├── data
    └── titanic.rda
├── endoR_0.1.0.tar.gz
├── figures
    ├── Workflow.png
    └── Decisions_summary.png
├── .gitignore
├── tests
    ├── testthat
    │   └── test-test.R
    └── testthat.R
├── figure
    ├── unnamed-chunk-14-1.png
    ├── unnamed-chunk-16-1.png
    ├── unnamed-chunk-17-1.png
    ├── unnamed-chunk-18-1.png
    ├── unnamed-chunk-25-1.png
    └── unnamed-chunk-26-1.png
├── endoR.Rcheck
    ├── R_check_bin
    │   ├── R
    │   └── Rscript
    ├── 00_pkg_src
    │   └── endoR
    │   │   ├── tests
    │   │       ├── testthat
    │   │       │   └── test-test.R
    │   │       └── testthat.R
    │   │   ├── data
    │   │       └── titanic.rda
    │   │   ├── build
    │   │       └── vignette.rds
    │   │   ├── figures
    │   │       ├── Workflow.png
    │   │       └── Decisions_summary.png
    │   │   ├── man
    │   │       ├── compatibleNames.Rd
    │   │       ├── endoR.Rd
    │   │       ├── getThresholds.Rd
    │   │       ├── model2DE_cluster.Rd
    │   │       ├── featureImportance.Rd
    │   │       ├── discretizeData_model.Rd
    │   │       ├── plotFeatures.Rd
    │   │       ├── decisions2FullDummy.Rd
    │   │       ├── discretizeData.Rd
    │   │       ├── decisionImportance.Rd
    │   │       ├── filterDecisionsImportances.Rd
    │   │       ├── plotNetwork.Rd
    │   │       ├── getNetwork.Rd
    │   │       ├── pruneDecisions.Rd
    │   │       ├── getDecisionsMetrics.Rd
    │   │       ├── model2DE_resampling.Rd
    │   │       ├── aggregateTaxa.Rd
    │   │       ├── evaluateAlpha.Rd
    │   │       ├── stabilitySelection.Rd
    │   │       ├── discretizeDecisions.Rd
    │   │       ├── preCluster.Rd
    │   │       └── model2DE.Rd
    │   │   ├── R
    │   │       ├── endoR.R
    │   │       ├── featureImportance.R
    │   │       ├── model2DE_cluster.R
    │   │       ├── discretizeVector_model.R
    │   │       ├── discretizeVector.R
    │   │       ├── aggregateTaxa_fine.R
    │   │       ├── discretizeData_model.R
    │   │       ├── Misc.R
    │   │       ├── filterDecisionsImportances.R
    │   │       ├── aggregateTaxa_coarse.R
    │   │       ├── measureSingleDecision.R
    │   │       ├── decisionImportance.R
    │   │       ├── discretizeData.R
    │   │       ├── aggregateTaxa.R
    │   │       ├── model2DE_resampling.R
    │   │       ├── evaluateAlpha.R
    │   │       ├── getThresholds.R
    │   │       ├── getDecisionsMetric.R
    │   │       ├── plotFeatures.R
    │   │       └── changeDecisionsDummies.R
    │   │   ├── NAMESPACE
    │   │   ├── .github
    │   │       └── workflows
    │   │       │   └── ci.yml
    │   │   ├── DESCRIPTION
    │   │   ├── vignettes
    │   │       └── iris_multiclass.Rmd
    │   │   └── README.md
    └── 00check.log
├── cran-comments.md
├── .Rbuildignore
├── man
    ├── endoR.Rd
    ├── compatibleNames.Rd
    ├── titanic.Rd
    ├── getThresholds.Rd
    ├── featureImportance.Rd
    ├── discretizeData_model.Rd
    ├── decisions2FullDummy.Rd
    ├── discretizeData.Rd
    ├── changeDecisionsDummies.Rd
    ├── aggregateTaxa.Rd
    ├── plotFeatures.Rd
    ├── plotNetwork.Rd
    ├── filterDecisionsImportances.Rd
    ├── decisionImportance.Rd
    ├── model2DE_cluster.Rd
    ├── pruneDecisions.Rd
    ├── getDecisionsMetrics.Rd
    ├── getNetwork.Rd
    └── discretizeDecisions.Rd
├── endoR.Rproj
├── R
    ├── featureImportance.R
    ├── discretizeVector_model.R
    ├── model2DE_cluster.R
    ├── discretizeVector.R
    ├── aggregateTaxa_fine.R
    ├── discretizeData_model.R
    ├── Misc.R
    ├── filterDecisionsImportances.R
    ├── endoR.R
    ├── aggregateTaxa_coarse.R
    ├── measureSingleDecision.R
    ├── decisionImportance.R
    ├── aggregateTaxa.R
    ├── discretizeData.R
    ├── getThresholds.R
    ├── getDecisionsMetric.R
    ├── evaluateAlpha.R
    ├── plotFeatures.R
    └── getComplements.R
├── NAMESPACE
├── .github
    └── workflows
    │   └── ci.yml
├── DESCRIPTION
├── examples
    ├── iris_basic.R
    ├── iris_each_function.R
    └── iris_bootstraps.R
└── vignettes
    └── iris_multiclass.Rmd


/data/titanic.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/data/titanic.rda


--------------------------------------------------------------------------------
/endoR_0.1.0.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/endoR_0.1.0.tar.gz


--------------------------------------------------------------------------------
/figures/Workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/figures/Workflow.png


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .Rproj.user
2 | .Rhistory
3 | .RData
4 | .Ruserdata
5 | inst/doc
6 | doc
7 | Meta
8 | 


--------------------------------------------------------------------------------
/tests/testthat/test-test.R:
--------------------------------------------------------------------------------
1 | test_that("multiplication works", {
2 |   expect_equal(2 * 2, 4)
3 | })
4 | 


--------------------------------------------------------------------------------
/figure/unnamed-chunk-14-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/figure/unnamed-chunk-14-1.png


--------------------------------------------------------------------------------
/figure/unnamed-chunk-16-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/figure/unnamed-chunk-16-1.png


--------------------------------------------------------------------------------
/figure/unnamed-chunk-17-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/figure/unnamed-chunk-17-1.png


--------------------------------------------------------------------------------
/figure/unnamed-chunk-18-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/figure/unnamed-chunk-18-1.png


--------------------------------------------------------------------------------
/figure/unnamed-chunk-25-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/figure/unnamed-chunk-25-1.png


--------------------------------------------------------------------------------
/figure/unnamed-chunk-26-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/figure/unnamed-chunk-26-1.png


--------------------------------------------------------------------------------
/figures/Decisions_summary.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/figures/Decisions_summary.png


--------------------------------------------------------------------------------
/endoR.Rcheck/R_check_bin/R:
--------------------------------------------------------------------------------
1 | echo "'R' should not be used without a path -- see par. 1.6 of the manual"
2 | exit 1
3 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/R_check_bin/Rscript:
--------------------------------------------------------------------------------
1 | echo "'Rscript' should not be used without a path -- see par. 1.6 of the manual"
2 | exit 1
3 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/tests/testthat/test-test.R:
--------------------------------------------------------------------------------
1 | test_that("multiplication works", {
2 |   expect_equal(2 * 2, 4)
3 | })
4 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/data/titanic.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/endoR.Rcheck/00_pkg_src/endoR/data/titanic.rda


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/build/vignette.rds:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/endoR.Rcheck/00_pkg_src/endoR/build/vignette.rds


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/figures/Workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/endoR.Rcheck/00_pkg_src/endoR/figures/Workflow.png


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/figures/Decisions_summary.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/endoR.Rcheck/00_pkg_src/endoR/figures/Decisions_summary.png


--------------------------------------------------------------------------------
/cran-comments.md:
--------------------------------------------------------------------------------
1 | ## R CMD check results
2 | There were no ERRORs or WARNINGs or NOTES.
3 | 
4 | ## Downstream dependencies
5 | There are currently no downstream dependencies for this package.


--------------------------------------------------------------------------------
/.Rbuildignore:
--------------------------------------------------------------------------------
 1 | ^.*\.Rproj$
 2 | ^\.Rproj\.user$
 3 | ^doc$
 4 | ^Meta$
 5 | ^LICENSE\.md$
 6 | ^figures$
 7 | ^\.git*
 8 | ^\.RData$
 9 | cran-comments.md
10 | \vignettes\titanic_boostraps.Rmd.orig
11 | ^figure$
12 | ^examples$


--------------------------------------------------------------------------------
/man/endoR.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/endoR.R
 3 | \docType{package}
 4 | \name{endoR}
 5 | \alias{endoR}
 6 | \title{endoR}
 7 | \description{
 8 | endoR extracts and visualizes how predictive variables contribute to tree ensemble model accuracy.
 9 | }
10 | \author{
11 | Albane Ruaud \email{albane.ruaud@tuebingen.mpg.de}
12 | }
13 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/man/compatibleNames.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/Misc.R
 3 | \name{compatibleNames}
 4 | \alias{compatibleNames}
 5 | \title{Transform character strings to be compatible with endoR functions.}
 6 | \usage{
 7 | compatibleNames(x)
 8 | }
 9 | \description{
10 | Transform character strings to be compatible with endoR functions.
11 | }
12 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/man/endoR.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/endoR.R
 3 | \docType{Package}
 4 | \name{endoR}
 5 | \alias{endoR}
 6 | \title{endoR}
 7 | \description{
 8 | endoR extracts and visualizes how predictive variables contribute to tree ensemble model accuracy.
 9 | }
10 | \author{
11 | Albane Ruaud \email{albane.ruaud@tuebingen.mpg.de}
12 | }
13 | 


--------------------------------------------------------------------------------
/tests/testthat.R:
--------------------------------------------------------------------------------
 1 | # This file is part of the standard setup for testthat.
 2 | # It is recommended that you do not modify it.
 3 | #
 4 | # Where should you do additional test configuration?
 5 | # Learn more about the roles of various files in:
 6 | # * https://r-pkgs.org/tests.html
 7 | # * https://testthat.r-lib.org/reference/test_package.html#special-files
 8 | 
 9 | library(testthat)
10 | library(endoR)
11 | 
12 | test_check("endoR")
13 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/tests/testthat.R:
--------------------------------------------------------------------------------
 1 | # This file is part of the standard setup for testthat.
 2 | # It is recommended that you do not modify it.
 3 | #
 4 | # Where should you do additional test configuration?
 5 | # Learn more about the roles of various files in:
 6 | # * https://r-pkgs.org/tests.html
 7 | # * https://testthat.r-lib.org/reference/test_package.html#special-files
 8 | 
 9 | library(testthat)
10 | library(endoR)
11 | 
12 | test_check("endoR")
13 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/R/endoR.R:
--------------------------------------------------------------------------------
 1 | #' endoR
 2 | #'
 3 | #' endoR extracts and visualizes how predictive variables contribute to tree ensemble model accuracy.
 4 | #'
 5 | #' @docType Package
 6 | #'
 7 | #' @author Albane Ruaud \email{albane.ruaud@tuebingen.mpg.de}
 8 | #'
 9 | #' @name endoR
10 | #' @import data.table
11 | #' @import inTrees
12 | #' @import dplyr
13 | #' @import stringr
14 | #' @import ggplot2
15 | #' @import ggraph
16 | #' @import igraph
17 | NULL
18 | 


--------------------------------------------------------------------------------
/endoR.Rproj:
--------------------------------------------------------------------------------
 1 | Version: 1.0
 2 | 
 3 | RestoreWorkspace: Default
 4 | SaveWorkspace: Default
 5 | AlwaysSaveHistory: Default
 6 | 
 7 | EnableCodeIndexing: Yes
 8 | UseSpacesForTab: Yes
 9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 | 
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 | 
15 | AutoAppendNewline: Yes
16 | StripTrailingWhitespace: Yes
17 | 
18 | BuildType: Package
19 | PackageUseDevtools: Yes
20 | PackageInstallArgs: --no-multiarch --with-keep.source
21 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/man/getThresholds.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/getThresholds.R
 3 | \name{getThresholds}
 4 | \alias{getThresholds}
 5 | \title{Get thresholds to discretize variables according to their split in tree ensemble models}
 6 | \usage{
 7 | getThresholds(conditions, data, Kmax = 2)
 8 | }
 9 | \description{
10 | Get thresholds to discretize variables according to their split in tree ensemble models
11 | }
12 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00check.log:
--------------------------------------------------------------------------------
 1 | * using log directory ‘/mnt/c/Users/Albane Ruaud/MPI-EBIO/Ruaud2022/endoR/endoR.Rcheck’
 2 | * using R version 4.0.2 (2020-06-22)
 3 | * using platform: x86_64-conda_cos6-linux-gnu (64-bit)
 4 | * using session charset: UTF-8
 5 | * using option ‘--as-cran’
 6 | * checking for file ‘endoR/DESCRIPTION’ ... OK
 7 | * checking extension type ... Package
 8 | * this is package ‘endoR’ version ‘0.1.0’
 9 | * package encoding: UTF-8
10 | * checking CRAN incoming feasibility ...


--------------------------------------------------------------------------------
/man/compatibleNames.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/Misc.R
 3 | \name{compatibleNames}
 4 | \alias{compatibleNames}
 5 | \title{Transform character strings to be compatible with endoR functions.}
 6 | \usage{
 7 | compatibleNames(x)
 8 | }
 9 | \arguments{
10 | \item{x}{character string or vector}
11 | }
12 | \description{
13 | Transform character strings to be compatible with endoR functions.
14 | }
15 | \examples{
16 | x <- c('hello.world__I am;happy', 'me-2')
17 | compatibleNames(x)
18 | }
19 | 


--------------------------------------------------------------------------------
/man/titanic.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/endoR.R
 3 | \docType{data}
 4 | \name{titanic}
 5 | \alias{titanic}
 6 | \title{This is data to be included in my package}
 7 | \format{
 8 | An object of class \code{data.frame} with 2207 rows and 9 columns.
 9 | }
10 | \usage{
11 | titanic
12 | }
13 | \description{
14 | This is data to be included in my package
15 | }
16 | \references{
17 | \url{https://cran.r-project.org/package=titanic}
18 | }
19 | \author{
20 | Paul Hendricks \email{paul.hendricks.2013@owu.edu}
21 | }
22 | \keyword{datasets}
23 | 


--------------------------------------------------------------------------------
/man/getThresholds.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/getThresholds.R
 3 | \name{getThresholds}
 4 | \alias{getThresholds}
 5 | \title{Get thresholds to discretize variables according to their split in tree ensemble models}
 6 | \usage{
 7 | getThresholds(conditions, data, Kmax = 2)
 8 | }
 9 | \arguments{
10 | \item{conditions}{character vector with all conditions from which to find the thresholds}
11 | 
12 | \item{data}{data to discretize}
13 | 
14 | \item{Kmax}{numeric, maximal number of categories for each variable (default: Kmax = 2).}
15 | }
16 | \description{
17 | Get thresholds to discretize variables according to their split in tree ensemble models
18 | }
19 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/man/model2DE_cluster.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/model2DE_cluster.R
 3 | \name{model2DE_cluster}
 4 | \alias{model2DE_cluster}
 5 | \title{Run model2DE on several bootstrap resamples in parallel.}
 6 | \usage{
 7 | model2DE_cluster(partition)
 8 | }
 9 | \arguments{
10 | \item{partition}{a vector with row numbers to subset data.}
11 | }
12 | \description{
13 | Function to with the Q() function from the clustermq R-package with the following arguments in export: 
14 | data, target, exec, classPos, dummy_var, prune, maxDecay, typeDecay, filter, in_parallel, n_cores.
15 | See preCluster() to obtain the list of boostraps resamples, the discretized data and exec dataframe with decisions.
16 | }
17 | 


--------------------------------------------------------------------------------
/man/featureImportance.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/featureImportance.R
 3 | \name{featureImportance}
 4 | \alias{featureImportance}
 5 | \title{Calculate the feature importances from a decision ensemble}
 6 | \usage{
 7 | featureImportance(nodes_agg)
 8 | }
 9 | \arguments{
10 | \item{nodes_agg}{a datatable with, for variable in each decision, their name (column Feature), decision-wise importance (column importance), the importance and multiplicity of the decision (columns imp and n).}
11 | }
12 | \value{
13 | a datatable with the feature importance for each variable.
14 | }
15 | \description{
16 | This function calculates the featureImportance of variables in a decision ensemble. Levels of discretized variables are grouped so that only one importance is returned ('variable name' separated by 2*'_' from the 'level name').
17 | }
18 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/man/featureImportance.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/featureImportance.R
 3 | \name{featureImportance}
 4 | \alias{featureImportance}
 5 | \title{Calculate the feature importances from a decision ensemble}
 6 | \usage{
 7 | featureImportance(nodes_agg)
 8 | }
 9 | \arguments{
10 | \item{nodes_agg}{a datatable with, for variable in each decision, their name (column Feature), decision-wise importance (column importance), the importance and multiplicity of the decision (columns imp and n).}
11 | }
12 | \value{
13 | a datatable with the feature importance for each variable.
14 | }
15 | \description{
16 | This function calculates the featureImportance of variables in a decision ensemble. Levels of discretized variables are grouped so that only one importance is returned ('variable name' separated by 2*'_' from the 'level name').
17 | }
18 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/man/discretizeData_model.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/discretizeData_model.R
 3 | \name{discretizeData_model}
 4 | \alias{discretizeData_model}
 5 | \title{Discretize numerical variables in a dataset based on thresholds used in the model to create splits.}
 6 | \usage{
 7 | discretizeData_model(data, conditions, Kmax = 2, return_split = FALSE)
 8 | }
 9 | \arguments{
10 | \item{data}{data to discretize.}
11 | 
12 | \item{Kmax}{numeric, maximal number of categories for each variable (default: Kmax = 2).}
13 | 
14 | \item{return_split}{if TRUE, then the table with thresholds used to discretize data is also returned.}
15 | }
16 | \value{
17 | Data with discretized variables.
18 | }
19 | \description{
20 | This function discretizes all numerical variables into Kmax categories based on the splits used by the model for each variable.
21 | }
22 | 


--------------------------------------------------------------------------------
/man/discretizeData_model.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/discretizeData_model.R
 3 | \name{discretizeData_model}
 4 | \alias{discretizeData_model}
 5 | \title{Discretize numerical variables in a dataset based on thresholds used in the model to create splits.}
 6 | \usage{
 7 | discretizeData_model(data, conditions, Kmax = 2, return_split = FALSE)
 8 | }
 9 | \arguments{
10 | \item{data}{data to discretize}
11 | 
12 | \item{conditions}{character vector with all conditions from which to find the thresholds}
13 | 
14 | \item{Kmax}{numeric, maximal number of categories for each variable (default: Kmax = 2).}
15 | 
16 | \item{return_split}{if TRUE, then the table with thresholds used to discretize data is also returned.}
17 | }
18 | \value{
19 | Data with discretized variables.
20 | }
21 | \description{
22 | This function discretizes all numerical variables into Kmax categories based on the splits used by the model for each variable.
23 | }
24 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/NAMESPACE:
--------------------------------------------------------------------------------
 1 | # Generated by roxygen2: do not edit by hand
 2 | 
 3 | export(Ranger2List_endoR)
 4 | export(aggregateTaxa)
 5 | export(compatibleNames)
 6 | export(decisionImportance)
 7 | export(decisions2FullDummy)
 8 | export(discretizeData)
 9 | export(discretizeData_model)
10 | export(evaluateAlpha)
11 | export(featureImportance)
12 | export(filterDecisionsImportances)
13 | export(getDecisionsMetrics)
14 | export(getNetwork)
15 | export(getNewRule)
16 | export(getThresholds)
17 | export(interactionVariables)
18 | export(model2DE)
19 | export(model2DE_cluster)
20 | export(model2DE_resampling)
21 | export(perlevelRule)
22 | export(plotFeatures)
23 | export(plotNetwork)
24 | export(preCluster)
25 | export(pruneDecisions)
26 | export(pruneSingleRule_endoR)
27 | export(rmError)
28 | export(singleRulePerLevel)
29 | export(stabilitySelection)
30 | import(data.table)
31 | import(dplyr)
32 | import(ggplot2)
33 | import(ggraph)
34 | import(igraph)
35 | import(inTrees)
36 | import(stringr)
37 | 


--------------------------------------------------------------------------------
/R/featureImportance.R:
--------------------------------------------------------------------------------
 1 | #' Calculate the feature importances from a decision ensemble
 2 | #'
 3 | #' This function calculates the featureImportance of variables in a decision ensemble. Levels of discretized variables are grouped so that only one importance is returned ('variable name' separated by 2*'_' from the 'level name').
 4 | #'
 5 | #' @param nodes_agg a datatable with, for variable in each decision, their name (column Feature), decision-wise importance (column importance), the importance and multiplicity of the decision (columns imp and n).
 6 | #'
 7 | #' @return a datatable with the feature importance for each variable.
 8 | #'
 9 | #' @export
10 | featureImportance <- function(nodes_agg) {
11 |   featImp <- copy(nodes_agg)[, `:=`(Feature = str_extract(var, pattern = ".*(?=\\_{2})"))][
12 |     is.na(Feature), Feature := var
13 |   ][, list(Feature, importance, imp, n)][
14 |     , importance := sum(importance * imp * n),
15 |     by = Feature
16 |   ][, list(Feature, importance)]
17 |   featImp <- unique(featImp)
18 |   setorder(featImp, -importance)
19 |   return(featImp)
20 | }
21 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/R/featureImportance.R:
--------------------------------------------------------------------------------
 1 | #' Calculate the feature importances from a decision ensemble
 2 | #'
 3 | #' This function calculates the featureImportance of variables in a decision ensemble. Levels of discretized variables are grouped so that only one importance is returned ('variable name' separated by 2*'_' from the 'level name').
 4 | #'
 5 | #' @param nodes_agg a datatable with, for variable in each decision, their name (column Feature), decision-wise importance (column importance), the importance and multiplicity of the decision (columns imp and n).
 6 | #'
 7 | #' @return a datatable with the feature importance for each variable.
 8 | #'
 9 | #' @export
10 | featureImportance <- function(nodes_agg) {
11 |   featImp <- copy(nodes_agg)[, `:=`(Feature = str_extract(var, pattern = ".*(?=\\_{2})"))][
12 |     is.na(Feature), Feature := var
13 |   ][, .(Feature, importance, imp, n)][
14 |     , importance := sum(importance * imp * n),
15 |     by = Feature
16 |   ][, .(Feature, importance)]
17 |   featImp <- unique(featImp)
18 |   setorder(featImp, -importance)
19 |   return(featImp)
20 | }
21 | 


--------------------------------------------------------------------------------
/man/decisions2FullDummy.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/decisions2FullDummy.R
 3 | \name{decisions2FullDummy}
 4 | \alias{decisions2FullDummy}
 5 | \title{Transform all variables in a decision ensemble into dummy variables.}
 6 | \usage{
 7 | decisions2FullDummy(
 8 |   rules,
 9 |   data,
10 |   in_parallel = FALSE,
11 |   n_cores = detectCores() - 1,
12 |   cluster = NULL
13 | )
14 | }
15 | \arguments{
16 | \item{rules}{data frame with a 'condition' column.}
17 | 
18 | \item{data}{data used to train the original model, or for which the column order corresponds to the column numbers in the rules (column 'condition').}
19 | 
20 | \item{in_parallel}{if TRUE, the function is run in parallel (default = FALSE).}
21 | 
22 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use, default is detectCores() - 1.}
23 | 
24 | \item{cluster}{the cluster to use to run the function in parallel.}
25 | }
26 | \description{
27 | This function first transforms all non-numeric variables from data used to train the original model into dummy variables. It then updates the decision ensemble.
28 | }
29 | 


--------------------------------------------------------------------------------
/man/discretizeData.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/discretizeData.R
 3 | \name{discretizeData}
 4 | \alias{discretizeData}
 5 | \title{Discretize numerical variables in a dataset}
 6 | \usage{
 7 | discretizeData(
 8 |   data,
 9 |   K = 2,
10 |   features = NULL,
11 |   knames = NULL,
12 |   return_split = FALSE
13 | )
14 | }
15 | \arguments{
16 | \item{data}{data to discretize.}
17 | 
18 | \item{K}{numeric, number of categories (default: K = 2).}
19 | 
20 | \item{features}{vector with variables names or column numbers to discretize. If NULL (default), then all numeric variables are discretized.}
21 | 
22 | \item{knames}{optional: character vector of the same length than K, containing the ordered names for categories.}
23 | 
24 | \item{return_split}{if TRUE, then the table with thresholds used to discretize data is also returned.}
25 | }
26 | \value{
27 | Data with discretized variables.
28 | }
29 | \description{
30 | This function discretizes all numerical variables, or only the ones passed in features, into K categories based on their quantiles.
31 | Names of categories can optionally be given in knames.
32 | }
33 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/man/plotFeatures.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/plotFeatures.R
 3 | \name{plotFeatures}
 4 | \alias{plotFeatures}
 5 | \title{Plot the importance and influence of features.}
 6 | \usage{
 7 | plotFeatures(
 8 |   decision_ensemble,
 9 |   levels_order = NULL,
10 |   colour_low = "#E69F00",
11 |   colour_mid = "grey87",
12 |   colour_high = "#0072B2",
13 |   return_all = FALSE
14 | )
15 | }
16 | \arguments{
17 | \item{decision_ensemble}{stable decision ensemble (see stabilitySelection).}
18 | 
19 | \item{levels_order}{optional, order for variables levels on the influence plot}
20 | 
21 | \item{return_all}{TRUE, returns the table of feature importance and influences and each plot separated (default = FALSE).}
22 | }
23 | \value{
24 | 2 ggplots arranged in a row with ggpubr; if return_all = TRUE, returns plots separately in a list , as well as the tables used to create plots.
25 | }
26 | \description{
27 | Returns a ggplot object with variables importance (across all categorical levels for factor variables) and variable per-level influence.
28 | It uses the ggpubr package to combine plots.
29 | }
30 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/R/model2DE_cluster.R:
--------------------------------------------------------------------------------
 1 | #' Run model2DE on several bootstrap resamples in parallel.
 2 | #'
 3 | #' Function to with the Q() function from the clustermq R-package with the following arguments in export:
 4 | #' data, target, exec, classPos, dummy_var, prune, maxDecay, typeDecay, filter, in_parallel, n_cores.
 5 | #' See preCluster() to obtain the list of boostraps resamples, the discretized data and exec dataframe with decisions.
 6 | #'
 7 | #' @param partition a vector with row numbers to subset data.
 8 | #' @export
 9 | model2DE_cluster <- function(partition) {
10 |   library(data.table)
11 |   res <- model2DE(
12 |     data = data[partition, ], target = target[partition],
13 |     exec = exec,
14 |     classPos = classPos, dummy_var = dummy_var,
15 |     prune = prune, maxDecay = maxDecay, typeDecay = typeDecay,
16 |     filter = filter, min_imp = 1,
17 |     in_parallel = in_parallel, n_cores = n_cores,
18 |     light = TRUE
19 |   )
20 |   res <- list(
21 |     "pdecisions" = res$n_decisions,
22 |     "rules" = res[[length(res) - 4]],
23 |     "nodes_agg" = res$nodes_agg, "edges_agg" = res$edges_agg
24 |   )
25 |   return(res)
26 | }
27 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/man/decisions2FullDummy.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/decisions2FullDummy.R
 3 | \name{decisions2FullDummy}
 4 | \alias{decisions2FullDummy}
 5 | \title{Transform all variables in a decision ensemble into dummy variables.}
 6 | \usage{
 7 | decisions2FullDummy(
 8 |   rules,
 9 |   data,
10 |   in_parallel = FALSE,
11 |   n_cores = detectCores() - 1,
12 |   cluster = NULL
13 | )
14 | }
15 | \arguments{
16 | \item{rules}{data frame with a 'condition' column.}
17 | 
18 | \item{data}{data used to train the original model, or for which the column order corresponds to the column numbers in the rules (column 'condition').}
19 | 
20 | \item{in_parallel}{if TRUE, the function is run in parallel (default = FALSE).}
21 | 
22 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use, default is detectCores() - 1.}
23 | 
24 | \item{cluster}{the cluster to use to run the function in parallel.}
25 | }
26 | \description{
27 | This function first transforms all non-numeric variables from data used to train the original model into dummy variables. It then updates the decision ensemble.
28 | }
29 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/man/discretizeData.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/discretizeData.R
 3 | \name{discretizeData}
 4 | \alias{discretizeData}
 5 | \title{Discretize numerical variables in a dataset}
 6 | \usage{
 7 | discretizeData(
 8 |   data,
 9 |   K = 2,
10 |   features = NULL,
11 |   knames = NULL,
12 |   return_split = FALSE
13 | )
14 | }
15 | \arguments{
16 | \item{data}{data to discretize.}
17 | 
18 | \item{K}{numeric, number of categories (default: K = 2).}
19 | 
20 | \item{features}{vector with variables names or column numbers to discretize. If NULL (default), then all numeric variables are discretized.}
21 | 
22 | \item{knames}{optional: character vector of the same length than K, containing the ordered names for categories.}
23 | 
24 | \item{return_split}{if TRUE, then the table with thresholds used to discretize data is also returned.}
25 | }
26 | \value{
27 | Data with discretized variables.
28 | }
29 | \description{
30 | This function discretizes all numerical variables, or only the ones passed in features, into K categories based on their quantiles.
31 | Names of categories can optionally be given in knames.
32 | }
33 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/man/decisionImportance.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/decisionImportance.R
 3 | \name{decisionImportance}
 4 | \alias{decisionImportance}
 5 | \title{Compute the importance of decisions}
 6 | \usage{
 7 | decisionImportance(
 8 |   rules,
 9 |   data,
10 |   target,
11 |   classPos = NULL,
12 |   in_parallel = FALSE,
13 |   n_cores = detectCores() - 1,
14 |   cluster = NULL
15 | )
16 | }
17 | \arguments{
18 | \item{data}{data from which to get the decision support.}
19 | 
20 | \item{target}{response variable.}
21 | 
22 | \item{classPos}{if classification, the positive class.}
23 | 
24 | \item{in_parallel}{if TRUE, the function is run in parallel.}
25 | 
26 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use.}
27 | 
28 | \item{cluster}{the cluster to use to run the function in parallel.}
29 | 
30 | \item{rulesa}{a data.frame with a column "condition" or a vector with name "condition".}
31 | }
32 | \value{
33 | the data.frame passed in rules with the gain and importance of the each decision.
34 | }
35 | \description{
36 | This function computes the importance of decisions.
37 | }
38 | 


--------------------------------------------------------------------------------
/R/discretizeVector_model.R:
--------------------------------------------------------------------------------
 1 | discretizeVector_model <- function(l_var, return_all = FALSE) {
 2 | 
 3 |   # sanity check
 4 |   if (length(l_var$thr) == 0) {
 5 |     return(l_var)
 6 |   }
 7 | 
 8 |   # set category names
 9 |   numSplit <- length(l_var$thr)
10 |   if (numSplit == 1) {
11 |     vNames <- c("Low", "High")
12 |   } else if (numSplit == 2) {
13 |     vNames <- c("Low", "Medium", "High")
14 |   } else if (numSplit == 3) {
15 |     vNames <- c("Very_Low", "Low", "High", "Very_High")
16 |   } else if (numSplit == 4) {
17 |     vNames <- c("Very_Low", "Low", "Medium", "High", "Very_High")
18 |   } else {
19 |     vNames <- as.character(1:(numSplit + 1))
20 |   }
21 | 
22 | 
23 |   newV <- rep(vNames[numSplit + 1], length(l_var$var_v))
24 | 
25 |   for (j in seq(numSplit, 1, by = -1)) {
26 |     newV[which(l_var$var_v <= l_var$thr[j])] <- vNames[j]
27 |   }
28 | 
29 |   if (return_all == TRUE) {
30 |     l_var$med <- lapply(unique(newV), function(x, l_var) {
31 |       i <- which(newV == x)
32 |       return(median(l_var$var_v[i]))
33 |     }, l_var = l_var)
34 |     names(l_var$med) <- unique(newV)
35 |     l_var$var_d <- newV
36 |     return(l_var)
37 |   } else {
38 |     return(newV)
39 |   }
40 | }
41 | 


--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
 1 | # Generated by roxygen2: do not edit by hand
 2 | 
 3 | export(aggregateTaxa)
 4 | export(changeDecisionsDummies)
 5 | export(compatibleNames)
 6 | export(decisionImportance)
 7 | export(decisions2FullDummy)
 8 | export(discretizeData)
 9 | export(discretizeData_model)
10 | export(discretizeDecisions)
11 | export(evaluateAlpha)
12 | export(featureImportance)
13 | export(filterDecisionsImportances)
14 | export(getDecisionsMetrics)
15 | export(getNetwork)
16 | export(getThresholds)
17 | export(model2DE)
18 | export(model2DE_cluster)
19 | export(model2DE_resampling)
20 | export(plotFeatures)
21 | export(plotNetwork)
22 | export(preCluster)
23 | export(pruneDecisions)
24 | export(stabilitySelection)
25 | import(caret)
26 | import(clustermq)
27 | import(data.table)
28 | import(dplyr, except = c(union, as_data_frame, groups, combine, slice, filter, lag, last, first, between))
29 | import(ggplot2, except = margin)
30 | import(ggraph)
31 | import(igraph, except = c(is_named, decompose, spectrum))
32 | import(inTrees)
33 | import(parallel)
34 | import(randomForest, except = importance)
35 | import(ranger)
36 | import(rlang, except = ':=')
37 | import(stats)
38 | import(stringr)
39 | import(tidyverse)
40 | import(utils)
41 | import(xgboost)
42 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/R/discretizeVector_model.R:
--------------------------------------------------------------------------------
 1 | discretizeVector_model <- function(l_var, return_all = FALSE) {
 2 | 
 3 |   # sanity check
 4 |   if (length(l_var$thr) == 0) {
 5 |     return(l_var)
 6 |   }
 7 | 
 8 |   # set category names
 9 |   numSplit <- length(l_var$thr)
10 |   if (numSplit == 1) {
11 |     vNames <- c("Low", "High")
12 |   } else if (numSplit == 2) {
13 |     vNames <- c("Low", "Medium", "High")
14 |   } else if (numSplit == 3) {
15 |     vNames <- c("Very_Low", "Low", "High", "Very_High")
16 |   } else if (numSplit == 4) {
17 |     vNames <- c("Very_Low", "Low", "Medium", "High", "Very_High")
18 |   } else {
19 |     vNames <- as.character(1:(numSplit + 1))
20 |   }
21 | 
22 | 
23 |   newV <- rep(vNames[numSplit + 1], length(l_var$var_v))
24 | 
25 |   for (j in seq(numSplit, 1, by = -1)) {
26 |     newV[which(l_var$var_v <= l_var$thr[j])] <- vNames[j]
27 |   }
28 | 
29 |   if (return_all == TRUE) {
30 |     l_var$med <- lapply(unique(newV), function(x, l_var) {
31 |       i <- which(newV == x)
32 |       return(median(l_var$var_v[i]))
33 |     }, l_var = l_var)
34 |     names(l_var$med) <- unique(newV)
35 |     l_var$var_d <- newV
36 |     return(l_var)
37 |   } else {
38 |     return(newV)
39 |   }
40 | }
41 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/man/filterDecisionsImportances.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/filterDecisionsImportances.R
 3 | \name{filterDecisionsImportances}
 4 | \alias{filterDecisionsImportances}
 5 | \title{Filter decisions according to their metrics}
 6 | \usage{
 7 | filterDecisionsImportances(rules, min_imp = 0.7)
 8 | }
 9 | \arguments{
10 | \item{rules}{data.frame corresponding to the decisions, with all their metrics.}
11 | 
12 | \item{min_imp}{minimal relative importance of the decisions that must be kept, the threshold to remove decisions is thus going to take lower values than max(imp)*min_imp.}
13 | }
14 | \value{
15 | The decision ensemble from which decisions with the lowest errors and/or importances have been removed, or are indicated in a column "filt_err"/"filt_imp".
16 | }
17 | \description{
18 | This function filters decisions in a heuristic manner according to their importance and multiplicity.
19 | A relative importance threshold that maximises the average product relative importance * n and the number of decisions to be removed is calculated.
20 | All decisions with a relative importance above that threshold are kept. The argument min_imp is the minimal relative importance of the decisions kept.
21 | }
22 | 


--------------------------------------------------------------------------------
/.github/workflows/ci.yml:
--------------------------------------------------------------------------------
 1 | name: EndoR
 2 | 
 3 | on:
 4 |   push:
 5 |     branches: [ main ]
 6 |   pull_request:
 7 |     branches: [ main ]
 8 | 
 9 | jobs:
10 |   build:
11 |     name: build (${{ matrix.r-version }}, ${{ matrix.os }})
12 |     runs-on: ubuntu-latest
13 |     strategy:
14 |       matrix:
15 |         r-version: [4.1]
16 |     steps:
17 |     - uses: conda-incubator/setup-miniconda@v2
18 |       with:
19 |         miniconda-version: 'latest'
20 |         auto-update-conda: true
21 |         python-version: '3.10'
22 |         channels: conda-forge,bioconda,defaults
23 |         activate-environment: endor_env
24 |     - name: conda env setup
25 |       shell: bash -l {0}
26 |       run: |
27 |         conda install -y mamba
28 |         mamba install -y r-base=${{ matrix.r-version }} r-renv r-nloptr r-data.table r-dplyr r-ggplot2 r-ggraph r-igraph r-stringr
29 |     - uses: actions/checkout@v2	
30 |     - name: renv install
31 |       shell: bash -l {0}
32 |       run: |
33 |         R -e 'renv::install("rstudio/renv")'
34 |         R -e 'renv::install("softwaredeng/inTrees/inTrees.Rcheck/inTrees")'
35 |         R -e 'renv::install("nick-youngblut/endoR")'
36 |     - name: Package unit tests
37 |       shell: bash -l {0}
38 |       run: |
39 |         R -e 'library(endoR)'
40 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/.github/workflows/ci.yml:
--------------------------------------------------------------------------------
 1 | name: EndoR
 2 | 
 3 | on:
 4 |   push:
 5 |     branches: [ main ]
 6 |   pull_request:
 7 |     branches: [ main ]
 8 | 
 9 | jobs:
10 |   build:
11 |     name: build (${{ matrix.r-version }}, ${{ matrix.os }})
12 |     runs-on: ubuntu-latest
13 |     strategy:
14 |       matrix:
15 |         r-version: [4.1]
16 |     steps:
17 |     - uses: conda-incubator/setup-miniconda@v2
18 |       with:
19 |         miniconda-version: 'latest'
20 |         auto-update-conda: true
21 |         python-version: '3.10'
22 |         channels: conda-forge,bioconda,defaults
23 |         activate-environment: endor_env
24 |     - name: conda env setup
25 |       shell: bash -l {0}
26 |       run: |
27 |         conda install -y mamba
28 |         mamba install -y r-base=${{ matrix.r-version }} r-renv r-nloptr r-data.table r-dplyr r-ggplot2 r-ggraph r-igraph r-stringr
29 |     - uses: actions/checkout@v2	
30 |     - name: renv install
31 |       shell: bash -l {0}
32 |       run: |
33 |         R -e 'renv::install("rstudio/renv")'
34 |         R -e 'renv::install("softwaredeng/inTrees/inTrees.Rcheck/inTrees")'
35 |         R -e 'renv::install("nick-youngblut/endoR")'
36 |     - name: Package unit tests
37 |       shell: bash -l {0}
38 |       run: |
39 |         R -e 'library(endoR)'
40 | 


--------------------------------------------------------------------------------
/man/changeDecisionsDummies.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/changeDecisionsDummies.R
 3 | \name{changeDecisionsDummies}
 4 | \alias{changeDecisionsDummies}
 5 | \title{Takes decisions and modifies them so that only one level of a multiclass variable is used in decisions}
 6 | \usage{
 7 | changeDecisionsDummies(
 8 |   rules,
 9 |   dummy_var,
10 |   data,
11 |   target,
12 |   classPos = NULL,
13 |   in_parallel = FALSE,
14 |   n_cores = detectCores() - 1,
15 |   cluster = NULL
16 | )
17 | }
18 | \arguments{
19 | \item{rules}{a data frame with a column "condition".}
20 | 
21 | \item{dummy_var}{string vector with the names of columns to change to dummy variable.}
22 | 
23 | \item{data}{data on which to fit the decision ensemble.}
24 | 
25 | \item{target}{response variable.}
26 | 
27 | \item{classPos}{for classification, the positive class.}
28 | 
29 | \item{in_parallel}{if TRUE, the function is run in parallel.}
30 | 
31 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use, default is detectCores() - 1.}
32 | 
33 | \item{cluster}{the cluster to use to run the function in parallel (opt).}
34 | }
35 | \description{
36 | Takes decisions and modifies them so that only one level of a multiclass variable is used in decisions
37 | }
38 | 


--------------------------------------------------------------------------------
/R/model2DE_cluster.R:
--------------------------------------------------------------------------------
 1 | #' Run model2DE on several bootstrap resamples in parallel.
 2 | #'
 3 | #' Function to with the Q() function from the clustermq R-package with the following arguments in export:
 4 | #' data, target, exec, classPos, dummy_var, prune, maxDecay, typeDecay, filter, in_parallel, n_cores.
 5 | #' See preCluster() to obtain the list of boostraps resamples, the discretized data and exec dataframe with decisions.
 6 | #'
 7 | #' @param partition a vector with row numbers to subset data.
 8 | #' @example examples/iris_bootstraps.R
 9 | #' @export
10 | model2DE_cluster <- function(partition) {
11 |   #library(data.table)
12 |   res <- model2DE(
13 |     data = data[partition, ], target = target[partition],
14 |     exec = exec,
15 |     classPos = classPos, dummy_var = dummy_var,
16 |     prune = prune, maxDecay = maxDecay, typeDecay = typeDecay,
17 |     filter = filter, min_imp = 1,
18 |     in_parallel = in_parallel, n_cores = n_cores,
19 |     light = TRUE
20 |   )
21 | 
22 |   # get the position of the last set of computed rules
23 |   tmp <- str_which(names(res), pattern = 'rules')
24 |   tmp <- tmp[length(tmp)]
25 | 
26 |   res <- list(
27 |     "pdecisions" = res$n_decisions,
28 |     "rules" = res$rules,
29 |     "nodes_agg" = res$nodes_agg, "edges_agg" = res$edges_agg
30 |   )
31 |   return(res)
32 | }
33 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/man/plotNetwork.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/plotNetwork.R
 3 | \name{plotNetwork}
 4 | \alias{plotNetwork}
 5 | \title{Plot an interaction variable network.}
 6 | \usage{
 7 | plotNetwork(
 8 |   decision_ensemble,
 9 |   path_length = Inf,
10 |   layout = "stress",
11 |   colour_edge_low = "#E69F00",
12 |   colour_edge_mid = "grey87",
13 |   colour_edge_high = "#0072B2",
14 |   colour_node_low = "#E69F00",
15 |   colour_node_mid = "grey87",
16 |   colour_node_high = "#0072B2",
17 |   text_size = 4,
18 |   hide_isolated_nodes = TRUE,
19 |   seed = 0
20 | )
21 | }
22 | \arguments{
23 | \item{decision_ensemble}{stable decision ensemble: list with edges, nodes, etc.}
24 | 
25 | \item{path_length}{maximal number of edges between 2 nodes, default = Inf.}
26 | 
27 | \item{layout}{network layout, default is 'stress' (see ggraph package)}
28 | 
29 | \item{text_size}{size of node labels.}
30 | 
31 | \item{hide_isolated_nodes}{logical, default = TRUE (= nodes without any edge are not shown).}
32 | 
33 | \item{seed}{the seed to use for generating the network.}
34 | 
35 | \item{colour_x_y}{colour of x = edge or node, and y = low, mid or high (e.g., colour_edge_mid), to use for the colour gradients.}
36 | }
37 | \value{
38 | a ggraph object
39 | }
40 | \description{
41 | Returns a plot of nodes and edges. Plots are created with the ggraph and igraph packages.
42 | }
43 | 


--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
 1 | Package: endoR
 2 | Type: Package
 3 | Title: Interpret and Visualize Stable Tree Ensemble Models
 4 | Version: 0.1.0
 5 | Authors@R: c(person("Albane", "Ruaud", email = "albane.ruaud@tuebingen.mpg.de"
 6 |                   , role = c("aut", "cre"), comment = c(ORCID = "0000-0001-5920-1710"))
 7 |             , person("Houtao", "Deng", role = "ctb"))
 8 | Description: Extract and visualize how predictive variables contribute to tree ensemble model accuracy. 
 9 |   You provide the model and data, tell endoR which regularization steps you want to obtain stable results, and it will return a plot of the feature importance and influence on the response variable and a decision network. For more details, please refer to our pre-print:  Albane Ruaud, Niklas A Pfister, Ruth E Ley, Nicholas D Youngblut. Interpreting tree ensemble machine learning models with endoR. bioRxiv (2022). <doi: 10.1101/2022.01.03.474763>.
10 | License: GPL (>= 3)
11 | Encoding: UTF-8
12 | LazyData: true
13 | Suggests:
14 |     knitr,
15 |     rmarkdown,
16 |     testthat (>= 3.0.0)
17 | Imports: rlang, data.table, dplyr, ggplot2, ggraph, igraph, inTrees, stringr, caret, ggpubr, ranger, xgboost, parallel, clustermq, utils, randomForest, tidyverse
18 | RoxygenNote: 7.2.0
19 | VignetteBuilder: knitr
20 | Depends: 
21 |     R (>= 2.10)
22 | BugReports: https://github.com/aruaud/endoR/issues
23 | URL: https://github.com/aruaud/endoR
24 | Config/testthat/edition: 3
25 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/DESCRIPTION:
--------------------------------------------------------------------------------
 1 | Package: endoR
 2 | Type: Package
 3 | Title: Interpret and Visualize Stable Tree Ensemble Models
 4 | Version: 0.1.0
 5 | Authors@R: c(person("Albane", "Ruaud", email = "albane.ruaud@tuebingen.mpg.de"
 6 |                   , role = c("aut", "cre"), comment = c(ORCID = "0000-0001-5920-1710"))
 7 |             , person("Houtao", "Deng", role = "ctb"))
 8 | Description: endoR extracts and visualizes how predictive variables contribute to tree ensemble model accuracy. 
 9 |   You provide the model and data, tell endoR which regularization steps you want to obtain stable results, and it will return a plot of the feature importance and influence on the response variable and a decision network.
10 | License: GPL (>= 3)
11 | Encoding: UTF-8
12 | LazyData: true
13 | Suggests: knitr, rmarkdown, ranger, xgboost, parallel, testthat (>=
14 |         3.0.0)
15 | Imports: data.table, dplyr, ggplot2, ggraph, igraph, inTrees, stringr,
16 |         caret, ggpubr
17 | RoxygenNote: 7.1.1
18 | VignetteBuilder: knitr
19 | Depends: R (>= 2.10)
20 | BugReports: https://github.com/aruaud/endoR/issues
21 | URL: https://github.com/aruaud/endoR
22 | Config/testthat/edition: 3
23 | NeedsCompilation: no
24 | Packaged: 2022-08-11 14:46:14 UTC; Albane Ruaud
25 | Author: Albane Ruaud [aut, cre] (<https://orcid.org/0000-0001-5920-1710>),
26 |   Houtao Deng [ctb]
27 | Maintainer: Albane Ruaud <albane.ruaud@tuebingen.mpg.de>
28 | 


--------------------------------------------------------------------------------
/examples/iris_basic.R:
--------------------------------------------------------------------------------
 1 | library(randomForest)
 2 | library(caret)
 3 | 
 4 | # import data and fit model
 5 | data(iris)
 6 | mod <- randomForest(Species ~ ., data = iris)
 7 | 
 8 | # Fit a decision ensemble to predict the species setosa (vs versicolor and
 9 | # virginica): no regularization (no decision pruning, discretization,
10 | # bootstrapping, or decision filtering)
11 | endo_setosa <- model2DE(model = mod, model_type = "rf", data = iris[, -5]
12 |     , target = iris$Species, classPos = "setosa"
13 |     , filter = FALSE, discretize = FALSE, prune = FALSE)
14 | 
15 | # Only decision pruning (default = TRUE) and discretization (default in 2
16 | # categories, we want 3 categories so we change K); no bootstrapping or
17 | # decision filtering.
18 | endo_setosa <- model2DE(model = mod, model_type = "rf", data = iris[, -5]
19 |     , target = iris$Species, classPos = "setosa"
20 |     , filter = FALSE, discretize = TRUE, K = 3)
21 | 
22 | # idem but run it in parallel on 2 threads
23 | endo_setosa <- model2DE(model = mod, model_type = "rf", data = iris[, -5]
24 |     , target = iris$Species, classPos = "setosa"
25 |     , filter = FALSE, discretize = TRUE, K = 3
26 |     , in_parallel = TRUE, n_cores = 2)
27 | 
28 | # Plot the decision ensemble:
29 | # Plants from the setosa species have small petal and narrow long sepals.
30 | plotFeatures(endo_setosa, levels_order = c("Low", "Medium", "High"))
31 | plotNetwork(endo_setosa, hide_isolated_nodes = FALSE, layout = "fr")
32 | 


--------------------------------------------------------------------------------
/R/discretizeVector.R:
--------------------------------------------------------------------------------
 1 | # modified from dicretizeVector() of the inTrees package.
 2 | discretizeVector <-
 3 |   function(v, K = 5, knames = NULL, return_all = FALSE) {
 4 | 
 5 |     # set category names
 6 |     if (is.null(knames) == TRUE) {
 7 |       if (K == 2) {
 8 |         knames <- c("min", "Low", "High")
 9 |       } else if (K == 3) {
10 |         knames <- c("min", "Low", "Medium", "High")
11 |       } else if (K == 4) {
12 |         knames <- c("min", "veryLow", "Low", "High", "veryHigh")
13 |       } else if (K == 5) {
14 |         knames <- c("min", "veryLow", "Low", "Medium", "High", "veryHigh")
15 |       } else {
16 |         knames <- c("L1", paste("L", seq(1:(K - 1)), sep = ""))
17 |       }
18 |     } else if (length(knames) == K) {
19 |       knames <- c("min", knames)
20 |     }
21 | 
22 | 
23 |     splitV <- quantile(v, probs = seq(0, 1, 1 / K), na.rm = TRUE, names = TRUE, type = 3)
24 |     names(splitV) <- knames
25 | 
26 |     numSplit <- length(splitV)
27 |     if (numSplit < 2) {
28 |       return(v)
29 |     }
30 | 
31 |     newV <- vector("character", length(v))
32 |     newV[which(v <= splitV[2])] <- names(splitV)[2]
33 | 
34 |     if (numSplit >= 3) {
35 |       for (jj in 3:numSplit) {
36 |         newV[which(v > splitV[jj - 1] & v <= splitV[jj])] <- names(splitV)[jj]
37 |       }
38 |     }
39 | 
40 |     splitV <- splitV[c("min", unique(newV))]
41 | 
42 |     if (return_all == TRUE) {
43 |       return(list("newV" = newV, "splitV" = splitV[order(splitV)]))
44 |     } else {
45 |       return(newV)
46 |     }
47 |   }
48 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/man/getNetwork.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/getNetwork.R
 3 | \name{getNetwork}
 4 | \alias{getNetwork}
 5 | \title{Transform a decision ensemble into a network}
 6 | \usage{
 7 | getNetwork(
 8 |   rules,
 9 |   data,
10 |   target,
11 |   classPos = NULL,
12 |   additional_decisions = NULL,
13 |   aggregate_taxa = FALSE,
14 |   taxa = NULL,
15 |   type = "coarse",
16 |   in_parallel = FALSE,
17 |   n_cores = detectCores() - 1,
18 |   cluster = NULL
19 | )
20 | }
21 | \arguments{
22 | \item{rules}{the decision ensemble.}
23 | 
24 | \item{data}{data from which to measure characteristic.}
25 | 
26 | \item{target}{response variable.}
27 | 
28 | \item{classPos}{the positive class predicted by decisions.}
29 | 
30 | \item{in_parallel}{if TRUE, the function is run in parallel.}
31 | 
32 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use.}
33 | 
34 | \item{cluster}{the cluster to use to run the function in parallel.}
35 | }
36 | \value{
37 | A list with in the nodes and edges dataframes, the feature and interaction importance and influence; the decision-wise feature and interaction importances and influences are contained in the nodes_agg and edges_agg dataframes.
38 | }
39 | \description{
40 | Takes a decision ensemble and measures the importance and influence of each feature and pair of features to create a network. 
41 | For categorical variables or discretized ones, the importance and influence are calculated per level. See featureImportance to obtain the overall feature importance.
42 | }
43 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/R/discretizeVector.R:
--------------------------------------------------------------------------------
 1 | # modified from dicretizeVector() of the inTrees package.
 2 | discretizeVector <-
 3 |   function(v, K = 5, knames = NULL, return_all = FALSE) {
 4 | 
 5 |     # set category names
 6 |     if (is.null(knames) == TRUE) {
 7 |       if (K == 2) {
 8 |         knames <- c("min", "Low", "High")
 9 |       } else if (K == 3) {
10 |         knames <- c("min", "Low", "Medium", "High")
11 |       } else if (K == 4) {
12 |         knames <- c("min", "veryLow", "Low", "High", "veryHigh")
13 |       } else if (K == 5) {
14 |         knames <- c("min", "veryLow", "Low", "Medium", "High", "veryHigh")
15 |       } else {
16 |         knames <- c("L1", paste("L", seq(1:(K - 1)), sep = ""))
17 |       }
18 |     } else if (length(knames) == K) {
19 |       knames <- c("min", knames)
20 |     }
21 | 
22 | 
23 |     splitV <- quantile(v, probs = seq(0, 1, 1 / K), na.rm = TRUE, names = TRUE, type = 3)
24 |     names(splitV) <- knames
25 | 
26 |     numSplit <- length(splitV)
27 |     if (numSplit < 2) {
28 |       return(v)
29 |     }
30 | 
31 |     newV <- vector("character", length(v))
32 |     newV[which(v <= splitV[2])] <- names(splitV)[2]
33 | 
34 |     if (numSplit >= 3) {
35 |       for (jj in 3:numSplit) {
36 |         newV[which(v > splitV[jj - 1] & v <= splitV[jj])] <- names(splitV)[jj]
37 |       }
38 |     }
39 | 
40 |     splitV <- splitV[c("min", unique(newV))]
41 | 
42 |     if (return_all == TRUE) {
43 |       return(list("newV" = newV, "splitV" = splitV[order(splitV)]))
44 |     } else {
45 |       return(newV)
46 |     }
47 |   }
48 | 


--------------------------------------------------------------------------------
/R/aggregateTaxa_fine.R:
--------------------------------------------------------------------------------
 1 | aggregateTaxa_fine <- function(taxa) {
 2 |   ### taxa: the elongated features (from elongateTaxa), the longFeature column is not necessary
 3 | 
 4 |   if (!("data.table" %in% class(taxa))) {
 5 |     taxa <- as.data.table(taxa)[, Feature := as.character(Feature)]
 6 |   }
 7 |   # create the columns for the updating
 8 |   taxa <- taxa[, `:=`(newFeature = Feature, ix = .I)]
 9 |   tax_col <- which(colnames(taxa) %in% c("f", "g", "s"))
10 | 
11 |   ### First: at the genus level
12 |   not_s <- taxa[is.na(s) & !is.na(g), ix]
13 |   for (i in not_s) {
14 |     # subset the specie of the same genus
15 |     tmp <- taxa[["g"]][i]
16 |     fine <- taxa[g == tmp & !is.na(s), newFeature]
17 |     if (length(fine) == 1) {
18 |       taxa <- taxa[ix == i, newFeature := fine]
19 |     }
20 |     fine <- NULL
21 |   }
22 | 
23 |   ### Then: at the family level
24 |   not_s <- taxa[is.na(g), ix]
25 |   for (i in not_s) {
26 |     # subset to the features of the same family
27 |     tmp <- taxa[["f"]][i]
28 |     fine <- unique(taxa[f == tmp & !is.na(g), newFeature])
29 |     if (length(fine) == 1) {
30 |       taxa <- taxa[ix == i, newFeature := fine]
31 |     }
32 |     fine <- NULL
33 |   }
34 | 
35 |   # record how many times a newFeature has been attributed and who was changed
36 |   taxa <- taxa[, "n" := .N, by = newFeature][, "changed" := "Unchanged"][
37 |     n > 1 & newFeature == Feature, changed := "Recipient"
38 |   ][
39 |     n > 1 & newFeature != Feature, changed := "Upgraded"
40 |   ][
41 |     , changed := as.factor(changed)
42 |   ]
43 | 
44 |   return(taxa)
45 | }
46 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/man/pruneDecisions.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/pruneDecisions.R
 3 | \name{pruneDecisions}
 4 | \alias{pruneDecisions}
 5 | \title{Prune rules of a decision ensemble}
 6 | \usage{
 7 | pruneDecisions(
 8 |   rules,
 9 |   data,
10 |   target,
11 |   classPos = NULL,
12 |   maxDecay = 0,
13 |   typeDecay = 1,
14 |   in_parallel = FALSE,
15 |   n_cores = detectCores() - 1,
16 |   cluster = NULL
17 | )
18 | }
19 | \arguments{
20 | \item{rules}{a data frame with a column "condition".}
21 | 
22 | \item{data}{data to use for calculating the metrics.}
23 | 
24 | \item{target}{response variable.}
25 | 
26 | \item{classPos}{for classification tasks, the positive class predicted.}
27 | 
28 | \item{maxDecay}{threshold for the increase in error; if maxDecay = -Inf, no pruning is done; if maxDecay = 0, only variables increasing the error are pruned from decisions.}
29 | 
30 | \item{typeDecay}{if typeDecay = 1, the absolute increase in error is computed, and the relative one is computed if typeDecay = 2 (default).}
31 | 
32 | \item{in_parallel}{if TRUE, the function is run in parallel.}
33 | 
34 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use, default is detectCores() - 1.}
35 | 
36 | \item{cluster}{the cluster to use to run the function in parallel.}
37 | }
38 | \value{
39 | Decision ensemble with pruned conditions.
40 | }
41 | \description{
42 | This function removes from rules, variables that do not increase the rule error more than a certain threshold. See the pruneRules function from the inTrees package (Deng, 2019) for more details.
43 | }
44 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/R/aggregateTaxa_fine.R:
--------------------------------------------------------------------------------
 1 | aggregateTaxa_fine <- function(taxa) {
 2 |   ### taxa: the elongated features (from elongateTaxa), the longFeature column is not necessary
 3 | 
 4 |   if (!("data.table" %in% class(taxa))) {
 5 |     taxa <- as.data.table(taxa)[, Feature := as.character(Feature)]
 6 |   }
 7 |   # create the columns for the updating
 8 |   taxa <- taxa[, `:=`(newFeature = Feature, ix = .I)]
 9 |   tax_col <- which(colnames(taxa) %in% c("f", "g", "s"))
10 | 
11 |   ### First: at the genus level
12 |   not_s <- taxa[is.na(s) & !is.na(g), ix]
13 |   for (i in not_s) {
14 |     # subset the specie of the same genus
15 |     tmp <- taxa[["g"]][i]
16 |     fine <- taxa[g == tmp & !is.na(s), newFeature]
17 |     if (length(fine) == 1) {
18 |       taxa <- taxa[ix == i, newFeature := fine]
19 |     }
20 |     fine <- NULL
21 |   }
22 | 
23 |   ### Then: at the family level
24 |   not_s <- taxa[is.na(g), ix]
25 |   for (i in not_s) {
26 |     # subset to the features of the same family
27 |     tmp <- taxa[["f"]][i]
28 |     fine <- unique(taxa[f == tmp & !is.na(g), newFeature])
29 |     if (length(fine) == 1) {
30 |       taxa <- taxa[ix == i, newFeature := fine]
31 |     }
32 |     fine <- NULL
33 |   }
34 | 
35 |   # record how many times a newFeature has been attributed and who was changed
36 |   taxa <- taxa[, "n" := .N, by = newFeature][, "changed" := "Unchanged"][
37 |     n > 1 & newFeature == Feature, changed := "Recipient"
38 |   ][
39 |     n > 1 & newFeature != Feature, changed := "Upgraded"
40 |   ][
41 |     , changed := as.factor(changed)
42 |   ]
43 | 
44 |   return(taxa)
45 | }
46 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/man/getDecisionsMetrics.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/getDecisionsMetric.R
 3 | \name{getDecisionsMetrics}
 4 | \alias{getDecisionsMetrics}
 5 | \title{Measure the error, prediction and importance of decisions}
 6 | \usage{
 7 | getDecisionsMetrics(
 8 |   ruleExec,
 9 |   data,
10 |   target,
11 |   classPos = NULL,
12 |   importances = TRUE,
13 |   in_parallel = FALSE,
14 |   n_cores = detectCores() - 1,
15 |   cluster = NULL
16 | )
17 | }
18 | \arguments{
19 | \item{ruleExec}{a vector with name "condition" or a data.frame with a column "condition".}
20 | 
21 | \item{data}{data from which to get the decision support.}
22 | 
23 | \item{target}{response variable.}
24 | 
25 | \item{classPos}{for clssification tasks, the positive class to be predicted by decisions.}
26 | 
27 | \item{importances}{if FALSE, the importances are not calculated (importances = TRUE by default).}
28 | 
29 | \item{in_parallel}{if TRUE, the function is run in parallel.}
30 | 
31 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use.}
32 | 
33 | \item{cluster}{the cluster to use to run the function in parallel.}
34 | }
35 | \value{
36 | a datatable with the rule (column "condition"), error ("err"), prediction ("pred") support, number of variables in the decision rule ("len"). Columns "gain" and "imp" wit hthe gain and importance of teh decision are added if importances were calculated.
37 | }
38 | \description{
39 | This function measures the prediction and error on the response variable of each decision on its support in the data passed. The importance is calculated by default but this can be switched off.
40 | }
41 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/R/discretizeData_model.R:
--------------------------------------------------------------------------------
 1 | #' Discretize numerical variables in a dataset based on thresholds used in the model to create splits.
 2 | #'
 3 | #' This function discretizes all numerical variables into Kmax categories based on the splits used by the model for each variable.
 4 | #'
 5 | #' @param data data to discretize.
 6 | #' @param Kmax numeric, maximal number of categories for each variable (default: Kmax = 2).
 7 | #' @param return_split if TRUE, then the table with thresholds used to discretize data is also returned.
 8 | #' @return Data with discretized variables.
 9 | #' @export
10 | discretizeData_model <- function(data, conditions, Kmax = 2, return_split = FALSE) {
11 | 
12 |   # get the thresholds
13 |   l_var <- getThresholds(conditions = conditions, data = data, Kmax = Kmax)
14 |   colNb <- names(l_var)
15 | 
16 |   # discretize the variable vectors
17 |   l_var <- lapply(l_var, discretizeVector_model, return_all = TRUE)
18 |   names(l_var) <- colNb
19 |   colNb <- as.integer(colNb)
20 | 
21 |   data_ctg <- as.data.table(data)
22 | 
23 |   for (j in colNb) set(data_ctg, i = NULL, j = j, l_var[[as.character(j)]]$var_d)
24 | 
25 |   data_ctg <- data_ctg[, (colNb) := lapply(.SD, as.factor), .SDcols = colNb]
26 | 
27 | 
28 |   # get the split table if asked
29 |   if (return_split == TRUE) {
30 |     splitV <- lapply(l_var, function(x) {
31 |       x$thr
32 |     })
33 |     names(splitV) <- names(l_var)
34 | 
35 |     splitV_med <- lapply(l_var, function(x) {
36 |       x$med
37 |     })
38 |     names(splitV_med) <- names(splitV)
39 | 
40 |     return(list("data_ctg" = data_ctg, "splitV" = splitV, "splitV_med" = splitV_med))
41 |   }
42 | 
43 | 
44 |   return(data_ctg)
45 | }
46 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/man/model2DE_resampling.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/model2DE_resampling.R
 3 | \name{model2DE_resampling}
 4 | \alias{model2DE_resampling}
 5 | \title{Run model2DE on several bootstrap resamples.}
 6 | \usage{
 7 | model2DE_resampling(
 8 |   model,
 9 |   model_type,
10 |   data,
11 |   target,
12 |   classPos = NULL,
13 |   times = 10,
14 |   p = 0.5,
15 |   sample_weight = NULL,
16 |   ntree = "all",
17 |   maxdepth = Inf,
18 |   dummy_var = NULL,
19 |   prune = TRUE,
20 |   maxDecay = 0.05,
21 |   typeDecay = 2,
22 |   discretize = TRUE,
23 |   K = 2,
24 |   mode = "data",
25 |   filter = TRUE,
26 |   min_imp = 0.9,
27 |   seed = 0,
28 |   in_parallel = FALSE,
29 |   n_cores = detectCores() - 1,
30 |   cluster = NULL
31 | )
32 | }
33 | \arguments{
34 | \item{times}{number of bootstraps}
35 | 
36 | \item{p}{fraction of data to resample.}
37 | 
38 | \item{sample_weight}{numeric vector with the weights of samples for bootstrap resampling. For classification, if 2 values are given, the 1st one is assumed to be for the positive class (classpos argument).}
39 | 
40 | \item{in_parallel}{if TRUE, the function is run in parallel}
41 | 
42 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use}
43 | 
44 | \item{cluster}{the cluster to use to run the function in parallel}
45 | 
46 | \item{...}{arguments to be passed to the model2DE funtion.}
47 | }
48 | \value{
49 | A list with the row numbers of partitioned data, the rules originally extracted from the model, a list with results from each bootstrap (use stabilitySelection to obtain the stable decison ensemble).
50 | }
51 | \description{
52 | Wrapper around the model2DE function to run it on several bootstrap resamples.
53 | }
54 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/R/Misc.R:
--------------------------------------------------------------------------------
 1 | taxa2vector <- function(taxa_table) {
 2 | 
 3 |   ### taxa_table is a data frame with taxonomic levels as columns
 4 |   # Add the previous levels to each one of them
 5 |   taxa_table$k <- paste("k_", taxa_table$k, sep = "")
 6 |   taxa_table$p <- paste(taxa_table$k, taxa_table$p, sep = "__p_")
 7 |   taxa_table$c <- paste(taxa_table$p, taxa_table$c, sep = "__c_")
 8 |   taxa_table$o <- paste(taxa_table$c, taxa_table$o, sep = "__o_")
 9 |   taxa_table$f <- paste(taxa_table$o, taxa_table$f, sep = "__f_")
10 |   taxa_table$g <- paste(taxa_table$f, taxa_table$g, sep = "__g_")
11 |   taxa_table$s <- paste(taxa_table$g, taxa_table$s, sep = "__s_")
12 | 
13 |   ### Vectorize the taxa_table table
14 |   tax_vector <- unique(unlist(setDT(taxa_table)[, .(k, p, c, o, f, g, s)]))
15 | 
16 |   ### Text formatting:
17 |   # no punctuation or weird symbole, only '_'
18 | 
19 |   tax_vector <- str_replace_all(tax_vector, pattern = "[:punct:]", replacement = "_")
20 |   tax_vector <- str_replace_all(tax_vector, pattern = " ", replacement = "_")
21 |   tax_vector <- str_replace_all(tax_vector, pattern = "__", replacement = "_")
22 | 
23 |   # levels are separated by 2x'_' and the initial of the level
24 |   tax_vector <- str_replace_all(tax_vector, pattern = "\\_(?=(p|c|o|f|g|s){1}_)", replacement = "__")
25 | 
26 |   return(tax_vector)
27 | }
28 | 
29 | 
30 | #' Transform character strings to be compatible with endoR functions.
31 | #' @export
32 | compatibleNames <- function(x) {
33 |   x %>%
34 |     str_replace_all(pattern = "[:punct:]", replacement = "_") %>%
35 |     str_replace_all(pattern = " ", replacement = "_") %>%
36 |     str_replace_all(pattern = "\\_+", replacement = "_") %>%
37 |     str_replace_all(pattern = "\\_$", replacement = "")
38 | }
39 | 


--------------------------------------------------------------------------------
/R/discretizeData_model.R:
--------------------------------------------------------------------------------
 1 | #' Discretize numerical variables in a dataset based on thresholds used in the model to create splits.
 2 | #'
 3 | #' This function discretizes all numerical variables into Kmax categories based on the splits used by the model for each variable.
 4 | #'
 5 | #' @param data data to discretize
 6 | #' @param conditions character vector with all conditions from which to find the thresholds
 7 | #' @param Kmax numeric, maximal number of categories for each variable (default: Kmax = 2).
 8 | #' @param return_split if TRUE, then the table with thresholds used to discretize data is also returned.
 9 | #' @return Data with discretized variables.
10 | #' @export
11 | discretizeData_model <- function(data, conditions, Kmax = 2, return_split = FALSE) {
12 | 
13 |   # get the thresholds
14 |   l_var <- getThresholds(conditions = conditions, data = data, Kmax = Kmax)
15 |   colNb <- names(l_var)
16 | 
17 |   # discretize the variable vectors
18 |   l_var <- lapply(l_var, discretizeVector_model, return_all = TRUE)
19 |   names(l_var) <- colNb
20 |   colNb <- as.integer(colNb)
21 | 
22 |   data_ctg <- as.data.table(data)
23 | 
24 |   for (j in colNb) set(data_ctg, i = NULL, j = j, l_var[[as.character(j)]]$var_d)
25 | 
26 |   data_ctg <- data_ctg[, (colNb) := lapply(.SD, as.factor), .SDcols = colNb]
27 | 
28 | 
29 |   # get the split table if asked
30 |   if (return_split == TRUE) {
31 |     splitV <- lapply(l_var, function(x) {
32 |       x$thr
33 |     })
34 |     names(splitV) <- names(l_var)
35 | 
36 |     splitV_med <- lapply(l_var, function(x) {
37 |       x$med
38 |     })
39 |     names(splitV_med) <- names(splitV)
40 | 
41 |     return(list("data_ctg" = data_ctg, "splitV" = splitV, "splitV_med" = splitV_med))
42 |   }
43 | 
44 | 
45 |   return(data_ctg)
46 | }
47 | 


--------------------------------------------------------------------------------
/man/aggregateTaxa.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/aggregateTaxa.R
 3 | \name{aggregateTaxa}
 4 | \alias{aggregateTaxa}
 5 | \title{Aggregate taxa according to a measure.}
 6 | \usage{
 7 | aggregateTaxa(
 8 |   taxa,
 9 |   features = NULL,
10 |   weights = NULL,
11 |   thr = NULL,
12 |   type = "coarse"
13 | )
14 | }
15 | \arguments{
16 | \item{taxa}{taxa should be a data.frame with the feature name and all its other coarser/thinner levels (columns: 'Feature', 'f', 'g', 's')}
17 | 
18 | \item{features}{if taxa is a vector or a data.frame that lacks all levels, a vector of the features to be checked.}
19 | 
20 | \item{weights}{a data.frame with a column Feature and a column weight; for type = 'both' or 'coarse' only.}
21 | 
22 | \item{thr}{numeric, value after which the algorithm should stop looking for better ranks in a taxonomic branch; for type = 'both' or 'coarse' only. If NULL, the median of weights is used.}
23 | 
24 | \item{type}{character. If 'coarse', finer levels are aggregated into their coarser one if it has a better rank. If 'fine' then coarser levels are aggregated into a thinner level if it has a better rank and is unique, i.e. there is a unique finer level for that coarser level. If 'both', both aggregation steps are seuqentially performed.}
25 | }
26 | \value{
27 | A dataframe with aggregated features in the "Feature" column, and the 'recipient' taxa in the "newFeature" column.
28 | }
29 | \description{
30 | This function aggregates taxa according to their rank from a measure passed in weights: if a taxa has a lower rank than its coarser level, type = 'coarse', it is aggregated into the coarser; if their is a unique thinner level with a better rank than the coarser one, then the coarser is aggregated into the thiner Comparison is done only for the family, genus and specie levels.
31 | }
32 | 


--------------------------------------------------------------------------------
/R/Misc.R:
--------------------------------------------------------------------------------
 1 | taxa2vector <- function(taxa_table) {
 2 | 
 3 |   ### taxa_table is a data frame with taxonomic levels as columns
 4 |   # Add the previous levels to each one of them
 5 |   taxa_table$k <- paste("k_", taxa_table$k, sep = "")
 6 |   taxa_table$p <- paste(taxa_table$k, taxa_table$p, sep = "__p_")
 7 |   taxa_table$c <- paste(taxa_table$p, taxa_table$c, sep = "__c_")
 8 |   taxa_table$o <- paste(taxa_table$c, taxa_table$o, sep = "__o_")
 9 |   taxa_table$f <- paste(taxa_table$o, taxa_table$f, sep = "__f_")
10 |   taxa_table$g <- paste(taxa_table$f, taxa_table$g, sep = "__g_")
11 |   taxa_table$s <- paste(taxa_table$g, taxa_table$s, sep = "__s_")
12 | 
13 |   ### Vectorize the taxa_table table
14 |   tax_vector <- unique(unlist(setDT(taxa_table)[, list(k, p, c, o, f, g, s)]))
15 | 
16 |   ### Text formatting:
17 |   # no punctuation or weird symbole, only '_'
18 | 
19 |   tax_vector <- str_replace_all(tax_vector, pattern = "[:punct:]", replacement = "_")
20 |   tax_vector <- str_replace_all(tax_vector, pattern = " ", replacement = "_")
21 |   tax_vector <- str_replace_all(tax_vector, pattern = "__", replacement = "_")
22 | 
23 |   # levels are separated by 2x'_' and the initial of the level
24 |   tax_vector <- str_replace_all(tax_vector, pattern = "\\_(?=(p|c|o|f|g|s){1}_)", replacement = "__")
25 | 
26 |   return(tax_vector)
27 | }
28 | 
29 | 
30 | #' Transform character strings to be compatible with endoR functions.
31 | #' @param x character string or vector
32 | #' @examples
33 | #' x <- c('hello.world__I am;happy', 'me-2')
34 | #' compatibleNames(x)
35 | #' @export
36 | compatibleNames <- function(x) {
37 |   x %>%
38 |     str_replace_all(pattern = "[:punct:]", replacement = "_") %>%
39 |     str_replace_all(pattern = " ", replacement = "_") %>%
40 |     str_replace_all(pattern = "\\_+", replacement = "_") %>%
41 |     str_replace_all(pattern = "\\_$", replacement = "")
42 | }
43 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/man/aggregateTaxa.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/aggregateTaxa.R
 3 | \name{aggregateTaxa}
 4 | \alias{aggregateTaxa}
 5 | \title{Aggregate taxa according to a measure.}
 6 | \usage{
 7 | aggregateTaxa(
 8 |   taxa,
 9 |   features = NULL,
10 |   weights = NULL,
11 |   thr = NULL,
12 |   type = "coarse"
13 | )
14 | }
15 | \arguments{
16 | \item{taxa}{taxa should be a vector with the long format of the features of the taxa table with taxonomic levels as columns - cf taxa2vector; or a data.frame with the feature name and all its other coarser/thiner levels.}
17 | 
18 | \item{features}{if taxa is a vector or a data.frame that lacks all levels, a vector of the features to be checked.}
19 | 
20 | \item{weights}{a data.frame with a column Feature and a column weight; for type = 'both' or 'coarse' only.}
21 | 
22 | \item{thr}{numeric, value after which the algorithm should stop looking for better ranks in a taxa banch; for type = 'both' or 'coarse' only. If NULL, the median of weights is used.}
23 | 
24 | \item{type}{character. If 'coarse', finer levels are aggregated into their coarser one if it has a better rank. If 'fine' then coarser levels are aggregated into a finer level if it has a better rank and is unique, ie there is a unique finer level for that coarser level. If 'both', both aggregation steps are seuqentially performed.}
25 | }
26 | \value{
27 | A dataframe with aggregated features in the "Feature" column, and the 'recipient' taxa in the "newFeature" column.
28 | }
29 | \description{
30 | This function aggregates taxa according to their rank from a measure passed in weights: if a taxa has a lower rank than its coarser level, type = 'coarse', it is aggregated into the coarser; if their is a unique thiner level with a better rank than the coarser one, then the coarser is aggregated into the thiner Comparison is done only for the family, genus and specie levels.
31 | }
32 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/R/filterDecisionsImportances.R:
--------------------------------------------------------------------------------
 1 | #' Filter decisions according to their metrics
 2 | #'
 3 | #' This function filters decisions in a heuristic manner according to their importance and multiplicity.
 4 | #' A relative importance threshold that maximises the average product relative importance * n and the number of decisions to be removed is calculated.
 5 | #' All decisions with a relative importance above that threshold are kept. The argument min_imp is the minimal relative importance of the decisions kept.
 6 | #'
 7 | #' @param rules data.frame corresponding to the decisions, with all their metrics.
 8 | #' @param min_imp minimal relative importance of the decisions that must be kept, the threshold to remove decisions is thus going to take lower values than max(imp)*min_imp.
 9 | #' @return The decision ensemble from which decisions with the lowest errors and/or importances have been removed, or are indicated in a column "filt_err"/"filt_imp".
10 | #' @export
11 | filterDecisionsImportances <- function(rules, min_imp = 0.7) {
12 |   if (("data.table" %in% class(rules)) == FALSE) {
13 |     rules <- setDT(rules)
14 |   }
15 | 
16 |   maxThr <- rules[, max(imp)] * min_imp
17 |   impthr <- seq(0, maxThr, by = maxThr / 500)
18 | 
19 |   checkImp <- data.table(impThr = impthr)
20 | 
21 |   for (i in 1:nrow(checkImp)) {
22 |     set(checkImp, i, "nrules", nrow(rules[imp >= impthr[i], ]))
23 |     set(checkImp, i, "sum_impn", sum(rules[imp >= impthr[i], imp * n])) ### remove n
24 |   }
25 | 
26 |   checkImp <- checkImp[, nremoved := max(nrules) - nrules][, f := sum_impn * nremoved]
27 | 
28 |   maxD <- max(checkImp[f == max(f), impThr])
29 |   rules <- rules[, filt_imp := "not ok"][imp >= maxD, filt_imp := "ok"]
30 | 
31 |   message(
32 |     "Threshold for relative importance: < ", round(maxD, digits = 3), " and ",
33 |     nrow(rules[filt_imp == "not ok"]), " rules removed.\n"
34 |   )
35 | 
36 |   # Filter rules
37 |   rules <- subset(rules, filt_imp == "ok", select = -filt_imp)
38 | 
39 |   return(rules)
40 | }
41 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/man/evaluateAlpha.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/evaluateAlpha.R
 3 | \name{evaluateAlpha}
 4 | \alias{evaluateAlpha}
 5 | \title{Calculate the number fo decisions and of predicted samples from decision ensemble obtained with different alpha values}
 6 | \usage{
 7 | evaluateAlpha(
 8 |   rules,
 9 |   alphas = c(5, 10, 15, 20, 30, 50, 75),
10 |   pi_thr = 0.7,
11 |   data = NULL,
12 |   decision_ensembles = TRUE,
13 |   aggregate_taxa = FALSE,
14 |   taxa = NULL
15 | )
16 | }
17 | \arguments{
18 | \item{alphas}{expected number of false positive decision selected (default = 1).}
19 | 
20 | \item{pi_thr}{fraction of bootstraps in which a decision should have been selected in to be included in the stable decision ensemble (default = 0.7).}
21 | 
22 | \item{decision_ensembles}{should the decision ensemble be returned?}
23 | 
24 | \item{res}{list of bootstrap results}
25 | }
26 | \value{
27 | A list with all decisions from all bootstrasps, the summary of decisions across bootstraps, the feature and interaction importance and influence in the nodes and edges dataframes, as well as the the decision-wise feature and interaction importances and influences the nodes_agg and edges_agg dataframes.
28 | }
29 | \description{
30 | The aim is to help picking an alpha that will result in a decision ensemble able to predict most samples. Performs stability selection for each of the given alpha value. The number of decisions and of samples that follow decisions are also calculated. 
31 | The procedure is adapted from Meinshausenand and Buehlmann (2010): the best decisions from each bootstrap are pre-seleected and the the ones that were pre-selected in a certain fraction of bootstraps are included in the stable decision ensemble.
32 | The decision importances and multiplicities are averaged across bootstraps. Decision-wise feature and interaction importances and influences are averaged across bootstraps before computing the feature and interaction importances and influences from the stable decision ensemble.
33 | }
34 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/man/stabilitySelection.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/stabilitySelection.R
 3 | \name{stabilitySelection}
 4 | \alias{stabilitySelection}
 5 | \title{Obtain a stable decision ensemble}
 6 | \usage{
 7 | stabilitySelection(
 8 |   rules,
 9 |   alpha_error = 1,
10 |   pi_thr = 0.7,
11 |   aggregate_taxa = FALSE,
12 |   taxa = NULL
13 | )
14 | }
15 | \arguments{
16 | \item{rules}{list of bootstrap results}
17 | 
18 | \item{alpha_error}{expected number of false positive decision selected (default = 1).}
19 | 
20 | \item{pi_thr}{fraction of bootstraps in which a decision should have been selected in to be included in the stable decision ensemble (default = 0.7).}
21 | 
22 | \item{aggregate_taxa}{should taxa be aggregated at the genus level (if species have lower importance than their genus) or species level (if a genus is represented by a unique species)}
23 | 
24 | \item{taxa}{if aggregate_taxa = TRUE, a data.frame with all taxa included in the dataset: columns = taxonomic ranks (with columns f, g, and s)}
25 | }
26 | \value{
27 | A list with all decisions from all bootstrasps, the summary of decisions across bootstraps, the feature and interaction importance and influence in the nodes and edges dataframes, as well as the the decision-wise feature and interaction importances and influences the nodes_agg and edges_agg dataframes.
28 | }
29 | \description{
30 | Performs stability selection after bootstrapping with the model2DE_cluster or model2DE_resampling functions. 
31 | The procedure is adapted from Meinshausenand and Buehlmann (2010): the best decisions from each bootstrap are pre-seleected and the the ones that were pre-selected in a certain fraction of bootstraps are included in the stable decision ensemble.
32 | The decision importances and multiplicities are averaged across bootstraps. Decision-wise feature and interaction importances and influences are averaged across bootstraps before computing the feature and interaction importances and influences from the stable decision ensemble.
33 | }
34 | 


--------------------------------------------------------------------------------
/R/filterDecisionsImportances.R:
--------------------------------------------------------------------------------
 1 | #' Filter decisions according to their metrics
 2 | #'
 3 | #' This function filters decisions in a heuristic manner according to their importance and multiplicity.
 4 | #' A relative importance threshold that maximises the average product relative importance * n and the number of decisions to be removed is calculated.
 5 | #' All decisions with a relative importance above that threshold are kept. The argument min_imp is the minimal relative importance of the decisions kept.
 6 | #'
 7 | #' @param rules data.frame corresponding to the decisions, with all their metrics.
 8 | #' @param min_imp minimal relative importance of the decisions that must be kept, the threshold to remove decisions is thus going to take lower values than max(imp)*min_imp.
 9 | #' @return The decision ensemble from which decisions with the lowest errors and/or importances have been removed, or are indicated in a column "filt_err"/"filt_imp".
10 | #'
11 | #' @example examples/iris_each_function.R
12 | #' @export
13 | filterDecisionsImportances <- function(rules, min_imp = 0.7) {
14 |   if (("data.table" %in% class(rules)) == FALSE) {
15 |     rules <- setDT(rules)
16 |   }
17 | 
18 |   maxThr <- rules[, max(imp)] * min_imp
19 |   impthr <- seq(0, maxThr, by = maxThr / 500)
20 | 
21 |   checkImp <- data.table(impThr = impthr)
22 | 
23 |   for (i in 1:nrow(checkImp)) {
24 |     set(checkImp, i, "nrules", nrow(rules[imp >= impthr[i], ]))
25 |     set(checkImp, i, "sum_impn", sum(rules[imp >= impthr[i], imp * n])) ### remove n
26 |   }
27 | 
28 |   checkImp <- checkImp[, nremoved := max(nrules) - nrules][, f := sum_impn * nremoved]
29 | 
30 |   maxD <- max(checkImp[f == max(f), impThr])
31 |   rules <- rules[, filt_imp := "not ok"][imp >= maxD, filt_imp := "ok"]
32 | 
33 |   message(
34 |     "Threshold for relative importance: < ", round(maxD, digits = 3), " and ",
35 |     nrow(rules[filt_imp == "not ok"]), " rules removed.\n"
36 |   )
37 | 
38 |   # Filter rules
39 |   rules <- subset(rules, filt_imp == "ok", select = -filt_imp)
40 | 
41 |   return(rules)
42 | }
43 | 


--------------------------------------------------------------------------------
/R/endoR.R:
--------------------------------------------------------------------------------
 1 | #' endoR
 2 | #'
 3 | #' endoR extracts and visualizes how predictive variables contribute to tree ensemble model accuracy.
 4 | #'
 5 | #' @docType package
 6 | #'
 7 | #' @author Albane Ruaud \email{albane.ruaud@tuebingen.mpg.de}
 8 | #'
 9 | #' @name endoR
10 | #' @import data.table
11 | #' @rawNamespace import(dplyr, except = c(union, as_data_frame, groups, combine, slice, filter, lag, last, first, between))
12 | #' @import inTrees
13 | #' @import stringr
14 | #' @rawNamespace import(ggplot2, except = margin)
15 | #' @import ggraph
16 | #' @rawNamespace import(igraph, except = c(is_named, decompose, spectrum))
17 | #' @import stats
18 | #' @import parallel
19 | #' @import clustermq
20 | #' @rawNamespace import(rlang, except = ':=')
21 | #' @import utils
22 | #' @rawNamespace import(randomForest, except = importance)
23 | #' @import tidyverse
24 | #' @import caret
25 | #' @import ranger
26 | #' @import xgboost
27 | 
28 | utils::globalVariables(c('Feature', 'Feature_short', 'association_sign'
29 |                          , 'change', 'changed', 'classPos', 'coarseFeature'
30 |                          , 'condition', 'd', 'd.x', 'd.y', 'd_assoc', 'decay'
31 |                          , 'dummy_var', 'elongateTaxa', 'err', 'err.rm', 'f'
32 |                          , 'filt_imp', 'full', 'fullID', 'g', 'gain', 'imp'
33 |                          , 'impThr', 'imp_sd', 'importance.x', 'importance.y'
34 |                          , 'inN', 'in_parallel', 'ix', 'k', 'len', 'lev'
35 |                          , 'level', 'maxDecay', 'n_cores', 'n_sd', 'name'
36 |                          , 'newFeature', 'newWeight', 'nremoved', 'nrules', 'o'
37 |                          , 'oldN', 'p', 'pred', 'pred.rm', 'predRm.x'
38 |                          , 'predRm.xy', 'predRm.y', 'prune', 'ruleID', 's'
39 |                          , 's.rm', 'sRm.x', 'sRm.y', 'sum_impn', 'support'
40 |                          , 'target', 'to_up', 'to_update', 'to_update_ctg'
41 |                          , 'typeDecay', 'weight', 'wimp', 'x', 'y'))
42 | 
43 | 
44 | #' This is data to be included in my package
45 | #'
46 | #' @author Paul Hendricks \email{paul.hendricks.2013@owu.edu}
47 | #' @references \url{https://cran.r-project.org/package=titanic}
48 | "titanic"
49 | 
50 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/man/discretizeDecisions.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/discretizeDecisions.R
 3 | \name{discretizeDecisions}
 4 | \alias{discretizeDecisions}
 5 | \title{Discretize numerical variables in decision ensemble}
 6 | \usage{
 7 | discretizeDecisions(
 8 |   rules,
 9 |   data = NULL,
10 |   target,
11 |   mode = "data",
12 |   K = 2,
13 |   splitV = NULL,
14 |   classPos = NULL,
15 |   in_parallel = FALSE,
16 |   n_cores = detectCores() - 1,
17 |   cluster = NULL
18 | )
19 | }
20 | \arguments{
21 | \item{rules}{a data frame with a column "condition".}
22 | 
23 | \item{data}{data to discretize.}
24 | 
25 | \item{target}{response variable.}
26 | 
27 | \item{mode}{whether to discretize variables based on the data distribution (default, mode = 'data') or on the data splits in the model (mode = 'model').}
28 | 
29 | \item{K}{numeric, number of categories to create from numeric variables (default: K = 2).}
30 | 
31 | \item{splitV}{instead of running internally discretizeData, one can provide a list with, for each variable to discretize in rules, the thresholds delimiting each new category.}
32 | 
33 | \item{classPos}{for classification, the positive class.}
34 | 
35 | \item{in_parallel}{if TRUE, the function is run in parallel.}
36 | 
37 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use, default is detectCores() - 1.}
38 | 
39 | \item{cluster}{the cluster to use to run the function in parallel.
40 | 
41 |  @return Decision ensemble with discretized variables in the condition. Decisions with the same condition are aggregated: their importances are summed, and all other metrics are averaged.
42 | 
43 |  @export}
44 | 
45 | \item{data_ctg}{discretized data, if splitV is passed. Necessary to re-compute the metrics (if column 'err' in rules).}
46 | 
47 | \item{return_data}{if TRUE, discretized data are also returned.}
48 | }
49 | \description{
50 | This function replaces in a decision ensemble the boundaries of numerical features by their corresponding levels when the variable is discretized.
51 | If discretized data are not passed, data are first discretized into Kmax categories based on their quantiles (see discretizeData). 
52 | The error, prediction, importance and multiplicity of decisions are updated after discretization.
53 | }
54 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/man/preCluster.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/preCluster.R
 3 | \name{preCluster}
 4 | \alias{preCluster}
 5 | \title{Extract decisions from a model and create bootstrap resamples.}
 6 | \usage{
 7 | preCluster(
 8 |   model,
 9 |   model_type,
10 |   data,
11 |   target,
12 |   times = 10,
13 |   p = 0.5,
14 |   sample_weight = NULL,
15 |   classPos = NULL,
16 |   ntree = "all",
17 |   maxdepth = Inf,
18 |   dummy_var = NULL,
19 |   discretize = FALSE,
20 |   K = 2,
21 |   mode = "data",
22 |   seed = 0,
23 |   in_parallel = FALSE,
24 |   n_cores = detectCores() - 1
25 | )
26 | }
27 | \arguments{
28 | \item{model}{model to extract rules from.}
29 | 
30 | \item{model_type}{character string: 'RF', 'random forest', 'rf', 'xgboost', 'XGBOOST', 'xgb', 'XGB', 'ranger', 'Ranger', 'gbm' or 'GBM'.}
31 | 
32 | \item{data}{data with the same columns than data used to fit the model.}
33 | 
34 | \item{target}{response variable.}
35 | 
36 | \item{times}{number of bootstraps}
37 | 
38 | \item{p}{fraction of data to resample.}
39 | 
40 | \item{sample_weight}{numeric vector with the weights of samples for bootstrap resampling. For classification, if 2 values are given, the 1st one is assumed to be for the positive class (classpos argument).}
41 | 
42 | \item{classPos}{the positive class predicted by decisions}
43 | 
44 | \item{ntree}{number of trees to use from the model (default = all)}
45 | 
46 | \item{maxdepth}{maximal node depth to use for extracting rules (by default, full branches are used).}
47 | 
48 | \item{dummy_var}{if multiclass variables were transformed into dummy variables before fitting the model, one can pass their names in a vector here to avoid multiple levels to be used in a same rule (recommended).}
49 | 
50 | \item{discretize}{if TRUE, discretization is performed with K categories (discretizeDecisions, by default = FALSE).}
51 | 
52 | \item{in_parallel}{if TRUE, the function is run in parallel.}
53 | 
54 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use.}
55 | 
56 | \item{cluster}{the cluster to use to run the function in parallel.}
57 | }
58 | \value{
59 | A list with the row numbers of partitioned data, the rules originally extracted from the model and new data if discretization was performed.
60 | }
61 | \description{
62 | to run before bootstrapping on parallel with the clustermq package and model2DE_cluster function. Extracts decisions, optionally discretizes them. Creates data partitions for bootstrapping.
63 | }
64 | 


--------------------------------------------------------------------------------
/R/aggregateTaxa_coarse.R:
--------------------------------------------------------------------------------
 1 | aggregateTaxa_coarse <- function(taxa, weights, thr = NULL) {
 2 |   ### taxa: the elongated features (from elongateTaxa), the longFeature column is not necessary
 3 |   ### weights: df with the features and the weights only (the highest the best)
 4 | 
 5 |   if (is.null(thr) == TRUE) {
 6 |     thr <- mean(weights$weight)
 7 |   }
 8 | 
 9 |   # some formatting
10 |   if (!("data.table" %in% class(taxa))) {
11 |     taxa <- as.data.table(taxa)[, Feature := as.character(Feature)]
12 |   }
13 |   taxa <- merge(taxa, weights[, list(Feature, weight)], by = "Feature", all.x = TRUE)
14 |   setorder(taxa, weight)
15 |   taxa <- taxa[!is.na(f), ][, `:=`(newFeature = Feature, newWeight = weight, ix = .I)]
16 |   tax_col <- which(colnames(taxa) %in% c("f", "g", "s"))
17 |   lim <- taxa[weight > thr, min(ix)]
18 | 
19 | 
20 |   for (i in 1:lim) {
21 | 
22 |     # subset to the features of the same family doing better (to look at the right spot)
23 |     tmp <- taxa$f[i]
24 |     to_check <- taxa[f == tmp, ix]
25 |     to_check <- to_check[which(to_check > i)]
26 | 
27 |     if (length(to_check) > 0) {
28 | 
29 |       # record if features have the same genus/family names and how good is their rank
30 |       check <- matrix(ncol = 4)
31 |       colnames(check) <- c(colnames(taxa)[tax_col], "ix")
32 |       for (k in 1:length(to_check)) {
33 |         tmp <- cbind((taxa[i, c(tax_col), with = FALSE] == taxa[to_check[k], c(tax_col), with = FALSE]),
34 |           ix = to_check[k]
35 |         )
36 |         check <- rbind(check, tmp)
37 |       }
38 |       check <- as.data.table(check, stringsAsFactors = FALSE)
39 | 
40 |       # select the one with the thinnest taxonomic level
41 |       if (is.na(taxa$s[i]) == FALSE) {
42 |         better <- check[is.na(s) & g == 1, ix]
43 |       } else if (is.na(taxa$g[i]) == FALSE) {
44 |         better <- check[is.na(g) & f == 1, ix]
45 |       } else {
46 |         better <- i
47 |       }
48 | 
49 |       # and give it to the bad one
50 |       if (length(better) > 0) {
51 |         # to make it retro-active!
52 |         to_change <- taxa[newFeature == taxa[["Feature"]][i], ix]
53 |         taxa <- taxa[ix %in% to_change, `:=`(
54 |           newFeature = taxa[["newFeature"]][max(better)],
55 |           newWeight = taxa[["newWeight"]][max(better)]
56 |         )]
57 |       }
58 |     }
59 |   }
60 | 
61 |   taxa <- taxa[, "n" := .N, by = newFeature][, "changed" := "Unchanged"][
62 |     n > 1 & newFeature == Feature, changed := "Recipient"
63 |   ][
64 |     n > 1 & newFeature != Feature, changed := "Downgraded"
65 |   ][
66 |     , changed := as.factor(changed)
67 |   ]
68 | 
69 |   return(taxa)
70 | }
71 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/R/aggregateTaxa_coarse.R:
--------------------------------------------------------------------------------
 1 | aggregateTaxa_coarse <- function(taxa, weights, thr = NULL) {
 2 |   ### taxa: the elongated features (from elongateTaxa), the longFeature column is not necessary
 3 |   ### weights: df with the features and the weights only (the highest the best)
 4 | 
 5 |   if (is.null(thr) == TRUE) {
 6 |     thr <- mean(weights$weight)
 7 |   }
 8 | 
 9 |   # some formatting
10 |   if (!("data.table" %in% class(taxa))) {
11 |     taxa <- as.data.table(taxa)[, Feature := as.character(Feature)]
12 |   }
13 |   taxa <- merge(taxa, weights[, .(Feature, weight)], by = "Feature", all.x = TRUE)
14 |   setorder(taxa, weight)
15 |   taxa <- taxa[!is.na(f), ][, `:=`(newFeature = Feature, newWeight = weight, ix = .I)]
16 |   tax_col <- which(colnames(taxa) %in% c("f", "g", "s"))
17 |   lim <- taxa[weight > thr, min(ix)]
18 | 
19 | 
20 |   for (i in 1:lim) {
21 | 
22 |     # subset to the features of the same family doing better (to look at the right spot)
23 |     tmp <- taxa$f[i]
24 |     to_check <- taxa[f == tmp, ix]
25 |     to_check <- to_check[which(to_check > i)]
26 | 
27 |     if (length(to_check) > 0) {
28 | 
29 |       # record if features have the same genus/family names and how good is their rank
30 |       check <- matrix(ncol = 4)
31 |       colnames(check) <- c(colnames(taxa)[tax_col], "ix")
32 |       for (k in 1:length(to_check)) {
33 |         tmp <- cbind((taxa[i, c(tax_col), with = FALSE] == taxa[to_check[k], c(tax_col), with = FALSE]),
34 |           ix = to_check[k]
35 |         )
36 |         check <- rbind(check, tmp)
37 |       }
38 |       check <- as.data.table(check, stringsAsFactors = FALSE)
39 | 
40 |       # select the one with the thinnest taxonomic level
41 |       if (is.na(taxa$s[i]) == FALSE) {
42 |         better <- check[is.na(s) & g == 1, ix]
43 |       } else if (is.na(taxa$g[i]) == FALSE) {
44 |         better <- check[is.na(g) & f == 1, ix]
45 |       } else {
46 |         better <- i
47 |       }
48 | 
49 |       # and give it to the bad one
50 |       if (length(better) > 0) {
51 |         # to make it retro-active!
52 |         to_change <- taxa[newFeature == taxa[["Feature"]][i], ix]
53 |         taxa <- taxa[ix %in% to_change, `:=`(
54 |           newFeature = taxa[["newFeature"]][max(better)],
55 |           newWeight = taxa[["newWeight"]][max(better)]
56 |         )]
57 |       }
58 |     }
59 |   }
60 | 
61 |   taxa <- taxa[, "n" := .N, by = newFeature][, "changed" := "Unchanged"][
62 |     n > 1 & newFeature == Feature, changed := "Recipient"
63 |   ][
64 |     n > 1 & newFeature != Feature, changed := "Downgraded"
65 |   ][
66 |     , changed := as.factor(changed)
67 |   ]
68 | 
69 |   return(taxa)
70 | }
71 | 


--------------------------------------------------------------------------------
/R/measureSingleDecision.R:
--------------------------------------------------------------------------------
 1 | measureSingleDecision <-
 2 |   function(ruleExec, data, target, type = "reg", gain = FALSE, pred_null = NULL) {
 3 | 
 4 |     # same than original but with the fraction of each class -1 is returned
 5 |     # instead of the majority class
 6 | 
 7 |     if (gain == FALSE) {
 8 |       colNames <- c("len", "support", "err", "condition", "pred")
 9 |     } else {
10 |       colNames <- c("len", "support", "err", "condition", "pred", "gain")
11 |     }
12 | 
13 | 
14 |     len <- length(unlist(strsplit(ruleExec, split = " & ")))
15 |     origRule <- ruleExec
16 | 
17 |     ruleExec <- paste("which(", ruleExec, ")")
18 |     ruleExec <- gsub(ruleExec, pattern = "X\\[,", replacement = "data\\[,")
19 |     ixMatch <- eval(parse(text = ruleExec))
20 | 
21 |     if (length(ixMatch) == 0) {
22 |       v <- rep("-1", length(colNames))
23 |       names(v) <- colNames
24 |       return(v)
25 |     }
26 |     ys <- target[ixMatch]
27 |     pred <- mean(ys, na.rm = TRUE)
28 |     support <- length(ys) / nrow(data)
29 | 
30 | 
31 |     if (type == "reg") {
32 |       err <- mean((pred - ys)^2, na.rm = TRUE)
33 |       if (gain == FALSE) {
34 |         v <- c(len, support, err, origRule, pred)
35 |       } else {
36 |         g <- 1 - err / (sum((pred_null - ys)^2) / length(ys))
37 |         v <- c(len, support, err, origRule, pred, g)
38 |       }
39 |     } else {
40 |       if (pred %in% c(0, 1)) {
41 |         err <- 0
42 |       } else {
43 |         err <- 1 - exp(mean(sapply(ys, function(x, pred) {
44 |           x * log(pred) + (1 - x) * log(1 - pred)
45 |         }, pred = pred)))
46 |       }
47 | 
48 |       if (gain == FALSE) {
49 |         v <- c(len, support, err, origRule, pred)
50 |       } else {
51 |         err_null <- 1 - exp(mean(sapply(ys, function(x, pred) {
52 |           x * log(pred) + (1 - x) * log(1 - pred)
53 |         }, pred = pred_null)))
54 |         g <- 1 - err / err_null
55 |         v <- c(len, support, err, origRule, pred, g)
56 |       }
57 |     }
58 | 
59 |     names(v) <- colNames
60 |     return(v)
61 |   }
62 | 
63 | 
64 | 
65 | 
66 | ###############################################################
67 | 
68 | measureAll <-
69 |   function(data, target, classPos = NULL) {
70 |     if (!is.numeric(target)) {
71 |       target <- ifelse(target == classPos, 1, 0)
72 |       pred <- mean(target, na.rm = TRUE)
73 |       err <- 1 - exp(mean(sapply(target, function(x, pred) {
74 |         x * log(pred) + (1 - x) * log(1 - pred)
75 |       }, pred = pred)))
76 |     } else {
77 |       pred <- mean(target, na.rm = TRUE)
78 |       err <- mean((pred - target)^2, na.rm = TRUE)
79 |     }
80 | 
81 |     v <- c(err, pred)
82 |     names(v) <- c("err", "pred")
83 | 
84 |     return(v)
85 |   }
86 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/man/model2DE.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/model2DE.R
 3 | \name{model2DE}
 4 | \alias{model2DE}
 5 | \title{Extract a decision ensemble from a tree-based model, simplify it and creates an interaction network from it.}
 6 | \usage{
 7 | model2DE(
 8 |   model,
 9 |   model_type,
10 |   data,
11 |   target,
12 |   ntree = "all",
13 |   maxdepth = Inf,
14 |   classPos = NULL,
15 |   dummy_var = NULL,
16 |   discretize = FALSE,
17 |   K = 2,
18 |   mode = "data",
19 |   prune = TRUE,
20 |   maxDecay = 0.05,
21 |   typeDecay = 2,
22 |   aggregate_taxa = FALSE,
23 |   taxa = NULL,
24 |   type = "both",
25 |   filter = TRUE,
26 |   min_imp = 0.7,
27 |   ntest = 100,
28 |   exec = NULL,
29 |   in_parallel = FALSE,
30 |   n_cores = detectCores() - 1,
31 |   cluster = NULL,
32 |   light = FALSE
33 | )
34 | }
35 | \arguments{
36 | \item{model}{model to extract rules from.}
37 | 
38 | \item{model_type}{character string: 'RF', 'random forest', 'rf', 'xgboost', 'XGBOOST', 'xgb', 'XGB', 'ranger', 'Ranger', 'gbm' or 'GBM'.}
39 | 
40 | \item{data}{data with the same columns than data used to fit the model.}
41 | 
42 | \item{target}{response variable.}
43 | 
44 | \item{ntree}{number of trees to use from the model (default = all)}
45 | 
46 | \item{maxdepth}{maximal node depth to use for extracting rules (by default, full branches are used).}
47 | 
48 | \item{classPos}{the positive class predicted by decisions}
49 | 
50 | \item{dummy_var}{if multiclass variables were transformed into dummy variables before fitting the model, one can pass their names in a vector here to avoid multiple levels to be used in a same rule (recommended).}
51 | 
52 | \item{exec}{if decisions have already been extracted, datatable with a 'condition' column.}
53 | 
54 | \item{in_parallel}{if TRUE, the function is run in parallel}
55 | 
56 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use, default is detectCores() - 1}
57 | 
58 | \item{cluster}{the cluster to use to run the function in parallel}
59 | 
60 | \item{light}{if FALSE, returns all intermediary decision ensembles; default = TRUE}
61 | 
62 | \item{...}{arguments to be passed to pruneDecisions, discretizeDecisions, filterDecisionsImportances.}
63 | }
64 | \value{
65 | A list with the final decision ensemble, if numeric variables were discretized in decisions, the discretized data, edges and nodes to make a network (plotNetwork and plotFeatures).
66 | }
67 | \description{
68 | Wrapper function to extract rules from a tree based model.
69 | It automatically transforms multiclass predictive variables into dummy variables.
70 | Optionally discretizes numeric variables (see discretizeDecisions). Measures decisions and optionally prunes them. Finally, generates a network.
71 | Can be run in parallel.
72 | }
73 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/R/measureSingleDecision.R:
--------------------------------------------------------------------------------
 1 | measureSingleDecision <-
 2 |   function(ruleExec, data, target, type = "reg", gain = FALSE, pred_null = NULL) {
 3 | 
 4 |     # same than original but with the fraction of each class -1 is returned
 5 |     # instead of the majority class
 6 | 
 7 |     if (gain == FALSE) {
 8 |       colNames <- c("len", "support", "err", "condition", "pred")
 9 |     } else {
10 |       colNames <- c("len", "support", "err", "condition", "pred", "gain")
11 |     }
12 | 
13 | 
14 |     len <- length(unlist(strsplit(ruleExec, split = " & ")))
15 |     origRule <- ruleExec
16 | 
17 |     ruleExec <- paste("which(", ruleExec, ")")
18 |     ruleExec <- gsub(ruleExec, pattern = "X\\[,", replacement = "data\\[,")
19 |     ixMatch <- eval(parse(text = ruleExec))
20 | 
21 |     if (length(ixMatch) == 0) {
22 |       v <- rep("-1", length(colNames))
23 |       names(v) <- colNames
24 |       return(v)
25 |     }
26 |     ys <- target[ixMatch]
27 |     pred <- mean(ys, na.rm = TRUE)
28 |     support <- length(ys) / nrow(data)
29 | 
30 | 
31 |     if (type == "reg") {
32 |       err <- mean((pred - ys)^2, na.rm = TRUE)
33 |       if (gain == FALSE) {
34 |         v <- c(len, support, err, origRule, pred)
35 |       } else {
36 |         g <- 1 - err / (sum((pred_null - ys)^2) / length(ys))
37 |         v <- c(len, support, err, origRule, pred, g)
38 |       }
39 |     } else {
40 |       if (pred %in% c(0, 1)) {
41 |         err <- 0
42 |       } else {
43 |         err <- 1 - exp(mean(sapply(ys, function(x, pred) {
44 |           x * log(pred) + (1 - x) * log(1 - pred)
45 |         }, pred = pred)))
46 |       }
47 | 
48 |       if (gain == FALSE) {
49 |         v <- c(len, support, err, origRule, pred)
50 |       } else {
51 |         err_null <- 1 - exp(mean(sapply(ys, function(x, pred) {
52 |           x * log(pred) + (1 - x) * log(1 - pred)
53 |         }, pred = pred_null)))
54 |         g <- 1 - err / err_null
55 |         v <- c(len, support, err, origRule, pred, g)
56 |       }
57 |     }
58 | 
59 |     names(v) <- colNames
60 |     return(v)
61 |   }
62 | 
63 | 
64 | 
65 | 
66 | ###############################################################
67 | 
68 | measureAll <-
69 |   function(data, target, classPos = NULL) {
70 |     if (!is.numeric(target)) {
71 |       target <- ifelse(target == classPos, 1, 0)
72 |       pred <- mean(target, na.rm = TRUE)
73 |       err <- 1 - exp(mean(sapply(target, function(x, pred) {
74 |         x * log(pred) + (1 - x) * log(1 - pred)
75 |       }, pred = pred)))
76 |     } else {
77 |       pred <- mean(target, na.rm = TRUE)
78 |       err <- mean((pred - target)^2, na.rm = TRUE)
79 |     }
80 | 
81 |     v <- c(err, pred)
82 |     names(v) <- c("err", "pred")
83 | 
84 |     return(v)
85 |   }
86 | 


--------------------------------------------------------------------------------
/man/plotFeatures.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/plotFeatures.R
 3 | \name{plotFeatures}
 4 | \alias{plotFeatures}
 5 | \title{Plot the importance and influence of features.}
 6 | \usage{
 7 | plotFeatures(
 8 |   decision_ensemble,
 9 |   levels_order = NULL,
10 |   colour_low = "#E69F00",
11 |   colour_mid = "grey87",
12 |   colour_high = "#0072B2",
13 |   return_all = FALSE
14 | )
15 | }
16 | \arguments{
17 | \item{decision_ensemble}{stable decision ensemble (see stabilitySelection).}
18 | 
19 | \item{levels_order}{optional, order for variables levels on the influence plot}
20 | 
21 | \item{colour_low}{colour for the negative feature influence values (default: yellowish)}
22 | 
23 | \item{colour_mid}{colour for the null feature influence values (default: light grey)}
24 | 
25 | \item{colour_high}{colour for the positive feature influence values (default: blue)}
26 | 
27 | \item{return_all}{TRUE, returns the table of feature importance and influences and each plot separated (default = FALSE).}
28 | }
29 | \value{
30 | 2 ggplots arranged in a row with ggpubr; if return_all = TRUE, returns plots separately in a list , as well as the tables used to create plots.
31 | }
32 | \description{
33 | Returns a ggplot object with variables importance (across all categorical levels for factor variables) and variable per-level influence.
34 | It uses the ggpubr package to combine plots.
35 | }
36 | \examples{
37 | library(randomForest)
38 | library(caret)
39 | 
40 | # import data and fit model
41 | data(iris)
42 | mod <- randomForest(Species ~ ., data = iris)
43 | 
44 | # Fit a decision ensemble to predict the species setosa (vs versicolor and
45 | # virginica): no regularization (no decision pruning, discretization,
46 | # bootstrapping, or decision filtering)
47 | endo_setosa <- model2DE(model = mod, model_type = "rf", data = iris[, -5]
48 |     , target = iris$Species, classPos = "setosa"
49 |     , filter = FALSE, discretize = FALSE, prune = FALSE)
50 | 
51 | # Only decision pruning (default = TRUE) and discretization (default in 2
52 | # categories, we want 3 categories so we change K); no bootstrapping or
53 | # decision filtering.
54 | endo_setosa <- model2DE(model = mod, model_type = "rf", data = iris[, -5]
55 |     , target = iris$Species, classPos = "setosa"
56 |     , filter = FALSE, discretize = TRUE, K = 3)
57 | 
58 | # idem but run it in parallel on 2 threads
59 | endo_setosa <- model2DE(model = mod, model_type = "rf", data = iris[, -5]
60 |     , target = iris$Species, classPos = "setosa"
61 |     , filter = FALSE, discretize = TRUE, K = 3
62 |     , in_parallel = TRUE, n_cores = 2)
63 | 
64 | # Plot the decision ensemble:
65 | # Plants from the setosa species have small petal and narrow long sepals.
66 | plotFeatures(endo_setosa, levels_order = c("Low", "Medium", "High"))
67 | plotNetwork(endo_setosa, hide_isolated_nodes = FALSE, layout = "fr")
68 | }
69 | 


--------------------------------------------------------------------------------
/examples/iris_each_function.R:
--------------------------------------------------------------------------------
 1 | library(randomForest)
 2 | library(caret)
 3 | library(data.table)
 4 | 
 5 | # import data and fit model
 6 | data(iris)
 7 | mod <- randomForest(Species ~ ., data = iris)
 8 | 
 9 | # Let's get the decision ensemble. One could use the wrapping function
10 | # model2DE() but, we will run each function separately.
11 | 
12 | # Get the raw decision ensemble
13 | de <- preCluster(model = mod, model_type = "rf", data = iris[, -5]
14 |         , target = iris$Species, classPos = "setosa"
15 |         , times = 1 # number of bootstraps, here just one
16 |         , discretize = FALSE) # we will discretize outside for the example
17 | summary(de)
18 | # exec = the decision ensemble
19 | # partitions = list of sample indexes for boostrapping
20 | # if we had done discretization, the new data would be in data_ctg
21 | de <- de$exec
22 | 
23 | # Discretize variables in 3 categories - optional
24 | de <- discretizeDecisions(rules = de, data = iris[, -5], target = iris$Species
25 |         , K = 3, classPos = "setosa", mode = "data")
26 | data_ctg <- de$data_ctg
27 | de <- de$rules_ctg
28 | 
29 | # Homogenize the decision ensemble
30 | de <- de[, condition := sapply(condition, function(x) {
31 |   paste(sort(unlist(strsplit(x, split = " & "))), collapse = " & ")
32 | })]
33 | de <- unique(
34 |           as.data.table(de)[, n := as.numeric(n)][, n := sum(n), by = condition]
35 |           )
36 | 
37 | # Calculate decision metrics, we don't need the importances yet since we will
38 | # do pruning. Otherwise, set importances = TRUE and skip the next 2 steps.
39 | de_met <- getDecisionsMetrics(de, data = data_ctg, target = iris$Species
40 |             , classPos = "setosa", importances = FALSE)
41 | de <- de[de_met, on = "condition"]
42 | 
43 | # Pruning - optional
44 | de <- pruneDecisions(rules = de, data = data_ctg, target = iris$Species
45 |         , classPos = "setosa")
46 | 
47 | # Decision importances
48 | de <- decisionImportance(rules = de, data = data_ctg, target = iris$Species
49 |         , classPos = "setosa")
50 | 
51 | # Filter out decisions with the lowest importance: min_imp = the minimal
52 | # importance in the decision ensemble compared to the maximal one.
53 | # E.g., if min_imp = 0.5, then at least all decisions with an
54 | # importance > 0.5*max(importance) will be kept.
55 | # This ensures that we don't throw out too much.
56 | # Since the decision ensemble is quite small, we don't need to filter much...
57 | de <- filterDecisionsImportances(rules = de, min_imp = 0.1)
58 | 
59 | # Get the network
60 | de_net <- getNetwork(rules = de, data = data_ctg, target = iris$Species
61 |             , classPos = "setosa")
62 | 
63 | # Plot the feature importance/influence and the network
64 | plotFeatures(de_net, levels_order = c("Low", "Medium", "High"))
65 | plotNetwork(de_net, hide_isolated_nodes = FALSE, layout = "fr")
66 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/R/decisionImportance.R:
--------------------------------------------------------------------------------
 1 | #' Compute the importance of decisions
 2 | #'
 3 | #' This function computes the importance of decisions.
 4 | #'
 5 | #'
 6 | #' @param rulesa a data.frame with a column "condition" or a vector with name "condition".
 7 | #' @param data data from which to get the decision support.
 8 | #' @param target response variable.
 9 | #' @param classPos if classification, the positive class.
10 | #' @param in_parallel if TRUE, the function is run in parallel.
11 | #' @param n_cores if in_parallel = TRUE, and no cluster has been passed: number of cores to use.
12 | #' @param cluster the cluster to use to run the function in parallel.
13 | #'
14 | #' @return the data.frame passed in rules with the gain and importance of the each decision.
15 | #'
16 | #' @export
17 | decisionImportance <- function(rules, data, target, classPos = NULL,
18 |                                in_parallel = FALSE, n_cores = detectCores() - 1, cluster = NULL) {
19 | 
20 | 
21 |   # numerise target
22 |   if (is.numeric(target) == FALSE) {
23 |     target <- ifelse(target == classPos, 1, 0)
24 |     type <- "classification"
25 |   } else {
26 |     type <- "reg"
27 |   }
28 | 
29 | 
30 |   if ("data.table" %in% class(data)) {
31 |     data <- as.data.frame(data)
32 |   }
33 | 
34 |   pred_null <- mean(target, na.rm = TRUE)
35 | 
36 | 
37 |   if (in_parallel == TRUE) {
38 |     if (is.null(cluster) == TRUE) {
39 |       message("Initiate parallelisation ... ")
40 |       cluster <- makeCluster(n_cores)
41 |       on.exit(stopCluster(cluster))
42 |     }
43 |     tmp <- parApply(
44 |       cl = cluster, rules, MARGIN = 1, FUN = importanceSingleRule,
45 |       data = data, target = target, type = type, pred_null = pred_null
46 |     )
47 |   } else {
48 |     tmp <- apply(rules,
49 |       MARGIN = 1, FUN = importanceSingleRule,
50 |       data = data, target = target, type = type, pred_null = pred_null
51 |     )
52 |   }
53 | 
54 | 
55 |   rules <- rules[, "gain" := tmp][, "imp" := gain * support]
56 | 
57 |   if ("n" %in% colnames(rules)) {
58 |     rules <- rules[, "n" := as.numeric(n)][order(-n * imp), ]
59 |   } else {
60 |     setorder(rules, -imp)
61 |   }
62 | 
63 | 
64 |   return(rules)
65 | }
66 | 
67 | 
68 | ######################################################
69 | importanceSingleRule <-
70 |   function(rule, data, target, type = "reg", pred_null = NULL) {
71 |     ruleExec <- paste("which(", rule["condition"], ")")
72 |     ruleExec <- gsub(ruleExec, pattern = "X\\[,", replacement = "data\\[,")
73 |     ixMatch <- eval(parse(text = ruleExec))
74 | 
75 |     if (length(ixMatch) == 0) {
76 |       return(0)
77 |     }
78 |     ys <- target[ixMatch]
79 | 
80 |     if (type == "reg") {
81 |       err_null <- sum((pred_null - ys)^2) / length(ys)
82 |     } else {
83 |       err_null <- exp(mean(sapply(ys, function(x, pred) {
84 |         x * log(pred) + (1 - x) * log(1 - pred)
85 |       }, pred = pred_null)))
86 |       err_null <- 1 - err_null
87 |     }
88 | 
89 |     g <- 1 - as.numeric(rule["err"]) / err_null
90 |     return(g)
91 |   }
92 | 


--------------------------------------------------------------------------------
/R/decisionImportance.R:
--------------------------------------------------------------------------------
 1 | #' Compute the importance of decisions
 2 | #'
 3 | #' This function computes the importance of decisions.
 4 | #'
 5 | #'
 6 | #' @param rules a data.frame with a column "condition" or a vector with name "condition".
 7 | #' @param data data from which to get the decision support.
 8 | #' @param target response variable.
 9 | #' @param classPos if classification, the positive class.
10 | #' @param in_parallel if TRUE, the function is run in parallel.
11 | #' @param n_cores if in_parallel = TRUE, and no cluster has been passed: number of cores to use.
12 | #' @param cluster the cluster to use to run the function in parallel.
13 | #'
14 | #' @return the data.frame passed in rules with the gain and importance of the each decision.
15 | #' @example examples/iris_each_function.R
16 | #'
17 | #' @export
18 | decisionImportance <- function(rules, data, target, classPos = NULL,
19 |                                in_parallel = FALSE, n_cores = detectCores() - 1, cluster = NULL) {
20 | 
21 | 
22 |   # numerise target
23 |   if (is.numeric(target) == FALSE) {
24 |     target <- ifelse(target == classPos, 1, 0)
25 |     type <- "classification"
26 |   } else {
27 |     type <- "reg"
28 |   }
29 | 
30 | 
31 |   if ("data.table" %in% class(data)) {
32 |     data <- as.data.frame(data)
33 |   }
34 | 
35 |   pred_null <- mean(target, na.rm = TRUE)
36 | 
37 | 
38 |   if (in_parallel == TRUE) {
39 |     if (is.null(cluster) == TRUE) {
40 |       message("Initiate parallelisation ... ")
41 |       cluster <- makeCluster(n_cores)
42 |       on.exit(stopCluster(cluster))
43 |     }
44 |     tmp <- parApply(
45 |       cl = cluster, rules, MARGIN = 1, FUN = importanceSingleRule,
46 |       data = data, target = target, type = type, pred_null = pred_null
47 |     )
48 |   } else {
49 |     tmp <- apply(rules,
50 |       MARGIN = 1, FUN = importanceSingleRule,
51 |       data = data, target = target, type = type, pred_null = pred_null
52 |     )
53 |   }
54 | 
55 | 
56 |   rules <- rules[, "gain" := tmp][, "imp" := gain * support]
57 | 
58 |   if ("n" %in% colnames(rules)) {
59 |     rules <- rules[, "n" := as.numeric(n)][order(-n * imp), ]
60 |   } else {
61 |     setorder(rules, -imp)
62 |   }
63 | 
64 | 
65 |   return(rules)
66 | }
67 | 
68 | 
69 | ######################################################
70 | importanceSingleRule <-
71 |   function(rule, data, target, type = "reg", pred_null = NULL) {
72 |     ruleExec <- paste("which(", rule["condition"], ")")
73 |     ruleExec <- gsub(ruleExec, pattern = "X\\[,", replacement = "data\\[,")
74 |     ixMatch <- eval(parse(text = ruleExec))
75 | 
76 |     if (length(ixMatch) == 0) {
77 |       return(0)
78 |     }
79 |     ys <- target[ixMatch]
80 | 
81 |     if (type == "reg") {
82 |       err_null <- sum((pred_null - ys)^2) / length(ys)
83 |     } else {
84 |       err_null <- exp(mean(sapply(ys, function(x, pred) {
85 |         x * log(pred) + (1 - x) * log(1 - pred)
86 |       }, pred = pred_null)))
87 |       err_null <- 1 - err_null
88 |     }
89 | 
90 |     g <- 1 - as.numeric(rule["err"]) / err_null
91 |     return(g)
92 |   }
93 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/R/discretizeData.R:
--------------------------------------------------------------------------------
 1 | #' Discretize numerical variables in a dataset
 2 | #'
 3 | #' This function discretizes all numerical variables, or only the ones passed in features, into K categories based on their quantiles.
 4 | #' Names of categories can optionally be given in knames.
 5 | #'
 6 | #' @param data data to discretize.
 7 | #' @param K numeric, number of categories (default: K = 2).
 8 | #' @param features vector with variables names or column numbers to discretize. If NULL (default), then all numeric variables are discretized.
 9 | #' @param knames optional: character vector of the same length than K, containing the ordered names for categories.
10 | #' @param return_split if TRUE, then the table with thresholds used to discretize data is also returned.
11 | #' @return Data with discretized variables.
12 | #' @export
13 | discretizeData <- function(data, K = 2, features = NULL, knames = NULL, return_split = FALSE) {
14 | 
15 |   # set category names
16 |   if (is.null(knames) == TRUE) {
17 |     if (K == 2) {
18 |       knames <- c("min", "Low", "High")
19 |     } else if (K == 3) {
20 |       knames <- c("min", "Low", "Medium", "High")
21 |     } else if (K == 4) {
22 |       knames <- c("min", "veryLow", "Low", "High", "veryHigh")
23 |     } else if (K == 5) {
24 |       knames <- c("min", "veryLow", "Low", "Medium", "High", "veryHigh")
25 |     } else {
26 |       knames <- c("min", paste("L", seq(1:K), sep = ""))
27 |     }
28 |   } else if (length(knames) == K) {
29 |     knames <- c("min", knames)
30 |   }
31 | 
32 |   # get column numbers - not the most efficient but later easier
33 |   if (is.character(features) == TRUE) {
34 |     colNb <- which(colnames(data) %in% features)
35 |   } else if (is.numeric(features) == TRUE) {
36 |     colNb <- features
37 |   } else {
38 |     colNb <- which(sapply(data, function(x) {
39 |       (length(unique(x)) > 2 & is.numeric(x))
40 |     }))
41 |   }
42 | 
43 |   # discretize
44 |   data_ctg <- as.data.table(data)
45 |   tmp <- lapply(data_ctg[, (colNb), with = FALSE], discretizeVector, K = K, knames = knames, return_all = TRUE)
46 |   names(tmp) <- as.character(colNb)
47 |   for (j in colNb) {
48 |     set(data_ctg, i = NULL, j = j, tmp[[as.character(j)]]$newV)
49 |     set(data_ctg, i = which(data_ctg[, j, with = FALSE] == ""), j = j, value = NA)
50 |   }
51 |   data_ctg <- data_ctg[, c(colNb) := lapply(.SD, as.factor), .SDcols = colNb]
52 | 
53 | 
54 |   # get the split table if asked
55 |   if (return_split == TRUE) {
56 |     splitV <- list()
57 |     for (i in 1:length(tmp)) {
58 |       splitV[[i]] <- tmp[[i]]$splitV
59 |     }
60 |     names(splitV) <- names(tmp)
61 | 
62 |     splitV_med <- lapply(names(splitV), FUN = getMedian, data = data, splitV = splitV)
63 |     names(splitV_med) <- names(splitV)
64 | 
65 |     return(list("data_ctg" = data_ctg, "splitV" = splitV, "splitV_med" = splitV_med))
66 |   }
67 | 
68 |   return(data_ctg)
69 | }
70 | 
71 | 
72 | ##############
73 | getMedian <- function(var, data, splitV) {
74 |   dataVar <- data[[as.numeric(var)]]
75 |   med <- list()
76 |   for (i in 2:length(splitV[[var]])) {
77 |     iX <- which(dataVar >= splitV[[var]][i - 1] & dataVar <= splitV[[var]][i])
78 |     med <- c(med, median(dataVar[iX], na.rm = TRUE))
79 |   }
80 |   names(med) <- names(splitV[[var]])[-1]
81 |   return(med)
82 | }
83 | 


--------------------------------------------------------------------------------
/R/aggregateTaxa.R:
--------------------------------------------------------------------------------
 1 | #' Aggregate taxa according to a measure.
 2 | #'
 3 | #' This function aggregates taxa according to their rank from a measure passed in weights: if a taxa has a lower rank than its coarser level, type = 'coarse', it is aggregated into the coarser; if their is a unique thinner level with a better rank than the coarser one, then the coarser is aggregated into the thiner Comparison is done only for the family, genus and specie levels.
 4 | #'
 5 | #' @param taxa taxa should be a data.frame with the feature name and all its other coarser/thinner levels (columns: 'Feature', 'f', 'g', 's')
 6 | #' @param features if taxa is a vector or a data.frame that lacks all levels, a vector of the features to be checked.
 7 | #' @param weights a data.frame with a column Feature and a column weight; for type = 'both' or 'coarse' only.
 8 | #' @param thr numeric, value after which the algorithm should stop looking for better ranks in a taxonomic branch; for type = 'both' or 'coarse' only. If NULL, the median of weights is used.
 9 | #' @param type character. If 'coarse', finer levels are aggregated into their coarser one if it has a better rank. If 'fine' then coarser levels are aggregated into a thinner level if it has a better rank and is unique, i.e. there is a unique finer level for that coarser level. If 'both', both aggregation steps are seuqentially performed.
10 | #'
11 | #' @return A dataframe with aggregated features in the "Feature" column, and the 'recipient' taxa in the "newFeature" column.
12 | #'
13 | #' @export
14 | aggregateTaxa <- function(taxa, features = NULL, weights = NULL, thr = NULL, type = "coarse") {
15 |   if (!is.null(weights) && !("weight" %in% colnames(weights))) {
16 |     wI <- which(sapply(weights, is.numeric))
17 |     colnames(weights)[wI] <- "weight"
18 |   }
19 | 
20 |   #if (is.vector(taxa) == TRUE || !("Feature" %in% colnames(taxa))) {
21 |   #  taxa <- elongateTaxa(features = features, taxa = taxa)
22 |   #}
23 | 
24 |   if (is.null(thr) == TRUE) {
25 |     thr <- median(weights$weight)
26 |   }
27 | 
28 |   ### wrapper of the 2 collapsing functions, with the option of performing both
29 |   if (type == "coarse") {
30 |     taxa <- aggregateTaxa_coarse(taxa, weights, thr)
31 |   }
32 | 
33 |   if (type == "fine") {
34 |     taxa <- aggregateTaxa_fine(taxa)
35 |   }
36 | 
37 |   if (type == "both") {
38 |     taxa_d <- aggregateTaxa_coarse(taxa, weights, thr)
39 |     taxa_u <- aggregateTaxa_fine(subset(taxa, Feature %in% taxa_d$newFeature))
40 | 
41 |     taxa <- unique(taxa_u[, list(Feature, newFeature)])[, "coarseFeature" = newFeature][, newFeature := NULL]
42 |     taxa <- merge(taxa, taxa_d[, list(newFeature, Feature, weight, newWeight)], by = c("coarseFeature" = "newFeature"))
43 | 
44 |     # record how many times a newFeature has been attributed and who was changed
45 |     taxa <- taxa[, n := .N, by = "newFeature"][, "changed" := "Unchanged"][
46 |       n > 1 & newFeature == Feature, changed := "Recipient"
47 |     ][
48 |       newFeature != coarseFeature & coarseFeature == Feature, changed := "Upgraded"
49 |     ][
50 |       newFeature == coarseFeature & coarseFeature != Feature, changed := "Downgraded"
51 |     ][
52 |       newFeature != coarseFeature & coarseFeature != Feature, changed := "Down&Upgraded"
53 |     ][
54 |       , changed := as.factor(changed)
55 |     ]
56 |   }
57 | 
58 |   return(taxa)
59 | }
60 | 


--------------------------------------------------------------------------------
/man/plotNetwork.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/plotNetwork.R
 3 | \name{plotNetwork}
 4 | \alias{plotNetwork}
 5 | \title{Plot an interaction variable network.}
 6 | \usage{
 7 | plotNetwork(
 8 |   decision_ensemble,
 9 |   path_length = Inf,
10 |   layout = "stress",
11 |   colour_edge_low = "#E69F00",
12 |   colour_edge_mid = "grey87",
13 |   colour_edge_high = "#0072B2",
14 |   colour_node_low = "#E69F00",
15 |   colour_node_mid = "grey87",
16 |   colour_node_high = "#0072B2",
17 |   text_size = 4,
18 |   hide_isolated_nodes = TRUE,
19 |   seed = 0
20 | )
21 | }
22 | \arguments{
23 | \item{decision_ensemble}{stable decision ensemble: list with edges, nodes, etc.}
24 | 
25 | \item{path_length}{maximal number of edges between 2 nodes, default = Inf.}
26 | 
27 | \item{layout}{network layout, default is 'stress' (see ggraph package)}
28 | 
29 | \item{colour_edge_low}{colour for the negative interaction influence values (i.e., edges; default: yellowish)}
30 | 
31 | \item{colour_edge_mid}{colour for the null interaction influence values (i.e., edges; default: light grey)}
32 | 
33 | \item{colour_edge_high}{colour for the positive interaction influence values (i.e., edges; default: blue)}
34 | 
35 | \item{colour_node_low}{colour for the negative feature influence values (i.e., nodes; default: yellowish)}
36 | 
37 | \item{colour_node_mid}{colour for the null feature influence values (i.e., nodes; default: light grey)}
38 | 
39 | \item{colour_node_high}{colour for the positive feature influence values (i.e., nodes; default: blue)}
40 | 
41 | \item{text_size}{size of node labels.}
42 | 
43 | \item{hide_isolated_nodes}{logical, default = TRUE (= nodes without any edge are not shown).}
44 | 
45 | \item{seed}{the seed to use for generating the network.}
46 | }
47 | \value{
48 | a ggraph object
49 | }
50 | \description{
51 | Returns a plot of nodes and edges. Plots are created with the ggraph and igraph packages.
52 | }
53 | \examples{
54 | library(randomForest)
55 | library(caret)
56 | 
57 | # import data and fit model
58 | data(iris)
59 | mod <- randomForest(Species ~ ., data = iris)
60 | 
61 | # Fit a decision ensemble to predict the species setosa (vs versicolor and
62 | # virginica): no regularization (no decision pruning, discretization,
63 | # bootstrapping, or decision filtering)
64 | endo_setosa <- model2DE(model = mod, model_type = "rf", data = iris[, -5]
65 |     , target = iris$Species, classPos = "setosa"
66 |     , filter = FALSE, discretize = FALSE, prune = FALSE)
67 | 
68 | # Only decision pruning (default = TRUE) and discretization (default in 2
69 | # categories, we want 3 categories so we change K); no bootstrapping or
70 | # decision filtering.
71 | endo_setosa <- model2DE(model = mod, model_type = "rf", data = iris[, -5]
72 |     , target = iris$Species, classPos = "setosa"
73 |     , filter = FALSE, discretize = TRUE, K = 3)
74 | 
75 | # idem but run it in parallel on 2 threads
76 | endo_setosa <- model2DE(model = mod, model_type = "rf", data = iris[, -5]
77 |     , target = iris$Species, classPos = "setosa"
78 |     , filter = FALSE, discretize = TRUE, K = 3
79 |     , in_parallel = TRUE, n_cores = 2)
80 | 
81 | # Plot the decision ensemble:
82 | # Plants from the setosa species have small petal and narrow long sepals.
83 | plotFeatures(endo_setosa, levels_order = c("Low", "Medium", "High"))
84 | plotNetwork(endo_setosa, hide_isolated_nodes = FALSE, layout = "fr")
85 | }
86 | 


--------------------------------------------------------------------------------
/R/discretizeData.R:
--------------------------------------------------------------------------------
 1 | #' Discretize numerical variables in a dataset
 2 | #'
 3 | #' This function discretizes all numerical variables, or only the ones passed in features, into K categories based on their quantiles.
 4 | #' Names of categories can optionally be given in knames.
 5 | #'
 6 | #' @param data data to discretize.
 7 | #' @param K numeric, number of categories (default: K = 2).
 8 | #' @param features vector with variables names or column numbers to discretize. If NULL (default), then all numeric variables are discretized.
 9 | #' @param knames optional: character vector of the same length than K, containing the ordered names for categories.
10 | #' @param return_split if TRUE, then the table with thresholds used to discretize data is also returned.
11 | #' @return Data with discretized variables.
12 | #' @export
13 | discretizeData <- function(data, K = 2, features = NULL, knames = NULL, return_split = FALSE) {
14 | 
15 |   # set category names
16 |   if (is.null(knames) == TRUE) {
17 |     if (K == 2) {
18 |       knames <- c("min", "Low", "High")
19 |     } else if (K == 3) {
20 |       knames <- c("min", "Low", "Medium", "High")
21 |     } else if (K == 4) {
22 |       knames <- c("min", "veryLow", "Low", "High", "veryHigh")
23 |     } else if (K == 5) {
24 |       knames <- c("min", "veryLow", "Low", "Medium", "High", "veryHigh")
25 |     } else {
26 |       knames <- c("min", paste("L", seq(1:K), sep = ""))
27 |     }
28 |   } else if (length(knames) == K) {
29 |     knames <- c("min", knames)
30 |   }
31 | 
32 |   # get column numbers - not the most efficient but later easier
33 |   if (is.character(features) == TRUE) {
34 |     colNb <- which(colnames(data) %in% features)
35 |   } else if (is.numeric(features) == TRUE) {
36 |     colNb <- features
37 |   } else {
38 |     colNb <- which(sapply(data, function(x) {
39 |       (length(unique(x)) > 2 & is.numeric(x))
40 |     }))
41 |   }
42 | 
43 |   # discretize
44 |   data_ctg <- as.data.table(data)
45 |   tmp <- lapply(data_ctg[, (colNb), with = FALSE], discretizeVector, K = K, knames = knames, return_all = TRUE)
46 |   names(tmp) <- as.character(colNb)
47 |   for (j in colNb) {
48 |     set(data_ctg, i = NULL, j = j, tmp[[as.character(j)]]$newV)
49 |     set(data_ctg, i = which(data_ctg[, j, with = FALSE] == ""), j = j, value = NA)
50 |   }
51 |   data_ctg <- data_ctg[, c(colNb) := lapply(.SD, as.factor), .SDcols = colNb]
52 | 
53 | 
54 |   # get the split table if asked
55 |   if (return_split == TRUE) {
56 |     splitV <- list()
57 |     for (i in 1:length(tmp)) {
58 |       splitV[[i]] <- tmp[[i]]$splitV
59 |     }
60 |     names(splitV) <- names(tmp)
61 | 
62 |     splitV_med <- lapply(names(splitV), FUN = getMedian, data = data, splitV = splitV)
63 |     names(splitV_med) <- names(splitV)
64 | 
65 |     return(list("data_ctg" = data_ctg, "splitV" = splitV, "splitV_med" = splitV_med))
66 |   }
67 | 
68 |   return(data_ctg)
69 | }
70 | 
71 | 
72 | ##############
73 | getMedian <- function(var, data, splitV) {
74 |   dataVar <- data[[as.numeric(var)]]
75 |   med <- list()
76 |   for (i in 2:length(splitV[[var]])) {
77 |     iX <- which(dataVar > splitV[[var]][i - 1] & dataVar <= splitV[[var]][i])
78 |     med <- c(med, median(dataVar[iX], na.rm = TRUE))
79 |   }
80 |   names(med) <- names(splitV[[var]])[-1]
81 | 
82 |   if (is.na(med[[1]])){
83 |     med[[1]] = min(dataVar)
84 |   }
85 |   return(med)
86 | }
87 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/R/aggregateTaxa.R:
--------------------------------------------------------------------------------
 1 | #' Aggregate taxa according to a measure.
 2 | #'
 3 | #' This function aggregates taxa according to their rank from a measure passed in weights: if a taxa has a lower rank than its coarser level, type = 'coarse', it is aggregated into the coarser; if their is a unique thiner level with a better rank than the coarser one, then the coarser is aggregated into the thiner Comparison is done only for the family, genus and specie levels.
 4 | #'
 5 | #' @param taxa taxa should be a vector with the long format of the features of the taxa table with taxonomic levels as columns - cf taxa2vector; or a data.frame with the feature name and all its other coarser/thiner levels.
 6 | #' @param features if taxa is a vector or a data.frame that lacks all levels, a vector of the features to be checked.
 7 | #' @param weights a data.frame with a column Feature and a column weight; for type = 'both' or 'coarse' only.
 8 | #' @param thr numeric, value after which the algorithm should stop looking for better ranks in a taxa banch; for type = 'both' or 'coarse' only. If NULL, the median of weights is used.
 9 | #' @param type character. If 'coarse', finer levels are aggregated into their coarser one if it has a better rank. If 'fine' then coarser levels are aggregated into a finer level if it has a better rank and is unique, ie there is a unique finer level for that coarser level. If 'both', both aggregation steps are seuqentially performed.
10 | #'
11 | #' @return A dataframe with aggregated features in the "Feature" column, and the 'recipient' taxa in the "newFeature" column.
12 | #'
13 | #' @export
14 | aggregateTaxa <- function(taxa, features = NULL, weights = NULL, thr = NULL, type = "coarse") {
15 |   if (!is.null(weights) && !("weight" %in% colnames(weights))) {
16 |     wI <- which(sapply(weights, is.numeric))
17 |     colnames(weights)[wI] <- "weight"
18 |   }
19 | 
20 |   if (is.vector(taxa) == TRUE || !("Feature" %in% colnames(taxa))) {
21 |     taxa <- elongateTaxa(features = features, taxa = taxa)
22 |   }
23 | 
24 |   if (is.null(thr) == TRUE) {
25 |     thr <- median(weights$weight)
26 |   }
27 | 
28 |   ### wrapper of the 2 collapsing functions, with the option of performing both
29 |   if (type == "coarse") {
30 |     taxa <- aggregateTaxa_coarse(taxa, weights, thr)
31 |   }
32 | 
33 |   if (type == "fine") {
34 |     taxa <- aggregateTaxa_fine(taxa)
35 |   }
36 | 
37 |   if (type == "both") {
38 |     taxa_d <- aggregateTaxa_coarse(taxa, weights, thr)
39 |     taxa_u <- aggregateTaxa_fine(subset(taxa, Feature %in% taxa_d$newFeature))
40 | 
41 |     taxa <- unique(taxa_u[, .(Feature, newFeature)])[, "coarseFeature" = newFeature][, newFeature := NULL]
42 |     taxa <- merge(taxa, taxa_d[, .(newFeature, Feature, weight, newWeight)], by = c("coarseFeature" = "newFeature"))
43 | 
44 |     # record how many times a newFeature has been attributed and who was changed
45 |     taxa <- taxa[, n := .N, by = "newFeature"][, "changed" := "Unchanged"][
46 |       n > 1 & newFeature == Feature, changed := "Recipient"
47 |     ][
48 |       newFeature != coarseFeature & coarseFeature == Feature, changed := "Upgraded"
49 |     ][
50 |       newFeature == coarseFeature & coarseFeature != Feature, changed := "Downgraded"
51 |     ][
52 |       newFeature != coarseFeature & coarseFeature != Feature, changed := "Down&Upgraded"
53 |     ][
54 |       , changed := as.factor(changed)
55 |     ]
56 |   }
57 | 
58 |   return(taxa)
59 | }
60 | 


--------------------------------------------------------------------------------
/examples/iris_bootstraps.R:
--------------------------------------------------------------------------------
 1 | library(randomForest)
 2 | library(caret)
 3 | 
 4 | # import data and fit model
 5 | data(iris)
 6 | mod <- randomForest(Species ~ ., data = iris)
 7 | 
 8 | # Get decision ensemble with bootstrapping.
 9 | 
10 | # Run 1 bootstrap after the other (times = 2 bootstraps)
11 | endo_setosa <- model2DE_resampling(model = mod, model_type = "rf"
12 |     , data = iris[, -5], target = iris$Species, classPos = "setosa"
13 |     , times = 2, in_parallel = TRUE, n_cores = 2, filter = FALSE)
14 | 
15 | # Same but use different sample weights for bootstrapping
16 | n_setosa <- sum(iris$Species == "setosa")
17 | n_samp <- length(iris$Species)
18 | samp_weight <- round(
19 |     ifelse(iris$Species == "setosa", 1 - n_setosa/n_samp, n_setosa/n_samp)
20 |     , digits = 2)
21 | endo_setosa <- model2DE_resampling(model = mod, model_type = "rf"
22 |     , data = iris[, -5], target = iris$Species, classPos = "setosa"
23 |     , times = 2, sample_weight = samp_weight
24 |     , in_parallel = TRUE, n_cores = 2, filter = FALSE)
25 | 
26 | # Run the bootstraps in parallel
27 | # First do all steps before bootstrapping
28 | preclu <- preCluster(model = mod, model_type = "rf", data = iris[, -5]
29 |     , target = iris$Species, classPos = "setosa", times = 2
30 |     , discretize = TRUE, in_parallel = FALSE)
31 | 
32 | # Remove the special characters from column names
33 | colnames(preclu$data) <- compatibleNames(colnames(preclu$data))
34 | 
35 | # Parameters for clustermq: can also run on HPC environment
36 | library(clustermq)
37 | options(clustermq.scheduler = "multiprocess")
38 | # ... and run in parallel on each bootstrap
39 | # (preclu$partitions = list of sample indexes for each bootstraps)
40 | endo_setosa <- Q(model2DE_cluster
41 |     , partition = preclu$partitions
42 |     , export = list(data = preclu$data
43 |                 , target = iris$Species
44 |                 , exec = preclu$exec
45 |                 , classPos = "setosa"
46 |                 , prune = TRUE, filter = FALSE
47 |                 , maxDecay = 0.05 # values needed for maxDecay and typeDecay
48 |                 , typeDecay = 2 # here default ones, see pruneDecisions()
49 |                 , in_parallel = FALSE # can parallelize within each boostrap!
50 |            )
51 |     , n_jobs = 2 # max number of bootstraps that can be ran in parallel
52 |     , pkgs = c("data.table", "parallel", "caret", "stringr", "scales"
53 |                 , "dplyr", "inTrees", "endoR")
54 |     , log_worker = FALSE # to keep a log of the runs, e.g. if it fails..
55 | )
56 | 
57 | 
58 | # Stability selection
59 | # First we can look at the effect of the alpha parameter on selection;
60 | # alpha = expected number of false decisions
61 | alphas <- evaluateAlpha(rules = endo_setosa, alphas = c(1:5, 7, 10)
62 |                         , data = preclu$data)
63 | alphas$summary_table
64 | 
65 | # perform stability selection with alpha = 1
66 | de_final <- stabilitySelection(rules = endo_setosa, alpha_error = 7)
67 | 
68 | # Plot the decision ensemble:
69 | # Plants from the setosa species have small petal and narrow long sepals.
70 | plotFeatures(de_final, levels_order = c("Low", "Medium", "High"))
71 | 
72 | # there is no interaction between variables (all decisions with len = 1,
73 | # the number of variables in the rules)
74 | de_final$rules_summary
75 | # hence the network would be empty and couldn't be plotted...
76 | # plotNetwork(de_final, hide_isolated_nodes = FALSE)
77 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/R/model2DE_resampling.R:
--------------------------------------------------------------------------------
 1 | #' Run model2DE on several bootstrap resamples.
 2 | #'
 3 | #' Wrapper around the model2DE function to run it on several bootstrap resamples.
 4 | #'
 5 | #' @param ... arguments to be passed to the model2DE funtion.
 6 | #' @param times number of bootstraps
 7 | #' @param p fraction of data to resample.
 8 | #' @param sample_weight numeric vector with the weights of samples for bootstrap resampling. For classification, if 2 values are given, the 1st one is assumed to be for the positive class (classpos argument).
 9 | #' @param in_parallel if TRUE, the function is run in parallel
10 | #' @param n_cores if in_parallel = TRUE, and no cluster has been passed: number of cores to use
11 | #' @param cluster the cluster to use to run the function in parallel
12 | #' @return A list with the row numbers of partitioned data, the rules originally extracted from the model, a list with results from each bootstrap (use stabilitySelection to obtain the stable decison ensemble).
13 | #' @import data.table
14 | #' @export
15 | model2DE_resampling <- function(model, model_type, data, target, classPos = NULL,
16 |                                 times = 10, p = .5, sample_weight = NULL,
17 |                                 ntree = "all", maxdepth = Inf,
18 |                                 dummy_var = NULL,
19 |                                 prune = TRUE, maxDecay = 0.05, typeDecay = 2,
20 |                                 discretize = TRUE, K = 2, mode = "data",
21 |                                 filter = TRUE, min_imp = 0.9
22 |                                 # , aggregate_taxa = FALSE, taxa = NULL
23 |                                 # , alpha_error = 1, pi_thr = 0.7
24 |                                 , seed = 0, in_parallel = FALSE, n_cores = detectCores() - 1, cluster = NULL) {
25 | 
26 | 
27 |   ### Get the partitions and rules
28 |   tmp <- preCluster(
29 |     model = model, model_type = model_type, data = data, target = target,
30 |     times = times, p = p, sample_weight = sample_weight, classPos = classPos,
31 |     ntree = ntree, maxdepth = maxdepth, dummy_var = dummy_var,
32 |     discretize = discretize, K = K, mode = mode,
33 |     seed = seed,
34 |     in_parallel = in_parallel, n_cores = n_cores
35 |   )
36 |   exec <- tmp$exec
37 |   partitions <- tmp$partitions
38 |   data <- tmp$data
39 | 
40 | 
41 |   ### Get a network for each subset
42 |   # define classPos if it has not been passed
43 |   if (is.character(target) && is.null(classPos) == TRUE) {
44 |     classPos <- names(which.max(table(target)))
45 |     cat("Positive class:", classPos, "\n")
46 |   }
47 |   resamp <- list()
48 |   k <- 1
49 |   for (ix in partitions) {
50 |     res <- model2DE(
51 |       data = data[ix, ], target = target[ix],
52 |       exec = exec,
53 |       classPos = classPos,
54 |       prune = prune, maxDecay = maxDecay, typeDecay = typeDecay,
55 |       filter = filter,
56 |       in_parallel = in_parallel, n_cores = n_cores,
57 |       light = TRUE
58 |     )
59 | 
60 |     # get the decisions: depends on options so we'll just take whatever was produced before getting nodes!
61 |     i_rules <- which(names(res) == "nodes") - 1
62 | 
63 |     res <- list(
64 |       "pdecisions" = res$n_decisions,
65 |       "rules" = res[[i_rules]],
66 |       "nodes_agg" = res$nodes_agg, "edges_agg" = res$edges_agg
67 |     )
68 | 
69 |     resamp[[k]] <- res
70 |     k <- k + 1
71 |   }
72 | 
73 | 
74 |   return(list("partitions" = partitions, "exec" = exec, "data" = data, "resamp" = resamp))
75 | }
76 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/R/evaluateAlpha.R:
--------------------------------------------------------------------------------
 1 | #' Calculate the number fo decisions and of predicted samples from decision ensemble obtained with different alpha values
 2 | #'
 3 | #' The aim is to help picking an alpha that will result in a decision ensemble able to predict most samples. Performs stability selection for each of the given alpha value. The number of decisions and of samples that follow decisions are also calculated.
 4 | #' The procedure is adapted from Meinshausenand and Buehlmann (2010): the best decisions from each bootstrap are pre-seleected and the the ones that were pre-selected in a certain fraction of bootstraps are included in the stable decision ensemble.
 5 | #' The decision importances and multiplicities are averaged across bootstraps. Decision-wise feature and interaction importances and influences are averaged across bootstraps before computing the feature and interaction importances and influences from the stable decision ensemble.
 6 | #' @param res list of bootstrap results
 7 | #' @param alphas expected number of false positive decision selected (default = 1).
 8 | #' @param decision_ensembles should the decision ensemble be returned?
 9 | #' @param pi_thr fraction of bootstraps in which a decision should have been selected in to be included in the stable decision ensemble (default = 0.7).
10 | #' @return A list with all decisions from all bootstrasps, the summary of decisions across bootstraps, the feature and interaction importance and influence in the nodes and edges dataframes, as well as the the decision-wise feature and interaction importances and influences the nodes_agg and edges_agg dataframes.
11 | #' @export
12 | 
13 | evaluateAlpha <- function(rules, alphas = c(5, 10, 15, 20, 30, 50, 75), pi_thr = 0.7,
14 |                           data = NULL, decision_ensembles = TRUE,
15 |                           aggregate_taxa = FALSE, taxa = NULL) {
16 |   check_sampl <- data.frame(alpha = numeric(), n_decision = numeric(), n_samples = numeric())
17 |   if ("resamp" %in% names(rules)) {
18 |     minN <- pi_thr * length(rules$resamp)
19 |   } else {
20 |     minN <- pi_thr * length(rules)
21 |   }
22 | 
23 |   if (is.null(data) & !("data" %in% names(rules))) {
24 |     warning("Please provide the data used to create the decision ensemble (original if not discretized, precluster$data if discretized).")
25 |     return()
26 |   } else if ("data" %in% names(rules)) {
27 |     data <- rules$data
28 |   }
29 | 
30 |   res <- list()
31 |   check_sampl <- data.frame(alpha = numeric(), n_dec = numeric(), n_samp = numeric())
32 | 
33 |   for (i in 1:length(alphas)) {
34 |     tmp <- stabilitySelection(rules = rules, alpha_error = alphas[i], pi_thr = pi_thr, aggregate_taxa = aggregate_taxa, taxa = taxa)
35 |     if (decision_ensembles == TRUE) {
36 |       res[[as.character(alphas[i])]] <- tmp
37 |     }
38 | 
39 |     tmp <- subset(tmp$rules_summary, inN >= minN)
40 | 
41 |     cond <- tmp$condition
42 |     if (length(cond) == 0) {
43 |       check_sampl <- add_row(check_sampl, alpha = alphas[i], n_dec = 0, n_samp = 0)
44 |     } else {
45 |       cond <- str_replace_all(cond, pattern = "X", replacement = "data")
46 |       cond <- paste0("which(", cond, ")")
47 |       pred_ix <- lapply(cond, function(x) {
48 |         eval(parse(text = x))
49 |       }) %>%
50 |         unlist() %>%
51 |         unique()
52 |       check_sampl <- add_row(check_sampl, alpha = alphas[i], n_dec = length(cond), n_samp = length(pred_ix))
53 |     }
54 |   }
55 |   if (decision_ensembles == TRUE) {
56 |     res[["summary_table"]] <- check_sampl
57 |     return(res)
58 |   } else {
59 |     return(check_sampl)
60 |   }
61 | }
62 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/R/getThresholds.R:
--------------------------------------------------------------------------------
  1 | #' Get thresholds to discretize variables according to their split in tree ensemble models
  2 | #' @export
  3 | 
  4 | getThresholds <- function(conditions, data, Kmax = 2) {
  5 |   ### a little slow, may be improved
  6 | 
  7 |   # get all individual sub conditions per variable
  8 |   var_cond <- unlist(lapply(conditions, function(x) {
  9 |     unlist(strsplit(x, split = " & "))
 10 |   }))
 11 |   # make it to a data.frame with variable and thresholds used
 12 |   var_cond <- data.frame(
 13 |     var = as.numeric(str_extract(var_cond, pattern = "[:digit:]*(?=\\])")),
 14 |     thr = as.numeric(str_extract(var_cond, pattern = "-?[:digit:]*\\.?[:digit:]*$"))
 15 |   )
 16 | 
 17 |   # and now transform to a list
 18 |   var_split <- lapply(sort(unique(var_cond$var)),
 19 |     function(x, var_cond, data) {
 20 |       thr <- unlist(subset(var_cond, var == x, select = thr))
 21 |       var_v <- data[[x]]
 22 |       thr[thr < min(var_v)] <- min(var_v)
 23 |       thr[thr > max(var_v)] <- max(var_v)
 24 |       return(thr)
 25 |     },
 26 |     var_cond = var_cond, data = data
 27 |   )
 28 |   names(var_split) <- as.character(sort(unique(var_cond$var)))
 29 | 
 30 |   # remove non-numeric variables
 31 |   are_num <- as.character(which(sapply(data, function(x) {
 32 |     length(unique(x)) > 2
 33 |   })))
 34 |   are_num <- are_num[are_num %in% names(var_split)]
 35 |   var_split <- var_split[are_num]
 36 | 
 37 |   # get thresholds for discretization
 38 |   if (Kmax == 2) {
 39 |     new_thr <- lapply(var_split, getMode)
 40 |   } else {
 41 |     new_thr <- lapply(var_split, getModes_all)
 42 |     new_thr <- lapply(new_thr, function(x, Kmax) {
 43 |       x[1:min(Kmax - 1, length(x))]
 44 |     }, Kmax = Kmax)
 45 |   }
 46 | 
 47 |   # remove thresholds out of range
 48 |   colNb <- as.integer(names(new_thr))
 49 |   new_thr <- lapply(colNb, function(x, data, thr) {
 50 |     list(
 51 |       "var_v" = data[[x]],
 52 |       "thr" = thr[[as.character(x)]]
 53 |     )
 54 |   }, data = data, thr = new_thr)
 55 | 
 56 |   new_thr <- lapply(new_thr, function(x) {
 57 |     x$thr <- x$thr[x$thr >= min(x$var_v) & x$thr <= max(x$var_v)]
 58 |     if (length(x$thr) == 0) x$thr <- min(x$var_v)
 59 |     return(x)
 60 |   })
 61 |   names(new_thr) <- as.character(colNb)
 62 | 
 63 |   return(new_thr)
 64 | }
 65 | 
 66 | 
 67 | ##########
 68 | 
 69 | getMode <- function(x) {
 70 |   if (length(x) == 1) {
 71 |     return(x)
 72 |   } else {
 73 |     tmp <- density(x)
 74 |     return(tmp$x[which.max(tmp$y)])
 75 |   }
 76 | }
 77 | 
 78 | 
 79 | getModes_all <- function(var) {
 80 |   # copied from the pastecs R-package:
 81 |   # https://github.com/phgrosjean/pastecs/blob/master/R/turnpoints.R
 82 |   # (I just removed the unecessary bits..)
 83 | 
 84 |   if (length(x) == 1) {
 85 |     return(x)
 86 |   }
 87 | 
 88 |   x <- as.vector(density(var)$y)
 89 |   n <- length(x)
 90 |   diffs <- c(x[1] - 1, x[1:(n - 1)]) != x
 91 | 
 92 |   uniques <- x[diffs]
 93 | 
 94 |   n2 <- length(uniques)
 95 |   poss <- (1:n)[diffs]
 96 |   exaequos <- c(poss[2:n2], n + 1) - poss - 1
 97 | 
 98 |   m <- n2 - 2
 99 |   ex <- matrix(uniques[1:m + rep(3:1, rep(m, 3)) - 1], m)
100 |   peaks <- c(FALSE, apply(ex, 1, max, na.rm = TRUE) == ex[, 2], FALSE)
101 |   tppos <- (poss + exaequos)[peaks]
102 | 
103 |   # Now, order the peaks and return the x values
104 |   y_peaks <- x[tppos]
105 |   x_peaks <- density(var)$x[tppos]
106 | 
107 |   peaks_order <- order(y_peaks, decreasing = TRUE)
108 |   # y_peaks <- y_peaks[peaks_order]
109 |   # x_peaks <- x_peaks[peaks_order]
110 |   return(x_peaks[peaks_order])
111 | }
112 | 


--------------------------------------------------------------------------------
/vignettes/iris_multiclass.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "endoR: multiclass"
  3 | output: rmarkdown::html_vignette
  4 | vignette: >
  5 |   %\VignetteIndexEntry{endoR: multiclass}
  6 |   %\VignetteEngine{knitr::rmarkdown}
  7 |   %\VignetteEncoding{UTF-8}
  8 | ---
  9 | 
 10 | ```{r, include = FALSE}
 11 | knitr::opts_chunk$set(
 12 |   collapse = TRUE,
 13 |   comment = "#>",
 14 |   fig.width = 7,
 15 |   fig.height = 2.5
 16 | )
 17 | ```
 18 | 
 19 | This vignette illustrates how to interpret multiclass models with endoR. 
 20 | We will use the iris data for this purpose: the species is being predicted 
 21 | (= target) using the length and width of petals and sepals (= features).  
 22 | 
 23 | # Preambule
 24 | ```{r setup, message = FALSE, warning=FALSE}
 25 | library(endoR)
 26 | library(randomForest)
 27 | library(tidyverse)
 28 | library(caret)
 29 | library(ggpubr)
 30 | library(ggraph)
 31 | ```
 32 | 
 33 | # Data
 34 | ```{r}
 35 | data(iris)
 36 | summary(iris)
 37 | ```
 38 | 
 39 | # Fit model
 40 | ```{r}
 41 | set.seed(1313)
 42 | mod <- randomForest(Species ~ ., data = iris)
 43 | mod
 44 | ```
 45 | 
 46 | # Fit endoR for the "setosa" species
 47 | 
 48 | We will only prune (= TRUE) and discretize decisions in K = 3 levels.
 49 | ```{r}
 50 | endo_setosa <- model2DE(model = mod, model_type = 'rf'
 51 |                  , data = select(iris, -Species), target = iris$Species
 52 |                  , classPos = 'setosa' # our focal class
 53 |                  , filter = FALSE 
 54 |                  
 55 |                  # we filter in K = 3 categories the numerical features
 56 |                  , discretize = TRUE, K = 3)
 57 | ```
 58 | 
 59 | Plants from the setosa species have small petals and narrow, long sepals. 
 60 | ```{r}
 61 | plotFeatures(endo_setosa, levels_order = c('Low', 'Medium', 'High'))
 62 | 
 63 | # The warnings du to the font are due to Windows .. no worries.
 64 | plotNetwork(endo_setosa, hide_isolated_nodes = FALSE, layout = 'fr')
 65 | ```
 66 | 
 67 | The only interaction used for predictions is the one of sepal length (High = 
 68 | long) and width (Low = narrow). We can also use ggplot2 `theme` to format the 
 69 | plot, e.g., put the legend boxes next to each other.
 70 | ```{r}
 71 | plotNetwork(endo_setosa, hide_isolated_nodes = TRUE, layout = 'fr')+
 72 |   theme(legend.box = "horizontal")
 73 | ```
 74 | 
 75 | 
 76 | # Fit endoR for the "versicolor" species
 77 | 
 78 | This time we will filter decisions based on their importance to trim the network 
 79 | (filter = TRUE). We will use min_imp = 0.5 to keep at least all decisions with 
 80 | an importance > 0.5*the best importance (the lower min_imp, the slighter the 
 81 | filtering).
 82 | 
 83 | ```{r}
 84 | endo_versicolor <- model2DE(model = mod, model_type = 'rf'
 85 |                  , data = select(iris, -Species), target = iris$Species
 86 |                  , classPos = 'versicolor'
 87 |                  , K = 3, discretize = TRUE
 88 |                  , filter = TRUE, min_imp = 0.5)
 89 | ```
 90 | 
 91 | 
 92 | The petal's proportions are intermediary compared to the setosa and virginica
 93 | species. 
 94 | ```{r}
 95 | plotFeatures(endo_versicolor, levels_order = c('Low', 'Medium', 'High'))
 96 | ```
 97 | 
 98 | Sepals have a narrow-intermediary width and long-intermediary length
 99 | (= if wide and small then it's not a versicolor, as seen on the network).
100 | ```{r}
101 | plotNetwork(endo_versicolor, hide_isolated_nodes = FALSE, layout = 'fr'
102 |             # we show only edges that connect 3 nodes max -> removes edges with 
103 |             # lowest importances - for longer paths = more complex network,
104 |             # you can increase path_length
105 |             , path_length = 3  
106 |             )+
107 |   scale_edge_alpha(range = c(0.8,1))+
108 |   theme(legend.box = "horizontal")
109 | ```
110 | 


--------------------------------------------------------------------------------
/R/getThresholds.R:
--------------------------------------------------------------------------------
  1 | #' Get thresholds to discretize variables according to their split in tree ensemble models
  2 | #' @param conditions character vector with all conditions from which to find the thresholds
  3 | #' @param data data to discretize
  4 | #' @param Kmax numeric, maximal number of categories for each variable (default: Kmax = 2).
  5 | #' @export
  6 | 
  7 | getThresholds <- function(conditions, data, Kmax = 2) {
  8 |   ### a little slow, may be improved
  9 | 
 10 |   # get all individual sub conditions per variable
 11 |   var_cond <- unlist(lapply(conditions, function(x) {
 12 |     unlist(strsplit(x, split = " & "))
 13 |   }))
 14 |   # make it to a data.frame with variable and thresholds used
 15 |   var_cond <- data.frame(
 16 |     var = as.numeric(str_extract(var_cond, pattern = "[:digit:]*(?=\\])")),
 17 |     thr = as.numeric(str_extract(var_cond, pattern = "-?[:digit:]*\\.?[:digit:]*$"))
 18 |   )
 19 | 
 20 |   # and now transform to a list
 21 |   var_split <- lapply(sort(unique(var_cond$var)),
 22 |     function(x, var_cond, data) {
 23 |       thr <- unlist(subset(var_cond, var == x, select = thr))
 24 |       var_v <- data[[x]]
 25 |       thr[thr < min(var_v)] <- min(var_v)
 26 |       thr[thr > max(var_v)] <- max(var_v)
 27 |       return(thr)
 28 |     },
 29 |     var_cond = var_cond, data = data
 30 |   )
 31 |   names(var_split) <- as.character(sort(unique(var_cond$var)))
 32 | 
 33 |   # remove non-numeric variables
 34 |   are_num <- as.character(which(sapply(data, function(x) {
 35 |     length(unique(x)) > 2
 36 |   })))
 37 |   are_num <- are_num[are_num %in% names(var_split)]
 38 |   var_split <- var_split[are_num]
 39 | 
 40 |   # get thresholds for discretization
 41 |   if (Kmax == 2) {
 42 |     new_thr <- lapply(var_split, getMode)
 43 |   } else {
 44 |     new_thr <- lapply(var_split, getModes_all)
 45 |     new_thr <- lapply(new_thr, function(x, Kmax) {
 46 |       x[1:min(Kmax - 1, length(x))]
 47 |     }, Kmax = Kmax)
 48 |   }
 49 | 
 50 |   # remove thresholds out of range
 51 |   colNb <- as.integer(names(new_thr))
 52 |   new_thr <- lapply(colNb, function(x, data, thr) {
 53 |     list(
 54 |       "var_v" = data[[x]],
 55 |       "thr" = thr[[as.character(x)]]
 56 |     )
 57 |   }, data = data, thr = new_thr)
 58 | 
 59 |   new_thr <- lapply(new_thr, function(x) {
 60 |     x$thr <- x$thr[x$thr >= min(x$var_v) & x$thr <= max(x$var_v)]
 61 |     if (length(x$thr) == 0) x$thr <- min(x$var_v)
 62 |     return(x)
 63 |   })
 64 |   names(new_thr) <- as.character(colNb)
 65 | 
 66 |   return(new_thr)
 67 | }
 68 | 
 69 | 
 70 | ##########
 71 | 
 72 | getMode <- function(x) {
 73 |   if (length(x) == 1) {
 74 |     return(x)
 75 |   } else {
 76 |     tmp <- density(x)
 77 |     return(tmp$x[which.max(tmp$y)])
 78 |   }
 79 | }
 80 | 
 81 | 
 82 | getModes_all <- function(var) {
 83 |   # copied from the pastecs R-package:
 84 |   # https://github.com/phgrosjean/pastecs/blob/master/R/turnpoints.R
 85 |   # (I just removed the unecessary bits..)
 86 | 
 87 |   if (length(x) == 1) {
 88 |     return(x)
 89 |   }
 90 | 
 91 |   x <- as.vector(density(var)$y)
 92 |   n <- length(x)
 93 |   diffs <- c(x[1] - 1, x[1:(n - 1)]) != x
 94 | 
 95 |   uniques <- x[diffs]
 96 | 
 97 |   n2 <- length(uniques)
 98 |   poss <- (1:n)[diffs]
 99 |   exaequos <- c(poss[2:n2], n + 1) - poss - 1
100 | 
101 |   m <- n2 - 2
102 |   ex <- matrix(uniques[1:m + rep(3:1, rep(m, 3)) - 1], m)
103 |   peaks <- c(FALSE, apply(ex, 1, max, na.rm = TRUE) == ex[, 2], FALSE)
104 |   tppos <- (poss + exaequos)[peaks]
105 | 
106 |   # Now, order the peaks and return the x values
107 |   y_peaks <- x[tppos]
108 |   x_peaks <- density(var)$x[tppos]
109 | 
110 |   peaks_order <- order(y_peaks, decreasing = TRUE)
111 |   # y_peaks <- y_peaks[peaks_order]
112 |   # x_peaks <- x_peaks[peaks_order]
113 |   return(x_peaks[peaks_order])
114 | }
115 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/vignettes/iris_multiclass.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "endoR: multiclass"
  3 | output: rmarkdown::html_vignette
  4 | vignette: >
  5 |   %\VignetteIndexEntry{endoR: multiclass}
  6 |   %\VignetteEngine{knitr::rmarkdown}
  7 |   %\VignetteEncoding{UTF-8}
  8 | ---
  9 | 
 10 | ```{r, include = FALSE}
 11 | knitr::opts_chunk$set(
 12 |   collapse = TRUE,
 13 |   comment = "#>",
 14 |   fig.width = 7,
 15 |   fig.height = 2.5
 16 | )
 17 | ```
 18 | 
 19 | This vignette illustrates how to interpret multiclass models with endoR. 
 20 | We will use the iris data for this purpose: the species is being predicted 
 21 | (= target) using the length and width of petals and sepals (= features).  
 22 | 
 23 | # Preambule
 24 | ```{r setup, message = FALSE, warning=FALSE}
 25 | library(endoR)
 26 | library(randomForest)
 27 | library(tidyverse)
 28 | library(caret)
 29 | library(ggpubr)
 30 | library(ggraph)
 31 | ```
 32 | 
 33 | # Data
 34 | ```{r}
 35 | data(iris)
 36 | summary(iris)
 37 | ```
 38 | 
 39 | # Fit model
 40 | ```{r}
 41 | set.seed(1313)
 42 | mod <- randomForest(Species ~ ., data = iris)
 43 | mod
 44 | ```
 45 | 
 46 | # Fit endoR for the "setosa" species
 47 | 
 48 | We will only prune (= TRUE) and discretize decisions in K = 3 levels.
 49 | ```{r}
 50 | endo_setosa <- model2DE(model = mod, model_type = 'rf'
 51 |                  , data = select(iris, -Species), target = iris$Species
 52 |                  , classPos = 'setosa' # our focal class
 53 |                  , filter = FALSE 
 54 |                  
 55 |                  # we filter in K = 3 categories the numerical features
 56 |                  , discretize = TRUE, K = 3)
 57 | ```
 58 | 
 59 | Plants from the setosa species have small petals and narrow, long sepals. 
 60 | ```{r}
 61 | plotFeatures(endo_setosa, levels_order = c('Low', 'Medium', 'High'))
 62 | 
 63 | # The warnings du to the font are due to Windows .. no worries.
 64 | plotNetwork(endo_setosa, hide_isolated_nodes = FALSE, layout = 'fr')
 65 | ```
 66 | 
 67 | The only interaction used for predictions is the one of sepal length (High = 
 68 | long) and width (Low = narrow). We can also use ggplot2 `theme` to format the 
 69 | plot, e.g., put the legend boxes next to each other.
 70 | ```{r}
 71 | plotNetwork(endo_setosa, hide_isolated_nodes = TRUE, layout = 'fr')+
 72 |   theme(legend.box = "horizontal")
 73 | ```
 74 | 
 75 | 
 76 | # Fit endoR for the "versicolor" species
 77 | 
 78 | This time we will filter decisions based on their importance to trim the network 
 79 | (filter = TRUE). We will use min_imp = 0.5 to keep at least all decisions with 
 80 | an importance > 0.5*the best importance (the lower min_imp, the slighter the 
 81 | filtering).
 82 | 
 83 | ```{r}
 84 | endo_versicolor <- model2DE(model = mod, model_type = 'rf'
 85 |                  , data = select(iris, -Species), target = iris$Species
 86 |                  , classPos = 'versicolor'
 87 |                  , K = 3, discretize = TRUE
 88 |                  , filter = TRUE, min_imp = 0.5)
 89 | ```
 90 | 
 91 | 
 92 | The petal's proportions are intermediary compared to the setosa and virginica
 93 | species. 
 94 | ```{r}
 95 | plotFeatures(endo_versicolor, levels_order = c('Low', 'Medium', 'High'))
 96 | ```
 97 | 
 98 | Sepals have a narrow-intermediary width and long-intermediary length
 99 | (= if wide and small then it's not a versicolor, as seen on the network).
100 | ```{r}
101 | plotNetwork(endo_versicolor, hide_isolated_nodes = FALSE, layout = 'fr'
102 |             # we show only edges that connect 3 nodes max -> removes edges with 
103 |             # lowest importances - for longer paths = more complex network,
104 |             # you can increase path_length
105 |             , path_length = 3  
106 |             )+
107 |   scale_edge_alpha(range = c(0.8,1))+
108 |   theme(legend.box = "horizontal")
109 | ```
110 | 


--------------------------------------------------------------------------------
/man/filterDecisionsImportances.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/filterDecisionsImportances.R
 3 | \name{filterDecisionsImportances}
 4 | \alias{filterDecisionsImportances}
 5 | \title{Filter decisions according to their metrics}
 6 | \usage{
 7 | filterDecisionsImportances(rules, min_imp = 0.7)
 8 | }
 9 | \arguments{
10 | \item{rules}{data.frame corresponding to the decisions, with all their metrics.}
11 | 
12 | \item{min_imp}{minimal relative importance of the decisions that must be kept, the threshold to remove decisions is thus going to take lower values than max(imp)*min_imp.}
13 | }
14 | \value{
15 | The decision ensemble from which decisions with the lowest errors and/or importances have been removed, or are indicated in a column "filt_err"/"filt_imp".
16 | }
17 | \description{
18 | This function filters decisions in a heuristic manner according to their importance and multiplicity.
19 | A relative importance threshold that maximises the average product relative importance * n and the number of decisions to be removed is calculated.
20 | All decisions with a relative importance above that threshold are kept. The argument min_imp is the minimal relative importance of the decisions kept.
21 | }
22 | \examples{
23 | library(randomForest)
24 | library(caret)
25 | library(data.table)
26 | 
27 | # import data and fit model
28 | data(iris)
29 | mod <- randomForest(Species ~ ., data = iris)
30 | 
31 | # Let's get the decision ensemble. One could use the wrapping function
32 | # model2DE() but, we will run each function separately.
33 | 
34 | # Get the raw decision ensemble
35 | de <- preCluster(model = mod, model_type = "rf", data = iris[, -5]
36 |         , target = iris$Species, classPos = "setosa"
37 |         , times = 1 # number of bootstraps, here just one
38 |         , discretize = FALSE) # we will discretize outside for the example
39 | summary(de)
40 | # exec = the decision ensemble
41 | # partitions = list of sample indexes for boostrapping
42 | # if we had done discretization, the new data would be in data_ctg
43 | de <- de$exec
44 | 
45 | # Discretize variables in 3 categories - optional
46 | de <- discretizeDecisions(rules = de, data = iris[, -5], target = iris$Species
47 |         , K = 3, classPos = "setosa", mode = "data")
48 | data_ctg <- de$data_ctg
49 | de <- de$rules_ctg
50 | 
51 | # Homogenize the decision ensemble
52 | de <- de[, condition := sapply(condition, function(x) {
53 |   paste(sort(unlist(strsplit(x, split = " & "))), collapse = " & ")
54 | })]
55 | de <- unique(
56 |           as.data.table(de)[, n := as.numeric(n)][, n := sum(n), by = condition]
57 |           )
58 | 
59 | # Calculate decision metrics, we don't need the importances yet since we will
60 | # do pruning. Otherwise, set importances = TRUE and skip the next 2 steps.
61 | de_met <- getDecisionsMetrics(de, data = data_ctg, target = iris$Species
62 |             , classPos = "setosa", importances = FALSE)
63 | de <- de[de_met, on = "condition"]
64 | 
65 | # Pruning - optional
66 | de <- pruneDecisions(rules = de, data = data_ctg, target = iris$Species
67 |         , classPos = "setosa")
68 | 
69 | # Decision importances
70 | de <- decisionImportance(rules = de, data = data_ctg, target = iris$Species
71 |         , classPos = "setosa")
72 | 
73 | # Filter out decisions with the lowest importance: min_imp = the minimal
74 | # importance in the decision ensemble compared to the maximal one.
75 | # E.g., if min_imp = 0.5, then at least all decisions with an
76 | # importance > 0.5*max(importance) will be kept.
77 | # This ensures that we don't throw out too much.
78 | # Since the decision ensemble is quite small, we don't need to filter much...
79 | de <- filterDecisionsImportances(rules = de, min_imp = 0.1)
80 | 
81 | # Get the network
82 | de_net <- getNetwork(rules = de, data = data_ctg, target = iris$Species
83 |             , classPos = "setosa")
84 | 
85 | # Plot the feature importance/influence and the network
86 | plotFeatures(de_net, levels_order = c("Low", "Medium", "High"))
87 | plotNetwork(de_net, hide_isolated_nodes = FALSE, layout = "fr")
88 | }
89 | 


--------------------------------------------------------------------------------
/man/decisionImportance.Rd:
--------------------------------------------------------------------------------
  1 | % Generated by roxygen2: do not edit by hand
  2 | % Please edit documentation in R/decisionImportance.R
  3 | \name{decisionImportance}
  4 | \alias{decisionImportance}
  5 | \title{Compute the importance of decisions}
  6 | \usage{
  7 | decisionImportance(
  8 |   rules,
  9 |   data,
 10 |   target,
 11 |   classPos = NULL,
 12 |   in_parallel = FALSE,
 13 |   n_cores = detectCores() - 1,
 14 |   cluster = NULL
 15 | )
 16 | }
 17 | \arguments{
 18 | \item{rules}{a data.frame with a column "condition" or a vector with name "condition".}
 19 | 
 20 | \item{data}{data from which to get the decision support.}
 21 | 
 22 | \item{target}{response variable.}
 23 | 
 24 | \item{classPos}{if classification, the positive class.}
 25 | 
 26 | \item{in_parallel}{if TRUE, the function is run in parallel.}
 27 | 
 28 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use.}
 29 | 
 30 | \item{cluster}{the cluster to use to run the function in parallel.}
 31 | }
 32 | \value{
 33 | the data.frame passed in rules with the gain and importance of the each decision.
 34 | }
 35 | \description{
 36 | This function computes the importance of decisions.
 37 | }
 38 | \examples{
 39 | library(randomForest)
 40 | library(caret)
 41 | library(data.table)
 42 | 
 43 | # import data and fit model
 44 | data(iris)
 45 | mod <- randomForest(Species ~ ., data = iris)
 46 | 
 47 | # Let's get the decision ensemble. One could use the wrapping function
 48 | # model2DE() but, we will run each function separately.
 49 | 
 50 | # Get the raw decision ensemble
 51 | de <- preCluster(model = mod, model_type = "rf", data = iris[, -5]
 52 |         , target = iris$Species, classPos = "setosa"
 53 |         , times = 1 # number of bootstraps, here just one
 54 |         , discretize = FALSE) # we will discretize outside for the example
 55 | summary(de)
 56 | # exec = the decision ensemble
 57 | # partitions = list of sample indexes for boostrapping
 58 | # if we had done discretization, the new data would be in data_ctg
 59 | de <- de$exec
 60 | 
 61 | # Discretize variables in 3 categories - optional
 62 | de <- discretizeDecisions(rules = de, data = iris[, -5], target = iris$Species
 63 |         , K = 3, classPos = "setosa", mode = "data")
 64 | data_ctg <- de$data_ctg
 65 | de <- de$rules_ctg
 66 | 
 67 | # Homogenize the decision ensemble
 68 | de <- de[, condition := sapply(condition, function(x) {
 69 |   paste(sort(unlist(strsplit(x, split = " & "))), collapse = " & ")
 70 | })]
 71 | de <- unique(
 72 |           as.data.table(de)[, n := as.numeric(n)][, n := sum(n), by = condition]
 73 |           )
 74 | 
 75 | # Calculate decision metrics, we don't need the importances yet since we will
 76 | # do pruning. Otherwise, set importances = TRUE and skip the next 2 steps.
 77 | de_met <- getDecisionsMetrics(de, data = data_ctg, target = iris$Species
 78 |             , classPos = "setosa", importances = FALSE)
 79 | de <- de[de_met, on = "condition"]
 80 | 
 81 | # Pruning - optional
 82 | de <- pruneDecisions(rules = de, data = data_ctg, target = iris$Species
 83 |         , classPos = "setosa")
 84 | 
 85 | # Decision importances
 86 | de <- decisionImportance(rules = de, data = data_ctg, target = iris$Species
 87 |         , classPos = "setosa")
 88 | 
 89 | # Filter out decisions with the lowest importance: min_imp = the minimal
 90 | # importance in the decision ensemble compared to the maximal one.
 91 | # E.g., if min_imp = 0.5, then at least all decisions with an
 92 | # importance > 0.5*max(importance) will be kept.
 93 | # This ensures that we don't throw out too much.
 94 | # Since the decision ensemble is quite small, we don't need to filter much...
 95 | de <- filterDecisionsImportances(rules = de, min_imp = 0.1)
 96 | 
 97 | # Get the network
 98 | de_net <- getNetwork(rules = de, data = data_ctg, target = iris$Species
 99 |             , classPos = "setosa")
100 | 
101 | # Plot the feature importance/influence and the network
102 | plotFeatures(de_net, levels_order = c("Low", "Medium", "High"))
103 | plotNetwork(de_net, hide_isolated_nodes = FALSE, layout = "fr")
104 | }
105 | 


--------------------------------------------------------------------------------
/man/model2DE_cluster.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/model2DE_cluster.R
 3 | \name{model2DE_cluster}
 4 | \alias{model2DE_cluster}
 5 | \title{Run model2DE on several bootstrap resamples in parallel.}
 6 | \usage{
 7 | model2DE_cluster(partition)
 8 | }
 9 | \arguments{
10 | \item{partition}{a vector with row numbers to subset data.}
11 | }
12 | \description{
13 | Function to with the Q() function from the clustermq R-package with the following arguments in export:
14 | data, target, exec, classPos, dummy_var, prune, maxDecay, typeDecay, filter, in_parallel, n_cores.
15 | See preCluster() to obtain the list of boostraps resamples, the discretized data and exec dataframe with decisions.
16 | }
17 | \examples{
18 | library(randomForest)
19 | library(caret)
20 | 
21 | # import data and fit model
22 | data(iris)
23 | mod <- randomForest(Species ~ ., data = iris)
24 | 
25 | # Get decision ensemble with bootstrapping.
26 | 
27 | # Run 1 bootstrap after the other (times = 2 bootstraps)
28 | endo_setosa <- model2DE_resampling(model = mod, model_type = "rf"
29 |     , data = iris[, -5], target = iris$Species, classPos = "setosa"
30 |     , times = 2, in_parallel = TRUE, n_cores = 2, filter = FALSE)
31 | 
32 | # Same but use different sample weights for bootstrapping
33 | n_setosa <- sum(iris$Species == "setosa")
34 | n_samp <- length(iris$Species)
35 | samp_weight <- round(
36 |     ifelse(iris$Species == "setosa", 1 - n_setosa/n_samp, n_setosa/n_samp)
37 |     , digits = 2)
38 | endo_setosa <- model2DE_resampling(model = mod, model_type = "rf"
39 |     , data = iris[, -5], target = iris$Species, classPos = "setosa"
40 |     , times = 2, sample_weight = samp_weight
41 |     , in_parallel = TRUE, n_cores = 2, filter = FALSE)
42 | 
43 | # Run the bootstraps in parallel
44 | # First do all steps before bootstrapping
45 | preclu <- preCluster(model = mod, model_type = "rf", data = iris[, -5]
46 |     , target = iris$Species, classPos = "setosa", times = 2
47 |     , discretize = TRUE, in_parallel = FALSE)
48 | 
49 | # Remove the special characters from column names
50 | colnames(preclu$data) <- compatibleNames(colnames(preclu$data))
51 | 
52 | # Parameters for clustermq: can also run on HPC environment
53 | library(clustermq)
54 | options(clustermq.scheduler = "multiprocess")
55 | # ... and run in parallel on each bootstrap
56 | # (preclu$partitions = list of sample indexes for each bootstraps)
57 | endo_setosa <- Q(model2DE_cluster
58 |     , partition = preclu$partitions
59 |     , export = list(data = preclu$data
60 |                 , target = iris$Species
61 |                 , exec = preclu$exec
62 |                 , classPos = "setosa"
63 |                 , prune = TRUE, filter = FALSE
64 |                 , maxDecay = 0.05 # values needed for maxDecay and typeDecay
65 |                 , typeDecay = 2 # here default ones, see pruneDecisions()
66 |                 , in_parallel = FALSE # can parallelize within each boostrap!
67 |            )
68 |     , n_jobs = 2 # max number of bootstraps that can be ran in parallel
69 |     , pkgs = c("data.table", "parallel", "caret", "stringr", "scales"
70 |                 , "dplyr", "inTrees", "endoR")
71 |     , log_worker = FALSE # to keep a log of the runs, e.g. if it fails..
72 | )
73 | 
74 | 
75 | # Stability selection
76 | # First we can look at the effect of the alpha parameter on selection;
77 | # alpha = expected number of false decisions
78 | alphas <- evaluateAlpha(rules = endo_setosa, alphas = c(1:5, 7, 10)
79 |                         , data = preclu$data)
80 | alphas$summary_table
81 | 
82 | # perform stability selection with alpha = 1
83 | de_final <- stabilitySelection(rules = endo_setosa, alpha_error = 7)
84 | 
85 | # Plot the decision ensemble:
86 | # Plants from the setosa species have small petal and narrow long sepals.
87 | plotFeatures(de_final, levels_order = c("Low", "Medium", "High"))
88 | 
89 | # there is no interaction between variables (all decisions with len = 1,
90 | # the number of variables in the rules)
91 | de_final$rules_summary
92 | # hence the network would be empty and couldn't be plotted...
93 | # plotNetwork(de_final, hide_isolated_nodes = FALSE)
94 | }
95 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/R/getDecisionsMetric.R:
--------------------------------------------------------------------------------
 1 | #' Measure the error, prediction and importance of decisions
 2 | #'
 3 | #' This function measures the prediction and error on the response variable of each decision on its support in the data passed. The importance is calculated by default but this can be switched off.
 4 | #'
 5 | #' @param ruleExec  a vector with name "condition" or a data.frame with a column "condition".
 6 | #' @param data data from which to get the decision support.
 7 | #' @param target response variable.
 8 | #' @param classPos for clssification tasks, the positive class to be predicted by decisions.
 9 | #' @param importances if FALSE, the importances are not calculated (importances = TRUE by default).
10 | #' @param in_parallel if TRUE, the function is run in parallel.
11 | #' @param n_cores if in_parallel = TRUE, and no cluster has been passed: number of cores to use.
12 | #' @param cluster the cluster to use to run the function in parallel.
13 | #' @return a datatable with the rule (column "condition"), error ("err"), prediction ("pred") support, number of variables in the decision rule ("len"). Columns "gain" and "imp" wit hthe gain and importance of teh decision are added if importances were calculated.
14 | #' @export
15 | getDecisionsMetrics <- function(ruleExec, data, target, classPos = NULL,
16 |                                 importances = TRUE,
17 |                                 in_parallel = FALSE, n_cores = detectCores() - 1, cluster = NULL) {
18 |   # modified from getRuleMetrics in the inTrees R-package
19 | 
20 |   # get the classPos
21 |   if (is.numeric(target) == FALSE) {
22 |     if (is.null(classPos) == TRUE) {
23 |       classPos <- names(which.max(table(target)))
24 |       cat("Positive class:", classPos, "\n")
25 |     }
26 |     target <- ifelse(target == classPos, 1, 0)
27 |     type <- "classification"
28 |   } else {
29 |     type <- "reg"
30 |   }
31 | 
32 |   if ("data.table" %in% class(data)) {
33 |     data <- as.data.frame(data)
34 |   }
35 | 
36 |   if (importances == TRUE) {
37 |     pred_null <- mean(target, na.rm = TRUE)
38 |   } else {
39 |     pred_null <- NULL
40 |   }
41 | 
42 | 
43 | 
44 |   if (in_parallel == FALSE) {
45 |     ruleMetric <- t(sapply(as.matrix(ruleExec[, "condition"]), measureSingleDecision,
46 |       data = data, target = target, type = type,
47 |       gain = importances, pred_null = pred_null
48 |     ))
49 |   } else {
50 |     if (is.null(cluster) == TRUE) {
51 |       message("Initiate parallelisation ... ")
52 |       cluster <- makeCluster(n_cores)
53 |       clusterEvalQ(cluster, library(inTrees))
54 |       on.exit(stopCluster(cluster))
55 |     }
56 | 
57 |     message("Calculate metrics ... ")
58 |     ruleMetric <- parLapply(cluster, as.matrix(ruleExec[, "condition"]), measureSingleDecision,
59 |       data = data, target = target, type = type,
60 |       gain = importances, pred_null = pred_null
61 |     )
62 | 
63 |     if (importances == FALSE) {
64 |       colN <- c("len", "support", "err", "condition", "pred")
65 |     } else {
66 |       colN <- c("len", "support", "err", "condition", "pred", "gain")
67 |     }
68 | 
69 |     ruleMetric <- matrix(unlist(ruleMetric), nrow = length(colN), byrow = FALSE)
70 |     rownames(ruleMetric) <- colN
71 |     ruleMetric <- t(ruleMetric)
72 |   }
73 | 
74 |   rownames(ruleMetric) <- NULL
75 |   # remove the decisions with empty support
76 |   dIx <- which(ruleMetric[, "len"] == "-1")
77 |   if (length(dIx) > 0) {
78 |     ruleMetric <- ruleMetric[-dIx, ]
79 |   }
80 |   if ("character" %in% class(ruleMetric)) {
81 |     ruleMetric <- t(ruleMetric)
82 |   }
83 | 
84 | 
85 |   if (importances == TRUE) {
86 |     ruleMetric <- as.data.table(ruleMetric)[, c("len", "support", "err", "pred", "gain") := lapply(.SD, as.numeric), .SDcols = c("len", "support", "err", "pred", "gain")][
87 |       , "imp" := gain * support
88 |     ]
89 |   } else {
90 |     ruleMetric <- as.data.table(ruleMetric)[, c("len", "support", "err", "pred") := lapply(.SD, as.numeric), .SDcols = c("len", "support", "err", "pred")]
91 |   }
92 | 
93 | 
94 |   return(ruleMetric)
95 | }
96 | 


--------------------------------------------------------------------------------
/R/getDecisionsMetric.R:
--------------------------------------------------------------------------------
 1 | #' Measure the error, prediction and importance of decisions
 2 | #'
 3 | #' This function measures the prediction and error on the response variable of each decision on its support in the data passed. The importance is calculated by default but this can be switched off.
 4 | #'
 5 | #' @param ruleExec  a vector with name "condition" or a data.frame with a column "condition".
 6 | #' @param data data from which to get the decision support.
 7 | #' @param target response variable.
 8 | #' @param classPos for clssification tasks, the positive class to be predicted by decisions.
 9 | #' @param importances if FALSE, the importances are not calculated (importances = TRUE by default).
10 | #' @param in_parallel if TRUE, the function is run in parallel.
11 | #' @param n_cores if in_parallel = TRUE, and no cluster has been passed: number of cores to use.
12 | #' @param cluster the cluster to use to run the function in parallel.
13 | #' @return a datatable with the rule (column "condition"), error ("err"), prediction ("pred") support, number of variables in the decision rule ("len"). Columns "gain" and "imp" wit hthe gain and importance of teh decision are added if importances were calculated.
14 | #'
15 | #' @example examples/iris_each_function.R
16 | #'
17 | #' @export
18 | getDecisionsMetrics <- function(ruleExec, data, target, classPos = NULL,
19 |                                 importances = TRUE,
20 |                                 in_parallel = FALSE, n_cores = detectCores() - 1, cluster = NULL) {
21 |   # modified from getRuleMetrics in the inTrees R-package
22 | 
23 |   # get the classPos
24 |   if (is.numeric(target) == FALSE) {
25 |     if (is.null(classPos) == TRUE) {
26 |       classPos <- names(which.max(table(target)))
27 |       cat("Positive class:", classPos, "\n")
28 |     }
29 |     target <- ifelse(target == classPos, 1, 0)
30 |     type <- "classification"
31 |   } else {
32 |     type <- "reg"
33 |   }
34 | 
35 |   if ("data.table" %in% class(data)) {
36 |     data <- as.data.frame(data)
37 |   }
38 | 
39 |   if (importances == TRUE) {
40 |     pred_null <- mean(target, na.rm = TRUE)
41 |   } else {
42 |     pred_null <- NULL
43 |   }
44 | 
45 | 
46 | 
47 |   if (in_parallel == FALSE) {
48 |     ruleMetric <- t(sapply(as.matrix(ruleExec[, "condition"]), measureSingleDecision,
49 |       data = data, target = target, type = type,
50 |       gain = importances, pred_null = pred_null
51 |     ))
52 |   } else {
53 |     if (is.null(cluster) == TRUE) {
54 |       message("Initiate parallelisation ... ")
55 |       cluster <- makeCluster(n_cores)
56 |       clusterEvalQ(cluster, library(inTrees))
57 |       on.exit(stopCluster(cluster))
58 |     }
59 | 
60 |     message("Calculate metrics ... ")
61 |     ruleMetric <- parLapply(cluster, as.matrix(ruleExec[, "condition"]), measureSingleDecision,
62 |       data = data, target = target, type = type,
63 |       gain = importances, pred_null = pred_null
64 |     )
65 | 
66 |     if (importances == FALSE) {
67 |       colN <- c("len", "support", "err", "condition", "pred")
68 |     } else {
69 |       colN <- c("len", "support", "err", "condition", "pred", "gain")
70 |     }
71 | 
72 |     ruleMetric <- matrix(unlist(ruleMetric), nrow = length(colN), byrow = FALSE)
73 |     rownames(ruleMetric) <- colN
74 |     ruleMetric <- t(ruleMetric)
75 |   }
76 | 
77 |   rownames(ruleMetric) <- NULL
78 |   # remove the decisions with empty support
79 |   dIx <- which(ruleMetric[, "len"] == "-1")
80 |   if (length(dIx) > 0) {
81 |     ruleMetric <- ruleMetric[-dIx, ]
82 |   }
83 |   if ("character" %in% class(ruleMetric)) {
84 |     ruleMetric <- t(ruleMetric)
85 |   }
86 | 
87 | 
88 |   if (importances == TRUE) {
89 |     ruleMetric <- as.data.table(ruleMetric)[, c("len", "support", "err", "pred", "gain") := lapply(.SD, as.numeric), .SDcols = c("len", "support", "err", "pred", "gain")][
90 |       , "imp" := gain * support
91 |     ]
92 |   } else {
93 |     ruleMetric <- as.data.table(ruleMetric)[, c("len", "support", "err", "pred") := lapply(.SD, as.numeric), .SDcols = c("len", "support", "err", "pred")]
94 |   }
95 | 
96 | 
97 |   return(ruleMetric)
98 | }
99 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/R/plotFeatures.R:
--------------------------------------------------------------------------------
 1 | #' Plot the importance and influence of features.
 2 | #'
 3 | #' Returns a ggplot object with variables importance (across all categorical levels for factor variables) and variable per-level influence.
 4 | #' It uses the ggpubr package to combine plots.
 5 | #'
 6 | #' @param decision_ensemble stable decision ensemble (see stabilitySelection).
 7 | #' @param levels_order optional, order for variables levels on the influence plot
 8 | #' @param return_all TRUE, returns the table of feature importance and influences and each plot separated (default = FALSE).
 9 | #' @return 2 ggplots arranged in a row with ggpubr; if return_all = TRUE, returns plots separately in a list , as well as the tables used to create plots.
10 | #' @export
11 | plotFeatures <- function(decision_ensemble, levels_order = NULL,
12 |                          colour_low = "#E69F00", colour_mid = "grey87", colour_high = "#0072B2",
13 |                          return_all = FALSE) {
14 | 
15 | 
16 |   # get the importance across all levels for each feature
17 |   agg_imp <- decision_ensemble$nodes
18 |   agg_imp$Feature <- str_replace(agg_imp$var, pattern = "\\_{2}.*", replacement = "")
19 |   agg_imp <- agg_imp %>%
20 |     group_by(Feature) %>%
21 |     summarize(importance = sum(importance))
22 |   # order features based on their importance
23 |   agg_imp$Feature <- factor(agg_imp$Feature, levels = agg_imp$Feature[order(agg_imp$importance)])
24 | 
25 |   # make the feature importance plot
26 |   paggimp <- ggplot(agg_imp, aes(x = Feature, y = importance)) +
27 |     geom_col(fill = "lightgray", width = 0.1) +
28 |     geom_point(size = 3) +
29 |     coord_flip() +
30 |     theme_classic() +
31 |     theme(
32 |       axis.text.y = element_text(size = rel(1.2)),
33 |       axis.text.x = element_text(size = rel(1.2)), legend.text = element_text(size = rel(1.2)),
34 |       panel.grid.major.y = element_line(colour = "grey", size = 0.2)
35 |     ) +
36 |     labs(x = "", y = "Importance", title = "Feature importance")
37 | 
38 | 
39 |   # arrange the feature influences
40 |   agg_inf <- decision_ensemble$nodes %>% select(var, influence)
41 |   agg_inf$level <- str_extract(agg_inf$var, pattern = "(?<=\\_{2}).*")
42 |   agg_inf$Feature_short <- str_replace(agg_inf$var, pattern = "\\_{2}.*", replacement = "")
43 |   agg_inf <- left_join(agg_inf, agg_imp, by = c(Feature_short = "Feature"))
44 |   l <- unique(agg_inf$level)
45 |   l <- l[!is.na(l)]
46 | 
47 |   if (length(l) > 0) {
48 |     ix <- which(is.na(agg_inf$level))
49 |     tmp <- agg_inf[rep(ix, length(l)), ]
50 |     tmp$level <- unlist(lapply(l, function(x) rep(x, length(ix))))
51 |     agg_inf <- agg_inf[complete.cases(agg_inf), ] %>% rbind(tmp)
52 |     agg_inf$Feature_short <- factor(agg_inf$Feature_short, levels = unique(agg_inf$Feature_short[order(agg_inf$importance)]))
53 |     if (!is.null(levels_order)) {
54 |       agg_inf$level <- factor(agg_inf$level, levels = levels_order)
55 |     }
56 |   } else {
57 |     agg_inf$level <- ""
58 |   }
59 | 
60 |   level_inf <- ggplot(agg_inf, aes(
61 |     x = level, y = Feature_short,
62 |     fill = influence
63 |   )) +
64 |     geom_tile(color = "white") +
65 |     theme_classic() +
66 |     theme(
67 |       axis.text.y = element_blank(), axis.text.x = element_text(size = rel(1.2)),
68 |       legend.text = element_text(size = rel(1.2)), legend.title = element_text(
69 |         size = rel(1.2),
70 |         vjust = 0.85
71 |       )
72 |     ) +
73 |     labs(x = "Level", y = "", title = "Influence per level") +
74 |     scale_fill_gradient2(
75 |       low = colour_low, mid = colour_mid,
76 |       high = colour_high, midpoint = 0, name = "Influence on\nphenotype"
77 |     )
78 | 
79 | 
80 |   if (return_all == TRUE) {
81 |     return(list(
82 |       "importance" = agg_imp, "importance_p" = paggimp,
83 |       "influences" = agg_inf, "influence_p" = level_inf
84 |     ))
85 |   } else {
86 |     require(ggpubr)
87 |     paggs <- ggpubr::ggarrange(paggimp + theme(legend.position = "none"),
88 |       level_inf,
89 |       nrow = 1, ncol = 2, widths = c(1, 0.8)
90 |     )
91 |     return(paggs)
92 |   }
93 | }
94 | 


--------------------------------------------------------------------------------
/R/evaluateAlpha.R:
--------------------------------------------------------------------------------
 1 | #' Calculate the number fo decisions and of predicted samples from decision ensemble obtained with different alpha values
 2 | #'
 3 | #' The aim is to help picking an alpha that will result in a decision ensemble able to predict most samples. Performs stability selection for each of the given alpha value. The number of decisions and of samples that follow decisions are also calculated.
 4 | #' The procedure is adapted from Meinshausenand and Buehlmann (2010): the best decisions from each bootstrap are pre-seleected and the the ones that were pre-selected in a certain fraction of bootstraps are included in the stable decision ensemble.
 5 | #' The decision importances and multiplicities are averaged across bootstraps. Decision-wise feature and interaction importances and influences are averaged across bootstraps before computing the feature and interaction importances and influences from the stable decision ensemble.
 6 | #' @param rules list of bootstrap results
 7 | #' @param alphas expected number of false positive decision selected (default = 1).
 8 | #' @param data data for which to calculate how many samples follow each decision. Columns should be the same as for fitting the decision ensemble but samples can be any.
 9 | #' @param decision_ensembles should the decision ensemble be returned?
10 | #' @param pi_thr fraction of bootstraps in which a decision should have been selected in to be included in the stable decision ensemble (default = 0.7).
11 | #' @param aggregate_taxa should taxa be aggregated at the genus level (if species have lower importance than their genus) or species level (if a genus is represented by a unique species)
12 | #' @param taxa if aggregate_taxa = TRUE, a data.frame with all taxa included in the dataset: columns = taxonomic ranks (with columns f, g, and s)
13 | #' @return A list with all decisions from all bootstrasps, the summary of decisions across bootstraps, the feature and interaction importance and influence in the nodes and edges dataframes, as well as the the decision-wise feature and interaction importances and influences the nodes_agg and edges_agg dataframes.
14 | #' @example examples/iris_bootstraps.R
15 | #' @export
16 | 
17 | evaluateAlpha <- function(rules, alphas = c(5, 10, 15, 20, 30, 50, 75), pi_thr = 0.7,
18 |                           data = NULL, decision_ensembles = TRUE,
19 |                           aggregate_taxa = FALSE, taxa = NULL
20 |                           ) {
21 |   check_sampl <- data.frame(alpha = numeric(), n_decision = numeric(), n_samples = numeric())
22 |   if ("resamp" %in% names(rules)) {
23 |     minN <- pi_thr * length(rules$resamp)
24 |   } else {
25 |     minN <- pi_thr * length(rules)
26 |   }
27 | 
28 |   if (is.null(data) & !("data" %in% names(rules))) {
29 |     warning("Please provide the data used to create the decision ensemble (original if not discretized, precluster$data if discretized).")
30 |     return()
31 |   } else if ("data" %in% names(rules)) {
32 |     data <- rules$data
33 |   }
34 | 
35 |   res <- list()
36 |   check_sampl <- data.frame(alpha = numeric(), n_dec = numeric(), n_samp = numeric())
37 | 
38 |   for (i in 1:length(alphas)) {
39 |     tmp <- stabilitySelection(rules = rules, alpha_error = alphas[i], pi_thr = pi_thr, aggregate_taxa = aggregate_taxa, taxa = taxa)
40 |     if (decision_ensembles == TRUE) {
41 |       res[[as.character(alphas[i])]] <- tmp
42 |     }
43 | 
44 |     tmp <- subset(tmp$rules_summary, inN >= minN)
45 | 
46 |     cond <- tmp$condition
47 |     if (length(cond) == 0) {
48 |       check_sampl <- add_row(check_sampl, alpha = alphas[i], n_dec = 0, n_samp = 0)
49 |     } else {
50 |       cond <- str_replace_all(cond, pattern = "X", replacement = "data")
51 |       cond <- paste0("which(", cond, ")")
52 |       pred_ix <- lapply(cond, function(x) {
53 |         eval(parse(text = x))
54 |       }) %>%
55 |         unlist() %>%
56 |         unique()
57 |       check_sampl <- add_row(check_sampl, alpha = alphas[i], n_dec = length(cond), n_samp = length(pred_ix))
58 |     }
59 |   }
60 |   if (decision_ensembles == TRUE) {
61 |     res[["summary_table"]] <- check_sampl
62 |     return(res)
63 |   } else {
64 |     return(check_sampl)
65 |   }
66 | }
67 | 


--------------------------------------------------------------------------------
/R/plotFeatures.R:
--------------------------------------------------------------------------------
 1 | #' Plot the importance and influence of features.
 2 | #'
 3 | #' Returns a ggplot object with variables importance (across all categorical levels for factor variables) and variable per-level influence.
 4 | #' It uses the ggpubr package to combine plots.
 5 | #'
 6 | #' @param decision_ensemble stable decision ensemble (see stabilitySelection).
 7 | #' @param levels_order optional, order for variables levels on the influence plot
 8 | #' @param return_all TRUE, returns the table of feature importance and influences and each plot separated (default = FALSE).
 9 | #' @param colour_low colour for the negative feature influence values (default: yellowish)
10 | #' @param colour_mid colour for the null feature influence values (default: light grey)
11 | #' @param colour_high colour for the positive feature influence values (default: blue)
12 | #' @return 2 ggplots arranged in a row with ggpubr; if return_all = TRUE, returns plots separately in a list , as well as the tables used to create plots.
13 | #'
14 | #' @example examples/iris_basic.R
15 | #' @export
16 | plotFeatures <- function(decision_ensemble, levels_order = NULL,
17 |                          colour_low = "#E69F00", colour_mid = "grey87", colour_high = "#0072B2",
18 |                          return_all = FALSE) {
19 | 
20 | 
21 |   # get the importance across all levels for each feature
22 |   agg_imp <- decision_ensemble$nodes
23 |   agg_imp$Feature <- str_replace(agg_imp$var, pattern = "\\_{2}.*", replacement = "")
24 |   agg_imp <- agg_imp %>%
25 |     group_by(Feature) %>%
26 |     summarize(importance = sum(importance))
27 |   # order features based on their importance
28 |   agg_imp$Feature <- factor(agg_imp$Feature, levels = agg_imp$Feature[order(agg_imp$importance)])
29 | 
30 |   # make the feature importance plot
31 |   paggimp <- ggplot(agg_imp, aes(x = Feature, y = importance)) +
32 |     geom_col(fill = "lightgray", width = 0.1) +
33 |     geom_point(size = 3) +
34 |     coord_flip() +
35 |     theme_classic() +
36 |     theme(
37 |       axis.text.y = element_text(size = rel(1.2)),
38 |       axis.text.x = element_text(size = rel(1.2)), legend.text = element_text(size = rel(1.2)),
39 |       panel.grid.major.y = element_line(colour = "grey", size = 0.2)
40 |     ) +
41 |     labs(x = "", y = "Importance", title = "Feature importance")
42 | 
43 | 
44 |   # arrange the feature influences
45 |   agg_inf <- decision_ensemble$nodes %>% select(var, influence)
46 |   agg_inf$level <- str_extract(agg_inf$var, pattern = "(?<=\\_{2}).*")
47 |   agg_inf$Feature_short <- str_replace(agg_inf$var, pattern = "\\_{2}.*", replacement = "")
48 |   agg_inf <- left_join(agg_inf, agg_imp, by = c(Feature_short = "Feature"))
49 |   l <- unique(agg_inf$level)
50 |   l <- l[!is.na(l)]
51 | 
52 |   if (length(l) > 0) {
53 |     ix <- which(is.na(agg_inf$level))
54 |     tmp <- agg_inf[rep(ix, length(l)), ]
55 |     tmp$level <- unlist(lapply(l, function(x) rep(x, length(ix))))
56 |     agg_inf <- agg_inf[complete.cases(agg_inf), ] %>% rbind(tmp)
57 |     agg_inf$Feature_short <- factor(agg_inf$Feature_short, levels = unique(agg_inf$Feature_short[order(agg_inf$importance)]))
58 |     if (!is.null(levels_order)) {
59 |       agg_inf$level <- factor(agg_inf$level, levels = levels_order)
60 |     }
61 |   } else {
62 |     agg_inf$level <- ""
63 |   }
64 | 
65 |   level_inf <- ggplot(agg_inf, aes(
66 |     x = level, y = Feature_short,
67 |     fill = influence
68 |   )) +
69 |     geom_tile(color = "white") +
70 |     theme_classic() +
71 |     theme(
72 |       axis.text.y = element_blank(), axis.text.x = element_text(size = rel(1.2)),
73 |       legend.text = element_text(size = rel(1.2)), legend.title = element_text(
74 |         size = rel(1.2),
75 |         vjust = 0.85
76 |       )
77 |     ) +
78 |     labs(x = "Level", y = "", title = "Influence per level") +
79 |     scale_fill_gradient2(
80 |       low = colour_low, mid = colour_mid,
81 |       high = colour_high, midpoint = 0, name = "Influence on\nphenotype"
82 |     )
83 | 
84 | 
85 |   if (return_all == TRUE) {
86 |     return(list(
87 |       "importance" = agg_imp, "importance_p" = paggimp,
88 |       "influences" = agg_inf, "influence_p" = level_inf
89 |     ))
90 |   } else {
91 |     paggs <- ggpubr::ggarrange(paggimp + theme(legend.position = "none"),
92 |       level_inf,
93 |       nrow = 1, ncol = 2, widths = c(1, 0.8)
94 |     )
95 |     return(paggs)
96 |   }
97 | }
98 | 


--------------------------------------------------------------------------------
/man/pruneDecisions.Rd:
--------------------------------------------------------------------------------
  1 | % Generated by roxygen2: do not edit by hand
  2 | % Please edit documentation in R/pruneDecisions.R
  3 | \name{pruneDecisions}
  4 | \alias{pruneDecisions}
  5 | \title{Prune rules of a decision ensemble}
  6 | \usage{
  7 | pruneDecisions(
  8 |   rules,
  9 |   data,
 10 |   target,
 11 |   classPos = NULL,
 12 |   maxDecay = 0,
 13 |   typeDecay = 1,
 14 |   in_parallel = FALSE,
 15 |   n_cores = detectCores() - 1,
 16 |   cluster = NULL
 17 | )
 18 | }
 19 | \arguments{
 20 | \item{rules}{a data frame with a column "condition".}
 21 | 
 22 | \item{data}{data to use for calculating the metrics.}
 23 | 
 24 | \item{target}{response variable.}
 25 | 
 26 | \item{classPos}{for classification tasks, the positive class predicted.}
 27 | 
 28 | \item{maxDecay}{threshold for the increase in error; if maxDecay = -Inf, no pruning is done; if maxDecay = 0, only variables increasing the error are pruned from decisions.}
 29 | 
 30 | \item{typeDecay}{if typeDecay = 1, the absolute increase in error is computed, and the relative one is computed if typeDecay = 2 (default).}
 31 | 
 32 | \item{in_parallel}{if TRUE, the function is run in parallel.}
 33 | 
 34 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use, default is detectCores() - 1.}
 35 | 
 36 | \item{cluster}{the cluster to use to run the function in parallel.}
 37 | }
 38 | \value{
 39 | Decision ensemble with pruned conditions.
 40 | }
 41 | \description{
 42 | This function removes from rules, variables that do not increase the rule error more than a certain threshold. See the pruneRules function from the inTrees package (Deng, 2019) for more details.
 43 | }
 44 | \examples{
 45 | library(randomForest)
 46 | library(caret)
 47 | library(data.table)
 48 | 
 49 | # import data and fit model
 50 | data(iris)
 51 | mod <- randomForest(Species ~ ., data = iris)
 52 | 
 53 | # Let's get the decision ensemble. One could use the wrapping function
 54 | # model2DE() but, we will run each function separately.
 55 | 
 56 | # Get the raw decision ensemble
 57 | de <- preCluster(model = mod, model_type = "rf", data = iris[, -5]
 58 |         , target = iris$Species, classPos = "setosa"
 59 |         , times = 1 # number of bootstraps, here just one
 60 |         , discretize = FALSE) # we will discretize outside for the example
 61 | summary(de)
 62 | # exec = the decision ensemble
 63 | # partitions = list of sample indexes for boostrapping
 64 | # if we had done discretization, the new data would be in data_ctg
 65 | de <- de$exec
 66 | 
 67 | # Discretize variables in 3 categories - optional
 68 | de <- discretizeDecisions(rules = de, data = iris[, -5], target = iris$Species
 69 |         , K = 3, classPos = "setosa", mode = "data")
 70 | data_ctg <- de$data_ctg
 71 | de <- de$rules_ctg
 72 | 
 73 | # Homogenize the decision ensemble
 74 | de <- de[, condition := sapply(condition, function(x) {
 75 |   paste(sort(unlist(strsplit(x, split = " & "))), collapse = " & ")
 76 | })]
 77 | de <- unique(
 78 |           as.data.table(de)[, n := as.numeric(n)][, n := sum(n), by = condition]
 79 |           )
 80 | 
 81 | # Calculate decision metrics, we don't need the importances yet since we will
 82 | # do pruning. Otherwise, set importances = TRUE and skip the next 2 steps.
 83 | de_met <- getDecisionsMetrics(de, data = data_ctg, target = iris$Species
 84 |             , classPos = "setosa", importances = FALSE)
 85 | de <- de[de_met, on = "condition"]
 86 | 
 87 | # Pruning - optional
 88 | de <- pruneDecisions(rules = de, data = data_ctg, target = iris$Species
 89 |         , classPos = "setosa")
 90 | 
 91 | # Decision importances
 92 | de <- decisionImportance(rules = de, data = data_ctg, target = iris$Species
 93 |         , classPos = "setosa")
 94 | 
 95 | # Filter out decisions with the lowest importance: min_imp = the minimal
 96 | # importance in the decision ensemble compared to the maximal one.
 97 | # E.g., if min_imp = 0.5, then at least all decisions with an
 98 | # importance > 0.5*max(importance) will be kept.
 99 | # This ensures that we don't throw out too much.
100 | # Since the decision ensemble is quite small, we don't need to filter much...
101 | de <- filterDecisionsImportances(rules = de, min_imp = 0.1)
102 | 
103 | # Get the network
104 | de_net <- getNetwork(rules = de, data = data_ctg, target = iris$Species
105 |             , classPos = "setosa")
106 | 
107 | # Plot the feature importance/influence and the network
108 | plotFeatures(de_net, levels_order = c("Low", "Medium", "High"))
109 | plotNetwork(de_net, hide_isolated_nodes = FALSE, layout = "fr")
110 | }
111 | 


--------------------------------------------------------------------------------
/man/getDecisionsMetrics.Rd:
--------------------------------------------------------------------------------
  1 | % Generated by roxygen2: do not edit by hand
  2 | % Please edit documentation in R/getDecisionsMetric.R
  3 | \name{getDecisionsMetrics}
  4 | \alias{getDecisionsMetrics}
  5 | \title{Measure the error, prediction and importance of decisions}
  6 | \usage{
  7 | getDecisionsMetrics(
  8 |   ruleExec,
  9 |   data,
 10 |   target,
 11 |   classPos = NULL,
 12 |   importances = TRUE,
 13 |   in_parallel = FALSE,
 14 |   n_cores = detectCores() - 1,
 15 |   cluster = NULL
 16 | )
 17 | }
 18 | \arguments{
 19 | \item{ruleExec}{a vector with name "condition" or a data.frame with a column "condition".}
 20 | 
 21 | \item{data}{data from which to get the decision support.}
 22 | 
 23 | \item{target}{response variable.}
 24 | 
 25 | \item{classPos}{for clssification tasks, the positive class to be predicted by decisions.}
 26 | 
 27 | \item{importances}{if FALSE, the importances are not calculated (importances = TRUE by default).}
 28 | 
 29 | \item{in_parallel}{if TRUE, the function is run in parallel.}
 30 | 
 31 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use.}
 32 | 
 33 | \item{cluster}{the cluster to use to run the function in parallel.}
 34 | }
 35 | \value{
 36 | a datatable with the rule (column "condition"), error ("err"), prediction ("pred") support, number of variables in the decision rule ("len"). Columns "gain" and "imp" wit hthe gain and importance of teh decision are added if importances were calculated.
 37 | }
 38 | \description{
 39 | This function measures the prediction and error on the response variable of each decision on its support in the data passed. The importance is calculated by default but this can be switched off.
 40 | }
 41 | \examples{
 42 | library(randomForest)
 43 | library(caret)
 44 | library(data.table)
 45 | 
 46 | # import data and fit model
 47 | data(iris)
 48 | mod <- randomForest(Species ~ ., data = iris)
 49 | 
 50 | # Let's get the decision ensemble. One could use the wrapping function
 51 | # model2DE() but, we will run each function separately.
 52 | 
 53 | # Get the raw decision ensemble
 54 | de <- preCluster(model = mod, model_type = "rf", data = iris[, -5]
 55 |         , target = iris$Species, classPos = "setosa"
 56 |         , times = 1 # number of bootstraps, here just one
 57 |         , discretize = FALSE) # we will discretize outside for the example
 58 | summary(de)
 59 | # exec = the decision ensemble
 60 | # partitions = list of sample indexes for boostrapping
 61 | # if we had done discretization, the new data would be in data_ctg
 62 | de <- de$exec
 63 | 
 64 | # Discretize variables in 3 categories - optional
 65 | de <- discretizeDecisions(rules = de, data = iris[, -5], target = iris$Species
 66 |         , K = 3, classPos = "setosa", mode = "data")
 67 | data_ctg <- de$data_ctg
 68 | de <- de$rules_ctg
 69 | 
 70 | # Homogenize the decision ensemble
 71 | de <- de[, condition := sapply(condition, function(x) {
 72 |   paste(sort(unlist(strsplit(x, split = " & "))), collapse = " & ")
 73 | })]
 74 | de <- unique(
 75 |           as.data.table(de)[, n := as.numeric(n)][, n := sum(n), by = condition]
 76 |           )
 77 | 
 78 | # Calculate decision metrics, we don't need the importances yet since we will
 79 | # do pruning. Otherwise, set importances = TRUE and skip the next 2 steps.
 80 | de_met <- getDecisionsMetrics(de, data = data_ctg, target = iris$Species
 81 |             , classPos = "setosa", importances = FALSE)
 82 | de <- de[de_met, on = "condition"]
 83 | 
 84 | # Pruning - optional
 85 | de <- pruneDecisions(rules = de, data = data_ctg, target = iris$Species
 86 |         , classPos = "setosa")
 87 | 
 88 | # Decision importances
 89 | de <- decisionImportance(rules = de, data = data_ctg, target = iris$Species
 90 |         , classPos = "setosa")
 91 | 
 92 | # Filter out decisions with the lowest importance: min_imp = the minimal
 93 | # importance in the decision ensemble compared to the maximal one.
 94 | # E.g., if min_imp = 0.5, then at least all decisions with an
 95 | # importance > 0.5*max(importance) will be kept.
 96 | # This ensures that we don't throw out too much.
 97 | # Since the decision ensemble is quite small, we don't need to filter much...
 98 | de <- filterDecisionsImportances(rules = de, min_imp = 0.1)
 99 | 
100 | # Get the network
101 | de_net <- getNetwork(rules = de, data = data_ctg, target = iris$Species
102 |             , classPos = "setosa")
103 | 
104 | # Plot the feature importance/influence and the network
105 | plotFeatures(de_net, levels_order = c("Low", "Medium", "High"))
106 | plotNetwork(de_net, hide_isolated_nodes = FALSE, layout = "fr")
107 | }
108 | 


--------------------------------------------------------------------------------
/man/getNetwork.Rd:
--------------------------------------------------------------------------------
  1 | % Generated by roxygen2: do not edit by hand
  2 | % Please edit documentation in R/getNetwork.R
  3 | \name{getNetwork}
  4 | \alias{getNetwork}
  5 | \title{Transform a decision ensemble into a network}
  6 | \usage{
  7 | getNetwork(
  8 |   rules,
  9 |   data,
 10 |   target,
 11 |   classPos = NULL,
 12 |   aggregate_taxa = FALSE,
 13 |   taxa = NULL,
 14 |   in_parallel = FALSE,
 15 |   n_cores = detectCores() - 1,
 16 |   cluster = NULL
 17 | )
 18 | }
 19 | \arguments{
 20 | \item{rules}{the decision ensemble.}
 21 | 
 22 | \item{data}{data from which to measure decision metrics.}
 23 | 
 24 | \item{target}{response variable.}
 25 | 
 26 | \item{classPos}{the positive class predicted by decisions.}
 27 | 
 28 | \item{aggregate_taxa}{should taxa be aggregated at the genus level (if species have lower importance than their genus) or species level (if a genus is represented by a unique species)}
 29 | 
 30 | \item{taxa}{if aggregate_taxa = TRUE, a data.frame with all taxa included in the dataset: columns = taxonomic ranks (with columns f, g, and s)}
 31 | 
 32 | \item{in_parallel}{if TRUE, the function is run in parallel.}
 33 | 
 34 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use.}
 35 | 
 36 | \item{cluster}{the cluster to use to run the function in parallel.}
 37 | }
 38 | \value{
 39 | A list with in the nodes and edges dataframes, the feature and interaction importance and influence; the decision-wise feature and interaction importances and influences are contained in the nodes_agg and edges_agg dataframes.
 40 | }
 41 | \description{
 42 | Takes a decision ensemble and measures the importance and influence of each feature and pair of features to create a network.
 43 | For categorical variables or discretized ones, the importance and influence are calculated per level. See featureImportance to obtain the overall feature importance.
 44 | }
 45 | \examples{
 46 | library(randomForest)
 47 | library(caret)
 48 | library(data.table)
 49 | 
 50 | # import data and fit model
 51 | data(iris)
 52 | mod <- randomForest(Species ~ ., data = iris)
 53 | 
 54 | # Let's get the decision ensemble. One could use the wrapping function
 55 | # model2DE() but, we will run each function separately.
 56 | 
 57 | # Get the raw decision ensemble
 58 | de <- preCluster(model = mod, model_type = "rf", data = iris[, -5]
 59 |         , target = iris$Species, classPos = "setosa"
 60 |         , times = 1 # number of bootstraps, here just one
 61 |         , discretize = FALSE) # we will discretize outside for the example
 62 | summary(de)
 63 | # exec = the decision ensemble
 64 | # partitions = list of sample indexes for boostrapping
 65 | # if we had done discretization, the new data would be in data_ctg
 66 | de <- de$exec
 67 | 
 68 | # Discretize variables in 3 categories - optional
 69 | de <- discretizeDecisions(rules = de, data = iris[, -5], target = iris$Species
 70 |         , K = 3, classPos = "setosa", mode = "data")
 71 | data_ctg <- de$data_ctg
 72 | de <- de$rules_ctg
 73 | 
 74 | # Homogenize the decision ensemble
 75 | de <- de[, condition := sapply(condition, function(x) {
 76 |   paste(sort(unlist(strsplit(x, split = " & "))), collapse = " & ")
 77 | })]
 78 | de <- unique(
 79 |           as.data.table(de)[, n := as.numeric(n)][, n := sum(n), by = condition]
 80 |           )
 81 | 
 82 | # Calculate decision metrics, we don't need the importances yet since we will
 83 | # do pruning. Otherwise, set importances = TRUE and skip the next 2 steps.
 84 | de_met <- getDecisionsMetrics(de, data = data_ctg, target = iris$Species
 85 |             , classPos = "setosa", importances = FALSE)
 86 | de <- de[de_met, on = "condition"]
 87 | 
 88 | # Pruning - optional
 89 | de <- pruneDecisions(rules = de, data = data_ctg, target = iris$Species
 90 |         , classPos = "setosa")
 91 | 
 92 | # Decision importances
 93 | de <- decisionImportance(rules = de, data = data_ctg, target = iris$Species
 94 |         , classPos = "setosa")
 95 | 
 96 | # Filter out decisions with the lowest importance: min_imp = the minimal
 97 | # importance in the decision ensemble compared to the maximal one.
 98 | # E.g., if min_imp = 0.5, then at least all decisions with an
 99 | # importance > 0.5*max(importance) will be kept.
100 | # This ensures that we don't throw out too much.
101 | # Since the decision ensemble is quite small, we don't need to filter much...
102 | de <- filterDecisionsImportances(rules = de, min_imp = 0.1)
103 | 
104 | # Get the network
105 | de_net <- getNetwork(rules = de, data = data_ctg, target = iris$Species
106 |             , classPos = "setosa")
107 | 
108 | # Plot the feature importance/influence and the network
109 | plotFeatures(de_net, levels_order = c("Low", "Medium", "High"))
110 | plotNetwork(de_net, hide_isolated_nodes = FALSE, layout = "fr")
111 | }
112 | 


--------------------------------------------------------------------------------
/man/discretizeDecisions.Rd:
--------------------------------------------------------------------------------
  1 | % Generated by roxygen2: do not edit by hand
  2 | % Please edit documentation in R/discretizeDecisions.R
  3 | \name{discretizeDecisions}
  4 | \alias{discretizeDecisions}
  5 | \title{Discretize numerical variables in decision ensemble}
  6 | \usage{
  7 | discretizeDecisions(
  8 |   rules,
  9 |   data = NULL,
 10 |   target,
 11 |   mode = "data",
 12 |   K = 2,
 13 |   splitV = NULL,
 14 |   classPos = NULL,
 15 |   in_parallel = FALSE,
 16 |   n_cores = detectCores() - 1,
 17 |   cluster = NULL
 18 | )
 19 | }
 20 | \arguments{
 21 | \item{rules}{a data frame with a column "condition".}
 22 | 
 23 | \item{data}{data to discretize.}
 24 | 
 25 | \item{target}{response variable.}
 26 | 
 27 | \item{mode}{whether to discretize variables based on the data distribution (default, mode = 'data') or on the data splits in the model (mode = 'model').}
 28 | 
 29 | \item{K}{numeric, number of categories to create from numeric variables (default: K = 2).}
 30 | 
 31 | \item{splitV}{instead of running internally discretizeData, one can provide a list with, for each variable to discretize in rules, the thresholds delimiting each new category.}
 32 | 
 33 | \item{classPos}{for classification, the positive class.}
 34 | 
 35 | \item{in_parallel}{if TRUE, the function is run in parallel.}
 36 | 
 37 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use, default is detectCores() - 1.}
 38 | 
 39 | \item{cluster}{the cluster to use to run the function in parallel.}
 40 | }
 41 | \value{
 42 | Decision ensemble with discretized variables in the condition. Decisions with the same condition are aggregated: their importances are summed, and all other metrics are averaged.
 43 | }
 44 | \description{
 45 | This function replaces in a decision ensemble the boundaries of numerical features by their corresponding levels when the variable is discretized.
 46 | If discretized data are not passed, data are first discretized into Kmax categories based on their quantiles (see discretizeData).
 47 | The error, prediction, importance and multiplicity of decisions are updated after discretization.
 48 | }
 49 | \examples{
 50 | library(randomForest)
 51 | library(caret)
 52 | library(data.table)
 53 | 
 54 | # import data and fit model
 55 | data(iris)
 56 | mod <- randomForest(Species ~ ., data = iris)
 57 | 
 58 | # Let's get the decision ensemble. One could use the wrapping function
 59 | # model2DE() but, we will run each function separately.
 60 | 
 61 | # Get the raw decision ensemble
 62 | de <- preCluster(model = mod, model_type = "rf", data = iris[, -5]
 63 |         , target = iris$Species, classPos = "setosa"
 64 |         , times = 1 # number of bootstraps, here just one
 65 |         , discretize = FALSE) # we will discretize outside for the example
 66 | summary(de)
 67 | # exec = the decision ensemble
 68 | # partitions = list of sample indexes for boostrapping
 69 | # if we had done discretization, the new data would be in data_ctg
 70 | de <- de$exec
 71 | 
 72 | # Discretize variables in 3 categories - optional
 73 | de <- discretizeDecisions(rules = de, data = iris[, -5], target = iris$Species
 74 |         , K = 3, classPos = "setosa", mode = "data")
 75 | data_ctg <- de$data_ctg
 76 | de <- de$rules_ctg
 77 | 
 78 | # Homogenize the decision ensemble
 79 | de <- de[, condition := sapply(condition, function(x) {
 80 |   paste(sort(unlist(strsplit(x, split = " & "))), collapse = " & ")
 81 | })]
 82 | de <- unique(
 83 |           as.data.table(de)[, n := as.numeric(n)][, n := sum(n), by = condition]
 84 |           )
 85 | 
 86 | # Calculate decision metrics, we don't need the importances yet since we will
 87 | # do pruning. Otherwise, set importances = TRUE and skip the next 2 steps.
 88 | de_met <- getDecisionsMetrics(de, data = data_ctg, target = iris$Species
 89 |             , classPos = "setosa", importances = FALSE)
 90 | de <- de[de_met, on = "condition"]
 91 | 
 92 | # Pruning - optional
 93 | de <- pruneDecisions(rules = de, data = data_ctg, target = iris$Species
 94 |         , classPos = "setosa")
 95 | 
 96 | # Decision importances
 97 | de <- decisionImportance(rules = de, data = data_ctg, target = iris$Species
 98 |         , classPos = "setosa")
 99 | 
100 | # Filter out decisions with the lowest importance: min_imp = the minimal
101 | # importance in the decision ensemble compared to the maximal one.
102 | # E.g., if min_imp = 0.5, then at least all decisions with an
103 | # importance > 0.5*max(importance) will be kept.
104 | # This ensures that we don't throw out too much.
105 | # Since the decision ensemble is quite small, we don't need to filter much...
106 | de <- filterDecisionsImportances(rules = de, min_imp = 0.1)
107 | 
108 | # Get the network
109 | de_net <- getNetwork(rules = de, data = data_ctg, target = iris$Species
110 |             , classPos = "setosa")
111 | 
112 | # Plot the feature importance/influence and the network
113 | plotFeatures(de_net, levels_order = c("Low", "Medium", "High"))
114 | plotNetwork(de_net, hide_isolated_nodes = FALSE, layout = "fr")
115 | }
116 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/R/changeDecisionsDummies.R:
--------------------------------------------------------------------------------
  1 | # takes decisions and modifies them so that only one level of a multiclass variable is used in decisions
  2 | # ’ @export
  3 | changeDecisionsDummies <- function(rules, dummy_var, data, target, classPos = NULL,
  4 |                                    in_parallel = FALSE, n_cores = detectCores() - 1, cluster = NULL) {
  5 | 
  6 |   # get the colnames and column numbers of dummy variables
  7 |   dum_lev <- lapply(dummy_var, getDummyLevels, colN = colnames(data))
  8 |   names(dum_lev) <- dummy_var
  9 |   dum_colN <- unlist(lapply(dummy_var, FUN = dumColnames, colN = colnames(data)))
 10 | 
 11 |   # replace the <=0.5 conditions in rules for all other levels with >0.5
 12 |   rules <- rules[, to_update := "no"]
 13 |   # separate levels in rules
 14 |   if (in_parallel == TRUE) {
 15 |     if (is.null(cluster) == TRUE) {
 16 |       message("Initiate parallelisation ... ")
 17 |       cluster <- makeCluster(n_cores)
 18 |       clusterEvalQ(cluster, library(endoR))
 19 |       clusterEvalQ(cluster, library(stringr))
 20 |       on.exit(stopCluster(cluster))
 21 |     }
 22 | 
 23 |     rules <- t(parApply(
 24 |       cl = cluster, rules, MARGIN = 1, FUN = changeSingleRuleDummies,
 25 |       dum_colN = dum_colN, dum_lev = dum_lev
 26 |     ))
 27 |   } else {
 28 |     rules <- t(apply(rules,
 29 |       MARGIN = 1, FUN = changeSingleRuleDummies,
 30 |       dum_colN = dum_colN, dum_lev = dum_lev
 31 |     ))
 32 |   }
 33 | 
 34 | 
 35 |   # get one rule per level
 36 |   tmp <- split(rules, seq(nrow(rules)))
 37 |   tmp <- lapply(tmp,
 38 |     FUN = function(x, newN) {
 39 |       names(x) <- newN
 40 |       return(x)
 41 |     },
 42 |     newN = colnames(rules)
 43 |   )
 44 |   if (in_parallel == TRUE) {
 45 |     rules <- parLapply(cl = cluster, tmp, fun = singleRulePerLevel)
 46 |   } else {
 47 |     rules <- lapply(tmp, FUN = singleRulePerLevel)
 48 |   }
 49 | 
 50 |   rules <- do.call(what = rbind, rules)
 51 |   rules <- as.data.table(rules)[, n := as.numeric(n)][, n := sum(n), by = condition]
 52 | 
 53 |   if ("err" %in% colnames(rules)) {
 54 |     rules <- rules[to_update == "yes", to_up := "yes"][, to_update := NULL][, to_update := NULL]
 55 |     rules <- unique(rules)
 56 |     # Get the new metrics
 57 |     if (nrow(rules[to_up == "yes", ]) > 0) {
 58 |       if ("imp" %in% colnames(rules)) {
 59 |         importances <- TRUE
 60 |         colN <- c("len", "err", "support", "pred", "imp", "to_up")
 61 |       } else {
 62 |         importances <- FALSE
 63 |         colN <- c("len", "err", "support", "pred", "to_up")
 64 |       }
 65 | 
 66 |       no_up <- unique(subset(rules, is.na(to_up), select = -to_up))
 67 |       tmp1 <- unique(subset(rules, to_up == "yes" & !(condition %in% no_up$condition))[, (colN) := NULL])
 68 | 
 69 |       tmp2 <- getDecisionsMetrics(tmp1,
 70 |         data = data, target = target, classPos = classPos,
 71 |         importances = importances,
 72 |         in_parallel = in_parallel, n_cores = n_cores, cluster = cluster
 73 |       )
 74 |       tmp2 <- tmp1[tmp2, on = "condition"]
 75 |       # tmp2 <- merge(tmp2, tmp1, all.x = TRUE, by = 'condition' )
 76 | 
 77 |       if (nrow(no_up) > 0) {
 78 |         rules <- rbind(no_up, tmp2)
 79 |       } else {
 80 |         rules <- tmp2
 81 |       }
 82 | 
 83 |       rules <- rules[, c("len", "support", "err", "pred", "n") := lapply(.SD, as.numeric),
 84 |         .SDcols = c("len", "support", "err", "pred", "n")
 85 |       ]
 86 |       rules <- unique(rules[, n := sum(n), by = condition])
 87 |     }
 88 |   } else {
 89 |     rules <- unique(rules[, to_update := NULL][, to_update := NULL])
 90 |   }
 91 | 
 92 | 
 93 |   return(rules)
 94 | }
 95 | 
 96 | 
 97 | ##################################################
 98 | getDummyLevels <- function(var, colN) {
 99 |   sub <- str_which(colN, pattern = paste0("^", var))
100 |   sub <- paste0("X[,", sub, "]<=0.5")
101 |   return(sub)
102 | }
103 | dumColnames <- function(var, colN) {
104 |   sub <- str_which(colN, pattern = paste0("^", var))
105 |   dum <- rep(var, length(sub))
106 |   names(dum) <- paste0("X[,", sub, "]<=0.5")
107 |   return(dum)
108 | }
109 | 
110 | #################################################
111 | changeSingleRuleDummies <- function(rule, dum_colN, dum_lev) {
112 |   ruleV <- unlist(strsplit(rule["condition"], " & "))
113 |   if (length(ruleV[ruleV %in% names(dum_colN)]) > 0) {
114 |     oriRule <- ruleV[!(ruleV %in% names(dum_colN))]
115 |     ruleV <- ruleV[ruleV %in% names(dum_colN)]
116 |     names(ruleV) <- dum_colN[ruleV]
117 |     vdum <- unique(names(ruleV))
118 |     newV <- c()
119 |     for (v in vdum) {
120 |       tmp <- names(dum_colN[dum_colN == v])
121 |       tmp <- tmp[!(tmp %in% ruleV)]
122 |       tmp <- str_replace(tmp, pattern = "<=", replacement = ">")
123 |       if (length(tmp[tmp %in% oriRule]) > 0) next
124 |       if (length(tmp) > 1) {
125 |         tmp <- paste0("(", paste(tmp, collapse = " | "), ")")
126 |       }
127 |       newV <- c(newV, tmp)
128 |     }
129 |     rule["condition"] <- paste0(sort(c(oriRule, newV)), collapse = " & ")
130 |     rule["to_update"] <- "yes"
131 |   }
132 | 
133 |   return(rule)
134 | }
135 | 


--------------------------------------------------------------------------------
/R/getComplements.R:
--------------------------------------------------------------------------------
  1 | getComplements <- function(rules, data, target, classPos = NULL,
  2 |                            in_parallel = FALSE, n_cores = detectCores() - 1, cluster = NULL) {
  3 |   if (!("data.table" %in% class(rules))) {
  4 |     rules <- as.data.table(rules)
  5 |   }
  6 | 
  7 |   # give rules ID
  8 |   rules <- rules[, ruleID := paste0("decision", .I)]
  9 | 
 10 | 
 11 |   # get the single complements and rm rules
 12 |   if (in_parallel == FALSE) {
 13 |     tmp <- apply(rules, MARGIN = 1, complementSingleRule)
 14 |   } else {
 15 |     if (is.null(cluster) == TRUE) {
 16 |       message("Initiate parallelisation ... ")
 17 |       cluster <- makeCluster(n_cores)
 18 |       clusterEvalQ(cluster, library(stringr))
 19 |       on.exit(stopCluster(cluster))
 20 |     }
 21 | 
 22 |     message("Generate additional decisions ... ")
 23 |     tmp <- parApply(cl = cluster, rules, MARGIN = 1, FUN = complementSingleRule)
 24 |   }
 25 |   # the rm decisions
 26 |   rulesAdd <- unlist(lapply(tmp, function(x) {
 27 |     x$nR
 28 |   }))
 29 |   names(rulesAdd) <- str_replace(names(rulesAdd), pattern = "^decision[:digit:]+\\.", replacement = "")
 30 |   # the signs
 31 |   sVar <- unlist(lapply(tmp, function(x) {
 32 |     x$s
 33 |   }))
 34 |   names(sVar) <- str_replace(names(sVar), pattern = "^decision[:digit:]+\\.", replacement = "")
 35 |   sVar <- unlist(lapply(sVar, FUN = function(x) {
 36 |     eval(parse(text = x))
 37 |   }))
 38 | 
 39 | 
 40 |   # get the directions
 41 |   directions <- data.table(
 42 |     ruleID = str_extract(names(sVar), pattern = "^decision[:digit:]+(?=var)"),
 43 |     var = str_extract(names(sVar), pattern = "(?<=var)[:digit:]+$"),
 44 |     d = ifelse(sVar == TRUE, 1, -1)
 45 |   )
 46 | 
 47 | 
 48 |   # get metrics for the rm
 49 |   cIx <- str_which(names(rulesAdd), pattern = "varRm")
 50 |   rulesRm <- data.table(
 51 |     fullID = names(rulesAdd)[cIx],
 52 |     ruleID = str_extract(names(rulesAdd)[cIx], pattern = "^decision[:digit:]+(?=var)"),
 53 |     var = str_extract(names(rulesAdd)[cIx], pattern = "(?<=varRm)[:digit:]+$"),
 54 |     condition = rulesAdd[cIx]
 55 |   )
 56 |   tmp <- unique(copy(rulesRm)[condition != "", list(condition)]) # remove repetitions due to the dummy variables
 57 |   if (nrow(tmp) == 0) {
 58 |     message("There is no interaction in the decision ensemble, only single variable effects.")
 59 |   } else {
 60 |     suppressMessages(
 61 |       tmp <- getDecisionsMetrics(tmp,
 62 |         data = data, target = target, classPos = classPos,
 63 |         importances = FALSE,
 64 |         in_parallel = in_parallel, n_cores = n_cores, cluster = cluster
 65 |       )
 66 |     )
 67 |     rulesRm <- tmp[rulesRm, on = "condition"]
 68 |   }
 69 | 
 70 |   # add the metrics for the "empty decisions" (original rule = a single variable only)
 71 |   tmp <- measureAll(data = data, target = target, classPos = classPos)
 72 |   rulesRm <- rulesRm[condition == "", `:=`(len = 0, support = 1, err = tmp["err"], pred = tmp["pred"])]
 73 | 
 74 | 
 75 |   # get the absolute support size for all
 76 |   nall <- nrow(data)
 77 |   rules <- rules[, "support_abs" := nall * support]
 78 |   rulesRm <- rulesRm[, "support_abs" := nall * support]
 79 | 
 80 |   return(list(
 81 |     "original" = rules,
 82 |     "rm" = rulesRm, "directions" = directions
 83 |   ))
 84 | }
 85 | 
 86 | 
 87 | 
 88 | 
 89 | 
 90 | ####################################################################
 91 | 
 92 | 
 93 | 
 94 | ### Get the complements for individual rules
 95 | complementSingleRule <- function(rule) {
 96 |   rule <- unlist(rule)
 97 |   ruleExec <- unlist(strsplit(rule["condition"], split = " & "))
 98 |   vars <- str_extract(ruleExec, pattern = "(?<=X\\[,)[:digit:]+")
 99 |   # order per variable
100 |   tmp <- order(vars)
101 |   ruleExec <- ruleExec[tmp]
102 |   vars <- vars[tmp]
103 |   # group conditions on a same variable
104 |   in_dup <- which(duplicated(vars))
105 |   if (length(in_dup) > 0) {
106 |     for (i in in_dup) {
107 |       ruleExec[i - 1] <- paste(ruleExec[i - 1], ruleExec[i], sep = " & ")
108 |     }
109 |     ruleExec <- ruleExec[-in_dup]
110 |     vars <- vars[-in_dup]
111 |   }
112 | 
113 |   ruleID <- rule["ruleID"]
114 |   signs <- c("<=" = "a", ">" = "b", "a" = ">", "b" = "<=") # need that trick to be sure replace them all when there are 2!
115 |   newRules <- list()
116 |   s <- list()
117 | 
118 |   for (i in 1:length(ruleExec)) {
119 |     if (str_detect(ruleExec[i], pattern = "&")) {
120 |       comp <- str_replace_all(ruleExec[i], signs)
121 |       comp <- paste0("(", str_replace_all(comp, pattern = "&", replacement = "|"), ")")
122 |       sV <- paste0("mean(unlist(data[which(", ruleExec[i], "),", vars[i], "])) > mean(unlist(data[which(", comp, "),", vars[i], "]))")
123 |       sV <- str_replace_all(sV, pattern = "(?<=\\(| )X(?=\\[)", replacement = "data")
124 |     } else {
125 |       if (str_detect(ruleExec[i], pattern = ">")) {
126 |         sV <- "TRUE"
127 |       } else {
128 |         sV <- "FALSE"
129 |       }
130 |     }
131 | 
132 |     rm <- paste(ruleExec[-i], collapse = " & ")
133 | 
134 |     newRules[[paste0(ruleID, "varRm", vars[i])]] <- rm
135 |     s[[paste0(ruleID, "var", vars[i])]] <- sV
136 |   }
137 | 
138 |   return(list("nR" = newRules, "s" = s))
139 | }
140 | 


--------------------------------------------------------------------------------
/endoR.Rcheck/00_pkg_src/endoR/README.md:
--------------------------------------------------------------------------------
 1 | # endoR
 2 | Code and manual of the endoR R-package.
 3 | 
 4 | [![DOI](https://zenodo.org/badge/349814633.svg)](https://zenodo.org/badge/latestdoi/349814633)
 5 | 
 6 | 
 7 | - author: Albane Ruaud [albane.ruaud@tuebingen.mpg.de](mailto:albane.ruaud@tuebingen.mpg.de)
 8 | - maintainer: Albane Ruaud [albane.ruaud@tuebingen.mpg.de](mailto:albane.ruaud@tuebingen.mpg.de)
 9 | 
10 | Preprint: Albane Ruaud, Niklas A Pfister, Ruth E Ley, Nicholas D Youngblut. Interpreting tree ensemble machine learning models with endoR. bioRxiv (2022). DOI: [10.1101/2022.01.03.474763](https://www.biorxiv.org/content/10.1101/2022.01.03.474763v1)
11 | 
12 | 
13 | # Abstract
14 | **Motivation:** Tree ensemble machine learning models are increasingly used in microbiome science to explore associations between microbes and their environment, as they are compatible with the compositional, high-dimensional, and sparse structure of sequence-based microbiome data. The complex structure of such models enables efficient capture of higher-order interactions to improve predictions, but makes the final model often difficult to interpret. Therefore, while tree ensembles are often the most accurate models for microbiome data, the approach often yields limited insight into how microbial taxa or genomic content may be associated with host phenotypes.
15 | 
16 | **Procedure:** endoR is a method that extracts and visualizes how predictive variables contribute to tree ensemble model accuracy. The fitted model is simplified into a decision ensemble and then reduced via regularization and bootstrapping to retain only the most essential aspects. Information about predictive variables and their pairwise interactions are extracted from the decision ensemble and displayed as a network for straightforward interpretation. The network and importance scores derived from endoR help understand how tree ensemble models make predictions from variables and interactions between variables.
17 | 
18 | ![endoR overall workflow](figures/Workflow.png)
19 | 
20 | 
21 | # Usage
22 | Installation: `devtools::install_github(repo = "aruaud/endoR")`
23 | 
24 | Before: select and fit a machine learning model (regression or classification)
25 | - random forest R-packages: randomForest, ranger 
26 | - gradient boosted model R-packages: GBM, XGBoost 
27 | 
28 | 1. Generate the **stable decision ensemble**:
29 | 	- without bootstrapping, opt. in parallel, with `model2DE()`
30 | 	- with bootstrapping, opt. in parallel, with `model2DE_resampling()`
31 | 	- with boostrapping each ran in parallel (recommended): `preCluster()` followed by `model2DE_cluster()` iterated on partitions in the `Q()` function of the clusterMQ R-package
32 | 2. If bootstrapping, perform **stability selection**: `stabilitySelection()`. Alternatively, use `evaluateAlpha()` to perform stability selection with various values of `alpha` and pick the decision ensemble that can predict as many samples as possible for the lowest `alpha`.
33 | 3. Plot results with `plotFeatures()` and `plotNetwork()` (created with ggplot2, ggraph and igraph R-packages, arranged with the ggpubr R-package)
34 | 
35 | Some common issues: 
36 | - if the installation doesn't work, you may need to install R version 4.0.3 and use the inTrees R-package version 1.3 
37 | - endoR uses text patterns: make sure to remove all special characters from your column names before using endoR, `colnames(data) <- compatibleNames(colnames(data))`
38 | 
39 | 
40 | # Background
41 | 
42 | **From tree ensemble to decision ensembles**
43 | 
44 | ![Schema decisions](figures/Decisions_summary.png)
45 | 
46 | A decision consists of a full branch of a decision tree: 
47 | - all splits along the branch form the *rule* of the decision;
48 | - the average response variable in the subset of observations following the rule corresponds to the *prediction* of the decision;
49 | - the sample support is the set of observations that follow the rule;
50 | - the error is the average deviation of the prediction to the observed resonse variable. 
51 | 
52 | 
53 | **Decision importance**
54 | 
55 | We measure the decision importance by comparing the error of the decision to what could be expected by a random guessing, i.e., the mean response variable across all samples. Respectively for regressions and classifications, this corresponds to calculating the coefficient of determination R<sup>2</sup> and Cohen's kappa. We weight 
56 | 
57 | <img src="https://latex.codecogs.com/svg.image?\bg_white&space;I_D&space;=&space;\Big(&space;1&space;-&space;\frac{\alpha(S_D,&space;\hat{y}_D)}{\alpha(S_D,\bar{y})}&space;\Big)&space;\cdot&space;|S_D|" title="\bg_white I_D = \Big( 1 - \frac{\alpha(S_D, \hat{y}_D)}{\alpha(S_D,\bar{y})} \Big) \cdot |S_D|" />
58 | 
59 | 
60 | **Feature and interaction importance** = how much a feature, or pair of features, contributes to the accuracy of predictions
61 | 
62 | **Feature and interaction influence** = how a feature, or pair of features, changes precisions (e.g., increases the predicted age or changes the prediction towards "disease" vs "healthy" state)
63 | 
64 | **Regularization** to obtain a stable decision ensemble that is generalizable
65 | - decision pruning = remove unimportant variables from decision rules (i.e. the ones that do not increase the decision error much than a threshold)
66 | - bootstrapping and stability selection = select decisions measured as important across bootstrap resamples (i.e, across bootstraps of observations/samples)


--------------------------------------------------------------------------------