56 |
57 | Pull requests will be evaluated against the a checklist:
58 |
59 | 1. __Motivation__. Your pull request should clearly and concisely motivates the
60 | need for change. Plesae describe the problem your PR addresses and show
61 | how your pull request solves it as concisely as possible.
62 |
63 | Also include this motivation in `NEWS` so that when a new release of
64 | `SmartML` comes out it's easy for users to see what's changed. Add your
65 | item at the top of the file and use markdown for formatting. The
66 | news item should end with `(@yourGithubUsername, #the_issue_number)`.
67 |
68 | 2. __Only related changes__. Before you submit your pull request, please
69 | check to make sure that you haven't accidentally included any unrelated
70 | changes. These make it harder to see exactly what's changed, and to
71 | evaluate any unexpected side effects.
72 |
73 | Each PR corresponds to a git branch, so if you expect to submit
74 | multiple changes make sure to create multiple branches. If you have
75 | multiple changes that depend on each other, start with the first one
76 | and don't submit any others until the first one has been processed.
77 |
78 | 3. If you're adding new parameters or a new function, you'll also need
79 | to document them with [roxygen](https://github.com/klutometis/roxygen).
80 | Make sure to re-run `devtools::document()` on the code before submitting.
81 |
82 | This seems like a lot of work but don't worry if your pull request isn't
83 | perfect. It's a learning process. A pull request is a process, and unless
84 | you've submitted a few in the past it's unlikely that your pull request will be
85 | accepted as is. Please don't submit pull requests that change existing
86 | behaviour. Instead, think about how you can add a new feature in a minimally
87 | invasive way.
88 |
--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
1 | Package: SmartML
2 | Version: 0.3.0
3 | Title: Machine Learning Automation
4 | Authors@R:
5 | c(person(given = "Mohamed",
6 | family = "Maher",
7 | email = "s-mohamed.zenhom@zewailcity.edu.eg",
8 | role = c("aut", "cre")),
9 | person(given = "Sherif",
10 | family = "Sakr",
11 | email = "sherif.sakr@ut.ee",
12 | role = "aut"),
13 | person(given = "Bruno Rucy",
14 | family = "Carneiro Alves de Lima",
15 | email = "brurucy@protonmail.ch",
16 | role = "ctb"))
17 | Description: This package is a meta-learning based framework for automated selection and hyper-parameter tuning for machine learning algorithms. Being meta-learning based, the framework is able to simulate the role of the machine learning expert. In particular, the framework is equipped with a continuously updated knowledge base that stores information about statistical meta features of all processed datasets along with the associated performance of the different classifiers and their tuned parameters. Thus, for any new dataset, SmartML automatically extracts its meta features and searches its knowledge base for the best performing algorithm to start its optimization process. In addition, SmartML makes use of the new runs to continuously enrich its knowledge base to improve its performance and robustness for future runs.
18 | License: GPL-3
19 | Encoding: UTF-8
20 | LazyData: false
21 | Imports:
22 | devtools, R.utils, stats, tictoc, e1071, BBmisc, kknn, purrr, xgboost, ranger,
23 | KernSmooth, data.table, randomForest, rpart, glmnet, nloptr, bbotk
24 | Suggests:
25 | knitr,
26 | covr,
27 | testthat,
28 | rmarkdown
29 | Depends:
30 | mlr3,
31 | mlr3learners,
32 | mlr3pipelines,
33 | mlr3filters
34 | RoxygenNote: 7.1.1
35 | VignetteBuilder: knitr
36 |
--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
1 | # Generated by roxygen2: do not edit by hand
2 |
3 | export(autoRLearn)
4 | export(autoRLearn_)
5 | export(evocate)
6 | export(runClassifier)
7 | import(RWeka)
8 | import(caret)
9 | import(devtools)
10 | import(farff)
11 | import(ggplot2)
12 | import(mice)
13 | import(purrr)
14 | import(rjson)
15 | importFrom(BBmisc,normalize)
16 | importFrom(C50,C5.0)
17 | importFrom(C50,C5.0Control)
18 | importFrom(FNN,knn)
19 | importFrom(KernSmooth,bkde)
20 | importFrom(KernSmooth,dpik)
21 | importFrom(LiblineaR,LiblineaR)
22 | importFrom(MASS,lda)
23 | importFrom(R.utils,withTimeout)
24 | importFrom(RCurl,getURL)
25 | importFrom(RMySQL,MySQL)
26 | importFrom(RMySQL,dbConnect)
27 | importFrom(RMySQL,dbDisconnect)
28 | importFrom(RMySQL,dbSendQuery)
29 | importFrom(RMySQL,fetch)
30 | importFrom(UBL,SmoteClassif)
31 | importFrom(caret,confusionMatrix)
32 | importFrom(caret,plsda)
33 | importFrom(data.table,fcase)
34 | importFrom(deepboost,deepboost)
35 | importFrom(deepboost,deepboost.predict)
36 | importFrom(dplyr,arrange)
37 | importFrom(dplyr,case_when)
38 | importFrom(dplyr,distinct)
39 | importFrom(dplyr,filter)
40 | importFrom(dplyr,group_by)
41 | importFrom(dplyr,mutate)
42 | importFrom(dplyr,mutate_if)
43 | importFrom(dplyr,n)
44 | importFrom(dplyr,select)
45 | importFrom(dplyr,top_frac)
46 | importFrom(e1071,kurtosis)
47 | importFrom(e1071,naiveBayes)
48 | importFrom(e1071,skewness)
49 | importFrom(e1071,svm)
50 | importFrom(fastNaiveBayes,fnb.train)
51 | importFrom(graphics,plot)
52 | importFrom(httr,POST)
53 | importFrom(httr,content)
54 | importFrom(iml,FeatureImp)
55 | importFrom(iml,Interaction)
56 | importFrom(iml,Predictor)
57 | importFrom(imputeMissings,compute)
58 | importFrom(imputeMissings,impute)
59 | importFrom(ipred,bagging)
60 | importFrom(klaR,rda)
61 | importFrom(mda,bruto)
62 | importFrom(mda,fda)
63 | importFrom(mda,gen.ridge)
64 | importFrom(mda,mars)
65 | importFrom(mda,polyreg)
66 | importFrom(nnet,nnet)
67 | importFrom(randomForest,randomForest)
68 | importFrom(ranger,ranger)
69 | importFrom(rjson,fromJSON)
70 | importFrom(rpart,rpart)
71 | importFrom(rpart,rpart.control)
72 | importFrom(stats,complete.cases)
73 | importFrom(stats,dnorm)
74 | importFrom(stats,glm)
75 | importFrom(stats,na.omit)
76 | importFrom(stats,pnorm)
77 | importFrom(stats,predict)
78 | importFrom(stats,rnorm)
79 | importFrom(stats,runif)
80 | importFrom(stats,setNames)
81 | importFrom(stats,var)
82 | importFrom(tictoc,tic)
83 | importFrom(tictoc,toc)
84 | importFrom(tidyr,drop_na)
85 | importFrom(tidyr,gather)
86 | importFrom(tidyr,separate)
87 | importFrom(tidyr,spread)
88 | importFrom(tidyr,unite)
89 | importFrom(truncnorm,dtruncnorm)
90 | importFrom(truncnorm,rtruncnorm)
91 | importFrom(utils,capture.output)
92 | importFrom(utils,head)
93 | importFrom(utils,read.csv)
94 | importFrom(xgboost,xgb.DMatrix)
95 | importFrom(xgboost,xgboost)
96 |
--------------------------------------------------------------------------------
/NEWS.md:
--------------------------------------------------------------------------------
1 | # SmartML 0.3.0.1
2 |
3 | * Hotfix, fixed some dependency issues relating to dplyr
4 |
5 | # SmartML 0.3.0
6 |
7 | ## Features
8 |
9 | * Added Ranger, XGBoost, fastNaiveBayes and LiblineaR high performing algorithms
10 | * Added the autoRLearn_ function, which assumes that the data is in perfect shape and can be loaded from a dataframe, unlike autoRLearn which can only load from a data file outside R.
11 | * Added Hyperband and Bayesian Optimization Hyperband to the new autoRLearn_
12 | * Added some extra temporary dependencies which will be removed in the following months (all tidyverse packages other than purrr)
13 | * Fixed some small mistakes in the code and jsons
14 |
15 | ## Current Roadmap
16 |
17 | * fix metalearning, at the moment it doesn't work. There's something wrong with the AWS server we are using.
18 | * change the dplyr back end to use data.table with dtplyr
19 | * merge autoRLearn and autoRLearn_ into a single function, which can both load from a data file and in R.
20 | * Rewrite SMAC, as requested by Sherif.
21 |
22 | ## Extra info
23 |
24 | * brurucy is a new and active maintainer
25 | * Nightly and experimental versions, independent from the Data Systems Lab, are being developed at https://github.com/brurucy/witchcraft
26 | * Updates will be conservative and focused on non-breaking changes, up to release 1.0.
27 |
--------------------------------------------------------------------------------
/R/autoRLearn.R:
--------------------------------------------------------------------------------
1 | #' @title Run smartML function for automatic Supervised Machine Learning.
2 | #'
3 | #' @description Run the smartML main function for automatic classifier algorithm selection, and hyper-parameter tuning.
4 | #'
5 | #' @param maxTime Float numeric of the maximum time budget for reading dataset, preprocessing, calculating meta-features, Algorithm Selection & hyper-parameter tuning process only in minutes(Excluding Model Interpretability) - This is applicable in case of Option = 2 only.
6 | #' @param directory String Character of the training dataset directory (SmartML accepts file formats arff/(csv with columns headers) ).
7 | #' @param testDirectory String Character of the testing dataset directory (SmartML accepts file formats arff/(csv with columns headers) ).
8 | #' @param classCol String Character of the name of the class label column in the dataset (default = 'class').
9 | #' @param vRatio Float numeric of the validation set ratio that should be splitted out of the training set for the evaluation process (default = 0.1 --> 10\%).
10 | #' @param preProcessF vector of string Character containing the name of the preprocessing algorithms (default = c('standardize', 'zv') --> no preprocessing):
11 | #' \itemize{
12 | #' \item "boxcox" - apply a Box–Cox transform and values must be non-zero and positive in all features,
13 | #' \item "yeo-Johnson" - apply a Yeo-Johnson transform, like a BoxCox, but values can be negative,
14 | #' \item "zv" - remove attributes with a zero variance (all the same value),
15 | #' \item "center" - subtract mean from values,
16 | #' \item "scale" - divide values by standard deviation,
17 | #' \item "standardize" - perform both centering and scaling,
18 | #' \item "normalize" - normalize values,
19 | #' \item "pca" - transform data to the principal components,
20 | #' \item "ica" - transform data to the independent components.
21 | #' }
22 | #' @param featuresToPreProcess Vector of number of features to perform the feature preprocessing on - In case of empty vector, this means to include all features in the dataset file (default = c()) - This vector should be a subset of \code{selectedFeats}.
23 | #' @param nComp Integer numeric of Number of components needed if either "pca" or "ica" feature preprocessors are needed.
24 | #' @param nModels Integer numeric representing the number of classifier algorithms that you want to select based on Meta-Learning and start to tune using Bayesian Optimization (default = 5).
25 | #' @param option Integer numeric representing either Classifier Algorithm Selection is needed only = 1 or Algorithm selection with its parameter tuning is required = 2 which is the default value.
26 | #' @param featureTypes Vector of either 'numerical' or 'categorical' representing the types of features in the dataset (default = c() --> any factor or character features will be considered as categorical otherwise numerical).
27 | #' @param interp Boolean representing if model interpretability (Feature Importance and Interaction) is needed or not (default = FALSE) This option will take more time budget if set to 1.
28 | #' @param missingOpr Boolean variable represents either use median/mode imputation for instances with missing values (FALSE) or apply imputation using "MICE" library which helps you imputing missing values with plausible data values that are drawn from a distribution specifically designed for each missing datapoint (TRUE).
29 | #' @param balance Boolean variable represents if SMOTE class balancing is required or not (default FALSE).
30 | #' @param metric Metric of string character to be used in evaluation:
31 | #' \itemize{
32 | #' \item "acc" - Accuracy,
33 | #' \item "avg-fscore" - Average of F-Score of each label,
34 | #' \item "avg-recall" - Average of Recall of each label,
35 | #' \item "avg-precision" - Average of Precision of each label,
36 | #' \item "fscore" - Micro-Average of F-Score of each label,
37 | #' \item "recall" - Micro-Average of Recall of each label,
38 | #' \item "precision" - Micro-Average of Precision of each label.
39 | #' }
40 | #'
41 | #' @return List of Results
42 | #' \itemize{
43 | #' \item "option=1" - Choosen Classifier Algorithms Names \code{clfs} with their parameters configurations \code{params}, Training DataFrame \code{TRData}, Test DataFrame \code{TEData} in case of \code{option=2},
44 | #' \item "option=2" - Best classifier algorithm name found \code{clfs} with its parameters configuration \code{params}, , Training DataFrame \code{TRData}, Test DataFrame \code{TEData}, model variable \code{model}, predicted values on test set \code{pred}, performance on TestingSet \code{perf}, and Feature Importance \code{interpret$featImp} / Interaction \code{interpret$Interact} plots in case of interpretability \code{interp} = TRUE and chosen model is not knn.
45 | #' }
46 | #'
47 | #' @examples
48 | #' \dontrun{
49 | #' autoRLearn(1, 'sampleDatasets/car/train.arff', \
50 | #' 'sampleDatasets/car/test.arff', option = 2, preProcessF = 'normalize')
51 | #'
52 | #' result <- autoRLearn(10, 'sampleDatasets/shuttle/train.arff', 'sampleDatasets/shuttle/test.arff')
53 | #' }
54 | #'
55 | #' @importFrom tictoc tic toc
56 | #' @importFrom R.utils withTimeout
57 | #' @importFrom graphics plot
58 | #' @import ggplot2
59 | #'
60 | #' @export autoRLearn
61 |
62 | autoRLearn <- function(maxTime, directory, testDirectory, classCol = 'class', metric = 'acc', vRatio = 0.3, preProcessF = c('standardize', 'zv'), featuresToPreProcess = c(), nComp = NA, nModels = 5, option = 2, featureTypes = c(), interp = FALSE, missingOpr = FALSE, balance = FALSE) {
63 | #Set Seed
64 | set.seed(22)
65 | #Read Dataset
66 | datasetReadError <- try(
67 | {
68 | #Read Training Dataset
69 | dataset <- readDataset(directory, testDirectory, classCol = classCol, vRatio = vRatio, preProcessF = preProcessF, featuresToPreProcess = featuresToPreProcess, nComp = nComp, missingOpr = missingOpr, metric = metric, balance = balance)
70 | trainingSet <- dataset$TD
71 | #Read Testing Dataset
72 | testDataset <- dataset$TED
73 | #Read all training Dataset without validation
74 | trainDataset <- dataset$FULLTD
75 | })
76 | if(inherits(datasetReadError, "try-error")){
77 | print('Error: Failed Reading Dataset: Make sure that dataset directory is correct and it is a valid csv/arff file.')
78 | return(-1)
79 | }
80 |
81 | #Calculate Meta-Features for the dataset
82 | metaFeaturesError <- try(
83 | {
84 | metaFeatures <- computeMetaFeatures(trainingSet, maxTime, featureTypes)
85 | })
86 | if(inherits(metaFeaturesError, "try-error")){
87 | print('Error: Failed Extracting Dataset MetaFeatures.')
88 | return(-1)
89 | }
90 |
91 | splitError <- try(
92 | {
93 | #Convert Categorical Features to Numerical Ones and split the dataset
94 | B <- max(10, as.integer((metaFeatures$nInstances) / 2000)) #Number of folds to work on for the dataset and trees in SMAC forest model
95 |
96 | dataset <- convertCategorical(dataset, trainDataset, testDataset, B = B)
97 | validationSet <- dataset$VD #Validation set
98 | trainingSet <- dataset$TD #Training Set
99 | foldedSet <- dataset$FD #Folded sets of Training Data.
100 | #Convert for all TrainingSet
101 | trainDataset <- dataset$FULLTD
102 | #Convert for all TestingSet
103 | testDataset <- dataset$TED
104 | })
105 | if(inherits(splitError, "try-error")){
106 | print('Error: Failed Splitting Dataset.')
107 | return(-1)
108 | }
109 |
110 | #Generate candidate classifiers
111 | candidateClfsError <- try(
112 | {
113 | nClassifiers <- 15
114 | output <- getCandidateClassifiers(maxTime, metaFeatures, min(c(nModels, nClassifiers)) )
115 | algorithms <- output$c #Classifier Algorithm names selected.
116 | tRatio <- output$r #Time ratio between all classifiers.
117 | algorithmsParams <- output$p #Initial Parameter configuration of each classifier.
118 | })
119 | if(inherits(candidateClfsError, "try-error")){
120 | print('Error: Can not generate Candidate classifiers.')
121 | return(-1)
122 | }
123 |
124 | tryCatch({
125 | #Option 1: Only Candidate Classifiers with initial parameters will be resulted (No Hyper-parameter tuning)
126 | if(option == 1 && length(algorithms) == length(algorithmsParams))
127 | return (list(clfs = algorithms, params = algorithmsParams, TRData = dataset$FULLTD, TEData = dataset$TED))
128 | else if(option == 1)
129 | return ('Error: Failed to Connect to KnowledgeBase, Option 1 can not be executed')
130 |
131 | #Option 2: Classifier Algorithm Selection + Parameter Tuning
132 | res <- withTimeout({
133 | #variables to hold best classifiers
134 | bestAlgorithm <- '' #bestClassifierName.
135 | bestAlgorithmPerf <- 0 #bestClassifierPerformance.
136 | bestAlgorithmParams <- list() #Parameters of best Classifier.
137 |
138 | #loop over each classifier
139 | for(i in 1:length(algorithms)){
140 | classifierAlgorithm <- algorithms[[i]]
141 | if (i <= length(algorithmsParams))
142 | classifierAlgorithmParams <- algorithmsParams[[i]]
143 | else
144 | classifierAlgorithmParams <- '' #use the default initial parameter configuration
145 |
146 | #Read maxTime for the current classifier algorithm and convert to seconds
147 | maxClfTime <- tRatio[i] * 60
148 | #Read the current classifier default parameter configuration
149 | classifierConf <- getClassifierConf(classifierAlgorithm)
150 | cat('\nStart Tuning Classifier Algorithm: ', classifierAlgorithm, '\n')
151 | #initialize step
152 | R <- initialize(classifierAlgorithm, classifierConf, classifierAlgorithmParams)
153 | cntParams <- R[, -which(names(R) == "performance")]
154 | #start hyperParameter tuning till maximum Time
155 | tic(quiet = TRUE)
156 | timeTillNow <- 0
157 | #Regression Random Forest Trees for training set folds
158 | tree <- data.frame(fold=integer(), parent=integer(), params=character(), rightChild=integer(), leftChild=integer(), performance=double(), rowN = integer())
159 | bestParams <- cntParams
160 | bestPerf <- c()
161 | classifierFailureCounter <- 0
162 |
163 | repeat{
164 | gc()
165 | #Fit Model
166 | output <- fitModel(bestParams, bestPerf, trainingSet, validationSet, foldedSet, classifierAlgorithm, tree, B = B)
167 | #Check if this classifer failed for more than 5 times, skip to the next classifier
168 | if((length(bestPerf) > 0 && mean(bestPerf) == 0) || length(bestPerf) == 0){
169 | classifierFailureCounter <- classifierFailureCounter + 1
170 | if(classifierFailureCounter > 2) break
171 | }
172 | tree <- output$t
173 | bestPerf <- output$p
174 | bestParams <- output$bp
175 | #Select Candidate Classifier Configurations
176 | candidateConfs <- selectConfiguration(R, classifierAlgorithm, tree, bestParams, B = B)
177 | #Intensify
178 | if(nrow(candidateConfs) > 0){
179 | output <- intensify(R, bestParams, bestPerf, candidateConfs, foldedSet, trainingSet, validationSet, classifierAlgorithm, maxClfTime, timeTillNow, B = B, metric = metric)
180 | bestParams <- output$params
181 | bestPerf <- output$perf
182 | timeTillNow <- output$timeTillNow
183 | classifierFailureCounter <- classifierFailureCounter + output$fails
184 | R <- output$r
185 | }
186 | #Check if execution time exceeded the allowed time or not
187 | t <- toc(quiet = TRUE)
188 | timeTillNow <- timeTillNow + t$toc - t$tic
189 | tic(quiet = TRUE)
190 | if(timeTillNow > maxClfTime){
191 | if(mean(bestPerf) > mean(bestAlgorithmPerf)){
192 | bestAlgorithmPerf <- bestPerf
193 | bestAlgorithm <- classifierAlgorithm
194 | bestAlgorithmParams <- bestParams
195 | #cat('Best Classifier:', bestAlgorithm, ' --> Performance:', bestAlgorithmPerf, '\n')
196 | }
197 | break
198 | }
199 |
200 | }
201 | }
202 |
203 | },timeout = maxTime * 60, cpu = maxTime * 60)
204 | }, TimeoutException = function(ex) {
205 | message("NOTE: Time Budget allowed has been finished.")
206 | })
207 |
208 | print("Time Limit for Tuning process has been reached out. Training the best classifier found over whole Training set now.")
209 | if (bestAlgorithm != '')
210 | bestAlgorithmParams <- bestAlgorithmParams[,names(bestAlgorithmParams) != "EI" & names(bestAlgorithmParams) != "performance"]
211 | else{
212 | bestAlgorithm <- algorithms[[1]]
213 | bestAlgorithmParams <- algorithmsParams[[1]]
214 | }
215 |
216 | trainFinalModelError <- try(
217 | {
218 | #Run Classifier over all training set and check performance on testing set
219 | finalResult <- runClassifier(trainingSet = trainDataset, validationSet = testDataset, params = bestAlgorithmParams, classifierAlgorithm = bestAlgorithm, metric = metric, interp = interp)
220 | finalResult$clfs <- bestAlgorithm
221 | finalResult$params <- bestAlgorithmParams
222 | #save results to Temporary File
223 | query <- sendToTmp(metaFeatures, bestAlgorithm, bestAlgorithmParams, finalResult$perf, nModels, metric)
224 | #check internet connection and send data in tmp file to database if connection exists
225 | if(checkInternet() == TRUE){
226 | sendToDatabase(query)
227 | }
228 | })
229 | if(inherits(trainFinalModelError, "try-error")){
230 | print('Error: No Enough Computational Resources. Can not build a model over the current dataset!')
231 | }
232 |
233 |
234 | finalResult$TRData = dataset$FULLTD
235 | finalResult$TEData = dataset$TED
236 | return(finalResult)
237 | }
238 |
--------------------------------------------------------------------------------
/R/autoRLearn_.R:
--------------------------------------------------------------------------------
1 | #' @title Advanced version of autoRLearn.
2 | #'
3 | #' @description Tunes the hyperparameters of the desired algorithm/s using either hyperband or BOHB.
4 | #'
5 | #' @param df_train Dataframe of the training dataset. Assumes it is in perfect shape with all numeric variables and factor response variable named "class".
6 | #' @param df_test Dataframe of the test dataset. Assumes it is in perfect shape with all numeric variables and factor response variable named "class".
7 | #' @param maxTime Float representing the maximum time the algorithm should be run (in minutes).
8 | #' @param models List of strings denoting which algorithms to use for the process:
9 | #' \itemize{
10 | #' \item "randomForest" - Random forests using the randomForest package
11 | #' \item "ranger - Random forests using the ranger package (unstable)
12 | #' \item "naiveBayes" - Naive bayes using the fastNaiveBayes package
13 | #' \item "boosting" - Gradient boosting using xgboost
14 | #' \item "l2-linear-classifier" - Linear primal Support vector machine from LibLinear
15 | #' \item "svm" - RBF kernel svm from e1071
16 | #' }
17 | #' @param optimizationAlgorithm - String of which hyperparameter tuning algorithm to use:
18 | #' \itemize{
19 | #' \item "hyperband" - Hyperband with uniformly initiated parameters
20 | #' \item "bohb" - Hyperband with bayesian optimization as described on F. Hutter et al 2018 paper BOHB. Has extra parameters bw and kde_type
21 | #' }
22 | #' @param bw - (only applies to BOHB) Double representing how much should the KDE bandwidth be widened. Higher values allow the algorithm to explore more hyperparameter combinations
23 | #' @param max_iter - (affects both hyperband and BOHB) Integer representing the maximum number of iterations that one successive halving run can have
24 | #' @param kde_type - (only applies to BOHB) String representing whether a model's hyperparameters should be tuned individually of each other or have their probability densities multiplied:
25 | #' \itemize{
26 | #' \item "single" - each hyperparameter has its own expected improvement calculated
27 | #' \item "mixed" - all hyperparameters' probabilty densities are multiplied and only one mixed expected improvement is calculated
28 | #' }
29 | #' @param metric String of the evaluation metric to be used in the model performance optimization:
30 | #' \itemize{
31 | #' \item "acc" - Accuracy,
32 | #' \item "avg-fscore" - Average of F-Score of each label,
33 | #' \item "avg-recall" - Average of Recall of each label,
34 | #' \item "avg-precision" - Average of Precision of each label,
35 | #' \item "fscore" - Micro-Average of F-Score of each label,
36 | #' \item "recall" - Micro-Average of Recall of each label,
37 | #' \item "precision" - Micro-Average of Precision of each label.
38 | #' }
39 | #' @return List of Results
40 | #' \itemize{
41 | #' \item \code{perf} - Evaluated metric of the best performing model on the test data
42 | #' \item \code{pred} - prediction on the test data using the best model
43 | #' \item \code{model} - best model object
44 | #' \item \code{best_models} - table with the best hyperparameters found for the selected models.
45 | #' }
46 |
47 | #' @importFrom R.utils withTimeout
48 | #' @importFrom tictoc tic toc
49 | #' @importFrom stats na.omit runif
50 | #' @importFrom utils head
51 |
52 | #' @export autoRLearn_
53 | autoRLearn_ <- function(df_train, df_test, maxTime = 10,
54 | models = c("randomForest", "naiveBayes", "boosting", "l2-linear-classifier", "svm"),
55 | optimizationAlgorithm = "hyperband", bw = 3, kde_type = "single",
56 | max_iter = 81, metric = "acc") {
57 |
58 | total_time <- maxTime * 60
59 | parameters_per_model <- map_int(models, .f = ~ length(jsons[[.x]]$params))
60 | times <- (parameters_per_model * total_time) / (sum(parameters_per_model))
61 |
62 | print("Time distribution:")
63 | print(times)
64 | print("Models selected:")
65 | print(models)
66 |
67 | run_optimization <- function(model, time) {
68 | results <- NULL
69 | priors <- data.frame()
70 |
71 | tic(model, "optimization time:")
72 |
73 | if(optimizationAlgorithm == "hyperband") {
74 | current <- Sys.time() %>% as.integer()
75 | end <- (Sys.time() %>% as.integer()) + time
76 | repeat {
77 | gc(verbose = F)
78 | tic("current hyperband runtime")
79 | print(paste("started", model))
80 | time_left <- max(end - (Sys.time() %>% as.integer()), 1)
81 | print(paste("There are:", time_left, "seconds left for this hyperband run"))
82 | res <- hyperband(df = df_train, model = model, max_iter = max_iter, maxtime = time_left)
83 | if(is_empty(flatten(res)) == F) {
84 | res <- res %>%
85 | map_dfr(.f = ~ .x[["answer"]]) %>%
86 | arrange(desc(acc)) %>%
87 | head(1)
88 | results <- c(list(res), results)
89 | print(paste('Best accuracy from hyperband this round: ', res$acc))
90 | }
91 | elapsed <- (Sys.time() %>% as.integer()) - current
92 | if(elapsed >= time) {
93 | break
94 | }
95 | }
96 | }
97 |
98 | else if(optimizationAlgorithm == "bohb") {
99 | current <- Sys.time() %>% as.integer()
100 | end <- (Sys.time() %>% as.integer()) + time
101 | repeat {
102 | gc(verbose = F)
103 | tic("current bohb time")
104 | print(paste("started", model))
105 | time_left <- max(end - (Sys.time() %>% as.integer()), 1)
106 | print(paste("There are:", time_left, "seconds left for this bohb run"))
107 | res <- bohb(df = df_train, model = model, bw = bw, max_iter = max_iter, maxtime = time_left,
108 | priors = priors, kde_type = kde_type)
109 | if(is_empty(flatten(res)) == F) {
110 | priors <- res %>%
111 | map_dfr(.f = ~ .x[["sh_runs"]])
112 | res <- res %>%
113 | map_dfr(.f = ~ .x[["answer"]]) %>%
114 | arrange(desc(acc)) %>%
115 | head(1)
116 | results <- c(list(res), results)
117 | print(paste('Best accuracy from hyperband this round: ', res$acc))
118 | }
119 | elapsed <- (Sys.time() %>% as.integer()) - current
120 | if(elapsed >= time) {
121 | break
122 | }
123 | }
124 | }
125 |
126 | else {
127 | errorCondition(message = "Only hyperband and bohb are valid optimization algorithms at this moment.")
128 | break
129 | }
130 |
131 | toc()
132 | results
133 | }
134 |
135 | print("Finished all optimizations.")
136 | ans <- vector(mode = "list", length = length(models))
137 |
138 | for(i in 1:length(models)) {
139 | flag <- TRUE
140 | #tryCatch(expr = {
141 | ans[[i]] <- run_optimization(models[[i]], times[[i]])
142 | #}, error = function(e) {
143 | # print("Error spotted, going to the next model!")
144 | # flag <<- FALSE
145 | #})
146 | if (!flag) next
147 | }
148 |
149 | print(ans)
150 | ans <- ans %>%
151 | map(.f = ~ map_dfr(.x = .x, .f = ~ .x %>% select(model, params, acc))) %>%
152 | map_dfr(.f = ~ .x %>% arrange(desc(acc)) %>% head(1)) %>%
153 | arrange(desc(acc))
154 | best_model <- ans %>% head(1)
155 | final_evaluation <- eval_loss(model = best_model[["model"]], train_df = df_train, test_df = df_test,
156 | params = best_model[["params"]])
157 | final_evaluation$best_models <- ans
158 | print(paste("Winner:", best_model$model, "test accuracy:", final_evaluation$perf))
159 | final_evaluation
160 |
161 | }
162 |
163 |
--------------------------------------------------------------------------------
/R/bohb.R:
--------------------------------------------------------------------------------
1 | #' @importFrom dplyr distinct n group_by
2 |
3 | #' @keywords internal
4 | bohb <- function(df, model, max_iter = 81, eta = 3, bw = 3, random_frac = 1/3,
5 | maxtime, priors = data.frame(), kde_type = "single") {
6 | logeta = as_mapper(~ log(.x) / log(eta))
7 | s_max = trunc(logeta(max_iter))
8 | B = (s_max + 1) * max_iter
9 | nrs = map_dfc(s_max:0, .f = ~ calc_n_r(max_iter, eta, .x, B)) %>%
10 | t() %>%
11 | `colnames<-`(value = c("n", "r")) %>%
12 | as_tibble()
13 | nrs$s = s_max:0
14 | length_params <- length(jsons[[model]]$params)
15 |
16 | tryCatch(expr = {withTimeout(expr = {
17 | liszt = vector(mode = "list",
18 | length = max(nrs$s) + 1)
19 | runs_df <- NULL
20 | current_sh_run <- NULL
21 | for (row in 1:nrow(nrs)) {
22 | if(row == 1) {
23 | print(paste("Iteration number", row))
24 | #print(paste("n = ", nrs[[row, 1]], " r = ", nrs[[row, 2]], " s_max = ", nrs[[row, 3]], sep = ""))
25 | current_sh_run <- successive_halving(df = df,
26 | params_config = sample_n_params(n = nrs[[row, 1]],
27 | model = model),
28 | n = nrs[[row, 1]],
29 | r = nrs[[row, 2]],
30 | s_max = nrs[[row, 3]],
31 | max_iter = max_iter,
32 | eta = eta,
33 | evaluations = priors)
34 | runs_df <- runs_df %>%
35 | bind_rows(current_sh_run$sh_runs)
36 | liszt[[row]] <- current_sh_run
37 | next
38 | }
39 | else if(row > 1){
40 | bayesian_opt_samples <- successive_resampling(df = runs_df,
41 | model = model,
42 | samples = max_iter,
43 | n = round(max(nrs[[row, 1]] * (1 - random_frac), 1)),
44 | bw = bw,
45 | kde_type = kde_type)
46 |
47 | current_sh_run <- successive_halving(df = df,
48 | params_config = bayesian_opt_samples %>%
49 | bind_rows(sample_n_params(n = round(max(nrs[[row, 1]] * random_frac, 1)), model = model)),
50 | n = nrs[[row, 1]],
51 | r = nrs[[row, 2]],
52 | s_max = nrs[[row, 3]],
53 | max_iter = max_iter,
54 | eta = eta)
55 | }
56 | runs_df <- runs_df %>%
57 | bind_rows(current_sh_run$sh_runs)
58 | liszt[[row]] <- current_sh_run
59 | }
60 | }, timeout = maxtime, cpu = maxtime)},
61 |
62 | TimeoutException = function(ex) {
63 | print("Budget ended.")
64 | return(liszt)
65 | },
66 |
67 | finally = function(ex) {
68 | print("BOHB successfully finished.")
69 | return(liszt)
70 | }
71 | ,
72 |
73 | error = function(ex) {
74 | print(paste("Error found, replace ", model, sep = ""))
75 | print(geterrmessage())
76 | break
77 | })
78 |
79 | return(liszt)
80 | }
81 |
--------------------------------------------------------------------------------
/R/bohb_utility.R:
--------------------------------------------------------------------------------
1 | #' @keywords internal
2 | EI <- function(..., lkde, gkde) { predict(lkde, x = c(...)) / predict(gkde, x = c(...)) }
3 |
4 | #' @keywords internal
5 | map_all <- function(df) {
6 | do.call("mapply", c(list, df, SIMPLIFY = FALSE, USE.NAMES=FALSE))
7 | }
8 |
9 | #' @keywords internal
10 | coalesce_all_columns <- function(df, group_vars = NULL) {
11 | if (is.null(group_vars)) {
12 | group_vars <-
13 | df %>%
14 | purrr::keep(~ dplyr::n_distinct(.x) == 1L) %>%
15 | names()
16 | }
17 |
18 | msk <- colnames(df) %in% group_vars
19 | same_df <- df[1L, msk, drop = FALSE]
20 | coal_df <- df[, !msk, drop = FALSE] %>%
21 | purrr::map_dfc(na.omit)
22 | cbind(same_df, coal_df)
23 | }
24 |
25 | #' @keywords internal
26 | sample_n_params <- function(n, model) {
27 | ans <- map_chr(.x = rep(model, n), .f = make_paste_final) %>%
28 | data.frame(model = model,
29 | params = .) %>%
30 | mutate_all(.funs = as.character)
31 | ans
32 | }
33 |
34 | #' @keywords internal
35 | make_paste_final <- function(model) {
36 | params_list <- get_random_hp_config(jsons[[model]])
37 |
38 | names_list <- names(params_list) %>%
39 | map(~ str_glue(.x, " = ")) %>%
40 | map2(params_list, ~paste(.x, .y, sep = "")) %>%
41 | paste(collapse = ",")
42 | names_list
43 | }
44 |
--------------------------------------------------------------------------------
/R/checkInternet.R:
--------------------------------------------------------------------------------
1 | #' @title Check Internet Connectivity.
2 | #'
3 | #' @description Checking if user has Internet connectivity at the moment of execution to send results to the knowledge base / get recommendation from knowledge base.
4 | #'
5 | #' @return Boolean representing the Internet connectivity status.
6 | #'
7 | #' @examples
8 | #' checkInternet().
9 | #'
10 | #' @importFrom RCurl getURL
11 | #'
12 | #' @noRd
13 | #'
14 | #' @keywords internal
15 |
16 | checkInternet <- function() {
17 | out <- FALSE
18 | tryCatch({
19 | out <- is.character(getURL("www.yahoo.com"))
20 | },
21 | error = function(e) {
22 | out <- FALSE
23 | }
24 | )
25 | out
26 | }
27 |
--------------------------------------------------------------------------------
/R/computeEI.R:
--------------------------------------------------------------------------------
1 | #' @title Compute Expected Improvement.
2 | #'
3 | #' @description Compute the expected improvement for the suggested parameter configurations of a specific classifier.
4 | #'
5 | #' @param cmin Minimum error rate achieved till now.
6 | #' @param perf Expected Performance of the current configuration on each tree of the forest of SMAC algorithm.
7 | #' @param B number of trees in the forest of trees of SMAC optimization algorithm (default = 10).
8 | #'
9 | #' @return Float Number of Expected Improvement value.
10 | #'
11 | #' @examples
12 | #' computeEI(0.9, c(0.91, 0.95, 0.89, 0.88, 0.93), 5).
13 | #'
14 | #' @importFrom stats pnorm dnorm var
15 | #'
16 | #' @noRd
17 | #'
18 | #' @keywords internal
19 |
20 | computeEI <- function(cmin, perf, B = 10){
21 | for(i in 1:B){
22 | perf[i] <- 1 - perf[i]
23 | }
24 | perfMean <- mean(perf)
25 | perfStdDev <- sqrt(var(perf))
26 | u <- (cmin - perfMean)/perfStdDev
27 | cdf <- pnorm(u, mean=0, sd=1)
28 | pdf <- dnorm(u, mean=0, sd=1)
29 | EI <- perfStdDev * (u * cdf + pdf)
30 | return (EI)
31 | }
32 |
--------------------------------------------------------------------------------
/R/computeMetaFeatures.R:
--------------------------------------------------------------------------------
1 | #' @title Compute Meta-Features.
2 | #'
3 | #' @description Compute Statistical Meta-Features for a dataset.
4 | #'
5 | #' @param dataset The dataframe containing the dataset to process.
6 | #' @param maxTime The maximum time budget entered by user for the parameter optimization part (in minutes).
7 | #' @param featureTypes Vector of Types of each feature in the dataset either ('numerical', 'categorical').
8 | #'
9 | #' @return dataframe with 25 statistical meta-feature of \code{dataset}.
10 | #'
11 | #' @examples
12 | #' computeMetaFeatures(data.frame(salary = c(623, 515, 611, 729, 843), class = c (0, 0, 0, 1, 1)), 10, c('numerical', 'numerical')).
13 | #'
14 | #' @importFrom e1071 skewness kurtosis
15 | #' @importFrom stats var
16 | #'
17 | #' @noRd
18 | #'
19 | #' @keywords internal
20 |
21 | computeMetaFeatures <- function(dataset, maxTime, featureTypes) {
22 | print('###################START: Preparation of Meta-Features of the Dataset###################')
23 | #1- number of instances
24 | nInstances <- nrow(dataset)
25 | cat(sprintf("1-Number of Instances: %d\n", nInstances))
26 | #2- log number of instances
27 | lognInstances <- log(nInstances)
28 | cat(sprintf("2-Log number of Instances: %f\n",lognInstances))
29 | #3- number of features
30 | nFeatures <- ncol(dataset) - 1
31 | cat(sprintf("3-Number of Features: %d\n", nFeatures))
32 | #4- log number of features
33 | lognFeatures <- log(nFeatures)
34 | cat(sprintf("4-Log number of Features: %f\n", lognFeatures))
35 | #5- number of classes
36 | classes <- unique(dataset$class)
37 | nClasses <- length(classes)
38 | cat(sprintf("5-Total number of Classes: %d\n", nClasses))
39 | #6- number of categorical features
40 | nCatFeatures <- 0
41 | nNumFeatures <- 0
42 | skewVector <- c()
43 | kurtosisVector <- c()
44 | symbolsVector <- c()
45 | featsType <- lapply(dataset, class)
46 | if(length(featureTypes) == 0){
47 | for(i in colnames(dataset)){
48 | if(i == 'class')next
49 | if(featsType[[i]] != 'factor' && featsType[[i]] != 'character' && length(unique(dataset[[i]])) > lognInstances){
50 | nNumFeatures <- nNumFeatures + 1
51 | skewVector <- c(skewVector, skewness(dataset[[i]]))
52 | kurtosisVector <- c(kurtosisVector, kurtosis(dataset[[i]]))
53 | }
54 | else{
55 | nCatFeatures <- nNumFeatures + 1
56 | symbolsVector <- c(symbolsVector, length(unique(dataset[[i]])))
57 | }
58 | }
59 | }
60 | else{
61 | counter <- 0
62 | for(i in colnames(dataset)){
63 | counter <- counter + 1
64 | if(i == 'class')next
65 | if(featureTypes[counter] == 'numerical'){
66 | nNumFeatures <- nNumFeatures + 1
67 | skewVector <- c(skewVector, skewness(dataset[[i]]))
68 | kurtosisVector <- c(kurtosisVector, kurtosis(dataset[[i]]))
69 | }
70 | else{
71 | nCatFeatures <- nNumFeatures + 1
72 | symbolsVector <- c(symbolsVector, length(unique(dataset[[i]])))
73 | }
74 | }
75 | }
76 | cat(sprintf("6-Number of Categorical Features: %d\n", nCatFeatures))
77 | #7- number of numerical features
78 | cat(sprintf("7-Number of Numerical Features: %d\n", nNumFeatures))
79 | #8- ratio of numerical to categorical features
80 | if(nNumFeatures > 0){
81 | ratioNumToCat <- nCatFeatures / nNumFeatures
82 | }
83 | else{
84 | ratioNumToCat <- 999999
85 | }
86 | cat(sprintf("8-Ratio of Categorical to Numerical Features %f\n", ratioNumToCat))
87 | #9- class entropy
88 | probClasses <- c()
89 | classEntropy <- 0
90 | for(i in classes){
91 | prob <- length(which(dataset$class==i))/nInstances
92 | probClasses <- c(probClasses, prob)
93 | classEntropy <- classEntropy - prob * log2(prob)
94 | }
95 | cat(sprintf("9-Class Entropy: %f\n", classEntropy))
96 | #10- class probability max
97 | classProbMax <- max(probClasses)
98 | cat(sprintf("10-Maximum Class Probability: %f\n", classProbMax))
99 | #11- class probability min
100 | classProbMin <- min(probClasses)
101 | cat(sprintf("11-Minimum Class Probability: %f\n", classProbMin))
102 | #12- class probability mean
103 | classProbMean <- mean(probClasses)
104 | cat(sprintf("12-Mean Class Probability: %f\n", classProbMean))
105 | #13- class probability std. dev
106 | classProbStdDev <- sqrt(var(probClasses))
107 | cat(sprintf("13-Standard Deviation of Class Probability: %f\n", classProbStdDev))
108 | #14- Symbols Mean
109 | if(length(symbolsVector) > 0) symbolsMean <- mean(symbolsVector)
110 | else symbolsMean <- 'NULL'
111 | cat(sprintf("14-Mean of Number of Symbols: %s\n", symbolsMean))
112 | #15- Symbols sum
113 | if(length(symbolsVector) > 0) symbolsSum <- sum(symbolsVector)
114 | else symbolsSum <- 'NULL'
115 | cat(sprintf("15-Sum of Number of Symbols: %s\n", symbolsSum))
116 | #16- Symbols Std. Deviation
117 | if(length(symbolsVector) > 0) symbolsStdDev <- sqrt(var(symbolsVector))
118 | else symbolsStdDev <- 'NULL'
119 | cat(sprintf("16-Std. Deviation of Number of Symbols: %s\n", symbolsStdDev))
120 | #17- skewness min
121 | if(length(skewVector) > 0) featuresSkewMin <- try(min(skewVector))
122 | else featuresSkewMin <- 0
123 | cat(sprintf("17-Features Skewness Minimum: %s\n", featuresSkewMin))
124 | #18- skewness mean
125 | if(length(skewVector) > 0) featuresSkewMean <- try(mean(skewVector))
126 | else featuresSkewMean <- 0
127 | cat(sprintf("18-Features Skewness Mean: %s\n", featuresSkewMean))
128 | #19- skewness max
129 | if(length(skewVector) > 0) featuresSkewMax <- try(max(skewVector))
130 | else featuresSkewMax <- 0
131 | cat(sprintf("19-Features Skewness Maximum: %s\n", featuresSkewMax))
132 | #20- skewness std. dev.
133 | if(length(skewVector) > 0) featuresSkewStdDev <- try(sqrt(var(skewVector)))
134 | else featuresSkewStdDev <- 0
135 | cat(sprintf("20-Features Skewness Std. Deviation: %s\n", featuresSkewStdDev))
136 | #21- Kurtosis min
137 | if(length(kurtosisVector) > 0) featuresKurtMin <- try(min(kurtosisVector))
138 | else featuresKurtMin <- 0
139 | cat(sprintf("21-Features Kurtosis Min: %s\n", featuresKurtMin))
140 | #22- Kurtosis max
141 | if(length(kurtosisVector) > 0) featuresKurtMax <- try(max(kurtosisVector))
142 | else featuresKurtMax <- 0
143 | cat(sprintf("22-Features Kurtosis Max: %s\n", featuresKurtMax))
144 | #23- Kurtosis mean
145 | if(length(kurtosisVector) > 0) featuresKurtMean <- try(mean(kurtosisVector))
146 | else featuresKurtMean <- 0
147 | cat(sprintf("23-Features Kurtosis Mean: %s\n", featuresKurtMean))
148 | #24- Kurtosis std. dev.
149 | if(length(kurtosisVector) > 0) featuresKurtStdDev <- try(sqrt(var(kurtosisVector)))
150 | else featuresKurtStdDev <- 0
151 | cat(sprintf("24-Features Kurtosis Std. Deviation: %s\n", featuresKurtStdDev))
152 | #25- Dataset Ratio (ratio of number features: number of instances)
153 | datasetRatio <- nFeatures / nInstances
154 | cat(sprintf("25-Dataset Ratio: %f\n", datasetRatio))
155 |
156 | #Collecting Meta-Features in a dataFrame
157 | df <- data.frame(datasetRatio = datasetRatio, featuresKurtStdDev = featuresKurtStdDev,
158 | featuresKurtMean = featuresKurtMean, featuresKurtMax = featuresKurtMax,
159 | featuresKurtMin = featuresKurtMin, featuresSkewStdDev = featuresSkewStdDev,
160 | featuresSkewMean = featuresSkewMean, featuresSkewMax = featuresSkewMax,
161 | featuresSkewMin = featuresSkewMin, symbolsStdDev = symbolsStdDev, symbolsSum = symbolsSum,
162 | symbolsMean = symbolsMean, classProbStdDev = classProbStdDev, classProbMean = classProbMean,
163 | classProbMax = classProbMax, classProbMin = classProbMin, classEntropy = classEntropy,
164 | ratioNumToCat = ratioNumToCat, nCatFeatures = nCatFeatures, nNumFeatures = nNumFeatures,
165 | nInstances = nInstances, nFeatures = nFeatures, nClasses = nClasses,
166 | lognFeatures = lognFeatures, lognInstances = lognInstances, maxTime = maxTime)
167 | print('###################END: Preparation of Meta-Features of the Dataset###################')
168 | return(df)
169 | }
170 |
--------------------------------------------------------------------------------
/R/convertCategorical.R:
--------------------------------------------------------------------------------
1 | #' @title Convert Categorical to Numerical Features.
2 | #'
3 | #' @description Perform One-Hot-Encoding for the categorical features to convert them to numerical ones.
4 | #'
5 | #' @param dataset List of training and validation dataframes containing the dataset to process.
6 | #' @param trainDataset Dataframe of full training set
7 | #' @param testDataset Dataframe of full testing set
8 | #' @param B number of trees in the forest of trees of SMAC optimization algorithm (default = 10).
9 | #'
10 | #' @return List of data frames for the new dataset after encoding Ctegorical to numerical (TD = Training Dataset, VD = Validation Dataset, FD = Training Dataset after splitting it into \code{B} folds).
11 | #'
12 | #' @examples
13 | #' convertCategorical(data.frame(salary = c(623, 515, 611, 729, 843), class = c (0, 0, 0, 1, 1)), 1).
14 | #'
15 | #' @import caret
16 | #'
17 | #' @noRd
18 | #'
19 | #' @keywords internal
20 |
21 | convertCategorical <- function(dataset, trainDataset, testDataset, B = 10) {
22 | #Convert Factor/String Features into numeric features
23 | dmy <- caret::dummyVars(" ~ .", data = rbind(trainDataset, testDataset)[,names(trainDataset) != "class"])
24 | datasetTmp <- data.frame(predict(dmy, newdata = dataset$TD))
25 | dataset$FULLTD <- data.frame(predict(dmy, newdata = trainDataset))
26 | dataset$TED <- data.frame(predict(dmy, newdata = testDataset))
27 |
28 | datasetTmp$class <- dataset$TD$class
29 | dataset$TD <- datasetTmp
30 | dataset$FULLTD$class <- trainDataset$class
31 | dataset$TED$class <- testDataset$class
32 |
33 | if(nrow(dataset$VD) > 1){
34 | validationSet <- data.frame(predict(dmy, newdata = dataset$VD))
35 | validationSet$class <- dataset$VD$class
36 | dataset$VD <- validationSet
37 | dataset$FD <- createFolds(dataset$TD$class, k = B, list = TRUE, returnTrain = FALSE)
38 | }
39 | return(dataset)
40 | }
41 |
--------------------------------------------------------------------------------
/R/datasetReader.R:
--------------------------------------------------------------------------------
1 | #' @title Read Dataset File into Memory.
2 | #'
3 | #' @description Read the file of the training and testing dataset, and perform preprocessing and data cleaning if necessary.
4 | #'
5 | #' @param directory String of the directory to the file containing the training dataset.
6 | #' @param testDirectory String of the directory to the file containing the testing dataset.
7 | #' @param selectedFeats Vector of numbers of features columns to include from the training set and ignore the rest of columns - In case of empty vector, this means to include all features in the dataset file (default = c()).
8 | #' @param classCol String of the name of the class label column in the dataset (default = 'class').
9 | #' @param preProcessF string containing the name of the preprocessing algorithm (default = 'N' --> no preprocessing):
10 | #' \itemize{
11 | #' \item "boxcox" - apply a Box–Cox transform and values must be non-zero and positive in all features,
12 | #' \item "yeo-Johnson" - apply a Yeo-Johnson transform, like a BoxCox, but values can be negative,
13 | #' \item "zv" - remove attributes with a zero variance (all the same value),
14 | #' \item "center" - subtract mean from values,
15 | #' \item "scale" - divide values by standard deviation,
16 | #' \item "standardize" - perform both centering and scaling,
17 | #' \item "normalize" - normalize values,
18 | #' \item "pca" - transform data to the principal components,
19 | #' \item "ica" - transform data to the independent components.
20 | #' }
21 | #' @param featuresToPreProcess Vector of number of features to perform the feature preprocessing on - In case of empty vector, this means to include all features in the dataset file (default = c()) - This vector should be a subset of \code{selectedFeats}.
22 | #' @param nComp Integer of Number of components needed if either "pca" or "ica" feature preprocessors are needed.
23 | #' @param missingVal Vector of strings representing the missing values in dataset (default: c('NA', '?', ' ')).
24 | #' @param missingOpr Boolean variable represents either delete instances with missing values or apply imputation using "MICE" library which helps you imputing missing values with plausible data values that are drawn from a distribution specifically designed for each missing datapoint- (default = 0 --> delete instances).
25 | #'
26 | #' @return List of the TrainingSet \code{Train} and TestingSet \code{Test}.
27 | #'
28 | #' @import RWeka
29 | #' @import farff
30 | #' @import caret
31 | #' @import mice
32 | #' @importFrom utils read.csv
33 | #' @importFrom stats complete.cases
34 | #'
35 | #' @examples
36 | #' \dontrun{
37 | #' dataset <- datasetReader('/Datasets/irisTrain.csv', '/Datasets/irisTest.csv')
38 | #' }
39 |
40 | datasetReader <- function(directory, testDirectory, selectedFeats = c(), classCol = 'class',
41 | preProcessF = 'N', featuresToPreProcess = c(), nComp = NA,
42 | missingVal = c('NA', '?', ' '), missingOpr = 0) {
43 | #check if CSV or arff
44 | ext <- substr(directory, nchar(directory)-2, nchar(directory))
45 | #Read CSV file of data
46 | if(ext == 'csv'){
47 | con <- file(directory, "r")
48 | data <- read.csv(file = con, header = TRUE, sep = ",", stringsAsFactors = FALSE)
49 | close(con)
50 | con <- file(testDirectory, "r")
51 | dataTED <- read.csv(file = con, header = TRUE, sep = ",", stringsAsFactors = FALSE)
52 | close(con)
53 | }
54 | else{
55 | data <- readARFF(directory)
56 | dataTED <- readARFF(testDirectory)
57 | }
58 |
59 | #change column name of classes to be "class"
60 | colnames(data)[which(names(data) == classCol)] <- "class"
61 | colnames(dataTED)[which(names(dataTED) == classCol)] <- "class"
62 | cInd <- grep("class", colnames(data)) #index of class column
63 |
64 | #Convert characters representing missing values to NA
65 | m1 <- as.matrix(data)
66 | m1[m1 %in% missingVal] <- NA
67 | m2 <- as.matrix(dataTED)
68 | m2[m2 %in% missingVal] <- NA
69 |
70 | #check either to delete instance with missing values or perform imputation
71 | if (missingOpr == 0){
72 | data <- data[complete.cases(m1), ]
73 | dataTED <- dataTED[complete.cases(m2), ]
74 | }
75 | else{
76 | data <- complete(mice(data, m = 1))
77 | dataTED <- complete(mice(dataTED, m = 1))
78 | }
79 |
80 | #select features only upon user request
81 | if(length(selectedFeats) == 0){
82 | selectedFeats <- c(1:ncol(data))
83 | }
84 | #perform preprocessing
85 | if(preProcessF != 'N'){
86 | if(length(featuresToPreProcess ) == 0)
87 | featuresToPreProcess <- selectedFeats
88 |
89 | featuresToPreProcess <- featuresToPreProcess[!featuresToPreProcess %in% cInd] #remove class column from set of features to be preprocessed
90 | dataTmp <- featurePreProcessing(data[,featuresToPreProcess], dataTED[,featuresToPreProcess], preProcessF, nComp)
91 |
92 | #add other features that don't require feature preprocessing to the features obtained after preprocessing
93 | diffTmp <- setdiff(selectedFeats, c(cInd, featuresToPreProcess))
94 | dataTDTmp <- cbind(dataTmp$TD, data[, diffTmp])
95 | dataTEDTmp <- cbind(dataTmp$TED, dataTED[, diffTmp])
96 | #add class column to the dataframe of the dataset
97 | dataTDTmp$class <- data$class
98 | dataTEDTmp$class <- dataTED$class
99 | data <- dataTDTmp
100 | dataTED <- dataTEDTmp
101 | }
102 | else{
103 | data <- data[, selectedFeats]
104 | dataTED <- dataTED[, selectedFeats]
105 | }
106 | return (list(Train = data, Test = dataTED))
107 | }
108 |
--------------------------------------------------------------------------------
/R/evaluateMet.R:
--------------------------------------------------------------------------------
1 | #' @title Evaluate Fitted Model.
2 | #'
3 | #' @description Evaluate Predictions obtained from a specific model based on true labels, its predictions, and the evaluation metric.
4 | #'
5 | #' @param yTrue Vector of true labels.
6 | #' @param pred Vector of predicted labels.
7 | #' @param metric Metric to be used in evaluation:
8 | #' \itemize{
9 | #' \item "acc" - Accuracy,
10 | #' \item "avg-fscore" - Average of F-Score of each label,
11 | #' \item "avg-recall" - Average of Recall of each label,
12 | #' \item "avg-precision" - Average of Precision of each label,
13 | #' \item "fscore" - Micro-Average of F-Score of each label,
14 | #' \item "recall" - Micro-Average of Recall of each label,
15 | #' \item "precision" - Micro-Average of Precision of each label.
16 | #' }
17 | #'
18 | #' @importFrom caret confusionMatrix
19 | #'
20 | #' @return Float number representing the evaluation.
21 | #'
22 | #' @examples
23 | #' \dontrun{
24 | #' result1 <- autoRLearn(10, 'sampleDatasets/shuttle/train.arff', 'sampleDatasets/shuttle/test.arff')
25 | #' }
26 | #'
27 | #' @noRd
28 | #'
29 | #' @keywords internal
30 | #'
31 | evaluateMet <- function(yTrue, pred, metric = 'acc'){
32 | lvls <- union(pred, yTrue)
33 | cm = as.matrix(table(Actual = factor(yTrue, lvls),
34 | Predicted = factor(pred, lvls)) ) # create the confusion matrix
35 | n = sum(cm) # number of instances
36 | nc = nrow(cm) # number of classes
37 | diag = diag(cm) # number of correctly classified instances per class
38 | rowsums = apply(cm, 1, sum) # number of instances per class
39 | colsums = apply(cm, 2, sum) # number of predictions per class
40 | oneVsAll = lapply(1 : nc,
41 | function(i){
42 | v = c(cm[i,i],
43 | rowsums[i] - cm[i,i],
44 | colsums[i] - cm[i,i],
45 | n-rowsums[i] - colsums[i] + cm[i,i]);
46 | return(matrix(v, nrow = 2, byrow = T))})
47 | s = matrix(0, nrow = 2, ncol = 2)
48 | for(i in 1 : nc){s = s + oneVsAll[[i]]}
49 |
50 | if (metric == 'acc'){
51 | perf <- sum(diag) / n
52 | }
53 | else if(metric == 'avg-precision'){
54 | precision <- diag / colsums
55 | perf <- mean(precision)
56 | }
57 | else if(metric == 'avg-recall'){
58 | recall <- diag / rowsums
59 | perf <- mean(recall)
60 | }
61 | else if(metric == 'avg-fscore'){
62 | precision <- diag / colsums
63 | recall <- diag / rowsums
64 | f1 <- 2 * precision * recall / (precision + recall)
65 | perf <- mean(f1)
66 | }
67 | else{
68 | perf <- (diag(s) / apply(s,1, sum))[1];
69 | }
70 |
71 | return(perf)
72 | }
73 |
--------------------------------------------------------------------------------
/R/evocate.R:
--------------------------------------------------------------------------------
1 | #' @export evocate
2 | evocate <- function(df_train, df_test, maxTime = 1, models = "xgboost",
3 | optimizationAlgorithm = "hyperband", bw = 3, max_iter = 81, kde_type = "single",
4 | problem = "classification", measure = "classif.acc", ensemble_size = 1) {
5 |
6 | total_time <- maxTime * 60
7 | parameters_per_model <- map_int(models, .f = ~ length(jsons[[.x]]$params))
8 | times <- (parameters_per_model * total_time) / (sum(parameters_per_model))
9 |
10 | cat("Models selected:", models, '\n', sep = ' ')
11 | cat("Time distribution:", times, '\n', sep = ' ')
12 |
13 | run_optimization <- function(model, time) {
14 | results <- NULL
15 | priors <- data.frame()
16 | tic(model, "optimization time:")
17 |
18 | if(optimizationAlgorithm == 'hyperband') {
19 | current <- Sys.time() %>% as.integer()
20 | end <- (Sys.time() %>% as.integer()) + time
21 |
22 | repeat {
23 | gc(verbose = F)
24 | tic('current hyperband runtime')
25 | print(paste('Started', model, ' model...'))
26 | # Compute the time left for this model
27 | time_left <- max(end - (Sys.time() %>% as.integer()), 1)
28 | print(paste("There are:", time_left, "seconds left for this hyperband run"))
29 | res <- hyperband(df = df_train, model = model, max_iter = max_iter,
30 | maxtime = time_left, problem = problem, measure = measure)
31 |
32 | if(is_empty(flatten(res)) == F) {
33 | res <- res %>%
34 | map_dfr(.f = ~ .x[["answer"]]) %>%
35 | arrange(desc(acc)) %>%
36 | head(1)
37 | results <- c(list(res), results)
38 | print(paste('Best performance from hyperband this round: ', res$acc))
39 | }
40 | # Break if the remaining time exceeds the allowed time budget
41 | elapsed <- (Sys.time() %>% as.integer()) - current
42 | if(elapsed >= time) {
43 | break
44 | }
45 | }
46 | }
47 | else if(optimizationAlgorithm == "bohb") {
48 | current <- Sys.time() %>% as.integer()
49 | end <- (Sys.time() %>% as.integer()) + time
50 | repeat {
51 | gc(verbose = F)
52 | tic("current bohb time")
53 | print(paste("started", model))
54 | time_left <- max(end - (Sys.time() %>% as.integer()), 1)
55 | print(paste("There are:", time_left, "seconds left for this bohb run"))
56 | res <- bohb(df = df_train, model = model, bw = bw, max_iter = max_iter,
57 | maxtime = time_left, priors = priors, kde_type = kde_type)
58 |
59 | if(is_empty(flatten(res)) == F) {
60 | priors <- res %>%
61 | map_dfr(.f = ~ .x[["sh_runs"]])
62 | res <- res %>%
63 | map_dfr(.f = ~ .x[["answer"]]) %>%
64 | arrange(desc(acc)) %>%
65 | head(1)
66 |
67 | results <- c(list(res), results)
68 | print(paste('Best accuracy from hyperband this round: ', res$acc))
69 | }
70 |
71 | elapsed <- (Sys.time() %>% as.integer()) - current
72 | if(elapsed >= time) {
73 | break
74 | }
75 | }
76 | }
77 | else {
78 | errorCondition(message = "Only hyperband and bohb are valid optimization algorithms at this moment.")
79 | break
80 | }
81 | toc()
82 | results
83 | }
84 |
85 | print("Starting to run all optimizations.")
86 | ans <- vector(mode = "list", length = length(models))
87 |
88 | for(i in 1:length(models)) {
89 | flag <- TRUE
90 | tryCatch({
91 | ans[[i]] <- run_optimization(models[[i]], times[[i]])
92 | }, error = function(e) {
93 | cat('Error spotted: ')
94 | message(e)
95 | cat(' In ', models[[i]], ' model, going to the next model!\n')
96 | flag <<- FALSE
97 | })
98 | if (!flag) next
99 | }
100 |
101 | # Arrange Results according to the best performance
102 | ensemble_size <- min(max(1, length(ans[[1]])), ensemble_size)
103 | print(ensemble_size)
104 | tryCatch({best_model <- ans %>%
105 | map(.f = ~ map_dfr(.x = .x, .f = ~ .x %>% select(model, acc))) %>%
106 | map_dfr(.f = ~ .x %>% arrange(desc(acc)) %>% head(ensemble_size)) %>%
107 | arrange(desc(acc))
108 | print('----------------------####------------------------')
109 | # Return the best performing model
110 | results <- ensembling(best_model, df_train, df_test, problem = problem, measure = measure)
111 | return (results)
112 | }, error = function(e){
113 | cat('Error spotted: ')
114 | message(e)
115 | cat('Try increasing the time budget or use a different model.\n')
116 | return (-1)
117 | })
118 |
119 | }
120 |
--------------------------------------------------------------------------------
/R/evocate_utilities.R:
--------------------------------------------------------------------------------
1 | #' @import nloptr
2 | #' @import bbotk
3 |
4 | #' @keywords internal
5 |
6 | ensembling = function(best_models, df_train, df_test,
7 | problem = 'classification', measure = 'classif.acc'){
8 | lrns = c()
9 | for(i in 1:nrow(best_models)){
10 | lrns = c(lrns, po('learner_cv', best_models[[1]][[i]],
11 | id = paste('lrn', as.character(i), sep='') ))
12 | }
13 |
14 | level0 = gunion(list(
15 | lrns)) %>>%
16 | po("featureunion", id = "union1")
17 |
18 | if(problem == 'classification'){
19 | problem = 'classif'
20 | ensemble = level0 %>>% LearnerClassifAvg$new(id = "classif.avg")
21 | task = TaskClassif$new(id = 'final_eval', backend = df_train, target = 'class')
22 | }
23 | else{
24 | problem = 'regr'
25 | ensemble = level0 %>>% LearnerRegrAvg$new(id = "regr.avg")
26 | task = TaskRegr$new(id = 'final_eval', backend = df_train, target = 'class')
27 | }
28 |
29 | ens_lrn = GraphLearner$new(ensemble)
30 | ens_lrn$predict_type = "prob"
31 | ens_lrn$train(task)
32 | perf <- ens_lrn$predict_newdata(df_test)$score(msr(measure))
33 | return (list("model" = ens_lrn, "performance" = perf))
34 | }
35 |
--------------------------------------------------------------------------------
/R/featurePreProcessing.R:
--------------------------------------------------------------------------------
1 | #' @title Perform Feature Preprocessing if specified by user.
2 | #'
3 | #' @description Perform a preprocessing algorithm on the dataset and return the preprocessed one.
4 | #'
5 | #' @param data Data frame containing the dataset to process.
6 | #' @param dataTED Data frame containing the test dataset to process.
7 | #' @param preProcessF string containing the name of the preprocessing algorithm:
8 | #' "boxcox": apply a Box–Cox transform and values must be non-zero and positive in all features,
9 | #' "yeo-Johnson": apply a Yeo-Johnson transform, like a BoxCox, but values can be negative,
10 | #' "zv": remove attributes with a zero variance (all the same value),
11 | #' "center": subtract mean from values,
12 | #' "scale": divide values by standard deviation,
13 | #' "standardize": perform both centering and scaling,
14 | #' "normalize": normalize values,
15 | #' "pca": transform data to the principal components,
16 | #' "ica": transform data to the independent components.
17 | #' @param nComp Integer of Number of components needed if either "pca" or "ica" feature preprocessors are needed.
18 | #'
19 | #' @return List of two Dataframes of the preprocessed training and testing datasets.
20 | #'
21 | #' @examples featurePreProcessing(\code{data}, \code{dataTED}, "center", 0).
22 | #'
23 | #' @noRd
24 | #'
25 | #' @keywords internal
26 |
27 | featurePreProcessing <- function(data, dataTED, preProcessF, nComp) {
28 |
29 | if(preProcessF == 'scale'){
30 | preprocessParams <- preProcess(data, method=c("scale"))
31 | }
32 | else if(preProcessF == 'center'){
33 | preprocessParams <- preProcess(data, method=c("center"))
34 | }
35 | else if(preProcessF == 'standardize'){
36 | preprocessParams <- preProcess(data, method=c("center", "scale"))
37 | }
38 | else if(preProcessF == 'normalize'){
39 | preprocessParams <- preProcess(data, method=c("range"))
40 | }
41 | else if(preProcessF == 'pca'){
42 | if (is.na(nComp))
43 | preprocessParams <- preProcess(data, method=c("pca"))
44 | else
45 | preprocessParams <- preProcess(data, method=c("center", "scale", "pca"), pcaComp = nComp)
46 | }
47 | else if(preProcessF == 'ica'){
48 | preprocessParams <- preProcess(data, method=c("center", "scale", "ica"), n.comp=nComp)
49 | }
50 | else if(preProcessF == 'yeo-Johnson'){
51 | preprocessParams <- preProcess(data, method=c("YeoJohnson"))
52 | }
53 | else if(preProcessF == 'boxcox'){
54 | preprocessParams <- preProcess(data, method=c("BoxCox"))
55 | }
56 | else if(preProcessF == 'zv'){
57 | preprocessParams <- preProcess(data, method=c("zv"))
58 | }
59 | else{
60 | print('Error: No defined Preprocessing Algorithm...Skip feature preprocessing part!')
61 | return(list(TD = data, TED = dataTED))
62 | }
63 | data <- predict(preprocessParams, data)
64 | dataTED <- predict(preprocessParams, dataTED)
65 | return(list(TD = data, TED = dataTED))
66 | }
67 |
--------------------------------------------------------------------------------
/R/fitModel.R:
--------------------------------------------------------------------------------
1 | #' @title Fit SMAC Model.
2 | #'
3 | #' @description Fit the trees of the SMAC forest model by adding new nodes to each of the forest trees.
4 | #'
5 | #' @param params A string of parameter configuration values for the current classifier to be tuned (parameters are separated by #).
6 | #' @param bestPerf Vector of performance values of the best parameter configuration on the folds of the SMAC model.
7 | #' @param trainingSet Dataframe of the training set.
8 | #' @param validationSet Dataframe of the validation Set.
9 | #' @param foldedSet List of the folds of the dataset in each tree of the SMAC forest.
10 | #' @param classifierAlgorithm String of the name of classifier algorithm used now.
11 | #' @param tree List of data frames, representing the data structure for the forest of trees of the SMAC model.
12 | #' @param B number of trees in the forest of trees of SMAC optimization algorithm (default = 10).
13 | #' @param metric Metric to be used in evaluation:
14 | #' \itemize{
15 | #' \item "acc" - Accuracy,
16 | #' \item "avg-fscore" - Average of F-Score of each label,
17 | #' \item "avg-recall" - Average of Recall of each label,
18 | #' \item "avg-precision" - Average of Precision of each label,
19 | #' \item "fscore" - Micro-Average of F-Score of each label,
20 | #' \item "recall" - Micro-Average of Recall of each label,
21 | #' \item "precision" - Micro-Average of Precision of each label.
22 | #' }
23 | #'
24 | #' @return List of: \code{t} trees of fitted SMAC Model - \code{p} performance of current parameter configuration on whole dataset - \code{bp} Current added parameter configuration.
25 | #'
26 | #' @examples fitModel('1', c(0.91, 0.89), data.frame(salary = c(623, 515, 611, 729, 843), class = c (0, 0, 0, 1, 1)), data.frame(salary = c(400, 800), class = c (0, 1)), list(c(1,2,4), c(3,5)), 'knn', data.frame(fold = c(), parent = c(), params = c(), leftChild = c(), rightChild = c(), performance = c(), rowN = c()), 2).
27 | #'
28 | #' @noRd
29 | #'
30 | #' @keywords internal
31 |
32 | fitModel <- function(params, bestPerf, trainingSet, validationSet, foldedSet, classifierAlgorithm, tree, B = 10, metric = 'acc') {
33 | #fit SMAC model using the current best parameters
34 | #get current best parameters
35 | cntParams <- params
36 | cntParamStr <- paste( unlist(cntParams), collapse='#')
37 | #initiate a variable to store its performance on each decision tree of the forest
38 | perf <- c()
39 | for(i in 1:B){
40 | cntNode <- tree[tree$fold==i & is.na(tree$parent), ]
41 | #Get position to add the new node
42 | cParent <- NA
43 | cChild <- NA
44 | if(nrow(cntNode) > 0){
45 | cParent <- cntNode$rowN
46 | while(!is.na(cntNode[[1]])){
47 | cParent <- cntNode$rowN
48 | if(cntParamStr > as.character(cntNode$params)){
49 | cntNode <- tree[as.integer(cntNode$rightChild), ]
50 | cChild <- 5 #pointer position to right node
51 | }
52 | else if(cntParamStr < as.character(cntNode$params)){
53 | cntNode <- tree[as.integer(cntNode$leftChild), ]
54 | cChild <- 4 #pointer position to left node
55 | }
56 | else{
57 | return(list(bp = params, t = tree, p=bestPerf))
58 | }
59 | }
60 | }
61 |
62 | if(length(bestPerf) >= i)
63 | perf <- bestPerf
64 | else
65 | perf <- c(perf, (runClassifier(trainingSet[foldedSet[[i]], ], validationSet, cntParams, classifierAlgorithm, metric = metric))$perf)
66 |
67 | #row number of new node to be added
68 | newRowN <- nrow(tree) + 1
69 | #Update parent's child
70 | if(!is.na(cChild))
71 | tree[cParent, cChild] <- newRowN
72 | #Add new node with current configuration
73 | df <- data.frame(fold = i, parent = cParent, params = cntParamStr, leftChild = NA, rightChild = NA, performance = perf[i], rowN = newRowN)
74 | tree <- rbind(tree, df)
75 | }
76 |
77 | cntParams$performance <- mean(perf)
78 | return(list(t = tree, p=perf, bp=cntParams))
79 | }
80 |
--------------------------------------------------------------------------------
/R/getCandidateClassifiers.R:
--------------------------------------------------------------------------------
1 | #' @title Get candidate Good Classifier Algorithms.
2 | #'
3 | #' @description Compare Dataset Meta-Features with the Knowledge base to recommend good Classifier Algorithms based on nearest neighbor datasets with outperformaing pipelines.
4 | #'
5 | #' @param maxTime Float of the maximum time budget allowed.
6 | #' @param metaFeatures List of the meta-features collected from the dataset.
7 | #' @param nModels Integer of number of required number of recommendations of classifier algorithms to get.
8 | #'
9 | #' @return List of recommended classifier algorithms, their initial parameter configurations, and time ratio to be spent in tuning each classifier.
10 | #'
11 | #' @examples getCandidateClassifiers(10, \code{metaFeatures}, 3)
12 | #'
13 | #' @importFrom BBmisc normalize
14 | #' @importFrom RMySQL MySQL fetch dbDisconnect dbSendQuery dbConnect
15 | #' @importFrom httr POST content
16 | #' @importFrom stats setNames
17 | #'
18 | #' @noRd
19 | #'
20 | #' @keywords internal
21 |
22 | getCandidateClassifiers <- function(maxTime, metaFeatures, nModels) {
23 | classifiers <- c('randomForest', 'c50', 'j48', 'svm', 'naiveBayes','knn', 'bagging', 'rda', 'neuralnet', 'plsda', 'part', 'deepboost', 'rpart', 'lda', 'lmt')
24 | classifiersWt <- c(10, 20, 11, 21, 10, 5, 25, 5, 5, 6, 11, 21, 6, 5, 10) #weight of each classifier to tune based on number and types of parameters
25 |
26 | #Choosen Classifiers parameters initialization
27 | params <- c()
28 | cclassifiers <- c() #chosen classifiers
29 | ratio <- c() #time ratios for each classifier
30 | KBFlag <- FALSE
31 | for(trial in 1:3){ #TRY to connect to knowledge base
32 | readKnowledgeBase <- try(
33 | {
34 | metaData <- content(POST("https://jncvt2k156.execute-api.eu-west-1.amazonaws.com/default/callKnowledgeBase"))
35 | KBFlag <- TRUE
36 | metaDataFeatures <- data.frame(matrix(unlist(metaData, recursive = FALSE), nrow = length(metaData), byrow = T))
37 | colnames(metaDataFeatures) <- c('datasetRatio', 'featuresKurtStdDev', 'featuresKurtMean', 'featuresKurtMax', 'featuresKurtMin', 'featuresSkewStdDev', 'featuresSkewMean', 'featuresSkewMax', 'featuresSkewMin', 'symbolsStdDev', 'symbolsSum', 'symbolsMean', 'classProbStdDev', 'classProbMean', 'classProbMax', 'classProbMin', 'classEntropy', 'ratioNumToCat', 'nCatFeatures', 'nNumFeatures', 'nInstances', 'nFeatures', 'nClasses', 'lognFeatures', 'lognInstances', 'classifierAlgorithm', 'parameters', 'maxTime', 'metric', 'performance')
38 |
39 | #Remove useless columns for now
40 | metaDataFeatures$performance <- NULL
41 | metaDataFeatures$metric <- NULL
42 | metaDataFeatures$ipInserted <- NULL
43 | metaDataFeatures$maxTime <- NULL
44 | metaDataFeatures$dateInserted <- NULL
45 | metaDataFeatures$ID <- NULL
46 | metaFeatures$maxTime <- NULL
47 |
48 | #Separate Best Classifier Algorithms and Their Parameters
49 | bestClf <- metaDataFeatures$classifierAlgorithm
50 | nClasses <- metaDataFeatures$nClasses
51 | bestClfParams <- metaDataFeatures$parameters
52 | metaDataFeatures$classifierAlgorithm <- NULL
53 | metaDataFeatures$parameters <- NULL
54 |
55 | #Append new dataset meta features to the metaDataFeatures
56 | metaDataFeatures <- rbind(metaDataFeatures, metaFeatures)
57 |
58 | #Normalize the distance matrix
59 | metaDataFeatures[] <- suppressWarnings(lapply(metaDataFeatures, function(x) as.numeric(as.character(x))))
60 | metaDataFeatures <- normalize(metaDataFeatures, method = "standardize", range = c(0, 1), margin = 1L, on.constant = "quiet")
61 |
62 | #Construct the distance list to extract the nearest neighbors
63 | cntMeta <- nrow(metaDataFeatures)
64 | distMat <- data.frame()
65 | distMat[['dist']] <- as.numeric()
66 | distMat[['index']] <- as.numeric()
67 |
68 | for(i in 1:(nrow(metaDataFeatures)-1)){
69 | dist <- 0
70 | for(j in 1:ncol(metaDataFeatures)){
71 | if(is.na(metaDataFeatures[i,j]) == TRUE && is.na(metaDataFeatures[cntMeta,j]) == TRUE)
72 | dist <- dist + 0
73 |
74 | else if ( (is.na(metaDataFeatures[i,j]) == TRUE && is.na(metaDataFeatures[cntMeta,j]) == FALSE) || (is.na(metaDataFeatures[i,j]) == FALSE && is.na(metaDataFeatures[cntMeta,j]) == TRUE) )
75 | dist <- dist + 0.5
76 |
77 | else
78 | dist <- dist + (suppressWarnings(as.numeric(metaDataFeatures[i,j])) - suppressWarnings(as.numeric(metaDataFeatures[cntMeta, j])) )^2
79 |
80 | }
81 | tmpDist <- list(dist = dist, index = i)
82 | distMat <- rbind(distMat, tmpDist)
83 | }
84 | #Sort Dataframe
85 | orderInd <- order(distMat$dist)
86 | distMat <- distMat[orderInd, ]
87 |
88 | #Get best classifiers with their parameters
89 | for(i in 1:nrow(distMat)){
90 | ind <- distMat[i,]$index
91 | clf <- bestClf[ind]
92 | if(is.element(clf, cclassifiers) == FALSE){
93 | #Exception for deep Boost requires binary classes dataset
94 | if((clf == 'deepboost' && nClasses > 2)||clf == 'fda')
95 | next
96 | cclassifiers <- c(cclassifiers, clf)
97 | params <- c(params, bestClfParams[ind])
98 |
99 | clfInd = which(classifiers == clf)
100 | ratio <- c(ratio, classifiersWt[clfInd])
101 | }
102 | if(length(cclassifiers) == nModels)
103 | break
104 | }
105 | })
106 | if(inherits(readKnowledgeBase, "try-error")){
107 | KBFlag <- FALSE
108 | print('Warning: Can not connect to KnowledgeBase Data! Check your internet connectivity. Trying Again.')
109 | next
110 | }
111 |
112 | if(KBFlag == TRUE) #managed to get information from knowledge base
113 | break
114 | }
115 |
116 | if(KBFlag == FALSE)
117 | print('Assuming Random Classifiers will be used. You should use Large Time Budgets and nModels for better results')
118 | #Assign time ratio for each classifier
119 | if (length(cclassifiers) < nModels){ #failed to make use of meta-learning --> tune over all classifiers
120 | #cclassifiers <- classifiers
121 | for (clf in classifiers){
122 | if(is.element(clf, cclassifiers) == TRUE) #already inserted this classifier
123 | next
124 | ind = which(classifiers == clf)
125 | ratio <- c(ratio, classifiersWt[ind])
126 | cclassifiers <- c(cclassifiers, clf)
127 | params <- c(params, '')
128 | if(length(cclassifiers) == nModels) #completed number of required models
129 | break
130 | }
131 | }
132 | ratio <- ratio / sum(ratio) * (maxTime * 0.9) #Only using 90% of the allowed time budget
133 |
134 | return (list(c = cclassifiers, r = ratio, p = params))
135 | }
136 |
--------------------------------------------------------------------------------
/R/hb_utilities.R:
--------------------------------------------------------------------------------
1 | #' @importFrom data.table fcase
2 | #' @import purrr
3 |
4 | #' @keywords internal
5 |
6 | param_sample <- function(model, hparam, columns = NULL) {
7 | param = jsons[[model]][[hparam]]
8 | type <- param$type
9 | type_scale <- param$scale
10 |
11 | if(type == "boolean") {
12 | param_estimation <- paste(base::sample(x = as.list(param$values), size = 1), sep = "")
13 | param_estimation <- ifelse(param_estimation == "FALSE", FALSE, TRUE)
14 | return(param_estimation)
15 | }
16 | else if(type == "discrete") {
17 | param_estimation <- paste(base::sample(x = as.list(param$values), size = 1), sep = "")
18 | return(param_estimation)
19 | }
20 |
21 | else {
22 | int_val <- ifelse(hparam == "mtry", as.numeric(columns) - 1, as.numeric(param$maxVal))
23 | param_estimation <- fcase(type_scale == "int", rdunif(1, a = as.numeric(param$minVal),
24 | b = int_val),
25 | type_scale == "any", runif(1, min = as.numeric(param$minVal),
26 | max = as.numeric(param$maxVal)),
27 | type_scale == "double", runif(1, min = as.numeric(param$minVal),
28 | max = as.numeric(param$maxVal)),
29 | type_scale == "exp", 2^rdunif(1, a = as.numeric(param$minVal),
30 | b = as.numeric(param$maxVal)))
31 | return(as.numeric(param_estimation))
32 | }
33 |
34 | }
35 |
36 | #' @keywords internal
37 | get_random_hp_config <- function(model, columns = NULL) {
38 | param_db <- jsons[[model]]
39 | params_list <- param_db$params
40 | params_list_mapped <- map(.x = params_list,
41 | .f = as_mapper( ~ param_sample(model = model,
42 | hparam = .x,
43 | columns = columns)))
44 | `names<-`(params_list_mapped, params_list)
45 | }
46 |
47 | #' @keywords internal
48 | calc_n_r = function(max_iter = 81, eta = 3, s = 4, B = 405) {
49 | n = trunc(ceiling(trunc(B/max_iter/(s+1)) * eta**s))
50 | r = max_iter * eta^(-s)
51 | ans = c(n, r)
52 | ans
53 | }
54 |
--------------------------------------------------------------------------------
/R/hyperband.R:
--------------------------------------------------------------------------------
1 | #' @keywords internal hyperband
2 | hyperband <- function(df, model, max_iter = 81, eta = 3, maxtime = 1000,
3 | problem = 'classification', measure = 'classif.acc') {
4 | logeta = as_mapper(~ log(.x) / log(eta))
5 | s_max = trunc(logeta(max_iter))
6 | B = (s_max + 1) * max_iter
7 | nrs = map_dfc(s_max:0, .f = ~ calc_n_r(max_iter, eta, .x, B)) %>%
8 | t() %>%
9 | `colnames<-`(value = c("n", "r")) %>%
10 | as.data.table()
11 | nrs$s = s_max:0
12 | partial_halving <- function(n, r, s) {
13 | successive_halving(df = df, model = model,
14 | params_config = replicate(n, get_random_hp_config(model, columns = ncol(df) - 1),
15 | simplify = FALSE),
16 | n = n, r = r, s_max = s, max_iter = max_iter, eta = eta,
17 | problem = problem, measure = measure)
18 | }
19 |
20 | liszt = vector(mode = "list", length = max(nrs$s) + 1)
21 | if (model != 'ranger'){
22 | tryCatch({tmp <- withTimeout({
23 | for (row in 1:nrow(nrs)) {
24 | liszt[[row]] <- partial_halving(nrs[[row, 1]],
25 | nrs[[row, 2]],
26 | nrs[[row, 3]])
27 | print("Looped once")
28 | }
29 | }, timeout = maxtime, elapsed = maxtime)
30 | }, TimeoutException = function(ex) {
31 | err <- geterrmessage()
32 | if (startsWith(err, 'reached') == FALSE)
33 | print(paste('Error Found, ', err, ' Replace ', model, sep = ''))
34 | else
35 | print("Time Budget ended.")
36 | },
37 | finally = function(ex) {
38 | print("Hyperband successfully finished.")
39 | })
40 | }
41 | else{
42 | current <- Sys.time() %>% as.integer()
43 | for (row in 1:nrow(nrs)) {
44 | tryCatch({liszt[[row]] <- partial_halving(nrs[[row, 1]],
45 | nrs[[row, 2]],
46 | nrs[[row, 3]])
47 | }, Exception = function(ex) {
48 | err <- geterrmessage()
49 | print(paste('Error Found, ', err, ' Replace ', model, sep = ''))
50 | })
51 | now <- Sys.time() %>% as.integer()
52 | if ((now - current) > maxtime){
53 | print("Time Budget ended.")
54 | break
55 | }
56 | print("Looped once")
57 | }
58 | }
59 | return(liszt)
60 | }
61 |
--------------------------------------------------------------------------------
/R/initialize.R:
--------------------------------------------------------------------------------
1 | #' @title Initialize the SMAC model.
2 | #'
3 | #' @description Initialize the SMAC model with the classifier default parameter configuration.
4 | #'
5 | #' @param classifierName String of the classifier algorithm name.
6 | #' @param result List of the converted classifier json parameter configuration into set of vectors and lists.
7 | #' @param initParams String of the initial parameter configuration of \code{classifierName} to start the model with.
8 | #'
9 | #' @return
10 | #'
11 | #' @examples
12 | #'
13 | #' @noRd
14 | #'
15 | #' @keywords internal
16 |
17 | initialize <- function(classifierName, result, initParams) {
18 | #get list of Classifier Parameters
19 | params <- result$params
20 | #get list of GrandParent parametes
21 | gparams <- result$parents
22 | #Create dataFrame for classifier default parameters
23 | defaultParams <- data.frame(matrix(ncol = length(params)+1, nrow = 1))
24 | colnames(defaultParams) <- c(params, 'performance')
25 | i <- 1
26 | while(i <= length(gparams)){
27 | parI <- gparams[i]
28 | defaultParams[[parI]] <- result[[parI]]$'default'
29 | require <- result[[parI]]$'requires'[[result[[parI]]$'default']]$'require'
30 | gparams <- c(gparams, require)
31 | i <- i + 1
32 | }
33 |
34 | if ( initParams != ""){
35 | initParams <- unlist(strsplit(initParams, "#"))
36 | j <- 1
37 | for(i in colnames(defaultParams)){
38 | if(i == 'performance' || i == 'nodesize')
39 | next
40 | if(initParams[j] == 'NA')
41 | defaultParams[[i]] <- NA
42 | else
43 | defaultParams[[i]] <- initParams[j]
44 |
45 | j <- j + 1
46 | }
47 | }
48 | defaultParams[["EI"]] <- NA
49 | return (defaultParams)
50 | }
51 |
--------------------------------------------------------------------------------
/R/intensify.R:
--------------------------------------------------------------------------------
1 | #' @title Intensify of SMAC model
2 | #'
3 | #' @description Checking if current candidate parameter configuration is better than the current best parameter configuration chosen till now or not.
4 | #'
5 | #' @param R Dataframe of tried out candidate parameter configurations.
6 | #' @param bestParams String of best parameter configuration found till now.
7 | #' @param bestPerf Vector of performance of classifier on all folds of dataset.
8 | #' @param candidateConfs Vector of strings of candidate parameter configurations.
9 | #' @param trainingSet Dataframe of the training set.
10 | #' @param validationSet Dataframe of the validation Set.
11 | #' @param foldedSet List of the folds of the dataset in each tree of the SMAC forest.
12 | #' @param classifierAlgorithm String value of the classifier Name.
13 | #' @param maxTime Float of maximum time budget allowed.
14 | #' @param timeTillNow Float of the time spent till now.
15 | #' @param B number of trees in the forest of trees of SMAC optimization algorithm (default = 10).
16 | #' @param metric Metric to be used in evaluation:
17 | #' \itemize{
18 | #' \item "acc" - Accuracy,
19 | #' \item "avg-fscore" - Average of F-Score of each label,
20 | #' \item "avg-recall" - Average of Recall of each label,
21 | #' \item "avg-precision" - Average of Precision of each label,
22 | #' \item "fscore" - Micro-Average of F-Score of each label,
23 | #' \item "recall" - Micro-Average of Recall of each label,
24 | #' \item "precision" - Micro-Average of Precision of each label.
25 | #' }
26 | #'
27 | #' @return List of current best parameter configuration, its performance, dataframe of tried out candidate parameter configurations, and time till now.
28 | #'
29 | #' @examples intensify(c('1'), '1', c(0.89, 0.91), list(c(1,2,4), c(3,5)), data.frame(salary = c(623, 515, 611, 729, 843), class = c (0, 0, 0, 1, 1)), data.frame(salary = c(400, 800), class = c (0, 1)), 'knn', 100, 5, 2)
30 | #'
31 | #' @noRd
32 | #'
33 | #' @keywords internal
34 |
35 | intensify <- function(R, bestParams, bestPerf, candidateConfs, foldedSet, trainingSet, validationSet, classifierAlgorithm, maxTime, timeTillNow , B = 10, metric = metric) {
36 | for(j in 1:nrow(candidateConfs)){
37 | cntParams <- candidateConfs[j,]
38 | cntPerf <- c()
39 | folds <- sample(1:B)
40 | pointer <- 1
41 | timeFlag <- FALSE
42 | N <- 1
43 | #number of folds with higher performance for candidate configuration
44 | forMe <- 0
45 | #number of folds with lower performance for candidate configuration
46 | againstMe <- 0
47 | fails <- 0
48 | while(pointer < B){
49 | for(i in pointer:min(pointer+N-1, B)){
50 | tmpPerf <- runClassifier(trainingSet[foldedSet[[i]], ], validationSet, cntParams, classifierAlgorithm, metric = metric)
51 | if(tmpPerf$perf == 0){
52 | fails <- fails + 1
53 | }
54 | cntPerf <- c(cntPerf, tmpPerf$perf)
55 | if(i > length(bestPerf))
56 | tmpPerf <- runClassifier(trainingSet[foldedSet[[i]], ], validationSet, bestParams, classifierAlgorithm, metric = metric)
57 | bestPerf <- c(bestPerf, tmpPerf$perf)
58 | if(cntPerf[i] >= bestPerf[i])forMe <- forMe + 1
59 | else againstMe <- againstMe + 1
60 |
61 | #Check time consumed till now
62 | t <- toc(quiet = TRUE)
63 | timeTillNow <- timeTillNow + t$toc - t$tic
64 | tic(quiet = TRUE)
65 | if(timeTillNow > maxTime || fails > 2){
66 | timeFlag <- TRUE
67 | break
68 | }
69 | }
70 | if(forMe < againstMe || timeFlag == TRUE) break
71 | pointer <- pointer + N
72 | N <- N * 2
73 | }
74 | #make the current candidate as the best candidate
75 | if(timeFlag == FALSE && forMe > againstMe){
76 | bestParams <- cntParams
77 | bestPerf <- cntPerf
78 | }
79 | cntParams$performance <- mean(cntPerf)
80 | bestParams$performance <- mean(bestPerf)
81 | R <- rbind(R, cntParams)
82 | }
83 | return(list(params = bestParams, perf = bestPerf, r = R, timeTillNow = timeTillNow, fails = fails))
84 | }
85 |
--------------------------------------------------------------------------------
/R/intrepretability.R:
--------------------------------------------------------------------------------
1 | #' @title Perform Interpretability on Model.
2 | #'
3 | #' @description Perform Model interpretability on the select model by obtaining two plots for feature importance and feature interaction.
4 | #'
5 | #' @param model Fitted Model of any of the chosen classifiers and fitted on the training set.
6 | #' @param x Dataframe of the training set.
7 | #'
8 | #' @return List of two plots of feature importance and feature interaction.
9 | #'
10 | #' @examples interpret(\code{model}, data.frame(salary = c(623, 515, 611, 729, 843), class = c (0, 0, 0, 1, 1)))
11 | #'
12 | #' @importFrom iml FeatureImp Interaction Predictor
13 | #'
14 | #' @noRd
15 | #'
16 | #' @keywords internal
17 |
18 | Loss <- function(actual, predicted){
19 | err <- 0
20 | for(i in 1:length(actual)){
21 | act <- as.character(actual[i])
22 | pred <- substring(as.character(predicted[i]), 2)
23 | if (act != pred)
24 | err <- err + 1
25 | }
26 | return(err/length(actual))
27 | }
28 |
29 | interpret <- function(model, x){
30 | clas = as.factor(x$class)
31 | X = x[which(names(x) != "class")]
32 | X[] <- lapply(X, function(x) {
33 | as.double(as.character(x))
34 | })
35 | predictor = Predictor$new(model, data = X, y = as.factor(clas))
36 | out <- list()
37 | out$featImp <- FeatureImp$new(predictor, loss = Loss)
38 | out$interact = Interaction$new(predictor)
39 | return(out)
40 | }
41 |
--------------------------------------------------------------------------------
/R/outClassifierConf.R:
--------------------------------------------------------------------------------
1 | #' @title Output Classifier Parameter Configuration.
2 | #'
3 | #' @description Get the classifier parameter configuration in a human readable format.
4 | #'
5 | #' @param classifierName String of the name of classifier algorithm used now.
6 | #' @param result List of the converted classifier json parameter configuration into set of vectors and lists.
7 | #' @param initParams String of parameters of \code{classifierName} separated by #.
8 | #'
9 | #' @return String of the human readable output in HTML format.
10 | #'
11 | #' @examples outClassifierConf('knn', list(params = c('k'), parents = c('k'), k = list(default = '7', require = c())), '1')
12 | #'
13 | #' @noRd
14 | #'
15 | #' @keywords internal
16 |
17 | outClassifierConf <- function(classifierName, result, initParams) {
18 | #get list of Classifier Parameters names
19 | params <- result$params
20 | #get list of GrandParent parameters
21 | gparams <- result$parents
22 | #Create dataFrame for classifier default parameters
23 | defaultParams <- data.frame(matrix(ncol = length(params), nrow = 1))
24 | colnames(defaultParams) <- c(params)
25 |
26 | i <- 1
27 | while(i <= length(gparams)){
28 | parI <- gparams[i]
29 | defaultParams[[parI]] <- result[[parI]]$'default'
30 | require <- result[[parI]]$'requires'[[result[[parI]]$'default']]$'require'
31 | gparams <- c(gparams, require)
32 | i <- i + 1
33 | }
34 |
35 | return(initParams)
36 | }
37 |
--------------------------------------------------------------------------------
/R/readDataset.R:
--------------------------------------------------------------------------------
1 | #' @title Read Dataset File into Memory.
2 | #'
3 | #' @description Read the file of the dataset, and split it into training and validation sets.
4 | #'
5 | #' @param directory String of the directory to the file containing the training dataset.
6 | #' @param testDirectory String of the directory to the file containing the testing dataset.
7 | #' @param vRatio The split ratio of the dataset file into training, and validation sets default(10% Validation - 90% Training).
8 | #' @param classCol String of the class column of the dataset.
9 | #' @param preProcessF Vector of Strings of the preprocessing algorithm to apply.
10 | #' @param featuresToPreProcess Vector of numbers of features columns to perform preprocessing - empty vector means all features.
11 | #' @param nComp Number of components needed if either "pca" or "ica" feature preprocessors are needed.
12 | #' @param missingOpr Boolean variable represents either delete instances with missing values or apply imputation using "MICE" library - (default = 0 --> delete instances).
13 | #' @param metric Metric of string character to be used in evaluation:
14 | #' @param balance Boolean variable represents if SMOTE class balancing is required or not (default FALSE).
15 | #' \itemize{
16 | #' \item "acc" - Accuracy,
17 | #' \item "avg-fscore" - Average of F-Score of each label,
18 | #' \item "avg-recall" - Average of Recall of each label,
19 | #' \item "avg-precision" - Average of Precision of each label,
20 | #' \item "fscore" - Micro-Average of F-Score of each label,
21 | #' \item "recall" - Micro-Average of Recall of each label,
22 | #' \item "precision" - Micro-Average of Precision of each label.
23 | #' }
24 | #'
25 | #' @return List of the Training and Validation Sets splits.
26 | #'
27 | #' @examples readDataset('/Datasets/irisTrain.csv', '/Datasets/irisTest.csv', 0.1, c(), 'class', 'pca', c(), 2)
28 | #'
29 | #' @import RWeka
30 | #' @import farff
31 | #' @import caret
32 | #' @import mice
33 | #' @importFrom UBL SmoteClassif
34 | #' @importFrom imputeMissings compute impute
35 | #' @importFrom utils read.csv
36 | #' @importFrom stats complete.cases
37 | #'
38 | #' @noRd
39 | #'
40 | #' @keywords internal
41 |
42 | readDataset <- function(directory, testDirectory, vRatio = 0.3, classCol, preProcessF, featuresToPreProcess, nComp, missingOpr, metric, balance) {
43 | #check if CSV or arff
44 | ext <- substr(directory, nchar(directory)-2, nchar(directory))
45 | #Read CSV file of data
46 | if(ext == 'csv'){
47 | con <- file(directory, "r")
48 | data <- read.csv(file = con, header = TRUE, sep = ",", stringsAsFactors = TRUE)
49 | close(con)
50 | con <- file(testDirectory, "r")
51 | dataTED <- read.csv(file = con, header = TRUE, sep = ",", stringsAsFactors = TRUE)
52 | close(con)
53 | }
54 | else{
55 | data <- readARFF(directory)
56 | dataTED <- readARFF(testDirectory)
57 | }
58 |
59 | #Sampling from large datasets
60 | maxSample = 20000000
61 | n = as.integer(maxSample / ncol(data))
62 | if(maxSample < nrow(data) * ncol(data)){
63 | sampleInds <- createDataPartition(y = data$class, times = 1, p = n/nrow(data), list = FALSE)
64 | data <- data[sampleInds,]
65 | }
66 |
67 | #change column name of classes to be "class"
68 | colnames(data)[which(names(data) == classCol)] <- "class"
69 | colnames(dataTED)[which(names(dataTED) == classCol)] <- "class"
70 | cInd <- grep("class", colnames(data)) #index of class column
71 | #function which returns function which will encode vectors with values of class column labels
72 | label_encoder <- function(vec){
73 | levels <- sort(unique(vec))
74 | function(x){
75 | match(x, levels)
76 | }
77 | }
78 | classEncoder <- label_encoder(data$class) # create class encoder
79 | data$class <- classEncoder(data$class) # encoding class labels of training set
80 | dataTED$class <- classEncoder(dataTED$class) # encoding class labels of testing set
81 |
82 | #check either to delete an instance with missing values or perform imputation
83 | if (missingOpr == FALSE){
84 | missingVals <- imputeMissings::compute(data, method = "median/mode")
85 | data <- impute(data, object = missingVals)
86 | dataTED <- impute(dataTED, object = missingVals)
87 | }
88 | else{
89 | data <-complete( mice(data, m = 1, threshold = 1, printFlag = FALSE))
90 | dataTED <- complete(mice(dataTED, m = 1, threshold = 1, printFlag = FALSE))
91 | }
92 |
93 | #remove ID features
94 | numericFlag <- unlist(lapply(data, is.numeric))
95 | rmvFlag = c()
96 | for(i in 1:ncol(data)){
97 | len = length(unique(data[,i]))
98 | if(numericFlag[i] == FALSE && ((len / nrow(data) > 0.5) || len == 1) )
99 | rmvFlag <- c(rmvFlag, i)
100 | }
101 | keepFlag = c(1:ncol(data))
102 | keepFlag = keepFlag[!keepFlag %in% rmvFlag]
103 | data <- data[, keepFlag]
104 | dataTED <- dataTED[, keepFlag]
105 |
106 | #Select all remaining features
107 | selectedFeats <- c(1:ncol(data))
108 |
109 | #perform preprocessing
110 | if(length(featuresToPreProcess ) == 0){
111 | numericFlag <- unlist(lapply(data, is.numeric))
112 | for(i in 1:ncol(data)){
113 | if(numericFlag[i] == TRUE && i != cInd)
114 | featuresToPreProcess <- c(featuresToPreProcess, i)
115 | }
116 | }
117 | if(length(preProcessF) != 0 && length(featuresToPreProcess) > 1){
118 | featuresToPreProcess <- featuresToPreProcess[!featuresToPreProcess %in% cInd] #remove class column from set of features to be preprocessed
119 | dataTmp = list(TD = data[,featuresToPreProcess], TED = dataTED[,featuresToPreProcess])
120 | #Add PCA if we have more than 100 features
121 | if(length(featuresToPreProcess) > 100 && any('pca' != preProcessF) )
122 | preProcessF <- c(preProcessF, 'pca')
123 | for(i in 1:length(preProcessF)){
124 | dataTmp <- featurePreProcessing(dataTmp$TD, dataTmp$TED, preProcessF[i], nComp)
125 | }
126 |
127 | #add other features that don't require feature preprocessing to the features obtained after preprocessing
128 | diffTmp <- setdiff(selectedFeats, c(cInd, featuresToPreProcess))
129 | dHead = c(colnames(dataTmp$TD), colnames(data)[diffTmp])
130 |
131 | dataTDTmp <- data.frame(cbind(dataTmp$TD, data[,diffTmp]))
132 | dataTEDTmp <- data.frame(cbind(dataTmp$TED, dataTED[,diffTmp]))
133 | colnames(dataTDTmp) <- dHead
134 | colnames(dataTEDTmp) <- dHead
135 |
136 | #add class column to the dataframe of the dataset
137 | dataTDTmp$class <- data$class
138 | dataTEDTmp$class <- dataTED$class
139 | data <- dataTDTmp
140 | dataTED <- dataTEDTmp
141 | }
142 |
143 | #Class Balancing using Smote for metrics other than accuracy and binary class problems
144 | if( balance == TRUE || (metric != 'acc' && length(unique(data$class)) == 2) ){
145 | data$class = factor(data$class)
146 | data <- SmoteClassif(class ~., data, dist = 'HEOM')
147 | }
148 |
149 | # Use 70% of the dataset as Training - 30% of the dataset as Validation by default
150 | #smp_size <- floor((1-vRatio) * nrow(data))
151 | # set the seed to make your partition reproducible
152 | #train_ind <- sample(seq_len(nrow(data)), size = smp_size)
153 | train_ind <- createDataPartition(y = data$class, times = 1, p = (1-vRatio), list = FALSE)
154 | trainingDataset <- data[train_ind, ]
155 | validationDataset <- data[-train_ind, ]
156 | return (list(TD = trainingDataset, VD = validationDataset, FULLTD = data, TED = dataTED))
157 | }
158 |
--------------------------------------------------------------------------------
/R/runClassifier_.R:
--------------------------------------------------------------------------------
1 | #' @keywords internal
2 | runClassifier_ <- function(trainingSet, validationSet, params, classifierAlgorithm, metric = "acc") {
3 |
4 | #training set features and classes
5 | xFeatures <- subset(trainingSet, select = -class)
6 | xClass <- c(subset(trainingSet, select = class)$'class')
7 |
8 | #print(levels(xClass))
9 |
10 | #validation set features and classes
11 | yFeatures <- subset(validationSet, select = -class)
12 | yClass <- c(subset(validationSet, select = class)$'class')
13 |
14 | #print(levels(yClass))
15 |
16 | #remove not available parameters
17 | if(typeof(params) == 'character'){
18 | classifierConf <- getClassifierConf(classifierAlgorithm)
19 | params <- initialize(classifierAlgorithm, classifierConf, params)
20 | }
21 | for(i in colnames(params)){
22 | if(is.na(params[[i]]) || params[[i]] == 'NA' || params[[i]] == 'EI'){
23 | params <- subset(params, select = -get(i))
24 | }
25 | }
26 | # build model
27 | if(classifierAlgorithm == 'svm'){
28 | if(exists('gamma', where=params) && !is.na(params$gamma))
29 | params$gamma <- (2^ as.double(params$gamma))
30 | if(exists('cost', where=params) && !is.na(params$cost))
31 | params$cost <- (2^ as.double(params$cost))
32 | if(exists('tolerance', where=params) && !is.na(params$tolerance))
33 | params$tolerance <- (2^ as.double(params$tolerance))
34 | if(!exists('kernel', where = params))
35 | params$kernel <- 'radial'
36 | invisible(capture.output(suppressWarnings(model <- do.call(svm,c(list(x = xFeatures, y = xClass, type = 'C-classification', scale = F), params)))))
37 | #check performance
38 | pred <- predict(model, yFeatures)
39 | }
40 | else if(classifierAlgorithm == 'l2-linear-classifier'){
41 | params$cost <- (2^as.numeric(params$cost))
42 | params$epsilon <- as.numeric(params$epsilon)
43 | model <- LiblineaR(target = as.factor(xClass), data = xFeatures, cost = params$cost, epsilon = params$epsilon, type = 2)
44 | pred <- predict(model, yFeatures)$predictions
45 | }
46 | else if(classifierAlgorithm == 'naiveBayes'){
47 | if(!exists('eps', where = params)) {
48 | params$laplace <- as.numeric(params$laplace)
49 |
50 | model <- fnb.train(x = xFeatures, y = as.factor(xClass), laplace = params$laplace)
51 | }
52 | if(exists('eps', where = params)) {
53 |
54 | params$laplace <- as.numeric(params$laplace)
55 | params$eps <- (2 ^ as.numeric(params$eps))
56 | learn <- cbind(xClass, xFeatures)
57 | model <- naiveBayes(as.factor(xClass) ~., data = learn, laplace = params$laplace, eps = params$eps)
58 |
59 | }
60 |
61 | pred <- predict(model, yFeatures)
62 |
63 | }
64 | else if(classifierAlgorithm == 'boosting'){
65 | params$eta <- (2^as.numeric(params$eta))
66 | params$max_depth <- as.numeric(params$max_depth)
67 | params$min_child_weight <- as.numeric(params$min_child_weight)
68 | params$gamma <- as.numeric(params$gamma)
69 | params$colsample_bytree <- as.numeric(params$colsample_bytree)
70 |
71 | xClass_dmat <- xClass %>% as.numeric() %>% map(.f = ~ .x - 1)
72 | xFeatures_dmat <- xFeatures %>% as.matrix()
73 | mode(xFeatures_dmat) = 'double'
74 | yFeatures_dmat <- yFeatures %>% as.matrix()
75 | mode(yFeatures_dmat) = 'double'
76 |
77 | learn <- xgb.DMatrix(data = xFeatures_dmat, label = xClass_dmat)
78 | model <- xgboost(data = learn,
79 | nrounds = 5,
80 | eta = params$eta,
81 | max_depth = params$max_depth,
82 | min_child_weight = params$min_child_weight,
83 | gamma = params$gamma,
84 | colsample_bytree = params$colsample_bytree,
85 | objective = "multi:softprob",
86 | num_class = length(unique(xClass_dmat)),
87 | verbose = 0,
88 | nthread = 1)
89 |
90 | pred_prep <- predict(model, yFeatures_dmat, nthreads = 1)
91 |
92 | pred_mat <- matrix(pred_prep, ncol = length(unique(xClass_dmat)), byrow = T)
93 |
94 | colnames(pred_mat) <- levels(trainingSet$class)
95 |
96 | pred <- apply(pred_mat, 1, function(x) colnames(pred_mat)[which.max(x)])
97 |
98 | levels(pred) <- levels(trainingSet$class)
99 |
100 | }
101 | else if(classifierAlgorithm == 'ranger'){
102 | params$max.depth <- as.numeric(params$max.depth)
103 | params$num.trees <- as.numeric(params$num.trees)
104 | params$mtry <- min(as.numeric(params$mtry), ncol(xFeatures))
105 | params$min.node.size <- as.numeric(params$min.node.size)
106 | learn <- cbind(xClass, xFeatures)
107 | model <- ranger(as.factor(xClass) ~ .,
108 | data = learn,
109 | max.depth = params$max.depth,
110 | num.trees = params$num.trees,
111 | mtry = params$mtry,
112 | min.node.size = params$min.node.size,
113 | num.threads = 1)
114 | pred <- predict(model, yFeatures, num.threads = 1)$prediction
115 | }
116 | else if(classifierAlgorithm == 'randomForest'){
117 | params$mtry <- as.numeric(params$mtry)
118 | params$ntree <- as.numeric(params$ntree)
119 | params$mtry <- min(params$mtry, ncol(xFeatures))
120 | model <- do.call(randomForest,c(list(x = xFeatures, y = as.factor(xClass)), params))
121 | pred <- predict(model, yFeatures)
122 | }
123 | if (classifierAlgorithm != 'boosting') {
124 |
125 | perf <- evaluateMet(yClass, pred, metric = metric)
126 |
127 | }
128 | else {
129 |
130 | perf <- evaluateMet(validationSet$class, pred %>% factor(levels = levels(validationSet$class)), metric = metric)
131 |
132 | }
133 |
134 | result <- list()
135 | result$perf <- perf
136 |
137 | result$model <- model
138 | result$pred <- pred
139 |
140 | return(result)
141 | }
142 |
--------------------------------------------------------------------------------
/R/selectConfiguration.R:
--------------------------------------------------------------------------------
1 | #' @title Select Candidate Parameter Configuration
2 | #'
3 | #' @description Generate neighbor parameter configurations, sort them according to the expected improvement, and select the top promising ones as candidate configurations.
4 | #'
5 | #' @param R Dataframe of tried out parameter configurations.
6 | #' @param classifierAlgorithm String value of the classifier Name.
7 | #' @param tree List of data frames, representing the data structure for the forest of trees of the SMAC model.
8 | #' @param bestParams String of best parameter configuration found till now.
9 | #' @param B number of trees in the forest of trees of SMAC optimization algorithm (default = 10).
10 | #'
11 | #' @return Vector of strings of candidate parameter configurations.
12 | #'
13 | #' @examples selectConfiguration(c('1'), 'knn', data.frame(fold = c(), parent = c(), params = c(), leftChild = c(), rightChild = c(), performance = c(), rowN = c()), '1', 10)
14 | #'
15 | #' @import rjson
16 | #' @importFrom stats rnorm
17 | #'
18 | #' @noRd
19 | #'
20 | #' @keywords internal
21 |
22 | selectConfiguration <- function(R, classifierAlgorithm, tree, bestParams, B = 10) {
23 | #Read Classifier Algorithm Configuration Parameters
24 | #Open the Classifier Parameters Configuration File
25 | classifierConfDir <- system.file("extdata", paste(classifierAlgorithm,'.json',sep=""), package = "SmartML", mustWork = TRUE)
26 | result <- fromJSON(file = classifierConfDir)
27 |
28 | #get list of Classifier Parameters
29 | params <- result$params
30 |
31 | #minimum error rate found till now
32 | cmin <- (1 - bestParams$performance)
33 |
34 | #calculate Expected Improvement for all saved configurations
35 | for(i in 1:nrow(R)){
36 | cntParams <- R[i,]
37 | cntParamStr <- paste( unlist(cntParams), collapse='#')
38 | cntPerf <- c()
39 | #calculate Expected improvment from SMAC random forest model
40 | for(j in 1:B){
41 | cntNode <- tree[tree$fold==j & is.na(tree$parent), ]
42 | while(!is.na(cntNode[1])){
43 | cParent <- cntNode$rowN
44 | cntNode$params
45 | if(cntParamStr > as.character(cntNode$params) && !is.na(cntNode$rightChild)){
46 | cntNode <- tree[cntNode$rightChild, ]
47 | }
48 | else if(cntParamStr < as.character(cntNode$params) && !is.na(cntNode$leftChild)){
49 | cntNode <- tree[cntNode$leftChild, ]
50 | }
51 | else{
52 | cntPerf <- c(cntPerf, cntNode$performance)
53 | cntNode <- NA
54 | }
55 | }
56 | }
57 | cntParams$EI <- computeEI(cmin, cntPerf)
58 | R[i, ] <- cntParams
59 | }
60 | #sort according to Expected Improvement
61 | sortedR <- R[order(-R$EI),]
62 | #choose best promising configurations to suggest candidate configurations
63 | candidates <- R[0,]
64 | for(i in 1:min(10, nrow(R))){
65 | cntParams <- R[i,]
66 | for(parI in params){
67 | tmpParams <- cntParams
68 | cntParam <- cntParams[[parI]]
69 | if(is.na(cntParam))
70 | next
71 | #for continuous Integer parameters
72 | if(result[[parI]]$type == 'continuous' && result[[parI]]$scale == 'int'){
73 | minVal <- as.double(result[[parI]]$minVal)
74 | maxVal <- as.double(result[[parI]]$maxVal)
75 | cntParam <- as.double(cntParam)
76 |
77 | #generate a candidate
78 | parValues <- c(result[[parI]]$values)
79 |
80 | while(cntParam == cntParams[[parI]]){
81 | cntParam <- sample(minVal:maxVal, 1, TRUE)
82 | if(result[[parI]]$constraint == 'odd' && (cntParam %% 2) == 0)
83 | cntParam = cntParams[[parI]]
84 | }
85 | tmpParams[[parI]] <- cntParam
86 | gparams <- c(parI)
87 | i <- 1
88 | while(i <= length(gparams)){
89 | parTmp <- gparams[i]
90 | if(parTmp != parI){
91 | if(is.na(cntParams[[parTmp]]))tmpParams[[parTmp]] <- result[[parTmp]]$default
92 | else tmpParams[[parTmp]] <- cntParams[[parTmp]]
93 | }
94 | i <- i + 1
95 | }
96 | tmpParams$EI <- NA
97 | tmpParams$performance <- NA
98 | candidates <- rbind(candidates, tmpParams)
99 | }
100 | #for continuous Non-Integer parameters
101 | else if(result[[parI]]$type == 'continuous'){
102 | minVal <- as.double(result[[parI]]$minVal)
103 | maxVal <- as.double(result[[parI]]$maxVal)
104 | cntParam <- as.double(cntParam)
105 | meanU <- (cntParam - minVal)/(maxVal - minVal)
106 | #generate four candidates
107 | num <- 1
108 | while(num < 5){
109 | cntParam <- rnorm(1, mean = meanU, sd = 0.2)
110 | if(cntParam <= 1 && cntParam >= 0){
111 | num <- num + 1
112 | tmpParams[[parI]] <- as.character(cntParam * (maxVal - minVal) + minVal)
113 | tmpParams$EI <- NA
114 | tmpParams$performance <- NA
115 | candidates <- rbind(candidates, tmpParams)
116 | }
117 | }
118 | }
119 | #for Categorical (discrete parameters)
120 | else if(result[[parI]]$type == 'discrete'){
121 | parValues <- c(result[[parI]]$values)
122 | while(cntParam == cntParams[[parI]])
123 | cntParam <- sample(parValues, 1)
124 | tmpParams[[parI]] <- cntParam
125 | gparams <- c(parI)
126 | i <- 1
127 | while(i <= length(gparams)){
128 | parTmp <- gparams[i]
129 | if(parTmp != parI){
130 | if(is.na(cntParams[[parTmp]]))tmpParams[[parTmp]] <- result[[parTmp]]$default
131 | else tmpParams[[parTmp]] <- cntParams[[parTmp]]
132 | }
133 | require <- result[[parTmp]]$'requires'[[cntParam]]$require
134 | gparams <- c(gparams, require)
135 | i <- i + 1
136 | }
137 | tmpParams$EI <- NA
138 | tmpParams$performance <- NA
139 | candidates <- rbind(candidates, tmpParams)
140 | }
141 | }
142 | }
143 | candidates <- unique(candidates)
144 |
145 | #Remove Duplicate Candidate Configurations
146 | duplicates <- c()
147 | for(i in 1:nrow(candidates)){
148 | for(j in 1:nrow(R)){
149 | flager <- FALSE
150 | for(k in 1:(ncol(candidates)-2)){
151 | if((!is.na(candidates[i,k]) && !is.na(candidates[i,k])) || candidates[i,k] != R[j,k]){
152 | flager <- TRUE
153 | break
154 | }
155 | }
156 | if(flager == FALSE)
157 | duplicates <- c(duplicates, i)
158 | }
159 | }
160 | if(length(duplicates) > 0)
161 | candidates <- candidates[-duplicates,]
162 | #End Remove Candidate Configurations
163 | return(candidates)
164 | }
165 |
--------------------------------------------------------------------------------
/R/sendToDatabase.R:
--------------------------------------------------------------------------------
1 | #' @title Send Results to Knowledge Base
2 | #'
3 | #' @description Connect to the cloud knowledge base to store the results obtained to be used in meta-learning of future runs.
4 | #' @param tmp String of characters to be sent to knowledge base
5 | #' @return None
6 | #'
7 | #' @examples sendToDatabase()
8 | #'
9 | #' @noRd
10 | #'
11 | #' @import devtools
12 | #' @importFrom rjson fromJSON
13 | #' @importFrom httr POST
14 | #'
15 | #' @keywords internal
16 |
17 | sendToDatabase <- function(tmp){
18 | #Get IP
19 | cntIP <- fromJSON(readLines("http://api.hostip.info/get_json.php", warn=F))$ip
20 |
21 | #Update knowledge base
22 | updateKB <- try(
23 | {
24 | #tmp <- paste(readLines(system.file("extdata", "tmp", package = "SmartML", mustWork = TRUE)), collapse="\n")
25 | res <- POST("https://jncvt2k156.execute-api.eu-west-1.amazonaws.com/default/s3-trigger-rautoml", body = list(data = paste(tmp, "&DATA&", sep=""),
26 | fName = paste(cntIP,".csv&FILENAME&", sep=""),
27 | encode = "json"))
28 | #write("", file=system.file("extdata", "tmp", package = "SmartML", mustWork = TRUE),append=TRUE) #Empty the tmp file
29 | })
30 | if(inherits(updateKB, "try-error"))
31 | print('Failed to update Knowledge base.')
32 |
33 | }
34 |
--------------------------------------------------------------------------------
/R/sendToTmp.R:
--------------------------------------------------------------------------------
1 | #' @title Write results.
2 | #'
3 | #' @description Append results to a log file.
4 | #'
5 | #' @param df List of the dataset meta-features
6 | #' @param algorithmName String of the name of selected classifier algorithm.
7 | #' @param bestParams String of the best parameters configuration found.
8 | #' @param perf String of the performance value obtained using the selected algorithm and parameter configuration.
9 | #' @param nModels Integer representing the number of classifier algorithms that you want to select based on Meta-Learning and start to tune using Bayesian Optimization.
10 | #' @param metric Metric to be used in evaluation:
11 | #' \itemize{
12 | #' \item "acc" - Accuracy,
13 | #' \item "fscore" - Micro-Average of F-Score of each label,
14 | #' \item "recall" - Micro-Average of Recall of each label,
15 | #' \item "precision" - Micro-Average of Precision of each label.
16 | #' }
17 | #'
18 | #' @return None
19 | #'
20 | #' @examples sendToTmp(\code{df}, 'knn', '1', '0.9').
21 | #'
22 | #' @noRd
23 | #'
24 | #' @keywords internal
25 |
26 | sendToTmp <- function(df, algorithmName, bestParams, perf, nModels, metric = 'acc') {
27 | df$params <- sprintf("%s", paste( unlist(bestParams), collapse='#'))
28 | df$performance <- perf
29 | df$classifierAlgorithm <- sprintf("%s", algorithmName)
30 |
31 | query <- sprintf("%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s",
32 | df$datasetRatio, df$featuresKurtStdDev, df$featuresKurtMean, df$featuresKurtMax, df$featuresKurtMin, df$featuresSkewStdDev,
33 | df$featuresSkewMean, df$featuresSkewMax, df$featuresSkewMin, df$symbolsStdDev, df$symbolsSum, df$symbolsMean, df$classProbStdDev,
34 | df$classProbMean, df$classProbMax, df$classProbMin, df$classEntropy, df$ratioNumToCat, df$nCatFeatures, df$nNumFeatures,
35 | df$nInstances, df$nFeatures, df$nClasses, df$lognFeatures, df$lognInstances, df$classifierAlgorithm, df$params, df$maxTime, metric,
36 | df$performance, nModels)
37 | return(query)
38 | #write(query, file=system.file("extdata", "tmp", package = "SmartML", mustWork = TRUE),append=TRUE)
39 | }
40 |
--------------------------------------------------------------------------------
/R/successive_halving.R:
--------------------------------------------------------------------------------
1 | #' @keywords internal
2 | #'
3 | successive_halving <- function(df, model, params_config, n = 81, r = 1, eta = 3,
4 | max_iter = 81, s_max = 5, evaluations = data.frame(),
5 | problem = 'classification', measure = 'classif.acc') {
6 |
7 | final_df = params_config
8 | print('GOT HERE 0')
9 | if(problem == 'classification'){
10 | problem = 'classif'
11 | task = TaskClassif$new(id = 'sh', backend = df, target = 'class')
12 | }
13 | else{
14 | problem = 'regr'
15 | task = TaskRegr$new(id = 'sh', backend = df, target = 'class')
16 | }
17 | param_number = length(params_config)
18 |
19 | for (k in 0:s_max) {
20 | gc()
21 | n_i = n * (eta ** -k)
22 | r_i = r * (eta ** k)
23 | r_p = r_i / max_iter
24 | min_train_datapoints = (length(unique(df$class)) * 3) + 1
25 | min_prob_datapoints = min_train_datapoints / nrow(df$class)
26 | train_idxs <- sample(task$nrow, task$nrow * max(min(r_p, 0.8), min_prob_datapoints))
27 | test_idxs <- setdiff(seq_len(task$nrow), train_idxs)
28 | if (problem == 'classif')
29 | learners <- replicate(n = n_i, expr = {lrn(paste(problem, sep = '.', model),
30 | predict_type = 'prob')})
31 | else
32 | learners <- replicate(n = n_i, expr = {lrn(paste(problem, sep = '.', model))})
33 |
34 | print('GOT HERE 1')
35 | j = 1
36 | for (i in learners) {
37 | cnt_field <- final_df[[j]]
38 | ## Some conditions to filter the parameter values
39 | if (model == 'svm' && final_df[[j]]$kernel != 'polynomial')
40 | cnt_field$degree <- NULL
41 | if ( (model == 'svm' && final_df[[j]]$kernel == 'linear') || (model == 'cv_glmnet' && final_df[[j]]$relax == FALSE))
42 | cnt_field$gamma <- NULL
43 |
44 | i$param_set$values = cnt_field
45 | j = j + 1
46 | }
47 |
48 | print('GOT HERE 2')
49 | for (l in learners) {
50 | l$train(task = task, row_ids = train_idxs)
51 | }
52 |
53 | print('GOT HERE 3')
54 | preds <- map(.x = learners, .f = ~ .x$predict(task, row_ids = test_idxs)$score(msr(measure)))
55 |
56 |
57 | final_df <- final_df %>%
58 | as.data.table() %>%
59 | t() %>%
60 | `colnames<-`(value = jsons[[model]]$params) %>%
61 | as.data.table()
62 |
63 |
64 | final_df[, acc := unlist(preds)]
65 | final_df[, budget := r_i]
66 | final_df[, budget := r_p]
67 | final_df[, model := unlist(learners)]
68 | setorder(final_df, -acc)
69 | evaluations <- rbindlist(list(evaluations, final_df))
70 |
71 |
72 | final_df <- final_df %>%
73 | head(max(n_i/eta, 1))
74 |
75 |
76 | if(k == s_max){
77 | return(list("answer" = final_df, "sh_runs" = evaluations))
78 | }
79 |
80 | final_df$acc = NULL
81 | final_df$budget = NULL
82 | final_df$model = NULL
83 | final_df <- purrr::transpose(final_df)
84 |
85 | }
86 | }
87 |
--------------------------------------------------------------------------------
/R/successive_resampling.R:
--------------------------------------------------------------------------------
1 | #' @importFrom KernSmooth dpik bkde
2 | #' @importFrom tidyr drop_na separate gather spread unite
3 | #' @importFrom dplyr select mutate_if arrange top_frac case_when mutate filter
4 | #' @importFrom truncnorm rtruncnorm dtruncnorm
5 |
6 | #' @keywords internal
7 | dpikSafe <- function(x, ...)
8 | {
9 | result <- try(dpik(x, ...), silent = TRUE)
10 | if (class(result) == "try-error")
11 | {
12 | msg <- geterrmessage()
13 | if (grepl("scale estimate is zero for input data", msg))
14 | {
15 | warning("Using standard deviation as scale estimate, probably because IQR == 0")
16 | result <- try(dpik(x, scalest = "stdev", ...), silent = TRUE )
17 | if (class(result) == "try-error") {
18 | msg <- geterrmessage()
19 | if (grepl("scale estimate is zero for input data", msg)) {
20 | warning("0 scale, bandwidth estimation failed. using 1e-3")
21 | result <- 1e-3
22 | }
23 | }
24 | } else
25 | {
26 | stop(msg)
27 | }
28 | }
29 | return(result)
30 | }
31 |
32 | #' @keywords internal
33 | successive_resampling <- function(df, model, samples = 64, n = 27, bw = 3, kde_type = "single") {
34 | samples_filtered <- df %>% drop_na()
35 | params_list <- jsons[[model]]$params
36 | length_params <- length(params_list)
37 | biggest_budget_that_satisfies <- samples_filtered %>%
38 | mutate(acc = as.numeric(acc)) %>%
39 | group_by(budget) %>%
40 | mutate(size = n()) %>%
41 | ungroup() %>%
42 | filter(size > ((length_params + 1) * 20/3)) %>%
43 | filter(budget == max(budget)) %>%
44 | arrange(desc(acc)) %>%
45 | select(-size) %>%
46 | separate(col = params,
47 | into = jsons[[model]]$params,
48 | sep = ",") %>%
49 | select(-model, -rp) %>%
50 | mutate_if(is.character, .funs = ~ str_extract(.x, pattern = "(?<==).*$") %>% parse_number)
51 | l_samples <- biggest_budget_that_satisfies %>%
52 | top_frac(0.15, wt = acc) %>%
53 | select(-acc, -budget)
54 |
55 | g_samples <- biggest_budget_that_satisfies %>%
56 | top_frac(-0.85, wt = acc) %>%
57 | select(-acc, -budget)
58 |
59 | l_kde_bws <- suppressWarnings(map_dbl(l_samples, dpikSafe))
60 | g_kde_bws <- suppressWarnings(map_dbl(g_samples, dpikSafe))
61 | l_kde_means <- map2_dbl(.x = l_samples, .y = l_kde_bws, .f = ~ mean(bkde(x = .x, bandwidth = .y)$x))
62 | g_kde_means <- map2_dbl(.x = g_samples, .y = g_kde_bws, .f = ~ mean(bkde(x = .x, bandwidth = .y)$x))
63 | maxvals <- map_dbl(.x = params_list, .f = ~ readr::parse_number(jsons[[model]][[.x]]$maxVal))
64 | minvals <- map_dbl(.x = params_list, .f = ~ readr::parse_number(jsons[[model]][[.x]]$minVal))
65 | types <- map_chr(.x = params_list, .f = ~ jsons[[model]][[.x]]$scale)
66 | partial_rtruncnorm <- function(n, a, b, mu, sigma, type) {
67 | case_when(type == "int" ~ round(rtruncnorm(n = n, a = a, b = b, mean = mu, sd = sigma)),
68 | type == "double" | type == "exp" ~ rtruncnorm(n = n, a = a, b = b, mean = mu, sd = sigma))
69 | }
70 |
71 | partial_dtruncnorm <- function(x, a, b, mu, sigma) {
72 | dtruncnorm(x = x, a = a, b = b, mean = mu, sd = sigma)
73 | }
74 |
75 | batch_samples <- pmap_dfc(.l = list("a" = minvals,
76 | "b" = maxvals,
77 | "mu" = l_kde_means,
78 | "sigma" = l_kde_bws * bw,
79 | "type" = types),
80 | .f = partial_rtruncnorm,
81 | n = samples) %>%
82 | set_names(nm = params_list)
83 |
84 | batch_samples_densities_l <- pmap_dfc(.l = list("x" = batch_samples,
85 | "a" = minvals,
86 | "b" = maxvals,
87 | "mu" = l_kde_means,
88 | "sigma" = l_kde_bws),
89 | .f = partial_dtruncnorm)
90 |
91 | batch_samples_densities_g <- pmap_dfc(.l = list("x" = batch_samples,
92 | "a" = minvals,
93 | "b" = maxvals,
94 | "mu" = g_kde_means,
95 | "sigma" = g_kde_bws),
96 | .f = partial_dtruncnorm)
97 |
98 | evaluate_batch_convolution <- batch_samples_densities_l / batch_samples_densities_g
99 |
100 | rank_sample_density <- function(samp, kdensity, n) {
101 | samp <- samp %>% as.data.frame()
102 | samp$rank <- kdensity
103 | sorted_samp <- samp %>% arrange(desc(rank)) %>% head(n)
104 | subset(sorted_samp, select = -rank)
105 | }
106 |
107 | if(kde_type == "mixed") {
108 | EI <- evaluate_batch_convolution %>%
109 | reduce(.f = `*`) %>%
110 | map_if(.p = ~ ((is.nan(.x) | is.infinite(.x)) == T),
111 | .f = ~ runif(1, min = 1e-5, max = 1e-3)) %>%
112 | flatten_dbl()
113 |
114 | batch_samples$rank <- EI
115 |
116 | evaluated_batch <- batch_samples %>%
117 | arrange(desc(rank)) %>%
118 | top_n(n = n, wt = rank)
119 |
120 | evaluated_batch_step_two <- evaluated_batch %>%
121 | select(-rank) %>%
122 | gather(key, value) %>%
123 | mutate(params = paste(key, value, sep = " = ")) %>%
124 | .[["params"]]
125 |
126 | eval_batch_step_three <- evaluated_batch_step_two %>%
127 | matrix(nrow = n, ncol = length(params_list)) %>%
128 | as.data.frame() %>%
129 | unite(col = "params", sep = ",") %>%
130 | mutate(model = model) %>%
131 | select(model, params)
132 |
133 | return(eval_batch_step_three)
134 |
135 | } else if(kde_type == "single") {
136 | evaluated_batch <- map2_dfc(.x = batch_samples, .y = evaluate_batch_convolution,
137 | .f = rank_sample_density, n = n)
138 |
139 | colnames(evaluated_batch) <- params_list
140 | final_df <- evaluated_batch %>%
141 | gather(key, value) %>%
142 | mutate(params = paste(key, value, sep = " = ")) %>%
143 | .[["params"]] %>%
144 | matrix(nrow = n, ncol = length(params_list)) %>%
145 | as.data.frame() %>%
146 | unite(col = "params", sep = ",") %>%
147 | mutate(model = model) %>%
148 | select(model, params)
149 |
150 | return(final_df)
151 | }
152 |
153 | }
154 |
--------------------------------------------------------------------------------
/R/sysdata.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DataSystemsGroupUT/SmartML/e58b5bddb0fbf741e16f31651a282146143e78fe/R/sysdata.rda
--------------------------------------------------------------------------------
/README.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | output: github_document
3 | ---
4 |
5 |
6 |
7 | ```{r setup, include = FALSE}
8 | knitr::opts_chunk$set(
9 | collapse = TRUE,
10 | comment = "#>",
11 | fig.path = "man/figures/README-",
12 | out.width = "100%"
13 | )
14 | ```
15 |
16 | # witchcraft
17 |
18 | [](https://cran.r-project.org/package=witchcraft)
19 | [](https://www.tidyverse.org/lifecycle/#experimental)
20 | [](https://travis-ci.org/brurucy/witchcraft)
21 |
22 |
23 | The R package *witchcraft* is an opinionated framework for automated machine learning, with the intent of being frequently updated with the newest state-of-the-art optimization methods.
24 |
25 | At the moment, *witchcraft* uses the [Bayesian-Optimization-Hyperband](https://arxiv.org/pdf/1603.06560.pdf) algorithm.
26 |
27 | Besides *Combined Algorithm Selection and Hyperparameter optimization*, *witchcraft* provides tools to evaluate the results, which are consistent with the mlr3 workflow.
28 |
29 | ## Installation
30 |
31 | Soon, installing the **stable** version from [CRAN](https://cran.r-project.org/package=witchcraft) will be possible:
32 |
33 | ```{r cran-installation, eval = FALSE}
34 | install.packages("witchcraft")
35 | ```
36 |
37 | You can always install the **development** version from
38 | [GitHub](https://github.com/brurucy/witchcraft)
39 |
40 | ```{r gh-installation, eval = FALSE}
41 | # install.packages("remotes")
42 | remotes::install_github("brurucy/witchcraft")
43 | ```
44 |
45 | Installing this software requires a compiler.
46 |
47 | ## Valid example
48 |
49 | ```{r example, message=FALSE, eval=FALSE}
50 | library(SmartML)
51 | library(readr)
52 |
53 | data_train <- readr::read_csv('inst/extdata/dota_train.csv') %>%
54 | as.data.table()
55 |
56 | data_test <- readr::read_csv('inst/extdata/dota_test.csv') %>%
57 | as.data.table()
58 |
59 | data_train[, class := factor(class, levels = unique(class)) %>% sort()]
60 | data_test[, class := factor(class, levels = unique(class)) %>% sort()]
61 |
62 | params <- SmartML:::get_random_hp_config('kknn', columns = ncol(data_train) - 1)
63 |
64 | print(typeof(params$kernel))
65 | params
66 |
67 | ```
68 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | [](https://doi.org/10.5441/002/edbt.2019.54)
5 |
6 |
7 | ## SmartML:
8 | Curently, SmartML is an R-Package representing a meta learning-based framework for automated selection and hyperparameter tuning for machine learning algorithms. Being meta-learning based, the framework is able to simulate the role of the machine learning expert. In particular, the framework is equipped with a continuously updated knowledge base that stores information about the meta-features of all processed datasets along with the associated performance of the different classifiers and their tuned parameters. Thus, for any new dataset, SmartML automatically extracts its meta features and searches its knowledge base for the best performing algorithm to start its optimization process. In addition, SmartML makes use of the new runs to continuously enrich its knowledge base to improve its performance and robustness for future runs.
9 |
10 |
11 |
12 | ---
13 | ## SmartML Contribution Points and Goals:
14 |
15 | The goal of SmartML is to automate the process of classifier algorithm selection, and hyper-parameter tuning in supervised machine learning using a modified version of SMAC bayesian optimization that prefers explitation more than exploration thanks to Meta-Learning.
16 | 1. SmartML is the first R package to deal with the sueprvised machine learning automation, and it is built over 16 different classifier algorithms from different R packages.
17 | 2. In addition, we offer different data preprocessing, and feature engineering algorithms that can be specified by user and applied on tabular datasets of either CSV or ARFF extensions easily.
18 | 3. SmartML has a collaborative knowledge base that grows by time as more users are using our tool.
19 | 4. Finally, SmartML has the ability to do some model interpretability plots for feature importance and interaction by help of ```iml``` package for ML model interpretability.
20 |
21 | ---
22 | ## Installation
23 |
24 | You can install the released version of SmartML from [Github](https://github.com/mmaher22/SmartML) with:
25 |
26 | ``` r
27 | install_github("mmaher22/SmartML")
28 | ```
29 |
30 | ---
31 | ## User Manual
32 |
33 | Manual for the SmartML R package can be found HERE
34 |
35 | ---
36 | ## Example
37 |
38 | This is a basic example which shows you how to run SmartML simply:
39 |
40 | ```{r}
41 | library(SmartML)
42 | ```
43 |
44 | ```{r}
45 | #' Option 1 = Classifier Selection Only, apply PCA as a preprocessing step with 4 components and get two candidate models as output only
46 | result1 <- autoRLearn(1, 'sampleDatasets/shuttle/train.arff', 'sampleDatasets/shuttle/test.arff', option = 1, preProcessF = 'pca', nComp = 4, nModels = 2)
47 |
48 | #option 1 runs for Classifier Algorithm Selection Only
49 | result1$clfs #Vector of recommended nModels classifiers
50 | result1$params #Vector of initial suggested parameter configurations of nModels recommended classifiers
51 |
52 | #Use recommended model to train over training data and make predictions over test data
53 | resultRun <- runClassifier(result1$TRData, result1$TEData, result1$params[[1]], result1$clfs[[1]])
54 | resultRun$perf #model performance on test set
55 | ```
56 |
57 | ```{r}
58 | #' Option 2 = Both Classifier Selection and Parameter Optimization and compute model interpretability plots
59 | result2 <- autoRLearn(2, 'sampleDatasets/car/train.arff', 'sampleDatasets/car/test.arff', interp = TRUE) # Option 2 runs for both classifier algorithm selection and parameter tuning for 2 minutes.
60 |
61 | result2$clfs #best classifier found
62 | result2$params #parameter configuration for best classifier
63 | result2$perf #performance of chosen classifier on testing set after fitting on whole training set
64 | ```
65 |
66 | ```{r}
67 | plot(result2$interpret$featImp) #Feature Importance Plot
68 | ```
69 |
70 | ```{r}
71 | #' Option 2 = Both Classifier Selection and Parameter Optimization, use 20% validation set from training set, and apply MICE for missing values imputation
72 | result3 <- autoRLearn(5, 'sampleDatasets/EEGEyeState/train.csv', 'sampleDatasets/EEGEyeState/test.csv', vRatio = 0.2, missingOpr = TRUE) # Option 2 runs for both classifier algorithm selection and parameter tuning for 5 minutes.
73 |
74 |
75 | result3$clfs #best classifier found
76 | result3$params #parameter configuration for best classifier
77 | result3$perf #performance of chosen classifier on testing set
78 | ```
79 |
80 | ---
81 | ## Contribution GuideLines to SmartML
82 | To Contribute to `SmartML`, Please Follow these GuideLines
83 |
84 | ---
85 | ## Publication
86 |
87 | SmartML has been accepted as a DEMO paper at EDBT 19 in Lisbon Portugal [PDF]:
88 | ```
89 | Mohamed Maher, Sherif Sakr.,SMARTML: A Meta Learning-Based Framework for Automated Selection and Hyperparameter Tuning for Machine Learning Algorithms (2019). Advances in Database Technology-EDBT 2019: 22nd International Conference on Extending Database Technology, Lisbon, Portugal, March 26-29.
90 | ```
91 |
92 | ---
93 | ## Licence:
94 | This work is licensed under the terms of the GNU General Public License, version 3.0 (GPLv3)
95 |
--------------------------------------------------------------------------------
/SmartML.Rproj:
--------------------------------------------------------------------------------
1 | Version: 1.0
2 |
3 | RestoreWorkspace: Default
4 | SaveWorkspace: Default
5 | AlwaysSaveHistory: Default
6 |
7 | EnableCodeIndexing: Yes
8 | UseSpacesForTab: Yes
9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 |
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 |
15 | AutoAppendNewline: Yes
16 | StripTrailingWhitespace: Yes
17 |
18 | BuildType: Package
19 | PackageUseDevtools: Yes
20 | PackageInstallArgs: --no-multiarch --with-keep.source
21 | PackageCheckArgs: –as-cran
22 | PackageRoxygenize: rd,collate,namespace
23 |
--------------------------------------------------------------------------------
/SmartML_0.3.0.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DataSystemsGroupUT/SmartML/e58b5bddb0fbf741e16f31651a282146143e78fe/SmartML_0.3.0.pdf
--------------------------------------------------------------------------------
/codecov.yml:
--------------------------------------------------------------------------------
1 | comment: false
2 | coverage:
3 | status:
4 | project:
5 | default:
6 | target: auto
7 | threshold: 1%
8 | patch:
9 | default:
10 | target: auto
11 | threshold: 1%
12 | language: R
13 | sudo: false
14 |
15 |
--------------------------------------------------------------------------------
/inst/extdata/hyperband_jsons.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DataSystemsGroupUT/SmartML/e58b5bddb0fbf741e16f31651a282146143e78fe/inst/extdata/hyperband_jsons.zip
--------------------------------------------------------------------------------
/inst/extdata/hyperband_jsons/cv_glmnet.json:
--------------------------------------------------------------------------------
1 | {
2 | "params":["dfmax", "alpha", "gamma", "relax", "nfolds"],
3 | "parents":["dfmax", "alpha", "gamma", "relax", "nfolds"],
4 | "gamma":
5 | {
6 | "type":"continuous",
7 | "scale":"double",
8 | "minVal":"0",
9 | "maxVal":"1",
10 | "default":"0.5",
11 | "constraint":"any"
12 | },
13 | "alpha":
14 | {
15 | "type":"continuous",
16 | "scale":"double",
17 | "minVal":"0",
18 | "maxVal":"1",
19 | "default":"0.3",
20 | "constraint":"any"
21 | },
22 | "dfmax":
23 | {
24 | "type":"continuous",
25 | "scale":"int",
26 | "minVal":"10",
27 | "maxVal":"100",
28 | "default":"50",
29 | "constraint":"any"
30 | },
31 | "nfolds":
32 | {
33 | "type":"continuous",
34 | "scale":"int",
35 | "minVal":"3",
36 | "maxVal":"3",
37 | "default":"3",
38 | "constraint":"any"
39 | },
40 | "relax":
41 | {
42 | "type":"boolean",
43 | "values":["TRUE", "FALSE"],
44 | "default":"FALSE"
45 | }
46 | }
47 |
--------------------------------------------------------------------------------
/inst/extdata/hyperband_jsons/glmnet.json:
--------------------------------------------------------------------------------
1 | {
2 | "params":["dfmax", "alpha", "gamma", "relax"],
3 | "parents":["dfmax", "alpha", "gamma", "relax"],
4 | "gamma":
5 | {
6 | "type":"continuous",
7 | "scale":"double",
8 | "minVal":"0",
9 | "maxVal":"1",
10 | "default":"0.5",
11 | "constraint":"any"
12 | },
13 | "alpha":
14 | {
15 | "type":"continuous",
16 | "scale":"double",
17 | "minVal":"0",
18 | "maxVal":"1",
19 | "default":"0.3",
20 | "constraint":"any"
21 | },
22 | "dfmax":
23 | {
24 | "type":"continuous",
25 | "scale":"int",
26 | "minVal":"10",
27 | "maxVal":"100",
28 | "default":"50",
29 | "constraint":"any"
30 | },
31 | "relax":
32 | {
33 | "type":"boolean",
34 | "values":["TRUE", "FALSE"],
35 | "default":"FALSE"
36 | }
37 | }
38 |
--------------------------------------------------------------------------------
/inst/extdata/hyperband_jsons/kknn.json:
--------------------------------------------------------------------------------
1 | {
2 | "params":["k", "distance", "kernel"],
3 | "parents":["k", "distance", "kernel"],
4 | "k":
5 | {
6 | "type":"continuous",
7 | "scale":"int",
8 | "minVal":"1",
9 | "maxVal":"20",
10 | "default":"7",
11 | "constraint":"any"
12 | },
13 | "distance":
14 | {
15 | "type":"continuous",
16 | "scale":"int",
17 | "minVal":"1",
18 | "maxVal":"4",
19 | "default":"2",
20 | "constraint":"any"
21 | },
22 | "kernel":
23 | {
24 | "type":"discrete",
25 | "values":["rectangular", "epanechnikov", "gaussian", "rank", "optimal"],
26 | "default":"optimal"
27 | }
28 | }
29 |
--------------------------------------------------------------------------------
/inst/extdata/hyperband_jsons/lm.json:
--------------------------------------------------------------------------------
1 | {
2 | "params":["singular.ok"],
3 | "parents":["singular.ok"],
4 | "type":
5 | {
6 | "type":"boolean",
7 | "values":["TRUE"],
8 | "default":"TRUE"
9 | }
10 | }
11 |
--------------------------------------------------------------------------------
/inst/extdata/hyperband_jsons/naive_bayes.json:
--------------------------------------------------------------------------------
1 | {
2 | "params":["laplace"],
3 | "parents":["laplace"],
4 | "laplace":
5 | {
6 | "default":"0",
7 | "type":"continuous",
8 | "scale":"int",
9 | "minVal":"0",
10 | "maxVal":"4",
11 | "constraint":"any"
12 | }
13 | }
14 |
--------------------------------------------------------------------------------
/inst/extdata/hyperband_jsons/ranger.json:
--------------------------------------------------------------------------------
1 | {
2 | "params":["num.trees", "mtry", "max.depth", "min.node.size", "verbose"],
3 | "parents":["num.trees", "mtry", "max.depth", "min.node.size", "verbose"],
4 | "num.trees":
5 | {
6 | "type":"continuous",
7 | "scale":"int",
8 | "minVal":"1",
9 | "maxVal":"500",
10 | "default":"500",
11 | "constraint":"any"
12 | },
13 | "mtry":
14 | {
15 | "type":"continuous",
16 | "scale":"int",
17 | "minVal":"1",
18 | "maxVal":"30",
19 | "default":"5",
20 | "constraint":"any"
21 | },
22 | "max.depth":
23 | {
24 | "type":"continuous",
25 | "scale":"int",
26 | "minVal":"0",
27 | "maxVal":"10",
28 | "default":"0",
29 | "constraint":"any"
30 | },
31 | "min.node.size":
32 | {
33 | "type":"continuous",
34 | "scale":"int",
35 | "minVal":"1",
36 | "maxVal":"10",
37 | "default":"2",
38 | "constraint":"any"
39 | },
40 | "verbose":
41 | {
42 | "type":"boolean",
43 | "values":["FALSE"],
44 | "default":"FALSE"
45 | }
46 | }
47 |
--------------------------------------------------------------------------------
/inst/extdata/hyperband_jsons/rpart.json:
--------------------------------------------------------------------------------
1 | {
2 | "params":["maxdepth", "minsplit"],
3 | "parents":["maxdepth", "minsplit"],
4 | "maxdepth":
5 | {
6 | "type":"continuous",
7 | "scale":"int",
8 | "minVal":"1",
9 | "maxVal":"30",
10 | "default":"6",
11 | "constraint":"any"
12 | },
13 | "minsplit":
14 | {
15 | "type":"continuous",
16 | "scale":"int",
17 | "minVal":"1",
18 | "maxVal":"30",
19 | "default":"10",
20 | "constraint":"any"
21 | }
22 | }
23 |
--------------------------------------------------------------------------------
/inst/extdata/hyperband_jsons/svm.json:
--------------------------------------------------------------------------------
1 | {
2 | "params":["kernel", "type", "degree", "gamma", "cost"],
3 | "parents":["kernel", "type", "degree", "gamma", "cost"],
4 | "kernel":
5 | {
6 | "type":"discrete",
7 | "values":["linear", "radial", "polynomial"],
8 | "default":"linear"
9 | },
10 | "type":
11 | {
12 | "type":"discrete",
13 | "values":["C-classification"],
14 | "default":"C-classification"
15 | },
16 | "gamma":
17 | {
18 | "default":"-4",
19 | "type":"continuous",
20 | "minVal":"-10",
21 | "maxVal":"5",
22 | "scale":"exp",
23 | "constraint":"any"
24 | },
25 | "degree":
26 | {
27 | "default":"3",
28 | "type":"continuous",
29 | "minVal":"2",
30 | "maxVal":"5",
31 | "scale":"int",
32 | "constraint":"any"
33 | },
34 | "cost":
35 | {
36 | "default":"-2",
37 | "type":"continuous",
38 | "minVal":"-6",
39 | "maxVal":"12",
40 | "scale":"exp",
41 | "constraint":"any"
42 | }
43 | }
44 |
--------------------------------------------------------------------------------
/inst/extdata/hyperband_jsons/xgboost.json:
--------------------------------------------------------------------------------
1 | {
2 | "params":["eta", "max_depth", "nrounds", "verbose", "min_child_weight"],
3 | "parents":["eta", "max_depth", "nrounds", "verbose", "min_child_weight"],
4 | "verbose":
5 | {
6 | "type":"continuous",
7 | "scale":"int",
8 | "minVal":"0",
9 | "maxVal":"0",
10 | "default":"0"
11 | },
12 | "nrounds":
13 | {
14 | "type":"continuous",
15 | "scale":"int",
16 | "minVal":"10",
17 | "maxVal":"1000",
18 | "default":"10"
19 | },
20 | "eta":
21 | {
22 | "type":"continuous",
23 | "scale":"double",
24 | "minVal":"0.01",
25 | "maxVal":"0.5",
26 | "default":"0.3"
27 | },
28 | "max_depth":
29 | {
30 | "type":"continuous",
31 | "scale":"int",
32 | "minVal":"2",
33 | "maxVal":"10",
34 | "default":"6"
35 | },
36 | "min_child_weight":
37 | {
38 | "type":"continuous",
39 | "scale":"int",
40 | "minVal":"1",
41 | "maxVal":"10",
42 | "default":"1"
43 | }
44 | }
45 |
--------------------------------------------------------------------------------
/inst/extdata/ta_test.csv:
--------------------------------------------------------------------------------
1 | X1.1,X1.2,X2.1,X2.2,X2.3,X2.4,X2.5,X2.6,X2.7,X2.8,X2.9,X2.10,X2.11,X2.12,X2.13,X2.14,X2.15,X2.16,X2.17,X2.18,X2.19,X2.20,X2.21,X2.22,X2.23,X2.24,X2.25,X3.1,X3.2,X3.3,X3.4,X3.5,X3.6,X3.7,X3.8,X3.9,X3.10,X3.11,X3.12,X3.13,X3.14,X3.15,X3.16,X3.17,X3.18,X3.19,X3.20,X3.21,X3.22,X3.23,X3.24,X3.25,X3.26,X4.1,X4.2,X5,class
2 | 0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2.104308876612462,3
3 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,-0.610182803372127,3
4 | 0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,-0.4550689930872934,2
5 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.096069109761043,1
6 | 0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.7940812560427943,1
7 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,-0.6877397085145439,3
8 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,-0.8428535187993775,3
9 | 0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1.1736260149034599,2
10 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.4062967303307103,2
11 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,-1.3857518547962953,2
12 | 0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.7082845840489589,1
13 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,-0.22239827766004297,3
14 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,-0.610182803372127,3
15 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.610182803372127,2
16 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.087829342909624325,1
17 | 0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.06728446737520931,1
18 | 0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.3775120879448766,1
19 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2.9574348331790463,1
20 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.8428535187993775,3
21 | 0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-1.6184225702235457,3
22 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2.026751971470045,3
23 | 1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.087829342909624325,3
24 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-1.3081949496538785,2
25 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,-0.5326258982297103,2
26 | 0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.29995518280245975,2
27 | 0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-1.7735363805083795,2
28 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.096069109761043,2
29 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0.087829342909624325,1
30 | 0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-1.2306380445114617,1
31 |
--------------------------------------------------------------------------------
/inst/extdata/ta_train.csv:
--------------------------------------------------------------------------------
1 | X1.1,X1.2,X2.1,X2.2,X2.3,X2.4,X2.5,X2.6,X2.7,X2.8,X2.9,X2.10,X2.11,X2.12,X2.13,X2.14,X2.15,X2.16,X2.17,X2.18,X2.19,X2.20,X2.21,X2.22,X2.23,X2.24,X2.25,X3.1,X3.2,X3.3,X3.4,X3.5,X3.6,X3.7,X3.8,X3.9,X3.10,X3.11,X3.12,X3.13,X3.14,X3.15,X3.16,X3.17,X3.18,X3.19,X3.20,X3.21,X3.22,X3.23,X3.24,X3.25,X3.26,X4.1,X4.2,X5,class
2 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,-0.6877397085145439,3
3 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,-0.8428535187993775,3
4 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.6389674457579608,3
5 | 1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.39805696347929165,3
6 | 0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.6877397085145439,3
7 | 0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.06728446737520931,3
8 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,2.3369795920397123,3
9 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,-0.610182803372127,3
10 | 0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,-1.463308759938712,3
11 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.16538624805204116,3
12 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0.087829342909624325,3
13 | 0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0.8633983943337925,3
14 | 0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1.096069109761043,2
15 | 0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1.1736260149034599,2
16 | 0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-1.3857518547962953,2
17 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.4062967303307103,2
18 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,-1.3857518547962953,2
19 | 0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1.096069109761043,2
20 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.06728446737520931,2
21 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.3775120879448766,2
22 | 0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.24294315319445797,2
23 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.7082845840489589,2
24 | 0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-1.1530811393690448,2
25 | 0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.29995518280245975,2
26 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0.7858414891913758,2
27 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.01027243776720751,1
28 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.6877397085145439,1
29 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.24294315319445797,1
30 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,-1.1530811393690448,1
31 | 0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.7082845840489589,1
32 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.6307276789065421,1
33 | 0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,-0.5326258982297103,1
34 | 0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.5614105406155439,1
35 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0.7858414891913758,1
36 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.6389674457579608,3
37 | 1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.39805696347929165,3
38 | 0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2.104308876612462,3
39 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,-0.610182803372127,3
40 | 0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.6877397085145439,3
41 | 0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.06728446737520931,3
42 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2.3369795920397123,3
43 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,-0.610182803372127,3
44 | 0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,-1.463308759938712,3
45 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.16538624805204116,3
46 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0.087829342909624325,3
47 | 0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0.8633983943337925,3
48 | 0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1.096069109761043,2
49 | 0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-1.3857518547962953,2
50 | 0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1.096069109761043,2
51 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.06728446737520931,2
52 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.3775120879448766,2
53 | 0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.24294315319445797,2
54 | 0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,-0.4550689930872934,2
55 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.7082845840489589,2
56 | 0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-1.1530811393690448,2
57 | 0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.29995518280245975,2
58 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0.7858414891913758,2
59 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.096069109761043,1
60 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.01027243776720751,1
61 | 0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.7940812560427943,1
62 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.6877397085145439,1
63 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.24294315319445797,1
64 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,-1.1530811393690448,1
65 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.6307276789065421,1
66 | 0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,-0.5326258982297103,1
67 | 0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.5614105406155439,1
68 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0.7858414891913758,1
69 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,-0.8428535187993775,3
70 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,-1.3081949496538785,3
71 | 0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.8633983943337925,3
72 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,-1.3081949496538785,3
73 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,-0.6877397085145439,3
74 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.3287398251882936,3
75 | 0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,-0.610182803372127,3
76 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,-0.610182803372127,3
77 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.7858414891913758,3
78 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,-0.8428535187993775,3
79 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,-0.6877397085145439,3
80 | 0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.29995518280245975,3
81 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,-0.22239827766004297,3
82 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0.24294315319445797,3
83 | 0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0.24294315319445797,3
84 | 0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.7652966136569607,2
85 | 1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,-0.4550689930872934,2
86 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.06728446737520931,2
87 | 0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-1.075524234226628,2
88 | 1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0.5531707737641253,2
89 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,-0.610182803372127,2
90 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,-0.610182803372127,2
91 | 0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0.7082845840489589,2
92 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.9979673290842112,2
93 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.22239827766004297,2
94 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-1.3857518547962953,2
95 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-1.075524234226628,1
96 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.7858414891913758,1
97 | 0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.6877397085145439,1
98 | 0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.16538624805204116,1
99 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0.3205000583368748,1
100 | 0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.47561386862170846,1
101 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-1.2306380445114617,1
102 | 0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.087829342909624325,1
103 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.6877397085145439,1
104 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-1.928650190793213,1
105 | 1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.5326258982297103,3
106 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0.6307276789065421,3
107 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.3287398251882936,3
108 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.9204104239417944,2
109 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.7652966136569607,2
110 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1.2511829200458766,2
111 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,-0.8428535187993775,2
112 | 0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,-0.610182803372127,2
113 | 0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.16538624805204116,1
114 | 0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.6877397085145439,1
115 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-1.3081949496538785,1
116 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,-0.9979673290842112,1
117 | 0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.7082845840489589,1
118 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-1.3857518547962953,1
119 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.29995518280245975,1
120 | 0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.14484137251762613,1
121 | 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.5614105406155439,1
122 | 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.7940812560427943,1
123 | 0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-0.06728446737520931,1
124 |
--------------------------------------------------------------------------------
/inst/extdata/test_schizo.csv:
--------------------------------------------------------------------------------
1 | target,gain_ratio_1,gain_ratio_2,gain_ratio_3,gain_ratio_4,gain_ratio_5,gain_ratio_6,gain_ratio_7,gain_ratio_8,gain_ratio_9,gain_ratio_10,gain_ratio_11,sex,y
2 | PS,0.879,0.864,0.804,0.65,0.74,0.766,0.866,0.817,0.879,0.733,0.845,female,non-schizophrenic
3 | PS,0.919,0.875,0.828,0.915,0.883,0.802,0.802,0.77,0.963,0.932,1.01,female,non-schizophrenic
4 | CS,0.829,0.753,0.774,0.716,0.776,0.793,0.738,0.731,0.76,0.636,0.642,female,non-schizophrenic
5 | PS,0.8425,0.829,0.828,0.741,0.831,0.832,0.665,0.816,0.819,0.73,0.816,female,non-schizophrenic
6 | PS,0.948,0.896,0.872,0.869,0.819,0.852,0.815,0.83,0.799,0.728,0.69,female,non-schizophrenic
7 | PS,0.862,0.881,0.874,0.874,0.835,0.814,0.825,0.772,0.711,0.716,0.726,female,non-schizophrenic
8 | PS,0.8425,0.829,0.952,0.83,0.831,0.98,0.827,0.892,0.962,0.836,0.816,female,non-schizophrenic
9 | PS,0.791,0.834,0.726,0.83,0.831,0.832,0.722,0.816,0.838,0.827,0.916,female,non-schizophrenic
10 | PS,0.872,0.829,0.867,0.919,0.795,0.756,0.854,0.945,0.842,0.82,0.816,female,non-schizophrenic
11 | PS,0.947,0.912,0.94,0.919,0.915,0.889,0.901,0.874,0.837,0.872,0.84,female,non-schizophrenic
12 | PS,0.88,0.829,0.798,0.822,0.77,0.815,0.803,0.816,0.767,0.82,0.797,female,non-schizophrenic
13 | PS,0.799,0.69,0.701,0.738,0.831,0.761,0.696,0.679,0.709,0.65,0.816,female,non-schizophrenic
14 | TR,0.8425,0.829,0.966,0.926,0.831,0.832,0.827,0.916,0.777,0.82,0.816,female,non-schizophrenic
15 | PS,0.8425,0.829,0.828,0.83,0.947,1.2,1.14,1.1,1.12,0.871,0.809,female,non-schizophrenic
16 | PS,0.896,0.874,0.893,0.944,0.933,0.941,0.892,0.893,0.84,0.82,0.829,female,non-schizophrenic
17 | PS,0.914,0.873,0.844,0.925,0.868,0.783,0.701,0.741,0.722,0.828,0.816,female,non-schizophrenic
18 | TR,0.8425,0.829,0.828,0.83,0.831,0.832,0.827,0.816,0.819,0.82,0.816,female,non-schizophrenic
19 | CS,0.807,0.811,0.787,0.728,0.803,0.832,0.827,0.816,0.819,0.82,0.816,female,non-schizophrenic
20 | PS,0.803,0.782,0.623,0.828,0.826,0.793,0.811,0.75,0.816,0.753,0.766,female,non-schizophrenic
21 | CS,0.939,0.841,0.901,0.917,0.896,0.921,0.899,0.804,0.894,0.846,0.902,female,non-schizophrenic
22 | PS,0.813,0.758,0.828,0.83,0.831,0.832,0.827,0.77,0.819,0.82,0.773,female,non-schizophrenic
23 | TR,0.697,0.617,0.759,0.83,0.6,0.604,0.619,0.592,0.819,0.82,0.679,female,non-schizophrenic
24 | TR,0.782,0.88,0.828,0.83,0.709,0.886,0.841,0.816,0.819,0.82,0.843,female,non-schizophrenic
25 | CS,1.03,1.01,1.02,0.964,1.05,1.01,0.985,0.964,1.01,1,1.01,male,non-schizophrenic
26 | PS,0.822,0.843,0.625,0.81,0.702,0.702,0.842,0.865,0.701,0.77,0.801,male,non-schizophrenic
27 | PS,0.863,0.913,0.743,0.86,0.803,0.889,0.924,0.87,0.872,0.859,0.84,male,non-schizophrenic
28 | PS,0.901,0.777,0.743,0.858,0.811,0.751,0.627,0.748,0.808,0.669,0.844,male,non-schizophrenic
29 | PS,0.81,0.735,0.664,0.826,0.767,0.604,0.669,0.87,0.817,0.59,0.835,male,non-schizophrenic
30 | TR,0.674,0.646,0.626,0.639,0.64,0.665,0.655,0.661,0.724,0.7,0.661,male,non-schizophrenic
31 | CS,1,0.958,0.938,1.02,0.956,0.909,1.04,0.902,0.956,0.939,0.954,male,non-schizophrenic
32 | PS,0.8425,0.829,0.828,0.83,0.924,0.924,1,0.986,0.962,1.02,0.991,male,non-schizophrenic
33 | PS,0.94,0.971,0.76,0.983,0.998,0.894,0.856,0.942,0.937,0.965,0.936,male,non-schizophrenic
34 | TR,0.8425,0.829,0.803,0.826,0.764,0.815,0.868,0.791,0.86,0.82,0.839,male,non-schizophrenic
35 | CS,0.894,0.954,0.939,0.938,0.9,0.936,0.944,0.884,0.93,0.885,0.846,male,non-schizophrenic
36 | CS,0.8425,0.829,0.711,0.83,0.78,0.832,0.775,0.68,0.819,0.858,0.662,male,non-schizophrenic
37 | CS,0.962,0.93,0.922,0.858,0.905,0.793,0.867,0.948,0.879,0.916,0.781,male,non-schizophrenic
38 | TR,0.757,0.756,0.811,0.709,0.714,0.743,0.745,0.816,0.819,0.82,0.813,male,non-schizophrenic
39 | PS,0.8425,0.93,0.906,1.01,0.933,0.832,0.862,0.816,0.819,0.82,0.816,male,non-schizophrenic
40 | CS,0.868,0.901,0.893,0.864,0.831,0.795,0.905,0.872,0.873,0.872,0.822,male,non-schizophrenic
41 | TR,0.8,0.76,0.815,0.759,0.828,0.77,0.769,0.789,0.73,0.766,0.85,male,non-schizophrenic
42 | PS,0.895,0.771,0.997,0.885,0.948,0.832,0.843,0.66,0.729,0.801,0.893,male,non-schizophrenic
43 | CS,0.643,0.829,0.828,0.83,0.831,0.832,0.626,0.816,0.819,0.82,0.816,male,non-schizophrenic
44 | TR,0.742,0.829,0.7,0.743,0.748,0.827,0.827,0.816,0.819,0.82,0.776,male,non-schizophrenic
45 | CS,0.767,0.822,0.828,0.798,0.806,0.766,0.767,0.816,0.82,0.876,0.756,male,non-schizophrenic
46 | PS,0.876,0.866,0.899,0.923,0.832,0.849,0.827,0.906,0.822,0.885,0.826,female,schizophrenic
47 | CS,0.836,0.944,0.889,0.909,0.863,0.838,0.844,0.784,0.819,0.82,0.816,female,schizophrenic
48 | PS,0.8425,0.857,0.828,0.798,0.831,0.832,0.757,0.742,0.819,0.82,0.816,female,schizophrenic
49 | TR,0.8425,0.829,0.682,0.651,0.672,0.832,0.827,0.604,0.819,0.82,0.816,female,schizophrenic
50 | CS,0.919,0.856,0.825,0.908,0.896,0.886,0.905,0.938,0.875,0.983,0.881,female,schizophrenic
51 | PS,0.911,0.927,0.798,0.938,0.899,0.952,0.925,0.851,0.953,0.761,0.952,female,schizophrenic
52 | PS,0.8425,0.829,0.613,0.44,0.831,0.832,0.827,0.816,0.819,0.82,0.816,female,schizophrenic
53 | PS,0.726,0.734,0.862,0.83,0.972,0.9,0.876,0.83,0.878,0.79,0.868,female,schizophrenic
54 | TR,0.756,0.871,0.712,0.897,0.785,0.789,0.724,0.798,0.581,0.672,0.636,female,schizophrenic
55 | PS,0.782,0.829,0.828,0.83,0.84,0.832,0.837,0.816,0.819,0.797,0.816,female,schizophrenic
56 | PS,0.937,0.776,0.857,0.899,0.955,0.929,0.827,0.89,0.819,0.818,0.945,female,schizophrenic
57 | TR,0.75,0.829,0.744,0.83,0.794,0.732,0.827,0.697,0.819,0.772,0.816,female,schizophrenic
58 | PS,0.8,0.866,0.915,0.911,0.9,0.886,0.837,0.848,0.896,0.755,0.861,female,schizophrenic
59 | TR,0.899,0.768,0.787,0.781,0.735,0.827,0.796,0.793,0.729,0.801,0.838,female,schizophrenic
60 | CS,0.83,0.828,0.697,0.731,0.817,0.687,0.778,0.612,0.668,0.755,0.754,male,schizophrenic
61 | PS,0.63,0.631,0.828,0.664,0.579,0.832,0.801,0.641,0.819,0.82,0.816,male,schizophrenic
62 | PS,0.691,0.709,0.828,0.83,0.831,0.687,0.639,0.667,0.669,0.695,0.545,male,schizophrenic
63 | PS,0.782,0.812,0.828,0.669,0.701,0.726,0.827,0.673,0.708,0.637,0.728,male,schizophrenic
64 | PS,0.932,0.783,0.809,0.837,0.744,0.794,0.767,0.71,0.622,0.569,0.562,male,schizophrenic
65 | PS,0.851,0.828,0.808,0.827,0.873,0.862,0.752,0.668,0.687,0.717,0.696,male,schizophrenic
66 | PS,0.73,0.729,0.828,0.704,0.831,0.692,0.637,0.581,0.819,0.654,0.816,male,schizophrenic
67 | TR,0.564,0.703,0.59,0.58,0.831,0.832,0.667,0.584,0.819,0.688,0.584,male,schizophrenic
68 | CS,0.779,0.707,0.705,0.785,0.58,0.746,0.715,0.551,0.799,0.668,0.779,male,schizophrenic
69 | PS,0.787,0.748,0.764,0.796,0.778,0.758,0.75,0.746,0.763,0.647,0.734,male,schizophrenic
70 | PS,0.8425,0.773,0.635,0.594,0.608,0.832,0.526,0.625,0.623,0.712,0.782,male,schizophrenic
71 | PS,0.927,0.854,0.828,0.83,1.01,0.955,0.916,0.957,0.905,0.855,0.947,male,schizophrenic
72 | CS,0.893,0.702,0.902,0.83,0.831,0.777,0.827,0.816,0.819,0.82,0.816,male,schizophrenic
73 | PS,0.8425,0.829,0.828,0.83,0.909,0.895,0.827,0.816,0.931,0.956,0.97,male,schizophrenic
74 | PS,0.8425,0.994,1.05,0.941,0.98,1.02,0.96,1.03,0.973,0.813,0.909,male,schizophrenic
75 | PS,0.8425,0.857,0.895,0.879,0.831,0.832,0.852,0.894,0.888,0.82,0.816,male,schizophrenic
76 | TR,0.8425,0.728,0.828,0.83,0.777,0.825,0.827,0.816,0.819,0.82,0.686,male,schizophrenic
77 | PS,0.776,0.956,0.944,0.928,0.85,0.925,0.942,0.9,0.945,0.919,0.898,male,schizophrenic
78 | CS,0.618,0.829,0.828,0.83,0.737,0.832,0.827,0.816,0.643,0.82,0.62,male,schizophrenic
79 | PS,0.8425,0.829,0.828,0.83,0.831,0.712,0.871,0.832,0.819,0.82,0.816,male,schizophrenic
80 | CS,0.956,0.825,0.953,0.825,0.916,0.92,0.964,0.903,0.868,0.945,0.895,male,schizophrenic
81 | CS,0.66,0.655,0.828,0.58,0.708,0.688,0.646,0.816,0.588,0.82,0.74,male,schizophrenic
82 | CS,0.782,0.779,0.72,0.787,0.763,0.755,0.784,0.764,0.754,0.789,0.753,male,schizophrenic
83 | PS,0.602,0.829,0.641,0.574,0.831,0.832,0.827,0.793,0.819,0.613,0.634,male,schizophrenic
84 | TR,0.684,0.579,0.509,0.496,0.436,0.558,0.564,0.816,0.819,0.82,0.259,male,schizophrenic
85 | CS,0.856,0.835,0.946,0.844,0.907,0.897,0.827,0.816,0.819,0.82,0.816,male,schizophrenic
86 |
--------------------------------------------------------------------------------
/inst/extdata/tictactoe_test.csv:
--------------------------------------------------------------------------------
1 | X1b,X1o,X1x,X2b,X2o,X2x,X3b,X3o,X3x,X4b,X4o,X4x,X5b,X5o,X5x,X6b,X6o,X6x,X7b,X7o,X7x,X8b,X8o,X8x,X9b,X9o,X9x,class
2 | 0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,1,0,0,0,1,0,0,1,0,1,0,0,positive
3 | 0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,1,0,0,0,1,0,1,0,0,0,1,0,positive
4 | 0,0,1,0,0,1,0,0,1,0,0,1,1,0,0,0,1,0,0,1,0,1,0,0,0,1,0,positive
5 | 0,0,1,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,0,1,0,0,0,1,0,1,0,positive
6 | 0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,0,1,1,0,0,0,1,0,1,0,0,positive
7 | 0,0,1,0,0,1,0,0,1,0,1,0,1,0,0,0,0,1,0,1,0,0,1,0,1,0,0,positive
8 | 0,0,1,0,0,1,0,0,1,0,1,0,1,0,0,0,1,0,1,0,0,0,0,1,0,1,0,positive
9 | 0,0,1,0,0,1,0,0,1,0,1,0,1,0,0,0,1,0,1,0,0,0,1,0,0,0,1,positive
10 | 0,0,1,0,0,1,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,0,0,1,0,1,0,positive
11 | 0,0,1,0,0,1,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0,1,0,0,positive
12 | 0,0,1,0,0,1,0,0,1,1,0,0,0,1,0,0,0,1,0,1,0,1,0,0,0,1,0,positive
13 | 0,0,1,0,0,1,0,0,1,1,0,0,0,1,0,0,1,0,1,0,0,0,0,1,0,1,0,positive
14 | 0,0,1,0,0,1,0,0,1,1,0,0,0,1,0,1,0,0,0,1,0,0,1,0,0,0,1,positive
15 | 0,0,1,0,0,1,0,0,1,1,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,1,0,positive
16 | 0,0,1,0,0,1,0,0,1,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,1,positive
17 | 0,0,1,0,0,1,0,0,1,1,0,0,1,0,0,0,1,0,1,0,0,0,1,0,1,0,0,positive
18 | 0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,0,1,0,0,1,0,positive
19 | 0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,0,1,0,0,0,1,1,0,0,1,0,0,positive
20 | 0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,1,0,0,0,0,1,1,0,0,0,1,0,positive
21 | 0,0,1,0,0,1,0,1,0,0,0,1,1,0,0,0,1,0,0,0,1,0,1,0,1,0,0,positive
22 | 0,0,1,0,0,1,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,1,0,0,1,0,positive
23 | 0,0,1,0,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,1,positive
24 | 0,0,1,0,0,1,1,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,1,0,1,0,0,positive
25 | 0,0,1,0,0,1,1,0,0,0,1,0,0,0,1,0,1,0,1,0,0,0,1,0,0,0,1,positive
26 | 0,0,1,0,1,0,0,0,1,0,1,0,0,0,1,1,0,0,1,0,0,0,1,0,0,0,1,positive
27 | 0,0,1,0,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,positive
28 | 0,0,1,0,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,1,0,1,0,0,1,0,0,positive
29 | 0,0,1,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,0,0,1,positive
30 | 0,0,1,0,1,0,0,1,0,0,0,1,0,0,1,1,0,0,0,1,0,1,0,0,0,0,1,positive
31 | 0,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,1,0,0,1,0,0,positive
32 | 0,0,1,0,1,0,0,1,0,0,0,1,0,1,0,1,0,0,0,0,1,1,0,0,0,0,1,positive
33 | 0,0,1,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,1,0,0,1,0,0,0,0,1,positive
34 | 0,0,1,0,1,0,0,1,0,0,1,0,0,0,1,1,0,0,0,0,1,1,0,0,0,0,1,positive
35 | 0,0,1,0,1,0,0,1,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,1,positive
36 | 0,0,1,0,1,0,0,1,0,1,0,0,0,0,1,0,1,0,1,0,0,0,0,1,0,0,1,positive
37 | 0,0,1,0,1,0,0,1,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,1,0,0,1,positive
38 | 0,0,1,0,1,0,1,0,0,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,1,0,0,positive
39 | 0,0,1,0,1,0,1,0,0,0,0,1,0,0,1,1,0,0,0,1,0,0,1,0,0,0,1,positive
40 | 0,0,1,0,1,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,1,0,0,1,0,1,0,positive
41 | 0,0,1,0,1,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,1,0,1,0,0,0,1,positive
42 | 0,0,1,0,1,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,1,1,0,0,1,0,0,positive
43 | 0,0,1,0,1,0,1,0,0,0,1,0,0,0,1,0,1,0,0,0,1,1,0,0,0,0,1,positive
44 | 0,0,1,0,1,0,1,0,0,0,1,0,0,0,1,1,0,0,0,1,0,0,0,1,0,0,1,positive
45 | 0,0,1,0,1,0,1,0,0,1,0,0,0,0,1,1,0,0,0,1,0,1,0,0,0,0,1,positive
46 | 0,0,1,1,0,0,0,0,1,0,1,0,0,0,1,1,0,0,0,1,0,0,1,0,0,0,1,positive
47 | 0,0,1,1,0,0,0,0,1,1,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,1,positive
48 | 0,0,1,1,0,0,0,1,0,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,1,0,0,positive
49 | 0,0,1,1,0,0,0,1,0,0,0,1,0,0,1,0,0,1,0,1,0,1,0,0,0,1,0,positive
50 | 0,0,1,1,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,1,0,0,positive
51 | 0,0,1,1,0,0,0,1,0,0,0,1,0,1,0,1,0,0,0,0,1,1,0,0,1,0,0,positive
52 | 0,0,1,1,0,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,1,0,1,0,0,1,0,positive
53 | 0,0,1,1,0,0,0,1,0,0,1,0,0,0,1,1,0,0,0,1,0,0,0,1,0,0,1,positive
54 | 0,0,1,1,0,0,0,1,0,0,1,0,0,0,1,1,0,0,1,0,0,1,0,0,0,0,1,positive
55 | 0,0,1,1,0,0,0,1,0,0,1,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,1,positive
56 | 0,0,1,1,0,0,0,1,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,1,positive
57 | 0,0,1,1,0,0,1,0,0,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,0,1,0,positive
58 | 0,0,1,1,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0,0,1,positive
59 | 0,0,1,1,0,0,1,0,0,0,0,1,0,1,0,1,0,0,0,0,1,1,0,0,0,1,0,positive
60 | 0,0,1,1,0,0,1,0,0,0,0,1,1,0,0,1,0,0,0,0,1,0,1,0,0,1,0,positive
61 | 0,0,1,1,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,1,0,0,0,1,positive
62 | 0,0,1,1,0,0,1,0,0,0,1,0,0,0,1,0,1,0,1,0,0,1,0,0,0,0,1,positive
63 | 0,0,1,1,0,0,1,0,0,0,1,0,0,0,1,1,0,0,0,1,0,1,0,0,0,0,1,positive
64 | 0,1,0,0,0,1,0,0,1,0,1,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,positive
65 | 0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,positive
66 | 0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,0,1,positive
67 | 0,1,0,0,0,1,0,1,0,1,0,0,0,0,1,0,0,1,0,1,0,0,0,1,1,0,0,positive
68 | 0,1,0,0,0,1,0,1,0,1,0,0,0,0,1,0,0,1,1,0,0,0,0,1,0,1,0,positive
69 | 0,1,0,0,0,1,0,1,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,1,0,1,0,positive
70 | 0,1,0,0,0,1,1,0,0,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,1,0,0,positive
71 | 0,1,0,0,0,1,1,0,0,0,0,1,0,0,1,0,1,0,0,1,0,0,0,1,1,0,0,positive
72 | 0,1,0,0,0,1,1,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,1,0,0,positive
73 | 0,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,positive
74 | 0,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,1,1,0,0,1,0,0,positive
75 | 0,1,0,0,1,0,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,positive
76 | 0,1,0,0,1,0,0,0,1,0,0,1,0,1,0,0,1,0,0,0,1,0,0,1,0,0,1,positive
77 | 0,1,0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,1,positive
78 | 0,1,0,0,1,0,0,0,1,0,1,0,0,0,1,1,0,0,0,0,1,0,0,1,1,0,0,positive
79 | 0,1,0,0,1,0,0,0,1,0,1,0,0,0,1,1,0,0,0,0,1,1,0,0,0,0,1,positive
80 | 0,1,0,0,1,0,0,0,1,0,1,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,1,positive
81 | 0,1,0,0,1,0,0,0,1,1,0,0,0,0,1,0,1,0,0,0,1,1,0,0,0,0,1,positive
82 | 0,1,0,0,1,0,0,0,1,1,0,0,0,0,1,1,0,0,0,0,1,0,0,1,0,1,0,positive
83 | 0,1,0,0,1,0,0,0,1,1,0,0,0,0,1,1,0,0,0,0,1,0,1,0,0,0,1,positive
84 | 0,1,0,0,1,0,0,0,1,1,0,0,0,1,0,0,0,1,0,0,1,1,0,0,0,0,1,positive
85 | 0,1,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,1,0,1,0,0,0,1,positive
86 | 0,1,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,1,0,0,0,1,0,0,1,positive
87 | 0,1,0,1,0,0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,1,0,0,positive
88 | 0,1,0,1,0,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,1,0,0,positive
89 | 0,1,0,1,0,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,1,1,0,0,0,0,1,positive
90 | 0,1,0,1,0,0,0,0,1,0,1,0,0,0,1,1,0,0,0,0,1,0,1,0,0,0,1,positive
91 | 0,1,0,1,0,0,0,0,1,0,1,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,1,positive
92 | 0,1,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,positive
93 | 0,1,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,1,1,0,0,0,1,0,0,0,1,positive
94 | 0,1,0,1,0,0,0,1,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,1,0,0,1,positive
95 | 0,1,0,1,0,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0,1,0,0,1,positive
96 | 0,1,0,1,0,0,0,1,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,1,positive
97 | 1,0,0,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,0,1,0,0,0,1,1,0,0,positive
98 | 1,0,0,0,0,1,0,0,1,0,1,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,1,positive
99 | 1,0,0,0,0,1,0,0,1,1,0,0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,1,positive
100 | 1,0,0,0,0,1,0,1,0,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,1,0,0,positive
101 | 1,0,0,0,0,1,0,1,0,0,0,1,0,0,1,1,0,0,0,1,0,0,0,1,0,1,0,positive
102 | 1,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,1,0,0,positive
103 | 1,0,0,0,0,1,0,1,0,0,1,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,1,positive
104 | 1,0,0,0,0,1,0,1,0,1,0,0,0,0,1,0,1,0,1,0,0,0,0,1,1,0,0,positive
105 | 1,0,0,0,0,1,1,0,0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,1,positive
106 | 1,0,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,1,0,0,positive
107 | 1,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,1,0,0,0,0,1,positive
108 | 1,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,0,1,0,1,0,0,0,0,1,positive
109 | 1,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,1,0,0,positive
110 | 1,0,0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,1,positive
111 | 1,0,0,0,1,0,0,0,1,0,1,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0,1,positive
112 | 1,0,0,0,1,0,0,0,1,0,1,0,1,0,0,0,0,1,1,0,0,1,0,0,0,0,1,positive
113 | 1,0,0,0,1,0,0,0,1,0,1,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,1,positive
114 | 1,0,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,positive
115 | 1,0,0,0,1,0,0,0,1,1,0,0,0,0,1,0,1,0,0,0,1,0,0,1,0,1,0,positive
116 | 1,0,0,0,1,0,0,0,1,1,0,0,0,0,1,1,0,0,0,0,1,0,1,0,1,0,0,positive
117 | 1,0,0,0,1,0,0,0,1,1,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,positive
118 | 1,0,0,0,1,0,0,0,1,1,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,positive
119 | 1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,1,0,0,positive
120 | 1,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,positive
121 | 1,0,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,1,0,1,0,1,0,0,0,0,1,positive
122 | 1,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,1,1,0,0,1,0,0,0,1,0,positive
123 | 1,0,0,1,0,0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,1,0,0,1,positive
124 | 1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,1,positive
125 | 1,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,1,1,0,0,0,1,0,0,1,0,positive
126 | 1,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0,1,positive
127 | 0,0,1,0,0,1,0,1,0,0,0,1,0,0,1,0,1,0,0,1,0,1,0,0,0,1,0,negative
128 | 0,0,1,0,0,1,0,1,0,0,0,1,0,0,1,0,1,0,1,0,0,0,1,0,0,1,0,negative
129 | 0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,1,1,0,0,negative
130 | 0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,1,0,0,0,1,0,0,1,0,0,0,1,negative
131 | 0,0,1,0,0,1,0,1,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,negative
132 | 0,0,1,0,0,1,1,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0,negative
133 | 0,0,1,0,1,0,0,0,1,0,0,1,0,0,1,1,0,0,0,1,0,0,1,0,0,1,0,negative
134 | 0,0,1,0,1,0,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,0,1,0,1,0,0,negative
135 | 0,0,1,0,1,0,0,0,1,0,0,1,0,1,0,0,0,1,1,0,0,0,1,0,0,1,0,negative
136 | 0,0,1,0,1,0,0,0,1,0,0,1,0,1,0,0,1,0,1,0,0,0,1,0,0,0,1,negative
137 | 0,0,1,0,1,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,1,0,0,negative
138 | 0,0,1,0,1,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,1,1,0,0,0,0,1,negative
139 | 0,0,1,0,1,0,0,0,1,0,1,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,1,negative
140 | 0,0,1,0,1,0,0,0,1,0,1,0,0,1,0,1,0,0,0,0,1,0,1,0,0,0,1,negative
141 | 0,0,1,0,1,0,1,0,0,0,0,1,0,1,0,0,0,1,1,0,0,0,1,0,1,0,0,negative
142 | 0,0,1,1,0,0,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,0,1,0,0,1,0,negative
143 | 0,0,1,1,0,0,0,0,1,0,1,0,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,negative
144 | 0,0,1,1,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,1,1,0,0,1,0,0,negative
145 | 0,0,1,1,0,0,0,0,1,1,0,0,0,0,1,1,0,0,0,1,0,0,1,0,0,1,0,negative
146 | 0,0,1,1,0,0,0,1,0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,0,1,0,negative
147 | 0,0,1,1,0,0,0,1,0,1,0,0,0,1,0,0,0,1,0,1,0,0,0,1,1,0,0,negative
148 | 0,0,1,1,0,0,0,1,0,1,0,0,0,1,0,0,0,1,0,1,0,1,0,0,0,0,1,negative
149 | 0,0,1,1,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,1,0,negative
150 | 0,0,1,1,0,0,1,0,0,0,0,1,0,0,1,1,0,0,0,1,0,0,1,0,0,1,0,negative
151 | 0,1,0,0,0,1,0,0,1,0,0,1,0,1,0,1,0,0,1,0,0,1,0,0,0,1,0,negative
152 | 0,1,0,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,1,negative
153 | 0,1,0,0,0,1,0,0,1,0,1,0,1,0,0,0,0,1,0,1,0,1,0,0,1,0,0,negative
154 | 0,1,0,0,0,1,0,0,1,0,1,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,negative
155 | 0,1,0,0,0,1,0,0,1,1,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,1,0,negative
156 | 0,1,0,0,0,1,0,0,1,1,0,0,0,1,0,1,0,0,0,0,1,1,0,0,0,1,0,negative
157 | 0,1,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,1,1,0,0,0,0,1,0,1,0,negative
158 | 0,1,0,0,0,1,0,1,0,0,0,1,0,1,0,1,0,0,0,0,1,0,0,1,0,1,0,negative
159 | 0,1,0,0,0,1,0,1,0,1,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,negative
160 | 0,1,0,0,0,1,1,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,1,0,0,0,1,negative
161 | 0,1,0,0,0,1,1,0,0,0,1,0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,negative
162 | 0,1,0,0,1,0,0,0,1,0,0,1,0,1,0,0,0,1,0,0,1,1,0,0,0,1,0,negative
163 | 0,1,0,0,1,0,0,0,1,0,0,1,0,1,0,1,0,0,0,0,1,0,1,0,0,0,1,negative
164 | 0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,1,0,0,0,1,0,0,0,1,0,0,1,negative
165 | 0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,1,0,0,1,0,0,1,0,0,0,0,1,negative
166 | 0,1,0,0,1,0,0,1,0,0,0,1,0,1,0,0,0,1,1,0,0,0,0,1,0,0,1,negative
167 | 0,1,0,0,1,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,negative
168 | 0,1,0,0,1,0,0,1,0,0,0,1,1,0,0,1,0,0,1,0,0,0,0,1,0,0,1,negative
169 | 0,1,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,1,1,0,0,0,0,1,negative
170 | 0,1,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,1,0,0,0,0,1,0,0,1,negative
171 | 0,1,0,1,0,0,0,0,1,0,1,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,1,negative
172 | 0,1,0,1,0,0,0,0,1,1,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,1,0,negative
173 | 0,1,0,1,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,0,0,1,0,1,0,negative
174 | 0,1,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0,0,0,1,0,0,1,negative
175 | 0,1,0,1,0,0,1,0,0,0,0,1,0,1,0,0,0,1,1,0,0,0,0,1,0,1,0,negative
176 | 0,1,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,1,0,1,0,0,0,0,1,negative
177 | 1,0,0,0,0,1,0,0,1,0,1,0,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,negative
178 | 1,0,0,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0,negative
179 | 1,0,0,0,0,1,0,1,0,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,0,1,0,negative
180 | 1,0,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,1,0,1,0,0,1,0,0,0,1,negative
181 | 1,0,0,0,0,1,0,1,0,0,0,1,0,1,0,1,0,0,0,1,0,1,0,0,0,0,1,negative
182 | 1,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0,0,1,negative
183 | 1,0,0,0,1,0,0,0,1,0,0,1,0,1,0,1,0,0,0,0,1,0,1,0,1,0,0,negative
184 | 1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0,0,0,1,0,0,1,0,1,0,negative
185 | 1,0,0,0,1,0,1,0,0,0,0,1,0,1,0,0,0,1,0,0,1,0,1,0,1,0,0,negative
186 | 1,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,0,1,1,0,0,0,1,0,negative
187 | 1,0,0,1,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,1,1,0,0,negative
188 | 0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,1,0,negative
189 | 0,0,1,0,1,0,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,0,1,0,1,0,negative
190 | 0,0,1,0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,negative
191 | 0,1,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,0,1,0,0,0,1,negative
192 | 0,1,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,1,negative
193 |
--------------------------------------------------------------------------------
/man/autoRLearn.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/autoRLearn.R
3 | \name{autoRLearn}
4 | \alias{autoRLearn}
5 | \title{Run smartML function for automatic Supervised Machine Learning.}
6 | \usage{
7 | autoRLearn(
8 | maxTime,
9 | directory,
10 | testDirectory,
11 | classCol = "class",
12 | metric = "acc",
13 | vRatio = 0.3,
14 | preProcessF = c("standardize", "zv"),
15 | featuresToPreProcess = c(),
16 | nComp = NA,
17 | nModels = 5,
18 | option = 2,
19 | featureTypes = c(),
20 | interp = FALSE,
21 | missingOpr = FALSE,
22 | balance = FALSE
23 | )
24 | }
25 | \arguments{
26 | \item{maxTime}{Float numeric of the maximum time budget for reading dataset, preprocessing, calculating meta-features, Algorithm Selection & hyper-parameter tuning process only in minutes(Excluding Model Interpretability) - This is applicable in case of Option = 2 only.}
27 |
28 | \item{directory}{String Character of the training dataset directory (SmartML accepts file formats arff/(csv with columns headers) ).}
29 |
30 | \item{testDirectory}{String Character of the testing dataset directory (SmartML accepts file formats arff/(csv with columns headers) ).}
31 |
32 | \item{classCol}{String Character of the name of the class label column in the dataset (default = 'class').}
33 |
34 | \item{metric}{Metric of string character to be used in evaluation:
35 | \itemize{
36 | \item "acc" - Accuracy,
37 | \item "avg-fscore" - Average of F-Score of each label,
38 | \item "avg-recall" - Average of Recall of each label,
39 | \item "avg-precision" - Average of Precision of each label,
40 | \item "fscore" - Micro-Average of F-Score of each label,
41 | \item "recall" - Micro-Average of Recall of each label,
42 | \item "precision" - Micro-Average of Precision of each label.
43 | }}
44 |
45 | \item{vRatio}{Float numeric of the validation set ratio that should be splitted out of the training set for the evaluation process (default = 0.1 --> 10\%).}
46 |
47 | \item{preProcessF}{vector of string Character containing the name of the preprocessing algorithms (default = c('standardize', 'zv') --> no preprocessing):
48 | \itemize{
49 | \item "boxcox" - apply a Box–Cox transform and values must be non-zero and positive in all features,
50 | \item "yeo-Johnson" - apply a Yeo-Johnson transform, like a BoxCox, but values can be negative,
51 | \item "zv" - remove attributes with a zero variance (all the same value),
52 | \item "center" - subtract mean from values,
53 | \item "scale" - divide values by standard deviation,
54 | \item "standardize" - perform both centering and scaling,
55 | \item "normalize" - normalize values,
56 | \item "pca" - transform data to the principal components,
57 | \item "ica" - transform data to the independent components.
58 | }}
59 |
60 | \item{featuresToPreProcess}{Vector of number of features to perform the feature preprocessing on - In case of empty vector, this means to include all features in the dataset file (default = c()) - This vector should be a subset of \code{selectedFeats}.}
61 |
62 | \item{nComp}{Integer numeric of Number of components needed if either "pca" or "ica" feature preprocessors are needed.}
63 |
64 | \item{nModels}{Integer numeric representing the number of classifier algorithms that you want to select based on Meta-Learning and start to tune using Bayesian Optimization (default = 5).}
65 |
66 | \item{option}{Integer numeric representing either Classifier Algorithm Selection is needed only = 1 or Algorithm selection with its parameter tuning is required = 2 which is the default value.}
67 |
68 | \item{featureTypes}{Vector of either 'numerical' or 'categorical' representing the types of features in the dataset (default = c() --> any factor or character features will be considered as categorical otherwise numerical).}
69 |
70 | \item{interp}{Boolean representing if model interpretability (Feature Importance and Interaction) is needed or not (default = FALSE) This option will take more time budget if set to 1.}
71 |
72 | \item{missingOpr}{Boolean variable represents either use median/mode imputation for instances with missing values (FALSE) or apply imputation using "MICE" library which helps you imputing missing values with plausible data values that are drawn from a distribution specifically designed for each missing datapoint (TRUE).}
73 |
74 | \item{balance}{Boolean variable represents if SMOTE class balancing is required or not (default FALSE).}
75 | }
76 | \value{
77 | List of Results
78 | \itemize{
79 | \item "option=1" - Choosen Classifier Algorithms Names \code{clfs} with their parameters configurations \code{params}, Training DataFrame \code{TRData}, Test DataFrame \code{TEData} in case of \code{option=2},
80 | \item "option=2" - Best classifier algorithm name found \code{clfs} with its parameters configuration \code{params}, , Training DataFrame \code{TRData}, Test DataFrame \code{TEData}, model variable \code{model}, predicted values on test set \code{pred}, performance on TestingSet \code{perf}, and Feature Importance \code{interpret$featImp} / Interaction \code{interpret$Interact} plots in case of interpretability \code{interp} = TRUE and chosen model is not knn.
81 | }
82 | }
83 | \description{
84 | Run the smartML main function for automatic classifier algorithm selection, and hyper-parameter tuning.
85 | }
86 | \examples{
87 | \dontrun{
88 | autoRLearn(1, 'sampleDatasets/car/train.arff', \
89 | 'sampleDatasets/car/test.arff', option = 2, preProcessF = 'normalize')
90 |
91 | result <- autoRLearn(10, 'sampleDatasets/shuttle/train.arff', 'sampleDatasets/shuttle/test.arff')
92 | }
93 |
94 | }
95 |
--------------------------------------------------------------------------------
/man/autoRLearn_.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/autoRLearn_.R
3 | \name{autoRLearn_}
4 | \alias{autoRLearn_}
5 | \title{Advanced version of autoRLearn.}
6 | \usage{
7 | autoRLearn_(
8 | df_train,
9 | df_test,
10 | maxTime = 10,
11 | models = c("randomForest", "naiveBayes", "boosting", "l2-linear-classifier", "svm"),
12 | optimizationAlgorithm = "hyperband",
13 | bw = 3,
14 | kde_type = "single",
15 | max_iter = 81,
16 | metric = "acc"
17 | )
18 | }
19 | \arguments{
20 | \item{df_train}{Dataframe of the training dataset. Assumes it is in perfect shape with all numeric variables and factor response variable named "class".}
21 |
22 | \item{df_test}{Dataframe of the test dataset. Assumes it is in perfect shape with all numeric variables and factor response variable named "class".}
23 |
24 | \item{maxTime}{Float representing the maximum time the algorithm should be run (in minutes).}
25 |
26 | \item{models}{List of strings denoting which algorithms to use for the process:
27 | \itemize{
28 | \item "randomForest" - Random forests using the randomForest package
29 | \item "ranger - Random forests using the ranger package (unstable)
30 | \item "naiveBayes" - Naive bayes using the fastNaiveBayes package
31 | \item "boosting" - Gradient boosting using xgboost
32 | \item "l2-linear-classifier" - Linear primal Support vector machine from LibLinear
33 | \item "svm" - RBF kernel svm from e1071
34 | }}
35 |
36 | \item{optimizationAlgorithm}{- String of which hyperparameter tuning algorithm to use:
37 | \itemize{
38 | \item "hyperband" - Hyperband with uniformly initiated parameters
39 | \item "bohb" - Hyperband with bayesian optimization as described on F. Hutter et al 2018 paper BOHB. Has extra parameters bw and kde_type
40 | }}
41 |
42 | \item{bw}{- (only applies to BOHB) Double representing how much should the KDE bandwidth be widened. Higher values allow the algorithm to explore more hyperparameter combinations}
43 |
44 | \item{kde_type}{- (only applies to BOHB) String representing whether a model's hyperparameters should be tuned individually of each other or have their probability densities multiplied:
45 | \itemize{
46 | \item "single" - each hyperparameter has its own expected improvement calculated
47 | \item "mixed" - all hyperparameters' probabilty densities are multiplied and only one mixed expected improvement is calculated
48 | }}
49 |
50 | \item{max_iter}{- (affects both hyperband and BOHB) Integer representing the maximum number of iterations that one successive halving run can have}
51 |
52 | \item{metric}{String of the evaluation metric to be used in the model performance optimization:
53 | \itemize{
54 | \item "acc" - Accuracy,
55 | \item "avg-fscore" - Average of F-Score of each label,
56 | \item "avg-recall" - Average of Recall of each label,
57 | \item "avg-precision" - Average of Precision of each label,
58 | \item "fscore" - Micro-Average of F-Score of each label,
59 | \item "recall" - Micro-Average of Recall of each label,
60 | \item "precision" - Micro-Average of Precision of each label.
61 | }}
62 | }
63 | \value{
64 | List of Results
65 | \itemize{
66 | \item \code{perf} - Evaluated metric of the best performing model on the test data
67 | \item \code{pred} - prediction on the test data using the best model
68 | \item \code{model} - best model object
69 | \item \code{best_models} - table with the best hyperparameters found for the selected models.
70 | }
71 | }
72 | \description{
73 | Tunes the hyperparameters of the desired algorithm/s using either hyperband or BOHB.
74 | }
75 |
--------------------------------------------------------------------------------
/man/datasetReader.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/datasetReader.R
3 | \name{datasetReader}
4 | \alias{datasetReader}
5 | \title{Read Dataset File into Memory.}
6 | \usage{
7 | datasetReader(
8 | directory,
9 | testDirectory,
10 | selectedFeats = c(),
11 | classCol = "class",
12 | preProcessF = "N",
13 | featuresToPreProcess = c(),
14 | nComp = NA,
15 | missingVal = c("NA", "?", " "),
16 | missingOpr = 0
17 | )
18 | }
19 | \arguments{
20 | \item{directory}{String of the directory to the file containing the training dataset.}
21 |
22 | \item{testDirectory}{String of the directory to the file containing the testing dataset.}
23 |
24 | \item{selectedFeats}{Vector of numbers of features columns to include from the training set and ignore the rest of columns - In case of empty vector, this means to include all features in the dataset file (default = c()).}
25 |
26 | \item{classCol}{String of the name of the class label column in the dataset (default = 'class').}
27 |
28 | \item{preProcessF}{string containing the name of the preprocessing algorithm (default = 'N' --> no preprocessing):
29 | \itemize{
30 | \item "boxcox" - apply a Box–Cox transform and values must be non-zero and positive in all features,
31 | \item "yeo-Johnson" - apply a Yeo-Johnson transform, like a BoxCox, but values can be negative,
32 | \item "zv" - remove attributes with a zero variance (all the same value),
33 | \item "center" - subtract mean from values,
34 | \item "scale" - divide values by standard deviation,
35 | \item "standardize" - perform both centering and scaling,
36 | \item "normalize" - normalize values,
37 | \item "pca" - transform data to the principal components,
38 | \item "ica" - transform data to the independent components.
39 | }}
40 |
41 | \item{featuresToPreProcess}{Vector of number of features to perform the feature preprocessing on - In case of empty vector, this means to include all features in the dataset file (default = c()) - This vector should be a subset of \code{selectedFeats}.}
42 |
43 | \item{nComp}{Integer of Number of components needed if either "pca" or "ica" feature preprocessors are needed.}
44 |
45 | \item{missingVal}{Vector of strings representing the missing values in dataset (default: c('NA', '?', ' ')).}
46 |
47 | \item{missingOpr}{Boolean variable represents either delete instances with missing values or apply imputation using "MICE" library which helps you imputing missing values with plausible data values that are drawn from a distribution specifically designed for each missing datapoint- (default = 0 --> delete instances).}
48 | }
49 | \value{
50 | List of the TrainingSet \code{Train} and TestingSet \code{Test}.
51 | }
52 | \description{
53 | Read the file of the training and testing dataset, and perform preprocessing and data cleaning if necessary.
54 | }
55 | \examples{
56 | \dontrun{
57 | dataset <- datasetReader('/Datasets/irisTrain.csv', '/Datasets/irisTest.csv')
58 | }
59 | }
60 |
--------------------------------------------------------------------------------
/man/metafeatures.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DataSystemsGroupUT/SmartML/e58b5bddb0fbf741e16f31651a282146143e78fe/man/metafeatures.pdf
--------------------------------------------------------------------------------
/man/runClassifier.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/runClassifier.R
3 | \name{runClassifier}
4 | \alias{runClassifier}
5 | \title{Fit a classifier model.}
6 | \usage{
7 | runClassifier(
8 | trainingSet,
9 | validationSet,
10 | params,
11 | classifierAlgorithm,
12 | metric = "acc",
13 | interp = 0
14 | )
15 | }
16 | \arguments{
17 | \item{trainingSet}{Dataframe of the training set.}
18 |
19 | \item{validationSet}{Dataframe of the validation Set.}
20 |
21 | \item{params}{A string character of parameter configuration values for the current classifier to be tuned (parameters are separated by #) and can be obtained from \code{params} out of resulted list after running \code{autoRLearn} function.}
22 |
23 | \item{classifierAlgorithm}{String character of the name of classifier algorithm used now.
24 | \itemize{
25 | \item "svm" - Support Vector Machines from e1071 package,
26 | \item "naiveBayes" - naiveBayes from e1071 package,
27 | \item "randomForest" - randomForest from randomForest package,
28 | \item "lmt" - LMT Weka classifier trees from RWeka package,
29 | \item "lda" - Linear Discriminant Analysis from MASS package,
30 | \item "j48" - J48 Weka classifier Trees from RWeka package,
31 | \item "bagging" - Bagging Classfier from ipred package,
32 | \item "knn" - K nearest Neighbors from FNN package,
33 | \item "nnet" - Simple neural net from nnet package,
34 | \item "C50" - C50 decision tree from C5.0 pacakge,
35 | \item "rpart" - rpart decision tree from rpart package,
36 | \item "rda" - regularized discriminant analysis from klaR package,
37 | \item "plsda" - Partial Least Squares And Sparse Partial Least Squares Discriminant Analysis from caret package,
38 | \item "glm" - Fitting Generalized Linear Models from stats package,
39 | \item "deepboost" - deep boost classifier from deepboost package.
40 | }}
41 |
42 | \item{metric}{Metric string character to be used in evaluation:
43 | \itemize{
44 | \item "acc" - Accuracy,
45 | \item "avg-fscore" - Average of F-Score of each label,
46 | \item "avg-recall" - Average of Recall of each label,
47 | \item "avg-precision" - Average of Precision of each label,
48 | \item "fscore" - Micro-Average of F-Score of each label,
49 | \item "recall" - Micro-Average of Recall of each label,
50 | \item "precision" - Micro-Average of Precision of each label
51 | }}
52 |
53 | \item{interp}{Boolean representing if interpretability is required or not (Default = 0).}
54 | }
55 | \value{
56 | List of performance on validationSet named \code{perf}, model fitted on trainingSet named \code{m}, predictions on test set \code{pred}, and interpretability plots named \code{interpret} in case of interp = 1
57 | }
58 | \description{
59 | Run the classifier on a training set and measure performance on a validation set.
60 | }
61 | \examples{
62 | \dontrun{
63 | result1 <- autoRLearn(10, 'sampleDatasets/shuttle/train.arff', 'sampleDatasets/shuttle/test.arff')
64 | dataset <- datasetReader('/Datasets/irisTrain.csv', '/Datasets/irisTest.csv')
65 | result2 <- runClassifier(dataset$Train, dataset$Test, result1$params, result1$clfs)
66 | }
67 |
68 | }
69 |
--------------------------------------------------------------------------------
/man/supportedAlgorithms.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DataSystemsGroupUT/SmartML/e58b5bddb0fbf741e16f31651a282146143e78fe/man/supportedAlgorithms.pdf
--------------------------------------------------------------------------------
/manual.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DataSystemsGroupUT/SmartML/e58b5bddb0fbf741e16f31651a282146143e78fe/manual.pdf
--------------------------------------------------------------------------------
/save_jsons.R:
--------------------------------------------------------------------------------
1 | library(purrr)
2 | library(stringr)
3 | library(jsonlite)
4 | library(devtools)
5 |
6 |
7 | files <- dir(path <- "inst/extdata/hyperband_jsons", pattern = "*.json")
8 | names_clf <- files %>%
9 | map_chr(~ str_remove(.x, pattern = ".json"))
10 | paths <- file.path(path, files)
11 | jsons <- paths %>%
12 | map(.f = ~ jsonlite::fromJSON(txt = .x, flatten = T))
13 | names(jsons) <- names_clf
14 |
15 | ## Then:
16 |
17 | save(jsons, file = "R/sysdata.rda")
18 | save(jsons, file = "sysdata.rda")
19 | load("sysdata.rda")
20 |
--------------------------------------------------------------------------------
/sysdata.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DataSystemsGroupUT/SmartML/e58b5bddb0fbf741e16f31651a282146143e78fe/sysdata.rda
--------------------------------------------------------------------------------
/test_rmarkdown/new_tests.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "new_tests"
3 | author: "rucy"
4 | date: "9/22/2020"
5 | output: html_document
6 | ---
7 |
8 | ```{r setup, include=FALSE}
9 | knitr::opts_chunk$set(echo = TRUE)
10 | ```
11 |
12 | ## R Markdown
13 |
14 | ```{r}
15 |
16 | library(R.utils)
17 | library(mlr3)
18 | library(mlr3learners)
19 | library(readr)
20 | library(data.table)
21 | library(purrr)
22 | library(stringr)
23 | library(jsonlite)
24 | library(tictoc)
25 |
26 | ## If you change any of the jsons
27 |
28 | ## Do this:
29 |
30 | files <- dir(path <- "~/school_stuff/schoolwork/witchcraft/inst/extdata/hyperband_jsons", pattern = "*.json")
31 |
32 | names_clf <- files %>%
33 | map_chr(~ str_remove(.x, pattern = ".json"))
34 |
35 | paths <- file.path(path, files)
36 |
37 | jsons <- paths %>%
38 | map(.f = ~ fromJSON(txt = .x, flatten = T))
39 |
40 | names(jsons) <- names_clf
41 |
42 | ## Then:
43 |
44 | ## save(jsons, file = "~/school_stuff/schoolwork/witchcraft/sysdata.rda")
45 |
46 | # load("~/school_stuff/schoolwork/witchcraft/R/sysdata.rda")
47 |
48 | ## Do this ^^
49 |
50 | param_sample <- function(model, hparam, columns = NULL) {
51 |
52 | param <- jsons[[model]][[hparam]]
53 |
54 | type <- param$type
55 |
56 | type_scale <- param$scale
57 |
58 | if(type == "discrete") {
59 |
60 | param_estimation <- paste("'", base::sample(x = as.list(param$values), size = 1), "'", sep = "")
61 |
62 | return(param_estimation)
63 |
64 | }
65 |
66 | else {
67 |
68 | int_val <- ifelse(hparam == "mtry", as.numeric(columns) - 1, as.numeric(param$maxVal))
69 |
70 | param_estimation <- fcase(type_scale == "int", rdunif(1, a = as.numeric(param$minVal),
71 | b = int_val),
72 | type_scale == "any", runif(1, min = as.numeric(param$minVal),
73 | max = as.numeric(param$maxVal)),
74 | type_scale == "double", runif(1, min = as.numeric(param$minVal),
75 | max = as.numeric(param$maxVal)),
76 | type_scale == "exp", runif(1, min = 2^as.numeric(param$minVal),
77 | max = 2^as.numeric(param$maxVal)))
78 |
79 | return(param_estimation)
80 |
81 | }
82 |
83 | }
84 |
85 | get_random_hp_config <- function(model, columns = NULL) {
86 |
87 | param_db <- jsons[[model]]
88 |
89 | params_list <- param_db$params
90 |
91 | params_list_mapped <- map(.x = params_list,
92 | .f = as_mapper( ~ param_sample(model = model, hparam = .x, columns = columns)))
93 |
94 | `names<-`(params_list_mapped, params_list)
95 |
96 | }
97 |
98 | data_load <- read_csv(file = "~/school_stuff/schoolwork/witchcraft/inst/extdata/ta_train.csv")
99 |
100 | data_model <- data_load %>%
101 | as.data.table()
102 |
103 | data_model[, class := factor(class, levels = unique(class)) %>% sort()]
104 |
105 | ```
106 |
107 | ### New successive halving
108 |
109 | ```{r}
110 |
111 | library(data.table)
112 |
113 | successive_halving <- function(df, model, params_config, n = 81, r = 1, eta = 3, max_iter = 81, s_max = 4, evaluations = data.frame()) {
114 |
115 | final_df <- params_config
116 |
117 | task <- TaskClassif$new(id = "sh", backend = df, target = "class")
118 |
119 | param_number <- length(params_config)
120 |
121 | for (k in 0:s_max) {
122 |
123 | gc()
124 |
125 | n_i = n * (eta ** -k)
126 |
127 | r_i = r * (eta ** k)
128 |
129 | r_p = r_i / max_iter
130 |
131 | min_train_datapoints = (length(unique(df$class)) * 3) + 1
132 |
133 | min_prob_datapoints = min_train_datapoints / nrow(df$class)
134 |
135 | train_idxs <- sample(task$nrow, task$nrow * max(min(r_p, 0.8), min_prob_datapoints))
136 | test_idxs <- setdiff(seq_len(task$nrow), train_idxs)
137 |
138 | learners <- replicate(n = n_i, expr = {lrn(paste("classif", sep = ".", model))})
139 |
140 | j = 1
141 | for (i in learners) {
142 |
143 | i$param_set$values = final_df[[j]]
144 |
145 | j = j + 1
146 |
147 | }
148 |
149 | for (l in learners) {
150 |
151 | l$train(task = task, row_ids = train_idxs)
152 |
153 | }
154 |
155 | measure <- msr("classif.acc")
156 |
157 | preds <- map(.x = learners, .f = ~ .x$predict(task, row_ids = test_idxs)$score(measure))
158 |
159 | final_df <- final_df %>%
160 | as.data.table() %>%
161 | t() %>%
162 | `colnames<-`(value = jsons[[model]]$params) %>%
163 | as.data.table()
164 |
165 | final_df[, acc := unlist(preds)]
166 |
167 | final_df[, budget := r_i]
168 |
169 | final_df[, budget := r_p]
170 |
171 | setorder(final_df, -acc)
172 |
173 | evaluations <- rbindlist(list(evaluations, final_df))
174 |
175 | final_df <- final_df %>%
176 | head(max(n_i/eta, 1))
177 |
178 | if(k == s_max){
179 |
180 | return(list("answer" = final_df, "sh_runs" = evaluations))
181 |
182 | }
183 |
184 | final_df$acc = NULL
185 | final_df$budget = NULL
186 |
187 | final_df <- purrr::transpose(final_df)
188 |
189 | }
190 | }
191 |
192 | test_param_sampling <- replicate(81, get_random_hp_config("xgboost", columns = ncol(data_model)), simplify = FALSE)
193 |
194 | test_sh <- successive_halving(df = data_model, model = "xgboost", params_config = test_param_sampling)
195 | ```
196 |
197 | ### New hyperbandito
198 |
199 | ```{r}
200 |
201 | calc_n_r = function(max_iter = 81, eta = 3, s = 4, B = 405) {
202 |
203 | n = trunc(ceiling(trunc(B/max_iter/(s+1)) * eta**s))
204 |
205 | r = max_iter * eta^(-s)
206 |
207 | ans = c(n, r)
208 |
209 | ans
210 |
211 | }
212 |
213 |
214 | hyperband <- function(df, model, max_iter = 81, eta = 3, maxtime = 1000) {
215 |
216 | logeta = as_mapper(~ log(.x) / log(eta))
217 |
218 | s_max = trunc(logeta(max_iter))
219 |
220 | B = (s_max + 1) * max_iter
221 |
222 | nrs = map_dfc(s_max:0, .f = ~ calc_n_r(max_iter, eta, .x, B)) %>%
223 | t() %>%
224 | `colnames<-`(value = c("n", "r")) %>%
225 | as.data.table()
226 |
227 | nrs$s = s_max:0
228 |
229 | partial_halving <- function(n, r, s) {
230 |
231 | successive_halving(df = df,
232 | model = model,
233 | params_config = replicate(n, get_random_hp_config(model, columns = ncol(df) - 1), simplify = FALSE),
234 | n = n,
235 | r = r,
236 | s_max = s,
237 | max_iter = max_iter,
238 | eta = eta)
239 |
240 | }
241 |
242 | tryCatch(expr = {withTimeout(expr = {
243 |
244 | liszt = vector(mode = "list",
245 | length = max(nrs$s) + 1)
246 |
247 | for (row in 1:nrow(nrs)) {
248 |
249 | liszt[[row]] <- partial_halving(nrs[[row, 1]],
250 | nrs[[row, 2]],
251 | nrs[[row, 3]])
252 |
253 | }
254 | }, timeout = maxtime, cpu = maxtime)},
255 |
256 | TimeoutException = function(ex) {
257 |
258 | print("Budget ended.")
259 |
260 | return(liszt)
261 |
262 | },
263 |
264 | finally = function(ex) {
265 |
266 | print("Hyperband successfully finished.")
267 |
268 | return(liszt) }
269 | ,
270 |
271 | error = function(ex) {
272 |
273 | print(paste("Error found, replace ", model, sep = ""))
274 |
275 | print(geterrmessage())
276 |
277 | break
278 |
279 | })
280 |
281 | return(liszt)
282 |
283 | }
284 |
285 | tezt_hyperband = hyperband(df = data_model, model = "xgboost", maxtime = 120)
286 | ```
287 |
288 | Evocation test
289 |
290 | ```{r}
291 |
292 | evocate <- function(df_train, df_test, maxTime = 10, models = "xgboost", optimizationAlgorithm = "hyperband", bw = 3, max_iter = 81, kde_type = "single") {
293 |
294 | total_time = maxTime * 60
295 |
296 | parameters_per_model <- map_int(models, .f = ~ length(jsons[[.x]]$params))
297 |
298 | times = (parameters_per_model * total_time) / (sum(parameters_per_model))
299 |
300 | print("Time distribution:")
301 | print(times)
302 | print("Models selected:")
303 | print(models)
304 |
305 | run_optimization = function(model, time) {
306 |
307 | results = NULL
308 |
309 | priors = data.frame()
310 |
311 | tic(model, "optimization time:")
312 |
313 | if(optimizationAlgorithm == "hyperband") {
314 |
315 | current <- Sys.time() %>% as.integer()
316 |
317 | end <- (Sys.time() %>% as.integer()) + time
318 |
319 | repeat {
320 |
321 | gc(verbose = F)
322 |
323 | tic("current hyperband runtime")
324 |
325 | print(paste("started", model))
326 |
327 | time_left <- max(end - (Sys.time() %>% as.integer()), 1)
328 |
329 | print(paste("There are:", time_left, "seconds left for this hyperband run"))
330 |
331 | res <- hyperband(df = df_train, model = model, max_iter = max_iter, maxtime = time_left)
332 |
333 | if(is_empty(purrr::flatten(res)) == F) {
334 |
335 | res <- res %>%
336 | map_dfr(.f = ~ .x[["answer"]]) %>%
337 | as.data.table()
338 |
339 | setorder(res, -acc)
340 |
341 | res <- res %>% head(1)
342 |
343 | results <- c(list(res), results)
344 |
345 | print(res)
346 |
347 | print(paste('Best accuracy from hyperband this round: ', res$acc))
348 |
349 | }
350 |
351 | elapsed <- (Sys.time() %>% as.integer()) - current
352 |
353 | if(elapsed >= time) {
354 |
355 | break
356 |
357 | }
358 |
359 | }
360 |
361 | }
362 |
363 | else if(optimizationAlgorithm == "bohb") {
364 |
365 | current <- Sys.time() %>% as.integer()
366 |
367 | end <- (Sys.time() %>% as.integer()) + time
368 |
369 | repeat {
370 |
371 | gc(verbose = F)
372 |
373 | tic("current bohb time")
374 |
375 | print(paste("started", model))
376 |
377 | time_left <- max(end - (Sys.time() %>% as.integer()), 1)
378 |
379 | print(paste("There are:", time_left, "seconds left for this bohb run"))
380 |
381 | res <- bohb(df = df_train, model = model, bw = bw, max_iter = max_iter, maxtime = time_left, priors = priors, kde_type = kde_type)
382 |
383 | if(is_empty(flatten(res)) == F) {
384 |
385 | priors <- res %>%
386 | map_dfr(.f = ~ .x[["sh_runs"]])
387 |
388 | res <- res %>%
389 | map_dfr(.f = ~ .x[["answer"]]) %>%
390 | arrange(desc(acc)) %>%
391 | head(1)
392 |
393 | results <- c(list(res), results)
394 |
395 | print(paste('Best accuracy from hyperband this round: ', res$acc))
396 |
397 | }
398 |
399 | elapsed <- (Sys.time() %>% as.integer()) - current
400 |
401 | if(elapsed >= time) {
402 |
403 | break
404 |
405 | }
406 |
407 | }
408 |
409 |
410 | }
411 |
412 | else {
413 |
414 | errorCondition(message = "Only hyperband and bohb are valid optimization algorithms at this moment.")
415 |
416 | break
417 |
418 | }
419 |
420 | toc()
421 |
422 | results
423 |
424 | }
425 |
426 | print("Finished all optimizations.")
427 |
428 | ans = vector(mode = "list", length = length(models))
429 |
430 | for(i in 1:length(models)) {
431 |
432 | flag <- TRUE
433 |
434 | tryCatch(expr = {
435 |
436 | ans[[i]] <- run_optimization(models[[i]], times[[i]])
437 |
438 | }, error = function(e) {
439 |
440 | print("Error spotted, going to the next model")
441 |
442 | flag <<- FALSE
443 |
444 | })
445 |
446 | if (!flag) next
447 |
448 | }
449 |
450 | return(ans)
451 |
452 | ### TO DO - add the final model evaluation.
453 | ### with your cross validation ideas and etc.
454 |
455 | }
456 |
457 |
458 | ```
459 |
460 | ```{r}
461 |
462 | data_train <- read_csv(file = "~/school_stuff/schoolwork/witchcraft/inst/extdata/ta_train.csv") %>% as.data.table()
463 | data_test <- read_csv(file = "~/school_stuff/schoolwork/witchcraft/inst/extdata/ta_test.csv") %>% as.data.table()
464 |
465 | data_train[, class := factor(class, levels = unique(class)) %>% sort()]
466 | data_test[, class := factor(class, levels = unique(class)) %>% sort()]
467 |
468 | tezt <- evocate(data_train, data_test, maxTime = 2, models = "xgboost")
469 |
470 | ```
471 |
--------------------------------------------------------------------------------
/testing.R:
--------------------------------------------------------------------------------
1 | # Title : Testing the Main Package Function
2 | # Objective : Package Testing
3 | # Created by: s-moh
4 | # Created on: 11/12/2020
5 | library(SmartML)
6 | library(tidyverse)
7 | library(R.utils)
8 | library(mlr)
9 | library(mlr3)
10 | library(mlr3learners)
11 | library(mlr3pipelines)
12 | library(mlr3filters)
13 | library(readr)
14 | library(data.table)
15 | library(stringr)
16 | library(jsonlite)
17 | library(tictoc)
18 |
19 | #################################################################################################
20 | # Classification
21 |
22 | "lrn1 <- lrn('classif.rpart', predict_type = 'prob')
23 | lrn2 <- lrn('classif.ranger', predict_type = 'prob')
24 | lrn3 <- lrn('classif.svm', predict_type = 'prob')
25 |
26 | rpart_cv1 = po('learner_cv', lrn1, id = 'lrn1')
27 | ranger_cv1 = po('learner_cv', lrn2, id = 'lrn2')
28 | svm_cv1 = po('learner_cv', lrn3, id = 'lrn3')
29 | lrns = c(rpart_cv1, ranger_cv1, svm_cv1)
30 |
31 | level0 = gunion(list(
32 | lrns)) %>>%
33 | po('featureunion', id = 'union1')
34 |
35 | ensemble = level0 %>>% LearnerClassifAvg$new(id = 'classif.avg')
36 | ensemble$plot(html = FALSE)
37 |
38 | ens_lrn = GraphLearner$new(ensemble)
39 | ens_lrn$predict_type = 'prob'
40 |
41 | task = mlr_tasks$get('iris')
42 | train.idx = sample(seq_len(task$nrow), 120)
43 | test.idx = setdiff(seq_len(task$nrow), train.idx)
44 |
45 | perf <- ens_lrn$train(task, train.idx)$predict(task, test.idx)$score(msr('classif.acc'))
46 | print(perf)"
47 |
48 | #################################################################################################
49 |
50 | data_train <- readr::read_csv('inst/extdata/tictactoe_train.csv') %>%
51 | as.data.table()
52 |
53 | data_test <- readr::read_csv('inst/extdata/tictactoe_test.csv') %>%
54 | as.data.table()
55 |
56 | data_train[, class := factor(class, levels = unique(class)) %>% sort()]
57 | data_test[, class := factor(class, levels = unique(class)) %>% sort()]
58 |
59 | opt <- SmartML::evocate(df_train = data_train,
60 | df_test = data_test,
61 | models = c('rpart', 'ranger', 'svm'),
62 | #'svm(done)', 'kknn(done)', 'ranger(done)', 'rpart(done)',
63 | #'xgboost(done)', 'cv_glmnet(done)', 'naive_bayes(done)'
64 | optimizationAlgorithm = 'hyperband',
65 | maxTime = 5, ensemble_size = 3)
66 |
67 | print(opt)
68 | gc()
69 |
--------------------------------------------------------------------------------
/tests/testthat.R:
--------------------------------------------------------------------------------
1 | library(testthat)
2 | library(SmartML)
3 |
4 | test_check("SmartML")
5 |
--------------------------------------------------------------------------------
/tests/testthat/test-autorlearn.R:
--------------------------------------------------------------------------------
1 | context("test-autorlearn")
2 |
3 | test_that("option1", {
4 | result1 <- autoRLearn(1, system.file("extdata", "shuttle/train.arff", package = "SmartML"), system.file("extdata", "shuttle/train.arff", package = "SmartML"), option = 1, preProcessF = 'pca', nComp = 3, nModels = 2)
5 | result1$clfs #Vector of recommended nModels classifiers
6 | result1$params #Vector of initial suggested parameter configurations of nModels recommended classifiers
7 | })
8 |
9 |
--------------------------------------------------------------------------------
/tests/testthat/test-hyperband_test.R:
--------------------------------------------------------------------------------
1 | test_that("Parameter sampling works", {
2 | expect_length(param_sample("ranger", "mtry", columns = 11), 1)
3 | })
4 |
--------------------------------------------------------------------------------
/vignettes/.gitignore:
--------------------------------------------------------------------------------
1 | *.html
2 | *.R
3 |
--------------------------------------------------------------------------------
/vignettes/introduction.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Introduction to SmartML: Automatic Supervised Machine Learning in R"
3 | author: "Mohamed Maher - Data Systems Group @ University of Tartu"
4 | output: rmarkdown::html_vignette
5 | fig_width: 10
6 | fig_height: 10
7 | vignette: >
8 | %\VignetteIndexEntry{Introduction to SmartML: Automatic Supervised Machine Learning in R}
9 | %\VignetteEngine{knitr::rmarkdown}
10 | %\VignetteEncoding{UTF-8}
11 | ---
12 |
13 | ```{r setup, include = FALSE}
14 | knitr::opts_chunk$set(
15 | collapse = TRUE,
16 | comment = "#>"
17 | )
18 | ```
19 |
20 |
21 | ## SmartML:
22 | Curently, SmartML is an R-Package representing a meta learning-based framework for automated selection and hyperparameter tuning for machine learning algorithms. Being meta-learning based, the framework is able to simulate the role of the machine learning expert. In particular, the framework is equipped with a continuously updated knowledge base that stores information about the meta-features of all processed datasets along with the associated performance of the different classifiers and their tuned parameters. Thus, for any new dataset, SmartML automatically extracts its meta features and searches its knowledge base for the best performing algorithm to start its optimization process. In addition, SmartML makes use of the new runs to continuously enrich its knowledge base to improve its performance and robustness for future runs.
23 |
24 |
25 |
26 | ## SmartML Contribution Points and Goals:
27 |
28 | The goal of SmartML is to automate the process of classifier algorithm selection, and hyper-parameter tuning in supervised machine learning using a modified version of SMAC bayesian optimization that prefers explitation more than exploration thanks to Meta-Learning.
29 | 1. SmartML is the first R package to deal with the sueprvised machine learning automation, and it is built over 16 different classifier algorithms from different R packages.
30 | 2. In addition, we offer different data preprocessing, and feature engineering algorithms that can be specified by user and applied on tabular datasets of either CSV or ARFF extensions easily.
31 | 3. SmartML has a collaborative knowledge base that grows by time as more users are using our tool.
32 | 4. Finally, SmartML has the ability to do some model interpretability plots for feature importance and interaction by help of ```iml``` package for ML model interpretability.
33 | 5. SmartML has a web service for the tool with a simple R Shiny interface that can be found HERE , and demonstration for how to use the web service can be found HERE.
34 |
35 | ## Installation
36 |
37 | You can install the released version of SmartML from [Github](https://github.com/DataSystemsGroupUT/SmartML) with:
38 |
39 | ``` r
40 | install_github("DataSystemsGroupUT/SmartML")
41 | ```
42 |
43 | ---
44 | ## User Manual
45 |
46 | Manual for the SmartML R package can be found HERE
47 |
48 | ---
49 | ## Example
50 |
51 | ---
52 | ## Contribution GuideLines to SmartML
53 | To Contribute to `SmartML`, Please Follow these GuideLines
54 |
55 | ---
56 | ## Publication
57 |
58 | For More details, you can view our publication about SmartML.
59 | SmartML has been accepted as a DEMO paper at EDBT 19 in Lisbon Portugal [PDF]:
60 | ```
61 | Mohamed Maher, Sherif Sakr.,SMARTML: A Meta Learning-Based Framework for Automated Selection and Hyperparameter Tuning for Machine Learning Algorithms (2019). Advances in Database Technology-EDBT 2019: 22nd International Conference on Extending Database Technology, Lisbon, Portugal, March 26-29.
62 | ```
63 |
64 | ---
65 | ## Funding:
66 | This work is funded by the European Regional Development Funds via the Mobilitas Plus programme (grant MOBTT75).
67 |
68 | ---
69 | ## Licence:
70 |
71 | © 2019, Data Systems Group at University of Tartu
72 |
73 | This work is licensed under the terms of the GNU General Public License, version 3.0 (GPLv3)
74 |
--------------------------------------------------------------------------------