├── .Rprofile ├── .gitignore ├── README.Rmd ├── README.md ├── README_files └── figure-gfm │ ├── kj-patch-1.png │ ├── kj_patch_kj-1.png │ ├── perf-error-line-1.png │ ├── perf_bt_charts-1.png │ ├── perf_build_times-1.png │ └── unnamed-chunk-1-1.png ├── _drake-kj.R ├── _drake-raschka.R ├── data └── fivek-simdat.pickle ├── duration-experiment ├── kuhn-johnson │ ├── nested-cv-h2o-kj.R │ ├── nested-cv-parsnip-kj.R │ ├── nested-cv-ranger-kj.R │ ├── nested-cv-sklearn-kj.R │ └── nested-cv-tune-kj.R ├── outputs │ ├── 0225-results.png │ ├── duration-pkg-tbl.png │ └── duration-runs.rds ├── package-sources-gt-tbl.R └── raschka │ ├── nested-cv-kj-raschka.R │ ├── nested-cv-mlr3-raschka.R │ ├── nested-cv-py-raschka.py │ └── nested-cv-retic-raschka.R ├── environment.yml ├── grids ├── elast-latin-params.pickle └── rf-latin-params.pickle ├── images └── ncv.png ├── nested-cross-validation-comparison.Rproj ├── palettes └── Forest Floor.ase ├── performance-experiment ├── Kuhn-Johnson │ ├── plan-kj.R │ └── r_make-kj.R ├── Raschka │ ├── plan-raschka.R │ └── r_make-raschka.R ├── functions │ ├── create-grids.R │ ├── create-models.R │ ├── create-ncv-objects.R │ ├── inner-tune.R │ ├── mlbench-data.R │ ├── ncv-compare.R │ ├── outer-cv.R │ └── run-ncv.R ├── output │ ├── perf-exp-output-r.csv │ └── perf-exp-output.csv └── packages.R ├── renv.lock └── renv ├── .gitignore ├── activate.R └── settings.dcf /.Rprofile: -------------------------------------------------------------------------------- 1 | source("renv/activate.R") 2 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | .env 6 | .drake 7 | .drake-raschka 8 | README.html 9 | ec2-ssh-raw.log 10 | README_cache 11 | check-results.R 12 | perf-exp-output-backup.rds -------------------------------------------------------------------------------- /README.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: github_document 3 | --- 4 | 5 | # Nested Cross-Validation: Comparing Methods and Implementations 6 | 7 | ![](images/ncv.png) 8 | [![DOI](https://zenodo.org/badge/242267104.svg)](https://zenodo.org/badge/latestdoi/242267104) 9 | 10 | Experiments conducted in May 2020. Packages in renv.lock had to be updated in August 2021, but all scripts haven't been re-ran to make sure everything still works. So, some refactoring of code may be necessary in order to reproduce results (e.g. {future} progress bar implementation and plan options have changed). 11 | 12 | Nested cross-validation (or double cross-validaation) has become a recommended technique for situations in which the size of our dataset is insufficient to simultaneously handle hyperparameter tuning and algorithm comparison. Using standard methods such as k-fold cross-validation in these cases may result in substantial increases in optimization bias where the more models that are trained on a fold means there's a greater opportunity for a model to achieve a low score by chance. Nested cross-validation has been shown to produce less biased, out-of-sample error estimates even using datasets with only hundreds of rows and therefore gives a better estimation of generalized performance. The primary issue with this technique is that it is usually computationally expensive with potentially tens of 1000s of models being trained during the process. 13 | 14 | While researching this technique, I found two slightly different variations of performing nested cross-validation — one authored by [Sabastian Raschka](https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/11_eval4-algo/code/11-eval4-algo__nested-cv_verbose1.ipynb) and the other by [Max Kuhn and Kjell Johnson](https://www.tidymodels.org/learn/work/nested-resampling/). After the nested cross-validation procedure and an algorithm is chosen, Raschka performs an extra k-fold cross-validation using the inner-loop cv strategy on the entire training set in order to tune his final model. Therefore, the hyperparameter tuning that takes place in the inner-loop during nested cross-validation is only in service of algorithm selection. Kuhn-Johnson uses majority vote. Whichever set of hyperparameter values has been chosen during the inner-loop tuning procedure the most often is the set used to fit the final model. The other diffferences are just the number of folds/resamples used in the outer and inner loops which are essentially just tuning parameters. 15 | 16 | Various elements of the technique affect the run times and performance. These include: 17 | 18 | 1. Hyperparameter value grids 19 | 2. Grid search strategy 20 | 3. Inner-Loop CV strategy 21 | 4. Outer-Loop CV strategy 22 | 23 | I'll be examining two aspects of nested cross-validation: 24 | 25 | 1. Duration: Find out which packages and combinations of model functions give us the fastest implementation of each method. 26 | 2. Performance: First, develop a testing framework. Then, for a given data generating process, determine how large of sample size is needed to obtain reasonably accurate out-of-sample error estimate. Also, determine how many repeats in the outer-loop cv strategy should be used to calculate this error estimate. 27 | 28 | The results from these experiments should give us an idea about which methodology, model packages, and compute specifications will produce lower training times, lower costs, and lower generalization error. 29 | 30 | 31 | ## Recommendations: 32 | * For faster training times, use {mlr3} (caveat: see Discussion) or other R model packages outside of the {tidymodels} ecosystem and code the nested cross-validation loops manually (Code: [mlr3](https://github.com/ercbk/nested-cross-validation-comparison/blob/master/duration-experiment/raschka/nested-cv-mlr3-raschka.R), [ranger-kj](https://github.com/ercbk/nested-cross-validation-comparison/blob/master/duration-experiment/kuhn-johnson/nested-cv-ranger-kj.R), [Kuhn-Johnson](https://www.tidymodels.org/learn/work/nested-resampling/)). 33 | * Choose compute resources with large amounts of RAM instead of opting for powerful processors. From the AWS cpu product line, I found the r5.#xlarge instances ran fastest. The most efficient number of vCPUs may vary according to the algorithm. 34 | * For the data in this experiment with row numbers in the low thousands, Raschka's method performed just as well as Kuhn-Johnson's but was substantially faster. 35 | * For the data in this experiment with row numbers in the hundreds, Raschka's method with at least 3 repeats performed just as well as Kuhn-Johnson's but was still substantially faster even with the repeats. 36 | 37 | 38 | 39 | ## Duration 40 | #### Experiment details: 41 | 42 | * Random Forest and Elastic Net Regression algorithms 43 | * Both algorithms are tuned with 100x2 hyperparameter grids using a latin hypercube design. 44 | * From {mlbench}, I'm using the generated data set, friedman1, from Friedman's Multivariate Adaptive Regression Splines (MARS) paper. 45 | * Kuhn-Johnson 46 | + 100 observations: 10 features, numeric target variable 47 | + outer loop: 2 repeats, 10 folds 48 | + inner loop: 25 bootstrap resamples 49 | * Raschka 50 | + 5000 observations: 10 features, numeric target variable 51 | + outer loop: 5 folds 52 | + inner loop: 2 folds 53 | 54 | The sizes of the data sets are the same as those in the original scripts by the authors. Using Kuhn-Johnson, 50,000 models (grid size * number of repeats * number of folds in the outer-loop * number of folds/resamples in the inner-loop) are trained for each algorithm — using Raschka's, 1,001 models for each algorithm. The one extra model in the Raschka variation is due to his method of choosing the hyperparameter values for the final model. 55 | 56 | [MLFlow](https://mlflow.org/docs/latest/index.html) is used to keep track of the duration (seconds) of each run along with the implementation and method used. 57 | 58 | ![](duration-experiment/outputs/0225-results.png) 59 | 60 | ![](duration-experiment/outputs/duration-pkg-tbl.png) 61 | 62 | ```{r, echo=FALSE, message=FALSE} 63 | pacman::p_load(dplyr, ggplot2, patchwork, stringr, tidytext) 64 | 65 | runs_raw <- readr::read_rds("duration-experiment/outputs/duration-runs.rds") 66 | 67 | 68 | 69 | runs <- runs_raw %>% 70 | mutate(duration = round(duration/60, 2), 71 | implementation = as.factor(str_to_title(implementation)), 72 | implementation = reorder_within(implementation, duration, method)) 73 | 74 | 75 | raschka <- runs %>% 76 | filter(method == "raschka") %>% 77 | ggplot(aes(y = duration, x = implementation, label = duration)) + 78 | geom_bar(aes(color = after_scale(prismatic::clr_darken(rep("#195198",4), 0.3))), fill = "#195198", stat = "identity", width = 0.50) + 79 | coord_flip() + 80 | scale_x_reordered() + 81 | geom_text(hjust = 1.3, size = 3.5, color = "white") + 82 | labs(x = NULL, y = NULL, 83 | title = "Raschka") + 84 | theme(plot.title = element_text(size = rel(0.9))) 85 | 86 | 87 | kj <- runs %>% 88 | filter(method == "kj") %>% 89 | ggplot(aes(y = duration, x = implementation, label = duration)) + 90 | geom_bar(aes(color = after_scale(prismatic::clr_darken(rep("#BD9865",5), 0.3))), stat = "identity", width = 0.50, fill = "#BD9865") + 91 | coord_flip() + 92 | scale_x_reordered() + 93 | geom_text(hjust = 1.3, size = 3.5, color = "black") + 94 | labs(x = NULL, y = NULL, 95 | title = "Kuhn-Johnson") + 96 | theme(plot.title = element_text(size = rel(0.9))) 97 | 98 | durations <- raschka + kj + 99 | plot_annotation(title = "Durations", 100 | subtitle = "minutes") & 101 | theme(axis.ticks = element_blank(), 102 | axis.text.x = element_blank(), 103 | panel.background = element_rect(fill = "ivory", 104 | colour = "ivory"), 105 | plot.background = element_rect(fill = "ivory"), 106 | plot.subtitle = element_text(size = rel(0.85)), 107 | panel.border = element_blank(), 108 | panel.grid.major = element_blank(), 109 | panel.grid.minor = element_blank() 110 | ) 111 | durations 112 | 113 | 114 | ``` 115 | 116 | #### Duration Results: 117 | 118 | * For the Raschka method, the {mlr3} comes in first with {ranger}/{parsnip} coming in a close second. 119 | * For the Kuhn-Johnson method, {ranger}/{parsnip} is clearly fastest. 120 | * This was my first time using the {reticulate} package, and I wanted to see if there was any speed penalty for using its api instead of just running a straight Python script. There doesn't appear to be any. 121 | * {h2o} and {sklearn} are surprisingly slow. If the data size were larger, I think {h2o} would be more competitive. 122 | * The {tidymodels} packages, {parsnip} and {tune}, add substantial overhead. 123 | 124 | 125 | ## Performance 126 | 127 | #### Experiment details: 128 | 129 | * The same data, algorithms, and hyperparameter grids are used. 130 | * The fastest implementation of each method is used in running a nested cross-validation with different sizes of data ranging from 100 to 5000 observations and different numbers of repeats of the outer-loop cv strategy. 131 | * The {mlr3} implementation is the fastest for Raschka's method, but the Ranger-Kuhn-Johnson implementation is close. To simplify, I am using [Ranger-Kuhn-Johnson](https://github.com/ercbk/nested-cross-validation-comparison/blob/master/duration-experiment/kuhn-johnson/nested-cv-ranger-kj.R) for both methods. 132 | * The chosen algorithm with hyperparameters is fit on the entire training set, and the resulting final model predicts on a 100K row Friedman dataset. 133 | * The percent error between the the average mean absolute error (MAE) across the outer-loop folds and the MAE of the predictions on this 100K dataset is calculated for each combination of repeat, data size, and method. 134 | * To make this experiment manageable in terms of runtimes, I am using AWS instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge for Random Forest. 135 | + Also see the Discussion section 136 | * Iterating through different numbers of repeats, sample sizes, and methods makes a functional approach more appropriate than running imperative scripts. Also, given the long runtimes and impermanent nature of my internet connection, it would also be nice to cache each iteration as it finishes. The [{drake}](https://github.com/ropensci/drake) package is superb on both counts, so I'm using it to orchestrate. 137 | 138 | ```{r perf_build_times_kj, echo=FALSE, message=FALSE} 139 | 140 | pacman::p_load(dplyr, purrr, lubridate, ggplot2, ggfittext, drake, patchwork) 141 | bt <- build_times(starts_with("ncv_results"), digits = 4) 142 | 143 | subtarget_bts <- bt %>% 144 | filter(stringr::str_detect(target, pattern = "[0-9]_([0-9]|[a-z])")) %>% 145 | select(target, elapsed) 146 | 147 | subtargets_raw <- map_dfr(subtarget_bts$target, function(x) { 148 | results <- readd(x, character_only = TRUE) %>% 149 | mutate(subtarget = x) %>% 150 | select(subtarget, everything()) 151 | 152 | }) %>% 153 | inner_join(subtarget_bts, by = c("subtarget" = "target")) 154 | 155 | subtargets <- subtargets_raw %>% 156 | mutate(repeats = factor(repeats), 157 | n = factor(n), 158 | elapsed = round(as.numeric(elapsed)/3600, 2), 159 | percent_error = round(delta_error/oos_error, 3)) 160 | 161 | readr::write_csv(subtargets, "performance-experiment/output/perf-exp-output.csv") 162 | # readr::write_rds(subtargets, "performance-experiment/output/perf-exp-output-backup.rds") 163 | 164 | ``` 165 | 166 | ```{r perf_bt_charts_kj, echo=FALSE, message=FALSE} 167 | 168 | fill_colors <- unname(swatches::read_ase("palettes/Forest Floor.ase")) 169 | 170 | b <- ggplot(subtargets, aes(y = elapsed, x = repeats, 171 | fill = n, label = elapsed)) + 172 | geom_col(position = position_dodge(width = 0.85)) + 173 | scale_fill_manual(values = fill_colors[4:7]) + 174 | geom_bar_text(position = "dodge", min.size = 9, 175 | place = "right", contrast = TRUE) + 176 | coord_flip() + 177 | labs(y = "Runtime (hrs)", x = "Repeats", 178 | fill = "Sample Size") + 179 | theme(legend.position = "top", 180 | legend.background = element_rect(fill = "ivory"), 181 | legend.key = element_rect(fill = "ivory"), 182 | axis.ticks = element_blank(), 183 | panel.background = element_rect(fill = "ivory", 184 | colour = "ivory"), 185 | plot.background = element_rect(fill = "ivory"), 186 | panel.border = element_blank(), 187 | panel.grid.major = element_blank(), 188 | panel.grid.minor = element_blank() 189 | ) 190 | 191 | ``` 192 | 193 | ```{r perf_error_line_kj, echo=FALSE, message=FALSE} 194 | e <- ggplot(subtargets, aes(x = repeats, y = percent_error, group = n)) + 195 | geom_point(aes(color = n), size = 3) + 196 | geom_line(aes(color = n), size = 2) + 197 | expand_limits(y = c(0, 0.10)) + 198 | scale_y_continuous(labels = scales::percent_format(accuracy = 0.1)) + 199 | scale_color_manual(values = fill_colors[4:7]) + 200 | labs(y = "Percent Error", x = "Repeats", 201 | color = "Sample Size") + 202 | theme(legend.position = "top", 203 | legend.background = element_rect(fill = "ivory"), 204 | legend.key = element_rect(fill = "ivory"), 205 | axis.ticks = element_blank(), 206 | panel.background = element_rect(fill = "ivory", 207 | color = "ivory"), 208 | plot.background = element_rect(fill = "ivory"), 209 | panel.border = element_blank(), 210 | panel.grid.major = element_blank(), 211 | panel.grid.minor = element_blank() 212 | ) 213 | ``` 214 | 215 | ```{r kj_patch_kj, echo=FALSE, fig.width=12, fig.height=7} 216 | b + e + plot_layout(guides = "auto") + 217 | plot_annotation(title = "Kuhn-Johnson") & 218 | theme(legend.position = "top", 219 | legend.text = element_text(size = 12), 220 | axis.text.x = element_text(size = 11, 221 | face = "bold"), 222 | axis.text.y = element_text(size = 11, 223 | face = "bold"), 224 | panel.background = element_rect(fill = "ivory", 225 | color = "ivory"), 226 | plot.background = element_rect(fill = "ivory"),) 227 | ``` 228 | 229 | #### Performance Results (Kuhn-Johnson): 230 | 231 | * Runtimes for n = 100 and n = 800 are close, and there's a large jump in runtime going from n = 2000 to n = 5000. 232 | * The number of repeats has little effect on the amount of percent error. 233 | * For n = 100, there is substantially more variation in percent error than in the other sample sizes. 234 | * While there is a large runtime cost that comes with increasing the sample size from 2000 to 5000 observations, it doesn't seem to provide any benefit in gaining a more accurate estimate of the out-of-sample error. 235 | 236 | 237 | ```{r perf_build_times_r, echo=FALSE, message=FALSE} 238 | 239 | cache_raschka <- drake_cache(path = ".drake-raschka") 240 | 241 | bt_r <- build_times(starts_with("ncv_results"), 242 | digits = 4, cache = cache_raschka) 243 | 244 | subtarget_bts_r <- bt_r %>% 245 | filter(stringr::str_detect(target, pattern = "[0-9]_([0-9]|[a-z])")) %>% 246 | select(target, elapsed) 247 | 248 | subtargets_raw_r <- map_dfr(subtarget_bts_r$target, function(x) { 249 | results <- readd(x, character_only = TRUE, 250 | cache = cache_raschka) %>% 251 | mutate(subtarget = x) %>% 252 | select(subtarget, everything()) 253 | 254 | }) %>% 255 | inner_join(subtarget_bts_r, by = c("subtarget" = "target")) 256 | 257 | subtargets_r <- subtargets_raw_r %>% 258 | mutate(repeats = factor(repeats), 259 | n = factor(n), 260 | elapsed = round(as.numeric(elapsed)/60, 2), 261 | percent_error = round(delta_error/oos_error, 3)) 262 | 263 | readr::write_csv(subtargets_r, "performance-experiment/output/perf-exp-output-r.csv") 264 | # readr::write_rds(subtargets, "performance-experiment/output/perf-exp-output-backup-r.rds") 265 | 266 | ``` 267 | 268 | ```{r perf_bt_charts_r, echo=FALSE, message=FALSE} 269 | 270 | b_r <- ggplot(subtargets_r, aes(y = elapsed, x = repeats, 271 | fill = n, label = elapsed)) + 272 | geom_col(position = position_dodge(width = 0.85)) + 273 | scale_fill_manual(values = fill_colors[4:7]) + 274 | geom_bar_text(position = "dodge", min.size = 9, 275 | place = "right", contrast = TRUE) + 276 | coord_flip() + 277 | labs(y = "Runtime (minutes)", x = "Repeats", 278 | fill = "Sample Size") + 279 | theme(legend.position = "top", 280 | legend.background = element_rect(fill = "ivory"), 281 | legend.key = element_rect(fill = "ivory"), 282 | axis.ticks = element_blank(), 283 | panel.background = element_rect(fill = "ivory", 284 | colour = "ivory"), 285 | plot.background = element_rect(fill = "ivory"), 286 | panel.border = element_blank(), 287 | panel.grid.major = element_blank(), 288 | panel.grid.minor = element_blank() 289 | ) 290 | 291 | ``` 292 | 293 | ```{r perf-error-line_r, echo=FALSE, message=FALSE} 294 | e_r <- ggplot(subtargets_r, aes(x = repeats, y = percent_error, group = n)) + 295 | geom_point(aes(color = n), size = 3) + 296 | geom_line(aes(color = n), size = 2) + 297 | expand_limits(y = c(0, 0.10)) + 298 | scale_y_continuous(labels = scales::percent_format(accuracy = 0.1), 299 | breaks = seq(0,0.125, by=0.025)) + 300 | scale_color_manual(values = fill_colors[4:7]) + 301 | labs(y = "Percent Error", x = "Repeats", 302 | color = "Sample Size") + 303 | theme(legend.position = "top", 304 | legend.background = element_rect(fill = "ivory"), 305 | legend.key = element_rect(fill = "ivory"), 306 | axis.ticks = element_blank(), 307 | panel.background = element_rect(fill = "ivory", 308 | color = "ivory"), 309 | plot.background = element_rect(fill = "ivory"), 310 | panel.border = element_blank(), 311 | panel.grid.major = element_blank(), 312 | panel.grid.minor = element_blank() 313 | ) 314 | ``` 315 | 316 | ```{r kj-patch, echo=FALSE, fig.width=12, fig.height=7} 317 | b_r + e_r + plot_layout(guides = "auto") + 318 | plot_annotation(title = "Raschka") & 319 | theme(legend.position = "top", 320 | legend.text = element_text(size = 12), 321 | axis.text.x = element_text(size = 11, 322 | face = "bold"), 323 | axis.text.y = element_text(size = 11, 324 | face = "bold"), 325 | panel.background = element_rect(fill = "ivory", 326 | color = "ivory"), 327 | plot.background = element_rect(fill = "ivory"),) 328 | ``` 329 | 330 | 331 | #### Performance Results (Raschka): 332 | 333 | * The longest runtime is under 30 minutes, so runtime isn't as large of a consideration if we are only comparing a few algorithms. 334 | * There isn't much difference in runtime between n = 100 and n = 2000. 335 | * For n = 100, there's a relatively large change in percent error when going from 1 repeat to 2 repeats. The error estimate then stabilizes for repeats 3 through 5. 336 | * n = 5000 gives poorer out-of-sample error estimates than n = 800 and n = 2000 for all values of repeats. 337 | * n = 800 remains under 2.5% percent error for all repeat values, but also shows considerable volatility. 338 | 339 | 340 | ## Discussion 341 | * {mlr3} wasn't included in the Kuhn-Johnson section of the duration experiment, because with the extra folds/resamples, the RAM usage rapidly increases to the maximum and either locks-up or slows the training time tremendously. I haven't explored this further. 342 | * The elasticnet model was slower to train than the random forest for the 100 row dataset. Compute resources should be optimized for each algorithm. For example, the number of vCPUs capable of being utilized by a random forest algorithm is much higher than number for an elasticnet algorithm. The elasticnet only used the number of vCPUs that matched the number of training folds while the random forest used all available vCPUs. Using a sparse matrix or another package (e.g. biglasso) might help to lower training times for elasticnet. 343 | * Adjusting the inner-loop strategy seems to have the most effect on the volatility of the results. 344 | * For data sizes of a few thousand rows, Kuhn-Johnson trains 50x as many models; takes 8x longer to run; for a similar amount of generalization error as compared to the Raschka method. The similar results in generalization error might be specific to this dataset though. 345 | + Kuhn-Johnson's runtime starts to really balloon once you get into datasets with over a thousand rows. 346 | + The extra folds in the outer loop made a huge difference. With Kuhn-Johnson, the runtimes were hours, and with Raschka's, it was minutes. 347 | * For smaller datasets, you should have at least 3 repeats when running Rashka's method. 348 | * This is just one dataset, but I still found it surprising how little a difference repeats made in reducing generalizaion error. The benefit only kicked in with the dataset that had hundred rows. 349 | 350 | 351 | ## Next Steps 352 | * The performance experimental framework used here could be useful as a way to gain insight into the amounts and types of resources that a project's first steps might require. For example, testing simiulated data before collection of actual data begins. I might try to see if there's much difficulty extending it to {mlr3} (assuming I can figure out the RAM issue) and Python. 353 | * Experiment with Raschka's method more. 354 | + Raschka's method using the majority vote method from Kuhn-Johnson for the final hyperparameter settings might be an additional optimization step. If the final k-fold cv can be discarded without much loss in generalization error, then maybe training times can be shortened further. 355 | + There's probably room to increase the number of folds in the inner-loop of Raschka's method in order to gain more stable results while keeping the training time comparitively low. 356 | * There should be a version of this technique that's capable of working for time series. I have ideas, so it might be something I'll work on for a future project. 357 | 358 | 359 | 360 | 361 | ## References 362 | 363 | Boulesteix, AL, and C Strobl. 2009. “Optimal Classifier Selection and Negative Bias in Error Rate Estimation: An Empirical Study on High-Dimensional Prediction.” BMC Medical Research Methodology 9 (1): 85. [link](https://www.researchgate.net/publication/40756303_Optimal_classifier_selection_and_negative_bias_in_error_rate_estimation_An_empirical_study_on_high-dimensional_prediction) 364 | 365 | Sabastian Raschka, "STAT 479 Statistical Tests and Algorithm Comparison," (Lecture Notes, University of Wisconsin-Madison, Fall 2019). [link](https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/11_eval4-algo/11-eval4-algo__notes.pdf) 366 | 367 | Sudhir Varma and Richard Simon. "Bias in error estimation when using cross-validation for model selection". In: BMC bioinformatics 7.1 (2006). p. 91. [link](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-91) 368 | 369 | 370 | 371 | 372 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # Nested Cross-Validation: Comparing Methods and Implementations 3 | 4 | ![](images/ncv.png) 5 | [![DOI](https://zenodo.org/badge/242267104.svg)](https://zenodo.org/badge/latestdoi/242267104) 6 | 7 | Experiments conducted in May 2020. Packages in renv.lock had to be 8 | updated in August 2021, but all scripts haven’t been re-ran to make sure 9 | everything still works. So, some refactoring of code may be necessary in 10 | order to reproduce results (e.g. {future} progress bar implementation 11 | and plan options have changed). 12 | 13 | Nested cross-validation (or double cross-validaation) has become a 14 | recommended technique for situations in which the size of our dataset is 15 | insufficient to simultaneously handle hyperparameter tuning and 16 | algorithm comparison. Using standard methods such as k-fold 17 | cross-validation in these cases may result in substantial increases in 18 | optimization bias where the more models that are trained on a fold means 19 | there’s a greater opportunity for a model to achieve a low score by 20 | chance. Nested cross-validation has been shown to produce less biased, 21 | out-of-sample error estimates even using datasets with only hundreds of 22 | rows and therefore gives a better estimation of generalized performance. 23 | The primary issue with this technique is that it is usually 24 | computationally expensive with potentially tens of 1000s of models being 25 | trained during the process. 26 | 27 | While researching this technique, I found two slightly different 28 | variations of performing nested cross-validation — one authored by 29 | [Sabastian 30 | Raschka](https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/11_eval4-algo/code/11-eval4-algo__nested-cv_verbose1.ipynb) 31 | and the other by [Max Kuhn and Kjell 32 | Johnson](https://www.tidymodels.org/learn/work/nested-resampling/). 33 | After the nested cross-validation procedure and an algorithm is chosen, 34 | Raschka performs an extra k-fold cross-validation using the inner-loop 35 | cv strategy on the entire training set in order to tune his final model. 36 | Therefore, the hyperparameter tuning that takes place in the inner-loop 37 | during nested cross-validation is only in service of algorithm 38 | selection. Kuhn-Johnson uses majority vote. Whichever set of 39 | hyperparameter values has been chosen during the inner-loop tuning 40 | procedure the most often is the set used to fit the final model. The 41 | other diffferences are just the number of folds/resamples used in the 42 | outer and inner loops which are essentially just tuning parameters. 43 | 44 | Various elements of the technique affect the run times and performance. 45 | These include: 46 | 47 | 1. Hyperparameter value grids 48 | 2. Grid search strategy 49 | 3. Inner-Loop CV strategy 50 | 4. Outer-Loop CV strategy 51 | 52 | I’ll be examining two aspects of nested cross-validation: 53 | 54 | 1. Duration: Find out which packages and combinations of model 55 | functions give us the fastest implementation of each method. 56 | 2. Performance: First, develop a testing framework. Then, for a given 57 | data generating process, determine how large of sample size is 58 | needed to obtain reasonably accurate out-of-sample error estimate. 59 | Also, determine how many repeats in the outer-loop cv strategy 60 | should be used to calculate this error estimate. 61 | 62 | The results from these experiments should give us an idea about which 63 | methodology, model packages, and compute specifications will produce 64 | lower training times, lower costs, and lower generalization error. 65 | 66 | ## Recommendations: 67 | 68 | - For faster training times, use {mlr3} (caveat: see Discussion) or 69 | other R model packages outside of the {tidymodels} ecosystem and 70 | code the nested cross-validation loops manually (Code: 71 | [mlr3](https://github.com/ercbk/nested-cross-validation-comparison/blob/master/duration-experiment/raschka/nested-cv-mlr3-raschka.R), 72 | [ranger-kj](https://github.com/ercbk/nested-cross-validation-comparison/blob/master/duration-experiment/kuhn-johnson/nested-cv-ranger-kj.R), 73 | [Kuhn-Johnson](https://www.tidymodels.org/learn/work/nested-resampling/)). 74 | - 2022-11-11 {parsnip} has undergone changes potentially leading to a 3-fold speed-up ([link](https://parsnip.tidymodels.org/news/index.html#parsnip-103)) 75 | - Choose compute resources with large amounts of RAM instead of opting 76 | for powerful processors. From the AWS cpu product line, I found the 77 | r5.\#xlarge instances ran fastest. The most efficient number of 78 | vCPUs may vary according to the algorithm. 79 | - For the data in this experiment with row numbers in the low 80 | thousands, Raschka’s method performed just as well as Kuhn-Johnson’s 81 | but was substantially faster. 82 | - For the data in this experiment with row numbers in the hundreds, 83 | Raschka’s method with at least 3 repeats performed just as well as 84 | Kuhn-Johnson’s but was still substantially faster even with the 85 | repeats. 86 | 87 | ## Duration 88 | 89 | #### Experiment details: 90 | 91 | - Random Forest and Elastic Net Regression algorithms 92 | - Both algorithms are tuned with 100x2 hyperparameter grids using a 93 | latin hypercube design. 94 | - From {mlbench}, I’m using the generated data set, friedman1, from 95 | Friedman’s Multivariate Adaptive Regression Splines (MARS) paper. 96 | - Kuhn-Johnson 97 | - 100 observations: 10 features, numeric target variable 98 | - outer loop: 2 repeats, 10 folds 99 | - inner loop: 25 bootstrap resamples 100 | - Raschka 101 | - 5000 observations: 10 features, numeric target variable 102 | - outer loop: 5 folds 103 | - inner loop: 2 folds 104 | 105 | The sizes of the data sets are the same as those in the original scripts 106 | by the authors. Using Kuhn-Johnson, 50,000 models (grid size \* number 107 | of repeats \* number of folds in the outer-loop \* number of 108 | folds/resamples in the inner-loop) are trained for each algorithm — 109 | using Raschka’s, 1,001 models for each algorithm. The one extra model in 110 | the Raschka variation is due to his method of choosing the 111 | hyperparameter values for the final model. 112 | 113 | [MLFlow](https://mlflow.org/docs/latest/index.html) is used to keep 114 | track of the duration (seconds) of each run along with the 115 | implementation and method used. 116 | 117 | ![](duration-experiment/outputs/0225-results.png) 118 | 119 | ![](duration-experiment/outputs/duration-pkg-tbl.png) 120 | 121 | ![](README_files/figure-gfm/unnamed-chunk-1-1.png) 122 | 123 | #### Duration Results: 124 | 125 | - For the Raschka method, the {mlr3} comes in first with 126 | {ranger}/{parsnip} coming in a close second. 127 | - For the Kuhn-Johnson method, {ranger}/{parsnip} is clearly 128 | fastest. 129 | - This was my first time using the {reticulate} package, and I wanted 130 | to see if there was any speed penalty for using its api instead of 131 | just running a straight Python script. There doesn’t appear to be 132 | any. 133 | - {h2o} and {sklearn} are surprisingly slow. If the data size were 134 | larger, I think {h2o} would be more competitive. 135 | - The {tidymodels} packages, {parsnip} and {tune}, add substantial 136 | overhead. 137 | - 2022-11-11 {parsnip} has undergone changes potentially leading to a 3-fold speed-up ([link](https://parsnip.tidymodels.org/news/index.html#parsnip-103)) 138 | 139 | ## Performance 140 | 141 | #### Experiment details: 142 | 143 | - The same data, algorithms, and hyperparameter grids are used. 144 | - The fastest implementation of each method is used in running a 145 | nested cross-validation with different sizes of data ranging from 146 | 100 to 5000 observations and different numbers of repeats of the 147 | outer-loop cv strategy. 148 | - The {mlr3} implementation is the fastest for Raschka’s method, 149 | but the Ranger-Kuhn-Johnson implementation is close. To 150 | simplify, I am using 151 | [Ranger-Kuhn-Johnson](https://github.com/ercbk/nested-cross-validation-comparison/blob/master/duration-experiment/kuhn-johnson/nested-cv-ranger-kj.R) 152 | for both methods. 153 | - The chosen algorithm with hyperparameters is fit on the entire 154 | training set, and the resulting final model predicts on a 100K row 155 | Friedman dataset. 156 | - The percent error between the the average mean absolute error (MAE) 157 | across the outer-loop folds and the MAE of the predictions on this 158 | 100K dataset is calculated for each combination of repeat, data 159 | size, and method. 160 | - To make this experiment manageable in terms of runtimes, I am using 161 | AWS instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge 162 | for Random Forest. 163 | - Also see the Discussion section 164 | - Iterating through different numbers of repeats, sample sizes, and 165 | methods makes a functional approach more appropriate than running 166 | imperative scripts. Also, given the long runtimes and impermanent 167 | nature of my internet connection, it would also be nice to cache 168 | each iteration as it finishes. The 169 | [{drake}](https://github.com/ropensci/drake) package is superb on 170 | both counts, so I’m using it to orchestrate. 171 | 172 | ![](README_files/figure-gfm/kj_patch_kj-1.png) 173 | 174 | #### Performance Results (Kuhn-Johnson): 175 | 176 | - Runtimes for n = 100 and n = 800 are close, and there’s a large jump 177 | in runtime going from n = 2000 to n = 5000. 178 | - The number of repeats has little effect on the amount of percent 179 | error. 180 | - For n = 100, there is substantially more variation in percent error 181 | than in the other sample sizes. 182 | - While there is a large runtime cost that comes with increasing the 183 | sample size from 2000 to 5000 observations, it doesn’t seem to 184 | provide any benefit in gaining a more accurate estimate of the 185 | out-of-sample error. 186 | 187 | ![](README_files/figure-gfm/kj-patch-1.png) 188 | 189 | #### Performance Results (Raschka): 190 | 191 | - The longest runtime is under 30 minutes, so runtime isn’t as large 192 | of a consideration if we are only comparing a few algorithms. 193 | - There isn’t much difference in runtime between n = 100 and n = 194 | 2000. 195 | - For n = 100, there’s a relatively large change in percent error when 196 | going from 1 repeat to 2 repeats. The error estimate then stabilizes 197 | for repeats 3 through 5. 198 | - n = 5000 gives poorer out-of-sample error estimates than n = 800 and 199 | n = 2000 for all values of repeats. 200 | - n = 800 remains under 2.5% percent error for all repeat values, but 201 | also shows considerable volatility. 202 | 203 | ## Discussion 204 | 205 | - {mlr3} wasn’t included in the Kuhn-Johnson section of the duration 206 | experiment, because with the extra folds/resamples, the RAM usage 207 | rapidly increases to the maximum and either locks-up or slows the 208 | training time tremendously. I haven’t explored this further. 209 | - The elasticnet model was slower to train than the random forest for 210 | the 100 row dataset. Compute resources should be optimized for each 211 | algorithm. For example, the number of vCPUs capable of being 212 | utilized by a random forest algorithm is much higher than number for 213 | an elasticnet algorithm. The elasticnet only used the number of 214 | vCPUs that matched the number of training folds while the random 215 | forest used all available vCPUs. Using a sparse matrix or another 216 | package (e.g. biglasso) might help to lower training times for 217 | elasticnet. 218 | - Adjusting the inner-loop strategy seems to have the most effect on 219 | the volatility of the results. 220 | - For data sizes of a few thousand rows, Kuhn-Johnson trains 50x as 221 | many models; takes 8x longer to run; for a similar amount of 222 | generalization error as compared to the Raschka method. The similar 223 | results in generalization error might be specific to this dataset 224 | though. 225 | - Kuhn-Johnson’s runtime starts to really balloon once you get 226 | into datasets with over a thousand rows. 227 | - The extra folds in the outer loop made a huge difference. With 228 | Kuhn-Johnson, the runtimes were hours, and with Raschka’s, it 229 | was minutes. 230 | - For smaller datasets, you should have at least 3 repeats when 231 | running Rashka’s method. 232 | - This is just one dataset, but I still found it surprising how little 233 | a difference repeats made in reducing generalizaion error. The 234 | benefit only kicked in with the dataset that had hundred rows. 235 | 236 | ## Next Steps 237 | 238 | - The performance experimental framework used here could be useful as 239 | a way to gain insight into the amounts and types of resources that a 240 | project’s first steps might require. For example, testing simiulated 241 | data before collection of actual data begins. I might try to see if 242 | there’s much difficulty extending it to {mlr3} (assuming I can 243 | figure out the RAM issue) and Python. 244 | - Experiment with Raschka’s method more. 245 | - Raschka’s method using the majority vote method from 246 | Kuhn-Johnson for the final hyperparameter settings might be an 247 | additional optimization step. If the final k-fold cv can be 248 | discarded without much loss in generalization error, then maybe 249 | training times can be shortened further. 250 | - There’s probably room to increase the number of folds in the 251 | inner-loop of Raschka’s method in order to gain more stable 252 | results while keeping the training time comparitively low. 253 | - There should be a version of this technique that’s capable of 254 | working for time series. I have ideas, so it might be something I’ll 255 | work on for a future project. 256 | 257 | ## References 258 | 259 | Boulesteix, AL, and C Strobl. 2009. “Optimal Classifier Selection and 260 | Negative Bias in Error Rate Estimation: An Empirical Study on 261 | High-Dimensional Prediction.” BMC Medical Research Methodology 9 (1): 262 | 85. 263 | [link](https://www.researchgate.net/publication/40756303_Optimal_classifier_selection_and_negative_bias_in_error_rate_estimation_An_empirical_study_on_high-dimensional_prediction) 264 | 265 | Sabastian Raschka, “STAT 479 Statistical Tests and Algorithm 266 | Comparison,” (Lecture Notes, University of Wisconsin-Madison, Fall 267 | 2019). 268 | [link](https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/11_eval4-algo/11-eval4-algo__notes.pdf) 269 | 270 | Sudhir Varma and Richard Simon. “Bias in error estimation when using 271 | cross-validation for model selection”. In: BMC bioinformatics 7.1 272 | (2006). p. 91. 273 | [link](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-91) 274 | -------------------------------------------------------------------------------- /README_files/figure-gfm/kj-patch-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/README_files/figure-gfm/kj-patch-1.png -------------------------------------------------------------------------------- /README_files/figure-gfm/kj_patch_kj-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/README_files/figure-gfm/kj_patch_kj-1.png -------------------------------------------------------------------------------- /README_files/figure-gfm/perf-error-line-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/README_files/figure-gfm/perf-error-line-1.png -------------------------------------------------------------------------------- /README_files/figure-gfm/perf_bt_charts-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/README_files/figure-gfm/perf_bt_charts-1.png -------------------------------------------------------------------------------- /README_files/figure-gfm/perf_build_times-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/README_files/figure-gfm/perf_build_times-1.png -------------------------------------------------------------------------------- /README_files/figure-gfm/unnamed-chunk-1-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/README_files/figure-gfm/unnamed-chunk-1-1.png -------------------------------------------------------------------------------- /_drake-kj.R: -------------------------------------------------------------------------------- 1 | # drake make file for Kuhn-Johnson performance experiment 2 | 3 | 4 | # Notes: 5 | # 1. see plan-kj.R for more details on how this thing works 6 | # 2. link to {future} issue with instructions on special PuTTY settings, https://github.com/HenrikBengtsson/future/issues/370 7 | 8 | 9 | # load packages, functions, and drake plan 10 | source("performance-experiment/packages.R") 11 | source("performance-experiment/functions/mlbench-data.R") 12 | source("performance-experiment/functions/create-ncv-objects.R") 13 | source("performance-experiment/functions/create-models.R") 14 | source("performance-experiment/functions/create-grids.R") 15 | source("performance-experiment/functions/inner-tune.R") 16 | source("performance-experiment/functions/outer-cv.R") 17 | source("performance-experiment/functions/ncv-compare.R") 18 | source("performance-experiment/functions/run-ncv.R") 19 | source("performance-experiment/Kuhn-Johnson/plan-kj.R") 20 | 21 | 22 | 23 | 24 | set.seed(2019) 25 | 26 | # Using different compute sizes for each model 27 | ip1 <- Sys.getenv("GLMEC2IP") 28 | ip2 <- Sys.getenv("RFEC2IP") 29 | public_ips <- c(ip1, ip2) 30 | # ppk file converted by PuTTY from an AWS pem file 31 | ssh_private_key_file <- Sys.getenv("AWSKEYPATH") 32 | 33 | 34 | cl <- future::makeClusterPSOCK( 35 | 36 | ## Public IP numbers of EC2 instances 37 | public_ips, 38 | 39 | ## User name (always 'ubuntu') 40 | user = "ubuntu", 41 | 42 | ## Use private SSH key registered with AWS 43 | ## futureSettings is a saved PuTTY session with settings to keep ssh active 44 | rshcmd = c("plink", "-ssh", "-load", "futureSettings","-i", ssh_private_key_file), 45 | rshopts = c( 46 | "-sshrawlog", "ec2-ssh-raw.log" 47 | ), 48 | 49 | rscript_args = c("-e", shQuote(".libPaths('/home/rstudio/R/x86_64-pc-linux-gnu-library/3.6')") 50 | ), 51 | verbose = TRUE 52 | ) 53 | 54 | 55 | future::plan(list(tweak(cluster, workers = cl), multiprocess)) 56 | 57 | cache_kj <- drake_cache(path = ".drake") 58 | 59 | # verbose = 0 prints nothing, verbose = 1 prints message as each target completes; verbose = 2 adds a progress bar that tracks target completion 60 | drake_config( 61 | plan, 62 | verbose = 1, 63 | lock_envir = FALSE, 64 | jobs_preprocess = 7, 65 | cache = cache_kj 66 | ) 67 | 68 | -------------------------------------------------------------------------------- /_drake-raschka.R: -------------------------------------------------------------------------------- 1 | # drake make file for Raschka performance experiment 2 | 3 | 4 | # Notes: 5 | # 1. see plan-raschka.R for more details on how this thing works 6 | # 2. link to {future} issue with instructions on special PuTTY settings, https://github.com/HenrikBengtsson/future/issues/370 7 | 8 | 9 | # load packages, functions, and drake plan 10 | source("performance-experiment/packages.R") 11 | source("performance-experiment/functions/mlbench-data.R") 12 | source("performance-experiment/functions/create-ncv-objects.R") 13 | source("performance-experiment/functions/create-models.R") 14 | source("performance-experiment/functions/create-grids.R") 15 | source("performance-experiment/functions/inner-tune.R") 16 | source("performance-experiment/functions/outer-cv.R") 17 | source("performance-experiment/functions/ncv-compare.R") 18 | source("performance-experiment/functions/run-ncv.R") 19 | source("performance-experiment/Raschka/plan-raschka.R") 20 | 21 | 22 | 23 | 24 | 25 | set.seed(2019) 26 | 27 | # Using different compute sizes for each model 28 | ip1 <- Sys.getenv("GLMEC2IP") 29 | ip2 <- Sys.getenv("RFEC2IP") 30 | public_ips <- c(ip1, ip2) 31 | # ppk file converted by PuTTY from an AWS pem file 32 | ssh_private_key_file <- Sys.getenv("AWSKEYPATH") 33 | 34 | 35 | cl <- future::makeClusterPSOCK( 36 | 37 | ## Public IP numbers of EC2 instances 38 | public_ips, 39 | 40 | ## User name (always 'ubuntu') 41 | user = "ubuntu", 42 | 43 | ## Use private SSH key registered with AWS 44 | ## futureSettings is a saved PuTTY session with settings to keep ssh active 45 | rshcmd = c("plink", "-ssh", "-load", "futureSettings","-i", ssh_private_key_file), 46 | rshopts = c( 47 | "-sshrawlog", "ec2-ssh-raw.log" 48 | ), 49 | 50 | rscript_args = c("-e", shQuote(".libPaths('/home/rstudio/R/x86_64-pc-linux-gnu-library/3.6')") 51 | ), 52 | verbose = TRUE 53 | ) 54 | 55 | 56 | future::plan(list(tweak(cluster, workers = cl), multiprocess)) 57 | 58 | 59 | cache_raschka <- drake_cache(path = ".drake-raschka") 60 | 61 | # verbose = 0 prints nothing, verbose = 1 prints message as each target completes; verbose = 2 adds a progress bar that tracks target completion 62 | drake_config( 63 | plan, 64 | verbose = 1, 65 | lock_envir = FALSE, 66 | jobs_preprocess = 7, 67 | cache = cache_raschka 68 | ) 69 | 70 | 71 | -------------------------------------------------------------------------------- /data/fivek-simdat.pickle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/data/fivek-simdat.pickle -------------------------------------------------------------------------------- /duration-experiment/kuhn-johnson/nested-cv-h2o-kj.R: -------------------------------------------------------------------------------- 1 | # Nested cross-validation for tuning and algorithm comparison 2 | 3 | # Kuhn-Johnson method 4 | # H2O 5 | 6 | 7 | # Notes 8 | # 1. *** Make sure target variable is the last column in dataframe *** 9 | # 2. h2o_grid arg parallelism sets the number of models to be computed at a time in parallel. parallelism = 0 lets h2o decide and it will use all resources. 10 | # 3. *** For glm models, setting parallelism = 0 causes the grid search to hang. Manually set the number of models to compute in parallel *** 11 | # 4. H2O doesn't have a cv strategy that just fits the grid row by row. Your options are RandomDiscrete or Cartesian Grid. I set the parameter ranges so that using the Cartesian grid would create a 100 row grid to match the other scripts. 12 | # 5. Uses few cpu resources when tuning glmnet, but does maximize cpu usage for tuning the rf 13 | 14 | 15 | # Sections 16 | # 1. Set-up 17 | # 2. Error and Model Functions; 18 | # 3. Hyperparameter Grids 19 | # 4. Functions used in the loops 20 | # 5. Compare Algorithms 21 | 22 | 23 | 24 | 25 | #################################################### 26 | # Set-Up 27 | #################################################### 28 | 29 | 30 | # texts me when there's an error 31 | options(error = function() { 32 | library(RPushbullet) 33 | pbPost("note", "Error", geterrmessage()) 34 | if(!interactive()) stop(geterrmessage()) 35 | }) 36 | 37 | 38 | # start MLflow server 39 | sys::exec_background("mlflow server") 40 | Sys.sleep(10) 41 | 42 | 43 | library(tictoc) 44 | tic() 45 | 46 | pacman::p_load(glue, RPushbullet, dials, h2o, rsample, purrr, dplyr, mlflow) 47 | 48 | # make explicit the name of the exeriement to record to 49 | mlflow_set_experiment("ncv_duration") 50 | 51 | set.seed(2019) 52 | 53 | # simulated data; generates 10 multi-patterned, numeric predictors plus outcome variable 54 | sim_data <- function(n) { 55 | tmp <- mlbench::mlbench.friedman1(n, sd=1) 56 | tmp <- cbind(tmp$x, tmp$y) 57 | tmp <- as.data.frame(tmp) 58 | names(tmp)[ncol(tmp)] <- "y" 59 | tmp 60 | } 61 | 62 | # Use small data to tune and compare models 63 | small_dat <- sim_data(100) 64 | 65 | 66 | 67 | ncv_dat_10 <- rsample::nested_cv(small_dat, 68 | outside = vfold_cv(v = 10, repeats = 2), 69 | inside = bootstraps(times = 25)) 70 | 71 | 72 | # Start h2o cluster 73 | h2o.init() 74 | 75 | 76 | 77 | 78 | ################################################### 79 | # # Error and Model Functions 80 | ################################################### 81 | 82 | 83 | error_FUN <- function(model){ 84 | h2o.mae(model, valid = T) 85 | } 86 | 87 | 88 | # Distributed Random Forest 89 | 90 | rf_FUN <- function(x, y, anal_h2o, ass_h2o, params) { 91 | 92 | mtries <- params$mtries[[1]] 93 | ntrees <- params$ntrees[[1]] 94 | 95 | # h20 ususally needs unique ids or loops will return exact same values over and over 96 | modelId <- as.character(dqrng::dqrnorm(1)) 97 | 98 | h2o.show_progress() 99 | 100 | h2o.randomForest(x = x, 101 | y = y, 102 | training_frame = anal_h2o, 103 | model_id = modelId, 104 | validation_frame = ass_h2o, 105 | mtries = mtries, 106 | ntrees = ntrees) 107 | } 108 | 109 | 110 | # Elastic Net Regression 111 | 112 | glm_FUN <- function(x, y, anal_h2o, ass_h2o, params) { 113 | 114 | alpha <- params$alpha[[1]] 115 | lambda <- params$lambda[[1]] 116 | 117 | # h20 needs unique ids or loops will return exact same values over and over 118 | modelId <- as.character(dqrng::dqrnorm(1)) 119 | 120 | h2o.show_progress() 121 | 122 | h2o.glm(x = x, 123 | y = y, 124 | training_frame = anal_h2o, 125 | model_id = modelId, 126 | validation_frame = ass_h2o, 127 | alpha = alpha, 128 | lambda = lambda) 129 | } 130 | 131 | 132 | mod_FUN_list <- list(glm = glm_FUN, rf = rf_FUN) 133 | 134 | alg_list <- list(glm = "glm", rf = "drf") 135 | 136 | 137 | 138 | 139 | ##################################################### 140 | # Hyperparameter Grids 141 | ##################################################### 142 | 143 | 144 | # 5*20 = 100 rows for glm grid, 2*50 = 100 rows for the rf grid 145 | params_list <- list(glm = list(alpha = c(0, 0.25, 0.5, 0.75, 1), 146 | lambda = -1/log10(sample(seq(0, 1, by = 0.00001), 20))), 147 | rf = list(mtries = c(3,4), 148 | ntrees = seq(200, 300, by = 2)) 149 | ) 150 | 151 | 152 | 153 | 154 | ##################################################### 155 | # Functions used in the loops 156 | ##################################################### 157 | 158 | 159 | # inputs params, model, and resample, calls model and error functions, outputs error 160 | mod_error <- function(params, mod_FUN, dat) { 161 | anal_df <- rsample::analysis(dat) 162 | ass_df <- rsample::assessment(dat) 163 | 164 | h2o.no_progress() 165 | 166 | # send data to the h2o cluster 167 | anal_h2o <- as.h2o(anal_df) 168 | ass_h2o <- as.h2o(ass_df) 169 | 170 | y <- names(anal_h2o)[[ncol(anal_h2o)]] 171 | x <- setdiff(names(anal_h2o), y) 172 | 173 | mod <- mod_FUN(x, y, anal_h2o, ass_h2o, params) 174 | error <- error_FUN(mod) 175 | error 176 | } 177 | 178 | 179 | compare_algs <- function(alg, params, mod_FUN, dat){ 180 | 181 | # tune models by grid searching on resamples in the inner loop, returns df with fold id, bootstrap id, params, and errors 182 | tuning_results <- purrr::map(dat$inner_resamples, function(dat, alg, params){ 183 | 184 | param_names <- names(params) 185 | 186 | # loops each folds set of resamples, grid search, returen table of hyperparam combos and error values 187 | params_errors <- purrr::map_dfr(dat$splits, function(dat, alg, params){ 188 | 189 | # split into analysis and assessment sets 190 | anal_df <- rsample::analysis(dat) 191 | ass_df <- rsample::assessment(dat) 192 | 193 | # as.h2o and h2o.grid have progress bars. That's too many of progress bars. 194 | h2o.no_progress() 195 | 196 | # send data to the h2o cluster 197 | anal_h2o <- as.h2o(anal_df) 198 | ass_h2o <- as.h2o(ass_df) 199 | 200 | y <- names(anal_h2o)[[ncol(anal_h2o)]] 201 | x <- setdiff(names(anal_h2o), y) 202 | 203 | h2o.show_progress() 204 | 205 | # need a unique grid id or h2o just gives you the same predictions over aand over 206 | gridId <- as.character(dqrng::dqrnorm(1)) 207 | 208 | mod_grid <- h2o.grid(alg, x = x, y = y, 209 | grid_id = gridId, 210 | training_frame = anal_h2o, 211 | validation_frame = ass_h2o, 212 | hyper_params = params, 213 | parallelism = 8) 214 | # results 215 | mod_gridperf <- h2o.getGrid(grid_id = gridId, sort_by = "mae") 216 | 217 | # Grab hyperparams, model_id, errors 218 | grid_results <- mod_gridperf@summary_table 219 | colnames(grid_results)[4] <- "error" 220 | 221 | # clean the cluster 222 | h2o.removeAll() 223 | 224 | grid_results 225 | 226 | }, alg, params, .id = "bootstrap") %>% 227 | mutate(error = as.numeric(error)) %>% 228 | group_by_at(vars(param_names)) %>% 229 | summarize(mean_error = mean(error, na.rm = TRUE)) %>% 230 | ungroup() 231 | 232 | }, alg, params) 233 | 234 | 235 | best_hyper_vals <- tuning_results %>% 236 | map_df(function(dat) { 237 | dat[which.min(dat$mean_error),] 238 | }) %>% 239 | select(names(params)) %>% 240 | # H2O makes params into char vars and adds brackets to the values. Guessing they're json. 241 | mutate_all(~stringr::str_remove_all(., "^\\[")) %>% 242 | mutate_all(~stringr::str_remove_all(., "\\]")) %>% 243 | mutate_all(as.numeric) 244 | 245 | # fit models on the outer-loop folds using best hyperparams (e.g. 5 repeats, 10 folds = 50 models), returns numeric with error values 246 | outer_fold_error <- map2_dbl(dat$splits, 1:nrow(best_hyper_vals), function(dat, row) { 247 | params <- best_hyper_vals[row,] 248 | mod_error(params, mod_FUN, dat) 249 | }) 250 | 251 | # hyperparam values for final model will be the ones most selected to use on the outer-loop folds 252 | chosen_params <- best_hyper_vals %>% 253 | group_by_all() %>% 254 | tally() %>% 255 | ungroup() %>% 256 | filter(n == max(n)) 257 | 258 | # output error stats and chosen hyperparams 259 | tibble( 260 | chosen_params = list(chosen_params), 261 | mean_error = mean(outer_fold_error), 262 | median_error = median(outer_fold_error), 263 | sd_error = sd(outer_fold_error) 264 | ) 265 | } 266 | 267 | 268 | 269 | 270 | ################################## 271 | # Compare algorithms 272 | ################################## 273 | 274 | 275 | lol <- list(alg_list, params_list, mod_FUN_list) 276 | 277 | # start the nested-cv 278 | algorithm_comparison_ten <- pmap_dfr(lol, compare_algs, ncv_dat_10) %>% 279 | mutate(model = names(mod_FUN_list)) %>% 280 | select(model, everything()) 281 | toc(log = TRUE) 282 | 283 | 284 | # log duration metric to MLflow 285 | ncv_times <- tic.log(format = FALSE) 286 | duration <- as.numeric(ncv_times[[1]]$toc - ncv_times[[1]]$tic) 287 | mlflow_log_metric("duration", duration) 288 | mlflow_set_tag("implementation", "h2o") 289 | mlflow_set_tag("method", "kj") 290 | mlflow_end_run() 291 | 292 | 293 | # text me results 294 | log.txt <- tic.log(format = TRUE) 295 | msg <- glue("Using h2o-kj script: \n After running 10 fold, {log.txt[[1]]}.") 296 | pbPost("note", title="h2o-kj script finished", body=msg) 297 | tic.clearlog() 298 | 299 | 300 | # MLflow uses waitress for Windows. Killing it also kills mlflow.exe, python.exe, console window host processes 301 | installr::kill_process(process = c("waitress-serve.exe")) 302 | 303 | # shutdown cluster 304 | h2o.shutdown(prompt = FALSE) 305 | -------------------------------------------------------------------------------- /duration-experiment/kuhn-johnson/nested-cv-parsnip-kj.R: -------------------------------------------------------------------------------- 1 | # Nested cross-validation for tuning and algorithm comparison 2 | 3 | # Kuhn-Johnson method 4 | # parsnip 5 | 6 | 7 | 8 | # Notes 9 | # 1. *** Make sure the target column is last in dataframe *** 10 | # 2. Using the Ranger Random Forest model function utilized through the parsnip package. 11 | 12 | 13 | # Sections 14 | # 1. Set-Up 15 | # 2. Error and 16 | # 3. Model Functions 17 | # 4. Hyperparameter Grids 18 | # 5. Functions used in the loops 19 | # 6. Compare algorithms 20 | 21 | 22 | 23 | 24 | #################################################### 25 | # Set-Up 26 | #################################################### 27 | 28 | 29 | # text me if an error occurs 30 | options(error = function() { 31 | library(RPushbullet) 32 | pbPost("note", "Error", geterrmessage()) 33 | if(!interactive()) stop(geterrmessage()) 34 | }) 35 | 36 | # start MLflow server 37 | sys::exec_background("mlflow server") 38 | Sys.sleep(10) 39 | 40 | library(tictoc) 41 | tic() 42 | 43 | set.seed(2019) 44 | 45 | # simulated data; generates 10 multi-patterned, numeric predictors plus outcome variable 46 | sim_data <- function(n) { 47 | tmp <- mlbench::mlbench.friedman1(n, sd=1) 48 | tmp <- cbind(tmp$x, tmp$y) 49 | tmp <- as.data.frame(tmp) 50 | names(tmp)[ncol(tmp)] <- "y" 51 | tmp 52 | } 53 | 54 | # Use small data to tune and compare models 55 | small_dat <- sim_data(100) 56 | 57 | 58 | pacman::p_load(RPushbullet, glue, ranger, tidymodels, data.table, dtplyr, dplyr, furrr, mlflow) 59 | 60 | # make explicit the name of the exeriement to record to 61 | mlflow_set_experiment("ncv_duration") 62 | 63 | plan(multiprocess) 64 | 65 | 66 | ncv_dat_10 <- nested_cv(small_dat, 67 | outside = vfold_cv(v = 10, repeats = 2), 68 | inside = bootstraps(times = 25)) 69 | 70 | 71 | 72 | 73 | ################################## 74 | # Error Functions 75 | ################################## 76 | 77 | 78 | error_FUN <- function(y_obs, y_hat){ 79 | y_obs <- unlist(y_obs) 80 | y_hat <- unlist(y_hat) 81 | Metrics::mae(y_obs, y_hat) 82 | } 83 | 84 | 85 | 86 | ##################################### 87 | # Model Functions 88 | ##################################### 89 | 90 | 91 | # Random Forest 92 | 93 | pars_ranger_FUN <- function(params, analysis_set) { 94 | mtry <- params$mtry[[1]] 95 | trees <- params$trees[[1]] 96 | model <- parsnip::rand_forest(mode = "regression", mtry = mtry, trees = trees) %>% 97 | parsnip::set_engine("ranger", importance = 'impurity') %>% 98 | generics::fit(y ~ ., data = analysis_set) 99 | model 100 | } 101 | 102 | 103 | 104 | # Elastic Net Regression 105 | 106 | glm_FUN <- function(params, analysis_set) { 107 | alpha <- params$mixture[[1]] 108 | lambda <- params$penalty[[1]] 109 | model <- parsnip::linear_reg(mixture = alpha, penalty = lambda) %>% 110 | parsnip::set_engine("glmnet") %>% 111 | generics::fit(y ~ ., data = analysis_set) 112 | model 113 | } 114 | 115 | 116 | mod_FUN_list_pars <- list(glmnet = glm_FUN, ranger = pars_ranger_FUN) 117 | 118 | 119 | 120 | 121 | ################################### 122 | # Hyperparameter Grids 123 | ################################### 124 | 125 | 126 | # size = number of rows 127 | # Default ranges look good for mixture and penalty 128 | glm_params <- grid_latin_hypercube( 129 | mixture(), 130 | penalty(), 131 | size = 100 132 | ) 133 | 134 | rf_params <- grid_latin_hypercube( 135 | mtry(range = c(3, 4)), 136 | trees(range = c(200, 300)), 137 | size = 100 138 | ) 139 | 140 | params_list <- list(glmnet = glm_params, ranger = rf_params) 141 | 142 | 143 | 144 | 145 | ##################################################### 146 | # Functions used in the loops 147 | ##################################################### 148 | 149 | 150 | # inputs params, model, and resample, calls model and error functions, outputs error 151 | mod_error <- function(params, mod_FUN, dat) { 152 | y_col <- ncol(dat$data) 153 | y_obs <- assessment(dat)[y_col] 154 | mod <- mod_FUN(params, analysis(dat)) 155 | pred <- predict(mod, assessment(dat)) 156 | if (!is.data.frame(pred)) { 157 | pred <- pred$predictions 158 | } 159 | error <- error_FUN(y_obs, pred) 160 | error 161 | } 162 | 163 | # inputs resample, loops hyperparam grid values to model/error function, collects error value for hyperparam combo 164 | tune_over_params <- function(dat, mod_FUN, params) { 165 | params$error <- map_dbl(1:nrow(params), function(row) { 166 | params <- params[row,] 167 | mod_error(params, mod_FUN, dat) 168 | }) 169 | params 170 | } 171 | 172 | # inputs and sends fold's resamples to tuning function, collects and averages fold's error for each hyperparameter combo 173 | summarize_tune_results <- function(dat, mod_FUN, params) { 174 | # Return row-bound tibble that has the 25 bootstrap results 175 | param_names <- names(params) 176 | future_map_dfr(dat$splits, tune_over_params, mod_FUN, params, .progress = TRUE) %>% 177 | lazy_dt(., key_by = param_names) %>% 178 | # For each value of the tuning parameter, compute the 179 | # average which is the inner bootstrap estimate. 180 | group_by_at(vars(param_names)) %>% 181 | summarize(mean_error = mean(error, na.rm = TRUE), 182 | n = length(error)) %>% 183 | as_tibble() 184 | } 185 | 186 | 187 | 188 | ###################################################### 189 | # Compare algorithms 190 | ###################################################### 191 | 192 | 193 | compare_algs <- function(mod_FUN, params, ncv_dat){ 194 | # tune models by grid searching on resamples in the inner loop (e.g. 5 repeats 10 folds = list of 50 tibbles with param and mean_error cols) 195 | tuning_results <- map(ncv_dat$inner_resamples, summarize_tune_results, mod_FUN, params) 196 | 197 | # Choose best hyperparameter combos across all the resamples for each fold (e.g. 5 repeats 10 folds = 50 best hyperparam combos) 198 | best_hyper_vals <- tuning_results %>% 199 | map_df(function(dat) { 200 | dat[which.min(dat$mean_error),] 201 | }) %>% 202 | select(names(params)) 203 | 204 | # fit models on the outer-loop folds using best hyperparams (e.g. 5 repeats, 10 folds = 50 models) 205 | outer_fold_error <- future_map2_dbl(ncv_dat$splits, 1:nrow(best_hyper_vals), function(dat, row) { 206 | params <- best_hyper_vals[row,] 207 | mod_error(params, mod_FUN, dat) 208 | }, .progress = TRUE) 209 | 210 | # hyperparam values for final model will be the ones most selected to use on the outer-loop folds 211 | chosen_params <- best_hyper_vals %>% 212 | group_by_all() %>% 213 | tally() %>% 214 | ungroup() %>% 215 | filter(n == max(n)) 216 | 217 | # output error stats and chosen hyperparams 218 | tibble( 219 | chosen_params = list(chosen_params), 220 | mean_error = mean(outer_fold_error), 221 | median_error = median(outer_fold_error), 222 | sd_error = sd(outer_fold_error) 223 | ) 224 | } 225 | 226 | 227 | # start nested-cv 228 | algorithm_comparison_ten_pars <- map2_dfr(mod_FUN_list_pars, params_list, compare_algs, ncv_dat_10) %>% 229 | mutate(model = names(mod_FUN_list_pars)) %>% 230 | select(model, everything()) 231 | toc(log = TRUE) 232 | 233 | 234 | # log duration metric to MLflow 235 | ncv_times <- tic.log(format = FALSE) 236 | duration <- as.numeric(ncv_times[[1]]$toc - ncv_times[[1]]$tic) 237 | mlflow_log_metric("duration", duration) 238 | mlflow_set_tag("implementation", "parsnip") 239 | mlflow_set_tag("method", "kj") 240 | mlflow_end_run() 241 | 242 | 243 | # text me results 244 | log.txt <- tic.log(format = TRUE) 245 | msg <- glue("Using parsnip-kj script: \n After running 10 fold pars, {log.txt[[1]]}") 246 | pbPost("note", title="parsnip-kj script finished", body=msg) 247 | tic.clearlog() 248 | 249 | 250 | # MLflow uses waitress for Windows. Killing it also kills mlflow.exe, python.exe, console window host processes 251 | installr::kill_process(process = c("waitress-serve.exe")) 252 | 253 | 254 | 255 | -------------------------------------------------------------------------------- /duration-experiment/kuhn-johnson/nested-cv-ranger-kj.R: -------------------------------------------------------------------------------- 1 | # Nested cross-validation for tuning and algorithm comparison 2 | 3 | # Kuhn-Johnson method 4 | # ranger 5 | 6 | 7 | 8 | # Notes 9 | # 1. *** Make sure the target column is last in dataframe *** 10 | # 2. Using the Ranger Random Forest model function from the ranger package 11 | 12 | 13 | # Sections 14 | # 1. Set-Up 15 | # 2. Error 16 | # 3. Model Functions 17 | # 4. Hyperparameter Grids 18 | # 5. Functions used in the loops 19 | # 6. Compare algorithms 20 | 21 | 22 | 23 | 24 | 25 | #################################################### 26 | # Set-Up 27 | #################################################### 28 | 29 | 30 | # texts me if an error occurs 31 | options(error = function() { 32 | library(RPushbullet) 33 | pbPost("note", "Error", geterrmessage()) 34 | if(!interactive()) stop(geterrmessage()) 35 | }) 36 | 37 | # start MLflow server 38 | sys::exec_background("mlflow server") 39 | Sys.sleep(10) 40 | 41 | 42 | library(tictoc) 43 | tic() 44 | 45 | set.seed(2019) 46 | 47 | # simulated data; generates 10 multi-patterned, numeric predictors plus outcome variable 48 | sim_data <- function(n) { 49 | tmp <- mlbench::mlbench.friedman1(n, sd=1) 50 | tmp <- cbind(tmp$x, tmp$y) 51 | tmp <- as.data.frame(tmp) 52 | names(tmp)[ncol(tmp)] <- "y" 53 | tmp 54 | } 55 | 56 | # Use small data to tune and compare models 57 | small_dat <- sim_data(100) 58 | 59 | 60 | 61 | pacman::p_load(RPushbullet, glue, ranger, tidymodels, data.table, dtplyr, dplyr, furrr, mlflow) 62 | 63 | # make explicit the name of the exeriement to record to 64 | mlflow_set_experiment("ncv_duration") 65 | 66 | plan(multiprocess) 67 | 68 | 69 | ncv_dat_10 <- nested_cv(small_dat, 70 | outside = vfold_cv(v = 10, repeats = 2), 71 | inside = bootstraps(times = 25)) 72 | 73 | 74 | 75 | 76 | ################################## 77 | # Error Functions 78 | ################################## 79 | 80 | 81 | error_FUN <- function(y_obs, y_hat){ 82 | y_obs <- unlist(y_obs) 83 | y_hat <- unlist(y_hat) 84 | Metrics::mae(y_obs, y_hat) 85 | } 86 | 87 | 88 | 89 | 90 | ##################################### 91 | # Model Functions 92 | ##################################### 93 | 94 | 95 | # Random Forest 96 | 97 | ranger_FUN <- function(params, analysis_set) { 98 | mtry <- params$mtry[[1]] 99 | trees <- params$trees[[1]] 100 | model <- ranger::ranger(y ~ ., data = analysis_set, mtry = mtry, num.trees = trees) 101 | model 102 | } 103 | 104 | 105 | # Elastic Net Regression 106 | 107 | glm_FUN <- function(params, analysis_set) { 108 | alpha <- params$mixture[[1]] 109 | lambda <- params$penalty[[1]] 110 | model <- parsnip::linear_reg(mixture = alpha, penalty = lambda) %>% 111 | parsnip::set_engine("glmnet") %>% 112 | generics::fit(y ~ ., data = analysis_set) 113 | model 114 | } 115 | 116 | 117 | mod_FUN_list_ranger <- list(glmnet = glm_FUN, ranger = ranger_FUN) 118 | 119 | 120 | 121 | 122 | ################################### 123 | # Hyperparameter Grids 124 | ################################### 125 | 126 | 127 | # size = number of rows 128 | # Default ranges look good for mixture and penalty 129 | glm_params <- grid_latin_hypercube( 130 | mixture(), 131 | penalty(), 132 | size = 100 133 | ) 134 | 135 | rf_params <- grid_latin_hypercube( 136 | mtry(range = c(3, 4)), 137 | trees(range = c(200, 300)), 138 | size = 100 139 | ) 140 | 141 | params_list <- list(glmnet = glm_params, ranger = rf_params) 142 | 143 | 144 | 145 | 146 | ##################################################### 147 | # Functions used in the loops 148 | ##################################################### 149 | 150 | 151 | # inputs params, model, and resample, calls model and error functions, outputs error 152 | mod_error <- function(params, mod_FUN, dat) { 153 | y_col <- ncol(dat$data) 154 | y_obs <- assessment(dat)[y_col] 155 | mod <- mod_FUN(params, analysis(dat)) 156 | pred <- predict(mod, assessment(dat)) 157 | if (!is.data.frame(pred)) { 158 | pred <- pred$predictions 159 | } 160 | error <- error_FUN(y_obs, pred) 161 | error 162 | } 163 | 164 | # inputs resample, loops hyperparam grid values to model/error function, collects error value for hyperparam combo 165 | tune_over_params <- function(dat, mod_FUN, params) { 166 | params$error <- map_dbl(1:nrow(params), function(row) { 167 | params <- params[row,] 168 | mod_error(params, mod_FUN, dat) 169 | }) 170 | params 171 | } 172 | 173 | # inputs and sends fold's resamples to tuning function, collects and averages fold's error for each hyperparameter combo 174 | summarize_tune_results <- function(dat, mod_FUN, params) { 175 | # Return row-bound tibble that has the 25 bootstrap results 176 | param_names <- names(params) 177 | future_map_dfr(dat$splits, tune_over_params, mod_FUN, params, .progress = TRUE) %>% 178 | lazy_dt(., key_by = param_names) %>% 179 | # For each value of the tuning parameter, compute the 180 | # average which is the inner bootstrap estimate. 181 | group_by_at(vars(param_names)) %>% 182 | summarize(mean_error = mean(error, na.rm = TRUE), 183 | n = length(error)) %>% 184 | as_tibble() 185 | } 186 | 187 | 188 | 189 | ###################################################### 190 | # Compare algorithms 191 | ###################################################### 192 | 193 | 194 | compare_algs <- function(mod_FUN, params, ncv_dat){ 195 | # tune models by grid searching on resamples in the inner loop (e.g. 5 repeats 10 folds = list of 50 tibbles with param and mean_error cols) 196 | tuning_results <- map(ncv_dat$inner_resamples, summarize_tune_results, mod_FUN, params) 197 | 198 | # Choose best hyperparameter combos across all the resamples for each fold (e.g. 5 repeats 10 folds = 50 best hyperparam combos) 199 | best_hyper_vals <- tuning_results %>% 200 | map_df(function(dat) { 201 | dat[which.min(dat$mean_error),] 202 | }) %>% 203 | select(names(params)) 204 | 205 | # fit models on the outer-loop folds using best hyperparams (e.g. 5 repeats, 10 folds = 50 models) 206 | outer_fold_error <- future_map2_dbl(ncv_dat$splits, 1:nrow(best_hyper_vals), function(dat, row) { 207 | params <- best_hyper_vals[row,] 208 | mod_error(params, mod_FUN, dat) 209 | }, .progress = TRUE) 210 | 211 | # hyperparam values for final model will be the ones most selected to use on the outer-loop folds 212 | chosen_params <- best_hyper_vals %>% 213 | group_by_all() %>% 214 | tally() %>% 215 | ungroup() %>% 216 | filter(n == max(n)) 217 | 218 | # output error stats and chosen hyperparams 219 | tibble( 220 | chosen_params = list(chosen_params), 221 | mean_error = mean(outer_fold_error), 222 | median_error = median(outer_fold_error), 223 | sd_error = sd(outer_fold_error) 224 | ) 225 | } 226 | 227 | 228 | 229 | # Start the nested-cv 230 | algorithm_comparison_ten_rang <- map2_dfr(mod_FUN_list_ranger, params_list, compare_algs, ncv_dat_10) %>% 231 | mutate(model = names(mod_FUN_list_ranger)) %>% 232 | select(model, everything()) 233 | toc(log = TRUE) 234 | 235 | 236 | # log duration metric to MLflow 237 | ncv_times <- tic.log(format = FALSE) 238 | duration <- as.numeric(ncv_times[[1]]$toc - ncv_times[[1]]$tic) 239 | mlflow_log_metric("duration", duration) 240 | mlflow_set_tag("implementation", "ranger") 241 | mlflow_set_tag("method", "kj") 242 | mlflow_end_run() 243 | 244 | 245 | # text me results 246 | log.txt <- tic.log(format = TRUE) 247 | msg <- glue("Using ranger-kj script: \n After running 10 fold rang, {log.txt[[1]]}") 248 | pbPost("note", title="ranger-kj script finished", body=msg) 249 | tic.clearlog() 250 | 251 | 252 | # MLflow uses waitress for Windows. Killing it also kills mlflow.exe, python.exe, console window host processes 253 | installr::kill_process(process = c("waitress-serve.exe")) 254 | 255 | -------------------------------------------------------------------------------- /duration-experiment/kuhn-johnson/nested-cv-sklearn-kj.R: -------------------------------------------------------------------------------- 1 | # Nested cross-validation for tuning and algorithm comparison 2 | 3 | # Kuhn-Johnson method 4 | # sklearn 5 | 6 | 7 | 8 | # Notes 9 | # 1. *** Target column needs to be last in dataframe *** 10 | # 2. *** All sklearn import aliases should have a "sklearn_" prefix *** 11 | # 3. Uses both R model functions along with those from sklearn modules 12 | # 4. Starts the MLflow server in the background with a OS command and kills the server process at the end of the script 13 | 14 | 15 | # Sections 16 | # 1. Set-Up 17 | # 2. Error Functions 18 | # 3. Model Functions 19 | # 4. Hyperparameter Grids 20 | # 3. Functions used in the loops 21 | # 4. Compare algorithms 22 | 23 | 24 | 25 | 26 | #################################################### 27 | # Set-Up 28 | #################################################### 29 | 30 | 31 | # text me if an error occurs 32 | options(error = function() { 33 | library(RPushbullet) 34 | pbPost("note", "Error", geterrmessage()) 35 | if(!interactive()) stop(geterrmessage()) 36 | }) 37 | 38 | # start MLflow server 39 | sys::exec_background("mlflow server") 40 | Sys.sleep(10) 41 | 42 | 43 | set.seed(2019) 44 | 45 | # simulated data; generates 10 multi-patterned, numeric predictors plus outcome variable 46 | sim_data <- function(n) { 47 | tmp <- mlbench::mlbench.friedman1(n, sd=1) 48 | tmp <- cbind(tmp$x, tmp$y) 49 | tmp <- as.data.frame(tmp) 50 | names(tmp)[ncol(tmp)] <- "y" 51 | tmp 52 | } 53 | 54 | # Use small data to tune and compare models 55 | small_dat <- sim_data(100) 56 | 57 | 58 | library(tictoc) 59 | tic() 60 | 61 | pacman::p_load(RPushbullet, glue, ranger, tidymodels, data.table, dtplyr, dplyr, furrr, reticulate, mlflow) 62 | 63 | # make explicit the name of the exeriement to record to 64 | mlflow_set_experiment("ncv_duration") 65 | 66 | 67 | plan(multiprocess) 68 | 69 | 70 | ncv_dat_10 <- nested_cv(small_dat, 71 | outside = vfold_cv(v = 10, repeats = 2), 72 | inside = bootstraps(times = 25)) 73 | 74 | 75 | 76 | 77 | ################################## 78 | # Error Functions 79 | ################################## 80 | 81 | 82 | error_FUN <- function(y_obs, y_hat){ 83 | y_obs <- unlist(y_obs) 84 | y_hat <- unlist(y_hat) 85 | Metrics::mae(y_obs, y_hat) 86 | } 87 | 88 | 89 | 90 | ##################################### 91 | # Model Functions 92 | ##################################### 93 | 94 | 95 | # Random Forest 96 | 97 | sklearn_rf_FUN <- function(params, analysis_set) { 98 | sklearn_e <- import("sklearn.ensemble") 99 | max_features <- r_to_py(params$mtry[[1]]) 100 | n_estimators <- r_to_py(params$trees[[1]]) 101 | 102 | # get data into sklearn's preferred format 103 | y_idx <- ncol(analysis_set) - 1 104 | X_idx <- y_idx - 1 105 | pAnal <- r_to_py(analysis_set) 106 | y_train <- pAnal$iloc(axis = 1L)[y_idx]$values 107 | X_train <- pAnal$iloc(axis = 1L)[0:X_idx] 108 | 109 | model <- sklearn_e$RandomForestRegressor(criterion = "mae", 110 | max_features = max_features, 111 | n_estimators = n_estimators, 112 | random_state = 1L) 113 | mod_fit <- model$fit(X_train, y_train) 114 | } 115 | 116 | 117 | # Elastic Net Regression 118 | 119 | glm_FUN <- function(params, analysis_set) { 120 | alpha <- params$mixture[[1]] 121 | lambda <- params$penalty[[1]] 122 | model <- parsnip::linear_reg(mixture = alpha, penalty = lambda) %>% 123 | parsnip::set_engine("glmnet") %>% 124 | generics::fit(y ~ ., data = analysis_set) 125 | model 126 | } 127 | 128 | 129 | 130 | mod_FUN_list_skrf <- list(glmnet = glm_FUN, sklearn_rf = sklearn_rf_FUN) 131 | 132 | 133 | 134 | 135 | ################################### 136 | # Hyperparameter Grids 137 | ################################### 138 | 139 | 140 | # size = number of rows 141 | # Default ranges look good for mixture and penalty 142 | glm_params <- grid_latin_hypercube( 143 | mixture(), 144 | penalty(), 145 | size = 100 146 | ) 147 | 148 | rf_params <- grid_latin_hypercube( 149 | mtry(range = c(3, 4)), 150 | trees(range = c(200, 300)), 151 | size = 100 152 | ) 153 | 154 | 155 | # params_list <- list(glmnet = glm_params, ranger = rf_params) 156 | params_list <- list(glmnet = glm_params, sklearn_rf = rf_params) 157 | 158 | 159 | 160 | 161 | ##################################################### 162 | # Functions used in the loops 163 | ##################################################### 164 | 165 | 166 | # detect if the model function is from sklearn 167 | is_sklearn <- function(modfun) { 168 | string <- toString(body(modfun)) 169 | stringr::str_detect(string, pattern = "sklearn") 170 | } 171 | 172 | 173 | # inputs params, model, and resample, calls model and error functions, outputs error 174 | mod_error <- function(params, mod_FUN, dat) { 175 | y_col <- ncol(dat$data) 176 | y_obs <- assessment(dat)[y_col] 177 | mod <- mod_FUN(params, analysis(dat)) 178 | 179 | if(is_sklearn(mod_FUN)) { 180 | X_dat <- r_to_py(assessment(dat)[-y_col]) 181 | pred <- mod$predict(X_dat) 182 | } else { 183 | pred <- predict(mod, assessment(dat)) 184 | if (!is.data.frame(pred)) { 185 | pred <- pred$predictions 186 | } 187 | } 188 | 189 | error <- error_FUN(y_obs, pred) 190 | error 191 | } 192 | 193 | # inputs resample, loops hyperparam grid values to model/error function, collects error value for hyperparam combo 194 | tune_over_params <- function(dat, mod_FUN, params) { 195 | params$error <- map_dbl(1:nrow(params), function(row) { 196 | params <- params[row,] 197 | mod_error(params, mod_FUN, dat) 198 | }) 199 | params 200 | } 201 | 202 | # inputs and sends fold's resamples to tuning function, collects and averages fold's error for each hyperparameter combo 203 | summarize_tune_results <- function(dat, mod_FUN, params) { 204 | # Return row-bound tibble that has the 25 bootstrap results 205 | param_names <- names(params) 206 | future_map_dfr(dat$splits, tune_over_params, mod_FUN, params, .progress = TRUE) %>% 207 | lazy_dt(., key_by = param_names) %>% 208 | # For each value of the tuning parameter, compute the 209 | # average which is the inner bootstrap estimate. 210 | group_by_at(vars(param_names)) %>% 211 | summarize(mean_error = mean(error, na.rm = TRUE), 212 | n = length(error)) %>% 213 | as_tibble() 214 | } 215 | 216 | 217 | 218 | ###################################################### 219 | # Compare algorithms 220 | ###################################################### 221 | 222 | 223 | compare_algs <- function(mod_FUN, params, ncv_dat){ 224 | # tune models by grid searching on resamples in the inner loop (e.g. 5 repeats 10 folds = list of 50 tibbles with param and mean_error cols) 225 | tuning_results <- map(ncv_dat$inner_resamples, summarize_tune_results, mod_FUN, params) 226 | 227 | # Choose best hyperparameter combos across all the resamples for each fold (e.g. 5 repeats 10 folds = 50 best hyperparam combos) 228 | best_hyper_vals <- tuning_results %>% 229 | map_df(function(dat) { 230 | dat[which.min(dat$mean_error),] 231 | }) %>% 232 | select(names(params)) 233 | 234 | # fit models on the outer-loop folds using best hyperparams (e.g. 5 repeats, 10 folds = 50 models) 235 | outer_fold_error <- future_map2_dbl(ncv_dat$splits, 1:nrow(best_hyper_vals), function(dat, row) { 236 | params <- best_hyper_vals[row,] 237 | mod_error(params, mod_FUN, dat) 238 | }, .progress = TRUE) 239 | 240 | # hyperparam values for final model will be the ones most selected to use on the outer-loop folds 241 | chosen_params <- best_hyper_vals %>% 242 | group_by_all() %>% 243 | tally() %>% 244 | ungroup() %>% 245 | filter(n == max(n)) 246 | 247 | # output error stats and chosen hyperparams 248 | tibble( 249 | chosen_params = list(chosen_params), 250 | mean_error = mean(outer_fold_error), 251 | median_error = median(outer_fold_error), 252 | sd_error = sd(outer_fold_error) 253 | ) 254 | } 255 | 256 | 257 | # start the nested-cv 258 | algorithm_comparison_ten_skrf <- map2_dfr(mod_FUN_list_skrf, params_list, compare_algs, ncv_dat_10) %>% 259 | mutate(model = names(mod_FUN_list_skrf)) %>% 260 | select(model, everything()) 261 | toc(log = TRUE) 262 | 263 | 264 | # log duration metric to MLflow 265 | ncv_times <- tic.log(format = FALSE) 266 | duration <- as.numeric(ncv_times[[1]]$toc - ncv_times[[1]]$tic) 267 | mlflow_log_metric("duration", duration) 268 | mlflow_set_tag("implementation", "sklearn") 269 | mlflow_set_tag("method", "kj") 270 | mlflow_end_run() 271 | 272 | 273 | # text me the results 274 | log.txt <- tic.log(format = TRUE) 275 | msg <- glue("Using sklearn-kj script: \n After running 10 fold skrf, {log.txt[[1]]}") 276 | pbPost("note", title="sklearn-kj script finished", body=msg) 277 | tic.clearlog() 278 | 279 | 280 | # MLflow uses waitress for Windows. Killing it also kills mlflow.exe, python.exe, console window host processes 281 | installr::kill_process(process = c("waitress-serve.exe")) 282 | 283 | 284 | 285 | -------------------------------------------------------------------------------- /duration-experiment/kuhn-johnson/nested-cv-tune-kj.R: -------------------------------------------------------------------------------- 1 | # Nested cross-validation using tune package 2 | 3 | # Kuhn-Johnson method 4 | # tune 5 | 6 | 7 | 8 | # Notes 9 | # 1. *** Make sure the target column is last in dataframe *** 10 | 11 | 12 | # Sections 13 | # 1. Set-up 14 | # 2. Error functions 15 | # 3. Model functions 16 | # 4, Hyperparameter Grids 17 | # 5. Functions used in the loops 18 | # 6. Compare Algorithms 19 | 20 | 21 | 22 | 23 | ################################ 24 | # Set-up 25 | ################################ 26 | 27 | # text me if an error occurs 28 | options(error = function() { 29 | library(RPushbullet) 30 | pbPost("note", "Error", geterrmessage()) 31 | if(!interactive()) stop(geterrmessage()) 32 | }) 33 | 34 | # start MLflow server 35 | sys::exec_background("mlflow server") 36 | Sys.sleep(10) 37 | 38 | 39 | library(tictoc) 40 | tic() 41 | 42 | pacman::p_load(RPushbullet, glue, ranger, doFuture, dplyr, purrr, tidymodels, tune, dials, mlflow) 43 | 44 | 45 | set.seed(2019) 46 | 47 | # make explicit the name of the exeriement to record to 48 | mlflow_set_experiment("ncv_duration") 49 | 50 | registerDoFuture() 51 | plan(multiprocess) 52 | 53 | 54 | 55 | # simulated data; generates 10 multi-patterned, numeric predictors plus outcome variable 56 | sim_data <- function(n) { 57 | tmp <- mlbench::mlbench.friedman1(n, sd=1) 58 | tmp <- cbind(tmp$x, tmp$y) 59 | tmp <- as.data.frame(tmp) 60 | names(tmp)[ncol(tmp)] <- "y" 61 | tmp 62 | } 63 | 64 | # Use small data to tune and compare models 65 | small_dat <- sim_data(100) 66 | 67 | 68 | 69 | ncv_dat_10 <- nested_cv(small_dat, 70 | outside = vfold_cv(v = 10, repeats = 2), 71 | inside = bootstraps(times = 25)) 72 | 73 | 74 | 75 | 76 | ################################ 77 | # Error functions 78 | ################################ 79 | 80 | 81 | error_funs <- metric_set(mae) 82 | error_FUN <- function(y_obs, y_hat){ 83 | y_obs <- unlist(y_obs) 84 | y_hat <- unlist(y_hat) 85 | Metrics::mae(y_obs, y_hat) 86 | } 87 | 88 | 89 | 90 | ################################ 91 | # Mode functions 92 | ################################ 93 | 94 | 95 | # Random Forest 96 | 97 | # inner-loop tuning 98 | rf_inner <- rand_forest(mtry = tune(), trees = tune()) %>% 99 | set_engine("ranger", importance = 'impurity') %>% 100 | set_mode("regression") 101 | 102 | # outer loop scoring and model selection 103 | rf_FUN <- function(params, analysis_set) { 104 | mtry <- params$mtry[[1]] 105 | trees <- params$trees[[1]] 106 | rand_forest(mode = "regression", mtry = mtry, trees = trees) %>% 107 | set_engine("ranger", importance = 'impurity') %>% 108 | fit(y ~ ., data = analysis_set) 109 | } 110 | 111 | 112 | # Regularized Regression 113 | 114 | glm_inner <- linear_reg(mixture = tune(), penalty = tune()) %>% 115 | set_engine("glmnet") 116 | 117 | glm_FUN <- function(params, analysis_set) { 118 | alpha <- params$mixture[[1]] 119 | lambda <- params$penalty[[1]] 120 | model <- parsnip::linear_reg(mixture = alpha, penalty = lambda) %>% 121 | parsnip::set_engine("glmnet") %>% 122 | generics::fit(y ~ ., data = analysis_set) 123 | model 124 | } 125 | 126 | 127 | mod_inner_list <- list(glm = glm_inner, rf = rf_inner) 128 | mod_FUN_list <- list(glm = glm_FUN, rf = rf_FUN) 129 | 130 | 131 | 132 | ################################ 133 | # Hyperparameter Grids 134 | ################################ 135 | 136 | 137 | glm_params <- grid_latin_hypercube( 138 | mixture(), 139 | penalty(), 140 | size = 100 141 | ) 142 | 143 | rf_params <- grid_latin_hypercube( 144 | mtry(range = c(3, 4)), 145 | trees(range = c(200, 300)), 146 | size = 100 147 | ) 148 | 149 | 150 | params_list <- list(glm = glm_params, rf = rf_params) 151 | 152 | 153 | 154 | ################################ 155 | # Functions used in the loops 156 | ################################ 157 | 158 | 159 | # inputs params, model, and resample, calls model and error functions, outputs error 160 | mod_error <- function(params, mod_FUN, dat) { 161 | y_col <- ncol(dat$data) 162 | y_obs <- assessment(dat)[y_col] 163 | mod <- mod_FUN(params, analysis(dat)) 164 | pred <- predict(mod, assessment(dat)) 165 | error <- error_FUN(y_obs, pred) 166 | error 167 | } 168 | 169 | 170 | compare_algs <- function(mod_inner, params, mod_FUN, ncv_dat){ 171 | 172 | # tune models by grid searching on resamples in the inner loop (e.g. 5 repeats 10 folds = list of 50 tibbles with param and mean_error cols) 173 | tuning_results <- map(ncv_dat$inner_resamples, function(dat, mod_inner, params) { 174 | tune_grid(y ~ ., 175 | model = mod_inner, 176 | resamples = dat, 177 | grid = params, 178 | metrics = error_funs) 179 | }, 180 | mod_inner, params) 181 | 182 | num_params <- ncol(params) 183 | 184 | # Choose best hyperparameter combos across all the resamples for each fold (e.g. 5 repeats 10 folds = 50 best hyperparam combos) 185 | best_hyper_vals <- tuning_results %>% 186 | map_dfr(function(dat){ 187 | dat %>% 188 | collect_metrics() %>% 189 | filter(mean == min(mean)) %>% 190 | slice(1) 191 | }) %>% 192 | select(1:num_params) 193 | 194 | # fit models on the outer-loop folds using best hyperparams (e.g. 5 repeats, 10 folds = 50 models) 195 | outer_fold_error <- furrr::future_map2_dbl(ncv_dat$splits, 1:nrow(best_hyper_vals), function(dat, row) { 196 | params <- best_hyper_vals[row,] 197 | mod_error(params, mod_FUN, dat) 198 | }, .progress = TRUE) 199 | 200 | # hyperparam values for final model will be the ones most selected to use on the outer-loop folds 201 | chosen_params <- best_hyper_vals %>% 202 | group_by_all() %>% 203 | tally() %>% 204 | ungroup() %>% 205 | filter(n == max(n)) 206 | 207 | # output error stats and chosen hyperparams 208 | tibble( 209 | chosen_params = list(chosen_params), 210 | mean_error = mean(outer_fold_error), 211 | median_error = median(outer_fold_error), 212 | sd_error = sd(outer_fold_error) 213 | ) 214 | } 215 | 216 | 217 | 218 | ################################ 219 | # Compare Algorithms 220 | ################################ 221 | 222 | 223 | lol <- list(mod_inner_list, params_list, mod_FUN_list) 224 | 225 | 226 | # start the nested-cv 227 | algorithm_comparison_ten <- pmap_dfr(lol, compare_algs, ncv_dat_10) %>% 228 | mutate(model = names(mod_FUN_list)) %>% 229 | select(model, everything()) 230 | toc(log = TRUE) 231 | 232 | 233 | # log duration metric to MLflow 234 | ncv_times <- tic.log(format = FALSE) 235 | duration <- as.numeric(ncv_times[[1]]$toc - ncv_times[[1]]$tic) 236 | mlflow_log_metric("duration", duration) 237 | mlflow_set_tag("implementation", "tune") 238 | mlflow_set_tag("method", "kj") 239 | mlflow_end_run() 240 | 241 | 242 | # text me results 243 | log.txt <- tic.log(format = TRUE) 244 | msg <- glue("Using tune-kj script: \n After running 10 fold, {log.txt[[1]]}") 245 | pbPost("note", title="tune-kj script finished", body=msg) 246 | tic.clearlog() 247 | 248 | 249 | # MLflow uses waitress for Windows. Killing it also kills mlflow.exe, python.exe, console window host processes 250 | installr::kill_process(process = c("waitress-serve.exe")) 251 | 252 | -------------------------------------------------------------------------------- /duration-experiment/outputs/0225-results.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/duration-experiment/outputs/0225-results.png -------------------------------------------------------------------------------- /duration-experiment/outputs/duration-pkg-tbl.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/duration-experiment/outputs/duration-pkg-tbl.png -------------------------------------------------------------------------------- /duration-experiment/outputs/duration-runs.rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/duration-experiment/outputs/duration-runs.rds -------------------------------------------------------------------------------- /duration-experiment/package-sources-gt-tbl.R: -------------------------------------------------------------------------------- 1 | # GT table: Package Sources for the Model Functions 2 | 3 | 4 | pacman::p_load(tibble, dplyr, tidyr, gt) 5 | 6 | runs_raw <- readr::read_rds("data/duration-runs.rds") %>% 7 | mutate(implementation = stringr::str_to_title(implementation)) 8 | 9 | 10 | # hardcoded 11 | # packages used for the model functions ordered by implementation in the runs_raw file 12 | elast_net <- c("sklearn", "sklearn", "parsnip-glmnet", "mlr-glmnet", "parsnip-glmnet", "h2o", "parsnip-glmnet", "parsnip-glmnet", "parsnip-glmnet") 13 | rand_forest <- c("sklearn", "sklearn", "ranger", "mlr-ranger", "parsnip-ranger", "h2o", "sklearn", "parsnip-ranger", "ranger") 14 | 15 | 16 | elast_dat <- runs_raw %>% 17 | select(implementation) %>% 18 | bind_cols(`Elastic Net` = elast_net) %>% 19 | pivot_wider(names_from = implementation, values_from = `Elastic Net`) 20 | 21 | # implementations as cols, algorithm as rows, values = package used 22 | model_dat <- runs_raw %>% 23 | select(implementation) %>% 24 | bind_cols(`Random Forest` = rand_forest) %>% 25 | pivot_wider(names_from = implementation, values_from = `Random Forest`) %>% 26 | bind_rows(elast_dat) %>% 27 | mutate(rowname = c("Random Forest", "Elastic Net")) 28 | 29 | model_dat %>% 30 | gt() %>% 31 | tab_spanner( 32 | label = "Implementation", 33 | columns = everything() 34 | ) %>% 35 | data_color(columns = vars(Reticulate, Python, Mlr3, `Ranger-Kj`), 36 | colors = scales::col_factor( 37 | palette = "#195198", 38 | domain = c("sklearn", "ranger", "parsnip-glmnet", "mlr-ranger", "mlr-glmnet" ))) %>% 39 | data_color(columns = vars(Tune, H2o, Sklearn, Parsnip, Ranger), 40 | colors = scales::col_factor( 41 | palette = "#BD9865", 42 | domain = c("sklearn", "ranger", "parsnip-glmnet", "parsnip-ranger", "h2o" ))) %>% 43 | tab_style( 44 | style = cell_text(align = "center"), 45 | locations = cells_body() 46 | ) %>% 47 | tab_options( 48 | table.background.color = "ivory", 49 | table.border.top.style = "None" 50 | ) %>% 51 | tab_header( 52 | title = "Package Sources for the Model Functions" 53 | ) %>% 54 | gtsave(filename = "duration-pkg-tbl.png", 55 | path = "duration-experiment/outputs") 56 | -------------------------------------------------------------------------------- /duration-experiment/raschka/nested-cv-kj-raschka.R: -------------------------------------------------------------------------------- 1 | # Nested cross-validation for tuning and algorithm comparison 2 | 3 | 4 | # Raschka method 5 | # ranger-kj 6 | 7 | 8 | 9 | # Notes 10 | # 1. *** Make sure the target column is last in dataframe *** 11 | 12 | 13 | # Sections 14 | # 1. Set-Up 15 | # 2. Error function 16 | # 3. Model Functions 17 | # 4. Hyperparameter Grids 18 | # 5. Functions used in the loops 19 | # 6. Compare algorithms; tune chosen algorithm 20 | # 7. Score chosen algorithm 21 | 22 | 23 | 24 | 25 | 26 | ###################################################### 27 | # Set-Up 28 | ###################################################### 29 | 30 | 31 | # texts me if an error occurs 32 | options(error = function() { 33 | library(RPushbullet) 34 | pbPost("note", "Error", geterrmessage()) 35 | if(!interactive()) stop(geterrmessage()) 36 | }) 37 | 38 | # start MLflow server 39 | sys::exec_background("mlflow server") 40 | Sys.sleep(10) 41 | 42 | 43 | library(tictoc) 44 | tic() 45 | 46 | set.seed(2019) 47 | 48 | # simulated data; generates 10 multi-patterned, numeric predictors plus outcome variable 49 | sim_data <- function(n) { 50 | tmp <- mlbench::mlbench.friedman1(n, sd=1) 51 | tmp <- cbind(tmp$x, tmp$y) 52 | tmp <- as.data.frame(tmp) 53 | names(tmp)[ncol(tmp)] <- "y" 54 | tmp 55 | } 56 | 57 | dat <- sim_data(5000) 58 | 59 | 60 | pacman::p_load(RPushbullet, glue, ranger, tidymodels, data.table, dtplyr, dplyr, furrr, mlflow) 61 | 62 | # make explicit the name of the exeriement to record to 63 | mlflow_set_experiment("ncv_duration") 64 | 65 | plan(multiprocess) 66 | 67 | 68 | train_idx <- caret::createDataPartition(y = dat$y, p = 0.80, list = FALSE) 69 | train_dat <- dat[train_idx, ] 70 | test_dat <- dat[-train_idx, ] 71 | 72 | ncv_dat <- nested_cv(train_dat, 73 | outside = vfold_cv(v = 5), 74 | inside = vfold_cv(v = 2)) 75 | 76 | 77 | 78 | 79 | ###################################################### 80 | # Error Functions 81 | ###################################################### 82 | 83 | 84 | error_FUN <- function(y_obs, y_hat){ 85 | y_obs <- unlist(y_obs) 86 | y_hat <- unlist(y_hat) 87 | Metrics::mae(y_obs, y_hat) 88 | } 89 | 90 | 91 | 92 | ###################################################### 93 | # Model Functions 94 | ###################################################### 95 | 96 | 97 | # Random Forest 98 | 99 | ranger_FUN <- function(params, analysis_set) { 100 | mtry <- params$mtry[[1]] 101 | trees <- params$trees[[1]] 102 | model <- ranger::ranger(y ~ ., data = analysis_set, mtry = mtry, num.trees = trees) 103 | model 104 | } 105 | 106 | 107 | # Regularized Regression 108 | 109 | glm_FUN <- function(params, analysis_set) { 110 | alpha <- params$mixture[[1]] 111 | lambda <- params$penalty[[1]] 112 | model <- parsnip::linear_reg(mixture = alpha, penalty = lambda) %>% 113 | parsnip::set_engine("glmnet") %>% 114 | generics::fit(y ~ ., data = analysis_set) 115 | model 116 | } 117 | 118 | 119 | mod_FUN_list <- list(glmnet = glm_FUN, ranger = ranger_FUN) 120 | 121 | 122 | 123 | ###################################################### 124 | # Hyperparameter Grids 125 | ###################################################### 126 | 127 | 128 | # size = number of rows 129 | glm_params <- grid_latin_hypercube( 130 | mixture(), 131 | penalty(), 132 | size = 100 133 | ) 134 | 135 | rf_params <- grid_latin_hypercube( 136 | mtry(range = c(3, 4)), 137 | trees(range = c(200, 300)), 138 | size = 100 139 | ) 140 | 141 | params_list <- list(glmnet = glm_params, ranger = rf_params) 142 | 143 | 144 | 145 | 146 | ###################################################### 147 | # Functions used in the loops 148 | ###################################################### 149 | 150 | 151 | # inputs params, model, and fold, calls model and error functions, outputs error 152 | mod_error <- function(params, mod_FUN, dat) { 153 | y_col <- ncol(dat$data) 154 | y_obs <- assessment(dat)[y_col] 155 | mod <- mod_FUN(params, analysis(dat)) 156 | pred <- predict(mod, assessment(dat)) 157 | if (!is.data.frame(pred)) { 158 | pred <- pred$predictions 159 | } 160 | error <- error_FUN(y_obs, pred) 161 | error 162 | } 163 | 164 | # inputs fold, loops hyperparam grid values to model/error function, collects error value for hyperparam combo 165 | tune_over_params <- function(dat, mod_FUN, params) { 166 | params$error <- map_dbl(1:nrow(params), function(row) { 167 | params <- params[row,] 168 | mod_error(params, mod_FUN, dat) 169 | }) 170 | params 171 | } 172 | 173 | # inputs outer fold and sends outer fold's inner folds to tuning function, collects and averages fold's error for each hyperparameter combo 174 | summarize_tune_results <- function(dat, mod_FUN, params) { 175 | # Return row-bound tibble that has the 25 bootstrap results 176 | param_names <- names(params) 177 | future_map_dfr(dat$splits, tune_over_params, mod_FUN, params, .progress = TRUE) %>% 178 | lazy_dt(., key_by = param_names) %>% 179 | # For each value of the tuning parameter, compute the 180 | # average which is the inner bootstrap estimate. 181 | group_by_at(vars(param_names)) %>% 182 | summarize(mean_error = mean(error, na.rm = TRUE), 183 | n = length(error)) %>% 184 | as_tibble() 185 | } 186 | 187 | 188 | # primary function for the nested-cv 189 | compare_algs <- function(mod_FUN, params, ncv_dat){ 190 | # tune models by grid searching on resamples in the inner loop (e.g. 5 repeats 10 folds = list of 50 tibbles with param and mean_error cols) 191 | tuning_results <- map(ncv_dat$inner_resamples, summarize_tune_results, mod_FUN, params) 192 | 193 | # Choose best hyperparameter combos across all the resamples for each fold (e.g. 5 repeats 10 folds = 50 best hyperparam combos) 194 | best_hyper_vals <- tuning_results %>% 195 | map_df(function(dat) { 196 | dat[which.min(dat$mean_error),] 197 | }) %>% 198 | select(names(params)) 199 | 200 | # fit models on the outer-loop folds using best hyperparams 201 | outer_fold_error <- future_map2_dbl(ncv_dat$splits, 1:nrow(best_hyper_vals), function(dat, row) { 202 | params <- best_hyper_vals[row,] 203 | mod_error(params, mod_FUN, dat) 204 | }, .progress = TRUE) 205 | 206 | 207 | # output error stats 208 | tibble( 209 | mean_error = mean(outer_fold_error), 210 | median_error = median(outer_fold_error), 211 | sd_error = sd(outer_fold_error) 212 | ) 213 | } 214 | 215 | 216 | 217 | 218 | ###################################################### 219 | # Compare algorithms, Tune chosen algorithm 220 | ###################################################### 221 | 222 | 223 | # outputs df with outer fold stats for each algorithm 224 | algorithm_comparison <- map2_dfr(mod_FUN_list, params_list, compare_algs, ncv_dat) %>% 225 | mutate(model = names(mod_FUN_list)) %>% 226 | select(model, everything()) 227 | 228 | # Choose alg with lowest avg error 229 | chosen_alg <- algorithm_comparison %>% 230 | filter(mean_error == min(mean_error)) %>% 231 | pull(1) 232 | 233 | # Set inputs to chosen alg 234 | mod_FUN <- mod_FUN_list[[chosen_alg]] 235 | params <- params_list[[chosen_alg]] 236 | total_train <- vfold_cv(train_dat, v = 2) 237 | 238 | # tune chosen alg on the inner-loop cv strategy 239 | # code is an amalgam of funs: summarize_tune_results, tune_over_params 240 | tuning_results <- map(total_train$splits, function(dat, mod_FUN, params) { 241 | params$error <- future_map_dbl(1:nrow(params), function(row) { 242 | params <- params[row,] 243 | mod_error(params, mod_FUN, dat) 244 | }) 245 | return(params) 246 | }, mod_FUN, params) %>% 247 | bind_rows() %>% 248 | lazy_dt(., key_by = names(params)) %>% 249 | # For each value of the tuning parameter, compute the 250 | # average which is the inner bootstrap estimate. 251 | group_by_at(vars(names(params))) %>% 252 | summarize(mean_error = mean(error, na.rm = TRUE)) %>% 253 | as_tibble() 254 | 255 | 256 | 257 | 258 | ###################################################### 259 | # Score chosen algorithm 260 | ###################################################### 261 | 262 | 263 | # Get best params from the tuning 264 | best_hyper_vals <- tuning_results %>% 265 | filter(mean_error == min(mean_error)) %>% 266 | slice(1) %>% 267 | select(names(params)) 268 | 269 | # Get avg error across validation folds from alg using best params 270 | avg_kfold_error <- tuning_results %>% 271 | filter(mean_error == min(mean_error)) %>% 272 | mutate(mean_error = round(mean_error, 5)) %>% 273 | select(mean_error) 274 | 275 | # train tuned mod on entire training set; score on test set 276 | chosen_mod <- mod_FUN(best_hyper_vals, train_dat) 277 | 278 | toc(log = TRUE) 279 | 280 | 281 | 282 | # log duration metric to MLflow 283 | ncv_times <- tic.log(format = FALSE) 284 | duration <- as.numeric(ncv_times[[1]]$toc - ncv_times[[1]]$tic) 285 | mlflow_log_metric("duration", duration) 286 | mlflow_set_tag("implementation", "ranger-kj") 287 | mlflow_set_tag("method", "raschka") 288 | mlflow_end_run() 289 | 290 | 291 | # Score on the held out test set 292 | chosen_preds <- predict(chosen_mod, test_dat) 293 | if (!is.data.frame(chosen_preds)) { 294 | chosen_preds <- chosen_preds$predictions 295 | } 296 | test_error <- round(error_FUN(test_dat$y, chosen_preds), 5) 297 | 298 | # avg error across outer-loop folds in ncv 299 | outer_kfold_error <- algorithm_comparison %>% 300 | filter(model == chosen_alg) %>% 301 | mutate(mean_error = round(mean_error, 5)) %>% 302 | pull(mean_error) 303 | 304 | # Create output message and text me the results 305 | best_hyper_vals <- best_hyper_vals %>% 306 | tidyr::pivot_longer(cols = names(.), names_to = "param", values_to = "value" ) %>% 307 | glue_data("{param} = {value}") %>% 308 | glue_collapse(sep = ",", last = " and ") 309 | 310 | msg <- glue("Avg K-Fold CV error: {avg_kfold_error[[1]]} 311 | Outer-Fold Avg Error: {outer_kfold_error} 312 | Test Error: {test_error} 313 | Best Parameters for {chosen_alg}: 314 | {best_hyper_vals}") 315 | 316 | log.txt <- tic.log(format = TRUE) 317 | text_msg <- glue("{log.txt[[1]]} for kj-raschka script to complete 318 | Results: 319 | {msg}") 320 | 321 | pbPost("note", title="kj-raschka script finished", body=text_msg) 322 | tic.clearlog() 323 | 324 | 325 | # MLflow uses waitress for Windows. Killing it also kills mlflow.exe, python.exe, console window host processes 326 | installr::kill_process(process = c("waitress-serve.exe")) 327 | -------------------------------------------------------------------------------- /duration-experiment/raschka/nested-cv-mlr3-raschka.R: -------------------------------------------------------------------------------- 1 | # Nested cross-validation for tuning and algorithm comparison 2 | 3 | 4 | # Raschka method 5 | # mlr3 6 | 7 | 8 | 9 | # Notes: 10 | # 1. *** Make sure target variable is the last column*** 11 | # 2. As of 16Jan2020, folds are not fixed so each call to resample() creates a new nested_cv structure which means the algorithms are being compared on somewhat different data. 12 | # 3. For the glm, lambda is not a tunable parameter in mlr3. So I had to use lambda_min_ratio and nlambda to get a grid that should be equivalent to a 100 row lambda grid. 13 | # 4. The batch arg in the tuner function allows you to specify how you want to parallelize for each algorithm which is nice. 14 | 15 | 16 | # Sections: 17 | # 1. Set-Up and Data 18 | # 2. Functions Used in the Loops 19 | # 3. Model Functions; Hyperparameter Grids 20 | # 4. Compare Algorithms 21 | # 5. Tune and Score Chosen Algorithm 22 | 23 | 24 | 25 | 26 | 27 | ##################################################### 28 | # set up and Data 29 | ##################################################### 30 | 31 | 32 | # texts me when there's an error 33 | options(error = function() { 34 | library(RPushbullet) 35 | pbPost("note", "Error", geterrmessage()) 36 | if(!interactive()) stop(geterrmessage()) 37 | }) 38 | 39 | # start MLflow server 40 | sys::exec_background("mlflow server") 41 | Sys.sleep(10) 42 | 43 | 44 | library(tictoc) 45 | tic() 46 | 47 | pacman::p_load(glue, RPushbullet, dplyr, mlr3, mlr3learners, mlr3tuning, future, mlflow) 48 | 49 | 50 | set.seed(2019) 51 | 52 | # make explicit the name of the exeriement to record to 53 | mlflow_set_experiment("ncv_duration") 54 | 55 | 56 | plan(multiprocess) 57 | 58 | # simulated data; generates 10 multi-patterned, numeric predictors plus outcome variable 59 | sim_data <- function(n) { 60 | tmp <- mlbench::mlbench.friedman1(n, sd=1) 61 | tmp <- cbind(tmp$x, tmp$y) 62 | tmp <- as.data.frame(tmp) 63 | names(tmp)[ncol(tmp)] <- "y" 64 | tmp 65 | } 66 | 67 | # Use small data to tune and compare models 68 | dat <- sim_data(5000) 69 | 70 | train_idx <- caret::createDataPartition(y = dat$y, p = 0.80, list = FALSE) 71 | train_dat <- dat[train_idx, ] 72 | test_dat <- dat[-train_idx, ] 73 | 74 | 75 | 76 | 77 | ##################################################### 78 | # Functions used in the loops 79 | ##################################################### 80 | 81 | 82 | # task obj consists of the data and target variable 83 | train_task <- TaskRegr$new(id = "train_dat", backend = train_dat, target = "y") 84 | 85 | # inner-loop tuning 86 | resampling_inner <- rsmp("cv", folds = 2) 87 | 88 | # outer loop 89 | resampling_outer_five <- rsmp("cv", folds = 5) 90 | 91 | # error function 92 | measures <- msr("regr.mae") 93 | 94 | # required to specify an early stopping criteria. I don't want one. 95 | terminator <- term("none") 96 | 97 | 98 | 99 | 100 | ##################################################### 101 | # Model Functions; Hyperparameter Grids 102 | ##################################################### 103 | 104 | 105 | # Model functions 106 | rf_mod <- lrn("regr.ranger") 107 | glm_mod <- lrn("regr.glmnet", nlambda = 1) 108 | 109 | # available hyperparameters 110 | # mlr3::mlr_learners$get("regr.ranger")$param_set 111 | # mlr3::mlr_learners$get("regr.glmnet")$param_set 112 | 113 | # Hyperparameter restrictions/conditions 114 | rf_params <- paradox::ParamSet$new( 115 | params = list(paradox::ParamInt$new("mtry", lower = 3, upper = 4), 116 | paradox::ParamInt$new("num.trees", lower = 200, upper = 300) 117 | ) 118 | ) 119 | 120 | # values from for lambda.min.ratio come from the h2o docs about the parameter. 121 | glm_params <- paradox::ParamSet$new( 122 | params = list(paradox::ParamDbl$new("alpha", lower = 0, upper = 1), 123 | paradox::ParamDbl$new("lambda.min.ratio", lower = 0.0001, upper = 0.01) 124 | ) 125 | ) 126 | 127 | 128 | 129 | # generates a latin hypercube of values based on the restrictions above 130 | # rf and glm grids consists of 100 rows 131 | rf_grid <- paradox::generate_design_lhs(rf_params, 100)$data 132 | 133 | glm_grid <- paradox::generate_design_lhs(glm_params, 100)$data 134 | 135 | 136 | 137 | # Design_points is the grid strategy that allows me to make my own grids to experiment on. Batch_size indicates how many rows of the grid to tune simulataneously. Essentially how much to parallelize. 138 | rf_tuner <- tnr("design_points", batch_size = 8, design = rf_grid) 139 | glm_tuner <- tnr("design_points", batch_size = 8, design = glm_grid) 140 | 141 | 142 | learner_list <- list(glm = glm_mod, rf = rf_mod) 143 | params_list <- list(glm = glm_params, rf = rf_params) 144 | tuner_list <- list(glm = glm_tuner, rf = rf_tuner) 145 | 146 | lol <- list(learner_list, params_list, tuner_list) 147 | 148 | 149 | 150 | 151 | ##################################################### 152 | # Compare algorithms 153 | ##################################################### 154 | 155 | 156 | compare_algs <- function(mod, params, tuner, resampling_inner, resampling_outer, measures, terminator) { 157 | 158 | # Learner augmented with tuning. Tunes on the inner-loop bootstraps, then the best model will be used on the outer loop fold 159 | mod_tuned <- AutoTuner$new(learner = mod, 160 | resampling = resampling_inner, 161 | measures = measures, 162 | tune_ps = params, 163 | terminator = terminator, 164 | tuner = tuner) 165 | 166 | # Initiates the ncv 167 | ncv_mod <- resample(task = train_task, learner = mod_tuned, resampling = resampling_outer) 168 | 169 | # pull() needs the name of the error function being used 170 | measure_name <- measures$id 171 | # Score for each fold 172 | outer_fold_error <- ncv_mod$score(measures) %>% 173 | pull(measure_name) 174 | # Average score across outer loop folds 175 | mean_error <- ncv_mod$aggregate(measures) 176 | 177 | # output error stats and chosen hyperparams 178 | tibble( 179 | mean_error = mean_error, 180 | median_error = median(outer_fold_error), 181 | sd_error = sd(outer_fold_error) 182 | ) 183 | 184 | } 185 | 186 | 187 | algorithm_comparison <- purrr::pmap_dfr(lol, compare_algs, resampling_inner, resampling_outer_five, measures, terminator) %>% 188 | mutate(model = names(learner_list)) %>% 189 | select(model, everything()) 190 | 191 | 192 | 193 | 194 | ##################################################### 195 | # Tune and Score the Chosen Algorithm 196 | ##################################################### 197 | 198 | 199 | # Choose the best algorithm based on the lowest mean error on the outer loop folds 200 | chosen_alg <- algorithm_comparison %>% 201 | filter(mean_error == min(mean_error)) %>% 202 | pull(1) 203 | 204 | chosen_learner <- learner_list[[chosen_alg]]$reset() 205 | chosen_grid <- params_list[[chosen_alg]] 206 | chosen_tuner <- tuner_list[[chosen_alg]] 207 | 208 | # Use inner-loop tuning strategy to tune the chosen model 209 | chosen_tuned <- AutoTuner$new(learner = chosen_learner, 210 | resampling = resampling_inner, 211 | measures = measures, 212 | tune_ps = chosen_grid, 213 | terminator = terminator, 214 | tuner = chosen_tuner) 215 | 216 | # Tune the chosen model on the entire training set 217 | chosen_train <- chosen_tuned$train(train_task) 218 | 219 | toc(log = TRUE) 220 | 221 | 222 | 223 | # log duration metric to MLflow 224 | ncv_times <- tic.log(format = FALSE) 225 | duration <- as.numeric(ncv_times[[1]]$toc - ncv_times[[1]]$tic) 226 | mlflow_log_metric("duration", duration) 227 | mlflow_set_tag("implementation", "mlr3") 228 | mlflow_set_tag("method", "raschka") 229 | mlflow_end_run() 230 | 231 | 232 | 233 | # Score the best model on the entire training set and the held out test set 234 | train_error <- round(chosen_train$predict(train_task)$score(measures), 4) 235 | test_error <- round(chosen_train$predict_newdata(test_dat, task = train_task)$score(measures), 4) 236 | 237 | # Average error across tuning folds and the parameter values chosen during tuning 238 | kfold_error <- round(chosen_train$tuning_result$perf, 4) 239 | best_params <- as.data.frame(chosen_train$tuning_result$params) %>% 240 | tidyr::pivot_longer(cols = names(.), names_to = "param", values_to = "value" ) %>% 241 | glue_data("{param} = {value}") %>% 242 | glue_collapse(sep = ",", last = " and ") 243 | 244 | # avg error across outer-loop folds in ncv 245 | outer_kfold_error <- algorithm_comparison %>% 246 | filter(model == chosen_alg) %>% 247 | mutate(mean_error = round(mean_error, 5)) %>% 248 | pull(mean_error) 249 | 250 | 251 | msg <- glue("Average of K-fold CV test folds: {kfold_error} 252 | Outer-Fold Avg Error: {outer_kfold_error} 253 | Training Error: {train_error} 254 | Test Error: {test_error} 255 | Best Parameters for {chosen_alg}: 256 | {best_params}") 257 | 258 | log.txt <- tic.log(format = TRUE) 259 | text_msg <- glue("{log.txt[[1]]} for script to complete 260 | Results: 261 | {msg}") 262 | pbPost("note", title="mlr3-raschka script finished", body=text_msg) 263 | tic.clearlog() 264 | 265 | 266 | # MLflow uses waitress for Windows. Killing it also kills mlflow.exe, python.exe, console window host processes 267 | installr::kill_process(process = c("waitress-serve.exe")) 268 | 269 | 270 | -------------------------------------------------------------------------------- /duration-experiment/raschka/nested-cv-py-raschka.py: -------------------------------------------------------------------------------- 1 | # Nested Cross-Validation with sklearn models 2 | 3 | 4 | # Raschka method 5 | # python 6 | 7 | 8 | # Notes 9 | # 1. *** make sure target variable is the last variable in the data.frame *** 10 | # 2. Hyperparameter values were generated from the R package dials and saved to pickle format through reticulate 11 | # 3. Data was simulated by R package mlbench and saved in pickle format through reticulate 12 | # 4. Best to use reticulate::source_python('path/to/script.py', envir = NULL, convert = FALSE) 13 | # (4. cont.) envir=NULL, convert=FALSE increases the speed 14 | # 5. Bootstrap CV strategy isn't offered by Scikit Learn and I couldn't find any other Python packages offering it. 15 | # 6. With n_iter set to the number of rows in the grids, RandomizedGridSearch just reshuffles the grids. 16 | # (6. cont.) Shouldn't be done this way in real life with Latin Hypercubes because I'd think shuffling 17 | # (6. cont.) defeats the purpose of grid algorithm. Sklearn doesn't execute ParameterGrid in parallel though, 18 | # (6. cont.) and I'm just worried about fairly testing the speed of implentations. 19 | 20 | 21 | # Sections 22 | # 1. Set-up 23 | # 2. Data 24 | # 3. Algorithms 25 | # 4. Hyperparameter grids 26 | # 5. Create inner-loop tuning strategy 27 | # 6. Run nested-cv 28 | # 7. Train and Score Chosen Algorithm 29 | 30 | 31 | 32 | 33 | 34 | ################################### 35 | # Set-up 36 | ################################### 37 | 38 | 39 | # # If running with reticulate::repl_python or using reticulate::source_python, necessary in order 40 | # # to run in parallel. 41 | # # Should be ran before other modules imported. 42 | # # Updates executable path in sys module. 43 | # import sys 44 | # import os 45 | # exe = os.path.join(sys.exec_prefix, "pythonw.exe") 46 | # sys.executable = exe 47 | # sys._base_executable = exe 48 | # # update executable path in multiprocessing module 49 | # import multiprocessing 50 | # multiprocessing.set_executable(exe) 51 | 52 | 53 | # # If running with reticulate::repl_python or using reticulate::source_python, necessary in order 54 | # # start MLflow's server 55 | # import subprocess 56 | # import time 57 | # subprocess.Popen('mlflow server') 58 | # time.sleep(10) 59 | 60 | 61 | from pytictoc import TicToc 62 | t = TicToc() 63 | t.tic() 64 | 65 | from pushbullet import Pushbullet 66 | import os 67 | import mlflow 68 | import pickle 69 | import numpy as np 70 | import pandas as pd 71 | from sklearn.model_selection import RandomizedSearchCV, train_test_split, KFold 72 | from sklearn.ensemble import RandomForestRegressor 73 | from sklearn.linear_model import ElasticNet 74 | from sklearn.metrics import mean_absolute_error 75 | 76 | np.random.seed(2019) 77 | 78 | # dotenv allows for persistent environment variables 79 | from dotenv import load_dotenv 80 | load_dotenv() 81 | pb_token = os.getenv('PUSHBULLET_TOKEN') 82 | pb = Pushbullet(pb_token) 83 | 84 | # make explicit the name of the exeriement to record to 85 | mlflow.set_experiment("ncv_duration") 86 | 87 | 88 | 89 | ################################### 90 | # Data 91 | ################################### 92 | 93 | 94 | # load simulated data 95 | # r = read mode, b = binary; pickle is binary 96 | with open('./data/fivek-simdat.pickle', 'rb') as fried: 97 | pdat = pickle.load(fried) 98 | 99 | # load elastic net regression hyperparameter values 100 | with open('./grids/elast-latin-params.pickle', 'rb') as elastp: 101 | elast_params = pickle.load(elastp) 102 | 103 | # load random forest hyperparater values 104 | with open('./grids/rf-latin-params.pickle', 'rb') as rfp: 105 | rf_params = pickle.load(rfp) 106 | 107 | 108 | y_idx = len(pdat.columns) - 1 109 | y = pdat.iloc[:, y_idx].values 110 | X = pdat.iloc[:, 0:y_idx] 111 | X_train, X_test, y_train, y_test = train_test_split(X, y, 112 | test_size=0.2, 113 | random_state=1) 114 | 115 | 116 | 117 | 118 | #################################### 119 | # Algorithms 120 | #################################### 121 | 122 | 123 | # Elastic Net Regression 124 | elast_est = ElasticNet(normalize = True, 125 | fit_intercept = True) 126 | 127 | # Random Forest 128 | rf_est = RandomForestRegressor(criterion = "mae", 129 | random_state = 1) 130 | 131 | 132 | est_dict = {'Elastic Net': elast_est, 'Random Forest': rf_est} 133 | 134 | 135 | 136 | 137 | #################################### 138 | # Hyperparameter grids 139 | #################################### 140 | 141 | # elastic net regression 142 | alpha = elast_params.pop('penalty').values 143 | l1_ratio = elast_params.pop('mixture').values 144 | elast_grid = [{'alpha': alpha, 145 | 'l1_ratio': l1_ratio}] 146 | 147 | # random forest 148 | max_features = rf_params.pop('mtry').values 149 | n_estimators = rf_params.pop('trees').values 150 | rf_grid = [{'max_features': max_features, 151 | 'n_estimators': n_estimators}] 152 | 153 | 154 | grid_dict = {'Elastic Net': elast_grid, 'Random Forest': rf_grid} 155 | 156 | 157 | 158 | 159 | #################################### 160 | # Create inner-loop tuning strategy 161 | #################################### 162 | 163 | 164 | # vessel for my inner-loop grid search objects 165 | gridcvs = {} 166 | 167 | # shuffle = True required for setting random state 168 | # setting random state makes sure all algorithms tuned on the same splits 169 | inner_cv = KFold(n_splits = 2, shuffle = True, random_state = 1) 170 | 171 | # Setting this parameter to the size of the grid tells Random Search to use every grid value once 172 | elast_iter = len(elast_params) 173 | rf_iter = len(rf_params) 174 | iter_dict = {'Elastic Net': elast_iter, 'Random Forest': rf_iter} 175 | 176 | 177 | # Setting up multiple RandomSearchCV objects, 1 for each algorithm 178 | # Collecting them in the gridcvs dict 179 | for pgrid, est, n_iter, name in zip((elast_grid, rf_grid), 180 | (elast_est, rf_est), 181 | (elast_iter, rf_iter), 182 | ('Elastic Net', 'Random Forest')): 183 | gcv = RandomizedSearchCV(estimator = est, 184 | param_distributions = pgrid, 185 | n_iter = n_iter, 186 | scoring = 'neg_mean_absolute_error', 187 | n_jobs = -1, 188 | cv = inner_cv, 189 | verbose = 0, 190 | refit = True) 191 | gridcvs[name] = gcv 192 | 193 | 194 | 195 | 196 | #################################### 197 | # Run nested-cv 198 | #################################### 199 | 200 | # vessel for stats on the outer fold results 201 | results = pd.DataFrame() 202 | 203 | # The validation set scores of the outer loop folds will be used to choose the best algorithm 204 | # loop the grid objects we created (1 for each algorithm) 205 | for name, gs_est in sorted(gridcvs.items()): 206 | outer_scores = [] 207 | # set the outer loop cv strategy 208 | outer_cv = KFold(n_splits=5, shuffle=True, random_state=1) 209 | 210 | # training set is split into train/valid as the outer-loop folds 211 | for train_idx, valid_idx in outer_cv.split(X_train, y_train): 212 | # training set is fed into inner-loop for tuning 213 | gridcvs[name].fit(X_train.values[train_idx], y_train[train_idx]) 214 | # best model chosen to be scored on the outer-loop valid set 215 | outer_scores.append(gridcvs[name].best_estimator_.score(X_train.values[valid_idx], y_train[valid_idx])) 216 | 217 | # stats calc'd on outer fold scores 218 | fold_score = {'model': name, 219 | 'mean_error': np.mean(outer_scores), 220 | 'sd_error': np.std(outer_scores)} 221 | # 1 row per algorithm 222 | results = results.append(fold_score, ignore_index=True) 223 | 224 | # Choose the best algorithm based on the lowest mean error on the outer loop folds 225 | chosen_alg = results[results.mean_error == results.mean_error.min()]['model'][0] 226 | 227 | chosen_est = est_dict[chosen_alg] 228 | chosen_grid = grid_dict[chosen_alg] 229 | chosen_iter = iter_dict[chosen_alg] 230 | 231 | 232 | 233 | 234 | #################################### 235 | # Train and Score Chosen Algorithm 236 | #################################### 237 | 238 | 239 | # Use inner-loop tuning strategy to tune the chosen model 240 | gcv_model_select = RandomizedSearchCV(estimator = chosen_est, 241 | param_distributions = chosen_grid, 242 | n_iter = chosen_iter, 243 | scoring = 'neg_mean_absolute_error', 244 | n_jobs = -1, 245 | cv = inner_cv, 246 | verbose = 0, 247 | refit = True) 248 | 249 | # Tune the chosen model on the entire training set 250 | gcv_model_select.fit(X_train, y_train) 251 | 252 | # given in seconds 253 | time_elapsed = round(t.tocvalue(), 2) 254 | 255 | # log the metric on MLflow 256 | mlflow.log_metric('duration', time_elapsed) 257 | tags = {'implementation': 'python', 'method': 'raschka'} 258 | mlflow.set_tags(tags) 259 | mlflow.end_run() 260 | 261 | 262 | best_model = gcv_model_select.best_estimator_ 263 | 264 | # Score the best model on the entire training set and the held out test set 265 | train_error = round(mean_absolute_error(y_true=y_train, y_pred=best_model.predict(X_train)), 5) 266 | test_error = round(mean_absolute_error(y_true=y_test, y_pred=best_model.predict(X_test)), 5) 267 | 268 | # Average error across tuning folds and the parameter values chosen during tuning 269 | k_fold_score = round(-1 * gcv_model_select.best_score_, 5) 270 | outer_kfold_score = round(results[results.model == chosen_alg]['mean_error'][0], 5) 271 | best_hyper_vals = gcv_model_select.best_params_ 272 | 273 | # best_hyper_vals is a dict. Use it to create the df, then add the errors. 274 | model_stats = pd.DataFrame(data = best_hyper_vals, index = [1]) 275 | model_stats = model_stats.assign(kfold_error = k_fold_score, outer_fold_error = outer_kfold_score, train_error = train_error, test_error = test_error) 276 | 277 | 278 | # evidently has to be written into a paragraph because print outputs can't be saved and there's no glue in python. 279 | msg = f'Python script finished in {time_elapsed} seconds. The chosen algorithm was {chosen_alg} with parameters, {best_hyper_vals}. Avg score over cv\'s test folds was {k_fold_score}. Outer fold avg score was {outer_kfold_score}. Training Error: {train_error}, Test Error: {test_error}' 280 | 281 | # text me the results 282 | pb.push_note("Nested CV script finished", msg) 283 | 284 | 285 | # # only necessary if running with reticulate::repl_python or using reticulate::source_python 286 | # # MLflow uses waitress for Windows. Killing it also kills mlflow.exe, python.exe, console window host processes 287 | # os.system('taskkill /f /im waitress-serve.exe') 288 | # os.system('taskkill /f /im pythonw.exe') 289 | -------------------------------------------------------------------------------- /duration-experiment/raschka/nested-cv-retic-raschka.R: -------------------------------------------------------------------------------- 1 | # Nested Cross-validation using Scikit-Learn 2 | 3 | 4 | # Raschka's method 5 | # reticulate 6 | 7 | 8 | # Notes 9 | # 1. Uses the reticulate pkg to implement Raschka's nested cv 10 | # 2. *** make sure target variable is the last variable in the data.frame *** 11 | # 3. Bootstrap CV strategy isn't offered by Scikit Learn and I couldn't find any other Python packages offering it. 12 | # 4. With n_iter set to the number of rows in the grids, RandomizedGridSearch just reshuffles the grids. Shouldn't be done this way in real life with Latin Hypercubes because I'd think shuffling defeats the purpose of grid algorithm. Sklearn doesn't execute ParameterGrid in parallel though, and I'm just worried about fairly testing the speed of implentations. 13 | 14 | 15 | # Sections: 16 | # 1. Set-Up 17 | # 2. Data 18 | # 3. Estimators 19 | # 4. Hyperparameter Grids 20 | # 5. Inner and Outer CV Strategies 21 | # 6. Compare Algorithms 22 | # 7. Tune and Score Chosen Algorithm 23 | 24 | 25 | 26 | 27 | 28 | ###################################### 29 | # Set-up 30 | ###################################### 31 | 32 | 33 | # text me if any errors occur 34 | options(error = function() { 35 | library(RPushbullet) 36 | pbPost("note", "Error", geterrmessage()) 37 | if(!interactive()) stop(geterrmessage()) 38 | }) 39 | 40 | 41 | # start MLflow server 42 | sys::exec_background("mlflow server") 43 | Sys.sleep(10) 44 | 45 | 46 | library(tictoc) 47 | tic() 48 | 49 | pacman::p_load(RPushbullet, dplyr, glue, reticulate, mlflow) 50 | 51 | # make explicit the name of the exeriement to record to 52 | mlflow_set_experiment("ncv_duration") 53 | 54 | 55 | #------- Required for Multiprocessing with reticulate (in just Windows?) 56 | 57 | # update executable path in sys module 58 | sys <- import("sys") 59 | exe <- file.path(sys$exec_prefix, "pythonw.exe") 60 | sys$executable <- exe 61 | sys$`_base_executable` <- exe 62 | 63 | # update executable path in multiprocessing module 64 | multiprocessing <- import("multiprocessing") 65 | multiprocessing$set_executable(exe) 66 | 67 | #------- 68 | 69 | 70 | sk_lm <- import("sklearn.linear_model") 71 | sk_e <- import("sklearn.ensemble") 72 | sk_ms <- import("sklearn.model_selection") 73 | sk_m <- import("sklearn.metrics") 74 | 75 | set.seed(2019) 76 | py_set_seed(2019) 77 | 78 | 79 | ###################################### 80 | # Data 81 | ###################################### 82 | 83 | 84 | sim_data <- function(n) { 85 | tmp <- mlbench::mlbench.friedman1(n, sd=1) 86 | tmp <- cbind(tmp$x, tmp$y) 87 | tmp <- as.data.frame(tmp) 88 | names(tmp)[ncol(tmp)] <- "y" 89 | tmp 90 | } 91 | 92 | dat <- sim_data(5000) 93 | 94 | pdat = r_to_py(dat) 95 | 96 | y_idx <- py_len(pdat$columns) - 1 97 | X_idx <- y_idx - 1 98 | y = pdat$iloc(axis = 1L)[y_idx]$values 99 | X = pdat$iloc(axis = 1L)[0:X_idx] 100 | 101 | dat_splits<- sk_ms$train_test_split(X, y, 102 | test_size=0.2, 103 | random_state=1L) 104 | 105 | X_train <- dat_splits[[1]] 106 | X_test <- dat_splits[[2]] 107 | y_train <- as.numeric(dat_splits[[3]]) 108 | y_test <- as.numeric(dat_splits[[4]]) 109 | 110 | 111 | 112 | 113 | ###################################### 114 | # Estimators 115 | ###################################### 116 | 117 | # Elastic Net Regression 118 | elast_est <- sk_lm$ElasticNet(normalize = TRUE, 119 | fit_intercept = TRUE) 120 | 121 | # Random Forest 122 | rf_est <- sk_e$RandomForestRegressor(criterion = "mae", 123 | random_state = 1L) 124 | 125 | 126 | alg_list <- list(elastic_net = elast_est, rf = rf_est) 127 | 128 | 129 | 130 | 131 | ###################################### 132 | # Hyperparameter grids 133 | ###################################### 134 | 135 | 136 | # Elastic Net Regression 137 | elast_params <- r_to_py(dials::grid_latin_hypercube( 138 | dials::mixture(), 139 | dials::penalty(), 140 | size = 100 141 | )) 142 | alpha <- elast_params$pop('penalty')$values 143 | l1_ratio <- elast_params$pop('mixture')$values 144 | elast_grid <- py_dict(list('alpha', 'l1_ratio'), 145 | list(alpha, l1_ratio)) 146 | 147 | 148 | 149 | # Random Forest 150 | rf_params <- r_to_py(dials::grid_latin_hypercube( 151 | dials::mtry(range = c(3, 4)), 152 | dials::trees(range = c(200, 300)), 153 | size = 100 154 | )) 155 | max_features <- rf_params$pop('mtry')$values 156 | n_estimators <- rf_params$pop('trees')$values 157 | rf_grid <- py_dict(list('max_features', 'n_estimators'), list(max_features, n_estimators)) 158 | 159 | 160 | grid_list <- list(elastic_net = elast_grid, rf = rf_grid) 161 | 162 | 163 | 164 | 165 | ###################################### 166 | # Inner and Outer CV strategies 167 | ###################################### 168 | 169 | # Setting the inner-loop tuning strategy 170 | inner_cv <- sk_ms$KFold(n_splits = 2L, 171 | shuffle = TRUE, 172 | random_state = 1L) 173 | 174 | # Setting the outer-loop cv strategy 175 | outer_cv <- sk_ms$KFold(n_splits = 5L, 176 | shuffle = TRUE, 177 | random_state = 1L) 178 | 179 | # Setting n_iter to the total rows for each grid just reshuffles the grids 180 | n_iter_list <- list(elastic_net = py_len(elast_params), rf = py_len(rf_params)) 181 | 182 | lol <- list(alg_list, grid_list, n_iter_list) 183 | 184 | # Setting up multiple RandomSearchCV objects, 1 for each algorithm 185 | # Collecting them in the inner-loop list 186 | inner_loop <- purrr::pmap(lol, function(alg, grid, n_iter) { 187 | sk_ms$RandomizedSearchCV(estimator = alg, 188 | param_distributions = grid, 189 | n_iter = n_iter, 190 | scoring = 'neg_mean_absolute_error', 191 | cv = inner_cv, 192 | n_jobs = -1L, 193 | pre_dispatch = '2*n_jobs', 194 | refit = TRUE) 195 | }) 196 | 197 | 198 | 199 | 200 | ###################################### 201 | # Compare Algorithms 202 | ###################################### 203 | 204 | 205 | algorithm_comparison <- purrr::map(inner_loop, function(grid_search) { 206 | 207 | outer_scores <- list() 208 | counter <- 0 209 | 210 | outer_split <- outer_cv$split(X_train, y_train) 211 | 212 | # while loop + iter_next = a python for-loop 213 | while (TRUE) { 214 | 215 | # python methods create "iterable" objs 216 | fold <- iter_next(outer_split) 217 | # loop ends once we run out of folds 218 | if (is.null(fold)) 219 | break 220 | 221 | # python 0-indexed, so need to add 1 in order to correctly subset in R 222 | train_idx <- as.integer(fold[[1]]) + 1 223 | valid_idx <- as.integer(fold[[2]]) + 1 224 | 225 | pX_train <- r_to_py(X_train[train_idx,]) 226 | pX_valid <- r_to_py(X_train[valid_idx,]) 227 | 228 | py_train <- r_to_py(y_train[train_idx]) 229 | py_valid <- r_to_py(y_train[valid_idx]) 230 | 231 | # training set is fed into inner-loop for tuning 232 | tuning_results <- grid_search$fit(pX_train, py_train) 233 | 234 | counter <- sum(counter, 1) 235 | 236 | # best model chosen to be scored on the outer-loop valid set 237 | outer_scores[[counter]] <- data.frame(error = tuning_results$best_estimator_$score(pX_valid, py_valid)) 238 | } 239 | 240 | # stats calc'd on outer fold scores 241 | outer_stats <- outer_scores %>% 242 | bind_rows() %>% 243 | summarize(mean_error = mean(error, na.rm = TRUE), 244 | median_error = median(error, na.rm = TRUE), 245 | sd_error = sd(error, na.rm = TRUE)) 246 | }) 247 | 248 | 249 | 250 | 251 | ###################################### 252 | # Tune and Score the Chosen Algorithm 253 | ###################################### 254 | 255 | 256 | # Choose the best algorithm based on the lowest mean error on the outer loop folds 257 | chosen_alg <- purrr::map_dfr(algorithm_comparison, ~select(., mean_error), .id = "model") %>% 258 | filter(mean_error == min(mean_error)) %>% 259 | pull(1) 260 | 261 | alg <- alg_list[[chosen_alg]] 262 | grid <- grid_list[[chosen_alg]] 263 | n_iter <- n_iter_list[[chosen_alg]] 264 | 265 | # Use inner-loop tuning strategy to tune the chosen model 266 | chosen_tuned <- sk_ms$RandomizedSearchCV(estimator = alg, 267 | param_distributions = grid, 268 | n_iter = n_iter, 269 | scoring = 'neg_mean_absolute_error', 270 | cv = inner_cv, 271 | n_jobs = -1L, 272 | refit = TRUE) 273 | 274 | # Tune the chosen model on the entire training set 275 | chosen_train <- chosen_tuned$fit(X_train, y_train) 276 | 277 | toc(log = TRUE) 278 | 279 | 280 | 281 | # log duration metric to MLflow 282 | ncv_times <- tic.log(format = FALSE) 283 | duration <- as.numeric(ncv_times[[1]]$toc - ncv_times[[1]]$tic) 284 | mlflow_log_metric("duration", duration) 285 | mlflow_set_tag("implementation", "reticulate") 286 | mlflow_set_tag("method", "raschka") 287 | mlflow_end_run() 288 | 289 | 290 | 291 | chosen_model <- chosen_train$best_estimator_ 292 | 293 | # Score the best model on the entire training set and the held out test set 294 | train_error <- round(sk_m$mean_absolute_error(y_true = y_train, y_pred = chosen_model$predict(X_train)), 4) 295 | test_error <- round(sk_m$mean_absolute_error(y_true = y_test, y_pred = chosen_model$predict(X_test)), 4) 296 | 297 | # Average error across tuning folds and the parameter values chosen during tuning 298 | kfold_error <- round(-chosen_tuned$best_score_, 4) 299 | best_params <- as.data.frame(chosen_tuned$best_params_) %>% 300 | tidyr::pivot_longer(cols = names(.), names_to = "param", values_to = "value" ) %>% 301 | glue_data("{param} = {value}") %>% 302 | glue_collapse(sep = ",", last = " and ") 303 | 304 | outer_kfold_error <- purrr::map_dfr(algorithm_comparison, ~select(., mean_error), .id = "model") %>% 305 | filter(mean_error == min(mean_error)) %>% 306 | mutate(mean_error = round(mean_error, 5)) 307 | pull(mean_error) 308 | 309 | 310 | msg <- glue("Average of K-fold CV test folds: {kfold_error} 311 | Outer-Fold Avg Error: {outer_kfold_error} 312 | Training Error: {train_error} 313 | Test Error: {test_error} 314 | Best Parameters for {chosen_alg}: 315 | {best_params}") 316 | 317 | log.txt <- tic.log(format = TRUE) 318 | text_msg <- glue("{log.txt[[1]]} for script to complete 319 | Results: 320 | {msg}") 321 | 322 | # text me the results 323 | pbPost("note", title="reticulate-raschka script finished", body=text_msg) 324 | tic.clearlog() 325 | 326 | 327 | # MLflow uses waitress for Windows. Killing it also kills mlflow.exe, python.exe, console window host processes 328 | installr::kill_process(process = c("waitress-serve.exe", "pythonw.exe")) 329 | -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: null 2 | channels: 3 | - defaults 4 | dependencies: 5 | - certifi=2019.11.28=py36_0 6 | - pip=20.0.2=py36_1 7 | - python=3.6.10=h9f7ef89_0 8 | - setuptools=45.2.0=py36_0 9 | - sqlite=3.31.1=he774522_0 10 | - vc=14.1=h0510ff6_4 11 | - vs2015_runtime=14.16.27012=hf0eaf9b_1 12 | - wheel=0.34.2=py36_0 13 | - wincertstore=0.2=py36h7fe50ca_0 14 | - pip: 15 | - alembic==1.4.0 16 | - chardet==3.0.4 17 | - click==7.0 18 | - cloudpickle==1.3.0 19 | - configparser==4.0.2 20 | - databricks-cli==0.9.1 21 | - docker==4.2.0 22 | - entrypoints==0.3 23 | - flask==1.1.1 24 | - gitdb2==3.0.2 25 | - gitpython==3.0.8 26 | - gorilla==0.3.0 27 | - idna==2.9 28 | - itsdangerous==1.1.0 29 | - jinja2==2.11.1 30 | - joblib==0.14.1 31 | - mako==1.1.1 32 | - markupsafe==1.1.1 33 | - mlflow==1.6.0 34 | - numpy==1.18.1 35 | - pandas==1.0.1 36 | - prometheus-client==0.7.1 37 | - prometheus-flask-exporter==0.12.2 38 | - protobuf==3.11.3 39 | - pushbullet-py==0.11.0 40 | - pypiwin32==223 41 | - python-dateutil==2.8.1 42 | - python-dotenv==0.11.0 43 | - python-editor==1.0.4 44 | - python-magic==0.4.15 45 | - pytictoc==1.5.0 46 | - pytz==2019.3 47 | - pywin32==227 48 | - pyyaml==5.3 49 | - querystring-parser==1.2.4 50 | - requests==2.23.0 51 | - scikit-learn==0.22.1 52 | - scipy==1.4.1 53 | - simplejson==3.17.0 54 | - six==1.14.0 55 | - smmap2==2.0.5 56 | - sqlalchemy==1.3.13 57 | - sqlparse==0.3.0 58 | - tabulate==0.8.6 59 | - urllib3==1.25.8 60 | - waitress==1.4.3 61 | - websocket-client==0.57.0 62 | - werkzeug==1.0.0 63 | prefix: C:\Users\tbats\Documents\R\Projects\nested-cross-validation-comparison\renv\python\condaenvs\renv-python 64 | -------------------------------------------------------------------------------- /grids/elast-latin-params.pickle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/grids/elast-latin-params.pickle -------------------------------------------------------------------------------- /grids/rf-latin-params.pickle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/grids/rf-latin-params.pickle -------------------------------------------------------------------------------- /images/ncv.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/images/ncv.png -------------------------------------------------------------------------------- /nested-cross-validation-comparison.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 6 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | -------------------------------------------------------------------------------- /palettes/Forest Floor.ase: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/palettes/Forest Floor.ase -------------------------------------------------------------------------------- /performance-experiment/Kuhn-Johnson/plan-kj.R: -------------------------------------------------------------------------------- 1 | # Kuhn-Johnson drake plan 2 | 3 | 4 | # Notes: 5 | # 1. I broke the plan into units by sample size. I'm sure its possible to formulate the plan to perform the whole experiment by looping the kj and raschka method along with sample sizes into one large, more compact plan, but I wanted units that I could run overnight on my desktop. 6 | # 2. sample_sizes: 100, 800, 2000, 5000, 10000 (maybe) 7 | # 3. I'm trying to minimize the delta_error. Delta error is the absolute difference between the average error across the outer-folds of the nested cross-validation and the out-of-sample error which uses the chosen model and parameters to predict on a simulated 100K row dataset. 8 | 9 | 10 | 11 | 12 | 13 | error_FUN <- function(y_obs, y_hat){ 14 | y_obs <- unlist(y_obs) 15 | y_hat <- unlist(y_hat) 16 | Metrics::mae(y_obs, y_hat) 17 | } 18 | 19 | method <- "kj" 20 | algorithms <- list("glmnet", "rf") 21 | repeats <- seq(1:5) 22 | grid_size <- 100 23 | 24 | plan <- drake_plan( 25 | # model functions for each algorithm 26 | mod_FUN_list = create_models(algorithms), 27 | # data used to estimate out-of-sample error 28 | # noise_sd, seed settings are the defaults 29 | large_dat = mlbench_data(n = 10^5, 30 | noise_sd = 1, 31 | seed = 2019), 32 | # sample size = 100 33 | sim_dat_100 = mlbench_data(100), 34 | # hyperparameter grids for each algorithm 35 | # This probably doesn't need to be a "dynamic" target since mtry is only concerned about the number of columns in data (see script), but I'll do it anyways 36 | params_list_100 = create_grids(sim_dat_100, 37 | algorithms, 38 | size = grid_size), 39 | # create a separate ncv data object for each repeat value 40 | ncv_dat_100 = create_ncv_objects(sim_dat_100, 41 | repeats, 42 | method), 43 | # runs nested-cv and compares ncv error with out-of-sample error 44 | # outputs: ncv error, oos error, delta error, chosen algorithm, chosen hyperparameters 45 | ncv_results_100 = target( 46 | run_ncv(ncv_dat_100, 47 | sim_dat_100, 48 | large_dat, 49 | mod_FUN_list, 50 | params_list_100, 51 | error_FUN, 52 | method), 53 | dynamic = map(ncv_dat_100) 54 | ), 55 | 56 | # repeat for the rest of the sample sizes 57 | # sample size = 800 58 | sim_dat_800 = mlbench_data(800), 59 | params_list_800 = create_grids(sim_dat_800, 60 | algorithms, 61 | size = grid_size), 62 | ncv_dat_800 = create_ncv_objects(sim_dat_800, 63 | repeats, 64 | method), 65 | ncv_results_800 = target( 66 | run_ncv(ncv_dat_800, 67 | sim_dat_800, 68 | large_dat, 69 | mod_FUN_list, 70 | params_list_800, 71 | error_FUN, 72 | method), 73 | dynamic = map(ncv_dat_800) 74 | ), 75 | 76 | # sample size = 2000 77 | sim_dat_2000 = mlbench_data(2000), 78 | params_list_2000 = create_grids(sim_dat_2000, 79 | algorithms, 80 | size = grid_size), 81 | ncv_dat_2000 = create_ncv_objects(sim_dat_2000, 82 | repeats, 83 | method), 84 | ncv_results_2000 = target( 85 | run_ncv(ncv_dat_2000, 86 | sim_dat_2000, 87 | large_dat, 88 | mod_FUN_list, 89 | params_list_2000, 90 | error_FUN, 91 | method), 92 | dynamic = map(ncv_dat_2000) 93 | ), 94 | 95 | # sample size = 5000 96 | sim_dat_5000 = mlbench_data(5000), 97 | params_list_5000 = create_grids(sim_dat_5000, 98 | algorithms, 99 | size = grid_size), 100 | ncv_dat_5000 = create_ncv_objects(sim_dat_5000, 101 | repeats, 102 | method), 103 | ncv_results_5000 = target( 104 | run_ncv(ncv_dat_5000, 105 | sim_dat_5000, 106 | large_dat, 107 | mod_FUN_list, 108 | params_list_5000, 109 | error_FUN, 110 | method), 111 | dynamic = map(ncv_dat_5000) 112 | ) 113 | 114 | ) 115 | 116 | 117 | -------------------------------------------------------------------------------- /performance-experiment/Kuhn-Johnson/r_make-kj.R: -------------------------------------------------------------------------------- 1 | # r_make 2 | # runs experiment in a clean environment in a separate r session 3 | 4 | 5 | # Notes 6 | # 1. Didn't make the render readme function into a target because the readd doesn't explicitly call a specific target, so drake isn't triggered to build it. The buildtimes function in the readme also doesn't trigger a build. 7 | 8 | 9 | # text me if an error occurs 10 | options(error = function() { 11 | library(RPushbullet) 12 | pbPost("note", "Error", geterrmessage()) 13 | if(!interactive()) stop(geterrmessage()) 14 | }) 15 | 16 | 17 | 18 | drake::r_make(source = "_drake-kj.R" ) 19 | 20 | 21 | rmarkdown::render( 22 | input = "README.Rmd" 23 | ) 24 | 25 | 26 | # text me when it finishes 27 | RPushbullet::pbPost("note", title="kj performance experiment", body="ncv run finished") 28 | 29 | 30 | -------------------------------------------------------------------------------- /performance-experiment/Raschka/plan-raschka.R: -------------------------------------------------------------------------------- 1 | # Raschka drake plan 2 | 3 | 4 | # Notes: 5 | # 1. I broke the plan into units by sample size. I'm sure its possible to formulate the plan to perform the whole experiment by looping the kj and raschka method along with sample sizes into one large, more compact plan, but I wanted units that I could run overnight on my desktop. 6 | # 2. sample_sizes: 100, 800, 2000, 5000, 10000 (maybe) 7 | # 3. I'm trying to minimize the delta_error. Delta error is the absolute difference between the average error across the outer-folds of the nested cross-validation and the out-of-sample error which uses the chosen model and parameters to predict on a simulated 100K row dataset. 8 | 9 | 10 | 11 | 12 | 13 | error_FUN <- function(y_obs, y_hat){ 14 | y_obs <- unlist(y_obs) 15 | y_hat <- unlist(y_hat) 16 | Metrics::mae(y_obs, y_hat) 17 | } 18 | 19 | method <- "raschka" 20 | algorithms <- list("glmnet", "rf") 21 | repeats <- seq(1:5) 22 | grid_size <- 100 23 | 24 | plan <- drake_plan( 25 | # model functions for each algorithm 26 | mod_FUN_list_r = create_models(algorithms), 27 | # data used to estimate out-of-sample error 28 | # noise_sd, seed settings are the defaults 29 | large_dat_r = mlbench_data(n = 10^5, 30 | noise_sd = 1, 31 | seed = 2019), 32 | # sample size = 100 33 | sim_dat_r_100 = mlbench_data(100), 34 | # hyperparameter grids for each algorithm 35 | params_list_r_100 = create_grids(sim_dat_r_100, 36 | algorithms, 37 | size = grid_size), 38 | # create a separate ncv data object for each repeat value 39 | ncv_dat_r_100 = create_ncv_objects(sim_dat_r_100, 40 | repeats, 41 | method), 42 | # runs nested-cv and compares ncv error with out-of-sample error 43 | # outputs: ncv error, oos error, delta error, chosen algorithm, chosen hyperparameters 44 | ncv_results_r_100 = target( 45 | run_ncv(ncv_dat_r_100, 46 | sim_dat_r_100, 47 | large_dat_r, 48 | mod_FUN_list_r, 49 | params_list_r_100, 50 | error_FUN, 51 | method), 52 | dynamic = map(ncv_dat_r_100) 53 | ), 54 | 55 | # repeat for the rest of the sample sizes 56 | # sample size = 800 57 | sim_dat_r_800 = mlbench_data(800), 58 | params_list_r_800 = create_grids(sim_dat_r_800, 59 | algorithms, 60 | size = grid_size), 61 | ncv_dat_r_800 = create_ncv_objects(sim_dat_r_800, 62 | repeats, 63 | method), 64 | ncv_results_r_800 = target( 65 | run_ncv(ncv_dat_r_800, 66 | sim_dat_r_800, 67 | large_dat_r, 68 | mod_FUN_list_r, 69 | params_list_r_800, 70 | error_FUN, 71 | method), 72 | dynamic = map(ncv_dat_r_800) 73 | ), 74 | 75 | # sample size = 2000 76 | sim_dat_r_2000 = mlbench_data(2000), 77 | params_list_r_2000 = create_grids(sim_dat_r_2000, 78 | algorithms, 79 | size = grid_size), 80 | ncv_dat_r_2000 = create_ncv_objects(sim_dat_r_2000, 81 | repeats, 82 | method), 83 | ncv_results_r_2000 = target( 84 | run_ncv(ncv_dat_r_2000, 85 | sim_dat_r_2000, 86 | large_dat_r, 87 | mod_FUN_list_r, 88 | params_list_r_2000, 89 | error_FUN, 90 | method), 91 | dynamic = map(ncv_dat_r_2000) 92 | ), 93 | 94 | # sample size = 5000 95 | sim_dat_r_5000 = mlbench_data(5000), 96 | params_list_r_5000 = create_grids(sim_dat_r_5000, 97 | algorithms, 98 | size = grid_size), 99 | ncv_dat_r_5000 = create_ncv_objects(sim_dat_r_5000, 100 | repeats, 101 | method), 102 | ncv_results_r_5000 = target( 103 | run_ncv(ncv_dat_r_5000, 104 | sim_dat_r_5000, 105 | large_dat_r, 106 | mod_FUN_list_r, 107 | params_list_r_5000, 108 | error_FUN, 109 | method), 110 | dynamic = map(ncv_dat_r_5000) 111 | ) 112 | ) 113 | 114 | 115 | -------------------------------------------------------------------------------- /performance-experiment/Raschka/r_make-raschka.R: -------------------------------------------------------------------------------- 1 | # r_make 2 | # runs experiment in a clean environment in a separate r session 3 | 4 | 5 | # Notes 6 | # 1. Didn't make the render readme function into a target because the readd doesn't explicitly call a specific target, so drake isn't triggered to build it. The buildtimes function in the readme also doesn't trigger a build. 7 | 8 | 9 | # text me if an error occurs 10 | options(error = function() { 11 | library(RPushbullet) 12 | pbPost("note", "Error", geterrmessage()) 13 | if(!interactive()) stop(geterrmessage()) 14 | }) 15 | 16 | 17 | 18 | drake::r_make(source = "_drake-raschka.R") 19 | 20 | rmarkdown::render( 21 | input = "README.Rmd" 22 | ) 23 | 24 | 25 | # text me when it finishes 26 | RPushbullet::pbPost("note", title="raschka performance experiment", body="ncv run finished") 27 | 28 | -------------------------------------------------------------------------------- /performance-experiment/functions/create-grids.R: -------------------------------------------------------------------------------- 1 | # Create Hyperparameter grid list 2 | 3 | 4 | # input: 5 | # 1, size = number of rows 6 | # 2. algorithms = list of algorithm abbreviations 7 | # "rf" = Ranger Random Forest 8 | # "glmnet" = Elastic Net regression 9 | # "svm" = Support Vector Machines 10 | 11 | # output: list of grid objects 12 | 13 | 14 | 15 | create_grids <- function(sim_dat, algorithms, size = 100) { 16 | 17 | # Elastic Net Regression 18 | 19 | glm_params <- dials::grid_latin_hypercube( 20 | dials::mixture(), 21 | dials::penalty(), 22 | size = size 23 | ) 24 | 25 | # Random Forest 26 | 27 | # Some of the parnsip model parameters have "unknown" for the default value ranges. finalize replaces the unknowns with values based on the data. 28 | mtry_updated <- dials::finalize(dials::mtry(), select(sim_dat, -ncol(sim_dat))) 29 | 30 | rf_params <- dials::grid_latin_hypercube( 31 | mtry_updated, 32 | dials::trees(), 33 | size = size 34 | ) 35 | 36 | # Support Vector Machines 37 | 38 | svm_params <- dials::grid_latin_hypercube( 39 | dials::cost(), 40 | dials::margin(), 41 | size = size 42 | ) 43 | 44 | # list of grid objects depending on the algorithms inputted (switch is pretty cool) 45 | # stop_glue throws error if algorithm inputted isn't available (Should be in glue pkg but isn't) 46 | grid_list <- purrr::map(algorithms, function(alg) { 47 | switch(alg, 48 | rf = rf_params -> alg_grid, 49 | glmnet = glm_params -> alg_grid, 50 | svm = svm_params -> alg_grid, 51 | infer:::stop_glue("{alg} grid not available.")) 52 | alg_grid 53 | 54 | }) %>% 55 | purrr::set_names(algorithms) 56 | } 57 | 58 | -------------------------------------------------------------------------------- /performance-experiment/functions/create-models.R: -------------------------------------------------------------------------------- 1 | # Creates list of model functions 2 | 3 | # input: list of algorithm abbreviations 4 | # "rf" = Ranger Random Forest 5 | # "glmnet" = Elastic Net regression 6 | # "svm" = Support Vector Machines 7 | 8 | # output: list of model functions 9 | 10 | 11 | 12 | create_models <- function(algorithms) { 13 | 14 | # Random Forest 15 | 16 | ranger_FUN <- function(params, analysis_set) { 17 | mtry <- params$mtry[[1]] 18 | trees <- params$trees[[1]] 19 | model <- ranger::ranger(y ~ ., data = analysis_set, mtry = mtry, num.trees = trees) 20 | model 21 | } 22 | 23 | # Elastic Net Regression 24 | 25 | glm_FUN <- function(params, analysis_set) { 26 | alpha <- params$mixture[[1]] 27 | lambda <- params$penalty[[1]] 28 | model <- parsnip::linear_reg(mixture = alpha, penalty = lambda) %>% 29 | parsnip::set_engine("glmnet") %>% 30 | generics::fit(y ~ ., data = analysis_set) 31 | model 32 | } 33 | 34 | # Support Vector Machines 35 | 36 | svm_FUN <- function(params, analysis_set) { 37 | cost <- params$cost[[1]] 38 | model <- kernlab::ksvm(y ~ ., data = analysis_set, C = cost) 39 | model 40 | } 41 | 42 | # list of model objects depending on the algorithms inputted (switch is pretty cool) 43 | # stop_glue throws error if algorithm inputted isn't available (Should be in glue pkg but isn't) 44 | mod_FUN_list <- purrr::map(algorithms, function(alg) { 45 | switch(alg, 46 | rf = ranger_FUN -> mod_fun, 47 | glmnet = glm_FUN -> mod_fun, 48 | svm = svm_FUN -> mod_fun, 49 | infer:::stop_glue("{alg} model function not available.")) 50 | mod_fun 51 | 52 | }) %>% 53 | purrr::set_names(algorithms) 54 | } 55 | 56 | 57 | -------------------------------------------------------------------------------- /performance-experiment/functions/create-ncv-objects.R: -------------------------------------------------------------------------------- 1 | # nested-cv data function 2 | 3 | # inputs: 4 | # 1. dat = dataset 5 | # 2. repeats = numeric vector with numbers of repeats 6 | # 3. method = "kj" or "raschka" 7 | # outputs: 8 | # 1. list of {rsample} nested cv objects; one object per repeat value 9 | 10 | 11 | 12 | create_ncv_objects <- function(dat, repeats, method) { 13 | 14 | attempt::stop_if_not(repeats, is.numeric, "repeats needs to be a numeric class") 15 | attempt::stop_if_not(method, is.character, "method needs to be a character class") 16 | 17 | # don't remember but guessing crossing needs a list object 18 | if (is.data.frame(dat)) { 19 | dat <- list(dat) 20 | } 21 | # tibble grid of data and repeats 22 | grid <- tidyr::crossing(dat, repeats) 23 | 24 | # generate list of ncv objects 25 | # dynGet needed to get reps out of the envirnonment and into the nested_cv function 26 | if (method == "kj") { 27 | ncv_list <- purrr::map2(grid$dat, grid$repeats, function(dat, reps) { 28 | rsample::nested_cv(dat, 29 | outside = vfold_cv(v = 10, repeats = dynGet("reps")), 30 | inside = bootstraps(times = 25)) 31 | }) 32 | } else if (method == "raschka") { 33 | ncv_list <- purrr::map2(grid$dat, grid$repeats, function(dat, reps) { 34 | rsample::nested_cv(dat, 35 | outside = vfold_cv(v = 5, repeats = dynGet("reps")), 36 | inside = vfold_cv(v = 2)) 37 | }) 38 | } else { 39 | stop("Need to specify method as kj or raschka", call. = FALSE) 40 | } 41 | 42 | return(ncv_list) 43 | } 44 | 45 | 46 | 47 | 48 | -------------------------------------------------------------------------------- /performance-experiment/functions/inner-tune.R: -------------------------------------------------------------------------------- 1 | # inner loop tuning function 2 | 3 | # inputs: 4 | # 1. ncv_dat = one ncv object from the list created by create-ncv-objects.R 5 | # 2. mod_FUN_list = all the model objects created by create-models.R 6 | # 3. params_list = all the hyperparameter grids created by create-grids.R 7 | # 4. error_FUN = specified at the start of plan-.R 8 | 9 | # outputs: df of hyperparameters for each fold that was chosen in the inner-loop 10 | 11 | 12 | 13 | inner_tune <- function(ncv_dat, mod_FUN_list, params_list, error_FUN) { 14 | 15 | # inputs params, model, and resample, calls model and error functions, outputs error 16 | mod_error <- function(params, mod_FUN, dat) { 17 | y_col <- ncol(dat$data) 18 | y_obs <- rsample::assessment(dat)[y_col] 19 | mod <- mod_FUN(params, rsample::analysis(dat)) 20 | pred <- predict(mod, rsample::assessment(dat)) 21 | if (!is.data.frame(pred)) { 22 | pred <- pred$predictions 23 | } 24 | error <- error_FUN(y_obs, pred) 25 | error 26 | } 27 | 28 | # inputs resample, loops hyperparam grid values to model/error function, collects error value for hyperparam combo 29 | tune_over_params <- function(dat, mod_FUN, params) { 30 | params$error <- purrr::map_dbl(1:nrow(params), function(row) { 31 | params <- params[row,] 32 | mod_error(params, mod_FUN, dat) 33 | }) 34 | params 35 | } 36 | 37 | # inputs and sends fold's resamples to tuning function, collects and averages fold's error for each hyperparameter combo 38 | summarize_tune_results <- function(dat, mod_FUN, params) { 39 | # Return row-bound tibble that has the 25 bootstrap results 40 | param_names <- names(params) 41 | furrr::future_map_dfr(dat$splits, tune_over_params, mod_FUN, params) %>% 42 | lazy_dt(., key_by = param_names) %>% 43 | # For each value of the tuning parameter, compute the 44 | # average which is the inner bootstrap estimate. 45 | group_by_at(vars(param_names)) %>% 46 | summarize(mean_error = mean(error, na.rm = TRUE), 47 | sd_error = sd(error, na.rm = TRUE), 48 | n = length(error)) %>% 49 | as_tibble() 50 | } 51 | 52 | tune_algorithms <- furrr::future_map2(mod_FUN_list, params_list, function(mod_FUN, params){ 53 | tuning_results <- purrr::map(ncv_dat$inner_resamples, summarize_tune_results, mod_FUN, params) 54 | 55 | # Choose best hyperparameter combos across all the resamples for each fold (e.g. 5 repeats 10 folds = 50 best hyperparam combos) 56 | best_hyper_vals <- tuning_results %>% 57 | purrr::map_df(function(dat) { 58 | dat %>% 59 | filter(mean_error == min(mean_error)) %>% 60 | arrange(sd_error) %>% 61 | slice(1) 62 | }) %>% 63 | select(names(params)) 64 | }) 65 | } 66 | 67 | 68 | 69 | -------------------------------------------------------------------------------- /performance-experiment/functions/mlbench-data.R: -------------------------------------------------------------------------------- 1 | # create simulation data 2 | 3 | # Inputs are 10 independent variables uniformly distributed on the interval [0,1], only 5 out of these 10 are actually used. Outputs are created according to the formula 4 | # y = 10 sin(π x1 x2) + 20 (x3 - 0.5)^2 + 10 x4 + 5 x5 + e 5 | 6 | mlbench_data <- function(n, noise_sd = 1, seed = 2019) { 7 | set.seed(seed) 8 | tmp <- mlbench::mlbench.friedman1(n, sd = noise_sd) 9 | tmp <- cbind(tmp$x, tmp$y) 10 | tmp <- as.data.frame(tmp) 11 | names(tmp)[ncol(tmp)] <- "y" 12 | tmp 13 | } -------------------------------------------------------------------------------- /performance-experiment/functions/ncv-compare.R: -------------------------------------------------------------------------------- 1 | # ncv_compare function 2 | 3 | 4 | # Chooses the best algorithm, fits best model on entire training set, predicts against large simulated data set 5 | 6 | # inputs: 7 | # 1. train_dat = the entire training dataset 8 | # 2. large_dat = the test dataset 9 | # 3. cv_stats = outer_cv.R output: df with chosen model, outer fold stats, hyperparams 10 | # 4. mod_FUN_list = list of model objects created from create_models.R 11 | # 5. params_list = list of hyperparameter grids created from create_grids.R 12 | # 6. error_FUN = error function given at the start of plan_.R 13 | # 7. method = "kj" or "raschka", given at the start of plan_.R 14 | 15 | # output: df with algorithm, hyperparams, and error values 16 | 17 | 18 | ncv_compare <- function(train_dat, large_dat, cv_stats, mod_FUN_list, params_list, error_FUN, method) { 19 | 20 | if (method == "kj") { 21 | # Choose alg with lowest avg error 22 | chosen_alg <- cv_stats %>% 23 | bind_rows(.id = "model") %>% 24 | filter(mean_error == min(mean_error)) %>% 25 | pull(model) 26 | 27 | # Set inputs to chosen alg 28 | mod_FUN <- mod_FUN_list[[chosen_alg]] 29 | params <- cv_stats[[chosen_alg]] %>% 30 | select(names(params_list[[chosen_alg]])) 31 | 32 | } else if (method == "raschka") { 33 | chosen_alg <- cv_stats %>% 34 | pull(model) 35 | mod_FUN <- mod_FUN_list[[chosen_alg]] 36 | params <- cv_stats %>% 37 | filter(model == chosen_alg) %>% 38 | select(names(params_list[[chosen_alg]])) 39 | } 40 | 41 | # fit model over entire training set 42 | fit <- mod_FUN(params, train_dat) 43 | 44 | # predict on test set 45 | preds <- predict(fit, large_dat) 46 | if (!is.data.frame(preds)) { 47 | preds <- preds$predictions 48 | } 49 | 50 | # calculate out-of-sample and retrieve nested-cv error 51 | y_col <- ncol(large_dat) 52 | y_obs <- large_dat[y_col] 53 | oos_error <- round(error_FUN(y_obs, preds), 5) 54 | 55 | if (method == "kj") { 56 | ncv_error <- cv_stats[[chosen_alg]] %>% 57 | mutate(mean_error = round(mean_error, 5)) %>% 58 | pull(mean_error) 59 | } else if (method == "raschka") { 60 | ncv_error <- cv_stats %>% 61 | filter(model == chosen_alg) %>% 62 | mutate(mean_error = round(mean_error, 5)) %>% 63 | pull(mean_error) 64 | } 65 | 66 | # delta (the difference between errors) is how well the ncv estimated generalization performance 67 | ncv_perf <- bind_cols(oos_error = oos_error, ncv_error = ncv_error) %>% 68 | mutate(method = method, 69 | delta_error = abs(oos_error - ncv_error), 70 | chosen_algorithm = chosen_alg) %>% 71 | bind_cols(params) %>% 72 | select(method, everything()) 73 | 74 | } 75 | 76 | 77 | 78 | 79 | 80 | -------------------------------------------------------------------------------- /performance-experiment/functions/outer-cv.R: -------------------------------------------------------------------------------- 1 | # outer_cv function 2 | 3 | # Takes the hyperparameter values that were chosen by the tuning process in the inner loop and uses them for cross-validation in the outer loop. 4 | 5 | # inputs: 6 | # 1. ncv_dat = ncv obj from list created by create-ncv-objects.R 7 | # 2. best_hypervals_list = output from inner-tune.R 8 | # 3. mod_FUN_list = output from create-models.R 9 | # 4. error_FUN = error function given at start of plan_.R 10 | # 5. method = "kj" or "raschka" 11 | # 6. train_dat = entire training set; output from mlbench-data.R 12 | # 7. params_list = list of hyperparameter grids; output from create-grids.R 13 | 14 | # output: df with chosen model, chosen hyperparameters, outer-fold error stats: mean, median, sd error for each algorithm 15 | 16 | 17 | 18 | 19 | outer_cv <- function(ncv_dat, best_hypervals_list, mod_FUN_list, error_FUN, method, train_dat = NULL, params_list = NULL) { 20 | if (method == "raschka" & is.null(train_dat)) { 21 | stop("train_dat argument = NULL. Entire training set needs to be included for raschka method") 22 | } 23 | if (method == "raschka" & is.null(params_list)) { 24 | stop("params_list argument = NULL. Hyperparameter grid list needs to be included for raschka method") 25 | } 26 | 27 | # inputs params, model, and resample, calls model and error functions, outputs error 28 | mod_error <- function(params, mod_FUN, dat) { 29 | y_col <- ncol(dat$data) 30 | y_obs <- rsample::assessment(dat)[y_col] 31 | mod <- mod_FUN(params, rsample::analysis(dat)) 32 | pred <- predict(mod, rsample::assessment(dat)) 33 | if (!is.data.frame(pred)) { 34 | pred <- pred$predictions 35 | } 36 | error <- error_FUN(y_obs, pred) 37 | error 38 | } 39 | outer_stats <- furrr::future_map2(mod_FUN_list, best_hypervals_list, function(mod_FUN, best_hyper_vals){ 40 | 41 | # fit models on the outer-loop folds using best hyperparams (e.g. 5 repeats, 10 folds = 50 models) 42 | outer_fold_error <- furrr::future_map2_dfr(ncv_dat$splits, 1:nrow(best_hyper_vals), function(dat, row) { 43 | params <- best_hyper_vals[row,] 44 | error <- mod_error(params, mod_FUN, dat) 45 | tibble( 46 | error = error 47 | ) 48 | }) %>% 49 | bind_cols(best_hyper_vals) %>% 50 | mutate_all(~round(., 6)) 51 | 52 | 53 | if (method == "kj") { 54 | # hyperparam values for final model will be the ones most selected to use on the outer-loop folds 55 | chosen_params <- best_hyper_vals %>% 56 | group_by_all() %>% 57 | tally() %>% 58 | ungroup() %>% 59 | filter(n == max(n)) %>% 60 | slice(1) 61 | 62 | # if majority vote chooses more than one parameter set, then choose the set with the lowest error. And take the first row in case more than one set of params has min error. 63 | if (nrow(chosen_params) > 1) { 64 | chosen_params <- chosen_params %>% 65 | mutate_all(~round(., 6)) %>% 66 | inner_join(outer_fold_error, by = c("mixture", "penalty")) %>% 67 | filter(error == min(error)) %>% 68 | slice(1) %>% 69 | select(names(best_hyper_vals)) 70 | } 71 | 72 | # output error stats and chosen hyperparams 73 | tibble( 74 | mean_error = mean(outer_fold_error$error), 75 | median_error = median(outer_fold_error$error), 76 | sd_error = sd(outer_fold_error$error) 77 | ) %>% 78 | bind_cols(chosen_params) 79 | 80 | } else if (method == "raschka") { 81 | tibble( 82 | mean_error = mean(outer_fold_error$error), 83 | median_error = median(outer_fold_error$error), 84 | sd_error = sd(outer_fold_error$error) 85 | ) 86 | 87 | } else { 88 | stop("Need to specify method as kj or raschka", call. = FALSE) 89 | } 90 | }) 91 | 92 | if (method == "raschka") { 93 | chosen_alg <- outer_stats %>% 94 | bind_rows(.id = "model") %>% 95 | filter(mean_error == min(mean_error)) %>% 96 | pull(1) 97 | 98 | # Set inputs to chosen alg 99 | mod_FUN <- mod_FUN_list[[chosen_alg]] 100 | params <- params_list[[chosen_alg]] 101 | total_train <- rsample::vfold_cv(train_dat, v = 2) 102 | 103 | # tune chosen alg on the inner-loop cv strategy 104 | # code is an amalgam of funs: summarize_tune_results, tune_over_params 105 | tuning_results <- furrr::future_map(total_train$splits, function(dat, mod_FUN, params) { 106 | params$error <- furrr::future_map_dbl(1:nrow(params), function(row) { 107 | params <- params[row,] 108 | mod_error(params, mod_FUN, dat) 109 | }) 110 | return(params) 111 | }, mod_FUN, params) %>% 112 | bind_rows() %>% 113 | lazy_dt(., key_by = names(params)) %>% 114 | # For each value of the tuning parameter, compute the 115 | # average which is the inner bootstrap estimate. 116 | group_by_at(vars(names(params))) %>% 117 | summarize(mean_error = mean(error, na.rm = TRUE), 118 | sd_error = sd(error, na.rm = TRUE)) %>% 119 | as_tibble() 120 | 121 | # Get best params from the tuning 122 | chosen_hyper_vals <- tuning_results %>% 123 | filter(mean_error == min(mean_error)) %>% 124 | arrange(sd_error) %>% 125 | slice(1) %>% 126 | select(names(params)) 127 | 128 | outer_stats <- outer_stats %>% 129 | bind_rows(.id = "model") %>% 130 | filter(model == chosen_alg) %>% 131 | bind_cols(chosen_hyper_vals) 132 | } 133 | return(outer_stats) 134 | } 135 | 136 | 137 | 138 | -------------------------------------------------------------------------------- /performance-experiment/functions/run-ncv.R: -------------------------------------------------------------------------------- 1 | # Runs nested cross-validation 2 | 3 | # main function: compares the algorithms; choose the one with smallest error; predicts on a test set; calcs difference between test error and mean of outer-fold error 4 | 5 | # calls inner-tune.R, outer-cv.R, and ncv-compare.R 6 | 7 | # input and output description in the individual function scripts 8 | 9 | 10 | run_ncv <- function(ncv_dat, sim_dat, large_dat, mod_FUN_list, params_list, error_FUN, method) { 11 | 12 | # output: hypervalues for each fold, chosen in the inner-loop 13 | best_hypervals_list <- inner_tune( 14 | ncv_dat = ncv_dat[[1]], 15 | mod_FUN_list = mod_FUN_list, 16 | params_list = params_list, 17 | error_FUN = error_FUN) 18 | 19 | # output: model, mean, median, sd error, and hyperparameter columns 20 | if (method == "raschka") { 21 | cv_stats <- outer_cv( 22 | ncv_dat = ncv_dat[[1]], 23 | best_hypervals_list = best_hypervals_list, 24 | mod_FUN_list = mod_FUN_list, 25 | error_FUN = error_FUN, 26 | method = method, 27 | train_dat = sim_dat, 28 | params_list = params_list) 29 | } else if (method == "kj") { 30 | cv_stats <- outer_cv( 31 | ncv_dat = ncv_dat[[1]], 32 | best_hypervals_list = best_hypervals_list, 33 | mod_FUN_list = mod_FUN_list, 34 | error_FUN = error_FUN, 35 | method = method) 36 | } 37 | 38 | # output: algorithm, hyperparams, and error values 39 | genl_perf_est <- ncv_compare(train_dat = sim_dat, 40 | large_dat = large_dat, 41 | cv_stats = cv_stats, 42 | mod_FUN_list = mod_FUN_list, 43 | params_list = params_list, 44 | error_FUN = error_FUN, 45 | method = method) 46 | 47 | # if there's repeat == 1, then there is no repeat column (id), id becomes the fold co instead of there being an id2 col 48 | rep_status <- stringr::str_detect(ncv_dat[[1]]$id[[1]], pattern = "Repeat") 49 | 50 | if (rep_status == TRUE) { 51 | # number of repeats 52 | num_reps <- ncv_dat[[1]] %>% 53 | select(id) %>% 54 | mutate(repeats = stringr::str_extract(id, pattern = "[0-9]") %>% 55 | as.numeric()) %>% 56 | slice(n()) %>% 57 | pull(repeats) 58 | } else { 59 | num_reps <- 1 60 | } 61 | 62 | # cols: n, repeats, error calcs, chosen alg, chosen hyperparams 63 | final_results <- tibble(n = nrow(ncv_dat[[1]]$splits$`1`[[1]]), 64 | repeats = num_reps) %>% 65 | bind_cols(genl_perf_est) 66 | 67 | } -------------------------------------------------------------------------------- /performance-experiment/output/perf-exp-output-r.csv: -------------------------------------------------------------------------------- 1 | subtarget,n,repeats,method,oos_error,ncv_error,delta_error,chosen_algorithm,mixture,penalty,mtry,trees,elapsed,percent_error 2 | ncv_results_r_100_2cf8f29f,100,4,raschka,2.19966,1.99714,0.20252000000000026,glmnet,0.9937975914264098,3.6844362948689596e-6,NA,NA,6.48,0.092 3 | ncv_results_r_100_73d34817,100,1,raschka,2.20712,1.92178,0.28534000000000015,glmnet,0.2830161499953829,0.4736074164981174,NA,NA,1.65,0.129 4 | ncv_results_r_100_80f90de9,100,3,raschka,2.24343,2.05019,0.19323999999999986,glmnet,0.8434229354886338,0.38592594070682457,NA,NA,4.75,0.086 5 | ncv_results_r_100_c6a30309,100,5,raschka,2.19318,1.99697,0.19621,glmnet,0.27113973276223985,0.27236778939605205,NA,NA,7.9,0.089 6 | ncv_results_r_100_ed9aeedc,100,2,raschka,2.19315,2.02878,0.16437000000000035,glmnet,0.3506598100136034,0.24982118433658881,NA,NA,3.13,0.075 7 | ncv_results_r_2000_97d0d6fc,2000,4,raschka,1.39104,1.40004,0.008999999999999897,rf,NA,NA,5,1167,8.43,0.006 8 | ncv_results_r_2000_a6a8efea,2000,1,raschka,1.38907,1.40676,0.017689999999999984,rf,NA,NA,6,227,3.08,0.013 9 | ncv_results_r_2000_b959e941,2000,2,raschka,1.38666,1.39311,0.006450000000000067,rf,NA,NA,5,1167,4.85,0.005 10 | ncv_results_r_2000_e7878a5b,2000,5,raschka,1.38627,1.39816,0.011890000000000178,rf,NA,NA,6,1309,10.68,0.009 11 | ncv_results_r_2000_f2c5e5eb,2000,3,raschka,1.39259,1.40355,0.01096000000000008,rf,NA,NA,7,576,6.52,0.008 12 | ncv_results_r_5000_0131c1ff,5000,5,raschka,1.24466,1.27121,0.02654999999999985,rf,NA,NA,5,1454,23.95,0.021 13 | ncv_results_r_5000_5b1d8a27,5000,3,raschka,1.24056,1.27153,0.030969999999999942,rf,NA,NA,6,1874,16.2,0.025 14 | ncv_results_r_5000_86c8ceed,5000,4,raschka,1.2404,1.28051,0.04011000000000009,rf,NA,NA,6,1281,20.13,0.032 15 | ncv_results_r_5000_8bc71439,5000,2,raschka,1.24009,1.27624,0.036150000000000126,rf,NA,NA,6,446,11.53,0.029 16 | ncv_results_r_5000_e931a398,5000,1,raschka,1.24406,1.28087,0.03681000000000001,rf,NA,NA,5,1462,7.7,0.03 17 | ncv_results_r_800_0f9280b4,800,4,raschka,1.6537,1.63063,0.023069999999999924,rf,NA,NA,7,92,7.27,0.014 18 | ncv_results_r_800_2ff75ff1,800,5,raschka,1.64388,1.61275,0.031130000000000102,rf,NA,NA,7,162,8.52,0.019 19 | ncv_results_r_800_67abc218,800,3,raschka,1.63369,1.61903,0.014660000000000117,rf,NA,NA,9,256,5.37,0.009 20 | ncv_results_r_800_a8ac41db,800,2,raschka,1.63013,1.62835,0.0017800000000001148,rf,NA,NA,9,1807,3.88,0.001 21 | ncv_results_r_800_e4b0a9bc,800,1,raschka,1.63248,1.59467,0.0378099999999999,rf,NA,NA,7,1289,2.22,0.023 22 | -------------------------------------------------------------------------------- /performance-experiment/output/perf-exp-output.csv: -------------------------------------------------------------------------------- 1 | subtarget,n,repeats,method,oos_error,ncv_error,delta_error,chosen_algorithm,mixture,penalty,mtry,trees,elapsed,percent_error 2 | ncv_results_100_0108d912,100,5,kj,2.19359,2.01424,0.1793499999999999,glmnet,0.50424303883221,0.2211151988375703,NA,NA,1.36,0.082 3 | ncv_results_100_7aaa57d2,100,1,kj,2.19359,2.04781,0.1457799999999998,glmnet,0.50424303883221,0.2211151988375703,NA,NA,0.15,0.066 4 | ncv_results_100_97e7fe04,100,2,kj,2.19359,1.99077,0.20282,glmnet,0.50424303883221,0.2211151988375703,NA,NA,0.4,0.092 5 | ncv_results_100_9d044993,100,4,kj,2.19359,1.99643,0.19716,glmnet,0.50424303883221,0.2211151988375703,NA,NA,0.97,0.09 6 | ncv_results_100_ea11bf8d,100,3,kj,2.19262,2.01702,0.17559999999999976,glmnet,0.5809470646083355,0.16010254880830843,NA,NA,0.65,0.08 7 | ncv_results_2000_47742c31,2000,4,kj,1.38697,1.37171,0.015260000000000051,rf,NA,NA,5,1779,2.96,0.011 8 | ncv_results_2000_746435d6,2000,5,kj,1.39092,1.37625,0.01466999999999996,rf,NA,NA,5,1779,3.71,0.011 9 | ncv_results_2000_7d80d14d,2000,1,kj,1.38466,1.36553,0.01913000000000009,rf,NA,NA,5,1948,0.74,0.014 10 | ncv_results_2000_80d2e33a,2000,3,kj,1.38955,1.3711,0.018450000000000077,rf,NA,NA,5,1948,2.22,0.013 11 | ncv_results_2000_c16e9aff,2000,2,kj,1.38739,1.37015,0.017239999999999922,rf,NA,NA,5,1948,1.48,0.012 12 | ncv_results_5000_20d7ace1,5000,4,kj,1.24192,1.25837,0.016450000000000076,rf,NA,NA,5,1573,8.92,0.013 13 | ncv_results_5000_2a916af4,5000,5,kj,1.24272,1.25644,0.013719999999999954,rf,NA,NA,5,1664,11.13,0.011 14 | ncv_results_5000_7b1fdb55,5000,2,kj,1.24336,1.2612,0.017840000000000078,rf,NA,NA,5,1351,4.46,0.014 15 | ncv_results_5000_7b6f8e72,5000,1,kj,1.24304,1.25709,0.014050000000000118,rf,NA,NA,5,1664,2.23,0.011 16 | ncv_results_5000_d380966a,5000,3,kj,1.24267,1.25724,0.014569999999999972,rf,NA,NA,5,1365,6.69,0.012 17 | ncv_results_800_3b54c7f8,800,1,kj,1.63668,1.58422,0.05245999999999995,rf,NA,NA,6,1507,0.26,0.032 18 | ncv_results_800_3f87e120,800,2,kj,1.6333,1.58689,0.04641000000000006,rf,NA,NA,6,1168,0.51,0.028 19 | ncv_results_800_50b46544,800,4,kj,1.63707,1.58522,0.05184999999999995,rf,NA,NA,6,1693,1.09,0.032 20 | ncv_results_800_589454bb,800,3,kj,1.63456,1.5905,0.04405999999999999,rf,NA,NA,6,1168,0.76,0.027 21 | ncv_results_800_a2c27fe0,800,5,kj,1.63489,1.58745,0.04743999999999993,rf,NA,NA,6,1507,1.52,0.029 22 | -------------------------------------------------------------------------------- /performance-experiment/packages.R: -------------------------------------------------------------------------------- 1 | 2 | 3 | pacman::p_load(data.table, dtplyr, dplyr, furrr, drake) 4 | 5 | 6 | -------------------------------------------------------------------------------- /renv/.gitignore: -------------------------------------------------------------------------------- 1 | local/ 2 | lock/ 3 | library/ 4 | python/ 5 | staging/ 6 | -------------------------------------------------------------------------------- /renv/activate.R: -------------------------------------------------------------------------------- 1 | 2 | local({ 3 | 4 | # the requested version of renv 5 | version <- "0.14.0-3" 6 | 7 | # the project directory 8 | project <- getwd() 9 | 10 | # allow environment variable to control activation 11 | activate <- Sys.getenv("RENV_ACTIVATE_PROJECT") 12 | if (!nzchar(activate)) { 13 | 14 | # don't auto-activate when R CMD INSTALL is running 15 | if (nzchar(Sys.getenv("R_INSTALL_PKG"))) 16 | return(FALSE) 17 | 18 | } 19 | 20 | # bail if activation was explicitly disabled 21 | if (tolower(activate) %in% c("false", "f", "0")) 22 | return(FALSE) 23 | 24 | # avoid recursion 25 | if (nzchar(Sys.getenv("RENV_R_INITIALIZING"))) 26 | return(invisible(TRUE)) 27 | 28 | # signal that we're loading renv during R startup 29 | Sys.setenv("RENV_R_INITIALIZING" = "true") 30 | on.exit(Sys.unsetenv("RENV_R_INITIALIZING"), add = TRUE) 31 | 32 | # signal that we've consented to use renv 33 | options(renv.consent = TRUE) 34 | 35 | # load the 'utils' package eagerly -- this ensures that renv shims, which 36 | # mask 'utils' packages, will come first on the search path 37 | library(utils, lib.loc = .Library) 38 | 39 | # check to see if renv has already been loaded 40 | if ("renv" %in% loadedNamespaces()) { 41 | 42 | # if renv has already been loaded, and it's the requested version of renv, 43 | # nothing to do 44 | spec <- .getNamespaceInfo(.getNamespace("renv"), "spec") 45 | if (identical(spec[["version"]], version)) 46 | return(invisible(TRUE)) 47 | 48 | # otherwise, unload and attempt to load the correct version of renv 49 | unloadNamespace("renv") 50 | 51 | } 52 | 53 | # load bootstrap tools 54 | bootstrap <- function(version, library) { 55 | 56 | # attempt to download renv 57 | tarball <- tryCatch(renv_bootstrap_download(version), error = identity) 58 | if (inherits(tarball, "error")) 59 | stop("failed to download renv ", version) 60 | 61 | # now attempt to install 62 | status <- tryCatch(renv_bootstrap_install(version, tarball, library), error = identity) 63 | if (inherits(status, "error")) 64 | stop("failed to install renv ", version) 65 | 66 | } 67 | 68 | renv_bootstrap_tests_running <- function() { 69 | getOption("renv.tests.running", default = FALSE) 70 | } 71 | 72 | renv_bootstrap_repos <- function() { 73 | 74 | # check for repos override 75 | repos <- Sys.getenv("RENV_CONFIG_REPOS_OVERRIDE", unset = NA) 76 | if (!is.na(repos)) 77 | return(repos) 78 | 79 | # if we're testing, re-use the test repositories 80 | if (renv_bootstrap_tests_running()) 81 | return(getOption("renv.tests.repos")) 82 | 83 | # retrieve current repos 84 | repos <- getOption("repos") 85 | 86 | # ensure @CRAN@ entries are resolved 87 | repos[repos == "@CRAN@"] <- getOption( 88 | "renv.repos.cran", 89 | "https://cloud.r-project.org" 90 | ) 91 | 92 | # add in renv.bootstrap.repos if set 93 | default <- c(FALLBACK = "https://cloud.r-project.org") 94 | extra <- getOption("renv.bootstrap.repos", default = default) 95 | repos <- c(repos, extra) 96 | 97 | # remove duplicates that might've snuck in 98 | dupes <- duplicated(repos) | duplicated(names(repos)) 99 | repos[!dupes] 100 | 101 | } 102 | 103 | renv_bootstrap_download <- function(version) { 104 | 105 | # if the renv version number has 4 components, assume it must 106 | # be retrieved via github 107 | nv <- numeric_version(version) 108 | components <- unclass(nv)[[1]] 109 | 110 | methods <- if (length(components) == 4L) { 111 | list( 112 | renv_bootstrap_download_github 113 | ) 114 | } else { 115 | list( 116 | renv_bootstrap_download_cran_latest, 117 | renv_bootstrap_download_cran_archive 118 | ) 119 | } 120 | 121 | for (method in methods) { 122 | path <- tryCatch(method(version), error = identity) 123 | if (is.character(path) && file.exists(path)) 124 | return(path) 125 | } 126 | 127 | stop("failed to download renv ", version) 128 | 129 | } 130 | 131 | renv_bootstrap_download_impl <- function(url, destfile) { 132 | 133 | mode <- "wb" 134 | 135 | # https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17715 136 | fixup <- 137 | Sys.info()[["sysname"]] == "Windows" && 138 | substring(url, 1L, 5L) == "file:" 139 | 140 | if (fixup) 141 | mode <- "w+b" 142 | 143 | utils::download.file( 144 | url = url, 145 | destfile = destfile, 146 | mode = mode, 147 | quiet = TRUE 148 | ) 149 | 150 | } 151 | 152 | renv_bootstrap_download_cran_latest <- function(version) { 153 | 154 | spec <- renv_bootstrap_download_cran_latest_find(version) 155 | 156 | message("* Downloading renv ", version, " ... ", appendLF = FALSE) 157 | 158 | type <- spec$type 159 | repos <- spec$repos 160 | 161 | info <- tryCatch( 162 | utils::download.packages( 163 | pkgs = "renv", 164 | destdir = tempdir(), 165 | repos = repos, 166 | type = type, 167 | quiet = TRUE 168 | ), 169 | condition = identity 170 | ) 171 | 172 | if (inherits(info, "condition")) { 173 | message("FAILED") 174 | return(FALSE) 175 | } 176 | 177 | # report success and return 178 | message("OK (downloaded ", type, ")") 179 | info[1, 2] 180 | 181 | } 182 | 183 | renv_bootstrap_download_cran_latest_find <- function(version) { 184 | 185 | # check whether binaries are supported on this system 186 | binary <- 187 | getOption("renv.bootstrap.binary", default = TRUE) && 188 | !identical(.Platform$pkgType, "source") && 189 | !identical(getOption("pkgType"), "source") && 190 | Sys.info()[["sysname"]] %in% c("Darwin", "Windows") 191 | 192 | types <- c(if (binary) "binary", "source") 193 | 194 | # iterate over types + repositories 195 | for (type in types) { 196 | for (repos in renv_bootstrap_repos()) { 197 | 198 | # retrieve package database 199 | db <- tryCatch( 200 | as.data.frame( 201 | utils::available.packages(type = type, repos = repos), 202 | stringsAsFactors = FALSE 203 | ), 204 | error = identity 205 | ) 206 | 207 | if (inherits(db, "error")) 208 | next 209 | 210 | # check for compatible entry 211 | entry <- db[db$Package %in% "renv" & db$Version %in% version, ] 212 | if (nrow(entry) == 0) 213 | next 214 | 215 | # found it; return spec to caller 216 | spec <- list(entry = entry, type = type, repos = repos) 217 | return(spec) 218 | 219 | } 220 | } 221 | 222 | # if we got here, we failed to find renv 223 | fmt <- "renv %s is not available from your declared package repositories" 224 | stop(sprintf(fmt, version)) 225 | 226 | } 227 | 228 | renv_bootstrap_download_cran_archive <- function(version) { 229 | 230 | name <- sprintf("renv_%s.tar.gz", version) 231 | repos <- renv_bootstrap_repos() 232 | urls <- file.path(repos, "src/contrib/Archive/renv", name) 233 | destfile <- file.path(tempdir(), name) 234 | 235 | message("* Downloading renv ", version, " ... ", appendLF = FALSE) 236 | 237 | for (url in urls) { 238 | 239 | status <- tryCatch( 240 | renv_bootstrap_download_impl(url, destfile), 241 | condition = identity 242 | ) 243 | 244 | if (identical(status, 0L)) { 245 | message("OK") 246 | return(destfile) 247 | } 248 | 249 | } 250 | 251 | message("FAILED") 252 | return(FALSE) 253 | 254 | } 255 | 256 | renv_bootstrap_download_github <- function(version) { 257 | 258 | enabled <- Sys.getenv("RENV_BOOTSTRAP_FROM_GITHUB", unset = "TRUE") 259 | if (!identical(enabled, "TRUE")) 260 | return(FALSE) 261 | 262 | # prepare download options 263 | pat <- Sys.getenv("GITHUB_PAT") 264 | if (nzchar(Sys.which("curl")) && nzchar(pat)) { 265 | fmt <- "--location --fail --header \"Authorization: token %s\"" 266 | extra <- sprintf(fmt, pat) 267 | saved <- options("download.file.method", "download.file.extra") 268 | options(download.file.method = "curl", download.file.extra = extra) 269 | on.exit(do.call(base::options, saved), add = TRUE) 270 | } else if (nzchar(Sys.which("wget")) && nzchar(pat)) { 271 | fmt <- "--header=\"Authorization: token %s\"" 272 | extra <- sprintf(fmt, pat) 273 | saved <- options("download.file.method", "download.file.extra") 274 | options(download.file.method = "wget", download.file.extra = extra) 275 | on.exit(do.call(base::options, saved), add = TRUE) 276 | } 277 | 278 | message("* Downloading renv ", version, " from GitHub ... ", appendLF = FALSE) 279 | 280 | url <- file.path("https://api.github.com/repos/rstudio/renv/tarball", version) 281 | name <- sprintf("renv_%s.tar.gz", version) 282 | destfile <- file.path(tempdir(), name) 283 | 284 | status <- tryCatch( 285 | renv_bootstrap_download_impl(url, destfile), 286 | condition = identity 287 | ) 288 | 289 | if (!identical(status, 0L)) { 290 | message("FAILED") 291 | return(FALSE) 292 | } 293 | 294 | message("OK") 295 | return(destfile) 296 | 297 | } 298 | 299 | renv_bootstrap_install <- function(version, tarball, library) { 300 | 301 | # attempt to install it into project library 302 | message("* Installing renv ", version, " ... ", appendLF = FALSE) 303 | dir.create(library, showWarnings = FALSE, recursive = TRUE) 304 | 305 | # invoke using system2 so we can capture and report output 306 | bin <- R.home("bin") 307 | exe <- if (Sys.info()[["sysname"]] == "Windows") "R.exe" else "R" 308 | r <- file.path(bin, exe) 309 | args <- c("--vanilla", "CMD", "INSTALL", "-l", shQuote(library), shQuote(tarball)) 310 | output <- system2(r, args, stdout = TRUE, stderr = TRUE) 311 | message("Done!") 312 | 313 | # check for successful install 314 | status <- attr(output, "status") 315 | if (is.numeric(status) && !identical(status, 0L)) { 316 | header <- "Error installing renv:" 317 | lines <- paste(rep.int("=", nchar(header)), collapse = "") 318 | text <- c(header, lines, output) 319 | writeLines(text, con = stderr()) 320 | } 321 | 322 | status 323 | 324 | } 325 | 326 | renv_bootstrap_platform_prefix <- function() { 327 | 328 | # construct version prefix 329 | version <- paste(R.version$major, R.version$minor, sep = ".") 330 | prefix <- paste("R", numeric_version(version)[1, 1:2], sep = "-") 331 | 332 | # include SVN revision for development versions of R 333 | # (to avoid sharing platform-specific artefacts with released versions of R) 334 | devel <- 335 | identical(R.version[["status"]], "Under development (unstable)") || 336 | identical(R.version[["nickname"]], "Unsuffered Consequences") 337 | 338 | if (devel) 339 | prefix <- paste(prefix, R.version[["svn rev"]], sep = "-r") 340 | 341 | # build list of path components 342 | components <- c(prefix, R.version$platform) 343 | 344 | # include prefix if provided by user 345 | prefix <- renv_bootstrap_platform_prefix_impl() 346 | if (!is.na(prefix) && nzchar(prefix)) 347 | components <- c(prefix, components) 348 | 349 | # build prefix 350 | paste(components, collapse = "/") 351 | 352 | } 353 | 354 | renv_bootstrap_platform_prefix_impl <- function() { 355 | 356 | # if an explicit prefix has been supplied, use it 357 | prefix <- Sys.getenv("RENV_PATHS_PREFIX", unset = NA) 358 | if (!is.na(prefix)) 359 | return(prefix) 360 | 361 | # if the user has requested an automatic prefix, generate it 362 | auto <- Sys.getenv("RENV_PATHS_PREFIX_AUTO", unset = NA) 363 | if (auto %in% c("TRUE", "True", "true", "1")) 364 | return(renv_bootstrap_platform_prefix_auto()) 365 | 366 | # empty string on failure 367 | "" 368 | 369 | } 370 | 371 | renv_bootstrap_platform_prefix_auto <- function() { 372 | 373 | prefix <- tryCatch(renv_bootstrap_platform_os(), error = identity) 374 | if (inherits(prefix, "error") || prefix %in% "unknown") { 375 | 376 | msg <- paste( 377 | "failed to infer current operating system", 378 | "please file a bug report at https://github.com/rstudio/renv/issues", 379 | sep = "; " 380 | ) 381 | 382 | warning(msg) 383 | 384 | } 385 | 386 | prefix 387 | 388 | } 389 | 390 | renv_bootstrap_platform_os <- function() { 391 | 392 | sysinfo <- Sys.info() 393 | sysname <- sysinfo[["sysname"]] 394 | 395 | # handle Windows + macOS up front 396 | if (sysname == "Windows") 397 | return("windows") 398 | else if (sysname == "Darwin") 399 | return("macos") 400 | 401 | # check for os-release files 402 | for (file in c("/etc/os-release", "/usr/lib/os-release")) 403 | if (file.exists(file)) 404 | return(renv_bootstrap_platform_os_via_os_release(file, sysinfo)) 405 | 406 | # check for redhat-release files 407 | if (file.exists("/etc/redhat-release")) 408 | return(renv_bootstrap_platform_os_via_redhat_release()) 409 | 410 | "unknown" 411 | 412 | } 413 | 414 | renv_bootstrap_platform_os_via_os_release <- function(file, sysinfo) { 415 | 416 | # read /etc/os-release 417 | release <- utils::read.table( 418 | file = file, 419 | sep = "=", 420 | quote = c("\"", "'"), 421 | col.names = c("Key", "Value"), 422 | comment.char = "#", 423 | stringsAsFactors = FALSE 424 | ) 425 | 426 | vars <- as.list(release$Value) 427 | names(vars) <- release$Key 428 | 429 | # get os name 430 | os <- tolower(sysinfo[["sysname"]]) 431 | 432 | # read id 433 | id <- "unknown" 434 | for (field in c("ID", "ID_LIKE")) { 435 | if (field %in% names(vars) && nzchar(vars[[field]])) { 436 | id <- vars[[field]] 437 | break 438 | } 439 | } 440 | 441 | # read version 442 | version <- "unknown" 443 | for (field in c("UBUNTU_CODENAME", "VERSION_CODENAME", "VERSION_ID", "BUILD_ID")) { 444 | if (field %in% names(vars) && nzchar(vars[[field]])) { 445 | version <- vars[[field]] 446 | break 447 | } 448 | } 449 | 450 | # join together 451 | paste(c(os, id, version), collapse = "-") 452 | 453 | } 454 | 455 | renv_bootstrap_platform_os_via_redhat_release <- function() { 456 | 457 | # read /etc/redhat-release 458 | contents <- readLines("/etc/redhat-release", warn = FALSE) 459 | 460 | # infer id 461 | id <- if (grepl("centos", contents, ignore.case = TRUE)) 462 | "centos" 463 | else if (grepl("redhat", contents, ignore.case = TRUE)) 464 | "redhat" 465 | else 466 | "unknown" 467 | 468 | # try to find a version component (very hacky) 469 | version <- "unknown" 470 | 471 | parts <- strsplit(contents, "[[:space:]]")[[1L]] 472 | for (part in parts) { 473 | 474 | nv <- tryCatch(numeric_version(part), error = identity) 475 | if (inherits(nv, "error")) 476 | next 477 | 478 | version <- nv[1, 1] 479 | break 480 | 481 | } 482 | 483 | paste(c("linux", id, version), collapse = "-") 484 | 485 | } 486 | 487 | renv_bootstrap_library_root_name <- function(project) { 488 | 489 | # use project name as-is if requested 490 | asis <- Sys.getenv("RENV_PATHS_LIBRARY_ROOT_ASIS", unset = "FALSE") 491 | if (asis) 492 | return(basename(project)) 493 | 494 | # otherwise, disambiguate based on project's path 495 | id <- substring(renv_bootstrap_hash_text(project), 1L, 8L) 496 | paste(basename(project), id, sep = "-") 497 | 498 | } 499 | 500 | renv_bootstrap_library_root <- function(project) { 501 | 502 | path <- Sys.getenv("RENV_PATHS_LIBRARY", unset = NA) 503 | if (!is.na(path)) 504 | return(path) 505 | 506 | path <- Sys.getenv("RENV_PATHS_LIBRARY_ROOT", unset = NA) 507 | if (!is.na(path)) { 508 | name <- renv_bootstrap_library_root_name(project) 509 | return(file.path(path, name)) 510 | } 511 | 512 | prefix <- renv_bootstrap_profile_prefix() 513 | paste(c(project, prefix, "renv/library"), collapse = "/") 514 | 515 | } 516 | 517 | renv_bootstrap_validate_version <- function(version) { 518 | 519 | loadedversion <- utils::packageDescription("renv", fields = "Version") 520 | if (version == loadedversion) 521 | return(TRUE) 522 | 523 | # assume four-component versions are from GitHub; three-component 524 | # versions are from CRAN 525 | components <- strsplit(loadedversion, "[.-]")[[1]] 526 | remote <- if (length(components) == 4L) 527 | paste("rstudio/renv", loadedversion, sep = "@") 528 | else 529 | paste("renv", loadedversion, sep = "@") 530 | 531 | fmt <- paste( 532 | "renv %1$s was loaded from project library, but this project is configured to use renv %2$s.", 533 | "Use `renv::record(\"%3$s\")` to record renv %1$s in the lockfile.", 534 | "Use `renv::restore(packages = \"renv\")` to install renv %2$s into the project library.", 535 | sep = "\n" 536 | ) 537 | 538 | msg <- sprintf(fmt, loadedversion, version, remote) 539 | warning(msg, call. = FALSE) 540 | 541 | FALSE 542 | 543 | } 544 | 545 | renv_bootstrap_hash_text <- function(text) { 546 | 547 | hashfile <- tempfile("renv-hash-") 548 | on.exit(unlink(hashfile), add = TRUE) 549 | 550 | writeLines(text, con = hashfile) 551 | tools::md5sum(hashfile) 552 | 553 | } 554 | 555 | renv_bootstrap_load <- function(project, libpath, version) { 556 | 557 | # try to load renv from the project library 558 | if (!requireNamespace("renv", lib.loc = libpath, quietly = TRUE)) 559 | return(FALSE) 560 | 561 | # warn if the version of renv loaded does not match 562 | renv_bootstrap_validate_version(version) 563 | 564 | # load the project 565 | renv::load(project) 566 | 567 | TRUE 568 | 569 | } 570 | 571 | renv_bootstrap_profile_load <- function(project) { 572 | 573 | # if RENV_PROFILE is already set, just use that 574 | profile <- Sys.getenv("RENV_PROFILE", unset = NA) 575 | if (!is.na(profile) && nzchar(profile)) 576 | return(profile) 577 | 578 | # check for a profile file (nothing to do if it doesn't exist) 579 | path <- file.path(project, "renv/local/profile") 580 | if (!file.exists(path)) 581 | return(NULL) 582 | 583 | # read the profile, and set it if it exists 584 | contents <- readLines(path, warn = FALSE) 585 | if (length(contents) == 0L) 586 | return(NULL) 587 | 588 | # set RENV_PROFILE 589 | profile <- contents[[1L]] 590 | if (nzchar(profile)) 591 | Sys.setenv(RENV_PROFILE = profile) 592 | 593 | profile 594 | 595 | } 596 | 597 | renv_bootstrap_profile_prefix <- function() { 598 | profile <- renv_bootstrap_profile_get() 599 | if (!is.null(profile)) 600 | return(file.path("renv/profiles", profile)) 601 | } 602 | 603 | renv_bootstrap_profile_get <- function() { 604 | profile <- Sys.getenv("RENV_PROFILE", unset = "") 605 | renv_bootstrap_profile_normalize(profile) 606 | } 607 | 608 | renv_bootstrap_profile_set <- function(profile) { 609 | profile <- renv_bootstrap_profile_normalize(profile) 610 | if (is.null(profile)) 611 | Sys.unsetenv("RENV_PROFILE") 612 | else 613 | Sys.setenv(RENV_PROFILE = profile) 614 | } 615 | 616 | renv_bootstrap_profile_normalize <- function(profile) { 617 | 618 | if (is.null(profile) || profile %in% c("", "default")) 619 | return(NULL) 620 | 621 | profile 622 | 623 | } 624 | 625 | # load the renv profile, if any 626 | renv_bootstrap_profile_load(project) 627 | 628 | # construct path to library root 629 | root <- renv_bootstrap_library_root(project) 630 | 631 | # construct library prefix for platform 632 | prefix <- renv_bootstrap_platform_prefix() 633 | 634 | # construct full libpath 635 | libpath <- file.path(root, prefix) 636 | 637 | # attempt to load 638 | if (renv_bootstrap_load(project, libpath, version)) 639 | return(TRUE) 640 | 641 | # load failed; inform user we're about to bootstrap 642 | prefix <- paste("# Bootstrapping renv", version) 643 | postfix <- paste(rep.int("-", 77L - nchar(prefix)), collapse = "") 644 | header <- paste(prefix, postfix) 645 | message(header) 646 | 647 | # perform bootstrap 648 | bootstrap(version, libpath) 649 | 650 | # exit early if we're just testing bootstrap 651 | if (!is.na(Sys.getenv("RENV_BOOTSTRAP_INSTALL_ONLY", unset = NA))) 652 | return(TRUE) 653 | 654 | # try again to load 655 | if (requireNamespace("renv", lib.loc = libpath, quietly = TRUE)) { 656 | message("* Successfully installed and loaded renv ", version, ".") 657 | return(renv::load()) 658 | } 659 | 660 | # failed to download or load renv; warn the user 661 | msg <- c( 662 | "Failed to find an renv installation: the project will not be loaded.", 663 | "Use `renv::activate()` to re-initialize the project." 664 | ) 665 | 666 | warning(paste(msg, collapse = "\n"), call. = FALSE) 667 | 668 | }) 669 | -------------------------------------------------------------------------------- /renv/settings.dcf: -------------------------------------------------------------------------------- 1 | external.libraries: 2 | ignored.packages: 3 | package.dependency.fields: Imports, Depends, LinkingTo 4 | r.version: 5 | snapshot.type: implicit 6 | use.cache: TRUE 7 | vcs.ignore.library: TRUE 8 | vcs.ignore.local: TRUE 9 | --------------------------------------------------------------------------------