├── .Rprofile
├── .gitignore
├── README.Rmd
├── README.md
├── README_files
    └── figure-gfm
    │   ├── kj-patch-1.png
    │   ├── kj_patch_kj-1.png
    │   ├── perf-error-line-1.png
    │   ├── perf_bt_charts-1.png
    │   ├── perf_build_times-1.png
    │   └── unnamed-chunk-1-1.png
├── _drake-kj.R
├── _drake-raschka.R
├── data
    └── fivek-simdat.pickle
├── duration-experiment
    ├── kuhn-johnson
    │   ├── nested-cv-h2o-kj.R
    │   ├── nested-cv-parsnip-kj.R
    │   ├── nested-cv-ranger-kj.R
    │   ├── nested-cv-sklearn-kj.R
    │   └── nested-cv-tune-kj.R
    ├── outputs
    │   ├── 0225-results.png
    │   ├── duration-pkg-tbl.png
    │   └── duration-runs.rds
    ├── package-sources-gt-tbl.R
    └── raschka
    │   ├── nested-cv-kj-raschka.R
    │   ├── nested-cv-mlr3-raschka.R
    │   ├── nested-cv-py-raschka.py
    │   └── nested-cv-retic-raschka.R
├── environment.yml
├── grids
    ├── elast-latin-params.pickle
    └── rf-latin-params.pickle
├── images
    └── ncv.png
├── nested-cross-validation-comparison.Rproj
├── palettes
    └── Forest Floor.ase
├── performance-experiment
    ├── Kuhn-Johnson
    │   ├── plan-kj.R
    │   └── r_make-kj.R
    ├── Raschka
    │   ├── plan-raschka.R
    │   └── r_make-raschka.R
    ├── functions
    │   ├── create-grids.R
    │   ├── create-models.R
    │   ├── create-ncv-objects.R
    │   ├── inner-tune.R
    │   ├── mlbench-data.R
    │   ├── ncv-compare.R
    │   ├── outer-cv.R
    │   └── run-ncv.R
    ├── output
    │   ├── perf-exp-output-r.csv
    │   └── perf-exp-output.csv
    └── packages.R
├── renv.lock
└── renv
    ├── .gitignore
    ├── activate.R
    └── settings.dcf


/.Rprofile:
--------------------------------------------------------------------------------
1 | source("renv/activate.R")
2 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | .Rproj.user
 2 | .Rhistory
 3 | .RData
 4 | .Ruserdata
 5 | .env
 6 | .drake
 7 | .drake-raschka
 8 | README.html
 9 | ec2-ssh-raw.log
10 | README_cache
11 | check-results.R
12 | perf-exp-output-backup.rds


--------------------------------------------------------------------------------
/README.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | output: github_document
  3 | ---
  4 | 
  5 | # Nested Cross-Validation: Comparing Methods and Implementations  
  6 | 
  7 | ![](images/ncv.png)  
  8 | [![DOI](https://zenodo.org/badge/242267104.svg)](https://zenodo.org/badge/latestdoi/242267104)
  9 | 
 10 | Experiments conducted in May 2020. Packages in renv.lock had to be updated in August 2021, but all scripts haven't been re-ran to make sure everything still works. So, some refactoring of code may be necessary in order to reproduce results (e.g. {future} progress bar implementation and plan options have changed).    
 11 | 
 12 | Nested cross-validation (or double cross-validaation) has become a recommended technique for situations in which the size of our dataset is insufficient to simultaneously handle hyperparameter tuning and algorithm comparison. Using standard methods such as k-fold cross-validation in these cases may result in substantial increases in optimization bias where the more models that are trained on a fold means there's a greater opportunity for a model to achieve a low score by chance. Nested cross-validation has been shown to produce less biased, out-of-sample error estimates even using datasets with only hundreds of rows and therefore gives a better estimation of generalized performance. The primary issue with this technique is that it is usually computationally expensive with potentially tens of 1000s of models being trained during the process.   
 13 | 
 14 | While researching this technique, I found two slightly different variations of performing nested cross-validation — one authored by [Sabastian Raschka](https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/11_eval4-algo/code/11-eval4-algo__nested-cv_verbose1.ipynb) and the other by [Max Kuhn and Kjell Johnson](https://www.tidymodels.org/learn/work/nested-resampling/). After the nested cross-validation procedure and an algorithm is chosen, Raschka performs an extra k-fold cross-validation using the inner-loop cv strategy on the entire training set in order to tune his final model. Therefore, the hyperparameter tuning that takes place in the inner-loop during nested cross-validation is only in service of algorithm selection. Kuhn-Johnson uses majority vote. Whichever set of hyperparameter values has been chosen during the inner-loop tuning procedure the most often is the set used to fit the final model. The other diffferences are just the number of folds/resamples used in the outer and inner loops which are essentially just tuning parameters.   
 15 | 
 16 | Various elements of the technique affect the run times and performance. These include:  
 17 | 
 18 | 1. Hyperparameter value grids  
 19 | 2. Grid search strategy  
 20 | 3. Inner-Loop CV strategy  
 21 | 4. Outer-Loop CV strategy  
 22 | 
 23 | I'll be examining two aspects of nested cross-validation:  
 24 | 
 25 | 1. Duration: Find out which packages and combinations of model functions give us the fastest implementation of each method.  
 26 | 2. Performance: First, develop a testing framework. Then, for a given data generating process, determine how large of sample size is needed to obtain reasonably accurate out-of-sample error estimate. Also, determine how many repeats in the outer-loop cv strategy should be used to calculate this error estimate.  
 27 | 
 28 | The results from these experiments should give us an idea about which methodology, model packages, and compute specifications will produce lower training times, lower costs, and lower generalization error.  
 29 | 
 30 | 
 31 | ## Recommendations:  
 32 |    * For faster training times, use {mlr3} (caveat: see Discussion) or other R model packages outside of the {tidymodels} ecosystem and code the nested cross-validation loops manually (Code: [mlr3](https://github.com/ercbk/nested-cross-validation-comparison/blob/master/duration-experiment/raschka/nested-cv-mlr3-raschka.R), [ranger-kj](https://github.com/ercbk/nested-cross-validation-comparison/blob/master/duration-experiment/kuhn-johnson/nested-cv-ranger-kj.R), [Kuhn-Johnson](https://www.tidymodels.org/learn/work/nested-resampling/)).  
 33 |    * Choose compute resources with large amounts of RAM instead of opting for powerful processors. From the AWS cpu product line, I found the r5.#xlarge instances ran fastest. The most efficient number of vCPUs may vary according to the algorithm.  
 34 |    * For the data in this experiment with row numbers in the low thousands, Raschka's method performed just as well as Kuhn-Johnson's but was substantially faster.  
 35 |    * For the data in this experiment with row numbers in the hundreds, Raschka's method with at least 3 repeats performed just as well as Kuhn-Johnson's but was still substantially faster even with the repeats.  
 36 | 
 37 | 
 38 | 
 39 | ## Duration
 40 | #### Experiment details:  
 41 |    
 42 |    * Random Forest and Elastic Net Regression algorithms  
 43 |    * Both algorithms are tuned with 100x2 hyperparameter grids using a latin hypercube design.  
 44 |    * From {mlbench}, I'm using the generated data set, friedman1, from Friedman's Multivariate Adaptive Regression Splines (MARS) paper.
 45 |    * Kuhn-Johnson  
 46 |       + 100 observations: 10 features, numeric target variable  
 47 |       + outer loop: 2 repeats, 10 folds  
 48 |       + inner loop: 25 bootstrap resamples  
 49 |    * Raschka  
 50 |       + 5000 observations: 10 features, numeric target variable  
 51 |       + outer loop: 5 folds  
 52 |       + inner loop: 2 folds  
 53 | 
 54 | The sizes of the data sets are the same as those in the original scripts by the authors. Using Kuhn-Johnson, 50,000 models (grid size * number of repeats * number of folds in the outer-loop * number of folds/resamples in the inner-loop) are trained for each algorithm — using Raschka's, 1,001 models for each algorithm. The one extra model in the Raschka variation is due to his method of choosing the hyperparameter values for the final model.   
 55 | 
 56 | [MLFlow](https://mlflow.org/docs/latest/index.html) is used to keep track of the duration (seconds) of each run along with the implementation and method used.  
 57 | 
 58 | ![](duration-experiment/outputs/0225-results.png)  
 59 | 
 60 | ![](duration-experiment/outputs/duration-pkg-tbl.png)  
 61 | 
 62 | ```{r, echo=FALSE, message=FALSE}
 63 | pacman::p_load(dplyr, ggplot2, patchwork, stringr, tidytext)
 64 | 
 65 | runs_raw <- readr::read_rds("duration-experiment/outputs/duration-runs.rds")
 66 | 
 67 | 
 68 | 
 69 | runs <- runs_raw %>%
 70 |       mutate(duration = round(duration/60, 2),
 71 |              implementation = as.factor(str_to_title(implementation)),
 72 |              implementation = reorder_within(implementation, duration, method))
 73 | 
 74 | 
 75 | raschka <- runs %>% 
 76 |    filter(method == "raschka") %>%
 77 |    ggplot(aes(y = duration, x = implementation, label = duration)) + 
 78 |    geom_bar(aes(color = after_scale(prismatic::clr_darken(rep("#195198",4), 0.3))), fill = "#195198", stat = "identity", width = 0.50) +
 79 |    coord_flip() +
 80 |    scale_x_reordered() +
 81 |    geom_text(hjust = 1.3,  size = 3.5, color = "white") +
 82 |    labs(x = NULL, y = NULL,
 83 |         title = "Raschka") +
 84 |    theme(plot.title = element_text(size = rel(0.9)))
 85 | 
 86 | 
 87 | kj <- runs %>% 
 88 |    filter(method == "kj") %>%
 89 |    ggplot(aes(y = duration, x = implementation, label = duration)) + 
 90 |    geom_bar(aes(color = after_scale(prismatic::clr_darken(rep("#BD9865",5), 0.3))), stat = "identity", width = 0.50, fill = "#BD9865") +
 91 |    coord_flip() +
 92 |    scale_x_reordered() +
 93 |    geom_text(hjust = 1.3,  size = 3.5, color = "black") +
 94 |    labs(x = NULL, y = NULL,
 95 |         title = "Kuhn-Johnson") +
 96 |    theme(plot.title = element_text(size = rel(0.9)))
 97 | 
 98 | durations <- raschka + kj +
 99 |       plot_annotation(title = "Durations",
100 |                       subtitle = "minutes") &
101 |       theme(axis.ticks = element_blank(),
102 |             axis.text.x = element_blank(),
103 |             panel.background = element_rect(fill = "ivory",
104 |                                             colour = "ivory"),
105 |             plot.background = element_rect(fill = "ivory"),
106 |             plot.subtitle = element_text(size = rel(0.85)),
107 |             panel.border = element_blank(),
108 |             panel.grid.major = element_blank(),
109 |             panel.grid.minor = element_blank()
110 |       )
111 | durations
112 | 
113 | 
114 | ```
115 | 
116 | #### Duration Results: 
117 | 
118 |    * For the Raschka method, the {mlr3} comes in first with {ranger}/{parsnip} coming in a close second.  
119 |    * For the Kuhn-Johnson method, {ranger}/{parsnip} is clearly fastest.  
120 |    * This was my first time using the {reticulate} package, and I wanted to see if there was any speed penalty for using its api instead of just running a straight Python script. There doesn't appear to be any.  
121 |    * {h2o} and {sklearn} are surprisingly slow. If the data size were larger, I think {h2o} would be more competitive.  
122 |    * The {tidymodels} packages, {parsnip} and {tune}, add substantial overhead.  
123 |    
124 | 
125 | ## Performance 
126 | 
127 | #### Experiment details:  
128 | 
129 |   * The same data, algorithms, and hyperparameter grids are used.
130 |   * The fastest implementation of each method is used in running a nested cross-validation with different sizes of data ranging from 100 to 5000 observations and different numbers of repeats of the outer-loop cv strategy.  
131 |       * The {mlr3} implementation is the fastest for Raschka's method, but the Ranger-Kuhn-Johnson implementation is close. To simplify, I am using [Ranger-Kuhn-Johnson](https://github.com/ercbk/nested-cross-validation-comparison/blob/master/duration-experiment/kuhn-johnson/nested-cv-ranger-kj.R) for both methods.  
132 |   * The chosen algorithm with hyperparameters is fit on the entire training set, and the resulting final model predicts on a 100K row Friedman dataset.  
133 |   * The percent error between the the average mean absolute error (MAE) across the outer-loop folds and the MAE of the predictions on this 100K dataset is calculated for each combination of repeat, data size, and method.  
134 |   * To make this experiment manageable in terms of runtimes, I am using AWS instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge for Random Forest.  
135 |     + Also see the Discussion section  
136 |   * Iterating through different numbers of repeats, sample sizes, and methods makes a functional approach more appropriate than running imperative scripts. Also, given the long runtimes and impermanent nature of my internet connection, it would also be nice to cache each iteration as it finishes. The [{drake}](https://github.com/ropensci/drake) package is superb on both counts, so I'm using it to orchestrate.  
137 | 
138 | ```{r perf_build_times_kj, echo=FALSE, message=FALSE}
139 | 
140 | pacman::p_load(dplyr, purrr, lubridate, ggplot2, ggfittext, drake, patchwork)
141 | bt <- build_times(starts_with("ncv_results"), digits = 4)
142 | 
143 | subtarget_bts <- bt %>% 
144 |       filter(stringr::str_detect(target, pattern = "[0-9]_([0-9]|[a-z])")) %>% 
145 |       select(target, elapsed)
146 | 
147 | subtargets_raw <- map_dfr(subtarget_bts$target, function(x) {
148 |       results <- readd(x, character_only = TRUE) %>% 
149 |             mutate(subtarget = x) %>% 
150 |             select(subtarget, everything())
151 |       
152 | }) %>% 
153 |       inner_join(subtarget_bts, by = c("subtarget" = "target"))
154 | 
155 | subtargets <- subtargets_raw %>% 
156 |       mutate(repeats = factor(repeats),
157 |              n = factor(n),
158 |              elapsed = round(as.numeric(elapsed)/3600, 2),
159 |              percent_error = round(delta_error/oos_error, 3))
160 | 
161 | readr::write_csv(subtargets, "performance-experiment/output/perf-exp-output.csv")
162 | # readr::write_rds(subtargets, "performance-experiment/output/perf-exp-output-backup.rds")
163 | 
164 | ```
165 | 
166 | ```{r perf_bt_charts_kj, echo=FALSE, message=FALSE}
167 | 
168 | fill_colors <- unname(swatches::read_ase("palettes/Forest Floor.ase"))
169 | 
170 | b <- ggplot(subtargets, aes(y = elapsed, x = repeats,
171 |                        fill = n, label = elapsed)) +
172 |    geom_col(position = position_dodge(width = 0.85)) +
173 |    scale_fill_manual(values = fill_colors[4:7]) +
174 |    geom_bar_text(position = "dodge", min.size = 9,
175 |                  place = "right", contrast = TRUE) +
176 |    coord_flip() +
177 |    labs(y = "Runtime (hrs)", x = "Repeats",
178 |         fill = "Sample Size") +
179 |    theme(legend.position = "top",
180 |          legend.background = element_rect(fill = "ivory"),
181 |          legend.key = element_rect(fill = "ivory"),
182 |          axis.ticks = element_blank(),
183 |          panel.background = element_rect(fill = "ivory",
184 |                                          colour = "ivory"),
185 |          plot.background = element_rect(fill = "ivory"),
186 |          panel.border = element_blank(),
187 |          panel.grid.major = element_blank(),
188 |          panel.grid.minor = element_blank()
189 |    )
190 | 
191 | ```
192 | 
193 | ```{r perf_error_line_kj, echo=FALSE, message=FALSE}
194 | e <- ggplot(subtargets, aes(x = repeats, y = percent_error, group = n)) +
195 |    geom_point(aes(color = n), size = 3) +
196 |    geom_line(aes(color = n), size = 2) +
197 |    expand_limits(y = c(0, 0.10)) +
198 |    scale_y_continuous(labels = scales::percent_format(accuracy = 0.1)) +
199 |    scale_color_manual(values = fill_colors[4:7]) +
200 |    labs(y = "Percent Error", x = "Repeats",
201 |         color = "Sample Size") +
202 |    theme(legend.position = "top",
203 |          legend.background = element_rect(fill = "ivory"),
204 |          legend.key = element_rect(fill = "ivory"),
205 |          axis.ticks = element_blank(),
206 |          panel.background = element_rect(fill = "ivory",
207 |                                          color = "ivory"),
208 |          plot.background = element_rect(fill = "ivory"),
209 |          panel.border = element_blank(),
210 |          panel.grid.major = element_blank(),
211 |          panel.grid.minor = element_blank()
212 |    )
213 | ```
214 | 
215 | ```{r kj_patch_kj, echo=FALSE, fig.width=12, fig.height=7}
216 | b + e + plot_layout(guides = "auto") +
217 |    plot_annotation(title = "Kuhn-Johnson") &
218 |    theme(legend.position = "top",
219 |          legend.text = element_text(size = 12),
220 |          axis.text.x = element_text(size = 11,
221 |                                     face = "bold"),
222 |          axis.text.y = element_text(size = 11,
223 |                                     face = "bold"),
224 |          panel.background = element_rect(fill = "ivory",
225 |                                          color = "ivory"),
226 |          plot.background = element_rect(fill = "ivory"),)
227 | ```
228 | 
229 | #### Performance Results (Kuhn-Johnson):  
230 | 
231 |   * Runtimes for n = 100 and n = 800 are close, and there's a large jump in runtime going from n = 2000 to n = 5000.  
232 |   * The number of repeats has little effect on the amount of percent error.
233 |   * For n = 100, there is substantially more variation in percent error than in the other sample sizes.  
234 |   * While there is a large runtime cost that comes with increasing the sample size from 2000 to 5000 observations, it doesn't seem to provide any benefit in gaining a more accurate estimate of the out-of-sample error.  
235 |   
236 | 
237 | ```{r perf_build_times_r, echo=FALSE, message=FALSE}
238 | 
239 | cache_raschka <- drake_cache(path = ".drake-raschka")
240 | 
241 | bt_r <- build_times(starts_with("ncv_results"),
242 |                   digits = 4, cache = cache_raschka)
243 | 
244 | subtarget_bts_r <- bt_r %>% 
245 |       filter(stringr::str_detect(target, pattern = "[0-9]_([0-9]|[a-z])")) %>% 
246 |       select(target, elapsed)
247 | 
248 | subtargets_raw_r <- map_dfr(subtarget_bts_r$target, function(x) {
249 |       results <- readd(x, character_only = TRUE,
250 |                        cache = cache_raschka) %>% 
251 |             mutate(subtarget = x) %>% 
252 |             select(subtarget, everything())
253 |       
254 | }) %>% 
255 |       inner_join(subtarget_bts_r, by = c("subtarget" = "target"))
256 | 
257 | subtargets_r <- subtargets_raw_r %>% 
258 |       mutate(repeats = factor(repeats),
259 |              n = factor(n),
260 |              elapsed = round(as.numeric(elapsed)/60, 2),
261 |              percent_error = round(delta_error/oos_error, 3))
262 | 
263 | readr::write_csv(subtargets_r, "performance-experiment/output/perf-exp-output-r.csv")
264 | # readr::write_rds(subtargets, "performance-experiment/output/perf-exp-output-backup-r.rds")
265 | 
266 | ```
267 | 
268 | ```{r perf_bt_charts_r, echo=FALSE, message=FALSE}
269 | 
270 | b_r <- ggplot(subtargets_r, aes(y = elapsed, x = repeats,
271 |                        fill = n, label = elapsed)) +
272 |    geom_col(position = position_dodge(width = 0.85)) +
273 |    scale_fill_manual(values = fill_colors[4:7]) +
274 |    geom_bar_text(position = "dodge", min.size = 9,
275 |                  place = "right", contrast = TRUE) +
276 |    coord_flip() +
277 |    labs(y = "Runtime (minutes)", x = "Repeats",
278 |         fill = "Sample Size") +
279 |    theme(legend.position = "top",
280 |          legend.background = element_rect(fill = "ivory"),
281 |          legend.key = element_rect(fill = "ivory"),
282 |          axis.ticks = element_blank(),
283 |          panel.background = element_rect(fill = "ivory",
284 |                                          colour = "ivory"),
285 |          plot.background = element_rect(fill = "ivory"),
286 |          panel.border = element_blank(),
287 |          panel.grid.major = element_blank(),
288 |          panel.grid.minor = element_blank()
289 |    )
290 | 
291 | ```
292 | 
293 | ```{r perf-error-line_r, echo=FALSE, message=FALSE}
294 | e_r <- ggplot(subtargets_r, aes(x = repeats, y = percent_error, group = n)) +
295 |    geom_point(aes(color = n), size = 3) +
296 |    geom_line(aes(color = n), size = 2) +
297 |    expand_limits(y = c(0, 0.10)) +
298 |    scale_y_continuous(labels = scales::percent_format(accuracy = 0.1),
299 |                       breaks = seq(0,0.125, by=0.025)) +
300 |    scale_color_manual(values = fill_colors[4:7]) +
301 |    labs(y = "Percent Error", x = "Repeats",
302 |         color = "Sample Size") +
303 |    theme(legend.position = "top",
304 |          legend.background = element_rect(fill = "ivory"),
305 |          legend.key = element_rect(fill = "ivory"),
306 |          axis.ticks = element_blank(),
307 |          panel.background = element_rect(fill = "ivory",
308 |                                          color = "ivory"),
309 |          plot.background = element_rect(fill = "ivory"),
310 |          panel.border = element_blank(),
311 |          panel.grid.major = element_blank(),
312 |          panel.grid.minor = element_blank()
313 |    )
314 | ```
315 | 
316 | ```{r kj-patch, echo=FALSE, fig.width=12, fig.height=7}
317 | b_r + e_r + plot_layout(guides = "auto") +
318 |    plot_annotation(title = "Raschka") &
319 |    theme(legend.position = "top",
320 |          legend.text = element_text(size = 12),
321 |          axis.text.x = element_text(size = 11,
322 |                                     face = "bold"),
323 |          axis.text.y = element_text(size = 11,
324 |                                     face = "bold"),
325 |          panel.background = element_rect(fill = "ivory",
326 |                                          color = "ivory"),
327 |          plot.background = element_rect(fill = "ivory"),)
328 | ```
329 | 
330 | 
331 | #### Performance Results (Raschka):  
332 | 
333 |   * The longest runtime is under 30 minutes, so runtime isn't as large of a consideration if we are only comparing a few algorithms.  
334 |   * There isn't much difference in runtime between n = 100 and n = 2000.  
335 |   * For n = 100, there's a relatively large change in percent error when going from 1 repeat to 2 repeats. The error estimate then stabilizes for repeats 3 through 5.  
336 |   * n = 5000 gives poorer out-of-sample error estimates than n = 800 and n = 2000 for all values of repeats.  
337 |   * n = 800 remains under 2.5% percent error for all repeat values, but also shows considerable volatility.  
338 | 
339 | 
340 | ## Discussion    
341 |    * {mlr3} wasn't included in the Kuhn-Johnson section of the duration experiment, because with the extra folds/resamples, the RAM usage rapidly increases to the maximum and either locks-up or slows the training time tremendously. I haven't explored this further.  
342 |    * The elasticnet model was slower to train than the random forest for the 100 row dataset. Compute resources should be optimized for each algorithm. For example, the number of vCPUs capable of being utilized by a random forest algorithm is much higher than number for an elasticnet algorithm. The elasticnet only used the number of vCPUs that matched the number of training folds while the random forest used all available vCPUs. Using a sparse matrix or another package (e.g. biglasso) might help to lower training times for elasticnet. 
343 |    * Adjusting the inner-loop strategy seems to have the most effect on the volatility of the results.  
344 |    * For data sizes of a few thousand rows, Kuhn-Johnson trains 50x as many models; takes 8x longer to run; for a similar amount of generalization error as compared to the Raschka method. The similar results in generalization error might be specific to this dataset though.    
345 |       + Kuhn-Johnson's runtime starts to really balloon once you get into datasets with over a thousand rows.  
346 |       + The extra folds in the outer loop made a huge difference. With Kuhn-Johnson, the runtimes were hours, and with Raschka's, it was minutes.  
347 |    * For smaller datasets, you should have at least 3 repeats when running Rashka's method.  
348 |    * This is just one dataset, but I still found it surprising how little a difference repeats made in reducing generalizaion error. The benefit only kicked in with the dataset that had hundred rows.  
349 | 
350 | 
351 | ## Next Steps  
352 |    * The performance experimental framework used here could be useful as a way to gain insight into the amounts and types of resources that a project's first steps might require. For example, testing simiulated data before collection of actual data begins. I might try to see if there's much difficulty extending it to {mlr3} (assuming I can figure out the RAM issue) and Python.  
353 |    * Experiment with Raschka's method more.
354 |       + Raschka's method using the majority vote method from Kuhn-Johnson for the final hyperparameter settings might be an additional optimization step. If the final k-fold cv can be discarded without much loss in generalization error, then maybe training times can be shortened further.  
355 |       + There's probably room to increase the number of folds in the inner-loop of Raschka's method in order to gain more stable results while keeping the training time comparitively low.   
356 |    * There should be a version of this technique that's capable of working for time series. I have ideas, so it might be something I'll work on for a future project.  
357 | 
358 | 
359 | 
360 | 
361 | ## References  
362 | 
363 | Boulesteix, AL, and C Strobl. 2009. “Optimal Classifier Selection and Negative Bias in Error Rate Estimation: An Empirical Study on High-Dimensional Prediction.” BMC Medical Research Methodology 9 (1): 85. [link](https://www.researchgate.net/publication/40756303_Optimal_classifier_selection_and_negative_bias_in_error_rate_estimation_An_empirical_study_on_high-dimensional_prediction)  
364 | 
365 | Sabastian Raschka, "STAT 479 Statistical Tests and Algorithm Comparison," (Lecture Notes, University of Wisconsin-Madison, Fall 2019). [link](https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/11_eval4-algo/11-eval4-algo__notes.pdf)  
366 | 
367 | Sudhir Varma and Richard Simon. "Bias in error estimation when using cross-validation for model selection". In: BMC bioinformatics 7.1 (2006). p. 91. [link](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-91)  
368 | 
369 | 
370 | 
371 | 
372 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # Nested Cross-Validation: Comparing Methods and Implementations
  3 | 
  4 | ![](images/ncv.png)  
  5 | [![DOI](https://zenodo.org/badge/242267104.svg)](https://zenodo.org/badge/latestdoi/242267104)
  6 | 
  7 | Experiments conducted in May 2020. Packages in renv.lock had to be
  8 | updated in August 2021, but all scripts haven’t been re-ran to make sure
  9 | everything still works. So, some refactoring of code may be necessary in
 10 | order to reproduce results (e.g. {future} progress bar implementation
 11 | and plan options have changed).
 12 | 
 13 | Nested cross-validation (or double cross-validaation) has become a
 14 | recommended technique for situations in which the size of our dataset is
 15 | insufficient to simultaneously handle hyperparameter tuning and
 16 | algorithm comparison. Using standard methods such as k-fold
 17 | cross-validation in these cases may result in substantial increases in
 18 | optimization bias where the more models that are trained on a fold means
 19 | there’s a greater opportunity for a model to achieve a low score by
 20 | chance. Nested cross-validation has been shown to produce less biased,
 21 | out-of-sample error estimates even using datasets with only hundreds of
 22 | rows and therefore gives a better estimation of generalized performance.
 23 | The primary issue with this technique is that it is usually
 24 | computationally expensive with potentially tens of 1000s of models being
 25 | trained during the process.
 26 | 
 27 | While researching this technique, I found two slightly different
 28 | variations of performing nested cross-validation — one authored by
 29 | [Sabastian
 30 | Raschka](https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/11_eval4-algo/code/11-eval4-algo__nested-cv_verbose1.ipynb)
 31 | and the other by [Max Kuhn and Kjell
 32 | Johnson](https://www.tidymodels.org/learn/work/nested-resampling/).
 33 | After the nested cross-validation procedure and an algorithm is chosen,
 34 | Raschka performs an extra k-fold cross-validation using the inner-loop
 35 | cv strategy on the entire training set in order to tune his final model.
 36 | Therefore, the hyperparameter tuning that takes place in the inner-loop
 37 | during nested cross-validation is only in service of algorithm
 38 | selection. Kuhn-Johnson uses majority vote. Whichever set of
 39 | hyperparameter values has been chosen during the inner-loop tuning
 40 | procedure the most often is the set used to fit the final model. The
 41 | other diffferences are just the number of folds/resamples used in the
 42 | outer and inner loops which are essentially just tuning parameters.
 43 | 
 44 | Various elements of the technique affect the run times and performance.
 45 | These include:
 46 | 
 47 | 1.  Hyperparameter value grids  
 48 | 2.  Grid search strategy  
 49 | 3.  Inner-Loop CV strategy  
 50 | 4.  Outer-Loop CV strategy
 51 | 
 52 | I’ll be examining two aspects of nested cross-validation:
 53 | 
 54 | 1.  Duration: Find out which packages and combinations of model
 55 |     functions give us the fastest implementation of each method.  
 56 | 2.  Performance: First, develop a testing framework. Then, for a given
 57 |     data generating process, determine how large of sample size is
 58 |     needed to obtain reasonably accurate out-of-sample error estimate.
 59 |     Also, determine how many repeats in the outer-loop cv strategy
 60 |     should be used to calculate this error estimate.
 61 | 
 62 | The results from these experiments should give us an idea about which
 63 | methodology, model packages, and compute specifications will produce
 64 | lower training times, lower costs, and lower generalization error.
 65 | 
 66 | ## Recommendations:
 67 | 
 68 | -   For faster training times, use {mlr3} (caveat: see Discussion) or
 69 |     other R model packages outside of the {tidymodels} ecosystem and
 70 |     code the nested cross-validation loops manually (Code:
 71 |     [mlr3](https://github.com/ercbk/nested-cross-validation-comparison/blob/master/duration-experiment/raschka/nested-cv-mlr3-raschka.R),
 72 |     [ranger-kj](https://github.com/ercbk/nested-cross-validation-comparison/blob/master/duration-experiment/kuhn-johnson/nested-cv-ranger-kj.R),
 73 |     [Kuhn-Johnson](https://www.tidymodels.org/learn/work/nested-resampling/)).  
 74 |     -   2022-11-11 {parsnip} has undergone changes potentially leading to a 3-fold speed-up ([link](https://parsnip.tidymodels.org/news/index.html#parsnip-103))
 75 | -   Choose compute resources with large amounts of RAM instead of opting
 76 |     for powerful processors. From the AWS cpu product line, I found the
 77 |     r5.\#xlarge instances ran fastest. The most efficient number of
 78 |     vCPUs may vary according to the algorithm.  
 79 | -   For the data in this experiment with row numbers in the low
 80 |     thousands, Raschka’s method performed just as well as Kuhn-Johnson’s
 81 |     but was substantially faster.  
 82 | -   For the data in this experiment with row numbers in the hundreds,
 83 |     Raschka’s method with at least 3 repeats performed just as well as
 84 |     Kuhn-Johnson’s but was still substantially faster even with the
 85 |     repeats.
 86 | 
 87 | ## Duration
 88 | 
 89 | #### Experiment details:
 90 | 
 91 | -   Random Forest and Elastic Net Regression algorithms  
 92 | -   Both algorithms are tuned with 100x2 hyperparameter grids using a
 93 |     latin hypercube design.  
 94 | -   From {mlbench}, I’m using the generated data set, friedman1, from
 95 |     Friedman’s Multivariate Adaptive Regression Splines (MARS) paper.
 96 | -   Kuhn-Johnson
 97 |     -   100 observations: 10 features, numeric target variable  
 98 |     -   outer loop: 2 repeats, 10 folds  
 99 |     -   inner loop: 25 bootstrap resamples  
100 | -   Raschka
101 |     -   5000 observations: 10 features, numeric target variable  
102 |     -   outer loop: 5 folds  
103 |     -   inner loop: 2 folds
104 | 
105 | The sizes of the data sets are the same as those in the original scripts
106 | by the authors. Using Kuhn-Johnson, 50,000 models (grid size \* number
107 | of repeats \* number of folds in the outer-loop \* number of
108 | folds/resamples in the inner-loop) are trained for each algorithm —
109 | using Raschka’s, 1,001 models for each algorithm. The one extra model in
110 | the Raschka variation is due to his method of choosing the
111 | hyperparameter values for the final model.
112 | 
113 | [MLFlow](https://mlflow.org/docs/latest/index.html) is used to keep
114 | track of the duration (seconds) of each run along with the
115 | implementation and method used.
116 | 
117 | ![](duration-experiment/outputs/0225-results.png)
118 | 
119 | ![](duration-experiment/outputs/duration-pkg-tbl.png)
120 | 
121 | ![](README_files/figure-gfm/unnamed-chunk-1-1.png)<!-- -->
122 | 
123 | #### Duration Results:
124 | 
125 | -   For the Raschka method, the {mlr3} comes in first with
126 |     {ranger}/{parsnip} coming in a close second.  
127 | -   For the Kuhn-Johnson method, {ranger}/{parsnip} is clearly
128 |     fastest.  
129 | -   This was my first time using the {reticulate} package, and I wanted
130 |     to see if there was any speed penalty for using its api instead of
131 |     just running a straight Python script. There doesn’t appear to be
132 |     any.  
133 | -   {h2o} and {sklearn} are surprisingly slow. If the data size were
134 |     larger, I think {h2o} would be more competitive.  
135 | -   The {tidymodels} packages, {parsnip} and {tune}, add substantial
136 |     overhead.
137 |     -   2022-11-11 {parsnip} has undergone changes potentially leading to a 3-fold speed-up ([link](https://parsnip.tidymodels.org/news/index.html#parsnip-103))
138 | 
139 | ## Performance
140 | 
141 | #### Experiment details:
142 | 
143 | -   The same data, algorithms, and hyperparameter grids are used.
144 | -   The fastest implementation of each method is used in running a
145 |     nested cross-validation with different sizes of data ranging from
146 |     100 to 5000 observations and different numbers of repeats of the
147 |     outer-loop cv strategy.
148 |     -   The {mlr3} implementation is the fastest for Raschka’s method,
149 |         but the Ranger-Kuhn-Johnson implementation is close. To
150 |         simplify, I am using
151 |         [Ranger-Kuhn-Johnson](https://github.com/ercbk/nested-cross-validation-comparison/blob/master/duration-experiment/kuhn-johnson/nested-cv-ranger-kj.R)
152 |         for both methods.  
153 | -   The chosen algorithm with hyperparameters is fit on the entire
154 |     training set, and the resulting final model predicts on a 100K row
155 |     Friedman dataset.  
156 | -   The percent error between the the average mean absolute error (MAE)
157 |     across the outer-loop folds and the MAE of the predictions on this
158 |     100K dataset is calculated for each combination of repeat, data
159 |     size, and method.  
160 | -   To make this experiment manageable in terms of runtimes, I am using
161 |     AWS instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge
162 |     for Random Forest.
163 |     -   Also see the Discussion section  
164 | -   Iterating through different numbers of repeats, sample sizes, and
165 |     methods makes a functional approach more appropriate than running
166 |     imperative scripts. Also, given the long runtimes and impermanent
167 |     nature of my internet connection, it would also be nice to cache
168 |     each iteration as it finishes. The
169 |     [{drake}](https://github.com/ropensci/drake) package is superb on
170 |     both counts, so I’m using it to orchestrate.
171 | 
172 | ![](README_files/figure-gfm/kj_patch_kj-1.png)<!-- -->
173 | 
174 | #### Performance Results (Kuhn-Johnson):
175 | 
176 | -   Runtimes for n = 100 and n = 800 are close, and there’s a large jump
177 |     in runtime going from n = 2000 to n = 5000.  
178 | -   The number of repeats has little effect on the amount of percent
179 |     error.
180 | -   For n = 100, there is substantially more variation in percent error
181 |     than in the other sample sizes.  
182 | -   While there is a large runtime cost that comes with increasing the
183 |     sample size from 2000 to 5000 observations, it doesn’t seem to
184 |     provide any benefit in gaining a more accurate estimate of the
185 |     out-of-sample error.
186 | 
187 | ![](README_files/figure-gfm/kj-patch-1.png)<!-- -->
188 | 
189 | #### Performance Results (Raschka):
190 | 
191 | -   The longest runtime is under 30 minutes, so runtime isn’t as large
192 |     of a consideration if we are only comparing a few algorithms.  
193 | -   There isn’t much difference in runtime between n = 100 and n =
194 |     2000.  
195 | -   For n = 100, there’s a relatively large change in percent error when
196 |     going from 1 repeat to 2 repeats. The error estimate then stabilizes
197 |     for repeats 3 through 5.  
198 | -   n = 5000 gives poorer out-of-sample error estimates than n = 800 and
199 |     n = 2000 for all values of repeats.  
200 | -   n = 800 remains under 2.5% percent error for all repeat values, but
201 |     also shows considerable volatility.
202 | 
203 | ## Discussion
204 | 
205 | -   {mlr3} wasn’t included in the Kuhn-Johnson section of the duration
206 |     experiment, because with the extra folds/resamples, the RAM usage
207 |     rapidly increases to the maximum and either locks-up or slows the
208 |     training time tremendously. I haven’t explored this further.  
209 | -   The elasticnet model was slower to train than the random forest for
210 |     the 100 row dataset. Compute resources should be optimized for each
211 |     algorithm. For example, the number of vCPUs capable of being
212 |     utilized by a random forest algorithm is much higher than number for
213 |     an elasticnet algorithm. The elasticnet only used the number of
214 |     vCPUs that matched the number of training folds while the random
215 |     forest used all available vCPUs. Using a sparse matrix or another
216 |     package (e.g. biglasso) might help to lower training times for
217 |     elasticnet.
218 | -   Adjusting the inner-loop strategy seems to have the most effect on
219 |     the volatility of the results.  
220 | -   For data sizes of a few thousand rows, Kuhn-Johnson trains 50x as
221 |     many models; takes 8x longer to run; for a similar amount of
222 |     generalization error as compared to the Raschka method. The similar
223 |     results in generalization error might be specific to this dataset
224 |     though.
225 |     -   Kuhn-Johnson’s runtime starts to really balloon once you get
226 |         into datasets with over a thousand rows.  
227 |     -   The extra folds in the outer loop made a huge difference. With
228 |         Kuhn-Johnson, the runtimes were hours, and with Raschka’s, it
229 |         was minutes.  
230 | -   For smaller datasets, you should have at least 3 repeats when
231 |     running Rashka’s method.  
232 | -   This is just one dataset, but I still found it surprising how little
233 |     a difference repeats made in reducing generalizaion error. The
234 |     benefit only kicked in with the dataset that had hundred rows.
235 | 
236 | ## Next Steps
237 | 
238 | -   The performance experimental framework used here could be useful as
239 |     a way to gain insight into the amounts and types of resources that a
240 |     project’s first steps might require. For example, testing simiulated
241 |     data before collection of actual data begins. I might try to see if
242 |     there’s much difficulty extending it to {mlr3} (assuming I can
243 |     figure out the RAM issue) and Python.  
244 | -   Experiment with Raschka’s method more.
245 |     -   Raschka’s method using the majority vote method from
246 |         Kuhn-Johnson for the final hyperparameter settings might be an
247 |         additional optimization step. If the final k-fold cv can be
248 |         discarded without much loss in generalization error, then maybe
249 |         training times can be shortened further.  
250 |     -   There’s probably room to increase the number of folds in the
251 |         inner-loop of Raschka’s method in order to gain more stable
252 |         results while keeping the training time comparitively low.  
253 | -   There should be a version of this technique that’s capable of
254 |     working for time series. I have ideas, so it might be something I’ll
255 |     work on for a future project.
256 | 
257 | ## References
258 | 
259 | Boulesteix, AL, and C Strobl. 2009. “Optimal Classifier Selection and
260 | Negative Bias in Error Rate Estimation: An Empirical Study on
261 | High-Dimensional Prediction.” BMC Medical Research Methodology 9 (1):
262 | 85.
263 | [link](https://www.researchgate.net/publication/40756303_Optimal_classifier_selection_and_negative_bias_in_error_rate_estimation_An_empirical_study_on_high-dimensional_prediction)
264 | 
265 | Sabastian Raschka, “STAT 479 Statistical Tests and Algorithm
266 | Comparison,” (Lecture Notes, University of Wisconsin-Madison, Fall
267 | 2019).
268 | [link](https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/11_eval4-algo/11-eval4-algo__notes.pdf)
269 | 
270 | Sudhir Varma and Richard Simon. “Bias in error estimation when using
271 | cross-validation for model selection”. In: BMC bioinformatics 7.1
272 | (2006). p. 91.
273 | [link](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-91)
274 | 


--------------------------------------------------------------------------------
/README_files/figure-gfm/kj-patch-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/README_files/figure-gfm/kj-patch-1.png


--------------------------------------------------------------------------------
/README_files/figure-gfm/kj_patch_kj-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/README_files/figure-gfm/kj_patch_kj-1.png


--------------------------------------------------------------------------------
/README_files/figure-gfm/perf-error-line-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/README_files/figure-gfm/perf-error-line-1.png


--------------------------------------------------------------------------------
/README_files/figure-gfm/perf_bt_charts-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/README_files/figure-gfm/perf_bt_charts-1.png


--------------------------------------------------------------------------------
/README_files/figure-gfm/perf_build_times-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/README_files/figure-gfm/perf_build_times-1.png


--------------------------------------------------------------------------------
/README_files/figure-gfm/unnamed-chunk-1-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/README_files/figure-gfm/unnamed-chunk-1-1.png


--------------------------------------------------------------------------------
/_drake-kj.R:
--------------------------------------------------------------------------------
 1 | # drake make file for Kuhn-Johnson performance experiment
 2 | 
 3 | 
 4 | # Notes:
 5 | # 1. see plan-kj.R for more details on how this thing works
 6 | # 2. link to {future} issue with instructions on special PuTTY settings, https://github.com/HenrikBengtsson/future/issues/370
 7 | 
 8 | 
 9 | # load packages, functions, and drake plan
10 | source("performance-experiment/packages.R")
11 | source("performance-experiment/functions/mlbench-data.R")
12 | source("performance-experiment/functions/create-ncv-objects.R")
13 | source("performance-experiment/functions/create-models.R")
14 | source("performance-experiment/functions/create-grids.R")
15 | source("performance-experiment/functions/inner-tune.R")
16 | source("performance-experiment/functions/outer-cv.R")
17 | source("performance-experiment/functions/ncv-compare.R")
18 | source("performance-experiment/functions/run-ncv.R")
19 | source("performance-experiment/Kuhn-Johnson/plan-kj.R")
20 | 
21 | 
22 | 
23 | 
24 | set.seed(2019)
25 | 
26 | # Using different compute sizes for each model
27 | ip1 <- Sys.getenv("GLMEC2IP")
28 | ip2 <- Sys.getenv("RFEC2IP")
29 | public_ips <- c(ip1, ip2)
30 | # ppk file converted by PuTTY from an AWS pem file
31 | ssh_private_key_file <- Sys.getenv("AWSKEYPATH")
32 | 
33 | 
34 | cl <- future::makeClusterPSOCK(
35 |       
36 |       ## Public IP numbers of EC2 instances
37 |       public_ips,
38 |       
39 |       ## User name (always 'ubuntu')
40 |       user = "ubuntu",
41 |       
42 |       ## Use private SSH key registered with AWS
43 |       ## futureSettings is a saved PuTTY session with settings to keep ssh active
44 |       rshcmd = c("plink", "-ssh", "-load", "futureSettings","-i", ssh_private_key_file),
45 |       rshopts = c(
46 |             "-sshrawlog", "ec2-ssh-raw.log"
47 |       ),
48 |       
49 |       rscript_args = c("-e", shQuote(".libPaths('/home/rstudio/R/x86_64-pc-linux-gnu-library/3.6')")
50 |       ), 
51 |       verbose = TRUE
52 | )
53 | 
54 | 
55 | future::plan(list(tweak(cluster, workers = cl), multiprocess))
56 | 
57 | cache_kj <- drake_cache(path = ".drake")
58 | 
59 | # verbose = 0 prints nothing, verbose = 1 prints message as each target completes; verbose = 2 adds a progress bar that tracks target completion
60 | drake_config(
61 |       plan,
62 |       verbose = 1,
63 |       lock_envir = FALSE,
64 |       jobs_preprocess = 7,
65 |       cache = cache_kj
66 | )
67 | 
68 | 


--------------------------------------------------------------------------------
/_drake-raschka.R:
--------------------------------------------------------------------------------
 1 | # drake make file for Raschka performance experiment
 2 | 
 3 | 
 4 | # Notes:
 5 | # 1. see plan-raschka.R for more details on how this thing works
 6 | # 2. link to {future} issue with instructions on special PuTTY settings, https://github.com/HenrikBengtsson/future/issues/370
 7 | 
 8 | 
 9 | # load packages, functions, and drake plan
10 | source("performance-experiment/packages.R")
11 | source("performance-experiment/functions/mlbench-data.R")
12 | source("performance-experiment/functions/create-ncv-objects.R")
13 | source("performance-experiment/functions/create-models.R")
14 | source("performance-experiment/functions/create-grids.R")
15 | source("performance-experiment/functions/inner-tune.R")
16 | source("performance-experiment/functions/outer-cv.R")
17 | source("performance-experiment/functions/ncv-compare.R")
18 | source("performance-experiment/functions/run-ncv.R")
19 | source("performance-experiment/Raschka/plan-raschka.R")
20 | 
21 | 
22 | 
23 | 
24 | 
25 | set.seed(2019)
26 | 
27 | # Using different compute sizes for each model
28 | ip1 <- Sys.getenv("GLMEC2IP")
29 | ip2 <- Sys.getenv("RFEC2IP")
30 | public_ips <- c(ip1, ip2)
31 | # ppk file converted by PuTTY from an AWS pem file
32 | ssh_private_key_file <- Sys.getenv("AWSKEYPATH")
33 | 
34 | 
35 | cl <- future::makeClusterPSOCK(
36 |       
37 |       ## Public IP numbers of EC2 instances
38 |       public_ips,
39 |       
40 |       ## User name (always 'ubuntu')
41 |       user = "ubuntu",
42 |       
43 |       ## Use private SSH key registered with AWS
44 |       ## futureSettings is a saved PuTTY session with settings to keep ssh active
45 |       rshcmd = c("plink", "-ssh", "-load", "futureSettings","-i", ssh_private_key_file),
46 |       rshopts = c(
47 |             "-sshrawlog", "ec2-ssh-raw.log"
48 |       ),
49 |       
50 |       rscript_args = c("-e", shQuote(".libPaths('/home/rstudio/R/x86_64-pc-linux-gnu-library/3.6')")
51 |       ), 
52 |       verbose = TRUE
53 | )
54 | 
55 | 
56 | future::plan(list(tweak(cluster, workers = cl), multiprocess))
57 | 
58 | 
59 | cache_raschka <- drake_cache(path = ".drake-raschka")
60 | 
61 | # verbose = 0 prints nothing, verbose = 1 prints message as each target completes; verbose = 2 adds a progress bar that tracks target completion
62 | drake_config(
63 |       plan,
64 |       verbose = 1,
65 |       lock_envir = FALSE,
66 |       jobs_preprocess = 7,
67 |       cache = cache_raschka
68 | )
69 | 
70 | 
71 | 


--------------------------------------------------------------------------------
/data/fivek-simdat.pickle:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/data/fivek-simdat.pickle


--------------------------------------------------------------------------------
/duration-experiment/kuhn-johnson/nested-cv-h2o-kj.R:
--------------------------------------------------------------------------------
  1 | # Nested cross-validation for tuning and algorithm comparison
  2 | 
  3 | # Kuhn-Johnson method
  4 | # H2O
  5 | 
  6 | 
  7 | # Notes
  8 | # 1. *** Make sure target variable is the last column in dataframe ***
  9 | # 2. h2o_grid arg parallelism sets the number of models to be computed at a time in parallel. parallelism = 0 lets h2o decide and it will use all resources.
 10 | # 3. *** For glm models, setting parallelism = 0 causes the grid search to hang. Manually set the number of models to compute in parallel ***
 11 | # 4. H2O doesn't have a cv strategy that just fits the grid row by row. Your options are RandomDiscrete or Cartesian Grid. I set the parameter ranges so that using the Cartesian grid would create a 100 row grid to match the other scripts.
 12 | # 5. Uses few cpu resources when tuning glmnet, but does maximize cpu usage for tuning the rf
 13 | 
 14 | 
 15 | # Sections
 16 | # 1. Set-up
 17 | # 2. Error and Model Functions; 
 18 | # 3. Hyperparameter Grids
 19 | # 4. Functions used in the loops
 20 | # 5. Compare Algorithms
 21 | 
 22 | 
 23 | 
 24 | 
 25 | ####################################################
 26 | # Set-Up
 27 | ####################################################
 28 | 
 29 | 
 30 | # texts me when there's an error
 31 | options(error = function() { 
 32 |    library(RPushbullet)
 33 |    pbPost("note", "Error", geterrmessage())
 34 |    if(!interactive()) stop(geterrmessage())
 35 | })
 36 | 
 37 | 
 38 | # start MLflow server
 39 | sys::exec_background("mlflow server")
 40 | Sys.sleep(10)
 41 | 
 42 | 
 43 | library(tictoc)
 44 | tic()
 45 | 
 46 | pacman::p_load(glue, RPushbullet, dials, h2o, rsample, purrr, dplyr, mlflow)
 47 | 
 48 | # make explicit the name of the exeriement to record to
 49 | mlflow_set_experiment("ncv_duration")
 50 | 
 51 | set.seed(2019)
 52 | 
 53 | # simulated data; generates 10 multi-patterned, numeric predictors plus outcome variable
 54 | sim_data <- function(n) {
 55 |       tmp <- mlbench::mlbench.friedman1(n, sd=1)
 56 |       tmp <- cbind(tmp$x, tmp$y)
 57 |       tmp <- as.data.frame(tmp)
 58 |       names(tmp)[ncol(tmp)] <- "y"
 59 |       tmp
 60 | }
 61 | 
 62 | # Use small data to tune and compare models
 63 | small_dat <- sim_data(100)
 64 | 
 65 | 
 66 | 
 67 | ncv_dat_10 <- rsample::nested_cv(small_dat,
 68 |                               outside = vfold_cv(v = 10, repeats = 2),
 69 |                               inside = bootstraps(times = 25))
 70 | 
 71 | 
 72 | # Start h2o cluster
 73 | h2o.init()
 74 | 
 75 | 
 76 | 
 77 | 
 78 | ###################################################
 79 | # # Error and Model Functions
 80 | ###################################################
 81 | 
 82 | 
 83 | error_FUN <- function(model){
 84 |    h2o.mae(model, valid = T)
 85 | }
 86 | 
 87 | 
 88 | # Distributed Random Forest
 89 | 
 90 | rf_FUN <- function(x, y, anal_h2o, ass_h2o, params) {
 91 |    
 92 |    mtries <- params$mtries[[1]]
 93 |    ntrees <- params$ntrees[[1]]
 94 |    
 95 |    # h20 ususally needs unique ids or loops will return exact same values over and over
 96 |    modelId <- as.character(dqrng::dqrnorm(1))
 97 |    
 98 |    h2o.show_progress()
 99 |    
100 |    h2o.randomForest(x = x,
101 |                     y = y,
102 |                     training_frame = anal_h2o,
103 |                     model_id = modelId,
104 |                     validation_frame = ass_h2o,
105 |                     mtries = mtries,
106 |                     ntrees = ntrees)
107 | }
108 | 
109 | 
110 | # Elastic Net Regression
111 | 
112 | glm_FUN <- function(x, y, anal_h2o, ass_h2o, params) {
113 |    
114 |    alpha <- params$alpha[[1]]
115 |    lambda <- params$lambda[[1]]
116 |    
117 |    # h20 needs unique ids or loops will return exact same values over and over
118 |    modelId <- as.character(dqrng::dqrnorm(1))
119 |    
120 |    h2o.show_progress()
121 |    
122 |    h2o.glm(x = x,
123 |            y = y,
124 |            training_frame = anal_h2o,
125 |            model_id = modelId,
126 |            validation_frame = ass_h2o,
127 |            alpha = alpha,
128 |            lambda = lambda)
129 | }
130 | 
131 | 
132 | mod_FUN_list <- list(glm = glm_FUN, rf = rf_FUN)
133 | 
134 | alg_list <- list(glm = "glm", rf = "drf")
135 | 
136 | 
137 | 
138 | 
139 | #####################################################
140 | # Hyperparameter Grids
141 | #####################################################
142 | 
143 | 
144 | # 5*20 = 100 rows for glm grid, 2*50 = 100 rows for the rf grid
145 | params_list <- list(glm = list(alpha = c(0, 0.25, 0.5, 0.75, 1),
146 |                                lambda = -1/log10(sample(seq(0, 1, by = 0.00001), 20))),
147 |                     rf = list(mtries = c(3,4),
148 |                               ntrees = seq(200, 300, by = 2))
149 |                     )
150 | 
151 | 
152 | 
153 | 
154 | #####################################################
155 | # Functions used in the loops
156 | #####################################################
157 | 
158 | 
159 | # inputs params, model, and resample, calls model and error functions, outputs error
160 | mod_error <- function(params, mod_FUN, dat) {
161 |    anal_df <- rsample::analysis(dat)
162 |    ass_df <- rsample::assessment(dat)
163 |    
164 |    h2o.no_progress()
165 |    
166 |    # send data to the h2o cluster
167 |    anal_h2o <- as.h2o(anal_df)
168 |    ass_h2o <- as.h2o(ass_df)
169 |    
170 |    y <- names(anal_h2o)[[ncol(anal_h2o)]]
171 |    x <- setdiff(names(anal_h2o), y)
172 |    
173 |    mod <- mod_FUN(x, y, anal_h2o, ass_h2o, params)     
174 |    error <- error_FUN(mod)
175 |    error
176 | }
177 | 
178 | 
179 | compare_algs <- function(alg, params, mod_FUN, dat){
180 |    
181 |    # tune models by grid searching on resamples in the inner loop, returns df with fold id, bootstrap id, params, and errors
182 |    tuning_results <- purrr::map(dat$inner_resamples, function(dat, alg, params){
183 |       
184 |       param_names <- names(params)
185 |       
186 |       # loops each folds set of resamples, grid search, returen table of hyperparam combos and error values
187 |       params_errors <- purrr::map_dfr(dat$splits, function(dat, alg, params){
188 |          
189 |          # split into analysis and assessment sets
190 |          anal_df <- rsample::analysis(dat)
191 |          ass_df <- rsample::assessment(dat)
192 |          
193 |          # as.h2o and h2o.grid have progress bars. That's too many of progress bars.
194 |          h2o.no_progress()
195 |          
196 |          # send data to the h2o cluster
197 |          anal_h2o <- as.h2o(anal_df)
198 |          ass_h2o <- as.h2o(ass_df)
199 |          
200 |          y <- names(anal_h2o)[[ncol(anal_h2o)]]
201 |          x <- setdiff(names(anal_h2o), y)
202 |          
203 |          h2o.show_progress()
204 |          
205 |          # need a unique grid id or h2o just gives you the same predictions over aand over
206 |          gridId <- as.character(dqrng::dqrnorm(1))
207 |          
208 |          mod_grid <- h2o.grid(alg, x = x, y = y,
209 |                               grid_id = gridId,
210 |                               training_frame = anal_h2o,
211 |                               validation_frame = ass_h2o,
212 |                               hyper_params = params,
213 |                               parallelism = 8)
214 |          # results
215 |          mod_gridperf <- h2o.getGrid(grid_id = gridId, sort_by = "mae")
216 |          
217 |          # Grab hyperparams, model_id, errors
218 |          grid_results <- mod_gridperf@summary_table
219 |          colnames(grid_results)[4] <- "error"
220 |          
221 |          # clean the cluster
222 |          h2o.removeAll()
223 |          
224 |          grid_results
225 |          
226 |       }, alg, params, .id = "bootstrap") %>% 
227 |          mutate(error = as.numeric(error)) %>% 
228 |          group_by_at(vars(param_names)) %>%
229 |          summarize(mean_error = mean(error, na.rm = TRUE)) %>% 
230 |          ungroup()
231 |          
232 |    }, alg, params)
233 |    
234 |    
235 |    best_hyper_vals <- tuning_results %>%
236 |       map_df(function(dat) {
237 |          dat[which.min(dat$mean_error),]
238 |       }) %>%
239 |       select(names(params)) %>% 
240 |       # H2O makes params into char vars and adds brackets to the values. Guessing they're json.
241 |       mutate_all(~stringr::str_remove_all(., "^\\[")) %>% 
242 |       mutate_all(~stringr::str_remove_all(., "\\]")) %>%
243 |       mutate_all(as.numeric)
244 |    
245 |    # fit models on the outer-loop folds using best hyperparams (e.g. 5 repeats, 10 folds = 50 models), returns numeric with error values
246 |    outer_fold_error <- map2_dbl(dat$splits, 1:nrow(best_hyper_vals), function(dat, row) {
247 |       params <- best_hyper_vals[row,]
248 |       mod_error(params, mod_FUN, dat)
249 |    })
250 |    
251 |    # hyperparam values for final model will be the ones most selected to use on the outer-loop folds
252 |    chosen_params <- best_hyper_vals %>% 
253 |       group_by_all() %>% 
254 |       tally() %>% 
255 |       ungroup() %>% 
256 |       filter(n == max(n))
257 |    
258 |    # output error stats and chosen hyperparams
259 |    tibble(
260 |       chosen_params = list(chosen_params),
261 |       mean_error = mean(outer_fold_error),
262 |       median_error = median(outer_fold_error),
263 |       sd_error = sd(outer_fold_error)
264 |    )
265 | }
266 | 
267 | 
268 | 
269 | 
270 | ##################################
271 | # Compare algorithms
272 | ##################################
273 | 
274 | 
275 | lol <- list(alg_list, params_list, mod_FUN_list)
276 | 
277 | # start the nested-cv
278 | algorithm_comparison_ten <- pmap_dfr(lol, compare_algs, ncv_dat_10) %>%
279 |    mutate(model = names(mod_FUN_list)) %>%
280 |    select(model, everything())
281 | toc(log = TRUE)
282 | 
283 | 
284 | # log duration metric to MLflow
285 | ncv_times <- tic.log(format = FALSE)
286 | duration <- as.numeric(ncv_times[[1]]$toc - ncv_times[[1]]$tic)
287 | mlflow_log_metric("duration", duration)
288 | mlflow_set_tag("implementation", "h2o")
289 | mlflow_set_tag("method", "kj")
290 | mlflow_end_run()
291 | 
292 | 
293 | # text me results
294 | log.txt <- tic.log(format = TRUE)
295 | msg <- glue("Using h2o-kj script: \n After running 10 fold,  {log.txt[[1]]}.")
296 | pbPost("note", title="h2o-kj script finished", body=msg)
297 | tic.clearlog()
298 | 
299 | 
300 | # MLflow uses waitress for Windows. Killing it also kills mlflow.exe, python.exe, console window host processes
301 | installr::kill_process(process = c("waitress-serve.exe"))
302 | 
303 | # shutdown cluster
304 | h2o.shutdown(prompt = FALSE)
305 | 


--------------------------------------------------------------------------------
/duration-experiment/kuhn-johnson/nested-cv-parsnip-kj.R:
--------------------------------------------------------------------------------
  1 | # Nested cross-validation for tuning and algorithm comparison
  2 | 
  3 | # Kuhn-Johnson method
  4 | # parsnip
  5 | 
  6 | 
  7 | 
  8 | # Notes
  9 | # 1. *** Make sure the target column is last in dataframe ***
 10 | # 2. Using the Ranger Random Forest model function utilized through the parsnip package.
 11 | 
 12 | 
 13 | # Sections
 14 | # 1. Set-Up
 15 | # 2. Error and 
 16 | # 3. Model Functions
 17 | # 4. Hyperparameter Grids
 18 | # 5. Functions used in the loops
 19 | # 6. Compare algorithms
 20 | 
 21 | 
 22 | 
 23 | 
 24 | ####################################################
 25 | # Set-Up
 26 | ####################################################
 27 | 
 28 | 
 29 | # text me if an error occurs
 30 | options(error = function() { 
 31 |    library(RPushbullet)
 32 |    pbPost("note", "Error", geterrmessage())
 33 |    if(!interactive()) stop(geterrmessage())
 34 | })
 35 | 
 36 | # start MLflow server
 37 | sys::exec_background("mlflow server")
 38 | Sys.sleep(10)
 39 | 
 40 | library(tictoc)
 41 | tic()
 42 | 
 43 | set.seed(2019)
 44 | 
 45 | # simulated data; generates 10 multi-patterned, numeric predictors plus outcome variable
 46 | sim_data <- function(n) {
 47 |       tmp <- mlbench::mlbench.friedman1(n, sd=1)
 48 |       tmp <- cbind(tmp$x, tmp$y)
 49 |       tmp <- as.data.frame(tmp)
 50 |       names(tmp)[ncol(tmp)] <- "y"
 51 |       tmp
 52 | }
 53 | 
 54 | # Use small data to tune and compare models
 55 | small_dat <- sim_data(100)
 56 | 
 57 | 
 58 | pacman::p_load(RPushbullet, glue, ranger, tidymodels, data.table, dtplyr, dplyr, furrr, mlflow)
 59 | 
 60 | # make explicit the name of the exeriement to record to
 61 | mlflow_set_experiment("ncv_duration")
 62 | 
 63 | plan(multiprocess)
 64 | 
 65 | 
 66 | ncv_dat_10 <- nested_cv(small_dat,
 67 |                      outside = vfold_cv(v = 10, repeats = 2),
 68 |                      inside = bootstraps(times = 25))
 69 | 
 70 | 
 71 | 
 72 | 
 73 | ##################################
 74 | # Error Functions
 75 | ##################################
 76 | 
 77 | 
 78 | error_FUN <- function(y_obs, y_hat){
 79 |    y_obs <- unlist(y_obs)
 80 |    y_hat <- unlist(y_hat)
 81 |    Metrics::mae(y_obs, y_hat)
 82 | }
 83 | 
 84 | 
 85 | 
 86 | #####################################
 87 | # Model Functions
 88 | #####################################
 89 | 
 90 | 
 91 | # Random Forest
 92 | 
 93 | pars_ranger_FUN <- function(params, analysis_set) {
 94 |    mtry <- params$mtry[[1]]
 95 |    trees <- params$trees[[1]]
 96 |    model <- parsnip::rand_forest(mode = "regression", mtry = mtry, trees = trees) %>%
 97 |       parsnip::set_engine("ranger", importance = 'impurity') %>%
 98 |       generics::fit(y ~ ., data = analysis_set)
 99 |    model
100 | }
101 | 
102 | 
103 | 
104 | # Elastic Net Regression
105 | 
106 | glm_FUN <- function(params, analysis_set) {
107 |    alpha <- params$mixture[[1]]
108 |    lambda <- params$penalty[[1]]
109 |    model <- parsnip::linear_reg(mixture = alpha, penalty = lambda) %>%
110 |       parsnip::set_engine("glmnet") %>%
111 |       generics::fit(y ~ ., data = analysis_set)
112 |    model
113 | }
114 | 
115 | 
116 | mod_FUN_list_pars <- list(glmnet = glm_FUN, ranger = pars_ranger_FUN)
117 | 
118 | 
119 | 
120 | 
121 | ###################################
122 | # Hyperparameter Grids
123 | ###################################
124 | 
125 | 
126 | # size = number of rows
127 | # Default ranges look good for mixture and penalty
128 | glm_params <- grid_latin_hypercube(
129 |       mixture(),
130 |       penalty(),
131 |       size = 100
132 | )
133 | 
134 | rf_params <- grid_latin_hypercube(
135 |       mtry(range = c(3, 4)),
136 |       trees(range = c(200, 300)),
137 |       size = 100 
138 | )
139 | 
140 | params_list <- list(glmnet = glm_params, ranger = rf_params)
141 | 
142 | 
143 | 
144 | 
145 | #####################################################
146 | # Functions used in the loops
147 | #####################################################
148 | 
149 | 
150 | # inputs params, model, and resample, calls model and error functions, outputs error
151 | mod_error <- function(params, mod_FUN, dat) {
152 |       y_col <- ncol(dat$data)
153 |       y_obs <- assessment(dat)[y_col]
154 |       mod <- mod_FUN(params, analysis(dat))
155 |       pred <- predict(mod, assessment(dat))
156 |       if (!is.data.frame(pred)) {
157 |          pred <- pred$predictions
158 |       }
159 |       error <- error_FUN(y_obs, pred)
160 |       error
161 | }
162 | 
163 | # inputs resample, loops hyperparam grid values to model/error function, collects error value for hyperparam combo
164 | tune_over_params <- function(dat, mod_FUN, params) {
165 |    params$error <- map_dbl(1:nrow(params), function(row) {
166 |       params <- params[row,]
167 |       mod_error(params, mod_FUN, dat)
168 |    })
169 |    params
170 | }
171 | 
172 | # inputs and sends fold's resamples to tuning function, collects and averages fold's error for each hyperparameter combo
173 | summarize_tune_results <- function(dat, mod_FUN, params) {
174 |    # Return row-bound tibble that has the 25 bootstrap results
175 |    param_names <- names(params)
176 |    future_map_dfr(dat$splits, tune_over_params, mod_FUN, params, .progress = TRUE) %>%
177 |       lazy_dt(., key_by = param_names) %>% 
178 |       # For each value of the tuning parameter, compute the
179 |       # average <error> which is the inner bootstrap estimate.
180 |       group_by_at(vars(param_names)) %>%
181 |       summarize(mean_error = mean(error, na.rm = TRUE),
182 |                 n = length(error)) %>% 
183 |       as_tibble()
184 | }
185 | 
186 | 
187 | 
188 | ######################################################
189 | # Compare algorithms
190 | ######################################################
191 | 
192 | 
193 | compare_algs <- function(mod_FUN, params, ncv_dat){
194 |    # tune models by grid searching on resamples in the inner loop (e.g. 5 repeats 10 folds = list of 50 tibbles with param and mean_error cols)
195 |    tuning_results <- map(ncv_dat$inner_resamples, summarize_tune_results, mod_FUN, params)
196 |    
197 |    # Choose best hyperparameter combos across all the resamples for each fold (e.g. 5 repeats 10 folds = 50 best hyperparam combos)
198 |    best_hyper_vals <- tuning_results %>%
199 |       map_df(function(dat) {
200 |          dat[which.min(dat$mean_error),]
201 |       }) %>%
202 |       select(names(params))
203 |    
204 |    # fit models on the outer-loop folds using best hyperparams (e.g. 5 repeats, 10 folds = 50 models)
205 |    outer_fold_error <- future_map2_dbl(ncv_dat$splits, 1:nrow(best_hyper_vals), function(dat, row) {
206 |       params <- best_hyper_vals[row,]
207 |       mod_error(params, mod_FUN, dat)
208 |    }, .progress = TRUE)
209 |    
210 |    # hyperparam values for final model will be the ones most selected to use on the outer-loop folds
211 |    chosen_params <- best_hyper_vals %>% 
212 |       group_by_all() %>% 
213 |       tally() %>% 
214 |       ungroup() %>% 
215 |       filter(n == max(n))
216 |    
217 |    # output error stats and chosen hyperparams
218 |    tibble(
219 |       chosen_params = list(chosen_params),
220 |       mean_error = mean(outer_fold_error),
221 |       median_error = median(outer_fold_error),
222 |       sd_error = sd(outer_fold_error)
223 |    )
224 | }
225 | 
226 | 
227 | # start nested-cv
228 | algorithm_comparison_ten_pars <- map2_dfr(mod_FUN_list_pars, params_list, compare_algs, ncv_dat_10) %>%
229 |    mutate(model = names(mod_FUN_list_pars)) %>%
230 |    select(model, everything())
231 | toc(log = TRUE)
232 | 
233 | 
234 | # log duration metric to MLflow
235 | ncv_times <- tic.log(format = FALSE)
236 | duration <- as.numeric(ncv_times[[1]]$toc - ncv_times[[1]]$tic)
237 | mlflow_log_metric("duration", duration)
238 | mlflow_set_tag("implementation", "parsnip")
239 | mlflow_set_tag("method", "kj")
240 | mlflow_end_run()
241 | 
242 | 
243 | # text me results
244 | log.txt <- tic.log(format = TRUE)
245 | msg <- glue("Using parsnip-kj script: \n After running 10 fold pars,  {log.txt[[1]]}")
246 | pbPost("note", title="parsnip-kj script finished", body=msg)
247 | tic.clearlog()
248 | 
249 | 
250 | # MLflow uses waitress for Windows. Killing it also kills mlflow.exe, python.exe, console window host processes
251 | installr::kill_process(process = c("waitress-serve.exe"))
252 | 
253 | 
254 | 
255 | 


--------------------------------------------------------------------------------
/duration-experiment/kuhn-johnson/nested-cv-ranger-kj.R:
--------------------------------------------------------------------------------
  1 | # Nested cross-validation for tuning and algorithm comparison
  2 | 
  3 | # Kuhn-Johnson method
  4 | # ranger
  5 | 
  6 | 
  7 | 
  8 | # Notes
  9 | # 1. *** Make sure the target column is last in dataframe ***
 10 | # 2. Using the Ranger Random Forest model function from the ranger package
 11 | 
 12 | 
 13 | # Sections
 14 | # 1. Set-Up
 15 | # 2. Error
 16 | # 3. Model Functions
 17 | # 4. Hyperparameter Grids
 18 | # 5. Functions used in the loops
 19 | # 6. Compare algorithms
 20 | 
 21 | 
 22 | 
 23 | 
 24 | 
 25 | ####################################################
 26 | # Set-Up
 27 | ####################################################
 28 | 
 29 | 
 30 | # texts me if an error occurs
 31 | options(error = function() { 
 32 |    library(RPushbullet)
 33 |    pbPost("note", "Error", geterrmessage())
 34 |    if(!interactive()) stop(geterrmessage())
 35 | })
 36 | 
 37 | # start MLflow server
 38 | sys::exec_background("mlflow server")
 39 | Sys.sleep(10)
 40 | 
 41 | 
 42 | library(tictoc)
 43 | tic()
 44 | 
 45 | set.seed(2019)
 46 | 
 47 | # simulated data; generates 10 multi-patterned, numeric predictors plus outcome variable
 48 | sim_data <- function(n) {
 49 |       tmp <- mlbench::mlbench.friedman1(n, sd=1)
 50 |       tmp <- cbind(tmp$x, tmp$y)
 51 |       tmp <- as.data.frame(tmp)
 52 |       names(tmp)[ncol(tmp)] <- "y"
 53 |       tmp
 54 | }
 55 | 
 56 | # Use small data to tune and compare models
 57 | small_dat <- sim_data(100)
 58 | 
 59 | 
 60 | 
 61 | pacman::p_load(RPushbullet, glue, ranger, tidymodels, data.table, dtplyr, dplyr, furrr, mlflow)
 62 | 
 63 | # make explicit the name of the exeriement to record to
 64 | mlflow_set_experiment("ncv_duration")
 65 | 
 66 | plan(multiprocess)
 67 | 
 68 | 
 69 | ncv_dat_10 <- nested_cv(small_dat,
 70 |                      outside = vfold_cv(v = 10, repeats = 2),
 71 |                      inside = bootstraps(times = 25))
 72 | 
 73 | 
 74 | 
 75 | 
 76 | ##################################
 77 | # Error Functions
 78 | ##################################
 79 | 
 80 | 
 81 | error_FUN <- function(y_obs, y_hat){
 82 |    y_obs <- unlist(y_obs)
 83 |    y_hat <- unlist(y_hat)
 84 |    Metrics::mae(y_obs, y_hat)
 85 | }
 86 | 
 87 | 
 88 | 
 89 | 
 90 | #####################################
 91 | # Model Functions
 92 | #####################################
 93 | 
 94 | 
 95 | # Random Forest
 96 | 
 97 | ranger_FUN <- function(params, analysis_set) {
 98 |    mtry <- params$mtry[[1]]
 99 |    trees <- params$trees[[1]]
100 |    model <- ranger::ranger(y ~ ., data = analysis_set, mtry = mtry, num.trees = trees)
101 |    model
102 | }
103 | 
104 | 
105 | # Elastic Net Regression
106 | 
107 | glm_FUN <- function(params, analysis_set) {
108 |    alpha <- params$mixture[[1]]
109 |    lambda <- params$penalty[[1]]
110 |    model <- parsnip::linear_reg(mixture = alpha, penalty = lambda) %>%
111 |       parsnip::set_engine("glmnet") %>%
112 |       generics::fit(y ~ ., data = analysis_set)
113 |    model
114 | }
115 | 
116 | 
117 | mod_FUN_list_ranger <- list(glmnet = glm_FUN, ranger = ranger_FUN)
118 | 
119 | 
120 | 
121 | 
122 | ###################################
123 | # Hyperparameter Grids
124 | ###################################
125 | 
126 | 
127 | # size = number of rows
128 | # Default ranges look good for mixture and penalty
129 | glm_params <- grid_latin_hypercube(
130 |       mixture(),
131 |       penalty(),
132 |       size = 100
133 | )
134 | 
135 | rf_params <- grid_latin_hypercube(
136 |       mtry(range = c(3, 4)),
137 |       trees(range = c(200, 300)),
138 |       size = 100 
139 | )
140 | 
141 | params_list <- list(glmnet = glm_params, ranger = rf_params)
142 | 
143 | 
144 | 
145 | 
146 | #####################################################
147 | # Functions used in the loops
148 | #####################################################
149 | 
150 | 
151 | # inputs params, model, and resample, calls model and error functions, outputs error
152 | mod_error <- function(params, mod_FUN, dat) {
153 |       y_col <- ncol(dat$data)
154 |       y_obs <- assessment(dat)[y_col]
155 |       mod <- mod_FUN(params, analysis(dat))
156 |       pred <- predict(mod, assessment(dat))
157 |       if (!is.data.frame(pred)) {
158 |          pred <- pred$predictions
159 |       }
160 |       error <- error_FUN(y_obs, pred)
161 |       error
162 | }
163 | 
164 | # inputs resample, loops hyperparam grid values to model/error function, collects error value for hyperparam combo
165 | tune_over_params <- function(dat, mod_FUN, params) {
166 |    params$error <- map_dbl(1:nrow(params), function(row) {
167 |       params <- params[row,]
168 |       mod_error(params, mod_FUN, dat)
169 |    })
170 |    params
171 | }
172 | 
173 | # inputs and sends fold's resamples to tuning function, collects and averages fold's error for each hyperparameter combo
174 | summarize_tune_results <- function(dat, mod_FUN, params) {
175 |    # Return row-bound tibble that has the 25 bootstrap results
176 |    param_names <- names(params)
177 |    future_map_dfr(dat$splits, tune_over_params, mod_FUN, params, .progress = TRUE) %>%
178 |       lazy_dt(., key_by = param_names) %>% 
179 |       # For each value of the tuning parameter, compute the
180 |       # average <error> which is the inner bootstrap estimate.
181 |       group_by_at(vars(param_names)) %>%
182 |       summarize(mean_error = mean(error, na.rm = TRUE),
183 |                 n = length(error)) %>% 
184 |       as_tibble()
185 | }
186 | 
187 | 
188 | 
189 | ######################################################
190 | # Compare algorithms
191 | ######################################################
192 | 
193 | 
194 | compare_algs <- function(mod_FUN, params, ncv_dat){
195 |    # tune models by grid searching on resamples in the inner loop (e.g. 5 repeats 10 folds = list of 50 tibbles with param and mean_error cols)
196 |    tuning_results <- map(ncv_dat$inner_resamples, summarize_tune_results, mod_FUN, params)
197 |    
198 |    # Choose best hyperparameter combos across all the resamples for each fold (e.g. 5 repeats 10 folds = 50 best hyperparam combos)
199 |    best_hyper_vals <- tuning_results %>%
200 |       map_df(function(dat) {
201 |          dat[which.min(dat$mean_error),]
202 |       }) %>%
203 |       select(names(params))
204 |    
205 |    # fit models on the outer-loop folds using best hyperparams (e.g. 5 repeats, 10 folds = 50 models)
206 |    outer_fold_error <- future_map2_dbl(ncv_dat$splits, 1:nrow(best_hyper_vals), function(dat, row) {
207 |       params <- best_hyper_vals[row,]
208 |       mod_error(params, mod_FUN, dat)
209 |    }, .progress = TRUE)
210 |    
211 |    # hyperparam values for final model will be the ones most selected to use on the outer-loop folds
212 |    chosen_params <- best_hyper_vals %>% 
213 |       group_by_all() %>% 
214 |       tally() %>% 
215 |       ungroup() %>% 
216 |       filter(n == max(n))
217 |    
218 |    # output error stats and chosen hyperparams
219 |    tibble(
220 |       chosen_params = list(chosen_params),
221 |       mean_error = mean(outer_fold_error),
222 |       median_error = median(outer_fold_error),
223 |       sd_error = sd(outer_fold_error)
224 |    )
225 | }
226 | 
227 | 
228 | 
229 | # Start the nested-cv
230 | algorithm_comparison_ten_rang <- map2_dfr(mod_FUN_list_ranger, params_list, compare_algs, ncv_dat_10) %>%
231 |    mutate(model = names(mod_FUN_list_ranger)) %>%
232 |    select(model, everything())
233 | toc(log = TRUE)
234 | 
235 | 
236 | # log duration metric to MLflow
237 | ncv_times <- tic.log(format = FALSE)
238 | duration <- as.numeric(ncv_times[[1]]$toc - ncv_times[[1]]$tic)
239 | mlflow_log_metric("duration", duration)
240 | mlflow_set_tag("implementation", "ranger")
241 | mlflow_set_tag("method", "kj")
242 | mlflow_end_run()
243 | 
244 | 
245 | # text me results
246 | log.txt <- tic.log(format = TRUE)
247 | msg <- glue("Using ranger-kj script: \n After running 10 fold rang,  {log.txt[[1]]}")
248 | pbPost("note", title="ranger-kj script finished", body=msg)
249 | tic.clearlog()
250 | 
251 | 
252 | # MLflow uses waitress for Windows. Killing it also kills mlflow.exe, python.exe, console window host processes
253 | installr::kill_process(process = c("waitress-serve.exe"))
254 | 
255 | 


--------------------------------------------------------------------------------
/duration-experiment/kuhn-johnson/nested-cv-sklearn-kj.R:
--------------------------------------------------------------------------------
  1 | # Nested cross-validation for tuning and algorithm comparison
  2 | 
  3 | # Kuhn-Johnson method
  4 | # sklearn
  5 | 
  6 | 
  7 | 
  8 | # Notes
  9 | # 1. *** Target column needs to be last in dataframe ***
 10 | # 2. *** All sklearn import aliases should have a "sklearn_" prefix ***
 11 | # 3. Uses both R model functions along with those from sklearn modules
 12 | # 4. Starts the MLflow server in the background with a OS command and kills the server process at the end of the script
 13 | 
 14 | 
 15 | # Sections
 16 | # 1. Set-Up
 17 | # 2. Error Functions  
 18 | # 3. Model Functions
 19 | # 4. Hyperparameter Grids
 20 | # 3. Functions used in the loops
 21 | # 4. Compare algorithms
 22 | 
 23 | 
 24 | 
 25 | 
 26 | ####################################################
 27 | # Set-Up
 28 | ####################################################
 29 | 
 30 | 
 31 | # text me if an error occurs
 32 | options(error = function() { 
 33 |       library(RPushbullet)
 34 |       pbPost("note", "Error", geterrmessage())
 35 |       if(!interactive()) stop(geterrmessage())
 36 | })
 37 | 
 38 | # start MLflow server
 39 | sys::exec_background("mlflow server")
 40 | Sys.sleep(10)
 41 | 
 42 | 
 43 | set.seed(2019)
 44 | 
 45 | # simulated data; generates 10 multi-patterned, numeric predictors plus outcome variable
 46 | sim_data <- function(n) {
 47 |       tmp <- mlbench::mlbench.friedman1(n, sd=1)
 48 |       tmp <- cbind(tmp$x, tmp$y)
 49 |       tmp <- as.data.frame(tmp)
 50 |       names(tmp)[ncol(tmp)] <- "y"
 51 |       tmp
 52 | }
 53 | 
 54 | # Use small data to tune and compare models
 55 | small_dat <- sim_data(100)
 56 | 
 57 | 
 58 | library(tictoc)
 59 | tic()
 60 | 
 61 | pacman::p_load(RPushbullet, glue, ranger, tidymodels, data.table, dtplyr, dplyr, furrr, reticulate, mlflow)
 62 | 
 63 | # make explicit the name of the exeriement to record to
 64 | mlflow_set_experiment("ncv_duration")
 65 | 
 66 | 
 67 | plan(multiprocess)
 68 | 
 69 | 
 70 | ncv_dat_10 <- nested_cv(small_dat,
 71 |                         outside = vfold_cv(v = 10, repeats = 2),
 72 |                         inside = bootstraps(times = 25))
 73 | 
 74 | 
 75 | 
 76 | 
 77 | ##################################
 78 | # Error Functions
 79 | ##################################
 80 | 
 81 | 
 82 | error_FUN <- function(y_obs, y_hat){
 83 |       y_obs <- unlist(y_obs)
 84 |       y_hat <- unlist(y_hat)
 85 |       Metrics::mae(y_obs, y_hat)
 86 | }
 87 | 
 88 | 
 89 | 
 90 | #####################################
 91 | # Model Functions
 92 | #####################################
 93 | 
 94 | 
 95 | # Random Forest
 96 | 
 97 | sklearn_rf_FUN <- function(params, analysis_set) {
 98 |    sklearn_e <- import("sklearn.ensemble")
 99 |    max_features <- r_to_py(params$mtry[[1]])
100 |    n_estimators <- r_to_py(params$trees[[1]])
101 |    
102 |    # get data into sklearn's preferred format
103 |    y_idx <- ncol(analysis_set) - 1
104 |    X_idx <- y_idx - 1 
105 |    pAnal <- r_to_py(analysis_set)
106 |    y_train <- pAnal$iloc(axis = 1L)[y_idx]$values
107 |    X_train <- pAnal$iloc(axis = 1L)[0:X_idx]
108 |    
109 |    model <- sklearn_e$RandomForestRegressor(criterion = "mae",
110 |                                             max_features = max_features,
111 |                                             n_estimators = n_estimators,
112 |                                             random_state = 1L)
113 |    mod_fit <- model$fit(X_train, y_train)
114 | }
115 | 
116 | 
117 | # Elastic Net Regression
118 | 
119 | glm_FUN <- function(params, analysis_set) {
120 |       alpha <- params$mixture[[1]]
121 |       lambda <- params$penalty[[1]]
122 |       model <- parsnip::linear_reg(mixture = alpha, penalty = lambda) %>%
123 |             parsnip::set_engine("glmnet") %>%
124 |             generics::fit(y ~ ., data = analysis_set)
125 |       model
126 | }
127 | 
128 | 
129 | 
130 | mod_FUN_list_skrf <- list(glmnet = glm_FUN, sklearn_rf = sklearn_rf_FUN)
131 | 
132 | 
133 | 
134 | 
135 | ###################################
136 | # Hyperparameter Grids
137 | ###################################
138 | 
139 | 
140 | # size = number of rows
141 | # Default ranges look good for mixture and penalty
142 | glm_params <- grid_latin_hypercube(
143 |       mixture(),
144 |       penalty(),
145 |       size = 100
146 | )
147 | 
148 | rf_params <- grid_latin_hypercube(
149 |       mtry(range = c(3, 4)),
150 |       trees(range = c(200, 300)),
151 |       size = 100 
152 | )
153 | 
154 | 
155 | # params_list <- list(glmnet = glm_params, ranger = rf_params)
156 | params_list <- list(glmnet = glm_params, sklearn_rf = rf_params)
157 | 
158 | 
159 | 
160 | 
161 | #####################################################
162 | # Functions used in the loops
163 | #####################################################
164 | 
165 | 
166 | # detect if the model function is from sklearn
167 | is_sklearn <- function(modfun) {
168 |       string <- toString(body(modfun))
169 |       stringr::str_detect(string, pattern = "sklearn")
170 | }
171 | 
172 | 
173 | # inputs params, model, and resample, calls model and error functions, outputs error
174 | mod_error <- function(params, mod_FUN, dat) {
175 |       y_col <- ncol(dat$data)
176 |       y_obs <- assessment(dat)[y_col]
177 |       mod <- mod_FUN(params, analysis(dat))
178 |       
179 |       if(is_sklearn(mod_FUN)) {
180 |             X_dat <- r_to_py(assessment(dat)[-y_col])
181 |             pred <- mod$predict(X_dat)
182 |       } else {
183 |             pred <- predict(mod, assessment(dat))
184 |             if (!is.data.frame(pred)) {
185 |                   pred <- pred$predictions
186 |             }
187 |       }
188 |       
189 |       error <- error_FUN(y_obs, pred)
190 |       error
191 | }
192 | 
193 | # inputs resample, loops hyperparam grid values to model/error function, collects error value for hyperparam combo
194 | tune_over_params <- function(dat, mod_FUN, params) {
195 |       params$error <- map_dbl(1:nrow(params), function(row) {
196 |             params <- params[row,]
197 |             mod_error(params, mod_FUN, dat)
198 |       })
199 |       params
200 | }
201 | 
202 | # inputs and sends fold's resamples to tuning function, collects and averages fold's error for each hyperparameter combo
203 | summarize_tune_results <- function(dat, mod_FUN, params) {
204 |    # Return row-bound tibble that has the 25 bootstrap results
205 |    param_names <- names(params)
206 |    future_map_dfr(dat$splits, tune_over_params, mod_FUN, params, .progress = TRUE) %>%
207 |       lazy_dt(., key_by = param_names) %>% 
208 |       # For each value of the tuning parameter, compute the
209 |       # average <error> which is the inner bootstrap estimate.
210 |       group_by_at(vars(param_names)) %>%
211 |       summarize(mean_error = mean(error, na.rm = TRUE),
212 |                 n = length(error)) %>% 
213 |       as_tibble()
214 | }
215 | 
216 | 
217 | 
218 | ######################################################
219 | # Compare algorithms
220 | ######################################################
221 | 
222 | 
223 | compare_algs <- function(mod_FUN, params, ncv_dat){
224 |       # tune models by grid searching on resamples in the inner loop (e.g. 5 repeats 10 folds = list of 50 tibbles with param and mean_error cols)
225 |       tuning_results <- map(ncv_dat$inner_resamples, summarize_tune_results, mod_FUN, params)
226 |       
227 |       # Choose best hyperparameter combos across all the resamples for each fold (e.g. 5 repeats 10 folds = 50 best hyperparam combos)
228 |       best_hyper_vals <- tuning_results %>%
229 |             map_df(function(dat) {
230 |                   dat[which.min(dat$mean_error),]
231 |             }) %>%
232 |             select(names(params))
233 |       
234 |       # fit models on the outer-loop folds using best hyperparams (e.g. 5 repeats, 10 folds = 50 models)
235 |       outer_fold_error <- future_map2_dbl(ncv_dat$splits, 1:nrow(best_hyper_vals), function(dat, row) {
236 |             params <- best_hyper_vals[row,]
237 |             mod_error(params, mod_FUN, dat)
238 |       }, .progress = TRUE)
239 |       
240 |       # hyperparam values for final model will be the ones most selected to use on the outer-loop folds
241 |       chosen_params <- best_hyper_vals %>% 
242 |             group_by_all() %>% 
243 |             tally() %>% 
244 |             ungroup() %>% 
245 |             filter(n == max(n))
246 |       
247 |       # output error stats and chosen hyperparams
248 |       tibble(
249 |             chosen_params = list(chosen_params),
250 |             mean_error = mean(outer_fold_error),
251 |             median_error = median(outer_fold_error),
252 |             sd_error = sd(outer_fold_error)
253 |       )
254 | }
255 | 
256 | 
257 | # start the nested-cv
258 | algorithm_comparison_ten_skrf <- map2_dfr(mod_FUN_list_skrf, params_list, compare_algs, ncv_dat_10) %>%
259 |    mutate(model = names(mod_FUN_list_skrf)) %>%
260 |    select(model, everything())
261 | toc(log = TRUE)
262 | 
263 | 
264 | # log duration metric to MLflow
265 | ncv_times <- tic.log(format = FALSE)
266 | duration <- as.numeric(ncv_times[[1]]$toc - ncv_times[[1]]$tic)
267 | mlflow_log_metric("duration", duration)
268 | mlflow_set_tag("implementation", "sklearn")
269 | mlflow_set_tag("method", "kj")
270 | mlflow_end_run()
271 | 
272 | 
273 | # text me the results
274 | log.txt <- tic.log(format = TRUE)
275 | msg <- glue("Using sklearn-kj script: \n After running 10 fold skrf,  {log.txt[[1]]}")
276 | pbPost("note", title="sklearn-kj script finished", body=msg)
277 | tic.clearlog()
278 | 
279 | 
280 | # MLflow uses waitress for Windows. Killing it also kills mlflow.exe, python.exe, console window host processes
281 | installr::kill_process(process = c("waitress-serve.exe"))
282 | 
283 | 
284 | 
285 | 


--------------------------------------------------------------------------------
/duration-experiment/kuhn-johnson/nested-cv-tune-kj.R:
--------------------------------------------------------------------------------
  1 | # Nested cross-validation using tune package
  2 | 
  3 | # Kuhn-Johnson method
  4 | # tune
  5 | 
  6 | 
  7 | 
  8 | # Notes
  9 | # 1.  *** Make sure the target column is last in dataframe ***
 10 | 
 11 | 
 12 | # Sections
 13 | # 1. Set-up
 14 | # 2. Error functions
 15 | # 3. Model functions
 16 | # 4, Hyperparameter Grids
 17 | # 5. Functions used in the loops
 18 | # 6. Compare Algorithms
 19 | 
 20 | 
 21 | 
 22 | 
 23 | ################################
 24 | # Set-up
 25 | ################################
 26 | 
 27 | # text me if an error occurs
 28 | options(error = function() { 
 29 |    library(RPushbullet)
 30 |    pbPost("note", "Error", geterrmessage())
 31 |    if(!interactive()) stop(geterrmessage())
 32 | })
 33 | 
 34 | # start MLflow server
 35 | sys::exec_background("mlflow server")
 36 | Sys.sleep(10)
 37 | 
 38 | 
 39 | library(tictoc)
 40 | tic()
 41 | 
 42 | pacman::p_load(RPushbullet, glue, ranger, doFuture, dplyr, purrr, tidymodels, tune, dials, mlflow)
 43 | 
 44 | 
 45 | set.seed(2019)
 46 | 
 47 | # make explicit the name of the exeriement to record to
 48 | mlflow_set_experiment("ncv_duration")
 49 | 
 50 | registerDoFuture()
 51 | plan(multiprocess)
 52 | 
 53 | 
 54 | 
 55 | # simulated data; generates 10 multi-patterned, numeric predictors plus outcome variable
 56 | sim_data <- function(n) {
 57 |       tmp <- mlbench::mlbench.friedman1(n, sd=1)
 58 |       tmp <- cbind(tmp$x, tmp$y)
 59 |       tmp <- as.data.frame(tmp)
 60 |       names(tmp)[ncol(tmp)] <- "y"
 61 |       tmp
 62 | }
 63 | 
 64 | # Use small data to tune and compare models
 65 | small_dat <- sim_data(100)
 66 | 
 67 | 
 68 | 
 69 | ncv_dat_10 <- nested_cv(small_dat,
 70 |                      outside = vfold_cv(v = 10, repeats = 2),
 71 |                      inside = bootstraps(times = 25))
 72 | 
 73 | 
 74 | 
 75 | 
 76 | ################################
 77 | # Error functions
 78 | ################################
 79 | 
 80 | 
 81 | error_funs <- metric_set(mae)
 82 | error_FUN <- function(y_obs, y_hat){
 83 |    y_obs <- unlist(y_obs)
 84 |    y_hat <- unlist(y_hat)
 85 |    Metrics::mae(y_obs, y_hat)
 86 | }
 87 | 
 88 | 
 89 | 
 90 | ################################
 91 | # Mode functions
 92 | ################################
 93 | 
 94 | 
 95 | # Random Forest
 96 | 
 97 | # inner-loop tuning
 98 | rf_inner <- rand_forest(mtry = tune(), trees = tune()) %>%
 99 |       set_engine("ranger", importance = 'impurity') %>%
100 |       set_mode("regression")
101 | 
102 | # outer loop scoring and model selection
103 | rf_FUN <- function(params, analysis_set) {
104 |       mtry <- params$mtry[[1]]
105 |       trees <- params$trees[[1]]
106 |       rand_forest(mode = "regression", mtry = mtry, trees = trees) %>%
107 |             set_engine("ranger", importance = 'impurity') %>%
108 |             fit(y ~ ., data = analysis_set)
109 | }
110 | 
111 | 
112 | # Regularized Regression
113 | 
114 | glm_inner <- linear_reg(mixture = tune(), penalty = tune()) %>% 
115 |    set_engine("glmnet")
116 | 
117 | glm_FUN <- function(params, analysis_set) {
118 |    alpha <- params$mixture[[1]]
119 |    lambda <- params$penalty[[1]]
120 |    model <- parsnip::linear_reg(mixture = alpha, penalty = lambda) %>%
121 |       parsnip::set_engine("glmnet") %>%
122 |       generics::fit(y ~ ., data = analysis_set)
123 |    model
124 | }
125 | 
126 | 
127 | mod_inner_list <- list(glm = glm_inner, rf = rf_inner)
128 | mod_FUN_list <- list(glm = glm_FUN, rf = rf_FUN)
129 | 
130 | 
131 | 
132 | ################################
133 | # Hyperparameter Grids
134 | ################################
135 | 
136 | 
137 | glm_params <- grid_latin_hypercube(
138 |    mixture(),
139 |    penalty(),
140 |    size = 100
141 | )
142 | 
143 | rf_params <- grid_latin_hypercube(
144 |    mtry(range = c(3, 4)),
145 |    trees(range = c(200, 300)),
146 |    size = 100 
147 | )
148 | 
149 | 
150 | params_list <- list(glm = glm_params, rf = rf_params)
151 | 
152 | 
153 | 
154 | ################################
155 | # Functions used in the loops
156 | ################################
157 | 
158 | 
159 | # inputs params, model, and resample, calls model and error functions, outputs error
160 | mod_error <- function(params, mod_FUN, dat) {
161 |    y_col <- ncol(dat$data)
162 |    y_obs <- assessment(dat)[y_col]
163 |    mod <- mod_FUN(params, analysis(dat))
164 |    pred <- predict(mod, assessment(dat))      
165 |    error <- error_FUN(y_obs, pred)
166 |    error
167 | }
168 | 
169 | 
170 | compare_algs <- function(mod_inner, params, mod_FUN, ncv_dat){
171 |    
172 |    # tune models by grid searching on resamples in the inner loop (e.g. 5 repeats 10 folds = list of 50 tibbles with param and mean_error cols)
173 |    tuning_results <- map(ncv_dat$inner_resamples, function(dat, mod_inner, params) {
174 |       tune_grid(y ~ .,
175 |                 model = mod_inner,
176 |                 resamples = dat,
177 |                 grid = params,
178 |                 metrics = error_funs)
179 |    },
180 |    mod_inner, params)
181 |    
182 |    num_params <- ncol(params)
183 |    
184 |    # Choose best hyperparameter combos across all the resamples for each fold (e.g. 5 repeats 10 folds = 50 best hyperparam combos)
185 |    best_hyper_vals <- tuning_results %>%
186 |       map_dfr(function(dat){
187 |          dat %>% 
188 |             collect_metrics() %>%
189 |             filter(mean == min(mean)) %>%
190 |             slice(1)
191 |       }) %>% 
192 |       select(1:num_params)
193 |    
194 |    # fit models on the outer-loop folds using best hyperparams (e.g. 5 repeats, 10 folds = 50 models)
195 |    outer_fold_error <- furrr::future_map2_dbl(ncv_dat$splits, 1:nrow(best_hyper_vals), function(dat, row) {
196 |       params <- best_hyper_vals[row,]
197 |       mod_error(params, mod_FUN, dat)
198 |    }, .progress = TRUE)
199 |    
200 |    # hyperparam values for final model will be the ones most selected to use on the outer-loop folds
201 |    chosen_params <- best_hyper_vals %>% 
202 |       group_by_all() %>% 
203 |       tally() %>% 
204 |       ungroup() %>% 
205 |       filter(n == max(n))
206 |    
207 |    # output error stats and chosen hyperparams
208 |    tibble(
209 |       chosen_params = list(chosen_params),
210 |       mean_error = mean(outer_fold_error),
211 |       median_error = median(outer_fold_error),
212 |       sd_error = sd(outer_fold_error)
213 |    )
214 | } 
215 | 
216 | 
217 | 
218 | ################################
219 | # Compare Algorithms
220 | ################################
221 | 
222 | 
223 | lol <- list(mod_inner_list, params_list, mod_FUN_list)
224 | 
225 | 
226 | # start the nested-cv
227 | algorithm_comparison_ten <- pmap_dfr(lol, compare_algs, ncv_dat_10) %>%
228 |    mutate(model = names(mod_FUN_list)) %>%
229 |    select(model, everything())
230 | toc(log = TRUE)
231 | 
232 | 
233 | # log duration metric to MLflow
234 | ncv_times <- tic.log(format = FALSE)
235 | duration <- as.numeric(ncv_times[[1]]$toc - ncv_times[[1]]$tic)
236 | mlflow_log_metric("duration", duration)
237 | mlflow_set_tag("implementation", "tune")
238 | mlflow_set_tag("method", "kj")
239 | mlflow_end_run()
240 | 
241 | 
242 | # text me results
243 | log.txt <- tic.log(format = TRUE)
244 | msg <- glue("Using tune-kj script: \n After running 10 fold,  {log.txt[[1]]}")
245 | pbPost("note", title="tune-kj script finished", body=msg)
246 | tic.clearlog()
247 | 
248 | 
249 | # MLflow uses waitress for Windows. Killing it also kills mlflow.exe, python.exe, console window host processes
250 | installr::kill_process(process = c("waitress-serve.exe"))
251 | 
252 | 


--------------------------------------------------------------------------------
/duration-experiment/outputs/0225-results.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/duration-experiment/outputs/0225-results.png


--------------------------------------------------------------------------------
/duration-experiment/outputs/duration-pkg-tbl.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/duration-experiment/outputs/duration-pkg-tbl.png


--------------------------------------------------------------------------------
/duration-experiment/outputs/duration-runs.rds:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/duration-experiment/outputs/duration-runs.rds


--------------------------------------------------------------------------------
/duration-experiment/package-sources-gt-tbl.R:
--------------------------------------------------------------------------------
 1 | # GT table: Package Sources for the Model Functions
 2 | 
 3 | 
 4 | pacman::p_load(tibble, dplyr, tidyr, gt)
 5 | 
 6 | runs_raw <- readr::read_rds("data/duration-runs.rds") %>%
 7 |       mutate(implementation = stringr::str_to_title(implementation))
 8 | 
 9 | 
10 | # hardcoded
11 | # packages used for the model functions ordered by implementation in the runs_raw file
12 | elast_net <- c("sklearn", "sklearn", "parsnip-glmnet", "mlr-glmnet", "parsnip-glmnet", "h2o", "parsnip-glmnet", "parsnip-glmnet", "parsnip-glmnet")
13 | rand_forest <- c("sklearn", "sklearn", "ranger", "mlr-ranger", "parsnip-ranger", "h2o", "sklearn", "parsnip-ranger", "ranger")
14 | 
15 | 
16 | elast_dat <- runs_raw %>% 
17 |       select(implementation) %>% 
18 |       bind_cols(`Elastic Net` = elast_net) %>% 
19 |       pivot_wider(names_from = implementation, values_from = `Elastic Net`)
20 | 
21 | # implementations as cols, algorithm as rows, values = package used
22 | model_dat <- runs_raw %>% 
23 |       select(implementation) %>% 
24 |       bind_cols(`Random Forest` = rand_forest) %>% 
25 |       pivot_wider(names_from = implementation, values_from = `Random Forest`) %>% 
26 |       bind_rows(elast_dat) %>% 
27 |       mutate(rowname = c("Random Forest", "Elastic Net"))
28 | 
29 | model_dat %>% 
30 |       gt() %>% 
31 |       tab_spanner(
32 |             label = "Implementation",
33 |             columns = everything()
34 |       ) %>% 
35 |       data_color(columns = vars(Reticulate, Python, Mlr3, `Ranger-Kj`),
36 |                  colors = scales::col_factor(
37 |                        palette = "#195198",
38 |                        domain = c("sklearn", "ranger", "parsnip-glmnet", "mlr-ranger", "mlr-glmnet" ))) %>% 
39 |       data_color(columns = vars(Tune, H2o, Sklearn, Parsnip, Ranger),
40 |                  colors = scales::col_factor(
41 |                        palette = "#BD9865",
42 |                        domain = c("sklearn", "ranger", "parsnip-glmnet", "parsnip-ranger", "h2o" ))) %>%
43 |       tab_style(
44 |             style = cell_text(align = "center"),
45 |             locations = cells_body()
46 |       ) %>% 
47 |       tab_options(
48 |             table.background.color = "ivory",
49 |             table.border.top.style = "None"
50 |       ) %>% 
51 |       tab_header(
52 |             title = "Package Sources for the Model Functions"
53 |       ) %>% 
54 |       gtsave(filename = "duration-pkg-tbl.png",
55 |              path = "duration-experiment/outputs")
56 | 


--------------------------------------------------------------------------------
/duration-experiment/raschka/nested-cv-kj-raschka.R:
--------------------------------------------------------------------------------
  1 | # Nested cross-validation for tuning and algorithm comparison
  2 | 
  3 | 
  4 | # Raschka method
  5 | # ranger-kj
  6 | 
  7 | 
  8 | 
  9 | # Notes
 10 | # 1. *** Make sure the target column is last in dataframe ***
 11 | 
 12 | 
 13 | # Sections
 14 | # 1. Set-Up
 15 | # 2. Error function 
 16 | # 3. Model Functions  
 17 | # 4. Hyperparameter Grids
 18 | # 5. Functions used in the loops
 19 | # 6. Compare algorithms; tune chosen algorithm
 20 | # 7. Score chosen algorithm
 21 | 
 22 | 
 23 | 
 24 | 
 25 | 
 26 | ######################################################
 27 | # Set-Up
 28 | ######################################################
 29 | 
 30 | 
 31 | # texts me if an error occurs
 32 | options(error = function() { 
 33 |       library(RPushbullet)
 34 |       pbPost("note", "Error", geterrmessage())
 35 |       if(!interactive()) stop(geterrmessage())
 36 | })
 37 | 
 38 | # start MLflow server
 39 | sys::exec_background("mlflow server")
 40 | Sys.sleep(10)
 41 | 
 42 | 
 43 | library(tictoc)
 44 | tic()
 45 | 
 46 | set.seed(2019)
 47 | 
 48 | # simulated data; generates 10 multi-patterned, numeric predictors plus outcome variable
 49 | sim_data <- function(n) {
 50 |       tmp <- mlbench::mlbench.friedman1(n, sd=1)
 51 |       tmp <- cbind(tmp$x, tmp$y)
 52 |       tmp <- as.data.frame(tmp)
 53 |       names(tmp)[ncol(tmp)] <- "y"
 54 |       tmp
 55 | }
 56 | 
 57 | dat <- sim_data(5000)
 58 | 
 59 | 
 60 | pacman::p_load(RPushbullet, glue, ranger, tidymodels, data.table, dtplyr, dplyr, furrr, mlflow)
 61 | 
 62 | # make explicit the name of the exeriement to record to
 63 | mlflow_set_experiment("ncv_duration")
 64 | 
 65 | plan(multiprocess)
 66 | 
 67 | 
 68 | train_idx <- caret::createDataPartition(y = dat$y, p = 0.80, list = FALSE)
 69 | train_dat <- dat[train_idx, ]
 70 | test_dat <- dat[-train_idx, ]
 71 | 
 72 | ncv_dat <- nested_cv(train_dat,
 73 |                        outside = vfold_cv(v = 5),
 74 |                        inside = vfold_cv(v = 2))
 75 | 
 76 | 
 77 | 
 78 | 
 79 | ######################################################
 80 | # Error Functions
 81 | ######################################################
 82 | 
 83 | 
 84 | error_FUN <- function(y_obs, y_hat){
 85 |       y_obs <- unlist(y_obs)
 86 |       y_hat <- unlist(y_hat)
 87 |       Metrics::mae(y_obs, y_hat)
 88 | }
 89 | 
 90 | 
 91 | 
 92 | ######################################################
 93 | # Model Functions
 94 | ######################################################
 95 | 
 96 | 
 97 | # Random Forest
 98 | 
 99 | ranger_FUN <- function(params, analysis_set) {
100 |       mtry <- params$mtry[[1]]
101 |       trees <- params$trees[[1]]
102 |       model <- ranger::ranger(y ~ ., data = analysis_set, mtry = mtry, num.trees = trees)
103 |       model
104 | }
105 | 
106 | 
107 | # Regularized Regression
108 | 
109 | glm_FUN <- function(params, analysis_set) {
110 |       alpha <- params$mixture[[1]]
111 |       lambda <- params$penalty[[1]]
112 |       model <- parsnip::linear_reg(mixture = alpha, penalty = lambda) %>%
113 |             parsnip::set_engine("glmnet") %>%
114 |             generics::fit(y ~ ., data = analysis_set)
115 |       model
116 | }
117 | 
118 | 
119 | mod_FUN_list <- list(glmnet = glm_FUN, ranger = ranger_FUN)
120 | 
121 | 
122 | 
123 | ######################################################
124 | # Hyperparameter Grids
125 | ######################################################
126 | 
127 | 
128 | # size = number of rows
129 | glm_params <- grid_latin_hypercube(
130 |       mixture(),
131 |       penalty(),
132 |       size = 100
133 | )
134 | 
135 | rf_params <- grid_latin_hypercube(
136 |       mtry(range = c(3, 4)),
137 |       trees(range = c(200, 300)),
138 |       size = 100
139 | )
140 | 
141 | params_list <- list(glmnet = glm_params, ranger = rf_params)
142 | 
143 | 
144 | 
145 | 
146 | ######################################################
147 | # Functions used in the loops
148 | ######################################################
149 | 
150 | 
151 | # inputs params, model, and fold, calls model and error functions, outputs error
152 | mod_error <- function(params, mod_FUN, dat) {
153 |       y_col <- ncol(dat$data)
154 |       y_obs <- assessment(dat)[y_col]
155 |       mod <- mod_FUN(params, analysis(dat))
156 |       pred <- predict(mod, assessment(dat))
157 |       if (!is.data.frame(pred)) {
158 |             pred <- pred$predictions
159 |       }
160 |       error <- error_FUN(y_obs, pred)
161 |       error
162 | }
163 | 
164 | # inputs fold, loops hyperparam grid values to model/error function, collects error value for hyperparam combo
165 | tune_over_params <- function(dat, mod_FUN, params) {
166 |       params$error <- map_dbl(1:nrow(params), function(row) {
167 |             params <- params[row,]
168 |             mod_error(params, mod_FUN, dat)
169 |       })
170 |       params
171 | }
172 | 
173 | # inputs outer fold and sends outer fold's inner folds to tuning function, collects and averages fold's error for each hyperparameter combo
174 | summarize_tune_results <- function(dat, mod_FUN, params) {
175 |       # Return row-bound tibble that has the 25 bootstrap results
176 |       param_names <- names(params)
177 |       future_map_dfr(dat$splits, tune_over_params, mod_FUN, params, .progress = TRUE) %>%
178 |             lazy_dt(., key_by = param_names) %>% 
179 |             # For each value of the tuning parameter, compute the
180 |             # average <error> which is the inner bootstrap estimate.
181 |             group_by_at(vars(param_names)) %>%
182 |             summarize(mean_error = mean(error, na.rm = TRUE),
183 |                       n = length(error)) %>% 
184 |             as_tibble()
185 | }
186 | 
187 | 
188 | # primary function for the nested-cv
189 | compare_algs <- function(mod_FUN, params, ncv_dat){
190 |       # tune models by grid searching on resamples in the inner loop (e.g. 5 repeats 10 folds = list of 50 tibbles with param and mean_error cols)
191 |       tuning_results <- map(ncv_dat$inner_resamples, summarize_tune_results, mod_FUN, params)
192 |       
193 |       # Choose best hyperparameter combos across all the resamples for each fold (e.g. 5 repeats 10 folds = 50 best hyperparam combos)
194 |       best_hyper_vals <- tuning_results %>%
195 |             map_df(function(dat) {
196 |                   dat[which.min(dat$mean_error),]
197 |             }) %>%
198 |             select(names(params))
199 |       
200 |       # fit models on the outer-loop folds using best hyperparams
201 |       outer_fold_error <- future_map2_dbl(ncv_dat$splits, 1:nrow(best_hyper_vals), function(dat, row) {
202 |             params <- best_hyper_vals[row,]
203 |             mod_error(params, mod_FUN, dat)
204 |       }, .progress = TRUE)
205 |       
206 |       
207 |       # output error stats
208 |       tibble(
209 |             mean_error = mean(outer_fold_error),
210 |             median_error = median(outer_fold_error),
211 |             sd_error = sd(outer_fold_error)
212 |       )
213 | }
214 | 
215 | 
216 | 
217 | 
218 | ######################################################
219 | # Compare algorithms, Tune chosen algorithm
220 | ######################################################
221 | 
222 | 
223 | # outputs df with outer fold stats for each algorithm
224 | algorithm_comparison <- map2_dfr(mod_FUN_list, params_list, compare_algs, ncv_dat) %>%
225 |       mutate(model = names(mod_FUN_list)) %>%
226 |       select(model, everything())
227 | 
228 | # Choose alg with lowest avg error
229 | chosen_alg <- algorithm_comparison %>% 
230 |       filter(mean_error == min(mean_error)) %>% 
231 |       pull(1)
232 | 
233 | # Set inputs to chosen alg
234 | mod_FUN <- mod_FUN_list[[chosen_alg]]
235 | params <- params_list[[chosen_alg]]
236 | total_train <- vfold_cv(train_dat, v = 2)
237 | 
238 | # tune chosen alg on the inner-loop cv strategy
239 | # code is an amalgam of funs: summarize_tune_results, tune_over_params
240 | tuning_results <- map(total_train$splits, function(dat, mod_FUN, params) {
241 |       params$error <- future_map_dbl(1:nrow(params), function(row) {
242 |             params <- params[row,]
243 |             mod_error(params, mod_FUN, dat)
244 |       })
245 |       return(params)
246 | }, mod_FUN, params) %>% 
247 |       bind_rows() %>% 
248 |       lazy_dt(., key_by = names(params)) %>% 
249 |       # For each value of the tuning parameter, compute the
250 |       # average <error> which is the inner bootstrap estimate.
251 |       group_by_at(vars(names(params))) %>%
252 |       summarize(mean_error = mean(error, na.rm = TRUE)) %>% 
253 |       as_tibble()
254 | 
255 | 
256 | 
257 | 
258 | ######################################################
259 | # Score chosen algorithm
260 | ######################################################
261 | 
262 | 
263 | # Get best params from the tuning
264 | best_hyper_vals <- tuning_results %>%
265 |       filter(mean_error == min(mean_error)) %>%
266 |       slice(1) %>% 
267 |       select(names(params))
268 | 
269 | # Get avg error across validation folds from alg using best params
270 | avg_kfold_error <- tuning_results %>% 
271 |       filter(mean_error == min(mean_error)) %>%
272 |       mutate(mean_error = round(mean_error, 5)) %>% 
273 |       select(mean_error)
274 | 
275 | # train tuned mod on entire training set; score on test set
276 | chosen_mod <- mod_FUN(best_hyper_vals, train_dat)
277 | 
278 | toc(log = TRUE)
279 | 
280 | 
281 | 
282 | # log duration metric to MLflow
283 | ncv_times <- tic.log(format = FALSE)
284 | duration <- as.numeric(ncv_times[[1]]$toc - ncv_times[[1]]$tic)
285 | mlflow_log_metric("duration", duration)
286 | mlflow_set_tag("implementation", "ranger-kj")
287 | mlflow_set_tag("method", "raschka")
288 | mlflow_end_run()
289 | 
290 | 
291 | # Score on the held out test set
292 | chosen_preds <- predict(chosen_mod, test_dat)
293 | if (!is.data.frame(chosen_preds)) {
294 |    chosen_preds <- chosen_preds$predictions
295 | }
296 | test_error <- round(error_FUN(test_dat$y, chosen_preds), 5)
297 | 
298 | # avg error across outer-loop folds in ncv
299 | outer_kfold_error <- algorithm_comparison %>% 
300 |    filter(model == chosen_alg) %>%
301 |    mutate(mean_error = round(mean_error, 5)) %>% 
302 |    pull(mean_error)
303 | 
304 | # Create output message and text me the results
305 | best_hyper_vals <- best_hyper_vals %>% 
306 |       tidyr::pivot_longer(cols = names(.), names_to = "param", values_to = "value" ) %>% 
307 |       glue_data("{param} = {value}") %>% 
308 |       glue_collapse(sep = ",", last = " and ")
309 | 
310 | msg <- glue("Avg K-Fold CV error: {avg_kfold_error[[1]]}
311 |      Outer-Fold Avg Error: {outer_kfold_error}
312 |      Test Error: {test_error}
313 |      Best Parameters for {chosen_alg}:
314 |      {best_hyper_vals}")
315 | 
316 | log.txt <- tic.log(format = TRUE)
317 | text_msg <- glue("{log.txt[[1]]} for kj-raschka script to complete
318 |                  Results:
319 |                  {msg}")
320 | 
321 | pbPost("note", title="kj-raschka script finished", body=text_msg)
322 | tic.clearlog()
323 | 
324 | 
325 | # MLflow uses waitress for Windows. Killing it also kills mlflow.exe, python.exe, console window host processes
326 | installr::kill_process(process = c("waitress-serve.exe"))
327 | 


--------------------------------------------------------------------------------
/duration-experiment/raschka/nested-cv-mlr3-raschka.R:
--------------------------------------------------------------------------------
  1 | # Nested cross-validation for tuning and algorithm comparison
  2 | 
  3 | 
  4 | # Raschka method
  5 | # mlr3
  6 | 
  7 | 
  8 | 
  9 | # Notes:
 10 | # 1. *** Make sure target variable is the last column***
 11 | # 2. As of 16Jan2020, folds are not fixed so each call to resample() creates a new nested_cv structure which means the algorithms are being compared on somewhat different data.
 12 | # 3. For the glm, lambda is not a tunable parameter in mlr3. So I had to use lambda_min_ratio and nlambda to get a grid that should be equivalent to a 100 row lambda grid.
 13 | # 4. The batch arg in the tuner function allows you to specify how you want to parallelize for each algorithm which is nice.
 14 | 
 15 | 
 16 | # Sections:
 17 | # 1. Set-Up and Data
 18 | # 2. Functions Used in the Loops
 19 | # 3. Model Functions; Hyperparameter Grids
 20 | # 4. Compare Algorithms
 21 | # 5. Tune and Score Chosen Algorithm
 22 | 
 23 | 
 24 | 
 25 | 
 26 | 
 27 | #####################################################
 28 | # set up and Data
 29 | #####################################################
 30 | 
 31 | 
 32 | # texts me when there's an error
 33 | options(error = function() { 
 34 |       library(RPushbullet)
 35 |       pbPost("note", "Error", geterrmessage())
 36 |       if(!interactive()) stop(geterrmessage())
 37 | })
 38 | 
 39 | # start MLflow server
 40 | sys::exec_background("mlflow server")
 41 | Sys.sleep(10)
 42 | 
 43 | 
 44 | library(tictoc)
 45 | tic()
 46 | 
 47 | pacman::p_load(glue, RPushbullet, dplyr, mlr3, mlr3learners, mlr3tuning, future, mlflow)
 48 | 
 49 | 
 50 | set.seed(2019)
 51 | 
 52 | # make explicit the name of the exeriement to record to
 53 | mlflow_set_experiment("ncv_duration")
 54 | 
 55 | 
 56 | plan(multiprocess)
 57 | 
 58 | # simulated data; generates 10 multi-patterned, numeric predictors plus outcome variable
 59 | sim_data <- function(n) {
 60 |       tmp <- mlbench::mlbench.friedman1(n, sd=1)
 61 |       tmp <- cbind(tmp$x, tmp$y)
 62 |       tmp <- as.data.frame(tmp)
 63 |       names(tmp)[ncol(tmp)] <- "y"
 64 |       tmp
 65 | }
 66 | 
 67 | # Use small data to tune and compare models
 68 | dat <- sim_data(5000)
 69 | 
 70 | train_idx <- caret::createDataPartition(y = dat$y, p = 0.80, list = FALSE)
 71 | train_dat <- dat[train_idx, ]
 72 | test_dat <- dat[-train_idx, ]
 73 | 
 74 | 
 75 | 
 76 | 
 77 | #####################################################
 78 | # Functions used in the loops
 79 | #####################################################
 80 | 
 81 | 
 82 | # task obj consists of the data and target variable
 83 | train_task <- TaskRegr$new(id = "train_dat", backend = train_dat, target = "y")
 84 | 
 85 | # inner-loop tuning
 86 | resampling_inner <- rsmp("cv", folds = 2)
 87 | 
 88 | # outer loop
 89 | resampling_outer_five <- rsmp("cv", folds = 5)
 90 | 
 91 | # error function
 92 | measures <- msr("regr.mae")
 93 | 
 94 | # required to specify an early stopping criteria. I don't want one.
 95 | terminator <- term("none")
 96 | 
 97 | 
 98 | 
 99 | 
100 | #####################################################
101 | # Model Functions; Hyperparameter Grids
102 | #####################################################
103 | 
104 | 
105 | # Model functions
106 | rf_mod <- lrn("regr.ranger")
107 | glm_mod <- lrn("regr.glmnet", nlambda = 1)
108 | 
109 | # available hyperparameters
110 | # mlr3::mlr_learners$get("regr.ranger")$param_set
111 | # mlr3::mlr_learners$get("regr.glmnet")$param_set
112 | 
113 | # Hyperparameter restrictions/conditions 
114 | rf_params <- paradox::ParamSet$new(
115 |       params = list(paradox::ParamInt$new("mtry", lower = 3, upper = 4),
116 |                     paradox::ParamInt$new("num.trees", lower = 200, upper = 300)
117 |       )
118 | )
119 | 
120 | # values from for lambda.min.ratio come from the h2o docs about the parameter.
121 | glm_params <- paradox::ParamSet$new(
122 |       params = list(paradox::ParamDbl$new("alpha", lower = 0, upper = 1),
123 |                     paradox::ParamDbl$new("lambda.min.ratio", lower = 0.0001, upper = 0.01)
124 |       )
125 | )
126 | 
127 | 
128 | 
129 | # generates a latin hypercube of values based on the restrictions above
130 | # rf and glm grids consists of 100 rows
131 | rf_grid <- paradox::generate_design_lhs(rf_params, 100)$data
132 | 
133 | glm_grid <- paradox::generate_design_lhs(glm_params, 100)$data
134 | 
135 | 
136 | 
137 | # Design_points is the grid strategy that allows me to make my own grids to experiment on. Batch_size indicates how many rows of the grid to tune simulataneously. Essentially how much to parallelize.
138 | rf_tuner <- tnr("design_points", batch_size = 8, design = rf_grid)
139 | glm_tuner <- tnr("design_points", batch_size = 8, design = glm_grid)
140 | 
141 | 
142 | learner_list <- list(glm = glm_mod, rf = rf_mod)
143 | params_list <- list(glm = glm_params, rf = rf_params)
144 | tuner_list <- list(glm = glm_tuner, rf = rf_tuner)
145 | 
146 | lol <- list(learner_list, params_list, tuner_list)
147 | 
148 | 
149 | 
150 | 
151 | #####################################################
152 | # Compare algorithms
153 | #####################################################
154 | 
155 | 
156 | compare_algs <- function(mod, params, tuner, resampling_inner, resampling_outer, measures, terminator) {
157 |       
158 |       # Learner augmented with tuning. Tunes on the inner-loop bootstraps, then the best model will be used on the outer loop fold
159 |       mod_tuned <- AutoTuner$new(learner = mod,
160 |                                  resampling = resampling_inner,
161 |                                  measures = measures,
162 |                                  tune_ps = params,
163 |                                  terminator = terminator,
164 |                                  tuner = tuner)  
165 |       
166 |       # Initiates the ncv
167 |       ncv_mod <- resample(task = train_task, learner = mod_tuned, resampling = resampling_outer)
168 |       
169 |       # pull() needs the name of the error function being used
170 |       measure_name <- measures$id
171 |       # Score for each fold
172 |       outer_fold_error <- ncv_mod$score(measures) %>% 
173 |             pull(measure_name)
174 |       # Average score across outer loop folds
175 |       mean_error <- ncv_mod$aggregate(measures)
176 |       
177 |       # output error stats and chosen hyperparams
178 |       tibble(
179 |             mean_error = mean_error,
180 |             median_error = median(outer_fold_error),
181 |             sd_error = sd(outer_fold_error)
182 |       )
183 |       
184 | }
185 | 
186 | 
187 | algorithm_comparison <- purrr::pmap_dfr(lol, compare_algs, resampling_inner, resampling_outer_five, measures, terminator) %>%
188 |       mutate(model = names(learner_list)) %>%
189 |       select(model, everything())
190 | 
191 | 
192 | 
193 | 
194 | #####################################################
195 | # Tune and Score the Chosen Algorithm
196 | #####################################################
197 | 
198 | 
199 | # Choose the best algorithm based on the lowest mean error on the outer loop folds
200 | chosen_alg <- algorithm_comparison %>% 
201 |       filter(mean_error == min(mean_error)) %>% 
202 |       pull(1)
203 | 
204 | chosen_learner <- learner_list[[chosen_alg]]$reset()
205 | chosen_grid <- params_list[[chosen_alg]]
206 | chosen_tuner <- tuner_list[[chosen_alg]]
207 | 
208 | # Use inner-loop tuning strategy to tune the chosen model
209 | chosen_tuned <- AutoTuner$new(learner = chosen_learner,
210 |                            resampling = resampling_inner,
211 |                            measures = measures,
212 |                            tune_ps = chosen_grid,
213 |                            terminator = terminator,
214 |                            tuner = chosen_tuner)
215 | 
216 | # Tune the chosen model on the entire training set
217 | chosen_train <- chosen_tuned$train(train_task)
218 | 
219 | toc(log = TRUE)
220 | 
221 | 
222 | 
223 | # log duration metric to MLflow
224 | ncv_times <- tic.log(format = FALSE)
225 | duration <- as.numeric(ncv_times[[1]]$toc - ncv_times[[1]]$tic)
226 | mlflow_log_metric("duration", duration)
227 | mlflow_set_tag("implementation", "mlr3")
228 | mlflow_set_tag("method", "raschka")
229 | mlflow_end_run()
230 | 
231 | 
232 | 
233 | # Score the best model on the entire training set and the held out test set
234 | train_error <- round(chosen_train$predict(train_task)$score(measures), 4)
235 | test_error <- round(chosen_train$predict_newdata(test_dat, task = train_task)$score(measures), 4)
236 | 
237 | # Average error across tuning folds and the parameter values chosen during tuning
238 | kfold_error <- round(chosen_train$tuning_result$perf, 4)
239 | best_params <- as.data.frame(chosen_train$tuning_result$params) %>%
240 |       tidyr::pivot_longer(cols = names(.), names_to = "param", values_to = "value" ) %>% 
241 |       glue_data("{param} = {value}") %>% 
242 |       glue_collapse(sep = ",", last = " and ")
243 | 
244 | # avg error across outer-loop folds in ncv
245 | outer_kfold_error <- algorithm_comparison %>% 
246 |    filter(model == chosen_alg) %>%
247 |    mutate(mean_error = round(mean_error, 5)) %>% 
248 |    pull(mean_error)
249 | 
250 | 
251 | msg <- glue("Average of K-fold CV test folds: {kfold_error}
252 |      Outer-Fold Avg Error: {outer_kfold_error}
253 |      Training Error: {train_error}
254 |      Test Error: {test_error}
255 |      Best Parameters for {chosen_alg}:
256 |      {best_params}")
257 | 
258 | log.txt <- tic.log(format = TRUE)
259 | text_msg <- glue("{log.txt[[1]]} for script to complete
260 |                  Results:
261 |                  {msg}")
262 | pbPost("note", title="mlr3-raschka script finished", body=text_msg)
263 | tic.clearlog()
264 | 
265 | 
266 | # MLflow uses waitress for Windows. Killing it also kills mlflow.exe, python.exe, console window host processes
267 | installr::kill_process(process = c("waitress-serve.exe"))
268 | 
269 | 
270 | 


--------------------------------------------------------------------------------
/duration-experiment/raschka/nested-cv-py-raschka.py:
--------------------------------------------------------------------------------
  1 | # Nested Cross-Validation with sklearn models
  2 | 
  3 | 
  4 | # Raschka method
  5 | # python
  6 | 
  7 | 
  8 | #  Notes
  9 | # 1. *** make sure target variable is the last variable in the data.frame ***
 10 | # 2. Hyperparameter values were generated from the R package dials and saved to pickle format through reticulate
 11 | # 3. Data was simulated by R package mlbench and saved in pickle format through reticulate
 12 | # 4. Best to use reticulate::source_python('path/to/script.py', envir = NULL, convert = FALSE)
 13 | # (4. cont.) envir=NULL, convert=FALSE increases the speed
 14 | # 5. Bootstrap CV strategy isn't offered by Scikit Learn and I couldn't find any other Python packages offering it.
 15 | # 6. With n_iter set to the number of rows in the grids, RandomizedGridSearch just reshuffles the grids. 
 16 | # (6. cont.) Shouldn't be done this way in real life with Latin Hypercubes because I'd think shuffling
 17 | # (6. cont.) defeats the purpose of grid algorithm. Sklearn doesn't execute ParameterGrid in parallel though,
 18 | # (6. cont.) and I'm just worried about fairly testing the speed of implentations.
 19 | 
 20 | 
 21 | # Sections
 22 | # 1. Set-up
 23 | # 2. Data
 24 | # 3. Algorithms
 25 | # 4. Hyperparameter grids
 26 | # 5. Create inner-loop tuning strategy
 27 | # 6. Run nested-cv
 28 | # 7. Train and Score Chosen Algorithm
 29 | 
 30 | 
 31 | 
 32 | 
 33 | 
 34 | ###################################
 35 | # Set-up
 36 | ###################################
 37 | 
 38 | 
 39 | # # If running with reticulate::repl_python or using reticulate::source_python, necessary in order
 40 | # # to run in parallel.
 41 | # # Should be ran before other modules imported.
 42 | # # Updates executable path in sys module.
 43 | # import sys
 44 | # import os
 45 | # exe = os.path.join(sys.exec_prefix, "pythonw.exe")
 46 | # sys.executable = exe
 47 | # sys._base_executable = exe
 48 | # # update executable path in multiprocessing module
 49 | # import multiprocessing
 50 | # multiprocessing.set_executable(exe)
 51 | 
 52 | 
 53 | # # If running with reticulate::repl_python or using reticulate::source_python, necessary in order
 54 | # # start MLflow's server
 55 | # import subprocess
 56 | # import time
 57 | # subprocess.Popen('mlflow server')
 58 | # time.sleep(10)
 59 | 
 60 | 
 61 | from pytictoc import TicToc
 62 | t = TicToc()
 63 | t.tic()
 64 | 
 65 | from pushbullet import Pushbullet
 66 | import os
 67 | import mlflow
 68 | import pickle
 69 | import numpy as np
 70 | import pandas as pd
 71 | from sklearn.model_selection import RandomizedSearchCV, train_test_split, KFold
 72 | from sklearn.ensemble import RandomForestRegressor
 73 | from sklearn.linear_model import ElasticNet
 74 | from sklearn.metrics import mean_absolute_error
 75 | 
 76 | np.random.seed(2019)
 77 | 
 78 | # dotenv allows for persistent environment variables
 79 | from dotenv import load_dotenv
 80 | load_dotenv()
 81 | pb_token = os.getenv('PUSHBULLET_TOKEN')
 82 | pb = Pushbullet(pb_token)
 83 | 
 84 | # make explicit the name of the exeriement to record to
 85 | mlflow.set_experiment("ncv_duration")
 86 | 
 87 | 
 88 | 
 89 | ###################################
 90 | # Data
 91 | ###################################
 92 | 
 93 | 
 94 | # load simulated data
 95 | # r = read mode, b = binary; pickle is binary
 96 | with open('./data/fivek-simdat.pickle', 'rb') as fried:
 97 |       pdat = pickle.load(fried)
 98 | 
 99 | # load elastic net regression hyperparameter values
100 | with open('./grids/elast-latin-params.pickle', 'rb') as elastp:
101 |       elast_params = pickle.load(elastp)
102 | 
103 | # load random forest hyperparater values
104 | with open('./grids/rf-latin-params.pickle', 'rb') as rfp:
105 |       rf_params = pickle.load(rfp)
106 | 
107 | 
108 | y_idx = len(pdat.columns) - 1
109 | y = pdat.iloc[:, y_idx].values
110 | X = pdat.iloc[:, 0:y_idx]
111 | X_train, X_test, y_train, y_test = train_test_split(X, y,
112 |                                                     test_size=0.2,
113 |                                                     random_state=1)
114 | 
115 | 
116 | 
117 | 
118 | ####################################
119 | # Algorithms
120 | ####################################
121 | 
122 | 
123 | # Elastic Net Regression
124 | elast_est = ElasticNet(normalize = True,
125 |                        fit_intercept = True)
126 | 
127 | # Random Forest
128 | rf_est = RandomForestRegressor(criterion = "mae",
129 |                                random_state = 1)
130 | 
131 | 
132 | est_dict = {'Elastic Net': elast_est, 'Random Forest': rf_est}
133 | 
134 | 
135 | 
136 | 
137 | ####################################
138 | # Hyperparameter grids
139 | ####################################
140 | 
141 | # elastic net regression
142 | alpha = elast_params.pop('penalty').values
143 | l1_ratio = elast_params.pop('mixture').values
144 | elast_grid = [{'alpha': alpha,
145 |                'l1_ratio': l1_ratio}]
146 | 
147 | # random forest
148 | max_features = rf_params.pop('mtry').values
149 | n_estimators = rf_params.pop('trees').values
150 | rf_grid = [{'max_features': max_features,
151 |             'n_estimators': n_estimators}]
152 | 
153 | 
154 | grid_dict = {'Elastic Net': elast_grid, 'Random Forest': rf_grid}
155 | 
156 | 
157 | 
158 | 
159 | ####################################
160 | # Create inner-loop tuning strategy
161 | ####################################
162 | 
163 | 
164 | # vessel for my inner-loop grid search objects
165 | gridcvs = {}
166 | 
167 | # shuffle = True required for setting random state
168 | # setting random state makes sure all algorithms tuned on the same splits
169 | inner_cv = KFold(n_splits = 2, shuffle = True, random_state = 1)
170 | 
171 | # Setting this parameter to the size of the grid tells Random Search to use every grid value once
172 | elast_iter = len(elast_params)
173 | rf_iter = len(rf_params)
174 | iter_dict = {'Elastic Net': elast_iter, 'Random Forest': rf_iter}
175 | 
176 | 
177 | # Setting up multiple RandomSearchCV objects, 1 for each algorithm
178 | # Collecting them in the gridcvs dict
179 | for pgrid, est, n_iter, name in zip((elast_grid, rf_grid),
180 |                             (elast_est, rf_est),
181 |                             (elast_iter, rf_iter),
182 |                             ('Elastic Net', 'Random Forest')):
183 |     gcv = RandomizedSearchCV(estimator = est,
184 |                        param_distributions = pgrid,
185 |                        n_iter = n_iter,
186 |                        scoring = 'neg_mean_absolute_error',
187 |                        n_jobs = -1,
188 |                        cv = inner_cv,
189 |                        verbose = 0,
190 |                        refit = True)
191 |     gridcvs[name] = gcv	
192 | 
193 | 
194 | 
195 | 
196 | ####################################
197 | # Run nested-cv
198 | ####################################
199 | 
200 | # vessel for stats on the outer fold results
201 | results = pd.DataFrame()
202 | 
203 | # The validation set scores of the outer loop folds will be used to choose the best algorithm
204 | # loop the grid objects we created (1 for each algorithm)
205 | for name, gs_est in sorted(gridcvs.items()):
206 |       outer_scores = []
207 |       # set the outer loop cv strategy
208 |       outer_cv = KFold(n_splits=5, shuffle=True, random_state=1)
209 |       
210 |       # training set is split into train/valid as the outer-loop folds
211 |       for train_idx, valid_idx in outer_cv.split(X_train, y_train):
212 |             # training set is fed into inner-loop for tuning
213 |             gridcvs[name].fit(X_train.values[train_idx], y_train[train_idx])
214 |             # best model chosen to be scored on the outer-loop valid set
215 |             outer_scores.append(gridcvs[name].best_estimator_.score(X_train.values[valid_idx], y_train[valid_idx]))
216 |       
217 |       # stats calc'd on outer fold scores
218 |       fold_score = {'model': name,
219 |                     'mean_error': np.mean(outer_scores),
220 |                     'sd_error': np.std(outer_scores)}
221 |       # 1 row per algorithm
222 |       results = results.append(fold_score, ignore_index=True)
223 | 
224 | # Choose the best algorithm based on the lowest mean error on the outer loop folds
225 | chosen_alg = results[results.mean_error == results.mean_error.min()]['model'][0]
226 | 
227 | chosen_est = est_dict[chosen_alg]
228 | chosen_grid = grid_dict[chosen_alg]
229 | chosen_iter = iter_dict[chosen_alg]
230 | 
231 | 
232 | 
233 | 
234 | ####################################
235 | # Train and Score Chosen Algorithm
236 | ####################################
237 | 
238 | 
239 | # Use inner-loop tuning strategy to tune the chosen model
240 | gcv_model_select = RandomizedSearchCV(estimator = chosen_est,
241 |                        param_distributions = chosen_grid,
242 |                        n_iter = chosen_iter,
243 |                        scoring = 'neg_mean_absolute_error',
244 |                        n_jobs = -1,
245 |                        cv = inner_cv,
246 |                        verbose = 0,
247 |                        refit = True)
248 | 
249 | # Tune the chosen model on the entire training set
250 | gcv_model_select.fit(X_train, y_train)
251 | 
252 | # given in seconds
253 | time_elapsed = round(t.tocvalue(), 2)
254 | 
255 | # log the metric on MLflow
256 | mlflow.log_metric('duration', time_elapsed)
257 | tags = {'implementation': 'python', 'method': 'raschka'}
258 | mlflow.set_tags(tags)
259 | mlflow.end_run()
260 | 
261 | 
262 | best_model = gcv_model_select.best_estimator_
263 | 
264 | # Score the best model on the entire training set and the held out test set
265 | train_error = round(mean_absolute_error(y_true=y_train, y_pred=best_model.predict(X_train)), 5)
266 | test_error = round(mean_absolute_error(y_true=y_test, y_pred=best_model.predict(X_test)), 5)
267 | 
268 | # Average error across tuning folds and the parameter values chosen during tuning
269 | k_fold_score = round(-1 * gcv_model_select.best_score_, 5)
270 | outer_kfold_score = round(results[results.model == chosen_alg]['mean_error'][0], 5)
271 | best_hyper_vals = gcv_model_select.best_params_
272 | 
273 | # best_hyper_vals is a dict. Use it to create the df, then add the errors.
274 | model_stats = pd.DataFrame(data = best_hyper_vals, index = [1])
275 | model_stats = model_stats.assign(kfold_error = k_fold_score, outer_fold_error = outer_kfold_score, train_error = train_error, test_error = test_error)
276 | 
277 | 
278 | # evidently has to be written into a paragraph because print outputs can't be saved and there's no glue in python.
279 | msg = f'Python script finished in {time_elapsed} seconds. The chosen algorithm was {chosen_alg} with parameters, {best_hyper_vals}. Avg score over cv\'s test folds was {k_fold_score}. Outer fold avg score was {outer_kfold_score}. Training Error: {train_error}, Test Error: {test_error}'
280 | 
281 | # text me the results
282 | pb.push_note("Nested CV script finished", msg)
283 | 
284 | 
285 | # # only necessary if running with reticulate::repl_python or using reticulate::source_python
286 | # # MLflow uses waitress for Windows. Killing it also kills mlflow.exe, python.exe, console window host processes
287 | # os.system('taskkill /f /im waitress-serve.exe')
288 | # os.system('taskkill /f /im pythonw.exe')
289 | 


--------------------------------------------------------------------------------
/duration-experiment/raschka/nested-cv-retic-raschka.R:
--------------------------------------------------------------------------------
  1 | # Nested Cross-validation using Scikit-Learn
  2 | 
  3 | 
  4 | # Raschka's method
  5 | # reticulate
  6 | 
  7 | 
  8 | # Notes
  9 | # 1. Uses the reticulate pkg to implement Raschka's nested cv
 10 | # 2. *** make sure target variable is the last variable in the data.frame ***
 11 | # 3. Bootstrap CV strategy isn't offered by Scikit Learn and I couldn't find any other Python packages offering it.
 12 | # 4. With n_iter set to the number of rows in the grids, RandomizedGridSearch just reshuffles the grids. Shouldn't be done this way in real life with Latin Hypercubes because I'd think shuffling defeats the purpose of grid algorithm. Sklearn doesn't execute ParameterGrid in parallel though, and I'm just worried about fairly testing the speed of implentations.
 13 | 
 14 | 
 15 | # Sections:
 16 | # 1. Set-Up
 17 | # 2. Data
 18 | # 3. Estimators
 19 | # 4. Hyperparameter Grids
 20 | # 5. Inner and Outer CV Strategies
 21 | # 6. Compare Algorithms
 22 | # 7. Tune and Score Chosen Algorithm
 23 | 
 24 | 
 25 | 
 26 | 
 27 | 
 28 | ######################################
 29 | # Set-up
 30 | ######################################
 31 | 
 32 | 
 33 | # text me if any errors occur
 34 | options(error = function() { 
 35 |    library(RPushbullet)
 36 |    pbPost("note", "Error", geterrmessage())
 37 |    if(!interactive()) stop(geterrmessage())
 38 | })
 39 | 
 40 | 
 41 | # start MLflow server
 42 | sys::exec_background("mlflow server")
 43 | Sys.sleep(10)
 44 | 
 45 | 
 46 | library(tictoc)
 47 | tic()
 48 | 
 49 | pacman::p_load(RPushbullet, dplyr, glue, reticulate, mlflow)
 50 | 
 51 | # make explicit the name of the exeriement to record to
 52 | mlflow_set_experiment("ncv_duration")
 53 | 
 54 | 
 55 | #------- Required for Multiprocessing with reticulate (in just Windows?)
 56 | 
 57 | # update executable path in sys module
 58 | sys <- import("sys")
 59 | exe <- file.path(sys$exec_prefix, "pythonw.exe")
 60 | sys$executable <- exe
 61 | sys$`_base_executable` <- exe
 62 | 
 63 | # update executable path in multiprocessing module
 64 | multiprocessing <- import("multiprocessing")
 65 | multiprocessing$set_executable(exe)
 66 | 
 67 | #-------
 68 | 
 69 | 
 70 | sk_lm <- import("sklearn.linear_model")
 71 | sk_e <- import("sklearn.ensemble")
 72 | sk_ms <- import("sklearn.model_selection")
 73 | sk_m <- import("sklearn.metrics")
 74 | 
 75 | set.seed(2019)
 76 | py_set_seed(2019)
 77 | 
 78 | 
 79 | ######################################
 80 | # Data
 81 | ######################################
 82 | 
 83 | 
 84 | sim_data <- function(n) {
 85 |    tmp <- mlbench::mlbench.friedman1(n, sd=1)
 86 |    tmp <- cbind(tmp$x, tmp$y)
 87 |    tmp <- as.data.frame(tmp)
 88 |    names(tmp)[ncol(tmp)] <- "y"
 89 |    tmp
 90 | }
 91 | 
 92 | dat <- sim_data(5000)
 93 | 
 94 | pdat = r_to_py(dat)
 95 | 
 96 | y_idx <- py_len(pdat$columns) - 1
 97 | X_idx <- y_idx - 1
 98 | y = pdat$iloc(axis = 1L)[y_idx]$values
 99 | X = pdat$iloc(axis = 1L)[0:X_idx]
100 | 
101 | dat_splits<- sk_ms$train_test_split(X, y,
102 |                                 test_size=0.2,
103 |                                 random_state=1L)
104 | 
105 | X_train <- dat_splits[[1]]
106 | X_test <- dat_splits[[2]]
107 | y_train <- as.numeric(dat_splits[[3]])
108 | y_test <- as.numeric(dat_splits[[4]])
109 | 
110 | 
111 | 
112 | 
113 | ######################################
114 | # Estimators
115 | ######################################
116 | 
117 | # Elastic Net Regression
118 | elast_est <- sk_lm$ElasticNet(normalize = TRUE,
119 |                          fit_intercept = TRUE)
120 | 
121 | # Random Forest
122 | rf_est <- sk_e$RandomForestRegressor(criterion = "mae",
123 |                                      random_state = 1L)
124 | 
125 | 
126 | alg_list <- list(elastic_net = elast_est, rf = rf_est)
127 | 
128 | 
129 | 
130 | 
131 | ######################################
132 | # Hyperparameter grids
133 | ######################################
134 | 
135 | 
136 | # Elastic Net Regression
137 | elast_params <- r_to_py(dials::grid_latin_hypercube(
138 |    dials::mixture(),
139 |    dials::penalty(),
140 |    size = 100
141 | ))
142 | alpha <- elast_params$pop('penalty')$values
143 | l1_ratio <- elast_params$pop('mixture')$values
144 | elast_grid <- py_dict(list('alpha', 'l1_ratio'),
145 |                       list(alpha, l1_ratio))
146 | 
147 | 
148 | 
149 | # Random Forest
150 | rf_params <- r_to_py(dials::grid_latin_hypercube(
151 |    dials::mtry(range = c(3, 4)),
152 |    dials::trees(range = c(200, 300)),
153 |    size = 100
154 | ))
155 | max_features <- rf_params$pop('mtry')$values
156 | n_estimators <- rf_params$pop('trees')$values
157 | rf_grid <- py_dict(list('max_features', 'n_estimators'), list(max_features, n_estimators))
158 | 
159 | 
160 | grid_list <- list(elastic_net = elast_grid, rf = rf_grid)
161 | 
162 | 
163 | 
164 | 
165 | ######################################
166 | # Inner and Outer CV strategies
167 | ######################################
168 | 
169 | # Setting the inner-loop tuning strategy
170 | inner_cv <- sk_ms$KFold(n_splits = 2L,
171 |                         shuffle = TRUE,
172 |                         random_state = 1L)
173 | 
174 | # Setting the outer-loop cv strategy
175 | outer_cv <- sk_ms$KFold(n_splits = 5L,
176 |                         shuffle = TRUE,
177 |                         random_state = 1L)
178 | 
179 | # Setting n_iter to the total rows for each grid just reshuffles the grids
180 | n_iter_list <- list(elastic_net = py_len(elast_params), rf = py_len(rf_params))
181 | 
182 | lol <- list(alg_list, grid_list, n_iter_list)
183 | 
184 | # Setting up multiple RandomSearchCV objects, 1 for each algorithm
185 | # Collecting them in the inner-loop list
186 | inner_loop <- purrr::pmap(lol, function(alg, grid, n_iter) {
187 |    sk_ms$RandomizedSearchCV(estimator = alg,
188 |                       param_distributions = grid,
189 |                       n_iter = n_iter,
190 |                       scoring = 'neg_mean_absolute_error',
191 |                       cv = inner_cv,
192 |                       n_jobs = -1L,
193 |                       pre_dispatch = '2*n_jobs',
194 |                       refit = TRUE)
195 | })
196 | 
197 | 
198 | 
199 | 
200 | ######################################
201 | # Compare Algorithms
202 | ######################################
203 | 
204 | 
205 | algorithm_comparison <- purrr::map(inner_loop, function(grid_search) {
206 |    
207 |    outer_scores <- list()
208 |    counter <- 0
209 |    
210 |    outer_split <- outer_cv$split(X_train, y_train)
211 |    
212 |    # while loop + iter_next = a python for-loop
213 |    while (TRUE) {
214 |       
215 |       # python methods create "iterable" objs
216 |       fold <- iter_next(outer_split)
217 |       # loop ends once we run out of folds
218 |       if (is.null(fold))
219 |          break
220 |       
221 |       # python 0-indexed, so need to add 1 in order to correctly subset in R
222 |       train_idx <- as.integer(fold[[1]]) + 1
223 |       valid_idx <- as.integer(fold[[2]]) + 1
224 |       
225 |       pX_train <- r_to_py(X_train[train_idx,])
226 |       pX_valid <- r_to_py(X_train[valid_idx,])
227 |       
228 |       py_train <- r_to_py(y_train[train_idx])
229 |       py_valid <- r_to_py(y_train[valid_idx])
230 |       
231 |       # training set is fed into inner-loop for tuning
232 |       tuning_results <- grid_search$fit(pX_train, py_train)
233 |       
234 |       counter <- sum(counter, 1)
235 |       
236 |       # best model chosen to be scored on the outer-loop valid set
237 |       outer_scores[[counter]] <- data.frame(error = tuning_results$best_estimator_$score(pX_valid, py_valid))
238 |    }
239 |    
240 |    # stats calc'd on outer fold scores
241 |    outer_stats <- outer_scores %>% 
242 |       bind_rows() %>% 
243 |       summarize(mean_error = mean(error, na.rm = TRUE),
244 |                 median_error = median(error, na.rm = TRUE),
245 |                 sd_error = sd(error, na.rm = TRUE))
246 | })
247 | 
248 | 
249 | 
250 | 
251 | ######################################
252 | # Tune and Score the Chosen Algorithm
253 | ######################################
254 | 
255 | 
256 | # Choose the best algorithm based on the lowest mean error on the outer loop folds
257 | chosen_alg <- purrr::map_dfr(algorithm_comparison, ~select(., mean_error), .id = "model") %>% 
258 |    filter(mean_error == min(mean_error)) %>% 
259 |    pull(1)
260 | 
261 | alg <- alg_list[[chosen_alg]]
262 | grid <- grid_list[[chosen_alg]]
263 | n_iter <- n_iter_list[[chosen_alg]]
264 | 
265 | # Use inner-loop tuning strategy to tune the chosen model
266 | chosen_tuned <- sk_ms$RandomizedSearchCV(estimator = alg,
267 |                                    param_distributions = grid,
268 |                                    n_iter = n_iter,
269 |                                    scoring = 'neg_mean_absolute_error',
270 |                                    cv = inner_cv,
271 |                                    n_jobs = -1L,
272 |                                    refit = TRUE)
273 | 
274 | # Tune the chosen model on the entire training set
275 | chosen_train <- chosen_tuned$fit(X_train, y_train)
276 | 
277 | toc(log = TRUE)
278 | 
279 | 
280 | 
281 | # log duration metric to MLflow
282 | ncv_times <- tic.log(format = FALSE)
283 | duration <- as.numeric(ncv_times[[1]]$toc - ncv_times[[1]]$tic)
284 | mlflow_log_metric("duration", duration)
285 | mlflow_set_tag("implementation", "reticulate")
286 | mlflow_set_tag("method", "raschka")
287 | mlflow_end_run()
288 | 
289 | 
290 | 
291 | chosen_model <- chosen_train$best_estimator_
292 | 
293 | # Score the best model on the entire training set and the held out test set
294 | train_error <- round(sk_m$mean_absolute_error(y_true = y_train, y_pred = chosen_model$predict(X_train)), 4)
295 | test_error <- round(sk_m$mean_absolute_error(y_true = y_test, y_pred = chosen_model$predict(X_test)), 4)
296 | 
297 | # Average error across tuning folds and the parameter values chosen during tuning
298 | kfold_error <- round(-chosen_tuned$best_score_, 4)
299 | best_params <- as.data.frame(chosen_tuned$best_params_) %>%
300 |    tidyr::pivot_longer(cols = names(.), names_to = "param", values_to = "value" ) %>% 
301 |    glue_data("{param} = {value}") %>% 
302 |    glue_collapse(sep = ",", last = " and ")
303 | 
304 | outer_kfold_error <- purrr::map_dfr(algorithm_comparison, ~select(., mean_error), .id = "model") %>% 
305 |    filter(mean_error == min(mean_error)) %>%
306 |    mutate(mean_error = round(mean_error, 5))
307 |    pull(mean_error)
308 | 
309 | 
310 | msg <- glue("Average of K-fold CV test folds: {kfold_error}
311 |      Outer-Fold Avg Error: {outer_kfold_error}
312 |      Training Error: {train_error}
313 |      Test Error: {test_error}
314 |      Best Parameters for {chosen_alg}:
315 |      {best_params}")
316 | 
317 | log.txt <- tic.log(format = TRUE)
318 | text_msg <- glue("{log.txt[[1]]} for script to complete
319 |                  Results:
320 |                  {msg}")
321 | 
322 | # text me the results
323 | pbPost("note", title="reticulate-raschka script finished", body=text_msg)
324 | tic.clearlog()
325 | 
326 | 
327 | # MLflow uses waitress for Windows. Killing it also kills mlflow.exe, python.exe, console window host processes
328 | installr::kill_process(process = c("waitress-serve.exe", "pythonw.exe"))
329 | 


--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
 1 | name: null
 2 | channels:
 3 |   - defaults
 4 | dependencies:
 5 |   - certifi=2019.11.28=py36_0
 6 |   - pip=20.0.2=py36_1
 7 |   - python=3.6.10=h9f7ef89_0
 8 |   - setuptools=45.2.0=py36_0
 9 |   - sqlite=3.31.1=he774522_0
10 |   - vc=14.1=h0510ff6_4
11 |   - vs2015_runtime=14.16.27012=hf0eaf9b_1
12 |   - wheel=0.34.2=py36_0
13 |   - wincertstore=0.2=py36h7fe50ca_0
14 |   - pip:
15 |     - alembic==1.4.0
16 |     - chardet==3.0.4
17 |     - click==7.0
18 |     - cloudpickle==1.3.0
19 |     - configparser==4.0.2
20 |     - databricks-cli==0.9.1
21 |     - docker==4.2.0
22 |     - entrypoints==0.3
23 |     - flask==1.1.1
24 |     - gitdb2==3.0.2
25 |     - gitpython==3.0.8
26 |     - gorilla==0.3.0
27 |     - idna==2.9
28 |     - itsdangerous==1.1.0
29 |     - jinja2==2.11.1
30 |     - joblib==0.14.1
31 |     - mako==1.1.1
32 |     - markupsafe==1.1.1
33 |     - mlflow==1.6.0
34 |     - numpy==1.18.1
35 |     - pandas==1.0.1
36 |     - prometheus-client==0.7.1
37 |     - prometheus-flask-exporter==0.12.2
38 |     - protobuf==3.11.3
39 |     - pushbullet-py==0.11.0
40 |     - pypiwin32==223
41 |     - python-dateutil==2.8.1
42 |     - python-dotenv==0.11.0
43 |     - python-editor==1.0.4
44 |     - python-magic==0.4.15
45 |     - pytictoc==1.5.0
46 |     - pytz==2019.3
47 |     - pywin32==227
48 |     - pyyaml==5.3
49 |     - querystring-parser==1.2.4
50 |     - requests==2.23.0
51 |     - scikit-learn==0.22.1
52 |     - scipy==1.4.1
53 |     - simplejson==3.17.0
54 |     - six==1.14.0
55 |     - smmap2==2.0.5
56 |     - sqlalchemy==1.3.13
57 |     - sqlparse==0.3.0
58 |     - tabulate==0.8.6
59 |     - urllib3==1.25.8
60 |     - waitress==1.4.3
61 |     - websocket-client==0.57.0
62 |     - werkzeug==1.0.0
63 | prefix: C:\Users\tbats\Documents\R\Projects\nested-cross-validation-comparison\renv\python\condaenvs\renv-python
64 | 


--------------------------------------------------------------------------------
/grids/elast-latin-params.pickle:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/grids/elast-latin-params.pickle


--------------------------------------------------------------------------------
/grids/rf-latin-params.pickle:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/grids/rf-latin-params.pickle


--------------------------------------------------------------------------------
/images/ncv.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/images/ncv.png


--------------------------------------------------------------------------------
/nested-cross-validation-comparison.Rproj:
--------------------------------------------------------------------------------
 1 | Version: 1.0
 2 | 
 3 | RestoreWorkspace: Default
 4 | SaveWorkspace: Default
 5 | AlwaysSaveHistory: Default
 6 | 
 7 | EnableCodeIndexing: Yes
 8 | UseSpacesForTab: Yes
 9 | NumSpacesForTab: 6
10 | Encoding: UTF-8
11 | 
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 | 


--------------------------------------------------------------------------------
/palettes/Forest Floor.ase:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ercbk/nested-cross-validation-comparison/d0016e1cb08640749cadad2addf1ce824c7802da/palettes/Forest Floor.ase


--------------------------------------------------------------------------------
/performance-experiment/Kuhn-Johnson/plan-kj.R:
--------------------------------------------------------------------------------
  1 | # Kuhn-Johnson drake plan
  2 | 
  3 | 
  4 | # Notes:
  5 | # 1. I broke the plan into units by sample size. I'm sure its possible to formulate the plan to perform the whole experiment by looping the kj and raschka method along with sample sizes into one large, more compact plan, but I wanted units that I could run overnight on my desktop.
  6 | # 2. sample_sizes: 100, 800, 2000, 5000, 10000 (maybe)
  7 | # 3. I'm trying to minimize the delta_error. Delta error is the absolute difference between the average error across the outer-folds of the nested cross-validation and the out-of-sample error which uses the chosen model and parameters to predict on a simulated 100K row dataset.
  8 | 
  9 | 
 10 | 
 11 | 
 12 | 
 13 | error_FUN <- function(y_obs, y_hat){
 14 |    y_obs <- unlist(y_obs)
 15 |    y_hat <- unlist(y_hat)
 16 |    Metrics::mae(y_obs, y_hat)
 17 | }
 18 | 
 19 | method <- "kj"
 20 | algorithms <- list("glmnet", "rf")
 21 | repeats <- seq(1:5)
 22 | grid_size <- 100
 23 | 
 24 | plan <- drake_plan(
 25 |    # model functions for each algorithm
 26 |    mod_FUN_list = create_models(algorithms),
 27 |    # data used to estimate out-of-sample error
 28 |    # noise_sd, seed settings are the defaults
 29 |    large_dat = mlbench_data(n = 10^5,
 30 |                             noise_sd = 1,
 31 |                             seed = 2019),
 32 |    # sample size = 100
 33 |    sim_dat_100 = mlbench_data(100),
 34 |    # hyperparameter grids for each algorithm
 35 |    # This probably doesn't need to be a "dynamic" target since mtry is only concerned about the number of columns in data (see script), but I'll do it anyways
 36 |    params_list_100 = create_grids(sim_dat_100,
 37 |                                   algorithms,
 38 |                                   size = grid_size),
 39 |    # create a separate ncv data object for each repeat value
 40 |    ncv_dat_100 = create_ncv_objects(sim_dat_100,
 41 |                                     repeats,
 42 |                                     method),
 43 |    # runs nested-cv and compares ncv error with out-of-sample error
 44 |    # outputs: ncv error, oos error, delta error, chosen algorithm, chosen hyperparameters 
 45 |    ncv_results_100 = target(
 46 |       run_ncv(ncv_dat_100,
 47 |               sim_dat_100,
 48 |               large_dat,
 49 |               mod_FUN_list,
 50 |               params_list_100,
 51 |               error_FUN,
 52 |               method),
 53 |       dynamic = map(ncv_dat_100)
 54 |    ),
 55 |    
 56 |    # repeat for the rest of the sample sizes
 57 |    # sample size = 800
 58 |    sim_dat_800 = mlbench_data(800),
 59 |    params_list_800 = create_grids(sim_dat_800,
 60 |                                   algorithms,
 61 |                                   size = grid_size),
 62 |    ncv_dat_800 = create_ncv_objects(sim_dat_800,
 63 |                                     repeats,
 64 |                                     method),
 65 |    ncv_results_800 = target(
 66 |       run_ncv(ncv_dat_800,
 67 |               sim_dat_800,
 68 |               large_dat,
 69 |               mod_FUN_list,
 70 |               params_list_800,
 71 |               error_FUN,
 72 |               method),
 73 |       dynamic = map(ncv_dat_800)
 74 |    ),
 75 | 
 76 |    # sample size = 2000
 77 |    sim_dat_2000 = mlbench_data(2000),
 78 |    params_list_2000 = create_grids(sim_dat_2000,
 79 |                                   algorithms,
 80 |                                   size = grid_size),
 81 |    ncv_dat_2000 = create_ncv_objects(sim_dat_2000,
 82 |                                     repeats,
 83 |                                     method),
 84 |    ncv_results_2000 = target(
 85 |       run_ncv(ncv_dat_2000,
 86 |               sim_dat_2000,
 87 |               large_dat,
 88 |               mod_FUN_list,
 89 |               params_list_2000,
 90 |               error_FUN,
 91 |               method),
 92 |       dynamic = map(ncv_dat_2000)
 93 |    ),
 94 | 
 95 |    # sample size = 5000
 96 |    sim_dat_5000 = mlbench_data(5000),
 97 |    params_list_5000 = create_grids(sim_dat_5000,
 98 |                                   algorithms,
 99 |                                   size = grid_size),
100 |    ncv_dat_5000 = create_ncv_objects(sim_dat_5000,
101 |                                     repeats,
102 |                                     method),
103 |    ncv_results_5000 = target(
104 |       run_ncv(ncv_dat_5000,
105 |               sim_dat_5000,
106 |               large_dat,
107 |               mod_FUN_list,
108 |               params_list_5000,
109 |               error_FUN,
110 |               method),
111 |       dynamic = map(ncv_dat_5000)
112 |    )
113 |    
114 | )
115 | 
116 | 
117 | 


--------------------------------------------------------------------------------
/performance-experiment/Kuhn-Johnson/r_make-kj.R:
--------------------------------------------------------------------------------
 1 | # r_make
 2 | # runs experiment in a clean environment in a separate r session
 3 | 
 4 | 
 5 | # Notes
 6 | # 1. Didn't make the render readme function into a target because the readd doesn't explicitly call a specific target, so drake isn't triggered to build it. The buildtimes function in the readme also doesn't trigger a build.
 7 | 
 8 | 
 9 | # text me if an error occurs
10 | options(error = function() {
11 |       library(RPushbullet)
12 |       pbPost("note", "Error", geterrmessage())
13 |       if(!interactive()) stop(geterrmessage())
14 | })
15 | 
16 | 
17 | 
18 | drake::r_make(source = "_drake-kj.R" )
19 | 
20 | 
21 | rmarkdown::render(
22 |       input = "README.Rmd"
23 | )
24 | 
25 | 
26 | # text me when it finishes
27 | RPushbullet::pbPost("note", title="kj performance experiment", body="ncv run finished")
28 | 
29 | 
30 | 


--------------------------------------------------------------------------------
/performance-experiment/Raschka/plan-raschka.R:
--------------------------------------------------------------------------------
  1 | # Raschka drake plan
  2 | 
  3 | 
  4 | # Notes:
  5 | # 1. I broke the plan into units by sample size. I'm sure its possible to formulate the plan to perform the whole experiment by looping the kj and raschka method along with sample sizes into one large, more compact plan, but I wanted units that I could run overnight on my desktop.
  6 | # 2. sample_sizes: 100, 800, 2000, 5000, 10000 (maybe)
  7 | # 3. I'm trying to minimize the delta_error. Delta error is the absolute difference between the average error across the outer-folds of the nested cross-validation and the out-of-sample error which uses the chosen model and parameters to predict on a simulated 100K row dataset.
  8 | 
  9 | 
 10 | 
 11 | 
 12 | 
 13 | error_FUN <- function(y_obs, y_hat){
 14 |       y_obs <- unlist(y_obs)
 15 |       y_hat <- unlist(y_hat)
 16 |       Metrics::mae(y_obs, y_hat)
 17 | }
 18 | 
 19 | method <- "raschka"
 20 | algorithms <- list("glmnet", "rf")
 21 | repeats <- seq(1:5)
 22 | grid_size <- 100
 23 | 
 24 | plan <- drake_plan(
 25 |    # model functions for each algorithm
 26 |    mod_FUN_list_r = create_models(algorithms),
 27 |    # data used to estimate out-of-sample error
 28 |    # noise_sd, seed settings are the defaults
 29 |    large_dat_r = mlbench_data(n = 10^5,
 30 |                             noise_sd = 1,
 31 |                             seed = 2019),
 32 |    # sample size = 100
 33 |    sim_dat_r_100 = mlbench_data(100),
 34 |    # hyperparameter grids for each algorithm 
 35 |    params_list_r_100 = create_grids(sim_dat_r_100,
 36 |                               algorithms,
 37 |                               size = grid_size),
 38 |    # create a separate ncv data object for each repeat value
 39 |    ncv_dat_r_100 = create_ncv_objects(sim_dat_r_100,
 40 |                                        repeats,
 41 |                                        method),
 42 |    # runs nested-cv and compares ncv error with out-of-sample error
 43 |    # outputs: ncv error, oos error, delta error, chosen algorithm, chosen hyperparameters 
 44 |    ncv_results_r_100 = target(
 45 |       run_ncv(ncv_dat_r_100,
 46 |               sim_dat_r_100,
 47 |               large_dat_r,
 48 |               mod_FUN_list_r,
 49 |               params_list_r_100,
 50 |               error_FUN,
 51 |               method),
 52 |       dynamic = map(ncv_dat_r_100)
 53 |    ),
 54 |    
 55 |    # repeat for the rest of the sample sizes
 56 |    # sample size = 800
 57 |    sim_dat_r_800 = mlbench_data(800),
 58 |    params_list_r_800 = create_grids(sim_dat_r_800,
 59 |                                   algorithms,
 60 |                                   size = grid_size),
 61 |    ncv_dat_r_800 = create_ncv_objects(sim_dat_r_800,
 62 |                                     repeats,
 63 |                                     method),
 64 |    ncv_results_r_800 = target(
 65 |       run_ncv(ncv_dat_r_800,
 66 |               sim_dat_r_800,
 67 |               large_dat_r,
 68 |               mod_FUN_list_r,
 69 |               params_list_r_800,
 70 |               error_FUN,
 71 |               method),
 72 |       dynamic = map(ncv_dat_r_800)
 73 |    ),
 74 | 
 75 |    # sample size = 2000
 76 |    sim_dat_r_2000 = mlbench_data(2000),
 77 |    params_list_r_2000 = create_grids(sim_dat_r_2000,
 78 |                                   algorithms,
 79 |                                   size = grid_size),
 80 |    ncv_dat_r_2000 = create_ncv_objects(sim_dat_r_2000,
 81 |                                     repeats,
 82 |                                     method),
 83 |    ncv_results_r_2000 = target(
 84 |       run_ncv(ncv_dat_r_2000,
 85 |               sim_dat_r_2000,
 86 |               large_dat_r,
 87 |               mod_FUN_list_r,
 88 |               params_list_r_2000,
 89 |               error_FUN,
 90 |               method),
 91 |       dynamic = map(ncv_dat_r_2000)
 92 |    ),
 93 | 
 94 |    # sample size = 5000
 95 |    sim_dat_r_5000 = mlbench_data(5000),
 96 |    params_list_r_5000 = create_grids(sim_dat_r_5000,
 97 |                                   algorithms,
 98 |                                   size = grid_size),
 99 |    ncv_dat_r_5000 = create_ncv_objects(sim_dat_r_5000,
100 |                                     repeats,
101 |                                     method),
102 |    ncv_results_r_5000 = target(
103 |       run_ncv(ncv_dat_r_5000,
104 |               sim_dat_r_5000,
105 |               large_dat_r,
106 |               mod_FUN_list_r,
107 |               params_list_r_5000,
108 |               error_FUN,
109 |               method),
110 |       dynamic = map(ncv_dat_r_5000)
111 |    )
112 | )
113 | 
114 | 
115 | 


--------------------------------------------------------------------------------
/performance-experiment/Raschka/r_make-raschka.R:
--------------------------------------------------------------------------------
 1 | # r_make
 2 | # runs experiment in a clean environment in a separate r session
 3 | 
 4 | 
 5 | # Notes
 6 | # 1. Didn't make the render readme function into a target because the readd doesn't explicitly call a specific target, so drake isn't triggered to build it. The buildtimes function in the readme also doesn't trigger a build.
 7 | 
 8 | 
 9 | # text me if an error occurs
10 | options(error = function() {
11 |       library(RPushbullet)
12 |       pbPost("note", "Error", geterrmessage())
13 |       if(!interactive()) stop(geterrmessage())
14 | })
15 | 
16 | 
17 | 
18 | drake::r_make(source = "_drake-raschka.R")
19 | 
20 | rmarkdown::render(
21 |       input = "README.Rmd"
22 | )
23 | 
24 | 
25 | # text me when it finishes
26 | RPushbullet::pbPost("note", title="raschka performance experiment", body="ncv run finished")
27 | 
28 | 


--------------------------------------------------------------------------------
/performance-experiment/functions/create-grids.R:
--------------------------------------------------------------------------------
 1 | # Create Hyperparameter grid list
 2 | 
 3 | 
 4 | # input: 
 5 | # 1, size = number of rows
 6 | # 2. algorithms = list of algorithm abbreviations
 7 | # "rf" = Ranger Random Forest
 8 | # "glmnet" = Elastic Net regression
 9 | # "svm" = Support Vector Machines
10 | 
11 | # output: list of grid objects
12 | 
13 | 
14 | 
15 | create_grids <- function(sim_dat, algorithms, size = 100) {
16 |       
17 |       # Elastic Net Regression
18 |       
19 |       glm_params <- dials::grid_latin_hypercube(
20 |             dials::mixture(),
21 |             dials::penalty(),
22 |             size = size
23 |       )
24 |       
25 |       # Random Forest
26 |       
27 |       # Some of the parnsip model parameters have "unknown" for the default value ranges. finalize replaces the unknowns with values based on the data.
28 |       mtry_updated <- dials::finalize(dials::mtry(), select(sim_dat, -ncol(sim_dat)))
29 |       
30 |       rf_params <- dials::grid_latin_hypercube(
31 |             mtry_updated,
32 |             dials::trees(),
33 |             size = size 
34 |       )
35 |       
36 |       # Support Vector Machines
37 |       
38 |       svm_params <- dials::grid_latin_hypercube(
39 |             dials::cost(),
40 |             dials::margin(),
41 |             size = size 
42 |       )
43 |       
44 |       # list of grid objects depending on the algorithms inputted (switch is pretty cool)
45 |       # stop_glue throws error if algorithm inputted isn't available (Should be in glue pkg but isn't)
46 |       grid_list <- purrr::map(algorithms, function(alg) {
47 |             switch(alg,
48 |                    rf = rf_params -> alg_grid,
49 |                    glmnet = glm_params -> alg_grid,
50 |                    svm = svm_params -> alg_grid,
51 |                    infer:::stop_glue("{alg} grid not available."))
52 |             alg_grid
53 |             
54 |       }) %>% 
55 |             purrr::set_names(algorithms)
56 | }
57 | 
58 | 


--------------------------------------------------------------------------------
/performance-experiment/functions/create-models.R:
--------------------------------------------------------------------------------
 1 | # Creates list of model functions
 2 | 
 3 | # input: list of algorithm abbreviations
 4 | # "rf" = Ranger Random Forest
 5 | # "glmnet" = Elastic Net regression
 6 | # "svm" = Support Vector Machines
 7 | 
 8 | # output: list of model functions
 9 | 
10 | 
11 | 
12 | create_models <- function(algorithms) {
13 |       
14 |       # Random Forest
15 |       
16 |       ranger_FUN <- function(params, analysis_set) {
17 |             mtry <- params$mtry[[1]]
18 |             trees <- params$trees[[1]]
19 |             model <- ranger::ranger(y ~ ., data = analysis_set, mtry = mtry, num.trees = trees)
20 |             model
21 |       }
22 |       
23 |       # Elastic Net Regression
24 |       
25 |       glm_FUN <- function(params, analysis_set) {
26 |             alpha <- params$mixture[[1]]
27 |             lambda <- params$penalty[[1]]
28 |             model <- parsnip::linear_reg(mixture = alpha, penalty = lambda) %>%
29 |                   parsnip::set_engine("glmnet") %>%
30 |                   generics::fit(y ~ ., data = analysis_set)
31 |             model
32 |       }
33 |       
34 |       # Support Vector Machines
35 |       
36 |       svm_FUN <- function(params, analysis_set) {
37 |             cost <- params$cost[[1]]
38 |             model <- kernlab::ksvm(y ~ ., data = analysis_set,  C = cost)
39 |             model
40 |       }
41 |       
42 |       # list of model objects depending on the algorithms inputted (switch is pretty cool)
43 |       # stop_glue throws error if algorithm inputted isn't available (Should be in glue pkg but isn't)
44 |       mod_FUN_list <- purrr::map(algorithms, function(alg) {
45 |             switch(alg,
46 |                    rf = ranger_FUN -> mod_fun,
47 |                    glmnet = glm_FUN -> mod_fun,
48 |                    svm = svm_FUN -> mod_fun,
49 |                    infer:::stop_glue("{alg} model function not available."))
50 |             mod_fun
51 |             
52 |       }) %>% 
53 |             purrr::set_names(algorithms)
54 | }
55 | 
56 | 
57 | 


--------------------------------------------------------------------------------
/performance-experiment/functions/create-ncv-objects.R:
--------------------------------------------------------------------------------
 1 | # nested-cv data function
 2 | 
 3 | # inputs:
 4 | # 1. dat = dataset
 5 | # 2. repeats = numeric vector with numbers of repeats
 6 | # 3. method = "kj" or "raschka"
 7 | # outputs:
 8 | # 1. list of {rsample} nested cv objects; one object per repeat value
 9 | 
10 | 
11 | 
12 | create_ncv_objects <- function(dat, repeats, method) {
13 |    
14 |    attempt::stop_if_not(repeats, is.numeric, "repeats needs to be a numeric class")
15 |    attempt::stop_if_not(method, is.character, "method needs to be a character class")
16 |    
17 |    # don't remember but guessing crossing needs a list object
18 |    if (is.data.frame(dat)) {
19 |       dat <- list(dat)
20 |    }
21 |    # tibble grid of data and repeats
22 |    grid <- tidyr::crossing(dat, repeats)
23 |    
24 |    # generate list of ncv objects
25 |    # dynGet needed to get reps out of the envirnonment and into the nested_cv function
26 |    if (method == "kj") {
27 |       ncv_list <- purrr::map2(grid$dat, grid$repeats, function(dat, reps) {
28 |          rsample::nested_cv(dat,
29 |                             outside = vfold_cv(v = 10, repeats = dynGet("reps")),
30 |                             inside = bootstraps(times = 25))
31 |       })
32 |    } else if (method == "raschka") {
33 |       ncv_list <- purrr::map2(grid$dat, grid$repeats, function(dat, reps) {
34 |          rsample::nested_cv(dat,
35 |                             outside = vfold_cv(v = 5, repeats = dynGet("reps")),
36 |                             inside = vfold_cv(v = 2))
37 |       })
38 |    } else {
39 |       stop("Need to specify method as kj or raschka", call. = FALSE)
40 |    }
41 |    
42 |    return(ncv_list)
43 | }
44 | 
45 | 
46 | 
47 | 
48 | 


--------------------------------------------------------------------------------
/performance-experiment/functions/inner-tune.R:
--------------------------------------------------------------------------------
 1 | # inner loop tuning function
 2 | 
 3 | # inputs:
 4 | # 1. ncv_dat = one ncv object from the list created by create-ncv-objects.R
 5 | # 2. mod_FUN_list = all the model objects created by create-models.R
 6 | # 3. params_list = all the hyperparameter grids created by create-grids.R
 7 | # 4. error_FUN = specified at the start of plan-<method>.R 
 8 | 
 9 | # outputs: df of hyperparameters for each fold that was chosen in the inner-loop
10 | 
11 | 
12 | 
13 | inner_tune <- function(ncv_dat, mod_FUN_list, params_list, error_FUN) {
14 |    
15 |    # inputs params, model, and resample, calls model and error functions, outputs error
16 |    mod_error <- function(params, mod_FUN, dat) {
17 |       y_col <- ncol(dat$data)
18 |       y_obs <- rsample::assessment(dat)[y_col]
19 |       mod <- mod_FUN(params, rsample::analysis(dat))
20 |       pred <- predict(mod, rsample::assessment(dat))
21 |       if (!is.data.frame(pred)) {
22 |          pred <- pred$predictions
23 |       }
24 |       error <- error_FUN(y_obs, pred)
25 |       error
26 |    }
27 |    
28 |    # inputs resample, loops hyperparam grid values to model/error function, collects error value for hyperparam combo
29 |    tune_over_params <- function(dat, mod_FUN, params) {
30 |       params$error <- purrr::map_dbl(1:nrow(params), function(row) {
31 |          params <- params[row,]
32 |          mod_error(params, mod_FUN, dat)
33 |       })
34 |       params
35 |    }
36 |    
37 |    # inputs and sends fold's resamples to tuning function, collects and averages fold's error for each hyperparameter combo
38 |    summarize_tune_results <- function(dat, mod_FUN, params) {
39 |       # Return row-bound tibble that has the 25 bootstrap results
40 |       param_names <- names(params)
41 |       furrr::future_map_dfr(dat$splits, tune_over_params, mod_FUN, params) %>%
42 |       lazy_dt(., key_by = param_names) %>% 
43 |          # For each value of the tuning parameter, compute the
44 |          # average <error> which is the inner bootstrap estimate.
45 |          group_by_at(vars(param_names)) %>%
46 |          summarize(mean_error = mean(error, na.rm = TRUE),
47 |                    sd_error = sd(error, na.rm = TRUE),
48 |                    n = length(error)) %>% 
49 |          as_tibble()
50 |    }
51 |    
52 |    tune_algorithms <- furrr::future_map2(mod_FUN_list, params_list, function(mod_FUN, params){
53 |       tuning_results <- purrr::map(ncv_dat$inner_resamples, summarize_tune_results, mod_FUN, params)
54 |       
55 |       # Choose best hyperparameter combos across all the resamples for each fold (e.g. 5 repeats 10 folds = 50 best hyperparam combos)
56 |       best_hyper_vals <- tuning_results %>%
57 |          purrr::map_df(function(dat) {
58 |             dat %>% 
59 |                filter(mean_error == min(mean_error)) %>% 
60 |                arrange(sd_error) %>% 
61 |                slice(1)
62 |          }) %>%
63 |          select(names(params))
64 |    })
65 | }
66 | 
67 | 
68 | 
69 | 


--------------------------------------------------------------------------------
/performance-experiment/functions/mlbench-data.R:
--------------------------------------------------------------------------------
 1 | # create simulation data
 2 | 
 3 | # Inputs are 10 independent variables uniformly distributed on the interval [0,1], only 5 out of these 10 are actually used. Outputs are created according to the formula
 4 | # y = 10 sin(π x1 x2) + 20 (x3 - 0.5)^2 + 10 x4 + 5 x5 + e
 5 | 
 6 | mlbench_data <- function(n, noise_sd = 1, seed = 2019) {
 7 |       set.seed(seed)
 8 |       tmp <- mlbench::mlbench.friedman1(n, sd = noise_sd)
 9 |       tmp <- cbind(tmp$x, tmp$y)
10 |       tmp <- as.data.frame(tmp)
11 |       names(tmp)[ncol(tmp)] <- "y"
12 |       tmp
13 | }


--------------------------------------------------------------------------------
/performance-experiment/functions/ncv-compare.R:
--------------------------------------------------------------------------------
 1 | # ncv_compare function
 2 | 
 3 | 
 4 | # Chooses the best algorithm, fits best model on entire training set, predicts against large simulated data set
 5 | 
 6 | # inputs:
 7 | # 1. train_dat = the entire training dataset
 8 | # 2. large_dat = the test dataset
 9 | # 3. cv_stats = outer_cv.R output: df with chosen model, outer fold stats, hyperparams
10 | # 4. mod_FUN_list = list of model objects created from create_models.R
11 | # 5. params_list = list of hyperparameter grids created from create_grids.R
12 | # 6. error_FUN = error function given at the start of plan_<method>.R
13 | # 7. method = "kj" or "raschka", given at the start of plan_<method>.R
14 | 
15 | # output: df with algorithm, hyperparams, and error values
16 | 
17 | 
18 | ncv_compare <- function(train_dat, large_dat, cv_stats, mod_FUN_list, params_list, error_FUN, method) {
19 |    
20 |    if (method == "kj") {
21 |       # Choose alg with lowest avg error
22 |       chosen_alg <- cv_stats %>%
23 |          bind_rows(.id = "model") %>% 
24 |          filter(mean_error == min(mean_error)) %>% 
25 |          pull(model)
26 |       
27 |       # Set inputs to chosen alg
28 |       mod_FUN <- mod_FUN_list[[chosen_alg]]
29 |       params <- cv_stats[[chosen_alg]] %>%
30 |          select(names(params_list[[chosen_alg]]))
31 |       
32 |    } else if (method == "raschka") {
33 |       chosen_alg <- cv_stats %>% 
34 |          pull(model)
35 |       mod_FUN <- mod_FUN_list[[chosen_alg]]
36 |       params <- cv_stats %>% 
37 |          filter(model == chosen_alg) %>% 
38 |          select(names(params_list[[chosen_alg]]))
39 |    }
40 |    
41 |    # fit model over entire training set
42 |    fit <- mod_FUN(params, train_dat)
43 |    
44 |    # predict on test set
45 |    preds <- predict(fit, large_dat)
46 |    if (!is.data.frame(preds)) {
47 |       preds <- preds$predictions
48 |    }
49 |    
50 |    # calculate out-of-sample and retrieve nested-cv error
51 |    y_col <- ncol(large_dat)
52 |    y_obs <- large_dat[y_col]
53 |    oos_error <- round(error_FUN(y_obs, preds), 5)
54 |    
55 |    if (method == "kj") {
56 |       ncv_error <- cv_stats[[chosen_alg]] %>%
57 |          mutate(mean_error = round(mean_error, 5)) %>% 
58 |          pull(mean_error)
59 |    } else if (method == "raschka") {
60 |       ncv_error <- cv_stats %>%
61 |          filter(model == chosen_alg) %>% 
62 |          mutate(mean_error = round(mean_error, 5)) %>% 
63 |          pull(mean_error)
64 |    }
65 |    
66 |    # delta (the difference between errors) is how well the ncv estimated generalization performance
67 |    ncv_perf <- bind_cols(oos_error = oos_error, ncv_error = ncv_error) %>% 
68 |       mutate(method = method,
69 |              delta_error = abs(oos_error - ncv_error),
70 |              chosen_algorithm = chosen_alg) %>% 
71 |       bind_cols(params) %>% 
72 |       select(method, everything())
73 |    
74 | }
75 | 
76 | 
77 |    
78 | 
79 | 
80 | 


--------------------------------------------------------------------------------
/performance-experiment/functions/outer-cv.R:
--------------------------------------------------------------------------------
  1 | # outer_cv function
  2 | 
  3 | # Takes the hyperparameter values that were chosen by the tuning process in the inner loop and uses them for cross-validation in the outer loop.
  4 | 
  5 | # inputs:
  6 | # 1. ncv_dat = ncv obj from list created by create-ncv-objects.R
  7 | # 2. best_hypervals_list = output from inner-tune.R
  8 | # 3. mod_FUN_list = output from create-models.R
  9 | # 4. error_FUN = error function given at start of plan_<method>.R
 10 | # 5. method = "kj" or "raschka"
 11 | # 6. train_dat = entire training set; output from mlbench-data.R
 12 | # 7. params_list = list of hyperparameter grids; output from create-grids.R
 13 | 
 14 | # output: df with chosen model, chosen hyperparameters, outer-fold error stats: mean, median, sd error for each algorithm
 15 | 
 16 | 
 17 | 
 18 | 
 19 | outer_cv <- function(ncv_dat, best_hypervals_list, mod_FUN_list, error_FUN, method, train_dat = NULL, params_list = NULL) {
 20 |    if (method == "raschka" & is.null(train_dat)) {
 21 |       stop("train_dat argument = NULL. Entire training set needs to be included for raschka method")
 22 |    }
 23 |    if (method == "raschka" &  is.null(params_list)) {
 24 |       stop("params_list argument = NULL. Hyperparameter grid list needs to be included for raschka method")
 25 |    }
 26 |    
 27 |    # inputs params, model, and resample, calls model and error functions, outputs error
 28 |    mod_error <- function(params, mod_FUN, dat) {
 29 |       y_col <- ncol(dat$data)
 30 |       y_obs <- rsample::assessment(dat)[y_col]
 31 |       mod <- mod_FUN(params, rsample::analysis(dat))
 32 |       pred <- predict(mod, rsample::assessment(dat))
 33 |       if (!is.data.frame(pred)) {
 34 |          pred <- pred$predictions
 35 |       }
 36 |       error <- error_FUN(y_obs, pred)
 37 |       error
 38 |    }
 39 |    outer_stats <- furrr::future_map2(mod_FUN_list, best_hypervals_list, function(mod_FUN, best_hyper_vals){
 40 |       
 41 |       # fit models on the outer-loop folds using best hyperparams (e.g. 5 repeats, 10 folds = 50 models)
 42 |       outer_fold_error <- furrr::future_map2_dfr(ncv_dat$splits, 1:nrow(best_hyper_vals), function(dat, row) {
 43 |          params <- best_hyper_vals[row,]
 44 |          error <- mod_error(params, mod_FUN, dat)
 45 |          tibble(
 46 |             error = error
 47 |          )
 48 |       }) %>% 
 49 |          bind_cols(best_hyper_vals) %>% 
 50 |          mutate_all(~round(., 6))
 51 |       
 52 |       
 53 |       if (method == "kj") {
 54 |          # hyperparam values for final model will be the ones most selected to use on the outer-loop folds
 55 |          chosen_params <- best_hyper_vals %>% 
 56 |             group_by_all() %>% 
 57 |             tally() %>% 
 58 |             ungroup() %>% 
 59 |             filter(n == max(n)) %>% 
 60 |             slice(1)
 61 |          
 62 |          # if majority vote chooses more than one parameter set, then choose the set with the lowest error. And take the first row in case more than one set of params has min error.
 63 |          if (nrow(chosen_params) > 1) {
 64 |             chosen_params <- chosen_params %>%
 65 |                mutate_all(~round(., 6)) %>% 
 66 |                inner_join(outer_fold_error, by = c("mixture", "penalty")) %>%
 67 |                filter(error == min(error)) %>%
 68 |                slice(1) %>%
 69 |                select(names(best_hyper_vals))
 70 |          }
 71 |          
 72 |          # output error stats and chosen hyperparams
 73 |          tibble(
 74 |             mean_error = mean(outer_fold_error$error),
 75 |             median_error = median(outer_fold_error$error),
 76 |             sd_error = sd(outer_fold_error$error)
 77 |          ) %>% 
 78 |             bind_cols(chosen_params)
 79 |          
 80 |       } else if (method == "raschka") {
 81 |          tibble(
 82 |             mean_error = mean(outer_fold_error$error),
 83 |             median_error = median(outer_fold_error$error),
 84 |             sd_error = sd(outer_fold_error$error)
 85 |          )
 86 |          
 87 |       } else {
 88 |          stop("Need to specify method as kj or raschka", call. = FALSE)
 89 |       }
 90 |    })
 91 |    
 92 |    if (method == "raschka") {
 93 |       chosen_alg <- outer_stats %>% 
 94 |          bind_rows(.id = "model") %>% 
 95 |          filter(mean_error == min(mean_error)) %>% 
 96 |          pull(1)
 97 |       
 98 |       # Set inputs to chosen alg
 99 |       mod_FUN <- mod_FUN_list[[chosen_alg]]
100 |       params <- params_list[[chosen_alg]]
101 |       total_train <- rsample::vfold_cv(train_dat, v = 2)
102 |       
103 |       # tune chosen alg on the inner-loop cv strategy
104 |       # code is an amalgam of funs: summarize_tune_results, tune_over_params
105 |       tuning_results <- furrr::future_map(total_train$splits, function(dat, mod_FUN, params) {
106 |          params$error <- furrr::future_map_dbl(1:nrow(params), function(row) {
107 |             params <- params[row,]
108 |             mod_error(params, mod_FUN, dat)
109 |          })
110 |          return(params)
111 |       }, mod_FUN, params) %>% 
112 |          bind_rows() %>% 
113 |          lazy_dt(., key_by = names(params)) %>% 
114 |          # For each value of the tuning parameter, compute the
115 |          # average <error> which is the inner bootstrap estimate.
116 |          group_by_at(vars(names(params))) %>%
117 |          summarize(mean_error = mean(error, na.rm = TRUE),
118 |                    sd_error = sd(error, na.rm = TRUE)) %>% 
119 |          as_tibble()
120 |       
121 |       # Get best params from the tuning
122 |       chosen_hyper_vals <- tuning_results %>%
123 |          filter(mean_error == min(mean_error)) %>%
124 |          arrange(sd_error) %>%
125 |          slice(1) %>% 
126 |          select(names(params))
127 |       
128 |       outer_stats <- outer_stats %>% 
129 |          bind_rows(.id = "model") %>% 
130 |          filter(model == chosen_alg) %>% 
131 |          bind_cols(chosen_hyper_vals)
132 |    }
133 |    return(outer_stats)
134 | }
135 | 
136 | 
137 | 
138 | 


--------------------------------------------------------------------------------
/performance-experiment/functions/run-ncv.R:
--------------------------------------------------------------------------------
 1 | # Runs nested cross-validation
 2 | 
 3 | # main function: compares the algorithms; choose the one with smallest error; predicts on a test set; calcs difference between test error and mean of outer-fold error
 4 | 
 5 | # calls inner-tune.R, outer-cv.R, and ncv-compare.R
 6 | 
 7 | # input and output description in the individual function scripts
 8 | 
 9 | 
10 | run_ncv <- function(ncv_dat, sim_dat, large_dat, mod_FUN_list, params_list, error_FUN, method) {
11 |    
12 |    # output: hypervalues for each fold, chosen in the inner-loop
13 |    best_hypervals_list <- inner_tune(
14 |       ncv_dat = ncv_dat[[1]],
15 |       mod_FUN_list = mod_FUN_list,
16 |       params_list = params_list,
17 |       error_FUN = error_FUN)
18 |    
19 |    # output: model, mean, median, sd error, and hyperparameter columns
20 |    if (method == "raschka") {
21 |       cv_stats <- outer_cv(
22 |          ncv_dat = ncv_dat[[1]],
23 |          best_hypervals_list = best_hypervals_list,
24 |          mod_FUN_list = mod_FUN_list,
25 |          error_FUN = error_FUN,
26 |          method = method,
27 |          train_dat = sim_dat,
28 |          params_list = params_list)
29 |    } else if (method == "kj") {
30 |       cv_stats <- outer_cv(
31 |          ncv_dat = ncv_dat[[1]],
32 |          best_hypervals_list = best_hypervals_list,
33 |          mod_FUN_list = mod_FUN_list,
34 |          error_FUN = error_FUN,
35 |          method = method)
36 |    }
37 |    
38 |    # output: algorithm, hyperparams, and error values
39 |    genl_perf_est <- ncv_compare(train_dat = sim_dat,
40 |                                 large_dat = large_dat,
41 |                                 cv_stats = cv_stats,
42 |                                 mod_FUN_list = mod_FUN_list,
43 |                                 params_list = params_list,
44 |                                 error_FUN = error_FUN,
45 |                                 method = method)
46 |    
47 |    # if there's repeat == 1, then there is no repeat column (id), id becomes the fold co instead of there being an id2 col
48 |    rep_status <- stringr::str_detect(ncv_dat[[1]]$id[[1]], pattern = "Repeat")
49 |    
50 |    if (rep_status == TRUE) {
51 |       # number of repeats
52 |       num_reps <- ncv_dat[[1]] %>%
53 |          select(id) %>%
54 |          mutate(repeats = stringr::str_extract(id, pattern = "[0-9]") %>% 
55 |                    as.numeric()) %>% 
56 |          slice(n()) %>% 
57 |          pull(repeats)
58 |    } else {
59 |       num_reps <- 1
60 |    }
61 |    
62 |    # cols: n, repeats, error calcs, chosen alg, chosen hyperparams
63 |    final_results <- tibble(n = nrow(ncv_dat[[1]]$splits$`1`[[1]]),
64 |                            repeats = num_reps) %>% 
65 |          bind_cols(genl_perf_est)
66 |    
67 | }


--------------------------------------------------------------------------------
/performance-experiment/output/perf-exp-output-r.csv:
--------------------------------------------------------------------------------
 1 | subtarget,n,repeats,method,oos_error,ncv_error,delta_error,chosen_algorithm,mixture,penalty,mtry,trees,elapsed,percent_error
 2 | ncv_results_r_100_2cf8f29f,100,4,raschka,2.19966,1.99714,0.20252000000000026,glmnet,0.9937975914264098,3.6844362948689596e-6,NA,NA,6.48,0.092
 3 | ncv_results_r_100_73d34817,100,1,raschka,2.20712,1.92178,0.28534000000000015,glmnet,0.2830161499953829,0.4736074164981174,NA,NA,1.65,0.129
 4 | ncv_results_r_100_80f90de9,100,3,raschka,2.24343,2.05019,0.19323999999999986,glmnet,0.8434229354886338,0.38592594070682457,NA,NA,4.75,0.086
 5 | ncv_results_r_100_c6a30309,100,5,raschka,2.19318,1.99697,0.19621,glmnet,0.27113973276223985,0.27236778939605205,NA,NA,7.9,0.089
 6 | ncv_results_r_100_ed9aeedc,100,2,raschka,2.19315,2.02878,0.16437000000000035,glmnet,0.3506598100136034,0.24982118433658881,NA,NA,3.13,0.075
 7 | ncv_results_r_2000_97d0d6fc,2000,4,raschka,1.39104,1.40004,0.008999999999999897,rf,NA,NA,5,1167,8.43,0.006
 8 | ncv_results_r_2000_a6a8efea,2000,1,raschka,1.38907,1.40676,0.017689999999999984,rf,NA,NA,6,227,3.08,0.013
 9 | ncv_results_r_2000_b959e941,2000,2,raschka,1.38666,1.39311,0.006450000000000067,rf,NA,NA,5,1167,4.85,0.005
10 | ncv_results_r_2000_e7878a5b,2000,5,raschka,1.38627,1.39816,0.011890000000000178,rf,NA,NA,6,1309,10.68,0.009
11 | ncv_results_r_2000_f2c5e5eb,2000,3,raschka,1.39259,1.40355,0.01096000000000008,rf,NA,NA,7,576,6.52,0.008
12 | ncv_results_r_5000_0131c1ff,5000,5,raschka,1.24466,1.27121,0.02654999999999985,rf,NA,NA,5,1454,23.95,0.021
13 | ncv_results_r_5000_5b1d8a27,5000,3,raschka,1.24056,1.27153,0.030969999999999942,rf,NA,NA,6,1874,16.2,0.025
14 | ncv_results_r_5000_86c8ceed,5000,4,raschka,1.2404,1.28051,0.04011000000000009,rf,NA,NA,6,1281,20.13,0.032
15 | ncv_results_r_5000_8bc71439,5000,2,raschka,1.24009,1.27624,0.036150000000000126,rf,NA,NA,6,446,11.53,0.029
16 | ncv_results_r_5000_e931a398,5000,1,raschka,1.24406,1.28087,0.03681000000000001,rf,NA,NA,5,1462,7.7,0.03
17 | ncv_results_r_800_0f9280b4,800,4,raschka,1.6537,1.63063,0.023069999999999924,rf,NA,NA,7,92,7.27,0.014
18 | ncv_results_r_800_2ff75ff1,800,5,raschka,1.64388,1.61275,0.031130000000000102,rf,NA,NA,7,162,8.52,0.019
19 | ncv_results_r_800_67abc218,800,3,raschka,1.63369,1.61903,0.014660000000000117,rf,NA,NA,9,256,5.37,0.009
20 | ncv_results_r_800_a8ac41db,800,2,raschka,1.63013,1.62835,0.0017800000000001148,rf,NA,NA,9,1807,3.88,0.001
21 | ncv_results_r_800_e4b0a9bc,800,1,raschka,1.63248,1.59467,0.0378099999999999,rf,NA,NA,7,1289,2.22,0.023
22 | 


--------------------------------------------------------------------------------
/performance-experiment/output/perf-exp-output.csv:
--------------------------------------------------------------------------------
 1 | subtarget,n,repeats,method,oos_error,ncv_error,delta_error,chosen_algorithm,mixture,penalty,mtry,trees,elapsed,percent_error
 2 | ncv_results_100_0108d912,100,5,kj,2.19359,2.01424,0.1793499999999999,glmnet,0.50424303883221,0.2211151988375703,NA,NA,1.36,0.082
 3 | ncv_results_100_7aaa57d2,100,1,kj,2.19359,2.04781,0.1457799999999998,glmnet,0.50424303883221,0.2211151988375703,NA,NA,0.15,0.066
 4 | ncv_results_100_97e7fe04,100,2,kj,2.19359,1.99077,0.20282,glmnet,0.50424303883221,0.2211151988375703,NA,NA,0.4,0.092
 5 | ncv_results_100_9d044993,100,4,kj,2.19359,1.99643,0.19716,glmnet,0.50424303883221,0.2211151988375703,NA,NA,0.97,0.09
 6 | ncv_results_100_ea11bf8d,100,3,kj,2.19262,2.01702,0.17559999999999976,glmnet,0.5809470646083355,0.16010254880830843,NA,NA,0.65,0.08
 7 | ncv_results_2000_47742c31,2000,4,kj,1.38697,1.37171,0.015260000000000051,rf,NA,NA,5,1779,2.96,0.011
 8 | ncv_results_2000_746435d6,2000,5,kj,1.39092,1.37625,0.01466999999999996,rf,NA,NA,5,1779,3.71,0.011
 9 | ncv_results_2000_7d80d14d,2000,1,kj,1.38466,1.36553,0.01913000000000009,rf,NA,NA,5,1948,0.74,0.014
10 | ncv_results_2000_80d2e33a,2000,3,kj,1.38955,1.3711,0.018450000000000077,rf,NA,NA,5,1948,2.22,0.013
11 | ncv_results_2000_c16e9aff,2000,2,kj,1.38739,1.37015,0.017239999999999922,rf,NA,NA,5,1948,1.48,0.012
12 | ncv_results_5000_20d7ace1,5000,4,kj,1.24192,1.25837,0.016450000000000076,rf,NA,NA,5,1573,8.92,0.013
13 | ncv_results_5000_2a916af4,5000,5,kj,1.24272,1.25644,0.013719999999999954,rf,NA,NA,5,1664,11.13,0.011
14 | ncv_results_5000_7b1fdb55,5000,2,kj,1.24336,1.2612,0.017840000000000078,rf,NA,NA,5,1351,4.46,0.014
15 | ncv_results_5000_7b6f8e72,5000,1,kj,1.24304,1.25709,0.014050000000000118,rf,NA,NA,5,1664,2.23,0.011
16 | ncv_results_5000_d380966a,5000,3,kj,1.24267,1.25724,0.014569999999999972,rf,NA,NA,5,1365,6.69,0.012
17 | ncv_results_800_3b54c7f8,800,1,kj,1.63668,1.58422,0.05245999999999995,rf,NA,NA,6,1507,0.26,0.032
18 | ncv_results_800_3f87e120,800,2,kj,1.6333,1.58689,0.04641000000000006,rf,NA,NA,6,1168,0.51,0.028
19 | ncv_results_800_50b46544,800,4,kj,1.63707,1.58522,0.05184999999999995,rf,NA,NA,6,1693,1.09,0.032
20 | ncv_results_800_589454bb,800,3,kj,1.63456,1.5905,0.04405999999999999,rf,NA,NA,6,1168,0.76,0.027
21 | ncv_results_800_a2c27fe0,800,5,kj,1.63489,1.58745,0.04743999999999993,rf,NA,NA,6,1507,1.52,0.029
22 | 


--------------------------------------------------------------------------------
/performance-experiment/packages.R:
--------------------------------------------------------------------------------
1 | 
2 | 
3 | pacman::p_load(data.table, dtplyr, dplyr, furrr, drake)
4 | 
5 | 
6 | 


--------------------------------------------------------------------------------
/renv/.gitignore:
--------------------------------------------------------------------------------
1 | local/
2 | lock/
3 | library/
4 | python/
5 | staging/
6 | 


--------------------------------------------------------------------------------
/renv/activate.R:
--------------------------------------------------------------------------------
  1 | 
  2 | local({
  3 | 
  4 |   # the requested version of renv
  5 |   version <- "0.14.0-3"
  6 | 
  7 |   # the project directory
  8 |   project <- getwd()
  9 | 
 10 |   # allow environment variable to control activation
 11 |   activate <- Sys.getenv("RENV_ACTIVATE_PROJECT")
 12 |   if (!nzchar(activate)) {
 13 | 
 14 |     # don't auto-activate when R CMD INSTALL is running
 15 |     if (nzchar(Sys.getenv("R_INSTALL_PKG")))
 16 |       return(FALSE)
 17 | 
 18 |   }
 19 | 
 20 |   # bail if activation was explicitly disabled
 21 |   if (tolower(activate) %in% c("false", "f", "0"))
 22 |     return(FALSE)
 23 | 
 24 |   # avoid recursion
 25 |   if (nzchar(Sys.getenv("RENV_R_INITIALIZING")))
 26 |     return(invisible(TRUE))
 27 | 
 28 |   # signal that we're loading renv during R startup
 29 |   Sys.setenv("RENV_R_INITIALIZING" = "true")
 30 |   on.exit(Sys.unsetenv("RENV_R_INITIALIZING"), add = TRUE)
 31 | 
 32 |   # signal that we've consented to use renv
 33 |   options(renv.consent = TRUE)
 34 | 
 35 |   # load the 'utils' package eagerly -- this ensures that renv shims, which
 36 |   # mask 'utils' packages, will come first on the search path
 37 |   library(utils, lib.loc = .Library)
 38 | 
 39 |   # check to see if renv has already been loaded
 40 |   if ("renv" %in% loadedNamespaces()) {
 41 | 
 42 |     # if renv has already been loaded, and it's the requested version of renv,
 43 |     # nothing to do
 44 |     spec <- .getNamespaceInfo(.getNamespace("renv"), "spec")
 45 |     if (identical(spec[["version"]], version))
 46 |       return(invisible(TRUE))
 47 | 
 48 |     # otherwise, unload and attempt to load the correct version of renv
 49 |     unloadNamespace("renv")
 50 | 
 51 |   }
 52 | 
 53 |   # load bootstrap tools   
 54 |   bootstrap <- function(version, library) {
 55 |   
 56 |     # attempt to download renv
 57 |     tarball <- tryCatch(renv_bootstrap_download(version), error = identity)
 58 |     if (inherits(tarball, "error"))
 59 |       stop("failed to download renv ", version)
 60 |   
 61 |     # now attempt to install
 62 |     status <- tryCatch(renv_bootstrap_install(version, tarball, library), error = identity)
 63 |     if (inherits(status, "error"))
 64 |       stop("failed to install renv ", version)
 65 |   
 66 |   }
 67 |   
 68 |   renv_bootstrap_tests_running <- function() {
 69 |     getOption("renv.tests.running", default = FALSE)
 70 |   }
 71 |   
 72 |   renv_bootstrap_repos <- function() {
 73 |   
 74 |     # check for repos override
 75 |     repos <- Sys.getenv("RENV_CONFIG_REPOS_OVERRIDE", unset = NA)
 76 |     if (!is.na(repos))
 77 |       return(repos)
 78 |   
 79 |     # if we're testing, re-use the test repositories
 80 |     if (renv_bootstrap_tests_running())
 81 |       return(getOption("renv.tests.repos"))
 82 |   
 83 |     # retrieve current repos
 84 |     repos <- getOption("repos")
 85 |   
 86 |     # ensure @CRAN@ entries are resolved
 87 |     repos[repos == "@CRAN@"] <- getOption(
 88 |       "renv.repos.cran",
 89 |       "https://cloud.r-project.org"
 90 |     )
 91 |   
 92 |     # add in renv.bootstrap.repos if set
 93 |     default <- c(FALLBACK = "https://cloud.r-project.org")
 94 |     extra <- getOption("renv.bootstrap.repos", default = default)
 95 |     repos <- c(repos, extra)
 96 |   
 97 |     # remove duplicates that might've snuck in
 98 |     dupes <- duplicated(repos) | duplicated(names(repos))
 99 |     repos[!dupes]
100 |   
101 |   }
102 |   
103 |   renv_bootstrap_download <- function(version) {
104 |   
105 |     # if the renv version number has 4 components, assume it must
106 |     # be retrieved via github
107 |     nv <- numeric_version(version)
108 |     components <- unclass(nv)[[1]]
109 |   
110 |     methods <- if (length(components) == 4L) {
111 |       list(
112 |         renv_bootstrap_download_github
113 |       )
114 |     } else {
115 |       list(
116 |         renv_bootstrap_download_cran_latest,
117 |         renv_bootstrap_download_cran_archive
118 |       )
119 |     }
120 |   
121 |     for (method in methods) {
122 |       path <- tryCatch(method(version), error = identity)
123 |       if (is.character(path) && file.exists(path))
124 |         return(path)
125 |     }
126 |   
127 |     stop("failed to download renv ", version)
128 |   
129 |   }
130 |   
131 |   renv_bootstrap_download_impl <- function(url, destfile) {
132 |   
133 |     mode <- "wb"
134 |   
135 |     # https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17715
136 |     fixup <-
137 |       Sys.info()[["sysname"]] == "Windows" &&
138 |       substring(url, 1L, 5L) == "file:"
139 |   
140 |     if (fixup)
141 |       mode <- "w+b"
142 |   
143 |     utils::download.file(
144 |       url      = url,
145 |       destfile = destfile,
146 |       mode     = mode,
147 |       quiet    = TRUE
148 |     )
149 |   
150 |   }
151 |   
152 |   renv_bootstrap_download_cran_latest <- function(version) {
153 |   
154 |     spec <- renv_bootstrap_download_cran_latest_find(version)
155 |   
156 |     message("* Downloading renv ", version, " ... ", appendLF = FALSE)
157 |   
158 |     type  <- spec$type
159 |     repos <- spec$repos
160 |   
161 |     info <- tryCatch(
162 |       utils::download.packages(
163 |         pkgs    = "renv",
164 |         destdir = tempdir(),
165 |         repos   = repos,
166 |         type    = type,
167 |         quiet   = TRUE
168 |       ),
169 |       condition = identity
170 |     )
171 |   
172 |     if (inherits(info, "condition")) {
173 |       message("FAILED")
174 |       return(FALSE)
175 |     }
176 |   
177 |     # report success and return
178 |     message("OK (downloaded ", type, ")")
179 |     info[1, 2]
180 |   
181 |   }
182 |   
183 |   renv_bootstrap_download_cran_latest_find <- function(version) {
184 |   
185 |     # check whether binaries are supported on this system
186 |     binary <-
187 |       getOption("renv.bootstrap.binary", default = TRUE) &&
188 |       !identical(.Platform$pkgType, "source") &&
189 |       !identical(getOption("pkgType"), "source") &&
190 |       Sys.info()[["sysname"]] %in% c("Darwin", "Windows")
191 |   
192 |     types <- c(if (binary) "binary", "source")
193 |   
194 |     # iterate over types + repositories
195 |     for (type in types) {
196 |       for (repos in renv_bootstrap_repos()) {
197 |   
198 |         # retrieve package database
199 |         db <- tryCatch(
200 |           as.data.frame(
201 |             utils::available.packages(type = type, repos = repos),
202 |             stringsAsFactors = FALSE
203 |           ),
204 |           error = identity
205 |         )
206 |   
207 |         if (inherits(db, "error"))
208 |           next
209 |   
210 |         # check for compatible entry
211 |         entry <- db[db$Package %in% "renv" & db$Version %in% version, ]
212 |         if (nrow(entry) == 0)
213 |           next
214 |   
215 |         # found it; return spec to caller
216 |         spec <- list(entry = entry, type = type, repos = repos)
217 |         return(spec)
218 |   
219 |       }
220 |     }
221 |   
222 |     # if we got here, we failed to find renv
223 |     fmt <- "renv %s is not available from your declared package repositories"
224 |     stop(sprintf(fmt, version))
225 |   
226 |   }
227 |   
228 |   renv_bootstrap_download_cran_archive <- function(version) {
229 |   
230 |     name <- sprintf("renv_%s.tar.gz", version)
231 |     repos <- renv_bootstrap_repos()
232 |     urls <- file.path(repos, "src/contrib/Archive/renv", name)
233 |     destfile <- file.path(tempdir(), name)
234 |   
235 |     message("* Downloading renv ", version, " ... ", appendLF = FALSE)
236 |   
237 |     for (url in urls) {
238 |   
239 |       status <- tryCatch(
240 |         renv_bootstrap_download_impl(url, destfile),
241 |         condition = identity
242 |       )
243 |   
244 |       if (identical(status, 0L)) {
245 |         message("OK")
246 |         return(destfile)
247 |       }
248 |   
249 |     }
250 |   
251 |     message("FAILED")
252 |     return(FALSE)
253 |   
254 |   }
255 |   
256 |   renv_bootstrap_download_github <- function(version) {
257 |   
258 |     enabled <- Sys.getenv("RENV_BOOTSTRAP_FROM_GITHUB", unset = "TRUE")
259 |     if (!identical(enabled, "TRUE"))
260 |       return(FALSE)
261 |   
262 |     # prepare download options
263 |     pat <- Sys.getenv("GITHUB_PAT")
264 |     if (nzchar(Sys.which("curl")) && nzchar(pat)) {
265 |       fmt <- "--location --fail --header \"Authorization: token %s\""
266 |       extra <- sprintf(fmt, pat)
267 |       saved <- options("download.file.method", "download.file.extra")
268 |       options(download.file.method = "curl", download.file.extra = extra)
269 |       on.exit(do.call(base::options, saved), add = TRUE)
270 |     } else if (nzchar(Sys.which("wget")) && nzchar(pat)) {
271 |       fmt <- "--header=\"Authorization: token %s\""
272 |       extra <- sprintf(fmt, pat)
273 |       saved <- options("download.file.method", "download.file.extra")
274 |       options(download.file.method = "wget", download.file.extra = extra)
275 |       on.exit(do.call(base::options, saved), add = TRUE)
276 |     }
277 |   
278 |     message("* Downloading renv ", version, " from GitHub ... ", appendLF = FALSE)
279 |   
280 |     url <- file.path("https://api.github.com/repos/rstudio/renv/tarball", version)
281 |     name <- sprintf("renv_%s.tar.gz", version)
282 |     destfile <- file.path(tempdir(), name)
283 |   
284 |     status <- tryCatch(
285 |       renv_bootstrap_download_impl(url, destfile),
286 |       condition = identity
287 |     )
288 |   
289 |     if (!identical(status, 0L)) {
290 |       message("FAILED")
291 |       return(FALSE)
292 |     }
293 |   
294 |     message("OK")
295 |     return(destfile)
296 |   
297 |   }
298 |   
299 |   renv_bootstrap_install <- function(version, tarball, library) {
300 |   
301 |     # attempt to install it into project library
302 |     message("* Installing renv ", version, " ... ", appendLF = FALSE)
303 |     dir.create(library, showWarnings = FALSE, recursive = TRUE)
304 |   
305 |     # invoke using system2 so we can capture and report output
306 |     bin <- R.home("bin")
307 |     exe <- if (Sys.info()[["sysname"]] == "Windows") "R.exe" else "R"
308 |     r <- file.path(bin, exe)
309 |     args <- c("--vanilla", "CMD", "INSTALL", "-l", shQuote(library), shQuote(tarball))
310 |     output <- system2(r, args, stdout = TRUE, stderr = TRUE)
311 |     message("Done!")
312 |   
313 |     # check for successful install
314 |     status <- attr(output, "status")
315 |     if (is.numeric(status) && !identical(status, 0L)) {
316 |       header <- "Error installing renv:"
317 |       lines <- paste(rep.int("=", nchar(header)), collapse = "")
318 |       text <- c(header, lines, output)
319 |       writeLines(text, con = stderr())
320 |     }
321 |   
322 |     status
323 |   
324 |   }
325 |   
326 |   renv_bootstrap_platform_prefix <- function() {
327 |   
328 |     # construct version prefix
329 |     version <- paste(R.version$major, R.version$minor, sep = ".")
330 |     prefix <- paste("R", numeric_version(version)[1, 1:2], sep = "-")
331 |   
332 |     # include SVN revision for development versions of R
333 |     # (to avoid sharing platform-specific artefacts with released versions of R)
334 |     devel <-
335 |       identical(R.version[["status"]],   "Under development (unstable)") ||
336 |       identical(R.version[["nickname"]], "Unsuffered Consequences")
337 |   
338 |     if (devel)
339 |       prefix <- paste(prefix, R.version[["svn rev"]], sep = "-r")
340 |   
341 |     # build list of path components
342 |     components <- c(prefix, R.version$platform)
343 |   
344 |     # include prefix if provided by user
345 |     prefix <- renv_bootstrap_platform_prefix_impl()
346 |     if (!is.na(prefix) && nzchar(prefix))
347 |       components <- c(prefix, components)
348 |   
349 |     # build prefix
350 |     paste(components, collapse = "/")
351 |   
352 |   }
353 |   
354 |   renv_bootstrap_platform_prefix_impl <- function() {
355 |   
356 |     # if an explicit prefix has been supplied, use it
357 |     prefix <- Sys.getenv("RENV_PATHS_PREFIX", unset = NA)
358 |     if (!is.na(prefix))
359 |       return(prefix)
360 |   
361 |     # if the user has requested an automatic prefix, generate it
362 |     auto <- Sys.getenv("RENV_PATHS_PREFIX_AUTO", unset = NA)
363 |     if (auto %in% c("TRUE", "True", "true", "1"))
364 |       return(renv_bootstrap_platform_prefix_auto())
365 |   
366 |     # empty string on failure
367 |     ""
368 |   
369 |   }
370 |   
371 |   renv_bootstrap_platform_prefix_auto <- function() {
372 |   
373 |     prefix <- tryCatch(renv_bootstrap_platform_os(), error = identity)
374 |     if (inherits(prefix, "error") || prefix %in% "unknown") {
375 |   
376 |       msg <- paste(
377 |         "failed to infer current operating system",
378 |         "please file a bug report at https://github.com/rstudio/renv/issues",
379 |         sep = "; "
380 |       )
381 |   
382 |       warning(msg)
383 |   
384 |     }
385 |   
386 |     prefix
387 |   
388 |   }
389 |   
390 |   renv_bootstrap_platform_os <- function() {
391 |   
392 |     sysinfo <- Sys.info()
393 |     sysname <- sysinfo[["sysname"]]
394 |   
395 |     # handle Windows + macOS up front
396 |     if (sysname == "Windows")
397 |       return("windows")
398 |     else if (sysname == "Darwin")
399 |       return("macos")
400 |   
401 |     # check for os-release files
402 |     for (file in c("/etc/os-release", "/usr/lib/os-release"))
403 |       if (file.exists(file))
404 |         return(renv_bootstrap_platform_os_via_os_release(file, sysinfo))
405 |   
406 |     # check for redhat-release files
407 |     if (file.exists("/etc/redhat-release"))
408 |       return(renv_bootstrap_platform_os_via_redhat_release())
409 |   
410 |     "unknown"
411 |   
412 |   }
413 |   
414 |   renv_bootstrap_platform_os_via_os_release <- function(file, sysinfo) {
415 |   
416 |     # read /etc/os-release
417 |     release <- utils::read.table(
418 |       file             = file,
419 |       sep              = "=",
420 |       quote            = c("\"", "'"),
421 |       col.names        = c("Key", "Value"),
422 |       comment.char     = "#",
423 |       stringsAsFactors = FALSE
424 |     )
425 |   
426 |     vars <- as.list(release$Value)
427 |     names(vars) <- release$Key
428 |   
429 |     # get os name
430 |     os <- tolower(sysinfo[["sysname"]])
431 |   
432 |     # read id
433 |     id <- "unknown"
434 |     for (field in c("ID", "ID_LIKE")) {
435 |       if (field %in% names(vars) && nzchar(vars[[field]])) {
436 |         id <- vars[[field]]
437 |         break
438 |       }
439 |     }
440 |   
441 |     # read version
442 |     version <- "unknown"
443 |     for (field in c("UBUNTU_CODENAME", "VERSION_CODENAME", "VERSION_ID", "BUILD_ID")) {
444 |       if (field %in% names(vars) && nzchar(vars[[field]])) {
445 |         version <- vars[[field]]
446 |         break
447 |       }
448 |     }
449 |   
450 |     # join together
451 |     paste(c(os, id, version), collapse = "-")
452 |   
453 |   }
454 |   
455 |   renv_bootstrap_platform_os_via_redhat_release <- function() {
456 |   
457 |     # read /etc/redhat-release
458 |     contents <- readLines("/etc/redhat-release", warn = FALSE)
459 |   
460 |     # infer id
461 |     id <- if (grepl("centos", contents, ignore.case = TRUE))
462 |       "centos"
463 |     else if (grepl("redhat", contents, ignore.case = TRUE))
464 |       "redhat"
465 |     else
466 |       "unknown"
467 |   
468 |     # try to find a version component (very hacky)
469 |     version <- "unknown"
470 |   
471 |     parts <- strsplit(contents, "[[:space:]]")[[1L]]
472 |     for (part in parts) {
473 |   
474 |       nv <- tryCatch(numeric_version(part), error = identity)
475 |       if (inherits(nv, "error"))
476 |         next
477 |   
478 |       version <- nv[1, 1]
479 |       break
480 |   
481 |     }
482 |   
483 |     paste(c("linux", id, version), collapse = "-")
484 |   
485 |   }
486 |   
487 |   renv_bootstrap_library_root_name <- function(project) {
488 |   
489 |     # use project name as-is if requested
490 |     asis <- Sys.getenv("RENV_PATHS_LIBRARY_ROOT_ASIS", unset = "FALSE")
491 |     if (asis)
492 |       return(basename(project))
493 |   
494 |     # otherwise, disambiguate based on project's path
495 |     id <- substring(renv_bootstrap_hash_text(project), 1L, 8L)
496 |     paste(basename(project), id, sep = "-")
497 |   
498 |   }
499 |   
500 |   renv_bootstrap_library_root <- function(project) {
501 |   
502 |     path <- Sys.getenv("RENV_PATHS_LIBRARY", unset = NA)
503 |     if (!is.na(path))
504 |       return(path)
505 |   
506 |     path <- Sys.getenv("RENV_PATHS_LIBRARY_ROOT", unset = NA)
507 |     if (!is.na(path)) {
508 |       name <- renv_bootstrap_library_root_name(project)
509 |       return(file.path(path, name))
510 |     }
511 |   
512 |     prefix <- renv_bootstrap_profile_prefix()
513 |     paste(c(project, prefix, "renv/library"), collapse = "/")
514 |   
515 |   }
516 |   
517 |   renv_bootstrap_validate_version <- function(version) {
518 |   
519 |     loadedversion <- utils::packageDescription("renv", fields = "Version")
520 |     if (version == loadedversion)
521 |       return(TRUE)
522 |   
523 |     # assume four-component versions are from GitHub; three-component
524 |     # versions are from CRAN
525 |     components <- strsplit(loadedversion, "[.-]")[[1]]
526 |     remote <- if (length(components) == 4L)
527 |       paste("rstudio/renv", loadedversion, sep = "@")
528 |     else
529 |       paste("renv", loadedversion, sep = "@")
530 |   
531 |     fmt <- paste(
532 |       "renv %1$s was loaded from project library, but this project is configured to use renv %2$s.",
533 |       "Use `renv::record(\"%3$s\")` to record renv %1$s in the lockfile.",
534 |       "Use `renv::restore(packages = \"renv\")` to install renv %2$s into the project library.",
535 |       sep = "\n"
536 |     )
537 |   
538 |     msg <- sprintf(fmt, loadedversion, version, remote)
539 |     warning(msg, call. = FALSE)
540 |   
541 |     FALSE
542 |   
543 |   }
544 |   
545 |   renv_bootstrap_hash_text <- function(text) {
546 |   
547 |     hashfile <- tempfile("renv-hash-")
548 |     on.exit(unlink(hashfile), add = TRUE)
549 |   
550 |     writeLines(text, con = hashfile)
551 |     tools::md5sum(hashfile)
552 |   
553 |   }
554 |   
555 |   renv_bootstrap_load <- function(project, libpath, version) {
556 |   
557 |     # try to load renv from the project library
558 |     if (!requireNamespace("renv", lib.loc = libpath, quietly = TRUE))
559 |       return(FALSE)
560 |   
561 |     # warn if the version of renv loaded does not match
562 |     renv_bootstrap_validate_version(version)
563 |   
564 |     # load the project
565 |     renv::load(project)
566 |   
567 |     TRUE
568 |   
569 |   }
570 |   
571 |   renv_bootstrap_profile_load <- function(project) {
572 |   
573 |     # if RENV_PROFILE is already set, just use that
574 |     profile <- Sys.getenv("RENV_PROFILE", unset = NA)
575 |     if (!is.na(profile) && nzchar(profile))
576 |       return(profile)
577 |   
578 |     # check for a profile file (nothing to do if it doesn't exist)
579 |     path <- file.path(project, "renv/local/profile")
580 |     if (!file.exists(path))
581 |       return(NULL)
582 |   
583 |     # read the profile, and set it if it exists
584 |     contents <- readLines(path, warn = FALSE)
585 |     if (length(contents) == 0L)
586 |       return(NULL)
587 |   
588 |     # set RENV_PROFILE
589 |     profile <- contents[[1L]]
590 |     if (nzchar(profile))
591 |       Sys.setenv(RENV_PROFILE = profile)
592 |   
593 |     profile
594 |   
595 |   }
596 |   
597 |   renv_bootstrap_profile_prefix <- function() {
598 |     profile <- renv_bootstrap_profile_get()
599 |     if (!is.null(profile))
600 |       return(file.path("renv/profiles", profile))
601 |   }
602 |   
603 |   renv_bootstrap_profile_get <- function() {
604 |     profile <- Sys.getenv("RENV_PROFILE", unset = "")
605 |     renv_bootstrap_profile_normalize(profile)
606 |   }
607 |   
608 |   renv_bootstrap_profile_set <- function(profile) {
609 |     profile <- renv_bootstrap_profile_normalize(profile)
610 |     if (is.null(profile))
611 |       Sys.unsetenv("RENV_PROFILE")
612 |     else
613 |       Sys.setenv(RENV_PROFILE = profile)
614 |   }
615 |   
616 |   renv_bootstrap_profile_normalize <- function(profile) {
617 |   
618 |     if (is.null(profile) || profile %in% c("", "default"))
619 |       return(NULL)
620 |   
621 |     profile
622 |   
623 |   }
624 | 
625 |   # load the renv profile, if any
626 |   renv_bootstrap_profile_load(project)
627 | 
628 |   # construct path to library root
629 |   root <- renv_bootstrap_library_root(project)
630 | 
631 |   # construct library prefix for platform
632 |   prefix <- renv_bootstrap_platform_prefix()
633 | 
634 |   # construct full libpath
635 |   libpath <- file.path(root, prefix)
636 | 
637 |   # attempt to load
638 |   if (renv_bootstrap_load(project, libpath, version))
639 |     return(TRUE)
640 | 
641 |   # load failed; inform user we're about to bootstrap
642 |   prefix <- paste("# Bootstrapping renv", version)
643 |   postfix <- paste(rep.int("-", 77L - nchar(prefix)), collapse = "")
644 |   header <- paste(prefix, postfix)
645 |   message(header)
646 | 
647 |   # perform bootstrap
648 |   bootstrap(version, libpath)
649 | 
650 |   # exit early if we're just testing bootstrap
651 |   if (!is.na(Sys.getenv("RENV_BOOTSTRAP_INSTALL_ONLY", unset = NA)))
652 |     return(TRUE)
653 | 
654 |   # try again to load
655 |   if (requireNamespace("renv", lib.loc = libpath, quietly = TRUE)) {
656 |     message("* Successfully installed and loaded renv ", version, ".")
657 |     return(renv::load())
658 |   }
659 | 
660 |   # failed to download or load renv; warn the user
661 |   msg <- c(
662 |     "Failed to find an renv installation: the project will not be loaded.",
663 |     "Use `renv::activate()` to re-initialize the project."
664 |   )
665 | 
666 |   warning(paste(msg, collapse = "\n"), call. = FALSE)
667 | 
668 | })
669 | 


--------------------------------------------------------------------------------
/renv/settings.dcf:
--------------------------------------------------------------------------------
1 | external.libraries:
2 | ignored.packages:
3 | package.dependency.fields: Imports, Depends, LinkingTo
4 | r.version:
5 | snapshot.type: implicit
6 | use.cache: TRUE
7 | vcs.ignore.library: TRUE
8 | vcs.ignore.local: TRUE
9 | 


--------------------------------------------------------------------------------