├── .Rbuildignore ├── .gitignore ├── DESCRIPTION ├── NAMESPACE ├── NEWS.md ├── R ├── checkargs.R ├── estimateerrorparams.R ├── findooberrors.R ├── findtestpreds.R ├── perror.R ├── qerror.R └── quantforesterror.R ├── README.md ├── cran-comments.md ├── forestError.Rproj ├── forestError_1.1.0.pdf ├── inst └── CITATION └── man ├── estimateErrorParams.Rd ├── findOOBErrors.Rd ├── findTestPreds.Rd ├── perror.Rd ├── qerror.Rd └── quantForestError.Rd /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^.*\.Rproj$ 2 | ^\.Rproj\.user$ 3 | ^README\.Rmd$ 4 | ^cran-comments\.md$ 5 | .git$ 6 | .gitignore$ 7 | ^CRAN-RELEASE$ 8 | ^NEWS\.Rmd$ 9 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | .DS_Store -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: forestError 2 | Type: Package 3 | Title: A Unified Framework for Random Forest Prediction Error Estimation 4 | Version: 1.1.0 5 | Author: Benjamin Lu and Johanna Hardin 6 | Maintainer: Benjamin Lu 7 | Description: Estimates the conditional error distributions of random forest 8 | predictions and common parameters of those distributions, including 9 | conditional misclassification rates, conditional mean squared prediction 10 | errors, conditional biases, and conditional quantiles, by out-of-bag 11 | weighting of out-of-bag prediction errors as proposed by Lu and Hardin 12 | (2021). This package is compatible with several existing packages that 13 | implement random forests in R. 14 | Imports: 15 | data.table, 16 | purrr 17 | Suggests: 18 | randomForest 19 | License: GPL-3 20 | Encoding: UTF-8 21 | RoxygenNote: 7.1.1 22 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | export(findOOBErrors) 4 | export(quantForestError) 5 | import(data.table) 6 | importFrom(purrr,map_dbl) 7 | importFrom(stats,predict) 8 | -------------------------------------------------------------------------------- /NEWS.md: -------------------------------------------------------------------------------- 1 | ## Version 1.1.0 Update 2 | 3 | Version 1.1.0 makes two changes. First, it enables estimation of the conditional misclassification rate of predictions by classification random forests as proposed by Lu and Hardin (2021). Second, it compartmentalizes a costly step in the `quantForestError` algorithm: The identification of each training observation's out-of-bag terminal nodes. 4 | 5 | ### Conditional Misclassification Rate Estimation 6 | 7 | The conditional misclassification rate of predictions by classification random forests can now be estimated. To estimate it, simply set the `what` argument in the `quantForestError` function to `"mcr"`. `what` will default to this if the provided `forest` is a classification random forest. See the example code in the README for a toy demonstration of the performance of this estimator. 8 | 9 | ### Compartmentalization 10 | 11 | The identification of each training observation's out-of-bag terminal nodes is now compartmentalized from the main `quantForestError` function. By isolating this step from the main `quantForestError` function, Version 1.1.0 allows users to more efficiently iterate the algorithm. Users may wish to feed `quantForestError` batches of test observations iteratively if they have streaming data or a large test set that cannot be processed in one go due to memory constraints. In previous versions of this package, doing so would require the algorithm to recompute each training observation's out-of-bag terminal nodes in each iteration. This was redundant and costly. By separating this computation from the rest of the `quantForestError` algorithm, Version 1.1.0 allows the user to perform this computation only once. 12 | 13 | As part of this modularization, the `quantForestError` function now has two additional arguments. If set to `TRUE`, `return_train_nodes` will return a `data.table` identifying each training observation's out-of-bag terminal nodes. This `data.table` can then be fed back into `quantForestError` via the argument `train_nodes` to avoid the redundant recomputation. 14 | 15 | Version 1.1.0 also exports the function that produces the `data.table` identifying each training observation's out-of-bag terminal nodes. It is called `findOOBErrors`. Assuming the same inputs, `findOOBErrors` will produce the same output that is returned by setting `return_train_nodes` to `TRUE` in the `quantForestError` function. 16 | 17 | See the documentation on `quantForestError` and `findOOBErrors` for examples. 18 | 19 | Neither of these changes affects code that relied on Version 1.0.0 of this package, as the changes consist solely of a newly exported function, two optional arguments to `quantForestError` that by default do nothing new, and a new possible input for the `what` argument. 20 | 21 | ## Version 1.0.0 Update 22 | 23 | This package has been updated to reflect the conventional sign of bias (mean prediction minus mean response). Previous versions of the package returned negative bias (mean response minus mean prediction). The sign of any algebraic operations involving the bias outputted by this package must therefore be reversed to preserve their intended effect. 24 | 25 | In the future, we hope to implement a stochastic version of the `quantForestError` function, in which the parameters are estimated by random subsets of the training sample and/or the trees of the random forest. 26 | 27 | ## Version 0.2.0 Update 28 | 29 | Thanks to John Sheffield ([Github Profile](https://github.com/sheffe)) for his helpful improvements to the computational performance of this package. (See the [Issue Tracker](https://github.com/benjilu/forestError/issues/2) for details.) These changes, which substantially reduce the runtime and memory load of this package's `quantForestError`, `perror`, and `qerror` functions, have been implemented in Version 0.2.0. 30 | 31 | Version 0.2.0 also now allows the user to generate conditional prediction intervals with different type-I error rates in a single call of the `quantForestError` function. 32 | -------------------------------------------------------------------------------- /R/checkargs.R: -------------------------------------------------------------------------------- 1 | # check forest argument for problems 2 | checkForest <- function(forest) { 3 | if (typeof(forest) != "list") { 4 | stop("'forest' is not of the correct type") 5 | } else if (!any(c("randomForest", "ranger", "rfsrc", "quantregForest") %in% class(forest))) { 6 | stop("'forest' is not of the correct class") 7 | } else if (is.null(forest$inbag)) { 8 | stop("'forest' does not have record of which training observations are in bag for each tree. Re-fit the random forest with argument keep.inbag = TRUE") 9 | } 10 | } 11 | 12 | # check training and test covariate arguments for problems 13 | checkXtrainXtest <- function(X.train, X.test) { 14 | if (length(dim(X.train)) != 2) { 15 | stop("'X.train' must be a matrix or data.frame of dimension 2") 16 | } else if (length(dim(X.test)) != 2) { 17 | stop("'X.test' must be a matrix or data.frame of dimension 2") 18 | } else if (ncol(X.train) != ncol(X.test)) { 19 | stop("'X.train' and 'X.test' must have the same predictor variables") 20 | } 21 | } 22 | 23 | # check training response argument for problems 24 | checkYtrain <- function(forest, Y.train, n.train) { 25 | if ("ranger" %in% class(forest)) { 26 | if (is.null(Y.train)) { 27 | stop("You must supply the training responses (Y.train)") 28 | } else if (length(Y.train) != n.train) { 29 | stop("Number of training responses does not match number of training observations") 30 | } 31 | } 32 | } 33 | 34 | # check type-I error rate argument for problems 35 | checkAlpha <- function(alpha) { 36 | if (typeof(alpha) != "double") { 37 | stop("'alpha' must be of type double") 38 | } else if (any(alpha <= 0 | alpha >= 1)) { 39 | stop("'alpha' must be in (0, 1)") 40 | } 41 | } 42 | 43 | # check index argument in perror and qerror functions for problems 44 | checkxs <- function(xs, n.test) { 45 | if (max(xs) > n.test | min(xs) < 1) { 46 | stop("Test indices are out of bounds") 47 | } else if (any(xs %% 1 != 0)) { 48 | stop("Test indices must be whole numbers") 49 | } 50 | } 51 | 52 | # check probability argument in qerror for problems 53 | checkps <- function(p) { 54 | if (max(p) > 1 | min(p) < 0) { 55 | stop("Probabilities must be between 0 and 1") 56 | } 57 | } 58 | 59 | # check core argument for problems 60 | checkcores <- function(n.cores) { 61 | if (is.null(n.cores)) { 62 | stop("Number of cores must be specified") 63 | } else if (n.cores < 1) { 64 | stop("Number of cores must be at least 1") 65 | } else if (n.cores %% 1 != 0) { 66 | stop("Number of cores must be integer") 67 | } 68 | } 69 | 70 | # check requested parameters 71 | checkwhat <- function(what, forest) { 72 | if (is.null(what)) { 73 | stop("Please specify the parameters to be estimated") 74 | } else if ("mcr" %in% what & length(what) > 2) { 75 | stop("Misclassification rate cannot be estimated for real-valued responses") 76 | } else if ("mcr" %in% what) { 77 | if ("quantregForest" %in% class(forest)) { 78 | if (forest$type != "classification") { 79 | stop("Misclassification rate can be estimated only for classification random forests") 80 | } 81 | } else if ("randomForest" %in% class(forest)) { 82 | if (forest$type != "classification") { 83 | stop("Misclassification rate can be estimated only for classification random forests") 84 | } 85 | } else if ("rfsrc" %in% class(forest)) { 86 | if (forest$family != "class") { 87 | stop("Misclassification rate can be estimated only for classification random forests") 88 | } 89 | } else if ("ranger" %in% class(forest)) { 90 | if (forest$treetype != "Classification") { 91 | stop("Misclassification rate can be estimated only for classification random forests") 92 | } 93 | } 94 | } else if (any(c("mspe", "bias", "interval", "p.error", "q.error") %in% what)) { 95 | if ("quantregForest" %in% class(forest)) { 96 | if (forest$type == "classification") { 97 | stop("Requested parameters cannot be estimated for classification random forests") 98 | } 99 | } else if ("randomForest" %in% class(forest)) { 100 | if (forest$type == "classification") { 101 | stop("Requested parameters cannot be estimated for classification random forests") 102 | } 103 | } else if ("rfsrc" %in% class(forest)) { 104 | if (forest$family == "class") { 105 | stop("Requested parameters cannot be estimated for classification random forests") 106 | } 107 | } else if ("ranger" %in% class(forest)) { 108 | if (forest$treetype == "Classification") { 109 | stop("Requested parameters cannot be estimated for classification random forests") 110 | } 111 | } 112 | } 113 | } 114 | -------------------------------------------------------------------------------- /R/estimateerrorparams.R: -------------------------------------------------------------------------------- 1 | #' Estimate prediction error distribution parameters 2 | #' 3 | #' Estimates the prediction error distribution parameters requested in the input 4 | #' to \code{quantForestError}. 5 | #' 6 | #' This function is for internal use. 7 | #' 8 | #' @param train_nodes A \code{data.table} indicating which out-of-bag prediction 9 | #' errors are in each terminal node of each tree in the random forest. It 10 | #' must be formatted like the output of the \code{findOOBErrors} function. 11 | #' @param test_nodes A \code{data.table} indicating which test observations are 12 | #' in each terminal node of each tree in the random forest. It must be 13 | #' formatted like the output of the \code{findTestPreds} function. 14 | #' @param mspewhat A boolean indicating whether to estimate conditional MSPE. 15 | #' @param biaswhat A boolean indicating whether to estimate conditional bias. 16 | #' @param intervalwhat A boolean indicating whether to estimate conditional 17 | #' prediction intervals. 18 | #' @param pwhat A boolean indicating whether to estimate the conditional 19 | #' prediction error CDFs. 20 | #' @param qwhat A boolean indicating whether to estimate the conditional 21 | #' prediction error quantile functions. 22 | #' @param mcrwhat A boolean indicating whether to estimate the conditional 23 | #' misclassification rate. 24 | #' @param alpha A vector of type-I error rates desired for the conditional prediction 25 | #' intervals; required if \code{intervalwhat} is \code{TRUE}. 26 | #' @param n.test The number of test observations. 27 | #' 28 | #' @return A \code{data.frame} with one or more of the following columns: 29 | #' 30 | #' \item{pred}{The random forest predictions of the test observations} 31 | #' \item{mspe}{The estimated conditional mean squared prediction errors of 32 | #' the random forest predictions} 33 | #' \item{bias}{The estimated conditional biases of the random forest 34 | #' predictions} 35 | #' \item{lower_alpha}{The estimated lower bounds of the conditional alpha-level 36 | #' prediction intervals for the test observations} 37 | #' \item{upper_alpha}{The estimated upper bounds of the conditional alpha-level 38 | #' prediction intervals for the test observations} 39 | #' \item{mcr}{The estimated conditional misclassification rate of the random 40 | #' forest predictions} 41 | #' 42 | #' In addition, one or both of the following functions: 43 | #' 44 | #' \item{perror}{The estimated cumulative distribution functions of the 45 | #' conditional error distributions associated with the test predictions} 46 | #' \item{qerror}{The estimated quantile functions of the conditional error 47 | #' distributions associated with the test predictions} 48 | #' 49 | #' @seealso \code{\link{quantForestError}}, \code{\link{findOOBErrors}}, \code{\link{findTestPreds}} 50 | #' 51 | #' @author Benjamin Lu \code{}; Johanna Hardin \code{} 52 | #' 53 | #' @import data.table 54 | #' @importFrom purrr map_dbl 55 | #' @keywords internal 56 | estimateErrorParams <- function(train_nodes, test_nodes, mspewhat, biaswhat, intervalwhat, pwhat, qwhat, mcrwhat, alpha, n.test) { 57 | 58 | # set columns to return 59 | col_names <- c("pred", "mspe", "bias", "mcr")[c(TRUE, mspewhat, biaswhat, mcrwhat)] 60 | 61 | # initialize overall output 62 | output <- NULL 63 | 64 | # if the user wants misclassification rate 65 | if (mcrwhat) { 66 | # join the train and test edgelists by tree/node and compute relevant summary 67 | # statistics via chaining 68 | oob_error_stats <- 69 | train_nodes[test_nodes, 70 | .(tree, terminal_node, rowid_test, pred, node_errs)][, 71 | .(mcr = mean(unlist(node_errs))), 72 | keyby = c("rowid_test", "pred")] 73 | # else if the user does not want intervals, p.error, or q.error 74 | } else if (!intervalwhat & !pwhat & !qwhat) { 75 | 76 | # produce whichever of mspe and bias the user wants 77 | if (mspewhat & !biaswhat) { 78 | # join the train and test edgelists by tree/node and compute relevant summary 79 | # statistics via chaining 80 | oob_error_stats <- 81 | train_nodes[test_nodes, 82 | .(tree, terminal_node, rowid_test, pred, node_errs)][, 83 | .(mspe = mean(unlist(node_errs) ^ 2)), 84 | keyby = c("rowid_test", "pred")] 85 | } else if (!mspewhat & biaswhat) { 86 | # join the train and test edgelists by tree/node and compute relevant summary 87 | # statistics via chaining 88 | oob_error_stats <- 89 | train_nodes[test_nodes, 90 | .(tree, terminal_node, rowid_test, pred, node_errs)][, 91 | .(bias = -mean(unlist(node_errs))), 92 | keyby = c("rowid_test", "pred")] 93 | } else { 94 | # join the train and test edgelists by tree/node and compute relevant summary 95 | # statistics via chaining 96 | oob_error_stats <- 97 | train_nodes[test_nodes, 98 | .(tree, terminal_node, rowid_test, pred, node_errs)][, 99 | .(mspe = mean(unlist(node_errs) ^ 2), 100 | bias = -mean(unlist(node_errs))), 101 | keyby = c("rowid_test", "pred")] 102 | } 103 | 104 | # else, do the same as above but keep the full list of OOB cohabitant prediction errors 105 | } else { 106 | 107 | # produce whichever of mspe and bias the user wants 108 | if (mspewhat & !biaswhat) { 109 | # join the train and test edgelists by tree/node and compute relevant summary 110 | # statistics via chaining 111 | oob_error_stats <- 112 | train_nodes[test_nodes, 113 | .(tree, terminal_node, rowid_test, pred, node_errs)][, 114 | .(mspe = mean(unlist(node_errs) ^ 2), 115 | all_errs = list(sort(unlist(node_errs)))), 116 | keyby = c("rowid_test", "pred")] 117 | } else if (!mspewhat & biaswhat) { 118 | # join the train and test edgelists by tree/node and compute relevant summary 119 | # statistics via chaining 120 | oob_error_stats <- 121 | train_nodes[test_nodes, 122 | .(tree, terminal_node, rowid_test, pred, node_errs)][, 123 | .(bias = -mean(unlist(node_errs)), 124 | all_errs = list(sort(unlist(node_errs)))), 125 | keyby = c("rowid_test", "pred")] 126 | } else if (mspewhat & biaswhat) { 127 | # join the train and test edgelists by tree/node and compute relevant summary 128 | # statistics via chaining 129 | oob_error_stats <- 130 | train_nodes[test_nodes, 131 | .(tree, terminal_node, rowid_test, pred, node_errs)][, 132 | .(mspe = mean(unlist(node_errs) ^ 2), 133 | bias = -mean(unlist(node_errs)), 134 | all_errs = list(sort(unlist(node_errs)))), 135 | keyby = c("rowid_test", "pred")] 136 | } else { 137 | # join the train and test edgelists by tree/node and compute relevant summary 138 | # statistics via chaining 139 | oob_error_stats <- 140 | train_nodes[test_nodes, 141 | .(tree, terminal_node, rowid_test, pred, node_errs)][, 142 | .(all_errs = list(sort(unlist(node_errs)))), 143 | keyby = c("rowid_test", "pred")] 144 | } 145 | } 146 | 147 | # if the user requests prediction intervals 148 | if (intervalwhat) { 149 | 150 | # check the alpha argument for issues 151 | checkAlpha(alpha) 152 | 153 | # format prediction interval output 154 | percentiles <- sort(c(alpha / 2, 1 - (alpha / 2))) 155 | interval_col_names <- paste0(rep(c("lower_", "upper_"), each = length(alpha)), 156 | c(alpha, rev(alpha))) 157 | col_names <- c(col_names, interval_col_names) 158 | 159 | # compute prediction intervals 160 | oob_error_stats[, (interval_col_names) := lapply(percentiles, FUN = function(p) { 161 | pred + purrr::map_dbl(all_errs, ~.x[ceiling(length(.x) * p)])})] 162 | } 163 | 164 | # extract summary statistics requested as data.frame 165 | output_df <- data.table::setDF(oob_error_stats[, ..col_names]) 166 | 167 | # if the user requests p.error 168 | if (pwhat) { 169 | 170 | # remove summary statistics from oob_error_stats 171 | oob_error_stats <- oob_error_stats[, .(all_errs)] 172 | 173 | # define the estimated CDF function 174 | perror <- function(q, xs = 1:n.test) { 175 | 176 | # check xs argument for issues 177 | checkxs(xs, n.test) 178 | 179 | # format output 180 | p_col_names <- paste0("p_", q) 181 | 182 | # compute CDF 183 | oob_error_stats[xs, (p_col_names) := lapply(q, FUN = function(que) { 184 | purrr::map_dbl(all_errs, ~mean(.x <= que))})] 185 | 186 | # format and return output 187 | to.return <- data.table::setDF(oob_error_stats[xs, ..p_col_names]) 188 | row.names(to.return) <- xs 189 | return(to.return) 190 | } 191 | 192 | # add to output 193 | output <- list(output_df, perror) 194 | names(output) <- c("estimates", "perror") 195 | } 196 | 197 | # if the user requests q.error 198 | if (qwhat) { 199 | 200 | # remove summary statistics from oob_error_stats 201 | oob_error_stats <- oob_error_stats[, .(all_errs)] 202 | 203 | # define the estimated quantile function 204 | qerror <- function(p, xs = 1:n.test) { 205 | 206 | # check p and xs arguments for issus 207 | checkps(p) 208 | checkxs(xs, n.test) 209 | 210 | # format output 211 | q_col_names <- paste0("q_", p) 212 | 213 | # compute quantiles 214 | oob_error_stats[xs, (q_col_names) := lapply(p, FUN = function(pe) { 215 | purrr::map_dbl(all_errs, ~.x[ceiling(length(.x) * pe)])})] 216 | 217 | # format and return output 218 | to.return <- data.table::setDF(oob_error_stats[xs, ..q_col_names]) 219 | row.names(to.return) <- xs 220 | return(to.return) 221 | } 222 | 223 | # add quantile function to output 224 | if (pwhat) { 225 | output[["qerror"]] <- qerror 226 | } else { 227 | output <- list(output_df, qerror) 228 | names(output) <- c("estimates", "qerror") 229 | } 230 | } 231 | 232 | # return output 233 | if (is.null(output)) { 234 | return(output_df) 235 | } else { 236 | return(output) 237 | } 238 | } 239 | -------------------------------------------------------------------------------- /R/findooberrors.R: -------------------------------------------------------------------------------- 1 | #' Compute and locate out-of-bag prediction errors 2 | #' 3 | #' Computes each training observation's out-of-bag prediction error using the 4 | #' random forest and, for each tree for which the training observation is 5 | #' out of bag, finds the terminal node of the tree in which the training 6 | #' observation falls. 7 | #' 8 | #' This function accepts classification or regression random forests built using 9 | #' the \code{randomForest}, \code{ranger}, \code{randomForestSRC}, and 10 | #' \code{quantregForest} packages. When training the random forest using 11 | #' \code{randomForest}, \code{ranger}, or \code{quantregForest}, \code{keep.inbag} 12 | #' must be set to \code{TRUE}. When training the random forest using 13 | #' \code{randomForestSRC}, \code{membership} must be set to \code{TRUE}. 14 | #' 15 | #' @param forest The random forest object being used for prediction. 16 | #' @param X.train A \code{matrix} or \code{data.frame} with the observations 17 | #' that were used to train \code{forest}. Each row should be an observation, 18 | #' and each column should be a predictor variable. 19 | #' @param Y.train A vector of the responses of the observations that were used 20 | #' to train \code{forest}. Required if \code{forest} was created using 21 | #' \code{ranger}, but not if \code{forest} was created using \code{randomForest}, 22 | #' \code{randomForestSRC}, or \code{quantregForest}. 23 | #' @param n.cores Number of cores to use (for parallel computation in \code{ranger}). 24 | #' 25 | #' @return A \code{data.table} with the following three columns: 26 | #' 27 | #' \item{tree}{The ID of the tree of the random forest} 28 | #' \item{terminal_node}{The ID of the terminal node of the tree} 29 | #' \item{node_errs}{A vector of the out-of-bag prediction errors that fall 30 | #' within the terminal node of the tree} 31 | #' 32 | #' @seealso \code{\link{quantForestError}} 33 | #' 34 | #' @author Benjamin Lu \code{}; Johanna Hardin \code{} 35 | #' 36 | #' @examples 37 | #' # load data 38 | #' data(airquality) 39 | #' 40 | #' # remove observations with missing predictor variable values 41 | #' airquality <- airquality[complete.cases(airquality), ] 42 | #' 43 | #' # get number of observations and the response column index 44 | #' n <- nrow(airquality) 45 | #' response.col <- 1 46 | #' 47 | #' # split data into training and test sets 48 | #' train.ind <- sample(1:n, n * 0.9, replace = FALSE) 49 | #' Xtrain <- airquality[train.ind, -response.col] 50 | #' Ytrain <- airquality[train.ind, response.col] 51 | #' Xtest <- airquality[-train.ind, -response.col] 52 | #' 53 | #' # fit random forest to the training data 54 | #' rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5, 55 | #' ntree = 500, keep.inbag = TRUE) 56 | #' 57 | #' # compute out-of-bag prediction errors and locate each 58 | #' # training observation in the trees for which it is out 59 | #' # of bag 60 | #' train_nodes <- findOOBErrors(rf, Xtrain) 61 | #' 62 | #' # estimate conditional mean squared prediction errors, 63 | #' # biases, prediction intervals, and error distribution 64 | #' # functions for the test observations. provide 65 | #' # train_nodes to avoid recomputing that step. 66 | #' output <- quantForestError(rf, Xtrain, Xtest, 67 | #' train_nodes = train_nodes) 68 | #' 69 | #' @importFrom stats predict 70 | #' @import data.table 71 | #' @export 72 | findOOBErrors <- function(forest, X.train, Y.train = NULL, n.cores = 1) { 73 | 74 | # determine whether this is a regression or classification random forest 75 | categorical <- grepl("class", c(forest$type, forest$family, forest$treetype), TRUE) 76 | 77 | # if the forest is from the quantregForest package 78 | if ("quantregForest" %in% class(forest)) { 79 | 80 | # convert to random forest class 81 | class(forest) <- "randomForest" 82 | 83 | # get terminal nodes of training observations 84 | train.terminal.nodes <- attr(predict(forest, X.train, nodes = TRUE), "nodes") 85 | 86 | # get number of times each training observation appears in each tree 87 | bag.count <- forest$inbag 88 | 89 | # get the OOB prediction error of each training observation 90 | if (categorical) { 91 | oob.errors <- forest$y != forest$predicted 92 | } else { 93 | oob.errors <- forest$y - forest$predicted 94 | } 95 | 96 | # else if the forest is from the randomForest package 97 | } else if ("randomForest" %in% class(forest)) { 98 | 99 | # get terminal nodes of training observations 100 | train.terminal.nodes <- attr(predict(forest, X.train, nodes = TRUE), "nodes") 101 | 102 | # get number of times each training observation appears in each tree 103 | bag.count <- forest$inbag 104 | 105 | # get the OOB prediction error of each training observation 106 | if (categorical) { 107 | oob.errors <- forest$y != forest$predicted 108 | } else { 109 | oob.errors <- forest$y - forest$predicted 110 | } 111 | 112 | # else, if the forest is from the ranger package 113 | } else if ("ranger" %in% class(forest)) { 114 | 115 | # get terminal nodes of training observations 116 | train.terminal.nodes <- predict(forest, X.train, num.threads = n.cores, type = "terminalNodes")$predictions 117 | 118 | # get number of times each training observation appears in each tree 119 | bag.count <- matrix(unlist(forest$inbag.counts, use.names = FALSE), ncol = forest$num.trees, byrow = FALSE) 120 | 121 | # get the OOB prediction error of each training observation 122 | if (categorical) { 123 | oob.errors <- Y.train != forest$predictions 124 | } else { 125 | oob.errors <- Y.train - forest$predictions 126 | } 127 | 128 | # else, if the forest is from the randomForestSRC package 129 | } else if ("rfsrc" %in% class(forest)) { 130 | 131 | # get terminal nodes of all observations 132 | train.terminal.nodes <- forest$membership 133 | 134 | # get number of times each training observation appears in each tree 135 | bag.count <- forest$inbag 136 | 137 | # get the OOB prediction error of each training observation 138 | if (categorical) { 139 | oob.errors <- forest$yvar != forest$class.oob 140 | } else { 141 | oob.errors <- forest$yvar - forest$predicted.oob 142 | } 143 | } 144 | 145 | # get the terminal nodes of the training observations in the trees in which 146 | # they are OOB (for all other trees, set the terminal node to be NA) 147 | train.terminal.nodes[bag.count != 0] <- NA 148 | 149 | # reshape train.terminal.nodes to be a long data.table and include OOB 150 | # prediction errors as a column 151 | train_nodes <- data.table::as.data.table(train.terminal.nodes) 152 | train_nodes[, `:=`(oob_error = oob.errors)] 153 | train_nodes <- data.table::melt( 154 | train_nodes, 155 | id.vars = c("oob_error"), 156 | measure.vars = 1:ncol(train.terminal.nodes), 157 | variable.name = "tree", 158 | value.name = "terminal_node", 159 | variable.factor = FALSE, 160 | na.rm = TRUE) 161 | 162 | # collapse the long data.table by unique tree/node 163 | train_nodes <- train_nodes[, 164 | .(node_errs = list(oob_error)), 165 | keyby = c("tree", "terminal_node")] 166 | 167 | # return long data.table of OOB error values and locations 168 | return(train_nodes) 169 | } 170 | -------------------------------------------------------------------------------- /R/findtestpreds.R: -------------------------------------------------------------------------------- 1 | #' Compute and locate test predictions 2 | #' 3 | #' Predicts each test observation's response using the random forest and, for 4 | #' each test observation and tree, finds the terminal node of the tree in which 5 | #' the test observation falls. 6 | #' 7 | #' This function accepts regression random forests built using the \code{randomForest}, 8 | #' \code{ranger}, \code{randomForestSRC}, and \code{quantregForest} packages. 9 | #' 10 | #' @param forest The random forest object being used for prediction. 11 | #' @param X.test A \code{matrix} or \code{data.frame} with the observations to 12 | #' be predicted. Each row should be an observation, and each column should be 13 | #' a predictor variable. 14 | #' @param n.cores Number of cores to use (for parallel computation in \code{ranger}). 15 | #' 16 | #' @return A \code{data.table} with the following four columns: 17 | #' 18 | #' \item{rowid_test}{The row ID of the test observation as provided by \code{X.test}} 19 | #' \item{pred}{The random forest prediction of the test observation} 20 | #' \item{tree}{The ID of the tree of the random forest} 21 | #' \item{terminal_node}{The ID of the terminal node of the tree in which the 22 | #' test observation falls} 23 | #' 24 | #' @seealso \code{\link{findOOBErrors}}, \code{\link{quantForestError}} 25 | #' 26 | #' @author Benjamin Lu \code{}; Johanna Hardin \code{} 27 | #' 28 | #' @importFrom stats predict 29 | #' @import data.table 30 | #' @keywords internal 31 | findTestPreds <- function(forest, X.test, n.cores = 1) { 32 | 33 | # determine whether this is a regression or classification random forest 34 | categorical <- grepl("class", c(forest$type, forest$family, forest$treetype), TRUE) 35 | 36 | # if the forest is from the quantregForest package 37 | if ("quantregForest" %in% class(forest)) { 38 | 39 | # convert to random forest class 40 | class(forest) <- "randomForest" 41 | 42 | # get test predictions 43 | test.preds <- predict(forest, X.test, nodes = TRUE) 44 | 45 | # get terminal nodes of test observations 46 | test.terminal.nodes <- attr(test.preds, "nodes") 47 | 48 | # format test observation predictions 49 | attr(test.preds, "nodes") <- NULL 50 | 51 | # else if the forest is from the randomForest package 52 | } else if ("randomForest" %in% class(forest)) { 53 | 54 | # get test predictions 55 | test.preds <- predict(forest, X.test, nodes = TRUE) 56 | 57 | # get terminal nodes of test observations 58 | test.terminal.nodes <- attr(test.preds, "nodes") 59 | 60 | # format test observation predictions 61 | attr(test.preds, "nodes") <- NULL 62 | 63 | # else, if the forest is from the ranger package 64 | } else if ("ranger" %in% class(forest)) { 65 | 66 | # get terminal nodes of test observations 67 | test.terminal.nodes <- predict(forest, X.test, num.threads = n.cores, type = "terminalNodes")$predictions 68 | 69 | # get test observation predictions 70 | test.preds <- predict(forest, X.test)$predictions 71 | 72 | # else, if the forest is from the randomForestSRC package 73 | } else if ("rfsrc" %in% class(forest)) { 74 | 75 | # get test predictions 76 | test.pred.list <- predict(forest, X.test, membership = TRUE) 77 | 78 | # get terminal nodes of test observations 79 | test.terminal.nodes <- test.pred.list$membership 80 | 81 | # format test observation predictions 82 | if (categorical) { 83 | test.preds <- test.pred.list$class 84 | } else { 85 | test.preds <- test.pred.list$predicted 86 | } 87 | } 88 | 89 | # reshape test.terminal.nodes to be a long data.table and 90 | # add unique IDs and predicted values 91 | test_nodes <- data.table::melt( 92 | data.table::as.data.table(test.terminal.nodes)[, `:=`(rowid_test = .I, pred = test.preds)], 93 | id.vars = c("rowid_test", "pred"), 94 | measure.vars = 1:ncol(test.terminal.nodes), 95 | variable.name = "tree", 96 | variable.factor = FALSE, 97 | value.name = "terminal_node") 98 | 99 | # set key columns for faster indexing 100 | data.table::setkey(test_nodes, tree, terminal_node) 101 | 102 | # return data.table 103 | return(test_nodes) 104 | } 105 | -------------------------------------------------------------------------------- /R/perror.R: -------------------------------------------------------------------------------- 1 | #' Estimated conditional prediction error CDFs 2 | #' 3 | #' Returns probabilities from the estimated conditional cumulative distribution 4 | #' function of the prediction error associated with each test observation. 5 | #' 6 | #' This function is only defined as output of the \code{quantForestError} function. 7 | #' It is not exported as a standalone function. See the example. 8 | #' 9 | #' @usage perror(q, xs) 10 | #' 11 | #' @param q A vector of quantiles. 12 | #' @param xs A vector of the indices of the test observations for which the 13 | #' conditional error CDFs are desired. Defaults to all test observations 14 | #' given in the call of \code{quantForestError}. 15 | #' 16 | #' @return If either \code{q} or \code{xs} has length one, then a vector is 17 | #' returned with the desired probabilities. If both have length greater than 18 | #' one, then a \code{data.frame} of probabilities is returned, with rows 19 | #' corresponding to the inputted \code{xs} and columns corresponding to the 20 | #' inputted \code{q}. 21 | #' 22 | #' @seealso \code{\link{quantForestError}} 23 | #' 24 | #' @author Benjamin Lu \code{}; Johanna Hardin \code{} 25 | #' 26 | #' @examples 27 | #' # load data 28 | #' data(airquality) 29 | #' 30 | #' # remove observations with missing predictor variable values 31 | #' airquality <- airquality[complete.cases(airquality), ] 32 | #' 33 | #' # get number of observations and the response column index 34 | #' n <- nrow(airquality) 35 | #' response.col <- 1 36 | #' 37 | #' # split data into training and test sets 38 | #' train.ind <- sample(1:n, n * 0.9, replace = FALSE) 39 | #' Xtrain <- airquality[train.ind, -response.col] 40 | #' Ytrain <- airquality[train.ind, response.col] 41 | #' Xtest <- airquality[-train.ind, -response.col] 42 | #' Ytest <- airquality[-train.ind, response.col] 43 | #' 44 | #' # fit random forest to the training data 45 | #' rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5, 46 | #' ntree = 500, 47 | #' keep.inbag = TRUE) 48 | #' 49 | #' # estimate conditional error distribution functions 50 | #' output <- quantForestError(rf, Xtrain, Xtest, 51 | #' what = c("p.error", "q.error")) 52 | #' 53 | #' # get the probability that the error associated with each test 54 | #' # prediction is less than -4 and the probability that the error 55 | #' # associated with each test prediction is less than 7 56 | #' output$perror(c(-4, 7)) 57 | #' 58 | #' # same as above but only for the first three test observations 59 | #' output$perror(c(-4, 7), 1:3) 60 | perror <- function(q, xs) {stop("perror is not exported as a standalone function. You must first run the quantForestError function to define perror. See documentation.")} 61 | -------------------------------------------------------------------------------- /R/qerror.R: -------------------------------------------------------------------------------- 1 | #' Estimated conditional prediction error quantile functions 2 | #' 3 | #' Returns quantiles of the estimated conditional error distribution associated 4 | #' with each test prediction. 5 | #' 6 | #' This function is only defined as output of the \code{quantForestError} function. 7 | #' It is not exported as a standalone function. See the example. 8 | #' 9 | #' @usage qerror(p, xs) 10 | #' 11 | #' @param p A vector of probabilities. 12 | #' @param xs A vector of the indices of the test observations for which the 13 | #' conditional error quantiles are desired. Defaults to all test observations 14 | #' given in the call of \code{quantForestError}. 15 | #' 16 | #' @return If either \code{p} or \code{xs} has length one, then a vector is 17 | #' returned with the desired quantiles. If both have length greater than 18 | #' one, then a \code{data.frame} of quantiles is returned, with rows 19 | #' corresponding to the inputted \code{xs} and columns corresponding to the 20 | #' inputted \code{p}. 21 | #' 22 | #' @seealso \code{\link{quantForestError}} 23 | #' 24 | #' @author Benjamin Lu \code{}; Johanna Hardin \code{} 25 | #' @examples 26 | #' # load data 27 | #' data(airquality) 28 | #' 29 | #' # remove observations with missing predictor variable values 30 | #' airquality <- airquality[complete.cases(airquality), ] 31 | #' 32 | #' # get number of observations and the response column index 33 | #' n <- nrow(airquality) 34 | #' response.col <- 1 35 | #' 36 | #' # split data into training and test sets 37 | #' train.ind <- sample(1:n, n * 0.9, replace = FALSE) 38 | #' Xtrain <- airquality[train.ind, -response.col] 39 | #' Ytrain <- airquality[train.ind, response.col] 40 | #' Xtest <- airquality[-train.ind, -response.col] 41 | #' Ytest <- airquality[-train.ind, response.col] 42 | #' 43 | #' # fit random forest to the training data 44 | #' rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5, 45 | #' ntree = 500, 46 | #' keep.inbag = TRUE) 47 | #' 48 | #' # estimate conditional error distribution functions 49 | #' output <- quantForestError(rf, Xtrain, Xtest, 50 | #' what = c("p.error", "q.error")) 51 | #' 52 | #' # get the 0.25 and 0.8 quantiles of the error distribution 53 | #' # associated with each test prediction 54 | #' output$qerror(c(0.25, 0.8)) 55 | #' 56 | #' # same as above but only for the first three test observations 57 | #' output$qerror(c(0.25, 0.8), 1:3) 58 | qerror <- function(p, xs) {stop("qerror is not exported as a standalone function. You must first run the quantForestError function to define qerror. See documentation.")} 59 | -------------------------------------------------------------------------------- /R/quantforesterror.R: -------------------------------------------------------------------------------- 1 | # avoid check note 2 | if(getRversion() >= "2.15.1"){utils::globalVariables(c("n.test", "tree", "terminal_node", 3 | ".", "oob_error", "rowid_test", "pred", 4 | "node_errs", "pred", "all_errs", "..col_names", 5 | "..p_col_names", "..q_col_names"))} 6 | 7 | #' Quantify random forest prediction error 8 | #' 9 | #' Estimates the conditional misclassification rates, conditional mean squared 10 | #' prediction errors, conditional biases, conditional prediction intervals, and 11 | #' conditional error distributions of random forest predictions. 12 | #' 13 | #' This function accepts classification or regression random forests built using 14 | #' the \code{randomForest}, \code{ranger}, \code{randomForestSRC}, and 15 | #' \code{quantregForest} packages. When training the random forest using 16 | #' \code{randomForest}, \code{ranger}, or \code{quantregForest}, \code{keep.inbag} 17 | #' must be set to \code{TRUE}. When training the random forest using 18 | #' \code{randomForestSRC}, \code{membership} must be set to \code{TRUE}. 19 | #' 20 | #' The predictions computed by \code{ranger} can be parallelized by setting the 21 | #' value of \code{n.cores} to be greater than 1. 22 | #' 23 | #' The random forest predictions are always returned as a \code{data.frame}. Additional 24 | #' columns are included in the \code{data.frame} depending on the user's selections in 25 | #' the argument \code{what}. In particular, including \code{"mspe"} in \code{what} 26 | #' will add an additional column with the conditional mean squared prediction 27 | #' error of each test prediction to the \code{data.frame}; including \code{"bias"} in 28 | #' \code{what} will add an additional column with the conditional bias of each test 29 | #' prediction to the \code{data.frame}; including \code{"interval"} in \code{what} 30 | #' will add to the \code{data.frame} additional columns with the lower and 31 | #' upper bounds of conditional prediction intervals for each test prediction; 32 | #' and including \code{"mcr"} in \code{what} will add an additional column with 33 | #' the conditional misclassification rate of each test prediction to the 34 | #' \code{data.frame}. The conditional misclassification rate can be estimated 35 | #' only for classification random forests, while the other parameters can be 36 | #' estimated only for regression random forests. 37 | #' 38 | #' If \code{"p.error"} or \code{"q.error"} is included in \code{what}, or if 39 | #' \code{return_train_nodes} is set to \code{TRUE}, then a list will be returned 40 | #' as output. The first element of the list, named \code{"estimates"}, is the 41 | #' \code{data.frame} described in the above paragraph. The other elements of the 42 | #' list are the estimated cumulative distribution functions (\code{perror}) of 43 | #' the conditional error distributions, the estimated quantile functions 44 | #' (\code{qerror}) of the conditional error distributions, and/or a \code{data.table} 45 | #' indicating what out-of-bag prediction errors each terminal node of each tree 46 | #' in the random forest contains. 47 | #' 48 | #' @param forest The random forest object being used for prediction. 49 | #' @param X.train A \code{matrix} or \code{data.frame} with the observations 50 | #' that were used to train \code{forest}. Each row should be an observation, 51 | #' and each column should be a predictor variable. 52 | #' @param X.test A \code{matrix} or \code{data.frame} with the observations to 53 | #' be predicted; each row should be an observation, and each column should be 54 | #' a predictor variable. 55 | #' @param Y.train A vector of the responses of the observations that were used 56 | #' to train \code{forest}. Required if \code{forest} was created using 57 | #' \code{ranger}, but not if \code{forest} was created using \code{randomForest}, 58 | #' \code{randomForestSRC}, or \code{quantregForest}. 59 | #' @param what A vector of characters indicating what estimates are desired. 60 | #' Possible options are conditional mean squared prediction errors (\code{"mspe"}), 61 | #' conditional biases (\code{"bias"}), conditional prediction intervals (\code{"interval"}), 62 | #' conditional error distribution functions (\code{"p.error"}), conditional 63 | #' error quantile functions (\code{"q.error"}), and conditional 64 | #' misclassification rate (\code{"mcr"}). Note that the conditional 65 | #' misclassification rate is available only for categorical outcomes, while 66 | #' the other parameters are available only for real-valued outcomes. 67 | #' @param alpha A vector of type-I error rates desired for the conditional prediction 68 | #' intervals; required if \code{"interval"} is included in \code{what}. 69 | #' @param train_nodes A \code{data.table} indicating what out-of-bag prediction 70 | #' errors each terminal node of each tree in \code{forest} contains. It should 71 | #' be formatted like the output of \code{findOOBErrors}. If not provided, 72 | #' it will be computed internally. 73 | #' @param return_train_nodes A boolean indicating whether to return the 74 | #' \code{train_nodes} computed and/or used. 75 | #' @param n.cores Number of cores to use (for parallel computation in \code{ranger}). 76 | #' 77 | #' @return A \code{data.frame} with one or more of the following columns, as described 78 | #' in the details section: 79 | #' 80 | #' \item{pred}{The random forest predictions of the test observations} 81 | #' \item{mspe}{The estimated conditional mean squared prediction errors of 82 | #' the random forest predictions} 83 | #' \item{bias}{The estimated conditional biases of the random forest 84 | #' predictions} 85 | #' \item{lower_alpha}{The estimated lower bounds of the conditional alpha-level 86 | #' prediction intervals for the test observations} 87 | #' \item{upper_alpha}{The estimated upper bounds of the conditional alpha-level 88 | #' prediction intervals for the test observations} 89 | #' \item{mcr}{The estimated conditional misclassification rate of the random 90 | #' forest predictions} 91 | #' 92 | #' In addition, one or both of the following functions, as described in the 93 | #' details section: 94 | #' 95 | #' \item{perror}{The estimated cumulative distribution functions of the 96 | #' conditional error distributions associated with the test predictions} 97 | #' \item{qerror}{The estimated quantile functions of the conditional error 98 | #' distributions associated with the test predictions} 99 | #' 100 | #' In addition, if \code{return_train_nodes} is \code{TRUE}, then a \code{data.table} 101 | #' called \code{train_nodes} indicating what out-of-bag prediction errors each 102 | #' terminal node of each tree in \code{forest} contains. 103 | #' 104 | #' @seealso \code{\link{perror}}, \code{\link{qerror}}, \code{\link{findOOBErrors}} 105 | #' 106 | #' @author Benjamin Lu \code{}; Johanna Hardin \code{} 107 | #' 108 | #' @examples 109 | #' # load data 110 | #' data(airquality) 111 | #' 112 | #' # remove observations with missing predictor variable values 113 | #' airquality <- airquality[complete.cases(airquality), ] 114 | #' 115 | #' # get number of observations and the response column index 116 | #' n <- nrow(airquality) 117 | #' response.col <- 1 118 | #' 119 | #' # split data into training and test sets 120 | #' train.ind <- sample(c("A", "B", "C"), n, 121 | #' replace = TRUE, prob = c(0.8, 0.1, 0.1)) 122 | #' Xtrain <- airquality[train.ind == "A", -response.col] 123 | #' Ytrain <- airquality[train.ind == "A", response.col] 124 | #' Xtest1 <- airquality[train.ind == "B", -response.col] 125 | #' Xtest2 <- airquality[train.ind == "C", -response.col] 126 | #' 127 | #' # fit regression random forest to the training data 128 | #' rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5, 129 | #' ntree = 500, 130 | #' keep.inbag = TRUE) 131 | #' 132 | #' # estimate conditional mean squared prediction errors, 133 | #' # biases, prediction intervals, and error distribution 134 | #' # functions for the observations in Xtest1. return 135 | #' # train_nodes to avoid recomputation in the next 136 | #' # line of code. 137 | #' output1 <- quantForestError(rf, Xtrain, Xtest1, 138 | #' return_train_nodes = TRUE) 139 | #' 140 | #' # estimate just the conditional mean squared prediction errors 141 | #' # and prediction intervals for the observations in Xtest2. 142 | #' # avoid recomputation by providing train_nodes from the 143 | #' # previous line of code. 144 | #' output2 <- quantForestError(rf, Xtrain, Xtest2, 145 | #' what = c("mspe", "interval"), 146 | #' train_nodes = output1$train_nodes) 147 | #' 148 | #' # for illustrative purposes, convert response to categorical 149 | #' Ytrain <- as.factor(Ytrain > 31.5) 150 | #' 151 | #' # fit classification random forest to the training data 152 | #' rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 3, 153 | #' ntree = 500, 154 | #' keep.inbag = TRUE) 155 | #' 156 | #' # estimate conditional misclassification rate of the 157 | #' # predictions of Xtest1 158 | #' output <- quantForestError(rf, Xtrain, Xtest1) 159 | #' 160 | #' @aliases forestError 161 | #' 162 | #' @importFrom stats predict 163 | #' @import data.table 164 | #' @importFrom purrr map_dbl 165 | #' @export 166 | quantForestError <- function(forest, X.train, X.test, Y.train = NULL, what = if (grepl("class", c(forest$type, forest$family, forest$treetype), TRUE)) "mcr" else c("mspe", "bias", "interval", "p.error", "q.error"), alpha = 0.05, train_nodes = NULL, return_train_nodes = FALSE, n.cores = 1) { 167 | 168 | # check forest, X.train, X.test arguments for issues 169 | checkForest(forest) 170 | checkXtrainXtest(X.train, X.test) 171 | # get number of training and test observations 172 | n.train <- nrow(X.train) 173 | n.test <- nrow(X.test) 174 | # check Y.train and n.cores arguments for issues 175 | checkYtrain(forest, Y.train, n.train) 176 | checkcores(n.cores) 177 | # check requested error parameters 178 | checkwhat(what, forest) 179 | 180 | # check what user wants to produce 181 | mspewhat <- "mspe" %in% what 182 | biaswhat <- "bias" %in% what 183 | intervalwhat <- "interval" %in% what 184 | pwhat <- "p.error" %in% what 185 | qwhat <- "q.error" %in% what 186 | mcrwhat <- "mcr" %in% what 187 | 188 | # compute and locate out-of-bag training errors and test predictions 189 | if (is.null(train_nodes)) { 190 | train_nodes <- findOOBErrors(forest, X.train, Y.train, n.cores) 191 | } 192 | test_nodes <- findTestPreds(forest, X.test, n.cores) 193 | 194 | # estimate the requested prediction error distribution parameters 195 | output <- estimateErrorParams(train_nodes, test_nodes, mspewhat, biaswhat, 196 | intervalwhat, pwhat, qwhat, mcrwhat, 197 | alpha, n.test) 198 | 199 | # add train_nodes if requested 200 | if (return_train_nodes) { 201 | if (pwhat | qwhat) { 202 | output[["train_nodes"]] <- train_nodes 203 | } else { 204 | output <- list(output, train_nodes) 205 | names(output) <- c("estimates", "train_nodes") 206 | } 207 | } 208 | 209 | # return output 210 | return(output) 211 | } 212 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # forestError: A Unified Framework for Random Forest Prediction Error Estimation 2 | [![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active) 3 | 4 | ## Version 1.1.0 Update 5 | 6 | Version 1.1.0 makes two changes. First, it enables estimation of the conditional misclassification rate of predictions by classification random forests as proposed by Lu and Hardin (2021). Second, it compartmentalizes a costly step in the `quantForestError` algorithm: The identification of each training observation's out-of-bag terminal nodes. 7 | 8 | ### Conditional Misclassification Rate Estimation 9 | 10 | The conditional misclassification rate of predictions by classification random forests can now be estimated. To estimate it, simply set the `what` argument in the `quantForestError` function to `"mcr"`. `what` will default to this if the provided `forest` is a classification random forest. See the example code below for a toy demonstration of the performance of this estimator. 11 | 12 | ### Compartmentalization 13 | 14 | The identification of each training observation's out-of-bag terminal nodes is now compartmentalized from the main `quantForestError` function. By isolating this step from the main `quantForestError` function, Version 1.1.0 allows users to more efficiently iterate the algorithm. Users may wish to feed `quantForestError` batches of test observations iteratively if they have streaming data or a large test set that cannot be processed in one go due to memory constraints. In previous versions of this package, doing so would require the algorithm to recompute each training observation's out-of-bag terminal nodes in each iteration. This was redundant and costly. By separating this computation from the rest of the `quantForestError` algorithm, Version 1.1.0 allows the user to perform this computation only once. 15 | 16 | As part of this modularization, the `quantForestError` function now has two additional arguments. If set to `TRUE`, `return_train_nodes` will return a `data.table` identifying each training observation's out-of-bag terminal nodes. This `data.table` can then be fed back into `quantForestError` via the argument `train_nodes` to avoid the redundant recomputation. 17 | 18 | Version 1.1.0 also exports the function that produces the `data.table` identifying each training observation's out-of-bag terminal nodes. It is called `findOOBErrors`. Assuming the same inputs, `findOOBErrors` will produce the same output that is returned by setting `return_train_nodes` to `TRUE` in the `quantForestError` function. 19 | 20 | See the documentation on `quantForestError` and `findOOBErrors` for examples. 21 | 22 | Neither of these changes affects code that relied on Version 1.0.0 of this package, as the changes consist solely of a newly exported function, two optional arguments to `quantForestError` that by default do nothing new, and a new possible input for the `what` argument. 23 | 24 | ## Overview 25 | 26 | The `forestError` package estimates conditional misclassification rates, conditional mean squared prediction errors, conditional biases, conditional prediction intervals, and conditional error distributions for random forest predictions using the plug-in method introduced in Lu and Hardin (2021). These estimates are conditional on the test observations' predictor values, accounting for possible response heterogeneity, random forest prediction bias, and random forest prediction variability across the predictor space. 27 | 28 | In its current state, the main function in this package accepts regression random forests built using any of the following packages: 29 | 30 | - `randomForest`, 31 | - `randomForestSRC`, 32 | - `ranger`, and 33 | - `quantregForest`. 34 | 35 | ## Installation 36 | 37 | Running the following line of code in `R` will install a stable version of this package from CRAN: 38 | 39 | ```{r} 40 | install.packages("forestError") 41 | ``` 42 | 43 | To install the developer version of this package from Github, run the following lines of code in `R`: 44 | 45 | ```{r} 46 | library(devtools) 47 | devtools::install_github(repo = "benjilu/forestError") 48 | ``` 49 | 50 | ## Instructions 51 | See the documentation for detailed information on how to use this package. A regression example and a classification example are given below. 52 | 53 | ```{r} 54 | ######################## REGRESSION ######################## 55 | # load data 56 | data(airquality) 57 | 58 | # remove observations with missing predictor variable values 59 | airquality <- airquality[complete.cases(airquality), ] 60 | 61 | # get number of observations and the response column index 62 | n <- nrow(airquality) 63 | response.col <- 1 64 | 65 | # split data into training and test sets 66 | train.ind <- sample(1:n, n * 0.9, replace = FALSE) 67 | Xtrain <- airquality[train.ind, -response.col] 68 | Ytrain <- airquality[train.ind, response.col] 69 | Xtest <- airquality[-train.ind, -response.col] 70 | Ytest <- airquality[-train.ind, response.col] 71 | 72 | # fit random forest to the training data 73 | rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5, 74 | ntree = 500, keep.inbag = TRUE) 75 | 76 | # estimate conditional mean squared prediction errors, conditional 77 | # biases, conditional prediction intervals, and conditional error 78 | # distribution functions for the test observations 79 | output <- quantForestError(rf, Xtrain, Xtest) 80 | 81 | ######################## CLASSIFICATION ######################## 82 | # data-generating parameters 83 | train_samp_size <- 10000 84 | test_samp_size <- 5000 85 | p <- 5 86 | 87 | # generate binary data where the probability of success is a 88 | # linear function of the first predictor variable 89 | Xtrain <- data.frame(matrix(runif(train_samp_size * p), 90 | ncol = p)) 91 | Xtest <- data.frame(matrix(runif(test_samp_size * p), 92 | ncol = p)) 93 | Ytrain <- as.factor(rbinom(train_samp_size, 1, Xtrain$X1)) 94 | 95 | # fit random forest to training data 96 | rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 3, 97 | ntree = 1000, keep.inbag = TRUE) 98 | 99 | # estimate conditional misclassification rate 100 | output <- quantForestError(rf, Xtrain, Xtest) 101 | 102 | # plot conditional misclassification rate against the signal 103 | plot(Xtest$X1, output$mcr, xlab = "X1", ylab = "Estimated 104 | Misclassification Rate") 105 | ``` 106 | 107 | ## License 108 | See `DESCRIPTION` for information. 109 | 110 | ## Authors 111 | Benjamin Lu and Johanna Hardin 112 | 113 | ## References 114 | * Benjamin Lu and Johanna Hardin. A Unified Framework for Random Forest Prediction Error Estimation. Journal of Machine Learning Research, 22(8):1-41, 2021. [[Link](https://jmlr.org/papers/v22/18-558.html)] 115 | -------------------------------------------------------------------------------- /cran-comments.md: -------------------------------------------------------------------------------- 1 | ## Submission 2 | This is an updated version of an existing package. 3 | 4 | ## Changes from previous submission 5 | Modularized the main function's steps by defining them as separate functions, 6 | one of which is now newly exported. Enabled estimation of an additional 7 | parameter through the main function. 8 | 9 | ## Test environments 10 | * local macOS R 4.1.0 11 | * win-builder (devel and release) 12 | 13 | ## R CMD check results 14 | There were no ERRORs, WARNINGs, or NOTEs. 15 | 16 | ## Downstream dependencies 17 | No downstream dependencies are negatively affected based on revdepcheck. 18 | -------------------------------------------------------------------------------- /forestError.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | 15 | AutoAppendNewline: Yes 16 | StripTrailingWhitespace: Yes 17 | 18 | BuildType: Package 19 | PackageUseDevtools: Yes 20 | PackageInstallArgs: --no-multiarch --with-keep.source 21 | -------------------------------------------------------------------------------- /forestError_1.1.0.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/benjilu/forestError/5d9c4b2d01e2ad65bdedf1bdd499ba342e27be8d/forestError_1.1.0.pdf -------------------------------------------------------------------------------- /inst/CITATION: -------------------------------------------------------------------------------- 1 | citHeader("To cide forestError in publications, please use:") 2 | 3 | citEntry(entry = "Article", 4 | title = "A Unified Framework for Random Forest Prediction Error Estimation", 5 | author = personList(as.person("Benjamin Lu"), 6 | as.person("Johanna Hardin")), 7 | journal = "Journal of Machine Learning Research", 8 | year = "2021", 9 | volume="22", 10 | number="8", 11 | pages="1--41", 12 | 13 | textVersion = paste("Benjamin Lu and Johanna Hardin.", 14 | "A Unified Framework for Random Forest Prediction Error Estimation.", 15 | "Journal of Machine Learning Research, 22(8):1-41, 2021.") 16 | ) 17 | -------------------------------------------------------------------------------- /man/estimateErrorParams.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/estimateerrorparams.R 3 | \name{estimateErrorParams} 4 | \alias{estimateErrorParams} 5 | \title{Estimate prediction error distribution parameters} 6 | \usage{ 7 | estimateErrorParams( 8 | train_nodes, 9 | test_nodes, 10 | mspewhat, 11 | biaswhat, 12 | intervalwhat, 13 | pwhat, 14 | qwhat, 15 | mcrwhat, 16 | alpha, 17 | n.test 18 | ) 19 | } 20 | \arguments{ 21 | \item{train_nodes}{A \code{data.table} indicating which out-of-bag prediction 22 | errors are in each terminal node of each tree in the random forest. It 23 | must be formatted like the output of the \code{findOOBErrors} function.} 24 | 25 | \item{test_nodes}{A \code{data.table} indicating which test observations are 26 | in each terminal node of each tree in the random forest. It must be 27 | formatted like the output of the \code{findTestPreds} function.} 28 | 29 | \item{mspewhat}{A boolean indicating whether to estimate conditional MSPE.} 30 | 31 | \item{biaswhat}{A boolean indicating whether to estimate conditional bias.} 32 | 33 | \item{intervalwhat}{A boolean indicating whether to estimate conditional 34 | prediction intervals.} 35 | 36 | \item{pwhat}{A boolean indicating whether to estimate the conditional 37 | prediction error CDFs.} 38 | 39 | \item{qwhat}{A boolean indicating whether to estimate the conditional 40 | prediction error quantile functions.} 41 | 42 | \item{mcrwhat}{A boolean indicating whether to estimate the conditional 43 | misclassification rate.} 44 | 45 | \item{alpha}{A vector of type-I error rates desired for the conditional prediction 46 | intervals; required if \code{intervalwhat} is \code{TRUE}.} 47 | 48 | \item{n.test}{The number of test observations.} 49 | } 50 | \value{ 51 | A \code{data.frame} with one or more of the following columns: 52 | 53 | \item{pred}{The random forest predictions of the test observations} 54 | \item{mspe}{The estimated conditional mean squared prediction errors of 55 | the random forest predictions} 56 | \item{bias}{The estimated conditional biases of the random forest 57 | predictions} 58 | \item{lower_alpha}{The estimated lower bounds of the conditional alpha-level 59 | prediction intervals for the test observations} 60 | \item{upper_alpha}{The estimated upper bounds of the conditional alpha-level 61 | prediction intervals for the test observations} 62 | \item{mcr}{The estimated conditional misclassification rate of the random 63 | forest predictions} 64 | 65 | In addition, one or both of the following functions: 66 | 67 | \item{perror}{The estimated cumulative distribution functions of the 68 | conditional error distributions associated with the test predictions} 69 | \item{qerror}{The estimated quantile functions of the conditional error 70 | distributions associated with the test predictions} 71 | } 72 | \description{ 73 | Estimates the prediction error distribution parameters requested in the input 74 | to \code{quantForestError}. 75 | } 76 | \details{ 77 | This function is for internal use. 78 | } 79 | \seealso{ 80 | \code{\link{quantForestError}}, \code{\link{findOOBErrors}}, \code{\link{findTestPreds}} 81 | } 82 | \author{ 83 | Benjamin Lu \code{}; Johanna Hardin \code{} 84 | } 85 | \keyword{internal} 86 | -------------------------------------------------------------------------------- /man/findOOBErrors.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/findooberrors.R 3 | \name{findOOBErrors} 4 | \alias{findOOBErrors} 5 | \title{Compute and locate out-of-bag prediction errors} 6 | \usage{ 7 | findOOBErrors(forest, X.train, Y.train = NULL, n.cores = 1) 8 | } 9 | \arguments{ 10 | \item{forest}{The random forest object being used for prediction.} 11 | 12 | \item{X.train}{A \code{matrix} or \code{data.frame} with the observations 13 | that were used to train \code{forest}. Each row should be an observation, 14 | and each column should be a predictor variable.} 15 | 16 | \item{Y.train}{A vector of the responses of the observations that were used 17 | to train \code{forest}. Required if \code{forest} was created using 18 | \code{ranger}, but not if \code{forest} was created using \code{randomForest}, 19 | \code{randomForestSRC}, or \code{quantregForest}.} 20 | 21 | \item{n.cores}{Number of cores to use (for parallel computation in \code{ranger}).} 22 | } 23 | \value{ 24 | A \code{data.table} with the following three columns: 25 | 26 | \item{tree}{The ID of the tree of the random forest} 27 | \item{terminal_node}{The ID of the terminal node of the tree} 28 | \item{node_errs}{A vector of the out-of-bag prediction errors that fall 29 | within the terminal node of the tree} 30 | } 31 | \description{ 32 | Computes each training observation's out-of-bag prediction error using the 33 | random forest and, for each tree for which the training observation is 34 | out of bag, finds the terminal node of the tree in which the training 35 | observation falls. 36 | } 37 | \details{ 38 | This function accepts classification or regression random forests built using 39 | the \code{randomForest}, \code{ranger}, \code{randomForestSRC}, and 40 | \code{quantregForest} packages. When training the random forest using 41 | \code{randomForest}, \code{ranger}, or \code{quantregForest}, \code{keep.inbag} 42 | must be set to \code{TRUE}. When training the random forest using 43 | \code{randomForestSRC}, \code{membership} must be set to \code{TRUE}. 44 | } 45 | \examples{ 46 | # load data 47 | data(airquality) 48 | 49 | # remove observations with missing predictor variable values 50 | airquality <- airquality[complete.cases(airquality), ] 51 | 52 | # get number of observations and the response column index 53 | n <- nrow(airquality) 54 | response.col <- 1 55 | 56 | # split data into training and test sets 57 | train.ind <- sample(1:n, n * 0.9, replace = FALSE) 58 | Xtrain <- airquality[train.ind, -response.col] 59 | Ytrain <- airquality[train.ind, response.col] 60 | Xtest <- airquality[-train.ind, -response.col] 61 | 62 | # fit random forest to the training data 63 | rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5, 64 | ntree = 500, keep.inbag = TRUE) 65 | 66 | # compute out-of-bag prediction errors and locate each 67 | # training observation in the trees for which it is out 68 | # of bag 69 | train_nodes <- findOOBErrors(rf, Xtrain) 70 | 71 | # estimate conditional mean squared prediction errors, 72 | # biases, prediction intervals, and error distribution 73 | # functions for the test observations. provide 74 | # train_nodes to avoid recomputing that step. 75 | output <- quantForestError(rf, Xtrain, Xtest, 76 | train_nodes = train_nodes) 77 | 78 | } 79 | \seealso{ 80 | \code{\link{quantForestError}} 81 | } 82 | \author{ 83 | Benjamin Lu \code{}; Johanna Hardin \code{} 84 | } 85 | -------------------------------------------------------------------------------- /man/findTestPreds.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/findtestpreds.R 3 | \name{findTestPreds} 4 | \alias{findTestPreds} 5 | \title{Compute and locate test predictions} 6 | \usage{ 7 | findTestPreds(forest, X.test, n.cores = 1) 8 | } 9 | \arguments{ 10 | \item{forest}{The random forest object being used for prediction.} 11 | 12 | \item{X.test}{A \code{matrix} or \code{data.frame} with the observations to 13 | be predicted. Each row should be an observation, and each column should be 14 | a predictor variable.} 15 | 16 | \item{n.cores}{Number of cores to use (for parallel computation in \code{ranger}).} 17 | } 18 | \value{ 19 | A \code{data.table} with the following four columns: 20 | 21 | \item{rowid_test}{The row ID of the test observation as provided by \code{X.test}} 22 | \item{pred}{The random forest prediction of the test observation} 23 | \item{tree}{The ID of the tree of the random forest} 24 | \item{terminal_node}{The ID of the terminal node of the tree in which the 25 | test observation falls} 26 | } 27 | \description{ 28 | Predicts each test observation's response using the random forest and, for 29 | each test observation and tree, finds the terminal node of the tree in which 30 | the test observation falls. 31 | } 32 | \details{ 33 | This function accepts regression random forests built using the \code{randomForest}, 34 | \code{ranger}, \code{randomForestSRC}, and \code{quantregForest} packages. 35 | } 36 | \seealso{ 37 | \code{\link{findOOBErrors}}, \code{\link{quantForestError}} 38 | } 39 | \author{ 40 | Benjamin Lu \code{}; Johanna Hardin \code{} 41 | } 42 | \keyword{internal} 43 | -------------------------------------------------------------------------------- /man/perror.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/perror.R 3 | \name{perror} 4 | \alias{perror} 5 | \title{Estimated conditional prediction error CDFs} 6 | \usage{ 7 | perror(q, xs) 8 | } 9 | \arguments{ 10 | \item{q}{A vector of quantiles.} 11 | 12 | \item{xs}{A vector of the indices of the test observations for which the 13 | conditional error CDFs are desired. Defaults to all test observations 14 | given in the call of \code{quantForestError}.} 15 | } 16 | \value{ 17 | If either \code{q} or \code{xs} has length one, then a vector is 18 | returned with the desired probabilities. If both have length greater than 19 | one, then a \code{data.frame} of probabilities is returned, with rows 20 | corresponding to the inputted \code{xs} and columns corresponding to the 21 | inputted \code{q}. 22 | } 23 | \description{ 24 | Returns probabilities from the estimated conditional cumulative distribution 25 | function of the prediction error associated with each test observation. 26 | } 27 | \details{ 28 | This function is only defined as output of the \code{quantForestError} function. 29 | It is not exported as a standalone function. See the example. 30 | } 31 | \examples{ 32 | # load data 33 | data(airquality) 34 | 35 | # remove observations with missing predictor variable values 36 | airquality <- airquality[complete.cases(airquality), ] 37 | 38 | # get number of observations and the response column index 39 | n <- nrow(airquality) 40 | response.col <- 1 41 | 42 | # split data into training and test sets 43 | train.ind <- sample(1:n, n * 0.9, replace = FALSE) 44 | Xtrain <- airquality[train.ind, -response.col] 45 | Ytrain <- airquality[train.ind, response.col] 46 | Xtest <- airquality[-train.ind, -response.col] 47 | Ytest <- airquality[-train.ind, response.col] 48 | 49 | # fit random forest to the training data 50 | rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5, 51 | ntree = 500, 52 | keep.inbag = TRUE) 53 | 54 | # estimate conditional error distribution functions 55 | output <- quantForestError(rf, Xtrain, Xtest, 56 | what = c("p.error", "q.error")) 57 | 58 | # get the probability that the error associated with each test 59 | # prediction is less than -4 and the probability that the error 60 | # associated with each test prediction is less than 7 61 | output$perror(c(-4, 7)) 62 | 63 | # same as above but only for the first three test observations 64 | output$perror(c(-4, 7), 1:3) 65 | } 66 | \seealso{ 67 | \code{\link{quantForestError}} 68 | } 69 | \author{ 70 | Benjamin Lu \code{}; Johanna Hardin \code{} 71 | } 72 | -------------------------------------------------------------------------------- /man/qerror.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/qerror.R 3 | \name{qerror} 4 | \alias{qerror} 5 | \title{Estimated conditional prediction error quantile functions} 6 | \usage{ 7 | qerror(p, xs) 8 | } 9 | \arguments{ 10 | \item{p}{A vector of probabilities.} 11 | 12 | \item{xs}{A vector of the indices of the test observations for which the 13 | conditional error quantiles are desired. Defaults to all test observations 14 | given in the call of \code{quantForestError}.} 15 | } 16 | \value{ 17 | If either \code{p} or \code{xs} has length one, then a vector is 18 | returned with the desired quantiles. If both have length greater than 19 | one, then a \code{data.frame} of quantiles is returned, with rows 20 | corresponding to the inputted \code{xs} and columns corresponding to the 21 | inputted \code{p}. 22 | } 23 | \description{ 24 | Returns quantiles of the estimated conditional error distribution associated 25 | with each test prediction. 26 | } 27 | \details{ 28 | This function is only defined as output of the \code{quantForestError} function. 29 | It is not exported as a standalone function. See the example. 30 | } 31 | \examples{ 32 | # load data 33 | data(airquality) 34 | 35 | # remove observations with missing predictor variable values 36 | airquality <- airquality[complete.cases(airquality), ] 37 | 38 | # get number of observations and the response column index 39 | n <- nrow(airquality) 40 | response.col <- 1 41 | 42 | # split data into training and test sets 43 | train.ind <- sample(1:n, n * 0.9, replace = FALSE) 44 | Xtrain <- airquality[train.ind, -response.col] 45 | Ytrain <- airquality[train.ind, response.col] 46 | Xtest <- airquality[-train.ind, -response.col] 47 | Ytest <- airquality[-train.ind, response.col] 48 | 49 | # fit random forest to the training data 50 | rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5, 51 | ntree = 500, 52 | keep.inbag = TRUE) 53 | 54 | # estimate conditional error distribution functions 55 | output <- quantForestError(rf, Xtrain, Xtest, 56 | what = c("p.error", "q.error")) 57 | 58 | # get the 0.25 and 0.8 quantiles of the error distribution 59 | # associated with each test prediction 60 | output$qerror(c(0.25, 0.8)) 61 | 62 | # same as above but only for the first three test observations 63 | output$qerror(c(0.25, 0.8), 1:3) 64 | } 65 | \seealso{ 66 | \code{\link{quantForestError}} 67 | } 68 | \author{ 69 | Benjamin Lu \code{}; Johanna Hardin \code{} 70 | } 71 | -------------------------------------------------------------------------------- /man/quantForestError.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/quantforesterror.R 3 | \name{quantForestError} 4 | \alias{quantForestError} 5 | \alias{forestError} 6 | \title{Quantify random forest prediction error} 7 | \usage{ 8 | quantForestError( 9 | forest, 10 | X.train, 11 | X.test, 12 | Y.train = NULL, 13 | what = if (grepl("class", c(forest$type, forest$family, forest$treetype), TRUE)) 14 | "mcr" else c("mspe", "bias", "interval", "p.error", "q.error"), 15 | alpha = 0.05, 16 | train_nodes = NULL, 17 | return_train_nodes = FALSE, 18 | n.cores = 1 19 | ) 20 | } 21 | \arguments{ 22 | \item{forest}{The random forest object being used for prediction.} 23 | 24 | \item{X.train}{A \code{matrix} or \code{data.frame} with the observations 25 | that were used to train \code{forest}. Each row should be an observation, 26 | and each column should be a predictor variable.} 27 | 28 | \item{X.test}{A \code{matrix} or \code{data.frame} with the observations to 29 | be predicted; each row should be an observation, and each column should be 30 | a predictor variable.} 31 | 32 | \item{Y.train}{A vector of the responses of the observations that were used 33 | to train \code{forest}. Required if \code{forest} was created using 34 | \code{ranger}, but not if \code{forest} was created using \code{randomForest}, 35 | \code{randomForestSRC}, or \code{quantregForest}.} 36 | 37 | \item{what}{A vector of characters indicating what estimates are desired. 38 | Possible options are conditional mean squared prediction errors (\code{"mspe"}), 39 | conditional biases (\code{"bias"}), conditional prediction intervals (\code{"interval"}), 40 | conditional error distribution functions (\code{"p.error"}), conditional 41 | error quantile functions (\code{"q.error"}), and conditional 42 | misclassification rate (\code{"mcr"}). Note that the conditional 43 | misclassification rate is available only for categorical outcomes, while 44 | the other parameters are available only for real-valued outcomes.} 45 | 46 | \item{alpha}{A vector of type-I error rates desired for the conditional prediction 47 | intervals; required if \code{"interval"} is included in \code{what}.} 48 | 49 | \item{train_nodes}{A \code{data.table} indicating what out-of-bag prediction 50 | errors each terminal node of each tree in \code{forest} contains. It should 51 | be formatted like the output of \code{findOOBErrors}. If not provided, 52 | it will be computed internally.} 53 | 54 | \item{return_train_nodes}{A boolean indicating whether to return the 55 | \code{train_nodes} computed and/or used.} 56 | 57 | \item{n.cores}{Number of cores to use (for parallel computation in \code{ranger}).} 58 | } 59 | \value{ 60 | A \code{data.frame} with one or more of the following columns, as described 61 | in the details section: 62 | 63 | \item{pred}{The random forest predictions of the test observations} 64 | \item{mspe}{The estimated conditional mean squared prediction errors of 65 | the random forest predictions} 66 | \item{bias}{The estimated conditional biases of the random forest 67 | predictions} 68 | \item{lower_alpha}{The estimated lower bounds of the conditional alpha-level 69 | prediction intervals for the test observations} 70 | \item{upper_alpha}{The estimated upper bounds of the conditional alpha-level 71 | prediction intervals for the test observations} 72 | \item{mcr}{The estimated conditional misclassification rate of the random 73 | forest predictions} 74 | 75 | In addition, one or both of the following functions, as described in the 76 | details section: 77 | 78 | \item{perror}{The estimated cumulative distribution functions of the 79 | conditional error distributions associated with the test predictions} 80 | \item{qerror}{The estimated quantile functions of the conditional error 81 | distributions associated with the test predictions} 82 | 83 | In addition, if \code{return_train_nodes} is \code{TRUE}, then a \code{data.table} 84 | called \code{train_nodes} indicating what out-of-bag prediction errors each 85 | terminal node of each tree in \code{forest} contains. 86 | } 87 | \description{ 88 | Estimates the conditional misclassification rates, conditional mean squared 89 | prediction errors, conditional biases, conditional prediction intervals, and 90 | conditional error distributions of random forest predictions. 91 | } 92 | \details{ 93 | This function accepts classification or regression random forests built using 94 | the \code{randomForest}, \code{ranger}, \code{randomForestSRC}, and 95 | \code{quantregForest} packages. When training the random forest using 96 | \code{randomForest}, \code{ranger}, or \code{quantregForest}, \code{keep.inbag} 97 | must be set to \code{TRUE}. When training the random forest using 98 | \code{randomForestSRC}, \code{membership} must be set to \code{TRUE}. 99 | 100 | The predictions computed by \code{ranger} can be parallelized by setting the 101 | value of \code{n.cores} to be greater than 1. 102 | 103 | The random forest predictions are always returned as a \code{data.frame}. Additional 104 | columns are included in the \code{data.frame} depending on the user's selections in 105 | the argument \code{what}. In particular, including \code{"mspe"} in \code{what} 106 | will add an additional column with the conditional mean squared prediction 107 | error of each test prediction to the \code{data.frame}; including \code{"bias"} in 108 | \code{what} will add an additional column with the conditional bias of each test 109 | prediction to the \code{data.frame}; including \code{"interval"} in \code{what} 110 | will add to the \code{data.frame} additional columns with the lower and 111 | upper bounds of conditional prediction intervals for each test prediction; 112 | and including \code{"mcr"} in \code{what} will add an additional column with 113 | the conditional misclassification rate of each test prediction to the 114 | \code{data.frame}. The conditional misclassification rate can be estimated 115 | only for classification random forests, while the other parameters can be 116 | estimated only for regression random forests. 117 | 118 | If \code{"p.error"} or \code{"q.error"} is included in \code{what}, or if 119 | \code{return_train_nodes} is set to \code{TRUE}, then a list will be returned 120 | as output. The first element of the list, named \code{"estimates"}, is the 121 | \code{data.frame} described in the above paragraph. The other elements of the 122 | list are the estimated cumulative distribution functions (\code{perror}) of 123 | the conditional error distributions, the estimated quantile functions 124 | (\code{qerror}) of the conditional error distributions, and/or a \code{data.table} 125 | indicating what out-of-bag prediction errors each terminal node of each tree 126 | in the random forest contains. 127 | } 128 | \examples{ 129 | # load data 130 | data(airquality) 131 | 132 | # remove observations with missing predictor variable values 133 | airquality <- airquality[complete.cases(airquality), ] 134 | 135 | # get number of observations and the response column index 136 | n <- nrow(airquality) 137 | response.col <- 1 138 | 139 | # split data into training and test sets 140 | train.ind <- sample(c("A", "B", "C"), n, 141 | replace = TRUE, prob = c(0.8, 0.1, 0.1)) 142 | Xtrain <- airquality[train.ind == "A", -response.col] 143 | Ytrain <- airquality[train.ind == "A", response.col] 144 | Xtest1 <- airquality[train.ind == "B", -response.col] 145 | Xtest2 <- airquality[train.ind == "C", -response.col] 146 | 147 | # fit regression random forest to the training data 148 | rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5, 149 | ntree = 500, 150 | keep.inbag = TRUE) 151 | 152 | # estimate conditional mean squared prediction errors, 153 | # biases, prediction intervals, and error distribution 154 | # functions for the observations in Xtest1. return 155 | # train_nodes to avoid recomputation in the next 156 | # line of code. 157 | output1 <- quantForestError(rf, Xtrain, Xtest1, 158 | return_train_nodes = TRUE) 159 | 160 | # estimate just the conditional mean squared prediction errors 161 | # and prediction intervals for the observations in Xtest2. 162 | # avoid recomputation by providing train_nodes from the 163 | # previous line of code. 164 | output2 <- quantForestError(rf, Xtrain, Xtest2, 165 | what = c("mspe", "interval"), 166 | train_nodes = output1$train_nodes) 167 | 168 | # for illustrative purposes, convert response to categorical 169 | Ytrain <- as.factor(Ytrain > 31.5) 170 | 171 | # fit classification random forest to the training data 172 | rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 3, 173 | ntree = 500, 174 | keep.inbag = TRUE) 175 | 176 | # estimate conditional misclassification rate of the 177 | # predictions of Xtest1 178 | output <- quantForestError(rf, Xtrain, Xtest1) 179 | 180 | } 181 | \seealso{ 182 | \code{\link{perror}}, \code{\link{qerror}}, \code{\link{findOOBErrors}} 183 | } 184 | \author{ 185 | Benjamin Lu \code{}; Johanna Hardin \code{} 186 | } 187 | --------------------------------------------------------------------------------