├── .Rbuildignore
├── .gitignore
├── DESCRIPTION
├── NAMESPACE
├── NEWS.md
├── R
    ├── checkargs.R
    ├── estimateerrorparams.R
    ├── findooberrors.R
    ├── findtestpreds.R
    ├── perror.R
    ├── qerror.R
    └── quantforesterror.R
├── README.md
├── cran-comments.md
├── forestError.Rproj
├── forestError_1.1.0.pdf
├── inst
    └── CITATION
└── man
    ├── estimateErrorParams.Rd
    ├── findOOBErrors.Rd
    ├── findTestPreds.Rd
    ├── perror.Rd
    ├── qerror.Rd
    └── quantForestError.Rd


/.Rbuildignore:
--------------------------------------------------------------------------------
1 | ^.*\.Rproj$
2 | ^\.Rproj\.user$
3 | ^README\.Rmd$
4 | ^cran-comments\.md$
5 | .git$
6 | .gitignore$
7 | ^CRAN-RELEASE$
8 | ^NEWS\.Rmd$
9 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .Rproj.user
2 | .Rhistory
3 | .RData
4 | .Ruserdata
5 | .DS_Store


--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
 1 | Package: forestError
 2 | Type: Package
 3 | Title: A Unified Framework for Random Forest Prediction Error Estimation
 4 | Version: 1.1.0
 5 | Author: Benjamin Lu and Johanna Hardin
 6 | Maintainer: Benjamin Lu <b.lu@berkeley.edu>
 7 | Description: Estimates the conditional error distributions of random forest
 8 |     predictions and common parameters of those distributions, including
 9 |     conditional misclassification rates, conditional mean squared prediction
10 |     errors, conditional biases, and conditional quantiles, by out-of-bag
11 |     weighting of out-of-bag prediction errors as proposed by Lu and Hardin
12 |     (2021). This package is compatible with several existing packages that
13 |     implement random forests in R.
14 | Imports: 
15 |     data.table,
16 |     purrr
17 | Suggests:
18 |     randomForest
19 | License: GPL-3
20 | Encoding: UTF-8
21 | RoxygenNote: 7.1.1
22 | 


--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
1 | # Generated by roxygen2: do not edit by hand
2 | 
3 | export(findOOBErrors)
4 | export(quantForestError)
5 | import(data.table)
6 | importFrom(purrr,map_dbl)
7 | importFrom(stats,predict)
8 | 


--------------------------------------------------------------------------------
/NEWS.md:
--------------------------------------------------------------------------------
 1 | ## Version 1.1.0 Update
 2 | 
 3 | Version 1.1.0 makes two changes. First, it enables estimation of the conditional misclassification rate of predictions by classification random forests as proposed by Lu and Hardin (2021). Second, it compartmentalizes a costly step in the `quantForestError` algorithm: The identification of each training observation's out-of-bag terminal nodes.
 4 | 
 5 | ### Conditional Misclassification Rate Estimation
 6 | 
 7 | The conditional misclassification rate of predictions by classification random forests can now be estimated. To estimate it, simply set the `what` argument in the `quantForestError` function to `"mcr"`. `what` will default to this if the provided `forest` is a classification random forest. See the example code in the README for a toy demonstration of the performance of this estimator.
 8 | 
 9 | ### Compartmentalization
10 | 
11 | The identification of each training observation's out-of-bag terminal nodes is now compartmentalized from the main `quantForestError` function. By isolating this step from the main `quantForestError` function, Version 1.1.0 allows users to more efficiently iterate the algorithm. Users may wish to feed `quantForestError` batches of test observations iteratively if they have streaming data or a large test set that cannot be processed in one go due to memory constraints. In previous versions of this package, doing so would require the algorithm to recompute each training observation's out-of-bag terminal nodes in each iteration. This was redundant and costly. By separating this computation from the rest of the `quantForestError` algorithm, Version 1.1.0 allows the user to perform this computation only once.
12 | 
13 | As part of this modularization, the `quantForestError` function now has two additional arguments. If set to `TRUE`, `return_train_nodes` will return a `data.table` identifying each training observation's out-of-bag terminal nodes. This `data.table` can then be fed back into `quantForestError` via the argument `train_nodes` to avoid the redundant recomputation.
14 | 
15 | Version 1.1.0 also exports the function that produces the `data.table` identifying each training observation's out-of-bag terminal nodes. It is called `findOOBErrors`. Assuming the same inputs, `findOOBErrors` will produce the same output that is returned by setting `return_train_nodes` to `TRUE` in the `quantForestError` function.
16 | 
17 | See the documentation on `quantForestError` and `findOOBErrors` for examples.
18 | 
19 | Neither of these changes affects code that relied on Version 1.0.0 of this package, as the changes consist solely of a newly exported function, two optional arguments to `quantForestError` that by default do nothing new, and a new possible input for the `what` argument.
20 | 
21 | ## Version 1.0.0 Update
22 | 
23 | This package has been updated to reflect the conventional sign of bias (mean prediction minus mean response). Previous versions of the package returned negative bias (mean response minus mean prediction). The sign of any algebraic operations involving the bias outputted by this package must therefore be reversed to preserve their intended effect.
24 | 
25 | In the future, we hope to implement a stochastic version of the `quantForestError` function, in which the parameters are estimated by random subsets of the training sample and/or the trees of the random forest.
26 | 
27 | ## Version 0.2.0 Update
28 | 
29 | Thanks to John Sheffield ([Github Profile](https://github.com/sheffe)) for his helpful improvements to the computational performance of this package. (See the [Issue Tracker](https://github.com/benjilu/forestError/issues/2) for details.) These changes, which substantially reduce the runtime and memory load of this package's `quantForestError`, `perror`, and `qerror` functions, have been implemented in Version 0.2.0.
30 | 
31 | Version 0.2.0 also now allows the user to generate conditional prediction intervals with different type-I error rates in a single call of the `quantForestError` function.
32 | 


--------------------------------------------------------------------------------
/R/checkargs.R:
--------------------------------------------------------------------------------
  1 | # check forest argument for problems
  2 | checkForest <- function(forest) {
  3 |   if (typeof(forest) != "list") {
  4 |     stop("'forest' is not of the correct type")
  5 |   } else if (!any(c("randomForest", "ranger", "rfsrc", "quantregForest") %in% class(forest))) {
  6 |     stop("'forest' is not of the correct class")
  7 |   } else if (is.null(forest$inbag)) {
  8 |     stop("'forest' does not have record of which training observations are in bag for each tree. Re-fit the random forest with argument keep.inbag = TRUE")
  9 |   }
 10 | }
 11 | 
 12 | # check training and test covariate arguments for problems
 13 | checkXtrainXtest <- function(X.train, X.test) {
 14 |   if (length(dim(X.train)) != 2) {
 15 |     stop("'X.train' must be a matrix or data.frame of dimension 2")
 16 |   } else if (length(dim(X.test)) != 2) {
 17 |     stop("'X.test' must be a matrix or data.frame of dimension 2")
 18 |   } else if (ncol(X.train) != ncol(X.test)) {
 19 |     stop("'X.train' and 'X.test' must have the same predictor variables")
 20 |   }
 21 | }
 22 | 
 23 | # check training response argument for problems
 24 | checkYtrain <- function(forest, Y.train, n.train) {
 25 |   if ("ranger" %in% class(forest)) {
 26 |     if (is.null(Y.train)) {
 27 |       stop("You must supply the training responses (Y.train)")
 28 |     } else if (length(Y.train) != n.train) {
 29 |       stop("Number of training responses does not match number of training observations")
 30 |     }
 31 |   }
 32 | }
 33 | 
 34 | # check type-I error rate argument for problems
 35 | checkAlpha <- function(alpha) {
 36 |   if (typeof(alpha) != "double") {
 37 |     stop("'alpha' must be of type double")
 38 |   } else if (any(alpha <= 0 | alpha >= 1)) {
 39 |     stop("'alpha' must be in (0, 1)")
 40 |   }
 41 | }
 42 | 
 43 | # check index argument in perror and qerror functions for problems
 44 | checkxs <- function(xs, n.test) {
 45 |   if (max(xs) > n.test | min(xs) < 1) {
 46 |     stop("Test indices are out of bounds")
 47 |   } else if (any(xs %% 1 != 0)) {
 48 |     stop("Test indices must be whole numbers")
 49 |   }
 50 | }
 51 | 
 52 | # check probability argument in qerror for problems
 53 | checkps <- function(p) {
 54 |   if (max(p) > 1 | min(p) < 0) {
 55 |     stop("Probabilities must be between 0 and 1")
 56 |   }
 57 | }
 58 | 
 59 | # check core argument for problems
 60 | checkcores <- function(n.cores) {
 61 |   if (is.null(n.cores)) {
 62 |     stop("Number of cores must be specified")
 63 |   } else if (n.cores < 1) {
 64 |     stop("Number of cores must be at least 1")
 65 |   } else if (n.cores %% 1 != 0) {
 66 |     stop("Number of cores must be integer")
 67 |   }
 68 | }
 69 | 
 70 | # check requested parameters
 71 | checkwhat <- function(what, forest) {
 72 |   if (is.null(what)) {
 73 |     stop("Please specify the parameters to be estimated")
 74 |   } else if ("mcr" %in% what & length(what) > 2) {
 75 |     stop("Misclassification rate cannot be estimated for real-valued responses")
 76 |   } else if ("mcr" %in% what) {
 77 |     if ("quantregForest" %in% class(forest)) {
 78 |       if (forest$type != "classification") {
 79 |         stop("Misclassification rate can be estimated only for classification random forests")
 80 |       }
 81 |     } else if ("randomForest" %in% class(forest)) {
 82 |       if (forest$type != "classification") {
 83 |         stop("Misclassification rate can be estimated only for classification random forests")
 84 |       }
 85 |     } else if ("rfsrc" %in% class(forest)) {
 86 |       if (forest$family != "class") {
 87 |         stop("Misclassification rate can be estimated only for classification random forests")
 88 |       }
 89 |     } else if ("ranger" %in% class(forest)) {
 90 |       if (forest$treetype != "Classification") {
 91 |         stop("Misclassification rate can be estimated only for classification random forests")
 92 |       }
 93 |     }
 94 |   } else if (any(c("mspe", "bias", "interval", "p.error", "q.error") %in% what)) {
 95 |     if ("quantregForest" %in% class(forest)) {
 96 |       if (forest$type == "classification") {
 97 |         stop("Requested parameters cannot be estimated for classification random forests")
 98 |       }
 99 |     } else if ("randomForest" %in% class(forest)) {
100 |       if (forest$type == "classification") {
101 |         stop("Requested parameters cannot be estimated for classification random forests")
102 |       }
103 |     } else if ("rfsrc" %in% class(forest)) {
104 |       if (forest$family == "class") {
105 |         stop("Requested parameters cannot be estimated for classification random forests")
106 |       }
107 |     } else if ("ranger" %in% class(forest)) {
108 |       if (forest$treetype == "Classification") {
109 |         stop("Requested parameters cannot be estimated for classification random forests")
110 |       }
111 |     }
112 |   }
113 | }
114 | 


--------------------------------------------------------------------------------
/R/estimateerrorparams.R:
--------------------------------------------------------------------------------
  1 | #' Estimate prediction error distribution parameters
  2 | #'
  3 | #' Estimates the prediction error distribution parameters requested in the input
  4 | #' to \code{quantForestError}.
  5 | #'
  6 | #' This function is for internal use.
  7 | #'
  8 | #' @param train_nodes A \code{data.table} indicating which out-of-bag prediction
  9 | #'   errors are in each terminal node of each tree in the random forest. It
 10 | #'   must be formatted like the output of the \code{findOOBErrors} function.
 11 | #' @param test_nodes A \code{data.table} indicating which test observations are
 12 | #'   in each terminal node of each tree in the random forest. It must be
 13 | #'   formatted like the output of the \code{findTestPreds} function.
 14 | #' @param mspewhat A boolean indicating whether to estimate conditional MSPE.
 15 | #' @param biaswhat A boolean indicating whether to estimate conditional bias.
 16 | #' @param intervalwhat A boolean indicating whether to estimate conditional
 17 | #'   prediction intervals.
 18 | #' @param pwhat A boolean indicating whether to estimate the conditional
 19 | #'   prediction error CDFs.
 20 | #' @param qwhat A boolean indicating whether to estimate the conditional
 21 | #'   prediction error quantile functions.
 22 | #' @param mcrwhat A boolean indicating whether to estimate the conditional
 23 | #'   misclassification rate.
 24 | #' @param alpha A vector of type-I error rates desired for the conditional prediction
 25 | #'   intervals; required if \code{intervalwhat} is \code{TRUE}.
 26 | #' @param n.test The number of test observations.
 27 | #'
 28 | #' @return A \code{data.frame} with one or more of the following columns:
 29 | #'
 30 | #'   \item{pred}{The random forest predictions of the test observations}
 31 | #'   \item{mspe}{The estimated conditional mean squared prediction errors of
 32 | #'   the random forest predictions}
 33 | #'   \item{bias}{The estimated conditional biases of the random forest
 34 | #'   predictions}
 35 | #'   \item{lower_alpha}{The estimated lower bounds of the conditional alpha-level
 36 | #'   prediction intervals for the test observations}
 37 | #'   \item{upper_alpha}{The estimated upper bounds of the conditional alpha-level
 38 | #'   prediction intervals for the test observations}
 39 | #'   \item{mcr}{The estimated conditional misclassification rate of the random
 40 | #'   forest predictions}
 41 | #'
 42 | #'   In addition, one or both of the following functions:
 43 | #'
 44 | #'   \item{perror}{The estimated cumulative distribution functions of the
 45 | #'   conditional error distributions associated with the test predictions}
 46 | #'   \item{qerror}{The estimated quantile functions of the conditional error
 47 | #'   distributions associated with the test predictions}
 48 | #'
 49 | #' @seealso \code{\link{quantForestError}}, \code{\link{findOOBErrors}}, \code{\link{findTestPreds}}
 50 | #'
 51 | #' @author Benjamin Lu \code{<b.lu@berkeley.edu>}; Johanna Hardin \code{<jo.hardin@pomona.edu>}
 52 | #'
 53 | #' @import data.table
 54 | #' @importFrom purrr map_dbl
 55 | #' @keywords internal
 56 | estimateErrorParams <- function(train_nodes, test_nodes, mspewhat, biaswhat, intervalwhat, pwhat, qwhat, mcrwhat, alpha, n.test) {
 57 | 
 58 |   # set columns to return
 59 |   col_names <- c("pred", "mspe", "bias", "mcr")[c(TRUE, mspewhat, biaswhat, mcrwhat)]
 60 | 
 61 |   # initialize overall output
 62 |   output <- NULL
 63 | 
 64 |   # if the user wants misclassification rate
 65 |   if (mcrwhat) {
 66 |     # join the train and test edgelists by tree/node and compute relevant summary
 67 |     # statistics via chaining
 68 |     oob_error_stats <-
 69 |       train_nodes[test_nodes,
 70 |                   .(tree, terminal_node, rowid_test, pred, node_errs)][,
 71 |                                                                        .(mcr = mean(unlist(node_errs))),
 72 |                                                                        keyby = c("rowid_test", "pred")]
 73 |   # else if the user does not want intervals, p.error, or q.error
 74 |   } else if (!intervalwhat & !pwhat & !qwhat) {
 75 | 
 76 |     # produce whichever of mspe and bias the user wants
 77 |     if (mspewhat & !biaswhat) {
 78 |       # join the train and test edgelists by tree/node and compute relevant summary
 79 |       # statistics via chaining
 80 |       oob_error_stats <-
 81 |         train_nodes[test_nodes,
 82 |                     .(tree, terminal_node, rowid_test, pred, node_errs)][,
 83 |                                                                          .(mspe = mean(unlist(node_errs) ^ 2)),
 84 |                                                                          keyby = c("rowid_test", "pred")]
 85 |     } else if (!mspewhat & biaswhat) {
 86 |       # join the train and test edgelists by tree/node and compute relevant summary
 87 |       # statistics via chaining
 88 |       oob_error_stats <-
 89 |         train_nodes[test_nodes,
 90 |                     .(tree, terminal_node, rowid_test, pred, node_errs)][,
 91 |                                                                          .(bias = -mean(unlist(node_errs))),
 92 |                                                                          keyby = c("rowid_test", "pred")]
 93 |     } else {
 94 |       # join the train and test edgelists by tree/node and compute relevant summary
 95 |       # statistics via chaining
 96 |       oob_error_stats <-
 97 |         train_nodes[test_nodes,
 98 |                     .(tree, terminal_node, rowid_test, pred, node_errs)][,
 99 |                                                                          .(mspe = mean(unlist(node_errs) ^ 2),
100 |                                                                            bias = -mean(unlist(node_errs))),
101 |                                                                          keyby = c("rowid_test", "pred")]
102 |     }
103 | 
104 |     # else, do the same as above but keep the full list of OOB cohabitant prediction errors
105 |   } else {
106 | 
107 |     # produce whichever of mspe and bias the user wants
108 |     if (mspewhat & !biaswhat) {
109 |       # join the train and test edgelists by tree/node and compute relevant summary
110 |       # statistics via chaining
111 |       oob_error_stats <-
112 |         train_nodes[test_nodes,
113 |                     .(tree, terminal_node, rowid_test, pred, node_errs)][,
114 |                                                                          .(mspe = mean(unlist(node_errs) ^ 2),
115 |                                                                            all_errs = list(sort(unlist(node_errs)))),
116 |                                                                          keyby = c("rowid_test", "pred")]
117 |     } else if (!mspewhat & biaswhat) {
118 |       # join the train and test edgelists by tree/node and compute relevant summary
119 |       # statistics via chaining
120 |       oob_error_stats <-
121 |         train_nodes[test_nodes,
122 |                     .(tree, terminal_node, rowid_test, pred, node_errs)][,
123 |                                                                          .(bias = -mean(unlist(node_errs)),
124 |                                                                            all_errs = list(sort(unlist(node_errs)))),
125 |                                                                          keyby = c("rowid_test", "pred")]
126 |     } else if (mspewhat & biaswhat) {
127 |       # join the train and test edgelists by tree/node and compute relevant summary
128 |       # statistics via chaining
129 |       oob_error_stats <-
130 |         train_nodes[test_nodes,
131 |                     .(tree, terminal_node, rowid_test, pred, node_errs)][,
132 |                                                                          .(mspe = mean(unlist(node_errs) ^ 2),
133 |                                                                            bias = -mean(unlist(node_errs)),
134 |                                                                            all_errs = list(sort(unlist(node_errs)))),
135 |                                                                          keyby = c("rowid_test", "pred")]
136 |     } else {
137 |       # join the train and test edgelists by tree/node and compute relevant summary
138 |       # statistics via chaining
139 |       oob_error_stats <-
140 |         train_nodes[test_nodes,
141 |                     .(tree, terminal_node, rowid_test, pred, node_errs)][,
142 |                                                                          .(all_errs = list(sort(unlist(node_errs)))),
143 |                                                                          keyby = c("rowid_test", "pred")]
144 |     }
145 |   }
146 | 
147 |   # if the user requests prediction intervals
148 |   if (intervalwhat) {
149 | 
150 |     # check the alpha argument for issues
151 |     checkAlpha(alpha)
152 | 
153 |     # format prediction interval output
154 |     percentiles <- sort(c(alpha / 2, 1 - (alpha / 2)))
155 |     interval_col_names <- paste0(rep(c("lower_", "upper_"), each = length(alpha)),
156 |                                  c(alpha, rev(alpha)))
157 |     col_names <- c(col_names, interval_col_names)
158 | 
159 |     # compute prediction intervals
160 |     oob_error_stats[, (interval_col_names) := lapply(percentiles, FUN = function(p) {
161 |       pred + purrr::map_dbl(all_errs, ~.x[ceiling(length(.x) * p)])})]
162 |   }
163 | 
164 |   # extract summary statistics requested as data.frame
165 |   output_df <- data.table::setDF(oob_error_stats[, ..col_names])
166 | 
167 |   # if the user requests p.error
168 |   if (pwhat) {
169 | 
170 |     # remove summary statistics from oob_error_stats
171 |     oob_error_stats <- oob_error_stats[, .(all_errs)]
172 | 
173 |     # define the estimated CDF function
174 |     perror <- function(q, xs = 1:n.test) {
175 | 
176 |       # check xs argument for issues
177 |       checkxs(xs, n.test)
178 | 
179 |       # format output
180 |       p_col_names <- paste0("p_", q)
181 | 
182 |       # compute CDF
183 |       oob_error_stats[xs, (p_col_names) := lapply(q, FUN = function(que) {
184 |         purrr::map_dbl(all_errs, ~mean(.x <= que))})]
185 | 
186 |       # format and return output
187 |       to.return <- data.table::setDF(oob_error_stats[xs, ..p_col_names])
188 |       row.names(to.return) <- xs
189 |       return(to.return)
190 |     }
191 | 
192 |     # add to output
193 |     output <- list(output_df, perror)
194 |     names(output) <- c("estimates", "perror")
195 |   }
196 | 
197 |   # if the user requests q.error
198 |   if (qwhat) {
199 | 
200 |     # remove summary statistics from oob_error_stats
201 |     oob_error_stats <- oob_error_stats[, .(all_errs)]
202 | 
203 |     # define the estimated quantile function
204 |     qerror <- function(p, xs = 1:n.test) {
205 | 
206 |       # check p and xs arguments for issus
207 |       checkps(p)
208 |       checkxs(xs, n.test)
209 | 
210 |       # format output
211 |       q_col_names <- paste0("q_", p)
212 | 
213 |       # compute quantiles
214 |       oob_error_stats[xs, (q_col_names) := lapply(p, FUN = function(pe) {
215 |         purrr::map_dbl(all_errs, ~.x[ceiling(length(.x) * pe)])})]
216 | 
217 |       # format and return output
218 |       to.return <- data.table::setDF(oob_error_stats[xs, ..q_col_names])
219 |       row.names(to.return) <- xs
220 |       return(to.return)
221 |     }
222 | 
223 |     # add quantile function to output
224 |     if (pwhat) {
225 |       output[["qerror"]] <- qerror
226 |     } else {
227 |       output <- list(output_df, qerror)
228 |       names(output) <- c("estimates", "qerror")
229 |     }
230 |   }
231 | 
232 |   # return output
233 |   if (is.null(output)) {
234 |     return(output_df)
235 |   } else {
236 |     return(output)
237 |   }
238 | }
239 | 


--------------------------------------------------------------------------------
/R/findooberrors.R:
--------------------------------------------------------------------------------
  1 | #' Compute and locate out-of-bag prediction errors
  2 | #'
  3 | #' Computes each training observation's out-of-bag prediction error using the
  4 | #' random forest and, for each tree for which the training observation is
  5 | #' out of bag, finds the terminal node of the tree in which the training
  6 | #' observation falls.
  7 | #'
  8 | #' This function accepts classification or regression random forests built using
  9 | #' the \code{randomForest}, \code{ranger}, \code{randomForestSRC}, and
 10 | #' \code{quantregForest} packages. When training the random forest using
 11 | #' \code{randomForest}, \code{ranger}, or \code{quantregForest}, \code{keep.inbag}
 12 | #' must be set to \code{TRUE}. When training the random forest using
 13 | #' \code{randomForestSRC}, \code{membership} must be set to \code{TRUE}.
 14 | #'
 15 | #' @param forest The random forest object being used for prediction.
 16 | #' @param X.train A \code{matrix} or \code{data.frame} with the observations
 17 | #'   that were used to train \code{forest}. Each row should be an observation,
 18 | #'   and each column should be a predictor variable.
 19 | #' @param Y.train A vector of the responses of the observations that were used
 20 | #'   to train \code{forest}. Required if \code{forest} was created using
 21 | #'   \code{ranger}, but not if \code{forest} was created using \code{randomForest},
 22 | #'   \code{randomForestSRC}, or \code{quantregForest}.
 23 | #' @param n.cores Number of cores to use (for parallel computation in \code{ranger}).
 24 | #'
 25 | #' @return A \code{data.table} with the following three columns:
 26 | #'
 27 | #'   \item{tree}{The ID of the tree of the random forest}
 28 | #'   \item{terminal_node}{The ID of the terminal node of the tree}
 29 | #'   \item{node_errs}{A vector of the out-of-bag prediction errors that fall
 30 | #'   within the terminal node of the tree}
 31 | #'
 32 | #' @seealso \code{\link{quantForestError}}
 33 | #'
 34 | #' @author Benjamin Lu \code{<b.lu@berkeley.edu>}; Johanna Hardin \code{<jo.hardin@pomona.edu>}
 35 | #'
 36 | #' @examples
 37 | #' # load data
 38 | #' data(airquality)
 39 | #'
 40 | #' # remove observations with missing predictor variable values
 41 | #' airquality <- airquality[complete.cases(airquality), ]
 42 | #'
 43 | #' # get number of observations and the response column index
 44 | #' n <- nrow(airquality)
 45 | #' response.col <- 1
 46 | #'
 47 | #' # split data into training and test sets
 48 | #' train.ind <- sample(1:n, n * 0.9, replace = FALSE)
 49 | #' Xtrain <- airquality[train.ind, -response.col]
 50 | #' Ytrain <- airquality[train.ind, response.col]
 51 | #' Xtest <- airquality[-train.ind, -response.col]
 52 | #'
 53 | #' # fit random forest to the training data
 54 | #' rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5,
 55 | #'                                  ntree = 500, keep.inbag = TRUE)
 56 | #'
 57 | #' # compute out-of-bag prediction errors and locate each
 58 | #' # training observation in the trees for which it is out
 59 | #' # of bag
 60 | #' train_nodes <- findOOBErrors(rf, Xtrain)
 61 | #'
 62 | #' # estimate conditional mean squared prediction errors,
 63 | #' # biases, prediction intervals, and error distribution
 64 | #' # functions for the test observations. provide
 65 | #' # train_nodes to avoid recomputing that step.
 66 | #' output <- quantForestError(rf, Xtrain, Xtest,
 67 | #'                            train_nodes = train_nodes)
 68 | #'
 69 | #' @importFrom stats predict
 70 | #' @import data.table
 71 | #' @export
 72 | findOOBErrors <- function(forest, X.train, Y.train = NULL, n.cores = 1) {
 73 | 
 74 |   # determine whether this is a regression or classification random forest
 75 |   categorical <- grepl("class", c(forest$type, forest$family, forest$treetype), TRUE)
 76 | 
 77 |   # if the forest is from the quantregForest package
 78 |   if ("quantregForest" %in% class(forest)) {
 79 | 
 80 |     # convert to random forest class
 81 |     class(forest) <- "randomForest"
 82 | 
 83 |     # get terminal nodes of training observations
 84 |     train.terminal.nodes <- attr(predict(forest, X.train, nodes = TRUE), "nodes")
 85 | 
 86 |     # get number of times each training observation appears in each tree
 87 |     bag.count <- forest$inbag
 88 | 
 89 |     # get the OOB prediction error of each training observation
 90 |     if (categorical) {
 91 |       oob.errors <- forest$y != forest$predicted
 92 |     } else {
 93 |       oob.errors <- forest$y - forest$predicted
 94 |     }
 95 | 
 96 |     # else if the forest is from the randomForest package
 97 |   } else if ("randomForest" %in% class(forest)) {
 98 | 
 99 |     # get terminal nodes of training observations
100 |     train.terminal.nodes <- attr(predict(forest, X.train, nodes = TRUE), "nodes")
101 | 
102 |     # get number of times each training observation appears in each tree
103 |     bag.count <- forest$inbag
104 | 
105 |     # get the OOB prediction error of each training observation
106 |     if (categorical) {
107 |       oob.errors <- forest$y != forest$predicted
108 |     } else {
109 |       oob.errors <- forest$y - forest$predicted
110 |     }
111 | 
112 |     # else, if the forest is from the ranger package
113 |   } else if ("ranger" %in% class(forest)) {
114 | 
115 |     # get terminal nodes of training observations
116 |     train.terminal.nodes <- predict(forest, X.train, num.threads = n.cores, type = "terminalNodes")$predictions
117 | 
118 |     # get number of times each training observation appears in each tree
119 |     bag.count <- matrix(unlist(forest$inbag.counts, use.names = FALSE), ncol = forest$num.trees, byrow = FALSE)
120 | 
121 |     # get the OOB prediction error of each training observation
122 |     if (categorical) {
123 |       oob.errors <- Y.train != forest$predictions
124 |     } else {
125 |       oob.errors <- Y.train - forest$predictions
126 |     }
127 | 
128 |     # else, if the forest is from the randomForestSRC package
129 |   } else if ("rfsrc" %in% class(forest)) {
130 | 
131 |     # get terminal nodes of all observations
132 |     train.terminal.nodes <- forest$membership
133 | 
134 |     # get number of times each training observation appears in each tree
135 |     bag.count <- forest$inbag
136 | 
137 |     # get the OOB prediction error of each training observation
138 |     if (categorical) {
139 |       oob.errors <- forest$yvar != forest$class.oob
140 |     } else {
141 |       oob.errors <- forest$yvar - forest$predicted.oob
142 |     }
143 |   }
144 | 
145 |   # get the terminal nodes of the training observations in the trees in which
146 |   # they are OOB (for all other trees, set the terminal node to be NA)
147 |   train.terminal.nodes[bag.count != 0] <- NA
148 | 
149 |   # reshape train.terminal.nodes to be a long data.table and include OOB
150 |   # prediction errors as a column
151 |   train_nodes <- data.table::as.data.table(train.terminal.nodes)
152 |   train_nodes[, `:=`(oob_error = oob.errors)]
153 |   train_nodes <- data.table::melt(
154 |     train_nodes,
155 |     id.vars = c("oob_error"),
156 |     measure.vars = 1:ncol(train.terminal.nodes),
157 |     variable.name = "tree",
158 |     value.name = "terminal_node",
159 |     variable.factor = FALSE,
160 |     na.rm = TRUE)
161 | 
162 |   # collapse the long data.table by unique tree/node
163 |   train_nodes <- train_nodes[,
164 |                              .(node_errs = list(oob_error)),
165 |                              keyby = c("tree", "terminal_node")]
166 | 
167 |   # return long data.table of OOB error values and locations
168 |   return(train_nodes)
169 | }
170 | 


--------------------------------------------------------------------------------
/R/findtestpreds.R:
--------------------------------------------------------------------------------
  1 | #' Compute and locate test predictions
  2 | #'
  3 | #' Predicts each test observation's response using the random forest and, for
  4 | #' each test observation and tree, finds the terminal node of the tree in which
  5 | #' the test observation falls.
  6 | #'
  7 | #' This function accepts regression random forests built using the \code{randomForest},
  8 | #' \code{ranger}, \code{randomForestSRC}, and \code{quantregForest} packages.
  9 | #'
 10 | #' @param forest The random forest object being used for prediction.
 11 | #' @param X.test A \code{matrix} or \code{data.frame} with the observations to
 12 | #'   be predicted. Each row should be an observation, and each column should be
 13 | #'   a predictor variable.
 14 | #' @param n.cores Number of cores to use (for parallel computation in \code{ranger}).
 15 | #'
 16 | #' @return A \code{data.table} with the following four columns:
 17 | #'
 18 | #'   \item{rowid_test}{The row ID of the test observation as provided by \code{X.test}}
 19 | #'   \item{pred}{The random forest prediction of the test observation}
 20 | #'   \item{tree}{The ID of the tree of the random forest}
 21 | #'   \item{terminal_node}{The ID of the terminal node of the tree in which the
 22 | #'   test observation falls}
 23 | #'
 24 | #' @seealso \code{\link{findOOBErrors}}, \code{\link{quantForestError}}
 25 | #'
 26 | #' @author Benjamin Lu \code{<b.lu@berkeley.edu>}; Johanna Hardin \code{<jo.hardin@pomona.edu>}
 27 | #'
 28 | #' @importFrom stats predict
 29 | #' @import data.table
 30 | #' @keywords internal
 31 | findTestPreds <- function(forest, X.test, n.cores = 1) {
 32 | 
 33 |   # determine whether this is a regression or classification random forest
 34 |   categorical <- grepl("class", c(forest$type, forest$family, forest$treetype), TRUE)
 35 | 
 36 |   # if the forest is from the quantregForest package
 37 |   if ("quantregForest" %in% class(forest)) {
 38 | 
 39 |     # convert to random forest class
 40 |     class(forest) <- "randomForest"
 41 | 
 42 |     # get test predictions
 43 |     test.preds <- predict(forest, X.test, nodes = TRUE)
 44 | 
 45 |     # get terminal nodes of test observations
 46 |     test.terminal.nodes <- attr(test.preds, "nodes")
 47 | 
 48 |     # format test observation predictions
 49 |     attr(test.preds, "nodes") <- NULL
 50 | 
 51 |     # else if the forest is from the randomForest package
 52 |   } else if ("randomForest" %in% class(forest)) {
 53 | 
 54 |     # get test predictions
 55 |     test.preds <- predict(forest, X.test, nodes = TRUE)
 56 | 
 57 |     # get terminal nodes of test observations
 58 |     test.terminal.nodes <- attr(test.preds, "nodes")
 59 | 
 60 |     # format test observation predictions
 61 |     attr(test.preds, "nodes") <- NULL
 62 | 
 63 |     # else, if the forest is from the ranger package
 64 |   } else if ("ranger" %in% class(forest)) {
 65 | 
 66 |     # get terminal nodes of test observations
 67 |     test.terminal.nodes <- predict(forest, X.test, num.threads = n.cores, type = "terminalNodes")$predictions
 68 | 
 69 |     # get test observation predictions
 70 |     test.preds <- predict(forest, X.test)$predictions
 71 | 
 72 |     # else, if the forest is from the randomForestSRC package
 73 |   } else if ("rfsrc" %in% class(forest)) {
 74 | 
 75 |     # get test predictions
 76 |     test.pred.list <- predict(forest, X.test, membership = TRUE)
 77 | 
 78 |     # get terminal nodes of test observations
 79 |     test.terminal.nodes <- test.pred.list$membership
 80 | 
 81 |     # format test observation predictions
 82 |     if (categorical) {
 83 |       test.preds <- test.pred.list$class
 84 |     } else {
 85 |       test.preds <- test.pred.list$predicted
 86 |     }
 87 |   }
 88 | 
 89 |   # reshape test.terminal.nodes to be a long data.table and
 90 |   # add unique IDs and predicted values
 91 |   test_nodes <- data.table::melt(
 92 |     data.table::as.data.table(test.terminal.nodes)[, `:=`(rowid_test = .I, pred = test.preds)],
 93 |     id.vars = c("rowid_test", "pred"),
 94 |     measure.vars = 1:ncol(test.terminal.nodes),
 95 |     variable.name = "tree",
 96 |     variable.factor = FALSE,
 97 |     value.name = "terminal_node")
 98 | 
 99 |   # set key columns for faster indexing
100 |   data.table::setkey(test_nodes, tree, terminal_node)
101 | 
102 |   # return data.table
103 |   return(test_nodes)
104 | }
105 | 


--------------------------------------------------------------------------------
/R/perror.R:
--------------------------------------------------------------------------------
 1 | #' Estimated conditional prediction error CDFs
 2 | #'
 3 | #' Returns probabilities from the estimated conditional cumulative distribution
 4 | #' function of the prediction error associated with each test observation.
 5 | #'
 6 | #' This function is only defined as output of the \code{quantForestError} function.
 7 | #' It is not exported as a standalone function. See the example.
 8 | #'
 9 | #' @usage perror(q, xs)
10 | #'
11 | #' @param q A vector of quantiles.
12 | #' @param xs A vector of the indices of the test observations for which the
13 | #'   conditional error CDFs are desired. Defaults to all test observations
14 | #'   given in the call of \code{quantForestError}.
15 | #'
16 | #' @return If either \code{q} or \code{xs} has length one, then a vector is
17 | #'   returned with the desired probabilities. If both have length greater than
18 | #'   one, then a \code{data.frame} of probabilities is returned, with rows
19 | #'   corresponding to the inputted \code{xs} and columns corresponding to the
20 | #'   inputted \code{q}.
21 | #'
22 | #' @seealso \code{\link{quantForestError}}
23 | #'
24 | #' @author Benjamin Lu \code{<b.lu@berkeley.edu>}; Johanna Hardin \code{<jo.hardin@pomona.edu>}
25 | #'
26 | #' @examples
27 | #' # load data
28 | #' data(airquality)
29 | #'
30 | #' # remove observations with missing predictor variable values
31 | #' airquality <- airquality[complete.cases(airquality), ]
32 | #'
33 | #' # get number of observations and the response column index
34 | #' n <- nrow(airquality)
35 | #' response.col <- 1
36 | #'
37 | #' # split data into training and test sets
38 | #' train.ind <- sample(1:n, n * 0.9, replace = FALSE)
39 | #' Xtrain <- airquality[train.ind, -response.col]
40 | #' Ytrain <- airquality[train.ind, response.col]
41 | #' Xtest <- airquality[-train.ind, -response.col]
42 | #' Ytest <- airquality[-train.ind, response.col]
43 | #'
44 | #' # fit random forest to the training data
45 | #' rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5,
46 | #'                                  ntree = 500,
47 | #'                                  keep.inbag = TRUE)
48 | #'
49 | #' # estimate conditional error distribution functions
50 | #' output <- quantForestError(rf, Xtrain, Xtest,
51 | #'                            what = c("p.error", "q.error"))
52 | #'
53 | #' # get the probability that the error associated with each test
54 | #' # prediction is less than -4 and the probability that the error
55 | #' # associated with each test prediction is less than 7
56 | #' output$perror(c(-4, 7))
57 | #'
58 | #' # same as above but only for the first three test observations
59 | #' output$perror(c(-4, 7), 1:3)
60 | perror <- function(q, xs) {stop("perror is not exported as a standalone function. You must first run the quantForestError function to define perror. See documentation.")}
61 | 


--------------------------------------------------------------------------------
/R/qerror.R:
--------------------------------------------------------------------------------
 1 | #' Estimated conditional prediction error quantile functions
 2 | #'
 3 | #' Returns quantiles of the estimated conditional error distribution associated
 4 | #' with each test prediction.
 5 | #'
 6 | #' This function is only defined as output of the \code{quantForestError} function.
 7 | #' It is not exported as a standalone function. See the example.
 8 | #'
 9 | #' @usage qerror(p, xs)
10 | #'
11 | #' @param p A vector of probabilities.
12 | #' @param xs A vector of the indices of the test observations for which the
13 | #'   conditional error quantiles are desired. Defaults to all test observations
14 | #'   given in the call of \code{quantForestError}.
15 | #'
16 | #' @return If either \code{p} or \code{xs} has length one, then a vector is
17 | #'   returned with the desired quantiles. If both have length greater than
18 | #'   one, then a \code{data.frame} of quantiles is returned, with rows
19 | #'   corresponding to the inputted \code{xs} and columns corresponding to the
20 | #'   inputted \code{p}.
21 | #'
22 | #' @seealso \code{\link{quantForestError}}
23 | #'
24 | #' @author Benjamin Lu \code{<b.lu@berkeley.edu>}; Johanna Hardin \code{<jo.hardin@pomona.edu>}
25 | #' @examples
26 | #' # load data
27 | #' data(airquality)
28 | #'
29 | #' # remove observations with missing predictor variable values
30 | #' airquality <- airquality[complete.cases(airquality), ]
31 | #'
32 | #' # get number of observations and the response column index
33 | #' n <- nrow(airquality)
34 | #' response.col <- 1
35 | #'
36 | #' # split data into training and test sets
37 | #' train.ind <- sample(1:n, n * 0.9, replace = FALSE)
38 | #' Xtrain <- airquality[train.ind, -response.col]
39 | #' Ytrain <- airquality[train.ind, response.col]
40 | #' Xtest <- airquality[-train.ind, -response.col]
41 | #' Ytest <- airquality[-train.ind, response.col]
42 | #'
43 | #' # fit random forest to the training data
44 | #' rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5,
45 | #'                                  ntree = 500,
46 | #'                                  keep.inbag = TRUE)
47 | #'
48 | #' # estimate conditional error distribution functions
49 | #' output <- quantForestError(rf, Xtrain, Xtest,
50 | #'                            what = c("p.error", "q.error"))
51 | #'
52 | #' # get the 0.25 and 0.8 quantiles of the error distribution
53 | #' # associated with each test prediction
54 | #' output$qerror(c(0.25, 0.8))
55 | #'
56 | #' # same as above but only for the first three test observations
57 | #' output$qerror(c(0.25, 0.8), 1:3)
58 | qerror <- function(p, xs) {stop("qerror is not exported as a standalone function. You must first run the quantForestError function to define qerror. See documentation.")}
59 | 


--------------------------------------------------------------------------------
/R/quantforesterror.R:
--------------------------------------------------------------------------------
  1 | # avoid check note
  2 | if(getRversion() >= "2.15.1"){utils::globalVariables(c("n.test", "tree", "terminal_node",
  3 |                                                      ".", "oob_error", "rowid_test", "pred",
  4 |                                                      "node_errs", "pred", "all_errs", "..col_names",
  5 |                                                      "..p_col_names", "..q_col_names"))}
  6 | 
  7 | #' Quantify random forest prediction error
  8 | #'
  9 | #' Estimates the conditional misclassification rates, conditional mean squared
 10 | #' prediction errors, conditional biases, conditional prediction intervals, and
 11 | #' conditional error distributions of random forest predictions.
 12 | #'
 13 | #' This function accepts classification or regression random forests built using
 14 | #' the \code{randomForest}, \code{ranger}, \code{randomForestSRC}, and
 15 | #' \code{quantregForest} packages. When training the random forest using
 16 | #' \code{randomForest}, \code{ranger}, or \code{quantregForest}, \code{keep.inbag}
 17 | #' must be set to \code{TRUE}. When training the random forest using
 18 | #' \code{randomForestSRC}, \code{membership} must be set to \code{TRUE}.
 19 | #'
 20 | #' The predictions computed by \code{ranger} can be parallelized by setting the
 21 | #' value of \code{n.cores} to be greater than 1.
 22 | #'
 23 | #' The random forest predictions are always returned as a \code{data.frame}. Additional
 24 | #' columns are included in the \code{data.frame} depending on the user's selections in
 25 | #' the argument \code{what}. In particular, including \code{"mspe"} in \code{what}
 26 | #' will add an additional column with the conditional mean squared prediction
 27 | #' error of each test prediction to the \code{data.frame}; including \code{"bias"} in
 28 | #' \code{what} will add an additional column with the conditional bias of each test
 29 | #' prediction to the \code{data.frame}; including \code{"interval"} in \code{what}
 30 | #' will add to the \code{data.frame} additional columns with the lower and
 31 | #' upper bounds of conditional prediction intervals for each test prediction;
 32 | #' and including \code{"mcr"} in \code{what} will add an additional column with
 33 | #' the conditional misclassification rate of each test prediction to the
 34 | #' \code{data.frame}. The conditional misclassification rate can be estimated
 35 | #' only for classification random forests, while the other parameters can be
 36 | #' estimated only for regression random forests.
 37 | #'
 38 | #' If \code{"p.error"} or \code{"q.error"} is included in \code{what}, or if
 39 | #' \code{return_train_nodes} is set to \code{TRUE}, then a list will be returned
 40 | #' as output. The first element of the list, named \code{"estimates"}, is the
 41 | #' \code{data.frame} described in the above paragraph. The other elements of the
 42 | #' list are the estimated cumulative distribution functions (\code{perror}) of
 43 | #' the conditional error distributions, the estimated quantile functions
 44 | #' (\code{qerror}) of the conditional error distributions, and/or a \code{data.table}
 45 | #' indicating what out-of-bag prediction errors each terminal node of each tree
 46 | #' in the random forest contains.
 47 | #'
 48 | #' @param forest The random forest object being used for prediction.
 49 | #' @param X.train A \code{matrix} or \code{data.frame} with the observations
 50 | #'   that were used to train \code{forest}. Each row should be an observation,
 51 | #'   and each column should be a predictor variable.
 52 | #' @param X.test A \code{matrix} or \code{data.frame} with the observations to
 53 | #'   be predicted; each row should be an observation, and each column should be
 54 | #'   a predictor variable.
 55 | #' @param Y.train A vector of the responses of the observations that were used
 56 | #'   to train \code{forest}. Required if \code{forest} was created using
 57 | #'   \code{ranger}, but not if \code{forest} was created using \code{randomForest},
 58 | #'   \code{randomForestSRC}, or \code{quantregForest}.
 59 | #' @param what A vector of characters indicating what estimates are desired.
 60 | #'   Possible options are conditional mean squared prediction errors (\code{"mspe"}),
 61 | #'   conditional biases (\code{"bias"}), conditional prediction intervals (\code{"interval"}),
 62 | #'   conditional error distribution functions (\code{"p.error"}), conditional
 63 | #'   error quantile functions (\code{"q.error"}), and conditional
 64 | #'   misclassification rate (\code{"mcr"}). Note that the conditional
 65 | #'   misclassification rate is available only for categorical outcomes, while
 66 | #'   the other parameters are available only for real-valued outcomes.
 67 | #' @param alpha A vector of type-I error rates desired for the conditional prediction
 68 | #'   intervals; required if \code{"interval"} is included in \code{what}.
 69 | #' @param train_nodes A \code{data.table} indicating what out-of-bag prediction
 70 | #'   errors each terminal node of each tree in \code{forest} contains. It should
 71 | #'   be formatted like the output of \code{findOOBErrors}. If not provided,
 72 | #'   it will be computed internally.
 73 | #' @param return_train_nodes A boolean indicating whether to return the
 74 | #'   \code{train_nodes} computed and/or used.
 75 | #' @param n.cores Number of cores to use (for parallel computation in \code{ranger}).
 76 | #'
 77 | #' @return A \code{data.frame} with one or more of the following columns, as described
 78 | #'   in the details section:
 79 | #'
 80 | #'   \item{pred}{The random forest predictions of the test observations}
 81 | #'   \item{mspe}{The estimated conditional mean squared prediction errors of
 82 | #'   the random forest predictions}
 83 | #'   \item{bias}{The estimated conditional biases of the random forest
 84 | #'   predictions}
 85 | #'   \item{lower_alpha}{The estimated lower bounds of the conditional alpha-level
 86 | #'   prediction intervals for the test observations}
 87 | #'   \item{upper_alpha}{The estimated upper bounds of the conditional alpha-level
 88 | #'   prediction intervals for the test observations}
 89 | #'   \item{mcr}{The estimated conditional misclassification rate of the random
 90 | #'   forest predictions}
 91 | #'
 92 | #'   In addition, one or both of the following functions, as described in the
 93 | #'   details section:
 94 | #'
 95 | #'   \item{perror}{The estimated cumulative distribution functions of the
 96 | #'   conditional error distributions associated with the test predictions}
 97 | #'   \item{qerror}{The estimated quantile functions of the conditional error
 98 | #'   distributions associated with the test predictions}
 99 | #'
100 | #'   In addition, if \code{return_train_nodes} is \code{TRUE}, then a \code{data.table}
101 | #'   called \code{train_nodes} indicating what out-of-bag prediction errors each
102 | #'   terminal node of each tree in \code{forest} contains.
103 | #'
104 | #' @seealso \code{\link{perror}}, \code{\link{qerror}}, \code{\link{findOOBErrors}}
105 | #'
106 | #' @author Benjamin Lu \code{<b.lu@berkeley.edu>}; Johanna Hardin \code{<jo.hardin@pomona.edu>}
107 | #'
108 | #' @examples
109 | #' # load data
110 | #' data(airquality)
111 | #'
112 | #' # remove observations with missing predictor variable values
113 | #' airquality <- airquality[complete.cases(airquality), ]
114 | #'
115 | #' # get number of observations and the response column index
116 | #' n <- nrow(airquality)
117 | #' response.col <- 1
118 | #'
119 | #' # split data into training and test sets
120 | #' train.ind <- sample(c("A", "B", "C"), n,
121 | #'                     replace = TRUE, prob = c(0.8, 0.1, 0.1))
122 | #' Xtrain <- airquality[train.ind == "A", -response.col]
123 | #' Ytrain <- airquality[train.ind == "A", response.col]
124 | #' Xtest1 <- airquality[train.ind == "B", -response.col]
125 | #' Xtest2 <- airquality[train.ind == "C", -response.col]
126 | #'
127 | #' # fit regression random forest to the training data
128 | #' rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5,
129 | #'                                  ntree = 500,
130 | #'                                  keep.inbag = TRUE)
131 | #'
132 | #' # estimate conditional mean squared prediction errors,
133 | #' # biases, prediction intervals, and error distribution
134 | #' # functions for the observations in Xtest1. return
135 | #' # train_nodes to avoid recomputation in the next
136 | #' # line of code.
137 | #' output1 <- quantForestError(rf, Xtrain, Xtest1,
138 | #'                             return_train_nodes = TRUE)
139 | #'
140 | #' # estimate just the conditional mean squared prediction errors
141 | #' # and prediction intervals for the observations in Xtest2.
142 | #' # avoid recomputation by providing train_nodes from the
143 | #' # previous line of code.
144 | #' output2 <- quantForestError(rf, Xtrain, Xtest2,
145 | #'                             what = c("mspe", "interval"),
146 | #'                             train_nodes = output1$train_nodes)
147 | #'
148 | #' # for illustrative purposes, convert response to categorical
149 | #' Ytrain <- as.factor(Ytrain > 31.5)
150 | #'
151 | #' # fit classification random forest to the training data
152 | #' rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 3,
153 | #'                                  ntree = 500,
154 | #'                                  keep.inbag = TRUE)
155 | #'
156 | #' # estimate conditional misclassification rate of the
157 | #' # predictions of Xtest1
158 | #' output <- quantForestError(rf, Xtrain, Xtest1)
159 | #'
160 | #' @aliases forestError
161 | #'
162 | #' @importFrom stats predict
163 | #' @import data.table
164 | #' @importFrom purrr map_dbl
165 | #' @export
166 | quantForestError <- function(forest, X.train, X.test, Y.train = NULL, what = if (grepl("class", c(forest$type, forest$family, forest$treetype), TRUE)) "mcr" else c("mspe", "bias", "interval", "p.error", "q.error"), alpha = 0.05, train_nodes = NULL, return_train_nodes = FALSE, n.cores = 1) {
167 | 
168 |   # check forest, X.train, X.test arguments for issues
169 |   checkForest(forest)
170 |   checkXtrainXtest(X.train, X.test)
171 |   # get number of training and test observations
172 |   n.train <- nrow(X.train)
173 |   n.test <- nrow(X.test)
174 |   # check Y.train and n.cores arguments for issues
175 |   checkYtrain(forest, Y.train, n.train)
176 |   checkcores(n.cores)
177 |   # check requested error parameters
178 |   checkwhat(what, forest)
179 | 
180 |   # check what user wants to produce
181 |   mspewhat <- "mspe" %in% what
182 |   biaswhat <- "bias" %in% what
183 |   intervalwhat <- "interval" %in% what
184 |   pwhat <- "p.error" %in% what
185 |   qwhat <- "q.error" %in% what
186 |   mcrwhat <- "mcr" %in% what
187 | 
188 |   # compute and locate out-of-bag training errors and test predictions
189 |   if (is.null(train_nodes)) {
190 |     train_nodes <- findOOBErrors(forest, X.train, Y.train, n.cores)
191 |   }
192 |   test_nodes <- findTestPreds(forest, X.test, n.cores)
193 | 
194 |   # estimate the requested prediction error distribution parameters
195 |   output <- estimateErrorParams(train_nodes, test_nodes, mspewhat, biaswhat,
196 |                                 intervalwhat, pwhat, qwhat, mcrwhat,
197 |                                 alpha, n.test)
198 | 
199 |   # add train_nodes if requested
200 |   if (return_train_nodes) {
201 |     if (pwhat | qwhat) {
202 |       output[["train_nodes"]] <- train_nodes
203 |     } else {
204 |       output <- list(output, train_nodes)
205 |       names(output) <- c("estimates", "train_nodes")
206 |     }
207 |   }
208 | 
209 |   # return output
210 |   return(output)
211 | }
212 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # forestError: A Unified Framework for Random Forest Prediction Error Estimation
  2 | [![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
  3 | 
  4 | ## Version 1.1.0 Update
  5 | 
  6 | Version 1.1.0 makes two changes. First, it enables estimation of the conditional misclassification rate of predictions by classification random forests as proposed by Lu and Hardin (2021). Second, it compartmentalizes a costly step in the `quantForestError` algorithm: The identification of each training observation's out-of-bag terminal nodes.
  7 | 
  8 | ### Conditional Misclassification Rate Estimation
  9 | 
 10 | The conditional misclassification rate of predictions by classification random forests can now be estimated. To estimate it, simply set the `what` argument in the `quantForestError` function to `"mcr"`. `what` will default to this if the provided `forest` is a classification random forest. See the example code below for a toy demonstration of the performance of this estimator.
 11 | 
 12 | ### Compartmentalization
 13 | 
 14 | The identification of each training observation's out-of-bag terminal nodes is now compartmentalized from the main `quantForestError` function. By isolating this step from the main `quantForestError` function, Version 1.1.0 allows users to more efficiently iterate the algorithm. Users may wish to feed `quantForestError` batches of test observations iteratively if they have streaming data or a large test set that cannot be processed in one go due to memory constraints. In previous versions of this package, doing so would require the algorithm to recompute each training observation's out-of-bag terminal nodes in each iteration. This was redundant and costly. By separating this computation from the rest of the `quantForestError` algorithm, Version 1.1.0 allows the user to perform this computation only once.
 15 | 
 16 | As part of this modularization, the `quantForestError` function now has two additional arguments. If set to `TRUE`, `return_train_nodes` will return a `data.table` identifying each training observation's out-of-bag terminal nodes. This `data.table` can then be fed back into `quantForestError` via the argument `train_nodes` to avoid the redundant recomputation.
 17 | 
 18 | Version 1.1.0 also exports the function that produces the `data.table` identifying each training observation's out-of-bag terminal nodes. It is called `findOOBErrors`. Assuming the same inputs, `findOOBErrors` will produce the same output that is returned by setting `return_train_nodes` to `TRUE` in the `quantForestError` function.
 19 | 
 20 | See the documentation on `quantForestError` and `findOOBErrors` for examples.
 21 | 
 22 | Neither of these changes affects code that relied on Version 1.0.0 of this package, as the changes consist solely of a newly exported function, two optional arguments to `quantForestError` that by default do nothing new, and a new possible input for the `what` argument.
 23 | 
 24 | ## Overview
 25 | 
 26 | The `forestError` package estimates conditional misclassification rates, conditional mean squared prediction errors, conditional biases, conditional prediction intervals, and conditional error distributions for random forest predictions using the plug-in method introduced in Lu and Hardin (2021). These estimates are conditional on the test observations' predictor values, accounting for possible response heterogeneity, random forest prediction bias, and random forest prediction variability across the predictor space.
 27 | 
 28 | In its current state, the main function in this package accepts regression random forests built using any of the following packages:
 29 | 
 30 | - `randomForest`,
 31 | - `randomForestSRC`,
 32 | - `ranger`, and
 33 | - `quantregForest`.
 34 | 
 35 | ## Installation
 36 | 
 37 | Running the following line of code in `R` will install a stable version of this package from CRAN:
 38 | 
 39 | ```{r}
 40 | install.packages("forestError")
 41 | ```
 42 | 
 43 | To install the developer version of this package from Github, run the following lines of code in `R`:
 44 | 
 45 | ```{r}
 46 | library(devtools)
 47 | devtools::install_github(repo = "benjilu/forestError")
 48 | ```  
 49 | 
 50 | ## Instructions
 51 | See the documentation for detailed information on how to use this package. A regression example and a classification example are given below.
 52 | 
 53 | ```{r}
 54 | ######################## REGRESSION ########################
 55 | # load data
 56 | data(airquality)
 57 | 
 58 | # remove observations with missing predictor variable values
 59 | airquality <- airquality[complete.cases(airquality), ]
 60 | 
 61 | # get number of observations and the response column index
 62 | n <- nrow(airquality)
 63 | response.col <- 1
 64 | 
 65 | # split data into training and test sets
 66 | train.ind <- sample(1:n, n * 0.9, replace = FALSE)
 67 | Xtrain <- airquality[train.ind, -response.col]
 68 | Ytrain <- airquality[train.ind, response.col]
 69 | Xtest <- airquality[-train.ind, -response.col]
 70 | Ytest <- airquality[-train.ind, response.col]
 71 | 
 72 | # fit random forest to the training data
 73 | rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5,
 74 |                                  ntree = 500, keep.inbag = TRUE)
 75 | 
 76 | # estimate conditional mean squared prediction errors, conditional
 77 | # biases, conditional prediction intervals, and conditional error
 78 | # distribution functions for the test observations
 79 | output <- quantForestError(rf, Xtrain, Xtest)
 80 | 
 81 | ######################## CLASSIFICATION ########################
 82 | # data-generating parameters
 83 | train_samp_size <- 10000
 84 | test_samp_size <- 5000
 85 | p <- 5
 86 | 
 87 | # generate binary data where the probability of success is a
 88 | # linear function of the first predictor variable
 89 | Xtrain <- data.frame(matrix(runif(train_samp_size * p),
 90 |                              ncol = p))
 91 | Xtest <- data.frame(matrix(runif(test_samp_size * p),
 92 |                             ncol = p))
 93 | Ytrain <- as.factor(rbinom(train_samp_size, 1, Xtrain$X1))
 94 | 
 95 | # fit random forest to training data
 96 | rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 3,
 97 |                                  ntree = 1000, keep.inbag = TRUE)
 98 | 
 99 | # estimate conditional misclassification rate
100 | output <- quantForestError(rf, Xtrain, Xtest)
101 | 
102 | # plot conditional misclassification rate against the signal
103 | plot(Xtest$X1, output$mcr, xlab = "X1", ylab = "Estimated
104 |      Misclassification Rate")
105 | ```
106 | 
107 | ## License
108 | See `DESCRIPTION` for information.
109 | 
110 | ## Authors
111 | Benjamin Lu and Johanna Hardin
112 | 
113 | ## References
114 | * Benjamin Lu and Johanna Hardin. A Unified Framework for Random Forest Prediction Error Estimation. Journal of Machine Learning Research, 22(8):1-41, 2021. [[Link](https://jmlr.org/papers/v22/18-558.html)]
115 | 


--------------------------------------------------------------------------------
/cran-comments.md:
--------------------------------------------------------------------------------
 1 | ## Submission
 2 | This is an updated version of an existing package.
 3 | 
 4 | ## Changes from previous submission
 5 | Modularized the main function's steps by defining them as separate functions,
 6 | one of which is now newly exported. Enabled estimation of an additional
 7 | parameter through the main function.
 8 | 
 9 | ## Test environments
10 | * local macOS R 4.1.0
11 | * win-builder (devel and release)
12 | 
13 | ## R CMD check results
14 | There were no ERRORs, WARNINGs, or NOTEs.
15 | 
16 | ## Downstream dependencies
17 | No downstream dependencies are negatively affected based on revdepcheck.
18 | 


--------------------------------------------------------------------------------
/forestError.Rproj:
--------------------------------------------------------------------------------
 1 | Version: 1.0
 2 | 
 3 | RestoreWorkspace: Default
 4 | SaveWorkspace: Default
 5 | AlwaysSaveHistory: Default
 6 | 
 7 | EnableCodeIndexing: Yes
 8 | UseSpacesForTab: Yes
 9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 | 
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 | 
15 | AutoAppendNewline: Yes
16 | StripTrailingWhitespace: Yes
17 | 
18 | BuildType: Package
19 | PackageUseDevtools: Yes
20 | PackageInstallArgs: --no-multiarch --with-keep.source
21 | 


--------------------------------------------------------------------------------
/forestError_1.1.0.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/benjilu/forestError/5d9c4b2d01e2ad65bdedf1bdd499ba342e27be8d/forestError_1.1.0.pdf


--------------------------------------------------------------------------------
/inst/CITATION:
--------------------------------------------------------------------------------
 1 | citHeader("To cide forestError in publications, please use:")
 2 | 
 3 | citEntry(entry = "Article",
 4 |   title = "A Unified Framework for Random Forest Prediction Error Estimation",
 5 |   author = personList(as.person("Benjamin Lu"),
 6 |                       as.person("Johanna Hardin")),
 7 |   journal = "Journal of Machine Learning Research",
 8 |   year = "2021",
 9 |   volume="22",
10 |   number="8",
11 |   pages="1--41",
12 | 
13 |   textVersion = paste("Benjamin Lu and Johanna Hardin.",
14 |                       "A Unified Framework for Random Forest Prediction Error Estimation.",
15 |                       "Journal of Machine Learning Research, 22(8):1-41, 2021.")
16 | )
17 | 


--------------------------------------------------------------------------------
/man/estimateErrorParams.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/estimateerrorparams.R
 3 | \name{estimateErrorParams}
 4 | \alias{estimateErrorParams}
 5 | \title{Estimate prediction error distribution parameters}
 6 | \usage{
 7 | estimateErrorParams(
 8 |   train_nodes,
 9 |   test_nodes,
10 |   mspewhat,
11 |   biaswhat,
12 |   intervalwhat,
13 |   pwhat,
14 |   qwhat,
15 |   mcrwhat,
16 |   alpha,
17 |   n.test
18 | )
19 | }
20 | \arguments{
21 | \item{train_nodes}{A \code{data.table} indicating which out-of-bag prediction
22 | errors are in each terminal node of each tree in the random forest. It
23 | must be formatted like the output of the \code{findOOBErrors} function.}
24 | 
25 | \item{test_nodes}{A \code{data.table} indicating which test observations are
26 | in each terminal node of each tree in the random forest. It must be
27 | formatted like the output of the \code{findTestPreds} function.}
28 | 
29 | \item{mspewhat}{A boolean indicating whether to estimate conditional MSPE.}
30 | 
31 | \item{biaswhat}{A boolean indicating whether to estimate conditional bias.}
32 | 
33 | \item{intervalwhat}{A boolean indicating whether to estimate conditional
34 | prediction intervals.}
35 | 
36 | \item{pwhat}{A boolean indicating whether to estimate the conditional
37 | prediction error CDFs.}
38 | 
39 | \item{qwhat}{A boolean indicating whether to estimate the conditional
40 | prediction error quantile functions.}
41 | 
42 | \item{mcrwhat}{A boolean indicating whether to estimate the conditional
43 | misclassification rate.}
44 | 
45 | \item{alpha}{A vector of type-I error rates desired for the conditional prediction
46 | intervals; required if \code{intervalwhat} is \code{TRUE}.}
47 | 
48 | \item{n.test}{The number of test observations.}
49 | }
50 | \value{
51 | A \code{data.frame} with one or more of the following columns:
52 | 
53 |   \item{pred}{The random forest predictions of the test observations}
54 |   \item{mspe}{The estimated conditional mean squared prediction errors of
55 |   the random forest predictions}
56 |   \item{bias}{The estimated conditional biases of the random forest
57 |   predictions}
58 |   \item{lower_alpha}{The estimated lower bounds of the conditional alpha-level
59 |   prediction intervals for the test observations}
60 |   \item{upper_alpha}{The estimated upper bounds of the conditional alpha-level
61 |   prediction intervals for the test observations}
62 |   \item{mcr}{The estimated conditional misclassification rate of the random
63 |   forest predictions}
64 | 
65 |   In addition, one or both of the following functions:
66 | 
67 |   \item{perror}{The estimated cumulative distribution functions of the
68 |   conditional error distributions associated with the test predictions}
69 |   \item{qerror}{The estimated quantile functions of the conditional error
70 |   distributions associated with the test predictions}
71 | }
72 | \description{
73 | Estimates the prediction error distribution parameters requested in the input
74 | to \code{quantForestError}.
75 | }
76 | \details{
77 | This function is for internal use.
78 | }
79 | \seealso{
80 | \code{\link{quantForestError}}, \code{\link{findOOBErrors}}, \code{\link{findTestPreds}}
81 | }
82 | \author{
83 | Benjamin Lu \code{<b.lu@berkeley.edu>}; Johanna Hardin \code{<jo.hardin@pomona.edu>}
84 | }
85 | \keyword{internal}
86 | 


--------------------------------------------------------------------------------
/man/findOOBErrors.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/findooberrors.R
 3 | \name{findOOBErrors}
 4 | \alias{findOOBErrors}
 5 | \title{Compute and locate out-of-bag prediction errors}
 6 | \usage{
 7 | findOOBErrors(forest, X.train, Y.train = NULL, n.cores = 1)
 8 | }
 9 | \arguments{
10 | \item{forest}{The random forest object being used for prediction.}
11 | 
12 | \item{X.train}{A \code{matrix} or \code{data.frame} with the observations
13 | that were used to train \code{forest}. Each row should be an observation,
14 | and each column should be a predictor variable.}
15 | 
16 | \item{Y.train}{A vector of the responses of the observations that were used
17 | to train \code{forest}. Required if \code{forest} was created using
18 | \code{ranger}, but not if \code{forest} was created using \code{randomForest},
19 | \code{randomForestSRC}, or \code{quantregForest}.}
20 | 
21 | \item{n.cores}{Number of cores to use (for parallel computation in \code{ranger}).}
22 | }
23 | \value{
24 | A \code{data.table} with the following three columns:
25 | 
26 |   \item{tree}{The ID of the tree of the random forest}
27 |   \item{terminal_node}{The ID of the terminal node of the tree}
28 |   \item{node_errs}{A vector of the out-of-bag prediction errors that fall
29 |   within the terminal node of the tree}
30 | }
31 | \description{
32 | Computes each training observation's out-of-bag prediction error using the
33 | random forest and, for each tree for which the training observation is
34 | out of bag, finds the terminal node of the tree in which the training
35 | observation falls.
36 | }
37 | \details{
38 | This function accepts classification or regression random forests built using
39 | the \code{randomForest}, \code{ranger}, \code{randomForestSRC}, and
40 | \code{quantregForest} packages. When training the random forest using
41 | \code{randomForest}, \code{ranger}, or \code{quantregForest}, \code{keep.inbag}
42 | must be set to \code{TRUE}. When training the random forest using
43 | \code{randomForestSRC}, \code{membership} must be set to \code{TRUE}.
44 | }
45 | \examples{
46 | # load data
47 | data(airquality)
48 | 
49 | # remove observations with missing predictor variable values
50 | airquality <- airquality[complete.cases(airquality), ]
51 | 
52 | # get number of observations and the response column index
53 | n <- nrow(airquality)
54 | response.col <- 1
55 | 
56 | # split data into training and test sets
57 | train.ind <- sample(1:n, n * 0.9, replace = FALSE)
58 | Xtrain <- airquality[train.ind, -response.col]
59 | Ytrain <- airquality[train.ind, response.col]
60 | Xtest <- airquality[-train.ind, -response.col]
61 | 
62 | # fit random forest to the training data
63 | rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5,
64 |                                  ntree = 500, keep.inbag = TRUE)
65 | 
66 | # compute out-of-bag prediction errors and locate each
67 | # training observation in the trees for which it is out
68 | # of bag
69 | train_nodes <- findOOBErrors(rf, Xtrain)
70 | 
71 | # estimate conditional mean squared prediction errors,
72 | # biases, prediction intervals, and error distribution
73 | # functions for the test observations. provide
74 | # train_nodes to avoid recomputing that step.
75 | output <- quantForestError(rf, Xtrain, Xtest,
76 |                            train_nodes = train_nodes)
77 | 
78 | }
79 | \seealso{
80 | \code{\link{quantForestError}}
81 | }
82 | \author{
83 | Benjamin Lu \code{<b.lu@berkeley.edu>}; Johanna Hardin \code{<jo.hardin@pomona.edu>}
84 | }
85 | 


--------------------------------------------------------------------------------
/man/findTestPreds.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/findtestpreds.R
 3 | \name{findTestPreds}
 4 | \alias{findTestPreds}
 5 | \title{Compute and locate test predictions}
 6 | \usage{
 7 | findTestPreds(forest, X.test, n.cores = 1)
 8 | }
 9 | \arguments{
10 | \item{forest}{The random forest object being used for prediction.}
11 | 
12 | \item{X.test}{A \code{matrix} or \code{data.frame} with the observations to
13 | be predicted. Each row should be an observation, and each column should be
14 | a predictor variable.}
15 | 
16 | \item{n.cores}{Number of cores to use (for parallel computation in \code{ranger}).}
17 | }
18 | \value{
19 | A \code{data.table} with the following four columns:
20 | 
21 |   \item{rowid_test}{The row ID of the test observation as provided by \code{X.test}}
22 |   \item{pred}{The random forest prediction of the test observation}
23 |   \item{tree}{The ID of the tree of the random forest}
24 |   \item{terminal_node}{The ID of the terminal node of the tree in which the
25 |   test observation falls}
26 | }
27 | \description{
28 | Predicts each test observation's response using the random forest and, for
29 | each test observation and tree, finds the terminal node of the tree in which
30 | the test observation falls.
31 | }
32 | \details{
33 | This function accepts regression random forests built using the \code{randomForest},
34 | \code{ranger}, \code{randomForestSRC}, and \code{quantregForest} packages.
35 | }
36 | \seealso{
37 | \code{\link{findOOBErrors}}, \code{\link{quantForestError}}
38 | }
39 | \author{
40 | Benjamin Lu \code{<b.lu@berkeley.edu>}; Johanna Hardin \code{<jo.hardin@pomona.edu>}
41 | }
42 | \keyword{internal}
43 | 


--------------------------------------------------------------------------------
/man/perror.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/perror.R
 3 | \name{perror}
 4 | \alias{perror}
 5 | \title{Estimated conditional prediction error CDFs}
 6 | \usage{
 7 | perror(q, xs)
 8 | }
 9 | \arguments{
10 | \item{q}{A vector of quantiles.}
11 | 
12 | \item{xs}{A vector of the indices of the test observations for which the
13 | conditional error CDFs are desired. Defaults to all test observations
14 | given in the call of \code{quantForestError}.}
15 | }
16 | \value{
17 | If either \code{q} or \code{xs} has length one, then a vector is
18 |   returned with the desired probabilities. If both have length greater than
19 |   one, then a \code{data.frame} of probabilities is returned, with rows
20 |   corresponding to the inputted \code{xs} and columns corresponding to the
21 |   inputted \code{q}.
22 | }
23 | \description{
24 | Returns probabilities from the estimated conditional cumulative distribution
25 | function of the prediction error associated with each test observation.
26 | }
27 | \details{
28 | This function is only defined as output of the \code{quantForestError} function.
29 | It is not exported as a standalone function. See the example.
30 | }
31 | \examples{
32 | # load data
33 | data(airquality)
34 | 
35 | # remove observations with missing predictor variable values
36 | airquality <- airquality[complete.cases(airquality), ]
37 | 
38 | # get number of observations and the response column index
39 | n <- nrow(airquality)
40 | response.col <- 1
41 | 
42 | # split data into training and test sets
43 | train.ind <- sample(1:n, n * 0.9, replace = FALSE)
44 | Xtrain <- airquality[train.ind, -response.col]
45 | Ytrain <- airquality[train.ind, response.col]
46 | Xtest <- airquality[-train.ind, -response.col]
47 | Ytest <- airquality[-train.ind, response.col]
48 | 
49 | # fit random forest to the training data
50 | rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5,
51 |                                  ntree = 500,
52 |                                  keep.inbag = TRUE)
53 | 
54 | # estimate conditional error distribution functions
55 | output <- quantForestError(rf, Xtrain, Xtest,
56 |                            what = c("p.error", "q.error"))
57 | 
58 | # get the probability that the error associated with each test
59 | # prediction is less than -4 and the probability that the error
60 | # associated with each test prediction is less than 7
61 | output$perror(c(-4, 7))
62 | 
63 | # same as above but only for the first three test observations
64 | output$perror(c(-4, 7), 1:3)
65 | }
66 | \seealso{
67 | \code{\link{quantForestError}}
68 | }
69 | \author{
70 | Benjamin Lu \code{<b.lu@berkeley.edu>}; Johanna Hardin \code{<jo.hardin@pomona.edu>}
71 | }
72 | 


--------------------------------------------------------------------------------
/man/qerror.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/qerror.R
 3 | \name{qerror}
 4 | \alias{qerror}
 5 | \title{Estimated conditional prediction error quantile functions}
 6 | \usage{
 7 | qerror(p, xs)
 8 | }
 9 | \arguments{
10 | \item{p}{A vector of probabilities.}
11 | 
12 | \item{xs}{A vector of the indices of the test observations for which the
13 | conditional error quantiles are desired. Defaults to all test observations
14 | given in the call of \code{quantForestError}.}
15 | }
16 | \value{
17 | If either \code{p} or \code{xs} has length one, then a vector is
18 |   returned with the desired quantiles. If both have length greater than
19 |   one, then a \code{data.frame} of quantiles is returned, with rows
20 |   corresponding to the inputted \code{xs} and columns corresponding to the
21 |   inputted \code{p}.
22 | }
23 | \description{
24 | Returns quantiles of the estimated conditional error distribution associated
25 | with each test prediction.
26 | }
27 | \details{
28 | This function is only defined as output of the \code{quantForestError} function.
29 | It is not exported as a standalone function. See the example.
30 | }
31 | \examples{
32 | # load data
33 | data(airquality)
34 | 
35 | # remove observations with missing predictor variable values
36 | airquality <- airquality[complete.cases(airquality), ]
37 | 
38 | # get number of observations and the response column index
39 | n <- nrow(airquality)
40 | response.col <- 1
41 | 
42 | # split data into training and test sets
43 | train.ind <- sample(1:n, n * 0.9, replace = FALSE)
44 | Xtrain <- airquality[train.ind, -response.col]
45 | Ytrain <- airquality[train.ind, response.col]
46 | Xtest <- airquality[-train.ind, -response.col]
47 | Ytest <- airquality[-train.ind, response.col]
48 | 
49 | # fit random forest to the training data
50 | rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5,
51 |                                  ntree = 500,
52 |                                  keep.inbag = TRUE)
53 | 
54 | # estimate conditional error distribution functions
55 | output <- quantForestError(rf, Xtrain, Xtest,
56 |                            what = c("p.error", "q.error"))
57 | 
58 | # get the 0.25 and 0.8 quantiles of the error distribution
59 | # associated with each test prediction
60 | output$qerror(c(0.25, 0.8))
61 | 
62 | # same as above but only for the first three test observations
63 | output$qerror(c(0.25, 0.8), 1:3)
64 | }
65 | \seealso{
66 | \code{\link{quantForestError}}
67 | }
68 | \author{
69 | Benjamin Lu \code{<b.lu@berkeley.edu>}; Johanna Hardin \code{<jo.hardin@pomona.edu>}
70 | }
71 | 


--------------------------------------------------------------------------------
/man/quantForestError.Rd:
--------------------------------------------------------------------------------
  1 | % Generated by roxygen2: do not edit by hand
  2 | % Please edit documentation in R/quantforesterror.R
  3 | \name{quantForestError}
  4 | \alias{quantForestError}
  5 | \alias{forestError}
  6 | \title{Quantify random forest prediction error}
  7 | \usage{
  8 | quantForestError(
  9 |   forest,
 10 |   X.train,
 11 |   X.test,
 12 |   Y.train = NULL,
 13 |   what = if (grepl("class", c(forest$type, forest$family, forest$treetype), TRUE))
 14 |     "mcr" else c("mspe", "bias", "interval", "p.error", "q.error"),
 15 |   alpha = 0.05,
 16 |   train_nodes = NULL,
 17 |   return_train_nodes = FALSE,
 18 |   n.cores = 1
 19 | )
 20 | }
 21 | \arguments{
 22 | \item{forest}{The random forest object being used for prediction.}
 23 | 
 24 | \item{X.train}{A \code{matrix} or \code{data.frame} with the observations
 25 | that were used to train \code{forest}. Each row should be an observation,
 26 | and each column should be a predictor variable.}
 27 | 
 28 | \item{X.test}{A \code{matrix} or \code{data.frame} with the observations to
 29 | be predicted; each row should be an observation, and each column should be
 30 | a predictor variable.}
 31 | 
 32 | \item{Y.train}{A vector of the responses of the observations that were used
 33 | to train \code{forest}. Required if \code{forest} was created using
 34 | \code{ranger}, but not if \code{forest} was created using \code{randomForest},
 35 | \code{randomForestSRC}, or \code{quantregForest}.}
 36 | 
 37 | \item{what}{A vector of characters indicating what estimates are desired.
 38 | Possible options are conditional mean squared prediction errors (\code{"mspe"}),
 39 | conditional biases (\code{"bias"}), conditional prediction intervals (\code{"interval"}),
 40 | conditional error distribution functions (\code{"p.error"}), conditional
 41 | error quantile functions (\code{"q.error"}), and conditional
 42 | misclassification rate (\code{"mcr"}). Note that the conditional
 43 | misclassification rate is available only for categorical outcomes, while
 44 | the other parameters are available only for real-valued outcomes.}
 45 | 
 46 | \item{alpha}{A vector of type-I error rates desired for the conditional prediction
 47 | intervals; required if \code{"interval"} is included in \code{what}.}
 48 | 
 49 | \item{train_nodes}{A \code{data.table} indicating what out-of-bag prediction
 50 | errors each terminal node of each tree in \code{forest} contains. It should
 51 | be formatted like the output of \code{findOOBErrors}. If not provided,
 52 | it will be computed internally.}
 53 | 
 54 | \item{return_train_nodes}{A boolean indicating whether to return the
 55 | \code{train_nodes} computed and/or used.}
 56 | 
 57 | \item{n.cores}{Number of cores to use (for parallel computation in \code{ranger}).}
 58 | }
 59 | \value{
 60 | A \code{data.frame} with one or more of the following columns, as described
 61 |   in the details section:
 62 | 
 63 |   \item{pred}{The random forest predictions of the test observations}
 64 |   \item{mspe}{The estimated conditional mean squared prediction errors of
 65 |   the random forest predictions}
 66 |   \item{bias}{The estimated conditional biases of the random forest
 67 |   predictions}
 68 |   \item{lower_alpha}{The estimated lower bounds of the conditional alpha-level
 69 |   prediction intervals for the test observations}
 70 |   \item{upper_alpha}{The estimated upper bounds of the conditional alpha-level
 71 |   prediction intervals for the test observations}
 72 |   \item{mcr}{The estimated conditional misclassification rate of the random
 73 |   forest predictions}
 74 | 
 75 |   In addition, one or both of the following functions, as described in the
 76 |   details section:
 77 | 
 78 |   \item{perror}{The estimated cumulative distribution functions of the
 79 |   conditional error distributions associated with the test predictions}
 80 |   \item{qerror}{The estimated quantile functions of the conditional error
 81 |   distributions associated with the test predictions}
 82 | 
 83 |   In addition, if \code{return_train_nodes} is \code{TRUE}, then a \code{data.table}
 84 |   called \code{train_nodes} indicating what out-of-bag prediction errors each
 85 |   terminal node of each tree in \code{forest} contains.
 86 | }
 87 | \description{
 88 | Estimates the conditional misclassification rates, conditional mean squared
 89 | prediction errors, conditional biases, conditional prediction intervals, and
 90 | conditional error distributions of random forest predictions.
 91 | }
 92 | \details{
 93 | This function accepts classification or regression random forests built using
 94 | the \code{randomForest}, \code{ranger}, \code{randomForestSRC}, and
 95 | \code{quantregForest} packages. When training the random forest using
 96 | \code{randomForest}, \code{ranger}, or \code{quantregForest}, \code{keep.inbag}
 97 | must be set to \code{TRUE}. When training the random forest using
 98 | \code{randomForestSRC}, \code{membership} must be set to \code{TRUE}.
 99 | 
100 | The predictions computed by \code{ranger} can be parallelized by setting the
101 | value of \code{n.cores} to be greater than 1.
102 | 
103 | The random forest predictions are always returned as a \code{data.frame}. Additional
104 | columns are included in the \code{data.frame} depending on the user's selections in
105 | the argument \code{what}. In particular, including \code{"mspe"} in \code{what}
106 | will add an additional column with the conditional mean squared prediction
107 | error of each test prediction to the \code{data.frame}; including \code{"bias"} in
108 | \code{what} will add an additional column with the conditional bias of each test
109 | prediction to the \code{data.frame}; including \code{"interval"} in \code{what}
110 | will add to the \code{data.frame} additional columns with the lower and
111 | upper bounds of conditional prediction intervals for each test prediction;
112 | and including \code{"mcr"} in \code{what} will add an additional column with
113 | the conditional misclassification rate of each test prediction to the
114 | \code{data.frame}. The conditional misclassification rate can be estimated
115 | only for classification random forests, while the other parameters can be
116 | estimated only for regression random forests.
117 | 
118 | If \code{"p.error"} or \code{"q.error"} is included in \code{what}, or if
119 | \code{return_train_nodes} is set to \code{TRUE}, then a list will be returned
120 | as output. The first element of the list, named \code{"estimates"}, is the
121 | \code{data.frame} described in the above paragraph. The other elements of the
122 | list are the estimated cumulative distribution functions (\code{perror}) of
123 | the conditional error distributions, the estimated quantile functions
124 | (\code{qerror}) of the conditional error distributions, and/or a \code{data.table}
125 | indicating what out-of-bag prediction errors each terminal node of each tree
126 | in the random forest contains.
127 | }
128 | \examples{
129 | # load data
130 | data(airquality)
131 | 
132 | # remove observations with missing predictor variable values
133 | airquality <- airquality[complete.cases(airquality), ]
134 | 
135 | # get number of observations and the response column index
136 | n <- nrow(airquality)
137 | response.col <- 1
138 | 
139 | # split data into training and test sets
140 | train.ind <- sample(c("A", "B", "C"), n,
141 |                     replace = TRUE, prob = c(0.8, 0.1, 0.1))
142 | Xtrain <- airquality[train.ind == "A", -response.col]
143 | Ytrain <- airquality[train.ind == "A", response.col]
144 | Xtest1 <- airquality[train.ind == "B", -response.col]
145 | Xtest2 <- airquality[train.ind == "C", -response.col]
146 | 
147 | # fit regression random forest to the training data
148 | rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5,
149 |                                  ntree = 500,
150 |                                  keep.inbag = TRUE)
151 | 
152 | # estimate conditional mean squared prediction errors,
153 | # biases, prediction intervals, and error distribution
154 | # functions for the observations in Xtest1. return
155 | # train_nodes to avoid recomputation in the next
156 | # line of code.
157 | output1 <- quantForestError(rf, Xtrain, Xtest1,
158 |                             return_train_nodes = TRUE)
159 | 
160 | # estimate just the conditional mean squared prediction errors
161 | # and prediction intervals for the observations in Xtest2.
162 | # avoid recomputation by providing train_nodes from the
163 | # previous line of code.
164 | output2 <- quantForestError(rf, Xtrain, Xtest2,
165 |                             what = c("mspe", "interval"),
166 |                             train_nodes = output1$train_nodes)
167 | 
168 | # for illustrative purposes, convert response to categorical
169 | Ytrain <- as.factor(Ytrain > 31.5)
170 | 
171 | # fit classification random forest to the training data
172 | rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 3,
173 |                                  ntree = 500,
174 |                                  keep.inbag = TRUE)
175 | 
176 | # estimate conditional misclassification rate of the
177 | # predictions of Xtest1
178 | output <- quantForestError(rf, Xtrain, Xtest1)
179 | 
180 | }
181 | \seealso{
182 | \code{\link{perror}}, \code{\link{qerror}}, \code{\link{findOOBErrors}}
183 | }
184 | \author{
185 | Benjamin Lu \code{<b.lu@berkeley.edu>}; Johanna Hardin \code{<jo.hardin@pomona.edu>}
186 | }
187 | 


--------------------------------------------------------------------------------