├── Ch_03.Rnw ├── Ch_03.pdf ├── Ch_03_Ex_Sol.bib ├── Ch_04.Rnw ├── Ch_04.pdf ├── Ch_04_Ex_Sol.bib ├── Ch_06.Rnw ├── Ch_06.pdf ├── Ch_07.Rnw ├── Ch_07.pdf ├── Ch_07_Ex_Sol.bib ├── Ch_08.Rnw ├── Ch_08.pdf ├── Ch_08_Ex_Sol.bib ├── Ch_12_Ex_Sol.bib ├── Ch_13_Ex_Sol.bib └── readme.md /Ch_03.Rnw: -------------------------------------------------------------------------------- 1 | 2 | \documentclass[12pt]{article} 3 | 4 | \usepackage{amsmath} 5 | \usepackage{graphicx} 6 | \usepackage{color} 7 | \usepackage{xspace} 8 | \usepackage{fancyvrb} 9 | \usepackage[ 10 | colorlinks=true, 11 | linkcolor=blue, 12 | citecolor=blue, 13 | urlcolor=blue] 14 | {hyperref} 15 | 16 | \usepackage[default]{jasa_harvard} 17 | %\usepackage{JASA_manu} 18 | 19 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 20 | 21 | \setlength{\oddsidemargin}{-.25 truein} 22 | \setlength{\evensidemargin}{0truein} 23 | \setlength{\topmargin}{-0.2truein} 24 | \setlength{\textwidth}{7 truein} 25 | \setlength{\textheight}{8.5 truein} 26 | \setlength{\parindent}{0truein} 27 | \setlength{\parskip}{0.07truein} 28 | 29 | \definecolor{darkred}{rgb}{0.6,0.0,0} 30 | \definecolor{darkblue}{rgb}{.165, 0, .659} 31 | \definecolor{grey}{rgb}{0.85,0.85,0.85} 32 | \definecolor{darkorange}{rgb}{1,0.54,0} 33 | 34 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 35 | 36 | 37 | \newcommand{\bld}[1]{\mbox{\boldmath $#1$}} 38 | \newcommand{\shell}[1]{\mbox{$#1$}} 39 | \renewcommand{\vec}[1]{\mbox{\bf {#1}}} 40 | 41 | \newcommand{\ReallySmallSpacing}{\renewcommand{\baselinestretch}{.6}\Large\normalsize} 42 | \newcommand{\SmallSpacing}{\renewcommand{\baselinestretch}{1.1}\Large\normalsize} 43 | 44 | \newcommand{\halfs}{\frac{1}{2}} 45 | 46 | \DefineVerbatimEnvironment{Sinput}{Verbatim}{fontshape=sl,formatcom=\color{darkblue}} 47 | \fvset{fontsize=\footnotesize} 48 | 49 | \newcommand{\website}[1]{{\textsf{#1}}} 50 | \newcommand{\code}[1]{\mbox{\footnotesize\color{darkblue}\texttt{#1}}} 51 | \newcommand{\pkg}[1]{{\fontseries{b}\selectfont #1}} 52 | \renewcommand{\pkg}[1]{{\textsf{#1}}} 53 | \newcommand{\todo}[1]{TODO: {\bf \textcolor{darkred}{#1}}} 54 | \newcommand{\Dag}{$^\dagger$} 55 | \newcommand{\Ast}{$^\ast$} 56 | 57 | 58 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 59 | 60 | \begin{document} 61 | 62 | <>= 63 | opts_chunk$set(tidy=FALSE,message=FALSE,size='footnotesize', 64 | background = 'white',comment=NA, digits = 3, 65 | prompt = TRUE) 66 | @ 67 | 68 | 69 | \title{ Exercises for \\ {\it Applied Predictive Modeling} \\ Chapter 3 --- Data Pre--Processing} 70 | \author{Max Kuhn, Kjell Johnson} 71 | \date{Version 1\\ \today} 72 | 73 | 74 | <>= 75 | library(caret) 76 | library(AppliedPredictiveModeling) 77 | library(mlbench) 78 | library(vcd) 79 | library(corrplot) 80 | 81 | options(width = 105) 82 | textList <- function (x, period = FALSE, last = " and ") 83 | { 84 | if (!is.character(x)) 85 | x <- as.character(x) 86 | numElements <- length(x) 87 | out <- if (length(x) > 0) { 88 | switch(min(numElements, 3), x, paste(x, collapse = last), 89 | { 90 | x <- paste(x, c(rep(",", numElements - 2), last, 91 | ""), sep = "") 92 | paste(x, collapse = " ") 93 | }) 94 | } 95 | else "" 96 | if (period) 97 | out <- paste(out, ".", sep = "") 98 | out 99 | } 100 | hook_inline = knit_hooks$get('inline') 101 | knit_hooks$set(inline = function(x) { 102 | if (is.character(x)) highr::hi_latex(x) else hook_inline(x) 103 | }) 104 | knit_theme$set("bclear") 105 | options(width = 80) 106 | @ 107 | 108 | \newcommand{\knnum}[1]{{\tt \small \hlnum{#1}}} 109 | \newcommand{\knarg}[1]{{\tt \small \hlkwc{#1}}} 110 | \newcommand{\knfun}[1]{{\tt \small \hlkwd{#1}}} 111 | \newcommand{\knlstr}[1]{{\tt \small \hlstr{#1}}} 112 | 113 | \maketitle 114 | 115 | \thispagestyle{empty} 116 | 117 | The solutions in this file use several \pkg{R} packages not used in the text. To install all of the packages needed for this document, use: 118 | 119 | <>= 120 | install.packages(c("AppliedPredictiveModeling", "car", "caret", "corrplot", 121 | "e1071", "mlbench", "subselect", "reshape2", "vcd")) 122 | @ 123 | 124 | One note about these exercises: the type and amount tot pre--processing is dependent on the model being used. The results shown here are appropriate for models that have significant pre--processing requirements of the predictor variables. 125 | 126 | \section*{Exercise 1} 127 | <>= 128 | data(Glass) 129 | data(Soybean) 130 | @ 131 | 132 | \label{P:PreProcessGlass} 133 | The UC Irvine Machine Learning Repository\footnote{\website{http://archive.ics.uci.edu/ml/index.html}} contains a data set related to glass identification. The data consist of \Sexpr{nrow(Glass)} glass samples labeled as one of seven class categories. There are \Sexpr{ncol(Glass)-1} predictors, including the refractive index and percentages of \Sexpr{ncol(Glass)-2} elements: \Sexpr{textList(names(Glass[,-c(1, ncol(Glass))]))}. 134 | 135 | The data can be accessed via: 136 | <>= 137 | library(mlbench) 138 | data(Glass) 139 | str(Glass) 140 | @ 141 | 142 | \begin{itemize} 143 | \item[] (a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors. 144 | \item[] (b) Does there appear to be any outliers in the data? Are predictors skewed? 145 | \item[](c) Are there any relevant transformations of one or more predictors that might improve the classification model? 146 | \end{itemize} 147 | 148 | \subsection*{Solutions} 149 | 150 | To examine the predictor distributions, individual histograms or density plots are useful. To look at them in a single plot, we will first ``melt'' the data into a ``long'' format so that predictors are not in separate columns: 151 | 152 | <>= 153 | library(reshape2) 154 | meltedGlass <- melt(Glass, id.vars = "Type") 155 | head(meltedGlass) 156 | @ 157 | 158 | Now, we can use the \pkg{lattice} function \knfun{densityplot} to look at each predictor. The code used to create Figure \ref{F:Glass_dens} is: 159 | 160 | <>= 161 | library(lattice) 162 | densityplot(~value|variable, 163 | data = meltedGlass, 164 | ## Adjust each axis so that the measurement scale is 165 | ## different for each panel 166 | scales = list(x = list(relation = "free"), 167 | y = list(relation = "free")), 168 | ## 'adjust' smooths the curve out 169 | adjust = 1.25, 170 | ## change the symbol on the rug for each data point 171 | pch = "|", 172 | xlab = "Predictor") 173 | @ 174 | 175 | From Figure \ref{F:Glass_dens}, we can see that \texttt{K} and \texttt{Mg} appear to have possible second modes around zero and that several predictors (\texttt{Ca}, \texttt{Ba}, \texttt{Fe} and \texttt{RI}) show signs of skewness. There may be one or two outliers in \texttt{K}, but they could simply be due to natueral skewness. Also, predictors \texttt{Ca}, \texttt{RI}, \texttt{Na} and \texttt{Si} have concentrations of samples in the middle of the scale and a small number of data points at the edges of the distribution. This characteristic is indicative of a ``heavy--tailed'' distribution. 176 | 177 | \begin{figure} 178 | \begin{center} 179 | <>= 180 | bookTheme() 181 | densityplot(~value|variable, 182 | data = meltedGlass, 183 | scales = list(x = list(relation = "free"), 184 | y = list(relation = "free")), 185 | adjust = 1.25, 186 | pch = "|", 187 | xlab = "Predictor") 188 | @ 189 | \caption{Density plots of each of the predictors in the original Glass data set. The points along the $x$--axis show the values of the individual samples. } 190 | \label{F:Glass_dens} 191 | \end{center} 192 | \end{figure} 193 | 194 | A scatterplot matrix can also be helpful to visualize a data set of this size (i.e. 9 predictor variables). The \pkg{lattice} function \knfun{splom} was used to create the scatterplot matrix in Figure \ref{F:Glass_splom}. This visualization highlights several other important characteristics of this data: 195 | 196 | \begin{enumerate} 197 | \item The measurements of some glass types, specifically \texttt{Fe}, \texttt{Ba}, \texttt{K} and \texttt{Mg}, are zero. This creates a ``mixture distribution'' of points; one distribution consists of glass types containing the element in question whereas the other does not. This finding implies that the samples in these distributions may not behave in the same manner. 198 | \item Most predictors are uncorrelated with the exception of pairs \texttt{Ca}/\texttt{RI} and \texttt{Ca}/\texttt{Na}. 199 | \item Many of the pair--wise combinations have very non--standard distributions (i.e. heavy tails or mixtures of distributions). 200 | \item It is difficult to tell if the extreme point in the \texttt{K} data is an outlier or just a artifact of a skewed distribution that has not been sampled enough. In either case, this should be accounted for through the modeling, preferably by using models that are resistant to outliers. 201 | \end{enumerate} 202 | 203 | 204 | \setkeys{Gin}{width=.85\textwidth} 205 | \begin{figure} 206 | \begin{center} 207 | <>= 208 | splom(~Glass[, -10], pch = 16, col = rgb(.2, .2, .2, .4), cex = .7) 209 | @ 210 | \caption{A scatterplot matrix of the predictors in the original Glass data set.} 211 | \label{F:Glass_splom} 212 | \end{center} 213 | \end{figure} 214 | 215 | Would transformations help these data? Based on our findings above, we need to investigate transformations of individual predictors that will resolve skewness and/or outliers (e.g. the spatial sign transformation). 216 | 217 | For skewness, first note that several predictors have values of zero. This excludes transformations such as the log transformation or the Box--Cox family of transformations. When we are faced with predictors containing zero values, the Yeo--Johnson family of transformations can be helpful\footnote{We were not aware of this set of transformation at the time when the text was written. } \cite{yeo2000new}. This family of transformations is very similar to the Box--Cox transformation, but can handle zero or negative predictor values. The transformation can be estimated using \pkg{caret}'s \knfun{preProcess} function: 218 | 219 | <>= 220 | yjTrans <- preProcess(Glass[, -10], method = "YeoJohnson") 221 | yjData <- predict(yjTrans, newdata= Glass[, -10]) 222 | melted <- melt(yjData) 223 | @ 224 | 225 | The resulting density plots are shown in Figure \ref{F:Glass_dens_trans}. The only substantive change relative to the original distributions is that a second mode was induced for predictors \texttt{Ba} and \texttt{Fe}. Given these results, this transformation did not seem to improve the data (in terms of skewness). 226 | 227 | \setkeys{Gin}{width=.8\textwidth} 228 | \begin{figure} 229 | \begin{center} 230 | <>= 231 | bookTheme() 232 | densityplot(~value|variable, 233 | data = melted, 234 | scales = list(x = list(relation = "free"), 235 | y = list(relation = "free")), 236 | adjust = 1.25, 237 | pch = "|", 238 | xlab = "Predictor") 239 | @ 240 | \caption{Density plots of the Glass predictors after a Yeo--Johnson transformation. } 241 | \label{F:Glass_dens_trans} 242 | \end{center} 243 | \end{figure} 244 | 245 | Next, we will apply the spatial sign transformation to attempt to mitigate outliers. For this data, we first center and scale the data, then apply the transformation: 246 | 247 | <>= 248 | centerScale <- preProcess(Glass[, -10], method = c("center", "scale")) 249 | csData <- predict(centerScale, newdata = Glass[, -10]) 250 | ssData <- spatialSign(csData) 251 | splom(~ssData, pch = 16, col = rgb(.2, .2, .2, .4), cex = .7) 252 | @ 253 | 254 | Figure \ref{F:Glass_splom_ss} shows the results. Many of the possible outliers have been contracted into the mainstream of the data. This transformation did result in at least one new pattern: the samples with zero values for both \texttt{Fe} and \texttt{B} are now projected onto a straight line in these two dimensions. 255 | 256 | \begin{figure} 257 | \begin{center} 258 | <>= 259 | centerScale <- preProcess(Glass[, -10], method = c("center", "scale")) 260 | csData <- predict(centerScale, newdata = Glass[, -10]) 261 | ssData <- spatialSign(csData) 262 | splom(~ssData, pch = 16, col = rgb(.2, .2, .2, .4), cex = .7) 263 | @ 264 | \caption{A scatterplot matrix of the Glass data after the spatial sign transformation.} 265 | \label{F:Glass_splom_ss} 266 | \end{center} 267 | \end{figure} 268 | 269 | While we were unable to resolve skewness in this data via transformations, we were able to minimize the number of unusually extreme observations. Note that attempts to pre--process data to resolve predictor distribution problems are not always successful. Our best efforts in pre--processing may not yield highly desirable transformed values. Under these kinds of circumstances, we will need to use models that are not unduly affected by skewed distributions (e.g. tree--based methods). 270 | 271 | 272 | \clearpage 273 | 274 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 275 | 276 | \section*{Exercise 2} 277 | \label{P:PreProcessSoybeans} 278 | The Soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in \Sexpr{nrow(Soybean)} soybeans. The \Sexpr{ncol(Soybean)-1} predictors are mostly categorical and include information on the environmental conditions (e.g. temperature, precipitation) and plant conditions (e.g. left spots, mold growth). The outcome labels consist of \Sexpr{length(levels(Soybean$Class))} distinct classes. 279 | 280 | The data can be loaded via: 281 | 282 | <>= 283 | library(mlbench) 284 | data(Soybean) 285 | ## See ?Soybean for details 286 | @ 287 | 288 | \begin{itemize} 289 | \item[] (a) Investigate the frequency distributions for the categorical predictors. Are the distributions likely to cause issues for models. 290 | \item[] (b) Roughly \Sexpr{round(mean(!complete.cases(Soybean))*100)}$\%$ of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes? 291 | \item[](c) Develop a strategy for dealing with the missing data, either by eliminating predictors or imputation. 292 | \end{itemize} 293 | 294 | \subsection*{Solutions} 295 | The contents of the Soybean data frame are: 296 | 297 | <>= 298 | str(Soybean) 299 | @ 300 | 301 | When we look closely at this output, we see that the factor levels of some predictors are not informative. For example, the \texttt{temp} column contains integer values. These values correspond the relative temperature: below average, average and above average. For our understanding of the data, it would be very helpful to change these integer values to their actual values. This change can be done using the \knfun{recode} function in the \pkg{car} package. We can also make missing values an independent category so that we see them in tables: 302 | 303 | <>= 304 | Soybean2 <- Soybean 305 | table(Soybean2$temp, useNA = "always") 306 | library(car) 307 | Soybean2$temp <- recode(Soybean2$temp, 308 | "0 = 'low'; 1 = 'norm'; 2 = 'high'; NA = 'missing'", 309 | levels = c("low", "norm", "high", "missing")) 310 | table(Soybean2$temp) 311 | @ 312 | 313 | For this part of the solution to this problem, we will look at the relationship between the months, temperature and precipitation. To explore these relationships, we need to recode months and precipitation: 314 | 315 | <>= 316 | table(Soybean2$date, useNA = "always") 317 | Soybean2$date <- recode(Soybean2$date, 318 | "0 ='apr';1='may';2='june';3='july';4='aug';5='sept';6='oct';NA = 'missing'", 319 | levels = c("apr", "may", "june", "july", "aug", "sept", "oct", "missing")) 320 | table(Soybean2$date) 321 | 322 | table(Soybean2$precip, useNA = "always") 323 | Soybean2$precip <- recode(Soybean2$precip, 324 | "0 = 'low'; 1 = 'norm'; 2 = 'high'; NA = 'missing'", 325 | levels = c("low", "norm", "high", "missing")) 326 | table(Soybean2$precip) 327 | @ 328 | 329 | To start, let's look at the date predictor. Are the months represented equally? From the table above, we can see that June through September have the most data and that there is a single missing value. For precipitation (ironically) most of the data are above average. In addition, the temperature and precipitation columns have missing value rates of about 5$\%$. 330 | 331 | Like the previous problems, we should examine the pair-wise or joint distributions of these predictors. Joint distributions of factor predictors are often displayed in a contingency table. There are also several ways that these distributions can be displayed in a graph. The \knfun{mosaic} function in the \pkg{vcd} package \cite{Hornik:2006tv} and the \knarg{barchart} function in the \pkg{lattice} package are two options. What does the joint distribution of temperature and month look like? First, we will use a mosaic plot: 332 | 333 | <>= 334 | library(vcd) 335 | ## mosaic() can table a table or a formula: 336 | mosaic(~date + temp, data = Soybean2) 337 | @ 338 | 339 | Alternatively, a bar chart can also be used: 340 | 341 | <>= 342 | barchart(table(Soybean2$date, Soybean2$temp), 343 | auto.key = list(columns = 4, title = "temperature")) 344 | @ 345 | 346 | The results are shown in Figure \ref{F:Soy_biv}. Note that in the bar chart, the bars are not cumulative (i.e. missing values are not the most frequent). Here we see which months are the most frequent. Additionally, we see that average temperatures are the most frequent category within each month, although high temperatures are also very likely in September. Missing values are most likely in July. One useful option to \knfun{barchart} is \knarg{stack} to create stacked bars. 347 | 348 | \setkeys{Gin}{width=1\textwidth} 349 | \begin{figure} 350 | \begin{center} 351 | <>= 352 | library(vcd) 353 | ## mosaic() can accept a table or a formula: 354 | mosaic(~date + temp, data = Soybean2) 355 | @ 356 | 357 | \vspace{.01in} 358 | 359 | <>= 360 | barchart(table(Soybean2$date, Soybean2$temp), 361 | auto.key = list(columns = 4, title = "temperature")) 362 | @ 363 | \caption{Mosaic and bar charts of the joint frequency distribution for month and temperature. } 364 | \label{F:Soy_biv} 365 | \end{center} 366 | \end{figure} 367 | 368 | To investigate higher--order relationships, predictors can be added to the table or formula to create more complex visualizations (e.g. panels in the \pkg{lattice} plots, etc). 369 | 370 | What does the distribution look like per response class for the missing data? If we look at the frequency of {\em any} missing predictor value per class, the results show that some classes are more problematic than others: 371 | 372 | <>= 373 | table(Soybean$Class, complete.cases(Soybean)) 374 | hasMissing <- unlist(lapply(Soybean, function(x) any(is.na(x)))) 375 | hasMissing <- names(hasMissing)[hasMissing] 376 | head(hasMissing) 377 | @ 378 | 379 | There are several classes where all of the samples have at least one missing predictor value. Are these concentrated in a single predictor that we could remove? We can get the percentage of missing values for each predictor by class using the following syntax: 380 | 381 | <>= 382 | byPredByClass <- apply(Soybean[, hasMissing], 2, 383 | function(x, y) { 384 | tab <- table(is.na(x), y) 385 | tab[2,]/apply(tab, 2, sum) 386 | }, 387 | y = Soybean$Class) 388 | 389 | ## The columns are predictors and the rows are classes. Let's eliminate 390 | ## any rows and columns with no missing values 391 | 392 | byPredByClass <- byPredByClass[apply(byPredByClass, 1, sum) > 0,] 393 | byPredByClass <- byPredByClass[, apply(byPredByClass, 2, sum) > 0] 394 | 395 | ## now print: 396 | t(byPredByClass) 397 | @ 398 | 399 | From this output, we see that there are many predictors completely missing for the \texttt{2-4-d-injury}, \texttt{cyst-nematode} and \texttt{herbicide-injury} classes. The \texttt{phytophthora-rot} class has a high rate of missing data across many predictors and the \texttt{diaporthe-pod-$\&$-stem-blight} has a more moderate pattern of missing data. 400 | 401 | One approach to handling missing data is to use an imputation technique. However, it is unlikely that imputation will help since almost 100$\%$ of the predictor values will need to be imputed in a few cases. We could encode the missing as another level or eliminate the classes associated with the high rate of missing values from the data altogether. 402 | 403 | 404 | How would the frequencies of the predictor values affect the modeling process? If we are using a model that is sensitive to sparsity then the low rate of some of the factor levels might be an issue. We can convert the factors to a set of dummy variables and see how good or bad the sparsity is. 405 | 406 | <>= 407 | ## Some of the factors are ordinal. First convert them to unordered factors so 408 | ## that we get a set of binary indicators. 409 | 410 | orderedVars <- unlist(lapply(Soybean, is.ordered)) 411 | orderedVars <- names(orderedVars)[orderedVars] 412 | 413 | ## Let's bypass the problem of missing data by removing the offending classes 414 | 415 | completeClasses <- as.character(unique(Soybean$Class[complete.cases(Soybean)])) 416 | Soybean3 <- subset(Soybean, Class %in% completeClasses) 417 | for(i in orderedVars) Soybean3[, i] <- factor(as.character(Soybean3[, i])) 418 | 419 | ## Use dummyVars to generate the binary predictors... 420 | dummyInfo <- dummyVars(Class ~ ., data = Soybean3) 421 | dummies <- predict(dummyInfo, Soybean3) 422 | 423 | ## ... then nearZeroVar to figure out which should be removed. 424 | predDistInfo <- nearZeroVar(dummies, saveMetrics = TRUE) 425 | head(predDistInfo) 426 | ## The number and percentage of predictors to remove: 427 | sum(predDistInfo$nzv) 428 | mean(predDistInfo$nzv) 429 | @ 430 | 431 | So if we wanted to remove sparse and unbalanced predictors, \Sexpr{round(mean(predDistInfo$nzv)*100,1)}$\%$ of the dummy variables would be eliminated. One way around this is to use models that are not sensitive to this characteristic, such as tree-- or rule--based models, or na\"{\i}ve Bayes. 432 | 433 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 434 | \clearpage 435 | 436 | <>= 437 | data(BloodBrain) 438 | @ 439 | 440 | \section*{Exercise 3} 441 | \label{P:PreProcessQSARProb} 442 | Chapter 5 introduces Quantitative Structure--Activity Relationship (QSAR) modeling where the characteristics of a chemical compound are used to predict other chemical properties. The \pkg{caret} package contains such a data set from \cite{Mente:2005vb}. Here, where the ability of a chemical to permeate the blood--brain barrier was experimentally determined for \Sexpr{length(logBBB)} compounds. \Sexpr{ncol(bbbDescr)} predictors were measured for each compound. 443 | \begin{itemize} 444 | \item[](a) Start \pkg{R} and use these commands to load the data: 445 | 446 | <>= 447 | library(caret) 448 | data(BloodBrain) 449 | # use ?BloodBrain to see more details 450 | @ 451 | 452 | \item[] The numeric outcome is contained in the vector \texttt{logBBB} while the predictors are in the data frame \texttt{bbbDescr}. 453 | \item[](b) Do any of the individual predictors have degenerate distributions? 454 | \item[](c) Generally speaking, are there strong relationships between the predictor data? If so, how could correlations in the predictor set be reduced? Does this have a dramatic effect on the number of predictors available for modeling? 455 | \end{itemize} 456 | 457 | 458 | \subsection*{Solutions} 459 | 460 | For these data, the first assessment looks for sparse and unbalanced predictors. The \pkg{caret} \knfun{nearZeroVar} function is used again but this time with the \knarg{saveMetrics} options that retains information about each predictor: 461 | 462 | <>= 463 | ncol(bbbDescr) 464 | predictorInfo <- nearZeroVar(bbbDescr, saveMetrics = TRUE) 465 | head(predictorInfo) 466 | ## Are there any near-zero variance predictors? 467 | rownames(predictorInfo)[predictorInfo$nzv] 468 | ## Examples: 469 | table(bbbDescr$a_acid) 470 | table(bbbDescr$alert) 471 | ## Let's get rid of these: 472 | filter1 <- bbbDescr[, !predictorInfo$nzv] 473 | ncol(filter1) 474 | @ 475 | 476 | As mentioned in the text, there are some models that are resistant to near--zero variance predictors and, for these models, we would most likely leave them in. 477 | 478 | What about the distributions of the remaining predictors? Although, it is time consuming to look at individual density plots of \Sexpr{ncol(filter1)} predictors, we do recommend it (or at least looking at a sample of predictors). For example, the top panel of Figure \ref{F:bbb_dens_trans} shows a random sample of eight predictors: 479 | 480 | <>= 481 | set.seed(532) 482 | sampled1 <- filter1[, sample(1:ncol(filter1), 8)] 483 | names(sampled1) 484 | @ 485 | 486 | A few of these predictors exhibit skewness and one (\texttt{frac.cation7.}) shows two distinct modes. Based on the rug plot of points in the panel for \texttt{o$\_$sp2}, these data are also likely to be bimodal. 487 | 488 | To numerically assess skewness, the function from the \pkg{e1071} package is used again: 489 | 490 | <>= 491 | library(e1071) 492 | skew <- apply(filter1, 2, skewness) 493 | summary(skew) 494 | @ 495 | 496 | 497 | There are a number of predictors that are left-- or right--skewed. We can again apply the Yeo--Johnson transformation to the data (some of the predictors are negative): 498 | 499 | <>= 500 | yjBBB <- preProcess(filter1, method = "YeoJohnson") 501 | transformed <- predict(yjBBB, newdata = filter1) 502 | sampled2 <- transformed[, names(sampled1)] 503 | @ 504 | 505 | \setkeys{Gin}{width=.8\textwidth} 506 | \begin{figure} 507 | \begin{center} 508 | <>= 509 | bookTheme() 510 | set.seed(532) 511 | sampled1 <- filter1[, sample(1:ncol(filter1), 8)] 512 | densityplot(~value|variable, 513 | data = melt(sampled1), 514 | scales = list(x = list(relation = "free"), 515 | y = list(relation = "free")), 516 | adjust = 1.25, 517 | pch = "|", 518 | main = "Original", 519 | xlab = "Predictor") 520 | 521 | @ 522 | 523 | \vspace{.1in} 524 | 525 | <>= 526 | bookTheme() 527 | densityplot(~value|variable, 528 | data = melt(sampled2), 529 | scales = list(x = list(relation = "free"), 530 | y = list(relation = "free")), 531 | adjust = 1.25, 532 | pch = "|", 533 | main = "Transformed", 534 | xlab = "Predictor") 535 | 536 | @ 537 | \caption{Density plots of the blood--brain barrier predictors before and after a Yeo--Johnson transformation. } 538 | \label{F:bbb_dens_trans} 539 | \end{center} 540 | \end{figure} 541 | 542 | The results for the sampled predictors are shown in the bottom panel of Figure \ref{F:bbb_dens_trans}. Although the distributions for \texttt{fpsa3} and \texttt{wpsa2} are more symmetric, the other predictors have either additional modes or more pronounced modes. One option would be to manually assess which predictors would benefit from this type of transformation. 543 | 544 | Is there severe correlation between the predictors? Based on previous experience with these types of data, there are likely to be many relationships between predictors. For example, when we examine the predictor names we find that \Sexpr{sum(grepl("(psa)|(fsa)|(nsa)", names(filter1)))} are some type of surface area predictor. These are most likely correlated to some extent. Also, surface area is usually related to the size (or weight) of a molecule, so additional correlations may exist. 545 | 546 | The correlation matrix of the predictors can be computed and examined. However, we know that many predictors are skewed in these data. Since the correlation is a function of squared values of the predictors, the samples in the tails of the predictor distributions may have a significant effect on the correlation structure. For this reason, we will look at the correlation structure three ways: the untransformed data, the data after the Yeo--Johnson transformation, and the data after a spatial sign transformation. 547 | 548 | <>= 549 | rawCorr <- cor(filter1) 550 | transCorr <- cor(transformed) 551 | 552 | ssData <- spatialSign(scale(filter1)) 553 | ssCorr <- cor(ssData) 554 | @ 555 | 556 | <>= 557 | library(corrplot) 558 | ## plot the matrix with no labels or grid 559 | corrplot(rawCorr, order = "hclust", addgrid.col = NA, tl.pos = "n") 560 | corrplot(transCorr, order = "hclust", addgrid.col = NA, tl.pos = "n") 561 | ssData <- spatialSign(scale(filter1)) 562 | ssCorr <- cor(ssData) 563 | corrplot(ssCorr, order = "hclust", addgrid.col = NA, tl.pos = "n") 564 | @ 565 | 566 | \begin{figure} 567 | \begin{center} 568 | <>= 569 | corrplot(rawCorr, order = "hclust", addgrid.col = rgb(.2, .2, .2, 0), tl.pos = "n") 570 | @ 571 | \vspace{.1in} 572 | 573 | <>= 574 | corrplot(transCorr, order = "hclust", addgrid.col = rgb(.2, .2, .2, 0), tl.pos = "n") 575 | @ 576 | \vspace{.1in} 577 | 578 | <>= 579 | corrplot(ssCorr, order = "hclust", addgrid.col = rgb(.2, .2, .2, 0), tl.pos = "n") 580 | @ 581 | \caption{Correlation matrices for the raw data (top), transformed via the Yeo--Johnson transformation (middle) and the spatial sign transformation (bottom). } 582 | \label{F:bbb_corr_mat} 583 | \end{center} 584 | \end{figure} 585 | 586 | 587 | The results are in Figure \ref{F:bbb_corr_mat}. This visualization indicates that correlations lessen with increasing levels of transformations: 588 | 589 | <>= 590 | corrInfo <- function(x) summary(x[upper.tri(x)]) 591 | corrInfo(rawCorr) 592 | corrInfo(transCorr) 593 | corrInfo(ssCorr) 594 | @ 595 | 596 | Rather than transform the data to resolve between--predictor correlations, it may be a better idea to remove predictors. The \pkg{caret} function \knfun{findCorrelation} was described in the text. The user is required to state what level of pair--wise correlations that they are willing to accept. The code below shows (for these data) the trade--off between the correlation threshold, the number of retained predictors, and the average absolute correlation in the data. Figure \ref{F:threshPlot} shows the results. 597 | 598 | 599 | <>= 600 | thresholds <- seq(.25, .95, by = 0.05) 601 | size <- meanCorr <- rep(NA, length(thresholds)) 602 | removals <- vector(mode = "list", length = length(thresholds)) 603 | 604 | for(i in seq_along(thresholds)){ 605 | removals[[i]] <- findCorrelation(rawCorr, thresholds[i]) 606 | subMat <- rawCorr[-removals[[i]], -removals[[i]]] 607 | size[i] <- ncol(rawCorr) -length(removals[[i]]) 608 | meanCorr[i] <- mean(abs(subMat[upper.tri(subMat)])) 609 | } 610 | 611 | corrData <- data.frame(value = c(size, meanCorr), 612 | threshold = c(thresholds, thresholds), 613 | what = rep(c("Predictors", 614 | "Average Absolute Correlation"), 615 | each = length(thresholds))) 616 | @ 617 | 618 | \setkeys{Gin}{width=.8\textwidth} 619 | \begin{figure} 620 | \begin{center} 621 | <>= 622 | bookTheme() 623 | xyplot(value~ threshold|what, data = corrData, 624 | scales = list(y = list(relation = "free")), 625 | type = c("p", "g", "smooth"), 626 | degree = 2, 627 | ylab = "") 628 | @ 629 | \caption{The average absolute correlation and the subset size for each value of the correlation filter threshold. } 630 | \label{F:threshPlot} 631 | \end{center} 632 | \end{figure} 633 | 634 | We can also try the \pkg{subselect} package \cite{Cerdeira2014} to remove predictors. This package uses a different criterion to evaluate the quality of a subset and has less greedy methods to search the predictor space. First, we have to remove all linear dependencies from the data. That includes perfect pair--wise correlations as well as relationships between three or more predictors. The \knfun{trim.matrix} function does that: 635 | 636 | <>= 637 | library(subselect) 638 | ncol(rawCorr) 639 | trimmed <- trim.matrix(rawCorr, tolval=1000*.Machine$double.eps)$trimmedmat 640 | ncol(trimmed) 641 | @ 642 | 643 | We can use simulated annealing and genetic algorithms to search for quality subsets. These techniques allow for lower and upper limits for the number of predictors. However, the functions get dramatically slower as the range increases. Here, we will look at one solution found by \knfun{findCorrelation} and, will subsequently use \pkg{subselect} to search within that subset size: 644 | 645 | <>= 646 | set.seed(702) 647 | sa <- anneal(trimmed, kmin = 18, kmax = 18, niter = 1000) 648 | saMat <- rawCorr[sa$bestsets[1,], sa$bestsets[1,]] 649 | 650 | set.seed(702) 651 | ga <- genetic(trimmed, kmin = 18, kmax = 18, nger = 1000) 652 | gaMat <- rawCorr[ga$bestsets[1,], ga$bestsets[1,]] 653 | 654 | fcMat <- rawCorr[-removals[size == 18][[1]], 655 | -removals[size == 18][[1]]] 656 | 657 | corrInfo(fcMat) 658 | corrInfo(saMat) 659 | corrInfo(gaMat) 660 | @ 661 | 662 | The main difference between these results is that the greedy approach of \knfun{findCorrelation} is much more conservative than the techniques found in the \pkg{subselect} package. 663 | 664 | \section*{Session Info} 665 | 666 | <>= 667 | toLatex(sessionInfo()) 668 | @ 669 | 670 | 671 | \bibliographystyle{ECA_jasa} 672 | \bibliography{Ch_03_Ex_Sol} 673 | 674 | 675 | \end{document} 676 | 677 | 678 | 679 | 680 | -------------------------------------------------------------------------------- /Ch_03.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/topepo/APM_Exercises/73d35f3e62522d9e9a61552159bbeda9d6f02d57/Ch_03.pdf -------------------------------------------------------------------------------- /Ch_03_Ex_Sol.bib: -------------------------------------------------------------------------------- 1 | @article{Mente:2005vb, 2 | author = {Mente, S and Lombardo, F}, 3 | title = {{A recursive-partitioning model for blood--brain barrier permeation}}, 4 | journal = {Journal of Computer-Aided Molecular Design}, 5 | year = {2005}, 6 | volume = {19}, 7 | number = {7}, 8 | pages = {465--481} 9 | } 10 | 11 | @Manual{Cerdeira2014, 12 | title= {\pkg{subselect}: Selecting variable subsets}, 13 | author = {J Cerdeira and P Silva And J Cadima And M Minhoto}, 14 | year= {2014}, 15 | url= {http://CRAN.R-project.org/package=subselect} 16 | } 17 | 18 | @article{yeo2000new, 19 | title={A new family of power transformations to improve normality or symmetry}, 20 | author={Yeo, I-K and Johnson, R}, 21 | journal={Biometrika}, 22 | volume={87}, 23 | number={4}, 24 | pages={954--959}, 25 | year={2000} 26 | } 27 | 28 | @Article{Hornik:2006tv, 29 | title = {The strucplot framework: Visualizing multi-way contingency 30 | tables with \pkg{vcd}}, 31 | author = {D Meyer and A Zeileis and K Hornik}, 32 | journal = {Journal of Statistical Software}, 33 | year = {2006}, 34 | volume = {17}, 35 | number = {3}, 36 | pages = {1--48}, 37 | url = {http://www.jstatsoft.org/v17/i03/}, 38 | } 39 | -------------------------------------------------------------------------------- /Ch_04.Rnw: -------------------------------------------------------------------------------- 1 | 2 | \documentclass[12pt]{article} 3 | 4 | \usepackage{amsmath} 5 | \usepackage{graphicx} 6 | \usepackage{color} 7 | \usepackage{xspace} 8 | \usepackage{fancyvrb} 9 | \usepackage{booktabs} 10 | \usepackage[ 11 | colorlinks=true, 12 | linkcolor=blue, 13 | citecolor=blue, 14 | urlcolor=blue] 15 | {hyperref} 16 | 17 | \usepackage[default]{jasa_harvard} 18 | %\usepackage{JASA_manu} 19 | 20 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 21 | 22 | \setlength{\oddsidemargin}{-.25 truein} 23 | \setlength{\evensidemargin}{0truein} 24 | \setlength{\topmargin}{-0.2truein} 25 | \setlength{\textwidth}{7 truein} 26 | \setlength{\textheight}{8.5 truein} 27 | \setlength{\parindent}{0truein} 28 | \setlength{\parskip}{0.07truein} 29 | 30 | \definecolor{darkred}{rgb}{0.6,0.0,0} 31 | \definecolor{darkblue}{rgb}{.165, 0, .659} 32 | \definecolor{grey}{rgb}{0.85,0.85,0.85} 33 | \definecolor{darkorange}{rgb}{1,0.54,0} 34 | 35 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 36 | 37 | 38 | \newcommand{\bld}[1]{\mbox{\boldmath $#1$}} 39 | \newcommand{\shell}[1]{\mbox{$#1$}} 40 | \renewcommand{\vec}[1]{\mbox{\bf {#1}}} 41 | 42 | \newcommand{\ReallySmallSpacing}{\renewcommand{\baselinestretch}{.6}\Large\normalsize} 43 | \newcommand{\SmallSpacing}{\renewcommand{\baselinestretch}{1.1}\Large\normalsize} 44 | 45 | \newcommand{\halfs}{\frac{1}{2}} 46 | 47 | \DefineVerbatimEnvironment{Sinput}{Verbatim}{fontshape=sl,formatcom=\color{darkblue}} 48 | \fvset{fontsize=\footnotesize} 49 | 50 | \newcommand{\website}[1]{{\textsf{#1}}} 51 | \newcommand{\code}[1]{\mbox{\footnotesize\color{darkblue}\texttt{#1}}} 52 | \newcommand{\pkg}[1]{{\fontseries{b}\selectfont #1}} 53 | \renewcommand{\pkg}[1]{{\textsf{#1}}} 54 | \newcommand{\todo}[1]{TODO: {\bf \textcolor{darkred}{#1}}} 55 | \newcommand{\Dag}{$^\dagger$} 56 | \newcommand{\Ast}{$^\ast$} 57 | 58 | 59 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 60 | 61 | \begin{document} 62 | <>= 63 | opts_chunk$set(tidy=FALSE,message=FALSE,size='footnotesize', 64 | background = 'white',comment=NA, digits = 3, 65 | prompt = TRUE) 66 | knit_theme$set("bclear") 67 | @ 68 | 69 | \title{ Exercises for \\ {\it Applied Predictive Modeling} \\ Chapter 4 --- Over--Fitting and Model Tuning} 70 | \author{Max Kuhn, Kjell Johnson} 71 | \date{Version 1\\ \today} 72 | 73 | <>= 74 | library(caret) 75 | library(AppliedPredictiveModeling) 76 | library(ggplot2) 77 | library(reshape2) 78 | library(Hmisc) 79 | 80 | options(width = 105) 81 | textList <- function (x, period = FALSE, last = " and ") 82 | { 83 | if (!is.character(x)) 84 | x <- as.character(x) 85 | numElements <- length(x) 86 | out <- if (length(x) > 0) { 87 | switch(min(numElements, 3), x, paste(x, collapse = last), 88 | { 89 | x <- paste(x, c(rep(",", numElements - 2), last, 90 | ""), sep = "") 91 | paste(x, collapse = " ") 92 | }) 93 | } 94 | else "" 95 | if (period) 96 | out <- paste(out, ".", sep = "") 97 | out 98 | } 99 | 100 | hook_inline = knit_hooks$get('inline') 101 | knit_hooks$set(inline = function(x) { 102 | if (is.character(x)) highr::hi_latex(x) else hook_inline(x) 103 | }) 104 | 105 | options(width = 80) 106 | @ 107 | 108 | \newcommand{\apmfun}[1]{{\tt \small \hlkwd{#1}}} 109 | \newcommand{\apmarg}[1]{{\tt \small \hlkwc{#1}}} 110 | \newcommand{\apmstr}[1]{{\tt \small \hlstr{#1}}} 111 | \newcommand{\apmnum}[1]{{\tt \small \hlnum{#1}}} 112 | \newcommand{\apmstd}[1]{{\tt \small \hlstd{#1}}} 113 | \newcommand{\apmred}[1]{\textcolor[rgb]{0.8,0.0,0}{#1}}% 114 | 115 | \maketitle 116 | 117 | \thispagestyle{empty} 118 | 119 | The solutions in this file uses several \pkg{R} packages not used in the text. To install all of the packages needed for this document, use: 120 | 121 | <>= 122 | install.packages(c("AppliedPredictiveModeling", "car", "caret", "corrplot", "ggplot2", 123 | "e1071", "Hmisc", "mlbench", "reshape2", "subselect", "vcd")) 124 | @ 125 | 126 | \section*{Exercise 1} 127 | Consider the Music Genre data set described in Sect. 1.4. The objective for these data is to use the predictors to classify music samples into the appropriate music genre. 128 | \begin{itemize} 129 | \item[] (a) What data splitting method(s) would you use for these data? Explain. 130 | \item[] (b) Using tools described in this chapter, provide code for implementing your approach(es). 131 | \end{itemize} 132 | 133 | \subsection*{Solutions} 134 | 135 | 136 | \begin{figure}[h] 137 | \begin{center} 138 | <>= 139 | transparentTheme(pchSize = 1, trans = .7) 140 | music = read.csv("genresTrain.csv") 141 | counts = table(music$GENRE) 142 | 143 | barchart(Freq ~ Var1, 144 | data = as.data.frame(counts), 145 | ylim = c(0, max(counts)*1.1), 146 | ylab = "Frequency", 147 | xlab = "Genre") 148 | @ 149 | \caption[Music Genre Distribution]{The frequency distribution of genres in the music data.} 150 | \label{F:MusicGenreDistribution} 151 | \end{center} 152 | \end{figure} 153 | 154 | The frequency distribution of music genres is presented in Figure \ref{F:MusicGenreDistribution}. When determining the data splitting method, we should focus on two primary characteristics: 155 | \begin{itemize} 156 | \item the number of samples relative to the number of predictors in the data, and 157 | \item the distribution of samples across classes. 158 | \end{itemize} 159 | 160 | For these data, the number of samples is \Sexpr{nrow(music)} and the number of predictors is \Sexpr{ncol(music)-1}. Because the number of samples is more than \Sexpr{floor(nrow(music)/ncol(music))} fold greater than the number of predictors, we likely can split the data into a training set and test set. This split would enable us to independently verify model performance results without impacting the estimation of the optimal tuning parameter(s) selection. However, prior to making this choice, we must examine the class distribution of the response. 161 | 162 | Figure \ref{F:MusicGenreDistribution} clearly displays the imbalance across the classes of music genre with Metal having the lowest percentage of samples ($7\%$) and Classical having the highest percentage of samples ($28\%$). While there is a distinct imbalance, the number of samples in the overall data set is large enough such that resampling or cross-validation techniques have a good chance at randomly selecting samples across all classes in similar proportion to the entire data set. 163 | 164 | When selecting a resampling technique on a large data set, one must consider the computational requirements of each technique and model. $k$-fold cross-validation, with small $k$ (5-10) will be less computationally taxing in this scenario than repeated training/test splits or bootstrapping. However, repeated training/test splits or bootstrapping will likely provide more accurate estimates of tuning parameters and model performance. 165 | 166 | The \apmfun{createDataPartition} function in the \pkg{caret} package can be used to partition the data into $k$-folds such that each fold approximately preserves the class distribution from the entire data set. The following code could be used to create 10 folds: 167 | 168 | <>= 169 | set.seed(31) 170 | tenFoldCV <- createDataPartition(trainClasses, k = 10, returnTrain = TRUE) 171 | @ 172 | 173 | \clearpage 174 | 175 | \section*{Exercise 2} 176 | 177 | Consider the permeability data set described in Section Sect. 1.4. The objective for this data is to use the predictors to model compounds' permeability. 178 | \begin{itemize} 179 | \item[] (a) What data splitting method(s) would you use for this data? Explain. 180 | \item[] (b) Using tools described in this chapter, provide code for implementing your approach(es). 181 | \end{itemize} 182 | 183 | \begin{figure}[h] 184 | \begin{center} 185 | <>= 186 | library(AppliedPredictiveModeling) 187 | data(permeability) 188 | 189 | transparentTheme(pchSize = 1, trans = .7) 190 | histogram(~permeability, xlab = "Permeability") 191 | @ 192 | 193 | \caption[Permeability Distribution]{Distribution of permeability values.} 194 | \label{F:PermeabilityDistribution} 195 | \end{center} 196 | \end{figure} 197 | 198 | \subsection*{Solutions} 199 | 200 | The frequency distribution of music genres is presented in Figure \ref{F:PermeabilityDistribution}. In this case, the number of samples is \Sexpr{nrow(fingerprints)} and the number of predictors is \Sexpr{ncol(fingerprints)}. Because the number of samples is small and is much smaller than the number of predictors, splitting the data into a separate training and testing set may hinder our ability to find signal among the predictors and the response. For the permeability data, we should focus on using resampling performance measures to select optimal tuning parameters and predictive performance. 201 | 202 | Another characteristic we should consider when determining the resampling approach is the distribution of the response. Clearly, the distribution of permeability values is skewed, with the majority of samples having low values. Because the overall number of samples is small, randomly partitioning the samples may create sets that are not representative of the overall data set. That is, we are likely to have many partitions of the data that contain relatively few of the larger permeability samples. To create more representative, but still random, selections we should use stratification. 203 | 204 | The \apmfun{createDataPartition} function in the \pkg{caret} package performs stratified sampling based on the quantiles of the data. Repeated cross-validation would be an appropriate resampling technique for this data. The following code could be used to create 25 iterations of 10-fold cross-validation: 205 | 206 | <>= 207 | set.seed(72) 208 | repeatedCV <- createMultiFolds(permeability, k = 10, times = 25) 209 | @ 210 | 211 | \code{repeatedCV} is a list that contains vectors of integers for each fold. These integers are the row numbers of the samples that should be used to model the data within the fold. 212 | 213 | Figure \ref{F:ResampledPermeabilityDistribution} illustrates the distribution of response for 6 of the randomly selected partitions of the data. Notice that the distributions are representative of the overall distribution of the data in Figure \ref{F:PermeabilityDistribution}. 214 | 215 | \begin{figure}[t] 216 | \begin{center} 217 | <>= 218 | set.seed(72) 219 | repeatedCV <- createMultiFolds(permeability, k = 10, times = 25) 220 | dataByFold <- lapply(repeatedCV[1:6], 221 | function(ind, dat) dat[ind,], 222 | dat = permeability) 223 | library(reshape2) 224 | dataByFold <- melt(dataByFold) 225 | 226 | transparentTheme(pchSize = 1, trans = .7) 227 | histogram(~value|L1, data = dataByFold, layout=c(3,2), xlab = "Permeability") 228 | @ 229 | \caption[Resampled Permeability Distribution]{Distribution of permeability values within six of the folds.} 230 | \label{F:ResampledPermeabilityDistribution} 231 | \end{center} 232 | \end{figure} 233 | 234 | \clearpage 235 | 236 | \section*{Exercise 3} 237 | 238 | Partial least squares (Sect. 6.3) was used to model the yield of a chemical manufacturing process (Sect. 1.4). The data can be found in the \pkg{AppliedPredictiveModeling} package and can be loaded using: 239 | 240 | <>= 241 | library(AppliedPredictiveModeling) 242 | data(ChemicalManufacturingProcess) 243 | @ 244 | 245 | The objective of this analysis is to find the number of PLS components that yields the optimal $R^2$ value (Sect. 5.1). PLS models with 1 through 10 components were each evaluated using 5 repeats of 10-fold cross--validation and the results are presented in the following table. 246 | 247 | <>= 248 | set.seed(19711230) 249 | plsProfileChemMod <- train(Yield ~ ., 250 | data = ChemicalManufacturingProcess, 251 | method = "pls", 252 | preProc = c("center", "scale"), 253 | tuneLength = 10, 254 | trControl = trainControl(method = "repeatedcv", repeats = 5)) 255 | @ 256 | 257 | \begin{itemize} 258 | \item[] (a) Using the ``one--standard error'' method, what number of PLS components provides the most parsimonious model? 259 | \item[] (b) Compute the tolerance values for this example. If a 10$\%$ loss in $R^2$ is acceptable, then what is the optimal number of PLS components? 260 | \item[] (c) Several other models (discussed in Part II) with varying degrees of complexity were trained and tuned and the results are presented in Figure \ref{F:OverfittingChemicalManPlot}. If the goal is to select the model that optimizes $R^2$, then which model(s) would you choose, and why? 261 | \item[] (d) Prediction time, as well as model complexity (Sect. 4.8) are other factors to consider when selecting the optimal model(s). Given each model's prediction time, model complexity, and $R^2$ estimates, which model(s) would you choose, and why? 262 | \end{itemize} 263 | 264 | 265 | \subsection*{Solutions} 266 | 267 | The model was fit using: 268 | 269 | <>= 270 | set.seed(19711230) 271 | plsProfileChemMod <- train(Yield ~ ., 272 | data = ChemicalManufacturingProcess, 273 | method = "pls", 274 | preProc = c("center", "scale"), 275 | tuneLength = 10, 276 | trControl = trainControl(method = "repeatedcv", repeats = 5)) 277 | @ 278 | 279 | <>= 280 | pls_table <- plsProfileChemMod$results[, c("ncomp", "Rsquared", "RsquaredSD")] 281 | pls_table$RsquaredSEM <- pls_table$RsquaredSD/sqrt(length(plsProfileChemMod$control$index)) 282 | pls_table$RsquaredSD <- NULL 283 | 284 | latex(pls_table, 285 | file = "", 286 | booktabs = TRUE, 287 | title = "", 288 | cgroup = c("", "Resampled $R^2$"), 289 | colheads = c("Components", "Mean", "Std. Error"), 290 | n.cgroup = c(1, 2), 291 | digits = 3, 292 | caption = "Number of components in the PLS model and their associated resampled $R^2$ values. ", 293 | label = "T:OverfittingChemManTable", 294 | rowname = NULL) 295 | @ 296 | 297 | We can get the resampling summaries in the \apmstd{results} component of the object and find the number of resamples using the \apmstd{index} argument of the \apmstd{control} component. 298 | 299 | <>= 300 | R2values <- plsProfileChemMod$results[, c("ncomp", "Rsquared", "RsquaredSD")] 301 | R2values$RsquaredSEM <- R2values$RsquaredSD/sqrt(length(plsProfileChemMod$control$index)) 302 | @ 303 | 304 | 305 | Table \ref{T:OverfittingChemManTable} show the results but let's plot them too. The easiest way to do this is using the package \href{http://docs.ggplot2.org/}{\pkg{ggplot2}}. First, we can make a plot of the $R^2$ values, showing the resampled estimate minus one standard error. 306 | <>= 307 | library(ggplot2) 308 | oneSE <- ggplot(R2values, 309 | ## Create an aesthetic mapping that plots the 310 | ## number of PLS components versus the R^2 311 | ## values and their one SE lower bound 312 | aes(ncomp, Rsquared, 313 | ymin = Rsquared - RsquaredSEM, 314 | ## don't add anything here to get 315 | ## a one-sided interval plot 316 | ymax=Rsquared)) 317 | ## geom_linerange shoes the interval and geom_pointrange 318 | ## plots the resampled estimates. 319 | oneSE + geom_linerange() + geom_pointrange() + theme_bw() 320 | @ 321 | <>= 322 | bestR2 <- subset(R2values, ncomp == which.max(R2values$Rsquared)) 323 | bestR2$lb <- bestR2$Rsquared - bestR2$RsquaredSEM 324 | candR2 <- subset(R2values, Rsquared >= bestR2$lb & ncomp < bestR2$ncomp) 325 | @ 326 | Figure \ref{F:ChemR2oneSE} shows the results. The best setting uses \Sexpr{R2values$ncomp[which.max(R2values$Rsquared)]} PLS components with a lower bound of \Sexpr{round(bestR2$lb, 2)}. There is \Sexpr{nrow(candR2)} parameter setting whose resampled $R^2$ estimate is greater than or equal to this bound (and are simpler): a model using \Sexpr{candR2$ncomp[nrow(candR2)]} PLS components. 327 | 328 | 329 | \begin{figure} 330 | \begin{center} 331 | <>= 332 | oneSE <- ggplot(R2values, 333 | ## Create an aesthetic mapping that plots the 334 | ## number of PLS components versus the R^2 335 | ## values and their one SE lower bound 336 | aes(ncomp, Rsquared, 337 | ymin = Rsquared - RsquaredSEM, 338 | ## don't add anything here to get 339 | ## a one-sided interval plot 340 | ymax=Rsquared)) 341 | ## geom_linerange shoes the interval and geom_pointrange 342 | ## plots the resampled estimates. 343 | oneSE + geom_linerange() + geom_pointrange() + xlab("#Components") + theme_bw() 344 | @ 345 | \caption{Resampled $R^2$ values with a lower bound of one standard error for the chemical manufacturing data. } 346 | \label{F:ChemR2oneSE} 347 | \end{center} 348 | \end{figure} 349 | 350 | 351 | The following syntax can be used to get the tolerance values: 352 | <<>= 353 | bestR2 <- subset(R2values, ncomp == which.max(R2values$Rsquared)) 354 | R2values$tolerance <- (R2values$Rsquared - bestR2$Rsquared)/bestR2$Rsquared * 100 355 | @ 356 | 357 | Let's stick with \pkg{ggplot2} graphics and plot these in Figure \ref{F:ChemR2tol} using {\tt \Sexpr{'qplot(ncomp, tolerance, data = R2values)'}}. The lowest setting that does not exceed a 10$\%$ tolerance is a 2 component model. 358 | 359 | \begin{figure} 360 | \begin{center} 361 | <>= 362 | qplot(ncomp, tolerance, data = R2values) + xlab("#Components") + theme_bw() 363 | @ 364 | \caption{$R^2$ tolerance values for the chemical manufacturing data. } 365 | \label{F:ChemR2tol} 366 | \end{center} 367 | \end{figure} 368 | 369 | 370 | \setkeys{Gin}{width=.65\textwidth} 371 | \begin{figure} 372 | \begin{center} 373 | \includegraphics{figure/ChemicalManPlot} 374 | \caption[Contrasting Performance with Time]{A plot of the estimated model performance against the time to predict 500,000 new samples using the chemical manufacturing data.} 375 | \label{F:OverfittingChemicalManPlot} 376 | \end{center} 377 | \end{figure} 378 | 379 | Looking at Figure \ref{F:OverfittingChemicalManPlot}, the model with the best $R^2$ value is random forest. However, the support vector machine has nearly equivalent results and the confidence intervals for the $R^2$ values have some overlap. The next best model is boosted linear regression, although this model is probably significantly worse that the support vector machine. Based on $R^2$ alone, the random forest or SVM models would be best. However, when execution time is factored in, the SVM model clearly wins since it is far faster. This is, of course, subjective since it is highly dependent on the implementation. If the prediction function needed to be recoded for use, neither of these models would be preferred. In that case, the regression tree or PLS model would be better choices, albeit with a substantial drop in $R^2$. 380 | 381 | \clearpage 382 | 383 | 384 | 385 | \section*{Exercise 4} 386 | 387 | <>= 388 | library(caret) 389 | data(oil) 390 | @ 391 | 392 | \citeasnoun{BrodnjakVonina:2005p4943} develop a methodology for food laboratories to determine the type of oil from a sample. In their procedure, they used a gas chromatograph (an instrument that separate chemicals in a sample) to measure \Sexpr{ncol(fattyAcids)} different fatty acids in an oil. These measurements would then be used to predict the type of oil in a food samples. To create their model, they used \Sexpr{length(oilType)} samples\footnote{The authors state that there are 95 samples of known oils. However, we count 96 in their Table 1 (pgs. 33-35 of the article)} of \Sexpr{length(table(oilType))} types of oils. 393 | 394 | These data can be found in the \pkg{caret} package using {\tt \Sexpr{'data(oil)'}}. The oil types are contained in a factor variable called \apmstd{oilType}. The types are: pumpkin (coded as \apmstd{A}), sunflower (\apmstd{B}), peanut (\apmstd{C}), olive (\apmstd{D}), soybean (\apmstd{E}), rapeseed (\apmstd{F}) and corn (\apmstd{G}). In \pkg{R}: 395 | 396 | <>= 397 | library(caret) 398 | data(oil) 399 | str(oilType) 400 | table(oilType) 401 | @ 402 | 403 | \begin{itemize} 404 | \item[] (a) Use the \apmfun{sample} function in base \pkg{R} to create a completely random sample of 60 oils. How closely do the frequencies of the random sample match the original samples? Repeat this procedure several times of understand the variation in the sampling process. 405 | \item[] (b) Use the \pkg{caret} package function \apmfun{createDataPartition} to create a stratified random sample. How does this compare to the completely random samples. 406 | \item[](c) With such a small samples size, what are the options for determining performance of the model? Should a test set be used? 407 | \item[] (d) One method for understanding the uncertainty of a test set is to use a confidence interval. To obtain a confidence interval for the overall accuracy, the based \pkg{R} function \apmfun{binom.test} can be used. It requires the user to input the number of samples and the number correctly classified to calculate the interval. For example, suppose a test set sample of 20 oil samples was set aside and 76 were used for model training. For this test set size and a model that is about 80$\%$ accurate (16 out of 20 correct), the confidence interval would be computed using 408 | 409 | <>= 410 | binom.test(16, 20) 411 | @ 412 | 413 | \item[] In this case, the width of the 95$\%$ confidence interval is \Sexpr{round(diff(binom.test(16, 20)$conf.int)*100, 1)}$\%$. Try different samples sizes and accuracy rates to understand the trade--off between the uncertainty in the results, the model performance and the test set size. 414 | \end{itemize} 415 | 416 | \subsection*{Solutions} 417 | 418 | We can create 20 splits using the \apmfun{sample} function: 419 | 420 | <>= 421 | sampNum <- floor(length(oilType)*.6) + 1 422 | set.seed(629) 423 | oilSplits <- vector(mode = "list", length = 20) 424 | for(i in seq(along = oilSplits)) oilSplits[[i]] <- table(sample(oilType, size = sampNum)) 425 | head(oilSplits, 3) 426 | ## Combine the list of tables into a matrix 427 | oilSplits <- do.call("rbind", oilSplits) 428 | head(oilSplits, 3) 429 | ## What does the distirbution of class percentages look like? 430 | summary(oilSplits/sampNum) 431 | @ 432 | 433 | Using a stratified random sample using \apmfun{createDataPartition}: 434 | 435 | <>= 436 | set.seed(629) 437 | oilSplits2 <- createDataPartition(oilType, p = .60, times = 20) 438 | oilSplits2 <- lapply(oilSplits2, function(x, y) table(y[x]), y = oilType) 439 | head(oilSplits2, 3) 440 | oilSplits2 <- do.call("rbind", oilSplits2) 441 | summary(oilSplits2/sampNum) 442 | @ 443 | 444 | The sampling done using \apmfun{createDataPartition} has much less variability that using the \apmfun{sample} function, and each partition has at least one sample in each class. 445 | 446 | Choosing a data splitting strategy is difficult. One possibility would be leave--one--out cross--validation only because, with the exception of class \apmstd{G}, each class will be represented in each resample. It should be noted that some classification models require at least one sample from each class, so resampling these data may place a restriction one which models can be used. As for a test set, it may be reasonable to rely on leave--one--out cross--validation as the only method for assessing performance. Alternatively, a test set could be used if it only consisted of the classes with the most samples (e.g. \apmstd{A}, \apmstd{B} and maybe \apmstd{E} and \apmstd{F}) although this would only protect against gross overfitting. 447 | 448 | Looking at the confidence intervals for overall accuracy, let's examine samples sizes between 10 and 30 and accuracy rates of 70$\%$ to 95$\%$: 449 | 450 | <>= 451 | getWidth <- function(values) { 452 | binom.test(x = floor(values["size"]*values["accuracy"])+1, 453 | n = values["size"])$conf.int 454 | } 455 | 456 | ciInfo <- expand.grid(size = 10:30, accuracy = seq(.7, .95, by = 0.01)) 457 | ciWidths <- t(apply(ciInfo, 1, getWidth)) 458 | head(ciWidths) 459 | ciInfo$length <- ciWidths[,2] - ciWidths[,1] 460 | # levelplot(length ~ size * accuracy, data = ciInfo) 461 | @ 462 | 463 | 464 | \begin{figure} 465 | \begin{center} 466 | <>= 467 | bookTheme() 468 | levelplot(length ~ size * accuracy, data = ciInfo) 469 | @ 470 | \caption{The width of a binomial confidence interval for overall accuracy for different sample sizes and accuracy rates.} 471 | \label{F:width_plot} 472 | \end{center} 473 | \end{figure} 474 | 475 | 476 | \section*{Session Info} 477 | 478 | <>= 479 | toLatex(sessionInfo()) 480 | @ 481 | 482 | 483 | 484 | 485 | \bibliographystyle{ECA_jasa} 486 | \bibliography{Ch_04_Ex_Sol} 487 | 488 | 489 | \end{document} 490 | 491 | 492 | 493 | 494 | -------------------------------------------------------------------------------- /Ch_04.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/topepo/APM_Exercises/73d35f3e62522d9e9a61552159bbeda9d6f02d57/Ch_04.pdf -------------------------------------------------------------------------------- /Ch_04_Ex_Sol.bib: -------------------------------------------------------------------------------- 1 | @article{BrodnjakVonina:2005p4943, 2 | author = {D Brodnjak--Vonina and Z Kodba and M Novi}, 3 | journal = {Chemometrics and Intelligent Laboratory Systems}, 4 | title = {Multivariate data analysis in classification of vegetable oils characterized by the content of fatty acids}, 5 | number = {1}, 6 | pages = {31--43}, 7 | volume = {75}, 8 | year = {2005} 9 | } 10 | -------------------------------------------------------------------------------- /Ch_06.Rnw: -------------------------------------------------------------------------------- 1 | 2 | \documentclass[12pt]{article} 3 | 4 | \usepackage{amsmath} 5 | \usepackage{graphicx} 6 | \usepackage{color} 7 | \usepackage{xspace} 8 | \usepackage{fancyvrb} 9 | \usepackage[ 10 | colorlinks=true, 11 | linkcolor=blue, 12 | citecolor=blue, 13 | urlcolor=blue] 14 | {hyperref} 15 | 16 | \usepackage[default]{jasa_harvard} 17 | %\usepackage{JASA_manu} 18 | 19 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 20 | 21 | \setlength{\oddsidemargin}{-.25 truein} 22 | \setlength{\evensidemargin}{0truein} 23 | \setlength{\topmargin}{-0.2truein} 24 | \setlength{\textwidth}{7 truein} 25 | \setlength{\textheight}{8.5 truein} 26 | \setlength{\parindent}{0truein} 27 | \setlength{\parskip}{0.07truein} 28 | 29 | \definecolor{darkred}{rgb}{0.6,0.0,0} 30 | \definecolor{darkblue}{rgb}{.165, 0, .659} 31 | \definecolor{grey}{rgb}{0.85,0.85,0.85} 32 | \definecolor{darkorange}{rgb}{1,0.54,0} 33 | 34 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 35 | 36 | 37 | \newcommand{\bld}[1]{\mbox{\boldmath $#1$}} 38 | \newcommand{\shell}[1]{\mbox{$#1$}} 39 | \renewcommand{\vec}[1]{\mbox{\bf {#1}}} 40 | 41 | \newcommand{\ReallySmallSpacing}{\renewcommand{\baselinestretch}{.6}\Large\normalsize} 42 | \newcommand{\SmallSpacing}{\renewcommand{\baselinestretch}{1.1}\Large\normalsize} 43 | 44 | \newcommand{\halfs}{\frac{1}{2}} 45 | 46 | \DefineVerbatimEnvironment{Sinput}{Verbatim}{fontshape=sl,formatcom=\color{darkblue}} 47 | \fvset{fontsize=\footnotesize} 48 | 49 | \newcommand{\website}[1]{{\textsf{#1}}} 50 | \newcommand{\code}[1]{\mbox{\footnotesize\color{darkblue}\texttt{#1}}} 51 | \newcommand{\pkg}[1]{{\fontseries{b}\selectfont #1}} 52 | \renewcommand{\pkg}[1]{{\textsf{#1}}} 53 | \newcommand{\todo}[1]{TODO: {\bf \textcolor{darkred}{#1}}} 54 | \newcommand{\Dag}{$^\dagger$} 55 | \newcommand{\Ast}{$^\ast$} 56 | 57 | 58 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 59 | 60 | \begin{document} 61 | 62 | <>= 63 | opts_chunk$set(tidy=FALSE,message=FALSE,size='footnotesize', 64 | background = 'white',comment=NA, digits = 3, 65 | prompt = TRUE) 66 | knit_theme$set("bclear") 67 | @ 68 | 69 | \title{ Exercises for \\ {\it Applied Predictive Modeling} \\ Chapter 6 --- Linear Regression and Its Cousins} 70 | \author{Max Kuhn, Kjell Johnson} 71 | \date{Version 1\\ \today} 72 | 73 | \maketitle 74 | 75 | <>= 76 | library(caret) 77 | library(pls) 78 | library(AppliedPredictiveModeling) 79 | library(doMC) 80 | library(xtable) 81 | library(parallel) 82 | registerDoMC(detectCores(logical = FALSE) - 1) 83 | 84 | options(width = 105) 85 | textList <- function (x, period = FALSE, last = " and ") 86 | { 87 | if (!is.character(x)) 88 | x <- as.character(x) 89 | numElements <- length(x) 90 | out <- if (length(x) > 0) { 91 | switch(min(numElements, 3), x, paste(x, collapse = last), 92 | { 93 | x <- paste(x, c(rep(",", numElements - 2), last, 94 | ""), sep = "") 95 | paste(x, collapse = " ") 96 | }) 97 | } 98 | else "" 99 | if (period) 100 | out <- paste(out, ".", sep = "") 101 | out 102 | } 103 | 104 | hook_inline = knit_hooks$get('inline') 105 | knit_hooks$set(inline = function(x) { 106 | if (is.character(x)) highr::hi_latex(x) else hook_inline(x) 107 | }) 108 | 109 | options(width = 80) 110 | @ 111 | 112 | \newcommand{\apmfun}[1]{{\tt \small \hlkwd{#1}}} 113 | \newcommand{\apmarg}[1]{{\tt \small \hlkwc{#1}}} 114 | \newcommand{\apmstr}[1]{{\tt \small \hlstr{#1}}} 115 | \newcommand{\apmnum}[1]{{\tt \small \hlnum{#1}}} 116 | \newcommand{\apmstd}[1]{{\tt \small \hlstd{#1}}} 117 | \newcommand{\apmred}[1]{\textcolor[rgb]{0.8,0.0,0}{#1}}% 118 | 119 | \maketitle 120 | 121 | \thispagestyle{empty} 122 | 123 | The solutions in this file uses several \pkg{R} packages not used in the text. To install all of the packages needed for this document, use: 124 | 125 | <>= 126 | install.packages(c("AppliedPredictiveModeling", "caret", "elasticnet", "pls", 127 | "RColorBrewer", "reshape2")) 128 | @ 129 | 130 | 131 | \section*{Exercise 1} 132 | 133 | Infrared (IR) spectroscopy technology can be used to determine the chemical makeup of a substance. The theory of IR spectroscopy holds that unique molecular structures absorb IR frequencies differently. In practice a spectrometer fires a series of IR frequencies into a sample material, and the device measures the absorbance of the sample at each individual frequency. This series of measurements creates a spectrum profile which can then be used to determine the chemical make--up of the sample material. 134 | 135 | A Tecator Infratec Food and Feed Analyzer instrument was used to analyze 215 samples of meat across 100 frequencies. A sample of these frequency profiles are displayed in Figure \ref{F:RegressionTecatorSpectrum}. In addition an IR profile, analytical chemistry was used to determine the percent content of water, fat, and protein for each sample. If we can establish a predictive relationship between IR spectrum and fat content, then food scientists could predict a sample's fat content instead of using analytical chemistry to determine the content. This would provide costs savings, since analytical chemistry is a more expensive, time consuming process. 136 | 137 | \begin{itemize} 138 | \item[] (a) Start \pkg{R} and use these commands to load the data: 139 | 140 | <>= 141 | library(caret) 142 | data(tecator) 143 | # use ?tecator to see more details 144 | @ 145 | 146 | \item[] The matrix \apmstd{absorp} contains the 100 absorbance values for the 215 samples, while matrix \apmstd{endpoints} contains the percent of moisture, fat, and protein in columns 1--3, respectively. 147 | \item[] (b) In this example the predictors are the measurements at the individual frequencies. Because the frequencies lie in a systematic order (850 to 1050 nm), the predictors have a high degree of correlation. Hence, the data lies in a smaller dimension than the total number of predictors (215). Use PCA to determine the effective dimension of this data. What is the effective dimension? 148 | \item[] (c) Split the data into a training and a test set, pre--process the data, and build each variety of model described in this chapter. For those models with tuning parameters, what are the optimal values of the tuning parameter(s)? 149 | \item[] (d) Which model has the best predictive ability? Is any model significantly better or worse than the others? 150 | \item[] (e) Explain which model you would use for predicting the fat content of a sample. 151 | \end{itemize} 152 | 153 | \begin{figure}[t] 154 | \begin{center} 155 | <>= 156 | library(reshape2) 157 | library(RColorBrewer) 158 | data(tecator) 159 | 160 | ## Select a random sample for plots 161 | set.seed(1) 162 | inSubset <- sample(1:dim(endpoints)[1], 10) 163 | 164 | absorpSubset <- absorp[inSubset,] 165 | ## Save the protein percentage outcome 166 | endpointSubset <- endpoints[inSubset, 3] 167 | newOrder <- order(-absorpSubset[,1]) 168 | 169 | ## Order the data from least amount of protein to the most 170 | absorpSubset <- absorpSubset[newOrder,] 171 | endpointSubset <- endpointSubset[newOrder] 172 | 173 | spectData <- as.data.frame(t(absorpSubset)) 174 | spectData$x <- 1:nrow(spectData) 175 | spectData2 <- melt(spectData, id.vars = c("x")) 176 | 177 | cols <- brewer.pal(9,"YlOrRd")[-(1:2)] 178 | cols <- colorRampPalette(cols)(10) 179 | spectTheme <- caretTheme() 180 | spectTheme$superpose.line$col <- cols 181 | spectTheme$superpose.line$lwd <- rep(2, 10) 182 | spectTheme$superpose.line$lty <- rep(1, 10) 183 | trellis.par.set(spectTheme) 184 | xyplot(value ~ x, 185 | data = spectData2, 186 | groups = variable, 187 | type = c("l", "g"), 188 | panel = function(...) { 189 | panel.xyplot(...) 190 | panel.text(rep(103.5, nrow(absorpSubset)), 191 | absorpSubset[,ncol(absorpSubset)], 192 | paste(endpointSubset), 193 | cex = .7) 194 | }, 195 | ylab = "Absorption", 196 | xlab = "") 197 | @ 198 | \caption[Tecator spectrum plots]{A sample of 10 spectrum of the Tecator data. The colors of the curves reflect the absorption values, where yellow indicates low absorption and red is indicative of high absorption.} 199 | \label{F:RegressionTecatorSpectrum} 200 | \end{center} 201 | \end{figure} 202 | 203 | \subsection*{Solutions} 204 | 205 | The full set of principal components can be obtained using the \apmfun{prcomp} function: 206 | 207 | <>= 208 | pcaObj <- prcomp(absorp, center = TRUE, scale = TRUE) 209 | @ 210 | 211 | To get the percent of variance associated with each component, we can get the standard deviation object: 212 | 213 | <>= 214 | pctVar <- pcaObj$sdev^2/sum(pcaObj$sdev^2)*100 215 | head(pctVar) 216 | @ 217 | 218 | This indicates that the first components accounts for almost all of the information in the data. Based on this analysis, the true dimensionality is much lower than the number of predictors. However, this is based on a {\em linear} combination of the data; other nonlinear summarizations of the predictors may also be useful. 219 | 220 | There are three endpoints and we will model the third, which is the percentage of protein. Given the sample size, 25$\%$ of the data will be held back for training and five repeats of 10--fold cross--validation will be used to tune the models: 221 | 222 | <>= 223 | set.seed(1029) 224 | inMeatTraining <- createDataPartition(endpoints[, 3], p = 3/4, list= FALSE) 225 | 226 | absorpTrain <- absorp[ inMeatTraining,] 227 | absorpTest <- absorp[-inMeatTraining,] 228 | proteinTrain <- endpoints[ inMeatTraining, 3] 229 | proteinTest <- endpoints[-inMeatTraining,3] 230 | 231 | ctrl <- trainControl(method = "repeatedcv", repeats = 5) 232 | @ 233 | 234 | To start, a simple linear model was used for these data: 235 | 236 | <>= 237 | set.seed(529) 238 | meatLm <- train(x = absorpTrain, y = proteinTrain, method = "lm", trControl = ctrl) 239 | meatLm 240 | @ 241 | 242 | The RMSE is slightly more than \Sexpr{I(round(getTrainPerf(meatLm)[1, "TrainRMSE"], 1))}$\%$. However, it is a fair to assume that the amount of multicolinearity in the predictors is probably degrading performance. For this reason, PCR and PLS models are also trained: 243 | 244 | <>= 245 | set.seed(529) 246 | meatPCR <- train(x = absorpTrain, y = proteinTrain, 247 | method = "pcr", 248 | trControl = ctrl, tuneLength = 25) 249 | @ 250 | <>= 251 | set.seed(529) 252 | meatPLS <- train(x = absorpTrain, y = proteinTrain, 253 | method = "pls", 254 | trControl = ctrl, 255 | preProcess = c("center", "scale"), 256 | tuneLength = 25) 257 | ## For Figure 258 | comps <- rbind(meatPLS$results, meatPCR$results) 259 | comps$Model <- rep(c("PLS", "PCR"), each = 25) 260 | @ 261 | 262 | The results are shown in Figure \ref{F:meat_pls}. Both models achieve comparable error rates but the PLS model requires fewer (\Sexpr{meatPLS$bestTune$ncomp}) components. The RMSE for the PLS model was estimated to be \Sexpr{round(getTrainPerf(meatPLS)[1, "TrainRMSE"], 2)}$\%$. 263 | 264 | \begin{figure}[t] 265 | \begin{center} 266 | <>= 267 | bookTheme() 268 | xyplot(RMSE ~ ncomp, data = comps, 269 | groups = Model, type = c("g", "o"), 270 | auto.key = list(columns = 2), 271 | xlab = "#Components") 272 | @ 273 | \caption{The resampling performance profile for the PCR and PLS models built using the Tecator protein data.} 274 | \label{F:meat_pls} 275 | \end{center} 276 | \end{figure} 277 | 278 | An elastic net model was also tuned over five values of the $L_2$ regularization parameter and the $L_1$ regularization parameter: 279 | 280 | <>= 281 | fractionTune = c(seq(.005, .03,.005),seq(.04,.1,0.02),seq(0.2,1,0.2)) 282 | lambdaTune = c(0, .001, .01, .1, 1) 283 | 284 | set.seed(529) 285 | meatENet <- train(x = absorpTrain, y = proteinTrain, 286 | method = "enet", 287 | trControl = ctrl, 288 | preProcess = c("center", "scale"), 289 | tuneGrid = expand.grid(lambda = lambdaTune, 290 | fraction = fractionTune)) 291 | @ 292 | 293 | The final model used $\lambda = \Sexpr{meatENet$bestTune$lambda}$ and a fraction of \Sexpr{meatENet$bestTune$fraction}. This parameter combination corresponds to a Lasso model. Figure \ref{F:meat_enet} shows a heat map of the resampled RMSE values across the two tuning parameters. The RMSE associated with the optimal parameters was \Sexpr{round(getTrainPerf(meatENet)[1, "TrainRMSE"], 2)}$\%$. 294 | 295 | \begin{figure}[t] 296 | 297 | \begin{center} 298 | <>= 299 | bookTheme() 300 | plot(meatENet, plotType = "level") 301 | @ 302 | \caption{The resampling performance profile for the elastic net model built using the Tecator protein data.} 303 | \label{F:meat_enet} 304 | \end{center} 305 | \end{figure} 306 | 307 | Based on the resampling statistics, the PLS model and Lasso performed best for this dataset. The ordinary linear model, as expected, had the worse performance overall. 308 | 309 | 310 | <>= 311 | save(meatPLS, file = "meatPLS.RData") 312 | @ 313 | 314 | \clearpage 315 | 316 | \section*{Exercise 2} 317 | 318 | Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug. 319 | \begin{itemize} 320 | \item[] (a) Start \pkg{R} and use these commands to load the data: 321 | 322 | <>= 323 | library(AppliedPredictiveModeling) 324 | data(permeability) 325 | @ 326 | 327 | \item[] The matrix \apmstd{fingerprints} contains the 1107 binary molecular predictors for the 165 compounds, while \apmstd{permeability} contains permeability response. 328 | \item[] (b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the \apmfun{nearZeroVar} function from the \pkg{caret} package. How many predictors are left for modeling? 329 | \item[] (c) Split the data into a training and a test set, pre--process the data, and tune a partial least squares model. How many latent variables are optimal and what is the corresponding resampled estimate of $R^2$? 330 | \item[] (d) Predict the response for the test set. What is the test set estimate of $R^2$? 331 | \item[] (e) Try building other models discussed in this chapter. Do any have better predictive performance? 332 | \item[] (f) Would you recommend any of your models to replace the permeability laboratory experiment? 333 | \end{itemize} 334 | 335 | \subsection*{Solutions} 336 | 337 | Before pre-processing and modeling the data, let's examine the distribution of the permeability response (left-hand plot in Figure \ref{F:permeabilityDistribution}). Clearly the distribution is not symmetric, but is skewed-to-the-right. When the response has this kind of distribution, an appropriate next step is to log-transform the data. We then should verify if the transformation induced a symmetric distribution (right-hand plot in Figure \ref{F:permeabilityDistribution}). For these data, the log-transformation did induce an approximately symmetric distribution and we will model the log-transformed response. 338 | 339 | \begin{figure}[t] 340 | \begin{center} 341 | <>= 342 | plotTheme <- bookTheme() 343 | trellis.par.set(plotTheme) 344 | h1 <- histogram(permeability, xlab="Permeability") 345 | h2 <- histogram(log10(permeability), xlab="log10(Permeability)") 346 | print(h1, split=c(1,1,2,1), more=TRUE) 347 | print(h2, split=c(2,1,2,1)) 348 | @ 349 | \caption{The response distribution for the permeability data.} 350 | \label{F:permeabilityDistribution} 351 | \end{center} 352 | \end{figure} 353 | 354 | Next, let's pre-process the fingerprint predictors to determine how many predictors will be removed and how many will remain for modeling. The following syntax can be used to identify and remove near-zero variance fingerprint predictors. 355 | 356 | <>= 357 | #Identify and remove NZV predictors 358 | nzvFingerprints <- nearZeroVar(fingerprints) 359 | noNzvFingerprints <- fingerprints[,-nzvFingerprints] 360 | @ 361 | 362 | In total there are \Sexpr{length(nzvFingerprints)} near-zero variance fingerprints, leaving \Sexpr{ncol(noNzvFingerprints)} left for modeling. This is a significant reduction from the original matrix, and indicates that many of the fingerprints are describing unique features of very small subsets of molecules. Because this data set has a small number of samples relative to predictors ($n$=\Sexpr{nrow(noNzvFingerprints)} vs $p$=\Sexpr{ncol(noNzvFingerprints)}), we will hold out 25\% in the test set for evaluating model performance. 363 | 364 | <>= 365 | #Split data into training and test sets 366 | set.seed(614) 367 | trainingRows <- createDataPartition(permeability, 368 | p = 0.75, 369 | list = FALSE) 370 | 371 | trainFingerprints <- noNzvFingerprints[trainingRows,] 372 | trainPermeability <- permeability[trainingRows,] 373 | 374 | testFingerprints <- noNzvFingerprints[-trainingRows,] 375 | testPermeability <- permeability[-trainingRows,] 376 | @ 377 | 378 | 379 | 380 | <>= 381 | set.seed(614) 382 | ctrl <- trainControl(method = "LGOCV") 383 | 384 | plsTune <- train(x = trainFingerprints, y = log10(trainPermeability), 385 | method = "pls", 386 | tuneGrid = expand.grid(ncomp = 1:15), 387 | trControl = ctrl) 388 | @ 389 | 390 | Figure \ref{F:permeabilityPLSTunePlot} indicates that the optimal number of latent variables that maximizes $R^2$ is \Sexpr{best(plsTune$results, "Rsquared", maximize = TRUE)}. 391 | 392 | \begin{figure}[t] 393 | \begin{center} 394 | <>= 395 | plotTheme <- bookTheme() 396 | trellis.par.set(plotTheme) 397 | plot(plsTune,metric="Rsquared") 398 | @ 399 | \caption{PLS tuning parameter profile for the permeability data} 400 | \label{F:permeabilityPLSTunePlot} 401 | \end{center} 402 | \end{figure} 403 | 404 | 405 | 406 | The relationship between the PLS components and the response can be seen in Figure \ref{F:permeabilityPLSComponentVsYield}, where the straight line is the linear regression fit, and the curved line is the loess fit. For the training data, the first three components are linearly related to log10(permeability). Each component also contains a noticeable amount of noise relative to predicting the response. 407 | 408 | \begin{figure}[ht] 409 | \begin{center} 410 | <>= 411 | library(reshape2) 412 | plsDat <- data.frame(unclass(scores(plsTune$finalModel))) 413 | names(plsDat) <- paste("PLS", names(plsDat), sep = ".") 414 | plsDat$y <- trainPermeability 415 | 416 | scoreData <- melt(plsDat[, c("PLS.Comp.1", "PLS.Comp.2", "PLS.Comp.3", "y")], id.vars = "y") 417 | scoreData$Label <- "" 418 | scoreData$Label[scoreData$variable == "PLS.Comp.1"] <- "PLS Component 1" 419 | scoreData$Label[scoreData$variable == "PLS.Comp.2"] <- "PLS Component 2" 420 | scoreData$Label[scoreData$variable == "PLS.Comp.3"] <- "PLS Component 3" 421 | 422 | scoreData$variable <- as.character(scoreData$variable) 423 | scoreData$Comp <- substring(scoreData$variable, nchar(scoreData$variable)) 424 | 425 | scatterTheme <- caretTheme() 426 | 427 | scatterTheme$plot.line$col <- c("blue") 428 | scatterTheme$plot.line$lwd <- 2 429 | 430 | scatterTheme$plot.symbol$col <- rgb(0, 0, 0, .3) 431 | scatterTheme$plot.symbol$cex <- 0.8 432 | scatterTheme$plot.symbol$pch <- 16 433 | 434 | scatterTheme$add.text <- list(cex = 0.6) 435 | 436 | trellis.par.set(scatterTheme) 437 | xyplot(log10(y) ~ value|Label, 438 | scoreData, 439 | panel = function(...) { 440 | theDots <- list(...) 441 | panel.xyplot(..., type = c("p", "g","r","smooth")) 442 | corr <- round(cor(theDots$x, theDots$y), 2) 443 | panel.text(4, 444 | min(theDots$y), 445 | paste("corr:", corr), 446 | cex = 1.5) 447 | }, 448 | layout = c(3,1), 449 | as.table = TRUE, 450 | ylab = "log10(Permeability)", 451 | xlab = "Component Score") 452 | @ 453 | \caption[PLS manufacturing]{PLS scores versus log10(permeability).} 454 | \label{F:permeabilityPLSComponentVsYield} 455 | \end{center} 456 | \end{figure} 457 | 458 | \begin{figure}[ht] 459 | \begin{center} 460 | <>= 461 | plsTest <- data.frame(Observed=log10(testPermeability),Predicted=predict(plsTune,testFingerprints)) 462 | scatterTheme <- caretTheme() 463 | 464 | scatterTheme$plot.line$col <- c("blue") 465 | scatterTheme$plot.line$lwd <- 2 466 | 467 | scatterTheme$plot.symbol$col <- rgb(0, 0, 0, .3) 468 | scatterTheme$plot.symbol$cex <- 0.8 469 | scatterTheme$plot.symbol$pch <- 16 470 | 471 | scatterTheme$add.text <- list(cex = 0.6) 472 | 473 | trellis.par.set(scatterTheme) 474 | xyplot(Predicted ~ Observed, 475 | plsTest, 476 | panel = function(...) { 477 | theDots <- list(...) 478 | panel.xyplot(..., type = c("p", "g","r","smooth")) 479 | corr <- round(cor(theDots$x, theDots$y), 2) 480 | panel.text(44, 481 | min(theDots$y), 482 | paste("corr:", corr)) 483 | }, 484 | ylab = "log10(Predicted)", 485 | xlab = "log10(Observed)") 486 | @ 487 | \caption[PLS permeability test]{PLS predictions for the test set for the permeability data.} 488 | \label{F:permeabilityPLSTestPreds} 489 | \end{center} 490 | \end{figure} 491 | 492 | Next we will predict the response for the test set and compare the $R^2$ value with the one obtained through leave-group-out cross-validation. The relationship is displayed in Figure \ref{F:permeabilityPLSTestPreds}, where the $R^2$ value is \Sexpr{round(cor(plsTest$Observed,plsTest$Predicted)^2, 3)}. Leave-group-out cross-validation estimated that the PLS model would have an $R^2$ value of \Sexpr{round(plsTune$results[best(plsTune$results, "Rsquared", maximize = TRUE), "Rsquared"],3)}. These values are close, indicating that the leave-group-out cross-validation performance is a good indicator of the performance from a true hold-out set. 493 | 494 | <>= 495 | ctrl <- trainControl(method = "LGOCV") 496 | 497 | set.seed(614) 498 | ridgeGrid <- data.frame(lambda = seq(0.02, .35, length = 9)) 499 | ridgeTune <- train(x = trainFingerprints, y = log10(trainPermeability), 500 | method = "ridge", 501 | tuneGrid = ridgeGrid, 502 | trControl = ctrl) 503 | 504 | enetGrid <- expand.grid(lambda = c(0, 0.05, .1), 505 | fraction = seq(.05, 1, length = 25)) 506 | set.seed(614) 507 | enetTune <- train(x = trainFingerprints, y = log10(trainPermeability), 508 | method = "enet", 509 | tuneGrid = enetGrid, 510 | trControl = ctrl) 511 | @ 512 | 513 | In addition to a PLS model, ridge regression and elastic net models were tuned. The tuning parameter profiles can be seen in Figures \ref{F:permeabilityRidgeTunePlot} and \ref{F:permeabilityEnetTunePlot}. For ridge regression, the optimal $R^2$ value is \Sexpr{round(ridgeTune$results[best(ridgeTune$results, "Rsquared", maximize = TRUE), "Rsquared"],3)}, which is obtained at a weight decay value of \Sexpr{ridgeTune$bestTune}. This shrinkage parameter value is relatively high and drives the coefficients more rapidly towards zero. For the elastic net, the optimal $R^2$ value is \Sexpr{round(enetTune$results[best(enetTune$results, "Rsquared", maximize = TRUE), "Rsquared"],3)}, which is obtained with the fraction of the full solution set at \Sexpr{enetTune$bestTune$fraction} (corresponding to \Sexpr{I(length(predictors(enetTune)))} predictors) and a weight decay of \Sexpr{enetTune$bestTune$lambda}. Therefore, the fraction parameter lessens the weight decay, relative to ridge regression. Clearly, the elastic net solution the best penalized model choice for this problem. 514 | 515 | \begin{figure}[ht] 516 | \begin{center} 517 | <>= 518 | plotTheme <- bookTheme() 519 | #plotTheme$fontsize$text <- 10 520 | trellis.par.set(plotTheme) 521 | plot(ridgeTune,metric="Rsquared") 522 | @ 523 | \caption{Ridge regression tuning parameter profile for the permeability data.} 524 | \label{F:permeabilityRidgeTunePlot} 525 | \end{center} 526 | \end{figure} 527 | 528 | \begin{figure}[ht] 529 | \begin{center} 530 | <>= 531 | plotTheme <- bookTheme() 532 | #plotTheme$fontsize$text <- 10 533 | trellis.par.set(plotTheme) 534 | plot(enetTune,metric="Rsquared") 535 | @ 536 | \caption{Elastic net tuning parameter profile for the permeability data.} 537 | \label{F:permeabilityEnetTunePlot} 538 | \end{center} 539 | \end{figure} 540 | 541 | 542 | \clearpage 543 | \section*{Exercise 3} 544 | 545 | A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors) and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by $1\%$ will boost revenue by approximately one hundred thousand dollars per batch. 546 | \begin{itemize} 547 | \item[] (a) Start \pkg{R} and use these commands to load the data: 548 | 549 | <>= 550 | library(AppliedPredictiveModeling) 551 | data(ChemicalManufacturingProcess) 552 | @ 553 | 554 | \item[] The matrix \apmstd{processPredictors} contains the 57 predictors (12 describing the input biological material, and 45 describing the process predictors) for the 176 manufacturing runs. \apmstd{yield} contains the percent yield for each run. 555 | \item[] (b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g. see Sect. 3.8). 556 | \item[] (c) Split the data into a training and a test set, pre--process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric? 557 | \item[] (d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set? 558 | \item[] (e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list? 559 | \item[] (f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process? 560 | \end{itemize} 561 | 562 | \subsection*{Solutions} 563 | 564 | First, let's split the \apmstd{ChemicalManufacturingProcess} data set into the predictors and the response as follows: 565 | 566 | <>= 567 | predictors <- subset(ChemicalManufacturingProcess,select= -Yield) 568 | yield <- subset(ChemicalManufacturingProcess,select="Yield") 569 | @ 570 | 571 | Then let's understand some fundamental characteristics of the response and predictors before modeling. The distribution of the response is presented in Figure \ref{F:yieldDistribution}; it is fairly symmetric and does not need transformation prior to analysis. 572 | 573 | \begin{figure}[t] 574 | \begin{center} 575 | <>= 576 | plotTheme <- bookTheme() 577 | trellis.par.set(plotTheme) 578 | histogram(~Yield, data=ChemicalManufacturingProcess, xlab="Yield") 579 | @ 580 | \caption{The response distribution for the chemical manufacturing data.} 581 | \label{F:yieldDistribution} 582 | \end{center} 583 | \end{figure} 584 | 585 | Are there any univariate relationships with the response? Because we have a small number of predictors, we can explore these relationships visually with a correlation plot as seen in Figure \ref{F:cmScatterplotCorrelation}. The figure shows that many of the biological material predictors are highly positively correlated, and some of the manufacturing process predictors are positively and negatively correlated. Biological materials 2, 3, 6, and 8, and manufacturing processes 9 and 32 are highly positively correlated with Yield, whereas manufacturing processes 13 and 36 are highly negatively correlated with Yield. 586 | 587 | \begin{figure}[ht] 588 | \begin{center} 589 | <>= 590 | library(corrplot) 591 | correlations <- cor(cbind(yield,predictors),use="pairwise.complete.obs") 592 | corrplot::corrplot(correlations, type="lower", tl.cex = 0.5, mar=c(0,0.2,0,0)) 593 | @ 594 | \caption{Scatterplot correlation matrix for the chemical manufacturing data.} 595 | \label{F:cmScatterplotCorrelation} 596 | \end{center} 597 | \end{figure} 598 | 599 | A small percentage of the predictor set cells are missing. Specifically, \Sexpr{sum(is.na(predictors))} cells are missing out of \Sexpr{nrow(predictors)*(ncol(predictors))} or \Sexpr{round(100*sum(is.na(predictors))/(nrow(predictors)*(ncol(predictors))),2)} percent. When we encounter missing cells, we also should investigate if any particular predictors or samples have a higher frequency of missingness. The top 10 missing predictors are: 600 | 601 | <>= 602 | missingCol = sapply(predictors, function(x) sum(is.na(x))) 603 | missingCol = sort(missingCol,decreasing=TRUE) 604 | topMissingCol = missingCol[1:10] 605 | data.frame(Frequency=topMissingCol) 606 | @ 607 | 608 | And the top 10 missing samples are: 609 | 610 | <>= 611 | missingRow = apply(predictors, 1, function(x) sum(is.na(x))) 612 | missingRow = sort(missingRow,decreasing=TRUE) 613 | topMissingRow = missingRow[1:10] 614 | data.frame(Frequency=topMissingRow) 615 | @ 616 | 617 | It may be worthwhile to investigate why rows (samples) 1 and 172-176 have the highest frequency of missingness. Since no rows or columns have too much missing data, we can impute this information without perturbing the true underlying relationship between the predictors and the response. Let's first split the data into training (70\%) and test (30\%) sets. On that data, we will identify the appropriate Box-Cox transformation, centering and scaling parameters, and will impute using the k-nearest neighbor method. We will then apply those transformations and imputation to the test data. 618 | 619 | <>= 620 | #Split data into training and test sets 621 | set.seed(517) 622 | trainingRows <- createDataPartition(yield$Yield, 623 | p = 0.7, 624 | list = FALSE) 625 | 626 | trainPredictors <- predictors[trainingRows,] 627 | trainYield <- yield[trainingRows,] 628 | 629 | testPredictors <- predictors[-trainingRows,] 630 | testYield <- yield[-trainingRows,] 631 | 632 | #Pre-process trainPredictors and apply to trainPredictors and testPredictors 633 | pp <- preProcess(trainPredictors,method=c("BoxCox","center","scale","knnImpute")) 634 | ppTrainPredictors <- predict(pp,trainPredictors) 635 | ppTestPredictors <- predict(pp,testPredictors) 636 | @ 637 | 638 | Next, we should remove near-zero variance and highly correlated predictors. 639 | 640 | <>= 641 | #Identify and remove NZV 642 | nzvpp <- nearZeroVar(ppTrainPredictors) 643 | ppTrainPredictors <- ppTrainPredictors[-nzvpp] 644 | ppTestPredictors <- ppTestPredictors[-nzvpp] 645 | 646 | #Identify and remove highly correlated predictors 647 | predcorr = cor(ppTrainPredictors) 648 | highCorrpp <- findCorrelation(predcorr) 649 | ppTrainPredictors <- ppTrainPredictors[, -highCorrpp] 650 | ppTestPredictors <- ppTestPredictors[, -highCorrpp] 651 | @ 652 | 653 | After pre-processing, \Sexpr{ncol(ppTrainPredictors)} predictors remain for modeling. 654 | 655 | Given the moderate pairwise correlations that were apparent in Figure \ref{F:cmScatterplotCorrelation}, a dimension reduction or shrinkage technique would be an appropriate model for this data. Here we will tune a PLS model on the training data using 25 iterations of bootstrap cross-validation. 656 | 657 | <>= 658 | set.seed(517) 659 | ctrl <- trainControl(method = "boot", number = 25) 660 | 661 | plsTune <- train(x = ppTrainPredictors, y = trainYield, 662 | method = "pls", 663 | tuneLength = 15, 664 | trControl = ctrl) 665 | @ 666 | 667 | Figure \ref{F:cmPLSTunePlot} indicates that the optimal number of latent variables that maximizes $R^2$ is \Sexpr{best(plsTune$results, "Rsquared", maximize = TRUE)}. 668 | 669 | \begin{figure}[h] 670 | \begin{center} 671 | <>= 672 | plotTheme <- bookTheme() 673 | trellis.par.set(plotTheme) 674 | plot(plsTune,metric="Rsquared") 675 | @ 676 | \caption{PLS tuning parameter profile for the manufacturing data} 677 | \label{F:cmPLSTunePlot} 678 | \end{center} 679 | \end{figure} 680 | 681 | 682 | The relationship between the PLS components and the response can be seen in Figure \ref{F:cmPLSComponentVsYield}, where the straight line is the linear regression fit, and the curved line is the loess fit. For the training data, the first two components are linearly related to yield with the first component having the stronger relationship with the response. The third component has a weak relationship with the response and may not be linear as indicated by the loess fit. 683 | 684 | \begin{figure}[h] 685 | \begin{center} 686 | <>= 687 | library(pls) 688 | plsDat <- data.frame(unclass(scores(plsTune$finalModel))) 689 | names(plsDat) <- paste("PLS", names(plsDat), sep = ".") 690 | plsDat$y <- trainYield 691 | 692 | scoreData <- melt(plsDat[, c("PLS.Comp.1", "PLS.Comp.2", "PLS.Comp.3", "y")], id.vars = "y") 693 | scoreData$Label <- "" 694 | scoreData$Label[scoreData$variable == "PLS.Comp.1"] <- "PLS Component 1" 695 | scoreData$Label[scoreData$variable == "PLS.Comp.2"] <- "PLS Component 2" 696 | scoreData$Label[scoreData$variable == "PLS.Comp.3"] <- "PLS Component 3" 697 | 698 | scoreData$variable <- as.character(scoreData$variable) 699 | scoreData$Comp <- substring(scoreData$variable, nchar(scoreData$variable)) 700 | 701 | scatterTheme <- caretTheme() 702 | 703 | scatterTheme$plot.line$col <- c("blue") 704 | scatterTheme$plot.line$lwd <- 2 705 | 706 | scatterTheme$plot.symbol$col <- rgb(0, 0, 0, .3) 707 | scatterTheme$plot.symbol$cex <- 0.8 708 | scatterTheme$plot.symbol$pch <- 16 709 | 710 | scatterTheme$add.text <- list(cex = 0.6) 711 | 712 | trellis.par.set(scatterTheme) 713 | xyplot(y ~ value|Label, 714 | scoreData, 715 | panel = function(...) { 716 | theDots <- list(...) 717 | panel.xyplot(..., type = c("p", "g","r","smooth")) 718 | corr <- round(cor(theDots$x, theDots$y), 2) 719 | panel.text(4, 720 | min(theDots$y), 721 | paste("corr:", corr)) 722 | }, 723 | layout = c(3,1), 724 | as.table = TRUE, 725 | ylab = "Yield", 726 | xlab = "Component Score") 727 | @ 728 | \caption[PLS manufacturing]{PLS scores versus response for the manufacturing data.} 729 | \label{F:cmPLSComponentVsYield} 730 | \end{center} 731 | \end{figure} 732 | 733 | \begin{figure}[h] 734 | \begin{center} 735 | <>= 736 | plsTest <- data.frame(Observed=testYield,Predicted=predict(plsTune,ppTestPredictors)) 737 | scatterTheme <- caretTheme() 738 | 739 | scatterTheme$plot.line$col <- c("blue") 740 | scatterTheme$plot.line$lwd <- 2 741 | 742 | scatterTheme$plot.symbol$col <- rgb(0, 0, 0, .3) 743 | scatterTheme$plot.symbol$cex <- 0.8 744 | scatterTheme$plot.symbol$pch <- 16 745 | 746 | scatterTheme$add.text <- list(cex = 0.6) 747 | 748 | trellis.par.set(scatterTheme) 749 | xyplot(Predicted ~ Observed, 750 | plsTest, 751 | panel = function(...) { 752 | theDots <- list(...) 753 | panel.xyplot(..., type = c("p", "g","r","smooth")) 754 | corr <- round(cor(theDots$x, theDots$y), 2) 755 | panel.text(44, 756 | min(theDots$y), 757 | paste("corr:", corr)) 758 | }, 759 | ylab = "Predicted", 760 | xlab = "Observed") 761 | @ 762 | \caption[PLS manufacturing test]{PLS predictions for the test set for the manufacturing data.} 763 | \label{F:cmPLSTestPreds} 764 | \end{center} 765 | \end{figure} 766 | 767 | Next we will predict the response for the test set and compare the $R^2$ value with the one obtained through bootstrap cross-validation. The relationship is displayed in Figure \ref{F:cmPLSTestPreds}, where the $R^2$ value is \Sexpr{round(cor(plsTest$Observed,plsTest$Predicted)^2, 3)}. Bootstrap cross-validation estimated that a three-component PLS model would have an $R^2$ value of \Sexpr{round(plsTune$results[best(plsTune$results, "Rsquared", maximize = TRUE), "Rsquared"],3)}. 768 | 769 | Next, let's examine the variable importance values for the top 15 predictors for this data and model (Figure \ref{F:cmPLSVI}). For this data, the manufacturing process predictors dominate the top part of the list. This may be helpful for improving yield, since many of the manufacturing predictors can be controlled. 770 | 771 | \begin{figure}[h] 772 | \begin{center} 773 | <>= 774 | plsImp <- varImp(plsTune, scale = FALSE) 775 | bookTheme() 776 | plot(plsImp, top=15, scales = list(y = list(cex = 0.8))) 777 | @ 778 | \caption[PLS Importance]{PLS variable importance scores for the manufacturing data.} 779 | \label{F:cmPLSVI} 780 | \end{center} 781 | \end{figure} 782 | 783 | \begin{figure}[h] 784 | \begin{center} 785 | <>= 786 | viporder <- order(abs(plsImp$importance),decreasing=TRUE) 787 | scatterTheme <- caretTheme() 788 | 789 | scatterTheme$plot.line$col <- "red" 790 | scatterTheme$plot.line$lwd <- 2 791 | 792 | scatterTheme$plot.symbol$col <- "black" 793 | scatterTheme$plot.symbol$cex <- .6 794 | 795 | scatterTheme$add.text <- list(cex = .6) 796 | 797 | trellis.par.set(scatterTheme) 798 | 799 | topVIP = rownames(plsImp$importance)[viporder[c(1:3)]] 800 | 801 | featurePlot(ppTrainPredictors[, topVIP], 802 | trainYield, 803 | plot = "scatter", 804 | between = list(x = 1, y = 1), 805 | type = c("g", "p", "smooth"), 806 | layout = c(3,1), 807 | labels = rep("", 2)) 808 | @ 809 | \caption[Top Corr Manufacturing Scatterplots]{Scatter plots of the top 3 correlated predictors in the manufacturing data set after pre-processing.} 810 | \label{F:cmPLSTopVIP} 811 | \end{center} 812 | \end{figure} 813 | 814 | Finally, let's explore the relationships between the three top important predictors and the response. Figure \ref{F:cmPLSTopVIP} provides the univariate relationships with each of these predictors (after transformation) and the response. Clearly Process 9 and Process 32 have a positive relationship with Yield, while Process 13 has a negative relationship. If these manufacturing processes can be controlled, then altering these steps in the process to have higher (or lower) values could improve the overall Yield of the process. A statistically designed experiment could be used to investigate a causal relationship between the settings of these processes and the overall yield. 815 | 816 | \clearpage 817 | \section*{Session Info} 818 | 819 | <>= 820 | toLatex(sessionInfo()) 821 | @ 822 | 823 | 824 | \bibliographystyle{ECA_jasa} 825 | \bibliography{Ch_06_Ex_sol} 826 | 827 | 828 | \end{document} 829 | 830 | 831 | 832 | 833 | -------------------------------------------------------------------------------- /Ch_06.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/topepo/APM_Exercises/73d35f3e62522d9e9a61552159bbeda9d6f02d57/Ch_06.pdf -------------------------------------------------------------------------------- /Ch_07.Rnw: -------------------------------------------------------------------------------- 1 | \documentclass[12pt]{article} 2 | 3 | \usepackage{amsmath} 4 | \usepackage{graphicx} 5 | \usepackage{color} 6 | \usepackage{xspace} 7 | \usepackage{fancyvrb} 8 | \usepackage[ 9 | colorlinks=true, 10 | linkcolor=blue, 11 | citecolor=blue, 12 | urlcolor=blue] 13 | {hyperref} 14 | 15 | \usepackage[default]{jasa_harvard} 16 | %\usepackage{JASA_manu} 17 | 18 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 19 | 20 | \setlength{\oddsidemargin}{-.25 truein} 21 | \setlength{\evensidemargin}{0truein} 22 | \setlength{\topmargin}{-0.2truein} 23 | \setlength{\textwidth}{7 truein} 24 | \setlength{\textheight}{8.5 truein} 25 | \setlength{\parindent}{0truein} 26 | \setlength{\parskip}{0.07truein} 27 | 28 | \definecolor{darkred}{rgb}{0.6,0.0,0} 29 | \definecolor{darkblue}{rgb}{.165, 0, .659} 30 | \definecolor{grey}{rgb}{0.85,0.85,0.85} 31 | \definecolor{darkorange}{rgb}{1,0.54,0} 32 | 33 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 34 | 35 | 36 | \newcommand{\bld}[1]{\mbox{\boldmath $#1$}} 37 | \newcommand{\shell}[1]{\mbox{$#1$}} 38 | \renewcommand{\vec}[1]{\mbox{\bf {#1}}} 39 | 40 | \newcommand{\ReallySmallSpacing}{\renewcommand{\baselinestretch}{.6}\Large\normalsize} 41 | \newcommand{\SmallSpacing}{\renewcommand{\baselinestretch}{1.1}\Large\normalsize} 42 | 43 | \newcommand{\halfs}{\frac{1}{2}} 44 | 45 | \DefineVerbatimEnvironment{Sinput}{Verbatim}{fontshape=sl,formatcom=\color{darkblue}} 46 | \fvset{fontsize=\footnotesize} 47 | 48 | \newcommand{\website}[1]{{\textsf{#1}}} 49 | \newcommand{\code}[1]{\mbox{\footnotesize\color{darkblue}\texttt{#1}}} 50 | \newcommand{\pkg}[1]{{\fontseries{b}\selectfont #1}} 51 | \renewcommand{\pkg}[1]{{\textsf{#1}}} 52 | \newcommand{\todo}[1]{TODO: {\bf \textcolor{darkred}{#1}}} 53 | \newcommand{\Dag}{$^\dagger$} 54 | \newcommand{\Ast}{$^\ast$} 55 | 56 | 57 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 58 | 59 | \begin{document} 60 | 61 | <>= 62 | opts_chunk$set(tidy=FALSE,message=FALSE,size='footnotesize', 63 | background = 'white',comment=NA, digits = 3, 64 | prompt = TRUE,warning=FALSE) 65 | knit_theme$set("bclear") 66 | @ 67 | \title{ Exercises for \\ {\it Applied Predictive Modeling} \\ Chapter 7 --- Nonlinear Regression Models} 68 | \author{Max Kuhn, Kjell Johnson} 69 | \date{Version 1\\ \today} 70 | 71 | <>= 72 | library(caret) 73 | library(AppliedPredictiveModeling) 74 | library(doMC) 75 | library(lattice) 76 | library(xtable) 77 | library(parallel) 78 | registerDoMC(detectCores(logical = FALSE) - 1) 79 | 80 | options(width = 105) 81 | textList <- function (x, period = FALSE, last = " and ") 82 | { 83 | if (!is.character(x)) 84 | x <- as.character(x) 85 | numElements <- length(x) 86 | out <- if (length(x) > 0) { 87 | switch(min(numElements, 3), x, paste(x, collapse = last), 88 | { 89 | x <- paste(x, c(rep(",", numElements - 2), last, 90 | ""), sep = "") 91 | paste(x, collapse = " ") 92 | }) 93 | } 94 | else "" 95 | if (period) 96 | out <- paste(out, ".", sep = "") 97 | out 98 | } 99 | 100 | hook_inline = knit_hooks$get('inline') 101 | knit_hooks$set(inline = function(x) { 102 | if (is.character(x)) highr::hi_latex(x) else hook_inline(x) 103 | }) 104 | 105 | options(width = 80) 106 | @ 107 | 108 | \newcommand{\apmfun}[1]{{\tt \small \hlkwd{#1}}} 109 | \newcommand{\apmarg}[1]{{\tt \small \hlkwc{#1}}} 110 | \newcommand{\apmstr}[1]{{\tt \small \hlstr{#1}}} 111 | \newcommand{\apmnum}[1]{{\tt \small \hlnum{#1}}} 112 | \newcommand{\apmstd}[1]{{\tt \small \hlstd{#1}}} 113 | \newcommand{\apmred}[1]{\textcolor[rgb]{0.8,0.0,0}{#1}}% 114 | 115 | \maketitle 116 | 117 | \thispagestyle{empty} 118 | 119 | The solutions in this file uses several \pkg{R} packages not used in the text. To install all of the packages needed for this document, use: 120 | 121 | <>= 122 | install.packages(c("AppliedPredictiveModeling", "caret", "earth", "kernlab", 123 | "latticeExtra", "mlbench", "nnet", "plotmo")) 124 | @ 125 | 126 | \section*{Exercise 1} 127 | 128 | Investigate the relationship between the cost, $\epsilon$ and kernel parameters for a support vector machine model. Simulate a single predictor and a non--linear relationship, such as a $sin$ wave shown in Fig. 7.7: 129 | <>= 130 | set.seed(100) 131 | x <- runif(100, min = 2, max = 10) 132 | y <- sin(x) + rnorm(length(x)) * .25 133 | sinData <- data.frame(x = x, y = y) 134 | 135 | ## Create a grid of x values to use for prediction 136 | dataGrid <- data.frame(x = seq(2, 10, length = 100)) 137 | @ 138 | 139 | 140 | \begin{itemize} 141 | \item[] (a) Fit different models using a radial basis function and different values of the cost (the \apmarg{C} parameter) and $\epsilon$. Plot the fitted curve. For example: 142 | <>= 143 | library(kernlab) 144 | rbfSVM <- ksvm(x = x, y = y, data = sinData, 145 | kernel ="rbfdot", kpar = "automatic", 146 | C = 1, epsilon = 0.1) 147 | modelPrediction <- predict(rbfSVM, newdata = dataGrid) 148 | ## This is a matrix with one column. We can plot the 149 | ## model predictions by adding points to the previous plot 150 | plot(x, y) 151 | points(x = dataGrid$x, y = modelPrediction[,1], 152 | type = "l", col = "blue") 153 | ## Try other parameters 154 | @ 155 | \item[] (b) The $\sigma$ parameter can be adjusted using the \apmarg{kpar} argument, such as {\tt \Sexpr{'kpar = list(sigma = 1)'}}. Try different values of $\sigma$ to understand how this parameter changes the model fit. How do the cost, $\epsilon$ and $\sigma$ values affect the model? 156 | 157 | \end{itemize} 158 | 159 | \subsection*{Solutions} 160 | 161 | A series of SVM models will be fit over four values of $\epsilon$ and the cost parameter. A fixed value of $\sigma$ determined automatically using the \apmfun{sigest} function in the \pkg{kernlab} package. The same value is guaranteed by setting the random number seed before each SVM model. 162 | 163 | <>= 164 | svmParam1 <- expand.grid(eps = c(.01, 0.05, .1, .5), costs = 2^c(-2, 0, 2, 8)) 165 | 166 | for(i in 1:nrow(svmParam1)) { 167 | set.seed(131) 168 | rbfSVM <- ksvm(x = x, y = y, data = sinData, 169 | kernel ="rbfdot", kpar = "automatic", 170 | C = svmParam1$costs[i], epsilon = svmParam1$eps[i]) 171 | 172 | tmp <- data.frame(x = dataGrid$x, 173 | y = predict(rbfSVM, newdata = dataGrid), 174 | eps = paste("epsilon:", format(svmParam1$eps)[i]), 175 | costs = paste("cost:", format(svmParam1$costs)[i])) 176 | svmPred1 <- if(i == 1) tmp else rbind(tmp, svmPred1) 177 | } 178 | svmPred1$costs <- factor(svmPred1$costs, levels = rev(levels(svmPred1$costs))) 179 | @ 180 | 181 | \begin{figure} 182 | \begin{center} 183 | <>= 184 | library(lattice) 185 | library(latticeExtra) 186 | bookTheme() 187 | useOuterStrips(xyplot(y~x|costs*eps, 188 | data = svmPred1, 189 | panel = function(...) { 190 | panel.xyplot(sinData$x, sinData$y, 191 | col = rgb(.2, .2, .2, .3), 192 | pch = 16, cex = .7) 193 | panel.xyplot(...) 194 | }, 195 | ylim = extendrange(sinData$y), 196 | type= "l", lwd = 2)) 197 | @ 198 | \caption{The relationship between the costs values and $\epsilon$ for the simulated data. } 199 | \label{F:sim_param1} 200 | \end{center} 201 | \end{figure} 202 | 203 | Figure \ref{F:sim_param1} shows the training data and the regression curve for each model fit. Generally speaking, the regression model becomes more jagged as the cost value increases. This is due to the training process putting a cost on the residual values. As a result, higher cost values will motivate the training process to minimize the residuals on the training set data as much as possible. This leads to models with lower bias and higher variability (i.e. the fit is likely to change if the data changes). The effect of $\epsilon$ is similar but not has potent. As $\epsilon$ {\em decreases} the model fit begins to overfit more. This effect is smaller than the cost effect. For example, in the left--most column, the amount of overfitting is negligible. However, in the right--most column, the overfitting effect can be seen the most at the extremes of the $x$--axis where the model fit changes more radically as $\epsilon$ becomes small. 204 | 205 | Next, the effect of $\sigma$ is factored into the analysis: 206 | <>= 207 | set.seed(1016) 208 | svmParam2 <- expand.grid(eps = c(.01, 0.05, .1, .5), costs = 2^c(-2, 0, 2, 8), 209 | sigma = as.vector(sigest(y~x, data = sinData, frac = .75))) 210 | 211 | for(i in 1:nrow(svmParam2)) { 212 | rbfSVM <- ksvm(x = x, y = y, data = sinData, 213 | kernel ="rbfdot", 214 | kpar = list(sigma = svmParam2$sigma[i]), 215 | C = svmParam2$costs[i], 216 | epsilon = svmParam2$eps[i]) 217 | tmp <- data.frame(x = dataGrid$x, 218 | y = predict(rbfSVM, newdata = dataGrid), 219 | eps = paste("epsilon:", format(svmParam2$eps)[i]), 220 | costs = paste("cost:", svmParam2$costs[i]), 221 | sigma = paste("sigma:", format(svmParam2$sigma, digits = 2)[i])) 222 | svmPred2 <- if(i == 1) tmp else rbind(tmp, svmPred2) 223 | } 224 | svmPred2$costs <- factor(svmPred2$costs, levels = rev(levels(svmPred2$costs))) 225 | svmPred2$sigma <- factor(svmPred2$sigma, levels = rev(levels(svmPred2$sigma))) 226 | @ 227 | 228 | 229 | \begin{figure} 230 | \begin{center} 231 | <>= 232 | library(lattice) 233 | library(latticeExtra) 234 | svmTheme <- bookTheme(set = FALSE) 235 | svmTheme$superpose.line$col <- c(rgb(.3, .7, .3, .7), 236 | rgb(.9, .1, .1, .5), 237 | rgb(.2, .5, .7, .7)) 238 | trellis.par.set(svmTheme) 239 | useOuterStrips(xyplot(y~x|costs*eps, 240 | data = svmPred2, 241 | groups = sigma, 242 | auto.key = list(columns = 3, 243 | lines = TRUE, 244 | points = FALSE), 245 | ylim = extendrange(sinData$y), 246 | type= "l", lwd = 3)) 247 | @ 248 | \caption{The relationship between, $\sigma$, the costs values and $\epsilon$ for the simulated data} 249 | \label{F:sim_param2} 250 | \end{center} 251 | \end{figure} 252 | 253 | Figure \ref{F:sim_param2} shows the fitted regression curves. The effect of $\sigma$ is complex. For low to moderate cost values, $\sigma$ appears to effect the model bias. For example, in the left column (i.e. low cost) the curves are dynamic in $\sigma$; the low value tends to underfit the data, while high value tends to overfit the data. As cost increases, the low and mid value of $\sigma$ are similar, while the high value still tends to overfit. 254 | 255 | This low-dimensional example illustrates the powerful fitting ability that SVMs have through the tuning parameters. Clearly SVMs are prone to overfitting and must be appropriately trained/tuned to protect against overfitting to the training data. 256 | 257 | \clearpage 258 | 259 | \section*{Exercise 2} 260 | 261 | \citeasnoun{Friedman:1991p109} introduced several benchmark datasets create by simulation. One of these simulations used the following non--linear equation to create data: 262 | \[ 263 | y = 10 \sin(\pi x_1x_2) + 20 (x_3 - 0.5)^2 + 10 x_4 + 5 x_5 + N(0, \sigma^2) 264 | \] 265 | where the $x$ values are random variables uniformly distributed between [0, 1] (there are also 5 other non--informative variables also created in the simulation). The package \pkg{mlbench} contains a function called \apmfun{mlbench.friedman1} that can simulate these data: 266 | <>= 267 | library(mlbench) 268 | set.seed(200) 269 | trainingData <- mlbench.friedman1(200, sd = 1) 270 | ## We convert the 'x' data from a matrix to a data frame 271 | ## One reason is that this will give the columns names. 272 | trainingData$x <- data.frame(trainingData$x) 273 | 274 | ## Look at the data using 275 | ## featurePlot(trainingData$x, trainingData$y) 276 | ## or other methods. 277 | 278 | ## This creates a list with a vector 'y' and a matrix 279 | ## of predictors 'x'. Also simulate a large test set to 280 | ## estimate the true error rate with good precision: 281 | testData <- mlbench.friedman1(5000, sd = 1) 282 | testData$x <- data.frame(testData$x) 283 | @ 284 | 285 | Tune several models on these data. For example: 286 | <>= 287 | library(caret) 288 | set.seed(921) 289 | knnModel <- train(x = trainingData$x, 290 | y = trainingData$y, 291 | method = "knn", 292 | preProc = c("center", "scale"), 293 | tuneLength = 10) 294 | knnModel 295 | knnPred <- predict(knnModel, newdata = testData$x) 296 | 297 | ## The function 'postResample' can be used to get the test set 298 | ## perforamnce values 299 | postResample(pred = knnPred, obs = testData$y) 300 | @ 301 | Which models appear to give the best performance? Does MARS select the informative predictors (those named \apmstd{X1} -- \apmstd{X5})? 302 | 303 | \subsection*{Solutions} 304 | $K$-nearest neighbors models have better performance when the underlying relationship between predictors and the response relies is dependent on samples' proximity in the predictor space. Geographic information is not part of the data generation scheme for this particular data set. Hence, we would expect another type of model to perform better then $K$-NN. Let's try MARS and SVM radial basis function models. 305 | 306 | <>= 307 | marsGrid <- expand.grid(degree = 1:2, nprune = seq(2,14,by=2)) 308 | set.seed(921) 309 | marsModel <- train(x = trainingData$x, 310 | y = trainingData$y, 311 | method = "earth", 312 | preProc = c("center", "scale"), 313 | tuneGrid = marsGrid) 314 | 315 | marsPred <- predict(marsModel, newdata = testData$x) 316 | 317 | postResample(pred = marsPred, obs = testData$y) 318 | @ 319 | 320 | \begin{figure} 321 | \begin{center} 322 | <>= 323 | plot(marsModel) 324 | @ 325 | \caption{The tuning parameter profile for the MARS model.} 326 | \label{F:Exercise2marsTune} 327 | \end{center} 328 | \end{figure} 329 | 330 | Figure \ref{F:Exercise2marsTune} illustrates the MARS tuning parameter profile. The optimal model in this case uses \Sexpr{marsModel$bestTune$nprune} terms, has degree \Sexpr{marsModel$bestTune$degree}, with an RMSE of \Sexpr{round(getTrainPerf(marsModel)[1, "TrainRMSE"], 2)}$\%$. 331 | 332 | There are a few ways we can decipher which predictors were used in the final model. One way to do this is by using the variable importance scores: 333 | 334 | <>= 335 | varImp(marsModel) 336 | @ 337 | 338 | Another way would be to generate a summary of the model which gives the exact form. We will re-create the model using the \apmfun{earth} function directly: 339 | 340 | <>= 341 | marsFit <- earth(x = trainingData$x, 342 | y = trainingData$y, 343 | nprune = 12, degree = 2) 344 | summary(marsFit) 345 | @ 346 | 347 | Clearly, MARS only selects \apmstd{X1} -- \apmstd{X5} as important predictors in relationship to the response. Figure \ref{F:plotmo} uses the \pkg{plotmo} package to create plots of each predictor versus the outcome. 348 | 349 | \begin{figure}[t] 350 | \begin{center} 351 | <>= 352 | plotmo(marsFit, caption = "") 353 | @ 354 | \caption{The functional form of the MARS model for the Friedman simulation data. } 355 | \label{F:plotmo} 356 | \end{center} 357 | \end{figure} 358 | 359 | 360 | 361 | <>= 362 | set.seed(921) 363 | svmRModel <- train(x = trainingData$x, 364 | y = trainingData$y, 365 | method = "svmRadial", 366 | preProc = c("center", "scale"), 367 | tuneLength = 8) 368 | 369 | svmRPred <- predict(svmRModel, newdata = testData$x) 370 | 371 | postResample(pred = svmRPred, obs = testData$y) 372 | @ 373 | 374 | \begin{figure} 375 | \begin{center} 376 | <>= 377 | plot(svmRModel, scales = list(x = list(log = 2))) 378 | @ 379 | \caption{The tuning parameter profile for the SVM radial basis function model.} 380 | \label{F:Exercise2svmRTune} 381 | \end{center} 382 | \end{figure} 383 | 384 | Figure \ref{F:Exercise2svmRTune} illustrates the radial basis function SVM tuning parameter profile. The optimal model has a cost value of \Sexpr{svmRModel$bestTune$C}, and sigma value of \Sexpr{svmRModel$bestTune$sigma}, with an RMSE of \Sexpr{round(getTrainPerf(svmRModel)[1, "TrainRMSE"], 2)}$\%$. 385 | 386 | Overall, the MARS model performs best, with the radial basis function SVM coming in next in performance. $K$-NN has relatively poor performance for this problem, which is not surprising since the data was not generated using neighborhood information. 387 | 388 | \clearpage 389 | 390 | 391 | \section*{Exercise 3} 392 | 393 | For the Tecator data described in the last chapter, build SVM, neural network, MARS and $K$--nearest neighbor models. Since neural networks are especially sensitive to highly correlated predictors, does pre--processing using PCA help the model? 394 | 395 | \subsection*{Solutions} 396 | 397 | The same data--splitting and resampling methodology from the last set of exercises was used: 398 | <>= 399 | library(caret) 400 | data(tecator) 401 | 402 | set.seed(1029) 403 | inMeatTraining <- createDataPartition(endpoints[, 3], p = 3/4, list= FALSE) 404 | 405 | absorpTrain <- absorp[ inMeatTraining,] 406 | absorpTest <- absorp[-inMeatTraining,] 407 | proteinTrain <- endpoints[ inMeatTraining, 3] 408 | proteinTest <- endpoints[-inMeatTraining,3] 409 | 410 | ctrl <- trainControl(method = "repeatedcv", repeats = 5) 411 | @ 412 | 413 | A MARS model was fit to the data: 414 | <>= 415 | set.seed(529) 416 | meatMARS <- train(x = absorpTrain, y = proteinTrain, 417 | method = "earth", 418 | trControl = ctrl, 419 | tuneLength = 25) 420 | @ 421 | 422 | \noindent Figure \ref{F:meat_MARS} shows the results. In the end, \Sexpr{meatMARS$bestTune$nprune} terms were retained (using \Sexpr{length(predictors(meatMARS))} unique predictors) for a model with an associated RMSE of \Sexpr{round(getTrainPerf(meatMARS)[1, "TrainRMSE"], 2)}$\%$. This is worse performance than the linear models form the last set of exercises. Would bagging the MARS model help? The bagging results are also presented in Figure \ref{F:meat_MARS}. For this data, provides only a slight improvement over the original MARS model. 423 | 424 | 425 | <>= 426 | set.seed(529) 427 | meatBMARS <- train(x = absorpTrain, y = proteinTrain, 428 | method = "bagEarth", 429 | trControl = ctrl, 430 | tuneLength = 25, 431 | B = 20) 432 | 433 | plotDat <- rbind(meatMARS$results, meatBMARS$results) 434 | plotDat$Model <- rep(c("Basic", "Bagged"), each = nrow(meatMARS$results)) 435 | @ 436 | 437 | \begin{figure} 438 | \begin{center} 439 | <>= 440 | bookTheme() 441 | xyplot(RMSE ~ nprune, 442 | data = plotDat, 443 | type = c("g", "o"), 444 | groups = Model, 445 | auto.key = list(columns = 2)) 446 | @ 447 | \caption{Resampled RMSE values versus the number of retained terms for the basic MARS model and a bagged version. } 448 | \label{F:meat_MARS} 449 | \end{center} 450 | \end{figure} 451 | 452 | 453 | <>= 454 | polyGrid <- expand.grid(degree = 1:2, 455 | C = 2^seq(8, 15, length = 8), 456 | scale = c(.5, .1, 0.01)) 457 | polyGrid <- polyGrid[!(polyGrid$scale == .5 & polyGrid$degree == 2),] 458 | 459 | set.seed(529) 460 | meatQSVM <- train(x = absorpTrain, y = proteinTrain, 461 | method = "svmPoly", 462 | preProcess = c("center", "scale"), 463 | trControl = ctrl, 464 | tuneGrid = polyGrid) 465 | @ 466 | 467 | The next model examines the performance of a polynomial kernel support vector machine tuning over the parameters of degree, cost and scale. Figure \ref{F:meat_pSVM} illustrates the results, where the optimal model has degree \Sexpr{meatQSVM$bestTune$degree}, a cost of \Sexpr{meatQSVM$bestTune$C}, and scale of \Sexpr{meatQSVM$bestTune$scale}. The RMSE of the optimal model is \Sexpr{round(getTrainPerf(meatQSVM)[1, "TrainRMSE"], 2)}$\%$ which is better than the MARS models. 468 | 469 | \begin{figure} 470 | \begin{center} 471 | <>= 472 | bookTheme() 473 | plot(meatQSVM, scales = list(x = list(log = 2)), ylim = c(.6, 2)) 474 | @ 475 | \caption{The RMSE resampling profiles for linear and quadratic support vector machines. } 476 | \label{F:meat_pSVM} 477 | \end{center} 478 | \end{figure} 479 | 480 | 481 | <>= 482 | set.seed(529) 483 | meatRSVM <- train(x = absorpTrain, y = proteinTrain, 484 | method = "svmRadial", 485 | preProcess = c("center", "scale"), 486 | trControl = ctrl, 487 | tuneLength = 10) 488 | @ 489 | 490 | To understand the performance of a radial basis support vector machine, we will search over 10 values of the cost parameter. Figure \ref{F:meat_rSVM} illustrates the results, where the optimal model has a cost of \Sexpr{meatRSVM$bestTune$C}. The RMSE of the optimal model is \Sexpr{round(getTrainPerf(meatRSVM)[1, "TrainRMSE"], 2)}$\%$ which is worse than the polynomial SVM and MARS models. 491 | 492 | \begin{figure} 493 | \begin{center} 494 | <>= 495 | bookTheme() 496 | plot(meatRSVM, scales = list(x = list(log = 2))) 497 | @ 498 | \caption{The relationship between the resampled RMSE values and the cost parameter for the RBF support vector machine. } 499 | \label{F:meat_rSVM} 500 | \end{center} 501 | \end{figure} 502 | 503 | For neural networks, we will tune over several values of the size and decay tuning parameters. In the first model, we will center and scale the predictors. In the second model, we will pre-process the predictors by centering, scaling, and using PCA. Figure \ref{F:meat_nnet} compares the tuning parameter profiles with and without PCA pre-processing. Notice that the RMSE is much higher for the PCA pre-processed set, indicating that PCA does not reduce the dimension of the predictor space in a way that is helpful for predicting absorption. 504 | 505 | <>= 506 | set.seed(529) 507 | meatNet <- train(x = absorpTrain, y = proteinTrain, 508 | method = "nnet", 509 | trControl = ctrl, 510 | preProc = c("center", "scale"), 511 | linout = TRUE, 512 | trace = FALSE, 513 | tuneGrid = expand.grid(size = 1:9, 514 | decay = c(0, .001, .01, .1))) 515 | 516 | @ 517 | 518 | 519 | <>= 520 | set.seed(529) 521 | meatPCANet <- train(x = absorpTrain, y = proteinTrain, 522 | method = "nnet", 523 | trControl = ctrl, 524 | preProc = c("center", "scale", "pca"), 525 | linout = TRUE, 526 | trace = FALSE, 527 | tuneGrid = expand.grid(size = 1:9, 528 | decay = c(0, .001, .01, .1))) 529 | plotNNet <- rbind(meatNet$results, meatPCANet$results) 530 | plotNNet$Model <- rep(c("Raw", "PCA"), each = nrow(meatNet$results)) 531 | @ 532 | 533 | \begin{figure} 534 | \begin{center} 535 | <>= 536 | bookTheme() 537 | print(xyplot(RMSE ~ size|Model, 538 | data = plotNNet, 539 | groups = decay, 540 | type = c("g", "o"), 541 | auto.key = list(columns = 4, 542 | lines = TRUE, 543 | cex = .6, 544 | title = "Weight Decay"), 545 | scales = list(y = list(relation = "free")))) 546 | @ 547 | \caption{Neural network resampling profiles for the RMSE statistic for different pre--processing techniques: PCA signal extraction with centering and scaling versus centering and scaling alone (Raw). } 548 | \label{F:meat_nnet} 549 | \end{center} 550 | \end{figure} 551 | 552 | <>= 553 | set.seed(529) 554 | meatKnn <- train(x = absorpTrain, y = proteinTrain, 555 | method = "knn", 556 | trControl = ctrl, 557 | preProc = c("center", "scale"), 558 | tuneGrid = data.frame(.k = seq(1, 20, by = 2))) 559 | @ 560 | 561 | <>= 562 | set.seed(529) 563 | meatPCAKnn <- train(x = absorpTrain, y = proteinTrain, 564 | method = "knn", 565 | trControl = ctrl, 566 | preProc = c("center", "scale", "pca"), 567 | tuneGrid = data.frame(k = seq(1, 20, by = 2))) 568 | 569 | plotKnn <- rbind(meatKnn$results, meatPCAKnn$results) 570 | plotKnn$Model <- rep(c("Raw", "PCA"), each = nrow(meatKnn$results)) 571 | @ 572 | 573 | Last, a $K$-nearest neighbors model is built over several values of $k$ where the predictors are pre-processed using centering and scaling, and are also pre-processed using centering, scaling, and PCA (Figure \ref{F:meat_knn}). Like neural networks, PCA does not provide useful dimension reduction for predicting the absorption response. Instead, it is better to pre-process by centering and scaling. Overall, the $K$-NN model performs the worst where RMSE of the optimal model is \Sexpr{round(getTrainPerf(meatKnn)[1, "TrainRMSE"], 2)}$\%$. 574 | 575 | 576 | \begin{figure} 577 | \begin{center} 578 | <>= 579 | bookTheme() 580 | xyplot(RMSE ~ k, 581 | data = plotKnn, 582 | groups = Model, 583 | type = c("g", "o"), 584 | auto.key = list(columns = 2, 585 | lines = TRUE, 586 | cex = .6, 587 | title = "#Neighbors")) 588 | @ 589 | \caption{The resampling RMSE profile of the $K$--nearest neighbors models (with and without PCA pre--processing). } 590 | \label{F:meat_knn} 591 | \end{center} 592 | \end{figure} 593 | 594 | 595 | <>= 596 | save(meatNet, file = "meatNet.RData") 597 | @ 598 | 599 | <>= 600 | load("meatPLS.RData") 601 | meatResamples <- resamples(list(PLS = meatPLS, 602 | MARS = meatBMARS, 603 | SVMlin = meatQSVM, 604 | SVMrad = meatRSVM, 605 | NNet = meatNet, 606 | KNN = meatKnn)) 607 | @ 608 | 609 | To compare models, the resampling distributions are presented in Figure \ref{F:meatCompare07}. For this data, neural networks (centering and scaling, only) and partial least squares (also centering and scaling) have the lowest overall RMSE. 610 | 611 | \begin{figure}[t!] 612 | \begin{center} 613 | <>= 614 | bwplot(meatResamples, metric = "RMSE") 615 | @ 616 | \caption{Resampling distributions of the RMSE statistic for the models fit in this document as well as the best form the linear models (i.e. PLS). } 617 | \label{F:meatCompare07} 618 | \end{center} 619 | \end{figure} 620 | 621 | \clearpage 622 | 623 | \section*{Exercise 4} 624 | 625 | Return to the permeability problem outlined in Exercise 6.2. Train several non--linear regression models and evaluate the resampling and test set performance. 626 | \begin{itemize} 627 | \item[] (a) Which non--linear regression model gives the optimal resampling and test set performance? 628 | \item[] (b) Do any of the non--linear models outperform the optimal linear model you previously developed in Exercise 6.2? If so, what might this tell you about the underlying relationship between the predictors and the response? 629 | \item[] (c) Would you recommend any of the models you have developed to replace the permeability laboratory experiment? 630 | \end{itemize} 631 | 632 | \subsection*{Solutions} 633 | 634 | In order to make a parallel comparison to the results in Exercise 6.2, we need to perform the same pre-precessing steps and set up the identical validation approach. The following syntax provides the same pre-processing, data partition into training and testing sets, and validation set-up. 635 | 636 | 637 | <>= 638 | library(AppliedPredictiveModeling) 639 | data(permeability) 640 | 641 | #Identify and remove NZV predictors 642 | nzvFingerprints = nearZeroVar(fingerprints) 643 | noNzvFingerprints <- fingerprints[,-nzvFingerprints] 644 | 645 | #Split data into training and test sets 646 | set.seed(614) 647 | trainingRows <- createDataPartition(permeability, 648 | p = 0.75, 649 | list = FALSE) 650 | 651 | trainFingerprints <- noNzvFingerprints[trainingRows,] 652 | trainPermeability <- permeability[trainingRows,] 653 | 654 | testFingerprints <- noNzvFingerprints[-trainingRows,] 655 | testPermeability <- permeability[-trainingRows,] 656 | 657 | set.seed(614) 658 | ctrl <- trainControl(method = "LGOCV") 659 | @ 660 | 661 | Next, we will find optimal tuning parameters for MARS, SVM (radial basis function), and K-NN models. 662 | 663 | <>= 664 | set.seed(614) 665 | 666 | marsPermGrid <- expand.grid(degree = 1:2, nprune = seq(2,14,by=2)) 667 | marsPermTune <- train(x = trainFingerprints, y = log10(trainPermeability), 668 | method = "earth", 669 | tuneGrid = marsPermGrid, 670 | trControl = ctrl) 671 | 672 | 673 | RSVMPermGrid <- expand.grid(sigma = c(0.0005,0.001,0.0015), 674 | C = seq(1,49,by=6)) 675 | RSVMPermTune <- train(x = trainFingerprints, y = log10(trainPermeability), 676 | method = "svmRadial", 677 | trControl = ctrl, 678 | tuneGrid = RSVMPermGrid) 679 | 680 | knnPermTune <- train(x = trainFingerprints, y = log10(trainPermeability), 681 | method = "knn", 682 | tuneLength = 10) 683 | @ 684 | 685 | Figure \ref{F:permeabilityMARSTunePlot} indicates that the optimal degree and number of terms that maximize $R^2$ are \Sexpr{marsPermTune$results$degree[best(marsPermTune$results, "Rsquared", maximize = TRUE)]} and \Sexpr{marsPermTune$results$nprune[best(marsPermTune$results, "Rsquared", maximize = TRUE)]}, respectively, with an $R^2$ of \Sexpr{round(marsPermTune$results$Rsquared[best(marsPermTune$results, "Rsquared", maximize = TRUE)],2)}. 686 | 687 | \begin{figure}[ht] 688 | \begin{center} 689 | <>= 690 | plotTheme <- bookTheme() 691 | trellis.par.set(plotTheme) 692 | plot(marsPermTune,metric="Rsquared") 693 | @ 694 | \caption{MARS tuning parameter profile for the permeability data} 695 | \label{F:permeabilityMARSTunePlot} 696 | \end{center} 697 | \end{figure} 698 | 699 | 700 | Figure \ref{F:permeabilityRSVMTunePlot} indicates that the optimal cost and sigma that maximize $R^2$ are \Sexpr{RSVMPermTune$results$C[best(RSVMPermTune$results, "Rsquared", maximize = TRUE)]} and \Sexpr{RSVMPermTune$results$sigma[best(RSVMPermTune$results, "Rsquared", maximize = TRUE)]}, respectively, with an $R^2$ of \Sexpr{round(RSVMPermTune$results$Rsquared[best(RSVMPermTune$results, "Rsquared", maximize = TRUE)],2)}. 701 | 702 | \begin{figure}[ht] 703 | \begin{center} 704 | <>= 705 | plotTheme <- bookTheme() 706 | trellis.par.set(plotTheme) 707 | plot(RSVMPermTune,metric="Rsquared", scales = list(x = list(log = 2))) 708 | @ 709 | \caption{Radial basis function SVM tuning parameter profile for the permeability data} 710 | \label{F:permeabilityRSVMTunePlot} 711 | \end{center} 712 | \end{figure} 713 | 714 | 715 | Figure \ref{F:permeabilityKNNTunePlot} indicates that the optimal number of nearest neighbors that maximize $R^2$ are \Sexpr{knnPermTune$results$k[best(knnPermTune$results, "Rsquared", maximize = TRUE)]}, respectively, with an $R^2$ of \Sexpr{round(knnPermTune$results$Rsquared[best(knnPermTune$results, "Rsquared", maximize = TRUE)],2)}. 716 | 717 | \begin{figure}[ht] 718 | \begin{center} 719 | <>= 720 | plotTheme <- bookTheme() 721 | trellis.par.set(plotTheme) 722 | plot(knnPermTune,metric="Rsquared") 723 | @ 724 | \caption{K-nearest neighbors tuning parameter profile for the permeability data} 725 | \label{F:permeabilityKNNTunePlot} 726 | \end{center} 727 | \end{figure} 728 | 729 | For these three non-linear models, the radial basis function SVM model performs best with a leave-group out cross-validated $R^2$ value of \Sexpr{round(RSVMPermTune$results$Rsquared[best(RSVMPermTune$results, "Rsquared", maximize = TRUE)],2)}. This is worse than the elastic net model tuned in Exercise 6.2 which had a leave-group out cross-validated $R^2$ value of 0.58. These results indicate that the underlying relationship between the predictors and the response is likely best described by a linear structure in a reduced dimension of the original space. 730 | 731 | For the models tuned thus far, we would recommend the elastic net model. 732 | 733 | \clearpage 734 | 735 | \section*{Exercise 5} 736 | 737 | Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data--splitting and pre--processing steps as before and train several non--linear regression models. 738 | \begin{itemize} 739 | \item[] (a) Which non--linear regression model gives the optimal resampling and test set performance? 740 | \item[] (b) Which predictors are most important in the optimal non--linear regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear model? 741 | \item[] (c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal non--linear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield? 742 | \end{itemize} 743 | 744 | \subsection*{Solutions} 745 | 746 | <>= 747 | library(AppliedPredictiveModeling) 748 | data(ChemicalManufacturingProcess) 749 | 750 | predictors <- subset(ChemicalManufacturingProcess,select= -Yield) 751 | yield <- subset(ChemicalManufacturingProcess,select="Yield") 752 | 753 | set.seed(517) 754 | trainingRows <- createDataPartition(yield$Yield, 755 | p = 0.7, 756 | list = FALSE) 757 | 758 | trainPredictors <- predictors[trainingRows,] 759 | trainYield <- yield[trainingRows,] 760 | 761 | testPredictors <- predictors[-trainingRows,] 762 | testYield <- yield[-trainingRows,] 763 | 764 | #Pre-process trainPredictors and apply to trainPredictors and testPredictors 765 | pp <- preProcess(trainPredictors,method=c("BoxCox","center","scale","knnImpute")) 766 | ppTrainPredictors <- predict(pp,trainPredictors) 767 | ppTestPredictors <- predict(pp,testPredictors) 768 | 769 | #Identify and remove NZV 770 | nzvpp = nearZeroVar(ppTrainPredictors) 771 | ppTrainPredictors <- ppTrainPredictors[-nzvpp] 772 | ppTestPredictors <- ppTestPredictors[-nzvpp] 773 | 774 | #Identify and remove highly correlated predictors 775 | predcorr = cor(ppTrainPredictors) 776 | highCorrpp <- findCorrelation(predcorr) 777 | ppTrainPredictors <- ppTrainPredictors[, -highCorrpp] 778 | ppTestPredictors <- ppTestPredictors[, -highCorrpp] 779 | 780 | #Set-up trainControl 781 | set.seed(517) 782 | ctrl <- trainControl(method = "boot", number = 25) 783 | @ 784 | 785 | For this example, we will tune a MARS and SVM (polynomial kernel function) model 786 | 787 | <>= 788 | set.seed(614) 789 | 790 | marsChemGrid <- expand.grid(degree = c(1:2), nprune = c(2:10)) 791 | marsChemTune <- train(x = ppTrainPredictors, y = trainYield, 792 | method = "earth", 793 | tuneGrid = marsChemGrid, 794 | trControl = ctrl) 795 | 796 | psvmTuneGrid <- expand.grid(C=c(0.01,0.05,0.1), degree=c(1,2), scale=c(0.25,0.5,1)) 797 | PSVMChemTune <- train(x = ppTrainPredictors, y = trainYield, 798 | method = "svmPoly", 799 | trControl = ctrl, 800 | tuneGrid = psvmTuneGrid) 801 | @ 802 | 803 | Figure \ref{F:chemMARSTunePlot} indicates that the optimal degree and number of terms that maximize $R^2$ are \Sexpr{marsChemTune$results$degree[best(marsChemTune$results, "Rsquared", maximize = TRUE)]} and \Sexpr{marsChemTune$results$nprune[best(marsChemTune$results, "Rsquared", maximize = TRUE)]}, respectively, with an $R^2$ of \Sexpr{round(marsChemTune$results$Rsquared[best(marsChemTune$results, "Rsquared", maximize = TRUE)],2)}. Hence, MARS identifies a fairly simple model with the terms listed in Table \ref{T:chemMARSSummary}. 804 | 805 | \begin{figure}[ht] 806 | \begin{center} 807 | <>= 808 | plotTheme <- bookTheme() 809 | trellis.par.set(plotTheme) 810 | plot(marsChemTune,metric="Rsquared") 811 | @ 812 | \caption{MARS tuning parameter profile for the chemical manufacturing data.} 813 | \label{F:chemMARSTunePlot} 814 | \end{center} 815 | \end{figure} 816 | 817 | 818 | \begin{figure}[h] 819 | \begin{center} 820 | <>= 821 | marsTest <- data.frame(Observed=testYield,Predicted=predict(marsChemTune,ppTestPredictors)) 822 | scatterTheme <- caretTheme() 823 | 824 | scatterTheme$plot.line$col <- c("blue") 825 | scatterTheme$plot.line$lwd <- 2 826 | 827 | scatterTheme$plot.symbol$col <- rgb(0, 0, 0, .3) 828 | scatterTheme$plot.symbol$cex <- 0.8 829 | scatterTheme$plot.symbol$pch <- 16 830 | 831 | scatterTheme$add.text <- list(cex = 0.6) 832 | 833 | trellis.par.set(scatterTheme) 834 | xyplot(Predicted ~ Observed, 835 | marsTest, 836 | panel = function(...) { 837 | theDots <- list(...) 838 | panel.xyplot(..., type = c("p", "g","r","smooth")) 839 | corr <- round(cor(theDots$x, theDots$y), 2) 840 | panel.text(44, 841 | min(theDots$y), 842 | paste("corr:", corr)) 843 | }, 844 | ylab = "Predicted", 845 | xlab = "Observed") 846 | @ 847 | \caption[mars manufacturing test]{MARS predictions for the test set for the chemical manufacturing data.} 848 | \label{F:MARSTestPreds} 849 | \end{center} 850 | \end{figure} 851 | 852 | The test set predictions for the MARS model are presented in Figure \ref{F:MARSTestPreds}, with an $R^2$ value of \Sexpr{round(cor(marsTest$Observed,marsTest$Predicted)^2, 3)}. 853 | 854 | <>= 855 | marsSummary <- summary(marsChemTune$finalModel) 856 | 857 | xtable(marsSummary$coefficients, 858 | digits=c(0,2), 859 | align=c("l|r"), 860 | label = "T:chemMARSSummary", 861 | caption = "MARS coefficients for chemical manufacturing data.") 862 | @ 863 | 864 | 865 | The final MARS model has slightly worse cross-validation and test set performance than the optimal PLS model (see solutions for Exercise 6.3). This would indicate that the underlying structure between the predictors and the response is approximately linear. We can use marginal plots of each predictor in the model to better understand the relationship that the MARS model has detected. Figure \ref{F:chemMARSMarginalPlot} displays the marginal relationships between the predictors identified by the model and the predicted response. Based on this figure, the underlying relationships are approximately linear in this model. 866 | 867 | \begin{figure}[ht] 868 | \begin{center} 869 | <>= 870 | marsImp <- varImp(marsChemTune) 871 | impOrder <- order(marsImp$importance$Overall, decreasing=TRUE) 872 | impVars <- rownames(marsImp$importance)[impOrder] 873 | top2 <- impVars[1:2] 874 | 875 | gridfcn <- function(x, data, topPreds, size=2) { 876 | grid <- NULL 877 | mns <- matrix(colMeans(data[,topPreds]), nrow = 1) 878 | mns <- apply(mns, 2, function(x, n) rep(x, n), n = size) 879 | colnames(mns) <- topPreds 880 | mns <- as.data.frame(mns) 881 | grid <- vector(mode = "list", length = length(topPreds)) 882 | for(i in seq(along = topPreds)){ 883 | tmp <- mns 884 | tmp[,i] <- seq(min(data[,topPreds[i]]), max(data[,topPreds[i]]), length = size) 885 | grid[[i]] <- data.frame(pred = predict(x, tmp), 886 | var = topPreds[i], 887 | x = tmp[,i]) 888 | } 889 | grid <- do.call("rbind", grid) 890 | colnames(grid) <- c("pred","var","x") 891 | grid 892 | } 893 | 894 | mg <- gridfcn(marsChemTune$finalModel, ppTrainPredictors, topPreds=top2) 895 | 896 | plotTheme <- bookTheme() 897 | trellis.par.set(plotTheme) 898 | xyplot(pred ~ x|var, data=mg, 899 | scales = list(x=list(relation="free"), y=list(alternating=0)), 900 | type = c("g","l"), 901 | as.table = TRUE, 902 | xlab = "", 903 | ylab = "Predicted Outcome", 904 | between = list(x=1,y=1)) 905 | @ 906 | \caption{MARS marginal plots for chemical manufacturing data.} 907 | \label{F:chemMARSMarginalPlot} 908 | \end{center} 909 | \end{figure} 910 | 911 | Referring back to Figure 15 of the solutions from Chapter 6, the top two PLS predictors are ManufacturingProcess32 and ManufacturingProcess09, which are the same as what the MARS model identifies. PLS, however, identifies additional predictive information from the other predictors that improve the predictive ability of the models. Overall, many of the manufacturing process predictors are at the top of the importance list. 912 | 913 | Figure \ref{F:chemPSVMTunePlot} indicates that the optimal degree, cost, and scale that maximize $R^2$ are \Sexpr{PSVMChemTune$results$degree[best(PSVMChemTune$results, "Rsquared", maximize = TRUE)]}, \Sexpr{PSVMChemTune$results$C[best(PSVMChemTune$results, "Rsquared", maximize = TRUE)]}, and \Sexpr{PSVMChemTune$results$scale[best(PSVMChemTune$results, "Rsquared", maximize = TRUE)]}, respectively, with an $R^2$ of \Sexpr{round(PSVMChemTune$results$Rsquared[best(PSVMChemTune$results, "Rsquared", maximize = TRUE)],2)}. 914 | 915 | \begin{figure}[ht] 916 | \begin{center} 917 | <>= 918 | plotTheme <- bookTheme() 919 | trellis.par.set(plotTheme) 920 | plot(PSVMChemTune,metric="Rsquared") 921 | @ 922 | \caption{Polynomial SVM tuning parameter profile for the chemical manufacturing data.} 923 | \label{F:chemPSVMTunePlot} 924 | \end{center} 925 | \end{figure} 926 | 927 | The polynomial SVM tuning parameters both indicate that a linear model is sufficient for this model. Specifically, a degree of 1 is an indicator of linear structure, and a low cost is also an indicator of linear structure. 928 | 929 | 930 | 931 | \clearpage 932 | 933 | \section*{Session Info} 934 | 935 | <>= 936 | toLatex(sessionInfo()) 937 | @ 938 | 939 | 940 | \bibliographystyle{ECA_jasa} 941 | \bibliography{Ch_07_Ex_sol} 942 | 943 | \end{document} 944 | 945 | 946 | 947 | 948 | -------------------------------------------------------------------------------- /Ch_07.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/topepo/APM_Exercises/73d35f3e62522d9e9a61552159bbeda9d6f02d57/Ch_07.pdf -------------------------------------------------------------------------------- /Ch_07_Ex_Sol.bib: -------------------------------------------------------------------------------- 1 | @article{Friedman:1991p109, 2 | Author = {J Friedman}, 3 | Date-Added = {2009-07-06 20:00:59 -0400}, 4 | Date-Modified = {2010-01-04 15:23:43 -0500}, 5 | Journal = {The Annals of Statistics}, 6 | Local-Url = {file://localhost/Users/kuhna03/Documents/Papers/1991/Friedman/The%20annals%20of%20statistics%201991%20Friedman.pdf}, 7 | Number = {1}, 8 | Pages = {1--141}, 9 | Pmid = {13659018209715916422related:ho61crGUjr0J}, 10 | Rating = {0}, 11 | Read = {Yes}, 12 | Title = {Multivariate Adaptive Regression Splines}, 13 | Uri = {papers://B8B5D302-E53E-4247-B570-F78A74DA1B28/Paper/p109}, 14 | Volume = {19}, 15 | Year = {1991}} 16 | -------------------------------------------------------------------------------- /Ch_08.Rnw: -------------------------------------------------------------------------------- 1 | 2 | \documentclass[12pt]{article} 3 | 4 | \usepackage{amsmath} 5 | \usepackage{graphicx} 6 | \usepackage{color} 7 | \usepackage{xspace} 8 | \usepackage{fancyvrb} 9 | \usepackage{rotating} 10 | \usepackage[ 11 | colorlinks=true, 12 | linkcolor=blue, 13 | citecolor=blue, 14 | urlcolor=blue] 15 | {hyperref} 16 | 17 | \usepackage[default]{jasa_harvard} 18 | %\usepackage{JASA_manu} 19 | 20 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 21 | 22 | \setlength{\oddsidemargin}{-.25 truein} 23 | \setlength{\evensidemargin}{0truein} 24 | \setlength{\topmargin}{-0.2truein} 25 | \setlength{\textwidth}{7 truein} 26 | \setlength{\textheight}{8.5 truein} 27 | \setlength{\parindent}{0truein} 28 | \setlength{\parskip}{0.07truein} 29 | 30 | \definecolor{darkred}{rgb}{0.6,0.0,0} 31 | \definecolor{darkblue}{rgb}{.165, 0, .659} 32 | \definecolor{grey}{rgb}{0.85,0.85,0.85} 33 | \definecolor{darkorange}{rgb}{1,0.54,0} 34 | 35 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 36 | 37 | 38 | \newcommand{\bld}[1]{\mbox{\boldmath $#1$}} 39 | \newcommand{\shell}[1]{\mbox{$#1$}} 40 | \renewcommand{\vec}[1]{\mbox{\bf {#1}}} 41 | 42 | \newcommand{\ReallySmallSpacing}{\renewcommand{\baselinestretch}{.6}\Large\normalsize} 43 | \newcommand{\SmallSpacing}{\renewcommand{\baselinestretch}{1.1}\Large\normalsize} 44 | 45 | \newcommand{\halfs}{\frac{1}{2}} 46 | 47 | \DefineVerbatimEnvironment{Sinput}{Verbatim}{fontshape=sl,formatcom=\color{darkblue}} 48 | \fvset{fontsize=\footnotesize} 49 | 50 | \newcommand{\website}[1]{{\textsf{#1}}} 51 | \newcommand{\code}[1]{\mbox{\footnotesize\color{darkblue}\texttt{#1}}} 52 | \newcommand{\pkg}[1]{{\fontseries{b}\selectfont #1}} 53 | \renewcommand{\pkg}[1]{{\textsf{#1}}} 54 | \newcommand{\todo}[1]{TODO: {\bf \textcolor{darkred}{#1}}} 55 | \newcommand{\Dag}{$^\dagger$} 56 | \newcommand{\Ast}{$^\ast$} 57 | 58 | 59 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 60 | 61 | \begin{document} 62 | <>= 63 | opts_chunk$set(tidy=FALSE,message=FALSE,size='footnotesize', 64 | background = 'white',comment=NA, digits = 3, 65 | prompt = TRUE) 66 | knit_theme$set("bclear") 67 | @ 68 | \title{ Exercises for \\ {\it Applied Predictive Modeling} \\ Chapter 8 --- Regression Trees and Rule--Based Models} 69 | \author{Max Kuhn, Kjell Johnson} 70 | \date{Version 1\\ \today} 71 | 72 | <>= 73 | library(caret) 74 | library(AppliedPredictiveModeling) 75 | library(rpart) 76 | library(randomForest) 77 | library(ipred) 78 | library(party) 79 | library(partykit) 80 | library(Cubist) 81 | library(gbm) 82 | library(pls) 83 | library(kernlab) 84 | library(xtable) 85 | library(doMC) 86 | library(parallel) 87 | registerDoMC(detectCores(logical = FALSE) - 1) 88 | 89 | options(width = 105) 90 | textList <- function (x, period = FALSE, last = " and ") 91 | { 92 | if (!is.character(x)) 93 | x <- as.character(x) 94 | numElements <- length(x) 95 | out <- if (length(x) > 0) { 96 | switch(min(numElements, 3), x, paste(x, collapse = last), 97 | { 98 | x <- paste(x, c(rep(",", numElements - 2), last, 99 | ""), sep = "") 100 | paste(x, collapse = " ") 101 | }) 102 | } 103 | else "" 104 | if (period) 105 | out <- paste(out, ".", sep = "") 106 | out 107 | } 108 | 109 | hook_inline = knit_hooks$get('inline') 110 | knit_hooks$set(inline = function(x) { 111 | if (is.character(x)) highr::hi_latex(x) else hook_inline(x) 112 | }) 113 | @ 114 | 115 | \newcommand{\apmfun}[1]{{\tt \small \hlkwd{#1}}} 116 | \newcommand{\apmarg}[1]{{\tt \small \hlkwc{#1}}} 117 | \newcommand{\apmstr}[1]{{\tt \small \hlstr{#1}}} 118 | \newcommand{\apmnum}[1]{{\tt \small \hlnum{#1}}} 119 | \newcommand{\apmstd}[1]{{\tt \small \hlstd{#1}}} 120 | \newcommand{\apmred}[1]{\textcolor[rgb]{0.8,0.0,0}{#1}}% 121 | 122 | \maketitle 123 | 124 | \thispagestyle{empty} 125 | 126 | The solutions in this file uses several \pkg{R} packages not used in the text. To install all of the packages needed for this document, use: 127 | 128 | <>= 129 | install.packages(c("AppliedPredictiveModeling", "caret", "Cubist", "ipred", 130 | "mlbench", "party", "randomForest")) 131 | @ 132 | 133 | 134 | \section*{Exercise 1} 135 | 136 | 137 | Recreate the simulated data from Exercise 7.2: 138 | <>= 139 | library(mlbench) 140 | set.seed(200) 141 | simulated <- mlbench.friedman1(200, sd = 1) 142 | simulated <- cbind(simulated$x, simulated$y) 143 | simulated <- as.data.frame(simulated) 144 | colnames(simulated)[ncol(simulated)] <- "y" 145 | @ 146 | 147 | \begin{itemize} 148 | \item[] (a) Fit a random forest model to all of the predictors, then 149 | estimate the variable importance scores: 150 | <>= 151 | library(randomForest) 152 | library(caret) 153 | model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000) 154 | rfImp1 <- varImp(model1, scale = FALSE) 155 | @ 156 | \item[] Did the random forest model significantly use the 157 | uninformative predictors (\apmstd{V6} -- \apmstd{V10})? 158 | \item[] 159 | \item[] (b) Now add an additional predictor that is highly correlated 160 | with one of the informative predictors. For example: 161 | <>= 162 | set.seed(600) 163 | simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1 164 | cor(simulated$duplicate1, simulated$V1) 165 | @ 166 | \item[] Fit another random forest model to these data. Did the 167 | importance score for \apmstd{V1} change? What happens when you add 168 | another predictors that is also highly correlated with \apmstd{V1}? 169 | \item[] 170 | \item[] (c) Use the \apmfun{cforest} function in the \pkg{party} package 171 | to fit a random forest model using conditional inference trees. The 172 | \pkg{party} package function \apmfun{varimp} can be used to calculate 173 | predictor importance. The \apmarg{conditional} argument of that 174 | function toggles between the traditional importance measure and the 175 | modified version described in \cite{17254353}. Do these importances 176 | show the same pattern as the traditional random forest model? 177 | \item[] 178 | \item[] (d) Repeat this process with different tree models, such as 179 | boosted trees and Cubist. Does the same pattern occur? 180 | \end{itemize} 181 | 182 | 183 | \subsection*{Solutions} 184 | The predictor importance scores for the simulated data set in part (a) can be seen in Table \ref{T:varImpSimulation1}. The model places most importance on predictors 1, 2, 4, and 5, and very little importance on 6 through 10. 185 | 186 | <>= 187 | print(xtable(round(rfImp1,2), 188 | caption = "Variable importance scores for part (a) simulation.", 189 | label = "T:varImpSimulation1")) 190 | @ 191 | 192 | Next we will add a highly correlated predictor (Part (b)) and re-model the data. Table \ref{T:varImpSimulation2} lists the importance scores for predictors V1-V10 when we inclucde a predictor that is highly correlated with V1. Notice that the importance score drops for V1 when a highly correlated predictor is included in the data. Predictor V1 has dropped to third in overall importance rank. 193 | 194 | <>= 195 | model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000) 196 | rfImp2 <- varImp(model2, scale = FALSE) 197 | 198 | vnames <- c('V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'duplicate1') 199 | 200 | names(rfImp1) <- "Original" 201 | rfImp1$Variable <- factor(rownames(rfImp1), levels = vnames) 202 | 203 | names(rfImp2) <- "Extra" 204 | rfImp2$Variable <- factor(rownames(rfImp2), levels = vnames) 205 | 206 | rfImps <- merge(rfImp1, rfImp2, all = TRUE) 207 | rownames(rfImps) <- rfImps$Variable 208 | rfImps$Variable <- NULL 209 | @ 210 | 211 | <>= 212 | print(xtable(round(rfImps,2), 213 | caption = "Variable importance scores for part (b) simulation.", 214 | label = "T:varImpSimulation2")) 215 | @ 216 | 217 | Next, we will build a conditional inference random forest for the original data set and compute the corresponding predictor importance scores. We will also build a conditional inference random forest on the data set that includes the highly correlated extra predictor with V1. 218 | 219 | <>= 220 | library(party) 221 | set.seed(147) 222 | cforest1 <- cforest(y ~ ., data = simulated[, 1:11], 223 | controls = cforest_control(ntree = 1000)) 224 | set.seed(147) 225 | cforest2 <- cforest(y ~ ., data = simulated, 226 | controls = cforest_control(ntree = 1000)) 227 | 228 | cfImps1 <- varimp(cforest1) 229 | cfImps2 <- varimp(cforest2) 230 | cfImps3 <- varimp(cforest1, conditional = TRUE) 231 | cfImps4 <- varimp(cforest2, conditional = TRUE) 232 | 233 | cfImps1 <- data.frame(Original = cfImps1, 234 | Variable = factor(names(cfImps1), levels = vnames)) 235 | 236 | cfImps2 <- data.frame(Extra = cfImps2, 237 | Variable = factor(names(cfImps2), levels = vnames)) 238 | 239 | cfImps3 <- data.frame(CondInf = cfImps3, 240 | Variable = factor(names(cfImps3), levels = vnames)) 241 | 242 | cfImps4 <- data.frame("CondInf Extra" = cfImps4, 243 | Variable = factor(names(cfImps4), levels = vnames)) 244 | 245 | cfImps <- merge(cfImps1, cfImps2, all = TRUE) 246 | cfImps <- merge(cfImps, cfImps3, all = TRUE) 247 | cfImps <- merge(cfImps, cfImps4, all = TRUE) 248 | rownames(cfImps) <- cfImps$Variable 249 | cfImps$Variable <- factor(cfImps$Variable, levels = vnames) 250 | cfImps <- cfImps[order(cfImps$Variable),] 251 | cfImps$Variable <- NULL 252 | @ 253 | 254 | Predictor importance scores for the conditional inference random forests can be seen in Table \ref{T:varImpSimulation3}. The conditional inference model has a similar pattern of importance as the random forest model from Part (a), placing most importance on predictors 1, 2, 4, and 5 and very little importance on 6 through 10. Adding a highly correlated predictor has a detrimenal effect on the importance for {\tt V1} dropping its importance rank to third. 255 | 256 | <>= 257 | print(xtable(round(cfImps,2), 258 | caption = "Variable importance scores for part (c) simulation.", 259 | label = "T:varImpSimulation3")) 260 | @ 261 | 262 | Finally, we will examine the effect of adding a highly correlated predictor on bagging and Cubist. We will explore bagging through the following simulation: 263 | 264 | <>= 265 | library(ipred) 266 | set.seed(147) 267 | bagFit1 <- bagging(y ~ ., data = simulated[, 1:11], nbag = 50) 268 | set.seed(147) 269 | bagFit2 <- bagging(y ~ ., data = simulated, nbag = 50) 270 | bagImp1 <- varImp(bagFit1) 271 | names(bagImp1) <- "Original" 272 | bagImp1$Variable <- factor(rownames(bagImp1), levels = vnames) 273 | 274 | bagImp2 <- varImp(bagFit2) 275 | names(bagImp2) <- "Extra" 276 | bagImp2$Variable <- factor(rownames(bagImp2), levels = vnames) 277 | 278 | bagImps <- merge(bagImp1, bagImp2, all = TRUE) 279 | rownames(bagImps) <- bagImps$Variable 280 | bagImps$Variable <- NULL 281 | @ 282 | 283 | Table \ref{T:varImpSimulation4} indicates that predictors {\tt V1}--{\tt V5} are at the top of the importance ranking. However {\tt V6}--{\tt V10} have relatively higher importance scores as compared to the random forest importance scores. Adding an extra highly correlated predictor with {\tt V1} has less of an impact on the overall importance score for {\tt V1} as compared to random forest. 284 | 285 | <>= 286 | print(xtable(round(bagImps,2), 287 | caption = "Variable importance scores for part (d) simulation using bagging.", 288 | label = "T:varImpSimulation4")) 289 | @ 290 | 291 | For Cubist, Table \ref{T:varImpSimulation5} indicates that predictors {\tt V1}--{\tt V5} are at the top of the importance ranking. Adding an extra highly correlated predictor with {\tt V1} has very little impact on the importance scores when using Cubist. 292 | 293 | <>= 294 | library(Cubist) 295 | set.seed(147) 296 | cbFit1 <- cubist(x = simulated[, 1:10], 297 | y = simulated$y, 298 | committees = 100) 299 | cbImp1 <- varImp(cbFit1) 300 | names(cbImp1) <- "Original" 301 | cbImp1$Variable <- factor(rownames(cbImp1), levels = vnames) 302 | 303 | set.seed(147) 304 | cbFit2 <- cubist(x = simulated[, names(simulated) != "y"], 305 | y = simulated$y, 306 | committees = 100) 307 | cbImp2 <- varImp(cbFit2) 308 | names(cbImp2) <- "Extra" 309 | cbImp2$Variable <- factor(rownames(cbImp2), levels = vnames) 310 | 311 | cbImp <- merge(cbImp1, cbImp2, all = TRUE) 312 | rownames(cbImp) <- cbImp$Variable 313 | cbImp$Variable <- NULL 314 | @ 315 | 316 | <>= 317 | print(xtable(round(cbImp,2), 318 | caption = "Variable importance scores for part (d) simulation using Cubist.", 319 | label = "T:varImpSimulation5")) 320 | @ 321 | 322 | 323 | 324 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 325 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 326 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 327 | 328 | \clearpage 329 | \section*{Exercise 2} 330 | 331 | Use a simulation to show tree bias with different granularities. 332 | 333 | \subsection*{Solutions} 334 | 335 | Recall \cite{lohshih97} found that predictors that are more granular (or have more potential split points) have a greater chance of being used towards the top of a tree to partition, even if the predictor has little-to-no relationship with the response. To investigate this phenomenon, let's develop a simple simulation. For the simulation, we will generate one categorical predictor that is informative at separating the response into two more homogenous groups. We will also generate a continuous predictor that is not informative at separating the response into two more homogenous groups. We will then use both of these predictors to build a one-split tree and note which predictor is used to split the data. This simulation will be run many times and we will tally the number of times each predictor is used as the first split. 336 | 337 | <>= 338 | 339 | set.seed(102) 340 | X1 <- rep(1:2,each=100) 341 | Y <- X1 + rnorm(200,mean=0,sd=4) 342 | set.seed(103) 343 | X2 <- rnorm(200,mean=0,sd=2) 344 | 345 | simData <- data.frame(Y=Y,X1=X1,X2=X2) 346 | @ 347 | 348 | The code chuck above defines how each predictor (X1 and X2) are related to the response. Predictor X1 has two categories and is created to separate the response into two more homogenous groups. Predictor X2, however, is not related to the response. Figure \ref{F:treeBiasFig1} illustrates the relationship between each predictor and the response. 349 | 350 | \begin{figure}[t!] 351 | \begin{center} 352 | <>= 353 | plotTheme <- bookTheme() 354 | trellis.par.set(plotTheme) 355 | bw <- bwplot(Y~X1, 356 | data=simData, 357 | ylab = "Y", 358 | xlab = "X1", 359 | horizontal = FALSE, 360 | panel = function(...) 361 | { 362 | panel.bwplot(...) 363 | } 364 | ) 365 | 366 | xy <- xyplot(Y~X2, 367 | data=simData, 368 | xlab = "X2", 369 | ylab = "Y") 370 | 371 | print(bw, split=c(1,1,2,1), more=TRUE) 372 | print(xy, split=c(2,1,2,1)) 373 | @ 374 | \caption[Tree bias sim]{The univariate relationship with each predictor and the response. Predictor X1 has only two categories, but is defined to to create two more homogenous groups with respect to the response. Predictor X2 has 200 possible categories (is more granular) and is not related to the response.} 375 | \label{F:treeBiasFig1} 376 | \end{center} 377 | \end{figure} 378 | 379 | <>= 380 | selectedPredictors <- data.frame(Predictor=as.character()) 381 | for (i in 1:100 ) { 382 | set.seed(i) 383 | X1 <- rep(1:2,each=100) 384 | Y <- X1 + rnorm(200,mean=0,sd=4) 385 | #Y <- rnorm(200,mean=0,sd=2) 386 | set.seed(1000+i) 387 | X2 <- rnorm(200,mean=0,sd=2) 388 | currentSimData <- data.frame(Y=Y,X1=X1,X2=X2) 389 | currentRpart <- rpart(Y~X1+X2,data=currentSimData,control=rpart.control(maxdepth=1)) 390 | currentPredictor <- data.frame(Predictor=rownames(currentRpart$splits)[1]) 391 | selectedPredictors <- rbind(selectedPredictors,currentPredictor) 392 | } 393 | @ 394 | 395 | In this simulation, the frequency that each predictor is selected in presented in Table \ref{T:treeBiasTable}. In this case, X1 and X2 are selected in near equal proportions despite the fact that the response is defined based on information from X1. As the amount of noise in the simulation increases, the chances that X2 are selected increase. Conversely, as the amount of noise decreases the chance that X2 is selected decreases. This implies that the granularity provided by X2 has a strong influence on whether or not it is selected--not the fact that it has no association with the response. 396 | 397 | 398 | <>= 399 | treeBiasTable <- table(selectedPredictors) 400 | print( 401 | xtable(table(selectedPredictors$Predictor), 402 | caption = "Frequency of predictor selection for tree bias simulation.", 403 | label = "T:treeBiasTable"), 404 | include.colnames=FALSE 405 | ) 406 | @ 407 | 408 | 409 | 410 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 411 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 412 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 413 | 414 | \clearpage 415 | \section*{Exercise 3} 416 | In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure \ref{F:gbmImpCompare} provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9. 417 | 418 | \begin{itemize} 419 | \item[] (a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors? 420 | \item[] (b) Which model do you think would be more predictive of other samples? 421 | \item[] (c) How would increasing interaction depth affect the slope of predictor importance for either model in Figure \ref{F:gbmImpCompare}? 422 | \end{itemize} 423 | 424 | <>= 425 | data(solubility) 426 | 427 | trainData <- solTrainXtrans 428 | trainData$y <- solTrainY 429 | 430 | set.seed(100) 431 | gbmindx <- createFolds(solTrainY, returnTrain = TRUE) 432 | gbmctrl <- trainControl(method = "cv", index = gbmindx) 433 | 434 | gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2), 435 | n.trees = seq(100, 1000, by = 50), 436 | shrinkage = c(0.01, 0.1)) 437 | set.seed(100) 438 | gbmTune <- train(solTrainXtrans, solTrainY, 439 | method = "gbm", 440 | tuneGrid = gbmGrid, 441 | trControl = gbmctrl, 442 | verbose = FALSE) 443 | 444 | gbmGrid1 <- expand.grid(interaction.depth = gbmTune$bestTune$interaction.depth, 445 | n.trees = gbmTune$bestTune$n.trees, 446 | shrinkage = 0.1) 447 | 448 | gbmGrid9 <- expand.grid(interaction.depth = gbmTune$bestTune$interaction.depth, 449 | n.trees = gbmTune$bestTune$n.trees, 450 | shrinkage = 0.9) 451 | 452 | set.seed(100) 453 | gbmTune11 <- train(solTrainXtrans, solTrainY, 454 | method = "gbm", 455 | tuneGrid = gbmGrid1, 456 | trControl = gbmctrl, 457 | bag.fraction = 0.1, 458 | verbose = FALSE) 459 | 460 | gbmImp11 <- varImp(gbmTune11, scale = FALSE) 461 | 462 | set.seed(100) 463 | gbmTune99 <- train(solTrainXtrans, solTrainY, 464 | method = "gbm", 465 | tuneGrid = gbmGrid9, 466 | trControl = gbmctrl, 467 | bag.fraction = 0.9, 468 | verbose = FALSE) 469 | 470 | gbmImp99 <- varImp(gbmTune99, scale = FALSE) 471 | 472 | @ 473 | 474 | 475 | \begin{figure}[t!] 476 | \begin{center} 477 | <>= 478 | plot11 <- plot(gbmImp11, top=25, scales = list(y = list(cex = .95))) 479 | plot99 <- plot(gbmImp99, top=25, scales = list(y = list(cex = .95))) 480 | 481 | print(plot11, split=c(1,1,2,1), more=TRUE) 482 | print(plot99, split=c(2,1,2,1)) 483 | @ 484 | \caption[GBM variable importance tuning parameter comparison]{A 485 | comparison of variable importance magnitudes for differing 486 | values of the bagging fraction and shrinkage parameters. Both 487 | tuning parameters are set to 0.1 in the left figure. Both are 488 | set to 0.9 in the right figure.} 489 | \label{F:gbmImpCompare} 490 | \end{center} 491 | \end{figure} 492 | 493 | \subsection*{Solutions} 494 | The model on the right focuses importance on just a few predictors for a couple of reasons. First, as the learning rate increases towards 1, the model becomes more greedy. As greediness increases, the model will be more likely to identify fewer predictors related to the response. Second, as the bagging fraction increases, the model uses more of the data in model construction. The less the stochastic element of the method (i.e. larger bagging fraction) the fewer predictors will be identified as important. Therefore, as the learning rate and bagging fraction increase, the importance will be concentrated on fewer and fewer predictors. 495 | 496 | At the same time, as the values of these parameters increase, model performance will correspondingly decrease. Hence, the model on the left is likley to have better performance than the model on the right. 497 | 498 | Interaction depth also relatively affects the variable importance metric. As tree depth increases, variable importance is likely to be spread over more predictors increasing the length of the horizontal lines in the importance figure. 499 | 500 | 501 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 502 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 503 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 504 | 505 | \clearpage 506 | \section*{Exercise 4} 507 | 508 | Use a single predictor in the solubility data, such as the molecular 509 | weight or the number of carbon atoms, fit several models: 510 | 511 | \begin{itemize} 512 | \item[] (a) a simple regression tree 513 | \item[] (b) a random forest model 514 | \item[] (c) different Cubist models with: a single rule or multiple 515 | committees (each with and without using neighbor adjustments). 516 | \end{itemize} 517 | 518 | Using the test set data, plot the predictor data versus the solubility 519 | results. Overlay the model predictions for the test set. How do the 520 | model differ? Does changing the tuning parameter(s) significantly 521 | affect the model fit? 522 | 523 | \subsection*{Solutions} 524 | 525 | The code listed below constructs models from Parts (a) through (c), and the performance results of these models are provided in Table \ref{T:ex4Performance}. Not surprisingly, the single tree performs the worst. The randomness and iterative process incorporated using Random Forest improves predictive ability when using just this one predictor. For the Cubist models, a couple of trends can be seen. First, the no neighbor models perform better than the corresponding models that were tuned using multiple neighbors. At the same time, using multiple committees slightly improves the predictive ability of the models. Still, the best Cubist model (multiple committees and no neighbors) performs slightly worse than the random forest model. 526 | 527 | <>= 528 | data(solubility) 529 | 530 | solTrainMW <- subset(solTrainXtrans,select="MolWeight") 531 | solTestMW <- subset(solTestXtrans,select="MolWeight") 532 | 533 | set.seed(100) 534 | rpartTune <- train(solTrainMW, solTrainY, 535 | method = "rpart2", 536 | tuneLength = 1) 537 | rpartTest <- data.frame(Method = "RPart",Y=solTestY, 538 | X=predict(rpartTune,solTestMW)) 539 | 540 | rfTune <- train(solTrainMW, solTrainY, 541 | method = "rf", 542 | tuneLength = 1) 543 | rfTest <- data.frame(Method = "RF",Y=solTestY, 544 | X=predict(rfTune,solTestMW)) 545 | 546 | 547 | cubistTune1.0 <- train(solTrainMW, solTrainY, 548 | method = "cubist", 549 | verbose = FALSE, 550 | metric = "Rsquared", 551 | tuneGrid = expand.grid(committees = 1, 552 | neighbors = 0)) 553 | cubistTest1.0 <- data.frame(Method = "Cubist1.0",Y=solTestY, 554 | X=predict(cubistTune1.0,solTestMW)) 555 | 556 | cubistTune1.n <- train(solTrainMW, solTrainY, 557 | method = "cubist", 558 | verbose = FALSE, 559 | metric = "Rsquared", 560 | tuneGrid = expand.grid(committees = 1, 561 | neighbors = c(1,3,5,7))) 562 | cubistTest1.n <- data.frame(Method = "Cubist1.n",Y=solTestY, 563 | X=predict(cubistTune1.n,solTestMW)) 564 | 565 | cubistTune100.0 <- train(solTrainMW, solTrainY, 566 | method = "cubist", 567 | verbose = FALSE, 568 | metric = "Rsquared", 569 | tuneGrid = expand.grid(committees = 100, 570 | neighbors = 0)) 571 | cubistTest100.0 <- data.frame(Method = "Cubist100.0",Y=solTestY, 572 | X=predict(cubistTune100.0,solTestMW)) 573 | 574 | cubistTune100.n <- train(solTrainMW, solTrainY, 575 | method = "cubist", 576 | verbose = FALSE, 577 | metric = "Rsquared", 578 | tuneGrid = expand.grid(committees = 100, 579 | neighbors = c(1,3,5,7))) 580 | cubistTest100.n <- data.frame(Method = "Cubist100.n",Y=solTestY, 581 | X=predict(cubistTune100.n,solTestMW)) 582 | @ 583 | 584 | <>= 585 | rpartPerf <- data.frame(Method = "Recursive Partitioning", 586 | R2 = round(rpartTune$results$Rsquared[best(rpartTune$results, "Rsquared", maximize = TRUE)],3)) 587 | rfPerf <- data.frame(Method = "Random Forest", 588 | R2 = round(rfTune$results$Rsquared[best(rfTune$results, "Rsquared", maximize = TRUE)],3)) 589 | cubistPerf1.0 <- data.frame(Method = "Cubist.SingleRule.NoNeighbors", 590 | R2 = round(cubistTune1.0$results$Rsquared[best(cubistTune1.0$results, "Rsquared", maximize = TRUE)],3)) 591 | cubistPerf1.n <- data.frame(Method = "Cubist.SingleRule.MultNeighbors", 592 | R2 = round(cubistTune1.n$results$Rsquared[best(cubistTune1.n$results, "Rsquared", maximize = TRUE)],3)) 593 | cubistPerf100.0 <- data.frame(Method = "Cubist.MultCommittees.NoNeighbors", 594 | R2 = round(cubistTune100.0$results$Rsquared[best(cubistTune100.0$results, "Rsquared", maximize = TRUE)],3)) 595 | cubistPerf100.n <- data.frame(Method = "Cubist.MultCommittees.MultNeighbors", 596 | R2 = round(cubistTune100.n$results$Rsquared[best(cubistTune100.n$results, "Rsquared", maximize = TRUE)],3)) 597 | 598 | ex4Results <- rbind(rpartPerf,rfPerf,cubistPerf1.0,cubistPerf1.n,cubistPerf100.0,cubistPerf100.n) 599 | 600 | print(xtable(ex4Results, 601 | align=c("ll|r"), 602 | caption = "Model performance using only Molecular Weight as a predictor.", 603 | label = "T:ex4Performance"), 604 | include.rownames=FALSE 605 | ) 606 | @ 607 | 608 | Test set performance is illustrated in Figure \ref{F:Ex4TestPreds}. The performance for recursive partitioning stands out since there are only two possible X values due to the split on the single predictor. Performance across random forest and the Cubist models are similar, with random forest having slightly smaller vertical spread across the range of the line of agreement. All of the Cubist models appear to have a lower-bound on predicted values at approximately -4.5. 609 | 610 | \begin{figure}[h] 611 | \begin{center} 612 | <>= 613 | cubistEx4Test <- rbind(rpartTest, 614 | rfTest,cubistTest1.0,cubistTest1.n,cubistTest100.0,cubistTest100.n) 615 | scatterTheme <- caretTheme() 616 | 617 | scatterTheme$plot.line$col <- c("blue") 618 | scatterTheme$plot.line$lwd <- 2 619 | 620 | scatterTheme$plot.symbol$col <- rgb(0, 0, 0, .3) 621 | scatterTheme$plot.symbol$cex <- 0.8 622 | scatterTheme$plot.symbol$pch <- 16 623 | 624 | scatterTheme$add.text <- list(cex = 0.6) 625 | 626 | trellis.par.set(scatterTheme) 627 | xyplot(X ~ Y | Method, 628 | cubistEx4Test, 629 | layout = c(2,3), 630 | panel = function(...) { 631 | theDots <- list(...) 632 | panel.xyplot(..., type = c("p", "g")) 633 | corr <- round(cor(theDots$x, theDots$y), 2) 634 | panel.text(44, 635 | min(theDots$y), 636 | paste("corr:", corr)) 637 | }, 638 | ylab = "Predicted", 639 | xlab = "Observed") 640 | @ 641 | \caption[Ex4 Test Performance]{Test set performance across models using only Molecular Weight as a predictor.} 642 | \label{F:Ex4TestPreds} 643 | \end{center} 644 | \end{figure} 645 | 646 | 647 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 648 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 649 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 650 | 651 | 652 | \clearpage 653 | \section*{Exercise 5} 654 | 655 | Fit different tree-- and rule--based models for the Tecator data discussed 656 | in Exercise 6.1. How do they compare to 657 | linear models? Do the between--predictor correlations seem to affect 658 | your models? If so, how would you transform or re--encode the 659 | predictor data to mitigate this issue? 660 | 661 | \subsection*{Solutions} 662 | 663 | The optimal RMSE in Exercise 6.1 came from the PLS model and was 0.65. We can load the data in the same way as before: 664 | 665 | <>= 666 | library(caret) 667 | data(tecator) 668 | 669 | set.seed(1029) 670 | inMeatTraining <- createDataPartition(endpoints[, 3], p = 3/4, list= FALSE) 671 | 672 | absorpTrain <- absorp[ inMeatTraining,] 673 | absorpTest <- absorp[-inMeatTraining,] 674 | proteinTrain <- endpoints[ inMeatTraining, 3] 675 | proteinTest <- endpoints[-inMeatTraining,3] 676 | 677 | absorpTrain <- as.data.frame(absorpTrain) 678 | absorpTest <- as.data.frame(absorpTest) 679 | 680 | ctrl <- trainControl(method = "repeatedcv", repeats = 5) 681 | @ 682 | 683 | A simple CART model can be fit using the syntax here: 684 | 685 | <>= 686 | set.seed(529) 687 | meatCART <- train(x = absorpTrain, y = proteinTrain, 688 | method = "rpart", 689 | trControl = ctrl, 690 | tuneLength = 25) 691 | @ 692 | 693 | The resulting tuning parameter profile is presented in Figure \ref{F:meat_cart}. For this mode, the optimal RMSE is \Sexpr{round(meatCART$results$RMSE[best(meatCART$results, "RMSE", maximize = FALSE)],3)}. This value is worse than the optimal value found using the PLS model. 694 | 695 | \begin{figure} 696 | \begin{center} 697 | <>= 698 | ggplot(meatCART) + scale_x_log10() 699 | @ 700 | \caption{The RMSE resampling profile for the single CART model.} 701 | \label{F:meat_cart} 702 | \end{center} 703 | \end{figure} 704 | 705 | Next we will tune and evaluate the following models: bagged trees, random forest, gradient boosting machines, and Cubist. The tuning parameter profiles for random forest, gradient boosting machines, and Cubist can be found in Figures \ref{F:meat_rf}, \ref{F:meat_gbm}, and \ref{F:meat_cubist}, respectively. 706 | <>= 707 | set.seed(529) 708 | meatBagged <- train(x = absorpTrain, y = proteinTrain, 709 | method = "treebag", 710 | trControl = ctrl) 711 | @ 712 | 713 | <>= 714 | set.seed(529) 715 | meatRF <- train(x = absorpTrain, y = proteinTrain, 716 | method = "rf", 717 | ntree = 1500, 718 | tuneLength = 10, 719 | trControl = ctrl) 720 | @ 721 | 722 | <>= 723 | gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2), 724 | n.trees = seq(100, 1000, by = 50), 725 | shrinkage = c(0.01, 0.1)) 726 | set.seed(529) 727 | meatGBM <- train(x = absorpTrain, y = proteinTrain, 728 | method = "gbm", 729 | verbose = FALSE, 730 | tuneGrid = gbmGrid, 731 | trControl = ctrl) 732 | @ 733 | 734 | <>= 735 | set.seed(529) 736 | meatCubist <- train(x = absorpTrain, y = proteinTrain, 737 | method = "cubist", 738 | verbose = FALSE, 739 | tuneGrid = expand.grid(committees = c(1:10, 20, 50, 75, 100), 740 | neighbors = c(0, 1, 5, 9)), 741 | trControl = ctrl) 742 | @ 743 | 744 | 745 | \begin{figure} 746 | \begin{center} 747 | <>= 748 | ggplot(meatRF) 749 | @ 750 | \caption{The RMSE resampling profile for the random forest model.} 751 | \label{F:meat_rf} 752 | \end{center} 753 | \end{figure} 754 | 755 | 756 | \begin{figure} 757 | \begin{center} 758 | <>= 759 | ggplot(meatGBM) + theme(legend.position = "top") 760 | @ 761 | \caption{The RMSE resampling profile for the gradient boosting machine model.} 762 | \label{F:meat_gbm} 763 | \end{center} 764 | \end{figure} 765 | 766 | 767 | \begin{figure} 768 | \begin{center} 769 | <>= 770 | ggplot(meatCubist) + theme(legend.position = "top") 771 | @ 772 | \caption{The RMSE resampling profile for the cubist model.} 773 | \label{F:meat_cubist} 774 | \end{center} 775 | \end{figure} 776 | 777 | <>= 778 | load("meatPLS.RData") 779 | load("meatNet.RData") 780 | meatResamples <- resamples(list(CART = meatCART, 781 | GBM = meatGBM, 782 | Cubist = meatCubist, 783 | "Bagged Tree" = meatBagged, 784 | "Random Forest" = meatRF, 785 | PLS = meatPLS, 786 | "Neural Network" = meatNet)) 787 | @ 788 | 789 | To compare model performance across those built in Chapters 6, 7, and 8, we can examine the resampling performance distributions (Figure \ref{F:meatCompare08}). Clearly the distributions of the PLS, Cubist, and neural network models indicate better performance than the tree-based models with RMSE values well under 1 and less overall variation. 790 | 791 | The latent variable characteristic of PLS and neural network models could be crucial model characteristics for this data and could be better suited for handling between-predictor correlations. 792 | 793 | \begin{figure}[t!] 794 | \begin{center} 795 | <>= 796 | bookTheme() 797 | bwplot(meatResamples, metric = "RMSE") 798 | @ 799 | \caption{Resampling distributions of tree-- and rule--based models, along with the best models from the previous two chapters (PLS and neural networks).} 800 | \label{F:meatCompare08} 801 | \end{center} 802 | \end{figure} 803 | 804 | <>= 805 | postResample(predict(meatPLS, absorpTest), proteinTest) 806 | postResample(predict(meatCubist, absorpTest), proteinTest) 807 | postResample(predict(meatNet, absorpTest), proteinTest) 808 | @ 809 | 810 | 811 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 812 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 813 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 814 | 815 | \clearpage 816 | \section*{Exercise 6} 817 | 818 | Return to the permeability problem described in Exercises 6.2 and 7.4. Train several tree--based models and evaluate the resampling and test set performance. 819 | \begin{itemize} 820 | \item[] (a) Which tree--based model gives the optimal resampling and test set performance? 821 | \item[] (b) Do any of these models outperform the linear or non--linear based regression models you have previously developed for this data? What criteria did you use to compare models' performance? 822 | \item[] (c) Of all the models you have developed thus far, which, if any, would you recommend to replace the permeability laboratory experiment? 823 | \end{itemize} 824 | 825 | \subsection*{Solutions} 826 | 827 | In order to make a parallel comparison to the results in Exercises 6.2 and 7.4, we need to perform the same pre-processing steps and set up the identical validation approach. Recall that the optimal $R^2$ value for linear based methods was 0.58 (Elastic Net) and for non-linear based methods was 0.55 (SVM). The following syntax provides the same pre-processing, data partition into training and testing sets, and validation set-up. 828 | 829 | <>= 830 | library(AppliedPredictiveModeling) 831 | data(permeability) 832 | 833 | #Identify and remove NZV predictors 834 | nzvFingerprints <- nearZeroVar(fingerprints) 835 | noNzvFingerprints <- fingerprints[,-nzvFingerprints] 836 | 837 | #Split data into training and test sets 838 | set.seed(614) 839 | trainingRows <- createDataPartition(permeability, 840 | p = 0.75, 841 | list = FALSE) 842 | 843 | trainFingerprints <- noNzvFingerprints[trainingRows,] 844 | trainPermeability <- permeability[trainingRows,] 845 | 846 | testFingerprints <- noNzvFingerprints[-trainingRows,] 847 | testPermeability <- permeability[-trainingRows,] 848 | 849 | set.seed(614) 850 | ctrl <- trainControl(method = "LGOCV") 851 | @ 852 | 853 | 854 | Next, we will find optimal tuning parameters for simple CART, RF and GBM models. 855 | 856 | <>= 857 | set.seed(614) 858 | rpartGrid <- expand.grid(maxdepth= seq(1,10,by=1)) 859 | rpartPermTune <- train(x = trainFingerprints, y = log10(trainPermeability), 860 | method = "rpart2", 861 | tuneGrid = rpartGrid, 862 | trControl = ctrl) 863 | @ 864 | 865 | <>= 866 | set.seed(614) 867 | 868 | rfPermTune <- train(x = trainFingerprints, y = log10(trainPermeability), 869 | method = "rf", 870 | tuneLength = 10, 871 | importance = TRUE, 872 | trControl = ctrl) 873 | @ 874 | 875 | <>= 876 | set.seed(614) 877 | gbmGrid <- expand.grid(interaction.depth=seq(1,6,by=1), 878 | n.trees=c(25,50,100,200), 879 | shrinkage=c(0.01,0.05,0.1)) 880 | gbmPermTune <- train(x = trainFingerprints, y = log10(trainPermeability), 881 | method = "gbm", 882 | verbose = FALSE, 883 | tuneGrid = gbmGrid, 884 | trControl = ctrl) 885 | @ 886 | 887 | Figure \ref{F:permeabilityRpartTunePlot} indicates that the optimal tree depth that maximizes $R^2$ is \Sexpr{rpartPermTune$results$maxdepth[best(rpartPermTune$results, "Rsquared", maximize = TRUE)]}, with an $R^2$ of \Sexpr{round(rpartPermTune$results$Rsquared[best(rpartPermTune$results, "Rsquared", maximize = TRUE)],2)}. This result is slightly better than what we found with either the selected linear or non-linear based methods. 888 | 889 | \begin{figure}[ht] 890 | \begin{center} 891 | <>= 892 | plotTheme <- bookTheme() 893 | trellis.par.set(plotTheme) 894 | plot(rpartPermTune,metric="Rsquared") 895 | @ 896 | \caption{Recursive partitioning tuning parameter profile for the permeability data} 897 | \label{F:permeabilityRpartTunePlot} 898 | \end{center} 899 | \end{figure} 900 | 901 | 902 | Figure \ref{F:permeabilityRFTunePlot} indicates that the optimal $m_{try}$ value that maximizes $R^2$ is \Sexpr{rfPermTune$results$mtry[best(rfPermTune$results, "Rsquared", maximize = TRUE)]}, with an $R^2$ of \Sexpr{round(rfPermTune$results$Rsquared[best(rfPermTune$results, "Rsquared", maximize = TRUE)],2)}. The tuning parameter profile as well as the similar performance results with recursive partitioning indicates that the underlying data structure is fairly consistent across the samples. Hence, the modeling process does not benefit from the reduction in variance induced by random forests. 903 | 904 | \begin{figure}[ht] 905 | \begin{center} 906 | <>= 907 | plotTheme <- bookTheme() 908 | trellis.par.set(plotTheme) 909 | plot(rfPermTune,metric="Rsquared") 910 | @ 911 | \caption{Random forest tuning parameter profile for the permeability data} 912 | \label{F:permeabilityRFTunePlot} 913 | \end{center} 914 | \end{figure} 915 | 916 | Next, let's look at the variable importance of the top 10 predictors for the random forest model (Figure \ref{F:permeabilityRFVarImpPlot}). Clearly a handful of predictors are identified as most important by random forests. 917 | 918 | \begin{figure}[!ht] 919 | \begin{center} 920 | <>= 921 | rfPermVarImp = varImp(rfPermTune) 922 | 923 | plotTheme <- bookTheme() 924 | trellis.par.set(plotTheme) 925 | plot(rfPermVarImp, top=10, scales = list(y = list(cex = .85))) 926 | @ 927 | \caption{Variable importance for RF model for permeability data} 928 | \label{F:permeabilityRFVarImpPlot} 929 | \end{center} 930 | \end{figure} 931 | 932 | 933 | Figure \ref{F:permeabilityGBMTunePlot} indicates that the optimal interaction depth, number of trees, and shrinkage that maximize $R^2$ are \Sexpr{gbmPermTune$results$interaction.depth[best(gbmPermTune$results, "Rsquared", maximize = TRUE)]}, \Sexpr{gbmPermTune$results$n.trees[best(gbmPermTune$results, "Rsquared", maximize = TRUE)]}, and \Sexpr{gbmPermTune$results$shrinkage[best(gbmPermTune$results, "Rsquared", maximize = TRUE)]}, respectively, with an $R^2$ of \Sexpr{round(gbmPermTune$results$Rsquared[best(gbmPermTune$results, "Rsquared", maximize = TRUE)],2)}. 934 | 935 | There are a couple of interesting characteristics we see from the GBM tuning parameter profiles. First, fewer trees with a tiny amount of shrinkage is optimal. This, again, points to the stability of the underlying samples. Second, a more complex model like GBM is not necessary for this data. Instead, a simpler model like a linear-based technique or a single CART tree provides near optimal results while at the same time being more interpretable than, say, the optimal random forest model. 936 | 937 | \begin{figure}[ht] 938 | \begin{center} 939 | <>= 940 | plotTheme <- bookTheme() 941 | trellis.par.set(plotTheme) 942 | plot(gbmPermTune,metric="Rsquared") 943 | @ 944 | \caption{Gradient boosting machine tuning parameter profile for the permeability data} 945 | \label{F:permeabilityGBMTunePlot} 946 | \end{center} 947 | \end{figure} 948 | 949 | The optimal recursive partitioning tree is presented in Figure \ref{F:rpartPermTree}. This tree reveals that similar to the variable importance rankings from random forests, X6, X93, and X157 play an important role in separating samples. Also, the splits reveal the impact of the presence (\textgreater 0.5) or absence of the fingerprint on permeability. Having fingerprint X6 appears to be associated with higher overall permeability values. Likewise, not having fingerprint X6 while having fingerprint X93 appears to be associated with lower overall permeability values. 950 | 951 | \begin{sidewaysfigure} 952 | \begin{center} 953 | <>= 954 | plot(as.party(rpartPermTune$finalModel),gp=gpar(fontsize=11)) 955 | @ 956 | \caption{Optimal recursive partitioning tree for permeability data} 957 | \label{F:rpartPermTree} 958 | \end{center} 959 | \end{sidewaysfigure} 960 | 961 | The findings of this exercise as well as 6.2 and 7.4 indicate that an interpretable model like recursive partitioning ($R^2$ = \Sexpr{round(rpartPermTune$results$Rsquared[best(rpartPermTune$results, "Rsquared", maximize = TRUE)],2)}) performs just as well as any of the more complex models. An $R^2$ at this level may or may not be sufficient to replace the permeability laboratory experiment. However, these findings may enable a gross computational screening which could identify compounds that are likely to be at the extremes of permeability. The predictors identified by recursive partitioning and random forests may also provide key insights about structures that are relevant to compounds' permeability. 962 | 963 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 964 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 965 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 966 | 967 | \clearpage 968 | \section*{Exercise 7} 969 | 970 | Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data--splitting and pre--processing steps as before and train several tree--based models. 971 | \begin{itemize} 972 | \item[] (a) Which tree--based regression model gives the optimal resampling and test set performance? 973 | \item[] (b) Which predictors are most important in the optimal tree--based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and non--linear models? 974 | \item[] (c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield? 975 | \end{itemize} 976 | 977 | \subsection*{Solutions} 978 | 979 | We will use the same pre-processing steps and validation approach as in Exercises 6.3 and 7.5. In Exercise 6.3, the cross-validated $R^2$ value for PLS was 0.57, and the key predictors were manufacturing processes 32, 09, and 13. The pairwise plots of these predictors with the response may indicate a non--linear pattern of these predictors with the response. In Exercise 7.5, the MARS model was optimal with a cross-validated $R^2$ of 0.52. MARS singled out manufacturing processes 32 and 09 for predicting the response. 980 | 981 | <>= 982 | library(AppliedPredictiveModeling) 983 | data(ChemicalManufacturingProcess) 984 | 985 | predictors <- subset(ChemicalManufacturingProcess,select= -Yield) 986 | yield <- subset(ChemicalManufacturingProcess,select="Yield") 987 | 988 | set.seed(517) 989 | trainingRows <- createDataPartition(yield$Yield, 990 | p = 0.7, 991 | list = FALSE) 992 | 993 | trainPredictors <- predictors[trainingRows,] 994 | trainYield <- yield[trainingRows,] 995 | 996 | testPredictors <- predictors[-trainingRows,] 997 | testYield <- yield[-trainingRows,] 998 | 999 | #Pre-process trainPredictors and apply to trainPredictors and testPredictors 1000 | pp <- preProcess(trainPredictors,method=c("BoxCox","center","scale","knnImpute")) 1001 | ppTrainPredictors <- predict(pp,trainPredictors) 1002 | ppTestPredictors <- predict(pp,testPredictors) 1003 | 1004 | #Identify and remove NZV 1005 | nzvpp <- nearZeroVar(ppTrainPredictors) 1006 | ppTrainPredictors <- ppTrainPredictors[-nzvpp] 1007 | ppTestPredictors <- ppTestPredictors[-nzvpp] 1008 | 1009 | #Identify and remove highly correlated predictors 1010 | predcorr = cor(ppTrainPredictors) 1011 | highCorrpp <- findCorrelation(predcorr) 1012 | ppTrainPredictors <- ppTrainPredictors[, -highCorrpp] 1013 | ppTestPredictors <- ppTestPredictors[, -highCorrpp] 1014 | 1015 | #Set-up trainControl 1016 | set.seed(517) 1017 | ctrl <- trainControl(method = "boot", number = 25) 1018 | @ 1019 | 1020 | Next, we will find optimal tuning parameters for simple CART, RF, GBM, and Cubist models. 1021 | 1022 | <>= 1023 | set.seed(614) 1024 | rpartGrid <- expand.grid(maxdepth= seq(1,10,by=1)) 1025 | rpartChemTune <- train(x = ppTrainPredictors, y = trainYield, 1026 | method = "rpart2", 1027 | metric = "Rsquared", 1028 | tuneGrid = rpartGrid, 1029 | trControl = ctrl) 1030 | @ 1031 | 1032 | <>= 1033 | set.seed(614) 1034 | 1035 | rfGrid <- expand.grid(mtry=seq(2,38,by=3)) 1036 | 1037 | rfChemTune <- train(x = ppTrainPredictors, y = trainYield, 1038 | method = "rf", 1039 | tuneGrid = rfGrid, 1040 | metric = "Rsquared", 1041 | importance = TRUE, 1042 | trControl = ctrl) 1043 | @ 1044 | 1045 | <>= 1046 | set.seed(614) 1047 | gbmGrid <- expand.grid(interaction.depth=seq(1,6,by=1), 1048 | n.trees=c(25,50,100,200), 1049 | shrinkage=c(0.01,0.05,0.1,0.2)) 1050 | 1051 | gbmChemTune <- train(x = ppTrainPredictors, y = trainYield, 1052 | method = "gbm", 1053 | metric = "Rsquared", 1054 | verbose = FALSE, 1055 | tuneGrid = gbmGrid, 1056 | trControl = ctrl) 1057 | @ 1058 | 1059 | <>= 1060 | set.seed(614) 1061 | cubistGrid <- expand.grid(committees = c(1, 5, 10, 20, 50, 100), 1062 | neighbors = c(0, 1, 3, 5, 7)) 1063 | 1064 | cubistChemTune <- train(x = ppTrainPredictors, y = trainYield, 1065 | method = "cubist", 1066 | verbose = FALSE, 1067 | metric = "Rsquared", 1068 | tuneGrid = cubistGrid, 1069 | trControl = ctrl) 1070 | @ 1071 | 1072 | The optimal recursive partitioning model has a bootstrap CV $R^2$ value of \Sexpr{round(rpartChemTune$results$Rsquared[best(rpartChemTune$results, "Rsquared", maximize = TRUE)],2)}. This result is substantially worse than what we found with either the optimal PLS or MARS models. The random forest and gradient boosting machine models have bootstrap CV $R^2$ values of \Sexpr{round(rfChemTune$results$Rsquared[best(rfChemTune$results, "Rsquared", maximize = TRUE)],2)} and \Sexpr{round(gbmChemTune$results$Rsquared[best(gbmChemTune$results, "Rsquared", maximize = TRUE)],2)}, respectively. These are similar to the MARS model results but still not quite as good as the PLS model results. 1073 | 1074 | Figure \ref{F:chemCubistTunePlot} provides the tuning parameter profile plot for the Cubist model. Here, the optimal bootstrap CV $R^2$ value is \Sexpr{round(cubistChemTune$results$Rsquared[best(cubistChemTune$results, "Rsquared", maximize = TRUE)],2)}, which occurs using \Sexpr{cubistChemTune$results$neighbors[best(cubistChemTune$results, "Rsquared", maximize = TRUE)]} neighbor and \Sexpr{cubistChemTune$results$committees[best(cubistChemTune$results, "Rsquared", maximize = TRUE)]} committees. Therefore very localized information is useful in constructing the model; as the number the number of neighbors increases, the optimal performance decreases. Also adjusting the response via committees is necessary to improve prediction. 1075 | 1076 | 1077 | 1078 | \begin{figure}[ht] 1079 | \begin{center} 1080 | <>= 1081 | plotTheme <- bookTheme() 1082 | trellis.par.set(plotTheme) 1083 | plot(cubistChemTune,metric="Rsquared") 1084 | @ 1085 | \caption{Cubist tuning parameter profile for the chemical manufacturing data} 1086 | \label{F:chemCubistTunePlot} 1087 | \end{center} 1088 | \end{figure} 1089 | 1090 | \begin{figure}[h] 1091 | \begin{center} 1092 | <>= 1093 | cubistChemTest <- data.frame(Observed=testYield,Predicted=predict(cubistChemTune,ppTestPredictors)) 1094 | scatterTheme <- caretTheme() 1095 | 1096 | scatterTheme$plot.line$col <- c("blue") 1097 | scatterTheme$plot.line$lwd <- 2 1098 | 1099 | scatterTheme$plot.symbol$col <- rgb(0, 0, 0, .3) 1100 | scatterTheme$plot.symbol$cex <- 0.8 1101 | scatterTheme$plot.symbol$pch <- 16 1102 | 1103 | scatterTheme$add.text <- list(cex = 0.6) 1104 | 1105 | trellis.par.set(scatterTheme) 1106 | xyplot(Predicted ~ Observed, 1107 | cubistChemTest, 1108 | panel = function(...) { 1109 | theDots <- list(...) 1110 | panel.xyplot(..., type = c("p", "g","r","smooth")) 1111 | corr <- round(cor(theDots$x, theDots$y), 2) 1112 | panel.text(44, 1113 | min(theDots$y), 1114 | paste("corr:", corr)) 1115 | }, 1116 | ylab = "Predicted", 1117 | xlab = "Observed") 1118 | @ 1119 | \caption[cubist manufacturing test]{Cubist predictions for the test set for the chemical manufacturing data.} 1120 | \label{F:cubistChemTestPreds} 1121 | \end{center} 1122 | \end{figure} 1123 | 1124 | 1125 | The Cubist model provides the best performance across all models we have tuned across this exercise as well as Exercises 6.3 and 7.5. The test set predictions for the Cubist model are presented in Figure \ref{F:cubistChemTestPreds}, with an $R^2$ value of \Sexpr{round(cor(cubistChemTest$Observed,cubistChemTest$Predicted)^2, 3)}. This result is better than any of the previous models. 1126 | 1127 | 1128 | 1129 | Figure \ref{F:cubistChemImp} lists the top 15 important predictors for the manufacturing data. Again, manufacturing processes 32 and 09 are at the top, while process 13 is in the top 5. All models thus far point towards processes 32 and 09, which should be investigated further to better understand if these can be controlled to improve yield. The role of biological material 03 at its impact on yield should also be investigated. 1130 | 1131 | \begin{figure}[h] 1132 | \begin{center} 1133 | <>= 1134 | cubistChemImp <- varImp(cubistChemTune, scale = FALSE) 1135 | bookTheme() 1136 | plot(cubistChemImp, top=15, scales = list(y = list(cex = 0.8))) 1137 | @ 1138 | \caption[Cubist Importance]{Cubist variable importance scores for the manufacturing data.} 1139 | \label{F:cubistChemImp} 1140 | \end{center} 1141 | \end{figure} 1142 | 1143 | The optimal recursive partitioning tree is presented in Figure \ref{F:rpartChemTree}. Manufacturing processes 32 and 13 are at the top, with higher values of process 32 being associated with larger yields. Lower values of process 32 are associated with smaller yields. However, a lower values of process 32 may be counter-acted with a corresponding lower value of process 13. 1144 | 1145 | \begin{sidewaysfigure} 1146 | \begin{center} 1147 | <>= 1148 | plot(as.party(rpartChemTune$finalModel),gp=gpar(fontsize=11)) 1149 | @ 1150 | \caption{Optimal recursive partitioning tree for chemical manufacturing data} 1151 | \label{F:rpartChemTree} 1152 | \end{center} 1153 | \end{sidewaysfigure} 1154 | 1155 | 1156 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 1157 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 1158 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 1159 | 1160 | \clearpage 1161 | \section*{Session Info} 1162 | 1163 | <>= 1164 | toLatex(sessionInfo()) 1165 | @ 1166 | 1167 | 1168 | \bibliographystyle{ECA_jasa} 1169 | \bibliography{Ch_08_Ex_sol} 1170 | 1171 | 1172 | \end{document} 1173 | 1174 | 1175 | 1176 | 1177 | -------------------------------------------------------------------------------- /Ch_08.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/topepo/APM_Exercises/73d35f3e62522d9e9a61552159bbeda9d6f02d57/Ch_08.pdf -------------------------------------------------------------------------------- /Ch_08_Ex_Sol.bib: -------------------------------------------------------------------------------- 1 | @Article{17254353, 2 | AUTHOR = {Strobl, Carolin and Boulesteix, Anne and Zeileis, Achim and Hothorn, Torsten}, 3 | TITLE = {Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution}, 4 | JOURNAL = {BMC Bioinformatics}, 5 | VOLUME = {8}, 6 | YEAR = {2007}, 7 | NUMBER = {1}, 8 | PAGES = {25}} 9 | 10 | @article{lohshih97, 11 | author = {Loh, W.--Y. and Shih, Y.--S.}, 12 | title = {Split Selection Methods for Classification Trees}, 13 | journal = {Statistica Sinica}, 14 | year = {1997}, 15 | volume = {7}, 16 | pages = {815--840}} 17 | 18 | -------------------------------------------------------------------------------- /Ch_12_Ex_Sol.bib: -------------------------------------------------------------------------------- 1 | @article{Menardi:2014th, 2 | author = {Menardi, G and Torelli, N}, 3 | title = {{Training and assessing classification rules with imbalanced data}}, 4 | journal = {Data Mining and Knowledge Discovery}, 5 | year = {2014}, 6 | volume = {28}, 7 | number = {1}, 8 | pages = {92--122}} 9 | 10 | 11 | @article{Chawla:2002ty, 12 | author = {Chawla, N and Bowyer, K and Hall, L and Kegelmeyer, W}, 13 | title = {{SMOTE}: synthetic minority over--sampling technique}, 14 | journal = {Journal of Artificial Intelligence Research}, 15 | year = {2002}, 16 | volume = {16}, 17 | number = {1}, 18 | pages = {321--357}} 19 | 20 | 21 | -------------------------------------------------------------------------------- /Ch_13_Ex_Sol.bib: -------------------------------------------------------------------------------- 1 | @techreport{kknn, 2 | volume = {399}, 3 | title = {\href{http://nbn-resolving.de/urn/resolver.pl?urn=nbn:de:bvb:19-epub-1769-9}{Weighted $k$--nearest--neighbor techniques and ordinal classification}}, 4 | author = {K Hechenbichler and K Schliep}, 5 | year = {2004}, 6 | number = {Discussion Paper 399}, 7 | INSTITUTION = {Ludwig--Maximilians University Munich} 8 | } -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 | Exercises for the book [Applied Predictive Modeling](http://appliedpredictivemodeling.com) by Kuhn and Johnson (2013) 2 | -- 3 | 4 | This project contains the solutions and code for the end of chapter exercises. Included at a complied pdf and a [knitr](http://yihui.name/knitr/) source file using LaTeX. 5 | 6 | Right now, the first two sets are finished and we are almost ready to post the solutions for the regression chapters. 7 | 8 | **Note**! You may have better solutions than we have here and we would love to see them. You can do so by creating a [pull request](https://help.github.com/articles/using-pull-requests/) or, if you are not git-savvy, drop an email to Max () and/or Kjell (). 9 | 10 | --------------------------------------------------------------------------------