├── .gitignore ├── README.rst ├── articles ├── a-journey-to-the-tip-of-neural-networks-kr.ipynb ├── a-journey-to-the-tip-of-neural-networks.ipynb ├── fig11-2-neural-network.jpg ├── kaggle-a-few-practical-thoughts-on-titanic.ipynb ├── tutorial-deep-learning-for-nlp-with-pytorch-1-kr.ipynb ├── tutorial-deep-learning-for-nlp-with-pytorch-2-kr.ipynb ├── tutorial-deep-learning-for-nlp-with-pytorch-3-kr.ipynb ├── why-is-logistic-regression-called-linear-method-kr.ipynb └── why-is-logistic-regression-called-linear-method.ipynb ├── chapter02-overview-of-supervised-learning ├── section3-least-squares-and-nearest-neighbors.ipynb ├── section4-statistical-decision-theory.ipynb ├── section5-local-methods-in-high-dimensions.ipynb ├── section6-statistical-methods-supervised-learning-and-function-approximation.ipynb ├── section7-structured-regression-models.ipynb ├── section8-classes-of-restricted-estimators.ipynb └── section9-model-selection-and-the-bias-variance-tradeoff.ipynb ├── chapter03-linear-methods-for-regression ├── section1-introduction.ipynb ├── section2-0-linear-regression-models-and-least-squares.ipynb ├── section2-1-example-prostate-cancer.ipynb ├── section2-2-the-gauss-markov-theorem.ipynb ├── section2-3-multiple-regression-from-simple-univariate-regression.ipynb ├── section2-4-multiple-outputs.ipynb ├── section3-subset-selection.ipynb ├── section4-0-shrinkage-methods.ipynb ├── section4-1-ridge-regression.ipynb ├── section4-2-the-lasso.ipynb ├── section4-3-discussion-subset-selection-ridge-lasso.ipynb ├── section4-4-least-angle-regression.ipynb ├── section5-0-methods-using-derived-input-directions.ipynb ├── section5-1-principal-components-regression.ipynb ├── section5-2-partial-least-squares.ipynb └── section6-discussion-a-comparison-of-the-selection-and-shrinkage-methods.ipynb ├── chapter04-linear-methods-for-classification ├── section1-introduction.ipynb ├── section2-linear-regression-of-an-indicator-matrix.ipynb ├── section3-0-linear-discriminant-analysis.ipynb ├── section3-1-regularized-discriminant-analysis.ipynb ├── section3-2-computations-for-lda.ipynb ├── section3-3-reduced-rank-linear-discriminant-analysis.ipynb ├── section4-0-logistic-regression.ipynb ├── section4-1-fitting-logistic-regression-models.ipynb ├── section4-2-example-south-african-heart-disease.ipynb ├── section4-3-quadratic-approximations-and-inference.ipynb ├── section4-4-l1-regularized-logistic-regression.ipynb ├── section4-5-logistic-or-lda.ipynb ├── section5-0-separating-hyperplanes.ipynb ├── section5-1-rosenblatt-perceptron-learning-algorithm.ipynb └── section5-2-optimal-separating-hyperplanes.ipynb ├── chapter05-basis-expansions-and-regularization ├── section1-introduction.ipynb ├── section2-0-piecewise-polynomials-and-splines.ipynb ├── section2-1-natural-cubic-splines.ipynb ├── section2-2-example-south-african-heart-disease-continued.ipynb ├── section2-3-example-phoneme-recognition.ipynb ├── section3-filtering-and-feature-extraction.ipynb ├── section4-0-smoothing-splines.ipynb ├── section4-1-degrees-of-freedom-and-smoother-matrices.ipynb ├── section5-0-automatic-selection-of-the-smoothing-parameters.ipynb ├── section5-1-fixing-the-degrees-of-freedom.ipynb ├── section5-2-the-biase-variance-tradeoff.ipynb ├── section6-nonparametric-logistic-regression.ipynb ├── section7-multidimensional-splines.ipynb ├── section8-0-regularization-and-reproducing-kernel-hilbert-space.ipynb ├── section8-1-spaces-of-fucntions-generated-by-kernels.ipynb ├── section8-2-example-of-rkhs.ipynb ├── section9-0-wavelet-smoothing.ipynb ├── section9-1-wavelet-bases-and-the-wavelet-transform.ipynb └── section9-2-adaptive-wavelet-filtering.ipynb ├── chapter06-kernel-smoothing-methods ├── section0-introduction.ipynb ├── section1-0-one-dimensional-kernel-smoothers.ipynb ├── section1-1-local-linear-regression.ipynb ├── section1-2-local-polynomial-regression.ipynb ├── section2-selecting-the-width-of-the-kernel.ipynb ├── section3-local-regression-in-higher-dimensions.ipynb ├── section4-0-structured-local-regression-models.ipynb ├── section4-1-structured-kernels.ipynb ├── section4-2-structured-regression-functions.ipynb ├── section5-local-likelihood-and-other-models.ipynb ├── section6-0-kernel-density-estimation-and-classification.ipynb ├── section6-1-kernel-density-estimation.ipynb ├── section6-2-kernel-density-classification.ipynb ├── section6-3-the-naive-bayes-classifier.ipynb ├── section7-radial-basis-functions-and-kernels.ipynb ├── section8-mixture-models-for-density-estimation-and-classification.ipynb └── section9-computational-considerations.ipynb ├── chapter07-model-assessment-and-selection ├── fig7-12.jpg ├── fig7-2.jpg ├── section01-introduction.ipynb ├── section02-bias-variance-and-model-complexity.ipynb ├── section03-0-the-bias-variance-decomposition.ipynb ├── section03-1-example-bias-variance-tradeoff.ipynb ├── section04-optimism-of-the-training-error-rate.ipynb ├── section05-estimate-of-in-sample-prediction-error.ipynb ├── section06-the-effective-number-of-parameters.ipynb ├── section07-the-bayesian-approach-and-bic.ipynb ├── section08-minimum-description-length.ipynb ├── section10-0-cross-validation.ipynb ├── section10-1-k-fold-cross-validation.ipynb ├── section10-2-the-wrong-and-right-way-to-do-cross-validation.ipynb ├── section10-3-does-cross-validation-really-work.ipynb └── section11-bootstrap-methods.ipynb ├── chapter11-neural-networks ├── fig11-10.jpg ├── fig11-12.jpg ├── section01-introduction.ipynb ├── section02-projection-pursuit-regression.ipynb ├── section03-neural-networks.ipynb ├── section04-fitting-neural-networks.ipynb ├── section05-some-issues-in-training-neural-networks.ipynb ├── section06-example-simulated-data.ipynb ├── section07-example-zip-code-data.ipynb ├── section08-discussion.ipynb ├── section09-0-bayesian-neural-nets-and-the-nips-2003-challenge.ipynb ├── section09-1-bayes-boosting-and-bagging.ipynb ├── section09-2-performance-comparisons.ipynb └── section10-computational-considerations.ipynb ├── data ├── heart │ ├── SAheart.data │ └── SAheart.info.txt ├── phoneme │ ├── phoneme.data │ └── phoneme.info.txt ├── prostate │ ├── prostate.data │ └── prostate.info.txt ├── titanic │ ├── test.csv │ └── train.csv ├── vowel │ ├── vowel.info.txt │ ├── vowel.test │ └── vowel.train └── zipcode │ ├── zip.info.txt │ ├── zip.test │ ├── zip.test.gz │ ├── zip.train │ └── zip.train.gz ├── module └── cv.py └── references ├── ESLII_print12.pdf └── lars.pdf /.gitignore: -------------------------------------------------------------------------------- 1 | *.swp 2 | .ipynb_checkpoints 3 | __pycache__ 4 | 5 | Pipfile.lock 6 | 7 | Untitled* 8 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | ================================================================ 2 | Jupyter Notebooks for the Elements of Statistical Learning (WIP) 3 | ================================================================ 4 | 5 | It aims to summarize and reproduce the textbook "The Elements of Statistical Learning" 2/E by Hastie, Tibshirani, and Friedman. 6 | 7 | Currently working the early chapters, I try to implement without frameworks like scikit-learn for showing the algorithms that the textbook introduces to me. 8 | 9 | Also starting with the neural networks, I decided to use PyTorch_ which seems less magical (They say that ``torch.Tensor`` is ``numpy.ndarray`` with GPU support). 10 | 11 | .. _PyTorch: //pytorch.org 12 | 13 | 14 | Installation 15 | ------------ 16 | 17 | Use your favorite virtualenv system and install the below dependencies; quite standard ones. 18 | 19 | * numpy 20 | * scipy 21 | * matplotlib 22 | * pandas 23 | * jupyter 24 | * pytorch 25 | * scikit-learn (optional, used in my own articles) 26 | 27 | .. code-block:: bash 28 | 29 | (esl) $ pip install ipython numpy scipy matplotlib pandas jupyter 30 | 31 | # The command below installs pytorch for Python 3.6 without CUDA support. 32 | # For other settings, consult with pytorch.org. 33 | (esl) $ pip install http://download.pytorch.org/whl/cpu/torch-0.3.1-cp36-cp36m-linux_x86_64.whl 34 | 35 | 36 | Execution 37 | --------- 38 | 39 | Just run ``jupyter notebook``. 40 | -------------------------------------------------------------------------------- /articles/fig11-2-neural-network.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dgkim5360/the-elements-of-statistical-learning-notebooks/2c13a4818379451bcce802bcd7917f4878e12977/articles/fig11-2-neural-network.jpg -------------------------------------------------------------------------------- /chapter02-overview-of-supervised-learning/section7-structured-regression-models.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 2.7. Structured Regression Models\n", 8 | "\n", 9 | "### Review & motivation\n", 10 | "\n", 11 | "We have seen that although nearest-neighbor and other local methods focus directly on estimating the function at a point, they face problems in high dimensions. They may also be inappropriate even in low dimensions in cases where more structured approaches can make more efficient use of the data.\n", 12 | "\n", 13 | "This section introduces classes of such structured approaches. Before we proceed, though, we discuss further the need for such classes." 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "## $\\S$ 2.7.1. Difficulty of the Problem\n", 21 | "\n", 22 | "Consider the RSS criterion for an arbitrary function $f$,\n", 23 | "\n", 24 | "\\begin{equation}\n", 25 | "\\text{RSS}(f) = \\sum_{i=1}^N \\left( y_i - f(x_i) \\right)^2.\n", 26 | "\\end{equation}\n", 27 | "\n", 28 | "Minimizing the RSS leads to infinitely many solutions: Any function $\\hat{f}$ passing through the training points $(x_i,y_i)$ is a solution. Any particular solution chosen might be a poor predictor at test points different from the training points.\n", 29 | "\n", 30 | "If there are multiple observation pairs $(x_i,y_{il})$, $l=1,\\cdots,N_i$, at each value of $x_i$, the risk is limited. In this case, the solution pass through the average values of the $y_{il}$ at each $x_i$ (Exercise 2.6). The situation is similar to the one we have already visited in $\\S$ 2.4; indeed, the above RSS is the finite sample version of the expected prediction error\n", 31 | "\n", 32 | "\\begin{equation}\n", 33 | "\\text{EPE}(f) = \\text{E}\\left( Y - f(X) \\right)^2 = \\int \\left( y - f(x) \\right)^2 \\text{Pr}(dx, dy).\n", 34 | "\\end{equation}" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "### Necessity & limit of the restriction\n", 42 | "\n", 43 | "If the sample size $N$ were sufficiently large such that repeats were guaranteed and densely arranged, it would seem that these solutions might all tend to the limiting conditional expectation.\n", 44 | "\n", 45 | "In order to obtain useful results for finite $N$, we must restrict the eligible solution to the RSS to a smaller set of functions.\n", 46 | "\n", 47 | "> How to decide on the nature of the restrictions is based on considerations outside of the data.\n", 48 | "\n", 49 | "These restrictions are somtimes\n", 50 | "* encoded via the parametric representation of $f_\\theta$, or\n", 51 | "* may be built into the learning method itself, either implicitly or explicitly.\n", 52 | "\n", 53 | "> These restricted classes of solutions are the major topic of this book.\n", 54 | "\n", 55 | "One thing should be clear, though.\n", 56 | "\n", 57 | "> Any restrictions imposed on $f$ that lead to a unique solution to RSS do not really remove the ambiguity caused by the multiplicity of solutions. There are infinitely many possible restrictions, each leading to a unique solution, so the abmiguity has simply been transferred to the choice of constraint." 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "### Complexity\n", 65 | "\n", 66 | "In general the constraints imposed by most learning methods can be described as _complexity_ restrictions of one kind or another.\n", 67 | "\n", 68 | "> This usually means some kind of regular behavior in small neighborhoods of the input space.\n", 69 | "\n", 70 | "That is, for all input points $x$ sufficiently close to each other in some metric, $\\hat{f}$ exhibits some special structure such as\n", 71 | "* nearly constant,\n", 72 | "* linear or\n", 73 | "* low-order polynomial behavior.\n", 74 | "\n", 75 | "The estimator is then obtained by averaging or polynomial fitting in that neighborhood.\n", 76 | "\n", 77 | "The strength of the constraint is dictated by the neighborhood size.\n", 78 | "\n", 79 | "> The larger the size, the stronger the constraint, and the more sensitive the solution is to the particular choice of constraint.\n", 80 | "\n", 81 | "For example,\n", 82 | "* local constant fits in infinitesimally small neighborhoods is no constraints at all;\n", 83 | "* local linear fits in very large neighborhoods is almost a globally llinear model, and is very restrictive." 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "### Metric\n", 91 | "\n", 92 | "The nature of the constraint depends on the metric used.\n", 93 | "\n", 94 | "Some methods, such as kernel and local regression and tree-based methods, directly specify the metric and size of the neighborhood. The kNN methods discussed so far are based on the assumption that locally the function is constant; close to a target input $x_0$, the function does not change much, and so close outputs can be averagedd to produce $\\hat{f}(x_0)$.\n", 95 | "\n", 96 | "Other methods such as splines, neural networks and basis-function methods implicitly define neighborhoods of local behavior. In $\\S$ 5.4.1 we discuss the concept of an _equivalent kernel_, which describes this local dependence for any method linear in the outputs. These equivalent kernels in many cases look just like the explicitly defined weighting kernels discussed above -- peaked at the target point and falling smoothly away from it." 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "### Curse of dimensionality\n", 104 | "\n", 105 | "One fact should be clear by now. Any method that attempts to produce locally varying functions in small isotopic neighborhoods will run into problems in high dimensions -- again the curse of dimensionality.\n", 106 | "\n", 107 | "And conversely, all methods that overcome the dimensionality problems have an associated -- and often implicit or adaptive -- metric for measuring neighborhoods, which basically does not allow the neighborhood to be simultaneously small in all directions." 108 | ] 109 | } 110 | ], 111 | "metadata": { 112 | "kernelspec": { 113 | "display_name": "Python 3", 114 | "language": "python", 115 | "name": "python3" 116 | }, 117 | "language_info": { 118 | "codemirror_mode": { 119 | "name": "ipython", 120 | "version": 3 121 | }, 122 | "file_extension": ".py", 123 | "mimetype": "text/x-python", 124 | "name": "python", 125 | "nbconvert_exporter": "python", 126 | "pygments_lexer": "ipython3", 127 | "version": "3.6.4" 128 | } 129 | }, 130 | "nbformat": 4, 131 | "nbformat_minor": 2 132 | } 133 | -------------------------------------------------------------------------------- /chapter02-overview-of-supervised-learning/section9-model-selection-and-the-bias-variance-tradeoff.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 2.9. Model Selection and the Bias-Variance Tradeoff\n", 8 | "\n", 9 | "### Review\n", 10 | "\n", 11 | "All the Models described so far have a *smoothing* or *complexity* parameter that has to be determined:\n", 12 | "* The multiplier of the penalty term;\n", 13 | "* the width of the kernel;\n", 14 | "* or the number of basis functions.\n", 15 | "\n", 16 | "In the case of the smoothing spline, the parameter $\\lambda$ indexes models ranging from a straight line fit to the interpolating model.\n", 17 | "\n", 18 | "Similarly a local degree-$m$ polynomial model ranges between a degree-$m$ global polynomial when the window size is infinitely large, to an interpolating fit when the window size shrinks to zero.\n", 19 | "\n", 20 | "This means that we cannot use residual sum-of-squares on the training data to determine these parameters as well, since we would always pick those that gave interpolating fits and hence zero residuals. Such a model is unlikely to predict future data well at all." 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "### The bias-variance tradeoff for the kNN\n", 28 | "\n", 29 | "The kNN regression fit $\\hat{f}_k(x_0)$ usefully illustrates the competing forces that affect the predictive ability of such approximations.\n", 30 | "\n", 31 | "Suppose\n", 32 | "* the data arise from a model $Y=f(X)+\\epsilon$, with $\\text{E}(\\epsilon)=0$ and $\\text{Var}(\\epsilon)=\\sigma^2$;\n", 33 | "* for simplicity here the values of $x_i$ in the sample are fixed in advance (nonrandom).\n", 34 | "\n", 35 | "The expected prediction error at $x_0$, a.k.a. _test_ or _generalization_ error, can be decomposed:\n", 36 | "\n", 37 | "\\begin{align}\n", 38 | "\\text{EPE}_k(x_0) &= \\text{E}\\left[(Y-\\hat{f}_k(x_0))^2|X=x_0\\right] \\\\\n", 39 | "&= \\text{E}\\left[(Y -f(x_0) + f(x_0) -\\hat{f}_k(x_0))^2|X=x_0\\right] \\\\\n", 40 | "&= \\text{E}(\\epsilon^2) + 2\\text{E}\\left[\\epsilon(f(x_0) -\\hat{f}_k(x_0))|X=x_0\\right] + \\text{E}\\left[\\left(f(x_0)-\\hat{f}_k(x_0)\\right)^2|X=x_0\\right] \\\\\n", 41 | "&= \\sigma^2 + 0+ \\left[\\text{Bias}^2(\\hat{f}_k(x_0))+\\text{Var}_\\mathcal{T}(\\hat{f}_k(x_0))\\right] \\\\\n", 42 | "&= \\sigma^2 + \\left(f(x_0) - \\frac{1}{k}\\sum_{l=1}^k f(x_{(l)})\\right)^2 + \\frac{\\sigma^2}{k}\n", 43 | "\\end{align},\n", 44 | "\n", 45 | "where subscripts in parentheses $(l)$ indicate the sequence of nearest neighbors to $x_0$.\n", 46 | "\n", 47 | "There are three terms in this expression.\n", 48 | "\n", 49 | "#### Irreducible error\n", 50 | "The first term $\\sigma^2$ is the *irreducible* error -- the variance of the new test target -- and is beyond our control, even if we know the true $f(x_0)$.\n", 51 | "\n", 52 | "The second and third terms are under our control, and make up the _mean squared error_ of $\\hat{f}_k(x_0)$ in estimateing $f(x_0)$, which is broken down into a bias component and a variance component.\n", 53 | "\n", 54 | "#### Bias\n", 55 | "The bias term is the squared difference between the true mean $f(x_0)$ and the expected value of the estimate, i.e.,\n", 56 | "\n", 57 | "\\begin{equation}\n", 58 | "\\left[ \\text{E}_\\mathcal{T} \\left( \\hat{f}_k(x_0) \\right) - f(x_0) \\right]^2,\n", 59 | "\\end{equation}\n", 60 | "\n", 61 | "where the expectation averages the randomness in the training data.\n", 62 | "\n", 63 | "This term will most likely increase with $k$, if the true function is reasonably smooth. For small $k$ the few closest neighbors will have values $f(x_{(l)})$ close to $f(x_0)$, so their average should be close to $f(x_0)$. As $k$ grows, the neighbors are further away, and then anything can happen.\n", 64 | "\n", 65 | "#### Variance\n", 66 | "The variance term is simply the variance of an average here, and decreases as the inverse of $k$.\n", 67 | "\n", 68 | "#### Finally, the tradeoff\n", 69 | "So as $k$ varies, there is a *bias-variance tradeoff*.\n", 70 | "\n", 71 | "More generally, as the _model complexity_ of our procedure is increased, the variance tends to increase and the squared bias tends to decrease, vice versa. For kNN, the model complexity is controlled by $k$.\n", 72 | "\n", 73 | "Typically we would like to choose our model complexity to trade bias off with variance in such a way as to minimize the test error. An obvious estimate of test error is _training error_\n", 74 | "\n", 75 | "\\begin{equation}\n", 76 | "\\frac{1}{N} \\sum_i (y_i \\hat{y}_i)^2.\n", 77 | "\\end{equation}\n", 78 | "\n", 79 | "Unfortunately training error is not a good estimate of test error, as it does not properly account for model complexity." 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "### Interpretation & implication\n", 87 | "FIGURE 2.11 shows the typical behavior of the test and training error, as model complexity is varied.\n", 88 | "\n", 89 | "> The training error tends to decrease whenever we increase the model complexity, i.e., whenever we fit the data harder.\n", 90 | "\n", 91 | "However with too much fitting, the model adapts itself too closely to the training data, and will not generalize well (i.e., have large test error). In that case the predictions $\\hat{f}(x_0)$ will have large variance, as reflected in the above EPE expression.\n", 92 | "\n", 93 | "In contrast, if the model is not complex enough, it will _underfit_ and may have large mias, again resulting in poor generalization. In Chapter 7 we discuss methods for estimating the test error of a prediction method, and hence estimating the optimal amount of model complexity fir a given prediction method and training set." 94 | ] 95 | } 96 | ], 97 | "metadata": { 98 | "kernelspec": { 99 | "display_name": "Python 3", 100 | "language": "python", 101 | "name": "python3" 102 | }, 103 | "language_info": { 104 | "codemirror_mode": { 105 | "name": "ipython", 106 | "version": 3 107 | }, 108 | "file_extension": ".py", 109 | "mimetype": "text/x-python", 110 | "name": "python", 111 | "nbconvert_exporter": "python", 112 | "pygments_lexer": "ipython3", 113 | "version": "3.6.4" 114 | } 115 | }, 116 | "nbformat": 4, 117 | "nbformat_minor": 2 118 | } 119 | -------------------------------------------------------------------------------- /chapter03-linear-methods-for-regression/section1-introduction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Chapter 3. Linear Methods for Regression\n", 8 | "# $\\S$ 3.1. Introduction\n", 9 | "\n", 10 | "A linear regression model assumes that the regression function $\\text{E}(Y|X)$ is linear in the inputs $X_1,\\cdots,X_p$. Linear models were largely developed in the precomputer age of statistics, but even in today's computer era there are still good reasons to study and use them. They are simple and often provide an adequate and interpretable description of how the inputs affect the output.\n", 11 | "\n", 12 | "For prediction purposes they can sometimes outperform fancier nonlinear models, especially in situations with small numbers of training cases, low signal-to-noise ratio or sparse data.\n", 13 | "\n", 14 | "Finally, linear methods can be applied to transformations of the inputs and this considerably expands their scope. These generalization are sometimes called basis-function methods (Chapter 5).\n", 15 | "\n", 16 | "In this chapter we describe linear methods for regression, while in the next chapter we discuss linear methods for classification.\n", 17 | "\n", 18 | "> On some topics we go into considerable detail, as it is out firm belief that an understanding of linear methods is essential for understanding nonlinear ones.\n", 19 | "\n", 20 | "In fact, many nonlinear techniques are direct generalizations of the linear methods discussed here." 21 | ] 22 | } 23 | ], 24 | "metadata": { 25 | "kernelspec": { 26 | "display_name": "Python 3", 27 | "language": "python", 28 | "name": "python3" 29 | }, 30 | "language_info": { 31 | "codemirror_mode": { 32 | "name": "ipython", 33 | "version": 3 34 | }, 35 | "file_extension": ".py", 36 | "mimetype": "text/x-python", 37 | "name": "python", 38 | "nbconvert_exporter": "python", 39 | "pygments_lexer": "ipython3", 40 | "version": "3.6.4" 41 | } 42 | }, 43 | "nbformat": 4, 44 | "nbformat_minor": 2 45 | } 46 | -------------------------------------------------------------------------------- /chapter03-linear-methods-for-regression/section2-2-the-gauss-markov-theorem.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 3.2.2. The Gauss-Markov Theorem\n", 8 | "\n", 9 | "One of the most famous results in statistics asserts that\n", 10 | "\n", 11 | "> the least squares estimates of the parameter $\\beta$ have the smallest variance among all linear unbiased estimates.\n", 12 | "\n", 13 | "We will make this precise here, and also make clear that\n", 14 | "\n", 15 | "> the restriction to unbiased estimates is not necessarily a wise one.\n", 16 | "\n", 17 | "This observation will lead us to consider biased estimates such as ridge regression later in the chapter." 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "### The statement of the theorem\n", 25 | "\n", 26 | "We focus on estimation of any linear combination of the parameters $\\theta=a^T\\beta$. The least squares estimate of $a^T\\beta$ is\n", 27 | "\n", 28 | "\\begin{equation}\n", 29 | "\\hat\\theta = a^T\\hat\\beta = a^T\\left(\\mathbf{X}^T\\mathbf{X}\\right)^{-1}\\mathbf{X}^T\\mathbf{y}.\n", 30 | "\\end{equation}\n", 31 | "\n", 32 | "Considering $\\mathbf{X}$ to be fixed and the linear model is correct, $a^T\\beta$ is unbiased since\n", 33 | "\n", 34 | "\\begin{align}\n", 35 | "\\text{E}(a^T\\hat\\beta) &= \\text{E}\\left(a^T(\\mathbf{X}^T\\mathbf{X})^{-1}\\mathbf{X}^T\\mathbf{y}\\right) \\\\\n", 36 | "&= a^T(\\mathbf{X}^T\\mathbf{X})^{-1}\\mathbf{X}^T\\mathbf{X}\\beta \\\\\n", 37 | "&= a^T\\beta\n", 38 | "\\end{align}\n", 39 | "\n", 40 | "The Gauss-Markov Theorem states that if we have any other linear estimator $\\tilde\\theta = \\mathbf{c}^T\\mathbf{y}$ that is unbiased for $a^T\\beta$, that is, $\\text{E}(\\mathbf{c}^T\\mathbf{y})=a^T\\beta$, then\n", 41 | "\n", 42 | "\\begin{equation}\n", 43 | "\\text{Var}(a^T\\hat\\beta) \\le \\text{Var}(\\mathbf{c}^T\\mathbf{y}).\n", 44 | "\\end{equation}\n", 45 | "\n", 46 | "The proof (Exercise 3.3) uses the triangle inequality.\n", 47 | "\n", 48 | "For simplicity we have stated the result in terms of estimation of a single parameter $a^T \\beta$, but with a few more definitions one can state it in terms of the entire parameter vector $\\beta$ (Exercise 3.3)." 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "### Implications of the Gauss-Markov theorem\n", 56 | "\n", 57 | "Consider the mean squared error of an estimator $\\tilde\\theta$ of $\\theta$:\n", 58 | "\n", 59 | "\\begin{align}\n", 60 | "\\text{MSE}(\\tilde\\theta) &= \\text{E}\\left(\\tilde\\theta-\\theta\\right)^2 \\\\\n", 61 | "&= \\text{Var}\\left(\\tilde\\theta\\right) + \\left[\\text{E}\\left(\\tilde\\theta-\\theta\\right)\\right]^2 \\\\\n", 62 | "&= \\text{Var} + \\text{Bias}^2\n", 63 | "\\end{align}\n", 64 | "\n", 65 | "The Gauss-Markov theorem implies that the least squares estimator has the smallest MSE of all linear estimators with no bias. However there may well exist a biased estimator with smaller MSE. Such an estimator would trade a little bias for a larger reduction in variance.\n", 66 | "\n", 67 | "Biased estimates are commonly used. Any method that shrinks or sets to zero some of the least squares coefficients may result in a biased estimate. We discuss many examples, including variable subset selection and ridge regression, later in this chapter.\n", 68 | "\n", 69 | "From a more pragmatic point of view, most models are distortions of the truth, and hence are biased; picking the right model amounts to creating the right balance between bias and variance. We go into these issues in more detail in Chapter 7." 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "### Relation between prediction accuracy and MSE\n", 77 | "\n", 78 | "MSE is intimately related to prediction accuracy, as discussed in Chapter 2.\n", 79 | "\n", 80 | "Consider the prediction of the new response at input $x_0$,\n", 81 | "\n", 82 | "\\begin{equation}\n", 83 | "Y_0 = f(x_0) + \\epsilon.\n", 84 | "\\end{equation}\n", 85 | "\n", 86 | "Then the expected prediction error of an estimate $\\tilde{f}(x_0)=x_0^T\\tilde\\beta$ is\n", 87 | "\n", 88 | "\\begin{align}\n", 89 | "\\text{E}(Y_0 - \\tilde{f}(x_0))^2 &= \\text{E}\\left(Y_0 -f(x_0)+f(x_0) - \\tilde{f}(x_0)\\right)^2\\\\\n", 90 | "&= \\sigma^2 + \\text{E}\\left(x_o^T\\tilde\\beta - f(x_0)\\right)^2 \\\\\n", 91 | "&= \\sigma^2 + \\text{MSE}\\left(\\tilde{f}(x_0)\\right).\n", 92 | "\\end{align}\n", 93 | "\n", 94 | "Therefore, expected prediction error and MSE differ only by the constant $\\sigma^2$, representing the variance of the new observation $y_0$." 95 | ] 96 | } 97 | ], 98 | "metadata": { 99 | "kernelspec": { 100 | "display_name": "Python 3", 101 | "language": "python", 102 | "name": "python3" 103 | }, 104 | "language_info": { 105 | "codemirror_mode": { 106 | "name": "ipython", 107 | "version": 3 108 | }, 109 | "file_extension": ".py", 110 | "mimetype": "text/x-python", 111 | "name": "python", 112 | "nbconvert_exporter": "python", 113 | "pygments_lexer": "ipython3", 114 | "version": "3.6.4" 115 | } 116 | }, 117 | "nbformat": 4, 118 | "nbformat_minor": 2 119 | } 120 | -------------------------------------------------------------------------------- /chapter03-linear-methods-for-regression/section2-4-multiple-outputs.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 3.2.4. Multiple Outputs\n", 8 | "\n", 9 | "Suppose we have\n", 10 | "* multiple outputs $Y_1,Y_2,\\cdots,Y_K$\n", 11 | "* inputs $X_0,X_1,\\cdots,X_p$\n", 12 | "* a linear model for each output \n", 13 | "\\begin{align}\n", 14 | "Y_k &= \\beta_{0k} + \\sum_{j=1}^p X_j\\beta_{jk} + \\epsilon_k \\\\\n", 15 | "&= f_k(X) + \\epsilon_k\n", 16 | "\\end{align}\n", 17 | "\n", 18 | "> the coefficients for the $k$th outcome are just the least squares estimates in the regression of $y_k$ on $x_0,x_1,\\cdots,x_p$ . Multiple outputs do not affect one another’s least squares estimates.\n", 19 | "\n", 20 | "With $N$ training cases we can write the model in matrix notation\n", 21 | "\n", 22 | "\\begin{equation}\n", 23 | "\\mathbf{Y}=\\mathbf{XB}+\\mathbf{E},\n", 24 | "\\end{equation}\n", 25 | "\n", 26 | "where\n", 27 | "* $\\mathbf{Y}$ is $N\\times K$ with $ik$ entry $y_{ik}$,\n", 28 | "* $\\mathbf{X}$ is $N\\times(p+1)$ input matrix,\n", 29 | "* $\\mathbf{B}$ is $(p+1)\\times K$ parameter matrix,\n", 30 | "* $\\mathbf{E}$ is $N\\times K$ error matrix.\n", 31 | "\n", 32 | "A straightforward generalization of the univariate loss function is\n", 33 | "\n", 34 | "\\begin{align}\n", 35 | "\\text{RSS}(\\mathbf{B}) &= \\sum_{k=1}^K \\sum_{i=1}^N \\left( y_{ik} - f_k(x_i) \\right)^2 \\\\\n", 36 | "&= \\text{trace}\\left( (\\mathbf{Y}-\\mathbf{XB})^T(\\mathbf{Y}-\\mathbf{XB}) \\right)\n", 37 | "\\end{align}\n", 38 | "\n", 39 | "The least squares estimates have exactly the same form as before\n", 40 | "\n", 41 | "\\begin{equation}\n", 42 | "\\hat{\\mathbf{B}} = \\left(\\mathbf{X}^T\\mathbf{X}\\right)^{-1}\\mathbf{X}^T\\mathbf{Y}.\n", 43 | "\\end{equation}" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "### Correlated errors\n", 51 | "\n", 52 | "If the errors $\\epsilon = (\\epsilon_1,\\cdots,\\epsilon_K)$ are correlated with $\\text{Cov}(\\epsilon)=\\mathbf{\\Sigma}$, then the multivariate weighted criterion\n", 53 | "\n", 54 | "\\begin{equation}\n", 55 | "\\text{RSS}(\\mathbf{B};\\mathbf{\\Sigma}) = \\sum_{i=1}^N (y_i-f(x_i))^T \\mathbf{\\Sigma}^{-1} (y_i-f(x_i))\n", 56 | "\\end{equation}\n", 57 | "\n", 58 | "arises naturally from multivariate Gaussian theory. Here\n", 59 | "* $f(x) = \\left(f_1(x),\\cdots,f_K(x)\\right)^T$ is the vector function,\n", 60 | "* $y_i$ the vector of $K$ responses for observation $i$.\n", 61 | "However, the solution is again the same with ignoring the correlations as\n", 62 | "\n", 63 | "\\begin{equation}\n", 64 | "\\hat{\\mathbf{B}} = \\left(\\mathbf{X}^T\\mathbf{X}\\right)^{-1}\\mathbf{X}^T\\mathbf{Y}.\n", 65 | "\\end{equation}\n", 66 | "\n", 67 | "In Section 3.7 we pursue the multiple output problem and consider situations where it does pay to combine the regressions." 68 | ] 69 | } 70 | ], 71 | "metadata": { 72 | "kernelspec": { 73 | "display_name": "Python 3", 74 | "language": "python", 75 | "name": "python3" 76 | }, 77 | "language_info": { 78 | "codemirror_mode": { 79 | "name": "ipython", 80 | "version": 3 81 | }, 82 | "file_extension": ".py", 83 | "mimetype": "text/x-python", 84 | "name": "python", 85 | "nbconvert_exporter": "python", 86 | "pygments_lexer": "ipython3", 87 | "version": "3.5.2" 88 | } 89 | }, 90 | "nbformat": 4, 91 | "nbformat_minor": 2 92 | } 93 | -------------------------------------------------------------------------------- /chapter03-linear-methods-for-regression/section4-0-shrinkage-methods.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 3.4. Shrinkage Methods\n", 8 | "\n", 9 | "By retaining a subset of the predictors and discarding the rest, subset selection produces a model that is interpretable and has possibly lower prediction error than the full model. However, because it is a discrete process -- variables are either retained or discarded -- it often exhibits high variance, and so doesn't reduce the prediction error of the full model. Shrinkage methods are more continuous, and don't suffer as much from high variability.\n", 10 | "\n", 11 | "Let's take a break, and grap a cup of coffee :)" 12 | ] 13 | } 14 | ], 15 | "metadata": { 16 | "kernelspec": { 17 | "display_name": "Python 3", 18 | "language": "python", 19 | "name": "python3" 20 | }, 21 | "language_info": { 22 | "codemirror_mode": { 23 | "name": "ipython", 24 | "version": 3 25 | }, 26 | "file_extension": ".py", 27 | "mimetype": "text/x-python", 28 | "name": "python", 29 | "nbconvert_exporter": "python", 30 | "pygments_lexer": "ipython3", 31 | "version": "3.5.2" 32 | } 33 | }, 34 | "nbformat": 4, 35 | "nbformat_minor": 2 36 | } 37 | -------------------------------------------------------------------------------- /chapter03-linear-methods-for-regression/section4-2-the-lasso.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 3.4.2. The Lasso\n", 8 | "\n", 9 | "The lasso estimate is defined by\n", 10 | "\n", 11 | "\\begin{equation}\n", 12 | "\\hat\\beta^{\\text{lasso}} = \\arg\\min_\\beta \\sum_{i=1}^N \\left( y_i-\\beta_0-\\sum_{j=1}^p x_{ij}\\beta_j \\right)^2 \\text{ subject to } \\sum_{j=1}^p |\\beta_j| \\le t,\n", 13 | "\\end{equation}\n", 14 | "\n", 15 | "Just as in ridge regression, we can re-parametrize the constant $\\beta_0$ by standardizing the predictors; $\\hat\\beta_0 = \\bar{y}$, and thereafter we fit a model without an intercept.\n", 16 | "\n", 17 | "In the signal processing literature, the lasso is a.k.a. *basis pursuit* (Chen et al., 1998).\n", 18 | "\n", 19 | "Also the lasso problem has the equivalent *Lagrangian form*\n", 20 | "\n", 21 | "\\begin{equation}\n", 22 | "\\hat\\beta^{\\text{lasso}} = \\arg\\min_\\beta \\left\\lbrace \\frac{1}{2}\\sum_{i=1}^N \\left( y_i-\\beta_0-\\sum_{j=1}^p x_{ij}\\beta_j \\right)^2 + \\lambda\\sum_{j=1}^p |\\beta_j| \\right\\rbrace,\n", 23 | "\\end{equation}\n", 24 | "\n", 25 | "which is similar to the ridge problem as the $L_2$ ridge penalty is replaced by the $L_1$ lasso penalty. This lasso constraint makes the solutions nonlinear in the $y_i$, and there is no closed form expresssion as in ridge regression. And computing the above lasso solution is a quadratic programming problem, although efficient algorithms, introduced in $\\S$ 3.4.4, are available for computing the entire path of solution as $\\lambda$ varies, with the same computational cost as for ridge regression." 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "Note that\n", 33 | "\n", 34 | "* If $t \\gt t_0 = \\sum_1^p \\lvert\\hat\\beta_j^{\\text{ls}}\\rvert$, then $\\hat\\beta^{\\text{lasso}} = \\hat\\beta^{\\text{ls}}$.\n", 35 | "* Say, for $t = t_0/2$, then the least squares coefficients are shrunk by about $50\\%$ on average. \n", 36 | "However, the nature of the shrinkage is not obvious, and we investigate it further in $\\S$ 3.4.4." 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "In FIGURE 3.7, for ease of interpretation, we have plotted the lasso prediction error estimates versus the standardized parameter\n", 44 | "\n", 45 | "\\begin{equation}\n", 46 | "s = \\frac{t}{\\sum_1^p \\lvert\\hat\\beta_j\\rvert}.\n", 47 | "\\end{equation}\n", 48 | "\n", 49 | "A value $\\hat s \\approx 0.36$ was chosen by 10-fold cross-validation; this caused four coefficients to be set to zero (see Table 3.3). The resulting model has the second lowest test error, slightly lower than the full least squares model, but the standard errors of the test error estimates are fairly large." 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "FIGURE 3.10 is discussed after implementing the lasso algorithm in $\\S$ 3.4.4." 57 | ] 58 | } 59 | ], 60 | "metadata": { 61 | "kernelspec": { 62 | "display_name": "Python 3", 63 | "language": "python", 64 | "name": "python3" 65 | }, 66 | "language_info": { 67 | "codemirror_mode": { 68 | "name": "ipython", 69 | "version": 3 70 | }, 71 | "file_extension": ".py", 72 | "mimetype": "text/x-python", 73 | "name": "python", 74 | "nbconvert_exporter": "python", 75 | "pygments_lexer": "ipython3", 76 | "version": "3.5.2" 77 | } 78 | }, 79 | "nbformat": 4, 80 | "nbformat_minor": 2 81 | } 82 | -------------------------------------------------------------------------------- /chapter03-linear-methods-for-regression/section5-0-methods-using-derived-input-directions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 3.5. Methods Using Derived Input Directions\n", 8 | "\n", 9 | "In many situations we have a large number of inputs, often very correlated. The methods in this section produce a small number of linear combinations $Z_m$, $m=1,\\cdots,M$ of the original inputs $X_j$, and the $Z_m$ are then used in place of the $X_j$ as inputs in regression. The methods differ in how the linear combinations are constructed." 10 | ] 11 | } 12 | ], 13 | "metadata": { 14 | "kernelspec": { 15 | "display_name": "Python 3", 16 | "language": "python", 17 | "name": "python3" 18 | }, 19 | "language_info": { 20 | "codemirror_mode": { 21 | "name": "ipython", 22 | "version": 3 23 | }, 24 | "file_extension": ".py", 25 | "mimetype": "text/x-python", 26 | "name": "python", 27 | "nbconvert_exporter": "python", 28 | "pygments_lexer": "ipython3", 29 | "version": "3.5.2" 30 | } 31 | }, 32 | "nbformat": 4, 33 | "nbformat_minor": 2 34 | } 35 | -------------------------------------------------------------------------------- /chapter03-linear-methods-for-regression/section5-1-principal-components-regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 3.5. Methods Using Derived Input Directions\n", 8 | "## $\\S$ 3.5.1. Principal Components Regression\n", 9 | "\n", 10 | "The linear combinations $Z_m$ used in principal component regression (PCR) are the principal components as defined in $\\S$ 3.4.1." 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "PCR forms the derived input columns\n", 18 | "\n", 19 | "\\begin{equation}\n", 20 | "\\mathbf{z}_m = \\mathbf{X} v_m,\n", 21 | "\\end{equation}\n", 22 | "\n", 23 | "and then regress $\\mathbf{y}$ on $\\mathbf{z}_1,\\mathbf{z}_2,\\cdots,\\mathbf{z}_M$ for some $M\\le p$. Since the $\\mathbf{z}_m$ are orthogonal, this regression is just a sum of univariate regressions:\n", 24 | "\n", 25 | "\\begin{equation}\n", 26 | "\\hat{\\mathbf{y}}_{(M)}^{\\text{pcr}} = \\bar{y}\\mathbf{1} + \\sum_{m=1}^M \\hat\\theta_m \\mathbf{z}_m = \\bar{y}\\mathbf{1} + \\mathbf{X}\\mathbf{V}_M\\hat{\\mathbf{\\theta}},\n", 27 | "\\end{equation}\n", 28 | "\n", 29 | "where $\\hat\\theta_m = \\langle\\mathbf{z}_m,\\mathbf{y}\\rangle \\big/ \\langle\\mathbf{z}_m,\\mathbf{z}_m\\rangle$. We can see from the last equality that, since the $\\mathbf{z}_m$ are each linear combinations of the original $\\mathbf{x}_j$, we can express the solution in terms of coefficients of the $\\mathbf{x}_j$.\n", 30 | "\n", 31 | "\\begin{equation}\n", 32 | "\\hat\\beta^{\\text{pcr}}(M) = \\sum_{m=1}^M \\hat\\theta_m v_m.\n", 33 | "\\end{equation}\n", 34 | "\n", 35 | "As with ridge regression, PCR depends on the scaling of the inputs, so typically we first standardized them." 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "### Comparison with ridge regression\n", 43 | "\n", 44 | "If $M=p$, since the columns of $\\mathbf{Z} = \\mathbf{UD}$ span the $\\text{col}(\\mathbf{X})$,\n", 45 | "\n", 46 | "\\begin{equation}\n", 47 | "\\hat\\beta^{\\text{pcr}}(p) = \\hat\\beta^{\\text{ls}}.\n", 48 | "\\end{equation}\n", 49 | "\n", 50 | "For $M PLS seeks direction that have high variance *and* have high correlation with the response, in contrast to PCR with keys only on high variance (Stone and Brooks, 1990; Frank and Friedman, 1993).\n", 78 | "\n", 79 | "Since it uses the response $\\mathbf{y}$ to construct its directions, its solution path is a nonlinear function of $\\mathbf{y}$.\n", 80 | "\n", 81 | "In particular, the $m$th principal component direction $v_m$ solves:\n", 82 | "\n", 83 | "\\begin{equation}\n", 84 | "\\max_\\alpha \\text{Var}(\\mathbf{X}\\alpha)\\\\\n", 85 | "\\text{subject to } \\|\\alpha\\| = 1, \\alpha^T\\mathbf{S} v_l = 0 \\text{ for } l = 1,\\cdots, m-1,\n", 86 | "\\end{equation}\n", 87 | "\n", 88 | "where $\\mathbf{S}$ is the sample covariance matrix of the $\\mathbf{x}_j$. The condition $\\alpha^T\\mathbf{S} v_l= 0$ ensures that $\\mathbf{z}_m = \\mathbf{X}\\alpha$ is uncorrelated with all the previous linear combinations $\\mathbf{z}_l = \\mathbf{X} v_l$.\n", 89 | "\n", 90 | "The $m$th PLS direction $\\hat\\rho_m$ solves:\n", 91 | "\n", 92 | "\\begin{equation}\n", 93 | "\\max_\\alpha \\text{Corr}^2(\\mathbf{y},\\mathbf{S}\\alpha)\\text{Var}(\\mathbf{X}\\alpha)\\\\\n", 94 | "\\text{subject to } \\|\\alpha\\| = 1, \\alpha^T\\mathbf{S}\\hat\\rho_l = 0 \\text{ for } l=1,\\cdots, m-1.\n", 95 | "\\end{equation}\n", 96 | "\n", 97 | "Further analysis reveals that the variance aaspect tends to dominate, and so PLS behaves much like ridge regression and PCR. We discuss further in the next section." 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "If the input matrix $\\mathbf{X}$ is orthogonal, then PLS finds the least squares estimates after the first $m=1$ step, and subsequent steps have no effect since the $\\hat\\rho_{mj} = 0$ for $m>1$ (Exercise 3.14).\n", 105 | "\n", 106 | "It can be also shown that the sequence of PLS coefficients for $m=1,2,\\cdots,p$ represents the conjugate gradient sequence for computing the least squares solutions (Exercise 3.18)." 107 | ] 108 | } 109 | ], 110 | "metadata": { 111 | "kernelspec": { 112 | "display_name": "Python 3", 113 | "language": "python", 114 | "name": "python3" 115 | }, 116 | "language_info": { 117 | "codemirror_mode": { 118 | "name": "ipython", 119 | "version": 3 120 | }, 121 | "file_extension": ".py", 122 | "mimetype": "text/x-python", 123 | "name": "python", 124 | "nbconvert_exporter": "python", 125 | "pygments_lexer": "ipython3", 126 | "version": "3.5.2" 127 | } 128 | }, 129 | "nbformat": 4, 130 | "nbformat_minor": 2 131 | } 132 | -------------------------------------------------------------------------------- /chapter03-linear-methods-for-regression/section6-discussion-a-comparison-of-the-selection-and-shrinkage-methods.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 3.6. Discussion: A Comparison of the Selection and Shrinkage Methods\n", 8 | "\n", 9 | "> PLS, PCR and ridge regression tend to behave similarly. Ridge regression may be preferred because it shrinks smoothly, rather than in discrete steps. Lasso falls somewhere between ridge regression and best subset regression, and enjoys some of the properties of each.\n", 10 | "\n", 11 | "Please check the textbook with FIGURE 3.18." 12 | ] 13 | } 14 | ], 15 | "metadata": { 16 | "kernelspec": { 17 | "display_name": "Python 3", 18 | "language": "python", 19 | "name": "python3" 20 | }, 21 | "language_info": { 22 | "codemirror_mode": { 23 | "name": "ipython", 24 | "version": 3 25 | }, 26 | "file_extension": ".py", 27 | "mimetype": "text/x-python", 28 | "name": "python", 29 | "nbconvert_exporter": "python", 30 | "pygments_lexer": "ipython3", 31 | "version": "3.5.2" 32 | } 33 | }, 34 | "nbformat": 4, 35 | "nbformat_minor": 2 36 | } 37 | -------------------------------------------------------------------------------- /chapter04-linear-methods-for-classification/section1-introduction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Chapter 4. Linear Methods for Classification\n", 8 | "# $\\S$ 4.1. Introduction\n", 9 | "\n", 10 | "Since our predictor $G(x)$ takes values in a discrete set $\\mathcal{G}$, we can always divide the input space into a collection of regions labeled according to the classification. We saw in Chapter 2 that the boundaries of these regions can be rough or smooth, depending on the prediction function. For an important class of procedures, these *decision boundaries* are linear; this is what we will mean by linear methodds for classification." 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "### Linear regression\n", 18 | "\n", 19 | "In Chapter 2 we fit linear regression models to the class indicator variable, and classify to the largest fit. Suppose there are $K$ classes labeled $1,\\cdots,K$, and the fitted linear model for the $k$th indicator response variable is\n", 20 | "\n", 21 | "\\begin{equation}\n", 22 | "\\hat{f}_k(x) = \\hat\\beta_{k0} + \\hat\\beta_k^Tx.\n", 23 | "\\end{equation}\n", 24 | "\n", 25 | "The decision boundary between class $k$ and $l$ is that set of points\n", 26 | "\n", 27 | "\\begin{equation}\n", 28 | "\\left\\lbrace x: \\hat{f}_k(x) = \\hat{f}_l(x) \\right\\rbrace = \\left\\lbrace x: (\\hat\\beta_{k0}-\\hat\\beta_{l0}) + (\\hat\\beta_k-\\hat\\beta_l)^Tx = 0 \\right\\rbrace,\n", 29 | "\\end{equation}\n", 30 | "\n", 31 | "which is an affine set or hyperplane. Since the same is true for any pair of classes, the input space is divided into regions of constant classification, with piecewise hyperplanar decision boundaries." 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "### Discriminant function\n", 39 | "\n", 40 | "The regression approach is a member of a class of methods that model *discriminant functions* $\\delta_k(x)$ for each class, and then classify $x$ to the class with the largest value for its discriminant function. Methods that model the posterior probabilities $\\text{Pr}(G=k|X=x)$ are also in this class. Clearly, if either the $\\delta_k(x)$ or $\\text{Pr}(G=k|X=x)$ are linear in $x$, then the decision boundaries will be linear.\n", 41 | "\n", 42 | "### Logit transformation\n", 43 | "\n", 44 | "Actually, all we require is that some monotone transformation of $\\delta_k$ or $\\text{Pr}(G=k|X=x)$ be linear for the decision boundaries to be linear. For example, if there are two classes, a popular model for the posterior probabilities is\n", 45 | "\n", 46 | "\\begin{align}\n", 47 | "\\text{Pr}(G=1|X=x) &= \\frac{\\exp(\\beta_0+\\beta^Tx)}{1+\\exp(\\beta_0+\\beta^Tx)},\\\\\n", 48 | "\\text{Pr}(G=2|X=x) &= \\frac{1}{1+\\exp(\\beta_0+\\beta^Tx)},\\\\\n", 49 | "\\end{align}\n", 50 | "\n", 51 | "where the monotone transformation is the *logit* transformation\n", 52 | "\n", 53 | "\\begin{equation}\n", 54 | "\\log\\frac{p}{1-p},\n", 55 | "\\end{equation}\n", 56 | "\n", 57 | "and in fact we see that\n", 58 | "\n", 59 | "\\begin{equation}\n", 60 | "\\log\\frac{\\text{Pr}(G=1|X=x)}{\\text{Pr}(G=2|X=x)} = \\beta_0 + \\beta^Tx.\n", 61 | "\\end{equation}\n", 62 | "\n", 63 | "The decision boundary is the set of points for which the *log-odds* are zero, and this is a hyperplane defined by\n", 64 | "\n", 65 | "\\begin{equation}\n", 66 | "\\left\\lbrace x: \\beta_0+\\beta^Tx = 0 \\right\\rbrace.\n", 67 | "\\end{equation}\n", 68 | "\n", 69 | "We discuss two very popular but different methods that result in linear log-odds or logits: Linear discriminant analysis and linear logistic regression. Although they differ in their derivation, the essential difference between them is in the way the lineaer function is fir to the training data." 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "### Separating hyperplanes\n", 77 | "\n", 78 | "A more direct approach is to explicitly model the boundaries between the classes as linear. For a two-class problem, this amounts to modeling the decision boundary as a hyperplane; a normal vector and a cut-point.\n", 79 | "\n", 80 | "We will look at two methods that explicitly look for \"separating hyperplanes\".\n", 81 | "1. The well-known *perceptron* model of Rosenblatt (1958), with an algorithm that finds a separating hyperplane in the training data, if one exists.\n", 82 | "2. Vapnik (1996) finds an *optimally separating hyperplane* if one exists, else finds a hyperplane that minimizes some measure of overlap in the training data.\n", 83 | "\n", 84 | "We treat separable cases here, and defer the nonseparable case to Chapter 12." 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "### Scope for generalization\n", 92 | "\n", 93 | "We can expand the input by including their squares $X_1^2,X_2^2,\\cdots$, and cross-products $X_1X_2,\\cdots$, thereby adding $p(p+1)/2$ additional variables. Linear functions in the augmented space map down to quadratic decision boundaires. FIGURE 4.1 illustrates the idea.\n", 94 | "\n", 95 | "This approach can be used with any basis transformation $h(X): \\mathbb{R}^p\\mapsto\\mathbb{R}^q$ with $q > p$, and will be explored in later chapters." 96 | ] 97 | } 98 | ], 99 | "metadata": { 100 | "kernelspec": { 101 | "display_name": "Python 3", 102 | "language": "python", 103 | "name": "python3" 104 | }, 105 | "language_info": { 106 | "codemirror_mode": { 107 | "name": "ipython", 108 | "version": 3 109 | }, 110 | "file_extension": ".py", 111 | "mimetype": "text/x-python", 112 | "name": "python", 113 | "nbconvert_exporter": "python", 114 | "pygments_lexer": "ipython3", 115 | "version": "3.6.4" 116 | } 117 | }, 118 | "nbformat": 4, 119 | "nbformat_minor": 2 120 | } 121 | -------------------------------------------------------------------------------- /chapter04-linear-methods-for-classification/section3-1-regularized-discriminant-analysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 4.3.1. Regularized Discriminant Analysis\n", 8 | "\n", 9 | "### $\\Sigma_k \\leftrightarrow \\Sigma$\n", 10 | "These methods are very similar in flavor to ridge regression. Friedman (1989) proposed a compromise between LDA and QDA, which allows one to shrink the separate covariances of QDA toward a common covariance $\\hat\\Sigma$ as in LDA. The regularized covariance matrices have the form\n", 11 | "\n", 12 | "\\begin{equation}\n", 13 | "\\hat\\Sigma_k(\\alpha) = \\alpha\\hat\\Sigma_k + (1+\\alpha)\\hat\\Sigma,\n", 14 | "\\end{equation}\n", 15 | "\n", 16 | "where $\\alpha\\in[0,1]$ allows a continuum of models between LDA and QDA, and needs to be specified. In practice $\\alpha$ can be chosen based on the performance of the model on validation data, or by cross-validation." 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "### $\\Sigma \\leftrightarrow \\sigma$\n", 24 | "\n", 25 | "Similar modifications allow $\\hat\\Sigma$ itelf to be shrunk toward the scalar covariance,\n", 26 | "\n", 27 | "\\begin{equation}\n", 28 | "\\hat\\Sigma(\\gamma) = \\gamma\\hat\\Sigma + (1-\\gamma)\\hat\\sigma^2\\mathbf{I},\n", 29 | "\\end{equation}\n", 30 | "\n", 31 | "for $\\gamma\\in[0,1]$.\n", 32 | "\n", 33 | "Combining two regularization leads to a more general family of covariances $\\hat\\Sigma(\\alpha,\\gamma)$." 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "### To be continued\n", 41 | "\n", 42 | "In Chapter 12, we discuss other regularized versions of LDA, which are more suitable when the data arise from digitized analog signals and images. In these situations the features are high-dimensional and correlated, and the LDA coefficients can be regularized to be smooth or sparse in original domain of the signal.\n", 43 | "\n", 44 | "In Chapter 18, we also deal with very high-dimensional problems, where for example, the features are gene-expression measurements in microarray studies." 45 | ] 46 | } 47 | ], 48 | "metadata": { 49 | "kernelspec": { 50 | "display_name": "Python 3", 51 | "language": "python", 52 | "name": "python3" 53 | }, 54 | "language_info": { 55 | "codemirror_mode": { 56 | "name": "ipython", 57 | "version": 3 58 | }, 59 | "file_extension": ".py", 60 | "mimetype": "text/x-python", 61 | "name": "python", 62 | "nbconvert_exporter": "python", 63 | "pygments_lexer": "ipython3", 64 | "version": "3.5.2" 65 | } 66 | }, 67 | "nbformat": 4, 68 | "nbformat_minor": 2 69 | } 70 | -------------------------------------------------------------------------------- /chapter04-linear-methods-for-classification/section3-2-computations-for-lda.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 4.3.2. Computations for LDA\n", 8 | "\n", 9 | "Computations for LDA and QDA are simplified by diagonalizing $\\hat\\Sigma$ or $\\hat\\Sigma_k$. For the latter, suppose we compute the eigen-decomposition, for each $k$,\n", 10 | "\n", 11 | "\\begin{equation}\n", 12 | "\\hat\\Sigma_k = \\mathbf{U}_k\\mathbf{D}_k\\mathbf{U}_K^T,\n", 13 | "\\end{equation}\n", 14 | "\n", 15 | "where $\\mathbf{U}_k$ is $p\\times p$ orthogonal, and $\\mathbf{D}_k$ a diagonal matrix of positive eigenvalues $d_{kl}$.\n", 16 | "\n", 17 | "Then the ingredients for $\\delta_k(x)$ are\n", 18 | "* $(x-\\hat\\mu_k)^T\\hat\\Sigma_k^{-1}(x-\\hat\\mu_k) = \\left[\\mathbf{U}_k(x-\\hat\\mu_k)\\right]^T\\mathbf{D}_k^{-1}\\left[\\mathbf{U}_k(x-\\hat\\mu_k)\\right]$\n", 19 | "* $\\log|\\hat\\Sigma_k| = \\sum_l \\log d_{kl}$\n", 20 | "\n", 21 | "Note that the inversion of diagonal matrices only requires elementwise reprocials.\n", 22 | "\n", 23 | "The LDA classifier can be implemented by the following pair of steps:\n", 24 | "* *Sphere* the data w.r.t. the common covariance estimate $\\hat\\Sigma = \\mathbf{U}\\mathbf{D}\\mathbf{U}^T$: \n", 25 | "\\begin{equation}\n", 26 | "X* \\rightarrow \\mathbf{D}^{-\\frac{1}{2}}\\mathbf{U}^TX,\n", 27 | "\\end{equation} \n", 28 | "The common covariance estimate of $X*$ will now be the identity.\n", 29 | "* Classify to the closest class centroid in the transformed space, modulo the effect of the class prior probabilities $\\pi_k$." 30 | ] 31 | } 32 | ], 33 | "metadata": { 34 | "kernelspec": { 35 | "display_name": "Python 3", 36 | "language": "python", 37 | "name": "python3" 38 | }, 39 | "language_info": { 40 | "codemirror_mode": { 41 | "name": "ipython", 42 | "version": 3 43 | }, 44 | "file_extension": ".py", 45 | "mimetype": "text/x-python", 46 | "name": "python", 47 | "nbconvert_exporter": "python", 48 | "pygments_lexer": "ipython3", 49 | "version": "3.5.2" 50 | } 51 | }, 52 | "nbformat": 4, 53 | "nbformat_minor": 2 54 | } 55 | -------------------------------------------------------------------------------- /chapter04-linear-methods-for-classification/section4-0-logistic-regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 4.4. Logistic Regression\n", 8 | "\n", 9 | "The logistic regression model arises from the desire to model the posterior probabilities of the $K$ classes via linear functions in $x$, ensuring the natural properties of the probability: They sum to one and remain in $[0,1]$.\n", 10 | "\n", 11 | "The model has the form\n", 12 | "\n", 13 | "\\begin{align}\n", 14 | "\\log\\frac{\\text{Pr}(G=1|X=x)}{\\text{Pr}(G=K|X=x)} &= \\beta_{10} + \\beta_1^Tx \\\\\n", 15 | "\\log\\frac{\\text{Pr}(G=2|X=x)}{\\text{Pr}(G=K|X=x)} &= \\beta_{20} + \\beta_2^Tx \\\\\n", 16 | "&\\vdots \\\\\n", 17 | "\\log\\frac{\\text{Pr}(G=K-1|X=x)}{\\text{Pr}(G=K|X=x)} &= \\beta_{(K-1)0} + \\beta_{K-1}^Tx \\\\\n", 18 | "\\end{align}\n", 19 | "\n", 20 | "The model is specified in terms of $K-1$ log-odds or logit transformations, reflecting the constraint that the probabilities sum to one. The choice of denominator ($K$ in this case) is arbitrary in that the estimates are equivalent under this choice." 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "### Sum to one\n", 28 | "\n", 29 | "To emphasize the dependence on the entire parameter set $\\theta = \\left\\lbrace \\beta_{10}, \\beta_1^T, \\cdots, \\beta_{(K-1)0}, \\beta_{K-1}^T\\right\\rbrace$, we denote the probabilities\n", 30 | "\n", 31 | "\\begin{equation}\n", 32 | "\\text{Pr}(G=k|X=x) = p_k(x;\\theta)\n", 33 | "\\end{equation}\n", 34 | "\n", 35 | "A simple calculation shows that\n", 36 | "\n", 37 | "\\begin{align}\n", 38 | "\\text{Pr}(G=k|X=x) &= \\frac{\\exp(\\beta_{k0}+\\beta_k^Tx)}{1+\\sum_{l=1}^{K-1}\\exp(\\beta_{l0}+\\beta_l^Tx)}, \\text{ for } k=1,\\cdots,K-1, \\\\\n", 39 | "\\text{Pr}(G=K|X=x) &= \\frac{1}{1+\\sum_{l=1}^{K-1}\\exp(\\beta_{l0}+\\beta_l^Tx)},\n", 40 | "\\end{align}\n", 41 | "\n", 42 | "and they clearly sum to one.\n", 43 | "\n", 44 | "When $K=2$, this model is especially simple, since there is only a single linear function." 45 | ] 46 | } 47 | ], 48 | "metadata": { 49 | "kernelspec": { 50 | "display_name": "Python 3", 51 | "language": "python", 52 | "name": "python3" 53 | }, 54 | "language_info": { 55 | "codemirror_mode": { 56 | "name": "ipython", 57 | "version": 3 58 | }, 59 | "file_extension": ".py", 60 | "mimetype": "text/x-python", 61 | "name": "python", 62 | "nbconvert_exporter": "python", 63 | "pygments_lexer": "ipython3", 64 | "version": "3.5.2" 65 | } 66 | }, 67 | "nbformat": 4, 68 | "nbformat_minor": 2 69 | } 70 | -------------------------------------------------------------------------------- /chapter04-linear-methods-for-classification/section4-3-quadratic-approximations-and-inference.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 4.4.3. Quadratic Approximations and Inference\n", 8 | "\n", 9 | "Please check this section later..." 10 | ] 11 | } 12 | ], 13 | "metadata": { 14 | "kernelspec": { 15 | "display_name": "Python 3", 16 | "language": "python", 17 | "name": "python3" 18 | }, 19 | "language_info": { 20 | "codemirror_mode": { 21 | "name": "ipython", 22 | "version": 3 23 | }, 24 | "file_extension": ".py", 25 | "mimetype": "text/x-python", 26 | "name": "python", 27 | "nbconvert_exporter": "python", 28 | "pygments_lexer": "ipython3", 29 | "version": "3.5.2" 30 | } 31 | }, 32 | "nbformat": 4, 33 | "nbformat_minor": 2 34 | } 35 | -------------------------------------------------------------------------------- /chapter04-linear-methods-for-classification/section4-4-l1-regularized-logistic-regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 4.4.1. L1 Regularized Logistic Regression\n", 8 | "\n", 9 | "Please check this section later..." 10 | ] 11 | } 12 | ], 13 | "metadata": { 14 | "kernelspec": { 15 | "display_name": "Python 3", 16 | "language": "python", 17 | "name": "python3" 18 | }, 19 | "language_info": { 20 | "codemirror_mode": { 21 | "name": "ipython", 22 | "version": 3 23 | }, 24 | "file_extension": ".py", 25 | "mimetype": "text/x-python", 26 | "name": "python", 27 | "nbconvert_exporter": "python", 28 | "pygments_lexer": "ipython3", 29 | "version": "3.5.2" 30 | } 31 | }, 32 | "nbformat": 4, 33 | "nbformat_minor": 2 34 | } 35 | -------------------------------------------------------------------------------- /chapter04-linear-methods-for-classification/section4-5-logistic-or-lda.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 4.4.5. Logistic Regression or LDA?\n", 8 | "### Common linearity\n", 9 | "\n", 10 | "LDA has the log-posterior odds which are linear functions of x:\n", 11 | "\n", 12 | "\\begin{align}\n", 13 | "\\log\\frac{\\text{Pr}(G=k|X=x)}{\\text{Pr}(G=K|X=x)} &= \\log\\frac{\\pi_k}{\\pi_K} - \\frac{1}{2}(\\mu_k-\\mu_K)^T\\Sigma^{-1}(\\mu_k-\\mu_K) + x^T\\Sigma^{-1}(\\mu_k-\\mu_K) \\\\\n", 14 | "&= \\alpha_{k0} + \\alpha_k^Tx,\n", 15 | "\\end{align}\n", 16 | "\n", 17 | "and this linearity is a consequence of the Gaussian assumption for the class densities with a common covariance matrix.\n", 18 | "\n", 19 | "The linear logistic model by construction has linear logits:\n", 20 | "\n", 21 | "\\begin{equation}\n", 22 | "\\log\\frac{\\text{Pr}(G=k|X=x)}{\\text{Pr}(G=K|X=x)} = \\beta_{k0} + \\beta_k^Tx\n", 23 | "\\end{equation}\n", 24 | "\n", 25 | "It seems that the models are the same, and they have the common logit-linear form for the posterior probabilities:\n", 26 | "\n", 27 | "\\begin{equation}\n", 28 | "\\text{Pr}(G=k|X=x) = \\frac{\\exp(\\beta_{k0}+\\beta_k^Tx)}{1+\\sum_{l=1}^{K-1}\\exp(\\beta_{l0}+\\beta_l^Tx)}\n", 29 | "\\end{equation}" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "### Different assumptions\n", 37 | "\n", 38 | "Although they have exactly the same form, the difference lies in the way the linear coefficients are estimated: The logistic regression model is more general, in that it makes less assumptions.\n", 39 | "\n", 40 | "Note the *joint density* of $X$ and $G$ as\n", 41 | "\n", 42 | "\\begin{equation}\n", 43 | "\\text{Pr}(X,G=k) = \\text{Pr}(X)\\text{Pr}(G=k|X),\n", 44 | "\\end{equation}\n", 45 | "\n", 46 | "where $\\text{Pr}(X)$ denotes the marginal density of the inputs $X$.\n", 47 | "\n", 48 | "The logistic regression model leaves $\\text{Pr}(X)$ as an arbitrary density function, and fits the parameters of $\\text{Pr}(G|X)$ by maximizing the *conditional likelihood* -- the multinomial likelihood with probabilities the $\\text{Pr}(G=k|X)$. Although $\\text{Pr}(X)$ is totally ignored, we can think of this marginal density as being estimated in a fully nonparametric and unrestricted fashion, using empirical distribution function which places mass $1/N$ at each observation.\n", 49 | "\n", 50 | "LDA fits the parameters by maximizing the full log-likelihood, based on the joint density\n", 51 | "\n", 52 | "\\begin{equation}\n", 53 | "\\text{Pr}(X,G=k) = \\phi(X;\\mu_k,\\Sigma)\\pi_k,\n", 54 | "\\end{equation}\n", 55 | "\n", 56 | "where $\\phi$ is the Gaussian density function. Standard normal theory leads easily to the estimates $\\hat\\mu_k$, $\\hat\\Sigma$, and $\\hat\\pi_k$ given in Section 4.3. Since the linear parameters of the logistic form\n", 57 | "\n", 58 | "\\begin{equation}\n", 59 | "\\log\\frac{\\text{Pr}(G=k|X=x)}{\\text{Pr}(G=K|X=x)} = \\log\\frac{\\pi_k}{\\pi_K} - \\frac{1}{2}(\\mu_k-\\mu_K)^T\\Sigma^{-1}(\\mu_k-\\mu_K) + x^T\\Sigma^{-1}(\\mu_k-\\mu_K)\n", 60 | "\\end{equation}\n", 61 | "\n", 62 | "are functions of the Gaussian parameters, we get their maximum-likelihood estimates by plugging in the corresponding estimates." 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "### Role of the marginal density $\\text{Pr}(X)$ in LDA\n", 70 | "\n", 71 | "However, unlike in the conditional case, the marginal density $\\text{Pr}(X)$ does play a role here. It is a mixture density\n", 72 | "\n", 73 | "\\begin{equation}\n", 74 | "\\text{Pr}(X) = \\sum_{k=1}^K \\pi_k\\phi(X;\\mu_k,\\Sigma),\n", 75 | "\\end{equation}\n", 76 | "\n", 77 | "which also involves the parameters. What role can this additional component or restriction play?\n", 78 | "\n", 79 | "By relying on the additional model assumptions, we have more information about the parameters, and hence can estimate them more efficiently (lower variance). If in fact the true $f_k(x)$ are Gaussian, then in the worst case ignoring this marginal part of the likelihood constitutes a loss of efficiency of about $30\\%$ asymptotically in the error rate (Efron, 1975). Paraphrasing: With $30\\%$ more data, the conditional likelihood will do as well.\n", 80 | "\n", 81 | "For example, observations far from the decision boundary (which are down-weighted by logistic regression) play a role in estimating the common covariance matrix. This is not a good news, because it also means that LDA is not robust to gross outliers." 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "### Marginal likelihood as a regularizer\n", 89 | "\n", 90 | "The marginal likelihood can be thought of as a regularizer, requiring in some sense that class densities be *visible* from this marginal view.\n", 91 | "\n", 92 | "For example, if the data in a two-class logistic regression model can be perfectly separated by a hyperplane, the maximum likelihood estimates of the parameters are undefined (i.e., infinite; see Exercise 4.5).\n", 93 | "\n", 94 | "The LDA coefficients for the same data will be well defined, since the marginal likelihood will not permit these degeneracies." 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "### In real world\n", 102 | "\n", 103 | "> It is generally felt that logistic regression is a safer and more robust bet than the LDA model, relying on fewer assumptions.\n", 104 | "\n", 105 | "In practice these assumptions are never correct, and often some of the components of $X$ are qualitative variables. It is our experience that the models give very similar results, even when LDA is used inappropriately, such as with qualitative predictors." 106 | ] 107 | } 108 | ], 109 | "metadata": { 110 | "kernelspec": { 111 | "display_name": "Python 3", 112 | "language": "python", 113 | "name": "python3" 114 | }, 115 | "language_info": { 116 | "codemirror_mode": { 117 | "name": "ipython", 118 | "version": 3 119 | }, 120 | "file_extension": ".py", 121 | "mimetype": "text/x-python", 122 | "name": "python", 123 | "nbconvert_exporter": "python", 124 | "pygments_lexer": "ipython3", 125 | "version": "3.5.2" 126 | } 127 | }, 128 | "nbformat": 4, 129 | "nbformat_minor": 2 130 | } 131 | -------------------------------------------------------------------------------- /chapter04-linear-methods-for-classification/section5-0-separating-hyperplanes.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 4.5. Separating Hyperplanes\n", 8 | "\n", 9 | "We describe separating hyperplane classifiers, constructing linear decision boundaries that explicitly try to separate the data into different classes as well as possible. They provide the basis for support vector classifiers, discussed in Chapter 12.\n", 10 | "\n", 11 | "FIGURE 4.14 shows 20 data points of two classes in $\\mathbb{R}^2$, which can be separated by a linear boundary but there are infinitely many possible *separating hyperplanes*.\n", 12 | "\n", 13 | "The orange line is the least squares solution to the problem, obtained by regressing the $-1/1$ response $Y$ on $X$ with intercept; the line is given by\n", 14 | "\n", 15 | "\\begin{equation}\n", 16 | "\\left\\lbrace x: \\hat\\beta_0 + \\hat\\beta_1x_1 + \\hat\\beta_2x_2=0 \\right\\rbrace.\n", 17 | "\\end{equation}\n", 18 | "\n", 19 | "This least squares solution does not do a perfect job in separating the points, and makes one error. This is the same boundary found by LDA, in light of its equivalence with linear regression in the two-class case ($\\S$ 4.3 and Exercise 4.2)." 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "### Perceptrons\n", 27 | "\n", 28 | "Classifiers such as above, that compute a linear combination of the input features and return the sign, were called *perceptrons* in the engineering literatur in the late 1950s (Rosenblatt, 1958). Perceptrons set he foundations for the neural network models of the 1980s and 1990s." 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "### Review on vector algebra\n", 36 | "\n", 37 | "FIGURE 4.15 depicts a hyperplane or *affine set* $L$ defined by the equation\n", 38 | "\n", 39 | "\\begin{equation}\n", 40 | "f(x) = \\beta_0 + \\beta^T x = 0,\n", 41 | "\\end{equation}\n", 42 | "\n", 43 | "since we are in $\\mathbb{R}^2$ this is a line.\n", 44 | "\n", 45 | "Here we list some properties:\n", 46 | "1. For any two points $x_1$ and $x_2$ lying in $L$, \n", 47 | "\\begin{equation}\n", 48 | "\\beta^T(x_1-x_2)=0,\n", 49 | "\\end{equation}\n", 50 | "and hence the unit vector $\\beta^* = \\beta/\\|\\beta\\|$ is normal to the surface of $L$.\n", 51 | "2. For any point $x_0$ in $L$, \n", 52 | "\\begin{equation}\n", 53 | "\\beta^Tx_0 = -\\beta_0.\n", 54 | "\\end{equation}\n", 55 | "3. The signed distance of any point $x$ to $L$ is given by \n", 56 | "\\begin{align}\n", 57 | "\\beta^{*T}(x-x_0) &= \\frac{1}{\\|\\beta\\|}(\\beta^Tx+\\beta_0) \\\\\n", 58 | "&= \\frac{1}{\\|f'(x)\\|}f(x).\n", 59 | "\\end{align}\n", 60 | "Hence $f(x)$ is proportional to the signed distance from $x$ to the hyperplane defined by $f(x)=0$." 61 | ] 62 | } 63 | ], 64 | "metadata": { 65 | "kernelspec": { 66 | "display_name": "Python 3", 67 | "language": "python", 68 | "name": "python3" 69 | }, 70 | "language_info": { 71 | "codemirror_mode": { 72 | "name": "ipython", 73 | "version": 3 74 | }, 75 | "file_extension": ".py", 76 | "mimetype": "text/x-python", 77 | "name": "python", 78 | "nbconvert_exporter": "python", 79 | "pygments_lexer": "ipython3", 80 | "version": "3.5.2" 81 | } 82 | }, 83 | "nbformat": 4, 84 | "nbformat_minor": 2 85 | } 86 | -------------------------------------------------------------------------------- /chapter04-linear-methods-for-classification/section5-1-rosenblatt-perceptron-learning-algorithm.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 4.5.1. Rosenblatt's Perceptron Learning Algorithm\n", 8 | "\n", 9 | "> The *perceptron learning algorithm* tries to find a separating hyperplane by minimizing the distance of misclassified points to the decision boundary.\n", 10 | "\n", 11 | "If a response $y_i=1$ is misclassified, then $x_i^T\\beta + \\beta_0 \\lt 0$, and the opposite for a misclassified response with $y_i=-1$. The goal is to minimize\n", 12 | "\n", 13 | "\\begin{equation}\n", 14 | "D(\\beta,\\beta_0) = -\\sum_{i\\in\\mathcal{M}} y_i(x_i^T\\beta + \\beta_0),\n", 15 | "\\end{equation}\n", 16 | "\n", 17 | "where $\\mathcal{M}$ indexes the set of misclassified points. The quantity is non-negative and proportional to the distance of the misclassified points to the decision boundary defined by $\\beta^Tx+\\beta_0=0$.\n", 18 | "\n", 19 | "Assuming $\\mathcal{M}$ is fixed, the gradient is given by\n", 20 | "\n", 21 | "\\begin{align}\n", 22 | "\\partial\\frac{D(\\beta,\\beta_0)}{\\partial\\beta} &= -\\sum_{i\\in\\mathcal{M}} y_ix_i, \\\\\n", 23 | "\\partial\\frac{D(\\beta,\\beta_0)}{\\partial\\beta_0} &= -\\sum_{i\\in\\mathcal{M}} y_i.\n", 24 | "\\end{align}" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "### Stochastic gradient descent\n", 32 | "\n", 33 | "The algorithm in face uses *stochastic gradient descent* to minimize this piecewise linear criterion. This means that rather than computing the sum of the gradient contributions of each observation followed by a step in the negative gradient direction, a step in taken after each observation is visited.\n", 34 | "\n", 35 | "Hence the misclassified observations are visited in some sequence, and the parameters $\\beta$ are updated via\n", 36 | "\n", 37 | "\\begin{equation}\n", 38 | "\\begin{pmatrix}\\beta \\\\ \\beta_0\\end{pmatrix}\n", 39 | "\\leftarrow\n", 40 | "\\begin{pmatrix}\\beta \\\\ \\beta_0\\end{pmatrix}\n", 41 | "+\n", 42 | "\\rho \\begin{pmatrix}y_ix_i \\\\ y_i\\end{pmatrix},\n", 43 | "\\end{equation}\n", 44 | "\n", 45 | "where $\\rho$ is the learning rate, which in this case can be taken to be $1$ WLOG.\n", 46 | "\n", 47 | "If the classes are linearly separable, it can be shown that the algorithm converges to a separating hyperplane in a finite number of steps (Exercise 4.6). FIGURE 4.14 shows two solutions to a toy problem, each started at a different random guess." 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 1, 53 | "metadata": {}, 54 | "outputs": [ 55 | { 56 | "name": "stdout", 57 | "output_type": "stream", 58 | "text": [ 59 | "Under construction ...\n" 60 | ] 61 | } 62 | ], 63 | "source": [ 64 | "\"\"\"FIGURE 4.14. A toy example with two classes separable by a hyperplane.\n", 65 | "\n", 66 | "The orange line is the least squares solution, which misclassifies one of\n", 67 | "the training points. Also shown are two blue separating hyperplanes found\n", 68 | "by the perceptron learning algorithm with different random starts.\n", 69 | "\"\"\"\n", 70 | "print('Under construction ...')" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "### Issues\n", 78 | "\n", 79 | "There are a number of problems with this algorithm, summarized in Ripley (1996):\n", 80 | "* When the data are separable, there are many solutions, and which one is found depends on the starting values.\n", 81 | "* The \"finite\" number of steps can be very large. The smaller the gap, the longer the time to find it.\n", 82 | "* When the data are not separable, the algorithm will not converge, and cycles develop. The cycles can be long and therefore hard to detect.\n", 83 | "\n", 84 | "The second problem can often be eliminated by seeking a hyperplane not in the orignal space, but in a much enlarged space obtained by creating many basis-function transformations of the original variables. This is analogous to driving the residuals in a ploynomial regression problem down to zero by making the degree sufficiently large.\n", 85 | "\n", 86 | "Perfect separation cannot always be achieved: For example, if observations from two different classes share the same input. It may not be desirable either, since the resulting model is likely to be overfit and will not generalizes well.\n", 87 | "\n", 88 | "A rather elegant solution to the first problem is to add additional constraints to the separating hyperplane." 89 | ] 90 | } 91 | ], 92 | "metadata": { 93 | "kernelspec": { 94 | "display_name": "Python 3", 95 | "language": "python", 96 | "name": "python3" 97 | }, 98 | "language_info": { 99 | "codemirror_mode": { 100 | "name": "ipython", 101 | "version": 3 102 | }, 103 | "file_extension": ".py", 104 | "mimetype": "text/x-python", 105 | "name": "python", 106 | "nbconvert_exporter": "python", 107 | "pygments_lexer": "ipython3", 108 | "version": "3.5.2" 109 | } 110 | }, 111 | "nbformat": 4, 112 | "nbformat_minor": 2 113 | } 114 | -------------------------------------------------------------------------------- /chapter05-basis-expansions-and-regularization/section1-introduction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Chapter 5.Basis Expansions and Regularization\n", 8 | "# $\\S$ 5.1. Introduction" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "### Linearity is unrealistic, but necessary\n", 16 | "\n", 17 | "Linear regression, LDA, logistic regression and separating hyper planes all rely on a linear model. It is extremely unlikely that the true function $f(X)$ is actually linear in $X$. In regression problems, $f(X) = \\text{E}(Y|X)$ will typically be nonlinear and nonadditive in $X$.\n", 18 | "\n", 19 | "Representing $f(X)$ by a linear model is usually a convenient, and sometimes necessary, approximation.\n", 20 | "* Convenient because a linear model is easy to interpret, and is the first-order Taylor approximation to $f(X)$.\n", 21 | "* Sometimes necessary because with $N$ small and/or $p$ large, a linear model might be all we are able to fit to the data without overfitting.\n", 22 | "\n", 23 | "Likewise in classification, a linear, Bayes-optimal decision boundary implies that some monotone transformation of $\\text{Pr}(Y=1|X)$ is linear in $X$. This is inevitably an approximation." 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "### Beyond linearity\n", 31 | "\n", 32 | "The core idea in this chapter is to augment/replace the vector of inputs $X$ with additional variables, which are transformatons of $X$, and then use linear models in this new space of derived input features.\n", 33 | "\n", 34 | "Denote by\n", 35 | "\n", 36 | "\\begin{equation}\n", 37 | "h_m(X): \\mathbb{R}^p \\mapsto \\mathbb{R}\n", 38 | "\\end{equation}\n", 39 | "\n", 40 | "the $m$th transformation of $X$ for $m=1,\\cdots,M$. We then model\n", 41 | "\n", 42 | "\\begin{equation}\n", 43 | "f(X) = \\sum_{m=1}^M \\beta_m h_m(X),\n", 44 | "\\end{equation}\n", 45 | "\n", 46 | "_a linear basis expansion_ in $X$.\n", 47 | "\n", 48 | "The beauty of this approach is that once the basis functions $h_m$ have been determined, the models are linear in these new variables, and the fitting proceeds as before." 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "### Examples\n", 56 | "\n", 57 | "Some simple and widely used examples of the $h_m$ are the following.\n", 58 | "\n", 59 | "* $h_m(X)=X_m$, $m=1,\\cdots,p$ recovers the original linear model.\n", 60 | "* $h_m(X)=X_j^2$ or $h_m(X)=X_j X_k$ allows us to augment the inputs with polynomial terms to achieve higher-order Taylor expansions. \n", 61 | "Note, however, that the number of variables grows exponentially in the degrees of the polynomial. A full quadratic model in $p$ variables requires $O(p^2)$ square and corss-product terms, or more generally $O(p^d)$ for a degree-$d$ polynomial.\n", 62 | "* $h_m(X)=\\log(X_j)$, $\\sqrt{X_j}$, $\\cdots$ permits other nonlinear transformations of single inputs. More generally one can use similar functions involving several inputs, such as $h_m(X)=\\|X\\|$.\n", 63 | "* $h_m(X)=I(L_m \\le X_k \\lt U_m)$, an indicator for a region of $X_k$. By breaking the range of $X_k$ up into $M_k$ such nonoverlapping regions results in a model with a piecewise constant contribution for $X_k$." 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "### Preview\n", 71 | "\n", 72 | "* _Piecewise-polynomials_ and _splines_ allow for local polynomial representations.\n", 73 | "* _wavelet_ bases produce a _dictionary_ $\\mathcal{D}$ consisting of typically a very large number $\\left|\\mathcal{D}\\right|$ of basis functions, far more than we can afford to fit to our data. \n", 74 | "Along with the dictionary we require a method for controlling the complexity of out model, using basis functions from the dictionary. These are three common approaches.\n", 75 | " * Restriction methods, where we decide before-hand to limit the class of functions. Additivity is an example, where we assume that our model has the form\n", 76 | "\\begin{equation}\n", 77 | "f(X) = \\sum_{j=1}^p f_j(X_j) = \\sum_{j=1}^p \\sum_{m=1}^{M-j} \\beta_{jm} h_{jm}(X_j).\n", 78 | "\\end{equation}\n", 79 | " The size of the model is limited by the number of basis functions $M_j$ used for each component function $f_j$.\n", 80 | " * Selection methods, which adaptively scan the dictionary and include only those basis functions $h_m$ that contribute significantly to the fit of the model. Here the variable selection techniques discussed in Chapter 3 are useful. The stagewise greedy approaches such as CART, MARS and boosting fall into this category as well.\n", 81 | " * Regularization methods where we use the entire dictionary but restrict the coefficients. Ridge regression is a simple example of a regularization approach, while the lasso is both a regularization and selection method." 82 | ] 83 | } 84 | ], 85 | "metadata": { 86 | "kernelspec": { 87 | "display_name": "Python 3", 88 | "language": "python", 89 | "name": "python3" 90 | }, 91 | "language_info": { 92 | "codemirror_mode": { 93 | "name": "ipython", 94 | "version": 3 95 | }, 96 | "file_extension": ".py", 97 | "mimetype": "text/x-python", 98 | "name": "python", 99 | "nbconvert_exporter": "python", 100 | "pygments_lexer": "ipython3", 101 | "version": "3.6.4" 102 | } 103 | }, 104 | "nbformat": 4, 105 | "nbformat_minor": 2 106 | } 107 | -------------------------------------------------------------------------------- /chapter05-basis-expansions-and-regularization/section2-2-example-south-african-heart-disease-continued.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 5.2.2. Example: South African Heart Disease (Continued)\n", 8 | "\n", 9 | "In $\\S$ 4.4.2 we fit linear logistic regression models to the South African heart disease data. Here we explore nonlinearities in the functions using natural splines.\n", 10 | "\n", 11 | "The functional form of the model is\n", 12 | "\n", 13 | "\\begin{equation}\n", 14 | "\\text{logit}\\left[ \\text{Pr}(\\textsf{chd}|X) \\right] = \\theta_0 + h_1(X_1)^T\\theta_1 + h_2(X_2)^T\\theta_2 + \\cdots + h_p(X_p)^T\\theta_p,\n", 15 | "\\end{equation}\n", 16 | "\n", 17 | "where each of the $\\theta_j$ are vectors of coefficients multiplying their associated vector of natural spline basis functions $h_j$." 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "### Transformation in a whole\n", 25 | "\n", 26 | "We can combine all $p$ vectors of basis functions (and the constant term) into one big vector $h(X)$, and then the model is simply\n", 27 | "\n", 28 | "\\begin{equation}\n", 29 | "h(X)^T\\theta,\n", 30 | "\\end{equation}\n", 31 | "\n", 32 | "with total number of parameters\n", 33 | "\n", 34 | "\\begin{equation}\n", 35 | "\\text{df} = \\sum_{j=1}^p \\text{df}_j.\n", 36 | "\\end{equation}\n", 37 | "\n", 38 | "Each basis function is evaluated at each of the $N$ samples, resulting in a $N \\times \\text{df}$ basis matrix $\\mathbf{H}$. At this point the model is like any other linear logistic model, and the algorithms described in $\\S$ 4.4.1 apply." 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "### Backward stepwise process\n", 46 | "\n", 47 | "We carried out a backward stepwise deletion process, dropping terms from this model while preserving the group structure of each term, rather than dropping one coefficient at a time. The AIC statistic ($\\S$ 7.5) was used to drop terms, and all the terms remaining in the final model would cause AIC to increase if deleted from the model (see TABLE 5.1).\n", 48 | "\n", 49 | "FIGURE 5.4 shows a plot of the final model selected by the stepwise regression. The functions displayed are\n", 50 | "\n", 51 | "\\begin{equation}\n", 52 | "\\hat{f}_j(X_j) = h_j(X_j)^T\\hat\\theta_j,\n", 53 | "\\end{equation}\n", 54 | "\n", 55 | "for each variable $X_j$. $\\mathbf{\\Sigma}$, the covariance matrix of $\\hat\\theta$, is estimated by\n", 56 | "\n", 57 | "\\begin{equation}\n", 58 | "\\hat{\\mathbf{\\Sigma}} = \\left( \\mathbf{H}^T\\mathbf{WH} \\right)^{-1},\n", 59 | "\\end{equation}\n", 60 | "\n", 61 | "where $\\mathbf{W}$ is the diagonal weight matrix from the logistic regression. Hence\n", 62 | "\n", 63 | "\\begin{equation}\n", 64 | "v_j(X_j) = \\text{Var}\\left( \\hat{f}_j(X_j) \\right) = h_j(X_j)^T \\hat{\\mathbf{\\Sigma}}_{jj} h_j(X_j)\n", 65 | "\\end{equation}\n", 66 | "\n", 67 | "is the pointwise variance function of $\\hat{f}_j$, where $\\text{Cov}(\\hat\\theta_j) = \\hat{\\mathbf{\\Sigma}}_{jj}$ is the appropriate sub-matrix of $\\hat{\\mathbf{\\Sigma}}$.\n", 68 | "\n", 69 | "The shaded region in each panel is defined by\n", 70 | "\n", 71 | "\\begin{equation}\n", 72 | "\\hat{f}_j(X_j) \\pm 2\\sqrt{v_j(X_j)}.\n", 73 | "\\end{equation}" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "### What linear models could not excavate: nonlinearity\n", 81 | "\n", 82 | "The AIC statistic is slightly more generous than the likelihood-ratio test (deviance test). Both $\\textsf{sbp}$ and $\\textsf{obesity}$ are included in the model, while they are not in the linear model. FIGURE 5.4 explains why, since their contributions are inherently nonlinear.\n", 83 | "\n", 84 | "> These effects at first may come as a surprise, but a explanation lies in the nature of the retrospective data." 85 | ] 86 | } 87 | ], 88 | "metadata": { 89 | "kernelspec": { 90 | "display_name": "Python 3", 91 | "language": "python", 92 | "name": "python3" 93 | }, 94 | "language_info": { 95 | "codemirror_mode": { 96 | "name": "ipython", 97 | "version": 3 98 | }, 99 | "file_extension": ".py", 100 | "mimetype": "text/x-python", 101 | "name": "python", 102 | "nbconvert_exporter": "python", 103 | "pygments_lexer": "ipython3", 104 | "version": "3.6.4" 105 | } 106 | }, 107 | "nbformat": 4, 108 | "nbformat_minor": 2 109 | } 110 | -------------------------------------------------------------------------------- /chapter05-basis-expansions-and-regularization/section3-filtering-and-feature-extraction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 5.3. Filtering and Feature Extraction\n", 8 | "\n", 9 | "### Review of the phoneme example\n", 10 | "\n", 11 | "In the previous example, we constructed a $p \\times M$ basis matrix $\\mathbf{H}$, and then transformed our features $x$ into new features\n", 12 | "\n", 13 | "\\begin{equation}\n", 14 | "x^* = \\mathbf{H}^T x.\n", 15 | "\\end{equation}\n", 16 | "\n", 17 | "These filtered versions of the features were then used as inputs into a learning procedure: In the previous example, this was linear logistic regression." 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "### Nonlinear or linear preprocessing of features\n", 25 | "\n", 26 | "Preprocessing of high-dimensional features is a very general and powerful method for improving the performance of a learning algorithm. The preprocessing need not be linear as it was above, but can be a general (nonlinear) function of the form\n", 27 | "\n", 28 | "\\begin{equation}\n", 29 | "x^* = g(x)\n", 30 | "\\end{equation}\n", 31 | "\n", 32 | "The derived features $x^*$ can then be used as inputs into any (linear of nonlinear) learning procedure." 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "### Example: Wavelet and neural network\n", 40 | "\n", 41 | "For example, for signal or image recognition a popular approach is to first transform the raw features via a wavelet transform ($\\S$ 5.9) and then use the features as inputs into a neural network (Chapter 11).\n", 42 | "\n", 43 | "Wavelets are effective in capturing discrete jumps or edges, and the neural network is a powerful tool for constructing nonlinear functions of these features for predicting the target variable. By using domain knowledge to construct appropriate features, one can often improve upon a learning method that has only the raw features $x$ at its disposal." 44 | ] 45 | } 46 | ], 47 | "metadata": { 48 | "kernelspec": { 49 | "display_name": "Python 3", 50 | "language": "python", 51 | "name": "python3" 52 | }, 53 | "language_info": { 54 | "codemirror_mode": { 55 | "name": "ipython", 56 | "version": 3 57 | }, 58 | "file_extension": ".py", 59 | "mimetype": "text/x-python", 60 | "name": "python", 61 | "nbconvert_exporter": "python", 62 | "pygments_lexer": "ipython3", 63 | "version": "3.6.4" 64 | } 65 | }, 66 | "nbformat": 4, 67 | "nbformat_minor": 2 68 | } 69 | -------------------------------------------------------------------------------- /chapter05-basis-expansions-and-regularization/section4-0-smoothing-splines.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 5.4. Smoothing Splines\n", 8 | "\n", 9 | "Here we discuss a spline basis method that avoids the knot selection problem completely by using a maximal set of knots.\n", 10 | "\n", 11 | "> The complexity of the fit is controlled by regularization.\n", 12 | "\n", 13 | "Consider the following problem: Among all functions $f(x)$ with two continuous derivatives, find one that minimizes the penalized residual sum of squares\n", 14 | "\n", 15 | "\\begin{equation}\n", 16 | "\\text{RSS}(f, \\lambda) = \\sum_{i=1}^N \\left( y_i - f(x_i) \\right)^2 + \\lambda \\int \\left( f''(t)\\right)^2 dt,\n", 17 | "\\end{equation}\n", 18 | "\n", 19 | "where $\\lambda$ is a fixed _smoothing parameter_. The first term measures closeness to the data, while the second term penalizes curvature in the function, and $\\lambda$ establishes a tradeoff between the two.\n", 20 | "\n", 21 | "Consider the two special cases:\n", 22 | "1. $\\lambda = 0$: $f$ can be any function that interpolates the data.\n", 23 | "2. $\\lambda = \\infty$: the simple least squares line fit, since no second derivative can be tolerated.\n", 24 | "\n", 25 | "These vary from very rough to very smooth, and the hope is that $\\lambda \\in (0,\\infty)$ indexes an interesting class of functions in between." 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "### The natural cubic spline as the minimizer\n", 33 | "\n", 34 | "The above $\\text{RSS}$ is defined on an infinite-dimensional function space -- in fact, a Sobolev space of functions for which the second term is defined.\n", 35 | "\n", 36 | "Remarkably, it can be shown that for the $\\text{RSS}$ there is an explicit, finite-dimensional, unique minimizer which is a natural cubic spline with knots at the unique values of the $x_i$, $i=1,\\cdots,N$ (Exercise 5.7).\n", 37 | "\n", 38 | "At face value it seems that the family is still over-parametrized, since there are as many as $N$ knots, which implies $N$ degrees of freedom. However, the penalty term translates to a penalty on the spline coefficients, which are shrunk some of the way toward the linear fit." 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "### Computation\n", 46 | "\n", 47 | "Since the solution is a natural spline, we can write it as\n", 48 | "\n", 49 | "\\begin{equation}\n", 50 | "f(x) = \\sum_{j=1}^N N_j(x) \\theta_j,\n", 51 | "\\end{equation}\n", 52 | "\n", 53 | "where the $N_j(x)$ are an $N$-dimensional set of basis functions for representing this family of natural splines ($\\S$ 5.2.1 and Exercise 5.4). The criterion thus reduces to\n", 54 | "\n", 55 | "\\begin{equation}\n", 56 | "\\text{RSS}(\\theta, \\lambda) = (\\mathbf{y} - \\mathbf{N}\\theta)^T(\\mathbf{y} - \\mathbf{N}\\theta) + \\lambda\\theta^T\\mathbf{\\Omega}_N\\theta,\n", 57 | "\\end{equation}\n", 58 | "\n", 59 | "where\n", 60 | "* $\\lbrace\\mathbf{N}\\rbrace_{ij} = N_j(x_i)$ and \n", 61 | "* $\\lbrace\\mathbf{\\Omega}_N\\rbrace_{jk} = \\int N_j''(t)N_k''(t)dt$.\n", 62 | "\n", 63 | "The solution is easily seen to be\n", 64 | "\n", 65 | "\\begin{equation}\n", 66 | "\\hat\\theta = \\left( \\mathbf{N}^T\\mathbf{N} + \\lambda\\mathbf{\\Omega}_N \\right)^{-1} \\mathbf{N}^T \\mathbf{y},\n", 67 | "\\end{equation}\n", 68 | "\n", 69 | "a generalized ridge regression. The fitted smoothing spline is given by\n", 70 | "\n", 71 | "\\begin{equation}\n", 72 | "\\hat{f}(x) = \\sum_{j=1}^N N_j(x) \\hat\\theta_j.\n", 73 | "\\end{equation}\n", 74 | "\n", 75 | "See the Appendix of this chapter for efficient computational techniques for smoothing splines." 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": 2, 81 | "metadata": {}, 82 | "outputs": [ 83 | { 84 | "name": "stdout", 85 | "output_type": "stream", 86 | "text": [ 87 | "Under construction ...\n" 88 | ] 89 | } 90 | ], 91 | "source": [ 92 | "\"\"\"FIGURE 5.6. A smoothing spline to BMD data with fixed lambda ~= 0.00022\n", 93 | "This choice, corresponding to about 12 degrees of freedom, will be discussed\n", 94 | "in the next section.\"\"\"\n", 95 | "print('Under construction ...')" 96 | ] 97 | } 98 | ], 99 | "metadata": { 100 | "kernelspec": { 101 | "display_name": "Python 3", 102 | "language": "python", 103 | "name": "python3" 104 | }, 105 | "language_info": { 106 | "codemirror_mode": { 107 | "name": "ipython", 108 | "version": 3 109 | }, 110 | "file_extension": ".py", 111 | "mimetype": "text/x-python", 112 | "name": "python", 113 | "nbconvert_exporter": "python", 114 | "pygments_lexer": "ipython3", 115 | "version": "3.6.4" 116 | } 117 | }, 118 | "nbformat": 4, 119 | "nbformat_minor": 2 120 | } 121 | -------------------------------------------------------------------------------- /chapter05-basis-expansions-and-regularization/section5-0-automatic-selection-of-the-smoothing-parameters.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 5.5. Automatic Selection of the Smoothing Parameters\n", 8 | "\n", 9 | "> Selecting the placement and number of knots for regression splines can be a combinatorially complex task, and we will not discuss this further here.\n", 10 | "\n", 11 | "The smoothing parameters for regression splines encompass the degree of the splines, and the number and placement of the knots. For splines, we have only the penalty parameter $\\lambda$ to select, since the knots are at all the unique training $X$'s, and cubic degree is almost always used in practice.\n", 12 | "\n", 13 | "Selecting the placement and number of knots for regression splines can be a combinatorially complex task, unless some simplifications are enforced. The MARS procedure (in Chapter 9) uses a greedy algorithm with some additional approximations to achieve a practical compromise. We will not discuss this further here." 14 | ] 15 | } 16 | ], 17 | "metadata": { 18 | "kernelspec": { 19 | "display_name": "Python 3", 20 | "language": "python", 21 | "name": "python3" 22 | }, 23 | "language_info": { 24 | "codemirror_mode": { 25 | "name": "ipython", 26 | "version": 3 27 | }, 28 | "file_extension": ".py", 29 | "mimetype": "text/x-python", 30 | "name": "python", 31 | "nbconvert_exporter": "python", 32 | "pygments_lexer": "ipython3", 33 | "version": "3.6.4" 34 | } 35 | }, 36 | "nbformat": 4, 37 | "nbformat_minor": 2 38 | } 39 | -------------------------------------------------------------------------------- /chapter05-basis-expansions-and-regularization/section5-1-fixing-the-degrees-of-freedom.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 5.5. Automatic Selection of the Smoothing Parameters\n", 8 | "## $\\S$ 5.5.1. Fixing the Degrees of Freedom\n", 9 | "\n", 10 | "Since, for smoothing splines,\n", 11 | "\n", 12 | "\\begin{equation}\n", 13 | "\\text{df}_\\lambda = \\text{trace}(\\mathbf{S}_\\lambda)\n", 14 | "\\end{equation}\n", 15 | "\n", 16 | "is monotone in $\\lambda$, we can invert the relationship and specify $\\lambda$ by fixing $\\text{df}$." 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "### Numerical inverse and its usage to model selection\n", 24 | "\n", 25 | "In practice this can be achieved by simple numerical methods. So, for example, in $\\textsf{R}$ one can use $\\textsf{smooth.spline(x,y,df=6)}$ to specify the amount of smoothing. This encourages a more traditional mode of model selection, where we might try a couple of different values of $\\text{df}$, and select one based on approximate $F$-tests, residual plots and other more subjective criteria. Using $\\text{df}$ in this way provides a uniform approach to compare many different smoothing methods. It is particularly useful in _generalized additive models_ (Chapter 9), where several smoothing methods can be simultaneously used in one model." 26 | ] 27 | } 28 | ], 29 | "metadata": { 30 | "kernelspec": { 31 | "display_name": "Python 3", 32 | "language": "python", 33 | "name": "python3" 34 | }, 35 | "language_info": { 36 | "codemirror_mode": { 37 | "name": "ipython", 38 | "version": 3 39 | }, 40 | "file_extension": ".py", 41 | "mimetype": "text/x-python", 42 | "name": "python", 43 | "nbconvert_exporter": "python", 44 | "pygments_lexer": "ipython3", 45 | "version": "3.6.4" 46 | } 47 | }, 48 | "nbformat": 4, 49 | "nbformat_minor": 2 50 | } 51 | -------------------------------------------------------------------------------- /chapter05-basis-expansions-and-regularization/section5-2-the-biase-variance-tradeoff.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 5.5.2. The Bias-Variance Tradeoff\n", 8 | "\n", 9 | "FIGURE 5.9 shows the effect of the choice of $\\text{df}_\\lambda$ when using a smoothing spline on a simple example:\n", 10 | "\n", 11 | "\\begin{align}\n", 12 | "Y &= f(X) + \\epsilon \\\\\n", 13 | "f(X) &= \\frac{\\sin(12(X+0.2))}{X+0.2},\n", 14 | "\\end{align}\n", 15 | "\n", 16 | "with\n", 17 | "* $X\\sim U[0,1]$,\n", 18 | "* $\\epsilon\\sim N(0,1)$;\n", 19 | "* Our training sample consists of $N=100$ pairs of $x_i, y_i$, drawn independently from this model." 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 2, 25 | "metadata": {}, 26 | "outputs": [ 27 | { 28 | "name": "stdout", 29 | "output_type": "stream", 30 | "text": [ 31 | "Under construction ...\n" 32 | ] 33 | } 34 | ], 35 | "source": [ 36 | "\"\"\"FIGURE 5.9. CV results and fitted splines for three different df's.\n", 37 | "\"\"\"\n", 38 | "print('Under construction ...')" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "### Computing bias and variance\n", 46 | "\n", 47 | "The yellow shaded region in the figure represents the pointwise standard error of $\\hat{f}_\\lambda$, e.g., we have shaded the region between \n", 48 | "\n", 49 | "\\begin{equation}\n", 50 | "\\hat{f}_\\lambda(x) \\pm 2 \\cdot \\text{se}(\\hat{f}_\\lambda(x)).\n", 51 | "\\end{equation}\n", 52 | "\n", 53 | "Since $\\hat{\\mathbf{f}} = \\mathbf{S}_\\lambda \\mathbf{y}$,\n", 54 | "\n", 55 | "\\begin{align}\n", 56 | "\\text{Cov}(\\hat{\\mathbf{f}}) &= \\mathbf{S}_\\lambda \\text{Cov}(\\mathbf{y}) \\mathbf{S}_\\lambda^T \\\\\n", 57 | "&= \\mathbf{S}_\\lambda \\mathbf{S}_\\lambda^T.\n", 58 | "\\end{align}\n", 59 | "\n", 60 | "The diagonal contains the pointwise variances at the training $x_i$. The bias is given by\n", 61 | "\n", 62 | "\\begin{align}\n", 63 | "\\text{Bias}(\\hat{\\mathbf{f}}) &= \\mathbf{f} - \\text{E}(\\hat{\\mathbf{f}}) \\\\\n", 64 | "&= \\mathbf{f} - \\mathbf{S}_\\lambda \\mathbf{f},\n", 65 | "\\end{align}\n", 66 | "\n", 67 | "where $\\mathbf{f}$ is the (unknown) vector of evaluations of the true $f$ at the training $X$'s.\n", 68 | "\n", 69 | "The expectations and variances are w.r.t. repeated draws of samples of size $N=100$ from the model $f$. In a similar fashion $\\text{Var}(\\hat{f}_\\lambda(x_0))$ and $\\text{Bias}(\\hat{f}_\\lambda(x_0))$ can be computed at any point $x_0$ (Exercise 5.10)." 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "### Visual interpretation of bias-variance tradeoff\n", 77 | "\n", 78 | "The three fits displayed in FIGURE 5.9 give a visual demonstration of the bias-variance tradeoff associated with selecting the smoothing parameter.\n", 79 | "\n", 80 | "* $\\text{df}_\\lambda = 5$: The spline under fits, and clearly _trims down hills and fills in the valleys_. This leads to a bias that is most dramatic in regions of high curvature. The standard error hand is very narrow, so we estimate a badly biased version of the true function with great reliability!\n", 81 | "* $\\text{df}_\\lambda = 9$: Here the fitted function is close to the true function, although a slight amount of bias seems evident. The variance has not increased appreicably.\n", 82 | "* $\\text{df}_\\lambda = 15$: The fitted function is somewhat wiggly, but close to the true function. The wiggliness also accounts for the increased width of the standard error bands -- the curve is starting to follow some individual points too closely.\n", 83 | "\n", 84 | "Note that in these figures we are seeing a single realization of data and hence fitted spline $\\hat{f}$ in each case, while the bias involves an expectation $\\text{E}(\\hat{f})$.\n", 85 | "\n", 86 | "The middle curve seems \"just right\", in that it has achieved a good compromise between bias and variance." 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "The integrated squared prediction error ($\\text{EPE}$) combines both bias and variance in a single summary:\n", 94 | "\n", 95 | "\\begin{align}\n", 96 | "\\text{EPE}(\\hat{f}_\\lambda) &= \\text{E}\\left( Y - \\hat{f}_\\lambda(X) \\right)^2 \\\\\n", 97 | "&= \\text{Var}(Y) + \\text{E}\\left( \\text{Bias}^2(\\hat{f}_\\lambda(X)) + \\text{Var}(\\hat{f}_\\lambda(X)) \\right) \\\\\n", 98 | "&= \\sigma^2 + \\text{MSE}(\\hat{f}_\\lambda).\n", 99 | "\\end{align}\n", 100 | "\n", 101 | "Note that this is averaged both over the training sample (giving rise to $\\hat{f}_\\lambda$), and the values of the (independently chosen) prediction points $(X,Y)$.\n", 102 | "\n", 103 | "$\\text{EPE}$ is a natural quantity of interest, and does create a tradeoff between bias and variance. The test error rate (blue points) in the top left panel of FIGURE 5.9 suggest that $\\text{df}=9$ is spot on!" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "### Estimation of EPE\n", 111 | "\n", 112 | "Since we don't know the true function, we do not have access to $\\text{EPE}$, and need an estimate. This topic is discussed in some detail in Chapter 7, and techniques such as $K$-fold cross-validation, $\\text{GCV}$ and $C_p$ are all in common use. In FIGURE 5.9 we include the $N$-fold (leave-one-out) cross-validation curve:\n", 113 | "\n", 114 | "\\begin{align}\n", 115 | "\\text{CV}(\\hat{f}_\\lambda) &= \\frac{1}{N} \\sum_{i=1}^N \\left( y_i - \\hat{f}_\\lambda^{(-i)}(x_i)\\right)^2 \\\\\n", 116 | "&= \\frac{1}{N} \\sum_{i=1}^N \\left( \\frac{y_i - \\hat{f}_\\lambda(x_i)}{1 - S_\\lambda(i,i)} \\right)^2,\n", 117 | "\\end{align}\n", 118 | "\n", 119 | "which can (remarkably) be computed for each value of $\\lambda$ from the original fitted values and the diagonal elements $S_\\lambda(i,i)$ of $\\mathbf{S}_\\lambda$ (Exercise 5.13).\n", 120 | "\n", 121 | "The $\\text{EPE}$ and $\\text{CV}$ curves have a similar shape, but the entire $\\text{CV}$ curve is above the $\\text{EPE}$ curve. For some realizations this is reversed, and everall the $\\text{CV}$ curve is approximately unbiased as an estimate of the $\\text{EPE}$ curve." 122 | ] 123 | } 124 | ], 125 | "metadata": { 126 | "kernelspec": { 127 | "display_name": "Python 3", 128 | "language": "python", 129 | "name": "python3" 130 | }, 131 | "language_info": { 132 | "codemirror_mode": { 133 | "name": "ipython", 134 | "version": 3 135 | }, 136 | "file_extension": ".py", 137 | "mimetype": "text/x-python", 138 | "name": "python", 139 | "nbconvert_exporter": "python", 140 | "pygments_lexer": "ipython3", 141 | "version": "3.6.4" 142 | } 143 | }, 144 | "nbformat": 4, 145 | "nbformat_minor": 2 146 | } 147 | -------------------------------------------------------------------------------- /chapter05-basis-expansions-and-regularization/section6-nonparametric-logistic-regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 5.6. Nonparametric Logistic Regression\n", 8 | "\n", 9 | "The smoothing spline problem in $\\S$ 5.4,\n", 10 | "\n", 11 | "\\begin{equation}\n", 12 | "\\text{RSS}(f, \\lambda) = \\sum_{i=1}^N \\left( y_i - f(x_i) \\right)^2 + \\lambda\\int \\left( f''(t) \\right)^2 dt,\n", 13 | "\\end{equation}\n", 14 | "\n", 15 | "is posed in a regression setting. It is typically straightforward to transfer this technology to other domains. Here we consider logistic regression with a single quantitative input $X$. Then the model is\n", 16 | "\n", 17 | "\\begin{equation}\n", 18 | "\\log \\frac{\\text{Pr}(Y=1|X=x)}{\\text{Pr}(Y=0|X=x)} = f(x),\n", 19 | "\\end{equation}\n", 20 | "\n", 21 | "which implies\n", 22 | "\n", 23 | "\\begin{equation}\n", 24 | "\\text{Pr}(Y=1|X=x) = \\frac{e^{f(x)}}{1+e^{f(x)}}.\n", 25 | "\\end{equation}\n", 26 | "\n", 27 | "Fitting $f(x)$ in a smooth fashion leads to a smooth estimate of the conditional probability $\\text{Pr}(Y=1|x)$, which can be used for classification or risk scoring." 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "### MLE\n", 35 | "\n", 36 | "We construct the penalized log-likelihood criterion\n", 37 | "\n", 38 | "\\begin{align}\n", 39 | "l(f;\\lambda) &= \\sum_{i=1}^N \\left[ y_i\\log{p(x_i)} + (1-y_i)\\log{(1-p(x_i))} \\right] - \\frac{\\lambda}{2} \\int \\left( f''(t) \\right)^2 dt \\\\\n", 40 | "&= \\sum_{i=1}^N \\left[ y_i f(x_i) - \\log{(1+e^{f(x_i)})} \\right] - \\frac{\\lambda}{2} \\int \\left( f''(t) \\right)^2 dt,\n", 41 | "\\end{align}\n", 42 | "\n", 43 | "where $p(x) = \\text{Pr}(Y=1|x)$. The first term is the log-likelihood on the binomial distribution (c.f. Chapter 4, page 120)." 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "### Iterative procedure using Newton-Raphson, again\n", 51 | "\n", 52 | "Arguments similar to those used in $\\S$ 5.4 show that the optimal $f$ is a finite-dimensional natural spline with knots at the unique values of $x$. This means that we can represent\n", 53 | "\n", 54 | "\\begin{equation}\n", 55 | "f(x) = \\sum_{j=1}^N N_j(x) \\theta_j.\n", 56 | "\\end{equation}\n", 57 | "\n", 58 | "We compute the first and second derivatives\n", 59 | "\n", 60 | "\\begin{align}\n", 61 | "\\frac{\\partial l(\\theta)}{\\partial\\theta} &= \\mathbf{N}^T(\\mathbf{y}-\\mathbf{p}) - \\lambda\\mathbf{\\Omega}\\theta, \\\\\n", 62 | "\\frac{\\partial^2 l(\\theta)}{\\partial\\theta\\partial\\theta^T} &= -\\mathbf{N}^T\\mathbf{WN} - \\lambda\\mathbf{\\Omega},\n", 63 | "\\end{align}\n", 64 | "\n", 65 | "where\n", 66 | "* $\\mathbf{p}$ is the $N$-vector with elements $p(x_i)$,\n", 67 | "* $\\mathbf{W}$ is a diagonal matrix of weights $p(x_i)(1-p(x_i))$.\n", 68 | "\n", 69 | "The first derivative is nonlinear in $\\theta$, so we need to use an iterative algorithm as in $\\S$ 4.4.1. Using Newton-Raphson as for linear logistic regression, the update equation can be written\n", 70 | "\n", 71 | "\\begin{align}\n", 72 | "\\theta^{\\text{new}} &= \\left( \\mathbf{N}^T\\mathbf{WN} + \\lambda\\mathbf{\\Omega} \\right)^{-1} \\mathbf{N}^T\\mathbf{W} \\left( \\mathbf{N}\\theta^{\\text{old}} + \\mathbf{W}^{-1}(\\mathbf{y}-\\mathbf{p}) \\right) \\\\\n", 73 | "&= \\left( \\mathbf{N}^T\\mathbf{WN} + \\lambda\\mathbf{\\Omega} \\right)^{-1} \\mathbf{N}^T\\mathbf{Wz}.\n", 74 | "\\end{align}\n", 75 | "\n", 76 | "We can also express this update in terms of the fitted values\n", 77 | "\n", 78 | "\\begin{align}\n", 79 | "\\mathbf{f}^{\\text{new}} &= \\mathbf{N} \\left( \\mathbf{N}^T\\mathbf{WN} + \\lambda\\mathbf{\\Omega} \\right)^{-1} \\mathbf{N}^T\\mathbf{W} \\left( \\mathbf{f}^{\\text{old}} + \\mathbf{W}^{-1}(\\mathbf{y}-\\mathbf{p}) \\right) \\\\\n", 80 | "&= \\mathbf{S}_{\\lambda,\\omega}\\mathbf{z}.\n", 81 | "\\end{align}" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "### Comparison with regressions\n", 89 | "\n", 90 | "Referring back to the regression solution of the smoothing spline problem in $\\S$ 5.4,\n", 91 | "\n", 92 | "\\begin{align}\n", 93 | "\\hat\\theta &= \\left( \\mathbf{N}^T\\mathbf{N} + \\lambda\\mathbf{\\Omega}_N \\right)^{-1} \\mathbf{N}^T \\mathbf{y} \\\\\n", 94 | "\\hat{\\mathbf{f}} &= \\mathbf{N} \\left( \\mathbf{N}^T\\mathbf{N} + \\lambda\\mathbf{\\Omega}_N \\right)^{-1} \\mathbf{N}^T \\mathbf{y} \\\\\n", 95 | "&= \\mathbf{S}_\\lambda \\mathbf{y},\n", 96 | "\\end{align}\n", 97 | "\n", 98 | "we see that the update fits a weighted smoothing spline to the working response $\\mathbf{z}$ (Exercise 5.12).\n", 99 | "\n", 100 | "The form of $\\mathbf{f}^{\\text{new}}$ is suggestive. It is tempting to replace $\\mathbf{S}_{\\lambda,\\omega}$ by any nonparametric (weighted) regression operator, and obtain general families of nonparametric logistic regression models.\n", 101 | "\n", 102 | "Although here $x$ is one-dimensional, this procedure generalizes naturally to higher-dimensional $x$. These extensions are at the heart of _generalized additive models_, which we pursue in Chapter 9." 103 | ] 104 | } 105 | ], 106 | "metadata": { 107 | "kernelspec": { 108 | "display_name": "Python 3", 109 | "language": "python", 110 | "name": "python3" 111 | }, 112 | "language_info": { 113 | "codemirror_mode": { 114 | "name": "ipython", 115 | "version": 3 116 | }, 117 | "file_extension": ".py", 118 | "mimetype": "text/x-python", 119 | "name": "python", 120 | "nbconvert_exporter": "python", 121 | "pygments_lexer": "ipython3", 122 | "version": "3.6.4" 123 | } 124 | }, 125 | "nbformat": 4, 126 | "nbformat_minor": 2 127 | } 128 | -------------------------------------------------------------------------------- /chapter05-basis-expansions-and-regularization/section8-0-regularization-and-reproducing-kernel-hilbert-space.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 5.8. Regularization and Reproducing Kernel Hilbert Spaces\n", 8 | "\n", 9 | "### Formulation\n", 10 | "\n", 11 | "A general class of regularization problems has the form\n", 12 | "\n", 13 | "\\begin{equation}\n", 14 | "\\min_{f\\in\\mathcal{H}} \\left[ \\sum_{i=1}^N L(y_i, f(x_i)) + \\lambda J(f) \\right],\n", 15 | "\\end{equation}\n", 16 | "\n", 17 | "where\n", 18 | "* $L(y, f(x))$ is a loss function,\n", 19 | "* $J(f)$ is a penalty functional, and\n", 20 | "* $\\mathcal{H}$ is a space of functions on which $J(f)$ is defined.\n", 21 | "\n", 22 | "Girosi et al. (1995) describe quite general penalty functionals of the form\n", 23 | "\n", 24 | "\\begin{equation}\n", 25 | "J(f) = \\int_{\\mathbb{R}^d} \\frac{|\\tilde{f}(s)|^2}{\\tilde{G}(s)} ds,\n", 26 | "\\end{equation}\n", 27 | "\n", 28 | "where\n", 29 | "* $\\tilde{f}$ denotes the Fourier transform of $f$,\n", 30 | "* $\\tilde{G}$ is some positive function that $\\tilde{G} \\rightarrow 0$ as $\\|s\\| \\rightarrow \\infty$, and\n", 31 | "* the idea is that $1/\\tilde{G}$ increases the penalty for high-frequency components of $f$." 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "### Solution\n", 39 | "\n", 40 | "Under some additional assumptions they show that the solutions have the form\n", 41 | "\n", 42 | "\\begin{equation}\n", 43 | "f(X) = \\sum_{k=1}^K \\alpha_k\\phi_k(X) + \\sum_{i=1}^N \\theta_i G(X-x_i),\n", 44 | "\\end{equation}\n", 45 | "\n", 46 | "where\n", 47 | "* $\\phi_k$ span the null space of the penalty functional $J$, \n", 48 | "* $G$ is the inverse Fourier transform of $\\tilde{G}$.\n", 49 | "\n", 50 | "Smoothing splines and thin-plate splines fall into this framework.\n", 51 | "\n", 52 | "> The remarkable feature of this solution is that while the criterion is defined over an infinite-dimensional space, the solution is finite-dimensional." 53 | ] 54 | } 55 | ], 56 | "metadata": { 57 | "kernelspec": { 58 | "display_name": "Python 3", 59 | "language": "python", 60 | "name": "python3" 61 | }, 62 | "language_info": { 63 | "codemirror_mode": { 64 | "name": "ipython", 65 | "version": 3 66 | }, 67 | "file_extension": ".py", 68 | "mimetype": "text/x-python", 69 | "name": "python", 70 | "nbconvert_exporter": "python", 71 | "pygments_lexer": "ipython3", 72 | "version": "3.6.4" 73 | } 74 | }, 75 | "nbformat": 4, 76 | "nbformat_minor": 2 77 | } 78 | -------------------------------------------------------------------------------- /chapter05-basis-expansions-and-regularization/section9-0-wavelet-smoothing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 5.9. Wavelet Smoothing\n", 8 | "\n", 9 | "We have seen two different modes of operation with dictionaries of basis function.\n", 10 | "* With regression splines, we select a subset of the bases, using either subject-matter knowledge, or else automatically. The more adaptive procedures such as MARS (Chapter 9) can capture both smooth and non-smooth behaviour.\n", 11 | "* With smooth splines, we use a complete basis, but then shrink the coefficients toward smoothness." 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "### Sparse representation\n", 19 | "\n", 20 | "Wavelets typically use a complete orthonormal basis to represent functions, but then shrink and select the coefficients toward a _sparse_ representation. Just as a smooth function can be represented by a few spline basis functions, a mostly flat function with a few isolated bumps can be represented with a few (bumpy) basis function." 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "### Time and frequency localization\n", 28 | "\n", 29 | "Wavelets bases are very popular in signal processing and compression, since they are able to represent both smooth and/or locally bumpy functions in an efficient way -- a phenomenon dubbed _time and frequency localization_. In contrast, the traditional Fourier basis allows only frequency localization." 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "### Introduction\n", 37 | "\n", 38 | "Before we give details, let's look at the Haar wavelets in the left panel of FIGURE 5.16 to get an intuitive idea of how wavelet smoothing works.\n", 39 | "\n", 40 | "The vertical axis indicates the scale (frequency) of the wavelets, from low scale at the bottom to high scale at the top. At each scale the wavelets are \"packed in\" side-by-side to completely fill the time axis: We have only shown a selected subset.\n", 41 | "\n", 42 | "Wavelet smoothing fits the coefficients for this basis by least squares, and then thresholds (discards, filters) the smaller coefficients. Since there are many basis functions at each scale, it can use bases where it needs them and discard the unnecessary ones, to achieve time and frequency localization. The Haar wavelets are simple to understand, but not smooth enought for most purposes. The _symmlet_ wavelets in the right panel of FIGURE 5.16 have the same orthonormal properties, but are smoother." 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "### Nuclear magnetic resonance (NMR) data\n", 50 | "\n", 51 | "FIGURE 5.17 displays an NMR signal, which appears to be composed of\n", 52 | "* smooth components and\n", 53 | "* isolated spikes,\n", 54 | "* plus some noise.\n", 55 | "\n", 56 | "The wavelet transform, using a symmlet basis, is shown in the lower left panel. The wavelet coefficients are arranged in rows, from lowest scale at the bottom to highest scale at the top.\n", 57 | "\n", 58 | "The bottom right panel shows the wavelet coefficients after thresholding. The thresholding procedure is the same soft-thresholding rule that arises in the lasso procedure for linear regression ($\\S$ 3.4.2).\n", 59 | "\n", 60 | "Notice that many of the smaller coefficients have been set to zero. The green curve in the top panel shows the back-transform of the thresholded coefficients: This is the smoothed version of the orignal signal." 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 1, 66 | "metadata": {}, 67 | "outputs": [ 68 | { 69 | "name": "stdout", 70 | "output_type": "stream", 71 | "text": [ 72 | "Data not found on the official ESL page ):\n" 73 | ] 74 | } 75 | ], 76 | "source": [ 77 | "\"\"\"FIGURE 5.17\"\"\"\n", 78 | "print('Data not found on the official ESL page ):')" 79 | ] 80 | } 81 | ], 82 | "metadata": { 83 | "kernelspec": { 84 | "display_name": "Python 3", 85 | "language": "python", 86 | "name": "python3" 87 | }, 88 | "language_info": { 89 | "codemirror_mode": { 90 | "name": "ipython", 91 | "version": 3 92 | }, 93 | "file_extension": ".py", 94 | "mimetype": "text/x-python", 95 | "name": "python", 96 | "nbconvert_exporter": "python", 97 | "pygments_lexer": "ipython3", 98 | "version": "3.6.4" 99 | } 100 | }, 101 | "nbformat": 4, 102 | "nbformat_minor": 2 103 | } 104 | -------------------------------------------------------------------------------- /chapter05-basis-expansions-and-regularization/section9-2-adaptive-wavelet-filtering.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 5.9.2. Adaptive Wavelet Filtering\n", 8 | "\n", 9 | "Wavelets are particular useful when the data are measured on a uniform lattice, such as a discretized signal, image, or a time series. We will focus on the one-dimensional case, and having $N=2^J$ lattice-points is convenient.\n", 10 | "\n", 11 | "Suppose\n", 12 | "* $\\mathbf{y}$ is the response vector,\n", 13 | "* $\\mathbf{W}$ is the $N \\times N$ orthonormal wavelet basis matrix evaluated at the $N$ uniformly spaced observations.\n", 14 | "\n", 15 | "Then\n", 16 | "\n", 17 | "\\begin{equation}\n", 18 | "\\mathbf{y}^* = \\mathbf{W}^T\\mathbf{y}\n", 19 | "\\end{equation}\n", 20 | "\n", 21 | "is called the _wavelet transform_ of $\\mathbf{y}$ (and is the full least squares regression coefficients)." 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "A popular method for adaptive wavelet fitting is known as _SURE shrinkage_ (Stein Unbiased Risk Estimation, Donoho and Johnstone, 1994). We start with the criterion\n", 29 | "\n", 30 | "\\begin{equation}\n", 31 | "\\min_{\\boldsymbol\\theta} \\|\\mathbf{y} - \\mathbf{W}\\boldsymbol\\theta\\|_2^2 + 2\\lambda\\|\\boldsymbol\\theta\\|_1,\n", 32 | "\\end{equation}\n", 33 | "\n", 34 | "which is the same as the lasso criterion in Chapter 3.\n", 35 | "\n", 36 | "Because $\\mathbf{W}$ is orthonormal, this leads to the simple solution:\n", 37 | "\n", 38 | "\\begin{equation}\n", 39 | "\\hat\\theta_j = \\text{sign}(y_j^*)(|y_j^*|-\\lambda)_+.\n", 40 | "\\end{equation}\n", 41 | "\n", 42 | "The least squares coefficients are translated toward zero, and truncated at zero. The fitted function (vector) is then given by the _inverse wavelet transform_\n", 43 | "\n", 44 | "\\begin{equation}\n", 45 | "\\hat{\\mathbf{f}} = \\mathbf{W}\\hat{\\boldsymbol\\theta}\n", 46 | "\\end{equation}" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "### Choice of $\\lambda$\n", 54 | "\n", 55 | "A simple choice of $\\lambda$ is\n", 56 | "\n", 57 | "\\begin{equation}\n", 58 | "\\lambda = \\sigma\\sqrt{2\\log N},\n", 59 | "\\end{equation}\n", 60 | "\n", 61 | "where $\\sigma$ is an estimate of the standard deviation of the noise.\n", 62 | "\n", 63 | "#### Motivation for this choice\n", 64 | "\n", 65 | "Since $\\mathbf{W}$ is an orthonormal transformation, if the elements of $\\mathbf{y}$ are white noise (independent Gaussian variates with mean 0 and variance $\\sigma^2$), then so are $\\mathbf{y}^*$. Furthermore if random variables $Z_1, Z_2, \\cdots, Z_N$ are white noise, the expected maximum $|Z_j|$ for $j=1,\\cdots,N$ is approximately $\\sigma\\sqrt{2\\log N}$. Hence all coefficients below $\\sigma\\sqrt{2\\log N}$ are likely to be noise and are set to zero.\n", 66 | "\n" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "### Choice of $\\mathbf{W}$\n", 74 | "\n", 75 | "The space $\\mathbf{W}$ could be any basis of orthonormal functions: Polynomials, natural splines or cosinusoids. What makes wavelets special is the particular form of basis functions used, which allows for a representation _localized in time and in frequency_." 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "### NMR signal revisited" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "### Similarity between SURE and smoothing spline criteria\n", 90 | "\n", 91 | "* Both are hierarchically structured from coarse to fine detail, although wavelets are also localized in time within each resolution level.\n", 92 | "* The splines build in a basis toward smooth functions by imposing differential shrinking constants $d_k$. Early version of SURE shrinkage treated all scales equally.\n", 93 | "* The spline $L_2$ penalty cause pure shrinkage, while the SURE $L_1$ penalty does shrinkage and selection.\n", 94 | "\n", 95 | "More generally smoothing splines achieve compression of the original signal by imposing smoothness, while wavelets impose sparsity. FIGURE 5.19 compares a wavelet fit (using SURE shrinkage) to a smoothing spline fit (using corss-validation) on two examples different in nature.\n", 96 | "\n", 97 | "For the NMR data in the upper panel, the smoothing spline introducees detail everywhere in order to capture the detail in the isolated spike; the wavelet fit nicely localizes the spike. In the lower panel, the true function is smooth, and the noise is relatively high. The wavelet fit has let in some additional and unnecessary wiggles -- a price it pays in variance for the additional adaptivity." 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 1, 103 | "metadata": {}, 104 | "outputs": [ 105 | { 106 | "name": "stdout", 107 | "output_type": "stream", 108 | "text": [ 109 | "Under construction ...\n" 110 | ] 111 | } 112 | ], 113 | "source": [ 114 | "\"\"\"FIGURE 5.19. Wavelet smoothing compared with smoothing splines on two examples.\n", 115 | "Each panel compares the SURE-shrunk wavelet fit to the cross-validated smoothing spline fit.\"\"\"\n", 116 | "print('Under construction ...')" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "### Computational aspects\n", 124 | "\n", 125 | "The wavelet transform is not performed by matrix multiplication as in\n", 126 | "\n", 127 | "\\begin{equation}\n", 128 | "\\mathbf{y}^* = \\mathbf{W}^T\\mathbf{y}.\n", 129 | "\\end{equation}\n", 130 | "\n", 131 | "In fact, using clever pyramidal schemes $\\mathbf{y}^*$ can be obtained in $O(N)$ computations, which is even faster than the $N\\log(N)$ of the FFT. It is easy to see for the Haar basis (Exercise 5.19). Likewise, the inverse wavelet transform $\\mathbf{W}\\hat{\\boldsymbol\\theta}$ is also $O(N)$." 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": {}, 137 | "source": [ 138 | "This has been a very brief glimpse of this vast and growing field. There is a very large mathematical and computational base built on wavelets. Modern image compression is often performed using two-dimensional wavelet representations." 139 | ] 140 | } 141 | ], 142 | "metadata": { 143 | "kernelspec": { 144 | "display_name": "Python 3", 145 | "language": "python", 146 | "name": "python3" 147 | }, 148 | "language_info": { 149 | "codemirror_mode": { 150 | "name": "ipython", 151 | "version": 3 152 | }, 153 | "file_extension": ".py", 154 | "mimetype": "text/x-python", 155 | "name": "python", 156 | "nbconvert_exporter": "python", 157 | "pygments_lexer": "ipython3", 158 | "version": "3.6.4" 159 | } 160 | }, 161 | "nbformat": 4, 162 | "nbformat_minor": 2 163 | } 164 | -------------------------------------------------------------------------------- /chapter06-kernel-smoothing-methods/section0-introduction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Chapter 6. Kernel Smoothing Methods\n", 8 | "\n", 9 | "In this chapter we describe a class of regression techniques that achieve flexibility in estimating the regression function $f(X)$ over the domain $\\mathbb{R}^p$ by fitting a different but simple model separately at each query point $x_0$. This is done for some neighborhood of the target point $x_0$ to fit the simple model, and in such a way that the resulting estimated function $\\hat{f}(X)$ is smooth in $\\mathbb{R}^p$.\n", 10 | "\n", 11 | "This localization is achieved via a weighting function or _kernel_ $K_\\lambda(x_0,x_i)$, which assigns a weight to $x_i$ based on its distance from $x_0$. The kernels $K_\\lambda$ are typically indexed by a paramter $\\lambda$ that dictates the width of the neighborhood. These _memory-based_ methods require in principle little or no training. The only paramter that needs to be determined from the training data is $\\lambda$. The model, however, is the entire training data set.\n", 12 | "\n", 13 | "We also discuss more general classes of kernel-based techniques, which tie in with structured methods in other chapters, and are useful for density estimation and classficiation." 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "### Caution, don't confuse!\n", 21 | "\n", 22 | "> In this chapter kernels are mostly used as a device for localization.\n", 23 | "\n", 24 | "The techniques in this chapter should not be confused with those associated with the more recent usage of the phrase \"kernel methods\". In this chapter kernels are mostly used as a device for localization. We discuss kernel methods in $\\S$ 5.8, 14.5.4, 18.5, and Chapter 12; in those contexts the kernel computes an inner product in a high-dimensional (implicit) feature space, and is used for regularized nonlinear modeling. We make some connections to the methodology in this chapter at the end of $\\S$ 6.7." 25 | ] 26 | } 27 | ], 28 | "metadata": { 29 | "kernelspec": { 30 | "display_name": "Python 3", 31 | "language": "python", 32 | "name": "python3" 33 | }, 34 | "language_info": { 35 | "codemirror_mode": { 36 | "name": "ipython", 37 | "version": 3 38 | }, 39 | "file_extension": ".py", 40 | "mimetype": "text/x-python", 41 | "name": "python", 42 | "nbconvert_exporter": "python", 43 | "pygments_lexer": "ipython3", 44 | "version": "3.6.4" 45 | } 46 | }, 47 | "nbformat": 4, 48 | "nbformat_minor": 2 49 | } 50 | -------------------------------------------------------------------------------- /chapter06-kernel-smoothing-methods/section1-1-local-linear-regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 6.1.1. Local Linear Regression\n", 8 | "\n", 9 | "The smooth kernel fit still has a problems. Locally-weighted averages can be badly biased on the boundaries of the domain, because of the asymmetry of the kernel in that region, as exhibited in FIGURE 6.3 (left panel).\n", 10 | "\n", 11 | "By fitting straight lines rather than constants locally, we can remove this bias exactly to first order; see the right panel of FIGURE 6.3. Actually, this bias can be present in the interior of the domain as well, if the $X$ values are not equally spaced (for the same reasons, but usually less severe). Again locally weighted linear regression will make a first-order correction." 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "### Formulation\n", 19 | "\n", 20 | "Locally weighted regression solves a separate weighted least squares problem at each target point $x_0$:\n", 21 | "\n", 22 | "\\begin{equation}\n", 23 | "\\min_{\\alpha(x_0),\\beta(x_0)} \\sum_{i=1}^N K_\\lambda(x_0,x_i) \\left( y_i - \\alpha(x_0) - \\beta(x_0)x_i \\right)^2.\n", 24 | "\\end{equation}\n", 25 | "\n", 26 | "The estimate is then\n", 27 | "\n", 28 | "\\begin{equation}\n", 29 | "\\hat{f}(x_0) = \\hat\\alpha(x_0) + \\hat\\beta(x_0).\n", 30 | "\\end{equation}\n", 31 | "\n", 32 | "Notice that although we fit an entire linear model to the data in the region, we only use it to evaluate the fit at the single point $x_0$." 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "### Matrix formulation and equivalent kernel\n", 40 | "\n", 41 | "Define\n", 42 | "* the vector-valued function $b(x)^T = (1, x)$,\n", 43 | "* the $N\\times2$ regression matrix $\\mathbf{B}$ with $i$th row $b(x_i)^T$, and\n", 44 | "* the $N\\times N$ diagonal matrix $\\mathbf{W}(x_0)$ with $i$th diagonal element $K_\\lambda(x_0,x_i)$.\n", 45 | "\n", 46 | "Then\n", 47 | "\n", 48 | "\\begin{align}\n", 49 | "\\hat{f}(x_0) &= b(x_0)^T \\left( \\mathbf{B}^T\\mathbf{W}(x_0)\\mathbf{B} \\right)^{-1} \\mathbf{B}^T \\mathbf{W}(x_0) \\mathbf{y} \\\\\n", 50 | "&= \\sum_{i=1}^N l_i(x_0)y_i.\n", 51 | "\\end{align}\n", 52 | "\n", 53 | "Note that $l_i(x_0)$ do not involve $\\mathbf{y}$ and thus the estimate is _linear_ in $y_i$. These weights $l_i(x_0)$ combine the weighting kernel $K_\\lambda(x_0,\\cdot)$ and the least squares operations, and are sometimes referred to as the _equivalent kernel_.\n", 54 | "\n", 55 | "FIGURE 6.4 illustrates the effect of local linear regression on the equivalent kernel." 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 1, 61 | "metadata": {}, 62 | "outputs": [ 63 | { 64 | "name": "stdout", 65 | "output_type": "stream", 66 | "text": [ 67 | "Under construction ...\n" 68 | ] 69 | } 70 | ], 71 | "source": [ 72 | "\"\"\"FIGURE 6.4. Equivalent kernel li(x0) for local regression\"\"\"\n", 73 | "print('Under construction ...')" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "### Automatic kernel carpentry\n", 81 | "\n", 82 | "Historically, the bias in the Nadaraya-Watson and other local average kernel methods were corrected by modifying the kernel. These modifications were based on theoretical asymptotic MSE considerations, and besides being tedious to implement, are only approximate for finite sample sizes.\n", 83 | "\n", 84 | "Local linear regression _automatically_ modfies the kernel to correct the bias _exactly_ to first order, a phenomenon dubbed as _automatic kernel carpentry_.\n", 85 | "\n", 86 | "Consider the following expansion for $\\text{E}\\hat{f}(x_0)$, using the linearity of local regression and a series expansion of the true function $f$ around $x_0$,\n", 87 | "\n", 88 | "\\begin{align}\n", 89 | "\\text{E}\\hat{f}(x_0) &= \\sum_{i=1}^N l_i(x_0)f(x_i) \\\\\n", 90 | "&= f(x_0)\\sum_{i=1}^N l_i(x_0) + f'(x_0)\\sum_{i=1}^N (x_i-x_0)l_i(x_0) + \\frac{f''(x_0)}2 \\sum_{i=1}^N \\sum_{i=1}^N (x_i-x_0)^2 l_i(x_0) + R,\n", 91 | "\\end{align}\n", 92 | "\n", 93 | "where the remainder term $R$ involves third- and higher-order derivatives of $f$, and is typically small under suitable smoothness assumptions. It can be shown (Exercise 6.2) that for local linear regression,\n", 94 | "\n", 95 | "\\begin{equation}\n", 96 | "\\sum_{i=1}^N l_i(x_0) = 1 \\text{ and } \\sum_{i=1}^N (x_i-x_0)l_i(x_0) = 0.\n", 97 | "\\end{equation}\n", 98 | "\n", 99 | "Hence\n", 100 | "\n", 101 | "\\begin{align}\n", 102 | "\\text{E}\\hat{f}(x_0) &= f(x_0)\\sum_{i=1}^N l_i(x_0) + f'(x_0)\\sum_{i=1}^N (x_i-x_0)l_i(x_0) + \\frac{f''(x_0)}2 \\sum_{i=1}^N \\sum_{i=1}^N (x_i-x_0)^2 l_i(x_0) + R \\\\\n", 103 | "&= f(x_0) + \\frac{f''(x_0)}2 \\sum_{i=1}^N \\sum_{i=1}^N (x_i-x_0)^2 l_i(x_0) + R,\n", 104 | "\\end{align}\n", 105 | "\n", 106 | "and the bias\n", 107 | "\n", 108 | "\\begin{equation}\n", 109 | "\\text{E}\\hat{f}(x_0) - f(x_0) = \\frac{f''(x_0)}2 \\sum_{i=1}^N \\sum_{i=1}^N (x_i-x_0)^2 l_i(x_0) + R.\n", 110 | "\\end{equation}\n", 111 | "\n", 112 | "We see that it depends only on quadratic and higher-order terms in the expansion of $f$." 113 | ] 114 | } 115 | ], 116 | "metadata": { 117 | "kernelspec": { 118 | "display_name": "Python 3", 119 | "language": "python", 120 | "name": "python3" 121 | }, 122 | "language_info": { 123 | "codemirror_mode": { 124 | "name": "ipython", 125 | "version": 3 126 | }, 127 | "file_extension": ".py", 128 | "mimetype": "text/x-python", 129 | "name": "python", 130 | "nbconvert_exporter": "python", 131 | "pygments_lexer": "ipython3", 132 | "version": "3.6.4" 133 | } 134 | }, 135 | "nbformat": 4, 136 | "nbformat_minor": 2 137 | } 138 | -------------------------------------------------------------------------------- /chapter06-kernel-smoothing-methods/section1-2-local-polynomial-regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 6.1.2. Local Polynomial Regression\n", 8 | "\n", 9 | "Why stop at local linear fit? We can fit local polynomial fits of any degree $d$,\n", 10 | "\n", 11 | "\\begin{equation}\n", 12 | "\\min_{\\alpha(x_0),\\beta_j(x_0), j=1,\\cdots,d} \\sum_{i=1}^N K_\\lambda(x_0,x_i) \\left( y_i - \\alpha(x_0) - \\sum_{j=1}^d \\beta_j(x_0)x_i^j \\right)^2\n", 13 | "\\end{equation}\n", 14 | "\n", 15 | "with solution\n", 16 | "\n", 17 | "\\begin{equation}\n", 18 | "\\hat{f}(x_0) = \\hat\\alpha(x_0) + \\sum_{j=1}^N \\hat\\beta(x_0)x_o^j.\n", 19 | "\\end{equation}\n", 20 | "\n", 21 | "In fact, the expansion shown in the previous section will tell us that the bias will only have components of degree $d+1$ and higher (Exercise 6.2).\n", 22 | "\n", 23 | "FIGURE 6.5 illustrates local quadratic regression. Local linear fits tend to be biased in regions of curvature of the true function, a phenomenon referred to as _trimming the hills_ and _filling the valleys_. Local quadratic regression is generally able to correct this bias." 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "### Bias-variance tradeoff, again\n", 31 | "\n", 32 | "There is of course a price to be paid for this bias reduction, and this is increased variance. The fit in the right panel of FIGURE 6.5 is slightly more wiggly, especially in the tails.\n", 33 | "\n", 34 | "Assume the model\n", 35 | "\n", 36 | "\\begin{equation}\n", 37 | "y_i = f(x_i) + \\epsilon_i,\n", 38 | "\\end{equation}\n", 39 | "\n", 40 | "with $\\epsilon_i \\sim^{\\text{iid}} (0, \\sigma^2)$. Then\n", 41 | "\n", 42 | "\\begin{equation}\n", 43 | "\\text{Var}(\\hat{f}(x_0)) = \\sigma^2 \\|l(x_0)\\|^2,\n", 44 | "\\end{equation}\n", 45 | "\n", 46 | "where $l(x_0)$ is the vector of equivalent kernel weights at $x_0$.\n", 47 | "\n", 48 | "It can be shown (Exercise 6.3) that $\\|l(x_0)\\|$ increases with $d$, and so there is a bias-variance tradeoff in selecting the polynomial degree.\n", 49 | "\n", 50 | "FIGURE 6.6 illustrates these variance curves for degree zero, one and two local polynomials. To summarize some collected wisdom on this issue:\n", 51 | "\n", 52 | "* Local linear fits can help bias dramatically at the boundaries at a modest cost in variance. Local quadratic fits do little at the boundaries for bias, but increase the variance a lot.\n", 53 | "* Local quadratic fits tend to be most helpful in reducing bias due to curvature in the interior of the domain.\n", 54 | "* Asymptotic analysis suggest that local polynomials of odd degree dominate those of even degree. This is largely due to the fact that asymptotically the MSE is dominated by boundary effects.\n", 55 | "\n", 56 | "While it may be helpful to tinker, and move from local linear fits at the boundary to local quadratic fits in the interior, we do not recommend such strategies. Usually the application will dictate the degree of the fit. For example, if we are interested in extrapolation, then the boundary is of more interest and local linear fits are probably more reliable." 57 | ] 58 | } 59 | ], 60 | "metadata": { 61 | "kernelspec": { 62 | "display_name": "Python 3", 63 | "language": "python", 64 | "name": "python3" 65 | }, 66 | "language_info": { 67 | "codemirror_mode": { 68 | "name": "ipython", 69 | "version": 3 70 | }, 71 | "file_extension": ".py", 72 | "mimetype": "text/x-python", 73 | "name": "python", 74 | "nbconvert_exporter": "python", 75 | "pygments_lexer": "ipython3", 76 | "version": "3.6.4" 77 | } 78 | }, 79 | "nbformat": 4, 80 | "nbformat_minor": 2 81 | } 82 | -------------------------------------------------------------------------------- /chapter06-kernel-smoothing-methods/section2-selecting-the-width-of-the-kernel.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 6.2. Selecting the Width of the Kernel\n", 8 | "\n", 9 | "In each of the kernels $K_\\lambda$, $\\lambda$ is a parameter that controls its width:\n", 10 | "\n", 11 | "* For the Epanechnikov or tri-cube kernel with metric width, $\\lambda$ is the radius of the support region.\n", 12 | "* For the Gaussian kernel, $\\lambda$ is the standard deviation.\n", 13 | "* $\\lambda$ is the number $k$ of nearest neighbors in $k$-nearest neighborhoods, often expressed as a fraction or _span_ $k/N$ of the total training sample." 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "### Bias-variance tradeoff, again and again\n", 21 | "\n", 22 | "There is a natural bias-variance tradeoff as we change the width of the averaging window, which is most explicit for local averages:\n", 23 | "\n", 24 | "* If the window is narrow, $\\hat{f}(x_0)$ is an average of a small number of $y_i$ close to $x_0$, and its variance will be relatively large -- close to that of an individual $y_i$. The bias will tend to be small, again because each of the $\\text{E}(y_i) = f(x_i)$ should be close to $f(x_0)$.\n", 25 | "* If the window is wide, the variance of $\\hat{f}(x_0)$ will be small relative to the variance of any $y_i$, because of the effects of averaging. The bias will be higher, because we are now using observations $x_i$ further from $x_0$, and there is no quarantee that $f(x_i)$ will be close to $f(x_0)$.\n", 26 | "\n", 27 | "Similar arguments apply to local regression estimates, say local linear:\n", 28 | "* As the width goes to zero, the estimates approach a piecewise-linear function that interpolates the training data;\n", 29 | "* as the width gets infinitely large, the fit approaches the global linear least-squares fit to the data." 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "The discussion in Chapter 5 on selecting the regularization parameter for smoothing splines applies here, and will not be repeated.\n", 37 | "\n", 38 | "Local regression smoothers are linear estimators; the smoother matrix in\n", 39 | "\n", 40 | "\\begin{equation}\n", 41 | "\\hat{\\mathbf{f}} = \\mathbf{S}_\\lambda\\mathbf{y}\n", 42 | "\\end{equation}\n", 43 | "\n", 44 | "is built up from the equivalent kernels ($\\S$ 6.1.1), and has $ij$th entry $\\{\\mathbf{S}_\\lambda\\}_{ij} = l_i(x_j)$.\n", 45 | "\n", 46 | "Leave-one-out cross-validation is particularly simple (Exercise 6.7), as is generalized cross-validation, $C_p$ (Exercise 6.10), and $k$-fold cross-validation. The effective degrees of freedom is again defined as $\\text{trace}(\\mathbf{S}_\\lambda)$, and can be used to calibrate the amount of smoothing." 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "FIGURE 6.7 compares the equivalent kernels for a smoothing spline and local linear regression. The local regression smoother has a span of $40%$, which results in $\\text{df} = \\text{trace}(\\mathbf{S}_\\lambda) = 5.86$. The smoothing spline was calibrated to have the same $\\text{df}$, and their equivalent kernels are qualitatively quite similar." 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 1, 59 | "metadata": {}, 60 | "outputs": [ 61 | { 62 | "name": "stdout", 63 | "output_type": "stream", 64 | "text": [ 65 | "Under construction ...\n" 66 | ] 67 | } 68 | ], 69 | "source": [ 70 | "\"\"\"FIGURE 6.7. Equivalent kernels for a local linear regreession smoother and\n", 71 | "a smoothing spline\"\"\"\n", 72 | "print('Under construction ...')" 73 | ] 74 | } 75 | ], 76 | "metadata": { 77 | "kernelspec": { 78 | "display_name": "Python 3", 79 | "language": "python", 80 | "name": "python3" 81 | }, 82 | "language_info": { 83 | "codemirror_mode": { 84 | "name": "ipython", 85 | "version": 3 86 | }, 87 | "file_extension": ".py", 88 | "mimetype": "text/x-python", 89 | "name": "python", 90 | "nbconvert_exporter": "python", 91 | "pygments_lexer": "ipython3", 92 | "version": "3.6.4" 93 | } 94 | }, 95 | "nbformat": 4, 96 | "nbformat_minor": 2 97 | } 98 | -------------------------------------------------------------------------------- /chapter06-kernel-smoothing-methods/section3-local-regression-in-higher-dimensions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 6.3. Local Regression in $\\mathbb{R}^p$\n", 8 | "\n", 9 | "Kernel smoothing and local regression generalize very naturally to two or more dimensions.\n", 10 | "\n", 11 | "* The Nadaraya-Watson kernel smoother fits a constant locally with weights supplied by a $p$-dimensional kernel.\n", 12 | "* Local linear regression will fit a hyperplane locally in $X$, by weighted least squares, with weights supplied by a $p$-dimensional kernel. \n", 13 | " It is simple to implement and is generally preferred to the local constant fit for its superior performacne on the boundaries." 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "### Formulation\n", 21 | "\n", 22 | "Let $b(X)$ be a vector of polynomial terms in $X$ of maximum degree $d$; e.g., with $d=1$ and $p=2$ we get\n", 23 | "\n", 24 | "\\begin{equation}\n", 25 | "b(X) = (1, X_1, X_2);\n", 26 | "\\end{equation}\n", 27 | "\n", 28 | "with $d=2$ we get\n", 29 | "\n", 30 | "\\begin{equation}\n", 31 | "b(X) = (1, X_1, X_2, X_1^2, X_2^2, X_1X_2);\n", 32 | "\\end{equation}\n", 33 | "\n", 34 | "and trivially with $d=0$ we get\n", 35 | "\n", 36 | "\\begin{equation}\n", 37 | "b(X) = 1.\n", 38 | "\\end{equation}\n", 39 | "\n", 40 | "At each $x_0 \\in \\mathbb{R}^p$ solve\n", 41 | "\n", 42 | "\\begin{equation}\n", 43 | "\\min_{\\beta(x_0)} \\sum_{i=1}^N K_\\lambda(x_0,x_i) \\left( y_i - b(x_i)^T\\beta(x_0) \\right)^2\n", 44 | "\\end{equation}\n", 45 | "\n", 46 | "to produce the fit\n", 47 | "\n", 48 | "\\begin{equation}\n", 49 | "\\hat{f}(x_0) = b(x_0)^T \\hat\\beta(x_0).\n", 50 | "\\end{equation}\n", 51 | "\n", 52 | "Typically the kernel will be a radial function, such as the radial Epanechnikov or tri-cube kernel\n", 53 | "\n", 54 | "\\begin{equation}\n", 55 | "K_\\lambda(x_0,x) = D\\left( \\frac{\\|x-x_0\\|}\\lambda \\right),\n", 56 | "\\end{equation}\n", 57 | "\n", 58 | "where $\\|\\cdot\\|$ is the Euclidean norm.\n", 59 | "\n", 60 | "Since the Euclidean norm depends on the units in each coordinate, it makes most sense to standardize each predictor, e.g., to unit standard deviation, prior to smoothing." 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "### Boundary problem gets serious with the curse of dimensionality\n", 68 | "\n", 69 | "While boundary effects are a problem in 1D smoothing, they are a much bigger problem in two or higher dimensions, since the fraction of points on the boundary is larger. In fact, one of the manifestations of the curse of dimensionality is that the fraction of points close to the boundary increases to one as the dimension grows.\n", 70 | "\n", 71 | "Directly modifying the kernel to accommodate two-dimensional boundaries becomes very messy, especially for irregular boundaries.\n", 72 | "\n", 73 | "Local polynomial regression seamlessly performs boundary correction to the desired order in any dimensions. FIGURE 6.8 illustrates local linear regression on some measurements from an astronomical study with an unusual predictor design (star-shaped). Here the boundary is extremely irregular, and the fitted surface must also interpolate over regions of increasing data sparsity as we approach the boundary." 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 1, 79 | "metadata": {}, 80 | "outputs": [ 81 | { 82 | "name": "stdout", 83 | "output_type": "stream", 84 | "text": [ 85 | "Under construction ...\n" 86 | ] 87 | } 88 | ], 89 | "source": [ 90 | "\"\"\"FIGURE 6.8 3D Galaxy data\"\"\"\n", 91 | "print('Under construction ...')" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "Local regression becomes less useful in dimensions much higher than two or three. We have discussed in some detail the problems of the dimensionality, e.g., in Chapter 2. It is impossible to simultaneously maintain localness ($\\Rightarrow$ low bias) and a sizeable sample in the neighborhood ($\\Rightarrow$ low variance) as the dimension increases, without the total sample size increasing exponentially in $p$.\n", 99 | "\n", 100 | "Visualization of $\\hat{f}(X)$ also becomes difficult in higher dimensions, and this is often one of the primary goals of smoothing. Although the scatter-cloud and wire-frame pictures in FIGURE 6.8 look at attractive, it is quite difficult to interpret the results except at a gross level." 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": {}, 106 | "source": [ 107 | "From a data analysis perspective, conditional plots are far more useful.\n", 108 | "\n", 109 | "FIGURE 6.9 shows an analysis of some environmental data with three predictors. The _trellis_ display here show ozone as a function of radiation, conditioned on the other two variables, temperature and wind speed. However, conditioning on the value of a variable really implies loca to that value (as in local regression)." 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": 2, 115 | "metadata": {}, 116 | "outputs": [ 117 | { 118 | "name": "stdout", 119 | "output_type": "stream", 120 | "text": [ 121 | "Under construction ...\n" 122 | ] 123 | } 124 | ], 125 | "source": [ 126 | "\"\"\"FIGURE 6.9. Conditional plots for Los Angeles Ozone data\"\"\"\n", 127 | "print('Under construction ...')" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "Above each of the panels in FIGURE 6.9 is an indication of the range of values present in that panel for each of the conditioning values. In the panel itself the data subsets are displayed (response versus remaining variable), and a 1D local linear regression is fit to the data.\n", 135 | "\n", 136 | "Although this is not quite the same as looking at slices of a fitted 3D surface, it is probably more useful in terms of understanding the joint behavior of the data." 137 | ] 138 | } 139 | ], 140 | "metadata": { 141 | "kernelspec": { 142 | "display_name": "Python 3", 143 | "language": "python", 144 | "name": "python3" 145 | }, 146 | "language_info": { 147 | "codemirror_mode": { 148 | "name": "ipython", 149 | "version": 3 150 | }, 151 | "file_extension": ".py", 152 | "mimetype": "text/x-python", 153 | "name": "python", 154 | "nbconvert_exporter": "python", 155 | "pygments_lexer": "ipython3", 156 | "version": "3.6.4" 157 | } 158 | }, 159 | "nbformat": 4, 160 | "nbformat_minor": 2 161 | } 162 | -------------------------------------------------------------------------------- /chapter06-kernel-smoothing-methods/section4-0-structured-local-regression-models.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 6.4. Structured Local Regression Models in $\\mathbb{R}^p$\n", 8 | "\n", 9 | "> When the dimension to sample-size ratio is unfavorable, local regression does not help us much, unless we are willing to make some structural assumptions about the model.\n", 10 | ">\n", 11 | "> Much of this book is about structured regression and classification models. Here we focus on some approaches directly related to kernel methods." 12 | ] 13 | } 14 | ], 15 | "metadata": { 16 | "kernelspec": { 17 | "display_name": "Python 3", 18 | "language": "python", 19 | "name": "python3" 20 | }, 21 | "language_info": { 22 | "codemirror_mode": { 23 | "name": "ipython", 24 | "version": 3 25 | }, 26 | "file_extension": ".py", 27 | "mimetype": "text/x-python", 28 | "name": "python", 29 | "nbconvert_exporter": "python", 30 | "pygments_lexer": "ipython3", 31 | "version": "3.6.4" 32 | } 33 | }, 34 | "nbformat": 4, 35 | "nbformat_minor": 2 36 | } 37 | -------------------------------------------------------------------------------- /chapter06-kernel-smoothing-methods/section4-1-structured-kernels.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 6.4.1. Structured Kernels\n", 8 | "### Standardization\n", 9 | "\n", 10 | "One line of approach is to modify the kernel. The default spherical kernel\n", 11 | "\n", 12 | "\\begin{equation}\n", 13 | "K_\\lambda(x_0,x) = D\\left( \\frac{\\|x-x_0\\|}\\lambda \\right)\n", 14 | "\\end{equation}\n", 15 | "\n", 16 | "gives equal weight to each coordinate, and so a natural default strategy is to standardize each variable to unit standard deviation.\n", 17 | "\n", 18 | "A more general approach is to use a positive semidefinite matrix $\\mathbf{A}$ to weigh the different coordinates:\n", 19 | "\n", 20 | "\\begin{equation}\n", 21 | "K_{\\lambda,\\mathbf{A}}(x_0,x) = D \\left( \\frac{(x-x_0)^T\\mathbf{A}(x-x_0)}\\lambda \\right).\n", 22 | "\\end{equation}\n", 23 | "\n", 24 | "Entire coordinates or directions can be downgraded or omitted by imposing appropriate restrictions on $\\mathbf{A}$. For example, if $\\mathbf{A}$ is diagonal, then we can increase or decrease the influence of individual predictors $X_j$ by increasing or decreasing $A_{jj}$.\n", 25 | "\n", 26 | "Often the predictors are many and highly correlated, such as those arising from digitized analog signals or images. The covariance function of the predictors can be used to tailor a metric $\\mathbf{A}$ that focuses less, say, on high-freqeuncy contrast (Exercise 6.4)." 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "### Structured regression over general models for $\\mathbf{A}$\n", 34 | "\n", 35 | "Proposals have been made for learning the parameters for multidimensional kernels. For example, the projection-pursuit regression model discussed in Chapter 11 is of this flavor, where low-rank versions of $\\mathbf{A}$ imply ridge functions for $\\hat{f}(X)$.\n", 36 | "\n", 37 | "More general models for $\\mathbf{A}$ are cumbersome, and we favor instead the structured forms for the regression function discussed next." 38 | ] 39 | } 40 | ], 41 | "metadata": { 42 | "kernelspec": { 43 | "display_name": "Python 3", 44 | "language": "python", 45 | "name": "python3" 46 | }, 47 | "language_info": { 48 | "codemirror_mode": { 49 | "name": "ipython", 50 | "version": 3 51 | }, 52 | "file_extension": ".py", 53 | "mimetype": "text/x-python", 54 | "name": "python", 55 | "nbconvert_exporter": "python", 56 | "pygments_lexer": "ipython3", 57 | "version": "3.6.4" 58 | } 59 | }, 60 | "nbformat": 4, 61 | "nbformat_minor": 2 62 | } 63 | -------------------------------------------------------------------------------- /chapter06-kernel-smoothing-methods/section4-2-structured-regression-functions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 6.4.2. Structured Regression Functions\n", 8 | "\n", 9 | "We are trying to fit a regression function\n", 10 | "\n", 11 | "\\begin{equation}\n", 12 | "\\text{E}(Y|X) = f(X_1,X_2,\\cdots,X_p)\n", 13 | "\\end{equation}\n", 14 | "\n", 15 | "in $\\mathbb{R}^p$, in which every level of interaction is potentially present." 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "### Structure via ANOVA decomposition\n", 23 | "It is natural to consider ANOVA decompositions of the form\n", 24 | "\n", 25 | "\\begin{equation}\n", 26 | "f(X_1,X_2,\\cdots,X_p) = \\alpha + \\sum_j g_j(X_j) + \\sum_{k Suppose we have a random sample $x_1,\\cdots,x_N$ drawn from a probability density $f_X(x)$, and we wish to estimate $f_X$ at a point $x_0$.\n", 12 | "\n", 13 | "For simplicity assume for now that $X\\in\\mathbb{R}$." 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "### Local estimate\n", 21 | "\n", 22 | "A natural local estimate has the form\n", 23 | "\n", 24 | "\\begin{equation}\n", 25 | "\\hat{f}_X(x_0) = \\frac{\\#x_i \\in \\mathcal{N}(x_0)}{N\\lambda},\n", 26 | "\\end{equation}\n", 27 | "\n", 28 | "where $\\mathcal{N}$ is a small metric neighborhood around $x_0$ of width $\\lambda$.\n", 29 | "\n", 30 | "This estimate is bumpy, so..." 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "### Estimate with kernels\n", 38 | "\n", 39 | "the smooth _Parzen_ estimate is preferred.\n", 40 | "\n", 41 | "\\begin{equation}\n", 42 | "\\hat{f}(x_0) = \\frac1{N\\lambda} \\sum_{i=1}^N K_\\lambda(x_0,x_i),\n", 43 | "\\end{equation}\n", 44 | "\n", 45 | "because it counts observations close to $x_0$ with weights that decrease with distance from $x_0$. In this case a popular choice for $K_\\lambda$ is the Gaussian kernel\n", 46 | "\n", 47 | "\\begin{equation}\n", 48 | "K_\\lambda(x_0,x) = \\phi\\left(\\frac{|x-x_0|}\\lambda\\right).\n", 49 | "\\end{equation}\n", 50 | "\n", 51 | "FIGURE 6.13 shows a Gaussian kernel density fit to the sample values for $\\textsf{systolic blood pressure}$ for the $\\textsf{CHD}$ group.\n", 52 | "\n", 53 | "Letting $\\phi_\\lambda$ denote the Gaussian density with mean zero and standard-deviation $\\lambda$, the Parzen estimate has the form\n", 54 | "\n", 55 | "\\begin{align}\n", 56 | "\\hat{f}_X(x) &= \\frac1N \\sum_{i=1}^N \\phi_\\lambda(x-x_i) \\\\\n", 57 | "&= \\left( \\hat{F}\\star\\phi_\\lambda \\right)(x),\n", 58 | "\\end{align}\n", 59 | "\n", 60 | "the convolution of the sample empirical distribution $\\hat{F}$ with $\\phi_\\lambda$. The distribution $\\hat{F}(x)$ puts mass $1/N$ at each of the observed $x_i$, and is jumpy; in $\\hat{f}_X(x)$ we have smoothed $\\hat{F}$ by adding independent Gaussian noise to each observation $x_i$." 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "### Convolution = local averaging\n", 68 | "\n", 69 | "The Parzen density estimate is the equivalent of the local average, and improvements have been proposed along the lines of local regression [on the log scale for densities; see Loader (1999)]. We will not pursue these here." 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "### In $\\mathbb{R}^p$\n", 77 | "\n", 78 | "The natural generalization of the Gaussian density estimate amounts to using the Gaussian product kernel,\n", 79 | "\n", 80 | "\\begin{equation}\n", 81 | "\\hat{f}(x_0) = \\frac1{N(2\\lambda^2\\pi)^{\\frac{p}2}} \\sum_{i=1}^N \\exp \\left\\{ -\\frac12 \\left( \\|x_i-x_0\\|/\\lambda \\right)^2 \\right\\}.\n", 82 | "\\end{equation}" 83 | ] 84 | } 85 | ], 86 | "metadata": { 87 | "kernelspec": { 88 | "display_name": "Python 3", 89 | "language": "python", 90 | "name": "python3" 91 | }, 92 | "language_info": { 93 | "codemirror_mode": { 94 | "name": "ipython", 95 | "version": 3 96 | }, 97 | "file_extension": ".py", 98 | "mimetype": "text/x-python", 99 | "name": "python", 100 | "nbconvert_exporter": "python", 101 | "pygments_lexer": "ipython3", 102 | "version": "3.6.4" 103 | } 104 | }, 105 | "nbformat": 4, 106 | "nbformat_minor": 2 107 | } 108 | -------------------------------------------------------------------------------- /chapter06-kernel-smoothing-methods/section6-2-kernel-density-classification.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 6.6.2. Kernel Density Classification\n", 8 | "\n", 9 | "One can use nonparametric density estimates for classification in a straight-forward fashion using Bayes' theorem.\n", 10 | "\n", 11 | "Suppose for a $J$ class problem\n", 12 | "* we fit nonparametric density estimates $\\hat{f}_j(X)$, for $j=1,\\cdots,J$ separately in each of the classes,\n", 13 | "* and we also have estimates of the class priors $\\hat\\pi_j$ (usually the sample proportions).\n", 14 | "\n", 15 | "Then\n", 16 | "\n", 17 | "\\begin{equation}\n", 18 | "\\hat{\\text{Pr}}(G=j|X=x_0) = \\frac{\\hat\\pi_j \\hat{f}_j(x_0)}{\\sum_{k=1}^J \\hat\\pi_k \\hat{f}_k(x_0)}.\n", 19 | "\\end{equation}" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "### Difficulty with sparse data\n", 27 | "\n", 28 | "FIGURE 6.14 uses this method to estimate to prevalence of $\\textsf{CHD}$ for the heart risk factor study, and should be compared with the left panel of FIGURE 6.12. The main difference occurs in the region of high SBP in the right panel of FIGURE 6.14. In this region the data are sparse for both classes, and since the Gaussian kernel density estimates use metric kernels, the density estimates are low and of poor quality (high variance) in these region.\n", 29 | "\n", 30 | "The local logistic regression method uses the tri-cube kernel with $k$-NN of the local linear assumption to smooth out the estimate (on the logit scale)." 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "If classification is the ultimate goal, then learning the separate class densities well may be unnecessary, and can in fact be misleading.\n", 38 | "\n", 39 | "FIGURE 6.15 shows an example where the densities are both multimodal, but the posterior ratio is quite smooth. In learning the separate densities from data, one might decide to settle for a rougher, high-variance fit to capture these features, which are irrelevant for the purposes of estimating the posterior probabilities. In fact, if classification is the ultimate goal, then we need only to estimate the posterior well near the decision boundary. e.g., for two classes, this is the set\n", 40 | "\n", 41 | "\\begin{equation}\n", 42 | "\\{ x| \\text{Pr}(G=1|X=x) = \\frac12 \\}.\n", 43 | "\\end{equation}" 44 | ] 45 | } 46 | ], 47 | "metadata": { 48 | "kernelspec": { 49 | "display_name": "Python 3", 50 | "language": "python", 51 | "name": "python3" 52 | }, 53 | "language_info": { 54 | "codemirror_mode": { 55 | "name": "ipython", 56 | "version": 3 57 | }, 58 | "file_extension": ".py", 59 | "mimetype": "text/x-python", 60 | "name": "python", 61 | "nbconvert_exporter": "python", 62 | "pygments_lexer": "ipython3", 63 | "version": "3.6.4" 64 | } 65 | }, 66 | "nbformat": 4, 67 | "nbformat_minor": 2 68 | } 69 | -------------------------------------------------------------------------------- /chapter06-kernel-smoothing-methods/section6-3-the-naive-bayes-classifier.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 6.6.3. The Naive Bayes Classifier\n", 8 | "\n", 9 | "This is a technique that has remained popular over the years, despite its name (a.k.a. \"Idiot's Bayes\"). It is espeically appropriate when the dimension $p$ of the feature space is high, making density estimation unattractive." 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "### Naive assumption\n", 17 | "\n", 18 | "The naive Bayes model assumes that given a class $G=j$, the features $X_k$ are independent:\n", 19 | "\n", 20 | "\\begin{equation}\n", 21 | "f_j(X) = \\prod_{k=1}^p f_{jk}(X_k)\n", 22 | "\\end{equation}\n", 23 | "\n", 24 | "While this assumption is generally not true, it does simplify the estimation dramatically:\n", 25 | "\n", 26 | "* The individual class-conditional marginal densities $f_{jk}$ can each be estimated separately using 1D kernel density estimates. This is in fact a generalization fo the original naive Bayes procedures, which used univariate Gaussians to represent these marginals.\n", 27 | "* If a component $X_j$ of $X$ is discrete, then an appropriate histogram estimate can be used. This provides a seamless wway of mixing variable types in a feature vector.\n", 28 | "\n", 29 | "Despite these rather optimistic assumptions, naive Bayes classifiers often outperform far more sophisticated alternatives. The reasons are related to FIGURE 6.15: Although the individual class density estimates may be biased, this bias might not hurt the posterior probabilities as much, especially near the decision regions. In fact, the problem may be able to withstand considerable bias for the savings in variance such a \"naive\" assumption earns." 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "### Formulation\n", 37 | "\n", 38 | "Starting from the independence assumption, we can derive the logit-transform (using class $J$ as the base):\n", 39 | "\n", 40 | "\\begin{align}\n", 41 | "\\log\\frac{\\text{Pr}(G=l|X)}{\\text{Pr}(G=J|X)} &= \\log\\frac{\\pi_l f_l(X)}{\\pi_J f_J(X)}\\\\\n", 42 | "&= \\log\\frac{\\pi_l \\prod_{k=1}^p f_{lk}(X_k)}{\\pi_J \\prod_{k=1}^p f_{Jk}(X_k)}\\\\\n", 43 | "&= \\log\\frac{\\pi_l}{\\pi_J} + \\sum_{k=1}^p \\log\\frac{f_{lk}(X_k)}{f_{Jk}(X_k)}\\\\\n", 44 | "&= \\alpha_l + \\sum_{k=1}^p g_{lk}(X_k).\n", 45 | "\\end{align}" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "### Similarity with generalized additive model\n", 53 | "\n", 54 | "This has the form of a _generalized additive model_ (Chapter 9). The models are fit in quite different ways though (Exercise 6.9). The relationship between naive Bayes and generalized additive models is analogous to that between LDA and logistic regression ($\\S$ 4.4.5)." 55 | ] 56 | } 57 | ], 58 | "metadata": { 59 | "kernelspec": { 60 | "display_name": "Python 3", 61 | "language": "python", 62 | "name": "python3" 63 | }, 64 | "language_info": { 65 | "codemirror_mode": { 66 | "name": "ipython", 67 | "version": 3 68 | }, 69 | "file_extension": ".py", 70 | "mimetype": "text/x-python", 71 | "name": "python", 72 | "nbconvert_exporter": "python", 73 | "pygments_lexer": "ipython3", 74 | "version": "3.6.4" 75 | } 76 | }, 77 | "nbformat": 4, 78 | "nbformat_minor": 2 79 | } 80 | -------------------------------------------------------------------------------- /chapter06-kernel-smoothing-methods/section7-radial-basis-functions-and-kernels.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 6.7. Radial Basis Functions and Kernels\n", 8 | "### Review on basis expansion\n", 9 | "\n", 10 | "In Chapter 5, functions are represented as expansions in basis functions:\n", 11 | "\n", 12 | "\\begin{equation}\n", 13 | "f(x) = \\sum_{j=1}^M \\beta_j h_j(x).\n", 14 | "\\end{equation}\n", 15 | "\n", 16 | "The art of flexible modeling using basis expansions consists of picking an appropriate family of basis functions, and then controlling the complexity of the representation by selection, regularization, or both.\n", 17 | "\n", 18 | "Some of the families of basis functions have elements that are defined locally; e.g., B-splines. If more flexibility is desired in a particular region, then that region needs to be represented by more basis functions (which in the case of B-splines translates to more knots).\n", 19 | "\n", 20 | "Tensor products of $\\mathbb{R}$-local basis functions deliver basis functions local in $\\mathbb{R}^p$. Not all basis functions are local -- e.g., the truncated power bases for splines, or sigmoidal basis functions $\\sigma(\\alpha_0 + \\alpha x)$ used in neural-networks (Chapter 11).\n", 21 | "\n", 22 | "The composed function $f(x)$ can nevertheless show local behavior, because of the particular signs and values of the coefficients causing cancellations of global effects. For example, the truncated power basis has an equivalent B-spline basis for the same space of functions; the cancellation is exact in this case." 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "### Review on kernel smoothing\n", 30 | "\n", 31 | "Kernel methods achieve flexibility by fitting simple models in a region local to the target point $x_0$. Localization is achieved via a weighting kernel $K_\\lambda$, and individual observations receive weights $K_\\lambda(x_0,x)$." 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "### Radial basis functions\n", 39 | "\n", 40 | "Radial basis functions combine these ideas, by treating the kernel functions $K_\\lambda(\\xi,x)$ as basis functions. This leads to the model\n", 41 | "\n", 42 | "\\begin{align}\n", 43 | "f(x) &= \\sum_{j=1}^M K_{\\lambda_i}(\\xi_j,x)\\beta_j \\\\\n", 44 | "&= \\sum_{j=1}^N D\\left( \\frac{\\|x-\\xi_j\\|}{\\lambda_j} \\right)\\beta_j,\n", 45 | "\\end{align}\n", 46 | "\n", 47 | "where each basis element is indexed by a location or _prototype_ parameter $xi_j$ and a scale parameter $\\lambda_j$. A popular choice for $D$ is the standard Gaussian density function.\n", 48 | "\n", 49 | "There are several approaches to learning the parameters $\\{\\lambda_j, \\xi_j, \\beta_j\\}$, for $j=1,\\cdots,M$. For simplicity we will focus on least squares methods for regression, and use the Gaussian kernel" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "1) \n", 57 | "Optimize the sum-of-squares w.r.t. all the parameters,\n", 58 | "\n", 59 | "\\begin{equation}\n", 60 | "\\min_{\\{\\lambda_j,\\xi_j,\\beta_j\\}_1^M} \\sum_{i=1}^N \\left( y_i - \\beta_0 - \\sum_{j=1}^M \\beta_j \\exp\\left( -\\frac{(x_i-\\xi_j)^T(x_i-\\xi_j)}{\\lambda_j^2} \\right)\\right)^2.\n", 61 | "\\end{equation}\n", 62 | "\n", 63 | "This model is commonly referred to as an RBF network, an alternative to the sigmoidal neural network; the $\\xi_j$ and $\\lambda_j$ playing the role of the weights. This criterion is nonconvex with multiple local minima, and the algorithms for optimization are similar to those used for neural networks." 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "2) \n", 71 | "Estimate the $\\{\\lambda_j,\\xi_j\\}$ separately from the $\\beta_j$. Given $\\{\\lambda_j,\\xi_j\\}$, the estimation of $\\beta_j$ is a simple least squares problem.\n", 72 | "\n", 73 | "Often the kernel parameters $\\lambda_j$ and $\\xi_j$ are chosen in an unsupervised way using the $X$ distribution alone. One of the methods is to fit a Gaussian mixture density model to the training $x_i$, which provides both the centers $\\xi_j$ and the scales $\\lambda_j$.\n", 74 | "\n", 75 | "Other even more adhoc approaches use clustering methods to local the prototypes $\\xi_j$, and treat $\\lambda_j = \\lambda$ as a hyper-parameter. The obvious drawback of these approaches is that the conditional distribution $\\text{Pr}(Y|X)$ and in particular $\\text{E}(Y|X)$ is having no say in where the action is concentrated.\n", 76 | "\n", 77 | "On the positive side, they are much simpler to implement." 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "### Renormalize\n", 85 | "\n", 86 | "While it would seem attractive to reduce the parameter set and assume a constant value for $\\lambda_j=\\lambda$, this can have an undesirable side effect of creating _holes_ -- regions of $\\mathbb{R}^p$ where none of the kernels has appreciable support, as illustrated in FIGURE 6.16 upper panel. _Renormalized_ radial basis functions,\n", 87 | "\n", 88 | "\\begin{equation}\n", 89 | "h_j(x) = \\frac{D(\\|x-\\xi_j\\|/\\lambda)}{\\sum_{k=1}^M D(\\|x-\\xi_k\\|/\\lambda)}\n", 90 | "\\end{equation}\n", 91 | "\n", 92 | "avoid this problem (FIGURE 6.16 lower panel)." 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "The Nadaraya-Watson kernel regression estimator in $\\mathbb{R}^p$ can be viewed as an expansion in renormalized radial basis functions,\n", 100 | "\n", 101 | "\\begin{align}\n", 102 | "\\hat{f}(x_0) &= \\sum_{i=1}^N y_i \\frac{K_\\lambda(x_0,x_i)}{\\sum_{j=1}^N K_\\lambda(x_0,x_j)} \\\\\n", 103 | "&= \\sum_{i=1}^N y_i h_i(x_0),\n", 104 | "\\end{align}\n", 105 | "\n", 106 | "with a basis function $h_i$ located at every observation and coefficients $y_i$; i.e.,\n", 107 | "\n", 108 | "\\begin{align}\n", 109 | "\\xi_i &= x_i, \\\\\n", 110 | "\\hat\\beta_i &= y_i, \\text{ for } i=1,\\cdots,N.\n", 111 | "\\end{align}\n", 112 | "\n", 113 | "Note the similarity between the expansion and the solution (5.50 on page 169) to the regularization problem induced by the kernel $K$. Radial basis functions form the bridge between the modern \"kernel methods\" and local fitting technology." 114 | ] 115 | } 116 | ], 117 | "metadata": { 118 | "kernelspec": { 119 | "display_name": "Python 3", 120 | "language": "python", 121 | "name": "python3" 122 | }, 123 | "language_info": { 124 | "codemirror_mode": { 125 | "name": "ipython", 126 | "version": 3 127 | }, 128 | "file_extension": ".py", 129 | "mimetype": "text/x-python", 130 | "name": "python", 131 | "nbconvert_exporter": "python", 132 | "pygments_lexer": "ipython3", 133 | "version": "3.6.4" 134 | } 135 | }, 136 | "nbformat": 4, 137 | "nbformat_minor": 2 138 | } 139 | -------------------------------------------------------------------------------- /chapter06-kernel-smoothing-methods/section8-mixture-models-for-density-estimation-and-classification.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 6.8. Mixture Models for Density Estimation and Classification\n", 8 | "\n", 9 | "The mixture model is a useful tool for density estimation, and can be viewed as a kind of kernel method. The Gaussian mixture model has the form\n", 10 | "\n", 11 | "\\begin{equation}\n", 12 | "f(x) = \\sum_{m=1}^M \\alpha_m\\phi(x;\\mu_m,\\mathbf{\\Sigma}_m)\n", 13 | "\\end{equation}\n", 14 | "\n", 15 | "with mixing proportions $\\sum\\alpha_m=1$.\n", 16 | "\n", 17 | "In general, mixture models can use any component densities in place of the Gaussian: The Gaussian mixture model is by far the most popular.\n", 18 | "\n", 19 | "The parameters are usually fit by maximum likelihood, using the EM algorithm as described in Chapter 8." 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "### Some special cases arise\n", 27 | "\n", 28 | "* If the covariance matrices are constrained to be scalar: $\\mathbf{\\Sigma}_m=\\sigma_m\\mathbf{I}$, then it has the form of a radial basis expansion.\n", 29 | "* If in addition $\\sigma_m = \\sigma >0$ is fixed, and $M\\uparrow N$, then the maximum likelihood estimate approaches the kernel density estimate \n", 30 | "\n", 31 | " \\begin{equation}\n", 32 | " \\hat{f}_X(x_0) = \\frac1{N\\lambda} \\sum_{i=1}^N K_\\lambda(x_0,x)\n", 33 | " \\end{equation}\n", 34 | " \n", 35 | " where $\\hat\\alpha_m = 1/N$ and $\\hat\\mu_m = x_m$.\n", 36 | "\n", 37 | "Using Bayes' theorem, separate mixture densities in each class lead to flexible models for $\\text{Pr}(G|X)$ (Chapter 12)." 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "### Heart disease, again\n", 45 | "\n", 46 | "FIGURE 6.17 shows an application of mixtures to the heart disease risk factor study." 47 | ] 48 | } 49 | ], 50 | "metadata": { 51 | "kernelspec": { 52 | "display_name": "Python 3", 53 | "language": "python", 54 | "name": "python3" 55 | }, 56 | "language_info": { 57 | "codemirror_mode": { 58 | "name": "ipython", 59 | "version": 3 60 | }, 61 | "file_extension": ".py", 62 | "mimetype": "text/x-python", 63 | "name": "python", 64 | "nbconvert_exporter": "python", 65 | "pygments_lexer": "ipython3", 66 | "version": "3.6.4" 67 | } 68 | }, 69 | "nbformat": 4, 70 | "nbformat_minor": 2 71 | } 72 | -------------------------------------------------------------------------------- /chapter06-kernel-smoothing-methods/section9-computational-considerations.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 6.9. Computational Considerations\n", 8 | "\n", 9 | "Kernel and local regression and density estimation are _memory-based_ methods: The model is the entire training data set, and the fitting is done at evaluation or prediction time. For many real-time applications, this can make this class of methods infeasible." 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "The computational cost to fit at a single obervation $x_0$ is $O(N)$ flops, except in oversimplified cases (such as square kernels). By comparison, an expansion in $M$ basis functions costs $O(M)$ for one evaluation, and typically $M\\sim O(\\log N)$. Basis function methods have an initial cost of at least $O(NM^2+M^3)$." 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "The smoothing parameter(s) $\\lambda$ for kernel methods are typically determined off-line, e.g. using cross-validation, at a cost of $O(N^2)$ flops." 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "Popular implementations of local regression, such as the $\\textsf{loess}$ function in S-PLUS and R, and the $\\textsf{locfit}$ procedure (Loader, 1999), use triangulation schemes to reduce the computations. They compute the fit exactly at $M$ carefully chosen locations ($O(NM)$), and then use blending techniques to interpolate the fit elsewhere ($O(M)$ per evaluation)." 31 | ] 32 | } 33 | ], 34 | "metadata": { 35 | "kernelspec": { 36 | "display_name": "Python 3", 37 | "language": "python", 38 | "name": "python3" 39 | }, 40 | "language_info": { 41 | "codemirror_mode": { 42 | "name": "ipython", 43 | "version": 3 44 | }, 45 | "file_extension": ".py", 46 | "mimetype": "text/x-python", 47 | "name": "python", 48 | "nbconvert_exporter": "python", 49 | "pygments_lexer": "ipython3", 50 | "version": "3.6.4" 51 | } 52 | }, 53 | "nbformat": 4, 54 | "nbformat_minor": 2 55 | } 56 | -------------------------------------------------------------------------------- /chapter07-model-assessment-and-selection/fig7-12.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dgkim5360/the-elements-of-statistical-learning-notebooks/2c13a4818379451bcce802bcd7917f4878e12977/chapter07-model-assessment-and-selection/fig7-12.jpg -------------------------------------------------------------------------------- /chapter07-model-assessment-and-selection/fig7-2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dgkim5360/the-elements-of-statistical-learning-notebooks/2c13a4818379451bcce802bcd7917f4878e12977/chapter07-model-assessment-and-selection/fig7-2.jpg -------------------------------------------------------------------------------- /chapter07-model-assessment-and-selection/section01-introduction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Chapter 7. Model Assessment and Selection\n", 8 | "# $\\S$ 7.1. Introduction\n", 9 | "\n", 10 | "The _generalization_ performance of a learning method relates to its prediction capability on independent test data. Assessment of this performance is extremely important in practice, since it guides the choice of learning method or model, and gives us a measure of the quality of the ultimately chosen model.\n", 11 | "\n", 12 | "In this chapter we describe and illustrate the key methods for performance assessment, and show how they are used to select models. We begin this chapter with a discussion of the interplay between bias, variance, and model complexity." 13 | ] 14 | } 15 | ], 16 | "metadata": { 17 | "kernelspec": { 18 | "display_name": "Python 3", 19 | "language": "python", 20 | "name": "python3" 21 | }, 22 | "language_info": { 23 | "codemirror_mode": { 24 | "name": "ipython", 25 | "version": 3 26 | }, 27 | "file_extension": ".py", 28 | "mimetype": "text/x-python", 29 | "name": "python", 30 | "nbconvert_exporter": "python", 31 | "pygments_lexer": "ipython3", 32 | "version": "3.6.4" 33 | } 34 | }, 35 | "nbformat": 4, 36 | "nbformat_minor": 2 37 | } 38 | -------------------------------------------------------------------------------- /chapter07-model-assessment-and-selection/section03-1-example-bias-variance-tradeoff.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 7.3.1. Example: Bias-Variance Tradeoff\n", 8 | "\n", 9 | "### Example specification\n", 10 | "\n", 11 | "FIGURE 7.3 shows the bias-variance tradeoff for two simulated examples. There are 80 observations and 20 predictors, uniformly distributed in the hypercube $[0,1]^{20}$. The situations are as follows:\n", 12 | "\n", 13 | "* _Left panels_: $Y=\\begin{cases}0 &\\text{ if } X_1 \\le 0.5 \\\\ 1 &\\text{ otherwise,}\\end{cases}$ and we apply kNN.\n", 14 | "* _Right panels_: $Y=\\begin{cases}1 &\\text{ if } \\sum_{j=1}^{10} X_j \\gt 5 \\\\ 5 &\\text{ otherwise,}\\end{cases}$ and we use best subset linear regression of size $p$.\n", 15 | "\n", 16 | "* The top row is regression with squares error loss;\n", 17 | "* the bottom row is classification with 0-1 loss.\n", 18 | "\n", 19 | "The figures show\n", 20 | "* the prediction error (red),\n", 21 | "* squared bias (green), and\n", 22 | "* variance (blue),\n", 23 | "\n", 24 | "all computed for a large test sample.\n", 25 | "\n", 26 | "In the regression problem, bias and variance add to produce the prediction error curve, with minima at about $k=5$ for $k$NN, and $p\\ge 10$ for the linear model." 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 1, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "\"\"\"FIGURE 7.3. simulation for bias-variance tradeoff\"\"\"\n", 36 | "%matplotlib inline\n", 37 | "import scipy\n", 38 | "import matplotlib.pyplot as plt" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 12, 44 | "metadata": {}, 45 | "outputs": [], 46 | "source": [ 47 | "# Training set\n", 48 | "size_training = 80\n", 49 | "size_predictor = 20\n", 50 | "train_x = scipy.rand(size_training, size_predictor)\n", 51 | "train_x.shape\n", 52 | "\n", 53 | "train_y1 = scipy.where(train_x[:,0] <= .5, 0, 1)\n", 54 | "train_y2 = scipy.where(train_x[:,:10].sum(axis=1) > 5, 1, 5)\n", 55 | "# print(train_x[:,0], train_y_cls)\n", 56 | "# print(train_x[:,:10].sum(axis=1), train_y_rgr)" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 13, 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "# kNN simulation\n", 66 | "# kNN regression\n", 67 | "# kNN classification" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": 14, 73 | "metadata": {}, 74 | "outputs": [ 75 | { 76 | "name": "stdout", 77 | "output_type": "stream", 78 | "text": [ 79 | "Under construction ...\n" 80 | ] 81 | } 82 | ], 83 | "source": [ 84 | "# Linear model simulation\n", 85 | "# Linear regression\n", 86 | "# Linear classification\n", 87 | "print('Under construction ...')" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "### Questions\n", 95 | "\n", 96 | "For classification loss (bottom figures), some interesting phenomena can be seen. The bias and variance curves are the same as in the top figures, and prediction error now refers to misclassification rate. We see that prediction error is no longer the sum of squared bias and variance.\n", 97 | "\n", 98 | "1. For the $k$NN classifier, prediction error decreases or stays the same as the number of neighbors is increased to 20, despite the fact that the squared bias is rising.\n", 99 | "2. For the linear model classifier the minimum occurs for $p\\ge 10$ as in regression, but the improvement over the $p=1$ model is more dramatic.\n", 100 | "\n", 101 | "We see that bias and variance seem to interact in determining prediction error." 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "### Answer for Question 1\n", 109 | "\n", 110 | "Why does this happen? There is a simple explanation for the first phenomenon.\n", 111 | "\n", 112 | "Suppose at a given input point, the true probability of class 1 is $0.9$ while the expected value of our estimate is $0.6$. Then the squared bias -- $(0.6 - 0.9)^2$ -- is considerable, but the prediction error is zero since we make the correct decision.\n", 113 | "\n", 114 | "In other words, estimation errors that leave us on the right side of the decision boundary don't hurt. Exercise 7.2 demonstrates this phenomenon analytically, and also shows the interaction effect between bias and variance." 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "### Summary\n", 122 | "\n", 123 | "The overall point is that the bias-variance tradeoff behaves differently for 0-1 loss than it does for squared error loss. This in turn means that the best choices of tuning parameters may differ substantially in the two settings. One should base the choice of tuning parameter on an estimate of prediction error, as described in the following sections." 124 | ] 125 | } 126 | ], 127 | "metadata": { 128 | "kernelspec": { 129 | "display_name": "Python 3", 130 | "language": "python", 131 | "name": "python3" 132 | }, 133 | "language_info": { 134 | "codemirror_mode": { 135 | "name": "ipython", 136 | "version": 3 137 | }, 138 | "file_extension": ".py", 139 | "mimetype": "text/x-python", 140 | "name": "python", 141 | "nbconvert_exporter": "python", 142 | "pygments_lexer": "ipython3", 143 | "version": "3.6.4" 144 | } 145 | }, 146 | "nbformat": 4, 147 | "nbformat_minor": 2 148 | } 149 | -------------------------------------------------------------------------------- /chapter07-model-assessment-and-selection/section06-the-effective-number-of-parameters.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 7.6. The Effective Number of Parameters\n", 8 | "\n", 9 | "The concept of \"number of parameters\" can be generalized, especially to models where regularization is used in the fitting.\n", 10 | "\n", 11 | "Suppose we stack the outcomes $y_1,y_2,\\cdots,y_N$ into a vector $\\mathbf{y}$, and similarly for the predictions $\\hat{\\mathbf{y}}$. Then a linear fitting method is one for which we can write\n", 12 | "\n", 13 | "\\begin{equation}\n", 14 | "\\hat{\\mathbf{y}} = \\mathbf{Sy},\n", 15 | "\\end{equation}\n", 16 | "\n", 17 | "where $\\mathbf{S}$ is an $N\\times N$ matrix depending on the input vectors $x_i$, but not on the $y_i$.\n", 18 | "\n", 19 | "Linear fitting methods include\n", 20 | "* linear regression on the original features or on a derived basis set, and\n", 21 | "* smoothing methods that use quadratic shrinkage, such as ridge regression and cubic smoothing splines.\n", 22 | "\n", 23 | "Then the _effective number of parameters_ (a.k.a. the _effective degrees of freedom_) is defined as\n", 24 | "\n", 25 | "\\begin{equation}\n", 26 | "\\text{df}(\\mathbf{S}) = \\text{trace}(\\mathbf{S}).\n", 27 | "\\end{equation}\n", 28 | "\n", 29 | "Note that if $\\mathbf{S}$ is an orthogonal-projection matrix onto a basis set spanned by $M$ features, then\n", 30 | "\n", 31 | "\\begin{equation}\n", 32 | "\\text{trace}(\\mathbf{S}) = M.\n", 33 | "\\end{equation}\n", 34 | "\n", 35 | "It turns out that $\\text{trace}(\\mathbf{S})$ is exactly the correct quantity to replace $d$ as the number of parameters in the $C_p$ statistic." 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "### For additive-error models\n", 43 | "\n", 44 | "If $\\mathbf{y}$ arises from an additive-error model\n", 45 | "\n", 46 | "\\begin{equation}\n", 47 | "Y = f(X) + \\epsilon\n", 48 | "\\end{equation}\n", 49 | "\n", 50 | "with $\\text{Var}(\\epsilon) = \\sigma_\\epsilon^2$, then one can show that\n", 51 | "\n", 52 | "\\begin{equation}\n", 53 | "\\sum_{i=1}^N \\text{Cov}(\\hat{y}_i,y_i) = \\text{trace}(\\mathbf{S})\\sigma_\\epsilon^2,\n", 54 | "\\end{equation}\n", 55 | "\n", 56 | "which motivates the more general definition (Exercise 7.4 and 7.5)\n", 57 | "\n", 58 | "\\begin{equation}\n", 59 | "\\text{df}(\\hat{\\mathbf{y}}) = \\frac{\\sum_{i=1}^N \\text{Cov}(\\hat{y}_i,y_i)}{\\sigma_\\epsilon^2}.\n", 60 | "\\end{equation}\n", 61 | "\n", 62 | "$\\S$ 5.4.1 on page 153 gives some more intuition for the definition $\\text{df} = \\text{trace}(\\mathbf{S})$ in the context of smoothing splines." 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "### For models like neural networks\n", 70 | "\n", 71 | "in which we minimize an error function $R(\\omega)$ with weight decay penalty (regularization) $\\alpha \\sum_m \\omega_m^2$, the effective number of parameters has the form\n", 72 | "\n", 73 | "\\begin{equation}\n", 74 | "\\text{df}(\\alpha) = \\sum_{m=1}^M \\frac{\\theta_m}{\\theta_m+\\alpha},\n", 75 | "\\end{equation}\n", 76 | "\n", 77 | "where the $\\theta_m$ are the eigenvalues of the Hessian matrix $\\frac{\\partial^2 R(\\omega)}{\\partial\\omega\\partial\\omega^T}$.\n", 78 | "\n", 79 | "Expression $(7.34)$ follows from $(7.32)$ if we make a quadratic approximation to the error function at the solution (Bishop, 1995)." 80 | ] 81 | } 82 | ], 83 | "metadata": { 84 | "kernelspec": { 85 | "display_name": "Python 3", 86 | "language": "python", 87 | "name": "python3" 88 | }, 89 | "language_info": { 90 | "codemirror_mode": { 91 | "name": "ipython", 92 | "version": 3 93 | }, 94 | "file_extension": ".py", 95 | "mimetype": "text/x-python", 96 | "name": "python", 97 | "nbconvert_exporter": "python", 98 | "pygments_lexer": "ipython3", 99 | "version": "3.6.4" 100 | } 101 | }, 102 | "nbformat": 4, 103 | "nbformat_minor": 2 104 | } 105 | -------------------------------------------------------------------------------- /chapter07-model-assessment-and-selection/section10-0-cross-validation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 7.10. Cross-Validation\n", 8 | "\n", 9 | "Probably the simplest and most widely used method for estimating prediction error is cross-validation. This method directly estimates the expected extra-sample error\n", 10 | "\n", 11 | "\\begin{equation}\n", 12 | "\\text{Err} = \\text{E}\\left[ L(Y, \\hat{f}(X)) \\right],\n", 13 | "\\end{equation}\n", 14 | "\n", 15 | "the average generalization error when the method $\\hat{f}(X)$ is applied to an independent test sample from the joint distribution of $X$ and $Y$." 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "As mentioned earlier, we might hope that cross-validation estimates the conditional error, with the training set $\\mathcal{T}$ held fixed. But as we will see in $\\S$ 7.12, cross-validation typically estimates well only the expected prediction error." 23 | ] 24 | } 25 | ], 26 | "metadata": { 27 | "kernelspec": { 28 | "display_name": "Python 3", 29 | "language": "python", 30 | "name": "python3" 31 | }, 32 | "language_info": { 33 | "codemirror_mode": { 34 | "name": "ipython", 35 | "version": 3 36 | }, 37 | "file_extension": ".py", 38 | "mimetype": "text/x-python", 39 | "name": "python", 40 | "nbconvert_exporter": "python", 41 | "pygments_lexer": "ipython3", 42 | "version": "3.6.4" 43 | } 44 | }, 45 | "nbformat": 4, 46 | "nbformat_minor": 2 47 | } 48 | -------------------------------------------------------------------------------- /chapter07-model-assessment-and-selection/section10-2-the-wrong-and-right-way-to-do-cross-validation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 7.10.2. The Wrong and Right Way to Do Cross-Validation\n", 8 | "\n", 9 | "Consider a classification problem with a large number of predictors, as may arise, for example, in genomic or proteomic applications. A typical strategy for analysis might be as follows:\n", 10 | "\n", 11 | "1. Screen the predictors: find a subset of \"good\" predictors that show fairly strong (univariate) correlation with the class labels.\n", 12 | "1. Using just this subset of predictors, build a multivariate classifier.\n", 13 | "1. Use cross-validation to estimate the unknown tuning parameters and to estimate the prediction error of the final model.\n", 14 | "\n", 15 | "Is this a correct application of cross-validation?" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "Consider a scenario with $N=50$ samples in two equal-sized classes, and $p=5000$ quantitative predictors (standard Gaussian) that are independent of the class labels. The true (test) error rate of any classifier is 50%.\n", 23 | "\n", 24 | "We carried out the above recipe,\n", 25 | "1. choosing in step (1) the 100 predictors having highest correlation with the class labels, and then\n", 26 | "1. using a 1NN classifier, based on just these 100 predictors, in step (2).\n", 27 | "\n", 28 | "Over 50 simulations from this setting, the average CV error rate was 3%. This is far lower than the true error rate of 50%." 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "### What happened?\n", 36 | "\n", 37 | "The problem is that the predictors have an unfair advantage, as they were chosen in step (1) on the basis of _all of the samples_. Leaving samples out _after_ the variables have been selected does not correctly mimic the application of the classifier to a completely independent test set, since these predictors \"have already seen\" the left out samples.\n", 38 | "\n", 39 | "FIGURE 7.10 (top panel) illustrates the problem. We selected the 100 predictors having largest correlation with the class labels over all 50 samples. Then we chose a random set of 10 samples, as we would do in five-fold cross-validation, and computed the correlations of the pre-selected 100 predictors with the class labels over just these 10 samples (top panel). We see that the correlations average about 0.28, rather than 0, as one might expect." 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "Here is the correct way to carry out cross-validation in this example:\n", 47 | "1. Divide the samples into $K$ cross-validation folds (groups) at random.\n", 48 | "1. For each fold $k=1,2,\\cdots,K$\n", 49 | " 1. Find a subset of \"good\" predictors that show fairly strong (univariate) correlation with the class labels, using all of the samples except those in fold $k$.\n", 50 | " 1. Using just this subset of predictors, build a multivariate classifier, using all of the samples except those in fold $k$.\n", 51 | " 1. Use the classifier to predict the class labels for the samples in fold $k$.\n", 52 | "\n", 53 | "The error estimates from step 2(c) are then accumulated over all $K$ folds, to produce the cross-validation estimate of prediction error. The lower panel of FIGURE 7.10 shows the correlations of class labels with the 100 predictors chosen in step 2(a) of the correct procedure, over the samples in a typical fold $k$. We see that they average about zero, as they should." 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "In general, with a multistep modeling procedure, cross-validation must be applied to the entire sequence of modeling steps. In particular, samples must be \"left out\" before any selection or filtering steps are applied.\n", 61 | "\n", 62 | "> There is one qualification: Initial _unsupervised_ screening steps can be done before samples are left out.\n", 63 | "\n", 64 | "For example, we could select the 1000 predictors with highest variance across all 50 samples, before starting cross-validation. Since this filtering does not involve the class labels, it does not give the predictors an unfair advantage." 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "### Even in published papers!\n", 72 | "\n", 73 | "While this point may seem obvious to the reader, we have seen this blunder committed many times in published papers in top rank journals. With the large numbers of predictors that are so common in genomic and other areas, the potential consequences of this error have also increased dramatically; see Ambroise and McLachlan (2002) for a detailed discussion of this issue." 74 | ] 75 | } 76 | ], 77 | "metadata": { 78 | "kernelspec": { 79 | "display_name": "Python 3", 80 | "language": "python", 81 | "name": "python3" 82 | }, 83 | "language_info": { 84 | "codemirror_mode": { 85 | "name": "ipython", 86 | "version": 3 87 | }, 88 | "file_extension": ".py", 89 | "mimetype": "text/x-python", 90 | "name": "python", 91 | "nbconvert_exporter": "python", 92 | "pygments_lexer": "ipython3", 93 | "version": "3.6.4" 94 | } 95 | }, 96 | "nbformat": 4, 97 | "nbformat_minor": 2 98 | } 99 | -------------------------------------------------------------------------------- /chapter07-model-assessment-and-selection/section10-3-does-cross-validation-really-work.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 7.10.3. Does Cross-Validation Really Work?\n", 8 | "\n", 9 | "We once again examine the behavior of cross-validation in a high-dimensional classification problem.\n", 10 | "\n", 11 | "Consider a scenario with $N=20$ samples in two equal-sized classes, and $p=500$ quantitative predictors that are independent of the class labels. Once again, the true error rate of any classifier is 50%.\n", 12 | "\n", 13 | "Consider a simple univariate classifier: A single split that minimizes the misclassification error (a \"stump\"). Stumps are trees with a single split, and are used in boosting methods (Chapter 10). A simple argument suggests that cross-validation will not work properly in this setting (This argument was made to us by a scientist at a proteomics lab meeting, and led to material in this section):\n", 14 | "\n", 15 | "> Fitting to the entire training set, we will find a predictor that splits the data very well. If we do 5-fold cross-validation, this same predictor should split any 4/5ths and 1/5th of the data well too, and hence its cross-validation error will be small (much less than 50%). Thus CV does not give an accurate estimate of error." 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "### Simulation\n", 23 | "\n", 24 | "To investigate whether this argument is correct, FIGURE 7.11 shows the result of a simulation from this setting.\n", 25 | "\n", 26 | "There are 500 predictors and 20 samples, in each of two equal-sized classes, with all predictors having a standard Gaussian distribution.\n", 27 | "\n", 28 | "The panel in the top left shows the number of training errors for each of the 500 stumps fit to the training data. We have marked in color the six predictor yielding the fewest errors.\n", 29 | "\n", 30 | "In the top right panel, the training errors are shown for stumps fit to a random 4/5ths partition of the data (16 samples), and tested on the remaining 1/5th (four samples). The colored points indicate the same predictors marked in the top left panel. We see that the stump for the blue predictor (whose stump was the best in the top left panel), makes two out of four test errors (50%), and is no better than random." 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 1, 36 | "metadata": {}, 37 | "outputs": [ 38 | { 39 | "name": "stdout", 40 | "output_type": "stream", 41 | "text": [ 42 | "Necessary to reproduce this figure?\n" 43 | ] 44 | } 45 | ], 46 | "source": [ 47 | "\"\"\"FIGURE 7.11. Simulation study to investigate the performance of cross-validation in a\n", 48 | "high-dimensional problem where the predictors are independent of the class labels.\n", 49 | "\n", 50 | "The top-left panel shows the number of errors made by individual stump classifiers on\n", 51 | "the full training set (20 observations).\n", 52 | "\n", 53 | "The top right panel shows the errors made by individual stumps trained on a random split of\n", 54 | "the dataset into 4/5ths (16 observations) and tested on the remaining 1/5th (4 observations).\n", 55 | "\n", 56 | "The best performers are depicted by colored dots in each panel.\n", 57 | "\n", 58 | "The bottom left panel shows the effect of re-estimating the split point in each fold:\n", 59 | "The colored points correspond to the four samples in the 1/5th validation set. The split\n", 60 | "point derived from the full dataset classifies all four samples correctly, but when the split\n", 61 | "point is re-estimated on the 4/5ths data (as it should be), it commits two errors on the four\n", 62 | "validation samples.\n", 63 | "\n", 64 | "In the bottom right we see the overall result of five-fold cross-validation applied to\n", 65 | "50 simulated datasets. The average error rate is about 50%, as it should be.\"\"\"\n", 66 | "N, p, K = 20, 500, 5\n", 67 | "print('Necessary to reproduce this figure?')" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "What has happened? The preceding argument has ignored the fact that in cross-validation, the model must be completely retrained for each fold of the process.\n", 75 | "\n", 76 | "In the present example, this means that the best predictor and corresponding split point are found from 4/5ths of the data. The effect of predictor choice is seen in the top right panel. Since the class labels are independent of the predictors, the performance of a stump on the 4/5ths training data contains no information about its performance in the remaining 1/5th.\n", 77 | "\n", 78 | "The effect of the choice of split point is shown in the bottom left panel. Here we see the data for predictor 436, corresponding to the blue dot in the top left plot. The colored points indicate the 1/5th data, while the remaining points belong to the 4/5ths. The optimal split points for this predictor based on both the full training set and 4/5ths data are indicated.\n", 79 | "\n", 80 | "The split based on the full data makes no errors on the 1/5ths data. But cross-validation must base its split on the 4/5ths data, and this incurs two errors out of four samples." 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "The results of applying five-fold cross-validation to each of 50 simulated datasets is shown in the bottom right panel. As we would hope, the average cross-validation error is around 50%, which is the true expected prediction error for this classifier. Hence cross-validation has behaved as it should.\n", 88 | "\n", 89 | "On the other hand, there is considerable variability in the error, underscoring the importance of reporting the estimated standard error of the CV estimate. See Exercise 7.10 for another variation of this problem." 90 | ] 91 | } 92 | ], 93 | "metadata": { 94 | "kernelspec": { 95 | "display_name": "Python 3", 96 | "language": "python", 97 | "name": "python3" 98 | }, 99 | "language_info": { 100 | "codemirror_mode": { 101 | "name": "ipython", 102 | "version": 3 103 | }, 104 | "file_extension": ".py", 105 | "mimetype": "text/x-python", 106 | "name": "python", 107 | "nbconvert_exporter": "python", 108 | "pygments_lexer": "ipython3", 109 | "version": "3.6.4" 110 | } 111 | }, 112 | "nbformat": 4, 113 | "nbformat_minor": 2 114 | } 115 | -------------------------------------------------------------------------------- /chapter11-neural-networks/fig11-10.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dgkim5360/the-elements-of-statistical-learning-notebooks/2c13a4818379451bcce802bcd7917f4878e12977/chapter11-neural-networks/fig11-10.jpg -------------------------------------------------------------------------------- /chapter11-neural-networks/fig11-12.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dgkim5360/the-elements-of-statistical-learning-notebooks/2c13a4818379451bcce802bcd7917f4878e12977/chapter11-neural-networks/fig11-12.jpg -------------------------------------------------------------------------------- /chapter11-neural-networks/section01-introduction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Chapter 11. Neural Networks\n", 8 | "# $\\S$ 11.1 Introduction\n", 9 | "\n", 10 | "In this chapter we describe a class of learning methods that was developed separately in different fields -- statistics and artificial intelligence -- based on essentially identical models.\n", 11 | "\n", 12 | "> The central idea is to extract linear combinations of the inputs as derived features, and then model the target as a nonlinear function of these features.\n", 13 | "\n", 14 | "The result is a powerful learning method, with widespread applications in many fields." 15 | ] 16 | } 17 | ], 18 | "metadata": { 19 | "kernelspec": { 20 | "display_name": "Python 3", 21 | "language": "python", 22 | "name": "python3" 23 | }, 24 | "language_info": { 25 | "codemirror_mode": { 26 | "name": "ipython", 27 | "version": 3 28 | }, 29 | "file_extension": ".py", 30 | "mimetype": "text/x-python", 31 | "name": "python", 32 | "nbconvert_exporter": "python", 33 | "pygments_lexer": "ipython3", 34 | "version": "3.6.4" 35 | } 36 | }, 37 | "nbformat": 4, 38 | "nbformat_minor": 2 39 | } 40 | -------------------------------------------------------------------------------- /chapter11-neural-networks/section08-discussion.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 11.8. Discussion\n", 8 | "\n", 9 | "Both projection pursuit regression and neural networks take nonlinear functions of linear combinations (\"derived features\") of the inputs. This is a powerful and very general approach for regression and classification, and has been shown to compete well with the best learning methods on many problems.\n", 10 | "\n", 11 | "> These tools are especially effective in problems with a high SNR and settings where prediction without interpretation is the goal.\n", 12 | "\n", 13 | "> They are less effective for problems where the goal is to describe the physical process that generated the data and the roles of individual inputs.\n", 14 | "\n", 15 | "Each input enters into the model in many places, in a nonlinear fashion." 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "### Efforts to comprehend the model\n", 23 | "\n", 24 | "Some authors (Hinton, 1989) plot a diagram of the estimated weights into each hidden unit, to try to understand the feature that each unit is extracting. This is limited however by the lack of identifiability of the parameter vectors $\\alpha_m$, $m=1,\\cdots,M$. Often there are solutions with $\\alpha_m$ spanning the same linear space as the ones found during training, giving predicted values that are roughly the same.\n", 25 | "\n", 26 | "Some authors suggest carrying out a principal component analysis of these weights, to try to find an interpretable solution.\n", 27 | "\n", 28 | "In general, the difficulty of interpreting these models has limited their use in fields like medicine, where interpretation of the model is very important." 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "### Bayesian inference\n", 36 | "\n", 37 | "There has been a great deal of research on the training of neural networks.\n", 38 | "\n", 39 | "Unlike methods like CART and MARS, neural networks are smooth functions of real-valued parameters. This facilitates the development of Bayesian inference for these models.\n", 40 | "\n", 41 | "The next sections discusses a successful Bayesian implementation of neural networks." 42 | ] 43 | } 44 | ], 45 | "metadata": { 46 | "kernelspec": { 47 | "display_name": "Python 3", 48 | "language": "python", 49 | "name": "python3" 50 | }, 51 | "language_info": { 52 | "codemirror_mode": { 53 | "name": "ipython", 54 | "version": 3 55 | }, 56 | "file_extension": ".py", 57 | "mimetype": "text/x-python", 58 | "name": "python", 59 | "nbconvert_exporter": "python", 60 | "pygments_lexer": "ipython3", 61 | "version": "3.6.4" 62 | } 63 | }, 64 | "nbformat": 4, 65 | "nbformat_minor": 2 66 | } 67 | -------------------------------------------------------------------------------- /chapter11-neural-networks/section09-0-bayesian-neural-nets-and-the-nips-2003-challenge.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 11.9. Bayesian Neural Nets and the NIPS 2003 Challenge\n", 8 | "\n", 9 | "A classification competition was held in 2003, in which five labeled training datasets were provided to participants. It was organized for a Neural Information Processing Systems (NIPS) workshop. Each of the data sets consituted a two-class classification problems, which different sizes and from a variety of domains (see TABLE 11.2). Feature measurements for a validation dataset were also available.\n", 10 | "\n", 11 | "Participants developed and applied statistical learning procedures to make predictions on the datasets, and could submit predictions to a website on the validation set for a period of 12 weeks. With this feedback, participants were then asked to submit predictions for a separate test set and they received their results. Finally, the class labels for the validation set were released and participants had one week to train their final predictions to the competition website. A total of 75 groups participated, with 20 and 16 eventually making submissions on the validation and test sets, respectively." 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "#### TABLE 11.2. NIPS 2003 challenge data sets. The column labeled $p$ is the number of features. For the Dorothea dataset the features are binary. $N_{\\text{tr}}$, $N_{\\text{val}}$, and $N_{\\text{te}}$ are the number of training, validation and test cases, respectively.\n", 19 | "Dataset | Domain | Feature Type | $p$ | Percent Probes | $N_{\\text{tr}}$ | $N_{\\text{val}}$ | $N_{\\text{te}}$\n", 20 | "--- | --- | --- | --- | --- | --- | --- | ---\n", 21 | "Arcene | Mass spectrometry | Dense | 10,000 | 30 | 100 | 100 | 700\n", 22 | "Dexter | Text classification | Sparse | 20,000 | 50 | 300 | 300 | 2000\n", 23 | "Dorothea | Drug discovery | Sparse | 100,000 | 50 | 800 | 350 | 800\n", 24 | "Gisette | Digit recognition | Dense | 5,000 | 30 | 6000 | 1000 | 6500\n", 25 | "Madelon | Artificial | Dense | 500 | 96 | 2000 | 600 | 1800" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "### Probes\n", 33 | "There was an emphasis on feature extraction in the competition. Artificial \"probes\" were added to the data: These are noise features with distributions resembling the real features but independent of the class labels. The percentage of probes that were added to each dataset, relative to the total set of features, is shown on TABLE 11.2. Thus each learning algorithm had to figure out a way of identifying the probes and downweighting or eliminating them." 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "### Metrics and Winners\n", 41 | "A number of metrics were used to evaluate the entries, including the percentage correct on the test set, the area under the ROC curve, and a combined score that compared each pair of classifiers head-to-head. The results of the competition are very interesting and are detailed in Guyon et al. (2006). The most notable result: the entries of Neal and Zhang (2006) were the clear overall winners. In the final competition they finished first in three of the five datasets, and were 5th and 7th on the remaining two datasets.\n", 42 | "\n", 43 | "In their winning entries, Neal and Zhang (2006) used a series of preprocessing feature-selection steps, followed by Bayesian neural networks, Dirichlet diffusion trees, and combinations of these methods. Here we focus only on the Bayesian neural network approach, and try to discern which aspects of their approach were important for its success. We rerun their programs and compare the results to boosted neural networks and boosted trees, and other related methods." 44 | ] 45 | } 46 | ], 47 | "metadata": { 48 | "kernelspec": { 49 | "display_name": "Python 3", 50 | "language": "python", 51 | "name": "python3" 52 | }, 53 | "language_info": { 54 | "codemirror_mode": { 55 | "name": "ipython", 56 | "version": 3 57 | }, 58 | "file_extension": ".py", 59 | "mimetype": "text/x-python", 60 | "name": "python", 61 | "nbconvert_exporter": "python", 62 | "pygments_lexer": "ipython3", 63 | "version": "3.6.4" 64 | } 65 | }, 66 | "nbformat": 4, 67 | "nbformat_minor": 2 68 | } 69 | -------------------------------------------------------------------------------- /chapter11-neural-networks/section10-computational-considerations.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 11.10. Computational Considerations\n", 8 | "\n", 9 | "With\n", 10 | "* $N$ observations,\n", 11 | "* $p$ predictors,\n", 12 | "* $M$ hidden units and\n", 13 | "* $L$ training epochs,\n", 14 | "\n", 15 | "a neural network fit typically requires $O(NpML)$ operations.\n", 16 | "\n", 17 | "There are many packages available for fitting neural networks, probably many more than exist for mainstream statistical methods. Because the available software varies widely in quality, and the learning problem for neural networks is sensitive to issues such as input scaling, such software should be carefully chosen and tested." 18 | ] 19 | } 20 | ], 21 | "metadata": { 22 | "kernelspec": { 23 | "display_name": "Python 3", 24 | "language": "python", 25 | "name": "python3" 26 | }, 27 | "language_info": { 28 | "codemirror_mode": { 29 | "name": "ipython", 30 | "version": 3 31 | }, 32 | "file_extension": ".py", 33 | "mimetype": "text/x-python", 34 | "name": "python", 35 | "nbconvert_exporter": "python", 36 | "pygments_lexer": "ipython3", 37 | "version": "3.6.4" 38 | } 39 | }, 40 | "nbformat": 4, 41 | "nbformat_minor": 2 42 | } 43 | -------------------------------------------------------------------------------- /data/heart/SAheart.info.txt: -------------------------------------------------------------------------------- 1 | A retrospective sample of males in a heart-disease high-risk region 2 | of the Western Cape, South Africa. There are roughly two controls per 3 | case of CHD. Many of the CHD positive men have undergone blood 4 | pressure reduction treatment and other programs to reduce their risk 5 | factors after their CHD event. In some cases the measurements were 6 | made after these treatments. These data are taken from a larger 7 | dataset, described in Rousseauw et al, 1983, South African Medical 8 | Journal. 9 | 10 | sbp systolic blood pressure 11 | tobacco cumulative tobacco (kg) 12 | ldl low densiity lipoprotein cholesterol 13 | adiposity 14 | famhist family history of heart disease (Present, Absent) 15 | typea type-A behavior 16 | obesity 17 | alcohol current alcohol consumption 18 | age age at onset 19 | chd response, coronary heart disease 20 | 21 | To read into R: 22 | read.table("http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data", 23 | sep=",",head=T,row.names=1) 24 | -------------------------------------------------------------------------------- /data/phoneme/phoneme.info.txt: -------------------------------------------------------------------------------- 1 | These data arose from a collaboration between Andreas Buja, Werner 2 | Stuetzle and Martin Maechler, and we used as an illustration in the 3 | paper on Penalized Discriminant Analysis by Hastie, Buja and 4 | Tibshirani (1995), referenced in the text. 5 | 6 | The data were extracted from the TIMIT database (TIMIT 7 | Acoustic-Phonetic Continuous Speech Corpus, NTIS, US Dept of Commerce) 8 | which is a widely used resource for research in speech recognition. A 9 | dataset was formed by selecting five phonemes for 10 | classification based on digitized speech from this database. The 11 | phonemes are transcribed as follows: "sh" as in "she", "dcl" as in 12 | "dark", "iy" as the vowel in "she", "aa" as the vowel in "dark", and 13 | "ao" as the first vowel in "water". From continuous speech of 50 male 14 | speakers, 4509 speech frames of 32 msec duration were selected, 15 | approximately 2 examples of each phoneme from each speaker. Each 16 | speech frame is represented by 512 samples at a 16kHz sampling rate, 17 | and each frame represents one of the above five phonemes. The 18 | breakdown of the 4509 speech frames into phoneme frequencies is as 19 | follows: 20 | 21 | aa ao dcl iy sh 22 | 695 1022 757 1163 872 23 | 24 | From each speech frame, we computed a log-periodogram, which is one of 25 | several widely used methods for casting speech data in a form suitable 26 | for speech recognition. Thus the data used in what follows consist of 27 | 4509 log-periodograms of length 256, with known class (phoneme) 28 | memberships. 29 | 30 | The data contain 256 columns labelled "x.1" - "x.256", a response 31 | column labelled "g", and a column labelled "speaker" identifying the 32 | diffferent speakers. 33 | -------------------------------------------------------------------------------- /data/prostate/prostate.info.txt: -------------------------------------------------------------------------------- 1 | Prostate data info 2 | 3 | Predictors (columns 1--8) 4 | 5 | lcavol 6 | lweight 7 | age 8 | lbph 9 | svi 10 | lcp 11 | gleason 12 | pgg45 13 | 14 | outcome (column 9) 15 | 16 | lpsa 17 | 18 | train/test indicator (column 10) 19 | 20 | This last column indicates which 67 observations were used as the 21 | "training set" and which 30 as the test set, as described on page 48 22 | in the book. 23 | 24 | There was an error in these data in the first edition of this 25 | book. Subject 32 had a value of 6.1 for lweight, which translates to a 26 | 449 gm prostate! The correct value is 44.9 gm. We are grateful to 27 | Prof. Stephen W. Link for alerting us to this error. 28 | 29 | The features must first be scaled to have mean zero and variance 96 (=n) 30 | before the analyses in Tables 3.1 and beyond. That is, if x is the 96 by 8 matrix 31 | of features, we compute xp <- scale(x,TRUE,TRUE) 32 | 33 | -------------------------------------------------------------------------------- /data/zipcode/zip.info.txt: -------------------------------------------------------------------------------- 1 | Normalized handwritten digits, automatically 2 | scanned from envelopes by the U.S. Postal Service. The original 3 | scanned digits are binary and of different sizes and orientations; the 4 | images here have been deslanted and size normalized, resulting 5 | in 16 x 16 grayscale images (Le Cun et al., 1990). 6 | 7 | The data are in two gzipped files, and each line consists of the digit 8 | id (0-9) followed by the 256 grayscale values. 9 | 10 | There are 7291 training observations and 2007 test observations, 11 | distributed as follows: 12 | 0 1 2 3 4 5 6 7 8 9 Total 13 | Train 1194 1005 731 658 652 556 664 645 542 644 7291 14 | Test 359 264 198 166 200 160 170 147 166 177 2007 15 | 16 | or as proportions: 17 | 0 1 2 3 4 5 6 7 8 9 18 | Train 0.16 0.14 0.1 0.09 0.09 0.08 0.09 0.09 0.07 0.09 19 | Test 0.18 0.13 0.1 0.08 0.10 0.08 0.08 0.07 0.08 0.09 20 | 21 | 22 | Alternatively, the training data are available as separate files per 23 | digit (and hence without the digit identifier in each row) 24 | 25 | The test set is notoriously "difficult", and a 2.5% error rate is 26 | excellent. These data were kindly made available by the neural network 27 | group at AT&T research labs (thanks to Yann Le Cunn). 28 | 29 | 30 | -------------------------------------------------------------------------------- /data/zipcode/zip.test.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dgkim5360/the-elements-of-statistical-learning-notebooks/2c13a4818379451bcce802bcd7917f4878e12977/data/zipcode/zip.test.gz -------------------------------------------------------------------------------- /data/zipcode/zip.train.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dgkim5360/the-elements-of-statistical-learning-notebooks/2c13a4818379451bcce802bcd7917f4878e12977/data/zipcode/zip.train.gz -------------------------------------------------------------------------------- /module/cv.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | 4 | 5 | def index_tenfold(n:int) ->np.ndarray: 6 | """Produce index array for tenfold CV with dataframe length n.""" 7 | original_indices = np.arange(n) 8 | tenfold_indices = np.zeros(n) 9 | 10 | div, mod = divmod(n, 10) 11 | unit_sizes = [div for _ in range(10)] 12 | for i in range(mod): 13 | unit_sizes[i] += 1 14 | 15 | for k, unit_size in enumerate(unit_sizes): 16 | tenfold = np.random.choice(original_indices, unit_size, 17 | replace=False) 18 | tenfold_indices[tenfold] = k 19 | original_indices = np.delete( 20 | original_indices, 21 | [np.argwhere(original_indices == val) for val in tenfold], 22 | ) 23 | # print(tenfold, original_indices) 24 | return tenfold_indices -------------------------------------------------------------------------------- /references/ESLII_print12.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dgkim5360/the-elements-of-statistical-learning-notebooks/2c13a4818379451bcce802bcd7917f4878e12977/references/ESLII_print12.pdf -------------------------------------------------------------------------------- /references/lars.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dgkim5360/the-elements-of-statistical-learning-notebooks/2c13a4818379451bcce802bcd7917f4878e12977/references/lars.pdf --------------------------------------------------------------------------------