├── .gitignore ├── README.rst ├── articles ├── a-journey-to-the-tip-of-neural-networks-kr.ipynb ├── a-journey-to-the-tip-of-neural-networks.ipynb ├── fig11-2-neural-network.jpg ├── kaggle-a-few-practical-thoughts-on-titanic.ipynb ├── tutorial-deep-learning-for-nlp-with-pytorch-1-kr.ipynb ├── tutorial-deep-learning-for-nlp-with-pytorch-2-kr.ipynb ├── tutorial-deep-learning-for-nlp-with-pytorch-3-kr.ipynb ├── why-is-logistic-regression-called-linear-method-kr.ipynb └── why-is-logistic-regression-called-linear-method.ipynb ├── chapter02-overview-of-supervised-learning ├── section3-least-squares-and-nearest-neighbors.ipynb ├── section4-statistical-decision-theory.ipynb ├── section5-local-methods-in-high-dimensions.ipynb ├── section6-statistical-methods-supervised-learning-and-function-approximation.ipynb ├── section7-structured-regression-models.ipynb ├── section8-classes-of-restricted-estimators.ipynb └── section9-model-selection-and-the-bias-variance-tradeoff.ipynb ├── chapter03-linear-methods-for-regression ├── section1-introduction.ipynb ├── section2-0-linear-regression-models-and-least-squares.ipynb ├── section2-1-example-prostate-cancer.ipynb ├── section2-2-the-gauss-markov-theorem.ipynb ├── section2-3-multiple-regression-from-simple-univariate-regression.ipynb ├── section2-4-multiple-outputs.ipynb ├── section3-subset-selection.ipynb ├── section4-0-shrinkage-methods.ipynb ├── section4-1-ridge-regression.ipynb ├── section4-2-the-lasso.ipynb ├── section4-3-discussion-subset-selection-ridge-lasso.ipynb ├── section4-4-least-angle-regression.ipynb ├── section5-0-methods-using-derived-input-directions.ipynb ├── section5-1-principal-components-regression.ipynb ├── section5-2-partial-least-squares.ipynb └── section6-discussion-a-comparison-of-the-selection-and-shrinkage-methods.ipynb ├── chapter04-linear-methods-for-classification ├── section1-introduction.ipynb ├── section2-linear-regression-of-an-indicator-matrix.ipynb ├── section3-0-linear-discriminant-analysis.ipynb ├── section3-1-regularized-discriminant-analysis.ipynb ├── section3-2-computations-for-lda.ipynb ├── section3-3-reduced-rank-linear-discriminant-analysis.ipynb ├── section4-0-logistic-regression.ipynb ├── section4-1-fitting-logistic-regression-models.ipynb ├── section4-2-example-south-african-heart-disease.ipynb ├── section4-3-quadratic-approximations-and-inference.ipynb ├── section4-4-l1-regularized-logistic-regression.ipynb ├── section4-5-logistic-or-lda.ipynb ├── section5-0-separating-hyperplanes.ipynb ├── section5-1-rosenblatt-perceptron-learning-algorithm.ipynb └── section5-2-optimal-separating-hyperplanes.ipynb ├── chapter05-basis-expansions-and-regularization ├── section1-introduction.ipynb ├── section2-0-piecewise-polynomials-and-splines.ipynb ├── section2-1-natural-cubic-splines.ipynb ├── section2-2-example-south-african-heart-disease-continued.ipynb ├── section2-3-example-phoneme-recognition.ipynb ├── section3-filtering-and-feature-extraction.ipynb ├── section4-0-smoothing-splines.ipynb ├── section4-1-degrees-of-freedom-and-smoother-matrices.ipynb ├── section5-0-automatic-selection-of-the-smoothing-parameters.ipynb ├── section5-1-fixing-the-degrees-of-freedom.ipynb ├── section5-2-the-biase-variance-tradeoff.ipynb ├── section6-nonparametric-logistic-regression.ipynb ├── section7-multidimensional-splines.ipynb ├── section8-0-regularization-and-reproducing-kernel-hilbert-space.ipynb ├── section8-1-spaces-of-fucntions-generated-by-kernels.ipynb ├── section8-2-example-of-rkhs.ipynb ├── section9-0-wavelet-smoothing.ipynb ├── section9-1-wavelet-bases-and-the-wavelet-transform.ipynb └── section9-2-adaptive-wavelet-filtering.ipynb ├── chapter06-kernel-smoothing-methods ├── section0-introduction.ipynb ├── section1-0-one-dimensional-kernel-smoothers.ipynb ├── section1-1-local-linear-regression.ipynb ├── section1-2-local-polynomial-regression.ipynb ├── section2-selecting-the-width-of-the-kernel.ipynb ├── section3-local-regression-in-higher-dimensions.ipynb ├── section4-0-structured-local-regression-models.ipynb ├── section4-1-structured-kernels.ipynb ├── section4-2-structured-regression-functions.ipynb ├── section5-local-likelihood-and-other-models.ipynb ├── section6-0-kernel-density-estimation-and-classification.ipynb ├── section6-1-kernel-density-estimation.ipynb ├── section6-2-kernel-density-classification.ipynb ├── section6-3-the-naive-bayes-classifier.ipynb ├── section7-radial-basis-functions-and-kernels.ipynb ├── section8-mixture-models-for-density-estimation-and-classification.ipynb └── section9-computational-considerations.ipynb ├── chapter07-model-assessment-and-selection ├── fig7-12.jpg ├── fig7-2.jpg ├── section01-introduction.ipynb ├── section02-bias-variance-and-model-complexity.ipynb ├── section03-0-the-bias-variance-decomposition.ipynb ├── section03-1-example-bias-variance-tradeoff.ipynb ├── section04-optimism-of-the-training-error-rate.ipynb ├── section05-estimate-of-in-sample-prediction-error.ipynb ├── section06-the-effective-number-of-parameters.ipynb ├── section07-the-bayesian-approach-and-bic.ipynb ├── section08-minimum-description-length.ipynb ├── section10-0-cross-validation.ipynb ├── section10-1-k-fold-cross-validation.ipynb ├── section10-2-the-wrong-and-right-way-to-do-cross-validation.ipynb ├── section10-3-does-cross-validation-really-work.ipynb └── section11-bootstrap-methods.ipynb ├── chapter11-neural-networks ├── fig11-10.jpg ├── fig11-12.jpg ├── section01-introduction.ipynb ├── section02-projection-pursuit-regression.ipynb ├── section03-neural-networks.ipynb ├── section04-fitting-neural-networks.ipynb ├── section05-some-issues-in-training-neural-networks.ipynb ├── section06-example-simulated-data.ipynb ├── section07-example-zip-code-data.ipynb ├── section08-discussion.ipynb ├── section09-0-bayesian-neural-nets-and-the-nips-2003-challenge.ipynb ├── section09-1-bayes-boosting-and-bagging.ipynb ├── section09-2-performance-comparisons.ipynb └── section10-computational-considerations.ipynb ├── data ├── heart │ ├── SAheart.data │ └── SAheart.info.txt ├── phoneme │ ├── phoneme.data │ └── phoneme.info.txt ├── prostate │ ├── prostate.data │ └── prostate.info.txt ├── titanic │ ├── test.csv │ └── train.csv ├── vowel │ ├── vowel.info.txt │ ├── vowel.test │ └── vowel.train └── zipcode │ ├── zip.info.txt │ ├── zip.test │ ├── zip.test.gz │ ├── zip.train │ └── zip.train.gz ├── module └── cv.py └── references ├── ESLII_print12.pdf └── lars.pdf /.gitignore: -------------------------------------------------------------------------------- 1 | *.swp 2 | .ipynb_checkpoints 3 | __pycache__ 4 | 5 | Pipfile.lock 6 | 7 | Untitled* 8 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | ================================================================ 2 | Jupyter Notebooks for the Elements of Statistical Learning (WIP) 3 | ================================================================ 4 | 5 | It aims to summarize and reproduce the textbook "The Elements of Statistical Learning" 2/E by Hastie, Tibshirani, and Friedman. 6 | 7 | Currently working the early chapters, I try to implement without frameworks like scikit-learn for showing the algorithms that the textbook introduces to me. 8 | 9 | Also starting with the neural networks, I decided to use PyTorch_ which seems less magical (They say that ``torch.Tensor`` is ``numpy.ndarray`` with GPU support). 10 | 11 | .. _PyTorch: //pytorch.org 12 | 13 | 14 | Installation 15 | ------------ 16 | 17 | Use your favorite virtualenv system and install the below dependencies; quite standard ones. 18 | 19 | * numpy 20 | * scipy 21 | * matplotlib 22 | * pandas 23 | * jupyter 24 | * pytorch 25 | * scikit-learn (optional, used in my own articles) 26 | 27 | .. code-block:: bash 28 | 29 | (esl) $ pip install ipython numpy scipy matplotlib pandas jupyter 30 | 31 | # The command below installs pytorch for Python 3.6 without CUDA support. 32 | # For other settings, consult with pytorch.org. 33 | (esl) $ pip install http://download.pytorch.org/whl/cpu/torch-0.3.1-cp36-cp36m-linux_x86_64.whl 34 | 35 | 36 | Execution 37 | --------- 38 | 39 | Just run ``jupyter notebook``. 40 | -------------------------------------------------------------------------------- /articles/fig11-2-neural-network.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dgkim5360/the-elements-of-statistical-learning-notebooks/2c13a4818379451bcce802bcd7917f4878e12977/articles/fig11-2-neural-network.jpg -------------------------------------------------------------------------------- /chapter02-overview-of-supervised-learning/section7-structured-regression-models.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 2.7. Structured Regression Models\n", 8 | "\n", 9 | "### Review & motivation\n", 10 | "\n", 11 | "We have seen that although nearest-neighbor and other local methods focus directly on estimating the function at a point, they face problems in high dimensions. They may also be inappropriate even in low dimensions in cases where more structured approaches can make more efficient use of the data.\n", 12 | "\n", 13 | "This section introduces classes of such structured approaches. Before we proceed, though, we discuss further the need for such classes." 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "## $\\S$ 2.7.1. Difficulty of the Problem\n", 21 | "\n", 22 | "Consider the RSS criterion for an arbitrary function $f$,\n", 23 | "\n", 24 | "\\begin{equation}\n", 25 | "\\text{RSS}(f) = \\sum_{i=1}^N \\left( y_i - f(x_i) \\right)^2.\n", 26 | "\\end{equation}\n", 27 | "\n", 28 | "Minimizing the RSS leads to infinitely many solutions: Any function $\\hat{f}$ passing through the training points $(x_i,y_i)$ is a solution. Any particular solution chosen might be a poor predictor at test points different from the training points.\n", 29 | "\n", 30 | "If there are multiple observation pairs $(x_i,y_{il})$, $l=1,\\cdots,N_i$, at each value of $x_i$, the risk is limited. In this case, the solution pass through the average values of the $y_{il}$ at each $x_i$ (Exercise 2.6). The situation is similar to the one we have already visited in $\\S$ 2.4; indeed, the above RSS is the finite sample version of the expected prediction error\n", 31 | "\n", 32 | "\\begin{equation}\n", 33 | "\\text{EPE}(f) = \\text{E}\\left( Y - f(X) \\right)^2 = \\int \\left( y - f(x) \\right)^2 \\text{Pr}(dx, dy).\n", 34 | "\\end{equation}" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "### Necessity & limit of the restriction\n", 42 | "\n", 43 | "If the sample size $N$ were sufficiently large such that repeats were guaranteed and densely arranged, it would seem that these solutions might all tend to the limiting conditional expectation.\n", 44 | "\n", 45 | "In order to obtain useful results for finite $N$, we must restrict the eligible solution to the RSS to a smaller set of functions.\n", 46 | "\n", 47 | "> How to decide on the nature of the restrictions is based on considerations outside of the data.\n", 48 | "\n", 49 | "These restrictions are somtimes\n", 50 | "* encoded via the parametric representation of $f_\\theta$, or\n", 51 | "* may be built into the learning method itself, either implicitly or explicitly.\n", 52 | "\n", 53 | "> These restricted classes of solutions are the major topic of this book.\n", 54 | "\n", 55 | "One thing should be clear, though.\n", 56 | "\n", 57 | "> Any restrictions imposed on $f$ that lead to a unique solution to RSS do not really remove the ambiguity caused by the multiplicity of solutions. There are infinitely many possible restrictions, each leading to a unique solution, so the abmiguity has simply been transferred to the choice of constraint." 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "### Complexity\n", 65 | "\n", 66 | "In general the constraints imposed by most learning methods can be described as _complexity_ restrictions of one kind or another.\n", 67 | "\n", 68 | "> This usually means some kind of regular behavior in small neighborhoods of the input space.\n", 69 | "\n", 70 | "That is, for all input points $x$ sufficiently close to each other in some metric, $\\hat{f}$ exhibits some special structure such as\n", 71 | "* nearly constant,\n", 72 | "* linear or\n", 73 | "* low-order polynomial behavior.\n", 74 | "\n", 75 | "The estimator is then obtained by averaging or polynomial fitting in that neighborhood.\n", 76 | "\n", 77 | "The strength of the constraint is dictated by the neighborhood size.\n", 78 | "\n", 79 | "> The larger the size, the stronger the constraint, and the more sensitive the solution is to the particular choice of constraint.\n", 80 | "\n", 81 | "For example,\n", 82 | "* local constant fits in infinitesimally small neighborhoods is no constraints at all;\n", 83 | "* local linear fits in very large neighborhoods is almost a globally llinear model, and is very restrictive." 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "### Metric\n", 91 | "\n", 92 | "The nature of the constraint depends on the metric used.\n", 93 | "\n", 94 | "Some methods, such as kernel and local regression and tree-based methods, directly specify the metric and size of the neighborhood. The kNN methods discussed so far are based on the assumption that locally the function is constant; close to a target input $x_0$, the function does not change much, and so close outputs can be averagedd to produce $\\hat{f}(x_0)$.\n", 95 | "\n", 96 | "Other methods such as splines, neural networks and basis-function methods implicitly define neighborhoods of local behavior. In $\\S$ 5.4.1 we discuss the concept of an _equivalent kernel_, which describes this local dependence for any method linear in the outputs. These equivalent kernels in many cases look just like the explicitly defined weighting kernels discussed above -- peaked at the target point and falling smoothly away from it." 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "### Curse of dimensionality\n", 104 | "\n", 105 | "One fact should be clear by now. Any method that attempts to produce locally varying functions in small isotopic neighborhoods will run into problems in high dimensions -- again the curse of dimensionality.\n", 106 | "\n", 107 | "And conversely, all methods that overcome the dimensionality problems have an associated -- and often implicit or adaptive -- metric for measuring neighborhoods, which basically does not allow the neighborhood to be simultaneously small in all directions." 108 | ] 109 | } 110 | ], 111 | "metadata": { 112 | "kernelspec": { 113 | "display_name": "Python 3", 114 | "language": "python", 115 | "name": "python3" 116 | }, 117 | "language_info": { 118 | "codemirror_mode": { 119 | "name": "ipython", 120 | "version": 3 121 | }, 122 | "file_extension": ".py", 123 | "mimetype": "text/x-python", 124 | "name": "python", 125 | "nbconvert_exporter": "python", 126 | "pygments_lexer": "ipython3", 127 | "version": "3.6.4" 128 | } 129 | }, 130 | "nbformat": 4, 131 | "nbformat_minor": 2 132 | } 133 | -------------------------------------------------------------------------------- /chapter02-overview-of-supervised-learning/section9-model-selection-and-the-bias-variance-tradeoff.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 2.9. Model Selection and the Bias-Variance Tradeoff\n", 8 | "\n", 9 | "### Review\n", 10 | "\n", 11 | "All the Models described so far have a *smoothing* or *complexity* parameter that has to be determined:\n", 12 | "* The multiplier of the penalty term;\n", 13 | "* the width of the kernel;\n", 14 | "* or the number of basis functions.\n", 15 | "\n", 16 | "In the case of the smoothing spline, the parameter $\\lambda$ indexes models ranging from a straight line fit to the interpolating model.\n", 17 | "\n", 18 | "Similarly a local degree-$m$ polynomial model ranges between a degree-$m$ global polynomial when the window size is infinitely large, to an interpolating fit when the window size shrinks to zero.\n", 19 | "\n", 20 | "This means that we cannot use residual sum-of-squares on the training data to determine these parameters as well, since we would always pick those that gave interpolating fits and hence zero residuals. Such a model is unlikely to predict future data well at all." 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "### The bias-variance tradeoff for the kNN\n", 28 | "\n", 29 | "The kNN regression fit $\\hat{f}_k(x_0)$ usefully illustrates the competing forces that affect the predictive ability of such approximations.\n", 30 | "\n", 31 | "Suppose\n", 32 | "* the data arise from a model $Y=f(X)+\\epsilon$, with $\\text{E}(\\epsilon)=0$ and $\\text{Var}(\\epsilon)=\\sigma^2$;\n", 33 | "* for simplicity here the values of $x_i$ in the sample are fixed in advance (nonrandom).\n", 34 | "\n", 35 | "The expected prediction error at $x_0$, a.k.a. _test_ or _generalization_ error, can be decomposed:\n", 36 | "\n", 37 | "\\begin{align}\n", 38 | "\\text{EPE}_k(x_0) &= \\text{E}\\left[(Y-\\hat{f}_k(x_0))^2|X=x_0\\right] \\\\\n", 39 | "&= \\text{E}\\left[(Y -f(x_0) + f(x_0) -\\hat{f}_k(x_0))^2|X=x_0\\right] \\\\\n", 40 | "&= \\text{E}(\\epsilon^2) + 2\\text{E}\\left[\\epsilon(f(x_0) -\\hat{f}_k(x_0))|X=x_0\\right] + \\text{E}\\left[\\left(f(x_0)-\\hat{f}_k(x_0)\\right)^2|X=x_0\\right] \\\\\n", 41 | "&= \\sigma^2 + 0+ \\left[\\text{Bias}^2(\\hat{f}_k(x_0))+\\text{Var}_\\mathcal{T}(\\hat{f}_k(x_0))\\right] \\\\\n", 42 | "&= \\sigma^2 + \\left(f(x_0) - \\frac{1}{k}\\sum_{l=1}^k f(x_{(l)})\\right)^2 + \\frac{\\sigma^2}{k}\n", 43 | "\\end{align},\n", 44 | "\n", 45 | "where subscripts in parentheses $(l)$ indicate the sequence of nearest neighbors to $x_0$.\n", 46 | "\n", 47 | "There are three terms in this expression.\n", 48 | "\n", 49 | "#### Irreducible error\n", 50 | "The first term $\\sigma^2$ is the *irreducible* error -- the variance of the new test target -- and is beyond our control, even if we know the true $f(x_0)$.\n", 51 | "\n", 52 | "The second and third terms are under our control, and make up the _mean squared error_ of $\\hat{f}_k(x_0)$ in estimateing $f(x_0)$, which is broken down into a bias component and a variance component.\n", 53 | "\n", 54 | "#### Bias\n", 55 | "The bias term is the squared difference between the true mean $f(x_0)$ and the expected value of the estimate, i.e.,\n", 56 | "\n", 57 | "\\begin{equation}\n", 58 | "\\left[ \\text{E}_\\mathcal{T} \\left( \\hat{f}_k(x_0) \\right) - f(x_0) \\right]^2,\n", 59 | "\\end{equation}\n", 60 | "\n", 61 | "where the expectation averages the randomness in the training data.\n", 62 | "\n", 63 | "This term will most likely increase with $k$, if the true function is reasonably smooth. For small $k$ the few closest neighbors will have values $f(x_{(l)})$ close to $f(x_0)$, so their average should be close to $f(x_0)$. As $k$ grows, the neighbors are further away, and then anything can happen.\n", 64 | "\n", 65 | "#### Variance\n", 66 | "The variance term is simply the variance of an average here, and decreases as the inverse of $k$.\n", 67 | "\n", 68 | "#### Finally, the tradeoff\n", 69 | "So as $k$ varies, there is a *bias-variance tradeoff*.\n", 70 | "\n", 71 | "More generally, as the _model complexity_ of our procedure is increased, the variance tends to increase and the squared bias tends to decrease, vice versa. For kNN, the model complexity is controlled by $k$.\n", 72 | "\n", 73 | "Typically we would like to choose our model complexity to trade bias off with variance in such a way as to minimize the test error. An obvious estimate of test error is _training error_\n", 74 | "\n", 75 | "\\begin{equation}\n", 76 | "\\frac{1}{N} \\sum_i (y_i \\hat{y}_i)^2.\n", 77 | "\\end{equation}\n", 78 | "\n", 79 | "Unfortunately training error is not a good estimate of test error, as it does not properly account for model complexity." 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "### Interpretation & implication\n", 87 | "FIGURE 2.11 shows the typical behavior of the test and training error, as model complexity is varied.\n", 88 | "\n", 89 | "> The training error tends to decrease whenever we increase the model complexity, i.e., whenever we fit the data harder.\n", 90 | "\n", 91 | "However with too much fitting, the model adapts itself too closely to the training data, and will not generalize well (i.e., have large test error). In that case the predictions $\\hat{f}(x_0)$ will have large variance, as reflected in the above EPE expression.\n", 92 | "\n", 93 | "In contrast, if the model is not complex enough, it will _underfit_ and may have large mias, again resulting in poor generalization. In Chapter 7 we discuss methods for estimating the test error of a prediction method, and hence estimating the optimal amount of model complexity fir a given prediction method and training set." 94 | ] 95 | } 96 | ], 97 | "metadata": { 98 | "kernelspec": { 99 | "display_name": "Python 3", 100 | "language": "python", 101 | "name": "python3" 102 | }, 103 | "language_info": { 104 | "codemirror_mode": { 105 | "name": "ipython", 106 | "version": 3 107 | }, 108 | "file_extension": ".py", 109 | "mimetype": "text/x-python", 110 | "name": "python", 111 | "nbconvert_exporter": "python", 112 | "pygments_lexer": "ipython3", 113 | "version": "3.6.4" 114 | } 115 | }, 116 | "nbformat": 4, 117 | "nbformat_minor": 2 118 | } 119 | -------------------------------------------------------------------------------- /chapter03-linear-methods-for-regression/section1-introduction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Chapter 3. Linear Methods for Regression\n", 8 | "# $\\S$ 3.1. Introduction\n", 9 | "\n", 10 | "A linear regression model assumes that the regression function $\\text{E}(Y|X)$ is linear in the inputs $X_1,\\cdots,X_p$. Linear models were largely developed in the precomputer age of statistics, but even in today's computer era there are still good reasons to study and use them. They are simple and often provide an adequate and interpretable description of how the inputs affect the output.\n", 11 | "\n", 12 | "For prediction purposes they can sometimes outperform fancier nonlinear models, especially in situations with small numbers of training cases, low signal-to-noise ratio or sparse data.\n", 13 | "\n", 14 | "Finally, linear methods can be applied to transformations of the inputs and this considerably expands their scope. These generalization are sometimes called basis-function methods (Chapter 5).\n", 15 | "\n", 16 | "In this chapter we describe linear methods for regression, while in the next chapter we discuss linear methods for classification.\n", 17 | "\n", 18 | "> On some topics we go into considerable detail, as it is out firm belief that an understanding of linear methods is essential for understanding nonlinear ones.\n", 19 | "\n", 20 | "In fact, many nonlinear techniques are direct generalizations of the linear methods discussed here." 21 | ] 22 | } 23 | ], 24 | "metadata": { 25 | "kernelspec": { 26 | "display_name": "Python 3", 27 | "language": "python", 28 | "name": "python3" 29 | }, 30 | "language_info": { 31 | "codemirror_mode": { 32 | "name": "ipython", 33 | "version": 3 34 | }, 35 | "file_extension": ".py", 36 | "mimetype": "text/x-python", 37 | "name": "python", 38 | "nbconvert_exporter": "python", 39 | "pygments_lexer": "ipython3", 40 | "version": "3.6.4" 41 | } 42 | }, 43 | "nbformat": 4, 44 | "nbformat_minor": 2 45 | } 46 | -------------------------------------------------------------------------------- /chapter03-linear-methods-for-regression/section2-2-the-gauss-markov-theorem.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 3.2.2. The Gauss-Markov Theorem\n", 8 | "\n", 9 | "One of the most famous results in statistics asserts that\n", 10 | "\n", 11 | "> the least squares estimates of the parameter $\\beta$ have the smallest variance among all linear unbiased estimates.\n", 12 | "\n", 13 | "We will make this precise here, and also make clear that\n", 14 | "\n", 15 | "> the restriction to unbiased estimates is not necessarily a wise one.\n", 16 | "\n", 17 | "This observation will lead us to consider biased estimates such as ridge regression later in the chapter." 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "### The statement of the theorem\n", 25 | "\n", 26 | "We focus on estimation of any linear combination of the parameters $\\theta=a^T\\beta$. The least squares estimate of $a^T\\beta$ is\n", 27 | "\n", 28 | "\\begin{equation}\n", 29 | "\\hat\\theta = a^T\\hat\\beta = a^T\\left(\\mathbf{X}^T\\mathbf{X}\\right)^{-1}\\mathbf{X}^T\\mathbf{y}.\n", 30 | "\\end{equation}\n", 31 | "\n", 32 | "Considering $\\mathbf{X}$ to be fixed and the linear model is correct, $a^T\\beta$ is unbiased since\n", 33 | "\n", 34 | "\\begin{align}\n", 35 | "\\text{E}(a^T\\hat\\beta) &= \\text{E}\\left(a^T(\\mathbf{X}^T\\mathbf{X})^{-1}\\mathbf{X}^T\\mathbf{y}\\right) \\\\\n", 36 | "&= a^T(\\mathbf{X}^T\\mathbf{X})^{-1}\\mathbf{X}^T\\mathbf{X}\\beta \\\\\n", 37 | "&= a^T\\beta\n", 38 | "\\end{align}\n", 39 | "\n", 40 | "The Gauss-Markov Theorem states that if we have any other linear estimator $\\tilde\\theta = \\mathbf{c}^T\\mathbf{y}$ that is unbiased for $a^T\\beta$, that is, $\\text{E}(\\mathbf{c}^T\\mathbf{y})=a^T\\beta$, then\n", 41 | "\n", 42 | "\\begin{equation}\n", 43 | "\\text{Var}(a^T\\hat\\beta) \\le \\text{Var}(\\mathbf{c}^T\\mathbf{y}).\n", 44 | "\\end{equation}\n", 45 | "\n", 46 | "The proof (Exercise 3.3) uses the triangle inequality.\n", 47 | "\n", 48 | "For simplicity we have stated the result in terms of estimation of a single parameter $a^T \\beta$, but with a few more definitions one can state it in terms of the entire parameter vector $\\beta$ (Exercise 3.3)." 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "### Implications of the Gauss-Markov theorem\n", 56 | "\n", 57 | "Consider the mean squared error of an estimator $\\tilde\\theta$ of $\\theta$:\n", 58 | "\n", 59 | "\\begin{align}\n", 60 | "\\text{MSE}(\\tilde\\theta) &= \\text{E}\\left(\\tilde\\theta-\\theta\\right)^2 \\\\\n", 61 | "&= \\text{Var}\\left(\\tilde\\theta\\right) + \\left[\\text{E}\\left(\\tilde\\theta-\\theta\\right)\\right]^2 \\\\\n", 62 | "&= \\text{Var} + \\text{Bias}^2\n", 63 | "\\end{align}\n", 64 | "\n", 65 | "The Gauss-Markov theorem implies that the least squares estimator has the smallest MSE of all linear estimators with no bias. However there may well exist a biased estimator with smaller MSE. Such an estimator would trade a little bias for a larger reduction in variance.\n", 66 | "\n", 67 | "Biased estimates are commonly used. Any method that shrinks or sets to zero some of the least squares coefficients may result in a biased estimate. We discuss many examples, including variable subset selection and ridge regression, later in this chapter.\n", 68 | "\n", 69 | "From a more pragmatic point of view, most models are distortions of the truth, and hence are biased; picking the right model amounts to creating the right balance between bias and variance. We go into these issues in more detail in Chapter 7." 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "### Relation between prediction accuracy and MSE\n", 77 | "\n", 78 | "MSE is intimately related to prediction accuracy, as discussed in Chapter 2.\n", 79 | "\n", 80 | "Consider the prediction of the new response at input $x_0$,\n", 81 | "\n", 82 | "\\begin{equation}\n", 83 | "Y_0 = f(x_0) + \\epsilon.\n", 84 | "\\end{equation}\n", 85 | "\n", 86 | "Then the expected prediction error of an estimate $\\tilde{f}(x_0)=x_0^T\\tilde\\beta$ is\n", 87 | "\n", 88 | "\\begin{align}\n", 89 | "\\text{E}(Y_0 - \\tilde{f}(x_0))^2 &= \\text{E}\\left(Y_0 -f(x_0)+f(x_0) - \\tilde{f}(x_0)\\right)^2\\\\\n", 90 | "&= \\sigma^2 + \\text{E}\\left(x_o^T\\tilde\\beta - f(x_0)\\right)^2 \\\\\n", 91 | "&= \\sigma^2 + \\text{MSE}\\left(\\tilde{f}(x_0)\\right).\n", 92 | "\\end{align}\n", 93 | "\n", 94 | "Therefore, expected prediction error and MSE differ only by the constant $\\sigma^2$, representing the variance of the new observation $y_0$." 95 | ] 96 | } 97 | ], 98 | "metadata": { 99 | "kernelspec": { 100 | "display_name": "Python 3", 101 | "language": "python", 102 | "name": "python3" 103 | }, 104 | "language_info": { 105 | "codemirror_mode": { 106 | "name": "ipython", 107 | "version": 3 108 | }, 109 | "file_extension": ".py", 110 | "mimetype": "text/x-python", 111 | "name": "python", 112 | "nbconvert_exporter": "python", 113 | "pygments_lexer": "ipython3", 114 | "version": "3.6.4" 115 | } 116 | }, 117 | "nbformat": 4, 118 | "nbformat_minor": 2 119 | } 120 | -------------------------------------------------------------------------------- /chapter03-linear-methods-for-regression/section2-4-multiple-outputs.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 3.2.4. Multiple Outputs\n", 8 | "\n", 9 | "Suppose we have\n", 10 | "* multiple outputs $Y_1,Y_2,\\cdots,Y_K$\n", 11 | "* inputs $X_0,X_1,\\cdots,X_p$\n", 12 | "* a linear model for each output \n", 13 | "\\begin{align}\n", 14 | "Y_k &= \\beta_{0k} + \\sum_{j=1}^p X_j\\beta_{jk} + \\epsilon_k \\\\\n", 15 | "&= f_k(X) + \\epsilon_k\n", 16 | "\\end{align}\n", 17 | "\n", 18 | "> the coefficients for the $k$th outcome are just the least squares estimates in the regression of $y_k$ on $x_0,x_1,\\cdots,x_p$ . Multiple outputs do not affect one another’s least squares estimates.\n", 19 | "\n", 20 | "With $N$ training cases we can write the model in matrix notation\n", 21 | "\n", 22 | "\\begin{equation}\n", 23 | "\\mathbf{Y}=\\mathbf{XB}+\\mathbf{E},\n", 24 | "\\end{equation}\n", 25 | "\n", 26 | "where\n", 27 | "* $\\mathbf{Y}$ is $N\\times K$ with $ik$ entry $y_{ik}$,\n", 28 | "* $\\mathbf{X}$ is $N\\times(p+1)$ input matrix,\n", 29 | "* $\\mathbf{B}$ is $(p+1)\\times K$ parameter matrix,\n", 30 | "* $\\mathbf{E}$ is $N\\times K$ error matrix.\n", 31 | "\n", 32 | "A straightforward generalization of the univariate loss function is\n", 33 | "\n", 34 | "\\begin{align}\n", 35 | "\\text{RSS}(\\mathbf{B}) &= \\sum_{k=1}^K \\sum_{i=1}^N \\left( y_{ik} - f_k(x_i) \\right)^2 \\\\\n", 36 | "&= \\text{trace}\\left( (\\mathbf{Y}-\\mathbf{XB})^T(\\mathbf{Y}-\\mathbf{XB}) \\right)\n", 37 | "\\end{align}\n", 38 | "\n", 39 | "The least squares estimates have exactly the same form as before\n", 40 | "\n", 41 | "\\begin{equation}\n", 42 | "\\hat{\\mathbf{B}} = \\left(\\mathbf{X}^T\\mathbf{X}\\right)^{-1}\\mathbf{X}^T\\mathbf{Y}.\n", 43 | "\\end{equation}" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "### Correlated errors\n", 51 | "\n", 52 | "If the errors $\\epsilon = (\\epsilon_1,\\cdots,\\epsilon_K)$ are correlated with $\\text{Cov}(\\epsilon)=\\mathbf{\\Sigma}$, then the multivariate weighted criterion\n", 53 | "\n", 54 | "\\begin{equation}\n", 55 | "\\text{RSS}(\\mathbf{B};\\mathbf{\\Sigma}) = \\sum_{i=1}^N (y_i-f(x_i))^T \\mathbf{\\Sigma}^{-1} (y_i-f(x_i))\n", 56 | "\\end{equation}\n", 57 | "\n", 58 | "arises naturally from multivariate Gaussian theory. Here\n", 59 | "* $f(x) = \\left(f_1(x),\\cdots,f_K(x)\\right)^T$ is the vector function,\n", 60 | "* $y_i$ the vector of $K$ responses for observation $i$.\n", 61 | "However, the solution is again the same with ignoring the correlations as\n", 62 | "\n", 63 | "\\begin{equation}\n", 64 | "\\hat{\\mathbf{B}} = \\left(\\mathbf{X}^T\\mathbf{X}\\right)^{-1}\\mathbf{X}^T\\mathbf{Y}.\n", 65 | "\\end{equation}\n", 66 | "\n", 67 | "In Section 3.7 we pursue the multiple output problem and consider situations where it does pay to combine the regressions." 68 | ] 69 | } 70 | ], 71 | "metadata": { 72 | "kernelspec": { 73 | "display_name": "Python 3", 74 | "language": "python", 75 | "name": "python3" 76 | }, 77 | "language_info": { 78 | "codemirror_mode": { 79 | "name": "ipython", 80 | "version": 3 81 | }, 82 | "file_extension": ".py", 83 | "mimetype": "text/x-python", 84 | "name": "python", 85 | "nbconvert_exporter": "python", 86 | "pygments_lexer": "ipython3", 87 | "version": "3.5.2" 88 | } 89 | }, 90 | "nbformat": 4, 91 | "nbformat_minor": 2 92 | } 93 | -------------------------------------------------------------------------------- /chapter03-linear-methods-for-regression/section4-0-shrinkage-methods.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 3.4. Shrinkage Methods\n", 8 | "\n", 9 | "By retaining a subset of the predictors and discarding the rest, subset selection produces a model that is interpretable and has possibly lower prediction error than the full model. However, because it is a discrete process -- variables are either retained or discarded -- it often exhibits high variance, and so doesn't reduce the prediction error of the full model. Shrinkage methods are more continuous, and don't suffer as much from high variability.\n", 10 | "\n", 11 | "Let's take a break, and grap a cup of coffee :)" 12 | ] 13 | } 14 | ], 15 | "metadata": { 16 | "kernelspec": { 17 | "display_name": "Python 3", 18 | "language": "python", 19 | "name": "python3" 20 | }, 21 | "language_info": { 22 | "codemirror_mode": { 23 | "name": "ipython", 24 | "version": 3 25 | }, 26 | "file_extension": ".py", 27 | "mimetype": "text/x-python", 28 | "name": "python", 29 | "nbconvert_exporter": "python", 30 | "pygments_lexer": "ipython3", 31 | "version": "3.5.2" 32 | } 33 | }, 34 | "nbformat": 4, 35 | "nbformat_minor": 2 36 | } 37 | -------------------------------------------------------------------------------- /chapter03-linear-methods-for-regression/section4-2-the-lasso.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## $\\S$ 3.4.2. The Lasso\n", 8 | "\n", 9 | "The lasso estimate is defined by\n", 10 | "\n", 11 | "\\begin{equation}\n", 12 | "\\hat\\beta^{\\text{lasso}} = \\arg\\min_\\beta \\sum_{i=1}^N \\left( y_i-\\beta_0-\\sum_{j=1}^p x_{ij}\\beta_j \\right)^2 \\text{ subject to } \\sum_{j=1}^p |\\beta_j| \\le t,\n", 13 | "\\end{equation}\n", 14 | "\n", 15 | "Just as in ridge regression, we can re-parametrize the constant $\\beta_0$ by standardizing the predictors; $\\hat\\beta_0 = \\bar{y}$, and thereafter we fit a model without an intercept.\n", 16 | "\n", 17 | "In the signal processing literature, the lasso is a.k.a. *basis pursuit* (Chen et al., 1998).\n", 18 | "\n", 19 | "Also the lasso problem has the equivalent *Lagrangian form*\n", 20 | "\n", 21 | "\\begin{equation}\n", 22 | "\\hat\\beta^{\\text{lasso}} = \\arg\\min_\\beta \\left\\lbrace \\frac{1}{2}\\sum_{i=1}^N \\left( y_i-\\beta_0-\\sum_{j=1}^p x_{ij}\\beta_j \\right)^2 + \\lambda\\sum_{j=1}^p |\\beta_j| \\right\\rbrace,\n", 23 | "\\end{equation}\n", 24 | "\n", 25 | "which is similar to the ridge problem as the $L_2$ ridge penalty is replaced by the $L_1$ lasso penalty. This lasso constraint makes the solutions nonlinear in the $y_i$, and there is no closed form expresssion as in ridge regression. And computing the above lasso solution is a quadratic programming problem, although efficient algorithms, introduced in $\\S$ 3.4.4, are available for computing the entire path of solution as $\\lambda$ varies, with the same computational cost as for ridge regression." 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "Note that\n", 33 | "\n", 34 | "* If $t \\gt t_0 = \\sum_1^p \\lvert\\hat\\beta_j^{\\text{ls}}\\rvert$, then $\\hat\\beta^{\\text{lasso}} = \\hat\\beta^{\\text{ls}}$.\n", 35 | "* Say, for $t = t_0/2$, then the least squares coefficients are shrunk by about $50\\%$ on average. \n", 36 | "However, the nature of the shrinkage is not obvious, and we investigate it further in $\\S$ 3.4.4." 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "In FIGURE 3.7, for ease of interpretation, we have plotted the lasso prediction error estimates versus the standardized parameter\n", 44 | "\n", 45 | "\\begin{equation}\n", 46 | "s = \\frac{t}{\\sum_1^p \\lvert\\hat\\beta_j\\rvert}.\n", 47 | "\\end{equation}\n", 48 | "\n", 49 | "A value $\\hat s \\approx 0.36$ was chosen by 10-fold cross-validation; this caused four coefficients to be set to zero (see Table 3.3). The resulting model has the second lowest test error, slightly lower than the full least squares model, but the standard errors of the test error estimates are fairly large." 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "FIGURE 3.10 is discussed after implementing the lasso algorithm in $\\S$ 3.4.4." 57 | ] 58 | } 59 | ], 60 | "metadata": { 61 | "kernelspec": { 62 | "display_name": "Python 3", 63 | "language": "python", 64 | "name": "python3" 65 | }, 66 | "language_info": { 67 | "codemirror_mode": { 68 | "name": "ipython", 69 | "version": 3 70 | }, 71 | "file_extension": ".py", 72 | "mimetype": "text/x-python", 73 | "name": "python", 74 | "nbconvert_exporter": "python", 75 | "pygments_lexer": "ipython3", 76 | "version": "3.5.2" 77 | } 78 | }, 79 | "nbformat": 4, 80 | "nbformat_minor": 2 81 | } 82 | -------------------------------------------------------------------------------- /chapter03-linear-methods-for-regression/section5-0-methods-using-derived-input-directions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 3.5. Methods Using Derived Input Directions\n", 8 | "\n", 9 | "In many situations we have a large number of inputs, often very correlated. The methods in this section produce a small number of linear combinations $Z_m$, $m=1,\\cdots,M$ of the original inputs $X_j$, and the $Z_m$ are then used in place of the $X_j$ as inputs in regression. The methods differ in how the linear combinations are constructed." 10 | ] 11 | } 12 | ], 13 | "metadata": { 14 | "kernelspec": { 15 | "display_name": "Python 3", 16 | "language": "python", 17 | "name": "python3" 18 | }, 19 | "language_info": { 20 | "codemirror_mode": { 21 | "name": "ipython", 22 | "version": 3 23 | }, 24 | "file_extension": ".py", 25 | "mimetype": "text/x-python", 26 | "name": "python", 27 | "nbconvert_exporter": "python", 28 | "pygments_lexer": "ipython3", 29 | "version": "3.5.2" 30 | } 31 | }, 32 | "nbformat": 4, 33 | "nbformat_minor": 2 34 | } 35 | -------------------------------------------------------------------------------- /chapter03-linear-methods-for-regression/section5-1-principal-components-regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $\\S$ 3.5. Methods Using Derived Input Directions\n", 8 | "## $\\S$ 3.5.1. Principal Components Regression\n", 9 | "\n", 10 | "The linear combinations $Z_m$ used in principal component regression (PCR) are the principal components as defined in $\\S$ 3.4.1." 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "PCR forms the derived input columns\n", 18 | "\n", 19 | "\\begin{equation}\n", 20 | "\\mathbf{z}_m = \\mathbf{X} v_m,\n", 21 | "\\end{equation}\n", 22 | "\n", 23 | "and then regress $\\mathbf{y}$ on $\\mathbf{z}_1,\\mathbf{z}_2,\\cdots,\\mathbf{z}_M$ for some $M\\le p$. Since the $\\mathbf{z}_m$ are orthogonal, this regression is just a sum of univariate regressions:\n", 24 | "\n", 25 | "\\begin{equation}\n", 26 | "\\hat{\\mathbf{y}}_{(M)}^{\\text{pcr}} = \\bar{y}\\mathbf{1} + \\sum_{m=1}^M \\hat\\theta_m \\mathbf{z}_m = \\bar{y}\\mathbf{1} + \\mathbf{X}\\mathbf{V}_M\\hat{\\mathbf{\\theta}},\n", 27 | "\\end{equation}\n", 28 | "\n", 29 | "where $\\hat\\theta_m = \\langle\\mathbf{z}_m,\\mathbf{y}\\rangle \\big/ \\langle\\mathbf{z}_m,\\mathbf{z}_m\\rangle$. We can see from the last equality that, since the $\\mathbf{z}_m$ are each linear combinations of the original $\\mathbf{x}_j$, we can express the solution in terms of coefficients of the $\\mathbf{x}_j$.\n", 30 | "\n", 31 | "\\begin{equation}\n", 32 | "\\hat\\beta^{\\text{pcr}}(M) = \\sum_{m=1}^M \\hat\\theta_m v_m.\n", 33 | "\\end{equation}\n", 34 | "\n", 35 | "As with ridge regression, PCR depends on the scaling of the inputs, so typically we first standardized them." 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "### Comparison with ridge regression\n", 43 | "\n", 44 | "If $M=p$, since the columns of $\\mathbf{Z} = \\mathbf{UD}$ span the $\\text{col}(\\mathbf{X})$,\n", 45 | "\n", 46 | "\\begin{equation}\n", 47 | "\\hat\\beta^{\\text{pcr}}(p) = \\hat\\beta^{\\text{ls}}.\n", 48 | "\\end{equation}\n", 49 | "\n", 50 | "For $M
PLS seeks direction that have high variance *and* have high correlation with the response, in contrast to PCR with keys only on high variance (Stone and Brooks, 1990; Frank and Friedman, 1993).\n",
78 | "\n",
79 | "Since it uses the response $\\mathbf{y}$ to construct its directions, its solution path is a nonlinear function of $\\mathbf{y}$.\n",
80 | "\n",
81 | "In particular, the $m$th principal component direction $v_m$ solves:\n",
82 | "\n",
83 | "\\begin{equation}\n",
84 | "\\max_\\alpha \\text{Var}(\\mathbf{X}\\alpha)\\\\\n",
85 | "\\text{subject to } \\|\\alpha\\| = 1, \\alpha^T\\mathbf{S} v_l = 0 \\text{ for } l = 1,\\cdots, m-1,\n",
86 | "\\end{equation}\n",
87 | "\n",
88 | "where $\\mathbf{S}$ is the sample covariance matrix of the $\\mathbf{x}_j$. The condition $\\alpha^T\\mathbf{S} v_l= 0$ ensures that $\\mathbf{z}_m = \\mathbf{X}\\alpha$ is uncorrelated with all the previous linear combinations $\\mathbf{z}_l = \\mathbf{X} v_l$.\n",
89 | "\n",
90 | "The $m$th PLS direction $\\hat\\rho_m$ solves:\n",
91 | "\n",
92 | "\\begin{equation}\n",
93 | "\\max_\\alpha \\text{Corr}^2(\\mathbf{y},\\mathbf{S}\\alpha)\\text{Var}(\\mathbf{X}\\alpha)\\\\\n",
94 | "\\text{subject to } \\|\\alpha\\| = 1, \\alpha^T\\mathbf{S}\\hat\\rho_l = 0 \\text{ for } l=1,\\cdots, m-1.\n",
95 | "\\end{equation}\n",
96 | "\n",
97 | "Further analysis reveals that the variance aaspect tends to dominate, and so PLS behaves much like ridge regression and PCR. We discuss further in the next section."
98 | ]
99 | },
100 | {
101 | "cell_type": "markdown",
102 | "metadata": {},
103 | "source": [
104 | "If the input matrix $\\mathbf{X}$ is orthogonal, then PLS finds the least squares estimates after the first $m=1$ step, and subsequent steps have no effect since the $\\hat\\rho_{mj} = 0$ for $m>1$ (Exercise 3.14).\n",
105 | "\n",
106 | "It can be also shown that the sequence of PLS coefficients for $m=1,2,\\cdots,p$ represents the conjugate gradient sequence for computing the least squares solutions (Exercise 3.18)."
107 | ]
108 | }
109 | ],
110 | "metadata": {
111 | "kernelspec": {
112 | "display_name": "Python 3",
113 | "language": "python",
114 | "name": "python3"
115 | },
116 | "language_info": {
117 | "codemirror_mode": {
118 | "name": "ipython",
119 | "version": 3
120 | },
121 | "file_extension": ".py",
122 | "mimetype": "text/x-python",
123 | "name": "python",
124 | "nbconvert_exporter": "python",
125 | "pygments_lexer": "ipython3",
126 | "version": "3.5.2"
127 | }
128 | },
129 | "nbformat": 4,
130 | "nbformat_minor": 2
131 | }
132 |
--------------------------------------------------------------------------------
/chapter03-linear-methods-for-regression/section6-discussion-a-comparison-of-the-selection-and-shrinkage-methods.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# $\\S$ 3.6. Discussion: A Comparison of the Selection and Shrinkage Methods\n",
8 | "\n",
9 | "> PLS, PCR and ridge regression tend to behave similarly. Ridge regression may be preferred because it shrinks smoothly, rather than in discrete steps. Lasso falls somewhere between ridge regression and best subset regression, and enjoys some of the properties of each.\n",
10 | "\n",
11 | "Please check the textbook with FIGURE 3.18."
12 | ]
13 | }
14 | ],
15 | "metadata": {
16 | "kernelspec": {
17 | "display_name": "Python 3",
18 | "language": "python",
19 | "name": "python3"
20 | },
21 | "language_info": {
22 | "codemirror_mode": {
23 | "name": "ipython",
24 | "version": 3
25 | },
26 | "file_extension": ".py",
27 | "mimetype": "text/x-python",
28 | "name": "python",
29 | "nbconvert_exporter": "python",
30 | "pygments_lexer": "ipython3",
31 | "version": "3.5.2"
32 | }
33 | },
34 | "nbformat": 4,
35 | "nbformat_minor": 2
36 | }
37 |
--------------------------------------------------------------------------------
/chapter04-linear-methods-for-classification/section1-introduction.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Chapter 4. Linear Methods for Classification\n",
8 | "# $\\S$ 4.1. Introduction\n",
9 | "\n",
10 | "Since our predictor $G(x)$ takes values in a discrete set $\\mathcal{G}$, we can always divide the input space into a collection of regions labeled according to the classification. We saw in Chapter 2 that the boundaries of these regions can be rough or smooth, depending on the prediction function. For an important class of procedures, these *decision boundaries* are linear; this is what we will mean by linear methodds for classification."
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {},
16 | "source": [
17 | "### Linear regression\n",
18 | "\n",
19 | "In Chapter 2 we fit linear regression models to the class indicator variable, and classify to the largest fit. Suppose there are $K$ classes labeled $1,\\cdots,K$, and the fitted linear model for the $k$th indicator response variable is\n",
20 | "\n",
21 | "\\begin{equation}\n",
22 | "\\hat{f}_k(x) = \\hat\\beta_{k0} + \\hat\\beta_k^Tx.\n",
23 | "\\end{equation}\n",
24 | "\n",
25 | "The decision boundary between class $k$ and $l$ is that set of points\n",
26 | "\n",
27 | "\\begin{equation}\n",
28 | "\\left\\lbrace x: \\hat{f}_k(x) = \\hat{f}_l(x) \\right\\rbrace = \\left\\lbrace x: (\\hat\\beta_{k0}-\\hat\\beta_{l0}) + (\\hat\\beta_k-\\hat\\beta_l)^Tx = 0 \\right\\rbrace,\n",
29 | "\\end{equation}\n",
30 | "\n",
31 | "which is an affine set or hyperplane. Since the same is true for any pair of classes, the input space is divided into regions of constant classification, with piecewise hyperplanar decision boundaries."
32 | ]
33 | },
34 | {
35 | "cell_type": "markdown",
36 | "metadata": {},
37 | "source": [
38 | "### Discriminant function\n",
39 | "\n",
40 | "The regression approach is a member of a class of methods that model *discriminant functions* $\\delta_k(x)$ for each class, and then classify $x$ to the class with the largest value for its discriminant function. Methods that model the posterior probabilities $\\text{Pr}(G=k|X=x)$ are also in this class. Clearly, if either the $\\delta_k(x)$ or $\\text{Pr}(G=k|X=x)$ are linear in $x$, then the decision boundaries will be linear.\n",
41 | "\n",
42 | "### Logit transformation\n",
43 | "\n",
44 | "Actually, all we require is that some monotone transformation of $\\delta_k$ or $\\text{Pr}(G=k|X=x)$ be linear for the decision boundaries to be linear. For example, if there are two classes, a popular model for the posterior probabilities is\n",
45 | "\n",
46 | "\\begin{align}\n",
47 | "\\text{Pr}(G=1|X=x) &= \\frac{\\exp(\\beta_0+\\beta^Tx)}{1+\\exp(\\beta_0+\\beta^Tx)},\\\\\n",
48 | "\\text{Pr}(G=2|X=x) &= \\frac{1}{1+\\exp(\\beta_0+\\beta^Tx)},\\\\\n",
49 | "\\end{align}\n",
50 | "\n",
51 | "where the monotone transformation is the *logit* transformation\n",
52 | "\n",
53 | "\\begin{equation}\n",
54 | "\\log\\frac{p}{1-p},\n",
55 | "\\end{equation}\n",
56 | "\n",
57 | "and in fact we see that\n",
58 | "\n",
59 | "\\begin{equation}\n",
60 | "\\log\\frac{\\text{Pr}(G=1|X=x)}{\\text{Pr}(G=2|X=x)} = \\beta_0 + \\beta^Tx.\n",
61 | "\\end{equation}\n",
62 | "\n",
63 | "The decision boundary is the set of points for which the *log-odds* are zero, and this is a hyperplane defined by\n",
64 | "\n",
65 | "\\begin{equation}\n",
66 | "\\left\\lbrace x: \\beta_0+\\beta^Tx = 0 \\right\\rbrace.\n",
67 | "\\end{equation}\n",
68 | "\n",
69 | "We discuss two very popular but different methods that result in linear log-odds or logits: Linear discriminant analysis and linear logistic regression. Although they differ in their derivation, the essential difference between them is in the way the lineaer function is fir to the training data."
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "### Separating hyperplanes\n",
77 | "\n",
78 | "A more direct approach is to explicitly model the boundaries between the classes as linear. For a two-class problem, this amounts to modeling the decision boundary as a hyperplane; a normal vector and a cut-point.\n",
79 | "\n",
80 | "We will look at two methods that explicitly look for \"separating hyperplanes\".\n",
81 | "1. The well-known *perceptron* model of Rosenblatt (1958), with an algorithm that finds a separating hyperplane in the training data, if one exists.\n",
82 | "2. Vapnik (1996) finds an *optimally separating hyperplane* if one exists, else finds a hyperplane that minimizes some measure of overlap in the training data.\n",
83 | "\n",
84 | "We treat separable cases here, and defer the nonseparable case to Chapter 12."
85 | ]
86 | },
87 | {
88 | "cell_type": "markdown",
89 | "metadata": {},
90 | "source": [
91 | "### Scope for generalization\n",
92 | "\n",
93 | "We can expand the input by including their squares $X_1^2,X_2^2,\\cdots$, and cross-products $X_1X_2,\\cdots$, thereby adding $p(p+1)/2$ additional variables. Linear functions in the augmented space map down to quadratic decision boundaires. FIGURE 4.1 illustrates the idea.\n",
94 | "\n",
95 | "This approach can be used with any basis transformation $h(X): \\mathbb{R}^p\\mapsto\\mathbb{R}^q$ with $q > p$, and will be explored in later chapters."
96 | ]
97 | }
98 | ],
99 | "metadata": {
100 | "kernelspec": {
101 | "display_name": "Python 3",
102 | "language": "python",
103 | "name": "python3"
104 | },
105 | "language_info": {
106 | "codemirror_mode": {
107 | "name": "ipython",
108 | "version": 3
109 | },
110 | "file_extension": ".py",
111 | "mimetype": "text/x-python",
112 | "name": "python",
113 | "nbconvert_exporter": "python",
114 | "pygments_lexer": "ipython3",
115 | "version": "3.6.4"
116 | }
117 | },
118 | "nbformat": 4,
119 | "nbformat_minor": 2
120 | }
121 |
--------------------------------------------------------------------------------
/chapter04-linear-methods-for-classification/section3-1-regularized-discriminant-analysis.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## $\\S$ 4.3.1. Regularized Discriminant Analysis\n",
8 | "\n",
9 | "### $\\Sigma_k \\leftrightarrow \\Sigma$\n",
10 | "These methods are very similar in flavor to ridge regression. Friedman (1989) proposed a compromise between LDA and QDA, which allows one to shrink the separate covariances of QDA toward a common covariance $\\hat\\Sigma$ as in LDA. The regularized covariance matrices have the form\n",
11 | "\n",
12 | "\\begin{equation}\n",
13 | "\\hat\\Sigma_k(\\alpha) = \\alpha\\hat\\Sigma_k + (1+\\alpha)\\hat\\Sigma,\n",
14 | "\\end{equation}\n",
15 | "\n",
16 | "where $\\alpha\\in[0,1]$ allows a continuum of models between LDA and QDA, and needs to be specified. In practice $\\alpha$ can be chosen based on the performance of the model on validation data, or by cross-validation."
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {},
22 | "source": [
23 | "### $\\Sigma \\leftrightarrow \\sigma$\n",
24 | "\n",
25 | "Similar modifications allow $\\hat\\Sigma$ itelf to be shrunk toward the scalar covariance,\n",
26 | "\n",
27 | "\\begin{equation}\n",
28 | "\\hat\\Sigma(\\gamma) = \\gamma\\hat\\Sigma + (1-\\gamma)\\hat\\sigma^2\\mathbf{I},\n",
29 | "\\end{equation}\n",
30 | "\n",
31 | "for $\\gamma\\in[0,1]$.\n",
32 | "\n",
33 | "Combining two regularization leads to a more general family of covariances $\\hat\\Sigma(\\alpha,\\gamma)$."
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "metadata": {},
39 | "source": [
40 | "### To be continued\n",
41 | "\n",
42 | "In Chapter 12, we discuss other regularized versions of LDA, which are more suitable when the data arise from digitized analog signals and images. In these situations the features are high-dimensional and correlated, and the LDA coefficients can be regularized to be smooth or sparse in original domain of the signal.\n",
43 | "\n",
44 | "In Chapter 18, we also deal with very high-dimensional problems, where for example, the features are gene-expression measurements in microarray studies."
45 | ]
46 | }
47 | ],
48 | "metadata": {
49 | "kernelspec": {
50 | "display_name": "Python 3",
51 | "language": "python",
52 | "name": "python3"
53 | },
54 | "language_info": {
55 | "codemirror_mode": {
56 | "name": "ipython",
57 | "version": 3
58 | },
59 | "file_extension": ".py",
60 | "mimetype": "text/x-python",
61 | "name": "python",
62 | "nbconvert_exporter": "python",
63 | "pygments_lexer": "ipython3",
64 | "version": "3.5.2"
65 | }
66 | },
67 | "nbformat": 4,
68 | "nbformat_minor": 2
69 | }
70 |
--------------------------------------------------------------------------------
/chapter04-linear-methods-for-classification/section3-2-computations-for-lda.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## $\\S$ 4.3.2. Computations for LDA\n",
8 | "\n",
9 | "Computations for LDA and QDA are simplified by diagonalizing $\\hat\\Sigma$ or $\\hat\\Sigma_k$. For the latter, suppose we compute the eigen-decomposition, for each $k$,\n",
10 | "\n",
11 | "\\begin{equation}\n",
12 | "\\hat\\Sigma_k = \\mathbf{U}_k\\mathbf{D}_k\\mathbf{U}_K^T,\n",
13 | "\\end{equation}\n",
14 | "\n",
15 | "where $\\mathbf{U}_k$ is $p\\times p$ orthogonal, and $\\mathbf{D}_k$ a diagonal matrix of positive eigenvalues $d_{kl}$.\n",
16 | "\n",
17 | "Then the ingredients for $\\delta_k(x)$ are\n",
18 | "* $(x-\\hat\\mu_k)^T\\hat\\Sigma_k^{-1}(x-\\hat\\mu_k) = \\left[\\mathbf{U}_k(x-\\hat\\mu_k)\\right]^T\\mathbf{D}_k^{-1}\\left[\\mathbf{U}_k(x-\\hat\\mu_k)\\right]$\n",
19 | "* $\\log|\\hat\\Sigma_k| = \\sum_l \\log d_{kl}$\n",
20 | "\n",
21 | "Note that the inversion of diagonal matrices only requires elementwise reprocials.\n",
22 | "\n",
23 | "The LDA classifier can be implemented by the following pair of steps:\n",
24 | "* *Sphere* the data w.r.t. the common covariance estimate $\\hat\\Sigma = \\mathbf{U}\\mathbf{D}\\mathbf{U}^T$: \n",
25 | "\\begin{equation}\n",
26 | "X* \\rightarrow \\mathbf{D}^{-\\frac{1}{2}}\\mathbf{U}^TX,\n",
27 | "\\end{equation} \n",
28 | "The common covariance estimate of $X*$ will now be the identity.\n",
29 | "* Classify to the closest class centroid in the transformed space, modulo the effect of the class prior probabilities $\\pi_k$."
30 | ]
31 | }
32 | ],
33 | "metadata": {
34 | "kernelspec": {
35 | "display_name": "Python 3",
36 | "language": "python",
37 | "name": "python3"
38 | },
39 | "language_info": {
40 | "codemirror_mode": {
41 | "name": "ipython",
42 | "version": 3
43 | },
44 | "file_extension": ".py",
45 | "mimetype": "text/x-python",
46 | "name": "python",
47 | "nbconvert_exporter": "python",
48 | "pygments_lexer": "ipython3",
49 | "version": "3.5.2"
50 | }
51 | },
52 | "nbformat": 4,
53 | "nbformat_minor": 2
54 | }
55 |
--------------------------------------------------------------------------------
/chapter04-linear-methods-for-classification/section4-0-logistic-regression.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# $\\S$ 4.4. Logistic Regression\n",
8 | "\n",
9 | "The logistic regression model arises from the desire to model the posterior probabilities of the $K$ classes via linear functions in $x$, ensuring the natural properties of the probability: They sum to one and remain in $[0,1]$.\n",
10 | "\n",
11 | "The model has the form\n",
12 | "\n",
13 | "\\begin{align}\n",
14 | "\\log\\frac{\\text{Pr}(G=1|X=x)}{\\text{Pr}(G=K|X=x)} &= \\beta_{10} + \\beta_1^Tx \\\\\n",
15 | "\\log\\frac{\\text{Pr}(G=2|X=x)}{\\text{Pr}(G=K|X=x)} &= \\beta_{20} + \\beta_2^Tx \\\\\n",
16 | "&\\vdots \\\\\n",
17 | "\\log\\frac{\\text{Pr}(G=K-1|X=x)}{\\text{Pr}(G=K|X=x)} &= \\beta_{(K-1)0} + \\beta_{K-1}^Tx \\\\\n",
18 | "\\end{align}\n",
19 | "\n",
20 | "The model is specified in terms of $K-1$ log-odds or logit transformations, reflecting the constraint that the probabilities sum to one. The choice of denominator ($K$ in this case) is arbitrary in that the estimates are equivalent under this choice."
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "### Sum to one\n",
28 | "\n",
29 | "To emphasize the dependence on the entire parameter set $\\theta = \\left\\lbrace \\beta_{10}, \\beta_1^T, \\cdots, \\beta_{(K-1)0}, \\beta_{K-1}^T\\right\\rbrace$, we denote the probabilities\n",
30 | "\n",
31 | "\\begin{equation}\n",
32 | "\\text{Pr}(G=k|X=x) = p_k(x;\\theta)\n",
33 | "\\end{equation}\n",
34 | "\n",
35 | "A simple calculation shows that\n",
36 | "\n",
37 | "\\begin{align}\n",
38 | "\\text{Pr}(G=k|X=x) &= \\frac{\\exp(\\beta_{k0}+\\beta_k^Tx)}{1+\\sum_{l=1}^{K-1}\\exp(\\beta_{l0}+\\beta_l^Tx)}, \\text{ for } k=1,\\cdots,K-1, \\\\\n",
39 | "\\text{Pr}(G=K|X=x) &= \\frac{1}{1+\\sum_{l=1}^{K-1}\\exp(\\beta_{l0}+\\beta_l^Tx)},\n",
40 | "\\end{align}\n",
41 | "\n",
42 | "and they clearly sum to one.\n",
43 | "\n",
44 | "When $K=2$, this model is especially simple, since there is only a single linear function."
45 | ]
46 | }
47 | ],
48 | "metadata": {
49 | "kernelspec": {
50 | "display_name": "Python 3",
51 | "language": "python",
52 | "name": "python3"
53 | },
54 | "language_info": {
55 | "codemirror_mode": {
56 | "name": "ipython",
57 | "version": 3
58 | },
59 | "file_extension": ".py",
60 | "mimetype": "text/x-python",
61 | "name": "python",
62 | "nbconvert_exporter": "python",
63 | "pygments_lexer": "ipython3",
64 | "version": "3.5.2"
65 | }
66 | },
67 | "nbformat": 4,
68 | "nbformat_minor": 2
69 | }
70 |
--------------------------------------------------------------------------------
/chapter04-linear-methods-for-classification/section4-3-quadratic-approximations-and-inference.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## $\\S$ 4.4.3. Quadratic Approximations and Inference\n",
8 | "\n",
9 | "Please check this section later..."
10 | ]
11 | }
12 | ],
13 | "metadata": {
14 | "kernelspec": {
15 | "display_name": "Python 3",
16 | "language": "python",
17 | "name": "python3"
18 | },
19 | "language_info": {
20 | "codemirror_mode": {
21 | "name": "ipython",
22 | "version": 3
23 | },
24 | "file_extension": ".py",
25 | "mimetype": "text/x-python",
26 | "name": "python",
27 | "nbconvert_exporter": "python",
28 | "pygments_lexer": "ipython3",
29 | "version": "3.5.2"
30 | }
31 | },
32 | "nbformat": 4,
33 | "nbformat_minor": 2
34 | }
35 |
--------------------------------------------------------------------------------
/chapter04-linear-methods-for-classification/section4-4-l1-regularized-logistic-regression.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## $\\S$ 4.4.1. L1 Regularized Logistic Regression\n",
8 | "\n",
9 | "Please check this section later..."
10 | ]
11 | }
12 | ],
13 | "metadata": {
14 | "kernelspec": {
15 | "display_name": "Python 3",
16 | "language": "python",
17 | "name": "python3"
18 | },
19 | "language_info": {
20 | "codemirror_mode": {
21 | "name": "ipython",
22 | "version": 3
23 | },
24 | "file_extension": ".py",
25 | "mimetype": "text/x-python",
26 | "name": "python",
27 | "nbconvert_exporter": "python",
28 | "pygments_lexer": "ipython3",
29 | "version": "3.5.2"
30 | }
31 | },
32 | "nbformat": 4,
33 | "nbformat_minor": 2
34 | }
35 |
--------------------------------------------------------------------------------
/chapter04-linear-methods-for-classification/section4-5-logistic-or-lda.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## $\\S$ 4.4.5. Logistic Regression or LDA?\n",
8 | "### Common linearity\n",
9 | "\n",
10 | "LDA has the log-posterior odds which are linear functions of x:\n",
11 | "\n",
12 | "\\begin{align}\n",
13 | "\\log\\frac{\\text{Pr}(G=k|X=x)}{\\text{Pr}(G=K|X=x)} &= \\log\\frac{\\pi_k}{\\pi_K} - \\frac{1}{2}(\\mu_k-\\mu_K)^T\\Sigma^{-1}(\\mu_k-\\mu_K) + x^T\\Sigma^{-1}(\\mu_k-\\mu_K) \\\\\n",
14 | "&= \\alpha_{k0} + \\alpha_k^Tx,\n",
15 | "\\end{align}\n",
16 | "\n",
17 | "and this linearity is a consequence of the Gaussian assumption for the class densities with a common covariance matrix.\n",
18 | "\n",
19 | "The linear logistic model by construction has linear logits:\n",
20 | "\n",
21 | "\\begin{equation}\n",
22 | "\\log\\frac{\\text{Pr}(G=k|X=x)}{\\text{Pr}(G=K|X=x)} = \\beta_{k0} + \\beta_k^Tx\n",
23 | "\\end{equation}\n",
24 | "\n",
25 | "It seems that the models are the same, and they have the common logit-linear form for the posterior probabilities:\n",
26 | "\n",
27 | "\\begin{equation}\n",
28 | "\\text{Pr}(G=k|X=x) = \\frac{\\exp(\\beta_{k0}+\\beta_k^Tx)}{1+\\sum_{l=1}^{K-1}\\exp(\\beta_{l0}+\\beta_l^Tx)}\n",
29 | "\\end{equation}"
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "metadata": {},
35 | "source": [
36 | "### Different assumptions\n",
37 | "\n",
38 | "Although they have exactly the same form, the difference lies in the way the linear coefficients are estimated: The logistic regression model is more general, in that it makes less assumptions.\n",
39 | "\n",
40 | "Note the *joint density* of $X$ and $G$ as\n",
41 | "\n",
42 | "\\begin{equation}\n",
43 | "\\text{Pr}(X,G=k) = \\text{Pr}(X)\\text{Pr}(G=k|X),\n",
44 | "\\end{equation}\n",
45 | "\n",
46 | "where $\\text{Pr}(X)$ denotes the marginal density of the inputs $X$.\n",
47 | "\n",
48 | "The logistic regression model leaves $\\text{Pr}(X)$ as an arbitrary density function, and fits the parameters of $\\text{Pr}(G|X)$ by maximizing the *conditional likelihood* -- the multinomial likelihood with probabilities the $\\text{Pr}(G=k|X)$. Although $\\text{Pr}(X)$ is totally ignored, we can think of this marginal density as being estimated in a fully nonparametric and unrestricted fashion, using empirical distribution function which places mass $1/N$ at each observation.\n",
49 | "\n",
50 | "LDA fits the parameters by maximizing the full log-likelihood, based on the joint density\n",
51 | "\n",
52 | "\\begin{equation}\n",
53 | "\\text{Pr}(X,G=k) = \\phi(X;\\mu_k,\\Sigma)\\pi_k,\n",
54 | "\\end{equation}\n",
55 | "\n",
56 | "where $\\phi$ is the Gaussian density function. Standard normal theory leads easily to the estimates $\\hat\\mu_k$, $\\hat\\Sigma$, and $\\hat\\pi_k$ given in Section 4.3. Since the linear parameters of the logistic form\n",
57 | "\n",
58 | "\\begin{equation}\n",
59 | "\\log\\frac{\\text{Pr}(G=k|X=x)}{\\text{Pr}(G=K|X=x)} = \\log\\frac{\\pi_k}{\\pi_K} - \\frac{1}{2}(\\mu_k-\\mu_K)^T\\Sigma^{-1}(\\mu_k-\\mu_K) + x^T\\Sigma^{-1}(\\mu_k-\\mu_K)\n",
60 | "\\end{equation}\n",
61 | "\n",
62 | "are functions of the Gaussian parameters, we get their maximum-likelihood estimates by plugging in the corresponding estimates."
63 | ]
64 | },
65 | {
66 | "cell_type": "markdown",
67 | "metadata": {},
68 | "source": [
69 | "### Role of the marginal density $\\text{Pr}(X)$ in LDA\n",
70 | "\n",
71 | "However, unlike in the conditional case, the marginal density $\\text{Pr}(X)$ does play a role here. It is a mixture density\n",
72 | "\n",
73 | "\\begin{equation}\n",
74 | "\\text{Pr}(X) = \\sum_{k=1}^K \\pi_k\\phi(X;\\mu_k,\\Sigma),\n",
75 | "\\end{equation}\n",
76 | "\n",
77 | "which also involves the parameters. What role can this additional component or restriction play?\n",
78 | "\n",
79 | "By relying on the additional model assumptions, we have more information about the parameters, and hence can estimate them more efficiently (lower variance). If in fact the true $f_k(x)$ are Gaussian, then in the worst case ignoring this marginal part of the likelihood constitutes a loss of efficiency of about $30\\%$ asymptotically in the error rate (Efron, 1975). Paraphrasing: With $30\\%$ more data, the conditional likelihood will do as well.\n",
80 | "\n",
81 | "For example, observations far from the decision boundary (which are down-weighted by logistic regression) play a role in estimating the common covariance matrix. This is not a good news, because it also means that LDA is not robust to gross outliers."
82 | ]
83 | },
84 | {
85 | "cell_type": "markdown",
86 | "metadata": {},
87 | "source": [
88 | "### Marginal likelihood as a regularizer\n",
89 | "\n",
90 | "The marginal likelihood can be thought of as a regularizer, requiring in some sense that class densities be *visible* from this marginal view.\n",
91 | "\n",
92 | "For example, if the data in a two-class logistic regression model can be perfectly separated by a hyperplane, the maximum likelihood estimates of the parameters are undefined (i.e., infinite; see Exercise 4.5).\n",
93 | "\n",
94 | "The LDA coefficients for the same data will be well defined, since the marginal likelihood will not permit these degeneracies."
95 | ]
96 | },
97 | {
98 | "cell_type": "markdown",
99 | "metadata": {},
100 | "source": [
101 | "### In real world\n",
102 | "\n",
103 | "> It is generally felt that logistic regression is a safer and more robust bet than the LDA model, relying on fewer assumptions.\n",
104 | "\n",
105 | "In practice these assumptions are never correct, and often some of the components of $X$ are qualitative variables. It is our experience that the models give very similar results, even when LDA is used inappropriately, such as with qualitative predictors."
106 | ]
107 | }
108 | ],
109 | "metadata": {
110 | "kernelspec": {
111 | "display_name": "Python 3",
112 | "language": "python",
113 | "name": "python3"
114 | },
115 | "language_info": {
116 | "codemirror_mode": {
117 | "name": "ipython",
118 | "version": 3
119 | },
120 | "file_extension": ".py",
121 | "mimetype": "text/x-python",
122 | "name": "python",
123 | "nbconvert_exporter": "python",
124 | "pygments_lexer": "ipython3",
125 | "version": "3.5.2"
126 | }
127 | },
128 | "nbformat": 4,
129 | "nbformat_minor": 2
130 | }
131 |
--------------------------------------------------------------------------------
/chapter04-linear-methods-for-classification/section5-0-separating-hyperplanes.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# $\\S$ 4.5. Separating Hyperplanes\n",
8 | "\n",
9 | "We describe separating hyperplane classifiers, constructing linear decision boundaries that explicitly try to separate the data into different classes as well as possible. They provide the basis for support vector classifiers, discussed in Chapter 12.\n",
10 | "\n",
11 | "FIGURE 4.14 shows 20 data points of two classes in $\\mathbb{R}^2$, which can be separated by a linear boundary but there are infinitely many possible *separating hyperplanes*.\n",
12 | "\n",
13 | "The orange line is the least squares solution to the problem, obtained by regressing the $-1/1$ response $Y$ on $X$ with intercept; the line is given by\n",
14 | "\n",
15 | "\\begin{equation}\n",
16 | "\\left\\lbrace x: \\hat\\beta_0 + \\hat\\beta_1x_1 + \\hat\\beta_2x_2=0 \\right\\rbrace.\n",
17 | "\\end{equation}\n",
18 | "\n",
19 | "This least squares solution does not do a perfect job in separating the points, and makes one error. This is the same boundary found by LDA, in light of its equivalence with linear regression in the two-class case ($\\S$ 4.3 and Exercise 4.2)."
20 | ]
21 | },
22 | {
23 | "cell_type": "markdown",
24 | "metadata": {},
25 | "source": [
26 | "### Perceptrons\n",
27 | "\n",
28 | "Classifiers such as above, that compute a linear combination of the input features and return the sign, were called *perceptrons* in the engineering literatur in the late 1950s (Rosenblatt, 1958). Perceptrons set he foundations for the neural network models of the 1980s and 1990s."
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "metadata": {},
34 | "source": [
35 | "### Review on vector algebra\n",
36 | "\n",
37 | "FIGURE 4.15 depicts a hyperplane or *affine set* $L$ defined by the equation\n",
38 | "\n",
39 | "\\begin{equation}\n",
40 | "f(x) = \\beta_0 + \\beta^T x = 0,\n",
41 | "\\end{equation}\n",
42 | "\n",
43 | "since we are in $\\mathbb{R}^2$ this is a line.\n",
44 | "\n",
45 | "Here we list some properties:\n",
46 | "1. For any two points $x_1$ and $x_2$ lying in $L$, \n",
47 | "\\begin{equation}\n",
48 | "\\beta^T(x_1-x_2)=0,\n",
49 | "\\end{equation}\n",
50 | "and hence the unit vector $\\beta^* = \\beta/\\|\\beta\\|$ is normal to the surface of $L$.\n",
51 | "2. For any point $x_0$ in $L$, \n",
52 | "\\begin{equation}\n",
53 | "\\beta^Tx_0 = -\\beta_0.\n",
54 | "\\end{equation}\n",
55 | "3. The signed distance of any point $x$ to $L$ is given by \n",
56 | "\\begin{align}\n",
57 | "\\beta^{*T}(x-x_0) &= \\frac{1}{\\|\\beta\\|}(\\beta^Tx+\\beta_0) \\\\\n",
58 | "&= \\frac{1}{\\|f'(x)\\|}f(x).\n",
59 | "\\end{align}\n",
60 | "Hence $f(x)$ is proportional to the signed distance from $x$ to the hyperplane defined by $f(x)=0$."
61 | ]
62 | }
63 | ],
64 | "metadata": {
65 | "kernelspec": {
66 | "display_name": "Python 3",
67 | "language": "python",
68 | "name": "python3"
69 | },
70 | "language_info": {
71 | "codemirror_mode": {
72 | "name": "ipython",
73 | "version": 3
74 | },
75 | "file_extension": ".py",
76 | "mimetype": "text/x-python",
77 | "name": "python",
78 | "nbconvert_exporter": "python",
79 | "pygments_lexer": "ipython3",
80 | "version": "3.5.2"
81 | }
82 | },
83 | "nbformat": 4,
84 | "nbformat_minor": 2
85 | }
86 |
--------------------------------------------------------------------------------
/chapter04-linear-methods-for-classification/section5-1-rosenblatt-perceptron-learning-algorithm.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## $\\S$ 4.5.1. Rosenblatt's Perceptron Learning Algorithm\n",
8 | "\n",
9 | "> The *perceptron learning algorithm* tries to find a separating hyperplane by minimizing the distance of misclassified points to the decision boundary.\n",
10 | "\n",
11 | "If a response $y_i=1$ is misclassified, then $x_i^T\\beta + \\beta_0 \\lt 0$, and the opposite for a misclassified response with $y_i=-1$. The goal is to minimize\n",
12 | "\n",
13 | "\\begin{equation}\n",
14 | "D(\\beta,\\beta_0) = -\\sum_{i\\in\\mathcal{M}} y_i(x_i^T\\beta + \\beta_0),\n",
15 | "\\end{equation}\n",
16 | "\n",
17 | "where $\\mathcal{M}$ indexes the set of misclassified points. The quantity is non-negative and proportional to the distance of the misclassified points to the decision boundary defined by $\\beta^Tx+\\beta_0=0$.\n",
18 | "\n",
19 | "Assuming $\\mathcal{M}$ is fixed, the gradient is given by\n",
20 | "\n",
21 | "\\begin{align}\n",
22 | "\\partial\\frac{D(\\beta,\\beta_0)}{\\partial\\beta} &= -\\sum_{i\\in\\mathcal{M}} y_ix_i, \\\\\n",
23 | "\\partial\\frac{D(\\beta,\\beta_0)}{\\partial\\beta_0} &= -\\sum_{i\\in\\mathcal{M}} y_i.\n",
24 | "\\end{align}"
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "### Stochastic gradient descent\n",
32 | "\n",
33 | "The algorithm in face uses *stochastic gradient descent* to minimize this piecewise linear criterion. This means that rather than computing the sum of the gradient contributions of each observation followed by a step in the negative gradient direction, a step in taken after each observation is visited.\n",
34 | "\n",
35 | "Hence the misclassified observations are visited in some sequence, and the parameters $\\beta$ are updated via\n",
36 | "\n",
37 | "\\begin{equation}\n",
38 | "\\begin{pmatrix}\\beta \\\\ \\beta_0\\end{pmatrix}\n",
39 | "\\leftarrow\n",
40 | "\\begin{pmatrix}\\beta \\\\ \\beta_0\\end{pmatrix}\n",
41 | "+\n",
42 | "\\rho \\begin{pmatrix}y_ix_i \\\\ y_i\\end{pmatrix},\n",
43 | "\\end{equation}\n",
44 | "\n",
45 | "where $\\rho$ is the learning rate, which in this case can be taken to be $1$ WLOG.\n",
46 | "\n",
47 | "If the classes are linearly separable, it can be shown that the algorithm converges to a separating hyperplane in a finite number of steps (Exercise 4.6). FIGURE 4.14 shows two solutions to a toy problem, each started at a different random guess."
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": 1,
53 | "metadata": {},
54 | "outputs": [
55 | {
56 | "name": "stdout",
57 | "output_type": "stream",
58 | "text": [
59 | "Under construction ...\n"
60 | ]
61 | }
62 | ],
63 | "source": [
64 | "\"\"\"FIGURE 4.14. A toy example with two classes separable by a hyperplane.\n",
65 | "\n",
66 | "The orange line is the least squares solution, which misclassifies one of\n",
67 | "the training points. Also shown are two blue separating hyperplanes found\n",
68 | "by the perceptron learning algorithm with different random starts.\n",
69 | "\"\"\"\n",
70 | "print('Under construction ...')"
71 | ]
72 | },
73 | {
74 | "cell_type": "markdown",
75 | "metadata": {},
76 | "source": [
77 | "### Issues\n",
78 | "\n",
79 | "There are a number of problems with this algorithm, summarized in Ripley (1996):\n",
80 | "* When the data are separable, there are many solutions, and which one is found depends on the starting values.\n",
81 | "* The \"finite\" number of steps can be very large. The smaller the gap, the longer the time to find it.\n",
82 | "* When the data are not separable, the algorithm will not converge, and cycles develop. The cycles can be long and therefore hard to detect.\n",
83 | "\n",
84 | "The second problem can often be eliminated by seeking a hyperplane not in the orignal space, but in a much enlarged space obtained by creating many basis-function transformations of the original variables. This is analogous to driving the residuals in a ploynomial regression problem down to zero by making the degree sufficiently large.\n",
85 | "\n",
86 | "Perfect separation cannot always be achieved: For example, if observations from two different classes share the same input. It may not be desirable either, since the resulting model is likely to be overfit and will not generalizes well.\n",
87 | "\n",
88 | "A rather elegant solution to the first problem is to add additional constraints to the separating hyperplane."
89 | ]
90 | }
91 | ],
92 | "metadata": {
93 | "kernelspec": {
94 | "display_name": "Python 3",
95 | "language": "python",
96 | "name": "python3"
97 | },
98 | "language_info": {
99 | "codemirror_mode": {
100 | "name": "ipython",
101 | "version": 3
102 | },
103 | "file_extension": ".py",
104 | "mimetype": "text/x-python",
105 | "name": "python",
106 | "nbconvert_exporter": "python",
107 | "pygments_lexer": "ipython3",
108 | "version": "3.5.2"
109 | }
110 | },
111 | "nbformat": 4,
112 | "nbformat_minor": 2
113 | }
114 |
--------------------------------------------------------------------------------
/chapter05-basis-expansions-and-regularization/section1-introduction.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Chapter 5.Basis Expansions and Regularization\n",
8 | "# $\\S$ 5.1. Introduction"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "### Linearity is unrealistic, but necessary\n",
16 | "\n",
17 | "Linear regression, LDA, logistic regression and separating hyper planes all rely on a linear model. It is extremely unlikely that the true function $f(X)$ is actually linear in $X$. In regression problems, $f(X) = \\text{E}(Y|X)$ will typically be nonlinear and nonadditive in $X$.\n",
18 | "\n",
19 | "Representing $f(X)$ by a linear model is usually a convenient, and sometimes necessary, approximation.\n",
20 | "* Convenient because a linear model is easy to interpret, and is the first-order Taylor approximation to $f(X)$.\n",
21 | "* Sometimes necessary because with $N$ small and/or $p$ large, a linear model might be all we are able to fit to the data without overfitting.\n",
22 | "\n",
23 | "Likewise in classification, a linear, Bayes-optimal decision boundary implies that some monotone transformation of $\\text{Pr}(Y=1|X)$ is linear in $X$. This is inevitably an approximation."
24 | ]
25 | },
26 | {
27 | "cell_type": "markdown",
28 | "metadata": {},
29 | "source": [
30 | "### Beyond linearity\n",
31 | "\n",
32 | "The core idea in this chapter is to augment/replace the vector of inputs $X$ with additional variables, which are transformatons of $X$, and then use linear models in this new space of derived input features.\n",
33 | "\n",
34 | "Denote by\n",
35 | "\n",
36 | "\\begin{equation}\n",
37 | "h_m(X): \\mathbb{R}^p \\mapsto \\mathbb{R}\n",
38 | "\\end{equation}\n",
39 | "\n",
40 | "the $m$th transformation of $X$ for $m=1,\\cdots,M$. We then model\n",
41 | "\n",
42 | "\\begin{equation}\n",
43 | "f(X) = \\sum_{m=1}^M \\beta_m h_m(X),\n",
44 | "\\end{equation}\n",
45 | "\n",
46 | "_a linear basis expansion_ in $X$.\n",
47 | "\n",
48 | "The beauty of this approach is that once the basis functions $h_m$ have been determined, the models are linear in these new variables, and the fitting proceeds as before."
49 | ]
50 | },
51 | {
52 | "cell_type": "markdown",
53 | "metadata": {},
54 | "source": [
55 | "### Examples\n",
56 | "\n",
57 | "Some simple and widely used examples of the $h_m$ are the following.\n",
58 | "\n",
59 | "* $h_m(X)=X_m$, $m=1,\\cdots,p$ recovers the original linear model.\n",
60 | "* $h_m(X)=X_j^2$ or $h_m(X)=X_j X_k$ allows us to augment the inputs with polynomial terms to achieve higher-order Taylor expansions. \n",
61 | "Note, however, that the number of variables grows exponentially in the degrees of the polynomial. A full quadratic model in $p$ variables requires $O(p^2)$ square and corss-product terms, or more generally $O(p^d)$ for a degree-$d$ polynomial.\n",
62 | "* $h_m(X)=\\log(X_j)$, $\\sqrt{X_j}$, $\\cdots$ permits other nonlinear transformations of single inputs. More generally one can use similar functions involving several inputs, such as $h_m(X)=\\|X\\|$.\n",
63 | "* $h_m(X)=I(L_m \\le X_k \\lt U_m)$, an indicator for a region of $X_k$. By breaking the range of $X_k$ up into $M_k$ such nonoverlapping regions results in a model with a piecewise constant contribution for $X_k$."
64 | ]
65 | },
66 | {
67 | "cell_type": "markdown",
68 | "metadata": {},
69 | "source": [
70 | "### Preview\n",
71 | "\n",
72 | "* _Piecewise-polynomials_ and _splines_ allow for local polynomial representations.\n",
73 | "* _wavelet_ bases produce a _dictionary_ $\\mathcal{D}$ consisting of typically a very large number $\\left|\\mathcal{D}\\right|$ of basis functions, far more than we can afford to fit to our data. \n",
74 | "Along with the dictionary we require a method for controlling the complexity of out model, using basis functions from the dictionary. These are three common approaches.\n",
75 | " * Restriction methods, where we decide before-hand to limit the class of functions. Additivity is an example, where we assume that our model has the form\n",
76 | "\\begin{equation}\n",
77 | "f(X) = \\sum_{j=1}^p f_j(X_j) = \\sum_{j=1}^p \\sum_{m=1}^{M-j} \\beta_{jm} h_{jm}(X_j).\n",
78 | "\\end{equation}\n",
79 | " The size of the model is limited by the number of basis functions $M_j$ used for each component function $f_j$.\n",
80 | " * Selection methods, which adaptively scan the dictionary and include only those basis functions $h_m$ that contribute significantly to the fit of the model. Here the variable selection techniques discussed in Chapter 3 are useful. The stagewise greedy approaches such as CART, MARS and boosting fall into this category as well.\n",
81 | " * Regularization methods where we use the entire dictionary but restrict the coefficients. Ridge regression is a simple example of a regularization approach, while the lasso is both a regularization and selection method."
82 | ]
83 | }
84 | ],
85 | "metadata": {
86 | "kernelspec": {
87 | "display_name": "Python 3",
88 | "language": "python",
89 | "name": "python3"
90 | },
91 | "language_info": {
92 | "codemirror_mode": {
93 | "name": "ipython",
94 | "version": 3
95 | },
96 | "file_extension": ".py",
97 | "mimetype": "text/x-python",
98 | "name": "python",
99 | "nbconvert_exporter": "python",
100 | "pygments_lexer": "ipython3",
101 | "version": "3.6.4"
102 | }
103 | },
104 | "nbformat": 4,
105 | "nbformat_minor": 2
106 | }
107 |
--------------------------------------------------------------------------------
/chapter05-basis-expansions-and-regularization/section2-2-example-south-african-heart-disease-continued.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## $\\S$ 5.2.2. Example: South African Heart Disease (Continued)\n",
8 | "\n",
9 | "In $\\S$ 4.4.2 we fit linear logistic regression models to the South African heart disease data. Here we explore nonlinearities in the functions using natural splines.\n",
10 | "\n",
11 | "The functional form of the model is\n",
12 | "\n",
13 | "\\begin{equation}\n",
14 | "\\text{logit}\\left[ \\text{Pr}(\\textsf{chd}|X) \\right] = \\theta_0 + h_1(X_1)^T\\theta_1 + h_2(X_2)^T\\theta_2 + \\cdots + h_p(X_p)^T\\theta_p,\n",
15 | "\\end{equation}\n",
16 | "\n",
17 | "where each of the $\\theta_j$ are vectors of coefficients multiplying their associated vector of natural spline basis functions $h_j$."
18 | ]
19 | },
20 | {
21 | "cell_type": "markdown",
22 | "metadata": {},
23 | "source": [
24 | "### Transformation in a whole\n",
25 | "\n",
26 | "We can combine all $p$ vectors of basis functions (and the constant term) into one big vector $h(X)$, and then the model is simply\n",
27 | "\n",
28 | "\\begin{equation}\n",
29 | "h(X)^T\\theta,\n",
30 | "\\end{equation}\n",
31 | "\n",
32 | "with total number of parameters\n",
33 | "\n",
34 | "\\begin{equation}\n",
35 | "\\text{df} = \\sum_{j=1}^p \\text{df}_j.\n",
36 | "\\end{equation}\n",
37 | "\n",
38 | "Each basis function is evaluated at each of the $N$ samples, resulting in a $N \\times \\text{df}$ basis matrix $\\mathbf{H}$. At this point the model is like any other linear logistic model, and the algorithms described in $\\S$ 4.4.1 apply."
39 | ]
40 | },
41 | {
42 | "cell_type": "markdown",
43 | "metadata": {},
44 | "source": [
45 | "### Backward stepwise process\n",
46 | "\n",
47 | "We carried out a backward stepwise deletion process, dropping terms from this model while preserving the group structure of each term, rather than dropping one coefficient at a time. The AIC statistic ($\\S$ 7.5) was used to drop terms, and all the terms remaining in the final model would cause AIC to increase if deleted from the model (see TABLE 5.1).\n",
48 | "\n",
49 | "FIGURE 5.4 shows a plot of the final model selected by the stepwise regression. The functions displayed are\n",
50 | "\n",
51 | "\\begin{equation}\n",
52 | "\\hat{f}_j(X_j) = h_j(X_j)^T\\hat\\theta_j,\n",
53 | "\\end{equation}\n",
54 | "\n",
55 | "for each variable $X_j$. $\\mathbf{\\Sigma}$, the covariance matrix of $\\hat\\theta$, is estimated by\n",
56 | "\n",
57 | "\\begin{equation}\n",
58 | "\\hat{\\mathbf{\\Sigma}} = \\left( \\mathbf{H}^T\\mathbf{WH} \\right)^{-1},\n",
59 | "\\end{equation}\n",
60 | "\n",
61 | "where $\\mathbf{W}$ is the diagonal weight matrix from the logistic regression. Hence\n",
62 | "\n",
63 | "\\begin{equation}\n",
64 | "v_j(X_j) = \\text{Var}\\left( \\hat{f}_j(X_j) \\right) = h_j(X_j)^T \\hat{\\mathbf{\\Sigma}}_{jj} h_j(X_j)\n",
65 | "\\end{equation}\n",
66 | "\n",
67 | "is the pointwise variance function of $\\hat{f}_j$, where $\\text{Cov}(\\hat\\theta_j) = \\hat{\\mathbf{\\Sigma}}_{jj}$ is the appropriate sub-matrix of $\\hat{\\mathbf{\\Sigma}}$.\n",
68 | "\n",
69 | "The shaded region in each panel is defined by\n",
70 | "\n",
71 | "\\begin{equation}\n",
72 | "\\hat{f}_j(X_j) \\pm 2\\sqrt{v_j(X_j)}.\n",
73 | "\\end{equation}"
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "### What linear models could not excavate: nonlinearity\n",
81 | "\n",
82 | "The AIC statistic is slightly more generous than the likelihood-ratio test (deviance test). Both $\\textsf{sbp}$ and $\\textsf{obesity}$ are included in the model, while they are not in the linear model. FIGURE 5.4 explains why, since their contributions are inherently nonlinear.\n",
83 | "\n",
84 | "> These effects at first may come as a surprise, but a explanation lies in the nature of the retrospective data."
85 | ]
86 | }
87 | ],
88 | "metadata": {
89 | "kernelspec": {
90 | "display_name": "Python 3",
91 | "language": "python",
92 | "name": "python3"
93 | },
94 | "language_info": {
95 | "codemirror_mode": {
96 | "name": "ipython",
97 | "version": 3
98 | },
99 | "file_extension": ".py",
100 | "mimetype": "text/x-python",
101 | "name": "python",
102 | "nbconvert_exporter": "python",
103 | "pygments_lexer": "ipython3",
104 | "version": "3.6.4"
105 | }
106 | },
107 | "nbformat": 4,
108 | "nbformat_minor": 2
109 | }
110 |
--------------------------------------------------------------------------------
/chapter05-basis-expansions-and-regularization/section3-filtering-and-feature-extraction.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# $\\S$ 5.3. Filtering and Feature Extraction\n",
8 | "\n",
9 | "### Review of the phoneme example\n",
10 | "\n",
11 | "In the previous example, we constructed a $p \\times M$ basis matrix $\\mathbf{H}$, and then transformed our features $x$ into new features\n",
12 | "\n",
13 | "\\begin{equation}\n",
14 | "x^* = \\mathbf{H}^T x.\n",
15 | "\\end{equation}\n",
16 | "\n",
17 | "These filtered versions of the features were then used as inputs into a learning procedure: In the previous example, this was linear logistic regression."
18 | ]
19 | },
20 | {
21 | "cell_type": "markdown",
22 | "metadata": {},
23 | "source": [
24 | "### Nonlinear or linear preprocessing of features\n",
25 | "\n",
26 | "Preprocessing of high-dimensional features is a very general and powerful method for improving the performance of a learning algorithm. The preprocessing need not be linear as it was above, but can be a general (nonlinear) function of the form\n",
27 | "\n",
28 | "\\begin{equation}\n",
29 | "x^* = g(x)\n",
30 | "\\end{equation}\n",
31 | "\n",
32 | "The derived features $x^*$ can then be used as inputs into any (linear of nonlinear) learning procedure."
33 | ]
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "metadata": {},
38 | "source": [
39 | "### Example: Wavelet and neural network\n",
40 | "\n",
41 | "For example, for signal or image recognition a popular approach is to first transform the raw features via a wavelet transform ($\\S$ 5.9) and then use the features as inputs into a neural network (Chapter 11).\n",
42 | "\n",
43 | "Wavelets are effective in capturing discrete jumps or edges, and the neural network is a powerful tool for constructing nonlinear functions of these features for predicting the target variable. By using domain knowledge to construct appropriate features, one can often improve upon a learning method that has only the raw features $x$ at its disposal."
44 | ]
45 | }
46 | ],
47 | "metadata": {
48 | "kernelspec": {
49 | "display_name": "Python 3",
50 | "language": "python",
51 | "name": "python3"
52 | },
53 | "language_info": {
54 | "codemirror_mode": {
55 | "name": "ipython",
56 | "version": 3
57 | },
58 | "file_extension": ".py",
59 | "mimetype": "text/x-python",
60 | "name": "python",
61 | "nbconvert_exporter": "python",
62 | "pygments_lexer": "ipython3",
63 | "version": "3.6.4"
64 | }
65 | },
66 | "nbformat": 4,
67 | "nbformat_minor": 2
68 | }
69 |
--------------------------------------------------------------------------------
/chapter05-basis-expansions-and-regularization/section4-0-smoothing-splines.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# $\\S$ 5.4. Smoothing Splines\n",
8 | "\n",
9 | "Here we discuss a spline basis method that avoids the knot selection problem completely by using a maximal set of knots.\n",
10 | "\n",
11 | "> The complexity of the fit is controlled by regularization.\n",
12 | "\n",
13 | "Consider the following problem: Among all functions $f(x)$ with two continuous derivatives, find one that minimizes the penalized residual sum of squares\n",
14 | "\n",
15 | "\\begin{equation}\n",
16 | "\\text{RSS}(f, \\lambda) = \\sum_{i=1}^N \\left( y_i - f(x_i) \\right)^2 + \\lambda \\int \\left( f''(t)\\right)^2 dt,\n",
17 | "\\end{equation}\n",
18 | "\n",
19 | "where $\\lambda$ is a fixed _smoothing parameter_. The first term measures closeness to the data, while the second term penalizes curvature in the function, and $\\lambda$ establishes a tradeoff between the two.\n",
20 | "\n",
21 | "Consider the two special cases:\n",
22 | "1. $\\lambda = 0$: $f$ can be any function that interpolates the data.\n",
23 | "2. $\\lambda = \\infty$: the simple least squares line fit, since no second derivative can be tolerated.\n",
24 | "\n",
25 | "These vary from very rough to very smooth, and the hope is that $\\lambda \\in (0,\\infty)$ indexes an interesting class of functions in between."
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "### The natural cubic spline as the minimizer\n",
33 | "\n",
34 | "The above $\\text{RSS}$ is defined on an infinite-dimensional function space -- in fact, a Sobolev space of functions for which the second term is defined.\n",
35 | "\n",
36 | "Remarkably, it can be shown that for the $\\text{RSS}$ there is an explicit, finite-dimensional, unique minimizer which is a natural cubic spline with knots at the unique values of the $x_i$, $i=1,\\cdots,N$ (Exercise 5.7).\n",
37 | "\n",
38 | "At face value it seems that the family is still over-parametrized, since there are as many as $N$ knots, which implies $N$ degrees of freedom. However, the penalty term translates to a penalty on the spline coefficients, which are shrunk some of the way toward the linear fit."
39 | ]
40 | },
41 | {
42 | "cell_type": "markdown",
43 | "metadata": {},
44 | "source": [
45 | "### Computation\n",
46 | "\n",
47 | "Since the solution is a natural spline, we can write it as\n",
48 | "\n",
49 | "\\begin{equation}\n",
50 | "f(x) = \\sum_{j=1}^N N_j(x) \\theta_j,\n",
51 | "\\end{equation}\n",
52 | "\n",
53 | "where the $N_j(x)$ are an $N$-dimensional set of basis functions for representing this family of natural splines ($\\S$ 5.2.1 and Exercise 5.4). The criterion thus reduces to\n",
54 | "\n",
55 | "\\begin{equation}\n",
56 | "\\text{RSS}(\\theta, \\lambda) = (\\mathbf{y} - \\mathbf{N}\\theta)^T(\\mathbf{y} - \\mathbf{N}\\theta) + \\lambda\\theta^T\\mathbf{\\Omega}_N\\theta,\n",
57 | "\\end{equation}\n",
58 | "\n",
59 | "where\n",
60 | "* $\\lbrace\\mathbf{N}\\rbrace_{ij} = N_j(x_i)$ and \n",
61 | "* $\\lbrace\\mathbf{\\Omega}_N\\rbrace_{jk} = \\int N_j''(t)N_k''(t)dt$.\n",
62 | "\n",
63 | "The solution is easily seen to be\n",
64 | "\n",
65 | "\\begin{equation}\n",
66 | "\\hat\\theta = \\left( \\mathbf{N}^T\\mathbf{N} + \\lambda\\mathbf{\\Omega}_N \\right)^{-1} \\mathbf{N}^T \\mathbf{y},\n",
67 | "\\end{equation}\n",
68 | "\n",
69 | "a generalized ridge regression. The fitted smoothing spline is given by\n",
70 | "\n",
71 | "\\begin{equation}\n",
72 | "\\hat{f}(x) = \\sum_{j=1}^N N_j(x) \\hat\\theta_j.\n",
73 | "\\end{equation}\n",
74 | "\n",
75 | "See the Appendix of this chapter for efficient computational techniques for smoothing splines."
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "execution_count": 2,
81 | "metadata": {},
82 | "outputs": [
83 | {
84 | "name": "stdout",
85 | "output_type": "stream",
86 | "text": [
87 | "Under construction ...\n"
88 | ]
89 | }
90 | ],
91 | "source": [
92 | "\"\"\"FIGURE 5.6. A smoothing spline to BMD data with fixed lambda ~= 0.00022\n",
93 | "This choice, corresponding to about 12 degrees of freedom, will be discussed\n",
94 | "in the next section.\"\"\"\n",
95 | "print('Under construction ...')"
96 | ]
97 | }
98 | ],
99 | "metadata": {
100 | "kernelspec": {
101 | "display_name": "Python 3",
102 | "language": "python",
103 | "name": "python3"
104 | },
105 | "language_info": {
106 | "codemirror_mode": {
107 | "name": "ipython",
108 | "version": 3
109 | },
110 | "file_extension": ".py",
111 | "mimetype": "text/x-python",
112 | "name": "python",
113 | "nbconvert_exporter": "python",
114 | "pygments_lexer": "ipython3",
115 | "version": "3.6.4"
116 | }
117 | },
118 | "nbformat": 4,
119 | "nbformat_minor": 2
120 | }
121 |
--------------------------------------------------------------------------------
/chapter05-basis-expansions-and-regularization/section5-0-automatic-selection-of-the-smoothing-parameters.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# $\\S$ 5.5. Automatic Selection of the Smoothing Parameters\n",
8 | "\n",
9 | "> Selecting the placement and number of knots for regression splines can be a combinatorially complex task, and we will not discuss this further here.\n",
10 | "\n",
11 | "The smoothing parameters for regression splines encompass the degree of the splines, and the number and placement of the knots. For splines, we have only the penalty parameter $\\lambda$ to select, since the knots are at all the unique training $X$'s, and cubic degree is almost always used in practice.\n",
12 | "\n",
13 | "Selecting the placement and number of knots for regression splines can be a combinatorially complex task, unless some simplifications are enforced. The MARS procedure (in Chapter 9) uses a greedy algorithm with some additional approximations to achieve a practical compromise. We will not discuss this further here."
14 | ]
15 | }
16 | ],
17 | "metadata": {
18 | "kernelspec": {
19 | "display_name": "Python 3",
20 | "language": "python",
21 | "name": "python3"
22 | },
23 | "language_info": {
24 | "codemirror_mode": {
25 | "name": "ipython",
26 | "version": 3
27 | },
28 | "file_extension": ".py",
29 | "mimetype": "text/x-python",
30 | "name": "python",
31 | "nbconvert_exporter": "python",
32 | "pygments_lexer": "ipython3",
33 | "version": "3.6.4"
34 | }
35 | },
36 | "nbformat": 4,
37 | "nbformat_minor": 2
38 | }
39 |
--------------------------------------------------------------------------------
/chapter05-basis-expansions-and-regularization/section5-1-fixing-the-degrees-of-freedom.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# $\\S$ 5.5. Automatic Selection of the Smoothing Parameters\n",
8 | "## $\\S$ 5.5.1. Fixing the Degrees of Freedom\n",
9 | "\n",
10 | "Since, for smoothing splines,\n",
11 | "\n",
12 | "\\begin{equation}\n",
13 | "\\text{df}_\\lambda = \\text{trace}(\\mathbf{S}_\\lambda)\n",
14 | "\\end{equation}\n",
15 | "\n",
16 | "is monotone in $\\lambda$, we can invert the relationship and specify $\\lambda$ by fixing $\\text{df}$."
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {},
22 | "source": [
23 | "### Numerical inverse and its usage to model selection\n",
24 | "\n",
25 | "In practice this can be achieved by simple numerical methods. So, for example, in $\\textsf{R}$ one can use $\\textsf{smooth.spline(x,y,df=6)}$ to specify the amount of smoothing. This encourages a more traditional mode of model selection, where we might try a couple of different values of $\\text{df}$, and select one based on approximate $F$-tests, residual plots and other more subjective criteria. Using $\\text{df}$ in this way provides a uniform approach to compare many different smoothing methods. It is particularly useful in _generalized additive models_ (Chapter 9), where several smoothing methods can be simultaneously used in one model."
26 | ]
27 | }
28 | ],
29 | "metadata": {
30 | "kernelspec": {
31 | "display_name": "Python 3",
32 | "language": "python",
33 | "name": "python3"
34 | },
35 | "language_info": {
36 | "codemirror_mode": {
37 | "name": "ipython",
38 | "version": 3
39 | },
40 | "file_extension": ".py",
41 | "mimetype": "text/x-python",
42 | "name": "python",
43 | "nbconvert_exporter": "python",
44 | "pygments_lexer": "ipython3",
45 | "version": "3.6.4"
46 | }
47 | },
48 | "nbformat": 4,
49 | "nbformat_minor": 2
50 | }
51 |
--------------------------------------------------------------------------------
/chapter05-basis-expansions-and-regularization/section5-2-the-biase-variance-tradeoff.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## $\\S$ 5.5.2. The Bias-Variance Tradeoff\n",
8 | "\n",
9 | "FIGURE 5.9 shows the effect of the choice of $\\text{df}_\\lambda$ when using a smoothing spline on a simple example:\n",
10 | "\n",
11 | "\\begin{align}\n",
12 | "Y &= f(X) + \\epsilon \\\\\n",
13 | "f(X) &= \\frac{\\sin(12(X+0.2))}{X+0.2},\n",
14 | "\\end{align}\n",
15 | "\n",
16 | "with\n",
17 | "* $X\\sim U[0,1]$,\n",
18 | "* $\\epsilon\\sim N(0,1)$;\n",
19 | "* Our training sample consists of $N=100$ pairs of $x_i, y_i$, drawn independently from this model."
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 2,
25 | "metadata": {},
26 | "outputs": [
27 | {
28 | "name": "stdout",
29 | "output_type": "stream",
30 | "text": [
31 | "Under construction ...\n"
32 | ]
33 | }
34 | ],
35 | "source": [
36 | "\"\"\"FIGURE 5.9. CV results and fitted splines for three different df's.\n",
37 | "\"\"\"\n",
38 | "print('Under construction ...')"
39 | ]
40 | },
41 | {
42 | "cell_type": "markdown",
43 | "metadata": {},
44 | "source": [
45 | "### Computing bias and variance\n",
46 | "\n",
47 | "The yellow shaded region in the figure represents the pointwise standard error of $\\hat{f}_\\lambda$, e.g., we have shaded the region between \n",
48 | "\n",
49 | "\\begin{equation}\n",
50 | "\\hat{f}_\\lambda(x) \\pm 2 \\cdot \\text{se}(\\hat{f}_\\lambda(x)).\n",
51 | "\\end{equation}\n",
52 | "\n",
53 | "Since $\\hat{\\mathbf{f}} = \\mathbf{S}_\\lambda \\mathbf{y}$,\n",
54 | "\n",
55 | "\\begin{align}\n",
56 | "\\text{Cov}(\\hat{\\mathbf{f}}) &= \\mathbf{S}_\\lambda \\text{Cov}(\\mathbf{y}) \\mathbf{S}_\\lambda^T \\\\\n",
57 | "&= \\mathbf{S}_\\lambda \\mathbf{S}_\\lambda^T.\n",
58 | "\\end{align}\n",
59 | "\n",
60 | "The diagonal contains the pointwise variances at the training $x_i$. The bias is given by\n",
61 | "\n",
62 | "\\begin{align}\n",
63 | "\\text{Bias}(\\hat{\\mathbf{f}}) &= \\mathbf{f} - \\text{E}(\\hat{\\mathbf{f}}) \\\\\n",
64 | "&= \\mathbf{f} - \\mathbf{S}_\\lambda \\mathbf{f},\n",
65 | "\\end{align}\n",
66 | "\n",
67 | "where $\\mathbf{f}$ is the (unknown) vector of evaluations of the true $f$ at the training $X$'s.\n",
68 | "\n",
69 | "The expectations and variances are w.r.t. repeated draws of samples of size $N=100$ from the model $f$. In a similar fashion $\\text{Var}(\\hat{f}_\\lambda(x_0))$ and $\\text{Bias}(\\hat{f}_\\lambda(x_0))$ can be computed at any point $x_0$ (Exercise 5.10)."
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "### Visual interpretation of bias-variance tradeoff\n",
77 | "\n",
78 | "The three fits displayed in FIGURE 5.9 give a visual demonstration of the bias-variance tradeoff associated with selecting the smoothing parameter.\n",
79 | "\n",
80 | "* $\\text{df}_\\lambda = 5$: The spline under fits, and clearly _trims down hills and fills in the valleys_. This leads to a bias that is most dramatic in regions of high curvature. The standard error hand is very narrow, so we estimate a badly biased version of the true function with great reliability!\n",
81 | "* $\\text{df}_\\lambda = 9$: Here the fitted function is close to the true function, although a slight amount of bias seems evident. The variance has not increased appreicably.\n",
82 | "* $\\text{df}_\\lambda = 15$: The fitted function is somewhat wiggly, but close to the true function. The wiggliness also accounts for the increased width of the standard error bands -- the curve is starting to follow some individual points too closely.\n",
83 | "\n",
84 | "Note that in these figures we are seeing a single realization of data and hence fitted spline $\\hat{f}$ in each case, while the bias involves an expectation $\\text{E}(\\hat{f})$.\n",
85 | "\n",
86 | "The middle curve seems \"just right\", in that it has achieved a good compromise between bias and variance."
87 | ]
88 | },
89 | {
90 | "cell_type": "markdown",
91 | "metadata": {},
92 | "source": [
93 | "The integrated squared prediction error ($\\text{EPE}$) combines both bias and variance in a single summary:\n",
94 | "\n",
95 | "\\begin{align}\n",
96 | "\\text{EPE}(\\hat{f}_\\lambda) &= \\text{E}\\left( Y - \\hat{f}_\\lambda(X) \\right)^2 \\\\\n",
97 | "&= \\text{Var}(Y) + \\text{E}\\left( \\text{Bias}^2(\\hat{f}_\\lambda(X)) + \\text{Var}(\\hat{f}_\\lambda(X)) \\right) \\\\\n",
98 | "&= \\sigma^2 + \\text{MSE}(\\hat{f}_\\lambda).\n",
99 | "\\end{align}\n",
100 | "\n",
101 | "Note that this is averaged both over the training sample (giving rise to $\\hat{f}_\\lambda$), and the values of the (independently chosen) prediction points $(X,Y)$.\n",
102 | "\n",
103 | "$\\text{EPE}$ is a natural quantity of interest, and does create a tradeoff between bias and variance. The test error rate (blue points) in the top left panel of FIGURE 5.9 suggest that $\\text{df}=9$ is spot on!"
104 | ]
105 | },
106 | {
107 | "cell_type": "markdown",
108 | "metadata": {},
109 | "source": [
110 | "### Estimation of EPE\n",
111 | "\n",
112 | "Since we don't know the true function, we do not have access to $\\text{EPE}$, and need an estimate. This topic is discussed in some detail in Chapter 7, and techniques such as $K$-fold cross-validation, $\\text{GCV}$ and $C_p$ are all in common use. In FIGURE 5.9 we include the $N$-fold (leave-one-out) cross-validation curve:\n",
113 | "\n",
114 | "\\begin{align}\n",
115 | "\\text{CV}(\\hat{f}_\\lambda) &= \\frac{1}{N} \\sum_{i=1}^N \\left( y_i - \\hat{f}_\\lambda^{(-i)}(x_i)\\right)^2 \\\\\n",
116 | "&= \\frac{1}{N} \\sum_{i=1}^N \\left( \\frac{y_i - \\hat{f}_\\lambda(x_i)}{1 - S_\\lambda(i,i)} \\right)^2,\n",
117 | "\\end{align}\n",
118 | "\n",
119 | "which can (remarkably) be computed for each value of $\\lambda$ from the original fitted values and the diagonal elements $S_\\lambda(i,i)$ of $\\mathbf{S}_\\lambda$ (Exercise 5.13).\n",
120 | "\n",
121 | "The $\\text{EPE}$ and $\\text{CV}$ curves have a similar shape, but the entire $\\text{CV}$ curve is above the $\\text{EPE}$ curve. For some realizations this is reversed, and everall the $\\text{CV}$ curve is approximately unbiased as an estimate of the $\\text{EPE}$ curve."
122 | ]
123 | }
124 | ],
125 | "metadata": {
126 | "kernelspec": {
127 | "display_name": "Python 3",
128 | "language": "python",
129 | "name": "python3"
130 | },
131 | "language_info": {
132 | "codemirror_mode": {
133 | "name": "ipython",
134 | "version": 3
135 | },
136 | "file_extension": ".py",
137 | "mimetype": "text/x-python",
138 | "name": "python",
139 | "nbconvert_exporter": "python",
140 | "pygments_lexer": "ipython3",
141 | "version": "3.6.4"
142 | }
143 | },
144 | "nbformat": 4,
145 | "nbformat_minor": 2
146 | }
147 |
--------------------------------------------------------------------------------
/chapter05-basis-expansions-and-regularization/section6-nonparametric-logistic-regression.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# $\\S$ 5.6. Nonparametric Logistic Regression\n",
8 | "\n",
9 | "The smoothing spline problem in $\\S$ 5.4,\n",
10 | "\n",
11 | "\\begin{equation}\n",
12 | "\\text{RSS}(f, \\lambda) = \\sum_{i=1}^N \\left( y_i - f(x_i) \\right)^2 + \\lambda\\int \\left( f''(t) \\right)^2 dt,\n",
13 | "\\end{equation}\n",
14 | "\n",
15 | "is posed in a regression setting. It is typically straightforward to transfer this technology to other domains. Here we consider logistic regression with a single quantitative input $X$. Then the model is\n",
16 | "\n",
17 | "\\begin{equation}\n",
18 | "\\log \\frac{\\text{Pr}(Y=1|X=x)}{\\text{Pr}(Y=0|X=x)} = f(x),\n",
19 | "\\end{equation}\n",
20 | "\n",
21 | "which implies\n",
22 | "\n",
23 | "\\begin{equation}\n",
24 | "\\text{Pr}(Y=1|X=x) = \\frac{e^{f(x)}}{1+e^{f(x)}}.\n",
25 | "\\end{equation}\n",
26 | "\n",
27 | "Fitting $f(x)$ in a smooth fashion leads to a smooth estimate of the conditional probability $\\text{Pr}(Y=1|x)$, which can be used for classification or risk scoring."
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | "### MLE\n",
35 | "\n",
36 | "We construct the penalized log-likelihood criterion\n",
37 | "\n",
38 | "\\begin{align}\n",
39 | "l(f;\\lambda) &= \\sum_{i=1}^N \\left[ y_i\\log{p(x_i)} + (1-y_i)\\log{(1-p(x_i))} \\right] - \\frac{\\lambda}{2} \\int \\left( f''(t) \\right)^2 dt \\\\\n",
40 | "&= \\sum_{i=1}^N \\left[ y_i f(x_i) - \\log{(1+e^{f(x_i)})} \\right] - \\frac{\\lambda}{2} \\int \\left( f''(t) \\right)^2 dt,\n",
41 | "\\end{align}\n",
42 | "\n",
43 | "where $p(x) = \\text{Pr}(Y=1|x)$. The first term is the log-likelihood on the binomial distribution (c.f. Chapter 4, page 120)."
44 | ]
45 | },
46 | {
47 | "cell_type": "markdown",
48 | "metadata": {},
49 | "source": [
50 | "### Iterative procedure using Newton-Raphson, again\n",
51 | "\n",
52 | "Arguments similar to those used in $\\S$ 5.4 show that the optimal $f$ is a finite-dimensional natural spline with knots at the unique values of $x$. This means that we can represent\n",
53 | "\n",
54 | "\\begin{equation}\n",
55 | "f(x) = \\sum_{j=1}^N N_j(x) \\theta_j.\n",
56 | "\\end{equation}\n",
57 | "\n",
58 | "We compute the first and second derivatives\n",
59 | "\n",
60 | "\\begin{align}\n",
61 | "\\frac{\\partial l(\\theta)}{\\partial\\theta} &= \\mathbf{N}^T(\\mathbf{y}-\\mathbf{p}) - \\lambda\\mathbf{\\Omega}\\theta, \\\\\n",
62 | "\\frac{\\partial^2 l(\\theta)}{\\partial\\theta\\partial\\theta^T} &= -\\mathbf{N}^T\\mathbf{WN} - \\lambda\\mathbf{\\Omega},\n",
63 | "\\end{align}\n",
64 | "\n",
65 | "where\n",
66 | "* $\\mathbf{p}$ is the $N$-vector with elements $p(x_i)$,\n",
67 | "* $\\mathbf{W}$ is a diagonal matrix of weights $p(x_i)(1-p(x_i))$.\n",
68 | "\n",
69 | "The first derivative is nonlinear in $\\theta$, so we need to use an iterative algorithm as in $\\S$ 4.4.1. Using Newton-Raphson as for linear logistic regression, the update equation can be written\n",
70 | "\n",
71 | "\\begin{align}\n",
72 | "\\theta^{\\text{new}} &= \\left( \\mathbf{N}^T\\mathbf{WN} + \\lambda\\mathbf{\\Omega} \\right)^{-1} \\mathbf{N}^T\\mathbf{W} \\left( \\mathbf{N}\\theta^{\\text{old}} + \\mathbf{W}^{-1}(\\mathbf{y}-\\mathbf{p}) \\right) \\\\\n",
73 | "&= \\left( \\mathbf{N}^T\\mathbf{WN} + \\lambda\\mathbf{\\Omega} \\right)^{-1} \\mathbf{N}^T\\mathbf{Wz}.\n",
74 | "\\end{align}\n",
75 | "\n",
76 | "We can also express this update in terms of the fitted values\n",
77 | "\n",
78 | "\\begin{align}\n",
79 | "\\mathbf{f}^{\\text{new}} &= \\mathbf{N} \\left( \\mathbf{N}^T\\mathbf{WN} + \\lambda\\mathbf{\\Omega} \\right)^{-1} \\mathbf{N}^T\\mathbf{W} \\left( \\mathbf{f}^{\\text{old}} + \\mathbf{W}^{-1}(\\mathbf{y}-\\mathbf{p}) \\right) \\\\\n",
80 | "&= \\mathbf{S}_{\\lambda,\\omega}\\mathbf{z}.\n",
81 | "\\end{align}"
82 | ]
83 | },
84 | {
85 | "cell_type": "markdown",
86 | "metadata": {},
87 | "source": [
88 | "### Comparison with regressions\n",
89 | "\n",
90 | "Referring back to the regression solution of the smoothing spline problem in $\\S$ 5.4,\n",
91 | "\n",
92 | "\\begin{align}\n",
93 | "\\hat\\theta &= \\left( \\mathbf{N}^T\\mathbf{N} + \\lambda\\mathbf{\\Omega}_N \\right)^{-1} \\mathbf{N}^T \\mathbf{y} \\\\\n",
94 | "\\hat{\\mathbf{f}} &= \\mathbf{N} \\left( \\mathbf{N}^T\\mathbf{N} + \\lambda\\mathbf{\\Omega}_N \\right)^{-1} \\mathbf{N}^T \\mathbf{y} \\\\\n",
95 | "&= \\mathbf{S}_\\lambda \\mathbf{y},\n",
96 | "\\end{align}\n",
97 | "\n",
98 | "we see that the update fits a weighted smoothing spline to the working response $\\mathbf{z}$ (Exercise 5.12).\n",
99 | "\n",
100 | "The form of $\\mathbf{f}^{\\text{new}}$ is suggestive. It is tempting to replace $\\mathbf{S}_{\\lambda,\\omega}$ by any nonparametric (weighted) regression operator, and obtain general families of nonparametric logistic regression models.\n",
101 | "\n",
102 | "Although here $x$ is one-dimensional, this procedure generalizes naturally to higher-dimensional $x$. These extensions are at the heart of _generalized additive models_, which we pursue in Chapter 9."
103 | ]
104 | }
105 | ],
106 | "metadata": {
107 | "kernelspec": {
108 | "display_name": "Python 3",
109 | "language": "python",
110 | "name": "python3"
111 | },
112 | "language_info": {
113 | "codemirror_mode": {
114 | "name": "ipython",
115 | "version": 3
116 | },
117 | "file_extension": ".py",
118 | "mimetype": "text/x-python",
119 | "name": "python",
120 | "nbconvert_exporter": "python",
121 | "pygments_lexer": "ipython3",
122 | "version": "3.6.4"
123 | }
124 | },
125 | "nbformat": 4,
126 | "nbformat_minor": 2
127 | }
128 |
--------------------------------------------------------------------------------
/chapter05-basis-expansions-and-regularization/section8-0-regularization-and-reproducing-kernel-hilbert-space.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# $\\S$ 5.8. Regularization and Reproducing Kernel Hilbert Spaces\n",
8 | "\n",
9 | "### Formulation\n",
10 | "\n",
11 | "A general class of regularization problems has the form\n",
12 | "\n",
13 | "\\begin{equation}\n",
14 | "\\min_{f\\in\\mathcal{H}} \\left[ \\sum_{i=1}^N L(y_i, f(x_i)) + \\lambda J(f) \\right],\n",
15 | "\\end{equation}\n",
16 | "\n",
17 | "where\n",
18 | "* $L(y, f(x))$ is a loss function,\n",
19 | "* $J(f)$ is a penalty functional, and\n",
20 | "* $\\mathcal{H}$ is a space of functions on which $J(f)$ is defined.\n",
21 | "\n",
22 | "Girosi et al. (1995) describe quite general penalty functionals of the form\n",
23 | "\n",
24 | "\\begin{equation}\n",
25 | "J(f) = \\int_{\\mathbb{R}^d} \\frac{|\\tilde{f}(s)|^2}{\\tilde{G}(s)} ds,\n",
26 | "\\end{equation}\n",
27 | "\n",
28 | "where\n",
29 | "* $\\tilde{f}$ denotes the Fourier transform of $f$,\n",
30 | "* $\\tilde{G}$ is some positive function that $\\tilde{G} \\rightarrow 0$ as $\\|s\\| \\rightarrow \\infty$, and\n",
31 | "* the idea is that $1/\\tilde{G}$ increases the penalty for high-frequency components of $f$."
32 | ]
33 | },
34 | {
35 | "cell_type": "markdown",
36 | "metadata": {},
37 | "source": [
38 | "### Solution\n",
39 | "\n",
40 | "Under some additional assumptions they show that the solutions have the form\n",
41 | "\n",
42 | "\\begin{equation}\n",
43 | "f(X) = \\sum_{k=1}^K \\alpha_k\\phi_k(X) + \\sum_{i=1}^N \\theta_i G(X-x_i),\n",
44 | "\\end{equation}\n",
45 | "\n",
46 | "where\n",
47 | "* $\\phi_k$ span the null space of the penalty functional $J$, \n",
48 | "* $G$ is the inverse Fourier transform of $\\tilde{G}$.\n",
49 | "\n",
50 | "Smoothing splines and thin-plate splines fall into this framework.\n",
51 | "\n",
52 | "> The remarkable feature of this solution is that while the criterion is defined over an infinite-dimensional space, the solution is finite-dimensional."
53 | ]
54 | }
55 | ],
56 | "metadata": {
57 | "kernelspec": {
58 | "display_name": "Python 3",
59 | "language": "python",
60 | "name": "python3"
61 | },
62 | "language_info": {
63 | "codemirror_mode": {
64 | "name": "ipython",
65 | "version": 3
66 | },
67 | "file_extension": ".py",
68 | "mimetype": "text/x-python",
69 | "name": "python",
70 | "nbconvert_exporter": "python",
71 | "pygments_lexer": "ipython3",
72 | "version": "3.6.4"
73 | }
74 | },
75 | "nbformat": 4,
76 | "nbformat_minor": 2
77 | }
78 |
--------------------------------------------------------------------------------
/chapter05-basis-expansions-and-regularization/section9-0-wavelet-smoothing.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# $\\S$ 5.9. Wavelet Smoothing\n",
8 | "\n",
9 | "We have seen two different modes of operation with dictionaries of basis function.\n",
10 | "* With regression splines, we select a subset of the bases, using either subject-matter knowledge, or else automatically. The more adaptive procedures such as MARS (Chapter 9) can capture both smooth and non-smooth behaviour.\n",
11 | "* With smooth splines, we use a complete basis, but then shrink the coefficients toward smoothness."
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "### Sparse representation\n",
19 | "\n",
20 | "Wavelets typically use a complete orthonormal basis to represent functions, but then shrink and select the coefficients toward a _sparse_ representation. Just as a smooth function can be represented by a few spline basis functions, a mostly flat function with a few isolated bumps can be represented with a few (bumpy) basis function."
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "### Time and frequency localization\n",
28 | "\n",
29 | "Wavelets bases are very popular in signal processing and compression, since they are able to represent both smooth and/or locally bumpy functions in an efficient way -- a phenomenon dubbed _time and frequency localization_. In contrast, the traditional Fourier basis allows only frequency localization."
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "metadata": {},
35 | "source": [
36 | "### Introduction\n",
37 | "\n",
38 | "Before we give details, let's look at the Haar wavelets in the left panel of FIGURE 5.16 to get an intuitive idea of how wavelet smoothing works.\n",
39 | "\n",
40 | "The vertical axis indicates the scale (frequency) of the wavelets, from low scale at the bottom to high scale at the top. At each scale the wavelets are \"packed in\" side-by-side to completely fill the time axis: We have only shown a selected subset.\n",
41 | "\n",
42 | "Wavelet smoothing fits the coefficients for this basis by least squares, and then thresholds (discards, filters) the smaller coefficients. Since there are many basis functions at each scale, it can use bases where it needs them and discard the unnecessary ones, to achieve time and frequency localization. The Haar wavelets are simple to understand, but not smooth enought for most purposes. The _symmlet_ wavelets in the right panel of FIGURE 5.16 have the same orthonormal properties, but are smoother."
43 | ]
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "metadata": {},
48 | "source": [
49 | "### Nuclear magnetic resonance (NMR) data\n",
50 | "\n",
51 | "FIGURE 5.17 displays an NMR signal, which appears to be composed of\n",
52 | "* smooth components and\n",
53 | "* isolated spikes,\n",
54 | "* plus some noise.\n",
55 | "\n",
56 | "The wavelet transform, using a symmlet basis, is shown in the lower left panel. The wavelet coefficients are arranged in rows, from lowest scale at the bottom to highest scale at the top.\n",
57 | "\n",
58 | "The bottom right panel shows the wavelet coefficients after thresholding. The thresholding procedure is the same soft-thresholding rule that arises in the lasso procedure for linear regression ($\\S$ 3.4.2).\n",
59 | "\n",
60 | "Notice that many of the smaller coefficients have been set to zero. The green curve in the top panel shows the back-transform of the thresholded coefficients: This is the smoothed version of the orignal signal."
61 | ]
62 | },
63 | {
64 | "cell_type": "code",
65 | "execution_count": 1,
66 | "metadata": {},
67 | "outputs": [
68 | {
69 | "name": "stdout",
70 | "output_type": "stream",
71 | "text": [
72 | "Data not found on the official ESL page ):\n"
73 | ]
74 | }
75 | ],
76 | "source": [
77 | "\"\"\"FIGURE 5.17\"\"\"\n",
78 | "print('Data not found on the official ESL page ):')"
79 | ]
80 | }
81 | ],
82 | "metadata": {
83 | "kernelspec": {
84 | "display_name": "Python 3",
85 | "language": "python",
86 | "name": "python3"
87 | },
88 | "language_info": {
89 | "codemirror_mode": {
90 | "name": "ipython",
91 | "version": 3
92 | },
93 | "file_extension": ".py",
94 | "mimetype": "text/x-python",
95 | "name": "python",
96 | "nbconvert_exporter": "python",
97 | "pygments_lexer": "ipython3",
98 | "version": "3.6.4"
99 | }
100 | },
101 | "nbformat": 4,
102 | "nbformat_minor": 2
103 | }
104 |
--------------------------------------------------------------------------------
/chapter05-basis-expansions-and-regularization/section9-2-adaptive-wavelet-filtering.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## $\\S$ 5.9.2. Adaptive Wavelet Filtering\n",
8 | "\n",
9 | "Wavelets are particular useful when the data are measured on a uniform lattice, such as a discretized signal, image, or a time series. We will focus on the one-dimensional case, and having $N=2^J$ lattice-points is convenient.\n",
10 | "\n",
11 | "Suppose\n",
12 | "* $\\mathbf{y}$ is the response vector,\n",
13 | "* $\\mathbf{W}$ is the $N \\times N$ orthonormal wavelet basis matrix evaluated at the $N$ uniformly spaced observations.\n",
14 | "\n",
15 | "Then\n",
16 | "\n",
17 | "\\begin{equation}\n",
18 | "\\mathbf{y}^* = \\mathbf{W}^T\\mathbf{y}\n",
19 | "\\end{equation}\n",
20 | "\n",
21 | "is called the _wavelet transform_ of $\\mathbf{y}$ (and is the full least squares regression coefficients)."
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {},
27 | "source": [
28 | "A popular method for adaptive wavelet fitting is known as _SURE shrinkage_ (Stein Unbiased Risk Estimation, Donoho and Johnstone, 1994). We start with the criterion\n",
29 | "\n",
30 | "\\begin{equation}\n",
31 | "\\min_{\\boldsymbol\\theta} \\|\\mathbf{y} - \\mathbf{W}\\boldsymbol\\theta\\|_2^2 + 2\\lambda\\|\\boldsymbol\\theta\\|_1,\n",
32 | "\\end{equation}\n",
33 | "\n",
34 | "which is the same as the lasso criterion in Chapter 3.\n",
35 | "\n",
36 | "Because $\\mathbf{W}$ is orthonormal, this leads to the simple solution:\n",
37 | "\n",
38 | "\\begin{equation}\n",
39 | "\\hat\\theta_j = \\text{sign}(y_j^*)(|y_j^*|-\\lambda)_+.\n",
40 | "\\end{equation}\n",
41 | "\n",
42 | "The least squares coefficients are translated toward zero, and truncated at zero. The fitted function (vector) is then given by the _inverse wavelet transform_\n",
43 | "\n",
44 | "\\begin{equation}\n",
45 | "\\hat{\\mathbf{f}} = \\mathbf{W}\\hat{\\boldsymbol\\theta}\n",
46 | "\\end{equation}"
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "### Choice of $\\lambda$\n",
54 | "\n",
55 | "A simple choice of $\\lambda$ is\n",
56 | "\n",
57 | "\\begin{equation}\n",
58 | "\\lambda = \\sigma\\sqrt{2\\log N},\n",
59 | "\\end{equation}\n",
60 | "\n",
61 | "where $\\sigma$ is an estimate of the standard deviation of the noise.\n",
62 | "\n",
63 | "#### Motivation for this choice\n",
64 | "\n",
65 | "Since $\\mathbf{W}$ is an orthonormal transformation, if the elements of $\\mathbf{y}$ are white noise (independent Gaussian variates with mean 0 and variance $\\sigma^2$), then so are $\\mathbf{y}^*$. Furthermore if random variables $Z_1, Z_2, \\cdots, Z_N$ are white noise, the expected maximum $|Z_j|$ for $j=1,\\cdots,N$ is approximately $\\sigma\\sqrt{2\\log N}$. Hence all coefficients below $\\sigma\\sqrt{2\\log N}$ are likely to be noise and are set to zero.\n",
66 | "\n"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "### Choice of $\\mathbf{W}$\n",
74 | "\n",
75 | "The space $\\mathbf{W}$ could be any basis of orthonormal functions: Polynomials, natural splines or cosinusoids. What makes wavelets special is the particular form of basis functions used, which allows for a representation _localized in time and in frequency_."
76 | ]
77 | },
78 | {
79 | "cell_type": "markdown",
80 | "metadata": {},
81 | "source": [
82 | "### NMR signal revisited"
83 | ]
84 | },
85 | {
86 | "cell_type": "markdown",
87 | "metadata": {},
88 | "source": [
89 | "### Similarity between SURE and smoothing spline criteria\n",
90 | "\n",
91 | "* Both are hierarchically structured from coarse to fine detail, although wavelets are also localized in time within each resolution level.\n",
92 | "* The splines build in a basis toward smooth functions by imposing differential shrinking constants $d_k$. Early version of SURE shrinkage treated all scales equally.\n",
93 | "* The spline $L_2$ penalty cause pure shrinkage, while the SURE $L_1$ penalty does shrinkage and selection.\n",
94 | "\n",
95 | "More generally smoothing splines achieve compression of the original signal by imposing smoothness, while wavelets impose sparsity. FIGURE 5.19 compares a wavelet fit (using SURE shrinkage) to a smoothing spline fit (using corss-validation) on two examples different in nature.\n",
96 | "\n",
97 | "For the NMR data in the upper panel, the smoothing spline introducees detail everywhere in order to capture the detail in the isolated spike; the wavelet fit nicely localizes the spike. In the lower panel, the true function is smooth, and the noise is relatively high. The wavelet fit has let in some additional and unnecessary wiggles -- a price it pays in variance for the additional adaptivity."
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": 1,
103 | "metadata": {},
104 | "outputs": [
105 | {
106 | "name": "stdout",
107 | "output_type": "stream",
108 | "text": [
109 | "Under construction ...\n"
110 | ]
111 | }
112 | ],
113 | "source": [
114 | "\"\"\"FIGURE 5.19. Wavelet smoothing compared with smoothing splines on two examples.\n",
115 | "Each panel compares the SURE-shrunk wavelet fit to the cross-validated smoothing spline fit.\"\"\"\n",
116 | "print('Under construction ...')"
117 | ]
118 | },
119 | {
120 | "cell_type": "markdown",
121 | "metadata": {},
122 | "source": [
123 | "### Computational aspects\n",
124 | "\n",
125 | "The wavelet transform is not performed by matrix multiplication as in\n",
126 | "\n",
127 | "\\begin{equation}\n",
128 | "\\mathbf{y}^* = \\mathbf{W}^T\\mathbf{y}.\n",
129 | "\\end{equation}\n",
130 | "\n",
131 | "In fact, using clever pyramidal schemes $\\mathbf{y}^*$ can be obtained in $O(N)$ computations, which is even faster than the $N\\log(N)$ of the FFT. It is easy to see for the Haar basis (Exercise 5.19). Likewise, the inverse wavelet transform $\\mathbf{W}\\hat{\\boldsymbol\\theta}$ is also $O(N)$."
132 | ]
133 | },
134 | {
135 | "cell_type": "markdown",
136 | "metadata": {},
137 | "source": [
138 | "This has been a very brief glimpse of this vast and growing field. There is a very large mathematical and computational base built on wavelets. Modern image compression is often performed using two-dimensional wavelet representations."
139 | ]
140 | }
141 | ],
142 | "metadata": {
143 | "kernelspec": {
144 | "display_name": "Python 3",
145 | "language": "python",
146 | "name": "python3"
147 | },
148 | "language_info": {
149 | "codemirror_mode": {
150 | "name": "ipython",
151 | "version": 3
152 | },
153 | "file_extension": ".py",
154 | "mimetype": "text/x-python",
155 | "name": "python",
156 | "nbconvert_exporter": "python",
157 | "pygments_lexer": "ipython3",
158 | "version": "3.6.4"
159 | }
160 | },
161 | "nbformat": 4,
162 | "nbformat_minor": 2
163 | }
164 |
--------------------------------------------------------------------------------
/chapter06-kernel-smoothing-methods/section0-introduction.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Chapter 6. Kernel Smoothing Methods\n",
8 | "\n",
9 | "In this chapter we describe a class of regression techniques that achieve flexibility in estimating the regression function $f(X)$ over the domain $\\mathbb{R}^p$ by fitting a different but simple model separately at each query point $x_0$. This is done for some neighborhood of the target point $x_0$ to fit the simple model, and in such a way that the resulting estimated function $\\hat{f}(X)$ is smooth in $\\mathbb{R}^p$.\n",
10 | "\n",
11 | "This localization is achieved via a weighting function or _kernel_ $K_\\lambda(x_0,x_i)$, which assigns a weight to $x_i$ based on its distance from $x_0$. The kernels $K_\\lambda$ are typically indexed by a paramter $\\lambda$ that dictates the width of the neighborhood. These _memory-based_ methods require in principle little or no training. The only paramter that needs to be determined from the training data is $\\lambda$. The model, however, is the entire training data set.\n",
12 | "\n",
13 | "We also discuss more general classes of kernel-based techniques, which tie in with structured methods in other chapters, and are useful for density estimation and classficiation."
14 | ]
15 | },
16 | {
17 | "cell_type": "markdown",
18 | "metadata": {},
19 | "source": [
20 | "### Caution, don't confuse!\n",
21 | "\n",
22 | "> In this chapter kernels are mostly used as a device for localization.\n",
23 | "\n",
24 | "The techniques in this chapter should not be confused with those associated with the more recent usage of the phrase \"kernel methods\". In this chapter kernels are mostly used as a device for localization. We discuss kernel methods in $\\S$ 5.8, 14.5.4, 18.5, and Chapter 12; in those contexts the kernel computes an inner product in a high-dimensional (implicit) feature space, and is used for regularized nonlinear modeling. We make some connections to the methodology in this chapter at the end of $\\S$ 6.7."
25 | ]
26 | }
27 | ],
28 | "metadata": {
29 | "kernelspec": {
30 | "display_name": "Python 3",
31 | "language": "python",
32 | "name": "python3"
33 | },
34 | "language_info": {
35 | "codemirror_mode": {
36 | "name": "ipython",
37 | "version": 3
38 | },
39 | "file_extension": ".py",
40 | "mimetype": "text/x-python",
41 | "name": "python",
42 | "nbconvert_exporter": "python",
43 | "pygments_lexer": "ipython3",
44 | "version": "3.6.4"
45 | }
46 | },
47 | "nbformat": 4,
48 | "nbformat_minor": 2
49 | }
50 |
--------------------------------------------------------------------------------
/chapter06-kernel-smoothing-methods/section1-1-local-linear-regression.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## $\\S$ 6.1.1. Local Linear Regression\n",
8 | "\n",
9 | "The smooth kernel fit still has a problems. Locally-weighted averages can be badly biased on the boundaries of the domain, because of the asymmetry of the kernel in that region, as exhibited in FIGURE 6.3 (left panel).\n",
10 | "\n",
11 | "By fitting straight lines rather than constants locally, we can remove this bias exactly to first order; see the right panel of FIGURE 6.3. Actually, this bias can be present in the interior of the domain as well, if the $X$ values are not equally spaced (for the same reasons, but usually less severe). Again locally weighted linear regression will make a first-order correction."
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "### Formulation\n",
19 | "\n",
20 | "Locally weighted regression solves a separate weighted least squares problem at each target point $x_0$:\n",
21 | "\n",
22 | "\\begin{equation}\n",
23 | "\\min_{\\alpha(x_0),\\beta(x_0)} \\sum_{i=1}^N K_\\lambda(x_0,x_i) \\left( y_i - \\alpha(x_0) - \\beta(x_0)x_i \\right)^2.\n",
24 | "\\end{equation}\n",
25 | "\n",
26 | "The estimate is then\n",
27 | "\n",
28 | "\\begin{equation}\n",
29 | "\\hat{f}(x_0) = \\hat\\alpha(x_0) + \\hat\\beta(x_0).\n",
30 | "\\end{equation}\n",
31 | "\n",
32 | "Notice that although we fit an entire linear model to the data in the region, we only use it to evaluate the fit at the single point $x_0$."
33 | ]
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "metadata": {},
38 | "source": [
39 | "### Matrix formulation and equivalent kernel\n",
40 | "\n",
41 | "Define\n",
42 | "* the vector-valued function $b(x)^T = (1, x)$,\n",
43 | "* the $N\\times2$ regression matrix $\\mathbf{B}$ with $i$th row $b(x_i)^T$, and\n",
44 | "* the $N\\times N$ diagonal matrix $\\mathbf{W}(x_0)$ with $i$th diagonal element $K_\\lambda(x_0,x_i)$.\n",
45 | "\n",
46 | "Then\n",
47 | "\n",
48 | "\\begin{align}\n",
49 | "\\hat{f}(x_0) &= b(x_0)^T \\left( \\mathbf{B}^T\\mathbf{W}(x_0)\\mathbf{B} \\right)^{-1} \\mathbf{B}^T \\mathbf{W}(x_0) \\mathbf{y} \\\\\n",
50 | "&= \\sum_{i=1}^N l_i(x_0)y_i.\n",
51 | "\\end{align}\n",
52 | "\n",
53 | "Note that $l_i(x_0)$ do not involve $\\mathbf{y}$ and thus the estimate is _linear_ in $y_i$. These weights $l_i(x_0)$ combine the weighting kernel $K_\\lambda(x_0,\\cdot)$ and the least squares operations, and are sometimes referred to as the _equivalent kernel_.\n",
54 | "\n",
55 | "FIGURE 6.4 illustrates the effect of local linear regression on the equivalent kernel."
56 | ]
57 | },
58 | {
59 | "cell_type": "code",
60 | "execution_count": 1,
61 | "metadata": {},
62 | "outputs": [
63 | {
64 | "name": "stdout",
65 | "output_type": "stream",
66 | "text": [
67 | "Under construction ...\n"
68 | ]
69 | }
70 | ],
71 | "source": [
72 | "\"\"\"FIGURE 6.4. Equivalent kernel li(x0) for local regression\"\"\"\n",
73 | "print('Under construction ...')"
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "### Automatic kernel carpentry\n",
81 | "\n",
82 | "Historically, the bias in the Nadaraya-Watson and other local average kernel methods were corrected by modifying the kernel. These modifications were based on theoretical asymptotic MSE considerations, and besides being tedious to implement, are only approximate for finite sample sizes.\n",
83 | "\n",
84 | "Local linear regression _automatically_ modfies the kernel to correct the bias _exactly_ to first order, a phenomenon dubbed as _automatic kernel carpentry_.\n",
85 | "\n",
86 | "Consider the following expansion for $\\text{E}\\hat{f}(x_0)$, using the linearity of local regression and a series expansion of the true function $f$ around $x_0$,\n",
87 | "\n",
88 | "\\begin{align}\n",
89 | "\\text{E}\\hat{f}(x_0) &= \\sum_{i=1}^N l_i(x_0)f(x_i) \\\\\n",
90 | "&= f(x_0)\\sum_{i=1}^N l_i(x_0) + f'(x_0)\\sum_{i=1}^N (x_i-x_0)l_i(x_0) + \\frac{f''(x_0)}2 \\sum_{i=1}^N \\sum_{i=1}^N (x_i-x_0)^2 l_i(x_0) + R,\n",
91 | "\\end{align}\n",
92 | "\n",
93 | "where the remainder term $R$ involves third- and higher-order derivatives of $f$, and is typically small under suitable smoothness assumptions. It can be shown (Exercise 6.2) that for local linear regression,\n",
94 | "\n",
95 | "\\begin{equation}\n",
96 | "\\sum_{i=1}^N l_i(x_0) = 1 \\text{ and } \\sum_{i=1}^N (x_i-x_0)l_i(x_0) = 0.\n",
97 | "\\end{equation}\n",
98 | "\n",
99 | "Hence\n",
100 | "\n",
101 | "\\begin{align}\n",
102 | "\\text{E}\\hat{f}(x_0) &= f(x_0)\\sum_{i=1}^N l_i(x_0) + f'(x_0)\\sum_{i=1}^N (x_i-x_0)l_i(x_0) + \\frac{f''(x_0)}2 \\sum_{i=1}^N \\sum_{i=1}^N (x_i-x_0)^2 l_i(x_0) + R \\\\\n",
103 | "&= f(x_0) + \\frac{f''(x_0)}2 \\sum_{i=1}^N \\sum_{i=1}^N (x_i-x_0)^2 l_i(x_0) + R,\n",
104 | "\\end{align}\n",
105 | "\n",
106 | "and the bias\n",
107 | "\n",
108 | "\\begin{equation}\n",
109 | "\\text{E}\\hat{f}(x_0) - f(x_0) = \\frac{f''(x_0)}2 \\sum_{i=1}^N \\sum_{i=1}^N (x_i-x_0)^2 l_i(x_0) + R.\n",
110 | "\\end{equation}\n",
111 | "\n",
112 | "We see that it depends only on quadratic and higher-order terms in the expansion of $f$."
113 | ]
114 | }
115 | ],
116 | "metadata": {
117 | "kernelspec": {
118 | "display_name": "Python 3",
119 | "language": "python",
120 | "name": "python3"
121 | },
122 | "language_info": {
123 | "codemirror_mode": {
124 | "name": "ipython",
125 | "version": 3
126 | },
127 | "file_extension": ".py",
128 | "mimetype": "text/x-python",
129 | "name": "python",
130 | "nbconvert_exporter": "python",
131 | "pygments_lexer": "ipython3",
132 | "version": "3.6.4"
133 | }
134 | },
135 | "nbformat": 4,
136 | "nbformat_minor": 2
137 | }
138 |
--------------------------------------------------------------------------------
/chapter06-kernel-smoothing-methods/section1-2-local-polynomial-regression.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## $\\S$ 6.1.2. Local Polynomial Regression\n",
8 | "\n",
9 | "Why stop at local linear fit? We can fit local polynomial fits of any degree $d$,\n",
10 | "\n",
11 | "\\begin{equation}\n",
12 | "\\min_{\\alpha(x_0),\\beta_j(x_0), j=1,\\cdots,d} \\sum_{i=1}^N K_\\lambda(x_0,x_i) \\left( y_i - \\alpha(x_0) - \\sum_{j=1}^d \\beta_j(x_0)x_i^j \\right)^2\n",
13 | "\\end{equation}\n",
14 | "\n",
15 | "with solution\n",
16 | "\n",
17 | "\\begin{equation}\n",
18 | "\\hat{f}(x_0) = \\hat\\alpha(x_0) + \\sum_{j=1}^N \\hat\\beta(x_0)x_o^j.\n",
19 | "\\end{equation}\n",
20 | "\n",
21 | "In fact, the expansion shown in the previous section will tell us that the bias will only have components of degree $d+1$ and higher (Exercise 6.2).\n",
22 | "\n",
23 | "FIGURE 6.5 illustrates local quadratic regression. Local linear fits tend to be biased in regions of curvature of the true function, a phenomenon referred to as _trimming the hills_ and _filling the valleys_. Local quadratic regression is generally able to correct this bias."
24 | ]
25 | },
26 | {
27 | "cell_type": "markdown",
28 | "metadata": {},
29 | "source": [
30 | "### Bias-variance tradeoff, again\n",
31 | "\n",
32 | "There is of course a price to be paid for this bias reduction, and this is increased variance. The fit in the right panel of FIGURE 6.5 is slightly more wiggly, especially in the tails.\n",
33 | "\n",
34 | "Assume the model\n",
35 | "\n",
36 | "\\begin{equation}\n",
37 | "y_i = f(x_i) + \\epsilon_i,\n",
38 | "\\end{equation}\n",
39 | "\n",
40 | "with $\\epsilon_i \\sim^{\\text{iid}} (0, \\sigma^2)$. Then\n",
41 | "\n",
42 | "\\begin{equation}\n",
43 | "\\text{Var}(\\hat{f}(x_0)) = \\sigma^2 \\|l(x_0)\\|^2,\n",
44 | "\\end{equation}\n",
45 | "\n",
46 | "where $l(x_0)$ is the vector of equivalent kernel weights at $x_0$.\n",
47 | "\n",
48 | "It can be shown (Exercise 6.3) that $\\|l(x_0)\\|$ increases with $d$, and so there is a bias-variance tradeoff in selecting the polynomial degree.\n",
49 | "\n",
50 | "FIGURE 6.6 illustrates these variance curves for degree zero, one and two local polynomials. To summarize some collected wisdom on this issue:\n",
51 | "\n",
52 | "* Local linear fits can help bias dramatically at the boundaries at a modest cost in variance. Local quadratic fits do little at the boundaries for bias, but increase the variance a lot.\n",
53 | "* Local quadratic fits tend to be most helpful in reducing bias due to curvature in the interior of the domain.\n",
54 | "* Asymptotic analysis suggest that local polynomials of odd degree dominate those of even degree. This is largely due to the fact that asymptotically the MSE is dominated by boundary effects.\n",
55 | "\n",
56 | "While it may be helpful to tinker, and move from local linear fits at the boundary to local quadratic fits in the interior, we do not recommend such strategies. Usually the application will dictate the degree of the fit. For example, if we are interested in extrapolation, then the boundary is of more interest and local linear fits are probably more reliable."
57 | ]
58 | }
59 | ],
60 | "metadata": {
61 | "kernelspec": {
62 | "display_name": "Python 3",
63 | "language": "python",
64 | "name": "python3"
65 | },
66 | "language_info": {
67 | "codemirror_mode": {
68 | "name": "ipython",
69 | "version": 3
70 | },
71 | "file_extension": ".py",
72 | "mimetype": "text/x-python",
73 | "name": "python",
74 | "nbconvert_exporter": "python",
75 | "pygments_lexer": "ipython3",
76 | "version": "3.6.4"
77 | }
78 | },
79 | "nbformat": 4,
80 | "nbformat_minor": 2
81 | }
82 |
--------------------------------------------------------------------------------
/chapter06-kernel-smoothing-methods/section2-selecting-the-width-of-the-kernel.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# $\\S$ 6.2. Selecting the Width of the Kernel\n",
8 | "\n",
9 | "In each of the kernels $K_\\lambda$, $\\lambda$ is a parameter that controls its width:\n",
10 | "\n",
11 | "* For the Epanechnikov or tri-cube kernel with metric width, $\\lambda$ is the radius of the support region.\n",
12 | "* For the Gaussian kernel, $\\lambda$ is the standard deviation.\n",
13 | "* $\\lambda$ is the number $k$ of nearest neighbors in $k$-nearest neighborhoods, often expressed as a fraction or _span_ $k/N$ of the total training sample."
14 | ]
15 | },
16 | {
17 | "cell_type": "markdown",
18 | "metadata": {},
19 | "source": [
20 | "### Bias-variance tradeoff, again and again\n",
21 | "\n",
22 | "There is a natural bias-variance tradeoff as we change the width of the averaging window, which is most explicit for local averages:\n",
23 | "\n",
24 | "* If the window is narrow, $\\hat{f}(x_0)$ is an average of a small number of $y_i$ close to $x_0$, and its variance will be relatively large -- close to that of an individual $y_i$. The bias will tend to be small, again because each of the $\\text{E}(y_i) = f(x_i)$ should be close to $f(x_0)$.\n",
25 | "* If the window is wide, the variance of $\\hat{f}(x_0)$ will be small relative to the variance of any $y_i$, because of the effects of averaging. The bias will be higher, because we are now using observations $x_i$ further from $x_0$, and there is no quarantee that $f(x_i)$ will be close to $f(x_0)$.\n",
26 | "\n",
27 | "Similar arguments apply to local regression estimates, say local linear:\n",
28 | "* As the width goes to zero, the estimates approach a piecewise-linear function that interpolates the training data;\n",
29 | "* as the width gets infinitely large, the fit approaches the global linear least-squares fit to the data."
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "metadata": {},
35 | "source": [
36 | "The discussion in Chapter 5 on selecting the regularization parameter for smoothing splines applies here, and will not be repeated.\n",
37 | "\n",
38 | "Local regression smoothers are linear estimators; the smoother matrix in\n",
39 | "\n",
40 | "\\begin{equation}\n",
41 | "\\hat{\\mathbf{f}} = \\mathbf{S}_\\lambda\\mathbf{y}\n",
42 | "\\end{equation}\n",
43 | "\n",
44 | "is built up from the equivalent kernels ($\\S$ 6.1.1), and has $ij$th entry $\\{\\mathbf{S}_\\lambda\\}_{ij} = l_i(x_j)$.\n",
45 | "\n",
46 | "Leave-one-out cross-validation is particularly simple (Exercise 6.7), as is generalized cross-validation, $C_p$ (Exercise 6.10), and $k$-fold cross-validation. The effective degrees of freedom is again defined as $\\text{trace}(\\mathbf{S}_\\lambda)$, and can be used to calibrate the amount of smoothing."
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "FIGURE 6.7 compares the equivalent kernels for a smoothing spline and local linear regression. The local regression smoother has a span of $40%$, which results in $\\text{df} = \\text{trace}(\\mathbf{S}_\\lambda) = 5.86$. The smoothing spline was calibrated to have the same $\\text{df}$, and their equivalent kernels are qualitatively quite similar."
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": 1,
59 | "metadata": {},
60 | "outputs": [
61 | {
62 | "name": "stdout",
63 | "output_type": "stream",
64 | "text": [
65 | "Under construction ...\n"
66 | ]
67 | }
68 | ],
69 | "source": [
70 | "\"\"\"FIGURE 6.7. Equivalent kernels for a local linear regreession smoother and\n",
71 | "a smoothing spline\"\"\"\n",
72 | "print('Under construction ...')"
73 | ]
74 | }
75 | ],
76 | "metadata": {
77 | "kernelspec": {
78 | "display_name": "Python 3",
79 | "language": "python",
80 | "name": "python3"
81 | },
82 | "language_info": {
83 | "codemirror_mode": {
84 | "name": "ipython",
85 | "version": 3
86 | },
87 | "file_extension": ".py",
88 | "mimetype": "text/x-python",
89 | "name": "python",
90 | "nbconvert_exporter": "python",
91 | "pygments_lexer": "ipython3",
92 | "version": "3.6.4"
93 | }
94 | },
95 | "nbformat": 4,
96 | "nbformat_minor": 2
97 | }
98 |
--------------------------------------------------------------------------------
/chapter06-kernel-smoothing-methods/section3-local-regression-in-higher-dimensions.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# $\\S$ 6.3. Local Regression in $\\mathbb{R}^p$\n",
8 | "\n",
9 | "Kernel smoothing and local regression generalize very naturally to two or more dimensions.\n",
10 | "\n",
11 | "* The Nadaraya-Watson kernel smoother fits a constant locally with weights supplied by a $p$-dimensional kernel.\n",
12 | "* Local linear regression will fit a hyperplane locally in $X$, by weighted least squares, with weights supplied by a $p$-dimensional kernel. \n",
13 | " It is simple to implement and is generally preferred to the local constant fit for its superior performacne on the boundaries."
14 | ]
15 | },
16 | {
17 | "cell_type": "markdown",
18 | "metadata": {},
19 | "source": [
20 | "### Formulation\n",
21 | "\n",
22 | "Let $b(X)$ be a vector of polynomial terms in $X$ of maximum degree $d$; e.g., with $d=1$ and $p=2$ we get\n",
23 | "\n",
24 | "\\begin{equation}\n",
25 | "b(X) = (1, X_1, X_2);\n",
26 | "\\end{equation}\n",
27 | "\n",
28 | "with $d=2$ we get\n",
29 | "\n",
30 | "\\begin{equation}\n",
31 | "b(X) = (1, X_1, X_2, X_1^2, X_2^2, X_1X_2);\n",
32 | "\\end{equation}\n",
33 | "\n",
34 | "and trivially with $d=0$ we get\n",
35 | "\n",
36 | "\\begin{equation}\n",
37 | "b(X) = 1.\n",
38 | "\\end{equation}\n",
39 | "\n",
40 | "At each $x_0 \\in \\mathbb{R}^p$ solve\n",
41 | "\n",
42 | "\\begin{equation}\n",
43 | "\\min_{\\beta(x_0)} \\sum_{i=1}^N K_\\lambda(x_0,x_i) \\left( y_i - b(x_i)^T\\beta(x_0) \\right)^2\n",
44 | "\\end{equation}\n",
45 | "\n",
46 | "to produce the fit\n",
47 | "\n",
48 | "\\begin{equation}\n",
49 | "\\hat{f}(x_0) = b(x_0)^T \\hat\\beta(x_0).\n",
50 | "\\end{equation}\n",
51 | "\n",
52 | "Typically the kernel will be a radial function, such as the radial Epanechnikov or tri-cube kernel\n",
53 | "\n",
54 | "\\begin{equation}\n",
55 | "K_\\lambda(x_0,x) = D\\left( \\frac{\\|x-x_0\\|}\\lambda \\right),\n",
56 | "\\end{equation}\n",
57 | "\n",
58 | "where $\\|\\cdot\\|$ is the Euclidean norm.\n",
59 | "\n",
60 | "Since the Euclidean norm depends on the units in each coordinate, it makes most sense to standardize each predictor, e.g., to unit standard deviation, prior to smoothing."
61 | ]
62 | },
63 | {
64 | "cell_type": "markdown",
65 | "metadata": {},
66 | "source": [
67 | "### Boundary problem gets serious with the curse of dimensionality\n",
68 | "\n",
69 | "While boundary effects are a problem in 1D smoothing, they are a much bigger problem in two or higher dimensions, since the fraction of points on the boundary is larger. In fact, one of the manifestations of the curse of dimensionality is that the fraction of points close to the boundary increases to one as the dimension grows.\n",
70 | "\n",
71 | "Directly modifying the kernel to accommodate two-dimensional boundaries becomes very messy, especially for irregular boundaries.\n",
72 | "\n",
73 | "Local polynomial regression seamlessly performs boundary correction to the desired order in any dimensions. FIGURE 6.8 illustrates local linear regression on some measurements from an astronomical study with an unusual predictor design (star-shaped). Here the boundary is extremely irregular, and the fitted surface must also interpolate over regions of increasing data sparsity as we approach the boundary."
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": 1,
79 | "metadata": {},
80 | "outputs": [
81 | {
82 | "name": "stdout",
83 | "output_type": "stream",
84 | "text": [
85 | "Under construction ...\n"
86 | ]
87 | }
88 | ],
89 | "source": [
90 | "\"\"\"FIGURE 6.8 3D Galaxy data\"\"\"\n",
91 | "print('Under construction ...')"
92 | ]
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "metadata": {},
97 | "source": [
98 | "Local regression becomes less useful in dimensions much higher than two or three. We have discussed in some detail the problems of the dimensionality, e.g., in Chapter 2. It is impossible to simultaneously maintain localness ($\\Rightarrow$ low bias) and a sizeable sample in the neighborhood ($\\Rightarrow$ low variance) as the dimension increases, without the total sample size increasing exponentially in $p$.\n",
99 | "\n",
100 | "Visualization of $\\hat{f}(X)$ also becomes difficult in higher dimensions, and this is often one of the primary goals of smoothing. Although the scatter-cloud and wire-frame pictures in FIGURE 6.8 look at attractive, it is quite difficult to interpret the results except at a gross level."
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "metadata": {},
106 | "source": [
107 | "From a data analysis perspective, conditional plots are far more useful.\n",
108 | "\n",
109 | "FIGURE 6.9 shows an analysis of some environmental data with three predictors. The _trellis_ display here show ozone as a function of radiation, conditioned on the other two variables, temperature and wind speed. However, conditioning on the value of a variable really implies loca to that value (as in local regression)."
110 | ]
111 | },
112 | {
113 | "cell_type": "code",
114 | "execution_count": 2,
115 | "metadata": {},
116 | "outputs": [
117 | {
118 | "name": "stdout",
119 | "output_type": "stream",
120 | "text": [
121 | "Under construction ...\n"
122 | ]
123 | }
124 | ],
125 | "source": [
126 | "\"\"\"FIGURE 6.9. Conditional plots for Los Angeles Ozone data\"\"\"\n",
127 | "print('Under construction ...')"
128 | ]
129 | },
130 | {
131 | "cell_type": "markdown",
132 | "metadata": {},
133 | "source": [
134 | "Above each of the panels in FIGURE 6.9 is an indication of the range of values present in that panel for each of the conditioning values. In the panel itself the data subsets are displayed (response versus remaining variable), and a 1D local linear regression is fit to the data.\n",
135 | "\n",
136 | "Although this is not quite the same as looking at slices of a fitted 3D surface, it is probably more useful in terms of understanding the joint behavior of the data."
137 | ]
138 | }
139 | ],
140 | "metadata": {
141 | "kernelspec": {
142 | "display_name": "Python 3",
143 | "language": "python",
144 | "name": "python3"
145 | },
146 | "language_info": {
147 | "codemirror_mode": {
148 | "name": "ipython",
149 | "version": 3
150 | },
151 | "file_extension": ".py",
152 | "mimetype": "text/x-python",
153 | "name": "python",
154 | "nbconvert_exporter": "python",
155 | "pygments_lexer": "ipython3",
156 | "version": "3.6.4"
157 | }
158 | },
159 | "nbformat": 4,
160 | "nbformat_minor": 2
161 | }
162 |
--------------------------------------------------------------------------------
/chapter06-kernel-smoothing-methods/section4-0-structured-local-regression-models.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# $\\S$ 6.4. Structured Local Regression Models in $\\mathbb{R}^p$\n",
8 | "\n",
9 | "> When the dimension to sample-size ratio is unfavorable, local regression does not help us much, unless we are willing to make some structural assumptions about the model.\n",
10 | ">\n",
11 | "> Much of this book is about structured regression and classification models. Here we focus on some approaches directly related to kernel methods."
12 | ]
13 | }
14 | ],
15 | "metadata": {
16 | "kernelspec": {
17 | "display_name": "Python 3",
18 | "language": "python",
19 | "name": "python3"
20 | },
21 | "language_info": {
22 | "codemirror_mode": {
23 | "name": "ipython",
24 | "version": 3
25 | },
26 | "file_extension": ".py",
27 | "mimetype": "text/x-python",
28 | "name": "python",
29 | "nbconvert_exporter": "python",
30 | "pygments_lexer": "ipython3",
31 | "version": "3.6.4"
32 | }
33 | },
34 | "nbformat": 4,
35 | "nbformat_minor": 2
36 | }
37 |
--------------------------------------------------------------------------------
/chapter06-kernel-smoothing-methods/section4-1-structured-kernels.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## $\\S$ 6.4.1. Structured Kernels\n",
8 | "### Standardization\n",
9 | "\n",
10 | "One line of approach is to modify the kernel. The default spherical kernel\n",
11 | "\n",
12 | "\\begin{equation}\n",
13 | "K_\\lambda(x_0,x) = D\\left( \\frac{\\|x-x_0\\|}\\lambda \\right)\n",
14 | "\\end{equation}\n",
15 | "\n",
16 | "gives equal weight to each coordinate, and so a natural default strategy is to standardize each variable to unit standard deviation.\n",
17 | "\n",
18 | "A more general approach is to use a positive semidefinite matrix $\\mathbf{A}$ to weigh the different coordinates:\n",
19 | "\n",
20 | "\\begin{equation}\n",
21 | "K_{\\lambda,\\mathbf{A}}(x_0,x) = D \\left( \\frac{(x-x_0)^T\\mathbf{A}(x-x_0)}\\lambda \\right).\n",
22 | "\\end{equation}\n",
23 | "\n",
24 | "Entire coordinates or directions can be downgraded or omitted by imposing appropriate restrictions on $\\mathbf{A}$. For example, if $\\mathbf{A}$ is diagonal, then we can increase or decrease the influence of individual predictors $X_j$ by increasing or decreasing $A_{jj}$.\n",
25 | "\n",
26 | "Often the predictors are many and highly correlated, such as those arising from digitized analog signals or images. The covariance function of the predictors can be used to tailor a metric $\\mathbf{A}$ that focuses less, say, on high-freqeuncy contrast (Exercise 6.4)."
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "metadata": {},
32 | "source": [
33 | "### Structured regression over general models for $\\mathbf{A}$\n",
34 | "\n",
35 | "Proposals have been made for learning the parameters for multidimensional kernels. For example, the projection-pursuit regression model discussed in Chapter 11 is of this flavor, where low-rank versions of $\\mathbf{A}$ imply ridge functions for $\\hat{f}(X)$.\n",
36 | "\n",
37 | "More general models for $\\mathbf{A}$ are cumbersome, and we favor instead the structured forms for the regression function discussed next."
38 | ]
39 | }
40 | ],
41 | "metadata": {
42 | "kernelspec": {
43 | "display_name": "Python 3",
44 | "language": "python",
45 | "name": "python3"
46 | },
47 | "language_info": {
48 | "codemirror_mode": {
49 | "name": "ipython",
50 | "version": 3
51 | },
52 | "file_extension": ".py",
53 | "mimetype": "text/x-python",
54 | "name": "python",
55 | "nbconvert_exporter": "python",
56 | "pygments_lexer": "ipython3",
57 | "version": "3.6.4"
58 | }
59 | },
60 | "nbformat": 4,
61 | "nbformat_minor": 2
62 | }
63 |
--------------------------------------------------------------------------------
/chapter06-kernel-smoothing-methods/section4-2-structured-regression-functions.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## $\\S$ 6.4.2. Structured Regression Functions\n",
8 | "\n",
9 | "We are trying to fit a regression function\n",
10 | "\n",
11 | "\\begin{equation}\n",
12 | "\\text{E}(Y|X) = f(X_1,X_2,\\cdots,X_p)\n",
13 | "\\end{equation}\n",
14 | "\n",
15 | "in $\\mathbb{R}^p$, in which every level of interaction is potentially present."
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "### Structure via ANOVA decomposition\n",
23 | "It is natural to consider ANOVA decompositions of the form\n",
24 | "\n",
25 | "\\begin{equation}\n",
26 | "f(X_1,X_2,\\cdots,X_p) = \\alpha + \\sum_j g_j(X_j) + \\sum_{k