├── Chapter01 ├── .ipynb_checkpoints │ ├── ClassicalInferenceMean-checkpoint.ipynb │ └── DescriptiveStatistics-checkpoint.ipynb ├── BayesianAnalysis.ipynb ├── BayesianInferenceMean.ipynb ├── BayesianInferenceProportions.ipynb ├── ClassicalInferenceMean.ipynb ├── ClassicalInferenceProportions.ipynb ├── Correlations.ipynb └── DescriptiveStatistics.ipynb ├── Chapter02 ├── EvaluatingModelResults.ipynb ├── MachineLearningPrinciples.ipynb ├── TrainingMLModels.ipynb └── titanic.csv ├── Chapter03 ├── .ipynb_checkpoints │ ├── BeyondBinary-checkpoint.ipynb │ ├── DecisionTree-checkpoint.ipynb │ ├── LogisticRegression-checkpoint.ipynb │ ├── NaiveBayes-checkpoint.ipynb │ ├── RandomForests-checkpoint.ipynb │ ├── SVM-checkpoint.ipynb │ └── Untitled-checkpoint.ipynb ├── BeyondBinary.ipynb ├── DecisionTree.ipynb ├── LogisticRegression.ipynb ├── Metadata.docx ├── NaiveBayes.ipynb ├── RandomForests.ipynb ├── SVM.ipynb ├── Untitled.ipynb ├── knn.ipynb └── titanic.csv ├── Chapter04 ├── .ipynb_checkpoints │ ├── BayesianRegression-checkpoint.ipynb │ ├── EvaluatingLinearModel-checkpoint.ipynb │ ├── LASSORegression-checkpoint.ipynb │ ├── OLS-checkpoint.ipynb │ ├── RidgeRegression-checkpoint.ipynb │ ├── SplineInterpolation-checkpoint.ipynb │ └── Untitled-checkpoint.ipynb ├── BayesianRegression.ipynb ├── EvaluatingLinearModel.ipynb ├── LASSORegression.ipynb ├── OLS.ipynb ├── RidgeRegression.ipynb ├── SplineInterpolation.ipynb ├── USCapitol.png ├── Untitled.ipynb ├── mystery_function.npy └── mystery_function_2.npy ├── Chapter05 ├── .ipynb_checkpoints │ └── RegressionNeuralNetworks-checkpoint.ipynb ├── Perceptron.ipynb ├── RegressionNeuralNetworks.ipynb └── TrainingNeuralNetwork.ipynb ├── Chapter06 ├── .ipynb_checkpoints │ ├── EvaluatingClusteringModelResults-checkpoint.ipynb │ ├── Hierarchical-checkpoint.ipynb │ ├── Spectral-checkpoint.ipynb │ ├── UnsupervisedLearning-checkpoint.ipynb │ └── kmeans-checkpoint.ipynb ├── EvaluatingClusteringModelResults.ipynb ├── HNHeadlines.txt ├── Hierarchical.ipynb ├── Spectral.ipynb ├── UnsupervisedLearning.ipynb ├── frog.png └── kmeans.ipynb ├── Chapter07 ├── .ipynb_checkpoints │ ├── LowDimensionalRepresentation-checkpoint.ipynb │ ├── PCA-checkpoint.ipynb │ └── SVD-checkpoint.ipynb ├── HNHeadlines.txt ├── LowDimensionalRepresentation.ipynb ├── Metadata.docx ├── PCA.ipynb ├── SVD.ipynb └── frog.png ├── LICENSE └── README.md /Chapter01/BayesianAnalysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Diving Into Bayesian Analysis\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "In this notebook I proceed with a simple example to introduce the idea of Bayesian statistical analysis.\n", 11 | "\n", 12 | "## Solving a Hit-and-Run\n", 13 | "\n", 14 | "In a city, 95% of cabs are owned by the Yellow Cab Company, and 5% are owned by Green Cab, Inc. Recently a cab was involved in a hit-and-run accident, injuring a pedestrian. A witness saw the accident and claimed that the cab that hit the pedestrian was a Green cab. Tests by investigators reveal that, under similar circumstances, this witness is correctly able to identify a Green cab 90% of the time, and correctly identify a Yellow cab 85% of the time; this means she incorrectly calls a Yellow cab a Green cab 15% of the time and incorrectly calls a Green cab a Yellow cab 10% of the time.\n", 15 | "\n", 16 | "Should investigators pursue Green Cab, Inc.?\n", 17 | "\n", 18 | "In general, **Bayes' Theorem** states\n", 19 | "\n", 20 | "$$P(H|D) = \\frac{P(D|H)P(H)}{P(D|H)P(H) + P(D|\\text{ not }H) P(\\text{not }H)} = \\frac{P(D|H)P(H)}{P(D)}$$\n", 21 | "\n", 22 | "Bayesian statisticians often emphasize the importance of the numerator by saying $P(H|D)$ is proportional to $P(D|H) P(H)$, or\n", 23 | "\n", 24 | "$$P(H|D) \\propto P(D|H) P(H)$$\n", 25 | "\n", 26 | "The denominator is needed only to make the right-hand side a proper probability.\n", 27 | "\n", 28 | "Let $H$ represent the event that a Green cab hit the pedestrian, and $D$ the event that the witness claims to see a Green cab. Let's compute the probability that, given the witness's testimony, the pedestrian was actually hit by a Green cab." 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": null, 34 | "metadata": { 35 | "collapsed": false 36 | }, 37 | "outputs": [], 38 | "source": [ 39 | "d_h = 0.9 # P(D|H): The probability the witness claims to see a Green cab when a Green cab hit the pedestrian\n", 40 | "h = 0.05 # P(H): The probability the cab was Green, based on the proportion of Green cabs on the streets\n", 41 | "d_nh = 0.1 # P(D|not H): The probability the witness claims to see a Green cab when a Yellow cab hit the pedestrian\n", 42 | "nh = 0.95 # P(not H): The probability the cab was Yellow\n", 43 | "\n", 44 | "h" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "At the outset the probaility the cab was Green is low given how few Green cabs there are on the streets. How does the witness's testimony change this?" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": { 58 | "collapsed": false 59 | }, 60 | "outputs": [], 61 | "source": [ 62 | "h_d = (d_h * h)/(d_h * h + d_nh * nh) # Bayes theorem\n", 63 | "h_d" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "While the cab was a Green cab has increase, it's still less than 50%; the Yellow Cab Company is more likely to employ the driver involved in the hit-and-run! (The witness is not accurate enough to overcome how few Green cabs there are on the streets.)" 71 | ] 72 | } 73 | ], 74 | "metadata": { 75 | "kernelspec": { 76 | "display_name": "Python 3", 77 | "language": "python", 78 | "name": "python3" 79 | }, 80 | "language_info": { 81 | "codemirror_mode": { 82 | "name": "ipython", 83 | "version": 3 84 | }, 85 | "file_extension": ".py", 86 | "mimetype": "text/x-python", 87 | "name": "python", 88 | "nbconvert_exporter": "python", 89 | "pygments_lexer": "ipython3", 90 | "version": "3.6.0" 91 | } 92 | }, 93 | "nbformat": 4, 94 | "nbformat_minor": 2 95 | } 96 | -------------------------------------------------------------------------------- /Chapter01/BayesianInferenceProportions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Bayesian Posterior Analysis: Proportions\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "In previous videos we saw classical methods for answering common questions about proportions. Here I demonstrate Bayesian methods for inference regarding proportions. In the process I introduce further ideas from Bayesian statistics.\n", 11 | "\n", 12 | "## Conjugate Priors\n", 13 | "\n", 14 | "Bayesian analysis is based on the following relation:\n", 15 | "\n", 16 | "$$p(\\theta | D) \\propto p(D | \\theta) p(\\theta)$$\n", 17 | "\n", 18 | "$p(\\theta)$ is the **prior** distribution of $\\theta$, and $p(\\theta | D)$ is the **posterior** distribution of $\\theta$ after observing $D$. $p(D | \\theta)$ is the **likelihood** of the evidence $D$. A **conjugate prior** is a prior distribution such that for the likelihood function $p(D | \\theta)$, the posterior belongs to the same family of distributions as the prior.\n", 19 | "\n", 20 | "[See this Wikipedia article on prior distribution](https://en.wikipedia.org/wiki/Conjugate_prior), which also includes a list of common conjugate priors.\n", 21 | "\n", 22 | "For data that takes values of either 0 or 1, the [Beta distribution](https://en.wikipedia.org/wiki/Beta_distribution) serves as a conjugate prior. Refer to this distribution as $B(\\alpha, \\beta)$, and we say that the proportion of successes $p$ follows $\\theta \\sim B(\\alpha, \\beta)$. $\\alpha - 1$ can be interpreted as imaginary prior \"successes\", and $\\beta - 1$ can be interpreted as imaginary prior \"failures\". If $\\alpha = \\beta = 1$, we interpret this as there being no prior \"successes\" or \"failures\" and every probability of \"success\" $\\theta$ is equally likely." 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "metadata": { 29 | "collapsed": true 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "from scipy.stats import beta\n", 34 | "import numpy as np\n", 35 | "import matplotlib.pyplot as plt\n", 36 | "%matplotlib inline" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "metadata": { 43 | "collapsed": false 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "x = np.linspace(0, 1, num=1000)\n", 48 | "plt.plot(x, beta.pdf(x, a=1, b=1)) # Plot of uninformative prior: the uniform distribution\n", 49 | "plt.show()" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": null, 55 | "metadata": { 56 | "collapsed": false 57 | }, 58 | "outputs": [], 59 | "source": [ 60 | "plt.plot(x, beta.pdf(x, a=3, b=3)) # Not an uninformative prior: has 2 \"successes\" and 2 \"failures\"\n", 61 | "plt.show()" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "Suppose that in a sample of size $N$ there are $M$ \"successes\". Then when the prior distribution of $\\theta$ is $B(\\alpha, \\beta)$, the posterior distribution of $\\theta$ is $B(\\alpha + M, \\beta + N - M)$.\n", 69 | "\n", 70 | "Let's reconsider an earlier example. Let's suppose that on a certain website, out of 1126 visitors on a given day, 310 clicked on an ad purchased by a sponsor. We will *not* use an uninformative prior; instead, our prior distribution will be $B(3, 3)$, which is interpreted as having 2 imaginary prior successes and 2 imaginary prior failure ($\\theta$ is biased towards $\\theta = \\frac{1}{2}$, or a 50/50 chance of clicking the ad for each visitor). Then the posterior distribution of $\\theta$ is $B(3 + 310, 3 + 1126 - 310) = B(313, 819)$. The prior and posterior are plotted below together to show their relationship." 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": null, 76 | "metadata": { 77 | "collapsed": false 78 | }, 79 | "outputs": [], 80 | "source": [ 81 | "plt.plot(x, beta.pdf(x, a=3, b=3), 'b-') # Prior\n", 82 | "plt.plot(x, beta.pdf(x, a=313, b=819), 'r-') # Posterior\n", 83 | "plt.show()" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "## Credible Intervals\n", 91 | "\n", 92 | "The Bayesian analogue to the **confidence interval** is the **credible interval**. The true value of $\\theta$ has probability $C$ of lying within the $100 \\times C$% credible interval. So there is a 95% chance that the true $\\theta$ lies in the 95% credible interval.\n", 93 | "\n", 94 | "I have written a function for computing a credible interval when using the conjugate prior." 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": null, 100 | "metadata": { 101 | "collapsed": true 102 | }, 103 | "outputs": [], 104 | "source": [ 105 | "def bernoulli_beta_credible_interval(M, N, a=1, b=1, C=.95):\n", 106 | " \"\"\"Computes a 100C% credible interval for Bernoulli (0/1) data\n", 107 | " \n", 108 | " args:\n", 109 | " M: int; number of \"successes\"\n", 110 | " N: int; total sample size\n", 111 | " a: float; first argument of the prior Beta distribution\n", 112 | " b: float; second argument of the prior Beta distribution\n", 113 | " C: float; the credibility (chance of containing theta) of the interval\n", 114 | " \n", 115 | " return:\n", 116 | " tuple; first number is the lower bound, second the upper bound, of the credible interval\n", 117 | " \"\"\"\n", 118 | " \n", 119 | " # Error checking\n", 120 | " if type(M) is not int or type(N) is not int:\n", 121 | " raise TypeError(\"M, N must both be integers\")\n", 122 | " elif M < 0 or N < M:\n", 123 | " raise ValueError(\"M, N must be non-negative, and N >= M\")\n", 124 | " elif a <= 0 or b <= 0:\n", 125 | " raise ValueError(\"Cannot have negative prior parameters!\")\n", 126 | " elif type(C) is not float:\n", 127 | " raise TypeError(\"C must be numeric\")\n", 128 | " elif C < 0 or C > 1:\n", 129 | " raise ValueError(\"C must be interpretable as a probability\")\n", 130 | " \n", 131 | " post = (a + M, b + N - M)\n", 132 | " alpha = (1 - C)/2\n", 133 | " return (beta.ppf(alpha, post[0], post[1]), beta.ppf(1 - alpha, post[0], post[1]))" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": { 140 | "collapsed": false 141 | }, 142 | "outputs": [], 143 | "source": [ 144 | "bernoulli_beta_credible_interval(310, 1126)" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": null, 150 | "metadata": { 151 | "collapsed": false 152 | }, 153 | "outputs": [], 154 | "source": [ 155 | "bernoulli_beta_credible_interval(310, 1126, a=3, b=3)" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "metadata": { 162 | "collapsed": false 163 | }, 164 | "outputs": [], 165 | "source": [ 166 | "bernoulli_beta_credible_interval(310, 1126, a=3, b=3, C=.99) # Like with confidence intervals,\n", 167 | " # larger C -> larger interval" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": {}, 173 | "source": [ 174 | "## Hypothesis Testing\n", 175 | "\n", 176 | "Bayesian hypothesis testing is merely computing the probability that $\\theta$ lies within the interval of interest.\n", 177 | "\n", 178 | "For instance, suppose that the administrator of the website you're testing claims that at least 30% of visitors to the site click the ad. What is the probability that the administrator is correct?\n", 179 | "\n", 180 | "We use the **cumulative distribution function (CDF)** of the posterior distribution to answer this question. The CDF is defined as $F(x) = P(X \\leq x)$. Notice $P(X > x) = 1 - F(x)$. Here, we want $P(\\theta > .3 | D)$." 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": null, 186 | "metadata": { 187 | "collapsed": false 188 | }, 189 | "outputs": [], 190 | "source": [ 191 | "1 - beta.cdf(.3, # Coincides with the administrator's claim\n", 192 | " a=3 + 310, # Posterior a\n", 193 | " b=3 + 1126 - 310) # Posterior b" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": {}, 199 | "source": [ 200 | "There is only a small probability that the administrator is correct. (Note, though, that this cannot be interpreted the same way as a p-value!)" 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "## Comparing Two Proportions\n", 208 | "\n", 209 | "The website is trying two different ad formats, format A and format B. Users are randomly assigned to one format or the other, and the website tracks how many viewers click the ad in the different formats. The website wants to know whether format B leads to more clicks than format A.\n", 210 | "\n", 211 | "Here we want the *joint* posterior distribution for $\\theta_A$ and $\\theta_B$, but this is not easy to compute and getting the probability that $\\theta_A < \\theta_B$ is also hard. So we do the following:\n", 212 | "\n", 213 | "1. Assume that $\\theta_A$ and $\\theta_B$ are independent.\n", 214 | "2. Collect data and compute the posterior distributions for $\\theta_A$ and $\\theta_B$ separately.\n", 215 | "3. Simulate $\\theta_A$ and $\\theta_B$, randomly sampling from their posterior distributions.\n", 216 | "4. Compute the proportion of times that $\\theta_A < \\theta_B$.\n", 217 | "\n", 218 | "516 visitors saw format A, and 108 of them clicked the ad. 510 visitors saw format B, and 144 of them clicked the ad. If the prior distribution for both $\\theta_A$ and $\\theta_B$ is $B(3, 3)$, then the posterior distribution for $\\theta_A$ is then $B(111, 411)$, and for $\\theta_B$, $B(147, 369)$. They're visualized below." 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": null, 224 | "metadata": { 225 | "collapsed": false 226 | }, 227 | "outputs": [], 228 | "source": [ 229 | "plt.plot(x, beta.pdf(x, 111, 411), 'b-') # Posterior distribution for theta_A\n", 230 | "plt.plot(x, beta.pdf(x, 147, 369), 'r-') # Posterior distribution for theta_B\n", 231 | "plt.show()" 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "metadata": {}, 237 | "source": [ 238 | "Now we engage in the simulation to see how often $\\theta_A < \\theta_B$." 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": null, 244 | "metadata": { 245 | "collapsed": false 246 | }, 247 | "outputs": [], 248 | "source": [ 249 | "# Demonstration: A random theta_A\n", 250 | "beta.rvs(111, 411)" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": null, 256 | "metadata": { 257 | "collapsed": false 258 | }, 259 | "outputs": [], 260 | "source": [ 261 | "# A random theta_B\n", 262 | "beta.rvs(147, 369)" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": null, 268 | "metadata": { 269 | "collapsed": true 270 | }, 271 | "outputs": [], 272 | "source": [ 273 | "N = 1000 # Number of simulations\n", 274 | "random_A = beta.rvs(111, 411, size=N)\n", 275 | "random_B = beta.rvs(147, 369, size=N)" 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": null, 281 | "metadata": { 282 | "collapsed": false 283 | }, 284 | "outputs": [], 285 | "source": [ 286 | "random_A[0:10]" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": null, 292 | "metadata": { 293 | "collapsed": false 294 | }, 295 | "outputs": [], 296 | "source": [ 297 | "random_B[0:10]" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": null, 303 | "metadata": { 304 | "collapsed": false 305 | }, 306 | "outputs": [], 307 | "source": [ 308 | "random_A[0:10] < random_B[0:10]" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": null, 314 | "metadata": { 315 | "collapsed": false 316 | }, 317 | "outputs": [], 318 | "source": [ 319 | "trial = random_A < random_B\n", 320 | "trial.sum() # Number of times theta_A < theta_B" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": null, 326 | "metadata": { 327 | "collapsed": false 328 | }, 329 | "outputs": [], 330 | "source": [ 331 | "trial.mean() # Estimated probability theta_A < theta_B" 332 | ] 333 | }, 334 | { 335 | "cell_type": "markdown", 336 | "metadata": {}, 337 | "source": [ 338 | "There is a high probability that $\\theta_A < \\theta_B$ according to our simulation." 339 | ] 340 | } 341 | ], 342 | "metadata": { 343 | "kernelspec": { 344 | "display_name": "Python 3", 345 | "language": "python", 346 | "name": "python3" 347 | }, 348 | "language_info": { 349 | "codemirror_mode": { 350 | "name": "ipython", 351 | "version": 3 352 | }, 353 | "file_extension": ".py", 354 | "mimetype": "text/x-python", 355 | "name": "python", 356 | "nbconvert_exporter": "python", 357 | "pygments_lexer": "ipython3", 358 | "version": "3.6.0" 359 | } 360 | }, 361 | "nbformat": 4, 362 | "nbformat_minor": 2 363 | } 364 | -------------------------------------------------------------------------------- /Chapter01/ClassicalInferenceProportions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Confidence Intervals and Classical Hypothesis Testing: Proportions\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "The first major topic is reaching conclusions about proportions.\n", 11 | "\n", 12 | "In a sample of size $N$ there are $M$ \"successes\" (say, people who clicked on an advertisement) and $N - M$ \"failures\" (everyone else, who did not click on an advertisement). The **sample proportion** is then:\n", 13 | "\n", 14 | "$$\\hat{p} = \\frac{M}{N}$$\n", 15 | "\n", 16 | "In fact, if your data $x_i$ is 1 for every \"success\" and 0 for every \"failure\", then we can say:\n", 17 | "\n", 18 | "$$\\hat{p} = \\frac{1}{N} \\sum_{i = 1}^{N} x_i = \\bar{x}$$\n", 19 | "\n", 20 | "That is, the sample proportion is the sample mean of the dataset.\n", 21 | "\n", 22 | "Let's say we want to know what proportion of visitors (including future visitors, not yet seen) will click on our ad based on previous data. How can we go from a sample proportion to a statement about the **population proportion**?\n", 23 | "\n", 24 | "## Confidence Interval for Population Proportion\n", 25 | "\n", 26 | "We can constuct a **confidence interval**, an interval we believe will contain the true population proportion of visitors who click our ad. We have an interval with a lower and upper bound and we believe that the true population proportion is within this interval with some level of confidence. For a 95% confidence interval, we are \"95% confident\" the true proportion is in the interval (in the sense that such intervals contain the population proportion 95% of the time).\n", 27 | "\n", 28 | "The classical way to construct this interval is to use the interval:\n", 29 | "\n", 30 | "$$\\hat{p} \\pm z_{1 - \\frac{\\alpha}{2}} \\sqrt{\\hat{p}(1 - \\hat{p}} \\equiv \\left(\\hat{p} - z_{1 - \\frac{\\alpha}{2}} \\sqrt{\\hat{p}(1 - \\hat{p}}, \\hat{p} + z_{1 - \\frac{\\alpha}{2}} \\sqrt{\\hat{p}(1 - \\hat{p}}\\right)$$\n", 31 | "\n", 32 | "where $z_{p}$ is the $100\\times p$th percentile of the [Normal distribution](https://en.wikipedia.org/wiki/Normal_distribution).\n", 33 | "\n", 34 | "In Python, the **statsmodels** package can be used for statistical computations such as computing a confidence interval.\n", 35 | "\n", 36 | "Let's suppose that on a certain website, out of 1126 visitors on a given day, 310 clicked on an ad purchased by a sponsor. Let's construct a confidence interval for the *population* proportion of visitors who click the ad." 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "metadata": { 43 | "collapsed": true 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "import statsmodels.api as sm" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "310 / 1126 # Sample proportion" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "from statsmodels.stats.proportion import proportion_confint # Function for computing confidence intervals\n", 66 | "proportion_confint(count=310, # Number of \"successes\"\n", 67 | " nobs=1126, # Number of trials\n", 68 | " alpha=(1 - 0.95)) # Alpha, which is 1 minus the confidence level" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "If we wanted a 99% confidence interval, we would have a wider interval, but more confidence that the true proportion lies in this interval." 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "proportion_confint(310, 1126, alpha=(1 - 0.99))" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "## Testing the Proportion\n", 92 | "\n", 93 | "The website administrator claims that 30% of visitors to the website click the advertisement. Is this true? The sample proportion does not match the administrator's claim, but this does not discredit the claim.\n", 94 | "\n", 95 | "We will do a **statistical test** to test the administrator's claim. We test the **null hypothesis**:\n", 96 | "i\n", 97 | "$$H_0: p = 0.3$$\n", 98 | "\n", 99 | "(where $p$ denotes the true proportion of visitors who click the ad on the site) against the **alternative hypothesis**:\n", 100 | "\n", 101 | "$$H_A: p \\neq 0.3$$\n", 102 | "\n", 103 | "How do we do this? We first compute a **test statistic**.\n", 104 | "\n", 105 | "$$z = \\frac{\\hat{p} - p_0}{\\sqrt{p_0(1 - p_0)}} = \\frac{\\hat{p} - 0.3}{\\sqrt{0.3(1 - 0.3)}}$$\n", 106 | "\n", 107 | "We then compute a $p$-value, which can be interpreted as the probability of observing a test statistic at least as \"extreme\" as the test statistic actually observed. If the $p$-value is small, we will reject $H_0$ and conclude that the administrator's claim is false; the proportion of visitors who click the ad is not $0.3$. If the $p$-value is not small, then we do not reject $H_0$; the evidence from our data does not contradict his claim.\n", 108 | "\n", 109 | "What counts as a \"small\" $p$-value? Here, we will decide that if a $p$-value is less than 0.05, then the $p$-value is \"small\" and we reject the null hypothesis. If we see a $p$-value greater than 0.05, we will not reject the null hypothesis. (We could have chosen a number other than 0.05; maybe 0.01 if we wanted to err on the side of not contradicting the administrator.)\n", 110 | "\n", 111 | "I now conduct the test and compute the $p$-value." 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "from statsmodels.stats.proportion import proportions_ztest # Performs the test just described\n", 121 | "\n", 122 | "res = proportions_ztest(count=310,\n", 123 | " nobs=1126,\n", 124 | " value=0.3, # The hypothesized value of population proportion p\n", 125 | " alternative='two-sided') # Tests the \"not equal to\" alternative hypothesis\n", 126 | "\n", 127 | "res # A tuple; the first entry is the value of the test statistic, and the second is the p-value" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "Here, we got a test statistic of $z \\approx -1.85$ and a $p$-value of $\\approx 0.0636 > 0.05$. We conclude there is not enough statistical evidence to disagree with the website administrator." 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "## Testing for Common Proportions\n", 142 | "\n", 143 | "The website decides to conduct an experiment. One day, the website shows its visitors different versions of an advertisement created by a sponsor. Users are randomly assigned to Version A and Version B. The website tracks how often Version A was clicked and how often Version B was clicked.\n", 144 | "\n", 145 | "On this day, 516 visitors saw Version A of the ad, and 510 saw Version B. Of those who saw Version A, 108 clicked the ad, while 144 clicked Version B when shown.\n", 146 | "\n", 147 | "Which ad generates more clicks?\n", 148 | "\n", 149 | "Here we test the following hypotheses:\n", 150 | "\n", 151 | "$$H_0: p_A = p_B$$\n", 152 | "$$H_A: p_A \\neq p_B$$\n", 153 | "\n", 154 | "The test statistic for this test is:\n", 155 | "\n", 156 | "$$z = \\frac{\\hat{p}_A - \\hat{p}_B}{\\sqrt{\\frac{\\hat{p}(1 - \\hat{p})}{n_A + n_B}}}$$\n", 157 | "\n", 158 | "where $\\hat{p}_A$ and $\\hat{p}_B$ are the sample proportions for group A and group B and $\\hat{p}$ is the proportion from the pooled sample (grouping A and B together).\n", 159 | "\n", 160 | "`proportions_ztest()` can perform this test." 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": null, 166 | "metadata": { 167 | "collapsed": true 168 | }, 169 | "outputs": [], 170 | "source": [ 171 | "import numpy as np" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": null, 177 | "metadata": {}, 178 | "outputs": [], 179 | "source": [ 180 | "np.array([108, 144])" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": null, 186 | "metadata": {}, 187 | "outputs": [], 188 | "source": [ 189 | "proportions_ztest(count=np.array([108, 144]),\n", 190 | " nobs=np.array([516, 510]),\n", 191 | " alternative='two-sided')" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "With a p-value of about 0.0066, which is small, we reject the null hypothesis; it appears that the two ads do not have the same proportion of clicks." 199 | ] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "metadata": {}, 204 | "source": [ 205 | "## $\\chi^2$ Test for Goodness of Fit\n", 206 | "\n", 207 | "The **$\\chi^2$ test for goodness of fit** generalizes the test for a population proportion. Whereas we have worked before with variables that either do or do not have some characteristic (such as: a visitor either did or did not click an ad), this test checks whether a variable that could fall in some category has some distribution.\n", 208 | "\n", 209 | "Suppose a website offers five colors of shoes: blue, black, brown, white, and red. We want to know whether each color is equally likely or not. That is, if $p_{\\text{color}}$ is the proportion of shoe buyers who bought a particular color, we wish to test:\n", 210 | "\n", 211 | "$$H_0: p_{\\text{blue}} = p_{\\text{black}} = p_{\\text{brown}} = p_{\\text{white}} = p_{\\text{red}}$$\n", 212 | "$$H_A: H_0 \\text{ is false}$$\n", 213 | "\n", 214 | "Suppose that out of 464 buyers of shoes, 98 bought blue shoes, 117 bought black shoes, 80 bought brown shoes, 73 bought white shoes, and 96 bought red shoes. If each shoe is equally likely to be bought, then $p_{\\text{color}} = 0.2$ for every color. We would expect to see $0.2 \\times 464 = 92.8$ pairs of each color sold if this were true.\n", 215 | "\n", 216 | "We can now use the function `chisquare()` from **scipy.stats** to perform the test." 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": { 223 | "collapsed": true 224 | }, 225 | "outputs": [], 226 | "source": [ 227 | "from scipy.stats import chisquare" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": null, 233 | "metadata": {}, 234 | "outputs": [], 235 | "source": [ 236 | "chisquare(f_obs=[98, 117, 80, 73, 96], # Observed frequency for each color\n", 237 | " f_exp=[464 * .2, 464 * .2, 464 * .2, 464 * .2, 464 * .2]) # Expected frequency under the null hypothesis" 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "metadata": {}, 243 | "source": [ 244 | "The p-value is approximately 0.0128. This is small, and suggests that the null hypothesis is false. It's likely that some shoe colors are more popular than others (black is a prime suspect)." 245 | ] 246 | } 247 | ], 248 | "metadata": { 249 | "kernelspec": { 250 | "display_name": "Python 3", 251 | "language": "python", 252 | "name": "python3" 253 | }, 254 | "language_info": { 255 | "codemirror_mode": { 256 | "name": "ipython", 257 | "version": 3 258 | }, 259 | "file_extension": ".py", 260 | "mimetype": "text/x-python", 261 | "name": "python", 262 | "nbconvert_exporter": "python", 263 | "pygments_lexer": "ipython3", 264 | "version": "3.6.5" 265 | } 266 | }, 267 | "nbformat": 4, 268 | "nbformat_minor": 2 269 | } 270 | -------------------------------------------------------------------------------- /Chapter01/Correlations.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Finding Correlations Using Pandas and SciPy\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "**Correlation** is a measure of how strongly two variables are related to one another. The most common measure of correlation is the **Pearson correlation coefficient**, which, for two sets of paired data $x_i$ and $y_i$ is defined as\n", 11 | "\n", 12 | "$$r = \\frac{1}{n - 1}\\sum_{i = 1}^n \\left(\\frac{x_i - \\bar{x}}{s_x}\\right)\\left(\\frac{y_i - \\bar{y}}{s_y}\\right)$$\n", 13 | "\n", 14 | "$r$ is a number between 1 and -1, with $r > 0$ indicating a positive relationship ($x$ and $y$ increase together) and $r < 0$ a negative relationship ($x$ increases as $y$ decreases). When $|r| = 1$, there is a perfect *linear* relationship, while if $r = 0$ there is no *linear* relationship ($r$ may fail to capture non-linear relationships). In practice, $r$ is never exactly 0, so $r$ with small magnitude are synonymous with \"no correlation\". $|r| = 1$ does occur, usually when two variables effectively describe the same phenomena (for example, height in meters vs. height in centimeters, or grocery bill and sales tax).\n", 15 | "\n", 16 | "## Loading the Boston House Price Dataset\n", 17 | "\n", 18 | "The Boston housing prices dataset is included with **sklearn** as a \"toy\" dataset (one used to experiment with statistical and machine learning methods). It includes the results of a survey that prices houses from various areas of Boston, and includes variables such as the crime rate of an area, the age of the home owners, and other variables. While many applications focus on predicting the price of housing based on these variables, I'm only interested in the correlation between these variables (perhaps this will suggest a model later).\n", 19 | "\n", 20 | "Below I load in the dataset and create a Pandas `DataFrame` from it." 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "metadata": { 27 | "collapsed": true 28 | }, 29 | "outputs": [], 30 | "source": [ 31 | "from sklearn.datasets import load_boston\n", 32 | "import pandas as pd\n", 33 | "from pandas import DataFrame\n", 34 | "import matplotlib.pyplot as plt\n", 35 | "%matplotlib inline" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "metadata": { 42 | "collapsed": true 43 | }, 44 | "outputs": [], 45 | "source": [ 46 | "boston = load_boston()" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": { 53 | "collapsed": false 54 | }, 55 | "outputs": [], 56 | "source": [ 57 | "print(boston.DESCR)" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "metadata": { 64 | "collapsed": false, 65 | "scrolled": true 66 | }, 67 | "outputs": [], 68 | "source": [ 69 | "boston.data" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": null, 75 | "metadata": { 76 | "collapsed": false 77 | }, 78 | "outputs": [], 79 | "source": [ 80 | "boston.feature_names" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": { 87 | "collapsed": false 88 | }, 89 | "outputs": [], 90 | "source": [ 91 | "boston.target" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": { 98 | "collapsed": false, 99 | "scrolled": false 100 | }, 101 | "outputs": [], 102 | "source": [ 103 | "temp = DataFrame(boston.data, columns=pd.Index(boston.feature_names))\n", 104 | "boston = temp.join(DataFrame(boston.target, columns=[\"PRICE\"]))\n", 105 | "boston" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "## Correlation Between Two Variables\n", 113 | "\n", 114 | "We could use NumPy's `corrcoef()` function if we wanted the correlation between two variable, say, the local area crime rate (CRIM) and the price of a home (PRICE)." 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": null, 120 | "metadata": { 121 | "collapsed": true 122 | }, 123 | "outputs": [], 124 | "source": [ 125 | "from numpy import corrcoef" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "metadata": { 132 | "collapsed": false 133 | }, 134 | "outputs": [], 135 | "source": [ 136 | "boston.CRIM.as_matrix() # As a NumPy array" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": null, 142 | "metadata": { 143 | "collapsed": false 144 | }, 145 | "outputs": [], 146 | "source": [ 147 | "corrcoef(boston.CRIM.as_matrix(), boston.PRICE.as_matrix())" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [ 154 | "The numbers in the off-diagonal entries correspond to the correlation between the two variables. In this case, there is a negative relationship, which makes sense (more crime is associated with lower prices), but the correlation is only moderate.\n", 155 | "\n", 156 | "## Computing a Correlation Matrix\n", 157 | "\n", 158 | "When we have several variables we may want to see what correlations there are among them. We can compute a **correlation matrix** that includes the correlations between the different variables in the dataset.\n", 159 | "\n", 160 | "When loaded into a Pandas `DataFrame`, we can use the `corr()` method to get the correlation matrix." 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": null, 166 | "metadata": { 167 | "collapsed": false 168 | }, 169 | "outputs": [], 170 | "source": [ 171 | "boston.corr()" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "While this has a lot of data it's not easy to read. Let's visualize the correlations with a heatmap." 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": null, 184 | "metadata": { 185 | "collapsed": false 186 | }, 187 | "outputs": [], 188 | "source": [ 189 | "import seaborn as sns # Allows for easy plotting of heatmaps" 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": null, 195 | "metadata": { 196 | "collapsed": false 197 | }, 198 | "outputs": [], 199 | "source": [ 200 | "sns.heatmap(boston.corr(), annot=True)" 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "The heatmap reveal some interesting patterns. We can see\n", 208 | "\n", 209 | "* A strong positive relationship between home prices and the average number of rooms for homes in that area (RM)\n", 210 | "* A strong negative relationship between home prices and the percentage of lower status of the population (LSTAT)\n", 211 | "* A strong positive relationship between accessibility to radial highways (RAD) and property taxes (TAX)\n", 212 | "* A negative relationship between nitric oxides concentration (NOX) and distance to major employment areas in Boston\n", 213 | "* No relationshipp between the Charles River variable (CHAS) and any other variable\n", 214 | "\n", 215 | "## Statistical Test for Correlation\n", 216 | "\n", 217 | "Suppose we want extra assurance that two variables are correlated. We could perform a statistical test that tests\n", 218 | "\n", 219 | "$$H_0: \\rho = 0$$\n", 220 | "$$H_A: \\rho \\neq 0$$\n", 221 | "\n", 222 | "(Where $\\rho$ is the population, or \"true\", correlation.) This test is provided for in SciPy." 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": null, 228 | "metadata": { 229 | "collapsed": true 230 | }, 231 | "outputs": [], 232 | "source": [ 233 | "from scipy.stats import pearsonr" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": null, 239 | "metadata": { 240 | "collapsed": false 241 | }, 242 | "outputs": [], 243 | "source": [ 244 | "# Test to see if crime rate and house prices are correlated\n", 245 | "pearsonr(boston.CRIM, boston.PRICE)" 246 | ] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "metadata": {}, 251 | "source": [ 252 | "The first number in the returned tuple is the computed sample correlation coefficient $r$, and the second number is the p-value of the test. In this case, the evidence that there is *any* non-zero correlation is strong. That said, just because we can conclude that the correlation is not zero does not mean that the correlation is meaningful." 253 | ] 254 | } 255 | ], 256 | "metadata": { 257 | "kernelspec": { 258 | "display_name": "Python [Root]", 259 | "language": "python", 260 | "name": "Python [Root]" 261 | }, 262 | "language_info": { 263 | "codemirror_mode": { 264 | "name": "ipython", 265 | "version": 3 266 | }, 267 | "file_extension": ".py", 268 | "mimetype": "text/x-python", 269 | "name": "python", 270 | "nbconvert_exporter": "python", 271 | "pygments_lexer": "ipython3", 272 | "version": "3.5.2" 273 | } 274 | }, 275 | "nbformat": 4, 276 | "nbformat_minor": 2 277 | } 278 | -------------------------------------------------------------------------------- /Chapter03/.ipynb_checkpoints/BeyondBinary-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Going Beyond Binary\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "I have emphasized binary classification because it is the simplest form of classification and it is easier to develop binary classifiers than classifiers that predict one of more than two labels (which we may call **multiclass classification**). That said, such use cases certainly exist. What can we do then?\n", 11 | "\n", 12 | "Let's take for example predicting the species of flowers in the iris dataset. Below I load in the iris dataset." 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": null, 18 | "metadata": { 19 | "collapsed": true 20 | }, 21 | "outputs": [], 22 | "source": [ 23 | "from sklearn.datasets import load_iris\n", 24 | "from sklearn.model_selection import train_test_split, cross_validate\n", 25 | "from sklearn.metrics import classification_report" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": null, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "iris_obj = load_iris()\n", 35 | "flower, species = iris_obj.data, iris_obj.target\n", 36 | "flower_train, flower_test, species_train, species_test = train_test_split(flower, species, test_size = 0.1)\n", 37 | "flower_train[:5, :]" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": null, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "species_train[:5]" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "## Inherently Multiclass Classifiers\n", 54 | "\n", 55 | "Some classifiers don't lean on the binary assumption and are ready for predicting one of many labels already. Classifiers we've seen that are inherently multiclass classifiers include:\n", 56 | "\n", 57 | "* KNN\n", 58 | "* Decision trees\n", 59 | "* Random forests\n", 60 | "* Naive Bayes\n", 61 | "\n", 62 | "### KNN\n", 63 | "\n", 64 | "We already saw KNN applied to this dataset and its ability to predict one of many labels.\n", 65 | "\n", 66 | "### Decision tree" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": { 73 | "collapsed": true 74 | }, 75 | "outputs": [], 76 | "source": [ 77 | "from sklearn.tree import DecisionTreeClassifier\n", 78 | "from sklearn.externals.six import StringIO \n", 79 | "from IPython.display import Image \n", 80 | "from sklearn.tree import export_graphviz\n", 81 | "import pydotplus" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "tree = DecisionTreeClassifier(max_depth=3)\n", 91 | "tree = tree.fit(flower_train, species_train)\n", 92 | "print(classification_report(species_test, tree.predict(flower_test)))" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": null, 98 | "metadata": {}, 99 | "outputs": [], 100 | "source": [ 101 | "dot_data = StringIO()\n", 102 | "\n", 103 | "export_graphviz(tree, # Function for exporting a visualization of the tree\n", 104 | " out_file=dot_data,\n", 105 | " # Data controlling the display of the graph\n", 106 | " filled=True, rounded=True,\n", 107 | " special_characters=True,\n", 108 | " feature_names=[\"Sepal Length\", \"Sepal Width\",\n", 109 | " \"Petal Length\", \"Petal Width\"], # Use the name of the features\n", 110 | " proportion=True) # Show proportions for labels\n", 111 | "\n", 112 | "# Display graph in Jupyter notebook\n", 113 | "graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) \n", 114 | "Image(graph.create_png())" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "### Random Forest" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": null, 127 | "metadata": { 128 | "collapsed": true 129 | }, 130 | "outputs": [], 131 | "source": [ 132 | "from sklearn.ensemble import RandomForestClassifier" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": {}, 139 | "outputs": [], 140 | "source": [ 141 | "forest = RandomForestClassifier(n_estimators=20, max_depth=2)\n", 142 | "forest.fit(flower_train, species_train)\n", 143 | "print(classification_report(species_test, forest.predict(flower_test)))" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "### Naive Bayes\n", 151 | "\n", 152 | "In this case, I will use the exclusively Gaussian variant of the naive Bayes classifier, implemented in `GaussianNB`, since all variables in the iris dataset are continuous variables." 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "metadata": { 159 | "collapsed": true 160 | }, 161 | "outputs": [], 162 | "source": [ 163 | "from sklearn.naive_bayes import GaussianNB" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [ 172 | "nb = GaussianNB()\n", 173 | "nb = nb.fit(flower_train, species_train)\n", 174 | "print(classification_report(species_test, nb.predict(flower_test)))" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "## One vs. All and One vs. One Classification\n", 182 | "\n", 183 | "After we exhaust classifiers that are inherently multiclass, we are forced to combine binary classifiers in such a way that they can predict multiple labels. SVMs and logistic regression are examples of classifiers that are not inherently multiclass and need to be handled this way.\n", 184 | "\n", 185 | "**One vs. all** classification trains a classifier for every class, where for each classifier trained, one class exclusively consists of \"successes\" and all data points not in that class are \"failures\". All classifiers make a prediction, and if a classifier predicts \"success\" while others predict \"failure\", the class associated with that classifier is the predicted class.\n", 186 | "\n", 187 | "One vs. all classification is simple since we need as many classifiers as we have classes, and so can be done relatively quickly. It also works well when the number of data points in the training set doesn't cause large performance lags. Thus this scheme is popular. However, this algorithm assumes that every class can be separated from the rest by a single hyperplane; this may not be true, in which case learning fails.\n", 188 | "\n", 189 | "**One vs. one** classification trains a classifier for every *combination* of classes. The training dataset is restricted to observations from these two classes, and a classifier is trained to distinguish them. In prediction, each classifier trained this way makes a prediction. The most common class predicted among the classifiers is the class finally predicted.\n", 190 | "\n", 191 | "This mode of classification requires more classifiers; if there are $K$ classes, $\\frac{K(K-1)}{2} \\sim K^2$ classifiers are needed. This slows down prediction as well. This scheme does work well, though, when training the classifiers is expensive with respect to the size of the dataset (smaller datasets are used for training).\n", 192 | "\n", 193 | "All classifiers implemented in **scikit-learn** support multiclass classification out of the box; `SVC` and `LogisticRegression`, in particular, already support these schemes. However, the **multiclass** module includes objects that allow for manual implementation of these schemes: `OneVsRestClassifier` for the one vs. all scheme, and `OneVsOneClassifier` for the one vs. one scheme.\n", 194 | "\n", 195 | "`SVC` by default implements the one vs. one method, and `LogisticRegression` uses the one vs. all method.\n", 196 | "\n", 197 | "### SVM (One vs. One)" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": null, 203 | "metadata": { 204 | "collapsed": true 205 | }, 206 | "outputs": [], 207 | "source": [ 208 | "from sklearn.svm import SVC" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "metadata": {}, 215 | "outputs": [], 216 | "source": [ 217 | "svm = SVC()\n", 218 | "svm.fit(flower_train, species_train)\n", 219 | "print(classification_report(species_test, svm.predict(flower_test)))" 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": null, 225 | "metadata": { 226 | "collapsed": true 227 | }, 228 | "outputs": [], 229 | "source": [ 230 | "from sklearn.linear_model import LogisticRegression" 231 | ] 232 | }, 233 | { 234 | "cell_type": "markdown", 235 | "metadata": {}, 236 | "source": [ 237 | "### Logistic Regression (One vs. All)" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": null, 243 | "metadata": {}, 244 | "outputs": [], 245 | "source": [ 246 | "logit = LogisticRegression()\n", 247 | "logit.fit(flower_train, species_train)\n", 248 | "print(classification_report(species_test, logit.predict(flower_test)))" 249 | ] 250 | } 251 | ], 252 | "metadata": { 253 | "kernelspec": { 254 | "display_name": "Python 3", 255 | "language": "python", 256 | "name": "python3" 257 | }, 258 | "language_info": { 259 | "codemirror_mode": { 260 | "name": "ipython", 261 | "version": 3 262 | }, 263 | "file_extension": ".py", 264 | "mimetype": "text/x-python", 265 | "name": "python", 266 | "nbconvert_exporter": "python", 267 | "pygments_lexer": "ipython3", 268 | "version": "3.6.5" 269 | } 270 | }, 271 | "nbformat": 4, 272 | "nbformat_minor": 2 273 | } 274 | -------------------------------------------------------------------------------- /Chapter03/.ipynb_checkpoints/LogisticRegression-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Logistic Regression for Machine Learning\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "**Logistic regression** (also referred to as **logit models**) is a form of regression that, given an observation's features, produces probabilities predicting whether an observation belongs to a certain class. While common in machine learning, they are popular statistical models in general, appearing in fields such as economics, medicine, etc.\n", 11 | "\n", 12 | "After fitting a logit model, we make predictions using the probability produced by the model that an observation belongs to a certain class. We may decide to for an observationt to predict the class to which it is most likely to belong; in other words, if the probability an observation to a class is greater than 0.5, we predict it belongs to that class. (In principle we could choose a different threshold than 0.5.)\n", 13 | "\n", 14 | "Logit models are considered linear models, but by changing the features the model uses we may express non-linear relationships.\n", 15 | "\n", 16 | "Logit models are implemented in **scikit-learn** in the `LogisticRegression` class.\n", 17 | "\n", 18 | "Again we will work with the *Titanic* dataset. We will make the transformations needed when fittiing a SVM." 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": null, 24 | "metadata": { 25 | "collapsed": true 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "import pandas as pd\n", 30 | "from pandas import DataFrame\n", 31 | "from sklearn.model_selection import train_test_split, cross_validate\n", 32 | "from sklearn.metrics import classification_report" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "titanic = pd.read_csv(\"titanic.csv\")\n", 42 | "titanic.replace({'Sex': {'male': 0, 'female': 1}}, inplace=True)\n", 43 | "titanic.drop(\"Name\", axis=1, inplace=True)\n", 44 | "titanic = titanic.join(pd.get_dummies(titanic.Pclass, prefix='Pclass')).drop(\"Pclass\", axis=1)\n", 45 | "titanic_train, titanic_test = train_test_split(titanic)\n", 46 | "titanic_train.head()" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "## Fitting a Logit Model\n", 54 | "\n", 55 | "Fitting logit models is similar to what we've seen before." 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "metadata": { 62 | "collapsed": true 63 | }, 64 | "outputs": [], 65 | "source": [ 66 | "from sklearn.linear_model import LogisticRegression" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "logit = LogisticRegression()\n", 76 | "logit.fit(X=titanic_train.drop(\"Survived\", axis=1),\n", 77 | " y=titanic_train.Survived)\n", 78 | "logit.predict([[0, 26, 0, 0, 30, 0, 1, 0]]) # Example prediction" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [ 87 | "logit.predict_proba([[0, 26, 0, 0, 30, 0, 1, 0]]) # What is the probability of belonging to certain classes?" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "print(classification_report(titanic_train.Survived, logit.predict(titanic_train.drop(\"Survived\", axis=1))))" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "Let's see the logit model's performance on test data." 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "metadata": {}, 110 | "outputs": [], 111 | "source": [ 112 | "print(classification_report(titanic_test.Survived, logit.predict(titanic_test.drop(\"Survived\", axis=1))))" 113 | ] 114 | } 115 | ], 116 | "metadata": { 117 | "kernelspec": { 118 | "display_name": "Python 3", 119 | "language": "python", 120 | "name": "python3" 121 | }, 122 | "language_info": { 123 | "codemirror_mode": { 124 | "name": "ipython", 125 | "version": 3 126 | }, 127 | "file_extension": ".py", 128 | "mimetype": "text/x-python", 129 | "name": "python", 130 | "nbconvert_exporter": "python", 131 | "pygments_lexer": "ipython3", 132 | "version": "3.6.5" 133 | } 134 | }, 135 | "nbformat": 4, 136 | "nbformat_minor": 2 137 | } 138 | -------------------------------------------------------------------------------- /Chapter03/.ipynb_checkpoints/NaiveBayes-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Making Predictions Using the Naive Bayes Algorithm\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "The **naive Bayes** algorithm is based on Bayes' theorem. Let's suppose that our data uses features $X_1, X_2, ..., X_K$ and $Y$ is the target variable. One form to describe the naive Bayes classifier is:\n", 11 | "\n", 12 | "$$P(Y = y | X_1 = x_1, ..., X_K = x_K) = \\frac{P(X_1 = x_1, ..., X_K = x_K | Y = y) P(Y = y)}{P(X_1 = x_1, ..., X_K = x_K)}$$\n", 13 | "\n", 14 | "The \"naive\" part of the naive Bayes classifier is to make the (unrealistic) assumption that all the features are **independent** random variables; that is, information about one feature provides essentially no information about the other, and $P(X_i = x_i | X_j = x_j) = P(X_i = x_i)$ when $i \\neq j$ (but we *don't* assume the features are independent of the target variable). Under this assumption, $P(X_i = x_i, X_j = x_j) = P(X_i = x_i)P(X_j = x_j)$ and we can rewrite Bayes theorem as\n", 15 | "\n", 16 | "$$P(Y = y | X_1 = x_1, ..., X_K = x_K) = \\frac{P(Y = y) \\prod_{k = 1}^{K} P(X_k = x_k | Y = y)}{\\prod_{k = 1}{K} P(X_k = x_k)} \\propto P(Y = y) \\prod_{k = 1}^{K} P(X_k = x_k | Y = y)$$\n", 17 | "\n", 18 | "The quantities in the above expression (behind the $\\propto$ symbol) are easily estimated from a dataset; this is what is done when training the algorithm.\n", 19 | "\n", 20 | "When predicting, we observe features $x_1, ..., x_K$ for a data point. The predicted label $y$ is the label that maximizes $P(Y = y) \\prod_{k = 1}^{K} P(X_k = x_k | Y = y)$; that is, if $\\hat{y}$ represents our prediction, then\n", 21 | "\n", 22 | "$$\\hat{y} = \\arg \\max_y P(Y = y) \\prod_{k = 1}^{K} P(X_k = x_k | Y = y)$$\n", 23 | "\n", 24 | "Notice that what I have written works only when $X_k$ are all discrete; this algorithm will work for categorical features (like passenger class or sex) or variables that take countable values (like number of children aboard), but it won't work for features we may consider continuous (like fare paid) since in that case $P(X_k = x_k | Y = y)$ always. We fix this problem by replacing, for continuous features, $P(X_k = x_k | Y = y)$ with $f_k(x_k | y)$, where $f( \\cdot )$ is a probability density function. A common choice is the Gaussian density:\n", 25 | "\n", 26 | "$$f_k(x_k | y) = \\frac{1}{\\sqrt{2 \\pi \\sigma^2_{k,y}}} \\exp\\left(-\\frac{(x_k - \\mu_{k,y})^2}{2 \\sigma_{k,y}^2}\\right)$$\n", 27 | "\n", 28 | "(Note that $\\exp(x) = e^x$.) The parameters $\\mu_{k,y}$ and $\\sigma_{k,y}$ are estimated from the data.\n", 29 | "\n", 30 | "If $U_1, ..., U_I$ are discrete variables and $V_1, ..., V_J$ continuous, we can rewrite the naive Bayes classifier like so:\n", 31 | "\n", 32 | "$$\\hat{y} = \\arg \\max_y P(Y = y) \\prod_{i = 1}^{I} P(U_i = u_i | Y = y) \\prod_{j = 1}^{J} f_j(v_j | y)$$\n", 33 | "\n", 34 | "Hyperparameters for the naive Bayes algorithm are the prior distributions for all features and probabilities mentioned here, before observing data. In particular we may want to choose a prior probability for $P(Y = y)$.\n", 35 | "\n", 36 | "In this notebook I will be implementing Bernoulli naive Bayes, where we don't allow for any continuous random variables; every variable takes one of two values. This means we will need to implement binning for continuous variables and break up count variables into each observed count. This requires preprocessing the data. The choice of bins also acts like a hyperparameter.\n", 37 | "\n", 38 | "Bernoulli naive Bayes is implemented in **scikit-learn** through the `BernoulliNB` object.\n", 39 | "\n", 40 | "## Linear Separability\n", 41 | "\n", 42 | "KNN, decision trees, and decision forests don't assume much about the underlying data, but from this point on the classifiers I consider (including the naive Bayes algorithm) assume that the data is **linearly separable**. That is, data with different labels can be separated by a straight line in 2D space, or a hyperplane in N-dimensional space (a hyperplane is a general notion of a line). Data is usually considered to be lying in a space that has more than even three dimensions If this assumption is violated, the algorithm may struggle, if not fail outright, to correctly learn and classify the data.\n", 43 | "\n", 44 | "In general it's difficult to check whether data is linearly separable. Visualization may be useful for determining this.\n", 45 | "\n", 46 | "## Preprocessing the Data\n", 47 | "\n", 48 | "We will continue our work with the *Titanic* dataset." 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "metadata": { 55 | "collapsed": true 56 | }, 57 | "outputs": [], 58 | "source": [ 59 | "import pandas as pd\n", 60 | "from pandas import DataFrame\n", 61 | "from sklearn.model_selection import train_test_split, cross_validate\n", 62 | "from sklearn.metrics import classification_report" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": null, 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [ 71 | "titanic = pd.read_csv(\"titanic.csv\")\n", 72 | "titanic.head()" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "Here we would model `Pclass`, `Sex`, `Siblings/Spouses Aboard`, and `Parents/Children Aboard` as discrete variable, while `Age` and `Fare` should be considered continuous. We will need to bin `Age` and `Fare` in order to be able to use `BernoulliNB`. Our work with decision trees may suggest what bins to use (we may also want to use cross-validation)." 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "pd.cut(titanic.Age, bins=[-1, 2, titanic.Age.max() + 1]).head()" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "metadata": { 95 | "scrolled": true 96 | }, 97 | "outputs": [], 98 | "source": [ 99 | "pd.cut(titanic.Fare, bins=[0, 23.35, titanic.Fare.max() + 1]).head()" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": null, 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "titanic = titanic.assign(Age_cat=(titanic.Age <= 2), Fare_cat=(titanic.Fare <= 23.35))\n", 109 | "titanic.replace({'Sex': {'male': 0, 'female': 1}}, inplace=True)\n", 110 | "titanic.drop(['Age', 'Fare', 'Name'], axis=1, inplace=True)\n", 111 | "titanic.head()" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "titanic_train, titanic_test = train_test_split(titanic)\n", 121 | "titanic_train.head()" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "## Training a Naive Bayes Algorithm\n", 129 | "\n", 130 | "Let's now fit a Bernoulli naive Bayes algorithm." 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": null, 136 | "metadata": { 137 | "collapsed": true 138 | }, 139 | "outputs": [], 140 | "source": [ 141 | "from sklearn.naive_bayes import BernoulliNB" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": {}, 148 | "outputs": [], 149 | "source": [ 150 | "bnb = BernoulliNB(alpha=0, # Additive smoothing parameter; setting to 0 for no smoothing\n", 151 | " fit_prior=False, # Don't learn a prior distribution for the label\n", 152 | " class_prior=None) # Don't have prior distributions for features\n", 153 | "bnb = bnb.fit(titanic_train.drop(\"Survived\", axis=1), titanic_train.Survived)\n", 154 | "print(classification_report(titanic_train.Survived, bnb.predict(titanic_train.drop(\"Survived\", axis=1))))" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "Now let's see the algorithm's predictive accuracy on the test set." 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": {}, 168 | "outputs": [], 169 | "source": [ 170 | "survived_test_predict = bnb.predict(titanic_test.drop(\"Survived\", axis=1))\n", 171 | "print(classification_report(titanic_test.Survived, survived_test_predict))" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "In-sample and out-of-sample performance for this dataset are similar. That demonstrates one feature of linear models that data scientists like: linear models tend to generalize well.\n", 179 | "\n", 180 | "That said, let's see if we can get better performance." 181 | ] 182 | } 183 | ], 184 | "metadata": { 185 | "kernelspec": { 186 | "display_name": "Python 3", 187 | "language": "python", 188 | "name": "python3" 189 | }, 190 | "language_info": { 191 | "codemirror_mode": { 192 | "name": "ipython", 193 | "version": 3 194 | }, 195 | "file_extension": ".py", 196 | "mimetype": "text/x-python", 197 | "name": "python", 198 | "nbconvert_exporter": "python", 199 | "pygments_lexer": "ipython3", 200 | "version": "3.6.5" 201 | } 202 | }, 203 | "nbformat": 4, 204 | "nbformat_minor": 2 205 | } 206 | -------------------------------------------------------------------------------- /Chapter03/.ipynb_checkpoints/RandomForests-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Machine Learning Using Random Forests\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "A **random forest** is a collection of decision trees that each individually make a prediction for an observation. Each tree is formed from a random subset of the training set. The majority decision among the trees is then the predicted value of an observation. Random forests are an example of **ensemble methods**, where the predictions of individual classifiers are used for decision making.\n", 11 | "\n", 12 | "The **scikit-learn** class `RandomForestClassifier` can be used for training random forests. For random forests we may consider an additional hyperparameter to tree depth: the number of trees to train. Each tree should individually be shallow, and having more trees should lead to less overfitting.\n", 13 | "\n", 14 | "We will still be using the *Titanic* dataset." 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "import pandas as pd\n", 24 | "from pandas import DataFrame\n", 25 | "from sklearn.model_selection import train_test_split, cross_validate\n", 26 | "from sklearn.metrics import classification_report\n", 27 | "from random import seed # Set random seed for reproducible results" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 2, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "seed(110717) # Set the seed\n", 37 | "titanic = pd.read_csv(\"titanic.csv\")\n", 38 | "titanic_train, titanic_test = train_test_split(titanic)" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "## Growing a Random Forest\n", 46 | "\n", 47 | "Let's generate a random forest where I cap the depth for each tree at $m = 5$ and grow 10 trees." 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 3, 53 | "metadata": {}, 54 | "outputs": [ 55 | { 56 | "name": "stderr", 57 | "output_type": "stream", 58 | "text": [ 59 | "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\sklearn\\ensemble\\weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.\n", 60 | " from numpy.core.umath_tests import inner1d\n" 61 | ] 62 | } 63 | ], 64 | "source": [ 65 | "from sklearn.ensemble import RandomForestClassifier" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 4, 71 | "metadata": {}, 72 | "outputs": [ 73 | { 74 | "data": { 75 | "text/plain": [ 76 | "array([0], dtype=int64)" 77 | ] 78 | }, 79 | "execution_count": 4, 80 | "metadata": {}, 81 | "output_type": "execute_result" 82 | } 83 | ], 84 | "source": [ 85 | "forest1 = RandomForestClassifier(n_estimators=10, # Number of trees to grow\n", 86 | " max_depth=5) # Maximum depth of a tree\n", 87 | "forest1.fit(X=titanic_train.replace({'Sex': {'male': 0, 'female': 1}} # Replace strings with numbers\n", 88 | " ).drop([\"Survived\", \"Name\"], axis=1),\n", 89 | " y=titanic_train.Survived)\n", 90 | "\n", 91 | "# Example prediction\n", 92 | "forest1.predict([[2, 0, 26, 0, 0, 30]])" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 5, 98 | "metadata": {}, 99 | "outputs": [ 100 | { 101 | "name": "stdout", 102 | "output_type": "stream", 103 | "text": [ 104 | " precision recall f1-score support\n", 105 | "\n", 106 | " 0 0.83 0.92 0.87 399\n", 107 | " 1 0.86 0.71 0.78 266\n", 108 | "\n", 109 | "avg / total 0.84 0.84 0.83 665\n", 110 | "\n" 111 | ] 112 | } 113 | ], 114 | "source": [ 115 | "pred1 = forest1.predict(titanic_train.replace({'Sex': {'male': 0, 'female': 1}}\n", 116 | " ).drop([\"Survived\", \"Name\"], axis=1))\n", 117 | "print(classification_report(titanic_train.Survived, pred1))" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "The random forest does not perform as well on the training data as a full-grown decision tree, but such a tree overfit. The random forest, in comparison, seems to do as well as a better decision tree so far.\n", 125 | "\n", 126 | "## Optimizing Multiple Hyperparameters\n", 127 | "\n", 128 | "We now have two hyperparameters to optimize: tree depth and the number of trees to grow. We have a few ways to proceed:\n", 129 | "\n", 130 | "1. We could use cross-validation to see which combination of hyperparameters performs the best. Beware that there could be many combinations to check!\n", 131 | "2. We could use cross-validation to optimize one hyperparameter first, then the next, and so on. While not necessarily producing a globally optimal solution this is less work and likely yields a \"good enough\" result.\n", 132 | "3. We could randomly pick combinations of hyperparameters and use the results to guess a good combination. This is like 1 but less work.\n", 133 | "\n", 134 | "Here I will go with option 2. I will optimize the number of trees to use first, then maximum tree depth." 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "n_candidate = [10, 20, 30, 40, 60, 80, 100] # Candidate forest sizes\n", 144 | "res1 = dict()\n", 145 | "\n", 146 | "for n in n_candidate:\n", 147 | " pred3 = RandomForestClassifier(n_estimators=n, max_depth=5)\n", 148 | " res1[n] = cross_validate(pred3,\n", 149 | " X=titanic_train.replace({'Sex': {'male': 0, 'female': 1}} # Replace strings with numbers\n", 150 | " ).drop([\"Survived\", \"Name\"], axis=1),\n", 151 | " y=titanic_train.Survived,\n", 152 | " cv=10,\n", 153 | " return_train_score=False,\n", 154 | " scoring='accuracy')\n", 155 | "\n", 156 | "res1df = DataFrame({(i, j): res1[i][j]\n", 157 | " for i in res1.keys()\n", 158 | " for j in res1[i].keys()}).T\n", 159 | "\n", 160 | "res1df.loc[(slice(None), 'test_score'), :]" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": null, 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [ 169 | "res1df.loc[(slice(None), 'test_score'), :].mean(axis=1)" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": {}, 175 | "source": [ 176 | "$n = 100$ seems to do well. Now let's pick optimal tree depth." 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": { 183 | "collapsed": true 184 | }, 185 | "outputs": [], 186 | "source": [ 187 | "m_candidate = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] # Candidate depths" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": null, 193 | "metadata": {}, 194 | "outputs": [], 195 | "source": [ 196 | "res2 = dict()\n", 197 | "\n", 198 | "for m in m_candidate:\n", 199 | " pred3 = RandomForestClassifier(max_depth=m, n_estimators=40)\n", 200 | " res2[m] = cross_validate(pred3,\n", 201 | " X=titanic_train.replace({'Sex': {'male': 0, 'female': 1}} # Replace strings with numbers\n", 202 | " ).drop([\"Survived\", \"Name\"], axis=1),\n", 203 | " y=titanic_train.Survived,\n", 204 | " cv=10,\n", 205 | " return_train_score=False,\n", 206 | " scoring='accuracy')\n", 207 | "\n", 208 | "res2df = DataFrame({(i, j): res2[i][j]\n", 209 | " for i in res2.keys()\n", 210 | " for j in res2[i].keys()}).T\n", 211 | "\n", 212 | "res2df.loc[(slice(None), 'test_score'), :]" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": null, 218 | "metadata": {}, 219 | "outputs": [], 220 | "source": [ 221 | "res2df.loc[(slice(None), 'test_score'), :].mean(axis=1)" 222 | ] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "metadata": {}, 227 | "source": [ 228 | "A maximum tree depth of $m = 7$ seems to work well. A way to try and combat the path-dependence of this approach would be to repeat the search for optimal forest size but with the new tree depth and so on, but I will not do so here.\n", 229 | "\n", 230 | "Let's now see how the new random forest performs on the test set." 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": null, 236 | "metadata": { 237 | "collapsed": true 238 | }, 239 | "outputs": [], 240 | "source": [ 241 | "forest2 = RandomForestClassifier(max_depth=9, n_estimators=40)\n", 242 | "forest2.fit(X=titanic_train.replace({'Sex': {'male': 0, 'female': 1}} # Replace strings with numbers\n", 243 | " ).drop([\"Survived\", \"Name\"], axis=1),\n", 244 | " y=titanic_train.Survived)\n", 245 | "\n", 246 | "survived_test_predict = forest2.predict(X=titanic_test.replace(\n", 247 | " {'Sex': {'male': 0, 'female': 1}}\n", 248 | ").drop([\"Survived\", \"Name\"], axis=1))" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": null, 254 | "metadata": {}, 255 | "outputs": [], 256 | "source": [ 257 | "print(classification_report(titanic_test.Survived, survived_test_predict))" 258 | ] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": {}, 263 | "source": [ 264 | "The random forest does reasonably well, though it does not appear to be much of an improvement over the decision tree. Given the complexity of the random forest, a simple decision tree would be preferred." 265 | ] 266 | } 267 | ], 268 | "metadata": { 269 | "kernelspec": { 270 | "display_name": "Python 3", 271 | "language": "python", 272 | "name": "python3" 273 | }, 274 | "language_info": { 275 | "codemirror_mode": { 276 | "name": "ipython", 277 | "version": 3 278 | }, 279 | "file_extension": ".py", 280 | "mimetype": "text/x-python", 281 | "name": "python", 282 | "nbconvert_exporter": "python", 283 | "pygments_lexer": "ipython3", 284 | "version": "3.6.5" 285 | } 286 | }, 287 | "nbformat": 4, 288 | "nbformat_minor": 2 289 | } 290 | -------------------------------------------------------------------------------- /Chapter03/.ipynb_checkpoints/SVM-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Working With Support Vector Machines (SVM) for Classification and Detection\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "**Support vector machines (SVMs)** are linear classifiers. An SVM can be understood as a hyperplane (think: line) such that on one side of the plane consists of data only belonging to one class (ideally) while on the other side all instances of the other class exist. Prediction amounts to determining on which side of the line a data point lies.\n", 11 | "\n", 12 | "Training an SVM involves finding a hyperplane that best separates data from different classes and trying to maximize the margin between the plane and the nearest data points from two different classes. By doing this SVMs tend to generalize well from training data to future data; they are not known to be prone to overfitting.\n", 13 | "\n", 14 | "A hyperparameter common to all SVMs is a parameter $C$ known as the error penalty parameter. Smaller $C$ tends to combat overfitting.\n", 15 | "\n", 16 | "SVMs can be trained using the `SVC` object provided in **scikit-learn**\n", 17 | "\n", 18 | "## Kernel Methods\n", 19 | "\n", 20 | "The linearity assumption seems restrictive. Analysts overcome it by choosing a **kernel**, a mathematical function that alters the feature space a SVM is trained on. Choosing different kernels allows the boundary between classes to take different shapes, which may lead to better predictions.\n", 21 | "\n", 22 | "Again, we can consider choice of kernel as a hyperparameter to optimize. However, we may also use **domain knowledge** (our understanding of the phenomenon being learned) to pick a kernel.\n", 23 | "\n", 24 | "Let's load in the *Titanic* dataset again." 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "metadata": { 31 | "collapsed": true 32 | }, 33 | "outputs": [], 34 | "source": [ 35 | "import pandas as pd\n", 36 | "from pandas import DataFrame\n", 37 | "from sklearn.model_selection import train_test_split, cross_validate\n", 38 | "from sklearn.metrics import classification_report\n", 39 | "from random import seed" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "seed(110717)\n", 49 | "\n", 50 | "titanic = pd.read_csv(\"titanic.csv\")\n", 51 | "titanic.replace({'Sex': {'male': 0, 'female': 1}}, inplace=True)\n", 52 | "titanic.drop(\"Name\", axis=1, inplace=True)\n", 53 | "titanic.head()" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "Here we would be wise to handle passenger class with more care. While written as numbers this is actually a categorical or ordinal variable; the actual numbers don't matter. We should have binary variables, one for each class." 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "pd.get_dummies(titanic.Pclass).head()" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": null, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "titanic = titanic.join(pd.get_dummies(titanic.Pclass, prefix='Pclass')).drop(\"Pclass\", axis=1)\n", 79 | "titanic.head()" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "titanic_train, titanic_test = train_test_split(titanic)\n", 89 | "titanic_train.head()" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "## Training a SVM\n", 97 | "\n", 98 | "We can train a SVM like so:" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "metadata": { 105 | "collapsed": true 106 | }, 107 | "outputs": [], 108 | "source": [ 109 | "from sklearn.svm import SVC" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": null, 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [ 118 | "svm1 = SVC(C=1.0, # Penalty parameter C\n", 119 | " kernel='linear') # Using a linear kernel\n", 120 | "svm1.fit(X=titanic_train.drop(\"Survived\", axis=1), y=titanic_train.Survived)\n", 121 | "\n", 122 | "svm1.predict([[0, 26, 0, 0, 30, 0, 1, 0]]) # Predicting whether a 26 year old male without family aboard in second\n", 123 | " # class who paid $30 fare would survive" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "Choosing the kernel and $C$ could be done with cross-validation, but I will not demonstrate this (it would take too long for this video)." 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": null, 136 | "metadata": {}, 137 | "outputs": [], 138 | "source": [ 139 | "print(classification_report(titanic_train.Survived, svm1.predict(titanic_train.drop(\"Survived\", axis=1))))" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "The SVM does reasonably well on the training data. Let's see how it does on the test data." 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "survived_test_predict = svm1.predict(titanic_test.drop(\"Survived\", axis=1))\n", 156 | "print(classification_report(titanic_test.Survived, survived_test_predict))" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "Performance on test data is slightly worse." 164 | ] 165 | } 166 | ], 167 | "metadata": { 168 | "kernelspec": { 169 | "display_name": "Python 3", 170 | "language": "python", 171 | "name": "python3" 172 | }, 173 | "language_info": { 174 | "codemirror_mode": { 175 | "name": "ipython", 176 | "version": 3 177 | }, 178 | "file_extension": ".py", 179 | "mimetype": "text/x-python", 180 | "name": "python", 181 | "nbconvert_exporter": "python", 182 | "pygments_lexer": "ipython3", 183 | "version": "3.6.5" 184 | } 185 | }, 186 | "nbformat": 4, 187 | "nbformat_minor": 2 188 | } 189 | -------------------------------------------------------------------------------- /Chapter03/.ipynb_checkpoints/Untitled-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 2 6 | } 7 | -------------------------------------------------------------------------------- /Chapter03/BeyondBinary.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Going Beyond Binary\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "I have emphasized binary classification because it is the simplest form of classification and it is easier to develop binary classifiers than classifiers that predict one of more than two labels (which we may call **multiclass classification**). That said, such use cases certainly exist. What can we do then?\n", 11 | "\n", 12 | "Let's take for example predicting the species of flowers in the iris dataset. Below I load in the iris dataset." 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": null, 18 | "metadata": { 19 | "collapsed": true 20 | }, 21 | "outputs": [], 22 | "source": [ 23 | "from sklearn.datasets import load_iris\n", 24 | "from sklearn.model_selection import train_test_split, cross_validate\n", 25 | "from sklearn.metrics import classification_report" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": null, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "iris_obj = load_iris()\n", 35 | "flower, species = iris_obj.data, iris_obj.target\n", 36 | "flower_train, flower_test, species_train, species_test = train_test_split(flower, species, test_size = 0.1)\n", 37 | "flower_train[:5, :]" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": null, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "species_train[:5]" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "## Inherently Multiclass Classifiers\n", 54 | "\n", 55 | "Some classifiers don't lean on the binary assumption and are ready for predicting one of many labels already. Classifiers we've seen that are inherently multiclass classifiers include:\n", 56 | "\n", 57 | "* KNN\n", 58 | "* Decision trees\n", 59 | "* Random forests\n", 60 | "* Naive Bayes\n", 61 | "\n", 62 | "### KNN\n", 63 | "\n", 64 | "We already saw KNN applied to this dataset and its ability to predict one of many labels.\n", 65 | "\n", 66 | "### Decision tree" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": { 73 | "collapsed": true 74 | }, 75 | "outputs": [], 76 | "source": [ 77 | "from sklearn.tree import DecisionTreeClassifier\n", 78 | "from sklearn.externals.six import StringIO \n", 79 | "from IPython.display import Image \n", 80 | "from sklearn.tree import export_graphviz\n", 81 | "import pydotplus" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "tree = DecisionTreeClassifier(max_depth=3)\n", 91 | "tree = tree.fit(flower_train, species_train)\n", 92 | "print(classification_report(species_test, tree.predict(flower_test)))" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": null, 98 | "metadata": {}, 99 | "outputs": [], 100 | "source": [ 101 | "dot_data = StringIO()\n", 102 | "\n", 103 | "export_graphviz(tree, # Function for exporting a visualization of the tree\n", 104 | " out_file=dot_data,\n", 105 | " # Data controlling the display of the graph\n", 106 | " filled=True, rounded=True,\n", 107 | " special_characters=True,\n", 108 | " feature_names=[\"Sepal Length\", \"Sepal Width\",\n", 109 | " \"Petal Length\", \"Petal Width\"], # Use the name of the features\n", 110 | " proportion=True) # Show proportions for labels\n", 111 | "\n", 112 | "# Display graph in Jupyter notebook\n", 113 | "graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) \n", 114 | "Image(graph.create_png())" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "### Random Forest" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": null, 127 | "metadata": { 128 | "collapsed": true 129 | }, 130 | "outputs": [], 131 | "source": [ 132 | "from sklearn.ensemble import RandomForestClassifier" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": {}, 139 | "outputs": [], 140 | "source": [ 141 | "forest = RandomForestClassifier(n_estimators=20, max_depth=2)\n", 142 | "forest.fit(flower_train, species_train)\n", 143 | "print(classification_report(species_test, forest.predict(flower_test)))" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "### Naive Bayes\n", 151 | "\n", 152 | "In this case, I will use the exclusively Gaussian variant of the naive Bayes classifier, implemented in `GaussianNB`, since all variables in the iris dataset are continuous variables." 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "metadata": { 159 | "collapsed": true 160 | }, 161 | "outputs": [], 162 | "source": [ 163 | "from sklearn.naive_bayes import GaussianNB" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [ 172 | "nb = GaussianNB()\n", 173 | "nb = nb.fit(flower_train, species_train)\n", 174 | "print(classification_report(species_test, nb.predict(flower_test)))" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "## One vs. All and One vs. One Classification\n", 182 | "\n", 183 | "After we exhaust classifiers that are inherently multiclass, we are forced to combine binary classifiers in such a way that they can predict multiple labels. SVMs and logistic regression are examples of classifiers that are not inherently multiclass and need to be handled this way.\n", 184 | "\n", 185 | "**One vs. all** classification trains a classifier for every class, where for each classifier trained, one class exclusively consists of \"successes\" and all data points not in that class are \"failures\". All classifiers make a prediction, and if a classifier predicts \"success\" while others predict \"failure\", the class associated with that classifier is the predicted class.\n", 186 | "\n", 187 | "One vs. all classification is simple since we need as many classifiers as we have classes, and so can be done relatively quickly. It also works well when the number of data points in the training set doesn't cause large performance lags. Thus this scheme is popular. However, this algorithm assumes that every class can be separated from the rest by a single hyperplane; this may not be true, in which case learning fails.\n", 188 | "\n", 189 | "**One vs. one** classification trains a classifier for every *combination* of classes. The training dataset is restricted to observations from these two classes, and a classifier is trained to distinguish them. In prediction, each classifier trained this way makes a prediction. The most common class predicted among the classifiers is the class finally predicted.\n", 190 | "\n", 191 | "This mode of classification requires more classifiers; if there are $K$ classes, $\\frac{K(K-1)}{2} \\sim K^2$ classifiers are needed. This slows down prediction as well. This scheme does work well, though, when training the classifiers is expensive with respect to the size of the dataset (smaller datasets are used for training).\n", 192 | "\n", 193 | "All classifiers implemented in **scikit-learn** support multiclass classification out of the box; `SVC` and `LogisticRegression`, in particular, already support these schemes. However, the **multiclass** module includes objects that allow for manual implementation of these schemes: `OneVsRestClassifier` for the one vs. all scheme, and `OneVsOneClassifier` for the one vs. one scheme.\n", 194 | "\n", 195 | "`SVC` by default implements the one vs. one method, and `LogisticRegression` uses the one vs. all method.\n", 196 | "\n", 197 | "### SVM (One vs. One)" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": null, 203 | "metadata": { 204 | "collapsed": true 205 | }, 206 | "outputs": [], 207 | "source": [ 208 | "from sklearn.svm import SVC" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "metadata": {}, 215 | "outputs": [], 216 | "source": [ 217 | "svm = SVC()\n", 218 | "svm.fit(flower_train, species_train)\n", 219 | "print(classification_report(species_test, svm.predict(flower_test)))" 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": null, 225 | "metadata": { 226 | "collapsed": true 227 | }, 228 | "outputs": [], 229 | "source": [ 230 | "from sklearn.linear_model import LogisticRegression" 231 | ] 232 | }, 233 | { 234 | "cell_type": "markdown", 235 | "metadata": {}, 236 | "source": [ 237 | "### Logistic Regression (One vs. All)" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": null, 243 | "metadata": {}, 244 | "outputs": [], 245 | "source": [ 246 | "logit = LogisticRegression()\n", 247 | "logit.fit(flower_train, species_train)\n", 248 | "print(classification_report(species_test, logit.predict(flower_test)))" 249 | ] 250 | } 251 | ], 252 | "metadata": { 253 | "kernelspec": { 254 | "display_name": "Python 3", 255 | "language": "python", 256 | "name": "python3" 257 | }, 258 | "language_info": { 259 | "codemirror_mode": { 260 | "name": "ipython", 261 | "version": 3 262 | }, 263 | "file_extension": ".py", 264 | "mimetype": "text/x-python", 265 | "name": "python", 266 | "nbconvert_exporter": "python", 267 | "pygments_lexer": "ipython3", 268 | "version": "3.6.5" 269 | } 270 | }, 271 | "nbformat": 4, 272 | "nbformat_minor": 2 273 | } 274 | -------------------------------------------------------------------------------- /Chapter03/LogisticRegression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Logistic Regression for Machine Learning\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "**Logistic regression** (also referred to as **logit models**) is a form of regression that, given an observation's features, produces probabilities predicting whether an observation belongs to a certain class. While common in machine learning, they are popular statistical models in general, appearing in fields such as economics, medicine, etc.\n", 11 | "\n", 12 | "After fitting a logit model, we make predictions using the probability produced by the model that an observation belongs to a certain class. We may decide to for an observationt to predict the class to which it is most likely to belong; in other words, if the probability an observation to a class is greater than 0.5, we predict it belongs to that class. (In principle we could choose a different threshold than 0.5.)\n", 13 | "\n", 14 | "Logit models are considered linear models, but by changing the features the model uses we may express non-linear relationships.\n", 15 | "\n", 16 | "Logit models are implemented in **scikit-learn** in the `LogisticRegression` class.\n", 17 | "\n", 18 | "Again we will work with the *Titanic* dataset. We will make the transformations needed when fittiing a SVM." 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": null, 24 | "metadata": { 25 | "collapsed": true 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "import pandas as pd\n", 30 | "from pandas import DataFrame\n", 31 | "from sklearn.model_selection import train_test_split, cross_validate\n", 32 | "from sklearn.metrics import classification_report" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "titanic = pd.read_csv(\"titanic.csv\")\n", 42 | "titanic.replace({'Sex': {'male': 0, 'female': 1}}, inplace=True)\n", 43 | "titanic.drop(\"Name\", axis=1, inplace=True)\n", 44 | "titanic = titanic.join(pd.get_dummies(titanic.Pclass, prefix='Pclass')).drop(\"Pclass\", axis=1)\n", 45 | "titanic_train, titanic_test = train_test_split(titanic)\n", 46 | "titanic_train.head()" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "## Fitting a Logit Model\n", 54 | "\n", 55 | "Fitting logit models is similar to what we've seen before." 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "metadata": { 62 | "collapsed": true 63 | }, 64 | "outputs": [], 65 | "source": [ 66 | "from sklearn.linear_model import LogisticRegression" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "logit = LogisticRegression()\n", 76 | "logit.fit(X=titanic_train.drop(\"Survived\", axis=1),\n", 77 | " y=titanic_train.Survived)\n", 78 | "logit.predict([[0, 26, 0, 0, 30, 0, 1, 0]]) # Example prediction" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [ 87 | "logit.predict_proba([[0, 26, 0, 0, 30, 0, 1, 0]]) # What is the probability of belonging to certain classes?" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "print(classification_report(titanic_train.Survived, logit.predict(titanic_train.drop(\"Survived\", axis=1))))" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "Let's see the logit model's performance on test data." 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "metadata": {}, 110 | "outputs": [], 111 | "source": [ 112 | "print(classification_report(titanic_test.Survived, logit.predict(titanic_test.drop(\"Survived\", axis=1))))" 113 | ] 114 | } 115 | ], 116 | "metadata": { 117 | "kernelspec": { 118 | "display_name": "Python 3", 119 | "language": "python", 120 | "name": "python3" 121 | }, 122 | "language_info": { 123 | "codemirror_mode": { 124 | "name": "ipython", 125 | "version": 3 126 | }, 127 | "file_extension": ".py", 128 | "mimetype": "text/x-python", 129 | "name": "python", 130 | "nbconvert_exporter": "python", 131 | "pygments_lexer": "ipython3", 132 | "version": "3.6.5" 133 | } 134 | }, 135 | "nbformat": 4, 136 | "nbformat_minor": 2 137 | } 138 | -------------------------------------------------------------------------------- /Chapter03/Metadata.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Training-Systems-Using-Python-Statistical-Modeling/5eb619df9648570e83e910f781aa97fd19f17403/Chapter03/Metadata.docx -------------------------------------------------------------------------------- /Chapter03/NaiveBayes.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Making Predictions Using the Naive Bayes Algorithm\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "The **naive Bayes** algorithm is based on Bayes' theorem. Let's suppose that our data uses features $X_1, X_2, ..., X_K$ and $Y$ is the target variable. One form to describe the naive Bayes classifier is:\n", 11 | "\n", 12 | "$$P(Y = y | X_1 = x_1, ..., X_K = x_K) = \\frac{P(X_1 = x_1, ..., X_K = x_K | Y = y) P(Y = y)}{P(X_1 = x_1, ..., X_K = x_K)}$$\n", 13 | "\n", 14 | "The \"naive\" part of the naive Bayes classifier is to make the (unrealistic) assumption that all the features are **independent** random variables; that is, information about one feature provides essentially no information about the other, and $P(X_i = x_i | X_j = x_j) = P(X_i = x_i)$ when $i \\neq j$ (but we *don't* assume the features are independent of the target variable). Under this assumption, $P(X_i = x_i, X_j = x_j) = P(X_i = x_i)P(X_j = x_j)$ and we can rewrite Bayes theorem as\n", 15 | "\n", 16 | "$$P(Y = y | X_1 = x_1, ..., X_K = x_K) = \\frac{P(Y = y) \\prod_{k = 1}^{K} P(X_k = x_k | Y = y)}{\\prod_{k = 1}{K} P(X_k = x_k)} \\propto P(Y = y) \\prod_{k = 1}^{K} P(X_k = x_k | Y = y)$$\n", 17 | "\n", 18 | "The quantities in the above expression (behind the $\\propto$ symbol) are easily estimated from a dataset; this is what is done when training the algorithm.\n", 19 | "\n", 20 | "When predicting, we observe features $x_1, ..., x_K$ for a data point. The predicted label $y$ is the label that maximizes $P(Y = y) \\prod_{k = 1}^{K} P(X_k = x_k | Y = y)$; that is, if $\\hat{y}$ represents our prediction, then\n", 21 | "\n", 22 | "$$\\hat{y} = \\arg \\max_y P(Y = y) \\prod_{k = 1}^{K} P(X_k = x_k | Y = y)$$\n", 23 | "\n", 24 | "Notice that what I have written works only when $X_k$ are all discrete; this algorithm will work for categorical features (like passenger class or sex) or variables that take countable values (like number of children aboard), but it won't work for features we may consider continuous (like fare paid) since in that case $P(X_k = x_k | Y = y)$ always. We fix this problem by replacing, for continuous features, $P(X_k = x_k | Y = y)$ with $f_k(x_k | y)$, where $f( \\cdot )$ is a probability density function. A common choice is the Gaussian density:\n", 25 | "\n", 26 | "$$f_k(x_k | y) = \\frac{1}{\\sqrt{2 \\pi \\sigma^2_{k,y}}} \\exp\\left(-\\frac{(x_k - \\mu_{k,y})^2}{2 \\sigma_{k,y}^2}\\right)$$\n", 27 | "\n", 28 | "(Note that $\\exp(x) = e^x$.) The parameters $\\mu_{k,y}$ and $\\sigma_{k,y}$ are estimated from the data.\n", 29 | "\n", 30 | "If $U_1, ..., U_I$ are discrete variables and $V_1, ..., V_J$ continuous, we can rewrite the naive Bayes classifier like so:\n", 31 | "\n", 32 | "$$\\hat{y} = \\arg \\max_y P(Y = y) \\prod_{i = 1}^{I} P(U_i = u_i | Y = y) \\prod_{j = 1}^{J} f_j(v_j | y)$$\n", 33 | "\n", 34 | "Hyperparameters for the naive Bayes algorithm are the prior distributions for all features and probabilities mentioned here, before observing data. In particular we may want to choose a prior probability for $P(Y = y)$.\n", 35 | "\n", 36 | "In this notebook I will be implementing Bernoulli naive Bayes, where we don't allow for any continuous random variables; every variable takes one of two values. This means we will need to implement binning for continuous variables and break up count variables into each observed count. This requires preprocessing the data. The choice of bins also acts like a hyperparameter.\n", 37 | "\n", 38 | "Bernoulli naive Bayes is implemented in **scikit-learn** through the `BernoulliNB` object.\n", 39 | "\n", 40 | "## Linear Separability\n", 41 | "\n", 42 | "KNN, decision trees, and decision forests don't assume much about the underlying data, but from this point on the classifiers I consider (including the naive Bayes algorithm) assume that the data is **linearly separable**. That is, data with different labels can be separated by a straight line in 2D space, or a hyperplane in N-dimensional space (a hyperplane is a general notion of a line). Data is usually considered to be lying in a space that has more than even three dimensions If this assumption is violated, the algorithm may struggle, if not fail outright, to correctly learn and classify the data.\n", 43 | "\n", 44 | "In general it's difficult to check whether data is linearly separable. Visualization may be useful for determining this.\n", 45 | "\n", 46 | "## Preprocessing the Data\n", 47 | "\n", 48 | "We will continue our work with the *Titanic* dataset." 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "metadata": { 55 | "collapsed": true 56 | }, 57 | "outputs": [], 58 | "source": [ 59 | "import pandas as pd\n", 60 | "from pandas import DataFrame\n", 61 | "from sklearn.model_selection import train_test_split, cross_validate\n", 62 | "from sklearn.metrics import classification_report" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": null, 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [ 71 | "titanic = pd.read_csv(\"titanic.csv\")\n", 72 | "titanic.head()" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "Here we would model `Pclass`, `Sex`, `Siblings/Spouses Aboard`, and `Parents/Children Aboard` as discrete variable, while `Age` and `Fare` should be considered continuous. We will need to bin `Age` and `Fare` in order to be able to use `BernoulliNB`. Our work with decision trees may suggest what bins to use (we may also want to use cross-validation)." 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "pd.cut(titanic.Age, bins=[-1, 2, titanic.Age.max() + 1]).head()" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "metadata": { 95 | "scrolled": true 96 | }, 97 | "outputs": [], 98 | "source": [ 99 | "pd.cut(titanic.Fare, bins=[0, 23.35, titanic.Fare.max() + 1]).head()" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": null, 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "titanic = titanic.assign(Age_cat=(titanic.Age <= 2), Fare_cat=(titanic.Fare <= 23.35))\n", 109 | "titanic.replace({'Sex': {'male': 0, 'female': 1}}, inplace=True)\n", 110 | "titanic.drop(['Age', 'Fare', 'Name'], axis=1, inplace=True)\n", 111 | "titanic.head()" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "titanic_train, titanic_test = train_test_split(titanic)\n", 121 | "titanic_train.head()" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "## Training a Naive Bayes Algorithm\n", 129 | "\n", 130 | "Let's now fit a Bernoulli naive Bayes algorithm." 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": null, 136 | "metadata": { 137 | "collapsed": true 138 | }, 139 | "outputs": [], 140 | "source": [ 141 | "from sklearn.naive_bayes import BernoulliNB" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": {}, 148 | "outputs": [], 149 | "source": [ 150 | "bnb = BernoulliNB(alpha=0, # Additive smoothing parameter; setting to 0 for no smoothing\n", 151 | " fit_prior=False, # Don't learn a prior distribution for the label\n", 152 | " class_prior=None) # Don't have prior distributions for features\n", 153 | "bnb = bnb.fit(titanic_train.drop(\"Survived\", axis=1), titanic_train.Survived)\n", 154 | "print(classification_report(titanic_train.Survived, bnb.predict(titanic_train.drop(\"Survived\", axis=1))))" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "Now let's see the algorithm's predictive accuracy on the test set." 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": {}, 168 | "outputs": [], 169 | "source": [ 170 | "survived_test_predict = bnb.predict(titanic_test.drop(\"Survived\", axis=1))\n", 171 | "print(classification_report(titanic_test.Survived, survived_test_predict))" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "In-sample and out-of-sample performance for this dataset are similar. That demonstrates one feature of linear models that data scientists like: linear models tend to generalize well.\n", 179 | "\n", 180 | "That said, let's see if we can get better performance." 181 | ] 182 | } 183 | ], 184 | "metadata": { 185 | "kernelspec": { 186 | "display_name": "Python 3", 187 | "language": "python", 188 | "name": "python3" 189 | }, 190 | "language_info": { 191 | "codemirror_mode": { 192 | "name": "ipython", 193 | "version": 3 194 | }, 195 | "file_extension": ".py", 196 | "mimetype": "text/x-python", 197 | "name": "python", 198 | "nbconvert_exporter": "python", 199 | "pygments_lexer": "ipython3", 200 | "version": "3.6.5" 201 | } 202 | }, 203 | "nbformat": 4, 204 | "nbformat_minor": 2 205 | } 206 | -------------------------------------------------------------------------------- /Chapter03/RandomForests.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Machine Learning Using Random Forests\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "A **random forest** is a collection of decision trees that each individually make a prediction for an observation. Each tree is formed from a random subset of the training set. The majority decision among the trees is then the predicted value of an observation. Random forests are an example of **ensemble methods**, where the predictions of individual classifiers are used for decision making.\n", 11 | "\n", 12 | "The **scikit-learn** class `RandomForestClassifier` can be used for training random forests. For random forests we may consider an additional hyperparameter to tree depth: the number of trees to train. Each tree should individually be shallow, and having more trees should lead to less overfitting.\n", 13 | "\n", 14 | "We will still be using the *Titanic* dataset." 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": { 21 | "collapsed": true 22 | }, 23 | "outputs": [], 24 | "source": [ 25 | "import pandas as pd\n", 26 | "from pandas import DataFrame\n", 27 | "from sklearn.model_selection import train_test_split, cross_validate\n", 28 | "from sklearn.metrics import classification_report\n", 29 | "from random import seed # Set random seed for reproducible results" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": { 36 | "collapsed": true 37 | }, 38 | "outputs": [], 39 | "source": [ 40 | "seed(110717) # Set the seed\n", 41 | "titanic = pd.read_csv(\"titanic.csv\")\n", 42 | "titanic_train, titanic_test = train_test_split(titanic)" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "## Growing a Random Forest\n", 50 | "\n", 51 | "Let's generate a random forest where I cap the depth for each tree at $m = 5$ and grow 10 trees." 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": { 58 | "collapsed": true 59 | }, 60 | "outputs": [], 61 | "source": [ 62 | "from sklearn.ensemble import RandomForestClassifier" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": null, 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [ 71 | "forest1 = RandomForestClassifier(n_estimators=10, # Number of trees to grow\n", 72 | " max_depth=5) # Maximum depth of a tree\n", 73 | "forest1.fit(X=titanic_train.replace({'Sex': {'male': 0, 'female': 1}} # Replace strings with numbers\n", 74 | " ).drop([\"Survived\", \"Name\"], axis=1),\n", 75 | " y=titanic_train.Survived)\n", 76 | "\n", 77 | "# Example prediction\n", 78 | "forest1.predict([[2, 0, 26, 0, 0, 30]])" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [ 87 | "pred1 = forest1.predict(titanic_train.replace({'Sex': {'male': 0, 'female': 1}}\n", 88 | " ).drop([\"Survived\", \"Name\"], axis=1))\n", 89 | "print(classification_report(titanic_train.Survived, pred1))" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "The random forest does not perform as well on the training data as a full-grown decision tree, but such a tree overfit. The random forest, in comparison, seems to do as well as a better decision tree so far.\n", 97 | "\n", 98 | "## Optimizing Multiple Hyperparameters\n", 99 | "\n", 100 | "We now have two hyperparameters to optimize: tree depth and the number of trees to grow. We have a few ways to proceed:\n", 101 | "\n", 102 | "1. We could use cross-validation to see which combination of hyperparameters performs the best. Beware that there could be many combinations to check!\n", 103 | "2. We could use cross-validation to optimize one hyperparameter first, then the next, and so on. While not necessarily producing a globally optimal solution this is less work and likely yields a \"good enough\" result.\n", 104 | "3. We could randomly pick combinations of hyperparameters and use the results to guess a good combination. This is like 1 but less work.\n", 105 | "\n", 106 | "Here I will go with option 2. I will optimize the number of trees to use first, then maximum tree depth." 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "n_candidate = [10, 20, 30, 40, 60, 80, 100] # Candidate forest sizes\n", 116 | "res1 = dict()\n", 117 | "\n", 118 | "for n in n_candidate:\n", 119 | " pred3 = RandomForestClassifier(n_estimators=n, max_depth=5)\n", 120 | " res1[n] = cross_validate(pred3,\n", 121 | " X=titanic_train.replace({'Sex': {'male': 0, 'female': 1}} # Replace strings with numbers\n", 122 | " ).drop([\"Survived\", \"Name\"], axis=1),\n", 123 | " y=titanic_train.Survived,\n", 124 | " cv=10,\n", 125 | " return_train_score=False,\n", 126 | " scoring='accuracy')\n", 127 | "\n", 128 | "res1df = DataFrame({(i, j): res1[i][j]\n", 129 | " for i in res1.keys()\n", 130 | " for j in res1[i].keys()}).T\n", 131 | "\n", 132 | "res1df.loc[(slice(None), 'test_score'), :]" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": {}, 139 | "outputs": [], 140 | "source": [ 141 | "res1df.loc[(slice(None), 'test_score'), :].mean(axis=1)" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "$n = 100$ seems to do well. Now let's pick optimal tree depth." 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": null, 154 | "metadata": { 155 | "collapsed": true 156 | }, 157 | "outputs": [], 158 | "source": [ 159 | "m_candidate = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] # Candidate depths" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [ 168 | "res2 = dict()\n", 169 | "\n", 170 | "for m in m_candidate:\n", 171 | " pred3 = RandomForestClassifier(max_depth=m, n_estimators=40)\n", 172 | " res2[m] = cross_validate(pred3,\n", 173 | " X=titanic_train.replace({'Sex': {'male': 0, 'female': 1}} # Replace strings with numbers\n", 174 | " ).drop([\"Survived\", \"Name\"], axis=1),\n", 175 | " y=titanic_train.Survived,\n", 176 | " cv=10,\n", 177 | " return_train_score=False,\n", 178 | " scoring='accuracy')\n", 179 | "\n", 180 | "res2df = DataFrame({(i, j): res2[i][j]\n", 181 | " for i in res2.keys()\n", 182 | " for j in res2[i].keys()}).T\n", 183 | "\n", 184 | "res2df.loc[(slice(None), 'test_score'), :]" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": null, 190 | "metadata": {}, 191 | "outputs": [], 192 | "source": [ 193 | "res2df.loc[(slice(None), 'test_score'), :].mean(axis=1)" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": {}, 199 | "source": [ 200 | "A maximum tree depth of $m = 7$ seems to work well. A way to try and combat the path-dependence of this approach would be to repeat the search for optimal forest size but with the new tree depth and so on, but I will not do so here.\n", 201 | "\n", 202 | "Let's now see how the new random forest performs on the test set." 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "metadata": { 209 | "collapsed": true 210 | }, 211 | "outputs": [], 212 | "source": [ 213 | "forest2 = RandomForestClassifier(max_depth=9, n_estimators=40)\n", 214 | "forest2.fit(X=titanic_train.replace({'Sex': {'male': 0, 'female': 1}} # Replace strings with numbers\n", 215 | " ).drop([\"Survived\", \"Name\"], axis=1),\n", 216 | " y=titanic_train.Survived)\n", 217 | "\n", 218 | "survived_test_predict = forest2.predict(X=titanic_test.replace(\n", 219 | " {'Sex': {'male': 0, 'female': 1}}\n", 220 | ").drop([\"Survived\", \"Name\"], axis=1))" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": null, 226 | "metadata": {}, 227 | "outputs": [], 228 | "source": [ 229 | "print(classification_report(titanic_test.Survived, survived_test_predict))" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": {}, 235 | "source": [ 236 | "The random forest does reasonably well, though it does not appear to be much of an improvement over the decision tree. Given the complexity of the random forest, a simple decision tree would be preferred." 237 | ] 238 | } 239 | ], 240 | "metadata": { 241 | "kernelspec": { 242 | "display_name": "Python 3", 243 | "language": "python", 244 | "name": "python3" 245 | }, 246 | "language_info": { 247 | "codemirror_mode": { 248 | "name": "ipython", 249 | "version": 3 250 | }, 251 | "file_extension": ".py", 252 | "mimetype": "text/x-python", 253 | "name": "python", 254 | "nbconvert_exporter": "python", 255 | "pygments_lexer": "ipython3", 256 | "version": "3.6.5" 257 | } 258 | }, 259 | "nbformat": 4, 260 | "nbformat_minor": 2 261 | } 262 | -------------------------------------------------------------------------------- /Chapter03/SVM.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Working With Support Vector Machines (SVM) for Classification and Detection\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "**Support vector machines (SVMs)** are linear classifiers. An SVM can be understood as a hyperplane (think: line) such that on one side of the plane consists of data only belonging to one class (ideally) while on the other side all instances of the other class exist. Prediction amounts to determining on which side of the line a data point lies.\n", 11 | "\n", 12 | "Training an SVM involves finding a hyperplane that best separates data from different classes and trying to maximize the margin between the plane and the nearest data points from two different classes. By doing this SVMs tend to generalize well from training data to future data; they are not known to be prone to overfitting.\n", 13 | "\n", 14 | "A hyperparameter common to all SVMs is a parameter $C$ known as the error penalty parameter. Smaller $C$ tends to combat overfitting.\n", 15 | "\n", 16 | "SVMs can be trained using the `SVC` object provided in **scikit-learn**\n", 17 | "\n", 18 | "## Kernel Methods\n", 19 | "\n", 20 | "The linearity assumption seems restrictive. Analysts overcome it by choosing a **kernel**, a mathematical function that alters the feature space a SVM is trained on. Choosing different kernels allows the boundary between classes to take different shapes, which may lead to better predictions.\n", 21 | "\n", 22 | "Again, we can consider choice of kernel as a hyperparameter to optimize. However, we may also use **domain knowledge** (our understanding of the phenomenon being learned) to pick a kernel.\n", 23 | "\n", 24 | "Let's load in the *Titanic* dataset again." 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "metadata": { 31 | "collapsed": true 32 | }, 33 | "outputs": [], 34 | "source": [ 35 | "import pandas as pd\n", 36 | "from pandas import DataFrame\n", 37 | "from sklearn.model_selection import train_test_split, cross_validate\n", 38 | "from sklearn.metrics import classification_report\n", 39 | "from random import seed" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "seed(110717)\n", 49 | "\n", 50 | "titanic = pd.read_csv(\"titanic.csv\")\n", 51 | "titanic.replace({'Sex': {'male': 0, 'female': 1}}, inplace=True)\n", 52 | "titanic.drop(\"Name\", axis=1, inplace=True)\n", 53 | "titanic.head()" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "Here we would be wise to handle passenger class with more care. While written as numbers this is actually a categorical or ordinal variable; the actual numbers don't matter. We should have binary variables, one for each class." 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "pd.get_dummies(titanic.Pclass).head()" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": null, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "titanic = titanic.join(pd.get_dummies(titanic.Pclass, prefix='Pclass')).drop(\"Pclass\", axis=1)\n", 79 | "titanic.head()" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "titanic_train, titanic_test = train_test_split(titanic)\n", 89 | "titanic_train.head()" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "## Training a SVM\n", 97 | "\n", 98 | "We can train a SVM like so:" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "metadata": { 105 | "collapsed": true 106 | }, 107 | "outputs": [], 108 | "source": [ 109 | "from sklearn.svm import SVC" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": null, 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [ 118 | "svm1 = SVC(C=1.0, # Penalty parameter C\n", 119 | " kernel='linear') # Using a linear kernel\n", 120 | "svm1.fit(X=titanic_train.drop(\"Survived\", axis=1), y=titanic_train.Survived)\n", 121 | "\n", 122 | "svm1.predict([[0, 26, 0, 0, 30, 0, 1, 0]]) # Predicting whether a 26 year old male without family aboard in second\n", 123 | " # class who paid $30 fare would survive" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "Choosing the kernel and $C$ could be done with cross-validation, but I will not demonstrate this (it would take too long for this video)." 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": null, 136 | "metadata": {}, 137 | "outputs": [], 138 | "source": [ 139 | "print(classification_report(titanic_train.Survived, svm1.predict(titanic_train.drop(\"Survived\", axis=1))))" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "The SVM does reasonably well on the training data. Let's see how it does on the test data." 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "survived_test_predict = svm1.predict(titanic_test.drop(\"Survived\", axis=1))\n", 156 | "print(classification_report(titanic_test.Survived, survived_test_predict))" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "Performance on test data is slightly worse." 164 | ] 165 | } 166 | ], 167 | "metadata": { 168 | "kernelspec": { 169 | "display_name": "Python 3", 170 | "language": "python", 171 | "name": "python3" 172 | }, 173 | "language_info": { 174 | "codemirror_mode": { 175 | "name": "ipython", 176 | "version": 3 177 | }, 178 | "file_extension": ".py", 179 | "mimetype": "text/x-python", 180 | "name": "python", 181 | "nbconvert_exporter": "python", 182 | "pygments_lexer": "ipython3", 183 | "version": "3.6.5" 184 | } 185 | }, 186 | "nbformat": 4, 187 | "nbformat_minor": 2 188 | } 189 | -------------------------------------------------------------------------------- /Chapter03/knn.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Classifying Data in Python using the $k$-Nearest Neighbors (KNN) Algorithm\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "In this notebook I will demonstrate training and using **$k$-nearest neighbors (KNN)** algorithms with **sklearn**.\n", 11 | "\n", 12 | "We will be using the iris dataset, which I load below." 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": null, 18 | "metadata": {}, 19 | "outputs": [], 20 | "source": [ 21 | "from sklearn.datasets import load_iris\n", 22 | "from sklearn.model_selection import train_test_split, cross_validate\n", 23 | "from sklearn.metrics import classification_report" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": null, 29 | "metadata": { 30 | "collapsed": true 31 | }, 32 | "outputs": [], 33 | "source": [ 34 | "iris_obj = load_iris()\n", 35 | "\n", 36 | "flower, species = iris_obj.data, iris_obj.target" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "flower_train, flower_test, species_train, species_test = train_test_split(flower, species, test_size = 0.1)\n", 46 | "flower_train[:5]" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "species_train[:5]" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "## Creating a Classifier\n", 63 | "\n", 64 | "The `KNeighborsClassifier` allows for fitting and predicting using the KNN algorithm. Recall that with KNN, training a model means saving the training data, and predicting is done by picking the most common algorithm the $k$ nearest neighbors of a point.\n", 65 | "\n", 66 | "Besides choice of variables, there are two hyperparameters that need to be picked to use KNN: the number of neighbors $k$ used for prediction and the choice of metric for defining distance. Here I will use Euclidean distance, and I start by picking $k = 1$." 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": { 73 | "collapsed": true 74 | }, 75 | "outputs": [], 76 | "source": [ 77 | "from sklearn.neighbors import KNeighborsClassifier\n", 78 | "import numpy as np" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [ 87 | "knn1 = KNeighborsClassifier(n_neighbors=1) # Setting the k parameter\n", 88 | "knn1.fit(flower_train, species_train) # Fitting the model\n", 89 | "knn1.predict(np.array([[7, 3, 5, 2]])) # A test prediction" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "metadata": {}, 96 | "outputs": [], 97 | "source": [ 98 | "pred1 = knn1.predict(flower_train)\n", 99 | "pred1" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": null, 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "print(classification_report(species_train, pred1))" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "*Of course* the model does perfectly on the training data! (How can it not?)" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "## Choosing $k$\n", 123 | "\n", 124 | "Let's perform cross-validation to see what $k$ seems to lead to the best predictive accuracy, along with getting a sense of what level of accuracy in prediction we can hope to see." 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": { 131 | "collapsed": true 132 | }, 133 | "outputs": [], 134 | "source": [ 135 | "import pandas as pd\n", 136 | "from pandas import DataFrame" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": null, 142 | "metadata": {}, 143 | "outputs": [], 144 | "source": [ 145 | "k_candidate = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n", 146 | "res = dict()\n", 147 | "\n", 148 | "for k in k_candidate:\n", 149 | " pred2 = KNeighborsClassifier(n_neighbors=k)\n", 150 | " res[k] = cross_validate(estimator=pred2, # The predictor\n", 151 | " X=flower_train, # Features array\n", 152 | " y=species_train, # Target array\n", 153 | " cv=10, # Number of folds (but other meanings exist)\n", 154 | " return_train_score=False, # Don't return training scores\n", 155 | " scoring='accuracy') # What scores to return (other meanings exist)" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "metadata": {}, 162 | "outputs": [], 163 | "source": [ 164 | "resdf = DataFrame({(i, j): res[i][j]\n", 165 | " for i in res.keys()\n", 166 | " for j in res[i].keys()}).T\n", 167 | "resdf" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": null, 173 | "metadata": {}, 174 | "outputs": [], 175 | "source": [ 176 | "resdf.loc[(slice(None), 'test_score'), :]" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": {}, 183 | "outputs": [], 184 | "source": [ 185 | "resdf.loc[(slice(None), 'test_score'), :].mean(axis=1)" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "It seems that the best accuracy is attained when $k = 8$. Let's see how our classifier does on the test set." 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [ 201 | "pred3 = KNeighborsClassifier(n_neighbors=8)\n", 202 | "pred3.fit(flower_train, species_train)\n", 203 | "species_test_predict = pred3.predict(flower_test)\n", 204 | "print(classification_report(species_test, species_test_predict))" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "Our KNN classifier does well predicting the setosa species, and the worst behavior is for the virginica species.\n", 212 | "\n", 213 | "Considering the graphic below, where species correctly predicted are shown in blue and those incorrectly predicted in red (with shape corresponding to species), we can see this result should be expected; setosa flowers are easily identified while versicolor and virginica would be more difficult to predict." 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": { 220 | "collapsed": true 221 | }, 222 | "outputs": [], 223 | "source": [ 224 | "import matplotlib.pyplot as plt\n", 225 | "%matplotlib inline" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "metadata": {}, 232 | "outputs": [], 233 | "source": [ 234 | "marker_map = {0: 'o', 1: 's', 2: '^'}\n", 235 | "var1, var2 = 0, 1 # Sepal length and sepal width variables\n", 236 | "for length, width, species in zip(flower_train[:, var1], flower_train[:, var2], species_train[:]):\n", 237 | " plt.scatter(x=length, y=width, marker=marker_map[species], c=\"black\")\n", 238 | "# Plot correct prediction\n", 239 | "correct = (species_test == species_test_predict)\n", 240 | "for length, width, species in zip(flower_test[correct, var1], flower_test[correct, var2], species_test[correct]):\n", 241 | " plt.scatter(x=length, y=width, marker=marker_map[species], c=\"blue\")\n", 242 | "for length, width, species in zip(flower_test[np.logical_not(correct), var1],\n", 243 | " flower_test[np.logical_not(correct), var2],\n", 244 | " species_test[np.logical_not(correct)]):\n", 245 | " plt.scatter(x=length, y=width, marker=marker_map[species], c=\"red\")\n", 246 | "plt.xlabel(iris_obj.feature_names[var1])\n", 247 | "plt.ylabel(iris_obj.feature_names[var2])\n", 248 | "plt.show()" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": null, 254 | "metadata": {}, 255 | "outputs": [], 256 | "source": [ 257 | "marker_map = {0: 'o', 1: 's', 2: '^'}\n", 258 | "var1, var2 = 2, 3 # Petal length and petal width variables\n", 259 | "for length, width, species in zip(flower_train[:, var1], flower_train[:, var2], species_train[:]):\n", 260 | " plt.scatter(x=length, y=width, marker=marker_map[species], c=\"black\")\n", 261 | "# Plot correct prediction\n", 262 | "correct = (species_test == species_test_predict)\n", 263 | "for length, width, species in zip(flower_test[correct, var1], flower_test[correct, var2], species_test[correct]):\n", 264 | " plt.scatter(x=length, y=width, marker=marker_map[species], c=\"blue\")\n", 265 | "for length, width, species in zip(flower_test[np.logical_not(correct), var1],\n", 266 | " flower_test[np.logical_not(correct), var2],\n", 267 | " species_test[np.logical_not(correct)]):\n", 268 | " plt.scatter(x=length, y=width, marker=marker_map[species], c=\"red\")\n", 269 | "plt.xlabel(iris_obj.feature_names[var1])\n", 270 | "plt.ylabel(iris_obj.feature_names[var2])\n", 271 | "plt.show()" 272 | ] 273 | } 274 | ], 275 | "metadata": { 276 | "kernelspec": { 277 | "display_name": "Python 3", 278 | "language": "python", 279 | "name": "python3" 280 | }, 281 | "language_info": { 282 | "codemirror_mode": { 283 | "name": "ipython", 284 | "version": 3 285 | }, 286 | "file_extension": ".py", 287 | "mimetype": "text/x-python", 288 | "name": "python", 289 | "nbconvert_exporter": "python", 290 | "pygments_lexer": "ipython3", 291 | "version": "3.6.2" 292 | } 293 | }, 294 | "nbformat": 4, 295 | "nbformat_minor": 2 296 | } 297 | -------------------------------------------------------------------------------- /Chapter04/.ipynb_checkpoints/EvaluatingLinearModel-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Evaluating a Linear Model\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "MSE is a useful metric for seeing the performance of a model, but other metrics can help us decide which model to use.\n", 11 | "\n", 12 | "Here we will use **statsmodels** for fitting linear models since the package easily computes the metrics I want to see.\n", 13 | "\n", 14 | "Let's load in the Boston housing dataset." 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "from sklearn.datasets import load_boston\n", 24 | "from sklearn.model_selection import train_test_split" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "metadata": { 31 | "collapsed": true 32 | }, 33 | "outputs": [], 34 | "source": [ 35 | "boston_obj = load_boston()\n", 36 | "data_train, data_test, price_train, price_test = train_test_split(boston_obj.data, boston_obj.target)" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "## Fitting a Linear Model with OLS in **statsmodels**\n", 44 | "\n", 45 | "The `OLS` object in **statsmodels** handles fitting models with OLS. Below I show how to fit the same model for Boston home prices I fitted using **scikit-learn**." 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "import statsmodels.api as sm\n", 55 | "import numpy as np" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "metadata": {}, 62 | "outputs": [], 63 | "source": [ 64 | "data_train, data_test = sm.add_constant(data_train), sm.add_constant(data_test) # Necessary to add the intercept\n", 65 | "data_train[:5, :]" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": null, 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "data_train[:5, 0]" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "metadata": {}, 81 | "outputs": [], 82 | "source": [ 83 | "ols1 = sm.OLS(price_train, data_train) # Target, features\n", 84 | "model1 = ols1.fit()\n", 85 | "model1.params # The parameters of the regression" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "model1.predict([[ # An example prediction\n", 95 | " 1, # Intercept term; always 1\n", 96 | " 10, # Per capita crime rate\n", 97 | " 25, # Proportion of land zoned for large homes\n", 98 | " 5, # Proportion of land zoned for non-retail business\n", 99 | " 1, # Tract bounds the Charles River\n", 100 | " 0.3, # NOX concentration\n", 101 | " 10, # Average number of rooms per dwelling\n", 102 | " 2, # Proportion of owner-occupied units built prior to 1940\n", 103 | " 10, # Weighted distance to employment centers\n", 104 | " 3, # Index for highway accessibility\n", 105 | " 400, # Tax rate\n", 106 | " 15, # Pupil/teacher ratio\n", 107 | " 200, # Index for number of blacks\n", 108 | " 5 # % lower status of population\n", 109 | "]])" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": {}, 115 | "source": [ 116 | "We can get a summary of the model easily in **statsmodels**." 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "print(model1.summary())" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "Let's parse these results:\n", 133 | "\n", 134 | "* `R-squared` (mathematically, $R^2$) describes how much variation in the target variable the model is able to \"explain.\" (Think: predict.) Practitioners prefer `Adj. R-squared` ($\\bar{R}^2$) since this takes into account how many variables are used. (it is impossible for $R^2$ to go down when adding variables even if those variables only contribute noise; $\\bar{R}^2$ doesn't have this property.) Here $\\bar{R}^2$ is somewhat high, suggesting that the model does a reasonable job at predicting home prices.\n", 135 | "* `F-statistic` is the test statistic for a statistical test to determine if any coefficients in the model are not zero. `Prob (F-statistic)` is the corresponding $p$-value. Here the model clearly has a non-zero coefficient, though the statistic does not say which.\n", 136 | "* `AIC` is the **Akaike information criterion (AIC)**, and `BIC` the **Bayesian information criterion (BIC)**. These are measures of the quality of statistical models and provide a means to decide between models. Models that lead to smaller AIC and BIC are seen as better.\n", 137 | "* The table contains the coefficients of the statistical model and the results of $t$-tests to determine if the coefficients are zero or not, in addition to confidence intervals for the coefficient values. We might consider removing features with coefficients not statistically different from zero from our model (but we should also refer to the AIC and BIC statistics when making decisions between models).\n", 138 | "\n", 139 | "## Using AIC to Pick Models\n", 140 | "\n", 141 | "Let's see how we can use the AIC to decide between different models. (The BIC can be used similarly.) Notice that in our table the third and seventh features don't have coefficients statistically different from zero (these correspond to proportion of non-retail business acres per town and proportion of owner-occupied units built prior to 1940). Does a model without these features do better according to the AIC?" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": {}, 148 | "outputs": [], 149 | "source": [ 150 | "ols2 = sm.OLS(price_train, np.delete(data_train, [3, 7], axis=1))\n", 151 | "model2 = ols2.fit()\n", 152 | "print(model2.summary())" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "This model has a smaller AIC.\n", 160 | "\n", 161 | "The different AIC values can be interpreted this way: If model 1 has $\\text{AIC}_1$ and model 2 $\\text{AIC}_2$ and $\\text{AIC}_2 < \\text{AIC}_1$, then model 1 is $\\exp((\\text{AIC}_2 - \\text{AIC}_1)/2)$ times more likely to minimize information loss than model 2." 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": {}, 168 | "outputs": [], 169 | "source": [ 170 | "np.exp((model2.aic - model1.aic)/2)" 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "The inverse of that quantity can be interpreted as how much more likely model 2 is to minimize the information loss than model 1." 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "metadata": {}, 184 | "outputs": [], 185 | "source": [ 186 | "np.exp((model1.aic - model2.aic)/2)" 187 | ] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "metadata": {}, 192 | "source": [ 193 | "So we can see that model 2 should be preferred to model 1.\n", 194 | "\n", 195 | "Let's see how model 2 would have done on the test set, evaluating its performance with the MSE." 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": null, 201 | "metadata": { 202 | "collapsed": true 203 | }, 204 | "outputs": [], 205 | "source": [ 206 | "from sklearn.metrics import mean_squared_error" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "metadata": {}, 213 | "outputs": [], 214 | "source": [ 215 | "price_train_pred = model2.predict(np.delete(data_train, [3, 7], axis=1))\n", 216 | "mean_squared_error(price_train, price_train_pred) # Performance on the training set" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": {}, 223 | "outputs": [], 224 | "source": [ 225 | "price_test_pred = model2.predict(np.delete(data_test, [3, 7], axis=1))\n", 226 | "mean_squared_error(price_test, price_test_pred) # Performance on the training set" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "In comparison to model 1..." 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": null, 239 | "metadata": {}, 240 | "outputs": [], 241 | "source": [ 242 | "price_test_pred_mod1 = model1.predict(data_test)\n", 243 | "mean_squared_error(price_test, price_test_pred_mod1)" 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "metadata": {}, 249 | "source": [ 250 | "Model 2 did better." 251 | ] 252 | } 253 | ], 254 | "metadata": { 255 | "kernelspec": { 256 | "display_name": "Python 3", 257 | "language": "python", 258 | "name": "python3" 259 | }, 260 | "language_info": { 261 | "codemirror_mode": { 262 | "name": "ipython", 263 | "version": 3 264 | }, 265 | "file_extension": ".py", 266 | "mimetype": "text/x-python", 267 | "name": "python", 268 | "nbconvert_exporter": "python", 269 | "pygments_lexer": "ipython3", 270 | "version": "3.6.5" 271 | } 272 | }, 273 | "nbformat": 4, 274 | "nbformat_minor": 2 275 | } 276 | -------------------------------------------------------------------------------- /Chapter04/.ipynb_checkpoints/Untitled-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 2 6 | } 7 | -------------------------------------------------------------------------------- /Chapter04/OLS.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Linear Models and OLS\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "**Regression** refers to the prediction of a continuous variable (income, age, height, etc.) using a dataset's features. A **linear model** is a model of the form:\n", 11 | "\n", 12 | "$$y = \\beta_0 + \\beta_1 x_1 + \\beta_2 x_2 + ... + \\beta_K x_K + \\epsilon$$\n", 13 | "\n", 14 | "Here $\\epsilon$ is an **error term**; the predicted value for $y$ is given by $\\hat{y} = \\beta_0 + \\beta_1 x_1 + \\beta_2 x_2 + ... + \\beta_K x_K$, so $y - \\hat{y} = \\epsilon$.\n", 15 | "\n", 16 | "$\\epsilon$ is almost never zero, so for regression we must measure \"accuracy\" differently. The **sum of squared errors (SSE)** is the sum $\\sum_{i = 1}^n (y_i - \\hat{y}_i)^2$ (letting $y_i = \\beta_0 + \\beta_1 x_{1,i} + \\beta_2 x_{2,i} + ... + \\beta_K x_{K,i} + \\epsilon_i$ and $\\hat{y}_i$ defined analogously). We might define the \"most accurate\" regression model as the model that minimizes the SSE. However, when measuring performance, the **mean squared error (MSE)** is often used. The MSE is given by $\\frac{\\text{SSE}}{n} = \\frac{1}{n}\\sum_{i = 1}^{n} (y_i - \\hat{y}_i)^2$.\n", 17 | "\n", 18 | "**Ordinary least squares (OLS)** is a procedure for finding a linear model that minimizes the SSE on a dataset. This is the simplest procedure for fitting a linear model on a dataset. To evaluate the model's performance we may split a dataset into training and test set, and evaluate the trained model's performance by computing the MSE of the model's predictions on the test set. If the model has a high MSE on both the training and test set, it's underfitting. If it has a small MSE on the training set and a high MSE on the test set, it is overfitting.\n", 19 | "\n", 20 | "With OLS the most important decision is which features to use in prediction and how to use them. \"Linear\" means linear in coefficients only; these models can handle many kinds of functions. (The models $\\hat{y} = \\beta_0 + \\beta_1 x + \\beta_2 x^2$ and $\\hat{y} = \\beta_0 + \\beta_1 \\log(x)$ are linear, but $\\hat{y} = \\frac{\\beta_0}{1 + \\beta_1 x}$ is not.) Many approaches exist for deciding which features to include. For now we will only use cross-validation.\n", 21 | "\n", 22 | "## Fitting a Linear Model with OLS\n", 23 | "\n", 24 | "OLS is supported by the `LinearRegression` object in **scikit-learn**, while the function `mean_squared_error()` computes the MSE.\n", 25 | "\n", 26 | "I will be using OLS to find a linear model for predicting home prices in the Boston house price dataset, created below." 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 1, 32 | "metadata": {}, 33 | "outputs": [ 34 | { 35 | "name": "stderr", 36 | "output_type": "stream", 37 | "text": [ 38 | "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\sklearn\\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n", 39 | " \"This module will be removed in 0.20.\", DeprecationWarning)\n" 40 | ] 41 | } 42 | ], 43 | "source": [ 44 | "from sklearn.datasets import load_boston\n", 45 | "from sklearn.cross_validation import train_test_split" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 2, 51 | "metadata": {}, 52 | "outputs": [ 53 | { 54 | "data": { 55 | "text/plain": [ 56 | "array([[6.3200e-03, 1.8000e+01, 2.3100e+00, 0.0000e+00, 5.3800e-01,\n", 57 | " 6.5750e+00, 6.5200e+01, 4.0900e+00, 1.0000e+00, 2.9600e+02,\n", 58 | " 1.5300e+01, 3.9690e+02, 4.9800e+00],\n", 59 | " [2.7310e-02, 0.0000e+00, 7.0700e+00, 0.0000e+00, 4.6900e-01,\n", 60 | " 6.4210e+00, 7.8900e+01, 4.9671e+00, 2.0000e+00, 2.4200e+02,\n", 61 | " 1.7800e+01, 3.9690e+02, 9.1400e+00],\n", 62 | " [2.7290e-02, 0.0000e+00, 7.0700e+00, 0.0000e+00, 4.6900e-01,\n", 63 | " 7.1850e+00, 6.1100e+01, 4.9671e+00, 2.0000e+00, 2.4200e+02,\n", 64 | " 1.7800e+01, 3.9283e+02, 4.0300e+00],\n", 65 | " [3.2370e-02, 0.0000e+00, 2.1800e+00, 0.0000e+00, 4.5800e-01,\n", 66 | " 6.9980e+00, 4.5800e+01, 6.0622e+00, 3.0000e+00, 2.2200e+02,\n", 67 | " 1.8700e+01, 3.9463e+02, 2.9400e+00],\n", 68 | " [6.9050e-02, 0.0000e+00, 2.1800e+00, 0.0000e+00, 4.5800e-01,\n", 69 | " 7.1470e+00, 5.4200e+01, 6.0622e+00, 3.0000e+00, 2.2200e+02,\n", 70 | " 1.8700e+01, 3.9690e+02, 5.3300e+00]])" 71 | ] 72 | }, 73 | "execution_count": 2, 74 | "metadata": {}, 75 | "output_type": "execute_result" 76 | } 77 | ], 78 | "source": [ 79 | "boston_obj = load_boston()\n", 80 | "data, price = boston_obj.data, boston_obj.target\n", 81 | "data[:5, :]" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 3, 87 | "metadata": {}, 88 | "outputs": [ 89 | { 90 | "data": { 91 | "text/plain": [ 92 | "array([24. , 21.6, 34.7, 33.4, 36.2])" 93 | ] 94 | }, 95 | "execution_count": 3, 96 | "metadata": {}, 97 | "output_type": "execute_result" 98 | } 99 | ], 100 | "source": [ 101 | "price[:5]" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 4, 107 | "metadata": {}, 108 | "outputs": [ 109 | { 110 | "data": { 111 | "text/plain": [ 112 | "array([[5.60200e-02, 0.00000e+00, 2.46000e+00, 0.00000e+00, 4.88000e-01,\n", 113 | " 7.83100e+00, 5.36000e+01, 3.19920e+00, 3.00000e+00, 1.93000e+02,\n", 114 | " 1.78000e+01, 3.92630e+02, 4.45000e+00],\n", 115 | " [8.30800e-02, 0.00000e+00, 2.46000e+00, 0.00000e+00, 4.88000e-01,\n", 116 | " 5.60400e+00, 8.98000e+01, 2.98790e+00, 3.00000e+00, 1.93000e+02,\n", 117 | " 1.78000e+01, 3.91000e+02, 1.39800e+01],\n", 118 | " [8.71675e+00, 0.00000e+00, 1.81000e+01, 0.00000e+00, 6.93000e-01,\n", 119 | " 6.47100e+00, 9.88000e+01, 1.72570e+00, 2.40000e+01, 6.66000e+02,\n", 120 | " 2.02000e+01, 3.91980e+02, 1.71200e+01],\n", 121 | " [8.79212e+00, 0.00000e+00, 1.81000e+01, 0.00000e+00, 5.84000e-01,\n", 122 | " 5.56500e+00, 7.06000e+01, 2.06350e+00, 2.40000e+01, 6.66000e+02,\n", 123 | " 2.02000e+01, 3.65000e+00, 1.71600e+01],\n", 124 | " [7.84200e-01, 0.00000e+00, 8.14000e+00, 0.00000e+00, 5.38000e-01,\n", 125 | " 5.99000e+00, 8.17000e+01, 4.25790e+00, 4.00000e+00, 3.07000e+02,\n", 126 | " 2.10000e+01, 3.86750e+02, 1.46700e+01]])" 127 | ] 128 | }, 129 | "execution_count": 4, 130 | "metadata": {}, 131 | "output_type": "execute_result" 132 | } 133 | ], 134 | "source": [ 135 | "data_train, data_test, price_train, price_test = train_test_split(data, price)\n", 136 | "data_train[:5, :]" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 5, 142 | "metadata": {}, 143 | "outputs": [ 144 | { 145 | "data": { 146 | "text/plain": [ 147 | "array([50. , 26.4, 13.1, 11.7, 17.5])" 148 | ] 149 | }, 150 | "execution_count": 5, 151 | "metadata": {}, 152 | "output_type": "execute_result" 153 | } 154 | ], 155 | "source": [ 156 | "price_train[:5]" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "We will go ahead and use all features for prediction in our first linear model. (In general this does *not* necessarily produce better models; some features may introduce only noise that makes prediction *more* difficult, not less.)" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 6, 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [ 172 | "from sklearn.linear_model import LinearRegression\n", 173 | "from sklearn.metrics import mean_squared_error\n", 174 | "import numpy as np" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": 7, 180 | "metadata": {}, 181 | "outputs": [ 182 | { 183 | "data": { 184 | "text/plain": [ 185 | "array([40.59396201])" 186 | ] 187 | }, 188 | "execution_count": 7, 189 | "metadata": {}, 190 | "output_type": "execute_result" 191 | } 192 | ], 193 | "source": [ 194 | "ols1 = LinearRegression()\n", 195 | "ols1.fit(data_train, price_train) # Fitting a linear model\n", 196 | "ols1.predict([[ # An example prediction\n", 197 | " 1, # Per capita crime rate\n", 198 | " 25, # Proportion of land zoned for large homes\n", 199 | " 5, # Proportion of land zoned for non-retail business\n", 200 | " 1, # Tract bounds the Charles River\n", 201 | " 0.3, # NOX concentration\n", 202 | " 10, # Average number of rooms per dwelling\n", 203 | " 2, # Proportion of owner-occupied units built prior to 1940\n", 204 | " 10, # Weighted distance to employment centers\n", 205 | " 3, # Index for highway accessibility\n", 206 | " 400, # Tax rate\n", 207 | " 15, # Pupil/teacher ratio\n", 208 | " 200, # Index for number of blacks\n", 209 | " 5 # % lower status of population\n", 210 | "]])" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": 8, 216 | "metadata": {}, 217 | "outputs": [ 218 | { 219 | "data": { 220 | "text/plain": [ 221 | "array([36.60268045, 22.74630558, 20.19610388, 14.04474667, 17.14567269])" 222 | ] 223 | }, 224 | "execution_count": 8, 225 | "metadata": {}, 226 | "output_type": "execute_result" 227 | } 228 | ], 229 | "source": [ 230 | "predprice = ols1.predict(data_train)\n", 231 | "predprice[:5]" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": 9, 237 | "metadata": {}, 238 | "outputs": [ 239 | { 240 | "data": { 241 | "text/plain": [ 242 | "19.24697544587027" 243 | ] 244 | }, 245 | "execution_count": 9, 246 | "metadata": {}, 247 | "output_type": "execute_result" 248 | } 249 | ], 250 | "source": [ 251 | "mean_squared_error(price_train, predprice)" 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": 10, 257 | "metadata": {}, 258 | "outputs": [ 259 | { 260 | "data": { 261 | "text/plain": [ 262 | "4.387137500223838" 263 | ] 264 | }, 265 | "execution_count": 10, 266 | "metadata": {}, 267 | "output_type": "execute_result" 268 | } 269 | ], 270 | "source": [ 271 | "np.sqrt(mean_squared_error(price_train, predprice))" 272 | ] 273 | }, 274 | { 275 | "cell_type": "markdown", 276 | "metadata": {}, 277 | "source": [ 278 | "The square root of the mean squared error can be interpreted as the average amount of error; in this case, the average difference between homes' actual and predicted prices. (This is almost the standard deviation of the error.)\n", 279 | "\n", 280 | "For cross-validation, I will use `cross_val_score()`, which performs the entire cross-validation process." 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": 11, 286 | "metadata": {}, 287 | "outputs": [], 288 | "source": [ 289 | "from sklearn.model_selection import cross_val_score" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": 12, 295 | "metadata": {}, 296 | "outputs": [ 297 | { 298 | "data": { 299 | "text/plain": [ 300 | "-21.336072658144275" 301 | ] 302 | }, 303 | "execution_count": 12, 304 | "metadata": {}, 305 | "output_type": "execute_result" 306 | } 307 | ], 308 | "source": [ 309 | "ols2 = LinearRegression()\n", 310 | "ols_cv_mse = cross_val_score(ols2, data_train, price_train, scoring='neg_mean_squared_error', cv=10)\n", 311 | "ols_cv_mse.mean()" 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "metadata": {}, 317 | "source": [ 318 | "The above number is the negative average MSE for cross-validation (minimizing MSE is equivalent to maximizing the negative MSE). This is close to our in-sample MSE.\n", 319 | "\n", 320 | "Let's now see the MSE for the fitted model on the test set." 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": 13, 326 | "metadata": {}, 327 | "outputs": [ 328 | { 329 | "data": { 330 | "text/plain": [ 331 | "30.480706235794237" 332 | ] 333 | }, 334 | "execution_count": 13, 335 | "metadata": {}, 336 | "output_type": "execute_result" 337 | } 338 | ], 339 | "source": [ 340 | "testpredprice = ols1.predict(data_test)\n", 341 | "mean_squared_error(price_test, testpredprice)" 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": 14, 347 | "metadata": {}, 348 | "outputs": [ 349 | { 350 | "data": { 351 | "text/plain": [ 352 | "5.520933456925037" 353 | ] 354 | }, 355 | "execution_count": 14, 356 | "metadata": {}, 357 | "output_type": "execute_result" 358 | } 359 | ], 360 | "source": [ 361 | "np.sqrt(mean_squared_error(price_test, testpredprice))" 362 | ] 363 | }, 364 | { 365 | "cell_type": "markdown", 366 | "metadata": {}, 367 | "source": [ 368 | "Overfitting is minimal, it seems." 369 | ] 370 | } 371 | ], 372 | "metadata": { 373 | "kernelspec": { 374 | "display_name": "Python 3", 375 | "language": "python", 376 | "name": "python3" 377 | }, 378 | "language_info": { 379 | "codemirror_mode": { 380 | "name": "ipython", 381 | "version": 3 382 | }, 383 | "file_extension": ".py", 384 | "mimetype": "text/x-python", 385 | "name": "python", 386 | "nbconvert_exporter": "python", 387 | "pygments_lexer": "ipython3", 388 | "version": "3.6.5" 389 | } 390 | }, 391 | "nbformat": 4, 392 | "nbformat_minor": 2 393 | } 394 | -------------------------------------------------------------------------------- /Chapter04/USCapitol.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Training-Systems-Using-Python-Statistical-Modeling/5eb619df9648570e83e910f781aa97fd19f17403/Chapter04/USCapitol.png -------------------------------------------------------------------------------- /Chapter04/Untitled.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 2 6 | } 7 | -------------------------------------------------------------------------------- /Chapter04/mystery_function.npy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Training-Systems-Using-Python-Statistical-Modeling/5eb619df9648570e83e910f781aa97fd19f17403/Chapter04/mystery_function.npy -------------------------------------------------------------------------------- /Chapter04/mystery_function_2.npy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Training-Systems-Using-Python-Statistical-Modeling/5eb619df9648570e83e910f781aa97fd19f17403/Chapter04/mystery_function_2.npy -------------------------------------------------------------------------------- /Chapter05/.ipynb_checkpoints/RegressionNeuralNetworks-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Regression with Neural Networks\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "Neural networks (MLPs in particular) can also be used for nonlinear regression. This is implemented via the `MLPRegressor` object in **scikit-learn**. In this video I demonstrate neural network regression on the Boston housing price dataset." 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 1, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "from sklearn.datasets import load_boston\n", 20 | "from sklearn.neural_network import MLPRegressor\n", 21 | "from sklearn.model_selection import train_test_split\n", 22 | "from sklearn.metrics import mean_squared_error" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 2, 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "boston_obj = load_boston()\n", 32 | "data_train, data_test, price_train, price_test = train_test_split(boston_obj.data, boston_obj.target)" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "For the neural network I use a MLP network with:\n", 40 | "\n", 41 | "* Three layers, each with 100 neurons\n", 42 | "* A regularization parameter $\\alpha = 10$\n", 43 | "* The hyperbolic tangent function for activation\n", 44 | "\n", 45 | "Here are the results." 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 3, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "mlp = MLPRegressor(hidden_layer_sizes=(100,100,100), activation='tanh', alpha=10, max_iter=1000)\n", 55 | "mlp = mlp.fit(data_train, price_train)" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 4, 61 | "metadata": {}, 62 | "outputs": [ 63 | { 64 | "data": { 65 | "text/plain": [ 66 | "array([19.11119061])" 67 | ] 68 | }, 69 | "execution_count": 4, 70 | "metadata": {}, 71 | "output_type": "execute_result" 72 | } 73 | ], 74 | "source": [ 75 | "mlp.predict(data_train[[0], :])" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": 5, 81 | "metadata": {}, 82 | "outputs": [ 83 | { 84 | "data": { 85 | "text/plain": [ 86 | "30.227707302425383" 87 | ] 88 | }, 89 | "execution_count": 5, 90 | "metadata": {}, 91 | "output_type": "execute_result" 92 | } 93 | ], 94 | "source": [ 95 | "price_pred_train = mlp.predict(data_train)\n", 96 | "mean_squared_error(price_pred_train, price_train)" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 6, 102 | "metadata": {}, 103 | "outputs": [ 104 | { 105 | "data": { 106 | "text/plain": [ 107 | "42.938874737346204" 108 | ] 109 | }, 110 | "execution_count": 6, 111 | "metadata": {}, 112 | "output_type": "execute_result" 113 | } 114 | ], 115 | "source": [ 116 | "price_pred_test = mlp.predict(data_test)\n", 117 | "mean_squared_error(price_pred_test, price_test)" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "The MLP does not do a superior job to the linear regression models we considered before. It appears to be strongly overfitting." 125 | ] 126 | } 127 | ], 128 | "metadata": { 129 | "kernelspec": { 130 | "display_name": "Python 3", 131 | "language": "python", 132 | "name": "python3" 133 | }, 134 | "language_info": { 135 | "codemirror_mode": { 136 | "name": "ipython", 137 | "version": 3 138 | }, 139 | "file_extension": ".py", 140 | "mimetype": "text/x-python", 141 | "name": "python", 142 | "nbconvert_exporter": "python", 143 | "pygments_lexer": "ipython3", 144 | "version": "3.6.5" 145 | } 146 | }, 147 | "nbformat": 4, 148 | "nbformat_minor": 2 149 | } 150 | -------------------------------------------------------------------------------- /Chapter05/Perceptron.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# The Perceptron\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "The **perceptron** is an online (in the sense that learning theoretically never ends) learning algorithm. It is a linear classifier like SVMs, but:\n", 11 | "\n", 12 | "* The perceptron does not seek to maximize the margin separating different classes (a characteristic of SVMs)\n", 13 | "* SVMs support only batch learning (train once, then deploy), while perceptrons support online learning (feedback can be used to update the algorithm)\n", 14 | "\n", 15 | "Perceptrons serve as a building block for neural networks and so should be understood first.\n", 16 | "\n", 17 | "In **scikit-learn**, the `Perceptron` object supports training perceptrons, including allowing for online learning. We will apply perceptrons (which are binary classifiers) to predicting the species of iris flowers. (Perceptrons support multiclass learning via the one-vs-all approach.)\n", 18 | "\n", 19 | "## Creating and Training a Perceptron\n", 20 | "\n", 21 | "Here we will actually create *two* test sets. One test set is interpreted as *the* test set, while the other holdout will be used to demonstrate online learning." 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": null, 27 | "metadata": { 28 | "collapsed": true 29 | }, 30 | "outputs": [], 31 | "source": [ 32 | "from sklearn.linear_model import Perceptron\n", 33 | "from sklearn.model_selection import train_test_split\n", 34 | "from sklearn.metrics import accuracy_score\n", 35 | "from sklearn.datasets import load_iris" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "iris_obj = load_iris()\n", 45 | "data_train, data_test, species_train, species_test = train_test_split(iris_obj.data, iris_obj.target)\n", 46 | "data_in, data_out, species_in, species_out = train_test_split(data_train, species_train, test_size=.1)\n", 47 | "data_in[:5,]" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "species_in[:5]" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "perc = Perceptron()\n", 66 | "perc = perc.fit(data_in, species_in) # Train first using one set of data\n", 67 | "\n", 68 | "species_pred_in = perc.predict(data_in)\n", 69 | "accuracy_score(species_pred_in, species_in)" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "## Online Learning\n", 77 | "\n", 78 | "Let's now see what online learning may look like. We will use the remaining data in the training data to update the perceptron we trained." 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [ 87 | "species_pred_out = perc.predict(data_out) # Seeing performance on some out-of-sample data\n", 88 | "accuracy_score(species_pred_out, species_out)" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "metadata": { 95 | "collapsed": true 96 | }, 97 | "outputs": [], 98 | "source": [ 99 | "perc = perc.partial_fit(data_out, species_out)" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": null, 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "species_pred_out = perc.predict(data_out) # Seeing performance on some out-of-sample data\n", 109 | "accuracy_score(species_pred_out, species_out)" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": null, 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [ 118 | "species_pred_train = perc.predict(data_train) # The performance on the entire training sample\n", 119 | "accuracy_score(species_pred_train, species_train)" 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "Now we finally see out-of-sample results." 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "metadata": {}, 133 | "outputs": [], 134 | "source": [ 135 | "species_pred_test = perc.predict(data_test)\n", 136 | "accuracy_score(species_pred_test, species_test)" 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": {}, 142 | "source": [ 143 | "We can see two advantages to online learning. One is that we can use feedback to improve our algorithm in real time. The other is that online learning scales well to very large datasets, since not all data needs to be in memory to train the algorithm." 144 | ] 145 | } 146 | ], 147 | "metadata": { 148 | "kernelspec": { 149 | "display_name": "Python 3", 150 | "language": "python", 151 | "name": "python3" 152 | }, 153 | "language_info": { 154 | "codemirror_mode": { 155 | "name": "ipython", 156 | "version": 3 157 | }, 158 | "file_extension": ".py", 159 | "mimetype": "text/x-python", 160 | "name": "python", 161 | "nbconvert_exporter": "python", 162 | "pygments_lexer": "ipython3", 163 | "version": "3.6.2" 164 | } 165 | }, 166 | "nbformat": 4, 167 | "nbformat_minor": 2 168 | } 169 | -------------------------------------------------------------------------------- /Chapter05/RegressionNeuralNetworks.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Regression with Neural Networks\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "Neural networks (MLPs in particular) can also be used for nonlinear regression. This is implemented via the `MLPRegressor` object in **scikit-learn**. In this video I demonstrate neural network regression on the Boston housing price dataset." 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 1, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "from sklearn.datasets import load_boston\n", 20 | "from sklearn.neural_network import MLPRegressor\n", 21 | "from sklearn.model_selection import train_test_split\n", 22 | "from sklearn.metrics import mean_squared_error" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 2, 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "boston_obj = load_boston()\n", 32 | "data_train, data_test, price_train, price_test = train_test_split(boston_obj.data, boston_obj.target)" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "For the neural network I use a MLP network with:\n", 40 | "\n", 41 | "* Three layers, each with 100 neurons\n", 42 | "* A regularization parameter $\\alpha = 10$\n", 43 | "* The hyperbolic tangent function for activation\n", 44 | "\n", 45 | "Here are the results." 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 3, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "mlp = MLPRegressor(hidden_layer_sizes=(100,100,100), activation='tanh', alpha=10, max_iter=1000)\n", 55 | "mlp = mlp.fit(data_train, price_train)" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 4, 61 | "metadata": {}, 62 | "outputs": [ 63 | { 64 | "data": { 65 | "text/plain": [ 66 | "array([19.11119061])" 67 | ] 68 | }, 69 | "execution_count": 4, 70 | "metadata": {}, 71 | "output_type": "execute_result" 72 | } 73 | ], 74 | "source": [ 75 | "mlp.predict(data_train[[0], :])" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": 5, 81 | "metadata": {}, 82 | "outputs": [ 83 | { 84 | "data": { 85 | "text/plain": [ 86 | "30.227707302425383" 87 | ] 88 | }, 89 | "execution_count": 5, 90 | "metadata": {}, 91 | "output_type": "execute_result" 92 | } 93 | ], 94 | "source": [ 95 | "price_pred_train = mlp.predict(data_train)\n", 96 | "mean_squared_error(price_pred_train, price_train)" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 6, 102 | "metadata": {}, 103 | "outputs": [ 104 | { 105 | "data": { 106 | "text/plain": [ 107 | "42.938874737346204" 108 | ] 109 | }, 110 | "execution_count": 6, 111 | "metadata": {}, 112 | "output_type": "execute_result" 113 | } 114 | ], 115 | "source": [ 116 | "price_pred_test = mlp.predict(data_test)\n", 117 | "mean_squared_error(price_pred_test, price_test)" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "The MLP does not do a superior job to the linear regression models we considered before. It appears to be strongly overfitting." 125 | ] 126 | } 127 | ], 128 | "metadata": { 129 | "kernelspec": { 130 | "display_name": "Python 3", 131 | "language": "python", 132 | "name": "python3" 133 | }, 134 | "language_info": { 135 | "codemirror_mode": { 136 | "name": "ipython", 137 | "version": 3 138 | }, 139 | "file_extension": ".py", 140 | "mimetype": "text/x-python", 141 | "name": "python", 142 | "nbconvert_exporter": "python", 143 | "pygments_lexer": "ipython3", 144 | "version": "3.6.5" 145 | } 146 | }, 147 | "nbformat": 4, 148 | "nbformat_minor": 2 149 | } 150 | -------------------------------------------------------------------------------- /Chapter05/TrainingNeuralNetwork.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Training a Neural Network\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "In this notebook I demonstrate how to train the neural network known as the **multilayer perceptron (MLP)**. We will use a MLP to classify the iris dataset and also a dataset of handwritten digits, in order to detect different characters.\n", 11 | "\n", 12 | "Neural networks have a lot of parameters to set when training. These include:\n", 13 | "\n", 14 | "* How many hidden layers to have\n", 15 | "* How many neurons to include in each layer\n", 16 | "* The activation functions of neurons in the hidden layers\n", 17 | "* Value of the regularization term to control overfitting (referred to as $\\alpha$)\n", 18 | "\n", 19 | "Issues when training a neural network are also accute. These are choices related to the actual optimization algorithm that estimates the parameters used for prediction. For neural networks this fitting process is very involved.\n", 20 | "\n", 21 | "MLPs are online algorithms just like perceptrons. This is especially advantageous for training on large datasets that don't necessarily fit into data. Additionally, MLPs are *not* linear classifiers/regressors. This suggests that MLPs are most popular for learning problems that require fitting data that isn't linearly separable.\n", 22 | "\n", 23 | "MLPs can be used for classification and regression. This notebook focuses on classification.\n", 24 | "\n", 25 | "First, lets load in the datasets we will use." 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": null, 31 | "metadata": { 32 | "collapsed": true 33 | }, 34 | "outputs": [], 35 | "source": [ 36 | "from sklearn.datasets import load_iris, load_digits\n", 37 | "from sklearn.model_selection import train_test_split\n", 38 | "import numpy as np\n", 39 | "import matplotlib.pyplot as plt\n", 40 | "%matplotlib inline" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": null, 46 | "metadata": {}, 47 | "outputs": [], 48 | "source": [ 49 | "# First, the iris dataset\n", 50 | "iris_obj = load_iris()\n", 51 | "iris_data_train, iris_data_test, species_train, species_test = train_test_split(iris_obj.data, iris_obj.target)\n", 52 | "\n", 53 | "# Next, the digits dataset\n", 54 | "digits_obj = load_digits()\n", 55 | "print(digits_obj.DESCR)" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "metadata": {}, 62 | "outputs": [], 63 | "source": [ 64 | "digits_obj.data.shape" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": null, 70 | "metadata": {}, 71 | "outputs": [], 72 | "source": [ 73 | "digits_data_train, digits_data_test, number_train, number_test = train_test_split(digits_obj.data, digits_obj.target)\n", 74 | "number_train[:5]" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "metadata": {}, 81 | "outputs": [], 82 | "source": [ 83 | "digits_data_train[0, :]" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "metadata": {}, 90 | "outputs": [], 91 | "source": [ 92 | "digits_data_train[0, :].reshape((8, 8))" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": null, 98 | "metadata": {}, 99 | "outputs": [], 100 | "source": [ 101 | "plt.imshow(digits_data_train[0, :].reshape((8, 8)))" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "## Fitting a MLP to the Iris Data\n", 109 | "\n", 110 | "MLP models are implemented via the `MLPClassifier` object in **scikit-learn**. The MLP classifier I train:\n", 111 | "\n", 112 | "* Has one hidden layer with 20 neurons\n", 113 | "* Uses the logistic activation function for the hidden layers\n", 114 | "* Uses a regularization parameter of $\\alpha = 1$\n", 115 | "\n", 116 | "I demonstrate its use below." 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "metadata": { 123 | "collapsed": true 124 | }, 125 | "outputs": [], 126 | "source": [ 127 | "from sklearn.neural_network import MLPClassifier\n", 128 | "from sklearn.metrics import accuracy_score" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "metadata": {}, 135 | "outputs": [], 136 | "source": [ 137 | "mlp_iris = MLPClassifier(hidden_layer_sizes=(20,), # A tuple with the number of neurons for each hidden layer\n", 138 | " activation='logistic', # Which activation function to use\n", 139 | " alpha=1, # Regularization parameter\n", 140 | " max_iter=1000) # Maximum number of iterations taken by the solver\n", 141 | "mlp_iris = mlp_iris.fit(iris_data_train, species_train)\n", 142 | "mlp_iris.predict(iris_data_train[:1,:])" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": null, 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [ 151 | "species_pred_train = mlp_iris.predict(iris_data_train)\n", 152 | "accuracy_score(species_pred_train, species_train)" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "metadata": {}, 159 | "outputs": [], 160 | "source": [ 161 | "species_pred_test = mlp_iris.predict(iris_data_test)\n", 162 | "accuracy_score(species_pred_test, species_test)" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": {}, 168 | "source": [ 169 | "The classifier has extremely high accuracy for this dataset.\n", 170 | "\n", 171 | "## Fitting a MLP to the Digits Dataset\n", 172 | "\n", 173 | "Let's now see how the MLP classifier performs for the digits dataset. Again there is only one hidden layer, this one with 50 neurons." 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": null, 179 | "metadata": { 180 | "collapsed": true 181 | }, 182 | "outputs": [], 183 | "source": [ 184 | "mlp_digits = MLPClassifier(hidden_layer_sizes=(50,),\n", 185 | " activation='logistic',\n", 186 | " alpha=1)\n", 187 | "mlp_digits = mlp_digits.fit(digits_data_train, number_train)" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": null, 193 | "metadata": {}, 194 | "outputs": [], 195 | "source": [ 196 | "mlp_digits.predict(digits_data_train[[0], :])" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": null, 202 | "metadata": {}, 203 | "outputs": [], 204 | "source": [ 205 | "number_pred_train = mlp_digits.predict(digits_data_train)\n", 206 | "accuracy_score(number_pred_train, number_train)" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "metadata": {}, 213 | "outputs": [], 214 | "source": [ 215 | "number_pred_test = mlp_digits.predict(digits_data_test)\n", 216 | "accuracy_score(number_pred_test, number_test)" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "The classifier shines in these nonlinear contexts." 224 | ] 225 | } 226 | ], 227 | "metadata": { 228 | "kernelspec": { 229 | "display_name": "Python 3", 230 | "language": "python", 231 | "name": "python3" 232 | }, 233 | "language_info": { 234 | "codemirror_mode": { 235 | "name": "ipython", 236 | "version": 3 237 | }, 238 | "file_extension": ".py", 239 | "mimetype": "text/x-python", 240 | "name": "python", 241 | "nbconvert_exporter": "python", 242 | "pygments_lexer": "ipython3", 243 | "version": "3.6.2" 244 | } 245 | }, 246 | "nbformat": 4, 247 | "nbformat_minor": 2 248 | } 249 | -------------------------------------------------------------------------------- /Chapter06/.ipynb_checkpoints/EvaluatingClusteringModelResults-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Evaluating Clustering Model Results\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "Deciding whether a clustering scheme is \"good\" is more difficult to *define* compared to evaluation of learning algorithms. This is because the data is unlabeled. There is no \"truth\" to compare against to decide if the clusters split the data. Thus choosing between clustering schemes is largely subjective.\n", 11 | "\n", 12 | "However we can impose criteria we would like a clustering scheme to satisfy. For instance we may want clusters to minimize within-cluster sum of squared \"errors\" (distance from the centroid) but not at the expense of having lots of clusters. We may decide then that the optimal number of clusters is the number of clusters that makes measure of error small without having many clusters. Other evaluation schemes exist as well.\n", 13 | "\n", 14 | "We will look at evaluating these schemes when clustering the iris data using $k$-means. The two approaches I will show are the \"elbow\" approach and silhouette analysis.\n", 15 | "\n", 16 | "## Clustering the Iris Dataset\n", 17 | "\n", 18 | "The first example will demonstrate using $k$-means clustering for the iris dataset. I first load in that dataset." 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": null, 24 | "metadata": { 25 | "collapsed": true 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "from sklearn.datasets import load_iris\n", 30 | "import matplotlib.pyplot as plt\n", 31 | "import matplotlib.cm as cm\n", 32 | "import numpy as np\n", 33 | "%matplotlib inline" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": {}, 40 | "outputs": [], 41 | "source": [ 42 | "iris_obj = load_iris()\n", 43 | "iris_data = iris_obj.data\n", 44 | "species = iris_obj.target\n", 45 | "iris_data[:5,:]" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "plt.scatter(iris_data[:, 0], iris_data[:, 1], c=species, cmap=plt.cm.brg)\n", 55 | "plt.xlabel(\"Sepal Length\")\n", 56 | "plt.ylabel(\"Sepal Width\")\n", 57 | "plt.show()" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "Next I import the `KMeans` object to perform $k$-means clustering, and then apply the method." 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": null, 70 | "metadata": { 71 | "collapsed": true 72 | }, 73 | "outputs": [], 74 | "source": [ 75 | "from sklearn.cluster import KMeans" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "irisclust = KMeans(n_clusters=3, init='random') # Three clusters with cluster centers chosen as random dataset points\n", 85 | "irisclust.fit(iris_data)\n", 86 | "irisclust.cluster_centers_ # The coordinates of cluster centers" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": null, 92 | "metadata": {}, 93 | "outputs": [], 94 | "source": [ 95 | "# Visualizing the clustering\n", 96 | "plt.scatter(iris_data[:, 0], iris_data[:, 1], c=irisclust.predict(iris_data), cmap=plt.cm.brg)\n", 97 | "plt.scatter(irisclust.cluster_centers_[:, 0], irisclust.cluster_centers_[:, 1],\n", 98 | " c=irisclust.predict(irisclust.cluster_centers_), cmap=plt.cm.brg, marker='^', s=200,\n", 99 | " edgecolors='k')\n", 100 | "plt.xlabel(\"Sepal Length\")\n", 101 | "plt.ylabel(\"Sepal Width\")\n", 102 | "plt.show()" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "## Finding the \"Elbow\"\n", 110 | "\n", 111 | "Consider a plot of the sum of within-cluster squared errors:\n", 112 | "\n", 113 | "$$\\sum_{k=1}^{K} \\sum_{i_k = 1}^{N_k} \\|x_{i_k} - \\mu_{k}\\|^2$$\n", 114 | "\n", 115 | "We want this number to be small, but we also want the number of clusters to be small. So we plot this quantity and choose the number of clusters $K$ that corresponds to the \"elbow\" of the plot.\n", 116 | "\n", 117 | "The function below creates such a plot." 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "metadata": { 124 | "collapsed": true 125 | }, 126 | "outputs": [], 127 | "source": [ 128 | "def wcsse_plot(data, max_clusters):\n", 129 | " \"\"\"Plots sum of within-cluster sum of squared errors for clusters up to max_clusters for dataset data\"\"\"\n", 130 | " wcsse = np.arange(max_clusters) + 1\n", 131 | " for k in wcsse:\n", 132 | " wcsse[k - 1] = KMeans(n_clusters=k).fit(data).inertia_ # inertia is the sum described above\n", 133 | " plt.plot(np.arange(max_clusters) + 1, wcsse, marker='o')" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "wcsse_plot(iris_data, 10)" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "Where is the elbow? We know that there are actually three species of iris flowers. Based on this plot, three clusters isn't a bad choice; this is around where the \"elbow\" is.\n", 150 | "\n", 151 | "## Silhouette Analysis\n", 152 | "\n", 153 | "Let $a(i)$ be the average dissimilarity of data point $i$ to all other data points in its cluster, and $b(i)$ the average dissimilarity of data point $i$ to the data points in the next nearest cluster (other than the cluster to which data point $i$ has been assigned). The silhouette score for data point $i$ is then:\n", 154 | "\n", 155 | "$$s(i) = \\frac{b(i) - a(i)}{\\max \\left(a(i), b(i)\\right)}$$\n", 156 | "\n", 157 | "This is a number between -1 and 1. 1 means that data point $i$ is very different from clusters other than the cluster to which it has been assigned, while -1 (and any number less than zero) suggests that data point $i$ is in the wrong cluster. A value of 0 suggests that data point $i$ is on the border of belonging to one of at least two clusters.\n", 158 | "\n", 159 | "If we average the silhouette scores of each data point, we get the average silhouette score. We prefer this number not to be small since this suggests that clusters are not highly dissimilar; there are likely too many.\n", 160 | "\n", 161 | "We can use the silhouette scores to construct a silhouette plot. **scikit-learn** provides functions, `silhouette_samples()` and `silhouette_score()`, that computes the silhouette score of each data point and the average silhouette score, respectively. These can be used to construct a silhouette plot. I have written a function for doing so. (The code for writing this function was strongly influenced by [this example](http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html#sphx-glr-auto-examples-cluster-plot-kmeans-silhouette-analysis-py) provided by **scikit-learn**'s documentation.)" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": { 168 | "collapsed": true 169 | }, 170 | "outputs": [], 171 | "source": [ 172 | "from sklearn.metrics import silhouette_samples, silhouette_score" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "metadata": { 179 | "collapsed": true 180 | }, 181 | "outputs": [], 182 | "source": [ 183 | "def silhouette_plot(data, labels, metric=\"euclidean\", xticks = True):\n", 184 | " \"\"\"Creates a silhouette plot given a dataset and the labels corresponding to cluster assignment, and reports the\n", 185 | " average silhouette score\"\"\"\n", 186 | " silhouette_avg = silhouette_score(data, labels,\n", 187 | " metric=metric) # The average silhouette score over the entire sample\n", 188 | " sample_silhouette_values = silhouette_samples(data, labels,\n", 189 | " metric=metric) # The silhouette score of each individual data point\n", 190 | " \n", 191 | " # This loop creates the silhouettes in the silhouette plot\n", 192 | " y_lower = 10 # For space between silhouettes\n", 193 | " for k in np.unique(labels):\n", 194 | " cluster_values = sample_silhouette_values[labels == k]\n", 195 | " cluster_values.sort()\n", 196 | " nk = len(cluster_values)\n", 197 | " y_upper = y_lower + nk\n", 198 | " color = cm.spectral(float(k) / len(np.unique(labels)))\n", 199 | " plt.fill_betweenx(np.arange(y_lower, y_upper),\n", 200 | " 0, cluster_values,\n", 201 | " facecolor=color, edgecolor=color)\n", 202 | " plt.text(-0.05, y_lower + 0.5 * nk, str(k))\n", 203 | " y_lower = y_upper + 10\n", 204 | " \n", 205 | " plt.axvline(x=silhouette_avg, color=\"red\", linestyle=\"--\")\n", 206 | " if xticks:\n", 207 | " plt.xticks([-0.1, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])\n", 208 | " plt.yticks([])\n", 209 | " plt.xlabel(\"Silhouette Score\")\n", 210 | " plt.ylabel(\"Cluster\")\n", 211 | " plt.show()\n", 212 | " \n", 213 | " print(\"The average silhouette score is\", silhouette_avg)" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": {}, 220 | "outputs": [], 221 | "source": [ 222 | "silhouette_plot(iris_data, irisclust.predict(iris_data))" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "When looking at a silhouette plot we do not want to see a cluster with most all (or nearly all) observations below the average silhouette score (as indicated by the red dashed line). This occurance would suggest a bad clustering." 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": null, 235 | "metadata": { 236 | "collapsed": true 237 | }, 238 | "outputs": [], 239 | "source": [ 240 | "def nclust_silhouette_kmeans(data, n_clusters):\n", 241 | " \"\"\"A function for finding the k-means clustering with n clusters for a dataset and creating the corresponding silhouette\n", 242 | " plot\"\"\"\n", 243 | " labels = KMeans(n_clusters=n_clusters, init='random').fit_predict(data)\n", 244 | " silhouette_plot(data=data, labels=labels)" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": null, 250 | "metadata": {}, 251 | "outputs": [], 252 | "source": [ 253 | "nclust_silhouette_kmeans(iris_data, 2)" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": null, 259 | "metadata": {}, 260 | "outputs": [], 261 | "source": [ 262 | "nclust_silhouette_kmeans(iris_data, 3)" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": null, 268 | "metadata": {}, 269 | "outputs": [], 270 | "source": [ 271 | "nclust_silhouette_kmeans(iris_data, 4)" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": null, 277 | "metadata": {}, 278 | "outputs": [], 279 | "source": [ 280 | "nclust_silhouette_kmeans(iris_data, 5)" 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": {}, 286 | "source": [ 287 | "The last two silhouette plots suggest that 4 and 5 clusters are not the right amounts since we end up with clusters whose silhouette scores are almost all below the mean. Either two or three clusters appear to be appropriate.\n", 288 | "\n", 289 | "These analyses suggest that either two or three clusters are appropriate. We know that there are three species in this dataset but without this information we may need to decide which is appropriate. (It appears that one species, setosa, is very different from the other two. Look at plots of the flowers' petal length and petal width if you don't believe me.) Any number above three, though, is clearly inappropriate, as indicated by these plots." 290 | ] 291 | } 292 | ], 293 | "metadata": { 294 | "kernelspec": { 295 | "display_name": "Python 3", 296 | "language": "python", 297 | "name": "python3" 298 | }, 299 | "language_info": { 300 | "codemirror_mode": { 301 | "name": "ipython", 302 | "version": 3 303 | }, 304 | "file_extension": ".py", 305 | "mimetype": "text/x-python", 306 | "name": "python", 307 | "nbconvert_exporter": "python", 308 | "pygments_lexer": "ipython3", 309 | "version": "3.6.5" 310 | } 311 | }, 312 | "nbformat": 4, 313 | "nbformat_minor": 2 314 | } 315 | -------------------------------------------------------------------------------- /Chapter06/.ipynb_checkpoints/UnsupervisedLearning-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Diving Into Clustering and Unsupervised Learning\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "In this notebook I give some functions for computing distances between points. This is to introduce the idea of different distance metrics, an important idea in data science and clustering.\n", 11 | "\n", 12 | "Many of these metrics are already supported in relevant packages, but you are welcome to look at functions defining them to understand how they work.\n", 13 | "\n", 14 | "## Euclidean Distance\n", 15 | "\n", 16 | "This is the \"straight line\" distance people are most familiar with." 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 1, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "import numpy as np" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "def euclidean_distance(v1, v2):\n", 35 | " \"\"\"Computes the Euclidean distance between two vectors\"\"\"\n", 36 | " return np.sqrt(np.sum((v1 - v2) ** 2))" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 3, 42 | "metadata": {}, 43 | "outputs": [ 44 | { 45 | "data": { 46 | "text/plain": [ 47 | "4.242640687119285" 48 | ] 49 | }, 50 | "execution_count": 3, 51 | "metadata": {}, 52 | "output_type": "execute_result" 53 | } 54 | ], 55 | "source": [ 56 | "vec1 = np.array([1, 2, 3])\n", 57 | "vec2 = np.array([1, -1, 0])\n", 58 | "\n", 59 | "euclidean_distance(vec1, vec2)" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "## Manhattan Distance\n", 67 | "\n", 68 | "Also commonly known as \"taxicab distance\" this is the distance between two points when \"diagonal\" movement is not allowed." 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 4, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "def manhattan_distance(v1, v2):\n", 78 | " \"\"\"Computes the Manhattan distance between two vectors\"\"\"\n", 79 | " return np.sum(np.abs(v1 - v2))" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 5, 85 | "metadata": {}, 86 | "outputs": [ 87 | { 88 | "data": { 89 | "text/plain": [ 90 | "6" 91 | ] 92 | }, 93 | "execution_count": 5, 94 | "metadata": {}, 95 | "output_type": "execute_result" 96 | } 97 | ], 98 | "source": [ 99 | "manhattan_distance(vec1, vec2)" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "## Angular Distance\n", 107 | "\n", 108 | "This is the size of the angle between the two vectors." 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 6, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "from numpy.linalg import norm\n", 118 | "\n", 119 | "def angular_distance(v1, v2):\n", 120 | " \"\"\"Computes the angular distance between two vectors\"\"\"\n", 121 | " sim = v1.dot(v2)/(norm(v1) * norm(v2))\n", 122 | " return np.arccos(sim)/np.pi" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 7, 128 | "metadata": {}, 129 | "outputs": [ 130 | { 131 | "data": { 132 | "text/plain": [ 133 | "0.5605188591618384" 134 | ] 135 | }, 136 | "execution_count": 7, 137 | "metadata": {}, 138 | "output_type": "execute_result" 139 | } 140 | ], 141 | "source": [ 142 | "angular_distance(vec1, vec2)" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": 8, 148 | "metadata": {}, 149 | "outputs": [ 150 | { 151 | "data": { 152 | "text/plain": [ 153 | "0.0" 154 | ] 155 | }, 156 | "execution_count": 8, 157 | "metadata": {}, 158 | "output_type": "execute_result" 159 | } 160 | ], 161 | "source": [ 162 | "angular_distance(vec1, vec1) # Two identical vectors have an angular distance of 0" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 9, 168 | "metadata": {}, 169 | "outputs": [ 170 | { 171 | "data": { 172 | "text/plain": [ 173 | "0.0" 174 | ] 175 | }, 176 | "execution_count": 9, 177 | "metadata": {}, 178 | "output_type": "execute_result" 179 | } 180 | ], 181 | "source": [ 182 | "angular_distance(vec1, 2 * vec1) # It's insensitive to magnitude (technically it's not a metric as defined by\n", 183 | " # mathematicians because of this, except on a unit circle)" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": {}, 189 | "source": [ 190 | "## Hamming Distance\n", 191 | "\n", 192 | "Intended for strings (bitstring or otherwise), the Hamming distance between two strings is the number of symbols that need to change in one string to make it identical to the other. (The following code was shamelessly stolen from [Wikipedia](https://en.wikipedia.org/wiki/Hamming_distance).)" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 10, 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [ 201 | "def hamming_distance(s1, s2):\n", 202 | " \"\"\"Return the Hamming distance between equal-length sequences\"\"\"\n", 203 | " if len(s1) != len(s2):\n", 204 | " raise ValueError(\"Undefined for sequences of unequal length\")\n", 205 | " return sum(el1 != el2 for el1, el2 in zip(s1, s2))" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 11, 211 | "metadata": {}, 212 | "outputs": [ 213 | { 214 | "data": { 215 | "text/plain": [ 216 | "2" 217 | ] 218 | }, 219 | "execution_count": 11, 220 | "metadata": {}, 221 | "output_type": "execute_result" 222 | } 223 | ], 224 | "source": [ 225 | "hamming_distance(\"11101\", \"11011\")" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | "## Jaccard Distance\n", 233 | "\n", 234 | "The Jaccard distance, defined for two sets, is the number of elements that the two sets don't have in common divided by the total number of elements the two sets combined have (removing duplicates)." 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": 12, 240 | "metadata": {}, 241 | "outputs": [], 242 | "source": [ 243 | "def jaccard_distance(s1, s2):\n", 244 | " \"\"\"Computes the Jaccard distance between two sets\"\"\"\n", 245 | " s1, s2 = set(s1), set(s2)\n", 246 | " diff = len(s1.union(s2)) - len(s1.intersection(s2))\n", 247 | " return diff / len(s1.union(s2))" 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": 13, 253 | "metadata": {}, 254 | "outputs": [ 255 | { 256 | "data": { 257 | "text/plain": [ 258 | "0.8" 259 | ] 260 | }, 261 | "execution_count": 13, 262 | "metadata": {}, 263 | "output_type": "execute_result" 264 | } 265 | ], 266 | "source": [ 267 | "jaccard_distance([\"cow\", \"pig\", \"horse\"], [\"cow\", \"donkey\", \"chicken\"])" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": 14, 273 | "metadata": {}, 274 | "outputs": [ 275 | { 276 | "data": { 277 | "text/plain": [ 278 | "0.0" 279 | ] 280 | }, 281 | "execution_count": 14, 282 | "metadata": {}, 283 | "output_type": "execute_result" 284 | } 285 | ], 286 | "source": [ 287 | "jaccard_distance(\"11101\", \"11011\") # Sets formed from the contents of these strings are identical" 288 | ] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "metadata": {}, 293 | "source": [ 294 | "In a later video I will discuss similarity metrics, focusing on Jaccard similarity." 295 | ] 296 | } 297 | ], 298 | "metadata": { 299 | "kernelspec": { 300 | "display_name": "Python 3", 301 | "language": "python", 302 | "name": "python3" 303 | }, 304 | "language_info": { 305 | "codemirror_mode": { 306 | "name": "ipython", 307 | "version": 3 308 | }, 309 | "file_extension": ".py", 310 | "mimetype": "text/x-python", 311 | "name": "python", 312 | "nbconvert_exporter": "python", 313 | "pygments_lexer": "ipython3", 314 | "version": "3.6.5" 315 | } 316 | }, 317 | "nbformat": 4, 318 | "nbformat_minor": 2 319 | } 320 | -------------------------------------------------------------------------------- /Chapter06/.ipynb_checkpoints/kmeans-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# $k$-Means Clustering\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "Here I demonstrate clustering using the $k$-means algorithm.\n", 11 | "\n", 12 | "## Clustering the Iris Dataset\n", 13 | "\n", 14 | "The first example will demonstrate using $k$-means clustering for the iris dataset. I first load in that dataset." 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": { 21 | "collapsed": true 22 | }, 23 | "outputs": [], 24 | "source": [ 25 | "from sklearn.datasets import load_iris\n", 26 | "import matplotlib.pyplot as plt\n", 27 | "%matplotlib inline" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": null, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "iris_obj = load_iris()\n", 37 | "iris_data = iris_obj.data\n", 38 | "species = iris_obj.target\n", 39 | "iris_data[:5,:]" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "plt.scatter(iris_data[:, 0], iris_data[:, 1], c=species, cmap=plt.cm.brg)\n", 49 | "plt.xlabel(\"Sepal Length\")\n", 50 | "plt.ylabel(\"Sepal Width\")\n", 51 | "plt.show()" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "Next I import the `KMeans` object to perform $k$-means clustering, and then apply the method." 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "metadata": { 65 | "collapsed": true 66 | }, 67 | "outputs": [], 68 | "source": [ 69 | "from sklearn.cluster import KMeans" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": null, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "irisclust = KMeans(n_clusters=3, init='random') # Three clusters with cluster centers chosen as random dataset points\n", 79 | "irisclust.fit(iris_data)\n", 80 | "irisclust.cluster_centers_ # The coordinates of cluster centers" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "# Visualizing the clustering\n", 90 | "plt.scatter(iris_data[:, 0], iris_data[:, 1], c=irisclust.predict(iris_data), cmap=plt.cm.brg)\n", 91 | "plt.scatter(irisclust.cluster_centers_[:, 0], irisclust.cluster_centers_[:, 1],\n", 92 | " c=irisclust.predict(irisclust.cluster_centers_), cmap=plt.cm.brg, marker='^', s=200,\n", 93 | " edgecolors='k')\n", 94 | "plt.xlabel(\"Sepal Length\")\n", 95 | "plt.ylabel(\"Sepal Width\")\n", 96 | "plt.show()" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "## Image Compression with $k$-Means\n", 104 | "\n", 105 | "$k$-means can also be used for image compression. An image is first clustered using the $k$-means algorithm, each pixel being assigned to a cluster (often pixels are represented using RGB values). The number of clusters is the number of unique colors that need to be stored. Additionally, we would need to store the dimensions of the image and which color each pixel of the image is.\n", 106 | "\n", 107 | "I demonstrate this approach with an image of a poison dart frog, which we will compress with $k$-means into ten unique colors." 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": { 114 | "collapsed": true 115 | }, 116 | "outputs": [], 117 | "source": [ 118 | "from sklearn.datasets import load_sample_image\n", 119 | "from PIL import Image\n", 120 | "import numpy as np" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "metadata": { 127 | "collapsed": true 128 | }, 129 | "outputs": [], 130 | "source": [ 131 | "frog = np.array(Image.open(\"frog.png\").convert(\"RGB\")) / 255 # The last division to force numbers to be in [0,1]" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": null, 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "frog.shape" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [ 149 | "frog[:5, :5, 0]" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": null, 155 | "metadata": {}, 156 | "outputs": [], 157 | "source": [ 158 | "frog[:5, :5, 1]" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": null, 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "frog[:5, :5, 2]" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": null, 173 | "metadata": {}, 174 | "outputs": [], 175 | "source": [ 176 | "plt.imshow(frog)" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": { 183 | "collapsed": true 184 | }, 185 | "outputs": [], 186 | "source": [ 187 | "def kmeans_compression(img, n_clusters):\n", 188 | " \"\"\"Recolors an image when colors are clustered using the k-means algorithm\"\"\"\n", 189 | " h, w, d = img.shape\n", 190 | " assert d == 3\n", 191 | " img_data = img.reshape(h * w, d) # The new array should have a row per pixel\n", 192 | " img_clust = KMeans(n_clusters=n_clusters, init='random').fit(img_data) # The actual k-means clustering step\n", 193 | " centroids = img_clust.cluster_centers_ # The RGB (normalized) values for the new pixels\n", 194 | " clust_pixels = img_clust.predict(img_data) # Which pixel gets which new value\n", 195 | " new_img_data = centroids[clust_pixels]\n", 196 | " return new_img_data.reshape(h, w, d)" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": null, 202 | "metadata": { 203 | "collapsed": true 204 | }, 205 | "outputs": [], 206 | "source": [ 207 | "newfrog = kmeans_compression(frog, 10)" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": null, 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "plt.imshow(newfrog)" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "While the original image would need more memory to store each pixel's unique color, the latter has less information and thus would need less memory to store, although the quality of the image is not the same." 224 | ] 225 | } 226 | ], 227 | "metadata": { 228 | "kernelspec": { 229 | "display_name": "Python 3", 230 | "language": "python", 231 | "name": "python3" 232 | }, 233 | "language_info": { 234 | "codemirror_mode": { 235 | "name": "ipython", 236 | "version": 3 237 | }, 238 | "file_extension": ".py", 239 | "mimetype": "text/x-python", 240 | "name": "python", 241 | "nbconvert_exporter": "python", 242 | "pygments_lexer": "ipython3", 243 | "version": "3.6.5" 244 | } 245 | }, 246 | "nbformat": 4, 247 | "nbformat_minor": 2 248 | } 249 | -------------------------------------------------------------------------------- /Chapter06/UnsupervisedLearning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Diving Into Clustering and Unsupervised Learning\n", 8 | "*Curtis Miller*\n", 9 | "\n", 10 | "In this notebook I give some functions for computing distances between points. This is to introduce the idea of different distance metrics, an important idea in data science and clustering.\n", 11 | "\n", 12 | "Many of these metrics are already supported in relevant packages, but you are welcome to look at functions defining them to understand how they work.\n", 13 | "\n", 14 | "## Euclidean Distance\n", 15 | "\n", 16 | "This is the \"straight line\" distance people are most familiar with." 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 1, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "import numpy as np" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "def euclidean_distance(v1, v2):\n", 35 | " \"\"\"Computes the Euclidean distance between two vectors\"\"\"\n", 36 | " return np.sqrt(np.sum((v1 - v2) ** 2))" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 3, 42 | "metadata": {}, 43 | "outputs": [ 44 | { 45 | "data": { 46 | "text/plain": [ 47 | "4.242640687119285" 48 | ] 49 | }, 50 | "execution_count": 3, 51 | "metadata": {}, 52 | "output_type": "execute_result" 53 | } 54 | ], 55 | "source": [ 56 | "vec1 = np.array([1, 2, 3])\n", 57 | "vec2 = np.array([1, -1, 0])\n", 58 | "\n", 59 | "euclidean_distance(vec1, vec2)" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "## Manhattan Distance\n", 67 | "\n", 68 | "Also commonly known as \"taxicab distance\" this is the distance between two points when \"diagonal\" movement is not allowed." 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 4, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "def manhattan_distance(v1, v2):\n", 78 | " \"\"\"Computes the Manhattan distance between two vectors\"\"\"\n", 79 | " return np.sum(np.abs(v1 - v2))" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 5, 85 | "metadata": {}, 86 | "outputs": [ 87 | { 88 | "data": { 89 | "text/plain": [ 90 | "6" 91 | ] 92 | }, 93 | "execution_count": 5, 94 | "metadata": {}, 95 | "output_type": "execute_result" 96 | } 97 | ], 98 | "source": [ 99 | "manhattan_distance(vec1, vec2)" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "## Angular Distance\n", 107 | "\n", 108 | "This is the size of the angle between the two vectors." 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 6, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "from numpy.linalg import norm\n", 118 | "\n", 119 | "def angular_distance(v1, v2):\n", 120 | " \"\"\"Computes the angular distance between two vectors\"\"\"\n", 121 | " sim = v1.dot(v2)/(norm(v1) * norm(v2))\n", 122 | " return np.arccos(sim)/np.pi" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 7, 128 | "metadata": {}, 129 | "outputs": [ 130 | { 131 | "data": { 132 | "text/plain": [ 133 | "0.5605188591618384" 134 | ] 135 | }, 136 | "execution_count": 7, 137 | "metadata": {}, 138 | "output_type": "execute_result" 139 | } 140 | ], 141 | "source": [ 142 | "angular_distance(vec1, vec2)" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": 8, 148 | "metadata": {}, 149 | "outputs": [ 150 | { 151 | "data": { 152 | "text/plain": [ 153 | "0.0" 154 | ] 155 | }, 156 | "execution_count": 8, 157 | "metadata": {}, 158 | "output_type": "execute_result" 159 | } 160 | ], 161 | "source": [ 162 | "angular_distance(vec1, vec1) # Two identical vectors have an angular distance of 0" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 9, 168 | "metadata": {}, 169 | "outputs": [ 170 | { 171 | "data": { 172 | "text/plain": [ 173 | "0.0" 174 | ] 175 | }, 176 | "execution_count": 9, 177 | "metadata": {}, 178 | "output_type": "execute_result" 179 | } 180 | ], 181 | "source": [ 182 | "angular_distance(vec1, 2 * vec1) # It's insensitive to magnitude (technically it's not a metric as defined by\n", 183 | " # mathematicians because of this, except on a unit circle)" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": {}, 189 | "source": [ 190 | "## Hamming Distance\n", 191 | "\n", 192 | "Intended for strings (bitstring or otherwise), the Hamming distance between two strings is the number of symbols that need to change in one string to make it identical to the other. (The following code was shamelessly stolen from [Wikipedia](https://en.wikipedia.org/wiki/Hamming_distance).)" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 10, 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [ 201 | "def hamming_distance(s1, s2):\n", 202 | " \"\"\"Return the Hamming distance between equal-length sequences\"\"\"\n", 203 | " if len(s1) != len(s2):\n", 204 | " raise ValueError(\"Undefined for sequences of unequal length\")\n", 205 | " return sum(el1 != el2 for el1, el2 in zip(s1, s2))" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 11, 211 | "metadata": {}, 212 | "outputs": [ 213 | { 214 | "data": { 215 | "text/plain": [ 216 | "2" 217 | ] 218 | }, 219 | "execution_count": 11, 220 | "metadata": {}, 221 | "output_type": "execute_result" 222 | } 223 | ], 224 | "source": [ 225 | "hamming_distance(\"11101\", \"11011\")" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | "## Jaccard Distance\n", 233 | "\n", 234 | "The Jaccard distance, defined for two sets, is the number of elements that the two sets don't have in common divided by the total number of elements the two sets combined have (removing duplicates)." 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": 12, 240 | "metadata": {}, 241 | "outputs": [], 242 | "source": [ 243 | "def jaccard_distance(s1, s2):\n", 244 | " \"\"\"Computes the Jaccard distance between two sets\"\"\"\n", 245 | " s1, s2 = set(s1), set(s2)\n", 246 | " diff = len(s1.union(s2)) - len(s1.intersection(s2))\n", 247 | " return diff / len(s1.union(s2))" 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": 13, 253 | "metadata": {}, 254 | "outputs": [ 255 | { 256 | "data": { 257 | "text/plain": [ 258 | "0.8" 259 | ] 260 | }, 261 | "execution_count": 13, 262 | "metadata": {}, 263 | "output_type": "execute_result" 264 | } 265 | ], 266 | "source": [ 267 | "jaccard_distance([\"cow\", \"pig\", \"horse\"], [\"cow\", \"donkey\", \"chicken\"])" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": 14, 273 | "metadata": {}, 274 | "outputs": [ 275 | { 276 | "data": { 277 | "text/plain": [ 278 | "0.0" 279 | ] 280 | }, 281 | "execution_count": 14, 282 | "metadata": {}, 283 | "output_type": "execute_result" 284 | } 285 | ], 286 | "source": [ 287 | "jaccard_distance(\"11101\", \"11011\") # Sets formed from the contents of these strings are identical" 288 | ] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "metadata": {}, 293 | "source": [ 294 | "In a later video I will discuss similarity metrics, focusing on Jaccard similarity." 295 | ] 296 | } 297 | ], 298 | "metadata": { 299 | "kernelspec": { 300 | "display_name": "Python 3", 301 | "language": "python", 302 | "name": "python3" 303 | }, 304 | "language_info": { 305 | "codemirror_mode": { 306 | "name": "ipython", 307 | "version": 3 308 | }, 309 | "file_extension": ".py", 310 | "mimetype": "text/x-python", 311 | "name": "python", 312 | "nbconvert_exporter": "python", 313 | "pygments_lexer": "ipython3", 314 | "version": "3.6.5" 315 | } 316 | }, 317 | "nbformat": 4, 318 | "nbformat_minor": 2 319 | } 320 | -------------------------------------------------------------------------------- /Chapter06/frog.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Training-Systems-Using-Python-Statistical-Modeling/5eb619df9648570e83e910f781aa97fd19f17403/Chapter06/frog.png -------------------------------------------------------------------------------- /Chapter07/Metadata.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Training-Systems-Using-Python-Statistical-Modeling/5eb619df9648570e83e910f781aa97fd19f17403/Chapter07/Metadata.docx -------------------------------------------------------------------------------- /Chapter07/frog.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Training-Systems-Using-Python-Statistical-Modeling/5eb619df9648570e83e910f781aa97fd19f17403/Chapter07/frog.png -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Packt 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | # Training Systems Using Python Statistical Modeling 5 | 6 | Training Systems Using Python Statistical Modeling 7 | 8 | This is the code repository for [Training Systems Using Python Statistical Modeling](https://www.packtpub.com/big-data-and-business-intelligence/training-systems-using-python-statistical-modeling), published by Packt. 9 | 10 | **Explore popular techniques for modeling your data in Python** 11 | 12 | ## What is this book about? 13 | Python's ease of use and multi-purpose nature has led it to become the choice of tool for many data scientists and machine learning developers today. Its rich libraries are widely used for data analysis, and more importantly, for building state-of-the-art predictive models. This book takes you through an exciting journey, of using these libraries to implement effective statistical models for predictive analytics. 14 | 15 | You’ll start by diving into classical statistical analysis, where you will learn to compute descriptive statistics using pandas. You will look at supervised learning, where you will explore the principles of machine learning and train different machine learning models from scratch. You will also work with binary prediction models, such as data classification using k-nearest neighbors, decision trees, and random forests. This book also covers algorithms for regression analysis, such as ridge and lasso regression, and their implementation in Python. You will also learn how neural networks can be trained and deployed for more accurate predictions, and which Python libraries can be used to implement them. 16 | 17 | By the end of this book, you will have all the knowledge you need to design, build, and deploy enterprise-grade statistical models for machine learning using Python and its rich ecosystem of libraries for predictive analytics. 18 | 19 | This book has the following features: 20 | * Understand the importance of statistical modeling 21 | * Learn about the various Python packages for statistical analysis 22 | * Build predictive models from scratch using Python's scikit-learn library 23 | * Implement regression analysis and clustering 24 | * Learn how to train a neural network in Python 25 | 26 | If you feel this book is for you, get your [copy](https://www.amazon.com/Training-Systems-Python-Statistical-Modeling/dp/1838823735) today! 27 | 28 | https://www.packtpub.com/ 30 | 31 | ## Instructions and Navigations 32 | All of the code is organized into folders. For example, Chapter02. 33 | 34 | With the following software and hardware list you can run all code files present in the book (Chapter 1-16). 35 | ### Software and Hardware List 36 | | Chapter | Software required | OS required | 37 | | -------- | ------------------------------------ | ----------------------------------- | 38 | | All | Python 3.6 and above | Windows, Mac OS X, and Linux (Any) | 39 | | All | Jupyter Notebook | Windows, Mac OS X, and Linux (Any) | 40 | | All | Anaconda | Windows, Mac OS X, and Linux (Any) | 41 | 42 | We also provide a PDF file that has color images of the screenshots/diagrams used in this book. [Click here to download it](https://www.packtpub.com/sites/default/files/downloads/9781838823733_ColorImages.pdf). 43 | 44 | ### Related products 45 | * Hands-On Data Analysis with NumPy and Pandas [[PACKT]](https://www.packtpub.com/big-data-and-business-intelligence/hands-data-analysis-numpy-and-pandas) [[Amazon]](https://www.amazon.com/Hands-Data-Analysis-NumPy-pandas/dp/1789530792) 46 | 47 | * Statistical Application Development with R and Python - Second Edition [[PACKT]](https://www.packtpub.com/big-data-and-business-intelligence/statistical-application-development-r-and-python-second-edition) [[Amazon]](https://www.amazon.com/Statistical-Application-Development-Python-applications/dp/1788621190) 48 | 49 | 50 | ## Get to Know the Author 51 | **Curtis Miller** 52 | is a doctoral candidate at the University of Utah studying mathematical statistics. He writes software for both research and personal interest, including the R package (CPAT) available on the Comprehensive R Archive Network (CRAN). Among Curtis Miller's publications are academic papers along with books and video courses all published by Packt Publishing. Curtis Miller's video courses include Unpacking NumPy and Pandas, Data Acquisition and Manipulation with Python, Training Your Systems with Python Statistical Modelling, and Applications of Statistical Learning with Python. His books include Hands-On Data Analysis with NumPy and Pandas. 53 | ### Suggestions and Feedback 54 | [Click here](https://docs.google.com/forms/d/e/1FAIpQLSdy7dATC6QmEL81FIUuymZ0Wy9vH1jHkvpY57OiMeKGqib_Ow/viewform) if you have any feedback or suggestions. 55 | 56 | 57 | ### Download a free PDF 58 | 59 | If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost.
Simply click on the link to claim your free PDF.
60 |

https://packt.link/free-ebook/9781838823733

--------------------------------------------------------------------------------