├── images ├── shuttle.png ├── cs109gitflow3.png └── conditionalmean.png ├── data └── chall.txt ├── README.md ├── .gitignore ├── Lab4-stats.ipynb └── Lab4-stats_original.ipynb /images/shuttle.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs109/2015lab4/master/images/shuttle.png -------------------------------------------------------------------------------- /images/cs109gitflow3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs109/2015lab4/master/images/cs109gitflow3.png -------------------------------------------------------------------------------- /images/conditionalmean.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs109/2015lab4/master/images/conditionalmean.png -------------------------------------------------------------------------------- /data/chall.txt: -------------------------------------------------------------------------------- 1 | 66 0 2 | 70 1 3 | 69 0 4 | 68 0 5 | 67 0 6 | 72 0 7 | 73 0 8 | 70 0 9 | 57 1 10 | 63 1 11 | 70 1 12 | 78 0 13 | 67 0 14 | 53 1 15 | 67 0 16 | 75 0 17 | 70 0 18 | 81 0 19 | 76 0 20 | 79 0 21 | 75 1 22 | 76 0 23 | 58 1 24 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 2015lab4 2 | 3 | Fork this lab! 4 | 5 | When you follow along, you can add in your own notes, and try your own variations. As you are doing this, dont forget to continue doing the "add/commit/push" cycle, so that you save and version your changes, and push them to your fork. This typically looks like: 6 | - git add . 7 | - git commit -a 8 | - git push 9 | 10 | In case we make changes, you can incorporate them into your repo by doing: `git fetch course; git checkout course/master -- labname_original.ipynb` where `labname.ipynb` is the lab in question. An "add/commit/push" cycle will make sure these changes go into your fork as well. If you intend to work on the changed file, simply copy the file to another one and work on it. Or you could make a new branch. Remember that this fork is YOUR repository, and you can do to it what you like. 11 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | 5 | # C extensions 6 | *.so 7 | 8 | # Distribution / packaging 9 | .Python 10 | env/ 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | *.egg-info/ 23 | .installed.cfg 24 | *.egg 25 | 26 | # PyInstaller 27 | # Usually these files are written by a python script from a template 28 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 29 | *.manifest 30 | *.spec 31 | 32 | # Installer logs 33 | pip-log.txt 34 | pip-delete-this-directory.txt 35 | 36 | # Unit test / coverage reports 37 | htmlcov/ 38 | .tox/ 39 | .coverage 40 | .coverage.* 41 | .cache 42 | nosetests.xml 43 | coverage.xml 44 | *,cover 45 | 46 | # Translations 47 | *.mo 48 | *.pot 49 | 50 | # Django stuff: 51 | *.log 52 | 53 | # Sphinx documentation 54 | docs/_build/ 55 | 56 | # PyBuilder 57 | target/ 58 | 59 | #Ipython 60 | .ipynb_checkpoints/ 61 | # Created by .ignore support plugin (hsz.mobi) 62 | ### OSX template 63 | .DS_Store 64 | .AppleDouble 65 | .LSOverride 66 | 67 | # Icon must end with two \r 68 | Icon 69 | 70 | # Thumbnails 71 | ._* 72 | 73 | # Files that might appear in the root of a volume 74 | .DocumentRevisions-V100 75 | .fseventsd 76 | .Spotlight-V100 77 | .TemporaryItems 78 | .Trashes 79 | .VolumeIcon.icns 80 | 81 | # Directories potentially created on remote AFP share 82 | .AppleDB 83 | .AppleDesktop 84 | Network Trash Folder 85 | Temporary Items 86 | .apdisk 87 | 88 | #Temporary data 89 | hw1/tempdata/ 90 | hw1/.ipynb_checkpoints/ 91 | 92 | -------------------------------------------------------------------------------- /Lab4-stats.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# CS-109: Fall 2015 -- Lab 4\n", 8 | "\n", 9 | "# Regression in Python\n", 10 | "\n", 11 | "***\n", 12 | "This is a very quick run-through of some basic statistical concepts\n", 13 | "\n", 14 | "* Regression Models\n", 15 | " * Linear, Logistic\n", 16 | "* Prediction using linear regression\n", 17 | "* Some re-sampling methods \n", 18 | " * Train-Test splits\n", 19 | " * Cross Validation\n", 20 | "\n", 21 | "Linear regression is used to model and predict continuous outcomes while logistic regression is used to model binary outcomes. We'll see some examples of linear regression as well as Train-test splits.\n", 22 | "\n", 23 | "\n", 24 | "The packages we'll cover are: `statsmodels`, `seaborn`, and `scikit-learn`.\n", 25 | "***" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "\n", 33 | "***" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": { 40 | "collapsed": true 41 | }, 42 | "outputs": [], 43 | "source": [ 44 | "# special IPython command to prepare the notebook for matplotlib\n", 45 | "%matplotlib inline \n", 46 | "\n", 47 | "import numpy as np\n", 48 | "import pandas as pd\n", 49 | "import scipy.stats as stats\n", 50 | "import matplotlib.pyplot as plt\n", 51 | "import sklearn\n", 52 | "import statsmodels.api as sm\n", 53 | "\n", 54 | "import seaborn as sns\n", 55 | "sns.set_style(\"whitegrid\")\n", 56 | "sns.set_context(\"poster\")\n", 57 | "\n", 58 | "# special matplotlib argument for improved plots\n", 59 | "from matplotlib import rcParams\n" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "# Part 0: Piazza Posting Guidelines\n", 67 | "***\n", 68 | "
\n", 69 | "The high volume of posts on piazza has made it very difficult for us to answer all of your questions. For this reason, we are taking measures to decrease the volume of posts, and increase their quality. Below is the general format that we now require for all posts. ONLY posts that are in this format will be answered by staff. Posts that are not in this format will be made private, and students will be asked to reformat their question.\n", 70 | "
\n", 71 | " \n", 72 | "1. At the top of your post, make a list of all the keywords you entered in your piazza search when looking for answers to your question. Also include a list of keywords you searched in google. Provide links to all the posts that you read that were relevant, but did not quite answer your question. \n", 73 | " \n", 74 | "2. If you are sure that your question is not a duplicate question, write down your question in as much detail as possible without posting code. You can post your error messages, and unit tests.\n", 75 | " \n", 76 | "3. Include the steps you have taken for debugging, and what the outcome was. You must have spent at least 30 minutes trying to debug before you post a question. \n", 77 | " \n", 78 | "4. Post your question in the most specific folder possible, for example hw1-1.1 (see @564)\n", 79 | " \n", 80 | "5. Follow up! Describe the solution that worked for you. Also, click the \"resolve\" button if your problem has been resolved.\n", 81 | " \n", 82 | "See @310 for general posting guidelines and etiquette. " 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "***\n", 90 | "# Part 1: Linear Regression\n", 91 | "### Purpose of linear regression\n", 92 | "***\n", 93 | "
\n", 94 | "\n", 95 | "

Given a dataset $X$ and $Y$, linear regression can be used to:

\n", 96 | "\n", 105 | "
\n", 106 | "\n", 107 | "### A brief recap\n", 108 | "***\n", 109 | "\n", 110 | "[Linear Regression](http://en.wikipedia.org/wiki/Linear_regression) is a method to model the relationship between a set of independent variables $X$ (also knowns as explanatory variables, features, predictors) and a dependent variable $Y$. This method assumes the relationship between each predictor $X$ is linearly related to the dependent variable $Y$. \n", 111 | "\n", 112 | "$$ Y = \\beta_0 + \\beta_1 X + \\epsilon$$\n", 113 | "\n", 114 | "where $\\epsilon$ is considered as an unobservable random variable that adds noise to the linear relationship. This is the simplest form of linear regression (one variable), we'll call this the simple model. \n", 115 | "\n", 116 | "* $\\beta_0$ is the intercept of the linear model\n", 117 | "\n", 118 | "* Multiple linear regression is when you have more than one independent variable\n", 119 | " * $X_1$, $X_2$, $X_3$, $\\ldots$\n", 120 | "\n", 121 | "$$ Y = \\beta_0 + \\beta_1 X_1 + \\ldots + \\beta_p X_p + \\epsilon$$ \n", 122 | "\n", 123 | "* Back to the simple model. The model in linear regression is the *conditional mean* of $Y$ given the values in $X$ is expressed a linear function. \n", 124 | "\n", 125 | "$$ y = f(x) = E(Y | X = x)$$ \n", 126 | "\n", 127 | "![conditional mean](images/conditionalmean.png)\n", 128 | "http://www.learner.org/courses/againstallodds/about/glossary.html\n", 129 | "\n", 130 | "* The goal is to estimate the coefficients (e.g. $\\beta_0$ and $\\beta_1$). We represent the estimates of the coefficients with a \"hat\" on top of the letter. \n", 131 | "\n", 132 | "$$ \\hat{\\beta}_0, \\hat{\\beta}_1 $$\n", 133 | "\n", 134 | "* Once you estimate the coefficients $\\hat{\\beta}_0$ and $\\hat{\\beta}_1$, you can use these to predict new values of $Y$\n", 135 | "\n", 136 | "$$\\hat{y} = \\hat{\\beta}_0 + \\hat{\\beta}_1 x_1$$\n", 137 | "\n", 138 | "\n", 139 | "* How do you estimate the coefficients? \n", 140 | " * There are many ways to fit a linear regression model\n", 141 | " * The method called **least squares** is one of the most common methods\n", 142 | " * We will discuss least squares today\n", 143 | " \n", 144 | "### Estimating $\\hat\\beta$: Least squares\n", 145 | "***\n", 146 | "[Least squares](http://en.wikipedia.org/wiki/Least_squares) is a method that can estimate the coefficients of a linear model by minimizing the difference between the following: \n", 147 | "\n", 148 | "$$ S = \\sum_{i=1}^N r_i = \\sum_{i=1}^N (y_i - (\\beta_0 + \\beta_1 x_i))^2 $$\n", 149 | "\n", 150 | "where $N$ is the number of observations. \n", 151 | "\n", 152 | "* We will not go into the mathematical details, but the least squares estimates $\\hat{\\beta}_0$ and $\\hat{\\beta}_1$ minimize the sum of the squared residuals $r_i = y_i - (\\beta_0 + \\beta_1 x_i)$ in the model (i.e. makes the difference between the observed $y_i$ and linear model $\\beta_0 + \\beta_1 x_i$ as small as possible). \n", 153 | "\n", 154 | "The solution can be written in compact matrix notation as\n", 155 | "\n", 156 | "$$\\hat\\beta = (X^T X)^{-1}X^T Y$$ \n", 157 | "\n", 158 | "We wanted to show you this in case you remember linear algebra, in order for this solution to exist we need $X^T X$ to be invertible. Of course this requires a few extra assumptions, $X$ must be full rank so that $X^T X$ is invertible, etc. **This is important for us because this means that having redundant features in our regression models will lead to poorly fitting (and unstable) models.** We'll see an implementation of this in the extra linear regression example.\n", 159 | "\n", 160 | "**Note**: The \"hat\" means it is an estimate of the coefficient. " 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "***\n", 168 | "# Part 2: Boston Housing Data Set\n", 169 | "\n", 170 | "The [Boston Housing data set](https://archive.ics.uci.edu/ml/datasets/Housing) contains information about the housing values in suburbs of Boston. This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University and is now available on the UCI Machine Learning Repository. \n", 171 | "\n", 172 | "\n", 173 | "## Load the Boston Housing data set from `sklearn`\n", 174 | "***\n", 175 | "\n", 176 | "This data set is available in the [sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html#sklearn.datasets.load_boston) python module which is how we will access it today. " 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": { 183 | "collapsed": false 184 | }, 185 | "outputs": [], 186 | "source": [ 187 | "from sklearn.datasets import load_boston\n", 188 | "boston = load_boston()" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": null, 194 | "metadata": { 195 | "collapsed": false 196 | }, 197 | "outputs": [], 198 | "source": [ 199 | "boston.keys()" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": null, 205 | "metadata": { 206 | "collapsed": false 207 | }, 208 | "outputs": [], 209 | "source": [ 210 | "boston.data.shape" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": null, 216 | "metadata": { 217 | "collapsed": false 218 | }, 219 | "outputs": [], 220 | "source": [ 221 | "# Print column names\n", 222 | "print boston.feature_names" 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": null, 228 | "metadata": { 229 | "collapsed": false 230 | }, 231 | "outputs": [], 232 | "source": [ 233 | "# Print description of Boston housing data set\n", 234 | "print boston.DESCR" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": {}, 240 | "source": [ 241 | "Now let's explore the data set itself. " 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "metadata": { 248 | "collapsed": false 249 | }, 250 | "outputs": [], 251 | "source": [ 252 | "bos = pd.DataFrame(boston.data)\n", 253 | "bos.head()" 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": {}, 259 | "source": [ 260 | "There are no column names in the DataFrame. Let's add those. " 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": null, 266 | "metadata": { 267 | "collapsed": false 268 | }, 269 | "outputs": [], 270 | "source": [ 271 | "bos.columns = boston.feature_names\n", 272 | "bos.head()" 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": {}, 278 | "source": [ 279 | "Now we have a pandas DataFrame called `bos` containing all the data we want to use to predict Boston Housing prices. Let's create a variable called `PRICE` which will contain the prices. This information is contained in the `target` data. " 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": null, 285 | "metadata": { 286 | "collapsed": false 287 | }, 288 | "outputs": [], 289 | "source": [ 290 | "print boston.target.shape" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": null, 296 | "metadata": { 297 | "collapsed": false 298 | }, 299 | "outputs": [], 300 | "source": [ 301 | "bos['PRICE'] = boston.target\n", 302 | "bos.head()" 303 | ] 304 | }, 305 | { 306 | "cell_type": "markdown", 307 | "metadata": {}, 308 | "source": [ 309 | "## EDA and Summary Statistics\n", 310 | "***\n", 311 | "\n", 312 | "Let's explore this data set. First we use `describe()` to get basic summary statistics for each of the columns. " 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": null, 318 | "metadata": { 319 | "collapsed": false 320 | }, 321 | "outputs": [], 322 | "source": [ 323 | "bos.describe()" 324 | ] 325 | }, 326 | { 327 | "cell_type": "markdown", 328 | "metadata": {}, 329 | "source": [ 330 | "### Scatter plots\n", 331 | "***\n", 332 | "\n", 333 | "Let's look at some scatter plots for three variables: 'CRIM', 'RM' and 'PTRATIO'. \n", 334 | "\n", 335 | "What kind of relationship do you see? e.g. positive, negative? linear? non-linear? " 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": null, 341 | "metadata": { 342 | "collapsed": false 343 | }, 344 | "outputs": [], 345 | "source": [ 346 | "plt.scatter(bos.CRIM, bos.PRICE)\n", 347 | "plt.xlabel(\"Per capita crime rate by town (CRIM)\")\n", 348 | "plt.ylabel(\"Housing Price\")\n", 349 | "plt.title(\"Relationship between CRIM and Price\")" 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": null, 355 | "metadata": { 356 | "collapsed": false 357 | }, 358 | "outputs": [], 359 | "source": [ 360 | "plt.scatter(bos.RM, bos.PRICE)\n", 361 | "plt.xlabel(\"Average number of rooms per dwelling (RM)\")\n", 362 | "plt.ylabel(\"Housing Price\")\n", 363 | "plt.title(\"Relationship between RM and Price\")\n", 364 | "\n", 365 | "# sns.regplot(y=\"PRICE\", x=\"RM\", data=bos, fit_reg = True)" 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "execution_count": null, 371 | "metadata": { 372 | "collapsed": false 373 | }, 374 | "outputs": [], 375 | "source": [ 376 | "# We can also use seaborn regplot for this\n", 377 | "# This provides automatic linear regression fits (useful for data exploration later on)\n", 378 | "\n", 379 | "sns.regplot(y=\"PRICE\", x=\"RM\", data=bos, fit_reg = True)" 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": null, 385 | "metadata": { 386 | "collapsed": false 387 | }, 388 | "outputs": [], 389 | "source": [ 390 | "plt.scatter(bos.PTRATIO, bos.PRICE)\n", 391 | "plt.xlabel(\"Pupil-to-Teacher Ratio (PTRATIO)\")\n", 392 | "plt.ylabel(\"Housing Price\")\n", 393 | "plt.title(\"Relationship between PTRATIO and Price\")" 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "metadata": {}, 399 | "source": [ 400 | "### Histograms\n", 401 | "***\n" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": null, 407 | "metadata": { 408 | "collapsed": false 409 | }, 410 | "outputs": [], 411 | "source": [ 412 | "plt.hist(bos.CRIM)\n", 413 | "plt.title(\"CRIM\")\n", 414 | "plt.xlabel(\"Crime rate per capita\")\n", 415 | "plt.ylabel(\"Frequencey\")\n", 416 | "plt.show()" 417 | ] 418 | }, 419 | { 420 | "cell_type": "code", 421 | "execution_count": null, 422 | "metadata": { 423 | "collapsed": false 424 | }, 425 | "outputs": [], 426 | "source": [ 427 | "plt.hist(bos.PRICE)\n", 428 | "plt.title('Housing Prices: $Y_i$')\n", 429 | "plt.xlabel('Price')\n", 430 | "plt.ylabel('Frequency')\n", 431 | "plt.show()" 432 | ] 433 | }, 434 | { 435 | "cell_type": "markdown", 436 | "metadata": {}, 437 | "source": [ 438 | "## Linear regression with Boston housing data example\n", 439 | "***\n", 440 | "\n", 441 | "Here, \n", 442 | "\n", 443 | "$Y$ = boston housing prices (also called \"target\" data in python)\n", 444 | "\n", 445 | "and\n", 446 | "\n", 447 | "$X$ = all the other features (or independent variables)\n", 448 | "\n", 449 | "which we will use to fit a linear regression model and predict Boston housing prices. We will use the least squares method as the way to estimate the coefficients. " 450 | ] 451 | }, 452 | { 453 | "cell_type": "markdown", 454 | "metadata": {}, 455 | "source": [ 456 | "We'll use two ways of fitting a linear regression. We recommend the first but the second is also powerful in its features." 457 | ] 458 | }, 459 | { 460 | "cell_type": "markdown", 461 | "metadata": {}, 462 | "source": [ 463 | "### Fitting Linear Regression using `statsmodels`\n", 464 | "***" 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": null, 470 | "metadata": { 471 | "collapsed": true 472 | }, 473 | "outputs": [], 474 | "source": [ 475 | "# Import regression modules\n", 476 | "# ols - stands for Ordinary least squares, we'll use this\n", 477 | "import statsmodels.api as sm\n", 478 | "from statsmodels.formula.api import ols" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": null, 484 | "metadata": { 485 | "collapsed": false 486 | }, 487 | "outputs": [], 488 | "source": [ 489 | "# statsmodels works nicely with pandas dataframes\n", 490 | "# The thing inside the \"quotes\" is called a formula, a bit on that below\n", 491 | "m = ols('PRICE ~ RM',bos).fit()\n", 492 | "print m.summary()" 493 | ] 494 | }, 495 | { 496 | "cell_type": "markdown", 497 | "metadata": {}, 498 | "source": [ 499 | "#### Interpreting coefficients\n", 500 | "\n", 501 | "There is a ton of information in this output. But we'll concentrate on the coefficient table (middle table). We can interpret the `RM` coefficient (9.1021) by first noticing that the p-vale (under `P>|t|`) is so small, basically zero. We can interpret the coefficient as, if we compare two groups of towns, one where the average number of rooms is say $5$ and the other group is the same except that they all have $6$ rooms. For these two groups the average difference in house prives is about $9.1$ (in thousands) so about $\\$9,100$ difference. The confidence interval fives us a range of plausible values for this difference, about ($\\$8,279, \\$9,925$), deffinitely not chump change. \n", 502 | "\n", 503 | "In the last section of this Lab we discuss p-values in more detail. Please have a read though it and ask your TFs for more help." 504 | ] 505 | }, 506 | { 507 | "cell_type": "markdown", 508 | "metadata": {}, 509 | "source": [ 510 | "#### `statsmodels` formulas\n", 511 | "***\n", 512 | "This formula notation will seem familiar to `R` users, but will take some getting used to for people coming from other languages or are new to statistics.\n", 513 | "\n", 514 | "The formula gives instruction for a general structure for a regression call. For `statsmodels` (`ols` or `logit`) calls you need to have a Pandas dataframe with column names that you will add to your formula. In the below example you need a pandas data frame that includes the columns named (`Outcome`, `X1`,`X2`, ...), bbut you don't need to build a new dataframe for every regression. Use the same dataframe with all these things in it. The structure is very simple:\n", 515 | "\n", 516 | "`Outcome ~ X1`\n", 517 | "\n", 518 | "But of course we want to to be able to handle more complex models, for example multiple regression is doone like this:\n", 519 | "\n", 520 | "`Outcome ~ X1 + X2 + X3`\n", 521 | "\n", 522 | "This is the very basic structure but it should be enough to get you through the homework. Things can get much more complex, for a quick run-down of further uses see the `statsmodels` [help page](http://statsmodels.sourceforge.net/devel/example_formulas.html).\n" 523 | ] 524 | }, 525 | { 526 | "cell_type": "markdown", 527 | "metadata": {}, 528 | "source": [ 529 | "Let's see how our model actually fit our data. We can see below that there is a ceiling effect, we should probably look into that. Also, for large values of $Y$ we get underpredictions, most predictions are below the 45-degree gridlines. " 530 | ] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "execution_count": null, 535 | "metadata": { 536 | "collapsed": false 537 | }, 538 | "outputs": [], 539 | "source": [ 540 | "plt.scatter(bos['PRICE'], m.fittedvalues)\n", 541 | "plt.xlabel(\"Prices: $Y_i$\")\n", 542 | "plt.ylabel(\"Predicted prices: $\\hat{Y}_i$\")\n", 543 | "plt.title(\"Prices vs Predicted Prices: $Y_i$ vs $\\hat{Y}_i$\")\n" 544 | ] 545 | }, 546 | { 547 | "cell_type": "markdown", 548 | "metadata": {}, 549 | "source": [ 550 | "### Fitting Linear Regression using `sklearn`\n" 551 | ] 552 | }, 553 | { 554 | "cell_type": "code", 555 | "execution_count": null, 556 | "metadata": { 557 | "collapsed": false 558 | }, 559 | "outputs": [], 560 | "source": [ 561 | "from sklearn.linear_model import LinearRegression\n", 562 | "X = bos.drop('PRICE', axis = 1)\n", 563 | "\n", 564 | "# This creates a LinearRegression object\n", 565 | "lm = LinearRegression()\n", 566 | "lm" 567 | ] 568 | }, 569 | { 570 | "cell_type": "markdown", 571 | "metadata": {}, 572 | "source": [ 573 | "#### What can you do with a LinearRegression object? " 574 | ] 575 | }, 576 | { 577 | "cell_type": "code", 578 | "execution_count": null, 579 | "metadata": { 580 | "collapsed": false 581 | }, 582 | "outputs": [], 583 | "source": [ 584 | "# Look inside linear regression object\n", 585 | "# LinearRegression." 586 | ] 587 | }, 588 | { 589 | "cell_type": "markdown", 590 | "metadata": {}, 591 | "source": [ 592 | "Main functions | Description\n", 593 | "--- | --- \n", 594 | "`lm.fit()` | Fit a linear model\n", 595 | "`lm.predit()` | Predict Y using the linear model with estimated coefficients\n", 596 | "`lm.score()` | Returns the coefficient of determination (R^2). *A measure of how well observed outcomes are replicated by the model, as the proportion of total variation of outcomes explained by the model*" 597 | ] 598 | }, 599 | { 600 | "cell_type": "markdown", 601 | "metadata": {}, 602 | "source": [ 603 | "#### What output can you get?" 604 | ] 605 | }, 606 | { 607 | "cell_type": "code", 608 | "execution_count": null, 609 | "metadata": { 610 | "collapsed": false 611 | }, 612 | "outputs": [], 613 | "source": [ 614 | "# Look inside lm object\n", 615 | "# lm." 616 | ] 617 | }, 618 | { 619 | "cell_type": "markdown", 620 | "metadata": {}, 621 | "source": [ 622 | "Output | Description\n", 623 | "--- | --- \n", 624 | "`lm.coef_` | Estimated coefficients\n", 625 | "`lm.intercept_` | Estimated intercept " 626 | ] 627 | }, 628 | { 629 | "cell_type": "markdown", 630 | "metadata": {}, 631 | "source": [ 632 | "### Fit a linear model\n", 633 | "***\n", 634 | "\n", 635 | "The `lm.fit()` function estimates the coefficients the linear regression using least squares. " 636 | ] 637 | }, 638 | { 639 | "cell_type": "code", 640 | "execution_count": null, 641 | "metadata": { 642 | "collapsed": false 643 | }, 644 | "outputs": [], 645 | "source": [ 646 | "# Use all 13 predictors to fit linear regression model\n", 647 | "lm.fit(X, bos.PRICE)\n", 648 | "\n", 649 | "# your turn\n", 650 | "# notice fit_intercept=True and normalize=True\n", 651 | "# How would you change the model to not fit an intercept term? \n" 652 | ] 653 | }, 654 | { 655 | "cell_type": "markdown", 656 | "metadata": {}, 657 | "source": [ 658 | "### Estimated intercept and coefficients\n", 659 | "\n", 660 | "Let's look at the estimated coefficients from the linear model using `1m.intercept_` and `lm.coef_`. \n", 661 | "\n", 662 | "After we have fit our linear regression model using the least squares method, we want to see what are the estimates of our coefficients $\\beta_0$, $\\beta_1$, ..., $\\beta_{13}$: \n", 663 | "\n", 664 | "$$ \\hat{\\beta}_0, \\hat{\\beta}_1, \\ldots, \\hat{\\beta}_{13} $$\n", 665 | "\n" 666 | ] 667 | }, 668 | { 669 | "cell_type": "code", 670 | "execution_count": null, 671 | "metadata": { 672 | "collapsed": false 673 | }, 674 | "outputs": [], 675 | "source": [ 676 | "print 'Estimated intercept coefficient:', lm.intercept_" 677 | ] 678 | }, 679 | { 680 | "cell_type": "code", 681 | "execution_count": null, 682 | "metadata": { 683 | "collapsed": false 684 | }, 685 | "outputs": [], 686 | "source": [ 687 | "print 'Number of coefficients:', len(lm.coef_)" 688 | ] 689 | }, 690 | { 691 | "cell_type": "code", 692 | "execution_count": null, 693 | "metadata": { 694 | "collapsed": false 695 | }, 696 | "outputs": [], 697 | "source": [ 698 | "# The coefficients\n", 699 | "pd.DataFrame(zip(X.columns, lm.coef_), columns = ['features', 'estimatedCoefficients'])" 700 | ] 701 | }, 702 | { 703 | "cell_type": "markdown", 704 | "metadata": {}, 705 | "source": [ 706 | "### Predict Prices \n", 707 | "\n", 708 | "We can calculate the predicted prices ($\\hat{Y}_i$) using `lm.predict`. \n", 709 | "\n", 710 | "$$ \\hat{Y}_i = \\hat{\\beta}_0 + \\hat{\\beta}_1 X_1 + \\ldots \\hat{\\beta}_{13} X_{13} $$" 711 | ] 712 | }, 713 | { 714 | "cell_type": "code", 715 | "execution_count": null, 716 | "metadata": { 717 | "collapsed": false 718 | }, 719 | "outputs": [], 720 | "source": [ 721 | "# first five predicted prices\n", 722 | "lm.predict(X)[0:5]" 723 | ] 724 | }, 725 | { 726 | "cell_type": "code", 727 | "execution_count": null, 728 | "metadata": { 729 | "collapsed": false 730 | }, 731 | "outputs": [], 732 | "source": [ 733 | "plt.hist(lm.predict(X))\n", 734 | "plt.title('Predicted Housing Prices (fitted values): $\\hat{Y}_i$')\n", 735 | "plt.xlabel('Price')\n", 736 | "plt.ylabel('Frequency')" 737 | ] 738 | }, 739 | { 740 | "cell_type": "markdown", 741 | "metadata": {}, 742 | "source": [ 743 | "Let's plot the true prices compared to the predicted prices to see they disagree, we saw this exactly befor but this is how you access the predicted values in using `sklearn`." 744 | ] 745 | }, 746 | { 747 | "cell_type": "code", 748 | "execution_count": null, 749 | "metadata": { 750 | "collapsed": false 751 | }, 752 | "outputs": [], 753 | "source": [ 754 | "# plt.scatter(bos.PRICE, lm.predict(X))\n", 755 | "# plt.xlabel(\"Prices: $Y_i$\")\n", 756 | "# plt.ylabel(\"Predicted prices: $\\hat{Y}_i$\")\n", 757 | "# plt.title(\"Prices vs Predicted Prices: $Y_i$ vs $\\hat{Y}_i$\")" 758 | ] 759 | }, 760 | { 761 | "cell_type": "markdown", 762 | "metadata": {}, 763 | "source": [ 764 | "### Residual sum of squares\n", 765 | "\n", 766 | "Let's calculate the residual sum of squares \n", 767 | "\n", 768 | "$$ S = \\sum_{i=1}^N r_i = \\sum_{i=1}^N (y_i - (\\beta_0 + \\beta_1 x_i))^2 $$" 769 | ] 770 | }, 771 | { 772 | "cell_type": "code", 773 | "execution_count": null, 774 | "metadata": { 775 | "collapsed": false 776 | }, 777 | "outputs": [], 778 | "source": [ 779 | "print np.sum((bos.PRICE - lm.predict(X)) ** 2)" 780 | ] 781 | }, 782 | { 783 | "cell_type": "markdown", 784 | "metadata": {}, 785 | "source": [ 786 | "#### Mean squared error" 787 | ] 788 | }, 789 | { 790 | "cell_type": "code", 791 | "execution_count": null, 792 | "metadata": { 793 | "collapsed": false 794 | }, 795 | "outputs": [], 796 | "source": [ 797 | "mseFull = np.mean((bos.PRICE - lm.predict(X)) ** 2)\n", 798 | "print mseFull" 799 | ] 800 | }, 801 | { 802 | "cell_type": "markdown", 803 | "metadata": {}, 804 | "source": [ 805 | "## Relationship between `PTRATIO` and housing price\n", 806 | "***\n", 807 | "\n", 808 | "Try fitting a linear regression model using only the 'PTRATIO' (pupil-teacher ratio by town)\n", 809 | "\n", 810 | "Calculate the mean squared error. \n" 811 | ] 812 | }, 813 | { 814 | "cell_type": "code", 815 | "execution_count": null, 816 | "metadata": { 817 | "collapsed": false 818 | }, 819 | "outputs": [], 820 | "source": [ 821 | "lm = LinearRegression()\n", 822 | "lm.fit(X[['PTRATIO']], bos.PRICE)" 823 | ] 824 | }, 825 | { 826 | "cell_type": "code", 827 | "execution_count": null, 828 | "metadata": { 829 | "collapsed": false 830 | }, 831 | "outputs": [], 832 | "source": [ 833 | "msePTRATIO = np.mean((bos.PRICE - lm.predict(X[['PTRATIO']])) ** 2)\n", 834 | "print msePTRATIO" 835 | ] 836 | }, 837 | { 838 | "cell_type": "markdown", 839 | "metadata": {}, 840 | "source": [ 841 | "We can also plot the fitted linear regression line. " 842 | ] 843 | }, 844 | { 845 | "cell_type": "code", 846 | "execution_count": null, 847 | "metadata": { 848 | "collapsed": false 849 | }, 850 | "outputs": [], 851 | "source": [ 852 | "plt.scatter(bos.PTRATIO, bos.PRICE)\n", 853 | "plt.xlabel(\"Pupil-to-Teacher Ratio (PTRATIO)\")\n", 854 | "plt.ylabel(\"Housing Price\")\n", 855 | "plt.title(\"Relationship between PTRATIO and Price\")\n", 856 | "\n", 857 | "plt.plot(bos.PTRATIO, lm.predict(X[['PTRATIO']]), color='blue', linewidth=3)\n", 858 | "plt.show()" 859 | ] 860 | }, 861 | { 862 | "cell_type": "markdown", 863 | "metadata": {}, 864 | "source": [ 865 | "# Your turn\n", 866 | "***\n", 867 | "\n", 868 | "Try fitting a linear regression model using three independent variables\n", 869 | "\n", 870 | "1. 'CRIM' (per capita crime rate by town)\n", 871 | "2. 'RM' (average number of rooms per dwelling)\n", 872 | "3. 'PTRATIO' (pupil-teacher ratio by town)\n", 873 | "\n", 874 | "Calculate the mean squared error. " 875 | ] 876 | }, 877 | { 878 | "cell_type": "code", 879 | "execution_count": null, 880 | "metadata": { 881 | "collapsed": false 882 | }, 883 | "outputs": [], 884 | "source": [ 885 | "# your turn" 886 | ] 887 | }, 888 | { 889 | "cell_type": "markdown", 890 | "metadata": {}, 891 | "source": [ 892 | "\n", 893 | "## Other important things to think about when fitting a linear regression model\n", 894 | "***\n", 895 | "
\n", 896 | "
    \n", 897 | "
  • **Linearity**. The dependent variable $Y$ is a linear combination of the regression coefficients and the independent variables $X$.
  • \n", 898 | "
  • **Constant standard deviation**. The SD of the dependent variable $Y$ should be constant for different values of X. \n", 899 | "
      \n", 900 | "
    • e.g. PTRATIO\n", 901 | "
    \n", 902 | "
  • \n", 903 | "
  • **Normal distribution for errors**. The $\\epsilon$ term we discussed at the beginning are assumed to be normally distributed. \n", 904 | " $$ \\epsilon_i \\sim N(0, \\sigma^2)$$\n", 905 | "Sometimes the distributions of responses $Y$ may not be normally distributed at any given value of $X$. e.g. skewed positively or negatively.
  • \n", 906 | "
  • **Independent errors**. The observations are assumed to be obtained independently.\n", 907 | "
      \n", 908 | "
    • e.g. Observations across time may be correlated\n", 909 | "
    \n", 910 | "
  • \n", 911 | "
\n", 912 | "\n", 913 | "
\n" 914 | ] 915 | }, 916 | { 917 | "cell_type": "markdown", 918 | "metadata": {}, 919 | "source": [ 920 | "# Part 3: Training and Test Data sets\n", 921 | "\n", 922 | "### Purpose of splitting data into Training/testing sets\n", 923 | "***\n", 924 | "
\n", 925 | "\n", 926 | "

Let's stick to the linear regression example:

\n", 927 | "
    \n", 928 | "
  • We built our model with the requirement that the model fit the data well.
  • \n", 929 | "
  • As a side-effect, the model will fit THIS dataset well. What about new data?
  • \n", 930 | "
      \n", 931 | "
    • We wanted the model for predictions, right?
    • \n", 932 | "
    \n", 933 | "
  • One simple solution, leave out some data (for testing) and train the model on the rest
  • \n", 934 | "
  • This also leads directly to the idea of cross-validation, next section.
  • \n", 935 | "
\n", 936 | "
\n", 937 | "\n", 938 | "***\n", 939 | "\n", 940 | "One way of doing this is you can create training and testing data sets manually. " 941 | ] 942 | }, 943 | { 944 | "cell_type": "code", 945 | "execution_count": null, 946 | "metadata": { 947 | "collapsed": false 948 | }, 949 | "outputs": [], 950 | "source": [ 951 | "X_train = X[:-50]\n", 952 | "X_test = X[-50:]\n", 953 | "Y_train = bos.PRICE[:-50]\n", 954 | "Y_test = bos.PRICE[-50:]\n", 955 | "print X_train.shape\n", 956 | "print X_test.shape\n", 957 | "print Y_train.shape\n", 958 | "print Y_test.shape" 959 | ] 960 | }, 961 | { 962 | "cell_type": "markdown", 963 | "metadata": {}, 964 | "source": [ 965 | "Another way, is to split the data into random train and test subsets using the function `train_test_split` in `sklearn.cross_validation`. " 966 | ] 967 | }, 968 | { 969 | "cell_type": "code", 970 | "execution_count": null, 971 | "metadata": { 972 | "collapsed": false 973 | }, 974 | "outputs": [], 975 | "source": [ 976 | "# let's look at the function in the help file\n", 977 | "# sklearn.cross_validation.train_test_split?" 978 | ] 979 | }, 980 | { 981 | "cell_type": "code", 982 | "execution_count": null, 983 | "metadata": { 984 | "collapsed": false 985 | }, 986 | "outputs": [], 987 | "source": [ 988 | "X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(\n", 989 | " X, bos.PRICE, test_size=0.33, random_state = 5)\n", 990 | "print X_train.shape\n", 991 | "print X_test.shape\n", 992 | "print Y_train.shape\n", 993 | "print Y_test.shape" 994 | ] 995 | }, 996 | { 997 | "cell_type": "markdown", 998 | "metadata": {}, 999 | "source": [ 1000 | "Your turn. Let's build a linear regression model using our new training data sets. " 1001 | ] 1002 | }, 1003 | { 1004 | "cell_type": "code", 1005 | "execution_count": null, 1006 | "metadata": { 1007 | "collapsed": false 1008 | }, 1009 | "outputs": [], 1010 | "source": [ 1011 | "# your turn\n", 1012 | "lm = LinearRegression()\n", 1013 | "lm.fit(X_train, Y_train)\n", 1014 | "pred_train = lm.predict(X_train)\n", 1015 | "pred_test = lm.predict(X_test)" 1016 | ] 1017 | }, 1018 | { 1019 | "cell_type": "markdown", 1020 | "metadata": {}, 1021 | "source": [ 1022 | "Now, calculate the mean squared error using just the test data and compare to mean squared from using all the data to fit the model. " 1023 | ] 1024 | }, 1025 | { 1026 | "cell_type": "code", 1027 | "execution_count": null, 1028 | "metadata": { 1029 | "collapsed": false 1030 | }, 1031 | "outputs": [], 1032 | "source": [ 1033 | "# your turn\n", 1034 | "print \"Fit a model X_train, and calculate MSE with Y_train:\", np.mean((Y_train - lm.predict(X_train)) ** 2)\n", 1035 | "print \"Fit a model X_train, and calculate MSE with X_test, Y_test:\", np.mean((Y_test - lm.predict(X_test)) ** 2)" 1036 | ] 1037 | }, 1038 | { 1039 | "cell_type": "markdown", 1040 | "metadata": {}, 1041 | "source": [ 1042 | "#### Residual plots" 1043 | ] 1044 | }, 1045 | { 1046 | "cell_type": "code", 1047 | "execution_count": null, 1048 | "metadata": { 1049 | "collapsed": false 1050 | }, 1051 | "outputs": [], 1052 | "source": [ 1053 | "plt.scatter(lm.predict(X_train), lm.predict(X_train) - Y_train, c='b', s=40, alpha=0.5)\n", 1054 | "plt.scatter(lm.predict(X_test), lm.predict(X_test) - Y_test, c='g', s=40)\n", 1055 | "plt.hlines(y = 0, xmin=0, xmax = 50)\n", 1056 | "plt.title('Residual Plot using training (blue) and test (green) data')\n", 1057 | "plt.ylabel('Residuals')" 1058 | ] 1059 | }, 1060 | { 1061 | "cell_type": "markdown", 1062 | "metadata": {}, 1063 | "source": [ 1064 | "### K-fold Cross-validation as an extension of this idea\n", 1065 | "***\n", 1066 | "
\n", 1067 | "\n", 1068 | "

A simple extension of the Test/train split is called K-fold cross-validation.

\n", 1069 | "\n", 1070 | "

Here's the procedure:

\n", 1071 | "
    \n", 1072 | "
  • randomly assign your $n$ samples to one of $K$ groups. They'll each have about $n/k$ samples
  • \n", 1073 | "
  • For each group $k$:
  • \n", 1074 | "
      \n", 1075 | "
    • Fit the model (e.g. run regression) on all data excluding the $k^{th}$ group
    • \n", 1076 | "
    • Use the model to predict the outcomes in group $k$
    • \n", 1077 | "
    • Calculate your prediction error for each observation in $k^{th}$ group (e.g. $(Y_i - \\hat{Y}_i)^2$ for regression, $\\mathbb{1}(Y_i = \\hat{Y}_i)$ for logistic regression).
    • \n", 1078 | "
    \n", 1079 | "
  • Calculate the average prediction error across all samples $Err_{CV} = \\frac{1}{n}\\sum_{i=1}^n (Y_i - \\hat{Y}_i)^2$
  • \n", 1080 | "
\n", 1081 | "
\n", 1082 | "\n", 1083 | "***\n", 1084 | "\n", 1085 | "Luckily you don't have to do this entire process all by hand (``for`` loops, etc.) every single time, ``sci-kit learn`` has a very nice implementation of this, have a look at the [documentation](http://scikit-learn.org/stable/modules/cross_validation.html)." 1086 | ] 1087 | }, 1088 | { 1089 | "cell_type": "markdown", 1090 | "metadata": {}, 1091 | "source": [ 1092 | "## Another Example: Old Faithful Geyser Data Set\n", 1093 | "***\n", 1094 | "\n", 1095 | "The [Old Faithful Geyser](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/faithful.html) data set is a well-known data set that depicts the relationship of the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA [[webcam]](http://yellowstone.net/webcams/). This data set is found in the base installation of the [R programming language](http://cran.r-project.org). \n", 1096 | "\n", 1097 | "`faithful` is a data set with 272 observations on 2 variables.\n", 1098 | "\n", 1099 | "Column name| Description \n", 1100 | "--- | --- \n", 1101 | "eruptions | Eruption time (in mins)\n", 1102 | "waiting\t| Waiting time to next eruption (in mins)\n", 1103 | "\n", 1104 | "There is a function in `statsmodels` (or `sm` for short) called `sm.datasets.get_rdataset` which will download and return a data set found in [R](http://cran.r-project.org). \n", 1105 | "\n", 1106 | "Let's import the `faithful` dataset. " 1107 | ] 1108 | }, 1109 | { 1110 | "cell_type": "code", 1111 | "execution_count": null, 1112 | "metadata": { 1113 | "collapsed": false 1114 | }, 1115 | "outputs": [], 1116 | "source": [ 1117 | "faithful = sm.datasets.get_rdataset(\"faithful\")" 1118 | ] 1119 | }, 1120 | { 1121 | "cell_type": "code", 1122 | "execution_count": null, 1123 | "metadata": { 1124 | "collapsed": false 1125 | }, 1126 | "outputs": [], 1127 | "source": [ 1128 | "# Let's look at the help file\n", 1129 | "sm.datasets.get_rdataset?\n", 1130 | "faithful?" 1131 | ] 1132 | }, 1133 | { 1134 | "cell_type": "code", 1135 | "execution_count": null, 1136 | "metadata": { 1137 | "collapsed": false 1138 | }, 1139 | "outputs": [], 1140 | "source": [ 1141 | "faithful.title" 1142 | ] 1143 | }, 1144 | { 1145 | "cell_type": "code", 1146 | "execution_count": null, 1147 | "metadata": { 1148 | "collapsed": false 1149 | }, 1150 | "outputs": [], 1151 | "source": [ 1152 | "faithful = faithful.data\n", 1153 | "faithful.head()" 1154 | ] 1155 | }, 1156 | { 1157 | "cell_type": "code", 1158 | "execution_count": null, 1159 | "metadata": { 1160 | "collapsed": false 1161 | }, 1162 | "outputs": [], 1163 | "source": [ 1164 | "faithful.shape" 1165 | ] 1166 | }, 1167 | { 1168 | "cell_type": "markdown", 1169 | "metadata": {}, 1170 | "source": [ 1171 | "### Histogram \n", 1172 | "***\n", 1173 | "\n", 1174 | "Create a histogram of the time between eruptions. What do you see? " 1175 | ] 1176 | }, 1177 | { 1178 | "cell_type": "code", 1179 | "execution_count": null, 1180 | "metadata": { 1181 | "collapsed": false 1182 | }, 1183 | "outputs": [], 1184 | "source": [ 1185 | "plt.hist(faithful.waiting)\n", 1186 | "plt.xlabel('Waiting time to next eruption (in mins)')\n", 1187 | "plt.ylabel('Frequency')\n", 1188 | "plt.title('Old Faithful Geyser time between eruption')\n", 1189 | "plt.show()" 1190 | ] 1191 | }, 1192 | { 1193 | "cell_type": "markdown", 1194 | "metadata": {}, 1195 | "source": [ 1196 | "This histogram indicates [Old Faithful isn’t as “faithful” as you might think](http://people.stern.nyu.edu/jsimonof/classes/2301/pdf/geystime.pdf). " 1197 | ] 1198 | }, 1199 | { 1200 | "cell_type": "markdown", 1201 | "metadata": {}, 1202 | "source": [ 1203 | "### Scatter plot \n", 1204 | "***\n", 1205 | "\n", 1206 | "Create a scatter plot of the `waiting` on the x-axis and the `eruptions` on the y-axis. " 1207 | ] 1208 | }, 1209 | { 1210 | "cell_type": "code", 1211 | "execution_count": null, 1212 | "metadata": { 1213 | "collapsed": false 1214 | }, 1215 | "outputs": [], 1216 | "source": [ 1217 | "plt.scatter(faithful.waiting, faithful.eruptions)\n", 1218 | "plt.xlabel('Waiting time to next eruption (in mins)')\n", 1219 | "plt.ylabel('Eruption time (in mins)')\n", 1220 | "plt.title('Old Faithful Geyser')\n", 1221 | "plt.show()\n" 1222 | ] 1223 | }, 1224 | { 1225 | "cell_type": "markdown", 1226 | "metadata": {}, 1227 | "source": [ 1228 | "### Build a linear regression to predict eruption time using `statsmodels`\n", 1229 | "***\n", 1230 | "\n", 1231 | "Now let's build a linear regression model for the `faithful` DataFrame, and estimate the next eruption duration if the waiting time since the last eruption has been 75 minutes.\n", 1232 | "\n", 1233 | "$$ Eruptions = \\beta_0 + \\beta_1 * Waiting + \\epsilon $$ " 1234 | ] 1235 | }, 1236 | { 1237 | "cell_type": "code", 1238 | "execution_count": null, 1239 | "metadata": { 1240 | "collapsed": false 1241 | }, 1242 | "outputs": [], 1243 | "source": [ 1244 | "X = faithful.waiting\n", 1245 | "y = faithful.eruptions\n", 1246 | "model = sm.OLS(y, X)" 1247 | ] 1248 | }, 1249 | { 1250 | "cell_type": "code", 1251 | "execution_count": null, 1252 | "metadata": { 1253 | "collapsed": false 1254 | }, 1255 | "outputs": [], 1256 | "source": [ 1257 | "# Let's look at the options in model\n", 1258 | "# model." 1259 | ] 1260 | }, 1261 | { 1262 | "cell_type": "code", 1263 | "execution_count": null, 1264 | "metadata": { 1265 | "collapsed": false 1266 | }, 1267 | "outputs": [], 1268 | "source": [ 1269 | "results = model.fit()" 1270 | ] 1271 | }, 1272 | { 1273 | "cell_type": "code", 1274 | "execution_count": null, 1275 | "metadata": { 1276 | "collapsed": false 1277 | }, 1278 | "outputs": [], 1279 | "source": [ 1280 | "# Let's look at the options in results\n", 1281 | "# results." 1282 | ] 1283 | }, 1284 | { 1285 | "cell_type": "code", 1286 | "execution_count": null, 1287 | "metadata": { 1288 | "collapsed": false 1289 | }, 1290 | "outputs": [], 1291 | "source": [ 1292 | "print results.summary()" 1293 | ] 1294 | }, 1295 | { 1296 | "cell_type": "code", 1297 | "execution_count": null, 1298 | "metadata": { 1299 | "collapsed": false 1300 | }, 1301 | "outputs": [], 1302 | "source": [ 1303 | "results.params.values" 1304 | ] 1305 | }, 1306 | { 1307 | "cell_type": "markdown", 1308 | "metadata": {}, 1309 | "source": [ 1310 | "We notice, there is no intercept ($\\beta_0$) fit in this linear model. To add it, we can use the function `sm.add_constant`. " 1311 | ] 1312 | }, 1313 | { 1314 | "cell_type": "code", 1315 | "execution_count": null, 1316 | "metadata": { 1317 | "collapsed": false 1318 | }, 1319 | "outputs": [], 1320 | "source": [ 1321 | "X = sm.add_constant(X)\n", 1322 | "X.head()" 1323 | ] 1324 | }, 1325 | { 1326 | "cell_type": "markdown", 1327 | "metadata": {}, 1328 | "source": [ 1329 | "Now let's fit a linear regression model with an intercept. " 1330 | ] 1331 | }, 1332 | { 1333 | "cell_type": "code", 1334 | "execution_count": null, 1335 | "metadata": { 1336 | "collapsed": false 1337 | }, 1338 | "outputs": [], 1339 | "source": [ 1340 | "modelW0 = sm.OLS(y, X)\n", 1341 | "resultsW0 = modelW0.fit()\n", 1342 | "print resultsW0.summary()" 1343 | ] 1344 | }, 1345 | { 1346 | "cell_type": "markdown", 1347 | "metadata": {}, 1348 | "source": [ 1349 | "If you want to predict the time to the next eruption using a waiting time of 75, you can directly estimate this using the equation \n", 1350 | "\n", 1351 | "$$ \\hat{y} = \\hat{\\beta}_0 + \\hat{\\beta}_1 * 75 $$ \n", 1352 | "\n", 1353 | "or you can use `results.predict`. " 1354 | ] 1355 | }, 1356 | { 1357 | "cell_type": "code", 1358 | "execution_count": null, 1359 | "metadata": { 1360 | "collapsed": false 1361 | }, 1362 | "outputs": [], 1363 | "source": [ 1364 | "newX = np.array([1,75])\n", 1365 | "resultsW0.params[0]*newX[0] + resultsW0.params[1] * newX[1]" 1366 | ] 1367 | }, 1368 | { 1369 | "cell_type": "code", 1370 | "execution_count": null, 1371 | "metadata": { 1372 | "collapsed": false 1373 | }, 1374 | "outputs": [], 1375 | "source": [ 1376 | "resultsW0.predict(newX)" 1377 | ] 1378 | }, 1379 | { 1380 | "cell_type": "markdown", 1381 | "metadata": {}, 1382 | "source": [ 1383 | "Based on this linear regression, if the waiting time since the last eruption has been 75 minutes, we expect the next one to last approximately 3.80 minutes." 1384 | ] 1385 | }, 1386 | { 1387 | "cell_type": "markdown", 1388 | "metadata": {}, 1389 | "source": [ 1390 | "### Plot the regression line \n", 1391 | "***\n", 1392 | "\n", 1393 | "Instead of using `resultsW0.predict(X)`, we can use `resultsW0.fittedvalues` which are the $\\hat{y}$. " 1394 | ] 1395 | }, 1396 | { 1397 | "cell_type": "code", 1398 | "execution_count": null, 1399 | "metadata": { 1400 | "collapsed": false 1401 | }, 1402 | "outputs": [], 1403 | "source": [ 1404 | "plt.scatter(faithful.waiting, faithful.eruptions)\n", 1405 | "plt.xlabel('Waiting time to next eruption (in mins)')\n", 1406 | "plt.ylabel('Eruption time (in mins)')\n", 1407 | "plt.title('Old Faithful Geyser')\n", 1408 | "\n", 1409 | "plt.plot(faithful.waiting, resultsW0.fittedvalues, color='blue', linewidth=3)\n", 1410 | "plt.show()\n" 1411 | ] 1412 | }, 1413 | { 1414 | "cell_type": "markdown", 1415 | "metadata": {}, 1416 | "source": [ 1417 | "### Residuals, residual sum of squares, mean squared error\n", 1418 | "***\n", 1419 | "\n", 1420 | "Recall, we can directly calculate the residuals as \n", 1421 | "\n", 1422 | "$$r_i = y_i - (\\hat{\\beta}_0 + \\hat{\\beta}_1 x_i)$$\n", 1423 | "\n", 1424 | "To calculate the residual sum of squares, \n", 1425 | "\n", 1426 | "$$ S = \\sum_{i=1}^n r_i = \\sum_{i=1}^n (y_i - (\\hat{\\beta}_0 + \\hat{\\beta}_1 x_i))^2 $$\n", 1427 | "\n", 1428 | "where $n$ is the number of observations. Alternatively, we can simply ask for the residuals using `resultsW0.predict`" 1429 | ] 1430 | }, 1431 | { 1432 | "cell_type": "code", 1433 | "execution_count": null, 1434 | "metadata": { 1435 | "collapsed": false 1436 | }, 1437 | "outputs": [], 1438 | "source": [ 1439 | "resids = faithful.eruptions - resultsW0.predict(X)\n" 1440 | ] 1441 | }, 1442 | { 1443 | "cell_type": "code", 1444 | "execution_count": null, 1445 | "metadata": { 1446 | "collapsed": false 1447 | }, 1448 | "outputs": [], 1449 | "source": [ 1450 | "resids = resultsW0.resid\n" 1451 | ] 1452 | }, 1453 | { 1454 | "cell_type": "code", 1455 | "execution_count": null, 1456 | "metadata": { 1457 | "collapsed": false 1458 | }, 1459 | "outputs": [], 1460 | "source": [ 1461 | "plt.plot(faithful.waiting, resids, 'o')\n", 1462 | "plt.hlines(y = 0, xmin=40, xmax = 100)\n", 1463 | "plt.xlabel('Waiting time')\n", 1464 | "plt.ylabel('Residuals')\n", 1465 | "plt.title('Residual Plot')\n", 1466 | "plt.show()" 1467 | ] 1468 | }, 1469 | { 1470 | "cell_type": "markdown", 1471 | "metadata": {}, 1472 | "source": [ 1473 | "The residual sum of squares: " 1474 | ] 1475 | }, 1476 | { 1477 | "cell_type": "code", 1478 | "execution_count": null, 1479 | "metadata": { 1480 | "collapsed": false 1481 | }, 1482 | "outputs": [], 1483 | "source": [ 1484 | "print np.sum((faithful.eruptions - resultsW0.predict(X)) ** 2)" 1485 | ] 1486 | }, 1487 | { 1488 | "cell_type": "markdown", 1489 | "metadata": {}, 1490 | "source": [ 1491 | "Mean squared error: " 1492 | ] 1493 | }, 1494 | { 1495 | "cell_type": "code", 1496 | "execution_count": null, 1497 | "metadata": { 1498 | "collapsed": false 1499 | }, 1500 | "outputs": [], 1501 | "source": [ 1502 | "print np.mean((faithful.eruptions - resultsW0.predict(X)) ** 2)" 1503 | ] 1504 | }, 1505 | { 1506 | "cell_type": "markdown", 1507 | "metadata": {}, 1508 | "source": [ 1509 | "## Build a linear regression to predict eruption time using least squares \n", 1510 | "***\n", 1511 | "\n", 1512 | "Now let's build a linear regression model for the `faithful` DataFrame, but instead of using `statmodels` (or `sklearn`), let's use the least squares estimates of the coefficients for the linear regression model.\n", 1513 | "\n", 1514 | "$$ \\hat{\\beta} = (X^{\\top}X)^{-1} X^{\\top}Y $$ \n", 1515 | "\n", 1516 | "The `numpy` function [`np.dot`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html#numpy.dot) is the dot product (or inner product) of two vectors (or arrays in python). \n", 1517 | "\n", 1518 | "The `numpy` function [`np.linalg.inv`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.inv.html#numpy.linalg.inv) can be used to compute the inverse of a matrix. " 1519 | ] 1520 | }, 1521 | { 1522 | "cell_type": "code", 1523 | "execution_count": null, 1524 | "metadata": { 1525 | "collapsed": false 1526 | }, 1527 | "outputs": [], 1528 | "source": [ 1529 | "X = sm.add_constant(faithful.waiting)\n", 1530 | "y = faithful.eruptions\n" 1531 | ] 1532 | }, 1533 | { 1534 | "cell_type": "markdown", 1535 | "metadata": {}, 1536 | "source": [ 1537 | "First, compute $X^{\\top}X$\n" 1538 | ] 1539 | }, 1540 | { 1541 | "cell_type": "code", 1542 | "execution_count": null, 1543 | "metadata": { 1544 | "collapsed": false 1545 | }, 1546 | "outputs": [], 1547 | "source": [ 1548 | "np.dot(X.T, X)\n" 1549 | ] 1550 | }, 1551 | { 1552 | "cell_type": "markdown", 1553 | "metadata": {}, 1554 | "source": [ 1555 | "Next, compute the inverse of $X^{\\top}X$ or $(X^{\\top}X)^{-1}$. " 1556 | ] 1557 | }, 1558 | { 1559 | "cell_type": "code", 1560 | "execution_count": null, 1561 | "metadata": { 1562 | "collapsed": false 1563 | }, 1564 | "outputs": [], 1565 | "source": [ 1566 | "np.linalg.inv(np.dot(X.T, X))" 1567 | ] 1568 | }, 1569 | { 1570 | "cell_type": "markdown", 1571 | "metadata": {}, 1572 | "source": [ 1573 | "Finally, compute $\\hat{\\beta} = (X^{\\top}X)^{-1} X^{\\top}Y $" 1574 | ] 1575 | }, 1576 | { 1577 | "cell_type": "code", 1578 | "execution_count": null, 1579 | "metadata": { 1580 | "collapsed": false 1581 | }, 1582 | "outputs": [], 1583 | "source": [ 1584 | "beta = np.linalg.inv(np.dot(X.T, X)).dot(X.T).dot(y)\n", 1585 | "print \"Directly estimating beta:\", beta\n", 1586 | "print \"Estimating beta using statmodels: \", resultsW0.params.values\n", 1587 | "\n", 1588 | "\n" 1589 | ] 1590 | }, 1591 | { 1592 | "cell_type": "markdown", 1593 | "metadata": {}, 1594 | "source": [ 1595 | "# Part 4: Many different types of regression\n", 1596 | "***\n", 1597 | "
\n", 1598 | "\n", 1599 | "

You do not always have a continuous $y$ variable that you are measuring. Sometimes it may be binary (e.g. 0 or 1). Sometimes it may be count data. What do you do?

\n", 1600 | "\n", 1601 | "

Use other types of regression besides just simple linear regression.

\n", 1602 | "\n", 1603 | "

[Nice summary of several types of regression](http://www.datasciencecentral.com/profiles/blogs/10-types-of-regressions-which-one-to-use).

\n", 1604 | "
\n" 1605 | ] 1606 | }, 1607 | { 1608 | "cell_type": "markdown", 1609 | "metadata": {}, 1610 | "source": [ 1611 | "# Part 5: Logistic Regression\n", 1612 | "***\n" 1613 | ] 1614 | }, 1615 | { 1616 | "cell_type": "markdown", 1617 | "metadata": {}, 1618 | "source": [ 1619 | "
\n", 1620 | "

Logistic regression is a probabilistic model that links observed binary data to a set of features.

\n", 1621 | "\n", 1622 | "

Suppose that we have a set of binary (that is, taking the values 0 or 1) observations $Y_1,\\cdots,Y_n$, and for each observation $Y_i$ we have a vector of features $X_i$. The logistic regression model assumes that there is some set of **weights**, **coefficients**, or **parameters** $\\beta$, one for each feature, so that the data were generated by flipping a weighted coin whose probability of giving a 1 is given by the following equation:\n", 1623 | "\n", 1624 | "$$\n", 1625 | "P(Y_i = 1) = \\mathrm{logistic}(\\sum \\beta_i X_i),\n", 1626 | "$$\n", 1627 | "\n", 1628 | "where\n", 1629 | "\n", 1630 | "$$\n", 1631 | "\\mathrm{logistic}(x) = \\frac{e^x}{1+e^x}.\n", 1632 | "$$\n", 1633 | "

\n", 1634 | "

When we *fit* a logistic regression model, we determine values for each $\\beta$ that allows the model to best fit the *training data* we have observed. Once we do this, we can use these coefficients to make predictions about data we have not yet observed.

\n", 1635 | "\n", 1636 | "
" 1637 | ] 1638 | }, 1639 | { 1640 | "cell_type": "markdown", 1641 | "metadata": {}, 1642 | "source": [ 1643 | "From http://www.edwardtufte.com/tufte/ebooks, in \"Visual and Statistical Thinking: \n", 1644 | "Displays of Evidence for Making Decisions\":\n", 1645 | "\n", 1646 | ">On January 28, 1986, the space shuttle Challenger exploded and seven astronauts died because two rubber O-rings leaked. These rings had lost their resiliency because the shuttle was launched on a very cold day. Ambient temperatures were in the low 30s and the O-rings themselves were much colder, less than 20F.\n", 1647 | "\n", 1648 | ">One day before the flight, the predicted temperature for the launch was 26F to 29F. Concerned that the rings would not seal at such a cold temperature, the engineers who designed the rocket opposed launching Challenger the next day.\n", 1649 | "\n", 1650 | "But they did not make their case persuasively, and were over-ruled by NASA." 1651 | ] 1652 | }, 1653 | { 1654 | "cell_type": "code", 1655 | "execution_count": null, 1656 | "metadata": { 1657 | "collapsed": false 1658 | }, 1659 | "outputs": [], 1660 | "source": [ 1661 | "from IPython.display import Image as Im\n", 1662 | "from IPython.display import display\n", 1663 | "Im('./images/shuttle.png')" 1664 | ] 1665 | }, 1666 | { 1667 | "cell_type": "markdown", 1668 | "metadata": {}, 1669 | "source": [ 1670 | "The image above shows the leak, where the O-ring failed.\n", 1671 | "\n", 1672 | "We have here data on previous failures of the O-rings at various temperatures." 1673 | ] 1674 | }, 1675 | { 1676 | "cell_type": "code", 1677 | "execution_count": null, 1678 | "metadata": { 1679 | "collapsed": false 1680 | }, 1681 | "outputs": [], 1682 | "source": [ 1683 | "data=np.array([[float(j) for j in e.strip().split()] for e in open(\"./data/chall.txt\")])\n", 1684 | "data" 1685 | ] 1686 | }, 1687 | { 1688 | "cell_type": "markdown", 1689 | "metadata": {}, 1690 | "source": [ 1691 | "Lets plot this data" 1692 | ] 1693 | }, 1694 | { 1695 | "cell_type": "code", 1696 | "execution_count": null, 1697 | "metadata": { 1698 | "collapsed": false 1699 | }, 1700 | "outputs": [], 1701 | "source": [ 1702 | "# fit logistic regression model\n", 1703 | "import statsmodels.api as sm\n", 1704 | "from statsmodels.formula.api import logit, glm, ols\n", 1705 | "\n", 1706 | "# statsmodels works nicely with pandas dataframes\n", 1707 | "dat = pd.DataFrame(data, columns = ['Temperature', 'Failure'])\n", 1708 | "logit_model = logit('Failure ~ Temperature',dat).fit()\n", 1709 | "print logit_model.summary()\n" 1710 | ] 1711 | }, 1712 | { 1713 | "cell_type": "markdown", 1714 | "metadata": {}, 1715 | "source": [ 1716 | "#### Interpreting p-values:\n", 1717 | "Generally we'd like the p-values to be very small as they represent the probability that we observed such an strong relationship between temperature and O-ring failures purely by chance. So when the p-value is small (we usually consider \"small\" as less than 0.05), what we're saying is that based on the data we observed, we know **fairly certainly** that temperature is strongly associated with the failure of O-rings. This is a very powerful statement that can take us a long way in terms of learning if used properly. Have a look at the Wikipedia page on [p-values](https://en.wikipedia.org/wiki/P-value) for a quick reminder.\n", 1718 | "\n", 1719 | "There are some issues with testing many many hypotheses that we'll also encounter in the homework. But generally, the idea is that the data may (or may not) have information about things you're interested in. We ask the data questions through hypotheses, but the more questions we ask of it, the higher chances we have of the data actually showing associations at random. Have alook at [this article](https://en.wikipedia.org/wiki/Multiple_comparisons_problem) to get an idea of the problem and some solutions. We'll be considering a very crude solution known as the [Bonferroni correction](https://en.wikipedia.org/wiki/Bonferroni_correction) but that is by no means the best solution. " 1720 | ] 1721 | }, 1722 | { 1723 | "cell_type": "code", 1724 | "execution_count": null, 1725 | "metadata": { 1726 | "collapsed": true 1727 | }, 1728 | "outputs": [], 1729 | "source": [ 1730 | "# calculate predicted failure probabilities for new termperatures\n", 1731 | "x = np.linspace(50, 85, 1000)\n", 1732 | "p = logit_model.params\n", 1733 | "eta = p['Intercept'] + x*p['Temperature']\n", 1734 | "y = np.exp(eta)/(1 + np.exp(eta))" 1735 | ] 1736 | }, 1737 | { 1738 | "cell_type": "markdown", 1739 | "metadata": {}, 1740 | "source": [ 1741 | "Let's plot the data along with a range of predicted failure probabilities for unobserved temperatures." 1742 | ] 1743 | }, 1744 | { 1745 | "cell_type": "code", 1746 | "execution_count": null, 1747 | "metadata": { 1748 | "collapsed": false 1749 | }, 1750 | "outputs": [], 1751 | "source": [ 1752 | "# plot data\n", 1753 | "temps, pfail = data[:,0], data[:,1]\n", 1754 | "plt.scatter(temps, pfail)\n", 1755 | "axes=plt.gca()\n", 1756 | "plt.xlabel('Temperature')\n", 1757 | "plt.ylabel('Failure')\n", 1758 | "plt.title('O-ring failures')\n", 1759 | "\n", 1760 | "# plot fitted values\n", 1761 | "plt.plot(x, y)\n", 1762 | "\n", 1763 | "# change limits, for a nicer plot\n", 1764 | "plt.xlim(50, 85)\n", 1765 | "plt.ylim(-0.1, 1.1)\n" 1766 | ] 1767 | }, 1768 | { 1769 | "cell_type": "markdown", 1770 | "metadata": {}, 1771 | "source": [ 1772 | "We can interpret the output from a logistic regression by looking at the coefficient of temperature (as well as the p-value). Since the coefficient of temperature is negative, we can say that an increase in temperature is associated with a decrease in the odds of having an O-ring failure. " 1773 | ] 1774 | } 1775 | ], 1776 | "metadata": { 1777 | "kernelspec": { 1778 | "display_name": "Python 2", 1779 | "language": "python", 1780 | "name": "python2" 1781 | }, 1782 | "language_info": { 1783 | "codemirror_mode": { 1784 | "name": "ipython", 1785 | "version": 2 1786 | }, 1787 | "file_extension": ".py", 1788 | "mimetype": "text/x-python", 1789 | "name": "python", 1790 | "nbconvert_exporter": "python", 1791 | "pygments_lexer": "ipython2", 1792 | "version": "2.7.10" 1793 | } 1794 | }, 1795 | "nbformat": 4, 1796 | "nbformat_minor": 0 1797 | } 1798 | -------------------------------------------------------------------------------- /Lab4-stats_original.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# CS-109: Fall 2015 -- Lab 4\n", 8 | "\n", 9 | "# Regression in Python\n", 10 | "\n", 11 | "***\n", 12 | "This is a very quick run-through of some basic statistical concepts\n", 13 | "\n", 14 | "* Regression Models\n", 15 | " * Linear, Logistic\n", 16 | "* Prediction using linear regression\n", 17 | "* Some re-sampling methods \n", 18 | " * Train-Test splits\n", 19 | " * Cross Validation\n", 20 | "\n", 21 | "Linear regression is used to model and predict continuous outcomes while logistic regression is used to model binary outcomes. We'll see some examples of linear regression as well as Train-test splits.\n", 22 | "\n", 23 | "\n", 24 | "The packages we'll cover are: `statsmodels`, `seaborn`, and `scikit-learn`.\n", 25 | "***" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "\n", 33 | "***" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": { 40 | "collapsed": true 41 | }, 42 | "outputs": [], 43 | "source": [ 44 | "# special IPython command to prepare the notebook for matplotlib\n", 45 | "%matplotlib inline \n", 46 | "\n", 47 | "import numpy as np\n", 48 | "import pandas as pd\n", 49 | "import scipy.stats as stats\n", 50 | "import matplotlib.pyplot as plt\n", 51 | "import sklearn\n", 52 | "import statsmodels.api as sm\n", 53 | "\n", 54 | "import seaborn as sns\n", 55 | "sns.set_style(\"whitegrid\")\n", 56 | "sns.set_context(\"poster\")\n", 57 | "\n", 58 | "# special matplotlib argument for improved plots\n", 59 | "from matplotlib import rcParams\n" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "# Part 0: Piazza Posting Guidelines\n", 67 | "***\n", 68 | "
\n", 69 | "The high volume of posts on piazza has made it very difficult for us to answer all of your questions. For this reason, we are taking measures to decrease the volume of posts, and increase their quality. Below is the general format that we now require for all posts. ONLY posts that are in this format will be answered by staff. Posts that are not in this format will be made private, and students will be asked to reformat their question.\n", 70 | "
\n", 71 | " \n", 72 | "1. At the top of your post, make a list of all the keywords you entered in your piazza search when looking for answers to your question. Also include a list of keywords you searched in google. Provide links to all the posts that you read that were relevant, but did not quite answer your question. \n", 73 | " \n", 74 | "2. If you are sure that your question is not a duplicate question, write down your question in as much detail as possible without posting code. You can post your error messages, and unit tests.\n", 75 | " \n", 76 | "3. Include the steps you have taken for debugging, and what the outcome was. You must have spent at least 30 minutes trying to debug before you post a question. \n", 77 | " \n", 78 | "4. Post your question in the most specific folder possible, for example hw1-1.1 (see @564)\n", 79 | " \n", 80 | "5. Follow up! Describe the solution that worked for you. Also, click the \"resolve\" button if your problem has been resolved.\n", 81 | " \n", 82 | "See @310 for general posting guidelines and etiquette. " 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "***\n", 90 | "# Part 1: Linear Regression\n", 91 | "### Purpose of linear regression\n", 92 | "***\n", 93 | "
\n", 94 | "\n", 95 | "

Given a dataset $X$ and $Y$, linear regression can be used to:

\n", 96 | "
    \n", 97 | "
  • Build a predictive model to predict future values of $X_i$ without a $Y$ value.
  • \n", 98 | "
  • Model the strength of the relationship between each dependent variable $X_i$ and $Y$
  • \n", 99 | "
      \n", 100 | "
    • Sometimes not all $X_i$ will have a relationship with $Y$
    • \n", 101 | "
    • Need to figure out which $X_i$ contributes most information to determine $Y$
    • \n", 102 | "
    \n", 103 | "
  • Linear regression is used in so many applications that I won't warrant this with examples. It is in many cases, the first pass prediction algorithm for continuous outcomes.
  • \n", 104 | "
\n", 105 | "
\n", 106 | "\n", 107 | "### A brief recap\n", 108 | "***\n", 109 | "\n", 110 | "[Linear Regression](http://en.wikipedia.org/wiki/Linear_regression) is a method to model the relationship between a set of independent variables $X$ (also knowns as explanatory variables, features, predictors) and a dependent variable $Y$. This method assumes the relationship between each predictor $X$ is linearly related to the dependent variable $Y$. \n", 111 | "\n", 112 | "$$ Y = \\beta_0 + \\beta_1 X + \\epsilon$$\n", 113 | "\n", 114 | "where $\\epsilon$ is considered as an unobservable random variable that adds noise to the linear relationship. This is the simplest form of linear regression (one variable), we'll call this the simple model. \n", 115 | "\n", 116 | "* $\\beta_0$ is the intercept of the linear model\n", 117 | "\n", 118 | "* Multiple linear regression is when you have more than one independent variable\n", 119 | " * $X_1$, $X_2$, $X_3$, $\\ldots$\n", 120 | "\n", 121 | "$$ Y = \\beta_0 + \\beta_1 X_1 + \\ldots + \\beta_p X_p + \\epsilon$$ \n", 122 | "\n", 123 | "* Back to the simple model. The model in linear regression is the *conditional mean* of $Y$ given the values in $X$ is expressed a linear function. \n", 124 | "\n", 125 | "$$ y = f(x) = E(Y | X = x)$$ \n", 126 | "\n", 127 | "![conditional mean](images/conditionalmean.png)\n", 128 | "http://www.learner.org/courses/againstallodds/about/glossary.html\n", 129 | "\n", 130 | "* The goal is to estimate the coefficients (e.g. $\\beta_0$ and $\\beta_1$). We represent the estimates of the coefficients with a \"hat\" on top of the letter. \n", 131 | "\n", 132 | "$$ \\hat{\\beta}_0, \\hat{\\beta}_1 $$\n", 133 | "\n", 134 | "* Once you estimate the coefficients $\\hat{\\beta}_0$ and $\\hat{\\beta}_1$, you can use these to predict new values of $Y$\n", 135 | "\n", 136 | "$$\\hat{y} = \\hat{\\beta}_0 + \\hat{\\beta}_1 x_1$$\n", 137 | "\n", 138 | "\n", 139 | "* How do you estimate the coefficients? \n", 140 | " * There are many ways to fit a linear regression model\n", 141 | " * The method called **least squares** is one of the most common methods\n", 142 | " * We will discuss least squares today\n", 143 | " \n", 144 | "### Estimating $\\hat\\beta$: Least squares\n", 145 | "***\n", 146 | "[Least squares](http://en.wikipedia.org/wiki/Least_squares) is a method that can estimate the coefficients of a linear model by minimizing the difference between the following: \n", 147 | "\n", 148 | "$$ S = \\sum_{i=1}^N r_i = \\sum_{i=1}^N (y_i - (\\beta_0 + \\beta_1 x_i))^2 $$\n", 149 | "\n", 150 | "where $N$ is the number of observations. \n", 151 | "\n", 152 | "* We will not go into the mathematical details, but the least squares estimates $\\hat{\\beta}_0$ and $\\hat{\\beta}_1$ minimize the sum of the squared residuals $r_i = y_i - (\\beta_0 + \\beta_1 x_i)$ in the model (i.e. makes the difference between the observed $y_i$ and linear model $\\beta_0 + \\beta_1 x_i$ as small as possible). \n", 153 | "\n", 154 | "The solution can be written in compact matrix notation as\n", 155 | "\n", 156 | "$$\\hat\\beta = (X^T X)^{-1}X^T Y$$ \n", 157 | "\n", 158 | "We wanted to show you this in case you remember linear algebra, in order for this solution to exist we need $X^T X$ to be invertible. Of course this requires a few extra assumptions, $X$ must be full rank so that $X^T X$ is invertible, etc. **This is important for us because this means that having redundant features in our regression models will lead to poorly fitting (and unstable) models.** We'll see an implementation of this in the extra linear regression example.\n", 159 | "\n", 160 | "**Note**: The \"hat\" means it is an estimate of the coefficient. " 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "***\n", 168 | "# Part 2: Boston Housing Data Set\n", 169 | "\n", 170 | "The [Boston Housing data set](https://archive.ics.uci.edu/ml/datasets/Housing) contains information about the housing values in suburbs of Boston. This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University and is now available on the UCI Machine Learning Repository. \n", 171 | "\n", 172 | "\n", 173 | "## Load the Boston Housing data set from `sklearn`\n", 174 | "***\n", 175 | "\n", 176 | "This data set is available in the [sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html#sklearn.datasets.load_boston) python module which is how we will access it today. " 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": { 183 | "collapsed": false 184 | }, 185 | "outputs": [], 186 | "source": [ 187 | "from sklearn.datasets import load_boston\n", 188 | "boston = load_boston()" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": null, 194 | "metadata": { 195 | "collapsed": false 196 | }, 197 | "outputs": [], 198 | "source": [ 199 | "boston.keys()" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": null, 205 | "metadata": { 206 | "collapsed": false 207 | }, 208 | "outputs": [], 209 | "source": [ 210 | "boston.data.shape" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": null, 216 | "metadata": { 217 | "collapsed": false 218 | }, 219 | "outputs": [], 220 | "source": [ 221 | "# Print column names\n", 222 | "print boston.feature_names" 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": null, 228 | "metadata": { 229 | "collapsed": false 230 | }, 231 | "outputs": [], 232 | "source": [ 233 | "# Print description of Boston housing data set\n", 234 | "print boston.DESCR" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": {}, 240 | "source": [ 241 | "Now let's explore the data set itself. " 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "metadata": { 248 | "collapsed": false 249 | }, 250 | "outputs": [], 251 | "source": [ 252 | "bos = pd.DataFrame(boston.data)\n", 253 | "bos.head()" 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": {}, 259 | "source": [ 260 | "There are no column names in the DataFrame. Let's add those. " 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": null, 266 | "metadata": { 267 | "collapsed": false 268 | }, 269 | "outputs": [], 270 | "source": [ 271 | "bos.columns = boston.feature_names\n", 272 | "bos.head()" 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": {}, 278 | "source": [ 279 | "Now we have a pandas DataFrame called `bos` containing all the data we want to use to predict Boston Housing prices. Let's create a variable called `PRICE` which will contain the prices. This information is contained in the `target` data. " 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": null, 285 | "metadata": { 286 | "collapsed": false 287 | }, 288 | "outputs": [], 289 | "source": [ 290 | "print boston.target.shape" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": null, 296 | "metadata": { 297 | "collapsed": false 298 | }, 299 | "outputs": [], 300 | "source": [ 301 | "bos['PRICE'] = boston.target\n", 302 | "bos.head()" 303 | ] 304 | }, 305 | { 306 | "cell_type": "markdown", 307 | "metadata": {}, 308 | "source": [ 309 | "## EDA and Summary Statistics\n", 310 | "***\n", 311 | "\n", 312 | "Let's explore this data set. First we use `describe()` to get basic summary statistics for each of the columns. " 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": null, 318 | "metadata": { 319 | "collapsed": false 320 | }, 321 | "outputs": [], 322 | "source": [ 323 | "bos.describe()" 324 | ] 325 | }, 326 | { 327 | "cell_type": "markdown", 328 | "metadata": {}, 329 | "source": [ 330 | "### Scatter plots\n", 331 | "***\n", 332 | "\n", 333 | "Let's look at some scatter plots for three variables: 'CRIM', 'RM' and 'PTRATIO'. \n", 334 | "\n", 335 | "What kind of relationship do you see? e.g. positive, negative? linear? non-linear? " 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": null, 341 | "metadata": { 342 | "collapsed": false 343 | }, 344 | "outputs": [], 345 | "source": [ 346 | "plt.scatter(bos.CRIM, bos.PRICE)\n", 347 | "plt.xlabel(\"Per capita crime rate by town (CRIM)\")\n", 348 | "plt.ylabel(\"Housing Price\")\n", 349 | "plt.title(\"Relationship between CRIM and Price\")" 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": null, 355 | "metadata": { 356 | "collapsed": false 357 | }, 358 | "outputs": [], 359 | "source": [ 360 | "plt.scatter(bos.RM, bos.PRICE)\n", 361 | "plt.xlabel(\"Average number of rooms per dwelling (RM)\")\n", 362 | "plt.ylabel(\"Housing Price\")\n", 363 | "plt.title(\"Relationship between RM and Price\")\n", 364 | "\n", 365 | "# sns.regplot(y=\"PRICE\", x=\"RM\", data=bos, fit_reg = True)" 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "execution_count": null, 371 | "metadata": { 372 | "collapsed": false 373 | }, 374 | "outputs": [], 375 | "source": [ 376 | "# We can also use seaborn regplot for this\n", 377 | "# This provides automatic linear regression fits (useful for data exploration later on)\n", 378 | "\n", 379 | "sns.regplot(y=\"PRICE\", x=\"RM\", data=bos, fit_reg = True)" 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": null, 385 | "metadata": { 386 | "collapsed": false 387 | }, 388 | "outputs": [], 389 | "source": [ 390 | "plt.scatter(bos.PTRATIO, bos.PRICE)\n", 391 | "plt.xlabel(\"Pupil-to-Teacher Ratio (PTRATIO)\")\n", 392 | "plt.ylabel(\"Housing Price\")\n", 393 | "plt.title(\"Relationship between PTRATIO and Price\")" 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "metadata": {}, 399 | "source": [ 400 | "### Histograms\n", 401 | "***\n" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": null, 407 | "metadata": { 408 | "collapsed": false 409 | }, 410 | "outputs": [], 411 | "source": [ 412 | "plt.hist(bos.CRIM)\n", 413 | "plt.title(\"CRIM\")\n", 414 | "plt.xlabel(\"Crime rate per capita\")\n", 415 | "plt.ylabel(\"Frequencey\")\n", 416 | "plt.show()" 417 | ] 418 | }, 419 | { 420 | "cell_type": "code", 421 | "execution_count": null, 422 | "metadata": { 423 | "collapsed": false 424 | }, 425 | "outputs": [], 426 | "source": [ 427 | "plt.hist(bos.PRICE)\n", 428 | "plt.title('Housing Prices: $Y_i$')\n", 429 | "plt.xlabel('Price')\n", 430 | "plt.ylabel('Frequency')\n", 431 | "plt.show()" 432 | ] 433 | }, 434 | { 435 | "cell_type": "markdown", 436 | "metadata": {}, 437 | "source": [ 438 | "## Linear regression with Boston housing data example\n", 439 | "***\n", 440 | "\n", 441 | "Here, \n", 442 | "\n", 443 | "$Y$ = boston housing prices (also called \"target\" data in python)\n", 444 | "\n", 445 | "and\n", 446 | "\n", 447 | "$X$ = all the other features (or independent variables)\n", 448 | "\n", 449 | "which we will use to fit a linear regression model and predict Boston housing prices. We will use the least squares method as the way to estimate the coefficients. " 450 | ] 451 | }, 452 | { 453 | "cell_type": "markdown", 454 | "metadata": {}, 455 | "source": [ 456 | "We'll use two ways of fitting a linear regression. We recommend the first but the second is also powerful in its features." 457 | ] 458 | }, 459 | { 460 | "cell_type": "markdown", 461 | "metadata": {}, 462 | "source": [ 463 | "### Fitting Linear Regression using `statsmodels`\n", 464 | "***" 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": null, 470 | "metadata": { 471 | "collapsed": true 472 | }, 473 | "outputs": [], 474 | "source": [ 475 | "# Import regression modules\n", 476 | "# ols - stands for Ordinary least squares, we'll use this\n", 477 | "import statsmodels.api as sm\n", 478 | "from statsmodels.formula.api import ols" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": null, 484 | "metadata": { 485 | "collapsed": false 486 | }, 487 | "outputs": [], 488 | "source": [ 489 | "# statsmodels works nicely with pandas dataframes\n", 490 | "# The thing inside the \"quotes\" is called a formula, a bit on that below\n", 491 | "m = ols('PRICE ~ RM',bos).fit()\n", 492 | "print m.summary()" 493 | ] 494 | }, 495 | { 496 | "cell_type": "markdown", 497 | "metadata": {}, 498 | "source": [ 499 | "#### Interpreting coefficients\n", 500 | "\n", 501 | "There is a ton of information in this output. But we'll concentrate on the coefficient table (middle table). We can interpret the `RM` coefficient (9.1021) by first noticing that the p-vale (under `P>|t|`) is so small, basically zero. We can interpret the coefficient as, if we compare two groups of towns, one where the average number of rooms is say $5$ and the other group is the same except that they all have $6$ rooms. For these two groups the average difference in house prives is about $9.1$ (in thousands) so about $\\$9,100$ difference. The confidence interval fives us a range of plausible values for this difference, about ($\\$8,279, \\$9,925$), deffinitely not chump change. \n", 502 | "\n", 503 | "In the last section of this Lab we discuss p-values in more detail. Please have a read though it and ask your TFs for more help." 504 | ] 505 | }, 506 | { 507 | "cell_type": "markdown", 508 | "metadata": {}, 509 | "source": [ 510 | "#### `statsmodels` formulas\n", 511 | "***\n", 512 | "This formula notation will seem familiar to `R` users, but will take some getting used to for people coming from other languages or are new to statistics.\n", 513 | "\n", 514 | "The formula gives instruction for a general structure for a regression call. For `statsmodels` (`ols` or `logit`) calls you need to have a Pandas dataframe with column names that you will add to your formula. In the below example you need a pandas data frame that includes the columns named (`Outcome`, `X1`,`X2`, ...), bbut you don't need to build a new dataframe for every regression. Use the same dataframe with all these things in it. The structure is very simple:\n", 515 | "\n", 516 | "`Outcome ~ X1`\n", 517 | "\n", 518 | "But of course we want to to be able to handle more complex models, for example multiple regression is doone like this:\n", 519 | "\n", 520 | "`Outcome ~ X1 + X2 + X3`\n", 521 | "\n", 522 | "This is the very basic structure but it should be enough to get you through the homework. Things can get much more complex, for a quick run-down of further uses see the `statsmodels` [help page](http://statsmodels.sourceforge.net/devel/example_formulas.html).\n" 523 | ] 524 | }, 525 | { 526 | "cell_type": "markdown", 527 | "metadata": {}, 528 | "source": [ 529 | "Let's see how our model actually fit our data. We can see below that there is a ceiling effect, we should probably look into that. Also, for large values of $Y$ we get underpredictions, most predictions are below the 45-degree gridlines. " 530 | ] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "execution_count": null, 535 | "metadata": { 536 | "collapsed": false 537 | }, 538 | "outputs": [], 539 | "source": [ 540 | "plt.scatter(bos['PRICE'], m.fittedvalues)\n", 541 | "plt.xlabel(\"Prices: $Y_i$\")\n", 542 | "plt.ylabel(\"Predicted prices: $\\hat{Y}_i$\")\n", 543 | "plt.title(\"Prices vs Predicted Prices: $Y_i$ vs $\\hat{Y}_i$\")\n" 544 | ] 545 | }, 546 | { 547 | "cell_type": "markdown", 548 | "metadata": {}, 549 | "source": [ 550 | "### Fitting Linear Regression using `sklearn`\n" 551 | ] 552 | }, 553 | { 554 | "cell_type": "code", 555 | "execution_count": null, 556 | "metadata": { 557 | "collapsed": false 558 | }, 559 | "outputs": [], 560 | "source": [ 561 | "from sklearn.linear_model import LinearRegression\n", 562 | "X = bos.drop('PRICE', axis = 1)\n", 563 | "\n", 564 | "# This creates a LinearRegression object\n", 565 | "lm = LinearRegression()\n", 566 | "lm" 567 | ] 568 | }, 569 | { 570 | "cell_type": "markdown", 571 | "metadata": {}, 572 | "source": [ 573 | "#### What can you do with a LinearRegression object? " 574 | ] 575 | }, 576 | { 577 | "cell_type": "code", 578 | "execution_count": null, 579 | "metadata": { 580 | "collapsed": false 581 | }, 582 | "outputs": [], 583 | "source": [ 584 | "# Look inside linear regression object\n", 585 | "# LinearRegression." 586 | ] 587 | }, 588 | { 589 | "cell_type": "markdown", 590 | "metadata": {}, 591 | "source": [ 592 | "Main functions | Description\n", 593 | "--- | --- \n", 594 | "`lm.fit()` | Fit a linear model\n", 595 | "`lm.predit()` | Predict Y using the linear model with estimated coefficients\n", 596 | "`lm.score()` | Returns the coefficient of determination (R^2). *A measure of how well observed outcomes are replicated by the model, as the proportion of total variation of outcomes explained by the model*" 597 | ] 598 | }, 599 | { 600 | "cell_type": "markdown", 601 | "metadata": {}, 602 | "source": [ 603 | "#### What output can you get?" 604 | ] 605 | }, 606 | { 607 | "cell_type": "code", 608 | "execution_count": null, 609 | "metadata": { 610 | "collapsed": false 611 | }, 612 | "outputs": [], 613 | "source": [ 614 | "# Look inside lm object\n", 615 | "# lm." 616 | ] 617 | }, 618 | { 619 | "cell_type": "markdown", 620 | "metadata": {}, 621 | "source": [ 622 | "Output | Description\n", 623 | "--- | --- \n", 624 | "`lm.coef_` | Estimated coefficients\n", 625 | "`lm.intercept_` | Estimated intercept " 626 | ] 627 | }, 628 | { 629 | "cell_type": "markdown", 630 | "metadata": {}, 631 | "source": [ 632 | "### Fit a linear model\n", 633 | "***\n", 634 | "\n", 635 | "The `lm.fit()` function estimates the coefficients the linear regression using least squares. " 636 | ] 637 | }, 638 | { 639 | "cell_type": "code", 640 | "execution_count": null, 641 | "metadata": { 642 | "collapsed": false 643 | }, 644 | "outputs": [], 645 | "source": [ 646 | "# Use all 13 predictors to fit linear regression model\n", 647 | "lm.fit(X, bos.PRICE)\n", 648 | "\n", 649 | "# your turn\n", 650 | "# notice fit_intercept=True and normalize=True\n", 651 | "# How would you change the model to not fit an intercept term? \n" 652 | ] 653 | }, 654 | { 655 | "cell_type": "markdown", 656 | "metadata": {}, 657 | "source": [ 658 | "### Estimated intercept and coefficients\n", 659 | "\n", 660 | "Let's look at the estimated coefficients from the linear model using `1m.intercept_` and `lm.coef_`. \n", 661 | "\n", 662 | "After we have fit our linear regression model using the least squares method, we want to see what are the estimates of our coefficients $\\beta_0$, $\\beta_1$, ..., $\\beta_{13}$: \n", 663 | "\n", 664 | "$$ \\hat{\\beta}_0, \\hat{\\beta}_1, \\ldots, \\hat{\\beta}_{13} $$\n", 665 | "\n" 666 | ] 667 | }, 668 | { 669 | "cell_type": "code", 670 | "execution_count": null, 671 | "metadata": { 672 | "collapsed": false 673 | }, 674 | "outputs": [], 675 | "source": [ 676 | "print 'Estimated intercept coefficient:', lm.intercept_" 677 | ] 678 | }, 679 | { 680 | "cell_type": "code", 681 | "execution_count": null, 682 | "metadata": { 683 | "collapsed": false 684 | }, 685 | "outputs": [], 686 | "source": [ 687 | "print 'Number of coefficients:', len(lm.coef_)" 688 | ] 689 | }, 690 | { 691 | "cell_type": "code", 692 | "execution_count": null, 693 | "metadata": { 694 | "collapsed": false 695 | }, 696 | "outputs": [], 697 | "source": [ 698 | "# The coefficients\n", 699 | "pd.DataFrame(zip(X.columns, lm.coef_), columns = ['features', 'estimatedCoefficients'])" 700 | ] 701 | }, 702 | { 703 | "cell_type": "markdown", 704 | "metadata": {}, 705 | "source": [ 706 | "### Predict Prices \n", 707 | "\n", 708 | "We can calculate the predicted prices ($\\hat{Y}_i$) using `lm.predict`. \n", 709 | "\n", 710 | "$$ \\hat{Y}_i = \\hat{\\beta}_0 + \\hat{\\beta}_1 X_1 + \\ldots \\hat{\\beta}_{13} X_{13} $$" 711 | ] 712 | }, 713 | { 714 | "cell_type": "code", 715 | "execution_count": null, 716 | "metadata": { 717 | "collapsed": false 718 | }, 719 | "outputs": [], 720 | "source": [ 721 | "# first five predicted prices\n", 722 | "lm.predict(X)[0:5]" 723 | ] 724 | }, 725 | { 726 | "cell_type": "code", 727 | "execution_count": null, 728 | "metadata": { 729 | "collapsed": false 730 | }, 731 | "outputs": [], 732 | "source": [ 733 | "plt.hist(lm.predict(X))\n", 734 | "plt.title('Predicted Housing Prices (fitted values): $\\hat{Y}_i$')\n", 735 | "plt.xlabel('Price')\n", 736 | "plt.ylabel('Frequency')" 737 | ] 738 | }, 739 | { 740 | "cell_type": "markdown", 741 | "metadata": {}, 742 | "source": [ 743 | "Let's plot the true prices compared to the predicted prices to see they disagree, we saw this exactly befor but this is how you access the predicted values in using `sklearn`." 744 | ] 745 | }, 746 | { 747 | "cell_type": "code", 748 | "execution_count": null, 749 | "metadata": { 750 | "collapsed": false 751 | }, 752 | "outputs": [], 753 | "source": [ 754 | "# plt.scatter(bos.PRICE, lm.predict(X))\n", 755 | "# plt.xlabel(\"Prices: $Y_i$\")\n", 756 | "# plt.ylabel(\"Predicted prices: $\\hat{Y}_i$\")\n", 757 | "# plt.title(\"Prices vs Predicted Prices: $Y_i$ vs $\\hat{Y}_i$\")" 758 | ] 759 | }, 760 | { 761 | "cell_type": "markdown", 762 | "metadata": {}, 763 | "source": [ 764 | "### Residual sum of squares\n", 765 | "\n", 766 | "Let's calculate the residual sum of squares \n", 767 | "\n", 768 | "$$ S = \\sum_{i=1}^N r_i = \\sum_{i=1}^N (y_i - (\\beta_0 + \\beta_1 x_i))^2 $$" 769 | ] 770 | }, 771 | { 772 | "cell_type": "code", 773 | "execution_count": null, 774 | "metadata": { 775 | "collapsed": false 776 | }, 777 | "outputs": [], 778 | "source": [ 779 | "print np.sum((bos.PRICE - lm.predict(X)) ** 2)" 780 | ] 781 | }, 782 | { 783 | "cell_type": "markdown", 784 | "metadata": {}, 785 | "source": [ 786 | "#### Mean squared error" 787 | ] 788 | }, 789 | { 790 | "cell_type": "code", 791 | "execution_count": null, 792 | "metadata": { 793 | "collapsed": false 794 | }, 795 | "outputs": [], 796 | "source": [ 797 | "mseFull = np.mean((bos.PRICE - lm.predict(X)) ** 2)\n", 798 | "print mseFull" 799 | ] 800 | }, 801 | { 802 | "cell_type": "markdown", 803 | "metadata": {}, 804 | "source": [ 805 | "## Relationship between `PTRATIO` and housing price\n", 806 | "***\n", 807 | "\n", 808 | "Try fitting a linear regression model using only the 'PTRATIO' (pupil-teacher ratio by town)\n", 809 | "\n", 810 | "Calculate the mean squared error. \n" 811 | ] 812 | }, 813 | { 814 | "cell_type": "code", 815 | "execution_count": null, 816 | "metadata": { 817 | "collapsed": false 818 | }, 819 | "outputs": [], 820 | "source": [ 821 | "lm = LinearRegression()\n", 822 | "lm.fit(X[['PTRATIO']], bos.PRICE)" 823 | ] 824 | }, 825 | { 826 | "cell_type": "code", 827 | "execution_count": null, 828 | "metadata": { 829 | "collapsed": false 830 | }, 831 | "outputs": [], 832 | "source": [ 833 | "msePTRATIO = np.mean((bos.PRICE - lm.predict(X[['PTRATIO']])) ** 2)\n", 834 | "print msePTRATIO" 835 | ] 836 | }, 837 | { 838 | "cell_type": "markdown", 839 | "metadata": {}, 840 | "source": [ 841 | "We can also plot the fitted linear regression line. " 842 | ] 843 | }, 844 | { 845 | "cell_type": "code", 846 | "execution_count": null, 847 | "metadata": { 848 | "collapsed": false 849 | }, 850 | "outputs": [], 851 | "source": [ 852 | "plt.scatter(bos.PTRATIO, bos.PRICE)\n", 853 | "plt.xlabel(\"Pupil-to-Teacher Ratio (PTRATIO)\")\n", 854 | "plt.ylabel(\"Housing Price\")\n", 855 | "plt.title(\"Relationship between PTRATIO and Price\")\n", 856 | "\n", 857 | "plt.plot(bos.PTRATIO, lm.predict(X[['PTRATIO']]), color='blue', linewidth=3)\n", 858 | "plt.show()" 859 | ] 860 | }, 861 | { 862 | "cell_type": "markdown", 863 | "metadata": {}, 864 | "source": [ 865 | "# Your turn\n", 866 | "***\n", 867 | "\n", 868 | "Try fitting a linear regression model using three independent variables\n", 869 | "\n", 870 | "1. 'CRIM' (per capita crime rate by town)\n", 871 | "2. 'RM' (average number of rooms per dwelling)\n", 872 | "3. 'PTRATIO' (pupil-teacher ratio by town)\n", 873 | "\n", 874 | "Calculate the mean squared error. " 875 | ] 876 | }, 877 | { 878 | "cell_type": "code", 879 | "execution_count": null, 880 | "metadata": { 881 | "collapsed": false 882 | }, 883 | "outputs": [], 884 | "source": [ 885 | "# your turn" 886 | ] 887 | }, 888 | { 889 | "cell_type": "markdown", 890 | "metadata": {}, 891 | "source": [ 892 | "\n", 893 | "## Other important things to think about when fitting a linear regression model\n", 894 | "***\n", 895 | "
\n", 896 | "
    \n", 897 | "
  • **Linearity**. The dependent variable $Y$ is a linear combination of the regression coefficients and the independent variables $X$.
  • \n", 898 | "
  • **Constant standard deviation**. The SD of the dependent variable $Y$ should be constant for different values of X. \n", 899 | "
      \n", 900 | "
    • e.g. PTRATIO\n", 901 | "
    \n", 902 | "
  • \n", 903 | "
  • **Normal distribution for errors**. The $\\epsilon$ term we discussed at the beginning are assumed to be normally distributed. \n", 904 | " $$ \\epsilon_i \\sim N(0, \\sigma^2)$$\n", 905 | "Sometimes the distributions of responses $Y$ may not be normally distributed at any given value of $X$. e.g. skewed positively or negatively.
  • \n", 906 | "
  • **Independent errors**. The observations are assumed to be obtained independently.\n", 907 | "
      \n", 908 | "
    • e.g. Observations across time may be correlated\n", 909 | "
    \n", 910 | "
  • \n", 911 | "
\n", 912 | "\n", 913 | "
\n" 914 | ] 915 | }, 916 | { 917 | "cell_type": "markdown", 918 | "metadata": {}, 919 | "source": [ 920 | "# Part 3: Training and Test Data sets\n", 921 | "\n", 922 | "### Purpose of splitting data into Training/testing sets\n", 923 | "***\n", 924 | "
\n", 925 | "\n", 926 | "

Let's stick to the linear regression example:

\n", 927 | "
    \n", 928 | "
  • We built our model with the requirement that the model fit the data well.
  • \n", 929 | "
  • As a side-effect, the model will fit THIS dataset well. What about new data?
  • \n", 930 | "
      \n", 931 | "
    • We wanted the model for predictions, right?
    • \n", 932 | "
    \n", 933 | "
  • One simple solution, leave out some data (for testing) and train the model on the rest
  • \n", 934 | "
  • This also leads directly to the idea of cross-validation, next section.
  • \n", 935 | "
\n", 936 | "
\n", 937 | "\n", 938 | "***\n", 939 | "\n", 940 | "One way of doing this is you can create training and testing data sets manually. " 941 | ] 942 | }, 943 | { 944 | "cell_type": "code", 945 | "execution_count": null, 946 | "metadata": { 947 | "collapsed": false 948 | }, 949 | "outputs": [], 950 | "source": [ 951 | "X_train = X[:-50]\n", 952 | "X_test = X[-50:]\n", 953 | "Y_train = bos.PRICE[:-50]\n", 954 | "Y_test = bos.PRICE[-50:]\n", 955 | "print X_train.shape\n", 956 | "print X_test.shape\n", 957 | "print Y_train.shape\n", 958 | "print Y_test.shape" 959 | ] 960 | }, 961 | { 962 | "cell_type": "markdown", 963 | "metadata": {}, 964 | "source": [ 965 | "Another way, is to split the data into random train and test subsets using the function `train_test_split` in `sklearn.cross_validation`. " 966 | ] 967 | }, 968 | { 969 | "cell_type": "code", 970 | "execution_count": null, 971 | "metadata": { 972 | "collapsed": false 973 | }, 974 | "outputs": [], 975 | "source": [ 976 | "# let's look at the function in the help file\n", 977 | "# sklearn.cross_validation.train_test_split?" 978 | ] 979 | }, 980 | { 981 | "cell_type": "code", 982 | "execution_count": null, 983 | "metadata": { 984 | "collapsed": false 985 | }, 986 | "outputs": [], 987 | "source": [ 988 | "X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(\n", 989 | " X, bos.PRICE, test_size=0.33, random_state = 5)\n", 990 | "print X_train.shape\n", 991 | "print X_test.shape\n", 992 | "print Y_train.shape\n", 993 | "print Y_test.shape" 994 | ] 995 | }, 996 | { 997 | "cell_type": "markdown", 998 | "metadata": {}, 999 | "source": [ 1000 | "Your turn. Let's build a linear regression model using our new training data sets. " 1001 | ] 1002 | }, 1003 | { 1004 | "cell_type": "code", 1005 | "execution_count": null, 1006 | "metadata": { 1007 | "collapsed": false 1008 | }, 1009 | "outputs": [], 1010 | "source": [ 1011 | "# your turn\n", 1012 | "lm = LinearRegression()\n", 1013 | "lm.fit(X_train, Y_train)\n", 1014 | "pred_train = lm.predict(X_train)\n", 1015 | "pred_test = lm.predict(X_test)" 1016 | ] 1017 | }, 1018 | { 1019 | "cell_type": "markdown", 1020 | "metadata": {}, 1021 | "source": [ 1022 | "Now, calculate the mean squared error using just the test data and compare to mean squared from using all the data to fit the model. " 1023 | ] 1024 | }, 1025 | { 1026 | "cell_type": "code", 1027 | "execution_count": null, 1028 | "metadata": { 1029 | "collapsed": false 1030 | }, 1031 | "outputs": [], 1032 | "source": [ 1033 | "# your turn\n", 1034 | "print \"Fit a model X_train, and calculate MSE with Y_train:\", np.mean((Y_train - lm.predict(X_train)) ** 2)\n", 1035 | "print \"Fit a model X_train, and calculate MSE with X_test, Y_test:\", np.mean((Y_test - lm.predict(X_test)) ** 2)" 1036 | ] 1037 | }, 1038 | { 1039 | "cell_type": "markdown", 1040 | "metadata": {}, 1041 | "source": [ 1042 | "#### Residual plots" 1043 | ] 1044 | }, 1045 | { 1046 | "cell_type": "code", 1047 | "execution_count": null, 1048 | "metadata": { 1049 | "collapsed": false 1050 | }, 1051 | "outputs": [], 1052 | "source": [ 1053 | "plt.scatter(lm.predict(X_train), lm.predict(X_train) - Y_train, c='b', s=40, alpha=0.5)\n", 1054 | "plt.scatter(lm.predict(X_test), lm.predict(X_test) - Y_test, c='g', s=40)\n", 1055 | "plt.hlines(y = 0, xmin=0, xmax = 50)\n", 1056 | "plt.title('Residual Plot using training (blue) and test (green) data')\n", 1057 | "plt.ylabel('Residuals')" 1058 | ] 1059 | }, 1060 | { 1061 | "cell_type": "markdown", 1062 | "metadata": {}, 1063 | "source": [ 1064 | "### K-fold Cross-validation as an extension of this idea\n", 1065 | "***\n", 1066 | "
\n", 1067 | "\n", 1068 | "

A simple extension of the Test/train split is called K-fold cross-validation.

\n", 1069 | "\n", 1070 | "

Here's the procedure:

\n", 1071 | "
    \n", 1072 | "
  • randomly assign your $n$ samples to one of $K$ groups. They'll each have about $n/k$ samples
  • \n", 1073 | "
  • For each group $k$:
  • \n", 1074 | "
      \n", 1075 | "
    • Fit the model (e.g. run regression) on all data excluding the $k^{th}$ group
    • \n", 1076 | "
    • Use the model to predict the outcomes in group $k$
    • \n", 1077 | "
    • Calculate your prediction error for each observation in $k^{th}$ group (e.g. $(Y_i - \\hat{Y}_i)^2$ for regression, $\\mathbb{1}(Y_i = \\hat{Y}_i)$ for logistic regression).
    • \n", 1078 | "
    \n", 1079 | "
  • Calculate the average prediction error across all samples $Err_{CV} = \\frac{1}{n}\\sum_{i=1}^n (Y_i - \\hat{Y}_i)^2$
  • \n", 1080 | "
\n", 1081 | "
\n", 1082 | "\n", 1083 | "***\n", 1084 | "\n", 1085 | "Luckily you don't have to do this entire process all by hand (``for`` loops, etc.) every single time, ``sci-kit learn`` has a very nice implementation of this, have a look at the [documentation](http://scikit-learn.org/stable/modules/cross_validation.html)." 1086 | ] 1087 | }, 1088 | { 1089 | "cell_type": "markdown", 1090 | "metadata": {}, 1091 | "source": [ 1092 | "## Another Example: Old Faithful Geyser Data Set\n", 1093 | "***\n", 1094 | "\n", 1095 | "The [Old Faithful Geyser](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/faithful.html) data set is a well-known data set that depicts the relationship of the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA [[webcam]](http://yellowstone.net/webcams/). This data set is found in the base installation of the [R programming language](http://cran.r-project.org). \n", 1096 | "\n", 1097 | "`faithful` is a data set with 272 observations on 2 variables.\n", 1098 | "\n", 1099 | "Column name| Description \n", 1100 | "--- | --- \n", 1101 | "eruptions | Eruption time (in mins)\n", 1102 | "waiting\t| Waiting time to next eruption (in mins)\n", 1103 | "\n", 1104 | "There is a function in `statsmodels` (or `sm` for short) called `sm.datasets.get_rdataset` which will download and return a data set found in [R](http://cran.r-project.org). \n", 1105 | "\n", 1106 | "Let's import the `faithful` dataset. " 1107 | ] 1108 | }, 1109 | { 1110 | "cell_type": "code", 1111 | "execution_count": null, 1112 | "metadata": { 1113 | "collapsed": false 1114 | }, 1115 | "outputs": [], 1116 | "source": [ 1117 | "faithful = sm.datasets.get_rdataset(\"faithful\")" 1118 | ] 1119 | }, 1120 | { 1121 | "cell_type": "code", 1122 | "execution_count": null, 1123 | "metadata": { 1124 | "collapsed": false 1125 | }, 1126 | "outputs": [], 1127 | "source": [ 1128 | "# Let's look at the help file\n", 1129 | "sm.datasets.get_rdataset?\n", 1130 | "faithful?" 1131 | ] 1132 | }, 1133 | { 1134 | "cell_type": "code", 1135 | "execution_count": null, 1136 | "metadata": { 1137 | "collapsed": false 1138 | }, 1139 | "outputs": [], 1140 | "source": [ 1141 | "faithful.title" 1142 | ] 1143 | }, 1144 | { 1145 | "cell_type": "code", 1146 | "execution_count": null, 1147 | "metadata": { 1148 | "collapsed": false 1149 | }, 1150 | "outputs": [], 1151 | "source": [ 1152 | "faithful = faithful.data\n", 1153 | "faithful.head()" 1154 | ] 1155 | }, 1156 | { 1157 | "cell_type": "code", 1158 | "execution_count": null, 1159 | "metadata": { 1160 | "collapsed": false 1161 | }, 1162 | "outputs": [], 1163 | "source": [ 1164 | "faithful.shape" 1165 | ] 1166 | }, 1167 | { 1168 | "cell_type": "markdown", 1169 | "metadata": {}, 1170 | "source": [ 1171 | "### Histogram \n", 1172 | "***\n", 1173 | "\n", 1174 | "Create a histogram of the time between eruptions. What do you see? " 1175 | ] 1176 | }, 1177 | { 1178 | "cell_type": "code", 1179 | "execution_count": null, 1180 | "metadata": { 1181 | "collapsed": false 1182 | }, 1183 | "outputs": [], 1184 | "source": [ 1185 | "plt.hist(faithful.waiting)\n", 1186 | "plt.xlabel('Waiting time to next eruption (in mins)')\n", 1187 | "plt.ylabel('Frequency')\n", 1188 | "plt.title('Old Faithful Geyser time between eruption')\n", 1189 | "plt.show()" 1190 | ] 1191 | }, 1192 | { 1193 | "cell_type": "markdown", 1194 | "metadata": {}, 1195 | "source": [ 1196 | "This histogram indicates [Old Faithful isn’t as “faithful” as you might think](http://people.stern.nyu.edu/jsimonof/classes/2301/pdf/geystime.pdf). " 1197 | ] 1198 | }, 1199 | { 1200 | "cell_type": "markdown", 1201 | "metadata": {}, 1202 | "source": [ 1203 | "### Scatter plot \n", 1204 | "***\n", 1205 | "\n", 1206 | "Create a scatter plot of the `waiting` on the x-axis and the `eruptions` on the y-axis. " 1207 | ] 1208 | }, 1209 | { 1210 | "cell_type": "code", 1211 | "execution_count": null, 1212 | "metadata": { 1213 | "collapsed": false 1214 | }, 1215 | "outputs": [], 1216 | "source": [ 1217 | "plt.scatter(faithful.waiting, faithful.eruptions)\n", 1218 | "plt.xlabel('Waiting time to next eruption (in mins)')\n", 1219 | "plt.ylabel('Eruption time (in mins)')\n", 1220 | "plt.title('Old Faithful Geyser')\n", 1221 | "plt.show()\n" 1222 | ] 1223 | }, 1224 | { 1225 | "cell_type": "markdown", 1226 | "metadata": {}, 1227 | "source": [ 1228 | "### Build a linear regression to predict eruption time using `statsmodels`\n", 1229 | "***\n", 1230 | "\n", 1231 | "Now let's build a linear regression model for the `faithful` DataFrame, and estimate the next eruption duration if the waiting time since the last eruption has been 75 minutes.\n", 1232 | "\n", 1233 | "$$ Eruptions = \\beta_0 + \\beta_1 * Waiting + \\epsilon $$ " 1234 | ] 1235 | }, 1236 | { 1237 | "cell_type": "code", 1238 | "execution_count": null, 1239 | "metadata": { 1240 | "collapsed": false 1241 | }, 1242 | "outputs": [], 1243 | "source": [ 1244 | "X = faithful.waiting\n", 1245 | "y = faithful.eruptions\n", 1246 | "model = sm.OLS(y, X)" 1247 | ] 1248 | }, 1249 | { 1250 | "cell_type": "code", 1251 | "execution_count": null, 1252 | "metadata": { 1253 | "collapsed": false 1254 | }, 1255 | "outputs": [], 1256 | "source": [ 1257 | "# Let's look at the options in model\n", 1258 | "# model." 1259 | ] 1260 | }, 1261 | { 1262 | "cell_type": "code", 1263 | "execution_count": null, 1264 | "metadata": { 1265 | "collapsed": false 1266 | }, 1267 | "outputs": [], 1268 | "source": [ 1269 | "results = model.fit()" 1270 | ] 1271 | }, 1272 | { 1273 | "cell_type": "code", 1274 | "execution_count": null, 1275 | "metadata": { 1276 | "collapsed": false 1277 | }, 1278 | "outputs": [], 1279 | "source": [ 1280 | "# Let's look at the options in results\n", 1281 | "# results." 1282 | ] 1283 | }, 1284 | { 1285 | "cell_type": "code", 1286 | "execution_count": null, 1287 | "metadata": { 1288 | "collapsed": false 1289 | }, 1290 | "outputs": [], 1291 | "source": [ 1292 | "print results.summary()" 1293 | ] 1294 | }, 1295 | { 1296 | "cell_type": "code", 1297 | "execution_count": null, 1298 | "metadata": { 1299 | "collapsed": false 1300 | }, 1301 | "outputs": [], 1302 | "source": [ 1303 | "results.params.values" 1304 | ] 1305 | }, 1306 | { 1307 | "cell_type": "markdown", 1308 | "metadata": {}, 1309 | "source": [ 1310 | "We notice, there is no intercept ($\\beta_0$) fit in this linear model. To add it, we can use the function `sm.add_constant`. " 1311 | ] 1312 | }, 1313 | { 1314 | "cell_type": "code", 1315 | "execution_count": null, 1316 | "metadata": { 1317 | "collapsed": false 1318 | }, 1319 | "outputs": [], 1320 | "source": [ 1321 | "X = sm.add_constant(X)\n", 1322 | "X.head()" 1323 | ] 1324 | }, 1325 | { 1326 | "cell_type": "markdown", 1327 | "metadata": {}, 1328 | "source": [ 1329 | "Now let's fit a linear regression model with an intercept. " 1330 | ] 1331 | }, 1332 | { 1333 | "cell_type": "code", 1334 | "execution_count": null, 1335 | "metadata": { 1336 | "collapsed": false 1337 | }, 1338 | "outputs": [], 1339 | "source": [ 1340 | "modelW0 = sm.OLS(y, X)\n", 1341 | "resultsW0 = modelW0.fit()\n", 1342 | "print resultsW0.summary()" 1343 | ] 1344 | }, 1345 | { 1346 | "cell_type": "markdown", 1347 | "metadata": {}, 1348 | "source": [ 1349 | "If you want to predict the time to the next eruption using a waiting time of 75, you can directly estimate this using the equation \n", 1350 | "\n", 1351 | "$$ \\hat{y} = \\hat{\\beta}_0 + \\hat{\\beta}_1 * 75 $$ \n", 1352 | "\n", 1353 | "or you can use `results.predict`. " 1354 | ] 1355 | }, 1356 | { 1357 | "cell_type": "code", 1358 | "execution_count": null, 1359 | "metadata": { 1360 | "collapsed": false 1361 | }, 1362 | "outputs": [], 1363 | "source": [ 1364 | "newX = np.array([1,75])\n", 1365 | "resultsW0.params[0]*newX[0] + resultsW0.params[1] * newX[1]" 1366 | ] 1367 | }, 1368 | { 1369 | "cell_type": "code", 1370 | "execution_count": null, 1371 | "metadata": { 1372 | "collapsed": false 1373 | }, 1374 | "outputs": [], 1375 | "source": [ 1376 | "resultsW0.predict(newX)" 1377 | ] 1378 | }, 1379 | { 1380 | "cell_type": "markdown", 1381 | "metadata": {}, 1382 | "source": [ 1383 | "Based on this linear regression, if the waiting time since the last eruption has been 75 minutes, we expect the next one to last approximately 3.80 minutes." 1384 | ] 1385 | }, 1386 | { 1387 | "cell_type": "markdown", 1388 | "metadata": {}, 1389 | "source": [ 1390 | "### Plot the regression line \n", 1391 | "***\n", 1392 | "\n", 1393 | "Instead of using `resultsW0.predict(X)`, we can use `resultsW0.fittedvalues` which are the $\\hat{y}$. " 1394 | ] 1395 | }, 1396 | { 1397 | "cell_type": "code", 1398 | "execution_count": null, 1399 | "metadata": { 1400 | "collapsed": false 1401 | }, 1402 | "outputs": [], 1403 | "source": [ 1404 | "plt.scatter(faithful.waiting, faithful.eruptions)\n", 1405 | "plt.xlabel('Waiting time to next eruption (in mins)')\n", 1406 | "plt.ylabel('Eruption time (in mins)')\n", 1407 | "plt.title('Old Faithful Geyser')\n", 1408 | "\n", 1409 | "plt.plot(faithful.waiting, resultsW0.fittedvalues, color='blue', linewidth=3)\n", 1410 | "plt.show()\n" 1411 | ] 1412 | }, 1413 | { 1414 | "cell_type": "markdown", 1415 | "metadata": {}, 1416 | "source": [ 1417 | "### Residuals, residual sum of squares, mean squared error\n", 1418 | "***\n", 1419 | "\n", 1420 | "Recall, we can directly calculate the residuals as \n", 1421 | "\n", 1422 | "$$r_i = y_i - (\\hat{\\beta}_0 + \\hat{\\beta}_1 x_i)$$\n", 1423 | "\n", 1424 | "To calculate the residual sum of squares, \n", 1425 | "\n", 1426 | "$$ S = \\sum_{i=1}^n r_i = \\sum_{i=1}^n (y_i - (\\hat{\\beta}_0 + \\hat{\\beta}_1 x_i))^2 $$\n", 1427 | "\n", 1428 | "where $n$ is the number of observations. Alternatively, we can simply ask for the residuals using `resultsW0.predict`" 1429 | ] 1430 | }, 1431 | { 1432 | "cell_type": "code", 1433 | "execution_count": null, 1434 | "metadata": { 1435 | "collapsed": false 1436 | }, 1437 | "outputs": [], 1438 | "source": [ 1439 | "resids = faithful.eruptions - resultsW0.predict(X)\n" 1440 | ] 1441 | }, 1442 | { 1443 | "cell_type": "code", 1444 | "execution_count": null, 1445 | "metadata": { 1446 | "collapsed": false 1447 | }, 1448 | "outputs": [], 1449 | "source": [ 1450 | "resids = resultsW0.resid\n" 1451 | ] 1452 | }, 1453 | { 1454 | "cell_type": "code", 1455 | "execution_count": null, 1456 | "metadata": { 1457 | "collapsed": false 1458 | }, 1459 | "outputs": [], 1460 | "source": [ 1461 | "plt.plot(faithful.waiting, resids, 'o')\n", 1462 | "plt.hlines(y = 0, xmin=40, xmax = 100)\n", 1463 | "plt.xlabel('Waiting time')\n", 1464 | "plt.ylabel('Residuals')\n", 1465 | "plt.title('Residual Plot')\n", 1466 | "plt.show()" 1467 | ] 1468 | }, 1469 | { 1470 | "cell_type": "markdown", 1471 | "metadata": {}, 1472 | "source": [ 1473 | "The residual sum of squares: " 1474 | ] 1475 | }, 1476 | { 1477 | "cell_type": "code", 1478 | "execution_count": null, 1479 | "metadata": { 1480 | "collapsed": false 1481 | }, 1482 | "outputs": [], 1483 | "source": [ 1484 | "print np.sum((faithful.eruptions - resultsW0.predict(X)) ** 2)" 1485 | ] 1486 | }, 1487 | { 1488 | "cell_type": "markdown", 1489 | "metadata": {}, 1490 | "source": [ 1491 | "Mean squared error: " 1492 | ] 1493 | }, 1494 | { 1495 | "cell_type": "code", 1496 | "execution_count": null, 1497 | "metadata": { 1498 | "collapsed": false 1499 | }, 1500 | "outputs": [], 1501 | "source": [ 1502 | "print np.mean((faithful.eruptions - resultsW0.predict(X)) ** 2)" 1503 | ] 1504 | }, 1505 | { 1506 | "cell_type": "markdown", 1507 | "metadata": {}, 1508 | "source": [ 1509 | "## Build a linear regression to predict eruption time using least squares \n", 1510 | "***\n", 1511 | "\n", 1512 | "Now let's build a linear regression model for the `faithful` DataFrame, but instead of using `statmodels` (or `sklearn`), let's use the least squares estimates of the coefficients for the linear regression model.\n", 1513 | "\n", 1514 | "$$ \\hat{\\beta} = (X^{\\top}X)^{-1} X^{\\top}Y $$ \n", 1515 | "\n", 1516 | "The `numpy` function [`np.dot`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html#numpy.dot) is the dot product (or inner product) of two vectors (or arrays in python). \n", 1517 | "\n", 1518 | "The `numpy` function [`np.linalg.inv`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.inv.html#numpy.linalg.inv) can be used to compute the inverse of a matrix. " 1519 | ] 1520 | }, 1521 | { 1522 | "cell_type": "code", 1523 | "execution_count": null, 1524 | "metadata": { 1525 | "collapsed": false 1526 | }, 1527 | "outputs": [], 1528 | "source": [ 1529 | "X = sm.add_constant(faithful.waiting)\n", 1530 | "y = faithful.eruptions\n" 1531 | ] 1532 | }, 1533 | { 1534 | "cell_type": "markdown", 1535 | "metadata": {}, 1536 | "source": [ 1537 | "First, compute $X^{\\top}X$\n" 1538 | ] 1539 | }, 1540 | { 1541 | "cell_type": "code", 1542 | "execution_count": null, 1543 | "metadata": { 1544 | "collapsed": false 1545 | }, 1546 | "outputs": [], 1547 | "source": [ 1548 | "np.dot(X.T, X)\n" 1549 | ] 1550 | }, 1551 | { 1552 | "cell_type": "markdown", 1553 | "metadata": {}, 1554 | "source": [ 1555 | "Next, compute the inverse of $X^{\\top}X$ or $(X^{\\top}X)^{-1}$. " 1556 | ] 1557 | }, 1558 | { 1559 | "cell_type": "code", 1560 | "execution_count": null, 1561 | "metadata": { 1562 | "collapsed": false 1563 | }, 1564 | "outputs": [], 1565 | "source": [ 1566 | "np.linalg.inv(np.dot(X.T, X))" 1567 | ] 1568 | }, 1569 | { 1570 | "cell_type": "markdown", 1571 | "metadata": {}, 1572 | "source": [ 1573 | "Finally, compute $\\hat{\\beta} = (X^{\\top}X)^{-1} X^{\\top}Y $" 1574 | ] 1575 | }, 1576 | { 1577 | "cell_type": "code", 1578 | "execution_count": null, 1579 | "metadata": { 1580 | "collapsed": false 1581 | }, 1582 | "outputs": [], 1583 | "source": [ 1584 | "beta = np.linalg.inv(np.dot(X.T, X)).dot(X.T).dot(y)\n", 1585 | "print \"Directly estimating beta:\", beta\n", 1586 | "print \"Estimating beta using statmodels: \", resultsW0.params.values\n", 1587 | "\n", 1588 | "\n" 1589 | ] 1590 | }, 1591 | { 1592 | "cell_type": "markdown", 1593 | "metadata": {}, 1594 | "source": [ 1595 | "# Part 4: Many different types of regression\n", 1596 | "***\n", 1597 | "
\n", 1598 | "\n", 1599 | "

You do not always have a continuous $y$ variable that you are measuring. Sometimes it may be binary (e.g. 0 or 1). Sometimes it may be count data. What do you do?

\n", 1600 | "\n", 1601 | "

Use other types of regression besides just simple linear regression.

\n", 1602 | "\n", 1603 | "

[Nice summary of several types of regression](http://www.datasciencecentral.com/profiles/blogs/10-types-of-regressions-which-one-to-use).

\n", 1604 | "
\n" 1605 | ] 1606 | }, 1607 | { 1608 | "cell_type": "markdown", 1609 | "metadata": {}, 1610 | "source": [ 1611 | "# Part 5: Logistic Regression\n", 1612 | "***\n" 1613 | ] 1614 | }, 1615 | { 1616 | "cell_type": "markdown", 1617 | "metadata": {}, 1618 | "source": [ 1619 | "
\n", 1620 | "

Logistic regression is a probabilistic model that links observed binary data to a set of features.

\n", 1621 | "\n", 1622 | "

Suppose that we have a set of binary (that is, taking the values 0 or 1) observations $Y_1,\\cdots,Y_n$, and for each observation $Y_i$ we have a vector of features $X_i$. The logistic regression model assumes that there is some set of **weights**, **coefficients**, or **parameters** $\\beta$, one for each feature, so that the data were generated by flipping a weighted coin whose probability of giving a 1 is given by the following equation:\n", 1623 | "\n", 1624 | "$$\n", 1625 | "P(Y_i = 1) = \\mathrm{logistic}(\\sum \\beta_i X_i),\n", 1626 | "$$\n", 1627 | "\n", 1628 | "where\n", 1629 | "\n", 1630 | "$$\n", 1631 | "\\mathrm{logistic}(x) = \\frac{e^x}{1+e^x}.\n", 1632 | "$$\n", 1633 | "

\n", 1634 | "

When we *fit* a logistic regression model, we determine values for each $\\beta$ that allows the model to best fit the *training data* we have observed. Once we do this, we can use these coefficients to make predictions about data we have not yet observed.

\n", 1635 | "\n", 1636 | "
" 1637 | ] 1638 | }, 1639 | { 1640 | "cell_type": "markdown", 1641 | "metadata": {}, 1642 | "source": [ 1643 | "From http://www.edwardtufte.com/tufte/ebooks, in \"Visual and Statistical Thinking: \n", 1644 | "Displays of Evidence for Making Decisions\":\n", 1645 | "\n", 1646 | ">On January 28, 1986, the space shuttle Challenger exploded and seven astronauts died because two rubber O-rings leaked. These rings had lost their resiliency because the shuttle was launched on a very cold day. Ambient temperatures were in the low 30s and the O-rings themselves were much colder, less than 20F.\n", 1647 | "\n", 1648 | ">One day before the flight, the predicted temperature for the launch was 26F to 29F. Concerned that the rings would not seal at such a cold temperature, the engineers who designed the rocket opposed launching Challenger the next day.\n", 1649 | "\n", 1650 | "But they did not make their case persuasively, and were over-ruled by NASA." 1651 | ] 1652 | }, 1653 | { 1654 | "cell_type": "code", 1655 | "execution_count": null, 1656 | "metadata": { 1657 | "collapsed": false 1658 | }, 1659 | "outputs": [], 1660 | "source": [ 1661 | "from IPython.display import Image as Im\n", 1662 | "from IPython.display import display\n", 1663 | "Im('./images/shuttle.png')" 1664 | ] 1665 | }, 1666 | { 1667 | "cell_type": "markdown", 1668 | "metadata": {}, 1669 | "source": [ 1670 | "The image above shows the leak, where the O-ring failed.\n", 1671 | "\n", 1672 | "We have here data on previous failures of the O-rings at various temperatures." 1673 | ] 1674 | }, 1675 | { 1676 | "cell_type": "code", 1677 | "execution_count": null, 1678 | "metadata": { 1679 | "collapsed": false 1680 | }, 1681 | "outputs": [], 1682 | "source": [ 1683 | "data=np.array([[float(j) for j in e.strip().split()] for e in open(\"./data/chall.txt\")])\n", 1684 | "data" 1685 | ] 1686 | }, 1687 | { 1688 | "cell_type": "markdown", 1689 | "metadata": {}, 1690 | "source": [ 1691 | "Lets plot this data" 1692 | ] 1693 | }, 1694 | { 1695 | "cell_type": "code", 1696 | "execution_count": null, 1697 | "metadata": { 1698 | "collapsed": false 1699 | }, 1700 | "outputs": [], 1701 | "source": [ 1702 | "# fit logistic regression model\n", 1703 | "import statsmodels.api as sm\n", 1704 | "from statsmodels.formula.api import logit, glm, ols\n", 1705 | "\n", 1706 | "# statsmodels works nicely with pandas dataframes\n", 1707 | "dat = pd.DataFrame(data, columns = ['Temperature', 'Failure'])\n", 1708 | "logit_model = logit('Failure ~ Temperature',dat).fit()\n", 1709 | "print logit_model.summary()\n" 1710 | ] 1711 | }, 1712 | { 1713 | "cell_type": "markdown", 1714 | "metadata": {}, 1715 | "source": [ 1716 | "#### Interpreting p-values:\n", 1717 | "Generally we'd like the p-values to be very small as they represent the probability that we observed such an strong relationship between temperature and O-ring failures purely by chance. So when the p-value is small (we usually consider \"small\" as less than 0.05), what we're saying is that based on the data we observed, we know **fairly certainly** that temperature is strongly associated with the failure of O-rings. This is a very powerful statement that can take us a long way in terms of learning if used properly. Have a look at the Wikipedia page on [p-values](https://en.wikipedia.org/wiki/P-value) for a quick reminder.\n", 1718 | "\n", 1719 | "There are some issues with testing many many hypotheses that we'll also encounter in the homework. But generally, the idea is that the data may (or may not) have information about things you're interested in. We ask the data questions through hypotheses, but the more questions we ask of it, the higher chances we have of the data actually showing associations at random. Have alook at [this article](https://en.wikipedia.org/wiki/Multiple_comparisons_problem) to get an idea of the problem and some solutions. We'll be considering a very crude solution known as the [Bonferroni correction](https://en.wikipedia.org/wiki/Bonferroni_correction) but that is by no means the best solution. " 1720 | ] 1721 | }, 1722 | { 1723 | "cell_type": "code", 1724 | "execution_count": null, 1725 | "metadata": { 1726 | "collapsed": true 1727 | }, 1728 | "outputs": [], 1729 | "source": [ 1730 | "# calculate predicted failure probabilities for new termperatures\n", 1731 | "x = np.linspace(50, 85, 1000)\n", 1732 | "p = logit_model.params\n", 1733 | "eta = p['Intercept'] + x*p['Temperature']\n", 1734 | "y = np.exp(eta)/(1 + np.exp(eta))" 1735 | ] 1736 | }, 1737 | { 1738 | "cell_type": "markdown", 1739 | "metadata": {}, 1740 | "source": [ 1741 | "Let's plot the data along with a range of predicted failure probabilities for unobserved temperatures." 1742 | ] 1743 | }, 1744 | { 1745 | "cell_type": "code", 1746 | "execution_count": null, 1747 | "metadata": { 1748 | "collapsed": false 1749 | }, 1750 | "outputs": [], 1751 | "source": [ 1752 | "# plot data\n", 1753 | "temps, pfail = data[:,0], data[:,1]\n", 1754 | "plt.scatter(temps, pfail)\n", 1755 | "axes=plt.gca()\n", 1756 | "plt.xlabel('Temperature')\n", 1757 | "plt.ylabel('Failure')\n", 1758 | "plt.title('O-ring failures')\n", 1759 | "\n", 1760 | "# plot fitted values\n", 1761 | "plt.plot(x, y)\n", 1762 | "\n", 1763 | "# change limits, for a nicer plot\n", 1764 | "plt.xlim(50, 85)\n", 1765 | "plt.ylim(-0.1, 1.1)\n" 1766 | ] 1767 | }, 1768 | { 1769 | "cell_type": "markdown", 1770 | "metadata": {}, 1771 | "source": [ 1772 | "We can interpret the output from a logistic regression by looking at the coefficient of temperature (as well as the p-value). Since the coefficient of temperature is negative, we can say that an increase in temperature is associated with a decrease in the odds of having an O-ring failure. " 1773 | ] 1774 | } 1775 | ], 1776 | "metadata": { 1777 | "kernelspec": { 1778 | "display_name": "Python 2", 1779 | "language": "python", 1780 | "name": "python2" 1781 | }, 1782 | "language_info": { 1783 | "codemirror_mode": { 1784 | "name": "ipython", 1785 | "version": 2 1786 | }, 1787 | "file_extension": ".py", 1788 | "mimetype": "text/x-python", 1789 | "name": "python", 1790 | "nbconvert_exporter": "python", 1791 | "pygments_lexer": "ipython2", 1792 | "version": "2.7.10" 1793 | } 1794 | }, 1795 | "nbformat": 4, 1796 | "nbformat_minor": 0 1797 | } 1798 | --------------------------------------------------------------------------------