├── contrasts.ipynb ├── contrasts.py ├── discrete_choice.ipynb ├── discrete_choice.py ├── generic_mle.ipynb ├── generic_mle.py ├── kernel_density.ipynb ├── kernel_density.py ├── linear_models.ipynb ├── linear_models.py ├── preliminaries.ipynb ├── preliminaries.py ├── rmagic_extension.ipynb ├── rmagic_extension.py ├── robust_models.ipynb ├── robust_models.py ├── salary.table ├── star_diagram.png ├── tsa_arma.ipynb ├── tsa_arma.py ├── tsa_filters.ipynb ├── tsa_filters.py ├── tsa_var.ipynb ├── tsa_var.py ├── whats_coming.ipynb └── whats_coming.py /contrasts.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "contrasts" 4 | }, 5 | "nbformat": 3, 6 | "nbformat_minor": 0, 7 | "worksheets": [ 8 | { 9 | "cells": [ 10 | { 11 | "cell_type": "heading", 12 | "level": 3, 13 | "metadata": {}, 14 | "source": [ 15 | "Contrasts Overview" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "collapsed": false, 21 | "input": [ 22 | "import statsmodels.api as sm" 23 | ], 24 | "language": "python", 25 | "metadata": {}, 26 | "outputs": [] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "This document is based heavily on this excellent resource from UCLA http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm" 33 | ] 34 | }, 35 | { 36 | "cell_type": "raw", 37 | "metadata": {}, 38 | "source": [ 39 | "A categorical variable of K categories, or levels, usually enters a regression as a sequence of K-1 dummy variables. This amounts to a linear hypothesis on the level means. That is, each test statistic for these variables amounts to testing whether the mean for that level is statistically significantly different from the mean of the base category. This dummy coding is called Treatment coding in R parlance, and we will follow this convention. There are, however, different coding methods that amount to different sets of linear hypotheses.\n", 40 | "\n", 41 | "In fact, the dummy coding is not technically a contrast coding. This is because the dummy variables add to one and are not functionally independent of the model's intercept. On the other hand, a set of *contrasts* for a categorical variable with `k` levels is a set of `k-1` functionally independent linear combinations of the factor level means that are also independent of the sum of the dummy variables. The dummy coding isn't wrong *per se*. It captures all of the coefficients, but it complicates matters when the model assumes independence of the coefficients such as in ANOVA. Linear regression models do not assume independence of the coefficients and thus dummy coding is often the only coding that is taught in this context.\n", 42 | "\n", 43 | "To have a look at the contrast matrices in Patsy, we will use data from UCLA ATS. First let's load the data." 44 | ] 45 | }, 46 | { 47 | "cell_type": "heading", 48 | "level": 4, 49 | "metadata": {}, 50 | "source": [ 51 | "Example Data" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "collapsed": false, 57 | "input": [ 58 | "import pandas\n", 59 | "url = 'http://www.ats.ucla.edu/stat/data/hsb2.csv'\n", 60 | "hsb2 = pandas.read_table(url, delimiter=\",\")" 61 | ], 62 | "language": "python", 63 | "metadata": {}, 64 | "outputs": [] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "collapsed": false, 69 | "input": [ 70 | "hsb2.head(10)" 71 | ], 72 | "language": "python", 73 | "metadata": {}, 74 | "outputs": [] 75 | }, 76 | { 77 | "cell_type": "raw", 78 | "metadata": {}, 79 | "source": [ 80 | "It will be instructive to look at the mean of the dependent variable, write, for each level of race ((1 = Hispanic, 2 = Asian, 3 = African American and 4 = Caucasian))." 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "collapsed": false, 86 | "input": [ 87 | "hsb2.groupby('race')['write'].mean()" 88 | ], 89 | "language": "python", 90 | "metadata": {}, 91 | "outputs": [] 92 | }, 93 | { 94 | "cell_type": "heading", 95 | "level": 4, 96 | "metadata": {}, 97 | "source": [ 98 | "Treatment (Dummy) Coding" 99 | ] 100 | }, 101 | { 102 | "cell_type": "raw", 103 | "metadata": {}, 104 | "source": [ 105 | "Dummy coding is likely the most well known coding scheme. It compares each level of the categorical variable to a base reference level. The base reference level is the value of the intercept. It is the default contrast in Patsy for unordered categorical factors. The Treatment contrast matrix for race would be" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "collapsed": false, 111 | "input": [ 112 | "from patsy.contrasts import Treatment\n", 113 | "levels = [1,2,3,4]\n", 114 | "contrast = Treatment(reference=0).code_without_intercept(levels)\n", 115 | "print contrast.matrix" 116 | ], 117 | "language": "python", 118 | "metadata": {}, 119 | "outputs": [] 120 | }, 121 | { 122 | "cell_type": "raw", 123 | "metadata": {}, 124 | "source": [ 125 | "Here we used `reference=0`, which implies that the first level, Hispanic, is the reference category against which the other level effects are measured. As mentioned above, the columns do not sum to zero and are thus not independent of the intercept. To be explicit, let's look at how this would encode the `race` variable." 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "collapsed": false, 131 | "input": [ 132 | "hsb2.race.head(10)" 133 | ], 134 | "language": "python", 135 | "metadata": {}, 136 | "outputs": [] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "collapsed": false, 141 | "input": [ 142 | "print contrast.matrix[hsb2.race-1, :][:20]" 143 | ], 144 | "language": "python", 145 | "metadata": {}, 146 | "outputs": [] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "collapsed": false, 151 | "input": [ 152 | "sm.categorical(hsb2.race.values)" 153 | ], 154 | "language": "python", 155 | "metadata": {}, 156 | "outputs": [] 157 | }, 158 | { 159 | "cell_type": "raw", 160 | "metadata": {}, 161 | "source": [ 162 | "This is a bit of a trick, as the `race` category conveniently maps to zero-based indices. If it does not, this conversion happens under the hood, so this won't work in general but nonetheless is a useful exercise to fix ideas. The below illustrates the output using the three contrasts above" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "collapsed": false, 168 | "input": [ 169 | "from statsmodels.formula.api import ols\n", 170 | "mod = ols(\"write ~ C(race, Treatment)\", data=hsb2)\n", 171 | "res = mod.fit()\n", 172 | "print res.summary()" 173 | ], 174 | "language": "python", 175 | "metadata": {}, 176 | "outputs": [] 177 | }, 178 | { 179 | "cell_type": "raw", 180 | "metadata": {}, 181 | "source": [ 182 | "We explicitly gave the contrast for race; however, since Treatment is the default, we could have omitted this." 183 | ] 184 | }, 185 | { 186 | "cell_type": "heading", 187 | "level": 3, 188 | "metadata": {}, 189 | "source": [ 190 | "Simple Coding" 191 | ] 192 | }, 193 | { 194 | "cell_type": "raw", 195 | "metadata": {}, 196 | "source": [ 197 | "Like Treatment Coding, Simple Coding compares each level to a fixed reference level. However, with simple coding, the intercept is the grand mean of all the levels of the factors. Patsy doesn't have the Simple contrast included, but you can easily define your own contrasts. To do so, write a class that contains a code_with_intercept and a code_without_intercept method that returns a patsy.contrast.ContrastMatrix instance" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "collapsed": false, 203 | "input": [ 204 | "from patsy.contrasts import ContrastMatrix\n", 205 | "\n", 206 | "def _name_levels(prefix, levels):\n", 207 | " return [\"[%s%s]\" % (prefix, level) for level in levels]\n", 208 | "\n", 209 | "class Simple(object):\n", 210 | " def _simple_contrast(self, levels):\n", 211 | " nlevels = len(levels)\n", 212 | " contr = -1./nlevels * np.ones((nlevels, nlevels-1))\n", 213 | " contr[1:][np.diag_indices(nlevels-1)] = (nlevels-1.)/nlevels\n", 214 | " return contr\n", 215 | "\n", 216 | " def code_with_intercept(self, levels):\n", 217 | " contrast = np.column_stack((np.ones(len(levels)),\n", 218 | " self._simple_contrast(levels)))\n", 219 | " return ContrastMatrix(contrast, _name_levels(\"Simp.\", levels))\n", 220 | "\n", 221 | " def code_without_intercept(self, levels):\n", 222 | " contrast = self._simple_contrast(levels)\n", 223 | " return ContrastMatrix(contrast, _name_levels(\"Simp.\", levels[:-1]))" 224 | ], 225 | "language": "python", 226 | "metadata": {}, 227 | "outputs": [] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "collapsed": false, 232 | "input": [ 233 | "hsb2.groupby('race')['write'].mean().mean()" 234 | ], 235 | "language": "python", 236 | "metadata": {}, 237 | "outputs": [] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "collapsed": false, 242 | "input": [ 243 | "contrast = Simple().code_without_intercept(levels)\n", 244 | "print contrast.matrix" 245 | ], 246 | "language": "python", 247 | "metadata": {}, 248 | "outputs": [] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "collapsed": false, 253 | "input": [ 254 | "mod = ols(\"write ~ C(race, Simple)\", data=hsb2)\n", 255 | "res = mod.fit()\n", 256 | "print res.summary()" 257 | ], 258 | "language": "python", 259 | "metadata": {}, 260 | "outputs": [] 261 | }, 262 | { 263 | "cell_type": "heading", 264 | "level": 3, 265 | "metadata": {}, 266 | "source": [ 267 | "Sum (Deviation) Coding" 268 | ] 269 | }, 270 | { 271 | "cell_type": "raw", 272 | "metadata": {}, 273 | "source": [ 274 | "Sum coding compares the mean of the dependent variable for a given level to the overall mean of the dependent variable over all the levels. That is, it uses contrasts between each of the first k-1 levels and level k In this example, level 1 is compared to all the others, level 2 to all the others, and level 3 to all the others." 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "collapsed": false, 280 | "input": [ 281 | "from patsy.contrasts import Sum\n", 282 | "contrast = Sum().code_without_intercept(levels)\n", 283 | "print contrast.matrix" 284 | ], 285 | "language": "python", 286 | "metadata": {}, 287 | "outputs": [] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "collapsed": false, 292 | "input": [ 293 | "mod = ols(\"write ~ C(race, Sum)\", data=hsb2)\n", 294 | "res = mod.fit()\n", 295 | "print res.summary()" 296 | ], 297 | "language": "python", 298 | "metadata": {}, 299 | "outputs": [] 300 | }, 301 | { 302 | "cell_type": "raw", 303 | "metadata": {}, 304 | "source": [ 305 | "This corresponds to a parameterization that forces all the coefficients to sum to zero. Notice that the intercept here is the grand mean where the grand mean is the mean of means of the dependent variable by each level." 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "collapsed": false, 311 | "input": [ 312 | "hsb2.groupby('race')['write'].mean().mean()" 313 | ], 314 | "language": "python", 315 | "metadata": {}, 316 | "outputs": [] 317 | }, 318 | { 319 | "cell_type": "heading", 320 | "level": 3, 321 | "metadata": {}, 322 | "source": [ 323 | "Backward Difference Coding" 324 | ] 325 | }, 326 | { 327 | "cell_type": "raw", 328 | "metadata": {}, 329 | "source": [ 330 | "In backward difference coding, the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. This type of coding may be useful for a nominal or an ordinal variable." 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "collapsed": false, 336 | "input": [ 337 | "from patsy.contrasts import Diff\n", 338 | "contrast = Diff().code_without_intercept(levels)\n", 339 | "print contrast.matrix" 340 | ], 341 | "language": "python", 342 | "metadata": {}, 343 | "outputs": [] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "collapsed": false, 348 | "input": [ 349 | "mod = ols(\"write ~ C(race, Diff)\", data=hsb2)\n", 350 | "res = mod.fit()\n", 351 | "print res.summary()" 352 | ], 353 | "language": "python", 354 | "metadata": {}, 355 | "outputs": [] 356 | }, 357 | { 358 | "cell_type": "raw", 359 | "metadata": {}, 360 | "source": [ 361 | "For example, here the coefficient on level 1 is the mean of `write` at level 2 compared with the mean at level 1. Ie.," 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "collapsed": false, 367 | "input": [ 368 | "res.params[\"C(race, Diff)[D.1]\"]\n", 369 | "hsb2.groupby('race').mean()[\"write\"][2] - \\\n", 370 | " hsb2.groupby('race').mean()[\"write\"][1]" 371 | ], 372 | "language": "python", 373 | "metadata": {}, 374 | "outputs": [] 375 | }, 376 | { 377 | "cell_type": "heading", 378 | "level": 3, 379 | "metadata": {}, 380 | "source": [ 381 | "Helmert Coding" 382 | ] 383 | }, 384 | { 385 | "cell_type": "raw", 386 | "metadata": {}, 387 | "source": [ 388 | "Our version of Helmert coding is sometimes referred to as Reverse Helmert Coding. The mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels. Hence, the name 'reverse' being sometimes applied to differentiate from forward Helmert coding. This comparison does not make much sense for a nominal variable such as race, but we would use the Helmert contrast like so:" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "collapsed": false, 394 | "input": [ 395 | "from patsy.contrasts import Helmert\n", 396 | "contrast = Helmert().code_without_intercept(levels)\n", 397 | "print contrast.matrix" 398 | ], 399 | "language": "python", 400 | "metadata": {}, 401 | "outputs": [] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "collapsed": false, 406 | "input": [ 407 | "mod = ols(\"write ~ C(race, Helmert)\", data=hsb2)\n", 408 | "res = mod.fit()\n", 409 | "print res.summary()" 410 | ], 411 | "language": "python", 412 | "metadata": {}, 413 | "outputs": [] 414 | }, 415 | { 416 | "cell_type": "raw", 417 | "metadata": {}, 418 | "source": [ 419 | "To illustrate, the comparison on level 4 is the mean of the dependent variable at the previous three levels taken from the mean at level 4" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "collapsed": false, 425 | "input": [ 426 | "grouped = hsb2.groupby('race')\n", 427 | "grouped.mean()[\"write\"][4] - grouped.mean()[\"write\"][:3].mean()" 428 | ], 429 | "language": "python", 430 | "metadata": {}, 431 | "outputs": [] 432 | }, 433 | { 434 | "cell_type": "raw", 435 | "metadata": {}, 436 | "source": [ 437 | "As you can see, these are only equal up to a constant. Other versions of the Helmert contrast give the actual difference in means. Regardless, the hypothesis tests are the same." 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "collapsed": false, 443 | "input": [ 444 | "k = 4\n", 445 | "1./k * (grouped.mean()[\"write\"][k] - grouped.mean()[\"write\"][:k-1].mean())\n", 446 | "k = 3\n", 447 | "1./k * (grouped.mean()[\"write\"][k] - grouped.mean()[\"write\"][:k-1].mean())" 448 | ], 449 | "language": "python", 450 | "metadata": {}, 451 | "outputs": [] 452 | }, 453 | { 454 | "cell_type": "heading", 455 | "level": 3, 456 | "metadata": {}, 457 | "source": [ 458 | "Orthogonal Polynomial Coding" 459 | ] 460 | }, 461 | { 462 | "cell_type": "raw", 463 | "metadata": {}, 464 | "source": [ 465 | "The coefficients taken on by polynomial coding for `k=4` levels are the linear, quadratic, and cubic trends in the categorical variable. The categorical variable here is assumed to be represented by an underlying, equally spaced numeric variable. Therefore, this type of encoding is used only for ordered categorical variables with equal spacing. In general, the polynomial contrast produces polynomials of order `k-1`. Since `race` is not an ordered factor variable let's use `read` as an example. First we need to create an ordered categorical from `read`." 466 | ] 467 | }, 468 | { 469 | "cell_type": "code", 470 | "collapsed": false, 471 | "input": [ 472 | "hsb2['readcat'] = pandas.cut(hsb2.read, bins=3)\n", 473 | "hsb2.groupby('readcat').mean()['write']" 474 | ], 475 | "language": "python", 476 | "metadata": {}, 477 | "outputs": [] 478 | }, 479 | { 480 | "cell_type": "code", 481 | "collapsed": false, 482 | "input": [ 483 | "from patsy.contrasts import Poly\n", 484 | "levels = hsb2.readcat.unique().tolist()\n", 485 | "contrast = Poly().code_without_intercept(levels)\n", 486 | "print contrast.matrix" 487 | ], 488 | "language": "python", 489 | "metadata": {}, 490 | "outputs": [] 491 | }, 492 | { 493 | "cell_type": "code", 494 | "collapsed": false, 495 | "input": [ 496 | "mod = ols(\"write ~ C(readcat, Poly)\", data=hsb2)\n", 497 | "res = mod.fit()\n", 498 | "print res.summary()" 499 | ], 500 | "language": "python", 501 | "metadata": {}, 502 | "outputs": [] 503 | }, 504 | { 505 | "cell_type": "raw", 506 | "metadata": {}, 507 | "source": [ 508 | "As you can see, readcat has a significant linear effect on the dependent variable `write` but not a significant quadratic or cubic effect." 509 | ] 510 | } 511 | ], 512 | "metadata": {} 513 | } 514 | ] 515 | } -------------------------------------------------------------------------------- /contrasts.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3.0 3 | 4 | # 5 | 6 | # Contrasts Overview 7 | 8 | # 9 | 10 | import statsmodels.api as sm 11 | 12 | # 13 | 14 | # This document is based heavily on this excellent resource from UCLA http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm 15 | 16 | # 17 | 18 | # A categorical variable of K categories, or levels, usually enters a regression as a sequence of K-1 dummy variables. This amounts to a linear hypothesis on the level means. That is, each test statistic for these variables amounts to testing whether the mean for that level is statistically significantly different from the mean of the base category. This dummy coding is called Treatment coding in R parlance, and we will follow this convention. There are, however, different coding methods that amount to different sets of linear hypotheses. 19 | # 20 | # In fact, the dummy coding is not technically a contrast coding. This is because the dummy variables add to one and are not functionally independent of the model's intercept. On the other hand, a set of *contrasts* for a categorical variable with `k` levels is a set of `k-1` functionally independent linear combinations of the factor level means that are also independent of the sum of the dummy variables. The dummy coding isn't wrong *per se*. It captures all of the coefficients, but it complicates matters when the model assumes independence of the coefficients such as in ANOVA. Linear regression models do not assume independence of the coefficients and thus dummy coding is often the only coding that is taught in this context. 21 | # 22 | # To have a look at the contrast matrices in Patsy, we will use data from UCLA ATS. First let's load the data. 23 | 24 | # 25 | 26 | # Example Data 27 | 28 | # 29 | 30 | import pandas 31 | url = 'http://www.ats.ucla.edu/stat/data/hsb2.csv' 32 | hsb2 = pandas.read_table(url, delimiter=",") 33 | 34 | # 35 | 36 | hsb2.head(10) 37 | 38 | # 39 | 40 | # It will be instructive to look at the mean of the dependent variable, write, for each level of race ((1 = Hispanic, 2 = Asian, 3 = African American and 4 = Caucasian)). 41 | 42 | # 43 | 44 | hsb2.groupby('race')['write'].mean() 45 | 46 | # 47 | 48 | # Treatment (Dummy) Coding 49 | 50 | # 51 | 52 | # Dummy coding is likely the most well known coding scheme. It compares each level of the categorical variable to a base reference level. The base reference level is the value of the intercept. It is the default contrast in Patsy for unordered categorical factors. The Treatment contrast matrix for race would be 53 | 54 | # 55 | 56 | from patsy.contrasts import Treatment 57 | levels = [1,2,3,4] 58 | contrast = Treatment(reference=0).code_without_intercept(levels) 59 | print contrast.matrix 60 | 61 | # 62 | 63 | # Here we used `reference=0`, which implies that the first level, Hispanic, is the reference category against which the other level effects are measured. As mentioned above, the columns do not sum to zero and are thus not independent of the intercept. To be explicit, let's look at how this would encode the `race` variable. 64 | 65 | # 66 | 67 | hsb2.race.head(10) 68 | 69 | # 70 | 71 | print contrast.matrix[hsb2.race-1, :][:20] 72 | 73 | # 74 | 75 | sm.categorical(hsb2.race.values) 76 | 77 | # 78 | 79 | # This is a bit of a trick, as the `race` category conveniently maps to zero-based indices. If it does not, this conversion happens under the hood, so this won't work in general but nonetheless is a useful exercise to fix ideas. The below illustrates the output using the three contrasts above 80 | 81 | # 82 | 83 | from statsmodels.formula.api import ols 84 | mod = ols("write ~ C(race, Treatment)", data=hsb2) 85 | res = mod.fit() 86 | print res.summary() 87 | 88 | # 89 | 90 | # We explicitly gave the contrast for race; however, since Treatment is the default, we could have omitted this. 91 | 92 | # 93 | 94 | # Simple Coding 95 | 96 | # 97 | 98 | # Like Treatment Coding, Simple Coding compares each level to a fixed reference level. However, with simple coding, the intercept is the grand mean of all the levels of the factors. Patsy doesn't have the Simple contrast included, but you can easily define your own contrasts. To do so, write a class that contains a code_with_intercept and a code_without_intercept method that returns a patsy.contrast.ContrastMatrix instance 99 | 100 | # 101 | 102 | from patsy.contrasts import ContrastMatrix 103 | 104 | def _name_levels(prefix, levels): 105 | return ["[%s%s]" % (prefix, level) for level in levels] 106 | 107 | class Simple(object): 108 | def _simple_contrast(self, levels): 109 | nlevels = len(levels) 110 | contr = -1./nlevels * np.ones((nlevels, nlevels-1)) 111 | contr[1:][np.diag_indices(nlevels-1)] = (nlevels-1.)/nlevels 112 | return contr 113 | 114 | def code_with_intercept(self, levels): 115 | contrast = np.column_stack((np.ones(len(levels)), 116 | self._simple_contrast(levels))) 117 | return ContrastMatrix(contrast, _name_levels("Simp.", levels)) 118 | 119 | def code_without_intercept(self, levels): 120 | contrast = self._simple_contrast(levels) 121 | return ContrastMatrix(contrast, _name_levels("Simp.", levels[:-1])) 122 | 123 | # 124 | 125 | hsb2.groupby('race')['write'].mean().mean() 126 | 127 | # 128 | 129 | contrast = Simple().code_without_intercept(levels) 130 | print contrast.matrix 131 | 132 | # 133 | 134 | mod = ols("write ~ C(race, Simple)", data=hsb2) 135 | res = mod.fit() 136 | print res.summary() 137 | 138 | # 139 | 140 | # Sum (Deviation) Coding 141 | 142 | # 143 | 144 | # Sum coding compares the mean of the dependent variable for a given level to the overall mean of the dependent variable over all the levels. That is, it uses contrasts between each of the first k-1 levels and level k In this example, level 1 is compared to all the others, level 2 to all the others, and level 3 to all the others. 145 | 146 | # 147 | 148 | from patsy.contrasts import Sum 149 | contrast = Sum().code_without_intercept(levels) 150 | print contrast.matrix 151 | 152 | # 153 | 154 | mod = ols("write ~ C(race, Sum)", data=hsb2) 155 | res = mod.fit() 156 | print res.summary() 157 | 158 | # 159 | 160 | # This corresponds to a parameterization that forces all the coefficients to sum to zero. Notice that the intercept here is the grand mean where the grand mean is the mean of means of the dependent variable by each level. 161 | 162 | # 163 | 164 | hsb2.groupby('race')['write'].mean().mean() 165 | 166 | # 167 | 168 | # Backward Difference Coding 169 | 170 | # 171 | 172 | # In backward difference coding, the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. This type of coding may be useful for a nominal or an ordinal variable. 173 | 174 | # 175 | 176 | from patsy.contrasts import Diff 177 | contrast = Diff().code_without_intercept(levels) 178 | print contrast.matrix 179 | 180 | # 181 | 182 | mod = ols("write ~ C(race, Diff)", data=hsb2) 183 | res = mod.fit() 184 | print res.summary() 185 | 186 | # 187 | 188 | # For example, here the coefficient on level 1 is the mean of `write` at level 2 compared with the mean at level 1. Ie., 189 | 190 | # 191 | 192 | res.params["C(race, Diff)[D.1]"] 193 | hsb2.groupby('race').mean()["write"][2] - \ 194 | hsb2.groupby('race').mean()["write"][1] 195 | 196 | # 197 | 198 | # Helmert Coding 199 | 200 | # 201 | 202 | # Our version of Helmert coding is sometimes referred to as Reverse Helmert Coding. The mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels. Hence, the name 'reverse' being sometimes applied to differentiate from forward Helmert coding. This comparison does not make much sense for a nominal variable such as race, but we would use the Helmert contrast like so: 203 | 204 | # 205 | 206 | from patsy.contrasts import Helmert 207 | contrast = Helmert().code_without_intercept(levels) 208 | print contrast.matrix 209 | 210 | # 211 | 212 | mod = ols("write ~ C(race, Helmert)", data=hsb2) 213 | res = mod.fit() 214 | print res.summary() 215 | 216 | # 217 | 218 | # To illustrate, the comparison on level 4 is the mean of the dependent variable at the previous three levels taken from the mean at level 4 219 | 220 | # 221 | 222 | grouped = hsb2.groupby('race') 223 | grouped.mean()["write"][4] - grouped.mean()["write"][:3].mean() 224 | 225 | # 226 | 227 | # As you can see, these are only equal up to a constant. Other versions of the Helmert contrast give the actual difference in means. Regardless, the hypothesis tests are the same. 228 | 229 | # 230 | 231 | k = 4 232 | 1./k * (grouped.mean()["write"][k] - grouped.mean()["write"][:k-1].mean()) 233 | k = 3 234 | 1./k * (grouped.mean()["write"][k] - grouped.mean()["write"][:k-1].mean()) 235 | 236 | # 237 | 238 | # Orthogonal Polynomial Coding 239 | 240 | # 241 | 242 | # The coefficients taken on by polynomial coding for `k=4` levels are the linear, quadratic, and cubic trends in the categorical variable. The categorical variable here is assumed to be represented by an underlying, equally spaced numeric variable. Therefore, this type of encoding is used only for ordered categorical variables with equal spacing. In general, the polynomial contrast produces polynomials of order `k-1`. Since `race` is not an ordered factor variable let's use `read` as an example. First we need to create an ordered categorical from `read`. 243 | 244 | # 245 | 246 | hsb2['readcat'] = pandas.cut(hsb2.read, bins=3) 247 | hsb2.groupby('readcat').mean()['write'] 248 | 249 | # 250 | 251 | from patsy.contrasts import Poly 252 | levels = hsb2.readcat.unique().tolist() 253 | contrast = Poly().code_without_intercept(levels) 254 | print contrast.matrix 255 | 256 | # 257 | 258 | mod = ols("write ~ C(readcat, Poly)", data=hsb2) 259 | res = mod.fit() 260 | print res.summary() 261 | 262 | # 263 | 264 | # As you can see, readcat has a significant linear effect on the dependent variable `write` but not a significant quadratic or cubic effect. 265 | 266 | -------------------------------------------------------------------------------- /discrete_choice.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "discrete_choice" 4 | }, 5 | "nbformat": 3, 6 | "nbformat_minor": 0, 7 | "worksheets": [ 8 | { 9 | "cells": [ 10 | { 11 | "cell_type": "heading", 12 | "level": 2, 13 | "metadata": {}, 14 | "source": [ 15 | "Discrete Choice Models - Fair's Affair data" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "A survey of women only was conducted in 1974 by *Redbook* asking about extramarital affairs." 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "collapsed": false, 28 | "input": [ 29 | "import numpy as np\n", 30 | "from scipy import stats\n", 31 | "import matplotlib.pyplot as plt\n", 32 | "import statsmodels.api as sm\n", 33 | "from statsmodels.formula.api import logit, probit, poisson, ols" 34 | ], 35 | "language": "python", 36 | "metadata": {}, 37 | "outputs": [] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "collapsed": false, 42 | "input": [ 43 | "print sm.datasets.fair.SOURCE" 44 | ], 45 | "language": "python", 46 | "metadata": {}, 47 | "outputs": [] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "collapsed": false, 52 | "input": [ 53 | "print sm.datasets.fair.NOTE" 54 | ], 55 | "language": "python", 56 | "metadata": {}, 57 | "outputs": [] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "collapsed": false, 62 | "input": [ 63 | "dta = sm.datasets.fair.load_pandas().data" 64 | ], 65 | "language": "python", 66 | "metadata": {}, 67 | "outputs": [] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "collapsed": false, 72 | "input": [ 73 | "dta['affair'] = (dta['affairs'] > 0).astype(float)\n", 74 | "print dta.head(10)" 75 | ], 76 | "language": "python", 77 | "metadata": {}, 78 | "outputs": [] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "collapsed": false, 83 | "input": [ 84 | "print dta.describe()" 85 | ], 86 | "language": "python", 87 | "metadata": {}, 88 | "outputs": [] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "collapsed": false, 93 | "input": [ 94 | "affair_mod = logit(\"affair ~ occupation + educ + occupation_husb\" \n", 95 | " \"+ rate_marriage + age + yrs_married + children\"\n", 96 | " \" + religious\", dta).fit()" 97 | ], 98 | "language": "python", 99 | "metadata": {}, 100 | "outputs": [] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "collapsed": false, 105 | "input": [ 106 | "print affair_mod.summary()" 107 | ], 108 | "language": "python", 109 | "metadata": {}, 110 | "outputs": [] 111 | }, 112 | { 113 | "cell_type": "raw", 114 | "metadata": {}, 115 | "source": [ 116 | "How well are we predicting?" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "collapsed": false, 122 | "input": [ 123 | "affair_mod.pred_table()" 124 | ], 125 | "language": "python", 126 | "metadata": {}, 127 | "outputs": [] 128 | }, 129 | { 130 | "cell_type": "raw", 131 | "metadata": {}, 132 | "source": [ 133 | "The coefficients of the discrete choice model do not tell us much. What we're after is marginal effects." 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "collapsed": false, 139 | "input": [ 140 | "mfx = affair_mod.get_margeff()\n", 141 | "print mfx.summary()" 142 | ], 143 | "language": "python", 144 | "metadata": {}, 145 | "outputs": [] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "collapsed": false, 150 | "input": [ 151 | "respondent1000 = dta.ix[1000]\n", 152 | "print respondent1000" 153 | ], 154 | "language": "python", 155 | "metadata": {}, 156 | "outputs": [] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "collapsed": false, 161 | "input": [ 162 | "resp = dict(zip(range(1,9), respondent1000[[\"occupation\", \"educ\", \n", 163 | " \"occupation_husb\", \"rate_marriage\", \n", 164 | " \"age\", \"yrs_married\", \"children\", \n", 165 | " \"religious\"]].tolist()))\n", 166 | "resp.update({0 : 1})\n", 167 | "print resp" 168 | ], 169 | "language": "python", 170 | "metadata": {}, 171 | "outputs": [] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "collapsed": false, 176 | "input": [ 177 | "mfx = affair_mod.get_margeff(atexog=resp)\n", 178 | "print mfx.summary()" 179 | ], 180 | "language": "python", 181 | "metadata": {}, 182 | "outputs": [] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "collapsed": false, 187 | "input": [ 188 | "affair_mod.predict(respondent1000)" 189 | ], 190 | "language": "python", 191 | "metadata": {}, 192 | "outputs": [] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "collapsed": false, 197 | "input": [ 198 | "affair_mod.fittedvalues[1000]" 199 | ], 200 | "language": "python", 201 | "metadata": {}, 202 | "outputs": [] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "collapsed": false, 207 | "input": [ 208 | "affair_mod.model.cdf(affair_mod.fittedvalues[1000])" 209 | ], 210 | "language": "python", 211 | "metadata": {}, 212 | "outputs": [] 213 | }, 214 | { 215 | "cell_type": "raw", 216 | "metadata": {}, 217 | "source": [ 218 | "The \"correct\" model here is likely the Tobit model. We have an work in progress branch \"tobit-model\" on github, if anyone is interested in censored regression models." 219 | ] 220 | }, 221 | { 222 | "cell_type": "heading", 223 | "level": 3, 224 | "metadata": {}, 225 | "source": [ 226 | "Exercise: Logit vs Probit" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "collapsed": false, 232 | "input": [ 233 | "fig = plt.figure(figsize=(12,8))\n", 234 | "ax = fig.add_subplot(111)\n", 235 | "support = np.linspace(-6, 6, 1000)\n", 236 | "ax.plot(support, stats.logistic.cdf(support), 'r-', label='Logistic')\n", 237 | "ax.plot(support, stats.norm.cdf(support), label='Probit')\n", 238 | "ax.legend();" 239 | ], 240 | "language": "python", 241 | "metadata": {}, 242 | "outputs": [] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "collapsed": false, 247 | "input": [ 248 | "fig = plt.figure(figsize=(12,8))\n", 249 | "ax = fig.add_subplot(111)\n", 250 | "support = np.linspace(-6, 6, 1000)\n", 251 | "ax.plot(support, stats.logistic.pdf(support), 'r-', label='Logistic')\n", 252 | "ax.plot(support, stats.norm.pdf(support), label='Probit')\n", 253 | "ax.legend();" 254 | ], 255 | "language": "python", 256 | "metadata": {}, 257 | "outputs": [] 258 | }, 259 | { 260 | "cell_type": "raw", 261 | "metadata": {}, 262 | "source": [ 263 | "Compare the estimates of the Logit Fair model above to a Probit model. Does the prediction table look better? Much difference in marginal effects?" 264 | ] 265 | }, 266 | { 267 | "cell_type": "heading", 268 | "level": 3, 269 | "metadata": {}, 270 | "source": [ 271 | "Genarlized Linear Model Example" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "collapsed": false, 277 | "input": [ 278 | "print sm.datasets.star98.SOURCE" 279 | ], 280 | "language": "python", 281 | "metadata": {}, 282 | "outputs": [] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "collapsed": false, 287 | "input": [ 288 | "print sm.datasets.star98.DESCRLONG" 289 | ], 290 | "language": "python", 291 | "metadata": {}, 292 | "outputs": [] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "collapsed": false, 297 | "input": [ 298 | "print sm.datasets.star98.NOTE" 299 | ], 300 | "language": "python", 301 | "metadata": {}, 302 | "outputs": [] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "collapsed": false, 307 | "input": [ 308 | "dta = sm.datasets.star98.load_pandas().data\n", 309 | "print dta.columns" 310 | ], 311 | "language": "python", 312 | "metadata": {}, 313 | "outputs": [] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "collapsed": false, 318 | "input": [ 319 | "print dta[['NABOVE', 'NBELOW', 'LOWINC', 'PERASIAN', 'PERBLACK', 'PERHISP', 'PERMINTE']].head(10)" 320 | ], 321 | "language": "python", 322 | "metadata": {}, 323 | "outputs": [] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "collapsed": false, 328 | "input": [ 329 | "print dta[['AVYRSEXP', 'AVSALK', 'PERSPENK', 'PTRATIO', 'PCTAF', 'PCTCHRT', 'PCTYRRND']].head(10)" 330 | ], 331 | "language": "python", 332 | "metadata": {}, 333 | "outputs": [] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "collapsed": false, 338 | "input": [ 339 | "formula = 'NABOVE + NBELOW ~ LOWINC + PERASIAN + PERBLACK + PERHISP + PCTCHRT '\n", 340 | "formula += '+ PCTYRRND + PERMINTE*AVYRSEXP*AVSALK + PERSPENK*PTRATIO*PCTAF'" 341 | ], 342 | "language": "python", 343 | "metadata": {}, 344 | "outputs": [] 345 | }, 346 | { 347 | "cell_type": "heading", 348 | "level": 4, 349 | "metadata": {}, 350 | "source": [ 351 | "Aside: Binomial distribution" 352 | ] 353 | }, 354 | { 355 | "cell_type": "raw", 356 | "metadata": {}, 357 | "source": [ 358 | "Toss a six-sided die 5 times, what's the probability of exactly 2 fours?" 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "collapsed": false, 364 | "input": [ 365 | "stats.binom(5, 1./6).pmf(2)" 366 | ], 367 | "language": "python", 368 | "metadata": {}, 369 | "outputs": [] 370 | }, 371 | { 372 | "cell_type": "code", 373 | "collapsed": false, 374 | "input": [ 375 | "from scipy.misc import comb\n", 376 | "comb(5,2) * (1/6.)**2 * (5/6.)**3" 377 | ], 378 | "language": "python", 379 | "metadata": {}, 380 | "outputs": [] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "collapsed": false, 385 | "input": [ 386 | "from statsmodels.formula.api import glm\n", 387 | "glm_mod = glm(formula, dta, family=sm.families.Binomial()).fit()" 388 | ], 389 | "language": "python", 390 | "metadata": {}, 391 | "outputs": [] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "collapsed": false, 396 | "input": [ 397 | "print glm_mod.summary()" 398 | ], 399 | "language": "python", 400 | "metadata": {}, 401 | "outputs": [] 402 | }, 403 | { 404 | "cell_type": "raw", 405 | "metadata": {}, 406 | "source": [ 407 | "The number of trials " 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "collapsed": false, 413 | "input": [ 414 | "glm_mod.model.data.orig_endog.sum(1)" 415 | ], 416 | "language": "python", 417 | "metadata": {}, 418 | "outputs": [] 419 | }, 420 | { 421 | "cell_type": "code", 422 | "collapsed": false, 423 | "input": [ 424 | "glm_mod.fittedvalues * glm_mod.model.data.orig_endog.sum(1)" 425 | ], 426 | "language": "python", 427 | "metadata": {}, 428 | "outputs": [] 429 | }, 430 | { 431 | "cell_type": "raw", 432 | "metadata": {}, 433 | "source": [ 434 | "First differences: We hold all explanatory variables constant at their means and manipulate the percentage of low income households to assess its impact\n", 435 | "on the response variables:" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "collapsed": false, 441 | "input": [ 442 | "exog = glm_mod.model.data.orig_exog # get the dataframe" 443 | ], 444 | "language": "python", 445 | "metadata": {}, 446 | "outputs": [] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "collapsed": false, 451 | "input": [ 452 | "means25 = exog.mean()\n", 453 | "print means25" 454 | ], 455 | "language": "python", 456 | "metadata": {}, 457 | "outputs": [] 458 | }, 459 | { 460 | "cell_type": "code", 461 | "collapsed": false, 462 | "input": [ 463 | "means25['LOWINC'] = exog['LOWINC'].quantile(.25)\n", 464 | "print means25" 465 | ], 466 | "language": "python", 467 | "metadata": {}, 468 | "outputs": [] 469 | }, 470 | { 471 | "cell_type": "code", 472 | "collapsed": false, 473 | "input": [ 474 | "means75 = exog.mean()\n", 475 | "means75['LOWINC'] = exog['LOWINC'].quantile(.75)\n", 476 | "print means75" 477 | ], 478 | "language": "python", 479 | "metadata": {}, 480 | "outputs": [] 481 | }, 482 | { 483 | "cell_type": "code", 484 | "collapsed": false, 485 | "input": [ 486 | "resp25 = glm_mod.predict(means25)\n", 487 | "resp75 = glm_mod.predict(means75)\n", 488 | "diff = resp75 - resp25" 489 | ], 490 | "language": "python", 491 | "metadata": {}, 492 | "outputs": [] 493 | }, 494 | { 495 | "cell_type": "raw", 496 | "metadata": {}, 497 | "source": [ 498 | "The interquartile first difference for the percentage of low income households in a school district is:" 499 | ] 500 | }, 501 | { 502 | "cell_type": "code", 503 | "collapsed": false, 504 | "input": [ 505 | "print \"%2.4f%%\" % (diff[0]*100)" 506 | ], 507 | "language": "python", 508 | "metadata": {}, 509 | "outputs": [] 510 | }, 511 | { 512 | "cell_type": "code", 513 | "collapsed": false, 514 | "input": [ 515 | "nobs = glm_mod.nobs\n", 516 | "y = glm_mod.model.endog\n", 517 | "yhat = glm_mod.mu" 518 | ], 519 | "language": "python", 520 | "metadata": {}, 521 | "outputs": [] 522 | }, 523 | { 524 | "cell_type": "code", 525 | "collapsed": false, 526 | "input": [ 527 | "from statsmodels.graphics.api import abline_plot\n", 528 | "fig = plt.figure(figsize=(12,8))\n", 529 | "ax = fig.add_subplot(111, ylabel='Observed Values', xlabel='Fitted Values')\n", 530 | "ax.scatter(yhat, y)\n", 531 | "y_vs_yhat = sm.OLS(y, sm.add_constant(yhat, prepend=True)).fit()\n", 532 | "fig = abline_plot(model_results=y_vs_yhat, ax=ax)" 533 | ], 534 | "language": "python", 535 | "metadata": {}, 536 | "outputs": [] 537 | }, 538 | { 539 | "cell_type": "heading", 540 | "level": 4, 541 | "metadata": {}, 542 | "source": [ 543 | "Plot fitted values vs Pearson residuals" 544 | ] 545 | }, 546 | { 547 | "cell_type": "markdown", 548 | "metadata": {}, 549 | "source": [ 550 | "Pearson residuals are defined to be \n", 551 | "\n", 552 | "$$\\frac{(y - \\mu)}{\\sqrt{(var(\\mu))}}$$\n", 553 | "\n", 554 | "where var is typically determined by the family. E.g., binomial variance is $np(1 - p)$" 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "collapsed": false, 560 | "input": [ 561 | "fig = plt.figure(figsize=(12,8))\n", 562 | "ax = fig.add_subplot(111, title='Residual Dependence Plot', xlabel='Fitted Values',\n", 563 | " ylabel='Pearson Residuals')\n", 564 | "ax.scatter(yhat, stats.zscore(glm_mod.resid_pearson))\n", 565 | "ax.axis('tight')\n", 566 | "ax.plot([0.0, 1.0],[0.0, 0.0], 'k-');" 567 | ], 568 | "language": "python", 569 | "metadata": {}, 570 | "outputs": [] 571 | }, 572 | { 573 | "cell_type": "heading", 574 | "level": 4, 575 | "metadata": {}, 576 | "source": [ 577 | "Histogram of standardized deviance residuals with Kernel Density Estimate overlayed" 578 | ] 579 | }, 580 | { 581 | "cell_type": "markdown", 582 | "metadata": {}, 583 | "source": [ 584 | "The definition of the deviance residuals depends on the family. For the Binomial distribution this is \n", 585 | "\n", 586 | "$$r_{dev} = sign\\(Y-\\mu\\)*\\sqrt{2n(Y\\log\\frac{Y}{\\mu}+(1-Y)\\log\\frac{(1-Y)}{(1-\\mu)}}$$\n", 587 | "\n", 588 | "They can be used to detect ill-fitting covariates" 589 | ] 590 | }, 591 | { 592 | "cell_type": "code", 593 | "collapsed": false, 594 | "input": [ 595 | "resid = glm_mod.resid_deviance\n", 596 | "resid_std = stats.zscore(resid) \n", 597 | "kde_resid = sm.nonparametric.KDEUnivariate(resid_std)\n", 598 | "kde_resid.fit()" 599 | ], 600 | "language": "python", 601 | "metadata": {}, 602 | "outputs": [] 603 | }, 604 | { 605 | "cell_type": "code", 606 | "collapsed": false, 607 | "input": [ 608 | "fig = plt.figure(figsize=(12,8))\n", 609 | "ax = fig.add_subplot(111, title=\"Standardized Deviance Residuals\")\n", 610 | "ax.hist(resid_std, bins=25, normed=True);\n", 611 | "ax.plot(kde_resid.support, kde_resid.density, 'r');" 612 | ], 613 | "language": "python", 614 | "metadata": {}, 615 | "outputs": [] 616 | }, 617 | { 618 | "cell_type": "heading", 619 | "level": 4, 620 | "metadata": {}, 621 | "source": [ 622 | "QQ-plot of deviance residuals" 623 | ] 624 | }, 625 | { 626 | "cell_type": "code", 627 | "collapsed": false, 628 | "input": [ 629 | "fig = plt.figure(figsize=(12,8))\n", 630 | "ax = fig.add_subplot(111)\n", 631 | "fig = sm.graphics.qqplot(resid, line='r', ax=ax)" 632 | ], 633 | "language": "python", 634 | "metadata": {}, 635 | "outputs": [] 636 | } 637 | ], 638 | "metadata": {} 639 | } 640 | ] 641 | } -------------------------------------------------------------------------------- /discrete_choice.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3.0 3 | 4 | # 5 | 6 | # Discrete Choice Models - Fair's Affair data 7 | 8 | # 9 | 10 | # A survey of women only was conducted in 1974 by *Redbook* asking about extramarital affairs. 11 | 12 | # 13 | 14 | import numpy as np 15 | from scipy import stats 16 | import matplotlib.pyplot as plt 17 | import statsmodels.api as sm 18 | from statsmodels.formula.api import logit, probit, poisson, ols 19 | 20 | # 21 | 22 | print sm.datasets.fair.SOURCE 23 | 24 | # 25 | 26 | print sm.datasets.fair.NOTE 27 | 28 | # 29 | 30 | dta = sm.datasets.fair.load_pandas().data 31 | 32 | # 33 | 34 | dta['affair'] = (dta['affairs'] > 0).astype(float) 35 | print dta.head(10) 36 | 37 | # 38 | 39 | print dta.describe() 40 | 41 | # 42 | 43 | affair_mod = logit("affair ~ occupation + educ + occupation_husb" 44 | "+ rate_marriage + age + yrs_married + children" 45 | " + religious", dta).fit() 46 | 47 | # 48 | 49 | print affair_mod.summary() 50 | 51 | # 52 | 53 | # How well are we predicting? 54 | 55 | # 56 | 57 | affair_mod.pred_table() 58 | 59 | # 60 | 61 | # The coefficients of the discrete choice model do not tell us much. What we're after is marginal effects. 62 | 63 | # 64 | 65 | mfx = affair_mod.get_margeff() 66 | print mfx.summary() 67 | 68 | # 69 | 70 | respondent1000 = dta.ix[1000] 71 | print respondent1000 72 | 73 | # 74 | 75 | resp = dict(zip(range(1,9), respondent1000[["occupation", "educ", 76 | "occupation_husb", "rate_marriage", 77 | "age", "yrs_married", "children", 78 | "religious"]].tolist())) 79 | resp.update({0 : 1}) 80 | print resp 81 | 82 | # 83 | 84 | mfx = affair_mod.get_margeff(atexog=resp) 85 | print mfx.summary() 86 | 87 | # 88 | 89 | affair_mod.predict(respondent1000) 90 | 91 | # 92 | 93 | affair_mod.fittedvalues[1000] 94 | 95 | # 96 | 97 | affair_mod.model.cdf(affair_mod.fittedvalues[1000]) 98 | 99 | # 100 | 101 | # The "correct" model here is likely the Tobit model. We have an work in progress branch "tobit-model" on github, if anyone is interested in censored regression models. 102 | 103 | # 104 | 105 | # Exercise: Logit vs Probit 106 | 107 | # 108 | 109 | fig = plt.figure(figsize=(12,8)) 110 | ax = fig.add_subplot(111) 111 | support = np.linspace(-6, 6, 1000) 112 | ax.plot(support, stats.logistic.cdf(support), 'r-', label='Logistic') 113 | ax.plot(support, stats.norm.cdf(support), label='Probit') 114 | ax.legend(); 115 | 116 | # 117 | 118 | fig = plt.figure(figsize=(12,8)) 119 | ax = fig.add_subplot(111) 120 | support = np.linspace(-6, 6, 1000) 121 | ax.plot(support, stats.logistic.pdf(support), 'r-', label='Logistic') 122 | ax.plot(support, stats.norm.pdf(support), label='Probit') 123 | ax.legend(); 124 | 125 | # 126 | 127 | # Compare the estimates of the Logit Fair model above to a Probit model. Does the prediction table look better? Much difference in marginal effects? 128 | 129 | # 130 | 131 | # Genarlized Linear Model Example 132 | 133 | # 134 | 135 | print sm.datasets.star98.SOURCE 136 | 137 | # 138 | 139 | print sm.datasets.star98.DESCRLONG 140 | 141 | # 142 | 143 | print sm.datasets.star98.NOTE 144 | 145 | # 146 | 147 | dta = sm.datasets.star98.load_pandas().data 148 | print dta.columns 149 | 150 | # 151 | 152 | print dta[['NABOVE', 'NBELOW', 'LOWINC', 'PERASIAN', 'PERBLACK', 'PERHISP', 'PERMINTE']].head(10) 153 | 154 | # 155 | 156 | print dta[['AVYRSEXP', 'AVSALK', 'PERSPENK', 'PTRATIO', 'PCTAF', 'PCTCHRT', 'PCTYRRND']].head(10) 157 | 158 | # 159 | 160 | formula = 'NABOVE + NBELOW ~ LOWINC + PERASIAN + PERBLACK + PERHISP + PCTCHRT ' 161 | formula += '+ PCTYRRND + PERMINTE*AVYRSEXP*AVSALK + PERSPENK*PTRATIO*PCTAF' 162 | 163 | # 164 | 165 | # Aside: Binomial distribution 166 | 167 | # 168 | 169 | # Toss a six-sided die 5 times, what's the probability of exactly 2 fours? 170 | 171 | # 172 | 173 | stats.binom(5, 1./6).pmf(2) 174 | 175 | # 176 | 177 | from scipy.misc import comb 178 | comb(5,2) * (1/6.)**2 * (5/6.)**3 179 | 180 | # 181 | 182 | from statsmodels.formula.api import glm 183 | glm_mod = glm(formula, dta, family=sm.families.Binomial()).fit() 184 | 185 | # 186 | 187 | print glm_mod.summary() 188 | 189 | # 190 | 191 | # The number of trials 192 | 193 | # 194 | 195 | glm_mod.model.data.orig_endog.sum(1) 196 | 197 | # 198 | 199 | glm_mod.fittedvalues * glm_mod.model.data.orig_endog.sum(1) 200 | 201 | # 202 | 203 | # First differences: We hold all explanatory variables constant at their means and manipulate the percentage of low income households to assess its impact 204 | # on the response variables: 205 | 206 | # 207 | 208 | exog = glm_mod.model.data.orig_exog # get the dataframe 209 | 210 | # 211 | 212 | means25 = exog.mean() 213 | print means25 214 | 215 | # 216 | 217 | means25['LOWINC'] = exog['LOWINC'].quantile(.25) 218 | print means25 219 | 220 | # 221 | 222 | means75 = exog.mean() 223 | means75['LOWINC'] = exog['LOWINC'].quantile(.75) 224 | print means75 225 | 226 | # 227 | 228 | resp25 = glm_mod.predict(means25) 229 | resp75 = glm_mod.predict(means75) 230 | diff = resp75 - resp25 231 | 232 | # 233 | 234 | # The interquartile first difference for the percentage of low income households in a school district is: 235 | 236 | # 237 | 238 | print "%2.4f%%" % (diff[0]*100) 239 | 240 | # 241 | 242 | nobs = glm_mod.nobs 243 | y = glm_mod.model.endog 244 | yhat = glm_mod.mu 245 | 246 | # 247 | 248 | from statsmodels.graphics.api import abline_plot 249 | fig = plt.figure(figsize=(12,8)) 250 | ax = fig.add_subplot(111, ylabel='Observed Values', xlabel='Fitted Values') 251 | ax.scatter(yhat, y) 252 | y_vs_yhat = sm.OLS(y, sm.add_constant(yhat, prepend=True)).fit() 253 | fig = abline_plot(model_results=y_vs_yhat, ax=ax) 254 | 255 | # 256 | 257 | # Plot fitted values vs Pearson residuals 258 | 259 | # 260 | 261 | # Pearson residuals are defined to be 262 | # 263 | # $$\frac{(y - \mu)}{\sqrt{(var(\mu))}}$$ 264 | # 265 | # where var is typically determined by the family. E.g., binomial variance is $np(1 - p)$ 266 | 267 | # 268 | 269 | fig = plt.figure(figsize=(12,8)) 270 | ax = fig.add_subplot(111, title='Residual Dependence Plot', xlabel='Fitted Values', 271 | ylabel='Pearson Residuals') 272 | ax.scatter(yhat, stats.zscore(glm_mod.resid_pearson)) 273 | ax.axis('tight') 274 | ax.plot([0.0, 1.0],[0.0, 0.0], 'k-'); 275 | 276 | # 277 | 278 | # Histogram of standardized deviance residuals with Kernel Density Estimate overlayed 279 | 280 | # 281 | 282 | # The definition of the deviance residuals depends on the family. For the Binomial distribution this is 283 | # 284 | # $$r_{dev} = sign\(Y-\mu\)*\sqrt{2n(Y\log\frac{Y}{\mu}+(1-Y)\log\frac{(1-Y)}{(1-\mu)}}$$ 285 | # 286 | # They can be used to detect ill-fitting covariates 287 | 288 | # 289 | 290 | resid = glm_mod.resid_deviance 291 | resid_std = stats.zscore(resid) 292 | kde_resid = sm.nonparametric.KDEUnivariate(resid_std) 293 | kde_resid.fit() 294 | 295 | # 296 | 297 | fig = plt.figure(figsize=(12,8)) 298 | ax = fig.add_subplot(111, title="Standardized Deviance Residuals") 299 | ax.hist(resid_std, bins=25, normed=True); 300 | ax.plot(kde_resid.support, kde_resid.density, 'r'); 301 | 302 | # 303 | 304 | # QQ-plot of deviance residuals 305 | 306 | # 307 | 308 | fig = plt.figure(figsize=(12,8)) 309 | ax = fig.add_subplot(111) 310 | fig = sm.graphics.qqplot(resid, line='r', ax=ax) 311 | 312 | -------------------------------------------------------------------------------- /generic_mle.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "generic_mle" 4 | }, 5 | "nbformat": 3, 6 | "nbformat_minor": 0, 7 | "worksheets": [ 8 | { 9 | "cells": [ 10 | { 11 | "cell_type": "code", 12 | "collapsed": false, 13 | "input": [ 14 | "import numpy as np\n", 15 | "from scipy import stats\n", 16 | "import statsmodels.api as sm\n", 17 | "from statsmodels.base.model import GenericLikelihoodModel" 18 | ], 19 | "language": "python", 20 | "metadata": {}, 21 | "outputs": [] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "collapsed": false, 26 | "input": [ 27 | "print sm.datasets.spector.NOTE" 28 | ], 29 | "language": "python", 30 | "metadata": {}, 31 | "outputs": [] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "collapsed": false, 36 | "input": [ 37 | "data = sm.datasets.spector.load_pandas()\n", 38 | "exog = sm.add_constant(data.exog, prepend=True)\n", 39 | "endog = data.endog" 40 | ], 41 | "language": "python", 42 | "metadata": {}, 43 | "outputs": [] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "collapsed": false, 48 | "input": [ 49 | "sm_probit = sm.Probit(endog, exog).fit()" 50 | ], 51 | "language": "python", 52 | "metadata": {}, 53 | "outputs": [] 54 | }, 55 | { 56 | "cell_type": "raw", 57 | "metadata": {}, 58 | "source": [ 59 | "* To create your own Likelihood Model, you just need to overwrite the loglike method." 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "collapsed": false, 65 | "input": [ 66 | "class MyProbit(GenericLikelihoodModel):\n", 67 | " def loglike(self, params):\n", 68 | " exog = self.exog\n", 69 | " endog = self.endog\n", 70 | " q = 2 * endog - 1\n", 71 | " return stats.norm.logcdf(q*np.dot(exog, params)).sum()" 72 | ], 73 | "language": "python", 74 | "metadata": {}, 75 | "outputs": [] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "collapsed": false, 80 | "input": [ 81 | "my_probit = MyProbit(endog, exog).fit()" 82 | ], 83 | "language": "python", 84 | "metadata": {}, 85 | "outputs": [] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "collapsed": false, 90 | "input": [ 91 | "print sm_probit.params" 92 | ], 93 | "language": "python", 94 | "metadata": {}, 95 | "outputs": [] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "collapsed": false, 100 | "input": [ 101 | "print sm_probit.cov_params()" 102 | ], 103 | "language": "python", 104 | "metadata": {}, 105 | "outputs": [] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "collapsed": false, 110 | "input": [ 111 | "print my_probit.params" 112 | ], 113 | "language": "python", 114 | "metadata": {}, 115 | "outputs": [] 116 | }, 117 | { 118 | "cell_type": "raw", 119 | "metadata": {}, 120 | "source": [ 121 | "You can get the variance-covariance of the parameters. Notice that we didn't have to provide Hessian or Score functions." 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "collapsed": false, 127 | "input": [ 128 | "print my_probit.cov_params()" 129 | ], 130 | "language": "python", 131 | "metadata": {}, 132 | "outputs": [] 133 | } 134 | ], 135 | "metadata": {} 136 | } 137 | ] 138 | } -------------------------------------------------------------------------------- /generic_mle.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3.0 3 | 4 | # 5 | 6 | import numpy as np 7 | from scipy import stats 8 | import statsmodels.api as sm 9 | from statsmodels.base.model import GenericLikelihoodModel 10 | 11 | # 12 | 13 | print sm.datasets.spector.NOTE 14 | 15 | # 16 | 17 | data = sm.datasets.spector.load_pandas() 18 | exog = sm.add_constant(data.exog, prepend=True) 19 | endog = data.endog 20 | 21 | # 22 | 23 | sm_probit = sm.Probit(endog, exog).fit() 24 | 25 | # 26 | 27 | # * To create your own Likelihood Model, you just need to overwrite the loglike method. 28 | 29 | # 30 | 31 | class MyProbit(GenericLikelihoodModel): 32 | def loglike(self, params): 33 | exog = self.exog 34 | endog = self.endog 35 | q = 2 * endog - 1 36 | return stats.norm.logcdf(q*np.dot(exog, params)).sum() 37 | 38 | # 39 | 40 | my_probit = MyProbit(endog, exog).fit() 41 | 42 | # 43 | 44 | print sm_probit.params 45 | 46 | # 47 | 48 | print sm_probit.cov_params() 49 | 50 | # 51 | 52 | print my_probit.params 53 | 54 | # 55 | 56 | # You can get the variance-covariance of the parameters. Notice that we didn't have to provide Hessian or Score functions. 57 | 58 | # 59 | 60 | print my_probit.cov_params() 61 | 62 | -------------------------------------------------------------------------------- /kernel_density.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "kernel_density" 4 | }, 5 | "nbformat": 3, 6 | "nbformat_minor": 0, 7 | "worksheets": [ 8 | { 9 | "cells": [ 10 | { 11 | "cell_type": "heading", 12 | "level": 3, 13 | "metadata": {}, 14 | "source": [ 15 | "Kernel Density Estimation" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "collapsed": false, 21 | "input": [ 22 | "import numpy as np\n", 23 | "from scipy import stats\n", 24 | "import statsmodels.api as sm\n", 25 | "import matplotlib.pyplot as plt\n", 26 | "from statsmodels.distributions.mixture_rvs import mixture_rvs" 27 | ], 28 | "language": "python", 29 | "metadata": {}, 30 | "outputs": [] 31 | }, 32 | { 33 | "cell_type": "heading", 34 | "level": 4, 35 | "metadata": {}, 36 | "source": [ 37 | "A univariate example." 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "collapsed": false, 43 | "input": [ 44 | "np.random.seed(12345)" 45 | ], 46 | "language": "python", 47 | "metadata": {}, 48 | "outputs": [] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "collapsed": false, 53 | "input": [ 54 | "obs_dist1 = mixture_rvs([.25,.75], size=10000, dist=[stats.norm, stats.norm],\n", 55 | " kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=.5)))" 56 | ], 57 | "language": "python", 58 | "metadata": {}, 59 | "outputs": [] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "collapsed": false, 64 | "input": [ 65 | "kde = sm.nonparametric.KDEUnivariate(obs_dist1)\n", 66 | "kde.fit()" 67 | ], 68 | "language": "python", 69 | "metadata": {}, 70 | "outputs": [] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "collapsed": false, 75 | "input": [ 76 | "fig = plt.figure(figsize=(12,8))\n", 77 | "ax = fig.add_subplot(111)\n", 78 | "ax.hist(obs_dist1, bins=50, normed=True, color='red')\n", 79 | "ax.plot(kde.support, kde.density, lw=2, color='black');" 80 | ], 81 | "language": "python", 82 | "metadata": {}, 83 | "outputs": [] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "collapsed": false, 88 | "input": [ 89 | "obs_dist2 = mixture_rvs([.25,.75], size=10000, dist=[stats.norm, stats.beta],\n", 90 | " kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=1,args=(1,.5))))\n", 91 | "\n", 92 | "kde2 = sm.nonparametric.KDEUnivariate(obs_dist2)\n", 93 | "kde2.fit()" 94 | ], 95 | "language": "python", 96 | "metadata": {}, 97 | "outputs": [] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "collapsed": false, 102 | "input": [ 103 | "fig = plt.figure(figsize=(12,8))\n", 104 | "ax = fig.add_subplot(111)\n", 105 | "ax.hist(obs_dist2, bins=50, normed=True, color='red')\n", 106 | "ax.plot(kde2.support, kde2.density, lw=2, color='black');" 107 | ], 108 | "language": "python", 109 | "metadata": {}, 110 | "outputs": [] 111 | }, 112 | { 113 | "cell_type": "raw", 114 | "metadata": {}, 115 | "source": [ 116 | "The fitted KDE object is a full non-parametric distribution." 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "collapsed": false, 122 | "input": [ 123 | "obs_dist3 = mixture_rvs([.25,.75], size=1000, dist=[stats.norm, stats.norm],\n", 124 | " kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=.5)))\n", 125 | "kde3 = sm.nonparametric.KDEUnivariate(obs_dist3)\n", 126 | "kde3.fit()" 127 | ], 128 | "language": "python", 129 | "metadata": {}, 130 | "outputs": [] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "collapsed": false, 135 | "input": [ 136 | "kde3.entropy" 137 | ], 138 | "language": "python", 139 | "metadata": {}, 140 | "outputs": [] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "collapsed": false, 145 | "input": [ 146 | "kde3.evaluate(-1)" 147 | ], 148 | "language": "python", 149 | "metadata": {}, 150 | "outputs": [] 151 | }, 152 | { 153 | "cell_type": "heading", 154 | "level": 4, 155 | "metadata": {}, 156 | "source": [ 157 | "CDF" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "collapsed": false, 163 | "input": [ 164 | "fig = plt.figure(figsize=(12,8))\n", 165 | "ax = fig.add_subplot(111)\n", 166 | "ax.plot(kde3.support, kde3.cdf);" 167 | ], 168 | "language": "python", 169 | "metadata": {}, 170 | "outputs": [] 171 | }, 172 | { 173 | "cell_type": "heading", 174 | "level": 4, 175 | "metadata": {}, 176 | "source": [ 177 | "Cumulative Hazard Function" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "collapsed": false, 183 | "input": [ 184 | "fig = plt.figure(figsize=(12,8))\n", 185 | "ax = fig.add_subplot(111)\n", 186 | "ax.plot(kde3.support, kde3.cumhazard);" 187 | ], 188 | "language": "python", 189 | "metadata": {}, 190 | "outputs": [] 191 | }, 192 | { 193 | "cell_type": "heading", 194 | "level": 4, 195 | "metadata": {}, 196 | "source": [ 197 | "Inverse CDF" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "collapsed": false, 203 | "input": [ 204 | "fig = plt.figure(figsize=(12,8))\n", 205 | "ax = fig.add_subplot(111)\n", 206 | "ax.plot(kde3.support, kde3.icdf);" 207 | ], 208 | "language": "python", 209 | "metadata": {}, 210 | "outputs": [] 211 | }, 212 | { 213 | "cell_type": "heading", 214 | "level": 4, 215 | "metadata": {}, 216 | "source": [ 217 | "Survival Function" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "collapsed": false, 223 | "input": [ 224 | "fig = plt.figure(figsize=(12,8))\n", 225 | "ax = fig.add_subplot(111)\n", 226 | "ax.plot(kde3.support, kde3.sf);" 227 | ], 228 | "language": "python", 229 | "metadata": {}, 230 | "outputs": [] 231 | } 232 | ], 233 | "metadata": {} 234 | } 235 | ] 236 | } 237 | -------------------------------------------------------------------------------- /kernel_density.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3.0 3 | 4 | # 5 | 6 | # Kernel Density Estimation 7 | 8 | # 9 | 10 | import numpy as np 11 | import statsmodels.api as sm 12 | import matplotlib.pyplot as plt 13 | from statsmodels.distributions.mixture_rvs import mixture_rvs 14 | 15 | # 16 | 17 | # A univariate example. 18 | 19 | # 20 | 21 | np.random.seed(12345) 22 | 23 | # 24 | 25 | obs_dist1 = mixture_rvs([.25,.75], size=10000, dist=[stats.norm, stats.norm], 26 | kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=.5))) 27 | 28 | # 29 | 30 | kde = sm.nonparametric.KDEUnivariate(obs_dist1) 31 | kde.fit() 32 | 33 | # 34 | 35 | fig = plt.figure(figsize=(12,8)) 36 | ax = fig.add_subplot(111) 37 | ax.hist(obs_dist1, bins=50, normed=True, color='red') 38 | ax.plot(kde.support, kde.density, lw=2, color='black'); 39 | 40 | # 41 | 42 | obs_dist2 = mixture_rvs([.25,.75], size=10000, dist=[stats.norm, stats.beta], 43 | kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=1,args=(1,.5)))) 44 | 45 | kde2 = sm.nonparametric.KDEUnivariate(obs_dist2) 46 | kde2.fit() 47 | 48 | # 49 | 50 | fig = plt.figure(figsize=(12,8)) 51 | ax = fig.add_subplot(111) 52 | ax.hist(obs_dist2, bins=50, normed=True, color='red') 53 | ax.plot(kde2.support, kde2.density, lw=2, color='black'); 54 | 55 | # 56 | 57 | # The fitted KDE object is a full non-parametric distribution. 58 | 59 | # 60 | 61 | obs_dist3 = mixture_rvs([.25,.75], size=1000, dist=[stats.norm, stats.norm], 62 | kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=.5))) 63 | kde3 = sm.nonparametric.KDEUnivariate(obs_dist3) 64 | kde3.fit() 65 | 66 | # 67 | 68 | kde3.entropy 69 | 70 | # 71 | 72 | kde3.evaluate(-1) 73 | 74 | # 75 | 76 | # CDF 77 | 78 | # 79 | 80 | fig = plt.figure(figsize=(12,8)) 81 | ax = fig.add_subplot(111) 82 | ax.plot(kde3.support, kde3.cdf); 83 | 84 | # 85 | 86 | # Cumulative Hazard Function 87 | 88 | # 89 | 90 | fig = plt.figure(figsize=(12,8)) 91 | ax = fig.add_subplot(111) 92 | ax.plot(kde3.support, kde3.cumhazard); 93 | 94 | # 95 | 96 | # Inverse CDF 97 | 98 | # 99 | 100 | fig = plt.figure(figsize=(12,8)) 101 | ax = fig.add_subplot(111) 102 | ax.plot(kde3.support, kde3.icdf); 103 | 104 | # 105 | 106 | # Survival Function 107 | 108 | # 109 | 110 | fig = plt.figure(figsize=(12,8)) 111 | ax = fig.add_subplot(111) 112 | ax.plot(kde3.support, kde3.sf); 113 | 114 | -------------------------------------------------------------------------------- /linear_models.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3.0 3 | 4 | # 5 | 6 | # This notebook introduces the use of pandas and the formula framework in statsmodels in the context of linear modeling. 7 | 8 | # 9 | 10 | # **It is based heavily on Jonathan Taylor's [class notes that use R](http://www.stanford.edu/class/stats191/interactions.html)** 11 | 12 | # 13 | 14 | import matplotlib.pyplot as plt 15 | import pandas 16 | import numpy as np 17 | 18 | from statsmodels.formula.api import ols 19 | from statsmodels.graphics.api import interaction_plot, abline_plot, qqplot 20 | from statsmodels.stats.api import anova_lm 21 | 22 | # 23 | 24 | # Example 1: IT salary data 25 | 26 | # 27 | 28 | # Outcome: S, salaries for IT staff in a corporation 29 | # Predictors: X, experience in years 30 | # M, managment, 2 levels, 0=non-management, 1=management 31 | # E, education, 3 levels, 1=Bachelor's, 2=Master's, 3=Ph.D 32 | 33 | # 34 | 35 | url = 'http://stats191.stanford.edu/data/salary.table' 36 | salary_table = pandas.read_table(url) # needs pandas 0.7.3 37 | salary_table.to_csv('salary.table', index=False) 38 | 39 | # 40 | 41 | print salary_table.head(10) 42 | 43 | # 44 | 45 | E = salary_table.E # Education 46 | M = salary_table.M # Management 47 | X = salary_table.X # Experience 48 | S = salary_table.S # Salary 49 | 50 | # 51 | 52 | # Let's explore the data 53 | 54 | # 55 | 56 | fig = plt.figure(figsize=(10,8)) 57 | ax = fig.add_subplot(111, xlabel='Experience', ylabel='Salary', 58 | xlim=(0, 27), ylim=(9600, 28800)) 59 | symbols = ['D', '^'] 60 | man_label = ["Non-Mgmt", "Mgmt"] 61 | educ_label = ["Bachelors", "Masters", "PhD"] 62 | colors = ['r', 'g', 'blue'] 63 | factor_groups = salary_table.groupby(['E','M']) 64 | for values, group in factor_groups: 65 | i,j = values 66 | label = "%s - %s" % (man_label[j], educ_label[i-1]) 67 | ax.scatter(group['X'], group['S'], marker=symbols[j], color=colors[i-1], 68 | s=350, label=label) 69 | ax.legend(scatterpoints=1, markerscale=.7, labelspacing=1); 70 | 71 | # 72 | 73 | # Fit a linear model 74 | # 75 | # $$S_i = \beta_0 + \beta_1X_i + \beta_2E_{i2} + \beta_3E_{i3} + \beta_4M_i + \epsilon_i$$ 76 | # 77 | # where 78 | # 79 | # $$ E_{i2}=\cases{1,&if $E_i=2$;\cr 0,&otherwise. \cr}$$ 80 | # $$ E_{i3}=\cases{1,&if $E_i=3$;\cr 0,&otherwise. \cr}$$ 81 | 82 | # 83 | 84 | formula = 'S ~ C(E) + C(M) + X' 85 | lm = ols(formula, salary_table).fit() 86 | print lm.summary() 87 | 88 | # 89 | 90 | # Aside: Contrasts (see contrasts notebook) 91 | 92 | # 93 | 94 | # Look at the design matrix created for us. Every results instance has a reference to the model. 95 | 96 | # 97 | 98 | lm.model.exog[:10] 99 | 100 | # 101 | 102 | # Since we initially passed in a DataFrame, we have a transformed DataFrame available. 103 | 104 | # 105 | 106 | print lm.model.data.orig_exog.head(10) 107 | 108 | # 109 | 110 | # There is a reference to the original untouched data in 111 | 112 | # 113 | 114 | print lm.model.data.frame.head(10) 115 | 116 | # 117 | 118 | # If you use the formula interface, statsmodels remembers this transformation. Say you want to know the predicted salary for someone with 12 years experience and a Master's degree who is in a management position 119 | 120 | # 121 | 122 | lm.predict({'X' : [12], 'M' : [1], 'E' : [2]}) 123 | 124 | # 125 | 126 | # So far we've assumed that the effect of experience is the same for each level of education and professional role. 127 | # Perhaps this assumption isn't merited. We can formally test this using some interactions. 128 | 129 | # 130 | 131 | # We can start by seeing if our model assumptions are met. Let's look at a residuals plot. 132 | 133 | # 134 | 135 | # And some formal tests 136 | 137 | # 138 | 139 | # Plot the residuals within the groups separately. 140 | 141 | # 142 | 143 | resid = lm.resid 144 | 145 | # 146 | 147 | fig = plt.figure(figsize=(12,8)) 148 | xticks = [] 149 | ax = fig.add_subplot(111, xlabel='Group (E, M)', ylabel='Residuals') 150 | for values, group in factor_groups: 151 | i,j = values 152 | xticks.append(str((i, j))) 153 | group_num = i*2 + j - 1 # for plotting purposes 154 | x = [group_num] * len(group) 155 | ax.scatter(x, resid[group.index], marker=symbols[j], color=colors[i-1], 156 | s=144, edgecolors='black') 157 | ax.set_xticks([1,2,3,4,5,6]) 158 | ax.set_xticklabels(xticks) 159 | ax.axis('tight'); 160 | 161 | # 162 | 163 | # Add an interaction between salary and experience, allowing different intercepts for level of experience. 164 | # 165 | # $$S_i = \beta_0+\beta_1X_i+\beta_2E_{i2}+\beta_3E_{i3}+\beta_4M_i+\beta_5E_{i2}X_i+\beta_6E_{i3}X_i+\epsilon_i$$ 166 | 167 | # 168 | 169 | interX_lm = ols('S ~ C(E)*X + C(M)', salary_table).fit() 170 | print interX_lm.summary() 171 | 172 | # 173 | 174 | # Test that $\beta_5 = \beta_6 = 0$. We can use anova_lm or we can use an F-test. 175 | 176 | # 177 | 178 | print anova_lm(lm, interX_lm) 179 | 180 | # 181 | 182 | print interX_lm.f_test('C(E)[T.2]:X = C(E)[T.3]:X = 0') 183 | 184 | # 185 | 186 | print interX_lm.f_test([[0,0,0,0,0,1,-1],[0,0,0,0,0,0,1]]) 187 | 188 | # 189 | 190 | # The contrasts are created here under the hood by patsy. 191 | 192 | # 193 | 194 | # Recall that F-tests are of the form $R\beta = q$ 195 | 196 | # 197 | 198 | LC = interX_lm.model.data.orig_exog.design_info.linear_constraint('C(E)[T.2]:X = C(E)[T.3]:X = 0') 199 | print LC.coefs 200 | print LC.constants 201 | 202 | # 203 | 204 | # Interact education with management 205 | 206 | # 207 | 208 | interM_lm = ols('S ~ X + C(E)*C(M)', salary_table).fit() 209 | print interM_lm.summary() 210 | 211 | # 212 | 213 | print anova_lm(lm, interM_lm) 214 | 215 | # 216 | 217 | infl = interM_lm.get_influence() 218 | resid = infl.resid_studentized_internal 219 | 220 | # 221 | 222 | fig = plt.figure(figsize=(12,8)) 223 | ax = fig.add_subplot(111, xlabel='X', ylabel='standardized resids') 224 | 225 | for values, group in factor_groups: 226 | i,j = values 227 | idx = group.index 228 | ax.scatter(X[idx], resid[idx], marker=symbols[j], color=colors[i-1], 229 | s=144, edgecolors='black') 230 | ax.axis('tight'); 231 | 232 | # 233 | 234 | # There looks to be an outlier. 235 | 236 | # 237 | 238 | outl = interM_lm.outlier_test('fdr_bh') 239 | outl.sort('unadj_p', inplace=True) 240 | print outl 241 | 242 | # 243 | 244 | idx = salary_table.index.drop(32) 245 | 246 | # 247 | 248 | print idx 249 | 250 | # 251 | 252 | lm32 = ols('S ~ C(E) + X + C(M)', data=salary_table, subset=idx).fit() 253 | print lm32.summary() 254 | 255 | # 256 | 257 | interX_lm32 = ols('S ~ C(E) * X + C(M)', data=salary_table, subset=idx).fit() 258 | print interX_lm32.summary() 259 | 260 | # 261 | 262 | table3 = anova_lm(lm32, interX_lm32) 263 | print table3 264 | 265 | # 266 | 267 | interM_lm32 = ols('S ~ X + C(E) * C(M)', data=salary_table, subset=idx).fit() 268 | print anova_lm(lm32, interM_lm32) 269 | 270 | # 271 | 272 | # Re-plotting the residuals 273 | 274 | # 275 | 276 | resid = interM_lm32.get_influence().summary_frame()['standard_resid'] 277 | fig = plt.figure(figsize=(12,8)) 278 | ax = fig.add_subplot(111, xlabel='X[~[32]]', ylabel='standardized resids') 279 | 280 | for values, group in factor_groups: 281 | i,j = values 282 | idx = group.index 283 | ax.scatter(X[idx], resid[idx], marker=symbols[j], color=colors[i-1], 284 | s=144, edgecolors='black') 285 | ax.axis('tight'); 286 | 287 | # 288 | 289 | # A final plot of the fitted values 290 | 291 | # 292 | 293 | lm_final = ols('S ~ X + C(E)*C(M)', data=salary_table.drop([32])).fit() 294 | mf = lm_final.model.data.orig_exog 295 | lstyle = ['-','--'] 296 | 297 | fig = plt.figure(figsize=(12,8)) 298 | ax = fig.add_subplot(111, xlabel='Experience', ylabel='Salary') 299 | 300 | for values, group in factor_groups: 301 | i,j = values 302 | idx = group.index 303 | ax.scatter(X[idx], S[idx], marker=symbols[j], color=colors[i-1], 304 | s=144, edgecolors='black') 305 | # drop NA because there is no idx 32 in the final model 306 | ax.plot(mf.X[idx].dropna(), lm_final.fittedvalues[idx].dropna(), 307 | ls=lstyle[j], color=colors[i-1]) 308 | ax.axis('tight'); 309 | 310 | # 311 | 312 | # From our first look at the data, the difference between Master's and PhD in the management group is different than in the non-management group. This is an interaction between the two qualitative variables management, M and education, E. We can visualize this by first removing the effect of experience, then plotting the means within each of the 6 groups using interaction.plot. 313 | 314 | # 315 | 316 | U = S - X * interX_lm32.params['X'] 317 | U.name = 'Salary|X' 318 | 319 | fig = plt.figure(figsize=(12,8)) 320 | ax = fig.add_subplot(111) 321 | ax = interaction_plot(E, M, U, colors=['red','blue'], markers=['^','D'], 322 | markersize=10, ax=ax) 323 | 324 | # 325 | 326 | # Minority Employment Data - ABLine plotting 327 | 328 | # 329 | 330 | # TEST - Job Aptitude Test Score 331 | # ETHN - 1 if minority, 0 otherwise 332 | # JPERF - Job performance evaluation 333 | 334 | # 335 | 336 | try: 337 | minority_table = pandas.read_table('minority.table') 338 | except: # don't have data already 339 | url = 'http://stats191.stanford.edu/data/minority.table' 340 | minority_table = pandas.read_table(url) 341 | minority_table.to_csv('minority.table', sep="\t", index=False) 342 | 343 | # 344 | 345 | factor_group = minority_table.groupby(['ETHN']) 346 | 347 | fig = plt.figure(figsize=(12,8)) 348 | ax = fig.add_subplot(111, xlabel='TEST', ylabel='JPERF') 349 | colors = ['purple', 'green'] 350 | markers = ['o', 'v'] 351 | for factor, group in factor_group: 352 | ax.scatter(group['TEST'], group['JPERF'], color=colors[factor], 353 | marker=markers[factor], s=12**2) 354 | ax.legend(['ETHN == 1', 'ETHN == 0'], scatterpoints=1) 355 | 356 | # 357 | 358 | min_lm = ols('JPERF ~ TEST', data=minority_table).fit() 359 | print min_lm.summary() 360 | 361 | # 362 | 363 | fig = plt.figure(figsize=(12,8)) 364 | ax = fig.add_subplot(111, xlabel='TEST', ylabel='JPERF') 365 | for factor, group in factor_group: 366 | ax.scatter(group['TEST'], group['JPERF'], color=colors[factor], 367 | marker=markers[factor], s=12**2) 368 | ax.legend(['ETHN == 1', 'ETHN == 0'], scatterpoints=1, loc='upper left') 369 | fig = abline_plot(model_results = min_lm, ax=ax) 370 | 371 | # 372 | 373 | min_lm2 = ols('JPERF ~ TEST + TEST:ETHN', data=minority_table).fit() 374 | print min_lm2.summary() 375 | 376 | # 377 | 378 | fig = plt.figure(figsize=(12,8)) 379 | ax = fig.add_subplot(111, xlabel='TEST', ylabel='JPERF') 380 | for factor, group in factor_group: 381 | ax.scatter(group['TEST'], group['JPERF'], color=colors[factor], 382 | marker=markers[factor], s=12**2) 383 | 384 | fig = abline_plot(intercept = min_lm2.params['Intercept'], 385 | slope = min_lm2.params['TEST'], ax=ax, color='purple') 386 | ax = fig.axes[0] 387 | fig = abline_plot(intercept = min_lm2.params['Intercept'], 388 | slope = min_lm2.params['TEST'] + min_lm2.params['TEST:ETHN'], 389 | ax=ax, color='green') 390 | ax.legend(['ETHN == 1', 'ETHN == 0'], scatterpoints=1, loc='upper left'); 391 | 392 | # 393 | 394 | min_lm3 = ols('JPERF ~ TEST + ETHN', data=minority_table).fit() 395 | print min_lm3.summary() 396 | 397 | # 398 | 399 | fig = plt.figure(figsize=(12,8)) 400 | ax = fig.add_subplot(111, xlabel='TEST', ylabel='JPERF') 401 | for factor, group in factor_group: 402 | ax.scatter(group['TEST'], group['JPERF'], color=colors[factor], 403 | marker=markers[factor], s=12**2) 404 | 405 | fig = abline_plot(intercept = min_lm3.params['Intercept'], 406 | slope = min_lm3.params['TEST'], ax=ax, color='purple') 407 | 408 | ax = fig.axes[0] 409 | fig = abline_plot(intercept = min_lm3.params['Intercept'] + min_lm3.params['ETHN'], 410 | slope = min_lm3.params['TEST'], ax=ax, color='green') 411 | ax.legend(['ETHN == 1', 'ETHN == 0'], scatterpoints=1, loc='upper left'); 412 | 413 | # 414 | 415 | min_lm4 = ols('JPERF ~ TEST * ETHN', data=minority_table).fit() 416 | print min_lm4.summary() 417 | 418 | # 419 | 420 | fig = plt.figure(figsize=(12,8)) 421 | ax = fig.add_subplot(111, ylabel='JPERF', xlabel='TEST') 422 | for factor, group in factor_group: 423 | ax.scatter(group['TEST'], group['JPERF'], color=colors[factor], 424 | marker=markers[factor], s=12**2) 425 | 426 | fig = abline_plot(intercept = min_lm4.params['Intercept'], 427 | slope = min_lm4.params['TEST'], ax=ax, color='purple') 428 | ax = fig.axes[0] 429 | fig = abline_plot(intercept = min_lm4.params['Intercept'] + min_lm4.params['ETHN'], 430 | slope = min_lm4.params['TEST'] + min_lm4.params['TEST:ETHN'], 431 | ax=ax, color='green') 432 | ax.legend(['ETHN == 1', 'ETHN == 0'], scatterpoints=1, loc='upper left'); 433 | 434 | # 435 | 436 | # Is there any effect of ETHN on slope or intercept? 437 | #
438 | # Y ~ TEST vs. Y ~ TEST + ETHN + ETHN:TEST 439 | 440 | # 441 | 442 | table5 = anova_lm(min_lm, min_lm4) 443 | print table5 444 | 445 | # 446 | 447 | # Is there any effect of ETHN on intercept? 448 | #
449 | # Y ~ TEST vs. Y ~ TEST + ETHN 450 | 451 | # 452 | 453 | table6 = anova_lm(min_lm, min_lm3) 454 | print table6 455 | 456 | # 457 | 458 | # Is there any effect of ETHN on slope? 459 | #
460 | # Y ~ TEST vs. Y ~ TEST + ETHN:TEST 461 | 462 | # 463 | 464 | table7 = anova_lm(min_lm, min_lm2) 465 | print table7 466 | 467 | # 468 | 469 | # Is it just the slope or both? 470 | #
471 | # Y ~ TEST + ETHN:TEST vs Y ~ TEST + ETHN + ETHN:TEST 472 | 473 | # 474 | 475 | table8 = anova_lm(min_lm2, min_lm4) 476 | print table8 477 | 478 | # 479 | 480 | # Two Way ANOVA - Kidney failure data 481 | 482 | # 483 | 484 | # Weight - (1,2,3) - Level of weight gan between treatments 485 | # Duration - (1,2) - Level of duration of treatment 486 | # Days - Time of stay in hospital 487 | 488 | # 489 | 490 | try: 491 | kidney_table = pandas.read_table('kidney.table') 492 | except: 493 | url = 'http://stats191.stanford.edu/data/kidney.table' 494 | kidney_table = pandas.read_table(url, delimiter=" *") 495 | kidney_table.to_csv("kidney.table", sep="\t", index=False) 496 | 497 | # 498 | 499 | # Explore the dataset, it's a balanced design 500 | print kidney_table.groupby(['Weight', 'Duration']).size() 501 | 502 | # 503 | 504 | kt = kidney_table 505 | fig = plt.figure(figsize=(10,8)) 506 | ax = fig.add_subplot(111) 507 | fig = interaction_plot(kt['Weight'], kt['Duration'], np.log(kt['Days']+1), 508 | colors=['red', 'blue'], markers=['D','^'], ms=10, ax=ax) 509 | 510 | # 511 | 512 | # $$Y_{ijk} = \mu + \alpha_i + \beta_j + \left(\alpha\beta\right)_{ij}+\epsilon_{ijk}$$ 513 | # 514 | # with 515 | # 516 | # $$\epsilon_{ijk}\sim N\left(0,\sigma^2\right)$$ 517 | 518 | # 519 | 520 | help(anova_lm) 521 | 522 | # 523 | 524 | # Things available in the calling namespace are available in the formula evaluation namespace 525 | 526 | # 527 | 528 | kidney_lm = ols('np.log(Days+1) ~ C(Duration) * C(Weight)', data=kt).fit() 529 | 530 | # 531 | 532 | # ANOVA Type-I Sum of Squares 533 | #

534 | # SS(A) for factor A.
535 | # SS(B|A) for factor B.
536 | # SS(AB|B, A) for interaction AB.
537 | 538 | # 539 | 540 | print anova_lm(kidney_lm) 541 | 542 | # 543 | 544 | # ANOVA Type-II Sum of Squares 545 | #

546 | # SS(A|B) for factor A.
547 | # SS(B|A) for factor B.
548 | 549 | # 550 | 551 | print anova_lm(kidney_lm, typ=2) 552 | 553 | # 554 | 555 | # ANOVA Type-III Sum of Squares 556 | #

557 | # SS(A|B, AB) for factor A.
558 | # SS(B|A, AB) for factor B.
559 | 560 | # 561 | 562 | print anova_lm(ols('np.log(Days+1) ~ C(Duration, Sum) * C(Weight, Poly)', 563 | data=kt).fit(), typ=3) 564 | 565 | # 566 | 567 | # Excercise: Find the 'best' model for the kidney failure dataset 568 | 569 | -------------------------------------------------------------------------------- /preliminaries.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3.0 3 | 4 | # 5 | 6 | # Learn More and Get Help 7 | 8 | # 9 | 10 | # Documentation: http://statsmodels.sf.net 11 | # 12 | # Mailing List: http://groups.google.com/group/pystatsmodels 13 | # 14 | # Use the source: https://github.com/statsmodels/statsmodels 15 | 16 | # 17 | 18 | # Tutorial Import Assumptions 19 | 20 | # 21 | 22 | import numpy as np 23 | import statsmodels.api as sm 24 | import matplotlib.pyplot as plt 25 | import pandas 26 | from scipy import stats 27 | 28 | np.set_printoptions(precision=4, suppress=True) 29 | pandas.set_printoptions(notebook_repr_html=False, 30 | precision=4, 31 | max_columns=12) 32 | 33 | # 34 | 35 | # Statsmodels Import Convention 36 | 37 | # 38 | 39 | import statsmodels.api as sm 40 | 41 | # 42 | 43 | # Import convention for models for which a formula is available. 44 | 45 | # 46 | 47 | from statsmodels.formula.api import ols, rlm, glm, #etc. 48 | 49 | # 50 | 51 | # Package Overview 52 | 53 | # 54 | 55 | # Regression models in statsmodels.regression 56 | 57 | # 58 | 59 | # Discrete choice models in statsmodels.discrete 60 | 61 | # 62 | 63 | # Robust linear models in statsmodels.robust 64 | 65 | # 66 | 67 | # Generalized linear models in statsmodels.genmod 68 | 69 | # 70 | 71 | # Time Series Analysis in statsmodels.tsa 72 | 73 | # 74 | 75 | # Nonparametric models in statsmodels.nonparametric 76 | 77 | # 78 | 79 | # Plotting functions in statsmodels.graphics 80 | 81 | # 82 | 83 | # Input/Output in statsmodels.iolib (Foreign data, ascii, HTML, $\LaTeX$ tables) 84 | 85 | # 86 | 87 | # Statistical tests, ANOVA in statsmodels.stats 88 | 89 | # 90 | 91 | # Datasets in statsmodels.datasets (See also the new GPL package Rdatasets: https://github.com/vincentarelbundock/Rdatasets) 92 | 93 | # 94 | 95 | # Base Classes 96 | 97 | # 98 | 99 | from statsmodels.base import model 100 | 101 | # 102 | 103 | help(model.Model) 104 | 105 | # 106 | 107 | help(model.LikelihoodModel) 108 | 109 | # 110 | 111 | help(model.LikelihoodModelResults) 112 | 113 | # 114 | 115 | from statsmodels.regression.linear_model import RegressionResults 116 | 117 | # 118 | 119 | help(RegressionResults) 120 | 121 | -------------------------------------------------------------------------------- /rmagic_extension.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "rmagic_extension" 4 | }, 5 | "nbformat": 3, 6 | "nbformat_minor": 0, 7 | "worksheets": [ 8 | { 9 | "cells": [ 10 | { 11 | "cell_type": "heading", 12 | "level": 1, 13 | "metadata": {}, 14 | "source": [ 15 | "Rmagic Functions Extension" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "collapsed": false, 21 | "input": [ 22 | "%pylab inline" 23 | ], 24 | "language": "python", 25 | "metadata": {}, 26 | "outputs": [] 27 | }, 28 | { 29 | "cell_type": "heading", 30 | "level": 2, 31 | "metadata": {}, 32 | "source": [ 33 | "Line magics" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "* IPython has an `rmagic` extension that contains a some magic functions for working with R via rpy2. \n", 41 | "* This extension can be loaded using the `%load_ext` magic as follows:" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "collapsed": true, 47 | "input": [ 48 | "%load_ext rmagic " 49 | ], 50 | "language": "python", 51 | "metadata": {}, 52 | "outputs": [] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "* We can go from numpy arrays to compute some statistics in R and back\n", 59 | "* Let's suppose we just want to fit a simple linear model to a scatterplot." 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "collapsed": false, 65 | "input": [ 66 | "import numpy as np\n", 67 | "import pylab\n", 68 | "X = np.array([0,1,2,3,4])\n", 69 | "Y = np.array([3,5,4,6,7])\n", 70 | "pylab.scatter(X, Y)" 71 | ], 72 | "language": "python", 73 | "metadata": {}, 74 | "outputs": [] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "* We can accomplish this by first pushing variables to R\n", 81 | "* Then fitting a model\n", 82 | "* And finally returning the results\n", 83 | "* The line magic `%Rpush` copies its arguments to variables of the same name in rpy2\n", 84 | "* The `%R` line magic evaluates the string in rpy2 and returns the result" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "collapsed": false, 90 | "input": [ 91 | "%Rpush X Y\n", 92 | "%R lm(Y~X)$coef" 93 | ], 94 | "language": "python", 95 | "metadata": {}, 96 | "outputs": [] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "We can check that this is correct fairly easily:" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "collapsed": false, 108 | "input": [ 109 | "Xr = X - X.mean(); Yr = Y - Y.mean()\n", 110 | "slope = (Xr*Yr).sum() / (Xr**2).sum()\n", 111 | "intercept = Y.mean() - X.mean() * slope\n", 112 | "(intercept, slope)" 113 | ], 114 | "language": "python", 115 | "metadata": {}, 116 | "outputs": [] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "It is also possible to return more than one value with %R." 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "collapsed": false, 128 | "input": [ 129 | "%R resid(lm(Y~X)); coef(lm(Y~X))" 130 | ], 131 | "language": "python", 132 | "metadata": {}, 133 | "outputs": [] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "metadata": {}, 138 | "source": [ 139 | "* One can also easily capture the results of %R into python objects. \n", 140 | "* Like R, the return value of this multiline expression is the final value, which is the `coef(lm(X~Y))`. \n", 141 | "* To pull other variables from R, there is one more set of magic functions" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "* `%Rpull` and `%Rget` \n", 149 | "* Both are useful to retrieve variables in the rpy2 namespace\n", 150 | "* The main difference is that one returns the value (%Rget), while the other pulls it to the user's namespace.\n", 151 | "* Imagine we've stored the results of some calculation in the variable `a` in rpy2's namespace. \n", 152 | "* By using the %R magic, we can obtain these results and store them in b. \n", 153 | "* We can also pull them directly to the namespace with %Rpull. \n", 154 | "* Note that they are both views on the same data." 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "collapsed": false, 160 | "input": [ 161 | "b = %R a=resid(lm(Y~X))\n", 162 | "%Rpull a\n", 163 | "print a\n", 164 | "assert id(b.data) == id(a.data)\n", 165 | "%R -o a" 166 | ], 167 | "language": "python", 168 | "metadata": {}, 169 | "outputs": [] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "%Rpull is equivalent to calling %R with just -o\n" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "collapsed": false, 181 | "input": [ 182 | "%R d=resid(lm(Y~X)); e=coef(lm(Y~X))\n", 183 | "%R -o d -o e\n", 184 | "%Rpull e\n", 185 | "print d\n", 186 | "print e\n", 187 | "import numpy as np\n", 188 | "np.testing.assert_almost_equal(d, a)" 189 | ], 190 | "language": "python", 191 | "metadata": {}, 192 | "outputs": [] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "On the other hand %Rpush is equivalent to calling %R with just -i and no trailing code." 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "collapsed": false, 204 | "input": [ 205 | "A = np.arange(20)\n", 206 | "%R -i A\n", 207 | "%R mean(A)" 208 | ], 209 | "language": "python", 210 | "metadata": {}, 211 | "outputs": [] 212 | }, 213 | { 214 | "cell_type": "markdown", 215 | "metadata": {}, 216 | "source": [ 217 | "The magic %Rget retrieves one variable from R." 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "collapsed": false, 223 | "input": [ 224 | "%Rget A" 225 | ], 226 | "language": "python", 227 | "metadata": {}, 228 | "outputs": [] 229 | }, 230 | { 231 | "cell_type": "heading", 232 | "level": 2, 233 | "metadata": {}, 234 | "source": [ 235 | "Plotting and capturing output" 236 | ] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": {}, 241 | "source": [ 242 | "* R's console (i.e. its stdout() connection) is captured by ipython,\n", 243 | "* So are any plots which are published as PNG files\n", 244 | "* As a call to %R may produce a return value (see above), what happens to a magic like the one below?\n", 245 | "* The R code specifies that something is published to the notebook. \n", 246 | "* If anything is published to the notebook, that call to %R returns None." 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "collapsed": false, 252 | "input": [ 253 | "v1 = %R plot(X,Y); print(summary(lm(Y~X))); vv=mean(X)*mean(Y)\n", 254 | "print 'v1 is:', v1\n", 255 | "v2 = %R mean(X)*mean(Y)\n", 256 | "print 'v2 is:', v2" 257 | ], 258 | "language": "python", 259 | "metadata": {}, 260 | "outputs": [] 261 | }, 262 | { 263 | "cell_type": "heading", 264 | "level": 2, 265 | "metadata": {}, 266 | "source": [ 267 | "What value is returned from %R?" 268 | ] 269 | }, 270 | { 271 | "cell_type": "markdown", 272 | "metadata": {}, 273 | "source": [ 274 | "* Some calls have no particularly interesting return value, the magic `%R` will not return anything in this case. \n", 275 | "* The return value in rpy2 is actually NULL so `%R` returns None." 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "collapsed": false, 281 | "input": [ 282 | "v = %R plot(X,Y)\n", 283 | "assert v == None" 284 | ], 285 | "language": "python", 286 | "metadata": {}, 287 | "outputs": [] 288 | }, 289 | { 290 | "cell_type": "markdown", 291 | "metadata": {}, 292 | "source": [ 293 | "Also, if the return value of a call to %R (inline mode) has just been printed to the console, then its value is also not returned." 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "collapsed": false, 299 | "input": [ 300 | "v = %R print(X)\n", 301 | "assert v == None" 302 | ], 303 | "language": "python", 304 | "metadata": {}, 305 | "outputs": [] 306 | }, 307 | { 308 | "cell_type": "markdown", 309 | "metadata": {}, 310 | "source": [ 311 | "* If the last value did not print anything to console, the value is returned" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "collapsed": false, 317 | "input": [ 318 | "v = %R print(summary(X)); X\n", 319 | "print 'v:', v" 320 | ], 321 | "language": "python", 322 | "metadata": {}, 323 | "outputs": [] 324 | }, 325 | { 326 | "cell_type": "markdown", 327 | "metadata": {}, 328 | "source": [ 329 | "* The return value can be suppressed by a trailing ';' or an -n argument" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "collapsed": true, 335 | "input": [ 336 | "%R -n X" 337 | ], 338 | "language": "python", 339 | "metadata": {}, 340 | "outputs": [] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "collapsed": true, 345 | "input": [ 346 | "%R X; " 347 | ], 348 | "language": "python", 349 | "metadata": {}, 350 | "outputs": [] 351 | }, 352 | { 353 | "cell_type": "heading", 354 | "level": 2, 355 | "metadata": {}, 356 | "source": [ 357 | "Cell level magic" 358 | ] 359 | }, 360 | { 361 | "cell_type": "markdown", 362 | "metadata": {}, 363 | "source": [ 364 | "* What if we want to run several lines of R code\n", 365 | "* This is the cell-level magic.\n", 366 | "* For the cell level magic, inputs can be passed via the -i or --inputs argument in the line\n", 367 | "* These variables are copied from the shell namespace to R's namespace using `rpy2.robjects.r.assign` \n", 368 | "* It would be nice not to have to copy these into R: rnumpy ( http://bitbucket.org/njs/rnumpy/wiki/API ) has done some work to limit or at least make transparent the number of copies of an array. \n", 369 | "* Arrays can be output from R via the -o or --outputs argument in the line. All other arguments are sent to R's png function, which is the graphics device used to create the plots.\n", 370 | "* We can redo the above calculations in one ipython cell. \n", 371 | "* We might also want to add a summary or perhaps the standard plotting diagnostics of the lm." 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "collapsed": false, 377 | "input": [ 378 | "%%R -i X,Y -o XYcoef\n", 379 | "XYlm = lm(Y~X)\n", 380 | "XYcoef = coef(XYlm)\n", 381 | "print(summary(XYlm))\n", 382 | "par(mfrow=c(2,2))\n", 383 | "plot(XYlm)" 384 | ], 385 | "language": "python", 386 | "metadata": {}, 387 | "outputs": [] 388 | }, 389 | { 390 | "cell_type": "heading", 391 | "level": 2, 392 | "metadata": {}, 393 | "source": [ 394 | "Passing data back and forth" 395 | ] 396 | }, 397 | { 398 | "cell_type": "markdown", 399 | "metadata": {}, 400 | "source": [ 401 | "* Currently (Summer 2012), data is passed through `RMagics.pyconverter` when going from python to R and `RMagics.Rconverter` when going from R to python. \n", 402 | "* These currently default to numpy.ndarray. Future work will involve writing better converters, most likely involving integration with http://pandas.sourceforge.net.\n", 403 | "* Passing ndarrays into R requires a copy, though once an object is returned to python, this object is NOT copied, and it is possible to change its values." 404 | ] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "collapsed": true, 409 | "input": [ 410 | "seq1 = np.arange(10)" 411 | ], 412 | "language": "python", 413 | "metadata": {}, 414 | "outputs": [] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "collapsed": false, 419 | "input": [ 420 | "%%R -i seq1 -o seq2\n", 421 | "seq2 = rep(seq1, 2)\n", 422 | "print(seq2)" 423 | ], 424 | "language": "python", 425 | "metadata": {}, 426 | "outputs": [] 427 | }, 428 | { 429 | "cell_type": "code", 430 | "collapsed": false, 431 | "input": [ 432 | "seq2[::2] = 0\n", 433 | "seq2" 434 | ], 435 | "language": "python", 436 | "metadata": {}, 437 | "outputs": [] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "collapsed": false, 442 | "input": [ 443 | "%%R\n", 444 | "print(seq2)" 445 | ], 446 | "language": "python", 447 | "metadata": {}, 448 | "outputs": [] 449 | }, 450 | { 451 | "cell_type": "markdown", 452 | "metadata": {}, 453 | "source": [ 454 | "* Once the array data has been passed to R, modifring its contents does not modify R's copy of the data." 455 | ] 456 | }, 457 | { 458 | "cell_type": "code", 459 | "collapsed": false, 460 | "input": [ 461 | "seq1[0] = 200\n", 462 | "%R print(seq1)" 463 | ], 464 | "language": "python", 465 | "metadata": {}, 466 | "outputs": [] 467 | }, 468 | { 469 | "cell_type": "markdown", 470 | "metadata": {}, 471 | "source": [ 472 | "* If we pass data as both input and output, then the value of \"data\" in the user's namespace will be overwritten \n", 473 | "* the new array will be a view of the data in R's copy." 474 | ] 475 | }, 476 | { 477 | "cell_type": "code", 478 | "collapsed": false, 479 | "input": [ 480 | "print seq1\n", 481 | "%R -i seq1 -o seq1\n", 482 | "print seq1\n", 483 | "seq1[0] = 200\n", 484 | "%R print(seq1)\n", 485 | "seq1_view = %R seq1\n", 486 | "assert(id(seq1_view.data) == id(seq1.data))" 487 | ], 488 | "language": "python", 489 | "metadata": {}, 490 | "outputs": [] 491 | }, 492 | { 493 | "cell_type": "heading", 494 | "level": 2, 495 | "metadata": {}, 496 | "source": [ 497 | "Exception handling\n" 498 | ] 499 | }, 500 | { 501 | "cell_type": "markdown", 502 | "metadata": {}, 503 | "source": [ 504 | "Exceptions are handled by passing back rpy2's exception and the line that triggered it." 505 | ] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "collapsed": false, 510 | "input": [ 511 | "try:\n", 512 | " %R -n nosuchvar\n", 513 | "except Exception as e:\n", 514 | " print e" 515 | ], 516 | "language": "python", 517 | "metadata": {}, 518 | "outputs": [] 519 | }, 520 | { 521 | "cell_type": "heading", 522 | "level": 2, 523 | "metadata": {}, 524 | "source": [ 525 | "Structured arrays and data frames\n" 526 | ] 527 | }, 528 | { 529 | "cell_type": "markdown", 530 | "metadata": {}, 531 | "source": [ 532 | "* In R, data frames play an important role as they allow array-like objects of mixed type with column names (and row names). \n", 533 | "* In numpy, the closest analogy is a structured array with named fields. \n", 534 | "* In future work, it would be nice to use pandas to return full-fledged DataFrames from rpy2. \n", 535 | "* In the mean time, structured arrays can be passed back and forth with the -d flag to %R, %Rpull, and %Rget" 536 | ] 537 | }, 538 | { 539 | "cell_type": "code", 540 | "collapsed": true, 541 | "input": [ 542 | "datapy= np.array([(1, 2.9, 'a'), (2, 3.5, 'b'), (3, 2.1, 'c')],\n", 543 | " dtype=[('x', '3.0 3 | 4 | # 5 | 6 | # Rmagic Functions Extension 7 | 8 | # 9 | 10 | %pylab inline 11 | 12 | # 13 | 14 | # Line magics 15 | 16 | # 17 | 18 | # * IPython has an `rmagic` extension that contains a some magic functions for working with R via rpy2. 19 | # * This extension can be loaded using the `%load_ext` magic as follows: 20 | 21 | # 22 | 23 | %load_ext rmagic 24 | 25 | # 26 | 27 | # * We can go from numpy arrays to compute some statistics in R and back 28 | # * Let's suppose we just want to fit a simple linear model to a scatterplot. 29 | 30 | # 31 | 32 | import numpy as np 33 | import pylab 34 | X = np.array([0,1,2,3,4]) 35 | Y = np.array([3,5,4,6,7]) 36 | pylab.scatter(X, Y) 37 | 38 | # 39 | 40 | # * We can accomplish this by first pushing variables to R 41 | # * Then fitting a model 42 | # * And finally returning the results 43 | # * The line magic `%Rpush` copies its arguments to variables of the same name in rpy2 44 | # * The `%R` line magic evaluates the string in rpy2 and returns the result 45 | 46 | # 47 | 48 | %Rpush X Y 49 | %R lm(Y~X)$coef 50 | 51 | # 52 | 53 | # We can check that this is correct fairly easily: 54 | 55 | # 56 | 57 | Xr = X - X.mean(); Yr = Y - Y.mean() 58 | slope = (Xr*Yr).sum() / (Xr**2).sum() 59 | intercept = Y.mean() - X.mean() * slope 60 | (intercept, slope) 61 | 62 | # 63 | 64 | # It is also possible to return more than one value with %R. 65 | 66 | # 67 | 68 | %R resid(lm(Y~X)); coef(lm(Y~X)) 69 | 70 | # 71 | 72 | # * One can also easily capture the results of %R into python objects. 73 | # * Like R, the return value of this multiline expression is the final value, which is the `coef(lm(X~Y))`. 74 | # * To pull other variables from R, there is one more set of magic functions 75 | 76 | # 77 | 78 | # * `%Rpull` and `%Rget` 79 | # * Both are useful to retrieve variables in the rpy2 namespace 80 | # * The main difference is that one returns the value (%Rget), while the other pulls it to the user's namespace. 81 | # * Imagine we've stored the results of some calculation in the variable `a` in rpy2's namespace. 82 | # * By using the %R magic, we can obtain these results and store them in b. 83 | # * We can also pull them directly to the namespace with %Rpull. 84 | # * Note that they are both views on the same data. 85 | 86 | # 87 | 88 | b = %R a=resid(lm(Y~X)) 89 | %Rpull a 90 | print a 91 | assert id(b.data) == id(a.data) 92 | %R -o a 93 | 94 | # 95 | 96 | # %Rpull is equivalent to calling %R with just -o 97 | 98 | # 99 | 100 | %R d=resid(lm(Y~X)); e=coef(lm(Y~X)) 101 | %R -o d -o e 102 | %Rpull e 103 | print d 104 | print e 105 | import numpy as np 106 | np.testing.assert_almost_equal(d, a) 107 | 108 | # 109 | 110 | # On the other hand %Rpush is equivalent to calling %R with just -i and no trailing code. 111 | 112 | # 113 | 114 | A = np.arange(20) 115 | %R -i A 116 | %R mean(A) 117 | 118 | # 119 | 120 | # The magic %Rget retrieves one variable from R. 121 | 122 | # 123 | 124 | %Rget A 125 | 126 | # 127 | 128 | # Plotting and capturing output 129 | 130 | # 131 | 132 | # * R's console (i.e. its stdout() connection) is captured by ipython, 133 | # * So are any plots which are published as PNG files 134 | # * As a call to %R may produce a return value (see above), what happens to a magic like the one below? 135 | # * The R code specifies that something is published to the notebook. 136 | # * If anything is published to the notebook, that call to %R returns None. 137 | 138 | # 139 | 140 | v1 = %R plot(X,Y); print(summary(lm(Y~X))); vv=mean(X)*mean(Y) 141 | print 'v1 is:', v1 142 | v2 = %R mean(X)*mean(Y) 143 | print 'v2 is:', v2 144 | 145 | # 146 | 147 | # What value is returned from %R? 148 | 149 | # 150 | 151 | # * Some calls have no particularly interesting return value, the magic `%R` will not return anything in this case. 152 | # * The return value in rpy2 is actually NULL so `%R` returns None. 153 | 154 | # 155 | 156 | v = %R plot(X,Y) 157 | assert v == None 158 | 159 | # 160 | 161 | # Also, if the return value of a call to %R (inline mode) has just been printed to the console, then its value is also not returned. 162 | 163 | # 164 | 165 | v = %R print(X) 166 | assert v == None 167 | 168 | # 169 | 170 | # * If the last value did not print anything to console, the value is returned 171 | 172 | # 173 | 174 | v = %R print(summary(X)); X 175 | print 'v:', v 176 | 177 | # 178 | 179 | # * The return value can be suppressed by a trailing ';' or an -n argument 180 | 181 | # 182 | 183 | %R -n X 184 | 185 | # 186 | 187 | %R X; 188 | 189 | # 190 | 191 | # Cell level magic 192 | 193 | # 194 | 195 | # * What if we want to run several lines of R code 196 | # * This is the cell-level magic. 197 | # * For the cell level magic, inputs can be passed via the -i or --inputs argument in the line 198 | # * These variables are copied from the shell namespace to R's namespace using `rpy2.robjects.r.assign` 199 | # * It would be nice not to have to copy these into R: rnumpy ( http://bitbucket.org/njs/rnumpy/wiki/API ) has done some work to limit or at least make transparent the number of copies of an array. 200 | # * Arrays can be output from R via the -o or --outputs argument in the line. All other arguments are sent to R's png function, which is the graphics device used to create the plots. 201 | # * We can redo the above calculations in one ipython cell. 202 | # * We might also want to add a summary or perhaps the standard plotting diagnostics of the lm. 203 | 204 | # 205 | 206 | %%R -i X,Y -o XYcoef 207 | XYlm = lm(Y~X) 208 | XYcoef = coef(XYlm) 209 | print(summary(XYlm)) 210 | par(mfrow=c(2,2)) 211 | plot(XYlm) 212 | 213 | # 214 | 215 | # Passing data back and forth 216 | 217 | # 218 | 219 | # * Currently (Summer 2012), data is passed through `RMagics.pyconverter` when going from python to R and `RMagics.Rconverter` when going from R to python. 220 | # * These currently default to numpy.ndarray. Future work will involve writing better converters, most likely involving integration with http://pandas.sourceforge.net. 221 | # * Passing ndarrays into R requires a copy, though once an object is returned to python, this object is NOT copied, and it is possible to change its values. 222 | 223 | # 224 | 225 | seq1 = np.arange(10) 226 | 227 | # 228 | 229 | %%R -i seq1 -o seq2 230 | seq2 = rep(seq1, 2) 231 | print(seq2) 232 | 233 | # 234 | 235 | seq2[::2] = 0 236 | seq2 237 | 238 | # 239 | 240 | %%R 241 | print(seq2) 242 | 243 | # 244 | 245 | # * Once the array data has been passed to R, modifring its contents does not modify R's copy of the data. 246 | 247 | # 248 | 249 | seq1[0] = 200 250 | %R print(seq1) 251 | 252 | # 253 | 254 | # * If we pass data as both input and output, then the value of "data" in the user's namespace will be overwritten 255 | # * the new array will be a view of the data in R's copy. 256 | 257 | # 258 | 259 | print seq1 260 | %R -i seq1 -o seq1 261 | print seq1 262 | seq1[0] = 200 263 | %R print(seq1) 264 | seq1_view = %R seq1 265 | assert(id(seq1_view.data) == id(seq1.data)) 266 | 267 | # 268 | 269 | # Exception handling 270 | 271 | # 272 | 273 | # Exceptions are handled by passing back rpy2's exception and the line that triggered it. 274 | 275 | # 276 | 277 | try: 278 | %R -n nosuchvar 279 | except Exception as e: 280 | print e 281 | 282 | # 283 | 284 | # Structured arrays and data frames 285 | 286 | # 287 | 288 | # * In R, data frames play an important role as they allow array-like objects of mixed type with column names (and row names). 289 | # * In numpy, the closest analogy is a structured array with named fields. 290 | # * In future work, it would be nice to use pandas to return full-fledged DataFrames from rpy2. 291 | # * In the mean time, structured arrays can be passed back and forth with the -d flag to %R, %Rpull, and %Rget 292 | 293 | # 294 | 295 | datapy= np.array([(1, 2.9, 'a'), (2, 3.5, 'b'), (3, 2.1, 'c')], 296 | dtype=[('x', ' 299 | 300 | %%R -i datapy -d datar 301 | datar = datapy 302 | 303 | # 304 | 305 | datar 306 | 307 | # 308 | 309 | %R datar2 = datapy 310 | %Rpull -d datar2 311 | datar2 312 | 313 | # 314 | 315 | %Rget -d datar2 316 | 317 | # 318 | 319 | # For arrays without names, the -d argument has no effect because the R object has no colnames or names. 320 | 321 | # 322 | 323 | Z = np.arange(6) 324 | %R -i Z 325 | %Rget -d Z 326 | 327 | # 328 | 329 | # * For mixed-type data frames in R, if the -d flag is not used, then an array of a single type is returned 330 | # * Its value is transposed. 331 | # * This would be nice to fix, but it seems something that should be fixed at the rpy2 level. See [here](https://bitbucket.org/lgautier/rpy2/issue/44/numpyrecarray-as-dataframe) 332 | 333 | # 334 | 335 | %Rget datar2 336 | 337 | -------------------------------------------------------------------------------- /robust_models.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "robust_models" 4 | }, 5 | "nbformat": 3, 6 | "nbformat_minor": 0, 7 | "worksheets": [ 8 | { 9 | "cells": [ 10 | { 11 | "cell_type": "heading", 12 | "level": 2, 13 | "metadata": {}, 14 | "source": [ 15 | "M-Estimators for Robust Linear Modeling" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "collapsed": false, 21 | "input": [ 22 | "import numpy as np\n", 23 | "from scipy import stats\n", 24 | "import matplotlib.pyplot as plt\n", 25 | "\n", 26 | "import statsmodels.api as sm" 27 | ], 28 | "language": "python", 29 | "metadata": {}, 30 | "outputs": [] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "* An M-estimator minimizes the function \n", 37 | "\n", 38 | "$$Q(e_i, \\rho) = \\sum_i~\\rho(\\frac{e_i}{s})$$\n", 39 | "\n", 40 | "where $\\rho$ is a symmetric function of the residuals \n", 41 | "\n", 42 | "* The effect of $\\rho$ is to reduce the influence of outliers\n", 43 | "* $s$ is an estimate of scale. \n", 44 | "* The robust estimates $\\hat{\\beta}$ are computed by the iteratively re-weighted least squares algorithm" 45 | ] 46 | }, 47 | { 48 | "cell_type": "raw", 49 | "metadata": {}, 50 | "source": [ 51 | "* We have several choices available for the weighting functions to be used" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "collapsed": false, 57 | "input": [ 58 | "norms = sm.robust.norms" 59 | ], 60 | "language": "python", 61 | "metadata": {}, 62 | "outputs": [] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "collapsed": false, 67 | "input": [ 68 | "def plot_weights(support, weights_func, xlabels, xticks):\n", 69 | " fig = plt.figure(figsize=(12,8))\n", 70 | " ax = fig.add_subplot(111)\n", 71 | " ax.plot(support, weights_func(support))\n", 72 | " ax.set_xticks(xticks)\n", 73 | " ax.set_xticklabels(xlabels, fontsize=16)\n", 74 | " ax.set_ylim(-.1, 1.1)\n", 75 | " return ax" 76 | ], 77 | "language": "python", 78 | "metadata": {}, 79 | "outputs": [] 80 | }, 81 | { 82 | "cell_type": "heading", 83 | "level": 3, 84 | "metadata": {}, 85 | "source": [ 86 | "Andrew's Wave" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "collapsed": false, 92 | "input": [ 93 | "help(norms.AndrewWave.weights)" 94 | ], 95 | "language": "python", 96 | "metadata": {}, 97 | "outputs": [] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "collapsed": false, 102 | "input": [ 103 | "a = 1.339\n", 104 | "support = np.linspace(-np.pi*a, np.pi*a, 100)\n", 105 | "andrew = norms.AndrewWave(a=a)\n", 106 | "plot_weights(support, andrew.weights, ['$-\\pi*a$', '0', '$\\pi*a$'], [-np.pi*a, 0, np.pi*a]);" 107 | ], 108 | "language": "python", 109 | "metadata": {}, 110 | "outputs": [] 111 | }, 112 | { 113 | "cell_type": "heading", 114 | "level": 3, 115 | "metadata": {}, 116 | "source": [ 117 | "Hampel's 17A" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "collapsed": false, 123 | "input": [ 124 | "help(norms.Hampel.weights)" 125 | ], 126 | "language": "python", 127 | "metadata": {}, 128 | "outputs": [] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "collapsed": false, 133 | "input": [ 134 | "c = 8\n", 135 | "support = np.linspace(-3*c, 3*c, 1000)\n", 136 | "hampel = norms.Hampel(a=2., b=4., c=c)\n", 137 | "plot_weights(support, hampel.weights, ['3*c', '0', '3*c'], [-3*c, 0, 3*c]);" 138 | ], 139 | "language": "python", 140 | "metadata": {}, 141 | "outputs": [] 142 | }, 143 | { 144 | "cell_type": "heading", 145 | "level": 3, 146 | "metadata": {}, 147 | "source": [ 148 | "Huber's t" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "collapsed": false, 154 | "input": [ 155 | "help(norms.HuberT.weights)" 156 | ], 157 | "language": "python", 158 | "metadata": {}, 159 | "outputs": [] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "collapsed": false, 164 | "input": [ 165 | "t = 1.345\n", 166 | "support = np.linspace(-3*t, 3*t, 1000)\n", 167 | "huber = norms.HuberT(t=t)\n", 168 | "plot_weights(support, huber.weights, ['-3*t', '0', '3*t'], [-3*t, 0, 3*t]);" 169 | ], 170 | "language": "python", 171 | "metadata": {}, 172 | "outputs": [] 173 | }, 174 | { 175 | "cell_type": "heading", 176 | "level": 3, 177 | "metadata": {}, 178 | "source": [ 179 | "Least Squares" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "collapsed": false, 185 | "input": [ 186 | "help(norms.LeastSquares.weights)" 187 | ], 188 | "language": "python", 189 | "metadata": {}, 190 | "outputs": [] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "collapsed": false, 195 | "input": [ 196 | "support = np.linspace(-3, 3, 1000)\n", 197 | "lst_sq = norms.LeastSquares()\n", 198 | "plot_weights(support, lst_sq.weights, ['-3', '0', '3'], [-3, 0, 3]);" 199 | ], 200 | "language": "python", 201 | "metadata": {}, 202 | "outputs": [] 203 | }, 204 | { 205 | "cell_type": "heading", 206 | "level": 3, 207 | "metadata": {}, 208 | "source": [ 209 | "Ramsay's Ea" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "collapsed": false, 215 | "input": [ 216 | "help(norms.RamsayE.weights)" 217 | ], 218 | "language": "python", 219 | "metadata": {}, 220 | "outputs": [] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "collapsed": false, 225 | "input": [ 226 | "a = .3\n", 227 | "support = np.linspace(-3*a, 3*a, 1000)\n", 228 | "ramsay = norms.RamsayE(a=a)\n", 229 | "plot_weights(support, ramsay.weights, ['-3*a', '0', '3*a'], [-3*a, 0, 3*a]);" 230 | ], 231 | "language": "python", 232 | "metadata": {}, 233 | "outputs": [] 234 | }, 235 | { 236 | "cell_type": "heading", 237 | "level": 3, 238 | "metadata": {}, 239 | "source": [ 240 | "Trimmed Mean" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "collapsed": false, 246 | "input": [ 247 | "help(norms.TrimmedMean.weights)" 248 | ], 249 | "language": "python", 250 | "metadata": {}, 251 | "outputs": [] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "collapsed": false, 256 | "input": [ 257 | "c = 2\n", 258 | "support = np.linspace(-3*c, 3*c, 1000)\n", 259 | "trimmed = norms.TrimmedMean(c=c)\n", 260 | "plot_weights(support, trimmed.weights, ['-3*c', '0', '3*c'], [-3*c, 0, 3*c]);" 261 | ], 262 | "language": "python", 263 | "metadata": {}, 264 | "outputs": [] 265 | }, 266 | { 267 | "cell_type": "heading", 268 | "level": 3, 269 | "metadata": {}, 270 | "source": [ 271 | "Tukey's Biweight" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "collapsed": false, 277 | "input": [ 278 | "help(norms.TukeyBiweight.weights)" 279 | ], 280 | "language": "python", 281 | "metadata": {}, 282 | "outputs": [] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "collapsed": false, 287 | "input": [ 288 | "c = 4.685\n", 289 | "support = np.linspace(-3*c, 3*c, 1000)\n", 290 | "tukey = norms.TukeyBiweight(c=c)\n", 291 | "plot_weights(support, tukey.weights, ['-3*c', '0', '3*c'], [-3*c, 0, 3*c]);" 292 | ], 293 | "language": "python", 294 | "metadata": {}, 295 | "outputs": [] 296 | }, 297 | { 298 | "cell_type": "heading", 299 | "level": 3, 300 | "metadata": {}, 301 | "source": [ 302 | "Scale Estimators" 303 | ] 304 | }, 305 | { 306 | "cell_type": "raw", 307 | "metadata": {}, 308 | "source": [ 309 | "* Robust estimates of the location" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "collapsed": false, 315 | "input": [ 316 | "x = np.array([1, 2, 3, 4, 500])" 317 | ], 318 | "language": "python", 319 | "metadata": {}, 320 | "outputs": [] 321 | }, 322 | { 323 | "cell_type": "markdown", 324 | "metadata": {}, 325 | "source": [ 326 | "* The mean is not a robust estimator of location" 327 | ] 328 | }, 329 | { 330 | "cell_type": "code", 331 | "collapsed": false, 332 | "input": [ 333 | "x.mean()" 334 | ], 335 | "language": "python", 336 | "metadata": {}, 337 | "outputs": [] 338 | }, 339 | { 340 | "cell_type": "markdown", 341 | "metadata": {}, 342 | "source": [ 343 | "* The median, on the other hand, is a robust estimator with a breakdown point of 50%" 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "collapsed": false, 349 | "input": [ 350 | "np.median(x)" 351 | ], 352 | "language": "python", 353 | "metadata": {}, 354 | "outputs": [] 355 | }, 356 | { 357 | "cell_type": "raw", 358 | "metadata": {}, 359 | "source": [ 360 | "* Analagously for the scale\n", 361 | "* The standard deviation is not robust" 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "collapsed": false, 367 | "input": [ 368 | "x.std()" 369 | ], 370 | "language": "python", 371 | "metadata": {}, 372 | "outputs": [] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "metadata": {}, 377 | "source": [ 378 | "Median Absolute Deviation\n", 379 | "\n", 380 | "$$ median_i |X_i - median_j(X_j)|) $$" 381 | ] 382 | }, 383 | { 384 | "cell_type": "markdown", 385 | "metadata": {}, 386 | "source": [ 387 | "Standardized Median Absolute Deviation is a consistent estimator for $\\hat{\\sigma}$\n", 388 | "\n", 389 | "$$\\hat{\\sigma}=K \\cdot MAD$$\n", 390 | "\n", 391 | "where $K$ depends on the distribution. For the normal distribution for example,\n", 392 | "\n", 393 | "$$K = \\Phi^{-1}(.75)$$" 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "collapsed": false, 399 | "input": [ 400 | "stats.norm.ppf(.75)" 401 | ], 402 | "language": "python", 403 | "metadata": {}, 404 | "outputs": [] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "collapsed": false, 409 | "input": [ 410 | "print x" 411 | ], 412 | "language": "python", 413 | "metadata": {}, 414 | "outputs": [] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "collapsed": false, 419 | "input": [ 420 | "sm.robust.scale.stand_mad(x)" 421 | ], 422 | "language": "python", 423 | "metadata": {}, 424 | "outputs": [] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "collapsed": false, 429 | "input": [ 430 | "np.array([1,2,3,4,5.]).std()" 431 | ], 432 | "language": "python", 433 | "metadata": {}, 434 | "outputs": [] 435 | }, 436 | { 437 | "cell_type": "raw", 438 | "metadata": {}, 439 | "source": [ 440 | "* The default for Robust Linear Models is MAD\n", 441 | "* another popular choice is Huber's proposal 2" 442 | ] 443 | }, 444 | { 445 | "cell_type": "code", 446 | "collapsed": false, 447 | "input": [ 448 | "np.random.seed(12345)\n", 449 | "fat_tails = stats.t(6).rvs(40)" 450 | ], 451 | "language": "python", 452 | "metadata": {}, 453 | "outputs": [] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "collapsed": false, 458 | "input": [ 459 | "kde = sm.nonparametric.KDE(fat_tails)\n", 460 | "kde.fit()\n", 461 | "fig = plt.figure(figsize=(12,8))\n", 462 | "ax = fig.add_subplot(111)\n", 463 | "ax.plot(kde.support, kde.density);" 464 | ], 465 | "language": "python", 466 | "metadata": {}, 467 | "outputs": [] 468 | }, 469 | { 470 | "cell_type": "code", 471 | "collapsed": false, 472 | "input": [ 473 | "print fat_tails.mean(), fat_tails.std()" 474 | ], 475 | "language": "python", 476 | "metadata": {}, 477 | "outputs": [] 478 | }, 479 | { 480 | "cell_type": "code", 481 | "collapsed": false, 482 | "input": [ 483 | "print stats.norm.fit(fat_tails)" 484 | ], 485 | "language": "python", 486 | "metadata": {}, 487 | "outputs": [] 488 | }, 489 | { 490 | "cell_type": "code", 491 | "collapsed": false, 492 | "input": [ 493 | "print stats.t.fit(fat_tails, f0=6)" 494 | ], 495 | "language": "python", 496 | "metadata": {}, 497 | "outputs": [] 498 | }, 499 | { 500 | "cell_type": "code", 501 | "collapsed": false, 502 | "input": [ 503 | "huber = sm.robust.scale.Huber()\n", 504 | "loc, scale = huber(fat_tails)\n", 505 | "print loc, scale" 506 | ], 507 | "language": "python", 508 | "metadata": {}, 509 | "outputs": [] 510 | }, 511 | { 512 | "cell_type": "code", 513 | "collapsed": false, 514 | "input": [ 515 | "sm.robust.stand_mad(fat_tails)" 516 | ], 517 | "language": "python", 518 | "metadata": {}, 519 | "outputs": [] 520 | }, 521 | { 522 | "cell_type": "code", 523 | "collapsed": false, 524 | "input": [ 525 | "sm.robust.stand_mad(fat_tails, c=stats.t(6).ppf(.75))" 526 | ], 527 | "language": "python", 528 | "metadata": {}, 529 | "outputs": [] 530 | }, 531 | { 532 | "cell_type": "code", 533 | "collapsed": false, 534 | "input": [ 535 | "sm.robust.scale.mad(fat_tails)" 536 | ], 537 | "language": "python", 538 | "metadata": {}, 539 | "outputs": [] 540 | }, 541 | { 542 | "cell_type": "heading", 543 | "level": 3, 544 | "metadata": {}, 545 | "source": [ 546 | "Duncan's Occupational Prestige data - M-estimation for outliers" 547 | ] 548 | }, 549 | { 550 | "cell_type": "code", 551 | "collapsed": false, 552 | "input": [ 553 | "from statsmodels.graphics.api import abline_plot\n", 554 | "from statsmodels.formula.api import ols, rlm" 555 | ], 556 | "language": "python", 557 | "metadata": {}, 558 | "outputs": [] 559 | }, 560 | { 561 | "cell_type": "code", 562 | "collapsed": false, 563 | "input": [ 564 | "prestige = sm.datasets.get_rdataset(\"Duncan\", \"car\", cache=True).data" 565 | ], 566 | "language": "python", 567 | "metadata": {}, 568 | "outputs": [] 569 | }, 570 | { 571 | "cell_type": "code", 572 | "collapsed": false, 573 | "input": [ 574 | "print prestige.head(10)" 575 | ], 576 | "language": "python", 577 | "metadata": {}, 578 | "outputs": [] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "collapsed": false, 583 | "input": [ 584 | "fig = plt.figure(figsize=(12,12))\n", 585 | "ax1 = fig.add_subplot(211, xlabel='Income', ylabel='Prestige')\n", 586 | "ax1.scatter(prestige.income, prestige.prestige)\n", 587 | "xy_outlier = prestige.ix['minister'][['income','prestige']]\n", 588 | "ax1.annotate('Minister', xy_outlier, xy_outlier+1, fontsize=16)\n", 589 | "ax2 = fig.add_subplot(212, xlabel='Education',\n", 590 | " ylabel='Prestige')\n", 591 | "ax2.scatter(prestige.education, prestige.prestige);" 592 | ], 593 | "language": "python", 594 | "metadata": {}, 595 | "outputs": [] 596 | }, 597 | { 598 | "cell_type": "code", 599 | "collapsed": false, 600 | "input": [ 601 | "ols_model = ols('prestige ~ income + education', prestige).fit()\n", 602 | "print ols_model.summary()" 603 | ], 604 | "language": "python", 605 | "metadata": {}, 606 | "outputs": [] 607 | }, 608 | { 609 | "cell_type": "code", 610 | "collapsed": false, 611 | "input": [ 612 | "infl = ols_model.get_influence()\n", 613 | "student = infl.summary_frame()['student_resid']\n", 614 | "print student" 615 | ], 616 | "language": "python", 617 | "metadata": {}, 618 | "outputs": [] 619 | }, 620 | { 621 | "cell_type": "code", 622 | "collapsed": false, 623 | "input": [ 624 | "print student.ix[np.abs(student) > 2]" 625 | ], 626 | "language": "python", 627 | "metadata": {}, 628 | "outputs": [] 629 | }, 630 | { 631 | "cell_type": "code", 632 | "collapsed": false, 633 | "input": [ 634 | "print infl.summary_frame().ix['minister']" 635 | ], 636 | "language": "python", 637 | "metadata": {}, 638 | "outputs": [] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "collapsed": false, 643 | "input": [ 644 | "sidak = ols_model.outlier_test('sidak')\n", 645 | "sidak.sort('unadj_p', inplace=True)\n", 646 | "print sidak" 647 | ], 648 | "language": "python", 649 | "metadata": {}, 650 | "outputs": [] 651 | }, 652 | { 653 | "cell_type": "code", 654 | "collapsed": false, 655 | "input": [ 656 | "fdr = ols_model.outlier_test('fdr_bh')\n", 657 | "fdr.sort('unadj_p', inplace=True)\n", 658 | "print fdr" 659 | ], 660 | "language": "python", 661 | "metadata": {}, 662 | "outputs": [] 663 | }, 664 | { 665 | "cell_type": "code", 666 | "collapsed": false, 667 | "input": [ 668 | "rlm_model = rlm('prestige ~ income + education', prestige).fit()\n", 669 | "print rlm_model.summary()" 670 | ], 671 | "language": "python", 672 | "metadata": {}, 673 | "outputs": [] 674 | }, 675 | { 676 | "cell_type": "code", 677 | "collapsed": false, 678 | "input": [ 679 | "print rlm_model.weights" 680 | ], 681 | "language": "python", 682 | "metadata": {}, 683 | "outputs": [] 684 | }, 685 | { 686 | "cell_type": "heading", 687 | "level": 3, 688 | "metadata": {}, 689 | "source": [ 690 | "Hertzprung Russell data for Star Cluster CYG 0B1 - Leverage Points" 691 | ] 692 | }, 693 | { 694 | "cell_type": "markdown", 695 | "metadata": {}, 696 | "source": [ 697 | "* Data is on the luminosity and temperature of 47 stars in the direction of Cygnus." 698 | ] 699 | }, 700 | { 701 | "cell_type": "code", 702 | "collapsed": false, 703 | "input": [ 704 | "dta = sm.datasets.get_rdataset(\"starsCYG\", \"robustbase\", cache=True).data" 705 | ], 706 | "language": "python", 707 | "metadata": {}, 708 | "outputs": [] 709 | }, 710 | { 711 | "cell_type": "code", 712 | "collapsed": false, 713 | "input": [ 714 | "from matplotlib.patches import Ellipse\n", 715 | "fig = plt.figure(figsize=(12,8))\n", 716 | "ax = fig.add_subplot(111, xlabel='log(Temp)', ylabel='log(Light)', title='Hertzsprung-Russell Diagram of Star Cluster CYG OB1')\n", 717 | "ax.scatter(*dta.values.T)\n", 718 | "# highlight outliers\n", 719 | "e = Ellipse((3.5, 6), .2, 1, alpha=.25, color='r')\n", 720 | "ax.add_patch(e);\n", 721 | "ax.annotate('Red giants', xy=(3.6, 6), xytext=(3.8, 6),\n", 722 | " arrowprops=dict(facecolor='black', shrink=0.05, width=2),\n", 723 | " horizontalalignment='left', verticalalignment='bottom',\n", 724 | " clip_on=True, # clip to the axes bounding box\n", 725 | " fontsize=16,\n", 726 | " )\n", 727 | "# annotate these with their index\n", 728 | "for i,row in dta.ix[dta['log.Te'] < 3.8].iterrows():\n", 729 | " ax.annotate(i, row, row + .01, fontsize=14)\n", 730 | "xlim, ylim = ax.get_xlim(), ax.get_ylim()" 731 | ], 732 | "language": "python", 733 | "metadata": {}, 734 | "outputs": [] 735 | }, 736 | { 737 | "cell_type": "code", 738 | "collapsed": false, 739 | "input": [ 740 | "from IPython.display import Image\n", 741 | "Image(filename='star_diagram.png')" 742 | ], 743 | "language": "python", 744 | "metadata": {}, 745 | "outputs": [] 746 | }, 747 | { 748 | "cell_type": "code", 749 | "collapsed": false, 750 | "input": [ 751 | "y = dta['log.light']\n", 752 | "X = sm.add_constant(dta['log.Te'], prepend=True)\n", 753 | "ols_model = sm.OLS(y, X).fit()\n", 754 | "abline_plot(model_results=ols_model, ax=ax)" 755 | ], 756 | "language": "python", 757 | "metadata": {}, 758 | "outputs": [] 759 | }, 760 | { 761 | "cell_type": "code", 762 | "collapsed": false, 763 | "input": [ 764 | "rlm_mod = sm.RLM(y, X, sm.robust.norms.TrimmedMean(.5)).fit()\n", 765 | "abline_plot(model_results=rlm_mod, ax=ax, color='red')" 766 | ], 767 | "language": "python", 768 | "metadata": {}, 769 | "outputs": [] 770 | }, 771 | { 772 | "cell_type": "markdown", 773 | "metadata": {}, 774 | "source": [ 775 | "* Why? Because M-estimators are not robust to leverage points." 776 | ] 777 | }, 778 | { 779 | "cell_type": "code", 780 | "collapsed": false, 781 | "input": [ 782 | "infl = ols_model.get_influence()" 783 | ], 784 | "language": "python", 785 | "metadata": {}, 786 | "outputs": [] 787 | }, 788 | { 789 | "cell_type": "code", 790 | "collapsed": false, 791 | "input": [ 792 | "h_bar = 2*(ols_model.df_model + 1 )/ols_model.nobs\n", 793 | "hat_diag = infl.summary_frame()['hat_diag']\n", 794 | "hat_diag.ix[hat_diag > h_bar]" 795 | ], 796 | "language": "python", 797 | "metadata": {}, 798 | "outputs": [] 799 | }, 800 | { 801 | "cell_type": "code", 802 | "collapsed": false, 803 | "input": [ 804 | "sidak2 = ols_model.outlier_test('sidak')\n", 805 | "sidak2.sort('unadj_p', inplace=True)\n", 806 | "print sidak2" 807 | ], 808 | "language": "python", 809 | "metadata": {}, 810 | "outputs": [] 811 | }, 812 | { 813 | "cell_type": "code", 814 | "collapsed": false, 815 | "input": [ 816 | "fdr2 = ols_model.outlier_test('fdr_bh')\n", 817 | "fdr2.sort('unadj_p', inplace=True)\n", 818 | "print fdr2" 819 | ], 820 | "language": "python", 821 | "metadata": {}, 822 | "outputs": [] 823 | }, 824 | { 825 | "cell_type": "markdown", 826 | "metadata": {}, 827 | "source": [ 828 | "* Let's delete that line" 829 | ] 830 | }, 831 | { 832 | "cell_type": "code", 833 | "collapsed": false, 834 | "input": [ 835 | "del ax.lines[-1]" 836 | ], 837 | "language": "python", 838 | "metadata": {}, 839 | "outputs": [] 840 | }, 841 | { 842 | "cell_type": "code", 843 | "collapsed": false, 844 | "input": [ 845 | "weights = np.ones(len(X))\n", 846 | "weights[X[X['log.Te'] < 3.8].index.values - 1] = 0\n", 847 | "wls_model = sm.WLS(y, X, weights=weights).fit()\n", 848 | "abline_plot(model_results=wls_model, ax=ax, color='green')" 849 | ], 850 | "language": "python", 851 | "metadata": {}, 852 | "outputs": [] 853 | }, 854 | { 855 | "cell_type": "markdown", 856 | "metadata": {}, 857 | "source": [ 858 | "* MM estimators are good for this type of problem, unfortunately, we don't yet have these yet. \n", 859 | "* It's being worked on, but it gives a good excuse to look at the R cell magics in the notebook." 860 | ] 861 | }, 862 | { 863 | "cell_type": "code", 864 | "collapsed": false, 865 | "input": [ 866 | "yy = y.values[:,None]\n", 867 | "xx = X['log.Te'].values[:,None]" 868 | ], 869 | "language": "python", 870 | "metadata": {}, 871 | "outputs": [] 872 | }, 873 | { 874 | "cell_type": "code", 875 | "collapsed": false, 876 | "input": [ 877 | "%load_ext rmagic\n", 878 | "\n", 879 | "%R library(robustbase)\n", 880 | "%Rpush yy xx\n", 881 | "%R mod <- lmrob(yy ~ xx);\n", 882 | "%R params <- mod$coefficients;\n", 883 | "%Rpull params" 884 | ], 885 | "language": "python", 886 | "metadata": {}, 887 | "outputs": [] 888 | }, 889 | { 890 | "cell_type": "code", 891 | "collapsed": false, 892 | "input": [ 893 | "%R print(mod)" 894 | ], 895 | "language": "python", 896 | "metadata": {}, 897 | "outputs": [] 898 | }, 899 | { 900 | "cell_type": "code", 901 | "collapsed": false, 902 | "input": [ 903 | "print params" 904 | ], 905 | "language": "python", 906 | "metadata": {}, 907 | "outputs": [] 908 | }, 909 | { 910 | "cell_type": "code", 911 | "collapsed": false, 912 | "input": [ 913 | "abline_plot(intercept=params[0], slope=params[1], ax=ax, color='green')" 914 | ], 915 | "language": "python", 916 | "metadata": {}, 917 | "outputs": [] 918 | }, 919 | { 920 | "cell_type": "heading", 921 | "level": 3, 922 | "metadata": {}, 923 | "source": [ 924 | "Exercise: Breakdown points of M-estimator" 925 | ] 926 | }, 927 | { 928 | "cell_type": "code", 929 | "collapsed": false, 930 | "input": [ 931 | "np.random.seed(12345)\n", 932 | "nobs = 200\n", 933 | "beta_true = np.array([3, 1, 2.5, 3, -4])\n", 934 | "X = np.random.uniform(-20,20, size=(nobs, len(beta_true)-1))\n", 935 | "# stack a constant in front\n", 936 | "X = sm.add_constant(X, prepend=True) # np.c_[np.ones(nobs), X]\n", 937 | "mc_iter = 500\n", 938 | "contaminate = .25 # percentage of response variables to contaminate" 939 | ], 940 | "language": "python", 941 | "metadata": {}, 942 | "outputs": [] 943 | }, 944 | { 945 | "cell_type": "code", 946 | "collapsed": false, 947 | "input": [ 948 | "all_betas = []\n", 949 | "for i in range(mc_iter):\n", 950 | " y = np.dot(X, beta_true) + np.random.normal(size=200)\n", 951 | " random_idx = np.random.randint(0, nobs, size=int(contaminate * nobs))\n", 952 | " y[random_idx] = np.random.uniform(-750, 750) #, size=len(random_idx))\n", 953 | " beta_hat = sm.RLM(y, X).fit().params\n", 954 | " all_betas.append(beta_hat)" 955 | ], 956 | "language": "python", 957 | "metadata": {}, 958 | "outputs": [] 959 | }, 960 | { 961 | "cell_type": "code", 962 | "collapsed": false, 963 | "input": [ 964 | "all_betas = np.asarray(all_betas)\n", 965 | "se_loss = lambda x : np.linalg.norm(x, ord=2)**2\n", 966 | "se_beta = map(se_loss, all_betas - beta_true)" 967 | ], 968 | "language": "python", 969 | "metadata": {}, 970 | "outputs": [] 971 | }, 972 | { 973 | "cell_type": "heading", 974 | "level": 4, 975 | "metadata": {}, 976 | "source": [ 977 | "Squared error loss" 978 | ] 979 | }, 980 | { 981 | "cell_type": "code", 982 | "collapsed": false, 983 | "input": [ 984 | "np.array(se_beta).mean()" 985 | ], 986 | "language": "python", 987 | "metadata": {}, 988 | "outputs": [] 989 | }, 990 | { 991 | "cell_type": "code", 992 | "collapsed": false, 993 | "input": [ 994 | "all_betas.mean(0)" 995 | ], 996 | "language": "python", 997 | "metadata": {}, 998 | "outputs": [] 999 | }, 1000 | { 1001 | "cell_type": "code", 1002 | "collapsed": false, 1003 | "input": [ 1004 | "beta_true" 1005 | ], 1006 | "language": "python", 1007 | "metadata": {}, 1008 | "outputs": [] 1009 | }, 1010 | { 1011 | "cell_type": "code", 1012 | "collapsed": false, 1013 | "input": [ 1014 | "se_loss(all_betas.mean(0) - beta_true)" 1015 | ], 1016 | "language": "python", 1017 | "metadata": {}, 1018 | "outputs": [] 1019 | } 1020 | ], 1021 | "metadata": {} 1022 | } 1023 | ] 1024 | } -------------------------------------------------------------------------------- /robust_models.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3.0 3 | 4 | # 5 | 6 | # M-Estimators for Robust Linear Modeling 7 | 8 | # 9 | 10 | import numpy as np 11 | from scipy import stats 12 | import matplotlib.pyplot as plt 13 | 14 | import statsmodels.api as sm 15 | 16 | # 17 | 18 | # * An M-estimator minimizes the function 19 | # 20 | # $$Q(e_i, \rho) = \sum_i~\rho(\frac{e_i}{s})$$ 21 | # 22 | # where $\rho$ is a symmetric function of the residuals 23 | # 24 | # * The effect of $\rho$ is to reduce the influence of outliers 25 | # * $s$ is an estimate of scale. 26 | # * The robust estimates $\hat{\beta}$ are computed by the iteratively re-weighted least squares algorithm 27 | 28 | # 29 | 30 | # * We have several choices available for the weighting functions to be used 31 | 32 | # 33 | 34 | norms = sm.robust.norms 35 | 36 | # 37 | 38 | def plot_weights(support, weights_func, xlabels, xticks): 39 | fig = plt.figure(figsize=(12,8)) 40 | ax = fig.add_subplot(111) 41 | ax.plot(support, weights_func(support)) 42 | ax.set_xticks(xticks) 43 | ax.set_xticklabels(xlabels, fontsize=16) 44 | ax.set_ylim(-.1, 1.1) 45 | return ax 46 | 47 | # 48 | 49 | # Andrew's Wave 50 | 51 | # 52 | 53 | help(norms.AndrewWave.weights) 54 | 55 | # 56 | 57 | a = 1.339 58 | support = np.linspace(-np.pi*a, np.pi*a, 100) 59 | andrew = norms.AndrewWave(a=a) 60 | plot_weights(support, andrew.weights, ['$-\pi*a$', '0', '$\pi*a$'], [-np.pi*a, 0, np.pi*a]); 61 | 62 | # 63 | 64 | # Hampel's 17A 65 | 66 | # 67 | 68 | help(norms.Hampel.weights) 69 | 70 | # 71 | 72 | c = 8 73 | support = np.linspace(-3*c, 3*c, 1000) 74 | hampel = norms.Hampel(a=2., b=4., c=c) 75 | plot_weights(support, hampel.weights, ['3*c', '0', '3*c'], [-3*c, 0, 3*c]); 76 | 77 | # 78 | 79 | # Huber's t 80 | 81 | # 82 | 83 | help(norms.HuberT.weights) 84 | 85 | # 86 | 87 | t = 1.345 88 | support = np.linspace(-3*t, 3*t, 1000) 89 | huber = norms.HuberT(t=t) 90 | plot_weights(support, huber.weights, ['-3*t', '0', '3*t'], [-3*t, 0, 3*t]); 91 | 92 | # 93 | 94 | # Least Squares 95 | 96 | # 97 | 98 | help(norms.LeastSquares.weights) 99 | 100 | # 101 | 102 | support = np.linspace(-3, 3, 1000) 103 | lst_sq = norms.LeastSquares() 104 | plot_weights(support, lst_sq.weights, ['-3', '0', '3'], [-3, 0, 3]); 105 | 106 | # 107 | 108 | # Ramsay's Ea 109 | 110 | # 111 | 112 | help(norms.RamsayE.weights) 113 | 114 | # 115 | 116 | a = .3 117 | support = np.linspace(-3*a, 3*a, 1000) 118 | ramsay = norms.RamsayE(a=a) 119 | plot_weights(support, ramsay.weights, ['-3*a', '0', '3*a'], [-3*a, 0, 3*a]); 120 | 121 | # 122 | 123 | # Trimmed Mean 124 | 125 | # 126 | 127 | help(norms.TrimmedMean.weights) 128 | 129 | # 130 | 131 | c = 2 132 | support = np.linspace(-3*c, 3*c, 1000) 133 | trimmed = norms.TrimmedMean(c=c) 134 | plot_weights(support, trimmed.weights, ['-3*c', '0', '3*c'], [-3*c, 0, 3*c]); 135 | 136 | # 137 | 138 | # Tukey's Biweight 139 | 140 | # 141 | 142 | help(norms.TukeyBiweight.weights) 143 | 144 | # 145 | 146 | c = 4.685 147 | support = np.linspace(-3*c, 3*c, 1000) 148 | tukey = norms.TukeyBiweight(c=c) 149 | plot_weights(support, tukey.weights, ['-3*c', '0', '3*c'], [-3*c, 0, 3*c]); 150 | 151 | # 152 | 153 | # Scale Estimators 154 | 155 | # 156 | 157 | # * Robust estimates of the location 158 | 159 | # 160 | 161 | x = np.array([1, 2, 3, 4, 500]) 162 | 163 | # 164 | 165 | # * The mean is not a robust estimator of location 166 | 167 | # 168 | 169 | x.mean() 170 | 171 | # 172 | 173 | # * The median, on the other hand, is a robust estimator with a breakdown point of 50% 174 | 175 | # 176 | 177 | np.median(x) 178 | 179 | # 180 | 181 | # * Analagously for the scale 182 | # * The standard deviation is not robust 183 | 184 | # 185 | 186 | x.std() 187 | 188 | # 189 | 190 | # Median Absolute Deviation 191 | # 192 | # $$ median_i |X_i - median_j(X_j)|) $$ 193 | 194 | # 195 | 196 | # Standardized Median Absolute Deviation is a consistent estimator for $\hat{\sigma}$ 197 | # 198 | # $$\hat{\sigma}=K \cdot MAD$$ 199 | # 200 | # where $K$ depends on the distribution. For the normal distribution for example, 201 | # 202 | # $$K = \Phi^{-1}(.75)$$ 203 | 204 | # 205 | 206 | stats.norm.ppf(.75) 207 | 208 | # 209 | 210 | print x 211 | 212 | # 213 | 214 | sm.robust.scale.stand_mad(x) 215 | 216 | # 217 | 218 | np.array([1,2,3,4,5.]).std() 219 | 220 | # 221 | 222 | # * The default for Robust Linear Models is MAD 223 | # * another popular choice is Huber's proposal 2 224 | 225 | # 226 | 227 | np.random.seed(12345) 228 | fat_tails = stats.t(6).rvs(40) 229 | 230 | # 231 | 232 | kde = sm.nonparametric.KDE(fat_tails) 233 | kde.fit() 234 | fig = plt.figure(figsize=(12,8)) 235 | ax = fig.add_subplot(111) 236 | ax.plot(kde.support, kde.density); 237 | 238 | # 239 | 240 | print fat_tails.mean(), fat_tails.std() 241 | 242 | # 243 | 244 | print stats.norm.fit(fat_tails) 245 | 246 | # 247 | 248 | print stats.t.fit(fat_tails, f0=6) 249 | 250 | # 251 | 252 | huber = sm.robust.scale.Huber() 253 | loc, scale = huber(fat_tails) 254 | print loc, scale 255 | 256 | # 257 | 258 | sm.robust.stand_mad(fat_tails) 259 | 260 | # 261 | 262 | sm.robust.stand_mad(fat_tails, c=stats.t(6).ppf(.75)) 263 | 264 | # 265 | 266 | sm.robust.scale.mad(fat_tails) 267 | 268 | # 269 | 270 | # Duncan's Occupational Prestige data - M-estimation for outliers 271 | 272 | # 273 | 274 | from statsmodels.graphics.api import abline_plot 275 | from statsmodels.formula.api import ols, rlm 276 | 277 | # 278 | 279 | prestige = sm.datasets.get_rdataset("Duncan", "car", cache=True).data 280 | 281 | # 282 | 283 | print prestige.head(10) 284 | 285 | # 286 | 287 | fig = plt.figure(figsize=(12,12)) 288 | ax1 = fig.add_subplot(211, xlabel='Income', ylabel='Prestige') 289 | ax1.scatter(prestige.income, prestige.prestige) 290 | xy_outlier = prestige.ix['minister'][['income','prestige']] 291 | ax1.annotate('Minister', xy_outlier, xy_outlier+1, fontsize=16) 292 | ax2 = fig.add_subplot(212, xlabel='Education', 293 | ylabel='Prestige') 294 | ax2.scatter(prestige.education, prestige.prestige); 295 | 296 | # 297 | 298 | ols_model = ols('prestige ~ income + education', prestige).fit() 299 | print ols_model.summary() 300 | 301 | # 302 | 303 | infl = ols_model.get_influence() 304 | student = infl.summary_frame()['student_resid'] 305 | print student 306 | 307 | # 308 | 309 | print student.ix[np.abs(student) > 2] 310 | 311 | # 312 | 313 | print infl.summary_frame().ix['minister'] 314 | 315 | # 316 | 317 | sidak = ols_model.outlier_test('sidak') 318 | sidak.sort('unadj_p', inplace=True) 319 | print sidak 320 | 321 | # 322 | 323 | fdr = ols_model.outlier_test('fdr_bh') 324 | fdr.sort('unadj_p', inplace=True) 325 | print fdr 326 | 327 | # 328 | 329 | rlm_model = rlm('prestige ~ income + education', prestige).fit() 330 | print rlm_model.summary() 331 | 332 | # 333 | 334 | print rlm_model.weights 335 | 336 | # 337 | 338 | # Hertzprung Russell data for Star Cluster CYG 0B1 - Leverage Points 339 | 340 | # 341 | 342 | # * Data is on the luminosity and temperature of 47 stars in the direction of Cygnus. 343 | 344 | # 345 | 346 | dta = sm.datasets.get_rdataset("starsCYG", "robustbase", cache=True).data 347 | 348 | # 349 | 350 | from matplotlib.patches import Ellipse 351 | fig = plt.figure(figsize=(12,8)) 352 | ax = fig.add_subplot(111, xlabel='log(Temp)', ylabel='log(Light)', title='Hertzsprung-Russell Diagram of Star Cluster CYG OB1') 353 | ax.scatter(*dta.values.T) 354 | # highlight outliers 355 | e = Ellipse((3.5, 6), .2, 1, alpha=.25, color='r') 356 | ax.add_patch(e); 357 | ax.annotate('Red giants', xy=(3.6, 6), xytext=(3.8, 6), 358 | arrowprops=dict(facecolor='black', shrink=0.05, width=2), 359 | horizontalalignment='left', verticalalignment='bottom', 360 | clip_on=True, # clip to the axes bounding box 361 | fontsize=16, 362 | ) 363 | # annotate these with their index 364 | for i,row in dta.ix[dta['log.Te'] < 3.8].iterrows(): 365 | ax.annotate(i, row, row + .01, fontsize=14) 366 | xlim, ylim = ax.get_xlim(), ax.get_ylim() 367 | 368 | # 369 | 370 | from IPython.display import Image 371 | Image(filename='star_diagram.png') 372 | 373 | # 374 | 375 | y = dta['log.light'] 376 | X = sm.add_constant(dta['log.Te'], prepend=True) 377 | ols_model = sm.OLS(y, X).fit() 378 | abline_plot(model_results=ols_model, ax=ax) 379 | 380 | # 381 | 382 | rlm_mod = sm.RLM(y, X, sm.robust.norms.TrimmedMean(.5)).fit() 383 | abline_plot(model_results=rlm_mod, ax=ax, color='red') 384 | 385 | # 386 | 387 | # * Why? Because M-estimators are not robust to leverage points. 388 | 389 | # 390 | 391 | infl = ols_model.get_influence() 392 | 393 | # 394 | 395 | h_bar = 2*(ols_model.df_model + 1 )/ols_model.nobs 396 | hat_diag = infl.summary_frame()['hat_diag'] 397 | hat_diag.ix[hat_diag > h_bar] 398 | 399 | # 400 | 401 | sidak2 = ols_model.outlier_test('sidak') 402 | sidak2.sort('unadj_p', inplace=True) 403 | print sidak2 404 | 405 | # 406 | 407 | fdr2 = ols_model.outlier_test('fdr_bh') 408 | fdr2.sort('unadj_p', inplace=True) 409 | print fdr2 410 | 411 | # 412 | 413 | # * Let's delete that line 414 | 415 | # 416 | 417 | del ax.lines[-1] 418 | 419 | # 420 | 421 | weights = np.ones(len(X)) 422 | weights[X[X['log.Te'] < 3.8].index.values - 1] = 0 423 | wls_model = sm.WLS(y, X, weights=weights).fit() 424 | abline_plot(model_results=wls_model, ax=ax, color='green') 425 | 426 | # 427 | 428 | # * MM estimators are good for this type of problem, unfortunately, we don't yet have these yet. 429 | # * It's being worked on, but it gives a good excuse to look at the R cell magics in the notebook. 430 | 431 | # 432 | 433 | yy = y.values[:,None] 434 | xx = X['log.Te'].values[:,None] 435 | 436 | # 437 | 438 | %load_ext rmagic 439 | 440 | %R library(robustbase) 441 | %Rpush yy xx 442 | %R mod <- lmrob(yy ~ xx); 443 | %R params <- mod$coefficients; 444 | %Rpull params 445 | 446 | # 447 | 448 | %R print(mod) 449 | 450 | # 451 | 452 | print params 453 | 454 | # 455 | 456 | abline_plot(intercept=params[0], slope=params[1], ax=ax, color='green') 457 | 458 | # 459 | 460 | # Exercise: Breakdown points of M-estimator 461 | 462 | # 463 | 464 | np.random.seed(12345) 465 | nobs = 200 466 | beta_true = np.array([3, 1, 2.5, 3, -4]) 467 | X = np.random.uniform(-20,20, size=(nobs, len(beta_true)-1)) 468 | # stack a constant in front 469 | X = sm.add_constant(X, prepend=True) # np.c_[np.ones(nobs), X] 470 | mc_iter = 500 471 | contaminate = .25 # percentage of response variables to contaminate 472 | 473 | # 474 | 475 | all_betas = [] 476 | for i in range(mc_iter): 477 | y = np.dot(X, beta_true) + np.random.normal(size=200) 478 | random_idx = np.random.randint(0, nobs, size=int(contaminate * nobs)) 479 | y[random_idx] = np.random.uniform(-750, 750) #, size=len(random_idx)) 480 | beta_hat = sm.RLM(y, X).fit().params 481 | all_betas.append(beta_hat) 482 | 483 | # 484 | 485 | all_betas = np.asarray(all_betas) 486 | se_loss = lambda x : np.linalg.norm(x, ord=2)**2 487 | se_beta = map(se_loss, all_betas - beta_true) 488 | 489 | # 490 | 491 | # Squared error loss 492 | 493 | # 494 | 495 | np.array(se_beta).mean() 496 | 497 | # 498 | 499 | all_betas.mean(0) 500 | 501 | # 502 | 503 | beta_true 504 | 505 | # 506 | 507 | se_loss(all_betas.mean(0) - beta_true) 508 | 509 | -------------------------------------------------------------------------------- /salary.table: -------------------------------------------------------------------------------- 1 | S,X,E,M 2 | 13876,1,1,1 3 | 11608,1,3,0 4 | 18701,1,3,1 5 | 11283,1,2,0 6 | 11767,1,3,0 7 | 20872,2,2,1 8 | 11772,2,2,0 9 | 10535,2,1,0 10 | 12195,2,3,0 11 | 12313,3,2,0 12 | 14975,3,1,1 13 | 21371,3,2,1 14 | 19800,3,3,1 15 | 11417,4,1,0 16 | 20263,4,3,1 17 | 13231,4,3,0 18 | 12884,4,2,0 19 | 13245,5,2,0 20 | 13677,5,3,0 21 | 15965,5,1,1 22 | 12336,6,1,0 23 | 21352,6,3,1 24 | 13839,6,2,0 25 | 22884,6,2,1 26 | 16978,7,1,1 27 | 14803,8,2,0 28 | 17404,8,1,1 29 | 22184,8,3,1 30 | 13548,8,1,0 31 | 14467,10,1,0 32 | 15942,10,2,0 33 | 23174,10,3,1 34 | 23780,10,2,1 35 | 25410,11,2,1 36 | 14861,11,1,0 37 | 16882,12,2,0 38 | 24170,12,3,1 39 | 15990,13,1,0 40 | 26330,13,2,1 41 | 17949,14,2,0 42 | 25685,15,3,1 43 | 27837,16,2,1 44 | 18838,16,2,0 45 | 17483,16,1,0 46 | 19207,17,2,0 47 | 19346,20,1,0 48 | -------------------------------------------------------------------------------- /star_diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jseabold/statsmodels-tutorial/07b0e675a699d6cd3fd397d9c09dc5c529377f5d/star_diagram.png -------------------------------------------------------------------------------- /tsa_arma.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "tsa_arma" 4 | }, 5 | "nbformat": 3, 6 | "nbformat_minor": 0, 7 | "worksheets": [ 8 | { 9 | "cells": [ 10 | { 11 | "cell_type": "heading", 12 | "level": 3, 13 | "metadata": {}, 14 | "source": [ 15 | "ARMA example using sunpots data" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "collapsed": false, 21 | "input": [ 22 | "import numpy as np\n", 23 | "from scipy import stats\n", 24 | "import pandas\n", 25 | "import matplotlib.pyplot as plt\n", 26 | "\n", 27 | "import statsmodels.api as sm" 28 | ], 29 | "language": "python", 30 | "metadata": {}, 31 | "outputs": [] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "collapsed": false, 36 | "input": [ 37 | "from statsmodels.graphics.api import qqplot" 38 | ], 39 | "language": "python", 40 | "metadata": {}, 41 | "outputs": [] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "collapsed": false, 46 | "input": [ 47 | "print sm.datasets.sunspots.NOTE" 48 | ], 49 | "language": "python", 50 | "metadata": {}, 51 | "outputs": [] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "collapsed": false, 56 | "input": [ 57 | "dta = sm.datasets.sunspots.load_pandas().data" 58 | ], 59 | "language": "python", 60 | "metadata": {}, 61 | "outputs": [] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "collapsed": false, 66 | "input": [ 67 | "dta.index = pandas.Index(sm.tsa.datetools.dates_from_range('1700', '2008'))\n", 68 | "del dta[\"YEAR\"]" 69 | ], 70 | "language": "python", 71 | "metadata": {}, 72 | "outputs": [] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "collapsed": false, 77 | "input": [ 78 | "dta.plot(figsize=(12,8));" 79 | ], 80 | "language": "python", 81 | "metadata": {}, 82 | "outputs": [] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "collapsed": false, 87 | "input": [ 88 | "fig = plt.figure(figsize=(12,8))\n", 89 | "ax1 = fig.add_subplot(211)\n", 90 | "fig = sm.graphics.tsa.plot_acf(dta.values.squeeze(), lags=40, ax=ax1)\n", 91 | "ax2 = fig.add_subplot(212)\n", 92 | "fig = sm.graphics.tsa.plot_pacf(dta, lags=40, ax=ax2)" 93 | ], 94 | "language": "python", 95 | "metadata": {}, 96 | "outputs": [] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "collapsed": false, 101 | "input": [ 102 | "arma_mod20 = sm.tsa.ARMA(dta, (2,0)).fit()\n", 103 | "print arma_mod20.params" 104 | ], 105 | "language": "python", 106 | "metadata": {}, 107 | "outputs": [] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "collapsed": false, 112 | "input": [ 113 | "arma_mod30 = sm.tsa.ARMA(dta, (3,0)).fit()" 114 | ], 115 | "language": "python", 116 | "metadata": {}, 117 | "outputs": [] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "collapsed": false, 122 | "input": [ 123 | "print arma_mod20.aic, arma_mod20.bic, arma_mod20.hqic" 124 | ], 125 | "language": "python", 126 | "metadata": {}, 127 | "outputs": [] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "collapsed": false, 132 | "input": [ 133 | "print arma_mod30.params" 134 | ], 135 | "language": "python", 136 | "metadata": {}, 137 | "outputs": [] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "collapsed": false, 142 | "input": [ 143 | "print arma_mod30.aic, arma_mod30.bic, arma_mod30.hqic" 144 | ], 145 | "language": "python", 146 | "metadata": {}, 147 | "outputs": [] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | "* Does our model obey the theory?" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "collapsed": false, 159 | "input": [ 160 | "sm.stats.durbin_watson(arma_mod30.resid.values)" 161 | ], 162 | "language": "python", 163 | "metadata": {}, 164 | "outputs": [] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "collapsed": false, 169 | "input": [ 170 | "fig = plt.figure(figsize=(12,8))\n", 171 | "ax = fig.add_subplot(111)\n", 172 | "ax = arma_mod30.resid.plot(ax=ax);" 173 | ], 174 | "language": "python", 175 | "metadata": {}, 176 | "outputs": [] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "collapsed": false, 181 | "input": [ 182 | "resid = arma_mod30.resid" 183 | ], 184 | "language": "python", 185 | "metadata": {}, 186 | "outputs": [] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "collapsed": false, 191 | "input": [ 192 | "stats.normaltest(resid)" 193 | ], 194 | "language": "python", 195 | "metadata": {}, 196 | "outputs": [] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "collapsed": false, 201 | "input": [ 202 | "fig = plt.figure(figsize=(12,8))\n", 203 | "ax = fig.add_subplot(111)\n", 204 | "fig = qqplot(resid, line='q', ax=ax, fit=True)" 205 | ], 206 | "language": "python", 207 | "metadata": {}, 208 | "outputs": [] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "collapsed": false, 213 | "input": [ 214 | "fig = plt.figure(figsize=(12,8))\n", 215 | "ax1 = fig.add_subplot(211)\n", 216 | "fig = sm.graphics.tsa.plot_acf(resid.values.squeeze(), lags=40, ax=ax1)\n", 217 | "ax2 = fig.add_subplot(212)\n", 218 | "fig = sm.graphics.tsa.plot_pacf(resid, lags=40, ax=ax2)" 219 | ], 220 | "language": "python", 221 | "metadata": {}, 222 | "outputs": [] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "collapsed": false, 227 | "input": [ 228 | "r,q,p = sm.tsa.acf(resid.values.squeeze(), qstat=True)\n", 229 | "data = np.c_[range(1,41), r[1:], q, p]\n", 230 | "table = pandas.DataFrame(data, columns=['lag', \"AC\", \"Q\", \"Prob(>Q)\"])\n", 231 | "print table.set_index('lag')" 232 | ], 233 | "language": "python", 234 | "metadata": {}, 235 | "outputs": [] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": {}, 240 | "source": [ 241 | "* This indicates a lack of fit." 242 | ] 243 | }, 244 | { 245 | "cell_type": "markdown", 246 | "metadata": {}, 247 | "source": [ 248 | "* In-sample dynamic prediction. How good does our model do?" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "collapsed": false, 254 | "input": [ 255 | "predict_sunspots = arma_mod30.predict('1990', '2012', dynamic=True)\n", 256 | "print predict_sunspots" 257 | ], 258 | "language": "python", 259 | "metadata": {}, 260 | "outputs": [] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "collapsed": false, 265 | "input": [ 266 | "ax = dta.ix['1950':].plot(figsize=(12,8))\n", 267 | "ax = predict_sunspots.plot(ax=ax, style='r--', label='Dynamic Prediction');\n", 268 | "ax.legend();\n", 269 | "ax.axis((-20.0, 38.0, -4.0, 200.0));" 270 | ], 271 | "language": "python", 272 | "metadata": {}, 273 | "outputs": [] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "collapsed": false, 278 | "input": [ 279 | "def mean_forecast_err(y, yhat):\n", 280 | " return y.sub(yhat).mean()" 281 | ], 282 | "language": "python", 283 | "metadata": {}, 284 | "outputs": [] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "collapsed": false, 289 | "input": [ 290 | "mean_forecast_err(dta.SUNACTIVITY, predict_sunspots)" 291 | ], 292 | "language": "python", 293 | "metadata": {}, 294 | "outputs": [] 295 | }, 296 | { 297 | "cell_type": "heading", 298 | "level": 3, 299 | "metadata": {}, 300 | "source": [ 301 | "Exercise: Can you obtain a better fit for the Sunspots model? (Hint: sm.tsa.AR has a method select_order)" 302 | ] 303 | }, 304 | { 305 | "cell_type": "heading", 306 | "level": 3, 307 | "metadata": {}, 308 | "source": [ 309 | "Simulated ARMA(4,1): Model Identification is Difficult" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "collapsed": false, 315 | "input": [ 316 | "from statsmodels.tsa.arima_process import arma_generate_sample, ArmaProcess" 317 | ], 318 | "language": "python", 319 | "metadata": {}, 320 | "outputs": [] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "collapsed": false, 325 | "input": [ 326 | "np.random.seed(1234)\n", 327 | "# include zero-th lag\n", 328 | "arparams = np.array([1, .75, -.65, -.55, .9])\n", 329 | "maparams = np.array([1, .65])" 330 | ], 331 | "language": "python", 332 | "metadata": {}, 333 | "outputs": [] 334 | }, 335 | { 336 | "cell_type": "markdown", 337 | "metadata": {}, 338 | "source": [ 339 | "* Let's make sure this models is estimable." 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "collapsed": false, 345 | "input": [ 346 | "arma_t = ArmaProcess(arparams, maparams)" 347 | ], 348 | "language": "python", 349 | "metadata": {}, 350 | "outputs": [] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "collapsed": false, 355 | "input": [ 356 | "arma_t.isinvertible()" 357 | ], 358 | "language": "python", 359 | "metadata": {}, 360 | "outputs": [] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "collapsed": false, 365 | "input": [ 366 | "arma_t.isstationary()" 367 | ], 368 | "language": "python", 369 | "metadata": {}, 370 | "outputs": [] 371 | }, 372 | { 373 | "cell_type": "raw", 374 | "metadata": {}, 375 | "source": [ 376 | "* What does this mean?" 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "collapsed": false, 382 | "input": [ 383 | "fig = plt.figure(figsize=(12,8))\n", 384 | "ax = fig.add_subplot(111)\n", 385 | "ax.plot(arma_t.generate_sample(size=50));" 386 | ], 387 | "language": "python", 388 | "metadata": {}, 389 | "outputs": [] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "collapsed": false, 394 | "input": [ 395 | "arparams = np.array([1, .35, -.15, .55, .1])\n", 396 | "maparams = np.array([1, .65])\n", 397 | "arma_t = ArmaProcess(arparams, maparams)\n", 398 | "arma_t.isstationary()" 399 | ], 400 | "language": "python", 401 | "metadata": {}, 402 | "outputs": [] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "collapsed": false, 407 | "input": [ 408 | "arma_rvs = arma_t.generate_sample(size=500, burnin=250, scale=2.5)" 409 | ], 410 | "language": "python", 411 | "metadata": {}, 412 | "outputs": [] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "collapsed": false, 417 | "input": [ 418 | "fig = plt.figure(figsize=(12,8))\n", 419 | "ax1 = fig.add_subplot(211)\n", 420 | "fig = sm.graphics.tsa.plot_acf(arma_rvs, lags=40, ax=ax1)\n", 421 | "ax2 = fig.add_subplot(212)\n", 422 | "fig = sm.graphics.tsa.plot_pacf(arma_rvs, lags=40, ax=ax2)" 423 | ], 424 | "language": "python", 425 | "metadata": {}, 426 | "outputs": [] 427 | }, 428 | { 429 | "cell_type": "raw", 430 | "metadata": {}, 431 | "source": [ 432 | "* For mixed ARMA processes the Autocorrelation function is a mixture of exponentials and damped sine waves after (q-p) lags. \n", 433 | "* The partial autocorrelation function is a mixture of exponentials and dampened sine waves after (p-q) lags." 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "collapsed": false, 439 | "input": [ 440 | "arma11 = sm.tsa.ARMA(arma_rvs, (1,1)).fit()\n", 441 | "resid = arma11.resid\n", 442 | "r,q,p = sm.tsa.acf(resid, qstat=True)\n", 443 | "data = np.c_[range(1,41), r[1:], q, p]\n", 444 | "table = pandas.DataFrame(data, columns=['lag', \"AC\", \"Q\", \"Prob(>Q)\"])\n", 445 | "print table.set_index('lag')" 446 | ], 447 | "language": "python", 448 | "metadata": {}, 449 | "outputs": [] 450 | }, 451 | { 452 | "cell_type": "code", 453 | "collapsed": false, 454 | "input": [ 455 | "arma41 = sm.tsa.ARMA(arma_rvs, (4,1)).fit()\n", 456 | "resid = arma41.resid\n", 457 | "r,q,p = sm.tsa.acf(resid, qstat=True)\n", 458 | "data = np.c_[range(1,41), r[1:], q, p]\n", 459 | "table = pandas.DataFrame(data, columns=['lag', \"AC\", \"Q\", \"Prob(>Q)\"])\n", 460 | "print table.set_index('lag')" 461 | ], 462 | "language": "python", 463 | "metadata": {}, 464 | "outputs": [] 465 | }, 466 | { 467 | "cell_type": "heading", 468 | "level": 3, 469 | "metadata": {}, 470 | "source": [ 471 | "Exercise: How good of in-sample prediction can you do for another series, say, CPI" 472 | ] 473 | }, 474 | { 475 | "cell_type": "code", 476 | "collapsed": false, 477 | "input": [ 478 | "macrodta = sm.datasets.macrodata.load_pandas().data\n", 479 | "macrodta.index = pandas.Index(sm.tsa.datetools.dates_from_range('1959Q1', '2009Q3'))\n", 480 | "cpi = macrodta[\"cpi\"]" 481 | ], 482 | "language": "python", 483 | "metadata": {}, 484 | "outputs": [] 485 | }, 486 | { 487 | "cell_type": "heading", 488 | "level": 4, 489 | "metadata": {}, 490 | "source": [ 491 | "Hint: " 492 | ] 493 | }, 494 | { 495 | "cell_type": "code", 496 | "collapsed": false, 497 | "input": [ 498 | "fig = plt.figure(figsize=(12,8))\n", 499 | "ax = fig.add_subplot(111)\n", 500 | "ax = cpi.plot(ax=ax);\n", 501 | "ax.legend();" 502 | ], 503 | "language": "python", 504 | "metadata": {}, 505 | "outputs": [] 506 | }, 507 | { 508 | "cell_type": "raw", 509 | "metadata": {}, 510 | "source": [ 511 | "P-value of the unit-root test, resoundly rejects the null of no unit-root." 512 | ] 513 | }, 514 | { 515 | "cell_type": "code", 516 | "collapsed": false, 517 | "input": [ 518 | "print sm.tsa.adfuller(cpi)[1]" 519 | ], 520 | "language": "python", 521 | "metadata": {}, 522 | "outputs": [] 523 | } 524 | ], 525 | "metadata": {} 526 | } 527 | ] 528 | } -------------------------------------------------------------------------------- /tsa_arma.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3.0 3 | 4 | # 5 | 6 | # ARMA example using sunpots data 7 | 8 | # 9 | 10 | import numpy as np 11 | from scipy import stats 12 | import pandas 13 | import matplotlib.pyplot as plt 14 | 15 | import statsmodels.api as sm 16 | 17 | # 18 | 19 | from statsmodels.graphics.api import qqplot 20 | 21 | # 22 | 23 | print sm.datasets.sunspots.NOTE 24 | 25 | # 26 | 27 | dta = sm.datasets.sunspots.load_pandas().data 28 | 29 | # 30 | 31 | dta.index = pandas.Index(sm.tsa.datetools.dates_from_range('1700', '2008')) 32 | del dta["YEAR"] 33 | 34 | # 35 | 36 | dta.plot(figsize=(12,8)); 37 | 38 | # 39 | 40 | fig = plt.figure(figsize=(12,8)) 41 | ax1 = fig.add_subplot(211) 42 | fig = sm.graphics.tsa.plot_acf(dta.values.squeeze(), lags=40, ax=ax1) 43 | ax2 = fig.add_subplot(212) 44 | fig = sm.graphics.tsa.plot_pacf(dta, lags=40, ax=ax2) 45 | 46 | # 47 | 48 | arma_mod20 = sm.tsa.ARMA(dta, (2,0)).fit() 49 | print arma_mod20.params 50 | 51 | # 52 | 53 | arma_mod30 = sm.tsa.ARMA(dta, (3,0)).fit() 54 | 55 | # 56 | 57 | print arma_mod20.aic, arma_mod20.bic, arma_mod20.hqic 58 | 59 | # 60 | 61 | print arma_mod30.params 62 | 63 | # 64 | 65 | print arma_mod30.aic, arma_mod30.bic, arma_mod30.hqic 66 | 67 | # 68 | 69 | # * Does our model obey the theory? 70 | 71 | # 72 | 73 | sm.stats.durbin_watson(arma_mod30.resid.values) 74 | 75 | # 76 | 77 | fig = plt.figure(figsize=(12,8)) 78 | ax = fig.add_subplot(111) 79 | ax = arma_mod30.resid.plot(ax=ax); 80 | 81 | # 82 | 83 | resid = arma_mod30.resid 84 | 85 | # 86 | 87 | stats.normaltest(resid) 88 | 89 | # 90 | 91 | fig = plt.figure(figsize=(12,8)) 92 | ax = fig.add_subplot(111) 93 | fig = qqplot(resid, line='q', ax=ax, fit=True) 94 | 95 | # 96 | 97 | fig = plt.figure(figsize=(12,8)) 98 | ax1 = fig.add_subplot(211) 99 | fig = sm.graphics.tsa.plot_acf(resid.values.squeeze(), lags=40, ax=ax1) 100 | ax2 = fig.add_subplot(212) 101 | fig = sm.graphics.tsa.plot_pacf(resid, lags=40, ax=ax2) 102 | 103 | # 104 | 105 | r,q,p = sm.tsa.acf(resid.values.squeeze(), qstat=True) 106 | data = np.c_[range(1,41), r[1:], q, p] 107 | table = pandas.DataFrame(data, columns=['lag', "AC", "Q", "Prob(>Q)"]) 108 | print table.set_index('lag') 109 | 110 | # 111 | 112 | # * This indicates a lack of fit. 113 | 114 | # 115 | 116 | # * In-sample dynamic prediction. How good does our model do? 117 | 118 | # 119 | 120 | predict_sunspots = arma_mod30.predict('1990', '2012', dynamic=True) 121 | print predict_sunspots 122 | 123 | # 124 | 125 | ax = dta.ix['1950':].plot(figsize=(12,8)) 126 | ax = predict_sunspots.plot(ax=ax, style='r--', label='Dynamic Prediction'); 127 | ax.legend(); 128 | ax.axis((-20.0, 38.0, -4.0, 200.0)); 129 | 130 | # 131 | 132 | def mean_forecast_err(y, yhat): 133 | return y.sub(yhat).mean() 134 | 135 | # 136 | 137 | mean_forecast_err(dta.SUNACTIVITY, predict_sunspots) 138 | 139 | # 140 | 141 | # Exercise: Can you obtain a better fit for the Sunspots model? (Hint: sm.tsa.AR has a method select_order) 142 | 143 | # 144 | 145 | # Simulated ARMA(4,1): Model Identification is Difficult 146 | 147 | # 148 | 149 | from statsmodels.tsa.arima_process import arma_generate_sample, ArmaProcess 150 | 151 | # 152 | 153 | np.random.seed(1234) 154 | # include zero-th lag 155 | arparams = np.array([1, .75, -.65, -.55, .9]) 156 | maparams = np.array([1, .65]) 157 | 158 | # 159 | 160 | # * Let's make sure this models is estimable. 161 | 162 | # 163 | 164 | arma_t = ArmaProcess(arparams, maparams) 165 | 166 | # 167 | 168 | arma_t.isinvertible() 169 | 170 | # 171 | 172 | arma_t.isstationary() 173 | 174 | # 175 | 176 | # * What does this mean? 177 | 178 | # 179 | 180 | fig = plt.figure(figsize=(12,8)) 181 | ax = fig.add_subplot(111) 182 | ax.plot(arma_t.generate_sample(size=50)); 183 | 184 | # 185 | 186 | arparams = np.array([1, .35, -.15, .55, .1]) 187 | maparams = np.array([1, .65]) 188 | arma_t = ArmaProcess(arparams, maparams) 189 | arma_t.isstationary() 190 | 191 | # 192 | 193 | arma_rvs = arma_t.generate_sample(size=500, burnin=250, scale=2.5) 194 | 195 | # 196 | 197 | fig = plt.figure(figsize=(12,8)) 198 | ax1 = fig.add_subplot(211) 199 | fig = sm.graphics.tsa.plot_acf(arma_rvs, lags=40, ax=ax1) 200 | ax2 = fig.add_subplot(212) 201 | fig = sm.graphics.tsa.plot_pacf(arma_rvs, lags=40, ax=ax2) 202 | 203 | # 204 | 205 | # * For mixed ARMA processes the Autocorrelation function is a mixture of exponentials and damped sine waves after (q-p) lags. 206 | # * The partial autocorrelation function is a mixture of exponentials and dampened sine waves after (p-q) lags. 207 | 208 | # 209 | 210 | arma11 = sm.tsa.ARMA(arma_rvs, (1,1)).fit() 211 | resid = arma11.resid 212 | r,q,p = sm.tsa.acf(resid, qstat=True) 213 | data = np.c_[range(1,41), r[1:], q, p] 214 | table = pandas.DataFrame(data, columns=['lag', "AC", "Q", "Prob(>Q)"]) 215 | print table.set_index('lag') 216 | 217 | # 218 | 219 | arma41 = sm.tsa.ARMA(arma_rvs, (4,1)).fit() 220 | resid = arma41.resid 221 | r,q,p = sm.tsa.acf(resid, qstat=True) 222 | data = np.c_[range(1,41), r[1:], q, p] 223 | table = pandas.DataFrame(data, columns=['lag', "AC", "Q", "Prob(>Q)"]) 224 | print table.set_index('lag') 225 | 226 | # 227 | 228 | # Exercise: How good of in-sample prediction can you do for another series, say, CPI 229 | 230 | # 231 | 232 | macrodta = sm.datasets.macrodata.load_pandas().data 233 | macrodta.index = pandas.Index(sm.tsa.datetools.dates_from_range('1959Q1', '2009Q3')) 234 | cpi = macrodta["cpi"] 235 | 236 | # 237 | 238 | # Hint: 239 | 240 | # 241 | 242 | fig = plt.figure(figsize=(12,8)) 243 | ax = fig.add_subplot(111) 244 | ax = cpi.plot(ax=ax); 245 | ax.legend(); 246 | 247 | # 248 | 249 | # P-value of the unit-root test, resoundly rejects the null of no unit-root. 250 | 251 | # 252 | 253 | print sm.tsa.adfuller(cpi)[1] 254 | 255 | -------------------------------------------------------------------------------- /tsa_filters.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "tsa_filters" 4 | }, 5 | "nbformat": 3, 6 | "nbformat_minor": 0, 7 | "worksheets": [ 8 | { 9 | "cells": [ 10 | { 11 | "cell_type": "heading", 12 | "level": 2, 13 | "metadata": {}, 14 | "source": [ 15 | "Filtering Time Series Data" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "collapsed": false, 21 | "input": [ 22 | "import pandas\n", 23 | "import matplotlib.pyplot as plt\n", 24 | "\n", 25 | "import statsmodels.api as sm" 26 | ], 27 | "language": "python", 28 | "metadata": {}, 29 | "outputs": [] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "collapsed": false, 34 | "input": [ 35 | "dta = sm.datasets.macrodata.load_pandas().data" 36 | ], 37 | "language": "python", 38 | "metadata": {}, 39 | "outputs": [] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "collapsed": false, 44 | "input": [ 45 | "index = pandas.Index(sm.tsa.datetools.dates_from_range('1959Q1', '2009Q3'))\n", 46 | "print index" 47 | ], 48 | "language": "python", 49 | "metadata": {}, 50 | "outputs": [] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "collapsed": false, 55 | "input": [ 56 | "dta.index = index\n", 57 | "del dta['year']\n", 58 | "del dta['quarter']" 59 | ], 60 | "language": "python", 61 | "metadata": {}, 62 | "outputs": [] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "collapsed": false, 67 | "input": [ 68 | "print sm.datasets.macrodata.NOTE" 69 | ], 70 | "language": "python", 71 | "metadata": {}, 72 | "outputs": [] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "collapsed": false, 77 | "input": [ 78 | "print dta.head(10)" 79 | ], 80 | "language": "python", 81 | "metadata": {}, 82 | "outputs": [] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "collapsed": false, 87 | "input": [ 88 | "fig = plt.figure(figsize=(12,8))\n", 89 | "ax = fig.add_subplot(111)\n", 90 | "dta.realgdp.plot(ax=ax);\n", 91 | "legend = ax.legend(loc = 'upper left');\n", 92 | "legend.prop.set_size(20);" 93 | ], 94 | "language": "python", 95 | "metadata": {}, 96 | "outputs": [] 97 | }, 98 | { 99 | "cell_type": "heading", 100 | "level": 3, 101 | "metadata": {}, 102 | "source": [ 103 | "Hodrick-Prescott Filter" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "The Hodrick-Prescott filter separates a time-series $y_t$ into a trend $\\tau_t$ and a cyclical component $\\zeta_t$ \n", 111 | "\n", 112 | "$$y_t = \\tau_t + \\zeta_t$$\n", 113 | "\n", 114 | "The components are determined by minimizing the following quadratic loss function\n", 115 | "\n", 116 | "$$\\min_{\\\\{ \\tau_{t}\\\\} }\\sum_{t}^{T}\\zeta_{t}^{2}+\\lambda\\sum_{t=1}^{T}\\left[\\left(\\tau_{t}-\\tau_{t-1}\\right)-\\left(\\tau_{t-1}-\\tau_{t-2}\\right)\\right]^{2}$$" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "collapsed": false, 122 | "input": [ 123 | "gdp_cycle, gdp_trend = sm.tsa.filters.hpfilter(dta.realgdp)" 124 | ], 125 | "language": "python", 126 | "metadata": {}, 127 | "outputs": [] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "collapsed": false, 132 | "input": [ 133 | "gdp_decomp = dta[['realgdp']]\n", 134 | "gdp_decomp[\"cycle\"] = gdp_cycle\n", 135 | "gdp_decomp[\"trend\"] = gdp_trend" 136 | ], 137 | "language": "python", 138 | "metadata": {}, 139 | "outputs": [] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "collapsed": false, 144 | "input": [ 145 | "fig = plt.figure(figsize=(12,8))\n", 146 | "ax = fig.add_subplot(111)\n", 147 | "gdp_decomp[[\"realgdp\", \"trend\"]][\"2000-03-31\":].plot(ax=ax, fontsize=16);\n", 148 | "legend = ax.get_legend()\n", 149 | "legend.prop.set_size(20);" 150 | ], 151 | "language": "python", 152 | "metadata": {}, 153 | "outputs": [] 154 | }, 155 | { 156 | "cell_type": "heading", 157 | "level": 3, 158 | "metadata": {}, 159 | "source": [ 160 | "Baxter-King approximate band-pass filter: Inflation and Unemployment" 161 | ] 162 | }, 163 | { 164 | "cell_type": "heading", 165 | "level": 4, 166 | "metadata": {}, 167 | "source": [ 168 | "Explore the hypothesis that inflation and unemployment are counter-cyclical." 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "The Baxter-King filter is intended to explictly deal with the periodicty of the business cycle. By applying their band-pass filter to a series, they produce a new series that does not contain fluctuations at higher or lower than those of the business cycle. Specifically, the BK filter takes the form of a symmetric moving average \n", 176 | "\n", 177 | "$$y_{t}^{*}=\\sum_{k=-K}^{k=K}a_ky_{t-k}$$\n", 178 | "\n", 179 | "where $a_{-k}=a_k$ and $\\sum_{k=-k}^{K}a_k=0$ to eliminate any trend in the series and render it stationary if the series is I(1) or I(2).\n", 180 | "\n", 181 | "For completeness, the filter weights are determined as follows\n", 182 | "\n", 183 | "$$a_{j} = B_{j}+\\theta\\text{ for }j=0,\\pm1,\\pm2,\\dots,\\pm K$$\n", 184 | "\n", 185 | "$$B_{0} = \\frac{\\left(\\omega_{2}-\\omega_{1}\\right)}{\\pi}$$\n", 186 | "$$B_{j} = \\frac{1}{\\pi j}\\left(\\sin\\left(\\omega_{2}j\\right)-\\sin\\left(\\omega_{1}j\\right)\\right)\\text{ for }j=0,\\pm1,\\pm2,\\dots,\\pm K$$\n", 187 | "\n", 188 | "where $\\theta$ is a normalizing constant such that the weights sum to zero.\n", 189 | "\n", 190 | "$$\\theta=\\frac{-\\sum_{j=-K^{K}b_{j}}}{2K+1}$$\n", 191 | "\n", 192 | "$$\\omega_{1}=\\frac{2\\pi}{P_{H}}$$\n", 193 | "\n", 194 | "$$\\omega_{2}=\\frac{2\\pi}{P_{L}}$$\n", 195 | "\n", 196 | "$P_L$ and $P_H$ are the periodicity of the low and high cut-off frequencies. Following Burns and Mitchell's work on US business cycles which suggests cycles last from 1.5 to 8 years, we use $P_L=6$ and $P_H=32$ by default." 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "collapsed": false, 202 | "input": [ 203 | "bk_cycles = sm.tsa.filters.bkfilter(dta[[\"infl\",\"unemp\"]])" 204 | ], 205 | "language": "python", 206 | "metadata": {}, 207 | "outputs": [] 208 | }, 209 | { 210 | "cell_type": "raw", 211 | "metadata": {}, 212 | "source": [ 213 | "* We lose K observations on both ends. It is suggested to use K=12 for quarterly data." 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "collapsed": false, 219 | "input": [ 220 | "fig = plt.figure(figsize=(14,10))\n", 221 | "ax = fig.add_subplot(111)\n", 222 | "bk_cycles.plot(ax=ax, style=['r--', 'b-']);" 223 | ], 224 | "language": "python", 225 | "metadata": {}, 226 | "outputs": [] 227 | }, 228 | { 229 | "cell_type": "heading", 230 | "level": 3, 231 | "metadata": {}, 232 | "source": [ 233 | "Christiano-Fitzgerald approximate band-pass filter: Inflation and Unemployment" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "The Christiano-Fitzgerald filter is a generalization of BK and can thus also be seen as weighted moving average. However, the CF filter is asymmetric about $t$ as well as using the entire series. The implementation of their filter involves the\n", 241 | "calculations of the weights in\n", 242 | "\n", 243 | "$$y_{t}^{*}=B_{0}y_{t}+B_{1}y_{t+1}+\\dots+B_{T-1-t}y_{T-1}+\\tilde B_{T-t}y_{T}+B_{1}y_{t-1}+\\dots+B_{t-2}y_{2}+\\tilde B_{t-1}y_{1}$$\n", 244 | "\n", 245 | "for $t=3,4,...,T-2$, where\n", 246 | "\n", 247 | "$$B_{j} = \\frac{\\sin(jb)-\\sin(ja)}{\\pi j},j\\geq1$$\n", 248 | "\n", 249 | "$$B_{0} = \\frac{b-a}{\\pi},a=\\frac{2\\pi}{P_{u}},b=\\frac{2\\pi}{P_{L}}$$\n", 250 | "\n", 251 | "$\\tilde B_{T-t}$ and $\\tilde B_{t-1}$ are linear functions of the $B_{j}$'s, and the values for $t=1,2,T-1,$ and $T$ are also calculated in much the same way. $P_{U}$ and $P_{L}$ are as described above with the same interpretation." 252 | ] 253 | }, 254 | { 255 | "cell_type": "raw", 256 | "metadata": {}, 257 | "source": [ 258 | "The CF filter is appropriate for series that may follow a random walk." 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "collapsed": false, 264 | "input": [ 265 | "print sm.tsa.stattools.adfuller(dta['unemp'])[:3]" 266 | ], 267 | "language": "python", 268 | "metadata": {}, 269 | "outputs": [] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "collapsed": false, 274 | "input": [ 275 | "print sm.tsa.stattools.adfuller(dta['infl'])[:3]" 276 | ], 277 | "language": "python", 278 | "metadata": {}, 279 | "outputs": [] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "collapsed": false, 284 | "input": [ 285 | "cf_cycles, cf_trend = sm.tsa.filters.cffilter(dta[[\"infl\",\"unemp\"]])\n", 286 | "print cf_cycles.head(10)" 287 | ], 288 | "language": "python", 289 | "metadata": {}, 290 | "outputs": [] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "collapsed": false, 295 | "input": [ 296 | "fig = plt.figure(figsize=(14,10))\n", 297 | "ax = fig.add_subplot(111)\n", 298 | "cf_cycles.plot(ax=ax, style=['r--','b-']);" 299 | ], 300 | "language": "python", 301 | "metadata": {}, 302 | "outputs": [] 303 | }, 304 | { 305 | "cell_type": "markdown", 306 | "metadata": {}, 307 | "source": [ 308 | "Filtering assumes *a priori* that business cycles exist. Due to this assumption, many macroeconomic models seek to create models that match the shape of impulse response functions rather than replicating properties of filtered series. See VAR notebook." 309 | ] 310 | } 311 | ], 312 | "metadata": {} 313 | } 314 | ] 315 | } -------------------------------------------------------------------------------- /tsa_filters.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3.0 3 | 4 | # 5 | 6 | # Filtering Time Series Data 7 | 8 | # 9 | 10 | import pandas 11 | import matplotlib.pyplot as plt 12 | 13 | import statsmodels.api as sm 14 | 15 | # 16 | 17 | dta = sm.datasets.macrodata.load_pandas().data 18 | 19 | # 20 | 21 | index = pandas.Index(sm.tsa.datetools.dates_from_range('1959Q1', '2009Q3')) 22 | print index 23 | 24 | # 25 | 26 | dta.index = index 27 | del dta['year'] 28 | del dta['quarter'] 29 | 30 | # 31 | 32 | print sm.datasets.macrodata.NOTE 33 | 34 | # 35 | 36 | print dta.head(10) 37 | 38 | # 39 | 40 | fig = plt.figure(figsize=(12,8)) 41 | ax = fig.add_subplot(111) 42 | dta.realgdp.plot(ax=ax); 43 | legend = ax.legend(loc = 'upper left'); 44 | legend.prop.set_size(20); 45 | 46 | # 47 | 48 | # Hodrick-Prescott Filter 49 | 50 | # 51 | 52 | # The Hodrick-Prescott filter separates a time-series $y_t$ into a trend $\tau_t$ and a cyclical component $\zeta_t$ 53 | # 54 | # $$y_t = \tau_t + \zeta_t$$ 55 | # 56 | # The components are determined by minimizing the following quadratic loss function 57 | # 58 | # $$\min_{\\{ \tau_{t}\\} }\sum_{t}^{T}\zeta_{t}^{2}+\lambda\sum_{t=1}^{T}\left[\left(\tau_{t}-\tau_{t-1}\right)-\left(\tau_{t-1}-\tau_{t-2}\right)\right]^{2}$$ 59 | 60 | # 61 | 62 | gdp_cycle, gdp_trend = sm.tsa.filters.hpfilter(dta.realgdp) 63 | 64 | # 65 | 66 | gdp_decomp = dta[['realgdp']] 67 | gdp_decomp["cycle"] = gdp_cycle 68 | gdp_decomp["trend"] = gdp_trend 69 | 70 | # 71 | 72 | fig = plt.figure(figsize=(12,8)) 73 | ax = fig.add_subplot(111) 74 | gdp_decomp[["realgdp", "trend"]]["2000-03-31":].plot(ax=ax, fontsize=16); 75 | legend = ax.get_legend() 76 | legend.prop.set_size(20); 77 | 78 | # 79 | 80 | # Baxter-King approximate band-pass filter: Inflation and Unemployment 81 | 82 | # 83 | 84 | # Explore the hypothesis that inflation and unemployment are counter-cyclical. 85 | 86 | # 87 | 88 | # The Baxter-King filter is intended to explictly deal with the periodicty of the business cycle. By applying their band-pass filter to a series, they produce a new series that does not contain fluctuations at higher or lower than those of the business cycle. Specifically, the BK filter takes the form of a symmetric moving average 89 | # 90 | # $$y_{t}^{*}=\sum_{k=-K}^{k=K}a_ky_{t-k}$$ 91 | # 92 | # where $a_{-k}=a_k$ and $\sum_{k=-k}^{K}a_k=0$ to eliminate any trend in the series and render it stationary if the series is I(1) or I(2). 93 | # 94 | # For completeness, the filter weights are determined as follows 95 | # 96 | # $$a_{j} = B_{j}+\theta\text{ for }j=0,\pm1,\pm2,\dots,\pm K$$ 97 | # 98 | # $$B_{0} = \frac{\left(\omega_{2}-\omega_{1}\right)}{\pi}$$ 99 | # $$B_{j} = \frac{1}{\pi j}\left(\sin\left(\omega_{2}j\right)-\sin\left(\omega_{1}j\right)\right)\text{ for }j=0,\pm1,\pm2,\dots,\pm K$$ 100 | # 101 | # where $\theta$ is a normalizing constant such that the weights sum to zero. 102 | # 103 | # $$\theta=\frac{-\sum_{j=-K^{K}b_{j}}}{2K+1}$$ 104 | # 105 | # $$\omega_{1}=\frac{2\pi}{P_{H}}$$ 106 | # 107 | # $$\omega_{2}=\frac{2\pi}{P_{L}}$$ 108 | # 109 | # $P_L$ and $P_H$ are the periodicity of the low and high cut-off frequencies. Following Burns and Mitchell's work on US business cycles which suggests cycles last from 1.5 to 8 years, we use $P_L=6$ and $P_H=32$ by default. 110 | 111 | # 112 | 113 | bk_cycles = sm.tsa.filters.bkfilter(dta[["infl","unemp"]]) 114 | 115 | # 116 | 117 | # * We lose K observations on both ends. It is suggested to use K=12 for quarterly data. 118 | 119 | # 120 | 121 | fig = plt.figure(figsize=(14,10)) 122 | ax = fig.add_subplot(111) 123 | bk_cycles.plot(ax=ax, style=['r--', 'b-']); 124 | 125 | # 126 | 127 | # Christiano-Fitzgerald approximate band-pass filter: Inflation and Unemployment 128 | 129 | # 130 | 131 | # The Christiano-Fitzgerald filter is a generalization of BK and can thus also be seen as weighted moving average. However, the CF filter is asymmetric about $t$ as well as using the entire series. The implementation of their filter involves the 132 | # calculations of the weights in 133 | # 134 | # $$y_{t}^{*}=B_{0}y_{t}+B_{1}y_{t+1}+\dots+B_{T-1-t}y_{T-1}+\tilde B_{T-t}y_{T}+B_{1}y_{t-1}+\dots+B_{t-2}y_{2}+\tilde B_{t-1}y_{1}$$ 135 | # 136 | # for $t=3,4,...,T-2$, where 137 | # 138 | # $$B_{j} = \frac{\sin(jb)-\sin(ja)}{\pi j},j\geq1$$ 139 | # 140 | # $$B_{0} = \frac{b-a}{\pi},a=\frac{2\pi}{P_{u}},b=\frac{2\pi}{P_{L}}$$ 141 | # 142 | # $\tilde B_{T-t}$ and $\tilde B_{t-1}$ are linear functions of the $B_{j}$'s, and the values for $t=1,2,T-1,$ and $T$ are also calculated in much the same way. $P_{U}$ and $P_{L}$ are as described above with the same interpretation. 143 | 144 | # 145 | 146 | # The CF filter is appropriate for series that may follow a random walk. 147 | 148 | # 149 | 150 | print sm.tsa.stattools.adfuller(dta['unemp'])[:3] 151 | 152 | # 153 | 154 | print sm.tsa.stattools.adfuller(dta['infl'])[:3] 155 | 156 | # 157 | 158 | cf_cycles, cf_trend = sm.tsa.filters.cffilter(dta[["infl","unemp"]]) 159 | print cf_cycles.head(10) 160 | 161 | # 162 | 163 | fig = plt.figure(figsize=(14,10)) 164 | ax = fig.add_subplot(111) 165 | cf_cycles.plot(ax=ax, style=['r--','b-']); 166 | 167 | # 168 | 169 | # Filtering assumes *a priori* that business cycles exist. Due to this assumption, many macroeconomic models seek to create models that match the shape of impulse response functions rather than replicating properties of filtered series. See VAR notebook. 170 | 171 | -------------------------------------------------------------------------------- /tsa_var.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3.0 3 | 4 | # 5 | 6 | # Vector Autoregressions: inflation-unemployment-interest rate 7 | 8 | # 9 | 10 | # Vector Autoregression (VAR), introduced by Nobel laureate Christopher Sims in 1980, is a powerful statistical tool in the macroeconomist's toolkit. 11 | 12 | # 13 | 14 | # Formally a VAR model is 15 | # 16 | # $$Y_t = A_1 Y_{t-1} + \ldots + A_p Y_{t-p} + u_t$$ 17 | # 18 | # $$u_t \sim {\sf Normal}(0, \Sigma_u)$$ 19 | # 20 | # where $Y_t$ is of dimension $K$ and $A_i$ is a $K \times K$ coefficient matrix. 21 | 22 | # 23 | 24 | dta = sm.datasets.macrodata.load_pandas().data 25 | endog = dta[["infl", "unemp", "tbilrate"]] 26 | 27 | # 28 | 29 | index = sm.tsa.datetools.dates_from_range('1959Q1', '2009Q3') 30 | dta.index = pandas.Index(index) 31 | del dta['year'] 32 | del dta['quarter'] 33 | endog.index = pandas.Index(index) # DatetimeIndex or PeriodIndex in 0.8.0 34 | print endog.head(10) 35 | 36 | # 37 | 38 | endog.plot(subplots=True, figsize=(14,18)); 39 | 40 | # 41 | 42 | # model only after Volcker appointment 43 | var_mod = sm.tsa.VAR(endog.ix['1979-12-31':]).fit(maxlags=4, ic=None) 44 | print var_mod.summary() 45 | 46 | # 47 | 48 | # Diagnostics 49 | 50 | # 51 | 52 | np.abs(var_mod.roots) 53 | 54 | # 55 | 56 | # var_mod.test_normality() and var_mod.test_whiteness() are also available. There are problems with this model... 57 | 58 | # 59 | 60 | # Granger-Causality tests 61 | 62 | # 63 | 64 | var_mod.test_causality('unemp', 'tbilrate', kind='Wald') 65 | 66 | # 67 | 68 | table = pandas.DataFrame(np.zeros((9,3)), columns=['chi2', 'df', 'prob(>chi2)']) 69 | index = [] 70 | variables = set(endog.columns.tolist()) 71 | i = 0 72 | for vari in variables: 73 | others = [] 74 | for j,ex_vari in enumerate(variables): 75 | if vari == ex_vari: # don't want to test this 76 | continue 77 | others.append(ex_vari) 78 | res = var_mod.test_causality(vari, ex_vari, kind='Wald', verbose=False) 79 | table.ix[[i], ['chi2', 'df', 'prob(>chi2)']] = (res['statistic'], res['df'], res['pvalue']) 80 | i += 1 81 | index.append([vari, ex_vari]) 82 | res = var_mod.test_causality(vari, others, kind='Wald', verbose=False) 83 | table.ix[[i], ['chi2', 'df', 'prob(>chi2)']] = res['statistic'], res['df'], res['pvalue'] 84 | index.append([vari, 'ALL']) 85 | i += 1 86 | table.index = pandas.MultiIndex.from_tuples(index, names=['Equation', 'Excluded']) 87 | 88 | # 89 | 90 | print table 91 | 92 | # 93 | 94 | # From this we reject the null that these variables do not Granger cause for all cases except for infl -> tbilrate. In other words, in almost all cases we can reject the null hypothesis that the lags of the *excluded* variable are jointly zero in *Equation*. 95 | 96 | # 97 | 98 | # Order Selection 99 | 100 | # 101 | 102 | var_mod.model.select_order() 103 | 104 | # 105 | 106 | # Impulse Response Functions 107 | 108 | # 109 | 110 | # Suppose we want to examine what happens to each of the variables when a 1 unit increase in the current value of one of the VAR errors occurs (a "shock"). To isolate the effects of only one error while holding the others constant, we need the model to be in a form so that the contemporaneous errors are uncorrelated across equations. One such way to achieve this is the so-called recursive VAR. In the recursive VAR, the order of the variables is determined by how the econometrician views the economic processes as ocurring. Given this order, inflation is determined by the contemporaneous unemployment rate and tbilrate is determined by the contemporaneous inflation and unemployment rates. Unemployment is a function of only the past values of itself, inflation, and the T-bill rate. 111 | # 112 | # We achieve such a structure by using the Choleski decomposition. 113 | 114 | # 115 | 116 | irf = var_mod.irf(24) 117 | 118 | # 119 | 120 | irf.plot(orth=True, signif=.33, subplot_params = {'fontsize' : 18}) 121 | 122 | # 123 | 124 | # Note that inflation dynamics are not very persistent, but do appear to have a significant and immediate impact on interest rates and on unemployment in the medium run. 125 | 126 | # 127 | 128 | # Forecast Error Decompositions 129 | 130 | # 131 | 132 | var_mod.fevd(24).summary() 133 | 134 | # 135 | 136 | var_mod.fevd(24).plot(figsize=(12,12)) 137 | 138 | # 139 | 140 | # There is some amount of interaction between the variables. For instance, at the 12 quarter horizon, 40% of the error in the forecast of the T-bill rate is attributed to the inflation and unemployment shocks in the recursive VAR. 141 | 142 | # 143 | 144 | # To make structural inferences - e.g., what is the effect on the rate of inflation and unemployment of an unexpected 100 basis point increase in the Federal Funds rate (proxied by the T-bill rate here), we might want to fit a structural VAR model based on economic theory of monetary policy. For instance, we might replace the VAR equation for the T-bill rate with a policy equation such as a Taylor rule and restrict coefficients. You can do so with the sm.tsa.SVAR class. 145 | 146 | # 147 | 148 | # Exercises 149 | 150 | # 151 | 152 | # Experiment with different VAR models. You can try to adjust the number of lags in the VAR model calculated above or the ordering of the variables and see how it affects the model. 153 | 154 | # 155 | 156 | # You might also try adding variables to the VAR, say *M1* measure of money supply, or estimating a different model using measures of consumption (realcons), government spending (realgovt), or GDP (realgdp). 157 | 158 | -------------------------------------------------------------------------------- /whats_coming.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "whats_coming" 4 | }, 5 | "nbformat": 3, 6 | "nbformat_minor": 0, 7 | "worksheets": [ 8 | { 9 | "cells": [ 10 | { 11 | "cell_type": "heading", 12 | "level": 3, 13 | "metadata": {}, 14 | "source": [ 15 | "Google Summer of Code" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "Multivariate KDE and nonparametric regression\n", 23 | "\n", 24 | "Systems of Equations Models\n", 25 | "\n", 26 | "Empirical Likelihood estimators\n", 27 | "\n", 28 | "Robust Non-Linear Estimators" 29 | ] 30 | }, 31 | { 32 | "cell_type": "heading", 33 | "level": 3, 34 | "metadata": {}, 35 | "source": [ 36 | "Formulas" 37 | ] 38 | }, 39 | { 40 | "cell_type": "raw", 41 | "metadata": {}, 42 | "source": [ 43 | "Continued integration with the formula code. API changes. Feedback welcome!" 44 | ] 45 | }, 46 | { 47 | "cell_type": "heading", 48 | "level": 3, 49 | "metadata": {}, 50 | "source": [ 51 | "Seasonal Data" 52 | ] 53 | }, 54 | { 55 | "cell_type": "raw", 56 | "metadata": {}, 57 | "source": [ 58 | "Multiplicative and Additive SARIMA models, seasonal filtering methods" 59 | ] 60 | }, 61 | { 62 | "cell_type": "heading", 63 | "level": 3, 64 | "metadata": {}, 65 | "source": [ 66 | "Time Series Models" 67 | ] 68 | }, 69 | { 70 | "cell_type": "raw", 71 | "metadata": {}, 72 | "source": [ 73 | "GARCH(1,1), VECM, Bayesian VAR, Fast Kalman filtering and non-linear state-space methods" 74 | ] 75 | }, 76 | { 77 | "cell_type": "heading", 78 | "level": 3, 79 | "metadata": {}, 80 | "source": [ 81 | "What else?" 82 | ] 83 | }, 84 | { 85 | "cell_type": "raw", 86 | "metadata": {}, 87 | "source": [ 88 | "Generalized Additive Models, Mixed Effects Models, Panel Data Models, Censored and Truncated Regression Models (Tobit, Heckman), Instrumental Variables, parallel Monte Carlo..." 89 | ] 90 | } 91 | ], 92 | "metadata": {} 93 | } 94 | ] 95 | } -------------------------------------------------------------------------------- /whats_coming.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3.0 3 | 4 | # 5 | 6 | # Google Summer of Code 7 | 8 | # 9 | 10 | # Multivariate KDE and nonparametric regression 11 | # 12 | # Systems of Equations Models 13 | # 14 | # Empirical Likelihood estimators 15 | # 16 | # Robust Non-Linear Estimators 17 | 18 | # 19 | 20 | # Formulas 21 | 22 | # 23 | 24 | # Continued integration with the formula code. API changes. Feedback welcome! 25 | 26 | # 27 | 28 | # Seasonal Data 29 | 30 | # 31 | 32 | # Multiplicative and Additive SARIMA models, seasonal filtering methods 33 | 34 | # 35 | 36 | # Time Series Models 37 | 38 | # 39 | 40 | # GARCH(1,1), VECM, Bayesian VAR, Fast Kalman filtering and non-linear state-space methods 41 | 42 | # 43 | 44 | # What else? 45 | 46 | # 47 | 48 | # Generalized Additive Models, Mixed Effects Models, Panel Data Models, Censored and Truncated Regression Models (Tobit, Heckman), Instrumental Variables, parallel Monte Carlo... 49 | 50 | --------------------------------------------------------------------------------