├── contrasts.ipynb
├── contrasts.py
├── discrete_choice.ipynb
├── discrete_choice.py
├── generic_mle.ipynb
├── generic_mle.py
├── kernel_density.ipynb
├── kernel_density.py
├── linear_models.ipynb
├── linear_models.py
├── preliminaries.ipynb
├── preliminaries.py
├── rmagic_extension.ipynb
├── rmagic_extension.py
├── robust_models.ipynb
├── robust_models.py
├── salary.table
├── star_diagram.png
├── tsa_arma.ipynb
├── tsa_arma.py
├── tsa_filters.ipynb
├── tsa_filters.py
├── tsa_var.ipynb
├── tsa_var.py
├── whats_coming.ipynb
└── whats_coming.py


/contrasts.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "metadata": {
  3 |   "name": "contrasts"
  4 |  },
  5 |  "nbformat": 3,
  6 |  "nbformat_minor": 0,
  7 |  "worksheets": [
  8 |   {
  9 |    "cells": [
 10 |     {
 11 |      "cell_type": "heading",
 12 |      "level": 3,
 13 |      "metadata": {},
 14 |      "source": [
 15 |       "Contrasts Overview"
 16 |      ]
 17 |     },
 18 |     {
 19 |      "cell_type": "code",
 20 |      "collapsed": false,
 21 |      "input": [
 22 |       "import statsmodels.api as sm"
 23 |      ],
 24 |      "language": "python",
 25 |      "metadata": {},
 26 |      "outputs": []
 27 |     },
 28 |     {
 29 |      "cell_type": "markdown",
 30 |      "metadata": {},
 31 |      "source": [
 32 |       "This document is based heavily on this excellent resource from UCLA http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm"
 33 |      ]
 34 |     },
 35 |     {
 36 |      "cell_type": "raw",
 37 |      "metadata": {},
 38 |      "source": [
 39 |       "A categorical variable of K categories, or levels, usually enters a regression as a sequence of K-1 dummy variables. This amounts to a linear hypothesis on the level means. That is, each test statistic for these variables amounts to testing whether the mean for that level is statistically significantly different from the mean of the base category. This dummy coding is called Treatment coding in R parlance, and we will follow this convention. There are, however, different coding methods that amount to different sets of linear hypotheses.\n",
 40 |       "\n",
 41 |       "In fact, the dummy coding is not technically a contrast coding. This is because the dummy variables add to one and are not functionally independent of the model's intercept. On the other hand, a set of *contrasts* for a categorical variable with `k` levels is a set of `k-1` functionally independent linear combinations of the factor level means that are also independent of the sum of the dummy variables. The dummy coding isn't wrong *per se*. It captures all of the coefficients, but it complicates matters when the model assumes independence of the coefficients such as in ANOVA. Linear regression models do not assume independence of the coefficients and thus dummy coding is often the only coding that is taught in this context.\n",
 42 |       "\n",
 43 |       "To have a look at the contrast matrices in Patsy, we will use data from UCLA ATS. First let's load the data."
 44 |      ]
 45 |     },
 46 |     {
 47 |      "cell_type": "heading",
 48 |      "level": 4,
 49 |      "metadata": {},
 50 |      "source": [
 51 |       "Example Data"
 52 |      ]
 53 |     },
 54 |     {
 55 |      "cell_type": "code",
 56 |      "collapsed": false,
 57 |      "input": [
 58 |       "import pandas\n",
 59 |       "url = 'http://www.ats.ucla.edu/stat/data/hsb2.csv'\n",
 60 |       "hsb2 = pandas.read_table(url, delimiter=\",\")"
 61 |      ],
 62 |      "language": "python",
 63 |      "metadata": {},
 64 |      "outputs": []
 65 |     },
 66 |     {
 67 |      "cell_type": "code",
 68 |      "collapsed": false,
 69 |      "input": [
 70 |       "hsb2.head(10)"
 71 |      ],
 72 |      "language": "python",
 73 |      "metadata": {},
 74 |      "outputs": []
 75 |     },
 76 |     {
 77 |      "cell_type": "raw",
 78 |      "metadata": {},
 79 |      "source": [
 80 |       "It will be instructive to look at the mean of the dependent variable, write, for each level of race ((1 = Hispanic, 2 = Asian, 3 = African American and 4 = Caucasian))."
 81 |      ]
 82 |     },
 83 |     {
 84 |      "cell_type": "code",
 85 |      "collapsed": false,
 86 |      "input": [
 87 |       "hsb2.groupby('race')['write'].mean()"
 88 |      ],
 89 |      "language": "python",
 90 |      "metadata": {},
 91 |      "outputs": []
 92 |     },
 93 |     {
 94 |      "cell_type": "heading",
 95 |      "level": 4,
 96 |      "metadata": {},
 97 |      "source": [
 98 |       "Treatment (Dummy) Coding"
 99 |      ]
100 |     },
101 |     {
102 |      "cell_type": "raw",
103 |      "metadata": {},
104 |      "source": [
105 |       "Dummy coding is likely the most well known coding scheme. It compares each level of the categorical variable to a base reference level. The base reference level is the value of the intercept. It is the default contrast in Patsy for unordered categorical factors. The Treatment contrast matrix for race would be"
106 |      ]
107 |     },
108 |     {
109 |      "cell_type": "code",
110 |      "collapsed": false,
111 |      "input": [
112 |       "from patsy.contrasts import Treatment\n",
113 |       "levels = [1,2,3,4]\n",
114 |       "contrast = Treatment(reference=0).code_without_intercept(levels)\n",
115 |       "print contrast.matrix"
116 |      ],
117 |      "language": "python",
118 |      "metadata": {},
119 |      "outputs": []
120 |     },
121 |     {
122 |      "cell_type": "raw",
123 |      "metadata": {},
124 |      "source": [
125 |       "Here we used `reference=0`, which implies that the first level, Hispanic, is the reference category against which the other level effects are measured. As mentioned above, the columns do not sum to zero and are thus not independent of the intercept. To be explicit, let's look at how this would encode the `race` variable."
126 |      ]
127 |     },
128 |     {
129 |      "cell_type": "code",
130 |      "collapsed": false,
131 |      "input": [
132 |       "hsb2.race.head(10)"
133 |      ],
134 |      "language": "python",
135 |      "metadata": {},
136 |      "outputs": []
137 |     },
138 |     {
139 |      "cell_type": "code",
140 |      "collapsed": false,
141 |      "input": [
142 |       "print contrast.matrix[hsb2.race-1, :][:20]"
143 |      ],
144 |      "language": "python",
145 |      "metadata": {},
146 |      "outputs": []
147 |     },
148 |     {
149 |      "cell_type": "code",
150 |      "collapsed": false,
151 |      "input": [
152 |       "sm.categorical(hsb2.race.values)"
153 |      ],
154 |      "language": "python",
155 |      "metadata": {},
156 |      "outputs": []
157 |     },
158 |     {
159 |      "cell_type": "raw",
160 |      "metadata": {},
161 |      "source": [
162 |       "This is a bit of a trick, as the `race` category conveniently maps to zero-based indices. If it does not, this conversion happens under the hood, so this won't work in general but nonetheless is a useful exercise to fix ideas. The below illustrates the output using the three contrasts above"
163 |      ]
164 |     },
165 |     {
166 |      "cell_type": "code",
167 |      "collapsed": false,
168 |      "input": [
169 |       "from statsmodels.formula.api import ols\n",
170 |       "mod = ols(\"write ~ C(race, Treatment)\", data=hsb2)\n",
171 |       "res = mod.fit()\n",
172 |       "print res.summary()"
173 |      ],
174 |      "language": "python",
175 |      "metadata": {},
176 |      "outputs": []
177 |     },
178 |     {
179 |      "cell_type": "raw",
180 |      "metadata": {},
181 |      "source": [
182 |       "We explicitly gave the contrast for race; however, since Treatment is the default, we could have omitted this."
183 |      ]
184 |     },
185 |     {
186 |      "cell_type": "heading",
187 |      "level": 3,
188 |      "metadata": {},
189 |      "source": [
190 |       "Simple Coding"
191 |      ]
192 |     },
193 |     {
194 |      "cell_type": "raw",
195 |      "metadata": {},
196 |      "source": [
197 |       "Like Treatment Coding, Simple Coding compares each level to a fixed reference level. However, with simple coding, the intercept is the grand mean of all the levels of the factors. Patsy doesn't have the Simple contrast included, but you can easily define your own contrasts. To do so, write a class that contains a code_with_intercept and a code_without_intercept method that returns a patsy.contrast.ContrastMatrix instance"
198 |      ]
199 |     },
200 |     {
201 |      "cell_type": "code",
202 |      "collapsed": false,
203 |      "input": [
204 |       "from patsy.contrasts import ContrastMatrix\n",
205 |       "\n",
206 |       "def _name_levels(prefix, levels):\n",
207 |       "    return [\"[%s%s]\" % (prefix, level) for level in levels]\n",
208 |       "\n",
209 |       "class Simple(object):\n",
210 |       "    def _simple_contrast(self, levels):\n",
211 |       "        nlevels = len(levels)\n",
212 |       "        contr = -1./nlevels * np.ones((nlevels, nlevels-1))\n",
213 |       "        contr[1:][np.diag_indices(nlevels-1)] = (nlevels-1.)/nlevels\n",
214 |       "        return contr\n",
215 |       "\n",
216 |       "    def code_with_intercept(self, levels):\n",
217 |       "        contrast = np.column_stack((np.ones(len(levels)),\n",
218 |       "                                    self._simple_contrast(levels)))\n",
219 |       "        return ContrastMatrix(contrast, _name_levels(\"Simp.\", levels))\n",
220 |       "\n",
221 |       "    def code_without_intercept(self, levels):\n",
222 |       "        contrast = self._simple_contrast(levels)\n",
223 |       "        return ContrastMatrix(contrast, _name_levels(\"Simp.\", levels[:-1]))"
224 |      ],
225 |      "language": "python",
226 |      "metadata": {},
227 |      "outputs": []
228 |     },
229 |     {
230 |      "cell_type": "code",
231 |      "collapsed": false,
232 |      "input": [
233 |       "hsb2.groupby('race')['write'].mean().mean()"
234 |      ],
235 |      "language": "python",
236 |      "metadata": {},
237 |      "outputs": []
238 |     },
239 |     {
240 |      "cell_type": "code",
241 |      "collapsed": false,
242 |      "input": [
243 |       "contrast = Simple().code_without_intercept(levels)\n",
244 |       "print contrast.matrix"
245 |      ],
246 |      "language": "python",
247 |      "metadata": {},
248 |      "outputs": []
249 |     },
250 |     {
251 |      "cell_type": "code",
252 |      "collapsed": false,
253 |      "input": [
254 |       "mod = ols(\"write ~ C(race, Simple)\", data=hsb2)\n",
255 |       "res = mod.fit()\n",
256 |       "print res.summary()"
257 |      ],
258 |      "language": "python",
259 |      "metadata": {},
260 |      "outputs": []
261 |     },
262 |     {
263 |      "cell_type": "heading",
264 |      "level": 3,
265 |      "metadata": {},
266 |      "source": [
267 |       "Sum (Deviation) Coding"
268 |      ]
269 |     },
270 |     {
271 |      "cell_type": "raw",
272 |      "metadata": {},
273 |      "source": [
274 |       "Sum coding compares the mean of the dependent variable for a given level to the overall mean of the dependent variable over all the levels. That is, it uses contrasts between each of the first k-1 levels and level k In this example, level 1 is compared to all the others, level 2 to all the others, and level 3 to all the others."
275 |      ]
276 |     },
277 |     {
278 |      "cell_type": "code",
279 |      "collapsed": false,
280 |      "input": [
281 |       "from patsy.contrasts import Sum\n",
282 |       "contrast = Sum().code_without_intercept(levels)\n",
283 |       "print contrast.matrix"
284 |      ],
285 |      "language": "python",
286 |      "metadata": {},
287 |      "outputs": []
288 |     },
289 |     {
290 |      "cell_type": "code",
291 |      "collapsed": false,
292 |      "input": [
293 |       "mod = ols(\"write ~ C(race, Sum)\", data=hsb2)\n",
294 |       "res = mod.fit()\n",
295 |       "print res.summary()"
296 |      ],
297 |      "language": "python",
298 |      "metadata": {},
299 |      "outputs": []
300 |     },
301 |     {
302 |      "cell_type": "raw",
303 |      "metadata": {},
304 |      "source": [
305 |       "This corresponds to a parameterization that forces all the coefficients to sum to zero. Notice that the intercept here is the grand mean where the grand mean is the mean of means of the dependent variable by each level."
306 |      ]
307 |     },
308 |     {
309 |      "cell_type": "code",
310 |      "collapsed": false,
311 |      "input": [
312 |       "hsb2.groupby('race')['write'].mean().mean()"
313 |      ],
314 |      "language": "python",
315 |      "metadata": {},
316 |      "outputs": []
317 |     },
318 |     {
319 |      "cell_type": "heading",
320 |      "level": 3,
321 |      "metadata": {},
322 |      "source": [
323 |       "Backward Difference Coding"
324 |      ]
325 |     },
326 |     {
327 |      "cell_type": "raw",
328 |      "metadata": {},
329 |      "source": [
330 |       "In backward difference coding, the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. This type of coding may be useful for a nominal or an ordinal variable."
331 |      ]
332 |     },
333 |     {
334 |      "cell_type": "code",
335 |      "collapsed": false,
336 |      "input": [
337 |       "from patsy.contrasts import Diff\n",
338 |       "contrast = Diff().code_without_intercept(levels)\n",
339 |       "print contrast.matrix"
340 |      ],
341 |      "language": "python",
342 |      "metadata": {},
343 |      "outputs": []
344 |     },
345 |     {
346 |      "cell_type": "code",
347 |      "collapsed": false,
348 |      "input": [
349 |       "mod = ols(\"write ~ C(race, Diff)\", data=hsb2)\n",
350 |       "res = mod.fit()\n",
351 |       "print res.summary()"
352 |      ],
353 |      "language": "python",
354 |      "metadata": {},
355 |      "outputs": []
356 |     },
357 |     {
358 |      "cell_type": "raw",
359 |      "metadata": {},
360 |      "source": [
361 |       "For example, here the coefficient on level 1 is the mean of `write` at level 2 compared with the mean at level 1. Ie.,"
362 |      ]
363 |     },
364 |     {
365 |      "cell_type": "code",
366 |      "collapsed": false,
367 |      "input": [
368 |       "res.params[\"C(race, Diff)[D.1]\"]\n",
369 |       "hsb2.groupby('race').mean()[\"write\"][2] - \\\n",
370 |       "     hsb2.groupby('race').mean()[\"write\"][1]"
371 |      ],
372 |      "language": "python",
373 |      "metadata": {},
374 |      "outputs": []
375 |     },
376 |     {
377 |      "cell_type": "heading",
378 |      "level": 3,
379 |      "metadata": {},
380 |      "source": [
381 |       "Helmert Coding"
382 |      ]
383 |     },
384 |     {
385 |      "cell_type": "raw",
386 |      "metadata": {},
387 |      "source": [
388 |       "Our version of Helmert coding is sometimes referred to as Reverse Helmert Coding. The mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels. Hence, the name 'reverse' being sometimes applied to differentiate from forward Helmert coding. This comparison does not make much sense for a nominal variable such as race, but we would use the Helmert contrast like so:"
389 |      ]
390 |     },
391 |     {
392 |      "cell_type": "code",
393 |      "collapsed": false,
394 |      "input": [
395 |       "from patsy.contrasts import Helmert\n",
396 |       "contrast = Helmert().code_without_intercept(levels)\n",
397 |       "print contrast.matrix"
398 |      ],
399 |      "language": "python",
400 |      "metadata": {},
401 |      "outputs": []
402 |     },
403 |     {
404 |      "cell_type": "code",
405 |      "collapsed": false,
406 |      "input": [
407 |       "mod = ols(\"write ~ C(race, Helmert)\", data=hsb2)\n",
408 |       "res = mod.fit()\n",
409 |       "print res.summary()"
410 |      ],
411 |      "language": "python",
412 |      "metadata": {},
413 |      "outputs": []
414 |     },
415 |     {
416 |      "cell_type": "raw",
417 |      "metadata": {},
418 |      "source": [
419 |       "To illustrate, the comparison on level 4 is the mean of the dependent variable at the previous three levels taken from the mean at level 4"
420 |      ]
421 |     },
422 |     {
423 |      "cell_type": "code",
424 |      "collapsed": false,
425 |      "input": [
426 |       "grouped = hsb2.groupby('race')\n",
427 |       "grouped.mean()[\"write\"][4] - grouped.mean()[\"write\"][:3].mean()"
428 |      ],
429 |      "language": "python",
430 |      "metadata": {},
431 |      "outputs": []
432 |     },
433 |     {
434 |      "cell_type": "raw",
435 |      "metadata": {},
436 |      "source": [
437 |       "As you can see, these are only equal up to a constant. Other versions of the Helmert contrast give the actual difference in means. Regardless, the hypothesis tests are the same."
438 |      ]
439 |     },
440 |     {
441 |      "cell_type": "code",
442 |      "collapsed": false,
443 |      "input": [
444 |       "k = 4\n",
445 |       "1./k * (grouped.mean()[\"write\"][k] - grouped.mean()[\"write\"][:k-1].mean())\n",
446 |       "k = 3\n",
447 |       "1./k * (grouped.mean()[\"write\"][k] - grouped.mean()[\"write\"][:k-1].mean())"
448 |      ],
449 |      "language": "python",
450 |      "metadata": {},
451 |      "outputs": []
452 |     },
453 |     {
454 |      "cell_type": "heading",
455 |      "level": 3,
456 |      "metadata": {},
457 |      "source": [
458 |       "Orthogonal Polynomial Coding"
459 |      ]
460 |     },
461 |     {
462 |      "cell_type": "raw",
463 |      "metadata": {},
464 |      "source": [
465 |       "The coefficients taken on by polynomial coding for `k=4` levels are the linear, quadratic, and cubic trends in the categorical variable. The categorical variable here is assumed to be represented by an underlying, equally spaced numeric variable. Therefore, this type of encoding is used only for ordered categorical variables with equal spacing. In general, the polynomial contrast produces polynomials of order `k-1`. Since `race` is not an ordered factor variable let's use `read` as an example. First we need to create an ordered categorical from `read`."
466 |      ]
467 |     },
468 |     {
469 |      "cell_type": "code",
470 |      "collapsed": false,
471 |      "input": [
472 |       "hsb2['readcat'] = pandas.cut(hsb2.read, bins=3)\n",
473 |       "hsb2.groupby('readcat').mean()['write']"
474 |      ],
475 |      "language": "python",
476 |      "metadata": {},
477 |      "outputs": []
478 |     },
479 |     {
480 |      "cell_type": "code",
481 |      "collapsed": false,
482 |      "input": [
483 |       "from patsy.contrasts import Poly\n",
484 |       "levels = hsb2.readcat.unique().tolist()\n",
485 |       "contrast = Poly().code_without_intercept(levels)\n",
486 |       "print contrast.matrix"
487 |      ],
488 |      "language": "python",
489 |      "metadata": {},
490 |      "outputs": []
491 |     },
492 |     {
493 |      "cell_type": "code",
494 |      "collapsed": false,
495 |      "input": [
496 |       "mod = ols(\"write ~ C(readcat, Poly)\", data=hsb2)\n",
497 |       "res = mod.fit()\n",
498 |       "print res.summary()"
499 |      ],
500 |      "language": "python",
501 |      "metadata": {},
502 |      "outputs": []
503 |     },
504 |     {
505 |      "cell_type": "raw",
506 |      "metadata": {},
507 |      "source": [
508 |       "As you can see, readcat has a significant linear effect on the dependent variable `write` but not a significant quadratic or cubic effect."
509 |      ]
510 |     }
511 |    ],
512 |    "metadata": {}
513 |   }
514 |  ]
515 | }


--------------------------------------------------------------------------------
/contrasts.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # <nbformat>3.0</nbformat>
  3 | 
  4 | # <headingcell level=3>
  5 | 
  6 | # Contrasts Overview
  7 | 
  8 | # <codecell>
  9 | 
 10 | import statsmodels.api as sm
 11 | 
 12 | # <markdowncell>
 13 | 
 14 | # This document is based heavily on this excellent resource from UCLA http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm
 15 | 
 16 | # <rawcell>
 17 | 
 18 | # A categorical variable of K categories, or levels, usually enters a regression as a sequence of K-1 dummy variables. This amounts to a linear hypothesis on the level means. That is, each test statistic for these variables amounts to testing whether the mean for that level is statistically significantly different from the mean of the base category. This dummy coding is called Treatment coding in R parlance, and we will follow this convention. There are, however, different coding methods that amount to different sets of linear hypotheses.
 19 | # 
 20 | # In fact, the dummy coding is not technically a contrast coding. This is because the dummy variables add to one and are not functionally independent of the model's intercept. On the other hand, a set of *contrasts* for a categorical variable with `k` levels is a set of `k-1` functionally independent linear combinations of the factor level means that are also independent of the sum of the dummy variables. The dummy coding isn't wrong *per se*. It captures all of the coefficients, but it complicates matters when the model assumes independence of the coefficients such as in ANOVA. Linear regression models do not assume independence of the coefficients and thus dummy coding is often the only coding that is taught in this context.
 21 | # 
 22 | # To have a look at the contrast matrices in Patsy, we will use data from UCLA ATS. First let's load the data.
 23 | 
 24 | # <headingcell level=4>
 25 | 
 26 | # Example Data
 27 | 
 28 | # <codecell>
 29 | 
 30 | import pandas
 31 | url = 'http://www.ats.ucla.edu/stat/data/hsb2.csv'
 32 | hsb2 = pandas.read_table(url, delimiter=",")
 33 | 
 34 | # <codecell>
 35 | 
 36 | hsb2.head(10)
 37 | 
 38 | # <rawcell>
 39 | 
 40 | # It will be instructive to look at the mean of the dependent variable, write, for each level of race ((1 = Hispanic, 2 = Asian, 3 = African American and 4 = Caucasian)).
 41 | 
 42 | # <codecell>
 43 | 
 44 | hsb2.groupby('race')['write'].mean()
 45 | 
 46 | # <headingcell level=4>
 47 | 
 48 | # Treatment (Dummy) Coding
 49 | 
 50 | # <rawcell>
 51 | 
 52 | # Dummy coding is likely the most well known coding scheme. It compares each level of the categorical variable to a base reference level. The base reference level is the value of the intercept. It is the default contrast in Patsy for unordered categorical factors. The Treatment contrast matrix for race would be
 53 | 
 54 | # <codecell>
 55 | 
 56 | from patsy.contrasts import Treatment
 57 | levels = [1,2,3,4]
 58 | contrast = Treatment(reference=0).code_without_intercept(levels)
 59 | print contrast.matrix
 60 | 
 61 | # <rawcell>
 62 | 
 63 | # Here we used `reference=0`, which implies that the first level, Hispanic, is the reference category against which the other level effects are measured. As mentioned above, the columns do not sum to zero and are thus not independent of the intercept. To be explicit, let's look at how this would encode the `race` variable.
 64 | 
 65 | # <codecell>
 66 | 
 67 | hsb2.race.head(10)
 68 | 
 69 | # <codecell>
 70 | 
 71 | print contrast.matrix[hsb2.race-1, :][:20]
 72 | 
 73 | # <codecell>
 74 | 
 75 | sm.categorical(hsb2.race.values)
 76 | 
 77 | # <rawcell>
 78 | 
 79 | # This is a bit of a trick, as the `race` category conveniently maps to zero-based indices. If it does not, this conversion happens under the hood, so this won't work in general but nonetheless is a useful exercise to fix ideas. The below illustrates the output using the three contrasts above
 80 | 
 81 | # <codecell>
 82 | 
 83 | from statsmodels.formula.api import ols
 84 | mod = ols("write ~ C(race, Treatment)", data=hsb2)
 85 | res = mod.fit()
 86 | print res.summary()
 87 | 
 88 | # <rawcell>
 89 | 
 90 | # We explicitly gave the contrast for race; however, since Treatment is the default, we could have omitted this.
 91 | 
 92 | # <headingcell level=3>
 93 | 
 94 | # Simple Coding
 95 | 
 96 | # <rawcell>
 97 | 
 98 | # Like Treatment Coding, Simple Coding compares each level to a fixed reference level. However, with simple coding, the intercept is the grand mean of all the levels of the factors. Patsy doesn't have the Simple contrast included, but you can easily define your own contrasts. To do so, write a class that contains a code_with_intercept and a code_without_intercept method that returns a patsy.contrast.ContrastMatrix instance
 99 | 
100 | # <codecell>
101 | 
102 | from patsy.contrasts import ContrastMatrix
103 | 
104 | def _name_levels(prefix, levels):
105 |     return ["[%s%s]" % (prefix, level) for level in levels]
106 | 
107 | class Simple(object):
108 |     def _simple_contrast(self, levels):
109 |         nlevels = len(levels)
110 |         contr = -1./nlevels * np.ones((nlevels, nlevels-1))
111 |         contr[1:][np.diag_indices(nlevels-1)] = (nlevels-1.)/nlevels
112 |         return contr
113 | 
114 |     def code_with_intercept(self, levels):
115 |         contrast = np.column_stack((np.ones(len(levels)),
116 |                                     self._simple_contrast(levels)))
117 |         return ContrastMatrix(contrast, _name_levels("Simp.", levels))
118 | 
119 |     def code_without_intercept(self, levels):
120 |         contrast = self._simple_contrast(levels)
121 |         return ContrastMatrix(contrast, _name_levels("Simp.", levels[:-1]))
122 | 
123 | # <codecell>
124 | 
125 | hsb2.groupby('race')['write'].mean().mean()
126 | 
127 | # <codecell>
128 | 
129 | contrast = Simple().code_without_intercept(levels)
130 | print contrast.matrix
131 | 
132 | # <codecell>
133 | 
134 | mod = ols("write ~ C(race, Simple)", data=hsb2)
135 | res = mod.fit()
136 | print res.summary()
137 | 
138 | # <headingcell level=3>
139 | 
140 | # Sum (Deviation) Coding
141 | 
142 | # <rawcell>
143 | 
144 | # Sum coding compares the mean of the dependent variable for a given level to the overall mean of the dependent variable over all the levels. That is, it uses contrasts between each of the first k-1 levels and level k In this example, level 1 is compared to all the others, level 2 to all the others, and level 3 to all the others.
145 | 
146 | # <codecell>
147 | 
148 | from patsy.contrasts import Sum
149 | contrast = Sum().code_without_intercept(levels)
150 | print contrast.matrix
151 | 
152 | # <codecell>
153 | 
154 | mod = ols("write ~ C(race, Sum)", data=hsb2)
155 | res = mod.fit()
156 | print res.summary()
157 | 
158 | # <rawcell>
159 | 
160 | # This corresponds to a parameterization that forces all the coefficients to sum to zero. Notice that the intercept here is the grand mean where the grand mean is the mean of means of the dependent variable by each level.
161 | 
162 | # <codecell>
163 | 
164 | hsb2.groupby('race')['write'].mean().mean()
165 | 
166 | # <headingcell level=3>
167 | 
168 | # Backward Difference Coding
169 | 
170 | # <rawcell>
171 | 
172 | # In backward difference coding, the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. This type of coding may be useful for a nominal or an ordinal variable.
173 | 
174 | # <codecell>
175 | 
176 | from patsy.contrasts import Diff
177 | contrast = Diff().code_without_intercept(levels)
178 | print contrast.matrix
179 | 
180 | # <codecell>
181 | 
182 | mod = ols("write ~ C(race, Diff)", data=hsb2)
183 | res = mod.fit()
184 | print res.summary()
185 | 
186 | # <rawcell>
187 | 
188 | # For example, here the coefficient on level 1 is the mean of `write` at level 2 compared with the mean at level 1. Ie.,
189 | 
190 | # <codecell>
191 | 
192 | res.params["C(race, Diff)[D.1]"]
193 | hsb2.groupby('race').mean()["write"][2] - \
194 |      hsb2.groupby('race').mean()["write"][1]
195 | 
196 | # <headingcell level=3>
197 | 
198 | # Helmert Coding
199 | 
200 | # <rawcell>
201 | 
202 | # Our version of Helmert coding is sometimes referred to as Reverse Helmert Coding. The mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels. Hence, the name 'reverse' being sometimes applied to differentiate from forward Helmert coding. This comparison does not make much sense for a nominal variable such as race, but we would use the Helmert contrast like so:
203 | 
204 | # <codecell>
205 | 
206 | from patsy.contrasts import Helmert
207 | contrast = Helmert().code_without_intercept(levels)
208 | print contrast.matrix
209 | 
210 | # <codecell>
211 | 
212 | mod = ols("write ~ C(race, Helmert)", data=hsb2)
213 | res = mod.fit()
214 | print res.summary()
215 | 
216 | # <rawcell>
217 | 
218 | # To illustrate, the comparison on level 4 is the mean of the dependent variable at the previous three levels taken from the mean at level 4
219 | 
220 | # <codecell>
221 | 
222 | grouped = hsb2.groupby('race')
223 | grouped.mean()["write"][4] - grouped.mean()["write"][:3].mean()
224 | 
225 | # <rawcell>
226 | 
227 | # As you can see, these are only equal up to a constant. Other versions of the Helmert contrast give the actual difference in means. Regardless, the hypothesis tests are the same.
228 | 
229 | # <codecell>
230 | 
231 | k = 4
232 | 1./k * (grouped.mean()["write"][k] - grouped.mean()["write"][:k-1].mean())
233 | k = 3
234 | 1./k * (grouped.mean()["write"][k] - grouped.mean()["write"][:k-1].mean())
235 | 
236 | # <headingcell level=3>
237 | 
238 | # Orthogonal Polynomial Coding
239 | 
240 | # <rawcell>
241 | 
242 | # The coefficients taken on by polynomial coding for `k=4` levels are the linear, quadratic, and cubic trends in the categorical variable. The categorical variable here is assumed to be represented by an underlying, equally spaced numeric variable. Therefore, this type of encoding is used only for ordered categorical variables with equal spacing. In general, the polynomial contrast produces polynomials of order `k-1`. Since `race` is not an ordered factor variable let's use `read` as an example. First we need to create an ordered categorical from `read`.
243 | 
244 | # <codecell>
245 | 
246 | hsb2['readcat'] = pandas.cut(hsb2.read, bins=3)
247 | hsb2.groupby('readcat').mean()['write']
248 | 
249 | # <codecell>
250 | 
251 | from patsy.contrasts import Poly
252 | levels = hsb2.readcat.unique().tolist()
253 | contrast = Poly().code_without_intercept(levels)
254 | print contrast.matrix
255 | 
256 | # <codecell>
257 | 
258 | mod = ols("write ~ C(readcat, Poly)", data=hsb2)
259 | res = mod.fit()
260 | print res.summary()
261 | 
262 | # <rawcell>
263 | 
264 | # As you can see, readcat has a significant linear effect on the dependent variable `write` but not a significant quadratic or cubic effect.
265 | 
266 | 


--------------------------------------------------------------------------------
/discrete_choice.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "metadata": {
  3 |   "name": "discrete_choice"
  4 |  },
  5 |  "nbformat": 3,
  6 |  "nbformat_minor": 0,
  7 |  "worksheets": [
  8 |   {
  9 |    "cells": [
 10 |     {
 11 |      "cell_type": "heading",
 12 |      "level": 2,
 13 |      "metadata": {},
 14 |      "source": [
 15 |       "Discrete Choice Models - Fair's Affair data"
 16 |      ]
 17 |     },
 18 |     {
 19 |      "cell_type": "markdown",
 20 |      "metadata": {},
 21 |      "source": [
 22 |       "A survey of women only was conducted in 1974 by *Redbook* asking about extramarital affairs."
 23 |      ]
 24 |     },
 25 |     {
 26 |      "cell_type": "code",
 27 |      "collapsed": false,
 28 |      "input": [
 29 |       "import numpy as np\n",
 30 |       "from scipy import stats\n",
 31 |       "import matplotlib.pyplot as plt\n",
 32 |       "import statsmodels.api as sm\n",
 33 |       "from statsmodels.formula.api import logit, probit, poisson, ols"
 34 |      ],
 35 |      "language": "python",
 36 |      "metadata": {},
 37 |      "outputs": []
 38 |     },
 39 |     {
 40 |      "cell_type": "code",
 41 |      "collapsed": false,
 42 |      "input": [
 43 |       "print sm.datasets.fair.SOURCE"
 44 |      ],
 45 |      "language": "python",
 46 |      "metadata": {},
 47 |      "outputs": []
 48 |     },
 49 |     {
 50 |      "cell_type": "code",
 51 |      "collapsed": false,
 52 |      "input": [
 53 |       "print sm.datasets.fair.NOTE"
 54 |      ],
 55 |      "language": "python",
 56 |      "metadata": {},
 57 |      "outputs": []
 58 |     },
 59 |     {
 60 |      "cell_type": "code",
 61 |      "collapsed": false,
 62 |      "input": [
 63 |       "dta = sm.datasets.fair.load_pandas().data"
 64 |      ],
 65 |      "language": "python",
 66 |      "metadata": {},
 67 |      "outputs": []
 68 |     },
 69 |     {
 70 |      "cell_type": "code",
 71 |      "collapsed": false,
 72 |      "input": [
 73 |       "dta['affair'] = (dta['affairs'] > 0).astype(float)\n",
 74 |       "print dta.head(10)"
 75 |      ],
 76 |      "language": "python",
 77 |      "metadata": {},
 78 |      "outputs": []
 79 |     },
 80 |     {
 81 |      "cell_type": "code",
 82 |      "collapsed": false,
 83 |      "input": [
 84 |       "print dta.describe()"
 85 |      ],
 86 |      "language": "python",
 87 |      "metadata": {},
 88 |      "outputs": []
 89 |     },
 90 |     {
 91 |      "cell_type": "code",
 92 |      "collapsed": false,
 93 |      "input": [
 94 |       "affair_mod = logit(\"affair ~ occupation + educ + occupation_husb\" \n",
 95 |       "                   \"+ rate_marriage + age + yrs_married + children\"\n",
 96 |       "                   \" + religious\", dta).fit()"
 97 |      ],
 98 |      "language": "python",
 99 |      "metadata": {},
100 |      "outputs": []
101 |     },
102 |     {
103 |      "cell_type": "code",
104 |      "collapsed": false,
105 |      "input": [
106 |       "print affair_mod.summary()"
107 |      ],
108 |      "language": "python",
109 |      "metadata": {},
110 |      "outputs": []
111 |     },
112 |     {
113 |      "cell_type": "raw",
114 |      "metadata": {},
115 |      "source": [
116 |       "How well are we predicting?"
117 |      ]
118 |     },
119 |     {
120 |      "cell_type": "code",
121 |      "collapsed": false,
122 |      "input": [
123 |       "affair_mod.pred_table()"
124 |      ],
125 |      "language": "python",
126 |      "metadata": {},
127 |      "outputs": []
128 |     },
129 |     {
130 |      "cell_type": "raw",
131 |      "metadata": {},
132 |      "source": [
133 |       "The coefficients of the discrete choice model do not tell us much. What we're after is marginal effects."
134 |      ]
135 |     },
136 |     {
137 |      "cell_type": "code",
138 |      "collapsed": false,
139 |      "input": [
140 |       "mfx = affair_mod.get_margeff()\n",
141 |       "print mfx.summary()"
142 |      ],
143 |      "language": "python",
144 |      "metadata": {},
145 |      "outputs": []
146 |     },
147 |     {
148 |      "cell_type": "code",
149 |      "collapsed": false,
150 |      "input": [
151 |       "respondent1000 = dta.ix[1000]\n",
152 |       "print respondent1000"
153 |      ],
154 |      "language": "python",
155 |      "metadata": {},
156 |      "outputs": []
157 |     },
158 |     {
159 |      "cell_type": "code",
160 |      "collapsed": false,
161 |      "input": [
162 |       "resp = dict(zip(range(1,9), respondent1000[[\"occupation\", \"educ\", \n",
163 |       "                                            \"occupation_husb\", \"rate_marriage\", \n",
164 |       "                                            \"age\", \"yrs_married\", \"children\", \n",
165 |       "                                            \"religious\"]].tolist()))\n",
166 |       "resp.update({0 : 1})\n",
167 |       "print resp"
168 |      ],
169 |      "language": "python",
170 |      "metadata": {},
171 |      "outputs": []
172 |     },
173 |     {
174 |      "cell_type": "code",
175 |      "collapsed": false,
176 |      "input": [
177 |       "mfx = affair_mod.get_margeff(atexog=resp)\n",
178 |       "print mfx.summary()"
179 |      ],
180 |      "language": "python",
181 |      "metadata": {},
182 |      "outputs": []
183 |     },
184 |     {
185 |      "cell_type": "code",
186 |      "collapsed": false,
187 |      "input": [
188 |       "affair_mod.predict(respondent1000)"
189 |      ],
190 |      "language": "python",
191 |      "metadata": {},
192 |      "outputs": []
193 |     },
194 |     {
195 |      "cell_type": "code",
196 |      "collapsed": false,
197 |      "input": [
198 |       "affair_mod.fittedvalues[1000]"
199 |      ],
200 |      "language": "python",
201 |      "metadata": {},
202 |      "outputs": []
203 |     },
204 |     {
205 |      "cell_type": "code",
206 |      "collapsed": false,
207 |      "input": [
208 |       "affair_mod.model.cdf(affair_mod.fittedvalues[1000])"
209 |      ],
210 |      "language": "python",
211 |      "metadata": {},
212 |      "outputs": []
213 |     },
214 |     {
215 |      "cell_type": "raw",
216 |      "metadata": {},
217 |      "source": [
218 |       "The \"correct\" model here is likely the Tobit model. We have an work in progress branch \"tobit-model\" on github, if anyone is interested in censored regression models."
219 |      ]
220 |     },
221 |     {
222 |      "cell_type": "heading",
223 |      "level": 3,
224 |      "metadata": {},
225 |      "source": [
226 |       "Exercise: Logit vs Probit"
227 |      ]
228 |     },
229 |     {
230 |      "cell_type": "code",
231 |      "collapsed": false,
232 |      "input": [
233 |       "fig = plt.figure(figsize=(12,8))\n",
234 |       "ax = fig.add_subplot(111)\n",
235 |       "support = np.linspace(-6, 6, 1000)\n",
236 |       "ax.plot(support, stats.logistic.cdf(support), 'r-', label='Logistic')\n",
237 |       "ax.plot(support, stats.norm.cdf(support), label='Probit')\n",
238 |       "ax.legend();"
239 |      ],
240 |      "language": "python",
241 |      "metadata": {},
242 |      "outputs": []
243 |     },
244 |     {
245 |      "cell_type": "code",
246 |      "collapsed": false,
247 |      "input": [
248 |       "fig = plt.figure(figsize=(12,8))\n",
249 |       "ax = fig.add_subplot(111)\n",
250 |       "support = np.linspace(-6, 6, 1000)\n",
251 |       "ax.plot(support, stats.logistic.pdf(support), 'r-', label='Logistic')\n",
252 |       "ax.plot(support, stats.norm.pdf(support), label='Probit')\n",
253 |       "ax.legend();"
254 |      ],
255 |      "language": "python",
256 |      "metadata": {},
257 |      "outputs": []
258 |     },
259 |     {
260 |      "cell_type": "raw",
261 |      "metadata": {},
262 |      "source": [
263 |       "Compare the estimates of the Logit Fair model above to a Probit model. Does the prediction table look better? Much difference in marginal effects?"
264 |      ]
265 |     },
266 |     {
267 |      "cell_type": "heading",
268 |      "level": 3,
269 |      "metadata": {},
270 |      "source": [
271 |       "Genarlized Linear Model Example"
272 |      ]
273 |     },
274 |     {
275 |      "cell_type": "code",
276 |      "collapsed": false,
277 |      "input": [
278 |       "print sm.datasets.star98.SOURCE"
279 |      ],
280 |      "language": "python",
281 |      "metadata": {},
282 |      "outputs": []
283 |     },
284 |     {
285 |      "cell_type": "code",
286 |      "collapsed": false,
287 |      "input": [
288 |       "print sm.datasets.star98.DESCRLONG"
289 |      ],
290 |      "language": "python",
291 |      "metadata": {},
292 |      "outputs": []
293 |     },
294 |     {
295 |      "cell_type": "code",
296 |      "collapsed": false,
297 |      "input": [
298 |       "print sm.datasets.star98.NOTE"
299 |      ],
300 |      "language": "python",
301 |      "metadata": {},
302 |      "outputs": []
303 |     },
304 |     {
305 |      "cell_type": "code",
306 |      "collapsed": false,
307 |      "input": [
308 |       "dta = sm.datasets.star98.load_pandas().data\n",
309 |       "print dta.columns"
310 |      ],
311 |      "language": "python",
312 |      "metadata": {},
313 |      "outputs": []
314 |     },
315 |     {
316 |      "cell_type": "code",
317 |      "collapsed": false,
318 |      "input": [
319 |       "print dta[['NABOVE', 'NBELOW', 'LOWINC', 'PERASIAN', 'PERBLACK', 'PERHISP', 'PERMINTE']].head(10)"
320 |      ],
321 |      "language": "python",
322 |      "metadata": {},
323 |      "outputs": []
324 |     },
325 |     {
326 |      "cell_type": "code",
327 |      "collapsed": false,
328 |      "input": [
329 |       "print dta[['AVYRSEXP', 'AVSALK', 'PERSPENK', 'PTRATIO', 'PCTAF', 'PCTCHRT', 'PCTYRRND']].head(10)"
330 |      ],
331 |      "language": "python",
332 |      "metadata": {},
333 |      "outputs": []
334 |     },
335 |     {
336 |      "cell_type": "code",
337 |      "collapsed": false,
338 |      "input": [
339 |       "formula = 'NABOVE + NBELOW ~ LOWINC + PERASIAN + PERBLACK + PERHISP + PCTCHRT '\n",
340 |       "formula += '+ PCTYRRND + PERMINTE*AVYRSEXP*AVSALK + PERSPENK*PTRATIO*PCTAF'"
341 |      ],
342 |      "language": "python",
343 |      "metadata": {},
344 |      "outputs": []
345 |     },
346 |     {
347 |      "cell_type": "heading",
348 |      "level": 4,
349 |      "metadata": {},
350 |      "source": [
351 |       "Aside: Binomial distribution"
352 |      ]
353 |     },
354 |     {
355 |      "cell_type": "raw",
356 |      "metadata": {},
357 |      "source": [
358 |       "Toss a six-sided die 5 times, what's the probability of exactly 2 fours?"
359 |      ]
360 |     },
361 |     {
362 |      "cell_type": "code",
363 |      "collapsed": false,
364 |      "input": [
365 |       "stats.binom(5, 1./6).pmf(2)"
366 |      ],
367 |      "language": "python",
368 |      "metadata": {},
369 |      "outputs": []
370 |     },
371 |     {
372 |      "cell_type": "code",
373 |      "collapsed": false,
374 |      "input": [
375 |       "from scipy.misc import comb\n",
376 |       "comb(5,2) * (1/6.)**2 * (5/6.)**3"
377 |      ],
378 |      "language": "python",
379 |      "metadata": {},
380 |      "outputs": []
381 |     },
382 |     {
383 |      "cell_type": "code",
384 |      "collapsed": false,
385 |      "input": [
386 |       "from statsmodels.formula.api import glm\n",
387 |       "glm_mod = glm(formula, dta, family=sm.families.Binomial()).fit()"
388 |      ],
389 |      "language": "python",
390 |      "metadata": {},
391 |      "outputs": []
392 |     },
393 |     {
394 |      "cell_type": "code",
395 |      "collapsed": false,
396 |      "input": [
397 |       "print glm_mod.summary()"
398 |      ],
399 |      "language": "python",
400 |      "metadata": {},
401 |      "outputs": []
402 |     },
403 |     {
404 |      "cell_type": "raw",
405 |      "metadata": {},
406 |      "source": [
407 |       "The number of trials "
408 |      ]
409 |     },
410 |     {
411 |      "cell_type": "code",
412 |      "collapsed": false,
413 |      "input": [
414 |       "glm_mod.model.data.orig_endog.sum(1)"
415 |      ],
416 |      "language": "python",
417 |      "metadata": {},
418 |      "outputs": []
419 |     },
420 |     {
421 |      "cell_type": "code",
422 |      "collapsed": false,
423 |      "input": [
424 |       "glm_mod.fittedvalues * glm_mod.model.data.orig_endog.sum(1)"
425 |      ],
426 |      "language": "python",
427 |      "metadata": {},
428 |      "outputs": []
429 |     },
430 |     {
431 |      "cell_type": "raw",
432 |      "metadata": {},
433 |      "source": [
434 |       "First differences: We hold all explanatory variables constant at their means and manipulate the percentage of low income households to assess its impact\n",
435 |       "on the response variables:"
436 |      ]
437 |     },
438 |     {
439 |      "cell_type": "code",
440 |      "collapsed": false,
441 |      "input": [
442 |       "exog = glm_mod.model.data.orig_exog # get the dataframe"
443 |      ],
444 |      "language": "python",
445 |      "metadata": {},
446 |      "outputs": []
447 |     },
448 |     {
449 |      "cell_type": "code",
450 |      "collapsed": false,
451 |      "input": [
452 |       "means25 = exog.mean()\n",
453 |       "print means25"
454 |      ],
455 |      "language": "python",
456 |      "metadata": {},
457 |      "outputs": []
458 |     },
459 |     {
460 |      "cell_type": "code",
461 |      "collapsed": false,
462 |      "input": [
463 |       "means25['LOWINC'] = exog['LOWINC'].quantile(.25)\n",
464 |       "print means25"
465 |      ],
466 |      "language": "python",
467 |      "metadata": {},
468 |      "outputs": []
469 |     },
470 |     {
471 |      "cell_type": "code",
472 |      "collapsed": false,
473 |      "input": [
474 |       "means75 = exog.mean()\n",
475 |       "means75['LOWINC'] = exog['LOWINC'].quantile(.75)\n",
476 |       "print means75"
477 |      ],
478 |      "language": "python",
479 |      "metadata": {},
480 |      "outputs": []
481 |     },
482 |     {
483 |      "cell_type": "code",
484 |      "collapsed": false,
485 |      "input": [
486 |       "resp25 = glm_mod.predict(means25)\n",
487 |       "resp75 = glm_mod.predict(means75)\n",
488 |       "diff = resp75 - resp25"
489 |      ],
490 |      "language": "python",
491 |      "metadata": {},
492 |      "outputs": []
493 |     },
494 |     {
495 |      "cell_type": "raw",
496 |      "metadata": {},
497 |      "source": [
498 |       "The interquartile first difference for the percentage of low income households in a school district is:"
499 |      ]
500 |     },
501 |     {
502 |      "cell_type": "code",
503 |      "collapsed": false,
504 |      "input": [
505 |       "print \"%2.4f%%\" % (diff[0]*100)"
506 |      ],
507 |      "language": "python",
508 |      "metadata": {},
509 |      "outputs": []
510 |     },
511 |     {
512 |      "cell_type": "code",
513 |      "collapsed": false,
514 |      "input": [
515 |       "nobs = glm_mod.nobs\n",
516 |       "y = glm_mod.model.endog\n",
517 |       "yhat = glm_mod.mu"
518 |      ],
519 |      "language": "python",
520 |      "metadata": {},
521 |      "outputs": []
522 |     },
523 |     {
524 |      "cell_type": "code",
525 |      "collapsed": false,
526 |      "input": [
527 |       "from statsmodels.graphics.api import abline_plot\n",
528 |       "fig = plt.figure(figsize=(12,8))\n",
529 |       "ax = fig.add_subplot(111, ylabel='Observed Values', xlabel='Fitted Values')\n",
530 |       "ax.scatter(yhat, y)\n",
531 |       "y_vs_yhat = sm.OLS(y, sm.add_constant(yhat, prepend=True)).fit()\n",
532 |       "fig = abline_plot(model_results=y_vs_yhat, ax=ax)"
533 |      ],
534 |      "language": "python",
535 |      "metadata": {},
536 |      "outputs": []
537 |     },
538 |     {
539 |      "cell_type": "heading",
540 |      "level": 4,
541 |      "metadata": {},
542 |      "source": [
543 |       "Plot fitted values vs Pearson residuals"
544 |      ]
545 |     },
546 |     {
547 |      "cell_type": "markdown",
548 |      "metadata": {},
549 |      "source": [
550 |       "Pearson residuals are defined to be \n",
551 |       "\n",
552 |       "$$\\frac{(y - \\mu)}{\\sqrt{(var(\\mu))}}$$\n",
553 |       "\n",
554 |       "where var is typically determined by the family. E.g., binomial variance is $np(1 - p)$"
555 |      ]
556 |     },
557 |     {
558 |      "cell_type": "code",
559 |      "collapsed": false,
560 |      "input": [
561 |       "fig = plt.figure(figsize=(12,8))\n",
562 |       "ax = fig.add_subplot(111, title='Residual Dependence Plot', xlabel='Fitted Values',\n",
563 |       "                          ylabel='Pearson Residuals')\n",
564 |       "ax.scatter(yhat, stats.zscore(glm_mod.resid_pearson))\n",
565 |       "ax.axis('tight')\n",
566 |       "ax.plot([0.0, 1.0],[0.0, 0.0], 'k-');"
567 |      ],
568 |      "language": "python",
569 |      "metadata": {},
570 |      "outputs": []
571 |     },
572 |     {
573 |      "cell_type": "heading",
574 |      "level": 4,
575 |      "metadata": {},
576 |      "source": [
577 |       "Histogram of standardized deviance residuals with Kernel Density Estimate overlayed"
578 |      ]
579 |     },
580 |     {
581 |      "cell_type": "markdown",
582 |      "metadata": {},
583 |      "source": [
584 |       "The definition of the deviance residuals depends on the family. For the Binomial distribution this is \n",
585 |       "\n",
586 |       "$$r_{dev} = sign\\(Y-\\mu\\)*\\sqrt{2n(Y\\log\\frac{Y}{\\mu}+(1-Y)\\log\\frac{(1-Y)}{(1-\\mu)}}$$\n",
587 |       "\n",
588 |       "They can be used to detect ill-fitting covariates"
589 |      ]
590 |     },
591 |     {
592 |      "cell_type": "code",
593 |      "collapsed": false,
594 |      "input": [
595 |       "resid = glm_mod.resid_deviance\n",
596 |       "resid_std = stats.zscore(resid) \n",
597 |       "kde_resid = sm.nonparametric.KDEUnivariate(resid_std)\n",
598 |       "kde_resid.fit()"
599 |      ],
600 |      "language": "python",
601 |      "metadata": {},
602 |      "outputs": []
603 |     },
604 |     {
605 |      "cell_type": "code",
606 |      "collapsed": false,
607 |      "input": [
608 |       "fig = plt.figure(figsize=(12,8))\n",
609 |       "ax = fig.add_subplot(111, title=\"Standardized Deviance Residuals\")\n",
610 |       "ax.hist(resid_std, bins=25, normed=True);\n",
611 |       "ax.plot(kde_resid.support, kde_resid.density, 'r');"
612 |      ],
613 |      "language": "python",
614 |      "metadata": {},
615 |      "outputs": []
616 |     },
617 |     {
618 |      "cell_type": "heading",
619 |      "level": 4,
620 |      "metadata": {},
621 |      "source": [
622 |       "QQ-plot of deviance residuals"
623 |      ]
624 |     },
625 |     {
626 |      "cell_type": "code",
627 |      "collapsed": false,
628 |      "input": [
629 |       "fig = plt.figure(figsize=(12,8))\n",
630 |       "ax = fig.add_subplot(111)\n",
631 |       "fig = sm.graphics.qqplot(resid, line='r', ax=ax)"
632 |      ],
633 |      "language": "python",
634 |      "metadata": {},
635 |      "outputs": []
636 |     }
637 |    ],
638 |    "metadata": {}
639 |   }
640 |  ]
641 | }


--------------------------------------------------------------------------------
/discrete_choice.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # <nbformat>3.0</nbformat>
  3 | 
  4 | # <headingcell level=2>
  5 | 
  6 | # Discrete Choice Models - Fair's Affair data
  7 | 
  8 | # <markdowncell>
  9 | 
 10 | # A survey of women only was conducted in 1974 by *Redbook* asking about extramarital affairs.
 11 | 
 12 | # <codecell>
 13 | 
 14 | import numpy as np
 15 | from scipy import stats
 16 | import matplotlib.pyplot as plt
 17 | import statsmodels.api as sm
 18 | from statsmodels.formula.api import logit, probit, poisson, ols
 19 | 
 20 | # <codecell>
 21 | 
 22 | print sm.datasets.fair.SOURCE
 23 | 
 24 | # <codecell>
 25 | 
 26 | print sm.datasets.fair.NOTE
 27 | 
 28 | # <codecell>
 29 | 
 30 | dta = sm.datasets.fair.load_pandas().data
 31 | 
 32 | # <codecell>
 33 | 
 34 | dta['affair'] = (dta['affairs'] > 0).astype(float)
 35 | print dta.head(10)
 36 | 
 37 | # <codecell>
 38 | 
 39 | print dta.describe()
 40 | 
 41 | # <codecell>
 42 | 
 43 | affair_mod = logit("affair ~ occupation + educ + occupation_husb" 
 44 |                    "+ rate_marriage + age + yrs_married + children"
 45 |                    " + religious", dta).fit()
 46 | 
 47 | # <codecell>
 48 | 
 49 | print affair_mod.summary()
 50 | 
 51 | # <rawcell>
 52 | 
 53 | # How well are we predicting?
 54 | 
 55 | # <codecell>
 56 | 
 57 | affair_mod.pred_table()
 58 | 
 59 | # <rawcell>
 60 | 
 61 | # The coefficients of the discrete choice model do not tell us much. What we're after is marginal effects.
 62 | 
 63 | # <codecell>
 64 | 
 65 | mfx = affair_mod.get_margeff()
 66 | print mfx.summary()
 67 | 
 68 | # <codecell>
 69 | 
 70 | respondent1000 = dta.ix[1000]
 71 | print respondent1000
 72 | 
 73 | # <codecell>
 74 | 
 75 | resp = dict(zip(range(1,9), respondent1000[["occupation", "educ", 
 76 |                                             "occupation_husb", "rate_marriage", 
 77 |                                             "age", "yrs_married", "children", 
 78 |                                             "religious"]].tolist()))
 79 | resp.update({0 : 1})
 80 | print resp
 81 | 
 82 | # <codecell>
 83 | 
 84 | mfx = affair_mod.get_margeff(atexog=resp)
 85 | print mfx.summary()
 86 | 
 87 | # <codecell>
 88 | 
 89 | affair_mod.predict(respondent1000)
 90 | 
 91 | # <codecell>
 92 | 
 93 | affair_mod.fittedvalues[1000]
 94 | 
 95 | # <codecell>
 96 | 
 97 | affair_mod.model.cdf(affair_mod.fittedvalues[1000])
 98 | 
 99 | # <rawcell>
100 | 
101 | # The "correct" model here is likely the Tobit model. We have an work in progress branch "tobit-model" on github, if anyone is interested in censored regression models.
102 | 
103 | # <headingcell level=3>
104 | 
105 | # Exercise: Logit vs Probit
106 | 
107 | # <codecell>
108 | 
109 | fig = plt.figure(figsize=(12,8))
110 | ax = fig.add_subplot(111)
111 | support = np.linspace(-6, 6, 1000)
112 | ax.plot(support, stats.logistic.cdf(support), 'r-', label='Logistic')
113 | ax.plot(support, stats.norm.cdf(support), label='Probit')
114 | ax.legend();
115 | 
116 | # <codecell>
117 | 
118 | fig = plt.figure(figsize=(12,8))
119 | ax = fig.add_subplot(111)
120 | support = np.linspace(-6, 6, 1000)
121 | ax.plot(support, stats.logistic.pdf(support), 'r-', label='Logistic')
122 | ax.plot(support, stats.norm.pdf(support), label='Probit')
123 | ax.legend();
124 | 
125 | # <rawcell>
126 | 
127 | # Compare the estimates of the Logit Fair model above to a Probit model. Does the prediction table look better? Much difference in marginal effects?
128 | 
129 | # <headingcell level=3>
130 | 
131 | # Genarlized Linear Model Example
132 | 
133 | # <codecell>
134 | 
135 | print sm.datasets.star98.SOURCE
136 | 
137 | # <codecell>
138 | 
139 | print sm.datasets.star98.DESCRLONG
140 | 
141 | # <codecell>
142 | 
143 | print sm.datasets.star98.NOTE
144 | 
145 | # <codecell>
146 | 
147 | dta = sm.datasets.star98.load_pandas().data
148 | print dta.columns
149 | 
150 | # <codecell>
151 | 
152 | print dta[['NABOVE', 'NBELOW', 'LOWINC', 'PERASIAN', 'PERBLACK', 'PERHISP', 'PERMINTE']].head(10)
153 | 
154 | # <codecell>
155 | 
156 | print dta[['AVYRSEXP', 'AVSALK', 'PERSPENK', 'PTRATIO', 'PCTAF', 'PCTCHRT', 'PCTYRRND']].head(10)
157 | 
158 | # <codecell>
159 | 
160 | formula = 'NABOVE + NBELOW ~ LOWINC + PERASIAN + PERBLACK + PERHISP + PCTCHRT '
161 | formula += '+ PCTYRRND + PERMINTE*AVYRSEXP*AVSALK + PERSPENK*PTRATIO*PCTAF'
162 | 
163 | # <headingcell level=4>
164 | 
165 | # Aside: Binomial distribution
166 | 
167 | # <rawcell>
168 | 
169 | # Toss a six-sided die 5 times, what's the probability of exactly 2 fours?
170 | 
171 | # <codecell>
172 | 
173 | stats.binom(5, 1./6).pmf(2)
174 | 
175 | # <codecell>
176 | 
177 | from scipy.misc import comb
178 | comb(5,2) * (1/6.)**2 * (5/6.)**3
179 | 
180 | # <codecell>
181 | 
182 | from statsmodels.formula.api import glm
183 | glm_mod = glm(formula, dta, family=sm.families.Binomial()).fit()
184 | 
185 | # <codecell>
186 | 
187 | print glm_mod.summary()
188 | 
189 | # <rawcell>
190 | 
191 | # The number of trials 
192 | 
193 | # <codecell>
194 | 
195 | glm_mod.model.data.orig_endog.sum(1)
196 | 
197 | # <codecell>
198 | 
199 | glm_mod.fittedvalues * glm_mod.model.data.orig_endog.sum(1)
200 | 
201 | # <rawcell>
202 | 
203 | # First differences: We hold all explanatory variables constant at their means and manipulate the percentage of low income households to assess its impact
204 | # on the response variables:
205 | 
206 | # <codecell>
207 | 
208 | exog = glm_mod.model.data.orig_exog # get the dataframe
209 | 
210 | # <codecell>
211 | 
212 | means25 = exog.mean()
213 | print means25
214 | 
215 | # <codecell>
216 | 
217 | means25['LOWINC'] = exog['LOWINC'].quantile(.25)
218 | print means25
219 | 
220 | # <codecell>
221 | 
222 | means75 = exog.mean()
223 | means75['LOWINC'] = exog['LOWINC'].quantile(.75)
224 | print means75
225 | 
226 | # <codecell>
227 | 
228 | resp25 = glm_mod.predict(means25)
229 | resp75 = glm_mod.predict(means75)
230 | diff = resp75 - resp25
231 | 
232 | # <rawcell>
233 | 
234 | # The interquartile first difference for the percentage of low income households in a school district is:
235 | 
236 | # <codecell>
237 | 
238 | print "%2.4f%%" % (diff[0]*100)
239 | 
240 | # <codecell>
241 | 
242 | nobs = glm_mod.nobs
243 | y = glm_mod.model.endog
244 | yhat = glm_mod.mu
245 | 
246 | # <codecell>
247 | 
248 | from statsmodels.graphics.api import abline_plot
249 | fig = plt.figure(figsize=(12,8))
250 | ax = fig.add_subplot(111, ylabel='Observed Values', xlabel='Fitted Values')
251 | ax.scatter(yhat, y)
252 | y_vs_yhat = sm.OLS(y, sm.add_constant(yhat, prepend=True)).fit()
253 | fig = abline_plot(model_results=y_vs_yhat, ax=ax)
254 | 
255 | # <headingcell level=4>
256 | 
257 | # Plot fitted values vs Pearson residuals
258 | 
259 | # <markdowncell>
260 | 
261 | # Pearson residuals are defined to be 
262 | # 
263 | # $$\frac{(y - \mu)}{\sqrt{(var(\mu))}}$$
264 | # 
265 | # where var is typically determined by the family. E.g., binomial variance is $np(1 - p)$
266 | 
267 | # <codecell>
268 | 
269 | fig = plt.figure(figsize=(12,8))
270 | ax = fig.add_subplot(111, title='Residual Dependence Plot', xlabel='Fitted Values',
271 |                           ylabel='Pearson Residuals')
272 | ax.scatter(yhat, stats.zscore(glm_mod.resid_pearson))
273 | ax.axis('tight')
274 | ax.plot([0.0, 1.0],[0.0, 0.0], 'k-');
275 | 
276 | # <headingcell level=4>
277 | 
278 | # Histogram of standardized deviance residuals with Kernel Density Estimate overlayed
279 | 
280 | # <markdowncell>
281 | 
282 | # The definition of the deviance residuals depends on the family. For the Binomial distribution this is 
283 | # 
284 | # $$r_{dev} = sign\(Y-\mu\)*\sqrt{2n(Y\log\frac{Y}{\mu}+(1-Y)\log\frac{(1-Y)}{(1-\mu)}}$$
285 | # 
286 | # They can be used to detect ill-fitting covariates
287 | 
288 | # <codecell>
289 | 
290 | resid = glm_mod.resid_deviance
291 | resid_std = stats.zscore(resid) 
292 | kde_resid = sm.nonparametric.KDEUnivariate(resid_std)
293 | kde_resid.fit()
294 | 
295 | # <codecell>
296 | 
297 | fig = plt.figure(figsize=(12,8))
298 | ax = fig.add_subplot(111, title="Standardized Deviance Residuals")
299 | ax.hist(resid_std, bins=25, normed=True);
300 | ax.plot(kde_resid.support, kde_resid.density, 'r');
301 | 
302 | # <headingcell level=4>
303 | 
304 | # QQ-plot of deviance residuals
305 | 
306 | # <codecell>
307 | 
308 | fig = plt.figure(figsize=(12,8))
309 | ax = fig.add_subplot(111)
310 | fig = sm.graphics.qqplot(resid, line='r', ax=ax)
311 | 
312 | 


--------------------------------------------------------------------------------
/generic_mle.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "metadata": {
  3 |   "name": "generic_mle"
  4 |  },
  5 |  "nbformat": 3,
  6 |  "nbformat_minor": 0,
  7 |  "worksheets": [
  8 |   {
  9 |    "cells": [
 10 |     {
 11 |      "cell_type": "code",
 12 |      "collapsed": false,
 13 |      "input": [
 14 |       "import numpy as np\n",
 15 |       "from scipy import stats\n",
 16 |       "import statsmodels.api as sm\n",
 17 |       "from statsmodels.base.model import GenericLikelihoodModel"
 18 |      ],
 19 |      "language": "python",
 20 |      "metadata": {},
 21 |      "outputs": []
 22 |     },
 23 |     {
 24 |      "cell_type": "code",
 25 |      "collapsed": false,
 26 |      "input": [
 27 |       "print sm.datasets.spector.NOTE"
 28 |      ],
 29 |      "language": "python",
 30 |      "metadata": {},
 31 |      "outputs": []
 32 |     },
 33 |     {
 34 |      "cell_type": "code",
 35 |      "collapsed": false,
 36 |      "input": [
 37 |       "data = sm.datasets.spector.load_pandas()\n",
 38 |       "exog = sm.add_constant(data.exog, prepend=True)\n",
 39 |       "endog = data.endog"
 40 |      ],
 41 |      "language": "python",
 42 |      "metadata": {},
 43 |      "outputs": []
 44 |     },
 45 |     {
 46 |      "cell_type": "code",
 47 |      "collapsed": false,
 48 |      "input": [
 49 |       "sm_probit = sm.Probit(endog, exog).fit()"
 50 |      ],
 51 |      "language": "python",
 52 |      "metadata": {},
 53 |      "outputs": []
 54 |     },
 55 |     {
 56 |      "cell_type": "raw",
 57 |      "metadata": {},
 58 |      "source": [
 59 |       "* To create your own Likelihood Model, you just need to overwrite the loglike method."
 60 |      ]
 61 |     },
 62 |     {
 63 |      "cell_type": "code",
 64 |      "collapsed": false,
 65 |      "input": [
 66 |       "class MyProbit(GenericLikelihoodModel):\n",
 67 |       "    def loglike(self, params):\n",
 68 |       "        exog = self.exog\n",
 69 |       "        endog = self.endog\n",
 70 |       "        q = 2 * endog - 1\n",
 71 |       "        return stats.norm.logcdf(q*np.dot(exog, params)).sum()"
 72 |      ],
 73 |      "language": "python",
 74 |      "metadata": {},
 75 |      "outputs": []
 76 |     },
 77 |     {
 78 |      "cell_type": "code",
 79 |      "collapsed": false,
 80 |      "input": [
 81 |       "my_probit = MyProbit(endog, exog).fit()"
 82 |      ],
 83 |      "language": "python",
 84 |      "metadata": {},
 85 |      "outputs": []
 86 |     },
 87 |     {
 88 |      "cell_type": "code",
 89 |      "collapsed": false,
 90 |      "input": [
 91 |       "print sm_probit.params"
 92 |      ],
 93 |      "language": "python",
 94 |      "metadata": {},
 95 |      "outputs": []
 96 |     },
 97 |     {
 98 |      "cell_type": "code",
 99 |      "collapsed": false,
100 |      "input": [
101 |       "print sm_probit.cov_params()"
102 |      ],
103 |      "language": "python",
104 |      "metadata": {},
105 |      "outputs": []
106 |     },
107 |     {
108 |      "cell_type": "code",
109 |      "collapsed": false,
110 |      "input": [
111 |       "print my_probit.params"
112 |      ],
113 |      "language": "python",
114 |      "metadata": {},
115 |      "outputs": []
116 |     },
117 |     {
118 |      "cell_type": "raw",
119 |      "metadata": {},
120 |      "source": [
121 |       "You can get the variance-covariance of the parameters. Notice that we didn't have to provide Hessian or Score functions."
122 |      ]
123 |     },
124 |     {
125 |      "cell_type": "code",
126 |      "collapsed": false,
127 |      "input": [
128 |       "print my_probit.cov_params()"
129 |      ],
130 |      "language": "python",
131 |      "metadata": {},
132 |      "outputs": []
133 |     }
134 |    ],
135 |    "metadata": {}
136 |   }
137 |  ]
138 | }


--------------------------------------------------------------------------------
/generic_mle.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # <nbformat>3.0</nbformat>
 3 | 
 4 | # <codecell>
 5 | 
 6 | import numpy as np
 7 | from scipy import stats
 8 | import statsmodels.api as sm
 9 | from statsmodels.base.model import GenericLikelihoodModel
10 | 
11 | # <codecell>
12 | 
13 | print sm.datasets.spector.NOTE
14 | 
15 | # <codecell>
16 | 
17 | data = sm.datasets.spector.load_pandas()
18 | exog = sm.add_constant(data.exog, prepend=True)
19 | endog = data.endog
20 | 
21 | # <codecell>
22 | 
23 | sm_probit = sm.Probit(endog, exog).fit()
24 | 
25 | # <rawcell>
26 | 
27 | # * To create your own Likelihood Model, you just need to overwrite the loglike method.
28 | 
29 | # <codecell>
30 | 
31 | class MyProbit(GenericLikelihoodModel):
32 |     def loglike(self, params):
33 |         exog = self.exog
34 |         endog = self.endog
35 |         q = 2 * endog - 1
36 |         return stats.norm.logcdf(q*np.dot(exog, params)).sum()
37 | 
38 | # <codecell>
39 | 
40 | my_probit = MyProbit(endog, exog).fit()
41 | 
42 | # <codecell>
43 | 
44 | print sm_probit.params
45 | 
46 | # <codecell>
47 | 
48 | print sm_probit.cov_params()
49 | 
50 | # <codecell>
51 | 
52 | print my_probit.params
53 | 
54 | # <rawcell>
55 | 
56 | # You can get the variance-covariance of the parameters. Notice that we didn't have to provide Hessian or Score functions.
57 | 
58 | # <codecell>
59 | 
60 | print my_probit.cov_params()
61 | 
62 | 


--------------------------------------------------------------------------------
/kernel_density.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "metadata": {
  3 |   "name": "kernel_density"
  4 |  },
  5 |  "nbformat": 3,
  6 |  "nbformat_minor": 0,
  7 |  "worksheets": [
  8 |   {
  9 |    "cells": [
 10 |     {
 11 |      "cell_type": "heading",
 12 |      "level": 3,
 13 |      "metadata": {},
 14 |      "source": [
 15 |       "Kernel Density Estimation"
 16 |      ]
 17 |     },
 18 |     {
 19 |      "cell_type": "code",
 20 |      "collapsed": false,
 21 |      "input": [
 22 |       "import numpy as np\n",
 23 |       "from scipy import stats\n",
 24 |       "import statsmodels.api as sm\n",
 25 |       "import matplotlib.pyplot as plt\n",
 26 |       "from statsmodels.distributions.mixture_rvs import mixture_rvs"
 27 |      ],
 28 |      "language": "python",
 29 |      "metadata": {},
 30 |      "outputs": []
 31 |     },
 32 |     {
 33 |      "cell_type": "heading",
 34 |      "level": 4,
 35 |      "metadata": {},
 36 |      "source": [
 37 |       "A univariate example."
 38 |      ]
 39 |     },
 40 |     {
 41 |      "cell_type": "code",
 42 |      "collapsed": false,
 43 |      "input": [
 44 |       "np.random.seed(12345)"
 45 |      ],
 46 |      "language": "python",
 47 |      "metadata": {},
 48 |      "outputs": []
 49 |     },
 50 |     {
 51 |      "cell_type": "code",
 52 |      "collapsed": false,
 53 |      "input": [
 54 |       "obs_dist1 = mixture_rvs([.25,.75], size=10000, dist=[stats.norm, stats.norm],\n",
 55 |       "                kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=.5)))"
 56 |      ],
 57 |      "language": "python",
 58 |      "metadata": {},
 59 |      "outputs": []
 60 |     },
 61 |     {
 62 |      "cell_type": "code",
 63 |      "collapsed": false,
 64 |      "input": [
 65 |       "kde = sm.nonparametric.KDEUnivariate(obs_dist1)\n",
 66 |       "kde.fit()"
 67 |      ],
 68 |      "language": "python",
 69 |      "metadata": {},
 70 |      "outputs": []
 71 |     },
 72 |     {
 73 |      "cell_type": "code",
 74 |      "collapsed": false,
 75 |      "input": [
 76 |       "fig = plt.figure(figsize=(12,8))\n",
 77 |       "ax = fig.add_subplot(111)\n",
 78 |       "ax.hist(obs_dist1, bins=50, normed=True, color='red')\n",
 79 |       "ax.plot(kde.support, kde.density, lw=2, color='black');"
 80 |      ],
 81 |      "language": "python",
 82 |      "metadata": {},
 83 |      "outputs": []
 84 |     },
 85 |     {
 86 |      "cell_type": "code",
 87 |      "collapsed": false,
 88 |      "input": [
 89 |       "obs_dist2 = mixture_rvs([.25,.75], size=10000, dist=[stats.norm, stats.beta],\n",
 90 |       "            kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=1,args=(1,.5))))\n",
 91 |       "\n",
 92 |       "kde2 = sm.nonparametric.KDEUnivariate(obs_dist2)\n",
 93 |       "kde2.fit()"
 94 |      ],
 95 |      "language": "python",
 96 |      "metadata": {},
 97 |      "outputs": []
 98 |     },
 99 |     {
100 |      "cell_type": "code",
101 |      "collapsed": false,
102 |      "input": [
103 |       "fig = plt.figure(figsize=(12,8))\n",
104 |       "ax = fig.add_subplot(111)\n",
105 |       "ax.hist(obs_dist2, bins=50, normed=True, color='red')\n",
106 |       "ax.plot(kde2.support, kde2.density, lw=2, color='black');"
107 |      ],
108 |      "language": "python",
109 |      "metadata": {},
110 |      "outputs": []
111 |     },
112 |     {
113 |      "cell_type": "raw",
114 |      "metadata": {},
115 |      "source": [
116 |       "The fitted KDE object is a full non-parametric distribution."
117 |      ]
118 |     },
119 |     {
120 |      "cell_type": "code",
121 |      "collapsed": false,
122 |      "input": [
123 |       "obs_dist3 = mixture_rvs([.25,.75], size=1000, dist=[stats.norm, stats.norm],\n",
124 |       "                kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=.5)))\n",
125 |       "kde3 = sm.nonparametric.KDEUnivariate(obs_dist3)\n",
126 |       "kde3.fit()"
127 |      ],
128 |      "language": "python",
129 |      "metadata": {},
130 |      "outputs": []
131 |     },
132 |     {
133 |      "cell_type": "code",
134 |      "collapsed": false,
135 |      "input": [
136 |       "kde3.entropy"
137 |      ],
138 |      "language": "python",
139 |      "metadata": {},
140 |      "outputs": []
141 |     },
142 |     {
143 |      "cell_type": "code",
144 |      "collapsed": false,
145 |      "input": [
146 |       "kde3.evaluate(-1)"
147 |      ],
148 |      "language": "python",
149 |      "metadata": {},
150 |      "outputs": []
151 |     },
152 |     {
153 |      "cell_type": "heading",
154 |      "level": 4,
155 |      "metadata": {},
156 |      "source": [
157 |       "CDF"
158 |      ]
159 |     },
160 |     {
161 |      "cell_type": "code",
162 |      "collapsed": false,
163 |      "input": [
164 |       "fig = plt.figure(figsize=(12,8))\n",
165 |       "ax = fig.add_subplot(111)\n",
166 |       "ax.plot(kde3.support, kde3.cdf);"
167 |      ],
168 |      "language": "python",
169 |      "metadata": {},
170 |      "outputs": []
171 |     },
172 |     {
173 |      "cell_type": "heading",
174 |      "level": 4,
175 |      "metadata": {},
176 |      "source": [
177 |       "Cumulative Hazard Function"
178 |      ]
179 |     },
180 |     {
181 |      "cell_type": "code",
182 |      "collapsed": false,
183 |      "input": [
184 |       "fig = plt.figure(figsize=(12,8))\n",
185 |       "ax = fig.add_subplot(111)\n",
186 |       "ax.plot(kde3.support, kde3.cumhazard);"
187 |      ],
188 |      "language": "python",
189 |      "metadata": {},
190 |      "outputs": []
191 |     },
192 |     {
193 |      "cell_type": "heading",
194 |      "level": 4,
195 |      "metadata": {},
196 |      "source": [
197 |       "Inverse CDF"
198 |      ]
199 |     },
200 |     {
201 |      "cell_type": "code",
202 |      "collapsed": false,
203 |      "input": [
204 |       "fig = plt.figure(figsize=(12,8))\n",
205 |       "ax = fig.add_subplot(111)\n",
206 |       "ax.plot(kde3.support, kde3.icdf);"
207 |      ],
208 |      "language": "python",
209 |      "metadata": {},
210 |      "outputs": []
211 |     },
212 |     {
213 |      "cell_type": "heading",
214 |      "level": 4,
215 |      "metadata": {},
216 |      "source": [
217 |       "Survival Function"
218 |      ]
219 |     },
220 |     {
221 |      "cell_type": "code",
222 |      "collapsed": false,
223 |      "input": [
224 |       "fig = plt.figure(figsize=(12,8))\n",
225 |       "ax = fig.add_subplot(111)\n",
226 |       "ax.plot(kde3.support, kde3.sf);"
227 |      ],
228 |      "language": "python",
229 |      "metadata": {},
230 |      "outputs": []
231 |     }
232 |    ],
233 |    "metadata": {}
234 |   }
235 |  ]
236 | }
237 | 


--------------------------------------------------------------------------------
/kernel_density.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # <nbformat>3.0</nbformat>
  3 | 
  4 | # <headingcell level=3>
  5 | 
  6 | # Kernel Density Estimation
  7 | 
  8 | # <codecell>
  9 | 
 10 | import numpy as np
 11 | import statsmodels.api as sm
 12 | import matplotlib.pyplot as plt
 13 | from statsmodels.distributions.mixture_rvs import mixture_rvs
 14 | 
 15 | # <headingcell level=4>
 16 | 
 17 | # A univariate example.
 18 | 
 19 | # <codecell>
 20 | 
 21 | np.random.seed(12345)
 22 | 
 23 | # <codecell>
 24 | 
 25 | obs_dist1 = mixture_rvs([.25,.75], size=10000, dist=[stats.norm, stats.norm],
 26 |                 kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=.5)))
 27 | 
 28 | # <codecell>
 29 | 
 30 | kde = sm.nonparametric.KDEUnivariate(obs_dist1)
 31 | kde.fit()
 32 | 
 33 | # <codecell>
 34 | 
 35 | fig = plt.figure(figsize=(12,8))
 36 | ax = fig.add_subplot(111)
 37 | ax.hist(obs_dist1, bins=50, normed=True, color='red')
 38 | ax.plot(kde.support, kde.density, lw=2, color='black');
 39 | 
 40 | # <codecell>
 41 | 
 42 | obs_dist2 = mixture_rvs([.25,.75], size=10000, dist=[stats.norm, stats.beta],
 43 |             kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=1,args=(1,.5))))
 44 | 
 45 | kde2 = sm.nonparametric.KDEUnivariate(obs_dist2)
 46 | kde2.fit()
 47 | 
 48 | # <codecell>
 49 | 
 50 | fig = plt.figure(figsize=(12,8))
 51 | ax = fig.add_subplot(111)
 52 | ax.hist(obs_dist2, bins=50, normed=True, color='red')
 53 | ax.plot(kde2.support, kde2.density, lw=2, color='black');
 54 | 
 55 | # <rawcell>
 56 | 
 57 | # The fitted KDE object is a full non-parametric distribution.
 58 | 
 59 | # <codecell>
 60 | 
 61 | obs_dist3 = mixture_rvs([.25,.75], size=1000, dist=[stats.norm, stats.norm],
 62 |                 kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=.5)))
 63 | kde3 = sm.nonparametric.KDEUnivariate(obs_dist3)
 64 | kde3.fit()
 65 | 
 66 | # <codecell>
 67 | 
 68 | kde3.entropy
 69 | 
 70 | # <codecell>
 71 | 
 72 | kde3.evaluate(-1)
 73 | 
 74 | # <headingcell level=4>
 75 | 
 76 | # CDF
 77 | 
 78 | # <codecell>
 79 | 
 80 | fig = plt.figure(figsize=(12,8))
 81 | ax = fig.add_subplot(111)
 82 | ax.plot(kde3.support, kde3.cdf);
 83 | 
 84 | # <headingcell level=4>
 85 | 
 86 | # Cumulative Hazard Function
 87 | 
 88 | # <codecell>
 89 | 
 90 | fig = plt.figure(figsize=(12,8))
 91 | ax = fig.add_subplot(111)
 92 | ax.plot(kde3.support, kde3.cumhazard);
 93 | 
 94 | # <headingcell level=4>
 95 | 
 96 | # Inverse CDF
 97 | 
 98 | # <codecell>
 99 | 
100 | fig = plt.figure(figsize=(12,8))
101 | ax = fig.add_subplot(111)
102 | ax.plot(kde3.support, kde3.icdf);
103 | 
104 | # <headingcell level=4>
105 | 
106 | # Survival Function
107 | 
108 | # <codecell>
109 | 
110 | fig = plt.figure(figsize=(12,8))
111 | ax = fig.add_subplot(111)
112 | ax.plot(kde3.support, kde3.sf);
113 | 
114 | 


--------------------------------------------------------------------------------
/linear_models.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # <nbformat>3.0</nbformat>
  3 | 
  4 | # <markdowncell>
  5 | 
  6 | # This notebook introduces the use of pandas and the formula framework in statsmodels in the context of linear modeling.
  7 | 
  8 | # <markdowncell>
  9 | 
 10 | # **It is based heavily on Jonathan Taylor's [class notes that use R](http://www.stanford.edu/class/stats191/interactions.html)**
 11 | 
 12 | # <codecell>
 13 | 
 14 | import matplotlib.pyplot as plt
 15 | import pandas
 16 | import numpy as np
 17 | 
 18 | from statsmodels.formula.api import ols
 19 | from statsmodels.graphics.api import interaction_plot, abline_plot, qqplot
 20 | from statsmodels.stats.api import anova_lm
 21 | 
 22 | # <headingcell level=2>
 23 | 
 24 | # Example 1: IT salary data
 25 | 
 26 | # <rawcell>
 27 | 
 28 | # Outcome:    S, salaries for IT staff in a corporation
 29 | # Predictors: X, experience in years
 30 | #             M, managment, 2 levels, 0=non-management, 1=management
 31 | #             E, education, 3 levels, 1=Bachelor's, 2=Master's, 3=Ph.D
 32 | 
 33 | # <codecell>
 34 | 
 35 | url = 'http://stats191.stanford.edu/data/salary.table'
 36 | salary_table = pandas.read_table(url) # needs pandas 0.7.3
 37 | salary_table.to_csv('salary.table', index=False)
 38 | 
 39 | # <codecell>
 40 | 
 41 | print salary_table.head(10)
 42 | 
 43 | # <codecell>
 44 | 
 45 | E = salary_table.E # Education
 46 | M = salary_table.M # Management
 47 | X = salary_table.X # Experience
 48 | S = salary_table.S # Salary
 49 | 
 50 | # <markdowncell>
 51 | 
 52 | # Let's explore the data
 53 | 
 54 | # <codecell>
 55 | 
 56 | fig = plt.figure(figsize=(10,8))
 57 | ax = fig.add_subplot(111, xlabel='Experience', ylabel='Salary',
 58 |             xlim=(0, 27), ylim=(9600, 28800))
 59 | symbols = ['D', '^']
 60 | man_label = ["Non-Mgmt", "Mgmt"]
 61 | educ_label = ["Bachelors", "Masters", "PhD"]
 62 | colors = ['r', 'g', 'blue']
 63 | factor_groups = salary_table.groupby(['E','M'])
 64 | for values, group in factor_groups:
 65 |     i,j = values
 66 |     label = "%s - %s" % (man_label[j], educ_label[i-1])
 67 |     ax.scatter(group['X'], group['S'], marker=symbols[j], color=colors[i-1],
 68 |                s=350, label=label)
 69 | ax.legend(scatterpoints=1, markerscale=.7, labelspacing=1);
 70 | 
 71 | # <markdowncell>
 72 | 
 73 | # Fit a linear model
 74 | # 
 75 | # $$S_i = \beta_0 + \beta_1X_i + \beta_2E_{i2} + \beta_3E_{i3} + \beta_4M_i + \epsilon_i$$
 76 | # 
 77 | # where
 78 | # 
 79 | # $$ E_{i2}=\cases{1,&if $E_i=2$;\cr 0,&otherwise. \cr}$$ 
 80 | # $$ E_{i3}=\cases{1,&if $E_i=3$;\cr 0,&otherwise. \cr}$$ 
 81 | 
 82 | # <codecell>
 83 | 
 84 | formula = 'S ~ C(E) + C(M) + X'
 85 | lm = ols(formula, salary_table).fit()
 86 | print lm.summary()
 87 | 
 88 | # <headingcell level=2>
 89 | 
 90 | # Aside: Contrasts (see contrasts notebook)
 91 | 
 92 | # <markdowncell>
 93 | 
 94 | # Look at the design matrix created for us. Every results instance has a reference to the model.
 95 | 
 96 | # <codecell>
 97 | 
 98 | lm.model.exog[:10]
 99 | 
100 | # <markdowncell>
101 | 
102 | # Since we initially passed in a DataFrame, we have a transformed DataFrame available.
103 | 
104 | # <codecell>
105 | 
106 | print lm.model.data.orig_exog.head(10)
107 | 
108 | # <markdowncell>
109 | 
110 | # There is a reference to the original untouched data in
111 | 
112 | # <codecell>
113 | 
114 | print lm.model.data.frame.head(10)
115 | 
116 | # <markdowncell>
117 | 
118 | # If you use the formula interface, statsmodels remembers this transformation. Say you want to know the predicted salary for someone with 12 years experience and a Master's degree who is in a management position
119 | 
120 | # <codecell>
121 | 
122 | lm.predict({'X' : [12], 'M' : [1], 'E' : [2]})
123 | 
124 | # <markdowncell>
125 | 
126 | # So far we've assumed that the effect of experience is the same for each level of education and professional role.
127 | # Perhaps this assumption isn't merited. We can formally test this using some interactions.
128 | 
129 | # <markdowncell>
130 | 
131 | # We can start by seeing if our model assumptions are met. Let's look at a residuals plot.
132 | 
133 | # <markdowncell>
134 | 
135 | # And some formal tests
136 | 
137 | # <markdowncell>
138 | 
139 | # Plot the residuals within the groups separately.
140 | 
141 | # <codecell>
142 | 
143 | resid = lm.resid
144 | 
145 | # <codecell>
146 | 
147 | fig = plt.figure(figsize=(12,8))
148 | xticks = []
149 | ax = fig.add_subplot(111, xlabel='Group (E, M)', ylabel='Residuals')
150 | for values, group in factor_groups:
151 |     i,j = values
152 |     xticks.append(str((i, j)))
153 |     group_num = i*2 + j - 1 # for plotting purposes
154 |     x = [group_num] * len(group)
155 |     ax.scatter(x, resid[group.index], marker=symbols[j], color=colors[i-1],
156 |             s=144, edgecolors='black')
157 | ax.set_xticks([1,2,3,4,5,6])
158 | ax.set_xticklabels(xticks)
159 | ax.axis('tight');
160 | 
161 | # <markdowncell>
162 | 
163 | # Add an interaction between salary and experience, allowing different intercepts for level of experience.
164 | # 
165 | # $$S_i = \beta_0+\beta_1X_i+\beta_2E_{i2}+\beta_3E_{i3}+\beta_4M_i+\beta_5E_{i2}X_i+\beta_6E_{i3}X_i+\epsilon_i$$
166 | 
167 | # <codecell>
168 | 
169 | interX_lm = ols('S ~ C(E)*X + C(M)', salary_table).fit()
170 | print interX_lm.summary()
171 | 
172 | # <markdowncell>
173 | 
174 | # Test that $\beta_5 = \beta_6 = 0$. We can use anova_lm or we can use an F-test.
175 | 
176 | # <codecell>
177 | 
178 | print anova_lm(lm, interX_lm)
179 | 
180 | # <codecell>
181 | 
182 | print interX_lm.f_test('C(E)[T.2]:X = C(E)[T.3]:X = 0')
183 | 
184 | # <codecell>
185 | 
186 | print interX_lm.f_test([[0,0,0,0,0,1,-1],[0,0,0,0,0,0,1]])
187 | 
188 | # <markdowncell>
189 | 
190 | # The contrasts are created here under the hood by patsy.
191 | 
192 | # <markdowncell>
193 | 
194 | # Recall that F-tests are of the form $R\beta = q$
195 | 
196 | # <codecell>
197 | 
198 | LC = interX_lm.model.data.orig_exog.design_info.linear_constraint('C(E)[T.2]:X = C(E)[T.3]:X = 0')
199 | print LC.coefs
200 | print LC.constants
201 | 
202 | # <markdowncell>
203 | 
204 | # Interact education with management
205 | 
206 | # <codecell>
207 | 
208 | interM_lm = ols('S ~ X + C(E)*C(M)', salary_table).fit()
209 | print interM_lm.summary()
210 | 
211 | # <codecell>
212 | 
213 | print anova_lm(lm, interM_lm)
214 | 
215 | # <codecell>
216 | 
217 | infl = interM_lm.get_influence()
218 | resid = infl.resid_studentized_internal
219 | 
220 | # <codecell>
221 | 
222 | fig = plt.figure(figsize=(12,8))
223 | ax = fig.add_subplot(111, xlabel='X', ylabel='standardized resids')
224 | 
225 | for values, group in factor_groups:
226 |     i,j = values
227 |     idx = group.index
228 |     ax.scatter(X[idx], resid[idx], marker=symbols[j], color=colors[i-1],
229 |             s=144, edgecolors='black')
230 | ax.axis('tight');
231 | 
232 | # <markdowncell>
233 | 
234 | # There looks to be an outlier.
235 | 
236 | # <codecell>
237 | 
238 | outl = interM_lm.outlier_test('fdr_bh')
239 | outl.sort('unadj_p', inplace=True)
240 | print outl
241 | 
242 | # <codecell>
243 | 
244 | idx = salary_table.index.drop(32)
245 | 
246 | # <codecell>
247 | 
248 | print idx
249 | 
250 | # <codecell>
251 | 
252 | lm32 = ols('S ~ C(E) + X + C(M)', data=salary_table, subset=idx).fit()
253 | print lm32.summary()
254 | 
255 | # <codecell>
256 | 
257 | interX_lm32 = ols('S ~ C(E) * X + C(M)', data=salary_table, subset=idx).fit()
258 | print interX_lm32.summary()
259 | 
260 | # <codecell>
261 | 
262 | table3 = anova_lm(lm32, interX_lm32)
263 | print table3
264 | 
265 | # <codecell>
266 | 
267 | interM_lm32 = ols('S ~ X + C(E) * C(M)', data=salary_table, subset=idx).fit()
268 | print anova_lm(lm32, interM_lm32)
269 | 
270 | # <markdowncell>
271 | 
272 | # Re-plotting the residuals
273 | 
274 | # <codecell>
275 | 
276 | resid = interM_lm32.get_influence().summary_frame()['standard_resid']
277 | fig = plt.figure(figsize=(12,8))
278 | ax = fig.add_subplot(111, xlabel='X[~[32]]', ylabel='standardized resids')
279 | 
280 | for values, group in factor_groups:
281 |     i,j = values
282 |     idx = group.index
283 |     ax.scatter(X[idx], resid[idx], marker=symbols[j], color=colors[i-1],
284 |             s=144, edgecolors='black')
285 | ax.axis('tight');
286 | 
287 | # <markdowncell>
288 | 
289 | # A final plot of the fitted values
290 | 
291 | # <codecell>
292 | 
293 | lm_final = ols('S ~ X + C(E)*C(M)', data=salary_table.drop([32])).fit()
294 | mf = lm_final.model.data.orig_exog
295 | lstyle = ['-','--']
296 | 
297 | fig = plt.figure(figsize=(12,8))
298 | ax = fig.add_subplot(111, xlabel='Experience', ylabel='Salary')
299 | 
300 | for values, group in factor_groups:
301 |     i,j = values
302 |     idx = group.index
303 |     ax.scatter(X[idx], S[idx], marker=symbols[j], color=colors[i-1],
304 |             s=144, edgecolors='black')
305 |     # drop NA because there is no idx 32 in the final model
306 |     ax.plot(mf.X[idx].dropna(), lm_final.fittedvalues[idx].dropna(),
307 |             ls=lstyle[j], color=colors[i-1])
308 | ax.axis('tight');
309 | 
310 | # <rawcell>
311 | 
312 | # From our first look at the data, the difference between Master's and PhD in the management group is different than in the non-management group. This is an interaction between the two qualitative variables management, M and education, E. We can visualize this by first removing the effect of experience, then plotting the means within each of the 6 groups using interaction.plot.
313 | 
314 | # <codecell>
315 | 
316 | U = S - X * interX_lm32.params['X']
317 | U.name = 'Salary|X'
318 | 
319 | fig = plt.figure(figsize=(12,8))
320 | ax = fig.add_subplot(111)
321 | ax = interaction_plot(E, M, U, colors=['red','blue'], markers=['^','D'],
322 |         markersize=10, ax=ax)
323 | 
324 | # <headingcell level=3>
325 | 
326 | # Minority Employment Data - ABLine plotting
327 | 
328 | # <rawcell>
329 | 
330 | # TEST  - Job Aptitude Test Score
331 | # ETHN  - 1 if minority, 0 otherwise
332 | # JPERF - Job performance evaluation
333 | 
334 | # <codecell>
335 | 
336 | try:
337 |     minority_table = pandas.read_table('minority.table')
338 | except: # don't have data already
339 |     url = 'http://stats191.stanford.edu/data/minority.table'
340 |     minority_table = pandas.read_table(url)
341 |     minority_table.to_csv('minority.table', sep="\t", index=False)
342 | 
343 | # <codecell>
344 | 
345 | factor_group = minority_table.groupby(['ETHN'])
346 | 
347 | fig = plt.figure(figsize=(12,8))
348 | ax = fig.add_subplot(111, xlabel='TEST', ylabel='JPERF')
349 | colors = ['purple', 'green']
350 | markers = ['o', 'v']
351 | for factor, group in factor_group:
352 |     ax.scatter(group['TEST'], group['JPERF'], color=colors[factor],
353 |                 marker=markers[factor], s=12**2)
354 | ax.legend(['ETHN == 1', 'ETHN == 0'], scatterpoints=1)
355 | 
356 | # <codecell>
357 | 
358 | min_lm = ols('JPERF ~ TEST', data=minority_table).fit()
359 | print min_lm.summary()
360 | 
361 | # <codecell>
362 | 
363 | fig = plt.figure(figsize=(12,8))
364 | ax = fig.add_subplot(111, xlabel='TEST', ylabel='JPERF')
365 | for factor, group in factor_group:
366 |     ax.scatter(group['TEST'], group['JPERF'], color=colors[factor],
367 |                 marker=markers[factor], s=12**2)
368 | ax.legend(['ETHN == 1', 'ETHN == 0'], scatterpoints=1, loc='upper left')
369 | fig = abline_plot(model_results = min_lm, ax=ax)
370 | 
371 | # <codecell>
372 | 
373 | min_lm2 = ols('JPERF ~ TEST + TEST:ETHN', data=minority_table).fit()
374 | print min_lm2.summary()
375 | 
376 | # <codecell>
377 | 
378 | fig = plt.figure(figsize=(12,8))
379 | ax = fig.add_subplot(111, xlabel='TEST', ylabel='JPERF')
380 | for factor, group in factor_group:
381 |     ax.scatter(group['TEST'], group['JPERF'], color=colors[factor],
382 |                 marker=markers[factor], s=12**2)
383 | 
384 | fig = abline_plot(intercept = min_lm2.params['Intercept'],
385 |                  slope = min_lm2.params['TEST'], ax=ax, color='purple')
386 | ax = fig.axes[0]
387 | fig = abline_plot(intercept = min_lm2.params['Intercept'],
388 |         slope = min_lm2.params['TEST'] + min_lm2.params['TEST:ETHN'],
389 |         ax=ax, color='green')
390 | ax.legend(['ETHN == 1', 'ETHN == 0'], scatterpoints=1, loc='upper left');
391 | 
392 | # <codecell>
393 | 
394 | min_lm3 = ols('JPERF ~ TEST + ETHN', data=minority_table).fit()
395 | print min_lm3.summary()
396 | 
397 | # <codecell>
398 | 
399 | fig = plt.figure(figsize=(12,8))
400 | ax = fig.add_subplot(111, xlabel='TEST', ylabel='JPERF')
401 | for factor, group in factor_group:
402 |     ax.scatter(group['TEST'], group['JPERF'], color=colors[factor],
403 |                 marker=markers[factor], s=12**2)
404 | 
405 | fig = abline_plot(intercept = min_lm3.params['Intercept'],
406 |                  slope = min_lm3.params['TEST'], ax=ax, color='purple')
407 | 
408 | ax = fig.axes[0]
409 | fig = abline_plot(intercept = min_lm3.params['Intercept'] + min_lm3.params['ETHN'],
410 |         slope = min_lm3.params['TEST'], ax=ax, color='green')
411 | ax.legend(['ETHN == 1', 'ETHN == 0'], scatterpoints=1, loc='upper left');
412 | 
413 | # <codecell>
414 | 
415 | min_lm4 = ols('JPERF ~ TEST * ETHN', data=minority_table).fit()
416 | print min_lm4.summary()
417 | 
418 | # <codecell>
419 | 
420 | fig = plt.figure(figsize=(12,8))
421 | ax = fig.add_subplot(111, ylabel='JPERF', xlabel='TEST')
422 | for factor, group in factor_group:
423 |     ax.scatter(group['TEST'], group['JPERF'], color=colors[factor],
424 |                 marker=markers[factor], s=12**2)
425 | 
426 | fig = abline_plot(intercept = min_lm4.params['Intercept'],
427 |                  slope = min_lm4.params['TEST'], ax=ax, color='purple')
428 | ax = fig.axes[0]
429 | fig = abline_plot(intercept = min_lm4.params['Intercept'] + min_lm4.params['ETHN'],
430 |         slope = min_lm4.params['TEST'] + min_lm4.params['TEST:ETHN'],
431 |         ax=ax, color='green')
432 | ax.legend(['ETHN == 1', 'ETHN == 0'], scatterpoints=1, loc='upper left');
433 | 
434 | # <markdowncell>
435 | 
436 | # Is there any effect of ETHN on slope or intercept?
437 | # <br />
438 | # Y ~ TEST vs. Y ~ TEST + ETHN + ETHN:TEST
439 | 
440 | # <codecell>
441 | 
442 | table5 = anova_lm(min_lm, min_lm4)
443 | print table5
444 | 
445 | # <markdowncell>
446 | 
447 | # Is there any effect of ETHN on intercept?
448 | # <br />
449 | # Y ~ TEST vs. Y ~ TEST + ETHN
450 | 
451 | # <codecell>
452 | 
453 | table6 = anova_lm(min_lm, min_lm3)
454 | print table6
455 | 
456 | # <markdowncell>
457 | 
458 | # Is there any effect of ETHN on slope?
459 | # <br />
460 | # Y ~ TEST vs. Y ~ TEST + ETHN:TEST
461 | 
462 | # <codecell>
463 | 
464 | table7 = anova_lm(min_lm, min_lm2)
465 | print table7
466 | 
467 | # <markdowncell>
468 | 
469 | # Is it just the slope or both?
470 | # <br />
471 | # Y ~ TEST + ETHN:TEST vs Y ~ TEST + ETHN + ETHN:TEST
472 | 
473 | # <codecell>
474 | 
475 | table8 = anova_lm(min_lm2, min_lm4)
476 | print table8
477 | 
478 | # <headingcell level=3>
479 | 
480 | # Two Way ANOVA - Kidney failure data
481 | 
482 | # <rawcell>
483 | 
484 | # Weight - (1,2,3) - Level of weight gan between treatments
485 | # Duration - (1,2) - Level of duration of treatment
486 | # Days - Time of stay in hospital
487 | 
488 | # <codecell>
489 | 
490 | try:
491 |     kidney_table = pandas.read_table('kidney.table')
492 | except:
493 |     url = 'http://stats191.stanford.edu/data/kidney.table'
494 |     kidney_table = pandas.read_table(url, delimiter=" *")
495 |     kidney_table.to_csv("kidney.table", sep="\t", index=False)
496 | 
497 | # <codecell>
498 | 
499 | # Explore the dataset, it's a balanced design
500 | print kidney_table.groupby(['Weight', 'Duration']).size()
501 | 
502 | # <codecell>
503 | 
504 | kt = kidney_table
505 | fig = plt.figure(figsize=(10,8))
506 | ax = fig.add_subplot(111)
507 | fig = interaction_plot(kt['Weight'], kt['Duration'], np.log(kt['Days']+1),
508 |         colors=['red', 'blue'], markers=['D','^'], ms=10, ax=ax)
509 | 
510 | # <markdowncell>
511 | 
512 | # $$Y_{ijk} = \mu + \alpha_i + \beta_j + \left(\alpha\beta\right)_{ij}+\epsilon_{ijk}$$
513 | # 
514 | # with 
515 | # 
516 | # $$\epsilon_{ijk}\sim N\left(0,\sigma^2\right)$$
517 | 
518 | # <codecell>
519 | 
520 | help(anova_lm)
521 | 
522 | # <markdowncell>
523 | 
524 | # Things available in the calling namespace are available in the formula evaluation namespace
525 | 
526 | # <codecell>
527 | 
528 | kidney_lm = ols('np.log(Days+1) ~ C(Duration) * C(Weight)', data=kt).fit()
529 | 
530 | # <markdowncell>
531 | 
532 | # ANOVA Type-I Sum of Squares
533 | # <br /><br />
534 | # SS(A) for factor A. <br />
535 | # SS(B|A) for factor B. <br />
536 | # SS(AB|B, A) for interaction AB. <br />
537 | 
538 | # <codecell>
539 | 
540 | print anova_lm(kidney_lm)
541 | 
542 | # <markdowncell>
543 | 
544 | # ANOVA Type-II Sum of Squares
545 | # <br /><br />
546 | # SS(A|B) for factor A. <br />
547 | # SS(B|A) for factor B. <br />
548 | 
549 | # <codecell>
550 | 
551 | print anova_lm(kidney_lm, typ=2)
552 | 
553 | # <rawcell>
554 | 
555 | # ANOVA Type-III Sum of Squares
556 | # <br /><br />
557 | # SS(A|B, AB) for factor A. <br />
558 | # SS(B|A, AB) for factor B. <br />
559 | 
560 | # <codecell>
561 | 
562 | print anova_lm(ols('np.log(Days+1) ~ C(Duration, Sum) * C(Weight, Poly)', 
563 |                    data=kt).fit(), typ=3)
564 | 
565 | # <headingcell level=4>
566 | 
567 | # Excercise: Find the 'best' model for the kidney failure dataset
568 | 
569 | 


--------------------------------------------------------------------------------
/preliminaries.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # <nbformat>3.0</nbformat>
  3 | 
  4 | # <headingcell level=3>
  5 | 
  6 | # Learn More and Get Help
  7 | 
  8 | # <markdowncell>
  9 | 
 10 | # Documentation: http://statsmodels.sf.net
 11 | # 
 12 | # Mailing List: http://groups.google.com/group/pystatsmodels
 13 | # 
 14 | # Use the source: https://github.com/statsmodels/statsmodels
 15 | 
 16 | # <headingcell level=3>
 17 | 
 18 | # Tutorial Import Assumptions
 19 | 
 20 | # <codecell>
 21 | 
 22 | import numpy as np
 23 | import statsmodels.api as sm
 24 | import matplotlib.pyplot as plt
 25 | import pandas
 26 | from scipy import stats
 27 | 
 28 | np.set_printoptions(precision=4, suppress=True)
 29 | pandas.set_printoptions(notebook_repr_html=False,
 30 |                         precision=4,
 31 |                         max_columns=12)
 32 | 
 33 | # <headingcell level=3>
 34 | 
 35 | # Statsmodels Import Convention
 36 | 
 37 | # <codecell>
 38 | 
 39 | import statsmodels.api as sm
 40 | 
 41 | # <markdowncell>
 42 | 
 43 | # Import convention for models for which a formula is available.
 44 | 
 45 | # <codecell>
 46 | 
 47 | from statsmodels.formula.api import ols, rlm, glm, #etc.
 48 | 
 49 | # <headingcell level=3>
 50 | 
 51 | # Package Overview
 52 | 
 53 | # <markdowncell>
 54 | 
 55 | # Regression models in statsmodels.regression
 56 | 
 57 | # <markdowncell>
 58 | 
 59 | # Discrete choice models in statsmodels.discrete
 60 | 
 61 | # <markdowncell>
 62 | 
 63 | # Robust linear models in statsmodels.robust
 64 | 
 65 | # <markdowncell>
 66 | 
 67 | # Generalized linear models in statsmodels.genmod
 68 | 
 69 | # <markdowncell>
 70 | 
 71 | # Time Series Analysis in statsmodels.tsa
 72 | 
 73 | # <markdowncell>
 74 | 
 75 | # Nonparametric models in statsmodels.nonparametric
 76 | 
 77 | # <markdowncell>
 78 | 
 79 | # Plotting functions in statsmodels.graphics
 80 | 
 81 | # <markdowncell>
 82 | 
 83 | # Input/Output in statsmodels.iolib (Foreign data, ascii, HTML, $\LaTeX$ tables)
 84 | 
 85 | # <markdowncell>
 86 | 
 87 | # Statistical tests, ANOVA in statsmodels.stats
 88 | 
 89 | # <markdowncell>
 90 | 
 91 | # Datasets in statsmodels.datasets (See also the new GPL package Rdatasets: https://github.com/vincentarelbundock/Rdatasets)
 92 | 
 93 | # <headingcell level=3>
 94 | 
 95 | # Base Classes
 96 | 
 97 | # <codecell>
 98 | 
 99 | from statsmodels.base import model
100 | 
101 | # <codecell>
102 | 
103 | help(model.Model)
104 | 
105 | # <codecell>
106 | 
107 | help(model.LikelihoodModel)
108 | 
109 | # <codecell>
110 | 
111 | help(model.LikelihoodModelResults)
112 | 
113 | # <codecell>
114 | 
115 | from statsmodels.regression.linear_model import RegressionResults
116 | 
117 | # <codecell>
118 | 
119 | help(RegressionResults)
120 | 
121 | 


--------------------------------------------------------------------------------
/rmagic_extension.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "metadata": {
  3 |   "name": "rmagic_extension"
  4 |  },
  5 |  "nbformat": 3,
  6 |  "nbformat_minor": 0,
  7 |  "worksheets": [
  8 |   {
  9 |    "cells": [
 10 |     {
 11 |      "cell_type": "heading",
 12 |      "level": 1,
 13 |      "metadata": {},
 14 |      "source": [
 15 |       "Rmagic Functions Extension"
 16 |      ]
 17 |     },
 18 |     {
 19 |      "cell_type": "code",
 20 |      "collapsed": false,
 21 |      "input": [
 22 |       "%pylab inline"
 23 |      ],
 24 |      "language": "python",
 25 |      "metadata": {},
 26 |      "outputs": []
 27 |     },
 28 |     {
 29 |      "cell_type": "heading",
 30 |      "level": 2,
 31 |      "metadata": {},
 32 |      "source": [
 33 |       "Line magics"
 34 |      ]
 35 |     },
 36 |     {
 37 |      "cell_type": "markdown",
 38 |      "metadata": {},
 39 |      "source": [
 40 |       "* IPython has an `rmagic` extension that contains a some magic functions for working with R via rpy2. \n",
 41 |       "* This extension can be loaded using the `%load_ext` magic as follows:"
 42 |      ]
 43 |     },
 44 |     {
 45 |      "cell_type": "code",
 46 |      "collapsed": true,
 47 |      "input": [
 48 |       "%load_ext rmagic "
 49 |      ],
 50 |      "language": "python",
 51 |      "metadata": {},
 52 |      "outputs": []
 53 |     },
 54 |     {
 55 |      "cell_type": "markdown",
 56 |      "metadata": {},
 57 |      "source": [
 58 |       "* We can go from numpy arrays to compute some statistics in R and back\n",
 59 |       "* Let's suppose we just want to fit a simple linear model to a scatterplot."
 60 |      ]
 61 |     },
 62 |     {
 63 |      "cell_type": "code",
 64 |      "collapsed": false,
 65 |      "input": [
 66 |       "import numpy as np\n",
 67 |       "import pylab\n",
 68 |       "X = np.array([0,1,2,3,4])\n",
 69 |       "Y = np.array([3,5,4,6,7])\n",
 70 |       "pylab.scatter(X, Y)"
 71 |      ],
 72 |      "language": "python",
 73 |      "metadata": {},
 74 |      "outputs": []
 75 |     },
 76 |     {
 77 |      "cell_type": "markdown",
 78 |      "metadata": {},
 79 |      "source": [
 80 |       "* We can accomplish this by first pushing variables to R\n",
 81 |       "* Then fitting a model\n",
 82 |       "* And finally returning the results\n",
 83 |       "* The line magic `%Rpush` copies its arguments to variables of the same name in rpy2\n",
 84 |       "* The `%R` line magic evaluates the string in rpy2 and returns the result"
 85 |      ]
 86 |     },
 87 |     {
 88 |      "cell_type": "code",
 89 |      "collapsed": false,
 90 |      "input": [
 91 |       "%Rpush X Y\n",
 92 |       "%R lm(Y~X)$coef"
 93 |      ],
 94 |      "language": "python",
 95 |      "metadata": {},
 96 |      "outputs": []
 97 |     },
 98 |     {
 99 |      "cell_type": "markdown",
100 |      "metadata": {},
101 |      "source": [
102 |       "We can check that this is correct fairly easily:"
103 |      ]
104 |     },
105 |     {
106 |      "cell_type": "code",
107 |      "collapsed": false,
108 |      "input": [
109 |       "Xr = X - X.mean(); Yr = Y - Y.mean()\n",
110 |       "slope = (Xr*Yr).sum() / (Xr**2).sum()\n",
111 |       "intercept = Y.mean() - X.mean() * slope\n",
112 |       "(intercept, slope)"
113 |      ],
114 |      "language": "python",
115 |      "metadata": {},
116 |      "outputs": []
117 |     },
118 |     {
119 |      "cell_type": "markdown",
120 |      "metadata": {},
121 |      "source": [
122 |       "It is also possible to return more than one value with %R."
123 |      ]
124 |     },
125 |     {
126 |      "cell_type": "code",
127 |      "collapsed": false,
128 |      "input": [
129 |       "%R resid(lm(Y~X)); coef(lm(Y~X))"
130 |      ],
131 |      "language": "python",
132 |      "metadata": {},
133 |      "outputs": []
134 |     },
135 |     {
136 |      "cell_type": "markdown",
137 |      "metadata": {},
138 |      "source": [
139 |       "* One can also easily capture the results of %R into python objects. \n",
140 |       "* Like R, the return value of this multiline expression is the final value, which is the `coef(lm(X~Y))`. \n",
141 |       "* To pull other variables from R, there is one more set of magic functions"
142 |      ]
143 |     },
144 |     {
145 |      "cell_type": "markdown",
146 |      "metadata": {},
147 |      "source": [
148 |       "* `%Rpull` and `%Rget` \n",
149 |       "* Both are useful to retrieve variables in the rpy2 namespace\n",
150 |       "* The main difference is that one returns the value (%Rget), while the other pulls it to the user's namespace.\n",
151 |       "* Imagine we've stored the results of some calculation in the variable `a` in rpy2's namespace. \n",
152 |       "* By using the %R magic, we can obtain these results and store them in b. \n",
153 |       "* We can also pull them directly to the namespace with %Rpull. \n",
154 |       "* Note that they are both views on the same data."
155 |      ]
156 |     },
157 |     {
158 |      "cell_type": "code",
159 |      "collapsed": false,
160 |      "input": [
161 |       "b = %R a=resid(lm(Y~X))\n",
162 |       "%Rpull a\n",
163 |       "print a\n",
164 |       "assert id(b.data) == id(a.data)\n",
165 |       "%R -o a"
166 |      ],
167 |      "language": "python",
168 |      "metadata": {},
169 |      "outputs": []
170 |     },
171 |     {
172 |      "cell_type": "markdown",
173 |      "metadata": {},
174 |      "source": [
175 |       "%Rpull is equivalent to calling %R with just -o\n"
176 |      ]
177 |     },
178 |     {
179 |      "cell_type": "code",
180 |      "collapsed": false,
181 |      "input": [
182 |       "%R d=resid(lm(Y~X)); e=coef(lm(Y~X))\n",
183 |       "%R -o d -o e\n",
184 |       "%Rpull e\n",
185 |       "print d\n",
186 |       "print e\n",
187 |       "import numpy as np\n",
188 |       "np.testing.assert_almost_equal(d, a)"
189 |      ],
190 |      "language": "python",
191 |      "metadata": {},
192 |      "outputs": []
193 |     },
194 |     {
195 |      "cell_type": "markdown",
196 |      "metadata": {},
197 |      "source": [
198 |       "On the other hand %Rpush is equivalent to calling %R with just -i and no trailing code."
199 |      ]
200 |     },
201 |     {
202 |      "cell_type": "code",
203 |      "collapsed": false,
204 |      "input": [
205 |       "A = np.arange(20)\n",
206 |       "%R -i A\n",
207 |       "%R mean(A)"
208 |      ],
209 |      "language": "python",
210 |      "metadata": {},
211 |      "outputs": []
212 |     },
213 |     {
214 |      "cell_type": "markdown",
215 |      "metadata": {},
216 |      "source": [
217 |       "The magic %Rget retrieves one variable from R."
218 |      ]
219 |     },
220 |     {
221 |      "cell_type": "code",
222 |      "collapsed": false,
223 |      "input": [
224 |       "%Rget A"
225 |      ],
226 |      "language": "python",
227 |      "metadata": {},
228 |      "outputs": []
229 |     },
230 |     {
231 |      "cell_type": "heading",
232 |      "level": 2,
233 |      "metadata": {},
234 |      "source": [
235 |       "Plotting and capturing output"
236 |      ]
237 |     },
238 |     {
239 |      "cell_type": "markdown",
240 |      "metadata": {},
241 |      "source": [
242 |       "* R's console (i.e. its stdout() connection) is captured by ipython,\n",
243 |       "* So are any plots which are published as PNG files\n",
244 |       "* As a call to %R may produce a return value (see above), what happens to a magic like the one below?\n",
245 |       "* The R code specifies that something is published to the notebook. \n",
246 |       "* If anything is published to the notebook, that call to %R returns None."
247 |      ]
248 |     },
249 |     {
250 |      "cell_type": "code",
251 |      "collapsed": false,
252 |      "input": [
253 |       "v1 = %R plot(X,Y); print(summary(lm(Y~X))); vv=mean(X)*mean(Y)\n",
254 |       "print 'v1 is:', v1\n",
255 |       "v2 = %R mean(X)*mean(Y)\n",
256 |       "print 'v2 is:', v2"
257 |      ],
258 |      "language": "python",
259 |      "metadata": {},
260 |      "outputs": []
261 |     },
262 |     {
263 |      "cell_type": "heading",
264 |      "level": 2,
265 |      "metadata": {},
266 |      "source": [
267 |       "What value is returned from %R?"
268 |      ]
269 |     },
270 |     {
271 |      "cell_type": "markdown",
272 |      "metadata": {},
273 |      "source": [
274 |       "* Some calls have no particularly interesting return value, the magic `%R` will not return anything in this case. \n",
275 |       "* The return value in rpy2 is actually NULL so `%R` returns None."
276 |      ]
277 |     },
278 |     {
279 |      "cell_type": "code",
280 |      "collapsed": false,
281 |      "input": [
282 |       "v = %R plot(X,Y)\n",
283 |       "assert v == None"
284 |      ],
285 |      "language": "python",
286 |      "metadata": {},
287 |      "outputs": []
288 |     },
289 |     {
290 |      "cell_type": "markdown",
291 |      "metadata": {},
292 |      "source": [
293 |       "Also, if the return value of a call to %R (inline mode) has just been printed to the console, then its value is also not returned."
294 |      ]
295 |     },
296 |     {
297 |      "cell_type": "code",
298 |      "collapsed": false,
299 |      "input": [
300 |       "v = %R print(X)\n",
301 |       "assert v == None"
302 |      ],
303 |      "language": "python",
304 |      "metadata": {},
305 |      "outputs": []
306 |     },
307 |     {
308 |      "cell_type": "markdown",
309 |      "metadata": {},
310 |      "source": [
311 |       "* If the last value did not print anything to console, the value is returned"
312 |      ]
313 |     },
314 |     {
315 |      "cell_type": "code",
316 |      "collapsed": false,
317 |      "input": [
318 |       "v = %R print(summary(X)); X\n",
319 |       "print 'v:', v"
320 |      ],
321 |      "language": "python",
322 |      "metadata": {},
323 |      "outputs": []
324 |     },
325 |     {
326 |      "cell_type": "markdown",
327 |      "metadata": {},
328 |      "source": [
329 |       "* The return value can be suppressed by a trailing ';' or an -n argument"
330 |      ]
331 |     },
332 |     {
333 |      "cell_type": "code",
334 |      "collapsed": true,
335 |      "input": [
336 |       "%R -n X"
337 |      ],
338 |      "language": "python",
339 |      "metadata": {},
340 |      "outputs": []
341 |     },
342 |     {
343 |      "cell_type": "code",
344 |      "collapsed": true,
345 |      "input": [
346 |       "%R X; "
347 |      ],
348 |      "language": "python",
349 |      "metadata": {},
350 |      "outputs": []
351 |     },
352 |     {
353 |      "cell_type": "heading",
354 |      "level": 2,
355 |      "metadata": {},
356 |      "source": [
357 |       "Cell level magic"
358 |      ]
359 |     },
360 |     {
361 |      "cell_type": "markdown",
362 |      "metadata": {},
363 |      "source": [
364 |       "* What if we want to run several lines of R code\n",
365 |       "* This is the cell-level magic.\n",
366 |       "* For the cell level magic, inputs can be passed via the -i or --inputs argument in the line\n",
367 |       "* These variables are copied from the shell namespace to R's namespace using `rpy2.robjects.r.assign` \n",
368 |       "* It would be nice not to have to copy these into R: rnumpy ( http://bitbucket.org/njs/rnumpy/wiki/API ) has done some work to limit or at least make transparent the number of copies of an array. \n",
369 |       "* Arrays can be output from R via the -o or --outputs argument in the line. All other arguments are sent to R's png function, which is the graphics device used to create the plots.\n",
370 |       "* We can redo the above calculations in one ipython cell. \n",
371 |       "* We might also want to add a summary or perhaps the standard plotting diagnostics of the lm."
372 |      ]
373 |     },
374 |     {
375 |      "cell_type": "code",
376 |      "collapsed": false,
377 |      "input": [
378 |       "%%R -i X,Y -o XYcoef\n",
379 |       "XYlm = lm(Y~X)\n",
380 |       "XYcoef = coef(XYlm)\n",
381 |       "print(summary(XYlm))\n",
382 |       "par(mfrow=c(2,2))\n",
383 |       "plot(XYlm)"
384 |      ],
385 |      "language": "python",
386 |      "metadata": {},
387 |      "outputs": []
388 |     },
389 |     {
390 |      "cell_type": "heading",
391 |      "level": 2,
392 |      "metadata": {},
393 |      "source": [
394 |       "Passing data back and forth"
395 |      ]
396 |     },
397 |     {
398 |      "cell_type": "markdown",
399 |      "metadata": {},
400 |      "source": [
401 |       "* Currently (Summer 2012), data is passed through `RMagics.pyconverter` when going from python to R and `RMagics.Rconverter` when going from R to python. \n",
402 |       "* These currently default to numpy.ndarray. Future work will involve writing better converters, most likely involving integration with http://pandas.sourceforge.net.\n",
403 |       "* Passing ndarrays into R requires a copy, though once an object is returned to python, this object is NOT copied, and it is possible to change its values."
404 |      ]
405 |     },
406 |     {
407 |      "cell_type": "code",
408 |      "collapsed": true,
409 |      "input": [
410 |       "seq1 = np.arange(10)"
411 |      ],
412 |      "language": "python",
413 |      "metadata": {},
414 |      "outputs": []
415 |     },
416 |     {
417 |      "cell_type": "code",
418 |      "collapsed": false,
419 |      "input": [
420 |       "%%R -i seq1 -o seq2\n",
421 |       "seq2 = rep(seq1, 2)\n",
422 |       "print(seq2)"
423 |      ],
424 |      "language": "python",
425 |      "metadata": {},
426 |      "outputs": []
427 |     },
428 |     {
429 |      "cell_type": "code",
430 |      "collapsed": false,
431 |      "input": [
432 |       "seq2[::2] = 0\n",
433 |       "seq2"
434 |      ],
435 |      "language": "python",
436 |      "metadata": {},
437 |      "outputs": []
438 |     },
439 |     {
440 |      "cell_type": "code",
441 |      "collapsed": false,
442 |      "input": [
443 |       "%%R\n",
444 |       "print(seq2)"
445 |      ],
446 |      "language": "python",
447 |      "metadata": {},
448 |      "outputs": []
449 |     },
450 |     {
451 |      "cell_type": "markdown",
452 |      "metadata": {},
453 |      "source": [
454 |       "* Once the array data has been passed to R, modifring its contents does not modify R's copy of the data."
455 |      ]
456 |     },
457 |     {
458 |      "cell_type": "code",
459 |      "collapsed": false,
460 |      "input": [
461 |       "seq1[0] = 200\n",
462 |       "%R print(seq1)"
463 |      ],
464 |      "language": "python",
465 |      "metadata": {},
466 |      "outputs": []
467 |     },
468 |     {
469 |      "cell_type": "markdown",
470 |      "metadata": {},
471 |      "source": [
472 |       "* If we pass data as both input and output, then the value of \"data\" in the user's namespace will be overwritten \n",
473 |       "* the new array will be a view of the data in R's copy."
474 |      ]
475 |     },
476 |     {
477 |      "cell_type": "code",
478 |      "collapsed": false,
479 |      "input": [
480 |       "print seq1\n",
481 |       "%R -i seq1 -o seq1\n",
482 |       "print seq1\n",
483 |       "seq1[0] = 200\n",
484 |       "%R print(seq1)\n",
485 |       "seq1_view = %R seq1\n",
486 |       "assert(id(seq1_view.data) == id(seq1.data))"
487 |      ],
488 |      "language": "python",
489 |      "metadata": {},
490 |      "outputs": []
491 |     },
492 |     {
493 |      "cell_type": "heading",
494 |      "level": 2,
495 |      "metadata": {},
496 |      "source": [
497 |       "Exception handling\n"
498 |      ]
499 |     },
500 |     {
501 |      "cell_type": "markdown",
502 |      "metadata": {},
503 |      "source": [
504 |       "Exceptions are handled by passing back rpy2's exception and the line that triggered it."
505 |      ]
506 |     },
507 |     {
508 |      "cell_type": "code",
509 |      "collapsed": false,
510 |      "input": [
511 |       "try:\n",
512 |       "    %R -n nosuchvar\n",
513 |       "except Exception as e:\n",
514 |       "    print e"
515 |      ],
516 |      "language": "python",
517 |      "metadata": {},
518 |      "outputs": []
519 |     },
520 |     {
521 |      "cell_type": "heading",
522 |      "level": 2,
523 |      "metadata": {},
524 |      "source": [
525 |       "Structured arrays and data frames\n"
526 |      ]
527 |     },
528 |     {
529 |      "cell_type": "markdown",
530 |      "metadata": {},
531 |      "source": [
532 |       "* In R, data frames play an important role as they allow array-like objects of mixed type with column names (and row names). \n",
533 |       "* In numpy, the closest analogy is a structured array with named fields. \n",
534 |       "* In future work, it would be nice to use pandas to return full-fledged DataFrames from rpy2. \n",
535 |       "* In the mean time, structured arrays can be passed back and forth with the -d flag to %R, %Rpull, and %Rget"
536 |      ]
537 |     },
538 |     {
539 |      "cell_type": "code",
540 |      "collapsed": true,
541 |      "input": [
542 |       "datapy= np.array([(1, 2.9, 'a'), (2, 3.5, 'b'), (3, 2.1, 'c')],\n",
543 |       "          dtype=[('x', '<i4'), ('y', '<f8'), ('z', '|S1')])"
544 |      ],
545 |      "language": "python",
546 |      "metadata": {},
547 |      "outputs": []
548 |     },
549 |     {
550 |      "cell_type": "code",
551 |      "collapsed": true,
552 |      "input": [
553 |       "%%R -i datapy -d datar\n",
554 |       "datar = datapy"
555 |      ],
556 |      "language": "python",
557 |      "metadata": {},
558 |      "outputs": []
559 |     },
560 |     {
561 |      "cell_type": "code",
562 |      "collapsed": false,
563 |      "input": [
564 |       "datar"
565 |      ],
566 |      "language": "python",
567 |      "metadata": {},
568 |      "outputs": []
569 |     },
570 |     {
571 |      "cell_type": "code",
572 |      "collapsed": false,
573 |      "input": [
574 |       "%R datar2 = datapy\n",
575 |       "%Rpull -d datar2\n",
576 |       "datar2"
577 |      ],
578 |      "language": "python",
579 |      "metadata": {},
580 |      "outputs": []
581 |     },
582 |     {
583 |      "cell_type": "code",
584 |      "collapsed": false,
585 |      "input": [
586 |       "%Rget -d datar2"
587 |      ],
588 |      "language": "python",
589 |      "metadata": {},
590 |      "outputs": []
591 |     },
592 |     {
593 |      "cell_type": "markdown",
594 |      "metadata": {},
595 |      "source": [
596 |       "For arrays without names, the -d argument has no effect because the R object has no colnames or names."
597 |      ]
598 |     },
599 |     {
600 |      "cell_type": "code",
601 |      "collapsed": false,
602 |      "input": [
603 |       "Z = np.arange(6)\n",
604 |       "%R -i Z\n",
605 |       "%Rget -d Z"
606 |      ],
607 |      "language": "python",
608 |      "metadata": {},
609 |      "outputs": []
610 |     },
611 |     {
612 |      "cell_type": "markdown",
613 |      "metadata": {},
614 |      "source": [
615 |       "* For mixed-type data frames in R, if the -d flag is not used, then an array of a single type is returned \n",
616 |       "* Its value is transposed. \n",
617 |       "* This would be nice to fix, but it seems something that should be fixed at the rpy2 level. See [here](https://bitbucket.org/lgautier/rpy2/issue/44/numpyrecarray-as-dataframe)"
618 |      ]
619 |     },
620 |     {
621 |      "cell_type": "code",
622 |      "collapsed": false,
623 |      "input": [
624 |       "%Rget datar2"
625 |      ],
626 |      "language": "python",
627 |      "metadata": {},
628 |      "outputs": []
629 |     }
630 |    ],
631 |    "metadata": {}
632 |   }
633 |  ]
634 | }


--------------------------------------------------------------------------------
/rmagic_extension.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # <nbformat>3.0</nbformat>
  3 | 
  4 | # <headingcell level=1>
  5 | 
  6 | # Rmagic Functions Extension
  7 | 
  8 | # <codecell>
  9 | 
 10 | %pylab inline
 11 | 
 12 | # <headingcell level=2>
 13 | 
 14 | # Line magics
 15 | 
 16 | # <markdowncell>
 17 | 
 18 | # * IPython has an `rmagic` extension that contains a some magic functions for working with R via rpy2. 
 19 | # * This extension can be loaded using the `%load_ext` magic as follows:
 20 | 
 21 | # <codecell>
 22 | 
 23 | %load_ext rmagic 
 24 | 
 25 | # <markdowncell>
 26 | 
 27 | # * We can go from numpy arrays to compute some statistics in R and back
 28 | # * Let's suppose we just want to fit a simple linear model to a scatterplot.
 29 | 
 30 | # <codecell>
 31 | 
 32 | import numpy as np
 33 | import pylab
 34 | X = np.array([0,1,2,3,4])
 35 | Y = np.array([3,5,4,6,7])
 36 | pylab.scatter(X, Y)
 37 | 
 38 | # <markdowncell>
 39 | 
 40 | # * We can accomplish this by first pushing variables to R
 41 | # * Then fitting a model
 42 | # * And finally returning the results
 43 | # * The line magic `%Rpush` copies its arguments to variables of the same name in rpy2
 44 | # * The `%R` line magic evaluates the string in rpy2 and returns the result
 45 | 
 46 | # <codecell>
 47 | 
 48 | %Rpush X Y
 49 | %R lm(Y~X)$coef
 50 | 
 51 | # <markdowncell>
 52 | 
 53 | # We can check that this is correct fairly easily:
 54 | 
 55 | # <codecell>
 56 | 
 57 | Xr = X - X.mean(); Yr = Y - Y.mean()
 58 | slope = (Xr*Yr).sum() / (Xr**2).sum()
 59 | intercept = Y.mean() - X.mean() * slope
 60 | (intercept, slope)
 61 | 
 62 | # <markdowncell>
 63 | 
 64 | # It is also possible to return more than one value with %R.
 65 | 
 66 | # <codecell>
 67 | 
 68 | %R resid(lm(Y~X)); coef(lm(Y~X))
 69 | 
 70 | # <markdowncell>
 71 | 
 72 | # * One can also easily capture the results of %R into python objects. 
 73 | # * Like R, the return value of this multiline expression is the final value, which is the `coef(lm(X~Y))`. 
 74 | # * To pull other variables from R, there is one more set of magic functions
 75 | 
 76 | # <markdowncell>
 77 | 
 78 | # * `%Rpull` and `%Rget` 
 79 | # * Both are useful to retrieve variables in the rpy2 namespace
 80 | # * The main difference is that one returns the value (%Rget), while the other pulls it to the user's namespace.
 81 | # * Imagine we've stored the results of some calculation in the variable `a` in rpy2's namespace. 
 82 | # * By using the %R magic, we can obtain these results and store them in b. 
 83 | # * We can also pull them directly to the namespace with %Rpull. 
 84 | # * Note that they are both views on the same data.
 85 | 
 86 | # <codecell>
 87 | 
 88 | b = %R a=resid(lm(Y~X))
 89 | %Rpull a
 90 | print a
 91 | assert id(b.data) == id(a.data)
 92 | %R -o a
 93 | 
 94 | # <markdowncell>
 95 | 
 96 | # %Rpull is equivalent to calling %R with just -o
 97 | 
 98 | # <codecell>
 99 | 
100 | %R d=resid(lm(Y~X)); e=coef(lm(Y~X))
101 | %R -o d -o e
102 | %Rpull e
103 | print d
104 | print e
105 | import numpy as np
106 | np.testing.assert_almost_equal(d, a)
107 | 
108 | # <markdowncell>
109 | 
110 | # On the other hand %Rpush is equivalent to calling %R with just -i and no trailing code.
111 | 
112 | # <codecell>
113 | 
114 | A = np.arange(20)
115 | %R -i A
116 | %R mean(A)
117 | 
118 | # <markdowncell>
119 | 
120 | # The magic %Rget retrieves one variable from R.
121 | 
122 | # <codecell>
123 | 
124 | %Rget A
125 | 
126 | # <headingcell level=2>
127 | 
128 | # Plotting and capturing output
129 | 
130 | # <markdowncell>
131 | 
132 | # * R's console (i.e. its stdout() connection) is captured by ipython,
133 | # * So are any plots which are published as PNG files
134 | # * As a call to %R may produce a return value (see above), what happens to a magic like the one below?
135 | # * The R code specifies that something is published to the notebook. 
136 | # * If anything is published to the notebook, that call to %R returns None.
137 | 
138 | # <codecell>
139 | 
140 | v1 = %R plot(X,Y); print(summary(lm(Y~X))); vv=mean(X)*mean(Y)
141 | print 'v1 is:', v1
142 | v2 = %R mean(X)*mean(Y)
143 | print 'v2 is:', v2
144 | 
145 | # <headingcell level=2>
146 | 
147 | # What value is returned from %R?
148 | 
149 | # <markdowncell>
150 | 
151 | # * Some calls have no particularly interesting return value, the magic `%R` will not return anything in this case. 
152 | # * The return value in rpy2 is actually NULL so `%R` returns None.
153 | 
154 | # <codecell>
155 | 
156 | v = %R plot(X,Y)
157 | assert v == None
158 | 
159 | # <markdowncell>
160 | 
161 | # Also, if the return value of a call to %R (inline mode) has just been printed to the console, then its value is also not returned.
162 | 
163 | # <codecell>
164 | 
165 | v = %R print(X)
166 | assert v == None
167 | 
168 | # <markdowncell>
169 | 
170 | # * If the last value did not print anything to console, the value is returned
171 | 
172 | # <codecell>
173 | 
174 | v = %R print(summary(X)); X
175 | print 'v:', v
176 | 
177 | # <markdowncell>
178 | 
179 | # * The return value can be suppressed by a trailing ';' or an -n argument
180 | 
181 | # <codecell>
182 | 
183 | %R -n X
184 | 
185 | # <codecell>
186 | 
187 | %R X; 
188 | 
189 | # <headingcell level=2>
190 | 
191 | # Cell level magic
192 | 
193 | # <markdowncell>
194 | 
195 | # * What if we want to run several lines of R code
196 | # * This is the cell-level magic.
197 | # * For the cell level magic, inputs can be passed via the -i or --inputs argument in the line
198 | # * These variables are copied from the shell namespace to R's namespace using `rpy2.robjects.r.assign` 
199 | # * It would be nice not to have to copy these into R: rnumpy ( http://bitbucket.org/njs/rnumpy/wiki/API ) has done some work to limit or at least make transparent the number of copies of an array. 
200 | # * Arrays can be output from R via the -o or --outputs argument in the line. All other arguments are sent to R's png function, which is the graphics device used to create the plots.
201 | # * We can redo the above calculations in one ipython cell. 
202 | # * We might also want to add a summary or perhaps the standard plotting diagnostics of the lm.
203 | 
204 | # <codecell>
205 | 
206 | %%R -i X,Y -o XYcoef
207 | XYlm = lm(Y~X)
208 | XYcoef = coef(XYlm)
209 | print(summary(XYlm))
210 | par(mfrow=c(2,2))
211 | plot(XYlm)
212 | 
213 | # <headingcell level=2>
214 | 
215 | # Passing data back and forth
216 | 
217 | # <markdowncell>
218 | 
219 | # * Currently (Summer 2012), data is passed through `RMagics.pyconverter` when going from python to R and `RMagics.Rconverter` when going from R to python. 
220 | # * These currently default to numpy.ndarray. Future work will involve writing better converters, most likely involving integration with http://pandas.sourceforge.net.
221 | # * Passing ndarrays into R requires a copy, though once an object is returned to python, this object is NOT copied, and it is possible to change its values.
222 | 
223 | # <codecell>
224 | 
225 | seq1 = np.arange(10)
226 | 
227 | # <codecell>
228 | 
229 | %%R -i seq1 -o seq2
230 | seq2 = rep(seq1, 2)
231 | print(seq2)
232 | 
233 | # <codecell>
234 | 
235 | seq2[::2] = 0
236 | seq2
237 | 
238 | # <codecell>
239 | 
240 | %%R
241 | print(seq2)
242 | 
243 | # <markdowncell>
244 | 
245 | # * Once the array data has been passed to R, modifring its contents does not modify R's copy of the data.
246 | 
247 | # <codecell>
248 | 
249 | seq1[0] = 200
250 | %R print(seq1)
251 | 
252 | # <markdowncell>
253 | 
254 | # * If we pass data as both input and output, then the value of "data" in the user's namespace will be overwritten 
255 | # * the new array will be a view of the data in R's copy.
256 | 
257 | # <codecell>
258 | 
259 | print seq1
260 | %R -i seq1 -o seq1
261 | print seq1
262 | seq1[0] = 200
263 | %R print(seq1)
264 | seq1_view = %R seq1
265 | assert(id(seq1_view.data) == id(seq1.data))
266 | 
267 | # <headingcell level=2>
268 | 
269 | # Exception handling
270 | 
271 | # <markdowncell>
272 | 
273 | # Exceptions are handled by passing back rpy2's exception and the line that triggered it.
274 | 
275 | # <codecell>
276 | 
277 | try:
278 |     %R -n nosuchvar
279 | except Exception as e:
280 |     print e
281 | 
282 | # <headingcell level=2>
283 | 
284 | # Structured arrays and data frames
285 | 
286 | # <markdowncell>
287 | 
288 | # * In R, data frames play an important role as they allow array-like objects of mixed type with column names (and row names). 
289 | # * In numpy, the closest analogy is a structured array with named fields. 
290 | # * In future work, it would be nice to use pandas to return full-fledged DataFrames from rpy2. 
291 | # * In the mean time, structured arrays can be passed back and forth with the -d flag to %R, %Rpull, and %Rget
292 | 
293 | # <codecell>
294 | 
295 | datapy= np.array([(1, 2.9, 'a'), (2, 3.5, 'b'), (3, 2.1, 'c')],
296 |           dtype=[('x', '<i4'), ('y', '<f8'), ('z', '|S1')])
297 | 
298 | # <codecell>
299 | 
300 | %%R -i datapy -d datar
301 | datar = datapy
302 | 
303 | # <codecell>
304 | 
305 | datar
306 | 
307 | # <codecell>
308 | 
309 | %R datar2 = datapy
310 | %Rpull -d datar2
311 | datar2
312 | 
313 | # <codecell>
314 | 
315 | %Rget -d datar2
316 | 
317 | # <markdowncell>
318 | 
319 | # For arrays without names, the -d argument has no effect because the R object has no colnames or names.
320 | 
321 | # <codecell>
322 | 
323 | Z = np.arange(6)
324 | %R -i Z
325 | %Rget -d Z
326 | 
327 | # <markdowncell>
328 | 
329 | # * For mixed-type data frames in R, if the -d flag is not used, then an array of a single type is returned 
330 | # * Its value is transposed. 
331 | # * This would be nice to fix, but it seems something that should be fixed at the rpy2 level. See [here](https://bitbucket.org/lgautier/rpy2/issue/44/numpyrecarray-as-dataframe)
332 | 
333 | # <codecell>
334 | 
335 | %Rget datar2
336 | 
337 | 


--------------------------------------------------------------------------------
/robust_models.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "metadata": {
   3 |   "name": "robust_models"
   4 |  },
   5 |  "nbformat": 3,
   6 |  "nbformat_minor": 0,
   7 |  "worksheets": [
   8 |   {
   9 |    "cells": [
  10 |     {
  11 |      "cell_type": "heading",
  12 |      "level": 2,
  13 |      "metadata": {},
  14 |      "source": [
  15 |       "M-Estimators for Robust Linear Modeling"
  16 |      ]
  17 |     },
  18 |     {
  19 |      "cell_type": "code",
  20 |      "collapsed": false,
  21 |      "input": [
  22 |       "import numpy as np\n",
  23 |       "from scipy import stats\n",
  24 |       "import matplotlib.pyplot as plt\n",
  25 |       "\n",
  26 |       "import statsmodels.api as sm"
  27 |      ],
  28 |      "language": "python",
  29 |      "metadata": {},
  30 |      "outputs": []
  31 |     },
  32 |     {
  33 |      "cell_type": "markdown",
  34 |      "metadata": {},
  35 |      "source": [
  36 |       "* An M-estimator minimizes the function \n",
  37 |       "\n",
  38 |       "$$Q(e_i, \\rho) = \\sum_i~\\rho(\\frac{e_i}{s})$$\n",
  39 |       "\n",
  40 |       "where $\\rho$ is a symmetric function of the residuals \n",
  41 |       "\n",
  42 |       "* The effect of $\\rho$ is to reduce the influence of outliers\n",
  43 |       "* $s$ is an estimate of scale. \n",
  44 |       "* The robust estimates $\\hat{\\beta}$ are computed by the iteratively re-weighted least squares algorithm"
  45 |      ]
  46 |     },
  47 |     {
  48 |      "cell_type": "raw",
  49 |      "metadata": {},
  50 |      "source": [
  51 |       "* We have several choices available for the weighting functions to be used"
  52 |      ]
  53 |     },
  54 |     {
  55 |      "cell_type": "code",
  56 |      "collapsed": false,
  57 |      "input": [
  58 |       "norms = sm.robust.norms"
  59 |      ],
  60 |      "language": "python",
  61 |      "metadata": {},
  62 |      "outputs": []
  63 |     },
  64 |     {
  65 |      "cell_type": "code",
  66 |      "collapsed": false,
  67 |      "input": [
  68 |       "def plot_weights(support, weights_func, xlabels, xticks):\n",
  69 |       "    fig = plt.figure(figsize=(12,8))\n",
  70 |       "    ax = fig.add_subplot(111)\n",
  71 |       "    ax.plot(support, weights_func(support))\n",
  72 |       "    ax.set_xticks(xticks)\n",
  73 |       "    ax.set_xticklabels(xlabels, fontsize=16)\n",
  74 |       "    ax.set_ylim(-.1, 1.1)\n",
  75 |       "    return ax"
  76 |      ],
  77 |      "language": "python",
  78 |      "metadata": {},
  79 |      "outputs": []
  80 |     },
  81 |     {
  82 |      "cell_type": "heading",
  83 |      "level": 3,
  84 |      "metadata": {},
  85 |      "source": [
  86 |       "Andrew's Wave"
  87 |      ]
  88 |     },
  89 |     {
  90 |      "cell_type": "code",
  91 |      "collapsed": false,
  92 |      "input": [
  93 |       "help(norms.AndrewWave.weights)"
  94 |      ],
  95 |      "language": "python",
  96 |      "metadata": {},
  97 |      "outputs": []
  98 |     },
  99 |     {
 100 |      "cell_type": "code",
 101 |      "collapsed": false,
 102 |      "input": [
 103 |       "a = 1.339\n",
 104 |       "support = np.linspace(-np.pi*a, np.pi*a, 100)\n",
 105 |       "andrew = norms.AndrewWave(a=a)\n",
 106 |       "plot_weights(support, andrew.weights, ['$-\\pi*a$', '0', '$\\pi*a$'], [-np.pi*a, 0, np.pi*a]);"
 107 |      ],
 108 |      "language": "python",
 109 |      "metadata": {},
 110 |      "outputs": []
 111 |     },
 112 |     {
 113 |      "cell_type": "heading",
 114 |      "level": 3,
 115 |      "metadata": {},
 116 |      "source": [
 117 |       "Hampel's 17A"
 118 |      ]
 119 |     },
 120 |     {
 121 |      "cell_type": "code",
 122 |      "collapsed": false,
 123 |      "input": [
 124 |       "help(norms.Hampel.weights)"
 125 |      ],
 126 |      "language": "python",
 127 |      "metadata": {},
 128 |      "outputs": []
 129 |     },
 130 |     {
 131 |      "cell_type": "code",
 132 |      "collapsed": false,
 133 |      "input": [
 134 |       "c = 8\n",
 135 |       "support = np.linspace(-3*c, 3*c, 1000)\n",
 136 |       "hampel = norms.Hampel(a=2., b=4., c=c)\n",
 137 |       "plot_weights(support, hampel.weights, ['3*c', '0', '3*c'], [-3*c, 0, 3*c]);"
 138 |      ],
 139 |      "language": "python",
 140 |      "metadata": {},
 141 |      "outputs": []
 142 |     },
 143 |     {
 144 |      "cell_type": "heading",
 145 |      "level": 3,
 146 |      "metadata": {},
 147 |      "source": [
 148 |       "Huber's t"
 149 |      ]
 150 |     },
 151 |     {
 152 |      "cell_type": "code",
 153 |      "collapsed": false,
 154 |      "input": [
 155 |       "help(norms.HuberT.weights)"
 156 |      ],
 157 |      "language": "python",
 158 |      "metadata": {},
 159 |      "outputs": []
 160 |     },
 161 |     {
 162 |      "cell_type": "code",
 163 |      "collapsed": false,
 164 |      "input": [
 165 |       "t = 1.345\n",
 166 |       "support = np.linspace(-3*t, 3*t, 1000)\n",
 167 |       "huber = norms.HuberT(t=t)\n",
 168 |       "plot_weights(support, huber.weights, ['-3*t', '0', '3*t'], [-3*t, 0, 3*t]);"
 169 |      ],
 170 |      "language": "python",
 171 |      "metadata": {},
 172 |      "outputs": []
 173 |     },
 174 |     {
 175 |      "cell_type": "heading",
 176 |      "level": 3,
 177 |      "metadata": {},
 178 |      "source": [
 179 |       "Least Squares"
 180 |      ]
 181 |     },
 182 |     {
 183 |      "cell_type": "code",
 184 |      "collapsed": false,
 185 |      "input": [
 186 |       "help(norms.LeastSquares.weights)"
 187 |      ],
 188 |      "language": "python",
 189 |      "metadata": {},
 190 |      "outputs": []
 191 |     },
 192 |     {
 193 |      "cell_type": "code",
 194 |      "collapsed": false,
 195 |      "input": [
 196 |       "support = np.linspace(-3, 3, 1000)\n",
 197 |       "lst_sq = norms.LeastSquares()\n",
 198 |       "plot_weights(support, lst_sq.weights, ['-3', '0', '3'], [-3, 0, 3]);"
 199 |      ],
 200 |      "language": "python",
 201 |      "metadata": {},
 202 |      "outputs": []
 203 |     },
 204 |     {
 205 |      "cell_type": "heading",
 206 |      "level": 3,
 207 |      "metadata": {},
 208 |      "source": [
 209 |       "Ramsay's Ea"
 210 |      ]
 211 |     },
 212 |     {
 213 |      "cell_type": "code",
 214 |      "collapsed": false,
 215 |      "input": [
 216 |       "help(norms.RamsayE.weights)"
 217 |      ],
 218 |      "language": "python",
 219 |      "metadata": {},
 220 |      "outputs": []
 221 |     },
 222 |     {
 223 |      "cell_type": "code",
 224 |      "collapsed": false,
 225 |      "input": [
 226 |       "a = .3\n",
 227 |       "support = np.linspace(-3*a, 3*a, 1000)\n",
 228 |       "ramsay = norms.RamsayE(a=a)\n",
 229 |       "plot_weights(support, ramsay.weights, ['-3*a', '0', '3*a'], [-3*a, 0, 3*a]);"
 230 |      ],
 231 |      "language": "python",
 232 |      "metadata": {},
 233 |      "outputs": []
 234 |     },
 235 |     {
 236 |      "cell_type": "heading",
 237 |      "level": 3,
 238 |      "metadata": {},
 239 |      "source": [
 240 |       "Trimmed Mean"
 241 |      ]
 242 |     },
 243 |     {
 244 |      "cell_type": "code",
 245 |      "collapsed": false,
 246 |      "input": [
 247 |       "help(norms.TrimmedMean.weights)"
 248 |      ],
 249 |      "language": "python",
 250 |      "metadata": {},
 251 |      "outputs": []
 252 |     },
 253 |     {
 254 |      "cell_type": "code",
 255 |      "collapsed": false,
 256 |      "input": [
 257 |       "c = 2\n",
 258 |       "support = np.linspace(-3*c, 3*c, 1000)\n",
 259 |       "trimmed = norms.TrimmedMean(c=c)\n",
 260 |       "plot_weights(support, trimmed.weights, ['-3*c', '0', '3*c'], [-3*c, 0, 3*c]);"
 261 |      ],
 262 |      "language": "python",
 263 |      "metadata": {},
 264 |      "outputs": []
 265 |     },
 266 |     {
 267 |      "cell_type": "heading",
 268 |      "level": 3,
 269 |      "metadata": {},
 270 |      "source": [
 271 |       "Tukey's Biweight"
 272 |      ]
 273 |     },
 274 |     {
 275 |      "cell_type": "code",
 276 |      "collapsed": false,
 277 |      "input": [
 278 |       "help(norms.TukeyBiweight.weights)"
 279 |      ],
 280 |      "language": "python",
 281 |      "metadata": {},
 282 |      "outputs": []
 283 |     },
 284 |     {
 285 |      "cell_type": "code",
 286 |      "collapsed": false,
 287 |      "input": [
 288 |       "c = 4.685\n",
 289 |       "support = np.linspace(-3*c, 3*c, 1000)\n",
 290 |       "tukey = norms.TukeyBiweight(c=c)\n",
 291 |       "plot_weights(support, tukey.weights, ['-3*c', '0', '3*c'], [-3*c, 0, 3*c]);"
 292 |      ],
 293 |      "language": "python",
 294 |      "metadata": {},
 295 |      "outputs": []
 296 |     },
 297 |     {
 298 |      "cell_type": "heading",
 299 |      "level": 3,
 300 |      "metadata": {},
 301 |      "source": [
 302 |       "Scale Estimators"
 303 |      ]
 304 |     },
 305 |     {
 306 |      "cell_type": "raw",
 307 |      "metadata": {},
 308 |      "source": [
 309 |       "* Robust estimates of the location"
 310 |      ]
 311 |     },
 312 |     {
 313 |      "cell_type": "code",
 314 |      "collapsed": false,
 315 |      "input": [
 316 |       "x = np.array([1, 2, 3, 4, 500])"
 317 |      ],
 318 |      "language": "python",
 319 |      "metadata": {},
 320 |      "outputs": []
 321 |     },
 322 |     {
 323 |      "cell_type": "markdown",
 324 |      "metadata": {},
 325 |      "source": [
 326 |       "* The mean is not a robust estimator of location"
 327 |      ]
 328 |     },
 329 |     {
 330 |      "cell_type": "code",
 331 |      "collapsed": false,
 332 |      "input": [
 333 |       "x.mean()"
 334 |      ],
 335 |      "language": "python",
 336 |      "metadata": {},
 337 |      "outputs": []
 338 |     },
 339 |     {
 340 |      "cell_type": "markdown",
 341 |      "metadata": {},
 342 |      "source": [
 343 |       "* The median, on the other hand, is a robust estimator with a breakdown point of 50%"
 344 |      ]
 345 |     },
 346 |     {
 347 |      "cell_type": "code",
 348 |      "collapsed": false,
 349 |      "input": [
 350 |       "np.median(x)"
 351 |      ],
 352 |      "language": "python",
 353 |      "metadata": {},
 354 |      "outputs": []
 355 |     },
 356 |     {
 357 |      "cell_type": "raw",
 358 |      "metadata": {},
 359 |      "source": [
 360 |       "* Analagously for the scale\n",
 361 |       "* The standard deviation is not robust"
 362 |      ]
 363 |     },
 364 |     {
 365 |      "cell_type": "code",
 366 |      "collapsed": false,
 367 |      "input": [
 368 |       "x.std()"
 369 |      ],
 370 |      "language": "python",
 371 |      "metadata": {},
 372 |      "outputs": []
 373 |     },
 374 |     {
 375 |      "cell_type": "markdown",
 376 |      "metadata": {},
 377 |      "source": [
 378 |       "Median Absolute Deviation\n",
 379 |       "\n",
 380 |       "$$ median_i |X_i - median_j(X_j)|) $$"
 381 |      ]
 382 |     },
 383 |     {
 384 |      "cell_type": "markdown",
 385 |      "metadata": {},
 386 |      "source": [
 387 |       "Standardized Median Absolute Deviation is a consistent estimator for $\\hat{\\sigma}$\n",
 388 |       "\n",
 389 |       "$$\\hat{\\sigma}=K \\cdot MAD$$\n",
 390 |       "\n",
 391 |       "where $K$ depends on the distribution. For the normal distribution for example,\n",
 392 |       "\n",
 393 |       "$$K = \\Phi^{-1}(.75)$$"
 394 |      ]
 395 |     },
 396 |     {
 397 |      "cell_type": "code",
 398 |      "collapsed": false,
 399 |      "input": [
 400 |       "stats.norm.ppf(.75)"
 401 |      ],
 402 |      "language": "python",
 403 |      "metadata": {},
 404 |      "outputs": []
 405 |     },
 406 |     {
 407 |      "cell_type": "code",
 408 |      "collapsed": false,
 409 |      "input": [
 410 |       "print x"
 411 |      ],
 412 |      "language": "python",
 413 |      "metadata": {},
 414 |      "outputs": []
 415 |     },
 416 |     {
 417 |      "cell_type": "code",
 418 |      "collapsed": false,
 419 |      "input": [
 420 |       "sm.robust.scale.stand_mad(x)"
 421 |      ],
 422 |      "language": "python",
 423 |      "metadata": {},
 424 |      "outputs": []
 425 |     },
 426 |     {
 427 |      "cell_type": "code",
 428 |      "collapsed": false,
 429 |      "input": [
 430 |       "np.array([1,2,3,4,5.]).std()"
 431 |      ],
 432 |      "language": "python",
 433 |      "metadata": {},
 434 |      "outputs": []
 435 |     },
 436 |     {
 437 |      "cell_type": "raw",
 438 |      "metadata": {},
 439 |      "source": [
 440 |       "* The default for Robust Linear Models is MAD\n",
 441 |       "* another popular choice is Huber's proposal 2"
 442 |      ]
 443 |     },
 444 |     {
 445 |      "cell_type": "code",
 446 |      "collapsed": false,
 447 |      "input": [
 448 |       "np.random.seed(12345)\n",
 449 |       "fat_tails = stats.t(6).rvs(40)"
 450 |      ],
 451 |      "language": "python",
 452 |      "metadata": {},
 453 |      "outputs": []
 454 |     },
 455 |     {
 456 |      "cell_type": "code",
 457 |      "collapsed": false,
 458 |      "input": [
 459 |       "kde = sm.nonparametric.KDE(fat_tails)\n",
 460 |       "kde.fit()\n",
 461 |       "fig = plt.figure(figsize=(12,8))\n",
 462 |       "ax = fig.add_subplot(111)\n",
 463 |       "ax.plot(kde.support, kde.density);"
 464 |      ],
 465 |      "language": "python",
 466 |      "metadata": {},
 467 |      "outputs": []
 468 |     },
 469 |     {
 470 |      "cell_type": "code",
 471 |      "collapsed": false,
 472 |      "input": [
 473 |       "print fat_tails.mean(), fat_tails.std()"
 474 |      ],
 475 |      "language": "python",
 476 |      "metadata": {},
 477 |      "outputs": []
 478 |     },
 479 |     {
 480 |      "cell_type": "code",
 481 |      "collapsed": false,
 482 |      "input": [
 483 |       "print stats.norm.fit(fat_tails)"
 484 |      ],
 485 |      "language": "python",
 486 |      "metadata": {},
 487 |      "outputs": []
 488 |     },
 489 |     {
 490 |      "cell_type": "code",
 491 |      "collapsed": false,
 492 |      "input": [
 493 |       "print stats.t.fit(fat_tails, f0=6)"
 494 |      ],
 495 |      "language": "python",
 496 |      "metadata": {},
 497 |      "outputs": []
 498 |     },
 499 |     {
 500 |      "cell_type": "code",
 501 |      "collapsed": false,
 502 |      "input": [
 503 |       "huber = sm.robust.scale.Huber()\n",
 504 |       "loc, scale = huber(fat_tails)\n",
 505 |       "print loc, scale"
 506 |      ],
 507 |      "language": "python",
 508 |      "metadata": {},
 509 |      "outputs": []
 510 |     },
 511 |     {
 512 |      "cell_type": "code",
 513 |      "collapsed": false,
 514 |      "input": [
 515 |       "sm.robust.stand_mad(fat_tails)"
 516 |      ],
 517 |      "language": "python",
 518 |      "metadata": {},
 519 |      "outputs": []
 520 |     },
 521 |     {
 522 |      "cell_type": "code",
 523 |      "collapsed": false,
 524 |      "input": [
 525 |       "sm.robust.stand_mad(fat_tails, c=stats.t(6).ppf(.75))"
 526 |      ],
 527 |      "language": "python",
 528 |      "metadata": {},
 529 |      "outputs": []
 530 |     },
 531 |     {
 532 |      "cell_type": "code",
 533 |      "collapsed": false,
 534 |      "input": [
 535 |       "sm.robust.scale.mad(fat_tails)"
 536 |      ],
 537 |      "language": "python",
 538 |      "metadata": {},
 539 |      "outputs": []
 540 |     },
 541 |     {
 542 |      "cell_type": "heading",
 543 |      "level": 3,
 544 |      "metadata": {},
 545 |      "source": [
 546 |       "Duncan's Occupational Prestige data - M-estimation for outliers"
 547 |      ]
 548 |     },
 549 |     {
 550 |      "cell_type": "code",
 551 |      "collapsed": false,
 552 |      "input": [
 553 |       "from statsmodels.graphics.api import abline_plot\n",
 554 |       "from statsmodels.formula.api import ols, rlm"
 555 |      ],
 556 |      "language": "python",
 557 |      "metadata": {},
 558 |      "outputs": []
 559 |     },
 560 |     {
 561 |      "cell_type": "code",
 562 |      "collapsed": false,
 563 |      "input": [
 564 |       "prestige = sm.datasets.get_rdataset(\"Duncan\", \"car\", cache=True).data"
 565 |      ],
 566 |      "language": "python",
 567 |      "metadata": {},
 568 |      "outputs": []
 569 |     },
 570 |     {
 571 |      "cell_type": "code",
 572 |      "collapsed": false,
 573 |      "input": [
 574 |       "print prestige.head(10)"
 575 |      ],
 576 |      "language": "python",
 577 |      "metadata": {},
 578 |      "outputs": []
 579 |     },
 580 |     {
 581 |      "cell_type": "code",
 582 |      "collapsed": false,
 583 |      "input": [
 584 |       "fig = plt.figure(figsize=(12,12))\n",
 585 |       "ax1 = fig.add_subplot(211, xlabel='Income', ylabel='Prestige')\n",
 586 |       "ax1.scatter(prestige.income, prestige.prestige)\n",
 587 |       "xy_outlier = prestige.ix['minister'][['income','prestige']]\n",
 588 |       "ax1.annotate('Minister', xy_outlier, xy_outlier+1, fontsize=16)\n",
 589 |       "ax2 = fig.add_subplot(212, xlabel='Education',\n",
 590 |       "                           ylabel='Prestige')\n",
 591 |       "ax2.scatter(prestige.education, prestige.prestige);"
 592 |      ],
 593 |      "language": "python",
 594 |      "metadata": {},
 595 |      "outputs": []
 596 |     },
 597 |     {
 598 |      "cell_type": "code",
 599 |      "collapsed": false,
 600 |      "input": [
 601 |       "ols_model = ols('prestige ~ income + education', prestige).fit()\n",
 602 |       "print ols_model.summary()"
 603 |      ],
 604 |      "language": "python",
 605 |      "metadata": {},
 606 |      "outputs": []
 607 |     },
 608 |     {
 609 |      "cell_type": "code",
 610 |      "collapsed": false,
 611 |      "input": [
 612 |       "infl = ols_model.get_influence()\n",
 613 |       "student = infl.summary_frame()['student_resid']\n",
 614 |       "print student"
 615 |      ],
 616 |      "language": "python",
 617 |      "metadata": {},
 618 |      "outputs": []
 619 |     },
 620 |     {
 621 |      "cell_type": "code",
 622 |      "collapsed": false,
 623 |      "input": [
 624 |       "print student.ix[np.abs(student) > 2]"
 625 |      ],
 626 |      "language": "python",
 627 |      "metadata": {},
 628 |      "outputs": []
 629 |     },
 630 |     {
 631 |      "cell_type": "code",
 632 |      "collapsed": false,
 633 |      "input": [
 634 |       "print infl.summary_frame().ix['minister']"
 635 |      ],
 636 |      "language": "python",
 637 |      "metadata": {},
 638 |      "outputs": []
 639 |     },
 640 |     {
 641 |      "cell_type": "code",
 642 |      "collapsed": false,
 643 |      "input": [
 644 |       "sidak = ols_model.outlier_test('sidak')\n",
 645 |       "sidak.sort('unadj_p', inplace=True)\n",
 646 |       "print sidak"
 647 |      ],
 648 |      "language": "python",
 649 |      "metadata": {},
 650 |      "outputs": []
 651 |     },
 652 |     {
 653 |      "cell_type": "code",
 654 |      "collapsed": false,
 655 |      "input": [
 656 |       "fdr = ols_model.outlier_test('fdr_bh')\n",
 657 |       "fdr.sort('unadj_p', inplace=True)\n",
 658 |       "print fdr"
 659 |      ],
 660 |      "language": "python",
 661 |      "metadata": {},
 662 |      "outputs": []
 663 |     },
 664 |     {
 665 |      "cell_type": "code",
 666 |      "collapsed": false,
 667 |      "input": [
 668 |       "rlm_model = rlm('prestige ~ income + education', prestige).fit()\n",
 669 |       "print rlm_model.summary()"
 670 |      ],
 671 |      "language": "python",
 672 |      "metadata": {},
 673 |      "outputs": []
 674 |     },
 675 |     {
 676 |      "cell_type": "code",
 677 |      "collapsed": false,
 678 |      "input": [
 679 |       "print rlm_model.weights"
 680 |      ],
 681 |      "language": "python",
 682 |      "metadata": {},
 683 |      "outputs": []
 684 |     },
 685 |     {
 686 |      "cell_type": "heading",
 687 |      "level": 3,
 688 |      "metadata": {},
 689 |      "source": [
 690 |       "Hertzprung Russell data for Star Cluster CYG 0B1 - Leverage Points"
 691 |      ]
 692 |     },
 693 |     {
 694 |      "cell_type": "markdown",
 695 |      "metadata": {},
 696 |      "source": [
 697 |       "* Data is on the luminosity and temperature of 47 stars in the direction of Cygnus."
 698 |      ]
 699 |     },
 700 |     {
 701 |      "cell_type": "code",
 702 |      "collapsed": false,
 703 |      "input": [
 704 |       "dta = sm.datasets.get_rdataset(\"starsCYG\", \"robustbase\", cache=True).data"
 705 |      ],
 706 |      "language": "python",
 707 |      "metadata": {},
 708 |      "outputs": []
 709 |     },
 710 |     {
 711 |      "cell_type": "code",
 712 |      "collapsed": false,
 713 |      "input": [
 714 |       "from matplotlib.patches import Ellipse\n",
 715 |       "fig = plt.figure(figsize=(12,8))\n",
 716 |       "ax = fig.add_subplot(111, xlabel='log(Temp)', ylabel='log(Light)', title='Hertzsprung-Russell Diagram of Star Cluster CYG OB1')\n",
 717 |       "ax.scatter(*dta.values.T)\n",
 718 |       "# highlight outliers\n",
 719 |       "e = Ellipse((3.5, 6), .2, 1, alpha=.25, color='r')\n",
 720 |       "ax.add_patch(e);\n",
 721 |       "ax.annotate('Red giants', xy=(3.6, 6), xytext=(3.8, 6),\n",
 722 |       "            arrowprops=dict(facecolor='black', shrink=0.05, width=2),\n",
 723 |       "            horizontalalignment='left', verticalalignment='bottom',\n",
 724 |       "            clip_on=True, # clip to the axes bounding box\n",
 725 |       "            fontsize=16,\n",
 726 |       "     )\n",
 727 |       "# annotate these with their index\n",
 728 |       "for i,row in dta.ix[dta['log.Te'] < 3.8].iterrows():\n",
 729 |       "    ax.annotate(i, row, row + .01, fontsize=14)\n",
 730 |       "xlim, ylim = ax.get_xlim(), ax.get_ylim()"
 731 |      ],
 732 |      "language": "python",
 733 |      "metadata": {},
 734 |      "outputs": []
 735 |     },
 736 |     {
 737 |      "cell_type": "code",
 738 |      "collapsed": false,
 739 |      "input": [
 740 |       "from IPython.display import Image\n",
 741 |       "Image(filename='star_diagram.png')"
 742 |      ],
 743 |      "language": "python",
 744 |      "metadata": {},
 745 |      "outputs": []
 746 |     },
 747 |     {
 748 |      "cell_type": "code",
 749 |      "collapsed": false,
 750 |      "input": [
 751 |       "y = dta['log.light']\n",
 752 |       "X = sm.add_constant(dta['log.Te'], prepend=True)\n",
 753 |       "ols_model = sm.OLS(y, X).fit()\n",
 754 |       "abline_plot(model_results=ols_model, ax=ax)"
 755 |      ],
 756 |      "language": "python",
 757 |      "metadata": {},
 758 |      "outputs": []
 759 |     },
 760 |     {
 761 |      "cell_type": "code",
 762 |      "collapsed": false,
 763 |      "input": [
 764 |       "rlm_mod = sm.RLM(y, X, sm.robust.norms.TrimmedMean(.5)).fit()\n",
 765 |       "abline_plot(model_results=rlm_mod, ax=ax, color='red')"
 766 |      ],
 767 |      "language": "python",
 768 |      "metadata": {},
 769 |      "outputs": []
 770 |     },
 771 |     {
 772 |      "cell_type": "markdown",
 773 |      "metadata": {},
 774 |      "source": [
 775 |       "* Why? Because M-estimators are not robust to leverage points."
 776 |      ]
 777 |     },
 778 |     {
 779 |      "cell_type": "code",
 780 |      "collapsed": false,
 781 |      "input": [
 782 |       "infl = ols_model.get_influence()"
 783 |      ],
 784 |      "language": "python",
 785 |      "metadata": {},
 786 |      "outputs": []
 787 |     },
 788 |     {
 789 |      "cell_type": "code",
 790 |      "collapsed": false,
 791 |      "input": [
 792 |       "h_bar = 2*(ols_model.df_model + 1 )/ols_model.nobs\n",
 793 |       "hat_diag = infl.summary_frame()['hat_diag']\n",
 794 |       "hat_diag.ix[hat_diag > h_bar]"
 795 |      ],
 796 |      "language": "python",
 797 |      "metadata": {},
 798 |      "outputs": []
 799 |     },
 800 |     {
 801 |      "cell_type": "code",
 802 |      "collapsed": false,
 803 |      "input": [
 804 |       "sidak2 = ols_model.outlier_test('sidak')\n",
 805 |       "sidak2.sort('unadj_p', inplace=True)\n",
 806 |       "print sidak2"
 807 |      ],
 808 |      "language": "python",
 809 |      "metadata": {},
 810 |      "outputs": []
 811 |     },
 812 |     {
 813 |      "cell_type": "code",
 814 |      "collapsed": false,
 815 |      "input": [
 816 |       "fdr2 = ols_model.outlier_test('fdr_bh')\n",
 817 |       "fdr2.sort('unadj_p', inplace=True)\n",
 818 |       "print fdr2"
 819 |      ],
 820 |      "language": "python",
 821 |      "metadata": {},
 822 |      "outputs": []
 823 |     },
 824 |     {
 825 |      "cell_type": "markdown",
 826 |      "metadata": {},
 827 |      "source": [
 828 |       "* Let's delete that line"
 829 |      ]
 830 |     },
 831 |     {
 832 |      "cell_type": "code",
 833 |      "collapsed": false,
 834 |      "input": [
 835 |       "del ax.lines[-1]"
 836 |      ],
 837 |      "language": "python",
 838 |      "metadata": {},
 839 |      "outputs": []
 840 |     },
 841 |     {
 842 |      "cell_type": "code",
 843 |      "collapsed": false,
 844 |      "input": [
 845 |       "weights = np.ones(len(X))\n",
 846 |       "weights[X[X['log.Te'] < 3.8].index.values - 1] = 0\n",
 847 |       "wls_model = sm.WLS(y, X, weights=weights).fit()\n",
 848 |       "abline_plot(model_results=wls_model, ax=ax, color='green')"
 849 |      ],
 850 |      "language": "python",
 851 |      "metadata": {},
 852 |      "outputs": []
 853 |     },
 854 |     {
 855 |      "cell_type": "markdown",
 856 |      "metadata": {},
 857 |      "source": [
 858 |       "* MM estimators are good for this type of problem, unfortunately, we don't yet have these yet. \n",
 859 |       "* It's being worked on, but it gives a good excuse to look at the R cell magics in the notebook."
 860 |      ]
 861 |     },
 862 |     {
 863 |      "cell_type": "code",
 864 |      "collapsed": false,
 865 |      "input": [
 866 |       "yy = y.values[:,None]\n",
 867 |       "xx = X['log.Te'].values[:,None]"
 868 |      ],
 869 |      "language": "python",
 870 |      "metadata": {},
 871 |      "outputs": []
 872 |     },
 873 |     {
 874 |      "cell_type": "code",
 875 |      "collapsed": false,
 876 |      "input": [
 877 |       "%load_ext rmagic\n",
 878 |       "\n",
 879 |       "%R library(robustbase)\n",
 880 |       "%Rpush yy xx\n",
 881 |       "%R mod <- lmrob(yy ~ xx);\n",
 882 |       "%R params <- mod$coefficients;\n",
 883 |       "%Rpull params"
 884 |      ],
 885 |      "language": "python",
 886 |      "metadata": {},
 887 |      "outputs": []
 888 |     },
 889 |     {
 890 |      "cell_type": "code",
 891 |      "collapsed": false,
 892 |      "input": [
 893 |       "%R print(mod)"
 894 |      ],
 895 |      "language": "python",
 896 |      "metadata": {},
 897 |      "outputs": []
 898 |     },
 899 |     {
 900 |      "cell_type": "code",
 901 |      "collapsed": false,
 902 |      "input": [
 903 |       "print params"
 904 |      ],
 905 |      "language": "python",
 906 |      "metadata": {},
 907 |      "outputs": []
 908 |     },
 909 |     {
 910 |      "cell_type": "code",
 911 |      "collapsed": false,
 912 |      "input": [
 913 |       "abline_plot(intercept=params[0], slope=params[1], ax=ax, color='green')"
 914 |      ],
 915 |      "language": "python",
 916 |      "metadata": {},
 917 |      "outputs": []
 918 |     },
 919 |     {
 920 |      "cell_type": "heading",
 921 |      "level": 3,
 922 |      "metadata": {},
 923 |      "source": [
 924 |       "Exercise: Breakdown points of M-estimator"
 925 |      ]
 926 |     },
 927 |     {
 928 |      "cell_type": "code",
 929 |      "collapsed": false,
 930 |      "input": [
 931 |       "np.random.seed(12345)\n",
 932 |       "nobs = 200\n",
 933 |       "beta_true = np.array([3, 1, 2.5, 3, -4])\n",
 934 |       "X = np.random.uniform(-20,20, size=(nobs, len(beta_true)-1))\n",
 935 |       "# stack a constant in front\n",
 936 |       "X = sm.add_constant(X, prepend=True) # np.c_[np.ones(nobs), X]\n",
 937 |       "mc_iter = 500\n",
 938 |       "contaminate = .25 # percentage of response variables to contaminate"
 939 |      ],
 940 |      "language": "python",
 941 |      "metadata": {},
 942 |      "outputs": []
 943 |     },
 944 |     {
 945 |      "cell_type": "code",
 946 |      "collapsed": false,
 947 |      "input": [
 948 |       "all_betas = []\n",
 949 |       "for i in range(mc_iter):\n",
 950 |       "    y = np.dot(X, beta_true) + np.random.normal(size=200)\n",
 951 |       "    random_idx = np.random.randint(0, nobs, size=int(contaminate * nobs))\n",
 952 |       "    y[random_idx] = np.random.uniform(-750, 750) #, size=len(random_idx))\n",
 953 |       "    beta_hat = sm.RLM(y, X).fit().params\n",
 954 |       "    all_betas.append(beta_hat)"
 955 |      ],
 956 |      "language": "python",
 957 |      "metadata": {},
 958 |      "outputs": []
 959 |     },
 960 |     {
 961 |      "cell_type": "code",
 962 |      "collapsed": false,
 963 |      "input": [
 964 |       "all_betas = np.asarray(all_betas)\n",
 965 |       "se_loss = lambda x : np.linalg.norm(x, ord=2)**2\n",
 966 |       "se_beta = map(se_loss, all_betas - beta_true)"
 967 |      ],
 968 |      "language": "python",
 969 |      "metadata": {},
 970 |      "outputs": []
 971 |     },
 972 |     {
 973 |      "cell_type": "heading",
 974 |      "level": 4,
 975 |      "metadata": {},
 976 |      "source": [
 977 |       "Squared error loss"
 978 |      ]
 979 |     },
 980 |     {
 981 |      "cell_type": "code",
 982 |      "collapsed": false,
 983 |      "input": [
 984 |       "np.array(se_beta).mean()"
 985 |      ],
 986 |      "language": "python",
 987 |      "metadata": {},
 988 |      "outputs": []
 989 |     },
 990 |     {
 991 |      "cell_type": "code",
 992 |      "collapsed": false,
 993 |      "input": [
 994 |       "all_betas.mean(0)"
 995 |      ],
 996 |      "language": "python",
 997 |      "metadata": {},
 998 |      "outputs": []
 999 |     },
1000 |     {
1001 |      "cell_type": "code",
1002 |      "collapsed": false,
1003 |      "input": [
1004 |       "beta_true"
1005 |      ],
1006 |      "language": "python",
1007 |      "metadata": {},
1008 |      "outputs": []
1009 |     },
1010 |     {
1011 |      "cell_type": "code",
1012 |      "collapsed": false,
1013 |      "input": [
1014 |       "se_loss(all_betas.mean(0) - beta_true)"
1015 |      ],
1016 |      "language": "python",
1017 |      "metadata": {},
1018 |      "outputs": []
1019 |     }
1020 |    ],
1021 |    "metadata": {}
1022 |   }
1023 |  ]
1024 | }


--------------------------------------------------------------------------------
/robust_models.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # <nbformat>3.0</nbformat>
  3 | 
  4 | # <headingcell level=2>
  5 | 
  6 | # M-Estimators for Robust Linear Modeling
  7 | 
  8 | # <codecell>
  9 | 
 10 | import numpy as np
 11 | from scipy import stats
 12 | import matplotlib.pyplot as plt
 13 | 
 14 | import statsmodels.api as sm
 15 | 
 16 | # <markdowncell>
 17 | 
 18 | # * An M-estimator minimizes the function 
 19 | # 
 20 | # $$Q(e_i, \rho) = \sum_i~\rho(\frac{e_i}{s})$$
 21 | # 
 22 | # where $\rho$ is a symmetric function of the residuals 
 23 | # 
 24 | # * The effect of $\rho$ is to reduce the influence of outliers
 25 | # * $s$ is an estimate of scale. 
 26 | # * The robust estimates $\hat{\beta}$ are computed by the iteratively re-weighted least squares algorithm
 27 | 
 28 | # <rawcell>
 29 | 
 30 | # * We have several choices available for the weighting functions to be used
 31 | 
 32 | # <codecell>
 33 | 
 34 | norms = sm.robust.norms
 35 | 
 36 | # <codecell>
 37 | 
 38 | def plot_weights(support, weights_func, xlabels, xticks):
 39 |     fig = plt.figure(figsize=(12,8))
 40 |     ax = fig.add_subplot(111)
 41 |     ax.plot(support, weights_func(support))
 42 |     ax.set_xticks(xticks)
 43 |     ax.set_xticklabels(xlabels, fontsize=16)
 44 |     ax.set_ylim(-.1, 1.1)
 45 |     return ax
 46 | 
 47 | # <headingcell level=3>
 48 | 
 49 | # Andrew's Wave
 50 | 
 51 | # <codecell>
 52 | 
 53 | help(norms.AndrewWave.weights)
 54 | 
 55 | # <codecell>
 56 | 
 57 | a = 1.339
 58 | support = np.linspace(-np.pi*a, np.pi*a, 100)
 59 | andrew = norms.AndrewWave(a=a)
 60 | plot_weights(support, andrew.weights, ['$-\pi*a$', '0', '$\pi*a$'], [-np.pi*a, 0, np.pi*a]);
 61 | 
 62 | # <headingcell level=3>
 63 | 
 64 | # Hampel's 17A
 65 | 
 66 | # <codecell>
 67 | 
 68 | help(norms.Hampel.weights)
 69 | 
 70 | # <codecell>
 71 | 
 72 | c = 8
 73 | support = np.linspace(-3*c, 3*c, 1000)
 74 | hampel = norms.Hampel(a=2., b=4., c=c)
 75 | plot_weights(support, hampel.weights, ['3*c', '0', '3*c'], [-3*c, 0, 3*c]);
 76 | 
 77 | # <headingcell level=3>
 78 | 
 79 | # Huber's t
 80 | 
 81 | # <codecell>
 82 | 
 83 | help(norms.HuberT.weights)
 84 | 
 85 | # <codecell>
 86 | 
 87 | t = 1.345
 88 | support = np.linspace(-3*t, 3*t, 1000)
 89 | huber = norms.HuberT(t=t)
 90 | plot_weights(support, huber.weights, ['-3*t', '0', '3*t'], [-3*t, 0, 3*t]);
 91 | 
 92 | # <headingcell level=3>
 93 | 
 94 | # Least Squares
 95 | 
 96 | # <codecell>
 97 | 
 98 | help(norms.LeastSquares.weights)
 99 | 
100 | # <codecell>
101 | 
102 | support = np.linspace(-3, 3, 1000)
103 | lst_sq = norms.LeastSquares()
104 | plot_weights(support, lst_sq.weights, ['-3', '0', '3'], [-3, 0, 3]);
105 | 
106 | # <headingcell level=3>
107 | 
108 | # Ramsay's Ea
109 | 
110 | # <codecell>
111 | 
112 | help(norms.RamsayE.weights)
113 | 
114 | # <codecell>
115 | 
116 | a = .3
117 | support = np.linspace(-3*a, 3*a, 1000)
118 | ramsay = norms.RamsayE(a=a)
119 | plot_weights(support, ramsay.weights, ['-3*a', '0', '3*a'], [-3*a, 0, 3*a]);
120 | 
121 | # <headingcell level=3>
122 | 
123 | # Trimmed Mean
124 | 
125 | # <codecell>
126 | 
127 | help(norms.TrimmedMean.weights)
128 | 
129 | # <codecell>
130 | 
131 | c = 2
132 | support = np.linspace(-3*c, 3*c, 1000)
133 | trimmed = norms.TrimmedMean(c=c)
134 | plot_weights(support, trimmed.weights, ['-3*c', '0', '3*c'], [-3*c, 0, 3*c]);
135 | 
136 | # <headingcell level=3>
137 | 
138 | # Tukey's Biweight
139 | 
140 | # <codecell>
141 | 
142 | help(norms.TukeyBiweight.weights)
143 | 
144 | # <codecell>
145 | 
146 | c = 4.685
147 | support = np.linspace(-3*c, 3*c, 1000)
148 | tukey = norms.TukeyBiweight(c=c)
149 | plot_weights(support, tukey.weights, ['-3*c', '0', '3*c'], [-3*c, 0, 3*c]);
150 | 
151 | # <headingcell level=3>
152 | 
153 | # Scale Estimators
154 | 
155 | # <rawcell>
156 | 
157 | # * Robust estimates of the location
158 | 
159 | # <codecell>
160 | 
161 | x = np.array([1, 2, 3, 4, 500])
162 | 
163 | # <markdowncell>
164 | 
165 | # * The mean is not a robust estimator of location
166 | 
167 | # <codecell>
168 | 
169 | x.mean()
170 | 
171 | # <markdowncell>
172 | 
173 | # * The median, on the other hand, is a robust estimator with a breakdown point of 50%
174 | 
175 | # <codecell>
176 | 
177 | np.median(x)
178 | 
179 | # <rawcell>
180 | 
181 | # * Analagously for the scale
182 | # * The standard deviation is not robust
183 | 
184 | # <codecell>
185 | 
186 | x.std()
187 | 
188 | # <markdowncell>
189 | 
190 | # Median Absolute Deviation
191 | # 
192 | # $$ median_i |X_i - median_j(X_j)|) $$
193 | 
194 | # <markdowncell>
195 | 
196 | # Standardized Median Absolute Deviation is a consistent estimator for $\hat{\sigma}$
197 | # 
198 | # $$\hat{\sigma}=K \cdot MAD$$
199 | # 
200 | # where $K$ depends on the distribution. For the normal distribution for example,
201 | # 
202 | # $$K = \Phi^{-1}(.75)$$
203 | 
204 | # <codecell>
205 | 
206 | stats.norm.ppf(.75)
207 | 
208 | # <codecell>
209 | 
210 | print x
211 | 
212 | # <codecell>
213 | 
214 | sm.robust.scale.stand_mad(x)
215 | 
216 | # <codecell>
217 | 
218 | np.array([1,2,3,4,5.]).std()
219 | 
220 | # <rawcell>
221 | 
222 | # * The default for Robust Linear Models is MAD
223 | # * another popular choice is Huber's proposal 2
224 | 
225 | # <codecell>
226 | 
227 | np.random.seed(12345)
228 | fat_tails = stats.t(6).rvs(40)
229 | 
230 | # <codecell>
231 | 
232 | kde = sm.nonparametric.KDE(fat_tails)
233 | kde.fit()
234 | fig = plt.figure(figsize=(12,8))
235 | ax = fig.add_subplot(111)
236 | ax.plot(kde.support, kde.density);
237 | 
238 | # <codecell>
239 | 
240 | print fat_tails.mean(), fat_tails.std()
241 | 
242 | # <codecell>
243 | 
244 | print stats.norm.fit(fat_tails)
245 | 
246 | # <codecell>
247 | 
248 | print stats.t.fit(fat_tails, f0=6)
249 | 
250 | # <codecell>
251 | 
252 | huber = sm.robust.scale.Huber()
253 | loc, scale = huber(fat_tails)
254 | print loc, scale
255 | 
256 | # <codecell>
257 | 
258 | sm.robust.stand_mad(fat_tails)
259 | 
260 | # <codecell>
261 | 
262 | sm.robust.stand_mad(fat_tails, c=stats.t(6).ppf(.75))
263 | 
264 | # <codecell>
265 | 
266 | sm.robust.scale.mad(fat_tails)
267 | 
268 | # <headingcell level=3>
269 | 
270 | # Duncan's Occupational Prestige data - M-estimation for outliers
271 | 
272 | # <codecell>
273 | 
274 | from statsmodels.graphics.api import abline_plot
275 | from statsmodels.formula.api import ols, rlm
276 | 
277 | # <codecell>
278 | 
279 | prestige = sm.datasets.get_rdataset("Duncan", "car", cache=True).data
280 | 
281 | # <codecell>
282 | 
283 | print prestige.head(10)
284 | 
285 | # <codecell>
286 | 
287 | fig = plt.figure(figsize=(12,12))
288 | ax1 = fig.add_subplot(211, xlabel='Income', ylabel='Prestige')
289 | ax1.scatter(prestige.income, prestige.prestige)
290 | xy_outlier = prestige.ix['minister'][['income','prestige']]
291 | ax1.annotate('Minister', xy_outlier, xy_outlier+1, fontsize=16)
292 | ax2 = fig.add_subplot(212, xlabel='Education',
293 |                            ylabel='Prestige')
294 | ax2.scatter(prestige.education, prestige.prestige);
295 | 
296 | # <codecell>
297 | 
298 | ols_model = ols('prestige ~ income + education', prestige).fit()
299 | print ols_model.summary()
300 | 
301 | # <codecell>
302 | 
303 | infl = ols_model.get_influence()
304 | student = infl.summary_frame()['student_resid']
305 | print student
306 | 
307 | # <codecell>
308 | 
309 | print student.ix[np.abs(student) > 2]
310 | 
311 | # <codecell>
312 | 
313 | print infl.summary_frame().ix['minister']
314 | 
315 | # <codecell>
316 | 
317 | sidak = ols_model.outlier_test('sidak')
318 | sidak.sort('unadj_p', inplace=True)
319 | print sidak
320 | 
321 | # <codecell>
322 | 
323 | fdr = ols_model.outlier_test('fdr_bh')
324 | fdr.sort('unadj_p', inplace=True)
325 | print fdr
326 | 
327 | # <codecell>
328 | 
329 | rlm_model = rlm('prestige ~ income + education', prestige).fit()
330 | print rlm_model.summary()
331 | 
332 | # <codecell>
333 | 
334 | print rlm_model.weights
335 | 
336 | # <headingcell level=3>
337 | 
338 | # Hertzprung Russell data for Star Cluster CYG 0B1 - Leverage Points
339 | 
340 | # <markdowncell>
341 | 
342 | # * Data is on the luminosity and temperature of 47 stars in the direction of Cygnus.
343 | 
344 | # <codecell>
345 | 
346 | dta = sm.datasets.get_rdataset("starsCYG", "robustbase", cache=True).data
347 | 
348 | # <codecell>
349 | 
350 | from matplotlib.patches import Ellipse
351 | fig = plt.figure(figsize=(12,8))
352 | ax = fig.add_subplot(111, xlabel='log(Temp)', ylabel='log(Light)', title='Hertzsprung-Russell Diagram of Star Cluster CYG OB1')
353 | ax.scatter(*dta.values.T)
354 | # highlight outliers
355 | e = Ellipse((3.5, 6), .2, 1, alpha=.25, color='r')
356 | ax.add_patch(e);
357 | ax.annotate('Red giants', xy=(3.6, 6), xytext=(3.8, 6),
358 |             arrowprops=dict(facecolor='black', shrink=0.05, width=2),
359 |             horizontalalignment='left', verticalalignment='bottom',
360 |             clip_on=True, # clip to the axes bounding box
361 |             fontsize=16,
362 |      )
363 | # annotate these with their index
364 | for i,row in dta.ix[dta['log.Te'] < 3.8].iterrows():
365 |     ax.annotate(i, row, row + .01, fontsize=14)
366 | xlim, ylim = ax.get_xlim(), ax.get_ylim()
367 | 
368 | # <codecell>
369 | 
370 | from IPython.display import Image
371 | Image(filename='star_diagram.png')
372 | 
373 | # <codecell>
374 | 
375 | y = dta['log.light']
376 | X = sm.add_constant(dta['log.Te'], prepend=True)
377 | ols_model = sm.OLS(y, X).fit()
378 | abline_plot(model_results=ols_model, ax=ax)
379 | 
380 | # <codecell>
381 | 
382 | rlm_mod = sm.RLM(y, X, sm.robust.norms.TrimmedMean(.5)).fit()
383 | abline_plot(model_results=rlm_mod, ax=ax, color='red')
384 | 
385 | # <markdowncell>
386 | 
387 | # * Why? Because M-estimators are not robust to leverage points.
388 | 
389 | # <codecell>
390 | 
391 | infl = ols_model.get_influence()
392 | 
393 | # <codecell>
394 | 
395 | h_bar = 2*(ols_model.df_model + 1 )/ols_model.nobs
396 | hat_diag = infl.summary_frame()['hat_diag']
397 | hat_diag.ix[hat_diag > h_bar]
398 | 
399 | # <codecell>
400 | 
401 | sidak2 = ols_model.outlier_test('sidak')
402 | sidak2.sort('unadj_p', inplace=True)
403 | print sidak2
404 | 
405 | # <codecell>
406 | 
407 | fdr2 = ols_model.outlier_test('fdr_bh')
408 | fdr2.sort('unadj_p', inplace=True)
409 | print fdr2
410 | 
411 | # <markdowncell>
412 | 
413 | # * Let's delete that line
414 | 
415 | # <codecell>
416 | 
417 | del ax.lines[-1]
418 | 
419 | # <codecell>
420 | 
421 | weights = np.ones(len(X))
422 | weights[X[X['log.Te'] < 3.8].index.values - 1] = 0
423 | wls_model = sm.WLS(y, X, weights=weights).fit()
424 | abline_plot(model_results=wls_model, ax=ax, color='green')
425 | 
426 | # <markdowncell>
427 | 
428 | # * MM estimators are good for this type of problem, unfortunately, we don't yet have these yet. 
429 | # * It's being worked on, but it gives a good excuse to look at the R cell magics in the notebook.
430 | 
431 | # <codecell>
432 | 
433 | yy = y.values[:,None]
434 | xx = X['log.Te'].values[:,None]
435 | 
436 | # <codecell>
437 | 
438 | %load_ext rmagic
439 | 
440 | %R library(robustbase)
441 | %Rpush yy xx
442 | %R mod <- lmrob(yy ~ xx);
443 | %R params <- mod$coefficients;
444 | %Rpull params
445 | 
446 | # <codecell>
447 | 
448 | %R print(mod)
449 | 
450 | # <codecell>
451 | 
452 | print params
453 | 
454 | # <codecell>
455 | 
456 | abline_plot(intercept=params[0], slope=params[1], ax=ax, color='green')
457 | 
458 | # <headingcell level=3>
459 | 
460 | # Exercise: Breakdown points of M-estimator
461 | 
462 | # <codecell>
463 | 
464 | np.random.seed(12345)
465 | nobs = 200
466 | beta_true = np.array([3, 1, 2.5, 3, -4])
467 | X = np.random.uniform(-20,20, size=(nobs, len(beta_true)-1))
468 | # stack a constant in front
469 | X = sm.add_constant(X, prepend=True) # np.c_[np.ones(nobs), X]
470 | mc_iter = 500
471 | contaminate = .25 # percentage of response variables to contaminate
472 | 
473 | # <codecell>
474 | 
475 | all_betas = []
476 | for i in range(mc_iter):
477 |     y = np.dot(X, beta_true) + np.random.normal(size=200)
478 |     random_idx = np.random.randint(0, nobs, size=int(contaminate * nobs))
479 |     y[random_idx] = np.random.uniform(-750, 750) #, size=len(random_idx))
480 |     beta_hat = sm.RLM(y, X).fit().params
481 |     all_betas.append(beta_hat)
482 | 
483 | # <codecell>
484 | 
485 | all_betas = np.asarray(all_betas)
486 | se_loss = lambda x : np.linalg.norm(x, ord=2)**2
487 | se_beta = map(se_loss, all_betas - beta_true)
488 | 
489 | # <headingcell level=4>
490 | 
491 | # Squared error loss
492 | 
493 | # <codecell>
494 | 
495 | np.array(se_beta).mean()
496 | 
497 | # <codecell>
498 | 
499 | all_betas.mean(0)
500 | 
501 | # <codecell>
502 | 
503 | beta_true
504 | 
505 | # <codecell>
506 | 
507 | se_loss(all_betas.mean(0) - beta_true)
508 | 
509 | 


--------------------------------------------------------------------------------
/salary.table:
--------------------------------------------------------------------------------
 1 | S,X,E,M
 2 | 13876,1,1,1
 3 | 11608,1,3,0
 4 | 18701,1,3,1
 5 | 11283,1,2,0
 6 | 11767,1,3,0
 7 | 20872,2,2,1
 8 | 11772,2,2,0
 9 | 10535,2,1,0
10 | 12195,2,3,0
11 | 12313,3,2,0
12 | 14975,3,1,1
13 | 21371,3,2,1
14 | 19800,3,3,1
15 | 11417,4,1,0
16 | 20263,4,3,1
17 | 13231,4,3,0
18 | 12884,4,2,0
19 | 13245,5,2,0
20 | 13677,5,3,0
21 | 15965,5,1,1
22 | 12336,6,1,0
23 | 21352,6,3,1
24 | 13839,6,2,0
25 | 22884,6,2,1
26 | 16978,7,1,1
27 | 14803,8,2,0
28 | 17404,8,1,1
29 | 22184,8,3,1
30 | 13548,8,1,0
31 | 14467,10,1,0
32 | 15942,10,2,0
33 | 23174,10,3,1
34 | 23780,10,2,1
35 | 25410,11,2,1
36 | 14861,11,1,0
37 | 16882,12,2,0
38 | 24170,12,3,1
39 | 15990,13,1,0
40 | 26330,13,2,1
41 | 17949,14,2,0
42 | 25685,15,3,1
43 | 27837,16,2,1
44 | 18838,16,2,0
45 | 17483,16,1,0
46 | 19207,17,2,0
47 | 19346,20,1,0
48 | 


--------------------------------------------------------------------------------
/star_diagram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jseabold/statsmodels-tutorial/07b0e675a699d6cd3fd397d9c09dc5c529377f5d/star_diagram.png


--------------------------------------------------------------------------------
/tsa_arma.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "metadata": {
  3 |   "name": "tsa_arma"
  4 |  },
  5 |  "nbformat": 3,
  6 |  "nbformat_minor": 0,
  7 |  "worksheets": [
  8 |   {
  9 |    "cells": [
 10 |     {
 11 |      "cell_type": "heading",
 12 |      "level": 3,
 13 |      "metadata": {},
 14 |      "source": [
 15 |       "ARMA example using sunpots data"
 16 |      ]
 17 |     },
 18 |     {
 19 |      "cell_type": "code",
 20 |      "collapsed": false,
 21 |      "input": [
 22 |       "import numpy as np\n",
 23 |       "from scipy import stats\n",
 24 |       "import pandas\n",
 25 |       "import matplotlib.pyplot as plt\n",
 26 |       "\n",
 27 |       "import statsmodels.api as sm"
 28 |      ],
 29 |      "language": "python",
 30 |      "metadata": {},
 31 |      "outputs": []
 32 |     },
 33 |     {
 34 |      "cell_type": "code",
 35 |      "collapsed": false,
 36 |      "input": [
 37 |       "from statsmodels.graphics.api import qqplot"
 38 |      ],
 39 |      "language": "python",
 40 |      "metadata": {},
 41 |      "outputs": []
 42 |     },
 43 |     {
 44 |      "cell_type": "code",
 45 |      "collapsed": false,
 46 |      "input": [
 47 |       "print sm.datasets.sunspots.NOTE"
 48 |      ],
 49 |      "language": "python",
 50 |      "metadata": {},
 51 |      "outputs": []
 52 |     },
 53 |     {
 54 |      "cell_type": "code",
 55 |      "collapsed": false,
 56 |      "input": [
 57 |       "dta = sm.datasets.sunspots.load_pandas().data"
 58 |      ],
 59 |      "language": "python",
 60 |      "metadata": {},
 61 |      "outputs": []
 62 |     },
 63 |     {
 64 |      "cell_type": "code",
 65 |      "collapsed": false,
 66 |      "input": [
 67 |       "dta.index = pandas.Index(sm.tsa.datetools.dates_from_range('1700', '2008'))\n",
 68 |       "del dta[\"YEAR\"]"
 69 |      ],
 70 |      "language": "python",
 71 |      "metadata": {},
 72 |      "outputs": []
 73 |     },
 74 |     {
 75 |      "cell_type": "code",
 76 |      "collapsed": false,
 77 |      "input": [
 78 |       "dta.plot(figsize=(12,8));"
 79 |      ],
 80 |      "language": "python",
 81 |      "metadata": {},
 82 |      "outputs": []
 83 |     },
 84 |     {
 85 |      "cell_type": "code",
 86 |      "collapsed": false,
 87 |      "input": [
 88 |       "fig = plt.figure(figsize=(12,8))\n",
 89 |       "ax1 = fig.add_subplot(211)\n",
 90 |       "fig = sm.graphics.tsa.plot_acf(dta.values.squeeze(), lags=40, ax=ax1)\n",
 91 |       "ax2 = fig.add_subplot(212)\n",
 92 |       "fig = sm.graphics.tsa.plot_pacf(dta, lags=40, ax=ax2)"
 93 |      ],
 94 |      "language": "python",
 95 |      "metadata": {},
 96 |      "outputs": []
 97 |     },
 98 |     {
 99 |      "cell_type": "code",
100 |      "collapsed": false,
101 |      "input": [
102 |       "arma_mod20 = sm.tsa.ARMA(dta, (2,0)).fit()\n",
103 |       "print arma_mod20.params"
104 |      ],
105 |      "language": "python",
106 |      "metadata": {},
107 |      "outputs": []
108 |     },
109 |     {
110 |      "cell_type": "code",
111 |      "collapsed": false,
112 |      "input": [
113 |       "arma_mod30 = sm.tsa.ARMA(dta, (3,0)).fit()"
114 |      ],
115 |      "language": "python",
116 |      "metadata": {},
117 |      "outputs": []
118 |     },
119 |     {
120 |      "cell_type": "code",
121 |      "collapsed": false,
122 |      "input": [
123 |       "print arma_mod20.aic, arma_mod20.bic, arma_mod20.hqic"
124 |      ],
125 |      "language": "python",
126 |      "metadata": {},
127 |      "outputs": []
128 |     },
129 |     {
130 |      "cell_type": "code",
131 |      "collapsed": false,
132 |      "input": [
133 |       "print arma_mod30.params"
134 |      ],
135 |      "language": "python",
136 |      "metadata": {},
137 |      "outputs": []
138 |     },
139 |     {
140 |      "cell_type": "code",
141 |      "collapsed": false,
142 |      "input": [
143 |       "print arma_mod30.aic, arma_mod30.bic, arma_mod30.hqic"
144 |      ],
145 |      "language": "python",
146 |      "metadata": {},
147 |      "outputs": []
148 |     },
149 |     {
150 |      "cell_type": "markdown",
151 |      "metadata": {},
152 |      "source": [
153 |       "* Does our model obey the theory?"
154 |      ]
155 |     },
156 |     {
157 |      "cell_type": "code",
158 |      "collapsed": false,
159 |      "input": [
160 |       "sm.stats.durbin_watson(arma_mod30.resid.values)"
161 |      ],
162 |      "language": "python",
163 |      "metadata": {},
164 |      "outputs": []
165 |     },
166 |     {
167 |      "cell_type": "code",
168 |      "collapsed": false,
169 |      "input": [
170 |       "fig = plt.figure(figsize=(12,8))\n",
171 |       "ax = fig.add_subplot(111)\n",
172 |       "ax = arma_mod30.resid.plot(ax=ax);"
173 |      ],
174 |      "language": "python",
175 |      "metadata": {},
176 |      "outputs": []
177 |     },
178 |     {
179 |      "cell_type": "code",
180 |      "collapsed": false,
181 |      "input": [
182 |       "resid = arma_mod30.resid"
183 |      ],
184 |      "language": "python",
185 |      "metadata": {},
186 |      "outputs": []
187 |     },
188 |     {
189 |      "cell_type": "code",
190 |      "collapsed": false,
191 |      "input": [
192 |       "stats.normaltest(resid)"
193 |      ],
194 |      "language": "python",
195 |      "metadata": {},
196 |      "outputs": []
197 |     },
198 |     {
199 |      "cell_type": "code",
200 |      "collapsed": false,
201 |      "input": [
202 |       "fig = plt.figure(figsize=(12,8))\n",
203 |       "ax = fig.add_subplot(111)\n",
204 |       "fig = qqplot(resid, line='q', ax=ax, fit=True)"
205 |      ],
206 |      "language": "python",
207 |      "metadata": {},
208 |      "outputs": []
209 |     },
210 |     {
211 |      "cell_type": "code",
212 |      "collapsed": false,
213 |      "input": [
214 |       "fig = plt.figure(figsize=(12,8))\n",
215 |       "ax1 = fig.add_subplot(211)\n",
216 |       "fig = sm.graphics.tsa.plot_acf(resid.values.squeeze(), lags=40, ax=ax1)\n",
217 |       "ax2 = fig.add_subplot(212)\n",
218 |       "fig = sm.graphics.tsa.plot_pacf(resid, lags=40, ax=ax2)"
219 |      ],
220 |      "language": "python",
221 |      "metadata": {},
222 |      "outputs": []
223 |     },
224 |     {
225 |      "cell_type": "code",
226 |      "collapsed": false,
227 |      "input": [
228 |       "r,q,p = sm.tsa.acf(resid.values.squeeze(), qstat=True)\n",
229 |       "data = np.c_[range(1,41), r[1:], q, p]\n",
230 |       "table = pandas.DataFrame(data, columns=['lag', \"AC\", \"Q\", \"Prob(>Q)\"])\n",
231 |       "print table.set_index('lag')"
232 |      ],
233 |      "language": "python",
234 |      "metadata": {},
235 |      "outputs": []
236 |     },
237 |     {
238 |      "cell_type": "markdown",
239 |      "metadata": {},
240 |      "source": [
241 |       "* This indicates a lack of fit."
242 |      ]
243 |     },
244 |     {
245 |      "cell_type": "markdown",
246 |      "metadata": {},
247 |      "source": [
248 |       "* In-sample dynamic prediction. How good does our model do?"
249 |      ]
250 |     },
251 |     {
252 |      "cell_type": "code",
253 |      "collapsed": false,
254 |      "input": [
255 |       "predict_sunspots = arma_mod30.predict('1990', '2012', dynamic=True)\n",
256 |       "print predict_sunspots"
257 |      ],
258 |      "language": "python",
259 |      "metadata": {},
260 |      "outputs": []
261 |     },
262 |     {
263 |      "cell_type": "code",
264 |      "collapsed": false,
265 |      "input": [
266 |       "ax = dta.ix['1950':].plot(figsize=(12,8))\n",
267 |       "ax = predict_sunspots.plot(ax=ax, style='r--', label='Dynamic Prediction');\n",
268 |       "ax.legend();\n",
269 |       "ax.axis((-20.0, 38.0, -4.0, 200.0));"
270 |      ],
271 |      "language": "python",
272 |      "metadata": {},
273 |      "outputs": []
274 |     },
275 |     {
276 |      "cell_type": "code",
277 |      "collapsed": false,
278 |      "input": [
279 |       "def mean_forecast_err(y, yhat):\n",
280 |       "    return y.sub(yhat).mean()"
281 |      ],
282 |      "language": "python",
283 |      "metadata": {},
284 |      "outputs": []
285 |     },
286 |     {
287 |      "cell_type": "code",
288 |      "collapsed": false,
289 |      "input": [
290 |       "mean_forecast_err(dta.SUNACTIVITY, predict_sunspots)"
291 |      ],
292 |      "language": "python",
293 |      "metadata": {},
294 |      "outputs": []
295 |     },
296 |     {
297 |      "cell_type": "heading",
298 |      "level": 3,
299 |      "metadata": {},
300 |      "source": [
301 |       "Exercise: Can you obtain a better fit for the Sunspots model? (Hint: sm.tsa.AR has a method select_order)"
302 |      ]
303 |     },
304 |     {
305 |      "cell_type": "heading",
306 |      "level": 3,
307 |      "metadata": {},
308 |      "source": [
309 |       "Simulated ARMA(4,1): Model Identification is Difficult"
310 |      ]
311 |     },
312 |     {
313 |      "cell_type": "code",
314 |      "collapsed": false,
315 |      "input": [
316 |       "from statsmodels.tsa.arima_process import arma_generate_sample, ArmaProcess"
317 |      ],
318 |      "language": "python",
319 |      "metadata": {},
320 |      "outputs": []
321 |     },
322 |     {
323 |      "cell_type": "code",
324 |      "collapsed": false,
325 |      "input": [
326 |       "np.random.seed(1234)\n",
327 |       "# include zero-th lag\n",
328 |       "arparams = np.array([1, .75, -.65, -.55, .9])\n",
329 |       "maparams = np.array([1, .65])"
330 |      ],
331 |      "language": "python",
332 |      "metadata": {},
333 |      "outputs": []
334 |     },
335 |     {
336 |      "cell_type": "markdown",
337 |      "metadata": {},
338 |      "source": [
339 |       "* Let's make sure this models is estimable."
340 |      ]
341 |     },
342 |     {
343 |      "cell_type": "code",
344 |      "collapsed": false,
345 |      "input": [
346 |       "arma_t = ArmaProcess(arparams, maparams)"
347 |      ],
348 |      "language": "python",
349 |      "metadata": {},
350 |      "outputs": []
351 |     },
352 |     {
353 |      "cell_type": "code",
354 |      "collapsed": false,
355 |      "input": [
356 |       "arma_t.isinvertible()"
357 |      ],
358 |      "language": "python",
359 |      "metadata": {},
360 |      "outputs": []
361 |     },
362 |     {
363 |      "cell_type": "code",
364 |      "collapsed": false,
365 |      "input": [
366 |       "arma_t.isstationary()"
367 |      ],
368 |      "language": "python",
369 |      "metadata": {},
370 |      "outputs": []
371 |     },
372 |     {
373 |      "cell_type": "raw",
374 |      "metadata": {},
375 |      "source": [
376 |       "* What does this mean?"
377 |      ]
378 |     },
379 |     {
380 |      "cell_type": "code",
381 |      "collapsed": false,
382 |      "input": [
383 |       "fig = plt.figure(figsize=(12,8))\n",
384 |       "ax = fig.add_subplot(111)\n",
385 |       "ax.plot(arma_t.generate_sample(size=50));"
386 |      ],
387 |      "language": "python",
388 |      "metadata": {},
389 |      "outputs": []
390 |     },
391 |     {
392 |      "cell_type": "code",
393 |      "collapsed": false,
394 |      "input": [
395 |       "arparams = np.array([1, .35, -.15, .55, .1])\n",
396 |       "maparams = np.array([1, .65])\n",
397 |       "arma_t = ArmaProcess(arparams, maparams)\n",
398 |       "arma_t.isstationary()"
399 |      ],
400 |      "language": "python",
401 |      "metadata": {},
402 |      "outputs": []
403 |     },
404 |     {
405 |      "cell_type": "code",
406 |      "collapsed": false,
407 |      "input": [
408 |       "arma_rvs = arma_t.generate_sample(size=500, burnin=250, scale=2.5)"
409 |      ],
410 |      "language": "python",
411 |      "metadata": {},
412 |      "outputs": []
413 |     },
414 |     {
415 |      "cell_type": "code",
416 |      "collapsed": false,
417 |      "input": [
418 |       "fig = plt.figure(figsize=(12,8))\n",
419 |       "ax1 = fig.add_subplot(211)\n",
420 |       "fig = sm.graphics.tsa.plot_acf(arma_rvs, lags=40, ax=ax1)\n",
421 |       "ax2 = fig.add_subplot(212)\n",
422 |       "fig = sm.graphics.tsa.plot_pacf(arma_rvs, lags=40, ax=ax2)"
423 |      ],
424 |      "language": "python",
425 |      "metadata": {},
426 |      "outputs": []
427 |     },
428 |     {
429 |      "cell_type": "raw",
430 |      "metadata": {},
431 |      "source": [
432 |       "* For mixed ARMA processes the Autocorrelation function is a mixture of exponentials and damped sine waves after (q-p) lags. \n",
433 |       "* The partial autocorrelation function is a mixture of exponentials and dampened sine waves after (p-q) lags."
434 |      ]
435 |     },
436 |     {
437 |      "cell_type": "code",
438 |      "collapsed": false,
439 |      "input": [
440 |       "arma11 = sm.tsa.ARMA(arma_rvs, (1,1)).fit()\n",
441 |       "resid = arma11.resid\n",
442 |       "r,q,p = sm.tsa.acf(resid, qstat=True)\n",
443 |       "data = np.c_[range(1,41), r[1:], q, p]\n",
444 |       "table = pandas.DataFrame(data, columns=['lag', \"AC\", \"Q\", \"Prob(>Q)\"])\n",
445 |       "print table.set_index('lag')"
446 |      ],
447 |      "language": "python",
448 |      "metadata": {},
449 |      "outputs": []
450 |     },
451 |     {
452 |      "cell_type": "code",
453 |      "collapsed": false,
454 |      "input": [
455 |       "arma41 = sm.tsa.ARMA(arma_rvs, (4,1)).fit()\n",
456 |       "resid = arma41.resid\n",
457 |       "r,q,p = sm.tsa.acf(resid, qstat=True)\n",
458 |       "data = np.c_[range(1,41), r[1:], q, p]\n",
459 |       "table = pandas.DataFrame(data, columns=['lag', \"AC\", \"Q\", \"Prob(>Q)\"])\n",
460 |       "print table.set_index('lag')"
461 |      ],
462 |      "language": "python",
463 |      "metadata": {},
464 |      "outputs": []
465 |     },
466 |     {
467 |      "cell_type": "heading",
468 |      "level": 3,
469 |      "metadata": {},
470 |      "source": [
471 |       "Exercise: How good of in-sample prediction can you do for another series, say, CPI"
472 |      ]
473 |     },
474 |     {
475 |      "cell_type": "code",
476 |      "collapsed": false,
477 |      "input": [
478 |       "macrodta = sm.datasets.macrodata.load_pandas().data\n",
479 |       "macrodta.index = pandas.Index(sm.tsa.datetools.dates_from_range('1959Q1', '2009Q3'))\n",
480 |       "cpi = macrodta[\"cpi\"]"
481 |      ],
482 |      "language": "python",
483 |      "metadata": {},
484 |      "outputs": []
485 |     },
486 |     {
487 |      "cell_type": "heading",
488 |      "level": 4,
489 |      "metadata": {},
490 |      "source": [
491 |       "Hint: "
492 |      ]
493 |     },
494 |     {
495 |      "cell_type": "code",
496 |      "collapsed": false,
497 |      "input": [
498 |       "fig = plt.figure(figsize=(12,8))\n",
499 |       "ax = fig.add_subplot(111)\n",
500 |       "ax = cpi.plot(ax=ax);\n",
501 |       "ax.legend();"
502 |      ],
503 |      "language": "python",
504 |      "metadata": {},
505 |      "outputs": []
506 |     },
507 |     {
508 |      "cell_type": "raw",
509 |      "metadata": {},
510 |      "source": [
511 |       "P-value of the unit-root test, resoundly rejects the null of no unit-root."
512 |      ]
513 |     },
514 |     {
515 |      "cell_type": "code",
516 |      "collapsed": false,
517 |      "input": [
518 |       "print sm.tsa.adfuller(cpi)[1]"
519 |      ],
520 |      "language": "python",
521 |      "metadata": {},
522 |      "outputs": []
523 |     }
524 |    ],
525 |    "metadata": {}
526 |   }
527 |  ]
528 | }


--------------------------------------------------------------------------------
/tsa_arma.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # <nbformat>3.0</nbformat>
  3 | 
  4 | # <headingcell level=3>
  5 | 
  6 | # ARMA example using sunpots data
  7 | 
  8 | # <codecell>
  9 | 
 10 | import numpy as np
 11 | from scipy import stats
 12 | import pandas
 13 | import matplotlib.pyplot as plt
 14 | 
 15 | import statsmodels.api as sm
 16 | 
 17 | # <codecell>
 18 | 
 19 | from statsmodels.graphics.api import qqplot
 20 | 
 21 | # <codecell>
 22 | 
 23 | print sm.datasets.sunspots.NOTE
 24 | 
 25 | # <codecell>
 26 | 
 27 | dta = sm.datasets.sunspots.load_pandas().data
 28 | 
 29 | # <codecell>
 30 | 
 31 | dta.index = pandas.Index(sm.tsa.datetools.dates_from_range('1700', '2008'))
 32 | del dta["YEAR"]
 33 | 
 34 | # <codecell>
 35 | 
 36 | dta.plot(figsize=(12,8));
 37 | 
 38 | # <codecell>
 39 | 
 40 | fig = plt.figure(figsize=(12,8))
 41 | ax1 = fig.add_subplot(211)
 42 | fig = sm.graphics.tsa.plot_acf(dta.values.squeeze(), lags=40, ax=ax1)
 43 | ax2 = fig.add_subplot(212)
 44 | fig = sm.graphics.tsa.plot_pacf(dta, lags=40, ax=ax2)
 45 | 
 46 | # <codecell>
 47 | 
 48 | arma_mod20 = sm.tsa.ARMA(dta, (2,0)).fit()
 49 | print arma_mod20.params
 50 | 
 51 | # <codecell>
 52 | 
 53 | arma_mod30 = sm.tsa.ARMA(dta, (3,0)).fit()
 54 | 
 55 | # <codecell>
 56 | 
 57 | print arma_mod20.aic, arma_mod20.bic, arma_mod20.hqic
 58 | 
 59 | # <codecell>
 60 | 
 61 | print arma_mod30.params
 62 | 
 63 | # <codecell>
 64 | 
 65 | print arma_mod30.aic, arma_mod30.bic, arma_mod30.hqic
 66 | 
 67 | # <markdowncell>
 68 | 
 69 | # * Does our model obey the theory?
 70 | 
 71 | # <codecell>
 72 | 
 73 | sm.stats.durbin_watson(arma_mod30.resid.values)
 74 | 
 75 | # <codecell>
 76 | 
 77 | fig = plt.figure(figsize=(12,8))
 78 | ax = fig.add_subplot(111)
 79 | ax = arma_mod30.resid.plot(ax=ax);
 80 | 
 81 | # <codecell>
 82 | 
 83 | resid = arma_mod30.resid
 84 | 
 85 | # <codecell>
 86 | 
 87 | stats.normaltest(resid)
 88 | 
 89 | # <codecell>
 90 | 
 91 | fig = plt.figure(figsize=(12,8))
 92 | ax = fig.add_subplot(111)
 93 | fig = qqplot(resid, line='q', ax=ax, fit=True)
 94 | 
 95 | # <codecell>
 96 | 
 97 | fig = plt.figure(figsize=(12,8))
 98 | ax1 = fig.add_subplot(211)
 99 | fig = sm.graphics.tsa.plot_acf(resid.values.squeeze(), lags=40, ax=ax1)
100 | ax2 = fig.add_subplot(212)
101 | fig = sm.graphics.tsa.plot_pacf(resid, lags=40, ax=ax2)
102 | 
103 | # <codecell>
104 | 
105 | r,q,p = sm.tsa.acf(resid.values.squeeze(), qstat=True)
106 | data = np.c_[range(1,41), r[1:], q, p]
107 | table = pandas.DataFrame(data, columns=['lag', "AC", "Q", "Prob(>Q)"])
108 | print table.set_index('lag')
109 | 
110 | # <markdowncell>
111 | 
112 | # * This indicates a lack of fit.
113 | 
114 | # <markdowncell>
115 | 
116 | # * In-sample dynamic prediction. How good does our model do?
117 | 
118 | # <codecell>
119 | 
120 | predict_sunspots = arma_mod30.predict('1990', '2012', dynamic=True)
121 | print predict_sunspots
122 | 
123 | # <codecell>
124 | 
125 | ax = dta.ix['1950':].plot(figsize=(12,8))
126 | ax = predict_sunspots.plot(ax=ax, style='r--', label='Dynamic Prediction');
127 | ax.legend();
128 | ax.axis((-20.0, 38.0, -4.0, 200.0));
129 | 
130 | # <codecell>
131 | 
132 | def mean_forecast_err(y, yhat):
133 |     return y.sub(yhat).mean()
134 | 
135 | # <codecell>
136 | 
137 | mean_forecast_err(dta.SUNACTIVITY, predict_sunspots)
138 | 
139 | # <headingcell level=3>
140 | 
141 | # Exercise: Can you obtain a better fit for the Sunspots model? (Hint: sm.tsa.AR has a method select_order)
142 | 
143 | # <headingcell level=3>
144 | 
145 | # Simulated ARMA(4,1): Model Identification is Difficult
146 | 
147 | # <codecell>
148 | 
149 | from statsmodels.tsa.arima_process import arma_generate_sample, ArmaProcess
150 | 
151 | # <codecell>
152 | 
153 | np.random.seed(1234)
154 | # include zero-th lag
155 | arparams = np.array([1, .75, -.65, -.55, .9])
156 | maparams = np.array([1, .65])
157 | 
158 | # <markdowncell>
159 | 
160 | # * Let's make sure this models is estimable.
161 | 
162 | # <codecell>
163 | 
164 | arma_t = ArmaProcess(arparams, maparams)
165 | 
166 | # <codecell>
167 | 
168 | arma_t.isinvertible()
169 | 
170 | # <codecell>
171 | 
172 | arma_t.isstationary()
173 | 
174 | # <rawcell>
175 | 
176 | # * What does this mean?
177 | 
178 | # <codecell>
179 | 
180 | fig = plt.figure(figsize=(12,8))
181 | ax = fig.add_subplot(111)
182 | ax.plot(arma_t.generate_sample(size=50));
183 | 
184 | # <codecell>
185 | 
186 | arparams = np.array([1, .35, -.15, .55, .1])
187 | maparams = np.array([1, .65])
188 | arma_t = ArmaProcess(arparams, maparams)
189 | arma_t.isstationary()
190 | 
191 | # <codecell>
192 | 
193 | arma_rvs = arma_t.generate_sample(size=500, burnin=250, scale=2.5)
194 | 
195 | # <codecell>
196 | 
197 | fig = plt.figure(figsize=(12,8))
198 | ax1 = fig.add_subplot(211)
199 | fig = sm.graphics.tsa.plot_acf(arma_rvs, lags=40, ax=ax1)
200 | ax2 = fig.add_subplot(212)
201 | fig = sm.graphics.tsa.plot_pacf(arma_rvs, lags=40, ax=ax2)
202 | 
203 | # <rawcell>
204 | 
205 | # * For mixed ARMA processes the Autocorrelation function is a mixture of exponentials and damped sine waves after (q-p) lags. 
206 | # * The partial autocorrelation function is a mixture of exponentials and dampened sine waves after (p-q) lags.
207 | 
208 | # <codecell>
209 | 
210 | arma11 = sm.tsa.ARMA(arma_rvs, (1,1)).fit()
211 | resid = arma11.resid
212 | r,q,p = sm.tsa.acf(resid, qstat=True)
213 | data = np.c_[range(1,41), r[1:], q, p]
214 | table = pandas.DataFrame(data, columns=['lag', "AC", "Q", "Prob(>Q)"])
215 | print table.set_index('lag')
216 | 
217 | # <codecell>
218 | 
219 | arma41 = sm.tsa.ARMA(arma_rvs, (4,1)).fit()
220 | resid = arma41.resid
221 | r,q,p = sm.tsa.acf(resid, qstat=True)
222 | data = np.c_[range(1,41), r[1:], q, p]
223 | table = pandas.DataFrame(data, columns=['lag', "AC", "Q", "Prob(>Q)"])
224 | print table.set_index('lag')
225 | 
226 | # <headingcell level=3>
227 | 
228 | # Exercise: How good of in-sample prediction can you do for another series, say, CPI
229 | 
230 | # <codecell>
231 | 
232 | macrodta = sm.datasets.macrodata.load_pandas().data
233 | macrodta.index = pandas.Index(sm.tsa.datetools.dates_from_range('1959Q1', '2009Q3'))
234 | cpi = macrodta["cpi"]
235 | 
236 | # <headingcell level=4>
237 | 
238 | # Hint: 
239 | 
240 | # <codecell>
241 | 
242 | fig = plt.figure(figsize=(12,8))
243 | ax = fig.add_subplot(111)
244 | ax = cpi.plot(ax=ax);
245 | ax.legend();
246 | 
247 | # <rawcell>
248 | 
249 | # P-value of the unit-root test, resoundly rejects the null of no unit-root.
250 | 
251 | # <codecell>
252 | 
253 | print sm.tsa.adfuller(cpi)[1]
254 | 
255 | 


--------------------------------------------------------------------------------
/tsa_filters.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "metadata": {
  3 |   "name": "tsa_filters"
  4 |  },
  5 |  "nbformat": 3,
  6 |  "nbformat_minor": 0,
  7 |  "worksheets": [
  8 |   {
  9 |    "cells": [
 10 |     {
 11 |      "cell_type": "heading",
 12 |      "level": 2,
 13 |      "metadata": {},
 14 |      "source": [
 15 |       "Filtering Time Series Data"
 16 |      ]
 17 |     },
 18 |     {
 19 |      "cell_type": "code",
 20 |      "collapsed": false,
 21 |      "input": [
 22 |       "import pandas\n",
 23 |       "import matplotlib.pyplot as plt\n",
 24 |       "\n",
 25 |       "import statsmodels.api as sm"
 26 |      ],
 27 |      "language": "python",
 28 |      "metadata": {},
 29 |      "outputs": []
 30 |     },
 31 |     {
 32 |      "cell_type": "code",
 33 |      "collapsed": false,
 34 |      "input": [
 35 |       "dta = sm.datasets.macrodata.load_pandas().data"
 36 |      ],
 37 |      "language": "python",
 38 |      "metadata": {},
 39 |      "outputs": []
 40 |     },
 41 |     {
 42 |      "cell_type": "code",
 43 |      "collapsed": false,
 44 |      "input": [
 45 |       "index = pandas.Index(sm.tsa.datetools.dates_from_range('1959Q1', '2009Q3'))\n",
 46 |       "print index"
 47 |      ],
 48 |      "language": "python",
 49 |      "metadata": {},
 50 |      "outputs": []
 51 |     },
 52 |     {
 53 |      "cell_type": "code",
 54 |      "collapsed": false,
 55 |      "input": [
 56 |       "dta.index = index\n",
 57 |       "del dta['year']\n",
 58 |       "del dta['quarter']"
 59 |      ],
 60 |      "language": "python",
 61 |      "metadata": {},
 62 |      "outputs": []
 63 |     },
 64 |     {
 65 |      "cell_type": "code",
 66 |      "collapsed": false,
 67 |      "input": [
 68 |       "print sm.datasets.macrodata.NOTE"
 69 |      ],
 70 |      "language": "python",
 71 |      "metadata": {},
 72 |      "outputs": []
 73 |     },
 74 |     {
 75 |      "cell_type": "code",
 76 |      "collapsed": false,
 77 |      "input": [
 78 |       "print dta.head(10)"
 79 |      ],
 80 |      "language": "python",
 81 |      "metadata": {},
 82 |      "outputs": []
 83 |     },
 84 |     {
 85 |      "cell_type": "code",
 86 |      "collapsed": false,
 87 |      "input": [
 88 |       "fig = plt.figure(figsize=(12,8))\n",
 89 |       "ax = fig.add_subplot(111)\n",
 90 |       "dta.realgdp.plot(ax=ax);\n",
 91 |       "legend = ax.legend(loc = 'upper left');\n",
 92 |       "legend.prop.set_size(20);"
 93 |      ],
 94 |      "language": "python",
 95 |      "metadata": {},
 96 |      "outputs": []
 97 |     },
 98 |     {
 99 |      "cell_type": "heading",
100 |      "level": 3,
101 |      "metadata": {},
102 |      "source": [
103 |       "Hodrick-Prescott Filter"
104 |      ]
105 |     },
106 |     {
107 |      "cell_type": "markdown",
108 |      "metadata": {},
109 |      "source": [
110 |       "The Hodrick-Prescott filter separates a time-series $y_t$ into a trend $\\tau_t$ and a cyclical component $\\zeta_t$ \n",
111 |       "\n",
112 |       "$$y_t = \\tau_t + \\zeta_t$$\n",
113 |       "\n",
114 |       "The components are determined by minimizing the following quadratic loss function\n",
115 |       "\n",
116 |       "$$\\min_{\\\\{ \\tau_{t}\\\\} }\\sum_{t}^{T}\\zeta_{t}^{2}+\\lambda\\sum_{t=1}^{T}\\left[\\left(\\tau_{t}-\\tau_{t-1}\\right)-\\left(\\tau_{t-1}-\\tau_{t-2}\\right)\\right]^{2}$$"
117 |      ]
118 |     },
119 |     {
120 |      "cell_type": "code",
121 |      "collapsed": false,
122 |      "input": [
123 |       "gdp_cycle, gdp_trend = sm.tsa.filters.hpfilter(dta.realgdp)"
124 |      ],
125 |      "language": "python",
126 |      "metadata": {},
127 |      "outputs": []
128 |     },
129 |     {
130 |      "cell_type": "code",
131 |      "collapsed": false,
132 |      "input": [
133 |       "gdp_decomp = dta[['realgdp']]\n",
134 |       "gdp_decomp[\"cycle\"] = gdp_cycle\n",
135 |       "gdp_decomp[\"trend\"] = gdp_trend"
136 |      ],
137 |      "language": "python",
138 |      "metadata": {},
139 |      "outputs": []
140 |     },
141 |     {
142 |      "cell_type": "code",
143 |      "collapsed": false,
144 |      "input": [
145 |       "fig = plt.figure(figsize=(12,8))\n",
146 |       "ax = fig.add_subplot(111)\n",
147 |       "gdp_decomp[[\"realgdp\", \"trend\"]][\"2000-03-31\":].plot(ax=ax, fontsize=16);\n",
148 |       "legend = ax.get_legend()\n",
149 |       "legend.prop.set_size(20);"
150 |      ],
151 |      "language": "python",
152 |      "metadata": {},
153 |      "outputs": []
154 |     },
155 |     {
156 |      "cell_type": "heading",
157 |      "level": 3,
158 |      "metadata": {},
159 |      "source": [
160 |       "Baxter-King approximate band-pass filter: Inflation and Unemployment"
161 |      ]
162 |     },
163 |     {
164 |      "cell_type": "heading",
165 |      "level": 4,
166 |      "metadata": {},
167 |      "source": [
168 |       "Explore the hypothesis that inflation and unemployment are counter-cyclical."
169 |      ]
170 |     },
171 |     {
172 |      "cell_type": "markdown",
173 |      "metadata": {},
174 |      "source": [
175 |       "The Baxter-King filter is intended to explictly deal with the periodicty of the business cycle. By applying their band-pass filter to a series, they produce a new series that does not contain fluctuations at higher or lower than those of the business cycle. Specifically, the BK filter takes the form of a symmetric moving average \n",
176 |       "\n",
177 |       "$$y_{t}^{*}=\\sum_{k=-K}^{k=K}a_ky_{t-k}$$\n",
178 |       "\n",
179 |       "where $a_{-k}=a_k$ and $\\sum_{k=-k}^{K}a_k=0$ to eliminate any trend in the series and render it stationary if the series is I(1) or I(2).\n",
180 |       "\n",
181 |       "For completeness, the filter weights are determined as follows\n",
182 |       "\n",
183 |       "$$a_{j} = B_{j}+\\theta\\text{ for }j=0,\\pm1,\\pm2,\\dots,\\pm K$$\n",
184 |       "\n",
185 |       "$$B_{0} = \\frac{\\left(\\omega_{2}-\\omega_{1}\\right)}{\\pi}$$\n",
186 |       "$$B_{j} = \\frac{1}{\\pi j}\\left(\\sin\\left(\\omega_{2}j\\right)-\\sin\\left(\\omega_{1}j\\right)\\right)\\text{ for }j=0,\\pm1,\\pm2,\\dots,\\pm K$$\n",
187 |       "\n",
188 |       "where $\\theta$ is a normalizing constant such that the weights sum to zero.\n",
189 |       "\n",
190 |       "$$\\theta=\\frac{-\\sum_{j=-K^{K}b_{j}}}{2K+1}$$\n",
191 |       "\n",
192 |       "$$\\omega_{1}=\\frac{2\\pi}{P_{H}}$$\n",
193 |       "\n",
194 |       "$$\\omega_{2}=\\frac{2\\pi}{P_{L}}$$\n",
195 |       "\n",
196 |       "$P_L$ and $P_H$ are the periodicity of the low and high cut-off frequencies. Following Burns and Mitchell's work on US business cycles which suggests cycles last from 1.5 to 8 years, we use $P_L=6$ and $P_H=32$ by default."
197 |      ]
198 |     },
199 |     {
200 |      "cell_type": "code",
201 |      "collapsed": false,
202 |      "input": [
203 |       "bk_cycles = sm.tsa.filters.bkfilter(dta[[\"infl\",\"unemp\"]])"
204 |      ],
205 |      "language": "python",
206 |      "metadata": {},
207 |      "outputs": []
208 |     },
209 |     {
210 |      "cell_type": "raw",
211 |      "metadata": {},
212 |      "source": [
213 |       "* We lose K observations on both ends. It is suggested to use K=12 for quarterly data."
214 |      ]
215 |     },
216 |     {
217 |      "cell_type": "code",
218 |      "collapsed": false,
219 |      "input": [
220 |       "fig = plt.figure(figsize=(14,10))\n",
221 |       "ax = fig.add_subplot(111)\n",
222 |       "bk_cycles.plot(ax=ax, style=['r--', 'b-']);"
223 |      ],
224 |      "language": "python",
225 |      "metadata": {},
226 |      "outputs": []
227 |     },
228 |     {
229 |      "cell_type": "heading",
230 |      "level": 3,
231 |      "metadata": {},
232 |      "source": [
233 |       "Christiano-Fitzgerald approximate band-pass filter: Inflation and Unemployment"
234 |      ]
235 |     },
236 |     {
237 |      "cell_type": "markdown",
238 |      "metadata": {},
239 |      "source": [
240 |       "The Christiano-Fitzgerald filter is a generalization of BK and can thus also be seen as weighted moving average. However, the CF filter is asymmetric about $t$ as well as using the entire series. The implementation of their filter involves the\n",
241 |       "calculations of the weights in\n",
242 |       "\n",
243 |       "$$y_{t}^{*}=B_{0}y_{t}+B_{1}y_{t+1}+\\dots+B_{T-1-t}y_{T-1}+\\tilde B_{T-t}y_{T}+B_{1}y_{t-1}+\\dots+B_{t-2}y_{2}+\\tilde B_{t-1}y_{1}$$\n",
244 |       "\n",
245 |       "for $t=3,4,...,T-2$, where\n",
246 |       "\n",
247 |       "$$B_{j} = \\frac{\\sin(jb)-\\sin(ja)}{\\pi j},j\\geq1$$\n",
248 |       "\n",
249 |       "$$B_{0} = \\frac{b-a}{\\pi},a=\\frac{2\\pi}{P_{u}},b=\\frac{2\\pi}{P_{L}}$$\n",
250 |       "\n",
251 |       "$\\tilde B_{T-t}$ and $\\tilde B_{t-1}$ are linear functions of the $B_{j}$'s, and the values for $t=1,2,T-1,$ and $T$ are also calculated in much the same way. $P_{U}$ and $P_{L}$ are as described above with the same interpretation."
252 |      ]
253 |     },
254 |     {
255 |      "cell_type": "raw",
256 |      "metadata": {},
257 |      "source": [
258 |       "The CF filter is appropriate for series that may follow a random walk."
259 |      ]
260 |     },
261 |     {
262 |      "cell_type": "code",
263 |      "collapsed": false,
264 |      "input": [
265 |       "print sm.tsa.stattools.adfuller(dta['unemp'])[:3]"
266 |      ],
267 |      "language": "python",
268 |      "metadata": {},
269 |      "outputs": []
270 |     },
271 |     {
272 |      "cell_type": "code",
273 |      "collapsed": false,
274 |      "input": [
275 |       "print sm.tsa.stattools.adfuller(dta['infl'])[:3]"
276 |      ],
277 |      "language": "python",
278 |      "metadata": {},
279 |      "outputs": []
280 |     },
281 |     {
282 |      "cell_type": "code",
283 |      "collapsed": false,
284 |      "input": [
285 |       "cf_cycles, cf_trend = sm.tsa.filters.cffilter(dta[[\"infl\",\"unemp\"]])\n",
286 |       "print cf_cycles.head(10)"
287 |      ],
288 |      "language": "python",
289 |      "metadata": {},
290 |      "outputs": []
291 |     },
292 |     {
293 |      "cell_type": "code",
294 |      "collapsed": false,
295 |      "input": [
296 |       "fig = plt.figure(figsize=(14,10))\n",
297 |       "ax = fig.add_subplot(111)\n",
298 |       "cf_cycles.plot(ax=ax, style=['r--','b-']);"
299 |      ],
300 |      "language": "python",
301 |      "metadata": {},
302 |      "outputs": []
303 |     },
304 |     {
305 |      "cell_type": "markdown",
306 |      "metadata": {},
307 |      "source": [
308 |       "Filtering assumes *a priori* that business cycles exist. Due to this assumption, many macroeconomic models seek to create models that match the shape of impulse response functions rather than replicating properties of filtered series. See VAR notebook."
309 |      ]
310 |     }
311 |    ],
312 |    "metadata": {}
313 |   }
314 |  ]
315 | }


--------------------------------------------------------------------------------
/tsa_filters.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # <nbformat>3.0</nbformat>
  3 | 
  4 | # <headingcell level=2>
  5 | 
  6 | # Filtering Time Series Data
  7 | 
  8 | # <codecell>
  9 | 
 10 | import pandas
 11 | import matplotlib.pyplot as plt
 12 | 
 13 | import statsmodels.api as sm
 14 | 
 15 | # <codecell>
 16 | 
 17 | dta = sm.datasets.macrodata.load_pandas().data
 18 | 
 19 | # <codecell>
 20 | 
 21 | index = pandas.Index(sm.tsa.datetools.dates_from_range('1959Q1', '2009Q3'))
 22 | print index
 23 | 
 24 | # <codecell>
 25 | 
 26 | dta.index = index
 27 | del dta['year']
 28 | del dta['quarter']
 29 | 
 30 | # <codecell>
 31 | 
 32 | print sm.datasets.macrodata.NOTE
 33 | 
 34 | # <codecell>
 35 | 
 36 | print dta.head(10)
 37 | 
 38 | # <codecell>
 39 | 
 40 | fig = plt.figure(figsize=(12,8))
 41 | ax = fig.add_subplot(111)
 42 | dta.realgdp.plot(ax=ax);
 43 | legend = ax.legend(loc = 'upper left');
 44 | legend.prop.set_size(20);
 45 | 
 46 | # <headingcell level=3>
 47 | 
 48 | # Hodrick-Prescott Filter
 49 | 
 50 | # <markdowncell>
 51 | 
 52 | # The Hodrick-Prescott filter separates a time-series $y_t$ into a trend $\tau_t$ and a cyclical component $\zeta_t$ 
 53 | # 
 54 | # $$y_t = \tau_t + \zeta_t$$
 55 | # 
 56 | # The components are determined by minimizing the following quadratic loss function
 57 | # 
 58 | # $$\min_{\\{ \tau_{t}\\} }\sum_{t}^{T}\zeta_{t}^{2}+\lambda\sum_{t=1}^{T}\left[\left(\tau_{t}-\tau_{t-1}\right)-\left(\tau_{t-1}-\tau_{t-2}\right)\right]^{2}$$
 59 | 
 60 | # <codecell>
 61 | 
 62 | gdp_cycle, gdp_trend = sm.tsa.filters.hpfilter(dta.realgdp)
 63 | 
 64 | # <codecell>
 65 | 
 66 | gdp_decomp = dta[['realgdp']]
 67 | gdp_decomp["cycle"] = gdp_cycle
 68 | gdp_decomp["trend"] = gdp_trend
 69 | 
 70 | # <codecell>
 71 | 
 72 | fig = plt.figure(figsize=(12,8))
 73 | ax = fig.add_subplot(111)
 74 | gdp_decomp[["realgdp", "trend"]]["2000-03-31":].plot(ax=ax, fontsize=16);
 75 | legend = ax.get_legend()
 76 | legend.prop.set_size(20);
 77 | 
 78 | # <headingcell level=3>
 79 | 
 80 | # Baxter-King approximate band-pass filter: Inflation and Unemployment
 81 | 
 82 | # <headingcell level=4>
 83 | 
 84 | # Explore the hypothesis that inflation and unemployment are counter-cyclical.
 85 | 
 86 | # <markdowncell>
 87 | 
 88 | # The Baxter-King filter is intended to explictly deal with the periodicty of the business cycle. By applying their band-pass filter to a series, they produce a new series that does not contain fluctuations at higher or lower than those of the business cycle. Specifically, the BK filter takes the form of a symmetric moving average 
 89 | # 
 90 | # $$y_{t}^{*}=\sum_{k=-K}^{k=K}a_ky_{t-k}$$
 91 | # 
 92 | # where $a_{-k}=a_k$ and $\sum_{k=-k}^{K}a_k=0$ to eliminate any trend in the series and render it stationary if the series is I(1) or I(2).
 93 | # 
 94 | # For completeness, the filter weights are determined as follows
 95 | # 
 96 | # $$a_{j} = B_{j}+\theta\text{ for }j=0,\pm1,\pm2,\dots,\pm K$$
 97 | # 
 98 | # $$B_{0} = \frac{\left(\omega_{2}-\omega_{1}\right)}{\pi}$$
 99 | # $$B_{j} = \frac{1}{\pi j}\left(\sin\left(\omega_{2}j\right)-\sin\left(\omega_{1}j\right)\right)\text{ for }j=0,\pm1,\pm2,\dots,\pm K$$
100 | # 
101 | # where $\theta$ is a normalizing constant such that the weights sum to zero.
102 | # 
103 | # $$\theta=\frac{-\sum_{j=-K^{K}b_{j}}}{2K+1}$$
104 | # 
105 | # $$\omega_{1}=\frac{2\pi}{P_{H}}$$
106 | # 
107 | # $$\omega_{2}=\frac{2\pi}{P_{L}}$$
108 | # 
109 | # $P_L$ and $P_H$ are the periodicity of the low and high cut-off frequencies. Following Burns and Mitchell's work on US business cycles which suggests cycles last from 1.5 to 8 years, we use $P_L=6$ and $P_H=32$ by default.
110 | 
111 | # <codecell>
112 | 
113 | bk_cycles = sm.tsa.filters.bkfilter(dta[["infl","unemp"]])
114 | 
115 | # <rawcell>
116 | 
117 | # * We lose K observations on both ends. It is suggested to use K=12 for quarterly data.
118 | 
119 | # <codecell>
120 | 
121 | fig = plt.figure(figsize=(14,10))
122 | ax = fig.add_subplot(111)
123 | bk_cycles.plot(ax=ax, style=['r--', 'b-']);
124 | 
125 | # <headingcell level=3>
126 | 
127 | # Christiano-Fitzgerald approximate band-pass filter: Inflation and Unemployment
128 | 
129 | # <markdowncell>
130 | 
131 | # The Christiano-Fitzgerald filter is a generalization of BK and can thus also be seen as weighted moving average. However, the CF filter is asymmetric about $t$ as well as using the entire series. The implementation of their filter involves the
132 | # calculations of the weights in
133 | # 
134 | # $$y_{t}^{*}=B_{0}y_{t}+B_{1}y_{t+1}+\dots+B_{T-1-t}y_{T-1}+\tilde B_{T-t}y_{T}+B_{1}y_{t-1}+\dots+B_{t-2}y_{2}+\tilde B_{t-1}y_{1}$$
135 | # 
136 | # for $t=3,4,...,T-2$, where
137 | # 
138 | # $$B_{j} = \frac{\sin(jb)-\sin(ja)}{\pi j},j\geq1$$
139 | # 
140 | # $$B_{0} = \frac{b-a}{\pi},a=\frac{2\pi}{P_{u}},b=\frac{2\pi}{P_{L}}$$
141 | # 
142 | # $\tilde B_{T-t}$ and $\tilde B_{t-1}$ are linear functions of the $B_{j}$'s, and the values for $t=1,2,T-1,$ and $T$ are also calculated in much the same way. $P_{U}$ and $P_{L}$ are as described above with the same interpretation.
143 | 
144 | # <rawcell>
145 | 
146 | # The CF filter is appropriate for series that may follow a random walk.
147 | 
148 | # <codecell>
149 | 
150 | print sm.tsa.stattools.adfuller(dta['unemp'])[:3]
151 | 
152 | # <codecell>
153 | 
154 | print sm.tsa.stattools.adfuller(dta['infl'])[:3]
155 | 
156 | # <codecell>
157 | 
158 | cf_cycles, cf_trend = sm.tsa.filters.cffilter(dta[["infl","unemp"]])
159 | print cf_cycles.head(10)
160 | 
161 | # <codecell>
162 | 
163 | fig = plt.figure(figsize=(14,10))
164 | ax = fig.add_subplot(111)
165 | cf_cycles.plot(ax=ax, style=['r--','b-']);
166 | 
167 | # <markdowncell>
168 | 
169 | # Filtering assumes *a priori* that business cycles exist. Due to this assumption, many macroeconomic models seek to create models that match the shape of impulse response functions rather than replicating properties of filtered series. See VAR notebook.
170 | 
171 | 


--------------------------------------------------------------------------------
/tsa_var.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # <nbformat>3.0</nbformat>
  3 | 
  4 | # <headingcell level=3>
  5 | 
  6 | # Vector Autoregressions: inflation-unemployment-interest rate
  7 | 
  8 | # <rawcell>
  9 | 
 10 | # Vector Autoregression (VAR), introduced by Nobel laureate Christopher Sims in 1980, is a powerful statistical tool in the macroeconomist's toolkit.
 11 | 
 12 | # <markdowncell>
 13 | 
 14 | # Formally a VAR model is
 15 | # 
 16 | # $$Y_t = A_1 Y_{t-1} + \ldots + A_p Y_{t-p} + u_t$$
 17 | # 
 18 | # $$u_t \sim {\sf Normal}(0, \Sigma_u)$$
 19 | # 
 20 | # where $Y_t$ is of dimension $K$ and $A_i$ is a $K \times K$ coefficient matrix.
 21 | 
 22 | # <codecell>
 23 | 
 24 | dta = sm.datasets.macrodata.load_pandas().data
 25 | endog = dta[["infl", "unemp", "tbilrate"]]
 26 | 
 27 | # <codecell>
 28 | 
 29 | index = sm.tsa.datetools.dates_from_range('1959Q1', '2009Q3')
 30 | dta.index = pandas.Index(index)
 31 | del dta['year']
 32 | del dta['quarter']
 33 | endog.index = pandas.Index(index) # DatetimeIndex or PeriodIndex in 0.8.0
 34 | print endog.head(10)
 35 | 
 36 | # <codecell>
 37 | 
 38 | endog.plot(subplots=True, figsize=(14,18));
 39 | 
 40 | # <codecell>
 41 | 
 42 | # model only after Volcker appointment
 43 | var_mod = sm.tsa.VAR(endog.ix['1979-12-31':]).fit(maxlags=4, ic=None)
 44 | print var_mod.summary()
 45 | 
 46 | # <headingcell level=3>
 47 | 
 48 | # Diagnostics
 49 | 
 50 | # <codecell>
 51 | 
 52 | np.abs(var_mod.roots)
 53 | 
 54 | # <rawcell>
 55 | 
 56 | # var_mod.test_normality() and var_mod.test_whiteness() are also available. There are problems with this model...
 57 | 
 58 | # <headingcell level=3>
 59 | 
 60 | # Granger-Causality tests
 61 | 
 62 | # <codecell>
 63 | 
 64 | var_mod.test_causality('unemp', 'tbilrate', kind='Wald')
 65 | 
 66 | # <codecell>
 67 | 
 68 | table = pandas.DataFrame(np.zeros((9,3)), columns=['chi2', 'df', 'prob(>chi2)'])
 69 | index = []
 70 | variables = set(endog.columns.tolist())
 71 | i = 0
 72 | for vari in variables:
 73 |     others = []
 74 |     for j,ex_vari in enumerate(variables):
 75 |         if vari == ex_vari: # don't want to test this
 76 |             continue
 77 |         others.append(ex_vari)
 78 |         res = var_mod.test_causality(vari, ex_vari, kind='Wald', verbose=False)
 79 |         table.ix[[i], ['chi2', 'df', 'prob(>chi2)']] = (res['statistic'], res['df'], res['pvalue'])
 80 |         i += 1
 81 |         index.append([vari, ex_vari])
 82 |     res = var_mod.test_causality(vari, others, kind='Wald', verbose=False)
 83 |     table.ix[[i], ['chi2', 'df', 'prob(>chi2)']] = res['statistic'], res['df'], res['pvalue']
 84 |     index.append([vari, 'ALL'])
 85 |     i += 1
 86 | table.index = pandas.MultiIndex.from_tuples(index, names=['Equation', 'Excluded'])
 87 | 
 88 | # <codecell>
 89 | 
 90 | print table
 91 | 
 92 | # <markdowncell>
 93 | 
 94 | # From this we reject the null that these variables do not Granger cause for all cases except for infl -> tbilrate. In other words, in almost all cases we can reject the null hypothesis that the lags of the *excluded* variable are jointly zero in *Equation*.
 95 | 
 96 | # <headingcell level=3>
 97 | 
 98 | # Order Selection
 99 | 
100 | # <codecell>
101 | 
102 | var_mod.model.select_order()
103 | 
104 | # <headingcell level=3>
105 | 
106 | # Impulse Response Functions
107 | 
108 | # <rawcell>
109 | 
110 | # Suppose we want to examine what happens to each of the variables when a 1 unit increase in the current value of one of the VAR errors occurs (a "shock"). To isolate the effects of only one error while holding the others constant, we need the model to be in a form so that the contemporaneous errors are uncorrelated across equations. One such way to achieve this is the so-called recursive VAR. In the recursive VAR, the order of the variables is determined by how the econometrician views the economic processes as ocurring. Given this order, inflation is determined by the contemporaneous unemployment rate and tbilrate is determined by the contemporaneous inflation and unemployment rates. Unemployment is a function of only the past values of itself, inflation, and the T-bill rate.
111 | # 
112 | # We achieve such a structure by using the Choleski decomposition.
113 | 
114 | # <codecell>
115 | 
116 | irf = var_mod.irf(24)
117 | 
118 | # <codecell>
119 | 
120 | irf.plot(orth=True, signif=.33, subplot_params = {'fontsize' : 18})
121 | 
122 | # <rawcell>
123 | 
124 | # Note that inflation dynamics are not very persistent, but do appear to have a significant and immediate impact on interest rates and on unemployment in the medium run.
125 | 
126 | # <headingcell level=3>
127 | 
128 | # Forecast Error Decompositions
129 | 
130 | # <codecell>
131 | 
132 | var_mod.fevd(24).summary()
133 | 
134 | # <codecell>
135 | 
136 | var_mod.fevd(24).plot(figsize=(12,12))
137 | 
138 | # <rawcell>
139 | 
140 | # There is some amount of interaction between the variables. For instance, at the 12 quarter horizon, 40% of the error in the forecast of the T-bill rate is attributed to the inflation and unemployment shocks in the recursive VAR.
141 | 
142 | # <rawcell>
143 | 
144 | # To make structural inferences - e.g., what is the effect on the rate of inflation and unemployment of an unexpected 100 basis point increase in the Federal Funds rate (proxied by the T-bill rate here), we might want to fit a structural VAR model based on economic theory of monetary policy. For instance, we might replace the VAR equation for the T-bill rate with a policy equation such as a Taylor rule and restrict coefficients. You can do so with the sm.tsa.SVAR class.
145 | 
146 | # <headingcell level=3>
147 | 
148 | # Exercises
149 | 
150 | # <markdowncell>
151 | 
152 | # Experiment with different VAR models. You can try to adjust the number of lags in the VAR model calculated above or the ordering of the variables and see how it affects the model.
153 | 
154 | # <markdowncell>
155 | 
156 | # You might also try adding variables to the VAR, say *M1* measure of money supply, or estimating a different model using measures of consumption (realcons), government spending (realgovt), or GDP (realgdp).
157 | 
158 | 


--------------------------------------------------------------------------------
/whats_coming.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "metadata": {
 3 |   "name": "whats_coming"
 4 |  },
 5 |  "nbformat": 3,
 6 |  "nbformat_minor": 0,
 7 |  "worksheets": [
 8 |   {
 9 |    "cells": [
10 |     {
11 |      "cell_type": "heading",
12 |      "level": 3,
13 |      "metadata": {},
14 |      "source": [
15 |       "Google Summer of Code"
16 |      ]
17 |     },
18 |     {
19 |      "cell_type": "markdown",
20 |      "metadata": {},
21 |      "source": [
22 |       "Multivariate KDE and nonparametric regression\n",
23 |       "\n",
24 |       "Systems of Equations Models\n",
25 |       "\n",
26 |       "Empirical Likelihood estimators\n",
27 |       "\n",
28 |       "Robust Non-Linear Estimators"
29 |      ]
30 |     },
31 |     {
32 |      "cell_type": "heading",
33 |      "level": 3,
34 |      "metadata": {},
35 |      "source": [
36 |       "Formulas"
37 |      ]
38 |     },
39 |     {
40 |      "cell_type": "raw",
41 |      "metadata": {},
42 |      "source": [
43 |       "Continued integration with the formula code. API changes. Feedback welcome!"
44 |      ]
45 |     },
46 |     {
47 |      "cell_type": "heading",
48 |      "level": 3,
49 |      "metadata": {},
50 |      "source": [
51 |       "Seasonal Data"
52 |      ]
53 |     },
54 |     {
55 |      "cell_type": "raw",
56 |      "metadata": {},
57 |      "source": [
58 |       "Multiplicative and Additive SARIMA models, seasonal filtering methods"
59 |      ]
60 |     },
61 |     {
62 |      "cell_type": "heading",
63 |      "level": 3,
64 |      "metadata": {},
65 |      "source": [
66 |       "Time Series Models"
67 |      ]
68 |     },
69 |     {
70 |      "cell_type": "raw",
71 |      "metadata": {},
72 |      "source": [
73 |       "GARCH(1,1), VECM, Bayesian VAR, Fast Kalman filtering and non-linear state-space methods"
74 |      ]
75 |     },
76 |     {
77 |      "cell_type": "heading",
78 |      "level": 3,
79 |      "metadata": {},
80 |      "source": [
81 |       "What else?"
82 |      ]
83 |     },
84 |     {
85 |      "cell_type": "raw",
86 |      "metadata": {},
87 |      "source": [
88 |       "Generalized Additive Models, Mixed Effects Models, Panel Data Models, Censored and Truncated Regression Models (Tobit, Heckman), Instrumental Variables, parallel Monte Carlo..."
89 |      ]
90 |     }
91 |    ],
92 |    "metadata": {}
93 |   }
94 |  ]
95 | }


--------------------------------------------------------------------------------
/whats_coming.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # <nbformat>3.0</nbformat>
 3 | 
 4 | # <headingcell level=3>
 5 | 
 6 | # Google Summer of Code
 7 | 
 8 | # <markdowncell>
 9 | 
10 | # Multivariate KDE and nonparametric regression
11 | # 
12 | # Systems of Equations Models
13 | # 
14 | # Empirical Likelihood estimators
15 | # 
16 | # Robust Non-Linear Estimators
17 | 
18 | # <headingcell level=3>
19 | 
20 | # Formulas
21 | 
22 | # <rawcell>
23 | 
24 | # Continued integration with the formula code. API changes. Feedback welcome!
25 | 
26 | # <headingcell level=3>
27 | 
28 | # Seasonal Data
29 | 
30 | # <rawcell>
31 | 
32 | # Multiplicative and Additive SARIMA models, seasonal filtering methods
33 | 
34 | # <headingcell level=3>
35 | 
36 | # Time Series Models
37 | 
38 | # <rawcell>
39 | 
40 | # GARCH(1,1), VECM, Bayesian VAR, Fast Kalman filtering and non-linear state-space methods
41 | 
42 | # <headingcell level=3>
43 | 
44 | # What else?
45 | 
46 | # <rawcell>
47 | 
48 | # Generalized Additive Models, Mixed Effects Models, Panel Data Models, Censored and Truncated Regression Models (Tobit, Heckman), Instrumental Variables, parallel Monte Carlo...
49 | 
50 | 


--------------------------------------------------------------------------------