├── contrasts.ipynb
├── contrasts.py
├── discrete_choice.ipynb
├── discrete_choice.py
├── generic_mle.ipynb
├── generic_mle.py
├── kernel_density.ipynb
├── kernel_density.py
├── linear_models.ipynb
├── linear_models.py
├── preliminaries.ipynb
├── preliminaries.py
├── rmagic_extension.ipynb
├── rmagic_extension.py
├── robust_models.ipynb
├── robust_models.py
├── salary.table
├── star_diagram.png
├── tsa_arma.ipynb
├── tsa_arma.py
├── tsa_filters.ipynb
├── tsa_filters.py
├── tsa_var.ipynb
├── tsa_var.py
├── whats_coming.ipynb
└── whats_coming.py
/contrasts.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "contrasts"
4 | },
5 | "nbformat": 3,
6 | "nbformat_minor": 0,
7 | "worksheets": [
8 | {
9 | "cells": [
10 | {
11 | "cell_type": "heading",
12 | "level": 3,
13 | "metadata": {},
14 | "source": [
15 | "Contrasts Overview"
16 | ]
17 | },
18 | {
19 | "cell_type": "code",
20 | "collapsed": false,
21 | "input": [
22 | "import statsmodels.api as sm"
23 | ],
24 | "language": "python",
25 | "metadata": {},
26 | "outputs": []
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "This document is based heavily on this excellent resource from UCLA http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm"
33 | ]
34 | },
35 | {
36 | "cell_type": "raw",
37 | "metadata": {},
38 | "source": [
39 | "A categorical variable of K categories, or levels, usually enters a regression as a sequence of K-1 dummy variables. This amounts to a linear hypothesis on the level means. That is, each test statistic for these variables amounts to testing whether the mean for that level is statistically significantly different from the mean of the base category. This dummy coding is called Treatment coding in R parlance, and we will follow this convention. There are, however, different coding methods that amount to different sets of linear hypotheses.\n",
40 | "\n",
41 | "In fact, the dummy coding is not technically a contrast coding. This is because the dummy variables add to one and are not functionally independent of the model's intercept. On the other hand, a set of *contrasts* for a categorical variable with `k` levels is a set of `k-1` functionally independent linear combinations of the factor level means that are also independent of the sum of the dummy variables. The dummy coding isn't wrong *per se*. It captures all of the coefficients, but it complicates matters when the model assumes independence of the coefficients such as in ANOVA. Linear regression models do not assume independence of the coefficients and thus dummy coding is often the only coding that is taught in this context.\n",
42 | "\n",
43 | "To have a look at the contrast matrices in Patsy, we will use data from UCLA ATS. First let's load the data."
44 | ]
45 | },
46 | {
47 | "cell_type": "heading",
48 | "level": 4,
49 | "metadata": {},
50 | "source": [
51 | "Example Data"
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "collapsed": false,
57 | "input": [
58 | "import pandas\n",
59 | "url = 'http://www.ats.ucla.edu/stat/data/hsb2.csv'\n",
60 | "hsb2 = pandas.read_table(url, delimiter=\",\")"
61 | ],
62 | "language": "python",
63 | "metadata": {},
64 | "outputs": []
65 | },
66 | {
67 | "cell_type": "code",
68 | "collapsed": false,
69 | "input": [
70 | "hsb2.head(10)"
71 | ],
72 | "language": "python",
73 | "metadata": {},
74 | "outputs": []
75 | },
76 | {
77 | "cell_type": "raw",
78 | "metadata": {},
79 | "source": [
80 | "It will be instructive to look at the mean of the dependent variable, write, for each level of race ((1 = Hispanic, 2 = Asian, 3 = African American and 4 = Caucasian))."
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "collapsed": false,
86 | "input": [
87 | "hsb2.groupby('race')['write'].mean()"
88 | ],
89 | "language": "python",
90 | "metadata": {},
91 | "outputs": []
92 | },
93 | {
94 | "cell_type": "heading",
95 | "level": 4,
96 | "metadata": {},
97 | "source": [
98 | "Treatment (Dummy) Coding"
99 | ]
100 | },
101 | {
102 | "cell_type": "raw",
103 | "metadata": {},
104 | "source": [
105 | "Dummy coding is likely the most well known coding scheme. It compares each level of the categorical variable to a base reference level. The base reference level is the value of the intercept. It is the default contrast in Patsy for unordered categorical factors. The Treatment contrast matrix for race would be"
106 | ]
107 | },
108 | {
109 | "cell_type": "code",
110 | "collapsed": false,
111 | "input": [
112 | "from patsy.contrasts import Treatment\n",
113 | "levels = [1,2,3,4]\n",
114 | "contrast = Treatment(reference=0).code_without_intercept(levels)\n",
115 | "print contrast.matrix"
116 | ],
117 | "language": "python",
118 | "metadata": {},
119 | "outputs": []
120 | },
121 | {
122 | "cell_type": "raw",
123 | "metadata": {},
124 | "source": [
125 | "Here we used `reference=0`, which implies that the first level, Hispanic, is the reference category against which the other level effects are measured. As mentioned above, the columns do not sum to zero and are thus not independent of the intercept. To be explicit, let's look at how this would encode the `race` variable."
126 | ]
127 | },
128 | {
129 | "cell_type": "code",
130 | "collapsed": false,
131 | "input": [
132 | "hsb2.race.head(10)"
133 | ],
134 | "language": "python",
135 | "metadata": {},
136 | "outputs": []
137 | },
138 | {
139 | "cell_type": "code",
140 | "collapsed": false,
141 | "input": [
142 | "print contrast.matrix[hsb2.race-1, :][:20]"
143 | ],
144 | "language": "python",
145 | "metadata": {},
146 | "outputs": []
147 | },
148 | {
149 | "cell_type": "code",
150 | "collapsed": false,
151 | "input": [
152 | "sm.categorical(hsb2.race.values)"
153 | ],
154 | "language": "python",
155 | "metadata": {},
156 | "outputs": []
157 | },
158 | {
159 | "cell_type": "raw",
160 | "metadata": {},
161 | "source": [
162 | "This is a bit of a trick, as the `race` category conveniently maps to zero-based indices. If it does not, this conversion happens under the hood, so this won't work in general but nonetheless is a useful exercise to fix ideas. The below illustrates the output using the three contrasts above"
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "collapsed": false,
168 | "input": [
169 | "from statsmodels.formula.api import ols\n",
170 | "mod = ols(\"write ~ C(race, Treatment)\", data=hsb2)\n",
171 | "res = mod.fit()\n",
172 | "print res.summary()"
173 | ],
174 | "language": "python",
175 | "metadata": {},
176 | "outputs": []
177 | },
178 | {
179 | "cell_type": "raw",
180 | "metadata": {},
181 | "source": [
182 | "We explicitly gave the contrast for race; however, since Treatment is the default, we could have omitted this."
183 | ]
184 | },
185 | {
186 | "cell_type": "heading",
187 | "level": 3,
188 | "metadata": {},
189 | "source": [
190 | "Simple Coding"
191 | ]
192 | },
193 | {
194 | "cell_type": "raw",
195 | "metadata": {},
196 | "source": [
197 | "Like Treatment Coding, Simple Coding compares each level to a fixed reference level. However, with simple coding, the intercept is the grand mean of all the levels of the factors. Patsy doesn't have the Simple contrast included, but you can easily define your own contrasts. To do so, write a class that contains a code_with_intercept and a code_without_intercept method that returns a patsy.contrast.ContrastMatrix instance"
198 | ]
199 | },
200 | {
201 | "cell_type": "code",
202 | "collapsed": false,
203 | "input": [
204 | "from patsy.contrasts import ContrastMatrix\n",
205 | "\n",
206 | "def _name_levels(prefix, levels):\n",
207 | " return [\"[%s%s]\" % (prefix, level) for level in levels]\n",
208 | "\n",
209 | "class Simple(object):\n",
210 | " def _simple_contrast(self, levels):\n",
211 | " nlevels = len(levels)\n",
212 | " contr = -1./nlevels * np.ones((nlevels, nlevels-1))\n",
213 | " contr[1:][np.diag_indices(nlevels-1)] = (nlevels-1.)/nlevels\n",
214 | " return contr\n",
215 | "\n",
216 | " def code_with_intercept(self, levels):\n",
217 | " contrast = np.column_stack((np.ones(len(levels)),\n",
218 | " self._simple_contrast(levels)))\n",
219 | " return ContrastMatrix(contrast, _name_levels(\"Simp.\", levels))\n",
220 | "\n",
221 | " def code_without_intercept(self, levels):\n",
222 | " contrast = self._simple_contrast(levels)\n",
223 | " return ContrastMatrix(contrast, _name_levels(\"Simp.\", levels[:-1]))"
224 | ],
225 | "language": "python",
226 | "metadata": {},
227 | "outputs": []
228 | },
229 | {
230 | "cell_type": "code",
231 | "collapsed": false,
232 | "input": [
233 | "hsb2.groupby('race')['write'].mean().mean()"
234 | ],
235 | "language": "python",
236 | "metadata": {},
237 | "outputs": []
238 | },
239 | {
240 | "cell_type": "code",
241 | "collapsed": false,
242 | "input": [
243 | "contrast = Simple().code_without_intercept(levels)\n",
244 | "print contrast.matrix"
245 | ],
246 | "language": "python",
247 | "metadata": {},
248 | "outputs": []
249 | },
250 | {
251 | "cell_type": "code",
252 | "collapsed": false,
253 | "input": [
254 | "mod = ols(\"write ~ C(race, Simple)\", data=hsb2)\n",
255 | "res = mod.fit()\n",
256 | "print res.summary()"
257 | ],
258 | "language": "python",
259 | "metadata": {},
260 | "outputs": []
261 | },
262 | {
263 | "cell_type": "heading",
264 | "level": 3,
265 | "metadata": {},
266 | "source": [
267 | "Sum (Deviation) Coding"
268 | ]
269 | },
270 | {
271 | "cell_type": "raw",
272 | "metadata": {},
273 | "source": [
274 | "Sum coding compares the mean of the dependent variable for a given level to the overall mean of the dependent variable over all the levels. That is, it uses contrasts between each of the first k-1 levels and level k In this example, level 1 is compared to all the others, level 2 to all the others, and level 3 to all the others."
275 | ]
276 | },
277 | {
278 | "cell_type": "code",
279 | "collapsed": false,
280 | "input": [
281 | "from patsy.contrasts import Sum\n",
282 | "contrast = Sum().code_without_intercept(levels)\n",
283 | "print contrast.matrix"
284 | ],
285 | "language": "python",
286 | "metadata": {},
287 | "outputs": []
288 | },
289 | {
290 | "cell_type": "code",
291 | "collapsed": false,
292 | "input": [
293 | "mod = ols(\"write ~ C(race, Sum)\", data=hsb2)\n",
294 | "res = mod.fit()\n",
295 | "print res.summary()"
296 | ],
297 | "language": "python",
298 | "metadata": {},
299 | "outputs": []
300 | },
301 | {
302 | "cell_type": "raw",
303 | "metadata": {},
304 | "source": [
305 | "This corresponds to a parameterization that forces all the coefficients to sum to zero. Notice that the intercept here is the grand mean where the grand mean is the mean of means of the dependent variable by each level."
306 | ]
307 | },
308 | {
309 | "cell_type": "code",
310 | "collapsed": false,
311 | "input": [
312 | "hsb2.groupby('race')['write'].mean().mean()"
313 | ],
314 | "language": "python",
315 | "metadata": {},
316 | "outputs": []
317 | },
318 | {
319 | "cell_type": "heading",
320 | "level": 3,
321 | "metadata": {},
322 | "source": [
323 | "Backward Difference Coding"
324 | ]
325 | },
326 | {
327 | "cell_type": "raw",
328 | "metadata": {},
329 | "source": [
330 | "In backward difference coding, the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. This type of coding may be useful for a nominal or an ordinal variable."
331 | ]
332 | },
333 | {
334 | "cell_type": "code",
335 | "collapsed": false,
336 | "input": [
337 | "from patsy.contrasts import Diff\n",
338 | "contrast = Diff().code_without_intercept(levels)\n",
339 | "print contrast.matrix"
340 | ],
341 | "language": "python",
342 | "metadata": {},
343 | "outputs": []
344 | },
345 | {
346 | "cell_type": "code",
347 | "collapsed": false,
348 | "input": [
349 | "mod = ols(\"write ~ C(race, Diff)\", data=hsb2)\n",
350 | "res = mod.fit()\n",
351 | "print res.summary()"
352 | ],
353 | "language": "python",
354 | "metadata": {},
355 | "outputs": []
356 | },
357 | {
358 | "cell_type": "raw",
359 | "metadata": {},
360 | "source": [
361 | "For example, here the coefficient on level 1 is the mean of `write` at level 2 compared with the mean at level 1. Ie.,"
362 | ]
363 | },
364 | {
365 | "cell_type": "code",
366 | "collapsed": false,
367 | "input": [
368 | "res.params[\"C(race, Diff)[D.1]\"]\n",
369 | "hsb2.groupby('race').mean()[\"write\"][2] - \\\n",
370 | " hsb2.groupby('race').mean()[\"write\"][1]"
371 | ],
372 | "language": "python",
373 | "metadata": {},
374 | "outputs": []
375 | },
376 | {
377 | "cell_type": "heading",
378 | "level": 3,
379 | "metadata": {},
380 | "source": [
381 | "Helmert Coding"
382 | ]
383 | },
384 | {
385 | "cell_type": "raw",
386 | "metadata": {},
387 | "source": [
388 | "Our version of Helmert coding is sometimes referred to as Reverse Helmert Coding. The mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels. Hence, the name 'reverse' being sometimes applied to differentiate from forward Helmert coding. This comparison does not make much sense for a nominal variable such as race, but we would use the Helmert contrast like so:"
389 | ]
390 | },
391 | {
392 | "cell_type": "code",
393 | "collapsed": false,
394 | "input": [
395 | "from patsy.contrasts import Helmert\n",
396 | "contrast = Helmert().code_without_intercept(levels)\n",
397 | "print contrast.matrix"
398 | ],
399 | "language": "python",
400 | "metadata": {},
401 | "outputs": []
402 | },
403 | {
404 | "cell_type": "code",
405 | "collapsed": false,
406 | "input": [
407 | "mod = ols(\"write ~ C(race, Helmert)\", data=hsb2)\n",
408 | "res = mod.fit()\n",
409 | "print res.summary()"
410 | ],
411 | "language": "python",
412 | "metadata": {},
413 | "outputs": []
414 | },
415 | {
416 | "cell_type": "raw",
417 | "metadata": {},
418 | "source": [
419 | "To illustrate, the comparison on level 4 is the mean of the dependent variable at the previous three levels taken from the mean at level 4"
420 | ]
421 | },
422 | {
423 | "cell_type": "code",
424 | "collapsed": false,
425 | "input": [
426 | "grouped = hsb2.groupby('race')\n",
427 | "grouped.mean()[\"write\"][4] - grouped.mean()[\"write\"][:3].mean()"
428 | ],
429 | "language": "python",
430 | "metadata": {},
431 | "outputs": []
432 | },
433 | {
434 | "cell_type": "raw",
435 | "metadata": {},
436 | "source": [
437 | "As you can see, these are only equal up to a constant. Other versions of the Helmert contrast give the actual difference in means. Regardless, the hypothesis tests are the same."
438 | ]
439 | },
440 | {
441 | "cell_type": "code",
442 | "collapsed": false,
443 | "input": [
444 | "k = 4\n",
445 | "1./k * (grouped.mean()[\"write\"][k] - grouped.mean()[\"write\"][:k-1].mean())\n",
446 | "k = 3\n",
447 | "1./k * (grouped.mean()[\"write\"][k] - grouped.mean()[\"write\"][:k-1].mean())"
448 | ],
449 | "language": "python",
450 | "metadata": {},
451 | "outputs": []
452 | },
453 | {
454 | "cell_type": "heading",
455 | "level": 3,
456 | "metadata": {},
457 | "source": [
458 | "Orthogonal Polynomial Coding"
459 | ]
460 | },
461 | {
462 | "cell_type": "raw",
463 | "metadata": {},
464 | "source": [
465 | "The coefficients taken on by polynomial coding for `k=4` levels are the linear, quadratic, and cubic trends in the categorical variable. The categorical variable here is assumed to be represented by an underlying, equally spaced numeric variable. Therefore, this type of encoding is used only for ordered categorical variables with equal spacing. In general, the polynomial contrast produces polynomials of order `k-1`. Since `race` is not an ordered factor variable let's use `read` as an example. First we need to create an ordered categorical from `read`."
466 | ]
467 | },
468 | {
469 | "cell_type": "code",
470 | "collapsed": false,
471 | "input": [
472 | "hsb2['readcat'] = pandas.cut(hsb2.read, bins=3)\n",
473 | "hsb2.groupby('readcat').mean()['write']"
474 | ],
475 | "language": "python",
476 | "metadata": {},
477 | "outputs": []
478 | },
479 | {
480 | "cell_type": "code",
481 | "collapsed": false,
482 | "input": [
483 | "from patsy.contrasts import Poly\n",
484 | "levels = hsb2.readcat.unique().tolist()\n",
485 | "contrast = Poly().code_without_intercept(levels)\n",
486 | "print contrast.matrix"
487 | ],
488 | "language": "python",
489 | "metadata": {},
490 | "outputs": []
491 | },
492 | {
493 | "cell_type": "code",
494 | "collapsed": false,
495 | "input": [
496 | "mod = ols(\"write ~ C(readcat, Poly)\", data=hsb2)\n",
497 | "res = mod.fit()\n",
498 | "print res.summary()"
499 | ],
500 | "language": "python",
501 | "metadata": {},
502 | "outputs": []
503 | },
504 | {
505 | "cell_type": "raw",
506 | "metadata": {},
507 | "source": [
508 | "As you can see, readcat has a significant linear effect on the dependent variable `write` but not a significant quadratic or cubic effect."
509 | ]
510 | }
511 | ],
512 | "metadata": {}
513 | }
514 | ]
515 | }
--------------------------------------------------------------------------------
/contrasts.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # 3.0
3 |
4 | #
5 |
6 | # Contrasts Overview
7 |
8 | #
9 |
10 | import statsmodels.api as sm
11 |
12 | #
13 |
14 | # This document is based heavily on this excellent resource from UCLA http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm
15 |
16 | #
17 |
18 | # A categorical variable of K categories, or levels, usually enters a regression as a sequence of K-1 dummy variables. This amounts to a linear hypothesis on the level means. That is, each test statistic for these variables amounts to testing whether the mean for that level is statistically significantly different from the mean of the base category. This dummy coding is called Treatment coding in R parlance, and we will follow this convention. There are, however, different coding methods that amount to different sets of linear hypotheses.
19 | #
20 | # In fact, the dummy coding is not technically a contrast coding. This is because the dummy variables add to one and are not functionally independent of the model's intercept. On the other hand, a set of *contrasts* for a categorical variable with `k` levels is a set of `k-1` functionally independent linear combinations of the factor level means that are also independent of the sum of the dummy variables. The dummy coding isn't wrong *per se*. It captures all of the coefficients, but it complicates matters when the model assumes independence of the coefficients such as in ANOVA. Linear regression models do not assume independence of the coefficients and thus dummy coding is often the only coding that is taught in this context.
21 | #
22 | # To have a look at the contrast matrices in Patsy, we will use data from UCLA ATS. First let's load the data.
23 |
24 | #
25 |
26 | # Example Data
27 |
28 | #
29 |
30 | import pandas
31 | url = 'http://www.ats.ucla.edu/stat/data/hsb2.csv'
32 | hsb2 = pandas.read_table(url, delimiter=",")
33 |
34 | #
35 |
36 | hsb2.head(10)
37 |
38 | #
39 |
40 | # It will be instructive to look at the mean of the dependent variable, write, for each level of race ((1 = Hispanic, 2 = Asian, 3 = African American and 4 = Caucasian)).
41 |
42 | #
43 |
44 | hsb2.groupby('race')['write'].mean()
45 |
46 | #
47 |
48 | # Treatment (Dummy) Coding
49 |
50 | #
51 |
52 | # Dummy coding is likely the most well known coding scheme. It compares each level of the categorical variable to a base reference level. The base reference level is the value of the intercept. It is the default contrast in Patsy for unordered categorical factors. The Treatment contrast matrix for race would be
53 |
54 | #
55 |
56 | from patsy.contrasts import Treatment
57 | levels = [1,2,3,4]
58 | contrast = Treatment(reference=0).code_without_intercept(levels)
59 | print contrast.matrix
60 |
61 | #
62 |
63 | # Here we used `reference=0`, which implies that the first level, Hispanic, is the reference category against which the other level effects are measured. As mentioned above, the columns do not sum to zero and are thus not independent of the intercept. To be explicit, let's look at how this would encode the `race` variable.
64 |
65 | #
66 |
67 | hsb2.race.head(10)
68 |
69 | #
70 |
71 | print contrast.matrix[hsb2.race-1, :][:20]
72 |
73 | #
74 |
75 | sm.categorical(hsb2.race.values)
76 |
77 | #
78 |
79 | # This is a bit of a trick, as the `race` category conveniently maps to zero-based indices. If it does not, this conversion happens under the hood, so this won't work in general but nonetheless is a useful exercise to fix ideas. The below illustrates the output using the three contrasts above
80 |
81 | #
82 |
83 | from statsmodels.formula.api import ols
84 | mod = ols("write ~ C(race, Treatment)", data=hsb2)
85 | res = mod.fit()
86 | print res.summary()
87 |
88 | #
89 |
90 | # We explicitly gave the contrast for race; however, since Treatment is the default, we could have omitted this.
91 |
92 | #
93 |
94 | # Simple Coding
95 |
96 | #
97 |
98 | # Like Treatment Coding, Simple Coding compares each level to a fixed reference level. However, with simple coding, the intercept is the grand mean of all the levels of the factors. Patsy doesn't have the Simple contrast included, but you can easily define your own contrasts. To do so, write a class that contains a code_with_intercept and a code_without_intercept method that returns a patsy.contrast.ContrastMatrix instance
99 |
100 | #
101 |
102 | from patsy.contrasts import ContrastMatrix
103 |
104 | def _name_levels(prefix, levels):
105 | return ["[%s%s]" % (prefix, level) for level in levels]
106 |
107 | class Simple(object):
108 | def _simple_contrast(self, levels):
109 | nlevels = len(levels)
110 | contr = -1./nlevels * np.ones((nlevels, nlevels-1))
111 | contr[1:][np.diag_indices(nlevels-1)] = (nlevels-1.)/nlevels
112 | return contr
113 |
114 | def code_with_intercept(self, levels):
115 | contrast = np.column_stack((np.ones(len(levels)),
116 | self._simple_contrast(levels)))
117 | return ContrastMatrix(contrast, _name_levels("Simp.", levels))
118 |
119 | def code_without_intercept(self, levels):
120 | contrast = self._simple_contrast(levels)
121 | return ContrastMatrix(contrast, _name_levels("Simp.", levels[:-1]))
122 |
123 | #
124 |
125 | hsb2.groupby('race')['write'].mean().mean()
126 |
127 | #
128 |
129 | contrast = Simple().code_without_intercept(levels)
130 | print contrast.matrix
131 |
132 | #
133 |
134 | mod = ols("write ~ C(race, Simple)", data=hsb2)
135 | res = mod.fit()
136 | print res.summary()
137 |
138 | #
139 |
140 | # Sum (Deviation) Coding
141 |
142 | #
143 |
144 | # Sum coding compares the mean of the dependent variable for a given level to the overall mean of the dependent variable over all the levels. That is, it uses contrasts between each of the first k-1 levels and level k In this example, level 1 is compared to all the others, level 2 to all the others, and level 3 to all the others.
145 |
146 | #
147 |
148 | from patsy.contrasts import Sum
149 | contrast = Sum().code_without_intercept(levels)
150 | print contrast.matrix
151 |
152 | #
153 |
154 | mod = ols("write ~ C(race, Sum)", data=hsb2)
155 | res = mod.fit()
156 | print res.summary()
157 |
158 | #
159 |
160 | # This corresponds to a parameterization that forces all the coefficients to sum to zero. Notice that the intercept here is the grand mean where the grand mean is the mean of means of the dependent variable by each level.
161 |
162 | #
163 |
164 | hsb2.groupby('race')['write'].mean().mean()
165 |
166 | #
167 |
168 | # Backward Difference Coding
169 |
170 | #
171 |
172 | # In backward difference coding, the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. This type of coding may be useful for a nominal or an ordinal variable.
173 |
174 | #
175 |
176 | from patsy.contrasts import Diff
177 | contrast = Diff().code_without_intercept(levels)
178 | print contrast.matrix
179 |
180 | #
181 |
182 | mod = ols("write ~ C(race, Diff)", data=hsb2)
183 | res = mod.fit()
184 | print res.summary()
185 |
186 | #
187 |
188 | # For example, here the coefficient on level 1 is the mean of `write` at level 2 compared with the mean at level 1. Ie.,
189 |
190 | #
191 |
192 | res.params["C(race, Diff)[D.1]"]
193 | hsb2.groupby('race').mean()["write"][2] - \
194 | hsb2.groupby('race').mean()["write"][1]
195 |
196 | #
197 |
198 | # Helmert Coding
199 |
200 | #
201 |
202 | # Our version of Helmert coding is sometimes referred to as Reverse Helmert Coding. The mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels. Hence, the name 'reverse' being sometimes applied to differentiate from forward Helmert coding. This comparison does not make much sense for a nominal variable such as race, but we would use the Helmert contrast like so:
203 |
204 | #
205 |
206 | from patsy.contrasts import Helmert
207 | contrast = Helmert().code_without_intercept(levels)
208 | print contrast.matrix
209 |
210 | #
211 |
212 | mod = ols("write ~ C(race, Helmert)", data=hsb2)
213 | res = mod.fit()
214 | print res.summary()
215 |
216 | #
217 |
218 | # To illustrate, the comparison on level 4 is the mean of the dependent variable at the previous three levels taken from the mean at level 4
219 |
220 | #
221 |
222 | grouped = hsb2.groupby('race')
223 | grouped.mean()["write"][4] - grouped.mean()["write"][:3].mean()
224 |
225 | #
226 |
227 | # As you can see, these are only equal up to a constant. Other versions of the Helmert contrast give the actual difference in means. Regardless, the hypothesis tests are the same.
228 |
229 | #
230 |
231 | k = 4
232 | 1./k * (grouped.mean()["write"][k] - grouped.mean()["write"][:k-1].mean())
233 | k = 3
234 | 1./k * (grouped.mean()["write"][k] - grouped.mean()["write"][:k-1].mean())
235 |
236 | #
237 |
238 | # Orthogonal Polynomial Coding
239 |
240 | #
241 |
242 | # The coefficients taken on by polynomial coding for `k=4` levels are the linear, quadratic, and cubic trends in the categorical variable. The categorical variable here is assumed to be represented by an underlying, equally spaced numeric variable. Therefore, this type of encoding is used only for ordered categorical variables with equal spacing. In general, the polynomial contrast produces polynomials of order `k-1`. Since `race` is not an ordered factor variable let's use `read` as an example. First we need to create an ordered categorical from `read`.
243 |
244 | #
245 |
246 | hsb2['readcat'] = pandas.cut(hsb2.read, bins=3)
247 | hsb2.groupby('readcat').mean()['write']
248 |
249 | #
250 |
251 | from patsy.contrasts import Poly
252 | levels = hsb2.readcat.unique().tolist()
253 | contrast = Poly().code_without_intercept(levels)
254 | print contrast.matrix
255 |
256 | #
257 |
258 | mod = ols("write ~ C(readcat, Poly)", data=hsb2)
259 | res = mod.fit()
260 | print res.summary()
261 |
262 | #
263 |
264 | # As you can see, readcat has a significant linear effect on the dependent variable `write` but not a significant quadratic or cubic effect.
265 |
266 |
--------------------------------------------------------------------------------
/discrete_choice.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "discrete_choice"
4 | },
5 | "nbformat": 3,
6 | "nbformat_minor": 0,
7 | "worksheets": [
8 | {
9 | "cells": [
10 | {
11 | "cell_type": "heading",
12 | "level": 2,
13 | "metadata": {},
14 | "source": [
15 | "Discrete Choice Models - Fair's Affair data"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "A survey of women only was conducted in 1974 by *Redbook* asking about extramarital affairs."
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "collapsed": false,
28 | "input": [
29 | "import numpy as np\n",
30 | "from scipy import stats\n",
31 | "import matplotlib.pyplot as plt\n",
32 | "import statsmodels.api as sm\n",
33 | "from statsmodels.formula.api import logit, probit, poisson, ols"
34 | ],
35 | "language": "python",
36 | "metadata": {},
37 | "outputs": []
38 | },
39 | {
40 | "cell_type": "code",
41 | "collapsed": false,
42 | "input": [
43 | "print sm.datasets.fair.SOURCE"
44 | ],
45 | "language": "python",
46 | "metadata": {},
47 | "outputs": []
48 | },
49 | {
50 | "cell_type": "code",
51 | "collapsed": false,
52 | "input": [
53 | "print sm.datasets.fair.NOTE"
54 | ],
55 | "language": "python",
56 | "metadata": {},
57 | "outputs": []
58 | },
59 | {
60 | "cell_type": "code",
61 | "collapsed": false,
62 | "input": [
63 | "dta = sm.datasets.fair.load_pandas().data"
64 | ],
65 | "language": "python",
66 | "metadata": {},
67 | "outputs": []
68 | },
69 | {
70 | "cell_type": "code",
71 | "collapsed": false,
72 | "input": [
73 | "dta['affair'] = (dta['affairs'] > 0).astype(float)\n",
74 | "print dta.head(10)"
75 | ],
76 | "language": "python",
77 | "metadata": {},
78 | "outputs": []
79 | },
80 | {
81 | "cell_type": "code",
82 | "collapsed": false,
83 | "input": [
84 | "print dta.describe()"
85 | ],
86 | "language": "python",
87 | "metadata": {},
88 | "outputs": []
89 | },
90 | {
91 | "cell_type": "code",
92 | "collapsed": false,
93 | "input": [
94 | "affair_mod = logit(\"affair ~ occupation + educ + occupation_husb\" \n",
95 | " \"+ rate_marriage + age + yrs_married + children\"\n",
96 | " \" + religious\", dta).fit()"
97 | ],
98 | "language": "python",
99 | "metadata": {},
100 | "outputs": []
101 | },
102 | {
103 | "cell_type": "code",
104 | "collapsed": false,
105 | "input": [
106 | "print affair_mod.summary()"
107 | ],
108 | "language": "python",
109 | "metadata": {},
110 | "outputs": []
111 | },
112 | {
113 | "cell_type": "raw",
114 | "metadata": {},
115 | "source": [
116 | "How well are we predicting?"
117 | ]
118 | },
119 | {
120 | "cell_type": "code",
121 | "collapsed": false,
122 | "input": [
123 | "affair_mod.pred_table()"
124 | ],
125 | "language": "python",
126 | "metadata": {},
127 | "outputs": []
128 | },
129 | {
130 | "cell_type": "raw",
131 | "metadata": {},
132 | "source": [
133 | "The coefficients of the discrete choice model do not tell us much. What we're after is marginal effects."
134 | ]
135 | },
136 | {
137 | "cell_type": "code",
138 | "collapsed": false,
139 | "input": [
140 | "mfx = affair_mod.get_margeff()\n",
141 | "print mfx.summary()"
142 | ],
143 | "language": "python",
144 | "metadata": {},
145 | "outputs": []
146 | },
147 | {
148 | "cell_type": "code",
149 | "collapsed": false,
150 | "input": [
151 | "respondent1000 = dta.ix[1000]\n",
152 | "print respondent1000"
153 | ],
154 | "language": "python",
155 | "metadata": {},
156 | "outputs": []
157 | },
158 | {
159 | "cell_type": "code",
160 | "collapsed": false,
161 | "input": [
162 | "resp = dict(zip(range(1,9), respondent1000[[\"occupation\", \"educ\", \n",
163 | " \"occupation_husb\", \"rate_marriage\", \n",
164 | " \"age\", \"yrs_married\", \"children\", \n",
165 | " \"religious\"]].tolist()))\n",
166 | "resp.update({0 : 1})\n",
167 | "print resp"
168 | ],
169 | "language": "python",
170 | "metadata": {},
171 | "outputs": []
172 | },
173 | {
174 | "cell_type": "code",
175 | "collapsed": false,
176 | "input": [
177 | "mfx = affair_mod.get_margeff(atexog=resp)\n",
178 | "print mfx.summary()"
179 | ],
180 | "language": "python",
181 | "metadata": {},
182 | "outputs": []
183 | },
184 | {
185 | "cell_type": "code",
186 | "collapsed": false,
187 | "input": [
188 | "affair_mod.predict(respondent1000)"
189 | ],
190 | "language": "python",
191 | "metadata": {},
192 | "outputs": []
193 | },
194 | {
195 | "cell_type": "code",
196 | "collapsed": false,
197 | "input": [
198 | "affair_mod.fittedvalues[1000]"
199 | ],
200 | "language": "python",
201 | "metadata": {},
202 | "outputs": []
203 | },
204 | {
205 | "cell_type": "code",
206 | "collapsed": false,
207 | "input": [
208 | "affair_mod.model.cdf(affair_mod.fittedvalues[1000])"
209 | ],
210 | "language": "python",
211 | "metadata": {},
212 | "outputs": []
213 | },
214 | {
215 | "cell_type": "raw",
216 | "metadata": {},
217 | "source": [
218 | "The \"correct\" model here is likely the Tobit model. We have an work in progress branch \"tobit-model\" on github, if anyone is interested in censored regression models."
219 | ]
220 | },
221 | {
222 | "cell_type": "heading",
223 | "level": 3,
224 | "metadata": {},
225 | "source": [
226 | "Exercise: Logit vs Probit"
227 | ]
228 | },
229 | {
230 | "cell_type": "code",
231 | "collapsed": false,
232 | "input": [
233 | "fig = plt.figure(figsize=(12,8))\n",
234 | "ax = fig.add_subplot(111)\n",
235 | "support = np.linspace(-6, 6, 1000)\n",
236 | "ax.plot(support, stats.logistic.cdf(support), 'r-', label='Logistic')\n",
237 | "ax.plot(support, stats.norm.cdf(support), label='Probit')\n",
238 | "ax.legend();"
239 | ],
240 | "language": "python",
241 | "metadata": {},
242 | "outputs": []
243 | },
244 | {
245 | "cell_type": "code",
246 | "collapsed": false,
247 | "input": [
248 | "fig = plt.figure(figsize=(12,8))\n",
249 | "ax = fig.add_subplot(111)\n",
250 | "support = np.linspace(-6, 6, 1000)\n",
251 | "ax.plot(support, stats.logistic.pdf(support), 'r-', label='Logistic')\n",
252 | "ax.plot(support, stats.norm.pdf(support), label='Probit')\n",
253 | "ax.legend();"
254 | ],
255 | "language": "python",
256 | "metadata": {},
257 | "outputs": []
258 | },
259 | {
260 | "cell_type": "raw",
261 | "metadata": {},
262 | "source": [
263 | "Compare the estimates of the Logit Fair model above to a Probit model. Does the prediction table look better? Much difference in marginal effects?"
264 | ]
265 | },
266 | {
267 | "cell_type": "heading",
268 | "level": 3,
269 | "metadata": {},
270 | "source": [
271 | "Genarlized Linear Model Example"
272 | ]
273 | },
274 | {
275 | "cell_type": "code",
276 | "collapsed": false,
277 | "input": [
278 | "print sm.datasets.star98.SOURCE"
279 | ],
280 | "language": "python",
281 | "metadata": {},
282 | "outputs": []
283 | },
284 | {
285 | "cell_type": "code",
286 | "collapsed": false,
287 | "input": [
288 | "print sm.datasets.star98.DESCRLONG"
289 | ],
290 | "language": "python",
291 | "metadata": {},
292 | "outputs": []
293 | },
294 | {
295 | "cell_type": "code",
296 | "collapsed": false,
297 | "input": [
298 | "print sm.datasets.star98.NOTE"
299 | ],
300 | "language": "python",
301 | "metadata": {},
302 | "outputs": []
303 | },
304 | {
305 | "cell_type": "code",
306 | "collapsed": false,
307 | "input": [
308 | "dta = sm.datasets.star98.load_pandas().data\n",
309 | "print dta.columns"
310 | ],
311 | "language": "python",
312 | "metadata": {},
313 | "outputs": []
314 | },
315 | {
316 | "cell_type": "code",
317 | "collapsed": false,
318 | "input": [
319 | "print dta[['NABOVE', 'NBELOW', 'LOWINC', 'PERASIAN', 'PERBLACK', 'PERHISP', 'PERMINTE']].head(10)"
320 | ],
321 | "language": "python",
322 | "metadata": {},
323 | "outputs": []
324 | },
325 | {
326 | "cell_type": "code",
327 | "collapsed": false,
328 | "input": [
329 | "print dta[['AVYRSEXP', 'AVSALK', 'PERSPENK', 'PTRATIO', 'PCTAF', 'PCTCHRT', 'PCTYRRND']].head(10)"
330 | ],
331 | "language": "python",
332 | "metadata": {},
333 | "outputs": []
334 | },
335 | {
336 | "cell_type": "code",
337 | "collapsed": false,
338 | "input": [
339 | "formula = 'NABOVE + NBELOW ~ LOWINC + PERASIAN + PERBLACK + PERHISP + PCTCHRT '\n",
340 | "formula += '+ PCTYRRND + PERMINTE*AVYRSEXP*AVSALK + PERSPENK*PTRATIO*PCTAF'"
341 | ],
342 | "language": "python",
343 | "metadata": {},
344 | "outputs": []
345 | },
346 | {
347 | "cell_type": "heading",
348 | "level": 4,
349 | "metadata": {},
350 | "source": [
351 | "Aside: Binomial distribution"
352 | ]
353 | },
354 | {
355 | "cell_type": "raw",
356 | "metadata": {},
357 | "source": [
358 | "Toss a six-sided die 5 times, what's the probability of exactly 2 fours?"
359 | ]
360 | },
361 | {
362 | "cell_type": "code",
363 | "collapsed": false,
364 | "input": [
365 | "stats.binom(5, 1./6).pmf(2)"
366 | ],
367 | "language": "python",
368 | "metadata": {},
369 | "outputs": []
370 | },
371 | {
372 | "cell_type": "code",
373 | "collapsed": false,
374 | "input": [
375 | "from scipy.misc import comb\n",
376 | "comb(5,2) * (1/6.)**2 * (5/6.)**3"
377 | ],
378 | "language": "python",
379 | "metadata": {},
380 | "outputs": []
381 | },
382 | {
383 | "cell_type": "code",
384 | "collapsed": false,
385 | "input": [
386 | "from statsmodels.formula.api import glm\n",
387 | "glm_mod = glm(formula, dta, family=sm.families.Binomial()).fit()"
388 | ],
389 | "language": "python",
390 | "metadata": {},
391 | "outputs": []
392 | },
393 | {
394 | "cell_type": "code",
395 | "collapsed": false,
396 | "input": [
397 | "print glm_mod.summary()"
398 | ],
399 | "language": "python",
400 | "metadata": {},
401 | "outputs": []
402 | },
403 | {
404 | "cell_type": "raw",
405 | "metadata": {},
406 | "source": [
407 | "The number of trials "
408 | ]
409 | },
410 | {
411 | "cell_type": "code",
412 | "collapsed": false,
413 | "input": [
414 | "glm_mod.model.data.orig_endog.sum(1)"
415 | ],
416 | "language": "python",
417 | "metadata": {},
418 | "outputs": []
419 | },
420 | {
421 | "cell_type": "code",
422 | "collapsed": false,
423 | "input": [
424 | "glm_mod.fittedvalues * glm_mod.model.data.orig_endog.sum(1)"
425 | ],
426 | "language": "python",
427 | "metadata": {},
428 | "outputs": []
429 | },
430 | {
431 | "cell_type": "raw",
432 | "metadata": {},
433 | "source": [
434 | "First differences: We hold all explanatory variables constant at their means and manipulate the percentage of low income households to assess its impact\n",
435 | "on the response variables:"
436 | ]
437 | },
438 | {
439 | "cell_type": "code",
440 | "collapsed": false,
441 | "input": [
442 | "exog = glm_mod.model.data.orig_exog # get the dataframe"
443 | ],
444 | "language": "python",
445 | "metadata": {},
446 | "outputs": []
447 | },
448 | {
449 | "cell_type": "code",
450 | "collapsed": false,
451 | "input": [
452 | "means25 = exog.mean()\n",
453 | "print means25"
454 | ],
455 | "language": "python",
456 | "metadata": {},
457 | "outputs": []
458 | },
459 | {
460 | "cell_type": "code",
461 | "collapsed": false,
462 | "input": [
463 | "means25['LOWINC'] = exog['LOWINC'].quantile(.25)\n",
464 | "print means25"
465 | ],
466 | "language": "python",
467 | "metadata": {},
468 | "outputs": []
469 | },
470 | {
471 | "cell_type": "code",
472 | "collapsed": false,
473 | "input": [
474 | "means75 = exog.mean()\n",
475 | "means75['LOWINC'] = exog['LOWINC'].quantile(.75)\n",
476 | "print means75"
477 | ],
478 | "language": "python",
479 | "metadata": {},
480 | "outputs": []
481 | },
482 | {
483 | "cell_type": "code",
484 | "collapsed": false,
485 | "input": [
486 | "resp25 = glm_mod.predict(means25)\n",
487 | "resp75 = glm_mod.predict(means75)\n",
488 | "diff = resp75 - resp25"
489 | ],
490 | "language": "python",
491 | "metadata": {},
492 | "outputs": []
493 | },
494 | {
495 | "cell_type": "raw",
496 | "metadata": {},
497 | "source": [
498 | "The interquartile first difference for the percentage of low income households in a school district is:"
499 | ]
500 | },
501 | {
502 | "cell_type": "code",
503 | "collapsed": false,
504 | "input": [
505 | "print \"%2.4f%%\" % (diff[0]*100)"
506 | ],
507 | "language": "python",
508 | "metadata": {},
509 | "outputs": []
510 | },
511 | {
512 | "cell_type": "code",
513 | "collapsed": false,
514 | "input": [
515 | "nobs = glm_mod.nobs\n",
516 | "y = glm_mod.model.endog\n",
517 | "yhat = glm_mod.mu"
518 | ],
519 | "language": "python",
520 | "metadata": {},
521 | "outputs": []
522 | },
523 | {
524 | "cell_type": "code",
525 | "collapsed": false,
526 | "input": [
527 | "from statsmodels.graphics.api import abline_plot\n",
528 | "fig = plt.figure(figsize=(12,8))\n",
529 | "ax = fig.add_subplot(111, ylabel='Observed Values', xlabel='Fitted Values')\n",
530 | "ax.scatter(yhat, y)\n",
531 | "y_vs_yhat = sm.OLS(y, sm.add_constant(yhat, prepend=True)).fit()\n",
532 | "fig = abline_plot(model_results=y_vs_yhat, ax=ax)"
533 | ],
534 | "language": "python",
535 | "metadata": {},
536 | "outputs": []
537 | },
538 | {
539 | "cell_type": "heading",
540 | "level": 4,
541 | "metadata": {},
542 | "source": [
543 | "Plot fitted values vs Pearson residuals"
544 | ]
545 | },
546 | {
547 | "cell_type": "markdown",
548 | "metadata": {},
549 | "source": [
550 | "Pearson residuals are defined to be \n",
551 | "\n",
552 | "$$\\frac{(y - \\mu)}{\\sqrt{(var(\\mu))}}$$\n",
553 | "\n",
554 | "where var is typically determined by the family. E.g., binomial variance is $np(1 - p)$"
555 | ]
556 | },
557 | {
558 | "cell_type": "code",
559 | "collapsed": false,
560 | "input": [
561 | "fig = plt.figure(figsize=(12,8))\n",
562 | "ax = fig.add_subplot(111, title='Residual Dependence Plot', xlabel='Fitted Values',\n",
563 | " ylabel='Pearson Residuals')\n",
564 | "ax.scatter(yhat, stats.zscore(glm_mod.resid_pearson))\n",
565 | "ax.axis('tight')\n",
566 | "ax.plot([0.0, 1.0],[0.0, 0.0], 'k-');"
567 | ],
568 | "language": "python",
569 | "metadata": {},
570 | "outputs": []
571 | },
572 | {
573 | "cell_type": "heading",
574 | "level": 4,
575 | "metadata": {},
576 | "source": [
577 | "Histogram of standardized deviance residuals with Kernel Density Estimate overlayed"
578 | ]
579 | },
580 | {
581 | "cell_type": "markdown",
582 | "metadata": {},
583 | "source": [
584 | "The definition of the deviance residuals depends on the family. For the Binomial distribution this is \n",
585 | "\n",
586 | "$$r_{dev} = sign\\(Y-\\mu\\)*\\sqrt{2n(Y\\log\\frac{Y}{\\mu}+(1-Y)\\log\\frac{(1-Y)}{(1-\\mu)}}$$\n",
587 | "\n",
588 | "They can be used to detect ill-fitting covariates"
589 | ]
590 | },
591 | {
592 | "cell_type": "code",
593 | "collapsed": false,
594 | "input": [
595 | "resid = glm_mod.resid_deviance\n",
596 | "resid_std = stats.zscore(resid) \n",
597 | "kde_resid = sm.nonparametric.KDEUnivariate(resid_std)\n",
598 | "kde_resid.fit()"
599 | ],
600 | "language": "python",
601 | "metadata": {},
602 | "outputs": []
603 | },
604 | {
605 | "cell_type": "code",
606 | "collapsed": false,
607 | "input": [
608 | "fig = plt.figure(figsize=(12,8))\n",
609 | "ax = fig.add_subplot(111, title=\"Standardized Deviance Residuals\")\n",
610 | "ax.hist(resid_std, bins=25, normed=True);\n",
611 | "ax.plot(kde_resid.support, kde_resid.density, 'r');"
612 | ],
613 | "language": "python",
614 | "metadata": {},
615 | "outputs": []
616 | },
617 | {
618 | "cell_type": "heading",
619 | "level": 4,
620 | "metadata": {},
621 | "source": [
622 | "QQ-plot of deviance residuals"
623 | ]
624 | },
625 | {
626 | "cell_type": "code",
627 | "collapsed": false,
628 | "input": [
629 | "fig = plt.figure(figsize=(12,8))\n",
630 | "ax = fig.add_subplot(111)\n",
631 | "fig = sm.graphics.qqplot(resid, line='r', ax=ax)"
632 | ],
633 | "language": "python",
634 | "metadata": {},
635 | "outputs": []
636 | }
637 | ],
638 | "metadata": {}
639 | }
640 | ]
641 | }
--------------------------------------------------------------------------------
/discrete_choice.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # 3.0
3 |
4 | #
5 |
6 | # Discrete Choice Models - Fair's Affair data
7 |
8 | #
9 |
10 | # A survey of women only was conducted in 1974 by *Redbook* asking about extramarital affairs.
11 |
12 | #
13 |
14 | import numpy as np
15 | from scipy import stats
16 | import matplotlib.pyplot as plt
17 | import statsmodels.api as sm
18 | from statsmodels.formula.api import logit, probit, poisson, ols
19 |
20 | #
21 |
22 | print sm.datasets.fair.SOURCE
23 |
24 | #
25 |
26 | print sm.datasets.fair.NOTE
27 |
28 | #
29 |
30 | dta = sm.datasets.fair.load_pandas().data
31 |
32 | #
33 |
34 | dta['affair'] = (dta['affairs'] > 0).astype(float)
35 | print dta.head(10)
36 |
37 | #
38 |
39 | print dta.describe()
40 |
41 | #
42 |
43 | affair_mod = logit("affair ~ occupation + educ + occupation_husb"
44 | "+ rate_marriage + age + yrs_married + children"
45 | " + religious", dta).fit()
46 |
47 | #
48 |
49 | print affair_mod.summary()
50 |
51 | #
52 |
53 | # How well are we predicting?
54 |
55 | #
56 |
57 | affair_mod.pred_table()
58 |
59 | #
60 |
61 | # The coefficients of the discrete choice model do not tell us much. What we're after is marginal effects.
62 |
63 | #
64 |
65 | mfx = affair_mod.get_margeff()
66 | print mfx.summary()
67 |
68 | #
69 |
70 | respondent1000 = dta.ix[1000]
71 | print respondent1000
72 |
73 | #
74 |
75 | resp = dict(zip(range(1,9), respondent1000[["occupation", "educ",
76 | "occupation_husb", "rate_marriage",
77 | "age", "yrs_married", "children",
78 | "religious"]].tolist()))
79 | resp.update({0 : 1})
80 | print resp
81 |
82 | #
83 |
84 | mfx = affair_mod.get_margeff(atexog=resp)
85 | print mfx.summary()
86 |
87 | #
88 |
89 | affair_mod.predict(respondent1000)
90 |
91 | #
92 |
93 | affair_mod.fittedvalues[1000]
94 |
95 | #
96 |
97 | affair_mod.model.cdf(affair_mod.fittedvalues[1000])
98 |
99 | #
100 |
101 | # The "correct" model here is likely the Tobit model. We have an work in progress branch "tobit-model" on github, if anyone is interested in censored regression models.
102 |
103 | #
104 |
105 | # Exercise: Logit vs Probit
106 |
107 | #
108 |
109 | fig = plt.figure(figsize=(12,8))
110 | ax = fig.add_subplot(111)
111 | support = np.linspace(-6, 6, 1000)
112 | ax.plot(support, stats.logistic.cdf(support), 'r-', label='Logistic')
113 | ax.plot(support, stats.norm.cdf(support), label='Probit')
114 | ax.legend();
115 |
116 | #
117 |
118 | fig = plt.figure(figsize=(12,8))
119 | ax = fig.add_subplot(111)
120 | support = np.linspace(-6, 6, 1000)
121 | ax.plot(support, stats.logistic.pdf(support), 'r-', label='Logistic')
122 | ax.plot(support, stats.norm.pdf(support), label='Probit')
123 | ax.legend();
124 |
125 | #
126 |
127 | # Compare the estimates of the Logit Fair model above to a Probit model. Does the prediction table look better? Much difference in marginal effects?
128 |
129 | #
130 |
131 | # Genarlized Linear Model Example
132 |
133 | #
134 |
135 | print sm.datasets.star98.SOURCE
136 |
137 | #
138 |
139 | print sm.datasets.star98.DESCRLONG
140 |
141 | #
142 |
143 | print sm.datasets.star98.NOTE
144 |
145 | #
146 |
147 | dta = sm.datasets.star98.load_pandas().data
148 | print dta.columns
149 |
150 | #
151 |
152 | print dta[['NABOVE', 'NBELOW', 'LOWINC', 'PERASIAN', 'PERBLACK', 'PERHISP', 'PERMINTE']].head(10)
153 |
154 | #
155 |
156 | print dta[['AVYRSEXP', 'AVSALK', 'PERSPENK', 'PTRATIO', 'PCTAF', 'PCTCHRT', 'PCTYRRND']].head(10)
157 |
158 | #
159 |
160 | formula = 'NABOVE + NBELOW ~ LOWINC + PERASIAN + PERBLACK + PERHISP + PCTCHRT '
161 | formula += '+ PCTYRRND + PERMINTE*AVYRSEXP*AVSALK + PERSPENK*PTRATIO*PCTAF'
162 |
163 | #
164 |
165 | # Aside: Binomial distribution
166 |
167 | #
168 |
169 | # Toss a six-sided die 5 times, what's the probability of exactly 2 fours?
170 |
171 | #
172 |
173 | stats.binom(5, 1./6).pmf(2)
174 |
175 | #
176 |
177 | from scipy.misc import comb
178 | comb(5,2) * (1/6.)**2 * (5/6.)**3
179 |
180 | #
181 |
182 | from statsmodels.formula.api import glm
183 | glm_mod = glm(formula, dta, family=sm.families.Binomial()).fit()
184 |
185 | #
186 |
187 | print glm_mod.summary()
188 |
189 | #
190 |
191 | # The number of trials
192 |
193 | #
194 |
195 | glm_mod.model.data.orig_endog.sum(1)
196 |
197 | #
198 |
199 | glm_mod.fittedvalues * glm_mod.model.data.orig_endog.sum(1)
200 |
201 | #
202 |
203 | # First differences: We hold all explanatory variables constant at their means and manipulate the percentage of low income households to assess its impact
204 | # on the response variables:
205 |
206 | #
207 |
208 | exog = glm_mod.model.data.orig_exog # get the dataframe
209 |
210 | #
211 |
212 | means25 = exog.mean()
213 | print means25
214 |
215 | #
216 |
217 | means25['LOWINC'] = exog['LOWINC'].quantile(.25)
218 | print means25
219 |
220 | #
221 |
222 | means75 = exog.mean()
223 | means75['LOWINC'] = exog['LOWINC'].quantile(.75)
224 | print means75
225 |
226 | #
227 |
228 | resp25 = glm_mod.predict(means25)
229 | resp75 = glm_mod.predict(means75)
230 | diff = resp75 - resp25
231 |
232 | #
233 |
234 | # The interquartile first difference for the percentage of low income households in a school district is:
235 |
236 | #
237 |
238 | print "%2.4f%%" % (diff[0]*100)
239 |
240 | #
241 |
242 | nobs = glm_mod.nobs
243 | y = glm_mod.model.endog
244 | yhat = glm_mod.mu
245 |
246 | #
247 |
248 | from statsmodels.graphics.api import abline_plot
249 | fig = plt.figure(figsize=(12,8))
250 | ax = fig.add_subplot(111, ylabel='Observed Values', xlabel='Fitted Values')
251 | ax.scatter(yhat, y)
252 | y_vs_yhat = sm.OLS(y, sm.add_constant(yhat, prepend=True)).fit()
253 | fig = abline_plot(model_results=y_vs_yhat, ax=ax)
254 |
255 | #
256 |
257 | # Plot fitted values vs Pearson residuals
258 |
259 | #
260 |
261 | # Pearson residuals are defined to be
262 | #
263 | # $$\frac{(y - \mu)}{\sqrt{(var(\mu))}}$$
264 | #
265 | # where var is typically determined by the family. E.g., binomial variance is $np(1 - p)$
266 |
267 | #
268 |
269 | fig = plt.figure(figsize=(12,8))
270 | ax = fig.add_subplot(111, title='Residual Dependence Plot', xlabel='Fitted Values',
271 | ylabel='Pearson Residuals')
272 | ax.scatter(yhat, stats.zscore(glm_mod.resid_pearson))
273 | ax.axis('tight')
274 | ax.plot([0.0, 1.0],[0.0, 0.0], 'k-');
275 |
276 | #
277 |
278 | # Histogram of standardized deviance residuals with Kernel Density Estimate overlayed
279 |
280 | #
281 |
282 | # The definition of the deviance residuals depends on the family. For the Binomial distribution this is
283 | #
284 | # $$r_{dev} = sign\(Y-\mu\)*\sqrt{2n(Y\log\frac{Y}{\mu}+(1-Y)\log\frac{(1-Y)}{(1-\mu)}}$$
285 | #
286 | # They can be used to detect ill-fitting covariates
287 |
288 | #
289 |
290 | resid = glm_mod.resid_deviance
291 | resid_std = stats.zscore(resid)
292 | kde_resid = sm.nonparametric.KDEUnivariate(resid_std)
293 | kde_resid.fit()
294 |
295 | #
296 |
297 | fig = plt.figure(figsize=(12,8))
298 | ax = fig.add_subplot(111, title="Standardized Deviance Residuals")
299 | ax.hist(resid_std, bins=25, normed=True);
300 | ax.plot(kde_resid.support, kde_resid.density, 'r');
301 |
302 | #
303 |
304 | # QQ-plot of deviance residuals
305 |
306 | #
307 |
308 | fig = plt.figure(figsize=(12,8))
309 | ax = fig.add_subplot(111)
310 | fig = sm.graphics.qqplot(resid, line='r', ax=ax)
311 |
312 |
--------------------------------------------------------------------------------
/generic_mle.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "generic_mle"
4 | },
5 | "nbformat": 3,
6 | "nbformat_minor": 0,
7 | "worksheets": [
8 | {
9 | "cells": [
10 | {
11 | "cell_type": "code",
12 | "collapsed": false,
13 | "input": [
14 | "import numpy as np\n",
15 | "from scipy import stats\n",
16 | "import statsmodels.api as sm\n",
17 | "from statsmodels.base.model import GenericLikelihoodModel"
18 | ],
19 | "language": "python",
20 | "metadata": {},
21 | "outputs": []
22 | },
23 | {
24 | "cell_type": "code",
25 | "collapsed": false,
26 | "input": [
27 | "print sm.datasets.spector.NOTE"
28 | ],
29 | "language": "python",
30 | "metadata": {},
31 | "outputs": []
32 | },
33 | {
34 | "cell_type": "code",
35 | "collapsed": false,
36 | "input": [
37 | "data = sm.datasets.spector.load_pandas()\n",
38 | "exog = sm.add_constant(data.exog, prepend=True)\n",
39 | "endog = data.endog"
40 | ],
41 | "language": "python",
42 | "metadata": {},
43 | "outputs": []
44 | },
45 | {
46 | "cell_type": "code",
47 | "collapsed": false,
48 | "input": [
49 | "sm_probit = sm.Probit(endog, exog).fit()"
50 | ],
51 | "language": "python",
52 | "metadata": {},
53 | "outputs": []
54 | },
55 | {
56 | "cell_type": "raw",
57 | "metadata": {},
58 | "source": [
59 | "* To create your own Likelihood Model, you just need to overwrite the loglike method."
60 | ]
61 | },
62 | {
63 | "cell_type": "code",
64 | "collapsed": false,
65 | "input": [
66 | "class MyProbit(GenericLikelihoodModel):\n",
67 | " def loglike(self, params):\n",
68 | " exog = self.exog\n",
69 | " endog = self.endog\n",
70 | " q = 2 * endog - 1\n",
71 | " return stats.norm.logcdf(q*np.dot(exog, params)).sum()"
72 | ],
73 | "language": "python",
74 | "metadata": {},
75 | "outputs": []
76 | },
77 | {
78 | "cell_type": "code",
79 | "collapsed": false,
80 | "input": [
81 | "my_probit = MyProbit(endog, exog).fit()"
82 | ],
83 | "language": "python",
84 | "metadata": {},
85 | "outputs": []
86 | },
87 | {
88 | "cell_type": "code",
89 | "collapsed": false,
90 | "input": [
91 | "print sm_probit.params"
92 | ],
93 | "language": "python",
94 | "metadata": {},
95 | "outputs": []
96 | },
97 | {
98 | "cell_type": "code",
99 | "collapsed": false,
100 | "input": [
101 | "print sm_probit.cov_params()"
102 | ],
103 | "language": "python",
104 | "metadata": {},
105 | "outputs": []
106 | },
107 | {
108 | "cell_type": "code",
109 | "collapsed": false,
110 | "input": [
111 | "print my_probit.params"
112 | ],
113 | "language": "python",
114 | "metadata": {},
115 | "outputs": []
116 | },
117 | {
118 | "cell_type": "raw",
119 | "metadata": {},
120 | "source": [
121 | "You can get the variance-covariance of the parameters. Notice that we didn't have to provide Hessian or Score functions."
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "collapsed": false,
127 | "input": [
128 | "print my_probit.cov_params()"
129 | ],
130 | "language": "python",
131 | "metadata": {},
132 | "outputs": []
133 | }
134 | ],
135 | "metadata": {}
136 | }
137 | ]
138 | }
--------------------------------------------------------------------------------
/generic_mle.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # 3.0
3 |
4 | #
5 |
6 | import numpy as np
7 | from scipy import stats
8 | import statsmodels.api as sm
9 | from statsmodels.base.model import GenericLikelihoodModel
10 |
11 | #
12 |
13 | print sm.datasets.spector.NOTE
14 |
15 | #
16 |
17 | data = sm.datasets.spector.load_pandas()
18 | exog = sm.add_constant(data.exog, prepend=True)
19 | endog = data.endog
20 |
21 | #
22 |
23 | sm_probit = sm.Probit(endog, exog).fit()
24 |
25 | #
26 |
27 | # * To create your own Likelihood Model, you just need to overwrite the loglike method.
28 |
29 | #
30 |
31 | class MyProbit(GenericLikelihoodModel):
32 | def loglike(self, params):
33 | exog = self.exog
34 | endog = self.endog
35 | q = 2 * endog - 1
36 | return stats.norm.logcdf(q*np.dot(exog, params)).sum()
37 |
38 | #
39 |
40 | my_probit = MyProbit(endog, exog).fit()
41 |
42 | #
43 |
44 | print sm_probit.params
45 |
46 | #
47 |
48 | print sm_probit.cov_params()
49 |
50 | #
51 |
52 | print my_probit.params
53 |
54 | #
55 |
56 | # You can get the variance-covariance of the parameters. Notice that we didn't have to provide Hessian or Score functions.
57 |
58 | #
59 |
60 | print my_probit.cov_params()
61 |
62 |
--------------------------------------------------------------------------------
/kernel_density.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "kernel_density"
4 | },
5 | "nbformat": 3,
6 | "nbformat_minor": 0,
7 | "worksheets": [
8 | {
9 | "cells": [
10 | {
11 | "cell_type": "heading",
12 | "level": 3,
13 | "metadata": {},
14 | "source": [
15 | "Kernel Density Estimation"
16 | ]
17 | },
18 | {
19 | "cell_type": "code",
20 | "collapsed": false,
21 | "input": [
22 | "import numpy as np\n",
23 | "from scipy import stats\n",
24 | "import statsmodels.api as sm\n",
25 | "import matplotlib.pyplot as plt\n",
26 | "from statsmodels.distributions.mixture_rvs import mixture_rvs"
27 | ],
28 | "language": "python",
29 | "metadata": {},
30 | "outputs": []
31 | },
32 | {
33 | "cell_type": "heading",
34 | "level": 4,
35 | "metadata": {},
36 | "source": [
37 | "A univariate example."
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "collapsed": false,
43 | "input": [
44 | "np.random.seed(12345)"
45 | ],
46 | "language": "python",
47 | "metadata": {},
48 | "outputs": []
49 | },
50 | {
51 | "cell_type": "code",
52 | "collapsed": false,
53 | "input": [
54 | "obs_dist1 = mixture_rvs([.25,.75], size=10000, dist=[stats.norm, stats.norm],\n",
55 | " kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=.5)))"
56 | ],
57 | "language": "python",
58 | "metadata": {},
59 | "outputs": []
60 | },
61 | {
62 | "cell_type": "code",
63 | "collapsed": false,
64 | "input": [
65 | "kde = sm.nonparametric.KDEUnivariate(obs_dist1)\n",
66 | "kde.fit()"
67 | ],
68 | "language": "python",
69 | "metadata": {},
70 | "outputs": []
71 | },
72 | {
73 | "cell_type": "code",
74 | "collapsed": false,
75 | "input": [
76 | "fig = plt.figure(figsize=(12,8))\n",
77 | "ax = fig.add_subplot(111)\n",
78 | "ax.hist(obs_dist1, bins=50, normed=True, color='red')\n",
79 | "ax.plot(kde.support, kde.density, lw=2, color='black');"
80 | ],
81 | "language": "python",
82 | "metadata": {},
83 | "outputs": []
84 | },
85 | {
86 | "cell_type": "code",
87 | "collapsed": false,
88 | "input": [
89 | "obs_dist2 = mixture_rvs([.25,.75], size=10000, dist=[stats.norm, stats.beta],\n",
90 | " kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=1,args=(1,.5))))\n",
91 | "\n",
92 | "kde2 = sm.nonparametric.KDEUnivariate(obs_dist2)\n",
93 | "kde2.fit()"
94 | ],
95 | "language": "python",
96 | "metadata": {},
97 | "outputs": []
98 | },
99 | {
100 | "cell_type": "code",
101 | "collapsed": false,
102 | "input": [
103 | "fig = plt.figure(figsize=(12,8))\n",
104 | "ax = fig.add_subplot(111)\n",
105 | "ax.hist(obs_dist2, bins=50, normed=True, color='red')\n",
106 | "ax.plot(kde2.support, kde2.density, lw=2, color='black');"
107 | ],
108 | "language": "python",
109 | "metadata": {},
110 | "outputs": []
111 | },
112 | {
113 | "cell_type": "raw",
114 | "metadata": {},
115 | "source": [
116 | "The fitted KDE object is a full non-parametric distribution."
117 | ]
118 | },
119 | {
120 | "cell_type": "code",
121 | "collapsed": false,
122 | "input": [
123 | "obs_dist3 = mixture_rvs([.25,.75], size=1000, dist=[stats.norm, stats.norm],\n",
124 | " kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=.5)))\n",
125 | "kde3 = sm.nonparametric.KDEUnivariate(obs_dist3)\n",
126 | "kde3.fit()"
127 | ],
128 | "language": "python",
129 | "metadata": {},
130 | "outputs": []
131 | },
132 | {
133 | "cell_type": "code",
134 | "collapsed": false,
135 | "input": [
136 | "kde3.entropy"
137 | ],
138 | "language": "python",
139 | "metadata": {},
140 | "outputs": []
141 | },
142 | {
143 | "cell_type": "code",
144 | "collapsed": false,
145 | "input": [
146 | "kde3.evaluate(-1)"
147 | ],
148 | "language": "python",
149 | "metadata": {},
150 | "outputs": []
151 | },
152 | {
153 | "cell_type": "heading",
154 | "level": 4,
155 | "metadata": {},
156 | "source": [
157 | "CDF"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "collapsed": false,
163 | "input": [
164 | "fig = plt.figure(figsize=(12,8))\n",
165 | "ax = fig.add_subplot(111)\n",
166 | "ax.plot(kde3.support, kde3.cdf);"
167 | ],
168 | "language": "python",
169 | "metadata": {},
170 | "outputs": []
171 | },
172 | {
173 | "cell_type": "heading",
174 | "level": 4,
175 | "metadata": {},
176 | "source": [
177 | "Cumulative Hazard Function"
178 | ]
179 | },
180 | {
181 | "cell_type": "code",
182 | "collapsed": false,
183 | "input": [
184 | "fig = plt.figure(figsize=(12,8))\n",
185 | "ax = fig.add_subplot(111)\n",
186 | "ax.plot(kde3.support, kde3.cumhazard);"
187 | ],
188 | "language": "python",
189 | "metadata": {},
190 | "outputs": []
191 | },
192 | {
193 | "cell_type": "heading",
194 | "level": 4,
195 | "metadata": {},
196 | "source": [
197 | "Inverse CDF"
198 | ]
199 | },
200 | {
201 | "cell_type": "code",
202 | "collapsed": false,
203 | "input": [
204 | "fig = plt.figure(figsize=(12,8))\n",
205 | "ax = fig.add_subplot(111)\n",
206 | "ax.plot(kde3.support, kde3.icdf);"
207 | ],
208 | "language": "python",
209 | "metadata": {},
210 | "outputs": []
211 | },
212 | {
213 | "cell_type": "heading",
214 | "level": 4,
215 | "metadata": {},
216 | "source": [
217 | "Survival Function"
218 | ]
219 | },
220 | {
221 | "cell_type": "code",
222 | "collapsed": false,
223 | "input": [
224 | "fig = plt.figure(figsize=(12,8))\n",
225 | "ax = fig.add_subplot(111)\n",
226 | "ax.plot(kde3.support, kde3.sf);"
227 | ],
228 | "language": "python",
229 | "metadata": {},
230 | "outputs": []
231 | }
232 | ],
233 | "metadata": {}
234 | }
235 | ]
236 | }
237 |
--------------------------------------------------------------------------------
/kernel_density.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # 3.0
3 |
4 | #
5 |
6 | # Kernel Density Estimation
7 |
8 | #
9 |
10 | import numpy as np
11 | import statsmodels.api as sm
12 | import matplotlib.pyplot as plt
13 | from statsmodels.distributions.mixture_rvs import mixture_rvs
14 |
15 | #
16 |
17 | # A univariate example.
18 |
19 | #
20 |
21 | np.random.seed(12345)
22 |
23 | #
24 |
25 | obs_dist1 = mixture_rvs([.25,.75], size=10000, dist=[stats.norm, stats.norm],
26 | kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=.5)))
27 |
28 | #
29 |
30 | kde = sm.nonparametric.KDEUnivariate(obs_dist1)
31 | kde.fit()
32 |
33 | #
34 |
35 | fig = plt.figure(figsize=(12,8))
36 | ax = fig.add_subplot(111)
37 | ax.hist(obs_dist1, bins=50, normed=True, color='red')
38 | ax.plot(kde.support, kde.density, lw=2, color='black');
39 |
40 | #
41 |
42 | obs_dist2 = mixture_rvs([.25,.75], size=10000, dist=[stats.norm, stats.beta],
43 | kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=1,args=(1,.5))))
44 |
45 | kde2 = sm.nonparametric.KDEUnivariate(obs_dist2)
46 | kde2.fit()
47 |
48 | #
49 |
50 | fig = plt.figure(figsize=(12,8))
51 | ax = fig.add_subplot(111)
52 | ax.hist(obs_dist2, bins=50, normed=True, color='red')
53 | ax.plot(kde2.support, kde2.density, lw=2, color='black');
54 |
55 | #
56 |
57 | # The fitted KDE object is a full non-parametric distribution.
58 |
59 | #
60 |
61 | obs_dist3 = mixture_rvs([.25,.75], size=1000, dist=[stats.norm, stats.norm],
62 | kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=.5)))
63 | kde3 = sm.nonparametric.KDEUnivariate(obs_dist3)
64 | kde3.fit()
65 |
66 | #
67 |
68 | kde3.entropy
69 |
70 | #
71 |
72 | kde3.evaluate(-1)
73 |
74 | #
75 |
76 | # CDF
77 |
78 | #
79 |
80 | fig = plt.figure(figsize=(12,8))
81 | ax = fig.add_subplot(111)
82 | ax.plot(kde3.support, kde3.cdf);
83 |
84 | #
85 |
86 | # Cumulative Hazard Function
87 |
88 | #
89 |
90 | fig = plt.figure(figsize=(12,8))
91 | ax = fig.add_subplot(111)
92 | ax.plot(kde3.support, kde3.cumhazard);
93 |
94 | #
95 |
96 | # Inverse CDF
97 |
98 | #
99 |
100 | fig = plt.figure(figsize=(12,8))
101 | ax = fig.add_subplot(111)
102 | ax.plot(kde3.support, kde3.icdf);
103 |
104 | #
105 |
106 | # Survival Function
107 |
108 | #
109 |
110 | fig = plt.figure(figsize=(12,8))
111 | ax = fig.add_subplot(111)
112 | ax.plot(kde3.support, kde3.sf);
113 |
114 |
--------------------------------------------------------------------------------
/linear_models.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # 3.0
3 |
4 | #
5 |
6 | # This notebook introduces the use of pandas and the formula framework in statsmodels in the context of linear modeling.
7 |
8 | #
9 |
10 | # **It is based heavily on Jonathan Taylor's [class notes that use R](http://www.stanford.edu/class/stats191/interactions.html)**
11 |
12 | #
13 |
14 | import matplotlib.pyplot as plt
15 | import pandas
16 | import numpy as np
17 |
18 | from statsmodels.formula.api import ols
19 | from statsmodels.graphics.api import interaction_plot, abline_plot, qqplot
20 | from statsmodels.stats.api import anova_lm
21 |
22 | #
23 |
24 | # Example 1: IT salary data
25 |
26 | #
27 |
28 | # Outcome: S, salaries for IT staff in a corporation
29 | # Predictors: X, experience in years
30 | # M, managment, 2 levels, 0=non-management, 1=management
31 | # E, education, 3 levels, 1=Bachelor's, 2=Master's, 3=Ph.D
32 |
33 | #
34 |
35 | url = 'http://stats191.stanford.edu/data/salary.table'
36 | salary_table = pandas.read_table(url) # needs pandas 0.7.3
37 | salary_table.to_csv('salary.table', index=False)
38 |
39 | #
40 |
41 | print salary_table.head(10)
42 |
43 | #
44 |
45 | E = salary_table.E # Education
46 | M = salary_table.M # Management
47 | X = salary_table.X # Experience
48 | S = salary_table.S # Salary
49 |
50 | #
51 |
52 | # Let's explore the data
53 |
54 | #
55 |
56 | fig = plt.figure(figsize=(10,8))
57 | ax = fig.add_subplot(111, xlabel='Experience', ylabel='Salary',
58 | xlim=(0, 27), ylim=(9600, 28800))
59 | symbols = ['D', '^']
60 | man_label = ["Non-Mgmt", "Mgmt"]
61 | educ_label = ["Bachelors", "Masters", "PhD"]
62 | colors = ['r', 'g', 'blue']
63 | factor_groups = salary_table.groupby(['E','M'])
64 | for values, group in factor_groups:
65 | i,j = values
66 | label = "%s - %s" % (man_label[j], educ_label[i-1])
67 | ax.scatter(group['X'], group['S'], marker=symbols[j], color=colors[i-1],
68 | s=350, label=label)
69 | ax.legend(scatterpoints=1, markerscale=.7, labelspacing=1);
70 |
71 | #
72 |
73 | # Fit a linear model
74 | #
75 | # $$S_i = \beta_0 + \beta_1X_i + \beta_2E_{i2} + \beta_3E_{i3} + \beta_4M_i + \epsilon_i$$
76 | #
77 | # where
78 | #
79 | # $$ E_{i2}=\cases{1,&if $E_i=2$;\cr 0,&otherwise. \cr}$$
80 | # $$ E_{i3}=\cases{1,&if $E_i=3$;\cr 0,&otherwise. \cr}$$
81 |
82 | #
83 |
84 | formula = 'S ~ C(E) + C(M) + X'
85 | lm = ols(formula, salary_table).fit()
86 | print lm.summary()
87 |
88 | #
89 |
90 | # Aside: Contrasts (see contrasts notebook)
91 |
92 | #
93 |
94 | # Look at the design matrix created for us. Every results instance has a reference to the model.
95 |
96 | #
97 |
98 | lm.model.exog[:10]
99 |
100 | #
101 |
102 | # Since we initially passed in a DataFrame, we have a transformed DataFrame available.
103 |
104 | #
105 |
106 | print lm.model.data.orig_exog.head(10)
107 |
108 | #
109 |
110 | # There is a reference to the original untouched data in
111 |
112 | #
113 |
114 | print lm.model.data.frame.head(10)
115 |
116 | #
117 |
118 | # If you use the formula interface, statsmodels remembers this transformation. Say you want to know the predicted salary for someone with 12 years experience and a Master's degree who is in a management position
119 |
120 | #
121 |
122 | lm.predict({'X' : [12], 'M' : [1], 'E' : [2]})
123 |
124 | #
125 |
126 | # So far we've assumed that the effect of experience is the same for each level of education and professional role.
127 | # Perhaps this assumption isn't merited. We can formally test this using some interactions.
128 |
129 | #
130 |
131 | # We can start by seeing if our model assumptions are met. Let's look at a residuals plot.
132 |
133 | #
134 |
135 | # And some formal tests
136 |
137 | #
138 |
139 | # Plot the residuals within the groups separately.
140 |
141 | #
142 |
143 | resid = lm.resid
144 |
145 | #
146 |
147 | fig = plt.figure(figsize=(12,8))
148 | xticks = []
149 | ax = fig.add_subplot(111, xlabel='Group (E, M)', ylabel='Residuals')
150 | for values, group in factor_groups:
151 | i,j = values
152 | xticks.append(str((i, j)))
153 | group_num = i*2 + j - 1 # for plotting purposes
154 | x = [group_num] * len(group)
155 | ax.scatter(x, resid[group.index], marker=symbols[j], color=colors[i-1],
156 | s=144, edgecolors='black')
157 | ax.set_xticks([1,2,3,4,5,6])
158 | ax.set_xticklabels(xticks)
159 | ax.axis('tight');
160 |
161 | #
162 |
163 | # Add an interaction between salary and experience, allowing different intercepts for level of experience.
164 | #
165 | # $$S_i = \beta_0+\beta_1X_i+\beta_2E_{i2}+\beta_3E_{i3}+\beta_4M_i+\beta_5E_{i2}X_i+\beta_6E_{i3}X_i+\epsilon_i$$
166 |
167 | #
168 |
169 | interX_lm = ols('S ~ C(E)*X + C(M)', salary_table).fit()
170 | print interX_lm.summary()
171 |
172 | #
173 |
174 | # Test that $\beta_5 = \beta_6 = 0$. We can use anova_lm or we can use an F-test.
175 |
176 | #
177 |
178 | print anova_lm(lm, interX_lm)
179 |
180 | #
181 |
182 | print interX_lm.f_test('C(E)[T.2]:X = C(E)[T.3]:X = 0')
183 |
184 | #
185 |
186 | print interX_lm.f_test([[0,0,0,0,0,1,-1],[0,0,0,0,0,0,1]])
187 |
188 | #
189 |
190 | # The contrasts are created here under the hood by patsy.
191 |
192 | #
193 |
194 | # Recall that F-tests are of the form $R\beta = q$
195 |
196 | #
197 |
198 | LC = interX_lm.model.data.orig_exog.design_info.linear_constraint('C(E)[T.2]:X = C(E)[T.3]:X = 0')
199 | print LC.coefs
200 | print LC.constants
201 |
202 | #
203 |
204 | # Interact education with management
205 |
206 | #
207 |
208 | interM_lm = ols('S ~ X + C(E)*C(M)', salary_table).fit()
209 | print interM_lm.summary()
210 |
211 | #
212 |
213 | print anova_lm(lm, interM_lm)
214 |
215 | #
216 |
217 | infl = interM_lm.get_influence()
218 | resid = infl.resid_studentized_internal
219 |
220 | #
221 |
222 | fig = plt.figure(figsize=(12,8))
223 | ax = fig.add_subplot(111, xlabel='X', ylabel='standardized resids')
224 |
225 | for values, group in factor_groups:
226 | i,j = values
227 | idx = group.index
228 | ax.scatter(X[idx], resid[idx], marker=symbols[j], color=colors[i-1],
229 | s=144, edgecolors='black')
230 | ax.axis('tight');
231 |
232 | #
233 |
234 | # There looks to be an outlier.
235 |
236 | #
237 |
238 | outl = interM_lm.outlier_test('fdr_bh')
239 | outl.sort('unadj_p', inplace=True)
240 | print outl
241 |
242 | #
243 |
244 | idx = salary_table.index.drop(32)
245 |
246 | #
247 |
248 | print idx
249 |
250 | #
251 |
252 | lm32 = ols('S ~ C(E) + X + C(M)', data=salary_table, subset=idx).fit()
253 | print lm32.summary()
254 |
255 | #
256 |
257 | interX_lm32 = ols('S ~ C(E) * X + C(M)', data=salary_table, subset=idx).fit()
258 | print interX_lm32.summary()
259 |
260 | #
261 |
262 | table3 = anova_lm(lm32, interX_lm32)
263 | print table3
264 |
265 | #
266 |
267 | interM_lm32 = ols('S ~ X + C(E) * C(M)', data=salary_table, subset=idx).fit()
268 | print anova_lm(lm32, interM_lm32)
269 |
270 | #
271 |
272 | # Re-plotting the residuals
273 |
274 | #
275 |
276 | resid = interM_lm32.get_influence().summary_frame()['standard_resid']
277 | fig = plt.figure(figsize=(12,8))
278 | ax = fig.add_subplot(111, xlabel='X[~[32]]', ylabel='standardized resids')
279 |
280 | for values, group in factor_groups:
281 | i,j = values
282 | idx = group.index
283 | ax.scatter(X[idx], resid[idx], marker=symbols[j], color=colors[i-1],
284 | s=144, edgecolors='black')
285 | ax.axis('tight');
286 |
287 | #
288 |
289 | # A final plot of the fitted values
290 |
291 | #
292 |
293 | lm_final = ols('S ~ X + C(E)*C(M)', data=salary_table.drop([32])).fit()
294 | mf = lm_final.model.data.orig_exog
295 | lstyle = ['-','--']
296 |
297 | fig = plt.figure(figsize=(12,8))
298 | ax = fig.add_subplot(111, xlabel='Experience', ylabel='Salary')
299 |
300 | for values, group in factor_groups:
301 | i,j = values
302 | idx = group.index
303 | ax.scatter(X[idx], S[idx], marker=symbols[j], color=colors[i-1],
304 | s=144, edgecolors='black')
305 | # drop NA because there is no idx 32 in the final model
306 | ax.plot(mf.X[idx].dropna(), lm_final.fittedvalues[idx].dropna(),
307 | ls=lstyle[j], color=colors[i-1])
308 | ax.axis('tight');
309 |
310 | #
311 |
312 | # From our first look at the data, the difference between Master's and PhD in the management group is different than in the non-management group. This is an interaction between the two qualitative variables management, M and education, E. We can visualize this by first removing the effect of experience, then plotting the means within each of the 6 groups using interaction.plot.
313 |
314 | #
315 |
316 | U = S - X * interX_lm32.params['X']
317 | U.name = 'Salary|X'
318 |
319 | fig = plt.figure(figsize=(12,8))
320 | ax = fig.add_subplot(111)
321 | ax = interaction_plot(E, M, U, colors=['red','blue'], markers=['^','D'],
322 | markersize=10, ax=ax)
323 |
324 | #
325 |
326 | # Minority Employment Data - ABLine plotting
327 |
328 | #
329 |
330 | # TEST - Job Aptitude Test Score
331 | # ETHN - 1 if minority, 0 otherwise
332 | # JPERF - Job performance evaluation
333 |
334 | #
335 |
336 | try:
337 | minority_table = pandas.read_table('minority.table')
338 | except: # don't have data already
339 | url = 'http://stats191.stanford.edu/data/minority.table'
340 | minority_table = pandas.read_table(url)
341 | minority_table.to_csv('minority.table', sep="\t", index=False)
342 |
343 | #
344 |
345 | factor_group = minority_table.groupby(['ETHN'])
346 |
347 | fig = plt.figure(figsize=(12,8))
348 | ax = fig.add_subplot(111, xlabel='TEST', ylabel='JPERF')
349 | colors = ['purple', 'green']
350 | markers = ['o', 'v']
351 | for factor, group in factor_group:
352 | ax.scatter(group['TEST'], group['JPERF'], color=colors[factor],
353 | marker=markers[factor], s=12**2)
354 | ax.legend(['ETHN == 1', 'ETHN == 0'], scatterpoints=1)
355 |
356 | #
357 |
358 | min_lm = ols('JPERF ~ TEST', data=minority_table).fit()
359 | print min_lm.summary()
360 |
361 | #
362 |
363 | fig = plt.figure(figsize=(12,8))
364 | ax = fig.add_subplot(111, xlabel='TEST', ylabel='JPERF')
365 | for factor, group in factor_group:
366 | ax.scatter(group['TEST'], group['JPERF'], color=colors[factor],
367 | marker=markers[factor], s=12**2)
368 | ax.legend(['ETHN == 1', 'ETHN == 0'], scatterpoints=1, loc='upper left')
369 | fig = abline_plot(model_results = min_lm, ax=ax)
370 |
371 | #
372 |
373 | min_lm2 = ols('JPERF ~ TEST + TEST:ETHN', data=minority_table).fit()
374 | print min_lm2.summary()
375 |
376 | #
377 |
378 | fig = plt.figure(figsize=(12,8))
379 | ax = fig.add_subplot(111, xlabel='TEST', ylabel='JPERF')
380 | for factor, group in factor_group:
381 | ax.scatter(group['TEST'], group['JPERF'], color=colors[factor],
382 | marker=markers[factor], s=12**2)
383 |
384 | fig = abline_plot(intercept = min_lm2.params['Intercept'],
385 | slope = min_lm2.params['TEST'], ax=ax, color='purple')
386 | ax = fig.axes[0]
387 | fig = abline_plot(intercept = min_lm2.params['Intercept'],
388 | slope = min_lm2.params['TEST'] + min_lm2.params['TEST:ETHN'],
389 | ax=ax, color='green')
390 | ax.legend(['ETHN == 1', 'ETHN == 0'], scatterpoints=1, loc='upper left');
391 |
392 | #
393 |
394 | min_lm3 = ols('JPERF ~ TEST + ETHN', data=minority_table).fit()
395 | print min_lm3.summary()
396 |
397 | #
398 |
399 | fig = plt.figure(figsize=(12,8))
400 | ax = fig.add_subplot(111, xlabel='TEST', ylabel='JPERF')
401 | for factor, group in factor_group:
402 | ax.scatter(group['TEST'], group['JPERF'], color=colors[factor],
403 | marker=markers[factor], s=12**2)
404 |
405 | fig = abline_plot(intercept = min_lm3.params['Intercept'],
406 | slope = min_lm3.params['TEST'], ax=ax, color='purple')
407 |
408 | ax = fig.axes[0]
409 | fig = abline_plot(intercept = min_lm3.params['Intercept'] + min_lm3.params['ETHN'],
410 | slope = min_lm3.params['TEST'], ax=ax, color='green')
411 | ax.legend(['ETHN == 1', 'ETHN == 0'], scatterpoints=1, loc='upper left');
412 |
413 | #
414 |
415 | min_lm4 = ols('JPERF ~ TEST * ETHN', data=minority_table).fit()
416 | print min_lm4.summary()
417 |
418 | #
419 |
420 | fig = plt.figure(figsize=(12,8))
421 | ax = fig.add_subplot(111, ylabel='JPERF', xlabel='TEST')
422 | for factor, group in factor_group:
423 | ax.scatter(group['TEST'], group['JPERF'], color=colors[factor],
424 | marker=markers[factor], s=12**2)
425 |
426 | fig = abline_plot(intercept = min_lm4.params['Intercept'],
427 | slope = min_lm4.params['TEST'], ax=ax, color='purple')
428 | ax = fig.axes[0]
429 | fig = abline_plot(intercept = min_lm4.params['Intercept'] + min_lm4.params['ETHN'],
430 | slope = min_lm4.params['TEST'] + min_lm4.params['TEST:ETHN'],
431 | ax=ax, color='green')
432 | ax.legend(['ETHN == 1', 'ETHN == 0'], scatterpoints=1, loc='upper left');
433 |
434 | #
435 |
436 | # Is there any effect of ETHN on slope or intercept?
437 | #
438 | # Y ~ TEST vs. Y ~ TEST + ETHN + ETHN:TEST
439 |
440 | #
441 |
442 | table5 = anova_lm(min_lm, min_lm4)
443 | print table5
444 |
445 | #
446 |
447 | # Is there any effect of ETHN on intercept?
448 | #
449 | # Y ~ TEST vs. Y ~ TEST + ETHN
450 |
451 | #
452 |
453 | table6 = anova_lm(min_lm, min_lm3)
454 | print table6
455 |
456 | #
457 |
458 | # Is there any effect of ETHN on slope?
459 | #
460 | # Y ~ TEST vs. Y ~ TEST + ETHN:TEST
461 |
462 | #
463 |
464 | table7 = anova_lm(min_lm, min_lm2)
465 | print table7
466 |
467 | #
468 |
469 | # Is it just the slope or both?
470 | #
471 | # Y ~ TEST + ETHN:TEST vs Y ~ TEST + ETHN + ETHN:TEST
472 |
473 | #
474 |
475 | table8 = anova_lm(min_lm2, min_lm4)
476 | print table8
477 |
478 | #
479 |
480 | # Two Way ANOVA - Kidney failure data
481 |
482 | #
483 |
484 | # Weight - (1,2,3) - Level of weight gan between treatments
485 | # Duration - (1,2) - Level of duration of treatment
486 | # Days - Time of stay in hospital
487 |
488 | #
489 |
490 | try:
491 | kidney_table = pandas.read_table('kidney.table')
492 | except:
493 | url = 'http://stats191.stanford.edu/data/kidney.table'
494 | kidney_table = pandas.read_table(url, delimiter=" *")
495 | kidney_table.to_csv("kidney.table", sep="\t", index=False)
496 |
497 | #
498 |
499 | # Explore the dataset, it's a balanced design
500 | print kidney_table.groupby(['Weight', 'Duration']).size()
501 |
502 | #
503 |
504 | kt = kidney_table
505 | fig = plt.figure(figsize=(10,8))
506 | ax = fig.add_subplot(111)
507 | fig = interaction_plot(kt['Weight'], kt['Duration'], np.log(kt['Days']+1),
508 | colors=['red', 'blue'], markers=['D','^'], ms=10, ax=ax)
509 |
510 | #
511 |
512 | # $$Y_{ijk} = \mu + \alpha_i + \beta_j + \left(\alpha\beta\right)_{ij}+\epsilon_{ijk}$$
513 | #
514 | # with
515 | #
516 | # $$\epsilon_{ijk}\sim N\left(0,\sigma^2\right)$$
517 |
518 | #
519 |
520 | help(anova_lm)
521 |
522 | #
523 |
524 | # Things available in the calling namespace are available in the formula evaluation namespace
525 |
526 | #
527 |
528 | kidney_lm = ols('np.log(Days+1) ~ C(Duration) * C(Weight)', data=kt).fit()
529 |
530 | #
531 |
532 | # ANOVA Type-I Sum of Squares
533 | #
534 | # SS(A) for factor A.
535 | # SS(B|A) for factor B.
536 | # SS(AB|B, A) for interaction AB.
537 |
538 | #
539 |
540 | print anova_lm(kidney_lm)
541 |
542 | #
543 |
544 | # ANOVA Type-II Sum of Squares
545 | #
546 | # SS(A|B) for factor A.
547 | # SS(B|A) for factor B.
548 |
549 | #
550 |
551 | print anova_lm(kidney_lm, typ=2)
552 |
553 | #
554 |
555 | # ANOVA Type-III Sum of Squares
556 | #
557 | # SS(A|B, AB) for factor A.
558 | # SS(B|A, AB) for factor B.
559 |
560 | #
561 |
562 | print anova_lm(ols('np.log(Days+1) ~ C(Duration, Sum) * C(Weight, Poly)',
563 | data=kt).fit(), typ=3)
564 |
565 | #
566 |
567 | # Excercise: Find the 'best' model for the kidney failure dataset
568 |
569 |
--------------------------------------------------------------------------------
/preliminaries.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # 3.0
3 |
4 | #
5 |
6 | # Learn More and Get Help
7 |
8 | #
9 |
10 | # Documentation: http://statsmodels.sf.net
11 | #
12 | # Mailing List: http://groups.google.com/group/pystatsmodels
13 | #
14 | # Use the source: https://github.com/statsmodels/statsmodels
15 |
16 | #
17 |
18 | # Tutorial Import Assumptions
19 |
20 | #
21 |
22 | import numpy as np
23 | import statsmodels.api as sm
24 | import matplotlib.pyplot as plt
25 | import pandas
26 | from scipy import stats
27 |
28 | np.set_printoptions(precision=4, suppress=True)
29 | pandas.set_printoptions(notebook_repr_html=False,
30 | precision=4,
31 | max_columns=12)
32 |
33 | #
34 |
35 | # Statsmodels Import Convention
36 |
37 | #
38 |
39 | import statsmodels.api as sm
40 |
41 | #
42 |
43 | # Import convention for models for which a formula is available.
44 |
45 | #
46 |
47 | from statsmodels.formula.api import ols, rlm, glm, #etc.
48 |
49 | #
50 |
51 | # Package Overview
52 |
53 | #
54 |
55 | # Regression models in statsmodels.regression
56 |
57 | #
58 |
59 | # Discrete choice models in statsmodels.discrete
60 |
61 | #
62 |
63 | # Robust linear models in statsmodels.robust
64 |
65 | #
66 |
67 | # Generalized linear models in statsmodels.genmod
68 |
69 | #
70 |
71 | # Time Series Analysis in statsmodels.tsa
72 |
73 | #
74 |
75 | # Nonparametric models in statsmodels.nonparametric
76 |
77 | #
78 |
79 | # Plotting functions in statsmodels.graphics
80 |
81 | #
82 |
83 | # Input/Output in statsmodels.iolib (Foreign data, ascii, HTML, $\LaTeX$ tables)
84 |
85 | #
86 |
87 | # Statistical tests, ANOVA in statsmodels.stats
88 |
89 | #