├── .gitignore
├── 2002FemPreg.dat.gz
├── 2002FemPreg.dct
├── LICENSE
├── README.md
├── _config.yml
├── check_env.py
├── cumulative_snowfall.png
├── effect_size.ipynb
├── effect_size_soln.ipynb
├── environment.yml
├── first.py
├── hypothesis.ipynb
├── hypothesis.py
├── hypothesis_soln.ipynb
├── hypothesis_testing.pdf
├── hypothesis_testing.png
├── hypothesis_testing.svg
├── hypothesis_testing_small.png
├── look_and_say.ipynb
├── lyrics-elvis-presley.txt
├── nsfg.py
├── nsfg2.py
├── pg2591.txt
├── pmf_intro.ipynb
├── resampling.ipynb
├── resampling.pdf
├── resampling.png
├── resampling.svg
├── resampling_small.png
├── sampling.ipynb
├── sampling_soln.ipynb
├── text_analysis.ipynb
├── the_fault_in_our_stars.txt
├── thinkplot.py
├── thinkstats2.py
├── tumbleweed.ipynb
└── tutorial.md


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | 
 5 | # C extensions
 6 | *.so
 7 | 
 8 | # Distribution / packaging
 9 | .Python
10 | env/
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | lib/
17 | lib64/
18 | parts/
19 | sdist/
20 | var/
21 | *.egg-info/
22 | .installed.cfg
23 | *.egg
24 | 
25 | # PyInstaller
26 | #  Usually these files are written by a python script from a template
27 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
28 | *.manifest
29 | *.spec
30 | 
31 | # Installer logs
32 | pip-log.txt
33 | pip-delete-this-directory.txt
34 | 
35 | # Unit test / coverage reports
36 | htmlcov/
37 | .tox/
38 | .coverage
39 | .cache
40 | nosetests.xml
41 | coverage.xml
42 | 
43 | # Translations
44 | *.mo
45 | *.pot
46 | 
47 | # Django stuff:
48 | *.log
49 | 
50 | # Sphinx documentation
51 | docs/_build/
52 | 
53 | # PyBuilder
54 | target/
55 | 


--------------------------------------------------------------------------------
/2002FemPreg.dat.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AllenDowney/CompStats/dc459e04ba74613d533adae2550abefb18da002a/2002FemPreg.dat.gz


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2015 Allen Downey
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 
23 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # CompStats
2 | 
3 | Code for a workshop on statistical interference using computational methods in Python.
4 | 


--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | theme: jekyll-theme-minimal


--------------------------------------------------------------------------------
/check_env.py:
--------------------------------------------------------------------------------
 1 | """This file contains code used in "Think Stats",
 2 | by Allen B. Downey, available from greenteapress.com
 3 | 
 4 | Copyright 2013 Allen B. Downey
 5 | License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
 6 | """
 7 | 
 8 | from __future__ import print_function, division
 9 | 
10 | import math
11 | import numpy
12 | 
13 | from matplotlib import pyplot
14 | 
15 | import thinkplot
16 | import thinkstats2
17 | 
18 | 
19 | def RenderPdf(mu, sigma, n=101):
20 |     """Makes xs and ys for a normal PDF with (mu, sigma).
21 | 
22 |     n: number of places to evaluate the PDF
23 |     """
24 |     xs = numpy.linspace(mu-4*sigma, mu+4*sigma, n)
25 |     ys = [thinkstats2.EvalNormalPdf(x, mu, sigma) for x in xs]
26 |     return xs, ys
27 | 
28 | 
29 | def main():
30 |     xs, ys = RenderPdf(100, 15)
31 | 
32 |     n = 34
33 |     pyplot.fill_between(xs[-n:], ys[-n:], y2=0.0001, color='blue', alpha=0.2)
34 |     s = 'Congratulations!\nIf you got this far,\nyou must be here.'
35 |     d = dict(shrink=0.05)
36 |     pyplot.annotate(s, [127, 0.002], xytext=[80, 0.005], arrowprops=d)
37 | 
38 |     thinkplot.Plot(xs, ys)
39 |     thinkplot.Show(title='Distribution of IQ',
40 |                    xlabel='IQ',
41 |                    ylabel='PDF',
42 |                    legend=False)
43 | 
44 | 
45 | if __name__ == "__main__":
46 |     main()
47 | 


--------------------------------------------------------------------------------
/cumulative_snowfall.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AllenDowney/CompStats/dc459e04ba74613d533adae2550abefb18da002a/cumulative_snowfall.png


--------------------------------------------------------------------------------
/effect_size.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "Effect Size\n",
  8 |     "===\n",
  9 |     "\n",
 10 |     "Examples and exercises for a tutorial on statistical inference.\n",
 11 |     "\n",
 12 |     "Copyright 2016 Allen Downey\n",
 13 |     "\n",
 14 |     "License: [Creative Commons Attribution 4.0 International](http://creativecommons.org/licenses/by/4.0/)"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": null,
 20 |    "metadata": {
 21 |     "collapsed": true
 22 |    },
 23 |    "outputs": [],
 24 |    "source": [
 25 |     "from __future__ import print_function, division\n",
 26 |     "\n",
 27 |     "import numpy\n",
 28 |     "import scipy.stats\n",
 29 |     "\n",
 30 |     "import matplotlib.pyplot as pyplot\n",
 31 |     "\n",
 32 |     "from ipywidgets import interact, interactive, fixed\n",
 33 |     "import ipywidgets as widgets\n",
 34 |     "\n",
 35 |     "# seed the random number generator so we all get the same results\n",
 36 |     "numpy.random.seed(17)\n",
 37 |     "\n",
 38 |     "# some nice colors from http://colorbrewer2.org/\n",
 39 |     "COLOR1 = '#7fc97f'\n",
 40 |     "COLOR2 = '#beaed4'\n",
 41 |     "COLOR3 = '#fdc086'\n",
 42 |     "COLOR4 = '#ffff99'\n",
 43 |     "COLOR5 = '#386cb0'\n",
 44 |     "\n",
 45 |     "%matplotlib inline"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "markdown",
 50 |    "metadata": {},
 51 |    "source": [
 52 |     "## Part One\n",
 53 |     "\n",
 54 |     "To explore statistics that quantify effect size, we'll look at the difference in height between men and women.  I used data from the Behavioral Risk Factor Surveillance System (BRFSS) to estimate the mean and standard deviation of height in cm for adult women and men in the U.S.\n",
 55 |     "\n",
 56 |     "I'll use `scipy.stats.norm` to represent the distributions.  The result is an `rv` object (which stands for random variable)."
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": null,
 62 |    "metadata": {
 63 |     "collapsed": true
 64 |    },
 65 |    "outputs": [],
 66 |    "source": [
 67 |     "mu1, sig1 = 178, 7.7\n",
 68 |     "male_height = scipy.stats.norm(mu1, sig1)"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "code",
 73 |    "execution_count": null,
 74 |    "metadata": {
 75 |     "collapsed": true
 76 |    },
 77 |    "outputs": [],
 78 |    "source": [
 79 |     "mu2, sig2 = 163, 7.3\n",
 80 |     "female_height = scipy.stats.norm(mu2, sig2)"
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "markdown",
 85 |    "metadata": {},
 86 |    "source": [
 87 |     "The following function evaluates the normal (Gaussian) probability density function (PDF) within 4 standard deviations of the mean.  It takes and rv object and returns a pair of NumPy arrays."
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "code",
 92 |    "execution_count": null,
 93 |    "metadata": {
 94 |     "collapsed": true
 95 |    },
 96 |    "outputs": [],
 97 |    "source": [
 98 |     "def eval_pdf(rv, num=4):\n",
 99 |     "    mean, std = rv.mean(), rv.std()\n",
100 |     "    xs = numpy.linspace(mean - num*std, mean + num*std, 100)\n",
101 |     "    ys = rv.pdf(xs)\n",
102 |     "    return xs, ys"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "markdown",
107 |    "metadata": {},
108 |    "source": [
109 |     "Here's what the two distributions look like."
110 |    ]
111 |   },
112 |   {
113 |    "cell_type": "code",
114 |    "execution_count": null,
115 |    "metadata": {},
116 |    "outputs": [],
117 |    "source": [
118 |     "xs, ys = eval_pdf(male_height)\n",
119 |     "pyplot.plot(xs, ys, label='male', linewidth=4, color=COLOR2)\n",
120 |     "\n",
121 |     "xs, ys = eval_pdf(female_height)\n",
122 |     "pyplot.plot(xs, ys, label='female', linewidth=4, color=COLOR3)\n",
123 |     "pyplot.xlabel('height (cm)')\n",
124 |     "None"
125 |    ]
126 |   },
127 |   {
128 |    "cell_type": "markdown",
129 |    "metadata": {},
130 |    "source": [
131 |     "Let's assume for now that those are the true distributions for the population.\n",
132 |     "\n",
133 |     "I'll use `rvs` to generate random samples from the population distributions.  Note that these are totally random, totally representative samples, with no measurement error!"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "code",
138 |    "execution_count": null,
139 |    "metadata": {
140 |     "collapsed": true
141 |    },
142 |    "outputs": [],
143 |    "source": [
144 |     "male_sample = male_height.rvs(1000)"
145 |    ]
146 |   },
147 |   {
148 |    "cell_type": "code",
149 |    "execution_count": null,
150 |    "metadata": {
151 |     "collapsed": true
152 |    },
153 |    "outputs": [],
154 |    "source": [
155 |     "female_sample = female_height.rvs(1000)"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "markdown",
160 |    "metadata": {},
161 |    "source": [
162 |     "Both samples are NumPy arrays.  Now we can compute sample statistics like the mean and standard deviation."
163 |    ]
164 |   },
165 |   {
166 |    "cell_type": "code",
167 |    "execution_count": null,
168 |    "metadata": {},
169 |    "outputs": [],
170 |    "source": [
171 |     "mean1, std1 = male_sample.mean(), male_sample.std()\n",
172 |     "mean1, std1"
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "markdown",
177 |    "metadata": {},
178 |    "source": [
179 |     "The sample mean is close to the population mean, but not exact, as expected."
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "code",
184 |    "execution_count": null,
185 |    "metadata": {},
186 |    "outputs": [],
187 |    "source": [
188 |     "mean2, std2 = female_sample.mean(), female_sample.std()\n",
189 |     "mean2, std2"
190 |    ]
191 |   },
192 |   {
193 |    "cell_type": "markdown",
194 |    "metadata": {},
195 |    "source": [
196 |     "And the results are similar for the female sample.\n",
197 |     "\n",
198 |     "Now, there are many ways to describe the magnitude of the difference between these distributions.  An obvious one is the difference in the means:"
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "code",
203 |    "execution_count": null,
204 |    "metadata": {},
205 |    "outputs": [],
206 |    "source": [
207 |     "difference_in_means = male_sample.mean() - female_sample.mean()\n",
208 |     "difference_in_means # in cm"
209 |    ]
210 |   },
211 |   {
212 |    "cell_type": "markdown",
213 |    "metadata": {},
214 |    "source": [
215 |     "On average, men are 14--15 centimeters taller.  For some applications, that would be a good way to describe the difference, but there are a few problems:\n",
216 |     "\n",
217 |     "* Without knowing more about the distributions (like the standard deviations) it's hard to interpret whether a difference like 15 cm is a lot or not.\n",
218 |     "\n",
219 |     "* The magnitude of the difference depends on the units of measure, making it hard to compare across different studies.\n",
220 |     "\n",
221 |     "There are a number of ways to quantify the difference between distributions.  A simple option is to express the difference as a percentage of the mean.\n",
222 |     "\n",
223 |     "**Exercise 1**: what is the relative difference in means, expressed as a percentage?"
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "code",
228 |    "execution_count": null,
229 |    "metadata": {
230 |     "collapsed": true
231 |    },
232 |    "outputs": [],
233 |    "source": [
234 |     "# Solution goes here"
235 |    ]
236 |   },
237 |   {
238 |    "cell_type": "markdown",
239 |    "metadata": {},
240 |    "source": [
241 |     "**STOP HERE**: We'll regroup and discuss before you move on."
242 |    ]
243 |   },
244 |   {
245 |    "cell_type": "markdown",
246 |    "metadata": {},
247 |    "source": [
248 |     "## Part Two\n",
249 |     "\n",
250 |     "An alternative way to express the difference between distributions is to see how much they overlap.  To define overlap, we choose a threshold between the two means.  The simple threshold is the midpoint between the means:"
251 |    ]
252 |   },
253 |   {
254 |    "cell_type": "code",
255 |    "execution_count": null,
256 |    "metadata": {},
257 |    "outputs": [],
258 |    "source": [
259 |     "simple_thresh = (mean1 + mean2) / 2\n",
260 |     "simple_thresh"
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "markdown",
265 |    "metadata": {},
266 |    "source": [
267 |     "A better, but slightly more complicated threshold is the place where the PDFs cross."
268 |    ]
269 |   },
270 |   {
271 |    "cell_type": "code",
272 |    "execution_count": null,
273 |    "metadata": {},
274 |    "outputs": [],
275 |    "source": [
276 |     "thresh = (std1 * mean2 + std2 * mean1) / (std1 + std2)\n",
277 |     "thresh"
278 |    ]
279 |   },
280 |   {
281 |    "cell_type": "markdown",
282 |    "metadata": {},
283 |    "source": [
284 |     "In this example, there's not much difference between the two thresholds.\n",
285 |     "\n",
286 |     "Now we can count how many men are below the threshold:"
287 |    ]
288 |   },
289 |   {
290 |    "cell_type": "code",
291 |    "execution_count": null,
292 |    "metadata": {},
293 |    "outputs": [],
294 |    "source": [
295 |     "male_below_thresh = sum(male_sample < thresh)\n",
296 |     "male_below_thresh"
297 |    ]
298 |   },
299 |   {
300 |    "cell_type": "markdown",
301 |    "metadata": {},
302 |    "source": [
303 |     "And how many women are above it:"
304 |    ]
305 |   },
306 |   {
307 |    "cell_type": "code",
308 |    "execution_count": null,
309 |    "metadata": {},
310 |    "outputs": [],
311 |    "source": [
312 |     "female_above_thresh = sum(female_sample > thresh)\n",
313 |     "female_above_thresh"
314 |    ]
315 |   },
316 |   {
317 |    "cell_type": "markdown",
318 |    "metadata": {},
319 |    "source": [
320 |     "The \"overlap\" is the area under the curves that ends up on the wrong side of the threshold."
321 |    ]
322 |   },
323 |   {
324 |    "cell_type": "code",
325 |    "execution_count": null,
326 |    "metadata": {},
327 |    "outputs": [],
328 |    "source": [
329 |     "male_overlap = male_below_thresh / len(male_sample)\n",
330 |     "female_overlap = female_above_thresh / len(female_sample)\n",
331 |     "male_overlap, female_overlap"
332 |    ]
333 |   },
334 |   {
335 |    "cell_type": "markdown",
336 |    "metadata": {},
337 |    "source": [
338 |     "In practical terms, you might report the fraction of people who would be misclassified if you tried to use height to guess sex, which is the average of the male and female overlap rates:"
339 |    ]
340 |   },
341 |   {
342 |    "cell_type": "code",
343 |    "execution_count": null,
344 |    "metadata": {},
345 |    "outputs": [],
346 |    "source": [
347 |     "misclassification_rate = (male_overlap + female_overlap) / 2\n",
348 |     "misclassification_rate"
349 |    ]
350 |   },
351 |   {
352 |    "cell_type": "markdown",
353 |    "metadata": {},
354 |    "source": [
355 |     "Another way to quantify the difference between distributions is what's called \"probability of superiority\", which is a problematic term, but in this context it's the probability that a randomly-chosen man is taller than a randomly-chosen woman.\n",
356 |     "\n",
357 |     "**Exercise 2**:  Suppose I choose a man and a woman at random.  What is the probability that the man is taller?\n",
358 |     "\n",
359 |     "HINT: You can `zip` the two samples together and count the number of pairs where the male is taller, or use NumPy array operations."
360 |    ]
361 |   },
362 |   {
363 |    "cell_type": "code",
364 |    "execution_count": null,
365 |    "metadata": {
366 |     "collapsed": true
367 |    },
368 |    "outputs": [],
369 |    "source": [
370 |     "# Solution goes here"
371 |    ]
372 |   },
373 |   {
374 |    "cell_type": "code",
375 |    "execution_count": null,
376 |    "metadata": {
377 |     "collapsed": true
378 |    },
379 |    "outputs": [],
380 |    "source": [
381 |     "# Solution goes here"
382 |    ]
383 |   },
384 |   {
385 |    "cell_type": "markdown",
386 |    "metadata": {},
387 |    "source": [
388 |     "Overlap (or misclassification rate) and \"probability of superiority\" have two good properties:\n",
389 |     "\n",
390 |     "* As probabilities, they don't depend on units of measure, so they are comparable between studies.\n",
391 |     "\n",
392 |     "* They are expressed in operational terms, so a reader has a sense of what practical effect the difference makes.\n",
393 |     "\n",
394 |     "### Cohen's effect size\n",
395 |     "\n",
396 |     "There is one other common way to express the difference between distributions.  Cohen's $d$ is the difference in means, standardized by dividing by the standard deviation.  Here's the math notation:\n",
397 |     "\n",
398 |     "$ d = \\frac{\\bar{x}_1 - \\bar{x}_2} s $\n",
399 |     "\n",
400 |     "where $s$ is the pooled standard deviation:\n",
401 |     "\n",
402 |     "$s = \\sqrt{\\frac{n_1 s^2_1 + n_2 s^2_2}{n_1+n_2}}$\n",
403 |     "\n",
404 |     "Here's a function that computes it:\n"
405 |    ]
406 |   },
407 |   {
408 |    "cell_type": "code",
409 |    "execution_count": null,
410 |    "metadata": {
411 |     "collapsed": true
412 |    },
413 |    "outputs": [],
414 |    "source": [
415 |     "def CohenEffectSize(group1, group2):\n",
416 |     "    \"\"\"Compute Cohen's d.\n",
417 |     "\n",
418 |     "    group1: Series or NumPy array\n",
419 |     "    group2: Series or NumPy array\n",
420 |     "\n",
421 |     "    returns: float\n",
422 |     "    \"\"\"\n",
423 |     "    diff = group1.mean() - group2.mean()\n",
424 |     "\n",
425 |     "    n1, n2 = len(group1), len(group2)\n",
426 |     "    var1 = group1.var()\n",
427 |     "    var2 = group2.var()\n",
428 |     "\n",
429 |     "    pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)\n",
430 |     "    d = diff / numpy.sqrt(pooled_var)\n",
431 |     "    return d"
432 |    ]
433 |   },
434 |   {
435 |    "cell_type": "markdown",
436 |    "metadata": {},
437 |    "source": [
438 |     "Computing the denominator is a little complicated; in fact, people have proposed several ways to do it.  This implementation uses the \"pooled standard deviation\", which is a weighted average of the standard deviations of the two groups.\n",
439 |     "\n",
440 |     "And here's the result for the difference in height between men and women."
441 |    ]
442 |   },
443 |   {
444 |    "cell_type": "code",
445 |    "execution_count": null,
446 |    "metadata": {},
447 |    "outputs": [],
448 |    "source": [
449 |     "CohenEffectSize(male_sample, female_sample)"
450 |    ]
451 |   },
452 |   {
453 |    "cell_type": "markdown",
454 |    "metadata": {},
455 |    "source": [
456 |     "Most people don't have a good sense of how big $d=1.9$ is, so let's make a visualization to get calibrated.\n",
457 |     "\n",
458 |     "Here's a function that encapsulates the code we already saw for computing overlap and probability of superiority."
459 |    ]
460 |   },
461 |   {
462 |    "cell_type": "code",
463 |    "execution_count": null,
464 |    "metadata": {
465 |     "collapsed": true
466 |    },
467 |    "outputs": [],
468 |    "source": [
469 |     "def overlap_superiority(control, treatment, n=1000):\n",
470 |     "    \"\"\"Estimates overlap and superiority based on a sample.\n",
471 |     "    \n",
472 |     "    control: scipy.stats rv object\n",
473 |     "    treatment: scipy.stats rv object\n",
474 |     "    n: sample size\n",
475 |     "    \"\"\"\n",
476 |     "    control_sample = control.rvs(n)\n",
477 |     "    treatment_sample = treatment.rvs(n)\n",
478 |     "    thresh = (control.mean() + treatment.mean()) / 2\n",
479 |     "    \n",
480 |     "    control_above = sum(control_sample > thresh)\n",
481 |     "    treatment_below = sum(treatment_sample < thresh)\n",
482 |     "    overlap = (control_above + treatment_below) / n\n",
483 |     "    \n",
484 |     "    superiority = (treatment_sample > control_sample).mean()\n",
485 |     "    return overlap, superiority"
486 |    ]
487 |   },
488 |   {
489 |    "cell_type": "markdown",
490 |    "metadata": {},
491 |    "source": [
492 |     "Here's the function that takes Cohen's $d$, plots normal distributions with the given effect size, and prints their overlap and superiority."
493 |    ]
494 |   },
495 |   {
496 |    "cell_type": "code",
497 |    "execution_count": null,
498 |    "metadata": {
499 |     "collapsed": true
500 |    },
501 |    "outputs": [],
502 |    "source": [
503 |     "def plot_pdfs(cohen_d=2):\n",
504 |     "    \"\"\"Plot PDFs for distributions that differ by some number of stds.\n",
505 |     "    \n",
506 |     "    cohen_d: number of standard deviations between the means\n",
507 |     "    \"\"\"\n",
508 |     "    control = scipy.stats.norm(0, 1)\n",
509 |     "    treatment = scipy.stats.norm(cohen_d, 1)\n",
510 |     "    xs, ys = eval_pdf(control)\n",
511 |     "    pyplot.fill_between(xs, ys, label='control', color=COLOR3, alpha=0.7)\n",
512 |     "\n",
513 |     "    xs, ys = eval_pdf(treatment)\n",
514 |     "    pyplot.fill_between(xs, ys, label='treatment', color=COLOR2, alpha=0.7)\n",
515 |     "    \n",
516 |     "    o, s = overlap_superiority(control, treatment)\n",
517 |     "    pyplot.text(0, 0.05, 'overlap ' + str(o))\n",
518 |     "    pyplot.text(0, 0.15, 'superiority ' + str(s))\n",
519 |     "    pyplot.show()\n",
520 |     "    #print('overlap', o)\n",
521 |     "    #print('superiority', s)"
522 |    ]
523 |   },
524 |   {
525 |    "cell_type": "markdown",
526 |    "metadata": {},
527 |    "source": [
528 |     "Here's an example that demonstrates the function:"
529 |    ]
530 |   },
531 |   {
532 |    "cell_type": "code",
533 |    "execution_count": null,
534 |    "metadata": {},
535 |    "outputs": [],
536 |    "source": [
537 |     "plot_pdfs(2)"
538 |    ]
539 |   },
540 |   {
541 |    "cell_type": "markdown",
542 |    "metadata": {},
543 |    "source": [
544 |     "And an interactive widget you can use to visualize what different values of $d$ mean:"
545 |    ]
546 |   },
547 |   {
548 |    "cell_type": "code",
549 |    "execution_count": null,
550 |    "metadata": {},
551 |    "outputs": [],
552 |    "source": [
553 |     "slider = widgets.FloatSlider(min=0, max=4, value=2)\n",
554 |     "interact(plot_pdfs, cohen_d=slider)\n",
555 |     "None"
556 |    ]
557 |   },
558 |   {
559 |    "cell_type": "markdown",
560 |    "metadata": {},
561 |    "source": [
562 |     "Cohen's $d$ has a few nice properties:\n",
563 |     "\n",
564 |     "* Because mean and standard deviation have the same units, their ratio is dimensionless, so we can compare $d$ across different studies.\n",
565 |     "\n",
566 |     "* In fields that commonly use $d$, people are calibrated to know what values should be considered big, surprising, or important.\n",
567 |     "\n",
568 |     "* Given $d$ (and the assumption that the distributions are normal), you can compute overlap, superiority, and related statistics."
569 |    ]
570 |   },
571 |   {
572 |    "cell_type": "markdown",
573 |    "metadata": {},
574 |    "source": [
575 |     "In summary, the best way to report effect size depends on the audience and your goals.  There is often a tradeoff between summary statistics that have good technical properties and statistics that are meaningful to a general audience."
576 |    ]
577 |   },
578 |   {
579 |    "cell_type": "code",
580 |    "execution_count": null,
581 |    "metadata": {
582 |     "collapsed": true
583 |    },
584 |    "outputs": [],
585 |    "source": []
586 |   }
587 |  ],
588 |  "metadata": {
589 |   "kernelspec": {
590 |    "display_name": "Python 3",
591 |    "language": "python",
592 |    "name": "python3"
593 |   },
594 |   "language_info": {
595 |    "codemirror_mode": {
596 |     "name": "ipython",
597 |     "version": 3
598 |    },
599 |    "file_extension": ".py",
600 |    "mimetype": "text/x-python",
601 |    "name": "python",
602 |    "nbconvert_exporter": "python",
603 |    "pygments_lexer": "ipython3",
604 |    "version": "3.6.1"
605 |   }
606 |  },
607 |  "nbformat": 4,
608 |  "nbformat_minor": 1
609 | }
610 | 


--------------------------------------------------------------------------------
/effect_size_soln.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "Effect Size\n",
  8 |     "===\n",
  9 |     "\n",
 10 |     "Examples and exercises for a tutorial on statistical inference.\n",
 11 |     "\n",
 12 |     "Copyright 2016 Allen Downey\n",
 13 |     "\n",
 14 |     "License: [Creative Commons Attribution 4.0 International](http://creativecommons.org/licenses/by/4.0/)"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": null,
 20 |    "metadata": {},
 21 |    "outputs": [],
 22 |    "source": [
 23 |     "from __future__ import print_function, division\n",
 24 |     "\n",
 25 |     "import numpy\n",
 26 |     "import scipy.stats\n",
 27 |     "\n",
 28 |     "import matplotlib.pyplot as pyplot\n",
 29 |     "\n",
 30 |     "from ipywidgets import interact, interactive, fixed\n",
 31 |     "import ipywidgets as widgets\n",
 32 |     "\n",
 33 |     "# seed the random number generator so we all get the same results\n",
 34 |     "numpy.random.seed(17)\n",
 35 |     "\n",
 36 |     "# some nice colors from http://colorbrewer2.org/\n",
 37 |     "COLOR1 = '#7fc97f'\n",
 38 |     "COLOR2 = '#beaed4'\n",
 39 |     "COLOR3 = '#fdc086'\n",
 40 |     "COLOR4 = '#ffff99'\n",
 41 |     "COLOR5 = '#386cb0'\n",
 42 |     "\n",
 43 |     "%matplotlib inline"
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "markdown",
 48 |    "metadata": {},
 49 |    "source": [
 50 |     "## Part One\n",
 51 |     "\n",
 52 |     "To explore statistics that quantify effect size, we'll look at the difference in height between men and women.  I used data from the Behavioral Risk Factor Surveillance System (BRFSS) to estimate the mean and standard deviation of height in cm for adult women and men in the U.S.\n",
 53 |     "\n",
 54 |     "I'll use `scipy.stats.norm` to represent the distributions.  The result is an `rv` object (which stands for random variable)."
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": null,
 60 |    "metadata": {},
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "mu1, sig1 = 178, 7.7\n",
 64 |     "male_height = scipy.stats.norm(mu1, sig1)"
 65 |    ]
 66 |   },
 67 |   {
 68 |    "cell_type": "code",
 69 |    "execution_count": null,
 70 |    "metadata": {},
 71 |    "outputs": [],
 72 |    "source": [
 73 |     "mu2, sig2 = 163, 7.3\n",
 74 |     "female_height = scipy.stats.norm(mu2, sig2)"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "markdown",
 79 |    "metadata": {},
 80 |    "source": [
 81 |     "The following function evaluates the normal (Gaussian) probability density function (PDF) within 4 standard deviations of the mean.  It takes and rv object and returns a pair of NumPy arrays."
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "code",
 86 |    "execution_count": null,
 87 |    "metadata": {},
 88 |    "outputs": [],
 89 |    "source": [
 90 |     "def eval_pdf(rv, num=4):\n",
 91 |     "    mean, std = rv.mean(), rv.std()\n",
 92 |     "    xs = numpy.linspace(mean - num*std, mean + num*std, 100)\n",
 93 |     "    ys = rv.pdf(xs)\n",
 94 |     "    return xs, ys"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "markdown",
 99 |    "metadata": {},
100 |    "source": [
101 |     "Here's what the two distributions look like."
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "code",
106 |    "execution_count": null,
107 |    "metadata": {},
108 |    "outputs": [],
109 |    "source": [
110 |     "xs, ys = eval_pdf(male_height)\n",
111 |     "pyplot.plot(xs, ys, label='male', linewidth=4, color=COLOR2)\n",
112 |     "\n",
113 |     "xs, ys = eval_pdf(female_height)\n",
114 |     "pyplot.plot(xs, ys, label='female', linewidth=4, color=COLOR3)\n",
115 |     "pyplot.xlabel('height (cm)')\n",
116 |     "None"
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "markdown",
121 |    "metadata": {},
122 |    "source": [
123 |     "Let's assume for now that those are the true distributions for the population.\n",
124 |     "\n",
125 |     "I'll use `rvs` to generate random samples from the population distributions.  Note that these are totally random, totally representative samples, with no measurement error!"
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "code",
130 |    "execution_count": null,
131 |    "metadata": {},
132 |    "outputs": [],
133 |    "source": [
134 |     "male_sample = male_height.rvs(1000)"
135 |    ]
136 |   },
137 |   {
138 |    "cell_type": "code",
139 |    "execution_count": null,
140 |    "metadata": {},
141 |    "outputs": [],
142 |    "source": [
143 |     "female_sample = female_height.rvs(1000)"
144 |    ]
145 |   },
146 |   {
147 |    "cell_type": "markdown",
148 |    "metadata": {},
149 |    "source": [
150 |     "Both samples are NumPy arrays.  Now we can compute sample statistics like the mean and standard deviation."
151 |    ]
152 |   },
153 |   {
154 |    "cell_type": "code",
155 |    "execution_count": null,
156 |    "metadata": {},
157 |    "outputs": [],
158 |    "source": [
159 |     "mean1, std1 = male_sample.mean(), male_sample.std()\n",
160 |     "mean1, std1"
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "markdown",
165 |    "metadata": {},
166 |    "source": [
167 |     "The sample mean is close to the population mean, but not exact, as expected."
168 |    ]
169 |   },
170 |   {
171 |    "cell_type": "code",
172 |    "execution_count": null,
173 |    "metadata": {},
174 |    "outputs": [],
175 |    "source": [
176 |     "mean2, std2 = female_sample.mean(), female_sample.std()\n",
177 |     "mean2, std2"
178 |    ]
179 |   },
180 |   {
181 |    "cell_type": "markdown",
182 |    "metadata": {},
183 |    "source": [
184 |     "And the results are similar for the female sample.\n",
185 |     "\n",
186 |     "Now, there are many ways to describe the magnitude of the difference between these distributions.  An obvious one is the difference in the means:"
187 |    ]
188 |   },
189 |   {
190 |    "cell_type": "code",
191 |    "execution_count": null,
192 |    "metadata": {},
193 |    "outputs": [],
194 |    "source": [
195 |     "difference_in_means = male_sample.mean() - female_sample.mean()\n",
196 |     "difference_in_means # in cm"
197 |    ]
198 |   },
199 |   {
200 |    "cell_type": "markdown",
201 |    "metadata": {},
202 |    "source": [
203 |     "On average, men are 14--15 centimeters taller.  For some applications, that would be a good way to describe the difference, but there are a few problems:\n",
204 |     "\n",
205 |     "* Without knowing more about the distributions (like the standard deviations) it's hard to interpret whether a difference like 15 cm is a lot or not.\n",
206 |     "\n",
207 |     "* The magnitude of the difference depends on the units of measure, making it hard to compare across different studies.\n",
208 |     "\n",
209 |     "There are a number of ways to quantify the difference between distributions.  A simple option is to express the difference as a percentage of the mean.\n",
210 |     "\n",
211 |     "**Exercise 1**: what is the relative difference in means, expressed as a percentage?"
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "code",
216 |    "execution_count": null,
217 |    "metadata": {},
218 |    "outputs": [],
219 |    "source": [
220 |     "# Solution goes here\n",
221 |     "\n",
222 |     "relative_difference = difference_in_means / male_sample.mean()\n",
223 |     "print(relative_difference * 100)   # percent\n",
224 |     "\n",
225 |     "# A problem with relative differences is that you have to choose \n",
226 |     "# which mean to express them relative to.\n",
227 |     "\n",
228 |     "relative_difference = difference_in_means / female_sample.mean()\n",
229 |     "print(relative_difference * 100)   # percent"
230 |    ]
231 |   },
232 |   {
233 |    "cell_type": "markdown",
234 |    "metadata": {},
235 |    "source": [
236 |     "**STOP HERE**: We'll regroup and discuss before you move on."
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "markdown",
241 |    "metadata": {},
242 |    "source": [
243 |     "## Part Two\n",
244 |     "\n",
245 |     "An alternative way to express the difference between distributions is to see how much they overlap.  To define overlap, we choose a threshold between the two means.  The simple threshold is the midpoint between the means:"
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "code",
250 |    "execution_count": null,
251 |    "metadata": {},
252 |    "outputs": [],
253 |    "source": [
254 |     "simple_thresh = (mean1 + mean2) / 2\n",
255 |     "simple_thresh"
256 |    ]
257 |   },
258 |   {
259 |    "cell_type": "markdown",
260 |    "metadata": {},
261 |    "source": [
262 |     "A better, but slightly more complicated threshold is the place where the PDFs cross."
263 |    ]
264 |   },
265 |   {
266 |    "cell_type": "code",
267 |    "execution_count": null,
268 |    "metadata": {},
269 |    "outputs": [],
270 |    "source": [
271 |     "thresh = (std1 * mean2 + std2 * mean1) / (std1 + std2)\n",
272 |     "thresh"
273 |    ]
274 |   },
275 |   {
276 |    "cell_type": "markdown",
277 |    "metadata": {},
278 |    "source": [
279 |     "In this example, there's not much difference between the two thresholds.\n",
280 |     "\n",
281 |     "Now we can count how many men are below the threshold:"
282 |    ]
283 |   },
284 |   {
285 |    "cell_type": "code",
286 |    "execution_count": null,
287 |    "metadata": {},
288 |    "outputs": [],
289 |    "source": [
290 |     "male_below_thresh = sum(male_sample < thresh)\n",
291 |     "male_below_thresh"
292 |    ]
293 |   },
294 |   {
295 |    "cell_type": "markdown",
296 |    "metadata": {},
297 |    "source": [
298 |     "And how many women are above it:"
299 |    ]
300 |   },
301 |   {
302 |    "cell_type": "code",
303 |    "execution_count": null,
304 |    "metadata": {},
305 |    "outputs": [],
306 |    "source": [
307 |     "female_above_thresh = sum(female_sample > thresh)\n",
308 |     "female_above_thresh"
309 |    ]
310 |   },
311 |   {
312 |    "cell_type": "markdown",
313 |    "metadata": {},
314 |    "source": [
315 |     "The \"overlap\" is the area under the curves that ends up on the wrong side of the threshold."
316 |    ]
317 |   },
318 |   {
319 |    "cell_type": "code",
320 |    "execution_count": null,
321 |    "metadata": {},
322 |    "outputs": [],
323 |    "source": [
324 |     "male_overlap = male_below_thresh / len(male_sample)\n",
325 |     "female_overlap = female_above_thresh / len(female_sample)\n",
326 |     "male_overlap, female_overlap"
327 |    ]
328 |   },
329 |   {
330 |    "cell_type": "markdown",
331 |    "metadata": {},
332 |    "source": [
333 |     "In practical terms, you might report the fraction of people who would be misclassified if you tried to use height to guess sex, which is the average of the male and female overlap rates:"
334 |    ]
335 |   },
336 |   {
337 |    "cell_type": "code",
338 |    "execution_count": null,
339 |    "metadata": {},
340 |    "outputs": [],
341 |    "source": [
342 |     "misclassification_rate = (male_overlap + female_overlap) / 2\n",
343 |     "misclassification_rate"
344 |    ]
345 |   },
346 |   {
347 |    "cell_type": "markdown",
348 |    "metadata": {},
349 |    "source": [
350 |     "Another way to quantify the difference between distributions is what's called \"probability of superiority\", which is a problematic term, but in this context it's the probability that a randomly-chosen man is taller than a randomly-chosen woman.\n",
351 |     "\n",
352 |     "**Exercise 2**:  Suppose I choose a man and a woman at random.  What is the probability that the man is taller?\n",
353 |     "\n",
354 |     "HINT: You can `zip` the two samples together and count the number of pairs where the male is taller, or use NumPy array operations."
355 |    ]
356 |   },
357 |   {
358 |    "cell_type": "code",
359 |    "execution_count": null,
360 |    "metadata": {},
361 |    "outputs": [],
362 |    "source": [
363 |     "# Solution goes here\n",
364 |     "\n",
365 |     "sum(x > y for x, y in zip(male_sample, female_sample)) / len(male_sample)"
366 |    ]
367 |   },
368 |   {
369 |    "cell_type": "code",
370 |    "execution_count": null,
371 |    "metadata": {},
372 |    "outputs": [],
373 |    "source": [
374 |     "# Solution goes here\n",
375 |     "\n",
376 |     "(male_sample > female_sample).mean()"
377 |    ]
378 |   },
379 |   {
380 |    "cell_type": "markdown",
381 |    "metadata": {},
382 |    "source": [
383 |     "Overlap (or misclassification rate) and \"probability of superiority\" have two good properties:\n",
384 |     "\n",
385 |     "* As probabilities, they don't depend on units of measure, so they are comparable between studies.\n",
386 |     "\n",
387 |     "* They are expressed in operational terms, so a reader has a sense of what practical effect the difference makes.\n",
388 |     "\n",
389 |     "### Cohen's effect size\n",
390 |     "\n",
391 |     "There is one other common way to express the difference between distributions.  Cohen's $d$ is the difference in means, standardized by dividing by the standard deviation.  Here's the math notation:\n",
392 |     "\n",
393 |     "$ d = \\frac{\\bar{x}_1 - \\bar{x}_2} s $\n",
394 |     "\n",
395 |     "where $s$ is the pooled standard deviation:\n",
396 |     "\n",
397 |     "$s = \\sqrt{\\frac{n_1 s^2_1 + n_2 s^2_2}{n_1+n_2}}$\n",
398 |     "\n",
399 |     "Here's a function that computes it:\n"
400 |    ]
401 |   },
402 |   {
403 |    "cell_type": "code",
404 |    "execution_count": null,
405 |    "metadata": {},
406 |    "outputs": [],
407 |    "source": [
408 |     "def CohenEffectSize(group1, group2):\n",
409 |     "    \"\"\"Compute Cohen's d.\n",
410 |     "\n",
411 |     "    group1: Series or NumPy array\n",
412 |     "    group2: Series or NumPy array\n",
413 |     "\n",
414 |     "    returns: float\n",
415 |     "    \"\"\"\n",
416 |     "    diff = group1.mean() - group2.mean()\n",
417 |     "\n",
418 |     "    n1, n2 = len(group1), len(group2)\n",
419 |     "    var1 = group1.var()\n",
420 |     "    var2 = group2.var()\n",
421 |     "\n",
422 |     "    pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)\n",
423 |     "    d = diff / numpy.sqrt(pooled_var)\n",
424 |     "    return d"
425 |    ]
426 |   },
427 |   {
428 |    "cell_type": "markdown",
429 |    "metadata": {},
430 |    "source": [
431 |     "Computing the denominator is a little complicated; in fact, people have proposed several ways to do it.  This implementation uses the \"pooled standard deviation\", which is a weighted average of the standard deviations of the two groups.\n",
432 |     "\n",
433 |     "And here's the result for the difference in height between men and women."
434 |    ]
435 |   },
436 |   {
437 |    "cell_type": "code",
438 |    "execution_count": null,
439 |    "metadata": {},
440 |    "outputs": [],
441 |    "source": [
442 |     "CohenEffectSize(male_sample, female_sample)"
443 |    ]
444 |   },
445 |   {
446 |    "cell_type": "markdown",
447 |    "metadata": {},
448 |    "source": [
449 |     "Most people don't have a good sense of how big $d=1.9$ is, so let's make a visualization to get calibrated.\n",
450 |     "\n",
451 |     "Here's a function that encapsulates the code we already saw for computing overlap and probability of superiority."
452 |    ]
453 |   },
454 |   {
455 |    "cell_type": "code",
456 |    "execution_count": null,
457 |    "metadata": {},
458 |    "outputs": [],
459 |    "source": [
460 |     "def overlap_superiority(control, treatment, n=1000):\n",
461 |     "    \"\"\"Estimates overlap and superiority based on a sample.\n",
462 |     "    \n",
463 |     "    control: scipy.stats rv object\n",
464 |     "    treatment: scipy.stats rv object\n",
465 |     "    n: sample size\n",
466 |     "    \"\"\"\n",
467 |     "    control_sample = control.rvs(n)\n",
468 |     "    treatment_sample = treatment.rvs(n)\n",
469 |     "    thresh = (control.mean() + treatment.mean()) / 2\n",
470 |     "    \n",
471 |     "    control_above = sum(control_sample > thresh)\n",
472 |     "    treatment_below = sum(treatment_sample < thresh)\n",
473 |     "    overlap = (control_above + treatment_below) / n\n",
474 |     "    \n",
475 |     "    superiority = (treatment_sample > control_sample).mean()\n",
476 |     "    return overlap, superiority"
477 |    ]
478 |   },
479 |   {
480 |    "cell_type": "markdown",
481 |    "metadata": {},
482 |    "source": [
483 |     "Here's the function that takes Cohen's $d$, plots normal distributions with the given effect size, and prints their overlap and superiority."
484 |    ]
485 |   },
486 |   {
487 |    "cell_type": "code",
488 |    "execution_count": null,
489 |    "metadata": {},
490 |    "outputs": [],
491 |    "source": [
492 |     "def plot_pdfs(cohen_d=2):\n",
493 |     "    \"\"\"Plot PDFs for distributions that differ by some number of stds.\n",
494 |     "    \n",
495 |     "    cohen_d: number of standard deviations between the means\n",
496 |     "    \"\"\"\n",
497 |     "    control = scipy.stats.norm(0, 1)\n",
498 |     "    treatment = scipy.stats.norm(cohen_d, 1)\n",
499 |     "    xs, ys = eval_pdf(control)\n",
500 |     "    pyplot.fill_between(xs, ys, label='control', color=COLOR3, alpha=0.7)\n",
501 |     "\n",
502 |     "    xs, ys = eval_pdf(treatment)\n",
503 |     "    pyplot.fill_between(xs, ys, label='treatment', color=COLOR2, alpha=0.7)\n",
504 |     "    \n",
505 |     "    o, s = overlap_superiority(control, treatment)\n",
506 |     "    pyplot.text(0, 0.05, 'overlap ' + str(o))\n",
507 |     "    pyplot.text(0, 0.15, 'superiority ' + str(s))\n",
508 |     "    pyplot.show()\n",
509 |     "    #print('overlap', o)\n",
510 |     "    #print('superiority', s)"
511 |    ]
512 |   },
513 |   {
514 |    "cell_type": "markdown",
515 |    "metadata": {},
516 |    "source": [
517 |     "Here's an example that demonstrates the function:"
518 |    ]
519 |   },
520 |   {
521 |    "cell_type": "code",
522 |    "execution_count": null,
523 |    "metadata": {},
524 |    "outputs": [],
525 |    "source": [
526 |     "plot_pdfs(2)"
527 |    ]
528 |   },
529 |   {
530 |    "cell_type": "markdown",
531 |    "metadata": {},
532 |    "source": [
533 |     "And an interactive widget you can use to visualize what different values of $d$ mean:"
534 |    ]
535 |   },
536 |   {
537 |    "cell_type": "code",
538 |    "execution_count": null,
539 |    "metadata": {},
540 |    "outputs": [],
541 |    "source": [
542 |     "slider = widgets.FloatSlider(min=0, max=4, value=2)\n",
543 |     "interact(plot_pdfs, cohen_d=slider)\n",
544 |     "None"
545 |    ]
546 |   },
547 |   {
548 |    "cell_type": "markdown",
549 |    "metadata": {},
550 |    "source": [
551 |     "Cohen's $d$ has a few nice properties:\n",
552 |     "\n",
553 |     "* Because mean and standard deviation have the same units, their ratio is dimensionless, so we can compare $d$ across different studies.\n",
554 |     "\n",
555 |     "* In fields that commonly use $d$, people are calibrated to know what values should be considered big, surprising, or important.\n",
556 |     "\n",
557 |     "* Given $d$ (and the assumption that the distributions are normal), you can compute overlap, superiority, and related statistics."
558 |    ]
559 |   },
560 |   {
561 |    "cell_type": "markdown",
562 |    "metadata": {},
563 |    "source": [
564 |     "In summary, the best way to report effect size depends on the audience and your goals.  There is often a tradeoff between summary statistics that have good technical properties and statistics that are meaningful to a general audience."
565 |    ]
566 |   },
567 |   {
568 |    "cell_type": "code",
569 |    "execution_count": null,
570 |    "metadata": {},
571 |    "outputs": [],
572 |    "source": []
573 |   }
574 |  ],
575 |  "metadata": {
576 |   "kernelspec": {
577 |    "display_name": "Python 3",
578 |    "language": "python",
579 |    "name": "python3"
580 |   },
581 |   "language_info": {
582 |    "codemirror_mode": {
583 |     "name": "ipython",
584 |     "version": 3
585 |    },
586 |    "file_extension": ".py",
587 |    "mimetype": "text/x-python",
588 |    "name": "python",
589 |    "nbconvert_exporter": "python",
590 |    "pygments_lexer": "ipython3",
591 |    "version": "3.6.1"
592 |   }
593 |  },
594 |  "nbformat": 4,
595 |  "nbformat_minor": 1
596 | }
597 | 


--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
 1 | name: CompStats
 2 | 
 3 | dependencies:
 4 |   - python=3.7
 5 |   - jupyter
 6 |   - numpy
 7 |   - matplotlib
 8 |   - seaborn
 9 |   - pandas
10 |   - scipy
11 | 
12 | 
13 | 
14 | 
15 | 
16 | 


--------------------------------------------------------------------------------
/first.py:
--------------------------------------------------------------------------------
  1 | """This file contains code used in "Think Stats",
  2 | by Allen B. Downey, available from greenteapress.com
  3 | 
  4 | Copyright 2014 Allen B. Downey
  5 | License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
  6 | """
  7 | 
  8 | from __future__ import print_function
  9 | 
 10 | import math
 11 | import numpy as np
 12 | 
 13 | import nsfg
 14 | import thinkstats2
 15 | import thinkplot
 16 | 
 17 | 
 18 | def MakeFrames():
 19 |     """Reads pregnancy data and partitions first babies and others.
 20 | 
 21 |     returns: DataFrames (all live births, first babies, others)
 22 |     """
 23 |     preg = nsfg.ReadFemPreg()
 24 | 
 25 |     live = preg[preg.outcome == 1]
 26 |     firsts = live[live.birthord == 1]
 27 |     others = live[live.birthord != 1]
 28 | 
 29 |     assert len(live) == 9148
 30 |     assert len(firsts) == 4413
 31 |     assert len(others) == 4735
 32 | 
 33 |     return live, firsts, others
 34 | 
 35 | 
 36 | def Summarize(live, firsts, others):
 37 |     """Print various summary statistics."""
 38 | 
 39 |     mean = live.prglngth.mean()
 40 |     var = live.prglngth.var()
 41 |     std = live.prglngth.std()
 42 | 
 43 |     print('Live mean', mean)
 44 |     print('Live variance', var)
 45 |     print('Live std', std)
 46 | 
 47 |     mean1 = firsts.prglngth.mean()
 48 |     mean2 = others.prglngth.mean()
 49 | 
 50 |     var1 = firsts.prglngth.var()
 51 |     var2 = others.prglngth.var()
 52 | 
 53 |     print('Mean')
 54 |     print('First babies', mean1)
 55 |     print('Others', mean2)
 56 | 
 57 |     print('Variance')
 58 |     print('First babies', var1)
 59 |     print('Others', var2)
 60 | 
 61 |     print('Difference in weeks', mean1 - mean2)
 62 |     print('Difference in hours', (mean1 - mean2) * 7 * 24)
 63 | 
 64 |     print('Difference relative to 39 weeks', (mean1 - mean2) / 39 * 100)
 65 | 
 66 |     d = thinkstats2.CohenEffectSize(firsts.prglngth, others.prglngth)
 67 |     print('Cohen d', d)
 68 | 
 69 | 
 70 | def PrintExtremes(live):
 71 |     """Plots the histogram of pregnancy lengths and prints the extremes.
 72 | 
 73 |     live: DataFrame of live births
 74 |     """
 75 |     hist = thinkstats2.Hist(live.prglngth)
 76 |     thinkplot.Hist(hist, label='live births')
 77 | 
 78 |     thinkplot.Save(root='first_nsfg_hist_live', 
 79 |                    title='Histogram',
 80 |                    xlabel='weeks',
 81 |                    ylabel='frequency')
 82 | 
 83 |     print('Shortest lengths:')
 84 |     for weeks, freq in hist.Smallest(10):
 85 |         print(weeks, freq)
 86 | 
 87 |     print('Longest lengths:')
 88 |     for weeks, freq in hist.Largest(10):
 89 |         print(weeks, freq)
 90 |     
 91 | 
 92 | def MakeHists(live):
 93 |     """Plot Hists for live births
 94 | 
 95 |     live: DataFrame
 96 |     others: DataFrame
 97 |     """
 98 |     hist = thinkstats2.Hist(live.birthwgt_lb, label='birthwgt_lb')
 99 |     thinkplot.Hist(hist)
100 |     thinkplot.Save(root='first_wgt_lb_hist', 
101 |                    xlabel='pounds',
102 |                    ylabel='frequency',
103 |                    axis=[-1, 14, 0, 3200])
104 | 
105 |     hist = thinkstats2.Hist(live.birthwgt_oz, label='birthwgt_oz')
106 |     thinkplot.Hist(hist)
107 |     thinkplot.Save(root='first_wgt_oz_hist', 
108 |                    xlabel='ounces',
109 |                    ylabel='frequency',
110 |                    axis=[-1, 16, 0, 1200])
111 | 
112 |     hist = thinkstats2.Hist(np.floor(live.agepreg), label='agepreg')
113 |     thinkplot.Hist(hist)
114 |     thinkplot.Save(root='first_agepreg_hist', 
115 |                    xlabel='years',
116 |                    ylabel='frequency')
117 | 
118 |     hist = thinkstats2.Hist(live.prglngth, label='prglngth')
119 |     thinkplot.Hist(hist)
120 |     thinkplot.Save(root='first_prglngth_hist', 
121 |                    xlabel='weeks',
122 |                    ylabel='frequency',
123 |                    axis=[-1, 53, 0, 5000])
124 | 
125 | 
126 | def MakeComparison(firsts, others):
127 |     """Plots histograms of pregnancy length for first babies and others.
128 | 
129 |     firsts: DataFrame
130 |     others: DataFrame
131 |     """
132 |     first_hist = thinkstats2.Hist(firsts.prglngth, label='first')
133 |     other_hist = thinkstats2.Hist(others.prglngth, label='other')
134 | 
135 |     width = 0.45
136 |     thinkplot.PrePlot(2)
137 |     thinkplot.Hist(first_hist, align='right', width=width)
138 |     thinkplot.Hist(other_hist, align='left', width=width)
139 | 
140 |     thinkplot.Save(root='first_nsfg_hist', 
141 |                    title='Histogram',
142 |                    xlabel='weeks',
143 |                    ylabel='frequency',
144 |                    axis=[27, 46, 0, 2700])
145 | 
146 | 
147 | def main(script):
148 |     live, firsts, others = MakeFrames()
149 | 
150 |     MakeHists(live)
151 |     PrintExtremes(live)
152 |     MakeComparison(firsts, others)
153 |     Summarize(live, firsts, others)
154 | 
155 | 
156 | if __name__ == '__main__':
157 |     import sys
158 |     main(*sys.argv)
159 | 
160 | 
161 | 


--------------------------------------------------------------------------------
/hypothesis.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "Hypothesis Testing\n",
  8 |     "==================\n",
  9 |     "\n",
 10 |     "Copyright 2016 Allen Downey\n",
 11 |     "\n",
 12 |     "License: [Creative Commons Attribution 4.0 International](http://creativecommons.org/licenses/by/4.0/)"
 13 |    ]
 14 |   },
 15 |   {
 16 |    "cell_type": "code",
 17 |    "execution_count": null,
 18 |    "metadata": {
 19 |     "collapsed": true
 20 |    },
 21 |    "outputs": [],
 22 |    "source": [
 23 |     "from __future__ import print_function, division\n",
 24 |     "\n",
 25 |     "import numpy\n",
 26 |     "import scipy.stats\n",
 27 |     "\n",
 28 |     "import matplotlib.pyplot as pyplot\n",
 29 |     "\n",
 30 |     "import first\n",
 31 |     "\n",
 32 |     "# some nicer colors from http://colorbrewer2.org/\n",
 33 |     "COLOR1 = '#7fc97f'\n",
 34 |     "COLOR2 = '#beaed4'\n",
 35 |     "COLOR3 = '#fdc086'\n",
 36 |     "COLOR4 = '#ffff99'\n",
 37 |     "COLOR5 = '#386cb0'\n",
 38 |     "\n",
 39 |     "%matplotlib inline"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "markdown",
 44 |    "metadata": {},
 45 |    "source": [
 46 |     "## Part One"
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "markdown",
 51 |    "metadata": {},
 52 |    "source": [
 53 |     "Suppose you observe an apparent difference between two groups and you want to check whether it might be due to chance.\n",
 54 |     "\n",
 55 |     "As an example, we'll look at differences between first babies and others.  The `first` module provides code to read data from the National Survey of Family Growth (NSFG)."
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "code",
 60 |    "execution_count": null,
 61 |    "metadata": {
 62 |     "collapsed": true
 63 |    },
 64 |    "outputs": [],
 65 |    "source": [
 66 |     "live, firsts, others = first.MakeFrames()"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "markdown",
 71 |    "metadata": {},
 72 |    "source": [
 73 |     "We'll look at a couple of variables, including pregnancy length and birth weight.  The effect size we'll consider is the difference in the means.\n",
 74 |     "\n",
 75 |     "Other examples might include a correlation between variables or a coefficient in a linear regression.  The number that quantifies the size of the effect is called the \"test statistic\"."
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "code",
 80 |    "execution_count": null,
 81 |    "metadata": {
 82 |     "collapsed": true
 83 |    },
 84 |    "outputs": [],
 85 |    "source": [
 86 |     "def TestStatistic(data):\n",
 87 |     "    group1, group2 = data\n",
 88 |     "    test_stat = abs(group1.mean() - group2.mean())\n",
 89 |     "    return test_stat"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "markdown",
 94 |    "metadata": {},
 95 |    "source": [
 96 |     "For the first example, I extract the pregnancy length for first babies and others.  The results are pandas Series objects."
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "code",
101 |    "execution_count": null,
102 |    "metadata": {
103 |     "collapsed": true
104 |    },
105 |    "outputs": [],
106 |    "source": [
107 |     "group1 = firsts.prglngth\n",
108 |     "group2 = others.prglngth"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "markdown",
113 |    "metadata": {},
114 |    "source": [
115 |     "The actual difference in the means is 0.078 weeks, which is only 13 hours."
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "execution_count": null,
121 |    "metadata": {},
122 |    "outputs": [],
123 |    "source": [
124 |     "actual = TestStatistic((group1, group2))\n",
125 |     "actual"
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "markdown",
130 |    "metadata": {},
131 |    "source": [
132 |     "The null hypothesis is that there is no difference between the groups.  We can model that by forming a pooled sample that includes first babies and others."
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "code",
137 |    "execution_count": null,
138 |    "metadata": {
139 |     "collapsed": true
140 |    },
141 |    "outputs": [],
142 |    "source": [
143 |     "n, m = len(group1), len(group2)\n",
144 |     "pool = numpy.hstack((group1, group2))"
145 |    ]
146 |   },
147 |   {
148 |    "cell_type": "markdown",
149 |    "metadata": {},
150 |    "source": [
151 |     "Then we can simulate the null hypothesis by shuffling the pool and dividing it into two groups, using the same sizes as the actual sample."
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "code",
156 |    "execution_count": null,
157 |    "metadata": {
158 |     "collapsed": true
159 |    },
160 |    "outputs": [],
161 |    "source": [
162 |     "def RunModel():\n",
163 |     "    numpy.random.shuffle(pool)\n",
164 |     "    data = pool[:n], pool[n:]\n",
165 |     "    return data"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "markdown",
170 |    "metadata": {},
171 |    "source": [
172 |     "The result of running the model is two NumPy arrays with the shuffled pregnancy lengths:"
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "code",
177 |    "execution_count": null,
178 |    "metadata": {},
179 |    "outputs": [],
180 |    "source": [
181 |     "RunModel()"
182 |    ]
183 |   },
184 |   {
185 |    "cell_type": "markdown",
186 |    "metadata": {},
187 |    "source": [
188 |     "Then we compute the same test statistic using the simulated data:"
189 |    ]
190 |   },
191 |   {
192 |    "cell_type": "code",
193 |    "execution_count": null,
194 |    "metadata": {},
195 |    "outputs": [],
196 |    "source": [
197 |     "TestStatistic(RunModel())"
198 |    ]
199 |   },
200 |   {
201 |    "cell_type": "markdown",
202 |    "metadata": {},
203 |    "source": [
204 |     "If we run the model 1000 times and compute the test statistic, we can see how much the test statistic varies under the null hypothesis."
205 |    ]
206 |   },
207 |   {
208 |    "cell_type": "code",
209 |    "execution_count": null,
210 |    "metadata": {},
211 |    "outputs": [],
212 |    "source": [
213 |     "test_stats = numpy.array([TestStatistic(RunModel()) for i in range(1000)])\n",
214 |     "test_stats.shape"
215 |    ]
216 |   },
217 |   {
218 |    "cell_type": "markdown",
219 |    "metadata": {},
220 |    "source": [
221 |     "Here's the sampling distribution of the test statistic under the null hypothesis, with the actual difference in means indicated by a gray line."
222 |    ]
223 |   },
224 |   {
225 |    "cell_type": "code",
226 |    "execution_count": null,
227 |    "metadata": {},
228 |    "outputs": [],
229 |    "source": [
230 |     "pyplot.axvline(actual, linewidth=3, color='0.8')\n",
231 |     "pyplot.hist(test_stats, color=COLOR5)\n",
232 |     "pyplot.xlabel('difference in means')\n",
233 |     "pyplot.ylabel('count')\n",
234 |     "None"
235 |    ]
236 |   },
237 |   {
238 |    "cell_type": "markdown",
239 |    "metadata": {},
240 |    "source": [
241 |     "The p-value is the probability that the test statistic under the null hypothesis exceeds the actual value."
242 |    ]
243 |   },
244 |   {
245 |    "cell_type": "code",
246 |    "execution_count": null,
247 |    "metadata": {},
248 |    "outputs": [],
249 |    "source": [
250 |     "pvalue = sum(test_stats >= actual) / len(test_stats)\n",
251 |     "pvalue"
252 |    ]
253 |   },
254 |   {
255 |    "cell_type": "markdown",
256 |    "metadata": {},
257 |    "source": [
258 |     "In this case the result is about 15%, which means that even if there is no difference between the groups, it is plausible that we could see a sample difference as big as 0.078 weeks.\n",
259 |     "\n",
260 |     "We conclude that the apparent effect might be due to chance, so we are not confident that it would appear in the general population, or in another sample from the same population.\n",
261 |     "\n",
262 |     "STOP HERE\n",
263 |     "---------"
264 |    ]
265 |   },
266 |   {
267 |    "cell_type": "markdown",
268 |    "metadata": {},
269 |    "source": [
270 |     "Part Two\n",
271 |     "========\n",
272 |     "\n",
273 |     "We can take the pieces from the previous section and organize them in a class that represents the structure of a hypothesis test."
274 |    ]
275 |   },
276 |   {
277 |    "cell_type": "code",
278 |    "execution_count": null,
279 |    "metadata": {
280 |     "collapsed": true
281 |    },
282 |    "outputs": [],
283 |    "source": [
284 |     "class HypothesisTest(object):\n",
285 |     "    \"\"\"Represents a hypothesis test.\"\"\"\n",
286 |     "\n",
287 |     "    def __init__(self, data):\n",
288 |     "        \"\"\"Initializes.\n",
289 |     "\n",
290 |     "        data: data in whatever form is relevant\n",
291 |     "        \"\"\"\n",
292 |     "        self.data = data\n",
293 |     "        self.MakeModel()\n",
294 |     "        self.actual = self.TestStatistic(data)\n",
295 |     "        self.test_stats = None\n",
296 |     "\n",
297 |     "    def PValue(self, iters=1000):\n",
298 |     "        \"\"\"Computes the distribution of the test statistic and p-value.\n",
299 |     "\n",
300 |     "        iters: number of iterations\n",
301 |     "\n",
302 |     "        returns: float p-value\n",
303 |     "        \"\"\"\n",
304 |     "        self.test_stats = numpy.array([self.TestStatistic(self.RunModel()) \n",
305 |     "                                       for _ in range(iters)])\n",
306 |     "\n",
307 |     "        count = sum(self.test_stats >= self.actual)\n",
308 |     "        return count / iters\n",
309 |     "\n",
310 |     "    def MaxTestStat(self):\n",
311 |     "        \"\"\"Returns the largest test statistic seen during simulations.\n",
312 |     "        \"\"\"\n",
313 |     "        return max(self.test_stats)\n",
314 |     "\n",
315 |     "    def PlotHist(self, label=None):\n",
316 |     "        \"\"\"Draws a Cdf with vertical lines at the observed test stat.\n",
317 |     "        \"\"\"\n",
318 |     "        pyplot.hist(ht.test_stats, color=COLOR4)\n",
319 |     "        pyplot.axvline(self.actual, linewidth=3, color='0.8')\n",
320 |     "        pyplot.xlabel('test statistic')\n",
321 |     "        pyplot.ylabel('count')\n",
322 |     "\n",
323 |     "    def TestStatistic(self, data):\n",
324 |     "        \"\"\"Computes the test statistic.\n",
325 |     "\n",
326 |     "        data: data in whatever form is relevant        \n",
327 |     "        \"\"\"\n",
328 |     "        raise UnimplementedMethodException()\n",
329 |     "\n",
330 |     "    def MakeModel(self):\n",
331 |     "        \"\"\"Build a model of the null hypothesis.\n",
332 |     "        \"\"\"\n",
333 |     "        pass\n",
334 |     "\n",
335 |     "    def RunModel(self):\n",
336 |     "        \"\"\"Run the model of the null hypothesis.\n",
337 |     "\n",
338 |     "        returns: simulated data\n",
339 |     "        \"\"\"\n",
340 |     "        raise UnimplementedMethodException()\n"
341 |    ]
342 |   },
343 |   {
344 |    "cell_type": "markdown",
345 |    "metadata": {},
346 |    "source": [
347 |     "`HypothesisTest` is an abstract parent class that encodes the template.  Child classes fill in the missing methods.  For example, here's the test from the previous section."
348 |    ]
349 |   },
350 |   {
351 |    "cell_type": "code",
352 |    "execution_count": null,
353 |    "metadata": {
354 |     "collapsed": true
355 |    },
356 |    "outputs": [],
357 |    "source": [
358 |     "class DiffMeansPermute(HypothesisTest):\n",
359 |     "    \"\"\"Tests a difference in means by permutation.\"\"\"\n",
360 |     "\n",
361 |     "    def TestStatistic(self, data):\n",
362 |     "        \"\"\"Computes the test statistic.\n",
363 |     "\n",
364 |     "        data: data in whatever form is relevant        \n",
365 |     "        \"\"\"\n",
366 |     "        group1, group2 = data\n",
367 |     "        test_stat = abs(group1.mean() - group2.mean())\n",
368 |     "        return test_stat\n",
369 |     "\n",
370 |     "    def MakeModel(self):\n",
371 |     "        \"\"\"Build a model of the null hypothesis.\n",
372 |     "        \"\"\"\n",
373 |     "        group1, group2 = self.data\n",
374 |     "        self.n, self.m = len(group1), len(group2)\n",
375 |     "        self.pool = numpy.hstack((group1, group2))\n",
376 |     "\n",
377 |     "    def RunModel(self):\n",
378 |     "        \"\"\"Run the model of the null hypothesis.\n",
379 |     "\n",
380 |     "        returns: simulated data\n",
381 |     "        \"\"\"\n",
382 |     "        numpy.random.shuffle(self.pool)\n",
383 |     "        data = self.pool[:self.n], self.pool[self.n:]\n",
384 |     "        return data"
385 |    ]
386 |   },
387 |   {
388 |    "cell_type": "markdown",
389 |    "metadata": {},
390 |    "source": [
391 |     "Now we can run the test by instantiating a DiffMeansPermute object:"
392 |    ]
393 |   },
394 |   {
395 |    "cell_type": "code",
396 |    "execution_count": null,
397 |    "metadata": {},
398 |    "outputs": [],
399 |    "source": [
400 |     "data = (firsts.prglngth, others.prglngth)\n",
401 |     "ht = DiffMeansPermute(data)\n",
402 |     "p_value = ht.PValue(iters=1000)\n",
403 |     "print('\\nmeans permute pregnancy length')\n",
404 |     "print('p-value =', p_value)\n",
405 |     "print('actual =', ht.actual)\n",
406 |     "print('ts max =', ht.MaxTestStat())"
407 |    ]
408 |   },
409 |   {
410 |    "cell_type": "markdown",
411 |    "metadata": {},
412 |    "source": [
413 |     "And we can plot the sampling distribution of the test statistic under the null hypothesis."
414 |    ]
415 |   },
416 |   {
417 |    "cell_type": "code",
418 |    "execution_count": null,
419 |    "metadata": {},
420 |    "outputs": [],
421 |    "source": [
422 |     "ht.PlotHist()"
423 |    ]
424 |   },
425 |   {
426 |    "cell_type": "markdown",
427 |    "metadata": {},
428 |    "source": [
429 |     "### Difference in standard deviation\n",
430 |     "\n",
431 |     "**Exercize 1**: Write a class named `DiffStdPermute` that extends `DiffMeansPermute` and overrides `TestStatistic` to compute the difference in standard deviations.  Is the difference in standard deviations statistically significant?"
432 |    ]
433 |   },
434 |   {
435 |    "cell_type": "code",
436 |    "execution_count": null,
437 |    "metadata": {
438 |     "collapsed": true
439 |    },
440 |    "outputs": [],
441 |    "source": [
442 |     "# Solution goes here"
443 |    ]
444 |   },
445 |   {
446 |    "cell_type": "markdown",
447 |    "metadata": {},
448 |    "source": [
449 |     "Here's the code to test your solution to the previous exercise."
450 |    ]
451 |   },
452 |   {
453 |    "cell_type": "code",
454 |    "execution_count": null,
455 |    "metadata": {},
456 |    "outputs": [],
457 |    "source": [
458 |     "data = (firsts.prglngth, others.prglngth)\n",
459 |     "ht = DiffStdPermute(data)\n",
460 |     "p_value = ht.PValue(iters=1000)\n",
461 |     "print('\\nstd permute pregnancy length')\n",
462 |     "print('p-value =', p_value)\n",
463 |     "print('actual =', ht.actual)\n",
464 |     "print('ts max =', ht.MaxTestStat())"
465 |    ]
466 |   },
467 |   {
468 |    "cell_type": "markdown",
469 |    "metadata": {},
470 |    "source": [
471 |     "### Difference in birth weights\n",
472 |     "\n",
473 |     "Now let's run DiffMeansPermute again to see if there is a difference in birth weight between first babies and others."
474 |    ]
475 |   },
476 |   {
477 |    "cell_type": "code",
478 |    "execution_count": null,
479 |    "metadata": {
480 |     "collapsed": true
481 |    },
482 |    "outputs": [],
483 |    "source": [
484 |     "data = (firsts.totalwgt_lb.dropna(), others.totalwgt_lb.dropna())\n",
485 |     "ht = DiffMeansPermute(data)\n",
486 |     "p_value = ht.PValue(iters=1000)\n",
487 |     "print('\\nmeans permute birthweight')\n",
488 |     "print('p-value =', p_value)\n",
489 |     "print('actual =', ht.actual)\n",
490 |     "print('ts max =', ht.MaxTestStat())"
491 |    ]
492 |   },
493 |   {
494 |    "cell_type": "markdown",
495 |    "metadata": {},
496 |    "source": [
497 |     "In this case, after 1000 attempts, we never see a sample difference as big as the observed difference, so we conclude that the apparent effect is unlikely under the null hypothesis.  Under normal circumstances, we can also make the inference that the apparent effect is unlikely to be caused by random sampling.\n",
498 |     "\n",
499 |     "One final note: in this case I would report that the p-value is less than 1/1000 or less than 0.001.  I would not report p=0, because  the apparent effect is not impossible under the null hypothesis; just unlikely."
500 |    ]
501 |   },
502 |   {
503 |    "cell_type": "markdown",
504 |    "metadata": {},
505 |    "source": [
506 |     "### Part Three\n",
507 |     "\n",
508 |     "In this section, we'll explore the dangers of p-hacking by running multiple tests until we find one that's statistically significant.\n",
509 |     "\n",
510 |     "Suppose we want to compare IQs for two groups of people.  And suppose that, in fact, the two groups are statistically identical; that is, their IQs are drawn from a normal distribution with mean 100 and standard deviation 15.\n",
511 |     "\n",
512 |     "I'll use `numpy.random.normal` to generate fake data I might get from running such an experiment:"
513 |    ]
514 |   },
515 |   {
516 |    "cell_type": "code",
517 |    "execution_count": null,
518 |    "metadata": {
519 |     "collapsed": true
520 |    },
521 |    "outputs": [],
522 |    "source": [
523 |     "group1 = numpy.random.normal(100, 15, size=100)\n",
524 |     "group2 = numpy.random.normal(100, 15, size=100)"
525 |    ]
526 |   },
527 |   {
528 |    "cell_type": "markdown",
529 |    "metadata": {},
530 |    "source": [
531 |     "We expect the mean in both groups to be near 100, but just by random chance, it might be higher or lower."
532 |    ]
533 |   },
534 |   {
535 |    "cell_type": "code",
536 |    "execution_count": null,
537 |    "metadata": {
538 |     "collapsed": true
539 |    },
540 |    "outputs": [],
541 |    "source": [
542 |     "group1.mean(), group2.mean()"
543 |    ]
544 |   },
545 |   {
546 |    "cell_type": "markdown",
547 |    "metadata": {},
548 |    "source": [
549 |     "We can use DiffMeansPermute to compute the p-value for this fake data, which is the probability that we would see a difference between the groups as big as what we saw, just by chance."
550 |    ]
551 |   },
552 |   {
553 |    "cell_type": "code",
554 |    "execution_count": null,
555 |    "metadata": {
556 |     "collapsed": true
557 |    },
558 |    "outputs": [],
559 |    "source": [
560 |     "data = (group1, group2)\n",
561 |     "ht = DiffMeansPermute(data)\n",
562 |     "p_value = ht.PValue(iters=1000)\n",
563 |     "p_value"
564 |    ]
565 |   },
566 |   {
567 |    "cell_type": "markdown",
568 |    "metadata": {},
569 |    "source": [
570 |     "Now let's check the p-value.  If it's less than 0.05, the result is statistically significant, and we can publish it.  Otherwise, we can try again."
571 |    ]
572 |   },
573 |   {
574 |    "cell_type": "code",
575 |    "execution_count": null,
576 |    "metadata": {
577 |     "collapsed": true
578 |    },
579 |    "outputs": [],
580 |    "source": [
581 |     "if p_value < 0.05:\n",
582 |     "    print('Congratulations!  Publish it!')\n",
583 |     "else:\n",
584 |     "    print('Too bad!  Please try again.')"
585 |    ]
586 |   },
587 |   {
588 |    "cell_type": "markdown",
589 |    "metadata": {},
590 |    "source": [
591 |     "You can probably see where this is going.  If we play this game over and over (or if many researchers play it in parallel), the false positive rate can be as high as 100%.\n",
592 |     "\n",
593 |     "To see this more clearly, let's simulate 100 researchers playing this game.  I'll take the code we have so far and wrap it in a function:"
594 |    ]
595 |   },
596 |   {
597 |    "cell_type": "code",
598 |    "execution_count": null,
599 |    "metadata": {
600 |     "collapsed": true
601 |    },
602 |    "outputs": [],
603 |    "source": [
604 |     "def run_a_test(sample_size=100):\n",
605 |     "    \"\"\"Generate random data and run a hypothesis test on it.\n",
606 |     "\n",
607 |     "    sample_size: integer\n",
608 |     "\n",
609 |     "    returns: p-value\n",
610 |     "    \"\"\"\n",
611 |     "    group1 = numpy.random.normal(100, 15, size=sample_size)\n",
612 |     "    group2 = numpy.random.normal(100, 15, size=sample_size)\n",
613 |     "    data = (group1, group2)\n",
614 |     "    ht = DiffMeansPermute(data)\n",
615 |     "    p_value = ht.PValue(iters=200)\n",
616 |     "    return p_value"
617 |    ]
618 |   },
619 |   {
620 |    "cell_type": "markdown",
621 |    "metadata": {},
622 |    "source": [
623 |     "Now let's run that function 100 times and save the p-values."
624 |    ]
625 |   },
626 |   {
627 |    "cell_type": "code",
628 |    "execution_count": null,
629 |    "metadata": {
630 |     "collapsed": true
631 |    },
632 |    "outputs": [],
633 |    "source": [
634 |     "num_experiments = 100\n",
635 |     "p_values = numpy.array([run_a_test() for i in range(num_experiments)])\n",
636 |     "sum(p_values < 0.05)"
637 |    ]
638 |   },
639 |   {
640 |    "cell_type": "markdown",
641 |    "metadata": {},
642 |    "source": [
643 |     "On average, we expect to get a false positive about 5 times out of 100.  To see why, let's plot the histogram of the p-values we got."
644 |    ]
645 |   },
646 |   {
647 |    "cell_type": "code",
648 |    "execution_count": null,
649 |    "metadata": {
650 |     "collapsed": true
651 |    },
652 |    "outputs": [],
653 |    "source": [
654 |     "bins = numpy.linspace(0, 1, 21)\n",
655 |     "bins"
656 |    ]
657 |   },
658 |   {
659 |    "cell_type": "code",
660 |    "execution_count": null,
661 |    "metadata": {
662 |     "collapsed": true
663 |    },
664 |    "outputs": [],
665 |    "source": [
666 |     "pyplot.hist(p_values, bins, color=COLOR5)\n",
667 |     "pyplot.axvline(0.05, linewidth=3, color='0.8')\n",
668 |     "pyplot.xlabel('p-value')\n",
669 |     "pyplot.ylabel('count')\n",
670 |     "None"
671 |    ]
672 |   },
673 |   {
674 |    "cell_type": "markdown",
675 |    "metadata": {},
676 |    "source": [
677 |     "The distribution of p-values is uniform from 0 to 1.  So it falls below 5% about 5% of the time.\n",
678 |     "\n",
679 |     "**Exercise:** If the threshold for statistical signficance is 5%, the probability of a false positive is 5%.  You might hope that things would get better with larger sample sizes, but they don't.  Run this experiment again with a larger sample size, and see for yourself."
680 |    ]
681 |   },
682 |   {
683 |    "cell_type": "markdown",
684 |    "metadata": {},
685 |    "source": [
686 |     "### Part four\n",
687 |     "\n",
688 |     "In the previous section, we computed the false positive rate, which is the probability of seeing a \"statistically significant\" result, even if there is no statistical difference between groups.\n",
689 |     "\n",
690 |     "Now let's ask the complementary question: if there really is a difference between groups, what is the chance of seeing a \"statistically significant\" result?\n",
691 |     "\n",
692 |     "The answer to this question is called the \"power\" of the test.  It depends on the sample size (unlike the false positive rate), and it also depends on how big the actual difference is.\n",
693 |     "\n",
694 |     "We can estimate the power of a test by running simulations similar to the ones in the previous section.  Here's a version of `run_a_test` that takes the actual difference between groups as a parameter:"
695 |    ]
696 |   },
697 |   {
698 |    "cell_type": "code",
699 |    "execution_count": null,
700 |    "metadata": {
701 |     "collapsed": true
702 |    },
703 |    "outputs": [],
704 |    "source": [
705 |     "def run_a_test2(actual_diff, sample_size=100):\n",
706 |     "    \"\"\"Generate random data and run a hypothesis test on it.\n",
707 |     "\n",
708 |     "    actual_diff: The actual difference between groups.\n",
709 |     "    sample_size: integer\n",
710 |     "\n",
711 |     "    returns: p-value\n",
712 |     "    \"\"\"\n",
713 |     "    group1 = numpy.random.normal(100, 15, \n",
714 |     "                                 size=sample_size)\n",
715 |     "    group2 = numpy.random.normal(100 + actual_diff, 15, \n",
716 |     "                                 size=sample_size)\n",
717 |     "    data = (group1, group2)\n",
718 |     "    ht = DiffMeansPermute(data)\n",
719 |     "    p_value = ht.PValue(iters=200)\n",
720 |     "    return p_value"
721 |    ]
722 |   },
723 |   {
724 |    "cell_type": "markdown",
725 |    "metadata": {},
726 |    "source": [
727 |     "Now let's run it 100 times with an actual difference of 5:"
728 |    ]
729 |   },
730 |   {
731 |    "cell_type": "code",
732 |    "execution_count": null,
733 |    "metadata": {
734 |     "collapsed": true
735 |    },
736 |    "outputs": [],
737 |    "source": [
738 |     "p_values = numpy.array([run_a_test2(5) for i in range(100)])\n",
739 |     "sum(p_values < 0.05)"
740 |    ]
741 |   },
742 |   {
743 |    "cell_type": "markdown",
744 |    "metadata": {},
745 |    "source": [
746 |     "With sample size 100 and an actual difference of 5, the power of the test is approximately 65%.  That means if we ran this hypothetical experiment 100 times, we'd expect a statistically significant result about 65 times.\n",
747 |     "\n",
748 |     "That's pretty good, but it also means we would NOT get a statistically significant result about 35 times, which is a lot.\n",
749 |     "\n",
750 |     "Again, let's look at the distribution of p-values:"
751 |    ]
752 |   },
753 |   {
754 |    "cell_type": "code",
755 |    "execution_count": null,
756 |    "metadata": {
757 |     "collapsed": true
758 |    },
759 |    "outputs": [],
760 |    "source": [
761 |     "pyplot.hist(p_values, bins, color=COLOR5)\n",
762 |     "pyplot.axvline(0.05, linewidth=3, color='0.8')\n",
763 |     "pyplot.xlabel('p-value')\n",
764 |     "pyplot.ylabel('count')\n",
765 |     "None"
766 |    ]
767 |   },
768 |   {
769 |    "cell_type": "markdown",
770 |    "metadata": {},
771 |    "source": [
772 |     "Here's the point of this example: if you get a negative result (no statistical significance), that is not always strong evidence that there is no difference between the groups.  It is also possible that the power of the test was too low; that is, that it was unlikely to produce a positive result, even if there is a difference between the groups.\n",
773 |     "\n",
774 |     "**Exercise:** Assuming that the actual difference between the groups is 5, what sample size is needed to get the power of the test up to 80%?  What if the actual difference is 2, what sample size do we need to get to 80%?"
775 |    ]
776 |   },
777 |   {
778 |    "cell_type": "code",
779 |    "execution_count": null,
780 |    "metadata": {
781 |     "collapsed": true
782 |    },
783 |    "outputs": [],
784 |    "source": []
785 |   }
786 |  ],
787 |  "metadata": {
788 |   "kernelspec": {
789 |    "display_name": "Python 3",
790 |    "language": "python",
791 |    "name": "python3"
792 |   },
793 |   "language_info": {
794 |    "codemirror_mode": {
795 |     "name": "ipython",
796 |     "version": 3
797 |    },
798 |    "file_extension": ".py",
799 |    "mimetype": "text/x-python",
800 |    "name": "python",
801 |    "nbconvert_exporter": "python",
802 |    "pygments_lexer": "ipython3",
803 |    "version": "3.6.1"
804 |   }
805 |  },
806 |  "nbformat": 4,
807 |  "nbformat_minor": 1
808 | }
809 | 


--------------------------------------------------------------------------------
/hypothesis.py:
--------------------------------------------------------------------------------
  1 | """This file contains code used in "Think Stats",
  2 | by Allen B. Downey, available from greenteapress.com
  3 | 
  4 | Copyright 2010 Allen B. Downey
  5 | License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
  6 | """
  7 | 
  8 | from __future__ import print_function, division
  9 | 
 10 | import nsfg
 11 | import nsfg2
 12 | import first
 13 | 
 14 | import thinkstats2
 15 | import thinkplot
 16 | 
 17 | import copy
 18 | import random
 19 | import numpy as np
 20 | import matplotlib.pyplot as pyplot
 21 | 
 22 | 
 23 | class CoinTest(thinkstats2.HypothesisTest):
 24 |     """Tests the hypothesis that a coin is fair."""
 25 | 
 26 |     def TestStatistic(self, data):
 27 |         """Computes the test statistic.
 28 | 
 29 |         data: data in whatever form is relevant        
 30 |         """
 31 |         heads, tails = data
 32 |         test_stat = abs(heads - tails)
 33 |         return test_stat
 34 | 
 35 |     def RunModel(self):
 36 |         """Run the model of the null hypothesis.
 37 | 
 38 |         returns: simulated data
 39 |         """
 40 |         heads, tails = self.data
 41 |         n = heads + tails
 42 |         sample = [random.choice('HT') for _ in range(n)]
 43 |         hist = thinkstats2.Hist(sample)
 44 |         data = hist['H'], hist['T']
 45 |         return data
 46 | 
 47 | 
 48 | class DiffMeansPermute(thinkstats2.HypothesisTest):
 49 |     """Tests a difference in means by permutation."""
 50 | 
 51 |     def TestStatistic(self, data):
 52 |         """Computes the test statistic.
 53 | 
 54 |         data: data in whatever form is relevant        
 55 |         """
 56 |         group1, group2 = data
 57 |         test_stat = abs(group1.mean() - group2.mean())
 58 |         return test_stat
 59 | 
 60 |     def MakeModel(self):
 61 |         """Build a model of the null hypothesis.
 62 |         """
 63 |         group1, group2 = self.data
 64 |         self.n, self.m = len(group1), len(group2)
 65 |         self.pool = np.hstack((group1, group2))
 66 | 
 67 |     def RunModel(self):
 68 |         """Run the model of the null hypothesis.
 69 | 
 70 |         returns: simulated data
 71 |         """
 72 |         np.random.shuffle(self.pool)
 73 |         data = self.pool[:self.n], self.pool[self.n:]
 74 |         return data
 75 | 
 76 | 
 77 | class DiffMeansOneSided(DiffMeansPermute):
 78 |     """Tests a one-sided difference in means by permutation."""
 79 | 
 80 |     def TestStatistic(self, data):
 81 |         """Computes the test statistic.
 82 | 
 83 |         data: data in whatever form is relevant        
 84 |         """
 85 |         group1, group2 = data
 86 |         test_stat = group1.mean() - group2.mean()
 87 |         return test_stat
 88 | 
 89 | 
 90 | class DiffStdPermute(DiffMeansPermute):
 91 |     """Tests a one-sided difference in standard deviation by permutation."""
 92 | 
 93 |     def TestStatistic(self, data):
 94 |         """Computes the test statistic.
 95 | 
 96 |         data: data in whatever form is relevant        
 97 |         """
 98 |         group1, group2 = data
 99 |         test_stat = group1.std() - group2.std()
100 |         return test_stat
101 | 
102 | 
103 | class CorrelationPermute(thinkstats2.HypothesisTest):
104 |     """Tests correlations by permutation."""
105 | 
106 |     def TestStatistic(self, data):
107 |         """Computes the test statistic.
108 | 
109 |         data: tuple of xs and ys
110 |         """
111 |         xs, ys = data
112 |         test_stat = abs(thinkstats2.Corr(xs, ys))
113 |         return test_stat
114 | 
115 |     def RunModel(self):
116 |         """Run the model of the null hypothesis.
117 | 
118 |         returns: simulated data
119 |         """
120 |         xs, ys = self.data
121 |         xs = np.random.permutation(xs)
122 |         return xs, ys
123 | 
124 | 
125 | class DiceTest(thinkstats2.HypothesisTest):
126 |     """Tests whether a six-sided die is fair."""
127 | 
128 |     def TestStatistic(self, data):
129 |         """Computes the test statistic.
130 | 
131 |         data: list of frequencies
132 |         """
133 |         observed = data
134 |         n = sum(observed)
135 |         expected = np.ones(6) * n / 6
136 |         test_stat = sum(abs(observed - expected))
137 |         return test_stat
138 | 
139 |     def RunModel(self):
140 |         """Run the model of the null hypothesis.
141 | 
142 |         returns: simulated data
143 |         """
144 |         n = sum(self.data)
145 |         values = [1,2,3,4,5,6]
146 |         rolls = np.random.choice(values, n, replace=True)
147 |         hist = thinkstats2.Hist(rolls)
148 |         freqs = hist.Freqs(values)
149 |         return freqs
150 | 
151 | 
152 | class DiceChiTest(DiceTest):
153 |     """Tests a six-sided die using a chi-squared statistic."""
154 | 
155 |     def TestStatistic(self, data):
156 |         """Computes the test statistic.
157 | 
158 |         data: list of frequencies
159 |         """
160 |         observed = data
161 |         n = sum(observed)
162 |         expected = np.ones(6) * n / 6
163 |         test_stat = sum((observed - expected)**2 / expected)
164 |         return test_stat
165 | 
166 | 
167 | class PregLengthTest(thinkstats2.HypothesisTest):
168 |     """Tests difference in pregnancy length using a chi-squared statistic."""
169 | 
170 |     def TestStatistic(self, data):
171 |         """Computes the test statistic.
172 | 
173 |         data: pair of lists of pregnancy lengths
174 |         """
175 |         firsts, others = data
176 |         stat = self.ChiSquared(firsts) + self.ChiSquared(others)
177 |         return stat
178 | 
179 |     def ChiSquared(self, lengths):
180 |         """Computes the chi-squared statistic.
181 |         
182 |         lengths: sequence of lengths
183 | 
184 |         returns: float
185 |         """
186 |         hist = thinkstats2.Hist(lengths)
187 |         observed = np.array(hist.Freqs(self.values))
188 |         expected = self.expected_probs * len(lengths)
189 |         stat = sum((observed - expected)**2 / expected)
190 |         return stat
191 | 
192 |     def MakeModel(self):
193 |         """Build a model of the null hypothesis.
194 |         """
195 |         firsts, others = self.data
196 |         self.n = len(firsts)
197 |         self.pool = np.hstack((firsts, others))
198 | 
199 |         pmf = thinkstats2.Pmf(self.pool)
200 |         self.values = range(35, 44)
201 |         self.expected_probs = np.array(pmf.Probs(self.values))
202 | 
203 |     def RunModel(self):
204 |         """Run the model of the null hypothesis.
205 | 
206 |         returns: simulated data
207 |         """
208 |         np.random.shuffle(self.pool)
209 |         data = self.pool[:self.n], self.pool[self.n:]
210 |         return data
211 | 
212 | 
213 | def RunDiceTest():
214 |     """Tests whether a die is fair.
215 |     """
216 |     data = [8, 9, 19, 5, 8, 11]
217 |     dt = DiceTest(data)
218 |     print('dice test', dt.PValue(iters=10000))
219 |     dt = DiceChiTest(data)
220 |     print('dice chi test', dt.PValue(iters=10000))
221 | 
222 | 
223 | def FalseNegRate(data, num_runs=1000):
224 |     """Computes the chance of a false negative based on resampling.
225 | 
226 |     data: pair of sequences
227 |     num_runs: how many experiments to simulate
228 | 
229 |     returns: float false negative rate
230 |     """
231 |     group1, group2 = data
232 |     count = 0
233 | 
234 |     for i in range(num_runs):
235 |         sample1 = thinkstats2.Resample(group1)
236 |         sample2 = thinkstats2.Resample(group2)
237 |         ht = DiffMeansPermute((sample1, sample2))
238 |         p_value = ht.PValue(iters=101)
239 |         if p_value > 0.05:
240 |             count += 1
241 | 
242 |     return count / num_runs
243 | 
244 | 
245 | def PrintTest(p_value, ht):
246 |     """Prints results from a hypothesis test.
247 | 
248 |     p_value: float
249 |     ht: HypothesisTest
250 |     """
251 |     print('p-value =', p_value)
252 |     print('actual =', ht.actual)
253 |     print('ts max =', ht.MaxTestStat())
254 | 
255 | 
256 | def RunTests(data, iters=1000):
257 |     """Runs several tests on the given data.
258 | 
259 |     data: pair of sequences
260 |     iters: number of iterations to run
261 |     """
262 | 
263 |     # test the difference in means
264 |     ht = DiffMeansPermute(data)
265 |     p_value = ht.PValue(iters=iters)
266 |     print('\nmeans permute two-sided')
267 |     PrintTest(p_value, ht)
268 | 
269 |     ht.PlotCdf()
270 |     thinkplot.Save(root='hypothesis1',
271 |                    title='Permutation test',
272 |                    xlabel='difference in means (weeks)',
273 |                    ylabel='CDF',
274 |                    legend=False) 
275 |     
276 |     # test the difference in means one-sided
277 |     ht = DiffMeansOneSided(data)
278 |     p_value = ht.PValue(iters=iters)
279 |     print('\nmeans permute one-sided')
280 |     PrintTest(p_value, ht)
281 | 
282 |     # test the difference in std
283 |     ht = DiffStdPermute(data)
284 |     p_value = ht.PValue(iters=iters)
285 |     print('\nstd permute one-sided')
286 |     PrintTest(p_value, ht)
287 | 
288 | 
289 | def ReplicateTests():    
290 |     """Replicates tests with the new NSFG data."""
291 | 
292 |     live, firsts, others = nsfg2.MakeFrames()
293 | 
294 |     # compare pregnancy lengths
295 |     print('\nprglngth2')
296 |     data = firsts.prglngth.values, others.prglngth.values
297 |     ht = DiffMeansPermute(data)
298 |     p_value = ht.PValue(iters=1000)
299 |     print('means permute two-sided')
300 |     PrintTest(p_value, ht)
301 | 
302 |     print('\nbirth weight 2')
303 |     data = (firsts.totalwgt_lb.dropna().values,
304 |             others.totalwgt_lb.dropna().values)
305 |     ht = DiffMeansPermute(data)
306 |     p_value = ht.PValue(iters=1000)
307 |     print('means permute two-sided')
308 |     PrintTest(p_value, ht)
309 | 
310 |     # test correlation
311 |     live2 = live.dropna(subset=['agepreg', 'totalwgt_lb'])
312 |     data = live2.agepreg.values, live2.totalwgt_lb.values
313 |     ht = CorrelationPermute(data)
314 |     p_value = ht.PValue()
315 |     print('\nage weight correlation 2')
316 |     PrintTest(p_value, ht)
317 | 
318 |     # compare pregnancy lengths (chi-squared)
319 |     data = firsts.prglngth.values, others.prglngth.values
320 |     ht = PregLengthTest(data)
321 |     p_value = ht.PValue()
322 |     print('\npregnancy length chi-squared 2')
323 |     PrintTest(p_value, ht)
324 | 
325 | 
326 | def main():
327 |     thinkstats2.RandomSeed(17)
328 | 
329 |     # run the coin test
330 |     ct = CoinTest((140, 110))
331 |     pvalue = ct.PValue()
332 |     print('coin test p-value', pvalue)
333 | 
334 |     # compare pregnancy lengths
335 |     print('\nprglngth')
336 |     live, firsts, others = first.MakeFrames()
337 |     data = firsts.prglngth.values, others.prglngth.values
338 |     RunTests(data)
339 | 
340 |     # compare birth weights
341 |     print('\nbirth weight')
342 |     data = (firsts.totalwgt_lb.dropna().values,
343 |             others.totalwgt_lb.dropna().values)
344 |     ht = DiffMeansPermute(data)
345 |     p_value = ht.PValue(iters=1000)
346 |     print('means permute two-sided')
347 |     PrintTest(p_value, ht)
348 | 
349 |     # test correlation
350 |     live2 = live.dropna(subset=['agepreg', 'totalwgt_lb'])
351 |     data = live2.agepreg.values, live2.totalwgt_lb.values
352 |     ht = CorrelationPermute(data)
353 |     p_value = ht.PValue()
354 |     print('\nage weight correlation')
355 |     print('n=', len(live2))
356 |     PrintTest(p_value, ht)
357 | 
358 |     # run the dice test
359 |     RunDiceTest()
360 | 
361 |     # compare pregnancy lengths (chi-squared)
362 |     data = firsts.prglngth.values, others.prglngth.values
363 |     ht = PregLengthTest(data)
364 |     p_value = ht.PValue()
365 |     print('\npregnancy length chi-squared')
366 |     PrintTest(p_value, ht)
367 | 
368 |     # compute the false negative rate for difference in pregnancy length
369 |     data = firsts.prglngth.values, others.prglngth.values
370 |     neg_rate = FalseNegRate(data)
371 |     print('false neg rate', neg_rate)
372 | 
373 |     # run the tests with new nsfg data
374 |     ReplicateTests()
375 | 
376 | 
377 | if __name__ == "__main__":
378 |     main()
379 | 


--------------------------------------------------------------------------------
/hypothesis_soln.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "Hypothesis Testing\n",
  8 |     "==================\n",
  9 |     "\n",
 10 |     "Copyright 2016 Allen Downey\n",
 11 |     "\n",
 12 |     "License: [Creative Commons Attribution 4.0 International](http://creativecommons.org/licenses/by/4.0/)"
 13 |    ]
 14 |   },
 15 |   {
 16 |    "cell_type": "code",
 17 |    "execution_count": null,
 18 |    "metadata": {},
 19 |    "outputs": [],
 20 |    "source": [
 21 |     "from __future__ import print_function, division\n",
 22 |     "\n",
 23 |     "import numpy\n",
 24 |     "import scipy.stats\n",
 25 |     "\n",
 26 |     "import matplotlib.pyplot as pyplot\n",
 27 |     "\n",
 28 |     "import first\n",
 29 |     "\n",
 30 |     "# some nicer colors from http://colorbrewer2.org/\n",
 31 |     "COLOR1 = '#7fc97f'\n",
 32 |     "COLOR2 = '#beaed4'\n",
 33 |     "COLOR3 = '#fdc086'\n",
 34 |     "COLOR4 = '#ffff99'\n",
 35 |     "COLOR5 = '#386cb0'\n",
 36 |     "\n",
 37 |     "%matplotlib inline"
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "markdown",
 42 |    "metadata": {},
 43 |    "source": [
 44 |     "## Part One"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "markdown",
 49 |    "metadata": {},
 50 |    "source": [
 51 |     "Suppose you observe an apparent difference between two groups and you want to check whether it might be due to chance.\n",
 52 |     "\n",
 53 |     "As an example, we'll look at differences between first babies and others.  The `first` module provides code to read data from the National Survey of Family Growth (NSFG)."
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": null,
 59 |    "metadata": {},
 60 |    "outputs": [],
 61 |    "source": [
 62 |     "live, firsts, others = first.MakeFrames()"
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "markdown",
 67 |    "metadata": {},
 68 |    "source": [
 69 |     "We'll look at a couple of variables, including pregnancy length and birth weight.  The effect size we'll consider is the difference in the means.\n",
 70 |     "\n",
 71 |     "Other examples might include a correlation between variables or a coefficient in a linear regression.  The number that quantifies the size of the effect is called the \"test statistic\"."
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": null,
 77 |    "metadata": {},
 78 |    "outputs": [],
 79 |    "source": [
 80 |     "def TestStatistic(data):\n",
 81 |     "    group1, group2 = data\n",
 82 |     "    test_stat = abs(group1.mean() - group2.mean())\n",
 83 |     "    return test_stat"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "markdown",
 88 |    "metadata": {},
 89 |    "source": [
 90 |     "For the first example, I extract the pregnancy length for first babies and others.  The results are pandas Series objects."
 91 |    ]
 92 |   },
 93 |   {
 94 |    "cell_type": "code",
 95 |    "execution_count": null,
 96 |    "metadata": {},
 97 |    "outputs": [],
 98 |    "source": [
 99 |     "group1 = firsts.prglngth\n",
100 |     "group2 = others.prglngth"
101 |    ]
102 |   },
103 |   {
104 |    "cell_type": "markdown",
105 |    "metadata": {},
106 |    "source": [
107 |     "The actual difference in the means is 0.078 weeks, which is only 13 hours."
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "code",
112 |    "execution_count": null,
113 |    "metadata": {},
114 |    "outputs": [],
115 |    "source": [
116 |     "actual = TestStatistic((group1, group2))\n",
117 |     "actual"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "markdown",
122 |    "metadata": {},
123 |    "source": [
124 |     "The null hypothesis is that there is no difference between the groups.  We can model that by forming a pooled sample that includes first babies and others."
125 |    ]
126 |   },
127 |   {
128 |    "cell_type": "code",
129 |    "execution_count": null,
130 |    "metadata": {},
131 |    "outputs": [],
132 |    "source": [
133 |     "n, m = len(group1), len(group2)\n",
134 |     "pool = numpy.hstack((group1, group2))"
135 |    ]
136 |   },
137 |   {
138 |    "cell_type": "markdown",
139 |    "metadata": {},
140 |    "source": [
141 |     "Then we can simulate the null hypothesis by shuffling the pool and dividing it into two groups, using the same sizes as the actual sample."
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "code",
146 |    "execution_count": null,
147 |    "metadata": {},
148 |    "outputs": [],
149 |    "source": [
150 |     "def RunModel():\n",
151 |     "    numpy.random.shuffle(pool)\n",
152 |     "    data = pool[:n], pool[n:]\n",
153 |     "    return data"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "markdown",
158 |    "metadata": {},
159 |    "source": [
160 |     "The result of running the model is two NumPy arrays with the shuffled pregnancy lengths:"
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "code",
165 |    "execution_count": null,
166 |    "metadata": {},
167 |    "outputs": [],
168 |    "source": [
169 |     "RunModel()"
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "markdown",
174 |    "metadata": {},
175 |    "source": [
176 |     "Then we compute the same test statistic using the simulated data:"
177 |    ]
178 |   },
179 |   {
180 |    "cell_type": "code",
181 |    "execution_count": null,
182 |    "metadata": {},
183 |    "outputs": [],
184 |    "source": [
185 |     "TestStatistic(RunModel())"
186 |    ]
187 |   },
188 |   {
189 |    "cell_type": "markdown",
190 |    "metadata": {},
191 |    "source": [
192 |     "If we run the model 1000 times and compute the test statistic, we can see how much the test statistic varies under the null hypothesis."
193 |    ]
194 |   },
195 |   {
196 |    "cell_type": "code",
197 |    "execution_count": null,
198 |    "metadata": {},
199 |    "outputs": [],
200 |    "source": [
201 |     "test_stats = numpy.array([TestStatistic(RunModel()) for i in range(1000)])\n",
202 |     "test_stats.shape"
203 |    ]
204 |   },
205 |   {
206 |    "cell_type": "markdown",
207 |    "metadata": {},
208 |    "source": [
209 |     "Here's the sampling distribution of the test statistic under the null hypothesis, with the actual difference in means indicated by a gray line."
210 |    ]
211 |   },
212 |   {
213 |    "cell_type": "code",
214 |    "execution_count": null,
215 |    "metadata": {},
216 |    "outputs": [],
217 |    "source": [
218 |     "pyplot.axvline(actual, linewidth=3, color='0.8')\n",
219 |     "pyplot.hist(test_stats, color=COLOR5)\n",
220 |     "pyplot.xlabel('difference in means')\n",
221 |     "pyplot.ylabel('count')\n",
222 |     "None"
223 |    ]
224 |   },
225 |   {
226 |    "cell_type": "markdown",
227 |    "metadata": {},
228 |    "source": [
229 |     "The p-value is the probability that the test statistic under the null hypothesis exceeds the actual value."
230 |    ]
231 |   },
232 |   {
233 |    "cell_type": "code",
234 |    "execution_count": null,
235 |    "metadata": {},
236 |    "outputs": [],
237 |    "source": [
238 |     "pvalue = sum(test_stats >= actual) / len(test_stats)\n",
239 |     "pvalue"
240 |    ]
241 |   },
242 |   {
243 |    "cell_type": "markdown",
244 |    "metadata": {},
245 |    "source": [
246 |     "In this case the result is about 15%, which means that even if there is no difference between the groups, it is plausible that we could see a sample difference as big as 0.078 weeks.\n",
247 |     "\n",
248 |     "We conclude that the apparent effect might be due to chance, so we are not confident that it would appear in the general population, or in another sample from the same population.\n",
249 |     "\n",
250 |     "STOP HERE\n",
251 |     "---------"
252 |    ]
253 |   },
254 |   {
255 |    "cell_type": "markdown",
256 |    "metadata": {},
257 |    "source": [
258 |     "Part Two\n",
259 |     "========\n",
260 |     "\n",
261 |     "We can take the pieces from the previous section and organize them in a class that represents the structure of a hypothesis test."
262 |    ]
263 |   },
264 |   {
265 |    "cell_type": "code",
266 |    "execution_count": null,
267 |    "metadata": {},
268 |    "outputs": [],
269 |    "source": [
270 |     "class HypothesisTest(object):\n",
271 |     "    \"\"\"Represents a hypothesis test.\"\"\"\n",
272 |     "\n",
273 |     "    def __init__(self, data):\n",
274 |     "        \"\"\"Initializes.\n",
275 |     "\n",
276 |     "        data: data in whatever form is relevant\n",
277 |     "        \"\"\"\n",
278 |     "        self.data = data\n",
279 |     "        self.MakeModel()\n",
280 |     "        self.actual = self.TestStatistic(data)\n",
281 |     "        self.test_stats = None\n",
282 |     "\n",
283 |     "    def PValue(self, iters=1000):\n",
284 |     "        \"\"\"Computes the distribution of the test statistic and p-value.\n",
285 |     "\n",
286 |     "        iters: number of iterations\n",
287 |     "\n",
288 |     "        returns: float p-value\n",
289 |     "        \"\"\"\n",
290 |     "        self.test_stats = numpy.array([self.TestStatistic(self.RunModel()) \n",
291 |     "                                       for _ in range(iters)])\n",
292 |     "\n",
293 |     "        count = sum(self.test_stats >= self.actual)\n",
294 |     "        return count / iters\n",
295 |     "\n",
296 |     "    def MaxTestStat(self):\n",
297 |     "        \"\"\"Returns the largest test statistic seen during simulations.\n",
298 |     "        \"\"\"\n",
299 |     "        return max(self.test_stats)\n",
300 |     "\n",
301 |     "    def PlotHist(self, label=None):\n",
302 |     "        \"\"\"Draws a Cdf with vertical lines at the observed test stat.\n",
303 |     "        \"\"\"\n",
304 |     "        pyplot.hist(ht.test_stats, color=COLOR4)\n",
305 |     "        pyplot.axvline(self.actual, linewidth=3, color='0.8')\n",
306 |     "        pyplot.xlabel('test statistic')\n",
307 |     "        pyplot.ylabel('count')\n",
308 |     "\n",
309 |     "    def TestStatistic(self, data):\n",
310 |     "        \"\"\"Computes the test statistic.\n",
311 |     "\n",
312 |     "        data: data in whatever form is relevant        \n",
313 |     "        \"\"\"\n",
314 |     "        raise UnimplementedMethodException()\n",
315 |     "\n",
316 |     "    def MakeModel(self):\n",
317 |     "        \"\"\"Build a model of the null hypothesis.\n",
318 |     "        \"\"\"\n",
319 |     "        pass\n",
320 |     "\n",
321 |     "    def RunModel(self):\n",
322 |     "        \"\"\"Run the model of the null hypothesis.\n",
323 |     "\n",
324 |     "        returns: simulated data\n",
325 |     "        \"\"\"\n",
326 |     "        raise UnimplementedMethodException()\n"
327 |    ]
328 |   },
329 |   {
330 |    "cell_type": "markdown",
331 |    "metadata": {},
332 |    "source": [
333 |     "`HypothesisTest` is an abstract parent class that encodes the template.  Child classes fill in the missing methods.  For example, here's the test from the previous section."
334 |    ]
335 |   },
336 |   {
337 |    "cell_type": "code",
338 |    "execution_count": null,
339 |    "metadata": {},
340 |    "outputs": [],
341 |    "source": [
342 |     "class DiffMeansPermute(HypothesisTest):\n",
343 |     "    \"\"\"Tests a difference in means by permutation.\"\"\"\n",
344 |     "\n",
345 |     "    def TestStatistic(self, data):\n",
346 |     "        \"\"\"Computes the test statistic.\n",
347 |     "\n",
348 |     "        data: data in whatever form is relevant        \n",
349 |     "        \"\"\"\n",
350 |     "        group1, group2 = data\n",
351 |     "        test_stat = abs(group1.mean() - group2.mean())\n",
352 |     "        return test_stat\n",
353 |     "\n",
354 |     "    def MakeModel(self):\n",
355 |     "        \"\"\"Build a model of the null hypothesis.\n",
356 |     "        \"\"\"\n",
357 |     "        group1, group2 = self.data\n",
358 |     "        self.n, self.m = len(group1), len(group2)\n",
359 |     "        self.pool = numpy.hstack((group1, group2))\n",
360 |     "\n",
361 |     "    def RunModel(self):\n",
362 |     "        \"\"\"Run the model of the null hypothesis.\n",
363 |     "\n",
364 |     "        returns: simulated data\n",
365 |     "        \"\"\"\n",
366 |     "        numpy.random.shuffle(self.pool)\n",
367 |     "        data = self.pool[:self.n], self.pool[self.n:]\n",
368 |     "        return data"
369 |    ]
370 |   },
371 |   {
372 |    "cell_type": "markdown",
373 |    "metadata": {},
374 |    "source": [
375 |     "Now we can run the test by instantiating a DiffMeansPermute object:"
376 |    ]
377 |   },
378 |   {
379 |    "cell_type": "code",
380 |    "execution_count": null,
381 |    "metadata": {},
382 |    "outputs": [],
383 |    "source": [
384 |     "data = (firsts.prglngth, others.prglngth)\n",
385 |     "ht = DiffMeansPermute(data)\n",
386 |     "p_value = ht.PValue(iters=1000)\n",
387 |     "print('\\nmeans permute pregnancy length')\n",
388 |     "print('p-value =', p_value)\n",
389 |     "print('actual =', ht.actual)\n",
390 |     "print('ts max =', ht.MaxTestStat())"
391 |    ]
392 |   },
393 |   {
394 |    "cell_type": "markdown",
395 |    "metadata": {},
396 |    "source": [
397 |     "And we can plot the sampling distribution of the test statistic under the null hypothesis."
398 |    ]
399 |   },
400 |   {
401 |    "cell_type": "code",
402 |    "execution_count": null,
403 |    "metadata": {},
404 |    "outputs": [],
405 |    "source": [
406 |     "ht.PlotHist()"
407 |    ]
408 |   },
409 |   {
410 |    "cell_type": "markdown",
411 |    "metadata": {},
412 |    "source": [
413 |     "### Difference in standard deviation\n",
414 |     "\n",
415 |     "**Exercize 1**: Write a class named `DiffStdPermute` that extends `DiffMeansPermute` and overrides `TestStatistic` to compute the difference in standard deviations.  Is the difference in standard deviations statistically significant?"
416 |    ]
417 |   },
418 |   {
419 |    "cell_type": "code",
420 |    "execution_count": null,
421 |    "metadata": {},
422 |    "outputs": [],
423 |    "source": [
424 |     "# Solution goes here\n",
425 |     "\n",
426 |     "class DiffStdPermute(DiffMeansPermute):\n",
427 |     "    \"\"\"Tests a difference in means by permutation.\"\"\"\n",
428 |     "\n",
429 |     "    def TestStatistic(self, data):\n",
430 |     "        \"\"\"Computes the test statistic.\n",
431 |     "\n",
432 |     "        data: data in whatever form is relevant        \n",
433 |     "        \"\"\"\n",
434 |     "        group1, group2 = data\n",
435 |     "        test_stat = abs(group1.std() - group2.std())\n",
436 |     "        return test_stat"
437 |    ]
438 |   },
439 |   {
440 |    "cell_type": "markdown",
441 |    "metadata": {},
442 |    "source": [
443 |     "Here's the code to test your solution to the previous exercise."
444 |    ]
445 |   },
446 |   {
447 |    "cell_type": "code",
448 |    "execution_count": null,
449 |    "metadata": {},
450 |    "outputs": [],
451 |    "source": [
452 |     "data = (firsts.prglngth, others.prglngth)\n",
453 |     "ht = DiffStdPermute(data)\n",
454 |     "p_value = ht.PValue(iters=1000)\n",
455 |     "print('\\nstd permute pregnancy length')\n",
456 |     "print('p-value =', p_value)\n",
457 |     "print('actual =', ht.actual)\n",
458 |     "print('ts max =', ht.MaxTestStat())"
459 |    ]
460 |   },
461 |   {
462 |    "cell_type": "markdown",
463 |    "metadata": {},
464 |    "source": [
465 |     "### Difference in birth weights\n",
466 |     "\n",
467 |     "Now let's run DiffMeansPermute again to see if there is a difference in birth weight between first babies and others."
468 |    ]
469 |   },
470 |   {
471 |    "cell_type": "code",
472 |    "execution_count": null,
473 |    "metadata": {},
474 |    "outputs": [],
475 |    "source": [
476 |     "data = (firsts.totalwgt_lb.dropna(), others.totalwgt_lb.dropna())\n",
477 |     "ht = DiffMeansPermute(data)\n",
478 |     "p_value = ht.PValue(iters=1000)\n",
479 |     "print('\\nmeans permute birthweight')\n",
480 |     "print('p-value =', p_value)\n",
481 |     "print('actual =', ht.actual)\n",
482 |     "print('ts max =', ht.MaxTestStat())"
483 |    ]
484 |   },
485 |   {
486 |    "cell_type": "markdown",
487 |    "metadata": {},
488 |    "source": [
489 |     "In this case, after 1000 attempts, we never see a sample difference as big as the observed difference, so we conclude that the apparent effect is unlikely under the null hypothesis.  Under normal circumstances, we can also make the inference that the apparent effect is unlikely to be caused by random sampling.\n",
490 |     "\n",
491 |     "One final note: in this case I would report that the p-value is less than 1/1000 or less than 0.001.  I would not report p=0, because  the apparent effect is not impossible under the null hypothesis; just unlikely."
492 |    ]
493 |   },
494 |   {
495 |    "cell_type": "markdown",
496 |    "metadata": {},
497 |    "source": [
498 |     "### Part Three\n",
499 |     "\n",
500 |     "In this section, we'll explore the dangers of p-hacking by running multiple tests until we find one that's statistically significant.\n",
501 |     "\n",
502 |     "Suppose we want to compare IQs for two groups of people.  And suppose that, in fact, the two groups are statistically identical; that is, their IQs are drawn from a normal distribution with mean 100 and standard deviation 15.\n",
503 |     "\n",
504 |     "I'll use `numpy.random.normal` to generate fake data I might get from running such an experiment:"
505 |    ]
506 |   },
507 |   {
508 |    "cell_type": "code",
509 |    "execution_count": null,
510 |    "metadata": {},
511 |    "outputs": [],
512 |    "source": [
513 |     "group1 = numpy.random.normal(100, 15, size=100)\n",
514 |     "group2 = numpy.random.normal(100, 15, size=100)"
515 |    ]
516 |   },
517 |   {
518 |    "cell_type": "markdown",
519 |    "metadata": {},
520 |    "source": [
521 |     "We expect the mean in both groups to be near 100, but just by random chance, it might be higher or lower."
522 |    ]
523 |   },
524 |   {
525 |    "cell_type": "code",
526 |    "execution_count": null,
527 |    "metadata": {},
528 |    "outputs": [],
529 |    "source": [
530 |     "group1.mean(), group2.mean()"
531 |    ]
532 |   },
533 |   {
534 |    "cell_type": "markdown",
535 |    "metadata": {},
536 |    "source": [
537 |     "We can use DiffMeansPermute to compute the p-value for this fake data, which is the probability that we would see a difference between the groups as big as what we saw, just by chance."
538 |    ]
539 |   },
540 |   {
541 |    "cell_type": "code",
542 |    "execution_count": null,
543 |    "metadata": {},
544 |    "outputs": [],
545 |    "source": [
546 |     "data = (group1, group2)\n",
547 |     "ht = DiffMeansPermute(data)\n",
548 |     "p_value = ht.PValue(iters=1000)\n",
549 |     "p_value"
550 |    ]
551 |   },
552 |   {
553 |    "cell_type": "markdown",
554 |    "metadata": {},
555 |    "source": [
556 |     "Now let's check the p-value.  If it's less than 0.05, the result is statistically significant, and we can publish it.  Otherwise, we can try again."
557 |    ]
558 |   },
559 |   {
560 |    "cell_type": "code",
561 |    "execution_count": null,
562 |    "metadata": {},
563 |    "outputs": [],
564 |    "source": [
565 |     "if p_value < 0.05:\n",
566 |     "    print('Congratulations!  Publish it!')\n",
567 |     "else:\n",
568 |     "    print('Too bad!  Please try again.')"
569 |    ]
570 |   },
571 |   {
572 |    "cell_type": "markdown",
573 |    "metadata": {},
574 |    "source": [
575 |     "You can probably see where this is going.  If we play this game over and over (or if many researchers play it in parallel), the false positive rate can be as high as 100%.\n",
576 |     "\n",
577 |     "To see this more clearly, let's simulate 100 researchers playing this game.  I'll take the code we have so far and wrap it in a function:"
578 |    ]
579 |   },
580 |   {
581 |    "cell_type": "code",
582 |    "execution_count": null,
583 |    "metadata": {
584 |     "collapsed": true
585 |    },
586 |    "outputs": [],
587 |    "source": [
588 |     "def run_a_test(sample_size=100):\n",
589 |     "    \"\"\"Generate random data and run a hypothesis test on it.\n",
590 |     "\n",
591 |     "    sample_size: integer\n",
592 |     "\n",
593 |     "    returns: p-value\n",
594 |     "    \"\"\"\n",
595 |     "    group1 = numpy.random.normal(100, 15, size=sample_size)\n",
596 |     "    group2 = numpy.random.normal(100, 15, size=sample_size)\n",
597 |     "    data = (group1, group2)\n",
598 |     "    ht = DiffMeansPermute(data)\n",
599 |     "    p_value = ht.PValue(iters=200)\n",
600 |     "    return p_value"
601 |    ]
602 |   },
603 |   {
604 |    "cell_type": "markdown",
605 |    "metadata": {},
606 |    "source": [
607 |     "Now let's run that function 100 times and save the p-values."
608 |    ]
609 |   },
610 |   {
611 |    "cell_type": "code",
612 |    "execution_count": null,
613 |    "metadata": {},
614 |    "outputs": [],
615 |    "source": [
616 |     "num_experiments = 100\n",
617 |     "p_values = numpy.array([run_a_test() for i in range(num_experiments)])\n",
618 |     "sum(p_values < 0.05)"
619 |    ]
620 |   },
621 |   {
622 |    "cell_type": "markdown",
623 |    "metadata": {},
624 |    "source": [
625 |     "On average, we expect to get a false positive about 5 times out of 100.  To see why, let's plot the histogram of the p-values we got."
626 |    ]
627 |   },
628 |   {
629 |    "cell_type": "code",
630 |    "execution_count": null,
631 |    "metadata": {},
632 |    "outputs": [],
633 |    "source": [
634 |     "bins = numpy.linspace(0, 1, 21)\n",
635 |     "bins"
636 |    ]
637 |   },
638 |   {
639 |    "cell_type": "code",
640 |    "execution_count": null,
641 |    "metadata": {},
642 |    "outputs": [],
643 |    "source": [
644 |     "pyplot.hist(p_values, bins, color=COLOR5)\n",
645 |     "pyplot.axvline(0.05, linewidth=3, color='0.8')\n",
646 |     "pyplot.xlabel('p-value')\n",
647 |     "pyplot.ylabel('count')\n",
648 |     "None"
649 |    ]
650 |   },
651 |   {
652 |    "cell_type": "markdown",
653 |    "metadata": {},
654 |    "source": [
655 |     "The distribution of p-values is uniform from 0 to 1.  So it falls below 5% about 5% of the time.\n",
656 |     "\n",
657 |     "**Exercise:** If the threshold for statistical signficance is 5%, the probability of a false positive is 5%.  You might hope that things would get better with larger sample sizes, but they don't.  Run this experiment again with a larger sample size, and see for yourself."
658 |    ]
659 |   },
660 |   {
661 |    "cell_type": "markdown",
662 |    "metadata": {},
663 |    "source": [
664 |     "### Part four\n",
665 |     "\n",
666 |     "In the previous section, we computed the false positive rate, which is the probability of seeing a \"statistically significant\" result, even if there is no statistical difference between groups.\n",
667 |     "\n",
668 |     "Now let's ask the complementary question: if there really is a difference between groups, what is the chance of seeing a \"statistically significant\" result?\n",
669 |     "\n",
670 |     "The answer to this question is called the \"power\" of the test.  It depends on the sample size (unlike the false positive rate), and it also depends on how big the actual difference is.\n",
671 |     "\n",
672 |     "We can estimate the power of a test by running simulations similar to the ones in the previous section.  Here's a version of `run_a_test` that takes the actual difference between groups as a parameter:"
673 |    ]
674 |   },
675 |   {
676 |    "cell_type": "code",
677 |    "execution_count": null,
678 |    "metadata": {
679 |     "collapsed": true
680 |    },
681 |    "outputs": [],
682 |    "source": [
683 |     "def run_a_test2(actual_diff, sample_size=100):\n",
684 |     "    \"\"\"Generate random data and run a hypothesis test on it.\n",
685 |     "\n",
686 |     "    actual_diff: The actual difference between groups.\n",
687 |     "    sample_size: integer\n",
688 |     "\n",
689 |     "    returns: p-value\n",
690 |     "    \"\"\"\n",
691 |     "    group1 = numpy.random.normal(100, 15, \n",
692 |     "                                 size=sample_size)\n",
693 |     "    group2 = numpy.random.normal(100 + actual_diff, 15, \n",
694 |     "                                 size=sample_size)\n",
695 |     "    data = (group1, group2)\n",
696 |     "    ht = DiffMeansPermute(data)\n",
697 |     "    p_value = ht.PValue(iters=200)\n",
698 |     "    return p_value"
699 |    ]
700 |   },
701 |   {
702 |    "cell_type": "markdown",
703 |    "metadata": {},
704 |    "source": [
705 |     "Now let's run it 100 times with an actual difference of 5:"
706 |    ]
707 |   },
708 |   {
709 |    "cell_type": "code",
710 |    "execution_count": null,
711 |    "metadata": {},
712 |    "outputs": [],
713 |    "source": [
714 |     "p_values = numpy.array([run_a_test2(5) for i in range(100)])\n",
715 |     "sum(p_values < 0.05)"
716 |    ]
717 |   },
718 |   {
719 |    "cell_type": "markdown",
720 |    "metadata": {},
721 |    "source": [
722 |     "With sample size 100 and an actual difference of 5, the power of the test is approximately 65%.  That means if we ran this hypothetical experiment 100 times, we'd expect a statistically significant result about 65 times.\n",
723 |     "\n",
724 |     "That's pretty good, but it also means we would NOT get a statistically significant result about 35 times, which is a lot.\n",
725 |     "\n",
726 |     "Again, let's look at the distribution of p-values:"
727 |    ]
728 |   },
729 |   {
730 |    "cell_type": "code",
731 |    "execution_count": null,
732 |    "metadata": {},
733 |    "outputs": [],
734 |    "source": [
735 |     "pyplot.hist(p_values, bins, color=COLOR5)\n",
736 |     "pyplot.axvline(0.05, linewidth=3, color='0.8')\n",
737 |     "pyplot.xlabel('p-value')\n",
738 |     "pyplot.ylabel('count')\n",
739 |     "None"
740 |    ]
741 |   },
742 |   {
743 |    "cell_type": "markdown",
744 |    "metadata": {},
745 |    "source": [
746 |     "Here's the point of this example: if you get a negative result (no statistical significance), that is not always strong evidence that there is no difference between the groups.  It is also possible that the power of the test was too low; that is, that it was unlikely to produce a positive result, even if there is a difference between the groups.\n",
747 |     "\n",
748 |     "**Exercise:** Assuming that the actual difference between the groups is 5, what sample size is needed to get the power of the test up to 80%?  What if the actual difference is 2, what sample size do we need to get to 80%?"
749 |    ]
750 |   },
751 |   {
752 |    "cell_type": "code",
753 |    "execution_count": null,
754 |    "metadata": {
755 |     "collapsed": true
756 |    },
757 |    "outputs": [],
758 |    "source": []
759 |   }
760 |  ],
761 |  "metadata": {
762 |   "kernelspec": {
763 |    "display_name": "Python 3",
764 |    "language": "python",
765 |    "name": "python3"
766 |   },
767 |   "language_info": {
768 |    "codemirror_mode": {
769 |     "name": "ipython",
770 |     "version": 3
771 |    },
772 |    "file_extension": ".py",
773 |    "mimetype": "text/x-python",
774 |    "name": "python",
775 |    "nbconvert_exporter": "python",
776 |    "pygments_lexer": "ipython3",
777 |    "version": "3.6.1"
778 |   }
779 |  },
780 |  "nbformat": 4,
781 |  "nbformat_minor": 1
782 | }
783 | 


--------------------------------------------------------------------------------
/hypothesis_testing.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AllenDowney/CompStats/dc459e04ba74613d533adae2550abefb18da002a/hypothesis_testing.pdf


--------------------------------------------------------------------------------
/hypothesis_testing.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AllenDowney/CompStats/dc459e04ba74613d533adae2550abefb18da002a/hypothesis_testing.png


--------------------------------------------------------------------------------
/hypothesis_testing_small.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AllenDowney/CompStats/dc459e04ba74613d533adae2550abefb18da002a/hypothesis_testing_small.png


--------------------------------------------------------------------------------
/look_and_say.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 47,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "%config InteractiveShell.ast_node_interactivity='last_expr_or_assign'\n",
 10 |     "\n",
 11 |     "import numpy as np"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "code",
 16 |    "execution_count": 147,
 17 |    "metadata": {},
 18 |    "outputs": [
 19 |     {
 20 |      "data": {
 21 |       "text/plain": [
 22 |        "array([1, 1])"
 23 |       ]
 24 |      },
 25 |      "execution_count": 147,
 26 |      "metadata": {},
 27 |      "output_type": "execute_result"
 28 |     }
 29 |    ],
 30 |    "source": [
 31 |     "a = np.array([1,1])"
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "code",
 36 |    "execution_count": 148,
 37 |    "metadata": {},
 38 |    "outputs": [
 39 |     {
 40 |      "data": {
 41 |       "text/plain": [
 42 |        "array([1, 0, 1])"
 43 |       ]
 44 |      },
 45 |      "execution_count": 148,
 46 |      "metadata": {},
 47 |      "output_type": "execute_result"
 48 |     }
 49 |    ],
 50 |    "source": [
 51 |     "diff = np.ediff1d(a, 1, 1)"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "code",
 56 |    "execution_count": 149,
 57 |    "metadata": {},
 58 |    "outputs": [
 59 |     {
 60 |      "data": {
 61 |       "text/plain": [
 62 |        "array([0, 2])"
 63 |       ]
 64 |      },
 65 |      "execution_count": 149,
 66 |      "metadata": {},
 67 |      "output_type": "execute_result"
 68 |     }
 69 |    ],
 70 |    "source": [
 71 |     "index = np.nonzero(diff)[0]"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": 150,
 77 |    "metadata": {},
 78 |    "outputs": [
 79 |     {
 80 |      "data": {
 81 |       "text/plain": [
 82 |        "array([2])"
 83 |       ]
 84 |      },
 85 |      "execution_count": 150,
 86 |      "metadata": {},
 87 |      "output_type": "execute_result"
 88 |     }
 89 |    ],
 90 |    "source": [
 91 |     "counts = np.ediff1d(index)"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": 151,
 97 |    "metadata": {},
 98 |    "outputs": [
 99 |     {
100 |      "data": {
101 |       "text/plain": [
102 |        "array([1])"
103 |       ]
104 |      },
105 |      "execution_count": 151,
106 |      "metadata": {},
107 |      "output_type": "execute_result"
108 |     }
109 |    ],
110 |    "source": [
111 |     "vals = a[index[:-1]]"
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "code",
116 |    "execution_count": 152,
117 |    "metadata": {},
118 |    "outputs": [
119 |     {
120 |      "data": {
121 |       "text/plain": [
122 |        "array([2, 1])"
123 |       ]
124 |      },
125 |      "execution_count": 152,
126 |      "metadata": {},
127 |      "output_type": "execute_result"
128 |     }
129 |    ],
130 |    "source": [
131 |     "# best way to interleave arrays\n",
132 |     "# https://stackoverflow.com/questions/5347065/interweaving-two-numpy-arrays\n",
133 |     "b = np.empty((vals.size + counts.size,), dtype=vals.dtype)\n",
134 |     "b[0::2] = counts\n",
135 |     "b[1::2] = vals\n",
136 |     "b"
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "code",
141 |    "execution_count": 153,
142 |    "metadata": {},
143 |    "outputs": [],
144 |    "source": [
145 |     "def look_and_say(a):\n",
146 |     "    diff = np.ediff1d(a, 1, 1)\n",
147 |     "    index = np.nonzero(diff)[0]\n",
148 |     "    counts = np.ediff1d(index)\n",
149 |     "    vals = a[index[:-1]]\n",
150 |     "    c = np.empty((vals.size + counts.size,), dtype=vals.dtype)\n",
151 |     "    c[0::2] = counts\n",
152 |     "    c[1::2] = vals\n",
153 |     "    return c"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "code",
158 |    "execution_count": 154,
159 |    "metadata": {},
160 |    "outputs": [
161 |     {
162 |      "data": {
163 |       "text/plain": [
164 |        "array([1, 2, 1, 1])"
165 |       ]
166 |      },
167 |      "execution_count": 154,
168 |      "metadata": {},
169 |      "output_type": "execute_result"
170 |     }
171 |    ],
172 |    "source": [
173 |     "look_and_say(b)"
174 |    ]
175 |   },
176 |   {
177 |    "cell_type": "code",
178 |    "execution_count": 156,
179 |    "metadata": {},
180 |    "outputs": [
181 |     {
182 |      "name": "stdout",
183 |      "output_type": "stream",
184 |      "text": [
185 |       "[1 1]\n",
186 |       "[2 1]\n",
187 |       "[1 2 1 1]\n",
188 |       "[1 1 1 2 2 1]\n",
189 |       "[3 1 2 2 1 1]\n",
190 |       "[1 3 1 1 2 2 2 1]\n",
191 |       "[1 1 1 3 2 1 3 2 1 1]\n",
192 |       "[3 1 1 3 1 2 1 1 1 3 1 2 2 1]\n",
193 |       "[1 3 2 1 1 3 1 1 1 2 3 1 1 3 1 1 2 2 1 1]\n",
194 |       "[1 1 1 3 1 2 2 1 1 3 3 1 1 2 1 3 2 1 1 3 2 1 2 2 2 1]\n",
195 |       "[3 1 1 3 1 1 2 2 2 1 2 3 2 1 1 2 1 1 1 3 1 2 2 1 1 3 1 2 1 1 3 2 1 1]\n",
196 |       "[1 3 2 1 1 3 2 1 3 2 1 1 1 2 1 3 1 2 2 1 1 2 3 1 1 3 1 1 2 2 2 1 1 3 1 1 1\n",
197 |       " 2 2 1 1 3 1 2 2 1]\n",
198 |       "[1 1 1 3 1 2 2 1 1 3 1 2 1 1 1 3 1 2 3 1 1 2 1 1 1 3 1 1 2 2 2 1 1 2 1 3 2\n",
199 |       " 1 1 3 2 1 3 2 2 1 1 3 3 1 2 2 2 1 1 3 1 1 2 2 1 1]\n",
200 |       "[3 1 1 3 1 1 2 2 2 1 1 3 1 1 1 2 3 1 1 3 1 1 1 2 1 3 2 1 1 2 3 1 1 3 2 1 3\n",
201 |       " 2 2 1 1 2 1 1 1 3 1 2 2 1 1 3 1 2 1 1 1 3 2 2 2 1 2 3 1 1 3 2 2 1 1 3 2 1\n",
202 |       " 2 2 2 1]\n",
203 |       "[1 3 2 1 1 3 2 1 3 2 2 1 1 3 3 1 1 2 1 3 2 1 1 3 3 1 1 2 1 1 1 3 1 2 2 1 1\n",
204 |       " 2 1 3 2 1 1 3 1 2 1 1 1 3 2 2 2 1 1 2 3 1 1 3 1 1 2 2 2 1 1 3 1 1 1 2 3 1\n",
205 |       " 1 3 3 2 1 1 1 2 1 3 2 1 1 3 2 2 2 1 1 3 1 2 1 1 3 2 1 1]\n",
206 |       "[1 1 1 3 1 2 2 1 1 3 1 2 1 1 1 3 2 2 2 1 2 3 2 1 1 2 1 1 1 3 1 2 2 1 2 3 2\n",
207 |       " 1 1 2 3 1 1 3 1 1 2 2 2 1 1 2 1 1 1 3 1 2 2 1 1 3 1 1 1 2 3 1 1 3 3 2 2 1\n",
208 |       " 1 2 1 3 2 1 1 3 2 1 3 2 2 1 1 3 3 1 1 2 1 3 2 1 2 3 1 2 3 1 1 2 1 1 1 3 1\n",
209 |       " 2 2 1 1 3 3 2 2 1 1 3 1 1 1 2 2 1 1 3 1 2 2 1]\n",
210 |       "[3 1 1 3 1 1 2 2 2 1 1 3 1 1 1 2 3 1 1 3 3 2 1 1 1 2 1 3 1 2 2 1 1 2 3 1 1\n",
211 |       " 3 1 1 2 2 1 1 1 2 1 3 1 2 2 1 1 2 1 3 2 1 1 3 2 1 3 2 2 1 1 2 3 1 1 3 1 1\n",
212 |       " 2 2 2 1 1 3 3 1 1 2 1 3 2 1 2 3 2 2 2 1 1 2 1 1 1 3 1 2 2 1 1 3 1 2 1 1 1\n",
213 |       " 3 2 2 2 1 2 3 2 1 1 2 1 1 1 3 1 2 1 1 1 2 1 3 1 1 1 2 1 3 2 1 1 2 3 1 1 3\n",
214 |       " 1 1 2 2 2 1 2 3 2 2 2 1 1 3 3 1 2 2 2 1 1 3 1 1 2 2 1 1]\n",
215 |       "[1 3 2 1 1 3 2 1 3 2 2 1 1 3 3 1 1 2 1 3 2 1 2 3 1 2 3 1 1 2 1 1 1 3 1 1 2\n",
216 |       " 2 2 1 1 2 1 3 2 1 1 3 2 1 2 2 3 1 1 2 1 1 1 3 1 1 2 2 2 1 1 2 1 1 1 3 1 2\n",
217 |       " 2 1 1 3 1 2 1 1 1 3 2 2 2 1 1 2 1 3 2 1 1 3 2 1 3 2 2 1 2 3 2 1 1 2 1 1 1\n",
218 |       " 3 1 2 1 1 1 2 1 3 3 2 2 1 1 2 3 1 1 3 1 1 2 2 2 1 1 3 1 1 1 2 3 1 1 3 3 2\n",
219 |       " 1 1 1 2 1 3 1 2 2 1 1 2 3 1 1 3 1 1 1 2 3 1 1 2 1 1 1 3 3 1 1 2 1 1 1 3 1\n",
220 |       " 2 2 1 1 2 1 3 2 1 1 3 2 1 3 2 1 1 1 2 1 3 3 2 2 1 2 3 1 1 3 2 2 1 1 3 2 1\n",
221 |       " 2 2 2 1]\n",
222 |       "[1 1 1 3 1 2 2 1 1 3 1 2 1 1 1 3 2 2 2 1 2 3 2 1 1 2 1 1 1 3 1 2 1 1 1 2 1\n",
223 |       " 3 1 1 1 2 1 3 2 1 1 2 3 1 1 3 2 1 3 2 2 1 1 2 1 1 1 3 1 2 2 1 1 3 1 2 1 1\n",
224 |       " 2 2 1 3 2 1 1 2 3 1 1 3 2 1 3 2 2 1 1 2 3 1 1 3 1 1 2 2 2 1 1 3 1 1 1 2 3\n",
225 |       " 1 1 3 3 2 2 1 1 2 1 1 1 3 1 2 2 1 1 3 1 2 1 1 1 3 2 2 1 1 1 2 1 3 1 2 2 1\n",
226 |       " 1 2 3 1 1 3 1 1 1 2 3 1 1 2 1 1 2 3 2 2 2 1 1 2 1 3 2 1 1 3 2 1 3 2 2 1 1\n",
227 |       " 3 3 1 1 2 1 3 2 1 2 3 1 2 3 1 1 2 1 1 1 3 1 1 2 2 2 1 1 2 1 3 2 1 1 3 3 1\n",
228 |       " 1 2 1 3 2 1 1 2 3 1 2 3 2 1 1 2 3 1 1 3 1 1 2 2 2 1 1 2 1 1 1 3 1 2 2 1 1\n",
229 |       " 3 1 2 1 1 1 3 1 2 3 1 1 2 1 1 2 3 2 2 1 1 1 2 1 3 2 1 1 3 2 2 2 1 1 3 1 2\n",
230 |       " 1 1 3 2 1 1]\n",
231 |       "[3 1 1 3 1 1 2 2 2 1 1 3 1 1 1 2 3 1 1 3 3 2 1 1 1 2 1 3 1 2 2 1 1 2 3 1 1\n",
232 |       " 3 1 1 1 2 3 1 1 2 1 1 1 3 3 1 1 2 1 1 1 3 1 2 2 1 1 2 1 3 2 1 1 3 1 2 1 1\n",
233 |       " 1 3 2 2 2 1 1 2 3 1 1 3 1 1 2 2 2 1 1 3 1 1 1 2 2 1 2 2 1 1 1 3 1 2 2 1 1\n",
234 |       " 2 1 3 2 1 1 3 1 2 1 1 1 3 2 2 2 1 1 2 1 3 2 1 1 3 2 1 3 2 2 1 1 3 3 1 1 2\n",
235 |       " 1 3 2 1 2 3 2 2 2 1 1 2 3 1 1 3 1 1 2 2 2 1 1 3 1 1 1 2 3 1 1 3 2 2 3 1 1\n",
236 |       " 2 1 1 1 3 1 1 2 2 2 1 1 2 1 3 2 1 1 3 3 1 1 2 1 3 2 1 1 2 2 1 1 2 1 3 3 2\n",
237 |       " 2 1 1 2 1 1 1 3 1 2 2 1 1 3 1 2 1 1 1 3 2 2 2 1 2 3 2 1 1 2 1 1 1 3 1 2 1\n",
238 |       " 1 1 2 1 3 1 1 1 2 1 3 2 1 1 2 3 1 1 3 2 1 3 2 2 1 1 2 1 1 1 3 1 2 2 1 2 3\n",
239 |       " 2 1 1 2 1 1 1 3 1 2 2 1 1 2 1 3 1 1 1 2 1 3 1 2 2 1 1 2 1 3 2 1 1 3 2 1 3\n",
240 |       " 2 2 1 1 2 3 1 1 3 1 1 2 2 2 1 1 3 1 1 1 2 3 1 1 3 1 1 1 2 1 3 2 1 1 2 2 1\n",
241 |       " 1 2 1 3 2 2 3 1 1 2 1 1 1 3 1 2 2 1 1 3 3 2 2 1 1 3 1 1 1 2 2 1 1 3 1 2 2\n",
242 |       " 1]\n",
243 |       "[1 3 2 1 1 3 2 1 3 2 2 1 1 3 3 1 1 2 1 3 2 1 2 3 1 2 3 1 1 2 1 1 1 3 1 1 2\n",
244 |       " 2 2 1 1 2 1 3 2 1 1 3 3 1 1 2 1 3 2 1 1 2 3 1 2 3 2 1 1 2 3 1 1 3 1 1 2 2\n",
245 |       " 2 1 1 2 1 1 1 3 1 2 2 1 1 3 1 1 1 2 3 1 1 3 3 2 2 1 1 2 1 3 2 1 1 3 2 1 3\n",
246 |       " 2 2 1 1 3 3 1 2 2 1 1 2 2 3 1 1 3 1 1 2 2 2 1 1 2 1 1 1 3 1 2 2 1 1 3 1 1\n",
247 |       " 1 2 3 1 1 3 3 2 2 1 1 2 1 1 1 3 1 2 2 1 1 3 1 2 1 1 1 3 2 2 2 1 2 3 2 1 1\n",
248 |       " 2 1 1 1 3 1 2 1 1 1 2 1 3 3 2 2 1 1 2 1 3 2 1 1 3 2 1 3 2 2 1 1 3 3 1 1 2\n",
249 |       " 1 3 2 1 1 3 2 2 1 3 2 1 1 2 3 1 1 3 2 1 3 2 2 1 1 2 1 1 1 3 1 2 2 1 2 3 2\n",
250 |       " 1 1 2 1 1 1 3 1 2 2 1 2 2 2 1 1 2 1 1 2 3 2 2 2 1 1 2 3 1 1 3 1 1 2 2 2 1\n",
251 |       " 1 3 1 1 1 2 3 1 1 3 3 2 1 1 1 2 1 3 1 2 2 1 1 2 3 1 1 3 1 1 1 2 3 1 1 2 1\n",
252 |       " 1 1 3 3 1 1 2 1 1 1 3 1 2 2 1 1 2 1 3 2 1 1 3 1 2 1 1 1 3 2 2 2 1 1 2 3 1\n",
253 |       " 1 3 1 1 2 2 1 1 1 2 1 3 1 2 2 1 1 2 3 1 1 3 1 1 2 2 2 1 1 2 1 1 1 3 3 1 1\n",
254 |       " 2 1 1 1 3 1 1 2 2 2 1 1 2 1 1 1 3 1 2 2 1 1 3 1 2 1 1 1 3 2 2 2 1 1 2 1 3\n",
255 |       " 2 1 1 3 2 1 3 2 2 1 1 3 3 1 1 2 1 3 2 1 1 3 3 1 1 2 1 1 1 3 1 2 2 1 2 2 2\n",
256 |       " 1 1 2 1 1 1 3 2 2 1 3 2 1 1 2 3 1 1 3 1 1 2 2 2 1 2 3 2 2 2 1 1 3 3 1 2 2\n",
257 |       " 2 1 1 3 1 1 2 2 1 1]\n"
258 |      ]
259 |     }
260 |    ],
261 |    "source": [
262 |     "c = a\n",
263 |     "print(c)\n",
264 |     "\n",
265 |     "for i in range(20):\n",
266 |     "    c = look_and_say(c)\n",
267 |     "    print(c)"
268 |    ]
269 |   },
270 |   {
271 |    "cell_type": "code",
272 |    "execution_count": null,
273 |    "metadata": {},
274 |    "outputs": [],
275 |    "source": []
276 |   }
277 |  ],
278 |  "metadata": {
279 |   "kernelspec": {
280 |    "display_name": "Python 3",
281 |    "language": "python",
282 |    "name": "python3"
283 |   },
284 |   "language_info": {
285 |    "codemirror_mode": {
286 |     "name": "ipython",
287 |     "version": 3
288 |    },
289 |    "file_extension": ".py",
290 |    "mimetype": "text/x-python",
291 |    "name": "python",
292 |    "nbconvert_exporter": "python",
293 |    "pygments_lexer": "ipython3",
294 |    "version": "3.6.5"
295 |   }
296 |  },
297 |  "nbformat": 4,
298 |  "nbformat_minor": 2
299 | }
300 | 


--------------------------------------------------------------------------------
/nsfg.py:
--------------------------------------------------------------------------------
  1 | """This file contains code for use with "Think Stats",
  2 | by Allen B. Downey, available from greenteapress.com
  3 | 
  4 | Copyright 2010 Allen B. Downey
  5 | License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
  6 | """
  7 | 
  8 | from __future__ import print_function
  9 | 
 10 | from collections import defaultdict
 11 | import numpy as np
 12 | import sys
 13 | 
 14 | import thinkstats2
 15 | 
 16 | 
 17 | def ReadFemPreg(dct_file='2002FemPreg.dct',
 18 |                 dat_file='2002FemPreg.dat.gz'):
 19 |     """Reads the NSFG pregnancy data.
 20 | 
 21 |     dct_file: string file name
 22 |     dat_file: string file name
 23 | 
 24 |     returns: DataFrame
 25 |     """
 26 |     dct = thinkstats2.ReadStataDct(dct_file)
 27 |     df = dct.ReadFixedWidth(dat_file, compression='gzip')
 28 |     CleanFemPreg(df)
 29 |     return df
 30 | 
 31 | 
 32 | def CleanFemPreg(df):
 33 |     """Recodes variables from the pregnancy frame.
 34 | 
 35 |     df: DataFrame
 36 |     """
 37 |     # mother's age is encoded in centiyears; convert to years
 38 |     df.agepreg /= 100.0
 39 | 
 40 |     # birthwgt_lb contains at least one bogus value (51 lbs)
 41 |     # replace with NaN
 42 |     df.loc[df.birthwgt_lb > 20, 'birthwgt_lb'] = np.nan
 43 |     
 44 |     # replace 'not ascertained', 'refused', 'don't know' with NaN
 45 |     na_vals = [97, 98, 99]
 46 |     df.birthwgt_lb.replace(na_vals, np.nan, inplace=True)
 47 |     df.birthwgt_oz.replace(na_vals, np.nan, inplace=True)
 48 |     df.hpagelb.replace(na_vals, np.nan, inplace=True)
 49 | 
 50 |     df.babysex.replace([7, 9], np.nan, inplace=True)
 51 |     df.nbrnaliv.replace([9], np.nan, inplace=True)
 52 | 
 53 |     # birthweight is stored in two columns, lbs and oz.
 54 |     # convert to a single column in lb
 55 |     # NOTE: creating a new column requires dictionary syntax,
 56 |     # not attribute assignment (like df.totalwgt_lb)
 57 |     df['totalwgt_lb'] = df.birthwgt_lb + df.birthwgt_oz / 16.0    
 58 | 
 59 |     # due to a bug in ReadStataDct, the last variable gets clipped;
 60 |     # so for now set it to NaN
 61 |     df.cmintvw = np.nan
 62 | 
 63 | 
 64 | def MakePregMap(df):
 65 |     """Make a map from caseid to list of preg indices.
 66 | 
 67 |     df: DataFrame
 68 | 
 69 |     returns: dict that maps from caseid to list of indices into preg df
 70 |     """
 71 |     d = defaultdict(list)
 72 |     for index, caseid in df.caseid.iteritems():
 73 |         d[caseid].append(index)
 74 |     return d
 75 | 
 76 | 
 77 | def main(script):
 78 |     """Tests the functions in this module.
 79 | 
 80 |     script: string script name
 81 |     """
 82 |     df = ReadFemPreg()
 83 |     print(df.shape)
 84 | 
 85 |     assert len(df) == 13593
 86 | 
 87 |     assert df.caseid[13592] == 12571
 88 |     assert df.pregordr.value_counts()[1] == 5033
 89 |     assert df.nbrnaliv.value_counts()[1] == 8981
 90 |     assert df.babysex.value_counts()[1] == 4641
 91 |     assert df.birthwgt_lb.value_counts()[7] == 3049
 92 |     assert df.birthwgt_oz.value_counts()[0] == 1037
 93 |     assert df.prglngth.value_counts()[39] == 4744
 94 |     assert df.outcome.value_counts()[1] == 9148
 95 |     assert df.birthord.value_counts()[1] == 4413
 96 |     assert df.agepreg.value_counts()[22.75] == 100
 97 |     assert df.totalwgt_lb.value_counts()[7.5] == 302
 98 | 
 99 |     weights = df.finalwgt.value_counts()
100 |     key = max(weights.keys())
101 |     assert df.finalwgt.value_counts()[key] == 6
102 | 
103 |     print('%s: All tests passed.' % script)
104 | 
105 | if __name__ == '__main__':
106 |     main(*sys.argv)
107 | 


--------------------------------------------------------------------------------
/nsfg2.py:
--------------------------------------------------------------------------------
 1 | """This file contains code used in "Think Stats",
 2 | by Allen B. Downey, available from greenteapress.com
 3 | 
 4 | Copyright 2014 Allen B. Downey
 5 | License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
 6 | """
 7 | 
 8 | from __future__ import print_function
 9 | 
10 | import numpy as np
11 | 
12 | import thinkstats2
13 | 
14 | def MakeFrames():
15 |     """Reads pregnancy data and partitions first babies and others.
16 | 
17 |     returns: DataFrames (all live births, first babies, others)
18 |     """
19 |     preg = ReadFemPreg()
20 | 
21 |     live = preg[preg.outcome == 1]
22 |     firsts = live[live.birthord == 1]
23 |     others = live[live.birthord != 1]
24 | 
25 |     assert(len(live) == 14292)
26 |     assert(len(firsts) == 6683)
27 |     assert(len(others) == 7609)
28 | 
29 |     return live, firsts, others
30 | 
31 | 
32 | def ReadFemPreg(dct_file='2006_2010_FemPregSetup.dct',
33 |                 dat_file='2006_2010_FemPreg.dat.gz'):
34 |     """Reads the NSFG 2006-2010 pregnancy data.
35 | 
36 |     dct_file: string file name
37 |     dat_file: string file name
38 | 
39 |     returns: DataFrame
40 |     """
41 |     dct = thinkstats2.ReadStataDct(dct_file, encoding='iso-8859-1')
42 |     df = dct.ReadFixedWidth(dat_file, compression='gzip')
43 |     CleanFemPreg(df)
44 |     return df
45 | 
46 | 
47 | def CleanFemPreg(df):
48 |     """Recodes variables from the pregnancy frame.
49 | 
50 |     df: DataFrame
51 |     """
52 |     # mother's age is encoded in centiyears; convert to years
53 |     df.agepreg /= 100.0
54 | 
55 |     # birthwgt_lb contains at least one bogus value (51 lbs)
56 |     # replace with NaN
57 |     df.birthwgt_lb1[df.birthwgt_lb1 > 20] = np.nan
58 |     
59 |     # replace 'not ascertained', 'refused', 'don't know' with NaN
60 |     na_vals = [97, 98, 99]
61 |     df.birthwgt_lb1.replace(na_vals, np.nan, inplace=True)
62 |     df.birthwgt_oz1.replace(na_vals, np.nan, inplace=True)
63 | 
64 |     # birthweight is stored in two columns, lbs and oz.
65 |     # convert to a single column in lb
66 |     # NOTE: creating a new column requires dictionary syntax,
67 |     # not attribute assignment (like df.totalwgt_lb)
68 |     df['totalwgt_lb'] = df.birthwgt_lb1 + df.birthwgt_oz1 / 16.0    
69 | 
70 |     # due to a bug in ReadStataDct, the last variable gets clipped;
71 |     # so for now set it to NaN
72 |     df.phase = np.nan
73 | 
74 | 
75 | def main():
76 |     live, firsts, others = MakeFrames()
77 | 
78 | 
79 | if __name__ == '__main__':
80 |     main()
81 | 
82 | 
83 | 


--------------------------------------------------------------------------------
/resampling.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AllenDowney/CompStats/dc459e04ba74613d533adae2550abefb18da002a/resampling.pdf


--------------------------------------------------------------------------------
/resampling.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AllenDowney/CompStats/dc459e04ba74613d533adae2550abefb18da002a/resampling.png


--------------------------------------------------------------------------------
/resampling_small.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AllenDowney/CompStats/dc459e04ba74613d533adae2550abefb18da002a/resampling_small.png


--------------------------------------------------------------------------------
/sampling.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "Random Sampling\n",
  8 |     "=============\n",
  9 |     "\n",
 10 |     "Copyright 2016 Allen Downey\n",
 11 |     "\n",
 12 |     "License: [Creative Commons Attribution 4.0 International](http://creativecommons.org/licenses/by/4.0/)"
 13 |    ]
 14 |   },
 15 |   {
 16 |    "cell_type": "code",
 17 |    "execution_count": null,
 18 |    "metadata": {
 19 |     "collapsed": true
 20 |    },
 21 |    "outputs": [],
 22 |    "source": [
 23 |     "from __future__ import print_function, division\n",
 24 |     "\n",
 25 |     "import numpy\n",
 26 |     "import scipy.stats\n",
 27 |     "\n",
 28 |     "import matplotlib.pyplot as pyplot\n",
 29 |     "\n",
 30 |     "from ipywidgets import interact, interactive, fixed\n",
 31 |     "import ipywidgets as widgets\n",
 32 |     "\n",
 33 |     "# seed the random number generator so we all get the same results\n",
 34 |     "numpy.random.seed(18)\n",
 35 |     "\n",
 36 |     "# some nicer colors from http://colorbrewer2.org/\n",
 37 |     "COLOR1 = '#7fc97f'\n",
 38 |     "COLOR2 = '#beaed4'\n",
 39 |     "COLOR3 = '#fdc086'\n",
 40 |     "COLOR4 = '#ffff99'\n",
 41 |     "COLOR5 = '#386cb0'\n",
 42 |     "\n",
 43 |     "%matplotlib inline"
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "markdown",
 48 |    "metadata": {},
 49 |    "source": [
 50 |     "Part One\n",
 51 |     "========\n",
 52 |     "\n",
 53 |     "Suppose we want to estimate the average weight of men and women in the U.S.\n",
 54 |     "\n",
 55 |     "And we want to quantify the uncertainty of the estimate.\n",
 56 |     "\n",
 57 |     "One approach is to simulate many experiments and see how much the results vary from one experiment to the next.\n",
 58 |     "\n",
 59 |     "I'll start with the unrealistic assumption that we know the actual distribution of weights in the population.  Then I'll show how to solve the problem without that assumption.\n",
 60 |     "\n",
 61 |     "Based on data from the [BRFSS](http://www.cdc.gov/brfss/), I found that the distribution of weight in kg for women in the U.S. is well modeled by a lognormal distribution with the following parameters:"
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "code",
 66 |    "execution_count": null,
 67 |    "metadata": {},
 68 |    "outputs": [],
 69 |    "source": [
 70 |     "weight = scipy.stats.lognorm(0.23, 0, 70.8)\n",
 71 |     "weight.mean(), weight.std()"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "markdown",
 76 |    "metadata": {},
 77 |    "source": [
 78 |     "Here's what that distribution looks like:"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "execution_count": null,
 84 |    "metadata": {},
 85 |    "outputs": [],
 86 |    "source": [
 87 |     "xs = numpy.linspace(20, 160, 100)\n",
 88 |     "ys = weight.pdf(xs)\n",
 89 |     "pyplot.plot(xs, ys, linewidth=4, color=COLOR1)\n",
 90 |     "pyplot.xlabel('weight (kg)')\n",
 91 |     "pyplot.ylabel('PDF')\n",
 92 |     "None"
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "markdown",
 97 |    "metadata": {},
 98 |    "source": [
 99 |     "`make_sample` draws a random sample from this distribution.  The result is a NumPy array."
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "code",
104 |    "execution_count": null,
105 |    "metadata": {
106 |     "collapsed": true
107 |    },
108 |    "outputs": [],
109 |    "source": [
110 |     "def make_sample(n=100):\n",
111 |     "    sample = weight.rvs(n)\n",
112 |     "    return sample"
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "markdown",
117 |    "metadata": {},
118 |    "source": [
119 |     "Here's an example with `n=100`.  The mean and std of the sample are close to the mean and std of the population, but not exact."
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "code",
124 |    "execution_count": null,
125 |    "metadata": {},
126 |    "outputs": [],
127 |    "source": [
128 |     "sample = make_sample(n=100)\n",
129 |     "sample.mean(), sample.std()"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "markdown",
134 |    "metadata": {},
135 |    "source": [
136 |     "We want to estimate the average weight in the population, so the \"sample statistic\" we'll use is the mean:"
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "code",
141 |    "execution_count": null,
142 |    "metadata": {
143 |     "collapsed": true
144 |    },
145 |    "outputs": [],
146 |    "source": [
147 |     "def sample_stat(sample):\n",
148 |     "    return sample.mean()"
149 |    ]
150 |   },
151 |   {
152 |    "cell_type": "markdown",
153 |    "metadata": {},
154 |    "source": [
155 |     "One iteration of \"the experiment\" is to collect a sample of 100 women and compute their average weight.\n",
156 |     "\n",
157 |     "We can simulate running this experiment many times, and collect a list of sample statistics.  The result is a NumPy array."
158 |    ]
159 |   },
160 |   {
161 |    "cell_type": "code",
162 |    "execution_count": null,
163 |    "metadata": {
164 |     "collapsed": true
165 |    },
166 |    "outputs": [],
167 |    "source": [
168 |     "def compute_sampling_distribution(n=100, iters=1000):\n",
169 |     "    stats = [sample_stat(make_sample(n)) for i in range(iters)]\n",
170 |     "    return numpy.array(stats)"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "markdown",
175 |    "metadata": {},
176 |    "source": [
177 |     "The next line runs the simulation 1000 times and puts the results in\n",
178 |     "`sample_means`:"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "code",
183 |    "execution_count": null,
184 |    "metadata": {
185 |     "collapsed": true
186 |    },
187 |    "outputs": [],
188 |    "source": [
189 |     "sample_means = compute_sampling_distribution(n=100, iters=1000)"
190 |    ]
191 |   },
192 |   {
193 |    "cell_type": "markdown",
194 |    "metadata": {},
195 |    "source": [
196 |     "Let's look at the distribution of the sample means.  This distribution shows how much the results vary from one experiment to the next.\n",
197 |     "\n",
198 |     "Remember that this distribution is not the same as the distribution of weight in the population.  This is the distribution of results across repeated imaginary experiments."
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "code",
203 |    "execution_count": null,
204 |    "metadata": {},
205 |    "outputs": [],
206 |    "source": [
207 |     "pyplot.hist(sample_means, color=COLOR5)\n",
208 |     "pyplot.xlabel('sample mean (n=100)')\n",
209 |     "pyplot.ylabel('count')\n",
210 |     "None"
211 |    ]
212 |   },
213 |   {
214 |    "cell_type": "markdown",
215 |    "metadata": {},
216 |    "source": [
217 |     "The mean of the sample means is close to the actual population mean, which is nice, but not actually the important part."
218 |    ]
219 |   },
220 |   {
221 |    "cell_type": "code",
222 |    "execution_count": null,
223 |    "metadata": {},
224 |    "outputs": [],
225 |    "source": [
226 |     "sample_means.mean()"
227 |    ]
228 |   },
229 |   {
230 |    "cell_type": "markdown",
231 |    "metadata": {},
232 |    "source": [
233 |     "The standard deviation of the sample means quantifies the variability from one experiment to the next, and reflects the precision of the estimate.\n",
234 |     "\n",
235 |     "This quantity is called the \"standard error\"."
236 |    ]
237 |   },
238 |   {
239 |    "cell_type": "code",
240 |    "execution_count": null,
241 |    "metadata": {},
242 |    "outputs": [],
243 |    "source": [
244 |     "std_err = sample_means.std()\n",
245 |     "std_err"
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "markdown",
250 |    "metadata": {},
251 |    "source": [
252 |     "We can also use the distribution of sample means to compute a \"90% confidence interval\", which contains 90% of the experimental results:"
253 |    ]
254 |   },
255 |   {
256 |    "cell_type": "code",
257 |    "execution_count": null,
258 |    "metadata": {},
259 |    "outputs": [],
260 |    "source": [
261 |     "conf_int = numpy.percentile(sample_means, [5, 95])\n",
262 |     "conf_int"
263 |    ]
264 |   },
265 |   {
266 |    "cell_type": "markdown",
267 |    "metadata": {},
268 |    "source": [
269 |     "Now we'd like to see what happens as we vary the sample size, `n`.  The following function takes `n`, runs 1000 simulated experiments, and summarizes the results."
270 |    ]
271 |   },
272 |   {
273 |    "cell_type": "code",
274 |    "execution_count": null,
275 |    "metadata": {
276 |     "collapsed": true
277 |    },
278 |    "outputs": [],
279 |    "source": [
280 |     "def plot_sampling_distribution(n, xlim=None):\n",
281 |     "    \"\"\"Plot the sampling distribution.\n",
282 |     "    \n",
283 |     "    n: sample size\n",
284 |     "    xlim: [xmin, xmax] range for the x axis \n",
285 |     "    \"\"\"\n",
286 |     "    sample_stats = compute_sampling_distribution(n, iters=1000)\n",
287 |     "    se = numpy.std(sample_stats)\n",
288 |     "    ci = numpy.percentile(sample_stats, [5, 95])\n",
289 |     "    \n",
290 |     "    pyplot.hist(sample_stats, color=COLOR2)\n",
291 |     "    pyplot.xlabel('sample statistic')\n",
292 |     "    pyplot.xlim(xlim)\n",
293 |     "    text(0.03, 0.95, 'CI [%0.2f %0.2f]' % tuple(ci))\n",
294 |     "    text(0.03, 0.85, 'SE %0.2f' % se)\n",
295 |     "    pyplot.show()\n",
296 |     "    \n",
297 |     "def text(x, y, s):\n",
298 |     "    \"\"\"Plot a string at a given location in axis coordinates.\n",
299 |     "    \n",
300 |     "    x: coordinate\n",
301 |     "    y: coordinate\n",
302 |     "    s: string\n",
303 |     "    \"\"\"\n",
304 |     "    ax = pyplot.gca()\n",
305 |     "    pyplot.text(x, y, s,\n",
306 |     "                horizontalalignment='left',\n",
307 |     "                verticalalignment='top',\n",
308 |     "                transform=ax.transAxes)"
309 |    ]
310 |   },
311 |   {
312 |    "cell_type": "markdown",
313 |    "metadata": {},
314 |    "source": [
315 |     "Here's a test run with `n=100`:"
316 |    ]
317 |   },
318 |   {
319 |    "cell_type": "code",
320 |    "execution_count": null,
321 |    "metadata": {},
322 |    "outputs": [],
323 |    "source": [
324 |     "plot_sampling_distribution(100)"
325 |    ]
326 |   },
327 |   {
328 |    "cell_type": "markdown",
329 |    "metadata": {},
330 |    "source": [
331 |     "Now we can use `interact` to run `plot_sampling_distribution` with different values of `n`.  Note: `xlim` sets the limits of the x-axis so the figure doesn't get rescaled as we vary `n`."
332 |    ]
333 |   },
334 |   {
335 |    "cell_type": "code",
336 |    "execution_count": null,
337 |    "metadata": {},
338 |    "outputs": [],
339 |    "source": [
340 |     "def sample_stat(sample):\n",
341 |     "    return sample.mean()\n",
342 |     "\n",
343 |     "slider = widgets.IntSlider(min=10, max=1000, value=100)\n",
344 |     "interact(plot_sampling_distribution, n=slider, xlim=fixed([55, 95]))\n",
345 |     "None"
346 |    ]
347 |   },
348 |   {
349 |    "cell_type": "markdown",
350 |    "metadata": {},
351 |    "source": [
352 |     "### Other sample statistics\n",
353 |     "\n",
354 |     "This framework works with any other quantity we want to estimate.  By changing `sample_stat`, you can compute the SE and CI for any sample statistic.\n",
355 |     "\n",
356 |     "**Exercise 1**: Fill in `sample_stat` below with any of these statistics:\n",
357 |     "\n",
358 |     "* Standard deviation of the sample.\n",
359 |     "* Coefficient of variation, which is the sample standard deviation divided by the sample standard mean.\n",
360 |     "* Min or Max\n",
361 |     "* Median (which is the 50th percentile)\n",
362 |     "* 10th or 90th percentile.\n",
363 |     "* Interquartile range (IQR), which is the difference between the 75th and 25th percentiles.\n",
364 |     "\n",
365 |     "NumPy array methods you might find useful include `std`, `min`, `max`, and `percentile`.\n",
366 |     "Depending on the results, you might want to adjust `xlim`."
367 |    ]
368 |   },
369 |   {
370 |    "cell_type": "code",
371 |    "execution_count": null,
372 |    "metadata": {},
373 |    "outputs": [],
374 |    "source": [
375 |     "def sample_stat(sample):\n",
376 |     "    # TODO: replace the following line with another sample statistic\n",
377 |     "    return sample.mean()\n",
378 |     "\n",
379 |     "slider = widgets.IntSlider(min=10, max=1000, value=100)\n",
380 |     "interact(plot_sampling_distribution, n=slider, xlim=fixed([0, 100]))\n",
381 |     "None"
382 |    ]
383 |   },
384 |   {
385 |    "cell_type": "markdown",
386 |    "metadata": {},
387 |    "source": [
388 |     "STOP HERE\n",
389 |     "---------\n",
390 |     "\n",
391 |     "We will regroup and discuss before going on."
392 |    ]
393 |   },
394 |   {
395 |    "cell_type": "markdown",
396 |    "metadata": {},
397 |    "source": [
398 |     "Part Two\n",
399 |     "========\n",
400 |     "\n",
401 |     "So far we have shown that if we know the actual distribution of the population, we can compute the sampling distribution for any sample statistic, and from that we can compute SE and CI.\n",
402 |     "\n",
403 |     "But in real life we don't know the actual distribution of the population.  If we did, we wouldn't be doing statistical inference in the first place!\n",
404 |     "\n",
405 |     "In real life, we use the sample to build a model of the population distribution, then use the model to generate the sampling distribution.  A simple and popular way to do that is \"resampling,\" which means we use the sample itself as a model of the population distribution and draw samples from it.\n",
406 |     "\n",
407 |     "Before we go on, I want to collect some of the code from Part One and organize it as a class.  This class represents a framework for computing sampling distributions."
408 |    ]
409 |   },
410 |   {
411 |    "cell_type": "code",
412 |    "execution_count": null,
413 |    "metadata": {
414 |     "collapsed": true
415 |    },
416 |    "outputs": [],
417 |    "source": [
418 |     "class Resampler(object):\n",
419 |     "    \"\"\"Represents a framework for computing sampling distributions.\"\"\"\n",
420 |     "    \n",
421 |     "    def __init__(self, sample, xlim=None):\n",
422 |     "        \"\"\"Stores the actual sample.\"\"\"\n",
423 |     "        self.sample = sample\n",
424 |     "        self.n = len(sample)\n",
425 |     "        self.xlim = xlim\n",
426 |     "        \n",
427 |     "    def resample(self):\n",
428 |     "        \"\"\"Generates a new sample by choosing from the original\n",
429 |     "        sample with replacement.\n",
430 |     "        \"\"\"\n",
431 |     "        new_sample = numpy.random.choice(self.sample, self.n, replace=True)\n",
432 |     "        return new_sample\n",
433 |     "    \n",
434 |     "    def sample_stat(self, sample):\n",
435 |     "        \"\"\"Computes a sample statistic using the original sample or a\n",
436 |     "        simulated sample.\n",
437 |     "        \"\"\"\n",
438 |     "        return sample.mean()\n",
439 |     "    \n",
440 |     "    def compute_sampling_distribution(self, iters=1000):\n",
441 |     "        \"\"\"Simulates many experiments and collects the resulting sample\n",
442 |     "        statistics.\n",
443 |     "        \"\"\"\n",
444 |     "        stats = [self.sample_stat(self.resample()) for i in range(iters)]\n",
445 |     "        return numpy.array(stats)\n",
446 |     "    \n",
447 |     "    def plot_sampling_distribution(self):\n",
448 |     "        \"\"\"Plots the sampling distribution.\"\"\"\n",
449 |     "        sample_stats = self.compute_sampling_distribution()\n",
450 |     "        se = sample_stats.std()\n",
451 |     "        ci = numpy.percentile(sample_stats, [5, 95])\n",
452 |     "    \n",
453 |     "        pyplot.hist(sample_stats, color=COLOR2)\n",
454 |     "        pyplot.xlabel('sample statistic')\n",
455 |     "        pyplot.xlim(self.xlim)\n",
456 |     "        text(0.03, 0.95, 'CI [%0.2f %0.2f]' % tuple(ci))\n",
457 |     "        text(0.03, 0.85, 'SE %0.2f' % se)\n",
458 |     "        pyplot.show()"
459 |    ]
460 |   },
461 |   {
462 |    "cell_type": "markdown",
463 |    "metadata": {},
464 |    "source": [
465 |     "The following function instantiates a `Resampler` and runs it."
466 |    ]
467 |   },
468 |   {
469 |    "cell_type": "code",
470 |    "execution_count": null,
471 |    "metadata": {
472 |     "collapsed": true
473 |    },
474 |    "outputs": [],
475 |    "source": [
476 |     "def interact_func(n, xlim):\n",
477 |     "    sample = weight.rvs(n)\n",
478 |     "    resampler = Resampler(sample, xlim=xlim)\n",
479 |     "    resampler.plot_sampling_distribution()"
480 |    ]
481 |   },
482 |   {
483 |    "cell_type": "markdown",
484 |    "metadata": {},
485 |    "source": [
486 |     "Here's a test run with `n=100`"
487 |    ]
488 |   },
489 |   {
490 |    "cell_type": "code",
491 |    "execution_count": null,
492 |    "metadata": {},
493 |    "outputs": [],
494 |    "source": [
495 |     "interact_func(n=100, xlim=[50, 100])"
496 |    ]
497 |   },
498 |   {
499 |    "cell_type": "markdown",
500 |    "metadata": {},
501 |    "source": [
502 |     "Now we can use `interact_func` in an interaction:"
503 |    ]
504 |   },
505 |   {
506 |    "cell_type": "code",
507 |    "execution_count": null,
508 |    "metadata": {},
509 |    "outputs": [],
510 |    "source": [
511 |     "slider = widgets.IntSlider(min=10, max=1000, value=100)\n",
512 |     "interact(interact_func, n=slider, xlim=fixed([50, 100]))\n",
513 |     "None"
514 |    ]
515 |   },
516 |   {
517 |    "cell_type": "markdown",
518 |    "metadata": {},
519 |    "source": [
520 |     "**Exercise 2**: write a new class called `StdResampler` that inherits from `Resampler` and overrides `sample_stat` so it computes the standard deviation of the resampled data."
521 |    ]
522 |   },
523 |   {
524 |    "cell_type": "code",
525 |    "execution_count": null,
526 |    "metadata": {
527 |     "collapsed": true
528 |    },
529 |    "outputs": [],
530 |    "source": [
531 |     "# Solution goes here"
532 |    ]
533 |   },
534 |   {
535 |    "cell_type": "markdown",
536 |    "metadata": {},
537 |    "source": [
538 |     "Test your code using the cell below:"
539 |    ]
540 |   },
541 |   {
542 |    "cell_type": "code",
543 |    "execution_count": null,
544 |    "metadata": {},
545 |    "outputs": [],
546 |    "source": [
547 |     "def interact_func2(n, xlim):\n",
548 |     "    sample = weight.rvs(n)\n",
549 |     "    resampler = StdResampler(sample, xlim=xlim)\n",
550 |     "    resampler.plot_sampling_distribution()\n",
551 |     "    \n",
552 |     "interact_func2(n=100, xlim=[0, 100])"
553 |    ]
554 |   },
555 |   {
556 |    "cell_type": "markdown",
557 |    "metadata": {},
558 |    "source": [
559 |     "When your `StdResampler` is working, you should be able to interact with it:"
560 |    ]
561 |   },
562 |   {
563 |    "cell_type": "code",
564 |    "execution_count": null,
565 |    "metadata": {
566 |     "collapsed": true
567 |    },
568 |    "outputs": [],
569 |    "source": [
570 |     "slider = widgets.IntSlider(min=10, max=1000, value=100)\n",
571 |     "interact(interact_func2, n=slider, xlim=fixed([0, 100]))\n",
572 |     "None"
573 |    ]
574 |   },
575 |   {
576 |    "cell_type": "markdown",
577 |    "metadata": {},
578 |    "source": [
579 |     "STOP HERE\n",
580 |     "---------\n",
581 |     "\n",
582 |     "We will regroup and discuss before going on."
583 |    ]
584 |   },
585 |   {
586 |    "cell_type": "markdown",
587 |    "metadata": {},
588 |    "source": [
589 |     "Part Three\n",
590 |     "==========\n",
591 |     "\n",
592 |     "We can extend this framework to compute SE and CI for a difference in means.\n",
593 |     "\n",
594 |     "For example, men are heavier than women on average.  Here's the women's distribution again (from BRFSS data):"
595 |    ]
596 |   },
597 |   {
598 |    "cell_type": "code",
599 |    "execution_count": null,
600 |    "metadata": {
601 |     "collapsed": true
602 |    },
603 |    "outputs": [],
604 |    "source": [
605 |     "female_weight = scipy.stats.lognorm(0.23, 0, 70.8)\n",
606 |     "female_weight.mean(), female_weight.std()"
607 |    ]
608 |   },
609 |   {
610 |    "cell_type": "markdown",
611 |    "metadata": {},
612 |    "source": [
613 |     "And here's the men's distribution:"
614 |    ]
615 |   },
616 |   {
617 |    "cell_type": "code",
618 |    "execution_count": null,
619 |    "metadata": {
620 |     "collapsed": true
621 |    },
622 |    "outputs": [],
623 |    "source": [
624 |     "male_weight = scipy.stats.lognorm(0.20, 0, 87.3)\n",
625 |     "male_weight.mean(), male_weight.std()"
626 |    ]
627 |   },
628 |   {
629 |    "cell_type": "markdown",
630 |    "metadata": {},
631 |    "source": [
632 |     "I'll simulate a sample of 100 men and 100 women:"
633 |    ]
634 |   },
635 |   {
636 |    "cell_type": "code",
637 |    "execution_count": null,
638 |    "metadata": {
639 |     "collapsed": true
640 |    },
641 |    "outputs": [],
642 |    "source": [
643 |     "female_sample = female_weight.rvs(100)\n",
644 |     "male_sample = male_weight.rvs(100)"
645 |    ]
646 |   },
647 |   {
648 |    "cell_type": "markdown",
649 |    "metadata": {},
650 |    "source": [
651 |     "The difference in means should be about 17 kg, but will vary from one random sample to the next:"
652 |    ]
653 |   },
654 |   {
655 |    "cell_type": "code",
656 |    "execution_count": null,
657 |    "metadata": {
658 |     "collapsed": true
659 |    },
660 |    "outputs": [],
661 |    "source": [
662 |     "male_sample.mean() - female_sample.mean()"
663 |    ]
664 |   },
665 |   {
666 |    "cell_type": "markdown",
667 |    "metadata": {},
668 |    "source": [
669 |     "Here's the function that computes Cohen's effect size again:"
670 |    ]
671 |   },
672 |   {
673 |    "cell_type": "code",
674 |    "execution_count": null,
675 |    "metadata": {
676 |     "collapsed": true
677 |    },
678 |    "outputs": [],
679 |    "source": [
680 |     "def CohenEffectSize(group1, group2):\n",
681 |     "    \"\"\"Compute Cohen's d.\n",
682 |     "\n",
683 |     "    group1: Series or NumPy array\n",
684 |     "    group2: Series or NumPy array\n",
685 |     "\n",
686 |     "    returns: float\n",
687 |     "    \"\"\"\n",
688 |     "    diff = group1.mean() - group2.mean()\n",
689 |     "\n",
690 |     "    n1, n2 = len(group1), len(group2)\n",
691 |     "    var1 = group1.var()\n",
692 |     "    var2 = group2.var()\n",
693 |     "\n",
694 |     "    pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)\n",
695 |     "    d = diff / numpy.sqrt(pooled_var)\n",
696 |     "    return d"
697 |    ]
698 |   },
699 |   {
700 |    "cell_type": "markdown",
701 |    "metadata": {},
702 |    "source": [
703 |     "The difference in weight between men and women is about 1 standard deviation:"
704 |    ]
705 |   },
706 |   {
707 |    "cell_type": "code",
708 |    "execution_count": null,
709 |    "metadata": {
710 |     "collapsed": true
711 |    },
712 |    "outputs": [],
713 |    "source": [
714 |     "CohenEffectSize(male_sample, female_sample)"
715 |    ]
716 |   },
717 |   {
718 |    "cell_type": "markdown",
719 |    "metadata": {},
720 |    "source": [
721 |     "Now we can write a version of the `Resampler` that computes the sampling distribution of $d$."
722 |    ]
723 |   },
724 |   {
725 |    "cell_type": "code",
726 |    "execution_count": null,
727 |    "metadata": {
728 |     "collapsed": true
729 |    },
730 |    "outputs": [],
731 |    "source": [
732 |     "class CohenResampler(Resampler):\n",
733 |     "    def __init__(self, group1, group2, xlim=None):\n",
734 |     "        self.group1 = group1\n",
735 |     "        self.group2 = group2\n",
736 |     "        self.xlim = xlim\n",
737 |     "        \n",
738 |     "    def resample(self):\n",
739 |     "        n, m = len(self.group1), len(self.group2)\n",
740 |     "        group1 = numpy.random.choice(self.group1, n, replace=True)\n",
741 |     "        group2 = numpy.random.choice(self.group2, m, replace=True)\n",
742 |     "        return group1, group2\n",
743 |     "    \n",
744 |     "    def sample_stat(self, groups):\n",
745 |     "        group1, group2 = groups\n",
746 |     "        return CohenEffectSize(group1, group2)"
747 |    ]
748 |   },
749 |   {
750 |    "cell_type": "markdown",
751 |    "metadata": {},
752 |    "source": [
753 |     "Now we can instantiate a `CohenResampler` and plot the sampling distribution."
754 |    ]
755 |   },
756 |   {
757 |    "cell_type": "code",
758 |    "execution_count": null,
759 |    "metadata": {
760 |     "collapsed": true
761 |    },
762 |    "outputs": [],
763 |    "source": [
764 |     "resampler = CohenResampler(male_sample, female_sample)\n",
765 |     "resampler.plot_sampling_distribution()"
766 |    ]
767 |   },
768 |   {
769 |    "cell_type": "markdown",
770 |    "metadata": {},
771 |    "source": [
772 |     "This example demonstrates an advantage of the computational framework over mathematical analysis.  Statistics like Cohen's $d$, which is the ratio of other statistics, are relatively difficult to analyze.  But with a computational approach, all sample statistics are equally \"easy\".\n",
773 |     "\n",
774 |     "One note on vocabulary: what I am calling \"resampling\" here is a specific kind of resampling called \"bootstrapping\".  Other techniques that are also considering resampling include permutation tests, which we'll see in the next section, and \"jackknife\" resampling.  You can read more at <http://en.wikipedia.org/wiki/Resampling_(statistics)>."
775 |    ]
776 |   },
777 |   {
778 |    "cell_type": "code",
779 |    "execution_count": null,
780 |    "metadata": {
781 |     "collapsed": true
782 |    },
783 |    "outputs": [],
784 |    "source": []
785 |   }
786 |  ],
787 |  "metadata": {
788 |   "kernelspec": {
789 |    "display_name": "Python 3",
790 |    "language": "python",
791 |    "name": "python3"
792 |   },
793 |   "language_info": {
794 |    "codemirror_mode": {
795 |     "name": "ipython",
796 |     "version": 3
797 |    },
798 |    "file_extension": ".py",
799 |    "mimetype": "text/x-python",
800 |    "name": "python",
801 |    "nbconvert_exporter": "python",
802 |    "pygments_lexer": "ipython3",
803 |    "version": "3.6.1"
804 |   }
805 |  },
806 |  "nbformat": 4,
807 |  "nbformat_minor": 1
808 | }
809 | 


--------------------------------------------------------------------------------
/sampling_soln.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "Random Sampling\n",
  8 |     "=============\n",
  9 |     "\n",
 10 |     "Copyright 2016 Allen Downey\n",
 11 |     "\n",
 12 |     "License: [Creative Commons Attribution 4.0 International](http://creativecommons.org/licenses/by/4.0/)"
 13 |    ]
 14 |   },
 15 |   {
 16 |    "cell_type": "code",
 17 |    "execution_count": null,
 18 |    "metadata": {},
 19 |    "outputs": [],
 20 |    "source": [
 21 |     "from __future__ import print_function, division\n",
 22 |     "\n",
 23 |     "import numpy\n",
 24 |     "import scipy.stats\n",
 25 |     "\n",
 26 |     "import matplotlib.pyplot as pyplot\n",
 27 |     "\n",
 28 |     "from ipywidgets import interact, interactive, fixed\n",
 29 |     "import ipywidgets as widgets\n",
 30 |     "\n",
 31 |     "# seed the random number generator so we all get the same results\n",
 32 |     "numpy.random.seed(18)\n",
 33 |     "\n",
 34 |     "# some nicer colors from http://colorbrewer2.org/\n",
 35 |     "COLOR1 = '#7fc97f'\n",
 36 |     "COLOR2 = '#beaed4'\n",
 37 |     "COLOR3 = '#fdc086'\n",
 38 |     "COLOR4 = '#ffff99'\n",
 39 |     "COLOR5 = '#386cb0'\n",
 40 |     "\n",
 41 |     "%matplotlib inline"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "markdown",
 46 |    "metadata": {},
 47 |    "source": [
 48 |     "Part One\n",
 49 |     "========\n",
 50 |     "\n",
 51 |     "Suppose we want to estimate the average weight of men and women in the U.S.\n",
 52 |     "\n",
 53 |     "And we want to quantify the uncertainty of the estimate.\n",
 54 |     "\n",
 55 |     "One approach is to simulate many experiments and see how much the results vary from one experiment to the next.\n",
 56 |     "\n",
 57 |     "I'll start with the unrealistic assumption that we know the actual distribution of weights in the population.  Then I'll show how to solve the problem without that assumption.\n",
 58 |     "\n",
 59 |     "Based on data from the [BRFSS](http://www.cdc.gov/brfss/), I found that the distribution of weight in kg for women in the U.S. is well modeled by a lognormal distribution with the following parameters:"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "code",
 64 |    "execution_count": null,
 65 |    "metadata": {},
 66 |    "outputs": [],
 67 |    "source": [
 68 |     "weight = scipy.stats.lognorm(0.23, 0, 70.8)\n",
 69 |     "weight.mean(), weight.std()"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "markdown",
 74 |    "metadata": {},
 75 |    "source": [
 76 |     "Here's what that distribution looks like:"
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "code",
 81 |    "execution_count": null,
 82 |    "metadata": {},
 83 |    "outputs": [],
 84 |    "source": [
 85 |     "xs = numpy.linspace(20, 160, 100)\n",
 86 |     "ys = weight.pdf(xs)\n",
 87 |     "pyplot.plot(xs, ys, linewidth=4, color=COLOR1)\n",
 88 |     "pyplot.xlabel('weight (kg)')\n",
 89 |     "pyplot.ylabel('PDF')\n",
 90 |     "None"
 91 |    ]
 92 |   },
 93 |   {
 94 |    "cell_type": "markdown",
 95 |    "metadata": {},
 96 |    "source": [
 97 |     "`make_sample` draws a random sample from this distribution.  The result is a NumPy array."
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "code",
102 |    "execution_count": null,
103 |    "metadata": {},
104 |    "outputs": [],
105 |    "source": [
106 |     "def make_sample(n=100):\n",
107 |     "    sample = weight.rvs(n)\n",
108 |     "    return sample"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "markdown",
113 |    "metadata": {},
114 |    "source": [
115 |     "Here's an example with `n=100`.  The mean and std of the sample are close to the mean and std of the population, but not exact."
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "execution_count": null,
121 |    "metadata": {},
122 |    "outputs": [],
123 |    "source": [
124 |     "sample = make_sample(n=100)\n",
125 |     "sample.mean(), sample.std()"
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "markdown",
130 |    "metadata": {},
131 |    "source": [
132 |     "We want to estimate the average weight in the population, so the \"sample statistic\" we'll use is the mean:"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "code",
137 |    "execution_count": null,
138 |    "metadata": {},
139 |    "outputs": [],
140 |    "source": [
141 |     "def sample_stat(sample):\n",
142 |     "    return sample.mean()"
143 |    ]
144 |   },
145 |   {
146 |    "cell_type": "markdown",
147 |    "metadata": {},
148 |    "source": [
149 |     "One iteration of \"the experiment\" is to collect a sample of 100 women and compute their average weight.\n",
150 |     "\n",
151 |     "We can simulate running this experiment many times, and collect a list of sample statistics.  The result is a NumPy array."
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "code",
156 |    "execution_count": null,
157 |    "metadata": {},
158 |    "outputs": [],
159 |    "source": [
160 |     "def compute_sampling_distribution(n=100, iters=1000):\n",
161 |     "    stats = [sample_stat(make_sample(n)) for i in range(iters)]\n",
162 |     "    return numpy.array(stats)"
163 |    ]
164 |   },
165 |   {
166 |    "cell_type": "markdown",
167 |    "metadata": {},
168 |    "source": [
169 |     "The next line runs the simulation 1000 times and puts the results in\n",
170 |     "`sample_means`:"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "code",
175 |    "execution_count": null,
176 |    "metadata": {},
177 |    "outputs": [],
178 |    "source": [
179 |     "sample_means = compute_sampling_distribution(n=100, iters=1000)"
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "markdown",
184 |    "metadata": {},
185 |    "source": [
186 |     "Let's look at the distribution of the sample means.  This distribution shows how much the results vary from one experiment to the next.\n",
187 |     "\n",
188 |     "Remember that this distribution is not the same as the distribution of weight in the population.  This is the distribution of results across repeated imaginary experiments."
189 |    ]
190 |   },
191 |   {
192 |    "cell_type": "code",
193 |    "execution_count": null,
194 |    "metadata": {},
195 |    "outputs": [],
196 |    "source": [
197 |     "pyplot.hist(sample_means, color=COLOR5)\n",
198 |     "pyplot.xlabel('sample mean (n=100)')\n",
199 |     "pyplot.ylabel('count')\n",
200 |     "None"
201 |    ]
202 |   },
203 |   {
204 |    "cell_type": "markdown",
205 |    "metadata": {},
206 |    "source": [
207 |     "The mean of the sample means is close to the actual population mean, which is nice, but not actually the important part."
208 |    ]
209 |   },
210 |   {
211 |    "cell_type": "code",
212 |    "execution_count": null,
213 |    "metadata": {},
214 |    "outputs": [],
215 |    "source": [
216 |     "sample_means.mean()"
217 |    ]
218 |   },
219 |   {
220 |    "cell_type": "markdown",
221 |    "metadata": {},
222 |    "source": [
223 |     "The standard deviation of the sample means quantifies the variability from one experiment to the next, and reflects the precision of the estimate.\n",
224 |     "\n",
225 |     "This quantity is called the \"standard error\"."
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "code",
230 |    "execution_count": null,
231 |    "metadata": {},
232 |    "outputs": [],
233 |    "source": [
234 |     "std_err = sample_means.std()\n",
235 |     "std_err"
236 |    ]
237 |   },
238 |   {
239 |    "cell_type": "markdown",
240 |    "metadata": {},
241 |    "source": [
242 |     "We can also use the distribution of sample means to compute a \"90% confidence interval\", which contains 90% of the experimental results:"
243 |    ]
244 |   },
245 |   {
246 |    "cell_type": "code",
247 |    "execution_count": null,
248 |    "metadata": {},
249 |    "outputs": [],
250 |    "source": [
251 |     "conf_int = numpy.percentile(sample_means, [5, 95])\n",
252 |     "conf_int"
253 |    ]
254 |   },
255 |   {
256 |    "cell_type": "markdown",
257 |    "metadata": {},
258 |    "source": [
259 |     "Now we'd like to see what happens as we vary the sample size, `n`.  The following function takes `n`, runs 1000 simulated experiments, and summarizes the results."
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "code",
264 |    "execution_count": null,
265 |    "metadata": {},
266 |    "outputs": [],
267 |    "source": [
268 |     "def plot_sampling_distribution(n, xlim=None):\n",
269 |     "    \"\"\"Plot the sampling distribution.\n",
270 |     "    \n",
271 |     "    n: sample size\n",
272 |     "    xlim: [xmin, xmax] range for the x axis \n",
273 |     "    \"\"\"\n",
274 |     "    sample_stats = compute_sampling_distribution(n, iters=1000)\n",
275 |     "    se = numpy.std(sample_stats)\n",
276 |     "    ci = numpy.percentile(sample_stats, [5, 95])\n",
277 |     "    \n",
278 |     "    pyplot.hist(sample_stats, color=COLOR2)\n",
279 |     "    pyplot.xlabel('sample statistic')\n",
280 |     "    pyplot.xlim(xlim)\n",
281 |     "    text(0.03, 0.95, 'CI [%0.2f %0.2f]' % tuple(ci))\n",
282 |     "    text(0.03, 0.85, 'SE %0.2f' % se)\n",
283 |     "    pyplot.show()\n",
284 |     "    \n",
285 |     "def text(x, y, s):\n",
286 |     "    \"\"\"Plot a string at a given location in axis coordinates.\n",
287 |     "    \n",
288 |     "    x: coordinate\n",
289 |     "    y: coordinate\n",
290 |     "    s: string\n",
291 |     "    \"\"\"\n",
292 |     "    ax = pyplot.gca()\n",
293 |     "    pyplot.text(x, y, s,\n",
294 |     "                horizontalalignment='left',\n",
295 |     "                verticalalignment='top',\n",
296 |     "                transform=ax.transAxes)"
297 |    ]
298 |   },
299 |   {
300 |    "cell_type": "markdown",
301 |    "metadata": {},
302 |    "source": [
303 |     "Here's a test run with `n=100`:"
304 |    ]
305 |   },
306 |   {
307 |    "cell_type": "code",
308 |    "execution_count": null,
309 |    "metadata": {},
310 |    "outputs": [],
311 |    "source": [
312 |     "plot_sampling_distribution(100)"
313 |    ]
314 |   },
315 |   {
316 |    "cell_type": "markdown",
317 |    "metadata": {},
318 |    "source": [
319 |     "Now we can use `interact` to run `plot_sampling_distribution` with different values of `n`.  Note: `xlim` sets the limits of the x-axis so the figure doesn't get rescaled as we vary `n`."
320 |    ]
321 |   },
322 |   {
323 |    "cell_type": "code",
324 |    "execution_count": null,
325 |    "metadata": {},
326 |    "outputs": [],
327 |    "source": [
328 |     "def sample_stat(sample):\n",
329 |     "    return sample.mean()\n",
330 |     "\n",
331 |     "slider = widgets.IntSlider(min=10, max=1000, value=100)\n",
332 |     "interact(plot_sampling_distribution, n=slider, xlim=fixed([55, 95]))\n",
333 |     "None"
334 |    ]
335 |   },
336 |   {
337 |    "cell_type": "markdown",
338 |    "metadata": {},
339 |    "source": [
340 |     "### Other sample statistics\n",
341 |     "\n",
342 |     "This framework works with any other quantity we want to estimate.  By changing `sample_stat`, you can compute the SE and CI for any sample statistic.\n",
343 |     "\n",
344 |     "**Exercise 1**: Fill in `sample_stat` below with any of these statistics:\n",
345 |     "\n",
346 |     "* Standard deviation of the sample.\n",
347 |     "* Coefficient of variation, which is the sample standard deviation divided by the sample standard mean.\n",
348 |     "* Min or Max\n",
349 |     "* Median (which is the 50th percentile)\n",
350 |     "* 10th or 90th percentile.\n",
351 |     "* Interquartile range (IQR), which is the difference between the 75th and 25th percentiles.\n",
352 |     "\n",
353 |     "NumPy array methods you might find useful include `std`, `min`, `max`, and `percentile`.\n",
354 |     "Depending on the results, you might want to adjust `xlim`."
355 |    ]
356 |   },
357 |   {
358 |    "cell_type": "code",
359 |    "execution_count": null,
360 |    "metadata": {},
361 |    "outputs": [],
362 |    "source": [
363 |     "def sample_stat(sample):\n",
364 |     "    # TODO: replace the following line with another sample statistic\n",
365 |     "    return sample.mean()\n",
366 |     "\n",
367 |     "slider = widgets.IntSlider(min=10, max=1000, value=100)\n",
368 |     "interact(plot_sampling_distribution, n=slider, xlim=fixed([0, 100]))\n",
369 |     "None"
370 |    ]
371 |   },
372 |   {
373 |    "cell_type": "markdown",
374 |    "metadata": {},
375 |    "source": [
376 |     "STOP HERE\n",
377 |     "---------\n",
378 |     "\n",
379 |     "We will regroup and discuss before going on."
380 |    ]
381 |   },
382 |   {
383 |    "cell_type": "markdown",
384 |    "metadata": {},
385 |    "source": [
386 |     "Part Two\n",
387 |     "========\n",
388 |     "\n",
389 |     "So far we have shown that if we know the actual distribution of the population, we can compute the sampling distribution for any sample statistic, and from that we can compute SE and CI.\n",
390 |     "\n",
391 |     "But in real life we don't know the actual distribution of the population.  If we did, we wouldn't be doing statistical inference in the first place!\n",
392 |     "\n",
393 |     "In real life, we use the sample to build a model of the population distribution, then use the model to generate the sampling distribution.  A simple and popular way to do that is \"resampling,\" which means we use the sample itself as a model of the population distribution and draw samples from it.\n",
394 |     "\n",
395 |     "Before we go on, I want to collect some of the code from Part One and organize it as a class.  This class represents a framework for computing sampling distributions."
396 |    ]
397 |   },
398 |   {
399 |    "cell_type": "code",
400 |    "execution_count": null,
401 |    "metadata": {},
402 |    "outputs": [],
403 |    "source": [
404 |     "class Resampler(object):\n",
405 |     "    \"\"\"Represents a framework for computing sampling distributions.\"\"\"\n",
406 |     "    \n",
407 |     "    def __init__(self, sample, xlim=None):\n",
408 |     "        \"\"\"Stores the actual sample.\"\"\"\n",
409 |     "        self.sample = sample\n",
410 |     "        self.n = len(sample)\n",
411 |     "        self.xlim = xlim\n",
412 |     "        \n",
413 |     "    def resample(self):\n",
414 |     "        \"\"\"Generates a new sample by choosing from the original\n",
415 |     "        sample with replacement.\n",
416 |     "        \"\"\"\n",
417 |     "        new_sample = numpy.random.choice(self.sample, self.n, replace=True)\n",
418 |     "        return new_sample\n",
419 |     "    \n",
420 |     "    def sample_stat(self, sample):\n",
421 |     "        \"\"\"Computes a sample statistic using the original sample or a\n",
422 |     "        simulated sample.\n",
423 |     "        \"\"\"\n",
424 |     "        return sample.mean()\n",
425 |     "    \n",
426 |     "    def compute_sampling_distribution(self, iters=1000):\n",
427 |     "        \"\"\"Simulates many experiments and collects the resulting sample\n",
428 |     "        statistics.\n",
429 |     "        \"\"\"\n",
430 |     "        stats = [self.sample_stat(self.resample()) for i in range(iters)]\n",
431 |     "        return numpy.array(stats)\n",
432 |     "    \n",
433 |     "    def plot_sampling_distribution(self):\n",
434 |     "        \"\"\"Plots the sampling distribution.\"\"\"\n",
435 |     "        sample_stats = self.compute_sampling_distribution()\n",
436 |     "        se = sample_stats.std()\n",
437 |     "        ci = numpy.percentile(sample_stats, [5, 95])\n",
438 |     "    \n",
439 |     "        pyplot.hist(sample_stats, color=COLOR2)\n",
440 |     "        pyplot.xlabel('sample statistic')\n",
441 |     "        pyplot.xlim(self.xlim)\n",
442 |     "        text(0.03, 0.95, 'CI [%0.2f %0.2f]' % tuple(ci))\n",
443 |     "        text(0.03, 0.85, 'SE %0.2f' % se)\n",
444 |     "        pyplot.show()"
445 |    ]
446 |   },
447 |   {
448 |    "cell_type": "markdown",
449 |    "metadata": {},
450 |    "source": [
451 |     "The following function instantiates a `Resampler` and runs it."
452 |    ]
453 |   },
454 |   {
455 |    "cell_type": "code",
456 |    "execution_count": null,
457 |    "metadata": {},
458 |    "outputs": [],
459 |    "source": [
460 |     "def interact_func(n, xlim):\n",
461 |     "    sample = weight.rvs(n)\n",
462 |     "    resampler = Resampler(sample, xlim=xlim)\n",
463 |     "    resampler.plot_sampling_distribution()"
464 |    ]
465 |   },
466 |   {
467 |    "cell_type": "markdown",
468 |    "metadata": {},
469 |    "source": [
470 |     "Here's a test run with `n=100`"
471 |    ]
472 |   },
473 |   {
474 |    "cell_type": "code",
475 |    "execution_count": null,
476 |    "metadata": {},
477 |    "outputs": [],
478 |    "source": [
479 |     "interact_func(n=100, xlim=[50, 100])"
480 |    ]
481 |   },
482 |   {
483 |    "cell_type": "markdown",
484 |    "metadata": {},
485 |    "source": [
486 |     "Now we can use `interact_func` in an interaction:"
487 |    ]
488 |   },
489 |   {
490 |    "cell_type": "code",
491 |    "execution_count": null,
492 |    "metadata": {},
493 |    "outputs": [],
494 |    "source": [
495 |     "slider = widgets.IntSlider(min=10, max=1000, value=100)\n",
496 |     "interact(interact_func, n=slider, xlim=fixed([50, 100]))\n",
497 |     "None"
498 |    ]
499 |   },
500 |   {
501 |    "cell_type": "markdown",
502 |    "metadata": {},
503 |    "source": [
504 |     "**Exercise 2**: write a new class called `StdResampler` that inherits from `Resampler` and overrides `sample_stat` so it computes the standard deviation of the resampled data."
505 |    ]
506 |   },
507 |   {
508 |    "cell_type": "code",
509 |    "execution_count": null,
510 |    "metadata": {},
511 |    "outputs": [],
512 |    "source": [
513 |     "# Solution goes here\n",
514 |     "\n",
515 |     "class StdResampler(Resampler):   \n",
516 |     "    \"\"\"Computes the sampling distribution of the standard deviation.\"\"\"\n",
517 |     "    \n",
518 |     "    def sample_stat(self, sample):\n",
519 |     "        \"\"\"Computes a sample statistic using the original sample or a\n",
520 |     "        simulated sample.\n",
521 |     "        \"\"\"\n",
522 |     "        return sample.std()"
523 |    ]
524 |   },
525 |   {
526 |    "cell_type": "markdown",
527 |    "metadata": {},
528 |    "source": [
529 |     "Test your code using the cell below:"
530 |    ]
531 |   },
532 |   {
533 |    "cell_type": "code",
534 |    "execution_count": null,
535 |    "metadata": {},
536 |    "outputs": [],
537 |    "source": [
538 |     "def interact_func2(n, xlim):\n",
539 |     "    sample = weight.rvs(n)\n",
540 |     "    resampler = StdResampler(sample, xlim=xlim)\n",
541 |     "    resampler.plot_sampling_distribution()\n",
542 |     "    \n",
543 |     "interact_func2(n=100, xlim=[0, 100])"
544 |    ]
545 |   },
546 |   {
547 |    "cell_type": "markdown",
548 |    "metadata": {},
549 |    "source": [
550 |     "When your `StdResampler` is working, you should be able to interact with it:"
551 |    ]
552 |   },
553 |   {
554 |    "cell_type": "code",
555 |    "execution_count": null,
556 |    "metadata": {},
557 |    "outputs": [],
558 |    "source": [
559 |     "slider = widgets.IntSlider(min=10, max=1000, value=100)\n",
560 |     "interact(interact_func2, n=slider, xlim=fixed([0, 100]))\n",
561 |     "None"
562 |    ]
563 |   },
564 |   {
565 |    "cell_type": "markdown",
566 |    "metadata": {},
567 |    "source": [
568 |     "STOP HERE\n",
569 |     "---------\n",
570 |     "\n",
571 |     "We will regroup and discuss before going on."
572 |    ]
573 |   },
574 |   {
575 |    "cell_type": "markdown",
576 |    "metadata": {},
577 |    "source": [
578 |     "Part Three\n",
579 |     "==========\n",
580 |     "\n",
581 |     "We can extend this framework to compute SE and CI for a difference in means.\n",
582 |     "\n",
583 |     "For example, men are heavier than women on average.  Here's the women's distribution again (from BRFSS data):"
584 |    ]
585 |   },
586 |   {
587 |    "cell_type": "code",
588 |    "execution_count": null,
589 |    "metadata": {},
590 |    "outputs": [],
591 |    "source": [
592 |     "female_weight = scipy.stats.lognorm(0.23, 0, 70.8)\n",
593 |     "female_weight.mean(), female_weight.std()"
594 |    ]
595 |   },
596 |   {
597 |    "cell_type": "markdown",
598 |    "metadata": {},
599 |    "source": [
600 |     "And here's the men's distribution:"
601 |    ]
602 |   },
603 |   {
604 |    "cell_type": "code",
605 |    "execution_count": null,
606 |    "metadata": {},
607 |    "outputs": [],
608 |    "source": [
609 |     "male_weight = scipy.stats.lognorm(0.20, 0, 87.3)\n",
610 |     "male_weight.mean(), male_weight.std()"
611 |    ]
612 |   },
613 |   {
614 |    "cell_type": "markdown",
615 |    "metadata": {},
616 |    "source": [
617 |     "I'll simulate a sample of 100 men and 100 women:"
618 |    ]
619 |   },
620 |   {
621 |    "cell_type": "code",
622 |    "execution_count": null,
623 |    "metadata": {},
624 |    "outputs": [],
625 |    "source": [
626 |     "female_sample = female_weight.rvs(100)\n",
627 |     "male_sample = male_weight.rvs(100)"
628 |    ]
629 |   },
630 |   {
631 |    "cell_type": "markdown",
632 |    "metadata": {},
633 |    "source": [
634 |     "The difference in means should be about 17 kg, but will vary from one random sample to the next:"
635 |    ]
636 |   },
637 |   {
638 |    "cell_type": "code",
639 |    "execution_count": null,
640 |    "metadata": {},
641 |    "outputs": [],
642 |    "source": [
643 |     "male_sample.mean() - female_sample.mean()"
644 |    ]
645 |   },
646 |   {
647 |    "cell_type": "markdown",
648 |    "metadata": {},
649 |    "source": [
650 |     "Here's the function that computes Cohen's effect size again:"
651 |    ]
652 |   },
653 |   {
654 |    "cell_type": "code",
655 |    "execution_count": null,
656 |    "metadata": {},
657 |    "outputs": [],
658 |    "source": [
659 |     "def CohenEffectSize(group1, group2):\n",
660 |     "    \"\"\"Compute Cohen's d.\n",
661 |     "\n",
662 |     "    group1: Series or NumPy array\n",
663 |     "    group2: Series or NumPy array\n",
664 |     "\n",
665 |     "    returns: float\n",
666 |     "    \"\"\"\n",
667 |     "    diff = group1.mean() - group2.mean()\n",
668 |     "\n",
669 |     "    n1, n2 = len(group1), len(group2)\n",
670 |     "    var1 = group1.var()\n",
671 |     "    var2 = group2.var()\n",
672 |     "\n",
673 |     "    pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)\n",
674 |     "    d = diff / numpy.sqrt(pooled_var)\n",
675 |     "    return d"
676 |    ]
677 |   },
678 |   {
679 |    "cell_type": "markdown",
680 |    "metadata": {},
681 |    "source": [
682 |     "The difference in weight between men and women is about 1 standard deviation:"
683 |    ]
684 |   },
685 |   {
686 |    "cell_type": "code",
687 |    "execution_count": null,
688 |    "metadata": {},
689 |    "outputs": [],
690 |    "source": [
691 |     "CohenEffectSize(male_sample, female_sample)"
692 |    ]
693 |   },
694 |   {
695 |    "cell_type": "markdown",
696 |    "metadata": {},
697 |    "source": [
698 |     "Now we can write a version of the `Resampler` that computes the sampling distribution of $d$."
699 |    ]
700 |   },
701 |   {
702 |    "cell_type": "code",
703 |    "execution_count": null,
704 |    "metadata": {},
705 |    "outputs": [],
706 |    "source": [
707 |     "class CohenResampler(Resampler):\n",
708 |     "    def __init__(self, group1, group2, xlim=None):\n",
709 |     "        self.group1 = group1\n",
710 |     "        self.group2 = group2\n",
711 |     "        self.xlim = xlim\n",
712 |     "        \n",
713 |     "    def resample(self):\n",
714 |     "        n, m = len(self.group1), len(self.group2)\n",
715 |     "        group1 = numpy.random.choice(self.group1, n, replace=True)\n",
716 |     "        group2 = numpy.random.choice(self.group2, m, replace=True)\n",
717 |     "        return group1, group2\n",
718 |     "    \n",
719 |     "    def sample_stat(self, groups):\n",
720 |     "        group1, group2 = groups\n",
721 |     "        return CohenEffectSize(group1, group2)"
722 |    ]
723 |   },
724 |   {
725 |    "cell_type": "markdown",
726 |    "metadata": {},
727 |    "source": [
728 |     "Now we can instantiate a `CohenResampler` and plot the sampling distribution."
729 |    ]
730 |   },
731 |   {
732 |    "cell_type": "code",
733 |    "execution_count": null,
734 |    "metadata": {},
735 |    "outputs": [],
736 |    "source": [
737 |     "resampler = CohenResampler(male_sample, female_sample)\n",
738 |     "resampler.plot_sampling_distribution()"
739 |    ]
740 |   },
741 |   {
742 |    "cell_type": "markdown",
743 |    "metadata": {},
744 |    "source": [
745 |     "This example demonstrates an advantage of the computational framework over mathematical analysis.  Statistics like Cohen's $d$, which is the ratio of other statistics, are relatively difficult to analyze.  But with a computational approach, all sample statistics are equally \"easy\".\n",
746 |     "\n",
747 |     "One note on vocabulary: what I am calling \"resampling\" here is a specific kind of resampling called \"bootstrapping\".  Other techniques that are also considering resampling include permutation tests, which we'll see in the next section, and \"jackknife\" resampling.  You can read more at <http://en.wikipedia.org/wiki/Resampling_(statistics)>."
748 |    ]
749 |   },
750 |   {
751 |    "cell_type": "code",
752 |    "execution_count": null,
753 |    "metadata": {},
754 |    "outputs": [],
755 |    "source": []
756 |   }
757 |  ],
758 |  "metadata": {
759 |   "kernelspec": {
760 |    "display_name": "Python 3",
761 |    "language": "python",
762 |    "name": "python3"
763 |   },
764 |   "language_info": {
765 |    "codemirror_mode": {
766 |     "name": "ipython",
767 |     "version": 3
768 |    },
769 |    "file_extension": ".py",
770 |    "mimetype": "text/x-python",
771 |    "name": "python",
772 |    "nbconvert_exporter": "python",
773 |    "pygments_lexer": "ipython3",
774 |    "version": "3.6.1"
775 |   }
776 |  },
777 |  "nbformat": 4,
778 |  "nbformat_minor": 1
779 | }
780 | 


--------------------------------------------------------------------------------
/thinkplot.py:
--------------------------------------------------------------------------------
  1 | """This file contains code for use with "Think Stats",
  2 | by Allen B. Downey, available from greenteapress.com
  3 | 
  4 | Copyright 2014 Allen B. Downey
  5 | License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
  6 | """
  7 | 
  8 | from __future__ import print_function
  9 | 
 10 | import math
 11 | import matplotlib
 12 | import matplotlib.pyplot as pyplot
 13 | import numpy as np
 14 | import pandas
 15 | 
 16 | import warnings
 17 | 
 18 | # customize some matplotlib attributes
 19 | #matplotlib.rc('figure', figsize=(4, 3))
 20 | 
 21 | #matplotlib.rc('font', size=14.0)
 22 | #matplotlib.rc('axes', labelsize=22.0, titlesize=22.0)
 23 | #matplotlib.rc('legend', fontsize=20.0)
 24 | 
 25 | #matplotlib.rc('xtick.major', size=6.0)
 26 | #matplotlib.rc('xtick.minor', size=3.0)
 27 | 
 28 | #matplotlib.rc('ytick.major', size=6.0)
 29 | #matplotlib.rc('ytick.minor', size=3.0)
 30 | 
 31 | 
 32 | class _Brewer(object):
 33 |     """Encapsulates a nice sequence of colors.
 34 | 
 35 |     Shades of blue that look good in color and can be distinguished
 36 |     in grayscale (up to a point).
 37 |     
 38 |     Borrowed from http://colorbrewer2.org/
 39 |     """
 40 |     color_iter = None
 41 | 
 42 |     colors = ['#f7fbff', '#deebf7', '#c6dbef',
 43 |               '#9ecae1', '#6baed6', '#4292c6',
 44 |               '#2171b5','#08519c','#08306b'][::-1]
 45 | 
 46 |     # lists that indicate which colors to use depending on how many are used
 47 |     which_colors = [[],
 48 |                     [1],
 49 |                     [1, 3],
 50 |                     [0, 2, 4],
 51 |                     [0, 2, 4, 6],
 52 |                     [0, 2, 3, 5, 6],
 53 |                     [0, 2, 3, 4, 5, 6],
 54 |                     [0, 1, 2, 3, 4, 5, 6],
 55 |                     [0, 1, 2, 3, 4, 5, 6, 7],
 56 |                     [0, 1, 2, 3, 4, 5, 6, 7, 8],
 57 |                     ]
 58 | 
 59 |     current_figure = None
 60 | 
 61 |     @classmethod
 62 |     def Colors(cls):
 63 |         """Returns the list of colors.
 64 |         """
 65 |         return cls.colors
 66 | 
 67 |     @classmethod
 68 |     def ColorGenerator(cls, num):
 69 |         """Returns an iterator of color strings.
 70 | 
 71 |         n: how many colors will be used
 72 |         """
 73 |         for i in cls.which_colors[num]:
 74 |             yield cls.colors[i]
 75 |         raise StopIteration('Ran out of colors in _Brewer.')
 76 | 
 77 |     @classmethod
 78 |     def InitIter(cls, num):
 79 |         """Initializes the color iterator with the given number of colors."""
 80 |         cls.color_iter = cls.ColorGenerator(num)
 81 | 
 82 |     @classmethod
 83 |     def ClearIter(cls):
 84 |         """Sets the color iterator to None."""
 85 |         cls.color_iter = None
 86 | 
 87 |     @classmethod
 88 |     def GetIter(cls, num):
 89 |         """Gets the color iterator."""
 90 |         fig = pyplot.gcf()
 91 |         if fig != cls.current_figure:
 92 |             cls.InitIter(num)
 93 |             cls.current_figure = fig  
 94 | 
 95 |         if cls.color_iter is None:
 96 |             cls.InitIter(num)
 97 | 
 98 |         return cls.color_iter
 99 | 
100 | 
101 | def _UnderrideColor(options):
102 |     """If color is not in the options, chooses a color.
103 |     """
104 |     if 'color' in options:
105 |         return options
106 | 
107 |     # get the current color iterator; if there is none, init one
108 |     color_iter = _Brewer.GetIter(5)
109 | 
110 |     try:
111 |         options['color'] = next(color_iter)
112 |     except StopIteration:
113 |         # if you run out of colors, initialize the color iterator
114 |         # and try again
115 |         warnings.warn('Ran out of colors.  Starting over.')
116 |         _Brewer.ClearIter()
117 |         _UnderrideColor(options)
118 | 
119 |     return options
120 | 
121 | 
122 | def PrePlot(num=None, rows=None, cols=None):
123 |     """Takes hints about what's coming.
124 | 
125 |     num: number of lines that will be plotted
126 |     rows: number of rows of subplots
127 |     cols: number of columns of subplots
128 |     """
129 |     if num:
130 |         _Brewer.InitIter(num)
131 | 
132 |     if rows is None and cols is None:
133 |         return
134 | 
135 |     if rows is not None and cols is None:
136 |         cols = 1
137 | 
138 |     if cols is not None and rows is None:
139 |         rows = 1
140 | 
141 |     # resize the image, depending on the number of rows and cols
142 |     size_map = {(1, 1): (8, 6),
143 |                 (1, 2): (12, 6),
144 |                 (1, 3): (12, 6),
145 |                 (2, 2): (10, 10),
146 |                 (2, 3): (16, 10),
147 |                 (3, 1): (8, 10),
148 |                 (4, 1): (8, 12),
149 |                 }
150 | 
151 |     if (rows, cols) in size_map:
152 |         fig = pyplot.gcf()
153 |         fig.set_size_inches(*size_map[rows, cols])
154 | 
155 |     # create the first subplot
156 |     if rows > 1 or cols > 1:
157 |         ax = pyplot.subplot(rows, cols, 1)
158 |         global SUBPLOT_ROWS, SUBPLOT_COLS
159 |         SUBPLOT_ROWS = rows
160 |         SUBPLOT_COLS = cols
161 |     else:
162 |         ax = pyplot.gca()
163 | 
164 |     return ax
165 | 
166 | def SubPlot(plot_number, rows=None, cols=None, **options):
167 |     """Configures the number of subplots and changes the current plot.
168 | 
169 |     rows: int
170 |     cols: int
171 |     plot_number: int
172 |     options: passed to subplot
173 |     """
174 |     rows = rows or SUBPLOT_ROWS
175 |     cols = cols or SUBPLOT_COLS
176 |     return pyplot.subplot(rows, cols, plot_number, **options)
177 | 
178 | 
179 | def _Underride(d, **options):
180 |     """Add key-value pairs to d only if key is not in d.
181 | 
182 |     If d is None, create a new dictionary.
183 | 
184 |     d: dictionary
185 |     options: keyword args to add to d
186 |     """
187 |     if d is None:
188 |         d = {}
189 | 
190 |     for key, val in options.items():
191 |         d.setdefault(key, val)
192 | 
193 |     return d
194 | 
195 | 
196 | def Clf():
197 |     """Clears the figure and any hints that have been set."""
198 |     global LOC
199 |     LOC = None
200 |     _Brewer.ClearIter()
201 |     pyplot.clf()
202 |     fig = pyplot.gcf()
203 |     fig.set_size_inches(8, 6)
204 | 
205 | 
206 | def Figure(**options):
207 |     """Sets options for the current figure."""
208 |     _Underride(options, figsize=(6, 8))
209 |     pyplot.figure(**options)
210 | 
211 | 
212 | def Plot(obj, ys=None, style='', **options):
213 |     """Plots a line.
214 | 
215 |     Args:
216 |       obj: sequence of x values, or Series, or anything with Render()
217 |       ys: sequence of y values
218 |       style: style string passed along to pyplot.plot
219 |       options: keyword args passed to pyplot.plot
220 |     """
221 |     options = _UnderrideColor(options)
222 |     label = getattr(obj, 'label', '_nolegend_')
223 |     options = _Underride(options, linewidth=3, alpha=0.7, label=label)
224 | 
225 |     xs = obj
226 |     if ys is None:
227 |         if hasattr(obj, 'Render'):
228 |             xs, ys = obj.Render()
229 |         if isinstance(obj, pandas.Series):
230 |             ys = obj.values
231 |             xs = obj.index
232 | 
233 |     if ys is None:
234 |         pyplot.plot(xs, style, **options)
235 |     else:
236 |         pyplot.plot(xs, ys, style, **options)
237 | 
238 | 
239 | def Vlines(xs, y1, y2, **options):
240 |     """Plots a set of vertical lines.
241 | 
242 |     Args:
243 |       xs: sequence of x values
244 |       y1: sequence of y values
245 |       y2: sequence of y values
246 |       options: keyword args passed to pyplot.vlines
247 |     """
248 |     options = _UnderrideColor(options)
249 |     options = _Underride(options, linewidth=1, alpha=0.5)
250 |     pyplot.vlines(xs, y1, y2, **options)
251 | 
252 | 
253 | def Hlines(ys, x1, x2, **options):
254 |     """Plots a set of horizontal lines.
255 | 
256 |     Args:
257 |       ys: sequence of y values
258 |       x1: sequence of x values
259 |       x2: sequence of x values
260 |       options: keyword args passed to pyplot.vlines
261 |     """
262 |     options = _UnderrideColor(options)
263 |     options = _Underride(options, linewidth=1, alpha=0.5)
264 |     pyplot.hlines(ys, x1, x2, **options)
265 | 
266 | 
267 | def FillBetween(xs, y1, y2=None, where=None, **options):
268 |     """Fills the space between two lines.
269 | 
270 |     Args:
271 |       xs: sequence of x values
272 |       y1: sequence of y values
273 |       y2: sequence of y values
274 |       where: sequence of boolean
275 |       options: keyword args passed to pyplot.fill_between
276 |     """
277 |     options = _UnderrideColor(options)
278 |     options = _Underride(options, linewidth=0, alpha=0.5)
279 |     pyplot.fill_between(xs, y1, y2, where, **options)
280 | 
281 | 
282 | def Bar(xs, ys, **options):
283 |     """Plots a line.
284 | 
285 |     Args:
286 |       xs: sequence of x values
287 |       ys: sequence of y values
288 |       options: keyword args passed to pyplot.bar
289 |     """
290 |     options = _UnderrideColor(options)
291 |     options = _Underride(options, linewidth=0, alpha=0.6)
292 |     pyplot.bar(xs, ys, **options)
293 | 
294 | 
295 | def Scatter(xs, ys=None, **options):
296 |     """Makes a scatter plot.
297 | 
298 |     xs: x values
299 |     ys: y values
300 |     options: options passed to pyplot.scatter
301 |     """
302 |     options = _Underride(options, color='blue', alpha=0.2, 
303 |                         s=30, edgecolors='none')
304 | 
305 |     if ys is None and isinstance(xs, pandas.Series):
306 |         ys = xs.values
307 |         xs = xs.index
308 | 
309 |     pyplot.scatter(xs, ys, **options)
310 | 
311 | 
312 | def HexBin(xs, ys, **options):
313 |     """Makes a scatter plot.
314 | 
315 |     xs: x values
316 |     ys: y values
317 |     options: options passed to pyplot.scatter
318 |     """
319 |     options = _Underride(options, cmap=matplotlib.cm.Blues)
320 |     pyplot.hexbin(xs, ys, **options)
321 | 
322 | 
323 | def Pdf(pdf, **options):
324 |     """Plots a Pdf, Pmf, or Hist as a line.
325 | 
326 |     Args:
327 |       pdf: Pdf, Pmf, or Hist object
328 |       options: keyword args passed to pyplot.plot
329 |     """
330 |     low, high = options.pop('low', None), options.pop('high', None)
331 |     n = options.pop('n', 101)
332 |     xs, ps = pdf.Render(low=low, high=high, n=n)
333 |     options = _Underride(options, label=pdf.label)
334 |     Plot(xs, ps, **options)
335 | 
336 | 
337 | def Pdfs(pdfs, **options):
338 |     """Plots a sequence of PDFs.
339 | 
340 |     Options are passed along for all PDFs.  If you want different
341 |     options for each pdf, make multiple calls to Pdf.
342 |     
343 |     Args:
344 |       pdfs: sequence of PDF objects
345 |       options: keyword args passed to pyplot.plot
346 |     """
347 |     for pdf in pdfs:
348 |         Pdf(pdf, **options)
349 | 
350 | 
351 | def Hist(hist, **options):
352 |     """Plots a Pmf or Hist with a bar plot.
353 | 
354 |     The default width of the bars is based on the minimum difference
355 |     between values in the Hist.  If that's too small, you can override
356 |     it by providing a width keyword argument, in the same units
357 |     as the values.
358 | 
359 |     Args:
360 |       hist: Hist or Pmf object
361 |       options: keyword args passed to pyplot.bar
362 |     """
363 |     # find the minimum distance between adjacent values
364 |     xs, ys = hist.Render()
365 | 
366 |     if 'width' not in options:
367 |         try:
368 |             options['width'] = 0.9 * np.diff(xs).min()
369 |         except TypeError:
370 |             warnings.warn("Hist: Can't compute bar width automatically."
371 |                             "Check for non-numeric types in Hist."
372 |                             "Or try providing width option."
373 |                             )
374 | 
375 |     options = _Underride(options, label=hist.label)
376 |     options = _Underride(options, align='center')
377 |     if options['align'] == 'left':
378 |         options['align'] = 'edge'
379 |     elif options['align'] == 'right':
380 |         options['align'] = 'edge'
381 |         options['width'] *= -1
382 | 
383 |     Bar(xs, ys, **options)
384 | 
385 | 
386 | def Hists(hists, **options):
387 |     """Plots two histograms as interleaved bar plots.
388 | 
389 |     Options are passed along for all PMFs.  If you want different
390 |     options for each pmf, make multiple calls to Pmf.
391 | 
392 |     Args:
393 |       hists: list of two Hist or Pmf objects
394 |       options: keyword args passed to pyplot.plot
395 |     """
396 |     for hist in hists:
397 |         Hist(hist, **options)
398 | 
399 | 
400 | def Pmf(pmf, **options):
401 |     """Plots a Pmf or Hist as a line.
402 | 
403 |     Args:
404 |       pmf: Hist or Pmf object
405 |       options: keyword args passed to pyplot.plot
406 |     """
407 |     xs, ys = pmf.Render()
408 |     low, high = min(xs), max(xs)
409 | 
410 |     width = options.pop('width', None)
411 |     if width is None:
412 |         try:
413 |             width = np.diff(xs).min()
414 |         except TypeError:
415 |             warnings.warn("Pmf: Can't compute bar width automatically."
416 |                           "Check for non-numeric types in Pmf."
417 |                           "Or try providing width option.")
418 |     points = []
419 | 
420 |     lastx = np.nan
421 |     lasty = 0
422 |     for x, y in zip(xs, ys):
423 |         if (x - lastx) > 1e-5:
424 |             points.append((lastx, 0))
425 |             points.append((x, 0))
426 | 
427 |         points.append((x, lasty))
428 |         points.append((x, y))
429 |         points.append((x+width, y))
430 | 
431 |         lastx = x + width
432 |         lasty = y
433 |     points.append((lastx, 0))
434 |     pxs, pys = zip(*points)
435 | 
436 |     align = options.pop('align', 'center')
437 |     if align == 'center':
438 |         pxs = np.array(pxs) - width/2.0
439 |     if align == 'right':
440 |         pxs = np.array(pxs) - width
441 | 
442 |     options = _Underride(options, label=pmf.label)
443 |     Plot(pxs, pys, **options)
444 | 
445 | 
446 | def Pmfs(pmfs, **options):
447 |     """Plots a sequence of PMFs.
448 | 
449 |     Options are passed along for all PMFs.  If you want different
450 |     options for each pmf, make multiple calls to Pmf.
451 |     
452 |     Args:
453 |       pmfs: sequence of PMF objects
454 |       options: keyword args passed to pyplot.plot
455 |     """
456 |     for pmf in pmfs:
457 |         Pmf(pmf, **options)
458 | 
459 | 
460 | def Diff(t):
461 |     """Compute the differences between adjacent elements in a sequence.
462 | 
463 |     Args:
464 |         t: sequence of number
465 | 
466 |     Returns:
467 |         sequence of differences (length one less than t)
468 |     """
469 |     diffs = [t[i+1] - t[i] for i in range(len(t)-1)]
470 |     return diffs
471 | 
472 | 
473 | def Cdf(cdf, complement=False, transform=None, **options):
474 |     """Plots a CDF as a line.
475 | 
476 |     Args:
477 |       cdf: Cdf object
478 |       complement: boolean, whether to plot the complementary CDF
479 |       transform: string, one of 'exponential', 'pareto', 'weibull', 'gumbel'
480 |       options: keyword args passed to pyplot.plot
481 | 
482 |     Returns:
483 |       dictionary with the scale options that should be passed to
484 |       Config, Show or Save.
485 |     """
486 |     xs, ps = cdf.Render()
487 |     xs = np.asarray(xs)
488 |     ps = np.asarray(ps)
489 | 
490 |     scale = dict(xscale='linear', yscale='linear')
491 | 
492 |     for s in ['xscale', 'yscale']: 
493 |         if s in options:
494 |             scale[s] = options.pop(s)
495 | 
496 |     if transform == 'exponential':
497 |         complement = True
498 |         scale['yscale'] = 'log'
499 | 
500 |     if transform == 'pareto':
501 |         complement = True
502 |         scale['yscale'] = 'log'
503 |         scale['xscale'] = 'log'
504 | 
505 |     if complement:
506 |         ps = [1.0-p for p in ps]
507 | 
508 |     if transform == 'weibull':
509 |         xs = np.delete(xs, -1)
510 |         ps = np.delete(ps, -1)
511 |         ps = [-math.log(1.0-p) for p in ps]
512 |         scale['xscale'] = 'log'
513 |         scale['yscale'] = 'log'
514 | 
515 |     if transform == 'gumbel':
516 |         xs = xp.delete(xs, 0)
517 |         ps = np.delete(ps, 0)
518 |         ps = [-math.log(p) for p in ps]
519 |         scale['yscale'] = 'log'
520 | 
521 |     options = _Underride(options, label=cdf.label)
522 |     Plot(xs, ps, **options)
523 |     return scale
524 | 
525 | 
526 | def Cdfs(cdfs, complement=False, transform=None, **options):
527 |     """Plots a sequence of CDFs.
528 |     
529 |     cdfs: sequence of CDF objects
530 |     complement: boolean, whether to plot the complementary CDF
531 |     transform: string, one of 'exponential', 'pareto', 'weibull', 'gumbel'
532 |     options: keyword args passed to pyplot.plot
533 |     """
534 |     for cdf in cdfs:
535 |         Cdf(cdf, complement, transform, **options)
536 | 
537 | 
538 | def Contour(obj, pcolor=False, contour=True, imshow=False, **options):
539 |     """Makes a contour plot.
540 |     
541 |     d: map from (x, y) to z, or object that provides GetDict
542 |     pcolor: boolean, whether to make a pseudocolor plot
543 |     contour: boolean, whether to make a contour plot
544 |     imshow: boolean, whether to use pyplot.imshow
545 |     options: keyword args passed to pyplot.pcolor and/or pyplot.contour
546 |     """
547 |     try:
548 |         d = obj.GetDict()
549 |     except AttributeError:
550 |         d = obj
551 | 
552 |     _Underride(options, linewidth=3, cmap=matplotlib.cm.Blues)
553 | 
554 |     xs, ys = zip(*d.keys())
555 |     xs = sorted(set(xs))
556 |     ys = sorted(set(ys))
557 | 
558 |     X, Y = np.meshgrid(xs, ys)
559 |     func = lambda x, y: d.get((x, y), 0)
560 |     func = np.vectorize(func)
561 |     Z = func(X, Y)
562 | 
563 |     x_formatter = matplotlib.ticker.ScalarFormatter(useOffset=False)
564 |     axes = pyplot.gca()
565 |     axes.xaxis.set_major_formatter(x_formatter)
566 | 
567 |     if pcolor:
568 |         pyplot.pcolormesh(X, Y, Z, **options)
569 |     if contour:
570 |         cs = pyplot.contour(X, Y, Z, **options)
571 |         pyplot.clabel(cs, inline=1, fontsize=10)
572 |     if imshow:
573 |         extent = xs[0], xs[-1], ys[0], ys[-1]
574 |         pyplot.imshow(Z, extent=extent, **options)
575 |         
576 | 
577 | def Pcolor(xs, ys, zs, pcolor=True, contour=False, **options):
578 |     """Makes a pseudocolor plot.
579 |     
580 |     xs:
581 |     ys:
582 |     zs:
583 |     pcolor: boolean, whether to make a pseudocolor plot
584 |     contour: boolean, whether to make a contour plot
585 |     options: keyword args passed to pyplot.pcolor and/or pyplot.contour
586 |     """
587 |     _Underride(options, linewidth=3, cmap=matplotlib.cm.Blues)
588 | 
589 |     X, Y = np.meshgrid(xs, ys)
590 |     Z = zs
591 | 
592 |     x_formatter = matplotlib.ticker.ScalarFormatter(useOffset=False)
593 |     axes = pyplot.gca()
594 |     axes.xaxis.set_major_formatter(x_formatter)
595 | 
596 |     if pcolor:
597 |         pyplot.pcolormesh(X, Y, Z, **options)
598 | 
599 |     if contour:
600 |         cs = pyplot.contour(X, Y, Z, **options)
601 |         pyplot.clabel(cs, inline=1, fontsize=10)
602 |         
603 | 
604 | def Text(x, y, s, **options):
605 |     """Puts text in a figure.
606 | 
607 |     x: number
608 |     y: number
609 |     s: string
610 |     options: keyword args passed to pyplot.text
611 |     """
612 |     options = _Underride(options,
613 |                          fontsize=16,
614 |                          verticalalignment='top',
615 |                          horizontalalignment='left')
616 |     pyplot.text(x, y, s, **options)
617 | 
618 | 
619 | LEGEND = True
620 | LOC = None
621 | 
622 | def Config(**options):
623 |     """Configures the plot.
624 | 
625 |     Pulls options out of the option dictionary and passes them to
626 |     the corresponding pyplot functions.
627 |     """
628 |     names = ['title', 'xlabel', 'ylabel', 'xscale', 'yscale',
629 |              'xticks', 'yticks', 'axis', 'xlim', 'ylim']
630 | 
631 |     for name in names:
632 |         if name in options:
633 |             getattr(pyplot, name)(options[name])
634 | 
635 |     global LEGEND
636 |     LEGEND = options.get('legend', LEGEND)
637 | 
638 |     if LEGEND:
639 |         global LOC
640 |         LOC = options.get('loc', LOC)
641 |         pyplot.legend(loc=LOC)
642 | 
643 |     val = options.get('xticklabels', None)
644 |     if val is not None:
645 |         if val == 'invisible':
646 |             ax = pyplot.gca()
647 |             labels = ax.get_xticklabels()
648 |             pyplot.setp(labels, visible=False)
649 | 
650 |     val = options.get('yticklabels', None)
651 |     if val is not None:
652 |         if val == 'invisible':
653 |             ax = pyplot.gca()
654 |             labels = ax.get_yticklabels()
655 |             pyplot.setp(labels, visible=False)
656 | 
657 | 
658 | def Show(**options):
659 |     """Shows the plot.
660 | 
661 |     For options, see Config.
662 | 
663 |     options: keyword args used to invoke various pyplot functions
664 |     """
665 |     clf = options.pop('clf', True)
666 |     Config(**options)
667 |     pyplot.show()
668 |     if clf:
669 |         Clf()
670 | 
671 | 
672 | def Plotly(**options):
673 |     """Shows the plot.
674 | 
675 |     For options, see Config.
676 | 
677 |     options: keyword args used to invoke various pyplot functions
678 |     """
679 |     clf = options.pop('clf', True)
680 |     Config(**options)
681 |     import plotly.plotly as plotly
682 |     url = plotly.plot_mpl(pyplot.gcf())
683 |     if clf:
684 |         Clf()
685 |     return url
686 | 
687 | 
688 | def Save(root=None, formats=None, **options):
689 |     """Saves the plot in the given formats and clears the figure.
690 | 
691 |     For options, see Config.
692 | 
693 |     Args:
694 |       root: string filename root
695 |       formats: list of string formats
696 |       options: keyword args used to invoke various pyplot functions
697 |     """
698 |     clf = options.pop('clf', True)
699 |     Config(**options)
700 | 
701 |     if formats is None:
702 |         formats = ['pdf', 'eps']
703 | 
704 |     try:
705 |         formats.remove('plotly')
706 |         Plotly(clf=False)
707 |     except ValueError:
708 |         pass
709 | 
710 |     if root:
711 |         for fmt in formats:
712 |             SaveFormat(root, fmt)
713 |     if clf:
714 |         Clf()
715 | 
716 | 
717 | def SaveFormat(root, fmt='eps'):
718 |     """Writes the current figure to a file in the given format.
719 | 
720 |     Args:
721 |       root: string filename root
722 |       fmt: string format
723 |     """
724 |     filename = '%s.%s' % (root, fmt)
725 |     print('Writing', filename)
726 |     pyplot.savefig(filename, format=fmt, dpi=300)
727 | 
728 | 
729 | # provide aliases for calling functons with lower-case names
730 | preplot = PrePlot
731 | subplot = SubPlot
732 | clf = Clf
733 | figure = Figure
734 | plot = Plot
735 | vlines = Vlines
736 | hlines = Hlines
737 | fill_between = FillBetween
738 | text = Text
739 | scatter = Scatter
740 | pmf = Pmf
741 | pmfs = Pmfs
742 | hist = Hist
743 | hists = Hists
744 | diff = Diff
745 | cdf = Cdf
746 | cdfs = Cdfs
747 | contour = Contour
748 | pcolor = Pcolor
749 | config = Config
750 | show = Show
751 | save = Save
752 | 
753 | 
754 | def main():
755 |     color_iter = _Brewer.ColorGenerator(7)
756 |     for color in color_iter:
757 |         print(color)
758 | 
759 | 
760 | if __name__ == '__main__':
761 |     main()
762 | 


--------------------------------------------------------------------------------
/tutorial.md:
--------------------------------------------------------------------------------
  1 | ## Tutorial: Computational Statistics
  2 | 
  3 | Allen Downey
  4 | 
  5 | Do you know the difference between standard deviation and standard error?  Do you know what statistical test to use for any occasion?  Do you really know what a p-value is?  How about a confidence interval?
  6 |  
  7 | Most people don’t really understand these concepts, even after taking several statistics classes.  The problem is that these classes focus on mathematical methods that bury the concepts under a mountain of details.
  8 |  
  9 | This tutorial uses Python to implement simple statistical experiments that develop deep understanding.  I will present examples using real-world data to answer relevant questions, and attendees will practice with hands-on exercises.
 10 |  
 11 | The tutorial material is based on my book, [*Think Stats*](http://greenteapress.com/wp/think-stats-2e/), a class I teach at Olin College, and my blog, [“Probably Overthinking It.”](http://allendowney.blogspot.com/)
 12 | 
 13 | 
 14 | ### Installation instructions
 15 | 
 16 | Note:  Please try to install everything you need for this tutorial before you leave home!
 17 | 
 18 | To prepare for this tutorial, you have two options:
 19 | 
 20 | 1. Install Jupyter on your laptop and download my code from Git.
 21 | 
 22 | 2. Run the Jupyter notebook on a virtual machine on Binder.
 23 | 
 24 | I'll provide instructions for both, but here's the catch: if everyone chooses Option 2, the wireless network will fail and no one will be able to do the hands-on part of the workshop.
 25 | 
 26 | So, I strongly encourage you to try Option 1 and only resort to Option 2 if you can't get Option 1 working.
 27 | 
 28 | 
 29 | 
 30 | #### Option 1A: If you already have Jupyter installed.
 31 | 
 32 | To do the exercises, you need Python 2 or 3 with NumPy, SciPy, and matplotlib. If you are not sure whether you have those modules already, the easiest way to check is to run my code and see if it works.
 33 | 
 34 | Code for this workshop is in a Git repository on Github.  
 35 | If you have a Git client installed, you should be able to download it by running:
 36 | 
 37 |     git clone https://github.com/AllenDowney/CompStats.git
 38 | 
 39 | It should create a directory named `CompStats`.
 40 | Otherwise you can download the repository in [this zip file](https://github.com/AllenDowney/CompStats/archive/master.zip).
 41 | 
 42 | To start Jupyter, run:
 43 | 
 44 |     cd CompStats
 45 |     jupyter notebook
 46 | 
 47 | Jupyter should launch your default browser or open a tab in an existing browser window.
 48 | If not, the Jupyter server should print a URL you can use.  For example, when I launch Jupyter, I get
 49 | 
 50 | ```
 51 |     ~/ThinkComplexity2$ jupyter notebook
 52 |     [I 10:03:20.115 NotebookApp] Serving notebooks from local directory: /home/downey/CompStats
 53 |     [I 10:03:20.115 NotebookApp] 0 active kernels
 54 |     [I 10:03:20.115 NotebookApp] The Jupyter Notebook is running at: http://localhost:8888/
 55 |     [I 10:03:20.115 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
 56 | ```
 57 | 
 58 | In this case, the URL is [http://localhost:8888](http://localhost:8888).  
 59 | When you start your server, you might get a different URL.
 60 | Whatever it is, if you paste it into a browser, you should should see a home page with a list of the
 61 | notebooks in the repository.
 62 | 
 63 | Click on `effect_size.ipynb`.  It should open the first notebook for the tutorial.
 64 | 
 65 | Select the cell with the import statements and press "Shift-Enter" to run the code in the cell.
 66 | If it works and you get no error messages, **you are all set**.  
 67 | 
 68 | If you get error messages about missing packages, you can install the packages you need using your package manager, or try Option 1B and install Anaconda.
 69 | 
 70 | 
 71 | #### Option 1B: If you don't already have Jupyter.
 72 | 
 73 | I highly recommend installing Anaconda, which is a Python distribution that contains everything
 74 | you need for this tutorial.  It is easy to install on Windows, Mac, and Linux, and because it does a
 75 | user-level install, it will not interfere with other Python installations.
 76 | 
 77 | [Information about installing Anaconda is here](http://docs.continuum.io/anaconda/install.html).
 78 | 
 79 | When you install Anaconda, you should get Jupyter by default, but if not, run
 80 | 
 81 |     conda install jupyter
 82 | 
 83 | Then go to Option 1A to make sure you can run my code.
 84 | 
 85 | If you don't want to install Anaconda,
 86 | [you can see some other options here](http://jupyter.readthedocs.io/en/latest/install.html).
 87 | 
 88 | 
 89 | #### Option 2: only if Option 1 failed.
 90 | 
 91 | You can run my notebook in a virtual machine on Binder. To launch the VM, press this button:
 92 | 
 93 |  [![Binder](http://mybinder.org/badge.svg)](http://mybinder.org:/repo/allendowney/compstats)
 94 | 
 95 | You should see a home page with a list of the files in the repository.
 96 | 
 97 | If you want to try the exercises, open `effect_size.ipynb`. If you just want to see the answers, open `effect_size_soln.ipynb`.  Either way, you should be able to run the notebooks in your browser and try out the examples.  
 98 | 
 99 | However, be aware that the virtual machine you are running is temporary.  If you leave it idle for more than an hour or so, it will disappear along with any work you have done.
100 | 
101 | Special thanks to the generous people who run Binder, which makes it easy to share and reproduce computation.
102 | 


--------------------------------------------------------------------------------