├── .gitignore ├── 2002FemPreg.dat.gz ├── 2002FemPreg.dct ├── LICENSE ├── README.md ├── _config.yml ├── check_env.py ├── cumulative_snowfall.png ├── effect_size.ipynb ├── effect_size_soln.ipynb ├── environment.yml ├── first.py ├── hypothesis.ipynb ├── hypothesis.py ├── hypothesis_soln.ipynb ├── hypothesis_testing.pdf ├── hypothesis_testing.png ├── hypothesis_testing.svg ├── hypothesis_testing_small.png ├── look_and_say.ipynb ├── lyrics-elvis-presley.txt ├── nsfg.py ├── nsfg2.py ├── pg2591.txt ├── pmf_intro.ipynb ├── resampling.ipynb ├── resampling.pdf ├── resampling.png ├── resampling.svg ├── resampling_small.png ├── sampling.ipynb ├── sampling_soln.ipynb ├── text_analysis.ipynb ├── the_fault_in_our_stars.txt ├── thinkplot.py ├── thinkstats2.py ├── tumbleweed.ipynb └── tutorial.md /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | 5 | # C extensions 6 | *.so 7 | 8 | # Distribution / packaging 9 | .Python 10 | env/ 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | lib/ 17 | lib64/ 18 | parts/ 19 | sdist/ 20 | var/ 21 | *.egg-info/ 22 | .installed.cfg 23 | *.egg 24 | 25 | # PyInstaller 26 | # Usually these files are written by a python script from a template 27 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 28 | *.manifest 29 | *.spec 30 | 31 | # Installer logs 32 | pip-log.txt 33 | pip-delete-this-directory.txt 34 | 35 | # Unit test / coverage reports 36 | htmlcov/ 37 | .tox/ 38 | .coverage 39 | .cache 40 | nosetests.xml 41 | coverage.xml 42 | 43 | # Translations 44 | *.mo 45 | *.pot 46 | 47 | # Django stuff: 48 | *.log 49 | 50 | # Sphinx documentation 51 | docs/_build/ 52 | 53 | # PyBuilder 54 | target/ 55 | -------------------------------------------------------------------------------- /2002FemPreg.dat.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AllenDowney/CompStats/dc459e04ba74613d533adae2550abefb18da002a/2002FemPreg.dat.gz -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2015 Allen Downey 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # CompStats 2 | 3 | Code for a workshop on statistical interference using computational methods in Python. 4 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | theme: jekyll-theme-minimal -------------------------------------------------------------------------------- /check_env.py: -------------------------------------------------------------------------------- 1 | """This file contains code used in "Think Stats", 2 | by Allen B. Downey, available from greenteapress.com 3 | 4 | Copyright 2013 Allen B. Downey 5 | License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html 6 | """ 7 | 8 | from __future__ import print_function, division 9 | 10 | import math 11 | import numpy 12 | 13 | from matplotlib import pyplot 14 | 15 | import thinkplot 16 | import thinkstats2 17 | 18 | 19 | def RenderPdf(mu, sigma, n=101): 20 | """Makes xs and ys for a normal PDF with (mu, sigma). 21 | 22 | n: number of places to evaluate the PDF 23 | """ 24 | xs = numpy.linspace(mu-4*sigma, mu+4*sigma, n) 25 | ys = [thinkstats2.EvalNormalPdf(x, mu, sigma) for x in xs] 26 | return xs, ys 27 | 28 | 29 | def main(): 30 | xs, ys = RenderPdf(100, 15) 31 | 32 | n = 34 33 | pyplot.fill_between(xs[-n:], ys[-n:], y2=0.0001, color='blue', alpha=0.2) 34 | s = 'Congratulations!\nIf you got this far,\nyou must be here.' 35 | d = dict(shrink=0.05) 36 | pyplot.annotate(s, [127, 0.002], xytext=[80, 0.005], arrowprops=d) 37 | 38 | thinkplot.Plot(xs, ys) 39 | thinkplot.Show(title='Distribution of IQ', 40 | xlabel='IQ', 41 | ylabel='PDF', 42 | legend=False) 43 | 44 | 45 | if __name__ == "__main__": 46 | main() 47 | -------------------------------------------------------------------------------- /cumulative_snowfall.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AllenDowney/CompStats/dc459e04ba74613d533adae2550abefb18da002a/cumulative_snowfall.png -------------------------------------------------------------------------------- /effect_size.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Effect Size\n", 8 | "===\n", 9 | "\n", 10 | "Examples and exercises for a tutorial on statistical inference.\n", 11 | "\n", 12 | "Copyright 2016 Allen Downey\n", 13 | "\n", 14 | "License: [Creative Commons Attribution 4.0 International](http://creativecommons.org/licenses/by/4.0/)" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": { 21 | "collapsed": true 22 | }, 23 | "outputs": [], 24 | "source": [ 25 | "from __future__ import print_function, division\n", 26 | "\n", 27 | "import numpy\n", 28 | "import scipy.stats\n", 29 | "\n", 30 | "import matplotlib.pyplot as pyplot\n", 31 | "\n", 32 | "from ipywidgets import interact, interactive, fixed\n", 33 | "import ipywidgets as widgets\n", 34 | "\n", 35 | "# seed the random number generator so we all get the same results\n", 36 | "numpy.random.seed(17)\n", 37 | "\n", 38 | "# some nice colors from http://colorbrewer2.org/\n", 39 | "COLOR1 = '#7fc97f'\n", 40 | "COLOR2 = '#beaed4'\n", 41 | "COLOR3 = '#fdc086'\n", 42 | "COLOR4 = '#ffff99'\n", 43 | "COLOR5 = '#386cb0'\n", 44 | "\n", 45 | "%matplotlib inline" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "## Part One\n", 53 | "\n", 54 | "To explore statistics that quantify effect size, we'll look at the difference in height between men and women. I used data from the Behavioral Risk Factor Surveillance System (BRFSS) to estimate the mean and standard deviation of height in cm for adult women and men in the U.S.\n", 55 | "\n", 56 | "I'll use `scipy.stats.norm` to represent the distributions. The result is an `rv` object (which stands for random variable)." 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "metadata": { 63 | "collapsed": true 64 | }, 65 | "outputs": [], 66 | "source": [ 67 | "mu1, sig1 = 178, 7.7\n", 68 | "male_height = scipy.stats.norm(mu1, sig1)" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": { 75 | "collapsed": true 76 | }, 77 | "outputs": [], 78 | "source": [ 79 | "mu2, sig2 = 163, 7.3\n", 80 | "female_height = scipy.stats.norm(mu2, sig2)" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "The following function evaluates the normal (Gaussian) probability density function (PDF) within 4 standard deviations of the mean. It takes and rv object and returns a pair of NumPy arrays." 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": { 94 | "collapsed": true 95 | }, 96 | "outputs": [], 97 | "source": [ 98 | "def eval_pdf(rv, num=4):\n", 99 | " mean, std = rv.mean(), rv.std()\n", 100 | " xs = numpy.linspace(mean - num*std, mean + num*std, 100)\n", 101 | " ys = rv.pdf(xs)\n", 102 | " return xs, ys" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "Here's what the two distributions look like." 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": null, 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [ 118 | "xs, ys = eval_pdf(male_height)\n", 119 | "pyplot.plot(xs, ys, label='male', linewidth=4, color=COLOR2)\n", 120 | "\n", 121 | "xs, ys = eval_pdf(female_height)\n", 122 | "pyplot.plot(xs, ys, label='female', linewidth=4, color=COLOR3)\n", 123 | "pyplot.xlabel('height (cm)')\n", 124 | "None" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "Let's assume for now that those are the true distributions for the population.\n", 132 | "\n", 133 | "I'll use `rvs` to generate random samples from the population distributions. Note that these are totally random, totally representative samples, with no measurement error!" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": { 140 | "collapsed": true 141 | }, 142 | "outputs": [], 143 | "source": [ 144 | "male_sample = male_height.rvs(1000)" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": null, 150 | "metadata": { 151 | "collapsed": true 152 | }, 153 | "outputs": [], 154 | "source": [ 155 | "female_sample = female_height.rvs(1000)" 156 | ] 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "metadata": {}, 161 | "source": [ 162 | "Both samples are NumPy arrays. Now we can compute sample statistics like the mean and standard deviation." 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": null, 168 | "metadata": {}, 169 | "outputs": [], 170 | "source": [ 171 | "mean1, std1 = male_sample.mean(), male_sample.std()\n", 172 | "mean1, std1" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "The sample mean is close to the population mean, but not exact, as expected." 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": null, 185 | "metadata": {}, 186 | "outputs": [], 187 | "source": [ 188 | "mean2, std2 = female_sample.mean(), female_sample.std()\n", 189 | "mean2, std2" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "And the results are similar for the female sample.\n", 197 | "\n", 198 | "Now, there are many ways to describe the magnitude of the difference between these distributions. An obvious one is the difference in the means:" 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "metadata": {}, 205 | "outputs": [], 206 | "source": [ 207 | "difference_in_means = male_sample.mean() - female_sample.mean()\n", 208 | "difference_in_means # in cm" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "On average, men are 14--15 centimeters taller. For some applications, that would be a good way to describe the difference, but there are a few problems:\n", 216 | "\n", 217 | "* Without knowing more about the distributions (like the standard deviations) it's hard to interpret whether a difference like 15 cm is a lot or not.\n", 218 | "\n", 219 | "* The magnitude of the difference depends on the units of measure, making it hard to compare across different studies.\n", 220 | "\n", 221 | "There are a number of ways to quantify the difference between distributions. A simple option is to express the difference as a percentage of the mean.\n", 222 | "\n", 223 | "**Exercise 1**: what is the relative difference in means, expressed as a percentage?" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": null, 229 | "metadata": { 230 | "collapsed": true 231 | }, 232 | "outputs": [], 233 | "source": [ 234 | "# Solution goes here" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": {}, 240 | "source": [ 241 | "**STOP HERE**: We'll regroup and discuss before you move on." 242 | ] 243 | }, 244 | { 245 | "cell_type": "markdown", 246 | "metadata": {}, 247 | "source": [ 248 | "## Part Two\n", 249 | "\n", 250 | "An alternative way to express the difference between distributions is to see how much they overlap. To define overlap, we choose a threshold between the two means. The simple threshold is the midpoint between the means:" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": null, 256 | "metadata": {}, 257 | "outputs": [], 258 | "source": [ 259 | "simple_thresh = (mean1 + mean2) / 2\n", 260 | "simple_thresh" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "A better, but slightly more complicated threshold is the place where the PDFs cross." 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": null, 273 | "metadata": {}, 274 | "outputs": [], 275 | "source": [ 276 | "thresh = (std1 * mean2 + std2 * mean1) / (std1 + std2)\n", 277 | "thresh" 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "metadata": {}, 283 | "source": [ 284 | "In this example, there's not much difference between the two thresholds.\n", 285 | "\n", 286 | "Now we can count how many men are below the threshold:" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": null, 292 | "metadata": {}, 293 | "outputs": [], 294 | "source": [ 295 | "male_below_thresh = sum(male_sample < thresh)\n", 296 | "male_below_thresh" 297 | ] 298 | }, 299 | { 300 | "cell_type": "markdown", 301 | "metadata": {}, 302 | "source": [ 303 | "And how many women are above it:" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": null, 309 | "metadata": {}, 310 | "outputs": [], 311 | "source": [ 312 | "female_above_thresh = sum(female_sample > thresh)\n", 313 | "female_above_thresh" 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": {}, 319 | "source": [ 320 | "The \"overlap\" is the area under the curves that ends up on the wrong side of the threshold." 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": null, 326 | "metadata": {}, 327 | "outputs": [], 328 | "source": [ 329 | "male_overlap = male_below_thresh / len(male_sample)\n", 330 | "female_overlap = female_above_thresh / len(female_sample)\n", 331 | "male_overlap, female_overlap" 332 | ] 333 | }, 334 | { 335 | "cell_type": "markdown", 336 | "metadata": {}, 337 | "source": [ 338 | "In practical terms, you might report the fraction of people who would be misclassified if you tried to use height to guess sex, which is the average of the male and female overlap rates:" 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": null, 344 | "metadata": {}, 345 | "outputs": [], 346 | "source": [ 347 | "misclassification_rate = (male_overlap + female_overlap) / 2\n", 348 | "misclassification_rate" 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "metadata": {}, 354 | "source": [ 355 | "Another way to quantify the difference between distributions is what's called \"probability of superiority\", which is a problematic term, but in this context it's the probability that a randomly-chosen man is taller than a randomly-chosen woman.\n", 356 | "\n", 357 | "**Exercise 2**: Suppose I choose a man and a woman at random. What is the probability that the man is taller?\n", 358 | "\n", 359 | "HINT: You can `zip` the two samples together and count the number of pairs where the male is taller, or use NumPy array operations." 360 | ] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": null, 365 | "metadata": { 366 | "collapsed": true 367 | }, 368 | "outputs": [], 369 | "source": [ 370 | "# Solution goes here" 371 | ] 372 | }, 373 | { 374 | "cell_type": "code", 375 | "execution_count": null, 376 | "metadata": { 377 | "collapsed": true 378 | }, 379 | "outputs": [], 380 | "source": [ 381 | "# Solution goes here" 382 | ] 383 | }, 384 | { 385 | "cell_type": "markdown", 386 | "metadata": {}, 387 | "source": [ 388 | "Overlap (or misclassification rate) and \"probability of superiority\" have two good properties:\n", 389 | "\n", 390 | "* As probabilities, they don't depend on units of measure, so they are comparable between studies.\n", 391 | "\n", 392 | "* They are expressed in operational terms, so a reader has a sense of what practical effect the difference makes.\n", 393 | "\n", 394 | "### Cohen's effect size\n", 395 | "\n", 396 | "There is one other common way to express the difference between distributions. Cohen's $d$ is the difference in means, standardized by dividing by the standard deviation. Here's the math notation:\n", 397 | "\n", 398 | "$ d = \\frac{\\bar{x}_1 - \\bar{x}_2} s $\n", 399 | "\n", 400 | "where $s$ is the pooled standard deviation:\n", 401 | "\n", 402 | "$s = \\sqrt{\\frac{n_1 s^2_1 + n_2 s^2_2}{n_1+n_2}}$\n", 403 | "\n", 404 | "Here's a function that computes it:\n" 405 | ] 406 | }, 407 | { 408 | "cell_type": "code", 409 | "execution_count": null, 410 | "metadata": { 411 | "collapsed": true 412 | }, 413 | "outputs": [], 414 | "source": [ 415 | "def CohenEffectSize(group1, group2):\n", 416 | " \"\"\"Compute Cohen's d.\n", 417 | "\n", 418 | " group1: Series or NumPy array\n", 419 | " group2: Series or NumPy array\n", 420 | "\n", 421 | " returns: float\n", 422 | " \"\"\"\n", 423 | " diff = group1.mean() - group2.mean()\n", 424 | "\n", 425 | " n1, n2 = len(group1), len(group2)\n", 426 | " var1 = group1.var()\n", 427 | " var2 = group2.var()\n", 428 | "\n", 429 | " pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)\n", 430 | " d = diff / numpy.sqrt(pooled_var)\n", 431 | " return d" 432 | ] 433 | }, 434 | { 435 | "cell_type": "markdown", 436 | "metadata": {}, 437 | "source": [ 438 | "Computing the denominator is a little complicated; in fact, people have proposed several ways to do it. This implementation uses the \"pooled standard deviation\", which is a weighted average of the standard deviations of the two groups.\n", 439 | "\n", 440 | "And here's the result for the difference in height between men and women." 441 | ] 442 | }, 443 | { 444 | "cell_type": "code", 445 | "execution_count": null, 446 | "metadata": {}, 447 | "outputs": [], 448 | "source": [ 449 | "CohenEffectSize(male_sample, female_sample)" 450 | ] 451 | }, 452 | { 453 | "cell_type": "markdown", 454 | "metadata": {}, 455 | "source": [ 456 | "Most people don't have a good sense of how big $d=1.9$ is, so let's make a visualization to get calibrated.\n", 457 | "\n", 458 | "Here's a function that encapsulates the code we already saw for computing overlap and probability of superiority." 459 | ] 460 | }, 461 | { 462 | "cell_type": "code", 463 | "execution_count": null, 464 | "metadata": { 465 | "collapsed": true 466 | }, 467 | "outputs": [], 468 | "source": [ 469 | "def overlap_superiority(control, treatment, n=1000):\n", 470 | " \"\"\"Estimates overlap and superiority based on a sample.\n", 471 | " \n", 472 | " control: scipy.stats rv object\n", 473 | " treatment: scipy.stats rv object\n", 474 | " n: sample size\n", 475 | " \"\"\"\n", 476 | " control_sample = control.rvs(n)\n", 477 | " treatment_sample = treatment.rvs(n)\n", 478 | " thresh = (control.mean() + treatment.mean()) / 2\n", 479 | " \n", 480 | " control_above = sum(control_sample > thresh)\n", 481 | " treatment_below = sum(treatment_sample < thresh)\n", 482 | " overlap = (control_above + treatment_below) / n\n", 483 | " \n", 484 | " superiority = (treatment_sample > control_sample).mean()\n", 485 | " return overlap, superiority" 486 | ] 487 | }, 488 | { 489 | "cell_type": "markdown", 490 | "metadata": {}, 491 | "source": [ 492 | "Here's the function that takes Cohen's $d$, plots normal distributions with the given effect size, and prints their overlap and superiority." 493 | ] 494 | }, 495 | { 496 | "cell_type": "code", 497 | "execution_count": null, 498 | "metadata": { 499 | "collapsed": true 500 | }, 501 | "outputs": [], 502 | "source": [ 503 | "def plot_pdfs(cohen_d=2):\n", 504 | " \"\"\"Plot PDFs for distributions that differ by some number of stds.\n", 505 | " \n", 506 | " cohen_d: number of standard deviations between the means\n", 507 | " \"\"\"\n", 508 | " control = scipy.stats.norm(0, 1)\n", 509 | " treatment = scipy.stats.norm(cohen_d, 1)\n", 510 | " xs, ys = eval_pdf(control)\n", 511 | " pyplot.fill_between(xs, ys, label='control', color=COLOR3, alpha=0.7)\n", 512 | "\n", 513 | " xs, ys = eval_pdf(treatment)\n", 514 | " pyplot.fill_between(xs, ys, label='treatment', color=COLOR2, alpha=0.7)\n", 515 | " \n", 516 | " o, s = overlap_superiority(control, treatment)\n", 517 | " pyplot.text(0, 0.05, 'overlap ' + str(o))\n", 518 | " pyplot.text(0, 0.15, 'superiority ' + str(s))\n", 519 | " pyplot.show()\n", 520 | " #print('overlap', o)\n", 521 | " #print('superiority', s)" 522 | ] 523 | }, 524 | { 525 | "cell_type": "markdown", 526 | "metadata": {}, 527 | "source": [ 528 | "Here's an example that demonstrates the function:" 529 | ] 530 | }, 531 | { 532 | "cell_type": "code", 533 | "execution_count": null, 534 | "metadata": {}, 535 | "outputs": [], 536 | "source": [ 537 | "plot_pdfs(2)" 538 | ] 539 | }, 540 | { 541 | "cell_type": "markdown", 542 | "metadata": {}, 543 | "source": [ 544 | "And an interactive widget you can use to visualize what different values of $d$ mean:" 545 | ] 546 | }, 547 | { 548 | "cell_type": "code", 549 | "execution_count": null, 550 | "metadata": {}, 551 | "outputs": [], 552 | "source": [ 553 | "slider = widgets.FloatSlider(min=0, max=4, value=2)\n", 554 | "interact(plot_pdfs, cohen_d=slider)\n", 555 | "None" 556 | ] 557 | }, 558 | { 559 | "cell_type": "markdown", 560 | "metadata": {}, 561 | "source": [ 562 | "Cohen's $d$ has a few nice properties:\n", 563 | "\n", 564 | "* Because mean and standard deviation have the same units, their ratio is dimensionless, so we can compare $d$ across different studies.\n", 565 | "\n", 566 | "* In fields that commonly use $d$, people are calibrated to know what values should be considered big, surprising, or important.\n", 567 | "\n", 568 | "* Given $d$ (and the assumption that the distributions are normal), you can compute overlap, superiority, and related statistics." 569 | ] 570 | }, 571 | { 572 | "cell_type": "markdown", 573 | "metadata": {}, 574 | "source": [ 575 | "In summary, the best way to report effect size depends on the audience and your goals. There is often a tradeoff between summary statistics that have good technical properties and statistics that are meaningful to a general audience." 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": null, 581 | "metadata": { 582 | "collapsed": true 583 | }, 584 | "outputs": [], 585 | "source": [] 586 | } 587 | ], 588 | "metadata": { 589 | "kernelspec": { 590 | "display_name": "Python 3", 591 | "language": "python", 592 | "name": "python3" 593 | }, 594 | "language_info": { 595 | "codemirror_mode": { 596 | "name": "ipython", 597 | "version": 3 598 | }, 599 | "file_extension": ".py", 600 | "mimetype": "text/x-python", 601 | "name": "python", 602 | "nbconvert_exporter": "python", 603 | "pygments_lexer": "ipython3", 604 | "version": "3.6.1" 605 | } 606 | }, 607 | "nbformat": 4, 608 | "nbformat_minor": 1 609 | } 610 | -------------------------------------------------------------------------------- /effect_size_soln.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Effect Size\n", 8 | "===\n", 9 | "\n", 10 | "Examples and exercises for a tutorial on statistical inference.\n", 11 | "\n", 12 | "Copyright 2016 Allen Downey\n", 13 | "\n", 14 | "License: [Creative Commons Attribution 4.0 International](http://creativecommons.org/licenses/by/4.0/)" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "from __future__ import print_function, division\n", 24 | "\n", 25 | "import numpy\n", 26 | "import scipy.stats\n", 27 | "\n", 28 | "import matplotlib.pyplot as pyplot\n", 29 | "\n", 30 | "from ipywidgets import interact, interactive, fixed\n", 31 | "import ipywidgets as widgets\n", 32 | "\n", 33 | "# seed the random number generator so we all get the same results\n", 34 | "numpy.random.seed(17)\n", 35 | "\n", 36 | "# some nice colors from http://colorbrewer2.org/\n", 37 | "COLOR1 = '#7fc97f'\n", 38 | "COLOR2 = '#beaed4'\n", 39 | "COLOR3 = '#fdc086'\n", 40 | "COLOR4 = '#ffff99'\n", 41 | "COLOR5 = '#386cb0'\n", 42 | "\n", 43 | "%matplotlib inline" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "## Part One\n", 51 | "\n", 52 | "To explore statistics that quantify effect size, we'll look at the difference in height between men and women. I used data from the Behavioral Risk Factor Surveillance System (BRFSS) to estimate the mean and standard deviation of height in cm for adult women and men in the U.S.\n", 53 | "\n", 54 | "I'll use `scipy.stats.norm` to represent the distributions. The result is an `rv` object (which stands for random variable)." 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "mu1, sig1 = 178, 7.7\n", 64 | "male_height = scipy.stats.norm(mu1, sig1)" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": null, 70 | "metadata": {}, 71 | "outputs": [], 72 | "source": [ 73 | "mu2, sig2 = 163, 7.3\n", 74 | "female_height = scipy.stats.norm(mu2, sig2)" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "The following function evaluates the normal (Gaussian) probability density function (PDF) within 4 standard deviations of the mean. It takes and rv object and returns a pair of NumPy arrays." 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "def eval_pdf(rv, num=4):\n", 91 | " mean, std = rv.mean(), rv.std()\n", 92 | " xs = numpy.linspace(mean - num*std, mean + num*std, 100)\n", 93 | " ys = rv.pdf(xs)\n", 94 | " return xs, ys" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "Here's what the two distributions look like." 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "xs, ys = eval_pdf(male_height)\n", 111 | "pyplot.plot(xs, ys, label='male', linewidth=4, color=COLOR2)\n", 112 | "\n", 113 | "xs, ys = eval_pdf(female_height)\n", 114 | "pyplot.plot(xs, ys, label='female', linewidth=4, color=COLOR3)\n", 115 | "pyplot.xlabel('height (cm)')\n", 116 | "None" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "Let's assume for now that those are the true distributions for the population.\n", 124 | "\n", 125 | "I'll use `rvs` to generate random samples from the population distributions. Note that these are totally random, totally representative samples, with no measurement error!" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "metadata": {}, 132 | "outputs": [], 133 | "source": [ 134 | "male_sample = male_height.rvs(1000)" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "female_sample = female_height.rvs(1000)" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "Both samples are NumPy arrays. Now we can compute sample statistics like the mean and standard deviation." 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [ 159 | "mean1, std1 = male_sample.mean(), male_sample.std()\n", 160 | "mean1, std1" 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "The sample mean is close to the population mean, but not exact, as expected." 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": null, 173 | "metadata": {}, 174 | "outputs": [], 175 | "source": [ 176 | "mean2, std2 = female_sample.mean(), female_sample.std()\n", 177 | "mean2, std2" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | "And the results are similar for the female sample.\n", 185 | "\n", 186 | "Now, there are many ways to describe the magnitude of the difference between these distributions. An obvious one is the difference in the means:" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "metadata": {}, 193 | "outputs": [], 194 | "source": [ 195 | "difference_in_means = male_sample.mean() - female_sample.mean()\n", 196 | "difference_in_means # in cm" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "On average, men are 14--15 centimeters taller. For some applications, that would be a good way to describe the difference, but there are a few problems:\n", 204 | "\n", 205 | "* Without knowing more about the distributions (like the standard deviations) it's hard to interpret whether a difference like 15 cm is a lot or not.\n", 206 | "\n", 207 | "* The magnitude of the difference depends on the units of measure, making it hard to compare across different studies.\n", 208 | "\n", 209 | "There are a number of ways to quantify the difference between distributions. A simple option is to express the difference as a percentage of the mean.\n", 210 | "\n", 211 | "**Exercise 1**: what is the relative difference in means, expressed as a percentage?" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": null, 217 | "metadata": {}, 218 | "outputs": [], 219 | "source": [ 220 | "# Solution goes here\n", 221 | "\n", 222 | "relative_difference = difference_in_means / male_sample.mean()\n", 223 | "print(relative_difference * 100) # percent\n", 224 | "\n", 225 | "# A problem with relative differences is that you have to choose \n", 226 | "# which mean to express them relative to.\n", 227 | "\n", 228 | "relative_difference = difference_in_means / female_sample.mean()\n", 229 | "print(relative_difference * 100) # percent" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": {}, 235 | "source": [ 236 | "**STOP HERE**: We'll regroup and discuss before you move on." 237 | ] 238 | }, 239 | { 240 | "cell_type": "markdown", 241 | "metadata": {}, 242 | "source": [ 243 | "## Part Two\n", 244 | "\n", 245 | "An alternative way to express the difference between distributions is to see how much they overlap. To define overlap, we choose a threshold between the two means. The simple threshold is the midpoint between the means:" 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": null, 251 | "metadata": {}, 252 | "outputs": [], 253 | "source": [ 254 | "simple_thresh = (mean1 + mean2) / 2\n", 255 | "simple_thresh" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "A better, but slightly more complicated threshold is the place where the PDFs cross." 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": null, 268 | "metadata": {}, 269 | "outputs": [], 270 | "source": [ 271 | "thresh = (std1 * mean2 + std2 * mean1) / (std1 + std2)\n", 272 | "thresh" 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": {}, 278 | "source": [ 279 | "In this example, there's not much difference between the two thresholds.\n", 280 | "\n", 281 | "Now we can count how many men are below the threshold:" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": null, 287 | "metadata": {}, 288 | "outputs": [], 289 | "source": [ 290 | "male_below_thresh = sum(male_sample < thresh)\n", 291 | "male_below_thresh" 292 | ] 293 | }, 294 | { 295 | "cell_type": "markdown", 296 | "metadata": {}, 297 | "source": [ 298 | "And how many women are above it:" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": null, 304 | "metadata": {}, 305 | "outputs": [], 306 | "source": [ 307 | "female_above_thresh = sum(female_sample > thresh)\n", 308 | "female_above_thresh" 309 | ] 310 | }, 311 | { 312 | "cell_type": "markdown", 313 | "metadata": {}, 314 | "source": [ 315 | "The \"overlap\" is the area under the curves that ends up on the wrong side of the threshold." 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": null, 321 | "metadata": {}, 322 | "outputs": [], 323 | "source": [ 324 | "male_overlap = male_below_thresh / len(male_sample)\n", 325 | "female_overlap = female_above_thresh / len(female_sample)\n", 326 | "male_overlap, female_overlap" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "In practical terms, you might report the fraction of people who would be misclassified if you tried to use height to guess sex, which is the average of the male and female overlap rates:" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": null, 339 | "metadata": {}, 340 | "outputs": [], 341 | "source": [ 342 | "misclassification_rate = (male_overlap + female_overlap) / 2\n", 343 | "misclassification_rate" 344 | ] 345 | }, 346 | { 347 | "cell_type": "markdown", 348 | "metadata": {}, 349 | "source": [ 350 | "Another way to quantify the difference between distributions is what's called \"probability of superiority\", which is a problematic term, but in this context it's the probability that a randomly-chosen man is taller than a randomly-chosen woman.\n", 351 | "\n", 352 | "**Exercise 2**: Suppose I choose a man and a woman at random. What is the probability that the man is taller?\n", 353 | "\n", 354 | "HINT: You can `zip` the two samples together and count the number of pairs where the male is taller, or use NumPy array operations." 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": null, 360 | "metadata": {}, 361 | "outputs": [], 362 | "source": [ 363 | "# Solution goes here\n", 364 | "\n", 365 | "sum(x > y for x, y in zip(male_sample, female_sample)) / len(male_sample)" 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "execution_count": null, 371 | "metadata": {}, 372 | "outputs": [], 373 | "source": [ 374 | "# Solution goes here\n", 375 | "\n", 376 | "(male_sample > female_sample).mean()" 377 | ] 378 | }, 379 | { 380 | "cell_type": "markdown", 381 | "metadata": {}, 382 | "source": [ 383 | "Overlap (or misclassification rate) and \"probability of superiority\" have two good properties:\n", 384 | "\n", 385 | "* As probabilities, they don't depend on units of measure, so they are comparable between studies.\n", 386 | "\n", 387 | "* They are expressed in operational terms, so a reader has a sense of what practical effect the difference makes.\n", 388 | "\n", 389 | "### Cohen's effect size\n", 390 | "\n", 391 | "There is one other common way to express the difference between distributions. Cohen's $d$ is the difference in means, standardized by dividing by the standard deviation. Here's the math notation:\n", 392 | "\n", 393 | "$ d = \\frac{\\bar{x}_1 - \\bar{x}_2} s $\n", 394 | "\n", 395 | "where $s$ is the pooled standard deviation:\n", 396 | "\n", 397 | "$s = \\sqrt{\\frac{n_1 s^2_1 + n_2 s^2_2}{n_1+n_2}}$\n", 398 | "\n", 399 | "Here's a function that computes it:\n" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": null, 405 | "metadata": {}, 406 | "outputs": [], 407 | "source": [ 408 | "def CohenEffectSize(group1, group2):\n", 409 | " \"\"\"Compute Cohen's d.\n", 410 | "\n", 411 | " group1: Series or NumPy array\n", 412 | " group2: Series or NumPy array\n", 413 | "\n", 414 | " returns: float\n", 415 | " \"\"\"\n", 416 | " diff = group1.mean() - group2.mean()\n", 417 | "\n", 418 | " n1, n2 = len(group1), len(group2)\n", 419 | " var1 = group1.var()\n", 420 | " var2 = group2.var()\n", 421 | "\n", 422 | " pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)\n", 423 | " d = diff / numpy.sqrt(pooled_var)\n", 424 | " return d" 425 | ] 426 | }, 427 | { 428 | "cell_type": "markdown", 429 | "metadata": {}, 430 | "source": [ 431 | "Computing the denominator is a little complicated; in fact, people have proposed several ways to do it. This implementation uses the \"pooled standard deviation\", which is a weighted average of the standard deviations of the two groups.\n", 432 | "\n", 433 | "And here's the result for the difference in height between men and women." 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": null, 439 | "metadata": {}, 440 | "outputs": [], 441 | "source": [ 442 | "CohenEffectSize(male_sample, female_sample)" 443 | ] 444 | }, 445 | { 446 | "cell_type": "markdown", 447 | "metadata": {}, 448 | "source": [ 449 | "Most people don't have a good sense of how big $d=1.9$ is, so let's make a visualization to get calibrated.\n", 450 | "\n", 451 | "Here's a function that encapsulates the code we already saw for computing overlap and probability of superiority." 452 | ] 453 | }, 454 | { 455 | "cell_type": "code", 456 | "execution_count": null, 457 | "metadata": {}, 458 | "outputs": [], 459 | "source": [ 460 | "def overlap_superiority(control, treatment, n=1000):\n", 461 | " \"\"\"Estimates overlap and superiority based on a sample.\n", 462 | " \n", 463 | " control: scipy.stats rv object\n", 464 | " treatment: scipy.stats rv object\n", 465 | " n: sample size\n", 466 | " \"\"\"\n", 467 | " control_sample = control.rvs(n)\n", 468 | " treatment_sample = treatment.rvs(n)\n", 469 | " thresh = (control.mean() + treatment.mean()) / 2\n", 470 | " \n", 471 | " control_above = sum(control_sample > thresh)\n", 472 | " treatment_below = sum(treatment_sample < thresh)\n", 473 | " overlap = (control_above + treatment_below) / n\n", 474 | " \n", 475 | " superiority = (treatment_sample > control_sample).mean()\n", 476 | " return overlap, superiority" 477 | ] 478 | }, 479 | { 480 | "cell_type": "markdown", 481 | "metadata": {}, 482 | "source": [ 483 | "Here's the function that takes Cohen's $d$, plots normal distributions with the given effect size, and prints their overlap and superiority." 484 | ] 485 | }, 486 | { 487 | "cell_type": "code", 488 | "execution_count": null, 489 | "metadata": {}, 490 | "outputs": [], 491 | "source": [ 492 | "def plot_pdfs(cohen_d=2):\n", 493 | " \"\"\"Plot PDFs for distributions that differ by some number of stds.\n", 494 | " \n", 495 | " cohen_d: number of standard deviations between the means\n", 496 | " \"\"\"\n", 497 | " control = scipy.stats.norm(0, 1)\n", 498 | " treatment = scipy.stats.norm(cohen_d, 1)\n", 499 | " xs, ys = eval_pdf(control)\n", 500 | " pyplot.fill_between(xs, ys, label='control', color=COLOR3, alpha=0.7)\n", 501 | "\n", 502 | " xs, ys = eval_pdf(treatment)\n", 503 | " pyplot.fill_between(xs, ys, label='treatment', color=COLOR2, alpha=0.7)\n", 504 | " \n", 505 | " o, s = overlap_superiority(control, treatment)\n", 506 | " pyplot.text(0, 0.05, 'overlap ' + str(o))\n", 507 | " pyplot.text(0, 0.15, 'superiority ' + str(s))\n", 508 | " pyplot.show()\n", 509 | " #print('overlap', o)\n", 510 | " #print('superiority', s)" 511 | ] 512 | }, 513 | { 514 | "cell_type": "markdown", 515 | "metadata": {}, 516 | "source": [ 517 | "Here's an example that demonstrates the function:" 518 | ] 519 | }, 520 | { 521 | "cell_type": "code", 522 | "execution_count": null, 523 | "metadata": {}, 524 | "outputs": [], 525 | "source": [ 526 | "plot_pdfs(2)" 527 | ] 528 | }, 529 | { 530 | "cell_type": "markdown", 531 | "metadata": {}, 532 | "source": [ 533 | "And an interactive widget you can use to visualize what different values of $d$ mean:" 534 | ] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "execution_count": null, 539 | "metadata": {}, 540 | "outputs": [], 541 | "source": [ 542 | "slider = widgets.FloatSlider(min=0, max=4, value=2)\n", 543 | "interact(plot_pdfs, cohen_d=slider)\n", 544 | "None" 545 | ] 546 | }, 547 | { 548 | "cell_type": "markdown", 549 | "metadata": {}, 550 | "source": [ 551 | "Cohen's $d$ has a few nice properties:\n", 552 | "\n", 553 | "* Because mean and standard deviation have the same units, their ratio is dimensionless, so we can compare $d$ across different studies.\n", 554 | "\n", 555 | "* In fields that commonly use $d$, people are calibrated to know what values should be considered big, surprising, or important.\n", 556 | "\n", 557 | "* Given $d$ (and the assumption that the distributions are normal), you can compute overlap, superiority, and related statistics." 558 | ] 559 | }, 560 | { 561 | "cell_type": "markdown", 562 | "metadata": {}, 563 | "source": [ 564 | "In summary, the best way to report effect size depends on the audience and your goals. There is often a tradeoff between summary statistics that have good technical properties and statistics that are meaningful to a general audience." 565 | ] 566 | }, 567 | { 568 | "cell_type": "code", 569 | "execution_count": null, 570 | "metadata": {}, 571 | "outputs": [], 572 | "source": [] 573 | } 574 | ], 575 | "metadata": { 576 | "kernelspec": { 577 | "display_name": "Python 3", 578 | "language": "python", 579 | "name": "python3" 580 | }, 581 | "language_info": { 582 | "codemirror_mode": { 583 | "name": "ipython", 584 | "version": 3 585 | }, 586 | "file_extension": ".py", 587 | "mimetype": "text/x-python", 588 | "name": "python", 589 | "nbconvert_exporter": "python", 590 | "pygments_lexer": "ipython3", 591 | "version": "3.6.1" 592 | } 593 | }, 594 | "nbformat": 4, 595 | "nbformat_minor": 1 596 | } 597 | -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: CompStats 2 | 3 | dependencies: 4 | - python=3.7 5 | - jupyter 6 | - numpy 7 | - matplotlib 8 | - seaborn 9 | - pandas 10 | - scipy 11 | 12 | 13 | 14 | 15 | 16 | -------------------------------------------------------------------------------- /first.py: -------------------------------------------------------------------------------- 1 | """This file contains code used in "Think Stats", 2 | by Allen B. Downey, available from greenteapress.com 3 | 4 | Copyright 2014 Allen B. Downey 5 | License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html 6 | """ 7 | 8 | from __future__ import print_function 9 | 10 | import math 11 | import numpy as np 12 | 13 | import nsfg 14 | import thinkstats2 15 | import thinkplot 16 | 17 | 18 | def MakeFrames(): 19 | """Reads pregnancy data and partitions first babies and others. 20 | 21 | returns: DataFrames (all live births, first babies, others) 22 | """ 23 | preg = nsfg.ReadFemPreg() 24 | 25 | live = preg[preg.outcome == 1] 26 | firsts = live[live.birthord == 1] 27 | others = live[live.birthord != 1] 28 | 29 | assert len(live) == 9148 30 | assert len(firsts) == 4413 31 | assert len(others) == 4735 32 | 33 | return live, firsts, others 34 | 35 | 36 | def Summarize(live, firsts, others): 37 | """Print various summary statistics.""" 38 | 39 | mean = live.prglngth.mean() 40 | var = live.prglngth.var() 41 | std = live.prglngth.std() 42 | 43 | print('Live mean', mean) 44 | print('Live variance', var) 45 | print('Live std', std) 46 | 47 | mean1 = firsts.prglngth.mean() 48 | mean2 = others.prglngth.mean() 49 | 50 | var1 = firsts.prglngth.var() 51 | var2 = others.prglngth.var() 52 | 53 | print('Mean') 54 | print('First babies', mean1) 55 | print('Others', mean2) 56 | 57 | print('Variance') 58 | print('First babies', var1) 59 | print('Others', var2) 60 | 61 | print('Difference in weeks', mean1 - mean2) 62 | print('Difference in hours', (mean1 - mean2) * 7 * 24) 63 | 64 | print('Difference relative to 39 weeks', (mean1 - mean2) / 39 * 100) 65 | 66 | d = thinkstats2.CohenEffectSize(firsts.prglngth, others.prglngth) 67 | print('Cohen d', d) 68 | 69 | 70 | def PrintExtremes(live): 71 | """Plots the histogram of pregnancy lengths and prints the extremes. 72 | 73 | live: DataFrame of live births 74 | """ 75 | hist = thinkstats2.Hist(live.prglngth) 76 | thinkplot.Hist(hist, label='live births') 77 | 78 | thinkplot.Save(root='first_nsfg_hist_live', 79 | title='Histogram', 80 | xlabel='weeks', 81 | ylabel='frequency') 82 | 83 | print('Shortest lengths:') 84 | for weeks, freq in hist.Smallest(10): 85 | print(weeks, freq) 86 | 87 | print('Longest lengths:') 88 | for weeks, freq in hist.Largest(10): 89 | print(weeks, freq) 90 | 91 | 92 | def MakeHists(live): 93 | """Plot Hists for live births 94 | 95 | live: DataFrame 96 | others: DataFrame 97 | """ 98 | hist = thinkstats2.Hist(live.birthwgt_lb, label='birthwgt_lb') 99 | thinkplot.Hist(hist) 100 | thinkplot.Save(root='first_wgt_lb_hist', 101 | xlabel='pounds', 102 | ylabel='frequency', 103 | axis=[-1, 14, 0, 3200]) 104 | 105 | hist = thinkstats2.Hist(live.birthwgt_oz, label='birthwgt_oz') 106 | thinkplot.Hist(hist) 107 | thinkplot.Save(root='first_wgt_oz_hist', 108 | xlabel='ounces', 109 | ylabel='frequency', 110 | axis=[-1, 16, 0, 1200]) 111 | 112 | hist = thinkstats2.Hist(np.floor(live.agepreg), label='agepreg') 113 | thinkplot.Hist(hist) 114 | thinkplot.Save(root='first_agepreg_hist', 115 | xlabel='years', 116 | ylabel='frequency') 117 | 118 | hist = thinkstats2.Hist(live.prglngth, label='prglngth') 119 | thinkplot.Hist(hist) 120 | thinkplot.Save(root='first_prglngth_hist', 121 | xlabel='weeks', 122 | ylabel='frequency', 123 | axis=[-1, 53, 0, 5000]) 124 | 125 | 126 | def MakeComparison(firsts, others): 127 | """Plots histograms of pregnancy length for first babies and others. 128 | 129 | firsts: DataFrame 130 | others: DataFrame 131 | """ 132 | first_hist = thinkstats2.Hist(firsts.prglngth, label='first') 133 | other_hist = thinkstats2.Hist(others.prglngth, label='other') 134 | 135 | width = 0.45 136 | thinkplot.PrePlot(2) 137 | thinkplot.Hist(first_hist, align='right', width=width) 138 | thinkplot.Hist(other_hist, align='left', width=width) 139 | 140 | thinkplot.Save(root='first_nsfg_hist', 141 | title='Histogram', 142 | xlabel='weeks', 143 | ylabel='frequency', 144 | axis=[27, 46, 0, 2700]) 145 | 146 | 147 | def main(script): 148 | live, firsts, others = MakeFrames() 149 | 150 | MakeHists(live) 151 | PrintExtremes(live) 152 | MakeComparison(firsts, others) 153 | Summarize(live, firsts, others) 154 | 155 | 156 | if __name__ == '__main__': 157 | import sys 158 | main(*sys.argv) 159 | 160 | 161 | -------------------------------------------------------------------------------- /hypothesis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Hypothesis Testing\n", 8 | "==================\n", 9 | "\n", 10 | "Copyright 2016 Allen Downey\n", 11 | "\n", 12 | "License: [Creative Commons Attribution 4.0 International](http://creativecommons.org/licenses/by/4.0/)" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": null, 18 | "metadata": { 19 | "collapsed": true 20 | }, 21 | "outputs": [], 22 | "source": [ 23 | "from __future__ import print_function, division\n", 24 | "\n", 25 | "import numpy\n", 26 | "import scipy.stats\n", 27 | "\n", 28 | "import matplotlib.pyplot as pyplot\n", 29 | "\n", 30 | "import first\n", 31 | "\n", 32 | "# some nicer colors from http://colorbrewer2.org/\n", 33 | "COLOR1 = '#7fc97f'\n", 34 | "COLOR2 = '#beaed4'\n", 35 | "COLOR3 = '#fdc086'\n", 36 | "COLOR4 = '#ffff99'\n", 37 | "COLOR5 = '#386cb0'\n", 38 | "\n", 39 | "%matplotlib inline" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "## Part One" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "Suppose you observe an apparent difference between two groups and you want to check whether it might be due to chance.\n", 54 | "\n", 55 | "As an example, we'll look at differences between first babies and others. The `first` module provides code to read data from the National Survey of Family Growth (NSFG)." 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "metadata": { 62 | "collapsed": true 63 | }, 64 | "outputs": [], 65 | "source": [ 66 | "live, firsts, others = first.MakeFrames()" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "We'll look at a couple of variables, including pregnancy length and birth weight. The effect size we'll consider is the difference in the means.\n", 74 | "\n", 75 | "Other examples might include a correlation between variables or a coefficient in a linear regression. The number that quantifies the size of the effect is called the \"test statistic\"." 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": { 82 | "collapsed": true 83 | }, 84 | "outputs": [], 85 | "source": [ 86 | "def TestStatistic(data):\n", 87 | " group1, group2 = data\n", 88 | " test_stat = abs(group1.mean() - group2.mean())\n", 89 | " return test_stat" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "For the first example, I extract the pregnancy length for first babies and others. The results are pandas Series objects." 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": { 103 | "collapsed": true 104 | }, 105 | "outputs": [], 106 | "source": [ 107 | "group1 = firsts.prglngth\n", 108 | "group2 = others.prglngth" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "The actual difference in the means is 0.078 weeks, which is only 13 hours." 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": null, 121 | "metadata": {}, 122 | "outputs": [], 123 | "source": [ 124 | "actual = TestStatistic((group1, group2))\n", 125 | "actual" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "The null hypothesis is that there is no difference between the groups. We can model that by forming a pooled sample that includes first babies and others." 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": { 139 | "collapsed": true 140 | }, 141 | "outputs": [], 142 | "source": [ 143 | "n, m = len(group1), len(group2)\n", 144 | "pool = numpy.hstack((group1, group2))" 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": {}, 150 | "source": [ 151 | "Then we can simulate the null hypothesis by shuffling the pool and dividing it into two groups, using the same sizes as the actual sample." 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": null, 157 | "metadata": { 158 | "collapsed": true 159 | }, 160 | "outputs": [], 161 | "source": [ 162 | "def RunModel():\n", 163 | " numpy.random.shuffle(pool)\n", 164 | " data = pool[:n], pool[n:]\n", 165 | " return data" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "The result of running the model is two NumPy arrays with the shuffled pregnancy lengths:" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "metadata": {}, 179 | "outputs": [], 180 | "source": [ 181 | "RunModel()" 182 | ] 183 | }, 184 | { 185 | "cell_type": "markdown", 186 | "metadata": {}, 187 | "source": [ 188 | "Then we compute the same test statistic using the simulated data:" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": null, 194 | "metadata": {}, 195 | "outputs": [], 196 | "source": [ 197 | "TestStatistic(RunModel())" 198 | ] 199 | }, 200 | { 201 | "cell_type": "markdown", 202 | "metadata": {}, 203 | "source": [ 204 | "If we run the model 1000 times and compute the test statistic, we can see how much the test statistic varies under the null hypothesis." 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "metadata": {}, 211 | "outputs": [], 212 | "source": [ 213 | "test_stats = numpy.array([TestStatistic(RunModel()) for i in range(1000)])\n", 214 | "test_stats.shape" 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "Here's the sampling distribution of the test statistic under the null hypothesis, with the actual difference in means indicated by a gray line." 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "metadata": {}, 228 | "outputs": [], 229 | "source": [ 230 | "pyplot.axvline(actual, linewidth=3, color='0.8')\n", 231 | "pyplot.hist(test_stats, color=COLOR5)\n", 232 | "pyplot.xlabel('difference in means')\n", 233 | "pyplot.ylabel('count')\n", 234 | "None" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": {}, 240 | "source": [ 241 | "The p-value is the probability that the test statistic under the null hypothesis exceeds the actual value." 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "metadata": {}, 248 | "outputs": [], 249 | "source": [ 250 | "pvalue = sum(test_stats >= actual) / len(test_stats)\n", 251 | "pvalue" 252 | ] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "metadata": {}, 257 | "source": [ 258 | "In this case the result is about 15%, which means that even if there is no difference between the groups, it is plausible that we could see a sample difference as big as 0.078 weeks.\n", 259 | "\n", 260 | "We conclude that the apparent effect might be due to chance, so we are not confident that it would appear in the general population, or in another sample from the same population.\n", 261 | "\n", 262 | "STOP HERE\n", 263 | "---------" 264 | ] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "metadata": {}, 269 | "source": [ 270 | "Part Two\n", 271 | "========\n", 272 | "\n", 273 | "We can take the pieces from the previous section and organize them in a class that represents the structure of a hypothesis test." 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": null, 279 | "metadata": { 280 | "collapsed": true 281 | }, 282 | "outputs": [], 283 | "source": [ 284 | "class HypothesisTest(object):\n", 285 | " \"\"\"Represents a hypothesis test.\"\"\"\n", 286 | "\n", 287 | " def __init__(self, data):\n", 288 | " \"\"\"Initializes.\n", 289 | "\n", 290 | " data: data in whatever form is relevant\n", 291 | " \"\"\"\n", 292 | " self.data = data\n", 293 | " self.MakeModel()\n", 294 | " self.actual = self.TestStatistic(data)\n", 295 | " self.test_stats = None\n", 296 | "\n", 297 | " def PValue(self, iters=1000):\n", 298 | " \"\"\"Computes the distribution of the test statistic and p-value.\n", 299 | "\n", 300 | " iters: number of iterations\n", 301 | "\n", 302 | " returns: float p-value\n", 303 | " \"\"\"\n", 304 | " self.test_stats = numpy.array([self.TestStatistic(self.RunModel()) \n", 305 | " for _ in range(iters)])\n", 306 | "\n", 307 | " count = sum(self.test_stats >= self.actual)\n", 308 | " return count / iters\n", 309 | "\n", 310 | " def MaxTestStat(self):\n", 311 | " \"\"\"Returns the largest test statistic seen during simulations.\n", 312 | " \"\"\"\n", 313 | " return max(self.test_stats)\n", 314 | "\n", 315 | " def PlotHist(self, label=None):\n", 316 | " \"\"\"Draws a Cdf with vertical lines at the observed test stat.\n", 317 | " \"\"\"\n", 318 | " pyplot.hist(ht.test_stats, color=COLOR4)\n", 319 | " pyplot.axvline(self.actual, linewidth=3, color='0.8')\n", 320 | " pyplot.xlabel('test statistic')\n", 321 | " pyplot.ylabel('count')\n", 322 | "\n", 323 | " def TestStatistic(self, data):\n", 324 | " \"\"\"Computes the test statistic.\n", 325 | "\n", 326 | " data: data in whatever form is relevant \n", 327 | " \"\"\"\n", 328 | " raise UnimplementedMethodException()\n", 329 | "\n", 330 | " def MakeModel(self):\n", 331 | " \"\"\"Build a model of the null hypothesis.\n", 332 | " \"\"\"\n", 333 | " pass\n", 334 | "\n", 335 | " def RunModel(self):\n", 336 | " \"\"\"Run the model of the null hypothesis.\n", 337 | "\n", 338 | " returns: simulated data\n", 339 | " \"\"\"\n", 340 | " raise UnimplementedMethodException()\n" 341 | ] 342 | }, 343 | { 344 | "cell_type": "markdown", 345 | "metadata": {}, 346 | "source": [ 347 | "`HypothesisTest` is an abstract parent class that encodes the template. Child classes fill in the missing methods. For example, here's the test from the previous section." 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": null, 353 | "metadata": { 354 | "collapsed": true 355 | }, 356 | "outputs": [], 357 | "source": [ 358 | "class DiffMeansPermute(HypothesisTest):\n", 359 | " \"\"\"Tests a difference in means by permutation.\"\"\"\n", 360 | "\n", 361 | " def TestStatistic(self, data):\n", 362 | " \"\"\"Computes the test statistic.\n", 363 | "\n", 364 | " data: data in whatever form is relevant \n", 365 | " \"\"\"\n", 366 | " group1, group2 = data\n", 367 | " test_stat = abs(group1.mean() - group2.mean())\n", 368 | " return test_stat\n", 369 | "\n", 370 | " def MakeModel(self):\n", 371 | " \"\"\"Build a model of the null hypothesis.\n", 372 | " \"\"\"\n", 373 | " group1, group2 = self.data\n", 374 | " self.n, self.m = len(group1), len(group2)\n", 375 | " self.pool = numpy.hstack((group1, group2))\n", 376 | "\n", 377 | " def RunModel(self):\n", 378 | " \"\"\"Run the model of the null hypothesis.\n", 379 | "\n", 380 | " returns: simulated data\n", 381 | " \"\"\"\n", 382 | " numpy.random.shuffle(self.pool)\n", 383 | " data = self.pool[:self.n], self.pool[self.n:]\n", 384 | " return data" 385 | ] 386 | }, 387 | { 388 | "cell_type": "markdown", 389 | "metadata": {}, 390 | "source": [ 391 | "Now we can run the test by instantiating a DiffMeansPermute object:" 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "execution_count": null, 397 | "metadata": {}, 398 | "outputs": [], 399 | "source": [ 400 | "data = (firsts.prglngth, others.prglngth)\n", 401 | "ht = DiffMeansPermute(data)\n", 402 | "p_value = ht.PValue(iters=1000)\n", 403 | "print('\\nmeans permute pregnancy length')\n", 404 | "print('p-value =', p_value)\n", 405 | "print('actual =', ht.actual)\n", 406 | "print('ts max =', ht.MaxTestStat())" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": {}, 412 | "source": [ 413 | "And we can plot the sampling distribution of the test statistic under the null hypothesis." 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": null, 419 | "metadata": {}, 420 | "outputs": [], 421 | "source": [ 422 | "ht.PlotHist()" 423 | ] 424 | }, 425 | { 426 | "cell_type": "markdown", 427 | "metadata": {}, 428 | "source": [ 429 | "### Difference in standard deviation\n", 430 | "\n", 431 | "**Exercize 1**: Write a class named `DiffStdPermute` that extends `DiffMeansPermute` and overrides `TestStatistic` to compute the difference in standard deviations. Is the difference in standard deviations statistically significant?" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": null, 437 | "metadata": { 438 | "collapsed": true 439 | }, 440 | "outputs": [], 441 | "source": [ 442 | "# Solution goes here" 443 | ] 444 | }, 445 | { 446 | "cell_type": "markdown", 447 | "metadata": {}, 448 | "source": [ 449 | "Here's the code to test your solution to the previous exercise." 450 | ] 451 | }, 452 | { 453 | "cell_type": "code", 454 | "execution_count": null, 455 | "metadata": {}, 456 | "outputs": [], 457 | "source": [ 458 | "data = (firsts.prglngth, others.prglngth)\n", 459 | "ht = DiffStdPermute(data)\n", 460 | "p_value = ht.PValue(iters=1000)\n", 461 | "print('\\nstd permute pregnancy length')\n", 462 | "print('p-value =', p_value)\n", 463 | "print('actual =', ht.actual)\n", 464 | "print('ts max =', ht.MaxTestStat())" 465 | ] 466 | }, 467 | { 468 | "cell_type": "markdown", 469 | "metadata": {}, 470 | "source": [ 471 | "### Difference in birth weights\n", 472 | "\n", 473 | "Now let's run DiffMeansPermute again to see if there is a difference in birth weight between first babies and others." 474 | ] 475 | }, 476 | { 477 | "cell_type": "code", 478 | "execution_count": null, 479 | "metadata": { 480 | "collapsed": true 481 | }, 482 | "outputs": [], 483 | "source": [ 484 | "data = (firsts.totalwgt_lb.dropna(), others.totalwgt_lb.dropna())\n", 485 | "ht = DiffMeansPermute(data)\n", 486 | "p_value = ht.PValue(iters=1000)\n", 487 | "print('\\nmeans permute birthweight')\n", 488 | "print('p-value =', p_value)\n", 489 | "print('actual =', ht.actual)\n", 490 | "print('ts max =', ht.MaxTestStat())" 491 | ] 492 | }, 493 | { 494 | "cell_type": "markdown", 495 | "metadata": {}, 496 | "source": [ 497 | "In this case, after 1000 attempts, we never see a sample difference as big as the observed difference, so we conclude that the apparent effect is unlikely under the null hypothesis. Under normal circumstances, we can also make the inference that the apparent effect is unlikely to be caused by random sampling.\n", 498 | "\n", 499 | "One final note: in this case I would report that the p-value is less than 1/1000 or less than 0.001. I would not report p=0, because the apparent effect is not impossible under the null hypothesis; just unlikely." 500 | ] 501 | }, 502 | { 503 | "cell_type": "markdown", 504 | "metadata": {}, 505 | "source": [ 506 | "### Part Three\n", 507 | "\n", 508 | "In this section, we'll explore the dangers of p-hacking by running multiple tests until we find one that's statistically significant.\n", 509 | "\n", 510 | "Suppose we want to compare IQs for two groups of people. And suppose that, in fact, the two groups are statistically identical; that is, their IQs are drawn from a normal distribution with mean 100 and standard deviation 15.\n", 511 | "\n", 512 | "I'll use `numpy.random.normal` to generate fake data I might get from running such an experiment:" 513 | ] 514 | }, 515 | { 516 | "cell_type": "code", 517 | "execution_count": null, 518 | "metadata": { 519 | "collapsed": true 520 | }, 521 | "outputs": [], 522 | "source": [ 523 | "group1 = numpy.random.normal(100, 15, size=100)\n", 524 | "group2 = numpy.random.normal(100, 15, size=100)" 525 | ] 526 | }, 527 | { 528 | "cell_type": "markdown", 529 | "metadata": {}, 530 | "source": [ 531 | "We expect the mean in both groups to be near 100, but just by random chance, it might be higher or lower." 532 | ] 533 | }, 534 | { 535 | "cell_type": "code", 536 | "execution_count": null, 537 | "metadata": { 538 | "collapsed": true 539 | }, 540 | "outputs": [], 541 | "source": [ 542 | "group1.mean(), group2.mean()" 543 | ] 544 | }, 545 | { 546 | "cell_type": "markdown", 547 | "metadata": {}, 548 | "source": [ 549 | "We can use DiffMeansPermute to compute the p-value for this fake data, which is the probability that we would see a difference between the groups as big as what we saw, just by chance." 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": null, 555 | "metadata": { 556 | "collapsed": true 557 | }, 558 | "outputs": [], 559 | "source": [ 560 | "data = (group1, group2)\n", 561 | "ht = DiffMeansPermute(data)\n", 562 | "p_value = ht.PValue(iters=1000)\n", 563 | "p_value" 564 | ] 565 | }, 566 | { 567 | "cell_type": "markdown", 568 | "metadata": {}, 569 | "source": [ 570 | "Now let's check the p-value. If it's less than 0.05, the result is statistically significant, and we can publish it. Otherwise, we can try again." 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": null, 576 | "metadata": { 577 | "collapsed": true 578 | }, 579 | "outputs": [], 580 | "source": [ 581 | "if p_value < 0.05:\n", 582 | " print('Congratulations! Publish it!')\n", 583 | "else:\n", 584 | " print('Too bad! Please try again.')" 585 | ] 586 | }, 587 | { 588 | "cell_type": "markdown", 589 | "metadata": {}, 590 | "source": [ 591 | "You can probably see where this is going. If we play this game over and over (or if many researchers play it in parallel), the false positive rate can be as high as 100%.\n", 592 | "\n", 593 | "To see this more clearly, let's simulate 100 researchers playing this game. I'll take the code we have so far and wrap it in a function:" 594 | ] 595 | }, 596 | { 597 | "cell_type": "code", 598 | "execution_count": null, 599 | "metadata": { 600 | "collapsed": true 601 | }, 602 | "outputs": [], 603 | "source": [ 604 | "def run_a_test(sample_size=100):\n", 605 | " \"\"\"Generate random data and run a hypothesis test on it.\n", 606 | "\n", 607 | " sample_size: integer\n", 608 | "\n", 609 | " returns: p-value\n", 610 | " \"\"\"\n", 611 | " group1 = numpy.random.normal(100, 15, size=sample_size)\n", 612 | " group2 = numpy.random.normal(100, 15, size=sample_size)\n", 613 | " data = (group1, group2)\n", 614 | " ht = DiffMeansPermute(data)\n", 615 | " p_value = ht.PValue(iters=200)\n", 616 | " return p_value" 617 | ] 618 | }, 619 | { 620 | "cell_type": "markdown", 621 | "metadata": {}, 622 | "source": [ 623 | "Now let's run that function 100 times and save the p-values." 624 | ] 625 | }, 626 | { 627 | "cell_type": "code", 628 | "execution_count": null, 629 | "metadata": { 630 | "collapsed": true 631 | }, 632 | "outputs": [], 633 | "source": [ 634 | "num_experiments = 100\n", 635 | "p_values = numpy.array([run_a_test() for i in range(num_experiments)])\n", 636 | "sum(p_values < 0.05)" 637 | ] 638 | }, 639 | { 640 | "cell_type": "markdown", 641 | "metadata": {}, 642 | "source": [ 643 | "On average, we expect to get a false positive about 5 times out of 100. To see why, let's plot the histogram of the p-values we got." 644 | ] 645 | }, 646 | { 647 | "cell_type": "code", 648 | "execution_count": null, 649 | "metadata": { 650 | "collapsed": true 651 | }, 652 | "outputs": [], 653 | "source": [ 654 | "bins = numpy.linspace(0, 1, 21)\n", 655 | "bins" 656 | ] 657 | }, 658 | { 659 | "cell_type": "code", 660 | "execution_count": null, 661 | "metadata": { 662 | "collapsed": true 663 | }, 664 | "outputs": [], 665 | "source": [ 666 | "pyplot.hist(p_values, bins, color=COLOR5)\n", 667 | "pyplot.axvline(0.05, linewidth=3, color='0.8')\n", 668 | "pyplot.xlabel('p-value')\n", 669 | "pyplot.ylabel('count')\n", 670 | "None" 671 | ] 672 | }, 673 | { 674 | "cell_type": "markdown", 675 | "metadata": {}, 676 | "source": [ 677 | "The distribution of p-values is uniform from 0 to 1. So it falls below 5% about 5% of the time.\n", 678 | "\n", 679 | "**Exercise:** If the threshold for statistical signficance is 5%, the probability of a false positive is 5%. You might hope that things would get better with larger sample sizes, but they don't. Run this experiment again with a larger sample size, and see for yourself." 680 | ] 681 | }, 682 | { 683 | "cell_type": "markdown", 684 | "metadata": {}, 685 | "source": [ 686 | "### Part four\n", 687 | "\n", 688 | "In the previous section, we computed the false positive rate, which is the probability of seeing a \"statistically significant\" result, even if there is no statistical difference between groups.\n", 689 | "\n", 690 | "Now let's ask the complementary question: if there really is a difference between groups, what is the chance of seeing a \"statistically significant\" result?\n", 691 | "\n", 692 | "The answer to this question is called the \"power\" of the test. It depends on the sample size (unlike the false positive rate), and it also depends on how big the actual difference is.\n", 693 | "\n", 694 | "We can estimate the power of a test by running simulations similar to the ones in the previous section. Here's a version of `run_a_test` that takes the actual difference between groups as a parameter:" 695 | ] 696 | }, 697 | { 698 | "cell_type": "code", 699 | "execution_count": null, 700 | "metadata": { 701 | "collapsed": true 702 | }, 703 | "outputs": [], 704 | "source": [ 705 | "def run_a_test2(actual_diff, sample_size=100):\n", 706 | " \"\"\"Generate random data and run a hypothesis test on it.\n", 707 | "\n", 708 | " actual_diff: The actual difference between groups.\n", 709 | " sample_size: integer\n", 710 | "\n", 711 | " returns: p-value\n", 712 | " \"\"\"\n", 713 | " group1 = numpy.random.normal(100, 15, \n", 714 | " size=sample_size)\n", 715 | " group2 = numpy.random.normal(100 + actual_diff, 15, \n", 716 | " size=sample_size)\n", 717 | " data = (group1, group2)\n", 718 | " ht = DiffMeansPermute(data)\n", 719 | " p_value = ht.PValue(iters=200)\n", 720 | " return p_value" 721 | ] 722 | }, 723 | { 724 | "cell_type": "markdown", 725 | "metadata": {}, 726 | "source": [ 727 | "Now let's run it 100 times with an actual difference of 5:" 728 | ] 729 | }, 730 | { 731 | "cell_type": "code", 732 | "execution_count": null, 733 | "metadata": { 734 | "collapsed": true 735 | }, 736 | "outputs": [], 737 | "source": [ 738 | "p_values = numpy.array([run_a_test2(5) for i in range(100)])\n", 739 | "sum(p_values < 0.05)" 740 | ] 741 | }, 742 | { 743 | "cell_type": "markdown", 744 | "metadata": {}, 745 | "source": [ 746 | "With sample size 100 and an actual difference of 5, the power of the test is approximately 65%. That means if we ran this hypothetical experiment 100 times, we'd expect a statistically significant result about 65 times.\n", 747 | "\n", 748 | "That's pretty good, but it also means we would NOT get a statistically significant result about 35 times, which is a lot.\n", 749 | "\n", 750 | "Again, let's look at the distribution of p-values:" 751 | ] 752 | }, 753 | { 754 | "cell_type": "code", 755 | "execution_count": null, 756 | "metadata": { 757 | "collapsed": true 758 | }, 759 | "outputs": [], 760 | "source": [ 761 | "pyplot.hist(p_values, bins, color=COLOR5)\n", 762 | "pyplot.axvline(0.05, linewidth=3, color='0.8')\n", 763 | "pyplot.xlabel('p-value')\n", 764 | "pyplot.ylabel('count')\n", 765 | "None" 766 | ] 767 | }, 768 | { 769 | "cell_type": "markdown", 770 | "metadata": {}, 771 | "source": [ 772 | "Here's the point of this example: if you get a negative result (no statistical significance), that is not always strong evidence that there is no difference between the groups. It is also possible that the power of the test was too low; that is, that it was unlikely to produce a positive result, even if there is a difference between the groups.\n", 773 | "\n", 774 | "**Exercise:** Assuming that the actual difference between the groups is 5, what sample size is needed to get the power of the test up to 80%? What if the actual difference is 2, what sample size do we need to get to 80%?" 775 | ] 776 | }, 777 | { 778 | "cell_type": "code", 779 | "execution_count": null, 780 | "metadata": { 781 | "collapsed": true 782 | }, 783 | "outputs": [], 784 | "source": [] 785 | } 786 | ], 787 | "metadata": { 788 | "kernelspec": { 789 | "display_name": "Python 3", 790 | "language": "python", 791 | "name": "python3" 792 | }, 793 | "language_info": { 794 | "codemirror_mode": { 795 | "name": "ipython", 796 | "version": 3 797 | }, 798 | "file_extension": ".py", 799 | "mimetype": "text/x-python", 800 | "name": "python", 801 | "nbconvert_exporter": "python", 802 | "pygments_lexer": "ipython3", 803 | "version": "3.6.1" 804 | } 805 | }, 806 | "nbformat": 4, 807 | "nbformat_minor": 1 808 | } 809 | -------------------------------------------------------------------------------- /hypothesis.py: -------------------------------------------------------------------------------- 1 | """This file contains code used in "Think Stats", 2 | by Allen B. Downey, available from greenteapress.com 3 | 4 | Copyright 2010 Allen B. Downey 5 | License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html 6 | """ 7 | 8 | from __future__ import print_function, division 9 | 10 | import nsfg 11 | import nsfg2 12 | import first 13 | 14 | import thinkstats2 15 | import thinkplot 16 | 17 | import copy 18 | import random 19 | import numpy as np 20 | import matplotlib.pyplot as pyplot 21 | 22 | 23 | class CoinTest(thinkstats2.HypothesisTest): 24 | """Tests the hypothesis that a coin is fair.""" 25 | 26 | def TestStatistic(self, data): 27 | """Computes the test statistic. 28 | 29 | data: data in whatever form is relevant 30 | """ 31 | heads, tails = data 32 | test_stat = abs(heads - tails) 33 | return test_stat 34 | 35 | def RunModel(self): 36 | """Run the model of the null hypothesis. 37 | 38 | returns: simulated data 39 | """ 40 | heads, tails = self.data 41 | n = heads + tails 42 | sample = [random.choice('HT') for _ in range(n)] 43 | hist = thinkstats2.Hist(sample) 44 | data = hist['H'], hist['T'] 45 | return data 46 | 47 | 48 | class DiffMeansPermute(thinkstats2.HypothesisTest): 49 | """Tests a difference in means by permutation.""" 50 | 51 | def TestStatistic(self, data): 52 | """Computes the test statistic. 53 | 54 | data: data in whatever form is relevant 55 | """ 56 | group1, group2 = data 57 | test_stat = abs(group1.mean() - group2.mean()) 58 | return test_stat 59 | 60 | def MakeModel(self): 61 | """Build a model of the null hypothesis. 62 | """ 63 | group1, group2 = self.data 64 | self.n, self.m = len(group1), len(group2) 65 | self.pool = np.hstack((group1, group2)) 66 | 67 | def RunModel(self): 68 | """Run the model of the null hypothesis. 69 | 70 | returns: simulated data 71 | """ 72 | np.random.shuffle(self.pool) 73 | data = self.pool[:self.n], self.pool[self.n:] 74 | return data 75 | 76 | 77 | class DiffMeansOneSided(DiffMeansPermute): 78 | """Tests a one-sided difference in means by permutation.""" 79 | 80 | def TestStatistic(self, data): 81 | """Computes the test statistic. 82 | 83 | data: data in whatever form is relevant 84 | """ 85 | group1, group2 = data 86 | test_stat = group1.mean() - group2.mean() 87 | return test_stat 88 | 89 | 90 | class DiffStdPermute(DiffMeansPermute): 91 | """Tests a one-sided difference in standard deviation by permutation.""" 92 | 93 | def TestStatistic(self, data): 94 | """Computes the test statistic. 95 | 96 | data: data in whatever form is relevant 97 | """ 98 | group1, group2 = data 99 | test_stat = group1.std() - group2.std() 100 | return test_stat 101 | 102 | 103 | class CorrelationPermute(thinkstats2.HypothesisTest): 104 | """Tests correlations by permutation.""" 105 | 106 | def TestStatistic(self, data): 107 | """Computes the test statistic. 108 | 109 | data: tuple of xs and ys 110 | """ 111 | xs, ys = data 112 | test_stat = abs(thinkstats2.Corr(xs, ys)) 113 | return test_stat 114 | 115 | def RunModel(self): 116 | """Run the model of the null hypothesis. 117 | 118 | returns: simulated data 119 | """ 120 | xs, ys = self.data 121 | xs = np.random.permutation(xs) 122 | return xs, ys 123 | 124 | 125 | class DiceTest(thinkstats2.HypothesisTest): 126 | """Tests whether a six-sided die is fair.""" 127 | 128 | def TestStatistic(self, data): 129 | """Computes the test statistic. 130 | 131 | data: list of frequencies 132 | """ 133 | observed = data 134 | n = sum(observed) 135 | expected = np.ones(6) * n / 6 136 | test_stat = sum(abs(observed - expected)) 137 | return test_stat 138 | 139 | def RunModel(self): 140 | """Run the model of the null hypothesis. 141 | 142 | returns: simulated data 143 | """ 144 | n = sum(self.data) 145 | values = [1,2,3,4,5,6] 146 | rolls = np.random.choice(values, n, replace=True) 147 | hist = thinkstats2.Hist(rolls) 148 | freqs = hist.Freqs(values) 149 | return freqs 150 | 151 | 152 | class DiceChiTest(DiceTest): 153 | """Tests a six-sided die using a chi-squared statistic.""" 154 | 155 | def TestStatistic(self, data): 156 | """Computes the test statistic. 157 | 158 | data: list of frequencies 159 | """ 160 | observed = data 161 | n = sum(observed) 162 | expected = np.ones(6) * n / 6 163 | test_stat = sum((observed - expected)**2 / expected) 164 | return test_stat 165 | 166 | 167 | class PregLengthTest(thinkstats2.HypothesisTest): 168 | """Tests difference in pregnancy length using a chi-squared statistic.""" 169 | 170 | def TestStatistic(self, data): 171 | """Computes the test statistic. 172 | 173 | data: pair of lists of pregnancy lengths 174 | """ 175 | firsts, others = data 176 | stat = self.ChiSquared(firsts) + self.ChiSquared(others) 177 | return stat 178 | 179 | def ChiSquared(self, lengths): 180 | """Computes the chi-squared statistic. 181 | 182 | lengths: sequence of lengths 183 | 184 | returns: float 185 | """ 186 | hist = thinkstats2.Hist(lengths) 187 | observed = np.array(hist.Freqs(self.values)) 188 | expected = self.expected_probs * len(lengths) 189 | stat = sum((observed - expected)**2 / expected) 190 | return stat 191 | 192 | def MakeModel(self): 193 | """Build a model of the null hypothesis. 194 | """ 195 | firsts, others = self.data 196 | self.n = len(firsts) 197 | self.pool = np.hstack((firsts, others)) 198 | 199 | pmf = thinkstats2.Pmf(self.pool) 200 | self.values = range(35, 44) 201 | self.expected_probs = np.array(pmf.Probs(self.values)) 202 | 203 | def RunModel(self): 204 | """Run the model of the null hypothesis. 205 | 206 | returns: simulated data 207 | """ 208 | np.random.shuffle(self.pool) 209 | data = self.pool[:self.n], self.pool[self.n:] 210 | return data 211 | 212 | 213 | def RunDiceTest(): 214 | """Tests whether a die is fair. 215 | """ 216 | data = [8, 9, 19, 5, 8, 11] 217 | dt = DiceTest(data) 218 | print('dice test', dt.PValue(iters=10000)) 219 | dt = DiceChiTest(data) 220 | print('dice chi test', dt.PValue(iters=10000)) 221 | 222 | 223 | def FalseNegRate(data, num_runs=1000): 224 | """Computes the chance of a false negative based on resampling. 225 | 226 | data: pair of sequences 227 | num_runs: how many experiments to simulate 228 | 229 | returns: float false negative rate 230 | """ 231 | group1, group2 = data 232 | count = 0 233 | 234 | for i in range(num_runs): 235 | sample1 = thinkstats2.Resample(group1) 236 | sample2 = thinkstats2.Resample(group2) 237 | ht = DiffMeansPermute((sample1, sample2)) 238 | p_value = ht.PValue(iters=101) 239 | if p_value > 0.05: 240 | count += 1 241 | 242 | return count / num_runs 243 | 244 | 245 | def PrintTest(p_value, ht): 246 | """Prints results from a hypothesis test. 247 | 248 | p_value: float 249 | ht: HypothesisTest 250 | """ 251 | print('p-value =', p_value) 252 | print('actual =', ht.actual) 253 | print('ts max =', ht.MaxTestStat()) 254 | 255 | 256 | def RunTests(data, iters=1000): 257 | """Runs several tests on the given data. 258 | 259 | data: pair of sequences 260 | iters: number of iterations to run 261 | """ 262 | 263 | # test the difference in means 264 | ht = DiffMeansPermute(data) 265 | p_value = ht.PValue(iters=iters) 266 | print('\nmeans permute two-sided') 267 | PrintTest(p_value, ht) 268 | 269 | ht.PlotCdf() 270 | thinkplot.Save(root='hypothesis1', 271 | title='Permutation test', 272 | xlabel='difference in means (weeks)', 273 | ylabel='CDF', 274 | legend=False) 275 | 276 | # test the difference in means one-sided 277 | ht = DiffMeansOneSided(data) 278 | p_value = ht.PValue(iters=iters) 279 | print('\nmeans permute one-sided') 280 | PrintTest(p_value, ht) 281 | 282 | # test the difference in std 283 | ht = DiffStdPermute(data) 284 | p_value = ht.PValue(iters=iters) 285 | print('\nstd permute one-sided') 286 | PrintTest(p_value, ht) 287 | 288 | 289 | def ReplicateTests(): 290 | """Replicates tests with the new NSFG data.""" 291 | 292 | live, firsts, others = nsfg2.MakeFrames() 293 | 294 | # compare pregnancy lengths 295 | print('\nprglngth2') 296 | data = firsts.prglngth.values, others.prglngth.values 297 | ht = DiffMeansPermute(data) 298 | p_value = ht.PValue(iters=1000) 299 | print('means permute two-sided') 300 | PrintTest(p_value, ht) 301 | 302 | print('\nbirth weight 2') 303 | data = (firsts.totalwgt_lb.dropna().values, 304 | others.totalwgt_lb.dropna().values) 305 | ht = DiffMeansPermute(data) 306 | p_value = ht.PValue(iters=1000) 307 | print('means permute two-sided') 308 | PrintTest(p_value, ht) 309 | 310 | # test correlation 311 | live2 = live.dropna(subset=['agepreg', 'totalwgt_lb']) 312 | data = live2.agepreg.values, live2.totalwgt_lb.values 313 | ht = CorrelationPermute(data) 314 | p_value = ht.PValue() 315 | print('\nage weight correlation 2') 316 | PrintTest(p_value, ht) 317 | 318 | # compare pregnancy lengths (chi-squared) 319 | data = firsts.prglngth.values, others.prglngth.values 320 | ht = PregLengthTest(data) 321 | p_value = ht.PValue() 322 | print('\npregnancy length chi-squared 2') 323 | PrintTest(p_value, ht) 324 | 325 | 326 | def main(): 327 | thinkstats2.RandomSeed(17) 328 | 329 | # run the coin test 330 | ct = CoinTest((140, 110)) 331 | pvalue = ct.PValue() 332 | print('coin test p-value', pvalue) 333 | 334 | # compare pregnancy lengths 335 | print('\nprglngth') 336 | live, firsts, others = first.MakeFrames() 337 | data = firsts.prglngth.values, others.prglngth.values 338 | RunTests(data) 339 | 340 | # compare birth weights 341 | print('\nbirth weight') 342 | data = (firsts.totalwgt_lb.dropna().values, 343 | others.totalwgt_lb.dropna().values) 344 | ht = DiffMeansPermute(data) 345 | p_value = ht.PValue(iters=1000) 346 | print('means permute two-sided') 347 | PrintTest(p_value, ht) 348 | 349 | # test correlation 350 | live2 = live.dropna(subset=['agepreg', 'totalwgt_lb']) 351 | data = live2.agepreg.values, live2.totalwgt_lb.values 352 | ht = CorrelationPermute(data) 353 | p_value = ht.PValue() 354 | print('\nage weight correlation') 355 | print('n=', len(live2)) 356 | PrintTest(p_value, ht) 357 | 358 | # run the dice test 359 | RunDiceTest() 360 | 361 | # compare pregnancy lengths (chi-squared) 362 | data = firsts.prglngth.values, others.prglngth.values 363 | ht = PregLengthTest(data) 364 | p_value = ht.PValue() 365 | print('\npregnancy length chi-squared') 366 | PrintTest(p_value, ht) 367 | 368 | # compute the false negative rate for difference in pregnancy length 369 | data = firsts.prglngth.values, others.prglngth.values 370 | neg_rate = FalseNegRate(data) 371 | print('false neg rate', neg_rate) 372 | 373 | # run the tests with new nsfg data 374 | ReplicateTests() 375 | 376 | 377 | if __name__ == "__main__": 378 | main() 379 | -------------------------------------------------------------------------------- /hypothesis_soln.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Hypothesis Testing\n", 8 | "==================\n", 9 | "\n", 10 | "Copyright 2016 Allen Downey\n", 11 | "\n", 12 | "License: [Creative Commons Attribution 4.0 International](http://creativecommons.org/licenses/by/4.0/)" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": null, 18 | "metadata": {}, 19 | "outputs": [], 20 | "source": [ 21 | "from __future__ import print_function, division\n", 22 | "\n", 23 | "import numpy\n", 24 | "import scipy.stats\n", 25 | "\n", 26 | "import matplotlib.pyplot as pyplot\n", 27 | "\n", 28 | "import first\n", 29 | "\n", 30 | "# some nicer colors from http://colorbrewer2.org/\n", 31 | "COLOR1 = '#7fc97f'\n", 32 | "COLOR2 = '#beaed4'\n", 33 | "COLOR3 = '#fdc086'\n", 34 | "COLOR4 = '#ffff99'\n", 35 | "COLOR5 = '#386cb0'\n", 36 | "\n", 37 | "%matplotlib inline" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "## Part One" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "Suppose you observe an apparent difference between two groups and you want to check whether it might be due to chance.\n", 52 | "\n", 53 | "As an example, we'll look at differences between first babies and others. The `first` module provides code to read data from the National Survey of Family Growth (NSFG)." 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": null, 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "live, firsts, others = first.MakeFrames()" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "We'll look at a couple of variables, including pregnancy length and birth weight. The effect size we'll consider is the difference in the means.\n", 70 | "\n", 71 | "Other examples might include a correlation between variables or a coefficient in a linear regression. The number that quantifies the size of the effect is called the \"test statistic\"." 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": null, 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "def TestStatistic(data):\n", 81 | " group1, group2 = data\n", 82 | " test_stat = abs(group1.mean() - group2.mean())\n", 83 | " return test_stat" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "For the first example, I extract the pregnancy length for first babies and others. The results are pandas Series objects." 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": null, 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "group1 = firsts.prglngth\n", 100 | "group2 = others.prglngth" 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": {}, 106 | "source": [ 107 | "The actual difference in the means is 0.078 weeks, which is only 13 hours." 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "actual = TestStatistic((group1, group2))\n", 117 | "actual" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "The null hypothesis is that there is no difference between the groups. We can model that by forming a pooled sample that includes first babies and others." 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": {}, 131 | "outputs": [], 132 | "source": [ 133 | "n, m = len(group1), len(group2)\n", 134 | "pool = numpy.hstack((group1, group2))" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "Then we can simulate the null hypothesis by shuffling the pool and dividing it into two groups, using the same sizes as the actual sample." 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": {}, 148 | "outputs": [], 149 | "source": [ 150 | "def RunModel():\n", 151 | " numpy.random.shuffle(pool)\n", 152 | " data = pool[:n], pool[n:]\n", 153 | " return data" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "The result of running the model is two NumPy arrays with the shuffled pregnancy lengths:" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": null, 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [ 169 | "RunModel()" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": {}, 175 | "source": [ 176 | "Then we compute the same test statistic using the simulated data:" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": {}, 183 | "outputs": [], 184 | "source": [ 185 | "TestStatistic(RunModel())" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "If we run the model 1000 times and compute the test statistic, we can see how much the test statistic varies under the null hypothesis." 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [ 201 | "test_stats = numpy.array([TestStatistic(RunModel()) for i in range(1000)])\n", 202 | "test_stats.shape" 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": {}, 208 | "source": [ 209 | "Here's the sampling distribution of the test statistic under the null hypothesis, with the actual difference in means indicated by a gray line." 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": null, 215 | "metadata": {}, 216 | "outputs": [], 217 | "source": [ 218 | "pyplot.axvline(actual, linewidth=3, color='0.8')\n", 219 | "pyplot.hist(test_stats, color=COLOR5)\n", 220 | "pyplot.xlabel('difference in means')\n", 221 | "pyplot.ylabel('count')\n", 222 | "None" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "The p-value is the probability that the test statistic under the null hypothesis exceeds the actual value." 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": null, 235 | "metadata": {}, 236 | "outputs": [], 237 | "source": [ 238 | "pvalue = sum(test_stats >= actual) / len(test_stats)\n", 239 | "pvalue" 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "metadata": {}, 245 | "source": [ 246 | "In this case the result is about 15%, which means that even if there is no difference between the groups, it is plausible that we could see a sample difference as big as 0.078 weeks.\n", 247 | "\n", 248 | "We conclude that the apparent effect might be due to chance, so we are not confident that it would appear in the general population, or in another sample from the same population.\n", 249 | "\n", 250 | "STOP HERE\n", 251 | "---------" 252 | ] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "metadata": {}, 257 | "source": [ 258 | "Part Two\n", 259 | "========\n", 260 | "\n", 261 | "We can take the pieces from the previous section and organize them in a class that represents the structure of a hypothesis test." 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": null, 267 | "metadata": {}, 268 | "outputs": [], 269 | "source": [ 270 | "class HypothesisTest(object):\n", 271 | " \"\"\"Represents a hypothesis test.\"\"\"\n", 272 | "\n", 273 | " def __init__(self, data):\n", 274 | " \"\"\"Initializes.\n", 275 | "\n", 276 | " data: data in whatever form is relevant\n", 277 | " \"\"\"\n", 278 | " self.data = data\n", 279 | " self.MakeModel()\n", 280 | " self.actual = self.TestStatistic(data)\n", 281 | " self.test_stats = None\n", 282 | "\n", 283 | " def PValue(self, iters=1000):\n", 284 | " \"\"\"Computes the distribution of the test statistic and p-value.\n", 285 | "\n", 286 | " iters: number of iterations\n", 287 | "\n", 288 | " returns: float p-value\n", 289 | " \"\"\"\n", 290 | " self.test_stats = numpy.array([self.TestStatistic(self.RunModel()) \n", 291 | " for _ in range(iters)])\n", 292 | "\n", 293 | " count = sum(self.test_stats >= self.actual)\n", 294 | " return count / iters\n", 295 | "\n", 296 | " def MaxTestStat(self):\n", 297 | " \"\"\"Returns the largest test statistic seen during simulations.\n", 298 | " \"\"\"\n", 299 | " return max(self.test_stats)\n", 300 | "\n", 301 | " def PlotHist(self, label=None):\n", 302 | " \"\"\"Draws a Cdf with vertical lines at the observed test stat.\n", 303 | " \"\"\"\n", 304 | " pyplot.hist(ht.test_stats, color=COLOR4)\n", 305 | " pyplot.axvline(self.actual, linewidth=3, color='0.8')\n", 306 | " pyplot.xlabel('test statistic')\n", 307 | " pyplot.ylabel('count')\n", 308 | "\n", 309 | " def TestStatistic(self, data):\n", 310 | " \"\"\"Computes the test statistic.\n", 311 | "\n", 312 | " data: data in whatever form is relevant \n", 313 | " \"\"\"\n", 314 | " raise UnimplementedMethodException()\n", 315 | "\n", 316 | " def MakeModel(self):\n", 317 | " \"\"\"Build a model of the null hypothesis.\n", 318 | " \"\"\"\n", 319 | " pass\n", 320 | "\n", 321 | " def RunModel(self):\n", 322 | " \"\"\"Run the model of the null hypothesis.\n", 323 | "\n", 324 | " returns: simulated data\n", 325 | " \"\"\"\n", 326 | " raise UnimplementedMethodException()\n" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "`HypothesisTest` is an abstract parent class that encodes the template. Child classes fill in the missing methods. For example, here's the test from the previous section." 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": null, 339 | "metadata": {}, 340 | "outputs": [], 341 | "source": [ 342 | "class DiffMeansPermute(HypothesisTest):\n", 343 | " \"\"\"Tests a difference in means by permutation.\"\"\"\n", 344 | "\n", 345 | " def TestStatistic(self, data):\n", 346 | " \"\"\"Computes the test statistic.\n", 347 | "\n", 348 | " data: data in whatever form is relevant \n", 349 | " \"\"\"\n", 350 | " group1, group2 = data\n", 351 | " test_stat = abs(group1.mean() - group2.mean())\n", 352 | " return test_stat\n", 353 | "\n", 354 | " def MakeModel(self):\n", 355 | " \"\"\"Build a model of the null hypothesis.\n", 356 | " \"\"\"\n", 357 | " group1, group2 = self.data\n", 358 | " self.n, self.m = len(group1), len(group2)\n", 359 | " self.pool = numpy.hstack((group1, group2))\n", 360 | "\n", 361 | " def RunModel(self):\n", 362 | " \"\"\"Run the model of the null hypothesis.\n", 363 | "\n", 364 | " returns: simulated data\n", 365 | " \"\"\"\n", 366 | " numpy.random.shuffle(self.pool)\n", 367 | " data = self.pool[:self.n], self.pool[self.n:]\n", 368 | " return data" 369 | ] 370 | }, 371 | { 372 | "cell_type": "markdown", 373 | "metadata": {}, 374 | "source": [ 375 | "Now we can run the test by instantiating a DiffMeansPermute object:" 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": null, 381 | "metadata": {}, 382 | "outputs": [], 383 | "source": [ 384 | "data = (firsts.prglngth, others.prglngth)\n", 385 | "ht = DiffMeansPermute(data)\n", 386 | "p_value = ht.PValue(iters=1000)\n", 387 | "print('\\nmeans permute pregnancy length')\n", 388 | "print('p-value =', p_value)\n", 389 | "print('actual =', ht.actual)\n", 390 | "print('ts max =', ht.MaxTestStat())" 391 | ] 392 | }, 393 | { 394 | "cell_type": "markdown", 395 | "metadata": {}, 396 | "source": [ 397 | "And we can plot the sampling distribution of the test statistic under the null hypothesis." 398 | ] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": null, 403 | "metadata": {}, 404 | "outputs": [], 405 | "source": [ 406 | "ht.PlotHist()" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": {}, 412 | "source": [ 413 | "### Difference in standard deviation\n", 414 | "\n", 415 | "**Exercize 1**: Write a class named `DiffStdPermute` that extends `DiffMeansPermute` and overrides `TestStatistic` to compute the difference in standard deviations. Is the difference in standard deviations statistically significant?" 416 | ] 417 | }, 418 | { 419 | "cell_type": "code", 420 | "execution_count": null, 421 | "metadata": {}, 422 | "outputs": [], 423 | "source": [ 424 | "# Solution goes here\n", 425 | "\n", 426 | "class DiffStdPermute(DiffMeansPermute):\n", 427 | " \"\"\"Tests a difference in means by permutation.\"\"\"\n", 428 | "\n", 429 | " def TestStatistic(self, data):\n", 430 | " \"\"\"Computes the test statistic.\n", 431 | "\n", 432 | " data: data in whatever form is relevant \n", 433 | " \"\"\"\n", 434 | " group1, group2 = data\n", 435 | " test_stat = abs(group1.std() - group2.std())\n", 436 | " return test_stat" 437 | ] 438 | }, 439 | { 440 | "cell_type": "markdown", 441 | "metadata": {}, 442 | "source": [ 443 | "Here's the code to test your solution to the previous exercise." 444 | ] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "execution_count": null, 449 | "metadata": {}, 450 | "outputs": [], 451 | "source": [ 452 | "data = (firsts.prglngth, others.prglngth)\n", 453 | "ht = DiffStdPermute(data)\n", 454 | "p_value = ht.PValue(iters=1000)\n", 455 | "print('\\nstd permute pregnancy length')\n", 456 | "print('p-value =', p_value)\n", 457 | "print('actual =', ht.actual)\n", 458 | "print('ts max =', ht.MaxTestStat())" 459 | ] 460 | }, 461 | { 462 | "cell_type": "markdown", 463 | "metadata": {}, 464 | "source": [ 465 | "### Difference in birth weights\n", 466 | "\n", 467 | "Now let's run DiffMeansPermute again to see if there is a difference in birth weight between first babies and others." 468 | ] 469 | }, 470 | { 471 | "cell_type": "code", 472 | "execution_count": null, 473 | "metadata": {}, 474 | "outputs": [], 475 | "source": [ 476 | "data = (firsts.totalwgt_lb.dropna(), others.totalwgt_lb.dropna())\n", 477 | "ht = DiffMeansPermute(data)\n", 478 | "p_value = ht.PValue(iters=1000)\n", 479 | "print('\\nmeans permute birthweight')\n", 480 | "print('p-value =', p_value)\n", 481 | "print('actual =', ht.actual)\n", 482 | "print('ts max =', ht.MaxTestStat())" 483 | ] 484 | }, 485 | { 486 | "cell_type": "markdown", 487 | "metadata": {}, 488 | "source": [ 489 | "In this case, after 1000 attempts, we never see a sample difference as big as the observed difference, so we conclude that the apparent effect is unlikely under the null hypothesis. Under normal circumstances, we can also make the inference that the apparent effect is unlikely to be caused by random sampling.\n", 490 | "\n", 491 | "One final note: in this case I would report that the p-value is less than 1/1000 or less than 0.001. I would not report p=0, because the apparent effect is not impossible under the null hypothesis; just unlikely." 492 | ] 493 | }, 494 | { 495 | "cell_type": "markdown", 496 | "metadata": {}, 497 | "source": [ 498 | "### Part Three\n", 499 | "\n", 500 | "In this section, we'll explore the dangers of p-hacking by running multiple tests until we find one that's statistically significant.\n", 501 | "\n", 502 | "Suppose we want to compare IQs for two groups of people. And suppose that, in fact, the two groups are statistically identical; that is, their IQs are drawn from a normal distribution with mean 100 and standard deviation 15.\n", 503 | "\n", 504 | "I'll use `numpy.random.normal` to generate fake data I might get from running such an experiment:" 505 | ] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "execution_count": null, 510 | "metadata": {}, 511 | "outputs": [], 512 | "source": [ 513 | "group1 = numpy.random.normal(100, 15, size=100)\n", 514 | "group2 = numpy.random.normal(100, 15, size=100)" 515 | ] 516 | }, 517 | { 518 | "cell_type": "markdown", 519 | "metadata": {}, 520 | "source": [ 521 | "We expect the mean in both groups to be near 100, but just by random chance, it might be higher or lower." 522 | ] 523 | }, 524 | { 525 | "cell_type": "code", 526 | "execution_count": null, 527 | "metadata": {}, 528 | "outputs": [], 529 | "source": [ 530 | "group1.mean(), group2.mean()" 531 | ] 532 | }, 533 | { 534 | "cell_type": "markdown", 535 | "metadata": {}, 536 | "source": [ 537 | "We can use DiffMeansPermute to compute the p-value for this fake data, which is the probability that we would see a difference between the groups as big as what we saw, just by chance." 538 | ] 539 | }, 540 | { 541 | "cell_type": "code", 542 | "execution_count": null, 543 | "metadata": {}, 544 | "outputs": [], 545 | "source": [ 546 | "data = (group1, group2)\n", 547 | "ht = DiffMeansPermute(data)\n", 548 | "p_value = ht.PValue(iters=1000)\n", 549 | "p_value" 550 | ] 551 | }, 552 | { 553 | "cell_type": "markdown", 554 | "metadata": {}, 555 | "source": [ 556 | "Now let's check the p-value. If it's less than 0.05, the result is statistically significant, and we can publish it. Otherwise, we can try again." 557 | ] 558 | }, 559 | { 560 | "cell_type": "code", 561 | "execution_count": null, 562 | "metadata": {}, 563 | "outputs": [], 564 | "source": [ 565 | "if p_value < 0.05:\n", 566 | " print('Congratulations! Publish it!')\n", 567 | "else:\n", 568 | " print('Too bad! Please try again.')" 569 | ] 570 | }, 571 | { 572 | "cell_type": "markdown", 573 | "metadata": {}, 574 | "source": [ 575 | "You can probably see where this is going. If we play this game over and over (or if many researchers play it in parallel), the false positive rate can be as high as 100%.\n", 576 | "\n", 577 | "To see this more clearly, let's simulate 100 researchers playing this game. I'll take the code we have so far and wrap it in a function:" 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": null, 583 | "metadata": { 584 | "collapsed": true 585 | }, 586 | "outputs": [], 587 | "source": [ 588 | "def run_a_test(sample_size=100):\n", 589 | " \"\"\"Generate random data and run a hypothesis test on it.\n", 590 | "\n", 591 | " sample_size: integer\n", 592 | "\n", 593 | " returns: p-value\n", 594 | " \"\"\"\n", 595 | " group1 = numpy.random.normal(100, 15, size=sample_size)\n", 596 | " group2 = numpy.random.normal(100, 15, size=sample_size)\n", 597 | " data = (group1, group2)\n", 598 | " ht = DiffMeansPermute(data)\n", 599 | " p_value = ht.PValue(iters=200)\n", 600 | " return p_value" 601 | ] 602 | }, 603 | { 604 | "cell_type": "markdown", 605 | "metadata": {}, 606 | "source": [ 607 | "Now let's run that function 100 times and save the p-values." 608 | ] 609 | }, 610 | { 611 | "cell_type": "code", 612 | "execution_count": null, 613 | "metadata": {}, 614 | "outputs": [], 615 | "source": [ 616 | "num_experiments = 100\n", 617 | "p_values = numpy.array([run_a_test() for i in range(num_experiments)])\n", 618 | "sum(p_values < 0.05)" 619 | ] 620 | }, 621 | { 622 | "cell_type": "markdown", 623 | "metadata": {}, 624 | "source": [ 625 | "On average, we expect to get a false positive about 5 times out of 100. To see why, let's plot the histogram of the p-values we got." 626 | ] 627 | }, 628 | { 629 | "cell_type": "code", 630 | "execution_count": null, 631 | "metadata": {}, 632 | "outputs": [], 633 | "source": [ 634 | "bins = numpy.linspace(0, 1, 21)\n", 635 | "bins" 636 | ] 637 | }, 638 | { 639 | "cell_type": "code", 640 | "execution_count": null, 641 | "metadata": {}, 642 | "outputs": [], 643 | "source": [ 644 | "pyplot.hist(p_values, bins, color=COLOR5)\n", 645 | "pyplot.axvline(0.05, linewidth=3, color='0.8')\n", 646 | "pyplot.xlabel('p-value')\n", 647 | "pyplot.ylabel('count')\n", 648 | "None" 649 | ] 650 | }, 651 | { 652 | "cell_type": "markdown", 653 | "metadata": {}, 654 | "source": [ 655 | "The distribution of p-values is uniform from 0 to 1. So it falls below 5% about 5% of the time.\n", 656 | "\n", 657 | "**Exercise:** If the threshold for statistical signficance is 5%, the probability of a false positive is 5%. You might hope that things would get better with larger sample sizes, but they don't. Run this experiment again with a larger sample size, and see for yourself." 658 | ] 659 | }, 660 | { 661 | "cell_type": "markdown", 662 | "metadata": {}, 663 | "source": [ 664 | "### Part four\n", 665 | "\n", 666 | "In the previous section, we computed the false positive rate, which is the probability of seeing a \"statistically significant\" result, even if there is no statistical difference between groups.\n", 667 | "\n", 668 | "Now let's ask the complementary question: if there really is a difference between groups, what is the chance of seeing a \"statistically significant\" result?\n", 669 | "\n", 670 | "The answer to this question is called the \"power\" of the test. It depends on the sample size (unlike the false positive rate), and it also depends on how big the actual difference is.\n", 671 | "\n", 672 | "We can estimate the power of a test by running simulations similar to the ones in the previous section. Here's a version of `run_a_test` that takes the actual difference between groups as a parameter:" 673 | ] 674 | }, 675 | { 676 | "cell_type": "code", 677 | "execution_count": null, 678 | "metadata": { 679 | "collapsed": true 680 | }, 681 | "outputs": [], 682 | "source": [ 683 | "def run_a_test2(actual_diff, sample_size=100):\n", 684 | " \"\"\"Generate random data and run a hypothesis test on it.\n", 685 | "\n", 686 | " actual_diff: The actual difference between groups.\n", 687 | " sample_size: integer\n", 688 | "\n", 689 | " returns: p-value\n", 690 | " \"\"\"\n", 691 | " group1 = numpy.random.normal(100, 15, \n", 692 | " size=sample_size)\n", 693 | " group2 = numpy.random.normal(100 + actual_diff, 15, \n", 694 | " size=sample_size)\n", 695 | " data = (group1, group2)\n", 696 | " ht = DiffMeansPermute(data)\n", 697 | " p_value = ht.PValue(iters=200)\n", 698 | " return p_value" 699 | ] 700 | }, 701 | { 702 | "cell_type": "markdown", 703 | "metadata": {}, 704 | "source": [ 705 | "Now let's run it 100 times with an actual difference of 5:" 706 | ] 707 | }, 708 | { 709 | "cell_type": "code", 710 | "execution_count": null, 711 | "metadata": {}, 712 | "outputs": [], 713 | "source": [ 714 | "p_values = numpy.array([run_a_test2(5) for i in range(100)])\n", 715 | "sum(p_values < 0.05)" 716 | ] 717 | }, 718 | { 719 | "cell_type": "markdown", 720 | "metadata": {}, 721 | "source": [ 722 | "With sample size 100 and an actual difference of 5, the power of the test is approximately 65%. That means if we ran this hypothetical experiment 100 times, we'd expect a statistically significant result about 65 times.\n", 723 | "\n", 724 | "That's pretty good, but it also means we would NOT get a statistically significant result about 35 times, which is a lot.\n", 725 | "\n", 726 | "Again, let's look at the distribution of p-values:" 727 | ] 728 | }, 729 | { 730 | "cell_type": "code", 731 | "execution_count": null, 732 | "metadata": {}, 733 | "outputs": [], 734 | "source": [ 735 | "pyplot.hist(p_values, bins, color=COLOR5)\n", 736 | "pyplot.axvline(0.05, linewidth=3, color='0.8')\n", 737 | "pyplot.xlabel('p-value')\n", 738 | "pyplot.ylabel('count')\n", 739 | "None" 740 | ] 741 | }, 742 | { 743 | "cell_type": "markdown", 744 | "metadata": {}, 745 | "source": [ 746 | "Here's the point of this example: if you get a negative result (no statistical significance), that is not always strong evidence that there is no difference between the groups. It is also possible that the power of the test was too low; that is, that it was unlikely to produce a positive result, even if there is a difference between the groups.\n", 747 | "\n", 748 | "**Exercise:** Assuming that the actual difference between the groups is 5, what sample size is needed to get the power of the test up to 80%? What if the actual difference is 2, what sample size do we need to get to 80%?" 749 | ] 750 | }, 751 | { 752 | "cell_type": "code", 753 | "execution_count": null, 754 | "metadata": { 755 | "collapsed": true 756 | }, 757 | "outputs": [], 758 | "source": [] 759 | } 760 | ], 761 | "metadata": { 762 | "kernelspec": { 763 | "display_name": "Python 3", 764 | "language": "python", 765 | "name": "python3" 766 | }, 767 | "language_info": { 768 | "codemirror_mode": { 769 | "name": "ipython", 770 | "version": 3 771 | }, 772 | "file_extension": ".py", 773 | "mimetype": "text/x-python", 774 | "name": "python", 775 | "nbconvert_exporter": "python", 776 | "pygments_lexer": "ipython3", 777 | "version": "3.6.1" 778 | } 779 | }, 780 | "nbformat": 4, 781 | "nbformat_minor": 1 782 | } 783 | -------------------------------------------------------------------------------- /hypothesis_testing.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AllenDowney/CompStats/dc459e04ba74613d533adae2550abefb18da002a/hypothesis_testing.pdf -------------------------------------------------------------------------------- /hypothesis_testing.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AllenDowney/CompStats/dc459e04ba74613d533adae2550abefb18da002a/hypothesis_testing.png -------------------------------------------------------------------------------- /hypothesis_testing_small.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AllenDowney/CompStats/dc459e04ba74613d533adae2550abefb18da002a/hypothesis_testing_small.png -------------------------------------------------------------------------------- /look_and_say.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 47, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "%config InteractiveShell.ast_node_interactivity='last_expr_or_assign'\n", 10 | "\n", 11 | "import numpy as np" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 147, 17 | "metadata": {}, 18 | "outputs": [ 19 | { 20 | "data": { 21 | "text/plain": [ 22 | "array([1, 1])" 23 | ] 24 | }, 25 | "execution_count": 147, 26 | "metadata": {}, 27 | "output_type": "execute_result" 28 | } 29 | ], 30 | "source": [ 31 | "a = np.array([1,1])" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 148, 37 | "metadata": {}, 38 | "outputs": [ 39 | { 40 | "data": { 41 | "text/plain": [ 42 | "array([1, 0, 1])" 43 | ] 44 | }, 45 | "execution_count": 148, 46 | "metadata": {}, 47 | "output_type": "execute_result" 48 | } 49 | ], 50 | "source": [ 51 | "diff = np.ediff1d(a, 1, 1)" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 149, 57 | "metadata": {}, 58 | "outputs": [ 59 | { 60 | "data": { 61 | "text/plain": [ 62 | "array([0, 2])" 63 | ] 64 | }, 65 | "execution_count": 149, 66 | "metadata": {}, 67 | "output_type": "execute_result" 68 | } 69 | ], 70 | "source": [ 71 | "index = np.nonzero(diff)[0]" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 150, 77 | "metadata": {}, 78 | "outputs": [ 79 | { 80 | "data": { 81 | "text/plain": [ 82 | "array([2])" 83 | ] 84 | }, 85 | "execution_count": 150, 86 | "metadata": {}, 87 | "output_type": "execute_result" 88 | } 89 | ], 90 | "source": [ 91 | "counts = np.ediff1d(index)" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 151, 97 | "metadata": {}, 98 | "outputs": [ 99 | { 100 | "data": { 101 | "text/plain": [ 102 | "array([1])" 103 | ] 104 | }, 105 | "execution_count": 151, 106 | "metadata": {}, 107 | "output_type": "execute_result" 108 | } 109 | ], 110 | "source": [ 111 | "vals = a[index[:-1]]" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 152, 117 | "metadata": {}, 118 | "outputs": [ 119 | { 120 | "data": { 121 | "text/plain": [ 122 | "array([2, 1])" 123 | ] 124 | }, 125 | "execution_count": 152, 126 | "metadata": {}, 127 | "output_type": "execute_result" 128 | } 129 | ], 130 | "source": [ 131 | "# best way to interleave arrays\n", 132 | "# https://stackoverflow.com/questions/5347065/interweaving-two-numpy-arrays\n", 133 | "b = np.empty((vals.size + counts.size,), dtype=vals.dtype)\n", 134 | "b[0::2] = counts\n", 135 | "b[1::2] = vals\n", 136 | "b" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 153, 142 | "metadata": {}, 143 | "outputs": [], 144 | "source": [ 145 | "def look_and_say(a):\n", 146 | " diff = np.ediff1d(a, 1, 1)\n", 147 | " index = np.nonzero(diff)[0]\n", 148 | " counts = np.ediff1d(index)\n", 149 | " vals = a[index[:-1]]\n", 150 | " c = np.empty((vals.size + counts.size,), dtype=vals.dtype)\n", 151 | " c[0::2] = counts\n", 152 | " c[1::2] = vals\n", 153 | " return c" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": 154, 159 | "metadata": {}, 160 | "outputs": [ 161 | { 162 | "data": { 163 | "text/plain": [ 164 | "array([1, 2, 1, 1])" 165 | ] 166 | }, 167 | "execution_count": 154, 168 | "metadata": {}, 169 | "output_type": "execute_result" 170 | } 171 | ], 172 | "source": [ 173 | "look_and_say(b)" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 156, 179 | "metadata": {}, 180 | "outputs": [ 181 | { 182 | "name": "stdout", 183 | "output_type": "stream", 184 | "text": [ 185 | "[1 1]\n", 186 | "[2 1]\n", 187 | "[1 2 1 1]\n", 188 | "[1 1 1 2 2 1]\n", 189 | "[3 1 2 2 1 1]\n", 190 | "[1 3 1 1 2 2 2 1]\n", 191 | "[1 1 1 3 2 1 3 2 1 1]\n", 192 | "[3 1 1 3 1 2 1 1 1 3 1 2 2 1]\n", 193 | "[1 3 2 1 1 3 1 1 1 2 3 1 1 3 1 1 2 2 1 1]\n", 194 | "[1 1 1 3 1 2 2 1 1 3 3 1 1 2 1 3 2 1 1 3 2 1 2 2 2 1]\n", 195 | "[3 1 1 3 1 1 2 2 2 1 2 3 2 1 1 2 1 1 1 3 1 2 2 1 1 3 1 2 1 1 3 2 1 1]\n", 196 | "[1 3 2 1 1 3 2 1 3 2 1 1 1 2 1 3 1 2 2 1 1 2 3 1 1 3 1 1 2 2 2 1 1 3 1 1 1\n", 197 | " 2 2 1 1 3 1 2 2 1]\n", 198 | "[1 1 1 3 1 2 2 1 1 3 1 2 1 1 1 3 1 2 3 1 1 2 1 1 1 3 1 1 2 2 2 1 1 2 1 3 2\n", 199 | " 1 1 3 2 1 3 2 2 1 1 3 3 1 2 2 2 1 1 3 1 1 2 2 1 1]\n", 200 | "[3 1 1 3 1 1 2 2 2 1 1 3 1 1 1 2 3 1 1 3 1 1 1 2 1 3 2 1 1 2 3 1 1 3 2 1 3\n", 201 | " 2 2 1 1 2 1 1 1 3 1 2 2 1 1 3 1 2 1 1 1 3 2 2 2 1 2 3 1 1 3 2 2 1 1 3 2 1\n", 202 | " 2 2 2 1]\n", 203 | "[1 3 2 1 1 3 2 1 3 2 2 1 1 3 3 1 1 2 1 3 2 1 1 3 3 1 1 2 1 1 1 3 1 2 2 1 1\n", 204 | " 2 1 3 2 1 1 3 1 2 1 1 1 3 2 2 2 1 1 2 3 1 1 3 1 1 2 2 2 1 1 3 1 1 1 2 3 1\n", 205 | " 1 3 3 2 1 1 1 2 1 3 2 1 1 3 2 2 2 1 1 3 1 2 1 1 3 2 1 1]\n", 206 | "[1 1 1 3 1 2 2 1 1 3 1 2 1 1 1 3 2 2 2 1 2 3 2 1 1 2 1 1 1 3 1 2 2 1 2 3 2\n", 207 | " 1 1 2 3 1 1 3 1 1 2 2 2 1 1 2 1 1 1 3 1 2 2 1 1 3 1 1 1 2 3 1 1 3 3 2 2 1\n", 208 | " 1 2 1 3 2 1 1 3 2 1 3 2 2 1 1 3 3 1 1 2 1 3 2 1 2 3 1 2 3 1 1 2 1 1 1 3 1\n", 209 | " 2 2 1 1 3 3 2 2 1 1 3 1 1 1 2 2 1 1 3 1 2 2 1]\n", 210 | "[3 1 1 3 1 1 2 2 2 1 1 3 1 1 1 2 3 1 1 3 3 2 1 1 1 2 1 3 1 2 2 1 1 2 3 1 1\n", 211 | " 3 1 1 2 2 1 1 1 2 1 3 1 2 2 1 1 2 1 3 2 1 1 3 2 1 3 2 2 1 1 2 3 1 1 3 1 1\n", 212 | " 2 2 2 1 1 3 3 1 1 2 1 3 2 1 2 3 2 2 2 1 1 2 1 1 1 3 1 2 2 1 1 3 1 2 1 1 1\n", 213 | " 3 2 2 2 1 2 3 2 1 1 2 1 1 1 3 1 2 1 1 1 2 1 3 1 1 1 2 1 3 2 1 1 2 3 1 1 3\n", 214 | " 1 1 2 2 2 1 2 3 2 2 2 1 1 3 3 1 2 2 2 1 1 3 1 1 2 2 1 1]\n", 215 | "[1 3 2 1 1 3 2 1 3 2 2 1 1 3 3 1 1 2 1 3 2 1 2 3 1 2 3 1 1 2 1 1 1 3 1 1 2\n", 216 | " 2 2 1 1 2 1 3 2 1 1 3 2 1 2 2 3 1 1 2 1 1 1 3 1 1 2 2 2 1 1 2 1 1 1 3 1 2\n", 217 | " 2 1 1 3 1 2 1 1 1 3 2 2 2 1 1 2 1 3 2 1 1 3 2 1 3 2 2 1 2 3 2 1 1 2 1 1 1\n", 218 | " 3 1 2 1 1 1 2 1 3 3 2 2 1 1 2 3 1 1 3 1 1 2 2 2 1 1 3 1 1 1 2 3 1 1 3 3 2\n", 219 | " 1 1 1 2 1 3 1 2 2 1 1 2 3 1 1 3 1 1 1 2 3 1 1 2 1 1 1 3 3 1 1 2 1 1 1 3 1\n", 220 | " 2 2 1 1 2 1 3 2 1 1 3 2 1 3 2 1 1 1 2 1 3 3 2 2 1 2 3 1 1 3 2 2 1 1 3 2 1\n", 221 | " 2 2 2 1]\n", 222 | "[1 1 1 3 1 2 2 1 1 3 1 2 1 1 1 3 2 2 2 1 2 3 2 1 1 2 1 1 1 3 1 2 1 1 1 2 1\n", 223 | " 3 1 1 1 2 1 3 2 1 1 2 3 1 1 3 2 1 3 2 2 1 1 2 1 1 1 3 1 2 2 1 1 3 1 2 1 1\n", 224 | " 2 2 1 3 2 1 1 2 3 1 1 3 2 1 3 2 2 1 1 2 3 1 1 3 1 1 2 2 2 1 1 3 1 1 1 2 3\n", 225 | " 1 1 3 3 2 2 1 1 2 1 1 1 3 1 2 2 1 1 3 1 2 1 1 1 3 2 2 1 1 1 2 1 3 1 2 2 1\n", 226 | " 1 2 3 1 1 3 1 1 1 2 3 1 1 2 1 1 2 3 2 2 2 1 1 2 1 3 2 1 1 3 2 1 3 2 2 1 1\n", 227 | " 3 3 1 1 2 1 3 2 1 2 3 1 2 3 1 1 2 1 1 1 3 1 1 2 2 2 1 1 2 1 3 2 1 1 3 3 1\n", 228 | " 1 2 1 3 2 1 1 2 3 1 2 3 2 1 1 2 3 1 1 3 1 1 2 2 2 1 1 2 1 1 1 3 1 2 2 1 1\n", 229 | " 3 1 2 1 1 1 3 1 2 3 1 1 2 1 1 2 3 2 2 1 1 1 2 1 3 2 1 1 3 2 2 2 1 1 3 1 2\n", 230 | " 1 1 3 2 1 1]\n", 231 | "[3 1 1 3 1 1 2 2 2 1 1 3 1 1 1 2 3 1 1 3 3 2 1 1 1 2 1 3 1 2 2 1 1 2 3 1 1\n", 232 | " 3 1 1 1 2 3 1 1 2 1 1 1 3 3 1 1 2 1 1 1 3 1 2 2 1 1 2 1 3 2 1 1 3 1 2 1 1\n", 233 | " 1 3 2 2 2 1 1 2 3 1 1 3 1 1 2 2 2 1 1 3 1 1 1 2 2 1 2 2 1 1 1 3 1 2 2 1 1\n", 234 | " 2 1 3 2 1 1 3 1 2 1 1 1 3 2 2 2 1 1 2 1 3 2 1 1 3 2 1 3 2 2 1 1 3 3 1 1 2\n", 235 | " 1 3 2 1 2 3 2 2 2 1 1 2 3 1 1 3 1 1 2 2 2 1 1 3 1 1 1 2 3 1 1 3 2 2 3 1 1\n", 236 | " 2 1 1 1 3 1 1 2 2 2 1 1 2 1 3 2 1 1 3 3 1 1 2 1 3 2 1 1 2 2 1 1 2 1 3 3 2\n", 237 | " 2 1 1 2 1 1 1 3 1 2 2 1 1 3 1 2 1 1 1 3 2 2 2 1 2 3 2 1 1 2 1 1 1 3 1 2 1\n", 238 | " 1 1 2 1 3 1 1 1 2 1 3 2 1 1 2 3 1 1 3 2 1 3 2 2 1 1 2 1 1 1 3 1 2 2 1 2 3\n", 239 | " 2 1 1 2 1 1 1 3 1 2 2 1 1 2 1 3 1 1 1 2 1 3 1 2 2 1 1 2 1 3 2 1 1 3 2 1 3\n", 240 | " 2 2 1 1 2 3 1 1 3 1 1 2 2 2 1 1 3 1 1 1 2 3 1 1 3 1 1 1 2 1 3 2 1 1 2 2 1\n", 241 | " 1 2 1 3 2 2 3 1 1 2 1 1 1 3 1 2 2 1 1 3 3 2 2 1 1 3 1 1 1 2 2 1 1 3 1 2 2\n", 242 | " 1]\n", 243 | "[1 3 2 1 1 3 2 1 3 2 2 1 1 3 3 1 1 2 1 3 2 1 2 3 1 2 3 1 1 2 1 1 1 3 1 1 2\n", 244 | " 2 2 1 1 2 1 3 2 1 1 3 3 1 1 2 1 3 2 1 1 2 3 1 2 3 2 1 1 2 3 1 1 3 1 1 2 2\n", 245 | " 2 1 1 2 1 1 1 3 1 2 2 1 1 3 1 1 1 2 3 1 1 3 3 2 2 1 1 2 1 3 2 1 1 3 2 1 3\n", 246 | " 2 2 1 1 3 3 1 2 2 1 1 2 2 3 1 1 3 1 1 2 2 2 1 1 2 1 1 1 3 1 2 2 1 1 3 1 1\n", 247 | " 1 2 3 1 1 3 3 2 2 1 1 2 1 1 1 3 1 2 2 1 1 3 1 2 1 1 1 3 2 2 2 1 2 3 2 1 1\n", 248 | " 2 1 1 1 3 1 2 1 1 1 2 1 3 3 2 2 1 1 2 1 3 2 1 1 3 2 1 3 2 2 1 1 3 3 1 1 2\n", 249 | " 1 3 2 1 1 3 2 2 1 3 2 1 1 2 3 1 1 3 2 1 3 2 2 1 1 2 1 1 1 3 1 2 2 1 2 3 2\n", 250 | " 1 1 2 1 1 1 3 1 2 2 1 2 2 2 1 1 2 1 1 2 3 2 2 2 1 1 2 3 1 1 3 1 1 2 2 2 1\n", 251 | " 1 3 1 1 1 2 3 1 1 3 3 2 1 1 1 2 1 3 1 2 2 1 1 2 3 1 1 3 1 1 1 2 3 1 1 2 1\n", 252 | " 1 1 3 3 1 1 2 1 1 1 3 1 2 2 1 1 2 1 3 2 1 1 3 1 2 1 1 1 3 2 2 2 1 1 2 3 1\n", 253 | " 1 3 1 1 2 2 1 1 1 2 1 3 1 2 2 1 1 2 3 1 1 3 1 1 2 2 2 1 1 2 1 1 1 3 3 1 1\n", 254 | " 2 1 1 1 3 1 1 2 2 2 1 1 2 1 1 1 3 1 2 2 1 1 3 1 2 1 1 1 3 2 2 2 1 1 2 1 3\n", 255 | " 2 1 1 3 2 1 3 2 2 1 1 3 3 1 1 2 1 3 2 1 1 3 3 1 1 2 1 1 1 3 1 2 2 1 2 2 2\n", 256 | " 1 1 2 1 1 1 3 2 2 1 3 2 1 1 2 3 1 1 3 1 1 2 2 2 1 2 3 2 2 2 1 1 3 3 1 2 2\n", 257 | " 2 1 1 3 1 1 2 2 1 1]\n" 258 | ] 259 | } 260 | ], 261 | "source": [ 262 | "c = a\n", 263 | "print(c)\n", 264 | "\n", 265 | "for i in range(20):\n", 266 | " c = look_and_say(c)\n", 267 | " print(c)" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": null, 273 | "metadata": {}, 274 | "outputs": [], 275 | "source": [] 276 | } 277 | ], 278 | "metadata": { 279 | "kernelspec": { 280 | "display_name": "Python 3", 281 | "language": "python", 282 | "name": "python3" 283 | }, 284 | "language_info": { 285 | "codemirror_mode": { 286 | "name": "ipython", 287 | "version": 3 288 | }, 289 | "file_extension": ".py", 290 | "mimetype": "text/x-python", 291 | "name": "python", 292 | "nbconvert_exporter": "python", 293 | "pygments_lexer": "ipython3", 294 | "version": "3.6.5" 295 | } 296 | }, 297 | "nbformat": 4, 298 | "nbformat_minor": 2 299 | } 300 | -------------------------------------------------------------------------------- /nsfg.py: -------------------------------------------------------------------------------- 1 | """This file contains code for use with "Think Stats", 2 | by Allen B. Downey, available from greenteapress.com 3 | 4 | Copyright 2010 Allen B. Downey 5 | License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html 6 | """ 7 | 8 | from __future__ import print_function 9 | 10 | from collections import defaultdict 11 | import numpy as np 12 | import sys 13 | 14 | import thinkstats2 15 | 16 | 17 | def ReadFemPreg(dct_file='2002FemPreg.dct', 18 | dat_file='2002FemPreg.dat.gz'): 19 | """Reads the NSFG pregnancy data. 20 | 21 | dct_file: string file name 22 | dat_file: string file name 23 | 24 | returns: DataFrame 25 | """ 26 | dct = thinkstats2.ReadStataDct(dct_file) 27 | df = dct.ReadFixedWidth(dat_file, compression='gzip') 28 | CleanFemPreg(df) 29 | return df 30 | 31 | 32 | def CleanFemPreg(df): 33 | """Recodes variables from the pregnancy frame. 34 | 35 | df: DataFrame 36 | """ 37 | # mother's age is encoded in centiyears; convert to years 38 | df.agepreg /= 100.0 39 | 40 | # birthwgt_lb contains at least one bogus value (51 lbs) 41 | # replace with NaN 42 | df.loc[df.birthwgt_lb > 20, 'birthwgt_lb'] = np.nan 43 | 44 | # replace 'not ascertained', 'refused', 'don't know' with NaN 45 | na_vals = [97, 98, 99] 46 | df.birthwgt_lb.replace(na_vals, np.nan, inplace=True) 47 | df.birthwgt_oz.replace(na_vals, np.nan, inplace=True) 48 | df.hpagelb.replace(na_vals, np.nan, inplace=True) 49 | 50 | df.babysex.replace([7, 9], np.nan, inplace=True) 51 | df.nbrnaliv.replace([9], np.nan, inplace=True) 52 | 53 | # birthweight is stored in two columns, lbs and oz. 54 | # convert to a single column in lb 55 | # NOTE: creating a new column requires dictionary syntax, 56 | # not attribute assignment (like df.totalwgt_lb) 57 | df['totalwgt_lb'] = df.birthwgt_lb + df.birthwgt_oz / 16.0 58 | 59 | # due to a bug in ReadStataDct, the last variable gets clipped; 60 | # so for now set it to NaN 61 | df.cmintvw = np.nan 62 | 63 | 64 | def MakePregMap(df): 65 | """Make a map from caseid to list of preg indices. 66 | 67 | df: DataFrame 68 | 69 | returns: dict that maps from caseid to list of indices into preg df 70 | """ 71 | d = defaultdict(list) 72 | for index, caseid in df.caseid.iteritems(): 73 | d[caseid].append(index) 74 | return d 75 | 76 | 77 | def main(script): 78 | """Tests the functions in this module. 79 | 80 | script: string script name 81 | """ 82 | df = ReadFemPreg() 83 | print(df.shape) 84 | 85 | assert len(df) == 13593 86 | 87 | assert df.caseid[13592] == 12571 88 | assert df.pregordr.value_counts()[1] == 5033 89 | assert df.nbrnaliv.value_counts()[1] == 8981 90 | assert df.babysex.value_counts()[1] == 4641 91 | assert df.birthwgt_lb.value_counts()[7] == 3049 92 | assert df.birthwgt_oz.value_counts()[0] == 1037 93 | assert df.prglngth.value_counts()[39] == 4744 94 | assert df.outcome.value_counts()[1] == 9148 95 | assert df.birthord.value_counts()[1] == 4413 96 | assert df.agepreg.value_counts()[22.75] == 100 97 | assert df.totalwgt_lb.value_counts()[7.5] == 302 98 | 99 | weights = df.finalwgt.value_counts() 100 | key = max(weights.keys()) 101 | assert df.finalwgt.value_counts()[key] == 6 102 | 103 | print('%s: All tests passed.' % script) 104 | 105 | if __name__ == '__main__': 106 | main(*sys.argv) 107 | -------------------------------------------------------------------------------- /nsfg2.py: -------------------------------------------------------------------------------- 1 | """This file contains code used in "Think Stats", 2 | by Allen B. Downey, available from greenteapress.com 3 | 4 | Copyright 2014 Allen B. Downey 5 | License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html 6 | """ 7 | 8 | from __future__ import print_function 9 | 10 | import numpy as np 11 | 12 | import thinkstats2 13 | 14 | def MakeFrames(): 15 | """Reads pregnancy data and partitions first babies and others. 16 | 17 | returns: DataFrames (all live births, first babies, others) 18 | """ 19 | preg = ReadFemPreg() 20 | 21 | live = preg[preg.outcome == 1] 22 | firsts = live[live.birthord == 1] 23 | others = live[live.birthord != 1] 24 | 25 | assert(len(live) == 14292) 26 | assert(len(firsts) == 6683) 27 | assert(len(others) == 7609) 28 | 29 | return live, firsts, others 30 | 31 | 32 | def ReadFemPreg(dct_file='2006_2010_FemPregSetup.dct', 33 | dat_file='2006_2010_FemPreg.dat.gz'): 34 | """Reads the NSFG 2006-2010 pregnancy data. 35 | 36 | dct_file: string file name 37 | dat_file: string file name 38 | 39 | returns: DataFrame 40 | """ 41 | dct = thinkstats2.ReadStataDct(dct_file, encoding='iso-8859-1') 42 | df = dct.ReadFixedWidth(dat_file, compression='gzip') 43 | CleanFemPreg(df) 44 | return df 45 | 46 | 47 | def CleanFemPreg(df): 48 | """Recodes variables from the pregnancy frame. 49 | 50 | df: DataFrame 51 | """ 52 | # mother's age is encoded in centiyears; convert to years 53 | df.agepreg /= 100.0 54 | 55 | # birthwgt_lb contains at least one bogus value (51 lbs) 56 | # replace with NaN 57 | df.birthwgt_lb1[df.birthwgt_lb1 > 20] = np.nan 58 | 59 | # replace 'not ascertained', 'refused', 'don't know' with NaN 60 | na_vals = [97, 98, 99] 61 | df.birthwgt_lb1.replace(na_vals, np.nan, inplace=True) 62 | df.birthwgt_oz1.replace(na_vals, np.nan, inplace=True) 63 | 64 | # birthweight is stored in two columns, lbs and oz. 65 | # convert to a single column in lb 66 | # NOTE: creating a new column requires dictionary syntax, 67 | # not attribute assignment (like df.totalwgt_lb) 68 | df['totalwgt_lb'] = df.birthwgt_lb1 + df.birthwgt_oz1 / 16.0 69 | 70 | # due to a bug in ReadStataDct, the last variable gets clipped; 71 | # so for now set it to NaN 72 | df.phase = np.nan 73 | 74 | 75 | def main(): 76 | live, firsts, others = MakeFrames() 77 | 78 | 79 | if __name__ == '__main__': 80 | main() 81 | 82 | 83 | -------------------------------------------------------------------------------- /resampling.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AllenDowney/CompStats/dc459e04ba74613d533adae2550abefb18da002a/resampling.pdf -------------------------------------------------------------------------------- /resampling.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AllenDowney/CompStats/dc459e04ba74613d533adae2550abefb18da002a/resampling.png -------------------------------------------------------------------------------- /resampling_small.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AllenDowney/CompStats/dc459e04ba74613d533adae2550abefb18da002a/resampling_small.png -------------------------------------------------------------------------------- /sampling.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Random Sampling\n", 8 | "=============\n", 9 | "\n", 10 | "Copyright 2016 Allen Downey\n", 11 | "\n", 12 | "License: [Creative Commons Attribution 4.0 International](http://creativecommons.org/licenses/by/4.0/)" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": null, 18 | "metadata": { 19 | "collapsed": true 20 | }, 21 | "outputs": [], 22 | "source": [ 23 | "from __future__ import print_function, division\n", 24 | "\n", 25 | "import numpy\n", 26 | "import scipy.stats\n", 27 | "\n", 28 | "import matplotlib.pyplot as pyplot\n", 29 | "\n", 30 | "from ipywidgets import interact, interactive, fixed\n", 31 | "import ipywidgets as widgets\n", 32 | "\n", 33 | "# seed the random number generator so we all get the same results\n", 34 | "numpy.random.seed(18)\n", 35 | "\n", 36 | "# some nicer colors from http://colorbrewer2.org/\n", 37 | "COLOR1 = '#7fc97f'\n", 38 | "COLOR2 = '#beaed4'\n", 39 | "COLOR3 = '#fdc086'\n", 40 | "COLOR4 = '#ffff99'\n", 41 | "COLOR5 = '#386cb0'\n", 42 | "\n", 43 | "%matplotlib inline" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "Part One\n", 51 | "========\n", 52 | "\n", 53 | "Suppose we want to estimate the average weight of men and women in the U.S.\n", 54 | "\n", 55 | "And we want to quantify the uncertainty of the estimate.\n", 56 | "\n", 57 | "One approach is to simulate many experiments and see how much the results vary from one experiment to the next.\n", 58 | "\n", 59 | "I'll start with the unrealistic assumption that we know the actual distribution of weights in the population. Then I'll show how to solve the problem without that assumption.\n", 60 | "\n", 61 | "Based on data from the [BRFSS](http://www.cdc.gov/brfss/), I found that the distribution of weight in kg for women in the U.S. is well modeled by a lognormal distribution with the following parameters:" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": null, 67 | "metadata": {}, 68 | "outputs": [], 69 | "source": [ 70 | "weight = scipy.stats.lognorm(0.23, 0, 70.8)\n", 71 | "weight.mean(), weight.std()" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "Here's what that distribution looks like:" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [ 87 | "xs = numpy.linspace(20, 160, 100)\n", 88 | "ys = weight.pdf(xs)\n", 89 | "pyplot.plot(xs, ys, linewidth=4, color=COLOR1)\n", 90 | "pyplot.xlabel('weight (kg)')\n", 91 | "pyplot.ylabel('PDF')\n", 92 | "None" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "`make_sample` draws a random sample from this distribution. The result is a NumPy array." 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": null, 105 | "metadata": { 106 | "collapsed": true 107 | }, 108 | "outputs": [], 109 | "source": [ 110 | "def make_sample(n=100):\n", 111 | " sample = weight.rvs(n)\n", 112 | " return sample" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "Here's an example with `n=100`. The mean and std of the sample are close to the mean and std of the population, but not exact." 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": null, 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "sample = make_sample(n=100)\n", 129 | "sample.mean(), sample.std()" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "We want to estimate the average weight in the population, so the \"sample statistic\" we'll use is the mean:" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": null, 142 | "metadata": { 143 | "collapsed": true 144 | }, 145 | "outputs": [], 146 | "source": [ 147 | "def sample_stat(sample):\n", 148 | " return sample.mean()" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "One iteration of \"the experiment\" is to collect a sample of 100 women and compute their average weight.\n", 156 | "\n", 157 | "We can simulate running this experiment many times, and collect a list of sample statistics. The result is a NumPy array." 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": null, 163 | "metadata": { 164 | "collapsed": true 165 | }, 166 | "outputs": [], 167 | "source": [ 168 | "def compute_sampling_distribution(n=100, iters=1000):\n", 169 | " stats = [sample_stat(make_sample(n)) for i in range(iters)]\n", 170 | " return numpy.array(stats)" 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "The next line runs the simulation 1000 times and puts the results in\n", 178 | "`sample_means`:" 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": null, 184 | "metadata": { 185 | "collapsed": true 186 | }, 187 | "outputs": [], 188 | "source": [ 189 | "sample_means = compute_sampling_distribution(n=100, iters=1000)" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "Let's look at the distribution of the sample means. This distribution shows how much the results vary from one experiment to the next.\n", 197 | "\n", 198 | "Remember that this distribution is not the same as the distribution of weight in the population. This is the distribution of results across repeated imaginary experiments." 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "metadata": {}, 205 | "outputs": [], 206 | "source": [ 207 | "pyplot.hist(sample_means, color=COLOR5)\n", 208 | "pyplot.xlabel('sample mean (n=100)')\n", 209 | "pyplot.ylabel('count')\n", 210 | "None" 211 | ] 212 | }, 213 | { 214 | "cell_type": "markdown", 215 | "metadata": {}, 216 | "source": [ 217 | "The mean of the sample means is close to the actual population mean, which is nice, but not actually the important part." 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": null, 223 | "metadata": {}, 224 | "outputs": [], 225 | "source": [ 226 | "sample_means.mean()" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "The standard deviation of the sample means quantifies the variability from one experiment to the next, and reflects the precision of the estimate.\n", 234 | "\n", 235 | "This quantity is called the \"standard error\"." 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": null, 241 | "metadata": {}, 242 | "outputs": [], 243 | "source": [ 244 | "std_err = sample_means.std()\n", 245 | "std_err" 246 | ] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "metadata": {}, 251 | "source": [ 252 | "We can also use the distribution of sample means to compute a \"90% confidence interval\", which contains 90% of the experimental results:" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": null, 258 | "metadata": {}, 259 | "outputs": [], 260 | "source": [ 261 | "conf_int = numpy.percentile(sample_means, [5, 95])\n", 262 | "conf_int" 263 | ] 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "metadata": {}, 268 | "source": [ 269 | "Now we'd like to see what happens as we vary the sample size, `n`. The following function takes `n`, runs 1000 simulated experiments, and summarizes the results." 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": null, 275 | "metadata": { 276 | "collapsed": true 277 | }, 278 | "outputs": [], 279 | "source": [ 280 | "def plot_sampling_distribution(n, xlim=None):\n", 281 | " \"\"\"Plot the sampling distribution.\n", 282 | " \n", 283 | " n: sample size\n", 284 | " xlim: [xmin, xmax] range for the x axis \n", 285 | " \"\"\"\n", 286 | " sample_stats = compute_sampling_distribution(n, iters=1000)\n", 287 | " se = numpy.std(sample_stats)\n", 288 | " ci = numpy.percentile(sample_stats, [5, 95])\n", 289 | " \n", 290 | " pyplot.hist(sample_stats, color=COLOR2)\n", 291 | " pyplot.xlabel('sample statistic')\n", 292 | " pyplot.xlim(xlim)\n", 293 | " text(0.03, 0.95, 'CI [%0.2f %0.2f]' % tuple(ci))\n", 294 | " text(0.03, 0.85, 'SE %0.2f' % se)\n", 295 | " pyplot.show()\n", 296 | " \n", 297 | "def text(x, y, s):\n", 298 | " \"\"\"Plot a string at a given location in axis coordinates.\n", 299 | " \n", 300 | " x: coordinate\n", 301 | " y: coordinate\n", 302 | " s: string\n", 303 | " \"\"\"\n", 304 | " ax = pyplot.gca()\n", 305 | " pyplot.text(x, y, s,\n", 306 | " horizontalalignment='left',\n", 307 | " verticalalignment='top',\n", 308 | " transform=ax.transAxes)" 309 | ] 310 | }, 311 | { 312 | "cell_type": "markdown", 313 | "metadata": {}, 314 | "source": [ 315 | "Here's a test run with `n=100`:" 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": null, 321 | "metadata": {}, 322 | "outputs": [], 323 | "source": [ 324 | "plot_sampling_distribution(100)" 325 | ] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "metadata": {}, 330 | "source": [ 331 | "Now we can use `interact` to run `plot_sampling_distribution` with different values of `n`. Note: `xlim` sets the limits of the x-axis so the figure doesn't get rescaled as we vary `n`." 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": null, 337 | "metadata": {}, 338 | "outputs": [], 339 | "source": [ 340 | "def sample_stat(sample):\n", 341 | " return sample.mean()\n", 342 | "\n", 343 | "slider = widgets.IntSlider(min=10, max=1000, value=100)\n", 344 | "interact(plot_sampling_distribution, n=slider, xlim=fixed([55, 95]))\n", 345 | "None" 346 | ] 347 | }, 348 | { 349 | "cell_type": "markdown", 350 | "metadata": {}, 351 | "source": [ 352 | "### Other sample statistics\n", 353 | "\n", 354 | "This framework works with any other quantity we want to estimate. By changing `sample_stat`, you can compute the SE and CI for any sample statistic.\n", 355 | "\n", 356 | "**Exercise 1**: Fill in `sample_stat` below with any of these statistics:\n", 357 | "\n", 358 | "* Standard deviation of the sample.\n", 359 | "* Coefficient of variation, which is the sample standard deviation divided by the sample standard mean.\n", 360 | "* Min or Max\n", 361 | "* Median (which is the 50th percentile)\n", 362 | "* 10th or 90th percentile.\n", 363 | "* Interquartile range (IQR), which is the difference between the 75th and 25th percentiles.\n", 364 | "\n", 365 | "NumPy array methods you might find useful include `std`, `min`, `max`, and `percentile`.\n", 366 | "Depending on the results, you might want to adjust `xlim`." 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": null, 372 | "metadata": {}, 373 | "outputs": [], 374 | "source": [ 375 | "def sample_stat(sample):\n", 376 | " # TODO: replace the following line with another sample statistic\n", 377 | " return sample.mean()\n", 378 | "\n", 379 | "slider = widgets.IntSlider(min=10, max=1000, value=100)\n", 380 | "interact(plot_sampling_distribution, n=slider, xlim=fixed([0, 100]))\n", 381 | "None" 382 | ] 383 | }, 384 | { 385 | "cell_type": "markdown", 386 | "metadata": {}, 387 | "source": [ 388 | "STOP HERE\n", 389 | "---------\n", 390 | "\n", 391 | "We will regroup and discuss before going on." 392 | ] 393 | }, 394 | { 395 | "cell_type": "markdown", 396 | "metadata": {}, 397 | "source": [ 398 | "Part Two\n", 399 | "========\n", 400 | "\n", 401 | "So far we have shown that if we know the actual distribution of the population, we can compute the sampling distribution for any sample statistic, and from that we can compute SE and CI.\n", 402 | "\n", 403 | "But in real life we don't know the actual distribution of the population. If we did, we wouldn't be doing statistical inference in the first place!\n", 404 | "\n", 405 | "In real life, we use the sample to build a model of the population distribution, then use the model to generate the sampling distribution. A simple and popular way to do that is \"resampling,\" which means we use the sample itself as a model of the population distribution and draw samples from it.\n", 406 | "\n", 407 | "Before we go on, I want to collect some of the code from Part One and organize it as a class. This class represents a framework for computing sampling distributions." 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": null, 413 | "metadata": { 414 | "collapsed": true 415 | }, 416 | "outputs": [], 417 | "source": [ 418 | "class Resampler(object):\n", 419 | " \"\"\"Represents a framework for computing sampling distributions.\"\"\"\n", 420 | " \n", 421 | " def __init__(self, sample, xlim=None):\n", 422 | " \"\"\"Stores the actual sample.\"\"\"\n", 423 | " self.sample = sample\n", 424 | " self.n = len(sample)\n", 425 | " self.xlim = xlim\n", 426 | " \n", 427 | " def resample(self):\n", 428 | " \"\"\"Generates a new sample by choosing from the original\n", 429 | " sample with replacement.\n", 430 | " \"\"\"\n", 431 | " new_sample = numpy.random.choice(self.sample, self.n, replace=True)\n", 432 | " return new_sample\n", 433 | " \n", 434 | " def sample_stat(self, sample):\n", 435 | " \"\"\"Computes a sample statistic using the original sample or a\n", 436 | " simulated sample.\n", 437 | " \"\"\"\n", 438 | " return sample.mean()\n", 439 | " \n", 440 | " def compute_sampling_distribution(self, iters=1000):\n", 441 | " \"\"\"Simulates many experiments and collects the resulting sample\n", 442 | " statistics.\n", 443 | " \"\"\"\n", 444 | " stats = [self.sample_stat(self.resample()) for i in range(iters)]\n", 445 | " return numpy.array(stats)\n", 446 | " \n", 447 | " def plot_sampling_distribution(self):\n", 448 | " \"\"\"Plots the sampling distribution.\"\"\"\n", 449 | " sample_stats = self.compute_sampling_distribution()\n", 450 | " se = sample_stats.std()\n", 451 | " ci = numpy.percentile(sample_stats, [5, 95])\n", 452 | " \n", 453 | " pyplot.hist(sample_stats, color=COLOR2)\n", 454 | " pyplot.xlabel('sample statistic')\n", 455 | " pyplot.xlim(self.xlim)\n", 456 | " text(0.03, 0.95, 'CI [%0.2f %0.2f]' % tuple(ci))\n", 457 | " text(0.03, 0.85, 'SE %0.2f' % se)\n", 458 | " pyplot.show()" 459 | ] 460 | }, 461 | { 462 | "cell_type": "markdown", 463 | "metadata": {}, 464 | "source": [ 465 | "The following function instantiates a `Resampler` and runs it." 466 | ] 467 | }, 468 | { 469 | "cell_type": "code", 470 | "execution_count": null, 471 | "metadata": { 472 | "collapsed": true 473 | }, 474 | "outputs": [], 475 | "source": [ 476 | "def interact_func(n, xlim):\n", 477 | " sample = weight.rvs(n)\n", 478 | " resampler = Resampler(sample, xlim=xlim)\n", 479 | " resampler.plot_sampling_distribution()" 480 | ] 481 | }, 482 | { 483 | "cell_type": "markdown", 484 | "metadata": {}, 485 | "source": [ 486 | "Here's a test run with `n=100`" 487 | ] 488 | }, 489 | { 490 | "cell_type": "code", 491 | "execution_count": null, 492 | "metadata": {}, 493 | "outputs": [], 494 | "source": [ 495 | "interact_func(n=100, xlim=[50, 100])" 496 | ] 497 | }, 498 | { 499 | "cell_type": "markdown", 500 | "metadata": {}, 501 | "source": [ 502 | "Now we can use `interact_func` in an interaction:" 503 | ] 504 | }, 505 | { 506 | "cell_type": "code", 507 | "execution_count": null, 508 | "metadata": {}, 509 | "outputs": [], 510 | "source": [ 511 | "slider = widgets.IntSlider(min=10, max=1000, value=100)\n", 512 | "interact(interact_func, n=slider, xlim=fixed([50, 100]))\n", 513 | "None" 514 | ] 515 | }, 516 | { 517 | "cell_type": "markdown", 518 | "metadata": {}, 519 | "source": [ 520 | "**Exercise 2**: write a new class called `StdResampler` that inherits from `Resampler` and overrides `sample_stat` so it computes the standard deviation of the resampled data." 521 | ] 522 | }, 523 | { 524 | "cell_type": "code", 525 | "execution_count": null, 526 | "metadata": { 527 | "collapsed": true 528 | }, 529 | "outputs": [], 530 | "source": [ 531 | "# Solution goes here" 532 | ] 533 | }, 534 | { 535 | "cell_type": "markdown", 536 | "metadata": {}, 537 | "source": [ 538 | "Test your code using the cell below:" 539 | ] 540 | }, 541 | { 542 | "cell_type": "code", 543 | "execution_count": null, 544 | "metadata": {}, 545 | "outputs": [], 546 | "source": [ 547 | "def interact_func2(n, xlim):\n", 548 | " sample = weight.rvs(n)\n", 549 | " resampler = StdResampler(sample, xlim=xlim)\n", 550 | " resampler.plot_sampling_distribution()\n", 551 | " \n", 552 | "interact_func2(n=100, xlim=[0, 100])" 553 | ] 554 | }, 555 | { 556 | "cell_type": "markdown", 557 | "metadata": {}, 558 | "source": [ 559 | "When your `StdResampler` is working, you should be able to interact with it:" 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "execution_count": null, 565 | "metadata": { 566 | "collapsed": true 567 | }, 568 | "outputs": [], 569 | "source": [ 570 | "slider = widgets.IntSlider(min=10, max=1000, value=100)\n", 571 | "interact(interact_func2, n=slider, xlim=fixed([0, 100]))\n", 572 | "None" 573 | ] 574 | }, 575 | { 576 | "cell_type": "markdown", 577 | "metadata": {}, 578 | "source": [ 579 | "STOP HERE\n", 580 | "---------\n", 581 | "\n", 582 | "We will regroup and discuss before going on." 583 | ] 584 | }, 585 | { 586 | "cell_type": "markdown", 587 | "metadata": {}, 588 | "source": [ 589 | "Part Three\n", 590 | "==========\n", 591 | "\n", 592 | "We can extend this framework to compute SE and CI for a difference in means.\n", 593 | "\n", 594 | "For example, men are heavier than women on average. Here's the women's distribution again (from BRFSS data):" 595 | ] 596 | }, 597 | { 598 | "cell_type": "code", 599 | "execution_count": null, 600 | "metadata": { 601 | "collapsed": true 602 | }, 603 | "outputs": [], 604 | "source": [ 605 | "female_weight = scipy.stats.lognorm(0.23, 0, 70.8)\n", 606 | "female_weight.mean(), female_weight.std()" 607 | ] 608 | }, 609 | { 610 | "cell_type": "markdown", 611 | "metadata": {}, 612 | "source": [ 613 | "And here's the men's distribution:" 614 | ] 615 | }, 616 | { 617 | "cell_type": "code", 618 | "execution_count": null, 619 | "metadata": { 620 | "collapsed": true 621 | }, 622 | "outputs": [], 623 | "source": [ 624 | "male_weight = scipy.stats.lognorm(0.20, 0, 87.3)\n", 625 | "male_weight.mean(), male_weight.std()" 626 | ] 627 | }, 628 | { 629 | "cell_type": "markdown", 630 | "metadata": {}, 631 | "source": [ 632 | "I'll simulate a sample of 100 men and 100 women:" 633 | ] 634 | }, 635 | { 636 | "cell_type": "code", 637 | "execution_count": null, 638 | "metadata": { 639 | "collapsed": true 640 | }, 641 | "outputs": [], 642 | "source": [ 643 | "female_sample = female_weight.rvs(100)\n", 644 | "male_sample = male_weight.rvs(100)" 645 | ] 646 | }, 647 | { 648 | "cell_type": "markdown", 649 | "metadata": {}, 650 | "source": [ 651 | "The difference in means should be about 17 kg, but will vary from one random sample to the next:" 652 | ] 653 | }, 654 | { 655 | "cell_type": "code", 656 | "execution_count": null, 657 | "metadata": { 658 | "collapsed": true 659 | }, 660 | "outputs": [], 661 | "source": [ 662 | "male_sample.mean() - female_sample.mean()" 663 | ] 664 | }, 665 | { 666 | "cell_type": "markdown", 667 | "metadata": {}, 668 | "source": [ 669 | "Here's the function that computes Cohen's effect size again:" 670 | ] 671 | }, 672 | { 673 | "cell_type": "code", 674 | "execution_count": null, 675 | "metadata": { 676 | "collapsed": true 677 | }, 678 | "outputs": [], 679 | "source": [ 680 | "def CohenEffectSize(group1, group2):\n", 681 | " \"\"\"Compute Cohen's d.\n", 682 | "\n", 683 | " group1: Series or NumPy array\n", 684 | " group2: Series or NumPy array\n", 685 | "\n", 686 | " returns: float\n", 687 | " \"\"\"\n", 688 | " diff = group1.mean() - group2.mean()\n", 689 | "\n", 690 | " n1, n2 = len(group1), len(group2)\n", 691 | " var1 = group1.var()\n", 692 | " var2 = group2.var()\n", 693 | "\n", 694 | " pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)\n", 695 | " d = diff / numpy.sqrt(pooled_var)\n", 696 | " return d" 697 | ] 698 | }, 699 | { 700 | "cell_type": "markdown", 701 | "metadata": {}, 702 | "source": [ 703 | "The difference in weight between men and women is about 1 standard deviation:" 704 | ] 705 | }, 706 | { 707 | "cell_type": "code", 708 | "execution_count": null, 709 | "metadata": { 710 | "collapsed": true 711 | }, 712 | "outputs": [], 713 | "source": [ 714 | "CohenEffectSize(male_sample, female_sample)" 715 | ] 716 | }, 717 | { 718 | "cell_type": "markdown", 719 | "metadata": {}, 720 | "source": [ 721 | "Now we can write a version of the `Resampler` that computes the sampling distribution of $d$." 722 | ] 723 | }, 724 | { 725 | "cell_type": "code", 726 | "execution_count": null, 727 | "metadata": { 728 | "collapsed": true 729 | }, 730 | "outputs": [], 731 | "source": [ 732 | "class CohenResampler(Resampler):\n", 733 | " def __init__(self, group1, group2, xlim=None):\n", 734 | " self.group1 = group1\n", 735 | " self.group2 = group2\n", 736 | " self.xlim = xlim\n", 737 | " \n", 738 | " def resample(self):\n", 739 | " n, m = len(self.group1), len(self.group2)\n", 740 | " group1 = numpy.random.choice(self.group1, n, replace=True)\n", 741 | " group2 = numpy.random.choice(self.group2, m, replace=True)\n", 742 | " return group1, group2\n", 743 | " \n", 744 | " def sample_stat(self, groups):\n", 745 | " group1, group2 = groups\n", 746 | " return CohenEffectSize(group1, group2)" 747 | ] 748 | }, 749 | { 750 | "cell_type": "markdown", 751 | "metadata": {}, 752 | "source": [ 753 | "Now we can instantiate a `CohenResampler` and plot the sampling distribution." 754 | ] 755 | }, 756 | { 757 | "cell_type": "code", 758 | "execution_count": null, 759 | "metadata": { 760 | "collapsed": true 761 | }, 762 | "outputs": [], 763 | "source": [ 764 | "resampler = CohenResampler(male_sample, female_sample)\n", 765 | "resampler.plot_sampling_distribution()" 766 | ] 767 | }, 768 | { 769 | "cell_type": "markdown", 770 | "metadata": {}, 771 | "source": [ 772 | "This example demonstrates an advantage of the computational framework over mathematical analysis. Statistics like Cohen's $d$, which is the ratio of other statistics, are relatively difficult to analyze. But with a computational approach, all sample statistics are equally \"easy\".\n", 773 | "\n", 774 | "One note on vocabulary: what I am calling \"resampling\" here is a specific kind of resampling called \"bootstrapping\". Other techniques that are also considering resampling include permutation tests, which we'll see in the next section, and \"jackknife\" resampling. You can read more at ." 775 | ] 776 | }, 777 | { 778 | "cell_type": "code", 779 | "execution_count": null, 780 | "metadata": { 781 | "collapsed": true 782 | }, 783 | "outputs": [], 784 | "source": [] 785 | } 786 | ], 787 | "metadata": { 788 | "kernelspec": { 789 | "display_name": "Python 3", 790 | "language": "python", 791 | "name": "python3" 792 | }, 793 | "language_info": { 794 | "codemirror_mode": { 795 | "name": "ipython", 796 | "version": 3 797 | }, 798 | "file_extension": ".py", 799 | "mimetype": "text/x-python", 800 | "name": "python", 801 | "nbconvert_exporter": "python", 802 | "pygments_lexer": "ipython3", 803 | "version": "3.6.1" 804 | } 805 | }, 806 | "nbformat": 4, 807 | "nbformat_minor": 1 808 | } 809 | -------------------------------------------------------------------------------- /sampling_soln.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Random Sampling\n", 8 | "=============\n", 9 | "\n", 10 | "Copyright 2016 Allen Downey\n", 11 | "\n", 12 | "License: [Creative Commons Attribution 4.0 International](http://creativecommons.org/licenses/by/4.0/)" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": null, 18 | "metadata": {}, 19 | "outputs": [], 20 | "source": [ 21 | "from __future__ import print_function, division\n", 22 | "\n", 23 | "import numpy\n", 24 | "import scipy.stats\n", 25 | "\n", 26 | "import matplotlib.pyplot as pyplot\n", 27 | "\n", 28 | "from ipywidgets import interact, interactive, fixed\n", 29 | "import ipywidgets as widgets\n", 30 | "\n", 31 | "# seed the random number generator so we all get the same results\n", 32 | "numpy.random.seed(18)\n", 33 | "\n", 34 | "# some nicer colors from http://colorbrewer2.org/\n", 35 | "COLOR1 = '#7fc97f'\n", 36 | "COLOR2 = '#beaed4'\n", 37 | "COLOR3 = '#fdc086'\n", 38 | "COLOR4 = '#ffff99'\n", 39 | "COLOR5 = '#386cb0'\n", 40 | "\n", 41 | "%matplotlib inline" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "Part One\n", 49 | "========\n", 50 | "\n", 51 | "Suppose we want to estimate the average weight of men and women in the U.S.\n", 52 | "\n", 53 | "And we want to quantify the uncertainty of the estimate.\n", 54 | "\n", 55 | "One approach is to simulate many experiments and see how much the results vary from one experiment to the next.\n", 56 | "\n", 57 | "I'll start with the unrealistic assumption that we know the actual distribution of weights in the population. Then I'll show how to solve the problem without that assumption.\n", 58 | "\n", 59 | "Based on data from the [BRFSS](http://www.cdc.gov/brfss/), I found that the distribution of weight in kg for women in the U.S. is well modeled by a lognormal distribution with the following parameters:" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [ 68 | "weight = scipy.stats.lognorm(0.23, 0, 70.8)\n", 69 | "weight.mean(), weight.std()" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "Here's what that distribution looks like:" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "xs = numpy.linspace(20, 160, 100)\n", 86 | "ys = weight.pdf(xs)\n", 87 | "pyplot.plot(xs, ys, linewidth=4, color=COLOR1)\n", 88 | "pyplot.xlabel('weight (kg)')\n", 89 | "pyplot.ylabel('PDF')\n", 90 | "None" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "`make_sample` draws a random sample from this distribution. The result is a NumPy array." 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "def make_sample(n=100):\n", 107 | " sample = weight.rvs(n)\n", 108 | " return sample" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "Here's an example with `n=100`. The mean and std of the sample are close to the mean and std of the population, but not exact." 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": null, 121 | "metadata": {}, 122 | "outputs": [], 123 | "source": [ 124 | "sample = make_sample(n=100)\n", 125 | "sample.mean(), sample.std()" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "We want to estimate the average weight in the population, so the \"sample statistic\" we'll use is the mean:" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": {}, 139 | "outputs": [], 140 | "source": [ 141 | "def sample_stat(sample):\n", 142 | " return sample.mean()" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "One iteration of \"the experiment\" is to collect a sample of 100 women and compute their average weight.\n", 150 | "\n", 151 | "We can simulate running this experiment many times, and collect a list of sample statistics. The result is a NumPy array." 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": null, 157 | "metadata": {}, 158 | "outputs": [], 159 | "source": [ 160 | "def compute_sampling_distribution(n=100, iters=1000):\n", 161 | " stats = [sample_stat(make_sample(n)) for i in range(iters)]\n", 162 | " return numpy.array(stats)" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": {}, 168 | "source": [ 169 | "The next line runs the simulation 1000 times and puts the results in\n", 170 | "`sample_means`:" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "sample_means = compute_sampling_distribution(n=100, iters=1000)" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": {}, 185 | "source": [ 186 | "Let's look at the distribution of the sample means. This distribution shows how much the results vary from one experiment to the next.\n", 187 | "\n", 188 | "Remember that this distribution is not the same as the distribution of weight in the population. This is the distribution of results across repeated imaginary experiments." 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": null, 194 | "metadata": {}, 195 | "outputs": [], 196 | "source": [ 197 | "pyplot.hist(sample_means, color=COLOR5)\n", 198 | "pyplot.xlabel('sample mean (n=100)')\n", 199 | "pyplot.ylabel('count')\n", 200 | "None" 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "The mean of the sample means is close to the actual population mean, which is nice, but not actually the important part." 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": null, 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "sample_means.mean()" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "The standard deviation of the sample means quantifies the variability from one experiment to the next, and reflects the precision of the estimate.\n", 224 | "\n", 225 | "This quantity is called the \"standard error\"." 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "metadata": {}, 232 | "outputs": [], 233 | "source": [ 234 | "std_err = sample_means.std()\n", 235 | "std_err" 236 | ] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": {}, 241 | "source": [ 242 | "We can also use the distribution of sample means to compute a \"90% confidence interval\", which contains 90% of the experimental results:" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": null, 248 | "metadata": {}, 249 | "outputs": [], 250 | "source": [ 251 | "conf_int = numpy.percentile(sample_means, [5, 95])\n", 252 | "conf_int" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "Now we'd like to see what happens as we vary the sample size, `n`. The following function takes `n`, runs 1000 simulated experiments, and summarizes the results." 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": null, 265 | "metadata": {}, 266 | "outputs": [], 267 | "source": [ 268 | "def plot_sampling_distribution(n, xlim=None):\n", 269 | " \"\"\"Plot the sampling distribution.\n", 270 | " \n", 271 | " n: sample size\n", 272 | " xlim: [xmin, xmax] range for the x axis \n", 273 | " \"\"\"\n", 274 | " sample_stats = compute_sampling_distribution(n, iters=1000)\n", 275 | " se = numpy.std(sample_stats)\n", 276 | " ci = numpy.percentile(sample_stats, [5, 95])\n", 277 | " \n", 278 | " pyplot.hist(sample_stats, color=COLOR2)\n", 279 | " pyplot.xlabel('sample statistic')\n", 280 | " pyplot.xlim(xlim)\n", 281 | " text(0.03, 0.95, 'CI [%0.2f %0.2f]' % tuple(ci))\n", 282 | " text(0.03, 0.85, 'SE %0.2f' % se)\n", 283 | " pyplot.show()\n", 284 | " \n", 285 | "def text(x, y, s):\n", 286 | " \"\"\"Plot a string at a given location in axis coordinates.\n", 287 | " \n", 288 | " x: coordinate\n", 289 | " y: coordinate\n", 290 | " s: string\n", 291 | " \"\"\"\n", 292 | " ax = pyplot.gca()\n", 293 | " pyplot.text(x, y, s,\n", 294 | " horizontalalignment='left',\n", 295 | " verticalalignment='top',\n", 296 | " transform=ax.transAxes)" 297 | ] 298 | }, 299 | { 300 | "cell_type": "markdown", 301 | "metadata": {}, 302 | "source": [ 303 | "Here's a test run with `n=100`:" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": null, 309 | "metadata": {}, 310 | "outputs": [], 311 | "source": [ 312 | "plot_sampling_distribution(100)" 313 | ] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "metadata": {}, 318 | "source": [ 319 | "Now we can use `interact` to run `plot_sampling_distribution` with different values of `n`. Note: `xlim` sets the limits of the x-axis so the figure doesn't get rescaled as we vary `n`." 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": null, 325 | "metadata": {}, 326 | "outputs": [], 327 | "source": [ 328 | "def sample_stat(sample):\n", 329 | " return sample.mean()\n", 330 | "\n", 331 | "slider = widgets.IntSlider(min=10, max=1000, value=100)\n", 332 | "interact(plot_sampling_distribution, n=slider, xlim=fixed([55, 95]))\n", 333 | "None" 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": {}, 339 | "source": [ 340 | "### Other sample statistics\n", 341 | "\n", 342 | "This framework works with any other quantity we want to estimate. By changing `sample_stat`, you can compute the SE and CI for any sample statistic.\n", 343 | "\n", 344 | "**Exercise 1**: Fill in `sample_stat` below with any of these statistics:\n", 345 | "\n", 346 | "* Standard deviation of the sample.\n", 347 | "* Coefficient of variation, which is the sample standard deviation divided by the sample standard mean.\n", 348 | "* Min or Max\n", 349 | "* Median (which is the 50th percentile)\n", 350 | "* 10th or 90th percentile.\n", 351 | "* Interquartile range (IQR), which is the difference between the 75th and 25th percentiles.\n", 352 | "\n", 353 | "NumPy array methods you might find useful include `std`, `min`, `max`, and `percentile`.\n", 354 | "Depending on the results, you might want to adjust `xlim`." 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": null, 360 | "metadata": {}, 361 | "outputs": [], 362 | "source": [ 363 | "def sample_stat(sample):\n", 364 | " # TODO: replace the following line with another sample statistic\n", 365 | " return sample.mean()\n", 366 | "\n", 367 | "slider = widgets.IntSlider(min=10, max=1000, value=100)\n", 368 | "interact(plot_sampling_distribution, n=slider, xlim=fixed([0, 100]))\n", 369 | "None" 370 | ] 371 | }, 372 | { 373 | "cell_type": "markdown", 374 | "metadata": {}, 375 | "source": [ 376 | "STOP HERE\n", 377 | "---------\n", 378 | "\n", 379 | "We will regroup and discuss before going on." 380 | ] 381 | }, 382 | { 383 | "cell_type": "markdown", 384 | "metadata": {}, 385 | "source": [ 386 | "Part Two\n", 387 | "========\n", 388 | "\n", 389 | "So far we have shown that if we know the actual distribution of the population, we can compute the sampling distribution for any sample statistic, and from that we can compute SE and CI.\n", 390 | "\n", 391 | "But in real life we don't know the actual distribution of the population. If we did, we wouldn't be doing statistical inference in the first place!\n", 392 | "\n", 393 | "In real life, we use the sample to build a model of the population distribution, then use the model to generate the sampling distribution. A simple and popular way to do that is \"resampling,\" which means we use the sample itself as a model of the population distribution and draw samples from it.\n", 394 | "\n", 395 | "Before we go on, I want to collect some of the code from Part One and organize it as a class. This class represents a framework for computing sampling distributions." 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": null, 401 | "metadata": {}, 402 | "outputs": [], 403 | "source": [ 404 | "class Resampler(object):\n", 405 | " \"\"\"Represents a framework for computing sampling distributions.\"\"\"\n", 406 | " \n", 407 | " def __init__(self, sample, xlim=None):\n", 408 | " \"\"\"Stores the actual sample.\"\"\"\n", 409 | " self.sample = sample\n", 410 | " self.n = len(sample)\n", 411 | " self.xlim = xlim\n", 412 | " \n", 413 | " def resample(self):\n", 414 | " \"\"\"Generates a new sample by choosing from the original\n", 415 | " sample with replacement.\n", 416 | " \"\"\"\n", 417 | " new_sample = numpy.random.choice(self.sample, self.n, replace=True)\n", 418 | " return new_sample\n", 419 | " \n", 420 | " def sample_stat(self, sample):\n", 421 | " \"\"\"Computes a sample statistic using the original sample or a\n", 422 | " simulated sample.\n", 423 | " \"\"\"\n", 424 | " return sample.mean()\n", 425 | " \n", 426 | " def compute_sampling_distribution(self, iters=1000):\n", 427 | " \"\"\"Simulates many experiments and collects the resulting sample\n", 428 | " statistics.\n", 429 | " \"\"\"\n", 430 | " stats = [self.sample_stat(self.resample()) for i in range(iters)]\n", 431 | " return numpy.array(stats)\n", 432 | " \n", 433 | " def plot_sampling_distribution(self):\n", 434 | " \"\"\"Plots the sampling distribution.\"\"\"\n", 435 | " sample_stats = self.compute_sampling_distribution()\n", 436 | " se = sample_stats.std()\n", 437 | " ci = numpy.percentile(sample_stats, [5, 95])\n", 438 | " \n", 439 | " pyplot.hist(sample_stats, color=COLOR2)\n", 440 | " pyplot.xlabel('sample statistic')\n", 441 | " pyplot.xlim(self.xlim)\n", 442 | " text(0.03, 0.95, 'CI [%0.2f %0.2f]' % tuple(ci))\n", 443 | " text(0.03, 0.85, 'SE %0.2f' % se)\n", 444 | " pyplot.show()" 445 | ] 446 | }, 447 | { 448 | "cell_type": "markdown", 449 | "metadata": {}, 450 | "source": [ 451 | "The following function instantiates a `Resampler` and runs it." 452 | ] 453 | }, 454 | { 455 | "cell_type": "code", 456 | "execution_count": null, 457 | "metadata": {}, 458 | "outputs": [], 459 | "source": [ 460 | "def interact_func(n, xlim):\n", 461 | " sample = weight.rvs(n)\n", 462 | " resampler = Resampler(sample, xlim=xlim)\n", 463 | " resampler.plot_sampling_distribution()" 464 | ] 465 | }, 466 | { 467 | "cell_type": "markdown", 468 | "metadata": {}, 469 | "source": [ 470 | "Here's a test run with `n=100`" 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": null, 476 | "metadata": {}, 477 | "outputs": [], 478 | "source": [ 479 | "interact_func(n=100, xlim=[50, 100])" 480 | ] 481 | }, 482 | { 483 | "cell_type": "markdown", 484 | "metadata": {}, 485 | "source": [ 486 | "Now we can use `interact_func` in an interaction:" 487 | ] 488 | }, 489 | { 490 | "cell_type": "code", 491 | "execution_count": null, 492 | "metadata": {}, 493 | "outputs": [], 494 | "source": [ 495 | "slider = widgets.IntSlider(min=10, max=1000, value=100)\n", 496 | "interact(interact_func, n=slider, xlim=fixed([50, 100]))\n", 497 | "None" 498 | ] 499 | }, 500 | { 501 | "cell_type": "markdown", 502 | "metadata": {}, 503 | "source": [ 504 | "**Exercise 2**: write a new class called `StdResampler` that inherits from `Resampler` and overrides `sample_stat` so it computes the standard deviation of the resampled data." 505 | ] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "execution_count": null, 510 | "metadata": {}, 511 | "outputs": [], 512 | "source": [ 513 | "# Solution goes here\n", 514 | "\n", 515 | "class StdResampler(Resampler): \n", 516 | " \"\"\"Computes the sampling distribution of the standard deviation.\"\"\"\n", 517 | " \n", 518 | " def sample_stat(self, sample):\n", 519 | " \"\"\"Computes a sample statistic using the original sample or a\n", 520 | " simulated sample.\n", 521 | " \"\"\"\n", 522 | " return sample.std()" 523 | ] 524 | }, 525 | { 526 | "cell_type": "markdown", 527 | "metadata": {}, 528 | "source": [ 529 | "Test your code using the cell below:" 530 | ] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "execution_count": null, 535 | "metadata": {}, 536 | "outputs": [], 537 | "source": [ 538 | "def interact_func2(n, xlim):\n", 539 | " sample = weight.rvs(n)\n", 540 | " resampler = StdResampler(sample, xlim=xlim)\n", 541 | " resampler.plot_sampling_distribution()\n", 542 | " \n", 543 | "interact_func2(n=100, xlim=[0, 100])" 544 | ] 545 | }, 546 | { 547 | "cell_type": "markdown", 548 | "metadata": {}, 549 | "source": [ 550 | "When your `StdResampler` is working, you should be able to interact with it:" 551 | ] 552 | }, 553 | { 554 | "cell_type": "code", 555 | "execution_count": null, 556 | "metadata": {}, 557 | "outputs": [], 558 | "source": [ 559 | "slider = widgets.IntSlider(min=10, max=1000, value=100)\n", 560 | "interact(interact_func2, n=slider, xlim=fixed([0, 100]))\n", 561 | "None" 562 | ] 563 | }, 564 | { 565 | "cell_type": "markdown", 566 | "metadata": {}, 567 | "source": [ 568 | "STOP HERE\n", 569 | "---------\n", 570 | "\n", 571 | "We will regroup and discuss before going on." 572 | ] 573 | }, 574 | { 575 | "cell_type": "markdown", 576 | "metadata": {}, 577 | "source": [ 578 | "Part Three\n", 579 | "==========\n", 580 | "\n", 581 | "We can extend this framework to compute SE and CI for a difference in means.\n", 582 | "\n", 583 | "For example, men are heavier than women on average. Here's the women's distribution again (from BRFSS data):" 584 | ] 585 | }, 586 | { 587 | "cell_type": "code", 588 | "execution_count": null, 589 | "metadata": {}, 590 | "outputs": [], 591 | "source": [ 592 | "female_weight = scipy.stats.lognorm(0.23, 0, 70.8)\n", 593 | "female_weight.mean(), female_weight.std()" 594 | ] 595 | }, 596 | { 597 | "cell_type": "markdown", 598 | "metadata": {}, 599 | "source": [ 600 | "And here's the men's distribution:" 601 | ] 602 | }, 603 | { 604 | "cell_type": "code", 605 | "execution_count": null, 606 | "metadata": {}, 607 | "outputs": [], 608 | "source": [ 609 | "male_weight = scipy.stats.lognorm(0.20, 0, 87.3)\n", 610 | "male_weight.mean(), male_weight.std()" 611 | ] 612 | }, 613 | { 614 | "cell_type": "markdown", 615 | "metadata": {}, 616 | "source": [ 617 | "I'll simulate a sample of 100 men and 100 women:" 618 | ] 619 | }, 620 | { 621 | "cell_type": "code", 622 | "execution_count": null, 623 | "metadata": {}, 624 | "outputs": [], 625 | "source": [ 626 | "female_sample = female_weight.rvs(100)\n", 627 | "male_sample = male_weight.rvs(100)" 628 | ] 629 | }, 630 | { 631 | "cell_type": "markdown", 632 | "metadata": {}, 633 | "source": [ 634 | "The difference in means should be about 17 kg, but will vary from one random sample to the next:" 635 | ] 636 | }, 637 | { 638 | "cell_type": "code", 639 | "execution_count": null, 640 | "metadata": {}, 641 | "outputs": [], 642 | "source": [ 643 | "male_sample.mean() - female_sample.mean()" 644 | ] 645 | }, 646 | { 647 | "cell_type": "markdown", 648 | "metadata": {}, 649 | "source": [ 650 | "Here's the function that computes Cohen's effect size again:" 651 | ] 652 | }, 653 | { 654 | "cell_type": "code", 655 | "execution_count": null, 656 | "metadata": {}, 657 | "outputs": [], 658 | "source": [ 659 | "def CohenEffectSize(group1, group2):\n", 660 | " \"\"\"Compute Cohen's d.\n", 661 | "\n", 662 | " group1: Series or NumPy array\n", 663 | " group2: Series or NumPy array\n", 664 | "\n", 665 | " returns: float\n", 666 | " \"\"\"\n", 667 | " diff = group1.mean() - group2.mean()\n", 668 | "\n", 669 | " n1, n2 = len(group1), len(group2)\n", 670 | " var1 = group1.var()\n", 671 | " var2 = group2.var()\n", 672 | "\n", 673 | " pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)\n", 674 | " d = diff / numpy.sqrt(pooled_var)\n", 675 | " return d" 676 | ] 677 | }, 678 | { 679 | "cell_type": "markdown", 680 | "metadata": {}, 681 | "source": [ 682 | "The difference in weight between men and women is about 1 standard deviation:" 683 | ] 684 | }, 685 | { 686 | "cell_type": "code", 687 | "execution_count": null, 688 | "metadata": {}, 689 | "outputs": [], 690 | "source": [ 691 | "CohenEffectSize(male_sample, female_sample)" 692 | ] 693 | }, 694 | { 695 | "cell_type": "markdown", 696 | "metadata": {}, 697 | "source": [ 698 | "Now we can write a version of the `Resampler` that computes the sampling distribution of $d$." 699 | ] 700 | }, 701 | { 702 | "cell_type": "code", 703 | "execution_count": null, 704 | "metadata": {}, 705 | "outputs": [], 706 | "source": [ 707 | "class CohenResampler(Resampler):\n", 708 | " def __init__(self, group1, group2, xlim=None):\n", 709 | " self.group1 = group1\n", 710 | " self.group2 = group2\n", 711 | " self.xlim = xlim\n", 712 | " \n", 713 | " def resample(self):\n", 714 | " n, m = len(self.group1), len(self.group2)\n", 715 | " group1 = numpy.random.choice(self.group1, n, replace=True)\n", 716 | " group2 = numpy.random.choice(self.group2, m, replace=True)\n", 717 | " return group1, group2\n", 718 | " \n", 719 | " def sample_stat(self, groups):\n", 720 | " group1, group2 = groups\n", 721 | " return CohenEffectSize(group1, group2)" 722 | ] 723 | }, 724 | { 725 | "cell_type": "markdown", 726 | "metadata": {}, 727 | "source": [ 728 | "Now we can instantiate a `CohenResampler` and plot the sampling distribution." 729 | ] 730 | }, 731 | { 732 | "cell_type": "code", 733 | "execution_count": null, 734 | "metadata": {}, 735 | "outputs": [], 736 | "source": [ 737 | "resampler = CohenResampler(male_sample, female_sample)\n", 738 | "resampler.plot_sampling_distribution()" 739 | ] 740 | }, 741 | { 742 | "cell_type": "markdown", 743 | "metadata": {}, 744 | "source": [ 745 | "This example demonstrates an advantage of the computational framework over mathematical analysis. Statistics like Cohen's $d$, which is the ratio of other statistics, are relatively difficult to analyze. But with a computational approach, all sample statistics are equally \"easy\".\n", 746 | "\n", 747 | "One note on vocabulary: what I am calling \"resampling\" here is a specific kind of resampling called \"bootstrapping\". Other techniques that are also considering resampling include permutation tests, which we'll see in the next section, and \"jackknife\" resampling. You can read more at ." 748 | ] 749 | }, 750 | { 751 | "cell_type": "code", 752 | "execution_count": null, 753 | "metadata": {}, 754 | "outputs": [], 755 | "source": [] 756 | } 757 | ], 758 | "metadata": { 759 | "kernelspec": { 760 | "display_name": "Python 3", 761 | "language": "python", 762 | "name": "python3" 763 | }, 764 | "language_info": { 765 | "codemirror_mode": { 766 | "name": "ipython", 767 | "version": 3 768 | }, 769 | "file_extension": ".py", 770 | "mimetype": "text/x-python", 771 | "name": "python", 772 | "nbconvert_exporter": "python", 773 | "pygments_lexer": "ipython3", 774 | "version": "3.6.1" 775 | } 776 | }, 777 | "nbformat": 4, 778 | "nbformat_minor": 1 779 | } 780 | -------------------------------------------------------------------------------- /thinkplot.py: -------------------------------------------------------------------------------- 1 | """This file contains code for use with "Think Stats", 2 | by Allen B. Downey, available from greenteapress.com 3 | 4 | Copyright 2014 Allen B. Downey 5 | License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html 6 | """ 7 | 8 | from __future__ import print_function 9 | 10 | import math 11 | import matplotlib 12 | import matplotlib.pyplot as pyplot 13 | import numpy as np 14 | import pandas 15 | 16 | import warnings 17 | 18 | # customize some matplotlib attributes 19 | #matplotlib.rc('figure', figsize=(4, 3)) 20 | 21 | #matplotlib.rc('font', size=14.0) 22 | #matplotlib.rc('axes', labelsize=22.0, titlesize=22.0) 23 | #matplotlib.rc('legend', fontsize=20.0) 24 | 25 | #matplotlib.rc('xtick.major', size=6.0) 26 | #matplotlib.rc('xtick.minor', size=3.0) 27 | 28 | #matplotlib.rc('ytick.major', size=6.0) 29 | #matplotlib.rc('ytick.minor', size=3.0) 30 | 31 | 32 | class _Brewer(object): 33 | """Encapsulates a nice sequence of colors. 34 | 35 | Shades of blue that look good in color and can be distinguished 36 | in grayscale (up to a point). 37 | 38 | Borrowed from http://colorbrewer2.org/ 39 | """ 40 | color_iter = None 41 | 42 | colors = ['#f7fbff', '#deebf7', '#c6dbef', 43 | '#9ecae1', '#6baed6', '#4292c6', 44 | '#2171b5','#08519c','#08306b'][::-1] 45 | 46 | # lists that indicate which colors to use depending on how many are used 47 | which_colors = [[], 48 | [1], 49 | [1, 3], 50 | [0, 2, 4], 51 | [0, 2, 4, 6], 52 | [0, 2, 3, 5, 6], 53 | [0, 2, 3, 4, 5, 6], 54 | [0, 1, 2, 3, 4, 5, 6], 55 | [0, 1, 2, 3, 4, 5, 6, 7], 56 | [0, 1, 2, 3, 4, 5, 6, 7, 8], 57 | ] 58 | 59 | current_figure = None 60 | 61 | @classmethod 62 | def Colors(cls): 63 | """Returns the list of colors. 64 | """ 65 | return cls.colors 66 | 67 | @classmethod 68 | def ColorGenerator(cls, num): 69 | """Returns an iterator of color strings. 70 | 71 | n: how many colors will be used 72 | """ 73 | for i in cls.which_colors[num]: 74 | yield cls.colors[i] 75 | raise StopIteration('Ran out of colors in _Brewer.') 76 | 77 | @classmethod 78 | def InitIter(cls, num): 79 | """Initializes the color iterator with the given number of colors.""" 80 | cls.color_iter = cls.ColorGenerator(num) 81 | 82 | @classmethod 83 | def ClearIter(cls): 84 | """Sets the color iterator to None.""" 85 | cls.color_iter = None 86 | 87 | @classmethod 88 | def GetIter(cls, num): 89 | """Gets the color iterator.""" 90 | fig = pyplot.gcf() 91 | if fig != cls.current_figure: 92 | cls.InitIter(num) 93 | cls.current_figure = fig 94 | 95 | if cls.color_iter is None: 96 | cls.InitIter(num) 97 | 98 | return cls.color_iter 99 | 100 | 101 | def _UnderrideColor(options): 102 | """If color is not in the options, chooses a color. 103 | """ 104 | if 'color' in options: 105 | return options 106 | 107 | # get the current color iterator; if there is none, init one 108 | color_iter = _Brewer.GetIter(5) 109 | 110 | try: 111 | options['color'] = next(color_iter) 112 | except StopIteration: 113 | # if you run out of colors, initialize the color iterator 114 | # and try again 115 | warnings.warn('Ran out of colors. Starting over.') 116 | _Brewer.ClearIter() 117 | _UnderrideColor(options) 118 | 119 | return options 120 | 121 | 122 | def PrePlot(num=None, rows=None, cols=None): 123 | """Takes hints about what's coming. 124 | 125 | num: number of lines that will be plotted 126 | rows: number of rows of subplots 127 | cols: number of columns of subplots 128 | """ 129 | if num: 130 | _Brewer.InitIter(num) 131 | 132 | if rows is None and cols is None: 133 | return 134 | 135 | if rows is not None and cols is None: 136 | cols = 1 137 | 138 | if cols is not None and rows is None: 139 | rows = 1 140 | 141 | # resize the image, depending on the number of rows and cols 142 | size_map = {(1, 1): (8, 6), 143 | (1, 2): (12, 6), 144 | (1, 3): (12, 6), 145 | (2, 2): (10, 10), 146 | (2, 3): (16, 10), 147 | (3, 1): (8, 10), 148 | (4, 1): (8, 12), 149 | } 150 | 151 | if (rows, cols) in size_map: 152 | fig = pyplot.gcf() 153 | fig.set_size_inches(*size_map[rows, cols]) 154 | 155 | # create the first subplot 156 | if rows > 1 or cols > 1: 157 | ax = pyplot.subplot(rows, cols, 1) 158 | global SUBPLOT_ROWS, SUBPLOT_COLS 159 | SUBPLOT_ROWS = rows 160 | SUBPLOT_COLS = cols 161 | else: 162 | ax = pyplot.gca() 163 | 164 | return ax 165 | 166 | def SubPlot(plot_number, rows=None, cols=None, **options): 167 | """Configures the number of subplots and changes the current plot. 168 | 169 | rows: int 170 | cols: int 171 | plot_number: int 172 | options: passed to subplot 173 | """ 174 | rows = rows or SUBPLOT_ROWS 175 | cols = cols or SUBPLOT_COLS 176 | return pyplot.subplot(rows, cols, plot_number, **options) 177 | 178 | 179 | def _Underride(d, **options): 180 | """Add key-value pairs to d only if key is not in d. 181 | 182 | If d is None, create a new dictionary. 183 | 184 | d: dictionary 185 | options: keyword args to add to d 186 | """ 187 | if d is None: 188 | d = {} 189 | 190 | for key, val in options.items(): 191 | d.setdefault(key, val) 192 | 193 | return d 194 | 195 | 196 | def Clf(): 197 | """Clears the figure and any hints that have been set.""" 198 | global LOC 199 | LOC = None 200 | _Brewer.ClearIter() 201 | pyplot.clf() 202 | fig = pyplot.gcf() 203 | fig.set_size_inches(8, 6) 204 | 205 | 206 | def Figure(**options): 207 | """Sets options for the current figure.""" 208 | _Underride(options, figsize=(6, 8)) 209 | pyplot.figure(**options) 210 | 211 | 212 | def Plot(obj, ys=None, style='', **options): 213 | """Plots a line. 214 | 215 | Args: 216 | obj: sequence of x values, or Series, or anything with Render() 217 | ys: sequence of y values 218 | style: style string passed along to pyplot.plot 219 | options: keyword args passed to pyplot.plot 220 | """ 221 | options = _UnderrideColor(options) 222 | label = getattr(obj, 'label', '_nolegend_') 223 | options = _Underride(options, linewidth=3, alpha=0.7, label=label) 224 | 225 | xs = obj 226 | if ys is None: 227 | if hasattr(obj, 'Render'): 228 | xs, ys = obj.Render() 229 | if isinstance(obj, pandas.Series): 230 | ys = obj.values 231 | xs = obj.index 232 | 233 | if ys is None: 234 | pyplot.plot(xs, style, **options) 235 | else: 236 | pyplot.plot(xs, ys, style, **options) 237 | 238 | 239 | def Vlines(xs, y1, y2, **options): 240 | """Plots a set of vertical lines. 241 | 242 | Args: 243 | xs: sequence of x values 244 | y1: sequence of y values 245 | y2: sequence of y values 246 | options: keyword args passed to pyplot.vlines 247 | """ 248 | options = _UnderrideColor(options) 249 | options = _Underride(options, linewidth=1, alpha=0.5) 250 | pyplot.vlines(xs, y1, y2, **options) 251 | 252 | 253 | def Hlines(ys, x1, x2, **options): 254 | """Plots a set of horizontal lines. 255 | 256 | Args: 257 | ys: sequence of y values 258 | x1: sequence of x values 259 | x2: sequence of x values 260 | options: keyword args passed to pyplot.vlines 261 | """ 262 | options = _UnderrideColor(options) 263 | options = _Underride(options, linewidth=1, alpha=0.5) 264 | pyplot.hlines(ys, x1, x2, **options) 265 | 266 | 267 | def FillBetween(xs, y1, y2=None, where=None, **options): 268 | """Fills the space between two lines. 269 | 270 | Args: 271 | xs: sequence of x values 272 | y1: sequence of y values 273 | y2: sequence of y values 274 | where: sequence of boolean 275 | options: keyword args passed to pyplot.fill_between 276 | """ 277 | options = _UnderrideColor(options) 278 | options = _Underride(options, linewidth=0, alpha=0.5) 279 | pyplot.fill_between(xs, y1, y2, where, **options) 280 | 281 | 282 | def Bar(xs, ys, **options): 283 | """Plots a line. 284 | 285 | Args: 286 | xs: sequence of x values 287 | ys: sequence of y values 288 | options: keyword args passed to pyplot.bar 289 | """ 290 | options = _UnderrideColor(options) 291 | options = _Underride(options, linewidth=0, alpha=0.6) 292 | pyplot.bar(xs, ys, **options) 293 | 294 | 295 | def Scatter(xs, ys=None, **options): 296 | """Makes a scatter plot. 297 | 298 | xs: x values 299 | ys: y values 300 | options: options passed to pyplot.scatter 301 | """ 302 | options = _Underride(options, color='blue', alpha=0.2, 303 | s=30, edgecolors='none') 304 | 305 | if ys is None and isinstance(xs, pandas.Series): 306 | ys = xs.values 307 | xs = xs.index 308 | 309 | pyplot.scatter(xs, ys, **options) 310 | 311 | 312 | def HexBin(xs, ys, **options): 313 | """Makes a scatter plot. 314 | 315 | xs: x values 316 | ys: y values 317 | options: options passed to pyplot.scatter 318 | """ 319 | options = _Underride(options, cmap=matplotlib.cm.Blues) 320 | pyplot.hexbin(xs, ys, **options) 321 | 322 | 323 | def Pdf(pdf, **options): 324 | """Plots a Pdf, Pmf, or Hist as a line. 325 | 326 | Args: 327 | pdf: Pdf, Pmf, or Hist object 328 | options: keyword args passed to pyplot.plot 329 | """ 330 | low, high = options.pop('low', None), options.pop('high', None) 331 | n = options.pop('n', 101) 332 | xs, ps = pdf.Render(low=low, high=high, n=n) 333 | options = _Underride(options, label=pdf.label) 334 | Plot(xs, ps, **options) 335 | 336 | 337 | def Pdfs(pdfs, **options): 338 | """Plots a sequence of PDFs. 339 | 340 | Options are passed along for all PDFs. If you want different 341 | options for each pdf, make multiple calls to Pdf. 342 | 343 | Args: 344 | pdfs: sequence of PDF objects 345 | options: keyword args passed to pyplot.plot 346 | """ 347 | for pdf in pdfs: 348 | Pdf(pdf, **options) 349 | 350 | 351 | def Hist(hist, **options): 352 | """Plots a Pmf or Hist with a bar plot. 353 | 354 | The default width of the bars is based on the minimum difference 355 | between values in the Hist. If that's too small, you can override 356 | it by providing a width keyword argument, in the same units 357 | as the values. 358 | 359 | Args: 360 | hist: Hist or Pmf object 361 | options: keyword args passed to pyplot.bar 362 | """ 363 | # find the minimum distance between adjacent values 364 | xs, ys = hist.Render() 365 | 366 | if 'width' not in options: 367 | try: 368 | options['width'] = 0.9 * np.diff(xs).min() 369 | except TypeError: 370 | warnings.warn("Hist: Can't compute bar width automatically." 371 | "Check for non-numeric types in Hist." 372 | "Or try providing width option." 373 | ) 374 | 375 | options = _Underride(options, label=hist.label) 376 | options = _Underride(options, align='center') 377 | if options['align'] == 'left': 378 | options['align'] = 'edge' 379 | elif options['align'] == 'right': 380 | options['align'] = 'edge' 381 | options['width'] *= -1 382 | 383 | Bar(xs, ys, **options) 384 | 385 | 386 | def Hists(hists, **options): 387 | """Plots two histograms as interleaved bar plots. 388 | 389 | Options are passed along for all PMFs. If you want different 390 | options for each pmf, make multiple calls to Pmf. 391 | 392 | Args: 393 | hists: list of two Hist or Pmf objects 394 | options: keyword args passed to pyplot.plot 395 | """ 396 | for hist in hists: 397 | Hist(hist, **options) 398 | 399 | 400 | def Pmf(pmf, **options): 401 | """Plots a Pmf or Hist as a line. 402 | 403 | Args: 404 | pmf: Hist or Pmf object 405 | options: keyword args passed to pyplot.plot 406 | """ 407 | xs, ys = pmf.Render() 408 | low, high = min(xs), max(xs) 409 | 410 | width = options.pop('width', None) 411 | if width is None: 412 | try: 413 | width = np.diff(xs).min() 414 | except TypeError: 415 | warnings.warn("Pmf: Can't compute bar width automatically." 416 | "Check for non-numeric types in Pmf." 417 | "Or try providing width option.") 418 | points = [] 419 | 420 | lastx = np.nan 421 | lasty = 0 422 | for x, y in zip(xs, ys): 423 | if (x - lastx) > 1e-5: 424 | points.append((lastx, 0)) 425 | points.append((x, 0)) 426 | 427 | points.append((x, lasty)) 428 | points.append((x, y)) 429 | points.append((x+width, y)) 430 | 431 | lastx = x + width 432 | lasty = y 433 | points.append((lastx, 0)) 434 | pxs, pys = zip(*points) 435 | 436 | align = options.pop('align', 'center') 437 | if align == 'center': 438 | pxs = np.array(pxs) - width/2.0 439 | if align == 'right': 440 | pxs = np.array(pxs) - width 441 | 442 | options = _Underride(options, label=pmf.label) 443 | Plot(pxs, pys, **options) 444 | 445 | 446 | def Pmfs(pmfs, **options): 447 | """Plots a sequence of PMFs. 448 | 449 | Options are passed along for all PMFs. If you want different 450 | options for each pmf, make multiple calls to Pmf. 451 | 452 | Args: 453 | pmfs: sequence of PMF objects 454 | options: keyword args passed to pyplot.plot 455 | """ 456 | for pmf in pmfs: 457 | Pmf(pmf, **options) 458 | 459 | 460 | def Diff(t): 461 | """Compute the differences between adjacent elements in a sequence. 462 | 463 | Args: 464 | t: sequence of number 465 | 466 | Returns: 467 | sequence of differences (length one less than t) 468 | """ 469 | diffs = [t[i+1] - t[i] for i in range(len(t)-1)] 470 | return diffs 471 | 472 | 473 | def Cdf(cdf, complement=False, transform=None, **options): 474 | """Plots a CDF as a line. 475 | 476 | Args: 477 | cdf: Cdf object 478 | complement: boolean, whether to plot the complementary CDF 479 | transform: string, one of 'exponential', 'pareto', 'weibull', 'gumbel' 480 | options: keyword args passed to pyplot.plot 481 | 482 | Returns: 483 | dictionary with the scale options that should be passed to 484 | Config, Show or Save. 485 | """ 486 | xs, ps = cdf.Render() 487 | xs = np.asarray(xs) 488 | ps = np.asarray(ps) 489 | 490 | scale = dict(xscale='linear', yscale='linear') 491 | 492 | for s in ['xscale', 'yscale']: 493 | if s in options: 494 | scale[s] = options.pop(s) 495 | 496 | if transform == 'exponential': 497 | complement = True 498 | scale['yscale'] = 'log' 499 | 500 | if transform == 'pareto': 501 | complement = True 502 | scale['yscale'] = 'log' 503 | scale['xscale'] = 'log' 504 | 505 | if complement: 506 | ps = [1.0-p for p in ps] 507 | 508 | if transform == 'weibull': 509 | xs = np.delete(xs, -1) 510 | ps = np.delete(ps, -1) 511 | ps = [-math.log(1.0-p) for p in ps] 512 | scale['xscale'] = 'log' 513 | scale['yscale'] = 'log' 514 | 515 | if transform == 'gumbel': 516 | xs = xp.delete(xs, 0) 517 | ps = np.delete(ps, 0) 518 | ps = [-math.log(p) for p in ps] 519 | scale['yscale'] = 'log' 520 | 521 | options = _Underride(options, label=cdf.label) 522 | Plot(xs, ps, **options) 523 | return scale 524 | 525 | 526 | def Cdfs(cdfs, complement=False, transform=None, **options): 527 | """Plots a sequence of CDFs. 528 | 529 | cdfs: sequence of CDF objects 530 | complement: boolean, whether to plot the complementary CDF 531 | transform: string, one of 'exponential', 'pareto', 'weibull', 'gumbel' 532 | options: keyword args passed to pyplot.plot 533 | """ 534 | for cdf in cdfs: 535 | Cdf(cdf, complement, transform, **options) 536 | 537 | 538 | def Contour(obj, pcolor=False, contour=True, imshow=False, **options): 539 | """Makes a contour plot. 540 | 541 | d: map from (x, y) to z, or object that provides GetDict 542 | pcolor: boolean, whether to make a pseudocolor plot 543 | contour: boolean, whether to make a contour plot 544 | imshow: boolean, whether to use pyplot.imshow 545 | options: keyword args passed to pyplot.pcolor and/or pyplot.contour 546 | """ 547 | try: 548 | d = obj.GetDict() 549 | except AttributeError: 550 | d = obj 551 | 552 | _Underride(options, linewidth=3, cmap=matplotlib.cm.Blues) 553 | 554 | xs, ys = zip(*d.keys()) 555 | xs = sorted(set(xs)) 556 | ys = sorted(set(ys)) 557 | 558 | X, Y = np.meshgrid(xs, ys) 559 | func = lambda x, y: d.get((x, y), 0) 560 | func = np.vectorize(func) 561 | Z = func(X, Y) 562 | 563 | x_formatter = matplotlib.ticker.ScalarFormatter(useOffset=False) 564 | axes = pyplot.gca() 565 | axes.xaxis.set_major_formatter(x_formatter) 566 | 567 | if pcolor: 568 | pyplot.pcolormesh(X, Y, Z, **options) 569 | if contour: 570 | cs = pyplot.contour(X, Y, Z, **options) 571 | pyplot.clabel(cs, inline=1, fontsize=10) 572 | if imshow: 573 | extent = xs[0], xs[-1], ys[0], ys[-1] 574 | pyplot.imshow(Z, extent=extent, **options) 575 | 576 | 577 | def Pcolor(xs, ys, zs, pcolor=True, contour=False, **options): 578 | """Makes a pseudocolor plot. 579 | 580 | xs: 581 | ys: 582 | zs: 583 | pcolor: boolean, whether to make a pseudocolor plot 584 | contour: boolean, whether to make a contour plot 585 | options: keyword args passed to pyplot.pcolor and/or pyplot.contour 586 | """ 587 | _Underride(options, linewidth=3, cmap=matplotlib.cm.Blues) 588 | 589 | X, Y = np.meshgrid(xs, ys) 590 | Z = zs 591 | 592 | x_formatter = matplotlib.ticker.ScalarFormatter(useOffset=False) 593 | axes = pyplot.gca() 594 | axes.xaxis.set_major_formatter(x_formatter) 595 | 596 | if pcolor: 597 | pyplot.pcolormesh(X, Y, Z, **options) 598 | 599 | if contour: 600 | cs = pyplot.contour(X, Y, Z, **options) 601 | pyplot.clabel(cs, inline=1, fontsize=10) 602 | 603 | 604 | def Text(x, y, s, **options): 605 | """Puts text in a figure. 606 | 607 | x: number 608 | y: number 609 | s: string 610 | options: keyword args passed to pyplot.text 611 | """ 612 | options = _Underride(options, 613 | fontsize=16, 614 | verticalalignment='top', 615 | horizontalalignment='left') 616 | pyplot.text(x, y, s, **options) 617 | 618 | 619 | LEGEND = True 620 | LOC = None 621 | 622 | def Config(**options): 623 | """Configures the plot. 624 | 625 | Pulls options out of the option dictionary and passes them to 626 | the corresponding pyplot functions. 627 | """ 628 | names = ['title', 'xlabel', 'ylabel', 'xscale', 'yscale', 629 | 'xticks', 'yticks', 'axis', 'xlim', 'ylim'] 630 | 631 | for name in names: 632 | if name in options: 633 | getattr(pyplot, name)(options[name]) 634 | 635 | global LEGEND 636 | LEGEND = options.get('legend', LEGEND) 637 | 638 | if LEGEND: 639 | global LOC 640 | LOC = options.get('loc', LOC) 641 | pyplot.legend(loc=LOC) 642 | 643 | val = options.get('xticklabels', None) 644 | if val is not None: 645 | if val == 'invisible': 646 | ax = pyplot.gca() 647 | labels = ax.get_xticklabels() 648 | pyplot.setp(labels, visible=False) 649 | 650 | val = options.get('yticklabels', None) 651 | if val is not None: 652 | if val == 'invisible': 653 | ax = pyplot.gca() 654 | labels = ax.get_yticklabels() 655 | pyplot.setp(labels, visible=False) 656 | 657 | 658 | def Show(**options): 659 | """Shows the plot. 660 | 661 | For options, see Config. 662 | 663 | options: keyword args used to invoke various pyplot functions 664 | """ 665 | clf = options.pop('clf', True) 666 | Config(**options) 667 | pyplot.show() 668 | if clf: 669 | Clf() 670 | 671 | 672 | def Plotly(**options): 673 | """Shows the plot. 674 | 675 | For options, see Config. 676 | 677 | options: keyword args used to invoke various pyplot functions 678 | """ 679 | clf = options.pop('clf', True) 680 | Config(**options) 681 | import plotly.plotly as plotly 682 | url = plotly.plot_mpl(pyplot.gcf()) 683 | if clf: 684 | Clf() 685 | return url 686 | 687 | 688 | def Save(root=None, formats=None, **options): 689 | """Saves the plot in the given formats and clears the figure. 690 | 691 | For options, see Config. 692 | 693 | Args: 694 | root: string filename root 695 | formats: list of string formats 696 | options: keyword args used to invoke various pyplot functions 697 | """ 698 | clf = options.pop('clf', True) 699 | Config(**options) 700 | 701 | if formats is None: 702 | formats = ['pdf', 'eps'] 703 | 704 | try: 705 | formats.remove('plotly') 706 | Plotly(clf=False) 707 | except ValueError: 708 | pass 709 | 710 | if root: 711 | for fmt in formats: 712 | SaveFormat(root, fmt) 713 | if clf: 714 | Clf() 715 | 716 | 717 | def SaveFormat(root, fmt='eps'): 718 | """Writes the current figure to a file in the given format. 719 | 720 | Args: 721 | root: string filename root 722 | fmt: string format 723 | """ 724 | filename = '%s.%s' % (root, fmt) 725 | print('Writing', filename) 726 | pyplot.savefig(filename, format=fmt, dpi=300) 727 | 728 | 729 | # provide aliases for calling functons with lower-case names 730 | preplot = PrePlot 731 | subplot = SubPlot 732 | clf = Clf 733 | figure = Figure 734 | plot = Plot 735 | vlines = Vlines 736 | hlines = Hlines 737 | fill_between = FillBetween 738 | text = Text 739 | scatter = Scatter 740 | pmf = Pmf 741 | pmfs = Pmfs 742 | hist = Hist 743 | hists = Hists 744 | diff = Diff 745 | cdf = Cdf 746 | cdfs = Cdfs 747 | contour = Contour 748 | pcolor = Pcolor 749 | config = Config 750 | show = Show 751 | save = Save 752 | 753 | 754 | def main(): 755 | color_iter = _Brewer.ColorGenerator(7) 756 | for color in color_iter: 757 | print(color) 758 | 759 | 760 | if __name__ == '__main__': 761 | main() 762 | -------------------------------------------------------------------------------- /tutorial.md: -------------------------------------------------------------------------------- 1 | ## Tutorial: Computational Statistics 2 | 3 | Allen Downey 4 | 5 | Do you know the difference between standard deviation and standard error? Do you know what statistical test to use for any occasion? Do you really know what a p-value is? How about a confidence interval? 6 | 7 | Most people don’t really understand these concepts, even after taking several statistics classes. The problem is that these classes focus on mathematical methods that bury the concepts under a mountain of details. 8 | 9 | This tutorial uses Python to implement simple statistical experiments that develop deep understanding. I will present examples using real-world data to answer relevant questions, and attendees will practice with hands-on exercises. 10 | 11 | The tutorial material is based on my book, [*Think Stats*](http://greenteapress.com/wp/think-stats-2e/), a class I teach at Olin College, and my blog, [“Probably Overthinking It.”](http://allendowney.blogspot.com/) 12 | 13 | 14 | ### Installation instructions 15 | 16 | Note: Please try to install everything you need for this tutorial before you leave home! 17 | 18 | To prepare for this tutorial, you have two options: 19 | 20 | 1. Install Jupyter on your laptop and download my code from Git. 21 | 22 | 2. Run the Jupyter notebook on a virtual machine on Binder. 23 | 24 | I'll provide instructions for both, but here's the catch: if everyone chooses Option 2, the wireless network will fail and no one will be able to do the hands-on part of the workshop. 25 | 26 | So, I strongly encourage you to try Option 1 and only resort to Option 2 if you can't get Option 1 working. 27 | 28 | 29 | 30 | #### Option 1A: If you already have Jupyter installed. 31 | 32 | To do the exercises, you need Python 2 or 3 with NumPy, SciPy, and matplotlib. If you are not sure whether you have those modules already, the easiest way to check is to run my code and see if it works. 33 | 34 | Code for this workshop is in a Git repository on Github. 35 | If you have a Git client installed, you should be able to download it by running: 36 | 37 | git clone https://github.com/AllenDowney/CompStats.git 38 | 39 | It should create a directory named `CompStats`. 40 | Otherwise you can download the repository in [this zip file](https://github.com/AllenDowney/CompStats/archive/master.zip). 41 | 42 | To start Jupyter, run: 43 | 44 | cd CompStats 45 | jupyter notebook 46 | 47 | Jupyter should launch your default browser or open a tab in an existing browser window. 48 | If not, the Jupyter server should print a URL you can use. For example, when I launch Jupyter, I get 49 | 50 | ``` 51 | ~/ThinkComplexity2$ jupyter notebook 52 | [I 10:03:20.115 NotebookApp] Serving notebooks from local directory: /home/downey/CompStats 53 | [I 10:03:20.115 NotebookApp] 0 active kernels 54 | [I 10:03:20.115 NotebookApp] The Jupyter Notebook is running at: http://localhost:8888/ 55 | [I 10:03:20.115 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). 56 | ``` 57 | 58 | In this case, the URL is [http://localhost:8888](http://localhost:8888). 59 | When you start your server, you might get a different URL. 60 | Whatever it is, if you paste it into a browser, you should should see a home page with a list of the 61 | notebooks in the repository. 62 | 63 | Click on `effect_size.ipynb`. It should open the first notebook for the tutorial. 64 | 65 | Select the cell with the import statements and press "Shift-Enter" to run the code in the cell. 66 | If it works and you get no error messages, **you are all set**. 67 | 68 | If you get error messages about missing packages, you can install the packages you need using your package manager, or try Option 1B and install Anaconda. 69 | 70 | 71 | #### Option 1B: If you don't already have Jupyter. 72 | 73 | I highly recommend installing Anaconda, which is a Python distribution that contains everything 74 | you need for this tutorial. It is easy to install on Windows, Mac, and Linux, and because it does a 75 | user-level install, it will not interfere with other Python installations. 76 | 77 | [Information about installing Anaconda is here](http://docs.continuum.io/anaconda/install.html). 78 | 79 | When you install Anaconda, you should get Jupyter by default, but if not, run 80 | 81 | conda install jupyter 82 | 83 | Then go to Option 1A to make sure you can run my code. 84 | 85 | If you don't want to install Anaconda, 86 | [you can see some other options here](http://jupyter.readthedocs.io/en/latest/install.html). 87 | 88 | 89 | #### Option 2: only if Option 1 failed. 90 | 91 | You can run my notebook in a virtual machine on Binder. To launch the VM, press this button: 92 | 93 | [![Binder](http://mybinder.org/badge.svg)](http://mybinder.org:/repo/allendowney/compstats) 94 | 95 | You should see a home page with a list of the files in the repository. 96 | 97 | If you want to try the exercises, open `effect_size.ipynb`. If you just want to see the answers, open `effect_size_soln.ipynb`. Either way, you should be able to run the notebooks in your browser and try out the examples. 98 | 99 | However, be aware that the virtual machine you are running is temporary. If you leave it idle for more than an hour or so, it will disappear along with any work you have done. 100 | 101 | Special thanks to the generous people who run Binder, which makes it easy to share and reproduce computation. 102 | --------------------------------------------------------------------------------