├── README.md ├── assets ├── datacamp.svg ├── gfmt_faces.png ├── pcorr_conf_int.svg └── pcorr_ecdf.svg ├── data ├── airbnb.csv └── gfmt_sleep.csv ├── drafts ├── hacker_stats_in_python_part_1_with_ConfInt_class.ipynb └── hacker_stats_in_python_part_1_with_ConfInt_class_solution.ipynb └── notebooks ├── .gitignore ├── hacker_stats_in_python_part_1.ipynb ├── hacker_stats_in_python_part_1_solution.ipynb ├── hacker_stats_in_python_part_2.ipynb ├── hacker_stats_in_python_part_2_solution.ipynb └── sandbox.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # **Hacker Stats in Python**
by **Justin Bois** 2 | 3 | 4 | ## Step 1: Foundations 5 | 6 | 7 | ### A. What problem(s) will students learn how to solve? (minimum of 5 problems) 8 | 9 | This live training will review the concepts of statistical inference laid out in Statistical Thinking in Python I and II using a new data set (one that I think is rather fun!). The goal is to reinforce the concepts and techniques from those courses and help students gain confidence applying them to new data analysis tasks. Specifically, we will: 10 | 11 | - Review fundamental notions of probability. 12 | - Perform graphical exploratory data analysis (EDA). 13 | - Review concepts of confidence intervals and null hypothesis significance tests (NHSTs). 14 | - Apply bootstrap methods for computing confidence intervals. 15 | - Investigate correlations in two-dimensional data. 16 | - Perform a NHST comparing two treatments in a data set. 17 | 18 | 19 | ### B. What technologies, packages, or functions will students use? Please be exhaustive. 20 | 21 | - NumPy 22 | - pandas 23 | - seaborn 24 | - Jupyter notebooks using Google Colab 25 | - [dc_stat_think](https://github.com/justinbois/dc_stat_think) 26 | 27 | 28 | ### C. What terms or jargon will you define? 29 | 30 | In our probably review, we will define and discuss: 31 | 32 | - **Frequentist interpretation of probability**: The probability of an observation represents a long-run frequency over a large number of identical repetitions of an experiment. These repetitions can be, and often are, hypothetical. For example, if I were to list an item on ebay, the probability that it will sell for over $100 is the fraction of times it do so in a very large number of hypothetical worlds in which I sell the item on ebay. 33 | - Probability distributions: Probability distributions provide the link between events to probability. For example a Normal distribution might link people's heights to probability. It is highly probable that a person is between five and six feet tall, but improbable that a person is above seven feet tall. 34 | - Generative distributions: Obtaining a measurement (by any means, experimentation, surveys, consumer trials, etc.) involves drawing samples out of a probability distribution. This distribution is called a generative distribution, and we do not in general know what it is. We can make models for the generative distributions, and we do this kind of model in the Statistical Thinking Courses, but we will not cover that in this live session. 35 | - Empirical distributions: An empirical distribution can be defined simply in terms of measured data. When we draw samples out of an empirical distribution, we simply randomly choose measurements we already made. 36 | - The plug-in principle: Under the plug-in principle, we use the empirical distribution in the place of the (unknown) true generative distribution in all of our statistical inference. This is the approach we will take in this live session. 37 | 38 | 39 | 40 | ### D. What mistakes or misconceptions do you expect? 41 | 42 | - The most common misconception is often about what constitutes a null hypothesis significance test. The result of a NHST is a p-value, defined as follows. Assume null hypothesis is indeed true. With this assumption in place, the p-value is the probability of obtaining a value of a test statistic at least as what was observed. So, to specify a null hypothesis, we need a clear definition of the null hypothesis, the test statistic, and what it means to be at least as extreme as. 43 | - The distinction between a bootstrap hypothesis test and a permutation test is often confusing. Different hypotheses may be assessed by the two approaches. A permutation hypothesis can address the null hypothesis that two data sets of data come from the same generative distribution. A bootstrap hypothesis test can address a null hypothesis, e.g., that two data sets come from two different generative distributions, but they have the same mean. 44 | - The definition of a confidence interval is sometimes a sticking point. A 95% confidence interval of a mean (or median, standard deviation, ...) may be defined as follows. If we were to do a measurement again and again and again, then the mean (or median, standard deviation, ...) that we compute from the data will fall within a 95% confidence interval for 95% for these repeated experiments. In practice, we repeat the experiment by sampling out of the empirical distribution; this is bootstrapping. 45 | 46 | 47 | ### E. What datasets will you use? 48 | 49 | We will be working with a fun data set. In a 2016 paper, [Beattie, et al.](https://doi.org/10.1098/rsos.160321) used the [Glasgow Facial Matching Test](https://en.wikipedia.org/wiki/Glasgow_Face_Matching_Test) (GFMT, [original paper](https://doi.org/10.3758/BRM.42.1.286)) to investigate how sleep deprivation affects a human subject’s ability to match faces, as well as the confidence the subject has in those matches. Briefly, the test works by having subjects look at a pair of faces. Two such pairs are shown below. 50 | 51 | ![GFMT faces](assets/gfmt_faces.png) 52 | 53 | For each pair of faces, the subject gets as much time as he or she needs and then says whether or not they are the same person. The subject then rates his or her confidence in the choice. 54 | 55 | In this study, subjects also took surveys to determine properties about their sleep. The Sleep Condition Indicator (SCI) is a measure of insomnia disorder over the past month (scores of 16 and below indicate insomnia). The Pittsburgh Sleep Quality Index (PSQI) quantifies how well a subject sleeps in terms of interruptions, latency, etc. A higher score indicates poorer sleep. The Epworth Sleepiness Scale (ESS) assesses daytime drowsiness. 56 | 57 | We will explore how the various sleep metrics are related to each other and how sleep disorders affect subjects' ability to discern faces and their confidence in doing so. 58 | 59 | 60 | ## Step 2: Who is this session for? 61 | 62 | This session is for anyone who wants to sharpen their skills in statistical inference. These skills apply across all industries and disciplines of interest to DataCamp learner; they are key for anyone working with data. Participants should have completed DataCamp courses Statistical Thinking in Python I and II. 63 | 64 | 65 | ### What roles would this live training be suitable for? 66 | 67 | *Check all that apply.* 68 | 69 | - [x] Data Consumer 70 | - [x] Leader 71 | - [x] Data Analyst 72 | - [x] Citizen Data Scientist 73 | - [x] Data Scientist 74 | - [x] Data Engineer 75 | - [ ] Database Administrator 76 | - [x] Statistician 77 | - [x] Machine Learning Scientist 78 | - [ ] Programmer 79 | - [ ] Other (please describe) 80 | 81 | ### What industries would this apply to? 82 | 83 | The topics of this live training are really general. Performing EDA, computing confidence intervals, and (though to a lesser extent) performing hypothesis tests apply across so many industries and applications. Whether are you doing business analytics, quality control, public health, science, really anything involving collection and interpretation of data, statistical inference plays an important role. 84 | 85 | 86 | ### What level of expertise should learners have before beginning the live training? 87 | 88 | Learners should be able to do the following heading into the live session. 89 | 90 | - Extract columns from pandas DataFrames. 91 | - Be comfortable with NumPy, particularly the random module. 92 | - Be able to make basic plots with Matplotlib; simple scatter plots should be sufficient. 93 | - We will review computing summary statistics (like means and medians), drawing bootstrap samples, and performing linear regressions, but familiarity with those methods will be helpful. 94 | 95 | 96 | ## Step 3: Prerequisites 97 | 98 | Learners should have completed DataCamp courses Statistical Thinking I and II. 99 | 100 | 101 | ## Step 4: Session Outline 102 | 103 | ### Introduction Slides 104 | - Introduction to the webinar and instructor (led by DataCamp TA) 105 | - Our approach to statistical inference 106 | + Statistical inference: deduction of properties of a generative distribution 107 | + The frequentist interpretation of probability and probability distributions 108 | + Hacker stats allows direct application of probability without mathematical gymnastics 109 | - Objectives 110 | + Obtain plug-in estimates and confidence intervals for pertinent parameters 111 | + Compare effect sizes between two samples 112 | + Perform a null hypothesis significance text 113 | + ...all with hacker stats using Python! 114 | - Introduction to the data set 115 | + Domain knowledge is key for determining analysis procedures (no data scientist should work alone) 116 | + Introduction to GFMT and sleep disorder study 117 | 118 | 119 | ### Live Training 120 | 121 | #### Exploratory data analysis 122 | - Import data into a pandas DataFrame and display using `df.head()`. 123 | - Add a column to a data frame. 124 | - Extract a column while dropping NaNs. 125 | - Generate ECDFs. 126 | - Make scatter plots of possibly correlated variables. 127 | 128 | #### Bootstrap confidence intervals 129 | - Review of bootstrapping procedure, definition of a confidence interval, and the plug-in principle. 130 | - Write a function to obtain bootstrap samples. 131 | - Compare an ECDF of a bootstrap sample to the original data set. 132 | - Write a function to obtain bootstrap replicates. 133 | - Make an ECDF of bootstrap replicates. 134 | - Compute percentiles of bootstrap replicates to obtain a confidence interval. 135 | - Make a graphical display of confidence intervals. 136 | - **Q & A** 137 | 138 | #### Pairs bootstrap confidence intervals 139 | - Write a function to obtain pairs bootstrap samples. 140 | - Write a function to obtain pairs bootstrap replicates of the Pearson correlation. 141 | - Compute percentiles of bootstrap replicates to obtain a confidence interval. 142 | - For what aspects of this data set should we perform a linear regression? 143 | - Bonus assignment for after the live session: Perform a linear regression where appropriate and give a pairs bootstrap confidence interval for the slope. 144 | 145 | #### Null Hypothesis significance testing: a permutation test 146 | - Review definition of p-value and essential pieces of NHST specification. 147 | - Hacker stats approach: *simulate* data generation under the null hypothesis. 148 | - Hypothesis: two variables are identically distributed. Generate permutations to simulate it. 149 | - Compute p-value from permutation samples. 150 | - Bonus assignment for after the live session: How would you simulate a null hypothesis that the two variables are not necessarily identically distributed but do have the same mean? Simulate this and compute a p-value. 151 | - **Q & A** 152 | 153 | 154 | ### Ending slides 155 | - Recap of what we learned 156 | - Emphasize the importance of thinking probabilistically. 157 | - You can use your computer to do probability directly. 158 | 159 | -------------------------------------------------------------------------------- /assets/datacamp.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /assets/gfmt_faces.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacamp/Hacker-Stats-in-Python-Live-Training/4f3fc34bd0c6038a3c5e29b0e6b0350b3aede8d6/assets/gfmt_faces.png -------------------------------------------------------------------------------- /assets/pcorr_conf_int.svg: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 721 | -------------------------------------------------------------------------------- /assets/pcorr_ecdf.svg: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 1077 | -------------------------------------------------------------------------------- /data/gfmt_sleep.csv: -------------------------------------------------------------------------------- 1 | participant number,gender,age,correct hit percentage,correct reject percentage,percent correct,confidence when correct hit,confidence when incorrect hit,confidence when correct reject,confidence when incorrect reject,confidence when correct,confidence when incorrect,sci,psqi,ess 2 | 8,f,39,65,80,72.5,91,90,93,83.5,93,90,9,13,2 3 | 16,m,42,90,90,90,75.5,55.5,70.5,50,75,50,4,11,7 4 | 18,f,31,90,95,92.5,89.5,90,86,81,89,88,10,9,3 5 | 22,f,35,100,75,87.5,89.5,*,71,80,88,80,13,8,20 6 | 27,f,74,60,65,62.5,68.5,49,61,49,65,49,13,9,12 7 | 28,f,61,80,20,50,71,63,31,72.5,64.5,70.5,15,14,2 8 | 30,m,32,90,75,82.5,67,56.5,66,65,66,64,16,9,3 9 | 33,m,62,45,90,67.5,54,37,65,81.5,62,61,14,9,9 10 | 34,f,33,80,100,90,70.5,76.5,64.5,*,68,76.5,14,12,10 11 | 35,f,53,100,50,75,74.5,*,60.5,65,71,65,14,8,7 12 | 38,f,41,70,55,62.5,82,61.5,73,69,82,64,14,5,19 13 | 41,f,36,90,100,95,76.5,75.5,75,*,76,75.5,15,7,0 14 | 46,f,40,95,65,80,80,89,79,58.5,79.5,63,10,12,8 15 | 49,f,24,85,75,80,58,50,49,68,55,59,14,13,4 16 | 55,f,32,75,55,65,85,81,85,86,85,83.5,5,13,7 17 | 71,f,40,40,100,70,69,56,70,*,70,56,0,11,14 18 | 76,f,61,100,40,70,69.5,*,44.5,73,54.5,73,16,4,12 19 | 77,f,42,70,90,80,87,72,90.5,43.5,88.5,64,11,10,10 20 | 78,m,31,100,70,85,92,*,81,60,87.5,60,14,6,11 21 | 80,m,28,100,50,75,100,*,100,100,100,100,12,7,12 22 | 89,f,26,60,80,70,70,77,82,67.5,77,70.5,14,8,1 23 | 90,m,45,100,95,97.5,100,*,100,100,100,100,14,9,6 24 | 93,f,28,100,75,87.5,89.5,*,67,60,80,60,16,7,4 25 | 100,f,44,65,25,45,62,72,87,77,69.5,73.5,1,15,6 26 | 101,f,28,100,40,70,87,*,68,54,81,54,14,7,2 27 | 1,f,42,80,65,72.5,51.5,44.5,43,49,51,49,29,1,5 28 | 2,f,45,80,90,85,75,55.5,80,75,78.5,67,19,5,1 29 | 3,f,16,70,80,75,70,57,54,53,57,54.5,23,1,3 30 | 4,f,21,70,65,67.5,63.5,64,50,50,60,50,26,5,4 31 | 5,f,18,90,100,95,76.5,83,80,*,80,83,21,7,5 32 | 6,f,28,95,80,87.5,100,85,94,61,99,65,19,7,12 33 | 7,f,38,90,95,92.5,77,43.5,79,21,78,36,28,3,4 34 | 9,m,17,90,90,90,80.5,87.5,76.5,27,78.5,67.5,29,3,4 35 | 10,f,25,100,100,100,90,*,85,*,90,*,17,10,11 36 | 11,f,22,80,60,70,70,70,70,65,70,70,22,4,6 37 | 12,m,41,90,80,85,76.5,55.5,67.5,52.5,74,55.5,28,5,3 38 | 13,m,53,95,60,77.5,40,33,56,49,47,44,31,2,11 39 | 14,m,43,95,90,92.5,52,29,49,36,52,29,31,2,10 40 | 15,f,23,90,80,85,88,40,70.5,66.5,84,54.5,32,2,12 41 | 17,m,19,55,60,57.5,62,50,66,50.5,63,50,25,5,6 42 | 19,f,45,100,85,92.5,68,*,61,54,62,54,30,2,13 43 | 20,f,43,65,65,65,59,55,64,59,59.5,57,28,3,12 44 | 21,m,35,90,100,95,75.5,35.5,74.5,*,75.5,35.5,30,3,5 45 | 23,m,24,55,100,77.5,68,67,80,*,78,67,22,4,10 46 | 24,f,64,75,85,80,50,25,66,24,63,24.5,20,9,4 47 | 25,f,36,100,80,90,88,*,66,63.5,81,63.5,26,5,4 48 | 26,m,35,70,90,80,29.5,28.5,19,41.5,24,35,32,2,6 49 | 29,f,43,90,85,87.5,65.5,27,41,45,50,45,32,1,3 50 | 31,m,44,95,90,92.5,83,56,79,42.5,81,54,24,4,0 51 | 32,f,29,95,55,75,67,53,65,53,67,53,26,8,12 52 | 36,m,22,70,65,67.5,78,88,77,80,77,84,30,2,6 53 | 37,f,46,80,75,77.5,72.5,72,100,91,100,91,28,2,5 54 | 39,f,35,50,95,72.5,62.5,29,77,32,68,31,21,6,9 55 | 40,m,53,65,80,72.5,54,32,47.5,78.5,48,51,29,3,7 56 | 42,m,29,100,70,85,75,*,64.5,43,74,43,32,1,6 57 | 43,f,31,85,90,87.5,82,49,81,36,82,49,26,5,10 58 | 44,f,21,85,90,87.5,66,29,70,29,67,29,26,7,18 59 | 45,f,42,90,90,90,83,83,80.5,36,82.5,76,23,3,11 60 | 48,f,23,90,85,87.5,67,47,69,40,67,40,18,6,8 61 | 50,m,54,90,70,80,90,83.5,77.5,69,88,79,22,6,16 62 | 51,f,24,85,95,90,97,41,74,73,83,55.5,29,1,7 63 | 52,f,21,85,75,80,65,73,56,68,63.5,68,20,6,9 64 | 53,f,21,90,80,85,84,55.5,73.5,70,80.5,65,27,4,11 65 | 54,f,43,95,75,85,74,89,68,65,71,68,19,4,4 66 | 56,m,50,70,85,77.5,92.5,72.5,95,65,95,65,29,3,7 67 | 57,f,53,95,75,85,84,55,68,61,78.5,58,24,5,4 68 | 58,f,16,85,85,85,55,30,50,40,52.5,35,29,2,11 69 | 59,f,67,95,75,85,70,7,69,60,69,59.5,17,7,12 70 | 60,m,36,90,65,77.5,67.5,28.5,55,52,61,50,26,4,3 71 | 61,f,34,90,90,90,58.5,43,73.5,53.5,66,47.5,30,0,3 72 | 62,f,42,100,100,100,74.5,*,74,*,74,*,17,5,4 73 | 63,f,46,80,90,85,92,75.5,92,63,92,73.5,25,1,11 74 | 64,f,69,95,80,87.5,80,65,78.5,70.5,80,70,31,1,1 75 | 65,f,31,100,95,97.5,98,*,90,40,92,40,27,4,4 76 | 66,f,44,90,95,92.5,87,47.5,69,87,83,67,32,1,2 77 | 67,f,25,100,100,100,61.5,*,58.5,*,60.5,*,28,8,9 78 | 68,f,45,70,50,60,80.5,51.5,63,69,72.5,61.5,25,4,1 79 | 69,f,47,90,100,95,100,*,71.5,83,97.5,83,30,2,2 80 | 70,f,33,85,70,77.5,70,38,58.5,65,68,40,21,7,12 81 | 72,f,18,80,75,77.5,67.5,51.5,66,57,67,53,29,4,6 82 | 73,f,74,85,80,82.5,66,55,63,50.5,65,55,20,1,5 83 | 74,m,21,40,40,40,90.5,80,74.5,83,82,81,22,7,5 84 | 75,f,45,80,95,87.5,74,67,76,17,75,64,23,4,4 85 | 79,f,37,90,80,85,95.5,68,83.5,83,94,71,20,5,9 86 | 81,m,41,90,85,87.5,80,59.5,70,41,77,59,17,6,3 87 | 82,f,41,80,75,77.5,94.5,61.5,86,74,92,67,27,4,8 88 | 83,f,34,90,35,62.5,81,52,71,58,81,58,27,2,6 89 | 84,f,39,75,70,72.5,57,57,59.5,50,58,50,22,3,10 90 | 85,f,18,85,85,85,93,92,91,89,91.5,91,25,4,21 91 | 86,f,31,100,85,92.5,100,*,100,50,100,50,30,3,5 92 | 87,m,26,95,75,85,85,88,82,82,85,85,32,1,5 93 | 88,m,66,60,85,72.5,67.5,66,74,57,74,64,30,5,9 94 | 91,m,62,100,80,90,81,*,74.5,82,79.5,82,32,2,1 95 | 92,m,22,85,95,90,66,56,72,63,70.5,59.5,28,1,8 96 | 94,f,41,35,75,55,55,61,80,57,72,60,31,1,11 97 | 95,m,46,95,80,87.5,90,75,80,80,85,75,29,3,5 98 | 96,f,56,70,50,60,63,52.5,67.5,65.5,64,59.5,26,6,7 99 | 97,f,23,70,85,77.5,77,66.5,77,77.5,77,74,20,8,10 100 | 98,f,70,90,85,87.5,65.5,85.5,87,80,74,80,19,8,7 101 | 99,f,24,70,80,75,61.5,81,70,61,65,81,31,2,15 102 | 102,f,40,75,65,70,53,37,84,52,81,51,22,4,7 103 | 103,f,33,85,40,62.5,80,27,31,82.5,81,73,24,5,7 -------------------------------------------------------------------------------- /drafts/hacker_stats_in_python_part_1_with_ConfInt_class.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "6Ijg5wUCTQYG" 8 | }, 9 | "source": [ 10 | "

\n", 11 | " $\"DataCamp$ \n", 12 | "

\n", 13 | "

\n", 14 | "\n", 15 | "This live training will review the concepts of statistical inference laid out in the DataCamp courses [Statistical Thinking in Python I](https://learn.datacamp.com/courses/statistical-thinking-in-python-part-1) and [Statistical Thinking in Python II](https://learn.datacamp.com/courses/statistical-thinking-in-python-part-2). The concepts of those courses were reviewed and fortified in [Case Studies in Statistical Thinking](https://learn.datacamp.com/courses/case-studies-in-statistical-thinking). You can link to those courses by clicking on the badges below.\n", 16 | "\n", 17 | "

\n", 18 | "\n", 19 | "

\n", 20 | "

\n", 21 | "

\n", 22 | "\n", 23 | "

\n", 24 | "\n", 25 | "\n", 26 | "Our goal here is to reinforce the concepts and techniques of statistical inference using hacker stats to help students gain confidence applying them to new data analysis tasks. In this part of the training, we will:\n", 27 | "\n", 28 | "- Review fundamental notions of probability and clarify the tasks of statistical inference.\n", 29 | "- Perform graphical exploratory data analysis (EDA).\n", 30 | "- Review concepts of confidence intervals.\n", 31 | "- Apply bootstrap methods for computing confidence intervals.\n", 32 | "\n", 33 | "\n", 39 | " \n", 40 | "We will do all of this using a data set (described [below](#The-Dataset)) which explores how sleep deprivation affects facial recognition tasks in humans." 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "## **Necessary packages**\n", 48 | "\n", 49 | "It is good practice to always have necessary package imports at the top of any `.py` file or notebook, so let's get our imports in before moving on.\n", 50 | "\n", 51 | "Unless I'll be plotting very large data sets (which is not the case in this webinar), I generally also like to set the figure format to be SVG. This avoids a pixelated look to the plots." 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": { 58 | "colab": {}, 59 | "colab_type": "code", 60 | "id": "EMQfyC7GUNhT" 61 | }, 62 | "outputs": [], 63 | "source": [ 64 | "import numpy as np\n", 65 | "import scipy.stats\n", 66 | "import pandas as pd\n", 67 | "import matplotlib.pyplot as plt\n", 68 | "import seaborn as sns\n", 69 | "sns.set()\n", 70 | "\n", 71 | "# I want crisp graphics, so we'll output SVG\n", 72 | "%config InlineBackend.figure_format = 'svg'" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "## **Statistical inference**\n", 80 | "\n", 81 | "In this section of the notebook, I will discuss the goals of statistical inference and lay out the approach we will take. You may have seen these ideas presented before, but in my experience, investigating a topic from multiple angles leads to a much fuller understanding, and that is my goal here.\n", 82 | "\n", 83 | "We will dive into our (rather fun) data set momentarily, but for now, we will go over some important theoretical background to make sure we understand what statistical inference is all about, and how the hacker stats approach works." 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "### **What is probability?**\n", 91 | "\n", 92 | "To think about what probability means, consider an example. Say I am interested in the heights of American adult women. If we were to select an American woman at random, I think we would all agree that she is probably going to be under 190 cm (about 6' 3\"). We just used the word \"probably!\" To see what we mean by that, let's pick another woman at random. And another. And another. And another. We do this over and over and over. Eventually, we will pick a woman who is over 190 cm tall (for example basketball legend [Lisa Leslie](https://en.wikipedia.org/wiki/Lisa_Leslie) is 196 cm). So, the probability that a randomly chosen woman is over 190 cm is not zero, but is is small.\n", 93 | "\n", 94 | "We can directly apply this mode of thinking to give us an interpretation of probability, referred to as the **frequentist interpretation of probability**. The probability of an observation represents a long-run frequency over a large number of identical repetitions of a measurement/data collection. These repetitions can be, and often are, hypothetical." 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "### **Probability distributions and CDFs**\n", 102 | "\n", 103 | "A **probability distribution** provides the link between observations to probability. For example a Normal distribution might link women's heights to probability. It is highly probable that a woman is between 150 and 180 cm (about five and six feet) tall, but improbable that a woman is above 190 cm.\n", 104 | "\n", 105 | "These statements of probability stem from the **cumulative distribution function** (CDF) associated with a probability distribution. The CDF evaluated at _x_ is defined as follows.\n", 106 | "\n", 107 | ">CDF(_x_) = probability of observing a value less than or equal to _x_.\n", 108 | "\n", 109 | "To get an idea of what a CDF might look like when we plot it, we can plot the CDF for a Normal distribution, say describing the height of American adult women. They average 165 cm, with a standard deviation of 9 cm. The mathematical expression for the CDF is complicated but known, and we can use the built-in functions in `scipy.stats` to generate the values." 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": null, 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [ 118 | "# x-values for height CDF\n", 119 | "\n", 120 | "# Values of the CDF\n", 121 | "\n", 122 | "# Make the plot" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "Remember, the CDF evaluated at a given position along the x-axis is the probability of having an observation less than or equal to _x_. So, if we look at _x_ = 160 cm, we see that the probability that a woman is shorter than 160 cm is about 0.3.\n", 130 | "\n", 131 | "Importantly, the CDF contains all of the information about a distribution." 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": {}, 137 | "source": [ 138 | "### **Generative distributions**\n", 139 | "\n", 140 | "Say we did the experiment of measuring women's heights. For cost reasons (time and money), we can measure 100 women to get 100 height measurements. We might expect about 30 of these measurements to be below 160 cm and, based on the CDF above, about 7 of these measurements to be above 180 cm. This assumes that the distribution that provides that link between height measurements and probability is in fact a Normal distribution. If this is true, then the Normal distribution *generates* the data; it is the **generative distribution**.\n", 141 | "\n", 142 | "If we *know* the true generative distribution, then drawing random numbers out of the distribution is the same as performing the measurements themselves. We can draw random numbers out of a probability distribution using the `numpy.random` module. So, we can \"repeat\" the measurement of 100 women by drawing out of the generative distribution." 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": null, 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [ 151 | "# For reproducibility, seed the random number generator\n", 152 | "np.random.seed(3252)\n", 153 | "\n", 154 | "# Draw 100 measurments from a normal with mean 165 and std 9" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "---\n", 162 | "\n", 163 | "

Q&A 1

\n", 164 | "\n", 165 | "---" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "### **The empirical cumulative distribution function**\n", 173 | "\n", 174 | "The **empirical cumulative distribution function**, or ECDF, is a useful plot to make when doing exploratory data analysis. The ECDF at position _x_ is defined as\n", 175 | "\n", 176 | ">ECDF(_x_) = fraction of data points ≤ _x_.\n", 177 | "\n", 178 | "Compare this to the definition of the CDF.\n", 179 | "\n", 180 | ">CDF(_x_) = probability of observing a value less than or equal to _x_.\n", 181 | "\n", 182 | "Think about the frequentist interpretation of probability. If we had many, many measurements, then the fraction of data points less than or equal to _x_ is in fact the *probability* of observing a value less than or equal to _x_. So, the ECDF is close to the CDF. The differences between an ECDF and the CDF of the generative distribution are due entirely to the fact that we have only a finite number of measurements for the ECDF.\n", 183 | "\n", 184 | "To plot an ECDF, we position a dot for each data point where the x-value is the value of the data point itself and the y-value is the fraction of data points less or equal to the x-value. We can write a function to generate the x- and y-values for the ECDF." 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": null, 190 | "metadata": {}, 191 | "outputs": [], 192 | "source": [ 193 | "def ecdf(data):\n", 194 | " \"\"\"Compute ECDF for a one-dimensional array of measurements.\"\"\"\n", 195 | " # Number of data points\n", 196 | " \n", 197 | " # x-data for the ECDF\n", 198 | "\n", 199 | " # y-data for the ECDF\n", 200 | "\n", 201 | " return x, y" 202 | ] 203 | }, 204 | { 205 | "cell_type": "markdown", 206 | "metadata": {}, 207 | "source": [ 208 | "We can use this function to add the ECDF to our plot of the generative CDF." 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "metadata": {}, 215 | "outputs": [], 216 | "source": [ 217 | "# Compute ECDF\n", 218 | "\n", 219 | "# Make the plot" 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "metadata": {}, 225 | "source": [ 226 | "Indeed, the ECDF closely matches the CDF, but we do see variation from it, owing to the small sample size of 100. If we increase the sample size to 1000, the ECDF follows more closely." 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": null, 232 | "metadata": {}, 233 | "outputs": [], 234 | "source": [ 235 | "# Draw more heights\n", 236 | "\n", 237 | "# Compute ECDF\n", 238 | "\n", 239 | "# Make the plot" 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "metadata": {}, 245 | "source": [ 246 | "### **The task of statistical inference**\n", 247 | "\n", 248 | "In the above example, we knew the generative distribution. If we did actually know the generative distribution ahead of time, there would be no point in doing measurements. We generally do not know what the generative distribution is, so we collect data. **The task of statistical inference is to deduce the properties of a generative distribution of data.**" 249 | ] 250 | }, 251 | { 252 | "cell_type": "markdown", 253 | "metadata": {}, 254 | "source": [ 255 | "### **The plug-in principle**\n", 256 | "\n", 257 | "We could imagine that we knew (or strongly suspected) that the generative distribution for American women's heights is Normal, but we do not know the parameters. In this case, the statistical inference task is largely to determine what the parameters of the generative distribution are.\n", 258 | "\n", 259 | "But imagine we do not know anything a priori about the generative distribution. We still make measurements, and the measurements are all we have. We do not know anything about the CDF, but we do know about the ECDF. As the CDF defines a generative distribution, so too does the ECDF define an **empirical distribution**. We can proceed with statistical inference by using the empirical distribution in place of the (unknown) generative distribution. This approximation is referred to as the **plug-in** principle. This principle underlies many **non-parametric** approaches to statistical inference, so named because we are not trying to find parameters of a generative distribution, but are using only information from the data themselves. Application of the plug-in principle is the approach we take in this tutorial." 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": {}, 265 | "source": [ 266 | "### **Properties of distributions via plug-in estimates**\n", 267 | "\n", 268 | "One-dimensional distributions have properties you may have heard of, like means, medians, and variances. Imagine we have a set of measurements stored in a NumPy array `x`. Then, the plug-in estimates for the various properties of the generative distribution are shown in the table below.\n", 269 | "\n", 270 | "| Property | Plug-in estimate |\n", 271 | "|--------------------|-----------------------|\n", 272 | "| mean | `np.mean(x)` |\n", 273 | "| median | `np.median(x)` |\n", 274 | "| variance | `np.var(x)` |\n", 275 | "| standard deviation | `np.std(x)` |\n", 276 | "| _p_ percentile | `np.percentile(x, p)` |\n", 277 | "\n", 278 | "\n", 279 | "\n", 280 | "Two-dimensional distribution additionally have properties like covariances and correlations. We can also estimate these using the plug-in principle. Let `x` and `y` be NumPy arrays. Then, for two-dimensional distributions, the plug-in estimates are:\n", 281 | "\n", 282 | "| Property | Plug-in estimate |\n", 283 | "|--------------------|-------------------------------|\n", 284 | "| covariance | `np.cov(x, y, ddof=0)[0, 1]` |\n", 285 | "| correlation | `np.corrcoef(x, y)[0, 1]` |\n" 286 | ] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "metadata": {}, 291 | "source": [ 292 | "---\n", 293 | "\n", 294 | "

Q&A 2

\n", 295 | "\n", 296 | "---" 297 | ] 298 | }, 299 | { 300 | "cell_type": "markdown", 301 | "metadata": {}, 302 | "source": [ 303 | "### **Onwards!**\n", 304 | "\n", 305 | "We have now laid the theoretical groundwork. In the data set we present below and subsequent analysis, we will use the plug-in principle to deduce properties about the generative distribution. For each set of measurements we consider, we will use the empirical distribution as a plug-in replacement for the unknown generative distribution and compute relevant properties." 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "metadata": { 311 | "colab_type": "text", 312 | "id": "6Ijg5wUCTQYG" 313 | }, 314 | "source": [ 315 | "## **The Dataset**\n", 316 | "\n", 317 | "The data set in this webinar comes from a study by [Beattie, et al.](https://doi.org/10.1098/rsos.160321) in which they used the [Glasgow Facial Matching Test](https://en.wikipedia.org/wiki/Glasgow_Face_Matching_Test) (GFMT, [original paper](https://doi.org/10.3758/BRM.42.1.286)) to investigate how sleep deprivation affects a human subject's ability to match faces, as well as the confidence the subject has in those matches. Briefly, the test works by having subjects look at a pair of faces. Two such pairs are shown below.\n", 318 | "\n", 319 | "

\n", 320 | " $\"GFMT$ \n", 321 | "

\n", 322 | "
\n", 323 | "\n", 324 | "\n", 325 | "For each of 40 pairs of faces, the subject gets as much time as he or she needs and then says whether or not they are the same person. The subject then rates his or her confidence in the choice.\n", 326 | "\n", 327 | "In this study, subjects also took surveys to determine properties about their sleep. The surveys provide three different metric of sleep quality and wakefulness. \n", 328 | "\n", 329 | "- The Sleep Condition Indicator (SCI) is a measure of insomnia disorder over the past month. High scores indicate better sleep and scores of 16 and below indicate insomnia. \n", 330 | "- The Pittsburgh Sleep Quality Index (PSQI) quantifies how well a subject sleeps in terms of interruptions, latency, etc. A higher score indicates poorer sleep. \n", 331 | "- The Epworth Sleepiness Scale (ESS) assesses daytime drowsiness. Higher scores indicate greater drowsiness.\n", 332 | "\n", 333 | "We will explore how the various sleep metrics are related to each other and how sleep disorders affect subjects' ability to discern faces and their confidence in doing so." 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": { 339 | "colab_type": "text", 340 | "id": "BMYfcKeDY85K" 341 | }, 342 | "source": [ 343 | "### **Loading and inspecting the data set**\n", 344 | "\n", 345 | "Let's load in the data set provided by Beattie and coworkers. We'll load in the data set and check out the first few rows using the `head()` method of pandas DataFrames. Importantly, missing data in this data set are denoted with an asterisk, which we specify using the `na_values` keyword argument." 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": null, 351 | "metadata": { 352 | "colab": { 353 | "base_uri": "https://localhost:8080/", 354 | "height": 479 355 | }, 356 | "colab_type": "code", 357 | "id": "l8t_EwRNZPLB", 358 | "outputId": "36a85c6f-f2ae-44e0-ac01-fc55462bc616" 359 | }, 360 | "outputs": [], 361 | "source": [ 362 | "# Read in the dataset\n", 363 | "df = pd.read_csv(\n", 364 | " \"https://github.com/datacamp/Hacker-Stats-in-Python-Live-Training/blob/master/data/gfmt_sleep.csv?raw=True\",\n", 365 | " na_values=\"*\",\n", 366 | ")\n", 367 | "\n", 368 | "# Print header\n", 369 | "df.head()" 370 | ] 371 | }, 372 | { 373 | "cell_type": "markdown", 374 | "metadata": {}, 375 | "source": [ 376 | "Here is some information about what is contained in each of the columns.\n", 377 | "\n", 378 | "- `participant number`: Unique identifier for each subject\n", 379 | "- `age`: Age of subject in years\n", 380 | "- `correct hit percentage`: Percentage of correct responses among trials for which the faces match\n", 381 | "- `correct reject percentage`: Percentage of correct responses among trials for which the faces do not match\n", 382 | "- `percent correct`: Percentage of correct responses among all trials\n", 383 | "- `confidence when correct hit`: Average confidence when the subject gave a correct response for trials for which the faces match\n", 384 | "- `confidence when incorrect hit`: Average confidence when the subject gave an incorrect response for trials for which the faces match\n", 385 | "- `confidence when correct reject`: Average confidence when the subject gave a correct response for trials for which the faces do not match\n", 386 | "- `confidence when incorrect reject`: Average confidence when the subject gave an incorrect response for trials for which the faces do not match\n", 387 | "- `confidence when correct`: Average confidence when the subject gave a correct response for for all trials\n", 388 | "- `confidence when incorrect`: Average confidence when the subject gave a correct response for for all trials\n", 389 | "- `sci`: The subject's Sleep Condition Indicator.\n", 390 | "- `psqi`: The subject's Pittsburgh Sleep Quality Index.\n", 391 | "- `ess`: The subject's Epworth Sleepiness Scale.\n", 392 | "\n", 393 | "Going forward, it will be useful to separate the subjects into two groups, insomniacs and normal sleepers. We will therefore add an `'insomnia'` column to the DataFrame with True/False entries. Recall that a person is deemed an insomniac if their SCI is 16 or below." 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": null, 399 | "metadata": {}, 400 | "outputs": [], 401 | "source": [ 402 | "# Add a column to the data frame for insomnia" 403 | ] 404 | }, 405 | { 406 | "cell_type": "markdown", 407 | "metadata": {}, 408 | "source": [ 409 | "It is important to know how many total subjects are included, so we can check on the length of the DataFrame." 410 | ] 411 | }, 412 | { 413 | "cell_type": "code", 414 | "execution_count": null, 415 | "metadata": {}, 416 | "outputs": [], 417 | "source": [ 418 | "# Number of entries in data set" 419 | ] 420 | }, 421 | { 422 | "cell_type": "markdown", 423 | "metadata": {}, 424 | "source": [ 425 | "So, we have 102 subjects, hopefully enough to make meaningful comparisons.\n", 426 | "\n", 427 | "With our data set in place, we can get moving with statistical inference." 428 | ] 429 | }, 430 | { 431 | "cell_type": "markdown", 432 | "metadata": {}, 433 | "source": [ 434 | "## **Our approach**\n", 435 | "\n", 436 | "In this live training, we will use statistical inference to address the following question of the data:\n", 437 | "\n", 438 | "1. How different is the facial matching performance of insomniacs versus normal sleepers?\n", 439 | "\n", 440 | "In our next live training, we will address other aspects of the data set.\n", 441 | "\n", 442 | "2. How different is *confidence* in facial matching for insomniacs versus normal sleepers?\n", 443 | "3. How are the different sleep metrics correlated?\n", 444 | "4. How do sleep metric *correlate* with facial matching performance?\n", 445 | "\n", 446 | "Each question requires a different sort of analysis involving calculation of confidence intervals (this session) and p-values (the next session). Along the way, we will introduce the necessary theoretical and technical concepts.\n", 447 | "\n", 448 | "Note that even though this webinar is about statistical inference, is is always important to do EDA first. Remember what [John Tukey](https://en.wikipedia.org/wiki/John_Tukey) said,\n", 449 | "\n", 450 | "> \"Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone.\"\n", 451 | "\n", 452 | "In each of the analyses, we will start with exploratory data analysis." 453 | ] 454 | }, 455 | { 456 | "cell_type": "markdown", 457 | "metadata": {}, 458 | "source": [ 459 | "## **1. Performance of insomniacs versus normal sleepers**\n", 460 | "\n", 461 | "Our first investigation is into how well insomniacs and perform the face matching task versus normal sleepers. As or first step in exploratory data analysis, we will make a plot of the ECDF of the percent correct on the facial matching test for the two categories.\n", 462 | "\n", 463 | "As our first, step, we will compare the means of the two data sets. To do, we will extract the values of the `'percent correct'` column of the DataFrame for normal sleepers and for insomniacs. We will be sure to drop any missing data (NaNs). We will also convert these respective pandas Series to NumPy arrays, which enable faster computing." 464 | ] 465 | }, 466 | { 467 | "cell_type": "code", 468 | "execution_count": null, 469 | "metadata": {}, 470 | "outputs": [], 471 | "source": [ 472 | "# Extract percent correct for normal sleepers\n", 473 | "\n", 474 | "# Extract percent correct for insomniacs" 475 | ] 476 | }, 477 | { 478 | "cell_type": "markdown", 479 | "metadata": {}, 480 | "source": [ 481 | "We can now compute the ECDFs and plot them." 482 | ] 483 | }, 484 | { 485 | "cell_type": "code", 486 | "execution_count": null, 487 | "metadata": {}, 488 | "outputs": [], 489 | "source": [ 490 | "# Compute ECDF for normal sleepers\n", 491 | "\n", 492 | "# Compute ECDF for insomniacs\n", 493 | "\n", 494 | "# Make plot of ECDFs" 495 | ] 496 | }, 497 | { 498 | "cell_type": "markdown", 499 | "metadata": {}, 500 | "source": [ 501 | "There are clearly fewer data points for insomniacs (25 versus 77 for normal sleepers), which will be important to consider as we do statistical inference. In eyeballing the ECDFs, it appears that those without insomnia perform a bit better; the ECDF is shifted rightward toward better scores." 502 | ] 503 | }, 504 | { 505 | "cell_type": "markdown", 506 | "metadata": {}, 507 | "source": [ 508 | "### **Plug-in estimates**\n", 509 | "\n", 510 | "We have already computed a plug-in estimate! The ECDF itself is a plug-in estimate for the CDF. From the ECDF, we can also directly read off plug-in estimates for any percentile. For example, the median is the 50th percentile; it is the percent correct where the ECDF is 0.5. That is, half of the measurements lie below and half above. The median for normal sleepers is 85 and that for insomniacs is 75, a 10% difference.\n", 511 | "\n", 512 | "We can also get plug-in estimates for the mean. These we can't read directly off of the ECDF, but can compute them using the `np.mean()` function." 513 | ] 514 | }, 515 | { 516 | "cell_type": "code", 517 | "execution_count": null, 518 | "metadata": {}, 519 | "outputs": [], 520 | "source": [ 521 | "# Plug in estimates for means\n", 522 | "\n", 523 | "# Print the results\n", 524 | "print(\"Plug-in estimates for the mean pecent correct:\")\n", 525 | "print(\"normal sleepers:\", pcorr_normal_mean)\n", 526 | "print(\"insomniacs: \", pcorr_insom_mean)" 527 | ] 528 | }, 529 | { 530 | "cell_type": "markdown", 531 | "metadata": {}, 532 | "source": [ 533 | "There is about a 5% difference in the mean scores. In looking at the ECDFs, it seems like this (or the median) might be a good difference to use to compare insomniacs and normal sleepers because the ECDFs are similar at the tails (low and high percent correct), but differ in the middle." 534 | ] 535 | }, 536 | { 537 | "cell_type": "markdown", 538 | "metadata": {}, 539 | "source": [ 540 | "### **Computing a confidence interval**\n", 541 | "\n", 542 | "So, we are now faced with the question: If I were to do the same experiment again, how much variation would I get in the mean percent correct? Might we again see that the insomniacs perform more poorly?\n", 543 | "\n", 544 | "To answer this questions, we can compute a **confidence interval**. A **confidence interval** can be defined as follows.\n", 545 | "\n", 546 | ">If an experiment is repeated over and over again, the estimate I compute will lie between the bounds of the 95% confidence interval for 95% of the experiments.\n", 547 | "\n", 548 | "So, all we have to do it go to Scotland, randomly select 102 people, 27 of whom are insomniacs, and perform the face matching test, record the results, and compute the mean for insomniacs and normal sleepers. Then, we have to go back to Scotland, do the whole procedure again, and do that again, and again, and again. Simple, right? \n", 549 | "\n", 550 | "Of course, we can't do that! But remember that performing an experiment is the same thing as drawing random samples out of the generative distribution. Because the generative distribution is unknown, the only way we know how to sample out of it is to literally do the experiment again, which is just not possible. However, we can use the plug-in principle to *approximate* the generative distribution with the empirical distribution. We *can* sample out of the empirical distribution using NumPy's random number generation! A sample of a new data set drawn from the empirical distribution is called a **bootstrap sample**. \n", 551 | "\n", 552 | "Imagine we have set of measurements stored in NumPy array `data`. To get a bootstrap sample, we use `np.random.choice()` to draw `len(data)` numbers out of the array `data`. We do this *with replacement*. The result is a bootstrap sample. The syntax is\n", 553 | "\n", 554 | " bs_sample = np.random.choice(data, len(data))\n", 555 | " \n", 556 | "The bootstrap sample is approximately a new data set drawn from the generative distribution.\n", 557 | "\n", 558 | "After drawing a bootstrap sample, we want to compute the mean in order to see how it will change from experiment to experiment. A mean (or other value of interest) computed from a bootstrap sample is referred to as a **bootstrap replicate**. We can write a function to compute a bootstrap replicate. This function takes as arguments a 1D array of data `data` and a function `func` that is to be applied to the bootstrap samples to return a bootstrap replicate." 559 | ] 560 | }, 561 | { 562 | "cell_type": "code", 563 | "execution_count": null, 564 | "metadata": {}, 565 | "outputs": [], 566 | "source": [ 567 | "def bootstrap_replicate_1d(data, func):\n", 568 | " \"\"\"Generate bootstrap replicate of 1D data.\"\"\"" 569 | ] 570 | }, 571 | { 572 | "cell_type": "markdown", 573 | "metadata": {}, 574 | "source": [ 575 | "Now, we want to compute many of these replicates so we can see what range of values for the mean comprises the middle 95%, which is give a 95% confidence interval. It is therefore useful to write a function to draw many bootstrap replicates." 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": null, 581 | "metadata": {}, 582 | "outputs": [], 583 | "source": [ 584 | "def draw_bs_reps(data, func, size=1):\n", 585 | " \"\"\"Draw `size` bootstrap replicates.\"\"\"" 586 | ] 587 | }, 588 | { 589 | "cell_type": "markdown", 590 | "metadata": {}, 591 | "source": [ 592 | "Excellent! Let's put these functions to use to draw bootstrap replicates of the mean for normal sleepers and for insomniacs. Because this calculation is fast, we can \"do\" the experiment over and over again many times. We'll do it 10,000 times." 593 | ] 594 | }, 595 | { 596 | "cell_type": "code", 597 | "execution_count": null, 598 | "metadata": {}, 599 | "outputs": [], 600 | "source": [ 601 | "# Draw bootstrap replicates for the mean\n", 602 | "\n", 603 | "# Take a quick peak\n", 604 | "bs_reps_normal" 605 | ] 606 | }, 607 | { 608 | "cell_type": "markdown", 609 | "metadata": {}, 610 | "source": [ 611 | "The replicates are stored in NumPy arrays of length 10,000. The values hover around the means, but they do vary.\n", 612 | "\n", 613 | "We can compute the percentiles of the bootstrap replicates using the `np.percentile()` function. We pass in the array we want to compute percentiles for, followed by a list of the percentiles we want. For a 95% confidence interval, we can use `[2.5, 97.5]`, which will give the middle 95% of the samples." 614 | ] 615 | }, 616 | { 617 | "cell_type": "code", 618 | "execution_count": null, 619 | "metadata": {}, 620 | "outputs": [], 621 | "source": [ 622 | "# Compute 95% confidence intervals of the mean\n", 623 | "\n", 624 | "# Print confidence intervals\n", 625 | "print(\"Normal sleepers:\", conf_int_normal)\n", 626 | "print(\"Insomniacs: \", conf_int_insom)" 627 | ] 628 | }, 629 | { 630 | "cell_type": "markdown", 631 | "metadata": {}, 632 | "source": [ 633 | "The 95% confidence interval of the mean for normal sleepers ranges over 5%, from 79 to 84%. That for insomniacs is twice as wide, ranging from 71 to 81%. " 634 | ] 635 | }, 636 | { 637 | "cell_type": "markdown", 638 | "metadata": {}, 639 | "source": [ 640 | "---\n", 641 | "\n", 642 | "

Q&A 3

\n", 643 | "\n", 644 | "---" 645 | ] 646 | }, 647 | { 648 | "cell_type": "markdown", 649 | "metadata": {}, 650 | "source": [ 651 | "### **Automating and organizing confidence intervals**\n", 652 | "\n", 653 | "There are many useful quantities we obtained by going through the above procedure. We took an array of data, computed the x- and y-values for an ECDF, defined a function to compute a plug-in estimate (in our case, the mean), used the function to compute the estimate, generated bootstrap replicates of the estimate, and computed a confidence interval. We did this for two categories, normal sleepers and insomniacs.\n", 654 | "\n", 655 | "At various points in our analyses, we might like to access these results. It is therefore useful to define a class to compute and store the results of a 1D confidence interval calculation." 656 | ] 657 | }, 658 | { 659 | "cell_type": "code", 660 | "execution_count": null, 661 | "metadata": {}, 662 | "outputs": [], 663 | "source": [ 664 | "class ConfInt1D(object):\n", 665 | " \"\"\"Class for computing and storing confidence intervals from\n", 666 | " one-dimensional data.\"\"\"\n", 667 | " def __init__(\n", 668 | " self, data, func, ptiles=(2.5, 97.5), n_bs_reps=10000, category=None\n", 669 | " ):\n", 670 | " # Store data and settings\n", 671 | " self.data = data\n", 672 | " self.func = func\n", 673 | " self.ptiles = ptiles\n", 674 | " self.n_bs_reps = n_bs_reps\n", 675 | " self.category = category\n", 676 | " \n", 677 | " # Compute ECDF x and y values\n", 678 | " self.ecdf_x, self.ecdf_y = ecdf(self.data)\n", 679 | "\n", 680 | " # Compute plug-in estimate\n", 681 | " self.estimate = func(data)\n", 682 | " \n", 683 | " # Compute bootstrap confidence interval\n", 684 | " self.bs_reps = draw_bs_reps(data, func, size=n_bs_reps)\n", 685 | " self.conf_int = np.percentile(self.bs_reps, ptiles)" 686 | ] 687 | }, 688 | { 689 | "cell_type": "markdown", 690 | "metadata": {}, 691 | "source": [ 692 | "Many learners may be less familiar with some of the object-oriented features in Python, so I will briefly explain how this works. To instantiate the class, we essentially make a function call." 693 | ] 694 | }, 695 | { 696 | "cell_type": "code", 697 | "execution_count": null, 698 | "metadata": {}, 699 | "outputs": [], 700 | "source": [ 701 | "# Store the percent correct for insomniacs in a ConfInt1D object\n", 702 | "pcorr_insom = ConfInt1D(pcorr_insom, np.mean, category=\"insomniacs\")" 703 | ] 704 | }, 705 | { 706 | "cell_type": "markdown", 707 | "metadata": {}, 708 | "source": [ 709 | "Whenever you instantiate a class in Python, the `__init__()` function gets called. For this class, this involves computing the ECDF values, computing the plug-in estimate, drawing and storing the bootstrap replicates, and computing the confidence interval. Important attributes are stored in the `pcorr_insom` variable. For example, to access the confidence interval, we use `pcorr_insom.conf_int`." 710 | ] 711 | }, 712 | { 713 | "cell_type": "code", 714 | "execution_count": null, 715 | "metadata": {}, 716 | "outputs": [], 717 | "source": [ 718 | "pcorr_insom.conf_int" 719 | ] 720 | }, 721 | { 722 | "cell_type": "markdown", 723 | "metadata": {}, 724 | "source": [ 725 | "We could get the bootstrap replicates as well." 726 | ] 727 | }, 728 | { 729 | "cell_type": "code", 730 | "execution_count": null, 731 | "metadata": {}, 732 | "outputs": [], 733 | "source": [ 734 | "pcorr_insom.bs_reps" 735 | ] 736 | }, 737 | { 738 | "cell_type": "markdown", 739 | "metadata": {}, 740 | "source": [ 741 | "So, the class allows convenient access to all of the statistical features we calculated with our data set." 742 | ] 743 | }, 744 | { 745 | "cell_type": "markdown", 746 | "metadata": {}, 747 | "source": [ 748 | "### **Visualizing confidence intervals**\n", 749 | "\n", 750 | "The confidence intervals we printed above are useful, but the confidence intervals are perhaps better visualized graphically. The function below generates a plot of confidence intervals. Since our focus here is primarily on the concepts behind inference, you may for now take the plotting function as a black box. Briefly, we are using the more object-oriented mode of plotting with Matplotlib where we first generate a `Figure` object and an `AxesSubplot` object using `plt.subplots()`. We then use the methods of the `AxesSubplot` object to populate the plot with markers and to make further modification to the plot, such as adding axis labels.. We plot the plug-in estimate (in this case, the mean) as a dot and the confidence interval as a line.\n", 751 | "\n", 752 | "The function takes as arguments a list of `ConfInt1D` instances." 753 | ] 754 | }, 755 | { 756 | "cell_type": "code", 757 | "execution_count": null, 758 | "metadata": {}, 759 | "outputs": [], 760 | "source": [ 761 | "def plot_conf_ints(confints, palette=None):\n", 762 | " \"\"\"Plot confidence intervals with estimates.\"\"\"\n", 763 | " # Set a nice color palette\n", 764 | " if palette is None:\n", 765 | " palette = [\n", 766 | " \"#1f77b4\",\n", 767 | " \"#ff7f0e\",\n", 768 | " \"#2ca02c\",\n", 769 | " \"#d62728\",\n", 770 | " \"#9467bd\",\n", 771 | " \"#8c564b\",\n", 772 | " \"#e377c2\",\n", 773 | " \"#7f7f7f\",\n", 774 | " \"#bcbd22\",\n", 775 | " \"#17becf\",\n", 776 | " ]\n", 777 | " elif type(palette) == str:\n", 778 | " palette = [palette]\n", 779 | "\n", 780 | " cats = [ci.category for ci in confints][::-1]\n", 781 | " estimates = [ci.estimate for ci in confints][::-1]\n", 782 | " conf_intervals = [ci.conf_int for ci in confints][::-1]\n", 783 | " palette = palette[:len(cats)][::-1]\n", 784 | " \n", 785 | " # Set up axes for plot\n", 786 | " fig, ax = plt.subplots(figsize=(5, len(cats) / 2))\n", 787 | "\n", 788 | " # Plot estimates as dots and confidence intervals as lines\n", 789 | " for i, (cat, est, conf_int) in enumerate(zip(cats, estimates, conf_intervals)):\n", 790 | " color = palette[i % len(palette)]\n", 791 | " ax.plot(\n", 792 | " [est],\n", 793 | " [cat],\n", 794 | " marker=\".\",\n", 795 | " linestyle=\"none\",\n", 796 | " markersize=10,\n", 797 | " color=color,\n", 798 | " )\n", 799 | "\n", 800 | " ax.plot(conf_int, [cat] * 2, linewidth=3, color=color)\n", 801 | "\n", 802 | " # Make sure margins look ok\n", 803 | " ax.margins(y=0.25 if len(cats) < 3 else 0.125)\n", 804 | "\n", 805 | " return ax" 806 | ] 807 | }, 808 | { 809 | "cell_type": "markdown", 810 | "metadata": {}, 811 | "source": [ 812 | "Let's make a plot. We set up `ConfInt1D` instances for normal sleepers and insomniacs and pass them to the function." 813 | ] 814 | }, 815 | { 816 | "cell_type": "code", 817 | "execution_count": null, 818 | "metadata": {}, 819 | "outputs": [], 820 | "source": [ 821 | "# Instantiate class for normal (already did insomniacs)\n", 822 | "\n", 823 | "# Make plot" 824 | ] 825 | }, 826 | { 827 | "cell_type": "markdown", 828 | "metadata": {}, 829 | "source": [ 830 | "The difference in the length of the confidence interval is starkly apparent on the plot. Because we had fewer measurements for insomniacs, we can't be too precise in what mean values we might get.\n", 831 | "\n", 832 | "In looking at the plot of confidence intervals, it seems possible that if we did the experiment again, we might even get a scenario where insomniacs perform *better* than normal sleepers. But how likely is such a scenario?" 833 | ] 834 | }, 835 | { 836 | "cell_type": "markdown", 837 | "metadata": {}, 838 | "source": [ 839 | "### **Confidence interval for difference of means**\n", 840 | "\n", 841 | "Remember that we are not restricted as to what confidence intervals we can compute. We can instead compute a confidence interval on the *difference* of means between normal sleepers and insomniacs. To do this, we do the following procedure, which again uses the plug-in principle to \"do\" the experiment again to get a bootstrap replicate of the difference of means.\n", 842 | "\n", 843 | "1. Generate a bootstrap sample of percent correct for normal sleepers.\n", 844 | "2. Generate a bootstrap sample of percent correct for insomniacs.\n", 845 | "3. Take the mean of each bootstrap sample, giving a bootstrap replicate for the mean of each.\n", 846 | "4. Subtract the mean for insomniacs from that of normal sleepers.\n", 847 | "\n", 848 | "This is actually trivial to do now because we have already computed and stored bootstrap replicates of the means! We simply have to subtract them." 849 | ] 850 | }, 851 | { 852 | "cell_type": "code", 853 | "execution_count": null, 854 | "metadata": {}, 855 | "outputs": [], 856 | "source": [ 857 | "# Get bootstrap replicates for difference of means" 858 | ] 859 | }, 860 | { 861 | "cell_type": "markdown", 862 | "metadata": {}, 863 | "source": [ 864 | "Now, we can compute the confidence interval by finding the percentiles." 865 | ] 866 | }, 867 | { 868 | "cell_type": "code", 869 | "execution_count": null, 870 | "metadata": {}, 871 | "outputs": [], 872 | "source": [ 873 | "# Compute confidence interval from bootstrap replicates" 874 | ] 875 | }, 876 | { 877 | "cell_type": "markdown", 878 | "metadata": {}, 879 | "source": [ 880 | "The confidence interval just barely crosses zero, suggesting that the insomniacs will rarely perform better than normal sleepers. \n", 881 | "\n", 882 | "We can find out the *probability* of having the insomniacs perform better than the normal sleepers by counting how many times the mean percent correct for insomniacs exceeded that of normal sleepers and dividing by the total number of bootstrap replicates. To do the count, we can make an array containing `True` and `False` values for whether or not the difference of means is negative and sum the result (since `True` is worth 1 and `False` is worth 0)." 883 | ] 884 | }, 885 | { 886 | "cell_type": "code", 887 | "execution_count": null, 888 | "metadata": {}, 889 | "outputs": [], 890 | "source": [ 891 | "# Compute probability of having insomniacs have better mean score" 892 | ] 893 | }, 894 | { 895 | "cell_type": "markdown", 896 | "metadata": {}, 897 | "source": [ 898 | "So, if we were to do the experiment again, there is about a 3% chance we would observe the insomniacs performing at parity of better than the normal sleepers, at least based on the observations we have. If we did more actual observations, this chance could rise or fall; we cannot know without more measurements." 899 | ] 900 | }, 901 | { 902 | "cell_type": "markdown", 903 | "metadata": {}, 904 | "source": [ 905 | "### **Summarizing the results in a report**\n", 906 | "\n", 907 | "There are many opinions about displaying the results of an analysis like this one. For me, the ECDFs are the most instructive part of our analysis, and I think they should be the central point of the discussion. In my experience, I have met resistance on presenting ECDFs because they are not in as common use as, say, bee swarm (strip) plots, histograms, box plots, or bar graphs. I get the argument that because they are less common, other people may find them difficult to interpret.\n", 908 | "\n", 909 | "With the exception of the bee swarm plots, all of these kinds of plots fail to plot all of the data. You (or someone in your organization) spent a lot of time and money to get you the data; you should display it all if you can.\n", 910 | "\n", 911 | "The bee swarm plots, while useful visualizations, are not as clear as ECDFs in showing of the data are distributed. Remember the task of statistical inference: You are trying to learn about the (unknown) generative distribution. The ECDF is an approximation of its CDF, made by plotting all of your data. You can't really get better than that.\n", 912 | "\n", 913 | "I therefore advocate for educating your organization in reading and interpreting ECDFs. They are exceptionally effective graphics, and time spent learning how to interpret them is well worth it.\n", 914 | "\n", 915 | "In addition to the ECDFs, I would also include the summary plot of the 95% confidence intervals of the mean. It helps establish what would happen if we did the experiment again; how big of a difference is there in performance between normal sleepers and insomniacs compared to changes due to variation and finite sample size?\n", 916 | "\n", 917 | "So, my summary report would look something like this...." 918 | ] 919 | }, 920 | { 921 | "cell_type": "markdown", 922 | "metadata": {}, 923 | "source": [ 924 | "#### **Sleep deprivation and facial matching**\n", 925 | "\n", 926 | "Twenty-seven subjects suffering from insomnia and seventy-five subjects with normal sleeping patterns were subjected to the short version of the Glasgow Facial Matching Test, comparing 40 pairs of faces each. The subjects' performance was scored based on the percent of the face matching tasks they identified correctly.\n", 927 | "\n", 928 | "Below is an empirical cumulative distribution function describing the results.\n", 929 | "\n", 930 | "

\n", 931 | " $\"percent$ \n", 932 | "

\n", 933 | "
\n", 934 | "\n", 935 | "The distribution for insomniacs is clearly shifted leftward relative to that for normal sleepers, indicating that insomniacs have poorer performance in face matching tasks. The tails of the distributions are similar; both groups have some very poor performers and some very good performers. The key difference lies in the middle of the distribution.\n", 936 | "\n", 937 | "Below is a plot of the 95% confidence interval for the mean percent correct for normal sleepers and insomniacs.\n", 938 | "\n", 939 | "

\n", 940 | " $\"percent$ \n", 941 | "

\n", 942 | "
\n", 943 | "\n", 944 | "The uncertainty in the estimate for insomniacs is due to the smaller sample size. As an estimate, the difference in mean performance in the facial matching task is about 5%." 945 | ] 946 | }, 947 | { 948 | "cell_type": "markdown", 949 | "metadata": {}, 950 | "source": [ 951 | "---\n", 952 | "\n", 953 | "

Q&A 4

\n", 954 | "\n", 955 | "---" 956 | ] 957 | }, 958 | { 959 | "cell_type": "markdown", 960 | "metadata": {}, 961 | "source": [ 962 | "## Conclusions\n", 963 | "\n", 964 | "Measured data come from an unknown generative distribution, and the job of statistical inference is to learn as much as we can about that generative distribution. Hacker stats enables us to use the plug-in principle, in which the generative distribution is approximated by the empirical distribution, to obtain this information using random number generation on our computers. The results are easier to come by and to understand.\n", 965 | "\n", 966 | "In this live session, we computed ECDFs and confidence intervals for univariate data. In another live session on hacker stats, we will extend these concepts to bivariate data. We will also introduce and perform null hypothesis significance tests (NHSTs)." 967 | ] 968 | }, 969 | { 970 | "cell_type": "markdown", 971 | "metadata": {}, 972 | "source": [ 973 | "## Take-home question\n", 974 | "\n", 975 | "There are plenty of interesting aspects of this data set to explore. For good practice, you can use what you have learned in this live training to do compute plug-in estimates confidence intervals for the percent correct for normal sleepers separated by gender? Is there a big difference between the genders? You should also make informative graphics and write a short report like the example above to share your findings." 976 | ] 977 | } 978 | ], 979 | "metadata": { 980 | "colab": { 981 | "name": "python_live_session_template.ipynb", 982 | "provenance": [] 983 | }, 984 | "kernelspec": { 985 | "display_name": "Python 3", 986 | "language": "python", 987 | "name": "python3" 988 | }, 989 | "language_info": { 990 | "codemirror_mode": { 991 | "name": "ipython", 992 | "version": 3 993 | }, 994 | "file_extension": ".py", 995 | "mimetype": "text/x-python", 996 | "name": "python", 997 | "nbconvert_exporter": "python", 998 | "pygments_lexer": "ipython3", 999 | "version": "3.7.7" 1000 | } 1001 | }, 1002 | "nbformat": 4, 1003 | "nbformat_minor": 4 1004 | } 1005 | -------------------------------------------------------------------------------- /notebooks/.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints 2 | *.html -------------------------------------------------------------------------------- /notebooks/hacker_stats_in_python_part_1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "6Ijg5wUCTQYG" 8 | }, 9 | "source": [ 10 | "

\n", 11 | " $\"DataCamp$ \n", 12 | "

\n", 18 | "\n", 19 | "

\n", 20 | "

\n", 21 | "

\n", 22 | "\n", 23 | "

\n", 24 | "\n", 25 | "\n", 26 | "Our goal here is to reinforce the concepts and techniques of statistical inference using hacker stats to help students gain confidence applying them to new data analysis tasks. In this part of the training, we will:\n", 27 | "\n", 28 | "- Review fundamental notions of probability and clarify the tasks of statistical inference.\n", 29 | "- Perform graphical exploratory data analysis (EDA).\n", 30 | "- Review concepts of confidence intervals.\n", 31 | "- Apply bootstrap methods for computing confidence intervals.\n", 32 | " \n", 33 | "We will do all of this using a data set (described [below](#The-Dataset)) which explores how sleep deprivation affects facial recognition tasks in humans." 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## **Necessary packages**\n", 41 | "\n", 42 | "It is good practice to always have necessary package imports at the top of any `.py` file or notebook, so let's get our imports in before moving on.\n", 43 | "\n", 44 | "Unless I'll be plotting very large data sets (which is not the case in this live training), I generally also like to set the figure format to be SVG. This avoids a pixelated look to the plots." 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": { 51 | "colab": {}, 52 | "colab_type": "code", 53 | "id": "EMQfyC7GUNhT" 54 | }, 55 | "outputs": [], 56 | "source": [ 57 | "import numpy as np\n", 58 | "import scipy.stats\n", 59 | "import pandas as pd\n", 60 | "import matplotlib.pyplot as plt\n", 61 | "import seaborn as sns\n", 62 | "sns.set()\n", 63 | "\n", 64 | "# I want crisp graphics, so we'll output SVG\n", 65 | "%config InlineBackend.figure_format = 'svg'" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "## **Statistical inference**\n", 73 | "\n", 74 | "In this section of the notebook, I will discuss the goals of statistical inference and lay out the approach we will take. You may have seen these ideas presented before, but in my experience, investigating a topic from multiple angles leads to a much fuller understanding, and that is my goal here.\n", 75 | "\n", 76 | "We will dive into our (rather fun) data set momentarily, but for now, we will go over some important theoretical background to make sure we understand what statistical inference is all about, and how the hacker stats approach works." 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "### **What is probability?**\n", 84 | "\n", 85 | "To think about what probability means, consider an example. Say I am interested in the heights of American adult women. If we were to select an American woman at random, I think we would all agree that she is probably going to be under 190 cm (about 6' 3\"). We just used the word \"probably!\" To see what we mean by that, let's pick another woman at random. And another. And another. And another. We do this over and over and over. Eventually, we will pick a woman who is over 190 cm tall (for example basketball legend [Lisa Leslie](https://en.wikipedia.org/wiki/Lisa_Leslie) is 196 cm). So, the probability that a randomly chosen woman is over 190 cm is not zero, but is is small.\n", 86 | "\n", 87 | "We can directly apply this mode of thinking to give us an interpretation of probability, referred to as the **frequentist interpretation of probability**. The probability of an observation represents a long-run frequency over a large number of identical repetitions of a measurement/data collection. These repetitions can be, and often are, hypothetical." 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "### **Probability distributions and CDFs**\n", 95 | "\n", 96 | "A **probability distribution** provides the link between observations to probability. For example a Normal distribution might link women's heights to probability. It is highly probable that a woman is between 150 and 180 cm (about five and six feet) tall, but improbable that a woman is above 190 cm.\n", 97 | "\n", 98 | "These statements of probability stem from the **cumulative distribution function** (CDF) associated with a probability distribution. The CDF evaluated at _x_ is defined as follows.\n", 99 | "\n", 100 | ">CDF(_x_) = probability of observing a value less than or equal to _x_.\n", 101 | "\n", 102 | "To get an idea of what a CDF might look like when we plot it, we can plot the CDF for a Normal distribution, say describing the height of American adult women. They average 165 cm, with a standard deviation of 9 cm. The mathematical expression for the CDF is complicated but known, and we can use the built-in functions in `scipy.stats` to generate the values." 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "# x-values for height CDF\n", 112 | "\n", 113 | "# Values of the CDF\n", 114 | "\n", 115 | "# Make the plot" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "Remember, the CDF evaluated at a given position along the x-axis is the probability of having an observation less than or equal to _x_. So, if we look at _x_ = 160 cm, we see that the probability that a woman is shorter than 160 cm is about 0.3.\n", 123 | "\n", 124 | "Importantly, the CDF contains all of the information about a distribution." 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "### **Generative distributions**\n", 132 | "\n", 133 | "Say we did the experiment of measuring women's heights. For cost reasons (time and money), we can measure 100 women to get 100 height measurements. We might expect about 30 of these measurements to be below 160 cm and, based on the CDF above, about 7 of these measurements to be above 180 cm. This assumes that the distribution that provides that link between height measurements and probability is in fact a Normal distribution. If this is true, then the Normal distribution *generates* the data; it is the **generative distribution**.\n", 134 | "\n", 135 | "If we *know* the true generative distribution, then drawing random numbers out of the distribution is the same as performing the measurements themselves. We can draw random numbers out of a probability distribution using the `numpy.random` module. So, we can \"repeat\" the measurement of 100 women by drawing out of the generative distribution." 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": null, 141 | "metadata": {}, 142 | "outputs": [], 143 | "source": [ 144 | "# For reproducibility, seed the random number generator\n", 145 | "np.random.seed(3252)\n", 146 | "\n", 147 | "# Draw 100 measurments from a normal with mean 165 and std 9" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [ 154 | "---\n", 155 | "\n", 156 | "

Q&A 1

\n", 157 | "\n", 158 | "---" 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "metadata": {}, 164 | "source": [ 165 | "### **The empirical cumulative distribution function**\n", 166 | "\n", 167 | "It is useful to plot the **empirical cumulative distribution function**, or ECDF, when doing exploratory data analysis. The ECDF at position _x_ is defined as\n", 168 | "\n", 169 | ">ECDF(_x_) = fraction of data points ≤ _x_.\n", 170 | "\n", 171 | "Compare this to the definition of the CDF.\n", 172 | "\n", 173 | ">CDF(_x_) = probability of observing a value less than or equal to _x_.\n", 174 | "\n", 175 | "Think about the frequentist interpretation of probability. If we had many, many measurements, then the fraction of data points less than or equal to _x_ is in fact the *probability* of observing a value less than or equal to _x_. So, the ECDF is close to the CDF. The differences between an ECDF and the CDF of the generative distribution are due entirely to the fact that we have only a finite number of measurements for the ECDF.\n", 176 | "\n", 177 | "To plot an ECDF, the x-coordinate of a dot is the value of a data point. The y-value is the fraction of data points less or equal to the x-value. We can write a function to generate the x- and y-values for the ECDF." 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "metadata": {}, 184 | "outputs": [], 185 | "source": [ 186 | "def ecdf(data):\n", 187 | " \"\"\"Compute ECDF for a one-dimensional array of measurements.\"\"\"\n", 188 | " # Number of data points\n", 189 | " \n", 190 | " # x-data for the ECDF\n", 191 | "\n", 192 | " # y-data for the ECDF\n", 193 | "\n", 194 | " return x, y" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "We can use this function to add the ECDF to our plot of the generative CDF." 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "metadata": {}, 208 | "outputs": [], 209 | "source": [ 210 | "# Compute ECDF\n", 211 | "\n", 212 | "# Make the plot" 213 | ] 214 | }, 215 | { 216 | "cell_type": "markdown", 217 | "metadata": {}, 218 | "source": [ 219 | "Indeed, the ECDF closely matches the CDF, but we do see variation from it, owing to the small sample size of 100. If we increase the sample size to 1000, the ECDF follows more closely." 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": null, 225 | "metadata": {}, 226 | "outputs": [], 227 | "source": [ 228 | "# Draw more heights\n", 229 | "\n", 230 | "# Compute ECDF\n", 231 | "\n", 232 | "# Make the plot" 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "metadata": {}, 238 | "source": [ 239 | "### **The task of statistical inference**\n", 240 | "\n", 241 | "In the above example, we knew the generative distribution. If we did actually know the generative distribution ahead of time, there would be no point in doing measurements. We generally do not know what the generative distribution is, so we collect data. **The task of statistical inference is to deduce the properties of a generative distribution of data.**" 242 | ] 243 | }, 244 | { 245 | "cell_type": "markdown", 246 | "metadata": {}, 247 | "source": [ 248 | "### **The plug-in principle**\n", 249 | "\n", 250 | "We could imagine that we knew (or strongly suspected) that the generative distribution for American women's heights is Normal, but we do not know the parameters. In this case, the statistical inference task is largely to characterize the parameters of the generative distribution.\n", 251 | "\n", 252 | "But imagine we do not know anything a priori about the generative distribution. We still make measurements, and the measurements are all we have. We do not know anything about the CDF, but we do know the ECDF. As the CDF defines a generative distribution, so too does the ECDF define an **empirical distribution**. We can proceed with statistical inference by using the empirical distribution in place of the (unknown) generative distribution. This approximation is referred to as the **plug-in** principle, as we \"plug in\" the empirical distribution for the generative distribution. This principle underlies many **non-parametric** approaches to statistical inference, so named because we are not trying to find parameters of a generative distribution, but are using only information from the data themselves. Application of the plug-in principle is the approach we take in this tutorial." 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "### **Properties of distributions via plug-in estimates**\n", 260 | "\n", 261 | "One-dimensional distributions have properties you may have heard of, like means, medians, and variances. Imagine we have a set of measurements stored in a NumPy array `x`. Then, the plug-in estimates for the various properties of the generative distribution are shown in the table below.\n", 262 | "\n", 263 | "| Property | Plug-in estimate |\n", 264 | "|--------------------|-----------------------|\n", 265 | "| mean | `np.mean(x)` |\n", 266 | "| median | `np.median(x)` |\n", 267 | "| variance | `np.var(x)` |\n", 268 | "| standard deviation | `np.std(x)` |\n", 269 | "| _p_ percentile | `np.percentile(x, p)` |\n", 270 | "\n", 271 | "\n", 272 | "\n", 273 | "Two-dimensional distributions additionally have properties like covariances and correlations. We can also estimate these using the plug-in principle. Let `x` and `y` be NumPy arrays. Then, for two-dimensional distributions, the plug-in estimates are:\n", 274 | "\n", 275 | "| Property | Plug-in estimate |\n", 276 | "|--------------------|-------------------------------|\n", 277 | "| covariance | `np.cov(x, y, ddof=0)[0, 1]` |\n", 278 | "| correlation | `np.corrcoef(x, y)[0, 1]` |\n" 279 | ] 280 | }, 281 | { 282 | "cell_type": "markdown", 283 | "metadata": {}, 284 | "source": [ 285 | "---\n", 286 | "\n", 287 | "

Q&A 2

\n", 288 | "\n", 289 | "---" 290 | ] 291 | }, 292 | { 293 | "cell_type": "markdown", 294 | "metadata": {}, 295 | "source": [ 296 | "We have now laid the theoretical groundwork. In the data set we present below and subsequent analysis, we will use the plug-in principle to deduce properties about the generative distribution. For each set of measurements we consider, we will use the empirical distribution as a plug-in replacement for the unknown generative distribution and compute relevant properties." 297 | ] 298 | }, 299 | { 300 | "cell_type": "markdown", 301 | "metadata": { 302 | "colab_type": "text", 303 | "id": "6Ijg5wUCTQYG" 304 | }, 305 | "source": [ 306 | "## **The Dataset**\n", 307 | "\n", 308 | "The data set in this webinar comes from a study by [Beattie, et al.](https://doi.org/10.1098/rsos.160321) in which they used the [Glasgow Facial Matching Test](https://en.wikipedia.org/wiki/Glasgow_Face_Matching_Test) (GFMT, [original paper](https://doi.org/10.3758/BRM.42.1.286)) to investigate how sleep deprivation affects a human subject's ability to match faces, as well as the confidence the subject has in those matches. Briefly, the test works by having subjects look at a pair of faces. Two such pairs are shown below.\n", 309 | "\n", 310 | "

\n", 311 | " $\"GFMT$ \n", 312 | "

\n", 313 | "
\n", 314 | "\n", 315 | "\n", 316 | "For each of 40 pairs of faces, the subjects gets as much time as needed and then says whether or not they are the same person. The subjects then rates their confidence in the choice.\n", 317 | "\n", 318 | "In this study, subjects also took surveys to determine properties about their sleep. The surveys provide three different metric of sleep quality and wakefulness. \n", 319 | "\n", 320 | "- The Sleep Condition Indicator (SCI) is a measure of insomnia disorder over the past month. High scores indicate better sleep and scores of 16 and below indicate insomnia. \n", 321 | "- The Pittsburgh Sleep Quality Index (PSQI) quantifies how well a subject sleeps in terms of interruptions, latency, etc. A higher score indicates poorer sleep. \n", 322 | "- The Epworth Sleepiness Scale (ESS) assesses daytime drowsiness. Higher scores indicate greater drowsiness.\n", 323 | "\n", 324 | "In this live training, we will explore how insomnia affects subjects' ability to discern faces." 325 | ] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "metadata": { 330 | "colab_type": "text", 331 | "id": "BMYfcKeDY85K" 332 | }, 333 | "source": [ 334 | "### **Loading and inspecting the data set**\n", 335 | "\n", 336 | "Let's load in the data set provided by Beattie and coworkers. We'll load in the data set and check out the first few rows using the `head()` method of pandas DataFrames. Importantly, missing data in this data set are denoted with an asterisk, which we specify using the `na_values` keyword argument." 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": null, 342 | "metadata": { 343 | "colab": { 344 | "base_uri": "https://localhost:8080/", 345 | "height": 479 346 | }, 347 | "colab_type": "code", 348 | "id": "l8t_EwRNZPLB", 349 | "outputId": "36a85c6f-f2ae-44e0-ac01-fc55462bc616" 350 | }, 351 | "outputs": [], 352 | "source": [ 353 | "# Read in the dataset\n", 354 | "df = pd.read_csv(\n", 355 | " \"https://github.com/datacamp/Hacker-Stats-in-Python-Live-Training/blob/master/data/gfmt_sleep.csv?raw=True\",\n", 356 | " na_values=\"*\",\n", 357 | ")\n", 358 | "\n", 359 | "# Print header\n", 360 | "df.head()" 361 | ] 362 | }, 363 | { 364 | "cell_type": "markdown", 365 | "metadata": {}, 366 | "source": [ 367 | "Here is some information about what is contained in each of the columns.\n", 368 | "\n", 369 | "- `participant number`: Unique identifier for each subject\n", 370 | "- `age`: Age of subject in years\n", 371 | "- `correct hit percentage`: Percentage of correct responses among trials for which the faces match\n", 372 | "- `correct reject percentage`: Percentage of correct responses among trials for which the faces do not match\n", 373 | "- `percent correct`: Percentage of correct responses among all trials\n", 374 | "- `confidence when correct hit`: Average confidence when the subject gave a correct response for trials for which the faces match\n", 375 | "- `confidence when incorrect hit`: Average confidence when the subject gave an incorrect response for trials for which the faces match\n", 376 | "- `confidence when correct reject`: Average confidence when the subject gave a correct response for trials for which the faces do not match\n", 377 | "- `confidence when incorrect reject`: Average confidence when the subject gave an incorrect response for trials for which the faces do not match\n", 378 | "- `confidence when correct`: Average confidence when the subject gave a correct response for for all trials\n", 379 | "- `confidence when incorrect`: Average confidence when the subject gave a correct response for for all trials\n", 380 | "- `sci`: The subject's Sleep Condition Indicator.\n", 381 | "- `psqi`: The subject's Pittsburgh Sleep Quality Index.\n", 382 | "- `ess`: The subject's Epworth Sleepiness Scale.\n", 383 | "\n", 384 | "Going forward, it will be useful to separate the subjects into two groups, insomniacs and normal sleepers. We will therefore add an `'insomnia'` column to the DataFrame with True/False entries. Recall that a person is deemed an insomniac if their SCI is 16 or below." 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "execution_count": null, 390 | "metadata": {}, 391 | "outputs": [], 392 | "source": [ 393 | "# Add a column to the data frame for insomnia" 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "metadata": {}, 399 | "source": [ 400 | "It is important to know how many total subjects are included, so we can check on the length of the DataFrame." 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": null, 406 | "metadata": {}, 407 | "outputs": [], 408 | "source": [ 409 | "# Number of entries in data set" 410 | ] 411 | }, 412 | { 413 | "cell_type": "markdown", 414 | "metadata": {}, 415 | "source": [ 416 | "So, we have 102 subjects, hopefully enough to make meaningful comparisons.\n", 417 | "\n", 418 | "With our data set in place, we can get moving with statistical inference." 419 | ] 420 | }, 421 | { 422 | "cell_type": "markdown", 423 | "metadata": {}, 424 | "source": [ 425 | "## **Our approach**\n", 426 | "\n", 427 | "In this live training, we will use statistical inference to address the following question of the data:\n", 428 | "\n", 429 | "1. How different is the facial matching performance of insomniacs versus normal sleepers?\n", 430 | "\n", 431 | "In our next live training, we will address other aspects of the data set.\n", 432 | "\n", 433 | "2. How different is *confidence* in facial matching for insomniacs versus normal sleepers?\n", 434 | "3. How are the different sleep metrics correlated?\n", 435 | "4. How do sleep metric *correlate* with facial matching performance?\n", 436 | "\n", 437 | "Each question requires a different sort of analysis involving calculation of confidence intervals (this session) and p-values (the next session). Along the way, we will introduce the necessary theoretical and technical concepts.\n", 438 | "\n", 439 | "Note that even though this live training is about statistical inference, is is always important to do EDA first. Remember what [John Tukey](https://en.wikipedia.org/wiki/John_Tukey) said,\n", 440 | "\n", 441 | "> \"Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone.\"\n", 442 | "\n", 443 | "In each of the analyses, we will start with exploratory data analysis." 444 | ] 445 | }, 446 | { 447 | "cell_type": "markdown", 448 | "metadata": {}, 449 | "source": [ 450 | "## **1. Performance of insomniacs versus normal sleepers**\n", 451 | "\n", 452 | "Our first investigation is into how well insomniacs perform the face matching task versus normal sleepers. As or first step in exploratory data analysis, we will make a plot of the ECDF of the percent correct on the facial matching test for the two categories.\n", 453 | "\n", 454 | "As our first, step, we will compare the means of the two data sets. To do, we will extract the values of the `'percent correct'` column of the DataFrame for normal sleepers and for insomniacs. We will be sure to drop any missing data (NaNs). We will also convert these respective pandas Series to NumPy arrays, which enable faster computing." 455 | ] 456 | }, 457 | { 458 | "cell_type": "code", 459 | "execution_count": null, 460 | "metadata": {}, 461 | "outputs": [], 462 | "source": [ 463 | "# Extract percent correct for normal sleepers\n", 464 | "\n", 465 | "# Extract percent correct for insomniacs" 466 | ] 467 | }, 468 | { 469 | "cell_type": "markdown", 470 | "metadata": {}, 471 | "source": [ 472 | "We can now compute the ECDFs and plot them." 473 | ] 474 | }, 475 | { 476 | "cell_type": "code", 477 | "execution_count": null, 478 | "metadata": {}, 479 | "outputs": [], 480 | "source": [ 481 | "# Compute ECDF for normal sleepers\n", 482 | "\n", 483 | "# Compute ECDF for insomniacs\n", 484 | "\n", 485 | "# Make plot of ECDFs" 486 | ] 487 | }, 488 | { 489 | "cell_type": "markdown", 490 | "metadata": {}, 491 | "source": [ 492 | "There are clearly fewer data points for insomniacs (25 versus 77 for normal sleepers), which will be important to consider as we do statistical inference. In eyeballing the ECDFs, it appears that those without insomnia perform a bit better; the ECDF is shifted rightward toward better scores." 493 | ] 494 | }, 495 | { 496 | "cell_type": "markdown", 497 | "metadata": {}, 498 | "source": [ 499 | "### **Plug-in estimates**\n", 500 | "\n", 501 | "We have already computed a plug-in estimate! The ECDF itself is a plug-in estimate for the CDF. From the ECDF, we can also directly read off plug-in estimates for any percentile. For example, the median is the 50th percentile; it is the percent correct where the ECDF is 0.5. That is, half of the measurements lie below and half above. The median for normal sleepers is 85 and that for insomniacs is 75, a 10% difference.\n", 502 | "\n", 503 | "We can also get plug-in estimates for the mean. These we can't read directly off of the ECDF, but we can compute them using the `np.mean()` function." 504 | ] 505 | }, 506 | { 507 | "cell_type": "code", 508 | "execution_count": null, 509 | "metadata": {}, 510 | "outputs": [], 511 | "source": [ 512 | "# Plug in estimates for means\n", 513 | "\n", 514 | "# Print the results\n", 515 | "print(\"Plug-in estimates for the mean pecent correct:\")\n", 516 | "print(\"normal sleepers:\", pcorr_normal_mean)\n", 517 | "print(\"insomniacs: \", pcorr_insom_mean)" 518 | ] 519 | }, 520 | { 521 | "cell_type": "markdown", 522 | "metadata": {}, 523 | "source": [ 524 | "There is about a 5% difference in the mean scores. In looking at the ECDFs, it seems like this (or the median) might be a good difference to use to compare insomniacs and normal sleepers because the ECDFs are similar at the tails (low and high percent correct), but differ in the middle." 525 | ] 526 | }, 527 | { 528 | "cell_type": "markdown", 529 | "metadata": {}, 530 | "source": [ 531 | "### **Computing a confidence interval**\n", 532 | "\n", 533 | "So, we are now faced with the question: If I were to do the same experiment again, how much variation would I get in the mean percent correct? Might we again see that the insomniacs perform more poorly?\n", 534 | "\n", 535 | "To answer this questions, we can compute a **confidence interval**. A **confidence interval** for a value computed from the data, which we will refer to as an **estimate**, can be defined as follows.\n", 536 | "\n", 537 | ">If an experiment is repeated over and over again, the estimate I compute will lie between the bounds of the 95% confidence interval for 95% of the experiments.\n", 538 | "\n", 539 | "So, all we have to do is go to Scotland, randomly select 102 people, 27 of whom are insomniacs, and perform the face matching test, record the results, and compute the mean for insomniacs and normal sleepers. Then, we have to go back to Scotland again, do the whole procedure again, and do that again, and again, and again. Simple, right? \n", 540 | "\n", 541 | "Of course, we can't do that! But remember that performing an experiment is the same thing as drawing random samples out of the generative distribution. Because the generative distribution is unknown, the only way we know how to sample out of it is to literally do the experiment again, which is just not possible. However, we can use the plug-in principle to *approximate* the generative distribution with the empirical distribution. We *can* sample out of the empirical distribution using NumPy's random number generation! A sample of a new data set drawn from the empirical distribution is called a **bootstrap sample**. \n", 542 | "\n", 543 | "Imagine we have set of measurements stored in NumPy array `data`. To get a bootstrap sample, we use `np.random.choice()` to draw `len(data)` numbers out of the array `data`. We do this *with replacement*. The result is a bootstrap sample. The syntax is\n", 544 | "\n", 545 | " bs_sample = np.random.choice(data, len(data))\n", 546 | " \n", 547 | "The bootstrap sample is approximately a new data set drawn from the generative distribution.\n", 548 | "\n", 549 | "After drawing a bootstrap sample, we want to compute the mean in order to see how it will change from experiment to experiment. A mean (or other value of interest) computed from a bootstrap sample is referred to as a **bootstrap replicate**. We can write a function to compute a bootstrap replicate. This function takes as arguments a 1D array of data `data` and a function `func` that is to be applied to a bootstrap sample to return a bootstrap replicate." 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": null, 555 | "metadata": {}, 556 | "outputs": [], 557 | "source": [ 558 | "def bootstrap_replicate_1d(data, func):\n", 559 | " \"\"\"Generate bootstrap replicate of 1D data.\"\"\"" 560 | ] 561 | }, 562 | { 563 | "cell_type": "markdown", 564 | "metadata": {}, 565 | "source": [ 566 | "Now, we want to compute many of these replicates so we can see what range of values for the mean comprises the middle 95%, which is give a 95% confidence interval. It is therefore useful to write a function to draw many bootstrap replicates." 567 | ] 568 | }, 569 | { 570 | "cell_type": "code", 571 | "execution_count": null, 572 | "metadata": {}, 573 | "outputs": [], 574 | "source": [ 575 | "def draw_bs_reps(data, func, size=1):\n", 576 | " \"\"\"Draw `size` bootstrap replicates.\"\"\"" 577 | ] 578 | }, 579 | { 580 | "cell_type": "markdown", 581 | "metadata": {}, 582 | "source": [ 583 | "Excellent! Let's put these functions to use to draw bootstrap replicates of the mean for normal sleepers and for insomniacs. Because this calculation is fast, we can \"do\" the experiment over and over again many times. We'll do it 10,000 times." 584 | ] 585 | }, 586 | { 587 | "cell_type": "code", 588 | "execution_count": null, 589 | "metadata": {}, 590 | "outputs": [], 591 | "source": [ 592 | "# Draw bootstrap replicates for the mean\n", 593 | "\n", 594 | "# Take a quick peak\n", 595 | "bs_reps_normal" 596 | ] 597 | }, 598 | { 599 | "cell_type": "markdown", 600 | "metadata": {}, 601 | "source": [ 602 | "The replicates are stored in NumPy arrays of length 10,000. The values hover around the means, but they do vary.\n", 603 | "\n", 604 | "We can compute the percentiles of the bootstrap replicates using the `np.percentile()` function. We pass in the array we want to compute percentiles for, followed by a list of the percentiles we want. For a 95% confidence interval, we can use `[2.5, 97.5]`, which will give the middle 95% of the samples." 605 | ] 606 | }, 607 | { 608 | "cell_type": "code", 609 | "execution_count": null, 610 | "metadata": {}, 611 | "outputs": [], 612 | "source": [ 613 | "# Compute 95% confidence intervals of the mean\n", 614 | "\n", 615 | "# Print confidence intervals\n", 616 | "print(\"Normal sleepers:\", conf_int_normal)\n", 617 | "print(\"Insomniacs: \", conf_int_insom)" 618 | ] 619 | }, 620 | { 621 | "cell_type": "markdown", 622 | "metadata": {}, 623 | "source": [ 624 | "The 95% confidence interval of the mean for normal sleepers ranges over 5%, from 79 to 84%. That for insomniacs is twice as wide, ranging from 71 to 81%. " 625 | ] 626 | }, 627 | { 628 | "cell_type": "markdown", 629 | "metadata": {}, 630 | "source": [ 631 | "### **Hence, \"hacker stats\"**\n", 632 | "\n", 633 | "Computing a confidence interval is not a trivial task if we seek to do it analytically by pen and paper. A \"hacker stats\" approach involves direct application of the plug-in principle and frequentist interpretation of probability to do statistical inference by random number generation. Because it involves coding rather than pen-and-paper work, we call our approach here \"hacker stats.\"" 634 | ] 635 | }, 636 | { 637 | "cell_type": "markdown", 638 | "metadata": {}, 639 | "source": [ 640 | "---\n", 641 | "\n", 642 | "

Q&A 3

\n", 643 | "\n", 644 | "---" 645 | ] 646 | }, 647 | { 648 | "cell_type": "markdown", 649 | "metadata": {}, 650 | "source": [ 651 | "### **Visualizing confidence intervals**\n", 652 | "\n", 653 | "The confidence intervals we printed above are useful, but the confidence intervals are perhaps better visualized graphically. The function below generates a plot of confidence intervals. Since our focus here is primarily on the concepts behind inference, you may for now take the plotting function as a black box. Briefly, we are using the more object-oriented mode of plotting with Matplotlib where we first generate a `Figure` object and an `AxesSubplot` object using `plt.subplots()`. We then use the methods of the `AxesSubplot` object to populate the plot with markers and to make further modification to the plot, such as adding axis labels. We plot the plug-in estimate (in this case, the mean) as a dot and the confidence interval as a line.\n", 654 | "\n", 655 | "The function takes as arguments a list of categories, a list of plug-in estimates, and a list of confidence intervals." 656 | ] 657 | }, 658 | { 659 | "cell_type": "code", 660 | "execution_count": null, 661 | "metadata": {}, 662 | "outputs": [], 663 | "source": [ 664 | "def plot_conf_ints(categories, estimates, conf_ints, palette=None):\n", 665 | " \"\"\"Plot confidence intervals with estimates.\"\"\"\n", 666 | " # Set a nice color palette\n", 667 | " if palette is None:\n", 668 | " palette = [\n", 669 | " \"#1f77b4\",\n", 670 | " \"#ff7f0e\",\n", 671 | " \"#2ca02c\",\n", 672 | " \"#d62728\",\n", 673 | " \"#9467bd\",\n", 674 | " \"#8c564b\",\n", 675 | " \"#e377c2\",\n", 676 | " \"#7f7f7f\",\n", 677 | " \"#bcbd22\",\n", 678 | " \"#17becf\",\n", 679 | " ]\n", 680 | " elif type(palette) == str:\n", 681 | " palette = [palette]\n", 682 | " palette = palette[: len(categories)][::-1]\n", 683 | "\n", 684 | " # Set up axes for plot\n", 685 | " fig, ax = plt.subplots(figsize=(5, len(categories) / 2))\n", 686 | "\n", 687 | " # Plot estimates as dots and confidence intervals as lines\n", 688 | " for i, (cat, est, conf_int) in enumerate(\n", 689 | " zip(categories[::-1], estimates[::-1], conf_ints[::-1])\n", 690 | " ):\n", 691 | " color = palette[i % len(palette)]\n", 692 | " ax.plot(\n", 693 | " [est],\n", 694 | " [cat],\n", 695 | " marker=\".\",\n", 696 | " linestyle=\"none\",\n", 697 | " markersize=10,\n", 698 | " color=color,\n", 699 | " )\n", 700 | "\n", 701 | " ax.plot(conf_int, [cat] * 2, linewidth=3, color=color)\n", 702 | "\n", 703 | " # Make sure margins look ok\n", 704 | " ax.margins(y=0.25 if len(categories) < 3 else 0.125)\n", 705 | "\n", 706 | " return ax" 707 | ] 708 | }, 709 | { 710 | "cell_type": "markdown", 711 | "metadata": {}, 712 | "source": [ 713 | "All right! Let's use this to make a plot." 714 | ] 715 | }, 716 | { 717 | "cell_type": "code", 718 | "execution_count": null, 719 | "metadata": {}, 720 | "outputs": [], 721 | "source": [ 722 | "# Make plot" 723 | ] 724 | }, 725 | { 726 | "cell_type": "markdown", 727 | "metadata": {}, 728 | "source": [ 729 | "The difference in the length of the confidence interval is starkly apparent on the plot. Because we had fewer measurements for insomniacs, we can't be too precise in what mean values we might get.\n", 730 | "\n", 731 | "In looking at the plot of confidence intervals, it seems possible that if we did the experiment again, we might even get a scenario where insomniacs perform *better* than normal sleepers. But how likely is such a scenario?" 732 | ] 733 | }, 734 | { 735 | "cell_type": "markdown", 736 | "metadata": {}, 737 | "source": [ 738 | "### **Confidence interval for difference of means**\n", 739 | "\n", 740 | "Remember that we are not restricted for what estimates we can compute confidence intervals. We can instead compute a confidence interval on the *difference* of means between normal sleepers and insomniacs. To do this, we do the following procedure, which again uses the plug-in principle to \"do\" the experiment again to get a bootstrap replicate of the difference of means.\n", 741 | "\n", 742 | "1. Generate a bootstrap sample of percent correct for normal sleepers.\n", 743 | "2. Generate a bootstrap sample of percent correct for insomniacs.\n", 744 | "3. Take the mean of each bootstrap sample, giving a bootstrap replicate for the mean of each.\n", 745 | "4. Subtract the mean for insomniacs from that of normal sleepers.\n", 746 | "\n", 747 | "This is actually trivial to do now because we have already computed and stored bootstrap replicates of the means! We simply have to subtract them." 748 | ] 749 | }, 750 | { 751 | "cell_type": "code", 752 | "execution_count": null, 753 | "metadata": {}, 754 | "outputs": [], 755 | "source": [ 756 | "# Get bootstrap replicates for difference of means" 757 | ] 758 | }, 759 | { 760 | "cell_type": "markdown", 761 | "metadata": {}, 762 | "source": [ 763 | "Now, we can compute the confidence interval by finding the percentiles." 764 | ] 765 | }, 766 | { 767 | "cell_type": "code", 768 | "execution_count": null, 769 | "metadata": {}, 770 | "outputs": [], 771 | "source": [ 772 | "# Compute confidence interval from bootstrap replicates" 773 | ] 774 | }, 775 | { 776 | "cell_type": "markdown", 777 | "metadata": {}, 778 | "source": [ 779 | "The confidence interval just barely crosses zero, suggesting that the insomniacs will rarely perform better than normal sleepers. \n", 780 | "\n", 781 | "We can find out the *probability* of having the insomniacs perform better than the normal sleepers by counting how many times the mean percent correct for insomniacs exceeded that of normal sleepers and dividing by the total number of bootstrap replicates. To do the count, we can make an array containing `True` and `False` values for whether or not the difference of means is negative and sum the result (since `True` is worth 1 and `False` is worth 0)." 782 | ] 783 | }, 784 | { 785 | "cell_type": "code", 786 | "execution_count": null, 787 | "metadata": {}, 788 | "outputs": [], 789 | "source": [ 790 | "# Compute probability of having insomniacs have better mean score" 791 | ] 792 | }, 793 | { 794 | "cell_type": "markdown", 795 | "metadata": {}, 796 | "source": [ 797 | "So, if we were to do the experiment again, there is about a 3% chance we would observe the insomniacs performing at parity of better than the normal sleepers, at least based on the observations we have. If we did more actual observations, this chance could rise or fall; we cannot know without more measurements." 798 | ] 799 | }, 800 | { 801 | "cell_type": "markdown", 802 | "metadata": {}, 803 | "source": [ 804 | "### **Summarizing the results in a report**\n", 805 | "\n", 806 | "There are many opinions about displaying the results of an analysis like this one. For me, the ECDFs are the most instructive part of our analysis, and I think they should be the central point of the discussion. In my experience, I have met resistance on presenting ECDFs because they are not in as common use as, say, bee swarm (strip) plots, histograms, box plots, or bar graphs. I hear the argument that because they are less common, other people may find them difficult to interpret.\n", 807 | "\n", 808 | "With the exception of the bee swarm plots, all of these kinds of plots fail to plot all of the data. You (or someone in your organization) spent a lot of time and money to get you the data; you should display it all if you can.\n", 809 | "\n", 810 | "The bee swarm plots, while useful visualizations, are not as clear as ECDFs in showing of the data are distributed. Remember the task of statistical inference: You are trying to learn about the (unknown) generative distribution. The ECDF is an approximation of its CDF, made by plotting all of your data. You can't really get better than that.\n", 811 | "\n", 812 | "I therefore advocate for educating your organization in reading and interpreting ECDFs. They are exceptionally effective graphics, and time spent learning how to interpret them is well worth it.\n", 813 | "\n", 814 | "In addition to the ECDFs, I would also include the summary plot of the 95% confidence intervals of the mean. It helps establish what would happen if we did the experiment again; how big of a difference is there in performance between normal sleepers and insomniacs compared to changes due to variation and finite sample size?\n", 815 | "\n", 816 | "So, my summary report would look something like this...." 817 | ] 818 | }, 819 | { 820 | "cell_type": "markdown", 821 | "metadata": {}, 822 | "source": [ 823 | "#### **Sleep deprivation and facial matching**\n", 824 | "\n", 825 | "Twenty-seven subjects suffering from insomnia and seventy-five subjects with normal sleeping patterns were subjected to the short version of the Glasgow Facial Matching Test, comparing 40 pairs of faces each. The subjects' performance was scored based on the percent of the face matching tasks they identified correctly.\n", 826 | "\n", 827 | "Below is an empirical cumulative distribution function describing the results.\n", 828 | "\n", 829 | "

\n", 830 | " $\"percent$ \n", 831 | "

\n", 832 | "
\n", 833 | "\n", 834 | "The distribution for insomniacs is clearly shifted leftward relative to that for normal sleepers, indicating that insomniacs have poorer performance in face matching tasks. The tails of the distributions are similar; both groups have some very poor performers and some very good performers. The key difference lies in the middle of the distribution.\n", 835 | "\n", 836 | "Below is a plot of the 95% confidence interval for the mean percent correct for normal sleepers and insomniacs.\n", 837 | "\n", 838 | "

\n", 839 | " $\"percent$ \n", 840 | "

\n", 841 | "
\n", 842 | "\n", 843 | "The uncertainty in the estimate for insomniacs is due to the smaller sample size. As an estimate, the difference in mean performance in the facial matching task is about 5%." 844 | ] 845 | }, 846 | { 847 | "cell_type": "markdown", 848 | "metadata": {}, 849 | "source": [ 850 | "---\n", 851 | "\n", 852 | "

Q&A 4

\n", 853 | "\n", 854 | "---" 855 | ] 856 | }, 857 | { 858 | "cell_type": "markdown", 859 | "metadata": {}, 860 | "source": [ 861 | "## Conclusions\n", 862 | "\n", 863 | "Measured data come from an unknown generative distribution, and the job of statistical inference is to learn as much as we can about that generative distribution. Hacker stats enables us to use the plug-in principle, in which the generative distribution is approximated by the empirical distribution, to obtain this information using random number generation on our computers.\n", 864 | "\n", 865 | "In this live session, we computed ECDFs and confidence intervals for univariate data. In another live session on hacker stats, we will extend these concepts to bivariate data. We will also introduce and perform null hypothesis significance tests (NHSTs)." 866 | ] 867 | }, 868 | { 869 | "cell_type": "markdown", 870 | "metadata": {}, 871 | "source": [ 872 | "## Take-home question\n", 873 | "\n", 874 | "There are plenty of interesting aspects of this data set to explore. For good practice, you can use what you have learned in this live training to do compute plug-in estimates confidence intervals for the percent correct for normal sleepers separated by gender. Is there a big difference between the genders? You should also make informative graphics and write a short report like the example above to share your findings." 875 | ] 876 | } 877 | ], 878 | "metadata": { 879 | "colab": { 880 | "name": "python_live_session_template.ipynb", 881 | "provenance": [] 882 | }, 883 | "kernelspec": { 884 | "display_name": "Python 3", 885 | "language": "python", 886 | "name": "python3" 887 | }, 888 | "language_info": { 889 | "codemirror_mode": { 890 | "name": "ipython", 891 | "version": 3 892 | }, 893 | "file_extension": ".py", 894 | "mimetype": "text/x-python", 895 | "name": "python", 896 | "nbconvert_exporter": "python", 897 | "pygments_lexer": "ipython3", 898 | "version": "3.7.7" 899 | } 900 | }, 901 | "nbformat": 4, 902 | "nbformat_minor": 4 903 | } 904 | --------------------------------------------------------------------------------