├── .gitignore ├── README.md ├── check_env.py ├── data ├── Demographics_State.csv ├── Population_State.csv └── Weed_Price.csv ├── notebooks ├── 1. Introduction.ipynb ├── 2. Warm-up.ipynb ├── 3. Resampling.ipynb ├── 4. Basic Metrics.ipynb ├── 5. Distributions.ipynb ├── 6. Hypothesis Testing.ipynb ├── 7. Linear Regression.ipynb ├── 8. Closing thoughts and terminology.ipynb ├── 9. References.ipynb └── img │ ├── 6sigma.png │ ├── binomial.gif │ ├── binomial_pmf.png │ ├── correlation.gif │ ├── correlation_not_causation.gif │ ├── covariance.png │ ├── exponential_pdf.png │ ├── kurtosis.png │ ├── leastsquare.gif │ ├── normal_cdf.png │ ├── normal_pdf.png │ ├── normaldist.png │ ├── skewness.png │ ├── uniform.png │ └── variance.png ├── requirements.txt └── requirements_linux.txt /.gitignore: -------------------------------------------------------------------------------- 1 | # Created by https://www.gitignore.io/api/vim,ipythonnotebook,virtualenv 2 | 3 | ### Vim ### 4 | [._]*.s[a-w][a-z] 5 | [._]s[a-w][a-z] 6 | *.un~ 7 | Session.vim 8 | .netrwhist 9 | *~ 10 | 11 | 12 | ### IPythonNotebook ### 13 | # Temporary data 14 | .ipynb_checkpoints/ 15 | 16 | 17 | ### VirtualEnv ### 18 | # Virtualenv 19 | # http://iamzed.com/2009/05/07/a-primer-on-virtualenv/ 20 | .Python 21 | [Bb]in 22 | [Ii]nclude 23 | [Ll]ib 24 | [Ss]cripts 25 | pyvenv.cfg 26 | pip-selfcheck.json 27 | 28 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Introduction to Statistics 2 | 3 | [![Bitdeli Badge](https://d2weczhvl823v0.cloudfront.net/rouseguy/intro2stats/trend.png)](https://bitdeli.com/free "Bitdeli Badge") 4 | 5 | 6 | Inspired by Allen Downey's books [Think Stats](http://greenteapress.com/thinkstats/) and [Think Bayes](http://greenteapress.com/thinkbayes/), this is an attempt to learn Statistics using an application-centric programming approach. 7 | 8 | ## Objective 9 | Showcase real-life examples and what statistics to use in each of those examples. Almost every book teaches a concept and shows an example. Ultimately, every topic gets treated separately and no holistic view is presented. Here, we would take examples and see how to make sense out of it. 10 | 11 | ## Topics covered 12 | 13 | * Mean, Median, Mode 14 | * Standard Deviation 15 | * Variance 16 | * Co-variance 17 | * Probability Distribution 18 | * Hypothesis Testing 19 | * t-test, p-value, chi-squared test 20 | * Confidence Intervals 21 | * Confidence levels and Sigificance levels 22 | * Correlation 23 | * Resampling (and uses in Big Data) 24 | * A/B Testing 25 | * A simple linear regression model 26 | 27 | ## Workshop Plan 28 | We would be using Marijuana prices in various states of the USA, along with demographic data of the USA based on the latest census data 29 | 30 | There will be separate ipython notebooks - grouped by topic similarities. *notebooks will be uploaded later* 31 | Some examples include: 32 | * Find sum of people buying weed in a year, by various states. 33 | * Find mean of price in a week/month, by various states. 34 | * Find variance of price in selected states. Find variance of selected states by week of month 35 | * Define distribution. Plot histograms 36 | * Determining outliers (Plots, quantiles, box plots, percentiles) in weed price data 37 | * Continuous distributions(exponential distribution, normal distribution) 38 | * Introduction to Probability 39 | * Hypothesis testing. Check if weed price across states are similar or not. Check for different qualities of weed 40 | * Resampling 41 | * Simple regression model: Predict weed price for the next month. Understand the output and diagnostics 42 | * Introduction to A/B testing: Impact of regulation and deregulation on a couple of states 43 | 44 | 45 | ## Prerequisites 46 | * Basics of Python. User should know how to write functions; read in a text file(csv, txt, fwf) and parse them; conditional and looping constructs; using standard libraries like os, sys; lists, list comprehension, dictionaries 47 | * It is good to know basics of the following: 48 | * Numpy 49 | * Scipy 50 | * Pandas 51 | * Matplotlib 52 | * Seaborn 53 | * IPython and IPython notebook - Everything here would be an IPython notebook 54 | * Software Requirements 55 | * Python 2.7 56 | * git - so that this repo can be cloned :) 57 | * virtualenv 58 | * Libraries from *requirements.txt* 59 | 60 | ## Optional 61 | Users could choose to install Anaconda, if they want. If using Anaconda or Enthought, please ensure that all libraries listed in the requirements.txt are installed. 62 | 63 | *Note to Windows Users*: Neither of us use Windows. From past workshop experiences, Windows users have faced issues installing the way explained below. It is advisable to install Anaconda and ensure that all the libraries listed in the *requirements.txt* file are installed. 64 | 65 | ## Setup Guide 66 | 67 | #### Clone the repository 68 | $ git clone https://github.com/rouseguy/intro2stats.git 69 | 70 | #### Create a virtual environment & activate 71 | $ cd intro2stats 72 | $ virtualenv env 73 | $ source env/bin/activate 74 | 75 | #### Install reqirements from requirements file 76 | $ pip install -r requirements.txt 77 | 78 | #### Note: Make sure you have libraries for png & freetype. 79 | Ubuntu users can install the below 80 | 81 | apt-get install libfreetype6-dev 82 | apt-get install libpng-dev 83 | 84 | ### Script to check if installation is fine for the workshop 85 | Please execute the following at the command prompt 86 | 87 | $ python check_env.py 88 | 89 | If any library has a `FAIL` message, please install/upgrade that library. 90 | 91 | --- 92 | 93 | Creative Commons License
Introduction to Statistics using Python by Bargava and Raghotham is licensed under a Creative Commons Attribution 4.0 International License. 94 | 95 | 96 | 97 | -------------------------------------------------------------------------------- /check_env.py: -------------------------------------------------------------------------------- 1 | """ 2 | This script will check if the environment setup is correct for the workshop. 3 | 4 | To run, please execute the following command from the command prompt 5 | >>> python check_env.py 6 | 7 | The output will indicate if any of the libraries are missing or need to be updated. 8 | 9 | This script is inspired from https://github.com/fonnesbeck/scipy2015_tutorial/blob/master/check_env.py 10 | """ 11 | 12 | from __future__ import print_function 13 | 14 | try: 15 | import curses 16 | curses.setupterm() 17 | assert curses.tigetnum("colors") > 2 18 | OK = "\x1b[1;%dm[ OK ]\x1b[0m" % (30 + curses.COLOR_GREEN) 19 | FAIL = "\x1b[1;%dm[FAIL]\x1b[0m" % (30 + curses.COLOR_RED) 20 | except: 21 | OK = '[ OK ]' 22 | FAIL = '[FAIL]' 23 | 24 | import sys 25 | try: 26 | import importlib 27 | except ImportError: 28 | print(FAIL, "Python version 2.7 is required, but %s is installed." % sys.version) 29 | from distutils.version import LooseVersion as Version 30 | 31 | def import_version(pkg, min_ver, fail_msg=""): 32 | mod = None 33 | try: 34 | mod = importlib.import_module(pkg) 35 | # workaround for Image not having __version__ 36 | version = getattr(mod, "__version__", 0) or getattr(mod, "VERSION", 0) 37 | if Version(version) < min_ver: 38 | print(FAIL, "%s version %s or higher required, but %s installed." 39 | % (lib, min_ver, version)) 40 | else: 41 | print(OK, '%s version %s' % (pkg, version)) 42 | except ImportError: 43 | print(FAIL, '%s not installed. %s' % (pkg, fail_msg)) 44 | return mod 45 | 46 | 47 | # first check the python version 48 | print('Using python in', sys.prefix) 49 | print(sys.version) 50 | pyversion = Version(sys.version) 51 | if pyversion >= "3": 52 | print(FAIL, "Python version 2.7 is required, but %s is installed." % sys.version) 53 | elif pyversion >= "2": 54 | if pyversion < "2.7": 55 | print(FAIL, "Python version 2.7 is required, but %s is installed." % sys.version) 56 | else: 57 | print(FAIL, "Unknown Python version: %s" % sys.version) 58 | 59 | print() 60 | requirements = {'numpy': "1.9.2", 'pandas': "0.16.2", 'scipy': "0.9", 'matplotlib': "1.4.3", 61 | 'IPython': "4.0", 'sklearn': "0.16.1", 'seaborn': "0.6.0", 'statsmodels': "0.6.1"} 62 | 63 | # now the dependencies 64 | for lib, required_version in list(requirements.items()): 65 | import_version(lib, required_version) 66 | -------------------------------------------------------------------------------- /data/Demographics_State.csv: -------------------------------------------------------------------------------- 1 | "region","total_population","percent_white","percent_black","percent_asian","percent_hispanic","per_capita_income","median_rent","median_age" 2 | "alabama",4799277,67,26,1,4,23680,501,38.1 3 | "alaska",720316,63,3,5,6,32651,978,33.6 4 | "arizona",6479703,57,4,3,30,25358,747,36.3 5 | "arkansas",2933369,74,15,1,7,22170,480,37.5 6 | "california",37659181,40,6,13,38,29527,1119,35.4 7 | "colorado",5119329,70,4,3,21,31109,825,36.1 8 | "connecticut",3583561,70,9,4,14,37892,880,40.2 9 | "delaware",908446,65,21,3,8,29819,828,38.9 10 | "district of columbia",619371,35,49,3,10,45290,1154,33.8 11 | "florida",19091156,57,15,2,23,26236,838,41 12 | "georgia",9810417,55,30,3,9,25182,673,35.6 13 | "hawaii",1376298,23,2,37,9,29305,1220,38.3 14 | "idaho",1583364,84,1,1,11,22568,607,34.9 15 | "illinois",12848554,63,14,5,16,29666,759,36.8 16 | "indiana",6514861,81,9,2,6,24635,577,37.1 17 | "iowa",3062553,88,3,2,5,27027,534,38.1 18 | "kansas",2868107,78,6,2,11,26929,551,36 19 | "kentucky",4361333,86,8,1,3,23462,506,38.2 20 | "louisiana",4567968,60,32,2,4,24442,610,36 21 | "maine",1328320,94,1,1,1,26824,664,43.2 22 | "maryland",5834299,54,29,6,8,36354,1034,38 23 | "massachusetts",6605058,76,6,6,10,35763,936,39.2 24 | "michigan",9886095,76,14,3,5,25681,623,39.1 25 | "minnesota",5347740,83,5,4,5,30913,734,37.6 26 | "mississippi",2976872,58,37,1,3,20618,510,36.2 27 | "missouri",6007182,81,11,2,4,25649,549,38 28 | "montana",998554,87,0,1,3,25373,577,39.9 29 | "nebraska",1841625,82,4,2,9,26899,563,36.3 30 | "nevada",2730066,53,8,7,27,26589,840,36.6 31 | "new hampshire",1319171,92,1,2,3,33134,878,41.5 32 | "new jersey",8832406,59,13,9,18,36027,1024,39.1 33 | "new mexico",2069706,40,2,1,47,23763,635,36.7 34 | "new york",19487053,58,14,8,18,32382,963,38.1 35 | "north carolina",9651380,65,21,2,9,25284,602,37.6 36 | "north dakota",689781,88,1,1,2,29732,564,36.4 37 | "ohio",11549590,81,12,2,3,26046,562,39 38 | "oklahoma",3785742,68,7,2,9,24208,525,36.2 39 | "oregon",3868721,78,2,4,12,26809,749,38.7 40 | "pennsylvania",12731381,79,10,3,6,28502,652,40.3 41 | "rhode island",1051695,76,5,3,13,30469,781,39.6 42 | "south carolina",4679602,64,28,1,5,23943,582,38.1 43 | "south dakota",825198,84,1,1,3,25740,517,36.9 44 | "tennessee",6402387,75,17,1,5,24409,568,38.2 45 | "texas",25639373,45,12,4,38,26019,688,33.8 46 | "utah",2813673,80,1,2,13,23873,739,29.6 47 | "vermont",625904,94,1,1,2,29167,754,42 48 | "virginia",8100653,64,19,6,8,33493,910,37.5 49 | "washington",6819579,72,3,7,11,30742,853,37.3 50 | "west virginia",1853619,93,3,1,1,22966,448,41.5 51 | "wisconsin",5706871,83,6,2,6,27523,636,38.7 52 | "wyoming",570134,85,1,1,9,28902,647,36.8 53 | -------------------------------------------------------------------------------- /data/Population_State.csv: -------------------------------------------------------------------------------- 1 | "region","value" 2 | "alabama",4777326 3 | "alaska",711139 4 | "arizona",6410979 5 | "arkansas",2916372 6 | "california",37325068 7 | "colorado",5042853 8 | "connecticut",3572213 9 | "delaware",900131 10 | "district of columbia",605759 11 | "florida",18885152 12 | "georgia",9714569 13 | "hawaii",1362730 14 | "idaho",1567803 15 | "illinois",12823860 16 | "indiana",6485530 17 | "iowa",3047646 18 | "kansas",2851183 19 | "kentucky",4340167 20 | "louisiana",4529605 21 | "maine",1329084 22 | "maryland",5785496 23 | "massachusetts",6560595 24 | "michigan",9897264 25 | "minnesota",5313081 26 | "mississippi",2967620 27 | "missouri",5982413 28 | "montana",990785 29 | "nebraska",1827306 30 | "nevada",2704204 31 | "new hampshire",1317474 32 | "new jersey",8793888 33 | "new mexico",2055287 34 | "new york",19398125 35 | "north carolina",9544249 36 | "north dakota",676253 37 | "ohio",11533561 38 | "oklahoma",3749005 39 | "oregon",3836628 40 | "pennsylvania",12699589 41 | "rhode island",1052471 42 | "south carolina",4630351 43 | "south dakota",815871 44 | "tennessee",6353226 45 | "texas",25208897 46 | "utah",2766233 47 | "vermont",625498 48 | "virginia",8014955 49 | "washington",6738714 50 | "west virginia",1850481 51 | "wisconsin",5687219 52 | "wyoming",562803 53 | -------------------------------------------------------------------------------- /notebooks/1. Introduction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "> **The fact that data science exists as a field is a colossal failure of statistics ** - Hadley Wickham\n", 8 | "\n", 9 | "\n", 10 | "# THERE ARE HACKS, DAMN HACKS, AND THERE ARE STATISTICS \n", 11 | "\n", 12 | "\n", 13 | "Statistics, to most people, is a bunch of formulae, yes - invariably complicated with strong assumptions. Very few practitioners in fields outside math can derive those from first principles. But is it necessarily need to be that way?\n", 14 | "\n", 15 | "## Philosophy for this workshop\n", 16 | "Instead of a formula-based, discretely-structured classical framework for teaching statistics, we believe in the hackers' philosophy of DIY. \n", 17 | "> **I hear and I forget. I see and I remember. I do and I understand - Confucius** \n", 18 | "\n", 19 | "What is the point of learning something if you don't know how and where to apply that? We aim to bridge that gap a bit.\n", 20 | "\n", 21 | "## Data Analysis\n", 22 | "\n", 23 | "The art of data analysis\n", 24 | "\n", 25 | "> https://github.com/amitkaps/weed/blob/master/0-Intro.ipynb\n", 26 | "\n", 27 | "# Statistics: Inferring results about a population given a sample\n", 28 | "\n", 29 | "\n", 30 | "# *Where does statistics fit in Business Anaytics?*\n", 31 | "\n", 32 | "> DISCUSSION\n", 33 | "\n" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": { 40 | "collapsed": true 41 | }, 42 | "outputs": [], 43 | "source": [] 44 | } 45 | ], 46 | "metadata": { 47 | "kernelspec": { 48 | "display_name": "Python 2", 49 | "language": "python", 50 | "name": "python2" 51 | }, 52 | "language_info": { 53 | "codemirror_mode": { 54 | "name": "ipython", 55 | "version": 2 56 | }, 57 | "file_extension": ".py", 58 | "mimetype": "text/x-python", 59 | "name": "python", 60 | "nbconvert_exporter": "python", 61 | "pygments_lexer": "ipython2", 62 | "version": "2.7.10" 63 | } 64 | }, 65 | "nbformat": 4, 66 | "nbformat_minor": 0 67 | } 68 | -------------------------------------------------------------------------------- /notebooks/2. Warm-up.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "> **SOMETIMES THE QUESTIONS ARE COMPLICATED AND THE ANSWERS ARE SIMPLE **\n", 8 | "\n", 9 | ">*Dr. Seuss*\n", 10 | "\n", 11 | "## Coin Toss\n", 12 | "\n", 13 | "You toss a coin 30 times and see head 24 times. Is it a fair coin?\n", 14 | "\n", 15 | "**Hypothesis 1**: Tossing a fair coin will get you 15 heads in 30 tosses. This coin is biased\n", 16 | "\n", 17 | "**Hypothesis 2**: Come on, even a fair coin could show 24 heads in 30 tosses. This is just by chance\n", 18 | "\n", 19 | "#### Statistical Method\n", 20 | "\n", 21 | "P(H) = ? \n", 22 | "\n", 23 | "P(HH) = ?\n", 24 | "\n", 25 | "P(THH) = ?\n", 26 | "\n", 27 | "Now, slightly tougher : P(2H, 1T) = ?\n", 28 | "\n", 29 | "Generalizing, \n", 30 | "\n", 31 | "\n", 32 | "\n", 33 | "
\n", 34 | "
\n", 35 | "
\n", 36 | "
\n", 37 | "\n", 38 | "\n", 39 | "**What is the probability of getting 24 heads in 30 tosses ?**\n", 40 | "\n", 41 | "It is the probability of getting heads 24 times or more. \n", 42 | "\n", 43 | "#### Hacker's Approach\n", 44 | "\n", 45 | "Simulation. Run the experiment 100,000 times. Find the percentage of times the experiment returned 24 or more heads. If it is more than 5%, we conclude that the coin is biased. " 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 1, 51 | "metadata": { 52 | "collapsed": false 53 | }, 54 | "outputs": [ 55 | { 56 | "name": "stdout", 57 | "output_type": "stream", 58 | "text": [ 59 | "Data of the Experiment: [1 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 1 0 0 1 1 1 1 0]\n", 60 | "Heads in the Experiment: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]\n", 61 | "Number of heads in the experiment: 15\n" 62 | ] 63 | } 64 | ], 65 | "source": [ 66 | "import numpy as np \n", 67 | "\n", 68 | "total_tosses = 30\n", 69 | "num_heads = 24\n", 70 | "prob_head = 0.5\n", 71 | "\n", 72 | "#0 is tail. 1 is heads. Generate one experiment\n", 73 | "experiment = np.random.randint(0,2,total_tosses)\n", 74 | "print \"Data of the Experiment:\", experiment\n", 75 | "#Find the number of heads\n", 76 | "print \"Heads in the Experiment:\", experiment[experiment==1] #This will give all the heads in the array\n", 77 | "head_count = experiment[experiment==1].shape[0] #This will get the count of heads in the array\n", 78 | "print \"Number of heads in the experiment:\", head_count" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 2, 84 | "metadata": { 85 | "collapsed": false 86 | }, 87 | "outputs": [], 88 | "source": [ 89 | "#Now, the above experiment needs to be repeated 100 times. Let's write a function and put the above code in a loop\n", 90 | "\n", 91 | "def coin_toss_experiment(times_to_repeat):\n", 92 | "\n", 93 | " head_count = np.empty([times_to_repeat,1], dtype=int)\n", 94 | " \n", 95 | " for times in np.arange(times_to_repeat):\n", 96 | " experiment = np.random.randint(0,2,total_tosses)\n", 97 | " head_count[times] = experiment[experiment==1].shape[0]\n", 98 | " \n", 99 | " return head_count" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 3, 105 | "metadata": { 106 | "collapsed": false 107 | }, 108 | "outputs": [], 109 | "source": [ 110 | "head_count = coin_toss_experiment(100)" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": 4, 116 | "metadata": { 117 | "collapsed": false 118 | }, 119 | "outputs": [ 120 | { 121 | "data": { 122 | "text/plain": [ 123 | "array([[15],\n", 124 | " [13],\n", 125 | " [15],\n", 126 | " [16],\n", 127 | " [11],\n", 128 | " [16],\n", 129 | " [14],\n", 130 | " [16],\n", 131 | " [13],\n", 132 | " [17]])" 133 | ] 134 | }, 135 | "execution_count": 4, 136 | "metadata": {}, 137 | "output_type": "execute_result" 138 | } 139 | ], 140 | "source": [ 141 | "head_count[:10] " 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 5, 147 | "metadata": { 148 | "collapsed": false 149 | }, 150 | "outputs": [ 151 | { 152 | "name": "stdout", 153 | "output_type": "stream", 154 | "text": [ 155 | "Dimensions: (100, 1) \n", 156 | "Type of object: \n" 157 | ] 158 | } 159 | ], 160 | "source": [ 161 | "print \"Dimensions:\", head_count.shape, \"\\n\",\"Type of object:\", type(head_count)" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 6, 167 | "metadata": { 168 | "collapsed": false 169 | }, 170 | "outputs": [], 171 | "source": [ 172 | "#Let's plot the above distribution\n", 173 | "import matplotlib.pyplot as plt\n", 174 | "%matplotlib inline\n", 175 | "import seaborn as sns\n", 176 | "sns.set(color_codes = True)" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": 7, 182 | "metadata": { 183 | "collapsed": false 184 | }, 185 | "outputs": [ 186 | { 187 | "data": { 188 | "text/plain": [ 189 | "" 190 | ] 191 | }, 192 | "execution_count": 7, 193 | "metadata": {}, 194 | "output_type": "execute_result" 195 | }, 196 | { 197 | "data": { 198 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAeIAAAFVCAYAAAAzJuxuAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAGspJREFUeJzt3W9slfX9//HXOaf0HNqeAyU9JmONDJlR77ikKiHGqeH7\nTdeoUbe5DGgb/3BDcIoisvFP7aaAmZlsS1mAmpmMoc0StyBGt2w6ZMNMyE8kYbpNZdPhwF9Liz3n\n9Jz2tNf1vUFkiND2Or2u86ZXn487wIHrXK/3uc75vM5V2nNFXNd1BQAATEStAwAAMJlRxAAAGKKI\nAQAwRBEDAGCIIgYAwBBFDACAoVGL+ODBg2ptbZUkvf/++1q4cKEWLVqkNWvWiJ98AgBgfEYs4o6O\nDq1bt07FYlGS1N7erqVLl+rZZ5/V4OCgdu/eXY6MAACE1ohFPGvWLLW3t586800kEjpx4oRc11Uu\nl9OUKVPKEhIAgLAasYgbGxsVi8VO/bmlpUXr16/XDTfcoJ6eHs2dOzfwgAAAhJmnb9ZauXKlnn32\nWb388su6+eab9cQTT4y6Df+PDADAuVV4+ceFQkHV1dWSpAsuuEAHDhwYdZtIJKKurkxp6SaAdDoZ\n2vnCPJvEfBMd801cYZ5NOjmfF2Mq4kgkIkl6/PHHtWzZMsXjcVVWVuqxxx7znhAAAJwyahHX19er\ns7NTknT11Vfr6quvDjwUAACTBR/oAQCAIYoYAABDFDEAAIYoYgAADFHEAAAYoogBADBEEQMAYIgi\nBgDAEEUMAIAhihgAAEMUMQAAhihiAAAMeboMIhAWjuOor69PmUzWOookqbq6WtEo74vPV47jKJfL\nlXWf8bh7zucnz5dwoYgxKeVyOb2y/4iGhu0Xs4FCXv9z1Rwlk96uYYryOfl8eV/xxNSy7bOmOqFs\nrvC523m+hA9FjEkrkajSsMtLAGMTT0zV1Kqasu2vqjrB83OSsD8dAABgEqOIAQAwRBEDAGCIIgYA\nwBBFDACAIYoYAABDFDEAAIYoYgAADFHEAAAYoogBADBEEQMAYGjUIj548KBaW1slScePH9fSpUvV\n0tKi5uZmHTlyJPCAAACE2YifKN7R0aEXXnhB1dXVkqQnn3xSt9xyi5qamvTGG2/o3XffVX19fVmC\nAgAQRiOeEc+aNUvt7e1yXVeSdODAAR07dkx33nmndu3apXnz5pUlJAAAYTXiGXFjY+Nnvvz80Ucf\nadq0aXrmmWe0efNmdXR0aNmyZYGHBFAejuMol8tZx5B0MoskRaNnP1+Ix11lMtmyZMlms3Idtyz7\nwuTj6WKX06dP1/z58yVJ8+fP16ZNm8a0XTod7gtYh3m+sM4Wj7vS4R4laxLWURSLDKmurkaplP+P\ntdfj19fXp1f2H1EiUeV7Fq96e7sUjVRo2vTas/+Dwz1lzVJVlSz78+Vs+wvy+VJOYV1bSuGpiBsa\nGrR7927dcsst2rdvny6++OIxbdfVlSkp3ESQTidDO1+YZ/v0TCqTLRgnkfL9BXV3ZzUwEPH1fks5\nfplMVkPD0fPigvTDw1E50XNnSdYkynb8hoejyuYGVFFZvufLueYL6vlSTmFeWyTvbzLG9ONLkcjJ\nA75q1Srt3LlTCxYs0N69e7VkyRLvCQEAwCmjvu2tr69XZ2enJGnmzJn6+c9/HngoAAAmCz7QAwAA\nQxQxAACGKGIAAAxRxAAAGKKIAQAwRBEDAGCIIgYAwBBFDACAIYoYAABDFDEAAIYoYgAADFHEAAAY\noogBADBEEQMAYIgiBgDAEEUMAIAhihgAAEMUMQAAhihiAAAMUcQAABiiiAEAMEQRAwBgiCIGAMAQ\nRQwAgCGKGAAAQxQxAACGRi3igwcPqrW19TO37dq1SwsWLAgsFAAAk0XFSH/Z0dGhF154QdXV1adu\ne/vtt/X8888HHgwAgMlgxDPiWbNmqb29Xa7rSpJ6e3u1adMmrVmz5tRtAACgdCOeETc2NurIkSOS\nJMdxtHbtWq1atUrxeLws4YDJwHEcZbNZ3+83HneVyXi732w2K9fhTTZQTiMW8ekOHTqkDz/8UG1t\nbRocHNR7772njRs3avXq1aNum04nxxXyfBfm+cI6WzzuSod7lKxJWEfRYP4T/b9/fKxp0wf9vePD\nPZ436e3tUlVV8jx5XBKKxmIjZilXzrFkCcLZ9heLDKmurkap1MR+bYZ1bSnFmIv48ssv14svvihJ\n+uijj/Tggw+OqYQlqasrU1q6CSCdToZ2vjDP9umZYiZbME4i5foLikQrNOyO+eU4JsmahOf5hoej\nyuYGVFF5/jwu58pSynxBZQnCuebL9xfU3Z3VwECkbFn8Fua1RfL+JmNMP74UiXz2gLuu+7nbAACA\nd6MWcX19vTo7O0e9DQAAeMcHegAAYIgiBgDAEEUMAIAhihgAAEMUMQAAhihiAAAMUcQAABiiiAEA\nMEQRAwBgiCIGAMAQRQwAgCGKGAAAQxQxAACGKGIAAAxRxAAAGKKIAQAwRBEDAGCIIgYAwBBFDACA\nIYoYAABDFDEAAIYoYgAADFHEAAAYoogBADBEEQMAYIgiBgDA0KhFfPDgQbW2tkqS3nnnHTU3N6u1\ntVWLFy/W8ePHAw8IAECYjVjEHR0dWrdunYrFoiRpw4YNevjhh7V9+3Y1Njaqo6OjLCEBAAirEYt4\n1qxZam9vl+u6kqSnnnpKl156qSRpaGhI8Xg8+IQAAITYiEXc2NioWCx26s/pdFqS9Oabb2rHjh26\n4447Ag0HAEDYVXjd4KWXXtKWLVu0bds21dbWjmmbdDrpOdhEEub5wjpbPO5Kh3uUrElYR9FgPqFo\nLBZIFq/3GWQWr8aSpVw5rR6Xs+0vFhlSXV2NUqmJ/doM69pSCk9FvHPnTv3qV7/S9u3bNW3atDFv\n19WV8Rxsokink6GdL8yzZTLZk79mC8ZJpFx/QZFohSoq/c2SrEl4ni+oLKUYLUsp8wWVJQjnmi/f\nX1B3d1YDA5GyZfFbmNcWyfubjDEVcSQSkeM42rBhg2bOnKl7771XkjR37lzdd9993lMCAABJYyji\n+vp6dXZ2SpLeeOONwAMBADCZ8IEeAAAYoogBADBEEQMAYIgiBgDAEEUMAIAhihgAAEMUMQAAhihi\nAAAMUcQAABiiiAEAMEQRAwBgiCIGAMAQRQwAgCGKGAAAQxQxAACGKGIAAAxRxAAAGKKIAQAwRBED\nAGCIIgYAwBBFDACAIYoYAABDFDEAAIYoYgAADFHEAAAYoogBADBEEQMAYGjUIj548KBaW1slSR98\n8IEWLlyo5uZmtbW1yXXdwAMCABBmIxZxR0eH1q1bp2KxKEnauHGjHnzwQe3YsUOu6+qVV14pS0gA\nAMJqxCKeNWuW2tvbT535vv3227rqqqskSddee61ef/314BMCABBiFSP9ZWNjo44cOXLqz6d/Kbqq\nqkqZTGZMO0mnkyXGmxjCPF9YZ4vHXelwj5I1CesoGswnFI3FAsni9T6DzOLVWLKUK6fV43K2/cUi\nQ6qrq1EqNbFfm2FdW0oxYhGfKRr97wl0LpdTKpUa03ZdXWMr7IkonU6Gdr4wz5bJZE/+mi0YJ5Fy\n/QVFohWqqPQ3S7Im4Xm+oLKUYrQspcwXVJYgnGu+fH9B3d1ZDQxEypbFb2FeWyTvbzI8fdf0ZZdd\npn379kmS9uzZoyuvvNLTzgAAwGeN6Yw4Ejn5zmvVqlV6+OGHVSwWNWfOHDU1NQUaDgCAsBu1iOvr\n69XZ2SlJ+tKXvqTt27cHHgoAgMmCD/QAAMAQRQwAgCGKGAAAQxQxAACGKGIAAAxRxAAAGKKIAQAw\nRBEDAGCIIgYAwBBFDACAIYoYAABDFDEAAIYoYgAADFHEAAAYoogBADBEEQMAYIgiBgDAEEUMAIAh\nihgAAEMUMQAAhihiAAAMUcQAABiiiAEAMEQRAwBgiCIGAMAQRQwAgKEKrxs4jqO1a9fqX//6l6LR\nqB577DFddNFFQWQDACD0PJ8R//nPf1Y+n9dzzz2n73znO/rxj38cRC4AACYFz0WcSCSUyWTkuq4y\nmYymTJkSRC4AACYFz1+abmho0ODgoJqamnTixAlt2bJl1G3S6WRJ4SaKMM8X1tnicVc63KNkTcI6\nigbzCUVjsUCyeL3PILN4NZYs5cpp9bicbX+xyJDq6mqUSk3s12ZY15ZSeC7ip59+Wg0NDVq+fLmO\nHTum22+/Xbt27VJlZeU5t+nqyowr5PksnU6Gdr4wz5bJZE/+mi0YJ5Fy/QVFohWqqPQ3S7Im4Xm+\noLKUYrQspcwXVJYgnGu+fH9B3d1ZDQxEypbFb2FeWyTvbzI8f2k6n8+rurpakpRKpVQsFuU4jte7\nAQAAKuGMePHixVq9erUWLVqkoaEhrVixQomE/ZexAACYiDwXcSqV0ubNm4PIAgDApMMHegAAYIgi\nBgDAEEUMAIAhihgAAEMUMQAAhihiAAAMUcQAABiiiAEAMEQRAwBgiCIGAMAQRQwAgCGKGAAAQxQx\nAACGKGIAAAxRxAAAGKKIAQAwRBEDAGCIIgYAwBBFDACAIYoYAABDFDEAAIYoYgAADFHEAAAYoogB\nADBEEQMAYIgiBgDAUEUpG23dulV//OMfVSwW1dLSoq9//et+5wIAYFLwXMRvvPGGDhw4oM7OTvX3\n9+vpp58OIhcAAJOC5yLeu3evLrnkEt1zzz3KZrP67ne/G0QuAAAmBc9F3NPTo6NHj2rr1q3697//\nraVLl+q3v/1tENkQMo7jKJfLWceQJGWzWbmuax0DALwXcW1trebMmaOKigrNnj1b8XhcPT09mjFj\nxjm3SaeT4wp5vgvzfH7O1tfXp1f2H1EiUeXbfZaqt7dLVVVJ1c6wP3aD+YSisZiSNQnf79vrfQaZ\nxauxZClXTqvH5Wz7i0WGVFdXo1TK/rk7HmFeN73yXMRXXHGFfvGLX+jOO+/Uxx9/rHw+r9ra2hG3\n6erKlBzwfJdOJ0M7n9+zZTJZDQ1HNeyW9D2CvhoePvkDA5lswTiJlOsvKBKtUEWlv1mSNQnP8wWV\npRSjZSllvqCyBOFc8+X7C+ruzmpgIFK2LH4L87opeX+T4XlFvP7667V//37ddtttchxHjz76qCKR\nifuEAADAUkmnJitXrvQ7BwAAkxIf6AEAgCGKGAAAQxQxAACGKGIAAAxRxAAAGKKIAQAwRBEDAGCI\nIgYAwBBFDACAIYoYAABDFDEAAIYoYgAADFHEAAAYoogBADBEEQMAYIgiBgDAEEUMAIAhihgAAEMU\nMQAAhihiAAAMUcQAABiiiAEAMEQRAwBgiCIGAMAQRQwAgCGKGAAAQyUX8fHjx3Xdddfpn//8p595\nAACYVEoq4mKxqEceeURTp071Ow8AAJNKSUX8wx/+UAsXLlQ6nfY7DwAAk0qF1w1+/etfa8aMGbrm\nmmu0detWua4bRC74xHEc5XK5kraNx11lMlnfsmSzWbkOzxcAOF3E9dikLS0tikQikqS//e1vmj17\ntn72s5+prq4ukIAYn76+Pr24529KJKqso6i3t0tVVUnVzrB/rhzv+ljRWIwsZJlwWfpzWf3v3AuV\nSqWso8Anns+If/nLX576fWtrq37wgx+MWsJdXRnvySaIdDp5Xs+XyWQ1NBzVsOv5UCtZk1AmW/At\ny/BwVNncgCoq/bvPUuX6C0omq32dbzxZItEK3x+XUo5fUFlKMVoWv5+f48kShHPNl+8vqLs7q4GB\nSNmy+O18XzfHK51Oevr3/PgSAACGvJ8mnWb79u1+5QAAYFLijBgAAEMUMQAAhihiAAAMUcQAABii\niAEAMEQRAwBgiCIGAMAQRQwAgCGKGAAAQxQxAACGKGIAAAxRxAAAGBrXRR/ON/+/67iOdfeWdZ/T\nj1bpxIn+z90+NVGpi2dfWNYsAMLPcRxls1nrGJKk6upqRaOcz41XqIq4u/cT9QxMLes+i7m4MgPu\n526f0t+ni2eXNQqASWBwIK89b/UpNW26aY6BQl7/c9UcJZPerr2LzwtVEQPAZBBPTNXUqhrrGPAJ\nX1MAAMAQRQwAgCGKGAAAQxQxAACGKGIAAAxRxAAAGKKIAQAwRBEDAGCIIgYAwBBFDACAIYoYAABD\nnj9rulgsas2aNfrPf/6jwcFBLV26VPPnzw8iGwAAoee5iHft2qUZM2boySef1CeffKJbb72VIgYA\noESei7ipqUlf+9rXJJ28LmYsFvM9FAAAk4XnIq6qqpIkZbNZ3X///Vq+fLnvoQAAGCvHcZTL5axj\nnJJOe7tGc0nXIz569KjuvfdeNTc368Ybb/Q9VKlqu6qVj5b/EsvJmsTnbqt0i2WbeyTxuKua6oSq\nqj+fcSzONlupBvMJRWMxX+9zPFkkf+crVZCPi9f7PN+O0WhZypXT6nE52/7Ol2MUiwyprq5GqVRp\n65yf62NfX59e2X9EiUSVb/dZqkKhXxddNNPTNp5bq7u7W3fddZceffRRzZs3b0zbdHVlvO6mJL29\nOWX642XZ16eSNQllsoXP3T5lOF+2uUeSyWSVzRU07Hp/g3Ku2UqV6y8oEq1QRaV/9zmeLMlkta/z\njSdLEI9LKcfvfDtGI2Xx+/k5nixBONd858sxyvcX1N2d1cBAxPO26XTS1/Uxk8lqaDha0jrnt6Fh\n7z+M5HmLLVu2KJPJaPPmzWptbVVra6sGBgY87xgAAJRwRrxu3TqtW7cuiCwAAEw6fKAHAACGKGIA\nAAxRxAAAGKKIAQAwRBEDAGCIIgYAwBBFDACAIYoYAABDFDEAAIYoYgAADFHEAAAYoogBADBkf82o\nkHIcR5mM/WUQs9msXMe1jgEgZBzHUTabLWnbeNxVJlPatmcz0dc5ijggA4W8Xtn/vuKJqaY5Puk9\nrkRVjewvlw0gTAYH8trzVp9S06Z73ramOqFszr/rKU/0dY4iDlA8MVVTq2pMMxTyOdP9AwivUte4\nquqEhl3/6meir3P8HzEAAIYoYgAADFHEAAAYoogBADBEEQMAYIgiBgDAEEUMAIAhihgAAEMUMQAA\nhihiAAAMUcQAABjy/GGfjuOora1N//jHPzRlyhStX79eF154YRDZAAAIPc9nxH/4wx9ULBbV2dmp\nhx56SE888UQQuQAAmBQ8F/Gbb76pr371q5Kkr3zlKzp06JDvoQAAmCw8f2k6m82qpua/l72KxWJy\nHEfRqP1/N0cj0kD2eFn3WamEBrKfv65mZGhQA8V8WbOczUChoEg0pny/94twxyJDyvf7d83Q8WTx\n20ChoCkVUcWGrJME97iUcvzOt2M0Uha/n5/jyRKEc813vhwj1pZzZfG+7nsu4pqaGuVy/73241hK\nOJ1Oeg5WinT6K/pqWfYEAIA/PJ/GNjQ0aM+ePZKkt956S5dcconvoQAAmCwiruu6XjZwXVdtbW36\n+9//LknauHGjZs+eHUg4AADCznMRAwAA/9h/hxUAAJMYRQwAgCGKGAAAQxQxAACGAivirVu3asGC\nBfrmN7+p3/zmN0HtxoTjOFq9erUWLlyo5uZmHT582DqSbw4ePKjW1lZJ0gcffHBqxra2Nk307+s7\nfbZ33nlHzc3Nam1t1eLFi3X8eHk/CCYIp8/3qV27dmnBggVGifx1+nzHjx/X0qVL1dLSoubmZh05\ncsQ43fidPt/777+vhQsXatGiRVqzZs2Efu0Vi0WtXLlSzc3N+ta3vqVXX301VGvL2ebzvL64AfjL\nX/7i3n333a7rum4ul3N/8pOfBLEbM6+99pp7//33u67runv37nXvu+8+40T+2LZtm3vTTTe53/72\nt13Xdd27777b3bdvn+u6rvvII4+4v//97y3jjcuZs7W0tLjvvPOO67qu29nZ6W7cuNEy3ridOZ/r\nuu5f//pX9/bbb//MbRPVmfN973vfc19++WXXdU+uN6+++qplvHE7c74HHnjAfe2111zXdd0VK1ZM\n6Pmef/55d8OGDa7ruu6JEyfc6667zl2yZElo1pazzed1fQnkjHjv3r265JJLdM8992jJkiWaP39+\nELsxk0gklMlk5LquMpmMpkyZYh3JF7NmzVJ7e/upd6dvv/22rrrqKknStddeq9dff90y3ricOdtT\nTz2lSy+9VJI0NDSkeDxuGW/czpyvt7dXmzZtmvBnU586c74DBw7o2LFjuvPOO7Vr1y7NmzfPOOH4\nnDlfIpHQiRMn5LqucrnchF5jmpqatGzZMkknv5pYUVERqrXlbPNt2rTJ0/oSSBH39PTo0KFD+ulP\nf6rvf//7euihh4LYjZmGhgYNDg6qqalJjzzyiFpaWqwj+aKxsVGxWOzUn09fwKuqqpTJZCxi+eLM\n2dLptKSTFzHZsWOH7rjjDqNk/jh9PsdxtHbtWq1atUpVVVXGyfxx5vH76KOPNG3aND3zzDP6whe+\noI6ODsN043fmfC0tLVq/fr1uuOEG9fT0aO7cuYbpxqeqqkrV1dXKZrO6//779cADD8hxnM/8/URe\nW86cb/ny5aqrq5M09vUlkCKura3VNddco4qKCs2ePVvxeFw9PT1B7MrE008/rYaGBv3ud7/Tzp07\ntWrVKg0ODlrH8t3pnyGey+WUSqUM0/jvpZdeUltbm7Zt26ba2lrrOL45dOiQPvzwQ7W1tWnFihV6\n7733tHHjRutYvpo+ffqpr7TNnz8/dFeBW7lypZ599lm9/PLLuvnmmyf85WaPHj2q22+/Xbfeeqtu\nuumm0K0tp8934403SvK2vgRSxFdccYX+9Kc/SZI+/vhj5fP5UC10+Xxe1dXVkqRUKqVisfiZd3hh\ncdlll2nfvn2SpD179ujKK680TuSfnTt3aseOHdq+fbvq6+ut4/jq8ssv14svvqjt27frqaee0pe/\n/GWtXr3aOpavGhoatHv3bknSvn37dPHFF9sG8lmhUDi1xlxwwQXq6+szTlS67u5u3XXXXVq5cqW+\n8Y1vSArX2nK2+byuL56vvjQW119/vfbv36/bbrtNjuPo0UcfVSQSCWJXJhYvXqzVq1dr0aJFGhoa\n0ooVK5RIJKxj+ebTY7Vq1So9/PDDKhaLmjNnjpqamoyTjV8kEpHjONqwYYNmzpype++9V5I0d+5c\n3Xfffcbpxu/M15nruqF67Z3+3Fy3bp2ee+45pVIp/ehHPzJO5o9P53v88ce1bNkyxeNxVVZW6rHH\nHjNOVrotW7Yok8lo8+bN2rx5syRp7dq1Wr9+fSjWljPncxxH7777rr74xS+OeX3hs6YBADDEB3oA\nAGCIIgYAwBBFDACAIYoYAABDFDEAAIYoYgAADFHEAAAY+j954m43MqFUVwAAAABJRU5ErkJggg==\n", 199 | "text/plain": [ 200 | "" 201 | ] 202 | }, 203 | "metadata": {}, 204 | "output_type": "display_data" 205 | } 206 | ], 207 | "source": [ 208 | "sns.distplot(head_count, kde=False)" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "**Exercise**: Try setting `kde=True` in the above cell and observe what happens" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": null, 221 | "metadata": { 222 | "collapsed": true 223 | }, 224 | "outputs": [], 225 | "source": [] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": 8, 230 | "metadata": { 231 | "collapsed": false 232 | }, 233 | "outputs": [ 234 | { 235 | "data": { 236 | "text/plain": [ 237 | "array([], dtype=int64)" 238 | ] 239 | }, 240 | "execution_count": 8, 241 | "metadata": {}, 242 | "output_type": "execute_result" 243 | } 244 | ], 245 | "source": [ 246 | "#Number of times the experiment returned 24 heads.\n", 247 | "head_count[head_count>=24]" 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": 9, 253 | "metadata": { 254 | "collapsed": false 255 | }, 256 | "outputs": [ 257 | { 258 | "name": "stdout", 259 | "output_type": "stream", 260 | "text": [ 261 | "No of times experiment returned 24 heads or more: 0\n", 262 | "% of times with 24 or more heads: 0.0\n" 263 | ] 264 | } 265 | ], 266 | "source": [ 267 | "print \"No of times experiment returned 24 heads or more:\", head_count[head_count>=24].shape[0]\n", 268 | "print \"% of times with 24 or more heads: \", head_count[head_count>=24].shape[0]/float(head_count.shape[0])*100" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": { 275 | "collapsed": false 276 | }, 277 | "outputs": [], 278 | "source": [] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "metadata": {}, 283 | "source": [ 284 | "#### Exercise: Repeat the experiment 100,000 times. " 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": null, 290 | "metadata": { 291 | "collapsed": true 292 | }, 293 | "outputs": [], 294 | "source": [] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": {}, 299 | "source": [ 300 | "# Is the coin fair?" 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": null, 306 | "metadata": { 307 | "collapsed": true 308 | }, 309 | "outputs": [], 310 | "source": [] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "metadata": {}, 315 | "source": [ 316 | "### Extra pointers on numpy" 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "metadata": {}, 322 | "source": [ 323 | "**** Removing `for` loop in the funciton ****" 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": 10, 329 | "metadata": { 330 | "collapsed": false 331 | }, 332 | "outputs": [], 333 | "source": [ 334 | "def coin_toss_experiment_2(times_to_repeat):\n", 335 | "\n", 336 | " head_count = np.empty([times_to_repeat,1], dtype=int)\n", 337 | " experiment = np.random.randint(0,2,[times_to_repeat,total_tosses])\n", 338 | " return experiment.sum(axis=1)" 339 | ] 340 | }, 341 | { 342 | "cell_type": "markdown", 343 | "metadata": {}, 344 | "source": [ 345 | "#### Exercise: Benchmark `coin_toss_experiment` and `coin_toss_experiment_2` for 100 and 100,000 runs and report improvements, if any" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": null, 351 | "metadata": { 352 | "collapsed": true 353 | }, 354 | "outputs": [], 355 | "source": [] 356 | } 357 | ], 358 | "metadata": { 359 | "kernelspec": { 360 | "display_name": "Python 2", 361 | "language": "python", 362 | "name": "python2" 363 | }, 364 | "language_info": { 365 | "codemirror_mode": { 366 | "name": "ipython", 367 | "version": 2 368 | }, 369 | "file_extension": ".py", 370 | "mimetype": "text/x-python", 371 | "name": "python", 372 | "nbconvert_exporter": "python", 373 | "pygments_lexer": "ipython2", 374 | "version": "2.7.10" 375 | } 376 | }, 377 | "nbformat": 4, 378 | "nbformat_minor": 0 379 | } 380 | -------------------------------------------------------------------------------- /notebooks/3. Resampling.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Problem\n", 8 | "The number of shoes sold by an e-commerce company during the first three months(12 weeks) of the year were:\n", 9 | "
\n", 10 | "23 21 19 24 35 17 18 24 33 27 21 23\n", 11 | "\n", 12 | "Meanwhile, the company developed some dynamic price optimization algorithms and the sales for the next 12 weeks were:\n", 13 | "
\n", 14 | "31 28 19 24 32 27 16 41 23 32 29 33\n", 15 | "\n", 16 | "Did the dynamic price optimization algorithm deliver superior results? Can it be trusted?\n", 17 | "\n", 18 | "### Solution\n", 19 | "\n", 20 | "Before we get onto different approaches, let's quickly get a feel for the data\n", 21 | "\n" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 1, 27 | "metadata": { 28 | "collapsed": true 29 | }, 30 | "outputs": [], 31 | "source": [ 32 | "import numpy as np\n", 33 | "import seaborn as sns\n", 34 | "sns.set(color_codes=True)\n", 35 | "%matplotlib inline" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 2, 41 | "metadata": { 42 | "collapsed": true 43 | }, 44 | "outputs": [], 45 | "source": [ 46 | "#Load the data\n", 47 | "before_opt = np.array([23, 21, 19, 24, 35, 17, 18, 24, 33, 27, 21, 23])\n", 48 | "after_opt = np.array([31, 28, 19, 24, 32, 27, 16, 41, 23, 32, 29, 33])" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 3, 54 | "metadata": { 55 | "collapsed": false 56 | }, 57 | "outputs": [ 58 | { 59 | "data": { 60 | "text/plain": [ 61 | "23.75" 62 | ] 63 | }, 64 | "execution_count": 3, 65 | "metadata": {}, 66 | "output_type": "execute_result" 67 | } 68 | ], 69 | "source": [ 70 | "before_opt.mean()" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 4, 76 | "metadata": { 77 | "collapsed": false 78 | }, 79 | "outputs": [ 80 | { 81 | "data": { 82 | "text/plain": [ 83 | "27.916666666666668" 84 | ] 85 | }, 86 | "execution_count": 4, 87 | "metadata": {}, 88 | "output_type": "execute_result" 89 | } 90 | ], 91 | "source": [ 92 | "after_opt.mean()" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 5, 98 | "metadata": { 99 | "collapsed": true 100 | }, 101 | "outputs": [], 102 | "source": [ 103 | "observed_difference = after_opt.mean() - before_opt.mean()" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 6, 109 | "metadata": { 110 | "collapsed": false 111 | }, 112 | "outputs": [ 113 | { 114 | "name": "stdout", 115 | "output_type": "stream", 116 | "text": [ 117 | "Difference between the means is: 4.16666666667\n" 118 | ] 119 | } 120 | ], 121 | "source": [ 122 | "print \"Difference between the means is:\", observed_difference" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "On average, the sales after optimization is more than the sales before optimization. But is the difference legit? Could it be due to chance?\n", 130 | "\n", 131 | "**Classical Method** : We could cover this method later on. This entails doing a *t-test* \n", 132 | "\n", 133 | "**Hacker's Method** : Let's see if we can provide a hacker's perspective to this problem, similar to what we did in the previous notebook." 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": { 140 | "collapsed": true 141 | }, 142 | "outputs": [], 143 | "source": [ 144 | "#Step 1: Create the dataset. Let's give Label 0 to before_opt and Label 1 to after_opt" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": null, 150 | "metadata": { 151 | "collapsed": true 152 | }, 153 | "outputs": [], 154 | "source": [ 155 | "#Learn about the following three functions" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "metadata": { 162 | "collapsed": true 163 | }, 164 | "outputs": [], 165 | "source": [ 166 | "?np.append" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "metadata": { 173 | "collapsed": true 174 | }, 175 | "outputs": [], 176 | "source": [ 177 | "?np.zeros" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "metadata": { 184 | "collapsed": true 185 | }, 186 | "outputs": [], 187 | "source": [ 188 | "?np.ones" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": 7, 194 | "metadata": { 195 | "collapsed": false 196 | }, 197 | "outputs": [], 198 | "source": [ 199 | "shoe_sales = np.array([np.append(np.zeros(before_opt.shape[0]), np.ones(after_opt.shape[0])),\n", 200 | "np.append(before_opt, after_opt)], dtype=int)" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 8, 206 | "metadata": { 207 | "collapsed": false 208 | }, 209 | "outputs": [ 210 | { 211 | "name": "stdout", 212 | "output_type": "stream", 213 | "text": [ 214 | "Shape: (2, 24)\n", 215 | "Data: \n", 216 | "[[ 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1]\n", 217 | " [23 21 19 24 35 17 18 24 33 27 21 23 31 28 19 24 32 27 16 41 23 32 29 33]]\n" 218 | ] 219 | } 220 | ], 221 | "source": [ 222 | "print \"Shape:\", shoe_sales.shape\n", 223 | "print \"Data:\", \"\\n\", shoe_sales" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": 9, 229 | "metadata": { 230 | "collapsed": false 231 | }, 232 | "outputs": [ 233 | { 234 | "name": "stdout", 235 | "output_type": "stream", 236 | "text": [ 237 | "Shape: (24, 2)\n", 238 | "Data: \n", 239 | "[[ 0 23]\n", 240 | " [ 0 21]\n", 241 | " [ 0 19]\n", 242 | " [ 0 24]\n", 243 | " [ 0 35]\n", 244 | " [ 0 17]\n", 245 | " [ 0 18]\n", 246 | " [ 0 24]\n", 247 | " [ 0 33]\n", 248 | " [ 0 27]\n", 249 | " [ 0 21]\n", 250 | " [ 0 23]\n", 251 | " [ 1 31]\n", 252 | " [ 1 28]\n", 253 | " [ 1 19]\n", 254 | " [ 1 24]\n", 255 | " [ 1 32]\n", 256 | " [ 1 27]\n", 257 | " [ 1 16]\n", 258 | " [ 1 41]\n", 259 | " [ 1 23]\n", 260 | " [ 1 32]\n", 261 | " [ 1 29]\n", 262 | " [ 1 33]]\n" 263 | ] 264 | } 265 | ], 266 | "source": [ 267 | "shoe_sales = shoe_sales.T\n", 268 | "print \"Shape:\",shoe_sales.shape\n", 269 | "print \"Data:\", \"\\n\", shoe_sales" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": 10, 275 | "metadata": { 276 | "collapsed": true 277 | }, 278 | "outputs": [], 279 | "source": [ 280 | "#This is the approach we are going to take\n", 281 | "#We are going to randomly shuffle the labels. Then compute the mean between the two groups. \n", 282 | "#Find the % of times when the difference between the means computed is greater than what we observed above\n", 283 | "#If the % of times is less than 5%, we would make the call that the improvements are real" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": 11, 289 | "metadata": { 290 | "collapsed": true 291 | }, 292 | "outputs": [], 293 | "source": [ 294 | "np.random.shuffle(shoe_sales)" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": 12, 300 | "metadata": { 301 | "collapsed": false 302 | }, 303 | "outputs": [ 304 | { 305 | "data": { 306 | "text/plain": [ 307 | "array([[ 1, 29],\n", 308 | " [ 1, 24],\n", 309 | " [ 1, 19],\n", 310 | " [ 1, 16],\n", 311 | " [ 1, 28],\n", 312 | " [ 1, 41],\n", 313 | " [ 1, 27],\n", 314 | " [ 0, 18],\n", 315 | " [ 1, 33],\n", 316 | " [ 0, 24],\n", 317 | " [ 0, 21],\n", 318 | " [ 1, 32],\n", 319 | " [ 1, 31],\n", 320 | " [ 0, 23],\n", 321 | " [ 0, 19],\n", 322 | " [ 0, 17],\n", 323 | " [ 0, 27],\n", 324 | " [ 0, 21],\n", 325 | " [ 1, 32],\n", 326 | " [ 0, 23],\n", 327 | " [ 0, 24],\n", 328 | " [ 0, 33],\n", 329 | " [ 0, 35],\n", 330 | " [ 1, 23]])" 331 | ] 332 | }, 333 | "execution_count": 12, 334 | "metadata": {}, 335 | "output_type": "execute_result" 336 | } 337 | ], 338 | "source": [ 339 | "shoe_sales" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": 13, 345 | "metadata": { 346 | "collapsed": true 347 | }, 348 | "outputs": [], 349 | "source": [ 350 | "experiment_label = np.random.randint(0,2,shoe_sales.shape[0])" 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": 14, 356 | "metadata": { 357 | "collapsed": false 358 | }, 359 | "outputs": [ 360 | { 361 | "data": { 362 | "text/plain": [ 363 | "array([0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0,\n", 364 | " 1])" 365 | ] 366 | }, 367 | "execution_count": 14, 368 | "metadata": {}, 369 | "output_type": "execute_result" 370 | } 371 | ], 372 | "source": [ 373 | "experiment_label" 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": 15, 379 | "metadata": { 380 | "collapsed": false 381 | }, 382 | "outputs": [ 383 | { 384 | "name": "stdout", 385 | "output_type": "stream", 386 | "text": [ 387 | "[[ 0 29]\n", 388 | " [ 1 24]\n", 389 | " [ 1 19]\n", 390 | " [ 1 16]\n", 391 | " [ 1 28]\n", 392 | " [ 1 41]\n", 393 | " [ 0 27]\n", 394 | " [ 0 18]\n", 395 | " [ 0 33]\n", 396 | " [ 0 24]\n", 397 | " [ 0 21]\n", 398 | " [ 1 32]\n", 399 | " [ 0 31]\n", 400 | " [ 1 23]\n", 401 | " [ 0 19]\n", 402 | " [ 1 17]\n", 403 | " [ 0 27]\n", 404 | " [ 1 21]\n", 405 | " [ 1 32]\n", 406 | " [ 0 23]\n", 407 | " [ 1 24]\n", 408 | " [ 1 33]\n", 409 | " [ 0 35]\n", 410 | " [ 1 23]]\n" 411 | ] 412 | } 413 | ], 414 | "source": [ 415 | "experiment_data = np.array([experiment_label, shoe_sales[:,1]])\n", 416 | "experiment_data = experiment_data.T\n", 417 | "print experiment_data" 418 | ] 419 | }, 420 | { 421 | "cell_type": "code", 422 | "execution_count": 16, 423 | "metadata": { 424 | "collapsed": false 425 | }, 426 | "outputs": [], 427 | "source": [ 428 | "experiment_diff_mean = experiment_data[experiment_data[:,0]==1].mean() \\\n", 429 | " - experiment_data[experiment_data[:,0]==0].mean()" 430 | ] 431 | }, 432 | { 433 | "cell_type": "code", 434 | "execution_count": 17, 435 | "metadata": { 436 | "collapsed": false 437 | }, 438 | "outputs": [ 439 | { 440 | "data": { 441 | "text/plain": [ 442 | "0.26223776223776341" 443 | ] 444 | }, 445 | "execution_count": 17, 446 | "metadata": {}, 447 | "output_type": "execute_result" 448 | } 449 | ], 450 | "source": [ 451 | "experiment_diff_mean" 452 | ] 453 | }, 454 | { 455 | "cell_type": "code", 456 | "execution_count": 18, 457 | "metadata": { 458 | "collapsed": true 459 | }, 460 | "outputs": [], 461 | "source": [ 462 | "#Like the previous notebook, let's repeat this experiment 100 and then 100000 times" 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": 19, 468 | "metadata": { 469 | "collapsed": true 470 | }, 471 | "outputs": [], 472 | "source": [ 473 | "def shuffle_experiment(number_of_times):\n", 474 | " experiment_diff_mean = np.empty([number_of_times,1])\n", 475 | " for times in np.arange(number_of_times):\n", 476 | " experiment_label = np.random.randint(0,2,shoe_sales.shape[0])\n", 477 | " experiment_data = np.array([experiment_label, shoe_sales[:,1]]).T\n", 478 | " experiment_diff_mean[times] = experiment_data[experiment_data[:,0]==1].mean() \\\n", 479 | " - experiment_data[experiment_data[:,0]==0].mean()\n", 480 | " return experiment_diff_mean " 481 | ] 482 | }, 483 | { 484 | "cell_type": "code", 485 | "execution_count": 20, 486 | "metadata": { 487 | "collapsed": false 488 | }, 489 | "outputs": [], 490 | "source": [ 491 | "experiment_diff_mean = shuffle_experiment(100)" 492 | ] 493 | }, 494 | { 495 | "cell_type": "code", 496 | "execution_count": 21, 497 | "metadata": { 498 | "collapsed": false 499 | }, 500 | "outputs": [ 501 | { 502 | "data": { 503 | "text/plain": [ 504 | "array([[ 1.83333333],\n", 505 | " [ 0.7 ],\n", 506 | " [-0.33333333],\n", 507 | " [ 0.54444444],\n", 508 | " [-1.0625 ],\n", 509 | " [ 0.61428571],\n", 510 | " [ 3.36713287],\n", 511 | " [ 1.3 ],\n", 512 | " [ 0.3 ],\n", 513 | " [ 0.66666667]])" 514 | ] 515 | }, 516 | "execution_count": 21, 517 | "metadata": {}, 518 | "output_type": "execute_result" 519 | } 520 | ], 521 | "source": [ 522 | "experiment_diff_mean[:10]" 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": 22, 528 | "metadata": { 529 | "collapsed": false 530 | }, 531 | "outputs": [ 532 | { 533 | "data": { 534 | "text/plain": [ 535 | "" 536 | ] 537 | }, 538 | "execution_count": 22, 539 | "metadata": {}, 540 | "output_type": "execute_result" 541 | }, 542 | { 543 | "data": { 544 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXEAAAECCAYAAAAIMefLAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAEARJREFUeJzt3X+M5Hddx/Hn7l5vZndm7kpyK2psIAH5BKMoVIIgoW0K\nBqKmQuAPU4kgBBASiqINVFJjgkjE1kCsRBCoYsTQpkDQADVn0yYkNAcWFDHv/iAkog3uHaWdmdvd\n+zHjHztn7sruzo+dme++756PpMn8+H6/n1e/O/ea73xnvt/vQr/fR5KU02LVASRJk7PEJSkxS1yS\nErPEJSkxS1ySErPEJSmxA7s9WUq5DPg48DSgBrwX+C7wj8CDg8k+HBGfnmVISdL2di1x4HpgLSJe\nW0p5CvAN4I+AWyLi1pmnkyTtaliJ3wHcObi9CJwGrgRKKeU64CHgHRHRmV1ESdJOFkY5YrOU0gI+\nB3wEqAPfiIgHSik3AU+JiN+fbUxJ0naGfrFZSrkC+BfgbyPiH4DPRMQDg6c/Czx3hvkkSbsY9sXm\nU4G7gbdGxD2Dh79YSnl7RBwDrgW+OmyQfr/fX1hY2HNYSbqEjFSau+5OKaV8EHgNEOc9/C7gFrb2\njz8KvGmEfeL9tbX2KHn2pdXVFlnzZ84O5q+a+auzutoaqcR33RKPiBuAG7Z56sWThJIkTZcH+0hS\nYpa4JCVmiUtSYpa4JCVmiUtSYpa4JCVmiUtSYpa4JCVmiUtSYpa4JCVmiUtSYpa4JCVmiUtSYsMu\nz6aLUK/Xo9vtVh0DgEajweKi2xLSpCzxS1C32+XosUeo1ZcrzbG5sc61z38GrVar0hxSZpb4JapW\nX2Z5pVl1DEl75OdYSUrMEpekxCxxSUrMEpekxCxxSUrMEpekxCxxSUrMEpekxCxxSUrMEpekxCxx\nSUrMEpekxCxxSUrMEpekxCxxSUrM84nrkjfLKx3Van3a7c7I03ulI43LEtclb5ZXOmo26nS6GyNN\n65WONAlLXGJ2VzpaadQ52/efmWbHz22SlJglLkmJWeKSlNiuO+tKKZcBHweeBtSA9wL/CdwO9IBv\nAm+LiP5sY0qStjNsS/x6YC0iXgK8HLgNuAW4afDYAnDdbCNKknYyrMTvAG4+b9rTwPMi4r7BY18A\nXjqjbJKkIXbdnRIRXYBSSoutQn8P8GfnTdIBDs8snSRpV0N/wFpKuQK4C7gtIj5VSvnT855uAT8Y\nZaDV1dwHMGTO/+TstVqfZqPOSqNeUaItSwtnOHKkyaFDu6/bWa/7Wa+PVnO05Y66PuYt82sf8ucf\nZtgXm08F7gbeGhH3DB5+oJRyVUTcC7wCODrKQGtr7T0FrdLqaitt/u2yt9sdOt2Nyg9CWT+5wfHj\nHTY3F3acZh7rfpbro9Ws0+6MdsTmKOtj3jK/9iF3/lHffIa9am9ia3fJzaWUc/vGbwA+VEo5CHwL\nuHPSkJKkvRm2T/wGtkr7ya6eSRpJ0lg82EeSErPEJSkxS1ySErPEJSkxS1ySErPEJSkxS1ySErPE\nJSkxS1ySErPEJSkxS1ySErPEJSkxS1ySErPEJSkxS1ySErPEJSkxS1ySErPEJSkxS1ySErPEJSkx\nS1ySErPEJSkxS1ySErPEJSkxS1ySErPEJSkxS1ySErPEJSkxS1ySErPEJSkxS1ySEjtQdQBdunq9\nHp1OZ9dparU+7fbu0+xVp9Oh3+vPdAxpVixxVebU5jr3ff0JDh2+fMdpmo06ne7GTHM8/tgJ6itN\nVmY6ijQblrgqVasvs7zS3PH5lUads/3Zvkw31rszXb40S+4Tl6TELHFJSmykz6mllBcA74+Ia0op\nzwU+Dzw0ePrDEfHpWQWUJO1saImXUm4EfgM49xOBK4FbI+LWWQaTJA03yu6Uh4FXAQuD+1cCv1xK\nubeU8tellJ2/lZIkzdTQEo+Iu4Az5z10P/B7EXEV8G3gD2eUTZI0xCRfbH4mIh4Y3P4s8Nwp5pEk\njWGSH+B+sZTy9og4BlwLfHWUmVZXWxMMtX9kzv/k7LVan2ajzkqjXlGiLafW6ywuLdFq7p5j2PPz\nyjGpUZe7tHCGI0eaHDq0v15rmV/7kD//MOOU+Lnjkt8C3FZKOQ08CrxplJnX1tpjRts/VldbafNv\nl73d7tDpbsz8IJphuic3WFg8wIGDOx+R2WrWaXdme8TmKDkmNU7+9ZMbHD/eYXNzYfjEc5L5tQ+5\n84/65jPSv+KI+A7wosHtbwAvnjSYJGl6PNhHkhKzxCUpMUtckhKzxCUpMUtckhKzxCUpMUtckhKz\nxCUpMUtckhKzxCUpMUtckhKzxCUpMUtckhKzxCUpsWpPKC3p//V6PTqdzvAJ56DRaLC46DZeBpa4\ntE+c2lznvq8/waHDl1eaY3NjnWuf/wxarYv7ijgXC0tc2kdq9WWWV5pVx1Aifl6SpMQscUlKzBKX\npMQscUlKzBKXpMQscUlKzJ8YzlGv16Pb7c51zFqtT7t94QEknU6Hfq8/1xySZsMSn6Nut8vRY49Q\nqy/Pbcxmo06nu3HBY48/doL6SpOVuaWQNCuW+JzN+2COlUads/0L/8wb6/P9NCBpdtwnLkmJWeKS\nlJglLkmJWeKSlJglLkmJWeKSlJglLkmJWeKSlJglLkmJWeKSlJglLkmJjXTulFLKC4D3R8Q1pZRn\nArcDPeCbwNsiwlPiSVIFhm6Jl1JuBD4K1AYP3QrcFBEvARaA62YXT5K0m1F2pzwMvIqtwgZ4XkTc\nN7j9BeClswgmSRpuaIlHxF3AmfMeWjjvdgc4PO1QkqTRTPLFZu+82y3gB1PKIkka0yQXhXiglHJV\nRNwLvAI4OspMq6utCYbaP6aRv1br02zUWWnUp5BodK3mheOdWq+zuLT0Q4/P26g5Zp1z1utj1OXu\nl7/L0sIZjhxpcujQ1mvef7v72zglfu4XKO8EPlpKOQh8C7hzlJnX1tpjRts/VldbU8nfbnfodDd+\n6Eo7s9Rq1ml3Lrw8W/fkBguLBzhwcGOHueZjlBzb5a8ix6TGyb9f/i7rJzc4frzD5ubC1F77Vcmc\nf9Q3n5HaJCK+A7xocPsh4OoJc0mSpsiDfSQpMUtckhKzxCUpMUtckhKzxCUpMUtckhKzxCUpMUtc\nkhKzxCUpMUtckhKzxCUpMUtckhKzxCUpMUtckhKzxCUpMUtckhKzxCUpMUtckhKzxCUpMUtckhKz\nxCUpMUtckhI7UHUASXqyXq9Ht9vd83JqtT7tdmdPy2g0Giwu7t/tXUtc0r7T7XY5euwRavXlPS2n\n2ajT6W5MPP/mxjrXPv8ZtFqtPeWYJUtc0r5Uqy+zvNLc0zJWGnXO9i/umtu/nxEkSUNZ4pKUmCUu\nSYlZ4pKUmCUuSYlZ4pKUmCUuSYlZ4pKUmCUuSYlZ4pKUmCUuSYlZ4pKU2MRnhiml/Cvw+ODutyPi\nDdOJJEka1UQlXkqpA0TENdONI0kax6Rb4j8LrJRSvjRYxk0Rcf/0YkmSRjFpiXeBD0TEx0opPwl8\noZTyrIjoTTGbpAr0ej06na2r4UzjyjiT6HQ69Hv9uY+b0aQl/iDwMEBEPFRKOQH8GPDfO82wurp/\nr4wximnkr9X6NBt1Vhr1KSQaXat54Xin1ussLi390OPzNmqOWeec9foYdbn75+/yOF978HscvvwU\nfPv7lWR47LE1VlZaU1kXe1nG0sIZjhxpcujQ/u2vSUv89cBzgLeVUn4cOAQ8utsMa2vtCYeq3upq\nayr52+0One7GXK800mrWaXcuvDxV9+QGC4sHOHBw8stWTcMoObbLX0WOSY2Tf7/9Xc72D8xl/W/n\n7NlFOt3NPa+LveZfP7nB8eMdNjcX9pRjEqNuOE7aJh8DPlFKuW9w//XuSpGk+ZuoxCPiDPDaKWeR\nJI3Jg30kKTFLXJISs8QlKTFLXJISs8QlKTFLXJISs8QlKTFLXJISs8QlKTFLXJISs8QlKTFLXJIS\ns8QlKTFLXJISs8QlKTFLXJISs8QlKTFLXJISs8QlKTFLXJISs8QlKTFLXJISOzCPQb73v2ucONGZ\nx1A7qh2scehQq9IMkjRtcynx+7/1fdrdjXkMtaPWgU1e+LxnV5pBkqZtLiV+sFbj4On+PIba0SKn\nKx1fkmbBfeKSlJglLkmJWeKSlJglLkmJWeKSlJglLkmJzeUnhvtBr9ej3W5PNG+t1qfd3vvBSp1O\nh36v2p9aSrq4XDIlvrmxztFjj1CrL489b7NRpzOFg5Uef+wE9ZUmK3tekiRtuWRKHKBWX2Z5pTn2\nfCuNOmf7e19VG+vdPS9Dks7nPnFJSswSl6TEJtpHUEpZBP4SeA6wCbwxIh6ZZjBJ0nCTbon/GnAw\nIl4EvAu4ZXqRJEmjmrTEfxH4IkBE3A/8/NQSSZJGNmmJHwKeOO/+2cEuFknSHE36u7kngPMvk7MY\nEb2dJt7snGCzU+1FIS7rn2bz9PpE8y4tnGH95N7zb25ssLC4xPrJ+V3laLvsVeTYzig5prXu95pj\nUuPk349/l3ms/2EZ9mKv+Tc3JuuMeZq0xL8M/CpwRynlF4B/223i61525cKE40iSdjFpiX8GeFkp\n5cuD+6+fUh5J0hgW+n3P5SFJWfllpCQlZolLUmKWuCQlZolLUmIzPxVtKaUB/D1wOXAK+M2I+J9Z\njzstpZTDwN+x9bv4g8DvRsRXqk01mVLKK4FXR8T1VWcZ5mI5P08p5QXA+yPimqqzjKqUchnwceBp\nQA14b0R8vtpUoyulLAEfBZ4F9IG3RMR/VJtqfKWUHwG+BlwbEQ/uNN08tsTfCByLiKvYKsMb5zDm\nNP0O8M8RcTXwOuC2StNMqJTyQeB9QJbf7Kc/P08p5Ua2yqRWdZYxXQ+sRcRLgJcDf1FxnnH9CtCL\niBcD7wH+uOI8Yxu8kf4VMPQiBDMv8Yg4Vx6w9c7+2KzHnLI/Bz4yuH0ZsP8P4drel4HfJk+JXwzn\n53kYeBV51vk5dwA3D24vAmcqzDK2iPgc8ObB3aeTr3MAPgB8GHh02IRT3Z1SSnkD8I4nPfy6iPha\nKeUo8NPAL01zzGkakv9HgU8CN8w/2eh2+X/4dCnl6goiTWrb8/PsdnqH/SYi7iqlPL3qHOOKiC5A\nKaXFVqH/QbWJxhcRZ0sptwOvBF5dcZyxlFJex9YnobtLKe9myEbAXA/2KaUU4J8i4plzG3QKSik/\nA3wKeGdEfKnqPJMalPibI+LXq84yTCnlFuArEXHH4P5/RcQVFcca26DEPxURL6w6yzhKKVcAdwG3\nRcTtFceZWCnlqcD9wLMjIsWn6FLKvWzty+8DPwcEcF1EfG+76efxxea7ge9GxCfZ2r+T6qNZKeWn\n2NoaeU1E/HvVeS4hY52fR9MzKL67gbdGxD1V5xlXKeW1wE9ExJ+wtfuzN/gvhcH3hwCUUu5ha8Nr\n2wKH+Vwo+WPA35RSfgtYIt95Vt7H1q9SPrT1QYIfRMQrq400sXPv7hlcTOfnybLOz7kJOAzcXEo5\nt2/8FRFR7alIR3cncPtgi/Yy4IaI2Kw408x47hRJSsyDfSQpMUtckhKzxCUpMUtckhKzxCUpMUtc\nkhKzxCUpMUtckhL7P8BDb+El45uCAAAAAElFTkSuQmCC\n", 545 | "text/plain": [ 546 | "" 547 | ] 548 | }, 549 | "metadata": {}, 550 | "output_type": "display_data" 551 | } 552 | ], 553 | "source": [ 554 | "sns.distplot(experiment_diff_mean, kde=False)" 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": 23, 560 | "metadata": { 561 | "collapsed": false 562 | }, 563 | "outputs": [ 564 | { 565 | "name": "stdout", 566 | "output_type": "stream", 567 | "text": [ 568 | "Data: Difference in mean greater than observed: []\n", 569 | "Number of times diff in mean greater than observed: 0\n", 570 | "% of times diff in mean greater than observed: 0.0\n" 571 | ] 572 | } 573 | ], 574 | "source": [ 575 | "#Finding % of times difference of means is greater than observed\n", 576 | "print \"Data: Difference in mean greater than observed:\", \\\n", 577 | " experiment_diff_mean[experiment_diff_mean>=observed_difference]\n", 578 | "\n", 579 | "print \"Number of times diff in mean greater than observed:\", \\\n", 580 | " experiment_diff_mean[experiment_diff_mean>=observed_difference].shape[0]\n", 581 | "print \"% of times diff in mean greater than observed:\", \\\n", 582 | " experiment_diff_mean[experiment_diff_mean>=observed_difference].shape[0]/float(experiment_diff_mean.shape[0])*100" 583 | ] 584 | }, 585 | { 586 | "cell_type": "markdown", 587 | "metadata": {}, 588 | "source": [ 589 | "#### Exercise: Repeat the above for 100,000 runs and report the results" 590 | ] 591 | }, 592 | { 593 | "cell_type": "code", 594 | "execution_count": null, 595 | "metadata": { 596 | "collapsed": true 597 | }, 598 | "outputs": [], 599 | "source": [] 600 | }, 601 | { 602 | "cell_type": "markdown", 603 | "metadata": {}, 604 | "source": [ 605 | "# Is the result by chance? " 606 | ] 607 | }, 608 | { 609 | "cell_type": "markdown", 610 | "metadata": {}, 611 | "source": [ 612 | "### What is the justification for shuffling the labels? \n", 613 | "\n", 614 | ">Thought process is this: If price optimization had no real effect, then, the sales before optimization would often give more sales than sales after optimization. By shuffling, we are simulating the situation where that happens - sales before optimization is greater than sales after optimization. If many such trials provide improvements, then, the price optimization has no effect. In statistical terms, *the observed difference could have occurred by chance*. \n", 615 | "\n", 616 | "Now, to show that the same difference in mean might lead to a different conclusion, let's try the same experiment with a different dataset. " 617 | ] 618 | }, 619 | { 620 | "cell_type": "code", 621 | "execution_count": 24, 622 | "metadata": { 623 | "collapsed": true 624 | }, 625 | "outputs": [], 626 | "source": [ 627 | "before_opt = np.array([230, 210, 190, 240, 350, 170, 180, 240, 330, 270, 210, 230])\n", 628 | "after_opt = np.array([310, 180, 190, 240, 220, 240, 160, 410, 130, 320, 290, 210])" 629 | ] 630 | }, 631 | { 632 | "cell_type": "code", 633 | "execution_count": 25, 634 | "metadata": { 635 | "collapsed": false 636 | }, 637 | "outputs": [ 638 | { 639 | "name": "stdout", 640 | "output_type": "stream", 641 | "text": [ 642 | "Mean sales before price optimization: 237.5\n", 643 | "Mean sales after price optimization: 241.666666667\n", 644 | "Difference in mean sales: 4.16666666667\n" 645 | ] 646 | } 647 | ], 648 | "source": [ 649 | "print \"Mean sales before price optimization:\", np.mean(before_opt)\n", 650 | "print \"Mean sales after price optimization:\", np.mean(after_opt)\n", 651 | "print \"Difference in mean sales:\", np.mean(after_opt) - np.mean(before_opt) #Same as above" 652 | ] 653 | }, 654 | { 655 | "cell_type": "code", 656 | "execution_count": 26, 657 | "metadata": { 658 | "collapsed": true 659 | }, 660 | "outputs": [], 661 | "source": [ 662 | "shoe_sales = np.array([np.append(np.zeros(before_opt.shape[0]), np.ones(after_opt.shape[0])),\n", 663 | "np.append(before_opt, after_opt)], dtype=int)\n", 664 | "shoe_sales = shoe_sales.T" 665 | ] 666 | }, 667 | { 668 | "cell_type": "code", 669 | "execution_count": 27, 670 | "metadata": { 671 | "collapsed": false 672 | }, 673 | "outputs": [ 674 | { 675 | "data": { 676 | "text/plain": [ 677 | "" 678 | ] 679 | }, 680 | "execution_count": 27, 681 | "metadata": {}, 682 | "output_type": "execute_result" 683 | }, 684 | { 685 | "data": { 686 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX8AAAECCAYAAAAW+Nd4AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAGShJREFUeJzt3W+MXNd53/Hv7FI7s7szS7/g2IpUI6nd6KmSlHYkE64l\ng6QgRbIEG0oDREYUuJKRSLXMCHIrIK23igAbVNTEkdoSdpiEdEIpKlzAhJPYJUTJJQIuvalCxpGU\nqHIem0qcwEbS7tIkd2Y4M8vdmb6Yu83szv/d2fl3fh+A0M5z7869RzP7u3fOOXNvrFwuIyIiYRnr\n9w6IiEjvKfxFRAKk8BcRCZDCX0QkQAp/EZEAKfxFRAK0o9UKZvYA8GD0cBJ4D/BB4L8CJeAN4IC7\nl83sIeBhYAU46O4nzGwSeAFIAxngAXdf7HZDRESkfbFO5vmb2eeB14CPAM+4+5yZHQZeAl4BXgZu\npnKQ+AbwPuCXgKS7f9bMPgp8wN0/1d1miIhIJ9ru9jGz9wE/5u5HgZvdfS5a9CJwB7AHmHf3q+6+\nBJwHdgO3AiejdU9G64qISB910uc/C3wm+jlWVc8AO4EZ4HKD+tKGmoiI9FFb4W9mbwNucPfTUalU\ntXgGuEQl4FNV9VSd+lpNRET6qOWAb2QvcKrq8atmti86GNwdLTsLPGVmcSAB3EhlMHgeuAc4F607\nRxPlcrkci8WarSIiIrU6Cs52w/8G4K2qx48DR8xsAngTOB7N9jkEnKHyiWLW3YvRgPBzZnYGKAL3\nN937WIyFhUwnbRgq6XRK7RtSo9w2UPuGXTqdar1SlY5m+/RIedRfILVvOI1y20DtG3bpdKqjM399\nyUtEJEAKfxGRACn8RUQC1O6Ar4gApVKJXC5Xd9n09DRjYzqfkuGg8BfpQC6X49S5t4gnJtfVi4U8\nt+95N6lUZzMuRPpF4S9SR70z/Hi8TDabZWIiweRUsk97JtIdCn+ROuqd4SenE3z/e98nMZVkqo/7\nJtINCn+RBuKJyXVn+FPTCeKJRB/3SKR7NDolIhIghb+ISIAU/iIiAVL4i4gESOEvIhIghb+ISIAU\n/iIiAVL4i4gESOEvIhIghb+ISIAU/iIiAVL4i4gESOEvIhIghb+ISIB0SWeRbaTbPsqgUvjLyOtn\nAOu2jzKoWoa/mX0a+AhwDfB5YB44BpSAN4AD7l42s4eAh4EV4KC7nzCzSeAFIA1kgAfcfXE7GiLS\nSL8DeONNYUQGQdNTHjPbD3zA3W8B9gPvAp4BZt19LxAD7jWza4FHgVuAu4CnzWwCeAR4PVr3eeCJ\nbWqHSFNrAVz9b+PBQCQkrT7v3gn8pZn9IfA14KvAze4+Fy1/EbgD2APMu/tVd18CzgO7gVuBk9G6\nJ6N1RUSkz1p1+6SBdwIfpnLW/zUqZ/trMsBOYAa43KC+tKEmIiJ91ir8F4FvufsK8G0zKwDXVy2f\nAS5RCfjqjtNUnfparaV0erQHwdS+3orHyySnE0xNr7/5+nhshV27kszM1O5vo9+ZnkowNj5OKtne\nc21m2/00aK9dt416+zrRKvy/ATwGPGtm1wFTwCkz2+fup4G7gVPAWeApM4sDCeBGKoPB88A9wLlo\n3bnaTdRaWMhsoinDIZ1OqX09lslkyeYKrJbXv93zVwosLmYpFmNt/U4qmSB3pUBsbAc7JgptPddm\ntt0vg/jadVMI7etE0/CPZuzsNbOzVMYHPgl8FzgSDei+CRyPZvscAs5E6826e9HMDgPPmdkZoAjc\n32mDRESk+1pO9XT3f1+nvL/OekeBoxtqeeC+ze6ciIhsD33JS6QPSqUS2Wy24XJ9+1e2m8JfpA+W\ni3nmXltiZufbapbp27/SCwp/kT7RN3+ln/S5UkQkQAp/EZEAKfxFRAKk8BcRCZDCX0QkQAp/EZEA\naaqnjIxGd+zKZrOUS+U+7JHI4FL4y8hodMeuyxcvkJhKMtWn/RIZRAp/GSn1vjhVyNe/f69IyNTn\nLyISIIW/iEiAFP4iIgFS+IuIBEjhLyISIIW/iEiAFP4iIgFS+IuIBEjhLyISIIW/iEiAFP4iIgFS\n+IuIBKitC7uZ2Z8Dl6OHfw08DRwDSsAbwAF3L5vZQ8DDwApw0N1PmNkk8AKQBjLAA+6+2NVWiIhI\nR1qGv5klANz9tqraV4FZd58zs8PAvWb2CvAocDMwCXzDzL4OPAK87u6fNbOPAk8An+p+U0Q6UyqV\nyGazdZfpHgAy6to5838PMGVmL0Xr/0fgJnefi5a/CNwJrALz7n4VuGpm54HdwK3Ar0XrngR+pYv7\nL7Jpy8U8c68tMbPzbTXLdA8AGXXthH8O+Jy7f9HMfpRKgFfLADuBGf6xa2hjfWlDTWQg1Lv+P+ge\nADL62gn/bwPnAdz9O2Z2AfjJquUzwCUqAZ+qqqfq1NdqTaXTqVarDDW1b3vE42WS0wmmphPr6sv5\nBGPj46SS7dWbLZueql8fj62wa1eSmZn1be90n5o9Vy/ovRmOdsL/41S6bw6Y2XVUAvxlM9vn7qeB\nu4FTwFngKTOLAwngRiqDwfPAPcC5aN252k2st7CQ2URThkM6nVL7tkkmkyWbK7BaXv+2zl0pEBvb\nwY6JQlv1RstSyUTD38lfKbC4mKVYjG1pn5o913bTe3O4dXpgayf8vwj8npmthfbHgQvAETObAN4E\njkezfQ4BZ6hMIZ1192I0IPycmZ0BisD9He2hBKnRzdjXTE9PMzammcoim9Uy/N19BfhYnUX766x7\nFDi6oZYH7tvk/kmgGt2MHaBYyHP7nneTSukjvMhm6QbuMrAaDcYOokbTRjVlVAaVwl+kCxpNG9WU\nURlUCn+RLqn3SUVTRmVQacRMRCRAOvMXGRLNZkBp9pN0SuEvMiQazYDS7CfZDIW/yBAZphlQMtj0\nOVFEJEAKfxGRACn8RUQCpPAXEQmQwl9EJEAKfxGRACn8RUQCpPAXEQmQwl9EJEAKfxGRAOnyDjJ0\ndOMUka1T+MvQ0Y1TRLZO4S9DaZRvnKJPNtILCn+RAaNPNtILCn+RATTKn2xkMGi2j4hIgBT+IiIB\naqvbx8zeDnwTuB0oAcei/74BHHD3spk9BDwMrAAH3f2EmU0CLwBpIAM84O6LXW+FiIh0pOWZv5ld\nA/w2kANiwLPArLvvjR7fa2bXAo8CtwB3AU+b2QTwCPB6tO7zwBPb0goREelIO90+nwMOA38fPb7J\n3eein18E7gD2APPuftXdl4DzwG7gVuBktO7JaF0REemzpuFvZg8CC+7+clSKRf/WZICdwAxwuUF9\naUNNRET6rFWf/8eBspndAbwXeI5K//2aGeASlYBPVdVTdeprtZbS6VTrlYaY2tdaPF4mOZ1gajpR\ns2w5n2BsfJxUMrEt9WbLpqf6t+1G9fHYCrt2JZmZ2fr/d703w9E0/N1939rPZvbHwCeAz5nZPnc/\nDdwNnALOAk+ZWRxIADdSGQyeB+4BzkXrztGGhYVM5y0ZEul0Su1rQyaTJZsrsFqufYvmrhSIje1g\nx0RhW+qNlqWSib5tu1k9f6XA4mKWYjHGVui9Odw6PbB1+iWvMvA4cCQa0H0TOB7N9jkEnKHSlTTr\n7kUzOww8Z2ZngCJwf4fbE5EWGl0OAmB6epqxMc3ollpth7+731b1cH+d5UeBoxtqeeC+ze6ciLTW\n6HIQxUKe2/e8m1RKXR1SS5d3EBkB9S4HIdKMwl/6qlQqkcvVXrNGV7AU2V4Kf+mrXC7HqXNvEU9M\nrqvrCpYi20vhL32nK1iK9J6mAYiIBEjhLyISIIW/iEiAFP4iIgFS+IuIBEjhLyISIIW/iEiAFP4i\nIgFS+IuIBEjhLyISIIW/iEiAFP4iIgFS+IuIBEjhLyISIIW/iEiAFP4iIgFS+IuIBEjhLyISIIW/\niEiAFP4iIgFqeQN3MxsHjgA3AGXgE0AROAaUgDeAA+5eNrOHgIeBFeCgu58ws0ngBSANZIAH3H1x\nG9oiIiJtaufM/8NAyd0/CDwB/CrwDDDr7nuBGHCvmV0LPArcAtwFPG1mE8AjwOvRus9HzyEiIn3U\nMvzd/Y+AfxM9/BHgInCzu89FtReBO4A9wLy7X3X3JeA8sBu4FTgZrXsyWldERPqoZbcPgLuvmtkx\n4KeBnwV+qmpxBtgJzACXG9SXNtQkMKVSiVwuV1PPZrOUS+U+7JFI2NoKfwB3f9DM3gGcBRJVi2aA\nS1QCPlVVT9Wpr9WaSqdTrVYZaiG2b2lpiVPnvkciMbWufvHiAlNTKVLJxLr6cj7B2Ph4Tb3Zsm7V\nmy2bnurftjutx8rLxONl4vH6B9dkMsnY2PoP/yG+N0PVzoDvx4B/4u5PA3lgFfgzM9vn7qeBu4FT\nVA4KT5lZnMrB4UYqg8HzwD3AuWjdudqtrLewkNlca4ZAOp0Ksn2ZTJaV1TFWy+vfcqurY2RzRXZM\nFNbVc1cKxMZ21NSbLetWvdGyVDLRt21vpn7xBxf5g1P/l5mdb6vZRrGQ5/Y97yaV+scwDPW9OSo6\nPbC1c+Z/HDhmZqeBa4DHgL8CjkQDum8Cx6PZPoeAM1TGEmbdvWhmh4HnzOwMlVlC93e0hyKyafHE\nJJNTyX7vhgygluHv7nngo3UW7a+z7lHgaJ3fv2+T+yci26BUKpHNZtfV4vEymUyW6enpmu4gGT1t\n9/mLtKNUKpHJ1H601sDuYFku5pl7bWldl1ByOsGFCxdruoNkNCn8pauy2Synzr1FPDG5rn754gUS\nU0mmGvye9N7GLqGp6QTZXO0YhIwmhb90Xb1+5kK+dpqniPSPOvZERAKk8BcRCZDCX0QkQAp/EZEA\nKfxFRAKk8BcRCZDCX0QkQAp/EZEAKfxFRAKk8BcRCZDCX0QkQAp/EZEAKfxFRAKk8BcRCZDCX0Qk\nQAp/EZEA6WYu0rFSqUQuV//mLMvLZd2uUWQIKPylY7lcru6tGgGuLmcolSd0u8YhVe/G7mt0Y/fR\novCXTal3q0aA8dgqVworfdgj6YZ6N3YHKBbyurH7iFH4i8g6jQ7sMlr0GU5EJEBNz/zN7Brgd4Ef\nBuLAQeBbwDGgBLwBHHD3spk9BDwMrAAH3f2EmU0CLwBpIAM84O6L29QWERFpU6sz/58HFtx9L/Ah\n4AvAM8BsVIsB95rZtcCjwC3AXcDTZjYBPAK8Hq37PPDE9jRDREQ60Sr8vww8WbXuVeAmd5+Lai8C\ndwB7gHl3v+ruS8B5YDdwK3AyWvdktK6IiPRZ024fd88BmFmKyoHgCeA3qlbJADuBGeByg/rShpqI\niPRZy9k+ZvZO4CvAF9z9S2b261WLZ4BLVAK+eg5Yqk59rdZSOj3a08mGvX3xeJnkdIKp6UTNsgv5\nyySn46SS65ct5xOMjY9vud7N59rMNqan+rftfrZvPLbCrl1JZmaG+7077H973dRqwPcdwMvAJ939\nj6Pyq2a2z91PA3cDp4CzwFNmFgcSwI1UBoPngXuAc9G6c7RhYSGziaYMh3Q6NfTty2SyZHMFVsv1\n3z7ZXJEdE4V1tdyVArGxHVuud/O5Ot1GKpno27b73b78lQKLi1mKxVjNcw2LUfjba6bTA1urM/9Z\nKl01T5rZWt//Y8ChaED3TeB4NNvnEHCGytjArLsXzeww8JyZnQGKwP0d7Z2IiGyLVn3+j1EJ+432\n11n3KHB0Qy0P3LeF/RMRkW2gL3mJiARI4S8iEiCFv4hIgBT+IiIBUviLiARI4S8iEiCFv4hIgBT+\nIiIBUviLiARIt3GUhkqlErlcrqaezWYpl8p92CPpF93YffQo/KWhXC7HqXNvEU9MrqtfvniBxFSS\nqT7tl/Sebuw+ehT+0lS9m3kX8rWfBmT06cbuo0Wf1UREAqTwFxEJkMJfRCRACn8RkQAp/EVEAqTw\nFxEJkMJfRCRACn8RkQDpS16iyziIBEjhL7qMg2xas2v+gK77M8gU/gLoMg6yOY2u+QO67s+gU/iL\nyJbomj/Dqa3wN7P3A//J3W8zs38GHANKwBvAAXcvm9lDwMPACnDQ3U+Y2STwApAGMsAD7r64De0Q\nEZEOtOyMM7NfBo4A8aj0LDDr7nuBGHCvmV0LPArcAtwFPG1mE8AjwOvRus8DT3S/CSIyiNbGAzKZ\nTM2/UqnU790LXjtn/ueBnwF+P3p8k7vPRT+/CNwJrALz7n4VuGpm54HdwK3Ar0XrngR+pVs7LiKD\nTfcAGGwtw9/dv2JmP1JVilX9nAF2AjPA5Qb1pQ016RNN6ZRe03jA4NrMgG/157UZ4BKVgK8+jKfq\n1NdqLaXTo31G0K/2LS0tcerc90gk1k/evHhxgampFKlkYl19OZ9gbHy87TrAhfxlktPxLT9Xs210\n67k2s43pqf5te5ja12zZeGyFXbuSzMz0/u9g1LOlE5sJ/1fNbJ+7nwbuBk4BZ4GnzCwOJIAbqQwG\nzwP3AOeidefqP+V6CwuZTezWcEinU31rXyaTZWV1jNXy+pd9dXWMbK7IjonCunruSoHY2I6262u6\n8VzNttGt5+p0G6lkom/bHrb2NVuWv1JgcTFLsRir+Z3t1M+/vV7o9MDWybcv1voFHgc+Y2Z/QuXg\ncdzd/w9wCDhD5WAw6+5F4DDw42Z2BvhF4DMd7Z2IiGyLts783f27VGby4O7fAfbXWecocHRDLQ/c\nt9WdFBGR7tL3rkVEAqTwFxEJkMJfRCRACn8RkQAp/EVEAqTwFxEJkMJfRCRACn8RkQDpZi4i0lPN\nbv2o2z72jsJ/BOnqnTLIdKnnwaDwH0G6IbsMOl3quf8U/iNKN2QXkWbUuSYiEiCd+Q+pRv36oL59\nGU4aCO4thf+QatSvD+rbl+GkgeDeUvgPsUaDZurbl2GlgeDe0ecoEZEA6cx/wGnOvohsB4X/gNOc\nfQmdBoK3h8J/CGjOvoRMA8HbQ+EvIgNPA8Hdp89LIiIB0pn/gNDArkhnmo0FgMYDWtn28DezMeA3\ngd1AEfhFd39ru7c7bDSwK9KZRmMBoPGAdvTizP+ngQl3v8XM3g88E9WCVCqVyGQyNfVsNsvEREID\nuyIdaDQWUO9TQTxeJpPJ6hNBpBfhfytwEsDd/9TM3teDbfZVs+vuLC8v8T//9C0SU+vP5XWGL9I9\n9T4VJKcTLCxc4AM//kMkk7UHjNAOCr0I/xlgqerxqpmNuXupB9vuikZhXipVmrDxDZPNZnnlf/9D\nTcADXF3OEBuf0Bm+yDbb+KlgajpBbPECc6/9XU1XUf5KLriDQi/Cfwmo7njravAvLy/z2uuv1V12\n/XXXMTOzc8vbyGaznH71b4jHE+vqly//gLHYOKkN27h8+QdMTqZINDiPLxby5K9kN9QKxMbGt1zv\n5nNtZttXl69QLKwOdfsaLRuPrfRt28PWvl60Yyvt22i5WODrr/xVzd9ysVhg30/+07oHhV7ZrnGL\nXoT/PPAR4Mtm9i+Bv2ixfiyd7qyx119/+yZ3rX27d9+w7duQYfAT/d6Bbab2haIX4f8HwE+Z2Xz0\n+OM92KaIiDQRK5c1h1xEJDSjN4ohIiItKfxFRAKk8BcRCZDCX0QkQANxYTczGweeBW4GJoAn3f1k\nNDX0vwArwMvu/tk+7uaWmdk/B14B3u7uy6PSPjPbCbxA5fscE8C/c/dXRqh9I3d9KjO7Bvhd4IeB\nOHAQ+BZwDCgBbwAH3H2oZ4SY2duBbwK3U2nXMUakfWb2aSrT6K8BPk9lWv0x2mzfoJz5fwzY4e4f\npHLdnxuj+m8BPxfV329m7+3XDm6Vmc1Qua5Roap8mNFo378Fvu7u+4EHgS9E9VF5/f7/9amA/0Dl\ndRx2Pw8suPte4ENUXrNngNmoFgPu7eP+bVl0gPttIEelPc8yIu0zs/3AB6L35H7gXXT4+g1K+N8J\nfN/M/gdwBPijKCwn3P1vonVeAu7o1w5uhZnFqLwJPw3ko9oMEB+F9gH/Gfid6OdrgLyZpRiR148N\n16cCRuH6VF8Gnox+HgOuAje5+1xUe5Hhfb3WfI7KCdbfR49HqX13An9pZn8IfA34KnBzJ+3rebeP\nmf0C8KkN5QUg7+4fNrO9wO8B97P+mkAZKke3gdagfX8L/Hd3/wszg8pReeM1j4a5fQ+6+zfN7Frg\n94HHgJ0MYfsaGPrrU23k7jmA6CD9ZeAJ4DeqVslSeQ2Hkpk9SOWTzctR90gs+rdmqNsHpIF3Ah+m\n8nf1NTpsX8/D392/CHyxumZmXwJORMvnzOwGaq8JNANc6tV+blaD9n0H+IUoOK+lchb8EUakfQBm\n9i+ALwGPu/uZ6JPN0LWvgW29PlW/mNk7ga8AX3D3L5nZr1ctTjG8rxdUriRQNrM7gPcCz1EJzDXD\n3r5F4FvuvgJ828wKwPVVy1u2b1C6fb4B3ANgZu8B/tbdM8Cymb0r6ja5E5hr8hwDy91/1N1vc/fb\ngH8A7hyl9pnZj1E5e/w5d38JwN2XGJH2URlIW3t/tnN9qoFnZu8AXgZ+2d2PReVXzWxf9PPdDO/r\nhbvvc/f90d/ca8C/Bk6OSvuoZOaHAMzsOmAKONVJ+wZitg+Vfv7DZva/osefqPrvfwPGgZfc/Vw/\ndq7LqkffR6V9v0plls+hqFvrkrv/K0anfaN4fapZKt0CT5rZWt//Y1RewwngTeB4v3ZuG5SBx4Ej\no9A+dz9hZnvN7CyVk/hPAt+lg/bp2j4iIgEalG4fERHpIYW/iEiAFP4iIgFS+IuIBEjhLyISIIW/\niEiAFP4iIgFS+IuIBOj/AVMVjN48eqv2AAAAAElFTkSuQmCC\n", 687 | "text/plain": [ 688 | "" 689 | ] 690 | }, 691 | "metadata": {}, 692 | "output_type": "display_data" 693 | } 694 | ], 695 | "source": [ 696 | "experiment_diff_mean = shuffle_experiment(100000)\n", 697 | "sns.distplot(experiment_diff_mean, kde=False)" 698 | ] 699 | }, 700 | { 701 | "cell_type": "code", 702 | "execution_count": 28, 703 | "metadata": { 704 | "collapsed": false 705 | }, 706 | "outputs": [ 707 | { 708 | "name": "stdout", 709 | "output_type": "stream", 710 | "text": [ 711 | "Number of times diff in mean greater than observed: 40473\n", 712 | "% of times diff in mean greater than observed: 40.473\n" 713 | ] 714 | } 715 | ], 716 | "source": [ 717 | "#Finding % of times difference of means is greater than observed\n", 718 | "print \"Number of times diff in mean greater than observed:\", \\\n", 719 | " experiment_diff_mean[experiment_diff_mean>=observed_difference].shape[0]\n", 720 | "print \"% of times diff in mean greater than observed:\", \\\n", 721 | " experiment_diff_mean[experiment_diff_mean>=observed_difference].shape[0]/float(experiment_diff_mean.shape[0])*100" 722 | ] 723 | }, 724 | { 725 | "cell_type": "markdown", 726 | "metadata": {}, 727 | "source": [ 728 | "### Did the conclusion change now? " 729 | ] 730 | }, 731 | { 732 | "cell_type": "code", 733 | "execution_count": null, 734 | "metadata": { 735 | "collapsed": true 736 | }, 737 | "outputs": [], 738 | "source": [] 739 | }, 740 | { 741 | "cell_type": "markdown", 742 | "metadata": {}, 743 | "source": [ 744 | "# Effect Size\n", 745 | "\n", 746 | "> **Because you can't argue with all the fools in the world. It's easier to let them have their way, then trick them when they're not paying attention** - Christopher Paolini\n", 747 | "\n", 748 | "In the first case, how much did the price optimization increase the sales on average?" 749 | ] 750 | }, 751 | { 752 | "cell_type": "code", 753 | "execution_count": 29, 754 | "metadata": { 755 | "collapsed": false 756 | }, 757 | "outputs": [ 758 | { 759 | "name": "stdout", 760 | "output_type": "stream", 761 | "text": [ 762 | "The % increase of sales in the first case: 17.5438596491 %\n" 763 | ] 764 | } 765 | ], 766 | "source": [ 767 | "before_opt = np.array([23, 21, 19, 24, 35, 17, 18, 24, 33, 27, 21, 23])\n", 768 | "after_opt = np.array([31, 28, 19, 24, 32, 27, 16, 41, 23, 32, 29, 33])\n", 769 | "\n", 770 | "print \"The % increase of sales in the first case:\", \\\n", 771 | "(np.mean(after_opt) - np.mean(before_opt))/np.mean(before_opt)*100,\"%\"" 772 | ] 773 | }, 774 | { 775 | "cell_type": "code", 776 | "execution_count": 30, 777 | "metadata": { 778 | "collapsed": false 779 | }, 780 | "outputs": [ 781 | { 782 | "name": "stdout", 783 | "output_type": "stream", 784 | "text": [ 785 | "The % increase of sales in the second case: 1.75438596491 %\n" 786 | ] 787 | } 788 | ], 789 | "source": [ 790 | "before_opt = np.array([230, 210, 190, 240, 350, 170, 180, 240, 330, 270, 210, 230])\n", 791 | "after_opt = np.array([310, 180, 190, 240, 220, 240, 160, 410, 130, 320, 290, 210])\n", 792 | "\n", 793 | "print \"The % increase of sales in the second case:\", \\\n", 794 | "(np.mean(after_opt) - np.mean(before_opt))/np.mean(before_opt)*100,\"%\"" 795 | ] 796 | }, 797 | { 798 | "cell_type": "markdown", 799 | "metadata": {}, 800 | "source": [ 801 | "**Would business feel comfortable spending millions of dollars if the increase is going to be just 1.75%. Does it make sense? Maybe yes - if margins are thin and any increase is considered good. But if the returns from the price optimization module does not let the company break even, it makes no sense to take that path.**" 802 | ] 803 | }, 804 | { 805 | "cell_type": "markdown", 806 | "metadata": {}, 807 | "source": [ 808 | "> Someone tells you the result is statistically significant. The first question you should ask?\n", 809 | "\n", 810 | "# How large is the effect?\n", 811 | "\n", 812 | "To answer such a question, we will make use of the concept **confidence interval**\n", 813 | "\n", 814 | "In plain english, *confidence interval* is the range of values the measurement metric is going to take. \n", 815 | "\n", 816 | "An example would be: 90% of the times, the increase in average sales (before and after price optimization) would be within the bucket `3.4 and 6.7` (These numbers are illustrative. We will derive those numbers below)\n", 817 | "\n", 818 | "What is the *hacker's way* of doing it? We will do the following steps:\n", 819 | "\n", 820 | "1. From actual sales data, we sample the data with repetition (separately for before and after) - sample size will be the same as the original\n", 821 | "2. Find the differences between the mean of the two samples.\n", 822 | "3. Repeat steps 1 and 2 , say 100,000 times.\n", 823 | "4. Sort the differences. For getting 90% interval, take the 5% and 95% number. That range gives you the 90% confidence interval on the mean.\n", 824 | "5. This process of generating the samples is called **bootstrapping**" 825 | ] 826 | }, 827 | { 828 | "cell_type": "code", 829 | "execution_count": 31, 830 | "metadata": { 831 | "collapsed": true 832 | }, 833 | "outputs": [], 834 | "source": [ 835 | "#Load the data\n", 836 | "before_opt = np.array([23, 21, 19, 24, 35, 17, 18, 24, 33, 27, 21, 23])\n", 837 | "after_opt = np.array([31, 28, 19, 24, 32, 27, 16, 41, 23, 32, 29, 33])" 838 | ] 839 | }, 840 | { 841 | "cell_type": "code", 842 | "execution_count": 32, 843 | "metadata": { 844 | "collapsed": false 845 | }, 846 | "outputs": [], 847 | "source": [ 848 | "#generate a uniform random sample\n", 849 | "random_before_opt = np.random.choice(before_opt, size=before_opt.size, replace=True)" 850 | ] 851 | }, 852 | { 853 | "cell_type": "code", 854 | "execution_count": 33, 855 | "metadata": { 856 | "collapsed": false 857 | }, 858 | "outputs": [ 859 | { 860 | "name": "stdout", 861 | "output_type": "stream", 862 | "text": [ 863 | "Actual sample before optimization: [23 21 19 24 35 17 18 24 33 27 21 23]\n", 864 | "Bootstrapped sample before optimization: [21 17 19 21 33 27 24 18 18 19 24 24]\n" 865 | ] 866 | } 867 | ], 868 | "source": [ 869 | "print \"Actual sample before optimization:\", before_opt\n", 870 | "print \"Bootstrapped sample before optimization: \", random_before_opt" 871 | ] 872 | }, 873 | { 874 | "cell_type": "code", 875 | "execution_count": 34, 876 | "metadata": { 877 | "collapsed": false 878 | }, 879 | "outputs": [ 880 | { 881 | "name": "stdout", 882 | "output_type": "stream", 883 | "text": [ 884 | "Mean for actual sample: 23.75\n", 885 | "Mean for bootstrapped sample: 22.0833333333\n" 886 | ] 887 | } 888 | ], 889 | "source": [ 890 | "print \"Mean for actual sample:\", np.mean(before_opt)\n", 891 | "print \"Mean for bootstrapped sample:\", np.mean(random_before_opt)" 892 | ] 893 | }, 894 | { 895 | "cell_type": "code", 896 | "execution_count": 35, 897 | "metadata": { 898 | "collapsed": false 899 | }, 900 | "outputs": [ 901 | { 902 | "name": "stdout", 903 | "output_type": "stream", 904 | "text": [ 905 | "Actual sample after optimization: [31 28 19 24 32 27 16 41 23 32 29 33]\n", 906 | "Bootstrapped sample after optimization: [33 41 27 32 28 41 33 41 41 31 29 19]\n", 907 | "Mean for actual sample: 27.9166666667\n", 908 | "Mean for bootstrapped sample: 33.0\n" 909 | ] 910 | } 911 | ], 912 | "source": [ 913 | "random_after_opt = np.random.choice(after_opt, size=after_opt.size, replace=True)\n", 914 | "print \"Actual sample after optimization:\", after_opt\n", 915 | "print \"Bootstrapped sample after optimization: \", random_after_opt\n", 916 | "print \"Mean for actual sample:\", np.mean(after_opt)\n", 917 | "print \"Mean for bootstrapped sample:\", np.mean(random_after_opt)" 918 | ] 919 | }, 920 | { 921 | "cell_type": "code", 922 | "execution_count": 36, 923 | "metadata": { 924 | "collapsed": false 925 | }, 926 | "outputs": [ 927 | { 928 | "name": "stdout", 929 | "output_type": "stream", 930 | "text": [ 931 | "Difference in means of actual samples: 4.16666666667\n", 932 | "Difference in means of bootstrapped samples: 10.9166666667\n" 933 | ] 934 | } 935 | ], 936 | "source": [ 937 | "print \"Difference in means of actual samples:\", np.mean(after_opt) - np.mean(before_opt)\n", 938 | "print \"Difference in means of bootstrapped samples:\", np.mean(random_after_opt) - np.mean(random_before_opt)" 939 | ] 940 | }, 941 | { 942 | "cell_type": "code", 943 | "execution_count": 37, 944 | "metadata": { 945 | "collapsed": true 946 | }, 947 | "outputs": [], 948 | "source": [ 949 | "#Like always, we will repeat this experiment 100,000 times. \n", 950 | "\n", 951 | "def bootstrap_experiment(number_of_times):\n", 952 | " mean_difference = np.empty([number_of_times,1])\n", 953 | " for times in np.arange(number_of_times):\n", 954 | " random_before_opt = np.random.choice(before_opt, size=before_opt.size, replace=True)\n", 955 | " random_after_opt = np.random.choice(after_opt, size=after_opt.size, replace=True)\n", 956 | " mean_difference[times] = np.mean(random_after_opt) - np.mean(random_before_opt)\n", 957 | " return mean_difference" 958 | ] 959 | }, 960 | { 961 | "cell_type": "code", 962 | "execution_count": 38, 963 | "metadata": { 964 | "collapsed": false 965 | }, 966 | "outputs": [ 967 | { 968 | "data": { 969 | "text/plain": [ 970 | "" 971 | ] 972 | }, 973 | "execution_count": 38, 974 | "metadata": {}, 975 | "output_type": "execute_result" 976 | }, 977 | { 978 | "data": { 979 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX8AAAECCAYAAAAW+Nd4AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAGhdJREFUeJzt3X+M2/d93/En785H3vFIuYi4OGmDbnHhd90ASmJXaCpn\nkj171qwl1RoMHuA0cIxEnl3B8IBsGXZTPbiQo7SpjUZoJgxWGtlQUKAxkgyFYMWDVujU6w8rqa3G\nc/K25aQItnbpnWbfkRTJ04ncH/zyRFHUkbwjj0d+Xg/gIPLNz5Gfj/i91/fL748PY5VKBRERCctI\nvzsgIiIbT+EvIhIghb+ISIAU/iIiAVL4i4gESOEvIhKgsdUeNLMR4ChwC1AG9gGXgWPR/deA/e5e\nMbN9wMPAMnDQ3U+Y2QRwHMgAWeBBd5/v0VhERKRNrbb87wWS7v5R4LeBLwBPA9PuvhOIAXvN7Cbg\nMWAHsBs4ZGbjwKPAuajt88CB3gxDREQ60Sr8C8AWM4sBW4Al4HZ3n4kefxG4B9gOzLr7JXdfBM4D\n24A7gJNR25NRWxER6bNVd/sAs0AC+CHwLuDjwM66x7NUVwppYOE69cWGmoiI9FmrLf/PU92iN+BD\nVHfd3FD3eBp4h2rAp+rqqSb1Wk1ERPqs1ZZ/kitb7m9H7V8xs13ufhq4DzgFvAw8ZWZxqp8UbqV6\nMHgW2AOcjdrO0EKlUqnEYrE1DEVEJGgdBWdstYndzOxG4GvAVqpb/L8PfA94FhgHXgf2RWf7fJbq\n2T4jwFPu/q3obJ/ngPcAJeABd/+HFn2qzM1lOxnDQMlkUmh8g2mYxwYa36DLZFLdC/8+UfgPsGEe\n3zCPDTS+Qddp+OsiLxGRACn8RUQCpPAXEQmQwl9EJEAKfxGRACn8RUQC1OoiL5GhVy6Xyefz19ST\nySQjI9o+kuGk8Jfg5fN5Tp19i3hiYqVWKha4e/vNpFKpVX5TZHAp/EWAeGKCicmpfndDZMPoM62I\nSIAU/iIiAVL4i4gESOEvIhIgHfCVoDQ7rTOXy1Epb7rZbUV6SuEvQWl2WufC2xdITE4x2cd+iWw0\nhb8Ep/G0zmLh2gu8RIad9vmLiARI4S8iEiCFv4hIgBT+IiIBannA18weBD4d3Z0APgh8FPgyUAZe\nA/a7e8XM9gEPA8vAQXc/YWYTwHEgA2SBB919vtsDERGR9rXc8nf359z9Lne/C/gu8BjwBDDt7juB\nGLDXzG6KHtsB7AYOmdk48ChwLmr7PHCgN0MREZF2tb3bx8x+Gfgldz8K3O7uM9FDLwL3ANuBWXe/\n5O6LwHlgG3AHcDJqezJqKyIifdTJPv9p4MnodqyungW2AGlg4Tr1xYaaiIj0UVvhb2Y3Are4++mo\nVK57OA28QzXg67/5ItWkXquJiEgftXuF707gVN39V8xsV7QyuC967GXgKTOLAwngVqoHg2eBPcDZ\nqO0MLWQyw/3tSRpf/8TjFaaSCSaTiZXaUiHByOgoqakrtdHYMlu3TpFOXz2WzTy2btD4wtFu+N8C\nvFV3/3PAs9EB3deBF6KzfQ4DZ6h+oph295KZHQGeM7MzQAl4oNWLzc1lOxnDQMlkUhpfH2WzOXL5\nIpcrVxb9/MUisZExxsaLK7XCxSLz8zlKpSt7ODf72NZL4xtsna7Y2gp/d/+9hvtvAnc2aXcUONpQ\nKwD3d9QrERHpKV3kJSISIIW/iEiAFP4iIgFS+IuIBEjhLyISIH2Tl0gT5XKZXC53VS0er1AuVxgZ\n0TaTDD6Fvwyt9XxZ+1KpwMyri6S33LhSGxst85Ff+jlSKV0oJINP4S9Da71f1t74Xb+jseUe9FKk\nPxT+MtT0Ze0izWnnpYhIgBT+IiIBUviLiARI4S8iEiCFv4hIgBT+IiIBUviLiARI5/mLtKnZlA8A\nyWRSUz7IwFH4i7SpVCww8+o/XDXlQ6lY4O7tN2vKBxk4Cn+RDjReMSwyqPRZVUQkQC23/M3sPwEf\nB24A/gCYBY4BZeA1YL+7V8xsH/AwsAwcdPcTZjYBHAcyQBZ40N3nezEQERFp36pb/mZ2J/Cr7r4D\nuBN4P/A0MO3uO4EYsNfMbgIeA3YAu4FDZjYOPAqci9o+Dxzo0ThERKQDrbb87wW+b2bfBtLAfwA+\n4+4z0eMvRm0uA7Pufgm4ZGbngW3AHcDvRG1PAr/V5f6LrGicv7/duftFQtQq/DPA+4CPUd3q/xOq\nW/s1WWAL1RXDwnXqiw01kZ5onL+/k7n7RULTKvzngR+4+zLwhpkVgZ+tezwNvEM14OvPdUs1qddq\nLWUyw33anMbXG/F4hXe962eYTFbPxhmNXWZkdJTUVGKlzVIhsebahcICU8n4VbXR2DJbt06RTg/H\ne6plMxytwv/PgMeBZ8zsvcAkcMrMdrn7aeA+4BTwMvCUmcWBBHAr1YPBs8Ae4GzUdubal7jW3Fx2\nDUMZDJlMSuPrkWw2Ry5f5HKluljnLxaJjYwxNl5cabOeGkAuX7qqVrhYZH4+R6kUY9Bp2Rxsna7Y\nVg3/6IydnWb2MtWDw78J/C3wbHRA93Xghehsn8PAmajdtLuXzOwI8JyZnQFKwAOdDkhERLqv5ame\n7v4fm5TvbNLuKHC0oVYA7l9r50REpDd0ha9seo1n8dRshjl1ms33sxn6JdKKwl82vcazeGDzzKmz\nVCow8+riynw/m6VfIq0o/GUgbOY5dTZz30SuR59NRUQCpPAXEQmQdvuIdJG+8EUGhcJfpIsaDwCD\nDgLL5qTwl4HUbAt7s0zkpgPAMggU/jKQmm1hayI3kfYp/GVgNW5hFwvXXggmIs3pCJSISIAU/iIi\nAVL4i4gESOEvIhIghb+ISIAU/iIiAVL4i4gESOEvIhIghb+ISIAU/iIiAWpregcz+2tgIbr7I+AQ\ncAwoA68B+929Ymb7gIeBZeCgu58wswngOJABssCD7j7f1VGIiEhHWm75m1kCwN3vin4+AzwDTLv7\nTiAG7DWzm4DHgB3AbuCQmY0DjwLnorbPAwd6MxQREWlXO1v+HwQmzew7Ufv/DNzm7jPR4y8C9wKX\ngVl3vwRcMrPzwDbgDuB3orYngd/qYv9FRGQN2tnnnwe+5O67gUeArzc8ngW2AGmu7BpqrC821ERE\npI/a2fJ/AzgP4O5vmtkF4MN1j6eBd6gGfP1XFaWa1Gu1VWUyw/2NRxpfZ+LxClPJBJPJxEptqZBg\nZHSU1NT1a+206aR2obDAVDLe8WuOxpbZunWKdHrzv+9aNsPRTvg/RHX3zX4zey/VAH/JzHa5+2ng\nPuAU8DLwlJnFgQRwK9WDwbPAHuBs1Hbm2pe42txcdg1DGQyZTErj61A2myOXL3K5cmVxzV8sEhsZ\nY2y8eN1aO206qQHk8qWOX7Nwscj8fI5SKbbe/4qe0rI52DpdsbUT/l8FvmZmtdB+CLgAPBsd0H0d\neCE62+cwcIbq7qRpdy+Z2RHgOTM7A5SABzrqoYiIdF3L8Hf3ZeBTTR66s0nbo8DRhloBuH+N/RMR\nkR7QRV4iIgFS+IuIBEjhLyISIIW/iEiAFP4iIgFS+IuIBEjhLyISIIW/iEiA2prPX0TWrlwuk8vl\nrqknk0lGRrT9Jf2h8BfpsaVSgZlXF0lvuXGlVioWuHv7zaRSmmhM+kPhL7IB4okJJian+t0NkRX6\nzCkiEiCFv4hIgBT+IiIBUviLiARI4S8iEiCFv4hIgBT+IiIBUviLiARIF3nJplIul8nn81fVcrkc\nlXKlTz0SGU5thb+Z/SPge8DdQBk4Fv37GrDf3Stmtg94GFgGDrr7CTObAI4DGSALPOju810fhQyN\nfD7PqbNvEU9MrNQW3r5AYnKKyT72S2TYtNztY2Y3AP8NyAMx4Blg2t13Rvf3mtlNwGPADmA3cMjM\nxoFHgXNR2+eBAz0ZhQyV2lQItZ94ItHvLokMnXb2+X8JOAL8fXT/NnefiW6/CNwDbAdm3f2Suy8C\n54FtwB3AyajtyaitiIj02arhb2afBubc/aWoFIt+arLAFiANLFynvthQExGRPmu1z/8hoGJm9wAf\nAp6juv++Jg28QzXg6+emTTWp12otZTLDPc2txnd98XiFqWSCyeSVXT1LhQQjo6OkpjqrrfX3rle7\nUFhgKhnvymuOxpbZunWKdHpzLQtaNsOxavi7+67abTP7U+AR4EtmtsvdTwP3AaeAl4GnzCwOJIBb\nqR4MngX2AGejtjO0YW4u2/lIBkQmk9L4VpHN5sjli1yuXFk08xeLxEbGGBsvdlRb6+9drwaQy5e6\n8pqFi0Xm53OUSvUfpPtLy+Zg63TF1ul5/hXgc8CTZvbnVFceL7j7T4HDwBmqK4Npdy9RPVbwATM7\nA3wWeLLD1xMRkR5o+zx/d7+r7u6dTR4/ChxtqBWA+9faORER6Q1d4SsiEiCFv4hIgBT+IiIBUviL\niARI4S8iEiCFv4hIgBT+IiIBUviLiARI4S8iEiCFv4hIgPQ1jiJ9UC6XyeVy19STySQjI9omk95T\n+Iv0wVKpwMyri6S33LhSKxUL3L39ZlIpTTssvafwF+mT2tdVivSDPl+KiARI4S8iEiDt9pG+KZfL\n5PP5q2q5XI5KudKnHomEQ+EvfZPP5zl19i3iiYmV2sLbF0hMTjHZx36JhEDhL33VeNCzWMiv0lpE\nukX7/EVEAqTwFxEJUMvdPmY2CjwL3AJUgEeAEnAMKAOvAfvdvWJm+4CHgWXgoLufMLMJ4DiQAbLA\ng+4+34OxiIhIm9rZ8v8YUHb3jwIHgC8ATwPT7r4TiAF7zewm4DFgB7AbOGRm48CjwLmo7fPRc4iI\nSB+1DH93/+/Av43u/mPgbeB2d5+Jai8C9wDbgVl3v+Tui8B5YBtwB3AyansyaisiIn3U1j5/d79s\nZseALwNfp7q1X5MFtgBpYOE69cWGmoiI9FHbp3q6+6fN7N3Ay0Ci7qE08A7VgK+fkSrVpF6rrSqT\nGe6JrTS+qni8wlQywWTyyuK0VEgwMjpKamr9tW4+F8CFwgJTyXjPXnM0tszWrVOk0/1bPrRshqOd\nA76fAn7O3Q8BBeAy8F0z2+Xup4H7gFNUVwpPmVmc6srhVqoHg2eBPcDZqO3Mta9ytbm57NpGMwAy\nmZTGF8lmc+TyRS5XriyG+YtFYiNjjI0X113r5nPV5PKlnr1m4WKR+fkcpVL9B+uNo2VzsHW6Ymtn\ny/8F4JiZnQZuAB4Hfgg8Gx3QfR14ITrb5zBwhurupGl3L5nZEeA5MztD9SyhBzrqoYiIdF3L8Hf3\nAvBvmjx0Z5O2R4GjTX7//jX2T0REekDTO4hsEvp2L9lICn+RTULf7iUbSeEvsono271ko+izpIhI\ngLTlLxtCX9wisrko/GVD6ItbRDYXhb9sGH1xi8jmoX3+IiIBUviLiARI4S8iEiCFv4hIgBT+IiIB\nUviLiARI4S8iEiCFv4hIgBT+IiIB0hW+0hONc/loHp+10Rz/0isKf+mJxrl8NI/P2miOf+kVhb/0\nTP1cPprHZ+00x7/0gj43iogEaNUtfzO7AfhD4OeBOHAQ+AFwDCgDrwH73b1iZvuAh4Fl4KC7nzCz\nCeA4kAGywIPuPt+jsYiISJtabfl/Ephz953AvwC+AjwNTEe1GLDXzG4CHgN2ALuBQ2Y2DjwKnIva\nPg8c6M0wRESkE63C/xvAE3VtLwG3uftMVHsRuAfYDsy6+yV3XwTOA9uAO4CTUduTUVsREemzVXf7\nuHsewMxSVFcEB4Dfq2uSBbYAaWDhOvXFhpqIrEOz0z916qd0quXZPmb2PuCbwFfc/Y/M7HfrHk4D\n71AN+PrzzlJN6rVaS5nMcJ/CFsL44vEKU8kEk8kEAEuFBCOjo6SmEivtel3r9vNfKCwwlYxv6Gs2\nry3wvTd+ypYblwAoFi/ysZ2/SDq9/uUqhGVTqlod8H038BLwm+7+p1H5FTPb5e6ngfuAU8DLwFNm\nFgcSwK1UDwbPAnuAs1HbGdowN5ddw1AGQyaTCmJ82WyOXL7I5Up1EctfLBIbGWNsvLjStte1bj8/\nQC5f2tDXXK1W+79dvjzC/HyOUinGeoSybA6rTldsrbb8p6nuqnnCzGr7/h8HDkcHdF8HXojO9jkM\nnKF6bGDa3UtmdgR4zszOACXggY56JyIiPdFqn//jVMO+0Z1N2h4FjjbUCsD96+ifiIj0gI4QiYgE\nSOEvIhIghb+ISIA0sZusW/30zfF4pXqmj6ZwFtnUFP6ybvXTN08lE+TyRU3hLLLJKfylK2rTDk8m\nE1yujGkKZ5FNTvv8RUQCpPAXEQmQwl9EJEAKfxGRACn8RUQCpPAXEQmQwl9EJEA6z186Un81b42u\n5hUZPAp/6Uj91bw1uppXZPAo/KVjtat5a3Q1r8jg0T5/EZEAKfxFRAKk8BcRCZDCX0QkQG0d8DWz\nXwG+6O53mdkvAMeAMvAasN/dK2a2D3gYWAYOuvsJM5sAjgMZIAs86O7zPRiHSLDK5TK5XO6aejKZ\nZGRE23fSXMslw8w+DzwLxKPSM8C0u+8EYsBeM7sJeAzYAewGDpnZOPAocC5q+zxwoPtDEAnbUqnA\nzKs/4c++/3crP6fOvnXN9Rgi9drZLDgPfIJq0APc5u4z0e0XgXuA7cCsu19y98Xod7YBdwAno7Yn\no7Yi0mW1029rP/XXYYg00zL83f2bVHfl1MTqbmeBLUAaWLhOfbGhJiIifbaWi7zKdbfTwDtUAz5V\nV081qddqLWUyqdaNBtggjy8erzCVTDCZTKzUlgoJRkZHSU1Va6mpxDW1xvsbUev2818oLDCVjG/o\na651nKOxZbZunSKd7mxZG+Rlsx3DPr5OrCX8XzGzXe5+GrgPOAW8DDxlZnEgAdxK9WDwLLAHOBu1\nnWn+lFebm8uuoVuDIZNJDfT4stkcuXyRy5Uri07+YpHYyBhj40VSUwmyueJVtcY2zX6vF7VuPz9A\nLl/a0Ndc6zgLF4vMz+coleo/qK9u0JfNVkIYXyc6ORWgNnPX54AnzezPqa48XnD3nwKHgTNUVwbT\n7l4CjgAfMLMzwGeBJzvqnYiI9ERbW/7u/rdUz+TB3d8E7mzS5ihwtKFWAO5fbyelPzSDp8jw0sRu\ncl2awXNw6dx/aUXhL6vSDJ6DqXru/yLpLTeu1ErFAndvv5lUSgc9ReEvMrQaV9wi9fT5T0QkQAp/\nEZEAKfxFRAKkff4C6LROkdAo/AXQaZ0ioVH4ywqd1ikSDoW/SCB04ZfUU/iLBEIXfkk9hX+AdHA3\nXLrwS2oU/gHSwV0RUfgHSgd3Ba4+DhCPV8hmq7d1HGD4KfxFAlZ/HGAqmSCXL+o4QCAU/gFo3Mev\n/ftSr/YpcDKZuOob2mS46Z0OQOM+fu3fFxGF/5C53pk84+OJlX382r8vq9H1AGFQ+A8Znckj69Xs\neoDCxTy/+oH3MDV15SQBrQwGW8/D38xGgP8KbANKwGfd/a1ev27IdCaPrFezZWjm1Z+srBB0UHjw\nbcSW/78Cxt19h5n9CvB0VJN10sVaspF0gdhw2YjwvwM4CeDuf2Vmv7wBrzl0rhf0f/m//i+JySs7\ndLSLRzaCjgsMvo0I/zSwWHf/spmNuHt5A15702sW6uVy9b+m/o9otaDXLh7ZaO0eF2i2LDergVYc\nG20jwn8RqN8xOHDBf+5vXmVpaWnlfowYN9/8fsbGxjt+rvqrKKEa6qdf+THxeGKltrDw/xiJjZJK\nb7mqNjGRItGwTV8qFihczNXdLxIbGV211k6btdZGY8sULhY39DU3apyXli5SKl7u2/9tr1/zeu/d\naq9Zb6lU5H/85Q+vWW6bLcuNtVKpyK4P/5OrVhzdlsno+ES9jQj/WeDjwDfM7CPA37RoH9tsb9I9\nd//Trj5fOp2+6v62bbd09flFpLnNli39tBHh/y3gn5vZbHT/oQ14TRERWUWsUtGZISIiodHRFRGR\nACn8RUQCpPAXEQmQwl9EJECbamI3M/t14F+7+yej+x8Bfh9YBl5y99/uZ/+6wcxiwP8G3ohKf+Hu\n033s0rqFMH+Tmf01sBDd/ZG7f6af/emGaLqVL7r7XWb2C8AxoAy8Bux394E+G6RhfB8G/gR4M3r4\niLv/cf96tz5mdgPwh8DPA3HgIPADOngPN034m9mXgXuBV+rKR4BPuPuPzeyEmX3I3V/tTw+75mbg\ne+7+a/3uSBcN9fxNZpYAcPe7+t2XbjGzzwO/AdSu5HoGmHb3GTM7AuwFvt2v/q1Xk/HdDjzj7s/0\nr1dd9Ulgzt0/ZWY/A5yjmp1tv4ebabfPLPAoEAMwszQQd/cfR49/B7inT33rptuBnzWz/xmt0Ibh\nCq+r5m8Chm3+pg8Ck2b2HTM7Fa3gBt154BNEf2/Abe4+E91+kcH/W2sc3+3AvzSz02Z21MwGfYa6\nbwBPRLdHgEt0+B5uePib2WfM7PsNP7c3+QjWOCdQFtjCAGk2VuDvgC+4+z8DvgAc728vu6Lp/E39\n6kwP5IEvuftu4BHg64M+Pnf/JtXdqTWxuts5BuxvrVGT8f0V8O/dfRfwI+C/9KVjXeLueXfPmVmK\n6orgAFfnecv3cMN3+7j7V4GvttG0cU6gNPBOTzrVI83GamYTRAulu8+a2Xv70bcuG/j5m1p4g+qW\nJO7+ppldAN4D/J++9qq76t+vFAP2t9aGb7l77ZjNt4HD/exMN5jZ+4BvAl9x9z8ys9+te7jle7hp\nt17cfRFYMrP3RwdJ7wVmWvzaIHgC+HcAZvZB4Cf97U5XzAJ7YOUgfav5mwbNQ1SPYxCtrNPA3/e1\nR933ipntim7fx3D8rdU7aWbbo9t3A9/tZ2fWy8zeDbwEfN7dj0Xljt7DTXPAN1KJfmoeAb4OjALf\ncfezfelVd30ROG5me6h+Avh0f7vTFcM+f9NXga+ZWe2P6aEh+mRT+3v7HPCsmY0DrwMv9K9LXVUb\n3yPAV8zsEtUV98P961JXTFPdrfOEmdX2/T8OHG73PdTcPiIiAdq0u31ERKR3FP4iIgFS+IuIBEjh\nLyISIIW/iEiAFP4iIgFS+IuIBEjhLyISoP8PoK6vAauRiewAAAAASUVORK5CYII=\n", 980 | "text/plain": [ 981 | "" 982 | ] 983 | }, 984 | "metadata": {}, 985 | "output_type": "display_data" 986 | } 987 | ], 988 | "source": [ 989 | "mean_difference = bootstrap_experiment(100000)\n", 990 | "sns.distplot(mean_difference, kde=False)" 991 | ] 992 | }, 993 | { 994 | "cell_type": "code", 995 | "execution_count": 39, 996 | "metadata": { 997 | "collapsed": false 998 | }, 999 | "outputs": [], 1000 | "source": [ 1001 | "mean_difference = np.sort(mean_difference, axis=0)" 1002 | ] 1003 | }, 1004 | { 1005 | "cell_type": "code", 1006 | "execution_count": 40, 1007 | "metadata": { 1008 | "collapsed": false 1009 | }, 1010 | "outputs": [ 1011 | { 1012 | "data": { 1013 | "text/plain": [ 1014 | "array([[ -6.66666667],\n", 1015 | " [ -6.33333333],\n", 1016 | " [ -6.08333333],\n", 1017 | " ..., \n", 1018 | " [ 13.16666667],\n", 1019 | " [ 13.16666667],\n", 1020 | " [ 15. ]])" 1021 | ] 1022 | }, 1023 | "execution_count": 40, 1024 | "metadata": {}, 1025 | "output_type": "execute_result" 1026 | } 1027 | ], 1028 | "source": [ 1029 | "mean_difference #Sorted difference" 1030 | ] 1031 | }, 1032 | { 1033 | "cell_type": "code", 1034 | "execution_count": 41, 1035 | "metadata": { 1036 | "collapsed": false 1037 | }, 1038 | "outputs": [ 1039 | { 1040 | "data": { 1041 | "text/plain": [ 1042 | "array([ 0.16666667, 8.08333333])" 1043 | ] 1044 | }, 1045 | "execution_count": 41, 1046 | "metadata": {}, 1047 | "output_type": "execute_result" 1048 | } 1049 | ], 1050 | "source": [ 1051 | "np.percentile(mean_difference, [5,95])" 1052 | ] 1053 | }, 1054 | { 1055 | "cell_type": "markdown", 1056 | "metadata": {}, 1057 | "source": [ 1058 | "Reiterating what this means: 90% of the times, the mean difference is between the limits as shown above" 1059 | ] 1060 | }, 1061 | { 1062 | "cell_type": "markdown", 1063 | "metadata": {}, 1064 | "source": [ 1065 | "**Exercise: Find the 95% percentile for confidence intevals**" 1066 | ] 1067 | }, 1068 | { 1069 | "cell_type": "code", 1070 | "execution_count": null, 1071 | "metadata": { 1072 | "collapsed": true 1073 | }, 1074 | "outputs": [], 1075 | "source": [] 1076 | }, 1077 | { 1078 | "cell_type": "markdown", 1079 | "metadata": {}, 1080 | "source": [ 1081 | "### Where do we go from here? \n", 1082 | "\n", 1083 | "First of all there are two points to be made.\n", 1084 | "\n", 1085 | "1. Whey do we need signficance testing if confidence intervals can provide us more information?\n", 1086 | "2. How does it relate to the traditional statistical procedure of finding confidence intervals\n", 1087 | "\n", 1088 | "For the first one:\n", 1089 | "\n", 1090 | "What if sales in the first month after price changes was 80 and the month before price changes was 40. The difference is 40. And confidence interval,as explained above, using replacements, would always generate 40. But if we do the significance testing, as detailed above - where the labels are shuffled, the prices are equally likely to occur in both the groups. And so, significance testing would answer that there was no difference. But don't we all know that the data is **too small** to make meaningful inferences?\n", 1091 | "\n", 1092 | "For the second one:\n", 1093 | "\n", 1094 | "Traditional statistics derivation assumes normal distribution. But what if the underlying distribution isn't normal? Also, people relate to resampling much better :-) " 1095 | ] 1096 | } 1097 | ], 1098 | "metadata": { 1099 | "kernelspec": { 1100 | "display_name": "Python 2", 1101 | "language": "python", 1102 | "name": "python2" 1103 | }, 1104 | "language_info": { 1105 | "codemirror_mode": { 1106 | "name": "ipython", 1107 | "version": 2 1108 | }, 1109 | "file_extension": ".py", 1110 | "mimetype": "text/x-python", 1111 | "name": "python", 1112 | "nbconvert_exporter": "python", 1113 | "pygments_lexer": "ipython2", 1114 | "version": "2.7.10" 1115 | } 1116 | }, 1117 | "nbformat": 4, 1118 | "nbformat_minor": 0 1119 | } 1120 | -------------------------------------------------------------------------------- /notebooks/4. Basic Metrics.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Basic Metrics\n", 8 | "\n", 9 | "When we think about summarizing data, what are the metrics that we look at?\n", 10 | "\n", 11 | "In this notebook, we will look in the price of weed dataset along with the demographic information of the United States. \n", 12 | "\n", 13 | "To read how the data was acquired, please read [this](https://github.com/amitkaps/weed/blob/master/1-Acquire.ipynb) to get more information\n", 14 | "\n", 15 | "This notebook will make use of pandas quite a bit." 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 1, 21 | "metadata": { 22 | "collapsed": true 23 | }, 24 | "outputs": [], 25 | "source": [ 26 | "import numpy as np\n", 27 | "import pandas as pd\n", 28 | "from datetime import datetime as dt\n", 29 | "from scipy import stats" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "### Read the input datasets. There are three datasets:\n", 37 | "\n", 38 | "1. Weed price by date / state\n", 39 | "2. Demographics of State\n", 40 | "3. Population of state" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 2, 46 | "metadata": { 47 | "collapsed": false 48 | }, 49 | "outputs": [], 50 | "source": [ 51 | "prices_pd = pd.read_csv(\"../data/Weed_Price.csv\", parse_dates=[-1])\n", 52 | "demography_pd = pd.read_csv(\"../data/Demographics_State.csv\")\n", 53 | "population_pd = pd.read_csv(\"../data/Population_State.csv\")" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 3, 59 | "metadata": { 60 | "collapsed": false 61 | }, 62 | "outputs": [ 63 | { 64 | "data": { 65 | "text/html": [ 66 | "
\n", 67 | "\n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | "
StateHighQHighQNMedQMedQNLowQLowQNdate
0Alabama339.061042198.64933149.491232014-01-01
1Alaska288.75252260.60297388.58262014-01-01
2Arizona303.311941209.351625189.452222014-01-01
3Arkansas361.85576185.62544125.871122014-01-01
4California248.7812096193.5612812192.927782014-01-01
\n", 139 | "
" 140 | ], 141 | "text/plain": [ 142 | " State HighQ HighQN MedQ MedQN LowQ LowQN date\n", 143 | "0 Alabama 339.06 1042 198.64 933 149.49 123 2014-01-01\n", 144 | "1 Alaska 288.75 252 260.60 297 388.58 26 2014-01-01\n", 145 | "2 Arizona 303.31 1941 209.35 1625 189.45 222 2014-01-01\n", 146 | "3 Arkansas 361.85 576 185.62 544 125.87 112 2014-01-01\n", 147 | "4 California 248.78 12096 193.56 12812 192.92 778 2014-01-01" 148 | ] 149 | }, 150 | "execution_count": 3, 151 | "metadata": {}, 152 | "output_type": "execute_result" 153 | } 154 | ], 155 | "source": [ 156 | "prices_pd.head()" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 4, 162 | "metadata": { 163 | "collapsed": false 164 | }, 165 | "outputs": [ 166 | { 167 | "data": { 168 | "text/html": [ 169 | "
\n", 170 | "\n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | "
StateHighQHighQNMedQMedQNLowQLowQNdate
22894Virginia364.983513293.123079NaN2842014-12-31
22895Washington233.053337189.923562NaN1602014-12-31
22896West Virginia359.35551224.03545NaN602014-12-31
22897Wisconsin350.522244272.712221NaN1672014-12-31
22898Wyoming322.27131351.86197NaN122014-12-31
\n", 242 | "
" 243 | ], 244 | "text/plain": [ 245 | " State HighQ HighQN MedQ MedQN LowQ LowQN date\n", 246 | "22894 Virginia 364.98 3513 293.12 3079 NaN 284 2014-12-31\n", 247 | "22895 Washington 233.05 3337 189.92 3562 NaN 160 2014-12-31\n", 248 | "22896 West Virginia 359.35 551 224.03 545 NaN 60 2014-12-31\n", 249 | "22897 Wisconsin 350.52 2244 272.71 2221 NaN 167 2014-12-31\n", 250 | "22898 Wyoming 322.27 131 351.86 197 NaN 12 2014-12-31" 251 | ] 252 | }, 253 | "execution_count": 4, 254 | "metadata": {}, 255 | "output_type": "execute_result" 256 | } 257 | ], 258 | "source": [ 259 | "prices_pd.tail()" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": 5, 265 | "metadata": { 266 | "collapsed": false 267 | }, 268 | "outputs": [ 269 | { 270 | "data": { 271 | "text/html": [ 272 | "
\n", 273 | "\n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | "
regiontotal_populationpercent_whitepercent_blackpercent_asianpercent_hispanicper_capita_incomemedian_rentmedian_age
0alabama47992776726142368050138.1
1alaska720316633563265197833.6
2arizona64797035743302535874736.3
3arkansas29333697415172217048037.5
4california37659181406133829527111935.4
\n", 351 | "
" 352 | ], 353 | "text/plain": [ 354 | " region total_population percent_white percent_black percent_asian \\\n", 355 | "0 alabama 4799277 67 26 1 \n", 356 | "1 alaska 720316 63 3 5 \n", 357 | "2 arizona 6479703 57 4 3 \n", 358 | "3 arkansas 2933369 74 15 1 \n", 359 | "4 california 37659181 40 6 13 \n", 360 | "\n", 361 | " percent_hispanic per_capita_income median_rent median_age \n", 362 | "0 4 23680 501 38.1 \n", 363 | "1 6 32651 978 33.6 \n", 364 | "2 30 25358 747 36.3 \n", 365 | "3 7 22170 480 37.5 \n", 366 | "4 38 29527 1119 35.4 " 367 | ] 368 | }, 369 | "execution_count": 5, 370 | "metadata": {}, 371 | "output_type": "execute_result" 372 | } 373 | ], 374 | "source": [ 375 | "demography_pd.head()" 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": 6, 381 | "metadata": { 382 | "collapsed": false 383 | }, 384 | "outputs": [ 385 | { 386 | "data": { 387 | "text/html": [ 388 | "
\n", 389 | "\n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | "
regionvalue
0alabama4777326
1alaska711139
2arizona6410979
3arkansas2916372
4california37325068
\n", 425 | "
" 426 | ], 427 | "text/plain": [ 428 | " region value\n", 429 | "0 alabama 4777326\n", 430 | "1 alaska 711139\n", 431 | "2 arizona 6410979\n", 432 | "3 arkansas 2916372\n", 433 | "4 california 37325068" 434 | ] 435 | }, 436 | "execution_count": 6, 437 | "metadata": {}, 438 | "output_type": "execute_result" 439 | } 440 | ], 441 | "source": [ 442 | "population_pd.head()" 443 | ] 444 | }, 445 | { 446 | "cell_type": "code", 447 | "execution_count": 7, 448 | "metadata": { 449 | "collapsed": false 450 | }, 451 | "outputs": [ 452 | { 453 | "data": { 454 | "text/plain": [ 455 | "State object\n", 456 | "HighQ float64\n", 457 | "HighQN int64\n", 458 | "MedQ float64\n", 459 | "MedQN int64\n", 460 | "LowQ float64\n", 461 | "LowQN int64\n", 462 | "date datetime64[ns]\n", 463 | "dtype: object" 464 | ] 465 | }, 466 | "execution_count": 7, 467 | "metadata": {}, 468 | "output_type": "execute_result" 469 | } 470 | ], 471 | "source": [ 472 | "prices_pd.dtypes" 473 | ] 474 | }, 475 | { 476 | "cell_type": "markdown", 477 | "metadata": {}, 478 | "source": [ 479 | "#### Sort the data on state and date, then fill NA values" 480 | ] 481 | }, 482 | { 483 | "cell_type": "code", 484 | "execution_count": null, 485 | "metadata": { 486 | "collapsed": false 487 | }, 488 | "outputs": [], 489 | "source": [ 490 | "prices_pd.sort(columns=['State', 'date'], inplace=True)\n", 491 | "prices_pd.fillna(method='ffill', inplace=True)" 492 | ] 493 | }, 494 | { 495 | "cell_type": "markdown", 496 | "metadata": {}, 497 | "source": [ 498 | "### Finding mean, median, mode, variance, standard deviation for California" 499 | ] 500 | }, 501 | { 502 | "cell_type": "markdown", 503 | "metadata": {}, 504 | "source": [ 505 | "#### Mean\n", 506 | "\n", 507 | "arithmetic average of a range of values or quantities, computed by dividing the total of all values by the number of values." 508 | ] 509 | }, 510 | { 511 | "cell_type": "code", 512 | "execution_count": 9, 513 | "metadata": { 514 | "collapsed": false 515 | }, 516 | "outputs": [ 517 | { 518 | "data": { 519 | "text/html": [ 520 | "
\n", 521 | "\n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | "
StateHighQHighQNMedQMedQNLowQLowQNdate
20098California248.7712021193.4412724193.887702013-12-27
20863California248.7412025193.4412728193.887702013-12-28
21577California248.7612047193.5512760193.607722013-12-29
22291California248.8212065193.5412779193.807732013-12-30
22801California248.7612082193.5412792193.807732013-12-31
\n", 593 | "
" 594 | ], 595 | "text/plain": [ 596 | " State HighQ HighQN MedQ MedQN LowQ LowQN date\n", 597 | "20098 California 248.77 12021 193.44 12724 193.88 770 2013-12-27\n", 598 | "20863 California 248.74 12025 193.44 12728 193.88 770 2013-12-28\n", 599 | "21577 California 248.76 12047 193.55 12760 193.60 772 2013-12-29\n", 600 | "22291 California 248.82 12065 193.54 12779 193.80 773 2013-12-30\n", 601 | "22801 California 248.76 12082 193.54 12792 193.80 773 2013-12-31" 602 | ] 603 | }, 604 | "execution_count": 9, 605 | "metadata": {}, 606 | "output_type": "execute_result" 607 | } 608 | ], 609 | "source": [ 610 | "california_pd = prices_pd[prices_pd.State == \"California\"].copy(True)\n", 611 | "california_pd.head()" 612 | ] 613 | }, 614 | { 615 | "cell_type": "code", 616 | "execution_count": 10, 617 | "metadata": { 618 | "collapsed": false 619 | }, 620 | "outputs": [], 621 | "source": [ 622 | "ca_sum = california_pd['HighQ'].sum()" 623 | ] 624 | }, 625 | { 626 | "cell_type": "code", 627 | "execution_count": 11, 628 | "metadata": { 629 | "collapsed": false 630 | }, 631 | "outputs": [], 632 | "source": [ 633 | "ca_count = california_pd['HighQ'].count()" 634 | ] 635 | }, 636 | { 637 | "cell_type": "code", 638 | "execution_count": 12, 639 | "metadata": { 640 | "collapsed": false 641 | }, 642 | "outputs": [ 643 | { 644 | "name": "stdout", 645 | "output_type": "stream", 646 | "text": [ 647 | "Mean weed price in CA is: 245.376124722\n" 648 | ] 649 | } 650 | ], 651 | "source": [ 652 | "ca_mean = ca_sum / ca_count\n", 653 | "print \"Mean weed price in CA is:\", ca_mean" 654 | ] 655 | }, 656 | { 657 | "cell_type": "markdown", 658 | "metadata": {}, 659 | "source": [ 660 | "#### Exercise: Find CA mean for 2013, 2014 & 2015 separately\n", 661 | "\n", 662 | "*Hint:* `california_pd.iloc[0]['date'].year`" 663 | ] 664 | }, 665 | { 666 | "cell_type": "code", 667 | "execution_count": null, 668 | "metadata": { 669 | "collapsed": true 670 | }, 671 | "outputs": [], 672 | "source": [] 673 | }, 674 | { 675 | "cell_type": "markdown", 676 | "metadata": {}, 677 | "source": [ 678 | "#### Median\n", 679 | "\n", 680 | "Denotes value or quantity lying at the midpoint of a frequency distribution of observed values or quantities, such that there is an equal probability of falling above or below it. Simply put, it is the *middle* value in the list of numbers." 681 | ] 682 | }, 683 | { 684 | "cell_type": "code", 685 | "execution_count": 13, 686 | "metadata": { 687 | "collapsed": false 688 | }, 689 | "outputs": [ 690 | { 691 | "data": { 692 | "text/plain": [ 693 | "449" 694 | ] 695 | }, 696 | "execution_count": 13, 697 | "metadata": {}, 698 | "output_type": "execute_result" 699 | } 700 | ], 701 | "source": [ 702 | "ca_count" 703 | ] 704 | }, 705 | { 706 | "cell_type": "markdown", 707 | "metadata": {}, 708 | "source": [ 709 | "If count is odd, the median is the value at (n+1)/2,\n", 710 | "\n", 711 | "else it is the average of n/2 and (n+1)/2" 712 | ] 713 | }, 714 | { 715 | "cell_type": "code", 716 | "execution_count": null, 717 | "metadata": { 718 | "collapsed": false 719 | }, 720 | "outputs": [], 721 | "source": [ 722 | "ca_highq_pd = california_pd.sort(columns=['HighQ'])\n", 723 | "ca_highq_pd.head()" 724 | ] 725 | }, 726 | { 727 | "cell_type": "code", 728 | "execution_count": 15, 729 | "metadata": { 730 | "collapsed": false 731 | }, 732 | "outputs": [ 733 | { 734 | "name": "stdout", 735 | "output_type": "stream", 736 | "text": [ 737 | "Median price of weed in CA is: 245.31\n" 738 | ] 739 | } 740 | ], 741 | "source": [ 742 | "ca_median = ca_highq_pd.HighQ.iloc[(ca_count) / 2]\n", 743 | "print \"Median price of weed in CA is:\", ca_median" 744 | ] 745 | }, 746 | { 747 | "cell_type": "markdown", 748 | "metadata": {}, 749 | "source": [ 750 | "#### Mode\n", 751 | "\n", 752 | "It is the number which appears most often in a set of numbers. " 753 | ] 754 | }, 755 | { 756 | "cell_type": "code", 757 | "execution_count": 16, 758 | "metadata": { 759 | "collapsed": false 760 | }, 761 | "outputs": [ 762 | { 763 | "name": "stdout", 764 | "output_type": "stream", 765 | "text": [ 766 | "The most common price is CA, as indicated by its mode, is: 245.05\n" 767 | ] 768 | } 769 | ], 770 | "source": [ 771 | "ca_mode = ca_highq_pd.HighQ.value_counts().index[0]\n", 772 | "print \"The most common price is CA, as indicated by its mode, is:\", ca_mode" 773 | ] 774 | }, 775 | { 776 | "cell_type": "markdown", 777 | "metadata": {}, 778 | "source": [ 779 | "#### Variance\n", 780 | "\n", 781 | "> Once two statistician of height 4 feet and 5 feet have to cross a river of AVERAGE depth 3 feet. Meanwhile, a third person comes and said, \"what are you waiting for? You can easily cross the river\"\n", 782 | "\n", 783 | "It's the average distance of the data values from the *mean*\n", 784 | "\n", 785 | "" 786 | ] 787 | }, 788 | { 789 | "cell_type": "code", 790 | "execution_count": 17, 791 | "metadata": { 792 | "collapsed": false 793 | }, 794 | "outputs": [], 795 | "source": [ 796 | "california_pd['HighQ_dev'] = (california_pd['HighQ'] - ca_mean) ** 2" 797 | ] 798 | }, 799 | { 800 | "cell_type": "code", 801 | "execution_count": 18, 802 | "metadata": { 803 | "collapsed": false 804 | }, 805 | "outputs": [ 806 | { 807 | "name": "stdout", 808 | "output_type": "stream", 809 | "text": [ 810 | "Variance of High Quality weed prices in CA is: 2.98268628798\n" 811 | ] 812 | } 813 | ], 814 | "source": [ 815 | "ca_HighQ_variance = california_pd.HighQ_dev.sum() / (ca_count - 1)\n", 816 | "print \"Variance of High Quality weed prices in CA is:\", ca_HighQ_variance" 817 | ] 818 | }, 819 | { 820 | "cell_type": "markdown", 821 | "metadata": {}, 822 | "source": [ 823 | "#### Standard Deviation\n", 824 | "\n", 825 | "It is the square root of variance. This will have the same units as the data and mean. " 826 | ] 827 | }, 828 | { 829 | "cell_type": "code", 830 | "execution_count": 19, 831 | "metadata": { 832 | "collapsed": false 833 | }, 834 | "outputs": [ 835 | { 836 | "name": "stdout", 837 | "output_type": "stream", 838 | "text": [ 839 | "Standard Deviation of High Quality weed prices in CA is: 1.72704553732\n" 840 | ] 841 | } 842 | ], 843 | "source": [ 844 | "ca_HighQ_SD = np.sqrt(ca_HighQ_variance)\n", 845 | "print \"Standard Deviation of High Quality weed prices in CA is:\", ca_HighQ_SD" 846 | ] 847 | }, 848 | { 849 | "cell_type": "markdown", 850 | "metadata": {}, 851 | "source": [ 852 | "#### Using Pandas built-in function" 853 | ] 854 | }, 855 | { 856 | "cell_type": "code", 857 | "execution_count": 20, 858 | "metadata": { 859 | "collapsed": false 860 | }, 861 | "outputs": [ 862 | { 863 | "data": { 864 | "text/html": [ 865 | "
\n", 866 | "\n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | "
HighQHighQNMedQMedQNLowQLowQNHighQ_dev
count449.000000449.000000449.000000449.000000449.000000449.000000449.000000
mean245.37612514947.073497191.26890916769.821826189.783586976.2984412.976043
std1.7270461656.1335651.5240282433.9431911.598252120.2467143.961134
min241.84000012021.000000187.85000012724.000000187.830000770.0000000.000015
25%244.48000013610.000000190.26000014826.000000188.600000878.0000000.106357
50%245.31000015037.000000191.57000016793.000000188.600000982.0000000.729103
75%246.22000016090.000000192.55000018435.000000191.3200001060.0000004.435761
max248.82000018492.000000193.63000022027.000000193.8800001232.00000012.504178
\n", 962 | "
" 963 | ], 964 | "text/plain": [ 965 | " HighQ HighQN MedQ MedQN LowQ \\\n", 966 | "count 449.000000 449.000000 449.000000 449.000000 449.000000 \n", 967 | "mean 245.376125 14947.073497 191.268909 16769.821826 189.783586 \n", 968 | "std 1.727046 1656.133565 1.524028 2433.943191 1.598252 \n", 969 | "min 241.840000 12021.000000 187.850000 12724.000000 187.830000 \n", 970 | "25% 244.480000 13610.000000 190.260000 14826.000000 188.600000 \n", 971 | "50% 245.310000 15037.000000 191.570000 16793.000000 188.600000 \n", 972 | "75% 246.220000 16090.000000 192.550000 18435.000000 191.320000 \n", 973 | "max 248.820000 18492.000000 193.630000 22027.000000 193.880000 \n", 974 | "\n", 975 | " LowQN HighQ_dev \n", 976 | "count 449.000000 449.000000 \n", 977 | "mean 976.298441 2.976043 \n", 978 | "std 120.246714 3.961134 \n", 979 | "min 770.000000 0.000015 \n", 980 | "25% 878.000000 0.106357 \n", 981 | "50% 982.000000 0.729103 \n", 982 | "75% 1060.000000 4.435761 \n", 983 | "max 1232.000000 12.504178 " 984 | ] 985 | }, 986 | "execution_count": 20, 987 | "metadata": {}, 988 | "output_type": "execute_result" 989 | } 990 | ], 991 | "source": [ 992 | "california_pd.describe()" 993 | ] 994 | }, 995 | { 996 | "cell_type": "code", 997 | "execution_count": 21, 998 | "metadata": { 999 | "collapsed": false 1000 | }, 1001 | "outputs": [ 1002 | { 1003 | "data": { 1004 | "text/plain": [ 1005 | "0 245.03\n", 1006 | "1 245.05\n", 1007 | "dtype: float64" 1008 | ] 1009 | }, 1010 | "execution_count": 21, 1011 | "metadata": {}, 1012 | "output_type": "execute_result" 1013 | } 1014 | ], 1015 | "source": [ 1016 | "california_pd.HighQ.mode()" 1017 | ] 1018 | }, 1019 | { 1020 | "cell_type": "markdown", 1021 | "metadata": {}, 1022 | "source": [ 1023 | "#### Co-variance \n", 1024 | "\n", 1025 | "covariance as a measure of the (average) co-variation between two variables, say x and y. Covariance describes both how far the variables are spread out, and the nature of their relationship, Covariance is a measure of how much two variables change together. Compare this to Variance, which is just the range over which one measure (or variable) varies.\n", 1026 | "\n", 1027 | "\n", 1028 | "\n", 1029 | "
\n", 1030 | "
\n", 1031 | "
\n", 1032 | "
\n", 1033 | "\n", 1034 | "#### Co-variance of weed price in California vs New York" 1035 | ] 1036 | }, 1037 | { 1038 | "cell_type": "code", 1039 | "execution_count": 22, 1040 | "metadata": { 1041 | "collapsed": false 1042 | }, 1043 | "outputs": [ 1044 | { 1045 | "data": { 1046 | "text/html": [ 1047 | "
\n", 1048 | "\n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | " \n", 1074 | " \n", 1075 | " \n", 1076 | " \n", 1077 | " \n", 1078 | " \n", 1079 | " \n", 1080 | " \n", 1081 | " \n", 1082 | " \n", 1083 | " \n", 1084 | " \n", 1085 | " \n", 1086 | " \n", 1087 | " \n", 1088 | " \n", 1089 | " \n", 1090 | " \n", 1091 | " \n", 1092 | " \n", 1093 | " \n", 1094 | " \n", 1095 | " \n", 1096 | " \n", 1097 | " \n", 1098 | " \n", 1099 | " \n", 1100 | " \n", 1101 | " \n", 1102 | " \n", 1103 | " \n", 1104 | " \n", 1105 | " \n", 1106 | " \n", 1107 | " \n", 1108 | " \n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | "
StateHighQHighQNMedQMedQNLowQLowQNdate
20120New York351.985773268.835786190.314792013-12-27
20885New York351.925775268.835786190.314792013-12-28
21599New York351.995785269.025806190.754802013-12-29
22313New York352.025791268.985814190.754802013-12-30
22823New York351.975794268.935818190.754802013-12-31
\n", 1120 | "
" 1121 | ], 1122 | "text/plain": [ 1123 | " State HighQ HighQN MedQ MedQN LowQ LowQN date\n", 1124 | "20120 New York 351.98 5773 268.83 5786 190.31 479 2013-12-27\n", 1125 | "20885 New York 351.92 5775 268.83 5786 190.31 479 2013-12-28\n", 1126 | "21599 New York 351.99 5785 269.02 5806 190.75 480 2013-12-29\n", 1127 | "22313 New York 352.02 5791 268.98 5814 190.75 480 2013-12-30\n", 1128 | "22823 New York 351.97 5794 268.93 5818 190.75 480 2013-12-31" 1129 | ] 1130 | }, 1131 | "execution_count": 22, 1132 | "metadata": {}, 1133 | "output_type": "execute_result" 1134 | } 1135 | ], 1136 | "source": [ 1137 | "ny_pd = prices_pd[prices_pd['State'] == 'New York'].copy(True)\n", 1138 | "ny_pd.head()" 1139 | ] 1140 | }, 1141 | { 1142 | "cell_type": "code", 1143 | "execution_count": 23, 1144 | "metadata": { 1145 | "collapsed": false 1146 | }, 1147 | "outputs": [], 1148 | "source": [ 1149 | "ny_pd = ny_pd.ix[:,[1,7]]\n", 1150 | "ny_pd.columns = ['NY_HighQ', 'date']" 1151 | ] 1152 | }, 1153 | { 1154 | "cell_type": "code", 1155 | "execution_count": 24, 1156 | "metadata": { 1157 | "collapsed": false 1158 | }, 1159 | "outputs": [ 1160 | { 1161 | "data": { 1162 | "text/html": [ 1163 | "
\n", 1164 | "\n", 1165 | " \n", 1166 | " \n", 1167 | " \n", 1168 | " \n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | " \n", 1199 | "
NY_HighQdate
20120351.982013-12-27
20885351.922013-12-28
21599351.992013-12-29
22313352.022013-12-30
22823351.972013-12-31
\n", 1200 | "
" 1201 | ], 1202 | "text/plain": [ 1203 | " NY_HighQ date\n", 1204 | "20120 351.98 2013-12-27\n", 1205 | "20885 351.92 2013-12-28\n", 1206 | "21599 351.99 2013-12-29\n", 1207 | "22313 352.02 2013-12-30\n", 1208 | "22823 351.97 2013-12-31" 1209 | ] 1210 | }, 1211 | "execution_count": 24, 1212 | "metadata": {}, 1213 | "output_type": "execute_result" 1214 | } 1215 | ], 1216 | "source": [ 1217 | "ny_pd.head()" 1218 | ] 1219 | }, 1220 | { 1221 | "cell_type": "code", 1222 | "execution_count": 25, 1223 | "metadata": { 1224 | "collapsed": false 1225 | }, 1226 | "outputs": [ 1227 | { 1228 | "data": { 1229 | "text/html": [ 1230 | "
\n", 1231 | "\n", 1232 | " \n", 1233 | " \n", 1234 | " \n", 1235 | " \n", 1236 | " \n", 1237 | " \n", 1238 | " \n", 1239 | " \n", 1240 | " \n", 1241 | " \n", 1242 | " \n", 1243 | " \n", 1244 | " \n", 1245 | " \n", 1246 | " \n", 1247 | " \n", 1248 | " \n", 1249 | " \n", 1250 | " \n", 1251 | " \n", 1252 | " \n", 1253 | " \n", 1254 | " \n", 1255 | " \n", 1256 | " \n", 1257 | " \n", 1258 | " \n", 1259 | " \n", 1260 | " \n", 1261 | " \n", 1262 | " \n", 1263 | " \n", 1264 | " \n", 1265 | " \n", 1266 | " \n", 1267 | " \n", 1268 | " \n", 1269 | " \n", 1270 | " \n", 1271 | " \n", 1272 | "
CA_HighQdateNY_HighQ
0248.772013-12-27351.98
1248.742013-12-28351.92
2248.762013-12-29351.99
3248.822013-12-30352.02
4248.762013-12-31351.97
\n", 1273 | "
" 1274 | ], 1275 | "text/plain": [ 1276 | " CA_HighQ date NY_HighQ\n", 1277 | "0 248.77 2013-12-27 351.98\n", 1278 | "1 248.74 2013-12-28 351.92\n", 1279 | "2 248.76 2013-12-29 351.99\n", 1280 | "3 248.82 2013-12-30 352.02\n", 1281 | "4 248.76 2013-12-31 351.97" 1282 | ] 1283 | }, 1284 | "execution_count": 25, 1285 | "metadata": {}, 1286 | "output_type": "execute_result" 1287 | } 1288 | ], 1289 | "source": [ 1290 | "ca_ny_pd = pd.merge(california_pd.ix[:,[1,7]].copy(), ny_pd, on=\"date\")\n", 1291 | "ca_ny_pd.rename(columns={\"HighQ\": \"CA_HighQ\"}, inplace=True)\n", 1292 | "ca_ny_pd.head()" 1293 | ] 1294 | }, 1295 | { 1296 | "cell_type": "code", 1297 | "execution_count": 26, 1298 | "metadata": { 1299 | "collapsed": false 1300 | }, 1301 | "outputs": [ 1302 | { 1303 | "data": { 1304 | "text/plain": [ 1305 | "346.9127616926502" 1306 | ] 1307 | }, 1308 | "execution_count": 26, 1309 | "metadata": {}, 1310 | "output_type": "execute_result" 1311 | } 1312 | ], 1313 | "source": [ 1314 | "ny_mean = ca_ny_pd.NY_HighQ.mean()\n", 1315 | "ny_mean" 1316 | ] 1317 | }, 1318 | { 1319 | "cell_type": "code", 1320 | "execution_count": 27, 1321 | "metadata": { 1322 | "collapsed": false 1323 | }, 1324 | "outputs": [ 1325 | { 1326 | "data": { 1327 | "text/html": [ 1328 | "
\n", 1329 | "\n", 1330 | " \n", 1331 | " \n", 1332 | " \n", 1333 | " \n", 1334 | " \n", 1335 | " \n", 1336 | " \n", 1337 | " \n", 1338 | " \n", 1339 | " \n", 1340 | " \n", 1341 | " \n", 1342 | " \n", 1343 | " \n", 1344 | " \n", 1345 | " \n", 1346 | " \n", 1347 | " \n", 1348 | " \n", 1349 | " \n", 1350 | " \n", 1351 | " \n", 1352 | " \n", 1353 | " \n", 1354 | " \n", 1355 | " \n", 1356 | " \n", 1357 | " \n", 1358 | " \n", 1359 | " \n", 1360 | " \n", 1361 | " \n", 1362 | " \n", 1363 | " \n", 1364 | " \n", 1365 | " \n", 1366 | " \n", 1367 | " \n", 1368 | " \n", 1369 | " \n", 1370 | " \n", 1371 | " \n", 1372 | " \n", 1373 | " \n", 1374 | " \n", 1375 | " \n", 1376 | "
CA_HighQdateNY_HighQca_dev
0248.772013-12-27351.983.393875
1248.742013-12-28351.923.363875
2248.762013-12-29351.993.383875
3248.822013-12-30352.023.443875
4248.762013-12-31351.973.383875
\n", 1377 | "
" 1378 | ], 1379 | "text/plain": [ 1380 | " CA_HighQ date NY_HighQ ca_dev\n", 1381 | "0 248.77 2013-12-27 351.98 3.393875\n", 1382 | "1 248.74 2013-12-28 351.92 3.363875\n", 1383 | "2 248.76 2013-12-29 351.99 3.383875\n", 1384 | "3 248.82 2013-12-30 352.02 3.443875\n", 1385 | "4 248.76 2013-12-31 351.97 3.383875" 1386 | ] 1387 | }, 1388 | "execution_count": 27, 1389 | "metadata": {}, 1390 | "output_type": "execute_result" 1391 | } 1392 | ], 1393 | "source": [ 1394 | "ca_ny_pd['ca_dev'] = ca_ny_pd['CA_HighQ'] - ca_mean\n", 1395 | "ca_ny_pd.head()" 1396 | ] 1397 | }, 1398 | { 1399 | "cell_type": "code", 1400 | "execution_count": 28, 1401 | "metadata": { 1402 | "collapsed": false 1403 | }, 1404 | "outputs": [ 1405 | { 1406 | "data": { 1407 | "text/html": [ 1408 | "
\n", 1409 | "\n", 1410 | " \n", 1411 | " \n", 1412 | " \n", 1413 | " \n", 1414 | " \n", 1415 | " \n", 1416 | " \n", 1417 | " \n", 1418 | " \n", 1419 | " \n", 1420 | " \n", 1421 | " \n", 1422 | " \n", 1423 | " \n", 1424 | " \n", 1425 | " \n", 1426 | " \n", 1427 | " \n", 1428 | " \n", 1429 | " \n", 1430 | " \n", 1431 | " \n", 1432 | " \n", 1433 | " \n", 1434 | " \n", 1435 | " \n", 1436 | " \n", 1437 | " \n", 1438 | " \n", 1439 | " \n", 1440 | " \n", 1441 | " \n", 1442 | " \n", 1443 | " \n", 1444 | " \n", 1445 | " \n", 1446 | " \n", 1447 | " \n", 1448 | " \n", 1449 | " \n", 1450 | " \n", 1451 | " \n", 1452 | " \n", 1453 | " \n", 1454 | " \n", 1455 | " \n", 1456 | " \n", 1457 | " \n", 1458 | " \n", 1459 | " \n", 1460 | " \n", 1461 | " \n", 1462 | "
CA_HighQdateNY_HighQca_devny_dev
0248.772013-12-27351.983.3938755.067238
1248.742013-12-28351.923.3638755.007238
2248.762013-12-29351.993.3838755.077238
3248.822013-12-30352.023.4438755.107238
4248.762013-12-31351.973.3838755.057238
\n", 1463 | "
" 1464 | ], 1465 | "text/plain": [ 1466 | " CA_HighQ date NY_HighQ ca_dev ny_dev\n", 1467 | "0 248.77 2013-12-27 351.98 3.393875 5.067238\n", 1468 | "1 248.74 2013-12-28 351.92 3.363875 5.007238\n", 1469 | "2 248.76 2013-12-29 351.99 3.383875 5.077238\n", 1470 | "3 248.82 2013-12-30 352.02 3.443875 5.107238\n", 1471 | "4 248.76 2013-12-31 351.97 3.383875 5.057238" 1472 | ] 1473 | }, 1474 | "execution_count": 28, 1475 | "metadata": {}, 1476 | "output_type": "execute_result" 1477 | } 1478 | ], 1479 | "source": [ 1480 | "ca_ny_pd['ny_dev'] = ca_ny_pd['NY_HighQ'] - ny_mean\n", 1481 | "ca_ny_pd.head()" 1482 | ] 1483 | }, 1484 | { 1485 | "cell_type": "code", 1486 | "execution_count": 29, 1487 | "metadata": { 1488 | "collapsed": false 1489 | }, 1490 | "outputs": [ 1491 | { 1492 | "name": "stdout", 1493 | "output_type": "stream", 1494 | "text": [ 1495 | "Covariance of the High Quality weed prices in CA and NY is: 5.91681496729\n" 1496 | ] 1497 | } 1498 | ], 1499 | "source": [ 1500 | "ca_ny_cov = (ca_ny_pd['ca_dev'] * ca_ny_pd['ny_dev']).sum() / (ca_count - 1)\n", 1501 | "print \"Covariance of the High Quality weed prices in CA and NY is:\", ca_ny_cov" 1502 | ] 1503 | }, 1504 | { 1505 | "cell_type": "markdown", 1506 | "metadata": {}, 1507 | "source": [ 1508 | "#### Using Pandas built-in function" 1509 | ] 1510 | }, 1511 | { 1512 | "cell_type": "code", 1513 | "execution_count": 30, 1514 | "metadata": { 1515 | "collapsed": false 1516 | }, 1517 | "outputs": [ 1518 | { 1519 | "data": { 1520 | "text/html": [ 1521 | "
\n", 1522 | "\n", 1523 | " \n", 1524 | " \n", 1525 | " \n", 1526 | " \n", 1527 | " \n", 1528 | " \n", 1529 | " \n", 1530 | " \n", 1531 | " \n", 1532 | " \n", 1533 | " \n", 1534 | " \n", 1535 | " \n", 1536 | " \n", 1537 | " \n", 1538 | " \n", 1539 | " \n", 1540 | " \n", 1541 | " \n", 1542 | " \n", 1543 | " \n", 1544 | " \n", 1545 | " \n", 1546 | " \n", 1547 | " \n", 1548 | " \n", 1549 | " \n", 1550 | " \n", 1551 | " \n", 1552 | " \n", 1553 | " \n", 1554 | " \n", 1555 | " \n", 1556 | " \n", 1557 | " \n", 1558 | " \n", 1559 | " \n", 1560 | " \n", 1561 | " \n", 1562 | "
CA_HighQNY_HighQca_devny_dev
CA_HighQ2.9826865.9168152.9826865.916815
NY_HighQ5.91681512.2451475.91681512.245147
ca_dev2.9826865.9168152.9826865.916815
ny_dev5.91681512.2451475.91681512.245147
\n", 1563 | "
" 1564 | ], 1565 | "text/plain": [ 1566 | " CA_HighQ NY_HighQ ca_dev ny_dev\n", 1567 | "CA_HighQ 2.982686 5.916815 2.982686 5.916815\n", 1568 | "NY_HighQ 5.916815 12.245147 5.916815 12.245147\n", 1569 | "ca_dev 2.982686 5.916815 2.982686 5.916815\n", 1570 | "ny_dev 5.916815 12.245147 5.916815 12.245147" 1571 | ] 1572 | }, 1573 | "execution_count": 30, 1574 | "metadata": {}, 1575 | "output_type": "execute_result" 1576 | } 1577 | ], 1578 | "source": [ 1579 | "ca_ny_pd.cov()" 1580 | ] 1581 | }, 1582 | { 1583 | "cell_type": "markdown", 1584 | "metadata": {}, 1585 | "source": [ 1586 | "### Correlation\n", 1587 | "\n", 1588 | "Extent to which two or more variables fluctuate together. A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable increases as the other decreases.\n", 1589 | "\n", 1590 | "\n", 1591 | "\n", 1592 | "
\n", 1593 | "
\n", 1594 | "
\n", 1595 | "\n", 1596 | "#### Finding correlation between weed prices in New York and California" 1597 | ] 1598 | }, 1599 | { 1600 | "cell_type": "code", 1601 | "execution_count": 31, 1602 | "metadata": { 1603 | "collapsed": false 1604 | }, 1605 | "outputs": [ 1606 | { 1607 | "name": "stdout", 1608 | "output_type": "stream", 1609 | "text": [ 1610 | "Correlation between weed prices in NY and CA: 0.979043961106\n" 1611 | ] 1612 | } 1613 | ], 1614 | "source": [ 1615 | "ca_highq_std = ca_ny_pd.CA_HighQ.std()\n", 1616 | "ny_highq_std = ca_ny_pd.NY_HighQ.std()\n", 1617 | "\n", 1618 | "ca_ny_corr = ca_ny_cov / (ca_highq_std * ny_highq_std)\n", 1619 | "print \"Correlation between weed prices in NY and CA:\", ca_ny_corr" 1620 | ] 1621 | }, 1622 | { 1623 | "cell_type": "code", 1624 | "execution_count": 32, 1625 | "metadata": { 1626 | "collapsed": false 1627 | }, 1628 | "outputs": [ 1629 | { 1630 | "data": { 1631 | "text/html": [ 1632 | "
\n", 1633 | "\n", 1634 | " \n", 1635 | " \n", 1636 | " \n", 1637 | " \n", 1638 | " \n", 1639 | " \n", 1640 | " \n", 1641 | " \n", 1642 | " \n", 1643 | " \n", 1644 | " \n", 1645 | " \n", 1646 | " \n", 1647 | " \n", 1648 | " \n", 1649 | " \n", 1650 | " \n", 1651 | " \n", 1652 | " \n", 1653 | " \n", 1654 | " \n", 1655 | " \n", 1656 | " \n", 1657 | " \n", 1658 | " \n", 1659 | " \n", 1660 | " \n", 1661 | " \n", 1662 | " \n", 1663 | " \n", 1664 | " \n", 1665 | " \n", 1666 | " \n", 1667 | " \n", 1668 | " \n", 1669 | " \n", 1670 | " \n", 1671 | " \n", 1672 | " \n", 1673 | "
CA_HighQNY_HighQca_devny_dev
CA_HighQ1.0000000.9790441.0000000.979044
NY_HighQ0.9790441.0000000.9790441.000000
ca_dev1.0000000.9790441.0000000.979044
ny_dev0.9790441.0000000.9790441.000000
\n", 1674 | "
" 1675 | ], 1676 | "text/plain": [ 1677 | " CA_HighQ NY_HighQ ca_dev ny_dev\n", 1678 | "CA_HighQ 1.000000 0.979044 1.000000 0.979044\n", 1679 | "NY_HighQ 0.979044 1.000000 0.979044 1.000000\n", 1680 | "ca_dev 1.000000 0.979044 1.000000 0.979044\n", 1681 | "ny_dev 0.979044 1.000000 0.979044 1.000000" 1682 | ] 1683 | }, 1684 | "execution_count": 32, 1685 | "metadata": {}, 1686 | "output_type": "execute_result" 1687 | } 1688 | ], 1689 | "source": [ 1690 | "ca_ny_pd.corr()" 1691 | ] 1692 | }, 1693 | { 1694 | "cell_type": "markdown", 1695 | "metadata": {}, 1696 | "source": [ 1697 | "# Correlation != Causation\n", 1698 | "\n", 1699 | "correlation between two variables does not necessarily imply that one causes the other.\n", 1700 | "\n", 1701 | "\n", 1702 | "" 1703 | ] 1704 | } 1705 | ], 1706 | "metadata": { 1707 | "kernelspec": { 1708 | "display_name": "Python 2", 1709 | "language": "python", 1710 | "name": "python2" 1711 | }, 1712 | "language_info": { 1713 | "codemirror_mode": { 1714 | "name": "ipython", 1715 | "version": 2 1716 | }, 1717 | "file_extension": ".py", 1718 | "mimetype": "text/x-python", 1719 | "name": "python", 1720 | "nbconvert_exporter": "python", 1721 | "pygments_lexer": "ipython2", 1722 | "version": "2.7.10" 1723 | } 1724 | }, 1725 | "nbformat": 4, 1726 | "nbformat_minor": 0 1727 | } 1728 | -------------------------------------------------------------------------------- /notebooks/6. Hypothesis Testing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Hypothesis Testing\n", 8 | "\n", 9 | "\n", 10 | "We would like to know if the effects we see in the sample(observed data) are likely to occur in the population. \n", 11 | "\n", 12 | "The way classical hypothesis testing works is by conducting a statistical test to answer the following question:\n", 13 | "> Given the sample and an effect, what is the probability of seeing that effect just by chance?\n", 14 | "\n", 15 | "Here are the steps on how we would do this\n", 16 | "\n", 17 | "1. Compute test statistic\n", 18 | "2. Define null hypothesis\n", 19 | "3. Compute p-value\n", 20 | "4. Interpret the result\n", 21 | "\n", 22 | "If p-value is very low(most often than now, below 0.05), the effect is considered statistically significant. That means that effect is unlikely to have occured by chance. The inference? The effect is likely to be seen in the population too. \n", 23 | "\n", 24 | "This process is very similar to the *proof by contradiction* paradigm. We first assume that the effect is false. That's the null hypothesis. Next step is to compute the probability of obtaining that effect (the p-value). If p-value is very low(<0.05 as a rule of thumb), we reject the null hypothesis. " 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 1, 30 | "metadata": { 31 | "collapsed": true 32 | }, 33 | "outputs": [], 34 | "source": [ 35 | "import numpy as np\n", 36 | "import pandas as pd\n", 37 | "from scipy import stats\n", 38 | "import matplotlib as mpl\n", 39 | "%matplotlib inline" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 2, 45 | "metadata": { 46 | "collapsed": false 47 | }, 48 | "outputs": [], 49 | "source": [ 50 | "import seaborn as sns\n", 51 | "sns.set(color_codes=True)" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 3, 57 | "metadata": { 58 | "collapsed": true 59 | }, 60 | "outputs": [], 61 | "source": [ 62 | "weed_pd = pd.read_csv(\"../data/Weed_Price.csv\", parse_dates=[-1])" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 4, 68 | "metadata": { 69 | "collapsed": false 70 | }, 71 | "outputs": [], 72 | "source": [ 73 | "weed_pd[\"month\"] = weed_pd.date.apply(lambda x: x.month)\n", 74 | "weed_pd[\"year\"] = weed_pd.date.apply(lambda x: x.year)" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 5, 80 | "metadata": { 81 | "collapsed": false 82 | }, 83 | "outputs": [ 84 | { 85 | "data": { 86 | "text/html": [ 87 | "
\n", 88 | "\n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | "
StateHighQHighQNMedQMedQNLowQLowQNdatemonthyear
0Alabama339.061042198.64933149.491232014-01-0112014
1Alaska288.75252260.60297388.58262014-01-0112014
2Arizona303.311941209.351625189.452222014-01-0112014
3Arkansas361.85576185.62544125.871122014-01-0112014
4California248.7812096193.5612812192.927782014-01-0112014
\n", 172 | "
" 173 | ], 174 | "text/plain": [ 175 | " State HighQ HighQN MedQ MedQN LowQ LowQN date month \\\n", 176 | "0 Alabama 339.06 1042 198.64 933 149.49 123 2014-01-01 1 \n", 177 | "1 Alaska 288.75 252 260.60 297 388.58 26 2014-01-01 1 \n", 178 | "2 Arizona 303.31 1941 209.35 1625 189.45 222 2014-01-01 1 \n", 179 | "3 Arkansas 361.85 576 185.62 544 125.87 112 2014-01-01 1 \n", 180 | "4 California 248.78 12096 193.56 12812 192.92 778 2014-01-01 1 \n", 181 | "\n", 182 | " year \n", 183 | "0 2014 \n", 184 | "1 2014 \n", 185 | "2 2014 \n", 186 | "3 2014 \n", 187 | "4 2014 " 188 | ] 189 | }, 190 | "execution_count": 5, 191 | "metadata": {}, 192 | "output_type": "execute_result" 193 | } 194 | ], 195 | "source": [ 196 | "weed_pd.head()" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "### Let's work on weed prices in California in 2014\n" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 6, 209 | "metadata": { 210 | "collapsed": true 211 | }, 212 | "outputs": [], 213 | "source": [ 214 | "weed_ca_2014 = weed_pd[(weed_pd.State==\"California\") & (weed_pd.year==2014)]" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": 7, 220 | "metadata": { 221 | "collapsed": false 222 | }, 223 | "outputs": [ 224 | { 225 | "name": "stdout", 226 | "output_type": "stream", 227 | "text": [ 228 | "Mean: 245.894230769\n", 229 | "Standard Deviation: 1.28990793937\n" 230 | ] 231 | } 232 | ], 233 | "source": [ 234 | "#Mean and standard deviation of high quality weed's price\n", 235 | "print \"Mean:\", weed_ca_2014.HighQ.mean()\n", 236 | "print \"Standard Deviation:\", weed_ca_2014.HighQ.std()" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 8, 242 | "metadata": { 243 | "collapsed": false 244 | }, 245 | "outputs": [ 246 | { 247 | "data": { 248 | "text/plain": [ 249 | "(245.761718492726, 246.02674304573577)" 250 | ] 251 | }, 252 | "execution_count": 8, 253 | "metadata": {}, 254 | "output_type": "execute_result" 255 | } 256 | ], 257 | "source": [ 258 | "#Confidence interval on the mean\n", 259 | "stats.norm.interval(0.95, loc=weed_ca_2014.HighQ.mean(), scale = weed_ca_2014.HighQ.std()/np.sqrt(len(weed_ca_2014)))" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": {}, 265 | "source": [ 266 | "### Question: Are high-quality weed prices in Jan 2014 significantly higher than in Jan 2015?" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 9, 272 | "metadata": { 273 | "collapsed": false 274 | }, 275 | "outputs": [], 276 | "source": [ 277 | "#Get the data\n", 278 | "weed_ca_jan2014 = np.array(weed_pd[(weed_pd.State==\"California\") & (weed_pd.year==2014) & (weed_pd.month==1)].HighQ)\n", 279 | "weed_ca_jan2015 = np.array(weed_pd[(weed_pd.State==\"California\") & (weed_pd.year==2015) & (weed_pd.month==1)].HighQ)" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": 10, 285 | "metadata": { 286 | "collapsed": false 287 | }, 288 | "outputs": [ 289 | { 290 | "name": "stdout", 291 | "output_type": "stream", 292 | "text": [ 293 | "Mean-2014 Jan: 248.445483871\n", 294 | "Mean-2015 Jan: 243.602258065\n" 295 | ] 296 | } 297 | ], 298 | "source": [ 299 | "print \"Mean-2014 Jan:\", weed_ca_jan2014.mean()\n", 300 | "print \"Mean-2015 Jan:\", weed_ca_jan2015.mean()" 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": 11, 306 | "metadata": { 307 | "collapsed": false 308 | }, 309 | "outputs": [ 310 | { 311 | "name": "stdout", 312 | "output_type": "stream", 313 | "text": [ 314 | "Effect size: 4.84322580645\n" 315 | ] 316 | } 317 | ], 318 | "source": [ 319 | "print \"Effect size:\", weed_ca_jan2014.mean() - weed_ca_jan2015.mean()" 320 | ] 321 | }, 322 | { 323 | "cell_type": "markdown", 324 | "metadata": {}, 325 | "source": [ 326 | "**Null Hypothesis**: Mean prices aren't significantly different\n", 327 | "\n", 328 | "Perform **t-test** and determine the p-value. " 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": 12, 334 | "metadata": { 335 | "collapsed": false 336 | }, 337 | "outputs": [ 338 | { 339 | "data": { 340 | "text/plain": [ 341 | "Ttest_indResult(statistic=98.011325238158051, pvalue=6.2979718185084028e-68)" 342 | ] 343 | }, 344 | "execution_count": 12, 345 | "metadata": {}, 346 | "output_type": "execute_result" 347 | } 348 | ], 349 | "source": [ 350 | "stats.ttest_ind(weed_ca_jan2014, weed_ca_jan2015, equal_var=True)" 351 | ] 352 | }, 353 | { 354 | "cell_type": "markdown", 355 | "metadata": {}, 356 | "source": [ 357 | "p-value is the probability that the effective size was by chance. And here, p-value is almost 0.\n", 358 | "\n", 359 | "*Conclusion*: The price difference is significant. But is a price increase of $4.85 a big deal? The price decreased in 2015 by almost 2%. Always remember to look at effect size. " 360 | ] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "metadata": {}, 365 | "source": [ 366 | "**Problem** Determine if prices of medium quality weed for Jan 2015 and Feb 2015 are significantly different for New York. " 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": null, 372 | "metadata": { 373 | "collapsed": true 374 | }, 375 | "outputs": [], 376 | "source": [] 377 | }, 378 | { 379 | "cell_type": "markdown", 380 | "metadata": {}, 381 | "source": [ 382 | "### Assumption of t-test\n", 383 | "\n", 384 | "One assumption is that the data used came from a normal distribution. \n", 385 | "
\n", 386 | "There's a [Shapiro-Wilk test](https://en.wikipedia.org/wiki/Shapiro-Wilk) to test for normality. If p-value is less than 0.05, then there's a low chance that the distribution is normal." 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": 13, 392 | "metadata": { 393 | "collapsed": false 394 | }, 395 | "outputs": [ 396 | { 397 | "data": { 398 | "text/plain": [ 399 | "(0.9469053149223328, 0.12818680703639984)" 400 | ] 401 | }, 402 | "execution_count": 13, 403 | "metadata": {}, 404 | "output_type": "execute_result" 405 | } 406 | ], 407 | "source": [ 408 | "stats.shapiro(weed_ca_jan2015)" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": 14, 414 | "metadata": { 415 | "collapsed": false 416 | }, 417 | "outputs": [ 418 | { 419 | "data": { 420 | "text/plain": [ 421 | "(0.9353488683700562, 0.06141229346394539)" 422 | ] 423 | }, 424 | "execution_count": 14, 425 | "metadata": {}, 426 | "output_type": "execute_result" 427 | } 428 | ], 429 | "source": [ 430 | "stats.shapiro(weed_ca_jan2014)" 431 | ] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "execution_count": null, 436 | "metadata": { 437 | "collapsed": true 438 | }, 439 | "outputs": [], 440 | "source": [ 441 | "#We seem to be good." 442 | ] 443 | }, 444 | { 445 | "cell_type": "markdown", 446 | "metadata": {}, 447 | "source": [ 448 | "### A/B testing\n", 449 | "\n", 450 | "Comparing two versions to check which one performs better. Eg: Show to people two variants for the same webpage that they want to see and find which one provides better conversion rate (or the relevant metric). [wiki](https://en.wikipedia.org/wiki/A/B_testing)" 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": {}, 456 | "source": [ 457 | "**Exercise: Impact of regulation and deregulation.**\n", 458 | "\n", 459 | "Information on regulation of Weed in the US by State [wiki](Impact of regulation and deregulation on a couple of states )\n", 460 | "\n", 461 | "1. Alaska legalized it on 4th Nov 2014. Find if prices significantly changed in Dec 2014 compared to Oct 2014. \n", 462 | "2. Maryland decriminalized possessing weed from Oct 1, 2014. Find if prices of weed changed significantly in Oct 2014 compared to Sep 2014" 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": null, 468 | "metadata": { 469 | "collapsed": true 470 | }, 471 | "outputs": [], 472 | "source": [] 473 | }, 474 | { 475 | "cell_type": "markdown", 476 | "metadata": {}, 477 | "source": [ 478 | "

Something to think about: Which of these give smaller p-values ?

\n", 479 | " \n", 480 | " * Smaller effect size\n", 481 | " * Smaller standard error\n", 482 | " * Smaller sample size\n", 483 | " * Higher variance\n", 484 | " \n", 485 | " **Answer:** " 486 | ] 487 | }, 488 | { 489 | "cell_type": "markdown", 490 | "metadata": {}, 491 | "source": [ 492 | "### Chi-square tests" 493 | ] 494 | }, 495 | { 496 | "cell_type": "markdown", 497 | "metadata": {}, 498 | "source": [ 499 | "Chi-Square tests are used when the data are frequencies, rather than numerical score/price.\n", 500 | "\n", 501 | "The following two tests make use of chi-square statistic\n", 502 | "\n", 503 | "1. chi-square test for goodness of fit\n", 504 | "2. chi-square test for independence\n", 505 | "\n", 506 | "Chi-square test is a non-parametric test. They do not require assumptions about population parameters and they do not test hypotheses about population parameters." 507 | ] 508 | }, 509 | { 510 | "cell_type": "markdown", 511 | "metadata": {}, 512 | "source": [ 513 | "

Chi-Square test for goodness fit

" 514 | ] 515 | }, 516 | { 517 | "cell_type": "markdown", 518 | "metadata": {}, 519 | "source": [ 520 | "$$ \\chi^2 = \\sum (O - E)^2/E $$\n", 521 | "\n", 522 | "* O is observed frequency\n", 523 | "* E is expected frequency\n", 524 | "* $ \\chi $ is the chi-square statistic" 525 | ] 526 | }, 527 | { 528 | "cell_type": "markdown", 529 | "metadata": {}, 530 | "source": [ 531 | "Let's assume the proportion of people who bought High, Medium and Low quality weed in Jan-2014 as the expected proportion. Find if proportion of people who bought weed in Jan 2015 conformed to the norm" 532 | ] 533 | }, 534 | { 535 | "cell_type": "code", 536 | "execution_count": 16, 537 | "metadata": { 538 | "collapsed": true 539 | }, 540 | "outputs": [], 541 | "source": [ 542 | "weed_jan2014 = weed_pd[(weed_pd.year==2014) & (weed_pd.month==1)][[\"HighQN\", \"MedQN\", \"LowQN\"]]\n", 543 | "weed_jan2015 = weed_pd[(weed_pd.year==2015) & (weed_pd.month==1)][[\"HighQN\", \"MedQN\", \"LowQN\"]]" 544 | ] 545 | }, 546 | { 547 | "cell_type": "code", 548 | "execution_count": 17, 549 | "metadata": { 550 | "collapsed": false 551 | }, 552 | "outputs": [], 553 | "source": [ 554 | "Expected = np.array(weed_jan2014.apply(sum, axis=0))\n", 555 | "Observed = np.array(weed_jan2015.apply(sum, axis=0))" 556 | ] 557 | }, 558 | { 559 | "cell_type": "code", 560 | "execution_count": 18, 561 | "metadata": { 562 | "collapsed": false 563 | }, 564 | "outputs": [ 565 | { 566 | "name": "stdout", 567 | "output_type": "stream", 568 | "text": [ 569 | "Expected: [2918004 2644757 263958] \n", 570 | "Observed: [4057716 4035049 358088]\n" 571 | ] 572 | } 573 | ], 574 | "source": [ 575 | "print \"Expected:\", Expected, \"\\n\" , \"Observed:\", Observed" 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": 19, 581 | "metadata": { 582 | "collapsed": false 583 | }, 584 | "outputs": [ 585 | { 586 | "name": "stdout", 587 | "output_type": "stream", 588 | "text": [ 589 | "Expected: [ 0.5007971 0.45390159 0.04530131] \n", 590 | "Observed: [ 0.48015461 0.47747239 0.042373 ]\n" 591 | ] 592 | } 593 | ], 594 | "source": [ 595 | "print \"Expected:\", Expected/np.sum(Expected.astype(float)), \"\\n\" , \"Observed:\", Observed/np.sum(Observed.astype(float))" 596 | ] 597 | }, 598 | { 599 | "cell_type": "code", 600 | "execution_count": 20, 601 | "metadata": { 602 | "collapsed": false 603 | }, 604 | "outputs": [ 605 | { 606 | "data": { 607 | "text/plain": [ 608 | "Power_divergenceResult(statistic=1209562.2775169075, pvalue=0.0)" 609 | ] 610 | }, 611 | "execution_count": 20, 612 | "metadata": {}, 613 | "output_type": "execute_result" 614 | } 615 | ], 616 | "source": [ 617 | "stats.chisquare(Observed, Expected)" 618 | ] 619 | }, 620 | { 621 | "cell_type": "markdown", 622 | "metadata": {}, 623 | "source": [ 624 | "*Inference* : We reject null hypothesis. The proportions in Jan 2015 is different than what was expected." 625 | ] 626 | } 627 | ], 628 | "metadata": { 629 | "kernelspec": { 630 | "display_name": "Python 2", 631 | "language": "python", 632 | "name": "python2" 633 | }, 634 | "language_info": { 635 | "codemirror_mode": { 636 | "name": "ipython", 637 | "version": 2 638 | }, 639 | "file_extension": ".py", 640 | "mimetype": "text/x-python", 641 | "name": "python", 642 | "nbconvert_exporter": "python", 643 | "pygments_lexer": "ipython2", 644 | "version": "2.7.10" 645 | } 646 | }, 647 | "nbformat": 4, 648 | "nbformat_minor": 0 649 | } 650 | -------------------------------------------------------------------------------- /notebooks/8. Closing thoughts and terminology.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Terminology\n", 8 | "\n", 9 | "1. Null Hypothesis\n", 10 | "2. Alternate Hypothesis\n", 11 | "3. p-value (Probability of observing the metric from the data at least as extreme as computed just by chance)\n", 12 | "4. Bootstrap\n", 13 | "5. Acceptance Region\n", 14 | "6. Rejection Region\n", 15 | "7. t-test\n", 16 | "8. One-tailed test\n", 17 | "9. Two-tailed test\n", 18 | "10. Significance test\n", 19 | "11. Confidence interval\n", 20 | "12. Power of a test\n", 21 | "13. type 1 error (Rejecting null hypothesis when it is true). Also called false positive.\n", 22 | "14. type 2 error (Failing to reject null hypothesis when it is false). Also called false negative\n", 23 | "\n", 24 | "# Some Practical thoughts \n", 25 | "\n", 26 | "1. Data could be biased. Confidence intervals may then not be representative.\n", 27 | "2. One way to handle biased data is to use bias-corrected-confidence-intervals. \n", 28 | "3. Outliers can impact confidence intervals. \n", 29 | "4. Too often, people remove outliers. But they might be encoding some necessary information.\n", 30 | "5. One way to handle outliers is to use ranking, instead of actual numbers.\n", 31 | "6. If sample size is small, bootstrapping underestimates the size of confidence interval. \n", 32 | "7. Better to use significance testing if sample size is small.\n", 33 | "8. Bootstrapping should not be used find maximum value (Eg: maximum sales of shoes, 5th largest sales of shoes, etc)\n", 34 | "9. Use rank transformation when using bootstrapping, if the data has outliers\n", 35 | "10. Lack of representativeness is a problem for any statistical technique\n", 36 | "11. The experiment should be random. (Eg: When doing A/B testing, randomize the subjects). Experimental bias can lead to wrong inferences. \n", 37 | "12. Resampling time series data is tricky. The assumption we used - that each data point is independent, doesn't hold good for time series data. \n", 38 | "13. Rank transformation changes the question. For our shoe sales example, a rank transformed analysis would be: \"Do sales tend to be higher after price optimization?\". (Our analysis was: \"Does post-price-optimization sales have a higher mean sales?\")\n", 39 | "14. Power of a test increases if sample size increases\n", 40 | "\n", 41 | "# Types of Error\n", 42 | "\n", 43 | "1. Sampling Bias\n", 44 | "2. Measurement Error\n", 45 | "3. Random Error\n" 46 | ] 47 | } 48 | ], 49 | "metadata": { 50 | "kernelspec": { 51 | "display_name": "Python 2", 52 | "language": "python", 53 | "name": "python2" 54 | }, 55 | "language_info": { 56 | "codemirror_mode": { 57 | "name": "ipython", 58 | "version": 2 59 | }, 60 | "file_extension": ".py", 61 | "mimetype": "text/x-python", 62 | "name": "python", 63 | "nbconvert_exporter": "python", 64 | "pygments_lexer": "ipython2", 65 | "version": "2.7.10" 66 | } 67 | }, 68 | "nbformat": 4, 69 | "nbformat_minor": 0 70 | } 71 | -------------------------------------------------------------------------------- /notebooks/9. References.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Books, Slides, Articles\n", 8 | "1. Book: [Think Stats](http://greenteapress.com/thinkstats2/)\n", 9 | "2. Book: [All of Statistics](http://www.stat.cmu.edu/~larry/all-of-statistics/)\n", 10 | "2. Book: [Statistics is Easy](http://www.amazon.com/Statistics-Edition-Synthesis-Lectures-Mathematics/dp/160845570X)\n", 11 | "3. Workshop: Computational Statistics Workshop, Allen Donwney SciPy 2015 [Repo](https://github.com/AllenDowney/CompStats) [Video](https://www.youtube.com/watch?v=5Vjrqnk7Igs)\n", 12 | "4. Slides: [Statistics for Hackers](https://speakerdeck.com/jakevdp/statistics-for-hackers)" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": null, 18 | "metadata": { 19 | "collapsed": true 20 | }, 21 | "outputs": [], 22 | "source": [] 23 | } 24 | ], 25 | "metadata": { 26 | "kernelspec": { 27 | "display_name": "Python 2", 28 | "language": "python", 29 | "name": "python2" 30 | }, 31 | "language_info": { 32 | "codemirror_mode": { 33 | "name": "ipython", 34 | "version": 2 35 | }, 36 | "file_extension": ".py", 37 | "mimetype": "text/x-python", 38 | "name": "python", 39 | "nbconvert_exporter": "python", 40 | "pygments_lexer": "ipython2", 41 | "version": "2.7.10" 42 | } 43 | }, 44 | "nbformat": 4, 45 | "nbformat_minor": 0 46 | } 47 | -------------------------------------------------------------------------------- /notebooks/img/6sigma.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rouseguy/intro2stats/1c3c89725ede55fdb4af6e0b846888a2d664a211/notebooks/img/6sigma.png -------------------------------------------------------------------------------- /notebooks/img/binomial.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rouseguy/intro2stats/1c3c89725ede55fdb4af6e0b846888a2d664a211/notebooks/img/binomial.gif -------------------------------------------------------------------------------- /notebooks/img/binomial_pmf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rouseguy/intro2stats/1c3c89725ede55fdb4af6e0b846888a2d664a211/notebooks/img/binomial_pmf.png -------------------------------------------------------------------------------- /notebooks/img/correlation.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rouseguy/intro2stats/1c3c89725ede55fdb4af6e0b846888a2d664a211/notebooks/img/correlation.gif -------------------------------------------------------------------------------- /notebooks/img/correlation_not_causation.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rouseguy/intro2stats/1c3c89725ede55fdb4af6e0b846888a2d664a211/notebooks/img/correlation_not_causation.gif -------------------------------------------------------------------------------- /notebooks/img/covariance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rouseguy/intro2stats/1c3c89725ede55fdb4af6e0b846888a2d664a211/notebooks/img/covariance.png -------------------------------------------------------------------------------- /notebooks/img/exponential_pdf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rouseguy/intro2stats/1c3c89725ede55fdb4af6e0b846888a2d664a211/notebooks/img/exponential_pdf.png -------------------------------------------------------------------------------- /notebooks/img/kurtosis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rouseguy/intro2stats/1c3c89725ede55fdb4af6e0b846888a2d664a211/notebooks/img/kurtosis.png -------------------------------------------------------------------------------- /notebooks/img/leastsquare.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rouseguy/intro2stats/1c3c89725ede55fdb4af6e0b846888a2d664a211/notebooks/img/leastsquare.gif -------------------------------------------------------------------------------- /notebooks/img/normal_cdf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rouseguy/intro2stats/1c3c89725ede55fdb4af6e0b846888a2d664a211/notebooks/img/normal_cdf.png -------------------------------------------------------------------------------- /notebooks/img/normal_pdf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rouseguy/intro2stats/1c3c89725ede55fdb4af6e0b846888a2d664a211/notebooks/img/normal_pdf.png -------------------------------------------------------------------------------- /notebooks/img/normaldist.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rouseguy/intro2stats/1c3c89725ede55fdb4af6e0b846888a2d664a211/notebooks/img/normaldist.png -------------------------------------------------------------------------------- /notebooks/img/skewness.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rouseguy/intro2stats/1c3c89725ede55fdb4af6e0b846888a2d664a211/notebooks/img/skewness.png -------------------------------------------------------------------------------- /notebooks/img/uniform.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rouseguy/intro2stats/1c3c89725ede55fdb4af6e0b846888a2d664a211/notebooks/img/uniform.png -------------------------------------------------------------------------------- /notebooks/img/variance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rouseguy/intro2stats/1c3c89725ede55fdb4af6e0b846888a2d664a211/notebooks/img/variance.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | appnope==0.1.0 2 | backports.ssl-match-hostname==3.4.0.2 3 | certifi==2015.4.28 4 | decorator==4.0.2 5 | funcsigs==0.4 6 | functools32==3.2.3.post2 7 | gnureadline==6.3.3 8 | ipykernel==4.0.3 9 | ipython==4.0.0 10 | ipython-genutils==0.1.0 11 | Jinja2==2.8 12 | jsonschema==2.5.1 13 | jupyter-client==4.0.0 14 | jupyter-core==4.0.4 15 | MarkupSafe==0.23 16 | matplotlib==1.4.3 17 | mistune==0.7.1 18 | mock==1.3.0 19 | nbconvert==4.0.0 20 | nbformat==4.0.0 21 | nose==1.3.7 22 | notebook==4.0.4 23 | numpy==1.9.2 24 | pandas==0.16.2 25 | path.py==7.7 26 | patsy==0.4.0 27 | pbr==1.6.0 28 | pexpect==3.3 29 | pickleshare==0.5 30 | ptyprocess==0.5 31 | Pygments==2.0.2 32 | pyparsing==2.0.3 33 | python-dateutil==2.4.2 34 | pytz==2015.4 35 | pyzmq==14.7.0 36 | scikit-learn==0.16.1 37 | scipy==0.16.0 38 | seaborn==0.6.0 39 | simplegeneric==0.8.1 40 | six==1.9.0 41 | terminado==0.5 42 | tornado==4.2.1 43 | traitlets==4.0.0 44 | vincent==0.4.4 45 | statsmodels==0.6.1 46 | -------------------------------------------------------------------------------- /requirements_linux.txt: -------------------------------------------------------------------------------- 1 | backports.ssl-match-hostname==3.4.0.2 2 | certifi==2015.4.28 3 | decorator==4.0.2 4 | funcsigs==0.4 5 | functools32==3.2.3.post2 6 | ipykernel==4.0.3 7 | ipython==4.0.0 8 | ipython-genutils==0.1.0 9 | Jinja2==2.8 10 | jsonschema==2.5.1 11 | jupyter-client==4.0.0 12 | jupyter-core==4.0.4 13 | MarkupSafe==0.23 14 | matplotlib==1.4.3 15 | mistune==0.7.1 16 | mock==1.3.0 17 | nbconvert==4.0.0 18 | nbformat==4.0.0 19 | nose==1.3.7 20 | notebook==4.0.4 21 | numpy==1.9.2 22 | pandas==0.16.2 23 | path.py==7.7.1 24 | patsy==0.4.0 25 | pbr==1.6.0 26 | pexpect==3.3 27 | pickleshare==0.5 28 | ptyprocess==0.5 29 | Pygments==2.0.2 30 | pyparsing==2.0.3 31 | python-dateutil==2.4.2 32 | pytz==2015.4 33 | pyzmq==14.7.0 34 | scikit-learn==0.16.1 35 | scipy==0.16.0 36 | seaborn==0.6.0 37 | simplegeneric==0.8.1 38 | six==1.9.0 39 | statsmodels==0.6.1 40 | terminado==0.5 41 | tornado==4.2.1 42 | traitlets==4.0.0 43 | vincent==0.4.4 44 | wheel==0.24.0 45 | --------------------------------------------------------------------------------