├── 1. Statistical Simulation in R.Rmd ├── 2. Statistical Simulation in Python.ipynb ├── 3. A Practical Guide To AB Tests.ipynb ├── 4. AA Test.ipynb └── README.md /1. Statistical Simulation in R.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Statistical Simulation" 3 | author: "Leihua Ye" 4 | date: "9/23/2020" 5 | output: html_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | library(dplyr) 11 | ``` 12 | 13 | # statistical simulation 101 14 | # part i: basics of sampling 15 | - runif 16 | - sample 17 | - sapply 18 | - apply 19 | - applications 20 | 21 | Check the sample function; it is the foundation of statistical modeling 22 | 23 | sample(x, size, replace = FALSE, prob = NULL) 24 | 25 | x: a vector of one or more elements from which to choose, or a positive integer 26 | size: a non-negative integer giving the number of items to choose 27 | replace: should sampling be with replacement? 28 | prob: a vector of probability weights for obtaining the elements of the vector being sampled 29 | 30 | # how to generate a random sample 31 | runif(n,min=minimal_value,max=max_value) 32 | ```{r} 33 | set.seed(2) 34 | runif(25,min=0,max=10) 35 | 36 | runif(25,min=0,max=10) %>% 37 | round(.,digits = 0) 38 | ``` 39 | 40 | 41 | ```{r} 42 | # flip a coin 43 | sample(c('H','T'),size = 10,replace=TRUE) 44 | sample(c(1:6),size=10,replace=TRUE) 45 | ``` 46 | 47 | # the sample() function draws randomly from a specified set of (scalar) objects allowing you to sample from arbitrary distributions of numbers 48 | ```{r} 49 | #1. random permutation of sequence [1,10] 50 | set.seed(2) 51 | sample(10) 52 | ?sample() 53 | ``` 54 | 55 | ```{r} 56 | sample(1:10,4) 57 | ``` 58 | 59 | ```{r} 60 | set.seed(1) 61 | sample(letters,18) 62 | ``` 63 | 64 | ```{r} 65 | # from r documentation 66 | x<- 1:10 67 | #a random permutation 68 | sample(x) 69 | 70 | # resample with replacement 71 | sample(x,replace=TRUE) 72 | ``` 73 | 74 | ```{r} 75 | # random sample of size 10 from sequence [1,5] with equal probabilities 76 | equal_prob_dist = sample(5,10000,prob=rep(0.1,5),replace=T) 77 | hist(equal_prob_dist) 78 | ``` 79 | 80 | ```{r} 81 | #random sample of size 10 from sequence[1,5] with unequal probabilities 82 | unequal_prob_dist = sample(5,10000,prob = c(0.1,0.25,0.4,0.25,0.1),replace=T) 83 | hist(unequal_prob_dist) 84 | ``` 85 | By default, the probability is equal if don't specify it. 86 | 87 | # To sample rows from a data frame or a list, we can sample the indices into an object rather than the elements of the object itself. 取元素对应的indices! 88 | ```{r} 89 | head(mtcars) 90 | ``` 91 | ```{r} 92 | # create an index vector for the elements/rows 93 | index <- seq_len(nrow(mtcars)) 94 | 95 | #sample from the index vector 96 | set.seed(12) 97 | 98 | #to obtain a random sample of 10 99 | sample_index <- sample(index,10) 100 | 101 | # to show the sampled elements/rows 102 | mtcars[sample_index,] 103 | ``` 104 | 105 | 106 | 107 | # Part 2: Application 108 | # example 1: dies 109 | ```{r} 110 | # use sample() to run 10,000 trials using two fair dies. what is the probability of rolling a 7? 111 | set.seed(1) 112 | die = 1:6 113 | die1 = sample(die,10000,replace = TRUE,prob=NULL) 114 | die2= sample(die,10000,replace=TRUE,prob = NULL) 115 | outcomes = die1+die2 116 | mean(outcomes == 7) 117 | ``` 118 | 119 | ```{r} 120 | # simulate the process 10000 times and check for discrepency 121 | set.seed(1) 122 | for (i in 10000){ 123 | die_1 = sample(die,prob=NULL,replace=TRUE) 124 | die_2 = sample(die,prob=NULL,replace=TRUE) 125 | die_sum = die_1+die_2 126 | print(mean(die_sum==7)) 127 | } 128 | ``` 129 | ```{r} 130 | # check for system time 131 | system.time(for (i in 10000){ 132 | die_1 = sample(die,prob=NULL,replace=TRUE) 133 | die_2 = sample(die,prob=NULL,replace=TRUE) 134 | die_sum = die_1+die_2 135 | print(mean(die_sum==7)) 136 | }) 137 | ``` 138 | 139 | ```{r} 140 | #You have two dies, and what is the probability of rolling a 7? 141 | # use sample() to run 10,000 trials using two fair dies. 142 | set.seed(1) 143 | die = 1:6 144 | die_1 = sample(die,20000,replace=TRUE,prob=rep(1/6,6)) 145 | die_2 = sample(die,20000,replace=TRUE,prob=rep(1/6,6)) 146 | outcomes = die_1+die_2 147 | mean(outcomes==7) 148 | ``` 149 | 150 | ```{r} 151 | # What is the probability of rolling a 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, and 13? 152 | sapply(2:13,function(x) mean(outcomes==x)) 153 | #?sapply() 154 | ``` 155 | # check for sapply() 156 | sapply is a user-friendly version and wrapper of lapply by default returning a vector, matrix or, if simplify = "array", an array if appropriate, by applying simplify2array(). sapply(x, f, simplify = FALSE, USE.NAMES = FALSE) is the same as lapply(x, f). 157 | 158 | 159 | ```{r} 160 | # Use the sample() function to determine the probability of rolling a 7 using three fair six-sided dies. 161 | set.seed(1) 162 | die = 1:6 163 | die1 = sample(die,10000,replace=TRUE, prob = NULL) 164 | die2 = sample(die1,10000,replace=TRUE, prob = NULL) 165 | die3 = sample(die,10000,replace = TRUE, prob= NULL) 166 | outcomes2 = die1+die2+die3 167 | mean(outcomes2==7) 168 | ``` 169 | 170 | ```{r} 171 | # Use the sample() function to determine the probability of rolling a 7 using three fair six-sided dies. 172 | set.seed(1) 173 | die=1:6 174 | die_1 = sample(die, 20000,replace=TRUE,prob=NULL)# by default, equal probability 175 | die_2 = sample(die,20000,replace=TRUE, prob=NULL) 176 | die_3 = sample(die,20000,replace=TRUE, prob=NULL) 177 | die_combn = die_1+die_2+die_3 178 | mean(die_combn==7) 179 | ``` 180 | 181 | ```{r} 182 | # Using three dies, what is the probability of rolling a 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, and 13? 183 | sapply(2:13,function(x) mean(die_combn==x)) 184 | ``` 185 | 186 | # example 2 187 | ```{r} 188 | # do 10k samples and calculate how many observations are included and how many are not included? 189 | set.seed(1) 190 | n=10000 191 | 192 | included_obs = length(unique(sample(1:n, replace = TRUE))) 193 | included_obs 194 | 195 | missing_obs = n-included_obs 196 | missing_obs 197 | ``` 198 | 199 | ```{r} 200 | # from 1:100, do a 10k sample and calculate how many observations are included and how many are not included? 201 | set.seed(1) 202 | n = 100 203 | included_observations = length(unique(sample(1:100,replace=TRUE, prob=NULL))) 204 | included_observations/n 205 | 206 | (n-included_observations)/n 207 | ``` 208 | 209 | #3 example 3 210 | ```{r} 211 | # three ways of generating a m*n matrix with randomly assigned 0/1 212 | 213 | #3.1 for loop 214 | # create an empty matrix 215 | m <- 10 216 | n <- 10 217 | m00 <- matrix(0,m,n) 218 | 219 | for (i in 1:m) { 220 | for (j in 1:n) { 221 | m00[i,j] <-sample(c(0,1),1) 222 | } 223 | } 224 | 225 | m00 226 | ``` 227 | 228 | ```{r} 229 | system.time(for (i in 1:m) { 230 | for (j in 1:n) { 231 | m00[m,n] <-sample(c(0,1),1) 232 | } 233 | } 234 | ) 235 | ``` 236 | 237 | 238 | 239 | ```{r} 240 | #3.2 apply() function 241 | m <-10 242 | n<-10 243 | 244 | m0 <- matrix(0,m,n) 245 | 246 | apply(m0,c(1,2),function(x) sample(c(0,1),1)) 247 | 248 | system.time(apply(m0,c(1,2),function(x) sample(c(0,1),1))) 249 | ?apply() 250 | ``` 251 | apply(): 252 | - Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix. 253 | - apply(x,margin, fun, ...) 254 | - x: an array, or matrix 255 | - margin: a vector giving the subscripts which the function will be applied over. for a matrix 1 indicates rows, and 2 indicates columns. c(1,2) indicates rows and columns. where x has named dimnames, it can be a character vector selecting dimension names 256 | - fun: the function to be applied. 257 | 258 | ```{r} 259 | #3.3 some other methods 260 | #1 generate a bunch of uniformly distributed[0,1) random numbers; round them to the closest integer 261 | m1<-round(matrix(runif(r*c),r,c)) 262 | #m1 263 | system.time(m1<-round(matrix(runif(r*c),r,c))) 264 | ``` 265 | 266 | ```{r} 267 | #generate r*c random numbers following a binomial distribution; 268 | # allow for different probabilities, rahter than 0.5 as in m1 269 | m2 <- matrix(rbinom(r*c,1,0.5),r,c) 270 | #m2 271 | ``` 272 | 273 | ```{r} 274 | system.time(m3<-matrix(round(runif(r*c)),r,c)) 275 | ``` 276 | 277 | ```{r} 278 | m4<-matrix(sample(0:1,r*c,replace=TRUE),r,c) 279 | #m4 280 | ``` 281 | 282 | #4 example 4 283 | Flip a coin 10 times and simulate the process for 10,000 times. Show the distribution of the number of heads shown up. 284 | ```{r} 285 | # create an empty list 286 | total_heads = c() 287 | 288 | # use a for looop to simulate coin-flipping 10 times; repeat it for 10,000 times. 289 | for (i in 1:10000){ 290 | sum_heads = sum(round(runif(10,0,1))) 291 | total_heads = c(total_heads, sum_heads) 292 | } 293 | 294 | hist(total_heads) 295 | ``` 296 | 297 | Reference: 298 | 1. https://bookdown.org/rdpeng/rprogdatascience/simulation.html#random-sampling 299 | 2. R documentation of Sampling 300 | 3. https://crumplab.github.io/programmingforpsych/simulating-and-analyzing-data-in-r.html 301 | -------------------------------------------------------------------------------- /2. Statistical Simulation in Python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 1. Uniform Distribution" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "# Question 1\n", 15 | "- In R or Python, please answer the following question. \n", 16 | "- Condition 1: For a sequence of numbers, (a1,a2,a3,a4,...,an), please write a function that randomly returns each element, ai, with probability ai/∑ai. \n", 17 | "- Condition 2: For example, for a sequence (1,2,3,4), the function returns element with a probability 1/10, and 4 with a probability 4/10.\n", 18 | "- Condition 3: You can use any library, but no random.choice()." 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 63, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "import numpy as np \n", 28 | "\n", 29 | "def weight_func(sequence):\n", 30 | "\n", 31 | " prob = [] # empty list to store probabilities of each element\n", 32 | "\n", 33 | " cum_prob = [] # empty list to store cumulative probabilities of each element \n", 34 | " \n", 35 | " total_sum = sum(sequence) # total sum of the sequence\n", 36 | " \n", 37 | " uniform = np.random.uniform(0,1) # a random value between 0 and 1 \n", 38 | " \n", 39 | " for i in range(len(sequence)): # create iterations for the sequence\n", 40 | " \n", 41 | " prob.append(sequence[i]/total_sum) # append the weighted probability (ai/∑ai) to the list, prob\n", 42 | " \n", 43 | " cum_prob.append(sum(prob)) # append the cumulative sum to the list, cum_prob \n", 44 | " \n", 45 | " if uniform < cum_prob[i]: # if the cumulative sum > uniform (the value generated from the uniform distribution)\n", 46 | " break # end the for looop \n", 47 | " \n", 48 | " return sequence[i] # and return the value at position i." 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 66, 54 | "metadata": {}, 55 | "outputs": [ 56 | { 57 | "data": { 58 | "text/plain": [ 59 | "1" 60 | ] 61 | }, 62 | "execution_count": 66, 63 | "metadata": {}, 64 | "output_type": "execute_result" 65 | } 66 | ], 67 | "source": [ 68 | "weight_return([1,4,3,2]) # test case" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "---" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "# Question 2: Binomial Distribution" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "An online shopping website (e.g., Amazon, Alibaba, etc.) wants to test out two versions of banners that will appear on the top of the website. The engineering team assigns the probability of visiting version A at 0.6 and version B 0.4. \n", 90 | "\n", 91 | "After 10,000 visits, there are 6050 visitors being exposed to version A and 3950 people exposed to version B. \n", 92 | "\n", 93 | "What is the probability that there are 6050 cases when the randomization process is correct? \n", 94 | "\n", 95 | "In other words, the probability for version A is indeed 0.6." 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 68, 101 | "metadata": {}, 102 | "outputs": [ 103 | { 104 | "data": { 105 | "text/plain": [ 106 | "0.1498" 107 | ] 108 | }, 109 | "execution_count": 68, 110 | "metadata": {}, 111 | "output_type": "execute_result" 112 | } 113 | ], 114 | "source": [ 115 | "import numpy as np\n", 116 | "\n", 117 | "np.random.seed(123)\n", 118 | "\n", 119 | "def some_funct(number_trials, probability_A):\n", 120 | " \n", 121 | " binomial_dist = np.random.binomial(n = number_trials,p = probability_A,size=10000) # generate a binomial distribution with n = number, p \n", 122 | " \n", 123 | " count = 0 # initialize count\n", 124 | " \n", 125 | " for value in binomial_dist: #iterate over binomial_dist\n", 126 | " \n", 127 | " if value > 6050: # if value>6050, count+1\n", 128 | " \n", 129 | " count += 1\n", 130 | " \n", 131 | " return count/number_trials # return the probability that cases larger than 6050 out of number_trials \n", 132 | "\n", 133 | "some_funct(number_trials=10000, probability_A = 0.6)" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "---" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "# Question 3: Poisson Distribution" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [ 154 | "My medium blog has 500 visits per day, and the number of visits follows a Poisson distribution. Out of 1000 times, what is the ratio that there would be more than 510 visits per day? " 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": 165, 160 | "metadata": {}, 161 | "outputs": [], 162 | "source": [ 163 | "import numpy as np \n", 164 | "\n", 165 | "np.random.seed(123)\n", 166 | "\n", 167 | "def Poisson(value_1,value_2): # two arguments\n", 168 | " \n", 169 | " count = 0 # initialize the counter\n", 170 | "\n", 171 | " poisson = np.random.poisson(lam=value_1,size = value_2) # a poisson distribution\n", 172 | "\n", 173 | " for i in poisson: # iteration\n", 174 | " \n", 175 | " if i > 510: # if clause to count numbers\n", 176 | " \n", 177 | " count+=1 \n", 178 | " \n", 179 | " return(count/value_2) # return the value" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": 166, 185 | "metadata": {}, 186 | "outputs": [ 187 | { 188 | "data": { 189 | "text/plain": [ 190 | "0.318" 191 | ] 192 | }, 193 | "execution_count": 166, 194 | "metadata": {}, 195 | "output_type": "execute_result" 196 | } 197 | ], 198 | "source": [ 199 | "Poisson(500,1000)" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": {}, 205 | "source": [ 206 | "---" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "# Question 4: Normal Distribution\n", 214 | "Write a function to generate X samples from a normal distribution and plot the histogram." 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": 21, 220 | "metadata": {}, 221 | "outputs": [], 222 | "source": [ 223 | "import numpy as np \n", 224 | "\n", 225 | "np.random.seed(123)\n", 226 | "\n", 227 | "def normal_func(X):\n", 228 | " \n", 229 | " norm_dist = np.random.normal(loc=10,scale=2,size=100)\n", 230 | " \n", 231 | " result = np.random.choice(norm_dist, X)\n", 232 | " \n", 233 | " return(result)" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": 22, 239 | "metadata": {}, 240 | "outputs": [ 241 | { 242 | "data": { 243 | "text/plain": [ 244 | "array([ 7.27305691, 6.98741057, 14.37357218, 14.17422672, 7.57495374,\n", 245 | " 10.39904815, 7.27305691, 7.41182935, 10.565957 , 12.0081078 ])" 246 | ] 247 | }, 248 | "execution_count": 22, 249 | "metadata": {}, 250 | "output_type": "execute_result" 251 | } 252 | ], 253 | "source": [ 254 | "X_samples = normal_func(10)\n", 255 | "X_samples" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": 23, 261 | "metadata": {}, 262 | "outputs": [ 263 | { 264 | "data": { 265 | "text/plain": [ 266 | "(array([5., 0., 0., 0., 2., 0., 1., 0., 0., 2.]),\n", 267 | " array([ 6.98741057, 7.72602673, 8.46464289, 9.20325905, 9.94187521,\n", 268 | " 10.68049138, 11.41910754, 12.1577237 , 12.89633986, 13.63495602,\n", 269 | " 14.37357218]),\n", 270 | " )" 271 | ] 272 | }, 273 | "execution_count": 23, 274 | "metadata": {}, 275 | "output_type": "execute_result" 276 | }, 277 | { 278 | "data": { 279 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAD4CAYAAADFAawfAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAALgElEQVR4nO3cfajdBR3H8c+nzfIhyXRH8+l2JUSS4VMXiwQhtZgPzOwBFBUh6/6TpFHYRAgkAsOo/glqpEzKBwqVSvEJy0xIa7Npm9M0m8+5mZSKYE4//XHO3N25Z7tn8/zO77v5fsHlPpxz7/2w3b332+/8zpxEAIC63tP2AADA1hFqACiOUANAcYQaAIoj1ABQ3PwmvuiCBQsyOTnZxJcGgJ3SihUrXkzSGXRbI6GenJzU8uXLm/jSALBTsv3klm7j1AcAFEeoAaA4Qg0AxRFqACiOUANAcYQaAIob6vI822slvSLpTUkbkkw1OQoAsMm2XEf9qSQvNrYEADAQpz4AoLhhj6gj6Q7bkfTTJEv772B7WtK0JE1MTGz3oMklt2z3574Tay8/tZXvCwBzGfaI+rgkx0g6WdJXbR/ff4ckS5NMJZnqdAY+XR0AsB2GCnWS53qv10m6SdKxTY4CAGwyZ6ht72F7z41vS/qMpFVNDwMAdA1zjno/STfZ3nj/a5Pc1ugqAMDb5gx1kickHTmGLQCAAbg8DwCKI9QAUByhBoDiCDUAFEeoAaA4Qg0AxRFqACiOUANAcYQaAIoj1ABQHKEGgOIINQAUR6gBoDhCDQDFEWoAKI5QA0BxhBoAiiPUAFAcoQaA4gg1ABRHqAGgOEINAMURagAojlADQHGEGgCKI9QAUByhBoDiCDUAFEeoAaA4Qg0AxRFqACiOUANAcUOH2vY823+1fXOTgwAAm9uWI+oLJa1paggAYLChQm37IEmnSvpZs3MAAP2GPaL+kaSLJb21pTvYnra93Pby9evXj2QcAGCIUNs+TdK6JCu2dr8kS5NMJZnqdDojGwgA73bDHFEfJ2mx7bWSrpd0gu1fNLoKAPC2OUOd5JIkByWZlHSmpN8lOafxZQAASVxHDQDlzd+WOye5W9LdjSwBAAzEETUAFEeoAaA4Qg0AxRFqACiOUANAcYQaAIoj1ABQHKEGgOIINQAUR6gBoDhCDQDFEWoAKI5QA0BxhBoAiiPUAFAcoQaA4gg1ABRHqAGgOEINAMURagAojlADQHGEGgCKI9QAUByhBoDiCDUAFEeoAaA4Qg0AxRFqACiOUANAcYQaAIoj1ABQHKEGgOLmDLXtXW3/2faDtlfbvmwcwwAAXfOHuM/rkk5I8qrtXSTda/vWJPc1vA0AoCFCnSSSXu29u0vvJU2OAgBsMtQ5atvzbK+UtE7SnUnub3YWAGCjoUKd5M0kR0k6SNKxthf238f2tO3ltpevX79+1DsB4F1rm676SPIfSXdLWjTgtqVJppJMdTqdEc0DAAxz1UfH9l69t3eTdJKkR5oeBgDoGuaqj/0lXW17nrph/2WSm5udBQDYaJirPh6SdPQYtgAABuCZiQBQHKEGgOIINQAUR6gBoDhCDQDFEWoAKI5QA0BxhBoAiiPUAFAcoQaA4gg1ABRHqAGgOEINAMURagAojlADQHGEGgCKI9QAUByhBoDiCDUAFEeoAaA4Qg0AxRFqACiOUANAcYQaAIoj1ABQHKEGgOIINQAUR6gBoDhCDQDFEWoAKI5QA0BxhBoAiiPUAFDcnKG2fbDt39teY3u17QvHMQwA0DV/iPtskPSNJA/Y3lPSCtt3Jnm44W0AAA1xRJ3k+SQP9N5+RdIaSQc2PQwA0DXMEfXbbE9KOlrS/QNum5Y0LUkTExMjmIad2eSSW1r5vmsvP7WV74vx2tl+voZ+MNH2+yXdIOmiJC/3355kaZKpJFOdTmeUGwHgXW2oUNveRd1IX5PkxmYnAQBmGuaqD0u6UtKaJD9ofhIAYKZhjqiPk3SupBNsr+y9nNLwLgBAz5wPJia5V5LHsAUAMADPTASA4gg1ABRHqAGgOEINAMURagAojlADQHGEGgCKI9QAUByhBoDiCDUAFEeoAaA4Qg0AxRFqACiOUANAcYQaAIoj1ABQHKEGgOIINQAUR6gBoDhCDQDFEWoAKI5QA0BxhBoAiiPUAFAcoQaA4gg1ABRHqAGgOEINAMURagAojlADQHGEGgCKI9QAUNycobZ9le11tleNYxAAYHPDHFEvk7So4R0AgC2YM9RJ7pH00hi2AAAGmD+qL2R7WtK0JE1MTIzqywI7hcklt7T2vddefmpr3xujMbIHE5MsTTKVZKrT6YzqywLAux5XfQBAcYQaAIob5vK86yT9SdJhtp+xfX7zswAAG835YGKSs8YxBAAwGKc+AKA4Qg0AxRFqACiOUANAcYQaAIoj1ABQHKEGgOIINQAUR6gBoDhCDQDFEWoAKI5QA0BxhBoAiiPUAFAcoQaA4gg1ABRHqAGgOEINAMURagAojlADQHGEGgCKI9QAUByhBoDiCDUAFEeoAaA4Qg0AxRFqACiOUANAcYQaAIoj1ABQHKEGgOIINQAUR6gBoLihQm17ke1HbT9ue0nTowAAm8wZatvzJP1Y0smSDpd0lu3Dmx4GAOga5oj6WEmPJ3kiyf8kXS/p9GZnAQA2cpKt38H+gqRFSb7ce/9cSR9PckHf/aYlTffePUzSo9u5aYGkF7fzc8dpR9jJxtFg4+jsCDvb2vjhJJ1BN8wf4pM94GOz6p5kqaSl2zhs9jezlyeZeqdfp2k7wk42jgYbR2dH2Flx4zCnPp6RdPCM9w+S9FwzcwAA/YYJ9V8kHWr7ENvvlXSmpN80OwsAsNGcpz6SbLB9gaTbJc2TdFWS1Q1uesenT8ZkR9jJxtFg4+jsCDvLbZzzwUQAQLt4ZiIAFEeoAaC4MqG2fZjtlTNeXrZ9Udu7+tn+uu3VtlfZvs72rm1v6mf7wt6+1ZV+DW1fZXud7VUzPra37TttP9Z7/cGCG7/Y+7V8y3brl21tYeMVth+x/ZDtm2zvVXDjd3r7Vtq+w/YBbW7sbZq1c8Zt37Qd2wva2DZTmVAneTTJUUmOkvQxSa9JuqnlWZuxfaCkr0maSrJQ3QdXz2x31eZsL5T0FXWfUXqkpNNsH9ruqrctk7So72NLJN2V5FBJd/Xeb9Myzd64StLnJN0z9jWDLdPsjXdKWpjkCEl/l3TJuEf1WabZG69IckTvz/jNkr499lWzLdPsnbJ9sKRPS3pq3IMGKRPqPidK+keSJ9seMsB8SbvZni9pd9W7pvyjku5L8lqSDZL+IOmMljdJkpLcI+mlvg+fLunq3ttXS/rsWEf1GbQxyZok2/tM25HbwsY7er/fknSfus93aM0WNr484909NOCJc+O2hZ9JSfqhpItVYKNUN9RnSrqu7RH9kjwr6fvq/i37vKT/Jrmj3VWzrJJ0vO19bO8u6RRt/oSlavZL8rwk9V7v2/KencGXJN3a9ohBbH/X9tOSzlaNI+pZbC+W9GySB9veslG5UPeeVLNY0q/a3tKvd/70dEmHSDpA0h62z2l31eaSrJH0PXX/KXybpAclbdjqJ2GnYftSdX+/r2l7yyBJLk1ysLr7Lpjr/uPWO7i5VMX+EikXanX/O9UHkrzQ9pABTpL0zyTrk7wh6UZJn2x50yxJrkxyTJLj1f1n3WNtb9qKF2zvL0m91+ta3rPDsn2epNMknZ36T5C4VtLn2x4xwEfUPRB70PZadU8hPWD7Q22Oqhjqs1TwtEfPU5I+YXt321b3XPqaljfNYnvf3usJdR8Eq/rrKXX/O4Lzem+fJ+nXLW7ZYdleJOlbkhYnea3tPYP0Pai9WNIjbW3ZkiR/S7Jvkskkk+r+X0fHJPlX28PKvKj74Ny/JX2g7S1b2XiZuj9gqyT9XNL72t40YOMfJT2s7mmPE9veM2PXdeqe239D3T8A50vaR92rPR7rvd674MYzem+/LukFSbcX3Pi4pKclrey9/KTgxht6f24ekvRbSQdW/Jnsu32tpAVt7+Qp5ABQXMVTHwCAGQg1ABRHqAGgOEINAMURagAojlADQHGEGgCK+z8o/DIzg3EuYQAAAABJRU5ErkJggg==\n", 280 | "text/plain": [ 281 | "
" 282 | ] 283 | }, 284 | "metadata": { 285 | "needs_background": "light" 286 | }, 287 | "output_type": "display_data" 288 | } 289 | ], 290 | "source": [ 291 | "import matplotlib.pyplot as plt\n", 292 | "\n", 293 | "plt.hist(X_samples)" 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": {}, 299 | "source": [ 300 | "---" 301 | ] 302 | } 303 | ], 304 | "metadata": { 305 | "kernelspec": { 306 | "display_name": "Python 3", 307 | "language": "python", 308 | "name": "python3" 309 | }, 310 | "language_info": { 311 | "codemirror_mode": { 312 | "name": "ipython", 313 | "version": 3 314 | }, 315 | "file_extension": ".py", 316 | "mimetype": "text/x-python", 317 | "name": "python", 318 | "nbconvert_exporter": "python", 319 | "pygments_lexer": "ipython3", 320 | "version": "3.7.4" 321 | } 322 | }, 323 | "nbformat": 4, 324 | "nbformat_minor": 2 325 | } 326 | -------------------------------------------------------------------------------- /3. A Practical Guide To AB Tests.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Part 1: Power Analysis" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 2, 13 | "metadata": {}, 14 | "outputs": [ 15 | { 16 | "name": "stdout", 17 | "output_type": "stream", 18 | "text": [ 19 | "Sample Size: 1571.000\n" 20 | ] 21 | } 22 | ], 23 | "source": [ 24 | "from statsmodels.stats.power import TTestIndPower\n", 25 | "\n", 26 | "# parameters for power analysis \n", 27 | "# effect: standardized effect size, difference between the two means divided by the standard deviation.\n", 28 | "# effect_size has to be positive.\n", 29 | "\n", 30 | "effect = 0.1\n", 31 | "alpha = 0.05\n", 32 | "power = 0.8\n", 33 | "\n", 34 | "# perform power analysis \n", 35 | "analysis = TTestIndPower()\n", 36 | "result = analysis.solve_power(effect, power = power,nobs1= None, ratio = 1.0, alpha = alpha)\n", 37 | "print('Sample Size: %.3f' % round(result))" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "# Part 2: Data Generation Process Through Statistical Simulation\n", 45 | "Variables to be simulated: \n", 46 | "1. userid\n", 47 | "2. version\n", 48 | "3. minutes of plays \n", 49 | "4. user engagement after 1 day (metric_1)\n", 50 | "5. user engagement after 7 days (metric_2)" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 3, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "# variable 1: userid\n", 60 | "user_id_control = list(range(1,1601))# 1600 control\n", 61 | "user_id_treatment = list(range(1601,3350))# 1749 treated" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "----" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 4, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "#variable 2: version \n", 78 | "import numpy as np\n", 79 | "control_status = ['control']*1600\n", 80 | "treatment_status = ['treatment']*1749 " 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "----" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 5, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "# variable 3: minutes of plays, which follows a normal distribution with a μ of 30 minutes and σ² of 10 (central limit theorem)\n", 97 | "\n", 98 | "# for control group\n", 99 | "\n", 100 | "μ_1 = 30\n", 101 | "\n", 102 | "σ_squared_1 = 10\n", 103 | "\n", 104 | "np.random.seed(123)\n", 105 | "\n", 106 | "minutes_control = np.random.normal(loc = μ_1, scale = σ_squared_1, size = 1600)" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": 6, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "# for treatment group, which increases the user engagement by \n", 116 | "# according to the formula (μ_1 - μ_2)/σ_squared = 0.1, we obtain μ_2 = 31\n", 117 | "\n", 118 | "μ_2 = 31\n", 119 | "\n", 120 | "σ_squared_2 = 10\n", 121 | "\n", 122 | "np.random.seed(123)\n", 123 | "\n", 124 | "minutes_treat = np.random.normal(loc = μ_2, scale = σ_squared_2, size = 1749)" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "----" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 7, 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "# variable 4 user engagement after 1 day (metric_1)\n", 141 | "# after day 1, treatment performs better than control --> to simulate novelty effect\n", 142 | "\n", 143 | "Active_status = [True,False]\n", 144 | "\n", 145 | "# control \n", 146 | "day_1_control = np.random.choice(Active_status, 1600, p=[0.3,0.7])\n", 147 | "\n", 148 | "# treatment\n", 149 | "day_1_treatment = np.random.choice(Active_status, 1749, p=[0.35,0.65])" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "----" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 8, 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "# variable 5 user engagement after 7 day (metric_2)\n", 166 | "# after day 7, control > treatment --> the novelty effect diminishes & performance reversed\n", 167 | "\n", 168 | "Active_status = [True,False]\n", 169 | "\n", 170 | "# control \n", 171 | "day_7_control = np.random.choice(Active_status, 1600, p=[0.35,0.65])\n", 172 | "\n", 173 | "# treatment\n", 174 | "day_7_treatment = np.random.choice(Active_status, 1749, p=[0.25,0.75])" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "----" 182 | ] 183 | }, 184 | { 185 | "cell_type": "markdown", 186 | "metadata": {}, 187 | "source": [ 188 | "# construct the control group " 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "1. user_id_control; \n", 196 | "2. control_status\n", 197 | "3. minutes_control\n", 198 | "4. day_1_control\n", 199 | "5. day_7_control" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 10, 205 | "metadata": {}, 206 | "outputs": [ 207 | { 208 | "data": { 209 | "text/html": [ 210 | "
\n", 211 | "\n", 224 | "\n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | "
user_idversionminutes_playday_1_activeday_7_active
01control19.143694FalseFalse
12control39.973454FalseFalse
23control32.829785FalseFalse
34control14.937053FalseFalse
45control24.213997TrueTrue
..................
15951596control27.154466TrueTrue
15961597control46.414042TrueTrue
15971598control41.523560FalseFalse
15981599control23.981909FalseFalse
15991600control17.843379FalseFalse
\n", 326 | "

1600 rows × 5 columns

\n", 327 | "
" 328 | ], 329 | "text/plain": [ 330 | " user_id version minutes_play day_1_active day_7_active\n", 331 | "0 1 control 19.143694 False False\n", 332 | "1 2 control 39.973454 False False\n", 333 | "2 3 control 32.829785 False False\n", 334 | "3 4 control 14.937053 False False\n", 335 | "4 5 control 24.213997 True True\n", 336 | "... ... ... ... ... ...\n", 337 | "1595 1596 control 27.154466 True True\n", 338 | "1596 1597 control 46.414042 True True\n", 339 | "1597 1598 control 41.523560 False False\n", 340 | "1598 1599 control 23.981909 False False\n", 341 | "1599 1600 control 17.843379 False False\n", 342 | "\n", 343 | "[1600 rows x 5 columns]" 344 | ] 345 | }, 346 | "execution_count": 10, 347 | "metadata": {}, 348 | "output_type": "execute_result" 349 | } 350 | ], 351 | "source": [ 352 | "# control data\n", 353 | "import pandas as pd\n", 354 | "raw_control = {'user_id':user_id_control,\n", 355 | " 'version':control_status,\n", 356 | " 'minutes_play':minutes_control,\n", 357 | " 'day_1_active':day_1_control,\n", 358 | " 'day_7_active':day_1_control\n", 359 | " }\n", 360 | "\n", 361 | "control_group = pd.DataFrame(data = raw_control)\n", 362 | "control_group" 363 | ] 364 | }, 365 | { 366 | "cell_type": "markdown", 367 | "metadata": {}, 368 | "source": [ 369 | "1. user_id_treatment\n", 370 | "2. treatment_status \n", 371 | "3. minutes_treat\n", 372 | "4. day_1_treatment\n", 373 | "5. day_7_treatment" 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": 11, 379 | "metadata": {}, 380 | "outputs": [ 381 | { 382 | "data": { 383 | "text/html": [ 384 | "
\n", 385 | "\n", 398 | "\n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | "
user_idversionminutes_playday_1_activeday_7_active
01601treatment25.515817FalseFalse
11602treatment41.653829FalseFalse
21603treatment25.760776FalseFalse
31604treatment36.459880TrueFalse
41605treatment25.341199TrueFalse
..................
17443345treatment24.098838TrueFalse
17453346treatment29.683718TrueFalse
17463347treatment34.013900TrueFalse
17473348treatment44.909702FalseFalse
17483349treatment51.464873FalseFalse
\n", 500 | "

1749 rows × 5 columns

\n", 501 | "
" 502 | ], 503 | "text/plain": [ 504 | " user_id version minutes_play day_1_active day_7_active\n", 505 | "0 1601 treatment 25.515817 False False\n", 506 | "1 1602 treatment 41.653829 False False\n", 507 | "2 1603 treatment 25.760776 False False\n", 508 | "3 1604 treatment 36.459880 True False\n", 509 | "4 1605 treatment 25.341199 True False\n", 510 | "... ... ... ... ... ...\n", 511 | "1744 3345 treatment 24.098838 True False\n", 512 | "1745 3346 treatment 29.683718 True False\n", 513 | "1746 3347 treatment 34.013900 True False\n", 514 | "1747 3348 treatment 44.909702 False False\n", 515 | "1748 3349 treatment 51.464873 False False\n", 516 | "\n", 517 | "[1749 rows x 5 columns]" 518 | ] 519 | }, 520 | "execution_count": 11, 521 | "metadata": {}, 522 | "output_type": "execute_result" 523 | } 524 | ], 525 | "source": [ 526 | "# treatment data \n", 527 | "raw_treatment = {'user_id':user_id_treatment,\n", 528 | " 'version':treatment_status,\n", 529 | " 'minutes_play':minutes_treat,\n", 530 | " 'day_1_active':day_1_treatment,\n", 531 | " 'day_7_active':day_7_treatment\n", 532 | " }\n", 533 | "\n", 534 | "treatment_group = pd.DataFrame(data = raw_treatment)\n", 535 | "treatment_group" 536 | ] 537 | }, 538 | { 539 | "cell_type": "code", 540 | "execution_count": 12, 541 | "metadata": {}, 542 | "outputs": [], 543 | "source": [ 544 | "# combine these two datasets\n", 545 | "two_datasets = control_group.append(treatment_group)\n", 546 | "\n", 547 | "# randomize the orders using df.sample(frac=1)\n", 548 | "# The frac keyword argument: specifies the fraction of rows to return in the random sample\n", 549 | "# so frac=1 means return all rows (in random order).\n", 550 | "final_data = two_datasets.sample(frac=1)" 551 | ] 552 | }, 553 | { 554 | "cell_type": "code", 555 | "execution_count": 13, 556 | "metadata": {}, 557 | "outputs": [ 558 | { 559 | "name": "stdout", 560 | "output_type": "stream", 561 | "text": [ 562 | "\n", 563 | "Int64Index: 3349 entries, 1558 to 345\n", 564 | "Data columns (total 5 columns):\n", 565 | "user_id 3349 non-null int64\n", 566 | "version 3349 non-null object\n", 567 | "minutes_play 3349 non-null float64\n", 568 | "day_1_active 3349 non-null bool\n", 569 | "day_7_active 3349 non-null bool\n", 570 | "dtypes: bool(2), float64(1), int64(1), object(1)\n", 571 | "memory usage: 111.2+ KB\n" 572 | ] 573 | } 574 | ], 575 | "source": [ 576 | "final_data.info()" 577 | ] 578 | }, 579 | { 580 | "cell_type": "code", 581 | "execution_count": 29, 582 | "metadata": {}, 583 | "outputs": [ 584 | { 585 | "data": { 586 | "text/html": [ 587 | "
\n", 588 | "\n", 601 | "\n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | "
user_idversionminutes_playday_1_activeday_7_activeminutes_play_integers
15583159treatment26.156152FalseTrue26.0
12231224control24.313143FalseFalse24.0
392393control29.013153TrueTrue29.0
808809control27.797075TrueTrue28.0
4852086treatment32.152395FalseFalse32.0
\n", 661 | "
" 662 | ], 663 | "text/plain": [ 664 | " user_id version minutes_play day_1_active day_7_active \\\n", 665 | "1558 3159 treatment 26.156152 False True \n", 666 | "1223 1224 control 24.313143 False False \n", 667 | "392 393 control 29.013153 True True \n", 668 | "808 809 control 27.797075 True True \n", 669 | "485 2086 treatment 32.152395 False False \n", 670 | "\n", 671 | " minutes_play_integers \n", 672 | "1558 26.0 \n", 673 | "1223 24.0 \n", 674 | "392 29.0 \n", 675 | "808 28.0 \n", 676 | "485 32.0 " 677 | ] 678 | }, 679 | "execution_count": 29, 680 | "metadata": {}, 681 | "output_type": "execute_result" 682 | } 683 | ], 684 | "source": [ 685 | "final_data.head()" 686 | ] 687 | }, 688 | { 689 | "cell_type": "markdown", 690 | "metadata": {}, 691 | "source": [ 692 | "# Part 3: After-Test Data Analysis" 693 | ] 694 | }, 695 | { 696 | "cell_type": "markdown", 697 | "metadata": {}, 698 | "source": [ 699 | "### 3.1 Count the Number of Users in Each Version " 700 | ] 701 | }, 702 | { 703 | "cell_type": "code", 704 | "execution_count": 14, 705 | "metadata": {}, 706 | "outputs": [ 707 | { 708 | "data": { 709 | "text/plain": [ 710 | "version\n", 711 | "control 1600\n", 712 | "treatment 1749\n", 713 | "Name: user_id, dtype: int64" 714 | ] 715 | }, 716 | "execution_count": 14, 717 | "metadata": {}, 718 | "output_type": "execute_result" 719 | } 720 | ], 721 | "source": [ 722 | "# calculate the number of users in each version\n", 723 | "final_data.groupby('version')['user_id'].count()" 724 | ] 725 | }, 726 | { 727 | "cell_type": "markdown", 728 | "metadata": {}, 729 | "source": [ 730 | "# the assignment process looks suspicious as more people assigned to the treatment than the control. " 731 | ] 732 | }, 733 | { 734 | "cell_type": "markdown", 735 | "metadata": {}, 736 | "source": [ 737 | "### 3.2 Formally Test for Sample Ratio Mismatch\n", 738 | "- Chi-Square test" 739 | ] 740 | }, 741 | { 742 | "cell_type": "code", 743 | "execution_count": 30, 744 | "metadata": {}, 745 | "outputs": [ 746 | { 747 | "data": { 748 | "text/plain": [ 749 | "Power_divergenceResult(statistic=6.627462686567164, pvalue=0.010041820594939122)" 750 | ] 751 | }, 752 | "execution_count": 30, 753 | "metadata": {}, 754 | "output_type": "execute_result" 755 | } 756 | ], 757 | "source": [ 758 | "from scipy.stats import chisquare \n", 759 | "chisquare([1600,1749],f_exp = [1675,1675])" 760 | ] 761 | }, 762 | { 763 | "cell_type": "markdown", 764 | "metadata": {}, 765 | "source": [ 766 | "Typically, we set the alpha level at 0.001 to test Sample Ratio Mismatch. Since the p value is 0.01, we have to reject the null hypothesis and conclude no evidence of SRM.\n", 767 | "In other words, the treatment assignment works as expected." 768 | ] 769 | }, 770 | { 771 | "cell_type": "code", 772 | "execution_count": 31, 773 | "metadata": {}, 774 | "outputs": [ 775 | { 776 | "data": { 777 | "text/html": [ 778 | "
\n", 779 | "\n", 792 | "\n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | "
user_idversionminutes_playday_1_activeday_7_activeminutes_play_integers
15583159treatment26.156152FalseTrue26.0
12231224control24.313143FalseFalse24.0
392393control29.013153TrueTrue29.0
808809control27.797075TrueTrue28.0
4852086treatment32.152395FalseFalse32.0
\n", 852 | "
" 853 | ], 854 | "text/plain": [ 855 | " user_id version minutes_play day_1_active day_7_active \\\n", 856 | "1558 3159 treatment 26.156152 False True \n", 857 | "1223 1224 control 24.313143 False False \n", 858 | "392 393 control 29.013153 True True \n", 859 | "808 809 control 27.797075 True True \n", 860 | "485 2086 treatment 32.152395 False False \n", 861 | "\n", 862 | " minutes_play_integers \n", 863 | "1558 26.0 \n", 864 | "1223 24.0 \n", 865 | "392 29.0 \n", 866 | "808 28.0 \n", 867 | "485 32.0 " 868 | ] 869 | }, 870 | "execution_count": 31, 871 | "metadata": {}, 872 | "output_type": "execute_result" 873 | } 874 | ], 875 | "source": [ 876 | "final_data.head()" 877 | ] 878 | }, 879 | { 880 | "cell_type": "markdown", 881 | "metadata": {}, 882 | "source": [ 883 | "### 3.3 Plot the Distribution of Video Played for Each Group " 884 | ] 885 | }, 886 | { 887 | "cell_type": "code", 888 | "execution_count": 39, 889 | "metadata": {}, 890 | "outputs": [ 891 | { 892 | "data": { 893 | "text/plain": [ 894 | "Text(0, 0.5, 'User Count')" 895 | ] 896 | }, 897 | "execution_count": 39, 898 | "metadata": {}, 899 | "output_type": "execute_result" 900 | }, 901 | { 902 | "data": { 903 | "image/png": "\n", 904 | "text/plain": [ 905 | "
" 906 | ] 907 | }, 908 | "metadata": { 909 | "needs_background": "light" 910 | }, 911 | "output_type": "display_data" 912 | } 913 | ], 914 | "source": [ 915 | "%matplotlib inline\n", 916 | "\n", 917 | "final_data['minutes_play_integers'] = round(final_data['minutes_play'])\n", 918 | "plot_df = final_data.groupby('minutes_play_integers')['user_id'].count()\n", 919 | "\n", 920 | "# Plot the distribution of players that played 0 to 50 minutes\n", 921 | "ax = plot_df.head(n=50).plot(x=\"minutes_play_integers\", y=\"user_id\", kind=\"hist\")\n", 922 | "ax.set_xlabel(\"Duration of Video Played in Minutes\")\n", 923 | "ax.set_ylabel(\"User Count\")" 924 | ] 925 | }, 926 | { 927 | "cell_type": "markdown", 928 | "metadata": {}, 929 | "source": [ 930 | "# Metric 1: 1-day retention by AB-Group" 931 | ] 932 | }, 933 | { 934 | "cell_type": "code", 935 | "execution_count": 35, 936 | "metadata": {}, 937 | "outputs": [ 938 | { 939 | "data": { 940 | "text/plain": [ 941 | "0.3248730964467005" 942 | ] 943 | }, 944 | "execution_count": 35, 945 | "metadata": {}, 946 | "output_type": "execute_result" 947 | } 948 | ], 949 | "source": [ 950 | "# 1-day retention\n", 951 | "final_data['day_1_active'].mean()" 952 | ] 953 | }, 954 | { 955 | "cell_type": "code", 956 | "execution_count": 36, 957 | "metadata": {}, 958 | "outputs": [ 959 | { 960 | "data": { 961 | "text/plain": [ 962 | "version\n", 963 | "control 0.296875\n", 964 | "treatment 0.350486\n", 965 | "Name: day_1_active, dtype: float64" 966 | ] 967 | }, 968 | "execution_count": 36, 969 | "metadata": {}, 970 | "output_type": "execute_result" 971 | } 972 | ], 973 | "source": [ 974 | "# 1-day retention by group\n", 975 | "final_data.groupby('version')['day_1_active'].mean()" 976 | ] 977 | }, 978 | { 979 | "cell_type": "markdown", 980 | "metadata": {}, 981 | "source": [ 982 | "# Some interesting questions:\n", 983 | "1. Treatment has a higher retention rate (0.35) than the control (0.29). Is the difference significant?\n", 984 | "2. To what extent can we trust the result?\n", 985 | "3. What is the variability of the difference? \n", 986 | "4. In other words, how many times do we obtain more extreme values, if we repeat the process for 100 times?\n", 987 | "--> Solution: Bootstrap (resampling, replication) and check for the variability " 988 | ] 989 | }, 990 | { 991 | "cell_type": "code", 992 | "execution_count": 20, 993 | "metadata": {}, 994 | "outputs": [ 995 | { 996 | "data": { 997 | "text/plain": [ 998 | "" 999 | ] 1000 | }, 1001 | "execution_count": 20, 1002 | "metadata": {}, 1003 | "output_type": "execute_result" 1004 | }, 1005 | { 1006 | "data": { 1007 | "image/png": "\n", 1008 | "text/plain": [ 1009 | "
" 1010 | ] 1011 | }, 1012 | "metadata": { 1013 | "needs_background": "light" 1014 | }, 1015 | "output_type": "display_data" 1016 | } 1017 | ], 1018 | "source": [ 1019 | "# solution: bootstrap\n", 1020 | "boot_means = []\n", 1021 | "\n", 1022 | "# run the simulation for 10k times \n", 1023 | "for i in range(10000):\n", 1024 | " #frac=1 means randomize the order of all rows \n", 1025 | " boot_sample = final_data.sample(frac=1,replace=True).groupby('version')['day_1_active'].mean()\n", 1026 | " boot_means.append(boot_sample)\n", 1027 | "\n", 1028 | "# a Pandas DataFrame\n", 1029 | "boot_means = pd.DataFrame(boot_means)\n", 1030 | "\n", 1031 | "# kernel density estimate\n", 1032 | "boot_means.plot(kind = 'kde')" 1033 | ] 1034 | }, 1035 | { 1036 | "cell_type": "code", 1037 | "execution_count": 22, 1038 | "metadata": {}, 1039 | "outputs": [], 1040 | "source": [ 1041 | "# create a new column, diff, which is the difference between the two variants, scaled by the control group\n", 1042 | "boot_means['diff'] = (boot_means['treatment'] - boot_means['control'])/boot_means['control']*100" 1043 | ] 1044 | }, 1045 | { 1046 | "cell_type": "code", 1047 | "execution_count": 23, 1048 | "metadata": {}, 1049 | "outputs": [ 1050 | { 1051 | "data": { 1052 | "text/plain": [ 1053 | "day_1_active 12.603674\n", 1054 | "day_1_active 20.623621\n", 1055 | "day_1_active 17.682652\n", 1056 | "day_1_active 20.093840\n", 1057 | "day_1_active 15.809040\n", 1058 | " ... \n", 1059 | "day_1_active 13.919948\n", 1060 | "day_1_active 17.328947\n", 1061 | "day_1_active 13.024745\n", 1062 | "day_1_active 22.401725\n", 1063 | "day_1_active 19.605505\n", 1064 | "Name: diff, Length: 10000, dtype: float64" 1065 | ] 1066 | }, 1067 | "execution_count": 23, 1068 | "metadata": {}, 1069 | "output_type": "execute_result" 1070 | } 1071 | ], 1072 | "source": [ 1073 | "boot_means['diff']" 1074 | ] 1075 | }, 1076 | { 1077 | "cell_type": "code", 1078 | "execution_count": 24, 1079 | "metadata": {}, 1080 | "outputs": [ 1081 | { 1082 | "data": { 1083 | "text/plain": [ 1084 | "Text(0.5, 0, '% diff in means')" 1085 | ] 1086 | }, 1087 | "execution_count": 24, 1088 | "metadata": {}, 1089 | "output_type": "execute_result" 1090 | }, 1091 | { 1092 | "data": { 1093 | "image/png": "\n", 1094 | "text/plain": [ 1095 | "
" 1096 | ] 1097 | }, 1098 | "metadata": { 1099 | "needs_background": "light" 1100 | }, 1101 | "output_type": "display_data" 1102 | } 1103 | ], 1104 | "source": [ 1105 | "# plot the bootstrap sample difference \n", 1106 | "ax = boot_means['diff'].plot(kind = 'kde')\n", 1107 | "ax.set_xlabel(\"% diff in means\")" 1108 | ] 1109 | }, 1110 | { 1111 | "cell_type": "code", 1112 | "execution_count": 25, 1113 | "metadata": {}, 1114 | "outputs": [ 1115 | { 1116 | "data": { 1117 | "text/html": [ 1118 | "
\n", 1119 | "\n", 1132 | "\n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | " \n", 1166 | " \n", 1167 | " \n", 1168 | " \n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | " \n", 1199 | " \n", 1200 | " \n", 1201 | " \n", 1202 | " \n", 1203 | " \n", 1204 | " \n", 1205 | " \n", 1206 | " \n", 1207 | " \n", 1208 | " \n", 1209 | "
versioncontroltreatmentdiff
day_1_active0.3170580.35701912.603674
day_1_active0.2916410.35178820.623621
day_1_active0.2878320.33872817.682652
day_1_active0.2687770.32278520.093840
day_1_active0.3071610.35572015.809040
............
day_1_active0.2996900.34140713.919948
day_1_active0.2941540.34512817.328947
day_1_active0.3141620.35508113.024745
day_1_active0.2924590.35797422.401725
day_1_active0.2937190.35130419.605505
\n", 1210 | "

9996 rows × 3 columns

\n", 1211 | "
" 1212 | ], 1213 | "text/plain": [ 1214 | "version control treatment diff\n", 1215 | "day_1_active 0.317058 0.357019 12.603674\n", 1216 | "day_1_active 0.291641 0.351788 20.623621\n", 1217 | "day_1_active 0.287832 0.338728 17.682652\n", 1218 | "day_1_active 0.268777 0.322785 20.093840\n", 1219 | "day_1_active 0.307161 0.355720 15.809040\n", 1220 | "... ... ... ...\n", 1221 | "day_1_active 0.299690 0.341407 13.919948\n", 1222 | "day_1_active 0.294154 0.345128 17.328947\n", 1223 | "day_1_active 0.314162 0.355081 13.024745\n", 1224 | "day_1_active 0.292459 0.357974 22.401725\n", 1225 | "day_1_active 0.293719 0.351304 19.605505\n", 1226 | "\n", 1227 | "[9996 rows x 3 columns]" 1228 | ] 1229 | }, 1230 | "execution_count": 25, 1231 | "metadata": {}, 1232 | "output_type": "execute_result" 1233 | } 1234 | ], 1235 | "source": [ 1236 | "boot_means[boot_means['diff'] > 0]" 1237 | ] 1238 | }, 1239 | { 1240 | "cell_type": "code", 1241 | "execution_count": 26, 1242 | "metadata": {}, 1243 | "outputs": [ 1244 | { 1245 | "data": { 1246 | "text/plain": [ 1247 | "0.9996" 1248 | ] 1249 | }, 1250 | "execution_count": 26, 1251 | "metadata": {}, 1252 | "output_type": "execute_result" 1253 | } 1254 | ], 1255 | "source": [ 1256 | "# p value \n", 1257 | "p = (boot_means['diff'] >0).sum()/len(boot_means)\n", 1258 | "p" 1259 | ] 1260 | }, 1261 | { 1262 | "cell_type": "markdown", 1263 | "metadata": {}, 1264 | "source": [ 1265 | "# Conclusion 1: treatment has a better performance than the control on 1-day user retention 99.96% of the time." 1266 | ] 1267 | }, 1268 | { 1269 | "cell_type": "markdown", 1270 | "metadata": {}, 1271 | "source": [ 1272 | "# Metric 7: 7-day retention by AB-Group" 1273 | ] 1274 | }, 1275 | { 1276 | "cell_type": "code", 1277 | "execution_count": 37, 1278 | "metadata": {}, 1279 | "outputs": [], 1280 | "source": [ 1281 | "boot_7d = []\n", 1282 | "\n", 1283 | "for i in range(10000):\n", 1284 | " #set frac=1 --> sample all rows\n", 1285 | " boot_mean = final_data.sample(frac=1,replace=True).groupby('version')['day_7_active'].mean() \n", 1286 | " boot_7d.append(boot_mean)\n", 1287 | " \n", 1288 | "boot_7d = pd.DataFrame(boot_7d)\n", 1289 | "\n", 1290 | "boot_7d['diff'] = (boot_7d['treatment'] - boot_7d['control'])/boot_7d['control'] *100" 1291 | ] 1292 | }, 1293 | { 1294 | "cell_type": "code", 1295 | "execution_count": 40, 1296 | "metadata": {}, 1297 | "outputs": [ 1298 | { 1299 | "data": { 1300 | "text/plain": [ 1301 | "Text(0.5, 0, '% diff in means')" 1302 | ] 1303 | }, 1304 | "execution_count": 40, 1305 | "metadata": {}, 1306 | "output_type": "execute_result" 1307 | }, 1308 | { 1309 | "data": { 1310 | "image/png": "\n", 1311 | "text/plain": [ 1312 | "
" 1313 | ] 1314 | }, 1315 | "metadata": { 1316 | "needs_background": "light" 1317 | }, 1318 | "output_type": "display_data" 1319 | } 1320 | ], 1321 | "source": [ 1322 | "# Ploting the bootstrap % difference\n", 1323 | "ax = boot_7d['diff'].plot(kind = 'kde')\n", 1324 | "ax.set_xlabel(\"% diff in means\")" 1325 | ] 1326 | }, 1327 | { 1328 | "cell_type": "code", 1329 | "execution_count": 41, 1330 | "metadata": {}, 1331 | "outputs": [ 1332 | { 1333 | "data": { 1334 | "text/plain": [ 1335 | "0.9983" 1336 | ] 1337 | }, 1338 | "execution_count": 41, 1339 | "metadata": {}, 1340 | "output_type": "execute_result" 1341 | } 1342 | ], 1343 | "source": [ 1344 | "# Calculating the probability that 7-day retention is greater when the gate is at level 30\n", 1345 | "p = (boot_7d['diff']>0).sum()/len(boot_7d)\n", 1346 | "\n", 1347 | "1-p" 1348 | ] 1349 | }, 1350 | { 1351 | "cell_type": "markdown", 1352 | "metadata": {}, 1353 | "source": [ 1354 | "# Conclusion 2: control has a better performance than the treatment on 7-day user retention 99.89% of the time." 1355 | ] 1356 | }, 1357 | { 1358 | "cell_type": "markdown", 1359 | "metadata": {}, 1360 | "source": [ 1361 | "## After double-checking with the true parameters, these two estimates are unbiased estimates." 1362 | ] 1363 | } 1364 | ], 1365 | "metadata": { 1366 | "kernelspec": { 1367 | "display_name": "Python 3", 1368 | "language": "python", 1369 | "name": "python3" 1370 | }, 1371 | "language_info": { 1372 | "codemirror_mode": { 1373 | "name": "ipython", 1374 | "version": 3 1375 | }, 1376 | "file_extension": ".py", 1377 | "mimetype": "text/x-python", 1378 | "name": "python", 1379 | "nbconvert_exporter": "python", 1380 | "pygments_lexer": "ipython3", 1381 | "version": "3.7.4" 1382 | } 1383 | }, 1384 | "nbformat": 4, 1385 | "nbformat_minor": 2 1386 | } 1387 | -------------------------------------------------------------------------------- /4. AA Test.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Part 1: A/A tests for Normal Distribution" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "- split the users into two groups & assign the same treatment to both groups \n", 15 | "- In repeated trials, 5% of the time a given metric should be statistically significant & p < 0.05. \n", 16 | "- conduct t-tests to compute p-values & the distribution of p-values from repeated trials form a uniform distribution " 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "# 1. simulate the hashing process & single iteration" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 4, 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "import numpy as np\n", 33 | "\n", 34 | "# 1: population\n", 35 | "np.random.seed(123)\n", 36 | "\n", 37 | "population = np.random.normal(loc = 100, scale = 5, size = 1000)" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 5, 43 | "metadata": {}, 44 | "outputs": [ 45 | { 46 | "data": { 47 | "text/plain": [ 48 | "Ttest_indResult(statistic=-0.5697163252872851, pvalue=0.568998328712083)" 49 | ] 50 | }, 51 | "execution_count": 5, 52 | "metadata": {}, 53 | "output_type": "execute_result" 54 | } 55 | ], 56 | "source": [ 57 | "import random\n", 58 | "\n", 59 | "A_1 = []\n", 60 | "\n", 61 | "A_2 = []\n", 62 | "\n", 63 | "for i in population:\n", 64 | " \n", 65 | " hash_val = random.random()\n", 66 | " \n", 67 | " if hash_val <= 0.5:\n", 68 | " A_2.append(i)\n", 69 | " \n", 70 | " else:\n", 71 | " A_1.append(i)\n", 72 | " \n", 73 | "# two sample t test \n", 74 | "from scipy import stats\n", 75 | "\n", 76 | "stats.ttest_ind(A_1,A_2)" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "# 2. run A/A 10,000 times & check False Positive Rate" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": 2, 89 | "metadata": {}, 90 | "outputs": [], 91 | "source": [ 92 | "import random\n", 93 | "from scipy import stats" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "### solution 1: eyeball " 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 6, 106 | "metadata": {}, 107 | "outputs": [], 108 | "source": [ 109 | "count_5_perc = 0 #False Positives\n", 110 | "count_10_perc = 0\n", 111 | "count_20_perc = 0\n", 112 | "count_30_perc = 0\n", 113 | "count_40_perc = 0\n", 114 | "count_50_perc = 0\n", 115 | "count_60_perc = 0\n", 116 | "count_70_perc = 0\n", 117 | "count_80_perc = 0\n", 118 | "count_90_perc = 0\n", 119 | "\n", 120 | "for i in range(10000):\n", 121 | " \n", 122 | " A_1 = []\n", 123 | " \n", 124 | " A_2 = []\n", 125 | " \n", 126 | " for j in population:\n", 127 | " \n", 128 | " hash_val = random.random()\n", 129 | " \n", 130 | " if hash_val <= 0.5:\n", 131 | " A_2.append(j)\n", 132 | " \n", 133 | " else:\n", 134 | " A_1.append(j)\n", 135 | " \n", 136 | " result = stats.ttest_ind(A_1,A_2)\n", 137 | " \n", 138 | " if result.pvalue <= 0.1:\n", 139 | " count_10_perc+=1\n", 140 | " \n", 141 | " elif 0.1 different distributions \n", 454 | "# big p-value --> same distributions\n", 455 | "import numpy as np\n", 456 | "import scipy\n", 457 | "\n", 458 | "\n", 459 | "dddd = np.random.normal(0,1,1000)\n", 460 | "kstest(dddd,'norm')" 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": 20, 466 | "metadata": {}, 467 | "outputs": [ 468 | { 469 | "data": { 470 | "text/plain": [ 471 | "KstestResult(statistic=0.4999999999999999, pvalue=0.06558641975308652)" 472 | ] 473 | }, 474 | "execution_count": 20, 475 | "metadata": {}, 476 | "output_type": "execute_result" 477 | } 478 | ], 479 | "source": [ 480 | "from scipy import stats\n", 481 | "x = np.linspace(-25, 17, 6)\n", 482 | "stats.kstest(x, 'norm')" 483 | ] 484 | }, 485 | { 486 | "cell_type": "code", 487 | "execution_count": 24, 488 | "metadata": {}, 489 | "outputs": [ 490 | { 491 | "data": { 492 | "text/plain": [ 493 | "KstestResult(statistic=0.03859901423041939, pvalue=0.09899489774451381)" 494 | ] 495 | }, 496 | "execution_count": 24, 497 | "metadata": {}, 498 | "output_type": "execute_result" 499 | } 500 | ], 501 | "source": [ 502 | "from scipy.stats import kstest\n", 503 | "import numpy as np\n", 504 | "\n", 505 | "x = np.random.normal(0,1,1000)\n", 506 | "test_stat = kstest(x, 'norm')\n", 507 | "test_stat" 508 | ] 509 | }, 510 | { 511 | "cell_type": "markdown", 512 | "metadata": {}, 513 | "source": [ 514 | "---" 515 | ] 516 | }, 517 | { 518 | "cell_type": "markdown", 519 | "metadata": {}, 520 | "source": [ 521 | "# split at 0.2%" 522 | ] 523 | }, 524 | { 525 | "cell_type": "code", 526 | "execution_count": 21, 527 | "metadata": {}, 528 | "outputs": [], 529 | "source": [ 530 | "# solution 2\n", 531 | "\n", 532 | "import numpy as np\n", 533 | "import scipy\n", 534 | "\n", 535 | "p_values = []\n", 536 | "\n", 537 | "for i in range(10000):\n", 538 | " \n", 539 | " A_1 = []\n", 540 | " \n", 541 | " A_2 = []\n", 542 | " \n", 543 | " for j in population:\n", 544 | " \n", 545 | " hash_val = random.random()\n", 546 | " \n", 547 | " if hash_val <= 0.2:\n", 548 | " A_2.append(j)\n", 549 | " \n", 550 | " else:\n", 551 | " A_1.append(j)\n", 552 | " \n", 553 | " result=(stats.ttest_ind(A_1,A_2))\n", 554 | " \n", 555 | " p_values.append(result.pvalue)" 556 | ] 557 | }, 558 | { 559 | "cell_type": "code", 560 | "execution_count": 22, 561 | "metadata": {}, 562 | "outputs": [ 563 | { 564 | "data": { 565 | "text/plain": [ 566 | "KstestResult(statistic=0.009005414094706121, pvalue=0.39198360324923237)" 567 | ] 568 | }, 569 | "execution_count": 22, 570 | "metadata": {}, 571 | "output_type": "execute_result" 572 | } 573 | ], 574 | "source": [ 575 | "import scipy\n", 576 | "scipy.stats.kstest(p_values,\"uniform\")" 577 | ] 578 | }, 579 | { 580 | "cell_type": "markdown", 581 | "metadata": {}, 582 | "source": [ 583 | "----" 584 | ] 585 | }, 586 | { 587 | "cell_type": "markdown", 588 | "metadata": {}, 589 | "source": [ 590 | "# Part 2: A/A tests for Left Skewed Distribution with Heavy Users" 591 | ] 592 | }, 593 | { 594 | "cell_type": "code", 595 | "execution_count": 1, 596 | "metadata": { 597 | "scrolled": true 598 | }, 599 | "outputs": [ 600 | { 601 | "data": { 602 | "text/plain": [ 603 | "
" 604 | ] 605 | }, 606 | "metadata": {}, 607 | "output_type": "display_data" 608 | } 609 | ], 610 | "source": [ 611 | "# generate skewed distribution \n", 612 | "# source: https://stackoverflow.com/questions/24854965/create-random-numbers-with-left-skewed-probability-distribution\n", 613 | "\n", 614 | "from scipy.stats import skewnorm\n", 615 | "import matplotlib.pyplot as plt\n", 616 | "\n", 617 | "numValues = 10000\n", 618 | "maxValue = 1000\n", 619 | "skewness = -50 #Negative values are left skewed, positive values are right skewed.\n", 620 | "\n", 621 | "random_1 = skewnorm.rvs(a = skewness,loc=maxValue, size=numValues) #Skewnorm function\n", 622 | "\n", 623 | "#random_1 = random_1 - min(random_1) #Shift the set so the minimum value is equal to zero.\n", 624 | "#random_1 = random_1 / max(random_1) #Standadize all the vlues between 0 and 1. \n", 625 | "#random_1 = random_1 * maxValue #Multiply the standardized values by the maximum value.\n", 626 | "\n", 627 | "#Plot histogram to check skewness\n", 628 | "plt.hist(random_1,30,density=True, color = 'red', alpha=0.1)\n", 629 | "plt.show()" 630 | ] 631 | }, 632 | { 633 | "cell_type": "code", 634 | "execution_count": 2, 635 | "metadata": {}, 636 | "outputs": [ 637 | { 638 | "data": { 639 | "text/plain": [ 640 | "array([999.9315308 , 999.00985561, 998.7830595 , ..., 999.56253601,\n", 641 | " 999.87125823, 997.82424896])" 642 | ] 643 | }, 644 | "execution_count": 2, 645 | "metadata": {}, 646 | "output_type": "execute_result" 647 | } 648 | ], 649 | "source": [ 650 | "random_1" 651 | ] 652 | }, 653 | { 654 | "cell_type": "code", 655 | "execution_count": 3, 656 | "metadata": { 657 | "scrolled": true 658 | }, 659 | "outputs": [ 660 | { 661 | "data": { 662 | "text/plain": [ 663 | "Ttest_indResult(statistic=0.7640304333604793, pvalue=0.44486713677102585)" 664 | ] 665 | }, 666 | "execution_count": 3, 667 | "metadata": {}, 668 | "output_type": "execute_result" 669 | } 670 | ], 671 | "source": [ 672 | "#2. simulate the hashing process & single iteration\n", 673 | "\n", 674 | "import random\n", 675 | "\n", 676 | "A_1 = []\n", 677 | "\n", 678 | "A_2 = []\n", 679 | "\n", 680 | "for i in random_1:\n", 681 | " \n", 682 | " hash_val = random.random()\n", 683 | " \n", 684 | " if hash_val <= 0.5:\n", 685 | " A_2.append(i)\n", 686 | " \n", 687 | " else:\n", 688 | " A_1.append(i)\n", 689 | " \n", 690 | "# two sample t test \n", 691 | "from scipy import stats\n", 692 | "\n", 693 | "stats.ttest_ind(A_1,A_2)" 694 | ] 695 | }, 696 | { 697 | "cell_type": "code", 698 | "execution_count": 4, 699 | "metadata": {}, 700 | "outputs": [], 701 | "source": [ 702 | "#2. run A/A 10,000 times & check False Positive Rate\n", 703 | "import random\n", 704 | "from scipy import stats" 705 | ] 706 | }, 707 | { 708 | "cell_type": "code", 709 | "execution_count": 5, 710 | "metadata": {}, 711 | "outputs": [], 712 | "source": [ 713 | "count_5_perc = 0 #False Positives\n", 714 | "count_10_perc = 0\n", 715 | "count_20_perc = 0\n", 716 | "count_30_perc = 0\n", 717 | "count_40_perc = 0\n", 718 | "count_50_perc = 0\n", 719 | "count_60_perc = 0\n", 720 | "count_70_perc = 0\n", 721 | "count_80_perc = 0\n", 722 | "count_90_perc = 0\n", 723 | "\n", 724 | "for i in range(10000):\n", 725 | " \n", 726 | " A_1 = []\n", 727 | " \n", 728 | " A_2 = []\n", 729 | " \n", 730 | " for j in random_1:\n", 731 | " \n", 732 | " hash_val = random.random()\n", 733 | " \n", 734 | " if hash_val <= 0.05:\n", 735 | " A_2.append(j)\n", 736 | " \n", 737 | " else:\n", 738 | " A_1.append(j)\n", 739 | " \n", 740 | " result = stats.ttest_ind(A_1,A_2)\n", 741 | " \n", 742 | " if result.pvalue <= 0.1:\n", 743 | " count_10_perc+=1\n", 744 | " \n", 745 | " elif 0.1" 907 | ] 908 | }, 909 | "execution_count": 49, 910 | "metadata": {}, 911 | "output_type": "execute_result" 912 | }, 913 | { 914 | "data": { 915 | "image/png": "\n", 916 | "text/plain": [ 917 | "
" 918 | ] 919 | }, 920 | "metadata": { 921 | "needs_background": "light" 922 | }, 923 | "output_type": "display_data" 924 | } 925 | ], 926 | "source": [ 927 | "import numpy as np\n", 928 | "import seaborn as sns\n", 929 | "\n", 930 | "# 1: population\n", 931 | "np.random.seed(123)\n", 932 | "\n", 933 | "regular_user = np.random.normal(loc = 100, scale = 5, size = 1000)\n", 934 | "sns.distplot(regular_user)" 935 | ] 936 | }, 937 | { 938 | "cell_type": "code", 939 | "execution_count": 55, 940 | "metadata": {}, 941 | "outputs": [ 942 | { 943 | "data": { 944 | "text/plain": [ 945 | "" 946 | ] 947 | }, 948 | "execution_count": 55, 949 | "metadata": {}, 950 | "output_type": "execute_result" 951 | }, 952 | { 953 | "data": { 954 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYcAAAD4CAYAAAAHHSreAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nO3deXgc1Zno/+8rtVr7vluSLdmW8Ybxohg7QAKYxSyxyQDBJCHODQmZTJjcJHfmDtz5JTfDb+Z5YDITMvmFkCHADIGAAYcEhTg4LE6AgBfZeJONbXnXYlubJWtf+v390WVopJbVsiVVS3o/z6NHVadOHb1dauntOnXqlKgqxhhjTKAItwMwxhgTfiw5GGOM6ceSgzHGmH4sORhjjOnHkoMxxph+PG4HMBwyMjK0sLDQ7TCMMWZM2bp1a52qZgbbNi6SQ2FhIWVlZW6HYYwxY4qIHB1om3UrGWOM6ceSgzHGmH4sORhjjOnHkoMxxph+LDkYY4zpx5KDMcaYfiw5GGOM6ceSgzHGmH5CSg4islxE9olIhYjcF2R7tIg872zfJCKFTvm1IrJVRHY5368O2GeRU14hIj8REXHK00TkNRE54HxPHZ6XaowxJlSD3iEtIpHAI8C1QCWwRURKVXVPQLW7gUZVnS4iq4CHgDuAOuAzqlotInOB9UCes8+jwD3ARmAdsBz4A3Af8IaqPugkovuAf7jwl2rM6Ht20zFXf/7nL53s6s83Y1coZw6LgQpVPaSqXcAaYGWfOiuBp5zltcAyERFVfV9Vq53yciDGOcvIBZJU9T31P4rul8AtQdp6KqDcGGPMKAklOeQBxwPWK/no03+/OqraAzQB6X3q3Aq8r6qdTv3KAdrMVtUap60aICtYUCJyj4iUiUhZbW1tCC/DGGNMqEJJDhKkrO+Dp89ZR0Tm4O9q+voQ2jwnVX1MVUtUtSQzM+ikgsYYY85TKMmhEigIWM8HqgeqIyIeIBlocNbzgd8AX1LVgwH18wdo86TT7YTz/VSoL8YYY8zwCCU5bAGKRaRIRLzAKqC0T51SYLWzfBvwpqqqiKQAvwfuV9W/nK3sdBedEZElziilLwEvB2lrdUC5McaYUTJocnCuIdyLf6TRXuAFVS0XkQdEZIVT7QkgXUQqgO/iH2GEs9904Hsist35OnsN4RvA40AFcBD/SCWAB4FrReQA/hFSD17oizTGGDM04h8sNLaVlJSoPezHhCMbymrCmYhsVdWSYNvsDmljjDH9WHIwxhjTjyUHY4wx/VhyMMYY048lB2OMMf1YcjDGGNOPJQdjjDH9WHIwxhjTjyUHY4wx/VhyMMYY048lB2OMMf1YcjDGGNOPJQdjjDH9WHIwxhjTjyUHY4wx/VhyMMYY048lB2OMMf2ElBxEZLmI7BORChG5L8j2aBF53tm+SUQKnfJ0EdkgIi0i8tOA+okBjw3dLiJ1IvJjZ9uXRaQ2YNtXh+elGmOMCZVnsAoiEgk8gv95zpXAFhEpVdU9AdXuBhpVdbqIrAIeAu4AOoDvAXOdLwBU9QwwP+BnbAVeCmjveVW997xflTHGmAsSypnDYqBCVQ+pahewBljZp85K4ClneS2wTEREVVtV9R38SSIoESkGsoC3hxy9McaYERFKcsgDjgesVzplQeuoag/QBKSHGMOd+M8UNKDsVhHZKSJrRaQg2E4ico+IlIlIWW1tbYg/yhhjTChCSQ4SpEzPo85AVgHPBaz/DihU1XnA63x0RvLxxlUfU9USVS3JzMwM8UcZY4wJRSjJoRII/PSeD1QPVEdEPEAy0DBYwyJyCeBR1a1ny1S1XlU7ndVfAItCiNEYY8wwCiU5bAGKRaRIRLz4P+mX9qlTCqx2lm8D3uzTTTSQO/n4WQMikhuwugLYG0I7xhhjhtGgo5VUtUdE7gXWA5HAk6paLiIPAGWqWgo8ATwtIhX4zxhWnd1fRI4ASYBXRG4BrgsY6fQ54MY+P/JbIrIC6HHa+vIFvD5jjDHnYdDkAKCq64B1fcq+H7DcAdw+wL6F52h3apCy+4H7Q4nLGGPMyLA7pI0xxvRjycEYY0w/lhyMMcb0Y8nBGGNMP5YcjDHG9GPJwRhjTD+WHIwxxvRjycEYY0w/Id0EZ4wZulPNHeyuauJ4YxtN7d20dvagCgkxHpJjoyjOSqQoI57IiGDzVhrjLksOxgyjU2c6+N2OGn77fhW7qpoA8EQISbFRJER7EKCysZ3y6mbePlBHbFQknyhM5cqLsoiJinQ3eGMCWHIw5gK1dPawfvcJfru9ir9U1OFTmJuXxH03zKSprZvclBg8ER/vwe3q8VFx6gw7Kpt460AdW4+dZvmcHBZOTkHEziSM+yw5GHMeunt9vH2glt+8X81re07Q0e0jPzWWv7lyOrcsmMT0rEQAnt10LOj+Xk8EsyclM3tSMlc0tvHKzhp+va2SysY2bp43ybqajOssORgTos6eXjYdauC1PSf5/a4aGlq7SImL4taF+Xx2QR6LpqSe16f+/NQ47vnUVNaXn+DtA3U0tnVx5ycmE23dTMZFlhyMCcLnU6qb2jlY28qO46cpO9rI1iMNtHb1EhMVwbJZ2dwyP49Pz8jE67nwQX8RItwwN5eM+Ghe3lHFs5uP8aWlhXYGYVxjycGMuIG6VkbLyvmT2FXVRGVjOzWn2znd3k1bVw9tXb20dvbS3t3j/97VS2tXD+1dvZzp6KGr1/dhGzOyE7hlQR5Xz8zisukZI3bx+BNFaYjAS+9X8fL2Kj67IM+uQRhXWHIw49KZjm62HGlkb00z33t5N72+jx5MGO+NJC7aQ5w3kjiv/3tijIfspGjivR5ivZEkxHiYkhbP1Mx4ZuUkkRwXNWqxlxSm0dDaxZ/215KeEM2nZ9gz0s3oCyk5iMhy4D/wPwnucVV9sM/2aOCX+J/3XA/coapHRCQdWAt8AvhvVb03YJ8/AblAu1N0naqeGqit836FZkJp7uhm/e4T7KxsoleVKWlx/M2V01g0JZXC9HhykmPGxJDRa2ZnU9faxWt7TjAtM5781Di3QzITzKDJQUQigUeAa4FKYIuIlAY86hPgbqBRVaeLyCrgIeAOoAP4HjDX+errC6pa1qdsoLaMGZCqsvVoI+t219DTqyyemsaSonQyE6P5/KWT3Q5vyCJE+Oz8PI43tPFCWSV/e/V0oiJtQgMzekJ5ty0GKlT1kKp2AWuAlX3qrASecpbXAstERFS1VVXfwZ8kQhW0rSHsbyaYXp+ydmslL71fRU5SLN+6upjPzJtEZmK026FdkFhvJLcuzKeupZP15SfcDsdMMKEkhzzgeMB6pVMWtI6q9gBNQHoIbf+XiGwXke8FJIDzbctMQN29Pn616SjvHz/NsplZfPWKIjLGeFIIND0rgSVT03n3YD3H6lvdDsdMIKEkh2Cf2vU86vT1BVW9GLjC+bprKG2JyD0iUiYiZbW1tYP8KDMe9fh8PPXeEfadOMOKSyaxbFY2EePwJPP6Odkkxnj4/a4afDrYn5UxwyOU5FAJFASs5wPVA9UREQ+QDDScq1FVrXK+nwGexd99FXJbqvqYqpaoaklmpo3mmIhe2VnDodpWbl2Yz5Kp4/fkMtoTyXWzczje2M7Oyia3wzETRCjJYQtQLCJFIuIFVgGlfeqUAqud5duAN1UH/ogjIh4RyXCWo4Cbgd3n05aZmDYeqmfz4QY+VZzBwimpbocz4hZMTmFSSgzry0/Q1eMbfAdjLtCgycHp978XWA/sBV5Q1XIReUBEVjjVngDSRaQC+C5w39n9ReQI8CPgyyJSKSKzgWhgvYjsBLYDVcAvBmvLGIDKxjZe2VnNRdmJXDcnx+1wRkWECDddPImm9m7+crDO7XDMBBDSfQ6qug5Y16fs+wHLHcDtA+xbOECziwaoP2BbxvT4fLy0rYqEaA+fKykYl9cYBlKUEc+snETeOVDH0qnpY+J+DTN22cBpM6a8tb+OE80drJyfR6x34v1zvGpmFu3dvWw6VO92KGacs+RgxoxTzR1s2HeKi/OSmZWb5HY4rshPjWNGdgJvV9TZtQczoiw5mDHjlZ01eCMj+Mwlk9wOxVVXXZRFW1cvmw/b2YMZOZYczJhQcaqFitoWrp6ZRUL0xJ4vckq6f0LAtw/U0dNrZw9mZFhyMGHPp8r68hOkxEVxaVGa2+GEhU8XZ3Kms+fD51QbM9wsOZiwt7uqiarT7VwzKxuPTT4H+KfVyEyI5t2D9dhtQGYk2F+aCWu9PuW1PSfJTopmfkGK2+GEDRFh6bR0qk63c7yhze1wzDhkycGEtfLqJupbu1g2c3zOm3QhFkxOISYqgndtWKsZAZYcTNhSVd4+UEd6vJfZkybm0NVzifZEUjIljd1VTTS1d7sdjhlnLDmYsHWorpWq0+1cXpxhZw0DWDI1HVXYcuSc81waM2SWHEzYevtALfHRHhZOHv8T652vtHgv07MS2Hq00abzNsPKkoMJSyeaO9h/soWlU9Pt8ZiDKClMo6m9m4pTLW6HYsYR+6szYWnjwXo8EcISu69hULNyEonzRlrXkhlWlhxM2Ons7mV75Wnm5acQN8Hvhg6FJzKChZNT2VvTTEtnj9vhmHHC/vJM2NleeZquHt+w3Q397KZjw9JOOFs0JZV3KurYdrSRT82wJyOaC2dnDiasqCqbDzeQmxxDfmqs2+GMGdlJMUxOi2Pr0Ua7Y9oMC0sOJqxUNrZT09TB4qI0xIavDsnCyanUtnRSfbrD7VDMOBBSchCR5SKyT0QqRKTfYztFJFpEnne2bxKRQqc8XUQ2iEiLiPw0oH6ciPxeRD4QkXIReTBg25dFpFZEtjtfX73wl2nGis2HG/B6Ipifb1NlDNXFeclERgjvH290OxQzDgyaHEQkEngEuAGYDdzpPAc60N1Ao6pOBx4GHnLKO4DvAX8XpOl/U9WZwALgMhG5IWDb86o63/l6fEivyIxZnT297KpqYl5eMtH2CMwhi/VGMjMnkR2VTfT6rGvJXJhQzhwWAxWqekhVu4A1wMo+dVYCTznLa4FlIiKq2qqq7+BPEh9S1TZV3eAsdwHbgPwLeB1mHCivbqar18eiKXbT2/laUJBCa2eP3fNgLlgoySEPOB6wXumUBa2jqj1AE5AeSgAikgJ8BngjoPhWEdkpImtFpGCA/e4RkTIRKautrQ3lR5kwt+1YI2nxXianxbkdypg1IzuR2KhI61oyFyyU5BDsqmDfc9ZQ6vRvWMQDPAf8RFUPOcW/AwpVdR7wOh+dkXy8cdXHVLVEVUsyM23o3lh3uq2Lw7WtLJicYheiL4AnMoKL85PZW9NMZ3ev2+GYMSyU5FAJBH56zweqB6rj/MNPBkK5XfMx4ICq/vhsgarWq2qns/oLYFEI7Zgxbvvx0yiwoMC6lC7UgoIUunuVvSea3Q7FjGGhJIctQLGIFImIF1gFlPapUwqsdpZvA97UQQZbi8g/408i3+5TnhuwugLYG0KMZgxTVbYdO01hejxp8V63wxnzCtLiSIrxsKvKkoM5f4PeIa2qPSJyL7AeiASeVNVyEXkAKFPVUuAJ4GkRqcB/xrDq7P4icgRIArwicgtwHdAM/CPwAbDN6Ub4qTMy6VsisgLocdr68jC9VhOmqk63U9fSyRXFGW6HMi5EiDA3L5nNhxs409FNYkyU2yGZMSik6TNUdR2wrk/Z9wOWO4DbB9i3cIBmg3Ysq+r9wP2hxGXGh52VTURGCHMnJbsdyrhxcV4y7x6s5429p7hlQd/xI8YMzu6QNq7yqbKz8rR/lI3X7m0YLme7ll7ZWeN2KGaMsuRgXHW0vo3mjh7m5dtZw3CKEOHivGTe2l/LmQ57hKgZOksOxlU7K08TFSnMyrFnRA+3uXnJdPX6eH3vSbdDMWOQJQfjml6fsquqiZk5SXg99lYcbgVpceQmx/D7nSfcDsWMQfYXaVxzsLaFtq5eLrEupRERIcINc3N5a38tzda1ZIbIkoNxza7KJqI9EczITnQ7lHHrpnm5dPX6eMO6lswQWXIwruj1KXtqmpmVm4Qn0t6GI2VBQYrTtWSjlszQ2F+lccXhulbau3uZO8kuRI+kiAjhxotzeWt/nXUtmSGx5GBcsbuqCW9kBMXWpTTibrzY37X0+h7rWjKhs+RgRp1PlfKaZmbkJBJlXUojbkFBCpOsa8kMkf1lmlF3tL6N1s4e61IaJRERwg0X5/L2AetaMqGz5GBG3e7qJjwRwkXWpTRqbpibQ1evjw0fnHI7FDNGWHIwo8qnyp7qZoqzE+050aNo4eRUMhOjWV9uN8SZ0FhyMKOqqrGdpvZu61IaZRERwnWzs9nwQS0d9oQ4EwJLDmZU7a5uIlKEmTaX0qhbPjeH9u5e3tpvz1w3g7PkYEaNqrK7qolpWfE2PbcLlkxNJynGw6vWtWRCEFJyEJHlIrJPRCpE5L4g26NF5Hln+yYRKXTK00Vkg4i0iMhP++yzSER2Ofv8RJzHwYlImoi8JiIHnO/2UOFxoqapg8a2bubYQ31cERUZwTWzs3l9z0m6e31uh2PC3KDJQUQigUeAG4DZwJ0iMrtPtbuBRlWdDjwMPOSUdwDfA/4uSNOPAvcAxc7Xcqf8PuANVS0G3nDWzTiwu7oJAWblWpeSW5bPyaG5o4eNh+rdDsWEuVDOHBYDFap6SFW7gDXAyj51VgJPOctrgWUiIqraqqrv4E8SHxKRXCBJVd9TVQV+CdwSpK2nAsrNGFde1UxRRjwJ0SE9ndaMgE/NyCTOG8mru61ryZxbKMkhDzgesF7plAWto6o9QBOQPkiblQO0ma2qNU5bNUBWCDGaMHeyuYPalk7m5FmXkptioiK56qIs1pefpNenbodjwlgoyUGClPV9V4VS50Lq929A5B4RKRORstpaG30R7sqdLqU51qXkuuvn5lDX0sn7xxrdDsWEsVCSQyVQELCeD1QPVEdEPEAy0DBIm/kDtHnS6XY62/0U9JZOVX1MVUtUtSQzMzOEl2HcVF7dzOS0OJJio9wOZcK76qJMvJER1rVkzimU5LAFKBaRIhHxAquA0j51SoHVzvJtwJvOtYSgnO6iMyKyxBml9CXg5SBtrQ4oN2NUfUsnNU0d1qUUJhJjori8OINXy09wjj9TM8ENmhycawj3AuuBvcALqlouIg+IyAqn2hNAuohUAN8lYISRiBwBfgR8WUQqA0Y6fQN4HKgADgJ/cMofBK4VkQPAtc66GcN2VzcDMMfuig4by+fkUNnYTrnzuzGmr5CGjajqOmBdn7LvByx3ALcPsG/hAOVlwNwg5fXAslDiMmNDeXUTeSmxpMZ53Q7FOK6ZnU3ES7C+/ARz7YzOBGF3SJsRVXW6ncrGdptLKcykxXu5tCjdrjuYAVlyMCPq7D8fu94QfpbPzeHAqRYqTrW4HYoJQ5YczIh6dXcNOUkxZCREux2K6eO6OdkANo23CcqSgxkxJ5o6KDvayNw861IKR7nJscwvSLHkYIKy5GBGzB9216CKXfAMY8vn5rCzsomq0+1uh2LCjCUHM2LW7arhouxEshJj3A7FDOD6OTkArLcL06YPSw5mRJxs9ncp3XhxrtuhmHMoyohnZk6iPePB9GPJwYyIP+zydyndNC/H7VDMIK6fk8OWIw3Unul0OxQTRiw5mBHxe6dLaXpWotuhmEEsn5uDKry+96TboZgwYsnBDDvrUhpbZuYkMiU9zm6IMx9jycEMO+tSGltEhOVzcnj3YB1N7d1uh2PChCUHM+zW7TphXUpjzPVzc+juVTZ8EHSGfDMBWXIww+pkcwdbjjZYl9IYMz8/heykaOtaMh+y5GCGlXUpjU0REcL1c3L40/5TtHf1uh2OCQOWHMywWrfrBDOyE6xLaQxaPieHjm4ff95vj901lhzMMDrldCnddPEkt0Mx52FxURopcVE215IBLDmYYfSH3SesS2kM80RGcO2sbF7fe5KuHp/b4RiXhZQcRGS5iOwTkQoRuS/I9mgRed7ZvklECgO23e+U7xOR652yi0Rke8BXs4h829n2AxGpCth24/C8VDPSSndU2yilMW753BzOdPTw3qF6t0MxLhs0OYhIJPAIcAMwG7gz4DnQZ90NNKrqdOBh4CFn39nAKmAOsBz4mYhEquo+VZ2vqvOBRUAb8JuA9h4+u915RKkJc8fq29h6tJGVC6xLaSy7bHoG8d5IG7VkQjpzWAxUqOohVe0C1gAr+9RZCTzlLK8FlomIOOVrVLVTVQ8DFU57gZYBB1X16Pm+COO+l7dXAbDiEksOY1lMVCRXzczitT0n6PWp2+EYF4WSHPKA4wHrlU5Z0Dqq2gM0Aekh7rsKeK5P2b0islNEnhSR1GBBicg9IlImImW1tTa6wk2qym+3V7G4MI381Di3wzEXaPncHOpauth6tNHtUIyLQkkOEqSs70eKgeqcc18R8QIrgBcDtj8KTAPmAzXAvwcLSlUfU9USVS3JzMwcOHoz4sqrmzlY22pdSuPElRdl4fVEWNfSBBdKcqgECgLW84HqgeqIiAdIBhpC2PcGYJuqfjgdpKqeVNVeVfUBv6B/N5QJM799v4qoSOEmuyt6XEiI9vCp4gzWl59A1bqWJqpQksMWoFhEipxP+quA0j51SoHVzvJtwJvqf1eVAquc0UxFQDGwOWC/O+nTpSQigf9hPgvsDvXFmNHX61N+t7OaKy/KIiXO63Y4ZphcPyeHqtPt7K5qdjsU4xLPYBVUtUdE7gXWA5HAk6paLiIPAGWqWgo8ATwtIhX4zxhWOfuWi8gLwB6gB/imqvYCiEgccC3w9T4/8l9FZD7+7qcjQbabMPLWgVpONnfyVwv6XkoyY9k1s7KJjBBeLa/h4nx7BvhENGhyAHCGk67rU/b9gOUO4PYB9v0X4F+ClLfhv2jdt/yuUGIy4WFtWSWpcVEsm5XtdihmGKXGe1kyNY1Xd5/g76+f6XY4xgV2h7Q5b6fbunhtz0lWzs/D67G30nizfE4OB2tbqTh1xu1QjAvsL9qct5e3V9PV6+P2kny3QzEj4Lo5/mlQbNTSxGTJwZy3F7ceZ3ZuEnMmWZ/0eJSdFMPCySm8ahPxTUiWHMx52VvTzO6qZj5nZw3j2vK5OeyuauZ4Q5vboZhRZsnBnJfnNh/D64lg5XwbpTSeLZ/jH1m+bleNy5GY0WbJwQxZa2cPL22r4qaLc0mNt3sbxrPJ6XFcUpBC6Y6+972a8c6Sgxmy0h3VtHT28MUlk90OxYyCFZdMory6mYpTLW6HYkaRJQczJKrKMxuPMjMnkYWTg86JaMaZz8zLRQQ7e5hgLDmYIdlR2UR5dTNfWDIF/6zsZrzLSorhk9PSKd1eZXMtTSCWHMyQPLPxKPHeSD5r02VMKCsumcSR+jZ2VTW5HYoZJZYcTMhqz3RSuqOazy7MIyE6pJlXzDixfE4u3sgIXt5uXUsThSUHE7JnNh6lq8fHVy4rcjsUM8qS46K48qJMXt5eTU+vz+1wzCiw5GBC0tHdyzMbj3LNrCymZia4HY5xwV8tzKeupZO3K+rcDsWMAksOJiQvb6+ivrWLr1xuZw0T1dUzs0iNi+LXWyvdDsWMAksOZlCqyuNvH2Z2bhJLp/abZd1MEF5PBCsumcQf95ykqb3b7XDMCLPkYAa1Yd8pDpxq4e7Li2z46gT3Vwvz6erx2XQaE4AlB3NOqspP3qggPzWWFfMnuR2Ocdm8/GSmZyVY19IEEFJyEJHlIrJPRCpE5L4g26NF5Hln+yYRKQzYdr9Tvk9Erg8oPyIiu0Rku4iUBZSnichrInLA+W634broLxX1bD9+mm9cOY2oSPssMdGJCLcuzKfsaCOH61rdDseMoEH/2kUkEngEuAGYDdwpIrP7VLsbaFTV6cDDwEPOvrPxP096DrAc+JnT3llXqep8VS0JKLsPeENVi4E3nHXjkv/vzQPkJMVw2yKbmtv43bowj8gIYc2WY26HYkZQKB8FFwMVqnpIVbuANcDKPnVWAk85y2uBZeLvnF4JrFHVTlU9DFQ47Z1LYFtPAbeEEKMZAZsPN7DpcANf//RUoj2Rg+9gJoSspBiunpnFr7dW0tVj9zyMV6EkhzzgeMB6pVMWtI6q9gBNQPog+yrwRxHZKiL3BNTJVtUap60aICtYUCJyj4iUiUhZbW1tCC/DDIWq8qPX9pGR4GXVJ2z2VfNxn188mbqWLl7fe9LtUMwICSU5BBue0nf2rYHqnGvfy1R1If7uqm+KyKdCiOWjRlQfU9USVS3JzMwcyq4mBH/eX8vGQw387dXFxHrtrMF83KdmZDIpOYbnNlvX0ngVygQ5lUBBwHo+0HeClbN1KkXEAyQDDefaV1XPfj8lIr/B3930FnBSRHJVtUZEcoFTQ35V5mOe3TS0P2CfKo9sqCAt3ovI0Pc34WMkf3czc5PY8MEpfvqm/70SzOcvtbPOsSqUM4ctQLGIFImIF/8F5tI+dUqB1c7ybcCb6p/btxRY5YxmKgKKgc0iEi8iiQAiEg9cB+wO0tZq4OXze2nmfO2qbKKmqYNrZmXjibARSia4kin+gYRbjjS4HIkZCYOeOahqj4jcC6wHIoEnVbVcRB4AylS1FHgCeFpEKvCfMaxy9i0XkReAPUAP8E1V7RWRbOA3zg1VHuBZVX3V+ZEPAi+IyN3AMeD2YXy9ZhDdvT5e23uS3OQY5uUnux2OCWMpcV5m5iSy5UgDV8/MsqHO40xI8y6r6jpgXZ+y7wcsdzDAP3FV/RfgX/qUHQIuGaB+PbAslLjM8PtLRR0NrV38j8sKibC7oc0glk7LYO+Jw+yqbGLhFLslaTyxVG8+1NTezYZ9p5idm0RxVqLb4ZgxYFpmPFmJ0bx7qM6eEjfOWHIwH3p1dw2qcOPFuW6HYsYIEWHptHSqT3dwrKHN7XDMMLLkYAA4XNfKjsomrijOHHDkiTHBLChIJSYqgncP1rsdihlGlhwM3b0+fvN+JalxUXx6ht0zYobG64mgZEoa5dVNNLZ2uR2OGSaWHAwb9p2irqWLWxbk4fXYW8IM3WXTMxCEd+wpceOG/SeY4Gqa2nlrfy0LJ6fYRWhz3pJjo5g/OYWyow20dPa4HY4ZBpYcJrAen49fb6sk1uvhxjlD3XAAABHuSURBVLl2EdpcmCuKM+jpVd6zaw/jgiWHCezNvaeoPt3BLfMnERcd0i0vxgwoKzGG2ZOS2Hions7uXrfDMRfIksMEdaSulT/vr2XRlFTmTLI7oc3w+PSMTNq7e3nvkJ09jHWWHCagju5eXtx6nNR4LzfbPQ1mGOWnxnFRdiJvHailvcvOHsYySw4TjKry622VNLV3c/uifKKjbDpuM7yunZ1NR7ePdyrsOStjmSWHCea9Q/WUVzdz/ZwcpqTHux2OGYcmpcRycV4yf6mop66l0+1wzHmy5DCBHG9o4w+7TjArJ5HLp2e4HY4Zx66ZlU13r49HNlS4HYo5T5YcJojmjm5+tekoSbEebltUgNiMq2YEZSZGU1KYytPvHaXi1Bm3wzHnwZLDBNDT6+NXG4/S3t3LF5dMscd+mlFx7ewcYr2R/NPv9tiMrWOQJYdxTlV5eXs1xxvbuX1RAbnJsW6HZCaIhGgP3712Bm8fqOOPe066HY4ZopCSg4gsF5F9IlIhIvcF2R4tIs872zeJSGHAtvud8n0icr1TViAiG0Rkr4iUi8j/DKj/AxGpEpHtzteNF/4yJ66f/ekgW481cvXMLObm2f0MZnTdtWQKM7IT+H9f2WNDW8eYQZODiEQCjwA3ALOBO0Vkdp9qdwONqjodeBh4yNl3Nv5Hhs4BlgM/c9rrAf6Xqs4ClgDf7NPmw6o63/n62BPoTOhe3l7FD9fv45L8ZJbNzHI7HDMBeSIjeGDlXCob2/nh+n1uh2OGIJQzh8VAhaoeUtUuYA2wsk+dlcBTzvJaYJn4r3iuBNaoaqeqHgYqgMWqWqOq2wBU9QywF8i78Jdjznr3YB1//+JOLi1K49aF+XYB2rhmydR0vrR0Cv/17mE22Z3TY0YoySEPOB6wXkn/f+Qf1lHVHqAJSA9lX6cLagGwKaD4XhHZKSJPikjQB9OKyD0iUiYiZbW1drNNoPePNfK1p8oozIjjsbtK8NiD343L7rthJpPT4vj7tTtptVlbx4RQ/msE+8jZd+jBQHXOua+IJAC/Br6tqs1O8aPANGA+UAP8e7CgVPUxVS1R1ZLMTHtAzVl7a5pZ/eRmMhKjeebuS0mOi3I7JGOI83r4t9sv4XhjG/+3tNxGL40BoSSHSqAgYD0fqB6ojoh4gGSg4Vz7ikgU/sTwK1V96WwFVT2pqr2q6gN+gb9by4TgUG0Ldz2xifhoD8/cfSlZSTFuh2TMhz5RmMbfXl3M2q2VPLv5mNvhmEGEkhy2AMUiUiQiXvwXmEv71CkFVjvLtwFvqv+jQSmwyhnNVAQUA5ud6xFPAHtV9UeBDYlI4ExwnwV2D/VFTUSVjW188fFNqMIzX72UgrQ4t0Mypp9vLyvmyosy+UFpOduONbodjjmHQZODcw3hXmA9/gvHL6hquYg8ICIrnGpPAOkiUgF8F7jP2bcceAHYA7wKfFNVe4HLgLuAq4MMWf1XEdklIjuBq4DvDNeLHa9qmtr54uObaOns4em7L2VaZoLbIRkTVESE8OM75pObHMtfP72VY/VtbodkBiDjoe+vpKREy8rK3A7DFUfrW/n8LzbR3N7Nf39lMYum9L9+/+wmO4U37vj8pZODlu8/eYbP/ed7JMVE8eJfLyXbukBdISJbVbUk2DYbxjKG7T95htt//h5tXT08+7UlQRODMeFoRnYi//0/FlPf0sldT2yi3mZvDTuWHMaoXZVN3PGf7wHw/NeXcnG+3f1sxpb5BSn8YnUJR+vbuPXRdzlS1+p2SCaAJYcxaMuRBj7/i43EeT28+NdLmZGd6HZIxpyXT07L4NmvLaGpvZu/evRdth61i9ThwpLDGFO6o5ovPr6JzKRo1n5jqT2wx4x5i6ak8tLfXEZijIc7/vM9HtlQQa9v7F8LHessOYwRPp/yo9f2863n3mdefjIvfn2pzbBqxo2ijHhKv3k518/N4Yfr93HnYxvZd8KeA+EmSw5jQHtXL/c+t42fvHGA2xfl88xXLyU9IdrtsIwZVslxUfz0zgX8++2XsO/kGW74j7e4/6VdnGjqcDu0CcnjdgDm3E40dfC1X5axu7qJ/3PjTL52xVSbRM+MWyLCrYvyWTYrix+/foBnNh7lxbLj3Dwvl9WfLGR+QYq9/0eJJYcw9vaBWr7z/Hbau3p5/EslLJuV7XZIxoyKlDgvP1gxh69cVsR/vXuYF7Yc57fbq5mcFsfN83K5amYW8wtSiLJJJUeM3QQXhrp6fPz49f08+ueDFGcl8MjnF1J8ASOS7CY4M9Z1dPeyu6qJXVVNHKxtwafg9UQwOS2OvJTYD79S4qI+dmYx0E14xu9cN8HZmUOY+eBEM//rhR2UVzdzR0kBP1gxx575bCa8mKhISgrTKClMo72rl0N1LVScauF4QxtvH6jl7OCmOG8kuckx5CbHkpMcw/yCFKZnJeD12BnGUFlyCBMd3b38558P8ciGChJjPPz8i4tYPjfH7bCMCTux3kjmTEpmziT/jZ/dvT5ONndQ2dhO9el2apo62Hionh6fsnZrJVGRwvSsRGblJjI7N4lZuUnMzk0iNd7r8isJb5YcwsCf99fyg9JyDte1cvO8XP5pxRwbjWRMiKIiI8hPjSM/9aOZiHt9Sl1LJ1Mz49lbc4a9Nc28c6COl7ZVfVhnamY8JVNSKZmSRklhKkUZ8XaxO4AlBxftrmrioVc/4O0DdRRlxPP03Yu5otgeXGTMhYqMELKTYlg5P4+V8z8qr2/pZG/NGXZVNbH1aAN/3HOSF8oqAchI8HLp1HQ+OS2dpVPTJ3yysOTggvePNfKzPx3ktT0nSY2L4ns3z+aLSyYT7bFrC8aMpPSEaC4vjuby4gxgGj6fcqiuhS1HGtl8uIH3Dtbz+501AGQnRbN0ajqfnJbB0mnpE+4ZKZYcRklnTy+v7j7BrzYeY/ORBpJjo/jWsmK+ekURSTH2KE9j3BAR4b8eMT0rkTsXT0ZVOVzXynuH6nnvYD3vVNTx2+3+B1/mpcSy1DmrWDotnUkp43uGAksOI6jXp5QdaaB0RzXrdtXQ2NbNlPQ4/p+bZrFq8WQSou3wGxNORISpmQlMzUzgC5dOQVWpONXCuwf9yeL1vSdZu9XfDVWYHsfSaekscZJFVuL4eiaF/XcaRj6fcri+lfePneadA7X8eX8tjW3dxEZFsmxWFp8rKeDy6RlEREzcfkxjxhIRoTg7keLsRFZ/shCfT9l7opn3Dtaz8VA9r+yo4bnNxwGYlhnPoimpzkgq/6io+DH8ATCkyEVkOfAfQCTwuKo+2Gd7NPBLYBFQD9yhqkecbfcDdwO9wLdUdf252nSeNb0GSAO2AXepateFvczh19TWzbGGNo42tLLvxBm2Hz/NjuOnae7oASAt3stVF2Vx1cwsls3KIs47dt8kxhi/iAj5cBjtV6+YSq9PKa9u4l0nWby+99SHF7hFoCg9nqmZCRRlxDElPZ6ijHimpMeRlRgT9vdeDPofS0QigUeAa4FKYIuIlKrqnoBqdwONqjpdRFYBDwF3iMhsYBUwB5gEvC4iM5x9BmrzIeBhVV0jIj932n50OF5sX6fbuqg900lHt4/Onl46e5zv3T46e3y0d/fS2NZFY2sXDa3dNDr1j9a3fpgEwD8yYkZ2IjfNm8SCghQuKUihOCvBzhCMGeciI4R5+SnMy0/hrz89DVXlRHMH5VXNlFc3s6emicN1rbx9oJbOHt/H9k2NiyIrMYaspGgyEqJJjo0iMcZDUkwUSbEe4rwevJ4IvJ4IoiMjPlz2eiLwREQQIf4zm/QE74hctwzl4+xioEJVDwGIyBpgJRCYHFYCP3CW1wI/Ff8YsJXAGlXtBA6LSIXTHsHaFJG9wNXA5506TzntjkhyWLPlOA/+4YNB68VGRZIW7yUlLoqMhGjmF6QwOS2OgrQ4JqfFUZQRb3cxG2MQEXKTY8lNjuWa2R/Nhebz+ZPGkfpWjtW3cbK5k1NnOjh1ppNTZzo5VNvKmY5uznT2MNQZjf75lrl8ccmUYX4loSWHPOB4wHolcOlAdVS1R0SagHSnfGOfffOc5WBtpgOnVbUnSP2PEZF7gHuc1RYR2TfI68gA6gap4waLa2gsrqEJx7hGLaYvDK16OB4rGCSuux6Cu86/7QGzSijJIVjfSN/cNlCdgcqDdbadq37/QtXHgMeCbQtGRMoGmmDKTRbX0FhcQxOOcYVjTGBx9RXKFZFKoCBgPR+oHqiOiHiAZKDhHPsOVF4HpDhtDPSzjDHGjLBQksMWoFhEikTEi/8Cc2mfOqXAamf5NuBN9c8FXgqsEpFoZxRSMbB5oDadfTY4beC0+fL5vzxjjDHnY9BuJecawr3AevzDTp9U1XIReQAoU9VS4AngaeeCcwP+f/Y49V7Af/G6B/imqvYCBGvT+ZH/AKwRkX8G3nfaHg4hd0GNMotraCyuoQnHuMIxJrC4PmZcPOzHGGPM8ArvuzCMMca4wpKDMcaYfsZtchCRSBF5X0RecdaLRGSTiBwQkeedC+GjHVOKiKwVkQ9EZK+ILBWRNBF5zYnrNRFJdSGu74hIuYjsFpHnRCTGjeMlIk+KyCkR2R1QFvT4iN9PRKRCRHaKyMJRjuuHzu9xp4j8RkRSArbd78S1T0SuH824Arb9nYioiGQ4664eL6f8b51jUi4i/xpQ7trxEpH5IrJRRLaLSJmILHbKR/N4FYjIBud/QrmI/E+n3N33vqqOyy/gu8CzwCvO+gvAKmf558A3XIjpKeCrzrIXSAH+FbjPKbsPeGiUY8oDDgOxAcfpy24cL+BTwEJgd0BZ0OMD3Aj8Af+9MUuATaMc13WAx1l+KCCu2cAOIBooAg4CkaMVl1NegH+wx1EgI0yO11XA60C0s54VDscL+CNwQ8Ax+pMLxysXWOgsJwL7nePi6nt/XJ45iEg+cBPwuLMu+KflWOtUeQq4ZZRjSsL/5nwCQFW7VPU0/ilGnnIrLocHiBX//SVxQA0uHC9VfQv/aLdAAx2flcAv1W8j/vtjckcrLlX9o350J/9G/PfknI1rjap2quphIHDKmBGPy/Ew8L/5+A2krh4v4BvAg+qfSgdVPRUQl5vHS4EkZzmZj+6rGs3jVaOq25zlM8Be/B/aXH3vj8vkAPwY/x/H2ZmuQp6WYwRNBWqB/3K6ux4XkXggW1VrwP8mAbJGMyhVrQL+DTiGPyk0AVtx/3idNdDxCTati1sxfgX/JzlwOS4RWQFUqeqOPpvcPl4zgCucrso/i8gnwiSubwM/FJHj+P8O7nczLhEpBBYAm3D5vT/ukoOI3AycUtWtgcVBqo72GF4P/lPaR1V1AdCK/1TRVU4/5kr8p/STgHjghiBVw23Mczj8ThGRf8R/D8+vzhYFqTYqcYlIHPCPwPeDbQ5SNprHywOk4u8G+XvgBeeM3u24vgF8R1ULgO/w0X1Vox6XiCQAvwa+rarN56oapGzYYxt3yQG4DFghIkfwPxfiavxnEm5Py1EJVKrqJmd9Lf5kcfLsKaHz/dQA+4+Ua4DDqlqrqt3AS8Ancf94nTXQ8QllWpcRJSKrgZuBL6jTGexyXNPwJ/kdzvs/H9gmIjkux4Xz819yukI24z+rzwiDuFbjf88DvMhHXVqjGpeIROFPDL9S1bPxuPreH3fJQVXvV9V8VS3Ef6f2m6r6BVyelkNVTwDHReQip2gZ/jvHA6cecWO6kGPAEhGJcz7JnY0rXKYxGej4lAJfckZuLAGazp6CjwbxP6zqH4AVqtrWJ95gU8aMOFXdpapZqlrovP8r8V/oPIHLxwv4Lf4Paoj/mS5e/HOpuXa8HNXAp53lq4EDzvKoHS/n7+4JYK+q/ihgk7vv/ZG6Ah8OX8CVfDRaaSr+N10F/k8I0S7EMx8oA3bi/2NJxX895A38b8o3gDQX4von4ANgN/A0/pEjo368gOfwX/foxv+P7e6Bjg/+U+tH8I9u2QWUjHJcFfj7fbc7Xz8PqP+PTlz7cEbCjFZcfbYf4aPRSm4fLy/wjPMe2wZcHQ7HC7gc/zW2Hfj7+Re5cLwux98ttDPg/XSj2+99mz7DGGNMP+OuW8kYY8yFs+RgjDGmH0sOxhhj+rHkYIwxph9LDsYYY/qx5GCMMaYfSw7GGGP6+f8B7EqKh0xaGS8AAAAASUVORK5CYII=\n", 955 | "text/plain": [ 956 | "
" 957 | ] 958 | }, 959 | "metadata": { 960 | "needs_background": "light" 961 | }, 962 | "output_type": "display_data" 963 | } 964 | ], 965 | "source": [ 966 | "# 5% heavy users \n", 967 | "heavy_users = np.random.normal(loc=120,scale = 25,size = 50)\n", 968 | "sns.distplot(heavy_users)" 969 | ] 970 | }, 971 | { 972 | "cell_type": "code", 973 | "execution_count": 56, 974 | "metadata": {}, 975 | "outputs": [], 976 | "source": [ 977 | "total_user = []\n", 978 | "\n", 979 | "for i in heavy_users:\n", 980 | " total_user.append(i)\n", 981 | "for j in regular_user:\n", 982 | " total_user.append(j)" 983 | ] 984 | }, 985 | { 986 | "cell_type": "code", 987 | "execution_count": 60, 988 | "metadata": {}, 989 | "outputs": [ 990 | { 991 | "data": { 992 | "text/plain": [ 993 | "Ttest_indResult(statistic=0.40481427665266323, pvalue=0.6856966445371957)" 994 | ] 995 | }, 996 | "execution_count": 60, 997 | "metadata": {}, 998 | "output_type": "execute_result" 999 | } 1000 | ], 1001 | "source": [ 1002 | "#1. simulate the hashing process & single iteration\n", 1003 | "\n", 1004 | "import random\n", 1005 | "\n", 1006 | "A_1 = []\n", 1007 | "\n", 1008 | "A_2 = []\n", 1009 | "\n", 1010 | "for i in total_user:\n", 1011 | " \n", 1012 | " hash_val = random.random()\n", 1013 | " \n", 1014 | " if hash_val <= 0.5:\n", 1015 | " A_2.append(i)\n", 1016 | " \n", 1017 | " else:\n", 1018 | " A_1.append(i)\n", 1019 | " \n", 1020 | "# two sample t test \n", 1021 | "from scipy import stats\n", 1022 | "\n", 1023 | "stats.ttest_ind(A_1,A_2)" 1024 | ] 1025 | }, 1026 | { 1027 | "cell_type": "code", 1028 | "execution_count": 61, 1029 | "metadata": {}, 1030 | "outputs": [], 1031 | "source": [ 1032 | "# no difference " 1033 | ] 1034 | }, 1035 | { 1036 | "cell_type": "code", 1037 | "execution_count": 72, 1038 | "metadata": {}, 1039 | "outputs": [ 1040 | { 1041 | "data": { 1042 | "text/plain": [ 1043 | "" 1044 | ] 1045 | }, 1046 | "execution_count": 72, 1047 | "metadata": {}, 1048 | "output_type": "execute_result" 1049 | }, 1050 | { 1051 | "data": { 1052 | "image/png": "\n", 1053 | "text/plain": [ 1054 | "
" 1055 | ] 1056 | }, 1057 | "metadata": { 1058 | "needs_background": "light" 1059 | }, 1060 | "output_type": "display_data" 1061 | } 1062 | ], 1063 | "source": [ 1064 | "exponential = np.random.exponential(scale=2, size=1000)\n", 1065 | "sns.distplot(exponential)" 1066 | ] 1067 | }, 1068 | { 1069 | "cell_type": "code", 1070 | "execution_count": 73, 1071 | "metadata": {}, 1072 | "outputs": [ 1073 | { 1074 | "data": { 1075 | "text/plain": [ 1076 | "Ttest_indResult(statistic=0.1348387526112502, pvalue=0.8927665540225909)" 1077 | ] 1078 | }, 1079 | "execution_count": 73, 1080 | "metadata": {}, 1081 | "output_type": "execute_result" 1082 | } 1083 | ], 1084 | "source": [ 1085 | "#1. simulate the hashing process & single iteration\n", 1086 | "\n", 1087 | "import random\n", 1088 | "\n", 1089 | "A_1 = []\n", 1090 | "\n", 1091 | "A_2 = []\n", 1092 | "\n", 1093 | "for i in exponential:\n", 1094 | " \n", 1095 | " hash_val = random.random()\n", 1096 | " \n", 1097 | " if hash_val <= 0.5:\n", 1098 | " A_2.append(i)\n", 1099 | " \n", 1100 | " else:\n", 1101 | " A_1.append(i)\n", 1102 | " \n", 1103 | "# two sample t test \n", 1104 | "from scipy import stats\n", 1105 | "\n", 1106 | "stats.ttest_ind(A_1,A_2)" 1107 | ] 1108 | }, 1109 | { 1110 | "cell_type": "markdown", 1111 | "metadata": {}, 1112 | "source": [ 1113 | "---" 1114 | ] 1115 | }, 1116 | { 1117 | "cell_type": "markdown", 1118 | "metadata": {}, 1119 | "source": [ 1120 | "# Part 4: Check for Variability " 1121 | ] 1122 | }, 1123 | { 1124 | "cell_type": "code", 1125 | "execution_count": 23, 1126 | "metadata": {}, 1127 | "outputs": [], 1128 | "source": [ 1129 | "# Bootstrap & 70% sample size\n", 1130 | "\n", 1131 | "boot_diffs = []\n", 1132 | "\n", 1133 | "for i in range(10000):\n", 1134 | " boot_sample_1 = np.random.choice(A_1, replace=False, size = int(0.7*len(A_1)))\n", 1135 | " \n", 1136 | " boot_sample_2 = np.random.choice(A_2, replace= False, size = int(0.7*len(A_2)))\n", 1137 | " \n", 1138 | " result = stats.ttest_ind(boot_sample_1,boot_sample_2)\n", 1139 | "\n", 1140 | " boot_diffs.append(result.pvalue)" 1141 | ] 1142 | }, 1143 | { 1144 | "cell_type": "code", 1145 | "execution_count": 25, 1146 | "metadata": {}, 1147 | "outputs": [ 1148 | { 1149 | "data": { 1150 | "text/plain": [ 1151 | "0.0004" 1152 | ] 1153 | }, 1154 | "execution_count": 25, 1155 | "metadata": {}, 1156 | "output_type": "execute_result" 1157 | } 1158 | ], 1159 | "source": [ 1160 | "count= 0 \n", 1161 | "\n", 1162 | "for i in boot_diffs:\n", 1163 | " if i < 0.05:\n", 1164 | " count+=1\n", 1165 | " \n", 1166 | "count/len(boot_diffs)" 1167 | ] 1168 | }, 1169 | { 1170 | "cell_type": "markdown", 1171 | "metadata": {}, 1172 | "source": [ 1173 | "---" 1174 | ] 1175 | }, 1176 | { 1177 | "cell_type": "code", 1178 | "execution_count": 26, 1179 | "metadata": {}, 1180 | "outputs": [], 1181 | "source": [ 1182 | "# Bootstrap & 80% sample size\n", 1183 | "\n", 1184 | "boot_diffs = []\n", 1185 | "\n", 1186 | "for i in range(10000):\n", 1187 | " boot_sample_1 = np.random.choice(A_1, replace=False, size = int(0.8*len(A_1)))\n", 1188 | " \n", 1189 | " boot_sample_2 = np.random.choice(A_2, replace= False, size = int(0.8*len(A_2)))\n", 1190 | " \n", 1191 | " result = stats.ttest_ind(boot_sample_1,boot_sample_2)\n", 1192 | "\n", 1193 | " boot_diffs.append(result.pvalue)" 1194 | ] 1195 | }, 1196 | { 1197 | "cell_type": "code", 1198 | "execution_count": 27, 1199 | "metadata": {}, 1200 | "outputs": [ 1201 | { 1202 | "data": { 1203 | "text/plain": [ 1204 | "0.0001" 1205 | ] 1206 | }, 1207 | "execution_count": 27, 1208 | "metadata": {}, 1209 | "output_type": "execute_result" 1210 | } 1211 | ], 1212 | "source": [ 1213 | "count= 0 \n", 1214 | "\n", 1215 | "for i in boot_diffs:\n", 1216 | " if i < 0.05:\n", 1217 | " count+=1\n", 1218 | " \n", 1219 | "count/len(boot_diffs)" 1220 | ] 1221 | }, 1222 | { 1223 | "cell_type": "markdown", 1224 | "metadata": {}, 1225 | "source": [ 1226 | "---" 1227 | ] 1228 | }, 1229 | { 1230 | "cell_type": "code", 1231 | "execution_count": 28, 1232 | "metadata": {}, 1233 | "outputs": [], 1234 | "source": [ 1235 | "# Bootstrap & 90% sample size\n", 1236 | "\n", 1237 | "boot_diffs = []\n", 1238 | "\n", 1239 | "for i in range(10000):\n", 1240 | " boot_sample_1 = np.random.choice(A_1, replace=False, size = int(0.9*len(A_1)))\n", 1241 | " \n", 1242 | " boot_sample_2 = np.random.choice(A_2, replace= False, size = int(0.9*len(A_2)))\n", 1243 | " \n", 1244 | " result = stats.ttest_ind(boot_sample_1,boot_sample_2)\n", 1245 | "\n", 1246 | " boot_diffs.append(result.pvalue)" 1247 | ] 1248 | }, 1249 | { 1250 | "cell_type": "code", 1251 | "execution_count": 29, 1252 | "metadata": {}, 1253 | "outputs": [ 1254 | { 1255 | "data": { 1256 | "text/plain": [ 1257 | "0.0" 1258 | ] 1259 | }, 1260 | "execution_count": 29, 1261 | "metadata": {}, 1262 | "output_type": "execute_result" 1263 | } 1264 | ], 1265 | "source": [ 1266 | "count= 0 \n", 1267 | "\n", 1268 | "for i in boot_diffs:\n", 1269 | " if i < 0.05:\n", 1270 | " count+=1\n", 1271 | " \n", 1272 | "count/len(boot_diffs)" 1273 | ] 1274 | }, 1275 | { 1276 | "cell_type": "markdown", 1277 | "metadata": {}, 1278 | "source": [ 1279 | "# Conclusion 1: for equal split & 90% sample size, the p value is 0.0101 (False Positive Rate);\n", 1280 | "- Note: set replace = False; otherwise, it returns a high p value" 1281 | ] 1282 | }, 1283 | { 1284 | "cell_type": "markdown", 1285 | "metadata": {}, 1286 | "source": [ 1287 | "---" 1288 | ] 1289 | }, 1290 | { 1291 | "cell_type": "code", 1292 | "execution_count": 93, 1293 | "metadata": {}, 1294 | "outputs": [], 1295 | "source": [ 1296 | "# Bootstrap & 100% sample size\n", 1297 | "\n", 1298 | "boot_diffs_2 = []\n", 1299 | "\n", 1300 | "for i in range(10000):\n", 1301 | " \n", 1302 | " boot_sample_1 = np.random.choice(A_1, replace=False, size = len(A_1))\n", 1303 | " \n", 1304 | " boot_sample_2 = np.random.choice(A_2, replace= False, size = len(A_2))\n", 1305 | " \n", 1306 | " result = stats.ttest_ind(boot_sample_1,boot_sample_2)\n", 1307 | "\n", 1308 | " boot_diffs_2.append(result.pvalue)" 1309 | ] 1310 | }, 1311 | { 1312 | "cell_type": "code", 1313 | "execution_count": 94, 1314 | "metadata": {}, 1315 | "outputs": [ 1316 | { 1317 | "data": { 1318 | "text/plain": [ 1319 | "0.0" 1320 | ] 1321 | }, 1322 | "execution_count": 94, 1323 | "metadata": {}, 1324 | "output_type": "execute_result" 1325 | } 1326 | ], 1327 | "source": [ 1328 | "count= 0 \n", 1329 | "for i in boot_diffs_2:\n", 1330 | " if i < 0.05:\n", 1331 | " count+=1\n", 1332 | "count/len(boot_diffs_2)" 1333 | ] 1334 | }, 1335 | { 1336 | "cell_type": "code", 1337 | "execution_count": 97, 1338 | "metadata": {}, 1339 | "outputs": [ 1340 | { 1341 | "data": { 1342 | "text/plain": [ 1343 | "{0.19427810106851318,\n", 1344 | " 0.19427810106856205,\n", 1345 | " 0.19427810106856214,\n", 1346 | " 0.1942781010685622,\n", 1347 | " 0.19427810106861107,\n", 1348 | " 0.19427810106866003}" 1349 | ] 1350 | }, 1351 | "execution_count": 97, 1352 | "metadata": {}, 1353 | "output_type": "execute_result" 1354 | } 1355 | ], 1356 | "source": [ 1357 | "set(boot_diffs_2)" 1358 | ] 1359 | }, 1360 | { 1361 | "cell_type": "markdown", 1362 | "metadata": {}, 1363 | "source": [ 1364 | "# Conclusion 2: for equal split & 100% sample size, 0 percent of observing different distributions (False Positive Rate)\n", 1365 | "- Note: set replace = False; otherwise, it returns a high p value" 1366 | ] 1367 | }, 1368 | { 1369 | "cell_type": "markdown", 1370 | "metadata": {}, 1371 | "source": [ 1372 | "---" 1373 | ] 1374 | }, 1375 | { 1376 | "cell_type": "markdown", 1377 | "metadata": {}, 1378 | "source": [ 1379 | "# Test for SRM & use a chi-square test" 1380 | ] 1381 | }, 1382 | { 1383 | "cell_type": "code", 1384 | "execution_count": 30, 1385 | "metadata": {}, 1386 | "outputs": [ 1387 | { 1388 | "data": { 1389 | "text/plain": [ 1390 | "Power_divergenceResult(statistic=25.0, pvalue=5.733031437583875e-07)" 1391 | ] 1392 | }, 1393 | "execution_count": 30, 1394 | "metadata": {}, 1395 | "output_type": "execute_result" 1396 | } 1397 | ], 1398 | "source": [ 1399 | "from scipy.stats import chisquare \n", 1400 | "half = int((len(A_1)+len(A_2))/2)\n", 1401 | "chisquare([len(A_1),len(A_2)],f_exp = [half,half])" 1402 | ] 1403 | }, 1404 | { 1405 | "cell_type": "markdown", 1406 | "metadata": {}, 1407 | "source": [ 1408 | "# Conclusion 3: p_value = 0.327 & no SRM" 1409 | ] 1410 | } 1411 | ], 1412 | "metadata": { 1413 | "kernelspec": { 1414 | "display_name": "Python 3", 1415 | "language": "python", 1416 | "name": "python3" 1417 | }, 1418 | "language_info": { 1419 | "codemirror_mode": { 1420 | "name": "ipython", 1421 | "version": 3 1422 | }, 1423 | "file_extension": ".py", 1424 | "mimetype": "text/x-python", 1425 | "name": "python", 1426 | "nbconvert_exporter": "python", 1427 | "pygments_lexer": "ipython3", 1428 | "version": "3.7.4" 1429 | } 1430 | }, 1431 | "nbformat": 4, 1432 | "nbformat_minor": 2 1433 | } 1434 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Statistical Simulation in Python 2 | 3 | # Project Summary 4 | 5 | This repo hosts R/Python code to common statistical distributions. This is an on-going project, and currently it has three parts: 6 | 1. sampling (part 1) 7 | 2. distribution and applications (part 2) 8 | 3. An end-to-end intro to A/B tests 9 | 4. Why A/A Tests? What if we fail A/A tests? 10 | 11 | In part 1, I introduce how to create random numbers, sample with equal and unequal probabilities, their applications. In part 2, I apply these distributions to solve real-life Data Science Interview questions. Part 3 walks through how to conduct an A/B test end-to-end with simulated data. 12 | 13 | The entire write-ups are available here. 14 | 15 | Part 1: https://towardsdatascience.com/statistical-simulation-in-r-part-1-d9cb4dc393c9 16 | 17 | Part 2: https://towardsdatascience.com/statistical-simulation-in-python-part-2-91f71f474f77 18 | 19 | Part 3: https://towardsdatascience.com/a-practical-guide-to-a-b-tests-in-python-66666f5c3b02?source=friends_link&sk=null 20 | 21 | Part 4: https://towardsdatascience.com/an-a-b-test-loses-its-luster-if-a-a-tests-fail-2dd11fa6d241?sk=8d4cebf2d3362704a4121b4518364c36 22 | 23 | ## Installing 24 | 25 | For part 1, please install dplyr in R. 26 | 27 | For part 2, please install numpy in Python. 28 | 29 | For part 3, please install pandas in Python. 30 | 31 | For part 4, please install pandas in Python 32 | 33 | ## About the Author 34 | 35 | Leihua Ye is a Ph.D. Researcher at the UC, Santa Barbara. He has received extensive training in Causal Inference, Research Design, Machine Learning, Big Data, and Machine Learning. 36 | 37 | He receives his B.A. and M.A. from the Uni. of Nottingham. 38 | 39 | ## Contact 40 | 41 | Email: yeleihua@gmail.com 42 | 43 | LinkedIn: www.linkedin.com/in/leihuaye 44 | 45 | Tech Blog: https://leihua-ye.medium.com 46 | --------------------------------------------------------------------------------