├── README.md └── ipython ├── CaseStudy_Churn_Analysis.ipynb ├── Lecture_Bagging_RandomForests.ipynb ├── Lecture_BiasVariance.ipynb ├── Lecture_Binning_NonLinear.ipynb ├── Lecture_Clustering.ipynb ├── Lecture_DecisionTrees.ipynb ├── Lecture_ERM_LogReg.ipynb ├── Lecture_FeatureSelection.ipynb ├── Lecture_GradientTreeBoosting.ipynb ├── Lecture_Metrics_Ranking.ipynb ├── Lecture_NumPyBasics.ipynb ├── Lecture_PandasIntro.ipynb ├── Lecture_PhotoSVD.ipynb ├── Lecture_Regularization.ipynb ├── Lecture_Resampling.ipynb ├── Lecture_SVM.ipynb ├── Lecture_SimpleOverfittingExample.ipynb ├── Lecture_SimpleiPythonExample.ipynb ├── Lecture_TextMining.ipynb ├── Lecture_kNN.ipynb ├── churn_analysis.py ├── course_utils.py ├── data ├── Cell2Cell_data.csv ├── Cell2Cell_info.pdf ├── ads_dataset.txt ├── advertising_events.csv ├── boson_testing_cut.csv ├── boson_training_cut_2000.csv ├── osquery_contributors.html └── spam_ham.csv ├── references ├── churn_architecture.png ├── churn_dataset_info.pdf ├── churn_sampling_scheme.png └── intro ds syllabus - fall 2015.pdf └── utils ├── .eval_plots.py.swp ├── ClassifierBakeoff.py ├── ClassifierBakeoff.pyc ├── bias_variance.py ├── eval_plots.py └── eval_plots.pyc /README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jmportilla/DataScienceCourse/92bb8f50a55a9e844357b795c48aee35b4772aff/README.md -------------------------------------------------------------------------------- /ipython/Lecture_ERM_LogReg.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "#A Primer on Empirical Risk Minimization" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "##Notations and Definitions" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "

Let's first set up some notation and ideas:

\n", 22 | "" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "

Second, let's define two more things

\n", 47 | "" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "##Emprical Risk Minimization" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "

The main goal of Supervised Learning can be stated using the Empirical Risk Minimization framework of Statistical Learning.

\n", 79 | "We are looking for a function $f\\in \\mathbb{F}$ that minimizes the expected loss: \n", 80 | "
\n", 81 | "
\n", 82 | "

$E[\\mathbb{L}(f(x),y)]=\\int \\mathbb{L}(f(x),y)\\, P(x,y)\\:\\mathrm{d}x\\mathrm{d}y$
\n", 83 | "\n", 84 | "
\n", 85 | "Because we don't know the distribution $P(X,Y)$, we can't minimize the expected loss. However, we can minimize the empirical loss, or risk, by computing the average loss over our training data.

Thus, in Supervised Learning, we choose the function $f(X)$ that minimizes the loss over training data:\n", 86 | "

\n", 87 | "
$f^{opt}= \\underset{f \\in \\mathbb{F}} {\\mathrm{argmin}} \\frac{1}{n} \\sum\\limits_{i=1}^n \\mathbb{L}(f(x_i),y_i)$
\n", 88 | "\n", 89 | "\n", 90 | "

" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "#Logistic Regression" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "##Definition" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "

Logistic Regression: a member of the class of generalized linear models (glm) using the logit as its link function.

\n", 112 | "\n", 113 | "The goal of Logistic Regression is to model the posterior probability of membership in class $c_i$ as a function of $X$. I.e.,\n", 114 | "
\n", 115 | "
\n", 116 | "
\n", 117 | "

$P(c_i|x)=f(x)=\\frac{1}{1+e^{-(\\alpha+\\beta x))}}$
\n", 118 | "
\n", 119 | "
\n", 120 | "To make this a linear model in $X$, we take the log of the odds ratio of $p$ (called the log-odds):\n", 121 | "
\n", 122 | "
\n", 123 | "
\n", 124 | "
$ln \\frac{P(c_i|x)}{1-P(c_i|x)} = ln \\frac{1}{e^{-(\\alpha+\\beta x))}} = \\alpha+\\beta x$
\n", 125 | "
\n", 126 | "
\n", 127 | "And effectively we do a linear regression against the log-odds of $P(c_i|x)$ (though we don't use least squares).\n", 128 | "

" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "##LogReg as ERM" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": {}, 141 | "source": [ 142 | "

How do we fit Logistic Regression into the ERM framework?

\n", 143 | "\n", 144 | "We find the parameters $\\alpha$ and $\\beta$ using the method of Maximum Likelihood Estimation.

\n", 145 | "\n", 146 | "If we consider each observation to be an indepedent Bernoulli draw with $p_i=P(y_i|x_i)$, then the likelihood of each draw can be defined as: $p_i^{y_i}(1-p_i)^{1-y_i}$, with $p_i$ given by the inverse logit function. In MLE, we wish to maximize the likelihood of observing the data as a function of the independent parameters of the model (i.e., $\\alpha$ and $\\beta$). The total likelihood function looks like:

\n", 147 | "\n", 148 | "

$L(\\alpha,\\beta|X,Y)=\\prod\\limits_{i=1}^nP(x_i,y_i|\\alpha,\\beta)=\\prod\\limits_{i=1}^np_i^{y_i}(1-p_i)^{1-y_i}$
\n", 149 | "

\n", 150 | "This is actually a difficult equation to maximize directly, so we do a little trick. We take the negative log and call this our loss function for ERM!\n", 151 | "\n", 152 | "
$\\mathbb{L}(f(X),Y)=-Ln [L(\\alpha,\\beta|X,Y)]=-\\sum\\limits_{i=1}^n y_i\\,ln\\,(p_i)+(1-y_i)\\,ln\\,(1-p_i)$
\n", 153 | "\n", 154 | "\n", 155 | "\n", 156 | "\n", 157 | "\n", 158 | "\n", 159 | "\n", 160 | "

" 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "##Example 1 - Building and Looking at a Model" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": 1, 173 | "metadata": { 174 | "collapsed": false 175 | }, 176 | "outputs": [ 177 | { 178 | "name": "stdout", 179 | "output_type": "stream", 180 | "text": [ 181 | "Optimization terminated successfully.\n", 182 | " Current function value: 0.253781\n", 183 | " Iterations 9\n" 184 | ] 185 | } 186 | ], 187 | "source": [ 188 | "'''\n", 189 | "Let's train an actual model and see how well it generalizes\n", 190 | "'''\n", 191 | "import math\n", 192 | "from sklearn.metrics import confusion_matrix, roc_auc_score\n", 193 | "from sklearn import linear_model\n", 194 | "import numpy as np\n", 195 | "import course_utils as bd\n", 196 | "import pandas as pd\n", 197 | "import statsmodels.api as sm\n", 198 | "import matplotlib.pyplot as plt\n", 199 | "import warnings\n", 200 | "warnings.filterwarnings('ignore')\n", 201 | "reload(bd)\n", 202 | "%matplotlib inline\n", 203 | "\n", 204 | "\n", 205 | "#Load data and downsample for a 1/10 pos/neg ratio, then split into a train/test\n", 206 | "f='/Users/briand/Desktop/ds course/datasets/ads_dataset_cut.txt'\n", 207 | "target = 'y_buy'\n", 208 | "tdat = pd.read_csv(f, header = 0, sep = '\\t')\n", 209 | "moddat = bd.downSample(tdat, target, 10)\n", 210 | "\n", 211 | "#We know the dataset is sorted so we can just split by index\n", 212 | "train_split = 0.75\n", 213 | "train = moddat[:int(math.floor(moddat.shape[0]*train_split))]\n", 214 | "test = moddat[int(math.floor(moddat.shape[0]*train_split)):]\n", 215 | "\n", 216 | "#Using Scikit-learn the model is built with two easy steps.\n", 217 | "logreg = linear_model.LogisticRegression(C = 1e30)\n", 218 | "logreg.fit(train.drop(target, 1), train[target])\n", 219 | "\n", 220 | "#But we are going to also build using the statsmodel package\n", 221 | "logit_sm = sm.Logit(train[target], train.drop(target, 1))\n", 222 | "lr_fit = logit_sm.fit()" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "

Logistic Regression has long been used as a tool in statistics and econometrics so there are a lot of additional data points one can get out of logistic regression model than one might get with standard machine learning tools.

\n", 230 | "We showed how to use scikit-learn to fit a model, but we also used statsmodel. The reason is that statsmodel returns summary statistics on each coefficient fit to the variables. In machine learning, we often only focus on the generalizability of the prediction. But in many analytical applications we want to know how statistically significant are the estimates within our model.\n", 231 | "

" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": 2, 237 | "metadata": { 238 | "collapsed": false 239 | }, 240 | "outputs": [ 241 | { 242 | "data": { 243 | "text/html": [ 244 | "\n", 245 | "\n", 246 | "\n", 247 | " \n", 248 | "\n", 249 | "\n", 250 | " \n", 251 | "\n", 252 | "\n", 253 | " \n", 254 | "\n", 255 | "\n", 256 | " \n", 257 | "\n", 258 | "\n", 259 | " \n", 260 | "\n", 261 | "\n", 262 | " \n", 263 | "\n", 264 | "\n", 265 | " \n", 266 | "\n", 267 | "
Logit Regression Results
Dep. Variable: y_buy No. Observations: 2087
Model: Logit Df Residuals: 2074
Method: MLE Df Model: 12
Date: Tue, 10 Nov 2015 Pseudo R-squ.: 0.1523
Time: 20:05:51 Log-Likelihood: -529.64
converged: True LL-Null: -624.83
LLR p-value: 3.151e-34
\n", 268 | "\n", 269 | "\n", 270 | " \n", 271 | "\n", 272 | "\n", 273 | " \n", 274 | "\n", 275 | "\n", 276 | " \n", 277 | "\n", 278 | "\n", 279 | " \n", 280 | "\n", 281 | "\n", 282 | " \n", 283 | "\n", 284 | "\n", 285 | " \n", 286 | "\n", 287 | "\n", 288 | " \n", 289 | "\n", 290 | "\n", 291 | " \n", 292 | "\n", 293 | "\n", 294 | " \n", 295 | "\n", 296 | "\n", 297 | " \n", 298 | "\n", 299 | "\n", 300 | " \n", 301 | "\n", 302 | "\n", 303 | " \n", 304 | "\n", 305 | "\n", 306 | " \n", 307 | "\n", 308 | "\n", 309 | " \n", 310 | "\n", 311 | "
coef std err z P>|z| [95.0% Conf. Int.]
isbuyer 0.9714 0.395 2.462 0.014 0.198 1.745
buy_freq -0.1588 0.203 -0.783 0.433 -0.556 0.239
visit_freq 0.0311 0.021 1.453 0.146 -0.011 0.073
buy_interval -0.0030 0.014 -0.208 0.835 -0.031 0.025
sv_interval 0.0085 0.007 1.210 0.226 -0.005 0.022
expected_time_buy 0.0101 0.011 0.935 0.350 -0.011 0.031
expected_time_visit -0.0328 0.007 -4.434 0.000 -0.047 -0.018
last_buy 0.0115 0.005 2.284 0.022 0.002 0.021
last_visit -0.0597 0.006 -10.295 0.000 -0.071 -0.048
multiple_buy 1.4701 1.100 1.337 0.181 -0.686 3.626
multiple_visit -0.1800 0.220 -0.819 0.413 -0.611 0.251
uniq_urls -0.0105 0.002 -6.466 0.000 -0.014 -0.007
num_checkins -2.867e-05 8.81e-05 -0.325 0.745 -0.000 0.000
" 312 | ], 313 | "text/plain": [ 314 | "\n", 315 | "\"\"\"\n", 316 | " Logit Regression Results \n", 317 | "==============================================================================\n", 318 | "Dep. Variable: y_buy No. Observations: 2087\n", 319 | "Model: Logit Df Residuals: 2074\n", 320 | "Method: MLE Df Model: 12\n", 321 | "Date: Tue, 10 Nov 2015 Pseudo R-squ.: 0.1523\n", 322 | "Time: 20:05:51 Log-Likelihood: -529.64\n", 323 | "converged: True LL-Null: -624.83\n", 324 | " LLR p-value: 3.151e-34\n", 325 | "=======================================================================================\n", 326 | " coef std err z P>|z| [95.0% Conf. Int.]\n", 327 | "---------------------------------------------------------------------------------------\n", 328 | "isbuyer 0.9714 0.395 2.462 0.014 0.198 1.745\n", 329 | "buy_freq -0.1588 0.203 -0.783 0.433 -0.556 0.239\n", 330 | "visit_freq 0.0311 0.021 1.453 0.146 -0.011 0.073\n", 331 | "buy_interval -0.0030 0.014 -0.208 0.835 -0.031 0.025\n", 332 | "sv_interval 0.0085 0.007 1.210 0.226 -0.005 0.022\n", 333 | "expected_time_buy 0.0101 0.011 0.935 0.350 -0.011 0.031\n", 334 | "expected_time_visit -0.0328 0.007 -4.434 0.000 -0.047 -0.018\n", 335 | "last_buy 0.0115 0.005 2.284 0.022 0.002 0.021\n", 336 | "last_visit -0.0597 0.006 -10.295 0.000 -0.071 -0.048\n", 337 | "multiple_buy 1.4701 1.100 1.337 0.181 -0.686 3.626\n", 338 | "multiple_visit -0.1800 0.220 -0.819 0.413 -0.611 0.251\n", 339 | "uniq_urls -0.0105 0.002 -6.466 0.000 -0.014 -0.007\n", 340 | "num_checkins -2.867e-05 8.81e-05 -0.325 0.745 -0.000 0.000\n", 341 | "=======================================================================================\n", 342 | "\"\"\"" 343 | ] 344 | }, 345 | "execution_count": 2, 346 | "metadata": {}, 347 | "output_type": "execute_result" 348 | } 349 | ], 350 | "source": [ 351 | "#Use statsmodel if you want to understand the fit statistics of the LR model\n", 352 | "lr_fit.summary()" 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": {}, 358 | "source": [ 359 | "

\n", 360 | "A Practical Aside

\n", 361 | "\n", 362 | "What exactly does the estimate of $\\beta$ really mean? How can we interpret it?

\n", 363 | "\n", 364 | "Recall that $Ln \\frac{p}{1-p}=\\alpha+\\beta x$. This means that a unit change in the value of $x$ changes the log-odds by the value of $\\beta$. This is a mathematical statement that IMHO does not offer much intuitive value.

\n", 365 | "\n", 366 | "

So what can we learn by looking at betas? (IMHO, not much!)


\n", 367 | "Some helpful tips, garnered from theory and experience:\n", 368 | "\n", 374 | "\n", 375 | "

" 376 | ] 377 | }, 378 | { 379 | "cell_type": "markdown", 380 | "metadata": {}, 381 | "source": [ 382 | "##Example 2 - Robustness" 383 | ] 384 | }, 385 | { 386 | "cell_type": "markdown", 387 | "metadata": {}, 388 | "source": [ 389 | "

In this example we test the sensitivity of out-of-sample performance to training set sample size. Our goal is to plot test-set $AUC$ as a function of $N$, the number of samples in the training set. Because we expect a lot of variance in the lower range of $N$, we use bootstrap algorithm to compute standard errors of AUC measurements.

\n", 390 | "\n", 391 | "To Bootstrap:\n", 392 | "

\n", 397 | "\n", 398 | "\n", 399 | "

" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": 11, 405 | "metadata": { 406 | "collapsed": false 407 | }, 408 | "outputs": [ 409 | { 410 | "data": { 411 | "text/plain": [ 412 | "" 413 | ] 414 | }, 415 | "execution_count": 11, 416 | "metadata": {}, 417 | "output_type": "execute_result" 418 | }, 419 | { 420 | "data": { 421 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAtQAAAIHCAYAAAC/h8txAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3XeYZFWd//F3TU9imBkGGHKYAQElIyg5lIJIlAVBSSoK\nyurqGlAUw9rmtO66K66yyooJUAEFRFBBBpQoKkkQGGAkh4EhZ7i/P763fnW7pqq7qivcqtvv1/Pc\np8K9de451T3Tnzp17jkgSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkqaAWAbvmXYkB8Hng\nQeAeYC3gcaDU4Nhh4Ee9qVbbfg28ZZyv/TbwyQ7WRZIk5WwR8BQRdB4GfgWs2YFy5wMvAZOaPP4l\nYN0OnLdXbgdeO8Yxw0S7tq7zfL3gWPsevB64GHgMeABYAOzbck3HtjUREJcADwFXAEd0oNy1id+t\nFZs8/tP0V6A+AvhDl89RBu7s8jkkdUmzf+AkFV8C7APMAlYD7ge+2cHyG/VGtnrs5HYr0mMl4K3A\ndeltVtLE6w8EfgacBKwBrAz8G80F6vlE4G/GdsAFwIXAy4jw+25gjyZfP5q1iYD+UAfKkiRJ6lu1\nPa17ATdlHi8H/JDoIV0EfIJq8C0RX10vIoL4D4DZ6b47iB7Xx9NtG2A94CLgEWIYwCnpsRenxz6R\nHnsQ0XN3F3AscG9a9hyiB/0Bojf9bCJsViwAvkT0sD4K/BJYPt03Pz3HO4G7iSEIx2ReWwI+BiwE\nFgM/zbwW4uv9f6T7Pl7nfau1c3psOb2dktk3zOg91CXi/TumzjHNmE/zgfqPjP0B6p3ALUQwPpP4\n4FXxEnA0cDPRw318+vxuRO/0i8TP9P9Y+luLdYjfh8eA36b1yL4v2wKXpuVeDeyS2bcA+Gxa/8eA\n3zCyJ3zHzGvvAN6WPj8N+HfiZ3kfMfxieoN2H0HjHuoFwJGZ4y4B/iM930Jge+Dt6bnvZ+SHqpOA\nzwEzgKepvkePAasS3xhcRfwO3wd8vUEdJElSn7id6ljgGURwPSmz/4fAL4BlgXlE2H5Huu8dRNCa\nn+4/PT2e9NjaIR+nAMel96cSoaOidrhDGXieCMhTiNCzArB/en8m0YP7i8xrFhAhfKO0LadRDWjz\n03P8BFgG2IQI5pW2v58IYKun5/sOcHK6byMi8OyY1vvrad1GC9QnAt9N798JHJDZN8zogfoV6f15\no5Q/mvk0F6hnAC8wMqjWei3x4WcLou3/TYTgipeAs4gPUmsR7+nr0327MHI4w3xG/k5cRoTbKcBO\nRKCs/P6sQXwQqfSU75Y+roTmBcTv3nrE78OFxO8KxPv2GPBmYIj4vdk83fefxAetOcTv0FnAFxu0\n/QgaB+oLqf47OIL4fXgb8WHoc8Tv4TfTtr0urc+M9PjvEx8GYOn3COJ9OSy9P4P4MCpJkvrYIiIs\nLgGeI4LAJum+IeBZIuBVvIsIExBDBf45s2+DtIxJ1B9D/QPgBEb2KlfUC9TPEiGukS2InuqKCxkZ\njjZMyyhl6rNBZv9XgO+l929kZEBeLW3LEDHU4uTMvhlpuY0C9Qyid3H39PE3iBBXMczogXqH9P5o\nbR/NfJoL1Guw9HtS60Tgy5nHyxLvy9rp45cY+cHop8BH0/tlGgfqtYkQukxm/0+oBuqPZu5XnEe1\np/dC4puCincD56b3jyM+3NUqEd+CZH/PtgNuq3MstBaob87s25Ro50qZ5xYDm6X3v0+Ebqg/hvoi\n4ndkboNzS+oTjqGWVJEA+xHDG6YB7yP+oK9M/EGfQnw9XnEH1UC8Wp19k4FVGpzrWCLUXAlcT3wl\nPpoHifBWMYMI5IuIwHoRMSQlO/Y6G07uSOs/d5T9q6f35xG93UvS7Qai93YVop13ZV73FKOPC96f\nCIsXpI9/DuxJtXf1eUYOASHz+PlM2avRvEMzdb+GCKyVxw9T/0LTJUTwG+08tT/jJ9P6ZT8U3Ze5\n/xTR8zuW1dPzP5157h9Uf5bziKE/SzLbDsSQiHrnfTpz3rWoH5JXIn6H/pwp81w6E1zvr6kLxO9v\nvfqN5UjiQ86NxL+VvduunaSuMFBLqichQuWLxPCGxUTAm585Zm2q4fKeOvteIMJFvQvv7id6uNcg\nxt3+D6PP7FFbxjFE0NiaCNK7EAEsG6jXrrn/fNqORvvvTu/fQQwvWD6zzSDaeC8R0ipmMPrMFW8j\nLvK8K33t6URgrnyNfwcj3zeI8cQvpPW5iQj+B45yjlonZ+q9WXqOyuMVGPmBoOIpYnjBaOep/Rkv\nS7T97rpHh2YuuryX6ntcMS/z2juIXvzsz2MW8NUmyr6DuMCy1mIi2G6UKXMO1XH/vZTU3GYtJD4g\nrUR8i3IaI3vyJfUJA7WkrOxFhpXe6huJYP0z4AtE79o84IPAj9PjT0kfz0/3fxE4lej1fDC9zQab\ng6j2lD5ChImX0sf3Uz8EZc0kAtGjREj8dJ12HE4M9ZhBjFP9OSNDyyeJcLIx8VX9T9Pnv5PWvxK4\nVwLekN4/jZgJZQdiGMZnafz/6BrEUJC9iXG7le0rVIcr/IYYRnM4EbRXSM99GvF+JMCHgE+ldZyd\nnm9Hood+LK3MrHJseo4PU/2QsDnVC0ZPIb5J2Jz4BuOLwOVEaG107mbO/w/iwrvPEO/BjsR7XPFj\nYkaT3YlhN9OJ4RHZnvFG5zmZGHN9EPGNyYpp/V8ixrV/g+pwjDWoDs1p1J5p6fkrW7uy79H9af2y\nof7wTP0eZeS/E0mS1IdupzoP9WPAtcAhmf1ziJ7CB4gQ9UlGBvBPpc8/QIx5XS7z2s9QnZFjGyJU\n3pWeayFwVObYo4ne0CVEj+kuLB3aViPGrj4O/J3o7X6RaritjKGuzPJxJhFWoTp+9yiid/VeIkRW\nlIgPB39P34eFxKIkFW9l5Cwft1F/DPXHgD/VeX41Ytz1Runj7YjxuQ+n9flfRr53UJ2H+nHiffw9\nMXRkLPNpPC64nlcT81A/QgznuJwIdRVHE+/HQ8RFfKtn9r3IyG8ZshfclRn5M5zPyJ/XOlTb91vi\ngsfsuOmtiYsPHyLafzbVD2TZMcwQ3wpcnHm8Y9qOR9M6VBZhmUZ8QLw13XcD8F7qexvxO5PdXiQC\nfvb8tedeLz0u606qY82z7xHEOPXFxO/CasS/t/uJ9+U6qh/sJEmSuq42ZGXNp7WFZiRJGpV/UCQV\nVSvDHSRJGjcDtaSiGu2CuGYulpMkSZIkSZIkSZIkSZJasBMxy0ivvRy4mpjdpNGMF4PqJKqrEkpS\nRzmGWpKWtojqFIKVaQRXHe0Fbapdbv0PjFzmvVeOJVZ1nA0cX2f/AmL+7+xqi7sx+vLm+xEh/VFi\nTvILWHoxm15IcOy8pC4xUEvS0hJicZFZ6Tabkctbd0M/zEoyj5iPeTRPEnOON2M94AfEvN7LEfNN\nf4ul52bulX54jyUVkIFaUr/7L2JBjkeJFfV2zOzbOn3uUSLwfr1BGXOAX1FdXOZsRq6016xFwK6Z\nx8PE4htQnd+6svDLg8TCLxWT0scLiR7vPxE9vZWFQK4hesMPIhZCuTPz2g2J3uElwPXEyoEVJxEh\n9VdpuZcz+jLubwD+lpZ1IdWe8N+n5z0+LWe9Oq9NiEVXDhnjHBVbEL3XF6aPnwDOoNq2rYklz5cQ\ni/l8k1gtseIl4N3ALWmdPkusonkZsfjMqZnjy8RiQccR7/3txLLdjexD9JwvAS4BNs3s+2ha1mPE\n0Jt6C/dIkiQNjMOIJdAnEctw30ss+w0RrA5L788gVmGsZwVgf2K56JnEMuq/GOWctzMyOGefz4ar\nT7N0oD6BWIVvM+AZYlwywEeI1SfXTx9vRnX1xtohH2WqoXMKEcI/Riyf/Roi6G2Q7j+JWF3vVcTK\nfT+mulx4rQ2IULtreuxHiLA6Od0/2oI4lf1HEh9cKu0ebcjHOsQQkf9I2zSzZv+WRKieRLV3/P2Z\n/S8RP6eZxMqSzxLBfz7xrcHfqC7jXgaeB/6deM92Tttaeb+/T3UM9SuJFQhfTfRavzVtwxTi53UH\n1SE+a9PchwdJkqSB8TDV3sSLiF7iuS2WsUVaTiOLiN7iJel2Rvp8baAeZulAnV2O+wrgTen9mxjZ\ns5w1WqDeifgQkXUyEeYhAvX/ZvbtCdzY4DyfInp1K0pET+zO6eNKYG6kErjnEj3EGzH2GOptgJ8S\n3w48TQTbZRsc+wGq7zXE+7Jd5vFVxIeAin8H/jO9XyYC9TKZ/T8FPpnezy7z/W1GLvkN0RO9M9ED\nfj/xoWMKktQEh3xI6ncfJnouHyHC7XJUA/SRRK/rjcCVwN4NyphB9BwvIoaHXJSW02hMbUJcTLd8\nuh3QQn2zY62fotoruyZwawvlVKzOyOEfEENKKsE9IQJgxdMs3RNcsRrR+1qRpGWvUfPcWBYTQ0M+\n28TxVwBvBlYmPhzsDHwi3bcBMVTlXuLn8gVgxZrX17Yt+/gZRrZ1SXpMxT+INteaBxxD9QPTEuLn\nsxrxM/oA8WHpfqK3v14ZkvT/Gagl9bOdiB7Jg4hx0MsTwasShBcS42RXAr4CnMbIHsqKY4jwtjUR\npHdJy2j1IrUnGdm72srMH3dSf1zyWO4B1mJkXecBd4+zrHmZx6W07PGU9TVi+MlWLbzmKmIIx8bp\n428TH5bWI34un6C1v0u1YX554sNTxTyizbXuIML78pltJtGjDRGid0pfnxC/W5LUkIFaUj+bBbxA\n9IhOBf6NGDtbcTgRpiGCdkIME6g1k+i5fJQYt/zpOsc042rgYGLM8auAN9L8VGzfI8bwrkcE2ewY\n6vuJoQb1XEH0dB9LDEEoExfUVYZutPKh4GdEL/5r07KOIXp5L80cM1Z5lf2PEmOpPzrKsTsAR1H9\nGb2CGPZyefp4JjG05ql037ubaEOpwf2KzxBt24lo688zx1aO/y7wz8QHrBLxIWnvtD4bEO/PNGLM\n9jPkNyuJpAFhoJbUz85Lt5uJ4RpPM3LIwuuJWS8eJ8bSHkyEoFrfIHquFxPh8VzGNyfxp4jgu4QY\nEvCTmv2jlfkfRKD9LRFGv0tcJEla1g/Scg9k5JzJzxEhdE9i9orjgbcQ7wnUn1+5UT1uJj6EfDMt\na++07BeabEPt/v9KX9voNY8Qs4pcR/yMziXGSH813f9h4huGx4hx4KfWlFWv3Nr92cf3UZ0x5EfA\n0dR/n/4MvJN4Lx8mLsysXNw4DfgS8f7cSwwvOq5B+ySpJ/YgLvS4hfq9GMsTX/9dQ/TCbFznGEmS\nxlJm6bHmkjTwhojxjfOJr9+uJuZSzfoa1QUCXg6c36vKSZIKpYyBWlJOujnkY2siUC8ipjI6lbhq\nPmtDqhP+30SE75WQJKl1Li0uKRfdDNRrMLK34C6WXpnsGqrTUW1NXFG9ZhfrJEkqpgXEIiyS1HOT\nxz5k3JrpKfgycVHLX4mLVv5K/aupF9L4CnhJkiSpU25lfNOcdsW2xNX5Fccx+vRKEKtt1VuQwK/x\nOms47woUyHDeFSiY4bwrUDDDeVegQIbzrkDBDOddgYIZzrsCBdNy7uzmkI+rgPWJcdFTiZWyzqo5\nZrl0H8QURhcBT3SxTpIkSVJHdXPIxwvAe4HfEDN+nEgsD3x0uv8EYCPgJOKTwPXEMsKSJEmSOswh\nH51VzrsCBVLOuwIFU867AgVTzrsCBVLOuwIFU867AgVTzrsCBVPY3FnYhkmSJKmv9NUYakmSJKnw\nDNSSJElSGwzUkiRJUhsM1JIkSVIbDNSSJElSGwzUkiRJUhsM1JIkSVIbDNSSJElSGwzUkiRJUhsM\n1JIkSVIbDNSSJElSGwzUkiRJUhsM1JIkSVIbDNSSJElSGwzUkiRJUhsM1JIkSVIbDNSSJElSGwzU\nkiRJUhsM1JIkSVIbDNSSJElSGwzUkiRJUhsM1JIkSVIbDNSSJElSGwzUkiRJUhsM1JIkSVIbDNSS\nJElSGwzUkiRJUhsM1JIkSRNT2bIB2An4NPA/wCHjKWByR6sjSZI0fmVgQc51mEjKdO/9zrPsLYBd\ngFUy28rAD4Bv1Tl+OWBdYCXGmY0N1JIkqRVlBjOEdVOZ/n5PlgHWJILlqplt/QbH7wUcSoxkyG6/\nAk6qc/xBwD/XHDsPWAwcX+f4A4GjgAR4KXP7C+D7dY7fJ61P5bjNge2AsxuUvxawHnA/cHl6+wBw\na4P2/irdKt7Z4LiGDNSSJKkVZRoHvMnAi0TwqfU6YFki3M3I3H49fU2tWcDj7VW1Z8r0PlBPZ+mA\nfDvw2zrHHg58DLgPeIF4759My74xPWZB5jyLgPOI8Jrdbm5Qxz8BS4BNiLCbADsSgX24pmyAPwNP\nEMG7lLm9rUH5t6bbxunjTYFridBcZun35+x06xkDtSRJGo8/ASsyMiAPEV+fP1Hn+H8lQtnTwFOZ\n29cQ4QtiHCtEwPpQWs7N6XZTensW9QP7WMq0H3pLwBxgdWCNtH6X1jluc+DNRD2zvbDXAqfXOX4j\n4A11jr+xzrEAbwG+R/S83pe5fbjB8d9Nt1rD6VbrhnRr1qJ0Oz/z3B0NyoYI/re3UP6NwKcyj4dH\nKTsXBmpJkvJRpv+GN0wCXg5sk27bAvsR40vL6TGV0HslcAnRhkpAfo7GYXffUc6bDWLDmfOsntZn\ng/R2C+DMOq+fDmxPBO67G9ShzOjv9zRgJvBQnX2vBU4gQvTz6TmeIoY0XE71PSE9x4NED3Aps1V6\nYeuZCiyfOXbtdLsX2L+m7AXAKcCPGd8HC01g/sJIkopmOO8K1DgeeIT4av1k4P1EoJ5ac9xwF+sw\n3rLXBC4iAugTwF+BU4EPNCh7A+B/gXOAq4kA/Fz6mnrmEIF+ZoP9ww2e74Rull227Lpazp32UEuS\nuq1Mf1+wlVf5k4nhEY8TX/F3Upml6z0deCUROhfVec13gM8Q4TIvC8b5uruIWR0g3tP1iQC8KSN7\nvCuuB/5CXIh2d7o9QOOfwyPpVjQLLLszDNSSJBjc0NvNslstfx1gN2AuMbZ4brpdDnw+U145vf8J\n4MNED3Bl+MCjRC/pl+qUvx6wGRHsHqUa8h4lLjSrrfctxPjkyvCNTYhxyJ+kfqC+vsl2LmjyuPHo\nRNmPAlelW63hDpRfz4IuldvtstUhBmpJ6qwyg/kHsEzz9Z5E9AJWAuPz1A8vuxBfq68J/AvV8aO/\nJabAqrUnMTa0clzl9jzgTXWO34sIpR8gvq5/Dng2Lf/ddY7fGjg2c1zlNVcDJ9Y5fj3iYrmvZ9q6\nInAFMRyi1qpEcF1MXCD2t/T+wswxCxj5Pg8TF/LNJoYVNLqgD2A+MVvDnJrtZOC9dY7fjZhu7Arg\nZ8TMCk81KLsVCzpQRtEsGNCy1SEGaknqrDKD9wdwiAhmryJC43PA7+sctwNwBrACccHV4nS7kKUD\ndRl4PfA7IkxXFlO4JD2+nvOJ0AjVMYwJ9Xtfy0Swn0wMVRgiLpL7CxGW67kL+CnRI5zdFjcof2Vg\nV2IYwFPAH4mf7V0Nyr8s3Vr1IjHl2JIxjjufkRfv1Soz8sLBzxCh/irgD+OoV9EsyLsCUt68KFFF\nV867AhNMuYNlbQV8gViM4LdEb+S/E0vZ9rvDiKmtKjMz3EPU/wcNjl+G6IWtvUhtLMPjrF/eZXe7\n/HIXyx7uYtlS0XlRojSgyth70ktlGr/fc4kLtyrzzK6R3r8c+Eqd46cCqxG9tg8Qi1dskb5mqM55\n5hNDJO6h+50FyxJjbtchhgXUOpO4GOtqYtjE8BjlPZ1u6owFeVdAUmcYqKV8lYnVq14k/j3WfrWt\nzppDdW7dg6k/RdargI8Qgfdu4iKu3xNfnddT+zX/QkYPpgcRF6JNIRZ5qGzn0XgoQUWZ0UPYEPBB\n4gPBK4nwfgMxFOIUlg7wT4xRXqd08xzdLLsX5XfLgrwrIKn/OORDRVQmQk5l1a/7gP9jsIZ/lPOu\nQBPmExfG3U6MrV1EvN+nEcG33OHzDTd53CrERWMfAk6i8RCRFYmxwtmyV6Nxh8iXgCOIldqmNFkX\nGIyfpST1QmFzZ2EbJqWGiWVq7wT+J9+qtGQ47wpkNBrXuwoxA8JWVAPmcBfrUe5weWcT8xRfRkyD\ndi9xEd26HT6PJCk4hlrqU+sSwwcazT4AMfvAOcCWPalRe2YR03et3sVzlGn8tXUJmEcsNbw9sB3V\nsc61CzPcT6wA1ysLOlze14nhJisDbwe+QcyzuzZwW4fPJUkqMHuoNahWI6YLe4gYu9tIuSe1ad8a\nxMVtTxFjcxOit3eYkW34KjFN10+B/yTGJB+avr5Zww2eLxHjmu8FTgeOIQL1tBbKLo95RH8azrsC\nkjQB2EMt9YnliQUk3kWMj305S891m7VgjPJKRO/r3R2o23isBnwZ2Bf4MbHi2m1Uw3StbxK97atn\nti2J8cv12vCvRNi+J93uBTYEVmLpZZATYGdiRo3xftheMM7XSZK0FAO11HnrEiuTnUlMn3ZnB8rc\nCLgY+C/ga/R+6rKniGEHH2DsxScg2txKu28m5jjeiQjSs4nV9R4kPogsYGQIvr+FsotkQd4VkCQN\nLod8aJCU6M4FY/OImSluA/ajuixznspdLHu4i2VLktRIYXNnYRvWhnLeFVBuXgfcCJxLLELSKSsC\nnwT27GCZ7RjOuwKSpAmp5dw5aexD1KfKeVdggisB+wBH5XDu3xFzDJ9BzPbQrvWICycXEj3r/TJz\nxIK8KyBJUpHYQz3SNCJs7ENneygHXblH5e8M/JEYU7x3l8/ZTasAvyDGKX8eWDXf6kiS1BcKmzsL\n27AxDDHyW4Qy1VkVEuBWYl7jO4iZJI7sZeX60HCXy/8OMcziduCtxM+nHy1b87jc4LipxCwktcdL\nkjSROW3egJtKrOa2M7ALsWDFLsA16f4FjPwafJgIdRsT8/DO6001+9bOwO+JfwgJscBHQiyEcW6d\n4/8F2LXO8Sek5dTamlga/J8YfYGWPJWAS4nfk08Dj9B4gZTniCW5JUlSGwzU/eMrwHuIpYUvBr4H\nHEHMtTuaF4Fr062R3Ynp1i7LbDew9IpyFWUGZ/xqmWoP7GuAHxKh8hrgOqKH/28NXnsZMd/xpPQ1\npfT+PxqU/0piiM1xLP3hpl8kxIeELwB/Jz507Ub0qv8gv2pJkqS8DeqQj3LN49k0Hqf6MmBOG2WP\nZjIRBt8D/Ii4+OwRIhjWM9xC2b1UAnYgll+uZ7jL5+92+Z1UJnraF1LtpR/Gi1klSRrLoObOMQ1q\nw74K7E8svfxn4HHgg7nWqGplYH6Dfd8iFhLpl1lgXkaEwVuJnvX3NzhuuMv16Hb53TKcdwUkSRog\ng5o7xzSIDdsbeAb4NfAxYjz01FxrNLoyIy94fJhYje/K9Lk9gFk9rlMJOJ9YFe+/iPHloy1mUu5y\nfbpdfrcM510BSZIGyCDmzqYMUsPKRID5HFHvYQbvq/bh9HYV4A3AF4ELiTHKvbYVMCWH8xZJOe8K\nSJI0QAYpd7aknxu2GjC9wb7hHtajk4ZbPP7bxDzGezP2vNjlmsclYBtijLckSVLeXCmxh7YgZk24\ngeKFwQUtHn8m8cv3AWKc883EbBvL1Dm2nN7OJ5a5/jtxoeTLWq+mJEmSmtUvPdSTiNUJLwDuIsZG\nrzDK8eUe1KnfDAGbEFP+1Rvv/A3gImAxcfHjtg2OkyRJykO/5M6O65eG7U7M1nEY/X2BYb8pM/KC\nx1OJMeblvCokSZLUQL/kzo7rl4ZVFv/Q+A3nXQFJkqRROIa6Q7ag/lCOyhLVkiRJ0kDpZogtp7eT\ngH2J6eHuBLbr4jknsnLeFZAkSRpF33We7kHM4nAL8NE6++cC5wFXA9cTF7LV082GfQF4NzEzxVXA\noTjvsSRJ0kTVV4F6CFhITI82hQjNG9YcMwx8Kb0/F3gImFynrHYaNgVYo865K/4X+AWwE46PliRJ\nmuhazp31wmunbE0E6kXp41OB/YAbM8fcC2yW3p9NBOoXmih7ErAcsKTOvpWAU4BV02028CBwLbBn\nekyZ6tCDdwKfAXYlPgQsaOL8kiRJUtcdCHw38/hw4Js1x0wiAuw9wONUA2+tBDiHmLLubuB5YgGR\neqYCrwM2JcL1WBdeDo+xX5IkSRNHX/VQN1OZjxNDQcrESnm/AzYnwnWttYHngEuIYRrnNyjzubQc\nSZIkaSxl+njShG2JCw4rjmPpCxN/DeyQeXwB8Ko6ZfVilg9JkiSpry5KnEwMy5hPDMOod1HifwCf\nTu+vQizn3Wj+Z0mSJKnb+i537gncRFyceFz63NHpBjGzx9nANcB1xJR19fRdwyRJklRIhc2dhW2Y\nJEmS+opLj0uSJEm9ZKCWJEmS2mCgliRJktpgoJYkSZLaYKCWJEmS2mCgliRJktpgoJYkSZLaYKCW\nJEmS2mCgliRJktpgoJYkSZLaYKCWJEmSAEhK43mVgVqSJEkTXDIVkqOBG/OuSTcleVdAkiRJRZNM\nhuRtkNwGyW8h2YYC587CNkySJEm9lkyC5GBI/g7JxZDskt2ZW7W6rLANkyRJUq8kJUj+CZJrIbkC\nkt3rjJsubO4sbMMkSZLUbUkJkj0g+RMkV0Oy7ygXIBY2dxa2YZIkSeqmpAzJHyG5AZKDYrjH6C/o\nRa3yUNiGSZIkqRuS7SA5H5KFkBwOyVCzL+xqtXJU2IZJkiSpk5ItIfkVJHdAchQkU1otoCvV6gOF\nbZgkSZI6IdkEktMhuQeS90IybbwFdbRafaSwDZMkSVI7kvUh+Qkk90NyDCQz2i2wI9XqQ4VtmCRJ\nksYjmQ/JiZA8CMknIJnVqYI7VE7fKWzDJEmS1IpkDUi+BclDkHwWkjmdPkGrL5jc4QpIkiRJNZJJ\nwExgNjArs432uNG+IeDbwMuhtLinzWig0YTW/SZhcOoqSZI0ASXLA2VgV+DVwByqQXgZ4CngMeDx\nzPZYg/tj7Cu90M2G0GLuHJSQaqCWJEnqK8kywA5EgN4V2BC4FLgAuAR4iGoYfhJKL+VU0VYZqCVJ\nktQNyRCwFbAbEaC3Aa4lAvT5wOVQeja/+nWMgVqSJKkYkinApPxCalICXkGE592AXYC7iAB9AXAR\nlB7Lp26ugZXTAAAgAElEQVRdZaCWJEnqD8lUYDliLHGjbbT904GXiGET9zTY7k1v74PScx2o85pU\nh3DsCrxA9D5fAPweSve1f46+Z6CWJEnqjWRN4CDgVdQPxFOBR4FHRtlG2/8EkX9WBFZPt9Uy97Pb\nKsASGgfvyvbAyAv6kuWB11DthV4RuJBqiL4VShNt+mIDtSRJUvckKwMHAgcDGwO/BBYAD7N0IH6q\nd2E0GQJWYuzgPRdYTITrErA+cQFhZRz0NQN08WC3FDZ3TrRPRpIkqW8kK0ByFCS/g2QJJD+GZJ90\nSMeASSanC6O8GpLtIZmWd436UGFzZ2EbJkmS+lEyG5LDIfkVJI9C8nNI3phOFadiK2zuLGzDJElS\nv0hmQPImSE5PQ/RZkBwKyay8a6aeKmzuLGzDJElSnpJpkOwHycmQPALJbyB5e3qxniamlnPnoAy4\nLuzgcEmS+l9SImayOIyYveIxYnaKelt235P9OUNEMoWY1eLNwH7E4iSnAqdD6cE8a6a+4CwfkiSp\nU5K1gMOBtwJTgB8BdxBzJy8HzM7cr7dNZWTAHiuIPwk8CzyX3j47xuMXmg/syRCwMzE7xwHAQiJE\n/xxK97T6zqjQWs6dk7tUEUmSNJCSmUTgfBuwBfBz4EjgstZ7m5MpNA7dledXBNZN788AphFBfFqd\n+7WPJ0FSG7gbhe9NiKnifgq8GkqLWmuL1Nig9PraQy1JUtckQ0CZCNFvAP4A/BA4G0rP5FixMSRD\nNA7ftUH8dijdklNFNVgKmzv7cPyVJEmDLtkQki9Bcickf4Hk/enCJdJEVtjcWdiGSZLUW8lcSN4L\nyZWQ3APJVyHZNO9aSX2ksLmzsA2TJKn7kmmQ7A/JL9Op4X4Cye7pkAlJIxU2dxa2YZIkdUdSgmRr\nSL4FyWJILkznV56dd82kPlfY3FnYhkmS1FnJOpB8HJK/Q3IzJJ+EZH7etZIGSMu502nzJEkaWMkk\nYGNgJ2DHdJsKnAEcAVzRnwurSMUyKFOCFHb6EkmSmpdMI1YsrATo7YHFwB+Jqe7+CCw0REttcaVE\nSZKKI5lDhOYdiRD9SuDvVAP0JVC6L7/6SYVkoJYkqX3JHGA34HngYeCh9PZhKD3XxfOuSXXoxk7E\nCoJ/otr7fDmUHu/e+SVhoJYkabySZYC9gUOBXYkA+yKwArE89grp9gwjQ3btbYPnSi/UnK8EbEg1\nPO8IzGLk8I2/QOn57rRXUgMGakmSmpdMBl5LhOj9gKuAk4FfQOmROseXiNCbDdmNbrP3lweepBqy\nnwA2AR5jZID+u+OfpdwVNnf6n4skqUOSEiTbQfJNSO6H5Ip0ye3VunjOSTGMJHkZJK+GZDdI1uje\n+SS1oeXcOSjpu7CfFCSpGJIS0RO7PrBeelu5Pxe4Dvhzul0FpftzqOPGRE/0IcCzRE/0KVBa2Pu6\nSOpjDvmQJHXLqKF5fWAScEu6LczcPgRsCmxFTPm2FfAUMbziz9Xb0gNdqPM84GAiSK8AnEIE6Wsc\nWiGpAQO1JKkd4wrNlfuLmwupSQlYh5EBeyvgcUYEbP4MpQfH0YaVgIOInugNgdOIEP1HKL3UenmS\nJhgDtSQVU7IWsBmxwu1kYKjO/Ua3zRwzC3gZHQvNLbevREwRVwnYrwK2BB5l6ZC9uM7rZxEXFR5K\nzNv8ayJE/7a709xJKiADtSQVQzKFCIZ7AXsCqxPzET9LTOX2QodvnwRupauhuVXJJCLkZ3uytwSW\nUA3YdwL7EO/RxUSIPgtKT+ZRY0mFYKCWpMGVrA7sQYToXYlw++t0uwpKL+ZYuT6RTCKGn1RC9jrA\nb4DToPRQnjWTVBiFzZ190FMiSZ2WTIZkR0i+CMlfIXkYklMheSskq+RdO0maoAqbOwvbMEkTTbIK\nJG+D5KdpgP4LJF+AZId0kRFJUr6ch1rSoEmWIS5GW69mm0WM7W1le6nJ454EHsxsi9Pbhzs/C0Qy\nBGxNjPHdixgTfD5wLnAelO7p7PkkSW1yDLWkfpTMJIJkbWheD1gJWMTIeYtvJWZ3GOrgNilzf2Z6\n3rnpbWWbTVzwVhu0R7lferZOe1cCXk+E6NcD9xDjoM8FLoXS8+N5FyVJPWGglpSXZDnqB+b1gDlU\nZ5Co3e7sn4vtksnAiiwdtke7/ywjg/Zc4BXA74kAfS6U7uxpMyRJ7TBQS2pWUgKmANOBZdKt3v1G\n+5cB1qAampehfmBeCNxTzAU1khLRq50N2k8Alzj3sSQNLAO1pGQFYszuNuntXBoH5peAZ4CnM7fN\n3K/c3kt1mMb9/TF3sSRJbTFQSxNLMhnYFNg23bYheo2vAi4HriRCb4NgXHohh0pLktTP+i5Q7wF8\ng7gI6HvAV2r2fxg4LL0/GdiQ6E17pOY4A7VGkawCHEgshHEf1WWSbwFuL9ZX78kaRGiuBOgtgX8A\nVxAB+nLgBoOyJEnj1leBegi4CdgNuJtYMvcQ4MYGx+8DfCA9vpaBWjWSucABwJuJUHkOcQHYisD6\nmW1N4C5GhuzKtqi/g2eyDLEaXDZAL0OE5kqA/hOUaj+ASpKk8Ws5d3ZzEYGtiXGVi9LHpwL70ThQ\nHwqc0sX6aOAlywP7EyF6G+A84Pi4LT3d4DVTiaWJKwH7FcC+6f1VIbmD+mH7jt7OPJGUiAv7KsM2\ntiW+sbmBCM6/BI4DbnWcsiRJ/aWbgXoNIDtV1F1EUKhnBjFX63u6WB8NpGQ28UHszcBOxIIYJwIH\nQOnJsV9feo74puSmOmVPJxYUqYTtzYA3pvdXguR2qgH7NuAF4t/MZOIbmMl1tvE8Pz099xNUe55P\nBv7a+IOCJEnqF90M1K30ou0L/JGlx05nDWfuL0g3FVIyk/ideBPwWuAi4tuLQ6D0eOfOU3qG6AG+\noU4dZhALkVTC9qbE1z8vpNuLmfuV7dk6z9U7rva556MOrpgnSVIOyunWl7YlvpKvOA74aINjfwEc\nPEpZfsVdeMkMSA6E5OeQPArJuZAcAcmcvGsmSZImlL7KnZOJldHmA1OBq4kxobWWAx4iLrZqpK8a\npk5JpkGyHyQnQ/IIJL+D5ChIVsy7ZpIkacLqu9y5JzF2dSHRQw1wdLpVvI0YLzqavmuYxiuZCsle\nkPwAkochWQDJuyFZOe+aSZIkUeDcWdiGTQzJdEj2heT7kCyG5BJI/hWS1fOumSRJUo3C5s7CNqy4\nkmXTMdGnpMM5LkpD9Fp510ySJGkUhc2dhW1YsSTLQXIYJGekFxb+FpKj05UMJUmSBkFhc2dhGzb4\nkhUheQck50DyGCRnp7NzrJB3zSRJksahsLmzsA0bn2R6BNYkp+XYk9UgeQ8kF6Q90T+H5JB0ERZJ\nkqRB1nLuzCmQtazlNdUHVzKdWGVyLWDNzG32/nLAM8TUhHfU2e5Mb+9KFy/pRL3mAQcQKwluDJwD\nnA78BkpPdeYckiRJuWs5dw5KSC1IoG46LN9DhOK70u3OmtsHoPRS2iO8FrB2zVZ5bg1gCaOH7geg\n1OCTWLI+EaDfCKwDnEmE6Aug9GxH3hJJkqT+YqDuD8lU4A3Aa6gfluuF5Jqw3JF6DAGrsHTgzm4z\nqYbryjYJ2A9YiVjF8nTgIii90Jl6SZIk9S0Ddb6SjYEjgcOBvxE9uv+gGpg7GJY7JZnB0r3c04Bf\nAZf2X30lSZK6akByZ+v6+KLEZFa6XPZlkNwDyRchWS/vWkmSJGlc+jh3tqfPGpaUINkOkhMhWQLJ\nLyHZB5LJeddMkiRJbemz3Nk5fdKwZGVIjoHkBkhuguRYSFbNu1aSJEnqmD7JnZ2XQHJwjFFOpvT4\n1EOQ7AnJaekS2idBslN+c0BLkiSpi4o8D3VyBrApcQHdLcB16XZ9entH4+nfxnXKdYC3p9u9wInA\nqVB6tHPnkCRJUp+ZCLN8JDOADYlwvUnmdiYxs0ZN0C491MJppgP/BBwFbA6cDJwIpWs70gpJkiT1\nu4kQqBsesiIRrCshuxK0n6Tai10J2jeMXN0v2YwI0YcCfyF6o8/s3CqDkiRJGhATOVDXfVmJGCKS\n7c3eFNiAmBf6+nT/KsD3Yyst6kSFJUmSNJAM1E0WNwVYnwjXjwDnQ+nFzpUvSZKkAeXCLpIkSVIb\nWs6dk7pRC0mSJGmiMFBLkiRJbTBQS5IkSW0wUEuSJEltMFBLkiRJbTBQS5IkSW0wUEuSJEltMFBL\nkiRJbTBQS5IkSW0wUEuSJEltMFBLkiRJbTBQS5IkSW0wUEuSJEltMFBLkiRJbTBQS5IkSW0wUEuS\nJEltMFBLkiRJbTBQS5IkSW0wUEuSJEltMFBLkiRJbTBQS5IkSW0wUEuSJEltGC1Q7wEcVOf5A4HX\ndac6kiRJUnFcCqxc5/mVgMt7XJekx+eTJEnSxNRy7hyth3oa8ECd5x8Elm31RJIkSVIRjRaoZwFT\n6jw/BZjenepIkiRJg2W0QH0G8L/AzMxzs4AT0n2SJEmSRjEF+DKwGPhLui0GvkL9nutucgy1JEmS\neqHl3Flq4pgZwHpp4bcCT7V6kg5IaK6ukiRJUjtazp2TR9n3RqoJvQS8BMwBrgYeH0/tJEmSpKIZ\nLVDvy9Jd3isAmwNHAhd0q1KSJElSkc0DruzxOR1DLUmSpF7o6DzUjfyD3l+UKEmSJPWl8QTqVwDP\ndLoikiRJ0iAabQz12XWeWx5YHTi8O9WRJEmSBstoU4KUax4nwEPAzcBz3apQA06bJ0mSpF7oSe7c\nCfhWt09Sw4sSJUmS1Ast587RhnxkbQkcArwJuB04vdUTSZIkSUU0WqB+ORGi3ww8CPyc6P4ud79a\nkiRJ0uB7CTgLWDvz3O051cUhH5IkSeqFjs5DfQDwNHAx8B1gV7wwUJIkSWrZTOAw4FfAk8C3gd17\nXAd7qCVJktQLXc+dKwDvAn7f7RPVMFBLkiSpFwqbOwvbMEmSJPWVjo6hliRJkjQGA7UkSZLUhmYC\n9VeafE6SJElSHX+t89x1Tb52D+DvwC3ARxscU07PcT2woMExjqGWJElSL3Q0d76bCM5PpbeVbRHw\nkyZePwQsBOYDU4CrgQ1rjpkD/A1YM308t0FZBmpJkiT1Qkdz53JEGD4VmJfenw+s2OTrtwPOyzz+\nWLplvQf4bBNlGaglSZLUCx2d5eNRojf6YGL58dekjycB6zRR9hrAnZnHd6XPZa1PzG19IXAV8JYm\nypUkSZL6xuQmjhkGtgJeDnwfmEoM+dh+jNc1k+6nAFsSy5rPAC4DLifGXEuSJEl9r5lAvT/wSuDP\n6eO7ieXIx3I3sFbm8VpEL3XWncBi4Ol0uxjYnPqBejhzfwGNL2CUJEmSmlVOt666Mr2tzPaxLHBt\nE6+bDNxKjLueSv2LEl8BnE9cwDiDuOhxozplOYZakiRJvdCV3PkR4ATgduBdxJCMf23ytXsCNxGz\nfRyXPnd0ulV8mJjp47pRyjVQS5IkqRdazp2lJo/bPd0AfgP8rtUTtSmh+bpKkiRJ49XV3DkXOIC4\nQLHX7KGWJElSL3Q0d54DbJLeXw24DzgbuAH4YCdP1AQDtSRJknqho7nzb5n7Hwd+mN6fRfNLj3eK\ngVqSJEm90NGFXZ7P3N8NODe9/zjwUqsnkiRJkopotHmo7wLeR8wn/Uqqy4jPGON1kiRJkoBViOny\nzqQ6wwfEEuQf7nFdHPIhSZKkXihs7ixswyRJktRXOjqGWpIkSdIYDNSSJEmaqDqygEszgXrHOs/t\n0ImTS5IkSTnZF/hJr0721yaf6ybHUEuSJKlTysADwKvr7Gs5d442/d12wPbASsCHqHaJz8KhIpIk\nSRpMrwZ+BrwZ+FMnChwtUE8lwvNQelvxGHBgJ04uSZIk9dBGwNnAkcCFvTzxvMz9IWC5Xp485ZAP\nSZIktevbwGFjHNOV3HkyMBtYFriBWDnx2G6caBQGakmSJLWrmVk9upI7r0lvDwO+DkwBruvGiUZh\noJYkSVIvdGVhl8lEiP4nYszJ8+M5kSRJklREzQTqE4BFwEzgYmA+8Gj3qiRJkiS1bTqRX/tSidFn\nB+kGe8QlSZLUrMnAL4Hhcby2K7lzVeBE4Lz08UbEVCO9ZKCWJElSMyYBPwR+TUwD3aqu5M7ziImv\nr00fTwGu78aJRmGgliRJ0lhKwDeBPwAzxllGR3NnZVjHVeltdrnxqzt5oiYYqCVJkjSWzwJ/ob11\nUzo6y8eV6e0TwNzM89viRYmSJEnqL5OAacAe9FFWrfRIbwVcQlTsUuAWYPMe18UeakmSJPVCy7lz\ntNVi7gL+Iz2mRCT+EvAs8GK6r1cSmlvZRpIkSWpHy7lztOnvhoBZdZ4f7wBvSZIkaUL569iH9IxD\nPiRJkpS1FTCnC+V2ZelxSZIkqZ9sQcwz3evr+lq2Yt4VyLCHWpIkSQAbAPcAB3ap/MLmzsI2TJIk\nSU3bHvgH8I4unqOwubOwDZMkSVJTXgPcCxzQ5fMUNncWtmGSJElqyhC9GZJc2NxZ2IZJkiSprzjL\nhyRJkgbe8nlXoIjsoZYkSSq+mcDxxHooeXX82kMtSZKkgfRa4Fpg2fT+S/lWp3jsoZYkSSqmWcC3\ngTuBvXKuC4wjd07uRi0kSZKkJm0ETAU2BR7JuS6FZg+1JEmSesEx1JIkSQWyJrAJMQez1BZ7qCVJ\n0kS0P3Az8Bjwe+CLwH7A3DwrNU5zgEPzrkQTCps7C9swSZKkJqwI7Al8BjgPODLf6rRsL+Kiw28B\npZzrMpbC5s7CNkySJAmYBqzSgXI+Q8zj/BZgA/IPr8sDJwG3EVPhDQLHUEuSJA2YHYCrgaM7UNZZ\nRHjdB/gdsBg4F1i3A2W36lXA9cATwGbEkJVCyvtTS7MSBqeukiRJzZgFfAk4AHgfcHoXzrEqsA1w\nITEOu9b7iN7xWscDz9R5/r0Njv9WneOXJ4L0Rc1Wtk+0nDudh1qSJKn39gBOAM4HNgaWdOk89wFn\njrJ/FWB6necbBcrVWjh+CYMXpgvNMdSSJKlI3gHslnclVFdhc2dhGyZJkqS+4kWJkiRJUi8ZqCVJ\nkrpjEvAvxEWHUu4c8iFJkgbJK4A/AJcAG+ZcF7WmsLmzsA2TJEmFMgX4BDH/83txNMAgajl3Om2e\nJElS5/yYmF96S+COnOsijWAPtSRJGgQr4WJ0g66wubOwDZMkSVJfcdo8SZIkqZcM1CqKGcBniItB\nJEnqtpWBL2OW0gBxyIfGsgLxe/LOvCsiSSq8ZYErgc/mXRF1RWFzZ2Ebpo7aGrgLWCbvikiSCmsy\ncA7wf3jxYVEVNncWtmHquNOBj+RdCUlSIZWA7wHn4hDDIits7ixsw9RxGwIPAHPyrogkqXCOBq4C\nZuZdEXVVYXNnYRumrjiOWPJVkqROmknMM61iK2zuLGzDNG7rAmfi1dWSJKmznIdaE8Jk4EfARcBL\nOddFkiRpINhDraxPABfgB0JJktR5hc2dhW2YWvYq4qLDtfKuiCSp0FYG/hsYyrsi6rm+y517AH8H\nbgE+Wmd/GXgU+Gu6fbJBOX3XMOViGeL36eAWXrM8sHZ3qiNJKigXbpnY+ip3DgELgfnEXI1XE1Oa\nZZWBs5ooq68aptxMAvZq8TXvBc7uQl0kScXkwi3qq4sStyYC9SLgeeBUYL86x/nLqma9BPy6xdd8\nF9gU2LHz1ZEkFUwJ+A6Rj47GDj01qZuBeg3gzszju9LnshJge+AaIiht1MX6aGJ6Fvg08GX88CZJ\nGt3bgS2Ag4jOQKkpk7tYdjOf6v5CXFz2FLAn8EtggwbHDmfuL0g3qRk/Bo4lhouck3NdJEn962Ri\nKOoTeVdEPVVOt760LXBe5vFx1L8wMet2YIU6z/uVy8Q1pUPl7Ed8E2IvtSRJGk1f5c7JwK3ERYlT\nqX9R4ipUA87WxHjrevqqYeqZdYGbgBkdKKsEbNaBciRJUrH1Xe7ckwhEC4keaohB/ken9/8FuJ4I\n25cSvdr19F3D1HVDwB+BD+VdEUmSNKEUNncWtmFq6OO4GqIkqXtWJmaC6ub1ZBpMhc2dhW2Y6toK\nV0OUJHXPTOBPwOfyroj6UmFzZ2EbpqVMAq4DDsm7IpKkQqos3PJ9vFBd9RU2dw5iw+aw9EWYas56\nXS6/BHwdWLHL55Ek9ZcS8D3gXDo3i5SKp+Xc6bih7jkW+CARqhflW5WBs7DL5SfAssQ0jsd2+VyS\n1Gv7Al8DngGeztz+AfhSnePXT1/zdM1r7iDWi6g1DViGWPjkOeAFBqfj6zBi4ZYyLtyiDhqUrzoS\nBqeuEHVdCHwAODvnuqi+1YmhJZsBd+dcF0nqpMlESJ5OBN/K7UPA5XWO3xR4R53jr6L+GON9gR8R\nU+JOSc/3PHAacGid418HfJVqAK/cXgR8sc7xGwNvBJ4kFlip3C4iZgVrxxRgNvFeSI20nDsHJaQO\nWqB+FXAKserjoHxqn4i+QgzNOXqsAyVJDU0iQvUkome71hxgHaoBvHK7GPhzneM3Bg4mvkmcmbm9\nEvhCnePfBPwfSwfwXwOfH2ebNLENWu5s2qCF0q/hlcOtWC2n864APEjj5e4lqd+tkncF+sAkYBaw\nKnENzubADsAmeVZKA23QcmfTBq1h/0V8wq7nZcTXaQrrElPkrZ7T+Y8F3pXTuSVpvGYAxwPX4nz9\nUqcNWu5sWpEa9t/AAuIrsInO1RAlqXWvBG4ghhYun3NdpCIqUu4coUgNm0T0YF9Lfr2y/cLVENXI\nUN4VkPrQEPGt2oPEbBWSuqNIuXOEojWsBHyMuGL5FflWJTeuhqhGNiBmYJmVd0WkPrMhcD4wP+d6\nSEVXtNz5/xW1YUcA9zIxFxg5D1dD1NJWA24Djsq7IpKkCauoubO4DSOmEpqIZuRdgQaWzbsCE9hy\nxByznxjlmA8Rc4dLktQthc2dg9CwycAPiPk1NZi2Aq7H8bt5mA5cSMxa0GjuzxKx+ui9wBnEhVlS\nUa2fdwWkCWwQcue4DELDdicmndfgKhFL8x6Rcz0mot2JGQua+TAzg1iF9B7gLGIhJakophKrB94L\nrJxzXaSJahBy57gMQsNOpHPTv62LvaR52ZG4WHRazvWYiFpdlWoZ4L3AR7pQFykPLyeW+z4HF2yR\n8jQIuXNc+r1hU4GH6NyMFT8HfkaxQt18BmdBm7OB9+ddiQ6YBPwSOIaCLqEqFUQJ+GdiKe734L9X\nKW/9njvHrd8btg+xQEmnTCdC9YXEhVqDbgox7/YBeVekSZsB9zH407YdAvw13U6iWB/QxlIixsRL\ng2BZ4DRiWjxJ+ev33Dlu/d6w/yO+eu6kIeICrauJqcQG2THAbxmsXpdDgdl5V6JNyxKzyFT+WF8K\nzM21RlXdnk1lJeBWYlXS1zBYv3uSpHz1e+4ct35v2DJ0JyCUiNUEb0nPMYjWJr7G9Ir1fE0C3kl/\n9FLvAiyk+3WZDLwVuJm42PR1GKwlSWPr99w5boVtWJMGeTXFXwD/lncl1Dc2J1bIfG0PzzlEfONw\nIxGwpbzMJC6iHdQOEmmiKGzuLGzDCm574Cb6o1dU+VsHuAt4U07nH8J54pWPqcD7iKnwfkz/DL2S\nVF9hc2dhG1ZwJfzD0e+WozerQ65EDL3o9LUGUj+bBBwO3EZMhbd5vtWR1KTC5s4Exz7WMqj2zi7A\nBnlXognTgXOB5Vt4zfuJWUDW7kqNqvYChrt8jqJbi3gfNTjKwGXAzjnXQ1JrCh2o98y7EnW8hnzG\nwg0BNwD75XDuiejDwOl5V6IJHyfmnW5FiZiF5R5iiM5E80ryG4LSirWJXs4lGKoHSQk7g6RBVOhA\n/Uf66z+m2cCjwJyczv9q4uKuTXI6/0SyDLF64htyrsdo1iIWF1p3nK/fk/h9OqJTFRoQGwP3E0uf\n96u1iCkA30986HkAmJdrjVRPP/19ktSeQgfqW4iv3vvFW4Czcq7D4cQf2hVzrkdWUS9A3I4IXvNz\nrkcjpwCfa7OMDYl/Z/u3X52BsiMRUrfMuyINrAq8PfN4Uwxv/WQesXCSsxlJxVHoQH0ksThIvzgH\nOCzvSgBfA84n5tzN21yiJ7efAn4nfRC4kv6bKWJn4A5gRgfKWp5Y2bJdq3agjF7aH7ib8ffwa+KZ\nC/wn8c3Q5yjGqraSQqED9VRg37wrklqBGO7RD0tTDwFnAFvnXRHgROAbeVeii0rACfTf8sC7AXvn\nXYmMtxFzPvfDh7xW/DNO86ixlYBPEQtWHc/gfXiUNLZCB+p+ciT9dZFaP3z9uxMxx/CgL9et9uwN\n3Ef/feholtckqBkfAF6WdyUkdU2/5c6O6beGbQdsm3cl+shU4HrgwLwroq6YC7yDsT+4bQs8iP82\n2rEq8Ala+5C8Dv19wawkDZp+y50dU9iGFcRHgV/THz3l6rx5xAemE2g8fnxDomfaKd3GbxViqMwn\nW3zdpsRFlRNx2sNumoTfWEgTVWFzZ2EbVhDr4TReRTebmNVmAfUXFdqDmPlG47MK8DfGP1PEXsSy\n1ut0rEYT10rAR4CFwJ/ov4uQJXVfYXNnbcNmAvvkUZEBsRPwXewx7oX3AIf2+JwlYhq1Xv+hHwK+\nRCwwsmmPz52XbYGjunyOlYlvAD7TZjn/SoRyZ5sYn+2BnxCL53wf2Ab/D5UmqgkTqFcgpirq9nLJ\ng2oG8Geil0XdtTkxbriXF+EdSvx8h3p4zqzDiFkOJoJ1iOn03tjFc5xKTLvWbngrAd8CziO/341B\n9nFi8Zzl866IpNxNmEAN8BXgmz2uxyD9kVqLWE56j7wrMgEcSfQwLtuDc80kZlPZoQfnUnglMUZ5\npy6VP5PO9YROJsK/PauSNH4TKlCvAjxMb+cAXUAs+T0odiCCwMvzrkjBlYAfEl8Td9uXgB/14Dwa\naTdipUwvUhtcM4F3Eb34kjSaCRWoIXqov9qjOqxNTOTfiVXkeuko4Co622O1ObGIi6qWJcavdvPC\nvPWJ38HVu3gONXYoMX7chV8Gy+bAt4kOmDOA3fOtjqQBMOEC9drEWOpeLCbyYeJCv0HUyQA2CbiM\n7hf5JtgAABeOSURBVF+oNYjWp7vfmPwbcGwXy9fY5rf5+lnEvyH1xhnAncS/nTVyroukwTHhAjX0\nbpqoq4Bde3SufvYu4BIMBXko4fs+yOYQ/4+8qcfnfQUxk8hEtAExrlySWjEhA3UvrEcsWjHR/2Ne\nmRiTvVneFZEGzBzgSuAb9P6CwU8ClwLTe3zeXplG/B8tSZ2Sd+7smrwbtjfwhZzr0A9+CHwt70pI\nA2Y54HLgv8ln9o1JwE+JOZaLNPvHFOB9RGfHt3Oui6RiyTt3dk1hG5aTt9D6WN8SMa/1zM5Xp7CG\niOkLVVw7EddXNDKbuObgePINs8sAVzD+lRj7SQnYH7gZ+A1x0aEkdVJhc2dhG5aTYWIctLMVdNdu\nxPLF7axct2KH6qLuWA24nVi5sp4ViAtJ+6FneFXgH8DBeVekTd8HrgVen3dFJBVWYXNnMw0bAj5P\n75djHkSTiKvfv0d//KEvsm8BpzO+93kHohfOCxH728uJYQd75V2RJmwGHJF3Jdo0j8FaZEvS4JnQ\ngRri678ju1mRAplJ9PK8N++KFNw0YmaH97f4uiHgL8AhHa+RumFbYgn6rfOuiCSpbRM+UO8M3IKz\ncTRrHaJn7bV5V6Tg1iVmR9mmhdccDVyM3yAMkn2ARRR3No1emkZccOg1G5LyMOEDNcAf6Fyv3hG0\nFoIG0Y40Xpp8bwzbnbI/zS9NvgKxzPUW3auOumSizvfcKSVijPdtwNm4KqikfBiogT2A62l/3GmJ\n+E99yzbLGVSzgbuIwK3/196dR8tR1Qkc/4a8BLJAWAIoBAigbEFANBFZm0UNDhgRQQwgHpfBIxpx\nQYIwzBsYcRSOYVEYV8ABZRRxQUAZR8ImS4IJBIIKMRHCJoSAIYATSOaP3+3z6jW9pvd+3885dV5V\n3VtVtys35/z69q9uNUa1o80XAhc3syFSRqe8QXB/Yq7uuUCuvU2RNMQZUBNBy13A5DqvOZlIHxmq\nP7lfQDy0qNZ7A87uodYYBswBPtnmduxOpMtMx4dwJbWfAXXSiJk+zgPObsB5utFbiNxqgzqp921P\n/H8/gvbmfztDk6ROYUDdIOsAjwC7tvi6nWA48AzdP7WWpOrtCzwIvAw8Bry1vc2RpLYyoG6QvYEH\nWnzNTjGReI3vUE11aZWNiNfZ+/O2Okn+7Z6jSpTfCSwErgMuAj4LTAPGVHHudYATgFPrb6YkNVXN\ncWe3BE1raG1b+4AJRE6f1Ax9wO+AG4kXEkndYCzxpXu7zLIt8DFiZppCRxHzc48h+vlLxKvaf9+C\ntkrS2mp13NkyPTv0riFtC+Bx4mVEl9Oj/3k1pH2bmE99DvB+7OOSukPPxp31fLBZmA+oznUIsBo4\nrd0NkSRJgAF1UTOAaxrVEKkJ9sYZDiRJ6hQG1EWMJqaEmtSgtkiSJKl3GVCXMBO4sop62wCb13kt\nSZIkdS8D6hI2IOZWfkOFet8npoGSJEnS0GRAXcZM4Ogy5SOBZcR0eZIkSRqaDKjrcBhwawuuI0mS\npM5Vc9zpW9oGHANc1e5GSJIkSc3Q7BHqUcByfCBRkiRpqDPlYy2NB05u8jUkSZLU+TouoJ4K/BF4\nCDi1TL3JwCvA+0qUN+ODDW/COSVJktTdOiqgHg48DEwERgDzgZ1L1Psd8CvgyBLnavQHmwmc2eBz\nSpIkqft11EOJU4iAegmwinjgb1qRep8GrgaebmJbCv0M+BQwtoXXlCRJUg9qZkC9JfBoZntp2ldY\nZxpwSdpu1RD7n4CbgE+06HqSJEnqUX1NPHc1wfH5RPrFGmBYWkrpz6zPTks9vgz8GvgG8HKd55Ik\nSVJ3yqWlI+1FBKx5p/HaBxP/AixOywrgKeA9Rc7VrJHrJ6k/MJckSVLv6KiHEvuARcRDiSMp/VBi\n3qW0dpYPiHmnt2rSuSVJktR9ao47m5ny8Qrx4N9viJk8vgc8CJyYyr/VxGtX66l2N0CSJElqhY4a\nepckSVLP6qhp8yRJkqSeZ0AtSZIk1cGAWpIkSapDMx9KlCRJ6hbPAhu1uxFqqeXAxu1uRCv5UKIk\nSWomY42hp9S/uQ8lSpIkSa1kQC1JkiTVwYBakiRJqoMBtSRJklQHA2pJkiSpDgbUkiRJ3WE2Mb3f\nyCZf5zJgFfC6IvvPLtg3EVjN4JhyOjAXWAE8DlwP7NP4ZnYOA2pJkqTONxGYAvwNeE8TrzMGOBJY\nCBxXULaGylPKfQ6YBfw7sBmwFfBNmttmVcm5ISVJUjN1eqxxJvBL4HTg2rRvXeA5YFKm3qbAi8D4\ntP1FYpR4KfAxYjR5uzLX+RBwH3AssKCg7FLKj1CPI0alj6zuI7Vdw+ah7hY9+8EkSVJH6PRY42Ei\nyH0j8H9E4AzwPWI0OO8kIsUCYCrwBLAzMAq4AniV8gH1/xJB+/rAS8CembJKAfVUIlWkWzIgDKgl\nSZIaqEKssWZNY5a1si8R3K6ftucDJ6f1g4lgO+92BlI1vg98OVO2PeVHqLcmAu4d0vbPgfMz5ZUC\n6mOJAL5b+KZESZKk1hk2rDHLWjkBuJFIpwD4SdoH8aDiaCK/eiKwO/CzVPZ64NHMeZZWuM7xwP3A\nnzPXmQ4MT9uvACMKjhlBBNSrgWVEqsmQiy/72t0ASZIklTQKOJoIUvOjv+sCGwK7EfnOPwY+SDyw\neC2wMtV7gngoMC+7XsyHUp38dfqATYB/IvK3H2FwvjbAtgwE7XcA/wCOAH5azYdTa5nyIUmSmqlT\nY40PEiO/E4hZMzYDNgduBs5LdaYQQfAC4PDMsVOJBxJ3IkaxL6d0ysfbifznSQXXuQK4OtWZRIyS\nv4MYtd4CuAU4J3OezwFPAtPSNUcAhwJfrf2jN5051JIkSQ3UqbHGDcC5RfYfRQTL+fSKh4BneG32\nwUwi2F4KfIIIqLcscr5LiBSPQpOJ/O0N0/ZhxBzTzwFLiEB53YJjpgNzgBfSta8F9ipy7nYzoJYk\nSWqgoRBr7EzkQQ+5HOcSDKglSZIaqFdjjSOIEeSNiDzoa9rbnI5iQC1JktRAvRpr3ECkZywjHhTc\nvL3N6SgG1JIkSQ1krDH0OA+1JEmS1AkMqCVJkqQ6GFBLkiRJdTCgliRJkupgQC1JkiTVwYBakiRJ\nqoMBtSRJUmdbDWxXsK8f+K/WN6WoLYFfEHNdPwqcWFC+mngN+Yq0fDtTdjCwmHhF+Qcy+zcE7gHG\nVLj2BsD5wF/TuR8GZgGbpPIl6RrCuSElSVJzVYo1cg24xtqeo1hA/a+0J6AuNhh7E/B1YDiwGxFY\n5zLlq4FtS5zvPmCXzHHD0v5LgPdXaMtIYA7wG2CntG9T4HRgatpeDBxU4nhf7CJJktRAlWKN/gZc\nY23PUWmEOgcsBU4DniaCyOmZupcB/wncCPwdmA1snSnfCfgfIqD9I3BUwbGXANcTo8yFwenY1L7x\nmX3fAn5Q0P7tS3y2RZn1J9J5pqTrVfIx4ElgdJk6LQmoTfmQJEnqfpsTaQ5bACcQaRU7ZMqnA2cR\nAet84Mq0fwwRTF9BjO4eA1wM7Jw59oPA2UTwfHvBdYcV/IWIL3ctqHcLETD/FNgms/9vxOj07sCr\nxGvSzwdmlP+4ABxCvFr9xSrqCkeoJUlScxWLNXLESHB/Ki+29Jc4X6lj8ufL1dC2akaoVwGjMuX/\nDZyR1i8DfpgpGwO8Akwg8pZvKTj3t4AzM8deVqF9twIXAusCexIj3Q9myvcF+oBxwEXAAiI9BCKQ\nvgm4AziQCKT/jQiyfwP8Dti/xHVvBM6p0LaWjFD31XqAJEnSEDE7LXn9NRzbX6R+sX3VeBUYUbBv\nBBFE5y0HXsps/xV4fVpfQ6SE5K0EniVGs7cB3paOz+tjIGWj8NhijgW+STyQuIgY7Z6UKb8t/X0e\n+Ez6uxPwAHAvEUiT2vt14O1EkD+DGNW+hcGj2nnL0mdoO1M+JEmSOtsjvPahvm2JGSzyNmJwLvE2\nwONpfRiwVaZsLLAx8Fg6983p+PyyPnBSje07HNiMCIY3Be4qUbdYikjeLOKBwpeJlJG5xBeDEQzO\n0c77LfAuyudQK8OUD0mS1EydPMvHOcQo75bEYOghxMOFu2TOuwo4lwg+9yMeIMznUF9GjArvQ8yM\nMYtI04AInpcAx6VjRwCTGZg14zIif7qcndJ5RqbzPM3AtHW7AHsQKR5jgQuIdJDhBed4B3BNZvsB\nIliexODZP7JGAncTedQ7EvdmE+BLwKGpjrN8ZPTsB5MkSR2hk2ON9YCvEcHhc8TI7WGZ8hyRbvEl\nIphdQqRh5F1KzNRxIzFX82wGp1DsAPyKeEDwGWLkd7fMsWdVaN9n0rEvEOkZe2bKDiRmDnkBeIoI\nmgtn/FgXmMfgUfSD0ud9DDi6zLU3IL4gPMLAPNTnESPtYEA9SM9+MEmS1BG6OdbIEQF1KZdSeZR5\nKHLaPEmSJFWlWLqEGsiAWpIkqfuVG1VdU6FcQ4SdQJIkNZOxxtBjyockSZLUCQyoJUmSpDoYUEuS\nJEl1MKCWJEmS6mBALUmSJNXBgFqSJEmqgwG1JElSZ1sCHNzuRpRxOHA/8erv24GdM2UfBl5NZfll\n/0z5+cCzwO+BLTP7pwMXVHHtKcD1wHJgGXBXuiZUfoPkkOPckJIkqZk6OdZYDBzU7kYAfUX2vRF4\nHtibGKidCTwEDE/lHwZuKXG+KalsBPA14KK0fxzwB2Bshfa8nQjQTwE2Tvv2BK5K6znKB9QNm4e6\nW/TsB5MkSR2hk2ONUgH1usQI72NpmQWMTGU3A+9L6/sAq4F3p+2DgXmZ83wEWEiMFP8a2DpTthr4\nJBEkLyrShk8Bv8psDwNeBA5M2x8Gbi3xuY4GzknrU4Hr0vo3gGNKHJN1GwNBeDE5WhRQm/IhSZLU\nnU4nRnl3T8sU4IxUNpsIKAEOAP7CQKrFAakcYBpwGnAEMJ4Ifn9UcJ1pwGRglyJtWEME0XnrpO1d\nM/veDDwN/Cm1Lz96/QCwH7AeEeTfD7wV2IGBUeZSRgN7AVdXqKeMTv7WKEmSul+lWKM/1Slc+muo\nX6puJaVGqB8mRnbz3pnqQgSo96b1G4CPAnek7ZuB92bKPpI5xzrASmCrtL2agcC8mB2BF4ggfSTw\nL0TO9KmpfFtgm7S+KxFEz8wcfzIwnwjixxM52DsCM1I7ryBSQAptmdq2Q5m25TDlY5Ce/WCSJKkj\ndHKsUSqgfpHBDwDuBPwjrY8GXgI2A54g8pSXApuk4/I5xwuJPOTlmWUlMfoLEbRuX6F9RwILgGeI\nFJQFwLEl6n4AmFui7CQiBWRSOsdwYhT+K0XqjgZeIQL5UnKY8iFJkqQyHgcmZra3TvsgguZ7iBHg\nBcAqYiaNzxMj28+meo8A/wxslFnGAHdmzlspwPwp8CZihLk/tWlOmfrDiuzbHPg4cBYxkn0fMdI9\nF9itSP0XiRH391domzI6+VujJEnqfp0caywmUjvWyyx9wNlEisT4tNxGBKR5XyZm4Dg9bX8S+DuD\nH+R7LxFw5/OjxwFHZcpXA9tVaN9biNHkTYEfE2kaeYcSwTLECPoCIi2k0JVErjZEvvaficD+K8CF\nJa6bn+XjC8TIO0QueT4HPIcpH4P07AeTJEkdoZNjjcVEYJtdziJm+biAGJV+nEi3GJk57p3EKO9+\naXvXtJ0NmAGOI0aEnydGrL+bKXuVygH1rUSgvgy4BBiVKTsXeJLIs15EjGAPLzj+IODagn2zGJif\neosy155MzEP9XLr+nenzQATUj5Q51oBakiSpgYw1hh5zqCVJkqROYEAtSZIk1cGAWpIkSaqDAbUk\nSZJUBwNqSZIkqQ4G1JIkSVId+trdAEmSpA6wHKfOG2qWt7sBrWYHlyRJUit03DzUU4E/Ag8BpxYp\nnwbcC8wj3jd/UJPbo5BrdwN6SK7dDegxuXY3oMfk2t2AHpJrdwN6TK7dDegxuXY3QM0zHHgYmAiM\nAOYDOxfUGZNZf1OqX4wj1I3V3+4G9JD+djegx/S3uwE9pr/dDegh/e1uQI/pb3cDekx/uxvQYzpq\nhHoKESAvAVYBVxEj0lkrM+tjgWea2B5JkiSp4ZoZUG8JPJrZXpr2FXov8CBwAzCjie2RJEmSGm5Y\nE899JJFD/fG0fRzwNuDTJervB3wX2LFI2cPA9o1uoCRJklRgEfCGWg5o5rR5jwFbZba3IkapS7k1\ntWcTYFlBWU0fSpIkSeoFfUSEPxEYSfGHErdnYJR8z1RfkiRJUnIo8CciZeO0tO/EtAB8EbifmDbv\nVmByqxsoSZIkSZIkSRLrAXcRqSELga+UqHch8aKYe4E3t6ZpXaeae5kDnid+HZgHnNGqxnWx4cS9\nurZEuX2zNuXuZw77Zy2WAPcR9+ruEnXsn9VZQvl7mcO+WYsNgauJ2bwWAnsVqWPfrF6l+5nD/lmN\nHRm4R/OIe1Zsprmu7Zuj098+4E5g34LydwPXp/W3pToqrtK9zAG/bGWDesDngCspft/sm7Urdz9z\nJfaruMXAxmXK7Z/Vq3Qvc9g3a3E58JG03geMKyi3b9am0v3MYf+s1TrAEwyeSANq7JvNfvV4rV5M\nf0cSo1fPFpS/h+hMECOwGwKbt6ZpXafSvYTmTpvYayYQ/7m+S/H7Zt+sTaX7SZn9Kq7c/bJ/1qZS\n37NvVmccMSXu99P2K8RIYJZ9s3rV3E+wf9bqEGJSjEcL9tfUNzstoF6HSFN4CriJ+Dkjq9jLYia0\npmldp9K9XAPsTfyMcT2wS0tb131mAacAq0uU2zdrU+l+2j9rswb4LTCXgbn/s+yf1at0L+2b1dsW\neBq4FPgD8B0Gfj3Ns29Wr5r7af+s3THAD4vsr6lvdlpAvRrYg2jw/sRPF4UKv3nV/L71IaLSvfwD\n8fPG7sBFwM9b2bgucxjwNyLPqtw3f/tmdaq5n/bP2uxD5PcdCpxEjGIVsn9Wp9K9tG9Wr4+YEvfi\n9HclMLNIPftmdaq5n/bP2owEDgd+UqK86r7ZaQF13vPAdcBbC/YXvixmQtqn0krdyxUMpIXcAIyg\nfN7gULY38dPPYuBHwEHADwrq2DerV839tH/W5on092ngZ8CUgnL7Z/Uq3Uv7ZvWWpmVO2r6aCASz\n7JvVq+Z+2j9rcyhwD/H/vVDX9s3xRH4KwCjgFuDggjrZBPG98OGFUqq5l5sz8M1rCvFkuyo7gOKz\nUtg3106p+2n/rN5oYP20Pga4HXhnQR37Z3WquZf2zdrcAuyQ1vuBrxaU2zdrU+l+2j9rcxVwQomy\nru2bbyJ+qphPTFl0StqffREMwDeIF8Xcy2u/mSlUcy9PIl6qMx/4PcWnMtJrHcDAE9T2zfqVup/2\nz+ptS9yn+cQ9K/YSLbB/VqOae2nfrM3uxIjqvcA1xGCPfXPtVbqf9s/qjQGeYeBLNNg3JUmSJEmS\nJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJDXfCw04xzuAucQ89HOBAwvKf8vA3KurgfMyZV8A/jWt\nzwCOb0B7JEmSpJZZ0YBz7AG8Lq1PIl5XnHcQ8M3M9svAImCTtP15BgLq9YG7G9AeSepp67S7AZKk\nivYgXnubfTsawGRiFHoecC6wIO2fDzyZ1hcCo4ARaXs68IvMuVcB3wY+W+S6K4BlRFAuSSrBgFqS\nOt8PgFOI1w4vYGAE+VLg48CbgVeANUWOPRK4hwicAfYh0kCyLgaOBTYocvzdwP51tF2Sep4BtSR1\ntnFpuTVtX04EuOOAscBdaf8PgWEFx04C/gM4MbNvC+DZgnoriKB9RpHrPw5MXLumS9LQYEAtSd2l\nMGgutX8CkR5yPLC4ivOeD3wUGFPkvMVGviVJiQG1JHW254HlwL5p+3hgdtq/ApiS9h+TOWZD4Drg\nVOCOgvM9zsADiFnLgR8TQXXW64Ela9VySZIkqQ1eBR7NLCcTudN3MPBQ4rhUd0raN48YYb4t7T+D\nmH5vXmYZn8q+A7wrc72/Z9Y3A1YCZ2b23YAPJUqSJKlHZdMzZgKzqjgmB1xS5fk3AObU2CZJkiSp\naxxNjD4vAK6leCpHMdkXu5QzAzhu7ZomSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZLUKP8PrxWn\nRX/nUKUAAAAASUVORK5CYII=\n", 422 | "text/plain": [ 423 | "" 424 | ] 425 | }, 426 | "metadata": {}, 427 | "output_type": "display_data" 428 | } 429 | ], 430 | "source": [ 431 | "'''\n", 432 | "The datasets train and test are defined in the above examples.\n", 433 | "'''\n", 434 | "\n", 435 | "target='y_buy'\n", 436 | "\n", 437 | "def modAUC(X_train, Y_train, X_test, Y_test):\n", 438 | " '''\n", 439 | " trains a model on train set and returns AUC on test set\n", 440 | " '''\n", 441 | " logreg = linear_model.LogisticRegression(C = 10)\n", 442 | " logreg.fit(X_train, Y_train)\n", 443 | " return roc_auc_score(Y_test, logreg.predict_proba(X_test)[:, 1])\n", 444 | "\n", 445 | "def LrBootstrapper(train, test, nruns, sampsize):\n", 446 | " '''\n", 447 | " Samples with replacement, runs multiple train/eval attempts\n", 448 | " returns mean and stdev of AUC\n", 449 | " '''\n", 450 | " auc_res = []\n", 451 | " for i in range(nruns):\n", 452 | " train_samp = train.iloc[np.random.randint(0, len(train), size=sampsize)]\n", 453 | " try:\n", 454 | " auc_res.append(modAUC(train_samp.drop(target,1), train_samp[target], test.drop(target,1), test[target]))\n", 455 | " except:\n", 456 | " oops = 1\n", 457 | " return (np.mean(auc_res), np.percentile(auc_res, 2.5), np.percentile(auc_res, 97.5))\n", 458 | " \n", 459 | "#Run the analysis \n", 460 | "n_seq = np.logspace(3, 7, base=2.0, num=30)\n", 461 | "\n", 462 | "avg = []; lowers = []; uppers = []; sz = []\n", 463 | "for n in n_seq:\n", 464 | " mu, low, up =LrBootstrapper(train, test, 500, int(n))\n", 465 | " avg.append(mu)\n", 466 | " lowers.append(low)\n", 467 | " uppers.append(up)\n", 468 | " sz.append(n) \n", 469 | "\n", 470 | " \n", 471 | "\n", 472 | "\n", 473 | "#Plot the analysis\n", 474 | "#lower = np.ones(len(avg)) * (avg[len(avg)-1]-1.96*stderr[len(avg)-1])\n", 475 | "\n", 476 | "fig = plt.figure(figsize = (12, 8))\n", 477 | "ax = fig.add_subplot(111)\n", 478 | "plt.title('Bootstrapped AUC + Confidence Limits \\n as a Function of N Samples')\n", 479 | "plt.plot(np.log2(n_seq), np.array(avg), label='Avg AUC')\n", 480 | "#plt.plot(np.log2(n_seq), np.array(avg) + 1.96 * np.array(stderr), 'k--+', label = 'Upper 95% CI')\n", 481 | "#plt.plot(np.log2(n_seq), np.array(avg) - 1.96 * np.array(stderr), 'k--', label = 'Lower 95% CI')\n", 482 | "\n", 483 | "plt.plot(np.log2(n_seq), np.array(uppers), 'k--+', label = 'Upper 95% CI')\n", 484 | "plt.plot(np.log2(n_seq), np.array(lowers), 'k--', label = 'Lower 95% CI')\n", 485 | "\n", 486 | "#plt.plot(np.log2(n_seq), lower,'r-')\n", 487 | "\n", 488 | "plt.legend(loc = 4)\n", 489 | "ax.set_xlabel('Log2(N)')\n", 490 | "ax.set_ylabel('Test Set AUC')" 491 | ] 492 | }, 493 | { 494 | "cell_type": "markdown", 495 | "metadata": {}, 496 | "source": [ 497 | "

We can see in the above plot that Logistic Regression does fairly well with small sample sizes. The lower bound of the $95%$ at the $max(N)$ overlaps with the confidence interval at most levels of $N$, suggesting that in expectation, the smaller samples could perform as well as the larger samples.

\n", 498 | "\n", 499 | "While this is true, always try to use as much data as you can to reduce the variance!\n", 500 | "\n", 501 | "

" 502 | ] 503 | } 504 | ], 505 | "metadata": { 506 | "kernelspec": { 507 | "display_name": "Python 2", 508 | "language": "python", 509 | "name": "python2" 510 | }, 511 | "language_info": { 512 | "codemirror_mode": { 513 | "name": "ipython", 514 | "version": 2 515 | }, 516 | "file_extension": ".py", 517 | "mimetype": "text/x-python", 518 | "name": "python", 519 | "nbconvert_exporter": "python", 520 | "pygments_lexer": "ipython2", 521 | "version": "2.7.9" 522 | } 523 | }, 524 | "nbformat": 4, 525 | "nbformat_minor": 0 526 | } 527 | -------------------------------------------------------------------------------- /ipython/Lecture_NumPyBasics.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "'''\n", 12 | "The core data type in Numpy is the ndarray, which enables fast and space-efficient multidimensional array processing.\n", 13 | "Note: This notebook is adapted from chapter 4 Python for Data Analysis by Wes McKinney and O'Reilly publishing. NumPy has many, \n", 14 | "many features that won't be covered here. The following snippets are just to illustrate basic data types and operations within\n", 15 | "numpy.\n", 16 | "\n", 17 | "Another good resource for learning more about ndarrays is here:\n", 18 | "http://docs.scipy.org/doc/numpy/reference/arrays.html\n", 19 | "'''\n", 20 | "\n", 21 | "#First, import NumPy\n", 22 | "import numpy as np\n", 23 | "\n", 24 | "#It is easy to create Nx1 and NxM arrays from standard Python lists\n", 25 | "l1 = [0,1,2]\n", 26 | "l2 = [3,4,5]\n", 27 | "\n", 28 | "nd1 = np.array(l1)\n", 29 | "nd2 = np.array([l1, l2])" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 2, 35 | "metadata": { 36 | "collapsed": false 37 | }, 38 | "outputs": [ 39 | { 40 | "name": "stdout", 41 | "output_type": "stream", 42 | "text": [ 43 | "The ndarray has dimension n=3 and m=1\n", 44 | "The ndarray has elements of type=int64\n", 45 | "The ndarray has dimension n=2 and m=3\n", 46 | "The ndarray has elements of type=int64\n" 47 | ] 48 | } 49 | ], 50 | "source": [ 51 | "#Now, we can get ask for some basic info to describe the ndarray\n", 52 | "def desc_ndarray(nd):\n", 53 | " try:\n", 54 | " print \"The ndarray has dimension n=%s and m=%s\" % (nd.shape[0],nd.shape[1])\n", 55 | " except:\n", 56 | " print \"The ndarray has dimension n=%s and m=1\" % nd.shape[0]\n", 57 | " print \"The ndarray has elements of type=%s\" % nd.dtype\n", 58 | "\n", 59 | "desc_ndarray(nd1)\n", 60 | "\n", 61 | "desc_ndarray(nd2)\n", 62 | "\n" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 3, 68 | "metadata": { 69 | "collapsed": false 70 | }, 71 | "outputs": [ 72 | { 73 | "data": { 74 | "text/plain": [ 75 | "[array([ 0., 0., 0., 0.]),\n", 76 | " array([ 1., 1., 1., 1.]),\n", 77 | " array([ 0.47121338, 1.83328779, 0.4438019 , -0.52309325])]" 78 | ] 79 | }, 80 | "execution_count": 3, 81 | "metadata": {}, 82 | "output_type": "execute_result" 83 | } 84 | ], 85 | "source": [ 86 | "#There are short cuts for creating certain frequently used special ndarrays, i.e.,\n", 87 | "\n", 88 | "k=4\n", 89 | "\n", 90 | "#1. an ndarray of all zeros\n", 91 | "zero = np.zeros(k)\n", 92 | "\n", 93 | "#2. an ndarray of all ones\n", 94 | "one = np.ones(k)\n", 95 | "\n", 96 | "#3. an ndarray of random elements (this one is standard normal, but there are many distributions to choose from)\n", 97 | "rand = np.random.randn(k)\n", 98 | "\n", 99 | "[zero, one, rand]" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 4, 105 | "metadata": { 106 | "collapsed": false 107 | }, 108 | "outputs": [ 109 | { 110 | "data": { 111 | "text/plain": [ 112 | "[array([[ 0.69394907, 0.85723722],\n", 113 | " [-0.16779156, 0.41709003],\n", 114 | " [-0.94008249, -0.21591983],\n", 115 | " [-0.61305106, 0.41435495]]),\n", 116 | " array([-0.16779156, 0.41709003]),\n", 117 | " 0.41709003439166575]" 118 | ] 119 | }, 120 | "execution_count": 4, 121 | "metadata": {}, 122 | "output_type": "execute_result" 123 | } 124 | ], 125 | "source": [ 126 | "'''\n", 127 | "For indexing an array:\n", 128 | "1. If nx1 array, follow the same protocol as a regular Python list\n", 129 | "2. If nxm array use the following examples\n", 130 | "'''\n", 131 | "\n", 132 | "arr2d = np.random.randn(4,2)\n", 133 | "\n", 134 | "#A single index gets a full row\n", 135 | "\n", 136 | "#2 indexes returns a value\n", 137 | "[arr2d, arr2d[1], arr2d[1,1]]" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 5, 143 | "metadata": { 144 | "collapsed": false 145 | }, 146 | "outputs": [ 147 | { 148 | "data": { 149 | "text/plain": [ 150 | "[array([-0.4386254 , -0.67720483, -1.19775067, -0.21300288]),\n", 151 | " array([-0.8772508 , -1.35440967, -2.39550135, -0.42600575]),\n", 152 | " array([-0.8772508 , -1.35440967, -2.39550135, -0.42600575]),\n", 153 | " array([-0., -0., -0., -0.])]" 154 | ] 155 | }, 156 | "execution_count": 5, 157 | "metadata": {}, 158 | "output_type": "execute_result" 159 | } 160 | ], 161 | "source": [ 162 | "'''\n", 163 | "Operations between Arrays and Scalars\n", 164 | "An important feature of ndarrays is they allow batch operations on data without writing any for loops. \n", 165 | "This is called vectorization.\n", 166 | "Any arithmetic operations between equal-size arrays applies the operation elementwise. \n", 167 | "'''\n", 168 | "\n", 169 | "#examples\n", 170 | "\n", 171 | "k = 4\n", 172 | "rand = np.random.randn(k)\n", 173 | "[rand, rand + rand, 2*rand, rand*np.zeros(4)]\n", 174 | "\n" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": 7, 180 | "metadata": { 181 | "collapsed": false 182 | }, 183 | "outputs": [ 184 | { 185 | "data": { 186 | "text/plain": [ 187 | "[array([ 0.19631415, 0.41059714, 4.26249299]),\n", 188 | " array([-1.46310809, 1.15559786, 0.10690073]),\n", 189 | " array([-1.26679394, 1.566195 , 4.36939372])]" 190 | ] 191 | }, 192 | "execution_count": 7, 193 | "metadata": {}, 194 | "output_type": "execute_result" 195 | } 196 | ], 197 | "source": [ 198 | "'''\n", 199 | "Matrix operations\n", 200 | "It is easy to do matrix operations with Nd arrays. The standard arithmetic operators don't work here though. And it is important \n", 201 | "to make sure matrix shapes are compatible\n", 202 | "'''\n", 203 | "\n", 204 | "k = 3\n", 205 | "r1 = np.random.randn(k)\n", 206 | "r2 = np.random.randn(k)\n", 207 | "\n", 208 | "#Matrix addition is the standard matrix operator\n", 209 | "[r1, r2 , r1 + r2]\n" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": 8, 215 | "metadata": { 216 | "collapsed": false 217 | }, 218 | "outputs": [ 219 | { 220 | "data": { 221 | "text/plain": [ 222 | "[array([[ 0.19631415, 0.41059714, 4.26249299],\n", 223 | " [-1.46310809, 1.15559786, 0.10690073]]),\n", 224 | " array([[ 0.19631415, -1.46310809],\n", 225 | " [ 0.41059714, 1.15559786],\n", 226 | " [ 4.26249299, 0.10690073]])]" 227 | ] 228 | }, 229 | "execution_count": 8, 230 | "metadata": {}, 231 | "output_type": "execute_result" 232 | } 233 | ], 234 | "source": [ 235 | "#The Transpose can be taken with the attribute T\n", 236 | "arr2d = np.array([r1, r2])\n", 237 | "[arr2d, arr2d.T]" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": 9, 243 | "metadata": { 244 | "collapsed": false 245 | }, 246 | "outputs": [ 247 | { 248 | "data": { 249 | "text/plain": [ 250 | "[array([[ 0.19631415, 0.41059714, 4.26249299],\n", 251 | " [-1.46310809, 1.15559786, 0.10690073]]),\n", 252 | " array([[ 3.85392468e-02, 1.68590015e-01, 1.81688465e+01],\n", 253 | " [ 2.14068529e+00, 1.33540642e+00, 1.14277663e-02]]),\n", 254 | " array([[ 18.37597578, 0.64291997],\n", 255 | " [ 0.64291997, 3.48751947]])]" 256 | ] 257 | }, 258 | "execution_count": 9, 259 | "metadata": {}, 260 | "output_type": "execute_result" 261 | } 262 | ], 263 | "source": [ 264 | "'''\n", 265 | "Matrix multiplication, like inner products can be done on arrays.\n", 266 | "Just remember that the standard multiplication operator does elementwise multiplication (provided they are the same shape).\n", 267 | "We need the dot method in order to do an inner product\n", 268 | "\n", 269 | "Numpy has a linalg library that can run most matrix operations on ndarrays:\n", 270 | "http://docs.scipy.org/doc/numpy/reference/routines.linalg.html\n", 271 | "\n", 272 | "One can also create a matrix object and use the methods in numpy.matrix to achieve the same thing:\n", 273 | "http://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html\n", 274 | "'''\n", 275 | "\n", 276 | "[arr2d, arr2d * arr2d, arr2d.dot(arr2d.T)]" 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": 11, 282 | "metadata": { 283 | "collapsed": false 284 | }, 285 | "outputs": [ 286 | { 287 | "name": "stdout", 288 | "output_type": "stream", 289 | "text": [ 290 | "10000 loops, best of 3: 119 µs per loop\n" 291 | ] 292 | } 293 | ], 294 | "source": [ 295 | "'''\n", 296 | "One important feature of vectorization is that it allows elementwise processing that is much faster than writing a traditional\n", 297 | "loop.\n", 298 | "'''\n", 299 | "import math\n", 300 | "\n", 301 | "#show an example and profile i\n", 302 | "%timeit [math.sqrt(x) for x in range(1000)]" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": 12, 308 | "metadata": { 309 | "collapsed": false 310 | }, 311 | "outputs": [ 312 | { 313 | "name": "stdout", 314 | "output_type": "stream", 315 | "text": [ 316 | "The slowest run took 9.83 times longer than the fastest. This could mean that an intermediate result is being cached \n", 317 | "100000 loops, best of 3: 5.19 µs per loop\n" 318 | ] 319 | } 320 | ], 321 | "source": [ 322 | "%timeit np.sqrt(np.arange(1000))" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": 16, 328 | "metadata": { 329 | "collapsed": false 330 | }, 331 | "outputs": [ 332 | { 333 | "name": "stderr", 334 | "output_type": "stream", 335 | "text": [ 336 | "ERROR: Line magic function `%inline` not found.\n" 337 | ] 338 | } 339 | ], 340 | "source": [ 341 | "'''\n", 342 | "The last thing we'll cover in this module is the numpy.random library. In general, it is advised to use numpy for\n", 343 | "random number generation as opposed to python's built in random module.\n", 344 | "\n", 345 | "Random number generation has many uses. One common use is generating fake (i.e. random) data to test modeling procedures\n", 346 | "or to do Monte Carlo Simulations\n", 347 | "'''\n", 348 | "import matplotlib.pyplot as plt\n", 349 | "%inline\n", 350 | "\n", 351 | "\n", 352 | "#Generate random pairs that have a multivariate normal distribution\n", 353 | "N = 1000\n", 354 | "mu = np.array([0,0])\n", 355 | "cov = 0.5\n", 356 | "sig = np.array([[1, cov],[cov, 1]]) #Must be square, symmetric and non-negative definite\n", 357 | "x, y = np.random.multivariate_normal(mu, sig, N).T\n", 358 | "#Now let's plot and see what that looks like\n", 359 | "\n", 360 | "\n", 361 | "plt.plot(x, y,'x'); plt.axis('equal'); plt.show()\n", 362 | "\n" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": 18, 368 | "metadata": { 369 | "collapsed": false 370 | }, 371 | "outputs": [], 372 | "source": [ 373 | "'''\n", 374 | "One final example (taken from Wes Mckinney's book):\n", 375 | "\n", 376 | "Let's generate a random walk and visualize it\n", 377 | "'''\n", 378 | "import matplotlib.pyplot as plt\n", 379 | "\n", 380 | "nsteps = 1000\n", 381 | "draws = np.random.randint(0, 2, size = nsteps) #Randint let's us generate random integers in a range\n", 382 | "steps = np.where(draws>0, 1, -1) #there function let's us do boolean logic on a conditional applied to an entire array\n", 383 | "walk = steps.cumsum() #Cumsum returns an array with the same size as steps, that has cum sum of steps up to index i\n", 384 | "plt.plot(np.arange(len(walk)), walk);plt.show()" 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "execution_count": 30, 390 | "metadata": { 391 | "collapsed": false 392 | }, 393 | "outputs": [], 394 | "source": [] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": null, 399 | "metadata": { 400 | "collapsed": false 401 | }, 402 | "outputs": [], 403 | "source": [] 404 | } 405 | ], 406 | "metadata": { 407 | "kernelspec": { 408 | "display_name": "Python 2", 409 | "language": "python", 410 | "name": "python2" 411 | }, 412 | "language_info": { 413 | "codemirror_mode": { 414 | "name": "ipython", 415 | "version": 2 416 | }, 417 | "file_extension": ".py", 418 | "mimetype": "text/x-python", 419 | "name": "python", 420 | "nbconvert_exporter": "python", 421 | "pygments_lexer": "ipython2", 422 | "version": "2.7.9" 423 | } 424 | }, 425 | "nbformat": 4, 426 | "nbformat_minor": 0 427 | } 428 | -------------------------------------------------------------------------------- /ipython/Lecture_PandasIntro.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [ 10 | { 11 | "data": { 12 | "text/html": [ 13 | "
\n", 14 | "\n", 15 | " \n", 16 | " \n", 17 | " \n", 18 | " \n", 19 | " \n", 20 | " \n", 21 | " \n", 22 | " \n", 23 | " \n", 24 | " \n", 25 | " \n", 26 | " \n", 27 | " \n", 28 | " \n", 29 | " \n", 30 | " \n", 31 | " \n", 32 | " \n", 33 | " \n", 34 | " \n", 35 | " \n", 36 | " \n", 37 | " \n", 38 | " \n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | "
popstateyear
0 1.5 OH 2000
1 1.7 OH 2001
2 3.6 OH 2002
3 2.4 NV 2001
4 2.9 NV 2002
\n", 56 | "
" 57 | ], 58 | "text/plain": [ 59 | " pop state year\n", 60 | "0 1.5 OH 2000\n", 61 | "1 1.7 OH 2001\n", 62 | "2 3.6 OH 2002\n", 63 | "3 2.4 NV 2001\n", 64 | "4 2.9 NV 2002" 65 | ] 66 | }, 67 | "execution_count": 1, 68 | "metadata": {}, 69 | "output_type": "execute_result" 70 | } 71 | ], 72 | "source": [ 73 | " '''\n", 74 | "Pandas is one of the most powerful contributions to python for quick and easy data analysis. Data Science is dominated by\n", 75 | "one common data structure - the table. Python never had a great native way to manipulate tables in ways that many analysts\n", 76 | "are used to (if you're at all familliar with spreadsheets or relational databases). The basic Pandas data structure is the Data \n", 77 | "Frame which, if you are an R user, should sound familliar.\n", 78 | "\n", 79 | "This module is a very high level treatment of basic data operations one typically uses when manipulating tables in Python. \n", 80 | "To really learn all of the details, refer to the book.\n", 81 | "'''\n", 82 | "\n", 83 | "#To import\n", 84 | "import pandas as pd #Its common to use pd as the abbreviation\n", 85 | "from pandas import Series, DataFrame #Wes McKinney recommends importing these separately - they are used so often and benefit from having their own namespace\n", 86 | "\n", 87 | "'''\n", 88 | "The Series - for this module we'll skip the Series (see book for details), but we will define it. A Series is a one dimensional\n", 89 | "array like object that has an array plus an index, which labels the array entries. Once we present a Data Frame, one can think\n", 90 | "of a series as similar to a Data Frame with just one column.\n", 91 | "'''\n", 92 | "\n", 93 | "'''\n", 94 | "A simple example of the DataFrame - building one from a dictionary\n", 95 | "(note for this to work each list has to be the same length)\n", 96 | "'''\n", 97 | "data = {'state':['OH', 'OH', 'OH', 'NV', 'NV'], \n", 98 | " 'year':[2000, 2001, 2002, 2001, 2002],\n", 99 | " 'pop':[1.5, 1.7, 3.6, 2.4, 2.9]}\n", 100 | "\n", 101 | "frame = pd.DataFrame(data) #This function will turn the dict to the data frame. Notice that the keys become columns and an index is created\n", 102 | "\n", 103 | "frame\n" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 2, 109 | "metadata": { 110 | "collapsed": false 111 | }, 112 | "outputs": [ 113 | { 114 | "data": { 115 | "text/plain": [ 116 | "(0 OH\n", 117 | " 1 OH\n", 118 | " 2 OH\n", 119 | " 3 NV\n", 120 | " 4 NV\n", 121 | " Name: state, dtype: object, 0 OH\n", 122 | " 1 OH\n", 123 | " 2 OH\n", 124 | " 3 NV\n", 125 | " 4 NV\n", 126 | " Name: state, dtype: object)" 127 | ] 128 | }, 129 | "execution_count": 2, 130 | "metadata": {}, 131 | "output_type": "execute_result" 132 | } 133 | ], 134 | "source": [ 135 | "#To retrieve columns...use dict-like notation or use the column name as an attribute\n", 136 | "frame['state'], frame.state" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 3, 142 | "metadata": { 143 | "collapsed": false 144 | }, 145 | "outputs": [ 146 | { 147 | "data": { 148 | "text/plain": [ 149 | "( pop state year\n", 150 | " 1 1.7 OH 2001, pop 1.7\n", 151 | " state OH\n", 152 | " year 2001\n", 153 | " Name: 1, dtype: object)" 154 | ] 155 | }, 156 | "execution_count": 3, 157 | "metadata": {}, 158 | "output_type": "execute_result" 159 | } 160 | ], 161 | "source": [ 162 | "#To retrieve a row, you can index it like a list, or use the actual row index name using the .ix method\n", 163 | "frame[1:2], frame.ix[1]" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 4, 169 | "metadata": { 170 | "collapsed": false 171 | }, 172 | "outputs": [ 173 | { 174 | "data": { 175 | "text/html": [ 176 | "
\n", 177 | "\n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | "
popstateyearbig_pop
0 1.5 OH 2000 False
1 1.7 OH 2001 False
2 3.6 OH 2002 True
3 2.4 NV 2001 False
4 2.9 NV 2002 False
\n", 225 | "
" 226 | ], 227 | "text/plain": [ 228 | " pop state year big_pop\n", 229 | "0 1.5 OH 2000 False\n", 230 | "1 1.7 OH 2001 False\n", 231 | "2 3.6 OH 2002 True\n", 232 | "3 2.4 NV 2001 False\n", 233 | "4 2.9 NV 2002 False" 234 | ] 235 | }, 236 | "execution_count": 4, 237 | "metadata": {}, 238 | "output_type": "execute_result" 239 | } 240 | ], 241 | "source": [ 242 | "#Assigning a new column is easy too\n", 243 | "frame['big_pop'] = (frame['pop']>3)\n", 244 | "frame" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": 6, 250 | "metadata": { 251 | "collapsed": false 252 | }, 253 | "outputs": [ 254 | { 255 | "data": { 256 | "text/html": [ 257 | "
\n", 258 | "\n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | "
Rand1OrigOrd
2 0.643863 2
4 0.298427 4
7 0.118110 7
5-0.220349 5
9-0.393215 9
0-0.452396 0
3-0.502908 3
6-0.585668 6
1-0.987663 1
8-1.774430 8
\n", 319 | "
" 320 | ], 321 | "text/plain": [ 322 | " Rand1 OrigOrd\n", 323 | "2 0.643863 2\n", 324 | "4 0.298427 4\n", 325 | "7 0.118110 7\n", 326 | "5 -0.220349 5\n", 327 | "9 -0.393215 9\n", 328 | "0 -0.452396 0\n", 329 | "3 -0.502908 3\n", 330 | "6 -0.585668 6\n", 331 | "1 -0.987663 1\n", 332 | "8 -1.774430 8" 333 | ] 334 | }, 335 | "execution_count": 6, 336 | "metadata": {}, 337 | "output_type": "execute_result" 338 | } 339 | ], 340 | "source": [ 341 | "'''\n", 342 | "One operation on data that is frequent enough to highlight here is sorting\n", 343 | "'''\n", 344 | "import numpy as np\n", 345 | "\n", 346 | "df = pd.DataFrame(np.random.randn(10,1), columns = ['Rand1'])\n", 347 | "df['OrigOrd'] = df.index.values\n", 348 | "df = df.sort_index(by = 'Rand1', ascending = False) #Sorting by a particular column\n", 349 | "df" 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": 7, 355 | "metadata": { 356 | "collapsed": false 357 | }, 358 | "outputs": [ 359 | { 360 | "data": { 361 | "text/html": [ 362 | "
\n", 363 | "\n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | "
Rand1OrigOrd
0-0.452396 0
1-0.987663 1
2 0.643863 2
3-0.502908 3
4 0.298427 4
5-0.220349 5
6-0.585668 6
7 0.118110 7
8-1.774430 8
9-0.393215 9
\n", 424 | "
" 425 | ], 426 | "text/plain": [ 427 | " Rand1 OrigOrd\n", 428 | "0 -0.452396 0\n", 429 | "1 -0.987663 1\n", 430 | "2 0.643863 2\n", 431 | "3 -0.502908 3\n", 432 | "4 0.298427 4\n", 433 | "5 -0.220349 5\n", 434 | "6 -0.585668 6\n", 435 | "7 0.118110 7\n", 436 | "8 -1.774430 8\n", 437 | "9 -0.393215 9" 438 | ] 439 | }, 440 | "execution_count": 7, 441 | "metadata": {}, 442 | "output_type": "execute_result" 443 | } 444 | ], 445 | "source": [ 446 | "df = df.sort_index() #Now sorting back, using the index\n", 447 | "df" 448 | ] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "execution_count": 9, 453 | "metadata": { 454 | "collapsed": false 455 | }, 456 | "outputs": [ 457 | { 458 | "data": { 459 | "text/html": [ 460 | "
\n", 461 | "\n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | "
keyrand_floatrand_int
0 a-1.253479 2
1 b 1.285238 1
2 c 0.312605 2
3 d-1.128266 2
4 e-0.096089 0
5 f-0.920994 3
6 g-1.790160 4
7 h-0.273019 4
8 i-0.922280 3
9 j-0.063663 0
\n", 533 | "
" 534 | ], 535 | "text/plain": [ 536 | " key rand_float rand_int\n", 537 | "0 a -1.253479 2\n", 538 | "1 b 1.285238 1\n", 539 | "2 c 0.312605 2\n", 540 | "3 d -1.128266 2\n", 541 | "4 e -0.096089 0\n", 542 | "5 f -0.920994 3\n", 543 | "6 g -1.790160 4\n", 544 | "7 h -0.273019 4\n", 545 | "8 i -0.922280 3\n", 546 | "9 j -0.063663 0" 547 | ] 548 | }, 549 | "execution_count": 9, 550 | "metadata": {}, 551 | "output_type": "execute_result" 552 | } 553 | ], 554 | "source": [ 555 | "'''\n", 556 | "Some of the real power we are after is the ability to condense, merge and concatenate data sets. This is where we\n", 557 | "want Python to have the same data munging functionality we usually get from executing SQL statements on relational\n", 558 | "databases.\n", 559 | "'''\n", 560 | "\n", 561 | "alpha = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']\n", 562 | "df1 = DataFrame({'rand_float':np.random.randn(10), 'key':alpha})\n", 563 | "df2 = DataFrame({'rand_int':np.random.randint(0, 5, size = 10), 'key':alpha})\n", 564 | "\n", 565 | "'''\n", 566 | "So we have two dataframes that share indexes (in this case all of them). We want to combine them. In sql we would execute\n", 567 | "a join, such as Select * from table1 a join table2 b on a.key=b.key;\n", 568 | "'''\n", 569 | "df_merge = pd.merge(df1,df2,on='key')\n", 570 | "df_merge" 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": 10, 576 | "metadata": { 577 | "collapsed": false 578 | }, 579 | "outputs": [ 580 | { 581 | "data": { 582 | "text/html": [ 583 | "
\n", 584 | "\n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | "
rand_float
rand_int
0-0.079876
1 1.285238
2-0.689713
3-0.921637
4-1.031590
\n", 618 | "
" 619 | ], 620 | "text/plain": [ 621 | " rand_float\n", 622 | "rand_int \n", 623 | "0 -0.079876\n", 624 | "1 1.285238\n", 625 | "2 -0.689713\n", 626 | "3 -0.921637\n", 627 | "4 -1.031590" 628 | ] 629 | }, 630 | "execution_count": 10, 631 | "metadata": {}, 632 | "output_type": "execute_result" 633 | } 634 | ], 635 | "source": [ 636 | "'''\n", 637 | "Now that we have this merged table, we might want to summarize it within a key grouping\n", 638 | "'''\n", 639 | "df_merge.groupby('rand_int').mean()" 640 | ] 641 | }, 642 | { 643 | "cell_type": "code", 644 | "execution_count": 11, 645 | "metadata": { 646 | "collapsed": false 647 | }, 648 | "outputs": [ 649 | { 650 | "data": { 651 | "text/html": [ 652 | "
\n", 653 | "\n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | "
rand_float
summeanlenstd
rand_int
0-0.159752-0.079876 2 0.022929
1 1.285238 1.285238 1 NaN
2-2.069140-0.689713 3 0.870288
3-1.843274-0.921637 2 0.000909
4-2.063179-1.031590 2 1.072780
\n", 712 | "
" 713 | ], 714 | "text/plain": [ 715 | " rand_float \n", 716 | " sum mean len std\n", 717 | "rand_int \n", 718 | "0 -0.159752 -0.079876 2 0.022929\n", 719 | "1 1.285238 1.285238 1 NaN\n", 720 | "2 -2.069140 -0.689713 3 0.870288\n", 721 | "3 -1.843274 -0.921637 2 0.000909\n", 722 | "4 -2.063179 -1.031590 2 1.072780" 723 | ] 724 | }, 725 | "execution_count": 11, 726 | "metadata": {}, 727 | "output_type": "execute_result" 728 | } 729 | ], 730 | "source": [ 731 | "'''\n", 732 | "You can have multiple aggregation functions, but the syntax isn't the same\n", 733 | "'''\n", 734 | "df_merge.groupby('rand_int').agg([np.sum, np.mean, len, np.std])" 735 | ] 736 | }, 737 | { 738 | "cell_type": "code", 739 | "execution_count": null, 740 | "metadata": { 741 | "collapsed": false 742 | }, 743 | "outputs": [], 744 | "source": [] 745 | } 746 | ], 747 | "metadata": { 748 | "kernelspec": { 749 | "display_name": "Python 2", 750 | "language": "python", 751 | "name": "python2" 752 | }, 753 | "language_info": { 754 | "codemirror_mode": { 755 | "name": "ipython", 756 | "version": 2 757 | }, 758 | "file_extension": ".py", 759 | "mimetype": "text/x-python", 760 | "name": "python", 761 | "nbconvert_exporter": "python", 762 | "pygments_lexer": "ipython2", 763 | "version": "2.7.9" 764 | } 765 | }, 766 | "nbformat": 4, 767 | "nbformat_minor": 0 768 | } 769 | -------------------------------------------------------------------------------- /ipython/Lecture_SimpleiPythonExample.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import pandas as pd\n", 12 | "import matplotlib.pyplot as plt\n", 13 | "from numpy import math\n", 14 | "\n", 15 | "loansData = pd.read_csv('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv')\n", 16 | "\n", 17 | "#Note, a schema can be found here : https://github.com/herrfz/dataanalysis/blob/master/assignment1/Assignment1.pdf" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 2, 23 | "metadata": { 24 | "collapsed": false 25 | }, 26 | "outputs": [ 27 | { 28 | "data": { 29 | "text/plain": [ 30 | "array(['Amount.Requested', 'Amount.Funded.By.Investors', 'Interest.Rate',\n", 31 | " 'Loan.Length', 'Loan.Purpose', 'Debt.To.Income.Ratio', 'State',\n", 32 | " 'Home.Ownership', 'Monthly.Income', 'FICO.Range',\n", 33 | " 'Open.CREDIT.Lines', 'Revolving.CREDIT.Balance',\n", 34 | " 'Inquiries.in.the.Last.6.Months', 'Employment.Length'], dtype=object)" 35 | ] 36 | }, 37 | "execution_count": 2, 38 | "metadata": {}, 39 | "output_type": "execute_result" 40 | } 41 | ], 42 | "source": [ 43 | "loansData.columns.values" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 3, 49 | "metadata": { 50 | "collapsed": false 51 | }, 52 | "outputs": [ 53 | { 54 | "data": { 55 | "text/plain": [ 56 | "81174 6541.67\n", 57 | "99592 4583.33\n", 58 | "80059 11500.00\n", 59 | "15825 3833.33\n", 60 | "33182 3195.00\n", 61 | "Name: Monthly.Income, dtype: float64" 62 | ] 63 | }, 64 | "execution_count": 3, 65 | "metadata": {}, 66 | "output_type": "execute_result" 67 | } 68 | ], 69 | "source": [ 70 | "loansData['Monthly.Income'][0:5] # first five rows of Interest.Rate" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 3, 76 | "metadata": { 77 | "collapsed": false 78 | }, 79 | "outputs": [ 80 | { 81 | "data": { 82 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAEKCAYAAADjDHn2AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3X1UVHX+B/D3uFBpoiLJADPmKM8jCJbhQ9lSLhqlZOki\nWIJlZdq22mrH1dpf1pbQ9qQ90Om0uJKm2Nmzie0KoinbHlMsE7dAV7QhHmNNxcAURD6/P4i7Eg9j\nCpe5X9+vc+bofZh7v+/B5sN8P/dOJhEREBERdaJXTw+AiIhcH4sFERE5xWJBREROsVgQEZFTLBZE\nROQUiwURETnFYnGFCQsLwyeffNLTw+hRH374IQYPHgwPDw8cOHCgp4cDACgpKUGvXr3Q1NTU7vbl\ny5dj1qxZOo+K6H9YLBRis9nw8ccft1q3Zs0ajB8/Xlv+6quvcOutt3Z6HGdvXEa3ePFipKWloba2\nFhEREW229+rVC2azGefPn9fWnTt3Dt7e3ujVq2v+k7HZbNixY8dF728ymS7pPKr/LEk/LBYKMZlM\nl/ym0p7uul/zwjdhvYkISktLYbfbO91v4MCByM7O1pazs7MxcODALnt9TSbTz3p9ee8s9TQWC8X9\n9M3twt9o9+7di1GjRqF///7w8fHB4sWLAUD75DFgwAB4eHggPz8fIoLnn38eNpsNZrMZycnJ+P77\n77XjvvfeexgyZAiuu+46bb+W8yxfvhzTp0/HrFmz0L9/f2RkZOCzzz7D2LFj4enpCT8/Pzz++OM4\nd+6cdrxevXrh7bffRmBgIPr164f/+7//w9GjRzF27FgMGDAACQkJrfa/UEdjra+vh4eHB86fP4+I\niAgEBgZ2+LrNmjUL7733Xqt8SUlJrd60KysrERcXBy8vLwQGBuLPf/6ztm358uWIj49HcnIy+vXr\nh7CwMOzbt087dmlpKaZMmQIPDw+8/PLL2vPWrVuHIUOGYNCgQVixYkW7P8u77roLb775ZqttI0aM\nQFZWVod5WsyePRuPPfYYJk+ejH79+mHMmDH4+uuvte2FhYWIiYmBl5cXfHx8kJKSAgCor6/HwoUL\nYbFYYLFY8MQTT6ChoQEAkJeXB6vVipdeegne3t7w8/PDpk2bsGXLFgQFBcHLywupqanaOUQEqamp\nCAgIwHXXXYcZM2bg5MmTTsdOPUxIGTabTbZv395q3V/+8he55ZZbWu3z8ccfi4jImDFjZN26dSIi\ncvr0admzZ4+IiJSUlIjJZJLz589rz0tPT5eAgABxOBxSV1cn9957r8yaNUtERAoLC6Vv376ya9cu\naWhokMWLF4u7u7t2nmeeeUbc3d0lKytLRETOnDkj+/btk/z8fDl//ryUlJRIaGiorFy5UjufyWSS\nqVOnSm1trRQWFspVV10lt912mzgcDjl16pTY7XbJyMho93XobKwtxz569GiHr6PJZJKvvvpKzGaz\nnDp1Sk6cOCFms1m++uorMZlM2n7jx4+Xxx57TOrr66WgoEAGDRokO3bs0DJfc801kp2dLU1NTbJ0\n6VIZM2ZMuz8HERGHwyEmk0keeeQROXv2rBw4cECuvvpqOXTokHa8+++/X0REPvjgAxk9erT23IKC\nAvHy8pJz5861ydJy3JafZXJysnh5eclnn30mjY2Nct9990lCQoKIiHz//ffi4+Mjr776qtTX10tt\nba3k5+eLiMgf/vAHGTt2rBw7dkyOHTsm48aNkz/84Q8iIrJz505xc3OTP/7xj9LY2CjvvvuueHl5\nycyZM6Wurk4KCwuld+/eUlJSIiIiK1eulLFjx0pFRYU0NDTI3LlzJTExscOfB7kGFguFDBkyRPr2\n7SsDBgzQHn369JHx48dr+1z4JnXrrbfKM888I8eOHWt1nJ++wYiI3H777fL2229ry//5z3/E3d1d\nGhsb5dlnn5WZM2dq23744Qe56qqrWhWLX/7yl52O/bXXXpN77rlHWzaZTPLpp59qyzfeeKP86U9/\n0pYXLVokCxcubPdYHY21Jc/FFIsjR47IQw89JO+88468/fbb8sgjj8iRI0e0YlFaWiq/+MUvpK6u\nTnve0qVLZfbs2VrmmJgYbVvLG2aLjopFRUWFti4qKko2btyoHa+lWJw5c0Y8PT3lyJEj2mvx2GOP\ntZvlpz/L2bNny8MPP6xt37Jli4SEhIiIyPr16+WGG25o9zj+/v6SnZ2tLW/dulVsNpuINBeL3r17\nS1NTk4g0Fx2TySR79+7V9r/xxhu1XxZCQkJaZa+srGz18yHXxGkohZhMJmRlZeHkyZPaIy0trcP5\n7vT0dBw+fBihoaGIiorCP/7xjw6PXVVVhSFDhmjL119/PRobG1FdXY2qqipYrVZtW+/eveHl5dXq\n+RduB4DDhw9j8uTJ8PX1Rf/+/fHUU0/h+PHjrfYxm82tjvnT5bq6up891otlMpmQlJSEjIwMrF27\ntt0pqIEDB+Laa69tdZ6Kiop2x9+nTx+cPXvWaaPZx8en1XPay3jNNdcgPj4ea9euhYggMzPzZ10p\n1dHrWFZWhmHDhrX7nMrKyjavaWVlpbbs5eWlTZP17t270/N88803uOeee+Dp6QlPT0/Y7Xa4ubn9\nrJ8P6Y/FQnEdFQoACAgIwPr163Hs2DEsWbIE06dPx5kzZ9pt4vr5+aGkpERbLi0thZubG3x8fODr\n64vy8nJt25kzZ9q88f/0mPPmzYPdbseRI0dw6tQpvPDCC112xU5HY73wzetijB8/Ht9++y3++9//\n4uabb25zjhMnTrR6My8tLW1TFDtyuY3y5ORkvP/++9i+fTv69OmD0aNHX9bxgOYCcGH/4kLtvaZ+\nfn6XfJ6cnJxWv9T88MMP8PX1vaTjkT5YLK5g69atw7FjxwAA/fv3h8lkQq9evTBo0CD06tULR48e\n1fZNTEzEa6+9hpKSEtTV1WHZsmVISEhAr169MG3aNHz00UfYvXs3GhoasHz5cqdX79TV1cHDwwN9\n+vTBoUOH8Pbbbzsd74XH7Oz4nY315/roo4+wefPmNusHDx6McePGYenSpaivr8e///1vrF69Gvff\nf/9FHddsNrd6fX+usWPHwmQyYfHixUhKSrro53X2ut11112oqqrCqlWrUF9fj9raWuzduxdA82v6\n/PPP47vvvsN3332H55577pLv+3j00UexbNkylJaWAgCOHTvW7mtMroXFQnGdXU67detWhIWFwcPD\nA0888QQyMzNx9dVXo0+fPnjqqadw8803w9PTE3v37sWDDz6IWbNm4dZbb8WwYcPQp08fvPHGGwCA\n4cOH44033kBCQgL8/Pzg4eEBb29vXH311R2O4eWXX8b69evRr18/PPLII0hISGi1T3tj/un2jnJ1\nNtaOjt3Reex2O0JDQ9vdtmHDBpSUlMDPzw/33nsvnnvuOdx+++0dju/C5aVLl+L555+Hp6cnXn31\nVafjau94SUlJ+PLLL50WKGevW8uyh4cHtm3bho8++gi+vr4ICgpCXl4eAODpp5/GqFGjMGLECIwY\nMQKjRo3C008/3e45nGVZsGAB4uLiMHHiRPTr1w9jx47VihK5sM4aGqWlpRIdHS12u12GDx8uq1at\nEpHmZpvFYpHIyEiJjIyULVu2aM9ZsWKFBAQESHBwsGzdulVb//nnn0tYWJgEBATIb3/72y5uvZAr\nqa2tFTc3N+3qF+oe7733XquLF4i6U6fFoqqqSvbv3y8izW8AQUFBUlRUJMuXL5dXXnmlzf6FhYUS\nEREhDQ0N4nA4xN/fX7tC4qabbtIuw4uNjW11ZQUZ3+bNm+X06dNSV1cnc+fO7fCqGuoap0+fltGj\nR8vatWt7eih0heh0GsrHxweRkZEAgL59+yI0NFS72kPamfvMyspCYmIi3N3dYbPZEBAQgPz8fFRV\nVaG2thZRUVEAmj8+b9q0qas/JFEP2rx5s3bD1tGjR5GZmdnTQ1LW1q1b4e3tDV9fX8ycObOnh0NX\niIvuWZSUlGD//v0YM2YMAOCNN95AREQE5syZg5qaGgDNl9ddeDWI1WpFRUVFm/UWi6XVJYZkfO++\n+y5OnjyJmpoabNu2rdO7o+nyTJo0CXV1dfjwww+77LuqiJy5qH9pdXV1mD59OlatWoW+ffti3rx5\ncDgcKCgogK+vLxYtWtTd4yQioh7k5myHc+fOYdq0abj//vsxdepUAIC3t7e2/aGHHsKUKVMANH9i\nKCsr07aVl5fDarXCYrG0ug6/vLwcFoulzbksFkurG32IiMg5f39/HDlypFvP0eknCxHBnDlzYLfb\nsXDhQm19VVWV9vcPP/wQ4eHhAIC4uDhkZmaioaEBDocDxcXFiIqKgo+PD/r166d9Id3atWu1wnOh\nyspKSHPTXcnHM8880+NjYD7mu9KyXQn5LueenYvV6SeLXbt2Yd26dRgxYgRGjhwJAFixYgU2bNiA\ngoICmEwmDB06FO+88w6A5mvS4+Pjtdv309LStOut09LSMHv2bJw5cwZ33nkn7rjjjm6O5nouvANW\nRcxnXCpnA9TPp4dOi8Utt9zS7lcwxMbGdvicZcuWYdmyZW3W33jjjfjyyy8vYYhERNTTeCmFjmbP\nnt3TQ+hWzGdcKmcD1M+nB5OIuMz/guvn/t/DiIhIn/dOfrLQUcv37KiK+YxL5WyA+vn0wGJBRERO\ncRqKiMjgOA1FREQugcVCR6rPmzKfcamcDVA/nx5YLIiIyCn2LIiIDE6P906nXySot4MHD6KgoEDX\ncw4ZMgTjxo3T9ZxEREbicp8sJk9OwPbt38DdfYgu52xq+h6enl+jrOxgt58rLy8P0dHR3X6ensJ8\nxqVyNkD9fFfkJ4vz5wVnz/4WZ88m6HTGQ+jfv+034BIR0f+43CeL2NgZyM6eCkC/YuHnNxUVFYd0\nOh8RUdfifRZEROQSWCx0pPq13sxnXCpnA9TPpwcWCyIicoo9C/YsiMjg2LMgIiKXwGKhI9XnTZnP\nuFTOBqifTw8sFkRE5BR7FuxZEJHBsWdBREQugcVCR6rPmzKfcamcDVA/nx5YLIiIyCn2LNizICKD\nY8+CiIhcAouFjlSfN2U+41I5G6B+Pj2wWBARkVPsWbBnQUQGx54FERG5BBYLHak+b8p8xqVyNkD9\nfHpgsSAiIqfYs2DPgogMjj0LIiJyCSwWOlJ93pT5jEvlbID6+fTAYkFERE51WizKyspw2223Yfjw\n4QgLC8Prr78OADhx4gRiYmIQFBSEiRMnoqamRntOSkoKAgMDERISgtzcXG39vn37EB4ejsDAQCxY\nsKCb4ri26Ojonh5Ct2I+41I5G6B+Pj10Wizc3d3x2muvobCwEHv27MFbb72FgwcPIjU1FTExMTh8\n+DAmTJiA1NRUAEBRURE2btyIoqIi5OTkYP78+VrTZd68eUhPT0dxcTGKi4uRk5PT/emIiKhLdFos\nfHx8EBkZCQDo27cvQkNDUVFRgc2bNyM5ORkAkJycjE2bNgEAsrKykJiYCHd3d9hsNgQEBCA/Px9V\nVVWora1FVFQUACApKUl7zpVE9XlT5jMulbMB6ufTw0X3LEpKSrB//36MHj0a1dXVMJvNAACz2Yzq\n6moAQGVlJaxWq/Ycq9WKioqKNustFgsqKiq6KgMREXUzt4vZqa6uDtOmTcOqVavg4eHRapvJZILJ\nZOqyAR04kA+gEcAhAAMARAKI/nFr3o9/duVyqXbult8+WuY3u3q5ZV13Hb+nl5nPuMvR0dEuNR7m\n63w5Ly8Pa9asAQDYbDboQpxoaGiQiRMnymuvvaatCw4OlqqqKhERqayslODgYBERSUlJkZSUFG2/\nSZMmyZ49e6SqqkpCQkK09evXr5e5c+e2ORcAiY2dIcAGAUSnx0Hx8wt29jIQEbmsi3grv2ydTkOJ\nCObMmQO73Y6FCxdq6+Pi4pCRkQEAyMjIwNSpU7X1mZmZaGhogMPhQHFxMaKiouDj44N+/fohPz8f\nIoK1a9dqz7mStPxmoCrmMy6VswHq59NDp9NQu3btwrp16zBixAiMHDkSQPOlsb///e8RHx+P9PR0\n2Gw2fPDBBwAAu92O+Ph42O12uLm5IS0tTZuiSktLw+zZs3HmzBnceeeduOOOO7o5GhERdRV+NxS/\nG4qIDI7fDUVERC6BxUJHqs+bMp9xqZwNUD+fHlgsiIjIKfYs2LMgIoNjz4KIiFwCi4WOVJ83ZT7j\nUjkboH4+PbBYEBGRU+xZsGdBRAbHngUREbkEFgsdqT5vynzGpXI2QP18emCxICIip9izYM+CiAyO\nPQsiInIJLBY6Un3elPmMS+VsgPr59MBiQURETrFnwZ4FERkcexZEROQSWCx0pPq8KfMZl8rZAPXz\n6YHFgoiInGLPgj0LIjI49iyIiMglsFjoSPV5U+YzLpWzAern0wOLBREROcWeBXsWRGRw7FkQEZFL\nYLHQkerzpsxnXCpnA9TPpwcWCyIicoo9C/YsiMjg2LMgIiKXwGKhI9XnTZnPuFTOBqifTw8sFkRE\n5BR7FuxZEJHBsWdBREQugcVCR6rPmzKfcamcDVA/nx5YLIiIyCmnxeLBBx+E2WxGeHi4tm758uWw\nWq0YOXIkRo4ciezsbG1bSkoKAgMDERISgtzcXG39vn37EB4ejsDAQCxYsKCLYxhDdHR0Tw+hWzGf\ncamcDVA/nx6cFosHHngAOTk5rdaZTCb87ne/w/79+7F//37ExsYCAIqKirBx40YUFRUhJycH8+fP\n15ou8+bNQ3p6OoqLi1FcXNzmmERE5LqcFovx48fD09Ozzfr2Ou9ZWVlITEyEu7s7bDYbAgICkJ+f\nj6qqKtTW1iIqKgoAkJSUhE2bNnXB8I1F9XlT5jMulbMB6ufTwyX3LN544w1ERERgzpw5qKmpAQBU\nVlbCarVq+1itVlRUVLRZb7FYUFFRcRnDJiIiPV1SsZg3bx4cDgcKCgrg6+uLRYsWdfW4lKT6vCnz\nGZfK2QD18+nB7VKe5O3trf39oYcewpQpUwA0f2IoKyvTtpWXl8NqtcJisaC8vLzVeovF0u6xDxzI\nB9AI4BCAAQAiAUT/uDXvxz+7crlUO3fLR9WWf1hc5jKXueyKy3l5eVizZg0AwGazQRdyERwOh4SF\nhWnLlZWV2t9fffVVSUxMFBGRwsJCiYiIkPr6evn6669l2LBh0tTUJCIiUVFRsmfPHmlqapLY2FjJ\nzs5ucx4AEhs7Q4ANAohOj4Pi5xd8MS/DZdu5c6cu5+kpzGdcKmcTUT/fRb6VXxannywSExPxz3/+\nE9999x0GDx6MZ599Fnl5eSgoKIDJZMLQoUPxzjvvAADsdjvi4+Nht9vh5uaGtLQ0mEwmAEBaWhpm\nz56NM2fO4M4778Qdd9zRnTWQiIi6EL8bit8NRUQGx++GIiIil8BioaOWBpWqmM+4VM4GqJ9PDywW\nRETkFHsW7FkQkcGxZ0FERC6BxUJHqs+bMp9xqZwNUD+fHlgsiIjIKfYs2LMgIoNjz4KIiFwCi4WO\nVJ83ZT7jUjkboH4+PbBYEBGRU+xZsGdBRAbHngUREbkEFgsdqT5vynzGpXI2QP18emCxICIip9iz\nYM+CiAyOPQsiInIJLBY6Un3elPmMS+VsgPr59MBiQURETrFnwZ4FERkcexZEROQSWCx0pPq8KfMZ\nl8rZAPXz6YHFgoiInGLPgj0LIjI49iyIiMglsFjoSPV5U+YzLpWzAern0wOLBREROcWeBXsWRGRw\n7FkQEZFLYLHQkerzpsxnXCpnA9TPpwcWCyIicoo9C/YsiMjg2LMgIiKXwGKhI9XnTZnPuFTOBqif\nTw8sFkRE5BR7FuxZEJHBuUTP4sEHH4TZbEZ4eLi27sSJE4iJiUFQUBAmTpyImpoabVtKSgoCAwMR\nEhKC3Nxcbf2+ffsQHh6OwMBALFiwoItjEBFRd3JaLB544AHk5OS0WpeamoqYmBgcPnwYEyZMQGpq\nKgCgqKgIGzduRFFREXJycjB//nyt2s2bNw/p6ekoLi5GcXFxm2NeCVSfN2U+41I5G6B+Pj04LRbj\nx4+Hp6dnq3WbN29GcnIyACA5ORmbNm0CAGRlZSExMRHu7u6w2WwICAhAfn4+qqqqUFtbi6ioKABA\nUlKS9hwiInJ9l9Tgrq6uhtlsBgCYzWZUV1cDACorK2G1WrX9rFYrKioq2qy3WCyoqKi4nHEbUnR0\ndE8PoVsxn3GpnA1QP58eLvtqKJPJBJPJ1BVjISIiF+V2KU8ym8349ttv4ePjg6qqKnh7ewNo/sRQ\nVlam7VdeXg6r1QqLxYLy8vJW6y0WS7vHPnAgH0AjgEMABgCIBBD949a8H//syuVS7dwt85otv4V0\n9fLKlSsRGRnZbcfv6WXmM+7yhXP6rjAe5nOeZ82aNQAAm80GXchFcDgcEhYWpi0/+eSTkpqaKiIi\nKSkpsmTJEhERKSwslIiICKmvr5evv/5ahg0bJk1NTSIiEhUVJXv27JGmpiaJjY2V7OzsNucBILGx\nMwTYIIDo9Dgofn7BF/MyXLadO3fqcp6ewnzGpXI2EfXzXeRb+eWdw9kOCQkJ4uvrK+7u7mK1WmX1\n6tVy/PhxmTBhggQGBkpMTIycPHlS2/+FF14Qf39/CQ4OlpycHG39559/LmFhYeLv7y+PP/54+4NR\nvFgQEXUHPYoFb8rjTXlEZHAucVMedZ0L501VxHzGpXI2QP18emCxICIipzgNxWkoIjI4TkMREZFL\nYLHQkerzpsxnXCpnA9TPpwcWCyIicoo9C/YsiMjg2LMgIiKXwGKhI9XnTZnPuFTOBqifTw8sFkRE\n5BR7FuxZEJHBsWdBREQugcVCR6rPmzKfcamcDVA/nx5YLIiIyCn2LNizICKDY8+CiIhcAouFjlSf\nN2U+41I5G6B+Pj2wWBARkVPsWbBnQUQGx54FERG5BBYLHak+b8p8xqVyNkD9fHpgsSAiIqfYs2DP\ngogMjj0LIiJyCSwWOlJ93pT5jEvlbID6+fTAYkFERE6xZ8GeBREZHHsWRETkElgsdKT6vCnzGZfK\n2QD18+mBxYKIiJxiz4I9CyIyOPYsiIjIJbBY6Ej1eVPmMy6VswHq59MDiwURETnFngV7FkRkcOxZ\nEBGRS7isYmGz2TBixAiMHDkSUVFRAIATJ04gJiYGQUFBmDhxImpqarT9U1JSEBgYiJCQEOTm5l7e\nyA1I9XlT5jMulbMB6ufTw2UVC5PJhLy8POzfvx979+4FAKSmpiImJgaHDx/GhAkTkJqaCgAoKirC\nxo0bUVRUhJycHMyfPx9NTU2Xn4CIiLrdZU9D/XSebPPmzUhOTgYAJCcnY9OmTQCArKwsJCYmwt3d\nHTabDQEBAVqBuVJER0f39BC6FfMZl8rZAPXz6eGyP1n86le/wqhRo/Duu+8CAKqrq2E2mwEAZrMZ\n1dXVAIDKykpYrVbtuVarFRUVFZdzeiIi0onb5Tx5165d8PX1xbFjxxATE4OQkJBW200mE0wmU4fP\nb2/bgQP5ABoBHAIwAEAkgOgft+b9+GdXLpdq526Z12z5LaSrl1euXInIyMhuO35PLzOfcZcvnNN3\nhfEwn/M8a9asAdDcO9aFdJHly5fLyy+/LMHBwVJVVSUiIpWVlRIcHCwiIikpKZKSkqLtP2nSJNmz\nZ0+rYwCQ2NgZAmwQQHR6HBQ/v+Cuehk6tXPnTl3O01OYz7hUziaifr4ufCvv0CVPQ/3www+ora0F\nAJw+fRq5ubkIDw9HXFwcMjIyAAAZGRmYOnUqACAuLg6ZmZloaGiAw+FAcXGxdgXVlaLlNwRVMZ9x\nqZwNUD+fHi55Gqq6uhr33HMPAKCxsRH33XcfJk6ciFGjRiE+Ph7p6emw2Wz44IMPAAB2ux3x8fGw\n2+1wc3NDWlpap1NURETkOngHt453cOfl5Sn9Gw7zGZfK2QD18/EObiIicgn8ZMHvhiIig+MnCyIi\ncgksFjq68FpvFTGfcamcDVA/nx5YLIiIyCn2LNizICKDY8+CiIhcAouFjlSfN2U+41I5G6B+Pj2w\nWBARkVPsWbBnQUQGx54FERG5BBYLHak+b8p8xqVyNkD9fHpgsSAiIqfYs2DPgogMjj0LIiJyCSwW\nOlJ93pT5jEvlbID6+fTAYkFERE6xZ8GeBREZHHsWRETkElgsdKT6vCnzGZfK2QD18+mBxYKIiJxi\nz4I9CyIyOPYsiIjIJbBY6Ej1eVPmMy6VswHq59MDiwURETnFngV7FkRkcHr0LNy69egGUVXlgMlk\n0vWcHh6e+P77E7qek4joUnEaCoBIAwDR4bFT+3tt7Ul9wulI9XlhlfOpnA1QP58eWCyIiMgp9ixw\nCEAomn/j11P3zzES0ZWB91kQEZFLYLHQVV5PD6BbqT4vrHI+lbMB6ufTA4sFERE5xZ4FexZEZHDs\nWRARkUvQtVjk5OQgJCQEgYGBePHFF/U8tYvI6+kBdCvV54VVzqdyNkD9fHrQrVicP38ev/nNb5CT\nk4OioiJs2LABBw8e1Ov0LqKgpwfQrQoKmM+oVM4GqJ9PD7oVi7179yIgIAA2mw3u7u5ISEhAVlaW\nXqd3ETU9PYBuVVPDfEalcjZA/Xx60K1YVFRUYPDgwdqy1WpFRUWFXqcnIqLLoNsXCV7sF/W5u/dC\n795/grv7+908omZNTbWoq9PlVABKLvi7m65fXqjHFxeWlJR06/F7msr5VM4GqJ9PF6KT3bt3y6RJ\nk7TlFStWSGpqaqt9/P399fg2Pz744IMPpR7+/v7d/h6u230WjY2NCA4Oxscffww/Pz9ERUVhw4YN\nCA0N1eP0RER0GXSbhnJzc8Obb76JSZMm4fz585gzZw4LBRGRQbjUHdxEROSaXOYObqPcsFdWVobb\nbrsNw4cPR1hYGF5//XUAwIkTJxATE4OgoCBMnDix1aV6KSkpCAwMREhICHJzc7X1+/btQ3h4OAID\nA7FgwQJtfX19PWbMmIHAwECMGTMG33zzjX4B0XxPzMiRIzFlyhQAamWrqanB9OnTERoaCrvdjvz8\nfKXypaSkYPjw4QgPD8fMmTNRX19v6HwPPvggzGYzwsPDtXV65cnIyEBQUBCCgoLw3nvv6ZbvySef\nRGhoKCKR9cbvAAAFhUlEQVQiInDvvffi1KlTrpGv27siF6GxsVH8/f3F4XBIQ0ODRERESFFRUU8P\nq11VVVWyf/9+ERGpra2VoKAgKSoqkieffFJefPFFERFJTU2VJUuWiIhIYWGhRERESENDgzgcDvH3\n95empiYREbnpppskPz9fRERiY2MlOztbRETeeustmTdvnoiIZGZmyowZM3TN+Morr8jMmTNlypQp\nIiJKZUtKSpL09HQRETl37pzU1NQok8/hcMjQoUPl7NmzIiISHx8va9asMXS+Tz75RL744gsJCwvT\n1umR5/jx4zJs2DA5efKknDx5Uvu7Hvlyc3Pl/PnzIiKyZMkSl8nnEsXi008/bXWlVEpKiqSkpPTg\niC7e3XffLdu2bZPg4GD59ttvRaS5oAQHB4tI26u+Jk2aJLt375bKykoJCQnR1m/YsEHmzp2r7bNn\nzx4RaX5Du+666/SKI2VlZTJhwgTZsWOHTJ48WUREmWw1NTUydOjQNutVyXf8+HEJCgqSEydOyLlz\n52Ty5MmSm5tr+HwOh6PVm6keedavXy+PPvqo9py5c+fKhg0bdMl3ob/97W9y3333iUjP53OJaSij\n3rBXUlKC/fv3Y/To0aiurobZbAYAmM1mVFdXAwAqKythtVq157Rk++l6i8WiZb7w9XBzc0P//v1x\n4kT33iPR4oknnsBLL72EXr3+909DlWwOhwODBg3CAw88gBtuuAEPP/wwTp8+rUy+gQMHYtGiRbj+\n+uvh5+eHAQMGICYmRpl8Lbo7z/Hjxzs8lt5Wr16NO++8E0DP53OJYqHnzWldpa6uDtOmTcOqVavg\n4eHRapvJZDJkpr///e/w9vbGyJEjO/y6Y6NmA5ov3/7iiy8wf/58fPHFF7j22muRmpraah8j5zt6\n9ChWrlyJkpISVFZWoq6uDuvWrWu1j5HztUe1PBd64YUXcNVVV2HmzJk9PRQALlIsLBYLysrKtOWy\nsrJWVc/VnDt3DtOmTcOsWbMwdepUAM2/4Xz77bcAgKqqKnh7ewNom628vBxWqxUWiwXl5eVt1rc8\np7S0FEDzG9ypU6cwcODAbs/16aefYvPmzRg6dCgSExOxY8cOzJo1S4lsQPNvT1arFTfddBMAYPr0\n6fjiiy/g4+OjRL7PP/8c48aNg5eXF9zc3HDvvfdi9+7dyuRr0d3/Hr28vHr8PWnNmjXYsmUL3n//\nf99k0dP5XKJYjBo1CsXFxSgpKUFDQwM2btyIuLi4nh5Wu0QEc+bMgd1ux8KFC7X1cXFxyMjIANB8\nlUFLEYmLi0NmZiYaGhrgcDhQXFyMqKgo+Pj4oF+/fsjPz4eIYO3atbj77rvbHOuvf/0rJkyYoEu2\nFStWoKysDA6HA5mZmbj99tuxdu1aJbIBgI+PDwYPHozDhw8DALZv347hw4djypQpSuQLCQnBnj17\ncObMGYgItm/fDrvdrky+Fnr8e5w4cSJyc3NRU1ODkydPYtu2bZg0aZIu+XJycvDSSy8hKysL11xz\nTavcPZrv4tsw3WvLli0SFBQk/v7+smLFip4eTof+9a9/iclkkoiICImMjJTIyEjJzs6W48ePy4QJ\nEyQwMFBiYmJaXVnwwgsviL+/vwQHB0tOTo62/vPPP5ewsDDx9/eXxx9/XFt/9uxZ+fWvfy0BAQEy\nevRocTgcekYUEZG8vDztaiiVshUUFMioUaNkxIgRcs8990hNTY1S+V588UWx2+0SFhYmSUlJ0tDQ\nYOh8CQkJ4uvrK+7u7mK1WmX16tW65Vm9erUEBARIQECArFmzRpd86enpEhAQINdff732/tJyNVNP\n5+NNeURE5JRLTEMREZFrY7EgIiKnWCyIiMgpFgsiInKKxYKIiJxisSAiIqdYLIiIyCkWCyIicur/\nAS6yRgLVeDXAAAAAAElFTkSuQmCC\n", 83 | "text/plain": [ 84 | "" 85 | ] 86 | }, 87 | "metadata": {}, 88 | "output_type": "display_data" 89 | } 90 | ], 91 | "source": [ 92 | "plt.figure()\n", 93 | "inc = loansData['Monthly.Income']\n", 94 | "h = inc.hist()\n", 95 | "plt.title('Histogram of Monthly Income')\n", 96 | "plt.show()" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 4, 102 | "metadata": { 103 | "collapsed": false 104 | }, 105 | "outputs": [ 106 | { 107 | "data": { 108 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX4AAAEKCAYAAAAVaT4rAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XtYVHX+B/D3IJhi3EQZhBEnBUXEe6llKqZ422rNDG+r\noJvt5rq7uVb066b1rIK1rbdWny4qlJVpz6ZYaq7KqK2rVG7aipc0UG6SqSgIisDn94cxOQqjM3DO\nmTnzfj0PT525nO/nM2fmw/CeAxpEREBERB7DS+sCiIhIXRz8REQehoOfiMjDcPATEXkYDn4iIg/D\nwU9E5GE4+N1QbGwsdu3apXUZmvr000/Rtm1b+Pn54cCBA6qt+9Zbb2HWrFmqrVcXi8WCtm3b1nt9\nUlISXnrpJRUr0sbBgwfRv39/rctwSxz8LsZsNmP79u02l6WlpWHAgAHW7f/9738YOHCg3f3k5ubC\ny8sLNTU1itSptaeffhrLli1DaWkpunfvftP1Xl5e+OGHHxp1zcrKSsybNw/PPvssgF8e4169etnc\n7qeffkLTpk1x1113Ncq6jvZiMBhgMBgcXudW31BcTbdu3RAYGIjPPvtM61LcDge/i3H2RVsfpX4/\nr7q6WpH93g4RwalTpxATE6Pquhs2bEDnzp3Rpk0bm8srKipw6NAh6/aHH36I9u3ba3ocPeX3MidN\nmoS33npL6zLcDge/G7hxgJjNZuzYsQMAkJWVhbvvvhsBAQEIDQ3F008/DQDWnwgCAwPh5+eHffv2\nQUTw17/+FWazGUajEYmJibh48aJ1v++99x7atWuHVq1aWW9Xu87cuXMxduxYTJ48GQEBAUhPT8dX\nX32Fe++9F0FBQQgLC8Mf//hHXL161bo/Ly8vLF++HFFRUfD398fLL7+MEydO4N5770VgYCDGjx9v\nc/vr1VfrlStX4Ofnh+rqanTv3h1RUVEOPZYXLlzAlClTEBISArPZjHnz5lmHZE1NDWbPno3WrVuj\nffv2ePPNN21+atq8eTMGDRp00z4nT56M9PR06/b777+PKVOm2Azfw4cPIy4uDkFBQYiNjcXGjRut\n1yUlJeEPf/gDHnzwQfj7+6Nfv37Wd/i1x7F79+7w8/PDunXrrPf7+9//DqPRiLCwMKSlpdnUVPuc\niY2NtXlHfPXqVbRq1eq24rG4uDi8/PLLuP/+++Hv74/hw4fj7Nmz1uu//PJL3HfffQgKCkJERIT1\nMbD3GKelpaF///74y1/+gqCgIERGRmLPnj1YtWoVIiIiYDQa8d5771nXuHLlCp5++mm0a9cOoaGh\nePLJJ3H58mXr9YMGDcL27dvrfR5RPYRcitlslm3bttlctmrVKrn//vttbrN9+3YREenXr5+sXr1a\nREQuXboke/fuFRGR3NxcMRgMUl1dbb3fihUrJDIyUnJycqSsrEzGjBkjkydPFhGRQ4cOyZ133in/\n/ve/pbKyUp5++mnx8fGxrjNnzhzx8fGRDRs2iIhIRUWFfPPNN7Jv3z6prq6W3Nxc6dy5syxatMi6\nnsFgkNGjR0tpaakcOnRImjZtKoMHD5acnBy5cOGCxMTESHp6ep2Pg71aa/d94sSJeh/H+q6fPHmy\njB49WsrKyiQ3N1c6duwoK1asEBGR5cuXS0xMjBQUFMj58+dlyJAh4uXlZX0M77nnHvnkk0+s+8rJ\nyRGDwSC5ubnStm1bqampkUOHDkl0dLRs27ZNzGaziIhUVlZKhw4dJCUlRa5evSo7duwQPz8/OXr0\nqIiIJCYmSnBwsHz11VdSVVUlkyZNkvHjx9fbS2Zmpnh7e8ucOXOkqqpKNm3aJL6+vlJSUiIiIklJ\nSfLSSy+JiMhrr70m48aNs953/fr10q1btzofs8zMTDGZTNbtQYMGSWRkpHz//fdSUVEhcXFx8txz\nz4nIteeXn5+frFmzRqqqquTs2bPy7bff3vIxXrVqlXh7e0taWprU1NTIiy++KOHh4TJz5kyprKyU\nrVu3ip+fn1y6dElERJ566in59a9/LefPn5fS0lJ56KGH5P/+7/9s6vb395fvvvuuzp6obhz8LqZd\nu3Zy5513SmBgoPXL19dXBgwYYL3N9YN/4MCBMmfOHDlz5ozNfmqH0vWD/4EHHpDly5dbt48ePSo+\nPj5SVVUlr7zyikycONF6XXl5uTRt2tRm8A8aNMhu7QsXLpRHHnnEum0wGGTPnj3W7d69e8trr71m\n3Z49e7Y89dRTde6rvlpr+3Fm8FdVVUnTpk3l8OHD1sveeustiYuLExGRwYMHy9tvv229btu2bTaP\nYVRUlHzxxRfW62sf46qqKhk6dKh88cUXkpycLPPnz7cZ/Lt27ZLQ0FCbWiZMmCBz584VkWuDf/r0\n6dbrNm3aJNHR0fX2kpmZKc2bN7c5tiEhIbJv3z4RuTb4X3zxRRERKSgokDvvvFNKS0tFROTRRx+V\n119/vc7H7MbBHxcXJ/PmzbNuL1u2TEaMGCEiIvPnz5cxY8bctI9bPcarVq2SqKgo63UHDx4Ug8Eg\nP/74o/Wy4OBgOXDggNTU1EiLFi1set+zZ4/cddddNmuGh4fL7t276+yJ6saox8UYDAZs2LAB58+f\nt34tW7as3sx2xYoVOHbsGDp37ow+ffrg888/r3ffRUVFaNeunXU7IiICVVVVKC4uRlFREUwmk/W6\n5s2bIzg42Ob+118PAMeOHcODDz6INm3aICAgAC+88IJNFAAARqPRZp83bpeVlTlcq7N++uknXL16\n9ab9FhQUWNe8/sPNG/sNCgqyicZqGQwGTJkyBatWrcKaNWswefJkm+NVWFh404em7dq1Q2FhofX+\nt/u41AoODoaX1y8vX19f3zrvExYWhv79++OTTz5BSUkJtmzZgkmTJtnd9/VCQ0PrrCsvLw/t27e/\n6fa3eoyBm58TANC6deub1jlz5gzKy8vRu3dvBAUFISgoCCNHjsRPP/1ks2ZpaSkCAwNvuydixu8W\n6hv6ABAZGYkPP/wQZ86cQXJyMsaOHYuKioo6P1gMCwtDbm6udfvUqVPw9vZGaGgo2rRpg/z8fOt1\nFRUVNw3xG/f55JNPIiYmBsePH8eFCxcwb968RjuLqL5arx8ajmrVqhV8fHxu2m/tgG/Tpg3y8vKs\n113//8C1s0iOHTtW577HjBmDTZs2oUOHDjd9wwgLC0NeXp7NcTx58iTCw8Od7sURiYmJWL16Ndat\nW4f77rvvpg+nnREREYETJ07cdPmtHmNHtGrVCs2bN0d2drb1TVBJSYnNN9+CggJUVlaiU6dOTvXh\nqTj43dzq1atx5swZAEBAQAAMBgO8vLzQunVreHl52bw4J0yYgIULFyI3NxdlZWV4/vnnMX78eHh5\neeHRRx/Fxo0b8Z///AeVlZWYO3fuLc8MKSsrg5+fH3x9fXHkyBEsX778lvVev097+7dX6+26cuUK\nLl++bP0CgISEBLzwwgsoKyvDyZMnsXDhQvzmN7+xXrd48WIUFhaipKQECxYssPlmN2rUKOzcubPO\ntVq0aIHMzEy8++67N13Xt29f+Pr64rXXXsPVq1dhsVjw2WefYfz48bd8HIBr75DrGrL1uXF/jzzy\nCPbv348lS5ZgypQpt70fe7VNnDgR27Ztw7p161BVVYWzZ8/iwIEDaNKkid3H2BFeXl6YPn06nnrq\nKetzvKCgAFu3brXeZufOnRgyZAh8fHwc3r8n4+B3A/ZO8fziiy8QGxsLPz8/zJo1C2vWrMEdd9wB\nX19fvPDCC+jfvz+CgoKQlZWFadOmYfLkyRg4cCDat28PX19fLF26FADQpUsXLF26FOPHj0dYWBj8\n/PwQEhKCO+64o94a/va3v+HDDz+Ev78/nnjiCYwfP97mNnXVfOP19fVlr9b69n2jLl26wNfX1/qV\nnp6OpUuXokWLFmjfvj0GDBiASZMmYerUqQCA6dOnY9iwYejWrRt69+6NX/3qV2jSpIn1m82DDz6I\nI0eOoKioqM46evXqZXPufu11TZs2xcaNG7F582a0bt0aM2fOxPvvv4+OHTvW+zhcvz137lwkJiYi\nKCgIn3zyyS1P+b3x+mbNmmHMmDHIzc3FmDFj7D5m9uq4fr8RERHYtGkT3njjDQQHB6Nnz544ePAg\nANh9jG/V640WLFiAyMhI9OvXDwEBAYiPj7f5qeuDDz7A73//e7s9UR3sfQAwdepUCQkJkdjYWOtl\nZ8+elaFDh0pUVJTEx8fL+fPnrdfNnz9fIiMjpVOnTjYfgn399dcSGxsrkZGR8qc//anRPqAg5ZSW\nloq3t7fk5uZqXYpmNm3aJO3atbO57O233673A2lX9uqrr9qcFaUHBw4ckPvuu0/rMtyS3cG/a9cu\n2b9/v83gf+aZZ2TBggUiIpKamirJyckicu10wO7du0tlZaXk5ORIhw4dpKamRkSunQZXe8bByJEj\nZfPmzYo0Qw2TkZEhly5dkrKyMvnd734nvXr10rokVVVUVMjnn38uV69elfz8fOnbt6/MmjVL67Ia\n7OzZs2I2m3nmC1nZjXoGDBiAoKAgm8syMjKQmJgI4NqHRuvXrwdw7bcaJ0yYAB8fH5jNZkRGRmLf\nvn0oKipCaWkp+vTpAwCYMmWK9T7kWjIyMhAeHo7w8HCcOHECa9as0bokVYkI5s6di5YtW6JXr17o\n0qULXn31Va3LapB33nkHERERGDlyJO6//36tyyEX4e3oHYqLi61nVhiNRuvpdYWFhejXr5/1diaT\nCQUFBfDx8bH5RD88PNzm1C5yHe+88w7eeecdrcvQTPPmzZGVlaV1GY1q+vTpmD59utZlkItp0Ie7\njf13ZYiISHkOv+M3Go04ffo0QkNDUVRUhJCQEADX3slff95zfn4+TCYTwsPDbc4Pz8/Pr/f85fDw\ncOsvtRAR0e3p0KEDjh8/ftu3d/gd/8MPP2z9Y0zp6ekYPXq09fI1a9agsrISOTk5+P7779GnTx+E\nhobC39/f+kfC3n//fet9blRYWAi59oGzLr/mzJmjeQ3sjf2xP/19OfJ7HsAt3vFPmDABO3fuxE8/\n/YS2bdvi1VdfxXPPPYeEhASsWLECZrMZa9euBQDExMQgISEBMTEx8Pb2xrJly6wx0LJly5CUlISK\nigqMGjUKI0aMcKhIvbj+txn1Rs+9AezP3em9P0fZHfwfffRRnZdv27atzsuff/55PP/88zdd3rt3\nb3z33XdOlEdERI2Nv7mroqSkJK1LUIyeewPYn7vTe3+OMoiIy/xTPQaDAS5UDhGRW3B0dvIdv4os\nFovWJShGz70B7M/d6b0/R3HwExF5GEY9RERujlEPERHZxcGvIj3njHruDWB/7k7v/TmKg5+IyMMw\n4ycicnPM+ImIyC4OfhXpOWfUc28A+3N3eu/PURz8REQehhk/EZGbY8ZPRER2cfCrSM85oxq9+fu3\ntP5zn2p9+fu3VK0/LbE/z+LwP71IpJXS0vMA1I0CS0v5b0qT/jDjJ7dx7V90U/v5weckuT5m/ERE\nZBcHv4r0nDPquTeA/bk7vffnKA5+IiIPw4yf3AYzfqK6MeMnIiK7OPhVpOecUc+9AezP3em9P0dx\n8BMReRhm/OQ2mPET1Y0ZPxER2cXBryI954x67g1gf+5O7/05ioOfiMjDMOMnt8GMn6huzPiJiMgu\nDn4V6Tln1HNvAPtzd3rvz1H8e/zkFH//lj//fXwicjfM+MkpWuXtzPiJbsaMn4iI7OLgV5G+c0aL\n1gUoSt/Hjv15Gg5+IiIP43TGn5KSgtWrV8PLywtdu3bFqlWrcOnSJYwbNw4nT56E2WzG2rVrERgY\naL39ypUr0aRJEyxZsgTDhg27uRhm/G6DGT+R63B0djo1+HNzc/HAAw/g8OHDuOOOOzBu3DiMGjUK\nhw4dQqtWrfDss89iwYIFOH/+PFJTU5GdnY2JEyfiq6++QkFBAYYOHYpjx47By8v2Bw4OfvfBwU/k\nOlT5cNff3x8+Pj4oLy9HVVUVysvLERYWhoyMDCQmJgIAEhMTsX79egDAhg0bMGHCBPj4+MBsNiMy\nMhJZWVnOLO3W9J0zWrQuQFH6Pnbsz9M4NfhbtmyJ2bNnIyIiAmFhYQgMDER8fDyKi4thNBoBAEaj\nEcXFxQCAwsJCmEwm6/1NJhMKCgoaoXwiInKUU7/AdeLECSxatAi5ubkICAjAY489htWrV9vcxmAw\n/BwH1K2+65KSkmA2mwEAgYGB6NGjB+Li4gD88l3bXbdrL3OVehrej+Xn/8b9/HX99o3XN8Z27WVK\n7b++7Ws9a/14K7nN/txr22KxIC0tDQCs89IRTmX8H3/8Mf71r3/h3XffBQC8//772Lt3L3bs2IHM\nzEyEhoaiqKgIgwcPxpEjR5CamgoAeO655wAAI0aMwCuvvIK+ffvaFsOM320w4ydyHapk/NHR0di7\ndy8qKiogIti2bRtiYmLw0EMPIT09HQCQnp6O0aNHAwAefvhhrFmzBpWVlcjJycH333+PPn36OLO0\nW6v9jq1PFq0LUJS+jx378zRORT3du3fHlClTcPfdd8PLywu9evXCE088gdLSUiQkJGDFihXW0zkB\nICYmBgkJCYiJiYG3tzeWLVtmNwYiIiLl8G/1kFMY9RC5Dv6tHiIisouDX0X6zhktWhegKH0fO/bn\naTj4iYg8DDN+cgozfiLXwYyfiIjs4uBXkb5zRovWBShK38eO/XkaDn4iIg/DjJ+cwoyfyHUw4yci\nIrs4+FWk75zRonUBitL3sWN/noaDn4jIwzDjJ6cw4ydyHcz4iYjILg5+Fek7Z7RoXYCi9H3s2J+n\n4eAnIvIwzPjJKcz4iVwHM34iIrKLg19F+s4ZLVoXoCh9Hzv252k4+ImIPAwzfnIKM34i18GMn4iI\n7OLgV5G+c0aL1gUoSt/Hjv15Gg5+IiIPw4yfnMKMn8h1MOMnIiK7OPhVpO+c0aJ1AYrS97Fjf56G\ng5+IyMMw4yenMOMnch3M+ImIyC4OfhXpO2e0aF2AovR97Nifp+HgJyLyMMz4ySnM+IlcBzN+IiKy\ni4NfRfrOGS1aF6AofR879udpOPiJiDyM0xl/SUkJHn/8cRw6dAgGgwGrVq1CVFQUxo0bh5MnT8Js\nNmPt2rUIDAwEAKSkpGDlypVo0qQJlixZgmHDht1cDDN+t8GMn8h1qJbx//nPf8aoUaNw+PBhHDx4\nENHR0UhNTUV8fDyOHTuGIUOGIDU1FQCQnZ2Njz/+GNnZ2diyZQtmzJiBmpoaZ5cmIqIGcGrwX7hw\nAbt378a0adMAAN7e3ggICEBGRgYSExMBAImJiVi/fj0AYMOGDZgwYQJ8fHxgNpsRGRmJrKysRmrB\nfeg7Z7RoXYCi9H3s2J+ncWrw5+TkoHXr1pg6dSp69eqF6dOn49KlSyguLobRaAQAGI1GFBcXAwAK\nCwthMpms9zeZTCgoKGiE8omIyFFODf6qqirs378fM2bMwP79+9GiRQtrrFPLYDD8nAPXzd51ehUX\nF6d1CQqK07oARen72LE/T+PtzJ1MJhNMJhPuueceAMDYsWORkpKC0NBQnD59GqGhoSgqKkJISAgA\nIDw8HHl5edb75+fnIzw8vM59JyUlwWw2AwACAwPRo0cP60Gr/XGN266x/Uu8o9Z27WVqr//zlos9\n/tz23G2LxYK0tDQAsM5LRzh9Vs/AgQPx7rvvomPHjpg7dy7Ky8sBAMHBwUhOTkZqaipKSkqQmpqK\n7OxsTJw4EVlZWSgoKMDQoUNx/Pjxm9716/2sHovFct3QdG83n9VjgfLv+rU7q0dPx64u7M+9OTo7\nnXrHDwBLly7FpEmTUFlZiQ4dOmDVqlWorq5GQkICVqxYYT2dEwBiYmKQkJCAmJgYeHt7Y9myZR4Z\n9RARuQL+rR5yCs/jJ3Id/Fs9RERkFwe/imo/nNEni9YFKErfx479eRoOfiIiD8OMn5zCjJ/IdTDj\nJyIiuzj4VaTvnNGidQGK0vexY3+ehoOfiMjDMOMnpzDjJ3IdzPiJiMguDn4V6TtntGhdgKL0fezY\nn6fh4Cci8jDM+MkpzPiJXAczfiIisouDX0X6zhktWhegKH0fO/bnaTj4iYg8DDN+cgozfiLXwYyf\niIjs4uBXkb5zRovWBShK38eO/XkaDn4iIg/DjJ+cwoyfyHUw4yciIrs4+FWk75zRonUBitL3sWN/\nnoaDn4jIwzDjJ6cw4ydyHcz4iYjILg5+Fek7Z7RoXYCi9H3s2J+n4eAnIvIwzPjJKcz4iVwHM34i\nIrKLg19F+s4ZLVoXoCh9Hzv252k4+ImIPAwzfnIKM34i18GMn4iI7OLgV5G+c0aL1gUoSt/Hjv15\nGg5+IiIPw4yfnMKMn8h1qJrxV1dXo2fPnnjooYcAAOfOnUN8fDw6duyIYcOGoaSkxHrblJQUREVF\nITo6Glu3bm3IskRE1AANGvyLFy9GTEzMz+/+gNTUVMTHx+PYsWMYMmQIUlNTAQDZ2dn4+OOPkZ2d\njS1btmDGjBmoqalpePVuRt85o0XrAhSl72PH/jyN04M/Pz8fmzZtwuOPP279ESMjIwOJiYkAgMTE\nRKxfvx4AsGHDBkyYMAE+Pj4wm82IjIxEVlZWI5RPRESOcnrwz5o1C6+//jq8vH7ZRXFxMYxGIwDA\naDSiuLgYAFBYWAiTyWS9nclkQkFBgbNLu624uDitS1BQnNYFKErfx479eRqnBv9nn32GkJAQ9OzZ\ns94PFAwGgzUCqu96IiJSn7czd9qzZw8yMjKwadMmXL58GRcvXsTkyZNhNBpx+vRphIaGoqioCCEh\nIQCA8PBw5OXlWe+fn5+P8PDwOvedlJQEs9kMAAgMDESPHj2s361rczp33V60aJGu+vkl14+DbcZf\n1/WNsV17mVL7r2/bNiN2lce/MbfZn3ttWywWpKWlAYB1XjpEGshisciDDz4oIiLPPPOMpKamiohI\nSkqKJCcni4jIoUOHpHv37nLlyhX54YcfpH379lJTU3PTvhqhHJeWmZmpdQmNBoAAct1X5g3bSnzd\nuKYaX9eek3o6dnVhf+7N0dnZ4PP4d+7ciTfeeAMZGRk4d+4cEhIScOrUKZjNZqxduxaBgYEAgPnz\n52PlypXw9vbG4sWLMXz48Jv2xfP43QfP4ydyHY7OTv4CFzmFg5/IdfCPtLmw63NG/bFoXYCi9H3s\n2J+n4eAnIvIwjHrIKYx6iFwHox4iIrKLg19F+s4ZLVoXoCh9Hzv252k4+ImIPAwzfnIKM34i1+Ho\n7HTqTzYQeQ5v1f+ulJ9fEC5ePKfqmuRZGPWoSN85o0XrAhRShWs/ZWT+/F/lv0pLz6vT2nX0/dzU\nf3+O4uAnIvIwzPjJKZ6U8fNzBXJ1PI+fiIjs4uBXkb5zRovWBSjMonUBitL3c1P//TmKg5+IyMMw\n4yenMONXdk2+DsgRzPiJiMguDn4V6TtntGhdgMIsWhegKH0/N/Xfn6M4+ImIPAwzfnIKM35l1+Tr\ngBzBjJ+IiOzi4FeRvnNGi9YFKMyidQGK0vdzU//9OYqDn4jIwzDjJ6cw41d2Tb4OyBHM+ImIyC4O\nfhXpO2e0aF2AwixaF6AofT839d+fozj4iYg8DDN+nfD3b6nBv9zkGXk7M35ydY7OTg5+nVD/w1bP\nGcIc/OTq+OGuC9N3zmjRugCFWbQuQFH6fm7qvz9HcfATEXkYRj06wahHX2vydUCOYNRDRER2cfCr\nSN85o0XrAhRm0boARen7uan//hzFwU9E5GGY8esEM359rcnXATmCGT8REdnl1ODPy8vD4MGD0aVL\nF8TGxmLJkiUAgHPnziE+Ph4dO3bEsGHDUFJSYr1PSkoKoqKiEB0dja1btzZO9W5G3zmjResCFGbR\nugBF6fu5qf/+HOXU4Pfx8cHChQtx6NAh7N27F//4xz9w+PBhpKamIj4+HseOHcOQIUOQmpoKAMjO\nzsbHH3+M7OxsbNmyBTNmzEBNTU2jNkJERLenUTL+0aNHY+bMmZg5cyZ27twJo9GI06dPIy4uDkeO\nHEFKSgq8vLyQnJwMABgxYgTmzp2Lfv362RbDjN9pzPj1tSZfB+QI1TP+3Nxc/Pe//0Xfvn1RXFwM\no9EIADAajSguLgYAFBYWwmQyWe9jMplQUFDQ0KWJiMgJ3g25c1lZGR599FEsXrwYfn5+NtcZDIaf\n34XWrb7rkpKSYDabAQCBgYHo0aMH4uLiAPyS07nr9qJFixTt55ccWovt2v9Xcr3ay5Taf33bN/6/\nGuup+/y8PgN3ldcL+7PfT1paGgBY56VDxEmVlZUybNgwWbhwofWyTp06SVFRkYiIFBYWSqdOnURE\nJCUlRVJSUqy3Gz58uOzdu/emfTagHLeQmZmp2L4BCCAqft24XqYGa6rZpxr9/bKm2pR8broCvffn\n6HPGqYxfRJCYmIjg4GAsXLjQevmzzz6L4OBgJCcnIzU1FSUlJUhNTUV2djYmTpyIrKwsFBQUYOjQ\noTh+/PhN7/qZ8TuPGb++1uTrgByhyt/j//LLLzFw4EB069bNOrxTUlLQp08fJCQk4NSpUzCbzVi7\ndi0CAwMBAPPnz8fKlSvh7e2NxYsXY/jw4Q0unn7Bwa+vNfk6IEfwH2JxYRaL5bo8vnFpP/gtsM3i\n1VhTDbVrWqB8f7+sqfbrQMnnpivQe3/8zV0iIrKL7/h1Qvt3/FyzMdfk64AcwXf8RERkFwe/ivT9\n90IsWhegMIvWBShK389N/ffnKA5+IiIPw4xfJ5jx62tNvg7IEcz4iYjILg5+Fek7Z7RoXYDCLFoX\noCh9Pzf135+jOPiJiDwMM36dYMavrzX5OiBHMOMnIiK7OPhVpO+c0aJ1AQqzaF2AovT93NR/f47i\n4Cci8jDM+HWCGb++1uTrgBzBjJ+IiOzi4FeRvnNGi9YFKMyidQGK0vdzU//9OYqDn4jIwzDj1wlm\n/Ppak68DcgQzfiIisouDX0X6zhktWhegMIvWBShK389N/ffnKA5+IiIPw4xfJ5jx62tNvg7IEcz4\niYjILg5+Fek7Z7RoXYDCLCqu5Q2DwaDql6+vn4r9qU/frz3HeWtdABHdqApqx0sVFQZV1yNtMePX\nCWb8XLMRWQo5AAAGmUlEQVSha/K1576Y8RMRkV0c/CrSd85o0boAhVm0LoAaQN+vPcdx8BMReRhm\n/DrBjJ9rNnRNvvbcFzN+IiKyi4NfAf7+LVU/D1t7Fq0LUJhF6wKoAZjx2+LgV0Bp6Xlc+1H9xq/M\nei5vjC8iotvDjF8B6uftgPq5sCf06Flr6uG156mY8RMRkV2qDv4tW7YgOjoaUVFRWLBggZpLuwiL\n1gUoyKJ1AQqzaF0ANQAzfluqDf7q6mrMnDkTW7ZsQXZ2Nj766CMcPnxYreVdxLdaF6AgPfcG6L8/\nffv2Wx6/66k2+LOyshAZGQmz2QwfHx+MHz8eGzZsUGt5F1GidQEK0nNvgP7707eSEh6/66n21zkL\nCgrQtm1b67bJZMK+ffsUXXPhwjexePE7iq5xo2bN+AdPyR2pe1qwn18QLl48p9p6ZEu1KaXFueb/\n/e93yM/Pg5fXnaqtWV1dZOfaXLXK0ECu1gUoLFfrAhSm7mnBpaXqzoPc3Fz4+7f8+VRr9bjqNzjV\nBn94eDjy8vKs23l5eTCZTDa36dChgyLfIKqr1T3Y19TXR7oGa6q1npK91bemGmrXVKO/G9dUk7pr\nusYvHiqrtPS8Kn126NDBodurdh5/VVUVOnXqhO3btyMsLAx9+vTBRx99hM6dO6uxPBER/Uy1d/ze\n3t548803MXz4cFRXV+O3v/0thz4RkQZc6jd3iYhIeS7xm7slJSUYO3YsOnfujJiYGOzdu1frkhrN\n0aNH0bNnT+tXQEAAlixZonVZjSolJQVdunRB165dMXHiRFy5ckXrkhrV4sWL0bVrV8TGxmLx4sVa\nl9Ng06ZNg9FoRNeuXa2XnTt3DvHx8ejYsSOGDRvmtqc/1tXbunXr0KVLFzRp0gT79+/XsLqGq6u/\nZ555Bp07d0b37t0xZswYXLhw4dY7EhcwZcoUWbFihYiIXL16VUpKSjSuSBnV1dUSGhoqp06d0rqU\nRpOTkyN33XWXXL58WUREEhISJC0tTeOqGs93330nsbGxUlFRIVVVVTJ06FA5fvy41mU1yK5du2T/\n/v0SGxtrveyZZ56RBQsWiIhIamqqJCcna1Veg9TV2+HDh+Xo0aMSFxcn33zzjYbVNVxd/W3dulWq\nq6tFRCQ5Ofm2jp3m7/gvXLiA3bt3Y9q0aQCufRYQEBCgcVXK2LZtGzp06GDz+wzuzt/fHz4+Pigv\nL0dVVRXKy8sRHh6udVmN5siRI+jbty+aNWuGJk2aYNCgQfjnP/+pdVkNMmDAAAQFBdlclpGRgcTE\nRABAYmIi1q9fr0VpDVZXb9HR0ejYsaNGFTWuuvqLj4+Hl9e1Ud63b1/k5+ffcj+aD/6cnBy0bt0a\nU6dORa9evTB9+nSUl5drXZYi1qxZg4kTJ2pdRqNq2bIlZs+ejYiICISFhSEwMBBDhw7VuqxGExsb\ni927d+PcuXMoLy/H559/flsvLHdTXFwMo9EIADAajSguLta4InLGypUrMWrUqFveTvPBX1VVhf37\n92PGjBnYv38/WrRogdTUVK3LanSVlZXYuHEjHnvsMa1LaVQnTpzAokWLkJubi8LCQpSVleGDDz7Q\nuqxGEx0djeTkZAwbNgwjR45Ez549re+u9Mp1/nEfcsS8efPQtGnT23pzqfkz2GQywWQy4Z577gEA\njB071u0/gKnL5s2b0bt3b7Ru3VrrUhrV119/jfvuuw/BwcHw9vbGmDFjsGfPHq3LalTTpk3D119/\njZ07dyIwMBCdOnXSuqRGZzQacfr0aQBAUVERQkJCNK6IHJGWloZNmzbd9psuzQd/aGgo2rZti2PH\njgG4loN36dJF46oa30cffYQJEyZoXUaji46Oxt69e1FRUQERwbZt2xATE6N1WY3qxx9/BACcOnUK\nn376qe7iOgB4+OGHkZ5+7TeT09PTMXr0aI0rUobo8Oz1LVu24PXXX8eGDRvQrFmz27uTQh8+O+Tb\nb7+Vu+++W7p16yaPPPKI7s7qKSsrk+DgYLl48aLWpShiwYIFEhMTI7GxsTJlyhSprKzUuqRGNWDA\nAImJiZHu3bvLjh07tC6nwcaPHy9t2rQRHx8fMZlMsnLlSjl79qwMGTJEoqKiJD4+Xs6fP691mU65\nsbcVK1bIp59+KiaTSZo1ayZGo1FGjBihdZlOq6u/yMhIiYiIkB49ekiPHj3kySefvOV++AtcREQe\nRvOoh4iI1MXBT0TkYTj4iYg8DAc/EZGH4eAnIvIwHPxERB6Gg5+IyMNw8BMReZj/B5JE+MQbZdZe\nAAAAAElFTkSuQmCC\n", 109 | "text/plain": [ 110 | "" 111 | ] 112 | }, 113 | "metadata": {}, 114 | "output_type": "display_data" 115 | } 116 | ], 117 | "source": [ 118 | "loansData['Monthly.LogIncome'] = [ math.log(x) for x in inc ]\n", 119 | "plt.figure()\n", 120 | "h = loansData['Monthly.LogIncome'].hist()\n", 121 | "plt.title('Histogram of Log(Monthly Income)')\n", 122 | "plt.show()" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 7, 128 | "metadata": { 129 | "collapsed": false 130 | }, 131 | "outputs": [ 132 | { 133 | "data": { 134 | "text/plain": [ 135 | "count 2499.000000\n", 136 | "mean 8.501915\n", 137 | "std 0.523019\n", 138 | "min 6.377577\n", 139 | "25% 8.160518\n", 140 | "50% 8.517193\n", 141 | "75% 8.824678\n", 142 | "max 11.540054\n", 143 | "dtype: float64" 144 | ] 145 | }, 146 | "execution_count": 7, 147 | "metadata": {}, 148 | "output_type": "execute_result" 149 | } 150 | ], 151 | "source": [ 152 | "loansData['Monthly.LogIncome'].describe()" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "metadata": { 159 | "collapsed": false 160 | }, 161 | "outputs": [], 162 | "source": [] 163 | } 164 | ], 165 | "metadata": { 166 | "kernelspec": { 167 | "display_name": "Python 2", 168 | "language": "python", 169 | "name": "python2" 170 | }, 171 | "language_info": { 172 | "codemirror_mode": { 173 | "name": "ipython", 174 | "version": 2 175 | }, 176 | "file_extension": ".py", 177 | "mimetype": "text/x-python", 178 | "name": "python", 179 | "nbconvert_exporter": "python", 180 | "pygments_lexer": "ipython2", 181 | "version": "2.7.9" 182 | } 183 | }, 184 | "nbformat": 4, 185 | "nbformat_minor": 0 186 | } 187 | -------------------------------------------------------------------------------- /ipython/Lecture_TextMining.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Introduction to Data Science\n", 8 | "## Text classification\n", 9 | "Author: Robert Moakler\n", 10 | "***" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "Read in some packages." 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": null, 23 | "metadata": { 24 | "collapsed": true 25 | }, 26 | "outputs": [], 27 | "source": [ 28 | "# Import the libraries we will be using\n", 29 | "import numpy as np\n", 30 | "import pandas as pd\n", 31 | "from sklearn.linear_model import LogisticRegression\n", 32 | "from sklearn.naive_bayes import BernoulliNB\n", 33 | "from sklearn import metrics\n", 34 | "from sklearn import cross_validation\n", 35 | "from sklearn.cross_validation import train_test_split\n", 36 | "from sklearn.feature_extraction.text import CountVectorizer\n", 37 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 38 | "\n", 39 | "import matplotlib.pylab as plt\n", 40 | "%matplotlib inline\n", 41 | "plt.rcParams['figure.figsize'] = 10, 8\n", 42 | "\n", 43 | "np.random.seed(36)\n", 44 | "\n", 45 | "# We will want to keep track of some different roc curves, lets do that here\n", 46 | "tprs = []\n", 47 | "fprs = []\n", 48 | "roc_labels = []" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "### Data\n", 56 | "We have a new data set in `data/spam_ham.csv`. Let's take a look at what it contains." 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "metadata": { 63 | "collapsed": false 64 | }, 65 | "outputs": [], 66 | "source": [ 67 | "!head -2 data/spam_ham.csv" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "Looks like we have two features: some text (looks like an email), and a label for spam or ham. What is the distribution of the target variable?" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "metadata": { 81 | "collapsed": false 82 | }, 83 | "outputs": [], 84 | "source": [ 85 | "!cut -f2 -d',' data/spam_ham.csv | sort | uniq -c | head" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "It doesn't look like that did what we wanted. Can you see why?\n", 93 | "\n", 94 | "The data in this file has **text data**. The text data in the first column can have commas. The command line will have some issues reading this data since it will try to split on all instances of the delimeter. Ideally, we would like to have a way of **encapsulating** the first column. Note that we actually have something like this in the data. The first column is wrapped in single quotes. Python (and pandas) have more explicit ways of dealing with this:" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": null, 100 | "metadata": { 101 | "collapsed": false 102 | }, 103 | "outputs": [], 104 | "source": [ 105 | "data = pd.read_csv(\"data/spam_ham.csv\", quotechar=\"'\", escapechar=\"\\\\\")" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "Above, we specify that fields that need to be encapsulated are done so with single quotes (`quotechar`). But, what if the text in this field uses single quotes? For example, apostrophes in words like \"can't\" would break the encapsulation. To overcome this, we **escape** single quotes that are actually just text. Here, we specify the escape character as a backslash (`escapechar`). So now, for example, \"can't\" would be written as \"can\\'t\".\n", 113 | "\n", 114 | "Let's take another look at our data." 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": null, 120 | "metadata": { 121 | "collapsed": false 122 | }, 123 | "outputs": [], 124 | "source": [ 125 | "data.head()" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "Here, the target is whether or not a record should be considered as spam. This is recorded as the string 'spam' or 'ham'. To make it a little easier for our classifier, let's recode it as `0` or `1`." 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": { 139 | "collapsed": false 140 | }, 141 | "outputs": [], 142 | "source": [ 143 | "data['spam'] = pd.Series(data['spam'] == 'spam', dtype=int)" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": { 150 | "collapsed": false 151 | }, 152 | "outputs": [], 153 | "source": [ 154 | "data.head()" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "Since we are going to do some modeling, we should split our data into a training and test set." 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": { 168 | "collapsed": false 169 | }, 170 | "outputs": [], 171 | "source": [ 172 | "X = data['text']\n", 173 | "Y = data['spam']\n", 174 | "\n", 175 | "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=.75)" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "### Text as features\n", 183 | "How can we turn the large amount of text for each record into useful features?\n", 184 | "\n", 185 | "\n", 186 | "#### Binary representation\n", 187 | "One way is to create a matrix that uses each word as a feature and keeps track of whether or not a word appears in a document/record. You can do this in sklearn with a `CountVectorizer()` and setting `binary` to `true`. The process is very similar to how you fit a model: you will fit a `CounterVectorizer()`. This will figure out what words exist in your data." 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": null, 193 | "metadata": { 194 | "collapsed": false 195 | }, 196 | "outputs": [], 197 | "source": [ 198 | "binary_vectorizer = CountVectorizer(binary=True)\n", 199 | "binary_vectorizer.fit(X_train)" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": {}, 205 | "source": [ 206 | "Let's look at the vocabulary the `CountVectorizer()` learned." 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "metadata": { 213 | "collapsed": false 214 | }, 215 | "outputs": [], 216 | "source": [ 217 | "binary_vectorizer.vocabulary_.keys()[0:10]" 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": {}, 223 | "source": [ 224 | "Now that we know what words are in the data, we can transform our blobs of text into a clean matrix. Simply `.transform()` the raw data using our fitted `CountVectorizer()`. You will do this for the training and test data. What do you think happens if there are new words in the test data that were not seen in the training data?" 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": null, 230 | "metadata": { 231 | "collapsed": false 232 | }, 233 | "outputs": [], 234 | "source": [ 235 | "X_train_binary = binary_vectorizer.transform(X_train)\n", 236 | "X_test_binary = binary_vectorizer.transform(X_test)" 237 | ] 238 | }, 239 | { 240 | "cell_type": "markdown", 241 | "metadata": {}, 242 | "source": [ 243 | "We can take a look at our new `X_test_counts`." 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": null, 249 | "metadata": { 250 | "collapsed": false 251 | }, 252 | "outputs": [], 253 | "source": [ 254 | "X_test_binary" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": {}, 260 | "source": [ 261 | "Sparse matrix? Where is our data?\n", 262 | "\n", 263 | "If you look at the output above, you will see that it is being stored in a *sparse* matrix (as opposed to the typical dense matrix) that is ~2k rows long and ~70k columns. The rows here are records in the original data and the columns are words. Given the shape, this means there are ~140m cells that should have values. However, from the above, we can see that only ~220k cells (~0.15%) of the cells have values! Why is this?\n", 264 | "\n", 265 | "To save space, sklearn uses a sparse matrix. This means that only values that are not zero are stored! This saves a ton of space! This also means that visualizing the data is a little trickier. Let's look at a very small chunk." 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": null, 271 | "metadata": { 272 | "collapsed": false 273 | }, 274 | "outputs": [], 275 | "source": [ 276 | "X_test_binary[0:20, 0:20].todense()" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "#### Applying a model\n", 284 | "Now that we have a ton of features (since we have a ton of words!) let's try using a logistic regression model to predict spam/ham." 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": null, 290 | "metadata": { 291 | "collapsed": false 292 | }, 293 | "outputs": [], 294 | "source": [ 295 | "model = LogisticRegression()\n", 296 | "model.fit(X_train_binary, Y_train)\n", 297 | "\n", 298 | "print \"Area under the ROC curve on the test data = %.3f\" % metrics.roc_auc_score(model.predict(X_test_binary), Y_test)" 299 | ] 300 | }, 301 | { 302 | "cell_type": "markdown", 303 | "metadata": {}, 304 | "source": [ 305 | "Is this any good? What do we care about in this case? Let's take a look at our ROC measure in more detail by looking at the actual ROC curve." 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": null, 311 | "metadata": { 312 | "collapsed": false 313 | }, 314 | "outputs": [], 315 | "source": [ 316 | "fpr, tpr, thresholds = metrics.roc_curve(Y_test, model.predict_proba(X_test_binary)[:,1])\n", 317 | "tprs.append(tpr)\n", 318 | "fprs.append(fpr)\n", 319 | "roc_labels.append(\"Default Binary\")\n", 320 | "ax = plt.subplot()\n", 321 | "plt.plot(fpr, tpr)\n", 322 | "plt.xlabel(\"fpr\")\n", 323 | "plt.ylabel(\"tpr\")\n", 324 | "plt.title(\"ROC Curve\")\n", 325 | "plt.xlim([0, 1])\n", 326 | "plt.ylim([0, 1])\n", 327 | "plt.show()" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [ 334 | "#### Counts instead of binary\n", 335 | "Instead of using a 0 or 1 to represent the occurence of a word, we can use the actual counts. We do this the same way as before, but now we leave `binary` set to `false` (the default value)." 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": null, 341 | "metadata": { 342 | "collapsed": false 343 | }, 344 | "outputs": [], 345 | "source": [ 346 | "# Fit a counter\n", 347 | "count_vectorizer = CountVectorizer()\n", 348 | "count_vectorizer.fit(X_train)\n", 349 | "\n", 350 | "# Transform to counter\n", 351 | "X_train_counts = count_vectorizer.transform(X_train)\n", 352 | "X_test_counts = count_vectorizer.transform(X_test)\n", 353 | "\n", 354 | "# Model\n", 355 | "model = LogisticRegression()\n", 356 | "model.fit(X_train_counts, Y_train)\n", 357 | "\n", 358 | "print \"Area under the ROC curve on the test data = %.3f\" % metrics.roc_auc_score(model.predict(X_test_counts), Y_test)" 359 | ] 360 | }, 361 | { 362 | "cell_type": "markdown", 363 | "metadata": {}, 364 | "source": [ 365 | "We can also take a look at the ROC curve." 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "execution_count": null, 371 | "metadata": { 372 | "collapsed": false 373 | }, 374 | "outputs": [], 375 | "source": [ 376 | "fpr, tpr, thresholds = metrics.roc_curve(Y_test, model.predict_proba(X_test_counts)[:,1])\n", 377 | "tprs.append(tpr)\n", 378 | "fprs.append(fpr)\n", 379 | "roc_labels.append(\"Default Counts\")\n", 380 | "ax = plt.subplot()\n", 381 | "plt.plot(fpr, tpr)\n", 382 | "plt.xlabel(\"fpr\")\n", 383 | "plt.ylabel(\"tpr\")\n", 384 | "plt.title(\"ROC Curve\")\n", 385 | "plt.xlim([0, 1])\n", 386 | "plt.ylim([0, 1])\n", 387 | "plt.show()" 388 | ] 389 | }, 390 | { 391 | "cell_type": "markdown", 392 | "metadata": {}, 393 | "source": [ 394 | "#### Tf-idf\n", 395 | "Another popular technique when dealing with text is to use the term frequency - inverse document frequency (tf-idf) measure." 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": null, 401 | "metadata": { 402 | "collapsed": false 403 | }, 404 | "outputs": [], 405 | "source": [ 406 | "# Fit a counter\n", 407 | "tfidf_vectorizer = TfidfVectorizer()\n", 408 | "tfidf_vectorizer.fit(X_train)\n", 409 | "\n", 410 | "# Transform to a counter\n", 411 | "X_train_tfidf = tfidf_vectorizer.transform(X_train)\n", 412 | "X_test_tfidf = tfidf_vectorizer.transform(X_test)\n", 413 | "\n", 414 | "# Model\n", 415 | "model = LogisticRegression()\n", 416 | "model.fit(X_train_tfidf, Y_train)\n", 417 | "\n", 418 | "print \"Area under the ROC curve on the test data = %.3f\" % metrics.roc_auc_score(model.predict(X_test_counts), Y_test)" 419 | ] 420 | }, 421 | { 422 | "cell_type": "markdown", 423 | "metadata": {}, 424 | "source": [ 425 | "Once again, we can look at the ROC curve." 426 | ] 427 | }, 428 | { 429 | "cell_type": "code", 430 | "execution_count": null, 431 | "metadata": { 432 | "collapsed": false 433 | }, 434 | "outputs": [], 435 | "source": [ 436 | "fpr, tpr, thresholds = metrics.roc_curve(Y_test, model.predict_proba(X_test_tfidf)[:,1])\n", 437 | "tprs.append(tpr)\n", 438 | "fprs.append(fpr)\n", 439 | "roc_labels.append(\"Default Tfidf\")\n", 440 | "ax = plt.subplot()\n", 441 | "plt.plot(fpr, tpr)\n", 442 | "plt.xlabel(\"fpr\")\n", 443 | "plt.ylabel(\"tpr\")\n", 444 | "plt.title(\"ROC Curve\")\n", 445 | "plt.xlim([0, 1])\n", 446 | "plt.ylim([0, 1])\n", 447 | "plt.show()" 448 | ] 449 | }, 450 | { 451 | "cell_type": "markdown", 452 | "metadata": {}, 453 | "source": [ 454 | "The `CountVectorizer()` and `TfidfVectorizer()` functions have many options. You can restrict the words you would like in the vocabulary. You can add n-grams. You can use stop word lists. Which options you should use generally depend on the type of data you are dealing with. We can discuss and try some of them now." 455 | ] 456 | }, 457 | { 458 | "cell_type": "markdown", 459 | "metadata": {}, 460 | "source": [ 461 | "Now that we have a few different feature sets and models, let's look at all of our ROC curves." 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": null, 467 | "metadata": { 468 | "collapsed": false 469 | }, 470 | "outputs": [], 471 | "source": [ 472 | "for fpr, tpr, roc_label in zip(fprs, tprs, roc_labels):\n", 473 | " plt.plot(fpr, tpr, label=roc_label)\n", 474 | "\n", 475 | "plt.xlabel(\"fpr\")\n", 476 | "plt.ylabel(\"tpr\")\n", 477 | "plt.title(\"ROC Curves\")\n", 478 | "plt.legend()\n", 479 | "plt.xlim([0, .07])\n", 480 | "plt.ylim([.98, 1])\n", 481 | "plt.show()" 482 | ] 483 | }, 484 | { 485 | "cell_type": "markdown", 486 | "metadata": {}, 487 | "source": [ 488 | "#### *A note on cross validation*\n", 489 | "We didn't use cross validation here, but it is definitely possible. The code is a little messier, so we will leave this to a Forum discussion." 490 | ] 491 | }, 492 | { 493 | "cell_type": "markdown", 494 | "metadata": {}, 495 | "source": [ 496 | "### Naive Bayes\n", 497 | "So far we have been exposed to tree classifiers and logistic regression in class. We have also seen SVMs in the homwork. A popular modeling technique (especially in text classification) is the is the (Bernoulli) naive Bayes classifier.\n", 498 | "\n", 499 | "Using this model in sklearn is just as easy as all the other models." 500 | ] 501 | }, 502 | { 503 | "cell_type": "code", 504 | "execution_count": null, 505 | "metadata": { 506 | "collapsed": false 507 | }, 508 | "outputs": [], 509 | "source": [ 510 | "model = BernoulliNB()\n", 511 | "model.fit(X_train_tfidf, Y_train)" 512 | ] 513 | }, 514 | { 515 | "cell_type": "code", 516 | "execution_count": null, 517 | "metadata": { 518 | "collapsed": false 519 | }, 520 | "outputs": [], 521 | "source": [ 522 | "print \"AUC on the count data = %.3f\" % metrics.roc_auc_score(model.predict(X_test_tfidf), Y_test)" 523 | ] 524 | }, 525 | { 526 | "cell_type": "markdown", 527 | "metadata": {}, 528 | "source": [ 529 | "The past few weeks we have seen that many of the models we are using have different parameters that can be tweaked. In naive Bayes, the parameter that is typically tuned is the Laplace smoothing value `alpha`. We won't discuss this in class, but will post a discussion on the NYU Classes Forum. Also, there is another version of naive Bayes (not discussed in the book) called multinomial naive Bayes, which can handle count features and not just binary features. We will give an additional reading covering that (in the Forum as well). " 530 | ] 531 | } 532 | ], 533 | "metadata": { 534 | "kernelspec": { 535 | "display_name": "Python 2", 536 | "language": "python", 537 | "name": "python2" 538 | }, 539 | "language_info": { 540 | "codemirror_mode": { 541 | "name": "ipython", 542 | "version": 2 543 | }, 544 | "file_extension": ".py", 545 | "mimetype": "text/x-python", 546 | "name": "python", 547 | "nbconvert_exporter": "python", 548 | "pygments_lexer": "ipython2", 549 | "version": "2.7.9" 550 | } 551 | }, 552 | "nbformat": 4, 553 | "nbformat_minor": 0 554 | } 555 | -------------------------------------------------------------------------------- /ipython/churn_analysis.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This script has a set of reference functions for performing analysis of the churn dataset 3 | ''' 4 | import sys 5 | import pandas as pd 6 | import numpy as np 7 | from matplotlib import pyplot as plt 8 | import sklearn.metrics as skm 9 | sys.path.append("./utils/") 10 | from ClassifierBakeoff import * 11 | 12 | import warnings 13 | warnings.filterwarnings('ignore') 14 | 15 | def getDfSummary(dat): 16 | ''' 17 | Get descriptive stats 18 | ''' 19 | #Get the names of the columns 20 | cols = dat.columns.values 21 | 22 | c_summ = [] 23 | #Outer Loop for the cols 24 | for c in cols: 25 | #Count the NAs 26 | missing = sum(pd.isnull(dat[c])) 27 | #Use describe to get summary statistics, and also drop the 'count' row 28 | sumval = dat[c].describe().drop(['count']) 29 | #Now count distinct values...note that nunique removes missing values for you 30 | distinct = dat[c].nunique() 31 | #Append missing and distinct to sumval 32 | sumval = sumval.append(pd.Series([missing, distinct], index=['missing', 'distinct'])) 33 | #Add each sumval to a list and then convert the entire thing to a DS 34 | c_summ.append(sumval) 35 | 36 | return pd.DataFrame(c_summ, index=cols) 37 | 38 | 39 | 40 | 41 | 42 | def plotCorr(dat, lab, h, w): 43 | ''' 44 | Do a heatmap to visualize the correlation matrix, dropping the label 45 | ''' 46 | 47 | dat = dat.drop(lab, 1) 48 | #Get correlation and 0 out the diagonal (for plotting purposes) 49 | c_dat = dat.corr() 50 | for i in range(c_dat.shape[0]): 51 | c_dat.iloc[i,i] = 0 52 | 53 | c_mat = c_dat.as_matrix() 54 | #c_mat = c_mat[:-1, :-1] 55 | fig, ax = plt.subplots() 56 | heatmap = plt.pcolor(c_mat, cmap = plt.cm.RdBu) 57 | 58 | #Set the tick labels and center them 59 | ax.set_xticks(np.arange(c_dat.shape[0]) + 0.5, minor = False) 60 | ax.set_yticks(np.arange(c_dat.shape[1]) + 0.5, minor = False) 61 | ax.set_xticklabels(c_dat.index.values, minor = False, rotation = 45) 62 | ax.set_yticklabels(c_dat.index.values, minor = False) 63 | heatmap.axes.set_ylim(0, len(c_dat.index)) 64 | heatmap.axes.set_xlim(0, len(c_dat.index)) 65 | plt.colorbar(heatmap, ax = ax) 66 | 67 | #plt.figure(figsize = (h, w)) 68 | fig = plt.gcf() 69 | fig.set_size_inches(h, w) 70 | 71 | 72 | def makeBar(df, h, lab, width): 73 | ''' 74 | Contains 75 | ''' 76 | df_s = df.sort(columns = [h], ascending = False) 77 | 78 | #Get a barplot 79 | ind = np.arange(df_s.shape[0]) 80 | labs = df_s[[lab]].values.ravel() 81 | 82 | fig = plt.figure(facecolor = 'w', figsize = (12, 6)) 83 | ax = plt.subplot(111) 84 | plt.subplots_adjust(bottom = 0.25) 85 | 86 | rec = ax.bar(ind + width, df_s[[h]].values, width, color='r') 87 | 88 | ax.set_xticks(ind + getTickAdj(labs, width)) 89 | ax.set_xticklabels(labs, rotation = 45, size = 14) 90 | 91 | 92 | def getTickAdj(labs, width): 93 | lens = map(len, labs) 94 | lens = -1 * width * (lens - np.mean(lens)) / np.max(lens) 95 | return lens 96 | 97 | def plotMI(dat, lab, width = 0.35, signed = 0): 98 | ''' 99 | Draw a bar chart of the normalized MI between each X and Y 100 | ''' 101 | X = dat.drop(lab, 1) 102 | Y = dat[[lab]].values 103 | cols = X.columns.values 104 | mis = [] 105 | 106 | #Start by getting MI 107 | for c in cols: 108 | mis.append(skm.normalized_mutual_info_score(Y.ravel(), X[[c]].values.ravel())) 109 | 110 | #Get signs by correlation 111 | corrs = dat.corr()[lab] 112 | corrs[corrs.index != lab] 113 | df = pd.DataFrame(zip(mis, cols), columns = ['MI', 'Lab']) 114 | df = pd.merge(df, pd.DataFrame(corrs, columns = ['corr']), how = 'inner', left_on = 'Lab', right_index=True) 115 | 116 | if signed == 0: 117 | makeBar(df, 'MI', 'Lab', width) 118 | 119 | else: 120 | makeBarSigned(df, 'MI', 'Lab', width) 121 | 122 | 123 | def makeBarSigned(df, h, lab, width): 124 | ''' 125 | Contains 126 | ''' 127 | df_s = df.sort(columns = [h], ascending = False) 128 | 129 | #Get a barplot 130 | ind = np.arange(df_s.shape[0]) 131 | labs = df_s[[lab]].values.ravel() 132 | h_pos = (df_s[['corr']].values.ravel() > 0) * df_s.MI 133 | h_neg = (df_s[['corr']].values.ravel() < 0) * df_s.MI 134 | 135 | fig = plt.figure(facecolor = 'w', figsize = (12, 6)) 136 | ax = plt.subplot(111) 137 | plt.subplots_adjust(bottom = 0.25) 138 | 139 | rec = ax.bar(ind + width, h_pos, width, color='r', label = 'Positive') 140 | rec = ax.bar(ind + width, h_neg, width, color='b', label = 'Negative') 141 | 142 | ax.set_xticks(ind + getTickAdj(labs, width)) 143 | ax.set_xticklabels(labs, rotation = 45, size = 14) 144 | 145 | plt.legend() 146 | 147 | 148 | 149 | def makeGS_Tup(ent, getmin = True): 150 | 151 | ostr = dToString(ent.parameters, ':', '|') 152 | if len(ostr.split('|')) > 2: 153 | sp = ostr.split('|') 154 | if len(sp) == 3: 155 | ostr = '{}|{}\n{}'.format(sp[0], sp[1], sp[2]) 156 | else: 157 | ostr = '{}|{}\n{}|{}'.format(sp[0], sp[1], sp[2], sp[3]) 158 | 159 | #ostr = dToString(ent.parameters, ':', '|') 160 | mu = np.abs(ent.mean_validation_score) #Log-Loss comes in at negative value 161 | sig = ent.cv_validation_scores.std() 162 | stderr = sig/np.sqrt(len(ent.cv_validation_scores)) 163 | 164 | if getmin: 165 | return (mu, ostr, mu + stderr, sig, stderr) #Note, this assumes minimization, thus adding stderr 166 | else: 167 | return (mu, ostr, mu - stderr, sig, stderr) 168 | 169 | 170 | def rankGS_Params(gs_obj_list, getmin = True): 171 | ''' 172 | Takes in the .grid_scores_ attributes of a GridSearchCV object 173 | ''' 174 | tup_list = [] 175 | 176 | for k in gs_obj_list: 177 | tup_list.append(makeGS_Tup(k, getmin)) 178 | 179 | tup_list.sort() 180 | 181 | if not getmin: 182 | tup_list.reverse() 183 | 184 | return tup_list 185 | 186 | 187 | 188 | def processGsObjList(gs_obj_list, getmin = True): 189 | 190 | rank_list = rankGS_Params(gs_obj_list, getmin) 191 | hts = [] 192 | desc = [] 193 | errs = [] 194 | std1 = rank_list[0][4] 195 | 196 | for tup in rank_list: 197 | hts.append(tup[0]) 198 | desc.append(tup[1]) 199 | errs.append(2 * tup[4]) 200 | 201 | return [hts, desc, errs, std1] 202 | 203 | def plotGridSearchSingle(gs_obj_list, getmin = True): 204 | 205 | hts, desc, errs, std1 = processGsObjList(gs_obj_list, getmin = True) 206 | 207 | gridBarH(hts, desc, errs, std1) 208 | 209 | 210 | 211 | def plotGridSearchMulti(tup_list, getmin = True): 212 | ''' 213 | Loop through a list of gs_obj_lists. The Obj list is in the 1 slot of each value in the dict 214 | ''' 215 | m_ht = [] 216 | m_desc = [] 217 | m_errs = [] 218 | 219 | best_min = 1000 #This assumes we are minimizing 220 | 221 | for tup in tup_list: 222 | lab = tup[0] 223 | gs_dict = tup[1] 224 | 225 | for k in gs_dict: 226 | clf = type(k).__name__.split('Classifier')[0] 227 | 228 | hts, desc, errs, std1 = processGsObjList(gs_dict[k][1], getmin = True) 229 | for i, d in enumerate(desc): 230 | desc[i] = '{} {} {}'.format(clf, lab, d) 231 | 232 | if hts[0] < best_min: 233 | best_std1 = std1 234 | 235 | m_ht = m_ht + hts 236 | m_desc = m_desc + desc 237 | m_errs = m_errs + errs 238 | 239 | gridBarH(m_ht, m_desc, m_errs, best_std1, int(len(m_ht)), 12) 240 | 241 | 242 | 243 | def gridBarH(hts, desc, errs, std1, h = 6, w = 12): 244 | 245 | fig = plt.figure(facecolor = 'w', figsize = (w, h)) 246 | ax = plt.subplot(111) 247 | plt.subplots_adjust(bottom = 0.25) 248 | 249 | width = 0.5 250 | 251 | pos = np.arange(len(hts)) 252 | 253 | rec = ax.barh(pos, np.array(hts), width, yerr = np.array(errs), color='r') 254 | 255 | ax.set_yticks(pos + width/2) 256 | ax.set_yticklabels(desc, size = 14) 257 | 258 | tmp = list(hts) 259 | tmp.sort() 260 | 261 | x_min = np.array(hts).min() - 2*np.array(hts).std() 262 | x_max = tmp[-2] + 2*np.array(hts).std() 263 | plt.xlim(x_min, x_max) 264 | 265 | 266 | plt.plot(tmp[0] * np.ones(len(tmp)), pos) 267 | plt.plot((tmp[0] + std1) * np.ones(len(tmp)), pos) 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | -------------------------------------------------------------------------------- /ipython/course_utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import matplotlib.pyplot as plt 4 | import sklearn 5 | import math 6 | from sklearn.metrics import roc_curve, auc 7 | import pickle 8 | 9 | def evenSplit(dat,fld): 10 | ''' 11 | Evenly splits the data on a given binary field, returns a shuffled dataframe 12 | ''' 13 | pos=dat[(dat[fld]==1)] 14 | neg=dat[(dat[fld]==0)] 15 | neg_shuf=neg.reindex(np.random.permutation(neg.index)) 16 | fin_temp=pos.append(neg_shuf[:pos.shape[0]],ignore_index=True) 17 | fin_temp=fin_temp.reindex(np.random.permutation(fin_temp.index)) 18 | return fin_temp 19 | 20 | 21 | def trainTest(dat, pct): 22 | ''' 23 | Randomly splits data into train and test 24 | ''' 25 | dat_shuf = dat.reindex(np.random.permutation(dat.index)) 26 | trn = dat_shuf[:int(np.floor(dat_shuf.shape[0]*pct))] 27 | tst = dat_shuf[int(np.floor(dat_shuf.shape[0]*pct)):] 28 | return [trn, tst] 29 | 30 | def downSample(dat,fld,mult): 31 | ''' 32 | Evenly splits the data on a given binary field, returns a shuffled dataframe 33 | ''' 34 | pos=dat[(dat[fld]==1)] 35 | neg=dat[(dat[fld]==0)] 36 | neg_shuf=neg.reindex(np.random.permutation(neg.index)) 37 | tot=min(pos.shape[0]*mult,neg.shape[0]) 38 | fin_temp=pos.append(neg_shuf[:tot],ignore_index=True) 39 | fin_temp=fin_temp.reindex(np.random.permutation(fin_temp.index)) 40 | return fin_temp 41 | 42 | 43 | def scaleData(d): 44 | ''' 45 | This function takes data and normalizes it to have the same scale (num-min)/(max-min) 46 | ''' 47 | #Note, by creating df_scale like this we preserve the index 48 | df_scale=pd.DataFrame(d.iloc[:,1],columns=['temp']) 49 | for c in d.columns.values: 50 | df_scale[c]=(d[c]-d[c].min())/(d[c].max()-d[c].min()) 51 | return df_scale.drop('temp',1) 52 | 53 | 54 | def plot_dec_line(mn,mx,b0,b1,a,col,lab): 55 | ''' 56 | This function plots a line in a 2 dim space 57 | ''' 58 | x = np.random.uniform(mn,mx,100) 59 | dec_line = map(lambda x_i: -1*(x_i*b0/b1+a/b1),x) 60 | plt.plot(x,dec_line,col,label=lab) 61 | 62 | 63 | 64 | def plotSVM(X, Y, my_svm): 65 | ''' 66 | Plots the separating line along with SV's and margin lines 67 | Code here derived or taken from this example http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane.html 68 | ''' 69 | # get the separating hyperplane 70 | w = my_svm.coef_[0] 71 | a = -w[0] / w[1] 72 | xx = np.linspace(X.iloc[:,0].min(), X.iloc[:,1].max()) 73 | yy = a * xx - (my_svm.intercept_[0]) / w[1] 74 | # plot the parallels to the separating hyperplane that pass through the 75 | # support vectors 76 | b = my_svm.support_vectors_[0] 77 | yy_down = a * xx + (b[1] - a * b[0]) 78 | b = my_svm.support_vectors_[-1] 79 | yy_up = a * xx + (b[1] - a * b[0]) 80 | # plot the line, the points, and the nearest vectors to the plane 81 | plt.plot(xx, yy, 'k-') 82 | plt.plot(xx, yy_down, 'k--') 83 | plt.plot(xx, yy_up, 'k--') 84 | plt.scatter(my_svm.support_vectors_[:, 0], my_svm.support_vectors_[:, 1], s=80, facecolors='none') 85 | plt.plot(X[(Y==-1)].iloc[:,0], X[(Y==-1)].iloc[:,1],'r.') 86 | plt.plot(X[(Y==1)].iloc[:,0], X[(Y==1)].iloc[:,1],'b+') 87 | #plt.axis('tight') 88 | #plt.show() 89 | 90 | 91 | def getP(val): 92 | ''' 93 | Get f(x) where f is the logistic function 94 | ''' 95 | return (1+math.exp(-1*val))**-1 96 | 97 | def getY(val): 98 | ''' 99 | Return a binary indicator based on a binomial draw with prob=f(val). f the logistic function. 100 | ''' 101 | return (int(getP(val)>np.random.uniform(0,1,1)[0])) 102 | 103 | def gen_logistic_dataframe(n,alpha,betas): 104 | ''' 105 | Aa function that generates a random logistic dataset 106 | n is the number of samples 107 | alpha, betas are the logistic truth 108 | ''' 109 | X = np.random.random([n,len(betas)]) 110 | Y = map(getY,X.dot(betas)+alpha) 111 | d = pd.DataFrame(X,columns=['f'+str(j) for j in range(X.shape[1])]) 112 | d['Y'] = Y 113 | return d 114 | 115 | 116 | def plotAUC(truth, pred, lab): 117 | fpr, tpr, thresholds = roc_curve(truth, pred) 118 | roc_auc = auc(fpr, tpr) 119 | c = (np.random.rand(), np.random.rand(), np.random.rand()) 120 | plt.plot(fpr, tpr, color=c, label= lab+' (AUC = %0.2f)' % roc_auc) 121 | plt.plot([0, 1], [0, 1], 'k--') 122 | plt.xlim([0.0, 1.0]) 123 | plt.ylim([0.0, 1.0]) 124 | plt.xlabel('FPR') 125 | plt.ylabel('TPR') 126 | plt.title('ROC') 127 | plt.legend(loc="lower right") 128 | 129 | 130 | 131 | def LogLoss(dat, beta, alpha): 132 | X = dat.drop('Y',1) 133 | Y = dat['Y'] 134 | XB=X.dot(np.array(beta))+alpha*np.ones(len(Y)) 135 | P=(1+np.exp(-1*XB))**-1 136 | return ((Y==1)*np.log(P)+(Y==0)*np.log(1-P)).mean() 137 | 138 | 139 | def plotSVD(sig): 140 | norm = math.sqrt(sum(sig*sig)) 141 | energy_k = [math.sqrt(k)/norm for k in np.cumsum(sig*sig)] 142 | 143 | plt.figure() 144 | ax1 = plt.subplot(211) 145 | ax1.bar(range(len(sig+1)), [0]+sig, 0.35) 146 | plt.title('Kth Singular Value') 147 | plt.tick_params(axis='x',which='both',bottom='off',top='off',labelbottom='off') 148 | 149 | ax2 = plt.subplot(212) 150 | plt.plot(range(len(sig)+1), [0]+energy_k) 151 | plt.title('Normalized Sum-of-Squares of Kth Singular Value') 152 | 153 | ax2.set_xlabel('Kth Singular Value') 154 | ax2.set_ylim([0, 1]) 155 | 156 | 157 | def genY(x, err, betas): 158 | ''' 159 | Goal: generate a Y variable as Y=XB+e 160 | Input 161 | 1. an np array x of length n 162 | 2. a random noise vector r of length n 163 | 3. a (d+1) x 1 vector of coefficients b - each represents ith degree of x 164 | ''' 165 | d = pd.DataFrame(x, columns=['x']) 166 | y = err 167 | for i,b in enumerate(betas): 168 | y = y + b*x**i 169 | d['y'] = y 170 | return d 171 | 172 | 173 | def makePolyFeat(d, deg): 174 | ''' 175 | Goal: Generate features up to X**deg 176 | 1. a data frame with two features X and Y 177 | 4. a degree 'deg' (from which we make polynomial features 178 | 179 | ''' 180 | #Generate Polynomial terms 181 | for i in range(2, deg+1): 182 | d['x'+str(i)] = d['x']**i 183 | return d 184 | 185 | 186 | def save_obj(obj, name ): 187 | with open(name + '.pkl', 'wb') as f: 188 | pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL) 189 | 190 | def load_obj(name ): 191 | with open(name + '.pkl', 'r') as f: 192 | return pickle.load(f) 193 | -------------------------------------------------------------------------------- /ipython/data/Cell2Cell_info.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jmportilla/DataScienceCourse/92bb8f50a55a9e844357b795c48aee35b4772aff/ipython/data/Cell2Cell_info.pdf -------------------------------------------------------------------------------- /ipython/references/churn_architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jmportilla/DataScienceCourse/92bb8f50a55a9e844357b795c48aee35b4772aff/ipython/references/churn_architecture.png -------------------------------------------------------------------------------- /ipython/references/churn_dataset_info.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jmportilla/DataScienceCourse/92bb8f50a55a9e844357b795c48aee35b4772aff/ipython/references/churn_dataset_info.pdf -------------------------------------------------------------------------------- /ipython/references/churn_sampling_scheme.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jmportilla/DataScienceCourse/92bb8f50a55a9e844357b795c48aee35b4772aff/ipython/references/churn_sampling_scheme.png -------------------------------------------------------------------------------- /ipython/references/intro ds syllabus - fall 2015.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jmportilla/DataScienceCourse/92bb8f50a55a9e844357b795c48aee35b4772aff/ipython/references/intro ds syllabus - fall 2015.pdf -------------------------------------------------------------------------------- /ipython/utils/.eval_plots.py.swp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jmportilla/DataScienceCourse/92bb8f50a55a9e844357b795c48aee35b4772aff/ipython/utils/.eval_plots.py.swp -------------------------------------------------------------------------------- /ipython/utils/ClassifierBakeoff.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | from sklearn.linear_model import LogisticRegression 4 | from sklearn.metrics import roc_auc_score 5 | from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, RandomForestClassifier 6 | from sklearn.tree import DecisionTreeClassifier 7 | 8 | 9 | def liftTable(pred, truth, b): 10 | df = pd.DataFrame({'p':pred + np.random.rand(len(pred))*0.000001, 'y':truth}) 11 | df['b'] = b - pd.qcut(df['p'], b, labels=False) 12 | df['n'] = np.ones(df.shape[0]) 13 | df_grp = df.groupby(['b']).sum() 14 | base = np.sum(df_grp['y'])/float(df.shape[0]) 15 | df_grp['n_cum'] = np.cumsum(df_grp['n'])/float(df.shape[0]) 16 | df_grp['y_cum'] = np.cumsum(df_grp['y']) 17 | df_grp['p_y_b'] = df_grp['y']/df_grp['n'] 18 | df_grp['lift_b'] = df_grp['p_y_b']/base 19 | df_grp['cum_lift_b'] = (df_grp['y_cum']/(float(df.shape[0])*df_grp['n_cum']))/base 20 | return df_grp 21 | 22 | 23 | def getMetrics(preds, labels): 24 | ''' 25 | Takes in non-binary predictions and labels and returns AUC, and several Lifts 26 | ''' 27 | auc = roc_auc_score(labels, preds) 28 | ltab = liftTable(preds, labels, 100) 29 | 30 | lift1 = ltab.ix[1].cum_lift_b 31 | lift5 = ltab.ix[5].cum_lift_b 32 | lift10 = ltab.ix[10].cum_lift_b 33 | lift25 = ltab.ix[25].cum_lift_b 34 | 35 | return [auc, lift1, lift5, lift10, lift25] 36 | 37 | 38 | def dToString(d, dm1, dm2): 39 | ''' 40 | Takes key-values and makes a string, d1 seprates k:v, d2 separates pairs 41 | ''' 42 | arg_str = '' 43 | for k in sorted(d.keys()): 44 | if len(arg_str) == 0: 45 | arg_str = '{}{}{}'.format(k, dm1, d[k]) 46 | else: 47 | arg_str = arg_str + '{}{}{}{}'.format(dm2, k, dm1, d[k]) 48 | return arg_str 49 | 50 | def getArgCombos(arg_lists): 51 | ''' 52 | Takes every combination and returns an iterable of dicts 53 | ''' 54 | keys = sorted(arg_lists.keys()) 55 | #Initialize the final iterable 56 | tot = 1 57 | for k in keys: 58 | tot = tot * len(arg_lists[k]) 59 | iter = [] 60 | #Fill it with empty dicts 61 | for i in range(tot): 62 | iter.append({}) 63 | #Now fill each dictionary 64 | kpass = 1 65 | for k in keys: 66 | klist = arg_lists[k] 67 | ktot = len(klist) 68 | for i in range(tot): 69 | iter[i][k] = klist[(i/kpass) % ktot] 70 | kpass = ktot * kpass 71 | return iter 72 | 73 | 74 | class LRAdaptor(object): 75 | ''' 76 | This adapts the LogisticRegression() Classifier so that LR can be used as an init for GBT 77 | This just overwrites the predict method to be predict_proba 78 | ''' 79 | def __init__(self, est): 80 | self.est = est 81 | 82 | def predict(self, X): 83 | return self.est.predict_proba(X)[:,1][:, np.newaxis] 84 | 85 | def fit(self, X, y): 86 | self.est.fit(X, y) 87 | 88 | class GenericClassifier(object): 89 | 90 | def __init__(self, modclass, dictargs): 91 | self.classifier = modclass(**dictargs) 92 | 93 | def fit(self, X, Y): 94 | self.classifier.fit(X,Y) 95 | 96 | def predict_proba(self, Xt): 97 | return self.classifier.predict_proba(Xt) 98 | 99 | 100 | class GenericClassifierOptimizer(object): 101 | 102 | def __init__(self, classtype, arg_lists): 103 | self.name = classtype.__name__ 104 | self.classtype = classtype 105 | self.arg_lists = arg_lists 106 | self.results = self._initDict() 107 | 108 | def _initDict(self): 109 | return {'alg':[], 'opt':[], 'auc':[], 'lift1':[], 'lift5':[], 'lift10':[], 'lift25':[]} 110 | 111 | def _updateResDict(self, opt, perf): 112 | self.results['alg'].append(self.name) 113 | self.results['opt'].append(opt) 114 | self.results['auc'].append(perf[0]) 115 | self.results['lift1'].append(perf[1]) 116 | self.results['lift5'].append(perf[2]) 117 | self.results['lift10'].append(perf[3]) 118 | self.results['lift25'].append(perf[4]) 119 | 120 | def runClassBake(self, X_train, Y_train, X_test, Y_test): 121 | 122 | arg_loop = getArgCombos(self.arg_lists) 123 | 124 | for d in arg_loop: 125 | 126 | mod = GenericClassifier(self.classtype, d) 127 | mod.fit(X_train, Y_train) 128 | 129 | perf = getMetrics(mod.predict_proba(X_test)[:,1], Y_test) 130 | self._updateResDict(dToString(d, ':', '|'), perf) 131 | 132 | 133 | 134 | class ClassifierBakeoff(object): 135 | 136 | def __init__(self, X_train, Y_train, X_test, Y_test, setup): 137 | self.instructions = setup 138 | self.X_train = X_train 139 | self.Y_train = Y_train 140 | self.X_test = X_test 141 | self.Y_test = Y_test 142 | self.results = self._initDict() 143 | 144 | def _initDict(self): 145 | return {'alg':[], 'opt':[], 'auc':[], 'lift1':[], 'lift5':[], 'lift10':[], 'lift25':[]} 146 | 147 | def _updateResDict(self, clfr_results): 148 | self.results['alg'] = self.results['alg'] + clfr_results['alg'] 149 | self.results['opt'] = self.results['opt'] + clfr_results['opt'] 150 | self.results['auc'] = self.results['auc'] + clfr_results['auc'] 151 | self.results['lift1'] = self.results['lift1'] + clfr_results['lift1'] 152 | self.results['lift5'] = self.results['lift5'] + clfr_results['lift5'] 153 | self.results['lift10'] = self.results['lift10'] + clfr_results['lift10'] 154 | self.results['lift25'] = self.results['lift25'] + clfr_results['lift25'] 155 | 156 | 157 | def bake(self): 158 | 159 | for clfr in self.instructions: 160 | 161 | classifierBake = GenericClassifierOptimizer(clfr, self.instructions[clfr]) 162 | classifierBake.runClassBake(self.X_train, self.Y_train, self.X_test, self.Y_test) 163 | self._updateResDict(classifierBake.results) 164 | 165 | 166 | 167 | 168 | 169 | 170 | -------------------------------------------------------------------------------- /ipython/utils/ClassifierBakeoff.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jmportilla/DataScienceCourse/92bb8f50a55a9e844357b795c48aee35b4772aff/ipython/utils/ClassifierBakeoff.pyc -------------------------------------------------------------------------------- /ipython/utils/bias_variance.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import pandas as pd 3 | import numpy as np 4 | from matplotlib import pyplot as plt 5 | import sklearn.metrics as skm 6 | import warnings 7 | warnings.filterwarnings('ignore') 8 | from sklearn import linear_model 9 | 10 | def simPolynomial(sigma = 0, betas = [0, 0], n = 100): 11 | 12 | x = np.random.uniform(0, 100, n) 13 | e = np.random.normal(0, sigma, n) 14 | 15 | d = pd.DataFrame(x, columns=['x']) 16 | y = e 17 | for i, b in enumerate(betas): 18 | y = y + b*(x**i) 19 | d['y'] = y 20 | return d 21 | 22 | 23 | def fitLinReg(d, mn, mx, inter): 24 | ''' 25 | Runs a linear regression and fits it on a grid 26 | ''' 27 | 28 | regr = linear_model.LinearRegression(fit_intercept = inter) 29 | regr.fit(d.drop('y', 1), d['y']) 30 | yhat = regr.predict(pd.DataFrame(np.arange(mn, mx, 1))) 31 | 32 | return yhat 33 | 34 | def makePolyFeat(d, deg): 35 | ''' 36 | Goal: Generate features up to X**deg 37 | 1. a data frame with two features X and Y 38 | 4. a degree 'deg' (from which we make polynomial features 39 | 40 | ''' 41 | #Generate Polynomial terms 42 | for i in range(2, deg+1): 43 | d['x'+str(i)] = d['x']**i 44 | return d 45 | 46 | def fitFullReg(d, mn, mx, betas, inter): 47 | ''' 48 | Runs a linear regression and fits it on a grid. Creates polynomial features using the dimension of betas 49 | ''' 50 | 51 | regr = linear_model.LinearRegression(fit_intercept = inter) 52 | regr.fit(makePolyFeat(d.drop('y', 1), len(betas)), d['y']) 53 | dt = pd.DataFrame(np.arange(mn, mx, 1), columns = ['x']) 54 | yhat = regr.predict(makePolyFeat(dt, len(betas))) 55 | 56 | return yhat 57 | 58 | 59 | 60 | def plotLinearBiasStage(sigma, betas, ns, fs): 61 | 62 | mn = 0 63 | mx = 101 64 | 65 | d = simPolynomial(sigma, betas, 10000) 66 | plt.figure(figsize = fs) 67 | plt.plot(d['x'], d['y'], 'b.', markersize = 0.75) 68 | 69 | 70 | x = np.arange(mn, mx, 1) 71 | y_real = np.zeros(len(x)) 72 | for i, b in enumerate(betas): 73 | y_real += b*(x**i) 74 | 75 | #plt.plot(x, y_real + 2*sigma, 'k+') 76 | #plt.plot(x, y_real - 2*sigma, 'k--') 77 | plt.plot(x, y_real, 'k*') 78 | 79 | for n in ns: 80 | dn = simPolynomial(sigma, betas, n) 81 | yhat = fitLinReg(dn, mn, mx, True) 82 | plt.plot(x, yhat, label = 'n={}'.format(n)) 83 | 84 | 85 | plt.legend(loc = 4, ncol = 3) 86 | 87 | 88 | 89 | def plotVariance(sigma, betas, ns, fs): 90 | 91 | mn = 0 92 | mx = 101 93 | nworlds = 100 94 | 95 | d = simPolynomial(sigma, betas, 10000) 96 | x = np.arange(mn, mx, 1) 97 | 98 | fig = plt.figure(figsize = fs) 99 | for pos, n in enumerate(ns): 100 | 101 | #First model each world 102 | yhat_lin = [] 103 | yhat_non = [] 104 | for i in range(nworlds): 105 | 106 | dn = simPolynomial(sigma, betas, n) 107 | 108 | yhat_lin.append(fitLinReg(dn, mn, mx, True)) 109 | yhat_non.append(fitFullReg(dn, mn, mx, betas, True)) 110 | 111 | #Now compute appropriate stats and plot 112 | 113 | lin_df = pd.DataFrame(yhat_lin) 114 | non_df = pd.DataFrame(yhat_non) 115 | 116 | lin_sig = lin_df.apply(np.std, axis=0).values 117 | non_sig = non_df.apply(np.std, axis=0).values 118 | lin_mu = lin_df.apply(np.mean, axis=0).values 119 | non_mu = non_df.apply(np.mean, axis=0).values 120 | 121 | #Need to continue from here 122 | 123 | for i in range(nworlds): 124 | 125 | ax1 = fig.add_subplot(2, 3, pos + 1) 126 | plt.title('n={}'.format(n)) 127 | plt.plot(x, yhat_lin[i], '.', color = '0.75') 128 | 129 | if i == nworlds - 1: 130 | plt.plot(x, lin_mu, 'r-') 131 | plt.title('E[std|X] = {}'.format(round(lin_sig.mean(),1))) 132 | 133 | ax1.axes.get_xaxis().set_visible(False) 134 | ax1.set_ylim((-40, 80)) 135 | 136 | ax2 = fig.add_subplot(2, 3, pos + 4) 137 | plt.plot(x, yhat_non[i], '--', color = '0.75') 138 | 139 | if i == nworlds - 1: 140 | plt.plot(x, non_mu, 'r-') 141 | plt.title('E[std|X] = {}'.format(round(non_sig.mean(),1))) 142 | 143 | ax2.set_ylim((-40, 80)) 144 | 145 | if pos != 0: 146 | ax1.axes.get_yaxis().set_visible(False) 147 | ax2.axes.get_yaxis().set_visible(False) 148 | 149 | plt.legend() 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | def getVarianceTrend(sigma, betas): 158 | 159 | mn = 50 160 | mx = 51 161 | nworlds = 100 162 | ns = np.logspace(4, 16, num = 10, base = 2) 163 | 164 | res_dict = {'n':[], 'lin':[], 'quad':[], 'non':[]} 165 | 166 | for pos, n in enumerate(ns): 167 | 168 | yhat_lin = []; yhat_quad = []; yhat_non = [] 169 | 170 | for i in range(nworlds): 171 | 172 | dn = simPolynomial(sigma, betas, n) 173 | 174 | #yhat_lin.append(fitLinReg(dn, mn, mx, True)[0]) 175 | yhat_lin.append(fitFullReg(dn, mn, mx, betas[0:1], True)[0]) 176 | yhat_quad.append(fitFullReg(dn, mn, mx, betas[0:2], True)[0]) 177 | yhat_non.append(fitFullReg(dn, mn, mx, betas, True)[0]) 178 | 179 | res_dict['lin'].append(np.array(yhat_lin).std()) 180 | res_dict['quad'].append(np.array(yhat_quad).std()) 181 | res_dict['non'].append(np.array(yhat_non).std()) 182 | res_dict['n'].append(n) 183 | 184 | 185 | return res_dict 186 | 187 | def plotVarianceTrend(res_dict, fs): 188 | 189 | fig = plt.figure(figsize = fs) 190 | 191 | ax1 = fig.add_subplot(2, 1, 1) 192 | x = np.log2(res_dict['n']) 193 | plt.plot(x, np.power(res_dict['lin'], 2), 'b-', label = 'd = 1') 194 | plt.plot(x, np.power(res_dict['quad'], 2), 'r-', label = 'd = 2') 195 | plt.plot(x, np.power(res_dict['non'], 2), 'g-', label = 'd = 4') 196 | 197 | ax1.set_ylim((0, 100)) 198 | 199 | plt.title('Model Variance by Polynomial Order (d) and Sample Size (n)') 200 | plt.legend(loc = 1) 201 | plt.ylabel('Var( E_d[Y|X = 50] )') 202 | 203 | ax2 = fig.add_subplot(2, 1, 2) 204 | filt = (x > 0) 205 | plt.plot(x[filt], 2*np.log2(res_dict['lin']), 'b-', label = 'd = 1') 206 | plt.plot(x[filt], 2*np.log2(res_dict['quad']), 'r-', label = 'd = 2') 207 | plt.plot(x[filt], 2*np.log2(res_dict['non']), 'g-', label = 'd = 4') 208 | 209 | ax2.set_xlim((x[filt].min(), x.max())) 210 | plt.xlabel('Log2(Sample Size)') 211 | plt.ylabel('Log [ Var( E_d[Y|X = 50] ) ]') 212 | plt.legend(loc = 1) 213 | -------------------------------------------------------------------------------- /ipython/utils/eval_plots.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | import pandas as pd 4 | 5 | def getMAE(pred, truth): 6 | return np.abs(truth - pred).mean() 7 | 8 | def getLL(pred, truth): 9 | ll_sum = 0 10 | for i in range(len(pred)): 11 | if (pred[i] == 0): 12 | p = 0.0001 13 | elif (pred[i] == 1): 14 | p = 0.9999 15 | else: 16 | p = pred[i] 17 | ll_sum += truth[i]*np.log(p)+(1-truth[i])*np.log(1-p) 18 | return (ll_sum)/len(pred) 19 | 20 | 21 | def plotCalib(truth, pred, bins = 100, f = 0, l = '', w = 8, h = 8, fig_i = 1, fig_j = 1, fig_k = 1): 22 | mae = np.round(getMAE(pred, truth),3) 23 | ll = np.round(getLL(pred, truth), 3) 24 | 25 | d = pd.DataFrame({'p':pred, 'y':truth}) 26 | d['p_bin'] = np.floor(d['p']*bins)/bins 27 | d_bin = d.groupby(['p_bin']).agg([np.mean, len]) 28 | filt = (d_bin['p']['len']>f) 29 | 30 | 31 | if fig_k == 1: 32 | fig = plt.figure(facecolor = 'w', figsize = (w, h)) 33 | 34 | x = d_bin['p']['mean'][filt] 35 | y = d_bin['y']['mean'][filt] 36 | n = d_bin['y']['len'][filt] 37 | 38 | stderr = np.sqrt(y * (1 - y)/n) 39 | 40 | ax = plt.subplot(fig_i, fig_j, fig_k) 41 | #plt.plot(x, y, 'b.', markersize = 9) 42 | plt.errorbar(x, y, yerr = 1.96 * stderr, fmt = 'o') 43 | plt.plot([0.0, 1.0], [0.0, 1.0], 'k-') 44 | plt.title(l + ':' + ' MAE = {}, LL = {}'.format(mae, ll)) 45 | 46 | plt.xlim([0.0, 1.0]) 47 | plt.ylim([0.0, 1.0]) 48 | plt.xlabel('prediction P(Y|X)') 49 | plt.ylabel('actual P(Y|X)') 50 | #plt.legend(loc=4) 51 | 52 | 53 | 54 | def liftTable(pred, truth, b): 55 | df = pd.DataFrame({'p':pred + np.random.rand(len(pred))*0.000001, 'y':truth}) 56 | df['b'] = b - pd.qcut(df['p'], b, labels=False) 57 | df['n'] = np.ones(df.shape[0]) 58 | df_grp = df.groupby(['b']).sum() 59 | tot_y = float(np.sum(df_grp['y'])) 60 | base = tot_y/float(df.shape[0]) 61 | df_grp['n_cum'] = np.cumsum(df_grp['n'])/float(df.shape[0]) 62 | df_grp['y_cum'] = np.cumsum(df_grp['y']) 63 | df_grp['p_y_b'] = df_grp['y']/df_grp['n'] 64 | df_grp['lift_b'] = df_grp['p_y_b']/base 65 | df_grp['cum_lift_b'] = (df_grp['y_cum']/(float(df.shape[0])*df_grp['n_cum']))/base 66 | df_grp['recall'] = df_grp['y_cum']/tot_y 67 | return df_grp 68 | 69 | 70 | def liftRecallCurve(pred, truth, b, h = 6, w = 12, title = ''): 71 | 72 | #Get the lift table 73 | lt = liftTable(pred, truth, b) 74 | 75 | fig, ax1 = plt.subplots(figsize = (w, h)) 76 | 77 | ax1.plot(lt['n_cum'], lt['cum_lift_b'], 'b-') 78 | 79 | ax1.set_xlabel('Quantile') 80 | # Make the y-axis label and tick labels match the line color. 81 | ax1.set_ylabel('Lift', color='b') 82 | for tl in ax1.get_yticklabels(): 83 | tl.set_color('b') 84 | 85 | ax2 = ax1.twinx() 86 | ax2.plot(lt['n_cum'], lt['recall'], 'r.') 87 | ax2.set_ylabel('Recall', color='r') 88 | for tl in ax2.get_yticklabels(): 89 | tl.set_color('r') 90 | 91 | plt.title(title) 92 | 93 | plt.show() 94 | 95 | -------------------------------------------------------------------------------- /ipython/utils/eval_plots.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jmportilla/DataScienceCourse/92bb8f50a55a9e844357b795c48aee35b4772aff/ipython/utils/eval_plots.pyc --------------------------------------------------------------------------------