├── LICENSE ├── README.md ├── Section 1 ├── 1.3.ipynb ├── 1.4.ipynb └── 1.5.ipynb ├── Section 2 ├── 2.1.ipynb ├── 2.2.ipynb ├── 2.3.ipynb └── german_credit.csv.txt ├── Section 3 ├── 3.1.ipynb ├── 3.2.ipynb ├── 3.3.ipynb ├── 3.4.ipynb └── 3.5.ipynb ├── Section 4 ├── 4.1.ipynb ├── 4.2.ipynb ├── 4.3.ipynb └── german_credit.csv └── Section 5 ├── 5.1.ipynb ├── 5.2.ipynb ├── 5.3.ipynb ├── fe_data.csv ├── german_credit.csv └── y.csv /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Packt 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Hands-On-Feature-Engineering-with-Python 2 | Strengthen your machine learning with advanced feature engineering techniques 3 | 4 | This is the code repository for [Hands-On Feature Engineering with Python [Video]](https://www.packtpub.com/big-data-and-business-intelligence/hands-feature-engineering-python-video), published by [Packt](https://www.packtpub.com/?utm_source=github). It contains all the supporting project files necessary to work through the video course from start to finish. 5 | ## About the Video Course 6 | Feature engineering is the most important aspect of machine learning. You know that every day you put off learning the process, you are hurting your model’s performance. Studies repeatedly prove that feature engineering can be much more powerful than the choice of algorithms. Yet the field of feature engineering can seem overwhelming and confusing. 7 | This course offers you the single best solution. In this course, all of the recommendations have been extensively tested and proven on real-world problems. You’ll find everything included: the recommendations, the code, the data sources, and the rationale. You’ll get an over-the-shoulder, step-by-step approach for every situation, and each segment can stand alone, allowing you to jump immediately to the topics most important to you. 8 | By the end of the course, you’ll have a clear, concise path to feature engineering and will enable you to get improved results by applying feature engineering techniques on your own datasets 9 | 10 | ## Table of Contents:
11 | Section 1 - Introduction to Feature Engineering
12 | Section 2 - Implementing Feature Extraction
13 | Section 3 - Implementing Feature Transformation
14 | Section 4 - Implementing Feature Selection
15 | Section 5 - Putting All Together - Building the Application
16 | 17 | 18 | 19 |

What You Will Learn

20 |

21 |

Master the insider tips for world-class feature engineering 23 |
Eliminate frustration and confusion in handling all aspects of features 24 |
Dramatically reduce the time required to move to the modeling steps of the process 25 |
Handle missing values with speed and ease 26 |
Systematically test for feature interaction terms build new features 27 |
Leverage advanced “target mean encoding” to maximize performance and understanding 28 |
Handle outliers automatically with much less effort 29 |

30 | 31 | ## Instructions and Navigation 32 | ### Assumed Knowledge 33 | Anyone who wants to build faster, more accurate machine learning models will benefit. This course assumes that you have basic familiarity with Python as well as machine learning concepts. The content covers material for beginners through to experts. 34 | 35 | ### Technical Requirements 36 | This course has the following requirements:
37 | Operating system: Windows or Linux
38 | Browser: Mozilla or Crome
39 | IDE : Pycharm Community version
40 | Jupyter notebook or jupyter lab
41 | 42 | 43 | 44 | ## Related Products 45 | * [Feature Engineering Made Easy](https://prod.packtpub.com/in/big-data-and-business-intelligence/feature-engineering-made-easy) 46 | 47 | * [Machine Learning Algorithms in 7 Days [Video]](https://prod.packtpub.com/in/big-data-and-business-intelligence/machine-learning-algorithms-7-days-video) 48 | 49 | * [Introduction to ML Classification Models using scikit-learn [Video]](https://prod.packtpub.com/in/application-development/introduction-ml-classification-models-using-scikit-learn-video) 50 | -------------------------------------------------------------------------------- /Section 1/1.4.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 12, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd # data munging\n", 10 | "import seaborn as sns #visualization\n", 11 | "import matplotlib.pyplot as plt #visualization\n", 12 | "from sklearn.ensemble import RandomForestClassifier #for a classification model\n", 13 | "import numpy as np #scientific computing\n", 14 | "from sklearn.model_selection import train_test_split # split the data into training and testing sets" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "## set working directory" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 2, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "import os\n", 31 | "os.chdir('/home/sahibachopra/packt/')" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "## load the data" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 3, 44 | "metadata": {}, 45 | "outputs": [], 46 | "source": [ 47 | "data = pd.read_csv('german_credit.csv')" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 5, 53 | "metadata": {}, 54 | "outputs": [ 55 | { 56 | "data": { 57 | "text/html": [ 58 | "

\n", 59 | "\n", 72 | "\n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | "

	Creditability	Account Balance	Duration of Credit (month)	Payment Status of Previous Credit	Purpose	Credit Amount	Value Savings/Stocks	Length of current employment	Instalment per cent	Sex & Marital Status	...	Duration in Current address	Most valuable available asset	Age (years)	Concurrent Credits	Type of apartment	No of Credits at this Bank	Occupation	No of dependents	Telephone	Foreign Worker
0	1	1	18	4	2	1049	1	2	4	2	...	4	2	21	3	1	1	3	1	1	1
1	1	1	9	4	0	2799	1	3	2	3	...	2	1	36	3	1	2	3	2	1	1
2	1	2	12	2	9	841	2	4	2	2	...	4	1	23	3	1	1	2	1	1	1
3	1	1	12	4	0	2122	1	3	3	3	...	2	1	39	3	1	2	2	2	1	2
4	1	1	12	4	0	2171	1	3	4	3	...	4	2	38	1	2	2	2	1	1	2

\n", 222 | "

5 rows × 21 columns

\n", 223 | "

" 224 | ], 225 | "text/plain": [ 226 | " Creditability Account Balance Duration of Credit (month) \\\n", 227 | "0 1 1 18 \n", 228 | "1 1 1 9 \n", 229 | "2 1 2 12 \n", 230 | "3 1 1 12 \n", 231 | "4 1 1 12 \n", 232 | "\n", 233 | " Payment Status of Previous Credit Purpose Credit Amount \\\n", 234 | "0 4 2 1049 \n", 235 | "1 4 0 2799 \n", 236 | "2 2 9 841 \n", 237 | "3 4 0 2122 \n", 238 | "4 4 0 2171 \n", 239 | "\n", 240 | " Value Savings/Stocks Length of current employment Instalment per cent \\\n", 241 | "0 1 2 4 \n", 242 | "1 1 3 2 \n", 243 | "2 2 4 2 \n", 244 | "3 1 3 3 \n", 245 | "4 1 3 4 \n", 246 | "\n", 247 | " Sex & Marital Status ... Duration in Current address \\\n", 248 | "0 2 ... 4 \n", 249 | "1 3 ... 2 \n", 250 | "2 2 ... 4 \n", 251 | "3 3 ... 2 \n", 252 | "4 3 ... 4 \n", 253 | "\n", 254 | " Most valuable available asset Age (years) Concurrent Credits \\\n", 255 | "0 2 21 3 \n", 256 | "1 1 36 3 \n", 257 | "2 1 23 3 \n", 258 | "3 1 39 3 \n", 259 | "4 2 38 1 \n", 260 | "\n", 261 | " Type of apartment No of Credits at this Bank Occupation \\\n", 262 | "0 1 1 3 \n", 263 | "1 1 2 3 \n", 264 | "2 1 1 2 \n", 265 | "3 1 2 2 \n", 266 | "4 2 2 2 \n", 267 | "\n", 268 | " No of dependents Telephone Foreign Worker \n", 269 | "0 1 1 1 \n", 270 | "1 2 1 1 \n", 271 | "2 1 1 1 \n", 272 | "3 2 1 2 \n", 273 | "4 1 1 2 \n", 274 | "\n", 275 | "[5 rows x 21 columns]" 276 | ] 277 | }, 278 | "execution_count": 5, 279 | "metadata": {}, 280 | "output_type": "execute_result" 281 | } 282 | ], 283 | "source": [ 284 | "data.head()" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "## split the data into predictor variables xVar and response variables yVar" 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": 7, 297 | "metadata": {}, 298 | "outputs": [], 299 | "source": [ 300 | "yVar = data['Creditability']\n", 301 | "xVar = data.loc[:, data.columns != 'Creditability']" 302 | ] 303 | }, 304 | { 305 | "cell_type": "markdown", 306 | "metadata": {}, 307 | "source": [ 308 | "## split our data into training and testing sets\n", 309 | "- 20% of the data into the testing set " 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": 8, 315 | "metadata": {}, 316 | "outputs": [ 317 | { 318 | "name": "stdout", 319 | "output_type": "stream", 320 | "text": [ 321 | "(800, 20) (800,)\n", 322 | "(200, 20) (200,)\n" 323 | ] 324 | } 325 | ], 326 | "source": [ 327 | "X_train, X_test, y_train, y_test = train_test_split(xVar, yVar, test_size=0.2)\n", 328 | "print (X_train.shape, y_train.shape)\n", 329 | "print (X_test.shape, y_test.shape)" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": {}, 335 | "source": [ 336 | "# Build a random forest classifier" 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": 9, 342 | "metadata": {}, 343 | "outputs": [ 344 | { 345 | "data": { 346 | "text/plain": [ 347 | "RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", 348 | " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", 349 | " min_impurity_decrease=0.0, min_impurity_split=1e-07,\n", 350 | " min_samples_leaf=1, min_samples_split=2,\n", 351 | " min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,\n", 352 | " oob_score=False, random_state=0, verbose=0, warm_start=False)" 353 | ] 354 | }, 355 | "execution_count": 9, 356 | "metadata": {}, 357 | "output_type": "execute_result" 358 | } 359 | ], 360 | "source": [ 361 | "clf = RandomForestClassifier(n_jobs=2, random_state=0)\n", 362 | "\n", 363 | "clf.fit(X_train, y_train)\n", 364 | "RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", 365 | " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", 366 | " min_impurity_split=1e-07, min_samples_leaf=1,\n", 367 | " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", 368 | " n_estimators=10, n_jobs=2, oob_score=False, random_state=0,\n", 369 | " verbose=0, warm_start=False)" 370 | ] 371 | }, 372 | { 373 | "cell_type": "markdown", 374 | "metadata": {}, 375 | "source": [ 376 | "# Predict " 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": 10, 382 | "metadata": {}, 383 | "outputs": [], 384 | "source": [ 385 | "preds = clf.predict(X_test)" 386 | ] 387 | }, 388 | { 389 | "cell_type": "markdown", 390 | "metadata": {}, 391 | "source": [ 392 | "# Evaluate results\n", 393 | "- 152 / 200 predicted correctly" 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": 11, 399 | "metadata": {}, 400 | "outputs": [ 401 | { 402 | "data": { 403 | "text/html": [ 404 | "

\n", 405 | "\n", 418 | "\n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | "

Predicted Result	0	1
Actual Result
0	30	28
1	20	122

\n", 444 | "

" 445 | ], 446 | "text/plain": [ 447 | "Predicted Result 0 1\n", 448 | "Actual Result \n", 449 | "0 30 28\n", 450 | "1 20 122" 451 | ] 452 | }, 453 | "execution_count": 11, 454 | "metadata": {}, 455 | "output_type": "execute_result" 456 | } 457 | ], 458 | "source": [ 459 | "pd.crosstab(y_test, preds, rownames=['Actual Result'], colnames=['Predicted Result'])" 460 | ] 461 | } 462 | ], 463 | "metadata": { 464 | "kernelspec": { 465 | "display_name": "Python 3", 466 | "language": "python", 467 | "name": "python3" 468 | }, 469 | "language_info": { 470 | "codemirror_mode": { 471 | "name": "ipython", 472 | "version": 3 473 | }, 474 | "file_extension": ".py", 475 | "mimetype": "text/x-python", 476 | "name": "python", 477 | "nbconvert_exporter": "python", 478 | "pygments_lexer": "ipython3", 479 | "version": "3.6.0" 480 | } 481 | }, 482 | "nbformat": 4, 483 | "nbformat_minor": 2 484 | } 485 | -------------------------------------------------------------------------------- /Section 2/2.1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 142, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd\n", 10 | "import seaborn as sns\n", 11 | "import matplotlib.pyplot as plt\n", 12 | "from sklearn.ensemble import RandomForestClassifier\n", 13 | "import numpy as np\n", 14 | "from sklearn.model_selection import train_test_split\n", 15 | "from scipy.io import arff\n", 16 | "from sklearn.metrics import roc_auc_score\n", 17 | "from sklearn.metrics import roc_curve, auc\n", 18 | "import scikitplot as skplt\n", 19 | "from sklearn.decomposition import PCA\n", 20 | "from sklearn.feature_selection import SelectKBest\n", 21 | "from sklearn.pipeline import Pipeline, FeatureUnion, make_union\n", 22 | "from sklearn.linear_model import LogisticRegression\n", 23 | "from sklearn.base import BaseEstimator, TransformerMixin\n", 24 | "from sklearn.preprocessing import scale" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 3, 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "data = pd.read_csv('german_credit.csv')" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## Feature extraction is a process of dimensionality reduction\n", 41 | "- Principal Component Analysis (PCA) helps reduce dimensions by creating a new set of variables that are smaller than the original set without losing any information \n", 42 | "- This efficient reduction of the number of variables is achieved by obtaining orthogonal linear combinations of the original variables – the so-called Principal Components (PCs). \n", 43 | "- PCA is useful for the compression of data and to find patterns in high-dimensional data. " 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "## PCA doesn't improve the performance of Random Forests much because Random Forests check feature importance and builds a model based on the important features\n", 51 | "- let's run a logistic regression model instead on the data \n", 52 | "- then run another logistic regression model on the data after we have reduced the number of dimensions using PCA" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 88, 58 | "metadata": {}, 59 | "outputs": [], 60 | "source": [ 61 | "X, y = data.loc[:, data.columns != 'Creditability'], data['Creditability']" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 89, 67 | "metadata": {}, 68 | "outputs": [ 69 | { 70 | "name": "stdout", 71 | "output_type": "stream", 72 | "text": [ 73 | "(800, 20) (800,)\n", 74 | "(200, 20) (200,)\n" 75 | ] 76 | } 77 | ], 78 | "source": [ 79 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n", 80 | "print (X_train.shape, y_train.shape)\n", 81 | "print (X_test.shape, y_test.shape)" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "## run a logistic regression model on the raw data" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": 90, 94 | "metadata": {}, 95 | "outputs": [ 96 | { 97 | "data": { 98 | "text/plain": [ 99 | "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", 100 | " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n", 101 | " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", 102 | " verbose=0, warm_start=False)" 103 | ] 104 | }, 105 | "execution_count": 90, 106 | "metadata": {}, 107 | "output_type": "execute_result" 108 | } 109 | ], 110 | "source": [ 111 | "logreg = LogisticRegression()\n", 112 | "logreg.fit(X_train, y_train)" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": 91, 118 | "metadata": {}, 119 | "outputs": [ 120 | { 121 | "name": "stdout", 122 | "output_type": "stream", 123 | "text": [ 124 | "Accuracy of logistic regression classifier on test set: 0.75\n" 125 | ] 126 | } 127 | ], 128 | "source": [ 129 | "y_pred = logreg.predict(X_test)\n", 130 | "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 92, 136 | "metadata": {}, 137 | "outputs": [], 138 | "source": [ 139 | "## run PCA on the raw data to reduce dimensions" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 143, 145 | "metadata": {}, 146 | "outputs": [ 147 | { 148 | "name": "stderr", 149 | "output_type": "stream", 150 | "text": [ 151 | "/home/sahibachopra/miniconda/envs/ai/lib/python3.6/site-packages/sklearn/utils/validation.py:475: DataConversionWarning: Data with input dtype int64 was converted to float64 by the scale function.\n", 152 | " warnings.warn(msg, DataConversionWarning)\n" 153 | ] 154 | }, 155 | { 156 | "data": { 157 | "text/plain": [ 158 | "array([[ 0.65465367, -1.25456565, -0.24085723, ..., -0.42828957,\n", 159 | " -0.82331789, -0.19601428],\n", 160 | " [ 0.65465367, -1.25456565, -0.9875727 , ..., 2.33486893,\n", 161 | " -0.82331789, -0.19601428],\n", 162 | " [ 0.65465367, -0.45902624, -0.73866754, ..., -0.42828957,\n", 163 | " -0.82331789, -0.19601428],\n", 164 | " ...,\n", 165 | " [-1.52752523, 1.13205258, 0.00804793, ..., -0.42828957,\n", 166 | " 1.21459768, -0.19601428],\n", 167 | " [-1.52752523, -0.45902624, -0.73866754, ..., -0.42828957,\n", 168 | " 1.21459768, -0.19601428],\n", 169 | " [-1.52752523, -1.25456565, 0.75476341, ..., -0.42828957,\n", 170 | " -0.82331789, -0.19601428]])" 171 | ] 172 | }, 173 | "execution_count": 143, 174 | "metadata": {}, 175 | "output_type": "execute_result" 176 | } 177 | ], 178 | "source": [ 179 | "x = data.values #convert the data into a numpy array\n", 180 | "x = scale(x);x" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 144, 186 | "metadata": {}, 187 | "outputs": [], 188 | "source": [ 189 | "covar_matrix = PCA(n_components = 20) #we have 20 features" 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": 145, 195 | "metadata": {}, 196 | "outputs": [ 197 | { 198 | "data": { 199 | "text/plain": [ 200 | "array([12.2, 22.4, 29.5, 35.9, 41.8, 47.5, 53. , 58.2, 63. , 67.4, 71.6,\n", 201 | " 75.5, 79.2, 82.8, 86.1, 89.1, 91.8, 94.3, 96.6, 98.8])" 202 | ] 203 | }, 204 | "execution_count": 145, 205 | "metadata": {}, 206 | "output_type": "execute_result" 207 | } 208 | ], 209 | "source": [ 210 | "covar_matrix.fit(x)\n", 211 | "variance = covar_matrix.explained_variance_ratio_ #calculate variance ratios\n", 212 | "\n", 213 | "var=np.cumsum(np.round(covar_matrix.explained_variance_ratio_, decimals=3)*100)\n", 214 | "var #cumulative sum of variance explained with [n] features" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": 149, 220 | "metadata": {}, 221 | "outputs": [ 222 | { 223 | "data": { 224 | "text/plain": [ 225 | "[]" 226 | ] 227 | }, 228 | "execution_count": 149, 229 | "metadata": {}, 230 | "output_type": "execute_result" 231 | }, 232 | { 233 | "data": { 234 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYgAAAEWCAYAAAB8LwAVAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4xLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvAOZPmwAAIABJREFUeJzt3XecVNX9//HXh97rsggsnaVJFwQVFHvDgp2osSVo1KhfE6NJTDSJMcYUE/0ZE6IiVqwREiVKUEBUuvTe27L0pbPt8/vj3jXrZnYZdpmZ3dn38/GYx9y599w5Hy6z85lz7r3nmLsjIiJSVJVEByAiIuWTEoSIiESkBCEiIhEpQYiISERKECIiEpEShIiIRKQEIVKOmNk6MzunjO+x38w6HK+YpPJSgpAKL/xSPRR+MWaa2Wgzq1do+/lmNtXM9pnZdjObYmaXFnmPoWbmZvajKOtsb2b5ZvaX4/3vKSt3r+fuaxIdh1R8ShCSLC5x93pAP2AA8DCAmV0FvA28DKQBzYGfA5cU2f8mYFf4HI1vA7uB68ysZpmjFymHlCAkqbj7ZmAC0MPMDPgj8Ct3f97ds9w9392nuPt3C/YxszrAVcBdQLqZ9Y+iqm8TJKEciiSbsCVyh5mtNLPdZvZsGAtm1tHMPjGznWa2w8xeM7NGRd/czE4ws4Nm1rTQupPCFlB1M+sUtoSywvd5s0j9ncLli8xsSdh62mxmP4z6YEqlpwQhScXMWgMXAV8BXYDWwDtH2e1KYD9BS+Mjgi//kuoYQtAaGQu8VUz5YQQtmd7ANcD5BbsDvwFaAt3C+B4turO7bwUmh/sWuAEY6+45wK+Aj4HGYSzPFBPuC8Dt7l4f6AF8UtK/TaQwJQhJFu+b2R5gGjAFeBwo+PWdcZR9bwLedPc84HVghJlVP0r5Ce6+Oyx/oZmlFinzhLvvcfcNwKdAHwB3X+XuE939iLtvJ2jhnFFMPWMIkgJmVhUYAbwSbssB2gIt3f2wu08r5j1ygO5m1sDdd7v73BL+XSLfoAQhyeJyd2/k7m3d/U53PwTsDLe1KG6nsMVxJvBauGocUAu4uJjytYGrC8q7+5fABuBbRYpuLbR8EKgX7p9qZmPD7p69wKtASjHhjSP4cu8AnAtkufvMcNuPCFojM81ssZndWsx7XEnQolofdkmdUkw5kf+hBCHJbDmwkeBLsjg3Evwd/NPMtgJrCBJEcd1Mw4EGwF/MbGu4T6sSyhf1G8CBXu7egKCFYJEKuvthgi6s68M4Xym0bau7f9fdWwK3h/F0ivAes9z9MiAVeD98P5GoKEFI0vJgLPv7gZ+Z2S1m1sDMqpjZYDMbFRb7NvALgi6ggseVwMWFTxAXchPwItCzUPnTgD5m1jOKsOoTnO/YY2atgAeOUv5l4GbgUoLWBgBmdrWZpYUvdxMknbzCO5pZDTO73swahuct9hYtI1ISJQhJau7+DnAtcCuwBcgEHgPGmdkgoB3wbPiLvOAxHlhF0Of/tfAL/WzgT0XKzwH+TXSXyP6C4FLcLOAD4L2jxP85kA/Mdfd1hTYNAGaY2X5gPHCvu6+N8BY3AuvC7qw7CM9piETDNGGQSPlmZp8Ar7v784mORSoXJQiRcszMBgATgdbuvi/R8Ujloi4mkXLKzMYA/wHuU3KQRFALQkREIlILQkREIqqW6ADKIiUlxdu1a5foMEREKpQ5c+bscPdmRytXoRNEu3btmD17dqLDEBGpUMxsfTTl1MUkIiIRKUGIiEhEShAiIhKREoSIiESkBCEiIhEpQYiISERKECIiEpEShIiIRKQEISIiESlBiIhIREoQIiISkRKEiIhEpAQhIiIRKUGIiEhEShAiIhJRzBKEmb1oZtvMbFGhdU3MbKKZrQyfG4frzcyeNrNVZrbAzPrFKi4REYlOLFsQLwEXFFn3EDDJ3dOBSeFrgAuB9PAxEnguhnGJiEgUYpYg3H0qsKvI6suAMeHyGODyQutf9sB0oJGZtYhVbCIicnTxPgfR3N0zAMLn1HB9K2BjoXKbwnX/w8xGmtlsM5u9ffv2mAYrIlKZlZeT1BZhnUcq6O6j3L2/u/dv1uyoc26LiCSVnLx8Zq7dxba9h2NeV7WY1/BNmWbWwt0zwi6kbeH6TUDrQuXSgC1xjk1EpNxxd9bsOMBnK7YzbdUOvly9kwPZefx8WHduHdw+pnXHO0GMB24CngifxxVaf7eZjQUGAlkFXVEiIpXNrgPZfL5qB9NW7uCzldvZkhW0Fto0qcPlfVsxJD2FUzqmxDyOmCUIM3sDGAqkmNkm4BGCxPCWmd0GbACuDot/CFwErAIOArfEKi4RkfLmSG4ec9bvDhPCDhZtycId6teqxmkdU7jrrBSGdGpGm6Z14hpXzBKEu48oZtPZEco6cFesYhERKU/cnZXb9vNZ2EKYsWYXh3LyqFbF6NumEf93TmcGp6fQq1VDqlVN3KnieHcxiYhUSlmHcvh81Q4mL9/G1BU72BqeZO6QUpdr+qcxOL0Zgzo0oX6t6gmO9L+UIEREYiA/31mSsZcpK7Yzefk25m7YQ16+U79WNYakp3B6ejMGp6eQ1ji+3UbHQglCROQ42XMwm89W7mDy8u1MWbGdHfuPANCjVQO+d0ZHzujSjL6tGyW02+hYKEGIiJRSfr6zcHPW162EeRv3kO/QqE51hqQ3Y2jnZgzpnEJq/VqJDrVUlCBERI5B1sEcJq/YxuTl25m6Yjs7D2RjBr1aNeTus9IZ2qUZvdMaUbVKpPt/KxYlCBGRo9i46yATl2QycUkmM9ftIi/faVK3BqenpzC0SypD0lNoWq9mosM87pQgRESKcHcWbd7LxCVb+XhJJsu27gMgPbUet5/egXO7N6d3WiOqJEEroSRKECIiQHZuPtPX7GTikkz+szSTjKzDVDHo37YJP72oG+d2b067lLqJDjOulCBEpNLKOpTD5OXbmLgkkynLt7PvSC61q1dlSHoK95/bmbO6piZl11G0lCBEpFLJ3HuYjxZv5ePFmUxfs5PcfCelXg0u6tmCc7s3Z3B6CrWqV010mOWCEoSIJL3MvYeZsDCDDxduZdb6XbhDh2Z1uW1Ie87r3pw+rRsnxVVHx5sShIgkpUhJoUvz+tx3dmcu7nUCnVLrJzrEck8JQkSSRkFS+GBhBrPX71ZSKCMlCBGp0JQUYkcJQkQqHCWF+FCCEJEK4WB2Lh8t3sq7czbz+eodSgpxoAQhIuVWfr4zY+0u3p27iQkLMziQnUda49p8/8xOXNqnpZJCjClBiEi5s27HAd6bu4n3vtrMpt2HqFezGhf3asGV/dIY0K5J0g9xUV4oQYhIuZB1KIcPFmTw7txNzFm/GzMY3CmFH57XhfNPPIHaNXTzWrwpQYhIwuTm5fPZyh28O3cTHy/JJDs3n06p9Xjwgq4M79uKExpWzHkUkoUShIjE3fKt+3hnzkben7eF7fuO0KhOdUYMaM0V/dLoldYQM3UhlQdKECISF4dz8vhgQQavzVjP3A17qFbFOLNrKlf2S+OsrqnUqFYxpuGsTJQgRCSmVm/fz+szNvDOnE1kHcqhQ7O6PHxxN4b3bVWpR0qtCJQgROS4y87NZ+KSTF6bsZ4vVu+kelXj/BNP4PqBbRnUoYm6kCoIJQgROW427T7I2JkbGTtrIzv2H6FVo9o8cH4Xrunfmmb11VqoaJQgRKRM8vKdycu38dqMDXy6fBsGnNU1lesHtuX0zs00jHYFpgQhIqWybd9h3pq1kTdmbmTznkOk1q/J98/sxLUnt6FVo9qJDk+OAyUIEYmauzN7/W5Gf76WjxdnkpvvDO6UwsMXd+Oc7s2pXlVXIiUTJQgROaqcvHwmLNrKC5+tYf6mLBrWrs6tg9sz4uQ2tE+pm+jwJEaUIESkWFmHchg7cwMvfbGOjKzDdEipy2OX9+DKfmka+qISUIIQkf+xfucBRn++jrdmb+Rgdh6ndmzKY5f34MwuqRoorxIpNkGY2T7Ai9vu7g1KW6mZ/R/wnfD9FwK3AC2AsUATYC5wo7tnl7YOETk27s6sdbt5YdoaPl6SSbUqxiW9W3Lb4Pac2LJhosOTBCg2Qbh7fQAz+yWwFXgFMOB6oNSDsJtZK+AeoLu7HzKzt4DrgIuAp9x9rJn9FbgNeK609YhIdHLy8vlwYQYvTFvLgk1ZNKpTnTuHduTbp7SjeQMNlleZRdPFdL67Dyz0+jkzmwE8WcZ6a5tZDlAHyADOAr4Vbh8DPIoShEjM6PyCHE00CSLPzK4n6P5xYASQV9oK3X2zmf0e2AAcAj4G5gB73D03LLYJaBVpfzMbCYwEaNOmTWnDEKm0Nuw8yIufr/3G+YVfD+/B0M46vyDfFE2C+Bbw5/DhwOf895f+MTOzxsBlQHtgD/A2cGGEohHPf7j7KGAUQP/+/Ys9RyIi3zR/4x5GTV3DhEUZVNX5BYnCUROEu68j+EI/Xs4B1rr7dgAzew84FWhkZtXCVkQasOU41ilSKeXnO5NXbONvU9YwY+0u6teqxsjTO3Lzqe00GY8c1VEThJl1JjgX0Nzde5hZL+BSd3+slHVuAAaZWR2CLqazgdnAp8BVBF1ZNwHjSvn+IpXekdw8xs3bwt+nrmHltv20bFiLhy/uxrUDWlO/VvVEhycVRDRdTH8HHgD+BuDuC8zsdaBUCcLdZ5jZOwSXsuYCXxF0GX0AjDWzx8J1L5Tm/UUqs6xDObw+YwOjP1/Ltn1H6NaiAU9d25thvVpqGAw5ZtEkiDruPrPI+O25xRWOhrs/AjxSZPUa4OSyvK9IZbV5zyFGT1vLGzM3cCA7jyHpKfzhmt4M7pSiuRek1KJJEDvMrCPhSWMzu4rgslQRSbAlW/Yyaupq/rUgAwcu6dWC757eQSee5biIJkHcRdAF1NXMNgNrgRtiGpWIFMvdmbZqB6OmruGzlTuoW6MqN53ajlsHt9cw23JcRXMV0xrgHDOrC1Rx932xD0tEisrPdz5eksmzn65i4eYsUuvX5MELuvKtgW1oWFsnnuX4i+YqpprAlUA7oFpBf6a7/zKmkYkIALl5+XywMINnP13Fisz9tGtahyeu6Mnwfq2oWU13PEvsRNPFNA7IIrjb+UhswxGRAtm5+bw3dxPPTVnN+p0H6dy8Hn++rg8X92xBNV2RJHEQTYJIc/cLYh6JiABwKDuPsbM2MGrqGjKyDtMrrSF/u/Ekzu3WXENhSFxFkyC+MLOe7r4w5tGIVGL7Dufw6vQNvDBtDTv2Z3NyuyY8cWUvTk/XpaqSGNEkiMHAzWa2lqCLyQB3914xjUykkth9IJvRX6zjpc/XsvdwLqd3bsbdZ3bi5PZNEh2aVHLRJIhIA+mJSBlt23eY5z9by6vT13MwO4/zujfn7rM60SutUaJDEwFKnlGugbvvBXRZq8hxlJF1iOcmr2bsrI3k5uVzSe+W3Dm0E11OKPU8XCIxUVIL4nVgGMHVS07QtVTAgQ4xjEsk6ezYf4TnJq/mlenryc93ruyXxveGdqRdSt1EhyYSUUlTjg4Ln9vHLxyR5JN1MIdRn61m9OfrOJyTxxX90rj37HRaN6mT6NBEShTNOYiCSX7Sga8HkHf3qbEKSiQZ7D+Sy+hpaxn12Rr2Hc5lWK8W3HdOZzql1kt0aCJRieZO6u8A9xJM4jMPGAR8STCHtIgUcTgnj1enr+cvk1ez60A253RL5f5zu9C9ZYNEhyZyTKJpQdwLDACmu/uZZtYV+EVswxKpeLJz83lr9kae+WQlmXuPMCQ9hfvP7UzfNo0THZpIqUSTIA67+2Ezw8xquvsyM+sS88hEKoi8fOcfX23mz5NWsHHXIfq3bcyfr+vLoA5NEx2aSJlEkyA2mVkj4H1gopntRvNFi5Cf73y4KIOnJq5g9fYD9GzVkF/d0oMzOjfTnc+SFKIZ7nt4uPiomX0KNAT+HdOoRMoxd+eTZdv4/ccrWJqxl/TUevz1hn6cf+IJSgySVEq6US7Sff4F4zHVA3bFJCKRcmzO+t385sOlzF6/m7ZN6/Cna/twSe+WVNUgepKESmpBRLpBroBulJNKZfX2/Tz572V8tDiTZvVr8uvhPbimf2uqa9htSWIl3SinG+Sk0tu29zB/mrSSN2dtpFa1Ktx/bme+M6Q9dWpEdQuRSIUW7Y1yVxCM6urAZ+7+fkyjEkmw/UdyGTVlNX//bC05efncMLAN3z87nZR6NRMdmkjcRHOj3F+ATsAb4ao7zOxcd78rppGJJEB2bj5vzNzA05NWsvNANhf3asED53XReElSKUXTgjgD6OHuDmBmY/jvyWqRpODufLAwg999tJz1Ow8yqEMTXrywG71ba+htqbyiSRDLgTbA+vB1a2BBzCISibMvV+/kiQlLmb8piy7N6zP65gEM7aJ7GUSiSRBNgaVmNjN8PQCYbmbjAdz90lgFJxJLy7bu5bcTlvHp8u20aFiL313Viyv6pemSVZFQNAni5zGPQiSOMvce5ncfLefduZuoX7MaD13YlZtPbUet6lUTHZpIuRJNgtju7ksKrzCzoe4+OTYhicRGdm4+oz9fy9OTVpKT53xncHvuOrMTjerUSHRoIuVSNAniLTN7GfgdwXwQTwL9gVNiGZjI8TR1xXYe/edi1mw/wDndUvnZsO60baork0RKEk2CGAj8FvgCqA+8BpwWy6BEjpeNuw7y2AdL+GhxJu2a1mH0zQM4s2tqosMSqRCiSRA5wCGgNkELYq2755el0nB02OeBHgQ3391KcLXUm0A7YB1wjbvvLks9Unkdzsnjr1NW89zk1VQx44Hzu/CdIe2pWU3nGUSiFc1AMrMIEsQAgrupR5jZO2Ws98/Av929K9AbWAo8BExy93RgUvha5Ji4Ox8v3so5f5zCn/6zknO6N2fSD87grjM7KTmIHKNoWhC3ufvscHkrcJmZ3VjaCs2sAXA6cDOAu2cD2WZ2GTA0LDYGmAw8WNp6pPJZvX0/v/jnEqau2E7n5vV4/bsDObVjSqLDEqmwShru+yx3/8TdZ5tZe3dfW2jzgTLU2QHYDow2s94Eo8beCzR39wwAd88ws4gdxWY2EhgJ0KZNmzKEIcniwJFcnvlkFS9MW0OtalX5+bDu3HhKW420KlJGFo6g8b8bzOa6e7+iy5FeH1OFZv2B6cBp7j7DzP4M7AW+7+6NCpXb7e4lTubbv39/nz17dklFJIm5O+Pnb+HxD5eSufcIV5+Uxo8u6Eqz+hpQT6QkZjbH3fsfrVxJXUxWzHKk18diE7DJ3WeEr98hON+QaWYtwtZDC2BbGeqQJLc0Yy+PjF/MzLW76NmqIc/dcBL92pT4e0JEjlFJCcKLWY70OmruvtXMNppZF3dfDpwNLAkfNwFPhM/jSluHJK9D2Xk89Z8VvDBtLQ1qVePx4T25dkBrDY8hEgMlJYgO4XhLVmiZ8HVZJxP6PvCamdUA1gC3EFxR9ZaZ3QZsAK4uYx2SZKat3MFP/rGQDbsOMuLk1jx4QVfdBS0SQyUliMsKLf++yLair4+Ju88juBu7qLPL8r6SnPYczObXHyzl7TmbaJ9Sl7EjBzGoQ9NEhyWS9EqacnRKPAMRKapgjoZHxy9m98Ec7hzakXvOTtegeiJxool1pVzKyDrEz95fxH+WbqNnq4a8fOtAurdskOiwRCoVJQgpV/LznddmbuC3E5aRm5/Pwxd34+ZT21FN9zSIxF3UCcLM6rp7WW6QEynRqm37eejdBcxev5vBnVJ4fHhP2jStk+iwRCqtoyYIMzuVYGC9ekCb8O7n2939zlgHJ5VDdm4+f5uymmc+WUXtGlX5/dW9ubJfK035KZJg0bQgngLOBwqmGJ1vZqfHNCqpNL7asJuH3l3I8sx9DOvVgkcuOVF3QouUE1F1Mbn7xiK/5vJiE45UFgeO5PL7j5fz0hfrOKFBLZ7/dn/O6d480WGJSCHRJIiNYTeThze23UMwPLdIqUxbuYMH313A5j2H+PYpbXng/C7Ur1U90WGJSBHRJIg7COZvaEUwjtLHwF2xDEqS04EjuTwxYRmvTF9Ph2Z1eeeOU+jfrkmiwxKRYhw1Qbj7DuD6OMQiSWzWul388O35bNh1kNsGt+eB87vohjeRcu6oF5eb2ZhwitCC143N7MXYhiXJ4nBOHr/+YAnX/O1L8t0Z+91B/GxYdyUHkQogmi6mXu6+p+CFu+82s74xjEmSxPyNe/jB2/NZtW0/1w9sw08u6kbdmro3U6SiiOavtYqZNXb33QBm1iTK/aSSys7N5/99spJnJ6+mWb2ajLn1ZM7o3CzRYYnIMYrmi/4PwBdm9k74+mrg17ELSSqypRl7+cFb81mSsZcr+rXikUtOpGFtXaEkUhFFc5L6ZTObA5xJMBfEFe6+JOaRSYWSm5fP36au4U//WUHD2tUZdeNJnHfiCYkOS0TKINquomXA7oLyZtbG3TfELCqpUFZt288P3p7P/I17uLhnC351eQ+a1NVEPiIVXTRjMX0feATIJLiD2gimHO0V29CkvMvPd0Z/sY4n/72M2jWq8syIvlzSu2WiwxKR4ySaFsS9QBd33xnrYKTi2LjrID98ez4z1u7i7K6p/OaKnqQ2qJXosETkOIpqqA0gK9aBSMXg7rw5ayO//NcSqpjx5FW9uPqkNI28KpKEokkQa4DJZvYBcKRgpbv/MWZRSbm0c/8RHnpvIROXZHJKh6b8/pretGpUO9FhiUiMRJMgNoSPGuFDKqFPl2/jgbcXsPdQDg9f3I1bT2tPlSpqNYgks2guc/1FPAKR8ulQdh6/mbCUl79cT5fm9XnltpPp1kJzQ4tUBtFcxdQM+BFwIvD1WUh3PyuGcUk5sGhzFve9OY9V2/ZrgD2RSiiaLqbXgDeBYQRDf98EbI9lUJJYefnOqKlr+OPE5TSpW4NXbxvI4PSURIclInEWTYJo6u4vmNm97j4FmGJmU2IdmCTGpt0Huf+t+cxcu4uLep7A48N70qiOTj2JVEbRJIic8DnDzC4GtgBpsQtJEsHdGTdvCz97fxEO/OHq3lzRr5UuXxWpxKJJEI+ZWUPgB8AzQAPg/2IalcRV1sEcHh63iH/O30L/to156to+tG5SJ9FhiUiCRXMV07/CxSyCAfskiXyxegc/eGs+2/cd4YHzu3DHGR2pqstXRYQSEoSZ/cjdnzSzZwjGXvoGd78nppFJTB3JzeMPH6/g75+toX3Turx356n0Smt09B1FpNIoqQWxNHyeHY9AJH5WZu7jnrHzWJqxlxsGBTO91amhOaBE5JuK/VZw93+aWVWgh7s/EMeYJEbcnTdmbuSX/1pM3RrVePHm/pzVtXmiwxKRcqrEn43unmdmJ8Wi4jD5zAY2u/swM2sPjAWaAHOBG909OxZ1V0ZZB3P48T8W8OHCrQxJT+EP1/Qmtb5GXxWR4kXTr/CVmY0H3gYOFKx09/fKWPe9BN1YBeM2/BZ4yt3HmtlfgduA58pYhwBz1u/injfmkbn3MD++sCvfHdJB4yiJyFFViaJME2AncBZwSfgYVpZKzSwNuBh4Pnxt4fsXzHs9Bri8LHVIcEf0M5NWcs3fplO1ivHO907l9jM6KjmISFSiucz1lhjU+yeC8Z3qh6+bAnvcPTd8vQloFWlHMxsJjARo06ZNDEJLDluzDnPfm18xfc0uLuvTkscu70H9WtUTHZaIVCDRDNZXi6C7p+hgfbeWpkIzGwZsc/c5Zja0YHWEov9zaW1Y7yhgFED//v0jlqnsJi7J5IF35pOdm8/vr+7NlbojWkRKIZpzEK8Ay4DzgV8C1/PfS2BL4zTgUjO7iCDhNCBoUTQys2phKyKNYEgPOQaHc/L4zYdLGfPlek5s2YBnRvSlQ7N6iQ5LRCqoaM5BdHL3nwEH3H0MwbmDnqWt0N1/7O5p7t4OuA74xN2vBz4FrgqL3QSMK20dldGqbfu4/NnPGfPlem4b3J737jxVyUFEyuRYBuvbY2Y9gK1AuxjE8iAw1sweA74CXohBHUnH3Xlr9kYeHb+EOjWqMvrmAZzZNTXRYYlIEogmQYwys8bAw8B4oB7ws+NRubtPBiaHy2uAk4/H+1YWWYdy+Mk/FvLBggxO69SUp67pQ2oD3dsgIsdHSWMxNXf3THd/Plw1FegQn7DkaOas3829Y78iI+swP7qgC3ecrstXReT4KqkFMd/MFgJvAO+6e1acYpISuDvPf7aWJ/69jBYNa/H2HafQr03jRIclIkmopATRCjiH4ETyb8zsS4JkMd7dD8UjOPmmA0dy+dG7C/hgQQYX9jiB317Viwa6t0FEYqSkwfrygI+Aj8ysBnAhQbL4s5lNCq88kjhZu+MAt78ym1Xb9vPQhV25/fQOurdBRGIqqjGe3T3bzJYQ3P9wEtA9plHJN/xnSSb/9+Y8qlU1Xr51IIPTUxIdkohUAiUmCDNrA1wLjADqEoy2epm7l+VGOYlSfr7zp0kreXrSSnq0asBfbziJtMaaClRE4qOkq5i+IDgP8TYw0t01cVAcZR3M4b43v+LT5du56qQ0Hru8B7WqV010WCJSiZTUgvgxMNXdNd5RnC3N2Msdr85hy55D/OryHtwwsI3ON4hI3JV0knpKPAORwLh5m3no3YXUr1WNsSMHcVLbJokOSUQqKU1EXE7k5OXzxIRlvDBtLQPaNebZ6/tpxjcRSSgliHJg+74j3P36XGas3cXNp7bjpxd3o3rVaMZRFBGJnagThJkNAh4HagK/c/f3YxZVJfLVht1879W57DmUzVPX9mZ437REhyQiApR8FdMJ7r610Kr7gUsJJvf5AlCCKKM3Zm7gkXGLSW1Qk3e/dyontmyY6JBERL5WUgvir2Y2h6C1cBjYA3wLyAf2xiO4ZHUkN49Hxi1m7KyNnN65GU9f14dGdWokOiwRkW8otqPb3S8H5gH/MrMbgfsIkkMd4PL4hJd89hzM5sbnZzJ21kbuPrMTo28eoOQgIuVSiecg3P2fZvYhcCfwHvBrd/8sLpEloY27DnLT6Jls2nWIp0f05dLeLRMdkohIsYptQZjZpWY2DfgEWEQwUN9wM3vDzDrGK8BksWDTHob/5Qt27s/mldtOVnIQkXKvpBbEY8ApQG3gQ3c/GbjfzNKBXxMkDIluGfpUAAANMUlEQVTCpKWZ3P36VzStV4OxIwfSKbV+okMSETmqkhJEFkESqA1sK1jp7itRcojaq9PX8/NxizixZUNeuLm/bn4TkQqjpLuxhhOckM4luHpJjkF+vvPEhGU8/P4ihnZJZezIQUoOIlKhlDQW0w7gmTjGkjSO5ObxwNsLGD9/C98a2IZfXnoi1XRntIhUMBpq4zjLOpjDyFdmM2PtLh68oCt3nKGZ30SkYlKCOI427jrILS/NYsPOg/z5uj5c1qdVokMSESk1JYjjZOGmLG4dM4sjOXm8fNvJDOrQNNEhiYiUiRLEcfDpsm3c9fpcGtepwevfGUh6c13GKiIVnxJEGb0+YwM/G7eIrifUZ/TNA0htoCuVRCQ5KEGUUn6+8/uPl/OXyasZ2qUZz36rH3Vr6nCKSPLQN1opHMnN40fvLGDcvC2MOLk1v7qshy5jFZGkowRxjNydH7+3kHHztvDA+V24c2hHXcYqIklJCeIYjfliHe/N3cx956Rz15mdEh2OiEjMxL1fxMxam9mnZrbUzBab2b3h+iZmNtHMVobPjeMd29HMWLOTxz5YyjndmnPPWemJDkdEJKYS0XGeC/zA3bsBg4C7zKw78BAwyd3TgUnh63IjI+sQd70+lzZN6vDHa3tTpYq6lUQkucU9Qbh7hrvPDZf3AUuBVsBlwJiw2BjK0ax1R3Lz+N6rczmUnceob59Eg1rVEx2SiEjMJfTSGzNrB/QFZgDN3T0DgiQCpBazz0gzm21ms7dv3x6XOB8dv5h5G/fwh2t6ay4HEak0EpYgzKwe8C5wn7vvjXY/dx/l7v3dvX+zZs1iF2Do9RkbeGPmRu46syMX9GgR8/pERMqLhCQIM6tOkBxec/f3wtWZZtYi3N6CQpMUJcqc9bt5ZPwiTu/cjPvP7ZLocERE4ioRVzEZ8AKw1N3/WGjTeOCmcPkmYFy8Yyts277D3PnaHFo0rM3T1/Whqk5Ki0glk4j7IE4DbgQWmtm8cN1PgCeAt8zsNmADcHUCYgMgOzefu16by95Dubx358k0qlMjUaGIiCRM3BOEu08Divs5fnY8YynOYx8sYda63Tw9oi/dWjRIdDgiIgmhAYSKeGfOJl7+cj3fHdKeS3u3THQ4IiIJowRRyMJNWfzkHws5tWNTHryga6LDERFJKCWI0M79R7j9ldk0q1eTZ0b01eisIlLpabA+IDcvn7tf/4odB7J5945TaVqvZqJDEhFJOP1MBp6YsIwv1+zk8eE96ZnWMNHhiIiUC5U+QYybt5nnp63lplPactVJaYkOR0Sk3KjUCWLJlr08+O4CBrRrzMPDuic6HBGRcqXSJog9B7O5/dXZNKxdnWev70d1nZQWEfmGSnmSOi/fuWfsPLZmHebN208htX6tRIckIlLuVMoE8ZdPVzF1xXYeH96Tfm3K3cR1IiLlQqVMEFf1T6NGtSp8a2CbRIciIlJuVcqO9xYNa3P7GR0THYaISLlWKROEiIgcnRKEiIhEpAQhIiIRKUGIiEhEShAiIhKREoSIiESkBCEiIhEpQYiISERKECIiEpEShIiIRKQEISIiESlBiIhIREoQIiISkRKEiIhEpAQhIiIRKUGIiEhEShAiIhKREoSIiESkBCEiIhGVqwRhZheY2XIzW2VmDyU6HhGRyqzcJAgzqwo8C1wIdAdGmFn3xEYlIlJ5lZsEAZwMrHL3Ne6eDYwFLktwTCIilVa1RAdQSCtgY6HXm4CBRQuZ2UhgZPhyv5ktL2V9KcCOUu4bD4qvbBRf2ZX3GBVf6bWNplB5ShAWYZ3/zwr3UcCoMldmNtvd+5f1fWJF8ZWN4iu78h6j4ou98tTFtAloXeh1GrAlQbGIiFR65SlBzALSzay9mdUArgPGJzgmEZFKq9x0Mbl7rpndDXwEVAVedPfFMayyzN1UMab4ykbxlV15j1HxxZi5/083v4iISLnqYhIRkXJECUJERCJK+gRxtOE7zKymmb0Zbp9hZu3iGFtrM/vUzJaa2WIzuzdCmaFmlmVm88LHz+MVX1j/OjNbGNY9O8J2M7Onw+O3wMz6xTG2LoWOyzwz22tm9xUpE/fjZ2Yvmtk2M1tUaF0TM5toZivD58bF7HtTWGalmd0Up9h+Z2bLwv+/f5hZo2L2LfGzEOMYHzWzzYX+Hy8qZt+YD9dTTHxvFoptnZnNK2bfuBzD48bdk/ZBcLJ7NdABqAHMB7oXKXMn8Ndw+TrgzTjG1wLoFy7XB1ZEiG8o8K8EHsN1QEoJ2y8CJhDcxzIImJHA/+utQNtEHz/gdKAfsKjQuieBh8Llh4DfRtivCbAmfG4cLjeOQ2znAdXC5d9Gii2az0KMY3wU+GEUn4ES/95jFV+R7X8Afp7IY3i8Hsnegohm+I7LgDHh8jvA2WYW6aa9487dM9x9bri8D1hKcEd5RXIZ8LIHpgONzKxFAuI4G1jt7usTUPc3uPtUYFeR1YU/Z2OAyyPsej4w0d13uftuYCJwQaxjc/eP3T03fDmd4B6khCnm+EUjLsP1lBRf+N1xDfDG8a43EZI9QUQavqPoF/DXZcI/kiygaVyiKyTs2uoLzIiw+RQzm29mE8zsxLgGFtzN/rGZzQmHOSkqmmMcD9dR/B9lIo9fgebungHBDwMgNUKZ8nAsbyVoEUZytM9CrN0ddoO9WEwXXXk4fkOATHdfWcz2RB/DY5LsCSKa4TuiGuIjlsysHvAucJ+77y2yeS5Bt0lv4Bng/XjGBpzm7v0IRtm9y8xOL7K9PBy/GsClwNsRNif6+B2LhB5LM/spkAu8VkyRo30WYuk5oCPQB8gg6MYpKuGfRWAEJbceEnkMj1myJ4hohu/4uoyZVQMaUrrmbamYWXWC5PCau79XdLu773X3/eHyh0B1M0uJV3zuviV83gb8g6AZX1h5GCLlQmCuu2cW3ZDo41dIZkHXW/i8LUKZhB3L8IT4MOB6DzvLi4risxAz7p7p7nnung/8vZi6E/pZDL8/rgDeLK5MIo9haSR7gohm+I7xQMHVIlcBnxT3B3K8hf2VLwBL3f2PxZQ5oeCciJmdTPB/tjNO8dU1s/oFywQnMxcVKTYe+HZ4NdMgIKugKyWOiv3VlsjjV0Thz9lNwLgIZT4CzjOzxmEXynnhupgyswuAB4FL3f1gMWWi+SzEMsbC57WGF1N3oofrOQdY5u6bIm1M9DEslUSfJY/1g+AqmxUEVzf8NFz3S4I/BoBaBF0Tq4CZQIc4xjaYoAm8AJgXPi4C7gDuCMvcDSwmuCJjOnBqHOPrENY7P4yh4PgVjs8IJnpaDSwE+sf5/7cOwRd+w0LrEnr8CJJVBpBD8Kv2NoLzWpOAleFzk7Bsf+D5QvveGn4WVwG3xCm2VQR99wWfwYKr+loCH5b0WYjj8Xsl/HwtIPjSb1E0xvD1//y9xyO+cP1LBZ+7QmUTcgyP10NDbYiISETJ3sUkIiKlpAQhIiIRKUGIiEhEShAiIhKREoSIiESkBCFJy8x+E47mevmxjuxpZs0sGN33KzMbUmTb5HDE0ILRO68qZXz3mVmd0uwrEg9KEJLMBhKMbXUG8Nkx7ns2wU1Pfd090r7Xu3uf8PFOKeO7j+A+jqiFd+uKxIUShCSdcH6DBcAA4EvgO8BzFmEuCDNra2aTwkHgJplZGzPrQzA890VhC6F2lPXeYGYzw33+ZmZVw/XPmdlsC+b8+EW47h6Cm6g+NbNPw3X7C73XVWb2Urj8kpn9MSz32/CO3BfNbFbYwrksLHdiofoXmFl6aY+hCGhOaklS4bAaNwL3A5Pd/bRiyv0TeMfdx5jZrQR32F9uZjcT3BV+d4R9JhPM5XEoXHU2weisTwJXuHuOmf0FmO7uL5tZE3ffFSaMScA97r7AzNaFdewI33e/u9cLl68Chrn7zWGiSAEuc/c8M3scWOLur1owuc9MgpGAnwjrfC0caqKquxfEKHLM1FyVZNWXYNiIrsCSEsqdQjDAGgTDOTwZ5ftf7+5fzwhmZiOAk4BZ4dBPtfnvgHzXhEM7VyNILN0Jhow4Fm+7e164fB5wqZn9MHxdC2hD0Fr6qZmlAe958UNOi0RFCUKSStg99BLBSJ47CPr4zYIpIE+J4hd1aZvUBoxx9x8Xiac98ENggLvvDlsDtaKou2iZA0XqutLdlxcps9TMZgAXAx+Z2Xfc/ZNj/HeIfE3nICSpuPs8d+9DOH0r8AlwfngyOVJy+IJg1E+A64Fppax6EnCVmaXC13NQtwUaEHy5Z5lZc4KhyQvsI5hqtkCmmXUzsyoEI5YW5yPg+4VGqe0bPncA1rj70wQD2vUq5b9FBFCCkCRkZs2A3R7MHdDV3UvqYroHuCU8qX0jcG9p6gzreJhgtrAFBNOFtnD3+cBXBKN3vgh8Xmi3UcCEgpPUBHNV/4sgqZU0ZPqvgOrAAjNbFL4GuBZYFLaWugIvl+bfIlJAJ6lFRCQitSBERCQiJQgREYlICUJERCJSghARkYiUIEREJCIlCBERiUgJQkREIvr/IA9j1t/vDq8AAAAASUVORK5CYII=\n", 235 | "text/plain": [ 236 | "" 237 | ] 238 | }, 239 | "metadata": { 240 | "needs_background": "light" 241 | }, 242 | "output_type": "display_data" 243 | } 244 | ], 245 | "source": [ 246 | "plt.ylabel('% Variance Explained')\n", 247 | "plt.xlabel('# of Features')\n", 248 | "plt.title('PCA Analysis')\n", 249 | "plt.ylim(0,110)\n", 250 | "plt.style.context('seaborn-whitegrid')\n", 251 | "\n", 252 | "\n", 253 | "plt.plot(var)" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": 137, 259 | "metadata": {}, 260 | "outputs": [ 261 | { 262 | "name": "stdout", 263 | "output_type": "stream", 264 | "text": [ 265 | "Combined space has 14 features\n" 266 | ] 267 | } 268 | ], 269 | "source": [ 270 | "pca = PCA(n_components=13)\n", 271 | "\n", 272 | "# Maybe some original features where good, too?\n", 273 | "selection = SelectKBest(k=1)\n", 274 | "\n", 275 | "# Build estimator from PCA and Univariate selection:\n", 276 | "\n", 277 | "combined_features = FeatureUnion([(\"pca\", pca), (\"univ_select\", selection)])\n", 278 | "\n", 279 | "\n", 280 | "\n", 281 | "# Use combined features to transform dataset:\n", 282 | "X_features = combined_features.fit(X, y).transform(X)\n", 283 | "print(\"Combined space has\", X_features.shape[1], \"features\")" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": 138, 289 | "metadata": {}, 290 | "outputs": [ 291 | { 292 | "name": "stdout", 293 | "output_type": "stream", 294 | "text": [ 295 | "(800, 14) (800,)\n", 296 | "(200, 14) (200,)\n" 297 | ] 298 | } 299 | ], 300 | "source": [ 301 | "X_train, X_test, y_train, y_test = train_test_split(X_features, y, test_size=0.2)\n", 302 | "print (X_train.shape, y_train.shape)\n", 303 | "print (X_test.shape, y_test.shape)" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 139, 309 | "metadata": {}, 310 | "outputs": [ 311 | { 312 | "data": { 313 | "text/plain": [ 314 | "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", 315 | " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n", 316 | " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", 317 | " verbose=0, warm_start=False)" 318 | ] 319 | }, 320 | "execution_count": 139, 321 | "metadata": {}, 322 | "output_type": "execute_result" 323 | } 324 | ], 325 | "source": [ 326 | "logreg = LogisticRegression()\n", 327 | "logreg.fit(X_train, y_train)" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": 140, 333 | "metadata": {}, 334 | "outputs": [ 335 | { 336 | "name": "stdout", 337 | "output_type": "stream", 338 | "text": [ 339 | "Accuracy of logistic regression classifier on test set: 0.80\n" 340 | ] 341 | } 342 | ], 343 | "source": [ 344 | "y_pred = logreg.predict(X_test)\n", 345 | "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))" 346 | ] 347 | } 348 | ], 349 | "metadata": { 350 | "kernelspec": { 351 | "display_name": "Python 3", 352 | "language": "python", 353 | "name": "python3" 354 | }, 355 | "language_info": { 356 | "codemirror_mode": { 357 | "name": "ipython", 358 | "version": 3 359 | }, 360 | "file_extension": ".py", 361 | "mimetype": "text/x-python", 362 | "name": "python", 363 | "nbconvert_exporter": "python", 364 | "pygments_lexer": "ipython3", 365 | "version": "3.6.0" 366 | } 367 | }, 368 | "nbformat": 4, 369 | "nbformat_minor": 2 370 | } 371 | -------------------------------------------------------------------------------- /Section 2/2.2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 10, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd\n", 10 | "import seaborn as sns\n", 11 | "import matplotlib.pyplot as plt\n", 12 | "from sklearn.ensemble import RandomForestClassifier\n", 13 | "import numpy as np\n", 14 | "from sklearn.model_selection import train_test_split\n", 15 | "from scipy.io import arff\n", 16 | "from sklearn.metrics import roc_auc_score\n", 17 | "from sklearn.metrics import roc_curve, auc\n", 18 | "import scikitplot as skplt\n", 19 | "from sklearn.decomposition import PCA\n", 20 | "from sklearn.feature_selection import SelectKBest\n", 21 | "from sklearn.pipeline import Pipeline, FeatureUnion, make_union\n", 22 | "from sklearn.linear_model import LogisticRegression\n", 23 | "from sklearn.base import BaseEstimator, TransformerMixin\n", 24 | "from sklearn.preprocessing import scale\n", 25 | "from sklearn.preprocessing import LabelBinarizer # one hot encoding" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "## Dealing with categorical features\n", 33 | " - Label encoding\n", 34 | " - One Hot encoding" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 2, 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [ 43 | "# load the raw data\n", 44 | "\n", 45 | "df = pd.read_csv('german_credit_raw.csv')" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 3, 51 | "metadata": {}, 52 | "outputs": [ 53 | { 54 | "data": { 55 | "text/html": [ 56 | "

\n", 57 | "\n", 70 | "\n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | "

	default	account_check_status	duration_in_month	credit_history	purpose	credit_amount	savings	present_emp_since	installment_as_income_perc	personal_status_sex	...	present_res_since	property	age	other_installment_plans	housing	credits_this_bank	job	people_under_maintenance	telephone	foreign_worker
0	0	< 0 DM	6	critical account/ other credits existing (not ...	domestic appliances	1169	unknown/ no savings account	.. >= 7 years	4	male : single	...	4	real estate	67	none	own	2	skilled employee / official	1	yes, registered under the customers name	yes
1	1	0 <= ... < 200 DM	48	existing credits paid back duly till now	domestic appliances	5951	... < 100 DM	1 <= ... < 4 years	2	female : divorced/separated/married	...	2	real estate	22	none	own	1	skilled employee / official	1	none	yes
2	0	no checking account	12	critical account/ other credits existing (not ...	(vacation - does not exist?)	2096	... < 100 DM	4 <= ... < 7 years	2	male : single	...	3	real estate	49	none	own	1	unskilled - resident	2	none	yes
3	0	< 0 DM	42	existing credits paid back duly till now	radio/television	7882	... < 100 DM	4 <= ... < 7 years	2	male : single	...	4	if not A121 : building society savings agreeme...	45	none	for free	1	skilled employee / official	2	none	yes
4	1	< 0 DM	24	delay in paying off in the past	car (new)	4870	... < 100 DM	1 <= ... < 4 years	3	male : single	...	4	unknown / no property	53	none	for free	2	skilled employee / official	2	none	yes

\n", 220 | "

5 rows × 21 columns

\n", 221 | "

" 222 | ], 223 | "text/plain": [ 224 | " default account_check_status duration_in_month \\\n", 225 | "0 0 < 0 DM 6 \n", 226 | "1 1 0 <= ... < 200 DM 48 \n", 227 | "2 0 no checking account 12 \n", 228 | "3 0 < 0 DM 42 \n", 229 | "4 1 < 0 DM 24 \n", 230 | "\n", 231 | " credit_history \\\n", 232 | "0 critical account/ other credits existing (not ... \n", 233 | "1 existing credits paid back duly till now \n", 234 | "2 critical account/ other credits existing (not ... \n", 235 | "3 existing credits paid back duly till now \n", 236 | "4 delay in paying off in the past \n", 237 | "\n", 238 | " purpose credit_amount savings \\\n", 239 | "0 domestic appliances 1169 unknown/ no savings account \n", 240 | "1 domestic appliances 5951 ... < 100 DM \n", 241 | "2 (vacation - does not exist?) 2096 ... < 100 DM \n", 242 | "3 radio/television 7882 ... < 100 DM \n", 243 | "4 car (new) 4870 ... < 100 DM \n", 244 | "\n", 245 | " present_emp_since installment_as_income_perc \\\n", 246 | "0 .. >= 7 years 4 \n", 247 | "1 1 <= ... < 4 years 2 \n", 248 | "2 4 <= ... < 7 years 2 \n", 249 | "3 4 <= ... < 7 years 2 \n", 250 | "4 1 <= ... < 4 years 3 \n", 251 | "\n", 252 | " personal_status_sex ... present_res_since \\\n", 253 | "0 male : single ... 4 \n", 254 | "1 female : divorced/separated/married ... 2 \n", 255 | "2 male : single ... 3 \n", 256 | "3 male : single ... 4 \n", 257 | "4 male : single ... 4 \n", 258 | "\n", 259 | " property age \\\n", 260 | "0 real estate 67 \n", 261 | "1 real estate 22 \n", 262 | "2 real estate 49 \n", 263 | "3 if not A121 : building society savings agreeme... 45 \n", 264 | "4 unknown / no property 53 \n", 265 | "\n", 266 | " other_installment_plans housing credits_this_bank \\\n", 267 | "0 none own 2 \n", 268 | "1 none own 1 \n", 269 | "2 none own 1 \n", 270 | "3 none for free 1 \n", 271 | "4 none for free 2 \n", 272 | "\n", 273 | " job people_under_maintenance \\\n", 274 | "0 skilled employee / official 1 \n", 275 | "1 skilled employee / official 1 \n", 276 | "2 unskilled - resident 2 \n", 277 | "3 skilled employee / official 2 \n", 278 | "4 skilled employee / official 2 \n", 279 | "\n", 280 | " telephone foreign_worker \n", 281 | "0 yes, registered under the customers name yes \n", 282 | "1 none yes \n", 283 | "2 none yes \n", 284 | "3 none yes \n", 285 | "4 none yes \n", 286 | "\n", 287 | "[5 rows x 21 columns]" 288 | ] 289 | }, 290 | "execution_count": 3, 291 | "metadata": {}, 292 | "output_type": "execute_result" 293 | } 294 | ], 295 | "source": [ 296 | "df.head()" 297 | ] 298 | }, 299 | { 300 | "cell_type": "markdown", 301 | "metadata": {}, 302 | "source": [ 303 | "## let's look at the different types of account status" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 4, 309 | "metadata": {}, 310 | "outputs": [ 311 | { 312 | "data": { 313 | "text/plain": [ 314 | "account_check_status\n", 315 | "0 <= ... < 200 DM 269\n", 316 | "< 0 DM 274\n", 317 | ">= 200 DM / salary assignments for at least 1 year 63\n", 318 | "no checking account 394\n", 319 | "dtype: int64" 320 | ] 321 | }, 322 | "execution_count": 4, 323 | "metadata": {}, 324 | "output_type": "execute_result" 325 | } 326 | ], 327 | "source": [ 328 | "df.groupby('account_check_status').size()" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": 6, 334 | "metadata": {}, 335 | "outputs": [], 336 | "source": [ 337 | "## convert the data type to category\n", 338 | "df['account_check_status'] = df['account_check_status'].astype('category')" 339 | ] 340 | }, 341 | { 342 | "cell_type": "markdown", 343 | "metadata": {}, 344 | "source": [ 345 | "## Label encoding" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": 7, 351 | "metadata": {}, 352 | "outputs": [], 353 | "source": [ 354 | "df['account_status_cat'] = df['account_check_status'].cat.codes" 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": 8, 360 | "metadata": {}, 361 | "outputs": [ 362 | { 363 | "data": { 364 | "text/plain": [ 365 | "account_check_status account_status_cat\n", 366 | "0 <= ... < 200 DM 0 269\n", 367 | "< 0 DM 1 274\n", 368 | ">= 200 DM / salary assignments for at least 1 year 2 63\n", 369 | "no checking account 3 394\n", 370 | "dtype: int64" 371 | ] 372 | }, 373 | "execution_count": 8, 374 | "metadata": {}, 375 | "output_type": "execute_result" 376 | } 377 | ], 378 | "source": [ 379 | "df.groupby(['account_check_status', 'account_status_cat']).size()" 380 | ] 381 | }, 382 | { 383 | "cell_type": "markdown", 384 | "metadata": {}, 385 | "source": [ 386 | "## One hot encoding" 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": 11, 392 | "metadata": {}, 393 | "outputs": [ 394 | { 395 | "data": { 396 | "text/html": [ 397 | "

\n", 398 | "\n", 411 | "\n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | "

	0 <= ... < 200 DM	< 0 DM	no checking account
0	0	1	0
1	1	0	0
2	0	0	1
3	0	1	0
4	0	1	0

\n", 459 | "

" 460 | ], 461 | "text/plain": [ 462 | " 0 <= ... < 200 DM < 0 DM \\\n", 463 | "0 0 1 \n", 464 | "1 1 0 \n", 465 | "2 0 0 \n", 466 | "3 0 1 \n", 467 | "4 0 1 \n", 468 | "\n", 469 | " >= 200 DM / salary assignments for at least 1 year no checking account \n", 470 | "0 0 0 \n", 471 | "1 0 0 \n", 472 | "2 0 1 \n", 473 | "3 0 0 \n", 474 | "4 0 0 " 475 | ] 476 | }, 477 | "execution_count": 11, 478 | "metadata": {}, 479 | "output_type": "execute_result" 480 | } 481 | ], 482 | "source": [ 483 | "df_one_hot = df.copy()\n", 484 | "\n", 485 | "lb = LabelBinarizer()\n", 486 | "lb_results = lb.fit_transform(df_one_hot['account_check_status'])\n", 487 | "lb_results_df = pd.DataFrame(lb_results, columns=lb.classes_)\n", 488 | "\n", 489 | "lb_results_df.head()" 490 | ] 491 | }, 492 | { 493 | "cell_type": "code", 494 | "execution_count": 12, 495 | "metadata": {}, 496 | "outputs": [], 497 | "source": [ 498 | "## concatenate this data to our data set\n", 499 | "\n", 500 | "final_df = pd.concat([df_one_hot, lb_results_df], axis=1)" 501 | ] 502 | }, 503 | { 504 | "cell_type": "code", 505 | "execution_count": 15, 506 | "metadata": {}, 507 | "outputs": [ 508 | { 509 | "name": "stdout", 510 | "output_type": "stream", 511 | "text": [ 512 | "original df dimensions: (1000, 22)\n", 513 | "one hot encoded df dimensions: (1000, 26)\n" 514 | ] 515 | }, 516 | { 517 | "data": { 518 | "text/html": [ 519 | "

\n", 520 | "\n", 533 | "\n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | "

	default	account_check_status	duration_in_month	credit_history	purpose	credit_amount	savings	present_emp_since	installment_as_income_perc	personal_status_sex	...	credits_this_bank	job	people_under_maintenance	telephone	foreign_worker	account_status_cat	0 <= ... < 200 DM	< 0 DM	no checking account
0	0	< 0 DM	6	critical account/ other credits existing (not ...	domestic appliances	1169	unknown/ no savings account	.. >= 7 years	4	male : single	...	2	skilled employee / official	1	yes, registered under the customers name	yes	1	0	1	0
1	1	0 <= ... < 200 DM	48	existing credits paid back duly till now	domestic appliances	5951	... < 100 DM	1 <= ... < 4 years	2	female : divorced/separated/married	...	1	skilled employee / official	1	none	yes	0	1	0	0
2	0	no checking account	12	critical account/ other credits existing (not ...	(vacation - does not exist?)	2096	... < 100 DM	4 <= ... < 7 years	2	male : single	...	1	unskilled - resident	2	none	yes	3	0	0	1
3	0	< 0 DM	42	existing credits paid back duly till now	radio/television	7882	... < 100 DM	4 <= ... < 7 years	2	male : single	...	1	skilled employee / official	2	none	yes	1	0	1	0
4	1	< 0 DM	24	delay in paying off in the past	car (new)	4870	... < 100 DM	1 <= ... < 4 years	3	male : single	...	2	skilled employee / official	2	none	yes	1	0	1	0

\n", 683 | "

5 rows × 26 columns

\n", 684 | "

" 685 | ], 686 | "text/plain": [ 687 | " default account_check_status duration_in_month \\\n", 688 | "0 0 < 0 DM 6 \n", 689 | "1 1 0 <= ... < 200 DM 48 \n", 690 | "2 0 no checking account 12 \n", 691 | "3 0 < 0 DM 42 \n", 692 | "4 1 < 0 DM 24 \n", 693 | "\n", 694 | " credit_history \\\n", 695 | "0 critical account/ other credits existing (not ... \n", 696 | "1 existing credits paid back duly till now \n", 697 | "2 critical account/ other credits existing (not ... \n", 698 | "3 existing credits paid back duly till now \n", 699 | "4 delay in paying off in the past \n", 700 | "\n", 701 | " purpose credit_amount savings \\\n", 702 | "0 domestic appliances 1169 unknown/ no savings account \n", 703 | "1 domestic appliances 5951 ... < 100 DM \n", 704 | "2 (vacation - does not exist?) 2096 ... < 100 DM \n", 705 | "3 radio/television 7882 ... < 100 DM \n", 706 | "4 car (new) 4870 ... < 100 DM \n", 707 | "\n", 708 | " present_emp_since installment_as_income_perc \\\n", 709 | "0 .. >= 7 years 4 \n", 710 | "1 1 <= ... < 4 years 2 \n", 711 | "2 4 <= ... < 7 years 2 \n", 712 | "3 4 <= ... < 7 years 2 \n", 713 | "4 1 <= ... < 4 years 3 \n", 714 | "\n", 715 | " personal_status_sex ... credits_this_bank \\\n", 716 | "0 male : single ... 2 \n", 717 | "1 female : divorced/separated/married ... 1 \n", 718 | "2 male : single ... 1 \n", 719 | "3 male : single ... 1 \n", 720 | "4 male : single ... 2 \n", 721 | "\n", 722 | " job people_under_maintenance \\\n", 723 | "0 skilled employee / official 1 \n", 724 | "1 skilled employee / official 1 \n", 725 | "2 unskilled - resident 2 \n", 726 | "3 skilled employee / official 2 \n", 727 | "4 skilled employee / official 2 \n", 728 | "\n", 729 | " telephone foreign_worker \\\n", 730 | "0 yes, registered under the customers name yes \n", 731 | "1 none yes \n", 732 | "2 none yes \n", 733 | "3 none yes \n", 734 | "4 none yes \n", 735 | "\n", 736 | " account_status_cat 0 <= ... < 200 DM < 0 DM \\\n", 737 | "0 1 0 1 \n", 738 | "1 0 1 0 \n", 739 | "2 3 0 0 \n", 740 | "3 1 0 1 \n", 741 | "4 1 0 1 \n", 742 | "\n", 743 | " >= 200 DM / salary assignments for at least 1 year no checking account \n", 744 | "0 0 0 \n", 745 | "1 0 0 \n", 746 | "2 0 1 \n", 747 | "3 0 0 \n", 748 | "4 0 0 \n", 749 | "\n", 750 | "[5 rows x 26 columns]" 751 | ] 752 | }, 753 | "execution_count": 15, 754 | "metadata": {}, 755 | "output_type": "execute_result" 756 | } 757 | ], 758 | "source": [ 759 | "print('original df dimensions:', df.shape)\n", 760 | "print('one hot encoded df dimensions:', final_df.shape)\n", 761 | "final_df.head()" 762 | ] 763 | } 764 | ], 765 | "metadata": { 766 | "kernelspec": { 767 | "display_name": "Python 3", 768 | "language": "python", 769 | "name": "python3" 770 | }, 771 | "language_info": { 772 | "codemirror_mode": { 773 | "name": "ipython", 774 | "version": 3 775 | }, 776 | "file_extension": ".py", 777 | "mimetype": "text/x-python", 778 | "name": "python", 779 | "nbconvert_exporter": "python", 780 | "pygments_lexer": "ipython3", 781 | "version": "3.6.0" 782 | } 783 | }, 784 | "nbformat": 4, 785 | "nbformat_minor": 2 786 | } 787 | -------------------------------------------------------------------------------- /Section 2/2.3.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 8, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd\n", 10 | "import seaborn as sns\n", 11 | "import matplotlib.pyplot as plt\n", 12 | "from sklearn.ensemble import RandomForestClassifier\n", 13 | "import numpy as np\n", 14 | "from sklearn.model_selection import train_test_split\n", 15 | "from scipy.io import arff\n", 16 | "from sklearn.metrics import roc_auc_score\n", 17 | "from sklearn.metrics import roc_curve, auc\n", 18 | "import scikitplot as skplt\n", 19 | "from sklearn.decomposition import PCA\n", 20 | "from sklearn.feature_selection import SelectKBest\n", 21 | "from sklearn.pipeline import Pipeline, FeatureUnion, make_union\n", 22 | "from sklearn.linear_model import LogisticRegression\n", 23 | "from sklearn.base import BaseEstimator, TransformerMixin\n", 24 | "from sklearn.preprocessing import scale\n", 25 | "from sklearn.preprocessing import LabelBinarizer # one hot encoding\n", 26 | "from sklearn.preprocessing import PolynomialFeatures # add polynomial features" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "## load our cleaned german credit dataset" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 4, 39 | "metadata": {}, 40 | "outputs": [], 41 | "source": [ 42 | "df = pd.read_csv('german_credit.csv')" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 5, 48 | "metadata": { 49 | "scrolled": false 50 | }, 51 | "outputs": [ 52 | { 53 | "data": { 54 | "text/plain": [ 55 | "Creditability int64\n", 56 | "Account Balance int64\n", 57 | "Duration of Credit (month) int64\n", 58 | "Payment Status of Previous Credit int64\n", 59 | "Purpose int64\n", 60 | "Credit Amount int64\n", 61 | "Value Savings/Stocks int64\n", 62 | "Length of current employment int64\n", 63 | "Instalment per cent int64\n", 64 | "Sex & Marital Status int64\n", 65 | "Guarantors int64\n", 66 | "Duration in Current address int64\n", 67 | "Most valuable available asset int64\n", 68 | "Age (years) int64\n", 69 | "Concurrent Credits int64\n", 70 | "Type of apartment int64\n", 71 | "No of Credits at this Bank int64\n", 72 | "Occupation int64\n", 73 | "No of dependents int64\n", 74 | "Telephone int64\n", 75 | "Foreign Worker int64\n", 76 | "dtype: object" 77 | ] 78 | }, 79 | "execution_count": 5, 80 | "metadata": {}, 81 | "output_type": "execute_result" 82 | } 83 | ], 84 | "source": [ 85 | "df.dtypes" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 7, 91 | "metadata": {}, 92 | "outputs": [ 93 | { 94 | "data": { 95 | "text/html": [ 96 | "

\n", 97 | "\n", 110 | "\n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | "

	Age (years)	Duration of Credit (month)	Credit Amount
count	1000.00000	1000.000000	1000.00000
mean	35.54200	20.903000	3271.24800
std	11.35267	12.058814	2822.75176
min	19.00000	4.000000	250.00000
25%	27.00000	12.000000	1365.50000
50%	33.00000	18.000000	2319.50000
75%	42.00000	24.000000	3972.25000
max	75.00000	72.000000	18424.00000

\n", 170 | "

" 171 | ], 172 | "text/plain": [ 173 | " Age (years) Duration of Credit (month) Credit Amount\n", 174 | "count 1000.00000 1000.000000 1000.00000\n", 175 | "mean 35.54200 20.903000 3271.24800\n", 176 | "std 11.35267 12.058814 2822.75176\n", 177 | "min 19.00000 4.000000 250.00000\n", 178 | "25% 27.00000 12.000000 1365.50000\n", 179 | "50% 33.00000 18.000000 2319.50000\n", 180 | "75% 42.00000 24.000000 3972.25000\n", 181 | "max 75.00000 72.000000 18424.00000" 182 | ] 183 | }, 184 | "execution_count": 7, 185 | "metadata": {}, 186 | "output_type": "execute_result" 187 | } 188 | ], 189 | "source": [ 190 | "## summary statistics of our continuous data\n", 191 | "\n", 192 | "df[['Age (years)', 'Duration of Credit (month)', 'Credit Amount']].describe()" 193 | ] 194 | }, 195 | { 196 | "cell_type": "markdown", 197 | "metadata": {}, 198 | "source": [ 199 | "## merging and expanding numerical data types\n", 200 | " - creating polynomial features" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 13, 206 | "metadata": {}, 207 | "outputs": [], 208 | "source": [ 209 | "pf = PolynomialFeatures(degree=2, interaction_only=False, \n", 210 | " include_bias=False)\n", 211 | "result = pf.fit_transform(df[['Age (years)', 'Duration of Credit (month)', 'Credit Amount']])" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 14, 217 | "metadata": {}, 218 | "outputs": [ 219 | { 220 | "data": { 221 | "text/html": [ 222 | "

\n", 223 | "\n", 236 | "\n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | "

	0	1	2	3	4	5	6	7	8
0	21.0	18.0	1049.0	441.0	378.0	22029.0	324.0	18882.0	1100401.0
1	36.0	9.0	2799.0	1296.0	324.0	100764.0	81.0	25191.0	7834401.0
2	23.0	12.0	841.0	529.0	276.0	19343.0	144.0	10092.0	707281.0
3	39.0	12.0	2122.0	1521.0	468.0	82758.0	144.0	25464.0	4502884.0
4	38.0	12.0	2171.0	1444.0	456.0	82498.0	144.0	26052.0	4713241.0

\n", 314 | "

" 315 | ], 316 | "text/plain": [ 317 | " 0 1 2 3 4 5 6 7 8\n", 318 | "0 21.0 18.0 1049.0 441.0 378.0 22029.0 324.0 18882.0 1100401.0\n", 319 | "1 36.0 9.0 2799.0 1296.0 324.0 100764.0 81.0 25191.0 7834401.0\n", 320 | "2 23.0 12.0 841.0 529.0 276.0 19343.0 144.0 10092.0 707281.0\n", 321 | "3 39.0 12.0 2122.0 1521.0 468.0 82758.0 144.0 25464.0 4502884.0\n", 322 | "4 38.0 12.0 2171.0 1444.0 456.0 82498.0 144.0 26052.0 4713241.0" 323 | ] 324 | }, 325 | "execution_count": 14, 326 | "metadata": {}, 327 | "output_type": "execute_result" 328 | } 329 | ], 330 | "source": [ 331 | "pd.DataFrame(result).head()" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": 16, 337 | "metadata": {}, 338 | "outputs": [], 339 | "source": [ 340 | "result = pd.DataFrame(result)\n", 341 | "result.columns = ['Age (years)', 'Duration of Credit (month)', 'Credit Amount', 'Age^2', 'AgexCreditDuration', \n", 342 | " 'AgexCreditAmount', 'CreditDuration^2', 'CreditDurationxCreditAmount', 'CreditAmount^2' ]" 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": 34, 348 | "metadata": {}, 349 | "outputs": [ 350 | { 351 | "name": "stdout", 352 | "output_type": "stream", 353 | "text": [ 354 | "Accuracy of logistic regression classifier on test set: 0.74\n" 355 | ] 356 | } 357 | ], 358 | "source": [ 359 | "X, y = df.loc[:, df.columns != 'Creditability'], df['Creditability']\n", 360 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n", 361 | "logreg = LogisticRegression()\n", 362 | "logreg.fit(X_train, y_train)\n", 363 | "y_pred = logreg.predict(X_test)\n", 364 | "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))" 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": 35, 370 | "metadata": {}, 371 | "outputs": [], 372 | "source": [ 373 | "new_df = df.drop(['Age (years)', 'Duration of Credit (month)', 'Credit Amount'], axis=1)\n", 374 | "new_df = pd.concat([new_df, result[['Age^2', 'AgexCreditDuration', \n", 375 | " 'AgexCreditAmount', 'CreditDuration^2', 'CreditDurationxCreditAmount', 'CreditAmount^2']]], axis=1)" 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": 42, 381 | "metadata": {}, 382 | "outputs": [ 383 | { 384 | "name": "stdout", 385 | "output_type": "stream", 386 | "text": [ 387 | "Accuracy of logistic regression classifier on test set: 0.76\n" 388 | ] 389 | } 390 | ], 391 | "source": [ 392 | "X, y = new_df.loc[:, new_df.columns != 'Creditability'], new_df['Creditability']\n", 393 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n", 394 | "logreg = LogisticRegression()\n", 395 | "logreg.fit(X_train, y_train)\n", 396 | "y_pred = logreg.predict(X_test)\n", 397 | "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))" 398 | ] 399 | } 400 | ], 401 | "metadata": { 402 | "kernelspec": { 403 | "display_name": "Python 3", 404 | "language": "python", 405 | "name": "python3" 406 | }, 407 | "language_info": { 408 | "codemirror_mode": { 409 | "name": "ipython", 410 | "version": 3 411 | }, 412 | "file_extension": ".py", 413 | "mimetype": "text/x-python", 414 | "name": "python", 415 | "nbconvert_exporter": "python", 416 | "pygments_lexer": "ipython3", 417 | "version": "3.6.0" 418 | } 419 | }, 420 | "nbformat": 4, 421 | "nbformat_minor": 2 422 | } 423 | -------------------------------------------------------------------------------- /Section 3/3.1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 6, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd\n", 10 | "import seaborn as sns\n", 11 | "import matplotlib.pyplot as plt\n", 12 | "from sklearn.ensemble import RandomForestClassifier\n", 13 | "import numpy as np\n", 14 | "from sklearn.model_selection import train_test_split\n", 15 | "from scipy.io import arff\n", 16 | "from sklearn.metrics import roc_auc_score\n", 17 | "from sklearn.metrics import roc_curve, auc\n", 18 | "import scikitplot as skplt\n", 19 | "from sklearn.decomposition import PCA\n", 20 | "from sklearn.feature_selection import SelectKBest\n", 21 | "from sklearn.pipeline import Pipeline, FeatureUnion, make_union\n", 22 | "from sklearn.linear_model import LogisticRegression\n", 23 | "from sklearn.base import BaseEstimator, TransformerMixin\n", 24 | "from sklearn.preprocessing import scale\n", 25 | "from sklearn.preprocessing import LabelBinarizer # one hot encoding\n", 26 | "from sklearn.preprocessing import PolynomialFeatures # add polynomial features\n", 27 | "from sklearn.metrics import classification_report # classification report" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "## dealing with target labels \n", 35 | "- In this case, our target labels are imbalanced: we have a lot more creditable people than not-creditable people because that's how lending works (risk averse)\n", 36 | "- Let's try to best deal with an imbalanced dataset" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "## Method 1:\n", 44 | "- don't use accuracy as a metric\n", 45 | "- other methods:\n", 46 | " - confusion matrix\n", 47 | " - precision\n", 48 | " - recall" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 2, 54 | "metadata": {}, 55 | "outputs": [], 56 | "source": [ 57 | "df = pd.read_csv('german_credit.csv')" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 3, 63 | "metadata": {}, 64 | "outputs": [ 65 | { 66 | "name": "stdout", 67 | "output_type": "stream", 68 | "text": [ 69 | "Accuracy of logistic regression classifier on test set: 0.72\n" 70 | ] 71 | }, 72 | { 73 | "data": { 74 | "text/html": [ 75 | "

\n", 76 | "\n", 89 | "\n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | "

Predicted Result	0	1
Actual Result
0	24	38
1	17	121

\n", 115 | "

" 116 | ], 117 | "text/plain": [ 118 | "Predicted Result 0 1\n", 119 | "Actual Result \n", 120 | "0 24 38\n", 121 | "1 17 121" 122 | ] 123 | }, 124 | "execution_count": 3, 125 | "metadata": {}, 126 | "output_type": "execute_result" 127 | } 128 | ], 129 | "source": [ 130 | "X, y = df.loc[:, df.columns != 'Creditability'], df['Creditability']\n", 131 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n", 132 | "logreg = LogisticRegression()\n", 133 | "logreg.fit(X_train, y_train)\n", 134 | "y_pred = logreg.predict(X_test)\n", 135 | "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))\n", 136 | "pd.crosstab(y_test, y_pred, rownames=['Actual Result'], colnames=['Predicted Result'])" 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": {}, 142 | "source": [ 143 | "## recall = tp / (tp + fn) : ability to find true positives\n", 144 | "## precision = tp / (tp + fp) : ability to correctly label a sample as positive\n" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 7, 150 | "metadata": {}, 151 | "outputs": [ 152 | { 153 | "name": "stdout", 154 | "output_type": "stream", 155 | "text": [ 156 | " precision recall f1-score support\n", 157 | "\n", 158 | " 0 0.59 0.39 0.47 62\n", 159 | " 1 0.76 0.88 0.81 138\n", 160 | "\n", 161 | "avg / total 0.71 0.72 0.71 200\n", 162 | "\n" 163 | ] 164 | } 165 | ], 166 | "source": [ 167 | "print(classification_report(y_test, y_pred))" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": {}, 173 | "source": [ 174 | "## Method 2:\n", 175 | "- resample data or use models that resample data like Random Forests\n", 176 | "\n", 177 | "## Method 3:\n", 178 | "- collect more data (if possible)" 179 | ] 180 | } 181 | ], 182 | "metadata": { 183 | "kernelspec": { 184 | "display_name": "Python 3", 185 | "language": "python", 186 | "name": "python3" 187 | }, 188 | "language_info": { 189 | "codemirror_mode": { 190 | "name": "ipython", 191 | "version": 3 192 | }, 193 | "file_extension": ".py", 194 | "mimetype": "text/x-python", 195 | "name": "python", 196 | "nbconvert_exporter": "python", 197 | "pygments_lexer": "ipython3", 198 | "version": "3.6.0" 199 | } 200 | }, 201 | "nbformat": 4, 202 | "nbformat_minor": 2 203 | } 204 | -------------------------------------------------------------------------------- /Section 3/3.3.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 27, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd\n", 10 | "import numpy as np\n", 11 | "from sklearn.model_selection import train_test_split\n", 12 | "from sklearn.linear_model import LogisticRegression" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "## imputing missing values\n", 20 | "- Pima Indians diabetes dataset\n", 21 | " - missing values are marked as 0\n", 22 | "- columns with missing values:\n", 23 | " - Glucose\n", 24 | " - Blood pressure\n", 25 | " - Skin Thickness\n", 26 | " - Insulin\n", 27 | " - BMI" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 2, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "df = pd.read_csv('diabetes.csv')" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 9, 42 | "metadata": {}, 43 | "outputs": [ 44 | { 45 | "data": { 46 | "text/html": [ 47 | "

\n", 48 | "\n", 61 | "\n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | "

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

\n", 139 | "

" 140 | ], 141 | "text/plain": [ 142 | " Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\n", 143 | "0 6 148 72 35 0 33.6 \n", 144 | "1 1 85 66 29 0 26.6 \n", 145 | "2 8 183 64 0 0 23.3 \n", 146 | "3 1 89 66 23 94 28.1 \n", 147 | "4 0 137 40 35 168 43.1 \n", 148 | "\n", 149 | " DiabetesPedigreeFunction Age Outcome \n", 150 | "0 0.627 50 1 \n", 151 | "1 0.351 31 0 \n", 152 | "2 0.672 32 1 \n", 153 | "3 0.167 21 0 \n", 154 | "4 2.288 33 1 " 155 | ] 156 | }, 157 | "execution_count": 9, 158 | "metadata": {}, 159 | "output_type": "execute_result" 160 | } 161 | ], 162 | "source": [ 163 | "df.head()" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 10, 169 | "metadata": {}, 170 | "outputs": [ 171 | { 172 | "data": { 173 | "text/html": [ 174 | "

\n", 175 | "\n", 188 | "\n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | "

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
std	3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

\n", 302 | "

" 303 | ], 304 | "text/plain": [ 305 | " Pregnancies Glucose BloodPressure SkinThickness Insulin \\\n", 306 | "count 768.000000 768.000000 768.000000 768.000000 768.000000 \n", 307 | "mean 3.845052 120.894531 69.105469 20.536458 79.799479 \n", 308 | "std 3.369578 31.972618 19.355807 15.952218 115.244002 \n", 309 | "min 0.000000 0.000000 0.000000 0.000000 0.000000 \n", 310 | "25% 1.000000 99.000000 62.000000 0.000000 0.000000 \n", 311 | "50% 3.000000 117.000000 72.000000 23.000000 30.500000 \n", 312 | "75% 6.000000 140.250000 80.000000 32.000000 127.250000 \n", 313 | "max 17.000000 199.000000 122.000000 99.000000 846.000000 \n", 314 | "\n", 315 | " BMI DiabetesPedigreeFunction Age Outcome \n", 316 | "count 768.000000 768.000000 768.000000 768.000000 \n", 317 | "mean 31.992578 0.471876 33.240885 0.348958 \n", 318 | "std 7.884160 0.331329 11.760232 0.476951 \n", 319 | "min 0.000000 0.078000 21.000000 0.000000 \n", 320 | "25% 27.300000 0.243750 24.000000 0.000000 \n", 321 | "50% 32.000000 0.372500 29.000000 0.000000 \n", 322 | "75% 36.600000 0.626250 41.000000 1.000000 \n", 323 | "max 67.100000 2.420000 81.000000 1.000000 " 324 | ] 325 | }, 326 | "execution_count": 10, 327 | "metadata": {}, 328 | "output_type": "execute_result" 329 | } 330 | ], 331 | "source": [ 332 | "df.describe()" 333 | ] 334 | }, 335 | { 336 | "cell_type": "markdown", 337 | "metadata": {}, 338 | "source": [ 339 | "## baseline ML model" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": 40, 345 | "metadata": {}, 346 | "outputs": [ 347 | { 348 | "name": "stdout", 349 | "output_type": "stream", 350 | "text": [ 351 | "Accuracy of logistic regression classifier on test set: 0.73\n" 352 | ] 353 | }, 354 | { 355 | "data": { 356 | "text/html": [ 357 | "

\n", 358 | "\n", 371 | "\n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | "

Predicted Result	0	1
Actual Result
0	85	10
1	32	27

\n", 397 | "

" 398 | ], 399 | "text/plain": [ 400 | "Predicted Result 0 1\n", 401 | "Actual Result \n", 402 | "0 85 10\n", 403 | "1 32 27" 404 | ] 405 | }, 406 | "execution_count": 40, 407 | "metadata": {}, 408 | "output_type": "execute_result" 409 | } 410 | ], 411 | "source": [ 412 | "X, y = df.loc[:, df.columns != 'Outcome'], df['Outcome']\n", 413 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n", 414 | "logreg = LogisticRegression()\n", 415 | "logreg.fit(X_train, y_train)\n", 416 | "y_pred = logreg.predict(X_test)\n", 417 | "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))\n", 418 | "pd.crosstab(y_test, y_pred, rownames=['Actual Result'], colnames=['Predicted Result'])" 419 | ] 420 | }, 421 | { 422 | "cell_type": "markdown", 423 | "metadata": {}, 424 | "source": [ 425 | "## impute missing values using the mean of the column" 426 | ] 427 | }, 428 | { 429 | "cell_type": "code", 430 | "execution_count": 20, 431 | "metadata": {}, 432 | "outputs": [], 433 | "source": [ 434 | "missing_columns = df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']]\n", 435 | "missing_columns = missing_columns.replace(0, np.nan)" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": 19, 441 | "metadata": {}, 442 | "outputs": [ 443 | { 444 | "data": { 445 | "text/html": [ 446 | "

\n", 447 | "\n", 460 | "\n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | "

	Glucose	BloodPressure	SkinThickness	Insulin	BMI
0	148.0	72.0	35.0	NaN	33.6
1	85.0	66.0	29.0	NaN	26.6
2	183.0	64.0	NaN	NaN	23.3
3	89.0	66.0	23.0	94.0	28.1
4	137.0	40.0	35.0	168.0	43.1

\n", 514 | "

" 515 | ], 516 | "text/plain": [ 517 | " Glucose BloodPressure SkinThickness Insulin BMI\n", 518 | "0 148.0 72.0 35.0 NaN 33.6\n", 519 | "1 85.0 66.0 29.0 NaN 26.6\n", 520 | "2 183.0 64.0 NaN NaN 23.3\n", 521 | "3 89.0 66.0 23.0 94.0 28.1\n", 522 | "4 137.0 40.0 35.0 168.0 43.1" 523 | ] 524 | }, 525 | "execution_count": 19, 526 | "metadata": {}, 527 | "output_type": "execute_result" 528 | } 529 | ], 530 | "source": [ 531 | "missing_columns.head()" 532 | ] 533 | }, 534 | { 535 | "cell_type": "code", 536 | "execution_count": 21, 537 | "metadata": {}, 538 | "outputs": [], 539 | "source": [ 540 | "means = df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].mean()\n", 541 | "missing_columns = missing_columns.fillna(means)" 542 | ] 543 | }, 544 | { 545 | "cell_type": "code", 546 | "execution_count": 23, 547 | "metadata": {}, 548 | "outputs": [ 549 | { 550 | "data": { 551 | "text/plain": [ 552 | "Glucose 120.894531\n", 553 | "BloodPressure 69.105469\n", 554 | "SkinThickness 20.536458\n", 555 | "Insulin 79.799479\n", 556 | "BMI 31.992578\n", 557 | "dtype: float64" 558 | ] 559 | }, 560 | "execution_count": 23, 561 | "metadata": {}, 562 | "output_type": "execute_result" 563 | } 564 | ], 565 | "source": [ 566 | "means" 567 | ] 568 | }, 569 | { 570 | "cell_type": "code", 571 | "execution_count": 22, 572 | "metadata": {}, 573 | "outputs": [ 574 | { 575 | "data": { 576 | "text/html": [ 577 | "

\n", 578 | "\n", 591 | "\n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | "

	Glucose	BloodPressure	SkinThickness	Insulin	BMI
0	148.0	72.0	35.000000	79.799479	33.6
1	85.0	66.0	29.000000	79.799479	26.6
2	183.0	64.0	20.536458	79.799479	23.3
3	89.0	66.0	23.000000	94.000000	28.1
4	137.0	40.0	35.000000	168.000000	43.1

\n", 645 | "

" 646 | ], 647 | "text/plain": [ 648 | " Glucose BloodPressure SkinThickness Insulin BMI\n", 649 | "0 148.0 72.0 35.000000 79.799479 33.6\n", 650 | "1 85.0 66.0 29.000000 79.799479 26.6\n", 651 | "2 183.0 64.0 20.536458 79.799479 23.3\n", 652 | "3 89.0 66.0 23.000000 94.000000 28.1\n", 653 | "4 137.0 40.0 35.000000 168.000000 43.1" 654 | ] 655 | }, 656 | "execution_count": 22, 657 | "metadata": {}, 658 | "output_type": "execute_result" 659 | } 660 | ], 661 | "source": [ 662 | "missing_columns.head()" 663 | ] 664 | }, 665 | { 666 | "cell_type": "code", 667 | "execution_count": 29, 668 | "metadata": {}, 669 | "outputs": [], 670 | "source": [ 671 | "df = df.drop(['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI'], axis = 1)\n", 672 | "df = pd.concat([df, missing_columns], axis =1)" 673 | ] 674 | }, 675 | { 676 | "cell_type": "code", 677 | "execution_count": 35, 678 | "metadata": {}, 679 | "outputs": [ 680 | { 681 | "name": "stdout", 682 | "output_type": "stream", 683 | "text": [ 684 | "Accuracy of logistic regression classifier on test set: 0.77\n" 685 | ] 686 | }, 687 | { 688 | "data": { 689 | "text/html": [ 690 | "

\n", 691 | "\n", 704 | "\n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | "

Predicted Result	0	1
Actual Result
0	92	12
1	23	27

\n", 730 | "

" 731 | ], 732 | "text/plain": [ 733 | "Predicted Result 0 1\n", 734 | "Actual Result \n", 735 | "0 92 12\n", 736 | "1 23 27" 737 | ] 738 | }, 739 | "execution_count": 35, 740 | "metadata": {}, 741 | "output_type": "execute_result" 742 | } 743 | ], 744 | "source": [ 745 | "X, y = df.loc[:, df.columns != 'Outcome'], df['Outcome']\n", 746 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n", 747 | "logreg = LogisticRegression()\n", 748 | "logreg.fit(X_train, y_train)\n", 749 | "y_pred = logreg.predict(X_test)\n", 750 | "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))\n", 751 | "pd.crosstab(y_test, y_pred, rownames=['Actual Result'], colnames=['Predicted Result'])" 752 | ] 753 | }, 754 | { 755 | "cell_type": "markdown", 756 | "metadata": {}, 757 | "source": [ 758 | "## impute missing values using the mode of the column" 759 | ] 760 | }, 761 | { 762 | "cell_type": "code", 763 | "execution_count": 41, 764 | "metadata": {}, 765 | "outputs": [], 766 | "source": [ 767 | "missing_columns = df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']]\n", 768 | "missing_columns = missing_columns.replace(0, np.nan)\n", 769 | "\n", 770 | "modes = df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].mode()\n", 771 | "missing_columns = missing_columns.fillna(modes)" 772 | ] 773 | }, 774 | { 775 | "cell_type": "code", 776 | "execution_count": 42, 777 | "metadata": {}, 778 | "outputs": [], 779 | "source": [ 780 | "df = df.drop(['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI'], axis = 1)\n", 781 | "df = pd.concat([df, missing_columns], axis =1)" 782 | ] 783 | }, 784 | { 785 | "cell_type": "code", 786 | "execution_count": 43, 787 | "metadata": {}, 788 | "outputs": [ 789 | { 790 | "name": "stdout", 791 | "output_type": "stream", 792 | "text": [ 793 | "Accuracy of logistic regression classifier on test set: 0.76\n" 794 | ] 795 | }, 796 | { 797 | "data": { 798 | "text/html": [ 799 | "

\n", 800 | "\n", 813 | "\n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | "

Predicted Result	0	1
Actual Result
0	89	8
1	29	28

\n", 839 | "

" 840 | ], 841 | "text/plain": [ 842 | "Predicted Result 0 1\n", 843 | "Actual Result \n", 844 | "0 89 8\n", 845 | "1 29 28" 846 | ] 847 | }, 848 | "execution_count": 43, 849 | "metadata": {}, 850 | "output_type": "execute_result" 851 | } 852 | ], 853 | "source": [ 854 | "X, y = df.loc[:, df.columns != 'Outcome'], df['Outcome']\n", 855 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n", 856 | "logreg = LogisticRegression()\n", 857 | "logreg.fit(X_train, y_train)\n", 858 | "y_pred = logreg.predict(X_test)\n", 859 | "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))\n", 860 | "pd.crosstab(y_test, y_pred, rownames=['Actual Result'], colnames=['Predicted Result'])" 861 | ] 862 | }, 863 | { 864 | "cell_type": "markdown", 865 | "metadata": {}, 866 | "source": [ 867 | "## impute missing values using the median" 868 | ] 869 | }, 870 | { 871 | "cell_type": "code", 872 | "execution_count": 44, 873 | "metadata": {}, 874 | "outputs": [], 875 | "source": [ 876 | "missing_columns = df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']]\n", 877 | "missing_columns = missing_columns.replace(0, np.nan)\n", 878 | "\n", 879 | "medians = df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].median()\n", 880 | "missing_columns = missing_columns.fillna(medians)" 881 | ] 882 | }, 883 | { 884 | "cell_type": "code", 885 | "execution_count": 45, 886 | "metadata": {}, 887 | "outputs": [], 888 | "source": [ 889 | "df = df.drop(['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI'], axis = 1)\n", 890 | "df = pd.concat([df, missing_columns], axis =1)" 891 | ] 892 | }, 893 | { 894 | "cell_type": "code", 895 | "execution_count": 48, 896 | "metadata": {}, 897 | "outputs": [ 898 | { 899 | "name": "stdout", 900 | "output_type": "stream", 901 | "text": [ 902 | "Accuracy of logistic regression classifier on test set: 0.78\n" 903 | ] 904 | }, 905 | { 906 | "data": { 907 | "text/html": [ 908 | "

\n", 909 | "\n", 922 | "\n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | "

Predicted Result	0	1
Actual Result
0	89	13
1	21	31

\n", 948 | "

" 949 | ], 950 | "text/plain": [ 951 | "Predicted Result 0 1\n", 952 | "Actual Result \n", 953 | "0 89 13\n", 954 | "1 21 31" 955 | ] 956 | }, 957 | "execution_count": 48, 958 | "metadata": {}, 959 | "output_type": "execute_result" 960 | } 961 | ], 962 | "source": [ 963 | "X, y = df.loc[:, df.columns != 'Outcome'], df['Outcome']\n", 964 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n", 965 | "logreg = LogisticRegression()\n", 966 | "logreg.fit(X_train, y_train)\n", 967 | "y_pred = logreg.predict(X_test)\n", 968 | "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))\n", 969 | "pd.crosstab(y_test, y_pred, rownames=['Actual Result'], colnames=['Predicted Result'])" 970 | ] 971 | } 972 | ], 973 | "metadata": { 974 | "kernelspec": { 975 | "display_name": "Python 3", 976 | "language": "python", 977 | "name": "python3" 978 | }, 979 | "language_info": { 980 | "codemirror_mode": { 981 | "name": "ipython", 982 | "version": 3 983 | }, 984 | "file_extension": ".py", 985 | "mimetype": "text/x-python", 986 | "name": "python", 987 | "nbconvert_exporter": "python", 988 | "pygments_lexer": "ipython3", 989 | "version": "3.6.0" 990 | } 991 | }, 992 | "nbformat": 4, 993 | "nbformat_minor": 2 994 | } 995 | -------------------------------------------------------------------------------- /Section 3/3.4.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stderr", 10 | "output_type": "stream", 11 | "text": [ 12 | "/home/sahibachopra/miniconda/envs/ai/lib/python3.6/importlib/_bootstrap.py:205: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88\n", 13 | " return f(*args, **kwds)\n" 14 | ] 15 | } 16 | ], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "import numpy as np\n", 20 | "from sklearn.model_selection import train_test_split\n", 21 | "from sklearn.linear_model import LogisticRegression" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "## remove outliers using the inter-quartile range" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 2, 34 | "metadata": {}, 35 | "outputs": [], 36 | "source": [ 37 | "df = pd.read_csv('diabetes.csv')" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 3, 43 | "metadata": {}, 44 | "outputs": [ 45 | { 46 | "data": { 47 | "text/html": [ 48 | "

\n", 49 | "\n", 62 | "\n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | "

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
std	3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

\n", 176 | "

" 177 | ], 178 | "text/plain": [ 179 | " Pregnancies Glucose BloodPressure SkinThickness Insulin \\\n", 180 | "count 768.000000 768.000000 768.000000 768.000000 768.000000 \n", 181 | "mean 3.845052 120.894531 69.105469 20.536458 79.799479 \n", 182 | "std 3.369578 31.972618 19.355807 15.952218 115.244002 \n", 183 | "min 0.000000 0.000000 0.000000 0.000000 0.000000 \n", 184 | "25% 1.000000 99.000000 62.000000 0.000000 0.000000 \n", 185 | "50% 3.000000 117.000000 72.000000 23.000000 30.500000 \n", 186 | "75% 6.000000 140.250000 80.000000 32.000000 127.250000 \n", 187 | "max 17.000000 199.000000 122.000000 99.000000 846.000000 \n", 188 | "\n", 189 | " BMI DiabetesPedigreeFunction Age Outcome \n", 190 | "count 768.000000 768.000000 768.000000 768.000000 \n", 191 | "mean 31.992578 0.471876 33.240885 0.348958 \n", 192 | "std 7.884160 0.331329 11.760232 0.476951 \n", 193 | "min 0.000000 0.078000 21.000000 0.000000 \n", 194 | "25% 27.300000 0.243750 24.000000 0.000000 \n", 195 | "50% 32.000000 0.372500 29.000000 0.000000 \n", 196 | "75% 36.600000 0.626250 41.000000 1.000000 \n", 197 | "max 67.100000 2.420000 81.000000 1.000000 " 198 | ] 199 | }, 200 | "execution_count": 3, 201 | "metadata": {}, 202 | "output_type": "execute_result" 203 | } 204 | ], 205 | "source": [ 206 | "df.describe()" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "## insulin looks like it might have some outliers \n", 214 | "- mean, quartlies a lot smaller than the max value\n", 215 | " - let's try to remove these outliers to check if they help with model performance" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "## calculate the inter-quartile range\n", 223 | "- In descriptive statistics, the interquartile range (IQR), also called the midspread or middle 50%, or technically H-spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1. In other words, the IQR is the first quartile subtracted from the third quartile; these quartiles can be clearly seen on a box plot on the data. (https://en.wikipedia.org/wiki/Interquartile_range)" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": 7, 229 | "metadata": {}, 230 | "outputs": [], 231 | "source": [ 232 | "# calculate interquartile range\n", 233 | "q25, q75 = np.percentile(df['Insulin'], 25), np.percentile(df['Insulin'], 75)\n", 234 | "iqr = q75 - q25\n" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": 8, 240 | "metadata": {}, 241 | "outputs": [ 242 | { 243 | "data": { 244 | "text/plain": [ 245 | "127.25" 246 | ] 247 | }, 248 | "execution_count": 8, 249 | "metadata": {}, 250 | "output_type": "execute_result" 251 | } 252 | ], 253 | "source": [ 254 | "iqr" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": {}, 260 | "source": [ 261 | "## let's remove all observations that have insulin higher than 127.25" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": 9, 267 | "metadata": {}, 268 | "outputs": [], 269 | "source": [ 270 | "df2 = df[df['Insulin'] <= 127.25]" 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": 11, 276 | "metadata": {}, 277 | "outputs": [ 278 | { 279 | "name": "stdout", 280 | "output_type": "stream", 281 | "text": [ 282 | "original df size: (768, 9)\n", 283 | "new df size: (576, 9)\n" 284 | ] 285 | } 286 | ], 287 | "source": [ 288 | "print('original df size:', df.shape)\n", 289 | "print('new df size:', df2.shape)" 290 | ] 291 | }, 292 | { 293 | "cell_type": "markdown", 294 | "metadata": {}, 295 | "source": [ 296 | "## let's compare models" 297 | ] 298 | }, 299 | { 300 | "cell_type": "markdown", 301 | "metadata": {}, 302 | "source": [ 303 | "## baseline logistic regression model" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 22, 309 | "metadata": {}, 310 | "outputs": [ 311 | { 312 | "name": "stdout", 313 | "output_type": "stream", 314 | "text": [ 315 | "Accuracy of logistic regression classifier on test set: 0.75\n" 316 | ] 317 | }, 318 | { 319 | "data": { 320 | "text/html": [ 321 | "

\n", 322 | "\n", 335 | "\n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | "

Predicted Result	0	1
Actual Result
0	97	11
1	28	18

\n", 361 | "

" 362 | ], 363 | "text/plain": [ 364 | "Predicted Result 0 1\n", 365 | "Actual Result \n", 366 | "0 97 11\n", 367 | "1 28 18" 368 | ] 369 | }, 370 | "execution_count": 22, 371 | "metadata": {}, 372 | "output_type": "execute_result" 373 | } 374 | ], 375 | "source": [ 376 | "X, y = df.loc[:, df.columns != 'Outcome'], df['Outcome']\n", 377 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n", 378 | "logreg = LogisticRegression()\n", 379 | "logreg.fit(X_train, y_train)\n", 380 | "y_pred = logreg.predict(X_test)\n", 381 | "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))\n", 382 | "pd.crosstab(y_test, y_pred, rownames=['Actual Result'], colnames=['Predicted Result'])" 383 | ] 384 | }, 385 | { 386 | "cell_type": "markdown", 387 | "metadata": {}, 388 | "source": [ 389 | "## logistic regression on data without insulin outliers" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": 24, 395 | "metadata": {}, 396 | "outputs": [ 397 | { 398 | "name": "stdout", 399 | "output_type": "stream", 400 | "text": [ 401 | "Accuracy of logistic regression classifier on test set: 0.80\n" 402 | ] 403 | }, 404 | { 405 | "data": { 406 | "text/html": [ 407 | "

\n", 408 | "\n", 421 | "\n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | "

Predicted Result	0	1
Actual Result
0	81	5
1	18	12

\n", 447 | "

" 448 | ], 449 | "text/plain": [ 450 | "Predicted Result 0 1\n", 451 | "Actual Result \n", 452 | "0 81 5\n", 453 | "1 18 12" 454 | ] 455 | }, 456 | "execution_count": 24, 457 | "metadata": {}, 458 | "output_type": "execute_result" 459 | } 460 | ], 461 | "source": [ 462 | "X, y = df2.loc[:, df2.columns != 'Outcome'], df2['Outcome']\n", 463 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n", 464 | "logreg = LogisticRegression()\n", 465 | "logreg.fit(X_train, y_train)\n", 466 | "y_pred = logreg.predict(X_test)\n", 467 | "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))\n", 468 | "pd.crosstab(y_test, y_pred, rownames=['Actual Result'], colnames=['Predicted Result'])" 469 | ] 470 | } 471 | ], 472 | "metadata": { 473 | "kernelspec": { 474 | "display_name": "Python 3", 475 | "language": "python", 476 | "name": "python3" 477 | }, 478 | "language_info": { 479 | "codemirror_mode": { 480 | "name": "ipython", 481 | "version": 3 482 | }, 483 | "file_extension": ".py", 484 | "mimetype": "text/x-python", 485 | "name": "python", 486 | "nbconvert_exporter": "python", 487 | "pygments_lexer": "ipython3", 488 | "version": "3.6.0" 489 | } 490 | }, 491 | "nbformat": 4, 492 | "nbformat_minor": 2 493 | } 494 | -------------------------------------------------------------------------------- /Section 3/3.5.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stderr", 10 | "output_type": "stream", 11 | "text": [ 12 | "/home/sahibachopra/miniconda/envs/ai/lib/python3.6/importlib/_bootstrap.py:205: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88\n", 13 | " return f(*args, **kwds)\n" 14 | ] 15 | } 16 | ], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "import numpy as np\n", 20 | "from sklearn.model_selection import train_test_split\n", 21 | "from sklearn.linear_model import LogisticRegression" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 2, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "df = pd.read_csv('diabetes.csv')" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "## removing outliers using standard deviation\n", 38 | " - Standard deviation is a metric of variance i.e. how much the individual data points are spread out from the mean.\n", 39 | " - less reliable than IQR because the mean and standard deviation are impacted by the outliers\n", 40 | " - data must follow a Gaussian or normal distribution" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "## let's remove outliers in the insulin variable\n", 48 | " - remove points that are above (Mean + 2 * SD) and any points below (Mean - 2 * SD)" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 4, 54 | "metadata": {}, 55 | "outputs": [], 56 | "source": [ 57 | "mean = np.mean(df['Insulin'])\n", 58 | "sd = np.std(df['Insulin'])" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 5, 64 | "metadata": {}, 65 | "outputs": [], 66 | "source": [ 67 | "df2 = df[(df['Insulin'] > mean - 2 * sd) & (df['Insulin'] < mean + 2 * sd)]" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "## compare models" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "## baseline logistic regression model" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 9, 87 | "metadata": {}, 88 | "outputs": [ 89 | { 90 | "name": "stdout", 91 | "output_type": "stream", 92 | "text": [ 93 | "Accuracy of logistic regression classifier on test set: 0.75\n" 94 | ] 95 | }, 96 | { 97 | "data": { 98 | "text/html": [ 99 | "

\n", 100 | "\n", 113 | "\n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | "

Predicted Result	0	1
Actual Result
0	90	9
1	29	26

\n", 139 | "

" 140 | ], 141 | "text/plain": [ 142 | "Predicted Result 0 1\n", 143 | "Actual Result \n", 144 | "0 90 9\n", 145 | "1 29 26" 146 | ] 147 | }, 148 | "execution_count": 9, 149 | "metadata": {}, 150 | "output_type": "execute_result" 151 | } 152 | ], 153 | "source": [ 154 | "X, y = df.loc[:, df.columns != 'Outcome'], df['Outcome']\n", 155 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n", 156 | "logreg = LogisticRegression()\n", 157 | "logreg.fit(X_train, y_train)\n", 158 | "y_pred = logreg.predict(X_test)\n", 159 | "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))\n", 160 | "pd.crosstab(y_test, y_pred, rownames=['Actual Result'], colnames=['Predicted Result'])" 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "## logistic regression model after removing outliers" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": 13, 173 | "metadata": {}, 174 | "outputs": [ 175 | { 176 | "name": "stdout", 177 | "output_type": "stream", 178 | "text": [ 179 | "Accuracy of logistic regression classifier on test set: 0.80\n" 180 | ] 181 | }, 182 | { 183 | "data": { 184 | "text/html": [ 185 | "

\n", 186 | "\n", 199 | "\n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | "

Predicted Result	0	1
Actual Result
0	91	9
1	21	26

\n", 225 | "

" 226 | ], 227 | "text/plain": [ 228 | "Predicted Result 0 1\n", 229 | "Actual Result \n", 230 | "0 91 9\n", 231 | "1 21 26" 232 | ] 233 | }, 234 | "execution_count": 13, 235 | "metadata": {}, 236 | "output_type": "execute_result" 237 | } 238 | ], 239 | "source": [ 240 | "X, y = df2.loc[:, df2.columns != 'Outcome'], df2['Outcome']\n", 241 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n", 242 | "logreg = LogisticRegression()\n", 243 | "logreg.fit(X_train, y_train)\n", 244 | "y_pred = logreg.predict(X_test)\n", 245 | "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))\n", 246 | "pd.crosstab(y_test, y_pred, rownames=['Actual Result'], colnames=['Predicted Result'])" 247 | ] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": {}, 252 | "source": [ 253 | "## removing outliers using the median absolute deviation\n", 254 | "- Robust Z-Score method\n", 255 | "- source: https://stackoverflow.com/questions/22354094/pythonic-way-of-detecting-outliers-in-one-dimensional-observation-data?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": 15, 261 | "metadata": {}, 262 | "outputs": [], 263 | "source": [ 264 | "def mad_based_outlier(points, thresh=3.5):\n", 265 | " if len(points.shape) == 1:\n", 266 | " points = points[:,None]\n", 267 | " median = np.median(points, axis=0)\n", 268 | " diff = np.sum((points - median)**2, axis=-1)\n", 269 | " diff = np.sqrt(diff)\n", 270 | " med_abs_deviation = np.median(diff)\n", 271 | "\n", 272 | " modified_z_score = 0.6745 * diff / med_abs_deviation\n", 273 | "\n", 274 | " return modified_z_score > thresh" 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": 18, 280 | "metadata": {}, 281 | "outputs": [], 282 | "source": [ 283 | "df2 = df[mad_based_outlier(df['Insulin'])]" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": 19, 289 | "metadata": {}, 290 | "outputs": [ 291 | { 292 | "data": { 293 | "text/plain": [ 294 | "(101, 9)" 295 | ] 296 | }, 297 | "execution_count": 19, 298 | "metadata": {}, 299 | "output_type": "execute_result" 300 | } 301 | ], 302 | "source": [ 303 | "df2.shape" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 20, 309 | "metadata": {}, 310 | "outputs": [ 311 | { 312 | "data": { 313 | "text/plain": [ 314 | "(768, 9)" 315 | ] 316 | }, 317 | "execution_count": 20, 318 | "metadata": {}, 319 | "output_type": "execute_result" 320 | } 321 | ], 322 | "source": [ 323 | "df.shape" 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": 26, 329 | "metadata": {}, 330 | "outputs": [ 331 | { 332 | "name": "stdout", 333 | "output_type": "stream", 334 | "text": [ 335 | "Accuracy of logistic regression classifier on test set: 0.76\n" 336 | ] 337 | }, 338 | { 339 | "data": { 340 | "text/html": [ 341 | "

\n", 342 | "\n", 355 | "\n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | "

Predicted Result	0	1
Actual Result
0	6	1
1	4	10

\n", 381 | "

" 382 | ], 383 | "text/plain": [ 384 | "Predicted Result 0 1\n", 385 | "Actual Result \n", 386 | "0 6 1\n", 387 | "1 4 10" 388 | ] 389 | }, 390 | "execution_count": 26, 391 | "metadata": {}, 392 | "output_type": "execute_result" 393 | } 394 | ], 395 | "source": [ 396 | "X, y = df2.loc[:, df2.columns != 'Outcome'], df2['Outcome']\n", 397 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n", 398 | "logreg = LogisticRegression()\n", 399 | "logreg.fit(X_train, y_train)\n", 400 | "y_pred = logreg.predict(X_test)\n", 401 | "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))\n", 402 | "pd.crosstab(y_test, y_pred, rownames=['Actual Result'], colnames=['Predicted Result'])" 403 | ] 404 | } 405 | ], 406 | "metadata": { 407 | "kernelspec": { 408 | "display_name": "Python 3", 409 | "language": "python", 410 | "name": "python3" 411 | }, 412 | "language_info": { 413 | "codemirror_mode": { 414 | "name": "ipython", 415 | "version": 3 416 | }, 417 | "file_extension": ".py", 418 | "mimetype": "text/x-python", 419 | "name": "python", 420 | "nbconvert_exporter": "python", 421 | "pygments_lexer": "ipython3", 422 | "version": "3.6.0" 423 | } 424 | }, 425 | "nbformat": 4, 426 | "nbformat_minor": 2 427 | } 428 | -------------------------------------------------------------------------------- /Section 4/4.1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 3, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd\n", 10 | "import seaborn as sns\n", 11 | "import matplotlib.pyplot as plt\n", 12 | "from sklearn.feature_selection import SelectKBest\n", 13 | "from sklearn.feature_selection import chi2 # chi square filter method" 14 | ] 15 | }, 16 | { 17 | "cell_type": "code", 18 | "execution_count": 2, 19 | "metadata": {}, 20 | "outputs": [], 21 | "source": [ 22 | "data = pd.read_csv('german_credit.csv')" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "## Filter methods\n", 30 | "- feature selection independent of the model used\n", 31 | "- based on general methods like correlation\n", 32 | "- find the best subset of variables\n", 33 | "- helps reduce overfitting\n", 34 | "\n", 35 | "## Types of filter methods\n", 36 | "- information gain\n", 37 | "- chi-square test\n", 38 | "- fisher score\n", 39 | "- correlation coefficient\n", 40 | "- variance threshold" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "## let's select the top 15 / 20 features with the higher chi squared statistics" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 5, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "X, y = data.loc[:, data.columns!='Creditability'], data['Creditability']" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 6, 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "chisq_selector = SelectKBest(chi2, k=15)\n", 66 | "X_kbest = chisq_selector.fit_transform(X, y)" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 7, 72 | "metadata": {}, 73 | "outputs": [ 74 | { 75 | "name": "stdout", 76 | "output_type": "stream", 77 | "text": [ 78 | "Original number of features: 20\n", 79 | "Reduced number of features: 15\n" 80 | ] 81 | } 82 | ], 83 | "source": [ 84 | "print('Original number of features:', X.shape[1])\n", 85 | "print('Reduced number of features:', X_kbest.shape[1])\n" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 12, 91 | "metadata": {}, 92 | "outputs": [ 93 | { 94 | "data": { 95 | "text/plain": [ 96 | "(1000, 15)" 97 | ] 98 | }, 99 | "execution_count": 12, 100 | "metadata": {}, 101 | "output_type": "execute_result" 102 | } 103 | ], 104 | "source": [ 105 | "X_kbest.shape" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 21, 111 | "metadata": {}, 112 | "outputs": [ 113 | { 114 | "data": { 115 | "text/plain": [ 116 | "array([ True, True, True, True, True, True, True, True, True,\n", 117 | " False, False, True, True, True, False, True, False, False,\n", 118 | " True, True])" 119 | ] 120 | }, 121 | "execution_count": 21, 122 | "metadata": {}, 123 | "output_type": "execute_result" 124 | } 125 | ], 126 | "source": [ 127 | "chisq_selector.get_support()" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": 22, 133 | "metadata": {}, 134 | "outputs": [ 135 | { 136 | "data": { 137 | "text/plain": [ 138 | "Index(['Account Balance', 'Duration of Credit (month)',\n", 139 | " 'Payment Status of Previous Credit', 'Purpose', 'Credit Amount',\n", 140 | " 'Value Savings/Stocks', 'Length of current employment',\n", 141 | " 'Instalment per cent', 'Sex & Marital Status', 'Guarantors',\n", 142 | " 'Duration in Current address', 'Most valuable available asset',\n", 143 | " 'Age (years)', 'Concurrent Credits', 'Type of apartment',\n", 144 | " 'No of Credits at this Bank', 'Occupation', 'No of dependents',\n", 145 | " 'Telephone', 'Foreign Worker'],\n", 146 | " dtype='object')" 147 | ] 148 | }, 149 | "execution_count": 22, 150 | "metadata": {}, 151 | "output_type": "execute_result" 152 | } 153 | ], 154 | "source": [ 155 | "X.columns" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 24, 161 | "metadata": {}, 162 | "outputs": [ 163 | { 164 | "name": "stdout", 165 | "output_type": "stream", 166 | "text": [ 167 | "Index(['Account Balance', 'Duration of Credit (month)',\n", 168 | " 'Payment Status of Previous Credit', 'Purpose', 'Credit Amount',\n", 169 | " 'Value Savings/Stocks', 'Length of current employment',\n", 170 | " 'Instalment per cent', 'Sex & Marital Status',\n", 171 | " 'Most valuable available asset', 'Age (years)', 'Concurrent Credits',\n", 172 | " 'No of Credits at this Bank', 'Telephone', 'Foreign Worker'],\n", 173 | " dtype='object')\n" 174 | ] 175 | } 176 | ], 177 | "source": [ 178 | "print(X.columns[chisq_selector.get_support()]) " 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": {}, 184 | "source": [ 185 | "## dropped columns:\n", 186 | "1. Guarantors\n", 187 | "2. Duration in Current address\n", 188 | "3. Type of apartment\n", 189 | "4. Occupation\n", 190 | "5. No of dependents" 191 | ] 192 | } 193 | ], 194 | "metadata": { 195 | "kernelspec": { 196 | "display_name": "Python 3", 197 | "language": "python", 198 | "name": "python3" 199 | }, 200 | "language_info": { 201 | "codemirror_mode": { 202 | "name": "ipython", 203 | "version": 3 204 | }, 205 | "file_extension": ".py", 206 | "mimetype": "text/x-python", 207 | "name": "python", 208 | "nbconvert_exporter": "python", 209 | "pygments_lexer": "ipython3", 210 | "version": "3.6.0" 211 | } 212 | }, 213 | "nbformat": 4, 214 | "nbformat_minor": 2 215 | } 216 | -------------------------------------------------------------------------------- /Section 4/4.2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 5, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd\n", 10 | "import seaborn as sns\n", 11 | "import matplotlib.pyplot as plt\n", 12 | "from sklearn.feature_selection import SelectKBest\n", 13 | "from sklearn.ensemble import RandomForestClassifier\n", 14 | "from sklearn.feature_selection import RFE" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 2, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "data = pd.read_csv('german_credit.csv')" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "## Wrapper methods\n", 31 | "- Wrapper methods evaluate subsets of variables which allows, unlike filter approaches, to detect the possible interactions between variables. \n", 32 | "The two main disadvantages of these methods are :\n", 33 | "\n", 34 | "1. The increasing overfitting risk when the number of observations is insufficient.\n", 35 | "2. The significant computation time when the number of variables is large.\n", 36 | "\n", 37 | "source: https://en.wikipedia.org/wiki/Feature_selection#Filter_method\n", 38 | "\n", 39 | "## Types of wrapper methods\n", 40 | "- recursive feature elimination\n", 41 | "- sequential feature selection algorithms\n", 42 | "- genetic algorithms" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "Recursive feature elimination recursively removes features, builds a model using the remaining attributes and calculates model accuracy. RFE is able to work out the combination of attributes that contribute to the prediction on the target variable (or class). " 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 8, 55 | "metadata": {}, 56 | "outputs": [ 57 | { 58 | "name": "stdout", 59 | "output_type": "stream", 60 | "text": [ 61 | "[ True True True True True True True True True False True True\n", 62 | " True True False True True False False False]\n", 63 | "[1 1 1 1 1 1 1 1 1 2 1 1 1 1 4 1 1 5 3 6]\n" 64 | ] 65 | } 66 | ], 67 | "source": [ 68 | "X, y = data.loc[:, data.columns!='Creditability'], data['Creditability']\n", 69 | "\n", 70 | "clf = RandomForestClassifier(n_jobs=2, random_state=0)\n", 71 | "# create the RFE model for a random forest classifier\n", 72 | "# and select attributes\n", 73 | "rfe = RFE(clf, 15)\n", 74 | "rfe = rfe.fit(X, y)\n", 75 | "# print summaries for the selection of attributes\n", 76 | "print(rfe.support_)\n", 77 | "print(rfe.ranking_)" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": 17, 83 | "metadata": {}, 84 | "outputs": [ 85 | { 86 | "data": { 87 | "text/plain": [ 88 | "array([ True, True, True, True, True, True, True, True, True,\n", 89 | " False, True, True, True, True, False, True, True, False,\n", 90 | " False, False])" 91 | ] 92 | }, 93 | "execution_count": 17, 94 | "metadata": {}, 95 | "output_type": "execute_result" 96 | } 97 | ], 98 | "source": [ 99 | "rfe.get_support()" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 19, 105 | "metadata": {}, 106 | "outputs": [ 107 | { 108 | "name": "stdout", 109 | "output_type": "stream", 110 | "text": [ 111 | "Index(['Account Balance', 'Duration of Credit (month)',\n", 112 | " 'Payment Status of Previous Credit', 'Purpose', 'Credit Amount',\n", 113 | " 'Value Savings/Stocks', 'Length of current employment',\n", 114 | " 'Instalment per cent', 'Sex & Marital Status', 'Guarantors',\n", 115 | " 'Duration in Current address', 'Most valuable available asset',\n", 116 | " 'Age (years)', 'Concurrent Credits', 'Type of apartment',\n", 117 | " 'No of Credits at this Bank', 'Occupation', 'No of dependents',\n", 118 | " 'Telephone', 'Foreign Worker'],\n", 119 | " dtype='object')\n" 120 | ] 121 | } 122 | ], 123 | "source": [ 124 | "print(X.columns)" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": 18, 130 | "metadata": {}, 131 | "outputs": [ 132 | { 133 | "name": "stdout", 134 | "output_type": "stream", 135 | "text": [ 136 | "Index(['Account Balance', 'Duration of Credit (month)',\n", 137 | " 'Payment Status of Previous Credit', 'Purpose', 'Credit Amount',\n", 138 | " 'Value Savings/Stocks', 'Length of current employment',\n", 139 | " 'Instalment per cent', 'Sex & Marital Status',\n", 140 | " 'Duration in Current address', 'Most valuable available asset',\n", 141 | " 'Age (years)', 'Concurrent Credits', 'No of Credits at this Bank',\n", 142 | " 'Occupation'],\n", 143 | " dtype='object')\n" 144 | ] 145 | } 146 | ], 147 | "source": [ 148 | "print(X.columns[rfe.get_support()]) " 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "dropped columns:\n", 156 | "1. Guarantors\n", 157 | "2. Type of apartment\n", 158 | "3. No of dependents\n", 159 | "4. Telephone\n", 160 | "5. Foreign Worker" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 13, 166 | "metadata": {}, 167 | "outputs": [ 168 | { 169 | "name": "stdout", 170 | "output_type": "stream", 171 | "text": [ 172 | "Features sorted by their rank:\n", 173 | "[(1, 'Account Balance'), (1, 'Age (years)'), (1, 'Concurrent Credits'), (1, 'Credit Amount'), (1, 'Duration in Current address'), (1, 'Duration of Credit (month)'), (1, 'Instalment per cent'), (1, 'Length of current employment'), (1, 'Most valuable available asset'), (1, 'No of Credits at this Bank'), (1, 'Occupation'), (1, 'Payment Status of Previous Credit'), (1, 'Purpose'), (1, 'Sex & Marital Status'), (1, 'Value Savings/Stocks'), (2, 'Guarantors'), (3, 'Telephone'), (4, 'Type of apartment'), (5, 'No of dependents'), (6, 'Foreign Worker')]\n" 174 | ] 175 | } 176 | ], 177 | "source": [ 178 | "names = X.columns\n", 179 | "print (\"Features sorted by their rank:\")\n", 180 | "print (sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), names)))" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": null, 186 | "metadata": {}, 187 | "outputs": [], 188 | "source": [] 189 | } 190 | ], 191 | "metadata": { 192 | "kernelspec": { 193 | "display_name": "Python 3", 194 | "language": "python", 195 | "name": "python3" 196 | }, 197 | "language_info": { 198 | "codemirror_mode": { 199 | "name": "ipython", 200 | "version": 3 201 | }, 202 | "file_extension": ".py", 203 | "mimetype": "text/x-python", 204 | "name": "python", 205 | "nbconvert_exporter": "python", 206 | "pygments_lexer": "ipython3", 207 | "version": "3.6.0" 208 | } 209 | }, 210 | "nbformat": 4, 211 | "nbformat_minor": 2 212 | } 213 | -------------------------------------------------------------------------------- /Section 4/4.3.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 7, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd\n", 10 | "from sklearn.ensemble import RandomForestClassifier\n", 11 | "from sklearn import metrics\n", 12 | "import numpy as np" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 2, 18 | "metadata": {}, 19 | "outputs": [], 20 | "source": [ 21 | "data = pd.read_csv('german_credit.csv')" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "## Feature importance:\n", 29 | "- use decision tree classifiers to find the relative importance of the features\n", 30 | "- These importance values can be used to inform a feature selection process." 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 6, 36 | "metadata": {}, 37 | "outputs": [ 38 | { 39 | "name": "stdout", 40 | "output_type": "stream", 41 | "text": [ 42 | "[0.12689479 0.09685797 0.05581169 0.061145 0.142556 0.04732479\n", 43 | " 0.05096343 0.04116976 0.0362849 0.01631357 0.04774687 0.03821556\n", 44 | " 0.10748535 0.01597799 0.02453943 0.02408641 0.02888386 0.01614243\n", 45 | " 0.01640286 0.00519733]\n" 46 | ] 47 | } 48 | ], 49 | "source": [ 50 | "X, y = data.loc[:, data.columns!='Creditability'], data['Creditability']\n", 51 | "\n", 52 | "model = RandomForestClassifier(n_jobs=2, random_state=0)\n", 53 | "model.fit(X, y)\n", 54 | "# display the relative importance of each attribute\n", 55 | "print(model.feature_importances_)" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 11, 61 | "metadata": {}, 62 | "outputs": [ 63 | { 64 | "data": { 65 | "text/html": [ 66 | "

\n", 67 | "\n", 80 | "\n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | "

	Gini-importance
Credit Amount	0.142556
Account Balance	0.126895
Age (years)	0.107485
Duration of Credit (month)	0.096858
Purpose	0.061145
Payment Status of Previous Credit	0.055812
Length of current employment	0.050963
Duration in Current address	0.047747
Value Savings/Stocks	0.047325
Instalment per cent	0.041170
Most valuable available asset	0.038216
Sex & Marital Status	0.036285
Occupation	0.028884
Type of apartment	0.024539
No of Credits at this Bank	0.024086
Telephone	0.016403
Guarantors	0.016314
No of dependents	0.016142
Concurrent Credits	0.015978
Foreign Worker	0.005197

\n", 170 | "

" 171 | ], 172 | "text/plain": [ 173 | " Gini-importance\n", 174 | "Credit Amount 0.142556\n", 175 | "Account Balance 0.126895\n", 176 | "Age (years) 0.107485\n", 177 | "Duration of Credit (month) 0.096858\n", 178 | "Purpose 0.061145\n", 179 | "Payment Status of Previous Credit 0.055812\n", 180 | "Length of current employment 0.050963\n", 181 | "Duration in Current address 0.047747\n", 182 | "Value Savings/Stocks 0.047325\n", 183 | "Instalment per cent 0.041170\n", 184 | "Most valuable available asset 0.038216\n", 185 | "Sex & Marital Status 0.036285\n", 186 | "Occupation 0.028884\n", 187 | "Type of apartment 0.024539\n", 188 | "No of Credits at this Bank 0.024086\n", 189 | "Telephone 0.016403\n", 190 | "Guarantors 0.016314\n", 191 | "No of dependents 0.016142\n", 192 | "Concurrent Credits 0.015978\n", 193 | "Foreign Worker 0.005197" 194 | ] 195 | }, 196 | "execution_count": 11, 197 | "metadata": {}, 198 | "output_type": "execute_result" 199 | } 200 | ], 201 | "source": [ 202 | "feats = {} # a dict to hold feature_name: feature_importance\n", 203 | "for feature, importance in zip(X.columns, model.feature_importances_):\n", 204 | " feats[feature] = importance #add the name/value pair \n", 205 | "\n", 206 | "importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})\n", 207 | "importances.sort_values(by='Gini-importance', ascending=False)" 208 | ] 209 | } 210 | ], 211 | "metadata": { 212 | "kernelspec": { 213 | "display_name": "Python 3", 214 | "language": "python", 215 | "name": "python3" 216 | }, 217 | "language_info": { 218 | "codemirror_mode": { 219 | "name": "ipython", 220 | "version": 3 221 | }, 222 | "file_extension": ".py", 223 | "mimetype": "text/x-python", 224 | "name": "python", 225 | "nbconvert_exporter": "python", 226 | "pygments_lexer": "ipython3", 227 | "version": "3.6.0" 228 | } 229 | }, 230 | "nbformat": 4, 231 | "nbformat_minor": 2 232 | } 233 | -------------------------------------------------------------------------------- /Section 5/5.1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stderr", 10 | "output_type": "stream", 11 | "text": [ 12 | "/home/sahibachopra/miniconda/envs/ai/lib/python3.6/importlib/_bootstrap.py:205: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88\n", 13 | " return f(*args, **kwds)\n" 14 | ] 15 | } 16 | ], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "from sklearn.model_selection import train_test_split\n", 20 | "from sklearn.linear_model import LogisticRegression\n", 21 | "from sklearn.metrics import classification_report # classification report" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 2, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "df = pd.read_csv('german_credit.csv')" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "## build a baseline logistic regression model" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 11, 43 | "metadata": {}, 44 | "outputs": [ 45 | { 46 | "name": "stdout", 47 | "output_type": "stream", 48 | "text": [ 49 | "Accuracy of logistic regression classifier on test set: 0.74\n" 50 | ] 51 | }, 52 | { 53 | "data": { 54 | "text/html": [ 55 | "

\n", 56 | "\n", 69 | "\n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | "

Predicted Result	0	1
Actual Result
0	29	34
1	17	120

\n", 95 | "

" 96 | ], 97 | "text/plain": [ 98 | "Predicted Result 0 1\n", 99 | "Actual Result \n", 100 | "0 29 34\n", 101 | "1 17 120" 102 | ] 103 | }, 104 | "execution_count": 11, 105 | "metadata": {}, 106 | "output_type": "execute_result" 107 | } 108 | ], 109 | "source": [ 110 | "X, y = df.loc[:, df.columns != 'Creditability'], df['Creditability']\n", 111 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n", 112 | "logreg = LogisticRegression()\n", 113 | "logreg.fit(X_train, y_train)\n", 114 | "y_pred = logreg.predict(X_test)\n", 115 | "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))\n", 116 | "pd.crosstab(y_test, y_pred, rownames=['Actual Result'], colnames=['Predicted Result'])" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": 12, 122 | "metadata": {}, 123 | "outputs": [ 124 | { 125 | "name": "stdout", 126 | "output_type": "stream", 127 | "text": [ 128 | " precision recall f1-score support\n", 129 | "\n", 130 | " 0 0.63 0.46 0.53 63\n", 131 | " 1 0.78 0.88 0.82 137\n", 132 | "\n", 133 | "avg / total 0.73 0.74 0.73 200\n", 134 | "\n" 135 | ] 136 | } 137 | ], 138 | "source": [ 139 | "print(classification_report(y_test, y_pred))" 140 | ] 141 | } 142 | ], 143 | "metadata": { 144 | "kernelspec": { 145 | "display_name": "Python 3", 146 | "language": "python", 147 | "name": "python3" 148 | }, 149 | "language_info": { 150 | "codemirror_mode": { 151 | "name": "ipython", 152 | "version": 3 153 | }, 154 | "file_extension": ".py", 155 | "mimetype": "text/x-python", 156 | "name": "python", 157 | "nbconvert_exporter": "python", 158 | "pygments_lexer": "ipython3", 159 | "version": "3.6.0" 160 | } 161 | }, 162 | "nbformat": 4, 163 | "nbformat_minor": 2 164 | } 165 | -------------------------------------------------------------------------------- /Section 5/5.2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 22, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stderr", 10 | "output_type": "stream", 11 | "text": [ 12 | "/home/sahibachopra/miniconda/envs/ai/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.\n", 13 | " from numpy.core.umath_tests import inner1d\n" 14 | ] 15 | } 16 | ], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "from sklearn.model_selection import train_test_split\n", 20 | "from sklearn.linear_model import LogisticRegression\n", 21 | "from sklearn.metrics import classification_report\n", 22 | "from sklearn.feature_selection import SelectKBest\n", 23 | "from sklearn.feature_selection import chi2\n", 24 | "from sklearn.preprocessing import scale\n", 25 | "from sklearn.decomposition import PCA\n", 26 | "import numpy as np\n", 27 | "import matplotlib.pyplot as plt\n", 28 | "from sklearn.ensemble import RandomForestClassifier\n", 29 | "from sklearn.feature_selection import RFE" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "## apply feature engineering to these data points\n", 37 | "- feature importance\n", 38 | "- feature selection: filter method\n", 39 | "- feature extraction / dimensionality reduction: PCA" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 3, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "data = pd.read_csv('german_credit.csv')" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 47, 54 | "metadata": {}, 55 | "outputs": [ 56 | { 57 | "data": { 58 | "text/plain": [ 59 | "RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", 60 | " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", 61 | " min_impurity_decrease=0.0, min_impurity_split=None,\n", 62 | " min_samples_leaf=1, min_samples_split=2,\n", 63 | " min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,\n", 64 | " oob_score=False, random_state=0, verbose=0, warm_start=False)" 65 | ] 66 | }, 67 | "execution_count": 47, 68 | "metadata": {}, 69 | "output_type": "execute_result" 70 | } 71 | ], 72 | "source": [ 73 | "# feature importance\n", 74 | "X, y = data.loc[:, data.columns!='Creditability'], data['Creditability']\n", 75 | "\n", 76 | "model = RandomForestClassifier(n_jobs=2, random_state=0)\n", 77 | "model.fit(X, y)" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": 48, 83 | "metadata": {}, 84 | "outputs": [ 85 | { 86 | "data": { 87 | "text/html": [ 88 | "

\n", 89 | "\n", 102 | "\n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | "

	Gini-importance
Credit Amount	0.142556
Account Balance	0.126895
Age (years)	0.107485
Duration of Credit (month)	0.096858
Purpose	0.061145
Payment Status of Previous Credit	0.055812
Length of current employment	0.050963
Duration in Current address	0.047747
Value Savings/Stocks	0.047325
Instalment per cent	0.041170
Most valuable available asset	0.038216
Sex & Marital Status	0.036285
Occupation	0.028884
Type of apartment	0.024539
No of Credits at this Bank	0.024086
Telephone	0.016403
Guarantors	0.016314
No of dependents	0.016142
Concurrent Credits	0.015978
Foreign Worker	0.005197

\n", 192 | "

" 193 | ], 194 | "text/plain": [ 195 | " Gini-importance\n", 196 | "Credit Amount 0.142556\n", 197 | "Account Balance 0.126895\n", 198 | "Age (years) 0.107485\n", 199 | "Duration of Credit (month) 0.096858\n", 200 | "Purpose 0.061145\n", 201 | "Payment Status of Previous Credit 0.055812\n", 202 | "Length of current employment 0.050963\n", 203 | "Duration in Current address 0.047747\n", 204 | "Value Savings/Stocks 0.047325\n", 205 | "Instalment per cent 0.041170\n", 206 | "Most valuable available asset 0.038216\n", 207 | "Sex & Marital Status 0.036285\n", 208 | "Occupation 0.028884\n", 209 | "Type of apartment 0.024539\n", 210 | "No of Credits at this Bank 0.024086\n", 211 | "Telephone 0.016403\n", 212 | "Guarantors 0.016314\n", 213 | "No of dependents 0.016142\n", 214 | "Concurrent Credits 0.015978\n", 215 | "Foreign Worker 0.005197" 216 | ] 217 | }, 218 | "execution_count": 48, 219 | "metadata": {}, 220 | "output_type": "execute_result" 221 | } 222 | ], 223 | "source": [ 224 | "feats = {} # a dict to hold feature_name: feature_importance\n", 225 | "for feature, importance in zip(X.columns, model.feature_importances_):\n", 226 | " feats[feature] = importance #add the name/value pair \n", 227 | "\n", 228 | "importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})\n", 229 | "importances.sort_values(by='Gini-importance', ascending=False)" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 49, 235 | "metadata": {}, 236 | "outputs": [], 237 | "source": [ 238 | "X, y = data.loc[:, data.columns!='Creditability'], data['Creditability']" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": 50, 244 | "metadata": {}, 245 | "outputs": [], 246 | "source": [ 247 | "# filter method\n", 248 | "chisq_selector = SelectKBest(chi2, k=15)\n", 249 | "X_kbest = chisq_selector.fit_transform(X, y)" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": 6, 255 | "metadata": {}, 256 | "outputs": [], 257 | "source": [ 258 | "X_new = X[X.columns[chisq_selector.get_support()]]" 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": 12, 264 | "metadata": {}, 265 | "outputs": [ 266 | { 267 | "name": "stderr", 268 | "output_type": "stream", 269 | "text": [ 270 | "/home/sahibachopra/miniconda/envs/ai/lib/python3.6/site-packages/sklearn/utils/validation.py:475: DataConversionWarning: Data with input dtype int64 was converted to float64 by the scale function.\n", 271 | " warnings.warn(msg, DataConversionWarning)\n" 272 | ] 273 | } 274 | ], 275 | "source": [ 276 | "# PCA\n", 277 | "x = X.values \n", 278 | "x = scale(x)" 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": 15, 284 | "metadata": {}, 285 | "outputs": [], 286 | "source": [ 287 | "covar_matrix = PCA(n_components = 15) #we have 15 features" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": 18, 293 | "metadata": {}, 294 | "outputs": [], 295 | "source": [ 296 | "covar_matrix.fit(x)\n", 297 | "variance = covar_matrix.explained_variance_ratio_ #calculate variance ratios\n", 298 | "\n", 299 | "var=np.cumsum(np.round(covar_matrix.explained_variance_ratio_, decimals=3)*100)" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": 21, 305 | "metadata": {}, 306 | "outputs": [ 307 | { 308 | "data": { 309 | "text/plain": [ 310 | "[]" 311 | ] 312 | }, 313 | "execution_count": 21, 314 | "metadata": {}, 315 | "output_type": "execute_result" 316 | }, 317 | { 318 | "data": { 319 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYgAAAEWCAYAAAB8LwAVAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4xLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvAOZPmwAAIABJREFUeJzt3Xd8VfX9x/HXh72XDFlhy5BtUBBXQYtSFQfWKiquoq2ztnX0111brbW11jpqpRXrHihuQYZInWyULSAEQggS9sj6/P44JzWGm3BJckdy38/HI4/ce+659/sJJPd9v9/vOd9j7o6IiEhJNRJdgIiIJCcFhIiIRKSAEBGRiBQQIiISkQJCREQiUkCIiEhECgiRJGJm68zs1Aq+xm4z61pZNUnqUkBIlRe+qe4L3xizzOzfZtao2OOjzGy2me0ys2wze8/Mzi7xGqeYmZvZrVG22cXMCs3socr+eSrK3Ru5+5pE1yFVnwJCqouz3L0RMBgYAvwcwMzGAi8ATwAdgDbAL4GzSjx/PLAt/B6Ny4Ac4HtmVrfC1YskIQWEVCvuvhF4C+hrZgb8Bfiduz/m7jvcvdDd33P37xc9x8waAGOB64AeZpYeRVOXEYRQHiXCJuyJXGtmq8wsx8weDGvBzLqZ2Qwz+8rMtprZU2bWrOSLm9mRZrbXzI4otu2YsAdU28y6hz2hHeHrPFei/e7h7dFmtjTsPW00s59E/Y8pKU8BIdWKmXUERgMLgJ5AR+DFQzztfGA3QU/jHYI3/7LaOJGgN/Is8Hwp+59J0JMZAHwXGFX0dOAuoB3QO6zv1yWf7O6bgVnhc4tcAjzr7nnA74CpQPOwlgdKKXcicI27Nwb6AjPK+tlEilNASHXxipltB+YA7wF/AIo+fWce4rnjgefcvQB4GrjIzGofYv+33D0n3P8MM2tdYp+73X27u68HZgIDAdx9tbtPc/cD7p5N0MM5uZR2JhGEAmZWE7gI+E/4WB7QCWjn7vvdfU4pr5EH9DGzJu6e4+7zy/i5RL5BASHVxTnu3szdO7n7D919H/BV+Fjb0p4U9ji+BTwVbpoC1AO+U8r+9YELivZ39w+B9cDFJXbdXOz2XqBR+PzWZvZsONyzE3gSaFlKeVMI3ty7AqcBO9z9k/CxWwl6I5+Y2edmdmUpr3E+QY/qy3BIalgp+4kcRAEh1dkKYAPBm2RpLiX4O3jNzDYDawgCorRhpnOBJsBDZrY5fE77MvYv6S7Agf7u3oSgh2CRdnT3/QRDWOPCOv9T7LHN7v59d28HXBPW0z3Ca3zq7mOA1sAr4euJREUBIdWWB2vZ3wL8wsyuMLMmZlbDzE4ws0fD3S4DfkMwBFT0dT7wneITxMWMB/4F9Cu2/3BgoJn1i6KsxgTzHdvNrD3w00Ps/wRwOXA2QW8DADO7wMw6hHdzCEKnoPgTzayOmY0zs6bhvMXOkvuIlEUBIdWau78IXAhcCWwCsoA7gSlmNhToDDwYfiIv+noVWE0w5v8/4Rv6SOCvJfafB7xNdIfI/obgUNwdwBvA5EPU/1+gEJjv7uuKPTQE+NjMdgOvAje5+9oIL3EpsC4czrqWcE5DJBqmCwaJJDczmwE87e6PJboWSS0KCJEkZmZDgGlAR3ffleh6JLVoiEkkSZnZJOBd4GaFgySCehAiIhKRehAiIhJRrUQXUBEtW7b0zp07J7oMEZEqZd68eVvdvdWh9qvSAdG5c2fmzp2b6DJERKoUM/symv00xCQiIhEpIEREJCIFhIiIRKSAEBGRiBQQIiISkQJCREQiUkCIiEhECggREYlIASEiIhEpIEREJCIFhIiIRKSAEBGRiBQQIiISkQJCREQiUkCIiEhEMQsIM/uXmW0xs8+KbWthZtPMbFX4vXm43czsb2a22swWm9ngWNUlIiLRiWUP4nHg9BLbbgemu3sPYHp4H+AMoEf4NQF4OIZ1iYhIFGIWEO4+G9hWYvMYYFJ4exJwTrHtT3jgI6CZmbWNVW0iInJo8Z6DaOPumQDh99bh9vbAhmL7ZYTbDmJmE8xsrpnNzc7OjmmxIiKpLFkmqS3CNo+0o7s/6u7p7p7eqtUhr7ktIiLlFO+AyCoaOgq/bwm3ZwAdi+3XAdgU59pERKSYeAfEq8D48PZ4YEqx7ZeFRzMNBXYUDUWJiEhi1IrVC5vZM8ApQEszywB+BdwNPG9mVwHrgQvC3d8ERgOrgb3AFbGqS0REohOzgHD3i0p5aGSEfR24Lla1iIjI4UuWSWoREUkyCggREYlIASEiIhEpIEREJCIFhIiIRKSAEBGRiBQQIiISkQJCREQiUkCIiEhECggREYlIASEiIhEpIEREJCIFhIiIRKSAEBGRiBQQIiISkQJCREQiUkCIiFQx7k5Boce8HQWEiEgVkJtfyOyV2fzilc84/u4ZvP3Z5pi3GbNLjoqISMXs3J/HzOVbmLY0i/dWZLPrQD71a9fkpKNa0rJRnZi3r4AQEUkim7bv491lWUxbmsWHX3xFfqHTslEdRvdry2l92nBCj5bUq10zLrUoIEREEsjdWZa5i2lLs5i2bDOfbdwJQNdWDbnqxC58u08bBnZsTs0aFvfaFBAiInGWV1DIp2u3MXVpFu8uyyIjZx9mMDitObef0YvT+rShW6tGiS5TASEiEg+7D+Qze2U205ZmMWP5Fnbsy6NOrRqc2L0l13+rOyN7t6FV47qJLvMbFBAiIjGyZdf+YOhoaRYfrP6K3IJCmjWozam923BanzacdFRLGtRJ3rfh5K1MRKQK2r43l7c+28xrizbx0ZqvKHRIa9GAS4d14rQ+bUjv1JxaNavGGQYKCBGRCtp9IJ9pSzfz2qJMZq/MJr/Q6dKyIdeP6MHofkfSs01jzOI/yVxRCggRkXLYn1fArBVbeHXRJqYv28KB/ELaNa3HVSd04awB7Ti6XZMqGQrFKSBERKKUV1DInNVbeW3RJqZ+nsXuA/m0bFSHC4d05OwB7Ric1pwaCTgcNVYUECIiZSgodD5Zu43XFm/irSWZ5OzNo3G9WozudyRnDWjHsK5HVJk5hcOlgBARKcHdWZSxg1cXbuKNJZvI2nmA+rVrcmqfNpw9oB0nHdWSurXiczZzIikgRERCyzfv5LVFm3htUSbrt+2lTs0anNyzFWcNaMepvVsn9SGpsVDqT2tmu4BS15N19yblbdTMfgRcHb7+EuAKoC3wLNACmA9c6u655W1DRCQaG7btZcrCjby6aBMrs3ZTw2B495ZcP6I7o44+kqb1aye6xIQpNSDcvTGAmf0W2Az8BzBgHNC4vA2aWXvgRqCPu+8zs+eB7wGjgfvc/VkzewS4Cni4vO2IiJTmq90HeGNJJq8s2Mj89dsBSO/UnN+OOZrR/drSslFyndGcKNH0l0a5+3HF7j9sZh8D91Sw3fpmlgc0ADKBEcDF4eOTgF+jgBCRSrI3N59pS7N4ZcFGZq/aSkGh07NNY249vSdn9W9HxxYNEl1i0okmIArMbBzB8I8DFwEF5W3Q3Tea2b3AemAfMBWYB2x39/xwtwygfaTnm9kEYAJAWlpaecsQkRSQV1DInFVbeWXhRqZ+nsW+vALaNa3H1Sd24ZyB7endttwj5SkhmoC4GLg//HLgv3z9Sf+wmVlzYAzQBdgOvACcEWHXiPMf7v4o8ChAenp67K+5JyJVirszf30OUxZu4vXFmWzbk0vT+rU5Z1B7zhnYjiGdW1SrcxVi6ZAB4e7rCN7QK8upwFp3zwYws8nA8UAzM6sV9iI6AJsqsU0RqeZWb9nFKws2MWXRRjZs20fdWjU4tU8bxgxox8k9W6XEYamV7ZABYWZHEcwFtHH3vmbWHzjb3e8sZ5vrgaFm1oBgiGkkMBeYCYwlGMoaD0wp5+uLSIrYvGM/ry7ayCsLNrE0c+f/jkC6aeRRjDq6DY3rpe4RSJUhmiGmfwI/Bf4B4O6LzexpoFwB4e4fm9mLBIey5gMLCIaM3gCeNbM7w20Ty/P6IlK97diXx9ufZfLKgk18tPYr3GFAh6b84sw+nNW/La2b1Et0idVGNAHRwN0/KbHoVH5pO0fD3X8F/KrE5jXAsRV5XRGpnnLzC5m1YgsvL9jI9GVbyC0opPMRDbhxRA/GDGxH1yS4+lp1FE1AbDWzboSTxmY2luCwVBGRmHF3FmzYzsvzN/L64k3k7M3jiIZ1uPi4NM4d1J7+HZpW+dVSk100AXEdwRBQLzPbCKwFLolpVSKSstZ/tZeXF2zklYUbWbt1D3Vr1eC0Pm04b3B7TuzRitrVdGG8ZBTNUUxrgFPNrCFQw913xb4sEUklO/bm8caSTF5ekMGn63IAGNq1BT84uRun9zuSJppsTohojmKqC5wPdAZqFXXp3P23Ma1MRKq1SPMK3Vo15KejenLOoPa0b1Y/0SWmvGiGmKYAOwjOdj4Q23JEpDpzdxZu2M7kCPMK5w1uT7/2mldIJtEERAd3Pz3mlYhItbVhWzCv8PICzStUJdEExAdm1s/dl8S8GhGpNnbsy+ONxd+cVziuSwuuPbkrZ/Rrq3mFKiCagDgBuNzM1hIMMRng7t4/ppWJSJXj7sz9MoenP17PG0syyc3/el5hzMB2dGiuFVOrkmgCItJCeiIi/7N9by6T52/kmU/Ws2rLbhrXrcV30zvw3fSOmleowsq6olwTd98J6LBWETlIpN7CwI7NuOf8/pw5oG3KXZ6zOirrf/Bp4EyCo5ecYGipiANdY1iXiCSpSL2FC9M7ctGxafRpp+srVCdlXXL0zPB7l/iVIyLJSL2F1BTV/2p4kZ8ewP+WSXT32bEqSkSSg3oLqS2aM6mvBm4iuIjPQmAo8CHBNaRFpJop6i08E/YWDqi3kLKi+Z++CRgCfOTu3zKzXsBvYluWiMRb5COR1FtIZdEExH53329mmFldd19uZj1jXpmIxJx6C1KWaP73M8ysGfAKMM3MctD1okWqNPUWJBrRLPd9bnjz12Y2E2gKvB3TqkSk0qm3IIerrBPlWkTYXLQeUyNgW0wqEpFKpd6ClFdZHxkinSBXRCfKiSQx9RakMpR1opxOkBOpYnbszeOl+RnqLUiliPZEufMIVnV14H13fyWmVYlI1NydecXOclZvQSpLNCfKPQR0B54JN11rZqe5+3UxrUxEyqTegsRaNB8tTgb6ursDmNkkvp6sFpE4Um9B4ima36YVQBrwZXi/I7A4ZhWJyEH25ubz0rwMnvjwS/UWJG6iCYgjgGVm9kl4fwjwkZm9CuDuZ8eqOJFUl7VzP5M+WMdTH69nx748+ndoqt6CxE00v2G/jHkVIvINn2/awcT31/La4k0UFDqjjj6Sq0/swuC05ro6m8RNNAGR7e5Li28ws1PcfVZsShJJTYWFzqyVW3js/bV88MVXNKhTk3HHdeLK4V1IO0LXcpb4iyYgnjezJ4A/EVwP4h4gHRgWy8JEUsW+3AImL8hg4py1rMneQ9um9bjjjF5879g0mtavnejyJIVFExDHAX8EPgAaA08Bw2NZlEgq2LJrP//58Eue/OhLcvbm0a99U+7/3kBG92tL7Zo1El2eSFQBkQfsA+oT9CDWunthRRoNV4d9DOhLcPLdlQRHSz0HdAbWAd9195yKtCOSjJZl7mTinLW8unATeYWFnNa7DVef2JUhnTW/IMklmoD4FJhCcPTSEcA/zGysu4+tQLv3A2+7+1gzqwM0AH4GTHf3u83sduB24LYKtCGSNAoLnfdWZTPx/bXMWb2V+rVrctGxHblieBc6t2yY6PJEIoomIK5y97nh7c3AGDO7tLwNmlkT4CTgcgB3zwVyzWwMcEq42yRgFgoIqeL25xXw8oKNTJyzltVbdtOmSV1uO70XFx3bkWYN6iS6PJEylbXc9wh3n+Huc82si7uvLfbwngq02RXIBv5tZgMIVo29CWjj7pkA7p5pZq1LqWsCMAEgLS2tAmWIxE72rgP856NgfmHbnlyObteEv14YzC/UqaX5BakaLFxB4+AHzOa7++CStyPdP6wGzdKBj4Dh7v6xmd0P7ARucPdmxfbLcffmZb1Wenq6z507t6xdROJq4/Z9PDLrC56bu4Hc/EJO7d2aq07oytCuLTS/IEnDzOa5e/qh9itriMlKuR3p/uHIADLc/ePw/osE8w1ZZtY27D20BbZUoA2RuNqwbS8PzVrNi/MyABh7TAe+f2JXurZqlODKRMqvrIDwUm5Huh81d99sZhvMrKe7rwBGAkvDr/HA3eH3KeVtQyRe1m7dw4MzV/Pygo3UNOOiY9O45uRutG9WP9GliVRYWQHRNVxvyYrdJrxf0YsJ3QA8FR7BtAa4AqhBcFLeVcB64IIKtiESM6u37ObBmauZsnAjtWvW4LJhnbjmpG4c2bReoksTqTRlBcSYYrfvLfFYyfuHxd0XEpyNXdLIiryuSKyt2LyLB2as4o0lmdSrVZOrT+zK1Sd2oXVjBYNUP2VdcvS9eBYiksw+37SDv89YzVufbaZhnZpce3I3rj6hC0c0qpvo0kRiRusFi5RhccZ2/jZ9Ne8uy6Jx3VrcOKI7VwzvQvOGOodBqj8FhEgE89fn8MD0VcxckU3T+rX50alHcfnwzlo8T1JK1AFhZg3dvSInyIkkvU/WbuOBGat4f9VWmjeozU9H9eSyYZ1oXE/BIKnnkAFhZscTLKzXCEgLz36+xt1/GOviROLB3flwzVf8bfoqPlqzjZaN6vCz0b0Yd1wnGtZVJ1tSVzS//fcBo4CiS4wuMrOTYlqVSBy4O3NWb+Vv01fx6bocWjeuyy/O7MPFx6ZRv07NRJcnknBRfTxy9w0llgkoiE05IvHxydpt3PvOCj5Zt422Tevx2zFH8930jtSrrWAQKRJNQGwIh5k8PLHtRmBZbMsSiY1FG7bz52krmb0ym1aN6/LbMUdz4ZCO1K2lYBApKZqAuJbg+g3tCdZRmgpcF8uiRCrbis27+PPUFUxdmkXzBrX52eheXDq0s4aSRMpwyIBw963AuDjUIlLp1m7dw33TVvLa4k00qlOLW047iiuGd9ZRSSJRiOYopknATe6+PbzfHPizu18Z6+JEyisjZy8PTF/Ni/MzqFOzBtee3I1rTuqqi/SIHIZohpj6F4UDgLvnmNmgGNYkUm5bdu7nwZmreeaTDQBcNqwTPzylO60aa0kMkcMVTUDUMLPm7p4DYGYtonyeSNzk7Mnlkfe+YNKH68gvcC5I78gNI7rTTstui5RbNG/0fwY+MLMXw/sXAL+PXUki0du5P4+J769l4py17MnN55yB7bn51B50OqJhoksTqfKimaR+wszmAd8iuBbEee6+NOaViZRhb24+kz74kn/M/oLte/M4o++R3HLaUfRo0zjRpYlUG9EOFS0Hcor2N7M0d18fs6pESnEgv4CnP17PgzO/YOvuA5zSsxU/Pq0n/To0TXRpItVONEcx3QD8CsgiOIPaCC452j+2pYl8La+gkJfmZfC36avYtGM/Q7u24JFLBpPeuUWiSxOptqLpQdwE9HT3r2JdjEhJhYXO60syuW/aStZu3cOAjs24Z+wAhnc/ghLLv4hIJYtqqQ1gR6wLESnO3Zm1Ips/vbOCpZk76XVkY/55WTqn9m6tYBCJk2gCYg0wy8zeAA4UbXT3v8SsKklpn67bxj1vL+fTdTmktWjAXy8cyNkD2lGjhoJBJJ6iCYj14Ved8EskJj7ftIN731nBzBXZtG5clzvP6cuFQzpSu2aNRJcmkpKiOcz1N/EoRFLX2q17+Mu0lby2aBNN69fm9jN6MX6YFtITSbRojmJqBdwKHA3UK9ru7iNiWJekgM079nP/9FU8P3cDdWrW4Ppvdef7J3XVdZ9FkkQ0Q0xPAc8BZxIs/T0eyI5lUVK95ezJ5eH3vmDSB+sodOfSoZ344be60bpxvUM/WUTiJpqAOMLdJ5rZTe7+HvCemb0X68Kk+tl9IJ9/zVnLP2evYXduPucN6sDNp/agY4sGiS5NRCKIJiDywu+ZZvYdYBPQIXYlSXVzIL+Apz5az4MzV/PVnly+3acNPxnVk6O0LIZIUosmIO40s6bAj4EHgCbAj2JalVQL+QWFTF6wkfvfXcXG7fs4vtsR/HRUTwalNU90aSIShWiOYno9vLmDYME+kTK5O29/tpl7p67gi+w9DOjQlD+e358TerRMdGkichhKDQgzu9Xd7zGzBwjWXvoGd78xppVJlfTB6q3c/fZyFmfsoHvrRjxyyWBGHX2kzn4WqYLK6kEsC7/PjUchUrWtytrFXW8tZ8byLbRvVp97LxjAuYPaU1NnP4tUWaUGhLu/ZmY1gb7u/tM41iRVSPauA9z37kqe/WQ9DevW4o4zejH++M7Uq62T3ESqujLnINy9wMyOiUXDYfjMBTa6+5lm1gV4FmgBzAcudffcWLQtFbcvt4DH3l/DI+99wYH8Qi4b1pkbR/agRUOtxiJSXURzFNMCM3sVeAHYU7TR3SdXsO2bCIaxmoT3/wjc5+7PmtkjwFXAwxVsQypZQaEzeX4G905dQdbOA5x+9JHcdkYvurTUJT5FqptoAqIF8BVQfGkNB8odEGbWAfgOwbWtb7FgBnMEcHG4yyTg1yggksqcVVv5/ZvLWJa5kwEdm/H3iwczRBfsEam2ojnM9YoYtPtXgvWdis6UOgLY7u754f0MoH2kJ5rZBGACQFpaWgxKk5JWbN7FXW8tY9aKbDo0r88DFw3izP5tdWSSSDUXzWJ99QiGe0ou1ndleRo0szOBLe4+z8xOKdocYdeDDq0N230UeBQgPT094j5SObbs3M99767kuU830KhuLf5vdG8uO74TdWtpAlokFUQzxPQfYDkwCvgtMI6vD4Etj+HA2WY2miBwmhD0KJqZWa2wF9GBYEkPSYC9ufn8c/Za/jH7C/IKCrn8+C7cMKI7zTUBLZJSogmI7u5+gZmNcfdJZvY08E55G3T3O4A7AMIexE/cfZyZvQCMJTiSaTwwpbxtSPkUFDovzQsmoLfsOsDofkdy66hedNYEtEhKOpzF+rabWV9gM9A5BrXcBjxrZncCC4CJMWhDSjF7ZTZ/eHMZyzfvYlBaMx6+ZDDHdNIEtEgqiyYgHjWz5sDPgVeBRsAvKqNxd58FzApvrwGOrYzXlegt37yTP7y5nNkrs0lr0YAHLx7M6H5aGkNEyl6LqY27Z7n7Y+Gm2UDX+JQlsZa96wD3vrOCF+ZtoHG92vz8O725dJgmoEXka2X1IBaZ2RLgGeAld98Rp5okhvILCnnyoy/587SV7M8r4MrhXbh+RHeaNdAEtIh8U1kB0R44FfgecJeZfUgQFq+6+754FCeVa+66bfxiyucsy9zJiT1a8uuzj6Zbq0aJLktEklRZi/UVEByt9I6Z1QHOIAiL+81suruPi1ONUkHZuw5w91vLeWl+Bu2a1uPhcYM5va/mGUSkbNFMUuPuuWa2lOD8h2OAPjGtSipFyeGkH57SjetHdKdBnaj+20UkxZX5TmFmacCFwEVAQ4JzFMa4e0VOlJM40HCSiFRUWUcxfUAwD/ECMMHddeGgKkDDSSJSWcrqQdwBzHZ3rXdUBWg4SUQqW1mT1O/FsxApPw0niUgs6ONlFVZ8OKmthpNEpJIpIKqgksNJPzilGzdoOElEKlnU7yhmNhT4A1AX+JO7vxKzqqRUGk4SkXgp6yimI919c7FNtwBnE1zc5wNAARFHGk4SkXgrqwfxiJnNI+gt7Ae2E1wzuhDYGY/iJLhGw5Mffcm9U1doOElE4qqso5jOMbOzgNfNbBJwM0FANADOiVN9KW1V1i5ufWkxC9Zv54TuLfnNGA0niUj8lPkx1N1fM7M3gR8Ck4Hfu/v7caksheXmF/LQrNU8OHM1jerW4r4LB3DOwPYaThKRuKpR2gNmdraZzQFmAJ8RLNR3rpk9Y2bd4lVgqlmwPoezHpjDX99dxRl92/LuLSdz7qAOCgcRibuyehB3AsOA+sCb7n4scIuZ9QB+TxAYUkn25uZz7zsr+fcHazmyST0mjk9nZO82iS5LRFJYWQGxgyAE6gNbija6+yoUDpVqzqqt3D55MRk5+7hkaBq3nd6LxvVqJ7osEUlxZQXEuQSruOYRTE5LJdu+N5c731jGi/My6NqyIc9fM4xju7RIdFkiIkDZRzFtBR6IYy0pw91567PN/HLK5+TszeW6b3XjhhE9qFdb14MWkeShg+njLGvnfn7xymdMXZpF3/ZNmHTlEI5u1zTRZYmIHEQBESfuznOfbuD3by4jN7+QO87oxVUndKFWzVIPJBMRSSgFRBys27qHOyYv4cM1XzG0awvuPq8/nVs2THRZIiJlUkDEUH5BIRPnrOUv01ZSp2YN7jqvHxemd6RGDZ3TICLJTwERI0s37eS2lxazZOMOvt2nDb87py9tmtRLdFkiIlFTQFSy/XkFPDBjFf94bw3NGtTmoXGDOUOrropIFaSAqETz1+fwkxcWsSZ7D2OP6cDPv9ObZg3qJLosEZFyUUBUkjeXZHLzcwtp1aguT1x5LCcd1SrRJYmIVIgCohI8/t+1/Ob1pQxOa87E8enqNYhItRD3g/DNrKOZzTSzZWb2uZndFG5vYWbTzGxV+L15vGs7XIWFzl1vLePXry3ltN5teOrq4xQOIlJtJOIsrXzgx+7eGxgKXGdmfYDbgenu3gOYHt5PWrn5hdzy/EL+8d4aLh3aiYcvOUZLZYhItRL3ISZ3zwQyw9u7zGwZ0B4YA5wS7jYJmAXcFu/6orFrfx4/eHI+c1Zv5aejevLDU7rpKCURqXYSOgdhZp2BQcDHQJswPHD3TDNrXcpzJgATANLS0uJTaDFbdu5n/L8/ZVXWLu69YABjj+kQ9xpEROIhYQsBmVkj4CXgZnffGe3z3P1Rd0939/RWreJ7pNDqLbs596EP+PKrPUy8fIjCQUSqtYT0IMysNkE4POXuk8PNWWbWNuw9tKXYRYqSwbwvt3HVpLnUqmE8N2EY/TpoBVYRqd4ScRSTAROBZe7+l2IPvQqMD2+PB6bEu7bSvPP5Zi7+58c0b1CHyT8YrnAQkZSQiB7EcOBSYImZLQy3/Qy4G3jezK4C1gMXJKC2gzz50Zf8cspn9O/QjInj0zmiUd1ElyQFcbMYAAAJPElEQVQiEheJOIppDlDaIT8j41lLWdydP09dyd9nrmZkr9b8/eLB1K+jw1hFJHXoTOoI8goKuf2lJbw0P4OLju3I78b01YV9RCTlKCBK2HMgnx88NZ/ZK7P50alHcePI7jrHQURSkgKimOxdB7jy8U9ZmrmTP57fjwuHxP88CxGRZKGACK3duofL/vUxW3fl8s/LjmFErzaJLklEJKEUEMCC9TlcNWkuAM9MGMrAjs0SXJGISOKlfEBMX5bFdU/Pp3Xjejxx5bF0btkw0SWJiCSFlA6IZz9Zz89eXkLf9k2ZOH4IrRrrHAcRkSIpGRDuzl/fXcX901dxSs9WPHjxYBrWTcl/ChGRUqXku+LfZ6zm/umruOCYDvzhvH7U1jkOIiIHScmAGJvegdq1anDNSV11joOISClSMiDaNq3PtSd3S3QZIiJJTWMrIiISkQJCREQiUkCIiEhECggREYlIASEiIhEpIEREJCIFhIiIRKSAEBGRiBQQIiISkQJCREQiUkCIiEhECggREYlIASEiIhEpIEREJCIFhIiIRKSAEBGRiBQQIiISkQJCREQiUkCIiEhECggREYkoqQLCzE43sxVmttrMbk90PSIiqSxpAsLMagIPAmcAfYCLzKxPYqsSEUldSRMQwLHAandf4+65wLPAmATXJCKSsmoluoBi2gMbit3PAI4ruZOZTQAmhHd3m9mKcrbXEthazucmQlWqtyrVClWr3qpUK1SteqtSrVCxejtFs1MyBYRF2OYHbXB/FHi0wo2ZzXX39Iq+TrxUpXqrUq1QteqtSrVC1aq3KtUK8ak3mYaYMoCOxe53ADYlqBYRkZSXTAHxKdDDzLqYWR3ge8CrCa5JRCRlJc0Qk7vnm9n1wDtATeBf7v55DJus8DBVnFWleqtSrVC16q1KtULVqrcq1QpxqNfcDxrmFxERSaohJhERSSIKCBERiSglA6KqLOlhZh3NbKaZLTOzz83spkTXFA0zq2lmC8zs9UTXUhYza2ZmL5rZ8vDfeFiiayqLmf0o/D34zMyeMbN6ia6pODP7l5ltMbPPim1rYWbTzGxV+L15ImssUkqtfwp/Fxab2ctm1iyRNRaJVGuxx35iZm5mLWPRdsoFRBVb0iMf+LG79waGAtclca3F3QQsS3QRUbgfeNvdewEDSOKazaw9cCOQ7u59CQ7k+F5iqzrI48DpJbbdDkx39x7A9PB+Mnicg2udBvR19/7ASuCOeBdVisc5uFbMrCNwGrA+Vg2nXEBQhZb0cPdMd58f3t5F8AbWPrFVlc3MOgDfAR5LdC1lMbMmwEnARAB3z3X37Ymt6pBqAfXNrBbQgCQ7T8jdZwPbSmweA0wKb08CzolrUaWIVKu7T3X3/PDuRwTnYiVcKf+uAPcBtxLhhOLKkooBEWlJj6R+0wUws87AIODjxFZySH8l+KUtTHQhh9AVyAb+HQ6HPWZmDRNdVGncfSNwL8GnxUxgh7tPTWxVUWnj7pkQfOABWie4nmhdCbyV6CJKY2ZnAxvdfVEs20nFgIhqSY9kYmaNgJeAm919Z6LrKY2ZnQlscfd5ia4lCrWAwcDD7j4I2EPyDH8cJBy7HwN0AdoBDc3sksRWVT2Z2f8RDO8+lehaIjGzBsD/Ab+MdVupGBBVakkPM6tNEA5PufvkRNdzCMOBs81sHcHQ3QgzezKxJZUqA8hw96Ie2YsEgZGsTgXWunu2u+cBk4HjE1xTNLLMrC1A+H1Lguspk5mNB84ExnnyniTWjeCDwqLwb60DMN/MjqzshlIxIKrMkh5mZgRj5Mvc/S+JrudQ3P0Od+/g7p0J/l1nuHtSfsp1983ABjPrGW4aCSxNYEmHsh4YamYNwt+LkSTxpHoxrwLjw9vjgSkJrKVMZnY6cBtwtrvvTXQ9pXH3Je7e2t07h39rGcDg8He6UqVcQISTUEVLeiwDno/xkh4VMRy4lOCT+MLwa3Sii6pGbgCeMrPFwEDgDwmup1RhT+dFYD6whOBvN6mWhjCzZ4APgZ5mlmFmVwF3A6eZ2SqCI27uTmSNRUqp9e9AY2Ba+Lf2SEKLDJVSa3zaTt5elIiIJFLK9SBERCQ6CggREYlIASEiIhEpIEREJCIFhIiIRKSAkGrLzO4ys1PM7JzDXbXXzFqZ2cfhMhwnlnhsVrgacNGhx2PLWd/N4VmxIklJASHV2XEEa1edDLx/mM8dCSx390HuHum549x9YPj1Yjnru5lg0b2ohQv1icSFAkKqnXBd/8XAEIITjK4GHjazg9auMbNOZjY9vAbAdDNLM7OBwD3A6LCHUD/Kdi8xs0/C5/wjXFoeM3vYzOaG13L4TbjtRoI1lWaa2cxw2+5irzXWzB4Pbz9uZn8J9/ujmTUMrxHwadjDGRPud3Sx9hebWY/y/huKgE6Uk2rKzI4lOAv9FmCWuw8vZb/XgBfdfZKZXUmwzMI5ZnY5wbUXro/wnFlAW2BfuGkkwSql9wDnuXuemT0EfOTuT5hZC3ffFgbGdOBGd18crqOT7u5bw9fd7e6NwttjgTPd/fIwKFoCY9y9wMz+ACx19yfDi9p8QrDS791hm0+Fy8jUdPeiGkUOm7qrUl0NAhYCvSh7jaVhwHnh7f8QvMlHY5y7zy26Y2YXAccAnwZLJVGfrxem+66ZTSD4e2tLcKGqxVG2U+QFdy8Ib3+bYFHEn4T36wFpBL2l/wuvyTHZ3VcdZhsi36CAkGolHB56nGCFy60EY/xmZguBYVF8oi5vl9qASe7+jauQmVkX4CfAEHfPCXsDpV0qtHjbJffZU6Kt8919RYl9lpnZxwQXbHrHzK529xmH+XOI/I/mIKRacfeF7j6Q4JKRfYAZwKhwMjlSOHzA15fuHAfMKWfT04GxZtYa/nct5k5AE4I39x1m1obgUrdFdhEsDlcky8x6m1kN4Nwy2noHuCFc1RUzGxR+7wqscfe/Eayi2r+cP4sIoICQasjMWgE57l4I9HL3soaYbgSuCCe1LyW4nvZhC9v4OTA1fK1pQNvwil8LgM+BfwH/Lfa0R4G3iiapCS5Y9DpBqGWW0dzvgNrAYgsuZP+7cPuFwGdhb6kX8ER5fhaRIpqkFhGRiNSDEBGRiBQQIiISkQJCREQiUkCIiEhECggREYlIASEiIhEpIEREJKL/B0MeQT8ji5a0AAAAAElFTkSuQmCC\n", 320 | "text/plain": [ 321 | "" 322 | ] 323 | }, 324 | "metadata": { 325 | "needs_background": "light" 326 | }, 327 | "output_type": "display_data" 328 | } 329 | ], 330 | "source": [ 331 | "plt.ylabel('% Variance Explained')\n", 332 | "plt.xlabel('# of Features')\n", 333 | "plt.title('PCA Analysis')\n", 334 | "plt.ylim(0,110)\n", 335 | "plt.style.context('seaborn-whitegrid')\n", 336 | "\n", 337 | "\n", 338 | "plt.plot(var)" 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": 28, 344 | "metadata": {}, 345 | "outputs": [], 346 | "source": [ 347 | "# feature selection using wrapper method\n", 348 | "\n", 349 | "clf = RandomForestClassifier(n_jobs=2, random_state=0)\n", 350 | "# create the RFE model for the svm classifier \n", 351 | "# and select attributes\n", 352 | "rfe = RFE(clf, 13)\n", 353 | "rfe = rfe.fit(X_new, y)" 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": 29, 359 | "metadata": {}, 360 | "outputs": [], 361 | "source": [ 362 | "X_new_v2 = X_new[X_new.columns[rfe.get_support()]]" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": 30, 368 | "metadata": {}, 369 | "outputs": [ 370 | { 371 | "data": { 372 | "text/plain": [ 373 | "(1000, 13)" 374 | ] 375 | }, 376 | "execution_count": 30, 377 | "metadata": {}, 378 | "output_type": "execute_result" 379 | } 380 | ], 381 | "source": [ 382 | "X_new_v2.shape" 383 | ] 384 | }, 385 | { 386 | "cell_type": "code", 387 | "execution_count": 31, 388 | "metadata": {}, 389 | "outputs": [ 390 | { 391 | "name": "stdout", 392 | "output_type": "stream", 393 | "text": [ 394 | "Features sorted by their rank:\n", 395 | "[(1, 'Account Balance'), (1, 'Age (years)'), (1, 'Concurrent Credits'), (1, 'Credit Amount'), (1, 'Duration of Credit (month)'), (1, 'Instalment per cent'), (1, 'Length of current employment'), (1, 'Most valuable available asset'), (1, 'No of Credits at this Bank'), (1, 'Payment Status of Previous Credit'), (1, 'Purpose'), (1, 'Sex & Marital Status'), (1, 'Value Savings/Stocks'), (2, 'Telephone'), (3, 'Foreign Worker')]\n" 396 | ] 397 | } 398 | ], 399 | "source": [ 400 | "names = X_new.columns\n", 401 | "print (\"Features sorted by their rank:\")\n", 402 | "print (sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), names)))" 403 | ] 404 | }, 405 | { 406 | "cell_type": "code", 407 | "execution_count": 52, 408 | "metadata": {}, 409 | "outputs": [], 410 | "source": [ 411 | "X_new_v2.to_csv('fe_data.csv', index=False)" 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "execution_count": 58, 417 | "metadata": {}, 418 | "outputs": [], 419 | "source": [ 420 | "X_new.to_csv('fe_data.csv', index=False)" 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": 57, 426 | "metadata": {}, 427 | "outputs": [], 428 | "source": [ 429 | "pd.DataFrame(y).to_csv('y.csv', index= False)" 430 | ] 431 | }, 432 | { 433 | "cell_type": "code", 434 | "execution_count": 54, 435 | "metadata": {}, 436 | "outputs": [ 437 | { 438 | "data": { 439 | "text/plain": [ 440 | "(1000,)" 441 | ] 442 | }, 443 | "execution_count": 54, 444 | "metadata": {}, 445 | "output_type": "execute_result" 446 | } 447 | ], 448 | "source": [ 449 | "y.shape" 450 | ] 451 | } 452 | ], 453 | "metadata": { 454 | "kernelspec": { 455 | "display_name": "Python 3", 456 | "language": "python", 457 | "name": "python3" 458 | }, 459 | "language_info": { 460 | "codemirror_mode": { 461 | "name": "ipython", 462 | "version": 3 463 | }, 464 | "file_extension": ".py", 465 | "mimetype": "text/x-python", 466 | "name": "python", 467 | "nbconvert_exporter": "python", 468 | "pygments_lexer": "ipython3", 469 | "version": "3.6.0" 470 | } 471 | }, 472 | "nbformat": 4, 473 | "nbformat_minor": 2 474 | } 475 | -------------------------------------------------------------------------------- /Section 5/5.3.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 44, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd\n", 10 | "from sklearn.model_selection import train_test_split\n", 11 | "from sklearn.linear_model import LogisticRegression\n", 12 | "from sklearn.metrics import classification_report\n", 13 | "import numpy as np\n", 14 | "from sklearn.ensemble import RandomForestClassifier\n", 15 | "import pandas\n", 16 | "from sklearn import model_selection\n", 17 | "from sklearn.linear_model import LogisticRegression\n", 18 | "from sklearn.ensemble import VotingClassifier\n", 19 | "from sklearn.svm import SVC" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 58, 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "X = pd.read_csv('fe_data.csv')\n", 29 | "y = pd.read_csv('y.csv') " 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 60, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "## ensemble model:\n", 46 | "- Random Forest\n", 47 | "- Linear Regression\n", 48 | "- Support Vector Classifier" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 80, 54 | "metadata": {}, 55 | "outputs": [ 56 | { 57 | "name": "stderr", 58 | "output_type": "stream", 59 | "text": [ 60 | "/home/sahibachopra/miniconda/envs/ai/lib/python3.6/site-packages/sklearn/utils/validation.py:578: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", 61 | " y = column_or_1d(y, warn=True)\n", 62 | "/home/sahibachopra/miniconda/envs/ai/lib/python3.6/site-packages/ipykernel_launcher.py:6: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", 63 | " \n", 64 | "/home/sahibachopra/miniconda/envs/ai/lib/python3.6/site-packages/sklearn/utils/validation.py:578: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", 65 | " y = column_or_1d(y, warn=True)\n", 66 | "/home/sahibachopra/miniconda/envs/ai/lib/python3.6/site-packages/sklearn/preprocessing/label.py:95: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", 67 | " y = column_or_1d(y, warn=True)\n", 68 | "/home/sahibachopra/miniconda/envs/ai/lib/python3.6/site-packages/sklearn/preprocessing/label.py:128: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", 69 | " y = column_or_1d(y, warn=True)\n", 70 | "/home/sahibachopra/miniconda/envs/ai/lib/python3.6/site-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.\n", 71 | " if diff:\n" 72 | ] 73 | }, 74 | { 75 | "data": { 76 | "text/plain": [ 77 | "0.785" 78 | ] 79 | }, 80 | "execution_count": 80, 81 | "metadata": {}, 82 | "output_type": "execute_result" 83 | } 84 | ], 85 | "source": [ 86 | "# baseline model accuracy = 0.74\n", 87 | "model1 = RandomForestClassifier()\n", 88 | "model2= LogisticRegression()\n", 89 | "model3 = SVC()\n", 90 | "\n", 91 | "model = VotingClassifier(estimators=[('rf', model1), ('lr', model2), ('svc', model3)], voting='hard')\n", 92 | "model.fit(X_train,y_train)\n", 93 | "model.score(X_test,y_test)" 94 | ] 95 | } 96 | ], 97 | "metadata": { 98 | "kernelspec": { 99 | "display_name": "Python 3", 100 | "language": "python", 101 | "name": "python3" 102 | }, 103 | "language_info": { 104 | "codemirror_mode": { 105 | "name": "ipython", 106 | "version": 3 107 | }, 108 | "file_extension": ".py", 109 | "mimetype": "text/x-python", 110 | "name": "python", 111 | "nbconvert_exporter": "python", 112 | "pygments_lexer": "ipython3", 113 | "version": "3.6.0" 114 | } 115 | }, 116 | "nbformat": 4, 117 | "nbformat_minor": 2 118 | } 119 | -------------------------------------------------------------------------------- /Section 5/y.csv: -------------------------------------------------------------------------------- 1 | Creditability 2 | 1 3 | 1 4 | 1 5 | 1 6 | 1 7 | 1 8 | 1 9 | 1 10 | 1 11 | 1 12 | 1 13 | 1 14 | 1 15 | 1 16 | 1 17 | 1 18 | 1 19 | 1 20 | 1 21 | 1 22 | 1 23 | 1 24 | 0 25 | 1 26 | 1 27 | 1 28 | 1 29 | 1 30 | 1 31 | 1 32 | 1 33 | 1 34 | 1 35 | 1 36 | 1 37 | 1 38 | 1 39 | 1 40 | 1 41 | 1 42 | 1 43 | 1 44 | 1 45 | 1 46 | 1 47 | 0 48 | 1 49 | 0 50 | 1 51 | 1 52 | 1 53 | 1 54 | 1 55 | 1 56 | 1 57 | 1 58 | 1 59 | 1 60 | 1 61 | 1 62 | 1 63 | 1 64 | 1 65 | 1 66 | 1 67 | 1 68 | 1 69 | 1 70 | 1 71 | 1 72 | 1 73 | 1 74 | 1 75 | 1 76 | 1 77 | 1 78 | 1 79 | 1 80 | 1 81 | 1 82 | 1 83 | 1 84 | 1 85 | 1 86 | 1 87 | 1 88 | 1 89 | 1 90 | 1 91 | 1 92 | 1 93 | 1 94 | 1 95 | 1 96 | 1 97 | 1 98 | 1 99 | 0 100 | 1 101 | 1 102 | 1 103 | 1 104 | 1 105 | 1 106 | 1 107 | 1 108 | 1 109 | 1 110 | 1 111 | 1 112 | 0 113 | 1 114 | 1 115 | 1 116 | 1 117 | 1 118 | 1 119 | 1 120 | 1 121 | 1 122 | 1 123 | 1 124 | 1 125 | 1 126 | 1 127 | 1 128 | 1 129 | 1 130 | 1 131 | 1 132 | 1 133 | 1 134 | 1 135 | 1 136 | 1 137 | 1 138 | 1 139 | 1 140 | 1 141 | 1 142 | 1 143 | 1 144 | 1 145 | 1 146 | 1 147 | 1 148 | 1 149 | 1 150 | 1 151 | 1 152 | 1 153 | 1 154 | 1 155 | 1 156 | 1 157 | 1 158 | 1 159 | 1 160 | 0 161 | 1 162 | 1 163 | 1 164 | 1 165 | 1 166 | 1 167 | 1 168 | 1 169 | 1 170 | 1 171 | 1 172 | 1 173 | 0 174 | 1 175 | 1 176 | 1 177 | 1 178 | 1 179 | 1 180 | 1 181 | 1 182 | 1 183 | 1 184 | 1 185 | 1 186 | 1 187 | 1 188 | 1 189 | 1 190 | 1 191 | 1 192 | 1 193 | 1 194 | 1 195 | 1 196 | 1 197 | 1 198 | 1 199 | 1 200 | 1 201 | 1 202 | 1 203 | 1 204 | 1 205 | 1 206 | 1 207 | 1 208 | 1 209 | 1 210 | 1 211 | 1 212 | 1 213 | 1 214 | 1 215 | 1 216 | 1 217 | 1 218 | 1 219 | 1 220 | 1 221 | 1 222 | 1 223 | 1 224 | 1 225 | 1 226 | 1 227 | 1 228 | 1 229 | 1 230 | 1 231 | 1 232 | 1 233 | 1 234 | 1 235 | 1 236 | 0 237 | 1 238 | 1 239 | 1 240 | 1 241 | 1 242 | 1 243 | 1 244 | 1 245 | 1 246 | 1 247 | 1 248 | 1 249 | 1 250 | 1 251 | 1 252 | 1 253 | 1 254 | 1 255 | 1 256 | 1 257 | 1 258 | 1 259 | 1 260 | 1 261 | 1 262 | 1 263 | 1 264 | 1 265 | 1 266 | 1 267 | 1 268 | 1 269 | 1 270 | 1 271 | 1 272 | 1 273 | 1 274 | 1 275 | 1 276 | 1 277 | 1 278 | 1 279 | 1 280 | 1 281 | 1 282 | 1 283 | 1 284 | 1 285 | 1 286 | 1 287 | 1 288 | 1 289 | 1 290 | 1 291 | 1 292 | 1 293 | 1 294 | 1 295 | 1 296 | 1 297 | 1 298 | 1 299 | 1 300 | 1 301 | 1 302 | 1 303 | 1 304 | 1 305 | 1 306 | 0 307 | 1 308 | 1 309 | 1 310 | 1 311 | 1 312 | 1 313 | 1 314 | 1 315 | 1 316 | 1 317 | 1 318 | 1 319 | 1 320 | 1 321 | 1 322 | 1 323 | 1 324 | 1 325 | 1 326 | 1 327 | 1 328 | 1 329 | 1 330 | 1 331 | 1 332 | 1 333 | 1 334 | 1 335 | 1 336 | 1 337 | 1 338 | 1 339 | 1 340 | 1 341 | 1 342 | 1 343 | 1 344 | 1 345 | 1 346 | 1 347 | 1 348 | 1 349 | 1 350 | 1 351 | 1 352 | 1 353 | 1 354 | 1 355 | 0 356 | 1 357 | 1 358 | 1 359 | 1 360 | 1 361 | 1 362 | 1 363 | 1 364 | 1 365 | 1 366 | 1 367 | 1 368 | 1 369 | 1 370 | 1 371 | 1 372 | 1 373 | 1 374 | 1 375 | 1 376 | 1 377 | 1 378 | 0 379 | 1 380 | 1 381 | 1 382 | 1 383 | 1 384 | 1 385 | 1 386 | 1 387 | 1 388 | 0 389 | 1 390 | 1 391 | 1 392 | 1 393 | 1 394 | 1 395 | 1 396 | 1 397 | 1 398 | 1 399 | 1 400 | 1 401 | 1 402 | 1 403 | 1 404 | 1 405 | 1 406 | 1 407 | 1 408 | 1 409 | 1 410 | 1 411 | 1 412 | 1 413 | 1 414 | 1 415 | 1 416 | 1 417 | 1 418 | 1 419 | 1 420 | 1 421 | 1 422 | 1 423 | 1 424 | 1 425 | 1 426 | 1 427 | 1 428 | 1 429 | 0 430 | 1 431 | 1 432 | 1 433 | 0 434 | 1 435 | 1 436 | 1 437 | 1 438 | 1 439 | 1 440 | 1 441 | 1 442 | 1 443 | 1 444 | 1 445 | 1 446 | 1 447 | 1 448 | 1 449 | 1 450 | 1 451 | 1 452 | 0 453 | 1 454 | 1 455 | 1 456 | 1 457 | 1 458 | 1 459 | 1 460 | 1 461 | 1 462 | 1 463 | 1 464 | 1 465 | 0 466 | 1 467 | 1 468 | 1 469 | 1 470 | 1 471 | 1 472 | 1 473 | 1 474 | 1 475 | 1 476 | 1 477 | 1 478 | 1 479 | 1 480 | 1 481 | 1 482 | 1 483 | 1 484 | 1 485 | 1 486 | 1 487 | 1 488 | 1 489 | 1 490 | 1 491 | 1 492 | 1 493 | 1 494 | 1 495 | 1 496 | 1 497 | 1 498 | 1 499 | 1 500 | 1 501 | 1 502 | 1 503 | 1 504 | 1 505 | 1 506 | 1 507 | 1 508 | 1 509 | 1 510 | 1 511 | 1 512 | 1 513 | 1 514 | 1 515 | 1 516 | 1 517 | 1 518 | 1 519 | 0 520 | 0 521 | 0 522 | 0 523 | 0 524 | 1 525 | 1 526 | 1 527 | 1 528 | 1 529 | 1 530 | 1 531 | 1 532 | 1 533 | 1 534 | 1 535 | 1 536 | 1 537 | 1 538 | 1 539 | 1 540 | 1 541 | 1 542 | 0 543 | 1 544 | 0 545 | 1 546 | 1 547 | 1 548 | 1 549 | 1 550 | 1 551 | 1 552 | 1 553 | 1 554 | 1 555 | 1 556 | 1 557 | 1 558 | 1 559 | 1 560 | 1 561 | 1 562 | 1 563 | 1 564 | 0 565 | 1 566 | 1 567 | 1 568 | 1 569 | 1 570 | 1 571 | 1 572 | 1 573 | 1 574 | 1 575 | 1 576 | 1 577 | 1 578 | 0 579 | 0 580 | 0 581 | 0 582 | 0 583 | 0 584 | 0 585 | 0 586 | 0 587 | 1 588 | 1 589 | 1 590 | 1 591 | 1 592 | 1 593 | 1 594 | 0 595 | 0 596 | 1 597 | 1 598 | 1 599 | 1 600 | 1 601 | 0 602 | 0 603 | 1 604 | 1 605 | 1 606 | 0 607 | 1 608 | 1 609 | 1 610 | 1 611 | 1 612 | 0 613 | 1 614 | 1 615 | 1 616 | 1 617 | 1 618 | 1 619 | 1 620 | 0 621 | 1 622 | 1 623 | 1 624 | 1 625 | 1 626 | 1 627 | 1 628 | 1 629 | 1 630 | 1 631 | 1 632 | 1 633 | 1 634 | 1 635 | 1 636 | 1 637 | 1 638 | 0 639 | 1 640 | 1 641 | 1 642 | 1 643 | 1 644 | 1 645 | 0 646 | 1 647 | 1 648 | 1 649 | 1 650 | 1 651 | 1 652 | 1 653 | 1 654 | 1 655 | 1 656 | 1 657 | 1 658 | 0 659 | 1 660 | 1 661 | 1 662 | 1 663 | 1 664 | 1 665 | 1 666 | 1 667 | 1 668 | 1 669 | 1 670 | 1 671 | 1 672 | 1 673 | 1 674 | 1 675 | 1 676 | 1 677 | 1 678 | 1 679 | 1 680 | 1 681 | 1 682 | 1 683 | 1 684 | 1 685 | 1 686 | 1 687 | 1 688 | 1 689 | 1 690 | 1 691 | 0 692 | 1 693 | 1 694 | 1 695 | 1 696 | 1 697 | 1 698 | 1 699 | 1 700 | 1 701 | 1 702 | 0 703 | 1 704 | 1 705 | 1 706 | 1 707 | 1 708 | 1 709 | 1 710 | 0 711 | 1 712 | 1 713 | 1 714 | 1 715 | 1 716 | 1 717 | 1 718 | 1 719 | 0 720 | 0 721 | 0 722 | 0 723 | 1 724 | 1 725 | 1 726 | 1 727 | 1 728 | 1 729 | 1 730 | 1 731 | 1 732 | 1 733 | 1 734 | 1 735 | 1 736 | 1 737 | 1 738 | 1 739 | 1 740 | 1 741 | 1 742 | 1 743 | 1 744 | 0 745 | 1 746 | 1 747 | 1 748 | 1 749 | 1 750 | 1 751 | 1 752 | 1 753 | 0 754 | 0 755 | 0 756 | 0 757 | 0 758 | 0 759 | 0 760 | 0 761 | 0 762 | 0 763 | 0 764 | 0 765 | 0 766 | 0 767 | 0 768 | 0 769 | 0 770 | 0 771 | 0 772 | 0 773 | 0 774 | 0 775 | 0 776 | 0 777 | 0 778 | 0 779 | 0 780 | 0 781 | 0 782 | 0 783 | 0 784 | 0 785 | 0 786 | 0 787 | 0 788 | 0 789 | 0 790 | 0 791 | 0 792 | 0 793 | 0 794 | 0 795 | 0 796 | 0 797 | 0 798 | 0 799 | 0 800 | 0 801 | 0 802 | 0 803 | 0 804 | 0 805 | 0 806 | 0 807 | 0 808 | 0 809 | 0 810 | 0 811 | 0 812 | 0 813 | 0 814 | 0 815 | 0 816 | 0 817 | 0 818 | 0 819 | 0 820 | 0 821 | 0 822 | 0 823 | 0 824 | 0 825 | 0 826 | 0 827 | 0 828 | 0 829 | 0 830 | 0 831 | 0 832 | 0 833 | 0 834 | 0 835 | 0 836 | 0 837 | 0 838 | 0 839 | 0 840 | 0 841 | 0 842 | 0 843 | 0 844 | 0 845 | 0 846 | 0 847 | 0 848 | 0 849 | 0 850 | 0 851 | 0 852 | 0 853 | 0 854 | 0 855 | 0 856 | 0 857 | 0 858 | 0 859 | 0 860 | 0 861 | 0 862 | 0 863 | 0 864 | 0 865 | 0 866 | 0 867 | 0 868 | 0 869 | 0 870 | 0 871 | 0 872 | 0 873 | 0 874 | 0 875 | 0 876 | 0 877 | 0 878 | 0 879 | 0 880 | 0 881 | 0 882 | 0 883 | 0 884 | 0 885 | 0 886 | 0 887 | 0 888 | 0 889 | 0 890 | 0 891 | 0 892 | 0 893 | 0 894 | 0 895 | 0 896 | 0 897 | 0 898 | 0 899 | 0 900 | 0 901 | 0 902 | 0 903 | 0 904 | 0 905 | 0 906 | 0 907 | 0 908 | 0 909 | 0 910 | 0 911 | 0 912 | 0 913 | 0 914 | 0 915 | 0 916 | 0 917 | 0 918 | 0 919 | 0 920 | 0 921 | 0 922 | 0 923 | 0 924 | 0 925 | 0 926 | 0 927 | 0 928 | 0 929 | 0 930 | 0 931 | 0 932 | 0 933 | 0 934 | 0 935 | 0 936 | 0 937 | 0 938 | 0 939 | 0 940 | 0 941 | 0 942 | 0 943 | 0 944 | 0 945 | 0 946 | 0 947 | 0 948 | 0 949 | 0 950 | 0 951 | 0 952 | 0 953 | 0 954 | 0 955 | 0 956 | 0 957 | 0 958 | 0 959 | 0 960 | 0 961 | 0 962 | 0 963 | 0 964 | 0 965 | 0 966 | 0 967 | 0 968 | 0 969 | 0 970 | 0 971 | 0 972 | 0 973 | 0 974 | 0 975 | 0 976 | 0 977 | 0 978 | 0 979 | 0 980 | 0 981 | 0 982 | 0 983 | 0 984 | 0 985 | 0 986 | 0 987 | 0 988 | 0 989 | 0 990 | 0 991 | 0 992 | 0 993 | 0 994 | 0 995 | 0 996 | 0 997 | 0 998 | 0 999 | 0 1000 | 0 1001 | 0 1002 | --------------------------------------------------------------------------------