├── README.md ├── LICENSE ├── The Skeleton Notebook.ipynb └── .ipynb_checkpoints └── The Skeleton Notebook-checkpoint.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # skeleton-notebook 2 | A skeleton notebook template for supervised machine learning and data science projects. 3 | 4 | Readme to be updated. 5 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Sajal Sharma 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /The Skeleton Notebook.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# The Skeleton Notebook" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Like the name suggests, this notebook/template serves as a great starting point for most Supervised Machine Learning projects that involve common tasks such as data exploration, cleaning, transformation and preparation, and data modelling (using machine learning or deep learning techniques).\n", 15 | "\n", 16 | "I've tried to build the notebook to provide a set workflow in which to handle the above tasks. These are arranged in sections, encouraged to be expanded to into sub-sections to handle approproate tasks. I've also tried to include common sub-tasks, and the code required to do them (usually Pandas or scikit-learn). \n", 17 | "\n", 18 | "Sections included:\n", 19 | "\n", 20 | "- Housekeeping and Imports\n", 21 | "- Data Loading\n", 22 | "- Data Exploration\n", 23 | "- Data Cleaning\n", 24 | "- Feature Engineering\n", 25 | "- Data Transformation and Preparation\n", 26 | "- Model Exploration and Performance Analysis\n", 27 | "- Final Model Building\n", 28 | "\n", 29 | "It is suggested to use separate notebooks if any of the above tasks are performed in depth. " 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "## Housekeeping and Imports" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "For importing libraries necessary for the project, and for basic preprocessing functions (ex: typset conversion for NLP projects). \n", 44 | "\n", 45 | "We're going to import commonly used Data Science libraries, so make sure they're available for your Python set-up." 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 2, 51 | "metadata": { 52 | "collapsed": false 53 | }, 54 | "outputs": [], 55 | "source": [ 56 | "# Import libraries necessary for projects\n", 57 | "import numpy as np \n", 58 | "import pandas as pd\n", 59 | "from time import time\n", 60 | "from IPython.display import display # Allows the use of display() for DataFrames\n", 61 | "\n", 62 | "# Import visualisation libraries\n", 63 | "import seaborn as sns\n", 64 | "import matplotlib.pyplot as plt\n", 65 | "\n", 66 | "# Pretty display for notebooks\n", 67 | "%matplotlib inline\n", 68 | "\n", 69 | "# Make division futuristic for Python 2\n", 70 | "from __future__ import division" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 26, 76 | "metadata": { 77 | "collapsed": true 78 | }, 79 | "outputs": [], 80 | "source": [ 81 | "#Cell for Housekeeping code" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "metadata": { 88 | "collapsed": true 89 | }, 90 | "outputs": [], 91 | "source": [] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "## Data Loading" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "For loading data files into appropriate variables." 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 45, 110 | "metadata": { 111 | "collapsed": false 112 | }, 113 | "outputs": [], 114 | "source": [ 115 | "#Loading the data file (ex: csv) using Pandas\n", 116 | "# data = pd.read_csv('') #insert path to file\n", 117 | "\n", 118 | "#Next steps?:\n", 119 | "# Loading the test data?\n", 120 | "# Loading the feaure vectors (X) and the prediction vector (Y) into different variables" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "metadata": { 127 | "collapsed": true 128 | }, 129 | "outputs": [], 130 | "source": [] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "## Data Exploration" 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": {}, 142 | "source": [ 143 | "Section for **exploratory analysis** on the available data. \n", 144 | "\n", 145 | "The exploration techniques vary for numerical, categorical, or time-series variables. Currently, \n", 146 | "\n", 147 | "Here we typically:\n", 148 | "\n", 149 | "- look at example records in the dataset\n", 150 | "- investigate the datatypes of variables in the dataset\n", 151 | "- calculate and investigate descriptive statistics (ex: central tendencies, variability etc.)\n", 152 | "- investigate distribution of feature vectors (ex: to check for skewness and outliers)\n", 153 | "- investigate distribution of prediction vector\n", 154 | "- check out the relationship (ex: correlation) between different features\n", 155 | "- check out the relationship between feature vectors and prediction vector\n", 156 | "\n", 157 | "Common steps to check the health of the data:\n", 158 | "\n", 159 | "- Check for missing data\n", 160 | "- Check the skewness of the data, outlier detection\n", 161 | "- etc..." 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "### Look at Example Records" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": 28, 174 | "metadata": { 175 | "collapsed": true 176 | }, 177 | "outputs": [], 178 | "source": [ 179 | "# data.head(5) #Display out the first 5 records\n", 180 | "\n", 181 | "# Additional:\n", 182 | "# Look at last few records using data.tail()" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "metadata": { 189 | "collapsed": true 190 | }, 191 | "outputs": [], 192 | "source": [] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "### Data-types, completeness Information\n", 199 | "\n", 200 | "Using the Pandas \"info\" function, in addition to the data-type information for the dataset, we can look at counts of available records/missing records too." 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 29, 206 | "metadata": { 207 | "collapsed": true 208 | }, 209 | "outputs": [], 210 | "source": [ 211 | "# data.info()" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": null, 217 | "metadata": { 218 | "collapsed": true 219 | }, 220 | "outputs": [], 221 | "source": [] 222 | }, 223 | { 224 | "cell_type": "markdown", 225 | "metadata": {}, 226 | "source": [ 227 | "### Descriptive Statistics" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": 30, 233 | "metadata": { 234 | "collapsed": true 235 | }, 236 | "outputs": [], 237 | "source": [ 238 | "# data.describe()\n", 239 | "\n", 240 | "# Additonal: \n", 241 | "# We can also make a guess at the skewness of the data at this stage by looking at the difference between\n", 242 | "# the means and medians of numerical features" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": null, 248 | "metadata": { 249 | "collapsed": true 250 | }, 251 | "outputs": [], 252 | "source": [] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "metadata": { 257 | "collapsed": true 258 | }, 259 | "source": [ 260 | "### Visualizaton: Distribution of features" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "*Section has great potential for expansion.* \n", 268 | "\n", 269 | "Visualization techniques differ depending on the type of the feature vector (i.e. numerical: continuous or discrete, categorical: ordinal etc). Techniques will also depend on the type of data being dealt with, and the insight that we want to extract from it. \n", 270 | "\n", 271 | "Common visualization techniques include:\n", 272 | "- Bar Plots: Visualize the frequency distribution of categorical features.\n", 273 | "- Histograms: Visualize the frequency distribution of numerical features.\n", 274 | "- Box Plots: Visualize a numerical feature, while providing more information like the median, lower/upper quantiles etc..\n", 275 | "- Scatter Plots: Visualize the relationship (usually the correlation) between two features. Can include a goodness of fit line, to serve as a regression plot.\n", 276 | "\n", 277 | "Below are example code snippets to draw these using seaborn." 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 31, 283 | "metadata": { 284 | "collapsed": true 285 | }, 286 | "outputs": [], 287 | "source": [ 288 | "#Example: drawing a seaborn barplot\n", 289 | "#sns.barplot(x=\"\",y=\"\",hue=\"\",data=\"\")\n", 290 | "\n", 291 | "#Can also use pandas/matplotlib for histograms (numerical features) or barplots ()" 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": 32, 297 | "metadata": { 298 | "collapsed": true 299 | }, 300 | "outputs": [], 301 | "source": [ 302 | "# Example: drawing a seaborn regplot\n", 303 | "# sns.regplot(data[feature1],data[feature2])" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 33, 309 | "metadata": { 310 | "collapsed": true 311 | }, 312 | "outputs": [], 313 | "source": [ 314 | "#Example: drawing a pandas scatter_matrix\n", 315 | "# pd.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');" 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": null, 321 | "metadata": { 322 | "collapsed": true 323 | }, 324 | "outputs": [], 325 | "source": [] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "metadata": {}, 330 | "source": [ 331 | "### Investigating correlations between features" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": null, 337 | "metadata": { 338 | "collapsed": true 339 | }, 340 | "outputs": [], 341 | "source": [] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": null, 346 | "metadata": { 347 | "collapsed": true 348 | }, 349 | "outputs": [], 350 | "source": [] 351 | }, 352 | { 353 | "cell_type": "markdown", 354 | "metadata": {}, 355 | "source": [ 356 | "### Visualizing prediction vector" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": null, 362 | "metadata": { 363 | "collapsed": true 364 | }, 365 | "outputs": [], 366 | "source": [] 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "metadata": {}, 371 | "source": [ 372 | "### Investigating missing values" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": null, 378 | "metadata": { 379 | "collapsed": true 380 | }, 381 | "outputs": [], 382 | "source": [] 383 | }, 384 | { 385 | "cell_type": "markdown", 386 | "metadata": {}, 387 | "source": [ 388 | "### Outlier Detection" 389 | ] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "metadata": {}, 394 | "source": [ 395 | "The presence of outliers can often skew results which take into consideration these data points. \n", 396 | "\n", 397 | "One approach to detect outliers is to use Tukey's Method for identfying them: An outlier step is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.\n", 398 | "\n", 399 | "One such pipeline for detecting outliers is below:" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": 47, 405 | "metadata": { 406 | "collapsed": true 407 | }, 408 | "outputs": [], 409 | "source": [ 410 | "# def find_outliers(data):\n", 411 | "\n", 412 | "# #Checking for outliers that occur for more than one feature\n", 413 | "# outliers = []\n", 414 | "\n", 415 | "# # For each feature find the data points with extreme high or low values\n", 416 | "# for feature in [list of features to investigate]:\n", 417 | "\n", 418 | "# # TODO: Calculate Q1 (25th percentile of the data) for the given feature\n", 419 | "# Q1 = np.percentile(data[feature],25)\n", 420 | "\n", 421 | "# # TODO: Calculate Q3 (75th percentile of the data) for the given feature\n", 422 | "# Q3 = np.percentile(data[feature],75)\n", 423 | "\n", 424 | "# # TODO: Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)\n", 425 | "# step = (Q3-Q1) * 1.5\n", 426 | "\n", 427 | "# # Display the outliers\n", 428 | "# out = data[~((data[feature] >= Q1 - step) & (data[feature] <= Q3 + step))]\n", 429 | "# print \"Number of outliers for the feature '{}': {}\".format(feature, len(out))\n", 430 | "# outliers = outliers + list(out.index.values)\n", 431 | "\n", 432 | "\n", 433 | "# #Creating list of more outliers which are the same for multiple features.\n", 434 | "# outliers = list(set([x for x in outliers if outliers.count(x) > 1])) \n", 435 | " \n", 436 | "# return outliers\n", 437 | " \n", 438 | "# print \"Data points considered outliers for more than one feature: {}\".format(find_outliers(data))" 439 | ] 440 | }, 441 | { 442 | "cell_type": "code", 443 | "execution_count": null, 444 | "metadata": { 445 | "collapsed": true 446 | }, 447 | "outputs": [], 448 | "source": [] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "execution_count": null, 453 | "metadata": { 454 | "collapsed": true 455 | }, 456 | "outputs": [], 457 | "source": [] 458 | }, 459 | { 460 | "cell_type": "markdown", 461 | "metadata": {}, 462 | "source": [ 463 | "## Data Cleaning" 464 | ] 465 | }, 466 | { 467 | "cell_type": "markdown", 468 | "metadata": {}, 469 | "source": [ 470 | "### Imputing missing values" 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": null, 476 | "metadata": { 477 | "collapsed": true 478 | }, 479 | "outputs": [], 480 | "source": [] 481 | }, 482 | { 483 | "cell_type": "markdown", 484 | "metadata": {}, 485 | "source": [ 486 | "### Cleaning outliers or error values" 487 | ] 488 | }, 489 | { 490 | "cell_type": "code", 491 | "execution_count": 48, 492 | "metadata": { 493 | "collapsed": false 494 | }, 495 | "outputs": [], 496 | "source": [ 497 | "# Remove the outliers, if any were specified \n", 498 | "# good_data = data.drop(data.index[outliers]).reset_index(drop = True)\n", 499 | "# print \"The good dataset now has {} observations after removing outliers.\".format(len(good_data))" 500 | ] 501 | }, 502 | { 503 | "cell_type": "code", 504 | "execution_count": null, 505 | "metadata": { 506 | "collapsed": true 507 | }, 508 | "outputs": [], 509 | "source": [] 510 | }, 511 | { 512 | "cell_type": "markdown", 513 | "metadata": {}, 514 | "source": [ 515 | "## Feature Engineering" 516 | ] 517 | }, 518 | { 519 | "cell_type": "markdown", 520 | "metadata": {}, 521 | "source": [ 522 | "Section to extract more features from those currently available." 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": 34, 528 | "metadata": { 529 | "collapsed": true 530 | }, 531 | "outputs": [], 532 | "source": [ 533 | "# code " 534 | ] 535 | }, 536 | { 537 | "cell_type": "markdown", 538 | "metadata": {}, 539 | "source": [ 540 | "## Data Transformation and Preparation" 541 | ] 542 | }, 543 | { 544 | "cell_type": "markdown", 545 | "metadata": {}, 546 | "source": [ 547 | "### Transforming Skewed Continous Features " 548 | ] 549 | }, 550 | { 551 | "cell_type": "markdown", 552 | "metadata": {}, 553 | "source": [ 554 | "It is common practice to apply a logarthmic transformation to highly skewed continuous feature distributions. A typical flow for this is in a commented code block below." 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": 35, 560 | "metadata": { 561 | "collapsed": true 562 | }, 563 | "outputs": [], 564 | "source": [ 565 | "# skewered = [list of skewed continuous features]\n", 566 | "# raw_features[skewed] = data[skewed].apply(lambda x: np.log(x+1))" 567 | ] 568 | }, 569 | { 570 | "cell_type": "code", 571 | "execution_count": null, 572 | "metadata": { 573 | "collapsed": true 574 | }, 575 | "outputs": [], 576 | "source": [] 577 | }, 578 | { 579 | "cell_type": "markdown", 580 | "metadata": {}, 581 | "source": [ 582 | "### Normalizing Numerical Features " 583 | ] 584 | }, 585 | { 586 | "cell_type": "markdown", 587 | "metadata": {}, 588 | "source": [ 589 | "Another common practice is to perform some type of scaling on numerical features. Applying scaling doesn't change the shape of each feature's distribution; but ensures that each feature is treated equally when applying supervised learners. An example workflow of achieving normalisation using the MinMaxScaler module of sklearn is below:" 590 | ] 591 | }, 592 | { 593 | "cell_type": "code", 594 | "execution_count": 36, 595 | "metadata": { 596 | "collapsed": true 597 | }, 598 | "outputs": [], 599 | "source": [ 600 | "# from sklearn.preprocessing import MinMaxScaler\n", 601 | "\n", 602 | "# scaler = MinMaxScaler()\n", 603 | "# numerical = [list of skewed numerical features]\n", 604 | "# raw_features[numerical] = scaler.fit_transform(data[numerical])" 605 | ] 606 | }, 607 | { 608 | "cell_type": "code", 609 | "execution_count": 37, 610 | "metadata": { 611 | "collapsed": true 612 | }, 613 | "outputs": [], 614 | "source": [ 615 | "# Checking examples after transformation\n", 616 | "# raw_features.head()" 617 | ] 618 | }, 619 | { 620 | "cell_type": "code", 621 | "execution_count": null, 622 | "metadata": { 623 | "collapsed": true 624 | }, 625 | "outputs": [], 626 | "source": [] 627 | }, 628 | { 629 | "cell_type": "markdown", 630 | "metadata": {}, 631 | "source": [ 632 | "### One Hot Encoding Categorical Features" 633 | ] 634 | }, 635 | { 636 | "cell_type": "code", 637 | "execution_count": 38, 638 | "metadata": { 639 | "collapsed": true 640 | }, 641 | "outputs": [], 642 | "source": [ 643 | "# Using Pandas get_dummies function\n", 644 | "# features = pd.get_dummies(raw_features)" 645 | ] 646 | }, 647 | { 648 | "cell_type": "code", 649 | "execution_count": 39, 650 | "metadata": { 651 | "collapsed": true 652 | }, 653 | "outputs": [], 654 | "source": [ 655 | "#Encoding categorical prediction vector to numerical ?" 656 | ] 657 | }, 658 | { 659 | "cell_type": "code", 660 | "execution_count": null, 661 | "metadata": { 662 | "collapsed": true 663 | }, 664 | "outputs": [], 665 | "source": [] 666 | }, 667 | { 668 | "cell_type": "markdown", 669 | "metadata": {}, 670 | "source": [ 671 | "It is encouraged to create a pipeline function for data preprocessing, rather than separate script blocks." 672 | ] 673 | }, 674 | { 675 | "cell_type": "markdown", 676 | "metadata": {}, 677 | "source": [ 678 | "### Shuffle and Split Data" 679 | ] 680 | }, 681 | { 682 | "cell_type": "code", 683 | "execution_count": 41, 684 | "metadata": { 685 | "collapsed": true 686 | }, 687 | "outputs": [], 688 | "source": [ 689 | "# from sklearn.cross_validation import train_test_split\n", 690 | "\n", 691 | "# X_train, X_test, y_train, y_test = train_test_split(features, prediction_vector, test_size = 0.2, random_state = 0)\n", 692 | "\n", 693 | "# Show the results of the split\n", 694 | "# print \"Training set has {} samples.\".format(X_train.shape[0])\n", 695 | "# print \"Testing set has {} samples.\".format(X_test.shape[0])" 696 | ] 697 | }, 698 | { 699 | "cell_type": "code", 700 | "execution_count": null, 701 | "metadata": { 702 | "collapsed": true 703 | }, 704 | "outputs": [], 705 | "source": [] 706 | }, 707 | { 708 | "cell_type": "markdown", 709 | "metadata": {}, 710 | "source": [ 711 | "## Model Exploration" 712 | ] 713 | }, 714 | { 715 | "cell_type": "markdown", 716 | "metadata": {}, 717 | "source": [ 718 | "### Naive Predictor Performance" 719 | ] 720 | }, 721 | { 722 | "cell_type": "markdown", 723 | "metadata": {}, 724 | "source": [ 725 | "To set a baseline for the performance of the predictor. \n", 726 | "\n", 727 | "Common techniques:\n", 728 | "- For categorical prediction vector, choose the most common class\n", 729 | "- For numerical prediction vector, choose a measure of central tendency\n", 730 | "\n", 731 | "Then calculate the evalation metric (accuracy, f-score etc)" 732 | ] 733 | }, 734 | { 735 | "cell_type": "code", 736 | "execution_count": 25, 737 | "metadata": { 738 | "collapsed": true 739 | }, 740 | "outputs": [], 741 | "source": [ 742 | "#Code to implement the above" 743 | ] 744 | }, 745 | { 746 | "cell_type": "markdown", 747 | "metadata": {}, 748 | "source": [ 749 | "### Choosing scoring metrics" 750 | ] 751 | }, 752 | { 753 | "cell_type": "code", 754 | "execution_count": null, 755 | "metadata": { 756 | "collapsed": true 757 | }, 758 | "outputs": [], 759 | "source": [ 760 | "# from sklearn.metrics import accuracy_score, fbeta_score" 761 | ] 762 | }, 763 | { 764 | "cell_type": "markdown", 765 | "metadata": {}, 766 | "source": [ 767 | "### Creating a Training and Prediction Pipeling" 768 | ] 769 | }, 770 | { 771 | "cell_type": "code", 772 | "execution_count": 42, 773 | "metadata": { 774 | "collapsed": true 775 | }, 776 | "outputs": [], 777 | "source": [ 778 | "#Importing models from sklearn, or tensorflow/keras components" 779 | ] 780 | }, 781 | { 782 | "cell_type": "markdown", 783 | "metadata": {}, 784 | "source": [ 785 | "Change below as seen fit" 786 | ] 787 | }, 788 | { 789 | "cell_type": "code", 790 | "execution_count": null, 791 | "metadata": { 792 | "collapsed": true 793 | }, 794 | "outputs": [], 795 | "source": [ 796 | "# def train_predict(learner, sample_size, X_train, y_train, X_test, y_test): \n", 797 | "# '''\n", 798 | "# inputs:\n", 799 | "# - learner: the learning algorithm to be trained and predicted on\n", 800 | "# - sample_size: the size of samples (number) to be drawn from training set\n", 801 | "# - X_train: features training set\n", 802 | "# - y_train: income training set\n", 803 | "# - X_test: features testing set\n", 804 | "# - y_test: income testing set\n", 805 | "# '''\n", 806 | " \n", 807 | "# results = {}\n", 808 | " \n", 809 | "# # TODO: Fit the learner to the training data using slicing with 'sample_size'\n", 810 | "# start = time() # Get start time\n", 811 | "# learner = learner.fit(X_train[:sample_size],y_train[:sample_size])\n", 812 | "# end = time() # Get end time\n", 813 | " \n", 814 | "# # TODO: Calculate the training time\n", 815 | "# results['train_time'] = end - start\n", 816 | " \n", 817 | "# # TODO: Get the predictions on the test set,\n", 818 | "# # then get predictions on the first 300 training samples\n", 819 | "# start = time() # Get start time\n", 820 | "# predictions_test = learner.predict(X_test)\n", 821 | "# predictions_train = learner.predict(X_train[:300])\n", 822 | "# end = time() # Get end time\n", 823 | " \n", 824 | "# # TODO: Calculate the total prediction time\n", 825 | "# results['pred_time'] = end - start\n", 826 | " \n", 827 | "# # TODO: Compute accuracy on the first 300 training samples\n", 828 | "# results['acc_train'] = accuracy_score(y_train[:300],predictions_train)\n", 829 | " \n", 830 | "# # TODO: Compute accuracy on test set\n", 831 | "# results['acc_test'] = accuracy_score(y_test,predictions_test)\n", 832 | " \n", 833 | "# # TODO: Compute F-score on the the first 300 training samples\n", 834 | "# results['f_train'] = fbeta_score(y_train[:300],predictions_train,0.5)\n", 835 | " \n", 836 | "# # TODO: Compute F-score on the test set\n", 837 | "# results['f_test'] = fbeta_score(y_test,predictions_test,0.5)\n", 838 | " \n", 839 | "# # Success\n", 840 | "# print \"{} trained on {} samples.\".format(learner.__class__.__name__, sample_size)\n", 841 | " \n", 842 | "# # Return the results\n", 843 | "# return results" 844 | ] 845 | }, 846 | { 847 | "cell_type": "markdown", 848 | "metadata": { 849 | "collapsed": true 850 | }, 851 | "source": [ 852 | "### Model Evaluation" 853 | ] 854 | }, 855 | { 856 | "cell_type": "code", 857 | "execution_count": 44, 858 | "metadata": { 859 | "collapsed": false 860 | }, 861 | "outputs": [], 862 | "source": [ 863 | "# Change the list of classifiers and code below as seen fit. we probably also don't need to see the effects of\n", 864 | "# different sample sizes" 865 | ] 866 | }, 867 | { 868 | "cell_type": "code", 869 | "execution_count": null, 870 | "metadata": { 871 | "collapsed": true 872 | }, 873 | "outputs": [], 874 | "source": [ 875 | "# # TODO: Import the three supervised learning models from sklearn\n", 876 | "# from sklearn.tree import DecisionTreeClassifier\n", 877 | "# from sklearn.svm import SVC\n", 878 | "# from sklearn.ensemble import AdaBoostClassifier\n", 879 | "\n", 880 | "# # TODO: Initialize the three models, the random states are set to 101 so we know how to reproduce the model later\n", 881 | "# clf_A = DecisionTreeClassifier(random_state=101)\n", 882 | "# clf_B = SVC(random_state = 101)\n", 883 | "# clf_C = AdaBoostClassifier(random_state = 101)\n", 884 | "\n", 885 | "# # TODO: Calculate the number of samples for 1%, 10%, and 100% of the training data\n", 886 | "# samples_1 = int(round(len(X_train) / 100))\n", 887 | "# samples_10 = int(round(len(X_train) / 10))\n", 888 | "# samples_100 = len(X_train)\n", 889 | "\n", 890 | "# # Collect results on the learners in a dictionary\n", 891 | "# results = {}\n", 892 | "# for clf in [clf_A, clf_B, clf_C]:\n", 893 | "# clf_name = clf.__class__.__name__\n", 894 | "# results[clf_name] = {}\n", 895 | "# for i, samples in enumerate([samples_1, samples_10, samples_100]):\n", 896 | "# results[clf_name][i] = \\\n", 897 | "# train_predict(clf, samples, X_train, y_train, X_test, y_test)" 898 | ] 899 | }, 900 | { 901 | "cell_type": "markdown", 902 | "metadata": {}, 903 | "source": [ 904 | "Printing out the results" 905 | ] 906 | }, 907 | { 908 | "cell_type": "code", 909 | "execution_count": null, 910 | "metadata": { 911 | "collapsed": true 912 | }, 913 | "outputs": [], 914 | "source": [ 915 | "# #Printing out the values\n", 916 | "# for i in results.items():\n", 917 | "# print i[0]\n", 918 | "# display(pd.DataFrame(i[1]).rename(columns={0:'1%', 1:'10%', 2:'100%'}))" 919 | ] 920 | }, 921 | { 922 | "cell_type": "code", 923 | "execution_count": null, 924 | "metadata": { 925 | "collapsed": true 926 | }, 927 | "outputs": [], 928 | "source": [] 929 | }, 930 | { 931 | "cell_type": "markdown", 932 | "metadata": {}, 933 | "source": [ 934 | "## Final Model Building" 935 | ] 936 | }, 937 | { 938 | "cell_type": "markdown", 939 | "metadata": {}, 940 | "source": [ 941 | "Using grid search (GridSearchCV) with different parameter/value combinations, we can tune our model for even better results.\n", 942 | "\n", 943 | "Example with Adaboost below" 944 | ] 945 | }, 946 | { 947 | "cell_type": "code", 948 | "execution_count": null, 949 | "metadata": { 950 | "collapsed": true 951 | }, 952 | "outputs": [], 953 | "source": [ 954 | "# # TODO: Import 'GridSearchCV', 'make_scorer', and any other necessary libraries\n", 955 | "# from sklearn.grid_search import GridSearchCV\n", 956 | "# from sklearn.metrics import make_scorer\n", 957 | "\n", 958 | "# # TODO: Initialize the classifier\n", 959 | "# clf = AdaBoostClassifier(base_estimator=DecisionTreeClassifier())\n", 960 | "\n", 961 | "# # TODO: Create the parameters list you wish to tune\n", 962 | "# parameters = {'n_estimators':[50, 120], \n", 963 | "# 'learning_rate':[0.1, 0.5, 1.],\n", 964 | "# 'base_estimator__min_samples_split' : np.arange(2, 8, 2),\n", 965 | "# 'base_estimator__max_depth' : np.arange(1, 4, 1)\n", 966 | "# }\n", 967 | "\n", 968 | "# # TODO: Make an fbeta_score scoring object\n", 969 | "# scorer = make_scorer(fbeta_score,beta=0.5)\n", 970 | "\n", 971 | "# # TODO: Perform grid search on the classifier using 'scorer' as the scoring method\n", 972 | "# grid_obj = GridSearchCV(clf, parameters,scorer)\n", 973 | "\n", 974 | "# # TODO: Fit the grid search object to the training data and find the optimal parameters\n", 975 | "# grid_fit = grid_obj.fit(X_train,y_train)\n", 976 | "\n", 977 | "# # Get the estimator\n", 978 | "# best_clf = grid_fit.best_estimator_\n", 979 | "\n", 980 | "# # Make predictions using the unoptimized and model\n", 981 | "# predictions = (clf.fit(X_train, y_train)).predict(X_test)\n", 982 | "# best_predictions = best_clf.predict(X_test)\n", 983 | "\n", 984 | "# # Report the before-and-afterscores\n", 985 | "# print \"Unoptimized model\\n------\"\n", 986 | "# print \"Accuracy score on testing data: {:.4f}\".format(accuracy_score(y_test, predictions))\n", 987 | "# print \"F-score on testing data: {:.4f}\".format(fbeta_score(y_test, predictions, beta = 0.5))\n", 988 | "# print \"\\nOptimized Model\\n------\"\n", 989 | "# print \"Final accuracy score on the testing data: {:.4f}\".format(accuracy_score(y_test, best_predictions))\n", 990 | "# print \"Final F-score on the testing data: {:.4f}\".format(fbeta_score(y_test, best_predictions, beta = 0.5))\n", 991 | "# print best_clf" 992 | ] 993 | }, 994 | { 995 | "cell_type": "code", 996 | "execution_count": null, 997 | "metadata": { 998 | "collapsed": true 999 | }, 1000 | "outputs": [], 1001 | "source": [] 1002 | }, 1003 | { 1004 | "cell_type": "markdown", 1005 | "metadata": {}, 1006 | "source": [ 1007 | "Next steps can include feature importance extraction, predictions on the test set.. etc" 1008 | ] 1009 | }, 1010 | { 1011 | "cell_type": "markdown", 1012 | "metadata": {}, 1013 | "source": [ 1014 | "## Predictions on Test Set" 1015 | ] 1016 | }, 1017 | { 1018 | "cell_type": "code", 1019 | "execution_count": null, 1020 | "metadata": { 1021 | "collapsed": true 1022 | }, 1023 | "outputs": [], 1024 | "source": [] 1025 | } 1026 | ], 1027 | "metadata": { 1028 | "anaconda-cloud": {}, 1029 | "kernelspec": { 1030 | "display_name": "Python [conda env:py27]", 1031 | "language": "python", 1032 | "name": "conda-env-py27-py" 1033 | }, 1034 | "language_info": { 1035 | "codemirror_mode": { 1036 | "name": "ipython", 1037 | "version": 2 1038 | }, 1039 | "file_extension": ".py", 1040 | "mimetype": "text/x-python", 1041 | "name": "python", 1042 | "nbconvert_exporter": "python", 1043 | "pygments_lexer": "ipython2", 1044 | "version": "2.7.12" 1045 | } 1046 | }, 1047 | "nbformat": 4, 1048 | "nbformat_minor": 1 1049 | } 1050 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/The Skeleton Notebook-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# The Skeleton Notebook" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Like the name suggests, this notebook/template serves as a great starting point for most Supervised Machine Learning projects that involve common tasks such as data exploration, cleaning, transformation and preparation, and data modelling (using machine learning or deep learning techniques).\n", 15 | "\n", 16 | "I've tried to build the notebook to provide a set workflow in which to handle the above tasks. These are arranged in sections, encouraged to be expanded to into sub-sections to handle approproate tasks. I've also tried to include common sub-tasks, and the code required to do them (usually Pandas or scikit-learn). \n", 17 | "\n", 18 | "Sections included:\n", 19 | "\n", 20 | "- Housekeeping and Imports\n", 21 | "- Data Loading\n", 22 | "- Data Exploration\n", 23 | "- Data Cleaning\n", 24 | "- Feature Engineering\n", 25 | "- Data Transformation and Preparation\n", 26 | "- Model Exploration and Performance Analysis\n", 27 | "- Final Model Building\n", 28 | "\n", 29 | "It is suggested to use separate notebooks if any of the above tasks are performed in depth. " 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "## Housekeeping and Imports" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "For importing libraries necessary for the project, and for basic preprocessing functions (ex: typset conversion for NLP projects). \n", 44 | "\n", 45 | "We're going to import commonly used Data Science libraries, so make sure they're available for your Python set-up." 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 2, 51 | "metadata": { 52 | "collapsed": false 53 | }, 54 | "outputs": [], 55 | "source": [ 56 | "# Import libraries necessary for projects\n", 57 | "import numpy as np \n", 58 | "import pandas as pd\n", 59 | "from time import time\n", 60 | "from IPython.display import display # Allows the use of display() for DataFrames\n", 61 | "\n", 62 | "# Import visualisation libraries\n", 63 | "import seaborn as sns\n", 64 | "import matplotlib.pyplot as plt\n", 65 | "\n", 66 | "# Pretty display for notebooks\n", 67 | "%matplotlib inline\n", 68 | "\n", 69 | "# Make division futuristic for Python 2\n", 70 | "from __future__ import division" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 26, 76 | "metadata": { 77 | "collapsed": true 78 | }, 79 | "outputs": [], 80 | "source": [ 81 | "#Cell for Housekeeping code" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "metadata": { 88 | "collapsed": true 89 | }, 90 | "outputs": [], 91 | "source": [] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "## Data Loading" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "For loading data files into appropriate variables." 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 45, 110 | "metadata": { 111 | "collapsed": false 112 | }, 113 | "outputs": [], 114 | "source": [ 115 | "#Loading the data file (ex: csv) using Pandas\n", 116 | "# data = pd.read_csv('') #insert path to file\n", 117 | "\n", 118 | "#Next steps?:\n", 119 | "# Loading the test data?\n", 120 | "# Loading the feaure vectors (X) and the prediction vector (Y) into different variables" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "metadata": { 127 | "collapsed": true 128 | }, 129 | "outputs": [], 130 | "source": [] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "## Data Exploration" 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": {}, 142 | "source": [ 143 | "Section for **exploratory analysis** on the available data. \n", 144 | "\n", 145 | "The exploration techniques vary for numerical, categorical, or time-series variables. Currently, \n", 146 | "\n", 147 | "Here we typically:\n", 148 | "\n", 149 | "- look at example records in the dataset\n", 150 | "- investigate the datatypes of variables in the dataset\n", 151 | "- calculate and investigate descriptive statistics (ex: central tendencies, variability etc.)\n", 152 | "- investigate distribution of feature vectors (ex: to check for skewness and outliers)\n", 153 | "- investigate distribution of prediction vector\n", 154 | "- check out the relationship (ex: correlation) between different features\n", 155 | "- check out the relationship between feature vectors and prediction vector\n", 156 | "\n", 157 | "Common steps to check the health of the data:\n", 158 | "\n", 159 | "- Check for missing data\n", 160 | "- Check the skewness of the data, outlier detection\n", 161 | "- etc..." 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "### Look at Example Records" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": 28, 174 | "metadata": { 175 | "collapsed": true 176 | }, 177 | "outputs": [], 178 | "source": [ 179 | "# data.head(5) #Display out the first 5 records\n", 180 | "\n", 181 | "# Additional:\n", 182 | "# Look at last few records using data.tail()" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "metadata": { 189 | "collapsed": true 190 | }, 191 | "outputs": [], 192 | "source": [] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "### Data-types, completeness Information\n", 199 | "\n", 200 | "Using the Pandas \"info\" function, in addition to the data-type information for the dataset, we can look at counts of available records/missing records too." 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 29, 206 | "metadata": { 207 | "collapsed": true 208 | }, 209 | "outputs": [], 210 | "source": [ 211 | "# data.info()" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": null, 217 | "metadata": { 218 | "collapsed": true 219 | }, 220 | "outputs": [], 221 | "source": [] 222 | }, 223 | { 224 | "cell_type": "markdown", 225 | "metadata": {}, 226 | "source": [ 227 | "### Descriptive Statistics" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": 30, 233 | "metadata": { 234 | "collapsed": true 235 | }, 236 | "outputs": [], 237 | "source": [ 238 | "# data.describe()\n", 239 | "\n", 240 | "# Additonal: \n", 241 | "# We can also make a guess at the skewness of the data at this stage by looking at the difference between\n", 242 | "# the means and medians of numerical features" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": null, 248 | "metadata": { 249 | "collapsed": true 250 | }, 251 | "outputs": [], 252 | "source": [] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "metadata": { 257 | "collapsed": true 258 | }, 259 | "source": [ 260 | "### Visualizaton: Distribution of features" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "*Section has great potential for expansion.* \n", 268 | "\n", 269 | "Visualization techniques differ depending on the type of the feature vector (i.e. numerical: continuous or discrete, categorical: ordinal etc). Techniques will also depend on the type of data being dealt with, and the insight that we want to extract from it. \n", 270 | "\n", 271 | "Common visualization techniques include:\n", 272 | "- Bar Plots: Visualize the frequency distribution of categorical features.\n", 273 | "- Histograms: Visualize the frequency distribution of numerical features.\n", 274 | "- Box Plots: Visualize a numerical feature, while providing more information like the median, lower/upper quantiles etc..\n", 275 | "- Scatter Plots: Visualize the relationship (usually the correlation) between two features. Can include a goodness of fit line, to serve as a regression plot.\n", 276 | "\n", 277 | "Below are example code snippets to draw these using seaborn." 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 31, 283 | "metadata": { 284 | "collapsed": true 285 | }, 286 | "outputs": [], 287 | "source": [ 288 | "#Example: drawing a seaborn barplot\n", 289 | "#sns.barplot(x=\"\",y=\"\",hue=\"\",data=\"\")\n", 290 | "\n", 291 | "#Can also use pandas/matplotlib for histograms (numerical features) or barplots ()" 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": 32, 297 | "metadata": { 298 | "collapsed": true 299 | }, 300 | "outputs": [], 301 | "source": [ 302 | "# Example: drawing a seaborn regplot\n", 303 | "# sns.regplot(data[feature1],data[feature2])" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 33, 309 | "metadata": { 310 | "collapsed": true 311 | }, 312 | "outputs": [], 313 | "source": [ 314 | "#Example: drawing a pandas scatter_matrix\n", 315 | "# pd.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');" 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": null, 321 | "metadata": { 322 | "collapsed": true 323 | }, 324 | "outputs": [], 325 | "source": [] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "metadata": {}, 330 | "source": [ 331 | "### Investigating correlations between features" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": null, 337 | "metadata": { 338 | "collapsed": true 339 | }, 340 | "outputs": [], 341 | "source": [] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": null, 346 | "metadata": { 347 | "collapsed": true 348 | }, 349 | "outputs": [], 350 | "source": [] 351 | }, 352 | { 353 | "cell_type": "markdown", 354 | "metadata": {}, 355 | "source": [ 356 | "### Visualizing prediction vector" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": null, 362 | "metadata": { 363 | "collapsed": true 364 | }, 365 | "outputs": [], 366 | "source": [] 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "metadata": {}, 371 | "source": [ 372 | "### Investigating missing values" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": null, 378 | "metadata": { 379 | "collapsed": true 380 | }, 381 | "outputs": [], 382 | "source": [] 383 | }, 384 | { 385 | "cell_type": "markdown", 386 | "metadata": {}, 387 | "source": [ 388 | "### Outlier Detection" 389 | ] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "metadata": {}, 394 | "source": [ 395 | "The presence of outliers can often skew results which take into consideration these data points. \n", 396 | "\n", 397 | "One approach to detect outliers is to use Tukey's Method for identfying them: An outlier step is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.\n", 398 | "\n", 399 | "One such pipeline for detecting outliers is below:" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": 47, 405 | "metadata": { 406 | "collapsed": true 407 | }, 408 | "outputs": [], 409 | "source": [ 410 | "# def find_outliers(data):\n", 411 | "\n", 412 | "# #Checking for outliers that occur for more than one feature\n", 413 | "# outliers = []\n", 414 | "\n", 415 | "# # For each feature find the data points with extreme high or low values\n", 416 | "# for feature in [list of features to investigate]:\n", 417 | "\n", 418 | "# # TODO: Calculate Q1 (25th percentile of the data) for the given feature\n", 419 | "# Q1 = np.percentile(data[feature],25)\n", 420 | "\n", 421 | "# # TODO: Calculate Q3 (75th percentile of the data) for the given feature\n", 422 | "# Q3 = np.percentile(data[feature],75)\n", 423 | "\n", 424 | "# # TODO: Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)\n", 425 | "# step = (Q3-Q1) * 1.5\n", 426 | "\n", 427 | "# # Display the outliers\n", 428 | "# out = data[~((data[feature] >= Q1 - step) & (data[feature] <= Q3 + step))]\n", 429 | "# print \"Number of outliers for the feature '{}': {}\".format(feature, len(out))\n", 430 | "# outliers = outliers + list(out.index.values)\n", 431 | "\n", 432 | "\n", 433 | "# #Creating list of more outliers which are the same for multiple features.\n", 434 | "# outliers = list(set([x for x in outliers if outliers.count(x) > 1])) \n", 435 | " \n", 436 | "# return outliers\n", 437 | " \n", 438 | "# print \"Data points considered outliers for more than one feature: {}\".format(find_outliers(data))" 439 | ] 440 | }, 441 | { 442 | "cell_type": "code", 443 | "execution_count": null, 444 | "metadata": { 445 | "collapsed": true 446 | }, 447 | "outputs": [], 448 | "source": [] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "execution_count": null, 453 | "metadata": { 454 | "collapsed": true 455 | }, 456 | "outputs": [], 457 | "source": [] 458 | }, 459 | { 460 | "cell_type": "markdown", 461 | "metadata": {}, 462 | "source": [ 463 | "## Data Cleaning" 464 | ] 465 | }, 466 | { 467 | "cell_type": "markdown", 468 | "metadata": {}, 469 | "source": [ 470 | "### Imputing missing values" 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": null, 476 | "metadata": { 477 | "collapsed": true 478 | }, 479 | "outputs": [], 480 | "source": [] 481 | }, 482 | { 483 | "cell_type": "markdown", 484 | "metadata": {}, 485 | "source": [ 486 | "### Cleaning outliers or error values" 487 | ] 488 | }, 489 | { 490 | "cell_type": "code", 491 | "execution_count": 48, 492 | "metadata": { 493 | "collapsed": false 494 | }, 495 | "outputs": [], 496 | "source": [ 497 | "# Remove the outliers, if any were specified \n", 498 | "# good_data = data.drop(data.index[outliers]).reset_index(drop = True)\n", 499 | "# print \"The good dataset now has {} observations after removing outliers.\".format(len(good_data))" 500 | ] 501 | }, 502 | { 503 | "cell_type": "code", 504 | "execution_count": null, 505 | "metadata": { 506 | "collapsed": true 507 | }, 508 | "outputs": [], 509 | "source": [] 510 | }, 511 | { 512 | "cell_type": "markdown", 513 | "metadata": {}, 514 | "source": [ 515 | "## Feature Engineering" 516 | ] 517 | }, 518 | { 519 | "cell_type": "markdown", 520 | "metadata": {}, 521 | "source": [ 522 | "Section to extract more features from those currently available." 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": 34, 528 | "metadata": { 529 | "collapsed": true 530 | }, 531 | "outputs": [], 532 | "source": [ 533 | "# code " 534 | ] 535 | }, 536 | { 537 | "cell_type": "markdown", 538 | "metadata": {}, 539 | "source": [ 540 | "## Data Transformation and Preparation" 541 | ] 542 | }, 543 | { 544 | "cell_type": "markdown", 545 | "metadata": {}, 546 | "source": [ 547 | "### Transforming Skewed Continous Features " 548 | ] 549 | }, 550 | { 551 | "cell_type": "markdown", 552 | "metadata": {}, 553 | "source": [ 554 | "It is common practice to apply a logarthmic transformation to highly skewed continuous feature distributions. A typical flow for this is in a commented code block below." 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": 35, 560 | "metadata": { 561 | "collapsed": true 562 | }, 563 | "outputs": [], 564 | "source": [ 565 | "# skewered = [list of skewed continuous features]\n", 566 | "# raw_features[skewed] = data[skewed].apply(lambda x: np.log(x+1))" 567 | ] 568 | }, 569 | { 570 | "cell_type": "code", 571 | "execution_count": null, 572 | "metadata": { 573 | "collapsed": true 574 | }, 575 | "outputs": [], 576 | "source": [] 577 | }, 578 | { 579 | "cell_type": "markdown", 580 | "metadata": {}, 581 | "source": [ 582 | "### Normalizing Numerical Features " 583 | ] 584 | }, 585 | { 586 | "cell_type": "markdown", 587 | "metadata": {}, 588 | "source": [ 589 | "Another common practice is to perform some type of scaling on numerical features. Applying scaling doesn't change the shape of each feature's distribution; but ensures that each feature is treated equally when applying supervised learners. An example workflow of achieving normalisation using the MinMaxScaler module of sklearn is below:" 590 | ] 591 | }, 592 | { 593 | "cell_type": "code", 594 | "execution_count": 36, 595 | "metadata": { 596 | "collapsed": true 597 | }, 598 | "outputs": [], 599 | "source": [ 600 | "# from sklearn.preprocessing import MinMaxScaler\n", 601 | "\n", 602 | "# scaler = MinMaxScaler()\n", 603 | "# numerical = [list of skewed numerical features]\n", 604 | "# raw_features[numerical] = scaler.fit_transform(data[numerical])" 605 | ] 606 | }, 607 | { 608 | "cell_type": "code", 609 | "execution_count": 37, 610 | "metadata": { 611 | "collapsed": true 612 | }, 613 | "outputs": [], 614 | "source": [ 615 | "# Checking examples after transformation\n", 616 | "# raw_features.head()" 617 | ] 618 | }, 619 | { 620 | "cell_type": "code", 621 | "execution_count": null, 622 | "metadata": { 623 | "collapsed": true 624 | }, 625 | "outputs": [], 626 | "source": [] 627 | }, 628 | { 629 | "cell_type": "markdown", 630 | "metadata": {}, 631 | "source": [ 632 | "### One Hot Encoding Categorical Features" 633 | ] 634 | }, 635 | { 636 | "cell_type": "code", 637 | "execution_count": 38, 638 | "metadata": { 639 | "collapsed": true 640 | }, 641 | "outputs": [], 642 | "source": [ 643 | "# Using Pandas get_dummies function\n", 644 | "# features = pd.get_dummies(raw_features)" 645 | ] 646 | }, 647 | { 648 | "cell_type": "code", 649 | "execution_count": 39, 650 | "metadata": { 651 | "collapsed": true 652 | }, 653 | "outputs": [], 654 | "source": [ 655 | "#Encoding categorical prediction vector to numerical ?" 656 | ] 657 | }, 658 | { 659 | "cell_type": "code", 660 | "execution_count": null, 661 | "metadata": { 662 | "collapsed": true 663 | }, 664 | "outputs": [], 665 | "source": [] 666 | }, 667 | { 668 | "cell_type": "markdown", 669 | "metadata": {}, 670 | "source": [ 671 | "It is encouraged to create a pipeline function for data preprocessing, rather than separate script blocks." 672 | ] 673 | }, 674 | { 675 | "cell_type": "markdown", 676 | "metadata": {}, 677 | "source": [ 678 | "### Shuffle and Split Data" 679 | ] 680 | }, 681 | { 682 | "cell_type": "code", 683 | "execution_count": 41, 684 | "metadata": { 685 | "collapsed": true 686 | }, 687 | "outputs": [], 688 | "source": [ 689 | "# from sklearn.cross_validation import train_test_split\n", 690 | "\n", 691 | "# X_train, X_test, y_train, y_test = train_test_split(features, prediction_vector, test_size = 0.2, random_state = 0)\n", 692 | "\n", 693 | "# Show the results of the split\n", 694 | "# print \"Training set has {} samples.\".format(X_train.shape[0])\n", 695 | "# print \"Testing set has {} samples.\".format(X_test.shape[0])" 696 | ] 697 | }, 698 | { 699 | "cell_type": "code", 700 | "execution_count": null, 701 | "metadata": { 702 | "collapsed": true 703 | }, 704 | "outputs": [], 705 | "source": [] 706 | }, 707 | { 708 | "cell_type": "markdown", 709 | "metadata": {}, 710 | "source": [ 711 | "## Model Exploration" 712 | ] 713 | }, 714 | { 715 | "cell_type": "markdown", 716 | "metadata": {}, 717 | "source": [ 718 | "### Naive Predictor Performance" 719 | ] 720 | }, 721 | { 722 | "cell_type": "markdown", 723 | "metadata": {}, 724 | "source": [ 725 | "To set a baseline for the performance of the predictor. \n", 726 | "\n", 727 | "Common techniques:\n", 728 | "- For categorical prediction vector, choose the most common class\n", 729 | "- For numerical prediction vector, choose a measure of central tendency\n", 730 | "\n", 731 | "Then calculate the evalation metric (accuracy, f-score etc)" 732 | ] 733 | }, 734 | { 735 | "cell_type": "code", 736 | "execution_count": 25, 737 | "metadata": { 738 | "collapsed": true 739 | }, 740 | "outputs": [], 741 | "source": [ 742 | "#Code to implement the above" 743 | ] 744 | }, 745 | { 746 | "cell_type": "markdown", 747 | "metadata": {}, 748 | "source": [ 749 | "### Choosing scoring metrics" 750 | ] 751 | }, 752 | { 753 | "cell_type": "code", 754 | "execution_count": null, 755 | "metadata": { 756 | "collapsed": true 757 | }, 758 | "outputs": [], 759 | "source": [ 760 | "# from sklearn.metrics import accuracy_score, fbeta_score" 761 | ] 762 | }, 763 | { 764 | "cell_type": "markdown", 765 | "metadata": {}, 766 | "source": [ 767 | "### Creating a Training and Prediction Pipeling" 768 | ] 769 | }, 770 | { 771 | "cell_type": "code", 772 | "execution_count": 42, 773 | "metadata": { 774 | "collapsed": true 775 | }, 776 | "outputs": [], 777 | "source": [ 778 | "#Importing models from sklearn, or tensorflow/keras components" 779 | ] 780 | }, 781 | { 782 | "cell_type": "markdown", 783 | "metadata": {}, 784 | "source": [ 785 | "Change below as seen fit" 786 | ] 787 | }, 788 | { 789 | "cell_type": "code", 790 | "execution_count": null, 791 | "metadata": { 792 | "collapsed": true 793 | }, 794 | "outputs": [], 795 | "source": [ 796 | "# def train_predict(learner, sample_size, X_train, y_train, X_test, y_test): \n", 797 | "# '''\n", 798 | "# inputs:\n", 799 | "# - learner: the learning algorithm to be trained and predicted on\n", 800 | "# - sample_size: the size of samples (number) to be drawn from training set\n", 801 | "# - X_train: features training set\n", 802 | "# - y_train: income training set\n", 803 | "# - X_test: features testing set\n", 804 | "# - y_test: income testing set\n", 805 | "# '''\n", 806 | " \n", 807 | "# results = {}\n", 808 | " \n", 809 | "# # TODO: Fit the learner to the training data using slicing with 'sample_size'\n", 810 | "# start = time() # Get start time\n", 811 | "# learner = learner.fit(X_train[:sample_size],y_train[:sample_size])\n", 812 | "# end = time() # Get end time\n", 813 | " \n", 814 | "# # TODO: Calculate the training time\n", 815 | "# results['train_time'] = end - start\n", 816 | " \n", 817 | "# # TODO: Get the predictions on the test set,\n", 818 | "# # then get predictions on the first 300 training samples\n", 819 | "# start = time() # Get start time\n", 820 | "# predictions_test = learner.predict(X_test)\n", 821 | "# predictions_train = learner.predict(X_train[:300])\n", 822 | "# end = time() # Get end time\n", 823 | " \n", 824 | "# # TODO: Calculate the total prediction time\n", 825 | "# results['pred_time'] = end - start\n", 826 | " \n", 827 | "# # TODO: Compute accuracy on the first 300 training samples\n", 828 | "# results['acc_train'] = accuracy_score(y_train[:300],predictions_train)\n", 829 | " \n", 830 | "# # TODO: Compute accuracy on test set\n", 831 | "# results['acc_test'] = accuracy_score(y_test,predictions_test)\n", 832 | " \n", 833 | "# # TODO: Compute F-score on the the first 300 training samples\n", 834 | "# results['f_train'] = fbeta_score(y_train[:300],predictions_train,0.5)\n", 835 | " \n", 836 | "# # TODO: Compute F-score on the test set\n", 837 | "# results['f_test'] = fbeta_score(y_test,predictions_test,0.5)\n", 838 | " \n", 839 | "# # Success\n", 840 | "# print \"{} trained on {} samples.\".format(learner.__class__.__name__, sample_size)\n", 841 | " \n", 842 | "# # Return the results\n", 843 | "# return results" 844 | ] 845 | }, 846 | { 847 | "cell_type": "markdown", 848 | "metadata": { 849 | "collapsed": true 850 | }, 851 | "source": [ 852 | "### Model Evaluation" 853 | ] 854 | }, 855 | { 856 | "cell_type": "code", 857 | "execution_count": 44, 858 | "metadata": { 859 | "collapsed": false 860 | }, 861 | "outputs": [], 862 | "source": [ 863 | "# Change the list of classifiers and code below as seen fit. we probably also don't need to see the effects of\n", 864 | "# different sample sizes" 865 | ] 866 | }, 867 | { 868 | "cell_type": "code", 869 | "execution_count": null, 870 | "metadata": { 871 | "collapsed": true 872 | }, 873 | "outputs": [], 874 | "source": [ 875 | "# # TODO: Import the three supervised learning models from sklearn\n", 876 | "# from sklearn.tree import DecisionTreeClassifier\n", 877 | "# from sklearn.svm import SVC\n", 878 | "# from sklearn.ensemble import AdaBoostClassifier\n", 879 | "\n", 880 | "# # TODO: Initialize the three models, the random states are set to 101 so we know how to reproduce the model later\n", 881 | "# clf_A = DecisionTreeClassifier(random_state=101)\n", 882 | "# clf_B = SVC(random_state = 101)\n", 883 | "# clf_C = AdaBoostClassifier(random_state = 101)\n", 884 | "\n", 885 | "# # TODO: Calculate the number of samples for 1%, 10%, and 100% of the training data\n", 886 | "# samples_1 = int(round(len(X_train) / 100))\n", 887 | "# samples_10 = int(round(len(X_train) / 10))\n", 888 | "# samples_100 = len(X_train)\n", 889 | "\n", 890 | "# # Collect results on the learners in a dictionary\n", 891 | "# results = {}\n", 892 | "# for clf in [clf_A, clf_B, clf_C]:\n", 893 | "# clf_name = clf.__class__.__name__\n", 894 | "# results[clf_name] = {}\n", 895 | "# for i, samples in enumerate([samples_1, samples_10, samples_100]):\n", 896 | "# results[clf_name][i] = \\\n", 897 | "# train_predict(clf, samples, X_train, y_train, X_test, y_test)" 898 | ] 899 | }, 900 | { 901 | "cell_type": "markdown", 902 | "metadata": {}, 903 | "source": [ 904 | "Printing out the results" 905 | ] 906 | }, 907 | { 908 | "cell_type": "code", 909 | "execution_count": null, 910 | "metadata": { 911 | "collapsed": true 912 | }, 913 | "outputs": [], 914 | "source": [ 915 | "# #Printing out the values\n", 916 | "# for i in results.items():\n", 917 | "# print i[0]\n", 918 | "# display(pd.DataFrame(i[1]).rename(columns={0:'1%', 1:'10%', 2:'100%'}))" 919 | ] 920 | }, 921 | { 922 | "cell_type": "code", 923 | "execution_count": null, 924 | "metadata": { 925 | "collapsed": true 926 | }, 927 | "outputs": [], 928 | "source": [] 929 | }, 930 | { 931 | "cell_type": "markdown", 932 | "metadata": {}, 933 | "source": [ 934 | "## Final Model Building" 935 | ] 936 | }, 937 | { 938 | "cell_type": "markdown", 939 | "metadata": {}, 940 | "source": [ 941 | "Using grid search (GridSearchCV) with different parameter/value combinations, we can tune our model for even better results.\n", 942 | "\n", 943 | "Example with Adaboost below" 944 | ] 945 | }, 946 | { 947 | "cell_type": "code", 948 | "execution_count": null, 949 | "metadata": { 950 | "collapsed": true 951 | }, 952 | "outputs": [], 953 | "source": [ 954 | "# # TODO: Import 'GridSearchCV', 'make_scorer', and any other necessary libraries\n", 955 | "# from sklearn.grid_search import GridSearchCV\n", 956 | "# from sklearn.metrics import make_scorer\n", 957 | "\n", 958 | "# # TODO: Initialize the classifier\n", 959 | "# clf = AdaBoostClassifier(base_estimator=DecisionTreeClassifier())\n", 960 | "\n", 961 | "# # TODO: Create the parameters list you wish to tune\n", 962 | "# parameters = {'n_estimators':[50, 120], \n", 963 | "# 'learning_rate':[0.1, 0.5, 1.],\n", 964 | "# 'base_estimator__min_samples_split' : np.arange(2, 8, 2),\n", 965 | "# 'base_estimator__max_depth' : np.arange(1, 4, 1)\n", 966 | "# }\n", 967 | "\n", 968 | "# # TODO: Make an fbeta_score scoring object\n", 969 | "# scorer = make_scorer(fbeta_score,beta=0.5)\n", 970 | "\n", 971 | "# # TODO: Perform grid search on the classifier using 'scorer' as the scoring method\n", 972 | "# grid_obj = GridSearchCV(clf, parameters,scorer)\n", 973 | "\n", 974 | "# # TODO: Fit the grid search object to the training data and find the optimal parameters\n", 975 | "# grid_fit = grid_obj.fit(X_train,y_train)\n", 976 | "\n", 977 | "# # Get the estimator\n", 978 | "# best_clf = grid_fit.best_estimator_\n", 979 | "\n", 980 | "# # Make predictions using the unoptimized and model\n", 981 | "# predictions = (clf.fit(X_train, y_train)).predict(X_test)\n", 982 | "# best_predictions = best_clf.predict(X_test)\n", 983 | "\n", 984 | "# # Report the before-and-afterscores\n", 985 | "# print \"Unoptimized model\\n------\"\n", 986 | "# print \"Accuracy score on testing data: {:.4f}\".format(accuracy_score(y_test, predictions))\n", 987 | "# print \"F-score on testing data: {:.4f}\".format(fbeta_score(y_test, predictions, beta = 0.5))\n", 988 | "# print \"\\nOptimized Model\\n------\"\n", 989 | "# print \"Final accuracy score on the testing data: {:.4f}\".format(accuracy_score(y_test, best_predictions))\n", 990 | "# print \"Final F-score on the testing data: {:.4f}\".format(fbeta_score(y_test, best_predictions, beta = 0.5))\n", 991 | "# print best_clf" 992 | ] 993 | }, 994 | { 995 | "cell_type": "code", 996 | "execution_count": null, 997 | "metadata": { 998 | "collapsed": true 999 | }, 1000 | "outputs": [], 1001 | "source": [] 1002 | }, 1003 | { 1004 | "cell_type": "markdown", 1005 | "metadata": {}, 1006 | "source": [ 1007 | "Next steps can include feature importance extraction, predictions on the test set.. etc" 1008 | ] 1009 | }, 1010 | { 1011 | "cell_type": "markdown", 1012 | "metadata": {}, 1013 | "source": [ 1014 | "## Predictions on Test Set" 1015 | ] 1016 | }, 1017 | { 1018 | "cell_type": "code", 1019 | "execution_count": null, 1020 | "metadata": { 1021 | "collapsed": true 1022 | }, 1023 | "outputs": [], 1024 | "source": [] 1025 | } 1026 | ], 1027 | "metadata": { 1028 | "anaconda-cloud": {}, 1029 | "kernelspec": { 1030 | "display_name": "Python [conda env:py27]", 1031 | "language": "python", 1032 | "name": "conda-env-py27-py" 1033 | }, 1034 | "language_info": { 1035 | "codemirror_mode": { 1036 | "name": "ipython", 1037 | "version": 2 1038 | }, 1039 | "file_extension": ".py", 1040 | "mimetype": "text/x-python", 1041 | "name": "python", 1042 | "nbconvert_exporter": "python", 1043 | "pygments_lexer": "ipython2", 1044 | "version": "2.7.12" 1045 | } 1046 | }, 1047 | "nbformat": 4, 1048 | "nbformat_minor": 1 1049 | } 1050 | --------------------------------------------------------------------------------