├── 02.1_Numerical_variables.ipynb ├── 02.2_Categorical_variables.ipynb ├── 02.3_Dates.ipynb ├── 02.4_Mixed_variables.ipynb ├── 02.5_Bonus_Learn_more_on_Lending_Club_dataset.ipynb ├── 03.1_Missing_values.ipynb ├── 03.2_Outliers.ipynb ├── 03.3_Labels.ipynb ├── 03.4_Rare_values.ipynb ├── 03.5_Bonus_Machine_Learning_Algorithms_Overview.ipynb ├── 03.6_Bonus_Additional_reading_resources_on_variable_problems.ipynb ├── 04.1_Variable_magnitude.ipynb ├── 04.2_Linear_assumption.ipynb ├── 04.3_Variable_distribution.ipynb ├── 04.4_Bonus_Additional_reading_resources.ipynb ├── 05.1_Complete_Case_Analysis.ipynb ├── 05.2_Mean_and_median_imputation.ipynb ├── 05.3_Random_sample_imputation.ipynb ├── 05.4_Adding_a_variable_to_capture_NA.ipynb ├── 05.5_End_of_distribution_imputation.ipynb ├── 05.6_Arbitrary_value_imputation.ipynb ├── 06.1_Frequent_category_imputation.ipynb ├── 06.2_Random_sample_imputation.ipynb ├── 06.3_Adding_a_variable_to_capture_NA.ipynb ├── 06.4_Adding_a_category_to_capture_NA.ipynb ├── 07.1_Bonus_Overview_of_missing_ value_imputation_methods.ipynb ├── 07.2_Conclusion_when_to_use_each_NA_imputation_methods.ipynb ├── 08.1_Top_coding_bottom_coding_and_zero_coding.ipynb ├── 09.1_Engineering_rare_values_1.ipynb ├── 09.2_Engineering_rare_values_2.ipynb ├── 09.3_Engineering_rare_values_3.ipynb ├── 09.4_Engineering_rare_values.ipynb ├── 10.10_Bonus_Additional_reading_resources.ipynb ├── 10.1_One_hot_encoding.ipynb ├── 10.2_One_hot_encoding_variables_with_many_labels.ipynb ├── 10.3_Ordinal_numbering_encoding.ipynb ├── 10.4_Count_or_frequency_encoding.ipynb ├── 10.5_Target_guided_ordinal_encoding.ipynb ├── 10.6_Mean_encoding.ipynb ├── 10.7_Probability_ratio_encoding.ipynb ├── 10.8_Weight_of_evidence_WOE.ipynb ├── 10.9_Comparison_of_categorical_variable_encoding.ipynb ├── 11.1_Engineering_mixed_variables.ipynb ├── 12.1_Engineering_dates.ipynb ├── 13.1_Normalisation-Standarisation.ipynb ├── 13.2_Scaling_to_minimum_and_maximum_values.ipynb ├── 13.3_Scaling_to_median_and_quantiles.ipynb ├── 14.1_Gaussian_transformations.ipynb ├── 15.1_Equal_frequency_discretisation.ipynb ├── 15.2_Equal_width_discretisation.ipynb ├── 15.3_Domain_knowledge_discretisation.ipynb ├── 15.4_Discretisation_with_Decision_Trees.ipynb ├── 15.5_Bonus_Additional_reading_resources.ipynb ├── 16.2_Classification_I.ipynb ├── 16.3_Regression.ipynb └── README.md /02.4_Mixed_variables.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Mixed variables\n", 8 | "\n", 9 | "Mixed variables are those which values contain both numbers and labels.\n", 10 | "\n", 11 | "Variables can be mixed for a variety of reasons. For example, when credit agencies gather and store financial information of users, usually, the values of the variables they store are numbers. However, in some cases the credit agencies cannot retrieve information for a certain user for different reasons. What Credit Agencies do in these situations is to code each different reason due to which they failed to retrieve information with a different code or 'label'. Like this, they generate mixed type variables. These variables contain numbers when the value could be retrieved, or labels otherwise.\n", 12 | "\n", 13 | "As an example, think of the variable 'number_of_open_accounts'. It can take any number, representing the number of different financial accounts of the borrower. Sometimes, information may not be available for a certain borrower, for a variety of reasons. Each reason will be coded by a different letter, for example: 'A': couldn't identify the person, 'B': no relevant data, 'C': person seems not to have any open account.\n", 14 | "\n", 15 | "Another example of mixed type variables, is for example the variable missed_payment_status. This variable indicates, whether a borrower has missed a (any) payment in their financial item. For example, if the borrower has a credit card, this variable indicates whether they missed a monthly payment on it. Therefore, this variable can take values of 0, 1, 2, 3 meaning that the customer has missed 0-3 payments in their account. And it can also take the value D, if the customer defaulted on that account.\n", 16 | "\n", 17 | "Typically, once the customer has missed 3 payments, the lender declares the item defaulted (D), that is why this variable takes numerical values 0-3 and then D.\n", 18 | "\n", 19 | "\n", 20 | "For this lecture, you will need to download a toy csv file that I created and uploaded at the end of the lecture in Udemy. It is called sample_s2.csv." 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 1, 26 | "metadata": { 27 | "collapsed": true 28 | }, 29 | "outputs": [], 30 | "source": [ 31 | "import pandas as pd\n", 32 | "\n", 33 | "import matplotlib.pyplot as plt\n", 34 | "%matplotlib inline" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 7, 40 | "metadata": {}, 41 | "outputs": [ 42 | { 43 | "data": { 44 | "text/html": [ 45 | "
\n", 46 | "\n", 59 | "\n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | "
idopen_il_24m
01077501C
11077430A
21077175A
31076863A
41075358A
\n", 95 | "
" 96 | ], 97 | "text/plain": [ 98 | " id open_il_24m\n", 99 | "0 1077501 C\n", 100 | "1 1077430 A\n", 101 | "2 1077175 A\n", 102 | "3 1076863 A\n", 103 | "4 1075358 A" 104 | ] 105 | }, 106 | "execution_count": 7, 107 | "metadata": {}, 108 | "output_type": "execute_result" 109 | } 110 | ], 111 | "source": [ 112 | "# open_il_24m indicates:\n", 113 | "# \"Number of installment accounts opened in past 24 months\".\n", 114 | "# Installment accounts are those that, at the moment of acquiring them,\n", 115 | "# there is a set period and amount of repayments agreed between the\n", 116 | "# lender and borrower. An example of this is a car loan, or a student loan.\n", 117 | "# the borrowers know that they are going to pay a certain,\n", 118 | "# fixed amount over, for example 36 months.\n", 119 | "\n", 120 | "data = pd.read_csv('sample_s2.csv')\n", 121 | "data.head()" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 8, 127 | "metadata": {}, 128 | "outputs": [ 129 | { 130 | "data": { 131 | "text/plain": [ 132 | "(887379, 2)" 133 | ] 134 | }, 135 | "execution_count": 8, 136 | "metadata": {}, 137 | "output_type": "execute_result" 138 | } 139 | ], 140 | "source": [ 141 | "data.shape" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 9, 147 | "metadata": {}, 148 | "outputs": [ 149 | { 150 | "data": { 151 | "text/plain": [ 152 | "array(['C', 'A', 'B', '0.0', '1.0', '2.0', '4.0', '3.0', '6.0', '5.0',\n", 153 | " '9.0', '7.0', '8.0', '13.0', '10.0', '19.0', '11.0', '12.0', '14.0',\n", 154 | " '15.0'], dtype=object)" 155 | ] 156 | }, 157 | "execution_count": 9, 158 | "metadata": {}, 159 | "output_type": "execute_result" 160 | } 161 | ], 162 | "source": [ 163 | "# 'A': couldn't identify the person\n", 164 | "# 'B': no relevant data\n", 165 | "# 'C': person seems not to have any account open\n", 166 | "\n", 167 | "data.open_il_24m.unique()" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": 10, 173 | "metadata": {}, 174 | "outputs": [ 175 | { 176 | "data": { 177 | "text/plain": [ 178 | "" 179 | ] 180 | }, 181 | "execution_count": 10, 182 | "metadata": {}, 183 | "output_type": "execute_result" 184 | }, 185 | { 186 | "data": { 187 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZsAAAEVCAYAAAA2IkhQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xu8VXWd//HXG0HyAgiKpAJiQjVqqcmgpb/SmJCy1Ck1\n7SI6eBud0SabwpoZSqN0Jq1xUkcbyctkSjYpao7itTFHhLwhXkYSFPECAor3BD+/P77fLYvtOWev\nA2eds928n4/Hepy1v2t9v/uz1t5nf/b6ru9eSxGBmZlZlXr1dABmZtb6nGzMzKxyTjZmZlY5Jxsz\nM6uck42ZmVXOycbMzCrnZGNdTtJFkr7fQ88tST+XtFzS3W0s/7KkG3sgrtskHZXnj5B0R3fHYNaT\nnGzWA5IWSFosaZNC2VGSbuvBsKqyF/ApYGhEjKlfGBG/iIhx6/okkkLSyHVtpztJ2lvSUz0dR1fp\nyS811nlONuuPDYCTejqIzpK0QSerbAssiIhXqojHzNaOk83641+Ab0jarH6BpBH5m3rvQll9t8/v\nJf1Y0guSHpf0sVy+MB81TahrdgtJMyS9JOl2SdsW2v5gXrZM0qOSDiksu0jSeZJ+K+kVYJ824t1a\n0vRcf56ko3P5ROA/gI9KelnS99qou0YXVt7u4yQ9lrftHEnKy0bm2F+U9LykK3L573L1+/PzfFHS\nQEnXSlqSu/CulTS08cvydgzH5xheknSapO0l3SlphaRpkjYsrP9ZSffleO+U9OHCsgWSviHpgRz3\nFZLek49qrwe2zjG/LGnrNmLZT9K9+XkXSvpu3fK98nO+kJcfkcs3knSmpCfy894haaO8bH9Jc3Od\n2yT9Wd22jyw8fvtopXYkJunk/B57RtKRedkxwJeBb+ZtuSaXf0vSorwfH5U0tp19PkDSJfn1ekLS\nP0jqlZfV3u8/zdvySLGdXPfCHM8iSd9X/lJUe39J+lF+H8yX9Oky74OWFxGeWnwCFgB/AfwX8P1c\ndhRwW54fAQTQu1DnNuCoPH8EsBI4knSE9H3gSeAcoC8wDngJ2DSvf1F+/PG8/F+BO/KyTYCFua3e\nwK7A88AOhbovAnuSvgy9p43t+R1wLvAeYBdgCfDJQqx3dLAv1liet/taYDNgeG5rfF72S+A7tTiA\nverqjSw83hz4ArAx0A/4FXBVB/uzPoargf7AjsAbwM3A+4ABwEPAhLzursBiYPf8WkzIr2/fwmt9\nN7A1MAh4GDguL9sbeKrBe2Vv4EN5mz8MPAccmJdtm1/Xw4A+eZt3ycvOydu4TY7rY/m1fz/wCqlr\nsw/wTWAesGE7+/EiVr9H9ya9707NdT8DvAoMrF83P/4A6b21deF9vX0723lJ3uf98nr/B0yse7//\nXX7eL5Lek4Py8t8A55Pey1vm/X1soe6bwNF5P/w18DSgnv4c6OnJRzbrl38C/lbS4LWoOz8ifh4R\nq4ArgGHAqRHxRkTcCPwJKJ7DuC4ifhcRb5A+sD8qaRjwWVI3188jYmVE3Av8Gji4UPfqiPh9RLwV\nEa8Xg8ht7Al8KyJej4j7SEczh6/FNtWcHhEvRMSTwK2kBAbpQ2Nb0ofX6xHR7kn9iFgaEb+OiFcj\n4iVgCvCJTsTwzxGxIiLmAg8CN0bE4xHxIumIZNe83jHA+RExMyJWRcTFpOS0R6GtsyPi6YhYBlxT\n2J6GIuK2iJiT9/0DpIRb244vATdFxC8j4s28zfflI4K/Ak6KiEU5rjvza/9F0nthRkS8CfwI2IiU\njMp4k/Q+ezMifgu8TEoqbVlFSnA7SOoTEQsi4o/1K+WjkEOBUyLipYhYAJwJfLWw2mLgJ/l5rwAe\nBfaTNISU9L4WEa9ExGLgx7m9mici4mf5f+ViYCtgSMntbVlONuuRiHiQ9C1+0lpUf64w/1pur75s\n08LjhYXnfRlYRvq2vS2we+5SeUHSC6TukPe2VbcNWwPL8gd6zROkb9Rr69nC/Kus3o5vAgLuzt1A\nf9VeA5I2lnR+7pJZQTr62kzlzznV78v29u22wMl1+28Yab802p6GJO0u6dbcvfQicBywRV48DHjH\nh3de/p52lm1Nen0AiIi3SK9v2ddraUSsLDxud3siYh7wNeC7wGJJl7fVVZjj7VOMi3e+hxZFRNQt\nr71/+wDPFPb/+aQjnJq3939EvJpnS78GrcrJZv0zmXSIX/zHqp1M37hQVvzwXxvDajOSNiV16TxN\n+qC5PSI2K0ybRsRfF+p2dCnyp4FBkvoVyoYDi9Yx3neIiGcj4uiI2Bo4FjhX7Y9AO5n0jXv3iOhP\n6kKElKy60kJgSt3+2zgiflmibplLvF8GTAeGRcQA4N9ZvQ0Lge3bqPM88Ho7y54mfUADaWg66b1R\ne71eZe3fd+/Ynoi4LCL2ys8ZwBntxFs7aq2pfw9tk2MtLq+9f98Atijs//4RsWMn4l4vOdmsZ/K3\nvyuAEwtlS0j/aF+RtEH+Bt/WB0dnfCafTN4QOA24KyIWko6s3i/pq5L65OnPiyeNG8S/ELgT+GE+\n8f1hYCLwn+sY7ztIOlirT/IvJ314vZUfP0c6p1LTj3QE8oKkQaSkXoWfAcflIxBJ2iSf1O/XsGaK\neXNJAzpYpx/pyPF1SWNIXWc1vwD+QtIhknpL2lzSLvloZSpwltLgjQ0kfVRSX2AaqftprKQ+pKT8\nBuk1BLgP+FKuM57OdT2u8RpI+oCkT+bnfZ30erxVXyl3b00DpkjqpzR45eus+R7aEjgxvz8PBv4M\n+G1EPAPcCJwpqb+kXkqDOToT93rJyWb9dCrp5GbR0cDfA0tJJ6nvrK/USZeRPnCXAbsBXwHI3V/j\nSH3cT5O6HM4g9bWXdRjppO7TpJO1kyPipnWMty1/DsyU9DLp2/5JEfF4XvZd4OLclXII8BPSuYjn\ngbuA/64gHiJiNum1+ikpAc4jnZQuU/cR0jmYx3PcbXUxHQ+cKukl0jm+aYX6T5LOV5xMel3vA3bO\ni78BzAFm5WVnAL0i4lHSa/9vpH3zOeBzEfGnXO+kXFbrTr2qzLZkF5LOz7wg6SrSe+j0/DzPkhLG\nKe3U/VvSEf3jwB2k9+vUwvKZwKjc1hTgoIhYmpcdDmxIGrixHLiSdF7GOqA1uyXNzNZvSsO5j8rd\ncdZFfGRjZmaVc7IxM7PKuRvNzMwq5yMbMzOrnJONmZlVrnfjVdYPW2yxRYwYMaKnwzAze1f5wx/+\n8HxENLwElpNNNmLECGbPnt3TYZiZvatIeqLxWu5GMzOzbuBkY2ZmlXOyMTOzyjnZmJlZ5ZxszMys\ncpUmG6X7oc9Rul/67Fw2SOn+84/lvwML65+idE/5RyXtWyjfLbczT9LZtftMSOqrdI/1eZJmShpR\nqDMhP8djkiZUuZ1mZtax7jiy2ScidomI0fnxJODmiBhFus/6JABJO5AuO78jMJ50o6raXQ7PI11W\nfVSexufyicDyiBhJujXrGbmt2v1EdgfGAJOLSc3MzLpXT3SjHUC6Lzf574GF8svzPe3nk+7TMUbS\nVkD/iLgr36b1kro6tbauBMbmo559gRkRsSwilgMzWJ2gzMysm1X9o84AbpK0Cjg/Ii4AhuS73UG6\nwdGQPL8N6aZTNU/lsjfzfH15rc5CgIhYme+ZvnmxvI06pY2YdF2Hyxecvl9nmzQzWy9VnWz2iohF\nkrYEZkh6pLgwIkJSj112WtIxwDEAw4cP76kwzMxaXqXdaBGxKP9dTLp97xjgudw1Rv67OK++CBhW\nqD40ly3K8/Xla9SR1BsYQLqtcXtt1cd3QUSMjojRgwc3vLSPmZmtpcqSjaRNJPWrzZPuO/8g6V7u\ntdFhE4Cr8/x04NA8wmw70kCAu3OX2wpJe+TzMYfX1am1dRBwSz6vcwMwTtLAPDBgXC4zM7MeUGU3\n2hDgN3mUcm/gsoj4b0mzgGmSJgJPAIcARMRcSdOAh4CVwAkRsSq3dTxwEbARcH2eAC4ELpU0D1hG\nGs1GRCyTdBowK693akQsq3BbzcysA5Ulm4h4HNi5jfKlwNh26kwBprRRPhvYqY3y14GD22lrKjC1\nc1GbmVkVfAUBMzOrnJONmZlVzsnGzMwq52RjZmaVc7IxM7PKOdmYmVnlnGzMzKxyTjZmZlY5Jxsz\nM6uck42ZmVXOycbMzCrnZGNmZpVzsjEzs8o52ZiZWeWcbMzMrHJONmZmVjknGzMzq5yTjZmZVc7J\nxszMKudkY2ZmlXOyMTOzyjnZmJlZ5ZxszMysck42ZmZWOScbMzOrnJONmZlVzsnGzMwq52RjZmaV\nc7IxM7PKOdmYmVnlnGzMzKxyTjZmZla5ypONpA0k3Svp2vx4kKQZkh7LfwcW1j1F0jxJj0rat1C+\nm6Q5ednZkpTL+0q6IpfPlDSiUGdCfo7HJE2oejvNzKx93XFkcxLwcOHxJODmiBgF3JwfI2kH4FBg\nR2A8cK6kDXKd84CjgVF5Gp/LJwLLI2Ik8GPgjNzWIGAysDswBphcTGpmZta9Kk02koYC+wH/USg+\nALg4z18MHFgovzwi3oiI+cA8YIykrYD+EXFXRARwSV2dWltXAmPzUc++wIyIWBYRy4EZrE5QZmbW\nzao+svkJ8E3grULZkIh4Js8/CwzJ89sACwvrPZXLtsnz9eVr1ImIlcCLwOYdtGVmZj2gsmQj6bPA\n4oj4Q3vr5COVqCqGRiQdI2m2pNlLlizpqTDMzFpelUc2ewL7S1oAXA58UtJ/As/lrjHy38V5/UXA\nsEL9oblsUZ6vL1+jjqTewABgaQdtrSEiLoiI0RExevDgwWu/pWZm1qHKkk1EnBIRQyNiBOnE/y0R\n8RVgOlAbHTYBuDrPTwcOzSPMtiMNBLg7d7mtkLRHPh9zeF2dWlsH5ecI4AZgnKSBeWDAuFxmZmY9\noHcPPOfpwDRJE4EngEMAImKupGnAQ8BK4ISIWJXrHA9cBGwEXJ8ngAuBSyXNA5aRkhoRsUzSacCs\nvN6pEbGs6g0zM7O2dUuyiYjbgNvy/FJgbDvrTQGmtFE+G9ipjfLXgYPbaWsqMHVtYzYzs67TsBtN\n0vaS+ub5vSWdKGmz6kMzM7NWUeacza+BVZJGAheQTrxfVmlUZmbWUsokm7fyb1j+Evi3iPh7YKtq\nwzIzs1ZSJtm8Kekw0qiva3NZn+pCMjOzVlMm2RwJfBSYEhHz87DkS6sNy8zMWkmHo9HyhTC/ExFf\nrpXl65adUXVgZmbWOjo8ssm/c9lW0obdFI+ZmbWgMr+zeRz4vaTpwCu1wog4q7KozMyspZRJNn/M\nUy+gX7XhmJlZK2qYbCLiewCSNo6IV6sPyczMWk2ZKwh8VNJDwCP58c6Szq08MjMzaxllhj7/hHTn\ny6UAEXE/8PEqgzIzs9ZS6hYDEbGwrmhVmyuamZm1ocwAgYWSPgaEpD7AScDD1YZlZmatpMyRzXHA\nCcA2pLtd7pIfm5mZlVLmyObl4hUEzMzMOqtMsnlQ0nPA/+Tpjoh4sdqwzMyslTTsRouIkcBhwBxg\nP+B+SfdVHZiZmbWOhkc2koYCewL/D9gZmAvcUXFcZmbWQsp0oz0JzAJ+EBHHVRyPmZm1oDKj0XYF\nLgG+JOl/JV0iaWLFcZmZWQspc220+yXVLsb5/4CvAJ8ALqw4NjMzaxFlztnMBvoCd5JGo308Ip6o\nOjAzM2sdZc7ZfDoillQeiZmZtawy52z+JOksSbPzdKakAZVHZmZmLaNMspkKvAQckqcVwM+rDMrM\nzFpLmW607SPiC4XH3/OPOs3MrDPKHNm8Jmmv2gNJewKvVReSmZm1mjJHNscBlxTO0ywHJlQXkpmZ\ntZoOk42kXsAHImJnSf0BImJFt0RmZmYto8NutIh4C/hmnl/hRGNmZmujzDmbmyR9Q9IwSYNqU+WR\nmZlZyyhzzuaL+W/x7pwBvK/rwzEzs1bU4ZFNPmfzlYjYrm5qmGgkvUfS3ZLulzRX0vdy+SBJMyQ9\nlv8OLNQ5RdI8SY9K2rdQvpukOXnZ2ZKUy/tKuiKXz5Q0olBnQn6OxyR5QIOZWQ8qc87mp2vZ9hvA\nJyNiZ2AXYLykPYBJwM0RMQq4OT9G0g7AocCOwHjgXEkb5LbOA44GRuVpfC6fCCzPN3j7MXBGbmsQ\nMBnYHRgDTC4mNTMz615lztncLOkLtaOJsiJ5OT/sk6cADgAuzuUXAwfm+QOAyyPijYiYD8wDxkja\nCugfEXdFRJBud1CsU2vrSmBsjnNfYEZELIuI5cAMVicoMzPrZmXO2RwLfB1YJek1QKRc0r9RxXxk\n8gdgJHBORMyUNCQinsmrPAsMyfPbAHcVqj+Vy97M8/XltToLSQGtlPQisHmxvI06xfiOAY4BGD58\neKPN6bQRk65ruM6C0/fr8uc1M2s2DY9sIqJfRPSKiD4R0T8/bphoct1VEbELMJR0lLJT3fIgHe30\niIi4ICJGR8TowYMH91QYZmYtr0w3GpL2l/SjPH22s08SES8At5K6sp7LXWPkv4vzaouAYYVqQ3PZ\nojxfX75GHUm9gQHA0g7aMjOzHtAw2Ug6HTgJeChPJ0n6YYl6gyVtluc3Aj4FPAJMZ/XlbiYAV+f5\n6cCheYTZdqSBAHfnLrcVkvbI52MOr6tTa+sg4JZ8tHQDME7SwDwwYFwuMzOzHlDmnM1ngF3yyDQk\nXQzcC5zSoN5WwMX5vE0vYFpEXCvpf4FpkiYCT5BuW0BEzJU0jZTQVgInRMSq3NbxwEXARsD1eYJ0\na+pLJc0DlpFGsxERyySdBszK650aEctKbKuZmVWgTLIB2Iz0YQ6pq6qhiHgA2LWN8qXA2HbqTAGm\ntFE+G9ipjfLXgYPbaWsq6V48ZmbWw8okmx8C90q6lTQS7ePk38aYmZmV0TDZRMQvJd0G/Dlp5Ni3\nIuLZqgMzM7PWUbYb7aPAXqRk0xv4TWURmZlZyykzGu1c0g3U5gAPAsdKOqfqwMzMrHWUObL5JPBn\neUhxbTTa3EqjMjOzllLmR53zgOK1XIblMjMzs1LaPbKRdA3pHE0/4GFJd+fHuwN3d094ZmbWCjrq\nRvtRt0VhZmYtrd1kExG3d2cgZmbWukpdiNPMzGxdONmYmVnl2k02km7Of8/ovnDMzKwVdTRAYCtJ\nHwP2l3Q56bpob4uIeyqNzMzMWkZHyeafgH8k3XjsrLplQfqxp5mZWUMdjUa7ErhS0j9GxGndGJOZ\nmbWYMld9Pk3S/qRbCwDcFhHXVhuWmZm1kjIX4vwh77wt9A+qDszMzFpHmQtx7kfbt4X+dpWBmZlZ\n6yj7O5vNCvOlbgttZmZW49tCm5lZ5Tp7W2jwbaHNzKyTSt0WOiKeAaZXHIuZmbUoXxvNzMwq52Rj\nZmaV6zDZSNpA0iPdFYyZmbWmDpNNRKwCHpU0vJviMTOzFlRmgMBAYK6ku4FXaoURsX9lUZmZWUsp\nk2z+sfIozMyspZX5nc3tkrYFRkXETZI2BjaoPjQzM2sVZS7EeTRwJXB+LtoGuKrKoMzMrLWUGfp8\nArAnsAIgIh4DtqwyKDMzay1lks0bEfGn2gNJvUl36jQzMyulTLK5XdK3gY0kfQr4FXBNo0qShkm6\nVdJDkuZKOimXD5I0Q9Jj+e/AQp1TJM2T9KikfQvlu0mak5edLUm5vK+kK3L5TEkjCnUm5Od4TNKE\nsjvEzMy6XplkMwlYAswBjgV+C/xDiXorgZMjYgdgD+AESTvk9m6OiFHAzfkxedmhwI7AeOBcSbWB\nCOcBRwOj8jQ+l08ElkfESODHwBm5rUHAZGB3YAwwuZjUzMysezVMNvmmaRcDpwHfAy6OiIbdaBHx\nTETck+dfAh4mDS44ILdH/ntgnj8AuDwi3oiI+cA8YIykrYD+EXFXft5L6urU2roSGJuPevYFZkTE\nsohYDsxgdYIyM7Nu1nDos6T9gH8H/ki6n812ko6NiOvLPknu3toVmAkMyVeRBngWGJLntwHuKlR7\nKpe9mefry2t1FgJExEpJLwKbF8vbqGNmZt2szI86zwT2iYh5AJK2B64DSiUbSZsCvwa+FhEr8ukW\nACIiJPXYYANJxwDHAAwf7ivymJlVpcw5m5dqiSZ7HHipTOOS+pASzS8i4r9y8XO5a4z8d3EuXwQM\nK1QfmssW5fn68jXq5FFyA4ClHbS1hoi4ICJGR8TowYMHl9kkMzNbC+0mG0mfl/R5YLak30o6Io/q\nugaY1ajhfO7kQuDhiDirsGg6UBsdNgG4ulB+aB5hth1pIMDductthaQ9cpuH19WptXUQcEs+r3MD\nME7SwDwwYFwuMzOzHtBRN9rnCvPPAZ/I80uAjUq0vSfwVWCOpPty2beB04FpkiYCTwCHAETEXEnT\ngIdII9lOyFedBjgeuCg/7/Ws7sK7ELhU0jxgGWk0GxGxTNJprE6Kp0bEshIxm5lZBdpNNhFx5Lo0\nHBF3kAYUtGVsO3WmAFPaKJ8N7NRG+evAwe20NRWYWjZeMzOrTpnRaNsBfwuMKK7vWwyYmVlZZUaj\nXUXqrroGeKvacMzMrBWVSTavR8TZlUdiZmYtq0yy+VdJk4EbgTdqhbWrA5iZmTVSJtl8iDSq7JOs\n7kaL/NjMzKyhMsnmYOB9xdsMmJmZdUaZKwg8CGxWdSBmZta6yhzZbAY8ImkWa56z8dBnMzMrpUyy\nmVx5FGZm1tIaJpuIuL07AjEzs9ZV5goCL5FGnwFsCPQBXomI/lUGZmZmraPMkU2/2ny+6vIBpNs8\nm5mZlVJmNNrbIrmKdNtlMzOzUsp0o32+8LAXMBp4vbKIzMys5ZQZjVa8r81KYAGpK83MzKyUMuds\n1um+NmZmZu0mG0n/1EG9iIjTKojHzMxaUEdHNq+0UbYJMBHYHHCyMTOzUjq6LfSZtXlJ/YCTgCOB\ny4Ez26tnZmZWr8NzNpIGAV8HvgxcDHwkIpZ3R2BmZtY6Ojpn8y/A54ELgA9FxMvdFpWZmbWUjn7U\neTKwNfAPwNOSVuTpJUkruic8MzNrBR2ds+nU1QXMzMza44RiZmaVc7IxM7PKOdmYmVnlnGzMzKxy\nTjZmZlY5JxszM6uck42ZmVXOycbMzCrnZGNmZpWrLNlImippsaQHC2WDJM2Q9Fj+O7Cw7BRJ8yQ9\nKmnfQvlukubkZWdLUi7vK+mKXD5T0ohCnQn5OR6TNKGqbTQzs3KqPLK5CBhfVzYJuDkiRgE358dI\n2gE4FNgx1zlX0ga5znnA0cCoPNXanAgsj4iRwI+BM3Jbg4DJwO7AGGByMamZmVn3qyzZRMTvgGV1\nxQeQblVA/ntgofzyiHgjIuYD84AxkrYC+kfEXRERwCV1dWptXQmMzUc9+wIzImJZvh3CDN6Z9MzM\nrBt19zmbIRHxTJ5/FhiS57cBFhbWeyqXbZPn68vXqBMRK4EXSXcQba8tMzPrIT02QCAfqURPPT+A\npGMkzZY0e8mSJT0ZiplZS+vuZPNc7hoj/12cyxcBwwrrDc1li/J8ffkadST1BgYASzto6x0i4oKI\nGB0RowcPHrwOm2VmZh3p7mQzHaiNDpsAXF0oPzSPMNuONBDg7tzltkLSHvl8zOF1dWptHQTcko+W\nbgDGSRqYBwaMy2VmZtZD2r152rqS9Etgb2ALSU+RRoidDkyTNBF4AjgEICLmSpoGPASsBE6IiFW5\nqeNJI9s2Aq7PE8CFwKWS5pEGIhya21om6TRgVl7v1IioH6hgZmbdqLJkExGHtbNobDvrTwGmtFE+\nG9ipjfLXgYPbaWsqMLV0sGZmVilfQcDMzCrnZGNmZpVzsjEzs8o52ZiZWeWcbMzMrHJONmZmVjkn\nGzMzq5yTjZmZVc7JxszMKlfZFQSsa4yYdF2Hyxecvl83RWJmtvZ8ZGNmZpVzsjEzs8o52ZiZWeWc\nbMzMrHJONmZmVjknGzMzq5yTjZmZVc7JxszMKudkY2ZmlXOyMTOzyjnZmJlZ5ZxszMysck42ZmZW\nOScbMzOrnJONmZlVzsnGzMwq52RjZmaVc7IxM7PKOdmYmVnlnGzMzKxyTjZmZlY5JxszM6uck42Z\nmVWupZONpPGSHpU0T9Kkno7HzGx91bunA6iKpA2Ac4BPAU8BsyRNj4iHejay7jdi0nUdLl9w+n7d\n0oaZrb9aNtkAY4B5EfE4gKTLgQOA9S7ZNINGyQoaJywnPLN3L0VET8dQCUkHAeMj4qj8+KvA7hHx\nN4V1jgGOyQ8/ADzaoNktgOfXMbR1baMZYmiWNpohhq5ooxliaJY2miGGZmmjGWIo08a2ETG4USOt\nfGTTUERcAFxQdn1JsyNi9Lo857q20QwxNEsbzRBDV7TRDDE0SxvNEEOztNEMMXRVG9DaAwQWAcMK\nj4fmMjMz62atnGxmAaMkbSdpQ+BQYHoPx2Rmtl5q2W60iFgp6W+AG4ANgKkRMXcdmy3d5VZhG80Q\nQ7O00QwxdEUbzRBDs7TRDDE0SxvNEENXtdG6AwTMzKx5tHI3mpmZNQknGzMzq5yTjZmZVc7JphMk\n7SXpnJLrjpS0Zxvle0ravuujMzNrXi07Gq2rSNoV+BJwMDAf+K+SVX8CnNJG+Yq87HNrGc8WwNLo\n5pEdkoYA2+SHiyLiuXdjG10RQ7PE0QxtdNX+tNbnZNMGSe8HDsvT88AVpJF7+3SimSERMae+MCLm\nSBpRMo49gNOBZcBpwKWkS0f0knR4RPx32WDW9kNB0i7AvwMDWP2j2KGSXgCOj4h73g1tdEUMzRJH\nM7TRVfszt/VB0nUL335/AtMj4uHuqN9KbTRDDO2KCE91E/AWcDswslD2eCfbeKyDZfNKtjEbGEc6\nqloO7JHLPwjcW7KNXYC7gIeBm/L0SC77SIn695GuKVdfvgdwf8kYeryNroihWeJohja6cH9+K7c1\nCfhKnibVyqqu30ptNEMMHba9LpVbdQIOBC4HFgI/A8YC8zvZxi+Bo9soPwq4omQb9xXmH65bVjbZ\nrOuHSlckzR5voytiaJY4mqGNLtyf/wf0aaN8w46eo6vqt1IbzRBDR5O70doQEVcBV0nahHQ4+TVg\nS0nnAb+JiBtLNPM14DeSvgz8IZeNJr1of1kylLcK86/Vh1myjU0iYmZ9YUTclbevkeslXQdcQkq+\nkK45dzgNwAL/AAAHAElEQVRQthuvGdroihiaJY5maKOr9udbwNbAE3XlW7Hm+7+q+q3URjPE0C5f\nQaAkSQNJ3VlfjIixnai3D7BTfjg3Im7pRN1VwCuAgI2AV2uLgPdERJ8SbZwNbE/bHwrzo3DLhQ7a\n+DRt9+H+thPb0uNtdEUMzRJHM7TRRTGMB34KPMbq9+dwYCTwN9HgvOS61m+lNpohhg7bdrJpfV31\nIWtWBUm9SDc7LL4/Z0XEqu6o30ptNEMM7bbrZGNrS9Ixke4J9K5uoytiaJY4mqGNrtqf1lr8o871\nWL5T6To10RVhNEEbXRFDV7TTDPuiK9rokv0p6dqerN9KbTRFDD6yWX9JOjYizi+xXleN3d8GmBkR\nLxfKx5ftB5Y0BoiImCVpB2A88MjadgdKuiQiDl+buoU29iJ1OTxYZuCIpN1JIwtXSNqINKz0I8BD\nwA8i4sUSbZxIGqiysNG6HbRRu8fT0xFxk6QvAR8jDZG/ICLeLNHG+4DPk84BriKNZLosIlasbVx1\n7W8VEc/0VP1WaqMpYnCyWX9JOjIift5gnW+Rftx6OfBULh5K+qC6PCJOL/E8JwInkD7IdgFOioir\n87J7IuIjJdqYDHya9EPkGcDuwK3Ap4AbImJKg/r1N84TsA9wC0BE7N8ohtzO3RExJs8fnbfrN6Tf\nQ13TaH9ImgvsHOl+SxeQBn1cSRpev3NEfL5EDC+SBo78kTTE/lcRsaRM/IU2fkHalxsDLwCbkq6O\nMRYgIo5oUP9E4LPA74DPAPfmdv6S9KPO2zoTj7VN0pYRsbin4+gS6zJu2tO7ewKeLLFOV4zdnwNs\nmudHkH6selJ+XPb3QnNIN8HbmHTJn/65fCPggRL17wH+E9gb+ET++0ye/0Qn9tm9hflZwOA8vwkw\np0T9h4sx1S27r2wMpC7wccCFwBLScOMJQL+SbTyQ//YGngM2yI9Vcn/OKdTZGLgtzw8v+5rm9QeQ\nrpLxCOlKGUtJX0pOBzZbx/f39SXX6w/8kHSFji/VLTu3ZBvvBc4DzgE2B76b99E0YKuSbQyqmzYH\nFgADgUEl6o+v268XAg8Al5GuaFImhk2BU4G5wIv5vXUXcMS6vBYR4XM2rU7SA+1Mc4AhJZqojbuv\n15lx970id51FxALSB/2nJZ1F+f79lRGxKiJeBf4YuasmIl4rGcdo0u+dvgO8GOmb92sRcXtE3F4y\nBkiXChooaXPSh+2SHMcrwMoS9R+UdGSev1/SaHj7EkkNu66yiIi3IuLGiJhIen3OJXUrPt6J7dgQ\n6EdKFgNyeV+g4ZD6rPY7vb6kDyki4slO1If0Ybwc2DsiBkXE5qQjzuV5WYckfaSdaTfSUXQZPye9\nD38NHCrp15L65mV7lGzjIlJX6ELSEfdrpCO+/yFd1qeM50nv0do0m9T1fE+eb+QHhfkzSV+mPkf6\nUtSwuzz7Bek9tC/wPeBs4KvAPpJ+0FHFhtY1W3lq7on0rXUXYNu6aQSpv75R/fHAPOB60u1hLyB9\ni55H4ZtUgzZuAXapK+tN+u3PqpJtzAQ2zvO9CuUDqDtCaNDOUOBXpN8SNDyya6P+gvzPOD//3SqX\nb0qJI5Mc70WkLrCZpATzOOnySDuXjKHdI4faPirRxt/l530COBG4mXS1jDnA5BL1TyJ9a/4Z6ajk\nyFw+GPhdJ/bno2uzrLDOqvz+urWN6bWSMdxX9/g7wO9JRxal3lusecT7ZEftd9DGyfl/60OFsvmd\n2Jf3tPecnYjh/rrHs/LfXqTzo536f1mjrXWp7Kn5J9Kh9F7tLLusZBu9SN/wvpCnPchdKCXrDwXe\n286yPUu20bed8i2K/5ydiGk/0gn5rtrPGwPbdWL9/sDOwG6U7OIo1H1/F8W8NbB1nt8MOAgY04n6\nO+Y6H1yHGG4EvlncB6Qj7m8BN5Wo/yAwqp1lC0vG8DCFLzC57AhSV9ITJdu4vzD//bplDbtXC+vW\nvgydRTrqLH1NRtI51a/npDWffE4+L2vYNZrXu7P2eQHsTzofWlvWMPl3NHmAgJn1mHxljkmk0Y5b\n5uLngOnA6RGxvEH9g0gf5o+2sezASJeeahTDPwM3RsRNdeXjgX+LiFEl2jgV+OcojLTM5SPzdhzU\nqI26evsD3wZGRMR7S9aZXFd0bkQskfTeHFvDkZeSPgz8BzCKlGz/KiL+T9Jg4LCIOLsz27FG2042\nZtaMyoyWrLJ+T7eRh8ZvHxEPtsK+cLIxs6Yk6cmIGN5T9VupjWaIwVd9NrMeI+mB9hZRYrTkutZv\npTaaIYaOONmYWU8aQhpmW39uRqST1VXXb6U2miGGdjnZmFlPupb0g9/76hdIuq0b6rdSG80QQ7t8\nzsbMzCrnKwiYmVnlnGzMzKxyTjZmZlY5JxszM6uck42ZmVXu/wNCP2yz4fPUrgAAAABJRU5ErkJg\ngg==\n", 188 | "text/plain": [ 189 | "" 190 | ] 191 | }, 192 | "metadata": {}, 193 | "output_type": "display_data" 194 | } 195 | ], 196 | "source": [ 197 | "# Now, let's make a bar plot showing the different number of \n", 198 | "# borrowers for each of the values of the mixed variable\n", 199 | "\n", 200 | "fig = data.open_il_24m.value_counts().plot.bar()\n", 201 | "fig.set_title('Number of installment accounts open')\n", 202 | "fig.set_ylabel('Number of borrowers')" 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": { 208 | "collapsed": true 209 | }, 210 | "source": [ 211 | "This is how a mixed variable looks like!" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": { 217 | "collapsed": true 218 | }, 219 | "source": [ 220 | "**That is all for this demonstration. I hope you enjoyed the notebook, and see you in the next one.**" 221 | ] 222 | } 223 | ], 224 | "metadata": { 225 | "kernelspec": { 226 | "display_name": "Python 3", 227 | "language": "python", 228 | "name": "python3" 229 | }, 230 | "language_info": { 231 | "codemirror_mode": { 232 | "name": "ipython", 233 | "version": 3 234 | }, 235 | "file_extension": ".py", 236 | "mimetype": "text/x-python", 237 | "name": "python", 238 | "nbconvert_exporter": "python", 239 | "pygments_lexer": "ipython3", 240 | "version": "3.6.1" 241 | }, 242 | "toc": { 243 | "nav_menu": {}, 244 | "number_sections": true, 245 | "sideBar": true, 246 | "skip_h1_title": false, 247 | "toc_cell": false, 248 | "toc_position": { 249 | "height": "550px", 250 | "left": "0px", 251 | "right": "869.4px", 252 | "top": "107px", 253 | "width": "151px" 254 | }, 255 | "toc_section_display": "block", 256 | "toc_window_display": true 257 | } 258 | }, 259 | "nbformat": 4, 260 | "nbformat_minor": 1 261 | } 262 | -------------------------------------------------------------------------------- /02.5_Bonus_Learn_more_on_Lending_Club_dataset.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Bonus: Learn more about the Lending Club dataset\n", 8 | "\n", 9 | "Visit the following Jupyter notebooks in Kaggle to learn more about the Lending Club dataset and on how different Kagglers approach data analysis. \n", 10 | "\n", 11 | "- [Initial Loan Book Analysis](https://www.kaggle.com/erykwalczak/initial-loan-book-analysis)\n", 12 | "- [Python for Padawans](https://www.kaggle.com/evanmiller/python-for-padawans)\n", 13 | "- [Loan Book Initial Exploration](https://www.kaggle.com/solegalli/loan-book-initial-exploration/)" 14 | ] 15 | }, 16 | { 17 | "cell_type": "code", 18 | "execution_count": null, 19 | "metadata": { 20 | "collapsed": true 21 | }, 22 | "outputs": [], 23 | "source": [] 24 | } 25 | ], 26 | "metadata": { 27 | "kernelspec": { 28 | "display_name": "Python 3", 29 | "language": "python", 30 | "name": "python3" 31 | }, 32 | "language_info": { 33 | "codemirror_mode": { 34 | "name": "ipython", 35 | "version": 3 36 | }, 37 | "file_extension": ".py", 38 | "mimetype": "text/x-python", 39 | "name": "python", 40 | "nbconvert_exporter": "python", 41 | "pygments_lexer": "ipython3", 42 | "version": "3.6.1" 43 | }, 44 | "toc": { 45 | "nav_menu": {}, 46 | "number_sections": true, 47 | "sideBar": true, 48 | "skip_h1_title": false, 49 | "toc_cell": false, 50 | "toc_position": {}, 51 | "toc_section_display": "block", 52 | "toc_window_display": false 53 | } 54 | }, 55 | "nbformat": 4, 56 | "nbformat_minor": 2 57 | } 58 | -------------------------------------------------------------------------------- /03.6_Bonus_Additional_reading_resources_on_variable_problems.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Bonus: Additional reading resources on variable problems\n", 8 | "\n", 9 | "### Further reading on missing data:\n", 10 | "\n", 11 | "- [Missing data imputation, chapter 25](http://www.stat.columbia.edu/~gelman/arm/missing.pdf).\n", 12 | "\n", 13 | "### Further reading on outliers:\n", 14 | "\n", 15 | "- [Why is AdaBoost algorithm sensitive to noisy data and outliers?](https://www.quora.com/Why-is-AdaBoost-algorithm-sensitive-to-noisy-data-and-outliers-And-how)\n", 16 | "- [Why is Logistic Regression robust to outliers compared to least squares?](https://www.quora.com/Why-is-logistic-regression-considered-robust-to-outliers-compared-to-a-least-square-method)\n", 17 | "- [Can Logistic Regression be considered robust to outliers?](https://www.quora.com/Can-Logistic-Regression-be-considered-robust-to-outliers)\n", 18 | "- [The Effects of Outlier Data on Neural Networks Performance](https://www.researchgate.net/profile/Azme_Khamis/publication/26568300_The_Effects_of_Outliers_Data_on_Neural_Network_Performance/links/564802c908ae54697fbc10de/The-Effects-of-Outliers-Data-on-Neural-Network-Performance.pdf)\n", 19 | "- [Outlier Analysis by C. Aggarwal](http://charuaggarwal.net/outlierbook.pdf)\n", 20 | "\n", 21 | "### Overview on Variable Problems\n", 22 | "\n", 23 | "- [Identifying common Data Mining Mistakes by SAS](https://www.mwsug.org/proceedings/2007/saspres/MWSUG-2007-SAS01.pdf)\n", 24 | "\n", 25 | "### Overview and Comparison of Machine Learning Algorithms\n", 26 | "\n", 27 | "- [Top 10 Machine Learning Algorithms](https://www.dezyre.com/article/top-10-machine-learning-algorithms/202)\n", 28 | "- [Choosing the Right Algorithm for Machine Learning](http://www.dummies.com/programming/big-data/data-science/choosing-right-algorithm-machine-learning/)\n", 29 | "- [Why does Gradient boosting work so well for so many Kaggle problems?](https://www.quora.com/Why-does-Gradient-boosting-work-so-well-for-so-many-Kaggle-problems)\n", 30 | "\n", 31 | "\n", 32 | "### Kaggle kernels for the Titanic Dataset\n", 33 | "\n", 34 | "- [Exploring Survival on the Titanic](https://www.kaggle.com/mrisdal/exploring-survival-on-the-titanic)\n", 35 | "- [Titanic Data Science Solutions](https://www.kaggle.com/startupsci/titanic-data-science-solutions/notebook)\n", 36 | "\n", 37 | "### Kaggle kernels for the Mercedes Benz Dataset\n", 38 | "\n", 39 | "- [Sherlock's Exploration {Season 01} - Categorical](https://www.kaggle.com/remidi/sherlock-s-exploration-season-01-categorical)\n", 40 | "- [Mercedes Benz Data Exploration](https://www.kaggle.com/aditya1702/mercedes-benz-data-exploration)\n", 41 | "- [Simple Exploration Notebook](https://www.kaggle.com/pcjimmmy/simple-exploration-notebook-mercedes)" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": null, 47 | "metadata": { 48 | "collapsed": true 49 | }, 50 | "outputs": [], 51 | "source": [] 52 | } 53 | ], 54 | "metadata": { 55 | "kernelspec": { 56 | "display_name": "Python 3", 57 | "language": "python", 58 | "name": "python3" 59 | }, 60 | "language_info": { 61 | "codemirror_mode": { 62 | "name": "ipython", 63 | "version": 3 64 | }, 65 | "file_extension": ".py", 66 | "mimetype": "text/x-python", 67 | "name": "python", 68 | "nbconvert_exporter": "python", 69 | "pygments_lexer": "ipython3", 70 | "version": "3.6.1" 71 | }, 72 | "toc": { 73 | "nav_menu": {}, 74 | "number_sections": true, 75 | "sideBar": true, 76 | "skip_h1_title": false, 77 | "toc_cell": false, 78 | "toc_position": {}, 79 | "toc_section_display": "block", 80 | "toc_window_display": false 81 | } 82 | }, 83 | "nbformat": 4, 84 | "nbformat_minor": 2 85 | } 86 | -------------------------------------------------------------------------------- /04.1_Variable_magnitude.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Variable magnitude\n", 8 | "\n", 9 | "### Does the magnitude of the variable matter?\n", 10 | "\n", 11 | "In Linear Regression models, the scale of variables used to estimate the output matters. Linear models are of the type **y = w x + b**, where the regression coefficient w represents the expected change in y for a one unit change in x (the predictor). Thus, the magnitude of **w** is partly determined by the magnitude of the units being used for **x**. If x is a distance variable, just changing the scale from kilometers to miles will cause a change in the magnitude of the coefficient.\n", 12 | "\n", 13 | "In addition, in situations where we estimate the outcome y by contemplating multiple predictors x1, x2, ...xn, predictors with greater numeric ranges dominate over those with smaller numeric ranges.\n", 14 | "\n", 15 | "Gradient descent converges faster when all the predictors (x1 to xn) are within a similar scale, therefore making feature scaling useful for Neural Networks as well as Logistic Regression.\n", 16 | "\n", 17 | "In Support Vector Machines, feature scaling can decrease the time to find the support vectors.\n", 18 | "\n", 19 | "Finally, methods using Euclidean distances or distances in general are also affected by the magnitude of the features, as Euclidean distance is sensitive to variations in the magnitude or scales of the predictors. Therefore feature scaling is required for methods that utilise distance calculations like k-nearest neighbours (KNN) and k-means clustering.\n", 20 | "\n", 21 | "For more details on the above, follow the links in the Bonus Lecture of this section.\n", 22 | "\n", 23 | "In summary:\n", 24 | "\n", 25 | "#### Maginutd matters because:\n", 26 | "\n", 27 | "- The regression coefficient is directly influenced by the scale of the variable\n", 28 | "- Variables with bigger magnitudes / value range dominate over the ones with smaller magnitudes / value range\n", 29 | "- Gradient descent converges faster when features are on similar scales\n", 30 | "- Feature scaling helps decrease the time to find support vectors for SVMs\n", 31 | "- Euclidean distances are sensitive to feature magnitude.\n", 32 | "\n", 33 | "#### The machine learning models affected by the magnitude of the feature are:\n", 34 | "\n", 35 | "- Linear and Logistic Regression\n", 36 | "- Neural Networks\n", 37 | "- Support Vector Machines\n", 38 | "- KNN\n", 39 | "- K-means clustering\n", 40 | "- Linear Discriminant Analysis (LDA)\n", 41 | "- Principal Component Analysis (PCA)\n", 42 | "\n", 43 | "#### Machine learning models insensitive to feature magnitude are the ones based on Trees:\n", 44 | "\n", 45 | "- Classification and Regression Trees\n", 46 | "- Random Forests\n", 47 | "- Gradient Boosted Trees\n", 48 | "\n", 49 | "\n", 50 | "**For more information on whether and why you should scale features prior to use in machine learning models, refer to the lecture \"Bonus: Additional reading resources on variable problems\", in the previous section of this course.** \n", 51 | "\n" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "===================================================================================================\n", 59 | "\n", 60 | "## Real Life example: \n", 61 | "\n", 62 | "### Predicting Survival on the Titanic: understanding society behaviour and beliefs\n", 63 | "\n", 64 | "Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.\n", 65 | "\n", 66 | "====================================================================================================\n", 67 | "\n", 68 | "To download the Titanic data, go ahead to this website:\n", 69 | "https://www.kaggle.com/c/titanic/data\n", 70 | "\n", 71 | "Click on the link 'train.csv', and then click the 'download' blue button towards the right of the screen, to download the dataset. Save it in a folder of your choice.\n", 72 | "\n", 73 | "**Note that you need to be logged in to Kaggle in order to download the datasets**.\n", 74 | "\n", 75 | "If you save it in the same directory from which you are running this notebook, and you rename the file to 'titanic.csv' then you can load it the same way I will load it below.\n", 76 | "\n", 77 | "====================================================================================================\n", 78 | "\n", 79 | "In this notebook, I will demonstrate the effect of feature magnitude on the performance of different machine learning algorithms." 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 1, 85 | "metadata": { 86 | "collapsed": true 87 | }, 88 | "outputs": [], 89 | "source": [ 90 | "import pandas as pd\n", 91 | "import numpy as np\n", 92 | "\n", 93 | "import matplotlib.pyplot as plt\n", 94 | "% matplotlib inline\n", 95 | "\n", 96 | "from sklearn.linear_model import LogisticRegression\n", 97 | "from sklearn.ensemble import AdaBoostClassifier\n", 98 | "from sklearn.ensemble import RandomForestClassifier\n", 99 | "from sklearn.svm import SVC\n", 100 | "from sklearn.neural_network import MLPClassifier\n", 101 | "from sklearn.neighbors import KNeighborsClassifier\n", 102 | "\n", 103 | "from sklearn.preprocessing import MinMaxScaler\n", 104 | "\n", 105 | "from sklearn.metrics import roc_auc_score\n", 106 | "from sklearn.model_selection import train_test_split" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "### Load data with numerical variables only" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 2, 119 | "metadata": {}, 120 | "outputs": [ 121 | { 122 | "data": { 123 | "text/html": [ 124 | "
\n", 125 | "\n", 138 | "\n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | "
SurvivedPclassAgeFare
00322.07.2500
11138.071.2833
21326.07.9250
31135.053.1000
40335.08.0500
\n", 186 | "
" 187 | ], 188 | "text/plain": [ 189 | " Survived Pclass Age Fare\n", 190 | "0 0 3 22.0 7.2500\n", 191 | "1 1 1 38.0 71.2833\n", 192 | "2 1 3 26.0 7.9250\n", 193 | "3 1 1 35.0 53.1000\n", 194 | "4 0 3 35.0 8.0500" 195 | ] 196 | }, 197 | "execution_count": 2, 198 | "metadata": {}, 199 | "output_type": "execute_result" 200 | } 201 | ], 202 | "source": [ 203 | "# load the numerical variables of the Titanic Dataset\n", 204 | "data = pd.read_csv('titanic.csv', usecols = ['Pclass', 'Age', 'Fare', 'Survived'])\n", 205 | "data.head()" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 3, 211 | "metadata": {}, 212 | "outputs": [ 213 | { 214 | "data": { 215 | "text/html": [ 216 | "
\n", 217 | "\n", 230 | "\n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | "
SurvivedPclassAgeFare
count891.000000891.000000714.000000891.000000
mean0.3838382.30864229.69911832.204208
std0.4865920.83607114.52649749.693429
min0.0000001.0000000.4200000.000000
25%0.0000002.00000020.1250007.910400
50%0.0000003.00000028.00000014.454200
75%1.0000003.00000038.00000031.000000
max1.0000003.00000080.000000512.329200
\n", 299 | "
" 300 | ], 301 | "text/plain": [ 302 | " Survived Pclass Age Fare\n", 303 | "count 891.000000 891.000000 714.000000 891.000000\n", 304 | "mean 0.383838 2.308642 29.699118 32.204208\n", 305 | "std 0.486592 0.836071 14.526497 49.693429\n", 306 | "min 0.000000 1.000000 0.420000 0.000000\n", 307 | "25% 0.000000 2.000000 20.125000 7.910400\n", 308 | "50% 0.000000 3.000000 28.000000 14.454200\n", 309 | "75% 1.000000 3.000000 38.000000 31.000000\n", 310 | "max 1.000000 3.000000 80.000000 512.329200" 311 | ] 312 | }, 313 | "execution_count": 3, 314 | "metadata": {}, 315 | "output_type": "execute_result" 316 | } 317 | ], 318 | "source": [ 319 | "# let's have a look at the values of those variables to get an idea of the magnitudes\n", 320 | "\n", 321 | "data.describe()" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": 4, 327 | "metadata": {}, 328 | "outputs": [ 329 | { 330 | "name": "stdout", 331 | "output_type": "stream", 332 | "text": [ 333 | "Pclass _range: 2\n", 334 | "Age _range: 79.58\n", 335 | "Fare _range: 512.3292\n" 336 | ] 337 | } 338 | ], 339 | "source": [ 340 | "# let's now calculate the range\n", 341 | "\n", 342 | "for col in ['Pclass', 'Age', 'Fare']:\n", 343 | " print(col, '_range: ', data[col].max()-data[col].min())" 344 | ] 345 | }, 346 | { 347 | "cell_type": "markdown", 348 | "metadata": {}, 349 | "source": [ 350 | "The magnitude of the values of the 3 different variables and their ranges of values are quite different. Therefore, feature scaling could benefit the performance of several machine learning algorithms." 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": 5, 356 | "metadata": {}, 357 | "outputs": [ 358 | { 359 | "data": { 360 | "text/plain": [ 361 | "((623, 3), (268, 3))" 362 | ] 363 | }, 364 | "execution_count": 5, 365 | "metadata": {}, 366 | "output_type": "execute_result" 367 | } 368 | ], 369 | "source": [ 370 | "# let's separate into training and testing set\n", 371 | "X_train, X_test, y_train, y_test = train_test_split(\n", 372 | " data[['Pclass', 'Age', 'Fare']].fillna(0),\n", 373 | " data.Survived,\n", 374 | " test_size=0.3,\n", 375 | " random_state=0)\n", 376 | "\n", 377 | "X_train.shape, X_test.shape" 378 | ] 379 | }, 380 | { 381 | "cell_type": "markdown", 382 | "metadata": {}, 383 | "source": [ 384 | "### Feature Scaling" 385 | ] 386 | }, 387 | { 388 | "cell_type": "markdown", 389 | "metadata": {}, 390 | "source": [ 391 | "For this demonstration, I will scale the features between 0 and 1. \n", 392 | "To learn more about this scaling visit the scikit-learn website: \n", 393 | "http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html\n", 394 | "\n", 395 | "Briefly, the transformation is given by:\n", 396 | "\n", 397 | "X_std = (X - X.min() / (X.max - X.min())\n", 398 | "\n", 399 | "And to transform the scaled feature back to its initial format:\n", 400 | "\n", 401 | "X_scaled = X_std * (max - min) + min\n", 402 | "\n", 403 | "\n", 404 | "**We'll see more on feature scaling in future lectures**" 405 | ] 406 | }, 407 | { 408 | "cell_type": "code", 409 | "execution_count": 6, 410 | "metadata": { 411 | "collapsed": true 412 | }, 413 | "outputs": [], 414 | "source": [ 415 | "# scaling the features between 0 and 1. \n", 416 | "\n", 417 | "scaler = MinMaxScaler()\n", 418 | "X_train_scaled = scaler.fit_transform(X_train)\n", 419 | "X_test_scaled = scaler.transform(X_test)" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": 7, 425 | "metadata": {}, 426 | "outputs": [ 427 | { 428 | "name": "stdout", 429 | "output_type": "stream", 430 | "text": [ 431 | "Mean: [ 0.64365971 0.30131421 0.06335433]\n", 432 | "Standard Deviation: [ 0.41999093 0.21983527 0.09411705]\n", 433 | "Minimum value: [ 0. 0. 0.]\n", 434 | "Maximum value: [ 1. 1. 1.]\n" 435 | ] 436 | } 437 | ], 438 | "source": [ 439 | "#let's have a look at the scaled training dataset\n", 440 | "print('Mean: ', X_train_scaled.mean(axis=0))\n", 441 | "print('Standard Deviation: ', X_train_scaled.std(axis=0))\n", 442 | "print('Minimum value: ', X_train_scaled.min(axis=0))\n", 443 | "print('Maximum value: ', X_train_scaled.max(axis=0))" 444 | ] 445 | }, 446 | { 447 | "cell_type": "markdown", 448 | "metadata": {}, 449 | "source": [ 450 | "### Logistic Regression" 451 | ] 452 | }, 453 | { 454 | "cell_type": "code", 455 | "execution_count": 8, 456 | "metadata": {}, 457 | "outputs": [ 458 | { 459 | "name": "stdout", 460 | "output_type": "stream", 461 | "text": [ 462 | "Train set\n", 463 | "Logistic Regression roc-auc: 0.7134823539619531\n", 464 | "Test set\n", 465 | "Logistic Regression roc-auc: 0.7080952380952381\n" 466 | ] 467 | } 468 | ], 469 | "source": [ 470 | "# model build on unscaled variables\n", 471 | "\n", 472 | "logit = LogisticRegression(random_state=44, C=1000) # c big to avoid regularization\n", 473 | "logit.fit(X_train, y_train)\n", 474 | "print('Train set')\n", 475 | "pred = logit.predict_proba(X_train)\n", 476 | "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n", 477 | "print('Test set')\n", 478 | "pred = logit.predict_proba(X_test)\n", 479 | "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))" 480 | ] 481 | }, 482 | { 483 | "cell_type": "code", 484 | "execution_count": 9, 485 | "metadata": {}, 486 | "outputs": [ 487 | { 488 | "data": { 489 | "text/plain": [ 490 | "array([[-0.92574443, -0.01822382, 0.00233696]])" 491 | ] 492 | }, 493 | "execution_count": 9, 494 | "metadata": {}, 495 | "output_type": "execute_result" 496 | } 497 | ], 498 | "source": [ 499 | "logit.coef_" 500 | ] 501 | }, 502 | { 503 | "cell_type": "code", 504 | "execution_count": 10, 505 | "metadata": {}, 506 | "outputs": [ 507 | { 508 | "name": "stdout", 509 | "output_type": "stream", 510 | "text": [ 511 | "Train set\n", 512 | "Logistic Regression roc-auc: 0.7134931997136722\n", 513 | "Test set\n", 514 | "Logistic Regression roc-auc: 0.7080952380952381\n" 515 | ] 516 | } 517 | ], 518 | "source": [ 519 | "# model built on scaled variables\n", 520 | "logit = LogisticRegression(random_state=44, C=1000) # c big to avoid regularization\n", 521 | "logit.fit(X_train_scaled, y_train)\n", 522 | "print('Train set')\n", 523 | "pred = logit.predict_proba(X_train_scaled)\n", 524 | "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n", 525 | "print('Test set')\n", 526 | "pred = logit.predict_proba(X_test_scaled)\n", 527 | "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))" 528 | ] 529 | }, 530 | { 531 | "cell_type": "code", 532 | "execution_count": 11, 533 | "metadata": {}, 534 | "outputs": [ 535 | { 536 | "data": { 537 | "text/plain": [ 538 | "array([[-1.85168371, -1.45774407, 1.1951952 ]])" 539 | ] 540 | }, 541 | "execution_count": 11, 542 | "metadata": {}, 543 | "output_type": "execute_result" 544 | } 545 | ], 546 | "source": [ 547 | "logit.coef_" 548 | ] 549 | }, 550 | { 551 | "cell_type": "markdown", 552 | "metadata": {}, 553 | "source": [ 554 | "We observe that the performance of logistic regression did not change when using the datasets with the features scaled (compare roc-auc values for train and test set for models with and without feature scaling). \n", 555 | "\n", 556 | "However, when looking at the coefficients we do see a big difference in the values. This is because the magnitude of the variable was affecting the coefficients. After scaling, all 3 variables have the relative same effect (coefficient) for survival, whereas before scaling, we would be inclined to think that PClass was driving the Survival outcome." 557 | ] 558 | }, 559 | { 560 | "cell_type": "markdown", 561 | "metadata": {}, 562 | "source": [ 563 | "### Support Vector Machines" 564 | ] 565 | }, 566 | { 567 | "cell_type": "code", 568 | "execution_count": 12, 569 | "metadata": {}, 570 | "outputs": [ 571 | { 572 | "name": "stdout", 573 | "output_type": "stream", 574 | "text": [ 575 | "Train set\n", 576 | "SVM roc-auc: 0.9016995292943752\n", 577 | "Test set\n", 578 | "SVM roc-auc: 0.6768154761904762\n" 579 | ] 580 | } 581 | ], 582 | "source": [ 583 | "# model build on data with plenty of categories in Cabin variable\n", 584 | "\n", 585 | "SVM_model = SVC(random_state=44, probability=True)\n", 586 | "SVM_model.fit(X_train, y_train)\n", 587 | "print('Train set')\n", 588 | "pred = SVM_model.predict_proba(X_train)\n", 589 | "print('SVM roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n", 590 | "print('Test set')\n", 591 | "pred = SVM_model.predict_proba(X_test)\n", 592 | "print('SVM roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))" 593 | ] 594 | }, 595 | { 596 | "cell_type": "code", 597 | "execution_count": 13, 598 | "metadata": {}, 599 | "outputs": [ 600 | { 601 | "name": "stdout", 602 | "output_type": "stream", 603 | "text": [ 604 | "Train set\n", 605 | "SVM roc-auc: 0.7047081408212403\n", 606 | "Test set\n", 607 | "SVM roc-auc: 0.6988690476190477\n" 608 | ] 609 | } 610 | ], 611 | "source": [ 612 | "SVM_model = SVC(random_state=44, probability=True)\n", 613 | "SVM_model.fit(X_train_scaled, y_train)\n", 614 | "print('Train set')\n", 615 | "pred = SVM_model.predict_proba(X_train_scaled)\n", 616 | "print('SVM roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n", 617 | "print('Test set')\n", 618 | "pred = SVM_model.predict_proba(X_test_scaled)\n", 619 | "print('SVM roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))" 620 | ] 621 | }, 622 | { 623 | "cell_type": "markdown", 624 | "metadata": {}, 625 | "source": [ 626 | "Feature scaling improved the performance of the support vector machine. After feature scaling the model is no longer over-fitting to the training set (compare the roc-auc of 0.901 for the model on unscaled features vs the roc-auc of 0.704). In addition, the roc-auc for the testing set increased as well (0.67 vs 0.69)." 627 | ] 628 | }, 629 | { 630 | "cell_type": "markdown", 631 | "metadata": {}, 632 | "source": [ 633 | "### Neural Networks" 634 | ] 635 | }, 636 | { 637 | "cell_type": "code", 638 | "execution_count": 14, 639 | "metadata": {}, 640 | "outputs": [ 641 | { 642 | "name": "stdout", 643 | "output_type": "stream", 644 | "text": [ 645 | "Train set\n", 646 | "Neural Network roc-auc: 0.678190277868159\n", 647 | "Test set\n", 648 | "Neural Network roc-auc: 0.666547619047619\n" 649 | ] 650 | } 651 | ], 652 | "source": [ 653 | "# model built on unscaled features\n", 654 | "\n", 655 | "NN_model = MLPClassifier(random_state=44, solver='sgd')\n", 656 | "NN_model.fit(X_train, y_train)\n", 657 | "print('Train set')\n", 658 | "pred = NN_model.predict_proba(X_train)\n", 659 | "print('Neural Network roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n", 660 | "print('Test set')\n", 661 | "pred = NN_model.predict_proba(X_test)\n", 662 | "print('Neural Network roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))" 663 | ] 664 | }, 665 | { 666 | "cell_type": "code", 667 | "execution_count": 15, 668 | "metadata": {}, 669 | "outputs": [ 670 | { 671 | "name": "stdout", 672 | "output_type": "stream", 673 | "text": [ 674 | "Train set\n", 675 | "Neural Network roc-auc: 0.7161937918917161\n", 676 | "Test set\n", 677 | "Neural Network roc-auc: 0.7125\n" 678 | ] 679 | } 680 | ], 681 | "source": [ 682 | "# model built on scaled features\n", 683 | "\n", 684 | "NN_model = MLPClassifier(random_state=44, solver='sgd')\n", 685 | "NN_model.fit(X_train_scaled, y_train)\n", 686 | "print('Train set')\n", 687 | "pred = NN_model.predict_proba(X_train_scaled)\n", 688 | "print('Neural Network roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n", 689 | "print('Test set')\n", 690 | "pred = NN_model.predict_proba(X_test_scaled)\n", 691 | "print('Neural Network roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))" 692 | ] 693 | }, 694 | { 695 | "cell_type": "markdown", 696 | "metadata": {}, 697 | "source": [ 698 | "We observe that scaling the features improved the performance of the neural network both for the training and the testing set (compare roc-auc values: training 0.67 vs 0.71; testing: 0.66 vs 0.71). The roc-auc increases in both training and testing sets when the model is trained on a dataset with scaled features." 699 | ] 700 | }, 701 | { 702 | "cell_type": "markdown", 703 | "metadata": {}, 704 | "source": [ 705 | "### K-Nearest Neighbours" 706 | ] 707 | }, 708 | { 709 | "cell_type": "code", 710 | "execution_count": 16, 711 | "metadata": {}, 712 | "outputs": [ 713 | { 714 | "name": "stdout", 715 | "output_type": "stream", 716 | "text": [ 717 | "Train set\n", 718 | "KNN roc-auc: 0.8694225721784778\n", 719 | "Test set\n", 720 | "KNN roc-auc: 0.6253571428571428\n" 721 | ] 722 | } 723 | ], 724 | "source": [ 725 | "#model built on unscaled features\n", 726 | "\n", 727 | "KNN = KNeighborsClassifier(n_neighbors=3)\n", 728 | "KNN.fit(X_train, y_train)\n", 729 | "print('Train set')\n", 730 | "pred = KNN.predict_proba(X_train)\n", 731 | "print('KNN roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n", 732 | "print('Test set')\n", 733 | "pred = KNN.predict_proba(X_test)\n", 734 | "print('KNN roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))" 735 | ] 736 | }, 737 | { 738 | "cell_type": "code", 739 | "execution_count": 17, 740 | "metadata": {}, 741 | "outputs": [ 742 | { 743 | "name": "stdout", 744 | "output_type": "stream", 745 | "text": [ 746 | "Train set\n", 747 | "KNN roc-auc: 0.8880555736318084\n", 748 | "Test set\n", 749 | "KNN roc-auc: 0.7017559523809525\n" 750 | ] 751 | } 752 | ], 753 | "source": [ 754 | "# model built on scaled\n", 755 | "\n", 756 | "KNN = KNeighborsClassifier(n_neighbors=3)\n", 757 | "KNN.fit(X_train_scaled, y_train)\n", 758 | "print('Train set')\n", 759 | "pred = KNN.predict_proba(X_train_scaled)\n", 760 | "print('KNN roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n", 761 | "print('Test set')\n", 762 | "pred = KNN.predict_proba(X_test_scaled)\n", 763 | "print('KNN roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))" 764 | ] 765 | }, 766 | { 767 | "cell_type": "markdown", 768 | "metadata": {}, 769 | "source": [ 770 | "We observe for KNN as well that feature scaling improved the performance of the model. The model built on unscaled features shows a better generalisation, with a higher roc-auc for the testing set (0.70 vs 0.62 for model built on unscaled features).\n", 771 | "\n", 772 | "Both KNN methods are over-fitting to the train set. Thus, we would need to change the parameters of the model or use less features to try and decrease over-fitting, which exceeds the purpose of this demonstration." 773 | ] 774 | }, 775 | { 776 | "cell_type": "markdown", 777 | "metadata": {}, 778 | "source": [ 779 | "### Random Forests" 780 | ] 781 | }, 782 | { 783 | "cell_type": "code", 784 | "execution_count": 18, 785 | "metadata": {}, 786 | "outputs": [ 787 | { 788 | "name": "stdout", 789 | "output_type": "stream", 790 | "text": [ 791 | "Train set\n", 792 | "Random Forests roc-auc: 0.9914589705212468\n", 793 | "Test set\n", 794 | "Random Forests roc-auc: 0.7602678571428572\n" 795 | ] 796 | } 797 | ], 798 | "source": [ 799 | "# model built on unscaled features\n", 800 | "\n", 801 | "rf = RandomForestClassifier(n_estimators=700, random_state=39)\n", 802 | "rf.fit(X_train, y_train)\n", 803 | "print('Train set')\n", 804 | "pred = rf.predict_proba(X_train)\n", 805 | "print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n", 806 | "print('Test set')\n", 807 | "pred = rf.predict_proba(X_test)\n", 808 | "print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))" 809 | ] 810 | }, 811 | { 812 | "cell_type": "code", 813 | "execution_count": 19, 814 | "metadata": {}, 815 | "outputs": [ 816 | { 817 | "name": "stdout", 818 | "output_type": "stream", 819 | "text": [ 820 | "Train set\n", 821 | "Random Forests roc-auc: 0.9914589705212469\n", 822 | "Test set\n", 823 | "Random Forests roc-auc: 0.7602380952380952\n" 824 | ] 825 | } 826 | ], 827 | "source": [ 828 | "# model built in scaled features\n", 829 | "rf = RandomForestClassifier(n_estimators=700, random_state=39)\n", 830 | "rf.fit(X_train_scaled, y_train)\n", 831 | "print('Train set')\n", 832 | "pred = rf.predict_proba(X_train_scaled)\n", 833 | "print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n", 834 | "print('Test set')\n", 835 | "pred = rf.predict_proba(X_test_scaled)\n", 836 | "print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))" 837 | ] 838 | }, 839 | { 840 | "cell_type": "markdown", 841 | "metadata": {}, 842 | "source": [ 843 | "As expected, Random Forests shows no change in performance regardless of whether it is trained on a dataset with scaled or unscaled features" 844 | ] 845 | }, 846 | { 847 | "cell_type": "code", 848 | "execution_count": 20, 849 | "metadata": {}, 850 | "outputs": [ 851 | { 852 | "name": "stdout", 853 | "output_type": "stream", 854 | "text": [ 855 | "Train set\n", 856 | "AdaBoost roc-auc: 0.847736491616234\n", 857 | "Test set\n", 858 | "AdaBoost roc-auc: 0.7733630952380953\n" 859 | ] 860 | } 861 | ], 862 | "source": [ 863 | "ada = AdaBoostClassifier(n_estimators=200, random_state=44)\n", 864 | "ada.fit(X_train, y_train)\n", 865 | "print('Train set')\n", 866 | "pred = ada.predict_proba(X_train)\n", 867 | "print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n", 868 | "print('Test set')\n", 869 | "pred = ada.predict_proba(X_test)\n", 870 | "print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))" 871 | ] 872 | }, 873 | { 874 | "cell_type": "code", 875 | "execution_count": 21, 876 | "metadata": {}, 877 | "outputs": [ 878 | { 879 | "name": "stdout", 880 | "output_type": "stream", 881 | "text": [ 882 | "Train set\n", 883 | "AdaBoost roc-auc: 0.847736491616234\n", 884 | "Test set\n", 885 | "AdaBoost roc-auc: 0.7733630952380953\n" 886 | ] 887 | } 888 | ], 889 | "source": [ 890 | "ada = AdaBoostClassifier(n_estimators=200, random_state=44)\n", 891 | "ada.fit(X_train_scaled, y_train)\n", 892 | "print('Train set')\n", 893 | "pred = ada.predict_proba(X_train_scaled)\n", 894 | "print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n", 895 | "print('Test set')\n", 896 | "pred = ada.predict_proba(X_test_scaled)\n", 897 | "print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))" 898 | ] 899 | }, 900 | { 901 | "cell_type": "markdown", 902 | "metadata": {}, 903 | "source": [ 904 | "As expected, AdaBoost shows no change in performance regardless of whether it is trained on a dataset with scaled or unscaled features" 905 | ] 906 | }, 907 | { 908 | "cell_type": "markdown", 909 | "metadata": { 910 | "collapsed": true 911 | }, 912 | "source": [ 913 | "**That is all for this demonstration. I hope you enjoyed the notebook, and see you in the next one.**" 914 | ] 915 | } 916 | ], 917 | "metadata": { 918 | "kernelspec": { 919 | "display_name": "Python 3", 920 | "language": "python", 921 | "name": "python3" 922 | }, 923 | "language_info": { 924 | "codemirror_mode": { 925 | "name": "ipython", 926 | "version": 3 927 | }, 928 | "file_extension": ".py", 929 | "mimetype": "text/x-python", 930 | "name": "python", 931 | "nbconvert_exporter": "python", 932 | "pygments_lexer": "ipython3", 933 | "version": "3.6.1" 934 | }, 935 | "toc": { 936 | "nav_menu": {}, 937 | "number_sections": true, 938 | "sideBar": true, 939 | "skip_h1_title": false, 940 | "toc_cell": false, 941 | "toc_position": {}, 942 | "toc_section_display": "block", 943 | "toc_window_display": true 944 | } 945 | }, 946 | "nbformat": 4, 947 | "nbformat_minor": 2 948 | } 949 | -------------------------------------------------------------------------------- /04.4_Bonus_Additional_reading_resources.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Bonus: Additional reading resources\n", 8 | "\n", 9 | "### Further reading on the importance of feature scaling\n", 10 | "\n", 11 | "- [Should I normalize/standardize/rescale the data?](http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html)\n", 12 | "- [Efficient BackProp by Yann LeCun](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf)\n", 13 | "- [Standardization and Its Effects on K -Means Clustering Algorithm](http://maxwellsci.com/print/rjaset/v6-3299-3303.pdf)\n", 14 | "- [Feature scaling in support vector data description]](http://rduin.nl/papers/asci_02_occ.pdf)\n", 15 | "\n", 16 | "### Further reading on linear and non-linear models\n", 17 | "\n", 18 | "- [An introduction to Statistical Learning]](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf)\n", 19 | "- [Elements of Statistical Learning]](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf)\n", 20 | "\n", 21 | "\n", 22 | "### More on Q-Q Plots\n", 23 | "\n", 24 | "- [Q-Q Plots in wiki](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot)\n", 25 | "- [How to interpret a Q-Q plot](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot)\n", 26 | "- [Q-Q Plot visualisation tool](https://xiongge.shinyapps.io/QQplots/)\n", 27 | "\n", 28 | "### Kaggle kernels on the Sale Price dataset\n", 29 | "\n", 30 | "- [Comprehensive Data Exploration](https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python)" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": null, 36 | "metadata": { 37 | "collapsed": true 38 | }, 39 | "outputs": [], 40 | "source": [] 41 | } 42 | ], 43 | "metadata": { 44 | "kernelspec": { 45 | "display_name": "Python 3", 46 | "language": "python", 47 | "name": "python3" 48 | }, 49 | "language_info": { 50 | "codemirror_mode": { 51 | "name": "ipython", 52 | "version": 3 53 | }, 54 | "file_extension": ".py", 55 | "mimetype": "text/x-python", 56 | "name": "python", 57 | "nbconvert_exporter": "python", 58 | "pygments_lexer": "ipython3", 59 | "version": "3.6.1" 60 | }, 61 | "toc": { 62 | "nav_menu": {}, 63 | "number_sections": true, 64 | "sideBar": true, 65 | "skip_h1_title": false, 66 | "toc_cell": false, 67 | "toc_position": {}, 68 | "toc_section_display": "block", 69 | "toc_window_display": false 70 | } 71 | }, 72 | "nbformat": 4, 73 | "nbformat_minor": 2 74 | } 75 | -------------------------------------------------------------------------------- /05.6_Arbitrary_value_imputation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Arbitrary value imputation\n", 8 | "\n", 9 | "Replacing the NA by artitrary values should be used when there are reasons to believe that the NA are not missing at random. In situations like this, we would not like to replace with the median or the mean, and therefore make the NA look like the majority of our observations.\n", 10 | "\n", 11 | "Instead, we want to flag them. We want to capture the missingness somehow.\n", 12 | "\n", 13 | "In previous lectures we saw 2 methods to do this:\n", 14 | "\n", 15 | "1) adding an additional binary variable to indicate whether the value is missing (1) or not (0)\n", 16 | "\n", 17 | "2) replacing the NA by a value at a far end of the distribution\n", 18 | "\n", 19 | "Here, I suggest an alternative to option 2, which I have seen in several Kaggle competitions. It consists of replacing the NA by an arbitrary value. Any of your creation, but ideally different from the median/mean/mode, and not within the normal values of the variable.\n", 20 | "\n", 21 | "The problem consists in deciding which arbitrary value to choose.\n", 22 | "\n", 23 | "### Advantages\n", 24 | "\n", 25 | "- Easy to implement\n", 26 | "- Captures the importance of missingess if there is one\n", 27 | "\n", 28 | "### Disadvantages\n", 29 | "\n", 30 | "- Distorts the original distribution of the variable\n", 31 | "- If missingess is not important, it may mask the predictive power of the original variable by distorting its distribution\n", 32 | "- Hard to decide which value to use\n", 33 | " If the value is outside the distribution it may mask or create outliers\n", 34 | "\n", 35 | "### Final note\n", 36 | "\n", 37 | "When variables are captured by third parties, like credit agencies, they place arbitrary numbers already to signal this missingness. So if not common practice in data competitions, it is common practice in real life data collections." 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "===============================================================================\n", 45 | "\n", 46 | "## Real Life example: \n", 47 | "\n", 48 | "### Predicting Survival on the Titanic: understanding society behaviour and beliefs\n", 49 | "\n", 50 | "Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.\n", 51 | "\n", 52 | "=============================================================================\n", 53 | "\n", 54 | "In the following cells, I will show how this procedure impacts features and machine learning using the Titanic and House Price datasets from Kaggle.\n", 55 | "\n", 56 | "If you haven't downloaded the datasets yet, in the lecture \"Guide to setting up your computer\" in section 1, you can find the details on how to do so." 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 1, 62 | "metadata": { 63 | "collapsed": true 64 | }, 65 | "outputs": [], 66 | "source": [ 67 | "import pandas as pd\n", 68 | "import numpy as np\n", 69 | "\n", 70 | "# for classification\n", 71 | "from sklearn.linear_model import LogisticRegression\n", 72 | "from sklearn.ensemble import RandomForestClassifier\n", 73 | "\n", 74 | "# to split and standarize the datasets\n", 75 | "from sklearn.model_selection import train_test_split\n", 76 | "\n", 77 | "# to evaluate classification models\n", 78 | "from sklearn.metrics import roc_auc_score\n", 79 | "\n", 80 | "import warnings\n", 81 | "warnings.filterwarnings('ignore')" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 2, 87 | "metadata": {}, 88 | "outputs": [ 89 | { 90 | "data": { 91 | "text/html": [ 92 | "
\n", 93 | "\n", 106 | "\n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | "
SurvivedAgeFare
0022.07.2500
1138.071.2833
2126.07.9250
3135.053.1000
4035.08.0500
\n", 148 | "
" 149 | ], 150 | "text/plain": [ 151 | " Survived Age Fare\n", 152 | "0 0 22.0 7.2500\n", 153 | "1 1 38.0 71.2833\n", 154 | "2 1 26.0 7.9250\n", 155 | "3 1 35.0 53.1000\n", 156 | "4 0 35.0 8.0500" 157 | ] 158 | }, 159 | "execution_count": 2, 160 | "metadata": {}, 161 | "output_type": "execute_result" 162 | } 163 | ], 164 | "source": [ 165 | "# load the Titanic Dataset with a few variables for demonstration\n", 166 | "\n", 167 | "data = pd.read_csv('titanic.csv', usecols = ['Age', 'Fare','Survived'])\n", 168 | "data.head()" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": 3, 174 | "metadata": {}, 175 | "outputs": [ 176 | { 177 | "data": { 178 | "text/plain": [ 179 | "Survived 0.000000\n", 180 | "Age 0.198653\n", 181 | "Fare 0.000000\n", 182 | "dtype: float64" 183 | ] 184 | }, 185 | "execution_count": 3, 186 | "metadata": {}, 187 | "output_type": "execute_result" 188 | } 189 | ], 190 | "source": [ 191 | "# let's look at the percentage of NA\n", 192 | "data.isnull().mean()" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 4, 198 | "metadata": {}, 199 | "outputs": [ 200 | { 201 | "data": { 202 | "text/plain": [ 203 | "((623, 3), (268, 3))" 204 | ] 205 | }, 206 | "execution_count": 4, 207 | "metadata": {}, 208 | "output_type": "execute_result" 209 | } 210 | ], 211 | "source": [ 212 | "# let's separate into training and testing set\n", 213 | "\n", 214 | "X_train, X_test, y_train, y_test = train_test_split(data, data.Survived, test_size=0.3,\n", 215 | " random_state=0)\n", 216 | "X_train.shape, X_test.shape" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": 5, 222 | "metadata": { 223 | "collapsed": true 224 | }, 225 | "outputs": [], 226 | "source": [ 227 | "def impute_na(df, variable):\n", 228 | " df[variable+'_zero'] = df[variable].fillna(0)\n", 229 | " df[variable+'_hundred']= df[variable].fillna(100)\n", 230 | " " 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": 6, 236 | "metadata": {}, 237 | "outputs": [ 238 | { 239 | "data": { 240 | "text/html": [ 241 | "
\n", 242 | "\n", 255 | "\n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | "
SurvivedAgeFareAge_zeroAge_hundred
857151.026.550051.051.0
52149.076.729249.049.0
38601.046.90001.01.0
124054.077.287554.054.0
5780NaN14.45830.0100.0
54918.036.75008.08.0
118024.0247.520824.024.0
12020.08.050020.020.0
157030.08.050030.030.0
127124.07.141724.024.0
6531NaN7.82920.0100.0
2350NaN7.55000.0100.0
785025.07.250025.025.0
2411NaN15.50000.0100.0
3510NaN35.00000.0100.0
862148.025.929248.048.0
851074.07.775074.074.0
753023.07.895823.023.0
532017.07.229217.017.0
4850NaN25.46670.0100.0
\n", 429 | "
" 430 | ], 431 | "text/plain": [ 432 | " Survived Age Fare Age_zero Age_hundred\n", 433 | "857 1 51.0 26.5500 51.0 51.0\n", 434 | "52 1 49.0 76.7292 49.0 49.0\n", 435 | "386 0 1.0 46.9000 1.0 1.0\n", 436 | "124 0 54.0 77.2875 54.0 54.0\n", 437 | "578 0 NaN 14.4583 0.0 100.0\n", 438 | "549 1 8.0 36.7500 8.0 8.0\n", 439 | "118 0 24.0 247.5208 24.0 24.0\n", 440 | "12 0 20.0 8.0500 20.0 20.0\n", 441 | "157 0 30.0 8.0500 30.0 30.0\n", 442 | "127 1 24.0 7.1417 24.0 24.0\n", 443 | "653 1 NaN 7.8292 0.0 100.0\n", 444 | "235 0 NaN 7.5500 0.0 100.0\n", 445 | "785 0 25.0 7.2500 25.0 25.0\n", 446 | "241 1 NaN 15.5000 0.0 100.0\n", 447 | "351 0 NaN 35.0000 0.0 100.0\n", 448 | "862 1 48.0 25.9292 48.0 48.0\n", 449 | "851 0 74.0 7.7750 74.0 74.0\n", 450 | "753 0 23.0 7.8958 23.0 23.0\n", 451 | "532 0 17.0 7.2292 17.0 17.0\n", 452 | "485 0 NaN 25.4667 0.0 100.0" 453 | ] 454 | }, 455 | "execution_count": 6, 456 | "metadata": {}, 457 | "output_type": "execute_result" 458 | } 459 | ], 460 | "source": [ 461 | "# let's replace the NA with the median value in the training set\n", 462 | "impute_na(X_train, 'Age')\n", 463 | "impute_na(X_test, 'Age')\n", 464 | "\n", 465 | "X_train.head(20)" 466 | ] 467 | }, 468 | { 469 | "cell_type": "markdown", 470 | "metadata": {}, 471 | "source": [ 472 | "### Logistic Regression" 473 | ] 474 | }, 475 | { 476 | "cell_type": "code", 477 | "execution_count": 7, 478 | "metadata": {}, 479 | "outputs": [ 480 | { 481 | "name": "stdout", 482 | "output_type": "stream", 483 | "text": [ 484 | "Train set\n", 485 | "Logistic Regression roc-auc: 0.6863462831608859\n", 486 | "Test set\n", 487 | "Logistic Regression roc-auc: 0.7137499999999999\n", 488 | "Train set\n", 489 | "Logistic Regression roc-auc: 0.6803594282119694\n", 490 | "Test set\n", 491 | "Logistic Regression roc-auc: 0.7227976190476191\n" 492 | ] 493 | } 494 | ], 495 | "source": [ 496 | "# we compare the models built using Age filled with zero, vs Age filled with 100\n", 497 | "\n", 498 | "logit = LogisticRegression(random_state=44, C=1000) # c big to avoid regularization\n", 499 | "logit.fit(X_train[['Age_zero','Fare']], y_train)\n", 500 | "print('Train set')\n", 501 | "pred = logit.predict_proba(X_train[['Age_zero','Fare']])\n", 502 | "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n", 503 | "print('Test set')\n", 504 | "pred = logit.predict_proba(X_test[['Age_zero','Fare']])\n", 505 | "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))\n", 506 | "\n", 507 | "logit = LogisticRegression(random_state=44, C=1000) # c big to avoid regularization\n", 508 | "logit.fit(X_train[['Age_hundred','Fare']], y_train)\n", 509 | "print('Train set')\n", 510 | "pred = logit.predict_proba(X_train[['Age_hundred','Fare']])\n", 511 | "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n", 512 | "print('Test set')\n", 513 | "pred = logit.predict_proba(X_test[['Age_hundred','Fare']])\n", 514 | "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))" 515 | ] 516 | }, 517 | { 518 | "cell_type": "code", 519 | "execution_count": 8, 520 | "metadata": {}, 521 | "outputs": [ 522 | { 523 | "name": "stdout", 524 | "output_type": "stream", 525 | "text": [ 526 | "Train set zero imputation\n", 527 | "Random Forests roc-auc: 0.7555855621353116\n", 528 | "Test set zero imputation\n", 529 | "Random Forests zero imputation roc-auc: 0.7490476190476191\n", 530 | "\n", 531 | "Train set median imputation\n", 532 | "Random Forests roc-auc: 0.7490781111038807\n", 533 | "Test set median imputation\n", 534 | "Random Forests roc-auc: 0.7653571428571431\n", 535 | "\n" 536 | ] 537 | } 538 | ], 539 | "source": [ 540 | "# random forests\n", 541 | "\n", 542 | "rf = RandomForestClassifier(n_estimators=100, random_state=39, max_depth=3)\n", 543 | "rf.fit(X_train[['Age_zero', 'Fare']], y_train)\n", 544 | "print('Train set zero imputation')\n", 545 | "pred = rf.predict_proba(X_train[['Age_zero', 'Fare']])\n", 546 | "print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n", 547 | "print('Test set zero imputation')\n", 548 | "pred = rf.predict_proba(X_test[['Age_zero', 'Fare']])\n", 549 | "print('Random Forests zero imputation roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))\n", 550 | "print()\n", 551 | "rf = RandomForestClassifier(n_estimators=100, random_state=39, max_depth=3)\n", 552 | "rf.fit(X_train[['Age_hundred', 'Fare']], y_train)\n", 553 | "print('Train set median imputation')\n", 554 | "pred = rf.predict_proba(X_train[['Age_hundred', 'Fare']])\n", 555 | "print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n", 556 | "print('Test set median imputation')\n", 557 | "pred = rf.predict_proba(X_test[['Age_hundred', 'Fare']])\n", 558 | "print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))\n", 559 | "print()" 560 | ] 561 | }, 562 | { 563 | "cell_type": "markdown", 564 | "metadata": {}, 565 | "source": [ 566 | "We can see that replacing NA with 100 makes the models perform better than replacing NA with 0. This is, if you remember from the lecture \"Replacing NA by mean or median\" because children were more likely to survive than adults. Then filling NA with zeroes, distorts this relation and makes the models loose predictive power. See below for a re-cap." 567 | ] 568 | }, 569 | { 570 | "cell_type": "code", 571 | "execution_count": 9, 572 | "metadata": {}, 573 | "outputs": [ 574 | { 575 | "name": "stdout", 576 | "output_type": "stream", 577 | "text": [ 578 | "Average real survival of children: 0.5740740740740741\n", 579 | "Average survival of children when using Age imputed with zeroes: 0.38857142857142857\n", 580 | "Average survival of children when using Age imputed with median: 0.5740740740740741\n" 581 | ] 582 | } 583 | ], 584 | "source": [ 585 | "print('Average real survival of children: ', X_train[X_train.Age<15].Survived.mean())\n", 586 | "print('Average survival of children when using Age imputed with zeroes: ', X_train[X_train.Age_zero<15].Survived.mean())\n", 587 | "print('Average survival of children when using Age imputed with median: ', X_train[X_train.Age_hundred<15].Survived.mean())" 588 | ] 589 | }, 590 | { 591 | "cell_type": "markdown", 592 | "metadata": {}, 593 | "source": [ 594 | "### Final notes\n", 595 | "\n", 596 | "The arbitrary value has to be determined for each variable specifically. For example, for this dataset, the choice of replacing NA in age by 0 or 100 are valid, because none of those values are frequent in the original distribution of the variable, and they lie at the tails of the distribution.\n", 597 | "\n", 598 | "However, if we were to replace NA in fare, those values are not good any more, because we can see that fare can take values of up to 500. So we might want to consider using 500 or 1000 to replace NA instead of 100.\n", 599 | "\n", 600 | "As you can see this is totally arbitrary. And yet, it is used in the industry.\n", 601 | "\n", 602 | "Typical values chose by companies are -9999 or 9999, or similar." 603 | ] 604 | }, 605 | { 606 | "cell_type": "markdown", 607 | "metadata": { 608 | "collapsed": true 609 | }, 610 | "source": [ 611 | "**That is all for this demonstration. I hope you enjoyed the notebook, and see you in the next one.**" 612 | ] 613 | }, 614 | { 615 | "cell_type": "code", 616 | "execution_count": null, 617 | "metadata": { 618 | "collapsed": true 619 | }, 620 | "outputs": [], 621 | "source": [] 622 | } 623 | ], 624 | "metadata": { 625 | "kernelspec": { 626 | "display_name": "Python 3", 627 | "language": "python", 628 | "name": "python3" 629 | }, 630 | "language_info": { 631 | "codemirror_mode": { 632 | "name": "ipython", 633 | "version": 3 634 | }, 635 | "file_extension": ".py", 636 | "mimetype": "text/x-python", 637 | "name": "python", 638 | "nbconvert_exporter": "python", 639 | "pygments_lexer": "ipython3", 640 | "version": "3.6.1" 641 | }, 642 | "toc": { 643 | "nav_menu": {}, 644 | "number_sections": true, 645 | "sideBar": true, 646 | "skip_h1_title": false, 647 | "toc_cell": false, 648 | "toc_position": {}, 649 | "toc_section_display": "block", 650 | "toc_window_display": false 651 | } 652 | }, 653 | "nbformat": 4, 654 | "nbformat_minor": 2 655 | } 656 | -------------------------------------------------------------------------------- /06.3_Adding_a_variable_to_capture_NA.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Adding a variable to capture NA\n", 8 | "\n", 9 | "In the previous lectures within this section we studied how to replace missing values by the most frequent category or by extracting a random sample of the variable. These 2 methods assume that the missing data are missing completely at random (MCAR), and are suitable when the number of missing data is small, otherwise it may affect the distribution of the target within the labels of the variable.\n", 10 | "\n", 11 | "So what if the missing data are not small or not MCAR?\n", 12 | "\n", 13 | "We can capture the importance of missingness by creating an additional variable indicating whether the data was missing for that observation (1) or not (0). The additional variable is a binary variable: it takes only the values 0 and 1, 0 indicating that a value was present for that observation, and 1 indicating that the value was missing for that observation.\n", 14 | "\n", 15 | "The procedure is exactly the same as for numerical variables.\n", 16 | "\n", 17 | "\n", 18 | "### Advantages\n", 19 | "\n", 20 | "- Easy to implement\n", 21 | "- Captures the importance of missingess if there is one\n", 22 | "\n", 23 | "### Disadvantages\n", 24 | "\n", 25 | "- Expands the feature space\n", 26 | "\n", 27 | "This method of imputation will add 1 variable per variable in the dataset with missing values. So if a dataset contains 10 features, and all of them have missing values, we will end up with a dataset with 20 features. The original features where we replaced the NA by the frequent label or random sampling, and additional 10 features, indicating for each of the variables, whether the value was missing or not.\n", 28 | "\n", 29 | "This may not be a problem in datasets with tens to a few hundreds of variables, but if your original dataset contains thousands of variables, by creating an additional variable to indicate NA, you will end up with very big datasets.\n", 30 | "\n", 31 | "In addition, data tends to be missing for the same observation on multiple variables, so it may also be the case, that many of your added variables will be actually similar to each other." 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "===============================================================================\n", 39 | "\n", 40 | "## Real Life example: \n", 41 | "\n", 42 | "### Predicting Sale Price of Houses\n", 43 | "\n", 44 | "The problem at hand aims to predict the final sale price of homes based on different explanatory variables describing aspects of residential homes. Predicting house prices is useful to identify fruitful investments, or to determine whether the price advertised for a house is over or underestimated, before making a buying judgment.\n", 45 | "\n", 46 | "=============================================================================\n", 47 | "\n", 48 | "In the following cells, I will demonstrate NA imputation by random sampling + adding an additional variable using the House Price datasets from Kaggle.\n", 49 | "\n", 50 | "If you haven't downloaded the datasets yet, in the lecture \"Guide to setting up your computer\" in section 1, you can find the details on how to do so." 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 1, 56 | "metadata": { 57 | "collapsed": true 58 | }, 59 | "outputs": [], 60 | "source": [ 61 | "import pandas as pd\n", 62 | "import numpy as np\n", 63 | "\n", 64 | "import matplotlib.pyplot as plt\n", 65 | "% matplotlib inline\n", 66 | "\n", 67 | "# for regression problems\n", 68 | "from sklearn.linear_model import LinearRegression, Ridge\n", 69 | "\n", 70 | "# to split and standarize the datasets\n", 71 | "from sklearn.model_selection import train_test_split\n", 72 | "\n", 73 | "# to evaluate regression models\n", 74 | "from sklearn.metrics import mean_squared_error\n", 75 | "\n", 76 | "import warnings\n", 77 | "warnings.filterwarnings('ignore')" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "### House Price dataset" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 2, 90 | "metadata": {}, 91 | "outputs": [ 92 | { 93 | "data": { 94 | "text/plain": [ 95 | "SalePrice 0.000000\n", 96 | "BsmtQual 0.025342\n", 97 | "GarageType 0.055479\n", 98 | "FireplaceQu 0.472603\n", 99 | "dtype: float64" 100 | ] 101 | }, 102 | "execution_count": 2, 103 | "metadata": {}, 104 | "output_type": "execute_result" 105 | } 106 | ], 107 | "source": [ 108 | "# let's load the dataset with a few columns for the demonstration\n", 109 | "cols_to_use = ['BsmtQual', 'FireplaceQu', 'GarageType', 'SalePrice']\n", 110 | "\n", 111 | "data = pd.read_csv('houseprice.csv', usecols=cols_to_use)\n", 112 | "\n", 113 | "# let's inspect the percentage of missing values in each variable\n", 114 | "data.isnull().mean().sort_values(ascending=True)" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "**To evaluate whether adding this additional variables to indicate missingness improves the performance of the ML algorithms, I will replace missing values by random sampling (see previous lecture) as well.**\n", 122 | "\n", 123 | "### Imputation important\n", 124 | "\n", 125 | "Imputation should be done over the training set, and then propagated to the test set. This means that the random sampling of categories should be done from the training set, and used to replace NA both in train and test sets." 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 3, 131 | "metadata": {}, 132 | "outputs": [ 133 | { 134 | "data": { 135 | "text/plain": [ 136 | "((1022, 3), (438, 3))" 137 | ] 138 | }, 139 | "execution_count": 3, 140 | "metadata": {}, 141 | "output_type": "execute_result" 142 | } 143 | ], 144 | "source": [ 145 | "# let's separate into training and testing set\n", 146 | "\n", 147 | "X_train, X_test, y_train, y_test = train_test_split(data[['BsmtQual', 'FireplaceQu', 'GarageType']],\n", 148 | " data.SalePrice, test_size=0.3,\n", 149 | " random_state=0)\n", 150 | "X_train.shape, X_test.shape" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": 4, 156 | "metadata": { 157 | "collapsed": true 158 | }, 159 | "outputs": [], 160 | "source": [ 161 | "# let's create a variable to replace NA with a random sample of labels available within that variable\n", 162 | "# in the same function we create the additional variable to indicate missingness\n", 163 | "\n", 164 | "# make sure you understand every line of code.\n", 165 | "# If unsure, run them separately in a cell in the notebook until you familiarise with the output\n", 166 | "# of each line\n", 167 | "\n", 168 | "def impute_na(df_train, df_test, variable):\n", 169 | " # add additional variable to indicate missingness\n", 170 | " df_train[variable+'_NA'] = np.where(df_train[variable].isnull(), 1, 0)\n", 171 | " df_test[variable+'_NA'] = np.where(df_test[variable].isnull(), 1, 0)\n", 172 | " \n", 173 | " # random sampling\n", 174 | " df_train[variable+'_random'] = df_train[variable]\n", 175 | " df_test[variable+'_random'] = df_test[variable]\n", 176 | " \n", 177 | " # extract the random sample to fill the na\n", 178 | " random_sample_train = df_train[variable].dropna().sample(df_train[variable].isnull().sum(), random_state=0)\n", 179 | " random_sample_test = df_train[variable].dropna().sample(df_test[variable].isnull().sum(), random_state=0)\n", 180 | " \n", 181 | " # pandas needs to have the same index in order to merge datasets\n", 182 | " random_sample_train.index = df_train[df_train[variable].isnull()].index\n", 183 | " random_sample_test.index = df_test[df_test[variable].isnull()].index\n", 184 | " \n", 185 | " df_train.loc[df_train[variable].isnull(), variable+'_random'] = random_sample_train\n", 186 | " df_test.loc[df_test[variable].isnull(), variable+'_random'] = random_sample_test" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": 5, 192 | "metadata": { 193 | "collapsed": true 194 | }, 195 | "outputs": [], 196 | "source": [ 197 | "# and let's replace the NA\n", 198 | "for variable in ['BsmtQual', 'FireplaceQu', 'GarageType',]:\n", 199 | " impute_na(X_train, X_test, variable)" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 6, 205 | "metadata": {}, 206 | "outputs": [ 207 | { 208 | "data": { 209 | "text/plain": [ 210 | "BsmtQual 24\n", 211 | "FireplaceQu 478\n", 212 | "GarageType 54\n", 213 | "BsmtQual_NA 0\n", 214 | "BsmtQual_random 0\n", 215 | "FireplaceQu_NA 0\n", 216 | "FireplaceQu_random 0\n", 217 | "GarageType_NA 0\n", 218 | "GarageType_random 0\n", 219 | "dtype: int64" 220 | ] 221 | }, 222 | "execution_count": 6, 223 | "metadata": {}, 224 | "output_type": "execute_result" 225 | } 226 | ], 227 | "source": [ 228 | "# let's inspect that NA were replaced\n", 229 | "X_train.isnull().sum()" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 7, 235 | "metadata": {}, 236 | "outputs": [ 237 | { 238 | "data": { 239 | "text/html": [ 240 | "
\n", 241 | "\n", 254 | "\n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | "
BsmtQualFireplaceQuGarageTypeBsmtQual_NABsmtQual_randomFireplaceQu_NAFireplaceQu_randomGarageType_NAGarageType_random
64GdNaNAttchd0Gd1Gd0Attchd
682GdGdAttchd0Gd0Gd0Attchd
960TANaNNaN0TA1TA1Attchd
1384TANaNDetchd0TA1TA0Detchd
1100TANaNDetchd0TA1Gd0Detchd
\n", 332 | "
" 333 | ], 334 | "text/plain": [ 335 | " BsmtQual FireplaceQu GarageType BsmtQual_NA BsmtQual_random \\\n", 336 | "64 Gd NaN Attchd 0 Gd \n", 337 | "682 Gd Gd Attchd 0 Gd \n", 338 | "960 TA NaN NaN 0 TA \n", 339 | "1384 TA NaN Detchd 0 TA \n", 340 | "1100 TA NaN Detchd 0 TA \n", 341 | "\n", 342 | " FireplaceQu_NA FireplaceQu_random GarageType_NA GarageType_random \n", 343 | "64 1 Gd 0 Attchd \n", 344 | "682 0 Gd 0 Attchd \n", 345 | "960 1 TA 1 Attchd \n", 346 | "1384 1 TA 0 Detchd \n", 347 | "1100 1 Gd 0 Detchd " 348 | ] 349 | }, 350 | "execution_count": 7, 351 | "metadata": {}, 352 | "output_type": "execute_result" 353 | } 354 | ], 355 | "source": [ 356 | "X_train.head()" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": 8, 362 | "metadata": {}, 363 | "outputs": [ 364 | { 365 | "data": { 366 | "text/html": [ 367 | "
\n", 368 | "\n", 381 | "\n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | "
BsmtQual_NAFireplaceQu_NAGarageType_NA
count1022.0000001022.0000001022.000000
mean0.0234830.4677100.052838
std0.1515070.4992010.223819
min0.0000000.0000000.000000
25%0.0000000.0000000.000000
50%0.0000000.0000000.000000
75%0.0000001.0000000.000000
max1.0000001.0000001.000000
\n", 441 | "
" 442 | ], 443 | "text/plain": [ 444 | " BsmtQual_NA FireplaceQu_NA GarageType_NA\n", 445 | "count 1022.000000 1022.000000 1022.000000\n", 446 | "mean 0.023483 0.467710 0.052838\n", 447 | "std 0.151507 0.499201 0.223819\n", 448 | "min 0.000000 0.000000 0.000000\n", 449 | "25% 0.000000 0.000000 0.000000\n", 450 | "50% 0.000000 0.000000 0.000000\n", 451 | "75% 0.000000 1.000000 0.000000\n", 452 | "max 1.000000 1.000000 1.000000" 453 | ] 454 | }, 455 | "execution_count": 8, 456 | "metadata": {}, 457 | "output_type": "execute_result" 458 | } 459 | ], 460 | "source": [ 461 | "X_train.describe()" 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": 9, 467 | "metadata": { 468 | "collapsed": true 469 | }, 470 | "outputs": [], 471 | "source": [ 472 | "# let's transform the categories into numbers quick and dirty so we can use them in scikit-learn\n", 473 | "\n", 474 | "# the below function numbers the labels from 0 to n, n being the number of different labels \n", 475 | "# within the variable\n", 476 | "\n", 477 | "for col in ['BsmtQual_random', 'FireplaceQu_random', 'GarageType_random',]:\n", 478 | " labels_dict = {k:i for i, k in enumerate(X_train[col].unique(), 0)}\n", 479 | " X_train.loc[:, col] = X_train.loc[:, col].map(labels_dict )\n", 480 | " X_test.loc[:, col] = X_test.loc[:, col].map(labels_dict)" 481 | ] 482 | }, 483 | { 484 | "cell_type": "markdown", 485 | "metadata": {}, 486 | "source": [ 487 | "### Linear Regression" 488 | ] 489 | }, 490 | { 491 | "cell_type": "code", 492 | "execution_count": 10, 493 | "metadata": {}, 494 | "outputs": [ 495 | { 496 | "name": "stdout", 497 | "output_type": "stream", 498 | "text": [ 499 | "Test set random imputation\n", 500 | "Linear Regression mse: 6456070592.706035\n", 501 | "\n", 502 | "Test set random imputation + additional variable indicating missingness\n", 503 | "Linear Regression mse: 4911877327.956806\n" 504 | ] 505 | } 506 | ], 507 | "source": [ 508 | "# Let's evaluate the performance of Linear Regression\n", 509 | "\n", 510 | "# first we build a model using ONLY the variable with the NA replaced by a random sample\n", 511 | "linreg = LinearRegression()\n", 512 | "linreg.fit(X_train[['BsmtQual_random', 'FireplaceQu_random', 'GarageType_random']], y_train)\n", 513 | "print('Test set random imputation')\n", 514 | "pred = linreg.predict(X_test[['BsmtQual_random', 'FireplaceQu_random', 'GarageType_random']])\n", 515 | "print('Linear Regression mse: {}'.format(mean_squared_error(y_test, pred)))\n", 516 | "print()\n", 517 | "\n", 518 | "# second we build a model including the variable that indicates missingness as well\n", 519 | "linreg = LinearRegression()\n", 520 | "linreg.fit(X_train[['BsmtQual_random', 'FireplaceQu_random', 'GarageType_random',\n", 521 | " 'BsmtQual_NA', 'FireplaceQu_NA', 'GarageType_NA']], y_train)\n", 522 | "print('Test set random imputation + additional variable indicating missingness')\n", 523 | "pred = linreg.predict(X_test[['BsmtQual_random', 'FireplaceQu_random', 'GarageType_random',\n", 524 | " 'BsmtQual_NA', 'FireplaceQu_NA', 'GarageType_NA']])\n", 525 | "print('Linear Regression mse: {}'.format(mean_squared_error(y_test, pred)))" 526 | ] 527 | }, 528 | { 529 | "cell_type": "markdown", 530 | "metadata": {}, 531 | "source": [ 532 | "Amazing, in this exercise we can see the power of creating that additional variable to capture missingness. The mse on the test set decreased dramatically when we included these variables that indicate that the observations contained missing values. That represents a savings of:" 533 | ] 534 | }, 535 | { 536 | "cell_type": "code", 537 | "execution_count": 11, 538 | "metadata": {}, 539 | "outputs": [ 540 | { 541 | "data": { 542 | "text/plain": [ 543 | "1544193265" 544 | ] 545 | }, 546 | "execution_count": 11, 547 | "metadata": {}, 548 | "output_type": "execute_result" 549 | } 550 | ], 551 | "source": [ 552 | "6456070592-4911877327" 553 | ] 554 | }, 555 | { 556 | "cell_type": "markdown", 557 | "metadata": {}, 558 | "source": [ 559 | "1 billion dollars!" 560 | ] 561 | }, 562 | { 563 | "cell_type": "markdown", 564 | "metadata": { 565 | "collapsed": true 566 | }, 567 | "source": [ 568 | "**That is all for this demonstration. I hope you enjoyed the notebook, and see you in the next one.**" 569 | ] 570 | } 571 | ], 572 | "metadata": { 573 | "kernelspec": { 574 | "display_name": "Python 3", 575 | "language": "python", 576 | "name": "python3" 577 | }, 578 | "language_info": { 579 | "codemirror_mode": { 580 | "name": "ipython", 581 | "version": 3 582 | }, 583 | "file_extension": ".py", 584 | "mimetype": "text/x-python", 585 | "name": "python", 586 | "nbconvert_exporter": "python", 587 | "pygments_lexer": "ipython3", 588 | "version": "3.6.1" 589 | }, 590 | "toc": { 591 | "nav_menu": {}, 592 | "number_sections": true, 593 | "sideBar": true, 594 | "skip_h1_title": false, 595 | "toc_cell": false, 596 | "toc_position": {}, 597 | "toc_section_display": "block", 598 | "toc_window_display": false 599 | } 600 | }, 601 | "nbformat": 4, 602 | "nbformat_minor": 2 603 | } 604 | -------------------------------------------------------------------------------- /06.4_Adding_a_category_to_capture_NA.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Adding a category to capture NA\n", 8 | "\n", 9 | "This is perhaps the most widely used method of missing data imputation for categorical variables. This method consists in treating missing data as if they were an additional label or category of the variable. All the missing observations are grouped in the newly created label 'Missing'. \n", 10 | "\n", 11 | "The beauty of this technique resides on the fact that it does not assume anything on the missingness of the values. It is very well suited when the number of missing data is high.\n", 12 | "\n", 13 | "\n", 14 | "### Advantages\n", 15 | "\n", 16 | "- Easy to implement\n", 17 | "- Captures the importance of missingess if there is one\n", 18 | "\n", 19 | "### Disadvantages\n", 20 | "\n", 21 | "- If the number of NA is small, creating an additional category may cause trees to over-fit\n", 22 | "\n", 23 | "I would say that for categorical variables this is the method of choice, as it treats missing values as a separate category, without making any assumption on their missingness. It is used widely in data science competitions and business settings. See for example the winning solution of the KDD 2009 cup: \"Winning the KDD Cup Orange Challenge with Ensemble Selection\" (http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf).\n" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "===============================================================================\n", 31 | "\n", 32 | "## Real Life example: \n", 33 | "\n", 34 | "### Predicting Sale Price of Houses\n", 35 | "\n", 36 | "The problem at hand aims to predict the final sale price of homes based on different explanatory variables describing aspects of residential homes. Predicting house prices is useful to identify fruitful investments, or to determine whether the price advertised for a house is over or underestimated, before making a buying judgment.\n", 37 | "\n", 38 | "=============================================================================\n", 39 | "\n", 40 | "In the following cells, I will demonstrate NA imputation by adding an additional label using the House Price datasets from Kaggle.\n", 41 | "\n", 42 | "If you haven't downloaded the datasets yet, in the lecture \"Guide to setting up your computer\" in section 1, you can find the details on how to do so." 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 1, 48 | "metadata": { 49 | "collapsed": true 50 | }, 51 | "outputs": [], 52 | "source": [ 53 | "import pandas as pd\n", 54 | "import numpy as np\n", 55 | "\n", 56 | "import matplotlib.pyplot as plt\n", 57 | "% matplotlib inline\n", 58 | "\n", 59 | "# for regression problems\n", 60 | "from sklearn.linear_model import LinearRegression, Ridge\n", 61 | "\n", 62 | "# to split and standarize the datasets\n", 63 | "from sklearn.model_selection import train_test_split\n", 64 | "\n", 65 | "# to evaluate regression models\n", 66 | "from sklearn.metrics import mean_squared_error\n", 67 | "\n", 68 | "import warnings\n", 69 | "warnings.filterwarnings('ignore')" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "### House Price dataset" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": 2, 82 | "metadata": {}, 83 | "outputs": [ 84 | { 85 | "data": { 86 | "text/plain": [ 87 | "SalePrice 0.000000\n", 88 | "BsmtQual 0.025342\n", 89 | "GarageType 0.055479\n", 90 | "FireplaceQu 0.472603\n", 91 | "dtype: float64" 92 | ] 93 | }, 94 | "execution_count": 2, 95 | "metadata": {}, 96 | "output_type": "execute_result" 97 | } 98 | ], 99 | "source": [ 100 | "# let's load the dataset with a few columns for the demonstration\n", 101 | "cols_to_use = ['BsmtQual', 'FireplaceQu', 'GarageType', 'SalePrice']\n", 102 | "\n", 103 | "data = pd.read_csv('houseprice.csv', usecols=cols_to_use)\n", 104 | "\n", 105 | "# let's inspect the percentage of missing values in each variable\n", 106 | "data.isnull().mean().sort_values(ascending=True)" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": 3, 112 | "metadata": {}, 113 | "outputs": [ 114 | { 115 | "data": { 116 | "text/plain": [ 117 | "((1022, 3), (438, 3))" 118 | ] 119 | }, 120 | "execution_count": 3, 121 | "metadata": {}, 122 | "output_type": "execute_result" 123 | } 124 | ], 125 | "source": [ 126 | "# let's separate into training and testing set\n", 127 | "\n", 128 | "X_train, X_test, y_train, y_test = train_test_split(data[['BsmtQual', 'FireplaceQu', 'GarageType']],\n", 129 | " data.SalePrice, test_size=0.3,\n", 130 | " random_state=0)\n", 131 | "X_train.shape, X_test.shape" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 4, 137 | "metadata": { 138 | "collapsed": true 139 | }, 140 | "outputs": [], 141 | "source": [ 142 | "# let's create a variable to replace NA with the most frequent label or a random sample\n", 143 | "\n", 144 | "def impute_na(df_train, df_test, variable):\n", 145 | " df_train[variable+'_NA'] = np.where(df_train[variable].isnull(), 'Missing', df_train[variable])\n", 146 | " df_test[variable+'_NA'] = np.where(df_test[variable].isnull(), 'Missing', df_test[variable])" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 5, 152 | "metadata": { 153 | "collapsed": true 154 | }, 155 | "outputs": [], 156 | "source": [ 157 | "# and let's replace the NA\n", 158 | "for variable in ['BsmtQual', 'FireplaceQu', 'GarageType',]:\n", 159 | " impute_na(X_train, X_test, variable)" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": 6, 165 | "metadata": {}, 166 | "outputs": [ 167 | { 168 | "data": { 169 | "text/plain": [ 170 | "BsmtQual 24\n", 171 | "FireplaceQu 478\n", 172 | "GarageType 54\n", 173 | "BsmtQual_NA 0\n", 174 | "FireplaceQu_NA 0\n", 175 | "GarageType_NA 0\n", 176 | "dtype: int64" 177 | ] 178 | }, 179 | "execution_count": 6, 180 | "metadata": {}, 181 | "output_type": "execute_result" 182 | } 183 | ], 184 | "source": [ 185 | "# let's check that data have been completed\n", 186 | "X_train.isnull().sum()" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": 7, 192 | "metadata": {}, 193 | "outputs": [ 194 | { 195 | "data": { 196 | "text/html": [ 197 | "
\n", 198 | "\n", 211 | "\n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | "
BsmtQualFireplaceQuGarageTypeBsmtQual_NAFireplaceQu_NAGarageType_NA
64GdNaNAttchdGdMissingAttchd
682GdGdAttchdGdGdAttchd
960TANaNNaNTAMissingMissing
1384TANaNDetchdTAMissingDetchd
1100TANaNDetchdTAMissingDetchd
\n", 271 | "
" 272 | ], 273 | "text/plain": [ 274 | " BsmtQual FireplaceQu GarageType BsmtQual_NA FireplaceQu_NA GarageType_NA\n", 275 | "64 Gd NaN Attchd Gd Missing Attchd\n", 276 | "682 Gd Gd Attchd Gd Gd Attchd\n", 277 | "960 TA NaN NaN TA Missing Missing\n", 278 | "1384 TA NaN Detchd TA Missing Detchd\n", 279 | "1100 TA NaN Detchd TA Missing Detchd" 280 | ] 281 | }, 282 | "execution_count": 7, 283 | "metadata": {}, 284 | "output_type": "execute_result" 285 | } 286 | ], 287 | "source": [ 288 | "# let's see how the new variable looks like, where data was missing we have\n", 289 | "# not the label 'Missing'\n", 290 | "\n", 291 | "X_train.head()" 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": 8, 297 | "metadata": {}, 298 | "outputs": [ 299 | { 300 | "data": { 301 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAEmCAYAAABs7FscAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAEv1JREFUeJzt3X+wZ3Vdx/Hnq+WXk6Ewe6MddvNiLdniCOKGP7IiyYGk\nXKyExaKtoSjDSe2HQTX9sNYoralMKvy59kPcwnTFRqMVIpsSFwFhUWQTCJgVVq0kq1WWd398z3a/\nXvfuvd+79+659/N9PmZ2vud8zjnf876Hy+ue8zm/UlVIktr1VX0XIElaXAa9JDXOoJekxhn0ktQ4\ng16SGmfQS1LjDHpJapxBL0mNM+glqXFH9F0AwMqVK2tycrLvMiRpWbn55ps/U1UTs823JIJ+cnKS\nHTt29F2GJC0rSe6by3x23UhS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIatyRumFoI\nk5e9r+8SALj3inP7LkGSvox79JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS\n1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcc08plhTfGSzpGHu0UtS4wx6SWqcQS9JjTPoJalxBr0k\nNc6gl6TGGfSS1DiDXpIaZ9BLUuPmHPRJViS5Jcm13fjxSa5Lcnf3edzQvJcn2ZXkriRnL0bhkqS5\nGWWP/uXAx4fGLwO2V9VaYHs3TpJ1wEbgFOAc4MokKxamXEnSqOYU9ElWA+cCbxpq3gBs6Ya3AOcN\ntV9dVXur6h5gF3DGwpQrSRrVXPfofx94FfDYUNsJVbW7G/40cEI3fCJw/9B8D3RtXybJJUl2JNmx\nZ8+e0aqWJM3ZrEGf5HuAh6vq5pnmqaoCapQVV9VVVbW+qtZPTEyMsqgkaQRzeUzxtwIvTPIC4Bjg\n2CR/DjyUZFVV7U6yCni4m/9BYM3Q8qu7NklSD2bdo6+qy6tqdVVNMjjJ+sGq+iFgG7Cpm20T8J5u\neBuwMcnRSU4C1gI3LXjlkqQ5OZQXj1wBbE1yMXAfcD5AVe1MshW4E3gUuLSq9h1ypZKkeRkp6Kvq\nBuCGbvizwFkzzLcZ2HyItUmSFoB3xkpS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIa\nZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEG\nvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BL\nUuMMeklqnEEvSY0z6CWpcbMGfZJjktyU5LYkO5P8etd+fJLrktzdfR43tMzlSXYluSvJ2Yv5A0iS\nDm4ue/R7gedV1anAacA5SZ4FXAZsr6q1wPZunCTrgI3AKcA5wJVJVixG8ZKk2c0a9DXwX93okd2/\nAjYAW7r2LcB53fAG4Oqq2ltV9wC7gDMWtGpJ0pzNqY8+yYoktwIPA9dV1YeBE6pqdzfLp4ETuuET\ngfuHFn+ga5v+nZck2ZFkx549e+b9A0iSDm5OQV9V+6rqNGA1cEaSp06bXgz28uesqq6qqvVVtX5i\nYmKURSVJIxjpqpuq+g/gegZ97w8lWQXQfT7czfYgsGZosdVdmySpB3O56mYiyRO74ccBzwc+AWwD\nNnWzbQLe0w1vAzYmOTrJScBa4KaFLlySNDdHzGGeVcCW7sqZrwK2VtW1Sf4Z2JrkYuA+4HyAqtqZ\nZCtwJ/AocGlV7Vuc8iVJs5k16KvqY8DTD9D+WeCsGZbZDGw+5OokSYfMO2MlqXEGvSQ1zqCXpMYZ\n9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEv\nSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLU\nOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1LhZgz7JmiTXJ7kzyc4kL+/aj09y\nXZK7u8/jhpa5PMmuJHclOXsxfwBJ0sHNZY/+UeBnq2od8Czg0iTrgMuA7VW1FtjejdNN2wicApwD\nXJlkxWIUL0ma3axBX1W7q+qj3fAjwMeBE4ENwJZuti3Aed3wBuDqqtpbVfcAu4AzFrpwSdLcjNRH\nn2QSeDrwYeCEqtrdTfo0cEI3fCJw/9BiD3Rt07/rkiQ7kuzYs2fPiGVLkuZqzkGf5PHANcArqurz\nw9OqqoAaZcVVdVVVra+q9RMTE6MsKkkawZyCPsmRDEL+L6rqXV3zQ0lWddNXAQ937Q8Ca4YWX921\nSZJ6MJerbgK8Gfh4Vf3e0KRtwKZueBPwnqH2jUmOTnISsBa4aeFKliSN4og5zPOtwEXA7Ulu7dp+\nEbgC2JrkYuA+4HyAqtqZZCtwJ4Mrdi6tqn0LXrkkaU5mDfqq+hCQGSafNcMym4HNh1CXJGmBeGes\nJDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS\n4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXO\noJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUuFmDPslbkjyc5I6h\ntuOTXJfk7u7zuKFplyfZleSuJGcvVuGSpLmZyx7924BzprVdBmyvqrXA9m6cJOuAjcAp3TJXJlmx\nYNVKkkY2a9BX1Y3A56Y1bwC2dMNbgPOG2q+uqr1VdQ+wCzhjgWqVJM3DfPvoT6iq3d3wp4ETuuET\ngfuH5nuga5Mk9eSQT8ZWVQE16nJJLkmyI8mOPXv2HGoZkqQZzDfoH0qyCqD7fLhrfxBYMzTf6q7t\nK1TVVVW1vqrWT0xMzLMMSdJs5hv024BN3fAm4D1D7RuTHJ3kJGAtcNOhlShJOhRHzDZDkncAZwIr\nkzwA/CpwBbA1ycXAfcD5AFW1M8lW4E7gUeDSqtq3SLVLkuZg1qCvqgtnmHTWDPNvBjYfSlGSpIXj\nnbGS1DiDXpIaZ9BLUuNm7aOXlrPJy97Xdwnce8W5fZegMecevSQ1zqCXpMYZ9JLUOINekhpn0EtS\n4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXO\noJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ17oi+C5B0eExe9r6+S+DeK87tuwRg\n/LaFe/SS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcYsW9EnOSXJXkl1JLlus9UiSDm5Rgj7JCuAN\nwHcD64ALk6xbjHVJkg5usfbozwB2VdWnquqLwNXAhkValyTpIFJVC/+lyQ8A51TVj3XjFwHPrKqX\nDc1zCXBJN/pNwF0LXsjoVgKf6buIJcJtMcVtMcVtMWUpbIsnVdXEbDP19giEqroKuKqv9R9Ikh1V\ntb7vOpYCt8UUt8UUt8WU5bQtFqvr5kFgzdD46q5NknSYLVbQfwRYm+SkJEcBG4Fti7QuSdJBLErX\nTVU9muRlwAeAFcBbqmrnYqxrgS2prqSeuS2muC2muC2mLJttsSgnYyVJS4d3xkpS4wx6SWqcQS9J\njTPohyR5bpI39F2HJC2ksX9nbJKnAy8BXgzcA7yr34oOrySnH2x6VX30cNWylCS5uKrePDS+Avjl\nqvr1HsuS5mUsgz7JycCF3b/PAO9kcAXSd/ZaWD9+t/s8BlgP3AYEeBqwA3h2T3X17awk3w9cDBwP\nvA34h14r6kmS9wLTL8/7Twa/H39aVf97+KvqR5JnAa8Hvhk4isHl41+oqmN7LWwWYxn0wCeAfwS+\np6p2ASR5Zb8l9WP/H7ck7wJOr6rbu/GnAr/WY2m9qqqXJLkAuB34AvCSqvqnnsvqy6eACeAd3fgF\nwCPAycAbgYt6qqsPf8TgBtC/YrBj9MMMtsOSNq5B/30M/mNdn+T9DJ6umX5L6t037Q95gKq6I8k3\n91lQn5KsBV4OXMNg7+2iJLdU1X/3W1kvnlNV3zI0/t4kH6mqb0myHG6EXFBVtSvJiqraB7w1yS3A\n5X3XdTDjGvTXVtW7k3w1g8cnvwL42iR/DPxNVf1dv+X14mNJ3gT8eTf+g8DHeqynb+8FLq2q7UkC\n/AyDR3uc0m9ZvXh8kq+vqn8DSPL1wOO7aV/sr6xe/Hf3WJdbk/wOsJtlcFHLWN4Zm+SjVXX6tLbj\nGJyQvaCqzuqnsv4kOQZ4KfBtXdONwB9X1d7+qupPkmOr6vPT2k6uqk/2VVNfkrwA+BPgXxkc+Z4E\n/BRwA/DjVfX7/VV3eCV5EvAQg/75VwJPAK7c3wW8VI1r0N9SVU/vu46lIMkGYHVVvaEbv4lBf2wB\nr6qqv+6zvsMtyauq6ne64RdX1V8NTXtNVf1if9X1J8nRwFO60bvG6QQsDI5i9h/RLEfjGvQPAL83\n0/SqmnFaa5L8E7Cxqu7vxm8Fnsfg0Pyt43Z0M3y0N/3I70BHguMiyXOASYa6e6vq7b0VdJhN+724\npqq+v++aRjGuffQrGATZuJ+ABThqf8h3PlRVnwM+153DGDeZYfhA42MhyZ8B3wDcCuzrmgsYm6Dn\ny//bP7m3KuZpXIN+d1W9uu8ilojjhkeGX/fIoAtn3NQMwwcaHxfrgXU1jof/Uw72e7HkjWvQj+We\n2Qw+nOTHq+qNw41JfgK4qaea+nRqks8z+B15XDdMN35Mf2X16g7g6xhcYTKuDvZ7UUv9hqlx7aM/\nvuueGHtJvhZ4N7AX2P+4g2cARwPnVdVDfdWmpSHJ9cBpDP7w//9VWFX1wt6K0kjGMuj1lZI8j6lr\nxHdW1Qf7rEdLR5LvOFB7VY3lIyGWI4Nekho3rn30kmaR5ENV9dwkj/DlJyCXRb+0prhHL0mNW/LP\naJDUryTf0N0ZS5Izk/x0kif2XZfmzqCXNJtrgH1JvhG4ClgD/GW/JWkUBr2k2TxWVY8CLwJeX1U/\nD6zquSaNwKCXNJsvJbkQ2ARc27Ud2WM9GpFBL2k2P8rglZKbq+qeJCcBf9ZzTRqBV91ImrPuvQ1r\nqmqcX0qz7LhHL+mgktyQ5NgkxzN4TMYbk4zNo7xbYNBLms0TurdtfR/w9qp6JvBdPdekERj0kmZz\nRJJVwPlMnYzVMmLQS5rNq4EPALuq6iNJngzc3XNNGoEnYyWpcT7UTNIB7X9RepLXc4C3KlXVT/dQ\nlubBoJc0k493nzt6rUKHzK4bSWqce/SSDijJtoNN91WCy4dBL2kmzwbuB94BfJjBC0e0DNl1I+mA\nkqwAng9cCDwNeB/wjqra2WthGpnX0Us6oKraV1Xvr6pNwLOAXcANSV7Wc2kakV03kmbUvVnqXAZ7\n9ZPAHwJ/02dNGp1dN5IOKMnbgacCfwtcXVV39FyS5smgl3RASR4DvtCNDgdFgKqqYw9/VZoPg16S\nGufJWElqnEEvSY0z6CWpcQa9lrQk+5LcmuS2JB9N8pwF+M7TkrxgWtt5ST6W5BNJ7kjyA4fw/ZNJ\nZrxCJcmZSSrJ9w61XZvkzKHxlUm+lOQn51uHtJ9Br6Xuf6rqtKo6Fbgc+K0F+M7TgP8P+iSnAq8D\nNlTVU4DvBX47yTMWYF0zeQD4pYNMfzHwLwyuX5cOiUGv5eRY4N8BkqxKcmO3t39Hkm/r2v8ryWuT\n7Ezy90nO6F5u/akkL0xyFIM3Jl3QLXsB8HPAa6rqHoDu8zXAz3bfeUOS9d3wyiT3dsOTSf6xO9IY\n9WjjNuA/kzx/hukXdus/McnqkbaSNI1Br6XucV0gfwJ4E/AbXftLgA9U1WnAqcCtXftXAx+sqlOA\nR4DfZPC8lhcBr66qLwK/AryzO1J4J3AKcPO09e4A1s1S28PA86vqdOACBneNjmIz8MvTG5OsAVZV\n1U3A1u67pXnzEQha6v6nC3OSPBt4e5KnAh8B3pLkSODdVbU/6L8IvL8bvh3YW1VfSnI7g1v4F9KR\nwB8lOQ3YB5w8ysJVdWMSkjx32qQLGAQ8wNXAW4DfPdRiNb7co9eyUVX/DKwEJqrqRuDbgQeBtyX5\n4W62L9XUXYCPAXu7ZR9j5h2bO4Hp/fHPYOrNSo8y9f/KMUPzvBJ4iMERxXrgqHn8WAfaq78Q+JGu\ni2gb8LQka+fx3RJg0GsZSfIUYAXw2SRPAh6qqjcy6NI5fYSvegT4mqHx1wGXJ5ns1jMJvAJ4bTf9\nXqb+EAxfjfMEYHf3R+SirraRVNXfAccxeAwwSU4GHl9VJ1bVZFVNMjgB7UlZzZtBr6Vufx/9rcA7\ngU1VtQ84E7gtyS0Mujr+YITvvB5Yt/9kbNft8wvAe5N8Evgk8NKququb/3XAS7t1rRz6niuBTUlu\nA57C1HNhRrUZWNMNX8hXPh3yGgx6HQKfdSNNk+QK4JnA2d3JW2lZM+glqXFedSMtkiRnA789rfme\nqnpRH/VofLlHL0mN82SsJDXOoJekxhn0ktQ4g16SGvd/wffx/8ADWecAAAAASUVORK5CYII=\n", 302 | "text/plain": [ 303 | "" 304 | ] 305 | }, 306 | "metadata": {}, 307 | "output_type": "display_data" 308 | }, 309 | { 310 | "data": { 311 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAEmCAYAAABs7FscAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAFWlJREFUeJzt3X+0XWV95/H3pwGBpYNCiWkWCU3axrZBBWpEbWlrpRZa\nrMGqGJw66ZQxjqLizLQMdGZWf610UWdNV6uV1ihqxBFMi0iEGSiNUH+MJYSfEjAlQ6DABBKlU1t0\nUOJ3/jg7crjk5p6be+49uc99v9bKOns/+9lnf/cSP+e5z95nn1QVkqR2fd+oC5AkTS+DXpIaZ9BL\nUuMMeklqnEEvSY0z6CWpcQa9JDVuoKBPcn+SryS5PcmWru3oJNcnubd7Paqv/4VJtifZluS06Spe\nkjSxyYzof66qTqyqFd36BcCmqloGbOrWSbIcWAUcD5wOXJxk3hBrliRNwiFT2Hcl8MpueT1wI/Af\nu/bLq+oJYEeS7cDJwJfHe6NjjjmmlixZMoVSJGnuueWWW75WVfMn6jdo0Bfw10n2AB+sqnXAgqra\n2W1/BFjQLR8L/G3fvg91bU+TZA2wBuC4445jy5YtA5YiSQJI8sAg/QYN+lOq6uEkzweuT/LV/o1V\nVUkm9dCc7sNiHcCKFSt84I4kTZOB5uir6uHudRdwJb2pmEeTLAToXnd13R8GFvftvqhrkySNwIRB\nn+TZSf7F3mXgF4C7gI3A6q7bauCqbnkjsCrJYUmWAsuAzcMuXJI0mEGmbhYAVybZ2/+TVXVtkpuB\nDUnOAR4AzgKoqq1JNgB3A08C51bVnmmpXpI0oQmDvqruA07YR/vXgVPH2WctsHbK1UmSpsxvxkpS\n4wx6SWqcQS9JjZvKN2NHbskF18zo8e6/6IwZPZ4kDYMjeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0\nktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9J\njTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktS4\ngYM+ybwktyW5uls/Osn1Se7tXo/q63thku1JtiU5bToKlyQNZjIj+vOAe/rWLwA2VdUyYFO3TpLl\nwCrgeOB04OIk84ZTriRpsgYK+iSLgDOAD/c1rwTWd8vrgTP72i+vqieqagewHTh5OOVKkiZr0BH9\nHwPnA9/ta1tQVTu75UeABd3yscCDff0e6tqeJsmaJFuSbNm9e/fkqpYkDWzCoE/yGmBXVd0yXp+q\nKqAmc+CqWldVK6pqxfz58yezqyRpEg4ZoM9PAa9N8kvA4cCRST4BPJpkYVXtTLIQ2NX1fxhY3Lf/\noq5NkjQCE47oq+rCqlpUVUvoXWT9XFX9KrARWN11Ww1c1S1vBFYlOSzJUmAZsHnolUuSBjLIiH48\nFwEbkpwDPACcBVBVW5NsAO4GngTOrao9U65UknRAJhX0VXUjcGO3/HXg1HH6rQXWTrE2SdIQ+M1Y\nSWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJek\nxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqc\nQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY2bMOiTHJ5kc5I7kmxN8rtd\n+9FJrk9yb/d6VN8+FybZnmRbktOm8wQkSfs3yIj+CeBVVXUCcCJwepKXAxcAm6pqGbCpWyfJcmAV\ncDxwOnBxknnTUbwkaWITBn31/HO3emj3r4CVwPqufT1wZre8Eri8qp6oqh3AduDkoVYtSRrYQHP0\nSeYluR3YBVxfVTcBC6pqZ9flEWBBt3ws8GDf7g91bWPfc02SLUm27N69+4BPQJK0fwMFfVXtqaoT\ngUXAyUleOGZ70RvlD6yq1lXViqpaMX/+/MnsKkmahEnddVNV/xe4gd7c+6NJFgJ0r7u6bg8Di/t2\nW9S1SZJGYJC7buYneV63fATwauCrwEZgdddtNXBVt7wRWJXksCRLgWXA5mEXLkkazCED9FkIrO/u\nnPk+YENVXZ3ky8CGJOcADwBnAVTV1iQbgLuBJ4Fzq2rP9JQvSZrIhEFfVXcCJ+2j/evAqePssxZY\nO+XqJElT5jdjJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJek\nxhn0ktQ4g16SGmfQS1LjBnkevUZkyQXXzOjx7r/ojBk9nqSZ4Yhekhpn0EtS4wx6SWqcQS9JjTPo\nJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuN8BIJGxkc8SDPDEb0kNc6gl6TGGfSS1DiDXpIaZ9BL\nUuMMeklqnEEvSY0z6CWpcRMGfZLFSW5IcneSrUnO69qPTnJ9knu716P69rkwyfYk25KcNp0nIEna\nv0FG9E8C/6GqlgMvB85Nshy4ANhUVcuATd063bZVwPHA6cDFSeZNR/GSpIlNGPRVtbOqbu2W/wm4\nBzgWWAms77qtB87sllcCl1fVE1W1A9gOnDzswiVJg5nUHH2SJcBJwE3Agqra2W16BFjQLR8LPNi3\n20NdmyRpBAYO+iTPAa4A3lNV3+jfVlUF1GQOnGRNki1JtuzevXsyu0qSJmGgoE9yKL2Q/+9V9emu\n+dEkC7vtC4FdXfvDwOK+3Rd1bU9TVeuqakVVrZg/f/6B1i9JmsAgd90EuAS4p6r+qG/TRmB1t7wa\nuKqvfVWSw5IsBZYBm4dXsiRpMgZ5Hv1PAW8BvpLk9q7tt4CLgA1JzgEeAM4CqKqtSTYAd9O7Y+fc\nqtoz9MolSQOZMOir6otAxtl86jj7rAXWTqEuSdKQ+M1YSWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS\n1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mN\nM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiD\nXpIaZ9BLUuMMeklqnEEvSY0z6CWpcRMGfZKPJNmV5K6+tqOTXJ/k3u71qL5tFybZnmRbktOmq3BJ\n0mAGGdF/DDh9TNsFwKaqWgZs6tZJshxYBRzf7XNxknlDq1aSNGkTBn1VfR54bEzzSmB9t7weOLOv\n/fKqeqKqdgDbgZOHVKsk6QAc6Bz9gqra2S0/Aizolo8FHuzr91DX9gxJ1iTZkmTL7t27D7AMSdJE\npnwxtqoKqAPYb11VraiqFfPnz59qGZKkcRxo0D+aZCFA97qra38YWNzXb1HXJkkakQMN+o3A6m55\nNXBVX/uqJIclWQosAzZPrURJ0lQcMlGHJJcBrwSOSfIQ8NvARcCGJOcADwBnAVTV1iQbgLuBJ4Fz\nq2rPNNUuSRrAhEFfVWePs+nUcfqvBdZOpShJ0vD4zVhJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLU\nOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z\n6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNO2TUBUitWnLBNTN6vPsvOmNGj6fZ\nwxG9JDXOoJekxjl1I2nSnJaaXQx6SRqjtQ8yp24kqXEGvSQ1zqCXpMZNW9AnOT3JtiTbk1wwXceR\nJO3ftAR9knnAB4BfBJYDZydZPh3HkiTt33SN6E8GtlfVfVX1beByYOU0HUuStB+pquG/afIG4PSq\n+jfd+luAl1XVO/v6rAHWdKs/CmwbeiHjOwb42gweb6Z5frNby+fX8rnBzJ/fD1bV/Ik6jew++qpa\nB6wbxbGTbKmqFaM49kzw/Ga3ls+v5XODg/f8pmvq5mFgcd/6oq5NkjTDpivobwaWJVma5FnAKmDj\nNB1LkrQf0zJ1U1VPJnkncB0wD/hIVW2djmMdoJFMGc0gz292a/n8Wj43OEjPb1ouxkqSDh5+M1aS\nGmfQS1LjDHpJapxBr1knySlJPjDqOqTZwh8emeWS/MT+tlfVrTNVy3RKchLwZuCNwA7g06OtSINK\nck5VXdK3Pg/4z1X1uyMsa6iSnAD8dLf6haq6Y5T1jDVngj7JZ4Gxtxj9I7AF+GBV/b+Zr2oo/lv3\nejiwArgDCPBieuf2ihHVNWVJXgCc3f37GvApeneK/dxIC5sGSV4OvB/4ceBZ9G5LfryqjhxpYcNx\napLXA+cARwMfA/5mpBUNUZLzgLfy1ODjE0nWVdX7R1jW08yZ2yuT/AkwH7isa3oT8A164X9kVb1l\nVLUNQ5JPA79dVV/p1l8I/E5VvWG0lR24JN8FvgCcU1Xbu7b7quqHRlvZ8CXZQu+LhX9B7wP7XwEv\nqKoLR1rYkCR5E70n2j4OvLmqvjTikoYmyZ3AK6rq8W792cCXq+rFo63sKXNmRA/8ZFW9tG/9s0lu\nrqqXJjmYvsx1oH50b8gDVNVdSX58lAUNwa/QC78bklxL7ymoGW1J06eqtieZV1V7gI8muQ2Y9UGf\nZBlwHnAFvb9Y3pLktqr65mgrG5oAe/rW93CQ/Xc6l4L+OUmOq6q/B0hyHPCcbtu3R1fW0NyZ5MPA\nJ7r1fwncOcJ6huHqqvpMN0JaCbwHeH6SPwOurKq/Gm15Q/XN7nEhtyd5L7CTdm6W+CxwblVtShLg\n39N7TMrxoy1raD4K3JTkym79TOCS/fSfcXNp6uaXgD8H/je9T9ulwDuAG4G3VtUfj666qUtyOPB2\nnrog9Hngz6rqidFVNTVJbq2qnxjTdhS9C7JvqqpTR1PZ8CX5QeBRevPz/w54LnDx3imr2SzJkVX1\njTFtL6iqvxtVTcPW3RRxSrf6haq6bZT1jDVngh4gyWHAj3Wr22bxBdjvSbISWFRVH+jWN9O7FlHA\n+VX1l6Osbyq6P+9PGnUd06n/r8zWJDm/qt7bLb+xqv6ib9sfVNVvja66qesGV/8W+BHgK8AlVfXk\naKvat7kW9D8JLKFvyqqqPj6ygoYgyZeAVVX1YLd+O/AqetNSH53No94kDwF/NN72qhp322zR/1dL\nkiuq6vWjrmlYxpzb0/4629dfa7NNkk8B36F3w8AvAvdX1XtGW9W+zZk5+iSXAj8M3M5TF04KmNVB\nDzxrb8h3vlhVjwGPdXPbs9k8eh9YB9WFrSHrP7fW7ibKOMv7Wp+NllfViwCSXAJsHnE945ozQU/v\nlrXl1d6fMEf1r/T/XCO9KZzZbGdV/d6oi5hmNc5yC/Z3bi2c63f2LnSPZh9lLfs1l4L+LuAH6N3N\n0JKbkry1qj7U35jkbRzEI4wBHbz/zxmeE5J8g965HtEt063XLP/C1P7O7fDRlTU0J4w5pyP6zveg\n+t9uzszRJ7kBOJFe+H3vTpSqeu3IihqCJM8HPkPvnPY+7uAlwGHAmVX16Khqm6okR3fTUJKmYC4F\n/c/uq72qmvgqdpJX8dR9yVur6nOjrEfSwWPOBL0kzVXNz9En+WJVnZLkn3j6BaCDbh5NkqaDI3pJ\nalwrz9KYUJIf7r4ZS5JXJnl3kueNui5Jmm5zJujpPTlvT5IfAdYBi4FPjrYkSZp+cynov9s9h+J1\nwPur6jeBhSOuSZKm3VwK+u8kORtYDVzdtR06wnokaUbMpaD/1/R+Vm9tVe1IshS4dMQ1SdK0m5N3\n3XTPNF9cVbP9hzkkaUJzZkSf5MYkRyY5mt6jAj6UZNY/5laSJjJngh54bvcrN78CfLyqXgb8/Ihr\nkqRpN5eC/pAkC4GzeOpirCQ1by4F/e8B1wHbq+rmJD8E3DvimiRp2s3Ji7GSNJfMhYeanV9V703y\nfvbxqzZV9e4RlCVJM6b5oAfu6V63jLQKSRoRp24kqXHNj+iTbNzf9tn+U4KSNJHmg57eYw8eBC4D\nbmJu/OC0JH1P81M3SeYBrwbOBl4MXANcVlVbR1qYJM2Q5u+jr6o9VXVtVa0GXg5sB25M8s4RlyZJ\nM2IuTN3Q/bLUGfRG9UuA9wFXjrImSZopc2Hq5uPAC4H/AVxeVXeNuCRJmlFzIei/CzzerfafbICq\nqiNnvipJmjnNB70kzXXNX4yVpLnOoJekxhn0ktQ4g14HhSR7ktze929JkhVJ3jfEY9yf5JhhvV/3\nnqck2Zzkq0m2JXnHAb7PkiSV5F19bX+a5Nf61g9JsjvJRUMoXXPInLiPXrPCt6rqxDFt97OPp44m\nOaSqnpyRqvYjyQ8AnwTOrKpbuw+R65LsrKoD+Z7GLuC8JB+sqm/vY/urgb8D3pjkwvJOCg3IEb0O\nWklemeTqbvl3klya5EvApUnmJfmvSW5OcmeSt/Xt8/kk13Qj7D9P8oz/zpN8JsktSbYmWdPXfnqS\nW5PckWRT1/bsJB/pRu63JVnZdT8X+FhV3QpQVV8Dzgd+s9vvY0ne0Pfe/zzBKe8GNgGrx9l+NvAn\nwN/Te4aTNBBH9DpYHJHk9m55R1W9bh99lgOnVNW3unD+x6p6affN5y8l+auu38ld3weAa+n9IPxf\njnmvX6+qx5IcAdyc5Ap6A58PAT9TVTuSHN31/U/A56rq15M8D9ic5K+B44H1Y953S3fsA/WHwP9M\n8pH+xiSH0/sx+7cBz6MX+v9rCsfRHGLQ62Cxr6mbsTZW1be65V8AXtw3Yn4usAz4NrC5qu4DSHIZ\ncArPDPp3J9n7YbK423c+8Pmq2gFQVY/1Heu1SX6jWz8cOG6yJziIqrovyU3Am8dseg1wQ/chdwXw\nX5K8p6r2TEcdaotBr9nk8b7lAO+qquv6OyR5Jc/8ycjaR5+fB15RVd9MciO98B5PgNdX1bYx73M3\n8BLgqr7ml/DUdYUn6aZHu+mjZ+3nGP3+gN4H09/0tZ0NnJLk/m79+4FXAdcP+J6aw5yj12x1HfD2\nJIcCJHlBkmd3205OsrQL1zcBXxyz73OBf+hC/sfoPdUU4G+Bn0mytHvPvVM31wHvSpKu/aSu/QPA\nryU5sWv/fmAt8Pvd9vvpBT/Aa4FDBzmxqvoqcDfwy937Hgn8NHBcVS2pqiX0rg+cPcj7SQa9ZqsP\n0wvDW5PcBXyQp/5CvRn4U3q/F7yDZz6p9FrgkCT3ABfRC3iqajewBvh0kjuAT3X9f59eSN+ZZGu3\nTlXtBH4VWJdkG/B/gPdV1d6R+IeAn+3e6xU8/S+SiawFFnXLr6N3jeCJvu1XAb/cXZ+Q9stn3agp\n3bTMb1TVa0Zw7HcAb6d3MfcfZvr40ngc0UtDUlUXV9WLDHkdbBzRSzMoyYuAS8c0P1FVLxtFPZob\nDHpJapxTN5LUOINekhpn0EtS4wx6SWrc/weJruW0ibHWyAAAAABJRU5ErkJggg==\n", 312 | "text/plain": [ 313 | "" 314 | ] 315 | }, 316 | "metadata": {}, 317 | "output_type": "display_data" 318 | }, 319 | { 320 | "data": { 321 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAEtCAYAAAAGK6vfAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHsRJREFUeJzt3Xu4XFV9//H3B6KIFxAeTiMSMGiDNngBG1OsPlZBBasV\nvGHwFhWNrXipl18btP3Zqqm0/VVrtaioaFSEBhWIaBEIUqtVQ0AUAiKpgYYIJN5RFAp8fn/sNTA5\nnJyZOZfsmcXn9Tx5zszae2a+k5x8Zs/aa68l20RERL12aruAiIiYXQn6iIjKJegjIiqXoI+IqFyC\nPiKicgn6iIjKJegjIiqXoI+IqFyCPiKicnPaLgBgr7328vz589suIyJipFx88cU/tj3Wa7+hCPr5\n8+ezbt26tsuIiBgpkq7tZ7903UREVC5BHxFRuQR9RETlEvQREZVL0EdEVC5BHxFRuQR9RETlEvQR\nEZXrK+glPVDS5yR9X9KVkh4vaU9J50m6uvzco2v/4yVtkHSVpMNnr/yIiOil3ytj3w+cY/v5ku4N\n3Bd4G7DG9gmSlgPLgb+UtBBYAhwIPBg4X9IBtm+fqaLnL//STD3VhK454Zmz+vwRETtSzyN6SbsD\nTwI+DmD7Vts/B44EVpbdVgJHldtHAqfZvsX2RmADsHimC4+IiP7003WzP7AV+ISk70j6mKT7AXNt\nX1/2uQGYW27vA2zqevx1pS0iIlrQT9DPAR4LfMj2wcCvabpp7mTbgAd5YUnLJK2TtG7r1q2DPDQi\nIgbQT9BfB1xn+9vl/udogv9GSXsDlJ9byvbNwL5dj59X2rZh+yTbi2wvGhvrOctmRERMUc+gt30D\nsEnSw0vTYcAVwGpgaWlbCpxVbq8GlkjaRdL+wAJg7YxWHRERfet31M3rgVPKiJsfAq+g+ZBYJelY\n4FrgaADb6yWtovkwuA04biZH3ERExGD6CnrblwKLJth02Hb2XwGsmEZdERExQ3JlbERE5RL0ERGV\nS9BHRFQuQR8RUbkEfURE5RL0ERGVS9BHRFQuQR8RUbkEfURE5RL0ERGVS9BHRFQuQR8RUbkEfURE\n5RL0ERGVS9BHRFQuQR8RUbkEfURE5RL0ERGVS9BHRFQuQR8RUbkEfURE5RL0ERGVS9BHRFQuQR8R\nUbkEfURE5foKeknXSLpM0qWS1pW2PSWdJ+nq8nOPrv2Pl7RB0lWSDp+t4iMiordBjuifYvsg24vK\n/eXAGtsLgDXlPpIWAkuAA4EjgBMl7TyDNUdExACm03VzJLCy3F4JHNXVfprtW2xvBDYAi6fxOhER\nMQ39Br2B8yVdLGlZaZtr+/py+wZgbrm9D7Cp67HXlbaIiGjBnD73e6LtzZJ+BzhP0ve7N9q2JA/y\nwuUDYxnAfvvtN8hDIyJiAH0d0dveXH5uAc6g6Yq5UdLeAOXnlrL7ZmDfrofPK23jn/Mk24tsLxob\nG5v6O4iIiEn1DHpJ95P0gM5t4OnA5cBqYGnZbSlwVrm9GlgiaRdJ+wMLgLUzXXhERPSnn66bucAZ\nkjr7f9b2OZIuAlZJOha4FjgawPZ6SauAK4DbgONs3z4r1UdERE89g972D4HHTND+E+Cw7TxmBbBi\n2tVFRMS05crYiIjKJegjIiqXoI+IqFyCPiKicgn6iIjKJegjIiqXoI+IqFyCPiKicgn6iIjKJegj\nIiqXoI+IqFyCPiKicgn6iIjKJegjIiqXoI+IqFyCPiKicgn6iIjKJegjIiqXoI+IqFyCPiKicgn6\niIjKJegjIiqXoI+IqFyCPiKicgn6iIjK9R30knaW9B1JZ5f7e0o6T9LV5eceXfseL2mDpKskHT4b\nhUdERH8GOaJ/I3Bl1/3lwBrbC4A15T6SFgJLgAOBI4ATJe08M+VGRMSg+gp6SfOAZwIf62o+ElhZ\nbq8EjupqP832LbY3AhuAxTNTbkREDKrfI/p/Bv4CuKOrba7t68vtG4C55fY+wKau/a4rbRER0YKe\nQS/pWcAW2xdvbx/bBjzIC0taJmmdpHVbt24d5KERETGAfo7onwA8W9I1wGnAoZI+A9woaW+A8nNL\n2X8zsG/X4+eVtm3YPsn2ItuLxsbGpvEWIiJiMj2D3vbxtufZnk9zkvUC2y8BVgNLy25LgbPK7dXA\nEkm7SNofWACsnfHKIyKiL3Om8dgTgFWSjgWuBY4GsL1e0irgCuA24Djbt0+70oiImJKBgt72hcCF\n5fZPgMO2s98KYMU0a4uIiBmQK2MjIiqXoI+IqFyCPiKicgn6iIjKJegjIiqXoI+IqFyCPiKicgn6\niIjKJegjIiqXoI+IqFyCPiKicgn6iIjKJegjIiqXoI+IqFyCPiKicgn6iIjKJegjIiqXoI+IqFyC\nPiKicgn6iIjKJegjIiqXoI+IqFyCPiKicgn6iIjKJegjIirXM+gl3UfSWknflbRe0t+W9j0lnSfp\n6vJzj67HHC9pg6SrJB0+m28gIiIm188R/S3AobYfAxwEHCHpEGA5sMb2AmBNuY+khcAS4EDgCOBE\nSTvPRvEREdFbz6B341fl7r3KHwNHAitL+0rgqHL7SOA027fY3ghsABbPaNUREdG3vvroJe0s6VJg\nC3Ce7W8Dc21fX3a5AZhbbu8DbOp6+HWlLSIiWtBX0Nu+3fZBwDxgsaRHjttumqP8vklaJmmdpHVb\nt24d5KERETGAgUbd2P458FWavvcbJe0NUH5uKbttBvbteti80jb+uU6yvcj2orGxsanUHhERfehn\n1M2YpAeW27sCTwO+D6wGlpbdlgJnldurgSWSdpG0P7AAWDvThUdERH/m9LHP3sDKMnJmJ2CV7bMl\nfRNYJelY4FrgaADb6yWtAq4AbgOOs3377JQfERG99Ax6298DDp6g/SfAYdt5zApgxbSri4iIacuV\nsRERlUvQR0RULkEfEVG5BH1EROUS9BERletneGXMsPnLvzSrz3/NCc+c1eePiNGSI/qIiMol6CMi\nKpegj4ioXII+IqJyCfqIiMol6CMiKpegj4ioXII+IqJyCfqIiMol6CMiKpegj4ioXII+IqJyCfqI\niMol6CMiKpegj4ioXII+IqJyCfqIiMol6CMiKpegj4ioXM+gl7SvpK9KukLSeklvLO17SjpP0tXl\n5x5djzle0gZJV0k6fDbfQERETK6fI/rbgLfYXggcAhwnaSGwHFhjewGwptynbFsCHAgcAZwoaefZ\nKD4iInrrGfS2r7d9Sbl9E3AlsA9wJLCy7LYSOKrcPhI4zfYttjcCG4DFM114RET0Z6A+eknzgYOB\nbwNzbV9fNt0AzC239wE2dT3sutIWEREt6DvoJd0f+Dzw57Z/2b3NtgEP8sKSlklaJ2nd1q1bB3lo\nREQMoK+gl3QvmpA/xfYXSvONkvYu2/cGtpT2zcC+XQ+fV9q2Yfsk24tsLxobG5tq/RER0UM/o24E\nfBy40vZ7uzatBpaW20uBs7ral0jaRdL+wAJg7cyVHBERg5jTxz5PAF4KXCbp0tL2NuAEYJWkY4Fr\ngaMBbK+XtAq4gmbEznG2b5/xyiMioi89g9721wFtZ/Nh23nMCmDFNOqKiIgZkitjIyIql6CPiKhc\ngj4ionIJ+oiIyiXoIyIql6CPiKhcgj4ionIJ+oiIyiXoIyIql6CPiKhcgj4ionIJ+oiIyiXoIyIq\nl6CPiKhcgj4ionIJ+oiIyiXoIyIql6CPiKhcgj4ionIJ+oiIyiXoIyIql6CPiKhcgj4ionIJ+oiI\nyiXoIyIq1zPoJZ0saYuky7va9pR0nqSry889urYdL2mDpKskHT5bhUdERH/6OaL/JHDEuLblwBrb\nC4A15T6SFgJLgAPLY06UtPOMVRsREQPrGfS2vwb8dFzzkcDKcnslcFRX+2m2b7G9EdgALJ6hWiMi\nYgqm2kc/1/b15fYNwNxyex9gU9d+15W2iIhoybRPxto24EEfJ2mZpHWS1m3dunW6ZURExHZMNehv\nlLQ3QPm5pbRvBvbt2m9eabsb2yfZXmR70djY2BTLiIiIXqYa9KuBpeX2UuCsrvYlknaRtD+wAFg7\nvRIjImI65vTaQdKpwJOBvSRdB7wDOAFYJelY4FrgaADb6yWtAq4AbgOOs337LNUeERF96Bn0to/Z\nzqbDtrP/CmDFdIqKiIiZkytjIyIql6CPiKhcgj4ionIJ+oiIyiXoIyIql6CPiKhcgj4ionI9x9FH\njDd/+Zdm9fmvOeGZs/r8Efc0OaKPiKhcgj4ionLpuol7nHQ9xT1NjugjIiqXoI+IqFyCPiKicgn6\niIjKJegjIiqXoI+IqFyCPiKicgn6iIjKJegjIiqXoI+IqFyCPiKicgn6iIjKJegjIiqX2SsjRkxm\n34xBzdoRvaQjJF0laYOk5bP1OhERMblZCXpJOwP/CjwDWAgcI2nhbLxWRERMbra6bhYDG2z/EEDS\nacCRwBWz9HoRMQJGvdtpVOufra6bfYBNXfevK20REbGDyfbMP6n0fOAI268q918K/IHt13XtswxY\nVu4+HLhqxgu5y17Aj2fx+Wdb6m9X6m/PKNcOs1//Q2yP9dpptrpuNgP7dt2fV9ruZPsk4KRZev1t\nSFpne9GOeK3ZkPrblfrbM8q1w/DUP1tdNxcBCyTtL+newBJg9Sy9VkRETGJWjuht3ybpdcBXgJ2B\nk22vn43XioiIyc3aBVO2vwx8ebaef0A7pItoFqX+dqX+9oxy7TAk9c/KydiIiBgemesmIqJyCfqI\niMol6CMqJGmXftriniGzVw4ZSXtOtt32T3dULdMlaR/gIXT9ntn+WnsV3aN8E3hsH21DR9KuwG9t\nW9LDaC6oPNf2bS2X1hdJa2wf1qttR6oq6CXdBGz37LLt3XZgOVN1Mc17ELAf8LNy+4HA/wD7t1da\n/yT9PfBCmvmNbi/NBkYi6CV9kbv/Lv0CWAd8xPZvd3xVvUl6EM10I7tKOpjmdwdgN+C+rRU2mP8E\nniRpd+AC4BKaa3Fe1mpVPUi6D83f8V6S9mDbv/tWp4CpKuhtPwBA0ruA64FP0/xlvxjYu8XS+mZ7\nfwBJHwXOKMNUkfQM4Kg2axvQUcDDbd/SdiFT9ENgDDi13H8hcBNwAPBR4KUt1dXL4cDLaa5Gf29X\n+03A29ooaAp2sn2zpFcCH7J9gqRL2y6qD68B/hx4MM0BWyfofwl8sK2ioNLhlZK+a/sxvdqGmaTL\nbD+qV9uwkvTvwAts/6rtWqZC0kW2HzdRm6T1tg9sq7Z+SHqe7c+3XcdUlFB/NfAvwKttXz4qv/tl\niva32X5X27V0q+qIvsuvJb0YOI3m6/cxwK/bLWlgP5L0V8Bnyv0XAz9qsZ5B3QxcKmkNcOdRve03\ntFfSQO4vaT/b/wMgaT/g/mXbre2V1bezJb0ImM+250je2VpF/Xsz8LfA2SXkH0rTnTP0bN8u6blA\ngn4HeBHw/vLHwDdK2yg5BngHcEa5/7XSNipWM9rzG70F+Lqk/6b5Cr4/8FpJ9wNWtlpZf86iOadw\nMV0ftKPA9gXABZ1RQmVdi9e2W9VA1kh6HvAFD0mXSZVdNxEzoQTNI8rdq4b1BOxEJF1u+5Ft1zEV\nkhYDHwd2t72fpMcAr7L9+pZL60sZFHI/mkEIv6E5UHCbg0GqPKKXNEbTxzefbb+2vrKtmgYl6QDg\nrdz9PRzaVk39kHQZk498evQOLGe6fp+7/v4fIwnbn2q3pL79l6RH2b6s7UKm4F+AZwFnAtj+rqSn\ntFtS/zqDQoZJlUFP87X1P4HzuWto36g5Hfgw8DFG6z08q+0CZoKkTwMPAy5l2+GhoxL0TwReLmkj\nTddN56hyFD5od7J9raTutlH6P4CkZwNPKncvtH12m/XUGvT3tf2XbRcxTbfZ/lDbRQzK9rXQjKMf\n/29QxtaPyr/LImDhsPSxTsEz2i5gGjaV7huXUSyvB37Qck19k3QC8DjglNL0RklPsH18WzXVOgXC\n2ZL+uO0ipkLSnuXq2C9Keq2kvTttva6aHTJPm6BtlMLncuBBbRcxVeUDd1/g0HL7Zkbn//uf0Yy8\n2Q/YAhxS2kbFHwNPs32y7ZOBI4DZXbW8h6pOxnZdGSuakyG3AP/LEJwM6Vf5qt15D+PZ9kN3cEkD\nkfRnNCMkHgr8d9emBwDfsP2SVgobkKSvAgcBa9l2eOizWytqAJLeQfOt5OG2D5D0YOB0209oubTq\nSfoe8OTOdCXlAO3CNrvNquq6GcaTIIPqXBk7wj4L/DvwHmB5V/tNozRPD/A3bRcwTc8BDqaZPgDb\nP5I0Ev8/JM0H3gc8vjR9A3iL7WtaKmlQ7wG+Uw4WRNNXv3zyh8yuqo7oOyQ9B7jA9i/K/QfSfMKe\n2W5l/ZN0HHCK7Z+X+3sAx9g+sd3KJlfTpGyjTNJa24slXWL7sWX8/zdH4WSspG/SrMzU6eN+EfAa\n24/f/qOGi6S9afrpAdbavqHVeioN+kttHzSu7Tu2D26rpkGN6nvo6nqCu3c/jULX09dtP3GCCfJG\npvsPQNJbgQU050reA7wS+KztD7RaWB8kfW/8B9IoTGEi6Xdo5hP6XeAy4D22f9luVY1ag36iX5SR\nmCujo4xHf3Rn1EcZffC9YZ9jJYaHpKcBT6f5kPqK7fNaLqkvZdTKj7lrCpMXAnsBJwAMS3iOJ+kc\nmiuRv0YzzPgBtl/ealFFrUF/MvBz4F9L03HAnsPyl94PSf+PZtTBR0rTa4BNtt/SXlW9SXqE7e9L\nmnDec9uX7OiapqLMg36d7VskPRl4NPCpTlfaqJC0G9tecDf0XWeSNk2y2bb322HFDGD8t45Ot1mb\nNXXUGvT3A/4aeCrNEcF5wLtt39xqYQOQtBOwjOY9QPMePmr7jvaq6k3SSbaXlRNR43nYr+ztKDMo\nLqK5MvbLNBfhHWh7JIbtSnoNzcRgvwXu4K6up6HuOhtlkr4LPJm7uiy/2n2/zQ/ZWoP+BbZP79U2\nzCS90fb7e7XF7Og6ifl/aFY7+sAonCPpkHQ18HjbP267lkFJ+hZwMnCq7Zvarqdfkq7hrg/V8Vr9\nkK016O/2lWmYvkb1YzvvYZSCZsLVgEZlrhhJ3wb+GXg78Ce2N47SRGGlv/i5o/QttkPSI4BXAC8A\n/gv4hO017VbVHzXzNuzrMr31sKgq6NWswvTHwNHAv3Vt2o3mcvbFrRQ2AEnH0AwneyLbzsH9AOAO\nt7ju5CAkdY/uuA9wGHCJ7ee3VNJAJC0E/pRmSOKpkvYHjrb99y2X1hc1ywh+Avg2o7keQGcAwrNp\nVme6leYo/wPDfp5kGAd+VHXBFM3CHOtofjku7mq/CXhTKxUN7r9olkHcC/inrvabgO+1UtEUeNyU\nsuVahtNaKmdgtq8A3gB3XsPwgFEJ+eIjNOutXkbTnTBSygftK4A/oTk/cgrNwc8FDP8C55dIepzt\ni9oupKOqI/oOSX9h+x/GtY1c/7akhwALbJ8vaVdgzij1WXaTdC/gctsPb7uWfki6kOaAYQ7NQcMW\nmikc3txmXf0apW6+8SStpZmb52SaaRt+07Vt9bBPQyHp+zRj6a+lWdmu9ZlDaw36ke7fBpD0appR\nN3vafpikBcCHR6jr5ovcdcHRTsBCYJXtVi8F71fn90XSq2j6XN8x0fUZw0rS3wHXAF9k266boR1e\nKem5tr8g6QDbIzNb5XjlAO1uXGZ2bUNVQT9J//ZuNNP+PnXCBw6hMrxvMfDtzgfUMPb9bY+kP+q6\nextwre3r2qpnUOWCtafTLBv4dtsXjVjQb5ygeaiHV47agIleypWy9+ncb/MEbW199Nvr3+5cXTdK\nbrF9q8riC5LmMMnKTcPG9n90bkvaC/hJi+VMxTuBrwBfLyH/UODqlmvqm0d/cryRpWbRkX8CHkzT\n5fcQ4Eqgtavaqzqi71ZGHbyIZojWRuDztj/YblX9k/QPNFf3voxm4YXXAlfYfnurhfUg6RCaS9V/\nCrwL+DTNB+9OwMtsn9NiefcYkl4AnGP7Jkl/RXMC8122v9Nyadsl6WZgw0SbGJ3VsToXTh0KnF+6\n/54CvMT2sa3VVFPQq1ln9Zjy58c0QyzfanvCPrNhVq6MPZauuUqAj3nI/8EkraOZ2Gl3mhkIn2H7\nW2Vs9KnDfp6kcyK/DA+929/1qAxP7HQzSXoi8G7gH4H/a/sPWi5tuyStpxkePaE2+7gHIWmd7UUl\n8A+2fUfbk7LV1nXzfZq++WfZ3gAgaVSGVW6j/HKcCZxpe2vb9Qxgju1zASS90/a3AMr8N+1W1p8r\ny891rVYxfZ01Vp8JnGT7S5Le3WZBfbh1VMK8h59Luj/N5GanSNpCM/qmNbUF/XOBJcBXy5WBpzHx\n5chDq1xZ9w7gdZSl3yTdTnOhyDvbrK1P3WO2fzNu21B/GwGw/cXyc2XbtUzTZkkfoZmm+O8l7cLw\nLyX4jfLNbx+aQQi/6myQdMSwd/tJ+l1gLnAkze/+m4AX0/TRv36Sh866qrpuOsqkZkfSdOEcCnwK\nOKNzpDnMJL2ZZm3VZbY3lraHAh+i6XN9X5v19VI+lDpjh3elGQ9NuX8f2/dqq7Z+SFo92fZhH8Pd\nIem+NGuVXmb7ajULYTxqmP8PSHoDzUyzV9Is4/hG22eVbUM/IkfS2cDxti8b1/4o4O9s/0k7lVUa\n9N3KVY0vAF44CmPQJX2HZmHhH49rHwPOHfY+7lEnaSuwCTiVZvqAbb4Rdo8mGgXDNMSvlzKk9fG2\nf6VmOcHPAZ+2/f5RuA5G0kW2H7edba0Oja6t6+ZubP+M5qTgSW3X0qd7jQ95ANtby9WlMbseRNPd\n0bkm40s0J5HXt1rVgCYY4rcfzTmsYV64ZqdOd43ta9SsA/C5cgHSKHTBPnCSbbvusComMOx9dvdE\nt05xW8wA27fbPsf2UuAQmuF+F0p6XculDepdNPX/oIypfyrwrXZL6ulGSXcun1lC/1k0w3NH4ULB\ndeWK9m2Uq6svnmD/Hab6rptR09XHfbdNjEAfdw3Kictn0hzVzwdWAyfb3txmXYMYxiF+vUiaR3MF\n+90W0pb0BNvfaKGsvkmaC5xBc0DWCfZFwL2B50z0vnZYbQn6iLtI+hTwSJpVpU6zfXnLJU2JpPOB\no2gWBt+Lpvvmcbb/sNXC7gHKBVKddQvW276gzXogQR+xDUl3cNc3qu7/HJ2rM3fb8VUNrow8+w1N\n9+yLaS5gO8X2qE1FETMgQR9Ruc5cQ8N+VXXMnpyMjaiIpEMkXSjpC5IOlnQ5cDnNic4j2q4v2pEj\n+oiKjPpcQzE7ckQfUZc5ts+1fTpwQ/dcQy3XFS1K0EfUZaTnGorZka6biIqM+lxDMTsS9BERlUvX\nTURE5RL0ERGVS9BHRFQuQR9DQdJcSZ+V9ENJF0v6pqTntFDHKyRdWv7cKumycvuEWX7d35VkSX/W\n1fZhSS/pun9vST8dgSUBY8gk6KN1ZfnEM4Gv2X6o7d+nWRJyXp+Pn7F1FWx/wvZBtg8CfgQ8pdxf\nPlOvMYkbgTdN8n4OB64AXrgDaomKJOhjGBxKszD0hzsNtq+1/QFJ8yX9p6RLyp8/BJD05NK+mib8\nkHRm+TawXtKyznNJOlbSDyStlfRRSR8s7WOSPi/povLnCdsrUNJOkjZI2rPc37l8+9hT0mckfai8\n9g8kPaPsM0fSe8vrfq/MSz6ZG2gWt3/pdrYfA7wXuEHS4h7PFXGn6leYipFwIHDJdrZtoVla8beS\nFtAs8beobHss8MjO2rrAK23/VNKuwEWSPg/sAvx12fcm4ALgu2X/9wPvs/11SfsBXwF+b6Iiynzu\np9KsOvVBmqPri8rrAewLPA5YAJxfFoo+Fthie3GZ4/5bks7tsZzfCcBZkrZZnLysAftk4JU0q2Ad\nA6yd5Hki7pSgj6Ej6V+BJ9Is4PBU4INl5aHbgQO6dl3bFfIAb+jq19+XJnQfBPyH7Z+W5z696zme\nCiwsQQ2wm6T7d5azm8DHgdNpgv6VwMe6tq2yfQdwlaRN5bWfDvyepCVln91L+3aDvizkfSl37555\nNnBe+cA7HbhY0lvKa0ZMKkEfw2A98LzOHdvHlal11wFvoum7fgxNV+Nvux5350pcZX3Rp9IsLn2z\npAvpWhR7O3YCDrH92x77deq6RtLPysISBwPndm8evzvN1aivtb2mn+fvsgL4LNsu/XcMcIika8r9\nMeCPgK8O+NxxD5Q++hgGFwD36R5xAty3/NwduL4cub4U2Hk7z7E78LMS8o+gWS8V4CLgjyTtUU5y\nPq/rMecCr+/c6V6vdBIfB06hWX2q+2j6BWocQPNt4mqarqDXdk6uSnp46VaaVFmI/L+BTl//A8v7\nmWd7vu35wBtowj+ipwR9tK4siHEUTSBvlLQWWAn8JXAisFTN2qePYOL1dAHOAeZIupKmn7sza+Nm\n4O9o+rO/AVwD/KI85g3AonKi9ArgT/so9wyaD5VPjmvfTPMN5IvAMtu3Ah+hCfxLy7zwH6L/b9Hv\npvnAgObD6Tzb/9u1/UzgKEmZuyZ6ylw3Ub1Ov3s5sj6DZqHvM6b4XIcA77H9lK62zwCfs33mzFQc\nMbNyRB/3BH9TTnBeDmykORoemKS3A/9Gs7BHxMjIEX3EDlTOA3xyXPPNtv+whXLiHiJBHxFRuXTd\nRERULkEfEVG5BH1EROUS9BERlUvQR0RU7v8Dz2thssqBmaAAAAAASUVORK5CYII=\n", 322 | "text/plain": [ 323 | "" 324 | ] 325 | }, 326 | "metadata": {}, 327 | "output_type": "display_data" 328 | } 329 | ], 330 | "source": [ 331 | "# let's look at the number of observations per label on each of the variables\n", 332 | "for col in ['BsmtQual_NA', 'FireplaceQu_NA', 'GarageType_NA']:\n", 333 | " X_train.groupby([col])[col].count().sort_values(ascending=False).plot.bar()\n", 334 | " plt.show()" 335 | ] 336 | }, 337 | { 338 | "cell_type": "markdown", 339 | "metadata": {}, 340 | "source": [ 341 | "We observe in the plots the label indicating the number of observations with missing values, within each of the 3 categorical variables." 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": 9, 347 | "metadata": { 348 | "collapsed": true 349 | }, 350 | "outputs": [], 351 | "source": [ 352 | "# let's transform the categories into numbers quick and dirty so we can use them in scikit-learn\n", 353 | "\n", 354 | "# the below function numbers the labels from 0 to n, n being the number of different labels \n", 355 | "# within the variable\n", 356 | "\n", 357 | "for col in ['BsmtQual_NA', 'FireplaceQu_NA', 'GarageType_NA',]:\n", 358 | " labels_dict = {k:i for i, k in enumerate(X_train[col].unique(), 0)}\n", 359 | " X_train.loc[:, col] = X_train.loc[:, col].map(labels_dict )\n", 360 | " X_test.loc[:, col] = X_test.loc[:, col].map(labels_dict)" 361 | ] 362 | }, 363 | { 364 | "cell_type": "markdown", 365 | "metadata": {}, 366 | "source": [ 367 | "### Linear Regression" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": 10, 373 | "metadata": {}, 374 | "outputs": [ 375 | { 376 | "name": "stdout", 377 | "output_type": "stream", 378 | "text": [ 379 | "Train set random imputation\n", 380 | "Linear Regression mse: 4810016310.466396\n", 381 | "Test set random imputation\n", 382 | "Linear Regression mse: 5562566516.826057\n" 383 | ] 384 | } 385 | ], 386 | "source": [ 387 | "# Let's evaluate the performance of Linear Regression\n", 388 | "\n", 389 | "linreg = LinearRegression()\n", 390 | "linreg.fit(X_train[['BsmtQual_NA', 'FireplaceQu_NA', 'GarageType_NA',]], y_train)\n", 391 | "print('Train set random imputation')\n", 392 | "pred = linreg.predict(X_train[['BsmtQual_NA', 'FireplaceQu_NA', 'GarageType_NA',]])\n", 393 | "print('Linear Regression mse: {}'.format(mean_squared_error(y_train, pred)))\n", 394 | "print('Test set random imputation')\n", 395 | "pred = linreg.predict(X_test[['BsmtQual_NA', 'FireplaceQu_NA', 'GarageType_NA',]])\n", 396 | "print('Linear Regression mse: {}'.format(mean_squared_error(y_test, pred)))" 397 | ] 398 | }, 399 | { 400 | "cell_type": "markdown", 401 | "metadata": {}, 402 | "source": [ 403 | "In the previous lectures we trained linear regressions on data where missing observations were replaced by i) random sampling, or ii) random sampling plus a variable to indicate missingness. We obtained the following mse for the testing sets:\n", 404 | "\n", 405 | "- frequent label imputation mse: 6456070592\n", 406 | "- random sampling + additional category: 4911877327\n", 407 | "- adding 'missing' label: 5562566516\n", 408 | "\n", 409 | "Therefore, adding an additional 'Missing' category lies between the 2 other approaches in terms of performance.\n", 410 | "\n", 411 | "A next step could be to investigate which approach works better for each variable individually, to try and optimise the performance of logistic regression even more." 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": { 417 | "collapsed": true 418 | }, 419 | "source": [ 420 | "**That is all for this demonstration. I hope you enjoyed the notebook, and see you in the next one.**" 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": null, 426 | "metadata": { 427 | "collapsed": true 428 | }, 429 | "outputs": [], 430 | "source": [] 431 | } 432 | ], 433 | "metadata": { 434 | "kernelspec": { 435 | "display_name": "Python 3", 436 | "language": "python", 437 | "name": "python3" 438 | }, 439 | "language_info": { 440 | "codemirror_mode": { 441 | "name": "ipython", 442 | "version": 3 443 | }, 444 | "file_extension": ".py", 445 | "mimetype": "text/x-python", 446 | "name": "python", 447 | "nbconvert_exporter": "python", 448 | "pygments_lexer": "ipython3", 449 | "version": "3.6.1" 450 | }, 451 | "toc": { 452 | "nav_menu": {}, 453 | "number_sections": true, 454 | "sideBar": true, 455 | "skip_h1_title": false, 456 | "toc_cell": false, 457 | "toc_position": {}, 458 | "toc_section_display": "block", 459 | "toc_window_display": true 460 | } 461 | }, 462 | "nbformat": 4, 463 | "nbformat_minor": 2 464 | } 465 | -------------------------------------------------------------------------------- /07.2_Conclusion_when_to_use_each_NA_imputation_methods.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Conclusion: when to use each NA imputation method\n", 8 | "\n", 9 | "\n", 10 | "### Which missing value imputation method shall I use and when?\n", 11 | "\n", 12 | "There is no straight forward answer to this question, and which method to use on which occasion is not set on stone. This is totally up to you. \n", 13 | "\n", 14 | "Different methods make different assumptions and have different advantages and disadvantages (see lecture \"Overview of missing value imputation methods\".\n", 15 | "\n", 16 | "**As a guideline I would say**:\n", 17 | "\n", 18 | "If missing values are less than 5% of the variable:\n", 19 | "replace by mean/median or random sample\n", 20 | "replace by most frequent category\n", 21 | "If missing values are more than 5% of the variable:\n", 22 | "do mean/median imputation+adding an additional binary variable to capture missingness\n", 23 | "add a 'Missing' label in categorical variables \n", 24 | "If the number of NA in a variable is small, they are unlikely to have a strong impact on the variable / target that you are trying to predict. Therefore, treating them specially, will most certainly add noise to the variables. Therefore, it is more useful to replace by mean/random sample to preserve the variable distribution.\n", 25 | "\n", 26 | "If the variable / target you are trying to predict is however highly unbalanced, then it might be the case that this small number of NA are indeed informative. You would have to check this out.\n", 27 | "\n", 28 | "**Exceptions to the guideline**:\n", 29 | "\n", 30 | "If you / your company suspect that NA are not missing at random and do not want to attribute the most common occurrence to NA.\n", 31 | "If you don't want to increase the feature space by adding an additional variable to indicate missingness\n", 32 | "In these cases, replace by a value at the far end of the distribution or an arbitrary value.\n", 33 | "\n", 34 | "#### Final note\n", 35 | "\n", 36 | "NA imputation for data competitions and business settings can be approached differently. In data competitions, a tiny increase in performance can be the difference between 1st or 2nd place. Therefore, you may want to try all the feature engineering methods and use the one that gives the best machine learning model performance. It may be the case that different NA imputation methods help different models make better predictions.\n", 37 | "\n", 38 | "In business scenarios, scientist don't usually have the time to do lengthy studies, and may therefore choose to streamline the feature engineering procedure. In these cases, it is common practice to follow the guidelines above, taking into account the exceptions, and do the same processing for all features.\n", 39 | "\n", 40 | "This streamlined pre-processing may not lead to the most predictive features possible, yet it makes feature engineering and machine learning models delivery substantially faster. Thus, the business can start enjoying the power of machine learning sooner." 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": null, 46 | "metadata": { 47 | "collapsed": true 48 | }, 49 | "outputs": [], 50 | "source": [] 51 | } 52 | ], 53 | "metadata": { 54 | "kernelspec": { 55 | "display_name": "Python 3", 56 | "language": "python", 57 | "name": "python3" 58 | }, 59 | "language_info": { 60 | "codemirror_mode": { 61 | "name": "ipython", 62 | "version": 3 63 | }, 64 | "file_extension": ".py", 65 | "mimetype": "text/x-python", 66 | "name": "python", 67 | "nbconvert_exporter": "python", 68 | "pygments_lexer": "ipython3", 69 | "version": "3.6.1" 70 | }, 71 | "toc": { 72 | "nav_menu": {}, 73 | "number_sections": true, 74 | "sideBar": true, 75 | "skip_h1_title": false, 76 | "toc_cell": false, 77 | "toc_position": {}, 78 | "toc_section_display": "block", 79 | "toc_window_display": false 80 | } 81 | }, 82 | "nbformat": 4, 83 | "nbformat_minor": 2 84 | } 85 | -------------------------------------------------------------------------------- /10.10_Bonus_Additional_reading_resources.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Bonus: Additional reading resources\n", 8 | "\n", 9 | "### Weight of Evidence\n", 10 | "\n", 11 | "- [WoE Module from StatSoft](http://documentation.statsoft.com/portals/0/formula%20guide/Weight%20of%20Evidence%20Formula%20Guide.pdf)\n", 12 | "- [WoE Statistica](http://documentation.statsoft.com/StatisticaHelp.aspx?path=WeightofEvidence/WeightofEvidenceWoEIntroductoryOverview)\n", 13 | "\n", 14 | "\n", 15 | "### Score card development for credit risk\n", 16 | "\n", 17 | "- [Plug and Score](https://plug-n-score.com/learning/scorecard-development-stages.htm#Binning)\n", 18 | "- [Score card development from SAS](http://www.sas.com/storefront/aux/en/spcrisks/59376_excerpt.pdf)\n", 19 | "\n", 20 | "### Windsorization and top/bottom coding\n", 21 | "\n", 22 | "- [Windsorisation](http://www.statisticshowto.com/winsorize/)\n", 23 | "- [Censoring data](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1431352)\n", 24 | "\n", 25 | "### Alternative methods of categorical variable encoding\n", 26 | "\n", 27 | "- [Will's blog](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/)\n", 28 | "- [Classification of Tutor systems with high cardinality (KDD competition)](http://pslcdatashop.org/KDDCup/workshop/)\n", 29 | "- [Light weight solution to KDD data mining challenge](http://pslcdatashop.org/KDDCup/workshop/)\n", 30 | "\n", 31 | "### Several methods described in this course\n", 32 | "\n", 33 | "- [The 2009 Knowledge Discovery in Data Competition (KDD Cup 2009)](http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf)\n", 34 | "\n" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": null, 40 | "metadata": { 41 | "collapsed": true 42 | }, 43 | "outputs": [], 44 | "source": [] 45 | } 46 | ], 47 | "metadata": { 48 | "kernelspec": { 49 | "display_name": "Python 3", 50 | "language": "python", 51 | "name": "python3" 52 | }, 53 | "language_info": { 54 | "codemirror_mode": { 55 | "name": "ipython", 56 | "version": 3 57 | }, 58 | "file_extension": ".py", 59 | "mimetype": "text/x-python", 60 | "name": "python", 61 | "nbconvert_exporter": "python", 62 | "pygments_lexer": "ipython3", 63 | "version": "3.6.1" 64 | }, 65 | "toc": { 66 | "nav_menu": {}, 67 | "number_sections": true, 68 | "sideBar": true, 69 | "skip_h1_title": false, 70 | "toc_cell": false, 71 | "toc_position": {}, 72 | "toc_section_display": "block", 73 | "toc_window_display": false 74 | } 75 | }, 76 | "nbformat": 4, 77 | "nbformat_minor": 2 78 | } 79 | -------------------------------------------------------------------------------- /10.2_One_hot_encoding_variables_with_many_labels.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## One Hot Encoding - variables with many categories\n", 8 | "\n", 9 | "We observed in the previous lecture that if a categorical variable contains multiple labels, then by re-encoding them using one hot encoding we will expand the feature space dramatically.\n", 10 | "\n", 11 | "See below:" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 1, 17 | "metadata": {}, 18 | "outputs": [ 19 | { 20 | "data": { 21 | "text/html": [ 22 | "
\n", 23 | "\n", 36 | "\n", 37 | " \n", 38 | " \n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | "
X1X2X3X4X5X6
0vataduj
1tavedyl
2wncdxj
3tnfdxl
4vnfdhd
\n", 96 | "
" 97 | ], 98 | "text/plain": [ 99 | " X1 X2 X3 X4 X5 X6\n", 100 | "0 v at a d u j\n", 101 | "1 t av e d y l\n", 102 | "2 w n c d x j\n", 103 | "3 t n f d x l\n", 104 | "4 v n f d h d" 105 | ] 106 | }, 107 | "execution_count": 1, 108 | "metadata": {}, 109 | "output_type": "execute_result" 110 | } 111 | ], 112 | "source": [ 113 | "import pandas as pd\n", 114 | "import numpy as np\n", 115 | "\n", 116 | "# let's load the mercedes benz dataset for demonstration, only the categorical variables\n", 117 | "\n", 118 | "data = pd.read_csv('mercedesbenz.csv', usecols=['X1', 'X2', 'X3', 'X4', 'X5', 'X6'])\n", 119 | "data.head()" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 2, 125 | "metadata": {}, 126 | "outputs": [ 127 | { 128 | "name": "stdout", 129 | "output_type": "stream", 130 | "text": [ 131 | "X1 : 27 labels\n", 132 | "X2 : 44 labels\n", 133 | "X3 : 7 labels\n", 134 | "X4 : 4 labels\n", 135 | "X5 : 29 labels\n", 136 | "X6 : 12 labels\n" 137 | ] 138 | } 139 | ], 140 | "source": [ 141 | "# let's have a look at how many labels each variable has\n", 142 | "\n", 143 | "for col in data.columns:\n", 144 | " print(col, ': ', len(data[col].unique()), ' labels')" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 3, 150 | "metadata": {}, 151 | "outputs": [ 152 | { 153 | "data": { 154 | "text/plain": [ 155 | "(4209, 117)" 156 | ] 157 | }, 158 | "execution_count": 3, 159 | "metadata": {}, 160 | "output_type": "execute_result" 161 | } 162 | ], 163 | "source": [ 164 | "# let's examine how many columns we will obtain after one hot encoding these variables\n", 165 | "pd.get_dummies(data, drop_first=True).shape" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "We can see that from just 6 initial categorical variables, we end up with 117 new variables. \n", 173 | "\n", 174 | "These numbers are still not huge, and in practice we could work with them relatively easily. However, in business datasets and also other Kaggle or KDD datasets, it is not unusual to find several categorical variables with multiple labels. And if we use one hot encoding on them, we will end up with datasets with thousands of columns.\n", 175 | "\n", 176 | "What can we do instead?\n", 177 | "\n", 178 | "In the winning solution of the KDD 2009 cup: \"Winning the KDD Cup Orange Challenge with Ensemble Selection\" (http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf), the authors limit one hot encoding to the 10 most frequent labels of the variable. This means that they would make one binary variable for each of the 10 most frequent labels only. This is equivalent to grouping all the other labels under a new category, that in this case will be dropped. Thus, the 10 new dummy variables indicate if one of the 10 most frequent labels is present (1) or not (0) for a particular observation.\n", 179 | "\n", 180 | "How can we do that in python?" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 4, 186 | "metadata": {}, 187 | "outputs": [ 188 | { 189 | "data": { 190 | "text/plain": [ 191 | "as 1659\n", 192 | "ae 496\n", 193 | "ai 415\n", 194 | "m 367\n", 195 | "ak 265\n", 196 | "r 153\n", 197 | "n 137\n", 198 | "s 94\n", 199 | "f 87\n", 200 | "e 81\n", 201 | "Name: X2, dtype: int64" 202 | ] 203 | }, 204 | "execution_count": 4, 205 | "metadata": {}, 206 | "output_type": "execute_result" 207 | } 208 | ], 209 | "source": [ 210 | "# let's find the top 10 most frequent categories for the variable X2\n", 211 | "\n", 212 | "data.X2.value_counts().sort_values(ascending=False).head(10)" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": 5, 218 | "metadata": {}, 219 | "outputs": [ 220 | { 221 | "data": { 222 | "text/plain": [ 223 | "['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']" 224 | ] 225 | }, 226 | "execution_count": 5, 227 | "metadata": {}, 228 | "output_type": "execute_result" 229 | } 230 | ], 231 | "source": [ 232 | "# let's make a list with the most frequent categories of the variable\n", 233 | "\n", 234 | "top_10 = [x for x in data.X2.value_counts().sort_values(ascending=False).head(10).index]\n", 235 | "top_10" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 6, 241 | "metadata": {}, 242 | "outputs": [ 243 | { 244 | "data": { 245 | "text/html": [ 246 | "
\n", 247 | "\n", 260 | "\n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | "
X2asaeaimakrnsfe
0at0000000000
1av0000000000
2n0000001000
3n0000001000
4n0000001000
5e0000000001
6e0000000001
7as1000000000
8as1000000000
9aq0000000000
\n", 420 | "
" 421 | ], 422 | "text/plain": [ 423 | " X2 as ae ai m ak r n s f e\n", 424 | "0 at 0 0 0 0 0 0 0 0 0 0\n", 425 | "1 av 0 0 0 0 0 0 0 0 0 0\n", 426 | "2 n 0 0 0 0 0 0 1 0 0 0\n", 427 | "3 n 0 0 0 0 0 0 1 0 0 0\n", 428 | "4 n 0 0 0 0 0 0 1 0 0 0\n", 429 | "5 e 0 0 0 0 0 0 0 0 0 1\n", 430 | "6 e 0 0 0 0 0 0 0 0 0 1\n", 431 | "7 as 1 0 0 0 0 0 0 0 0 0\n", 432 | "8 as 1 0 0 0 0 0 0 0 0 0\n", 433 | "9 aq 0 0 0 0 0 0 0 0 0 0" 434 | ] 435 | }, 436 | "execution_count": 6, 437 | "metadata": {}, 438 | "output_type": "execute_result" 439 | } 440 | ], 441 | "source": [ 442 | "# and now we make the 10 binary variables\n", 443 | "\n", 444 | "for label in top_10:\n", 445 | " data[label] = np.where(data['X2']==label, 1, 0)\n", 446 | "\n", 447 | "data[['X2']+top_10].head(10)" 448 | ] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "execution_count": 7, 453 | "metadata": {}, 454 | "outputs": [ 455 | { 456 | "data": { 457 | "text/html": [ 458 | "
\n", 459 | "\n", 472 | "\n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | "
X1X2X3X4X5X6X2_asX2_aeX2_aiX2_mX2_akX2_rX2_nX2_sX2_fX2_e
0vataduj0000000000
1tavedyl0000000000
2wncdxj0000001000
3tnfdxl0000001000
4vnfdhd0000001000
\n", 592 | "
" 593 | ], 594 | "text/plain": [ 595 | " X1 X2 X3 X4 X5 X6 X2_as X2_ae X2_ai X2_m X2_ak X2_r X2_n X2_s \\\n", 596 | "0 v at a d u j 0 0 0 0 0 0 0 0 \n", 597 | "1 t av e d y l 0 0 0 0 0 0 0 0 \n", 598 | "2 w n c d x j 0 0 0 0 0 0 1 0 \n", 599 | "3 t n f d x l 0 0 0 0 0 0 1 0 \n", 600 | "4 v n f d h d 0 0 0 0 0 0 1 0 \n", 601 | "\n", 602 | " X2_f X2_e \n", 603 | "0 0 0 \n", 604 | "1 0 0 \n", 605 | "2 0 0 \n", 606 | "3 0 0 \n", 607 | "4 0 0 " 608 | ] 609 | }, 610 | "execution_count": 7, 611 | "metadata": {}, 612 | "output_type": "execute_result" 613 | } 614 | ], 615 | "source": [ 616 | "# get whole set of dummy variables, for all the categorical variables\n", 617 | "\n", 618 | "def one_hot_top_x(df, variable, top_x_labels):\n", 619 | " # function to create the dummy variables for the most frequent labels\n", 620 | " # we can vary the number of most frequent labels that we encode\n", 621 | " \n", 622 | " for label in top_x_labels:\n", 623 | " df[variable+'_'+label] = np.where(data[variable]==label, 1, 0)\n", 624 | "\n", 625 | "# read the data again\n", 626 | "data = pd.read_csv('mercedesbenz.csv', usecols=['X1', 'X2', 'X3', 'X4', 'X5', 'X6'])\n", 627 | "\n", 628 | "# encode X2 into the 10 most frequent categories\n", 629 | "one_hot_top_x(data, 'X2', top_10)\n", 630 | "data.head()" 631 | ] 632 | }, 633 | { 634 | "cell_type": "code", 635 | "execution_count": 8, 636 | "metadata": {}, 637 | "outputs": [ 638 | { 639 | "data": { 640 | "text/html": [ 641 | "
\n", 642 | "\n", 655 | "\n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | "
X1X2X3X4X5X6X2_asX2_aeX2_aiX2_m...X1_aaX1_sX1_bX1_lX1_vX1_rX1_iX1_aX1_cX1_o
0vataduj0000...0000100000
1tavedyl0000...0000000000
2wncdxj0000...0000000000
3tnfdxl0000...0000000000
4vnfdhd0000...0000100000
\n", 805 | "

5 rows × 26 columns

\n", 806 | "
" 807 | ], 808 | "text/plain": [ 809 | " X1 X2 X3 X4 X5 X6 X2_as X2_ae X2_ai X2_m ... X1_aa X1_s X1_b \\\n", 810 | "0 v at a d u j 0 0 0 0 ... 0 0 0 \n", 811 | "1 t av e d y l 0 0 0 0 ... 0 0 0 \n", 812 | "2 w n c d x j 0 0 0 0 ... 0 0 0 \n", 813 | "3 t n f d x l 0 0 0 0 ... 0 0 0 \n", 814 | "4 v n f d h d 0 0 0 0 ... 0 0 0 \n", 815 | "\n", 816 | " X1_l X1_v X1_r X1_i X1_a X1_c X1_o \n", 817 | "0 0 1 0 0 0 0 0 \n", 818 | "1 0 0 0 0 0 0 0 \n", 819 | "2 0 0 0 0 0 0 0 \n", 820 | "3 0 0 0 0 0 0 0 \n", 821 | "4 0 1 0 0 0 0 0 \n", 822 | "\n", 823 | "[5 rows x 26 columns]" 824 | ] 825 | }, 826 | "execution_count": 8, 827 | "metadata": {}, 828 | "output_type": "execute_result" 829 | } 830 | ], 831 | "source": [ 832 | "# find the 10 most frequent categories for X1\n", 833 | "top_10 = [x for x in data.X1.value_counts().sort_values(ascending=False).head(10).index]\n", 834 | "\n", 835 | "# now create the 10 most frequent dummy variables for X1\n", 836 | "one_hot_top_x(data, 'X1', top_10)\n", 837 | "data.head()" 838 | ] 839 | }, 840 | { 841 | "cell_type": "markdown", 842 | "metadata": {}, 843 | "source": [ 844 | "### One Hot encoding of top variables\n", 845 | "\n", 846 | "### Advantages\n", 847 | "\n", 848 | "- Straightforward to implement\n", 849 | "- Does not require hrs of variable exploration\n", 850 | "- Does not expand massively the feature space (number of columns in the dataset)\n", 851 | "\n", 852 | "### Disadvantages\n", 853 | "\n", 854 | "- Does not add any information that may make the variable more predictive\n", 855 | "- Does not keep the information of the ignored labels\n", 856 | "\n", 857 | "\n", 858 | "Because it is not unusual that categorical variables have a few dominating categories and the remaining labels add mostly noise, this is a quite simple and straightforward approach that may be useful on many occasions.\n", 859 | "\n", 860 | "It is worth noting that the top 10 variables is a totally arbitrary number. You could also choose the top 5, or top 20.\n", 861 | "\n", 862 | "This modelling was more than enough for the team to win the KDD 2009 cup. They did do some other powerful feature engineering as we will see in following lectures, that improved the performance of the variables dramatically." 863 | ] 864 | }, 865 | { 866 | "cell_type": "code", 867 | "execution_count": null, 868 | "metadata": { 869 | "collapsed": true 870 | }, 871 | "outputs": [], 872 | "source": [] 873 | } 874 | ], 875 | "metadata": { 876 | "kernelspec": { 877 | "display_name": "Python 3", 878 | "language": "python", 879 | "name": "python3" 880 | }, 881 | "language_info": { 882 | "codemirror_mode": { 883 | "name": "ipython", 884 | "version": 3 885 | }, 886 | "file_extension": ".py", 887 | "mimetype": "text/x-python", 888 | "name": "python", 889 | "nbconvert_exporter": "python", 890 | "pygments_lexer": "ipython3", 891 | "version": "3.6.1" 892 | }, 893 | "toc": { 894 | "nav_menu": {}, 895 | "number_sections": true, 896 | "sideBar": true, 897 | "skip_h1_title": false, 898 | "toc_cell": false, 899 | "toc_position": {}, 900 | "toc_section_display": "block", 901 | "toc_window_display": true 902 | } 903 | }, 904 | "nbformat": 4, 905 | "nbformat_minor": 2 906 | } 907 | -------------------------------------------------------------------------------- /10.3_Ordinal_numbering_encoding.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Ordinal numbering encoding\n", 8 | "\n", 9 | "**Ordinal categorical variables**\n", 10 | "\n", 11 | "Categorical variable which categories can be meaningfully ordered are called ordinal. For example:\n", 12 | "\n", 13 | "- Student's grade in an exam (A, B, C or Fail).\n", 14 | "- Days of the week can be ordinal with Monday = 1, and Sunday = 7.\n", 15 | "- Educational level, with the categories: Elementary school, High school, College graduate, PhD ranked from 1 to 4.\n", 16 | "\n", 17 | "When the categorical variable is ordinal, the most straightforward approach is to replace the labels by some ordinal number.\n", 18 | "\n", 19 | "### Advantages\n", 20 | "\n", 21 | "- Keeps the semantical information of the variable (human readable content)\n", 22 | "- Straightforward\n", 23 | "\n", 24 | "### Disadvantage\n", 25 | "\n", 26 | "- Does not add machine learning valuable information\n", 27 | "\n", 28 | "I will simulate some data below to demonstrate this exercise" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 1, 34 | "metadata": { 35 | "collapsed": true 36 | }, 37 | "outputs": [], 38 | "source": [ 39 | "import pandas as pd\n", 40 | "import datetime" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 2, 46 | "metadata": {}, 47 | "outputs": [ 48 | { 49 | "data": { 50 | "text/html": [ 51 | "
\n", 52 | "\n", 65 | "\n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | "
day
02017-11-24 23:37:17.497960
12017-11-23 23:37:17.497960
22017-11-22 23:37:17.497960
32017-11-21 23:37:17.497960
42017-11-20 23:37:17.497960
52017-11-19 23:37:17.497960
62017-11-18 23:37:17.497960
72017-11-17 23:37:17.497960
82017-11-16 23:37:17.497960
92017-11-15 23:37:17.497960
102017-11-14 23:37:17.497960
112017-11-13 23:37:17.497960
122017-11-12 23:37:17.497960
132017-11-11 23:37:17.497960
142017-11-10 23:37:17.497960
152017-11-09 23:37:17.497960
162017-11-08 23:37:17.497960
172017-11-07 23:37:17.497960
182017-11-06 23:37:17.497960
192017-11-05 23:37:17.497960
202017-11-04 23:37:17.497960
212017-11-03 23:37:17.497960
222017-11-02 23:37:17.497960
232017-11-01 23:37:17.497960
242017-10-31 23:37:17.497960
252017-10-30 23:37:17.497960
262017-10-29 23:37:17.497960
272017-10-28 23:37:17.497960
282017-10-27 23:37:17.497960
292017-10-26 23:37:17.497960
\n", 195 | "
" 196 | ], 197 | "text/plain": [ 198 | " day\n", 199 | "0 2017-11-24 23:37:17.497960\n", 200 | "1 2017-11-23 23:37:17.497960\n", 201 | "2 2017-11-22 23:37:17.497960\n", 202 | "3 2017-11-21 23:37:17.497960\n", 203 | "4 2017-11-20 23:37:17.497960\n", 204 | "5 2017-11-19 23:37:17.497960\n", 205 | "6 2017-11-18 23:37:17.497960\n", 206 | "7 2017-11-17 23:37:17.497960\n", 207 | "8 2017-11-16 23:37:17.497960\n", 208 | "9 2017-11-15 23:37:17.497960\n", 209 | "10 2017-11-14 23:37:17.497960\n", 210 | "11 2017-11-13 23:37:17.497960\n", 211 | "12 2017-11-12 23:37:17.497960\n", 212 | "13 2017-11-11 23:37:17.497960\n", 213 | "14 2017-11-10 23:37:17.497960\n", 214 | "15 2017-11-09 23:37:17.497960\n", 215 | "16 2017-11-08 23:37:17.497960\n", 216 | "17 2017-11-07 23:37:17.497960\n", 217 | "18 2017-11-06 23:37:17.497960\n", 218 | "19 2017-11-05 23:37:17.497960\n", 219 | "20 2017-11-04 23:37:17.497960\n", 220 | "21 2017-11-03 23:37:17.497960\n", 221 | "22 2017-11-02 23:37:17.497960\n", 222 | "23 2017-11-01 23:37:17.497960\n", 223 | "24 2017-10-31 23:37:17.497960\n", 224 | "25 2017-10-30 23:37:17.497960\n", 225 | "26 2017-10-29 23:37:17.497960\n", 226 | "27 2017-10-28 23:37:17.497960\n", 227 | "28 2017-10-27 23:37:17.497960\n", 228 | "29 2017-10-26 23:37:17.497960" 229 | ] 230 | }, 231 | "execution_count": 2, 232 | "metadata": {}, 233 | "output_type": "execute_result" 234 | } 235 | ], 236 | "source": [ 237 | "# create a variable with dates, and from that extract the weekday\n", 238 | "# I create a list of dates with 30 days difference from today\n", 239 | "# and then transform it into a datafame\n", 240 | "\n", 241 | "base = datetime.datetime.today()\n", 242 | "date_list = [base - datetime.timedelta(days=x) for x in range(0, 30)]\n", 243 | "df = pd.DataFrame(date_list)\n", 244 | "df.columns = ['day']\n", 245 | "df" 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": 3, 251 | "metadata": {}, 252 | "outputs": [ 253 | { 254 | "data": { 255 | "text/html": [ 256 | "
\n", 257 | "\n", 270 | "\n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | "
dayday_of_week
02017-11-24 23:37:17.497960Friday
12017-11-23 23:37:17.497960Thursday
22017-11-22 23:37:17.497960Wednesday
32017-11-21 23:37:17.497960Tuesday
42017-11-20 23:37:17.497960Monday
\n", 306 | "
" 307 | ], 308 | "text/plain": [ 309 | " day day_of_week\n", 310 | "0 2017-11-24 23:37:17.497960 Friday\n", 311 | "1 2017-11-23 23:37:17.497960 Thursday\n", 312 | "2 2017-11-22 23:37:17.497960 Wednesday\n", 313 | "3 2017-11-21 23:37:17.497960 Tuesday\n", 314 | "4 2017-11-20 23:37:17.497960 Monday" 315 | ] 316 | }, 317 | "execution_count": 3, 318 | "metadata": {}, 319 | "output_type": "execute_result" 320 | } 321 | ], 322 | "source": [ 323 | "# extract the week day name\n", 324 | "\n", 325 | "df['day_of_week'] = df['day'].dt.weekday_name\n", 326 | "df.head()" 327 | ] 328 | }, 329 | { 330 | "cell_type": "code", 331 | "execution_count": 4, 332 | "metadata": {}, 333 | "outputs": [ 334 | { 335 | "data": { 336 | "text/html": [ 337 | "
\n", 338 | "\n", 351 | "\n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | "
dayday_of_weekday_ordinal
02017-11-24 23:37:17.497960Friday5
12017-11-23 23:37:17.497960Thursday4
22017-11-22 23:37:17.497960Wednesday3
32017-11-21 23:37:17.497960Tuesday2
42017-11-20 23:37:17.497960Monday1
52017-11-19 23:37:17.497960Sunday7
62017-11-18 23:37:17.497960Saturday6
72017-11-17 23:37:17.497960Friday5
82017-11-16 23:37:17.497960Thursday4
92017-11-15 23:37:17.497960Wednesday3
\n", 423 | "
" 424 | ], 425 | "text/plain": [ 426 | " day day_of_week day_ordinal\n", 427 | "0 2017-11-24 23:37:17.497960 Friday 5\n", 428 | "1 2017-11-23 23:37:17.497960 Thursday 4\n", 429 | "2 2017-11-22 23:37:17.497960 Wednesday 3\n", 430 | "3 2017-11-21 23:37:17.497960 Tuesday 2\n", 431 | "4 2017-11-20 23:37:17.497960 Monday 1\n", 432 | "5 2017-11-19 23:37:17.497960 Sunday 7\n", 433 | "6 2017-11-18 23:37:17.497960 Saturday 6\n", 434 | "7 2017-11-17 23:37:17.497960 Friday 5\n", 435 | "8 2017-11-16 23:37:17.497960 Thursday 4\n", 436 | "9 2017-11-15 23:37:17.497960 Wednesday 3" 437 | ] 438 | }, 439 | "execution_count": 4, 440 | "metadata": {}, 441 | "output_type": "execute_result" 442 | } 443 | ], 444 | "source": [ 445 | "# Engineer categorical variable by ordinal number replacement\n", 446 | "\n", 447 | "weekday_map = {'Monday':1,\n", 448 | " 'Tuesday':2,\n", 449 | " 'Wednesday':3,\n", 450 | " 'Thursday':4,\n", 451 | " 'Friday':5,\n", 452 | " 'Saturday':6,\n", 453 | " 'Sunday':7\n", 454 | "}\n", 455 | "\n", 456 | "df['day_ordinal'] = df.day_of_week.map(weekday_map)\n", 457 | "df.head(10)" 458 | ] 459 | }, 460 | { 461 | "cell_type": "markdown", 462 | "metadata": {}, 463 | "source": [ 464 | "We can now use the variable day_ordinal in sklearn to build machine learning models." 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": null, 470 | "metadata": { 471 | "collapsed": true 472 | }, 473 | "outputs": [], 474 | "source": [] 475 | } 476 | ], 477 | "metadata": { 478 | "kernelspec": { 479 | "display_name": "Python 3", 480 | "language": "python", 481 | "name": "python3" 482 | }, 483 | "language_info": { 484 | "codemirror_mode": { 485 | "name": "ipython", 486 | "version": 3 487 | }, 488 | "file_extension": ".py", 489 | "mimetype": "text/x-python", 490 | "name": "python", 491 | "nbconvert_exporter": "python", 492 | "pygments_lexer": "ipython3", 493 | "version": "3.6.1" 494 | }, 495 | "toc": { 496 | "nav_menu": {}, 497 | "number_sections": true, 498 | "sideBar": true, 499 | "skip_h1_title": false, 500 | "toc_cell": false, 501 | "toc_position": {}, 502 | "toc_section_display": "block", 503 | "toc_window_display": true 504 | } 505 | }, 506 | "nbformat": 4, 507 | "nbformat_minor": 2 508 | } 509 | -------------------------------------------------------------------------------- /10.4_Count_or_frequency_encoding.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Count or frequency encoding\n", 8 | "\n", 9 | "Another way to refer to variables that have a multitude of categories, is to call them variables with **high cardinality**.\n", 10 | "\n", 11 | "We observed in the previous lecture, that if a categorical variable contains multiple labels, then by re-encoding them using one hot encoding, we will expand the feature space dramatically.\n", 12 | "\n", 13 | "One approach that is heavily used in Kaggle competitions, is to replace each label of the categorical variable by the count, this is the amount of times each label appears in the dataset. Or the frequency, this is the percentage of observations within that category. The 2 are equivalent.\n", 14 | "\n", 15 | "There is not any rationale behind this transformation, other than its simplicity.\n", 16 | "\n", 17 | "### Advantages\n", 18 | "\n", 19 | "- Simple\n", 20 | "- Does not expand the feature space\n", 21 | "\n", 22 | "### Disadvantages\n", 23 | "\n", 24 | "- If 2 labels appear the same amount of times in the dataset, that is, contain the same number of observations, they will be merged: may loose valuable information\n", 25 | "- Adds somewhat arbitrary numbers, and therefore weights to the different labels, that may not be related to their predictive power\n", 26 | "\n", 27 | "Follow this thread in Kaggle for more information:\n", 28 | "https://www.kaggle.com/general/16927\n", 29 | "\n", 30 | "Let's see how this works:" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 1, 36 | "metadata": {}, 37 | "outputs": [ 38 | { 39 | "data": { 40 | "text/html": [ 41 | "
\n", 42 | "\n", 55 | "\n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | "
yX1X2X3X4X5X6
0130.81vataduj
188.53tavedyl
276.26wncdxj
380.62tnfdxl
478.02vnfdhd
\n", 121 | "
" 122 | ], 123 | "text/plain": [ 124 | " y X1 X2 X3 X4 X5 X6\n", 125 | "0 130.81 v at a d u j\n", 126 | "1 88.53 t av e d y l\n", 127 | "2 76.26 w n c d x j\n", 128 | "3 80.62 t n f d x l\n", 129 | "4 78.02 v n f d h d" 130 | ] 131 | }, 132 | "execution_count": 1, 133 | "metadata": {}, 134 | "output_type": "execute_result" 135 | } 136 | ], 137 | "source": [ 138 | "import pandas as pd\n", 139 | "import numpy as np\n", 140 | "\n", 141 | "from sklearn.model_selection import train_test_split\n", 142 | "\n", 143 | "# let's open the mercedes benz dataset for demonstration\n", 144 | "\n", 145 | "data = pd.read_csv('mercedesbenz.csv', usecols=['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'y'])\n", 146 | "data.head()" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 2, 152 | "metadata": {}, 153 | "outputs": [ 154 | { 155 | "name": "stdout", 156 | "output_type": "stream", 157 | "text": [ 158 | "X1 : 27 labels\n", 159 | "X2 : 44 labels\n", 160 | "X3 : 7 labels\n", 161 | "X4 : 4 labels\n", 162 | "X5 : 29 labels\n", 163 | "X6 : 12 labels\n" 164 | ] 165 | } 166 | ], 167 | "source": [ 168 | "# let's have a look at how many labels\n", 169 | "\n", 170 | "for col in data.columns[1:]:\n", 171 | " print(col, ': ', len(data[col].unique()), ' labels')" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "### Important\n", 179 | "\n", 180 | "When doing count transformation of categorical variables, it is important to calculate the count (or frequency = count/total observations) **over the training set**, and then use those numbers to replace the labels in the test set." 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 3, 186 | "metadata": {}, 187 | "outputs": [ 188 | { 189 | "data": { 190 | "text/plain": [ 191 | "((2946, 6), (1263, 6))" 192 | ] 193 | }, 194 | "execution_count": 3, 195 | "metadata": {}, 196 | "output_type": "execute_result" 197 | } 198 | ], 199 | "source": [ 200 | "X_train, X_test, y_train, y_test = train_test_split(data[['X1', 'X2', 'X3', 'X4', 'X5', 'X6']], data.y,\n", 201 | " test_size=0.3,\n", 202 | " random_state=0)\n", 203 | "X_train.shape, X_test.shape" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 4, 209 | "metadata": { 210 | "scrolled": true 211 | }, 212 | "outputs": [ 213 | { 214 | "data": { 215 | "text/plain": [ 216 | "{'a': 34,\n", 217 | " 'aa': 1,\n", 218 | " 'ac': 10,\n", 219 | " 'ae': 342,\n", 220 | " 'af': 1,\n", 221 | " 'ag': 15,\n", 222 | " 'ah': 3,\n", 223 | " 'ai': 289,\n", 224 | " 'ak': 188,\n", 225 | " 'al': 3,\n", 226 | " 'am': 1,\n", 227 | " 'an': 3,\n", 228 | " 'ao': 10,\n", 229 | " 'ap': 5,\n", 230 | " 'aq': 46,\n", 231 | " 'as': 1155,\n", 232 | " 'at': 5,\n", 233 | " 'au': 3,\n", 234 | " 'av': 2,\n", 235 | " 'aw': 2,\n", 236 | " 'ay': 40,\n", 237 | " 'b': 12,\n", 238 | " 'c': 1,\n", 239 | " 'd': 12,\n", 240 | " 'e': 61,\n", 241 | " 'f': 59,\n", 242 | " 'g': 10,\n", 243 | " 'h': 4,\n", 244 | " 'i': 15,\n", 245 | " 'k': 16,\n", 246 | " 'l': 1,\n", 247 | " 'm': 284,\n", 248 | " 'n': 97,\n", 249 | " 'o': 1,\n", 250 | " 'p': 1,\n", 251 | " 'q': 3,\n", 252 | " 'r': 101,\n", 253 | " 's': 63,\n", 254 | " 't': 17,\n", 255 | " 'x': 8,\n", 256 | " 'y': 8,\n", 257 | " 'z': 14}" 258 | ] 259 | }, 260 | "execution_count": 4, 261 | "metadata": {}, 262 | "output_type": "execute_result" 263 | } 264 | ], 265 | "source": [ 266 | "# let's obtain the counts for each one of the labels in variable X2\n", 267 | "# let's capture this in a dictionary that we can use to re-map the labels\n", 268 | "\n", 269 | "X_train.X2.value_counts().to_dict()" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": 5, 275 | "metadata": {}, 276 | "outputs": [ 277 | { 278 | "data": { 279 | "text/html": [ 280 | "
\n", 281 | "\n", 294 | "\n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | "
X1X2X3X4X5X6
3059aaaicdqg
3014bmcdqi
3368offdsl
2772aaasddpj
3383vecdsg
\n", 354 | "
" 355 | ], 356 | "text/plain": [ 357 | " X1 X2 X3 X4 X5 X6\n", 358 | "3059 aa ai c d q g\n", 359 | "3014 b m c d q i\n", 360 | "3368 o f f d s l\n", 361 | "2772 aa as d d p j\n", 362 | "3383 v e c d s g" 363 | ] 364 | }, 365 | "execution_count": 5, 366 | "metadata": {}, 367 | "output_type": "execute_result" 368 | } 369 | ], 370 | "source": [ 371 | "# lets look at X_train so we can compare then the variable re-coding\n", 372 | "\n", 373 | "X_train.head()" 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": 6, 379 | "metadata": {}, 380 | "outputs": [ 381 | { 382 | "data": { 383 | "text/html": [ 384 | "
\n", 385 | "\n", 398 | "\n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | "
X1X2X3X4X5X6
3059aa289cdqg
3014b284cdqi
3368o59fdsl
2772aa1155ddpj
3383v61cdsg
\n", 458 | "
" 459 | ], 460 | "text/plain": [ 461 | " X1 X2 X3 X4 X5 X6\n", 462 | "3059 aa 289 c d q g\n", 463 | "3014 b 284 c d q i\n", 464 | "3368 o 59 f d s l\n", 465 | "2772 aa 1155 d d p j\n", 466 | "3383 v 61 c d s g" 467 | ] 468 | }, 469 | "execution_count": 6, 470 | "metadata": {}, 471 | "output_type": "execute_result" 472 | } 473 | ], 474 | "source": [ 475 | "# And now let's replace each label in X2 by its count\n", 476 | "\n", 477 | "# first we make a dictionary that maps each label to the counts\n", 478 | "X_frequency_map = X_train.X2.value_counts().to_dict()\n", 479 | "\n", 480 | "# and now we replace X2 labels both in train and test set with the same map\n", 481 | "X_train.X2 = X_train.X2.map(X_frequency_map)\n", 482 | "X_test.X2 = X_test.X2.map(X_frequency_map)\n", 483 | "\n", 484 | "X_train.head()" 485 | ] 486 | }, 487 | { 488 | "cell_type": "markdown", 489 | "metadata": {}, 490 | "source": [ 491 | "Where in the original dataset, for the observation 1 in the variable 2 before it was 'ai', now it was replaced by the count 289. And so on for the rest of the categories (compare outputs 5 and 6).\n", 492 | "\n", 493 | "### Note\n", 494 | "\n", 495 | "I want you to keep in mind something important:\n", 496 | "\n", 497 | "If a category is present in the test set, that was not present in the train set, this method will generate missing data in the test set. This is why it is extremely important to handle rare categories, as we say in section 6 of this course.\n", 498 | "\n", 499 | "Then we can combine rare label replacement plus categorical encoding with counts like this: we may choose to replace the 10 most frequent labels by their count, and then group all the other labels under one label (for example \"Rare\"), and replace \"Rare\" by its count, to account for what I just mentioned.\n", 500 | "\n", 501 | "In coming sections I will explain more methods of categorical encoding. I want you to keep in mind that There is no rule of thumb to indicate which method you should use to encode categorical variables. It is mostly up to what makes sense for the data, and it also depends on what you are trying to achieve. In general, for data competitions, we value more model predictive power, whereas in business scenarios we want to capture and understand the information, and generally, we want to transform variables in a way that it makes 'Business sense'. Some of your common sense and a lot of conversation with the people that understand the data well will be required to encode categorical labels.\n" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": null, 507 | "metadata": { 508 | "collapsed": true 509 | }, 510 | "outputs": [], 511 | "source": [] 512 | } 513 | ], 514 | "metadata": { 515 | "kernelspec": { 516 | "display_name": "Python 3", 517 | "language": "python", 518 | "name": "python3" 519 | }, 520 | "language_info": { 521 | "codemirror_mode": { 522 | "name": "ipython", 523 | "version": 3 524 | }, 525 | "file_extension": ".py", 526 | "mimetype": "text/x-python", 527 | "name": "python", 528 | "nbconvert_exporter": "python", 529 | "pygments_lexer": "ipython3", 530 | "version": "3.6.1" 531 | }, 532 | "toc": { 533 | "nav_menu": {}, 534 | "number_sections": true, 535 | "sideBar": true, 536 | "skip_h1_title": false, 537 | "toc_cell": false, 538 | "toc_position": {}, 539 | "toc_section_display": "block", 540 | "toc_window_display": true 541 | } 542 | }, 543 | "nbformat": 4, 544 | "nbformat_minor": 2 545 | } 546 | -------------------------------------------------------------------------------- /15.5_Bonus_Additional_reading_resources.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Additional reading resources\n", 8 | "\n", 9 | "### Discretisation\n", 10 | "\n", 11 | "#### Articles\n", 12 | "- [Discretisation: An Enabling Technique](http://www.public.asu.edu/~huanliu/papers/dmkd02.pdf)\n", 13 | "- [Supervised and unsupervised discretisation of continuous features](http://ai.stanford.edu/~ronnyk/disc.pdf)\n", 14 | "- [ChiMerge: Discretisation of Numeric Attributes](https://www.aaai.org/Papers/AAAI/1992/AAAI92-019.pdf)\n", 15 | "\n", 16 | "#### Master thesis\n", 17 | "- [Beating Kaggle the easy way](https://www.ke.tu-darmstadt.de/lehre/arbeiten/studien/2015/Dong_Ying.pdf)\n", 18 | "\n", 19 | "#### Blog\n", 20 | "- [Tips for honing logistic regression models](https://blog.zopa.com/2017/07/20/tips-honing-logistic-regression-models/)\n", 21 | "- [ChiMerge discretisation algorithm](https://alitarhini.wordpress.com/2010/11/02/chimerge-discretization-algorithm/)\n", 22 | "\n", 23 | "#### Other\n", 24 | "- [Score card development stages: Binning](https://plug-n-score.com/learning/scorecard-development-stages.htm#Binning)" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "metadata": { 31 | "collapsed": true 32 | }, 33 | "outputs": [], 34 | "source": [] 35 | } 36 | ], 37 | "metadata": { 38 | "kernelspec": { 39 | "display_name": "Python 3", 40 | "language": "python", 41 | "name": "python3" 42 | }, 43 | "language_info": { 44 | "codemirror_mode": { 45 | "name": "ipython", 46 | "version": 3 47 | }, 48 | "file_extension": ".py", 49 | "mimetype": "text/x-python", 50 | "name": "python", 51 | "nbconvert_exporter": "python", 52 | "pygments_lexer": "ipython3", 53 | "version": "3.6.1" 54 | }, 55 | "toc": { 56 | "nav_menu": {}, 57 | "number_sections": true, 58 | "sideBar": true, 59 | "skip_h1_title": false, 60 | "toc_cell": false, 61 | "toc_position": {}, 62 | "toc_section_display": "block", 63 | "toc_window_display": false 64 | } 65 | }, 66 | "nbformat": 4, 67 | "nbformat_minor": 2 68 | } 69 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Feature-Engineering --------------------------------------------------------------------------------