├── 02.1_Numerical_variables.ipynb
├── 02.2_Categorical_variables.ipynb
├── 02.3_Dates.ipynb
├── 02.4_Mixed_variables.ipynb
├── 02.5_Bonus_Learn_more_on_Lending_Club_dataset.ipynb
├── 03.1_Missing_values.ipynb
├── 03.2_Outliers.ipynb
├── 03.3_Labels.ipynb
├── 03.4_Rare_values.ipynb
├── 03.5_Bonus_Machine_Learning_Algorithms_Overview.ipynb
├── 03.6_Bonus_Additional_reading_resources_on_variable_problems.ipynb
├── 04.1_Variable_magnitude.ipynb
├── 04.2_Linear_assumption.ipynb
├── 04.3_Variable_distribution.ipynb
├── 04.4_Bonus_Additional_reading_resources.ipynb
├── 05.1_Complete_Case_Analysis.ipynb
├── 05.2_Mean_and_median_imputation.ipynb
├── 05.3_Random_sample_imputation.ipynb
├── 05.4_Adding_a_variable_to_capture_NA.ipynb
├── 05.5_End_of_distribution_imputation.ipynb
├── 05.6_Arbitrary_value_imputation.ipynb
├── 06.1_Frequent_category_imputation.ipynb
├── 06.2_Random_sample_imputation.ipynb
├── 06.3_Adding_a_variable_to_capture_NA.ipynb
├── 06.4_Adding_a_category_to_capture_NA.ipynb
├── 07.1_Bonus_Overview_of_missing_ value_imputation_methods.ipynb
├── 07.2_Conclusion_when_to_use_each_NA_imputation_methods.ipynb
├── 08.1_Top_coding_bottom_coding_and_zero_coding.ipynb
├── 09.1_Engineering_rare_values_1.ipynb
├── 09.2_Engineering_rare_values_2.ipynb
├── 09.3_Engineering_rare_values_3.ipynb
├── 09.4_Engineering_rare_values.ipynb
├── 10.10_Bonus_Additional_reading_resources.ipynb
├── 10.1_One_hot_encoding.ipynb
├── 10.2_One_hot_encoding_variables_with_many_labels.ipynb
├── 10.3_Ordinal_numbering_encoding.ipynb
├── 10.4_Count_or_frequency_encoding.ipynb
├── 10.5_Target_guided_ordinal_encoding.ipynb
├── 10.6_Mean_encoding.ipynb
├── 10.7_Probability_ratio_encoding.ipynb
├── 10.8_Weight_of_evidence_WOE.ipynb
├── 10.9_Comparison_of_categorical_variable_encoding.ipynb
├── 11.1_Engineering_mixed_variables.ipynb
├── 12.1_Engineering_dates.ipynb
├── 13.1_Normalisation-Standarisation.ipynb
├── 13.2_Scaling_to_minimum_and_maximum_values.ipynb
├── 13.3_Scaling_to_median_and_quantiles.ipynb
├── 14.1_Gaussian_transformations.ipynb
├── 15.1_Equal_frequency_discretisation.ipynb
├── 15.2_Equal_width_discretisation.ipynb
├── 15.3_Domain_knowledge_discretisation.ipynb
├── 15.4_Discretisation_with_Decision_Trees.ipynb
├── 15.5_Bonus_Additional_reading_resources.ipynb
├── 16.2_Classification_I.ipynb
├── 16.3_Regression.ipynb
└── README.md


/02.4_Mixed_variables.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Mixed variables\n",
  8 |     "\n",
  9 |     "Mixed variables are those which values contain both numbers and labels.\n",
 10 |     "\n",
 11 |     "Variables can be mixed for a variety of reasons. For example, when credit agencies gather and store financial information of users, usually, the values of the variables they store are numbers. However, in some cases the credit agencies cannot retrieve information for a certain user for different reasons. What Credit Agencies do in these situations is to code each different reason due to which they failed to retrieve information with a different code or 'label'. Like this, they generate mixed type variables. These variables contain numbers when the value could be retrieved, or labels otherwise.\n",
 12 |     "\n",
 13 |     "As an example, think of the variable 'number_of_open_accounts'. It can take any number, representing the number of different financial accounts of the borrower. Sometimes, information may not be available for a certain borrower, for a variety of reasons. Each reason will be coded by a different letter, for example: 'A': couldn't identify the person, 'B': no relevant data, 'C': person seems not to have any open account.\n",
 14 |     "\n",
 15 |     "Another example of mixed type variables, is for example the variable missed_payment_status. This variable indicates, whether a borrower has missed a (any) payment in their financial item. For example, if the borrower has a credit card, this variable indicates whether they missed a monthly payment on it. Therefore, this variable can take values of 0, 1, 2, 3 meaning that the customer has missed 0-3 payments in their account. And it can also take the value D, if the customer defaulted on that account.\n",
 16 |     "\n",
 17 |     "Typically, once the customer has missed 3 payments, the lender declares the item defaulted (D), that is why this variable takes numerical values 0-3 and then D.\n",
 18 |     "\n",
 19 |     "\n",
 20 |     "For this lecture, you will need to download a toy csv file that I created and uploaded at the end of the lecture in Udemy. It is called sample_s2.csv."
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": 1,
 26 |    "metadata": {
 27 |     "collapsed": true
 28 |    },
 29 |    "outputs": [],
 30 |    "source": [
 31 |     "import pandas as pd\n",
 32 |     "\n",
 33 |     "import matplotlib.pyplot as plt\n",
 34 |     "%matplotlib inline"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "code",
 39 |    "execution_count": 7,
 40 |    "metadata": {},
 41 |    "outputs": [
 42 |     {
 43 |      "data": {
 44 |       "text/html": [
 45 |        "<div>\n",
 46 |        "<style>\n",
 47 |        "    .dataframe thead tr:only-child th {\n",
 48 |        "        text-align: right;\n",
 49 |        "    }\n",
 50 |        "\n",
 51 |        "    .dataframe thead th {\n",
 52 |        "        text-align: left;\n",
 53 |        "    }\n",
 54 |        "\n",
 55 |        "    .dataframe tbody tr th {\n",
 56 |        "        vertical-align: top;\n",
 57 |        "    }\n",
 58 |        "</style>\n",
 59 |        "<table border=\"1\" class=\"dataframe\">\n",
 60 |        "  <thead>\n",
 61 |        "    <tr style=\"text-align: right;\">\n",
 62 |        "      <th></th>\n",
 63 |        "      <th>id</th>\n",
 64 |        "      <th>open_il_24m</th>\n",
 65 |        "    </tr>\n",
 66 |        "  </thead>\n",
 67 |        "  <tbody>\n",
 68 |        "    <tr>\n",
 69 |        "      <th>0</th>\n",
 70 |        "      <td>1077501</td>\n",
 71 |        "      <td>C</td>\n",
 72 |        "    </tr>\n",
 73 |        "    <tr>\n",
 74 |        "      <th>1</th>\n",
 75 |        "      <td>1077430</td>\n",
 76 |        "      <td>A</td>\n",
 77 |        "    </tr>\n",
 78 |        "    <tr>\n",
 79 |        "      <th>2</th>\n",
 80 |        "      <td>1077175</td>\n",
 81 |        "      <td>A</td>\n",
 82 |        "    </tr>\n",
 83 |        "    <tr>\n",
 84 |        "      <th>3</th>\n",
 85 |        "      <td>1076863</td>\n",
 86 |        "      <td>A</td>\n",
 87 |        "    </tr>\n",
 88 |        "    <tr>\n",
 89 |        "      <th>4</th>\n",
 90 |        "      <td>1075358</td>\n",
 91 |        "      <td>A</td>\n",
 92 |        "    </tr>\n",
 93 |        "  </tbody>\n",
 94 |        "</table>\n",
 95 |        "</div>"
 96 |       ],
 97 |       "text/plain": [
 98 |        "        id open_il_24m\n",
 99 |        "0  1077501           C\n",
100 |        "1  1077430           A\n",
101 |        "2  1077175           A\n",
102 |        "3  1076863           A\n",
103 |        "4  1075358           A"
104 |       ]
105 |      },
106 |      "execution_count": 7,
107 |      "metadata": {},
108 |      "output_type": "execute_result"
109 |     }
110 |    ],
111 |    "source": [
112 |     "# open_il_24m indicates:\n",
113 |     "# \"Number of installment accounts opened in past 24 months\".\n",
114 |     "# Installment accounts are those that, at the moment of acquiring them,\n",
115 |     "# there is a set period and amount of repayments agreed between the\n",
116 |     "# lender and borrower. An example of this is a car loan, or a student loan.\n",
117 |     "# the borrowers know that they are going to pay a certain,\n",
118 |     "# fixed amount over, for example 36 months.\n",
119 |     "\n",
120 |     "data = pd.read_csv('sample_s2.csv')\n",
121 |     "data.head()"
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "code",
126 |    "execution_count": 8,
127 |    "metadata": {},
128 |    "outputs": [
129 |     {
130 |      "data": {
131 |       "text/plain": [
132 |        "(887379, 2)"
133 |       ]
134 |      },
135 |      "execution_count": 8,
136 |      "metadata": {},
137 |      "output_type": "execute_result"
138 |     }
139 |    ],
140 |    "source": [
141 |     "data.shape"
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "code",
146 |    "execution_count": 9,
147 |    "metadata": {},
148 |    "outputs": [
149 |     {
150 |      "data": {
151 |       "text/plain": [
152 |        "array(['C', 'A', 'B', '0.0', '1.0', '2.0', '4.0', '3.0', '6.0', '5.0',\n",
153 |        "       '9.0', '7.0', '8.0', '13.0', '10.0', '19.0', '11.0', '12.0', '14.0',\n",
154 |        "       '15.0'], dtype=object)"
155 |       ]
156 |      },
157 |      "execution_count": 9,
158 |      "metadata": {},
159 |      "output_type": "execute_result"
160 |     }
161 |    ],
162 |    "source": [
163 |     "# 'A': couldn't identify the person\n",
164 |     "# 'B': no relevant data\n",
165 |     "# 'C': person seems not to have any account open\n",
166 |     "\n",
167 |     "data.open_il_24m.unique()"
168 |    ]
169 |   },
170 |   {
171 |    "cell_type": "code",
172 |    "execution_count": 10,
173 |    "metadata": {},
174 |    "outputs": [
175 |     {
176 |      "data": {
177 |       "text/plain": [
178 |        "<matplotlib.text.Text at 0xfc894079b0>"
179 |       ]
180 |      },
181 |      "execution_count": 10,
182 |      "metadata": {},
183 |      "output_type": "execute_result"
184 |     },
185 |     {
186 |      "data": {
187 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZsAAAEVCAYAAAA2IkhQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xu8VXWd//HXG0HyAgiKpAJiQjVqqcmgpb/SmJCy1Ck1\n7SI6eBud0SabwpoZSqN0Jq1xUkcbyctkSjYpao7itTFHhLwhXkYSFPECAor3BD+/P77fLYvtOWev\nA2eds928n4/Hepy1v2t9v/uz1t5nf/b6ru9eSxGBmZlZlXr1dABmZtb6nGzMzKxyTjZmZlY5Jxsz\nM6uck42ZmVXOycbMzCrnZGNdTtJFkr7fQ88tST+XtFzS3W0s/7KkG3sgrtskHZXnj5B0R3fHYNaT\nnGzWA5IWSFosaZNC2VGSbuvBsKqyF/ApYGhEjKlfGBG/iIhx6/okkkLSyHVtpztJ2lvSUz0dR1fp\nyS811nlONuuPDYCTejqIzpK0QSerbAssiIhXqojHzNaOk83641+Ab0jarH6BpBH5m3rvQll9t8/v\nJf1Y0guSHpf0sVy+MB81TahrdgtJMyS9JOl2SdsW2v5gXrZM0qOSDiksu0jSeZJ+K+kVYJ824t1a\n0vRcf56ko3P5ROA/gI9KelnS99qou0YXVt7u4yQ9lrftHEnKy0bm2F+U9LykK3L573L1+/PzfFHS\nQEnXSlqSu/CulTS08cvydgzH5xheknSapO0l3SlphaRpkjYsrP9ZSffleO+U9OHCsgWSviHpgRz3\nFZLek49qrwe2zjG/LGnrNmLZT9K9+XkXSvpu3fK98nO+kJcfkcs3knSmpCfy894haaO8bH9Jc3Od\n2yT9Wd22jyw8fvtopXYkJunk/B57RtKRedkxwJeBb+ZtuSaXf0vSorwfH5U0tp19PkDSJfn1ekLS\nP0jqlZfV3u8/zdvySLGdXPfCHM8iSd9X/lJUe39J+lF+H8yX9Oky74OWFxGeWnwCFgB/AfwX8P1c\ndhRwW54fAQTQu1DnNuCoPH8EsBI4knSE9H3gSeAcoC8wDngJ2DSvf1F+/PG8/F+BO/KyTYCFua3e\nwK7A88AOhbovAnuSvgy9p43t+R1wLvAeYBdgCfDJQqx3dLAv1liet/taYDNgeG5rfF72S+A7tTiA\nverqjSw83hz4ArAx0A/4FXBVB/uzPoargf7AjsAbwM3A+4ABwEPAhLzursBiYPf8WkzIr2/fwmt9\nN7A1MAh4GDguL9sbeKrBe2Vv4EN5mz8MPAccmJdtm1/Xw4A+eZt3ycvOydu4TY7rY/m1fz/wCqlr\nsw/wTWAesGE7+/EiVr9H9ya9707NdT8DvAoMrF83P/4A6b21deF9vX0723lJ3uf98nr/B0yse7//\nXX7eL5Lek4Py8t8A55Pey1vm/X1soe6bwNF5P/w18DSgnv4c6OnJRzbrl38C/lbS4LWoOz8ifh4R\nq4ArgGHAqRHxRkTcCPwJKJ7DuC4ifhcRb5A+sD8qaRjwWVI3188jYmVE3Av8Gji4UPfqiPh9RLwV\nEa8Xg8ht7Al8KyJej4j7SEczh6/FNtWcHhEvRMSTwK2kBAbpQ2Nb0ofX6xHR7kn9iFgaEb+OiFcj\n4iVgCvCJTsTwzxGxIiLmAg8CN0bE4xHxIumIZNe83jHA+RExMyJWRcTFpOS0R6GtsyPi6YhYBlxT\n2J6GIuK2iJiT9/0DpIRb244vATdFxC8j4s28zfflI4K/Ak6KiEU5rjvza/9F0nthRkS8CfwI2IiU\njMp4k/Q+ezMifgu8TEoqbVlFSnA7SOoTEQsi4o/1K+WjkEOBUyLipYhYAJwJfLWw2mLgJ/l5rwAe\nBfaTNISU9L4WEa9ExGLgx7m9mici4mf5f+ViYCtgSMntbVlONuuRiHiQ9C1+0lpUf64w/1pur75s\n08LjhYXnfRlYRvq2vS2we+5SeUHSC6TukPe2VbcNWwPL8gd6zROkb9Rr69nC/Kus3o5vAgLuzt1A\nf9VeA5I2lnR+7pJZQTr62kzlzznV78v29u22wMl1+28Yab802p6GJO0u6dbcvfQicBywRV48DHjH\nh3de/p52lm1Nen0AiIi3SK9v2ddraUSsLDxud3siYh7wNeC7wGJJl7fVVZjj7VOMi3e+hxZFRNQt\nr71/+wDPFPb/+aQjnJq3939EvJpnS78GrcrJZv0zmXSIX/zHqp1M37hQVvzwXxvDajOSNiV16TxN\n+qC5PSI2K0ybRsRfF+p2dCnyp4FBkvoVyoYDi9Yx3neIiGcj4uiI2Bo4FjhX7Y9AO5n0jXv3iOhP\n6kKElKy60kJgSt3+2zgiflmibplLvF8GTAeGRcQA4N9ZvQ0Lge3bqPM88Ho7y54mfUADaWg66b1R\ne71eZe3fd+/Ynoi4LCL2ys8ZwBntxFs7aq2pfw9tk2MtLq+9f98Atijs//4RsWMn4l4vOdmsZ/K3\nvyuAEwtlS0j/aF+RtEH+Bt/WB0dnfCafTN4QOA24KyIWko6s3i/pq5L65OnPiyeNG8S/ELgT+GE+\n8f1hYCLwn+sY7ztIOlirT/IvJ314vZUfP0c6p1LTj3QE8oKkQaSkXoWfAcflIxBJ2iSf1O/XsGaK\neXNJAzpYpx/pyPF1SWNIXWc1vwD+QtIhknpL2lzSLvloZSpwltLgjQ0kfVRSX2AaqftprKQ+pKT8\nBuk1BLgP+FKuM57OdT2u8RpI+oCkT+bnfZ30erxVXyl3b00DpkjqpzR45eus+R7aEjgxvz8PBv4M\n+G1EPAPcCJwpqb+kXkqDOToT93rJyWb9dCrp5GbR0cDfA0tJJ6nvrK/USZeRPnCXAbsBXwHI3V/j\nSH3cT5O6HM4g9bWXdRjppO7TpJO1kyPipnWMty1/DsyU9DLp2/5JEfF4XvZd4OLclXII8BPSuYjn\ngbuA/64gHiJiNum1+ikpAc4jnZQuU/cR0jmYx3PcbXUxHQ+cKukl0jm+aYX6T5LOV5xMel3vA3bO\ni78BzAFm5WVnAL0i4lHSa/9vpH3zOeBzEfGnXO+kXFbrTr2qzLZkF5LOz7wg6SrSe+j0/DzPkhLG\nKe3U/VvSEf3jwB2k9+vUwvKZwKjc1hTgoIhYmpcdDmxIGrixHLiSdF7GOqA1uyXNzNZvSsO5j8rd\ncdZFfGRjZmaVc7IxM7PKuRvNzMwq5yMbMzOrnJONmZlVrnfjVdYPW2yxRYwYMaKnwzAze1f5wx/+\n8HxENLwElpNNNmLECGbPnt3TYZiZvatIeqLxWu5GMzOzbuBkY2ZmlXOyMTOzyjnZmJlZ5ZxszMys\ncpUmG6X7oc9Rul/67Fw2SOn+84/lvwML65+idE/5RyXtWyjfLbczT9LZtftMSOqrdI/1eZJmShpR\nqDMhP8djkiZUuZ1mZtax7jiy2ScidomI0fnxJODmiBhFus/6JABJO5AuO78jMJ50o6raXQ7PI11W\nfVSexufyicDyiBhJujXrGbmt2v1EdgfGAJOLSc3MzLpXT3SjHUC6Lzf574GF8svzPe3nk+7TMUbS\nVkD/iLgr36b1kro6tbauBMbmo559gRkRsSwilgMzWJ2gzMysm1X9o84AbpK0Cjg/Ii4AhuS73UG6\nwdGQPL8N6aZTNU/lsjfzfH15rc5CgIhYme+ZvnmxvI06pY2YdF2Hyxecvl9nmzQzWy9VnWz2iohF\nkrYEZkh6pLgwIkJSj112WtIxwDEAw4cP76kwzMxaXqXdaBGxKP9dTLp97xjgudw1Rv67OK++CBhW\nqD40ly3K8/Xla9SR1BsYQLqtcXtt1cd3QUSMjojRgwc3vLSPmZmtpcqSjaRNJPWrzZPuO/8g6V7u\ntdFhE4Cr8/x04NA8wmw70kCAu3OX2wpJe+TzMYfX1am1dRBwSz6vcwMwTtLAPDBgXC4zM7MeUGU3\n2hDgN3mUcm/gsoj4b0mzgGmSJgJPAIcARMRcSdOAh4CVwAkRsSq3dTxwEbARcH2eAC4ELpU0D1hG\nGs1GRCyTdBowK693akQsq3BbzcysA5Ulm4h4HNi5jfKlwNh26kwBprRRPhvYqY3y14GD22lrKjC1\nc1GbmVkVfAUBMzOrnJONmZlVzsnGzMwq52RjZmaVc7IxM7PKOdmYmVnlnGzMzKxyTjZmZlY5Jxsz\nM6uck42ZmVXOycbMzCrnZGNmZpVzsjEzs8o52ZiZWeWcbMzMrHJONmZmVjknGzMzq5yTjZmZVc7J\nxszMKudkY2ZmlXOyMTOzyjnZmJlZ5ZxszMysck42ZmZWOScbMzOrnJONmZlVzsnGzMwq52RjZmaV\nc7IxM7PKOdmYmVnlnGzMzKxyTjZmZla5ypONpA0k3Svp2vx4kKQZkh7LfwcW1j1F0jxJj0rat1C+\nm6Q5ednZkpTL+0q6IpfPlDSiUGdCfo7HJE2oejvNzKx93XFkcxLwcOHxJODmiBgF3JwfI2kH4FBg\nR2A8cK6kDXKd84CjgVF5Gp/LJwLLI2Ik8GPgjNzWIGAysDswBphcTGpmZta9Kk02koYC+wH/USg+\nALg4z18MHFgovzwi3oiI+cA8YIykrYD+EXFXRARwSV2dWltXAmPzUc++wIyIWBYRy4EZrE5QZmbW\nzao+svkJ8E3grULZkIh4Js8/CwzJ89sACwvrPZXLtsnz9eVr1ImIlcCLwOYdtGVmZj2gsmQj6bPA\n4oj4Q3vr5COVqCqGRiQdI2m2pNlLlizpqTDMzFpelUc2ewL7S1oAXA58UtJ/As/lrjHy38V5/UXA\nsEL9oblsUZ6vL1+jjqTewABgaQdtrSEiLoiI0RExevDgwWu/pWZm1qHKkk1EnBIRQyNiBOnE/y0R\n8RVgOlAbHTYBuDrPTwcOzSPMtiMNBLg7d7mtkLRHPh9zeF2dWlsH5ecI4AZgnKSBeWDAuFxmZmY9\noHcPPOfpwDRJE4EngEMAImKupGnAQ8BK4ISIWJXrHA9cBGwEXJ8ngAuBSyXNA5aRkhoRsUzSacCs\nvN6pEbGs6g0zM7O2dUuyiYjbgNvy/FJgbDvrTQGmtFE+G9ipjfLXgYPbaWsqMHVtYzYzs67TsBtN\n0vaS+ub5vSWdKGmz6kMzM7NWUeacza+BVZJGAheQTrxfVmlUZmbWUsokm7fyb1j+Evi3iPh7YKtq\nwzIzs1ZSJtm8Kekw0qiva3NZn+pCMjOzVlMm2RwJfBSYEhHz87DkS6sNy8zMWkmHo9HyhTC/ExFf\nrpXl65adUXVgZmbWOjo8ssm/c9lW0obdFI+ZmbWgMr+zeRz4vaTpwCu1wog4q7KozMyspZRJNn/M\nUy+gX7XhmJlZK2qYbCLiewCSNo6IV6sPyczMWk2ZKwh8VNJDwCP58c6Szq08MjMzaxllhj7/hHTn\ny6UAEXE/8PEqgzIzs9ZS6hYDEbGwrmhVmyuamZm1ocwAgYWSPgaEpD7AScDD1YZlZmatpMyRzXHA\nCcA2pLtd7pIfm5mZlVLmyObl4hUEzMzMOqtMsnlQ0nPA/+Tpjoh4sdqwzMyslTTsRouIkcBhwBxg\nP+B+SfdVHZiZmbWOhkc2koYCewL/D9gZmAvcUXFcZmbWQsp0oz0JzAJ+EBHHVRyPmZm1oDKj0XYF\nLgG+JOl/JV0iaWLFcZmZWQspc220+yXVLsb5/4CvAJ8ALqw4NjMzaxFlztnMBvoCd5JGo308Ip6o\nOjAzM2sdZc7ZfDoillQeiZmZtawy52z+JOksSbPzdKakAZVHZmZmLaNMspkKvAQckqcVwM+rDMrM\nzFpLmW607SPiC4XH3/OPOs3MrDPKHNm8Jmmv2gNJewKvVReSmZm1mjJHNscBlxTO0ywHJlQXkpmZ\ntZoOk42kXsAHImJnSf0BImJFt0RmZmYto8NutIh4C/hmnl/hRGNmZmujzDmbmyR9Q9IwSYNqU+WR\nmZlZyyhzzuaL+W/x7pwBvK/rwzEzs1bU4ZFNPmfzlYjYrm5qmGgkvUfS3ZLulzRX0vdy+SBJMyQ9\nlv8OLNQ5RdI8SY9K2rdQvpukOXnZ2ZKUy/tKuiKXz5Q0olBnQn6OxyR5QIOZWQ8qc87mp2vZ9hvA\nJyNiZ2AXYLykPYBJwM0RMQq4OT9G0g7AocCOwHjgXEkb5LbOA44GRuVpfC6fCCzPN3j7MXBGbmsQ\nMBnYHRgDTC4mNTMz615lztncLOkLtaOJsiJ5OT/sk6cADgAuzuUXAwfm+QOAyyPijYiYD8wDxkja\nCugfEXdFRJBud1CsU2vrSmBsjnNfYEZELIuI5cAMVicoMzPrZmXO2RwLfB1YJek1QKRc0r9RxXxk\n8gdgJHBORMyUNCQinsmrPAsMyfPbAHcVqj+Vy97M8/XltToLSQGtlPQisHmxvI06xfiOAY4BGD58\neKPN6bQRk65ruM6C0/fr8uc1M2s2DY9sIqJfRPSKiD4R0T8/bphoct1VEbELMJR0lLJT3fIgHe30\niIi4ICJGR8TowYMH91QYZmYtr0w3GpL2l/SjPH22s08SES8At5K6sp7LXWPkv4vzaouAYYVqQ3PZ\nojxfX75GHUm9gQHA0g7aMjOzHtAw2Ug6HTgJeChPJ0n6YYl6gyVtluc3Aj4FPAJMZ/XlbiYAV+f5\n6cCheYTZdqSBAHfnLrcVkvbI52MOr6tTa+sg4JZ8tHQDME7SwDwwYFwuMzOzHlDmnM1ngF3yyDQk\nXQzcC5zSoN5WwMX5vE0vYFpEXCvpf4FpkiYCT5BuW0BEzJU0jZTQVgInRMSq3NbxwEXARsD1eYJ0\na+pLJc0DlpFGsxERyySdBszK650aEctKbKuZmVWgTLIB2Iz0YQ6pq6qhiHgA2LWN8qXA2HbqTAGm\ntFE+G9ipjfLXgYPbaWsq6V48ZmbWw8okmx8C90q6lTQS7ePk38aYmZmV0TDZRMQvJd0G/Dlp5Ni3\nIuLZqgMzM7PWUbYb7aPAXqRk0xv4TWURmZlZyykzGu1c0g3U5gAPAsdKOqfqwMzMrHWUObL5JPBn\neUhxbTTa3EqjMjOzllLmR53zgOK1XIblMjMzs1LaPbKRdA3pHE0/4GFJd+fHuwN3d094ZmbWCjrq\nRvtRt0VhZmYtrd1kExG3d2cgZmbWukpdiNPMzGxdONmYmVnl2k02km7Of8/ovnDMzKwVdTRAYCtJ\nHwP2l3Q56bpob4uIeyqNzMzMWkZHyeafgH8k3XjsrLplQfqxp5mZWUMdjUa7ErhS0j9GxGndGJOZ\nmbWYMld9Pk3S/qRbCwDcFhHXVhuWmZm1kjIX4vwh77wt9A+qDszMzFpHmQtx7kfbt4X+dpWBmZlZ\n6yj7O5vNCvOlbgttZmZW49tCm5lZ5Tp7W2jwbaHNzKyTSt0WOiKeAaZXHIuZmbUoXxvNzMwq52Rj\nZmaV6zDZSNpA0iPdFYyZmbWmDpNNRKwCHpU0vJviMTOzFlRmgMBAYK6ku4FXaoURsX9lUZmZWUsp\nk2z+sfIozMyspZX5nc3tkrYFRkXETZI2BjaoPjQzM2sVZS7EeTRwJXB+LtoGuKrKoMzMrLWUGfp8\nArAnsAIgIh4DtqwyKDMzay1lks0bEfGn2gNJvUl36jQzMyulTLK5XdK3gY0kfQr4FXBNo0qShkm6\nVdJDkuZKOimXD5I0Q9Jj+e/AQp1TJM2T9KikfQvlu0mak5edLUm5vK+kK3L5TEkjCnUm5Od4TNKE\nsjvEzMy6XplkMwlYAswBjgV+C/xDiXorgZMjYgdgD+AESTvk9m6OiFHAzfkxedmhwI7AeOBcSbWB\nCOcBRwOj8jQ+l08ElkfESODHwBm5rUHAZGB3YAwwuZjUzMysezVMNvmmaRcDpwHfAy6OiIbdaBHx\nTETck+dfAh4mDS44ILdH/ntgnj8AuDwi3oiI+cA8YIykrYD+EXFXft5L6urU2roSGJuPevYFZkTE\nsohYDsxgdYIyM7Nu1nDos6T9gH8H/ki6n812ko6NiOvLPknu3toVmAkMyVeRBngWGJLntwHuKlR7\nKpe9mefry2t1FgJExEpJLwKbF8vbqGNmZt2szI86zwT2iYh5AJK2B64DSiUbSZsCvwa+FhEr8ukW\nACIiJPXYYANJxwDHAAwf7ivymJlVpcw5m5dqiSZ7HHipTOOS+pASzS8i4r9y8XO5a4z8d3EuXwQM\nK1QfmssW5fn68jXq5FFyA4ClHbS1hoi4ICJGR8TowYMHl9kkMzNbC+0mG0mfl/R5YLak30o6Io/q\nugaY1ajhfO7kQuDhiDirsGg6UBsdNgG4ulB+aB5hth1pIMDductthaQ9cpuH19WptXUQcEs+r3MD\nME7SwDwwYFwuMzOzHtBRN9rnCvPPAZ/I80uAjUq0vSfwVWCOpPty2beB04FpkiYCTwCHAETEXEnT\ngIdII9lOyFedBjgeuCg/7/Ws7sK7ELhU0jxgGWk0GxGxTNJprE6Kp0bEshIxm5lZBdpNNhFx5Lo0\nHBF3kAYUtGVsO3WmAFPaKJ8N7NRG+evAwe20NRWYWjZeMzOrTpnRaNsBfwuMKK7vWwyYmVlZZUaj\nXUXqrroGeKvacMzMrBWVSTavR8TZlUdiZmYtq0yy+VdJk4EbgTdqhbWrA5iZmTVSJtl8iDSq7JOs\n7kaL/NjMzKyhMsnmYOB9xdsMmJmZdUaZKwg8CGxWdSBmZta6yhzZbAY8ImkWa56z8dBnMzMrpUyy\nmVx5FGZm1tIaJpuIuL07AjEzs9ZV5goCL5FGnwFsCPQBXomI/lUGZmZmraPMkU2/2ny+6vIBpNs8\nm5mZlVJmNNrbIrmKdNtlMzOzUsp0o32+8LAXMBp4vbKIzMys5ZQZjVa8r81KYAGpK83MzKyUMuds\n1um+NmZmZu0mG0n/1EG9iIjTKojHzMxaUEdHNq+0UbYJMBHYHHCyMTOzUjq6LfSZtXlJ/YCTgCOB\ny4Ez26tnZmZWr8NzNpIGAV8HvgxcDHwkIpZ3R2BmZtY6Ojpn8y/A54ELgA9FxMvdFpWZmbWUjn7U\neTKwNfAPwNOSVuTpJUkruic8MzNrBR2ds+nU1QXMzMza44RiZmaVc7IxM7PKOdmYmVnlnGzMzKxy\nTjZmZlY5JxszM6uck42ZmVXOycbMzCrnZGNmZpWrLNlImippsaQHC2WDJM2Q9Fj+O7Cw7BRJ8yQ9\nKmnfQvlukubkZWdLUi7vK+mKXD5T0ohCnQn5OR6TNKGqbTQzs3KqPLK5CBhfVzYJuDkiRgE358dI\n2gE4FNgx1zlX0ga5znnA0cCoPNXanAgsj4iRwI+BM3Jbg4DJwO7AGGByMamZmVn3qyzZRMTvgGV1\nxQeQblVA/ntgofzyiHgjIuYD84AxkrYC+kfEXRERwCV1dWptXQmMzUc9+wIzImJZvh3CDN6Z9MzM\nrBt19zmbIRHxTJ5/FhiS57cBFhbWeyqXbZPn68vXqBMRK4EXSXcQba8tMzPrIT02QCAfqURPPT+A\npGMkzZY0e8mSJT0ZiplZS+vuZPNc7hoj/12cyxcBwwrrDc1li/J8ffkadST1BgYASzto6x0i4oKI\nGB0RowcPHrwOm2VmZh3p7mQzHaiNDpsAXF0oPzSPMNuONBDg7tzltkLSHvl8zOF1dWptHQTcko+W\nbgDGSRqYBwaMy2VmZtZD2r152rqS9Etgb2ALSU+RRoidDkyTNBF4AjgEICLmSpoGPASsBE6IiFW5\nqeNJI9s2Aq7PE8CFwKWS5pEGIhya21om6TRgVl7v1IioH6hgZmbdqLJkExGHtbNobDvrTwGmtFE+\nG9ipjfLXgYPbaWsqMLV0sGZmVilfQcDMzCrnZGNmZpVzsjEzs8o52ZiZWeWcbMzMrHJONmZmVjkn\nGzMzq5yTjZmZVc7JxszMKlfZFQSsa4yYdF2Hyxecvl83RWJmtvZ8ZGNmZpVzsjEzs8o52ZiZWeWc\nbMzMrHJONmZmVjknGzMzq5yTjZmZVc7JxszMKudkY2ZmlXOyMTOzyjnZmJlZ5ZxszMysck42ZmZW\nOScbMzOrnJONmZlVzsnGzMwq52RjZmaVc7IxM7PKOdmYmVnlnGzMzKxyTjZmZlY5JxszM6uck42Z\nmVWupZONpPGSHpU0T9Kkno7HzGx91bunA6iKpA2Ac4BPAU8BsyRNj4iHejay7jdi0nUdLl9w+n7d\n0oaZrb9aNtkAY4B5EfE4gKTLgQOA9S7ZNINGyQoaJywnPLN3L0VET8dQCUkHAeMj4qj8+KvA7hHx\nN4V1jgGOyQ8/ADzaoNktgOfXMbR1baMZYmiWNpohhq5ooxliaJY2miGGZmmjGWIo08a2ETG4USOt\nfGTTUERcAFxQdn1JsyNi9Lo857q20QwxNEsbzRBDV7TRDDE0SxvNEEOztNEMMXRVG9DaAwQWAcMK\nj4fmMjMz62atnGxmAaMkbSdpQ+BQYHoPx2Rmtl5q2W60iFgp6W+AG4ANgKkRMXcdmy3d5VZhG80Q\nQ7O00QwxdEUbzRBDs7TRDDE0SxvNEENXtdG6AwTMzKx5tHI3mpmZNQknGzMzq5yTjZmZVc7JphMk\n7SXpnJLrjpS0Zxvle0ravuujMzNrXi07Gq2rSNoV+BJwMDAf+K+SVX8CnNJG+Yq87HNrGc8WwNLo\n5pEdkoYA2+SHiyLiuXdjG10RQ7PE0QxtdNX+tNbnZNMGSe8HDsvT88AVpJF7+3SimSERMae+MCLm\nSBpRMo49gNOBZcBpwKWkS0f0knR4RPx32WDW9kNB0i7AvwMDWP2j2KGSXgCOj4h73g1tdEUMzRJH\nM7TRVfszt/VB0nUL335/AtMj4uHuqN9KbTRDDO2KCE91E/AWcDswslD2eCfbeKyDZfNKtjEbGEc6\nqloO7JHLPwjcW7KNXYC7gIeBm/L0SC77SIn695GuKVdfvgdwf8kYeryNroihWeJohja6cH9+K7c1\nCfhKnibVyqqu30ptNEMMHba9LpVbdQIOBC4HFgI/A8YC8zvZxi+Bo9soPwq4omQb9xXmH65bVjbZ\nrOuHSlckzR5voytiaJY4mqGNLtyf/wf0aaN8w46eo6vqt1IbzRBDR5O70doQEVcBV0nahHQ4+TVg\nS0nnAb+JiBtLNPM14DeSvgz8IZeNJr1of1kylLcK86/Vh1myjU0iYmZ9YUTclbevkeslXQdcQkq+\nkK45dzgNwAL/AAAHAElEQVRQthuvGdroihiaJY5maKOr9udbwNbAE3XlW7Hm+7+q+q3URjPE0C5f\nQaAkSQNJ3VlfjIixnai3D7BTfjg3Im7pRN1VwCuAgI2AV2uLgPdERJ8SbZwNbE/bHwrzo3DLhQ7a\n+DRt9+H+thPb0uNtdEUMzRJHM7TRRTGMB34KPMbq9+dwYCTwN9HgvOS61m+lNpohhg7bdrJpfV31\nIWtWBUm9SDc7LL4/Z0XEqu6o30ptNEMM7bbrZGNrS9Ixke4J9K5uoytiaJY4mqGNrtqf1lr8o871\nWL5T6To10RVhNEEbXRFDV7TTDPuiK9rokv0p6dqerN9KbTRFDD6yWX9JOjYizi+xXleN3d8GmBkR\nLxfKx5ftB5Y0BoiImCVpB2A88MjadgdKuiQiDl+buoU29iJ1OTxYZuCIpN1JIwtXSNqINKz0I8BD\nwA8i4sUSbZxIGqiysNG6HbRRu8fT0xFxk6QvAR8jDZG/ICLeLNHG+4DPk84BriKNZLosIlasbVx1\n7W8VEc/0VP1WaqMpYnCyWX9JOjIift5gnW+Rftx6OfBULh5K+qC6PCJOL/E8JwInkD7IdgFOioir\n87J7IuIjJdqYDHya9EPkGcDuwK3Ap4AbImJKg/r1N84TsA9wC0BE7N8ohtzO3RExJs8fnbfrN6Tf\nQ13TaH9ImgvsHOl+SxeQBn1cSRpev3NEfL5EDC+SBo78kTTE/lcRsaRM/IU2fkHalxsDLwCbkq6O\nMRYgIo5oUP9E4LPA74DPAPfmdv6S9KPO2zoTj7VN0pYRsbin4+gS6zJu2tO7ewKeLLFOV4zdnwNs\nmudHkH6selJ+XPb3QnNIN8HbmHTJn/65fCPggRL17wH+E9gb+ET++0ye/0Qn9tm9hflZwOA8vwkw\np0T9h4sx1S27r2wMpC7wccCFwBLScOMJQL+SbTyQ//YGngM2yI9Vcn/OKdTZGLgtzw8v+5rm9QeQ\nrpLxCOlKGUtJX0pOBzZbx/f39SXX6w/8kHSFji/VLTu3ZBvvBc4DzgE2B76b99E0YKuSbQyqmzYH\nFgADgUEl6o+v268XAg8Al5GuaFImhk2BU4G5wIv5vXUXcMS6vBYR4XM2rU7SA+1Mc4AhJZqojbuv\n15lx970id51FxALSB/2nJZ1F+f79lRGxKiJeBf4YuasmIl4rGcdo0u+dvgO8GOmb92sRcXtE3F4y\nBkiXChooaXPSh+2SHMcrwMoS9R+UdGSev1/SaHj7EkkNu66yiIi3IuLGiJhIen3OJXUrPt6J7dgQ\n6EdKFgNyeV+g4ZD6rPY7vb6kDyki4slO1If0Ybwc2DsiBkXE5qQjzuV5WYckfaSdaTfSUXQZPye9\nD38NHCrp15L65mV7lGzjIlJX6ELSEfdrpCO+/yFd1qeM50nv0do0m9T1fE+eb+QHhfkzSV+mPkf6\nUtSwuzz7Bek9tC/wPeBs4KvAPpJ+0FHFhtY1W3lq7on0rXUXYNu6aQSpv75R/fHAPOB60u1hLyB9\ni55H4ZtUgzZuAXapK+tN+u3PqpJtzAQ2zvO9CuUDqDtCaNDOUOBXpN8SNDyya6P+gvzPOD//3SqX\nb0qJI5Mc70WkLrCZpATzOOnySDuXjKHdI4faPirRxt/l530COBG4mXS1jDnA5BL1TyJ9a/4Z6ajk\nyFw+GPhdJ/bno2uzrLDOqvz+urWN6bWSMdxX9/g7wO9JRxal3lusecT7ZEftd9DGyfl/60OFsvmd\n2Jf3tPecnYjh/rrHs/LfXqTzo536f1mjrXWp7Kn5J9Kh9F7tLLusZBu9SN/wvpCnPchdKCXrDwXe\n286yPUu20bed8i2K/5ydiGk/0gn5rtrPGwPbdWL9/sDOwG6U7OIo1H1/F8W8NbB1nt8MOAgY04n6\nO+Y6H1yHGG4EvlncB6Qj7m8BN5Wo/yAwqp1lC0vG8DCFLzC57AhSV9ITJdu4vzD//bplDbtXC+vW\nvgydRTrqLH1NRtI51a/npDWffE4+L2vYNZrXu7P2eQHsTzofWlvWMPl3NHmAgJn1mHxljkmk0Y5b\n5uLngOnA6RGxvEH9g0gf5o+2sezASJeeahTDPwM3RsRNdeXjgX+LiFEl2jgV+OcojLTM5SPzdhzU\nqI26evsD3wZGRMR7S9aZXFd0bkQskfTeHFvDkZeSPgz8BzCKlGz/KiL+T9Jg4LCIOLsz27FG2042\nZtaMyoyWrLJ+T7eRh8ZvHxEPtsK+cLIxs6Yk6cmIGN5T9VupjWaIwVd9NrMeI+mB9hZRYrTkutZv\npTaaIYaOONmYWU8aQhpmW39uRqST1VXXb6U2miGGdjnZmFlPupb0g9/76hdIuq0b6rdSG80QQ7t8\nzsbMzCrnKwiYmVnlnGzMzKxyTjZmZlY5JxszM6uck42ZmVXu/wNCP2yz4fPUrgAAAABJRU5ErkJg\ngg==\n",
188 |       "text/plain": [
189 |        "<matplotlib.figure.Figure at 0xfc86ee4908>"
190 |       ]
191 |      },
192 |      "metadata": {},
193 |      "output_type": "display_data"
194 |     }
195 |    ],
196 |    "source": [
197 |     "# Now, let's make a bar plot showing the different number of \n",
198 |     "# borrowers for each of the values of the mixed variable\n",
199 |     "\n",
200 |     "fig = data.open_il_24m.value_counts().plot.bar()\n",
201 |     "fig.set_title('Number of installment accounts open')\n",
202 |     "fig.set_ylabel('Number of borrowers')"
203 |    ]
204 |   },
205 |   {
206 |    "cell_type": "markdown",
207 |    "metadata": {
208 |     "collapsed": true
209 |    },
210 |    "source": [
211 |     "This is how a mixed variable looks like!"
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "markdown",
216 |    "metadata": {
217 |     "collapsed": true
218 |    },
219 |    "source": [
220 |     "**That is all for this demonstration. I hope you enjoyed the notebook, and see you in the next one.**"
221 |    ]
222 |   }
223 |  ],
224 |  "metadata": {
225 |   "kernelspec": {
226 |    "display_name": "Python 3",
227 |    "language": "python",
228 |    "name": "python3"
229 |   },
230 |   "language_info": {
231 |    "codemirror_mode": {
232 |     "name": "ipython",
233 |     "version": 3
234 |    },
235 |    "file_extension": ".py",
236 |    "mimetype": "text/x-python",
237 |    "name": "python",
238 |    "nbconvert_exporter": "python",
239 |    "pygments_lexer": "ipython3",
240 |    "version": "3.6.1"
241 |   },
242 |   "toc": {
243 |    "nav_menu": {},
244 |    "number_sections": true,
245 |    "sideBar": true,
246 |    "skip_h1_title": false,
247 |    "toc_cell": false,
248 |    "toc_position": {
249 |     "height": "550px",
250 |     "left": "0px",
251 |     "right": "869.4px",
252 |     "top": "107px",
253 |     "width": "151px"
254 |    },
255 |    "toc_section_display": "block",
256 |    "toc_window_display": true
257 |   }
258 |  },
259 |  "nbformat": 4,
260 |  "nbformat_minor": 1
261 | }
262 | 


--------------------------------------------------------------------------------
/02.5_Bonus_Learn_more_on_Lending_Club_dataset.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "## Bonus: Learn more about the Lending Club dataset\n",
 8 |     "\n",
 9 |     "Visit the following Jupyter notebooks in Kaggle to learn more about the Lending Club dataset and on how different Kagglers approach data analysis. \n",
10 |     "\n",
11 |     "- [Initial Loan Book Analysis](https://www.kaggle.com/erykwalczak/initial-loan-book-analysis)\n",
12 |     "- [Python for Padawans](https://www.kaggle.com/evanmiller/python-for-padawans)\n",
13 |     "- [Loan Book Initial Exploration](https://www.kaggle.com/solegalli/loan-book-initial-exploration/)"
14 |    ]
15 |   },
16 |   {
17 |    "cell_type": "code",
18 |    "execution_count": null,
19 |    "metadata": {
20 |     "collapsed": true
21 |    },
22 |    "outputs": [],
23 |    "source": []
24 |   }
25 |  ],
26 |  "metadata": {
27 |   "kernelspec": {
28 |    "display_name": "Python 3",
29 |    "language": "python",
30 |    "name": "python3"
31 |   },
32 |   "language_info": {
33 |    "codemirror_mode": {
34 |     "name": "ipython",
35 |     "version": 3
36 |    },
37 |    "file_extension": ".py",
38 |    "mimetype": "text/x-python",
39 |    "name": "python",
40 |    "nbconvert_exporter": "python",
41 |    "pygments_lexer": "ipython3",
42 |    "version": "3.6.1"
43 |   },
44 |   "toc": {
45 |    "nav_menu": {},
46 |    "number_sections": true,
47 |    "sideBar": true,
48 |    "skip_h1_title": false,
49 |    "toc_cell": false,
50 |    "toc_position": {},
51 |    "toc_section_display": "block",
52 |    "toc_window_display": false
53 |   }
54 |  },
55 |  "nbformat": 4,
56 |  "nbformat_minor": 2
57 | }
58 | 


--------------------------------------------------------------------------------
/03.6_Bonus_Additional_reading_resources_on_variable_problems.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "## Bonus: Additional reading resources on variable problems\n",
 8 |     "\n",
 9 |     "### Further reading on missing data:\n",
10 |     "\n",
11 |     "- [Missing data imputation, chapter 25](http://www.stat.columbia.edu/~gelman/arm/missing.pdf).\n",
12 |     "\n",
13 |     "### Further reading on outliers:\n",
14 |     "\n",
15 |     "- [Why is AdaBoost algorithm sensitive to noisy data and outliers?](https://www.quora.com/Why-is-AdaBoost-algorithm-sensitive-to-noisy-data-and-outliers-And-how)\n",
16 |     "- [Why is Logistic Regression robust to outliers compared to least squares?](https://www.quora.com/Why-is-logistic-regression-considered-robust-to-outliers-compared-to-a-least-square-method)\n",
17 |     "- [Can Logistic Regression be considered robust to outliers?](https://www.quora.com/Can-Logistic-Regression-be-considered-robust-to-outliers)\n",
18 |     "- [The Effects of Outlier Data on Neural Networks Performance](https://www.researchgate.net/profile/Azme_Khamis/publication/26568300_The_Effects_of_Outliers_Data_on_Neural_Network_Performance/links/564802c908ae54697fbc10de/The-Effects-of-Outliers-Data-on-Neural-Network-Performance.pdf)\n",
19 |     "- [Outlier Analysis by C. Aggarwal](http://charuaggarwal.net/outlierbook.pdf)\n",
20 |     "\n",
21 |     "### Overview on Variable Problems\n",
22 |     "\n",
23 |     "- [Identifying common Data Mining Mistakes by SAS](https://www.mwsug.org/proceedings/2007/saspres/MWSUG-2007-SAS01.pdf)\n",
24 |     "\n",
25 |     "### Overview and Comparison of Machine Learning Algorithms\n",
26 |     "\n",
27 |     "- [Top 10 Machine Learning Algorithms](https://www.dezyre.com/article/top-10-machine-learning-algorithms/202)\n",
28 |     "- [Choosing the Right Algorithm for Machine Learning](http://www.dummies.com/programming/big-data/data-science/choosing-right-algorithm-machine-learning/)\n",
29 |     "- [Why does Gradient boosting work so well for so many Kaggle problems?](https://www.quora.com/Why-does-Gradient-boosting-work-so-well-for-so-many-Kaggle-problems)\n",
30 |     "\n",
31 |     "\n",
32 |     "### Kaggle kernels for the Titanic Dataset\n",
33 |     "\n",
34 |     "- [Exploring Survival on the Titanic](https://www.kaggle.com/mrisdal/exploring-survival-on-the-titanic)\n",
35 |     "- [Titanic Data Science Solutions](https://www.kaggle.com/startupsci/titanic-data-science-solutions/notebook)\n",
36 |     "\n",
37 |     "### Kaggle kernels for the Mercedes Benz Dataset\n",
38 |     "\n",
39 |     "- [Sherlock's Exploration {Season 01} - Categorical](https://www.kaggle.com/remidi/sherlock-s-exploration-season-01-categorical)\n",
40 |     "- [Mercedes Benz Data Exploration](https://www.kaggle.com/aditya1702/mercedes-benz-data-exploration)\n",
41 |     "- [Simple Exploration Notebook](https://www.kaggle.com/pcjimmmy/simple-exploration-notebook-mercedes)"
42 |    ]
43 |   },
44 |   {
45 |    "cell_type": "code",
46 |    "execution_count": null,
47 |    "metadata": {
48 |     "collapsed": true
49 |    },
50 |    "outputs": [],
51 |    "source": []
52 |   }
53 |  ],
54 |  "metadata": {
55 |   "kernelspec": {
56 |    "display_name": "Python 3",
57 |    "language": "python",
58 |    "name": "python3"
59 |   },
60 |   "language_info": {
61 |    "codemirror_mode": {
62 |     "name": "ipython",
63 |     "version": 3
64 |    },
65 |    "file_extension": ".py",
66 |    "mimetype": "text/x-python",
67 |    "name": "python",
68 |    "nbconvert_exporter": "python",
69 |    "pygments_lexer": "ipython3",
70 |    "version": "3.6.1"
71 |   },
72 |   "toc": {
73 |    "nav_menu": {},
74 |    "number_sections": true,
75 |    "sideBar": true,
76 |    "skip_h1_title": false,
77 |    "toc_cell": false,
78 |    "toc_position": {},
79 |    "toc_section_display": "block",
80 |    "toc_window_display": false
81 |   }
82 |  },
83 |  "nbformat": 4,
84 |  "nbformat_minor": 2
85 | }
86 | 


--------------------------------------------------------------------------------
/04.1_Variable_magnitude.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Variable magnitude\n",
  8 |     "\n",
  9 |     "### Does the magnitude of the variable matter?\n",
 10 |     "\n",
 11 |     "In Linear Regression models, the scale of variables used to estimate the output matters. Linear models are of the type **y = w x + b**, where the regression coefficient w represents the expected change in y for a one unit change in x (the predictor). Thus, the magnitude of **w** is partly determined by the magnitude of the units being used for **x**. If x is a distance variable, just changing the scale from kilometers to miles will cause a change in the magnitude of the coefficient.\n",
 12 |     "\n",
 13 |     "In addition, in situations where we estimate the outcome y by contemplating multiple predictors x1, x2, ...xn, predictors with greater numeric ranges dominate over those with smaller numeric ranges.\n",
 14 |     "\n",
 15 |     "Gradient descent converges faster when all the predictors (x1 to xn) are within a similar scale, therefore making feature scaling useful for Neural Networks as well as Logistic Regression.\n",
 16 |     "\n",
 17 |     "In Support Vector Machines, feature scaling can decrease the time to find the support vectors.\n",
 18 |     "\n",
 19 |     "Finally, methods using Euclidean distances or distances in general are also affected by the magnitude of the features, as Euclidean distance is sensitive to variations in the magnitude or scales of the predictors. Therefore feature scaling is required for methods that utilise distance calculations like k-nearest neighbours (KNN) and k-means clustering.\n",
 20 |     "\n",
 21 |     "For more details on the above, follow the links in the Bonus Lecture of this section.\n",
 22 |     "\n",
 23 |     "In summary:\n",
 24 |     "\n",
 25 |     "#### Maginutd matters because:\n",
 26 |     "\n",
 27 |     "- The regression coefficient is directly influenced by the scale of the variable\n",
 28 |     "- Variables with bigger magnitudes /  value range dominate over the ones with smaller magnitudes / value range\n",
 29 |     "- Gradient descent converges faster when features are on similar scales\n",
 30 |     "- Feature scaling helps decrease the time to find support vectors for SVMs\n",
 31 |     "- Euclidean distances are sensitive to feature magnitude.\n",
 32 |     "\n",
 33 |     "#### The machine learning models affected by the magnitude of the feature are:\n",
 34 |     "\n",
 35 |     "- Linear and Logistic Regression\n",
 36 |     "- Neural Networks\n",
 37 |     "- Support Vector Machines\n",
 38 |     "- KNN\n",
 39 |     "- K-means clustering\n",
 40 |     "- Linear Discriminant Analysis (LDA)\n",
 41 |     "- Principal Component Analysis (PCA)\n",
 42 |     "\n",
 43 |     "#### Machine learning models insensitive to feature magnitude are the ones based on Trees:\n",
 44 |     "\n",
 45 |     "- Classification and Regression Trees\n",
 46 |     "- Random Forests\n",
 47 |     "- Gradient Boosted Trees\n",
 48 |     "\n",
 49 |     "\n",
 50 |     "**For more information on whether and why you should scale features prior to use in machine learning models, refer to the lecture \"Bonus: Additional reading resources on variable problems\", in the previous section of this course.** \n",
 51 |     "\n"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "markdown",
 56 |    "metadata": {},
 57 |    "source": [
 58 |     "===================================================================================================\n",
 59 |     "\n",
 60 |     "## Real Life example: \n",
 61 |     "\n",
 62 |     "### Predicting Survival on the Titanic: understanding society behaviour and beliefs\n",
 63 |     "\n",
 64 |     "Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.\n",
 65 |     "\n",
 66 |     "====================================================================================================\n",
 67 |     "\n",
 68 |     "To download the Titanic data, go ahead to this website:\n",
 69 |     "https://www.kaggle.com/c/titanic/data\n",
 70 |     "\n",
 71 |     "Click on the link 'train.csv', and then click the 'download' blue button towards the right of the screen, to download the dataset. Save it in a folder of your choice.\n",
 72 |     "\n",
 73 |     "**Note that you need to be logged in to Kaggle in order to download the datasets**.\n",
 74 |     "\n",
 75 |     "If you save it in the same directory from which you are running this notebook, and you rename the file to 'titanic.csv' then you can load it the same way I will load it below.\n",
 76 |     "\n",
 77 |     "====================================================================================================\n",
 78 |     "\n",
 79 |     "In this notebook, I will demonstrate the effect of feature magnitude on the performance of different machine learning algorithms."
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": 1,
 85 |    "metadata": {
 86 |     "collapsed": true
 87 |    },
 88 |    "outputs": [],
 89 |    "source": [
 90 |     "import pandas as pd\n",
 91 |     "import numpy as np\n",
 92 |     "\n",
 93 |     "import matplotlib.pyplot as plt\n",
 94 |     "% matplotlib inline\n",
 95 |     "\n",
 96 |     "from sklearn.linear_model import LogisticRegression\n",
 97 |     "from sklearn.ensemble import AdaBoostClassifier\n",
 98 |     "from sklearn.ensemble import RandomForestClassifier\n",
 99 |     "from sklearn.svm import SVC\n",
100 |     "from sklearn.neural_network import MLPClassifier\n",
101 |     "from sklearn.neighbors import KNeighborsClassifier\n",
102 |     "\n",
103 |     "from sklearn.preprocessing import MinMaxScaler\n",
104 |     "\n",
105 |     "from sklearn.metrics import roc_auc_score\n",
106 |     "from sklearn.model_selection import train_test_split"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "markdown",
111 |    "metadata": {},
112 |    "source": [
113 |     "### Load data with numerical variables only"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": 2,
119 |    "metadata": {},
120 |    "outputs": [
121 |     {
122 |      "data": {
123 |       "text/html": [
124 |        "<div>\n",
125 |        "<style>\n",
126 |        "    .dataframe thead tr:only-child th {\n",
127 |        "        text-align: right;\n",
128 |        "    }\n",
129 |        "\n",
130 |        "    .dataframe thead th {\n",
131 |        "        text-align: left;\n",
132 |        "    }\n",
133 |        "\n",
134 |        "    .dataframe tbody tr th {\n",
135 |        "        vertical-align: top;\n",
136 |        "    }\n",
137 |        "</style>\n",
138 |        "<table border=\"1\" class=\"dataframe\">\n",
139 |        "  <thead>\n",
140 |        "    <tr style=\"text-align: right;\">\n",
141 |        "      <th></th>\n",
142 |        "      <th>Survived</th>\n",
143 |        "      <th>Pclass</th>\n",
144 |        "      <th>Age</th>\n",
145 |        "      <th>Fare</th>\n",
146 |        "    </tr>\n",
147 |        "  </thead>\n",
148 |        "  <tbody>\n",
149 |        "    <tr>\n",
150 |        "      <th>0</th>\n",
151 |        "      <td>0</td>\n",
152 |        "      <td>3</td>\n",
153 |        "      <td>22.0</td>\n",
154 |        "      <td>7.2500</td>\n",
155 |        "    </tr>\n",
156 |        "    <tr>\n",
157 |        "      <th>1</th>\n",
158 |        "      <td>1</td>\n",
159 |        "      <td>1</td>\n",
160 |        "      <td>38.0</td>\n",
161 |        "      <td>71.2833</td>\n",
162 |        "    </tr>\n",
163 |        "    <tr>\n",
164 |        "      <th>2</th>\n",
165 |        "      <td>1</td>\n",
166 |        "      <td>3</td>\n",
167 |        "      <td>26.0</td>\n",
168 |        "      <td>7.9250</td>\n",
169 |        "    </tr>\n",
170 |        "    <tr>\n",
171 |        "      <th>3</th>\n",
172 |        "      <td>1</td>\n",
173 |        "      <td>1</td>\n",
174 |        "      <td>35.0</td>\n",
175 |        "      <td>53.1000</td>\n",
176 |        "    </tr>\n",
177 |        "    <tr>\n",
178 |        "      <th>4</th>\n",
179 |        "      <td>0</td>\n",
180 |        "      <td>3</td>\n",
181 |        "      <td>35.0</td>\n",
182 |        "      <td>8.0500</td>\n",
183 |        "    </tr>\n",
184 |        "  </tbody>\n",
185 |        "</table>\n",
186 |        "</div>"
187 |       ],
188 |       "text/plain": [
189 |        "   Survived  Pclass   Age     Fare\n",
190 |        "0         0       3  22.0   7.2500\n",
191 |        "1         1       1  38.0  71.2833\n",
192 |        "2         1       3  26.0   7.9250\n",
193 |        "3         1       1  35.0  53.1000\n",
194 |        "4         0       3  35.0   8.0500"
195 |       ]
196 |      },
197 |      "execution_count": 2,
198 |      "metadata": {},
199 |      "output_type": "execute_result"
200 |     }
201 |    ],
202 |    "source": [
203 |     "# load the numerical variables of the Titanic Dataset\n",
204 |     "data = pd.read_csv('titanic.csv', usecols = ['Pclass', 'Age', 'Fare', 'Survived'])\n",
205 |     "data.head()"
206 |    ]
207 |   },
208 |   {
209 |    "cell_type": "code",
210 |    "execution_count": 3,
211 |    "metadata": {},
212 |    "outputs": [
213 |     {
214 |      "data": {
215 |       "text/html": [
216 |        "<div>\n",
217 |        "<style>\n",
218 |        "    .dataframe thead tr:only-child th {\n",
219 |        "        text-align: right;\n",
220 |        "    }\n",
221 |        "\n",
222 |        "    .dataframe thead th {\n",
223 |        "        text-align: left;\n",
224 |        "    }\n",
225 |        "\n",
226 |        "    .dataframe tbody tr th {\n",
227 |        "        vertical-align: top;\n",
228 |        "    }\n",
229 |        "</style>\n",
230 |        "<table border=\"1\" class=\"dataframe\">\n",
231 |        "  <thead>\n",
232 |        "    <tr style=\"text-align: right;\">\n",
233 |        "      <th></th>\n",
234 |        "      <th>Survived</th>\n",
235 |        "      <th>Pclass</th>\n",
236 |        "      <th>Age</th>\n",
237 |        "      <th>Fare</th>\n",
238 |        "    </tr>\n",
239 |        "  </thead>\n",
240 |        "  <tbody>\n",
241 |        "    <tr>\n",
242 |        "      <th>count</th>\n",
243 |        "      <td>891.000000</td>\n",
244 |        "      <td>891.000000</td>\n",
245 |        "      <td>714.000000</td>\n",
246 |        "      <td>891.000000</td>\n",
247 |        "    </tr>\n",
248 |        "    <tr>\n",
249 |        "      <th>mean</th>\n",
250 |        "      <td>0.383838</td>\n",
251 |        "      <td>2.308642</td>\n",
252 |        "      <td>29.699118</td>\n",
253 |        "      <td>32.204208</td>\n",
254 |        "    </tr>\n",
255 |        "    <tr>\n",
256 |        "      <th>std</th>\n",
257 |        "      <td>0.486592</td>\n",
258 |        "      <td>0.836071</td>\n",
259 |        "      <td>14.526497</td>\n",
260 |        "      <td>49.693429</td>\n",
261 |        "    </tr>\n",
262 |        "    <tr>\n",
263 |        "      <th>min</th>\n",
264 |        "      <td>0.000000</td>\n",
265 |        "      <td>1.000000</td>\n",
266 |        "      <td>0.420000</td>\n",
267 |        "      <td>0.000000</td>\n",
268 |        "    </tr>\n",
269 |        "    <tr>\n",
270 |        "      <th>25%</th>\n",
271 |        "      <td>0.000000</td>\n",
272 |        "      <td>2.000000</td>\n",
273 |        "      <td>20.125000</td>\n",
274 |        "      <td>7.910400</td>\n",
275 |        "    </tr>\n",
276 |        "    <tr>\n",
277 |        "      <th>50%</th>\n",
278 |        "      <td>0.000000</td>\n",
279 |        "      <td>3.000000</td>\n",
280 |        "      <td>28.000000</td>\n",
281 |        "      <td>14.454200</td>\n",
282 |        "    </tr>\n",
283 |        "    <tr>\n",
284 |        "      <th>75%</th>\n",
285 |        "      <td>1.000000</td>\n",
286 |        "      <td>3.000000</td>\n",
287 |        "      <td>38.000000</td>\n",
288 |        "      <td>31.000000</td>\n",
289 |        "    </tr>\n",
290 |        "    <tr>\n",
291 |        "      <th>max</th>\n",
292 |        "      <td>1.000000</td>\n",
293 |        "      <td>3.000000</td>\n",
294 |        "      <td>80.000000</td>\n",
295 |        "      <td>512.329200</td>\n",
296 |        "    </tr>\n",
297 |        "  </tbody>\n",
298 |        "</table>\n",
299 |        "</div>"
300 |       ],
301 |       "text/plain": [
302 |        "         Survived      Pclass         Age        Fare\n",
303 |        "count  891.000000  891.000000  714.000000  891.000000\n",
304 |        "mean     0.383838    2.308642   29.699118   32.204208\n",
305 |        "std      0.486592    0.836071   14.526497   49.693429\n",
306 |        "min      0.000000    1.000000    0.420000    0.000000\n",
307 |        "25%      0.000000    2.000000   20.125000    7.910400\n",
308 |        "50%      0.000000    3.000000   28.000000   14.454200\n",
309 |        "75%      1.000000    3.000000   38.000000   31.000000\n",
310 |        "max      1.000000    3.000000   80.000000  512.329200"
311 |       ]
312 |      },
313 |      "execution_count": 3,
314 |      "metadata": {},
315 |      "output_type": "execute_result"
316 |     }
317 |    ],
318 |    "source": [
319 |     "# let's have a look at the values of those variables to get an idea of the magnitudes\n",
320 |     "\n",
321 |     "data.describe()"
322 |    ]
323 |   },
324 |   {
325 |    "cell_type": "code",
326 |    "execution_count": 4,
327 |    "metadata": {},
328 |    "outputs": [
329 |     {
330 |      "name": "stdout",
331 |      "output_type": "stream",
332 |      "text": [
333 |       "Pclass _range:  2\n",
334 |       "Age _range:  79.58\n",
335 |       "Fare _range:  512.3292\n"
336 |      ]
337 |     }
338 |    ],
339 |    "source": [
340 |     "# let's now calculate the range\n",
341 |     "\n",
342 |     "for col in ['Pclass', 'Age', 'Fare']:\n",
343 |     "    print(col, '_range: ', data[col].max()-data[col].min())"
344 |    ]
345 |   },
346 |   {
347 |    "cell_type": "markdown",
348 |    "metadata": {},
349 |    "source": [
350 |     "The magnitude of the values of the 3 different variables and their ranges of values are quite different. Therefore, feature scaling could benefit the performance of several machine learning algorithms."
351 |    ]
352 |   },
353 |   {
354 |    "cell_type": "code",
355 |    "execution_count": 5,
356 |    "metadata": {},
357 |    "outputs": [
358 |     {
359 |      "data": {
360 |       "text/plain": [
361 |        "((623, 3), (268, 3))"
362 |       ]
363 |      },
364 |      "execution_count": 5,
365 |      "metadata": {},
366 |      "output_type": "execute_result"
367 |     }
368 |    ],
369 |    "source": [
370 |     "# let's separate into training and testing set\n",
371 |     "X_train, X_test, y_train, y_test = train_test_split(\n",
372 |     "    data[['Pclass', 'Age', 'Fare']].fillna(0),\n",
373 |     "    data.Survived,\n",
374 |     "    test_size=0.3,\n",
375 |     "    random_state=0)\n",
376 |     "\n",
377 |     "X_train.shape, X_test.shape"
378 |    ]
379 |   },
380 |   {
381 |    "cell_type": "markdown",
382 |    "metadata": {},
383 |    "source": [
384 |     "### Feature Scaling"
385 |    ]
386 |   },
387 |   {
388 |    "cell_type": "markdown",
389 |    "metadata": {},
390 |    "source": [
391 |     "For this demonstration, I will scale the features between 0 and 1. \n",
392 |     "To learn more about this scaling visit the scikit-learn website: \n",
393 |     "http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html\n",
394 |     "\n",
395 |     "Briefly, the transformation is given by:\n",
396 |     "\n",
397 |     "X_std = (X - X.min() / (X.max - X.min())\n",
398 |     "\n",
399 |     "And to transform the scaled feature back to its initial format:\n",
400 |     "\n",
401 |     "X_scaled = X_std * (max - min) + min\n",
402 |     "\n",
403 |     "\n",
404 |     "**We'll see more on feature scaling in future lectures**"
405 |    ]
406 |   },
407 |   {
408 |    "cell_type": "code",
409 |    "execution_count": 6,
410 |    "metadata": {
411 |     "collapsed": true
412 |    },
413 |    "outputs": [],
414 |    "source": [
415 |     "# scaling the features between 0 and 1. \n",
416 |     "\n",
417 |     "scaler = MinMaxScaler()\n",
418 |     "X_train_scaled = scaler.fit_transform(X_train)\n",
419 |     "X_test_scaled = scaler.transform(X_test)"
420 |    ]
421 |   },
422 |   {
423 |    "cell_type": "code",
424 |    "execution_count": 7,
425 |    "metadata": {},
426 |    "outputs": [
427 |     {
428 |      "name": "stdout",
429 |      "output_type": "stream",
430 |      "text": [
431 |       "Mean:  [ 0.64365971  0.30131421  0.06335433]\n",
432 |       "Standard Deviation:  [ 0.41999093  0.21983527  0.09411705]\n",
433 |       "Minimum value:  [ 0.  0.  0.]\n",
434 |       "Maximum value:  [ 1.  1.  1.]\n"
435 |      ]
436 |     }
437 |    ],
438 |    "source": [
439 |     "#let's have a look at the scaled training dataset\n",
440 |     "print('Mean: ', X_train_scaled.mean(axis=0))\n",
441 |     "print('Standard Deviation: ', X_train_scaled.std(axis=0))\n",
442 |     "print('Minimum value: ', X_train_scaled.min(axis=0))\n",
443 |     "print('Maximum value: ', X_train_scaled.max(axis=0))"
444 |    ]
445 |   },
446 |   {
447 |    "cell_type": "markdown",
448 |    "metadata": {},
449 |    "source": [
450 |     "### Logistic Regression"
451 |    ]
452 |   },
453 |   {
454 |    "cell_type": "code",
455 |    "execution_count": 8,
456 |    "metadata": {},
457 |    "outputs": [
458 |     {
459 |      "name": "stdout",
460 |      "output_type": "stream",
461 |      "text": [
462 |       "Train set\n",
463 |       "Logistic Regression roc-auc: 0.7134823539619531\n",
464 |       "Test set\n",
465 |       "Logistic Regression roc-auc: 0.7080952380952381\n"
466 |      ]
467 |     }
468 |    ],
469 |    "source": [
470 |     "# model build on unscaled variables\n",
471 |     "\n",
472 |     "logit = LogisticRegression(random_state=44, C=1000) # c big to avoid regularization\n",
473 |     "logit.fit(X_train, y_train)\n",
474 |     "print('Train set')\n",
475 |     "pred = logit.predict_proba(X_train)\n",
476 |     "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
477 |     "print('Test set')\n",
478 |     "pred = logit.predict_proba(X_test)\n",
479 |     "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
480 |    ]
481 |   },
482 |   {
483 |    "cell_type": "code",
484 |    "execution_count": 9,
485 |    "metadata": {},
486 |    "outputs": [
487 |     {
488 |      "data": {
489 |       "text/plain": [
490 |        "array([[-0.92574443, -0.01822382,  0.00233696]])"
491 |       ]
492 |      },
493 |      "execution_count": 9,
494 |      "metadata": {},
495 |      "output_type": "execute_result"
496 |     }
497 |    ],
498 |    "source": [
499 |     "logit.coef_"
500 |    ]
501 |   },
502 |   {
503 |    "cell_type": "code",
504 |    "execution_count": 10,
505 |    "metadata": {},
506 |    "outputs": [
507 |     {
508 |      "name": "stdout",
509 |      "output_type": "stream",
510 |      "text": [
511 |       "Train set\n",
512 |       "Logistic Regression roc-auc: 0.7134931997136722\n",
513 |       "Test set\n",
514 |       "Logistic Regression roc-auc: 0.7080952380952381\n"
515 |      ]
516 |     }
517 |    ],
518 |    "source": [
519 |     "# model built on scaled variables\n",
520 |     "logit = LogisticRegression(random_state=44, C=1000) # c big to avoid regularization\n",
521 |     "logit.fit(X_train_scaled, y_train)\n",
522 |     "print('Train set')\n",
523 |     "pred = logit.predict_proba(X_train_scaled)\n",
524 |     "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
525 |     "print('Test set')\n",
526 |     "pred = logit.predict_proba(X_test_scaled)\n",
527 |     "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
528 |    ]
529 |   },
530 |   {
531 |    "cell_type": "code",
532 |    "execution_count": 11,
533 |    "metadata": {},
534 |    "outputs": [
535 |     {
536 |      "data": {
537 |       "text/plain": [
538 |        "array([[-1.85168371, -1.45774407,  1.1951952 ]])"
539 |       ]
540 |      },
541 |      "execution_count": 11,
542 |      "metadata": {},
543 |      "output_type": "execute_result"
544 |     }
545 |    ],
546 |    "source": [
547 |     "logit.coef_"
548 |    ]
549 |   },
550 |   {
551 |    "cell_type": "markdown",
552 |    "metadata": {},
553 |    "source": [
554 |     "We observe that the performance of logistic regression did not change when using the datasets with the features scaled (compare roc-auc values for train and test set for models with and without feature scaling). \n",
555 |     "\n",
556 |     "However, when looking at the coefficients we do see a big difference in the values. This is because the magnitude of the variable was affecting the coefficients. After scaling, all 3 variables have the relative same effect (coefficient) for survival, whereas before scaling, we would be inclined to think that PClass was driving the Survival outcome."
557 |    ]
558 |   },
559 |   {
560 |    "cell_type": "markdown",
561 |    "metadata": {},
562 |    "source": [
563 |     "### Support Vector Machines"
564 |    ]
565 |   },
566 |   {
567 |    "cell_type": "code",
568 |    "execution_count": 12,
569 |    "metadata": {},
570 |    "outputs": [
571 |     {
572 |      "name": "stdout",
573 |      "output_type": "stream",
574 |      "text": [
575 |       "Train set\n",
576 |       "SVM roc-auc: 0.9016995292943752\n",
577 |       "Test set\n",
578 |       "SVM roc-auc: 0.6768154761904762\n"
579 |      ]
580 |     }
581 |    ],
582 |    "source": [
583 |     "# model build on data with plenty of categories in Cabin variable\n",
584 |     "\n",
585 |     "SVM_model = SVC(random_state=44, probability=True)\n",
586 |     "SVM_model.fit(X_train, y_train)\n",
587 |     "print('Train set')\n",
588 |     "pred = SVM_model.predict_proba(X_train)\n",
589 |     "print('SVM roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
590 |     "print('Test set')\n",
591 |     "pred = SVM_model.predict_proba(X_test)\n",
592 |     "print('SVM roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
593 |    ]
594 |   },
595 |   {
596 |    "cell_type": "code",
597 |    "execution_count": 13,
598 |    "metadata": {},
599 |    "outputs": [
600 |     {
601 |      "name": "stdout",
602 |      "output_type": "stream",
603 |      "text": [
604 |       "Train set\n",
605 |       "SVM roc-auc: 0.7047081408212403\n",
606 |       "Test set\n",
607 |       "SVM roc-auc: 0.6988690476190477\n"
608 |      ]
609 |     }
610 |    ],
611 |    "source": [
612 |     "SVM_model = SVC(random_state=44, probability=True)\n",
613 |     "SVM_model.fit(X_train_scaled, y_train)\n",
614 |     "print('Train set')\n",
615 |     "pred = SVM_model.predict_proba(X_train_scaled)\n",
616 |     "print('SVM roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
617 |     "print('Test set')\n",
618 |     "pred = SVM_model.predict_proba(X_test_scaled)\n",
619 |     "print('SVM roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
620 |    ]
621 |   },
622 |   {
623 |    "cell_type": "markdown",
624 |    "metadata": {},
625 |    "source": [
626 |     "Feature scaling improved the performance of the support vector machine. After feature scaling the model is no longer over-fitting to the training set (compare the roc-auc of 0.901 for the model on unscaled features vs the roc-auc of 0.704). In addition, the roc-auc for the testing set increased as well (0.67 vs 0.69)."
627 |    ]
628 |   },
629 |   {
630 |    "cell_type": "markdown",
631 |    "metadata": {},
632 |    "source": [
633 |     "### Neural Networks"
634 |    ]
635 |   },
636 |   {
637 |    "cell_type": "code",
638 |    "execution_count": 14,
639 |    "metadata": {},
640 |    "outputs": [
641 |     {
642 |      "name": "stdout",
643 |      "output_type": "stream",
644 |      "text": [
645 |       "Train set\n",
646 |       "Neural Network roc-auc: 0.678190277868159\n",
647 |       "Test set\n",
648 |       "Neural Network roc-auc: 0.666547619047619\n"
649 |      ]
650 |     }
651 |    ],
652 |    "source": [
653 |     "# model built on unscaled features\n",
654 |     "\n",
655 |     "NN_model = MLPClassifier(random_state=44, solver='sgd')\n",
656 |     "NN_model.fit(X_train, y_train)\n",
657 |     "print('Train set')\n",
658 |     "pred = NN_model.predict_proba(X_train)\n",
659 |     "print('Neural Network roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
660 |     "print('Test set')\n",
661 |     "pred = NN_model.predict_proba(X_test)\n",
662 |     "print('Neural Network roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
663 |    ]
664 |   },
665 |   {
666 |    "cell_type": "code",
667 |    "execution_count": 15,
668 |    "metadata": {},
669 |    "outputs": [
670 |     {
671 |      "name": "stdout",
672 |      "output_type": "stream",
673 |      "text": [
674 |       "Train set\n",
675 |       "Neural Network roc-auc: 0.7161937918917161\n",
676 |       "Test set\n",
677 |       "Neural Network roc-auc: 0.7125\n"
678 |      ]
679 |     }
680 |    ],
681 |    "source": [
682 |     "# model built on scaled features\n",
683 |     "\n",
684 |     "NN_model = MLPClassifier(random_state=44, solver='sgd')\n",
685 |     "NN_model.fit(X_train_scaled, y_train)\n",
686 |     "print('Train set')\n",
687 |     "pred = NN_model.predict_proba(X_train_scaled)\n",
688 |     "print('Neural Network roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
689 |     "print('Test set')\n",
690 |     "pred = NN_model.predict_proba(X_test_scaled)\n",
691 |     "print('Neural Network roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
692 |    ]
693 |   },
694 |   {
695 |    "cell_type": "markdown",
696 |    "metadata": {},
697 |    "source": [
698 |     "We observe that scaling the features improved the performance of the neural network both for the training and the testing set (compare roc-auc values: training 0.67 vs 0.71; testing: 0.66 vs 0.71). The roc-auc increases in both training and testing sets when the model is trained on a dataset with scaled features."
699 |    ]
700 |   },
701 |   {
702 |    "cell_type": "markdown",
703 |    "metadata": {},
704 |    "source": [
705 |     "### K-Nearest Neighbours"
706 |    ]
707 |   },
708 |   {
709 |    "cell_type": "code",
710 |    "execution_count": 16,
711 |    "metadata": {},
712 |    "outputs": [
713 |     {
714 |      "name": "stdout",
715 |      "output_type": "stream",
716 |      "text": [
717 |       "Train set\n",
718 |       "KNN roc-auc: 0.8694225721784778\n",
719 |       "Test set\n",
720 |       "KNN roc-auc: 0.6253571428571428\n"
721 |      ]
722 |     }
723 |    ],
724 |    "source": [
725 |     "#model built on unscaled features\n",
726 |     "\n",
727 |     "KNN = KNeighborsClassifier(n_neighbors=3)\n",
728 |     "KNN.fit(X_train, y_train)\n",
729 |     "print('Train set')\n",
730 |     "pred = KNN.predict_proba(X_train)\n",
731 |     "print('KNN roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
732 |     "print('Test set')\n",
733 |     "pred = KNN.predict_proba(X_test)\n",
734 |     "print('KNN roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
735 |    ]
736 |   },
737 |   {
738 |    "cell_type": "code",
739 |    "execution_count": 17,
740 |    "metadata": {},
741 |    "outputs": [
742 |     {
743 |      "name": "stdout",
744 |      "output_type": "stream",
745 |      "text": [
746 |       "Train set\n",
747 |       "KNN roc-auc: 0.8880555736318084\n",
748 |       "Test set\n",
749 |       "KNN roc-auc: 0.7017559523809525\n"
750 |      ]
751 |     }
752 |    ],
753 |    "source": [
754 |     "# model built on scaled\n",
755 |     "\n",
756 |     "KNN = KNeighborsClassifier(n_neighbors=3)\n",
757 |     "KNN.fit(X_train_scaled, y_train)\n",
758 |     "print('Train set')\n",
759 |     "pred = KNN.predict_proba(X_train_scaled)\n",
760 |     "print('KNN roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
761 |     "print('Test set')\n",
762 |     "pred = KNN.predict_proba(X_test_scaled)\n",
763 |     "print('KNN roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
764 |    ]
765 |   },
766 |   {
767 |    "cell_type": "markdown",
768 |    "metadata": {},
769 |    "source": [
770 |     "We observe for KNN as well that feature scaling improved the performance of the model. The model built on unscaled features shows a better generalisation, with a higher roc-auc for the testing set (0.70 vs 0.62 for model built on unscaled features).\n",
771 |     "\n",
772 |     "Both KNN methods are over-fitting to the train set. Thus, we would need to change the parameters of the model or use less features to try and decrease over-fitting, which exceeds the purpose of this demonstration."
773 |    ]
774 |   },
775 |   {
776 |    "cell_type": "markdown",
777 |    "metadata": {},
778 |    "source": [
779 |     "### Random Forests"
780 |    ]
781 |   },
782 |   {
783 |    "cell_type": "code",
784 |    "execution_count": 18,
785 |    "metadata": {},
786 |    "outputs": [
787 |     {
788 |      "name": "stdout",
789 |      "output_type": "stream",
790 |      "text": [
791 |       "Train set\n",
792 |       "Random Forests roc-auc: 0.9914589705212468\n",
793 |       "Test set\n",
794 |       "Random Forests roc-auc: 0.7602678571428572\n"
795 |      ]
796 |     }
797 |    ],
798 |    "source": [
799 |     "# model built on unscaled features\n",
800 |     "\n",
801 |     "rf = RandomForestClassifier(n_estimators=700, random_state=39)\n",
802 |     "rf.fit(X_train, y_train)\n",
803 |     "print('Train set')\n",
804 |     "pred = rf.predict_proba(X_train)\n",
805 |     "print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
806 |     "print('Test set')\n",
807 |     "pred = rf.predict_proba(X_test)\n",
808 |     "print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
809 |    ]
810 |   },
811 |   {
812 |    "cell_type": "code",
813 |    "execution_count": 19,
814 |    "metadata": {},
815 |    "outputs": [
816 |     {
817 |      "name": "stdout",
818 |      "output_type": "stream",
819 |      "text": [
820 |       "Train set\n",
821 |       "Random Forests roc-auc: 0.9914589705212469\n",
822 |       "Test set\n",
823 |       "Random Forests roc-auc: 0.7602380952380952\n"
824 |      ]
825 |     }
826 |    ],
827 |    "source": [
828 |     "# model built in scaled features\n",
829 |     "rf = RandomForestClassifier(n_estimators=700, random_state=39)\n",
830 |     "rf.fit(X_train_scaled, y_train)\n",
831 |     "print('Train set')\n",
832 |     "pred = rf.predict_proba(X_train_scaled)\n",
833 |     "print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
834 |     "print('Test set')\n",
835 |     "pred = rf.predict_proba(X_test_scaled)\n",
836 |     "print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
837 |    ]
838 |   },
839 |   {
840 |    "cell_type": "markdown",
841 |    "metadata": {},
842 |    "source": [
843 |     "As expected, Random Forests shows no change in performance regardless of whether it is trained on a dataset with scaled or unscaled features"
844 |    ]
845 |   },
846 |   {
847 |    "cell_type": "code",
848 |    "execution_count": 20,
849 |    "metadata": {},
850 |    "outputs": [
851 |     {
852 |      "name": "stdout",
853 |      "output_type": "stream",
854 |      "text": [
855 |       "Train set\n",
856 |       "AdaBoost roc-auc: 0.847736491616234\n",
857 |       "Test set\n",
858 |       "AdaBoost roc-auc: 0.7733630952380953\n"
859 |      ]
860 |     }
861 |    ],
862 |    "source": [
863 |     "ada = AdaBoostClassifier(n_estimators=200, random_state=44)\n",
864 |     "ada.fit(X_train, y_train)\n",
865 |     "print('Train set')\n",
866 |     "pred = ada.predict_proba(X_train)\n",
867 |     "print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
868 |     "print('Test set')\n",
869 |     "pred = ada.predict_proba(X_test)\n",
870 |     "print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
871 |    ]
872 |   },
873 |   {
874 |    "cell_type": "code",
875 |    "execution_count": 21,
876 |    "metadata": {},
877 |    "outputs": [
878 |     {
879 |      "name": "stdout",
880 |      "output_type": "stream",
881 |      "text": [
882 |       "Train set\n",
883 |       "AdaBoost roc-auc: 0.847736491616234\n",
884 |       "Test set\n",
885 |       "AdaBoost roc-auc: 0.7733630952380953\n"
886 |      ]
887 |     }
888 |    ],
889 |    "source": [
890 |     "ada = AdaBoostClassifier(n_estimators=200, random_state=44)\n",
891 |     "ada.fit(X_train_scaled, y_train)\n",
892 |     "print('Train set')\n",
893 |     "pred = ada.predict_proba(X_train_scaled)\n",
894 |     "print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
895 |     "print('Test set')\n",
896 |     "pred = ada.predict_proba(X_test_scaled)\n",
897 |     "print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
898 |    ]
899 |   },
900 |   {
901 |    "cell_type": "markdown",
902 |    "metadata": {},
903 |    "source": [
904 |     "As expected, AdaBoost shows no change in performance regardless of whether it is trained on a dataset with scaled or unscaled features"
905 |    ]
906 |   },
907 |   {
908 |    "cell_type": "markdown",
909 |    "metadata": {
910 |     "collapsed": true
911 |    },
912 |    "source": [
913 |     "**That is all for this demonstration. I hope you enjoyed the notebook, and see you in the next one.**"
914 |    ]
915 |   }
916 |  ],
917 |  "metadata": {
918 |   "kernelspec": {
919 |    "display_name": "Python 3",
920 |    "language": "python",
921 |    "name": "python3"
922 |   },
923 |   "language_info": {
924 |    "codemirror_mode": {
925 |     "name": "ipython",
926 |     "version": 3
927 |    },
928 |    "file_extension": ".py",
929 |    "mimetype": "text/x-python",
930 |    "name": "python",
931 |    "nbconvert_exporter": "python",
932 |    "pygments_lexer": "ipython3",
933 |    "version": "3.6.1"
934 |   },
935 |   "toc": {
936 |    "nav_menu": {},
937 |    "number_sections": true,
938 |    "sideBar": true,
939 |    "skip_h1_title": false,
940 |    "toc_cell": false,
941 |    "toc_position": {},
942 |    "toc_section_display": "block",
943 |    "toc_window_display": true
944 |   }
945 |  },
946 |  "nbformat": 4,
947 |  "nbformat_minor": 2
948 | }
949 | 


--------------------------------------------------------------------------------
/04.4_Bonus_Additional_reading_resources.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "## Bonus: Additional reading resources\n",
 8 |     "\n",
 9 |     "### Further reading on the importance of feature scaling\n",
10 |     "\n",
11 |     "- [Should I normalize/standardize/rescale the data?](http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html)\n",
12 |     "- [Efficient BackProp by Yann LeCun](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf)\n",
13 |     "- [Standardization and Its Effects on K -Means Clustering Algorithm](http://maxwellsci.com/print/rjaset/v6-3299-3303.pdf)\n",
14 |     "- [Feature scaling in support vector data description]](http://rduin.nl/papers/asci_02_occ.pdf)\n",
15 |     "\n",
16 |     "### Further reading on linear and non-linear models\n",
17 |     "\n",
18 |     "- [An introduction to Statistical Learning]](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf)\n",
19 |     "- [Elements of Statistical Learning]](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf)\n",
20 |     "\n",
21 |     "\n",
22 |     "### More on Q-Q Plots\n",
23 |     "\n",
24 |     "- [Q-Q Plots in wiki](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot)\n",
25 |     "- [How to interpret a Q-Q plot](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot)\n",
26 |     "- [Q-Q Plot visualisation tool](https://xiongge.shinyapps.io/QQplots/)\n",
27 |     "\n",
28 |     "### Kaggle kernels on the Sale Price dataset\n",
29 |     "\n",
30 |     "- [Comprehensive Data Exploration](https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python)"
31 |    ]
32 |   },
33 |   {
34 |    "cell_type": "code",
35 |    "execution_count": null,
36 |    "metadata": {
37 |     "collapsed": true
38 |    },
39 |    "outputs": [],
40 |    "source": []
41 |   }
42 |  ],
43 |  "metadata": {
44 |   "kernelspec": {
45 |    "display_name": "Python 3",
46 |    "language": "python",
47 |    "name": "python3"
48 |   },
49 |   "language_info": {
50 |    "codemirror_mode": {
51 |     "name": "ipython",
52 |     "version": 3
53 |    },
54 |    "file_extension": ".py",
55 |    "mimetype": "text/x-python",
56 |    "name": "python",
57 |    "nbconvert_exporter": "python",
58 |    "pygments_lexer": "ipython3",
59 |    "version": "3.6.1"
60 |   },
61 |   "toc": {
62 |    "nav_menu": {},
63 |    "number_sections": true,
64 |    "sideBar": true,
65 |    "skip_h1_title": false,
66 |    "toc_cell": false,
67 |    "toc_position": {},
68 |    "toc_section_display": "block",
69 |    "toc_window_display": false
70 |   }
71 |  },
72 |  "nbformat": 4,
73 |  "nbformat_minor": 2
74 | }
75 | 


--------------------------------------------------------------------------------
/05.6_Arbitrary_value_imputation.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Arbitrary value imputation\n",
  8 |     "\n",
  9 |     "Replacing the NA by artitrary values should be used when there are reasons to believe that the NA are not missing at random. In situations like this, we would not like to replace with the median or the mean, and therefore make the NA look like the majority of our observations.\n",
 10 |     "\n",
 11 |     "Instead, we want to flag them. We want to capture the missingness somehow.\n",
 12 |     "\n",
 13 |     "In previous lectures we saw 2 methods to do this:\n",
 14 |     "\n",
 15 |     "1) adding an additional binary variable to indicate whether the value is missing (1) or not (0)\n",
 16 |     "\n",
 17 |     "2) replacing the NA by a value at a far end of the distribution\n",
 18 |     "\n",
 19 |     "Here, I suggest an alternative to option 2, which I have seen in several Kaggle competitions. It consists of replacing the NA by an arbitrary value. Any of your creation, but ideally different from the median/mean/mode, and not within the normal values of the variable.\n",
 20 |     "\n",
 21 |     "The problem consists in deciding which arbitrary value to choose.\n",
 22 |     "\n",
 23 |     "### Advantages\n",
 24 |     "\n",
 25 |     "- Easy to implement\n",
 26 |     "- Captures the importance of missingess if there is one\n",
 27 |     "\n",
 28 |     "### Disadvantages\n",
 29 |     "\n",
 30 |     "- Distorts the original distribution of the variable\n",
 31 |     "- If missingess is not important, it may mask the predictive power of the original variable by distorting its distribution\n",
 32 |     "- Hard to decide which value to use\n",
 33 |     " If the value is outside the distribution it may mask or create outliers\n",
 34 |     "\n",
 35 |     "### Final note\n",
 36 |     "\n",
 37 |     "When variables are captured by third parties, like credit agencies, they place arbitrary numbers already to signal this missingness. So if not common practice in data competitions, it is common practice in real life data collections."
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "markdown",
 42 |    "metadata": {},
 43 |    "source": [
 44 |     "===============================================================================\n",
 45 |     "\n",
 46 |     "## Real Life example: \n",
 47 |     "\n",
 48 |     "### Predicting Survival on the Titanic: understanding society behaviour and beliefs\n",
 49 |     "\n",
 50 |     "Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.\n",
 51 |     "\n",
 52 |     "=============================================================================\n",
 53 |     "\n",
 54 |     "In the following cells, I will show how this procedure impacts features and machine learning using the Titanic and House Price datasets from Kaggle.\n",
 55 |     "\n",
 56 |     "If you haven't downloaded the datasets yet, in the lecture \"Guide to setting up your computer\" in section 1, you can find the details on how to do so."
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": 1,
 62 |    "metadata": {
 63 |     "collapsed": true
 64 |    },
 65 |    "outputs": [],
 66 |    "source": [
 67 |     "import pandas as pd\n",
 68 |     "import numpy as np\n",
 69 |     "\n",
 70 |     "# for classification\n",
 71 |     "from sklearn.linear_model import LogisticRegression\n",
 72 |     "from sklearn.ensemble import RandomForestClassifier\n",
 73 |     "\n",
 74 |     "# to split and standarize the datasets\n",
 75 |     "from sklearn.model_selection import train_test_split\n",
 76 |     "\n",
 77 |     "# to evaluate classification models\n",
 78 |     "from sklearn.metrics import roc_auc_score\n",
 79 |     "\n",
 80 |     "import warnings\n",
 81 |     "warnings.filterwarnings('ignore')"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "code",
 86 |    "execution_count": 2,
 87 |    "metadata": {},
 88 |    "outputs": [
 89 |     {
 90 |      "data": {
 91 |       "text/html": [
 92 |        "<div>\n",
 93 |        "<style>\n",
 94 |        "    .dataframe thead tr:only-child th {\n",
 95 |        "        text-align: right;\n",
 96 |        "    }\n",
 97 |        "\n",
 98 |        "    .dataframe thead th {\n",
 99 |        "        text-align: left;\n",
100 |        "    }\n",
101 |        "\n",
102 |        "    .dataframe tbody tr th {\n",
103 |        "        vertical-align: top;\n",
104 |        "    }\n",
105 |        "</style>\n",
106 |        "<table border=\"1\" class=\"dataframe\">\n",
107 |        "  <thead>\n",
108 |        "    <tr style=\"text-align: right;\">\n",
109 |        "      <th></th>\n",
110 |        "      <th>Survived</th>\n",
111 |        "      <th>Age</th>\n",
112 |        "      <th>Fare</th>\n",
113 |        "    </tr>\n",
114 |        "  </thead>\n",
115 |        "  <tbody>\n",
116 |        "    <tr>\n",
117 |        "      <th>0</th>\n",
118 |        "      <td>0</td>\n",
119 |        "      <td>22.0</td>\n",
120 |        "      <td>7.2500</td>\n",
121 |        "    </tr>\n",
122 |        "    <tr>\n",
123 |        "      <th>1</th>\n",
124 |        "      <td>1</td>\n",
125 |        "      <td>38.0</td>\n",
126 |        "      <td>71.2833</td>\n",
127 |        "    </tr>\n",
128 |        "    <tr>\n",
129 |        "      <th>2</th>\n",
130 |        "      <td>1</td>\n",
131 |        "      <td>26.0</td>\n",
132 |        "      <td>7.9250</td>\n",
133 |        "    </tr>\n",
134 |        "    <tr>\n",
135 |        "      <th>3</th>\n",
136 |        "      <td>1</td>\n",
137 |        "      <td>35.0</td>\n",
138 |        "      <td>53.1000</td>\n",
139 |        "    </tr>\n",
140 |        "    <tr>\n",
141 |        "      <th>4</th>\n",
142 |        "      <td>0</td>\n",
143 |        "      <td>35.0</td>\n",
144 |        "      <td>8.0500</td>\n",
145 |        "    </tr>\n",
146 |        "  </tbody>\n",
147 |        "</table>\n",
148 |        "</div>"
149 |       ],
150 |       "text/plain": [
151 |        "   Survived   Age     Fare\n",
152 |        "0         0  22.0   7.2500\n",
153 |        "1         1  38.0  71.2833\n",
154 |        "2         1  26.0   7.9250\n",
155 |        "3         1  35.0  53.1000\n",
156 |        "4         0  35.0   8.0500"
157 |       ]
158 |      },
159 |      "execution_count": 2,
160 |      "metadata": {},
161 |      "output_type": "execute_result"
162 |     }
163 |    ],
164 |    "source": [
165 |     "# load the Titanic Dataset with a few variables for demonstration\n",
166 |     "\n",
167 |     "data = pd.read_csv('titanic.csv', usecols = ['Age', 'Fare','Survived'])\n",
168 |     "data.head()"
169 |    ]
170 |   },
171 |   {
172 |    "cell_type": "code",
173 |    "execution_count": 3,
174 |    "metadata": {},
175 |    "outputs": [
176 |     {
177 |      "data": {
178 |       "text/plain": [
179 |        "Survived    0.000000\n",
180 |        "Age         0.198653\n",
181 |        "Fare        0.000000\n",
182 |        "dtype: float64"
183 |       ]
184 |      },
185 |      "execution_count": 3,
186 |      "metadata": {},
187 |      "output_type": "execute_result"
188 |     }
189 |    ],
190 |    "source": [
191 |     "# let's look at the percentage of NA\n",
192 |     "data.isnull().mean()"
193 |    ]
194 |   },
195 |   {
196 |    "cell_type": "code",
197 |    "execution_count": 4,
198 |    "metadata": {},
199 |    "outputs": [
200 |     {
201 |      "data": {
202 |       "text/plain": [
203 |        "((623, 3), (268, 3))"
204 |       ]
205 |      },
206 |      "execution_count": 4,
207 |      "metadata": {},
208 |      "output_type": "execute_result"
209 |     }
210 |    ],
211 |    "source": [
212 |     "# let's separate into training and testing set\n",
213 |     "\n",
214 |     "X_train, X_test, y_train, y_test = train_test_split(data, data.Survived, test_size=0.3,\n",
215 |     "                                                    random_state=0)\n",
216 |     "X_train.shape, X_test.shape"
217 |    ]
218 |   },
219 |   {
220 |    "cell_type": "code",
221 |    "execution_count": 5,
222 |    "metadata": {
223 |     "collapsed": true
224 |    },
225 |    "outputs": [],
226 |    "source": [
227 |     "def impute_na(df, variable):\n",
228 |     "    df[variable+'_zero'] = df[variable].fillna(0)\n",
229 |     "    df[variable+'_hundred']= df[variable].fillna(100)\n",
230 |     "    "
231 |    ]
232 |   },
233 |   {
234 |    "cell_type": "code",
235 |    "execution_count": 6,
236 |    "metadata": {},
237 |    "outputs": [
238 |     {
239 |      "data": {
240 |       "text/html": [
241 |        "<div>\n",
242 |        "<style>\n",
243 |        "    .dataframe thead tr:only-child th {\n",
244 |        "        text-align: right;\n",
245 |        "    }\n",
246 |        "\n",
247 |        "    .dataframe thead th {\n",
248 |        "        text-align: left;\n",
249 |        "    }\n",
250 |        "\n",
251 |        "    .dataframe tbody tr th {\n",
252 |        "        vertical-align: top;\n",
253 |        "    }\n",
254 |        "</style>\n",
255 |        "<table border=\"1\" class=\"dataframe\">\n",
256 |        "  <thead>\n",
257 |        "    <tr style=\"text-align: right;\">\n",
258 |        "      <th></th>\n",
259 |        "      <th>Survived</th>\n",
260 |        "      <th>Age</th>\n",
261 |        "      <th>Fare</th>\n",
262 |        "      <th>Age_zero</th>\n",
263 |        "      <th>Age_hundred</th>\n",
264 |        "    </tr>\n",
265 |        "  </thead>\n",
266 |        "  <tbody>\n",
267 |        "    <tr>\n",
268 |        "      <th>857</th>\n",
269 |        "      <td>1</td>\n",
270 |        "      <td>51.0</td>\n",
271 |        "      <td>26.5500</td>\n",
272 |        "      <td>51.0</td>\n",
273 |        "      <td>51.0</td>\n",
274 |        "    </tr>\n",
275 |        "    <tr>\n",
276 |        "      <th>52</th>\n",
277 |        "      <td>1</td>\n",
278 |        "      <td>49.0</td>\n",
279 |        "      <td>76.7292</td>\n",
280 |        "      <td>49.0</td>\n",
281 |        "      <td>49.0</td>\n",
282 |        "    </tr>\n",
283 |        "    <tr>\n",
284 |        "      <th>386</th>\n",
285 |        "      <td>0</td>\n",
286 |        "      <td>1.0</td>\n",
287 |        "      <td>46.9000</td>\n",
288 |        "      <td>1.0</td>\n",
289 |        "      <td>1.0</td>\n",
290 |        "    </tr>\n",
291 |        "    <tr>\n",
292 |        "      <th>124</th>\n",
293 |        "      <td>0</td>\n",
294 |        "      <td>54.0</td>\n",
295 |        "      <td>77.2875</td>\n",
296 |        "      <td>54.0</td>\n",
297 |        "      <td>54.0</td>\n",
298 |        "    </tr>\n",
299 |        "    <tr>\n",
300 |        "      <th>578</th>\n",
301 |        "      <td>0</td>\n",
302 |        "      <td>NaN</td>\n",
303 |        "      <td>14.4583</td>\n",
304 |        "      <td>0.0</td>\n",
305 |        "      <td>100.0</td>\n",
306 |        "    </tr>\n",
307 |        "    <tr>\n",
308 |        "      <th>549</th>\n",
309 |        "      <td>1</td>\n",
310 |        "      <td>8.0</td>\n",
311 |        "      <td>36.7500</td>\n",
312 |        "      <td>8.0</td>\n",
313 |        "      <td>8.0</td>\n",
314 |        "    </tr>\n",
315 |        "    <tr>\n",
316 |        "      <th>118</th>\n",
317 |        "      <td>0</td>\n",
318 |        "      <td>24.0</td>\n",
319 |        "      <td>247.5208</td>\n",
320 |        "      <td>24.0</td>\n",
321 |        "      <td>24.0</td>\n",
322 |        "    </tr>\n",
323 |        "    <tr>\n",
324 |        "      <th>12</th>\n",
325 |        "      <td>0</td>\n",
326 |        "      <td>20.0</td>\n",
327 |        "      <td>8.0500</td>\n",
328 |        "      <td>20.0</td>\n",
329 |        "      <td>20.0</td>\n",
330 |        "    </tr>\n",
331 |        "    <tr>\n",
332 |        "      <th>157</th>\n",
333 |        "      <td>0</td>\n",
334 |        "      <td>30.0</td>\n",
335 |        "      <td>8.0500</td>\n",
336 |        "      <td>30.0</td>\n",
337 |        "      <td>30.0</td>\n",
338 |        "    </tr>\n",
339 |        "    <tr>\n",
340 |        "      <th>127</th>\n",
341 |        "      <td>1</td>\n",
342 |        "      <td>24.0</td>\n",
343 |        "      <td>7.1417</td>\n",
344 |        "      <td>24.0</td>\n",
345 |        "      <td>24.0</td>\n",
346 |        "    </tr>\n",
347 |        "    <tr>\n",
348 |        "      <th>653</th>\n",
349 |        "      <td>1</td>\n",
350 |        "      <td>NaN</td>\n",
351 |        "      <td>7.8292</td>\n",
352 |        "      <td>0.0</td>\n",
353 |        "      <td>100.0</td>\n",
354 |        "    </tr>\n",
355 |        "    <tr>\n",
356 |        "      <th>235</th>\n",
357 |        "      <td>0</td>\n",
358 |        "      <td>NaN</td>\n",
359 |        "      <td>7.5500</td>\n",
360 |        "      <td>0.0</td>\n",
361 |        "      <td>100.0</td>\n",
362 |        "    </tr>\n",
363 |        "    <tr>\n",
364 |        "      <th>785</th>\n",
365 |        "      <td>0</td>\n",
366 |        "      <td>25.0</td>\n",
367 |        "      <td>7.2500</td>\n",
368 |        "      <td>25.0</td>\n",
369 |        "      <td>25.0</td>\n",
370 |        "    </tr>\n",
371 |        "    <tr>\n",
372 |        "      <th>241</th>\n",
373 |        "      <td>1</td>\n",
374 |        "      <td>NaN</td>\n",
375 |        "      <td>15.5000</td>\n",
376 |        "      <td>0.0</td>\n",
377 |        "      <td>100.0</td>\n",
378 |        "    </tr>\n",
379 |        "    <tr>\n",
380 |        "      <th>351</th>\n",
381 |        "      <td>0</td>\n",
382 |        "      <td>NaN</td>\n",
383 |        "      <td>35.0000</td>\n",
384 |        "      <td>0.0</td>\n",
385 |        "      <td>100.0</td>\n",
386 |        "    </tr>\n",
387 |        "    <tr>\n",
388 |        "      <th>862</th>\n",
389 |        "      <td>1</td>\n",
390 |        "      <td>48.0</td>\n",
391 |        "      <td>25.9292</td>\n",
392 |        "      <td>48.0</td>\n",
393 |        "      <td>48.0</td>\n",
394 |        "    </tr>\n",
395 |        "    <tr>\n",
396 |        "      <th>851</th>\n",
397 |        "      <td>0</td>\n",
398 |        "      <td>74.0</td>\n",
399 |        "      <td>7.7750</td>\n",
400 |        "      <td>74.0</td>\n",
401 |        "      <td>74.0</td>\n",
402 |        "    </tr>\n",
403 |        "    <tr>\n",
404 |        "      <th>753</th>\n",
405 |        "      <td>0</td>\n",
406 |        "      <td>23.0</td>\n",
407 |        "      <td>7.8958</td>\n",
408 |        "      <td>23.0</td>\n",
409 |        "      <td>23.0</td>\n",
410 |        "    </tr>\n",
411 |        "    <tr>\n",
412 |        "      <th>532</th>\n",
413 |        "      <td>0</td>\n",
414 |        "      <td>17.0</td>\n",
415 |        "      <td>7.2292</td>\n",
416 |        "      <td>17.0</td>\n",
417 |        "      <td>17.0</td>\n",
418 |        "    </tr>\n",
419 |        "    <tr>\n",
420 |        "      <th>485</th>\n",
421 |        "      <td>0</td>\n",
422 |        "      <td>NaN</td>\n",
423 |        "      <td>25.4667</td>\n",
424 |        "      <td>0.0</td>\n",
425 |        "      <td>100.0</td>\n",
426 |        "    </tr>\n",
427 |        "  </tbody>\n",
428 |        "</table>\n",
429 |        "</div>"
430 |       ],
431 |       "text/plain": [
432 |        "     Survived   Age      Fare  Age_zero  Age_hundred\n",
433 |        "857         1  51.0   26.5500      51.0         51.0\n",
434 |        "52          1  49.0   76.7292      49.0         49.0\n",
435 |        "386         0   1.0   46.9000       1.0          1.0\n",
436 |        "124         0  54.0   77.2875      54.0         54.0\n",
437 |        "578         0   NaN   14.4583       0.0        100.0\n",
438 |        "549         1   8.0   36.7500       8.0          8.0\n",
439 |        "118         0  24.0  247.5208      24.0         24.0\n",
440 |        "12          0  20.0    8.0500      20.0         20.0\n",
441 |        "157         0  30.0    8.0500      30.0         30.0\n",
442 |        "127         1  24.0    7.1417      24.0         24.0\n",
443 |        "653         1   NaN    7.8292       0.0        100.0\n",
444 |        "235         0   NaN    7.5500       0.0        100.0\n",
445 |        "785         0  25.0    7.2500      25.0         25.0\n",
446 |        "241         1   NaN   15.5000       0.0        100.0\n",
447 |        "351         0   NaN   35.0000       0.0        100.0\n",
448 |        "862         1  48.0   25.9292      48.0         48.0\n",
449 |        "851         0  74.0    7.7750      74.0         74.0\n",
450 |        "753         0  23.0    7.8958      23.0         23.0\n",
451 |        "532         0  17.0    7.2292      17.0         17.0\n",
452 |        "485         0   NaN   25.4667       0.0        100.0"
453 |       ]
454 |      },
455 |      "execution_count": 6,
456 |      "metadata": {},
457 |      "output_type": "execute_result"
458 |     }
459 |    ],
460 |    "source": [
461 |     "# let's replace the NA with the median value in the training set\n",
462 |     "impute_na(X_train, 'Age')\n",
463 |     "impute_na(X_test, 'Age')\n",
464 |     "\n",
465 |     "X_train.head(20)"
466 |    ]
467 |   },
468 |   {
469 |    "cell_type": "markdown",
470 |    "metadata": {},
471 |    "source": [
472 |     "### Logistic Regression"
473 |    ]
474 |   },
475 |   {
476 |    "cell_type": "code",
477 |    "execution_count": 7,
478 |    "metadata": {},
479 |    "outputs": [
480 |     {
481 |      "name": "stdout",
482 |      "output_type": "stream",
483 |      "text": [
484 |       "Train set\n",
485 |       "Logistic Regression roc-auc: 0.6863462831608859\n",
486 |       "Test set\n",
487 |       "Logistic Regression roc-auc: 0.7137499999999999\n",
488 |       "Train set\n",
489 |       "Logistic Regression roc-auc: 0.6803594282119694\n",
490 |       "Test set\n",
491 |       "Logistic Regression roc-auc: 0.7227976190476191\n"
492 |      ]
493 |     }
494 |    ],
495 |    "source": [
496 |     "# we compare the models built using Age filled with zero, vs Age filled with 100\n",
497 |     "\n",
498 |     "logit = LogisticRegression(random_state=44, C=1000) # c big to avoid regularization\n",
499 |     "logit.fit(X_train[['Age_zero','Fare']], y_train)\n",
500 |     "print('Train set')\n",
501 |     "pred = logit.predict_proba(X_train[['Age_zero','Fare']])\n",
502 |     "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
503 |     "print('Test set')\n",
504 |     "pred = logit.predict_proba(X_test[['Age_zero','Fare']])\n",
505 |     "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))\n",
506 |     "\n",
507 |     "logit = LogisticRegression(random_state=44, C=1000) # c big to avoid regularization\n",
508 |     "logit.fit(X_train[['Age_hundred','Fare']], y_train)\n",
509 |     "print('Train set')\n",
510 |     "pred = logit.predict_proba(X_train[['Age_hundred','Fare']])\n",
511 |     "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
512 |     "print('Test set')\n",
513 |     "pred = logit.predict_proba(X_test[['Age_hundred','Fare']])\n",
514 |     "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
515 |    ]
516 |   },
517 |   {
518 |    "cell_type": "code",
519 |    "execution_count": 8,
520 |    "metadata": {},
521 |    "outputs": [
522 |     {
523 |      "name": "stdout",
524 |      "output_type": "stream",
525 |      "text": [
526 |       "Train set zero imputation\n",
527 |       "Random Forests roc-auc: 0.7555855621353116\n",
528 |       "Test set zero imputation\n",
529 |       "Random Forests zero imputation roc-auc: 0.7490476190476191\n",
530 |       "\n",
531 |       "Train set median imputation\n",
532 |       "Random Forests roc-auc: 0.7490781111038807\n",
533 |       "Test set median imputation\n",
534 |       "Random Forests roc-auc: 0.7653571428571431\n",
535 |       "\n"
536 |      ]
537 |     }
538 |    ],
539 |    "source": [
540 |     "# random forests\n",
541 |     "\n",
542 |     "rf = RandomForestClassifier(n_estimators=100, random_state=39, max_depth=3)\n",
543 |     "rf.fit(X_train[['Age_zero', 'Fare']], y_train)\n",
544 |     "print('Train set zero imputation')\n",
545 |     "pred = rf.predict_proba(X_train[['Age_zero', 'Fare']])\n",
546 |     "print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
547 |     "print('Test set zero imputation')\n",
548 |     "pred = rf.predict_proba(X_test[['Age_zero', 'Fare']])\n",
549 |     "print('Random Forests zero imputation roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))\n",
550 |     "print()\n",
551 |     "rf = RandomForestClassifier(n_estimators=100, random_state=39, max_depth=3)\n",
552 |     "rf.fit(X_train[['Age_hundred', 'Fare']], y_train)\n",
553 |     "print('Train set median imputation')\n",
554 |     "pred = rf.predict_proba(X_train[['Age_hundred', 'Fare']])\n",
555 |     "print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
556 |     "print('Test set median imputation')\n",
557 |     "pred = rf.predict_proba(X_test[['Age_hundred', 'Fare']])\n",
558 |     "print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))\n",
559 |     "print()"
560 |    ]
561 |   },
562 |   {
563 |    "cell_type": "markdown",
564 |    "metadata": {},
565 |    "source": [
566 |     "We can see that replacing NA with 100 makes the models perform better than replacing NA with 0. This is, if you remember from the lecture \"Replacing NA by mean or median\" because children were more likely to survive than adults. Then filling NA with zeroes, distorts this relation and makes the models loose predictive power. See below for a re-cap."
567 |    ]
568 |   },
569 |   {
570 |    "cell_type": "code",
571 |    "execution_count": 9,
572 |    "metadata": {},
573 |    "outputs": [
574 |     {
575 |      "name": "stdout",
576 |      "output_type": "stream",
577 |      "text": [
578 |       "Average real survival of children:  0.5740740740740741\n",
579 |       "Average survival of children when using Age imputed with zeroes:  0.38857142857142857\n",
580 |       "Average survival of children when using Age imputed with median:  0.5740740740740741\n"
581 |      ]
582 |     }
583 |    ],
584 |    "source": [
585 |     "print('Average real survival of children: ', X_train[X_train.Age<15].Survived.mean())\n",
586 |     "print('Average survival of children when using Age imputed with zeroes: ', X_train[X_train.Age_zero<15].Survived.mean())\n",
587 |     "print('Average survival of children when using Age imputed with median: ', X_train[X_train.Age_hundred<15].Survived.mean())"
588 |    ]
589 |   },
590 |   {
591 |    "cell_type": "markdown",
592 |    "metadata": {},
593 |    "source": [
594 |     "### Final notes\n",
595 |     "\n",
596 |     "The arbitrary value has to be determined for each variable specifically. For example, for this dataset, the choice of replacing NA in age by 0 or 100 are valid, because none of those values are frequent in the original distribution of the variable, and they lie at the tails of the distribution.\n",
597 |     "\n",
598 |     "However, if we were to replace NA in fare, those values are not good any more, because we can see that fare can take values of up to 500. So we might want to consider using 500 or 1000 to replace NA instead of 100.\n",
599 |     "\n",
600 |     "As you can see this is totally arbitrary. And yet, it is used in the industry.\n",
601 |     "\n",
602 |     "Typical values chose by companies are -9999 or 9999, or similar."
603 |    ]
604 |   },
605 |   {
606 |    "cell_type": "markdown",
607 |    "metadata": {
608 |     "collapsed": true
609 |    },
610 |    "source": [
611 |     "**That is all for this demonstration. I hope you enjoyed the notebook, and see you in the next one.**"
612 |    ]
613 |   },
614 |   {
615 |    "cell_type": "code",
616 |    "execution_count": null,
617 |    "metadata": {
618 |     "collapsed": true
619 |    },
620 |    "outputs": [],
621 |    "source": []
622 |   }
623 |  ],
624 |  "metadata": {
625 |   "kernelspec": {
626 |    "display_name": "Python 3",
627 |    "language": "python",
628 |    "name": "python3"
629 |   },
630 |   "language_info": {
631 |    "codemirror_mode": {
632 |     "name": "ipython",
633 |     "version": 3
634 |    },
635 |    "file_extension": ".py",
636 |    "mimetype": "text/x-python",
637 |    "name": "python",
638 |    "nbconvert_exporter": "python",
639 |    "pygments_lexer": "ipython3",
640 |    "version": "3.6.1"
641 |   },
642 |   "toc": {
643 |    "nav_menu": {},
644 |    "number_sections": true,
645 |    "sideBar": true,
646 |    "skip_h1_title": false,
647 |    "toc_cell": false,
648 |    "toc_position": {},
649 |    "toc_section_display": "block",
650 |    "toc_window_display": false
651 |   }
652 |  },
653 |  "nbformat": 4,
654 |  "nbformat_minor": 2
655 | }
656 | 


--------------------------------------------------------------------------------
/06.3_Adding_a_variable_to_capture_NA.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Adding a variable to capture NA\n",
  8 |     "\n",
  9 |     "In the previous lectures within this section we studied how to replace missing values by the most frequent category or by extracting a random sample of the variable. These 2 methods assume that the missing data are missing completely at random (MCAR), and are suitable when the number of missing data is small, otherwise it may affect the distribution of the target within the labels of the variable.\n",
 10 |     "\n",
 11 |     "So what if the missing data are not small or not MCAR?\n",
 12 |     "\n",
 13 |     "We can capture the importance of missingness by creating an additional variable indicating whether the data was missing for that observation (1) or not (0). The additional variable is a binary variable: it takes only the values 0 and 1, 0 indicating that a value was present for that observation, and 1 indicating that the value was missing for that observation.\n",
 14 |     "\n",
 15 |     "The procedure is exactly the same as for numerical variables.\n",
 16 |     "\n",
 17 |     "\n",
 18 |     "### Advantages\n",
 19 |     "\n",
 20 |     "- Easy to implement\n",
 21 |     "- Captures the importance of missingess if there is one\n",
 22 |     "\n",
 23 |     "### Disadvantages\n",
 24 |     "\n",
 25 |     "- Expands the feature space\n",
 26 |     "\n",
 27 |     "This method of imputation will add 1 variable per variable in the dataset with missing values. So if a dataset contains 10 features, and all of them have missing values, we will end up with a dataset with 20 features. The original features where we replaced the NA by the frequent label or random sampling, and additional 10 features, indicating for each of the variables, whether the value was missing or not.\n",
 28 |     "\n",
 29 |     "This may not be a problem in datasets with tens to a few hundreds of variables, but if your original dataset contains thousands of variables, by creating an additional variable to indicate NA, you will end up with very big datasets.\n",
 30 |     "\n",
 31 |     "In addition, data tends to be missing for the same observation on multiple variables, so it may also be the case, that many of your added variables will be actually similar to each other."
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "markdown",
 36 |    "metadata": {},
 37 |    "source": [
 38 |     "===============================================================================\n",
 39 |     "\n",
 40 |     "## Real Life example: \n",
 41 |     "\n",
 42 |     "### Predicting Sale Price of Houses\n",
 43 |     "\n",
 44 |     "The problem at hand aims to predict the final sale price of homes based on different explanatory variables describing aspects of residential homes. Predicting house prices is useful to identify fruitful investments, or to determine whether the price advertised for a house is over or underestimated, before making a buying judgment.\n",
 45 |     "\n",
 46 |     "=============================================================================\n",
 47 |     "\n",
 48 |     "In the following cells, I will demonstrate NA imputation by random sampling + adding an additional variable using the House Price datasets from Kaggle.\n",
 49 |     "\n",
 50 |     "If you haven't downloaded the datasets yet, in the lecture \"Guide to setting up your computer\" in section 1, you can find the details on how to do so."
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "code",
 55 |    "execution_count": 1,
 56 |    "metadata": {
 57 |     "collapsed": true
 58 |    },
 59 |    "outputs": [],
 60 |    "source": [
 61 |     "import pandas as pd\n",
 62 |     "import numpy as np\n",
 63 |     "\n",
 64 |     "import matplotlib.pyplot as plt\n",
 65 |     "% matplotlib inline\n",
 66 |     "\n",
 67 |     "# for regression problems\n",
 68 |     "from sklearn.linear_model import LinearRegression, Ridge\n",
 69 |     "\n",
 70 |     "# to split and standarize the datasets\n",
 71 |     "from sklearn.model_selection import train_test_split\n",
 72 |     "\n",
 73 |     "# to evaluate regression models\n",
 74 |     "from sklearn.metrics import mean_squared_error\n",
 75 |     "\n",
 76 |     "import warnings\n",
 77 |     "warnings.filterwarnings('ignore')"
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "markdown",
 82 |    "metadata": {},
 83 |    "source": [
 84 |     "### House Price dataset"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "code",
 89 |    "execution_count": 2,
 90 |    "metadata": {},
 91 |    "outputs": [
 92 |     {
 93 |      "data": {
 94 |       "text/plain": [
 95 |        "SalePrice      0.000000\n",
 96 |        "BsmtQual       0.025342\n",
 97 |        "GarageType     0.055479\n",
 98 |        "FireplaceQu    0.472603\n",
 99 |        "dtype: float64"
100 |       ]
101 |      },
102 |      "execution_count": 2,
103 |      "metadata": {},
104 |      "output_type": "execute_result"
105 |     }
106 |    ],
107 |    "source": [
108 |     "# let's load the dataset with a few columns for the demonstration\n",
109 |     "cols_to_use = ['BsmtQual', 'FireplaceQu', 'GarageType', 'SalePrice']\n",
110 |     "\n",
111 |     "data = pd.read_csv('houseprice.csv', usecols=cols_to_use)\n",
112 |     "\n",
113 |     "# let's inspect the percentage of missing values in each variable\n",
114 |     "data.isnull().mean().sort_values(ascending=True)"
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "markdown",
119 |    "metadata": {},
120 |    "source": [
121 |     "**To evaluate whether adding this additional variables to indicate missingness improves the performance of the ML algorithms, I will replace missing values by random sampling (see previous lecture) as well.**\n",
122 |     "\n",
123 |     "### Imputation important\n",
124 |     "\n",
125 |     "Imputation should be done over the training set, and then propagated to the test set. This means that the random sampling of categories should be done from the training set, and used to replace NA both in train and test sets."
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "code",
130 |    "execution_count": 3,
131 |    "metadata": {},
132 |    "outputs": [
133 |     {
134 |      "data": {
135 |       "text/plain": [
136 |        "((1022, 3), (438, 3))"
137 |       ]
138 |      },
139 |      "execution_count": 3,
140 |      "metadata": {},
141 |      "output_type": "execute_result"
142 |     }
143 |    ],
144 |    "source": [
145 |     "# let's separate into training and testing set\n",
146 |     "\n",
147 |     "X_train, X_test, y_train, y_test = train_test_split(data[['BsmtQual', 'FireplaceQu', 'GarageType']],\n",
148 |     "                                                    data.SalePrice, test_size=0.3,\n",
149 |     "                                                    random_state=0)\n",
150 |     "X_train.shape, X_test.shape"
151 |    ]
152 |   },
153 |   {
154 |    "cell_type": "code",
155 |    "execution_count": 4,
156 |    "metadata": {
157 |     "collapsed": true
158 |    },
159 |    "outputs": [],
160 |    "source": [
161 |     "# let's create a variable to replace NA with a random sample of labels available within that variable\n",
162 |     "# in the same function we create the additional variable to indicate missingness\n",
163 |     "\n",
164 |     "# make sure you understand every line of code.\n",
165 |     "# If unsure, run them separately in a cell in the notebook until you familiarise with the output\n",
166 |     "# of each line\n",
167 |     "\n",
168 |     "def impute_na(df_train, df_test, variable):\n",
169 |     "    # add additional variable to indicate missingness\n",
170 |     "    df_train[variable+'_NA'] = np.where(df_train[variable].isnull(), 1, 0)\n",
171 |     "    df_test[variable+'_NA'] = np.where(df_test[variable].isnull(), 1, 0)\n",
172 |     "    \n",
173 |     "    # random sampling\n",
174 |     "    df_train[variable+'_random'] = df_train[variable]\n",
175 |     "    df_test[variable+'_random'] = df_test[variable]\n",
176 |     "    \n",
177 |     "    # extract the random sample to fill the na\n",
178 |     "    random_sample_train = df_train[variable].dropna().sample(df_train[variable].isnull().sum(), random_state=0)\n",
179 |     "    random_sample_test = df_train[variable].dropna().sample(df_test[variable].isnull().sum(), random_state=0)\n",
180 |     "    \n",
181 |     "    # pandas needs to have the same index in order to merge datasets\n",
182 |     "    random_sample_train.index = df_train[df_train[variable].isnull()].index\n",
183 |     "    random_sample_test.index = df_test[df_test[variable].isnull()].index\n",
184 |     "    \n",
185 |     "    df_train.loc[df_train[variable].isnull(), variable+'_random'] = random_sample_train\n",
186 |     "    df_test.loc[df_test[variable].isnull(), variable+'_random'] = random_sample_test"
187 |    ]
188 |   },
189 |   {
190 |    "cell_type": "code",
191 |    "execution_count": 5,
192 |    "metadata": {
193 |     "collapsed": true
194 |    },
195 |    "outputs": [],
196 |    "source": [
197 |     "# and let's replace the NA\n",
198 |     "for variable in ['BsmtQual', 'FireplaceQu', 'GarageType',]:\n",
199 |     "    impute_na(X_train, X_test, variable)"
200 |    ]
201 |   },
202 |   {
203 |    "cell_type": "code",
204 |    "execution_count": 6,
205 |    "metadata": {},
206 |    "outputs": [
207 |     {
208 |      "data": {
209 |       "text/plain": [
210 |        "BsmtQual               24\n",
211 |        "FireplaceQu           478\n",
212 |        "GarageType             54\n",
213 |        "BsmtQual_NA             0\n",
214 |        "BsmtQual_random         0\n",
215 |        "FireplaceQu_NA          0\n",
216 |        "FireplaceQu_random      0\n",
217 |        "GarageType_NA           0\n",
218 |        "GarageType_random       0\n",
219 |        "dtype: int64"
220 |       ]
221 |      },
222 |      "execution_count": 6,
223 |      "metadata": {},
224 |      "output_type": "execute_result"
225 |     }
226 |    ],
227 |    "source": [
228 |     "# let's inspect that NA were replaced\n",
229 |     "X_train.isnull().sum()"
230 |    ]
231 |   },
232 |   {
233 |    "cell_type": "code",
234 |    "execution_count": 7,
235 |    "metadata": {},
236 |    "outputs": [
237 |     {
238 |      "data": {
239 |       "text/html": [
240 |        "<div>\n",
241 |        "<style>\n",
242 |        "    .dataframe thead tr:only-child th {\n",
243 |        "        text-align: right;\n",
244 |        "    }\n",
245 |        "\n",
246 |        "    .dataframe thead th {\n",
247 |        "        text-align: left;\n",
248 |        "    }\n",
249 |        "\n",
250 |        "    .dataframe tbody tr th {\n",
251 |        "        vertical-align: top;\n",
252 |        "    }\n",
253 |        "</style>\n",
254 |        "<table border=\"1\" class=\"dataframe\">\n",
255 |        "  <thead>\n",
256 |        "    <tr style=\"text-align: right;\">\n",
257 |        "      <th></th>\n",
258 |        "      <th>BsmtQual</th>\n",
259 |        "      <th>FireplaceQu</th>\n",
260 |        "      <th>GarageType</th>\n",
261 |        "      <th>BsmtQual_NA</th>\n",
262 |        "      <th>BsmtQual_random</th>\n",
263 |        "      <th>FireplaceQu_NA</th>\n",
264 |        "      <th>FireplaceQu_random</th>\n",
265 |        "      <th>GarageType_NA</th>\n",
266 |        "      <th>GarageType_random</th>\n",
267 |        "    </tr>\n",
268 |        "  </thead>\n",
269 |        "  <tbody>\n",
270 |        "    <tr>\n",
271 |        "      <th>64</th>\n",
272 |        "      <td>Gd</td>\n",
273 |        "      <td>NaN</td>\n",
274 |        "      <td>Attchd</td>\n",
275 |        "      <td>0</td>\n",
276 |        "      <td>Gd</td>\n",
277 |        "      <td>1</td>\n",
278 |        "      <td>Gd</td>\n",
279 |        "      <td>0</td>\n",
280 |        "      <td>Attchd</td>\n",
281 |        "    </tr>\n",
282 |        "    <tr>\n",
283 |        "      <th>682</th>\n",
284 |        "      <td>Gd</td>\n",
285 |        "      <td>Gd</td>\n",
286 |        "      <td>Attchd</td>\n",
287 |        "      <td>0</td>\n",
288 |        "      <td>Gd</td>\n",
289 |        "      <td>0</td>\n",
290 |        "      <td>Gd</td>\n",
291 |        "      <td>0</td>\n",
292 |        "      <td>Attchd</td>\n",
293 |        "    </tr>\n",
294 |        "    <tr>\n",
295 |        "      <th>960</th>\n",
296 |        "      <td>TA</td>\n",
297 |        "      <td>NaN</td>\n",
298 |        "      <td>NaN</td>\n",
299 |        "      <td>0</td>\n",
300 |        "      <td>TA</td>\n",
301 |        "      <td>1</td>\n",
302 |        "      <td>TA</td>\n",
303 |        "      <td>1</td>\n",
304 |        "      <td>Attchd</td>\n",
305 |        "    </tr>\n",
306 |        "    <tr>\n",
307 |        "      <th>1384</th>\n",
308 |        "      <td>TA</td>\n",
309 |        "      <td>NaN</td>\n",
310 |        "      <td>Detchd</td>\n",
311 |        "      <td>0</td>\n",
312 |        "      <td>TA</td>\n",
313 |        "      <td>1</td>\n",
314 |        "      <td>TA</td>\n",
315 |        "      <td>0</td>\n",
316 |        "      <td>Detchd</td>\n",
317 |        "    </tr>\n",
318 |        "    <tr>\n",
319 |        "      <th>1100</th>\n",
320 |        "      <td>TA</td>\n",
321 |        "      <td>NaN</td>\n",
322 |        "      <td>Detchd</td>\n",
323 |        "      <td>0</td>\n",
324 |        "      <td>TA</td>\n",
325 |        "      <td>1</td>\n",
326 |        "      <td>Gd</td>\n",
327 |        "      <td>0</td>\n",
328 |        "      <td>Detchd</td>\n",
329 |        "    </tr>\n",
330 |        "  </tbody>\n",
331 |        "</table>\n",
332 |        "</div>"
333 |       ],
334 |       "text/plain": [
335 |        "     BsmtQual FireplaceQu GarageType  BsmtQual_NA BsmtQual_random  \\\n",
336 |        "64         Gd         NaN     Attchd            0              Gd   \n",
337 |        "682        Gd          Gd     Attchd            0              Gd   \n",
338 |        "960        TA         NaN        NaN            0              TA   \n",
339 |        "1384       TA         NaN     Detchd            0              TA   \n",
340 |        "1100       TA         NaN     Detchd            0              TA   \n",
341 |        "\n",
342 |        "      FireplaceQu_NA FireplaceQu_random  GarageType_NA GarageType_random  \n",
343 |        "64                 1                 Gd              0            Attchd  \n",
344 |        "682                0                 Gd              0            Attchd  \n",
345 |        "960                1                 TA              1            Attchd  \n",
346 |        "1384               1                 TA              0            Detchd  \n",
347 |        "1100               1                 Gd              0            Detchd  "
348 |       ]
349 |      },
350 |      "execution_count": 7,
351 |      "metadata": {},
352 |      "output_type": "execute_result"
353 |     }
354 |    ],
355 |    "source": [
356 |     "X_train.head()"
357 |    ]
358 |   },
359 |   {
360 |    "cell_type": "code",
361 |    "execution_count": 8,
362 |    "metadata": {},
363 |    "outputs": [
364 |     {
365 |      "data": {
366 |       "text/html": [
367 |        "<div>\n",
368 |        "<style>\n",
369 |        "    .dataframe thead tr:only-child th {\n",
370 |        "        text-align: right;\n",
371 |        "    }\n",
372 |        "\n",
373 |        "    .dataframe thead th {\n",
374 |        "        text-align: left;\n",
375 |        "    }\n",
376 |        "\n",
377 |        "    .dataframe tbody tr th {\n",
378 |        "        vertical-align: top;\n",
379 |        "    }\n",
380 |        "</style>\n",
381 |        "<table border=\"1\" class=\"dataframe\">\n",
382 |        "  <thead>\n",
383 |        "    <tr style=\"text-align: right;\">\n",
384 |        "      <th></th>\n",
385 |        "      <th>BsmtQual_NA</th>\n",
386 |        "      <th>FireplaceQu_NA</th>\n",
387 |        "      <th>GarageType_NA</th>\n",
388 |        "    </tr>\n",
389 |        "  </thead>\n",
390 |        "  <tbody>\n",
391 |        "    <tr>\n",
392 |        "      <th>count</th>\n",
393 |        "      <td>1022.000000</td>\n",
394 |        "      <td>1022.000000</td>\n",
395 |        "      <td>1022.000000</td>\n",
396 |        "    </tr>\n",
397 |        "    <tr>\n",
398 |        "      <th>mean</th>\n",
399 |        "      <td>0.023483</td>\n",
400 |        "      <td>0.467710</td>\n",
401 |        "      <td>0.052838</td>\n",
402 |        "    </tr>\n",
403 |        "    <tr>\n",
404 |        "      <th>std</th>\n",
405 |        "      <td>0.151507</td>\n",
406 |        "      <td>0.499201</td>\n",
407 |        "      <td>0.223819</td>\n",
408 |        "    </tr>\n",
409 |        "    <tr>\n",
410 |        "      <th>min</th>\n",
411 |        "      <td>0.000000</td>\n",
412 |        "      <td>0.000000</td>\n",
413 |        "      <td>0.000000</td>\n",
414 |        "    </tr>\n",
415 |        "    <tr>\n",
416 |        "      <th>25%</th>\n",
417 |        "      <td>0.000000</td>\n",
418 |        "      <td>0.000000</td>\n",
419 |        "      <td>0.000000</td>\n",
420 |        "    </tr>\n",
421 |        "    <tr>\n",
422 |        "      <th>50%</th>\n",
423 |        "      <td>0.000000</td>\n",
424 |        "      <td>0.000000</td>\n",
425 |        "      <td>0.000000</td>\n",
426 |        "    </tr>\n",
427 |        "    <tr>\n",
428 |        "      <th>75%</th>\n",
429 |        "      <td>0.000000</td>\n",
430 |        "      <td>1.000000</td>\n",
431 |        "      <td>0.000000</td>\n",
432 |        "    </tr>\n",
433 |        "    <tr>\n",
434 |        "      <th>max</th>\n",
435 |        "      <td>1.000000</td>\n",
436 |        "      <td>1.000000</td>\n",
437 |        "      <td>1.000000</td>\n",
438 |        "    </tr>\n",
439 |        "  </tbody>\n",
440 |        "</table>\n",
441 |        "</div>"
442 |       ],
443 |       "text/plain": [
444 |        "       BsmtQual_NA  FireplaceQu_NA  GarageType_NA\n",
445 |        "count  1022.000000     1022.000000    1022.000000\n",
446 |        "mean      0.023483        0.467710       0.052838\n",
447 |        "std       0.151507        0.499201       0.223819\n",
448 |        "min       0.000000        0.000000       0.000000\n",
449 |        "25%       0.000000        0.000000       0.000000\n",
450 |        "50%       0.000000        0.000000       0.000000\n",
451 |        "75%       0.000000        1.000000       0.000000\n",
452 |        "max       1.000000        1.000000       1.000000"
453 |       ]
454 |      },
455 |      "execution_count": 8,
456 |      "metadata": {},
457 |      "output_type": "execute_result"
458 |     }
459 |    ],
460 |    "source": [
461 |     "X_train.describe()"
462 |    ]
463 |   },
464 |   {
465 |    "cell_type": "code",
466 |    "execution_count": 9,
467 |    "metadata": {
468 |     "collapsed": true
469 |    },
470 |    "outputs": [],
471 |    "source": [
472 |     "# let's transform the categories into numbers quick and dirty so we can use them in scikit-learn\n",
473 |     "\n",
474 |     "# the below function numbers the labels from 0 to n, n being the number of different labels \n",
475 |     "#  within the variable\n",
476 |     "\n",
477 |     "for col in ['BsmtQual_random', 'FireplaceQu_random', 'GarageType_random',]:\n",
478 |     "    labels_dict = {k:i for i, k in enumerate(X_train[col].unique(), 0)}\n",
479 |     "    X_train.loc[:, col] = X_train.loc[:, col].map(labels_dict )\n",
480 |     "    X_test.loc[:, col] = X_test.loc[:, col].map(labels_dict)"
481 |    ]
482 |   },
483 |   {
484 |    "cell_type": "markdown",
485 |    "metadata": {},
486 |    "source": [
487 |     "### Linear Regression"
488 |    ]
489 |   },
490 |   {
491 |    "cell_type": "code",
492 |    "execution_count": 10,
493 |    "metadata": {},
494 |    "outputs": [
495 |     {
496 |      "name": "stdout",
497 |      "output_type": "stream",
498 |      "text": [
499 |       "Test set random imputation\n",
500 |       "Linear Regression mse: 6456070592.706035\n",
501 |       "\n",
502 |       "Test set random imputation + additional variable indicating missingness\n",
503 |       "Linear Regression mse: 4911877327.956806\n"
504 |      ]
505 |     }
506 |    ],
507 |    "source": [
508 |     "# Let's evaluate the performance of Linear Regression\n",
509 |     "\n",
510 |     "# first we build a model using ONLY the variable with the NA replaced by a random sample\n",
511 |     "linreg = LinearRegression()\n",
512 |     "linreg.fit(X_train[['BsmtQual_random', 'FireplaceQu_random', 'GarageType_random']], y_train)\n",
513 |     "print('Test set random imputation')\n",
514 |     "pred = linreg.predict(X_test[['BsmtQual_random', 'FireplaceQu_random', 'GarageType_random']])\n",
515 |     "print('Linear Regression mse: {}'.format(mean_squared_error(y_test, pred)))\n",
516 |     "print()\n",
517 |     "\n",
518 |     "# second we build a model including the variable that indicates missingness as well\n",
519 |     "linreg = LinearRegression()\n",
520 |     "linreg.fit(X_train[['BsmtQual_random', 'FireplaceQu_random', 'GarageType_random',\n",
521 |     "                   'BsmtQual_NA', 'FireplaceQu_NA', 'GarageType_NA']], y_train)\n",
522 |     "print('Test set random imputation + additional variable indicating missingness')\n",
523 |     "pred = linreg.predict(X_test[['BsmtQual_random', 'FireplaceQu_random', 'GarageType_random',\n",
524 |     "                             'BsmtQual_NA', 'FireplaceQu_NA', 'GarageType_NA']])\n",
525 |     "print('Linear Regression mse: {}'.format(mean_squared_error(y_test, pred)))"
526 |    ]
527 |   },
528 |   {
529 |    "cell_type": "markdown",
530 |    "metadata": {},
531 |    "source": [
532 |     "Amazing, in this exercise we can see the power of creating that additional variable to capture missingness. The mse on the test set decreased dramatically when we included these variables that indicate that the observations contained missing values. That represents a savings of:"
533 |    ]
534 |   },
535 |   {
536 |    "cell_type": "code",
537 |    "execution_count": 11,
538 |    "metadata": {},
539 |    "outputs": [
540 |     {
541 |      "data": {
542 |       "text/plain": [
543 |        "1544193265"
544 |       ]
545 |      },
546 |      "execution_count": 11,
547 |      "metadata": {},
548 |      "output_type": "execute_result"
549 |     }
550 |    ],
551 |    "source": [
552 |     "6456070592-4911877327"
553 |    ]
554 |   },
555 |   {
556 |    "cell_type": "markdown",
557 |    "metadata": {},
558 |    "source": [
559 |     "1 billion dollars!"
560 |    ]
561 |   },
562 |   {
563 |    "cell_type": "markdown",
564 |    "metadata": {
565 |     "collapsed": true
566 |    },
567 |    "source": [
568 |     "**That is all for this demonstration. I hope you enjoyed the notebook, and see you in the next one.**"
569 |    ]
570 |   }
571 |  ],
572 |  "metadata": {
573 |   "kernelspec": {
574 |    "display_name": "Python 3",
575 |    "language": "python",
576 |    "name": "python3"
577 |   },
578 |   "language_info": {
579 |    "codemirror_mode": {
580 |     "name": "ipython",
581 |     "version": 3
582 |    },
583 |    "file_extension": ".py",
584 |    "mimetype": "text/x-python",
585 |    "name": "python",
586 |    "nbconvert_exporter": "python",
587 |    "pygments_lexer": "ipython3",
588 |    "version": "3.6.1"
589 |   },
590 |   "toc": {
591 |    "nav_menu": {},
592 |    "number_sections": true,
593 |    "sideBar": true,
594 |    "skip_h1_title": false,
595 |    "toc_cell": false,
596 |    "toc_position": {},
597 |    "toc_section_display": "block",
598 |    "toc_window_display": false
599 |   }
600 |  },
601 |  "nbformat": 4,
602 |  "nbformat_minor": 2
603 | }
604 | 


--------------------------------------------------------------------------------
/06.4_Adding_a_category_to_capture_NA.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Adding a category to capture NA\n",
  8 |     "\n",
  9 |     "This is perhaps the most widely used method of missing data imputation for categorical variables. This method consists in treating missing data as if they were an additional label or category of the variable. All the missing observations are grouped in the newly created label 'Missing'. \n",
 10 |     "\n",
 11 |     "The beauty of this technique resides on the fact that it does not assume anything on the missingness of the values. It is very well suited when the number of missing data is high.\n",
 12 |     "\n",
 13 |     "\n",
 14 |     "### Advantages\n",
 15 |     "\n",
 16 |     "- Easy to implement\n",
 17 |     "- Captures the importance of missingess if there is one\n",
 18 |     "\n",
 19 |     "### Disadvantages\n",
 20 |     "\n",
 21 |     "- If the number of NA is small, creating an additional category may cause trees to over-fit\n",
 22 |     "\n",
 23 |     "I would say that for categorical variables this is the method of choice, as it treats missing values as a separate category, without making any assumption on their missingness. It is used widely in data science competitions and business settings. See for example the winning solution of the KDD 2009 cup: \"Winning the KDD Cup Orange Challenge with Ensemble Selection\" (http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf).\n"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "markdown",
 28 |    "metadata": {},
 29 |    "source": [
 30 |     "===============================================================================\n",
 31 |     "\n",
 32 |     "## Real Life example: \n",
 33 |     "\n",
 34 |     "### Predicting Sale Price of Houses\n",
 35 |     "\n",
 36 |     "The problem at hand aims to predict the final sale price of homes based on different explanatory variables describing aspects of residential homes. Predicting house prices is useful to identify fruitful investments, or to determine whether the price advertised for a house is over or underestimated, before making a buying judgment.\n",
 37 |     "\n",
 38 |     "=============================================================================\n",
 39 |     "\n",
 40 |     "In the following cells, I will demonstrate NA imputation by adding an additional label using the House Price datasets from Kaggle.\n",
 41 |     "\n",
 42 |     "If you haven't downloaded the datasets yet, in the lecture \"Guide to setting up your computer\" in section 1, you can find the details on how to do so."
 43 |    ]
 44 |   },
 45 |   {
 46 |    "cell_type": "code",
 47 |    "execution_count": 1,
 48 |    "metadata": {
 49 |     "collapsed": true
 50 |    },
 51 |    "outputs": [],
 52 |    "source": [
 53 |     "import pandas as pd\n",
 54 |     "import numpy as np\n",
 55 |     "\n",
 56 |     "import matplotlib.pyplot as plt\n",
 57 |     "% matplotlib inline\n",
 58 |     "\n",
 59 |     "# for regression problems\n",
 60 |     "from sklearn.linear_model import LinearRegression, Ridge\n",
 61 |     "\n",
 62 |     "# to split and standarize the datasets\n",
 63 |     "from sklearn.model_selection import train_test_split\n",
 64 |     "\n",
 65 |     "# to evaluate regression models\n",
 66 |     "from sklearn.metrics import mean_squared_error\n",
 67 |     "\n",
 68 |     "import warnings\n",
 69 |     "warnings.filterwarnings('ignore')"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "markdown",
 74 |    "metadata": {},
 75 |    "source": [
 76 |     "### House Price dataset"
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "code",
 81 |    "execution_count": 2,
 82 |    "metadata": {},
 83 |    "outputs": [
 84 |     {
 85 |      "data": {
 86 |       "text/plain": [
 87 |        "SalePrice      0.000000\n",
 88 |        "BsmtQual       0.025342\n",
 89 |        "GarageType     0.055479\n",
 90 |        "FireplaceQu    0.472603\n",
 91 |        "dtype: float64"
 92 |       ]
 93 |      },
 94 |      "execution_count": 2,
 95 |      "metadata": {},
 96 |      "output_type": "execute_result"
 97 |     }
 98 |    ],
 99 |    "source": [
100 |     "# let's load the dataset with a few columns for the demonstration\n",
101 |     "cols_to_use = ['BsmtQual', 'FireplaceQu', 'GarageType', 'SalePrice']\n",
102 |     "\n",
103 |     "data = pd.read_csv('houseprice.csv', usecols=cols_to_use)\n",
104 |     "\n",
105 |     "# let's inspect the percentage of missing values in each variable\n",
106 |     "data.isnull().mean().sort_values(ascending=True)"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "code",
111 |    "execution_count": 3,
112 |    "metadata": {},
113 |    "outputs": [
114 |     {
115 |      "data": {
116 |       "text/plain": [
117 |        "((1022, 3), (438, 3))"
118 |       ]
119 |      },
120 |      "execution_count": 3,
121 |      "metadata": {},
122 |      "output_type": "execute_result"
123 |     }
124 |    ],
125 |    "source": [
126 |     "# let's separate into training and testing set\n",
127 |     "\n",
128 |     "X_train, X_test, y_train, y_test = train_test_split(data[['BsmtQual', 'FireplaceQu', 'GarageType']],\n",
129 |     "                                                    data.SalePrice, test_size=0.3,\n",
130 |     "                                                    random_state=0)\n",
131 |     "X_train.shape, X_test.shape"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "code",
136 |    "execution_count": 4,
137 |    "metadata": {
138 |     "collapsed": true
139 |    },
140 |    "outputs": [],
141 |    "source": [
142 |     "# let's create a variable to replace NA with the most frequent label or a random sample\n",
143 |     "\n",
144 |     "def impute_na(df_train, df_test, variable):\n",
145 |     "    df_train[variable+'_NA'] = np.where(df_train[variable].isnull(), 'Missing', df_train[variable])\n",
146 |     "    df_test[variable+'_NA'] = np.where(df_test[variable].isnull(), 'Missing', df_test[variable])"
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "code",
151 |    "execution_count": 5,
152 |    "metadata": {
153 |     "collapsed": true
154 |    },
155 |    "outputs": [],
156 |    "source": [
157 |     "# and let's replace the NA\n",
158 |     "for variable in ['BsmtQual', 'FireplaceQu', 'GarageType',]:\n",
159 |     "    impute_na(X_train, X_test, variable)"
160 |    ]
161 |   },
162 |   {
163 |    "cell_type": "code",
164 |    "execution_count": 6,
165 |    "metadata": {},
166 |    "outputs": [
167 |     {
168 |      "data": {
169 |       "text/plain": [
170 |        "BsmtQual           24\n",
171 |        "FireplaceQu       478\n",
172 |        "GarageType         54\n",
173 |        "BsmtQual_NA         0\n",
174 |        "FireplaceQu_NA      0\n",
175 |        "GarageType_NA       0\n",
176 |        "dtype: int64"
177 |       ]
178 |      },
179 |      "execution_count": 6,
180 |      "metadata": {},
181 |      "output_type": "execute_result"
182 |     }
183 |    ],
184 |    "source": [
185 |     "# let's check that data have been completed\n",
186 |     "X_train.isnull().sum()"
187 |    ]
188 |   },
189 |   {
190 |    "cell_type": "code",
191 |    "execution_count": 7,
192 |    "metadata": {},
193 |    "outputs": [
194 |     {
195 |      "data": {
196 |       "text/html": [
197 |        "<div>\n",
198 |        "<style>\n",
199 |        "    .dataframe thead tr:only-child th {\n",
200 |        "        text-align: right;\n",
201 |        "    }\n",
202 |        "\n",
203 |        "    .dataframe thead th {\n",
204 |        "        text-align: left;\n",
205 |        "    }\n",
206 |        "\n",
207 |        "    .dataframe tbody tr th {\n",
208 |        "        vertical-align: top;\n",
209 |        "    }\n",
210 |        "</style>\n",
211 |        "<table border=\"1\" class=\"dataframe\">\n",
212 |        "  <thead>\n",
213 |        "    <tr style=\"text-align: right;\">\n",
214 |        "      <th></th>\n",
215 |        "      <th>BsmtQual</th>\n",
216 |        "      <th>FireplaceQu</th>\n",
217 |        "      <th>GarageType</th>\n",
218 |        "      <th>BsmtQual_NA</th>\n",
219 |        "      <th>FireplaceQu_NA</th>\n",
220 |        "      <th>GarageType_NA</th>\n",
221 |        "    </tr>\n",
222 |        "  </thead>\n",
223 |        "  <tbody>\n",
224 |        "    <tr>\n",
225 |        "      <th>64</th>\n",
226 |        "      <td>Gd</td>\n",
227 |        "      <td>NaN</td>\n",
228 |        "      <td>Attchd</td>\n",
229 |        "      <td>Gd</td>\n",
230 |        "      <td>Missing</td>\n",
231 |        "      <td>Attchd</td>\n",
232 |        "    </tr>\n",
233 |        "    <tr>\n",
234 |        "      <th>682</th>\n",
235 |        "      <td>Gd</td>\n",
236 |        "      <td>Gd</td>\n",
237 |        "      <td>Attchd</td>\n",
238 |        "      <td>Gd</td>\n",
239 |        "      <td>Gd</td>\n",
240 |        "      <td>Attchd</td>\n",
241 |        "    </tr>\n",
242 |        "    <tr>\n",
243 |        "      <th>960</th>\n",
244 |        "      <td>TA</td>\n",
245 |        "      <td>NaN</td>\n",
246 |        "      <td>NaN</td>\n",
247 |        "      <td>TA</td>\n",
248 |        "      <td>Missing</td>\n",
249 |        "      <td>Missing</td>\n",
250 |        "    </tr>\n",
251 |        "    <tr>\n",
252 |        "      <th>1384</th>\n",
253 |        "      <td>TA</td>\n",
254 |        "      <td>NaN</td>\n",
255 |        "      <td>Detchd</td>\n",
256 |        "      <td>TA</td>\n",
257 |        "      <td>Missing</td>\n",
258 |        "      <td>Detchd</td>\n",
259 |        "    </tr>\n",
260 |        "    <tr>\n",
261 |        "      <th>1100</th>\n",
262 |        "      <td>TA</td>\n",
263 |        "      <td>NaN</td>\n",
264 |        "      <td>Detchd</td>\n",
265 |        "      <td>TA</td>\n",
266 |        "      <td>Missing</td>\n",
267 |        "      <td>Detchd</td>\n",
268 |        "    </tr>\n",
269 |        "  </tbody>\n",
270 |        "</table>\n",
271 |        "</div>"
272 |       ],
273 |       "text/plain": [
274 |        "     BsmtQual FireplaceQu GarageType BsmtQual_NA FireplaceQu_NA GarageType_NA\n",
275 |        "64         Gd         NaN     Attchd          Gd        Missing        Attchd\n",
276 |        "682        Gd          Gd     Attchd          Gd             Gd        Attchd\n",
277 |        "960        TA         NaN        NaN          TA        Missing       Missing\n",
278 |        "1384       TA         NaN     Detchd          TA        Missing        Detchd\n",
279 |        "1100       TA         NaN     Detchd          TA        Missing        Detchd"
280 |       ]
281 |      },
282 |      "execution_count": 7,
283 |      "metadata": {},
284 |      "output_type": "execute_result"
285 |     }
286 |    ],
287 |    "source": [
288 |     "# let's see how the new variable looks like, where data was missing we have\n",
289 |     "# not the label 'Missing'\n",
290 |     "\n",
291 |     "X_train.head()"
292 |    ]
293 |   },
294 |   {
295 |    "cell_type": "code",
296 |    "execution_count": 8,
297 |    "metadata": {},
298 |    "outputs": [
299 |     {
300 |      "data": {
301 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAEmCAYAAABs7FscAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAEv1JREFUeJzt3X+wZ3Vdx/Hnq+WXk6Ewe6MddvNiLdniCOKGP7IiyYGk\nXKyExaKtoSjDSe2HQTX9sNYoralMKvy59kPcwnTFRqMVIpsSFwFhUWQTCJgVVq0kq1WWd398z3a/\nXvfuvd+79+659/N9PmZ2vud8zjnf876Hy+ue8zm/UlVIktr1VX0XIElaXAa9JDXOoJekxhn0ktQ4\ng16SGmfQS1LjDHpJapxBL0mNM+glqXFH9F0AwMqVK2tycrLvMiRpWbn55ps/U1UTs823JIJ+cnKS\nHTt29F2GJC0rSe6by3x23UhS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIatyRumFoI\nk5e9r+8SALj3inP7LkGSvox79JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS\n1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcc08plhTfGSzpGHu0UtS4wx6SWqcQS9JjTPoJalxBr0k\nNc6gl6TGGfSS1DiDXpIaZ9BLUuPmHPRJViS5Jcm13fjxSa5Lcnf3edzQvJcn2ZXkriRnL0bhkqS5\nGWWP/uXAx4fGLwO2V9VaYHs3TpJ1wEbgFOAc4MokKxamXEnSqOYU9ElWA+cCbxpq3gBs6Ya3AOcN\ntV9dVXur6h5gF3DGwpQrSRrVXPfofx94FfDYUNsJVbW7G/40cEI3fCJw/9B8D3RtXybJJUl2JNmx\nZ8+e0aqWJM3ZrEGf5HuAh6vq5pnmqaoCapQVV9VVVbW+qtZPTEyMsqgkaQRzeUzxtwIvTPIC4Bjg\n2CR/DjyUZFVV7U6yCni4m/9BYM3Q8qu7NklSD2bdo6+qy6tqdVVNMjjJ+sGq+iFgG7Cpm20T8J5u\neBuwMcnRSU4C1gI3LXjlkqQ5OZQXj1wBbE1yMXAfcD5AVe1MshW4E3gUuLSq9h1ypZKkeRkp6Kvq\nBuCGbvizwFkzzLcZ2HyItUmSFoB3xkpS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIa\nZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEG\nvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BL\nUuMMeklqnEEvSY0z6CWpcbMGfZJjktyU5LYkO5P8etd+fJLrktzdfR43tMzlSXYluSvJ2Yv5A0iS\nDm4ue/R7gedV1anAacA5SZ4FXAZsr6q1wPZunCTrgI3AKcA5wJVJVixG8ZKk2c0a9DXwX93okd2/\nAjYAW7r2LcB53fAG4Oqq2ltV9wC7gDMWtGpJ0pzNqY8+yYoktwIPA9dV1YeBE6pqdzfLp4ETuuET\ngfuHFn+ga5v+nZck2ZFkx549e+b9A0iSDm5OQV9V+6rqNGA1cEaSp06bXgz28uesqq6qqvVVtX5i\nYmKURSVJIxjpqpuq+g/gegZ97w8lWQXQfT7czfYgsGZosdVdmySpB3O56mYiyRO74ccBzwc+AWwD\nNnWzbQLe0w1vAzYmOTrJScBa4KaFLlySNDdHzGGeVcCW7sqZrwK2VtW1Sf4Z2JrkYuA+4HyAqtqZ\nZCtwJ/AocGlV7Vuc8iVJs5k16KvqY8DTD9D+WeCsGZbZDGw+5OokSYfMO2MlqXEGvSQ1zqCXpMYZ\n9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEv\nSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLU\nOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1LhZgz7JmiTXJ7kzyc4kL+/aj09y\nXZK7u8/jhpa5PMmuJHclOXsxfwBJ0sHNZY/+UeBnq2od8Czg0iTrgMuA7VW1FtjejdNN2wicApwD\nXJlkxWIUL0ma3axBX1W7q+qj3fAjwMeBE4ENwJZuti3Aed3wBuDqqtpbVfcAu4AzFrpwSdLcjNRH\nn2QSeDrwYeCEqtrdTfo0cEI3fCJw/9BiD3Rt07/rkiQ7kuzYs2fPiGVLkuZqzkGf5PHANcArqurz\nw9OqqoAaZcVVdVVVra+q9RMTE6MsKkkawZyCPsmRDEL+L6rqXV3zQ0lWddNXAQ937Q8Ca4YWX921\nSZJ6MJerbgK8Gfh4Vf3e0KRtwKZueBPwnqH2jUmOTnISsBa4aeFKliSN4og5zPOtwEXA7Ulu7dp+\nEbgC2JrkYuA+4HyAqtqZZCtwJ4Mrdi6tqn0LXrkkaU5mDfqq+hCQGSafNcMym4HNh1CXJGmBeGes\nJDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS\n4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXO\noJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUuFmDPslbkjyc5I6h\ntuOTXJfk7u7zuKFplyfZleSuJGcvVuGSpLmZyx7924BzprVdBmyvqrXA9m6cJOuAjcAp3TJXJlmx\nYNVKkkY2a9BX1Y3A56Y1bwC2dMNbgPOG2q+uqr1VdQ+wCzhjgWqVJM3DfPvoT6iq3d3wp4ETuuET\ngfuH5nuga5Mk9eSQT8ZWVQE16nJJLkmyI8mOPXv2HGoZkqQZzDfoH0qyCqD7fLhrfxBYMzTf6q7t\nK1TVVVW1vqrWT0xMzLMMSdJs5hv024BN3fAm4D1D7RuTHJ3kJGAtcNOhlShJOhRHzDZDkncAZwIr\nkzwA/CpwBbA1ycXAfcD5AFW1M8lW4E7gUeDSqtq3SLVLkuZg1qCvqgtnmHTWDPNvBjYfSlGSpIXj\nnbGS1DiDXpIaZ9BLUuNm7aOXlrPJy97Xdwnce8W5fZegMecevSQ1zqCXpMYZ9JLUOINekhpn0EtS\n4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXO\noJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ17oi+C5B0eExe9r6+S+DeK87tuwRg\n/LaFe/SS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcYsW9EnOSXJXkl1JLlus9UiSDm5Rgj7JCuAN\nwHcD64ALk6xbjHVJkg5usfbozwB2VdWnquqLwNXAhkValyTpIFJVC/+lyQ8A51TVj3XjFwHPrKqX\nDc1zCXBJN/pNwF0LXsjoVgKf6buIJcJtMcVtMcVtMWUpbIsnVdXEbDP19giEqroKuKqv9R9Ikh1V\ntb7vOpYCt8UUt8UUt8WU5bQtFqvr5kFgzdD46q5NknSYLVbQfwRYm+SkJEcBG4Fti7QuSdJBLErX\nTVU9muRlwAeAFcBbqmrnYqxrgS2prqSeuS2muC2muC2mLJttsSgnYyVJS4d3xkpS4wx6SWqcQS9J\njTPohyR5bpI39F2HJC2ksX9nbJKnAy8BXgzcA7yr34oOrySnH2x6VX30cNWylCS5uKrePDS+Avjl\nqvr1HsuS5mUsgz7JycCF3b/PAO9kcAXSd/ZaWD9+t/s8BlgP3AYEeBqwA3h2T3X17awk3w9cDBwP\nvA34h14r6kmS9wLTL8/7Twa/H39aVf97+KvqR5JnAa8Hvhk4isHl41+oqmN7LWwWYxn0wCeAfwS+\np6p2ASR5Zb8l9WP/H7ck7wJOr6rbu/GnAr/WY2m9qqqXJLkAuB34AvCSqvqnnsvqy6eACeAd3fgF\nwCPAycAbgYt6qqsPf8TgBtC/YrBj9MMMtsOSNq5B/30M/mNdn+T9DJ6umX5L6t037Q95gKq6I8k3\n91lQn5KsBV4OXMNg7+2iJLdU1X/3W1kvnlNV3zI0/t4kH6mqb0myHG6EXFBVtSvJiqraB7w1yS3A\n5X3XdTDjGvTXVtW7k3w1g8cnvwL42iR/DPxNVf1dv+X14mNJ3gT8eTf+g8DHeqynb+8FLq2q7UkC\n/AyDR3uc0m9ZvXh8kq+vqn8DSPL1wOO7aV/sr6xe/Hf3WJdbk/wOsJtlcFHLWN4Zm+SjVXX6tLbj\nGJyQvaCqzuqnsv4kOQZ4KfBtXdONwB9X1d7+qupPkmOr6vPT2k6uqk/2VVNfkrwA+BPgXxkc+Z4E\n/BRwA/DjVfX7/VV3eCV5EvAQg/75VwJPAK7c3wW8VI1r0N9SVU/vu46lIMkGYHVVvaEbv4lBf2wB\nr6qqv+6zvsMtyauq6ne64RdX1V8NTXtNVf1if9X1J8nRwFO60bvG6QQsDI5i9h/RLEfjGvQPAL83\n0/SqmnFaa5L8E7Cxqu7vxm8Fnsfg0Pyt43Z0M3y0N/3I70BHguMiyXOASYa6e6vq7b0VdJhN+724\npqq+v++aRjGuffQrGATZuJ+ABThqf8h3PlRVnwM+153DGDeZYfhA42MhyZ8B3wDcCuzrmgsYm6Dn\ny//bP7m3KuZpXIN+d1W9uu8ilojjhkeGX/fIoAtn3NQMwwcaHxfrgXU1jof/Uw72e7HkjWvQj+We\n2Qw+nOTHq+qNw41JfgK4qaea+nRqks8z+B15XDdMN35Mf2X16g7g6xhcYTKuDvZ7UUv9hqlx7aM/\nvuueGHtJvhZ4N7AX2P+4g2cARwPnVdVDfdWmpSHJ9cBpDP7w//9VWFX1wt6K0kjGMuj1lZI8j6lr\nxHdW1Qf7rEdLR5LvOFB7VY3lIyGWI4Nekho3rn30kmaR5ENV9dwkj/DlJyCXRb+0prhHL0mNW/LP\naJDUryTf0N0ZS5Izk/x0kif2XZfmzqCXNJtrgH1JvhG4ClgD/GW/JWkUBr2k2TxWVY8CLwJeX1U/\nD6zquSaNwKCXNJsvJbkQ2ARc27Ud2WM9GpFBL2k2P8rglZKbq+qeJCcBf9ZzTRqBV91ImrPuvQ1r\nqmqcX0qz7LhHL+mgktyQ5NgkxzN4TMYbk4zNo7xbYNBLms0TurdtfR/w9qp6JvBdPdekERj0kmZz\nRJJVwPlMnYzVMmLQS5rNq4EPALuq6iNJngzc3XNNGoEnYyWpcT7UTNIB7X9RepLXc4C3KlXVT/dQ\nlubBoJc0k493nzt6rUKHzK4bSWqce/SSDijJtoNN91WCy4dBL2kmzwbuB94BfJjBC0e0DNl1I+mA\nkqwAng9cCDwNeB/wjqra2WthGpnX0Us6oKraV1Xvr6pNwLOAXcANSV7Wc2kakV03kmbUvVnqXAZ7\n9ZPAHwJ/02dNGp1dN5IOKMnbgacCfwtcXVV39FyS5smgl3RASR4DvtCNDgdFgKqqYw9/VZoPg16S\nGufJWElqnEEvSY0z6CWpcQa9lrQk+5LcmuS2JB9N8pwF+M7TkrxgWtt5ST6W5BNJ7kjyA4fw/ZNJ\nZrxCJcmZSSrJ9w61XZvkzKHxlUm+lOQn51uHtJ9Br6Xuf6rqtKo6Fbgc+K0F+M7TgP8P+iSnAq8D\nNlTVU4DvBX47yTMWYF0zeQD4pYNMfzHwLwyuX5cOiUGv5eRY4N8BkqxKcmO3t39Hkm/r2v8ryWuT\n7Ezy90nO6F5u/akkL0xyFIM3Jl3QLXsB8HPAa6rqHoDu8zXAz3bfeUOS9d3wyiT3dsOTSf6xO9IY\n9WjjNuA/kzx/hukXdus/McnqkbaSNI1Br6XucV0gfwJ4E/AbXftLgA9U1WnAqcCtXftXAx+sqlOA\nR4DfZPC8lhcBr66qLwK/AryzO1J4J3AKcPO09e4A1s1S28PA86vqdOACBneNjmIz8MvTG5OsAVZV\n1U3A1u67pXnzEQha6v6nC3OSPBt4e5KnAh8B3pLkSODdVbU/6L8IvL8bvh3YW1VfSnI7g1v4F9KR\nwB8lOQ3YB5w8ysJVdWMSkjx32qQLGAQ8wNXAW4DfPdRiNb7co9eyUVX/DKwEJqrqRuDbgQeBtyX5\n4W62L9XUXYCPAXu7ZR9j5h2bO4Hp/fHPYOrNSo8y9f/KMUPzvBJ4iMERxXrgqHn8WAfaq78Q+JGu\ni2gb8LQka+fx3RJg0GsZSfIUYAXw2SRPAh6qqjcy6NI5fYSvegT4mqHx1wGXJ5ns1jMJvAJ4bTf9\nXqb+EAxfjfMEYHf3R+SirraRVNXfAccxeAwwSU4GHl9VJ1bVZFVNMjgB7UlZzZtBr6Vufx/9rcA7\ngU1VtQ84E7gtyS0Mujr+YITvvB5Yt/9kbNft8wvAe5N8Evgk8NKququb/3XAS7t1rRz6niuBTUlu\nA57C1HNhRrUZWNMNX8hXPh3yGgx6HQKfdSNNk+QK4JnA2d3JW2lZM+glqXFedSMtkiRnA789rfme\nqnpRH/VofLlHL0mN82SsJDXOoJekxhn0ktQ4g16SGvd/wffx/8ADWecAAAAASUVORK5CYII=\n",
302 |       "text/plain": [
303 |        "<matplotlib.figure.Figure at 0x8cb93f8710>"
304 |       ]
305 |      },
306 |      "metadata": {},
307 |      "output_type": "display_data"
308 |     },
309 |     {
310 |      "data": {
311 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAEmCAYAAABs7FscAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAFWlJREFUeJzt3X+0XWV95/H3pwGBpYNCiWkWCU3axrZBBWpEbWlrpRZa\nrMGqGJw66ZQxjqLizLQMdGZWf610UWdNV6uV1ihqxBFMi0iEGSiNUH+MJYSfEjAlQ6DABBKlU1t0\nUOJ3/jg7crjk5p6be+49uc99v9bKOns/+9lnf/cSP+e5z95nn1QVkqR2fd+oC5AkTS+DXpIaZ9BL\nUuMMeklqnEEvSY0z6CWpcQa9JDVuoKBPcn+SryS5PcmWru3oJNcnubd7Paqv/4VJtifZluS06Spe\nkjSxyYzof66qTqyqFd36BcCmqloGbOrWSbIcWAUcD5wOXJxk3hBrliRNwiFT2Hcl8MpueT1wI/Af\nu/bLq+oJYEeS7cDJwJfHe6NjjjmmlixZMoVSJGnuueWWW75WVfMn6jdo0Bfw10n2AB+sqnXAgqra\n2W1/BFjQLR8L/G3fvg91bU+TZA2wBuC4445jy5YtA5YiSQJI8sAg/QYN+lOq6uEkzweuT/LV/o1V\nVUkm9dCc7sNiHcCKFSt84I4kTZOB5uir6uHudRdwJb2pmEeTLAToXnd13R8GFvftvqhrkySNwIRB\nn+TZSf7F3mXgF4C7gI3A6q7bauCqbnkjsCrJYUmWAsuAzcMuXJI0mEGmbhYAVybZ2/+TVXVtkpuB\nDUnOAR4AzgKoqq1JNgB3A08C51bVnmmpXpI0oQmDvqruA07YR/vXgVPH2WctsHbK1UmSpsxvxkpS\n4wx6SWqcQS9JjZvKN2NHbskF18zo8e6/6IwZPZ4kDYMjeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0\nktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9J\njTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktS4\ngYM+ybwktyW5uls/Osn1Se7tXo/q63thku1JtiU5bToKlyQNZjIj+vOAe/rWLwA2VdUyYFO3TpLl\nwCrgeOB04OIk84ZTriRpsgYK+iSLgDOAD/c1rwTWd8vrgTP72i+vqieqagewHTh5OOVKkiZr0BH9\nHwPnA9/ta1tQVTu75UeABd3yscCDff0e6tqeJsmaJFuSbNm9e/fkqpYkDWzCoE/yGmBXVd0yXp+q\nKqAmc+CqWldVK6pqxfz58yezqyRpEg4ZoM9PAa9N8kvA4cCRST4BPJpkYVXtTLIQ2NX1fxhY3Lf/\noq5NkjQCE47oq+rCqlpUVUvoXWT9XFX9KrARWN11Ww1c1S1vBFYlOSzJUmAZsHnolUuSBjLIiH48\nFwEbkpwDPACcBVBVW5NsAO4GngTOrao9U65UknRAJhX0VXUjcGO3/HXg1HH6rQXWTrE2SdIQ+M1Y\nSWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJek\nxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqc\nQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY2bMOiTHJ5kc5I7kmxN8rtd\n+9FJrk9yb/d6VN8+FybZnmRbktOm8wQkSfs3yIj+CeBVVXUCcCJwepKXAxcAm6pqGbCpWyfJcmAV\ncDxwOnBxknnTUbwkaWITBn31/HO3emj3r4CVwPqufT1wZre8Eri8qp6oqh3AduDkoVYtSRrYQHP0\nSeYluR3YBVxfVTcBC6pqZ9flEWBBt3ws8GDf7g91bWPfc02SLUm27N69+4BPQJK0fwMFfVXtqaoT\ngUXAyUleOGZ70RvlD6yq1lXViqpaMX/+/MnsKkmahEnddVNV/xe4gd7c+6NJFgJ0r7u6bg8Di/t2\nW9S1SZJGYJC7buYneV63fATwauCrwEZgdddtNXBVt7wRWJXksCRLgWXA5mEXLkkazCED9FkIrO/u\nnPk+YENVXZ3ky8CGJOcADwBnAVTV1iQbgLuBJ4Fzq2rP9JQvSZrIhEFfVXcCJ+2j/evAqePssxZY\nO+XqJElT5jdjJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJek\nxhn0ktQ4g16SGmfQS1LjBnkevUZkyQXXzOjx7r/ojBk9nqSZ4Yhekhpn0EtS4wx6SWqcQS9JjTPo\nJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuN8BIJGxkc8SDPDEb0kNc6gl6TGGfSS1DiDXpIaZ9BL\nUuMMeklqnEEvSY0z6CWpcRMGfZLFSW5IcneSrUnO69qPTnJ9knu716P69rkwyfYk25KcNp0nIEna\nv0FG9E8C/6GqlgMvB85Nshy4ANhUVcuATd063bZVwPHA6cDFSeZNR/GSpIlNGPRVtbOqbu2W/wm4\nBzgWWAms77qtB87sllcCl1fVE1W1A9gOnDzswiVJg5nUHH2SJcBJwE3Agqra2W16BFjQLR8LPNi3\n20NdmyRpBAYO+iTPAa4A3lNV3+jfVlUF1GQOnGRNki1JtuzevXsyu0qSJmGgoE9yKL2Q/+9V9emu\n+dEkC7vtC4FdXfvDwOK+3Rd1bU9TVeuqakVVrZg/f/6B1i9JmsAgd90EuAS4p6r+qG/TRmB1t7wa\nuKqvfVWSw5IsBZYBm4dXsiRpMgZ5Hv1PAW8BvpLk9q7tt4CLgA1JzgEeAM4CqKqtSTYAd9O7Y+fc\nqtoz9MolSQOZMOir6otAxtl86jj7rAXWTqEuSdKQ+M1YSWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS\n1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mN\nM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiD\nXpIaZ9BLUuMMeklqnEEvSY0z6CWpcRMGfZKPJNmV5K6+tqOTXJ/k3u71qL5tFybZnmRbktOmq3BJ\n0mAGGdF/DDh9TNsFwKaqWgZs6tZJshxYBRzf7XNxknlDq1aSNGkTBn1VfR54bEzzSmB9t7weOLOv\n/fKqeqKqdgDbgZOHVKsk6QAc6Bz9gqra2S0/Aizolo8FHuzr91DX9gxJ1iTZkmTL7t27D7AMSdJE\npnwxtqoKqAPYb11VraiqFfPnz59qGZKkcRxo0D+aZCFA97qra38YWNzXb1HXJkkakQMN+o3A6m55\nNXBVX/uqJIclWQosAzZPrURJ0lQcMlGHJJcBrwSOSfIQ8NvARcCGJOcADwBnAVTV1iQbgLuBJ4Fz\nq2rPNNUuSRrAhEFfVWePs+nUcfqvBdZOpShJ0vD4zVhJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLU\nOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z\n6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNO2TUBUitWnLBNTN6vPsvOmNGj6fZ\nwxG9JDXOoJekxjl1I2nSnJaaXQx6SRqjtQ8yp24kqXEGvSQ1zqCXpMZNW9AnOT3JtiTbk1wwXceR\nJO3ftAR9knnAB4BfBJYDZydZPh3HkiTt33SN6E8GtlfVfVX1beByYOU0HUuStB+pquG/afIG4PSq\n+jfd+luAl1XVO/v6rAHWdKs/CmwbeiHjOwb42gweb6Z5frNby+fX8rnBzJ/fD1bV/Ik6jew++qpa\nB6wbxbGTbKmqFaM49kzw/Ga3ls+v5XODg/f8pmvq5mFgcd/6oq5NkjTDpivobwaWJVma5FnAKmDj\nNB1LkrQf0zJ1U1VPJnkncB0wD/hIVW2djmMdoJFMGc0gz292a/n8Wj43OEjPb1ouxkqSDh5+M1aS\nGmfQS1LjDHpJapxBr1knySlJPjDqOqTZwh8emeWS/MT+tlfVrTNVy3RKchLwZuCNwA7g06OtSINK\nck5VXdK3Pg/4z1X1uyMsa6iSnAD8dLf6haq6Y5T1jDVngj7JZ4Gxtxj9I7AF+GBV/b+Zr2oo/lv3\nejiwArgDCPBieuf2ihHVNWVJXgCc3f37GvApeneK/dxIC5sGSV4OvB/4ceBZ9G5LfryqjhxpYcNx\napLXA+cARwMfA/5mpBUNUZLzgLfy1ODjE0nWVdX7R1jW08yZ2yuT/AkwH7isa3oT8A164X9kVb1l\nVLUNQ5JPA79dVV/p1l8I/E5VvWG0lR24JN8FvgCcU1Xbu7b7quqHRlvZ8CXZQu+LhX9B7wP7XwEv\nqKoLR1rYkCR5E70n2j4OvLmqvjTikoYmyZ3AK6rq8W792cCXq+rFo63sKXNmRA/8ZFW9tG/9s0lu\nrqqXJjmYvsx1oH50b8gDVNVdSX58lAUNwa/QC78bklxL7ymoGW1J06eqtieZV1V7gI8muQ2Y9UGf\nZBlwHnAFvb9Y3pLktqr65mgrG5oAe/rW93CQ/Xc6l4L+OUmOq6q/B0hyHPCcbtu3R1fW0NyZ5MPA\nJ7r1fwncOcJ6huHqqvpMN0JaCbwHeH6SPwOurKq/Gm15Q/XN7nEhtyd5L7CTdm6W+CxwblVtShLg\n39N7TMrxoy1raD4K3JTkym79TOCS/fSfcXNp6uaXgD8H/je9T9ulwDuAG4G3VtUfj666qUtyOPB2\nnrog9Hngz6rqidFVNTVJbq2qnxjTdhS9C7JvqqpTR1PZ8CX5QeBRevPz/w54LnDx3imr2SzJkVX1\njTFtL6iqvxtVTcPW3RRxSrf6haq6bZT1jDVngh4gyWHAj3Wr22bxBdjvSbISWFRVH+jWN9O7FlHA\n+VX1l6Osbyq6P+9PGnUd06n/r8zWJDm/qt7bLb+xqv6ib9sfVNVvja66qesGV/8W+BHgK8AlVfXk\naKvat7kW9D8JLKFvyqqqPj6ygoYgyZeAVVX1YLd+O/AqetNSH53No94kDwF/NN72qhp322zR/1dL\nkiuq6vWjrmlYxpzb0/4629dfa7NNkk8B36F3w8AvAvdX1XtGW9W+zZk5+iSXAj8M3M5TF04KmNVB\nDzxrb8h3vlhVjwGPdXPbs9k8eh9YB9WFrSHrP7fW7ibKOMv7Wp+NllfViwCSXAJsHnE945ozQU/v\nlrXl1d6fMEf1r/T/XCO9KZzZbGdV/d6oi5hmNc5yC/Z3bi2c63f2LnSPZh9lLfs1l4L+LuAH6N3N\n0JKbkry1qj7U35jkbRzEI4wBHbz/zxmeE5J8g965HtEt063XLP/C1P7O7fDRlTU0J4w5pyP6zveg\n+t9uzszRJ7kBOJFe+H3vTpSqeu3IihqCJM8HPkPvnPY+7uAlwGHAmVX16Khqm6okR3fTUJKmYC4F\n/c/uq72qmvgqdpJX8dR9yVur6nOjrEfSwWPOBL0kzVXNz9En+WJVnZLkn3j6BaCDbh5NkqaDI3pJ\nalwrz9KYUJIf7r4ZS5JXJnl3kueNui5Jmm5zJujpPTlvT5IfAdYBi4FPjrYkSZp+cynov9s9h+J1\nwPur6jeBhSOuSZKm3VwK+u8kORtYDVzdtR06wnokaUbMpaD/1/R+Vm9tVe1IshS4dMQ1SdK0m5N3\n3XTPNF9cVbP9hzkkaUJzZkSf5MYkRyY5mt6jAj6UZNY/5laSJjJngh54bvcrN78CfLyqXgb8/Ihr\nkqRpN5eC/pAkC4GzeOpirCQ1by4F/e8B1wHbq+rmJD8E3DvimiRp2s3Ji7GSNJfMhYeanV9V703y\nfvbxqzZV9e4RlCVJM6b5oAfu6V63jLQKSRoRp24kqXHNj+iTbNzf9tn+U4KSNJHmg57eYw8eBC4D\nbmJu/OC0JH1P81M3SeYBrwbOBl4MXANcVlVbR1qYJM2Q5u+jr6o9VXVtVa0GXg5sB25M8s4RlyZJ\nM2IuTN3Q/bLUGfRG9UuA9wFXjrImSZopc2Hq5uPAC4H/AVxeVXeNuCRJmlFzIei/CzzerfafbICq\nqiNnvipJmjnNB70kzXXNX4yVpLnOoJekxhn0ktQ4g14HhSR7ktze929JkhVJ3jfEY9yf5JhhvV/3\nnqck2Zzkq0m2JXnHAb7PkiSV5F19bX+a5Nf61g9JsjvJRUMoXXPInLiPXrPCt6rqxDFt97OPp44m\nOaSqnpyRqvYjyQ8AnwTOrKpbuw+R65LsrKoD+Z7GLuC8JB+sqm/vY/urgb8D3pjkwvJOCg3IEb0O\nWklemeTqbvl3klya5EvApUnmJfmvSW5OcmeSt/Xt8/kk13Qj7D9P8oz/zpN8JsktSbYmWdPXfnqS\nW5PckWRT1/bsJB/pRu63JVnZdT8X+FhV3QpQVV8Dzgd+s9vvY0ne0Pfe/zzBKe8GNgGrx9l+NvAn\nwN/Te4aTNBBH9DpYHJHk9m55R1W9bh99lgOnVNW3unD+x6p6affN5y8l+auu38ld3weAa+n9IPxf\njnmvX6+qx5IcAdyc5Ap6A58PAT9TVTuSHN31/U/A56rq15M8D9ic5K+B44H1Y953S3fsA/WHwP9M\n8pH+xiSH0/sx+7cBz6MX+v9rCsfRHGLQ62Cxr6mbsTZW1be65V8AXtw3Yn4usAz4NrC5qu4DSHIZ\ncArPDPp3J9n7YbK423c+8Pmq2gFQVY/1Heu1SX6jWz8cOG6yJziIqrovyU3Am8dseg1wQ/chdwXw\nX5K8p6r2TEcdaotBr9nk8b7lAO+qquv6OyR5Jc/8ycjaR5+fB15RVd9MciO98B5PgNdX1bYx73M3\n8BLgqr7ml/DUdYUn6aZHu+mjZ+3nGP3+gN4H09/0tZ0NnJLk/m79+4FXAdcP+J6aw5yj12x1HfD2\nJIcCJHlBkmd3205OsrQL1zcBXxyz73OBf+hC/sfoPdUU4G+Bn0mytHvPvVM31wHvSpKu/aSu/QPA\nryU5sWv/fmAt8Pvd9vvpBT/Aa4FDBzmxqvoqcDfwy937Hgn8NHBcVS2pqiX0rg+cPcj7SQa9ZqsP\n0wvDW5PcBXyQp/5CvRn4U3q/F7yDZz6p9FrgkCT3ABfRC3iqajewBvh0kjuAT3X9f59eSN+ZZGu3\nTlXtBH4VWJdkG/B/gPdV1d6R+IeAn+3e6xU8/S+SiawFFnXLr6N3jeCJvu1XAb/cXZ+Q9stn3agp\n3bTMb1TVa0Zw7HcAb6d3MfcfZvr40ngc0UtDUlUXV9WLDHkdbBzRSzMoyYuAS8c0P1FVLxtFPZob\nDHpJapxTN5LUOINekhpn0EtS4wx6SWrc/weJruW0ibHWyAAAAABJRU5ErkJggg==\n",
312 |       "text/plain": [
313 |        "<matplotlib.figure.Figure at 0x8cb949a470>"
314 |       ]
315 |      },
316 |      "metadata": {},
317 |      "output_type": "display_data"
318 |     },
319 |     {
320 |      "data": {
321 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAEtCAYAAAAGK6vfAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHsRJREFUeJzt3Xu4XFV9//H3B6KIFxAeTiMSMGiDNngBG1OsPlZBBasV\nvGHwFhWNrXipl18btP3Zqqm0/VVrtaioaFSEBhWIaBEIUqtVQ0AUAiKpgYYIJN5RFAp8fn/sNTA5\nnJyZOZfsmcXn9Tx5zszae2a+k5x8Zs/aa68l20RERL12aruAiIiYXQn6iIjKJegjIiqXoI+IqFyC\nPiKicgn6iIjKJegjIiqXoI+IqFyCPiKicnPaLgBgr7328vz589suIyJipFx88cU/tj3Wa7+hCPr5\n8+ezbt26tsuIiBgpkq7tZ7903UREVC5BHxFRuQR9RETlEvQREZVL0EdEVC5BHxFRuQR9RETlEvQR\nEZXrK+glPVDS5yR9X9KVkh4vaU9J50m6uvzco2v/4yVtkHSVpMNnr/yIiOil3ytj3w+cY/v5ku4N\n3Bd4G7DG9gmSlgPLgb+UtBBYAhwIPBg4X9IBtm+fqaLnL//STD3VhK454Zmz+vwRETtSzyN6SbsD\nTwI+DmD7Vts/B44EVpbdVgJHldtHAqfZvsX2RmADsHimC4+IiP7003WzP7AV+ISk70j6mKT7AXNt\nX1/2uQGYW27vA2zqevx1pS0iIlrQT9DPAR4LfMj2wcCvabpp7mTbgAd5YUnLJK2TtG7r1q2DPDQi\nIgbQT9BfB1xn+9vl/udogv9GSXsDlJ9byvbNwL5dj59X2rZh+yTbi2wvGhvrOctmRERMUc+gt30D\nsEnSw0vTYcAVwGpgaWlbCpxVbq8GlkjaRdL+wAJg7YxWHRERfet31M3rgVPKiJsfAq+g+ZBYJelY\n4FrgaADb6yWtovkwuA04biZH3ERExGD6CnrblwKLJth02Hb2XwGsmEZdERExQ3JlbERE5RL0ERGV\nS9BHRFQuQR8RUbkEfURE5RL0ERGVS9BHRFQuQR8RUbkEfURE5RL0ERGVS9BHRFQuQR8RUbkEfURE\n5RL0ERGVS9BHRFQuQR8RUbkEfURE5RL0ERGVS9BHRFQuQR8RUbkEfURE5RL0ERGVS9BHRFQuQR8R\nUbkEfURE5foKeknXSLpM0qWS1pW2PSWdJ+nq8nOPrv2Pl7RB0lWSDp+t4iMiordBjuifYvsg24vK\n/eXAGtsLgDXlPpIWAkuAA4EjgBMl7TyDNUdExACm03VzJLCy3F4JHNXVfprtW2xvBDYAi6fxOhER\nMQ39Br2B8yVdLGlZaZtr+/py+wZgbrm9D7Cp67HXlbaIiGjBnD73e6LtzZJ+BzhP0ve7N9q2JA/y\nwuUDYxnAfvvtN8hDIyJiAH0d0dveXH5uAc6g6Yq5UdLeAOXnlrL7ZmDfrofPK23jn/Mk24tsLxob\nG5v6O4iIiEn1DHpJ95P0gM5t4OnA5cBqYGnZbSlwVrm9GlgiaRdJ+wMLgLUzXXhERPSnn66bucAZ\nkjr7f9b2OZIuAlZJOha4FjgawPZ6SauAK4DbgONs3z4r1UdERE89g972D4HHTND+E+Cw7TxmBbBi\n2tVFRMS05crYiIjKJegjIiqXoI+IqFyCPiKicgn6iIjKJegjIiqXoI+IqFyCPiKicgn6iIjKJegj\nIiqXoI+IqFyCPiKicgn6iIjKJegjIiqXoI+IqFyCPiKicgn6iIjKJegjIiqXoI+IqFyCPiKicgn6\niIjKJegjIiqXoI+IqFyCPiKicgn6iIjK9R30knaW9B1JZ5f7e0o6T9LV5eceXfseL2mDpKskHT4b\nhUdERH8GOaJ/I3Bl1/3lwBrbC4A15T6SFgJLgAOBI4ATJe08M+VGRMSg+gp6SfOAZwIf62o+ElhZ\nbq8EjupqP832LbY3AhuAxTNTbkREDKrfI/p/Bv4CuKOrba7t68vtG4C55fY+wKau/a4rbRER0YKe\nQS/pWcAW2xdvbx/bBjzIC0taJmmdpHVbt24d5KERETGAfo7onwA8W9I1wGnAoZI+A9woaW+A8nNL\n2X8zsG/X4+eVtm3YPsn2ItuLxsbGpvEWIiJiMj2D3vbxtufZnk9zkvUC2y8BVgNLy25LgbPK7dXA\nEkm7SNofWACsnfHKIyKiL3Om8dgTgFWSjgWuBY4GsL1e0irgCuA24Djbt0+70oiImJKBgt72hcCF\n5fZPgMO2s98KYMU0a4uIiBmQK2MjIiqXoI+IqFyCPiKicgn6iIjKJegjIiqXoI+IqFyCPiKicgn6\niIjKJegjIiqXoI+IqFyCPiKicgn6iIjKJegjIiqXoI+IqFyCPiKicgn6iIjKJegjIiqXoI+IqFyC\nPiKicgn6iIjKJegjIiqXoI+IqFyCPiKicgn6iIjKJegjIirXM+gl3UfSWknflbRe0t+W9j0lnSfp\n6vJzj67HHC9pg6SrJB0+m28gIiIm188R/S3AobYfAxwEHCHpEGA5sMb2AmBNuY+khcAS4EDgCOBE\nSTvPRvEREdFbz6B341fl7r3KHwNHAitL+0rgqHL7SOA027fY3ghsABbPaNUREdG3vvroJe0s6VJg\nC3Ce7W8Dc21fX3a5AZhbbu8DbOp6+HWlLSIiWtBX0Nu+3fZBwDxgsaRHjttumqP8vklaJmmdpHVb\nt24d5KERETGAgUbd2P458FWavvcbJe0NUH5uKbttBvbteti80jb+uU6yvcj2orGxsanUHhERfehn\n1M2YpAeW27sCTwO+D6wGlpbdlgJnldurgSWSdpG0P7AAWDvThUdERH/m9LHP3sDKMnJmJ2CV7bMl\nfRNYJelY4FrgaADb6yWtAq4AbgOOs3377JQfERG99Ax6298DDp6g/SfAYdt5zApgxbSri4iIacuV\nsRERlUvQR0RULkEfEVG5BH1EROUS9BERletneGXMsPnLvzSrz3/NCc+c1eePiNGSI/qIiMol6CMi\nKpegj4ioXII+IqJyCfqIiMol6CMiKpegj4ioXII+IqJyCfqIiMol6CMiKpegj4ioXII+IqJyCfqI\niMol6CMiKpegj4ioXII+IqJyCfqIiMol6CMiKpegj4ioXM+gl7SvpK9KukLSeklvLO17SjpP0tXl\n5x5djzle0gZJV0k6fDbfQERETK6fI/rbgLfYXggcAhwnaSGwHFhjewGwptynbFsCHAgcAZwoaefZ\nKD4iInrrGfS2r7d9Sbl9E3AlsA9wJLCy7LYSOKrcPhI4zfYttjcCG4DFM114RET0Z6A+eknzgYOB\nbwNzbV9fNt0AzC239wE2dT3sutIWEREt6DvoJd0f+Dzw57Z/2b3NtgEP8sKSlklaJ2nd1q1bB3lo\nREQMoK+gl3QvmpA/xfYXSvONkvYu2/cGtpT2zcC+XQ+fV9q2Yfsk24tsLxobG5tq/RER0UM/o24E\nfBy40vZ7uzatBpaW20uBs7ral0jaRdL+wAJg7cyVHBERg5jTxz5PAF4KXCbp0tL2NuAEYJWkY4Fr\ngaMBbK+XtAq4gmbEznG2b5/xyiMioi89g9721wFtZ/Nh23nMCmDFNOqKiIgZkitjIyIql6CPiKhc\ngj4ionIJ+oiIyiXoIyIql6CPiKhcgj4ionIJ+oiIyiXoIyIql6CPiKhcgj4ionIJ+oiIyiXoIyIq\nl6CPiKhcgj4ionIJ+oiIyiXoIyIql6CPiKhcgj4ionIJ+oiIyiXoIyIql6CPiKhcgj4ionIJ+oiI\nyiXoIyIq1zPoJZ0saYuky7va9pR0nqSry889urYdL2mDpKskHT5bhUdERH/6OaL/JHDEuLblwBrb\nC4A15T6SFgJLgAPLY06UtPOMVRsREQPrGfS2vwb8dFzzkcDKcnslcFRX+2m2b7G9EdgALJ6hWiMi\nYgqm2kc/1/b15fYNwNxyex9gU9d+15W2iIhoybRPxto24EEfJ2mZpHWS1m3dunW6ZURExHZMNehv\nlLQ3QPm5pbRvBvbt2m9eabsb2yfZXmR70djY2BTLiIiIXqYa9KuBpeX2UuCsrvYlknaRtD+wAFg7\nvRIjImI65vTaQdKpwJOBvSRdB7wDOAFYJelY4FrgaADb6yWtAq4AbgOOs337LNUeERF96Bn0to/Z\nzqbDtrP/CmDFdIqKiIiZkytjIyIql6CPiKhcgj4ionIJ+oiIyiXoIyIql6CPiKhcgj4ionI9x9FH\njDd/+Zdm9fmvOeGZs/r8Efc0OaKPiKhcgj4ionLpuol7nHQ9xT1NjugjIiqXoI+IqFyCPiKicgn6\niIjKJegjIiqXoI+IqFyCPiKicgn6iIjKJegjIiqXoI+IqFyCPiKicgn6iIjKJegjIiqX2SsjRkxm\n34xBzdoRvaQjJF0laYOk5bP1OhERMblZCXpJOwP/CjwDWAgcI2nhbLxWRERMbra6bhYDG2z/EEDS\nacCRwBWz9HoRMQJGvdtpVOufra6bfYBNXfevK20REbGDyfbMP6n0fOAI268q918K/IHt13XtswxY\nVu4+HLhqxgu5y17Aj2fx+Wdb6m9X6m/PKNcOs1//Q2yP9dpptrpuNgP7dt2fV9ruZPsk4KRZev1t\nSFpne9GOeK3ZkPrblfrbM8q1w/DUP1tdNxcBCyTtL+newBJg9Sy9VkRETGJWjuht3ybpdcBXgJ2B\nk22vn43XioiIyc3aBVO2vwx8ebaef0A7pItoFqX+dqX+9oxy7TAk9c/KydiIiBgemesmIqJyCfqI\niMol6CMqJGmXftriniGzVw4ZSXtOtt32T3dULdMlaR/gIXT9ntn+WnsV3aN8E3hsH21DR9KuwG9t\nW9LDaC6oPNf2bS2X1hdJa2wf1qttR6oq6CXdBGz37LLt3XZgOVN1Mc17ELAf8LNy+4HA/wD7t1da\n/yT9PfBCmvmNbi/NBkYi6CV9kbv/Lv0CWAd8xPZvd3xVvUl6EM10I7tKOpjmdwdgN+C+rRU2mP8E\nniRpd+AC4BKaa3Fe1mpVPUi6D83f8V6S9mDbv/tWp4CpKuhtPwBA0ruA64FP0/xlvxjYu8XS+mZ7\nfwBJHwXOKMNUkfQM4Kg2axvQUcDDbd/SdiFT9ENgDDi13H8hcBNwAPBR4KUt1dXL4cDLaa5Gf29X\n+03A29ooaAp2sn2zpFcCH7J9gqRL2y6qD68B/hx4MM0BWyfofwl8sK2ioNLhlZK+a/sxvdqGmaTL\nbD+qV9uwkvTvwAts/6rtWqZC0kW2HzdRm6T1tg9sq7Z+SHqe7c+3XcdUlFB/NfAvwKttXz4qv/tl\niva32X5X27V0q+qIvsuvJb0YOI3m6/cxwK/bLWlgP5L0V8Bnyv0XAz9qsZ5B3QxcKmkNcOdRve03\ntFfSQO4vaT/b/wMgaT/g/mXbre2V1bezJb0ImM+250je2VpF/Xsz8LfA2SXkH0rTnTP0bN8u6blA\ngn4HeBHw/vLHwDdK2yg5BngHcEa5/7XSNipWM9rzG70F+Lqk/6b5Cr4/8FpJ9wNWtlpZf86iOadw\nMV0ftKPA9gXABZ1RQmVdi9e2W9VA1kh6HvAFD0mXSZVdNxEzoQTNI8rdq4b1BOxEJF1u+5Ft1zEV\nkhYDHwd2t72fpMcAr7L9+pZL60sZFHI/mkEIv6E5UHCbg0GqPKKXNEbTxzefbb+2vrKtmgYl6QDg\nrdz9PRzaVk39kHQZk498evQOLGe6fp+7/v4fIwnbn2q3pL79l6RH2b6s7UKm4F+AZwFnAtj+rqSn\ntFtS/zqDQoZJlUFP87X1P4HzuWto36g5Hfgw8DFG6z08q+0CZoKkTwMPAy5l2+GhoxL0TwReLmkj\nTddN56hyFD5od7J9raTutlH6P4CkZwNPKncvtH12m/XUGvT3tf2XbRcxTbfZ/lDbRQzK9rXQjKMf\n/29QxtaPyr/LImDhsPSxTsEz2i5gGjaV7huXUSyvB37Qck19k3QC8DjglNL0RklPsH18WzXVOgXC\n2ZL+uO0ipkLSnuXq2C9Keq2kvTttva6aHTJPm6BtlMLncuBBbRcxVeUDd1/g0HL7Zkbn//uf0Yy8\n2Q/YAhxS2kbFHwNPs32y7ZOBI4DZXbW8h6pOxnZdGSuakyG3AP/LEJwM6Vf5qt15D+PZ9kN3cEkD\nkfRnNCMkHgr8d9emBwDfsP2SVgobkKSvAgcBa9l2eOizWytqAJLeQfOt5OG2D5D0YOB0209oubTq\nSfoe8OTOdCXlAO3CNrvNquq6GcaTIIPqXBk7wj4L/DvwHmB5V/tNozRPD/A3bRcwTc8BDqaZPgDb\nP5I0Ev8/JM0H3gc8vjR9A3iL7WtaKmlQ7wG+Uw4WRNNXv3zyh8yuqo7oOyQ9B7jA9i/K/QfSfMKe\n2W5l/ZN0HHCK7Z+X+3sAx9g+sd3KJlfTpGyjTNJa24slXWL7sWX8/zdH4WSspG/SrMzU6eN+EfAa\n24/f/qOGi6S9afrpAdbavqHVeioN+kttHzSu7Tu2D26rpkGN6nvo6nqCu3c/jULX09dtP3GCCfJG\npvsPQNJbgQU050reA7wS+KztD7RaWB8kfW/8B9IoTGEi6Xdo5hP6XeAy4D22f9luVY1ag36iX5SR\nmCujo4xHf3Rn1EcZffC9YZ9jJYaHpKcBT6f5kPqK7fNaLqkvZdTKj7lrCpMXAnsBJwAMS3iOJ+kc\nmiuRv0YzzPgBtl/ealFFrUF/MvBz4F9L03HAnsPyl94PSf+PZtTBR0rTa4BNtt/SXlW9SXqE7e9L\nmnDec9uX7OiapqLMg36d7VskPRl4NPCpTlfaqJC0G9tecDf0XWeSNk2y2bb322HFDGD8t45Ot1mb\nNXXUGvT3A/4aeCrNEcF5wLtt39xqYQOQtBOwjOY9QPMePmr7jvaq6k3SSbaXlRNR43nYr+ztKDMo\nLqK5MvbLNBfhHWh7JIbtSnoNzcRgvwXu4K6up6HuOhtlkr4LPJm7uiy/2n2/zQ/ZWoP+BbZP79U2\nzCS90fb7e7XF7Og6ifl/aFY7+sAonCPpkHQ18HjbP267lkFJ+hZwMnCq7Zvarqdfkq7hrg/V8Vr9\nkK016O/2lWmYvkb1YzvvYZSCZsLVgEZlrhhJ3wb+GXg78Ce2N47SRGGlv/i5o/QttkPSI4BXAC8A\n/gv4hO017VbVHzXzNuzrMr31sKgq6NWswvTHwNHAv3Vt2o3mcvbFrRQ2AEnH0AwneyLbzsH9AOAO\nt7ju5CAkdY/uuA9wGHCJ7ee3VNJAJC0E/pRmSOKpkvYHjrb99y2X1hc1ywh+Avg2o7keQGcAwrNp\nVme6leYo/wPDfp5kGAd+VHXBFM3CHOtofjku7mq/CXhTKxUN7r9olkHcC/inrvabgO+1UtEUeNyU\nsuVahtNaKmdgtq8A3gB3XsPwgFEJ+eIjNOutXkbTnTBSygftK4A/oTk/cgrNwc8FDP8C55dIepzt\ni9oupKOqI/oOSX9h+x/GtY1c/7akhwALbJ8vaVdgzij1WXaTdC/gctsPb7uWfki6kOaAYQ7NQcMW\nmikc3txmXf0apW6+8SStpZmb52SaaRt+07Vt9bBPQyHp+zRj6a+lWdmu9ZlDaw36ke7fBpD0appR\nN3vafpikBcCHR6jr5ovcdcHRTsBCYJXtVi8F71fn90XSq2j6XN8x0fUZw0rS3wHXAF9k266boR1e\nKem5tr8g6QDbIzNb5XjlAO1uXGZ2bUNVQT9J//ZuNNP+PnXCBw6hMrxvMfDtzgfUMPb9bY+kP+q6\nextwre3r2qpnUOWCtafTLBv4dtsXjVjQb5ygeaiHV47agIleypWy9+ncb/MEbW199Nvr3+5cXTdK\nbrF9q8riC5LmMMnKTcPG9n90bkvaC/hJi+VMxTuBrwBfLyH/UODqlmvqm0d/cryRpWbRkX8CHkzT\n5fcQ4Eqgtavaqzqi71ZGHbyIZojWRuDztj/YblX9k/QPNFf3voxm4YXXAlfYfnurhfUg6RCaS9V/\nCrwL+DTNB+9OwMtsn9NiefcYkl4AnGP7Jkl/RXMC8122v9Nyadsl6WZgw0SbGJ3VsToXTh0KnF+6\n/54CvMT2sa3VVFPQq1ln9Zjy58c0QyzfanvCPrNhVq6MPZauuUqAj3nI/8EkraOZ2Gl3mhkIn2H7\nW2Vs9KnDfp6kcyK/DA+929/1qAxP7HQzSXoi8G7gH4H/a/sPWi5tuyStpxkePaE2+7gHIWmd7UUl\n8A+2fUfbk7LV1nXzfZq++WfZ3gAgaVSGVW6j/HKcCZxpe2vb9Qxgju1zASS90/a3AMr8N+1W1p8r\ny891rVYxfZ01Vp8JnGT7S5Le3WZBfbh1VMK8h59Luj/N5GanSNpCM/qmNbUF/XOBJcBXy5WBpzHx\n5chDq1xZ9w7gdZSl3yTdTnOhyDvbrK1P3WO2fzNu21B/GwGw/cXyc2XbtUzTZkkfoZmm+O8l7cLw\nLyX4jfLNbx+aQQi/6myQdMSwd/tJ+l1gLnAkze/+m4AX0/TRv36Sh866qrpuOsqkZkfSdOEcCnwK\nOKNzpDnMJL2ZZm3VZbY3lraHAh+i6XN9X5v19VI+lDpjh3elGQ9NuX8f2/dqq7Z+SFo92fZhH8Pd\nIem+NGuVXmb7ajULYTxqmP8PSHoDzUyzV9Is4/hG22eVbUM/IkfS2cDxti8b1/4o4O9s/0k7lVUa\n9N3KVY0vAF44CmPQJX2HZmHhH49rHwPOHfY+7lEnaSuwCTiVZvqAbb4Rdo8mGgXDNMSvlzKk9fG2\nf6VmOcHPAZ+2/f5RuA5G0kW2H7edba0Oja6t6+ZubP+M5qTgSW3X0qd7jQ95ANtby9WlMbseRNPd\n0bkm40s0J5HXt1rVgCYY4rcfzTmsYV64ZqdOd43ta9SsA/C5cgHSKHTBPnCSbbvusComMOx9dvdE\nt05xW8wA27fbPsf2UuAQmuF+F0p6XculDepdNPX/oIypfyrwrXZL6ulGSXcun1lC/1k0w3NH4ULB\ndeWK9m2Uq6svnmD/Hab6rptR09XHfbdNjEAfdw3Kictn0hzVzwdWAyfb3txmXYMYxiF+vUiaR3MF\n+90W0pb0BNvfaKGsvkmaC5xBc0DWCfZFwL2B50z0vnZYbQn6iLtI+hTwSJpVpU6zfXnLJU2JpPOB\no2gWBt+Lpvvmcbb/sNXC7gHKBVKddQvW276gzXogQR+xDUl3cNc3qu7/HJ2rM3fb8VUNrow8+w1N\n9+yLaS5gO8X2qE1FETMgQR9Ruc5cQ8N+VXXMnpyMjaiIpEMkXSjpC5IOlnQ5cDnNic4j2q4v2pEj\n+oiKjPpcQzE7ckQfUZc5ts+1fTpwQ/dcQy3XFS1K0EfUZaTnGorZka6biIqM+lxDMTsS9BERlUvX\nTURE5RL0ERGVS9BHRFQuQR9DQdJcSZ+V9ENJF0v6pqTntFDHKyRdWv7cKumycvuEWX7d35VkSX/W\n1fZhSS/pun9vST8dgSUBY8gk6KN1ZfnEM4Gv2X6o7d+nWRJyXp+Pn7F1FWx/wvZBtg8CfgQ8pdxf\nPlOvMYkbgTdN8n4OB64AXrgDaomKJOhjGBxKszD0hzsNtq+1/QFJ8yX9p6RLyp8/BJD05NK+mib8\nkHRm+TawXtKyznNJOlbSDyStlfRRSR8s7WOSPi/povLnCdsrUNJOkjZI2rPc37l8+9hT0mckfai8\n9g8kPaPsM0fSe8vrfq/MSz6ZG2gWt3/pdrYfA7wXuEHS4h7PFXGn6leYipFwIHDJdrZtoVla8beS\nFtAs8beobHss8MjO2rrAK23/VNKuwEWSPg/sAvx12fcm4ALgu2X/9wPvs/11SfsBXwF+b6Iiynzu\np9KsOvVBmqPri8rrAewLPA5YAJxfFoo+Fthie3GZ4/5bks7tsZzfCcBZkrZZnLysAftk4JU0q2Ad\nA6yd5Hki7pSgj6Ej6V+BJ9Is4PBU4INl5aHbgQO6dl3bFfIAb+jq19+XJnQfBPyH7Z+W5z696zme\nCiwsQQ2wm6T7d5azm8DHgdNpgv6VwMe6tq2yfQdwlaRN5bWfDvyepCVln91L+3aDvizkfSl37555\nNnBe+cA7HbhY0lvKa0ZMKkEfw2A98LzOHdvHlal11wFvoum7fgxNV+Nvux5350pcZX3Rp9IsLn2z\npAvpWhR7O3YCDrH92x77deq6RtLPysISBwPndm8evzvN1aivtb2mn+fvsgL4LNsu/XcMcIika8r9\nMeCPgK8O+NxxD5Q++hgGFwD36R5xAty3/NwduL4cub4U2Hk7z7E78LMS8o+gWS8V4CLgjyTtUU5y\nPq/rMecCr+/c6V6vdBIfB06hWX2q+2j6BWocQPNt4mqarqDXdk6uSnp46VaaVFmI/L+BTl//A8v7\nmWd7vu35wBtowj+ipwR9tK4siHEUTSBvlLQWWAn8JXAisFTN2qePYOL1dAHOAeZIupKmn7sza+Nm\n4O9o+rO/AVwD/KI85g3AonKi9ArgT/so9wyaD5VPjmvfTPMN5IvAMtu3Ah+hCfxLy7zwH6L/b9Hv\npvnAgObD6Tzb/9u1/UzgKEmZuyZ6ylw3Ub1Ov3s5sj6DZqHvM6b4XIcA77H9lK62zwCfs33mzFQc\nMbNyRB/3BH9TTnBeDmykORoemKS3A/9Gs7BHxMjIEX3EDlTOA3xyXPPNtv+whXLiHiJBHxFRuXTd\nRERULkEfEVG5BH1EROUS9BERlUvQR0RU7v8Dz2thssqBmaAAAAAASUVORK5CYII=\n",
322 |       "text/plain": [
323 |        "<matplotlib.figure.Figure at 0x8cba8b32e8>"
324 |       ]
325 |      },
326 |      "metadata": {},
327 |      "output_type": "display_data"
328 |     }
329 |    ],
330 |    "source": [
331 |     "# let's look at the number of observations per label on each of the variables\n",
332 |     "for col in ['BsmtQual_NA', 'FireplaceQu_NA', 'GarageType_NA']:\n",
333 |     "    X_train.groupby([col])[col].count().sort_values(ascending=False).plot.bar()\n",
334 |     "    plt.show()"
335 |    ]
336 |   },
337 |   {
338 |    "cell_type": "markdown",
339 |    "metadata": {},
340 |    "source": [
341 |     "We observe in the plots the label indicating the number of observations with missing values, within each of the 3 categorical variables."
342 |    ]
343 |   },
344 |   {
345 |    "cell_type": "code",
346 |    "execution_count": 9,
347 |    "metadata": {
348 |     "collapsed": true
349 |    },
350 |    "outputs": [],
351 |    "source": [
352 |     "# let's transform the categories into numbers quick and dirty so we can use them in scikit-learn\n",
353 |     "\n",
354 |     "# the below function numbers the labels from 0 to n, n being the number of different labels \n",
355 |     "#  within the variable\n",
356 |     "\n",
357 |     "for col in ['BsmtQual_NA', 'FireplaceQu_NA', 'GarageType_NA',]:\n",
358 |     "    labels_dict = {k:i for i, k in enumerate(X_train[col].unique(), 0)}\n",
359 |     "    X_train.loc[:, col] = X_train.loc[:, col].map(labels_dict )\n",
360 |     "    X_test.loc[:, col] = X_test.loc[:, col].map(labels_dict)"
361 |    ]
362 |   },
363 |   {
364 |    "cell_type": "markdown",
365 |    "metadata": {},
366 |    "source": [
367 |     "### Linear Regression"
368 |    ]
369 |   },
370 |   {
371 |    "cell_type": "code",
372 |    "execution_count": 10,
373 |    "metadata": {},
374 |    "outputs": [
375 |     {
376 |      "name": "stdout",
377 |      "output_type": "stream",
378 |      "text": [
379 |       "Train set random imputation\n",
380 |       "Linear Regression mse: 4810016310.466396\n",
381 |       "Test set random imputation\n",
382 |       "Linear Regression mse: 5562566516.826057\n"
383 |      ]
384 |     }
385 |    ],
386 |    "source": [
387 |     "# Let's evaluate the performance of Linear Regression\n",
388 |     "\n",
389 |     "linreg = LinearRegression()\n",
390 |     "linreg.fit(X_train[['BsmtQual_NA', 'FireplaceQu_NA', 'GarageType_NA',]], y_train)\n",
391 |     "print('Train set random imputation')\n",
392 |     "pred = linreg.predict(X_train[['BsmtQual_NA', 'FireplaceQu_NA', 'GarageType_NA',]])\n",
393 |     "print('Linear Regression mse: {}'.format(mean_squared_error(y_train, pred)))\n",
394 |     "print('Test set random imputation')\n",
395 |     "pred = linreg.predict(X_test[['BsmtQual_NA', 'FireplaceQu_NA', 'GarageType_NA',]])\n",
396 |     "print('Linear Regression mse: {}'.format(mean_squared_error(y_test, pred)))"
397 |    ]
398 |   },
399 |   {
400 |    "cell_type": "markdown",
401 |    "metadata": {},
402 |    "source": [
403 |     "In the previous lectures we trained linear regressions on data where missing observations were replaced by i) random sampling, or ii) random sampling plus a variable to indicate missingness. We obtained the following mse for the testing sets:\n",
404 |     "\n",
405 |     "- frequent label imputation mse: 6456070592\n",
406 |     "- random sampling + additional category: 4911877327\n",
407 |     "- adding 'missing' label: 5562566516\n",
408 |     "\n",
409 |     "Therefore, adding an additional 'Missing' category lies between the 2 other approaches in terms of performance.\n",
410 |     "\n",
411 |     "A next step could be to investigate which approach works better for each variable individually, to try and optimise the performance of logistic regression even more."
412 |    ]
413 |   },
414 |   {
415 |    "cell_type": "markdown",
416 |    "metadata": {
417 |     "collapsed": true
418 |    },
419 |    "source": [
420 |     "**That is all for this demonstration. I hope you enjoyed the notebook, and see you in the next one.**"
421 |    ]
422 |   },
423 |   {
424 |    "cell_type": "code",
425 |    "execution_count": null,
426 |    "metadata": {
427 |     "collapsed": true
428 |    },
429 |    "outputs": [],
430 |    "source": []
431 |   }
432 |  ],
433 |  "metadata": {
434 |   "kernelspec": {
435 |    "display_name": "Python 3",
436 |    "language": "python",
437 |    "name": "python3"
438 |   },
439 |   "language_info": {
440 |    "codemirror_mode": {
441 |     "name": "ipython",
442 |     "version": 3
443 |    },
444 |    "file_extension": ".py",
445 |    "mimetype": "text/x-python",
446 |    "name": "python",
447 |    "nbconvert_exporter": "python",
448 |    "pygments_lexer": "ipython3",
449 |    "version": "3.6.1"
450 |   },
451 |   "toc": {
452 |    "nav_menu": {},
453 |    "number_sections": true,
454 |    "sideBar": true,
455 |    "skip_h1_title": false,
456 |    "toc_cell": false,
457 |    "toc_position": {},
458 |    "toc_section_display": "block",
459 |    "toc_window_display": true
460 |   }
461 |  },
462 |  "nbformat": 4,
463 |  "nbformat_minor": 2
464 | }
465 | 


--------------------------------------------------------------------------------
/07.2_Conclusion_when_to_use_each_NA_imputation_methods.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "## Conclusion: when to use each NA imputation method\n",
 8 |     "\n",
 9 |     "\n",
10 |     "###  Which missing value imputation method shall I use and when?\n",
11 |     "\n",
12 |     "There is no straight forward answer to this question, and which method to use on which occasion is not set on stone. This is totally up to you. \n",
13 |     "\n",
14 |     "Different methods make different assumptions and have different advantages and disadvantages (see lecture \"Overview of missing value imputation methods\".\n",
15 |     "\n",
16 |     "**As a guideline I would say**:\n",
17 |     "\n",
18 |     "If missing values are less than 5% of the variable:\n",
19 |     "replace by mean/median or random sample\n",
20 |     "replace by most frequent category\n",
21 |     "If missing values are more than 5% of the variable:\n",
22 |     "do mean/median imputation+adding an additional binary variable to capture missingness\n",
23 |     "add a 'Missing' label in categorical variables  \n",
24 |     "If the number of NA in a variable is small, they are unlikely to have a strong impact on the variable / target that you are trying to predict. Therefore, treating them specially, will most certainly add noise to the variables. Therefore, it is more useful to replace by mean/random sample to preserve the variable distribution.\n",
25 |     "\n",
26 |     "If the variable / target you are trying to predict is however highly unbalanced, then it might be the case that this small number of NA are indeed informative. You would have to check this out.\n",
27 |     "\n",
28 |     "**Exceptions to the guideline**:\n",
29 |     "\n",
30 |     "If you / your company suspect that NA are not missing at random and do not want to attribute the most common occurrence to NA.\n",
31 |     "If you don't want to increase the feature space by adding an additional variable to indicate missingness\n",
32 |     "In these cases, replace by a value at the far end of the distribution or an arbitrary value.\n",
33 |     "\n",
34 |     "#### Final note\n",
35 |     "\n",
36 |     "NA imputation for data competitions and business settings can be approached differently. In data competitions, a tiny increase in performance can be the difference between 1st or 2nd place. Therefore, you may want to try all the feature engineering methods and use the one that gives the best machine learning model performance. It may be the case that different NA imputation methods help different models make better predictions.\n",
37 |     "\n",
38 |     "In business scenarios, scientist don't usually have the time to do lengthy studies, and may therefore choose to streamline the feature engineering procedure. In these cases, it is common practice to follow the guidelines above, taking into account the exceptions, and do the same processing for all features.\n",
39 |     "\n",
40 |     "This streamlined pre-processing may not lead to the most predictive features possible, yet it makes feature engineering and machine learning models delivery substantially faster. Thus, the business can start enjoying the power of machine learning sooner."
41 |    ]
42 |   },
43 |   {
44 |    "cell_type": "code",
45 |    "execution_count": null,
46 |    "metadata": {
47 |     "collapsed": true
48 |    },
49 |    "outputs": [],
50 |    "source": []
51 |   }
52 |  ],
53 |  "metadata": {
54 |   "kernelspec": {
55 |    "display_name": "Python 3",
56 |    "language": "python",
57 |    "name": "python3"
58 |   },
59 |   "language_info": {
60 |    "codemirror_mode": {
61 |     "name": "ipython",
62 |     "version": 3
63 |    },
64 |    "file_extension": ".py",
65 |    "mimetype": "text/x-python",
66 |    "name": "python",
67 |    "nbconvert_exporter": "python",
68 |    "pygments_lexer": "ipython3",
69 |    "version": "3.6.1"
70 |   },
71 |   "toc": {
72 |    "nav_menu": {},
73 |    "number_sections": true,
74 |    "sideBar": true,
75 |    "skip_h1_title": false,
76 |    "toc_cell": false,
77 |    "toc_position": {},
78 |    "toc_section_display": "block",
79 |    "toc_window_display": false
80 |   }
81 |  },
82 |  "nbformat": 4,
83 |  "nbformat_minor": 2
84 | }
85 | 


--------------------------------------------------------------------------------
/10.10_Bonus_Additional_reading_resources.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "## Bonus: Additional reading resources\n",
 8 |     "\n",
 9 |     "### Weight of Evidence\n",
10 |     "\n",
11 |     "- [WoE Module from StatSoft](http://documentation.statsoft.com/portals/0/formula%20guide/Weight%20of%20Evidence%20Formula%20Guide.pdf)\n",
12 |     "- [WoE Statistica](http://documentation.statsoft.com/StatisticaHelp.aspx?path=WeightofEvidence/WeightofEvidenceWoEIntroductoryOverview)\n",
13 |     "\n",
14 |     "\n",
15 |     "### Score card development for credit risk\n",
16 |     "\n",
17 |     "- [Plug and Score](https://plug-n-score.com/learning/scorecard-development-stages.htm#Binning)\n",
18 |     "- [Score card development from SAS](http://www.sas.com/storefront/aux/en/spcrisks/59376_excerpt.pdf)\n",
19 |     "\n",
20 |     "### Windsorization and top/bottom coding\n",
21 |     "\n",
22 |     "- [Windsorisation](http://www.statisticshowto.com/winsorize/)\n",
23 |     "- [Censoring data](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1431352)\n",
24 |     "\n",
25 |     "### Alternative methods of categorical variable encoding\n",
26 |     "\n",
27 |     "- [Will's blog](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/)\n",
28 |     "- [Classification of Tutor systems with high cardinality (KDD competition)](http://pslcdatashop.org/KDDCup/workshop/)\n",
29 |     "- [Light weight solution to KDD data mining challenge](http://pslcdatashop.org/KDDCup/workshop/)\n",
30 |     "\n",
31 |     "### Several methods described in this course\n",
32 |     "\n",
33 |     "- [The 2009 Knowledge Discovery in Data Competition (KDD Cup 2009)](http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf)\n",
34 |     "\n"
35 |    ]
36 |   },
37 |   {
38 |    "cell_type": "code",
39 |    "execution_count": null,
40 |    "metadata": {
41 |     "collapsed": true
42 |    },
43 |    "outputs": [],
44 |    "source": []
45 |   }
46 |  ],
47 |  "metadata": {
48 |   "kernelspec": {
49 |    "display_name": "Python 3",
50 |    "language": "python",
51 |    "name": "python3"
52 |   },
53 |   "language_info": {
54 |    "codemirror_mode": {
55 |     "name": "ipython",
56 |     "version": 3
57 |    },
58 |    "file_extension": ".py",
59 |    "mimetype": "text/x-python",
60 |    "name": "python",
61 |    "nbconvert_exporter": "python",
62 |    "pygments_lexer": "ipython3",
63 |    "version": "3.6.1"
64 |   },
65 |   "toc": {
66 |    "nav_menu": {},
67 |    "number_sections": true,
68 |    "sideBar": true,
69 |    "skip_h1_title": false,
70 |    "toc_cell": false,
71 |    "toc_position": {},
72 |    "toc_section_display": "block",
73 |    "toc_window_display": false
74 |   }
75 |  },
76 |  "nbformat": 4,
77 |  "nbformat_minor": 2
78 | }
79 | 


--------------------------------------------------------------------------------
/10.2_One_hot_encoding_variables_with_many_labels.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## One Hot Encoding - variables with many categories\n",
  8 |     "\n",
  9 |     "We observed in the previous lecture that if a categorical variable contains multiple labels, then by re-encoding them using one hot encoding we will expand the feature space dramatically.\n",
 10 |     "\n",
 11 |     "See below:"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "code",
 16 |    "execution_count": 1,
 17 |    "metadata": {},
 18 |    "outputs": [
 19 |     {
 20 |      "data": {
 21 |       "text/html": [
 22 |        "<div>\n",
 23 |        "<style>\n",
 24 |        "    .dataframe thead tr:only-child th {\n",
 25 |        "        text-align: right;\n",
 26 |        "    }\n",
 27 |        "\n",
 28 |        "    .dataframe thead th {\n",
 29 |        "        text-align: left;\n",
 30 |        "    }\n",
 31 |        "\n",
 32 |        "    .dataframe tbody tr th {\n",
 33 |        "        vertical-align: top;\n",
 34 |        "    }\n",
 35 |        "</style>\n",
 36 |        "<table border=\"1\" class=\"dataframe\">\n",
 37 |        "  <thead>\n",
 38 |        "    <tr style=\"text-align: right;\">\n",
 39 |        "      <th></th>\n",
 40 |        "      <th>X1</th>\n",
 41 |        "      <th>X2</th>\n",
 42 |        "      <th>X3</th>\n",
 43 |        "      <th>X4</th>\n",
 44 |        "      <th>X5</th>\n",
 45 |        "      <th>X6</th>\n",
 46 |        "    </tr>\n",
 47 |        "  </thead>\n",
 48 |        "  <tbody>\n",
 49 |        "    <tr>\n",
 50 |        "      <th>0</th>\n",
 51 |        "      <td>v</td>\n",
 52 |        "      <td>at</td>\n",
 53 |        "      <td>a</td>\n",
 54 |        "      <td>d</td>\n",
 55 |        "      <td>u</td>\n",
 56 |        "      <td>j</td>\n",
 57 |        "    </tr>\n",
 58 |        "    <tr>\n",
 59 |        "      <th>1</th>\n",
 60 |        "      <td>t</td>\n",
 61 |        "      <td>av</td>\n",
 62 |        "      <td>e</td>\n",
 63 |        "      <td>d</td>\n",
 64 |        "      <td>y</td>\n",
 65 |        "      <td>l</td>\n",
 66 |        "    </tr>\n",
 67 |        "    <tr>\n",
 68 |        "      <th>2</th>\n",
 69 |        "      <td>w</td>\n",
 70 |        "      <td>n</td>\n",
 71 |        "      <td>c</td>\n",
 72 |        "      <td>d</td>\n",
 73 |        "      <td>x</td>\n",
 74 |        "      <td>j</td>\n",
 75 |        "    </tr>\n",
 76 |        "    <tr>\n",
 77 |        "      <th>3</th>\n",
 78 |        "      <td>t</td>\n",
 79 |        "      <td>n</td>\n",
 80 |        "      <td>f</td>\n",
 81 |        "      <td>d</td>\n",
 82 |        "      <td>x</td>\n",
 83 |        "      <td>l</td>\n",
 84 |        "    </tr>\n",
 85 |        "    <tr>\n",
 86 |        "      <th>4</th>\n",
 87 |        "      <td>v</td>\n",
 88 |        "      <td>n</td>\n",
 89 |        "      <td>f</td>\n",
 90 |        "      <td>d</td>\n",
 91 |        "      <td>h</td>\n",
 92 |        "      <td>d</td>\n",
 93 |        "    </tr>\n",
 94 |        "  </tbody>\n",
 95 |        "</table>\n",
 96 |        "</div>"
 97 |       ],
 98 |       "text/plain": [
 99 |        "  X1  X2 X3 X4 X5 X6\n",
100 |        "0  v  at  a  d  u  j\n",
101 |        "1  t  av  e  d  y  l\n",
102 |        "2  w   n  c  d  x  j\n",
103 |        "3  t   n  f  d  x  l\n",
104 |        "4  v   n  f  d  h  d"
105 |       ]
106 |      },
107 |      "execution_count": 1,
108 |      "metadata": {},
109 |      "output_type": "execute_result"
110 |     }
111 |    ],
112 |    "source": [
113 |     "import pandas as pd\n",
114 |     "import numpy as np\n",
115 |     "\n",
116 |     "# let's load the mercedes benz dataset for demonstration, only the categorical variables\n",
117 |     "\n",
118 |     "data = pd.read_csv('mercedesbenz.csv', usecols=['X1', 'X2', 'X3', 'X4', 'X5', 'X6'])\n",
119 |     "data.head()"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "code",
124 |    "execution_count": 2,
125 |    "metadata": {},
126 |    "outputs": [
127 |     {
128 |      "name": "stdout",
129 |      "output_type": "stream",
130 |      "text": [
131 |       "X1 :  27  labels\n",
132 |       "X2 :  44  labels\n",
133 |       "X3 :  7  labels\n",
134 |       "X4 :  4  labels\n",
135 |       "X5 :  29  labels\n",
136 |       "X6 :  12  labels\n"
137 |      ]
138 |     }
139 |    ],
140 |    "source": [
141 |     "# let's have a look at how many labels each variable has\n",
142 |     "\n",
143 |     "for col in data.columns:\n",
144 |     "    print(col, ': ', len(data[col].unique()), ' labels')"
145 |    ]
146 |   },
147 |   {
148 |    "cell_type": "code",
149 |    "execution_count": 3,
150 |    "metadata": {},
151 |    "outputs": [
152 |     {
153 |      "data": {
154 |       "text/plain": [
155 |        "(4209, 117)"
156 |       ]
157 |      },
158 |      "execution_count": 3,
159 |      "metadata": {},
160 |      "output_type": "execute_result"
161 |     }
162 |    ],
163 |    "source": [
164 |     "# let's examine how many columns we will obtain after one hot encoding these variables\n",
165 |     "pd.get_dummies(data, drop_first=True).shape"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "markdown",
170 |    "metadata": {},
171 |    "source": [
172 |     "We can see that from just 6 initial categorical variables, we end up with 117 new variables. \n",
173 |     "\n",
174 |     "These numbers are still not huge, and in practice we could work with them relatively easily. However, in business datasets and also other Kaggle or KDD datasets, it is not unusual to find several categorical variables with multiple labels. And if we use one hot encoding on them, we will end up with datasets with thousands of columns.\n",
175 |     "\n",
176 |     "What can we do instead?\n",
177 |     "\n",
178 |     "In the winning solution of the KDD 2009 cup: \"Winning the KDD Cup Orange Challenge with Ensemble Selection\" (http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf), the authors limit one hot encoding to the 10 most frequent labels of the variable. This means that they would make one binary variable for each of the 10 most frequent labels only. This is equivalent to grouping all the other labels under a new category, that in this case will be dropped. Thus, the 10 new dummy variables indicate if one of the 10 most frequent labels is present (1) or not (0) for a particular observation.\n",
179 |     "\n",
180 |     "How can we do that in python?"
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "code",
185 |    "execution_count": 4,
186 |    "metadata": {},
187 |    "outputs": [
188 |     {
189 |      "data": {
190 |       "text/plain": [
191 |        "as    1659\n",
192 |        "ae     496\n",
193 |        "ai     415\n",
194 |        "m      367\n",
195 |        "ak     265\n",
196 |        "r      153\n",
197 |        "n      137\n",
198 |        "s       94\n",
199 |        "f       87\n",
200 |        "e       81\n",
201 |        "Name: X2, dtype: int64"
202 |       ]
203 |      },
204 |      "execution_count": 4,
205 |      "metadata": {},
206 |      "output_type": "execute_result"
207 |     }
208 |    ],
209 |    "source": [
210 |     "# let's find the top 10 most frequent categories for the variable X2\n",
211 |     "\n",
212 |     "data.X2.value_counts().sort_values(ascending=False).head(10)"
213 |    ]
214 |   },
215 |   {
216 |    "cell_type": "code",
217 |    "execution_count": 5,
218 |    "metadata": {},
219 |    "outputs": [
220 |     {
221 |      "data": {
222 |       "text/plain": [
223 |        "['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']"
224 |       ]
225 |      },
226 |      "execution_count": 5,
227 |      "metadata": {},
228 |      "output_type": "execute_result"
229 |     }
230 |    ],
231 |    "source": [
232 |     "# let's make a list with the most frequent categories of the variable\n",
233 |     "\n",
234 |     "top_10 = [x for x in data.X2.value_counts().sort_values(ascending=False).head(10).index]\n",
235 |     "top_10"
236 |    ]
237 |   },
238 |   {
239 |    "cell_type": "code",
240 |    "execution_count": 6,
241 |    "metadata": {},
242 |    "outputs": [
243 |     {
244 |      "data": {
245 |       "text/html": [
246 |        "<div>\n",
247 |        "<style>\n",
248 |        "    .dataframe thead tr:only-child th {\n",
249 |        "        text-align: right;\n",
250 |        "    }\n",
251 |        "\n",
252 |        "    .dataframe thead th {\n",
253 |        "        text-align: left;\n",
254 |        "    }\n",
255 |        "\n",
256 |        "    .dataframe tbody tr th {\n",
257 |        "        vertical-align: top;\n",
258 |        "    }\n",
259 |        "</style>\n",
260 |        "<table border=\"1\" class=\"dataframe\">\n",
261 |        "  <thead>\n",
262 |        "    <tr style=\"text-align: right;\">\n",
263 |        "      <th></th>\n",
264 |        "      <th>X2</th>\n",
265 |        "      <th>as</th>\n",
266 |        "      <th>ae</th>\n",
267 |        "      <th>ai</th>\n",
268 |        "      <th>m</th>\n",
269 |        "      <th>ak</th>\n",
270 |        "      <th>r</th>\n",
271 |        "      <th>n</th>\n",
272 |        "      <th>s</th>\n",
273 |        "      <th>f</th>\n",
274 |        "      <th>e</th>\n",
275 |        "    </tr>\n",
276 |        "  </thead>\n",
277 |        "  <tbody>\n",
278 |        "    <tr>\n",
279 |        "      <th>0</th>\n",
280 |        "      <td>at</td>\n",
281 |        "      <td>0</td>\n",
282 |        "      <td>0</td>\n",
283 |        "      <td>0</td>\n",
284 |        "      <td>0</td>\n",
285 |        "      <td>0</td>\n",
286 |        "      <td>0</td>\n",
287 |        "      <td>0</td>\n",
288 |        "      <td>0</td>\n",
289 |        "      <td>0</td>\n",
290 |        "      <td>0</td>\n",
291 |        "    </tr>\n",
292 |        "    <tr>\n",
293 |        "      <th>1</th>\n",
294 |        "      <td>av</td>\n",
295 |        "      <td>0</td>\n",
296 |        "      <td>0</td>\n",
297 |        "      <td>0</td>\n",
298 |        "      <td>0</td>\n",
299 |        "      <td>0</td>\n",
300 |        "      <td>0</td>\n",
301 |        "      <td>0</td>\n",
302 |        "      <td>0</td>\n",
303 |        "      <td>0</td>\n",
304 |        "      <td>0</td>\n",
305 |        "    </tr>\n",
306 |        "    <tr>\n",
307 |        "      <th>2</th>\n",
308 |        "      <td>n</td>\n",
309 |        "      <td>0</td>\n",
310 |        "      <td>0</td>\n",
311 |        "      <td>0</td>\n",
312 |        "      <td>0</td>\n",
313 |        "      <td>0</td>\n",
314 |        "      <td>0</td>\n",
315 |        "      <td>1</td>\n",
316 |        "      <td>0</td>\n",
317 |        "      <td>0</td>\n",
318 |        "      <td>0</td>\n",
319 |        "    </tr>\n",
320 |        "    <tr>\n",
321 |        "      <th>3</th>\n",
322 |        "      <td>n</td>\n",
323 |        "      <td>0</td>\n",
324 |        "      <td>0</td>\n",
325 |        "      <td>0</td>\n",
326 |        "      <td>0</td>\n",
327 |        "      <td>0</td>\n",
328 |        "      <td>0</td>\n",
329 |        "      <td>1</td>\n",
330 |        "      <td>0</td>\n",
331 |        "      <td>0</td>\n",
332 |        "      <td>0</td>\n",
333 |        "    </tr>\n",
334 |        "    <tr>\n",
335 |        "      <th>4</th>\n",
336 |        "      <td>n</td>\n",
337 |        "      <td>0</td>\n",
338 |        "      <td>0</td>\n",
339 |        "      <td>0</td>\n",
340 |        "      <td>0</td>\n",
341 |        "      <td>0</td>\n",
342 |        "      <td>0</td>\n",
343 |        "      <td>1</td>\n",
344 |        "      <td>0</td>\n",
345 |        "      <td>0</td>\n",
346 |        "      <td>0</td>\n",
347 |        "    </tr>\n",
348 |        "    <tr>\n",
349 |        "      <th>5</th>\n",
350 |        "      <td>e</td>\n",
351 |        "      <td>0</td>\n",
352 |        "      <td>0</td>\n",
353 |        "      <td>0</td>\n",
354 |        "      <td>0</td>\n",
355 |        "      <td>0</td>\n",
356 |        "      <td>0</td>\n",
357 |        "      <td>0</td>\n",
358 |        "      <td>0</td>\n",
359 |        "      <td>0</td>\n",
360 |        "      <td>1</td>\n",
361 |        "    </tr>\n",
362 |        "    <tr>\n",
363 |        "      <th>6</th>\n",
364 |        "      <td>e</td>\n",
365 |        "      <td>0</td>\n",
366 |        "      <td>0</td>\n",
367 |        "      <td>0</td>\n",
368 |        "      <td>0</td>\n",
369 |        "      <td>0</td>\n",
370 |        "      <td>0</td>\n",
371 |        "      <td>0</td>\n",
372 |        "      <td>0</td>\n",
373 |        "      <td>0</td>\n",
374 |        "      <td>1</td>\n",
375 |        "    </tr>\n",
376 |        "    <tr>\n",
377 |        "      <th>7</th>\n",
378 |        "      <td>as</td>\n",
379 |        "      <td>1</td>\n",
380 |        "      <td>0</td>\n",
381 |        "      <td>0</td>\n",
382 |        "      <td>0</td>\n",
383 |        "      <td>0</td>\n",
384 |        "      <td>0</td>\n",
385 |        "      <td>0</td>\n",
386 |        "      <td>0</td>\n",
387 |        "      <td>0</td>\n",
388 |        "      <td>0</td>\n",
389 |        "    </tr>\n",
390 |        "    <tr>\n",
391 |        "      <th>8</th>\n",
392 |        "      <td>as</td>\n",
393 |        "      <td>1</td>\n",
394 |        "      <td>0</td>\n",
395 |        "      <td>0</td>\n",
396 |        "      <td>0</td>\n",
397 |        "      <td>0</td>\n",
398 |        "      <td>0</td>\n",
399 |        "      <td>0</td>\n",
400 |        "      <td>0</td>\n",
401 |        "      <td>0</td>\n",
402 |        "      <td>0</td>\n",
403 |        "    </tr>\n",
404 |        "    <tr>\n",
405 |        "      <th>9</th>\n",
406 |        "      <td>aq</td>\n",
407 |        "      <td>0</td>\n",
408 |        "      <td>0</td>\n",
409 |        "      <td>0</td>\n",
410 |        "      <td>0</td>\n",
411 |        "      <td>0</td>\n",
412 |        "      <td>0</td>\n",
413 |        "      <td>0</td>\n",
414 |        "      <td>0</td>\n",
415 |        "      <td>0</td>\n",
416 |        "      <td>0</td>\n",
417 |        "    </tr>\n",
418 |        "  </tbody>\n",
419 |        "</table>\n",
420 |        "</div>"
421 |       ],
422 |       "text/plain": [
423 |        "   X2  as  ae  ai  m  ak  r  n  s  f  e\n",
424 |        "0  at   0   0   0  0   0  0  0  0  0  0\n",
425 |        "1  av   0   0   0  0   0  0  0  0  0  0\n",
426 |        "2   n   0   0   0  0   0  0  1  0  0  0\n",
427 |        "3   n   0   0   0  0   0  0  1  0  0  0\n",
428 |        "4   n   0   0   0  0   0  0  1  0  0  0\n",
429 |        "5   e   0   0   0  0   0  0  0  0  0  1\n",
430 |        "6   e   0   0   0  0   0  0  0  0  0  1\n",
431 |        "7  as   1   0   0  0   0  0  0  0  0  0\n",
432 |        "8  as   1   0   0  0   0  0  0  0  0  0\n",
433 |        "9  aq   0   0   0  0   0  0  0  0  0  0"
434 |       ]
435 |      },
436 |      "execution_count": 6,
437 |      "metadata": {},
438 |      "output_type": "execute_result"
439 |     }
440 |    ],
441 |    "source": [
442 |     "# and now we make the 10 binary variables\n",
443 |     "\n",
444 |     "for label in top_10:\n",
445 |     "    data[label] = np.where(data['X2']==label, 1, 0)\n",
446 |     "\n",
447 |     "data[['X2']+top_10].head(10)"
448 |    ]
449 |   },
450 |   {
451 |    "cell_type": "code",
452 |    "execution_count": 7,
453 |    "metadata": {},
454 |    "outputs": [
455 |     {
456 |      "data": {
457 |       "text/html": [
458 |        "<div>\n",
459 |        "<style>\n",
460 |        "    .dataframe thead tr:only-child th {\n",
461 |        "        text-align: right;\n",
462 |        "    }\n",
463 |        "\n",
464 |        "    .dataframe thead th {\n",
465 |        "        text-align: left;\n",
466 |        "    }\n",
467 |        "\n",
468 |        "    .dataframe tbody tr th {\n",
469 |        "        vertical-align: top;\n",
470 |        "    }\n",
471 |        "</style>\n",
472 |        "<table border=\"1\" class=\"dataframe\">\n",
473 |        "  <thead>\n",
474 |        "    <tr style=\"text-align: right;\">\n",
475 |        "      <th></th>\n",
476 |        "      <th>X1</th>\n",
477 |        "      <th>X2</th>\n",
478 |        "      <th>X3</th>\n",
479 |        "      <th>X4</th>\n",
480 |        "      <th>X5</th>\n",
481 |        "      <th>X6</th>\n",
482 |        "      <th>X2_as</th>\n",
483 |        "      <th>X2_ae</th>\n",
484 |        "      <th>X2_ai</th>\n",
485 |        "      <th>X2_m</th>\n",
486 |        "      <th>X2_ak</th>\n",
487 |        "      <th>X2_r</th>\n",
488 |        "      <th>X2_n</th>\n",
489 |        "      <th>X2_s</th>\n",
490 |        "      <th>X2_f</th>\n",
491 |        "      <th>X2_e</th>\n",
492 |        "    </tr>\n",
493 |        "  </thead>\n",
494 |        "  <tbody>\n",
495 |        "    <tr>\n",
496 |        "      <th>0</th>\n",
497 |        "      <td>v</td>\n",
498 |        "      <td>at</td>\n",
499 |        "      <td>a</td>\n",
500 |        "      <td>d</td>\n",
501 |        "      <td>u</td>\n",
502 |        "      <td>j</td>\n",
503 |        "      <td>0</td>\n",
504 |        "      <td>0</td>\n",
505 |        "      <td>0</td>\n",
506 |        "      <td>0</td>\n",
507 |        "      <td>0</td>\n",
508 |        "      <td>0</td>\n",
509 |        "      <td>0</td>\n",
510 |        "      <td>0</td>\n",
511 |        "      <td>0</td>\n",
512 |        "      <td>0</td>\n",
513 |        "    </tr>\n",
514 |        "    <tr>\n",
515 |        "      <th>1</th>\n",
516 |        "      <td>t</td>\n",
517 |        "      <td>av</td>\n",
518 |        "      <td>e</td>\n",
519 |        "      <td>d</td>\n",
520 |        "      <td>y</td>\n",
521 |        "      <td>l</td>\n",
522 |        "      <td>0</td>\n",
523 |        "      <td>0</td>\n",
524 |        "      <td>0</td>\n",
525 |        "      <td>0</td>\n",
526 |        "      <td>0</td>\n",
527 |        "      <td>0</td>\n",
528 |        "      <td>0</td>\n",
529 |        "      <td>0</td>\n",
530 |        "      <td>0</td>\n",
531 |        "      <td>0</td>\n",
532 |        "    </tr>\n",
533 |        "    <tr>\n",
534 |        "      <th>2</th>\n",
535 |        "      <td>w</td>\n",
536 |        "      <td>n</td>\n",
537 |        "      <td>c</td>\n",
538 |        "      <td>d</td>\n",
539 |        "      <td>x</td>\n",
540 |        "      <td>j</td>\n",
541 |        "      <td>0</td>\n",
542 |        "      <td>0</td>\n",
543 |        "      <td>0</td>\n",
544 |        "      <td>0</td>\n",
545 |        "      <td>0</td>\n",
546 |        "      <td>0</td>\n",
547 |        "      <td>1</td>\n",
548 |        "      <td>0</td>\n",
549 |        "      <td>0</td>\n",
550 |        "      <td>0</td>\n",
551 |        "    </tr>\n",
552 |        "    <tr>\n",
553 |        "      <th>3</th>\n",
554 |        "      <td>t</td>\n",
555 |        "      <td>n</td>\n",
556 |        "      <td>f</td>\n",
557 |        "      <td>d</td>\n",
558 |        "      <td>x</td>\n",
559 |        "      <td>l</td>\n",
560 |        "      <td>0</td>\n",
561 |        "      <td>0</td>\n",
562 |        "      <td>0</td>\n",
563 |        "      <td>0</td>\n",
564 |        "      <td>0</td>\n",
565 |        "      <td>0</td>\n",
566 |        "      <td>1</td>\n",
567 |        "      <td>0</td>\n",
568 |        "      <td>0</td>\n",
569 |        "      <td>0</td>\n",
570 |        "    </tr>\n",
571 |        "    <tr>\n",
572 |        "      <th>4</th>\n",
573 |        "      <td>v</td>\n",
574 |        "      <td>n</td>\n",
575 |        "      <td>f</td>\n",
576 |        "      <td>d</td>\n",
577 |        "      <td>h</td>\n",
578 |        "      <td>d</td>\n",
579 |        "      <td>0</td>\n",
580 |        "      <td>0</td>\n",
581 |        "      <td>0</td>\n",
582 |        "      <td>0</td>\n",
583 |        "      <td>0</td>\n",
584 |        "      <td>0</td>\n",
585 |        "      <td>1</td>\n",
586 |        "      <td>0</td>\n",
587 |        "      <td>0</td>\n",
588 |        "      <td>0</td>\n",
589 |        "    </tr>\n",
590 |        "  </tbody>\n",
591 |        "</table>\n",
592 |        "</div>"
593 |       ],
594 |       "text/plain": [
595 |        "  X1  X2 X3 X4 X5 X6  X2_as  X2_ae  X2_ai  X2_m  X2_ak  X2_r  X2_n  X2_s  \\\n",
596 |        "0  v  at  a  d  u  j      0      0      0     0      0     0     0     0   \n",
597 |        "1  t  av  e  d  y  l      0      0      0     0      0     0     0     0   \n",
598 |        "2  w   n  c  d  x  j      0      0      0     0      0     0     1     0   \n",
599 |        "3  t   n  f  d  x  l      0      0      0     0      0     0     1     0   \n",
600 |        "4  v   n  f  d  h  d      0      0      0     0      0     0     1     0   \n",
601 |        "\n",
602 |        "   X2_f  X2_e  \n",
603 |        "0     0     0  \n",
604 |        "1     0     0  \n",
605 |        "2     0     0  \n",
606 |        "3     0     0  \n",
607 |        "4     0     0  "
608 |       ]
609 |      },
610 |      "execution_count": 7,
611 |      "metadata": {},
612 |      "output_type": "execute_result"
613 |     }
614 |    ],
615 |    "source": [
616 |     "# get whole set of dummy variables, for all the categorical variables\n",
617 |     "\n",
618 |     "def one_hot_top_x(df, variable, top_x_labels):\n",
619 |     "    # function to create the dummy variables for the most frequent labels\n",
620 |     "    # we can vary the number of most frequent labels that we encode\n",
621 |     "    \n",
622 |     "    for label in top_x_labels:\n",
623 |     "        df[variable+'_'+label] = np.where(data[variable]==label, 1, 0)\n",
624 |     "\n",
625 |     "# read the data again\n",
626 |     "data = pd.read_csv('mercedesbenz.csv', usecols=['X1', 'X2', 'X3', 'X4', 'X5', 'X6'])\n",
627 |     "\n",
628 |     "# encode X2 into the 10 most frequent categories\n",
629 |     "one_hot_top_x(data, 'X2', top_10)\n",
630 |     "data.head()"
631 |    ]
632 |   },
633 |   {
634 |    "cell_type": "code",
635 |    "execution_count": 8,
636 |    "metadata": {},
637 |    "outputs": [
638 |     {
639 |      "data": {
640 |       "text/html": [
641 |        "<div>\n",
642 |        "<style>\n",
643 |        "    .dataframe thead tr:only-child th {\n",
644 |        "        text-align: right;\n",
645 |        "    }\n",
646 |        "\n",
647 |        "    .dataframe thead th {\n",
648 |        "        text-align: left;\n",
649 |        "    }\n",
650 |        "\n",
651 |        "    .dataframe tbody tr th {\n",
652 |        "        vertical-align: top;\n",
653 |        "    }\n",
654 |        "</style>\n",
655 |        "<table border=\"1\" class=\"dataframe\">\n",
656 |        "  <thead>\n",
657 |        "    <tr style=\"text-align: right;\">\n",
658 |        "      <th></th>\n",
659 |        "      <th>X1</th>\n",
660 |        "      <th>X2</th>\n",
661 |        "      <th>X3</th>\n",
662 |        "      <th>X4</th>\n",
663 |        "      <th>X5</th>\n",
664 |        "      <th>X6</th>\n",
665 |        "      <th>X2_as</th>\n",
666 |        "      <th>X2_ae</th>\n",
667 |        "      <th>X2_ai</th>\n",
668 |        "      <th>X2_m</th>\n",
669 |        "      <th>...</th>\n",
670 |        "      <th>X1_aa</th>\n",
671 |        "      <th>X1_s</th>\n",
672 |        "      <th>X1_b</th>\n",
673 |        "      <th>X1_l</th>\n",
674 |        "      <th>X1_v</th>\n",
675 |        "      <th>X1_r</th>\n",
676 |        "      <th>X1_i</th>\n",
677 |        "      <th>X1_a</th>\n",
678 |        "      <th>X1_c</th>\n",
679 |        "      <th>X1_o</th>\n",
680 |        "    </tr>\n",
681 |        "  </thead>\n",
682 |        "  <tbody>\n",
683 |        "    <tr>\n",
684 |        "      <th>0</th>\n",
685 |        "      <td>v</td>\n",
686 |        "      <td>at</td>\n",
687 |        "      <td>a</td>\n",
688 |        "      <td>d</td>\n",
689 |        "      <td>u</td>\n",
690 |        "      <td>j</td>\n",
691 |        "      <td>0</td>\n",
692 |        "      <td>0</td>\n",
693 |        "      <td>0</td>\n",
694 |        "      <td>0</td>\n",
695 |        "      <td>...</td>\n",
696 |        "      <td>0</td>\n",
697 |        "      <td>0</td>\n",
698 |        "      <td>0</td>\n",
699 |        "      <td>0</td>\n",
700 |        "      <td>1</td>\n",
701 |        "      <td>0</td>\n",
702 |        "      <td>0</td>\n",
703 |        "      <td>0</td>\n",
704 |        "      <td>0</td>\n",
705 |        "      <td>0</td>\n",
706 |        "    </tr>\n",
707 |        "    <tr>\n",
708 |        "      <th>1</th>\n",
709 |        "      <td>t</td>\n",
710 |        "      <td>av</td>\n",
711 |        "      <td>e</td>\n",
712 |        "      <td>d</td>\n",
713 |        "      <td>y</td>\n",
714 |        "      <td>l</td>\n",
715 |        "      <td>0</td>\n",
716 |        "      <td>0</td>\n",
717 |        "      <td>0</td>\n",
718 |        "      <td>0</td>\n",
719 |        "      <td>...</td>\n",
720 |        "      <td>0</td>\n",
721 |        "      <td>0</td>\n",
722 |        "      <td>0</td>\n",
723 |        "      <td>0</td>\n",
724 |        "      <td>0</td>\n",
725 |        "      <td>0</td>\n",
726 |        "      <td>0</td>\n",
727 |        "      <td>0</td>\n",
728 |        "      <td>0</td>\n",
729 |        "      <td>0</td>\n",
730 |        "    </tr>\n",
731 |        "    <tr>\n",
732 |        "      <th>2</th>\n",
733 |        "      <td>w</td>\n",
734 |        "      <td>n</td>\n",
735 |        "      <td>c</td>\n",
736 |        "      <td>d</td>\n",
737 |        "      <td>x</td>\n",
738 |        "      <td>j</td>\n",
739 |        "      <td>0</td>\n",
740 |        "      <td>0</td>\n",
741 |        "      <td>0</td>\n",
742 |        "      <td>0</td>\n",
743 |        "      <td>...</td>\n",
744 |        "      <td>0</td>\n",
745 |        "      <td>0</td>\n",
746 |        "      <td>0</td>\n",
747 |        "      <td>0</td>\n",
748 |        "      <td>0</td>\n",
749 |        "      <td>0</td>\n",
750 |        "      <td>0</td>\n",
751 |        "      <td>0</td>\n",
752 |        "      <td>0</td>\n",
753 |        "      <td>0</td>\n",
754 |        "    </tr>\n",
755 |        "    <tr>\n",
756 |        "      <th>3</th>\n",
757 |        "      <td>t</td>\n",
758 |        "      <td>n</td>\n",
759 |        "      <td>f</td>\n",
760 |        "      <td>d</td>\n",
761 |        "      <td>x</td>\n",
762 |        "      <td>l</td>\n",
763 |        "      <td>0</td>\n",
764 |        "      <td>0</td>\n",
765 |        "      <td>0</td>\n",
766 |        "      <td>0</td>\n",
767 |        "      <td>...</td>\n",
768 |        "      <td>0</td>\n",
769 |        "      <td>0</td>\n",
770 |        "      <td>0</td>\n",
771 |        "      <td>0</td>\n",
772 |        "      <td>0</td>\n",
773 |        "      <td>0</td>\n",
774 |        "      <td>0</td>\n",
775 |        "      <td>0</td>\n",
776 |        "      <td>0</td>\n",
777 |        "      <td>0</td>\n",
778 |        "    </tr>\n",
779 |        "    <tr>\n",
780 |        "      <th>4</th>\n",
781 |        "      <td>v</td>\n",
782 |        "      <td>n</td>\n",
783 |        "      <td>f</td>\n",
784 |        "      <td>d</td>\n",
785 |        "      <td>h</td>\n",
786 |        "      <td>d</td>\n",
787 |        "      <td>0</td>\n",
788 |        "      <td>0</td>\n",
789 |        "      <td>0</td>\n",
790 |        "      <td>0</td>\n",
791 |        "      <td>...</td>\n",
792 |        "      <td>0</td>\n",
793 |        "      <td>0</td>\n",
794 |        "      <td>0</td>\n",
795 |        "      <td>0</td>\n",
796 |        "      <td>1</td>\n",
797 |        "      <td>0</td>\n",
798 |        "      <td>0</td>\n",
799 |        "      <td>0</td>\n",
800 |        "      <td>0</td>\n",
801 |        "      <td>0</td>\n",
802 |        "    </tr>\n",
803 |        "  </tbody>\n",
804 |        "</table>\n",
805 |        "<p>5 rows × 26 columns</p>\n",
806 |        "</div>"
807 |       ],
808 |       "text/plain": [
809 |        "  X1  X2 X3 X4 X5 X6  X2_as  X2_ae  X2_ai  X2_m  ...   X1_aa  X1_s  X1_b  \\\n",
810 |        "0  v  at  a  d  u  j      0      0      0     0  ...       0     0     0   \n",
811 |        "1  t  av  e  d  y  l      0      0      0     0  ...       0     0     0   \n",
812 |        "2  w   n  c  d  x  j      0      0      0     0  ...       0     0     0   \n",
813 |        "3  t   n  f  d  x  l      0      0      0     0  ...       0     0     0   \n",
814 |        "4  v   n  f  d  h  d      0      0      0     0  ...       0     0     0   \n",
815 |        "\n",
816 |        "   X1_l  X1_v  X1_r  X1_i  X1_a  X1_c  X1_o  \n",
817 |        "0     0     1     0     0     0     0     0  \n",
818 |        "1     0     0     0     0     0     0     0  \n",
819 |        "2     0     0     0     0     0     0     0  \n",
820 |        "3     0     0     0     0     0     0     0  \n",
821 |        "4     0     1     0     0     0     0     0  \n",
822 |        "\n",
823 |        "[5 rows x 26 columns]"
824 |       ]
825 |      },
826 |      "execution_count": 8,
827 |      "metadata": {},
828 |      "output_type": "execute_result"
829 |     }
830 |    ],
831 |    "source": [
832 |     "# find the 10 most frequent categories for X1\n",
833 |     "top_10 = [x for x in data.X1.value_counts().sort_values(ascending=False).head(10).index]\n",
834 |     "\n",
835 |     "# now create the 10 most frequent dummy variables for X1\n",
836 |     "one_hot_top_x(data, 'X1', top_10)\n",
837 |     "data.head()"
838 |    ]
839 |   },
840 |   {
841 |    "cell_type": "markdown",
842 |    "metadata": {},
843 |    "source": [
844 |     "### One Hot encoding of top variables\n",
845 |     "\n",
846 |     "### Advantages\n",
847 |     "\n",
848 |     "- Straightforward to implement\n",
849 |     "- Does not require hrs of variable exploration\n",
850 |     "- Does not expand massively the feature space (number of columns in the dataset)\n",
851 |     "\n",
852 |     "### Disadvantages\n",
853 |     "\n",
854 |     "- Does not add any information that may make the variable more predictive\n",
855 |     "- Does not keep the information of the ignored labels\n",
856 |     "\n",
857 |     "\n",
858 |     "Because it is not unusual that categorical variables have a few dominating categories and the remaining labels add mostly noise, this is a quite simple and straightforward approach that may be useful on many occasions.\n",
859 |     "\n",
860 |     "It is worth noting that the top 10 variables is a totally arbitrary number. You could also choose the top 5, or top 20.\n",
861 |     "\n",
862 |     "This modelling was more than enough for the team to win the KDD 2009 cup. They did do some other powerful feature engineering as we will see in following lectures, that improved the performance of the variables dramatically."
863 |    ]
864 |   },
865 |   {
866 |    "cell_type": "code",
867 |    "execution_count": null,
868 |    "metadata": {
869 |     "collapsed": true
870 |    },
871 |    "outputs": [],
872 |    "source": []
873 |   }
874 |  ],
875 |  "metadata": {
876 |   "kernelspec": {
877 |    "display_name": "Python 3",
878 |    "language": "python",
879 |    "name": "python3"
880 |   },
881 |   "language_info": {
882 |    "codemirror_mode": {
883 |     "name": "ipython",
884 |     "version": 3
885 |    },
886 |    "file_extension": ".py",
887 |    "mimetype": "text/x-python",
888 |    "name": "python",
889 |    "nbconvert_exporter": "python",
890 |    "pygments_lexer": "ipython3",
891 |    "version": "3.6.1"
892 |   },
893 |   "toc": {
894 |    "nav_menu": {},
895 |    "number_sections": true,
896 |    "sideBar": true,
897 |    "skip_h1_title": false,
898 |    "toc_cell": false,
899 |    "toc_position": {},
900 |    "toc_section_display": "block",
901 |    "toc_window_display": true
902 |   }
903 |  },
904 |  "nbformat": 4,
905 |  "nbformat_minor": 2
906 | }
907 | 


--------------------------------------------------------------------------------
/10.3_Ordinal_numbering_encoding.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Ordinal numbering encoding\n",
  8 |     "\n",
  9 |     "**Ordinal categorical variables**\n",
 10 |     "\n",
 11 |     "Categorical variable which categories can be meaningfully ordered are called ordinal. For example:\n",
 12 |     "\n",
 13 |     "- Student's grade in an exam (A, B, C or Fail).\n",
 14 |     "- Days of the week can be ordinal with Monday = 1, and Sunday = 7.\n",
 15 |     "- Educational level, with the categories: Elementary school,  High school, College graduate, PhD ranked from 1 to 4.\n",
 16 |     "\n",
 17 |     "When the categorical variable is ordinal, the most straightforward approach is to replace the labels by some ordinal number.\n",
 18 |     "\n",
 19 |     "### Advantages\n",
 20 |     "\n",
 21 |     "- Keeps the semantical information of the variable (human readable content)\n",
 22 |     "- Straightforward\n",
 23 |     "\n",
 24 |     "### Disadvantage\n",
 25 |     "\n",
 26 |     "- Does not add machine learning valuable information\n",
 27 |     "\n",
 28 |     "I will simulate some data below to demonstrate this exercise"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "code",
 33 |    "execution_count": 1,
 34 |    "metadata": {
 35 |     "collapsed": true
 36 |    },
 37 |    "outputs": [],
 38 |    "source": [
 39 |     "import pandas as pd\n",
 40 |     "import datetime"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "code",
 45 |    "execution_count": 2,
 46 |    "metadata": {},
 47 |    "outputs": [
 48 |     {
 49 |      "data": {
 50 |       "text/html": [
 51 |        "<div>\n",
 52 |        "<style>\n",
 53 |        "    .dataframe thead tr:only-child th {\n",
 54 |        "        text-align: right;\n",
 55 |        "    }\n",
 56 |        "\n",
 57 |        "    .dataframe thead th {\n",
 58 |        "        text-align: left;\n",
 59 |        "    }\n",
 60 |        "\n",
 61 |        "    .dataframe tbody tr th {\n",
 62 |        "        vertical-align: top;\n",
 63 |        "    }\n",
 64 |        "</style>\n",
 65 |        "<table border=\"1\" class=\"dataframe\">\n",
 66 |        "  <thead>\n",
 67 |        "    <tr style=\"text-align: right;\">\n",
 68 |        "      <th></th>\n",
 69 |        "      <th>day</th>\n",
 70 |        "    </tr>\n",
 71 |        "  </thead>\n",
 72 |        "  <tbody>\n",
 73 |        "    <tr>\n",
 74 |        "      <th>0</th>\n",
 75 |        "      <td>2017-11-24 23:37:17.497960</td>\n",
 76 |        "    </tr>\n",
 77 |        "    <tr>\n",
 78 |        "      <th>1</th>\n",
 79 |        "      <td>2017-11-23 23:37:17.497960</td>\n",
 80 |        "    </tr>\n",
 81 |        "    <tr>\n",
 82 |        "      <th>2</th>\n",
 83 |        "      <td>2017-11-22 23:37:17.497960</td>\n",
 84 |        "    </tr>\n",
 85 |        "    <tr>\n",
 86 |        "      <th>3</th>\n",
 87 |        "      <td>2017-11-21 23:37:17.497960</td>\n",
 88 |        "    </tr>\n",
 89 |        "    <tr>\n",
 90 |        "      <th>4</th>\n",
 91 |        "      <td>2017-11-20 23:37:17.497960</td>\n",
 92 |        "    </tr>\n",
 93 |        "    <tr>\n",
 94 |        "      <th>5</th>\n",
 95 |        "      <td>2017-11-19 23:37:17.497960</td>\n",
 96 |        "    </tr>\n",
 97 |        "    <tr>\n",
 98 |        "      <th>6</th>\n",
 99 |        "      <td>2017-11-18 23:37:17.497960</td>\n",
100 |        "    </tr>\n",
101 |        "    <tr>\n",
102 |        "      <th>7</th>\n",
103 |        "      <td>2017-11-17 23:37:17.497960</td>\n",
104 |        "    </tr>\n",
105 |        "    <tr>\n",
106 |        "      <th>8</th>\n",
107 |        "      <td>2017-11-16 23:37:17.497960</td>\n",
108 |        "    </tr>\n",
109 |        "    <tr>\n",
110 |        "      <th>9</th>\n",
111 |        "      <td>2017-11-15 23:37:17.497960</td>\n",
112 |        "    </tr>\n",
113 |        "    <tr>\n",
114 |        "      <th>10</th>\n",
115 |        "      <td>2017-11-14 23:37:17.497960</td>\n",
116 |        "    </tr>\n",
117 |        "    <tr>\n",
118 |        "      <th>11</th>\n",
119 |        "      <td>2017-11-13 23:37:17.497960</td>\n",
120 |        "    </tr>\n",
121 |        "    <tr>\n",
122 |        "      <th>12</th>\n",
123 |        "      <td>2017-11-12 23:37:17.497960</td>\n",
124 |        "    </tr>\n",
125 |        "    <tr>\n",
126 |        "      <th>13</th>\n",
127 |        "      <td>2017-11-11 23:37:17.497960</td>\n",
128 |        "    </tr>\n",
129 |        "    <tr>\n",
130 |        "      <th>14</th>\n",
131 |        "      <td>2017-11-10 23:37:17.497960</td>\n",
132 |        "    </tr>\n",
133 |        "    <tr>\n",
134 |        "      <th>15</th>\n",
135 |        "      <td>2017-11-09 23:37:17.497960</td>\n",
136 |        "    </tr>\n",
137 |        "    <tr>\n",
138 |        "      <th>16</th>\n",
139 |        "      <td>2017-11-08 23:37:17.497960</td>\n",
140 |        "    </tr>\n",
141 |        "    <tr>\n",
142 |        "      <th>17</th>\n",
143 |        "      <td>2017-11-07 23:37:17.497960</td>\n",
144 |        "    </tr>\n",
145 |        "    <tr>\n",
146 |        "      <th>18</th>\n",
147 |        "      <td>2017-11-06 23:37:17.497960</td>\n",
148 |        "    </tr>\n",
149 |        "    <tr>\n",
150 |        "      <th>19</th>\n",
151 |        "      <td>2017-11-05 23:37:17.497960</td>\n",
152 |        "    </tr>\n",
153 |        "    <tr>\n",
154 |        "      <th>20</th>\n",
155 |        "      <td>2017-11-04 23:37:17.497960</td>\n",
156 |        "    </tr>\n",
157 |        "    <tr>\n",
158 |        "      <th>21</th>\n",
159 |        "      <td>2017-11-03 23:37:17.497960</td>\n",
160 |        "    </tr>\n",
161 |        "    <tr>\n",
162 |        "      <th>22</th>\n",
163 |        "      <td>2017-11-02 23:37:17.497960</td>\n",
164 |        "    </tr>\n",
165 |        "    <tr>\n",
166 |        "      <th>23</th>\n",
167 |        "      <td>2017-11-01 23:37:17.497960</td>\n",
168 |        "    </tr>\n",
169 |        "    <tr>\n",
170 |        "      <th>24</th>\n",
171 |        "      <td>2017-10-31 23:37:17.497960</td>\n",
172 |        "    </tr>\n",
173 |        "    <tr>\n",
174 |        "      <th>25</th>\n",
175 |        "      <td>2017-10-30 23:37:17.497960</td>\n",
176 |        "    </tr>\n",
177 |        "    <tr>\n",
178 |        "      <th>26</th>\n",
179 |        "      <td>2017-10-29 23:37:17.497960</td>\n",
180 |        "    </tr>\n",
181 |        "    <tr>\n",
182 |        "      <th>27</th>\n",
183 |        "      <td>2017-10-28 23:37:17.497960</td>\n",
184 |        "    </tr>\n",
185 |        "    <tr>\n",
186 |        "      <th>28</th>\n",
187 |        "      <td>2017-10-27 23:37:17.497960</td>\n",
188 |        "    </tr>\n",
189 |        "    <tr>\n",
190 |        "      <th>29</th>\n",
191 |        "      <td>2017-10-26 23:37:17.497960</td>\n",
192 |        "    </tr>\n",
193 |        "  </tbody>\n",
194 |        "</table>\n",
195 |        "</div>"
196 |       ],
197 |       "text/plain": [
198 |        "                          day\n",
199 |        "0  2017-11-24 23:37:17.497960\n",
200 |        "1  2017-11-23 23:37:17.497960\n",
201 |        "2  2017-11-22 23:37:17.497960\n",
202 |        "3  2017-11-21 23:37:17.497960\n",
203 |        "4  2017-11-20 23:37:17.497960\n",
204 |        "5  2017-11-19 23:37:17.497960\n",
205 |        "6  2017-11-18 23:37:17.497960\n",
206 |        "7  2017-11-17 23:37:17.497960\n",
207 |        "8  2017-11-16 23:37:17.497960\n",
208 |        "9  2017-11-15 23:37:17.497960\n",
209 |        "10 2017-11-14 23:37:17.497960\n",
210 |        "11 2017-11-13 23:37:17.497960\n",
211 |        "12 2017-11-12 23:37:17.497960\n",
212 |        "13 2017-11-11 23:37:17.497960\n",
213 |        "14 2017-11-10 23:37:17.497960\n",
214 |        "15 2017-11-09 23:37:17.497960\n",
215 |        "16 2017-11-08 23:37:17.497960\n",
216 |        "17 2017-11-07 23:37:17.497960\n",
217 |        "18 2017-11-06 23:37:17.497960\n",
218 |        "19 2017-11-05 23:37:17.497960\n",
219 |        "20 2017-11-04 23:37:17.497960\n",
220 |        "21 2017-11-03 23:37:17.497960\n",
221 |        "22 2017-11-02 23:37:17.497960\n",
222 |        "23 2017-11-01 23:37:17.497960\n",
223 |        "24 2017-10-31 23:37:17.497960\n",
224 |        "25 2017-10-30 23:37:17.497960\n",
225 |        "26 2017-10-29 23:37:17.497960\n",
226 |        "27 2017-10-28 23:37:17.497960\n",
227 |        "28 2017-10-27 23:37:17.497960\n",
228 |        "29 2017-10-26 23:37:17.497960"
229 |       ]
230 |      },
231 |      "execution_count": 2,
232 |      "metadata": {},
233 |      "output_type": "execute_result"
234 |     }
235 |    ],
236 |    "source": [
237 |     "# create a variable with dates, and from that extract the weekday\n",
238 |     "# I create a list of dates with 30 days difference from today\n",
239 |     "# and then transform it into a datafame\n",
240 |     "\n",
241 |     "base = datetime.datetime.today()\n",
242 |     "date_list = [base - datetime.timedelta(days=x) for x in range(0, 30)]\n",
243 |     "df = pd.DataFrame(date_list)\n",
244 |     "df.columns = ['day']\n",
245 |     "df"
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "code",
250 |    "execution_count": 3,
251 |    "metadata": {},
252 |    "outputs": [
253 |     {
254 |      "data": {
255 |       "text/html": [
256 |        "<div>\n",
257 |        "<style>\n",
258 |        "    .dataframe thead tr:only-child th {\n",
259 |        "        text-align: right;\n",
260 |        "    }\n",
261 |        "\n",
262 |        "    .dataframe thead th {\n",
263 |        "        text-align: left;\n",
264 |        "    }\n",
265 |        "\n",
266 |        "    .dataframe tbody tr th {\n",
267 |        "        vertical-align: top;\n",
268 |        "    }\n",
269 |        "</style>\n",
270 |        "<table border=\"1\" class=\"dataframe\">\n",
271 |        "  <thead>\n",
272 |        "    <tr style=\"text-align: right;\">\n",
273 |        "      <th></th>\n",
274 |        "      <th>day</th>\n",
275 |        "      <th>day_of_week</th>\n",
276 |        "    </tr>\n",
277 |        "  </thead>\n",
278 |        "  <tbody>\n",
279 |        "    <tr>\n",
280 |        "      <th>0</th>\n",
281 |        "      <td>2017-11-24 23:37:17.497960</td>\n",
282 |        "      <td>Friday</td>\n",
283 |        "    </tr>\n",
284 |        "    <tr>\n",
285 |        "      <th>1</th>\n",
286 |        "      <td>2017-11-23 23:37:17.497960</td>\n",
287 |        "      <td>Thursday</td>\n",
288 |        "    </tr>\n",
289 |        "    <tr>\n",
290 |        "      <th>2</th>\n",
291 |        "      <td>2017-11-22 23:37:17.497960</td>\n",
292 |        "      <td>Wednesday</td>\n",
293 |        "    </tr>\n",
294 |        "    <tr>\n",
295 |        "      <th>3</th>\n",
296 |        "      <td>2017-11-21 23:37:17.497960</td>\n",
297 |        "      <td>Tuesday</td>\n",
298 |        "    </tr>\n",
299 |        "    <tr>\n",
300 |        "      <th>4</th>\n",
301 |        "      <td>2017-11-20 23:37:17.497960</td>\n",
302 |        "      <td>Monday</td>\n",
303 |        "    </tr>\n",
304 |        "  </tbody>\n",
305 |        "</table>\n",
306 |        "</div>"
307 |       ],
308 |       "text/plain": [
309 |        "                         day day_of_week\n",
310 |        "0 2017-11-24 23:37:17.497960      Friday\n",
311 |        "1 2017-11-23 23:37:17.497960    Thursday\n",
312 |        "2 2017-11-22 23:37:17.497960   Wednesday\n",
313 |        "3 2017-11-21 23:37:17.497960     Tuesday\n",
314 |        "4 2017-11-20 23:37:17.497960      Monday"
315 |       ]
316 |      },
317 |      "execution_count": 3,
318 |      "metadata": {},
319 |      "output_type": "execute_result"
320 |     }
321 |    ],
322 |    "source": [
323 |     "# extract the week day name\n",
324 |     "\n",
325 |     "df['day_of_week'] = df['day'].dt.weekday_name\n",
326 |     "df.head()"
327 |    ]
328 |   },
329 |   {
330 |    "cell_type": "code",
331 |    "execution_count": 4,
332 |    "metadata": {},
333 |    "outputs": [
334 |     {
335 |      "data": {
336 |       "text/html": [
337 |        "<div>\n",
338 |        "<style>\n",
339 |        "    .dataframe thead tr:only-child th {\n",
340 |        "        text-align: right;\n",
341 |        "    }\n",
342 |        "\n",
343 |        "    .dataframe thead th {\n",
344 |        "        text-align: left;\n",
345 |        "    }\n",
346 |        "\n",
347 |        "    .dataframe tbody tr th {\n",
348 |        "        vertical-align: top;\n",
349 |        "    }\n",
350 |        "</style>\n",
351 |        "<table border=\"1\" class=\"dataframe\">\n",
352 |        "  <thead>\n",
353 |        "    <tr style=\"text-align: right;\">\n",
354 |        "      <th></th>\n",
355 |        "      <th>day</th>\n",
356 |        "      <th>day_of_week</th>\n",
357 |        "      <th>day_ordinal</th>\n",
358 |        "    </tr>\n",
359 |        "  </thead>\n",
360 |        "  <tbody>\n",
361 |        "    <tr>\n",
362 |        "      <th>0</th>\n",
363 |        "      <td>2017-11-24 23:37:17.497960</td>\n",
364 |        "      <td>Friday</td>\n",
365 |        "      <td>5</td>\n",
366 |        "    </tr>\n",
367 |        "    <tr>\n",
368 |        "      <th>1</th>\n",
369 |        "      <td>2017-11-23 23:37:17.497960</td>\n",
370 |        "      <td>Thursday</td>\n",
371 |        "      <td>4</td>\n",
372 |        "    </tr>\n",
373 |        "    <tr>\n",
374 |        "      <th>2</th>\n",
375 |        "      <td>2017-11-22 23:37:17.497960</td>\n",
376 |        "      <td>Wednesday</td>\n",
377 |        "      <td>3</td>\n",
378 |        "    </tr>\n",
379 |        "    <tr>\n",
380 |        "      <th>3</th>\n",
381 |        "      <td>2017-11-21 23:37:17.497960</td>\n",
382 |        "      <td>Tuesday</td>\n",
383 |        "      <td>2</td>\n",
384 |        "    </tr>\n",
385 |        "    <tr>\n",
386 |        "      <th>4</th>\n",
387 |        "      <td>2017-11-20 23:37:17.497960</td>\n",
388 |        "      <td>Monday</td>\n",
389 |        "      <td>1</td>\n",
390 |        "    </tr>\n",
391 |        "    <tr>\n",
392 |        "      <th>5</th>\n",
393 |        "      <td>2017-11-19 23:37:17.497960</td>\n",
394 |        "      <td>Sunday</td>\n",
395 |        "      <td>7</td>\n",
396 |        "    </tr>\n",
397 |        "    <tr>\n",
398 |        "      <th>6</th>\n",
399 |        "      <td>2017-11-18 23:37:17.497960</td>\n",
400 |        "      <td>Saturday</td>\n",
401 |        "      <td>6</td>\n",
402 |        "    </tr>\n",
403 |        "    <tr>\n",
404 |        "      <th>7</th>\n",
405 |        "      <td>2017-11-17 23:37:17.497960</td>\n",
406 |        "      <td>Friday</td>\n",
407 |        "      <td>5</td>\n",
408 |        "    </tr>\n",
409 |        "    <tr>\n",
410 |        "      <th>8</th>\n",
411 |        "      <td>2017-11-16 23:37:17.497960</td>\n",
412 |        "      <td>Thursday</td>\n",
413 |        "      <td>4</td>\n",
414 |        "    </tr>\n",
415 |        "    <tr>\n",
416 |        "      <th>9</th>\n",
417 |        "      <td>2017-11-15 23:37:17.497960</td>\n",
418 |        "      <td>Wednesday</td>\n",
419 |        "      <td>3</td>\n",
420 |        "    </tr>\n",
421 |        "  </tbody>\n",
422 |        "</table>\n",
423 |        "</div>"
424 |       ],
425 |       "text/plain": [
426 |        "                         day day_of_week  day_ordinal\n",
427 |        "0 2017-11-24 23:37:17.497960      Friday            5\n",
428 |        "1 2017-11-23 23:37:17.497960    Thursday            4\n",
429 |        "2 2017-11-22 23:37:17.497960   Wednesday            3\n",
430 |        "3 2017-11-21 23:37:17.497960     Tuesday            2\n",
431 |        "4 2017-11-20 23:37:17.497960      Monday            1\n",
432 |        "5 2017-11-19 23:37:17.497960      Sunday            7\n",
433 |        "6 2017-11-18 23:37:17.497960    Saturday            6\n",
434 |        "7 2017-11-17 23:37:17.497960      Friday            5\n",
435 |        "8 2017-11-16 23:37:17.497960    Thursday            4\n",
436 |        "9 2017-11-15 23:37:17.497960   Wednesday            3"
437 |       ]
438 |      },
439 |      "execution_count": 4,
440 |      "metadata": {},
441 |      "output_type": "execute_result"
442 |     }
443 |    ],
444 |    "source": [
445 |     "# Engineer categorical variable by ordinal number replacement\n",
446 |     "\n",
447 |     "weekday_map = {'Monday':1,\n",
448 |     "               'Tuesday':2,\n",
449 |     "               'Wednesday':3,\n",
450 |     "               'Thursday':4,\n",
451 |     "               'Friday':5,\n",
452 |     "               'Saturday':6,\n",
453 |     "               'Sunday':7\n",
454 |     "}\n",
455 |     "\n",
456 |     "df['day_ordinal'] = df.day_of_week.map(weekday_map)\n",
457 |     "df.head(10)"
458 |    ]
459 |   },
460 |   {
461 |    "cell_type": "markdown",
462 |    "metadata": {},
463 |    "source": [
464 |     "We can now use the variable day_ordinal in sklearn to build machine learning models."
465 |    ]
466 |   },
467 |   {
468 |    "cell_type": "code",
469 |    "execution_count": null,
470 |    "metadata": {
471 |     "collapsed": true
472 |    },
473 |    "outputs": [],
474 |    "source": []
475 |   }
476 |  ],
477 |  "metadata": {
478 |   "kernelspec": {
479 |    "display_name": "Python 3",
480 |    "language": "python",
481 |    "name": "python3"
482 |   },
483 |   "language_info": {
484 |    "codemirror_mode": {
485 |     "name": "ipython",
486 |     "version": 3
487 |    },
488 |    "file_extension": ".py",
489 |    "mimetype": "text/x-python",
490 |    "name": "python",
491 |    "nbconvert_exporter": "python",
492 |    "pygments_lexer": "ipython3",
493 |    "version": "3.6.1"
494 |   },
495 |   "toc": {
496 |    "nav_menu": {},
497 |    "number_sections": true,
498 |    "sideBar": true,
499 |    "skip_h1_title": false,
500 |    "toc_cell": false,
501 |    "toc_position": {},
502 |    "toc_section_display": "block",
503 |    "toc_window_display": true
504 |   }
505 |  },
506 |  "nbformat": 4,
507 |  "nbformat_minor": 2
508 | }
509 | 


--------------------------------------------------------------------------------
/10.4_Count_or_frequency_encoding.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Count or frequency encoding\n",
  8 |     "\n",
  9 |     "Another way to refer to variables that have a multitude of categories, is to call them variables with **high cardinality**.\n",
 10 |     "\n",
 11 |     "We observed in the previous lecture, that if a categorical variable contains multiple labels, then by re-encoding them using one hot encoding, we will expand the feature space dramatically.\n",
 12 |     "\n",
 13 |     "One approach that is heavily used in Kaggle competitions, is to replace each label of the categorical variable by the count, this is the amount of times each label appears in the dataset. Or the frequency, this is the percentage of observations within that category. The 2 are equivalent.\n",
 14 |     "\n",
 15 |     "There is not any rationale behind this transformation, other than its simplicity.\n",
 16 |     "\n",
 17 |     "### Advantages\n",
 18 |     "\n",
 19 |     "- Simple\n",
 20 |     "- Does not expand the feature space\n",
 21 |     "\n",
 22 |     "### Disadvantages\n",
 23 |     "\n",
 24 |     "-  If 2 labels appear the same amount of times in the dataset, that is, contain the same number of observations, they will be merged: may loose valuable information\n",
 25 |     "- Adds somewhat arbitrary numbers, and therefore weights to the different labels, that may not be related to their predictive power\n",
 26 |     "\n",
 27 |     "Follow this thread in Kaggle for more information:\n",
 28 |     "https://www.kaggle.com/general/16927\n",
 29 |     "\n",
 30 |     "Let's see how this works:"
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": 1,
 36 |    "metadata": {},
 37 |    "outputs": [
 38 |     {
 39 |      "data": {
 40 |       "text/html": [
 41 |        "<div>\n",
 42 |        "<style>\n",
 43 |        "    .dataframe thead tr:only-child th {\n",
 44 |        "        text-align: right;\n",
 45 |        "    }\n",
 46 |        "\n",
 47 |        "    .dataframe thead th {\n",
 48 |        "        text-align: left;\n",
 49 |        "    }\n",
 50 |        "\n",
 51 |        "    .dataframe tbody tr th {\n",
 52 |        "        vertical-align: top;\n",
 53 |        "    }\n",
 54 |        "</style>\n",
 55 |        "<table border=\"1\" class=\"dataframe\">\n",
 56 |        "  <thead>\n",
 57 |        "    <tr style=\"text-align: right;\">\n",
 58 |        "      <th></th>\n",
 59 |        "      <th>y</th>\n",
 60 |        "      <th>X1</th>\n",
 61 |        "      <th>X2</th>\n",
 62 |        "      <th>X3</th>\n",
 63 |        "      <th>X4</th>\n",
 64 |        "      <th>X5</th>\n",
 65 |        "      <th>X6</th>\n",
 66 |        "    </tr>\n",
 67 |        "  </thead>\n",
 68 |        "  <tbody>\n",
 69 |        "    <tr>\n",
 70 |        "      <th>0</th>\n",
 71 |        "      <td>130.81</td>\n",
 72 |        "      <td>v</td>\n",
 73 |        "      <td>at</td>\n",
 74 |        "      <td>a</td>\n",
 75 |        "      <td>d</td>\n",
 76 |        "      <td>u</td>\n",
 77 |        "      <td>j</td>\n",
 78 |        "    </tr>\n",
 79 |        "    <tr>\n",
 80 |        "      <th>1</th>\n",
 81 |        "      <td>88.53</td>\n",
 82 |        "      <td>t</td>\n",
 83 |        "      <td>av</td>\n",
 84 |        "      <td>e</td>\n",
 85 |        "      <td>d</td>\n",
 86 |        "      <td>y</td>\n",
 87 |        "      <td>l</td>\n",
 88 |        "    </tr>\n",
 89 |        "    <tr>\n",
 90 |        "      <th>2</th>\n",
 91 |        "      <td>76.26</td>\n",
 92 |        "      <td>w</td>\n",
 93 |        "      <td>n</td>\n",
 94 |        "      <td>c</td>\n",
 95 |        "      <td>d</td>\n",
 96 |        "      <td>x</td>\n",
 97 |        "      <td>j</td>\n",
 98 |        "    </tr>\n",
 99 |        "    <tr>\n",
100 |        "      <th>3</th>\n",
101 |        "      <td>80.62</td>\n",
102 |        "      <td>t</td>\n",
103 |        "      <td>n</td>\n",
104 |        "      <td>f</td>\n",
105 |        "      <td>d</td>\n",
106 |        "      <td>x</td>\n",
107 |        "      <td>l</td>\n",
108 |        "    </tr>\n",
109 |        "    <tr>\n",
110 |        "      <th>4</th>\n",
111 |        "      <td>78.02</td>\n",
112 |        "      <td>v</td>\n",
113 |        "      <td>n</td>\n",
114 |        "      <td>f</td>\n",
115 |        "      <td>d</td>\n",
116 |        "      <td>h</td>\n",
117 |        "      <td>d</td>\n",
118 |        "    </tr>\n",
119 |        "  </tbody>\n",
120 |        "</table>\n",
121 |        "</div>"
122 |       ],
123 |       "text/plain": [
124 |        "        y X1  X2 X3 X4 X5 X6\n",
125 |        "0  130.81  v  at  a  d  u  j\n",
126 |        "1   88.53  t  av  e  d  y  l\n",
127 |        "2   76.26  w   n  c  d  x  j\n",
128 |        "3   80.62  t   n  f  d  x  l\n",
129 |        "4   78.02  v   n  f  d  h  d"
130 |       ]
131 |      },
132 |      "execution_count": 1,
133 |      "metadata": {},
134 |      "output_type": "execute_result"
135 |     }
136 |    ],
137 |    "source": [
138 |     "import pandas as pd\n",
139 |     "import numpy as np\n",
140 |     "\n",
141 |     "from sklearn.model_selection import train_test_split\n",
142 |     "\n",
143 |     "# let's open the mercedes benz dataset for demonstration\n",
144 |     "\n",
145 |     "data = pd.read_csv('mercedesbenz.csv', usecols=['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'y'])\n",
146 |     "data.head()"
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "code",
151 |    "execution_count": 2,
152 |    "metadata": {},
153 |    "outputs": [
154 |     {
155 |      "name": "stdout",
156 |      "output_type": "stream",
157 |      "text": [
158 |       "X1 :  27  labels\n",
159 |       "X2 :  44  labels\n",
160 |       "X3 :  7  labels\n",
161 |       "X4 :  4  labels\n",
162 |       "X5 :  29  labels\n",
163 |       "X6 :  12  labels\n"
164 |      ]
165 |     }
166 |    ],
167 |    "source": [
168 |     "# let's have a look at how many labels\n",
169 |     "\n",
170 |     "for col in data.columns[1:]:\n",
171 |     "    print(col, ': ', len(data[col].unique()), ' labels')"
172 |    ]
173 |   },
174 |   {
175 |    "cell_type": "markdown",
176 |    "metadata": {},
177 |    "source": [
178 |     "### Important\n",
179 |     "\n",
180 |     "When doing count transformation of categorical variables, it is important to calculate the count (or frequency = count/total observations) **over the training set**, and then use those numbers to replace the labels in the test set."
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "code",
185 |    "execution_count": 3,
186 |    "metadata": {},
187 |    "outputs": [
188 |     {
189 |      "data": {
190 |       "text/plain": [
191 |        "((2946, 6), (1263, 6))"
192 |       ]
193 |      },
194 |      "execution_count": 3,
195 |      "metadata": {},
196 |      "output_type": "execute_result"
197 |     }
198 |    ],
199 |    "source": [
200 |     "X_train, X_test, y_train, y_test = train_test_split(data[['X1', 'X2', 'X3', 'X4', 'X5', 'X6']], data.y,\n",
201 |     "                                                    test_size=0.3,\n",
202 |     "                                                    random_state=0)\n",
203 |     "X_train.shape, X_test.shape"
204 |    ]
205 |   },
206 |   {
207 |    "cell_type": "code",
208 |    "execution_count": 4,
209 |    "metadata": {
210 |     "scrolled": true
211 |    },
212 |    "outputs": [
213 |     {
214 |      "data": {
215 |       "text/plain": [
216 |        "{'a': 34,\n",
217 |        " 'aa': 1,\n",
218 |        " 'ac': 10,\n",
219 |        " 'ae': 342,\n",
220 |        " 'af': 1,\n",
221 |        " 'ag': 15,\n",
222 |        " 'ah': 3,\n",
223 |        " 'ai': 289,\n",
224 |        " 'ak': 188,\n",
225 |        " 'al': 3,\n",
226 |        " 'am': 1,\n",
227 |        " 'an': 3,\n",
228 |        " 'ao': 10,\n",
229 |        " 'ap': 5,\n",
230 |        " 'aq': 46,\n",
231 |        " 'as': 1155,\n",
232 |        " 'at': 5,\n",
233 |        " 'au': 3,\n",
234 |        " 'av': 2,\n",
235 |        " 'aw': 2,\n",
236 |        " 'ay': 40,\n",
237 |        " 'b': 12,\n",
238 |        " 'c': 1,\n",
239 |        " 'd': 12,\n",
240 |        " 'e': 61,\n",
241 |        " 'f': 59,\n",
242 |        " 'g': 10,\n",
243 |        " 'h': 4,\n",
244 |        " 'i': 15,\n",
245 |        " 'k': 16,\n",
246 |        " 'l': 1,\n",
247 |        " 'm': 284,\n",
248 |        " 'n': 97,\n",
249 |        " 'o': 1,\n",
250 |        " 'p': 1,\n",
251 |        " 'q': 3,\n",
252 |        " 'r': 101,\n",
253 |        " 's': 63,\n",
254 |        " 't': 17,\n",
255 |        " 'x': 8,\n",
256 |        " 'y': 8,\n",
257 |        " 'z': 14}"
258 |       ]
259 |      },
260 |      "execution_count": 4,
261 |      "metadata": {},
262 |      "output_type": "execute_result"
263 |     }
264 |    ],
265 |    "source": [
266 |     "# let's obtain the counts for each one of the labels in variable X2\n",
267 |     "# let's capture this in a dictionary that we can use to re-map the labels\n",
268 |     "\n",
269 |     "X_train.X2.value_counts().to_dict()"
270 |    ]
271 |   },
272 |   {
273 |    "cell_type": "code",
274 |    "execution_count": 5,
275 |    "metadata": {},
276 |    "outputs": [
277 |     {
278 |      "data": {
279 |       "text/html": [
280 |        "<div>\n",
281 |        "<style>\n",
282 |        "    .dataframe thead tr:only-child th {\n",
283 |        "        text-align: right;\n",
284 |        "    }\n",
285 |        "\n",
286 |        "    .dataframe thead th {\n",
287 |        "        text-align: left;\n",
288 |        "    }\n",
289 |        "\n",
290 |        "    .dataframe tbody tr th {\n",
291 |        "        vertical-align: top;\n",
292 |        "    }\n",
293 |        "</style>\n",
294 |        "<table border=\"1\" class=\"dataframe\">\n",
295 |        "  <thead>\n",
296 |        "    <tr style=\"text-align: right;\">\n",
297 |        "      <th></th>\n",
298 |        "      <th>X1</th>\n",
299 |        "      <th>X2</th>\n",
300 |        "      <th>X3</th>\n",
301 |        "      <th>X4</th>\n",
302 |        "      <th>X5</th>\n",
303 |        "      <th>X6</th>\n",
304 |        "    </tr>\n",
305 |        "  </thead>\n",
306 |        "  <tbody>\n",
307 |        "    <tr>\n",
308 |        "      <th>3059</th>\n",
309 |        "      <td>aa</td>\n",
310 |        "      <td>ai</td>\n",
311 |        "      <td>c</td>\n",
312 |        "      <td>d</td>\n",
313 |        "      <td>q</td>\n",
314 |        "      <td>g</td>\n",
315 |        "    </tr>\n",
316 |        "    <tr>\n",
317 |        "      <th>3014</th>\n",
318 |        "      <td>b</td>\n",
319 |        "      <td>m</td>\n",
320 |        "      <td>c</td>\n",
321 |        "      <td>d</td>\n",
322 |        "      <td>q</td>\n",
323 |        "      <td>i</td>\n",
324 |        "    </tr>\n",
325 |        "    <tr>\n",
326 |        "      <th>3368</th>\n",
327 |        "      <td>o</td>\n",
328 |        "      <td>f</td>\n",
329 |        "      <td>f</td>\n",
330 |        "      <td>d</td>\n",
331 |        "      <td>s</td>\n",
332 |        "      <td>l</td>\n",
333 |        "    </tr>\n",
334 |        "    <tr>\n",
335 |        "      <th>2772</th>\n",
336 |        "      <td>aa</td>\n",
337 |        "      <td>as</td>\n",
338 |        "      <td>d</td>\n",
339 |        "      <td>d</td>\n",
340 |        "      <td>p</td>\n",
341 |        "      <td>j</td>\n",
342 |        "    </tr>\n",
343 |        "    <tr>\n",
344 |        "      <th>3383</th>\n",
345 |        "      <td>v</td>\n",
346 |        "      <td>e</td>\n",
347 |        "      <td>c</td>\n",
348 |        "      <td>d</td>\n",
349 |        "      <td>s</td>\n",
350 |        "      <td>g</td>\n",
351 |        "    </tr>\n",
352 |        "  </tbody>\n",
353 |        "</table>\n",
354 |        "</div>"
355 |       ],
356 |       "text/plain": [
357 |        "      X1  X2 X3 X4 X5 X6\n",
358 |        "3059  aa  ai  c  d  q  g\n",
359 |        "3014   b   m  c  d  q  i\n",
360 |        "3368   o   f  f  d  s  l\n",
361 |        "2772  aa  as  d  d  p  j\n",
362 |        "3383   v   e  c  d  s  g"
363 |       ]
364 |      },
365 |      "execution_count": 5,
366 |      "metadata": {},
367 |      "output_type": "execute_result"
368 |     }
369 |    ],
370 |    "source": [
371 |     "# lets look at X_train so we can compare then the variable re-coding\n",
372 |     "\n",
373 |     "X_train.head()"
374 |    ]
375 |   },
376 |   {
377 |    "cell_type": "code",
378 |    "execution_count": 6,
379 |    "metadata": {},
380 |    "outputs": [
381 |     {
382 |      "data": {
383 |       "text/html": [
384 |        "<div>\n",
385 |        "<style>\n",
386 |        "    .dataframe thead tr:only-child th {\n",
387 |        "        text-align: right;\n",
388 |        "    }\n",
389 |        "\n",
390 |        "    .dataframe thead th {\n",
391 |        "        text-align: left;\n",
392 |        "    }\n",
393 |        "\n",
394 |        "    .dataframe tbody tr th {\n",
395 |        "        vertical-align: top;\n",
396 |        "    }\n",
397 |        "</style>\n",
398 |        "<table border=\"1\" class=\"dataframe\">\n",
399 |        "  <thead>\n",
400 |        "    <tr style=\"text-align: right;\">\n",
401 |        "      <th></th>\n",
402 |        "      <th>X1</th>\n",
403 |        "      <th>X2</th>\n",
404 |        "      <th>X3</th>\n",
405 |        "      <th>X4</th>\n",
406 |        "      <th>X5</th>\n",
407 |        "      <th>X6</th>\n",
408 |        "    </tr>\n",
409 |        "  </thead>\n",
410 |        "  <tbody>\n",
411 |        "    <tr>\n",
412 |        "      <th>3059</th>\n",
413 |        "      <td>aa</td>\n",
414 |        "      <td>289</td>\n",
415 |        "      <td>c</td>\n",
416 |        "      <td>d</td>\n",
417 |        "      <td>q</td>\n",
418 |        "      <td>g</td>\n",
419 |        "    </tr>\n",
420 |        "    <tr>\n",
421 |        "      <th>3014</th>\n",
422 |        "      <td>b</td>\n",
423 |        "      <td>284</td>\n",
424 |        "      <td>c</td>\n",
425 |        "      <td>d</td>\n",
426 |        "      <td>q</td>\n",
427 |        "      <td>i</td>\n",
428 |        "    </tr>\n",
429 |        "    <tr>\n",
430 |        "      <th>3368</th>\n",
431 |        "      <td>o</td>\n",
432 |        "      <td>59</td>\n",
433 |        "      <td>f</td>\n",
434 |        "      <td>d</td>\n",
435 |        "      <td>s</td>\n",
436 |        "      <td>l</td>\n",
437 |        "    </tr>\n",
438 |        "    <tr>\n",
439 |        "      <th>2772</th>\n",
440 |        "      <td>aa</td>\n",
441 |        "      <td>1155</td>\n",
442 |        "      <td>d</td>\n",
443 |        "      <td>d</td>\n",
444 |        "      <td>p</td>\n",
445 |        "      <td>j</td>\n",
446 |        "    </tr>\n",
447 |        "    <tr>\n",
448 |        "      <th>3383</th>\n",
449 |        "      <td>v</td>\n",
450 |        "      <td>61</td>\n",
451 |        "      <td>c</td>\n",
452 |        "      <td>d</td>\n",
453 |        "      <td>s</td>\n",
454 |        "      <td>g</td>\n",
455 |        "    </tr>\n",
456 |        "  </tbody>\n",
457 |        "</table>\n",
458 |        "</div>"
459 |       ],
460 |       "text/plain": [
461 |        "      X1    X2 X3 X4 X5 X6\n",
462 |        "3059  aa   289  c  d  q  g\n",
463 |        "3014   b   284  c  d  q  i\n",
464 |        "3368   o    59  f  d  s  l\n",
465 |        "2772  aa  1155  d  d  p  j\n",
466 |        "3383   v    61  c  d  s  g"
467 |       ]
468 |      },
469 |      "execution_count": 6,
470 |      "metadata": {},
471 |      "output_type": "execute_result"
472 |     }
473 |    ],
474 |    "source": [
475 |     "# And now let's replace each label in X2 by its count\n",
476 |     "\n",
477 |     "# first we make a dictionary that maps each label to the counts\n",
478 |     "X_frequency_map = X_train.X2.value_counts().to_dict()\n",
479 |     "\n",
480 |     "# and now we replace X2 labels both in train and test set with the same map\n",
481 |     "X_train.X2 = X_train.X2.map(X_frequency_map)\n",
482 |     "X_test.X2 = X_test.X2.map(X_frequency_map)\n",
483 |     "\n",
484 |     "X_train.head()"
485 |    ]
486 |   },
487 |   {
488 |    "cell_type": "markdown",
489 |    "metadata": {},
490 |    "source": [
491 |     "Where in the original dataset, for the observation 1 in the variable 2 before it was 'ai', now it was replaced by the count 289. And so on for the rest of the categories (compare outputs 5 and 6).\n",
492 |     "\n",
493 |     "### Note\n",
494 |     "\n",
495 |     "I want you to keep in mind something important:\n",
496 |     "\n",
497 |     "If a category is present in the test set, that was not present in the train set, this method will generate missing data in the test set. This is why it is extremely important to handle rare categories, as we say in section 6 of this course.\n",
498 |     "\n",
499 |     "Then we can combine rare label replacement plus categorical encoding with counts like this: we may choose to replace the 10 most frequent labels by their count, and then group all the other labels under one label (for example \"Rare\"), and replace \"Rare\" by its count, to account for what I just mentioned.\n",
500 |     "\n",
501 |     "In coming sections I will explain more methods of categorical encoding. I want you to keep in mind that There is no rule of thumb to indicate which method you should use to encode categorical variables. It is mostly up to what makes sense for the data, and it also depends on what you are trying to achieve. In general, for data competitions, we value more model predictive power, whereas in business scenarios we want to capture and understand the information, and generally, we want to transform variables in a way that it makes 'Business sense'. Some of your common sense and a lot of conversation with the people that understand the data well will be required to encode categorical labels.\n"
502 |    ]
503 |   },
504 |   {
505 |    "cell_type": "code",
506 |    "execution_count": null,
507 |    "metadata": {
508 |     "collapsed": true
509 |    },
510 |    "outputs": [],
511 |    "source": []
512 |   }
513 |  ],
514 |  "metadata": {
515 |   "kernelspec": {
516 |    "display_name": "Python 3",
517 |    "language": "python",
518 |    "name": "python3"
519 |   },
520 |   "language_info": {
521 |    "codemirror_mode": {
522 |     "name": "ipython",
523 |     "version": 3
524 |    },
525 |    "file_extension": ".py",
526 |    "mimetype": "text/x-python",
527 |    "name": "python",
528 |    "nbconvert_exporter": "python",
529 |    "pygments_lexer": "ipython3",
530 |    "version": "3.6.1"
531 |   },
532 |   "toc": {
533 |    "nav_menu": {},
534 |    "number_sections": true,
535 |    "sideBar": true,
536 |    "skip_h1_title": false,
537 |    "toc_cell": false,
538 |    "toc_position": {},
539 |    "toc_section_display": "block",
540 |    "toc_window_display": true
541 |   }
542 |  },
543 |  "nbformat": 4,
544 |  "nbformat_minor": 2
545 | }
546 | 


--------------------------------------------------------------------------------
/15.5_Bonus_Additional_reading_resources.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "##  Additional reading resources\n",
 8 |     "\n",
 9 |     "### Discretisation\n",
10 |     "\n",
11 |     "#### Articles\n",
12 |     "- [Discretisation: An Enabling Technique](http://www.public.asu.edu/~huanliu/papers/dmkd02.pdf)\n",
13 |     "- [Supervised and unsupervised discretisation of continuous features](http://ai.stanford.edu/~ronnyk/disc.pdf)\n",
14 |     "- [ChiMerge: Discretisation of Numeric Attributes](https://www.aaai.org/Papers/AAAI/1992/AAAI92-019.pdf)\n",
15 |     "\n",
16 |     "#### Master thesis\n",
17 |     "- [Beating Kaggle the easy way](https://www.ke.tu-darmstadt.de/lehre/arbeiten/studien/2015/Dong_Ying.pdf)\n",
18 |     "\n",
19 |     "#### Blog\n",
20 |     "- [Tips for honing logistic regression models](https://blog.zopa.com/2017/07/20/tips-honing-logistic-regression-models/)\n",
21 |     "- [ChiMerge discretisation algorithm](https://alitarhini.wordpress.com/2010/11/02/chimerge-discretization-algorithm/)\n",
22 |     "\n",
23 |     "#### Other\n",
24 |     "- [Score card development stages: Binning](https://plug-n-score.com/learning/scorecard-development-stages.htm#Binning)"
25 |    ]
26 |   },
27 |   {
28 |    "cell_type": "code",
29 |    "execution_count": null,
30 |    "metadata": {
31 |     "collapsed": true
32 |    },
33 |    "outputs": [],
34 |    "source": []
35 |   }
36 |  ],
37 |  "metadata": {
38 |   "kernelspec": {
39 |    "display_name": "Python 3",
40 |    "language": "python",
41 |    "name": "python3"
42 |   },
43 |   "language_info": {
44 |    "codemirror_mode": {
45 |     "name": "ipython",
46 |     "version": 3
47 |    },
48 |    "file_extension": ".py",
49 |    "mimetype": "text/x-python",
50 |    "name": "python",
51 |    "nbconvert_exporter": "python",
52 |    "pygments_lexer": "ipython3",
53 |    "version": "3.6.1"
54 |   },
55 |   "toc": {
56 |    "nav_menu": {},
57 |    "number_sections": true,
58 |    "sideBar": true,
59 |    "skip_h1_title": false,
60 |    "toc_cell": false,
61 |    "toc_position": {},
62 |    "toc_section_display": "block",
63 |    "toc_window_display": false
64 |   }
65 |  },
66 |  "nbformat": 4,
67 |  "nbformat_minor": 2
68 | }
69 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Feature-Engineering


--------------------------------------------------------------------------------