├── Machine Learning Project Checklist.pdf
├── README.md
├── assets
├── ml_proj_checklist.png
└── proj_template.png
├── boston.csv
├── cambridge.csv
├── data_cleaning_for_ml_lab_EXERCISES.ipynb
├── data_cleaning_for_ml_lab_SOLUTIONS.ipynb
└── ml_project_checklist_template.ipynb
/Machine Learning Project Checklist.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pdeguzman96/data_cleaning_workshop/3bb8874b917f56ce543c27be3356c938dd553129/Machine Learning Project Checklist.pdf
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## Machine Learning Project Checklist
2 |
3 | **Summary:** [This checklist](http://bit.ly/ml_proj_checklist) was created to help ML students/practitioners structure their projects and problems in a way that makes sense to me.
4 |
5 | ---
6 |
7 | When I just got started learning Python for Machine Learning and worked on my first few projects, I found it very overwhelming because...
8 | - it was difficult to remember all of the steps I needed to take in order to make my data ML-friendly,
9 | - I couldn't easily remember the functions, methods, and estimators from pandas, numpy, and sklearn, and
10 | - it was tedious and time-consuming to try to understand large (>50 feature) datasets
11 |
12 | So, I created the [ML checklist](http://bit.ly/ml_proj_checklist) (Pictured Below) to be a handy tool for whenever I start to feel lost creating an ML project.
13 |
14 |
16 |
17 | In this repo, I also created...
18 | 1. `ml_project_checklist_template.ipynb`: (Pictured below) a Jupyter .ipynb that you can use as a template for your project or Kaggle competition
19 | 2. `data_cleaning_for_ml_lab_EXERCISES.ipynb`: An exercises/lab that you can finish for data cleaning practice, originally made for a workshop that I gave
20 | 3. `data_cleaning_for_ml_lab_SOLUTIONS.ipynb`: A solutions file for the exercises I give above
21 | 4. `boston.csv` and `cambridge.csv`: Airbnb datasets from [here](http://insideairbnb.com/get-the-data.html) used for the exercises
22 | 5. I also included a PDF version of the checklist.
23 |
24 |
26 |
27 | ---
28 |
29 | I hope you find these resources as useful as I do!
30 |
31 | Happy learning :).
--------------------------------------------------------------------------------
/assets/ml_proj_checklist.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pdeguzman96/data_cleaning_workshop/3bb8874b917f56ce543c27be3356c938dd553129/assets/ml_proj_checklist.png
--------------------------------------------------------------------------------
/assets/proj_template.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pdeguzman96/data_cleaning_workshop/3bb8874b917f56ce543c27be3356c938dd553129/assets/proj_template.png
--------------------------------------------------------------------------------
/data_cleaning_for_ml_lab_EXERCISES.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Data Cleaning for Machine Learning Lab w/ Template"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "[Full Project Checklist Here](https://docs.google.com/spreadsheets/d/1y4EdxeAliOQw9CDHx0_brjmk-LUb3gfX52zLGSqLg_g/edit?usp=sharing)"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "In this notebook, we will be cleaning, exploring, preprocessing, and modeling [Airbnb listing data](http://insideairbnb.com/get-the-data.html) from Boston and Cambridge.\n",
22 | "\n",
23 | "The purpose of this notebook is to \n",
24 | "1. practice data cleaning for ML and \n",
25 | "2. show how to effectively use this template to bring some structure to your ML projects."
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "### Instructions"
33 | ]
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "metadata": {},
38 | "source": [
39 | "1. Edit all cells that say \"TO DO\""
40 | ]
41 | },
42 | {
43 | "cell_type": "code",
44 | "execution_count": null,
45 | "metadata": {},
46 | "outputs": [],
47 | "source": [
48 | "## TO DO: Add 1 + 1"
49 | ]
50 | },
51 | {
52 | "cell_type": "markdown",
53 | "metadata": {},
54 | "source": [
55 | "2. Read, but do not edit, cells that say \"DO NOT CHANGE\""
56 | ]
57 | },
58 | {
59 | "cell_type": "code",
60 | "execution_count": 87,
61 | "metadata": {},
62 | "outputs": [
63 | {
64 | "data": {
65 | "text/plain": [
66 | "True"
67 | ]
68 | },
69 | "execution_count": 87,
70 | "metadata": {},
71 | "output_type": "execute_result"
72 | }
73 | ],
74 | "source": [
75 | "## DO NOT CHANGE:\n",
76 | "10%2==0"
77 | ]
78 | },
79 | {
80 | "cell_type": "markdown",
81 | "metadata": {},
82 | "source": [
83 | "## Prerequisite: Business and Data Understanding\n",
84 | "\n",
85 | "Before doing any data cleaning or exploration, do the best you can to identify your goals, questions, and purpose of this analysis. Additionally, try to get your hands on a Data Dictionary or schema if you can. You, ideally, will be able to answer questions like this...\n",
86 | "\n",
87 | "- Business Questions:\n",
88 | " - What's the goal of this analysis?\n",
89 | " - What're some questions I want to answer?\n",
90 | " - Do I need machine learning?\n",
91 | "- Data Questions:\n",
92 | " - How many features should I expect?\n",
93 | " - How much text, categorical, or image data to I have? All of these need to be turned into numbers somehow.\n",
94 | " - Do I already have the datasets that I need?\n",
95 | "\n",
96 | "Honestly, taking 1-2 hours to answer these can go a long way.\n",
97 | "\n",
98 | "> **One of the worst feelings you can get in these situations is feeling overwhelmed and lost while trying to understand a big and messy dataset. You're doing yourself a favor by studying the data before you dive in.**"
99 | ]
100 | },
101 | {
102 | "cell_type": "markdown",
103 | "metadata": {},
104 | "source": [
105 | "---\n",
106 | "\n",
107 | "### Let's Get Started...\n",
108 | "\n",
109 | "---"
110 | ]
111 | },
112 | {
113 | "cell_type": "markdown",
114 | "metadata": {},
115 | "source": [
116 | "## Table of Contents \n",
117 | "\n",
118 | "#### I. [Import Data & Libraries](#idl)\n",
119 | "#### II. [Exploratory Data Analysis](#eda)\n",
120 | "#### III. [Train/Test Split](#tts)\n",
121 | "#### IV. [Prepare for ML](#pfm)\n",
122 | "#### V. [Pick your Models](#pym)\n",
123 | "#### VI. [Model Selection](#ms)\n",
124 | "#### VII. [Model Tuning](#mt)\n",
125 | "#### VIII. [Pick the Best Model](#pbm)\n",
126 | "\n"
127 | ]
128 | },
129 | {
130 | "cell_type": "markdown",
131 | "metadata": {},
132 | "source": [
133 | "## I. Import Data & Libraries "
134 | ]
135 | },
136 | {
137 | "cell_type": "markdown",
138 | "metadata": {},
139 | "source": [
140 | "### Import Libraries"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": null,
146 | "metadata": {},
147 | "outputs": [],
148 | "source": [
149 | "## DO NOT CHANGE\n",
150 | "\n",
151 | "# Data manipulation\n",
152 | "import pandas as pd\n",
153 | "import numpy as np\n",
154 | "\n",
155 | "# More Data Preprocessing & Machine Learning\n",
156 | "from sklearn.model_selection import train_test_split\n",
157 | "from sklearn.preprocessing import MultiLabelBinarizer, OneHotEncoder, StandardScaler\n",
158 | "from sklearn.impute import SimpleImputer\n",
159 | "\n",
160 | "# Data Viz\n",
161 | "import seaborn as sns\n",
162 | "import matplotlib.pyplot as plt\n",
163 | "%matplotlib inline"
164 | ]
165 | },
166 | {
167 | "cell_type": "markdown",
168 | "metadata": {},
169 | "source": [
170 | "### Downloading/Importing Data"
171 | ]
172 | },
173 | {
174 | "cell_type": "code",
175 | "execution_count": null,
176 | "metadata": {},
177 | "outputs": [],
178 | "source": [
179 | "## DO NOT CHANGE\n",
180 | "boston_url = \"https://github.com/pdeguzman96/data_cleaning_workshop/blob/master/boston.csv?raw=true\"\n",
181 | "cambridge_url = \"https://github.com/pdeguzman96/data_cleaning_workshop/blob/master/cambridge.csv?raw=true\""
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "execution_count": null,
187 | "metadata": {},
188 | "outputs": [],
189 | "source": [
190 | "## TO DO: import the data using the links above (Hint: pd.read_csv may be helpful)\n",
191 | "\n",
192 | "## if you have the csv files saved into your directory (which you should if you downloaded the whole Github repo)\n",
193 | " ## Just replace the urls with the filepath\n",
194 | "boston_df = \n",
195 | "cambridge_df = "
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": null,
201 | "metadata": {
202 | "scrolled": false
203 | },
204 | "outputs": [],
205 | "source": [
206 | "## DO NOT CHANGE\n",
207 | "## TO DO: Skim through all the columns. There are a lot of columns that we don't need right now.\n",
208 | "pd.options.display.max_rows = boston_df.shape[1]\n",
209 | "boston_df.head(2).T"
210 | ]
211 | },
212 | {
213 | "cell_type": "markdown",
214 | "metadata": {},
215 | "source": [
216 | "Dropping columns that we're not going to use for this notebook."
217 | ]
218 | },
219 | {
220 | "cell_type": "code",
221 | "execution_count": null,
222 | "metadata": {},
223 | "outputs": [],
224 | "source": [
225 | "## DO NOT CHANGE\n",
226 | "## These are urls, irrelevant dates, text data, names, zipcode, repetitive information, columns with 1 value\n",
227 | "\n",
228 | "drop = ['listing_url', 'scrape_id', 'last_scraped', 'summary', 'space', 'description', 'neighborhood_overview',\n",
229 | " 'notes', 'transit', 'access', 'interaction', 'house_rules', 'thumbnail_url', 'medium_url', 'picture_url',\n",
230 | " 'xl_picture_url', 'host_id', 'host_url', 'host_about', 'host_thumbnail_url', 'host_picture_url',\n",
231 | " 'calendar_updated', 'calendar_last_scraped', 'license', 'name', 'host_name', 'zipcode', 'id','city', 'state',\n",
232 | " 'market','jurisdiction_names', 'host_location', 'street', 'experiences_offered','country_code','country',\n",
233 | " 'has_availability','is_business_travel_ready', 'host_neighbourhood','neighbourhood_cleansed','smart_location',\n",
234 | " 'neighbourhood']"
235 | ]
236 | },
237 | {
238 | "cell_type": "code",
239 | "execution_count": null,
240 | "metadata": {},
241 | "outputs": [],
242 | "source": [
243 | "## TO DO: drop the columns above from boston_df and cambridge_df"
244 | ]
245 | },
246 | {
247 | "cell_type": "code",
248 | "execution_count": null,
249 | "metadata": {},
250 | "outputs": [],
251 | "source": [
252 | "## TO DO: concatenate the dataframes together (hint: pd.concat using axis=0 and ignore_index=True). Store in df\n",
253 | "df = "
254 | ]
255 | },
256 | {
257 | "cell_type": "markdown",
258 | "metadata": {},
259 | "source": [
260 | "## II. Exploratory Data Analysis\n",
261 | "**[Back to top](#toc)**\n",
262 | "\n",
263 | "This section is where you're going to really try to get a feel of what you're dealing with. You'll be doing lots of cleaning and visualizing before you're ready for ML.\n",
264 | "\n",
265 | "This usually the most time consuming section before you get to a simple working ML algorithm."
266 | ]
267 | },
268 | {
269 | "cell_type": "markdown",
270 | "metadata": {},
271 | "source": [
272 | "### A. Duplicate Value Check\n",
273 | "\n",
274 | "We don't need/want any rows that are purely identical to one another."
275 | ]
276 | },
277 | {
278 | "cell_type": "code",
279 | "execution_count": null,
280 | "metadata": {},
281 | "outputs": [],
282 | "source": [
283 | "## TO DO: drop any duplicate rows"
284 | ]
285 | },
286 | {
287 | "cell_type": "markdown",
288 | "metadata": {},
289 | "source": [
290 | "Were there any duplicates?"
291 | ]
292 | },
293 | {
294 | "cell_type": "markdown",
295 | "metadata": {},
296 | "source": [
297 | "### B. Separate Data Types"
298 | ]
299 | },
300 | {
301 | "cell_type": "markdown",
302 | "metadata": {},
303 | "source": [
304 | "Generally, there are 5-6 types of data you will run into.\n",
305 | "\n",
306 | "1. Numerical\n",
307 | "2. Categorical\n",
308 | "3. Date/Time\n",
309 | "4. Text\n",
310 | "5. Image\n",
311 | "6. Sound\n",
312 | "\n",
313 | "We don't have any Image or Sound data, and we removed Text data to make this simple and easier, so we're going to have to deal with Numerical, Categorical, and Date/Time."
314 | ]
315 | },
316 | {
317 | "cell_type": "markdown",
318 | "metadata": {},
319 | "source": [
320 | "**Let's start with separating our data apart into Numerical and Categorical.**"
321 | ]
322 | },
323 | {
324 | "cell_type": "code",
325 | "execution_count": null,
326 | "metadata": {},
327 | "outputs": [],
328 | "source": [
329 | "## TO DO: create a dataframe of only categorical variables (Hint: df.select_dtypes(['object', 'bool']))\n",
330 | "cat_df =\n",
331 | "\n",
332 | "## TO DO: create a dataframe of only numerical variables (Hint: data types \"int\" or \"float\")\n",
333 | "num_df ="
334 | ]
335 | },
336 | {
337 | "cell_type": "markdown",
338 | "metadata": {},
339 | "source": [
340 | "---\n",
341 | "So now we need to account for all of the following possible data types...\n",
342 | "\n",
343 | "1. Numerical *(1.3, -2.345, 6,423.1)*\n",
344 | "2. Categorical\n",
345 | " - Binary *(True/False, 0/1, Heads/Tails)*\n",
346 | " - Ordinal *(Low, Medium, High)*\n",
347 | " - Nominal *(Red, Blue, Purple)*\n",
348 | "3. Date/Time \n",
349 | "\n",
350 | "---\n",
351 | "\n",
352 | "> As you take an inventory of your data, use this next section to look through your data to **identify anything you have to fix** in order for your data to be **ready for EDA**."
353 | ]
354 | },
355 | {
356 | "cell_type": "markdown",
357 | "metadata": {},
358 | "source": [
359 | "#### Skim through the Numerical data"
360 | ]
361 | },
362 | {
363 | "cell_type": "code",
364 | "execution_count": null,
365 | "metadata": {},
366 | "outputs": [],
367 | "source": [
368 | "# Glance at the numerical data\n",
369 | "num_df.head().T"
370 | ]
371 | },
372 | {
373 | "cell_type": "markdown",
374 | "metadata": {},
375 | "source": [
376 | "The numerical features should look OK. Nothing obvious that we have to fix other than missing values, which we will deal with later."
377 | ]
378 | },
379 | {
380 | "cell_type": "markdown",
381 | "metadata": {},
382 | "source": [
383 | "#### Skim through the Catgorical Data\n",
384 | "\n",
385 | "Most (if not, all) problems will come from this subset."
386 | ]
387 | },
388 | {
389 | "cell_type": "code",
390 | "execution_count": null,
391 | "metadata": {},
392 | "outputs": [],
393 | "source": [
394 | "# Skim the output to look for things to fix\n",
395 | "cat_df.head().T"
396 | ]
397 | },
398 | {
399 | "cell_type": "markdown",
400 | "metadata": {},
401 | "source": [
402 | "**We have a lot of work to do for these categorical columns.**\n",
403 | "\n",
404 | "**Here's what we're going to take care of below...**\n",
405 | "1. Numerical data stored as Categorial (strings)\n",
406 | " - Convert some of these to numerical columns (i.e. `price` features and `host_response_rate`)\n",
407 | "2. Binary data needs to be binarized into 1's and 0's\n",
408 | " - We can Binarize the Binary/Boolean columns (such as `requires_license`)\n",
409 | "3. Ordinal (should generally be encoded to retain their information (e.g. {1,2,3} to encode {low, med, high})\n",
410 | " - The only ordinal-looking column I see is `host_response_time`, but let's treat it as nominal for simplicity\n",
411 | "4. Nominal data to be unpacked, then later one hot encoded\n",
412 | " - `host_verifications` and `amenities` have multiple items that need to be extrapolated into their own columns\n",
413 | " - All other categorical columns, like `neighborhood`, `cancellation_policy`, `property_type` should be one hot encoded.\n",
414 | "5. Date/Time features need to be engineered\n",
415 | " - Using these dates we can engineer features from the dates columns. We'll do this later"
416 | ]
417 | },
418 | {
419 | "cell_type": "markdown",
420 | "metadata": {},
421 | "source": [
422 | "### C. Initial Data Cleaning (for exploration)\n",
423 | "\n",
424 | "Before we're ready to perform some EDA (Exploratory Data Analysis), we should address the points stated above."
425 | ]
426 | },
427 | {
428 | "cell_type": "markdown",
429 | "metadata": {},
430 | "source": [
431 | "#### Convert Numerical Features to Numerical Data Types (if they were typed as objects instead of numbers)"
432 | ]
433 | },
434 | {
435 | "cell_type": "code",
436 | "execution_count": null,
437 | "metadata": {},
438 | "outputs": [],
439 | "source": [
440 | "## DO NOT CHANGE\n",
441 | "\n",
442 | "# Getting all the features that should be numerical, but are typed as objects (strings)\n",
443 | "cat_to_num = ['host_response_rate', 'price', 'weekly_price',\n",
444 | " 'monthly_price', 'security_deposit', 'cleaning_fee', 'extra_people']\n",
445 | "\n",
446 | "# Keeping changes in a temporary copied DataFrame\n",
447 | "# Setting deep=True creates a \"deepcopy\", which guarantees that you're creating a new object\n",
448 | " # Sometimes when you copy an object into a new variable, this new variable just points back to the copied object\n",
449 | " # This can have unintended consequences - if you edit one variable, you also might edit the other\n",
450 | "cat_to_num_df = cat_df[cat_to_num].copy(deep=True)"
451 | ]
452 | },
453 | {
454 | "cell_type": "code",
455 | "execution_count": null,
456 | "metadata": {},
457 | "outputs": [],
458 | "source": [
459 | "## TO DO: Take a peek at the data in cat_to_num_df using head(), What do you see?"
460 | ]
461 | },
462 | {
463 | "cell_type": "code",
464 | "execution_count": null,
465 | "metadata": {},
466 | "outputs": [],
467 | "source": [
468 | "## TO DO: remove the percent sign, then convert to a number (Hint: str.replace() & astype() will be useful)\n",
469 | "## Overwrite the old values with the updated values in 'host_response_rate'\n"
470 | ]
471 | },
472 | {
473 | "cell_type": "markdown",
474 | "metadata": {},
475 | "source": [
476 | "For the rest of the columns regarding price, remove the \"$\" and \",\" then convert to float."
477 | ]
478 | },
479 | {
480 | "cell_type": "code",
481 | "execution_count": null,
482 | "metadata": {},
483 | "outputs": [],
484 | "source": [
485 | "## DO NOT CHANGE\n",
486 | "price_cols = ['price', 'weekly_price','monthly_price', 'security_deposit', 'cleaning_fee', 'extra_people']"
487 | ]
488 | },
489 | {
490 | "cell_type": "code",
491 | "execution_count": null,
492 | "metadata": {},
493 | "outputs": [],
494 | "source": [
495 | "## TO DO: For each of the price columns, remove commas and dollar signs, then convert it to float\n",
496 | "for col in price_cols:"
497 | ]
498 | },
499 | {
500 | "cell_type": "code",
501 | "execution_count": null,
502 | "metadata": {},
503 | "outputs": [],
504 | "source": [
505 | "## TO DO: Append the new cat_to_num_df data to the num_df DataFrame using pd.concat and axis=1\n",
506 | "num_df ="
507 | ]
508 | },
509 | {
510 | "cell_type": "code",
511 | "execution_count": null,
512 | "metadata": {},
513 | "outputs": [],
514 | "source": [
515 | "## TO DO: Drop the old columns from the cat_df DataFrame using the appropriate axis\n",
516 | "cat_df = "
517 | ]
518 | },
519 | {
520 | "cell_type": "markdown",
521 | "metadata": {},
522 | "source": [
523 | "#### Convert Binary Columns to Boolean (Not necessary for exploration, but we have to do this later anyway)"
524 | ]
525 | },
526 | {
527 | "cell_type": "code",
528 | "execution_count": null,
529 | "metadata": {},
530 | "outputs": [],
531 | "source": [
532 | "bi_cols = []\n",
533 | "\n",
534 | "## TO DO: Loop through each column and store all columns with only 2 values in the bi_cols list\n",
535 | "## Hint: the nunique() method will be helpful\n",
536 | "\n",
537 | "for col in cat_df.columns:\n",
538 | " \n",
539 | "## TO DO: Take a peek at first few rows of the columns in bi_cols. What do you see?"
540 | ]
541 | },
542 | {
543 | "cell_type": "code",
544 | "execution_count": null,
545 | "metadata": {},
546 | "outputs": [],
547 | "source": [
548 | "## TO DO: Convert all binary columns to 1's and 0's. (Hint: the .map() method with a dictionary is helpful and fast)\n",
549 | "## Make sure you overwrite the old columns with these new ones in cat_df"
550 | ]
551 | },
552 | {
553 | "cell_type": "code",
554 | "execution_count": null,
555 | "metadata": {},
556 | "outputs": [],
557 | "source": [
558 | "## TO DO: Take a peak at the bi_cols in cat_df using head to see if everything looks okay"
559 | ]
560 | },
561 | {
562 | "cell_type": "markdown",
563 | "metadata": {},
564 | "source": [
565 | "#### Nominal (Extrapolating Multiple Values in one Feature)\n",
566 | "\n",
567 | "- host_verifications\n",
568 | "- amenities"
569 | ]
570 | },
571 | {
572 | "cell_type": "code",
573 | "execution_count": 25,
574 | "metadata": {},
575 | "outputs": [
576 | {
577 | "data": {
578 | "text/html": [
579 | "
"
1182 | ]
1183 | },
1184 | "metadata": {
1185 | "needs_background": "light"
1186 | },
1187 | "output_type": "display_data"
1188 | }
1189 | ],
1190 | "source": [
1191 | "## DO NOT CHANGE \n",
1192 | "fgrid = sns.FacetGrid(eda_viz, col='room_type', height=6,)\n",
1193 | "fgrid.map(sns.boxplot, 'last_review_discrete', 'price', 'host_is_superhost', \n",
1194 | " order=labels, hue_order = [0,1])\n",
1195 | "\n",
1196 | "for ax in fgrid.axes.flat:\n",
1197 | " plt.setp(ax.get_xticklabels(), rotation=45)\n",
1198 | " ax.set(xlabel=None, ylabel=None)\n",
1199 | "\n",
1200 | "l = plt.legend(loc='upper right')\n",
1201 | "l.get_texts()[0].set_text('Is not Superhost')\n",
1202 | "l.get_texts()[1].set_text('Is Superhost')\n",
1203 | "\n",
1204 | "fgrid.fig.tight_layout(w_pad=1)"
1205 | ]
1206 | },
1207 | {
1208 | "cell_type": "markdown",
1209 | "metadata": {},
1210 | "source": [
1211 | "Looking at the faceted plots above, it seems that units that haven't been reviewed for a long time are priced slightly higher than units with more recent reviews. Also, it appears that superhosts' pricing (dark blue) is higher than non-superhosts (light blue), suggesting that hosts who are verified as superhosts (hosts who are top-rated and most experienced) are priced higher than those who are not.\n",
1212 | "\n",
1213 | "However, we haven't verified any of these with meaningful statistical tests. This is all descriptive analysis."
1214 | ]
1215 | },
1216 | {
1217 | "cell_type": "markdown",
1218 | "metadata": {},
1219 | "source": [
1220 | "---"
1221 | ]
1222 | },
1223 | {
1224 | "cell_type": "code",
1225 | "execution_count": null,
1226 | "metadata": {},
1227 | "outputs": [],
1228 | "source": [
1229 | "## SKIP FOR NOW. Come back to this when you've finished the notebook. \n",
1230 | " ## Depending on what you want to try, this might take a while\n",
1231 | " \n",
1232 | "## TO DO: Think of your own EDA question. Try to answer it below. \n",
1233 | " ## You can create features for this, but don't add them to cleaned_df, add them to a copied DF called eda\n",
1234 | "## Some ideas...\n",
1235 | " ## What kinds of amenities do the expensive listings usually have?\n",
1236 | " ## Do hosts with many listings have higher or lower reviews than hosts with only a few listings?"
1237 | ]
1238 | },
1239 | {
1240 | "cell_type": "markdown",
1241 | "metadata": {},
1242 | "source": [
1243 | "### E. Assess Missing Values\n",
1244 | "\n",
1245 | "> **Do not fill or impute them yet at this point! We want to fill missing values after we train/test split.**\n",
1246 | "\n",
1247 | "In this section, we need to come up with a strategy on how we're going to tackle our missing values. Most ML algorithms (except fancy ones like [XGBoost](https://xgboost.readthedocs.io/en/latest/index.html)) cannot handle NA values, so we need to deal with them.\n",
1248 | "\n",
1249 | "You have two options, and I'll describe some strategies for each option below:\n",
1250 | "1. **Remove them**\n",
1251 | " - Are there many missing values in a particular **column**? Perhaps it's not very useful if there's too many missing.\n",
1252 | " - Are there many missing values in a particular **row**? Perhaps this missingness caused by something reasonable or a data-collecting failure. Investigate these in case you can reasonably identify a reason why they're missing before you drop them.\n",
1253 | " - Do some rows not contain your *response variable* of interest? Perhaps you want to predict `price` of an Airbnb listing. If so, supervised learning methods require the label (`price`) to be there, so we can disregard these rows.\n",
1254 | "2. **Fill them** (**Warning**: it's generally **not great practice** to fill missing values **before** you **train/test split** your data. You *can* fill missing values now if it's a one-off analysis, but if this is something you want to implement in practice, you want to be able to test your entire preprocessing workflow to evaluate how good it is. Think of your strategy for filling missing values as another hyperparameter that you want to tune.)\n",
1255 | " - Infer the value of missing values from other columns. (e.g. If `state` is missing, but `city` is San Francisco, `state` is probably `CA`)\n",
1256 | " - Fill numerical values with mean, median, or mode.\n",
1257 | " - Fill categorical values with the most frequent value.\n",
1258 | " - Use machine learning techniques to predict missing values. (Check out [IterativeImputer](https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html) from sklearn for a method of doing this.)"
1259 | ]
1260 | },
1261 | {
1262 | "cell_type": "markdown",
1263 | "metadata": {},
1264 | "source": [
1265 | "---\n",
1266 | "\n",
1267 | "**Here's our approach for the section below...**\n",
1268 | "\n",
1269 | "1. We assess missing values per column to see if we can drop any features.\n",
1270 | "2. We assess missing values per row to see if we can find any patterns in how these values may be missing.\n",
1271 | "3. We strategize how we want to fill our categorical features.\n",
1272 | "4. We strategize how we want to fill our remaining numerical features."
1273 | ]
1274 | },
1275 | {
1276 | "cell_type": "markdown",
1277 | "metadata": {},
1278 | "source": [
1279 | "#### 1. Assessing Missing Values per Column"
1280 | ]
1281 | },
1282 | {
1283 | "cell_type": "code",
1284 | "execution_count": null,
1285 | "metadata": {},
1286 | "outputs": [],
1287 | "source": [
1288 | "## TO DO: Let's assume we want to predict the price feature. \n",
1289 | " ## If price is the variable we want to predict, then we have to disregard rows that don't have it\n",
1290 | " ## Drop all rows with missing values in the 'price' column\n",
1291 | "cleaned_df = "
1292 | ]
1293 | },
1294 | {
1295 | "cell_type": "code",
1296 | "execution_count": null,
1297 | "metadata": {},
1298 | "outputs": [],
1299 | "source": [
1300 | "## TO DO: Calculate the proportion/percentage of NA values per column\n",
1301 | "## TO DO: There should be 5 columns with much more than 80% of their values missing. Which 5 columns are they?\n",
1302 | "## TO DO (OPTIONAL): Create a bar chart to visualize the proportion of NA values per column\n",
1303 | " # Hint: matplotlib's bar or barh are useful for this."
1304 | ]
1305 | },
1306 | {
1307 | "cell_type": "code",
1308 | "execution_count": null,
1309 | "metadata": {},
1310 | "outputs": [],
1311 | "source": [
1312 | "## TO DO: Drop the 5 missing columns identified above from cleaned_df"
1313 | ]
1314 | },
1315 | {
1316 | "cell_type": "markdown",
1317 | "metadata": {},
1318 | "source": [
1319 | "#### 2. Assessing Missing Values per Row"
1320 | ]
1321 | },
1322 | {
1323 | "cell_type": "code",
1324 | "execution_count": null,
1325 | "metadata": {},
1326 | "outputs": [],
1327 | "source": [
1328 | "## TO DO: Create a temporary column called \"sum_na_row\" in cleaned_df that contains the number of NA values per row\n",
1329 | "cleaned_df['sum_na_row'] = "
1330 | ]
1331 | },
1332 | {
1333 | "cell_type": "code",
1334 | "execution_count": null,
1335 | "metadata": {},
1336 | "outputs": [],
1337 | "source": [
1338 | "## TO DO: Use matplotlib or seaborn to plot the distribution of this new column. \n",
1339 | " ## Is there anything in this distribution that looks odd to you??\n",
1340 | "## Hint: seaborn's distplot() function is nice and easy for this. Alternatively, countplot() may work, too"
1341 | ]
1342 | },
1343 | {
1344 | "cell_type": "markdown",
1345 | "metadata": {},
1346 | "source": [
1347 | "Look at the distribution of missing values per row. This distribution looks a little odd. Look at how few missing values per row we have between 5-9, then hundreds more from 10-14.\n",
1348 | "\n",
1349 | "This may be systematically created. Let's investigate if there are specific columns that are consistently empty for these rows."
1350 | ]
1351 | },
1352 | {
1353 | "cell_type": "code",
1354 | "execution_count": null,
1355 | "metadata": {},
1356 | "outputs": [],
1357 | "source": [
1358 | "## Note: If you were able to spot the same sudden jump in missing values that I did,\n",
1359 | " ## you may have noticed that there are a lot of rows with 10 or more missing values,\n",
1360 | " ## but there's very few with 5-9 missing values. \n",
1361 | "\n",
1362 | "## TO DO: filter cleaned_df for only rows with 10 or more missing values. store this in a temporary DataFrame\n",
1363 | "temp = \n",
1364 | "\n",
1365 | "## TO DO: get the names of the columns that contain missing values from this temporary DF\n",
1366 | "## Hint: DF.isna().any() can be useful here. \n",
1367 | "na_cols = \n",
1368 | "\n",
1369 | "# Take a peek at what these features look like. Transposed for readability\n",
1370 | "temp[na_cols].transpose()"
1371 | ]
1372 | },
1373 | {
1374 | "cell_type": "markdown",
1375 | "metadata": {},
1376 | "source": [
1377 | "Do you see any patterns in the missing-ness of our data? Which features contain many missing values?"
1378 | ]
1379 | },
1380 | {
1381 | "cell_type": "markdown",
1382 | "metadata": {},
1383 | "source": [
1384 | "> It looks like a huge portion of missing values are coming from `review`-related features. Why could this be?\n",
1385 | "---\n",
1386 | "\n",
1387 | "**My Guess:**\n",
1388 | "\n",
1389 | "NA values related to reviews are most likely missing because these particular listings do not have any reviews. This could be useful information, so I'll encode these values as `0` so they're different from the values that we do have.\n",
1390 | "\n",
1391 | "Although `0` may be misleading, I believe filling with `0` is better than simply removing the rows or imputing based on other values to maintain its variance from the units that actually have reviews."
1392 | ]
1393 | },
1394 | {
1395 | "cell_type": "code",
1396 | "execution_count": null,
1397 | "metadata": {},
1398 | "outputs": [],
1399 | "source": [
1400 | "## DO NOT CHANGE \n",
1401 | "# Collecting the numerical review-related columns\n",
1402 | "zero_fill_cols = ['review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',\n",
1403 | " 'review_scores_checkin', 'review_scores_communication', 'review_scores_location',\n",
1404 | " 'review_scores_value', 'reviews_per_month', 'first_review_days', 'last_review_days']"
1405 | ]
1406 | },
1407 | {
1408 | "cell_type": "markdown",
1409 | "metadata": {},
1410 | "source": [
1411 | "#### 3. Categorical Features with Missing Values\n",
1412 | "\n",
1413 | "Dealing with missing categorical data can be tricky. Here are some ways you can deal with them:\n",
1414 | "- Fill with mode/most frequent value (e.g. if 70% of a column is \"red\", maybe you fill the remaining NA values with \"red\")\n",
1415 | "- Infer their value from other columns (e.g. if one feature helps you make an educated guess about the missing value)\n",
1416 | "- Create a dummy variable (e.g. if the value is missing, another dummy feature will have 1 for the missing value. Else, it will be 0)\n",
1417 | "\n",
1418 | "> Let's take the simple most frequent approach. We'll tackle this using the `SimpleImputer` from sklearn after we train/test split our data."
1419 | ]
1420 | },
1421 | {
1422 | "cell_type": "code",
1423 | "execution_count": null,
1424 | "metadata": {},
1425 | "outputs": [],
1426 | "source": [
1427 | "## TO DO (OPTIONAL): isolate all categorical columns (i.e. columns of dtype 'object'), \n",
1428 | " ## then make countplots or barplots for each one to visualize how these features are distributed"
1429 | ]
1430 | },
1431 | {
1432 | "cell_type": "markdown",
1433 | "metadata": {},
1434 | "source": [
1435 | "#### 4. Now what should we do about imputing the rest of our missing numerical features below?**"
1436 | ]
1437 | },
1438 | {
1439 | "cell_type": "markdown",
1440 | "metadata": {},
1441 | "source": [
1442 | "- A lot of people like the simple approach of filling them with the **mean** or **median** of the features.\n",
1443 | "- There's also some advanced methods of imputing missing values using Machine Learning. An examle is a neat experimental estimator in sklearn called `IterativeImputer` that uses machine learning to predict and impute many features at once. "
1444 | ]
1445 | },
1446 | {
1447 | "cell_type": "code",
1448 | "execution_count": 61,
1449 | "metadata": {
1450 | "scrolled": true
1451 | },
1452 | "outputs": [
1453 | {
1454 | "data": {
1455 | "text/html": [
1456 | "
\n",
1457 | "\n",
1470 | "
\n",
1471 | " \n",
1472 | "
\n",
1473 | "
\n",
1474 | "
host_listings_count
\n",
1475 | "
host_total_listings_count
\n",
1476 | "
bathrooms
\n",
1477 | "
bedrooms
\n",
1478 | "
beds
\n",
1479 | "
review_scores_rating
\n",
1480 | "
review_scores_accuracy
\n",
1481 | "
review_scores_cleanliness
\n",
1482 | "
review_scores_checkin
\n",
1483 | "
review_scores_communication
\n",
1484 | "
...
\n",
1485 | "
host_response_rate
\n",
1486 | "
security_deposit
\n",
1487 | "
cleaning_fee
\n",
1488 | "
host_since_days
\n",
1489 | "
first_review_days
\n",
1490 | "
last_review_days
\n",
1491 | "
host_response_time
\n",
1492 | "
host_is_superhost
\n",
1493 | "
host_has_profile_pic
\n",
1494 | "
host_identity_verified
\n",
1495 | "
\n",
1496 | " \n",
1497 | " \n",
1498 | "
\n",
1499 | "
0
\n",
1500 | "
6.0
\n",
1501 | "
6.0
\n",
1502 | "
1.0
\n",
1503 | "
1.0
\n",
1504 | "
1.0
\n",
1505 | "
95.0
\n",
1506 | "
10.0
\n",
1507 | "
10.0
\n",
1508 | "
10.0
\n",
1509 | "
10.0
\n",
1510 | "
...
\n",
1511 | "
1.0
\n",
1512 | "
0.0
\n",
1513 | "
60.0
\n",
1514 | "
3984.0
\n",
1515 | "
3954.0
\n",
1516 | "
78.0
\n",
1517 | "
within an hour
\n",
1518 | "
0.0
\n",
1519 | "
1.0
\n",
1520 | "
1.0
\n",
1521 | "
\n",
1522 | "
\n",
1523 | "
1
\n",
1524 | "
6.0
\n",
1525 | "
6.0
\n",
1526 | "
1.0
\n",
1527 | "
1.0
\n",
1528 | "
2.0
\n",
1529 | "
96.0
\n",
1530 | "
10.0
\n",
1531 | "
10.0
\n",
1532 | "
10.0
\n",
1533 | "
10.0
\n",
1534 | "
...
\n",
1535 | "
1.0
\n",
1536 | "
0.0
\n",
1537 | "
80.0
\n",
1538 | "
3984.0
\n",
1539 | "
3816.0
\n",
1540 | "
76.0
\n",
1541 | "
within an hour
\n",
1542 | "
0.0
\n",
1543 | "
1.0
\n",
1544 | "
1.0
\n",
1545 | "
\n",
1546 | "
\n",
1547 | "
2
\n",
1548 | "
12.0
\n",
1549 | "
12.0
\n",
1550 | "
1.0
\n",
1551 | "
1.0
\n",
1552 | "
1.0
\n",
1553 | "
93.0
\n",
1554 | "
9.0
\n",
1555 | "
9.0
\n",
1556 | "
10.0
\n",
1557 | "
10.0
\n",
1558 | "
...
\n",
1559 | "
1.0
\n",
1560 | "
1000.0
\n",
1561 | "
250.0
\n",
1562 | "
3831.0
\n",
1563 | "
1984.0
\n",
1564 | "
109.0
\n",
1565 | "
within a few hours
\n",
1566 | "
1.0
\n",
1567 | "
1.0
\n",
1568 | "
0.0
\n",
1569 | "
\n",
1570 | "
\n",
1571 | "
3
\n",
1572 | "
12.0
\n",
1573 | "
12.0
\n",
1574 | "
1.0
\n",
1575 | "
1.0
\n",
1576 | "
1.0
\n",
1577 | "
95.0
\n",
1578 | "
10.0
\n",
1579 | "
9.0
\n",
1580 | "
10.0
\n",
1581 | "
10.0
\n",
1582 | "
...
\n",
1583 | "
1.0
\n",
1584 | "
1000.0
\n",
1585 | "
250.0
\n",
1586 | "
3831.0
\n",
1587 | "
3770.0
\n",
1588 | "
125.0
\n",
1589 | "
within a few hours
\n",
1590 | "
1.0
\n",
1591 | "
1.0
\n",
1592 | "
0.0
\n",
1593 | "
\n",
1594 | "
\n",
1595 | "
4
\n",
1596 | "
11.0
\n",
1597 | "
11.0
\n",
1598 | "
1.0
\n",
1599 | "
0.0
\n",
1600 | "
1.0
\n",
1601 | "
87.0
\n",
1602 | "
9.0
\n",
1603 | "
9.0
\n",
1604 | "
9.0
\n",
1605 | "
8.0
\n",
1606 | "
...
\n",
1607 | "
1.0
\n",
1608 | "
500.0
\n",
1609 | "
150.0
\n",
1610 | "
3775.0
\n",
1611 | "
1469.0
\n",
1612 | "
179.0
\n",
1613 | "
within an hour
\n",
1614 | "
0.0
\n",
1615 | "
1.0
\n",
1616 | "
0.0
\n",
1617 | "
\n",
1618 | " \n",
1619 | "
\n",
1620 | "
5 rows × 23 columns
\n",
1621 | "
"
1622 | ],
1623 | "text/plain": [
1624 | " host_listings_count host_total_listings_count bathrooms bedrooms beds \\\n",
1625 | "0 6.0 6.0 1.0 1.0 1.0 \n",
1626 | "1 6.0 6.0 1.0 1.0 2.0 \n",
1627 | "2 12.0 12.0 1.0 1.0 1.0 \n",
1628 | "3 12.0 12.0 1.0 1.0 1.0 \n",
1629 | "4 11.0 11.0 1.0 0.0 1.0 \n",
1630 | "\n",
1631 | " review_scores_rating review_scores_accuracy review_scores_cleanliness \\\n",
1632 | "0 95.0 10.0 10.0 \n",
1633 | "1 96.0 10.0 10.0 \n",
1634 | "2 93.0 9.0 9.0 \n",
1635 | "3 95.0 10.0 9.0 \n",
1636 | "4 87.0 9.0 9.0 \n",
1637 | "\n",
1638 | " review_scores_checkin review_scores_communication ... \\\n",
1639 | "0 10.0 10.0 ... \n",
1640 | "1 10.0 10.0 ... \n",
1641 | "2 10.0 10.0 ... \n",
1642 | "3 10.0 10.0 ... \n",
1643 | "4 9.0 8.0 ... \n",
1644 | "\n",
1645 | " host_response_rate security_deposit cleaning_fee host_since_days \\\n",
1646 | "0 1.0 0.0 60.0 3984.0 \n",
1647 | "1 1.0 0.0 80.0 3984.0 \n",
1648 | "2 1.0 1000.0 250.0 3831.0 \n",
1649 | "3 1.0 1000.0 250.0 3831.0 \n",
1650 | "4 1.0 500.0 150.0 3775.0 \n",
1651 | "\n",
1652 | " first_review_days last_review_days host_response_time host_is_superhost \\\n",
1653 | "0 3954.0 78.0 within an hour 0.0 \n",
1654 | "1 3816.0 76.0 within an hour 0.0 \n",
1655 | "2 1984.0 109.0 within a few hours 1.0 \n",
1656 | "3 3770.0 125.0 within a few hours 1.0 \n",
1657 | "4 1469.0 179.0 within an hour 0.0 \n",
1658 | "\n",
1659 | " host_has_profile_pic host_identity_verified \n",
1660 | "0 1.0 1.0 \n",
1661 | "1 1.0 1.0 \n",
1662 | "2 1.0 0.0 \n",
1663 | "3 1.0 0.0 \n",
1664 | "4 1.0 0.0 \n",
1665 | "\n",
1666 | "[5 rows x 23 columns]"
1667 | ]
1668 | },
1669 | "execution_count": 61,
1670 | "metadata": {},
1671 | "output_type": "execute_result"
1672 | }
1673 | ],
1674 | "source": [
1675 | "## DO NOT CHANGE \n",
1676 | "\n",
1677 | "# Getting indices of columns that still contain missing values\n",
1678 | "columns_idxs_missing = np.where(cleaned_df.isna().any())[0]\n",
1679 | "# Getting the names of these columns\n",
1680 | "cols_missing = cleaned_df.columns[columns_idxs_missing]\n",
1681 | "# Taking a peek at what's left\n",
1682 | "cleaned_df[cols_missing].head()"
1683 | ]
1684 | },
1685 | {
1686 | "cell_type": "markdown",
1687 | "metadata": {},
1688 | "source": [
1689 | "> Took keep things simple, let's just fill the rest of these values with the median."
1690 | ]
1691 | },
1692 | {
1693 | "cell_type": "code",
1694 | "execution_count": null,
1695 | "metadata": {},
1696 | "outputs": [],
1697 | "source": [
1698 | "## TO DO: Drop the sum_na_row feature we made from cleaned_df\n",
1699 | "## We don't need this column anymore\n",
1700 | "cleaned_df = "
1701 | ]
1702 | },
1703 | {
1704 | "cell_type": "code",
1705 | "execution_count": null,
1706 | "metadata": {},
1707 | "outputs": [],
1708 | "source": [
1709 | "## TO DO: notice how we kept the review-related columns and the categorical columns in the \n",
1710 | " ## arrays \"zero_fill_cols\" and \"cat_cols\". \n",
1711 | " ## Identify all remaining columns that aren't in these two lists and store them in an array called median_fill_cols\n",
1712 | " ## Also remove \"price\" from this array.\n",
1713 | " ## I'll explain why we do these steps later in the notebook when we fill our missing values.\n",
1714 | " \n",
1715 | " \n",
1716 | " \n",
1717 | "median_fill_cols = "
1718 | ]
1719 | },
1720 | {
1721 | "cell_type": "markdown",
1722 | "metadata": {},
1723 | "source": [
1724 | "## III. Train/Test Split\n",
1725 | "**[Back to top](#toc)**\n",
1726 | "\n",
1727 | "Now here we split our data into training, testing, and (optionally) validation.\n",
1728 | "\n",
1729 | "However, if you plan to use a validation set or K-Fold Cross Validation, just create your validation sets later when you're evaluating your ML models."
1730 | ]
1731 | },
1732 | {
1733 | "cell_type": "code",
1734 | "execution_count": null,
1735 | "metadata": {},
1736 | "outputs": [],
1737 | "source": [
1738 | "## TO DO: store cleaned_df without the price column in a variable called X. \n",
1739 | "X = \n",
1740 | "## TO DO: store cleaned_df['price'] in a variable called y\n",
1741 | "y = \n",
1742 | "\n",
1743 | "## TO DO: Split your data using train_test_split using a train_size of 80%\n",
1744 | "## TO DO: store all these in the variables below\n",
1745 | "X_train, X_test, y_train, y_test = "
1746 | ]
1747 | },
1748 | {
1749 | "cell_type": "code",
1750 | "execution_count": null,
1751 | "metadata": {},
1752 | "outputs": [],
1753 | "source": [
1754 | "## RUN THIS, BUT DO NOT CHANGE\n",
1755 | "# Setting this option to None to suppress a warning that we don't need to worry about right now\n",
1756 | "pd.options.mode.chained_assignment = None"
1757 | ]
1758 | },
1759 | {
1760 | "cell_type": "markdown",
1761 | "metadata": {},
1762 | "source": [
1763 | "## IV. Prepare for ML \n",
1764 | "**[Back to top](#toc)**"
1765 | ]
1766 | },
1767 | {
1768 | "cell_type": "markdown",
1769 | "metadata": {},
1770 | "source": [
1771 | "Now that we've already split our data and engineered the features that we want, all we have to do is prepare our data for our models."
1772 | ]
1773 | },
1774 | {
1775 | "cell_type": "markdown",
1776 | "metadata": {},
1777 | "source": [
1778 | "### A. Dealing with Missing Data\n",
1779 | "\n",
1780 | "The reason we want to deal with missing data *after* we've split our data is because we want to simulate real world conditions when we test as much as we can. When data is coming/streaming in, we have to be ready with our methods for dealing with missing data.\n",
1781 | "\n",
1782 | "Below, rather than using panda's `fillna` method, we will take advantage of sklearn's `SimpleImputer` estimator (imputing is just another way of saying you're going to fill/infer missing values in this case)."
1783 | ]
1784 | },
1785 | {
1786 | "cell_type": "markdown",
1787 | "metadata": {},
1788 | "source": [
1789 | "---\n",
1790 | "\n",
1791 | "****A Brief Note on sklearn Estimators/Transformers****\n",
1792 | "\n",
1793 | "Many of sklearn's objects are called \"estimators\", and all estimators are also \"transformers\" because they are treated as objects that estimate some parameters about your data, then are used to transform your data in some way to produce a prediction or a transformed (e.g. normalized, standardized, filled NA's with mean, etc) version of your data.\n",
1794 | "\n",
1795 | "---"
1796 | ]
1797 | },
1798 | {
1799 | "cell_type": "markdown",
1800 | "metadata": {},
1801 | "source": [
1802 | "We will *fit* three `SimpleImputer` objects on **`X_train` only** according to each of our three strategies above. Then, we will use these imputers to transform **both our `X_train` and `X_test`.** As a reminder, this is what we will do...\n",
1803 | "\n",
1804 | "1. Fill categorical features stored in `cat_cols` with their mode/most frequent value\n",
1805 | "2. Fill review-related features stored in `zero_fill_cols` with a constant vaue: 0.\n",
1806 | "3. Fill all remaining numerical features stored in `median_fill_cols` with their median.\n",
1807 | "\n",
1808 | "> This is why we stored these column names in the **Assess Missing Values** section. We want to easily change each of these columns for both our X_train and X_test datasets. Also, remember how we dropped `price` from `median_fill_cols`? We needed to remove it because there is no `price` in `X_train` and `X_test`."
1809 | ]
1810 | },
1811 | {
1812 | "cell_type": "markdown",
1813 | "metadata": {},
1814 | "source": [
1815 | "**First, let's start with imputing our categorical variables.**"
1816 | ]
1817 | },
1818 | {
1819 | "cell_type": "code",
1820 | "execution_count": 68,
1821 | "metadata": {},
1822 | "outputs": [],
1823 | "source": [
1824 | "## DO NOT CHANGE - use this as an example of you have to do in the cells below for numerical variables\n",
1825 | " ## Notice how we're looping through our columns, imputing one at a time.\n",
1826 | " ## Normally, we would fit and transform features all at once with sklearn's ColumnTransformer, but \n",
1827 | " ## this is fine since we're just practicing\n",
1828 | "\n",
1829 | "# looping through our columns\n",
1830 | "for col in cat_cols:\n",
1831 | " # instantiating/creating an imputer with an impute strategy of \"most frequent\"\n",
1832 | " imputer = SimpleImputer(strategy='most_frequent')\n",
1833 | " \n",
1834 | " # fit this imputer to the training column. \n",
1835 | " # This stores the most frequent value in the imputer for transforming\n",
1836 | " imputer.fit(X_train[[col]])\n",
1837 | " \n",
1838 | " # using the transform method to fill NA values with the most frequent value, then updating our DFs\n",
1839 | " X_train[col] = imputer.transform(X_train[[col]])\n",
1840 | " X_test[col] = imputer.transform(X_test[[col]])"
1841 | ]
1842 | },
1843 | {
1844 | "cell_type": "markdown",
1845 | "metadata": {},
1846 | "source": [
1847 | "---"
1848 | ]
1849 | },
1850 | {
1851 | "cell_type": "markdown",
1852 | "metadata": {},
1853 | "source": [
1854 | "**Now let's impute our numerical variables.**"
1855 | ]
1856 | },
1857 | {
1858 | "cell_type": "code",
1859 | "execution_count": null,
1860 | "metadata": {},
1861 | "outputs": [],
1862 | "source": [
1863 | "## TO DO: impute the zero_fil_cols features using an imputer with strategy = \"constant\" and fill_value = 0\n",
1864 | "## Use what we did above for cat_cols as a reference"
1865 | ]
1866 | },
1867 | {
1868 | "cell_type": "code",
1869 | "execution_count": null,
1870 | "metadata": {},
1871 | "outputs": [],
1872 | "source": [
1873 | "## TO DO: impute the median_fill_cols using an imputer with strategy = \"median\"\n",
1874 | "## Use what we did above for cat_cols as a reference"
1875 | ]
1876 | },
1877 | {
1878 | "cell_type": "markdown",
1879 | "metadata": {},
1880 | "source": [
1881 | "### B. Feature Engineering "
1882 | ]
1883 | },
1884 | {
1885 | "cell_type": "markdown",
1886 | "metadata": {},
1887 | "source": [
1888 | "> Use this section as an opportunity to create useful features for your ML model. Note that any features you create might create NA or Infinite values, which have to be taken care of before using the data in most ML models.\n",
1889 | "\n",
1890 | "**An easy idea**: Ratio of capacity to beds."
1891 | ]
1892 | },
1893 | {
1894 | "cell_type": "code",
1895 | "execution_count": null,
1896 | "metadata": {},
1897 | "outputs": [],
1898 | "source": [
1899 | "## TO DO: In X_train, create a new feature called \"capacity_to_beds\" by dividing the \"accomodates\" feature by \"beds\"\n",
1900 | "\n",
1901 | "## TO DO: Do the same thing for X_test. Can you think of anything that can go wrong if you do this?"
1902 | ]
1903 | },
1904 | {
1905 | "cell_type": "markdown",
1906 | "metadata": {},
1907 | "source": [
1908 | "Be careful with ratios because\n",
1909 | "1. dividing by zero might create infinite values and \n",
1910 | "2. any operations with NA values create more NA values. This *shouldn't* be a problem because we already took care of NA values, but try to remember this.\n",
1911 | "\n",
1912 | "I think filling these values with zero is reasonable for now."
1913 | ]
1914 | },
1915 | {
1916 | "cell_type": "code",
1917 | "execution_count": null,
1918 | "metadata": {},
1919 | "outputs": [],
1920 | "source": [
1921 | "## TO DO: Fill infinite values in this new column with zero in X_train and X_test. \n",
1922 | "## (Hint: np.where and np.isinf can be helpful)\n"
1923 | ]
1924 | },
1925 | {
1926 | "cell_type": "markdown",
1927 | "metadata": {},
1928 | "source": [
1929 | "### C. Transform Data"
1930 | ]
1931 | },
1932 | {
1933 | "cell_type": "markdown",
1934 | "metadata": {},
1935 | "source": [
1936 | "**Transforming Numerical Data - Log Transform**"
1937 | ]
1938 | },
1939 | {
1940 | "cell_type": "markdown",
1941 | "metadata": {},
1942 | "source": [
1943 | "Now is a good time to do any numerical data transformations if you haven't done them already.\n",
1944 | "\n",
1945 | "An example could be to log-transform salary or price fields to make the distributions look more normal. Here's one way you can do that."
1946 | ]
1947 | },
1948 | {
1949 | "cell_type": "code",
1950 | "execution_count": 71,
1951 | "metadata": {},
1952 | "outputs": [
1953 | {
1954 | "name": "stderr",
1955 | "output_type": "stream",
1956 | "text": [
1957 | "/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/pandas/core/series.py:853: RuntimeWarning: divide by zero encountered in log\n",
1958 | " result = getattr(ufunc, method)(*inputs, **kwargs)\n"
1959 | ]
1960 | },
1961 | {
1962 | "data": {
1963 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAtEAAAEXCAYAAABrkBgzAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nOzdeXxddZ3/8dcne5MmaZqka9qmdKG0pWUpZd8EBVwo44AWXFBBfuPIjI4zP0d/Kq44bqOOA+owgiLKJi5ULeICiGUpLWXrTroladM2bfZmTz6/P85JuU1vkps0yb1J3s/HIw/vPed7zv3cKz33c7/n+/18zd0REREREZHYJcU7ABERERGRkUZJtIiIiIhIPymJFhERERHpJyXRIiIiIiL9pCRaRERERKSflESLiIiIiPSTkmgZFGb2ETM7YGYNZpYf73hGEzObbmbPmlm9md0e73hEZGQys3Fm9lszqzWzX8Q7ntHAzMab2WNmVmdm98U7HhleSqIFADPbbWZNYRJcbWa/N7MZMR6bCnwbeIu7j3f3w0Mc6xfM7GdD/Bp/DD+LBjNrM7PWiOd3DOVrR/GPwG53z3b3zwzza4vICGNmT4XX8fRuu64FJgP57n6dmX3AzNYM8msP+jmjvMaPIq7HreE1uuv5b4fytaO4HhgP5Ln7+4b5tSXOlERLpHe4+3hgKnAA+O8Yj5sMZACb+vuCFki4/w7dvesHwXjgIeCrXc/d/dbu7c0sZQjDmQVsHsiBQxyXiCQYMysGLgQcuLrb7lnAdndvH6TXisv1xd1vjrg+fwP4ecT1+R3d2w/D9Xmbu3f090Bdn0e+hEteJP7cvRl4BFjYtc3M0s3sW2ZWGg7b+GF4a3A+sC1sVmNmT4TtzzOzdeFtw3Vmdl7EuZ4ys9vN7BmgETjJzHLN7G4zqzCzvWb2FTNL7m/sZnZKeP4aM9tkZldH7MsPb2XWhTF9ZaA9JmZ2pZmVmNnnzOwA8AMzKwxv61WaWZWZPWpmUyOOed7MPh/+b52ZrTazvHBflpk9GB5XY2ZrzSzPzB4A3g18LuxluTD83O8MP6tyM/tmeDegp7i6tn3WzA6Fn+9bzWyFme0ws8Nm9q8D+RxEJOG8H3ge+AlwY9dGM/sicBvw7vBa8lHgh8C54fOasF3Ua32475LwmvPvZrYf+HF/AjOzaWa2KrzOlZjZhyP2jTOze8Me9C1m9kkzKx/IB2BmC8ys3cw+bGZlwGozSzGzX4bvqcbMnjSzkyOOedDMvmtmj1swdO4ZM5sV7ks2szvCa3utmb1iZieb2deBTwI3hp/he8K2X4z4/O4xs+xe4uradlN4bT5sZh8ys3PNbGMY67cH8jnI0NOvIDmOmWUSJG7PR2z+OnAScBrQBtwP3ObunzazRcAuYIK7t5vZROD3wD8DDwDXAb83s7kRQz3eB1xFkIAb8AuC3u+5QBbwO6AM+J9+xJ0K/Ba4B3gLcAHwqJktc/dtwJ3AEWAKUAw8DuyJ/ZM5TjGQCswAkoEcgi+lPwJpwE+B7wArI465AXgrsB/4E/Ax4AvAzQT/HqcTfL6nA63ufr2ZGbDR3b8Svs9vAEuAU8PX/R3BhbxrvHT3uC4m6C1pC9/7R8LP6E/heeYDz5nZg+6+9wQ+DxGJv/cTDK9bCzxvZpPd/YC7f97MHJjr7u8FMLMjwM3ufkHE8VGv9cCnw/1TgIkE15T+dsQ9QHDHchqwAPiTme10978Anye4dp1E8B2wup/n7i4ZOBs4maBXHmAVwQ+LdoJr873AORHH3ABcCbwWxvpF4APA24EzgTlAA3AKUO3u/x5+pgXufjOAmf0j8C6CuwHVBJ/ft4GuHwzd45oVblsSvvcrwtf+A8G1Owt4xcwecve1J/iZyGBzd/3pD2A3wcWhhuACsw84NdxnBMnnnIj25wK7wsfFBBeDlPD5+4AXup3/OeAD4eOngC9F7JsMtADjIrZdDzzZQ6xfAH4WZfuFBMlpUsS2B8L2yQRfCCdH7PsKsCaGz+ZnwBe6bbsy/ExSeznuHKAi4vnzwL9FPP8E8Jvw8T8CfwUWRznPg8BnI57vBd4U8XwFsLWnuMJttV2fC1AY/v+1NKLNJuDKeP93qD/96W/gfwQdB20ESR3AVuBfIvYfc+0kSBDXRDzv61p/CdAKZPQSwzHnjNg+A+gAsiO2/Qfwk/DxTuCKiH03A+UxvOevdJ0jYtuC8Bo3rZfjpgCdXe8lvM7eEbH/ncDL4eO3htfI5YB1O8/XgB9FPH8G+FDE86UEd1wtWlwR2/Ijth0BVkQ8/z3wD/H+70t/x/9pOIdEusbdJwDpwK3AX81sCkHSlQm8GN5aqiH4lVzYw3mmcXwP7x6CXtYuZRGPZxH0nFZEnP9/gEn9jH8aUObunVFet5CgpzfydY8+NrP/Z29MTPlhjK+3393bIs6RHd66KzWzOoIe6YLux0Q8biSYkAJwN0ES/Uh4u/SrFmU4S9grPYVjP9/un+0xcYUqIz6XpvB/D0Tsb4qIRURGphuBP7r7ofD5/UQM6YhBLNf6Sg+G/PXXNKDK3esjtkVeu6bR8/X5PRHX58difL1Od98XcY6UcJjKzvD6vJUgsY2sJtXT9fkxgmv0/wAHzOz7ZtbT9bL7998eYBxB7/1xcYU6/NgJ+U3o+jwiKImW47h7h7v/iqDX4ALgEME/4kXuPiH8y/VgUkc0+wgS40gzCXpQj75MxOMygp7ogojz57j7on6Gvg+YYcdOVOx63UqCHvaiiH1Hq4+4e+TEwX+I8fW82/NPhec/y91zCIaUWEwncm9x99vcfQFwEcEQmJVR2jnBhT7y8+3tsxWRMSAct/wu4GIz2x+OWf4XYKmZLe3hsO7Xiliu9QO9vuwDJnaNDw5FXrsq6Pn6HDlx8KoYX697nB8kuCZfCuQS9ABDDNdoD3zb3U8nGHaxlGAoXjTdv/9mEnymVT3EJSOYkmg5jgVWAHnAlrAH83+B75jZpLDNdDO7oodTrAbmm9kN4a//dxNMUvxdtMbuXkHQa/ufZpZjZklmNsfMLu4lzCQzy4j4SycYA3gE+KSZpZrZJcA7gAc9mDn9K+ALZpZpZgsIxg4OpmyC3osaMysAPhvrgWZ2uZktDH8A1BEk/D3N9n4A+LwFEyUnAZ8hGHIiImPXNQTXjIUE45lPIxi7+zd6vtYdAIrMLA1gANf6nli363OGu5cBzwL/EW5bAtwE/Dw85mHg0xZMqJ5OcDd0MGUDzcBhgnHGX4n1QDM7x8yWWVBN4wjBkJbers//ZmYzwx8MXwHuDztAZJRREi2RfmtmDQRJ3O3Aje7eVbbu34ESgokqdcCfCSZGHCe8LfV24F8JLlifBN4ecYsxmvcTTMbbTDAZ4xGCUns9uZ7g133X3w53byUo6XQVQY/K94H3u/vW8JhbCXog9gP3EVzsWnp5jf76FsHwjcPAGvo3MWY68ChQD2wMj324h7a3EXxOm4CXCcbgfWNgIYvIKHEj8GN3L3X3/V1/wB3Aeyx6ObUnCK4j+82s6/oc87W+F+dx7PW5KXz96wnm0OwDfg183t3/FB7zJaCcYJL6nwm+Awbz+nw3wR3J/QQTB/tTmWkCQbWTGoKx23uA7/XQ9gcEHTbPAjsIeqA/MaCIJeGZfhzJWGVBeaIp7t6fMYMiIjLEzOwjwEp37+2OpEhcqSdaxoywHueScLjKcoJbib+Od1wiImOdmU01s/PD4XwnE9zJ1PVZEprqRMtYkk0whGMacBD4T4IhFCIiEl9pBNUvZhMMm3iQYEieSMLScA4RERERkX7ScA4RERERkX4aUcM5CgoKvLi4ON5hiIj024svvnjI3XtaoGhU0jVbREaqWK7ZIyqJLi4uZv369fEOQ0Sk38ys+yqeo56u2SIyUsVyzY5pOIeZXWlm28ysxMw+FWV/upk9FO5fa2bF4fZ8M3syXKrzjm7HpJnZXWa23cy2mtnfx/a2RERERETiq8+eaDNLBu4E3kxQCH2dma1y980RzW4Cqt19rpmtBL4OvJtgdaDPAYvDv0ifAQ66+/xwlbaJiIiIiIiMALH0RC8HStx9Z7gi3IPAim5tVgD3ho8fAS4zM3P3I+6+hiCZ7u5DwH9AsNRoH6vZiYiIiIgkjFiS6OlAWcTz8nBb1Dbu3g7UAvk9ndDMJoQPv2xmG8zsF2Y2uYe2t5jZejNbX1lZGUO4IiIiIiJDK5Yk2qJs615cOpY2kVKAIuAZdz8DeA74VrSG7n6Xuy9z92WFhWNqYruIiIiIJKhYkuhyYEbE8yJgX09tzCwFyAWqejnnYaCRN5b0/AVwRgyxiIiIiIjEXSxJ9DpgnpnNNrM0YCWwqlubVcCN4eNrgSe8l6UQw32/BS4JN10GbO6pvYiIiIhIIumzOoe7t5vZrcDjQDJwj7tvMrMvAevdfRVwN3CfmZUQ9ECv7DrezHYDOUCamV0DvCWs7PHv4THfBSqBDw7uWxMRERERGRoxLbbi7quB1d223RbxuBm4rodji3vYvge4KNZARUREREQSxYhasXCsuH9t6XHbbjh7ZhwiERERkeGkHGDkiGnFQhERGb3M7B4zO2hmG3vY/x4zezX8e9bMlg53jCIiiUZJtIiI/AS4spf9u4CL3X0J8GXgruEISkQkkWk4h4jIGOfuT5tZcS/7n414+jxBqVMRkTFNPdEiItIfNwGP9bRTq8yKyFihJFpERGJiZpcSJNH/3lMbrTIrImOFhnOIiEifzGwJ8CPgKnc/HO94RETiTT3RIiLSKzObCfwKeJ+7b493PCIiiUA90SIiY5yZPQBcAhSYWTnweSAVwN1/CNwG5APfNzOAdndfFp9oRUQSg5JoEZExzt2v72P/zcDNwxSOiMiIoOEcIiIiIiL9pCRaRERERKSflESLiIiIiPSTkmgRERERkX5SEi0iIiIi0k9KokVERERE+klJtIiIiIhIP8WURJvZlWa2zcxKzOxTUfanm9lD4f61ZlYcbs83syfNrMHM7ujh3KvMbOOJvAkRERERkeHUZxJtZsnAncBVwELgejNb2K3ZTUC1u88FvgN8PdzeDHwO+Lcezv1OoGFgoYuIiIiIxEcsPdHLgRJ33+nurcCDwIpubVYA94aPHwEuMzNz9yPuvoYgmT6GmY0HPgF8ZcDRi4iIiIjEQSxJ9HSgLOJ5ebgtaht3bwdqgfw+zvtl4D+Bxt4amdktZrbezNZXVlbGEK6IiIiIyNCKJYm2KNt8AG3eaGx2GjDX3X/d14u7+13uvszdlxUWFvbVXERERERkyMWSRJcDMyKeFwH7empjZilALlDVyznPBc40s93AGmC+mT0VW8giIiIiIvEVSxK9DphnZrPNLA1YCazq1mYVcGP4+FrgCXfvsSfa3X/g7tPcvRi4ANju7pf0N3gRERERkXhI6auBu7eb2a3A40AycI+7bzKzLwHr3X0VcDdwn5mVEPRAr+w6PuxtzgHSzOwa4C3uvnnw34qIiIiIyPDoM4kGcPfVwOpu226LeNwMXNfDscV9nHs3sDiWOEREREQS2f1rS6Nuv+HsmcMciQw1rVgoIiIiItJPSqJFRERERPpJSbSIiIiISD8piRYRERER6Scl0SIiIiIi/aQkWkRERESkn5REi4iIiIj0k5JoEREREZF+UhItIiIiItJPSqJFRMY4M7vHzA6a2cYe9puZfc/MSszsVTM7Y7hjFBFJNEqiRUTkJ8CVvey/CpgX/t0C/GAYYhIRSWhKokVExjh3fxqo6qXJCuCnHngemGBmU4cnOhGRxKQkWkRE+jIdKIt4Xh5uO46Z3WJm681sfWVl5bAEJyISD0qiRUSkLxZlm0dr6O53ufsyd19WWFg4xGGJiMSPkmgREelLOTAj4nkRsC9OsYiIJAQl0SIi0pdVwPvDKh3nALXuXhHvoERE4ikl3gGIiEh8mdkDwCVAgZmVA58HUgHc/YfAauCtQAnQCHwwPpGKiCQOJdEiImOcu1/fx34HPjpM4YiIjAgxDecwsyvNbFtYaP9TUfanm9lD4f61ZlYcbs83syfNrMHM7ohon2lmvzezrWa2ycy+NlhvSERERERkqPWZRJtZMnAnQbH9hcD1ZrawW7ObgGp3nwt8B/h6uL0Z+Bzwb1FO/S13XwCcDpxvZlcN7C2IiIiIiAyvWIZzLAdK3H0ngJk9SFB4f3NEmxXAF8LHjwB3mJm5+xFgjZnNjTyhuzcCT4aPW81sA8FsbxkC968tPW7bDWfPjEMkIiIiIqNDLMM5Yimyf7SNu7cDtUB+LAGY2QTgHcBfetivwv0iIiIiklBiSaJjKbIfcyH+Yw4ySwEeAL7X1dN93ElUuF9EREREEkwsSXQsRfaPtgkT41ygKoZz3wW87u7fjaGtiIiIiEhCiCWJXgfMM7PZZpYGrCQovB9pFXBj+Pha4ImwJFKPzOwrBMn2x/sXsoiIiIhIfPU5sdDd283sVuBxIBm4x903mdmXgPXuvgq4G7jPzEoIeqBXdh1vZruBHCDNzK4B3gLUAZ8BtgIbzAzgDnf/0WC+ORERERGRoRDTYivuvppgxarIbbdFPG4Gruvh2OIeThttHLWIiIiISMKLabEVERERERF5g5JoEREREZF+UhItIiIiItJPSqJFRERERPpJSbSIiIiISD8piRYRERER6Scl0SIiIiIi/aQkWkRERESkn5REi4iIiIj0k5JoEREREZF+UhItIiIiItJPSqJFRERERPpJSbSIiIiISD8piRYRERER6Scl0SIiIiIi/aQkWkREMLMrzWybmZWY2aei7J9pZk+a2Utm9qqZvTUecYqIJIqUeAcg8Gp5DS+V1tDa3smiaTnxDkdExhgzSwbuBN4MlAPrzGyVu2+OaPZZ4GF3/4GZLQRWA8XDHqyISIKIqSc6hh6KdDN7KNy/1syKw+35Yc9Fg5nd0e2YM83stfCY75mZDcYbGmncnVt++iKfX7WJ21dv4aP3b8Dd4x2WiIwty4ESd9/p7q3Ag8CKbm0c6PqVnwvsG8b4REQSTp9JdEQPxVXAQuD6sBci0k1AtbvPBb4DfD3c3gx8Dvi3KKf+AXALMC/8u3Igb2Ck23agnv11zXzx6kV8ecUiqhvbqGxoiXdYIjK2TAfKIp6Xh9sifQF4r5mVE/RC/9PwhCYikphi6YmOpYdiBXBv+PgR4DIzM3c/4u5rCJLpo8xsKpDj7s950O36U+CaE3kjI9XT2ysBeMuiyZw7pwCA0sON8QxJRMaeaHcCu98Sux74ibsXAW8F7jOz475DzOwWM1tvZusrKyuHIFQRkcQQSxIdSw/F0Tbu3g7UAvl9nLO8j3OOCX/dXsn8yeOZmjuOOYVZ5GWmskdJtIgMr3JgRsTzIo4frnET8DCAuz8HZAAF3U/k7ne5+zJ3X1ZYWDhE4YqIxF8sSXQsPRSxtBlQ+9Hcq9HY2s66XdVcNC/4ojEzzpyVx56qI3GOTETGmHXAPDObbWZpwEpgVbc2pcBlAGZ2CkESPbouyiIi/RBLEh1LD8XRNmaWQjDppKqPcxb1cU5gdPdqrN1ZRWtHJxef/Mb7OmNWHocaWmloaY9jZCIyloR3EG8FHge2EFTh2GRmXzKzq8Nm/wp82MxeAR4APuCaBS0iY1gsJe6O9lAAewl6KG7o1mYVcCPwHHAt8ERvF1d3rzCzejM7B1gLvB/47wHEP6L9dXslGalJnFU88ei2ZbOCx2VVjZwyVeXuRGR4uPtqggmDkdtui3i8GTh/uOMSEUlUfSbR7t5uZl09FMnAPV09FMB6d18F3E0wyaSEoAd6ZdfxZraboCxSmpldA7wlvBh/BPgJMA54LPwbU57eXsk5J+WTkZp8dNuSolySzdhz+IiSaBEREZEEFdNiKzH0UDQD1/VwbHEP29cDi2MNdLQ51NDCzkNHWLl8xjHbM1KTmTYhQ5MLRURERBKYlv2Ok417awE4dfqE4/bNys+ivKaJ9s7O4Q5LRERERGKgJDpONu2rA2DR9OOHbEzOyaCj06ltbBvusEREREQkBkqi4+S18lqK8zPJyUg9bt/ErDQAqo60DndYIiIiIhIDJdFxsnFfLYun50bddzSJblQSLSIiIpKIlETHQfWRVsqrm3pMorMzUkhOMvVEi4iIiCQoJdFx0DUeevG06El0khl5mWlKokVEREQSVEwl7qR3968tPW7bDWfP7LH9xn1BZY7F03OiHgswMSuVaiXRIiIiIglJPdFx8NreWoryxjEhM63HNhOz0jh8pBWtqisiIiKSeJREx8GmvbWc2sN46C4TM9Noae+kqa1jmKISERERkVgpiR5mdc1t7D7c2OOkwi4Ts9IBlbkTERERSURKoofZa+VdKxX2lUSrVrSIiIhIolISPcw27KnGDE6befxy35HysoJFWJREi4iIiCQeJdHDbENpNXMLx0ddqTBSekoyWekpSqJFREREEpCS6GHk7rxUVsMZM/Niaj8xM1WrFoqIiIgkICXRw2jnoSPUNLZxxqzeh3J0mZiVplrRIiIiIglISfQw2rCnGiD2nuisNGoa2+joVK1oERERkUSiJHoYbSitIScjhTmF42NqPzErDQdqNKRDREREJKEoiR5GL5VWc9rMPJKSLKb2eV1l7pREi4iIiCSUmJJoM7vSzLaZWYmZfSrK/nQzeyjcv9bMiiP2fTrcvs3MrojY/i9mtsnMNprZA2aWMRhvKFHVN7ex7UA9Z/RR2i5SXrgseE1j21CFJSIiIiID0GcSbWbJwJ3AVcBC4HozW9it2U1AtbvPBb4DfD08diGwElgEXAl838ySzWw68M/AMndfDCSH7UatV8pqcY99PDRATkYqSQbV6okWERERSSix9EQvB0rcfae7twIPAiu6tVkB3Bs+fgS4zMws3P6gu7e4+y6gJDwfQAowzsxSgExg34m9lcT2+Kb9pKckccas2JPo5CQjZ1yqeqJFREREEkwsSfR0oCzieXm4LWobd28HaoH8no51973At4BSoAKodfc/RntxM7vFzNab2frKysoYwk08bR2d/P61Ci5fOJnx6Sn9OjYvM0090SIiIiIJJpYkOtosuO4113pqE3W7meUR9FLPBqYBWWb23mgv7u53ufsyd19WWFgYQ7iJ52+vV1J1pJVrTuv+26NveZnqiRYRERFJNLEk0eXAjIjnRRw/9OJom3B4Ri5Q1cuxlwO73L3S3duAXwHnDeQNjAS/eWkfEzJTuXh+/38ETMhMo66pjdb2ziGITEQk0NcE8rDNu8xsczgp/P7hjlFEJJHEkkSvA+aZ2WwzSyOYALiqW5tVwI3h42uBJ9zdw+0rw+ods4F5wAsEwzjOMbPMcOz0ZcCWE387iedISzt/2nyAt506lbSU/lcUzMsMakVX1DYNfnAiIsQ2gdzM5gGfBs5390XAx4c9UJEE5u4crG/W2g5jSJ8DdN293cxuBR4nqKJxj7tvMrMvAevdfRVwN3CfmZUQ9ECvDI/dZGYPA5uBduCj7t4BrDWzR4AN4faXgLsG/+3F32Mb99PU1sE1p/d/KAcEwzkAyqubmJWfNZihiYh0OTqBHMDMuiaQb45o82HgTnevBnD3g8MepUiCuu/5PXz3T9s5fKSV1GTjXy6fz4SwTK2MXjHNcnP31cDqbttui3jcDFzXw7G3A7dH2f554PP9CXakKatq5Pbfb2bBlGzO7Edpu0hdtaLLqxsHMzQRkUjRJoGf3a3NfAAze4agQ+UL7v6H7icys1uAWwBmzpw5JMGKJJL65ja+/thW5hRmcctFJ/GNx7fxl60H+fsziuIdmgwxrVg4RJpaO/g/971Ie6fzg/eeGfMqhd3ljEvFgL3VGs4hIkMmlgnkKQRD8i4Brgd+ZGbHrR41GiaDi/THL9aX09DSzpevWcz/uXgO58yeyIY91Ryoa453aDLElEQPgea2Dm69fwNb9tfxXytPY3bBwIdhJCcZueNSKVcSLSJDJ9YJ5I+6e1tY938bQVItMmZ1dDo/eXY3y2blsaQo+E15ycmTSEtJ4o+bD8Q5OhlqSqIH2cG6Zn7w1A6e3HaQL169iDctmHzC55yQmaYkWkSGUiwTyH8DXApgZgUEwzt2DmuUIgnmia0HKa1q5IPnzz66LSs9hQvnFbKloo796o0e1ZRED6KGlnb+9287aWxt52c3n837zy0elPPmZaZqTLSIDJlwkayuCeRbgIe7JpCb2dVhs8eBw2a2GXgS+L/ufjg+EYskhnuf3c203AyuWHRsh9lZxXkYsHFvbXwCk2GhJHoQ/e7VfTS3dXLTBSdx3pyCQTtvXlYa++ua+1Ur2t35/lMllBxsGLQ4RGT0cvfV7j7f3eeEE8Jx99vCCkx44BPuvtDdT3X3B+MbsUh8NbS089zOw1xz+nRSko9Np7IzUikuyFISPcopiR4kWyrqeLW8lktOLmRKbsagnjsvM5VOh/21sd8Wemzjfr7xh218YdWmQY1FREREYN2uKjo6nQvmRu80Wzwth4P1LRwcwJCOptYOGlraae/UQmuJTEn0IGjr6OTRl/cyJSeDi08e/NnoE/pZ5q6to5NvPr6N1GRjTckh9hw+MugxiYiIjGXPlBwiLSWJM2ZFL2G7aFouABv31fXrvIcaWvjaH7bw1dVbuO3RTfx5iyYoJiol0YNg6/566prbuXLxFFKSBv8j7aoVXRZjEv3QujJ2HTrCt991GvlZaTyxVWsiiIiIDKZndxzmzJl5ZKQmR92fMy6VWRMz2bSvf0M6Vr9WQZIZbzt1KsX5Wax5/RC1jW2DEbIMspgWW5HevVJWQ3Z6CnMnje+13f1rSwd0/gmZqaSnJMU0vrmxtZ3v/vl1lhdP5O1LprK3pomvPbaVsqpGZkzMHNDri4iIyBuqj7SyuaKOf33z/F7bLZqey+rXKjjc0BLTeZ/adpCt++u5ctEUzp9bwEmFWfz3EyXc/0IpH7lkzmCELoNIPdEnqLaxjW0H6llSlEuSDWxBlb4kmTGncDyvx5BE/3VbJYcaWvjY5fMwM953zizGpSbz/E5NohcRERkMXd+p5/UwHrrL4mk5QGxDOto6Ovny7zaTn5XGeXPyAZiaO465k8bz42d20dLecYJRy2BTEn2CHttYQWhbnMQAACAASURBVEens3TGcQt3Dap5k8fz+oG+k+hndhwiKy2Z5bMnAkG9ypkTM6nox6REERER6dmzOw6TlZbMkqLcXttNyEyjKG9cTFU6fvfqPnZUHuGqxVOPqfZx4dwCDta3sOrl7usfSbxpOMcJ+s3Le8nPSmP6hHFD+jrzJ2fz6Mv7aGhpZ3x6z/+3PVtymOWzJ5Ia8Q9wck46JZUNdHQ6yQNcflxEREQCz+44dNx3bU8WT8vlD5v29zms8r7n9nBSQRYLpmYfs33upPEsmJLNz9aWct2yGT0cHYg2bPSGs2f2GaMMjHqiT8D+2mbW7qritBkTsCEaytGla7x1b+OiK2qb2HnoCOd3u700KSeDjk7n8JHYxmSJiIhIdJX1LeyoPMK54ZCLviwKh3Q8vml/j2027q1lQ2kN7zln1nFDQ82MKxdP4dXyGk0wTDBKok/Ak9sO4g6Lp/d+O2cwzJ8c/DLdfqC+xzbPlARjtLr/w56cHdStPlinJFpEROREvFxWA8AZM6OXtusuf3w603IzWP1aRY9tfr52DxmpSVx7RlHU/RfMLcAdntt5qP8By5BREn0Cnik5xKTsdCZlpw/5a82cmElaHxU6ni05xMSsNE6ZknPM9sLsdAw4MICC7yIiIvKGl8uqSUmyfnWgLZqey4bSGipqm47bV9vUxm9e2seKpdPJzUyNevzSGRPISks+2lkmiUFjogeos9N5dsdhLplfGHUox0DL2fUkOSmo0NFTT7S788yOQ5x7Uj5J3cY9p6UkkZeVxoF69USLiIiciJfLalgwNbvH+tDRLJ6Wy582H+D3r1Zw84UnHbPv/rWlNLV18N5zZvV4fGpyEmeflM8zJeqJTiTqiR6grfvrqTrS2md5m8E0b1LPFTp2VB7hQF0L582NPkZrcnb6gJYeFRERkUBHp/NKWS2nz4htKEeXwux0lhdP5Id/3UF98xvjmg81tPD9J0u49ORCTu2j0sd5c/LZeegI+2qO782W+FASPUBdvwbP7yFpHQrzJ49nb00TR1raj9u35vXKIJ450ZP6STkZHGpoob2zc0hjFBERGa12VDbQ0NLOaQMoa/vZt5/CoYZW7nxyx9Ft3/nTdhrbOvjM207p8/gL5gXf7+qNThwxJdFmdqWZbTOzEjP7VJT96Wb2ULh/rZkVR+z7dLh9m5ldEbF9gpk9YmZbzWyLmZ07GG9ouKwpOcScwiym5g5tabtIcycFkwujjYv+y9aDnFSYRXFBVtRjJ+ek0+lwqKF1SGMUEREZrV4uDSYVnjaz/0n0kqIJvPOM6dyzZhfb9tfzbMkhHnihlPeePfPo93tvTp6cTcH4NJ7doXHRiaLPJNrMkoE7gauAhcD1ZrawW7ObgGp3nwt8B/h6eOxCYCWwCLgS+H54PoD/Av7g7guApcCWE387w6O1vZMXdlUdV0puqM2fHJS56z4uuq65jed3HubNp0zu8djJOV0VOjSkQ0REZCBeKqsmd1wqs/Ojd1j15ZNXLCA5ybjiu09zw4/Wkp2Ryscv733p8C5mxnlzClhTcgh3H9Dry+CKZWLhcqDE3XcCmNmDwApgc0SbFcAXwsePAHdYMNtuBfCgu7cAu8ysBFhuZpuAi4APALh7K5BwXaQ9FS1/qbSapraOYU+iuyp0bNpXx3UR25/eXklbh3P5wp6T6ILxXRU6NLlQRERkIF4qrWHpjAnHTeCP1ZTcDH7w3jPYUlHP9LxxLJuVR15WWszHn3NSPqte2UdpVSOzBpjIy+CJJYmeDpRFPC8Hzu6pjbu3m1ktkB9uf77bsdOBJqAS+LGZLQVeBD7m7ke6v7iZ3QLcAjBzZmKsuvPczsOYwTmzh288NEBKchIXzi1g9WsVfPZtpxxdFvQvWw6Sl5naa83K1OQk8senDUmZO62QJCIio92PnwmGYUybMO7o995AvusuOXkSl5w8aUAxLCsOvufX7a5WEp0AYhkTHe3nVvf7CD216Wl7CnAG8AN3Px04Ahw31hrA3e9y92XuvqywsDCGcIfeut1VLJiS02M9x6F03bIZHKxv4W+vBxML2js6eWLrQS5dMKnPJb0nZWdQqTJ3IiIi/ba3ugkHZuT1vHT3UJtbOJ7ccams310VtxjkDbEk0eVA5GLtRcC+ntqYWQqQC1T1cmw5UO7ua8PtjxAk1QmvraOTDXtqWF7cv/I2g+VNCyYxMSuNh9eXcf/aUr72h63UNrWRkZLcZ23qiVlpVDe2aiyViIhIP5VVB6XlZuQNX0GB7pKSjGWz8linJDohxJJErwPmmdlsM0sjmCi4qlubVcCN4eNrgSc8yNRWASvD6h2zgXnAC+6+Hygzs5PDYy7j2DHWCWvTvjqa2jo4a/bEuLx+WkoS15w2nT9vOcDhhhb+suUgyUnGvEnj+zw2LyuN9k6nPkqJPBEREelZWVUj+VlpZKbHd526ZcUT2VF5hMMNurMcb30m0e7eDtwKPE5QQeNhd99kZl8ys6vDZncD+eHEwU8QDs1w903AwwQJ8h+Aj7p7R3jMPwE/N7NXgdOArw7e2xo663YFv/6WF8cniQZ411lFtHU433vidfYcPsI1p00jPYaVk/LC4Sc1RxJuDqeIiEjCcnfKqhuZMTF+Qzm6dI2LfnFPdZwjkZh+Trn7amB1t223RTxuhmMKRkS2ux24Pcr2l4Fl/Qk2Ebywu4pZ+ZlMCkvGxcOCKTmcMXMCW/fX84HzZjG7h9rQ3eVlBjOAqxrb+mgpIiIiXSpqm6lvbo/rUI4up07PJS05ifV7qnnLoinxDmdMi+89iRGm0531u6u4rJd6zMPlxx9czi9fLCcjhh7oLl1JdHWjeqJFRERi9XJZsMhKIvREZ6Qms6QoV+OiE4CW/e6HyvoWqhvb4jqUo0vuuNR+JdAQjKcen55CtYZziIiIxOzlshpSkowpufG7Cx1pWfFENu6tpbmto+/GMmSURPfD7sNBGet4TSocDHmZqVSpJ1pERCRmL5VWM23COFKSEiNtWj47j7YO17joOEuM/xpGiD2HGynMTqc4P/63cwYqLyuNGo2JFpFuzOxKM9tmZiVmFrVuf9juWjNzMxtxc1pEBqKto5PX9tYmxHjoLstn55OSZEfXjJD40Jjofth96AhFEzN54IWyvhsnqLzMNDburaWj0/tcnEVExgYzSwbuBN5MUMd/nZmtcvfN3dplA/8MrD3+LCKj07b99TS3dVKUAOOhu4xPT+GMmXk8U6IkOp6URMeoprGVmqY2LhjBvdAAEzPT6HSoqG2iKI6rLolIQlkOlLj7TgAzexBYwfH1+78MfAP4t+ENb2CiLUA1kGWaZWx7qTQYMhHPlQqjOX9uAd/9y3aqj7SSl5UW73DGJA3niFHXeOjiEb5W/YSsoFZ0WVVTnCMRkQQyHYi8xVYebjvKzE4HZrj773o7kZndYmbrzWx9ZWXl4EcqMszW7a5mck760bUWEsUF8wpwh2d3HI53KGOWeqJjtPtQI+kpSQkzM3egJoZl7sqrG4H8+AYjIoki2tguP7rTLAn4DvCBvk7k7ncBdwEsW7bM+2guktDcnXW7qzireCJmJzYEcrDvjCwtyiU7PYU1JZW8bcnUEwlNBkg90THaffgIs/IzSTrBf0TxlpuZigFl1eqJFpGjyoEZEc+LgH0Rz7OBxcBTZrYbOAdYpcmFMtqVVzdRUdvM8gSsypWSnMQ5c/JZo3HRcaMkOgaNLe0crG8Z8UM5AFKSksgZl0p5VWO8QxGRxLEOmGdms80sDVgJrOra6e617l7g7sXuXgw8D1zt7uvjE67I8Oha0OSsBFgfIpoL5hZQVtVE6WF9p8eDhnPEYHf4H+doSKIhqNBRVq1/cCIScPd2M7sVeBxIBu5x901m9iVgvbuv6v0MIvE1VJNI1+2uIjsjhfmTs3mptOaEzzfYLppfCMAfN+/n5gtPinM0Y4+S6BjsPnyElCSjKIFqRJ6IvMxUTSwUkWO4+2pgdbdtt/XQ9pLhiEkk3l7YVcWyWXkJWxJ2dkEWS4py+dWGvUqi40DDOWKw5/ARivLGkZI8Oj6uvKw0DtQ3a7lQERGRHhxuaGFH5ZGEX6X4706fzuaKOrbtr493KGPO6MgKh1B7Ryf7apuZkUBF1k9UflYa7l0VOkRERKS79eGS2ssTdDx0l3csnUZykvHrl/bGO5QxR0l0H/bXNdPR6QlXZP1E5I9PB4KyfSIiInK8F3ZVkZaSxKlFufEOpVcF49O5eH4hj768l05XVcnhpCS6D2VhFYvRMh4agp5oeGMBGRERETnW09srWV48kfSU5HiH0qdrTp9ORW0zuw7pe304aWJhH8qrm8jOSCF3XHxXKoo283igMtOSyc5IYY9K4oiIiBxnb00Trx9s4N1nzei7cQJ4y8LJ5GSk8OyOw8wpHB/vcMaMmHqizexKM9tmZiVm9qko+9PN7KFw/1ozK47Y9+lw+zYzu6Lbcclm9pKZ9bqMbDyVVTdSlJd5wisVJRIzozg/Sz3RIiIiUTy9PViy/uKwhFyiy0hN5kMXzGZLRR37alR9a7j0mUSbWTJwJ3AVsBC43swWdmt2E1Dt7nMJlob9enjsQoKi/YuAK4Hvh+fr8jFgy4m+iaHS1NrBoYZWZoyioRxdZuVnqidaREQkiqe2HWRabgZzJ42cXt0Pnj+bjNQknth6MN6hjBmx9EQvB0rcfae7twIPAiu6tVkB3Bs+fgS4zIKu2xXAg+7e4u67gJLwfJhZEfA24Ecn/jaGRlf1iqJRNKmwS3F+FuXVjbS2d8Y7FBERkYTR1tHJMyWHufjkSSPqLnTuuFTOm1PA5oo6KmrVGz0cYkmipwNlEc/Lw21R27h7O1AL5Pdx7HeBTwIJm8WVVQf/EY6mSYVdZuVn0unBuC8REREJbNhTTUNL+4gZyhHp/DkFpKck8ect6o0eDrEk0dF+hnWvodJTm6jbzeztwEF3f7HPFze7xczWm9n6ysrKvqMdROXVjRSOTycjNfFn5vZXcUGwhLnGRYuIiLzhr9srSUkyzp+bH+9Q+m1cWjIXzitgS0Udpfp+H3KxVOcoByKnpxYB+3poU25mKUAuUNXLsVcDV5vZW4EMIMfMfubu7+3+4u5+F3AXwLJly4atAKK7U17dxLwRNB6qP2blB0NU9hw6AifHORgREZFBFq2q1Q1nz+z1GHfnj5sPcOasPLIz4luVa6DOn1vA8zur+MOm/XxYS4EPqVh6otcB88xstpmlEUwUXNWtzSrgxvDxtcAT7u7h9pVh9Y7ZwDzgBXf/tLsXuXtxeL4noiXQ8VRZ30JDSzvTR+FQDoDC8elkpiWzW5MLRUREANi4t46Sgw2sOK37qNWRIz0lmTctmMTuw41sO6ClwIdSnz3R7t5uZrcCjwPJwD3uvsnMvgSsd/dVwN3AfWZWQtADvTI8dpOZPQxsBtqBj7p7xxC9l0G1JVyDfkpuRpwjGRpmxqz8LPbodo+IiCSgntZH6Ks3+URe53ev7iM5yXjbqVMH9TWG21nFE3mm5BB/2nyAL169aERNkBxJYlpsxd1XA6u7bbst4nEzcF0Px94O3N7LuZ8CnooljuG0paIOgCk5ozOJBijOz2Tbfv1KFRER6eh0XimvZcGUbHIzR+ZQji7JScYlJxfyyw17WVNyiAvnjbxJkiOBVizswdaKOnLHpZKZNno/oln5Wfx5ywHaOzpJSdYK8CIiMna9frCeIy3tnDEzL96hxKSvlYyXFk3gj5sO8L9/26Ukeogoc+rBlor6Ud0LDUFPdFuHs6+mOd6hiIiIxNVLpTVkpiUzb/LoKCiQkpzEOXPyeXp7pe46DxEl0VG0tHewo7Jh1I6H7jJvcjYAW/bXxTkSERGR+KlubGVzRR1LiyaQkjR6UqOziyeSkZrE3Wt2xjuUUWn0jlU4ASUHG2jvdKaO8iR60bQckpOMV8truGLRlHiHIyIiEhd/2XIQAy4a5gVW+hqScaIy01O49swiHl5XzqevOoW8rLQhfb2xZvT83BpEWyvCyhyjfDhHRmoy8ydn82p5bbxDERERiYsDdc28VFrNuSflkztuZE8ojGblWTNp7ejkd69VxDuUUUc90VFsqagjPSWJ/PHp8Q5lyC0tyuUPm/bj7iqBIyIiY84fNx8gLSXpmGW+h7qHeDgtmpbD/Mnj+fWGct53zqx4hzOqqCc6iq3765k/OZvkpNGfVJ5alEtNYxtlVU3xDkVERGRY3fvsbrZU1HHR/EIy00dnv6KZ8XenF7GhtIbdh7Q2xGBSEh3F1v11nDI1O95hDIulRRMAeHVvzYDP4e4EC1SKiIjEX2t7J02tHbS2d/b4/fTAC6V8ftUmFk7N4aJRXgLumtOnYQa/fmlvvEMZVUbnz64TcLC+mUMNrSyYkhPvUIbF/MnZpCUn8Wp5LW9fMq1fx7o7q17Zxzce38YpU3O4emn/jhcRERksne5s3FvLy2U1vH6ggY4wec5MS2ZGXiZTJwTznNo6Onl8036e23mYS08u5NKTJ436O89Tc8dx7kn5/OblvXz88nkavjlIlER30zWp8JSpOewagbc9+juOKy0liVOm5fBqef96olvbO7n5p+t5ensl41KTeX7nYZYW5TIrP6tf5xERETlRDS3tPPJiGdsPNJCTkcK5c4JJgu2dzqGGFkqrGnn9YD1PbasEYE5hFv/0pnn84yVz+NWGgffOjqSx0393+nT+7yOvsqG0hjNnjYwFZRKdkuhuupb7PmVq9ohMogdiyfRcfv3SXjo7naQYf43/ckM5T2+v5NNXLSA1OYn/+svrPPryPj566dxR/4teREQSx6Z9tdzxxOs0tnZw9dJpLJ89kaQoPa0dnc7lCyfR1u7MzM+MQ6QDNxjJ+hWLp/CZX29k9WsVSqIHicZEd7N1fz1TczOYkDl2aikuKcqloaWdnTH+aGht7+SOJ0pYOmMCt1x0Ehmpybx9yVT21zXz3M7DQxytiAwFM7vSzLaZWYmZfSrK/k+Y2WYze9XM/mJmmuYvcVdysIH33/0CZsY/XDyHc07Kj5pAAyQnGVNzx424BHqw5GSkcuG8Ah57rYLOTs1jGgxKorvZUlHHgiljY1Jhl6UzgsmFL+yqiqn9LzeUs7em6ZhxVQun5jCnMItnSg5pkqHICGNmycCdwFXAQuB6M1vYrdlLwDJ3XwI8AnxjeKMUOVZ5dSPvu3stZnDT+bOZNmFcvENKeG9bMpV9tc28VDbwYgLyBiXREVrbOyk52MApU8fGpMIu8yaNZ05hFr/aUN5n265e6NNmTOCSiJqaZsbSognUNrWxv655KMMVkcG3HChx953u3go8CKyIbODuT7p7Y/j0eaBomGMUOaqptYMP//RFGlra+emHzqYge/Sv6zAYLl84mbTkJFZr4ZVBoSQ6Qtdy3wvGWBJtZly3bAbr91Szs7Kh17bReqG7nBz24G/bXz9ksYrIkJgOlEU8Lw+39eQm4LFoO8zsFjNbb2brKysrBzFEkYC785nfvMbW/XV87/rTWThtbH1nnwgN6RhcSqIjbN0fTiocY8M5AN55+nSSDB55sefe6Mhe6MiVnbpkZ6QyfcI4tiqJFhlpog0ijfoNa2bvBZYB34y2393vcvdl7r6ssHB0196V+Lj/hVJ+tWEvH7tsHpeePCne4Yw4GtIxeFSdI8KWijrSUpKYXTD2yrRNyslg3qRsfvb8HqZNGHd0YsYNZ8882uaRF4Ne6Nv/bnGPNSYXTMnmia0HOdzQMiaWTRcZJcqBGRHPi4B93RuZ2eXAZ4CL3b1lmGKTBBStWkTk98VQOVjfzA+e2sFF8wv55zfNG/LXG40ih3SoSseJUU90hGC57/GkJI/Nj+WMWXnUNbdTcvD4IR2t7Z3c+WTPvdBdFkzJweFoLU4RGRHWAfPMbLaZpQErgVWRDczsdOB/gKvd/WAcYpQxrqPT+cX6cjLTkvnWdUtiLskqx9KQjsETU7YYQ+mjdDN7KNy/1syKI/Z9Oty+zcyuCLfNMLMnzWyLmW0ys48N1hs6EVsq6jlljKxUGM0pU7LJTk9h1Sv7aGhpP2bff/1le49joSNNnZBBdnoKT2zTd6zISOHu7cCtwOPAFuBhd99kZl8ys6vDZt8ExgO/MLOXzWxVD6cTGRJPbjvI3pomvvp3pzIpOyPe4YxoGtIxOPoczhFR+ujNBLf81pnZKnffHNHsJqDa3eea2Urg68C7wxJJK4FFwDTgz2Y2H2gH/tXdN5hZNvCimf2p2zmHVWV9C4caWsbcpMJIKclJvPecWfxozU5++txubr7gJADueOJ17nxyB9edWdRrLzRAkhknT8nm6W2VtHd0jtlefZGRxt1XA6u7bbst4vHlwx6UjEoDGQpSVtXIU9sOcvqMCVx16tShCm3M0JCOwRFLhtNn6aPw+b3h40eAyyzorlwBPOjuLe6+CygBlrt7hbtvAHD3eoKej95mgg+5yJUKx7IZEzNZedZM9lY38bU/bOHibz7Jt/64nXeePp2v/f2SXnuhu8ybnE19Szuv9HMpcRERke5a2zt5eH0ZORmpvGPptHiHMypoSMfgiCWJjqX00dE24W3BWiA/lmPDoR+nA2ujvfhwlUvauK8WCBYNGetOmZrD+86ZxdKiCSyensutl87lm9ctjXk57zkFWZjB314/NMSRiojIaPfYxgoOH2nl788sIiM1Od7hjBoa0nHiYqnOEUvpo57a9HqsmY0Hfgl83N3ror24u98F3AWwbNmyIfu5tHFvLTMmjhtTy333ZsHUHBZMzRnQbOvM9BROnZ7LmtcP8fHL5w9BdCIiMhZsP1DP2l1VXDC3gDmF40/4fNGGkoxVXUM6fvvKPg3pGKBYeqJjKX10tI2ZpQC5QFVvx5pZKkEC/XN3/9VAgh9Mr+2t5dTpufEOY9S4YG4BL5XVUN/cFu9QRERkBGpsaeeXG8qZlJ3OmxdOjnc4o05ORipvXjiZ37y8l+a2jniHMyLFkkT3WfoofH5j+Pha4Al393D7yrB6x2xgHvBCOF76bmCLu397MN7IiahpbKWsqolTp0+IdyijxgXzCujodJ7fWRXvUERkjKlpbOXxTfvZdegIdfohPyK5O4++so/Glg7etWwGqZqkPiSuXz6TmsY2Ht+0P96hjEh9Dudw93Yz6yp9lAzc01X6CFjv7qsIEuL7zKyEoAd6ZXjsJjN7GNhMUJHjo+7eYWYXAO8DXjOzl8OX+n/h7PBht3FvMJJkLPVED/UtrTNn5TEuNZlnSg6pB0FEhk1NYyvX/vC5o/XukwxuPK+YeZPG9qTxkeYXL5bz2t5a3rJwMtMmjIt3OKPWeXPymTFxHA+8UMqK0+Ja32FEimnFwhhKHzUD1/Vw7O3A7d22rSH6eOm4eG1vMKlw8XRNKhws6SnJLJ89kb+9rkVXRGR4NLd1cPO96yk93MgdN5zOi7urWb2xgofWlfHRS+eSpzkvI8LmfXV87jcbOakwi4v6KKsqJyYpyVh51ky++fg2dh06MiZXbD4Ruj+CJhX25v61pcf89ceF8wrYUXmE8urGIYpOROQNn/3NRl4srebb717K25dMY97kbN5z9iw6Op2fr91DW0dnvEOUPtQ1t/GPP3+RCZmprDxrJkkxlFWVE3PdmUUkJxkPrtOky/5SEo0mFQ6VrmEcv3+1Is6RiMhot3V/HY+8WM4tF53E25e8UUu4YHw671o2g301zTy343AcI5S+tLZ38tGfb6Csuok7bjiD8ekx3SyXEzQpJ4O3LJzM/WtLqW3SHIL+GPNJdG1jG6VVjSxWEj3oZuVncdqMCTz6cvdiLiIig+s7f9pOdnoKH7l4znH7Tpmaw7xJ4/nb65U0trbHITrpS2en88lHXuFvrx/ia+88lbOKJ8Y7pDHl1jfNpb65nXvW7Ip3KCPKmE+iuxZZUU/00Fhx2jQ2V9RRcrA+3qGIyCj1Wnktj286wE0Xzu5xWN6bFkziSGsHP39et6wTTac7X/ztJn7z8j4+eeXJXLdsRt8HyaBaNC2XKxZN5p41u6htVG90rMZ8Et21NPXiaUqih8LblkwlyWCVeqNFZIh858/byR2XyocumN1jm1n5WcwtHM//PL2DplbVxE0Une78+qW93PvcHm656KSodxJkeHz88vnUt7Rz95qd8Q5lxBjzSfSzJYeZP3k8eVmaVDgUJmVncN6cAh59ZR9B6XARkcGzpaKOJ7Ye5MMXziYnI7XXtpcumMShhlYeeEG90YmgraOTB9eV8eKeaj522Tw+fdUCTBMJh0X3ogH3ry3llKk5vPXUKfxozS7KqlQQIBZjOoluau3ghd1VXDhPJXSG0tWnTWPP4UZe3FM94HNE+wcvIvK/f9tJZloy7z1nVp9tZxdksbx4Inev2aVKHXFW39zGj/62k017a3nrqVP5lzfPVwIdZ/evLWXxtFw6Op333f0CP3t+T7xDSnhjOol+YXcVre2dXDivIN6hjGpvPXUqBePT+OrqLXR2qjdaRAZHRW0Tq17ex7uWzYi5ROmHLzqJvTVNrH5NVYPipbSq8f+3d+/RUdXXAse/e2byIA8IefBKgPCIIMorsEAFX9BSsa201rY+WrmVql21Wnvb1drbta70Vtt6e6/Vam1LraW1Fa3ah49braAVRAuCgAgECSGG8AoJkJB3MrPvH3OCkzBJZpLAnEn2Z61ZM+ec35zZv3ns85tzfuf8eOSfezlc08gNc8cwf6Jtg90iIyWRT0wbSWlVHW/a1Wy6NaAb0WvfP0qiz8PccVmxDqVfS0vy8Z0rJvNO2Qn+vOVArMMxxvQTK98sJaDKsi76Qne0cPIwJuSk8qvXS6yLWS+0+AO8W36C9cWVvFpUwc6DNdQ1dX3lk1Z/gBVr97Ji7V48ArdcMoEpdj6S6xSOGcrkEen8Y8dhNu47FutwXG1AX4Rx3Z6jzMnPZFCiN9ahxI2edqP4TGEeP1uzh7uf2cAgSAAAFTdJREFU20F1fcup9/z6uWP6MjxjzABR09jCExvKWDx1JKMzUyJ+nscj3HzxeO7683bWF1cx345ERsUfUJ7ZvJ/7X3n/tGsK/2nTfi4uyOYzs/JYMHkYyQnBPN/iD/BaUQU/eXk3eypqOW/UYK6emWfbXpcSEa4uzGPF2hKWrXybJ26+gKl59mcnnAHbiD5c3cj7R2q5ZlZerEMZEDwe4aoZuTzyWjGrNpZx/dwxpxKsMcZE6zfr9nGysbVHV3P41Mxc/veV93n4tT3Mm5hlfXEj1NDs5/ZVW1i96wijhw7i6sJccocMIsHnoexYPQI8/+5B1hRV4PUIYzNTSEv2UXT4JM2tAcZnp/LLLxRSVdts77nLpSX5uGlePivWlfD5FW9x44X5jAn5s2o7wIIGbCN67Z6jAHZS4VmUmzGIqwvz+MuWcn75+l6+GMGJQMYY09HxumZ+88Y+Fp8/okcDZSUneLntsgksf34na/dUcuk5th3ozon6Zpb9bhPvlB1n+SenkOD1tGsIT8hJ4/q5Y/juleeyvriSjfuOUVxRS01jC0svHEvhmKF8ZMpwErweOzE8TmSkJLJs3jgeW7+PX68t4cppI7lgXKb9AQoxYBvRq3ceISc9ickj0mMdyoAya+xQMlIS+OOGD3hg9R7KjtXzb/PymTJycNgfZm1TKweON1DT0EJ6su+M/njDJXb7t22M+6xYV0Jdcyvf+Og5PV7HdXPH8Ogb+/jvl4q4eGI2Ho81DDpTXd/C9b/eQHFFLY9cX8jiqSM7bQh7PcIl5+Rwif0x6Rey0pK47fKJPL2pnOe3HWT34Ro+MW1UrMNyjQHZiC6trGP1riPccskE+0cVAxNy0rhjQQFr91Ty3LaDPL25nOy0RKbnZZCc6AWFAyca2H+snqq65lPPS03yMWl4GgsnD7frehszQFWcbGTl+lKumj6Kc4b3fCdIks/LNxedwzee2sYL2w9x1XRrGIRT29TK0t9upLiilhU3zuKyScNiHZKJUm/3/Kck+vjihWN5a28Vq3cd4cE1e6huaOHLF49j5JBBfRRlfBqQjegV60rweTzcNC8/1qH0W939aDNSErlq+igevn4mrxZV8NbeKnYdqqHZHwCFkRnJLDpvOKMzU8gbmsI/dhzmwPEG3i2vZlt5NReNz+LqwlzrV23MAKKq3PXsdgKq3PmRnu+FbnPV9Fx+9XoJ9/29iEsLchiS0vVgLQNNTWMLN/32bbYfqOYXNxRaA3oA84gwb2I20/KG8PKOI6x8s5Tfv1XKp2fmcuulE5iQkxbrEGNiwDWiK0428szmcj4zK5dhg5NjHc6Al52WxOdmj+Zzs0d3Wa62MXjppEXnjWD1ziOsK67kEw+9wYPXzuA8u0SSMQPCHzaU8WpRBcs/OYVx2am9Xp/XI/zo6ql89pdv8e1nt/HLL8yyo5OOqtomlv52I7sPn+Sh62ay6LwRsQ7JuEB6cgLXzMrjwWtn8Oi6Ep58ez9Pby5n0ZTh3DB3LPMHWNeoAdeIXrm+lBZ/gJsvHh/rUEwPDBmUwGdm5TEtbwgvbj/EkofXc/Ml47ljQUGPLpdU29TKm8WVlFbV8fruo/i8HnLSk8jNGMTwM/Any/pdG9MzRYdruPfFnVxyTg43XpjfZ+udOWYody2ezD0v7mLlm6V8aV7k15x2o86OAkaTZ947UM3tq7Zw8EQDK26czeW2B9p0MDozhe8vOZ/bFxawcn0pT2ws4+UdRxiaksCssZnMGjuUIYMS+v32bUA1ol/bXcGv15Vw5dSRjB+ghx76i4Lh6by8sIB7/28Xv/jnXp7bepAvzcvn6sI8MrvoL62qlFbV82pRBa8VVbBhXxUt/uCACwlewR9Q2gZVHDE4meqGFq6aMYrcjIHd78uYWNq47xg3/34TaUkJ/OSaaX2+p2vZ/HH8q6SKe17chQBLL8ofkHukG1v8PLZ+Hz995X0yUxN5fNlc5ozLjHVYxsWy05L41scmcfvCidz9tx28XXqM1buOsGbXEfKzU2lo8bNoyvCoruUeTyJqRIvIFcCDgBd4VFV/3GF5EvB7YBZQBXxeVUudZd8FlgF+4A5VfTmSdfa19cWV3Pr4ZiaNSOeHn556Jl/KRKE3JzwMTU3kfz47nc/OyuPHLxVxz4u7uO+lIqaMHMykEekMS08myeehJaAcq2ui/HgD28urT52sOHFYGjfNG8flk4cxZdRgnt96kIBCVV0Teytq2VZezX0vFXHfS0XMyc9kycxRfHzqyG6HFz5R30xJZR0lR+sorqiluKKWIzWNVNU2UdvUSqLPQ2qSjxGDkxkxJJmZYzI4Z3g63gF0CMy4T2/y/JnS2OJn1cYyfvT3IvIyBrHyS3POyBEiEeGBa2fyjae2svz5nbx3sIbvXDGZnPSkPn+tvtbqD3DgRAOlVfUcrm5g3Z6jNPsDeEXweT2kJHpJTfSxvbyarLREMlMT251LUtvUyvbyatbtOcqqjWUcr29h8fkj+OGnp9oJ3CZiST4v0/IymJaXQVVtE++UHWfnoRp+8MJOfvDCTiaPSOfSSTnMHpvJ9NFDyElL6hd/VKW7YU9FxAu8D3wUKAfeBq5T1Z0hZb4KTFPVr4jItcCnVfXzIjIFWAXMAUYBq4G2s0G6XGc4s2fP1k2bNkVcuUBAWVdcyRMbPmD1rgoKhqWx6uYLIk4Mdi1Ld+t4mGj34ZPc++JODpxo4HBNE/VNrSggAhmDEhg+OJmpuUOYNjqDSwtyGJPV/p9xuM97/sRs/rb1AH/deoC9R+vweYQJOWnkZ6cwNCURn1doaVVONDRTcbKJ0so6jtd/OIpXglcYl53KqIxBZKYmUlZVT1NrgJqGFg7XNNLUGgCCF7afMTqDwjEZFAxPZ1x2KsMGJ5ExKJFEn+e0uFSVFr/S7A/Q3Bpy8/tpajcdvG/xK16P4PPIqXuf19N+nteZ7/GcmvaGTCeETHuEfpEAzyYR2ayqs2MdRzi9yfNdrTfanA1wpKaRLWUneKfsOM9uLqeqrpl5E7N4+LrCqBt10XafCgSUB9bs4aFX9+DzCFdOHcnFBTmcN2owI4ckk5rkI8F7+u/xTPEHlNrGVk42tXCivoXy48GrFpUdq2f/8Xo+qKpn/7F6WgPRDV+eluQjOcFDQ7OfumY/EMyTHz13ODfNH8fcCK4FHOn2Mdz7HennYtvg+FdV28SuQzXsPHSS/cfq8TttzvRkH+OzUxmfk0Z+VirZ6YkMTUkkIyWBoSmJpCX5SPJ5SPR5SPJ5SfR5zvqOpkhydiR7oucAxapa4qz0SWAJENrgXQIsdx4/AzwswV/gEuBJVW0C9olIsbM+Ilhnr63fW8nSxzaSmZrIl+eP49ZLJ9g/634kXIK94vyR7ab9zsal44/vjeJKKO7+NcZkpXD7wgK+tmAiOw7W8Pf3DrH7cHDv8snGVvwBxecVVIMbpoJh6WSlJZKdlsTSi/IZm5XSbqMbGrOqcqyumbzMQWz+4DjvfHCCh18rpuP20OcRREAQcKrR7DS+Y6mt8Z3QoTHetu0VQh+3b3Sfmt9WL+exKigavHfeB1VF6bCMtuWh0yHlnMeELAsoBDRYPnCqrDoxC14RPBIcXdPjPA5dFlA99adn292LztwbGxs9zvPa3Z6YKDQ0+7nox6/iDygJXmH+xGy+cukE5pylAR48HuHfP3oOS2aM4vG3PuDZzeX8bevBdmUSfZ5gI9TnCRtT6Hf71LyQ73ibgCqBQPvvZkDbpoPftXqngdtRerKPMZkpnDsyncXnjyA/K5WxWSmMyhjEKzuPkOD1EFCl1a/UN7dS19RK4dihHKtrpqqumcraJhpbAqQkeskYlMD5uUOYljeErLQknthQRsnRut69kT1gDeb+KSstifkFOcwvyKHFH6D8eAO5GcmnjtZuKKniL1sORLQun0dI9HnwiDjblOB2xRNyD23TwW1Bsz9AU0uAp269gGl5GX1ev0ga0bnA/pDpcmBuZ2VUtVVEqoEsZ/6/Ojw313nc3ToBEJFbgFucyVoR2R1BzO18AGwB/iPaJ0I2UBn901zH6tFDN/Tiufd0vsg+D3eJuh6yvEev4+YhOnuT59u9d32Rs9sUAyt7+uSg0z7b3vym+1ivfj/v9WEgHfTp77qP32835xyLLXpnLa7pXWyQO5FNBDk7kkZ0uL//Hfc8dFams/nhjoeF3ZuhqiuAFV0FeKaIyCa3Hn6NhtXDXawe7tJf6tFLvcnz7WfEMGd35ObP1q2xuTUusNh6yq2xuTUuOBVbfnflIuncVQ6EXsQ3DzjYWRkR8QFDgGNdPDeSdRpjjDk7epPnjTFmQIqkEf02UCAi40QkEbgWeK5DmeeApc7ja4BXnX5yzwHXikiSiIwDCoCNEa7TGGPM2dGbPG+MMQNSt905nL5vXwNeJnjpo8dUdYeI/BewSVWfA34DPO6cOHiMYALGKfcngientAK3qaofINw6+756veaKQ5J9wOrhLlYPd+kv9eix3uR5l3PzZ+vW2NwaF1hsPeXW2NwaF0QYW7eXuDPGGGOMMca0d/YueGmMMcYYY0w/YY1oY4wxxhhjomSN6E6IyBUisltEikXkrljH0xUReUxEKkTkvZB5mSLyiojsce6HOvNFRH7m1OtdESmMXeQfEpHRIvKaiOwSkR0i8nVnfrzVI1lENorINqce33fmjxORDU49nnJO3sI56fYppx4bRCQ/lvF3JCJeEdkiIi8403FXDxEpFZHtIrJVRDY58+Lqe2Wi59YcHi5fu0FnOdgNOsurbtExT7pFuNznFiKSISLPiEiR85270AUxTXLeq7ZbjYjc2dVzrBEdhgSHwP05sBiYAlwnwSHM3WolcEWHeXcBa1S1AFjjTEOwTgXO7RbgF2cpxu60At9U1XOBC4DbnPc83urRBCxQ1enADOAKEbkAuA/4qVOP48Ayp/wy4LiqTgR+6pRzk68Du0Km47Uel6vqjJBrksbb98pEweU5fCWn52s36CwHu0FnedUtOuZJN+mY+9ziQeAlVZ0MTMcF75+q7nbeqxnALKAe+EtXz7FGdHinhsBV1WagbQhcV1LVtZx+vdYlwO+cx78DPhUy//ca9C8gQ0RGEmOqekhV33EenyT4g8ol/uqhqlrrTCY4NwUWEBwqGU6vR1v9ngEWipyF8Y0jICJ5wMeBR51pIQ7r0Ym4+l6ZqLk2h3eSr2Ouixwcc13k1ZjrmCdN90RkMHAJwSv+oKrNqnoitlGdZiGwV1U/6KqQNaLDCzcEriuSSRSGq+ohCCZHYJgz3/V1c7oCzAQ2EIf1cA7tbQUqgFeAvcAJVW11ioTG2m4oZaBtKGU3eAD4NhBwprOIz3oo8A8R2SzBIakhDr9XJir2OfZChxzsCh3zqqq6JbaOedJNwuU+NxgPHAV+63SDeVREUmMdVAfXAqu6K2SN6PAiGt42Trm6biKSBjwL3KmqNV0VDTPPFfVQVb9zOCiP4B6xc8MVc+5dWQ8R+QRQoaqbQ2eHKerqejjmqWohwUP7t4nIJV2UdXM9TOTsc+yhKHLwWdUxr4rI+bGOqZM86SbR5L6zyQcUAr9Q1ZlAHR92qYs551yfq4Cnuytrjejw+sOw5EfaDkM79xXOfNfWTUQSCCbvP6rqn53ZcVePNs7hqX8S7F+YIcGhkqF9rG4dSnkecJWIlBI8FL6A4B6XeKsHqnrQua8g2L9tDnH8vTIRsc+xBzrJwa4Sklfd0K/8tDwpIn+IbUgf6iT3uUE5UB5yNOEZgo1qt1gMvKOqR7oraI3o8PrDsOShQ/QuBf4WMv9G5yoEFwDVbYe1Y8npP/sbYJeq3h+yKN7qkSMiGc7jQcBHCPYtfI3gUMlwej1cN5Syqn5XVfNUNZ/g9/9VVb2BOKuHiKSKSHrbY2AR8B5x9r0yUesPOfys6iIHx1wnebUotlF1mie/EOOwgC5zX8yp6mFgv4hMcmYtJDiytVtcRwRdOQBQVbuFuQFXAu8T7M/6vVjH002sq4BDQAvBf3jLCPZHXQPsce4znbJC8Kz1vcB2YHas43fimk/wcOu7wFbndmUc1mMasMWpx3vAfzrzxwMbgWKCh4iSnPnJznSxs3x8rOsQpk6XAS/EYz2ceLc5tx1tv+V4+17ZrUefvStzeLh8HeuYnLjC5uBYx+XEFjavuukWmifdcOss97nlRvAqK5ucz/SvwNBYx+TElQJUAUMiKW/DfhtjjDHGGBMl685hjDHGGGNMlKwRbYwxxhhjTJSsEW2MMcYYY0yUrBFtjDHGGGNMlKwRbYwxxhhjTJSsEW2MMcYYY0yUrBFtXE9ElovIt87Aet/s4/UlichqEdkqIp/vy3UbY4wxxl183Rcxpn9S1Yv6eJUzgQRVndHH6zXGGGOMy9ieaOM6InKjiLwrIttE5PEOyyaIyEsisllE1onIZGf+J0Vkg4hscfYGD3fmLxeRx0TknyJSIiJ3hKyr1rm/zFn+jIgUicgfnSFwEZErnXlviMjPROSFTmIeBvwBmOHsiZ4gIrNE5HUn1pdFZGRXdTDGGGNM/LBGtHEVETkP+B6wQFWnA1/vUGQFcLuqzgK+BTzizH8DuEBVZwJPAt8Oec5k4GPAHOBuEUkI89IzgTuBKQSHS50nIsnAr4DFqjofyOksblWtAL4MrHP2RJcBDwHXOLE+BtzbTR2MMcYYEyesO4dxmwXAM6paCaCqx5ydwohIGnAR8HTbPCDJuc8DnnL29iYC+0LW+aKqNgFNIlIBDAfKO7zuRlUtd15nK5AP1AIlqtq2rlXALRHWYxJwPvCKE6sXONRNHYwxxhgTJ6wRbdxGAO1kmQc40Umf44eA+1X1ORG5DFgesqwp5LGf8N/7cGUkTLlICbBDVS9sN1NkMJ3XwRhjjDFxwrpzGLdZA3xORLIARCSzbYGq1gD7ROSzzjIRkenO4iHAAefx0j6KpQgYLyL5znQ0V9zYDeSIyIUAIpIgIud1UwdjjDHGxAlrRBtXUdUdBPsOvy4i24D7OxS5AVjmLNsBLHHmLyfYRWIdUNlHsTQAXwVeEpE3gCNAdYTPbQauAe5zYt1KsBtHV3UwxhhjTJwQ1c6OnBtjRCRNVWudq3X8HNijqj+NdVzGGGOMiS3bE21M1252TjTcQbDLyK9iHI8xxhhjXMD2RBsTJRH5Eqdfem+9qt4Wi3iMMcYYc/ZZI9oYY4wxxpgoWXcOY4wxxhhjomSNaGOMMcYYY6JkjWhjjDHGGGOiZI1oY4wxxhhjovT/2DWG8Y+zPc0AAAAASUVORK5CYII=\n",
1964 | "text/plain": [
1965 | "
"
1966 | ]
1967 | },
1968 | "metadata": {
1969 | "needs_background": "light"
1970 | },
1971 | "output_type": "display_data"
1972 | }
1973 | ],
1974 | "source": [
1975 | "## DO NOT CHANGE\n",
1976 | "plt.figure(figsize=(12,4))\n",
1977 | "\n",
1978 | "# Creating plot on the left\n",
1979 | "plt.subplot(121)\n",
1980 | "sns.distplot(X_train['cleaning_fee'])\n",
1981 | "plt.title('Before Log-Transform')\n",
1982 | "\n",
1983 | "# Creating plot on the right\n",
1984 | "plt.subplot(122)\n",
1985 | "log_transform_train = np.where(np.isinf(np.log(X_train['cleaning_fee'])), 0, np.log(X_train['cleaning_fee']))\n",
1986 | "log_transform_test = np.where(np.isinf(np.log(X_test['cleaning_fee'])), 0, np.log(X_test['cleaning_fee']))\n",
1987 | "sns.distplot(log_transform_train)\n",
1988 | "plt.title('After Log-Transform')\n",
1989 | "\n",
1990 | "plt.show()"
1991 | ]
1992 | },
1993 | {
1994 | "cell_type": "code",
1995 | "execution_count": null,
1996 | "metadata": {},
1997 | "outputs": [],
1998 | "source": [
1999 | "## TO DO: update the \"cleaning_fee\" features in X_train and X_test with their log-transformed value\n",
2000 | "## Careful if you try to compute the log yourself - taking the log of 0 will create an infinite value\n",
2001 | " ## if you choose to compute log yourself with np.log, make sure you fill np.inf values with 0"
2002 | ]
2003 | },
2004 | {
2005 | "cell_type": "markdown",
2006 | "metadata": {},
2007 | "source": [
2008 | "**Transforming Numerical Data - Standardization**\n",
2009 | "\n",
2010 | "A lot of Machine Learning models perform better after you've standardized the features, such as Linear Regression, Logistic Regression, and Neural Networks. It may not always be required (this doesn't really matter for Random Forests).\n",
2011 | "\n",
2012 | "In this section, we will standardize our data anyway. We don't have to standardize our binary features since they're either {0,1}, but we should standardize everything else."
2013 | ]
2014 | },
2015 | {
2016 | "cell_type": "code",
2017 | "execution_count": null,
2018 | "metadata": {},
2019 | "outputs": [],
2020 | "source": [
2021 | "## DO NOT CHANGE\n",
2022 | "temp_df = X_train.select_dtypes(['float', 'int'])\n",
2023 | "\n",
2024 | "# Gathering binary features\n",
2025 | "bi_cols = []\n",
2026 | "for col in temp_df.columns:\n",
2027 | " if temp_df[col].nunique() == 2:\n",
2028 | " bi_cols.append(col)"
2029 | ]
2030 | },
2031 | {
2032 | "cell_type": "code",
2033 | "execution_count": null,
2034 | "metadata": {},
2035 | "outputs": [],
2036 | "source": [
2037 | "## TO DO: store all the columns we need to standardize in cols_to_standardize.\n",
2038 | "## Hint: bi_cols contains all the columns that you don't need to standardize. np.setdiff1d may be helpful\n",
2039 | "cols_to_standardize = "
2040 | ]
2041 | },
2042 | {
2043 | "cell_type": "code",
2044 | "execution_count": null,
2045 | "metadata": {},
2046 | "outputs": [],
2047 | "source": [
2048 | "## TO DO: instantiate the StandardScaler() (make sure you include the parenthesis to create the object)\n",
2049 | " ## store this object in scaler below\n",
2050 | "scaler = \n",
2051 | "## TO DO: fit the scaler to X_train's cols_to_standardize only\n",
2052 | "\n",
2053 | "## TO DO: transform (DO NOT FIT_TRANSFORM) X_train and X_test's cols_to_standardize and update the DataFrames\n",
2054 | "X_train[cols_to_standardize] = \n",
2055 | "X_test[cols_to_standardize] = "
2056 | ]
2057 | },
2058 | {
2059 | "cell_type": "markdown",
2060 | "metadata": {},
2061 | "source": [
2062 | "#### Encoding Categorical Data\n",
2063 | "\n",
2064 | "We still have to convert our categorical data into numbers. Here we're going to simply OneHotEncode (very similar to pd.get_dummies) our `cat_cols`."
2065 | ]
2066 | },
2067 | {
2068 | "cell_type": "code",
2069 | "execution_count": 76,
2070 | "metadata": {},
2071 | "outputs": [
2072 | {
2073 | "name": "stdout",
2074 | "output_type": "stream",
2075 | "text": [
2076 | "Unique Values per categorical column...\n",
2077 | "host_response_time: 4\n",
2078 | "property_type: 19\n",
2079 | "room_type: 4\n",
2080 | "bed_type: 4\n",
2081 | "cancellation_policy: 6\n",
2082 | "last_review_discrete: 6\n"
2083 | ]
2084 | }
2085 | ],
2086 | "source": [
2087 | "## DO NOT CHANGE\n",
2088 | "print('Unique Values per categorical column...')\n",
2089 | "for col in cat_cols:\n",
2090 | " print(f'{col}: {X_train[col].nunique()}')"
2091 | ]
2092 | },
2093 | {
2094 | "cell_type": "code",
2095 | "execution_count": null,
2096 | "metadata": {},
2097 | "outputs": [],
2098 | "source": [
2099 | "## TO DO: Finish the code\n",
2100 | "for col in cat_cols:\n",
2101 | " ## TO DO: instantiate the OneHotEncoder with handle_unknown = 'ignore' and sparse=False. Store object in ohe\n",
2102 | " ohe = \n",
2103 | " ## TO DO: fit ohe to the current column \"col\" in X_train\n",
2104 | " ohe.fit\n",
2105 | " \n",
2106 | " # This extracts the names of the dummy columns from ohe\n",
2107 | " dummy_cols = list(ohe.categories_[0])\n",
2108 | " \n",
2109 | " # This creates new dummy columns in X_train and X_test that we will fill\n",
2110 | " for dummy in dummy_cols:\n",
2111 | " X_train[dummy] = 0\n",
2112 | " X_test[dummy] = 0\n",
2113 | " \n",
2114 | " ## TO DO: transform the X_train and X_test column \"col\" and update the dummy_cols we created above\n",
2115 | " X_train[dummy_cols] = \n",
2116 | " X_test[dummy_cols] = "
2117 | ]
2118 | },
2119 | {
2120 | "cell_type": "code",
2121 | "execution_count": null,
2122 | "metadata": {},
2123 | "outputs": [],
2124 | "source": [
2125 | "## TO DO: drop the original cat_cols from X_train and X_test"
2126 | ]
2127 | },
2128 | {
2129 | "cell_type": "markdown",
2130 | "metadata": {},
2131 | "source": [
2132 | "**Text Data**\n",
2133 | "\n",
2134 | "We omitted text data in the beginning of the notebook, but a good place to start when working with text data is using [NLTK](https://www.nltk.org/), the Natrual Language Toolkit Library."
2135 | ]
2136 | },
2137 | {
2138 | "cell_type": "markdown",
2139 | "metadata": {},
2140 | "source": [
2141 | "### D. Feature Selection"
2142 | ]
2143 | },
2144 | {
2145 | "cell_type": "markdown",
2146 | "metadata": {},
2147 | "source": [
2148 | "Depending on how big your dataset is, you may want to reduce the number of features you have for performance purposes. Here we will simply reduce number of features by removing highly correlated features."
2149 | ]
2150 | },
2151 | {
2152 | "cell_type": "code",
2153 | "execution_count": null,
2154 | "metadata": {},
2155 | "outputs": [],
2156 | "source": [
2157 | "## TO DO (this may be tough- check solution for help)\n",
2158 | " ## identify pairs of features that have a correlation higher than 0.8 or lower than -0.8 from X_train\n",
2159 | " ## remove ONLY ONE of these features from both X_train and X_test\n",
2160 | " ## Hint 1: df.corr() and np.triu() can be helpful here\n",
2161 | " ## Hint 2: you should end up removing 33 features (unless you added extra features than what was given)"
2162 | ]
2163 | },
2164 | {
2165 | "cell_type": "markdown",
2166 | "metadata": {},
2167 | "source": [
2168 | "## **Congrats! Now you have train and test datasets that are ready for Machine Learning modeling!**\n",
2169 | " \n",
2170 | " > **This is the end of the exercises. Below is some very simple ML modeling with the data that we prepared together**"
2171 | ]
2172 | },
2173 | {
2174 | "cell_type": "markdown",
2175 | "metadata": {},
2176 | "source": [
2177 | "---"
2178 | ]
2179 | },
2180 | {
2181 | "cell_type": "markdown",
2182 | "metadata": {},
2183 | "source": [
2184 | "## V. Pick your Models \n",
2185 | "**[Back to top](#toc)**"
2186 | ]
2187 | },
2188 | {
2189 | "cell_type": "code",
2190 | "execution_count": 81,
2191 | "metadata": {},
2192 | "outputs": [],
2193 | "source": [
2194 | "from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor\n",
2195 | "from sklearn.svm import SVR\n",
2196 | "\n",
2197 | "from sklearn.metrics import mean_squared_error, mean_absolute_error"
2198 | ]
2199 | },
2200 | {
2201 | "cell_type": "code",
2202 | "execution_count": 82,
2203 | "metadata": {},
2204 | "outputs": [],
2205 | "source": [
2206 | "rf = RandomForestRegressor()\n",
2207 | "gbr = GradientBoostingRegressor()\n",
2208 | "svr = SVR()\n",
2209 | "\n",
2210 | "models = [rf, gbr, svr]"
2211 | ]
2212 | },
2213 | {
2214 | "cell_type": "markdown",
2215 | "metadata": {},
2216 | "source": [
2217 | "## VI. Model Selection \n",
2218 | "**[Back to top](#toc)**\n",
2219 | "\n",
2220 | "Evaluate your models, and pick the 2-3 best performing ones for tuning."
2221 | ]
2222 | },
2223 | {
2224 | "cell_type": "code",
2225 | "execution_count": 83,
2226 | "metadata": {},
2227 | "outputs": [],
2228 | "source": [
2229 | "results = []\n",
2230 | "for model in models:\n",
2231 | " model.fit(X_train, y_train)\n",
2232 | " y_preds = model.predict(X_test)\n",
2233 | " \n",
2234 | " mse = mean_squared_error(y_test, y_preds)\n",
2235 | " mae = mean_absolute_error(y_test, y_preds)\n",
2236 | " \n",
2237 | " metrics = {}\n",
2238 | " metrics['model'] = model.__class__.__name__\n",
2239 | " metrics['mse'] = mse\n",
2240 | " metrics['mae'] = mae\n",
2241 | " results.append(metrics)"
2242 | ]
2243 | },
2244 | {
2245 | "cell_type": "code",
2246 | "execution_count": 84,
2247 | "metadata": {
2248 | "scrolled": true
2249 | },
2250 | "outputs": [
2251 | {
2252 | "data": {
2253 | "text/html": [
2254 | "
\n",
2255 | "\n",
2268 | "
\n",
2269 | " \n",
2270 | "
\n",
2271 | "
\n",
2272 | "
model
\n",
2273 | "
mse
\n",
2274 | "
mae
\n",
2275 | "
\n",
2276 | " \n",
2277 | " \n",
2278 | "
\n",
2279 | "
0
\n",
2280 | "
RandomForestRegressor
\n",
2281 | "
61360.79
\n",
2282 | "
56.51
\n",
2283 | "
\n",
2284 | "
\n",
2285 | "
1
\n",
2286 | "
GradientBoostingRegressor
\n",
2287 | "
91765.90
\n",
2288 | "
68.03
\n",
2289 | "
\n",
2290 | "
\n",
2291 | "
2
\n",
2292 | "
SVR
\n",
2293 | "
112711.50
\n",
2294 | "
67.50
\n",
2295 | "
\n",
2296 | " \n",
2297 | "
\n",
2298 | "
"
2299 | ],
2300 | "text/plain": [
2301 | " model mse mae\n",
2302 | "0 RandomForestRegressor 61360.79 56.51\n",
2303 | "1 GradientBoostingRegressor 91765.90 68.03\n",
2304 | "2 SVR 112711.50 67.50"
2305 | ]
2306 | },
2307 | "execution_count": 84,
2308 | "metadata": {},
2309 | "output_type": "execute_result"
2310 | }
2311 | ],
2312 | "source": [
2313 | "pd.set_option('display.float_format', lambda x: '%7.2f' % x)\n",
2314 | "\n",
2315 | "pd.DataFrame(results, index=np.arange(len(results))).round(50)"
2316 | ]
2317 | },
2318 | {
2319 | "cell_type": "markdown",
2320 | "metadata": {},
2321 | "source": [
2322 | "## VII. Model Tuning \n",
2323 | "**[Back to top](#toc)**\n",
2324 | "\n",
2325 | "Tune the hyperparameters of your models, or even try adding new engineered features, or different transformations."
2326 | ]
2327 | },
2328 | {
2329 | "cell_type": "markdown",
2330 | "metadata": {},
2331 | "source": [
2332 | "## VIII. Pick the Best Model \n",
2333 | "**[Back to top](#toc)**"
2334 | ]
2335 | }
2336 | ],
2337 | "metadata": {
2338 | "kernelspec": {
2339 | "display_name": "Python 3",
2340 | "language": "python",
2341 | "name": "python3"
2342 | },
2343 | "language_info": {
2344 | "codemirror_mode": {
2345 | "name": "ipython",
2346 | "version": 3
2347 | },
2348 | "file_extension": ".py",
2349 | "mimetype": "text/x-python",
2350 | "name": "python",
2351 | "nbconvert_exporter": "python",
2352 | "pygments_lexer": "ipython3",
2353 | "version": "3.7.3"
2354 | }
2355 | },
2356 | "nbformat": 4,
2357 | "nbformat_minor": 2
2358 | }
2359 |
--------------------------------------------------------------------------------
/ml_project_checklist_template.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Machine Learning Project Checklist Template"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "[Full Project Checklist Here](https://docs.google.com/spreadsheets/d/1y4EdxeAliOQw9CDHx0_brjmk-LUb3gfX52zLGSqLg_g/edit?usp=sharing)"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "## Prerequisite: Business and Data Understanding\n",
22 | "\n",
23 | "Before doing any data cleaning or exploration, do the best you can to identify your goals, questions, and purpose of this analysis. Additionally, try to get your hands on a Data Dictionary or schema if you can. You, ideally, will be able to answer questions like this...\n",
24 | "\n",
25 | "- Business Questions:\n",
26 | " - What's the goal of this analysis?\n",
27 | " - What're some questions I want to answer?\n",
28 | " - Do I need machine learning?\n",
29 | "- Data Questions:\n",
30 | " - How many features should I expect?\n",
31 | " - How much text, categorical, or image data to I have? All of these need to be turned into numbers somehow.\n",
32 | " - Do I already have the datasets that I need?\n",
33 | "\n",
34 | "Honestly, taking 1-2 hours to answer these can go a long way.\n",
35 | "\n",
36 | "> **One of the worst feelings you can get in these situations is feeling overwhelmed and lost while trying to understand a big and messy dataset. You're doing yourself a favor by studying the data before you dive in.**"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "## Table of Contents \n",
44 | "\n",
45 | "#### I. [Import Data & Libraries](#idl)\n",
46 | "#### II. [Exploratory Data Analysis](#eda)\n",
47 | "#### III. [Train/Test Split](#tts)\n",
48 | "#### IV. [Prepare for ML](#pfm)\n",
49 | "#### V. [Pick your Models](#pym)\n",
50 | "#### VI. [Model Selection](#ms)\n",
51 | "#### VII. [Model Tuning](#mt)\n",
52 | "#### VIII. [Pick the Best Model](#pbm)\n",
53 | "\n"
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "metadata": {},
59 | "source": [
60 | "## I. Import Data & Libraries "
61 | ]
62 | },
63 | {
64 | "cell_type": "code",
65 | "execution_count": null,
66 | "metadata": {},
67 | "outputs": [],
68 | "source": []
69 | },
70 | {
71 | "cell_type": "markdown",
72 | "metadata": {},
73 | "source": [
74 | "## II. Exploratory Data Analysis\n",
75 | "**[Back to top](#toc)**"
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "execution_count": null,
81 | "metadata": {},
82 | "outputs": [],
83 | "source": []
84 | },
85 | {
86 | "cell_type": "markdown",
87 | "metadata": {},
88 | "source": [
89 | "## III. Train/Test Split\n",
90 | "**[Back to top](#toc)**"
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "execution_count": null,
96 | "metadata": {},
97 | "outputs": [],
98 | "source": []
99 | },
100 | {
101 | "cell_type": "markdown",
102 | "metadata": {},
103 | "source": [
104 | "## IV. Prepare for ML \n",
105 | "**[Back to top](#toc)**"
106 | ]
107 | },
108 | {
109 | "cell_type": "code",
110 | "execution_count": null,
111 | "metadata": {},
112 | "outputs": [],
113 | "source": []
114 | },
115 | {
116 | "cell_type": "markdown",
117 | "metadata": {},
118 | "source": [
119 | "## V. Pick your Models \n",
120 | "**[Back to top](#toc)**"
121 | ]
122 | },
123 | {
124 | "cell_type": "code",
125 | "execution_count": null,
126 | "metadata": {},
127 | "outputs": [],
128 | "source": []
129 | },
130 | {
131 | "cell_type": "markdown",
132 | "metadata": {},
133 | "source": [
134 | "## VI. Model Selection \n",
135 | "**[Back to top](#toc)**"
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "execution_count": null,
141 | "metadata": {},
142 | "outputs": [],
143 | "source": []
144 | },
145 | {
146 | "cell_type": "markdown",
147 | "metadata": {},
148 | "source": [
149 | "## VII. Model Tuning \n",
150 | "**[Back to top](#toc)**"
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": null,
156 | "metadata": {},
157 | "outputs": [],
158 | "source": []
159 | },
160 | {
161 | "cell_type": "markdown",
162 | "metadata": {},
163 | "source": [
164 | "## VIII. Pick the Best Model \n",
165 | "**[Back to top](#toc)**"
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": null,
171 | "metadata": {},
172 | "outputs": [],
173 | "source": []
174 | }
175 | ],
176 | "metadata": {
177 | "kernelspec": {
178 | "display_name": "Python 3",
179 | "language": "python",
180 | "name": "python3"
181 | },
182 | "language_info": {
183 | "codemirror_mode": {
184 | "name": "ipython",
185 | "version": 3
186 | },
187 | "file_extension": ".py",
188 | "mimetype": "text/x-python",
189 | "name": "python",
190 | "nbconvert_exporter": "python",
191 | "pygments_lexer": "ipython3",
192 | "version": "3.7.3"
193 | }
194 | },
195 | "nbformat": 4,
196 | "nbformat_minor": 2
197 | }
198 |
--------------------------------------------------------------------------------