├── .DS_Store
├── Code
├── .DS_Store
├── A2_Pandas
│ ├── .DS_Store
│ └── P1_Getting_Knowing_Data
│ │ ├── .DS_Store
│ │ ├── .ipynb_checkpoints
│ │ ├── Introduction-to-pandas-checkpoint.ipynb
│ │ └── Pandas_Basic_1-checkpoint.ipynb
│ │ ├── Introduction-to-pandas.ipynb
│ │ ├── Pandas_Basic_1.ipynb
│ │ ├── chipotle.tsv
│ │ ├── data
│ │ ├── car-sales-missing-data.csv
│ │ └── car-sales.csv
│ │ ├── pandas-exercises-solutions.ipynb
│ │ └── pandas-exercises.ipynb
├── A3_Numpy
│ ├── .DS_Store
│ ├── .ipynb_checkpoints
│ │ ├── 2. NumPy-checkpoint.ipynb
│ │ ├── 3. NumPy exercises-checkpoint.ipynb
│ │ └── Introduction-to-Numpy-checkpoint.ipynb
│ ├── 2. NumPy.ipynb
│ ├── 3. NumPy exercises.ipynb
│ ├── Introduction-to-Numpy.ipynb
│ └── numpy-images
│ │ ├── car-photo.png
│ │ ├── dog-photo.png
│ │ └── panda.png
├── A4_Matplotlib
│ ├── .DS_Store
│ ├── .ipynb_checkpoints
│ │ └── Introduction_to_Matplotlib-checkpoint.ipynb
│ ├── Introduction_to_Matplotlib.ipynb
│ ├── Introduction_to_Matplotlib_ZTM.ipynb
│ ├── data
│ │ ├── .DS_Store
│ │ ├── california_cities.csv
│ │ ├── car-sales.csv
│ │ └── heart-disease.csv
│ └── images
│ │ ├── .DS_Store
│ │ └── simple-plot.jpg
├── A5_Scikit_Learn
│ ├── .DS_Store
│ ├── .ipynb_checkpoints
│ │ ├── 1-Get-Data-Ready-checkpoint.ipynb
│ │ ├── Introduction-to-scikit-learn-checkpoint.ipynb
│ │ └── sklearn-workflow-1-Get-Data-Ready-checkpoint.ipynb
│ ├── 1-Get-Data-Ready.ipynb
│ ├── Introduction-to-scikit-learn.ipynb
│ ├── data
│ │ ├── car-sales-extended-missing-data.csv
│ │ ├── car-sales-extended.csv
│ │ ├── car-sales-missing-data.csv
│ │ ├── car-sales.csv
│ │ └── heart-disease.csv
│ ├── gs_random_forest_model_1.joblib
│ ├── gs_random_forest_model_1.pkl
│ └── random_forest_model_1.pkl
├── A6_Kaggle
│ ├── .ipynb_checkpoints
│ │ └── Day12_Housing Prices Competition-checkpoint.ipynb
│ └── Day12_Housing Prices Competition.ipynb
├── A6_Seaborn
│ ├── .DS_Store
│ ├── .ipynb_checkpoints
│ │ └── Introduction_to_Seaborn-checkpoint.ipynb
│ ├── Seaborn_Tutorial.ipynb
│ └── data
│ │ └── cereal.csv
├── P00_Project_Template
│ ├── .DS_Store
│ ├── .ipynb_checkpoints
│ │ └── Project_Template_Heart_Disease_Classification-checkpoint.ipynb
│ ├── Project_Template_Heart_Disease_Classification.ipynb
│ └── data
│ │ └── heart.csv
├── P01_Pre_Processing
│ ├── .ipynb_checkpoints
│ │ └── data_preprocessing_template-checkpoint.ipynb
│ ├── Data.csv
│ └── data_preprocessing_template.ipynb
├── P02_Linear_Regression
│ ├── .ipynb_checkpoints
│ │ ├── polynomial_regression-checkpoint.ipynb
│ │ └── simple_linear_regression-checkpoint.ipynb
│ ├── Position_Salaries.csv
│ ├── multiple_linear_regression.ipynb
│ ├── multiple_linear_regression_Backward_Elimination.ipynb
│ ├── polynomial_regression.ipynb
│ └── simple_linear_regression.ipynb
└── Project
│ ├── .DS_Store
│ └── Housing Corporation
│ ├── .DS_Store
│ ├── .ipynb_checkpoints
│ ├── Housing Corporation-checkpoint.ipynb
│ └── Icon
│ ├── Housing Corporation.ipynb
│ ├── Icon
│ └── datasets
│ ├── .DS_Store
│ ├── Icon
│ └── housing
│ ├── .DS_Store
│ ├── Icon
│ └── housing.csv
├── Pages
├── .DS_Store
├── A00_Reading_List.md
├── A01_Interview_Question.md
├── A01_Job_Description.md
├── A02_Pandas_Cheat_Sheet.md
├── A03_Numpy_Cheat_Sheet.md
├── A04_Conda_CLI.md
├── A05_Matplotlib.md
├── A05_Statistics.md
├── A06_SkLearn.md
├── A8_Daily_Lessons.md
├── P00_Introduction.md
├── P01_Data_Pre_Processing.md
├── P02_Regression.md
├── Project_Guideline.md
└── Resources
│ ├── .DS_Store
│ └── Interview
│ └── ML_cheatsheets.pdf
└── README.md
/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/.DS_Store
--------------------------------------------------------------------------------
/Code/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/.DS_Store
--------------------------------------------------------------------------------
/Code/A2_Pandas/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A2_Pandas/.DS_Store
--------------------------------------------------------------------------------
/Code/A2_Pandas/P1_Getting_Knowing_Data/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A2_Pandas/P1_Getting_Knowing_Data/.DS_Store
--------------------------------------------------------------------------------
/Code/A2_Pandas/P1_Getting_Knowing_Data/.ipynb_checkpoints/Introduction-to-pandas-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [],
3 | "metadata": {},
4 | "nbformat": 4,
5 | "nbformat_minor": 5
6 | }
7 |
--------------------------------------------------------------------------------
/Code/A2_Pandas/P1_Getting_Knowing_Data/.ipynb_checkpoints/Pandas_Basic_1-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [],
3 | "metadata": {},
4 | "nbformat": 4,
5 | "nbformat_minor": 4
6 | }
7 |
--------------------------------------------------------------------------------
/Code/A2_Pandas/P1_Getting_Knowing_Data/Introduction-to-pandas.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "impressive-touch",
6 | "metadata": {},
7 | "source": [
8 | "## Introduction to Pandas"
9 | ]
10 | },
11 | {
12 | "cell_type": "code",
13 | "execution_count": 1,
14 | "id": "cellular-fleet",
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 2,
24 | "id": "mathematical-floating",
25 | "metadata": {},
26 | "outputs": [],
27 | "source": [
28 | "# 2 main datatypes: Series & DataFrame\n",
29 | "# Series = 1-D\n",
30 | "# DF. = 2-D\n",
31 | "\n",
32 | "series = pd.Series([\"BMW\", \"Toyota\", \"Honda\"])\n",
33 | "colours = pd.Series([\"Red\", \"Blue\", \"White\"])"
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": 3,
39 | "id": "right-effort",
40 | "metadata": {},
41 | "outputs": [
42 | {
43 | "data": {
44 | "text/html": [
45 | "
\n",
46 | "\n",
59 | "
\n",
60 | " \n",
61 | " \n",
62 | " | \n",
63 | " Car make | \n",
64 | " Colour | \n",
65 | "
\n",
66 | " \n",
67 | " \n",
68 | " \n",
69 | " 0 | \n",
70 | " BMW | \n",
71 | " Red | \n",
72 | "
\n",
73 | " \n",
74 | " 1 | \n",
75 | " Toyota | \n",
76 | " Blue | \n",
77 | "
\n",
78 | " \n",
79 | " 2 | \n",
80 | " Honda | \n",
81 | " White | \n",
82 | "
\n",
83 | " \n",
84 | "
\n",
85 | "
"
86 | ],
87 | "text/plain": [
88 | " Car make Colour\n",
89 | "0 BMW Red\n",
90 | "1 Toyota Blue\n",
91 | "2 Honda White"
92 | ]
93 | },
94 | "execution_count": 3,
95 | "metadata": {},
96 | "output_type": "execute_result"
97 | }
98 | ],
99 | "source": [
100 | "car_data = pd.DataFrame({\"Car make\": series, \"Colour\" : colours})\n",
101 | "car_data"
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": null,
107 | "id": "unnecessary-newsletter",
108 | "metadata": {},
109 | "outputs": [],
110 | "source": [
111 | "# Import data\n",
112 | "car_sales = pd.read_csv(\"./data/car-sales.csv\")"
113 | ]
114 | },
115 | {
116 | "cell_type": "code",
117 | "execution_count": null,
118 | "id": "present-fisher",
119 | "metadata": {},
120 | "outputs": [],
121 | "source": [
122 | "car_sales"
123 | ]
124 | },
125 | {
126 | "cell_type": "markdown",
127 | "id": "abroad-province",
128 | "metadata": {},
129 | "source": [
130 | "## Describe Data"
131 | ]
132 | },
133 | {
134 | "cell_type": "code",
135 | "execution_count": null,
136 | "id": "reduced-maine",
137 | "metadata": {},
138 | "outputs": [],
139 | "source": [
140 | "# Attributes\n",
141 | "car_sales.dtypes"
142 | ]
143 | },
144 | {
145 | "cell_type": "code",
146 | "execution_count": null,
147 | "id": "professional-acoustic",
148 | "metadata": {},
149 | "outputs": [],
150 | "source": [
151 | "car_sales.columns"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": null,
157 | "id": "infectious-satin",
158 | "metadata": {},
159 | "outputs": [],
160 | "source": [
161 | "car_colums = car_sales.columns\n",
162 | "car_colums"
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "execution_count": null,
168 | "id": "lyric-savannah",
169 | "metadata": {},
170 | "outputs": [],
171 | "source": [
172 | "car_sales.index"
173 | ]
174 | },
175 | {
176 | "cell_type": "code",
177 | "execution_count": null,
178 | "id": "hairy-brake",
179 | "metadata": {},
180 | "outputs": [],
181 | "source": [
182 | "# Function\n",
183 | "car_sales.describe()"
184 | ]
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": null,
189 | "id": "expensive-taiwan",
190 | "metadata": {},
191 | "outputs": [],
192 | "source": [
193 | "car_sales.info()"
194 | ]
195 | },
196 | {
197 | "cell_type": "code",
198 | "execution_count": null,
199 | "id": "anticipated-rebecca",
200 | "metadata": {},
201 | "outputs": [],
202 | "source": [
203 | "car_sales.mean()"
204 | ]
205 | },
206 | {
207 | "cell_type": "code",
208 | "execution_count": null,
209 | "id": "herbal-moment",
210 | "metadata": {},
211 | "outputs": [],
212 | "source": [
213 | "car_prices = pd.Series([3000,1500,112045])\n",
214 | "car_prices.mean()"
215 | ]
216 | },
217 | {
218 | "cell_type": "code",
219 | "execution_count": null,
220 | "id": "sunrise-softball",
221 | "metadata": {},
222 | "outputs": [],
223 | "source": [
224 | "car_sales[\"Doors\"].sum()"
225 | ]
226 | },
227 | {
228 | "cell_type": "code",
229 | "execution_count": null,
230 | "id": "severe-mother",
231 | "metadata": {},
232 | "outputs": [],
233 | "source": [
234 | "len(car_sales)"
235 | ]
236 | },
237 | {
238 | "cell_type": "markdown",
239 | "id": "aware-chicago",
240 | "metadata": {},
241 | "source": [
242 | "## Viewing and selecting data"
243 | ]
244 | },
245 | {
246 | "cell_type": "code",
247 | "execution_count": null,
248 | "id": "civic-gamma",
249 | "metadata": {},
250 | "outputs": [],
251 | "source": [
252 | "car_sales.head()"
253 | ]
254 | },
255 | {
256 | "cell_type": "code",
257 | "execution_count": null,
258 | "id": "interior-european",
259 | "metadata": {},
260 | "outputs": [],
261 | "source": [
262 | "car_sales.tail()"
263 | ]
264 | },
265 | {
266 | "cell_type": "code",
267 | "execution_count": null,
268 | "id": "favorite-necklace",
269 | "metadata": {},
270 | "outputs": [],
271 | "source": [
272 | "# .loc and .iloc\n",
273 | "animals = pd.Series([\"cat\", \"dog\", \"Bird\", \"panda\", \"snake\"], index=[0,3,9,8,3])"
274 | ]
275 | },
276 | {
277 | "cell_type": "code",
278 | "execution_count": null,
279 | "id": "seasonal-region",
280 | "metadata": {},
281 | "outputs": [],
282 | "source": [
283 | "animals"
284 | ]
285 | },
286 | {
287 | "cell_type": "code",
288 | "execution_count": null,
289 | "id": "later-boxing",
290 | "metadata": {},
291 | "outputs": [],
292 | "source": [
293 | "# loc refers to index\n",
294 | "animals.loc[3]"
295 | ]
296 | },
297 | {
298 | "cell_type": "code",
299 | "execution_count": null,
300 | "id": "cordless-aurora",
301 | "metadata": {},
302 | "outputs": [],
303 | "source": [
304 | "# .iloc refers to position\n",
305 | "animals.iloc[3]"
306 | ]
307 | },
308 | {
309 | "cell_type": "markdown",
310 | "id": "animated-constitution",
311 | "metadata": {},
312 | "source": [
313 | "### Boolean Indexing"
314 | ]
315 | },
316 | {
317 | "cell_type": "code",
318 | "execution_count": null,
319 | "id": "governing-headline",
320 | "metadata": {},
321 | "outputs": [],
322 | "source": [
323 | "car_sales[car_sales[\"Odometer (KM)\"] > 100000]"
324 | ]
325 | },
326 | {
327 | "cell_type": "code",
328 | "execution_count": null,
329 | "id": "internal-attribute",
330 | "metadata": {},
331 | "outputs": [],
332 | "source": [
333 | "pd.crosstab(car_sales[\"Make\"], car_sales[\"Doors\"])"
334 | ]
335 | },
336 | {
337 | "cell_type": "code",
338 | "execution_count": null,
339 | "id": "iraqi-protocol",
340 | "metadata": {},
341 | "outputs": [],
342 | "source": [
343 | "# Groupy\n",
344 | "car_sales.groupby([\"Make\"]).mean()"
345 | ]
346 | },
347 | {
348 | "cell_type": "code",
349 | "execution_count": null,
350 | "id": "integrated-transparency",
351 | "metadata": {},
352 | "outputs": [],
353 | "source": [
354 | "car_sales[\"Odometer (KM)\"].plot()"
355 | ]
356 | },
357 | {
358 | "cell_type": "code",
359 | "execution_count": null,
360 | "id": "accessory-scene",
361 | "metadata": {},
362 | "outputs": [],
363 | "source": [
364 | "car_sales[\"Odometer (KM)\"].hist() #150000 & 200000 consider as Outliner"
365 | ]
366 | },
367 | {
368 | "cell_type": "code",
369 | "execution_count": null,
370 | "id": "linear-emission",
371 | "metadata": {},
372 | "outputs": [],
373 | "source": [
374 | "car_sales['Price'] = car_sales['Price'].replace('[\\$\\,\\.]',\"\",regex=True).astype(int)"
375 | ]
376 | },
377 | {
378 | "cell_type": "code",
379 | "execution_count": null,
380 | "id": "lesbian-consent",
381 | "metadata": {},
382 | "outputs": [],
383 | "source": [
384 | "car_sales.plot()"
385 | ]
386 | },
387 | {
388 | "cell_type": "code",
389 | "execution_count": null,
390 | "id": "descending-detection",
391 | "metadata": {},
392 | "outputs": [],
393 | "source": [
394 | "car_sales[\"Make\"] = car_sales[\"Make\"].str.lower()"
395 | ]
396 | },
397 | {
398 | "cell_type": "code",
399 | "execution_count": null,
400 | "id": "frequent-indonesia",
401 | "metadata": {},
402 | "outputs": [],
403 | "source": [
404 | "car_sales_missing = pd.read_csv(\"./data/car-sales-missing-data.csv\")\n",
405 | "car_sales_missing"
406 | ]
407 | },
408 | {
409 | "cell_type": "code",
410 | "execution_count": null,
411 | "id": "assumed-processor",
412 | "metadata": {},
413 | "outputs": [],
414 | "source": [
415 | "car_sales_missing[\"Odometer\"].fillna(car_sales_missing[\"Odometer\"].mean(), inplace = True)\n",
416 | "car_sales_missing"
417 | ]
418 | },
419 | {
420 | "cell_type": "code",
421 | "execution_count": null,
422 | "id": "changing-perception",
423 | "metadata": {},
424 | "outputs": [],
425 | "source": [
426 | "car_sales_missing.dropna(inplace=True)\n",
427 | "car_sales_missing"
428 | ]
429 | },
430 | {
431 | "cell_type": "markdown",
432 | "id": "about-professional",
433 | "metadata": {},
434 | "source": [
435 | "## Create new Columns for Pandas DF"
436 | ]
437 | },
438 | {
439 | "cell_type": "code",
440 | "execution_count": null,
441 | "id": "amended-sharing",
442 | "metadata": {},
443 | "outputs": [],
444 | "source": [
445 | "#columns from series\n",
446 | "\n",
447 | "seat_columns = pd.Series([5,5,5,5,5])\n",
448 | "\n",
449 | "# New column added\n",
450 | "car_sales[\"Seats\"] = seat_columns\n",
451 | "car_sales"
452 | ]
453 | },
454 | {
455 | "cell_type": "code",
456 | "execution_count": null,
457 | "id": "younger-grocery",
458 | "metadata": {},
459 | "outputs": [],
460 | "source": [
461 | "car_sales[\"Seats\"].fillna(5, inplace = True)"
462 | ]
463 | },
464 | {
465 | "cell_type": "code",
466 | "execution_count": null,
467 | "id": "reflected-remainder",
468 | "metadata": {},
469 | "outputs": [],
470 | "source": [
471 | "car_sales"
472 | ]
473 | },
474 | {
475 | "cell_type": "code",
476 | "execution_count": null,
477 | "id": "multiple-generation",
478 | "metadata": {},
479 | "outputs": [],
480 | "source": [
481 | "fuel_economy = [7.5, 9.2, 5.0, 9.6, 8.7, 4.7, 7.6,8.7,3.0,4.5]\n",
482 | "car_sales[\"Fuel per 100KM\"] = fuel_economy\n",
483 | "car_sales"
484 | ]
485 | },
486 | {
487 | "cell_type": "code",
488 | "execution_count": null,
489 | "id": "healthy-mortgage",
490 | "metadata": {},
491 | "outputs": [],
492 | "source": [
493 | "car_sales[\"Total fuel used (L)\"] = car_sales[\"Odometer (KM)\"]/100*car_sales[\"Fuel per 100KM\"]"
494 | ]
495 | },
496 | {
497 | "cell_type": "code",
498 | "execution_count": null,
499 | "id": "fiscal-serial",
500 | "metadata": {},
501 | "outputs": [],
502 | "source": [
503 | "car_sales"
504 | ]
505 | },
506 | {
507 | "cell_type": "code",
508 | "execution_count": null,
509 | "id": "marine-catholic",
510 | "metadata": {},
511 | "outputs": [],
512 | "source": [
513 | "car_sales[\"Number of wheels\"] = 4\n",
514 | "car_sales[\"Passed road safety\"] = True\n",
515 | "car_sales"
516 | ]
517 | },
518 | {
519 | "cell_type": "markdown",
520 | "id": "fancy-courtesy",
521 | "metadata": {},
522 | "source": [
523 | "## Sampling Data"
524 | ]
525 | },
526 | {
527 | "cell_type": "code",
528 | "execution_count": null,
529 | "id": "suitable-survey",
530 | "metadata": {},
531 | "outputs": [],
532 | "source": [
533 | "#Shuffle all the row\n",
534 | "car_sales_shuffled = car_sales.sample(frac=1)\n",
535 | "car_sales_shuffled"
536 | ]
537 | },
538 | {
539 | "cell_type": "code",
540 | "execution_count": null,
541 | "id": "amino-remove",
542 | "metadata": {},
543 | "outputs": [],
544 | "source": [
545 | "# Take a sample of the data to practise\n",
546 | "\n",
547 | "#Only select 20% of data\n",
548 | "car_sales_shuffled.sample(frac = 0.2)"
549 | ]
550 | },
551 | {
552 | "cell_type": "code",
553 | "execution_count": null,
554 | "id": "ready-commissioner",
555 | "metadata": {},
556 | "outputs": [],
557 | "source": [
558 | "# Revert the original index\n",
559 | "car_sales_shuffled.reset_index(drop=True, inplace=True)"
560 | ]
561 | },
562 | {
563 | "cell_type": "code",
564 | "execution_count": null,
565 | "id": "beneficial-screen",
566 | "metadata": {},
567 | "outputs": [],
568 | "source": [
569 | "car_sales_shuffled"
570 | ]
571 | },
572 | {
573 | "cell_type": "code",
574 | "execution_count": null,
575 | "id": "grand-uniform",
576 | "metadata": {},
577 | "outputs": [],
578 | "source": [
579 | "car_sales[\"Odometer (KM)\"] = car_sales[\"Odometer (KM)\"].apply(lambda x: x/1.6)"
580 | ]
581 | },
582 | {
583 | "cell_type": "code",
584 | "execution_count": null,
585 | "id": "private-fossil",
586 | "metadata": {},
587 | "outputs": [],
588 | "source": [
589 | "car_sales"
590 | ]
591 | },
592 | {
593 | "cell_type": "code",
594 | "execution_count": null,
595 | "id": "expensive-friday",
596 | "metadata": {},
597 | "outputs": [],
598 | "source": []
599 | }
600 | ],
601 | "metadata": {
602 | "kernelspec": {
603 | "display_name": "Python 3",
604 | "language": "python",
605 | "name": "python3"
606 | },
607 | "language_info": {
608 | "codemirror_mode": {
609 | "name": "ipython",
610 | "version": 3
611 | },
612 | "file_extension": ".py",
613 | "mimetype": "text/x-python",
614 | "name": "python",
615 | "nbconvert_exporter": "python",
616 | "pygments_lexer": "ipython3",
617 | "version": "3.8.8"
618 | }
619 | },
620 | "nbformat": 4,
621 | "nbformat_minor": 5
622 | }
623 |
--------------------------------------------------------------------------------
/Code/A2_Pandas/P1_Getting_Knowing_Data/data/car-sales-missing-data.csv:
--------------------------------------------------------------------------------
1 | Make,Colour,Odometer,Doors,Price
2 | Toyota,White,150043,4,"$4,000"
3 | Honda,Red,87899,4,"$5,000"
4 | Toyota,Blue,,3,"$7,000"
5 | BMW,Black,11179,5,"$22,000"
6 | Nissan,White,213095,4,"$3,500"
7 | Toyota,Green,,4,"$4,500"
8 | Honda,,,4,"$7,500"
9 | Honda,Blue,,4,
10 | Toyota,White,60000,,
11 | ,White,31600,4,"$9,700"
--------------------------------------------------------------------------------
/Code/A2_Pandas/P1_Getting_Knowing_Data/data/car-sales.csv:
--------------------------------------------------------------------------------
1 | Make,Colour,Odometer (KM),Doors,Price
2 | Toyota,White,150043,4,"$4,000.00"
3 | Honda,Red,87899,4,"$5,000.00"
4 | Toyota,Blue,32549,3,"$7,000.00"
5 | BMW,Black,11179,5,"$22,000.00"
6 | Nissan,White,213095,4,"$3,500.00"
7 | Toyota,Green,99213,4,"$4,500.00"
8 | Honda,Blue,45698,4,"$7,500.00"
9 | Honda,Blue,54738,4,"$7,000.00"
10 | Toyota,White,60000,4,"$6,250.00"
11 | Nissan,White,31600,4,"$9,700.00"
--------------------------------------------------------------------------------
/Code/A2_Pandas/P1_Getting_Knowing_Data/pandas-exercises.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Pandas Practice\n",
8 | "\n",
9 | "This notebook is dedicated to practicing different tasks with pandas. The solutions are available in a solutions notebook, however, you should always try to figure them out yourself first.\n",
10 | "\n",
11 | "It should be noted there may be more than one different way to answer a question or complete an exercise.\n",
12 | "\n",
13 | "Exercises are based off (and directly taken from) the quick introduction to pandas notebook.\n",
14 | "\n",
15 | "Different tasks will be detailed by comments or text.\n",
16 | "\n",
17 | "For further reference and resources, it's advised to check out the [pandas documnetation](https://pandas.pydata.org/pandas-docs/stable/)."
18 | ]
19 | },
20 | {
21 | "cell_type": "code",
22 | "execution_count": 1,
23 | "metadata": {},
24 | "outputs": [],
25 | "source": [
26 | "# Import pandas\n"
27 | ]
28 | },
29 | {
30 | "cell_type": "code",
31 | "execution_count": 2,
32 | "metadata": {},
33 | "outputs": [],
34 | "source": [
35 | "# Create a series of three different colours\n"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": 3,
41 | "metadata": {},
42 | "outputs": [],
43 | "source": [
44 | "# View the series of different colours\n"
45 | ]
46 | },
47 | {
48 | "cell_type": "code",
49 | "execution_count": 4,
50 | "metadata": {},
51 | "outputs": [],
52 | "source": [
53 | "# Create a series of three different car types and view it\n"
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": 5,
59 | "metadata": {},
60 | "outputs": [],
61 | "source": [
62 | "# Combine the Series of cars and colours into a DataFrame\n"
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": 6,
68 | "metadata": {},
69 | "outputs": [],
70 | "source": [
71 | "# Import \"../data/car-sales.csv\" and turn it into a DataFrame\n"
72 | ]
73 | },
74 | {
75 | "cell_type": "markdown",
76 | "metadata": {},
77 | "source": [
78 | "**Note:** Since you've imported `../data/car-sales.csv` as a DataFrame, we'll now refer to this DataFrame as 'the car sales DataFrame'."
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": 7,
84 | "metadata": {},
85 | "outputs": [],
86 | "source": [
87 | "# Export the DataFrame you created to a .csv file\n"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": 8,
93 | "metadata": {},
94 | "outputs": [],
95 | "source": [
96 | "# Find the different datatypes of the car data DataFrame\n"
97 | ]
98 | },
99 | {
100 | "cell_type": "code",
101 | "execution_count": 9,
102 | "metadata": {},
103 | "outputs": [],
104 | "source": [
105 | "# Describe your current car sales DataFrame using describe()\n"
106 | ]
107 | },
108 | {
109 | "cell_type": "code",
110 | "execution_count": 10,
111 | "metadata": {},
112 | "outputs": [],
113 | "source": [
114 | "# Get information about your DataFrame using info()\n"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "What does it show you?"
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": 11,
127 | "metadata": {},
128 | "outputs": [],
129 | "source": [
130 | "# Create a Series of different numbers and find the mean of them\n"
131 | ]
132 | },
133 | {
134 | "cell_type": "code",
135 | "execution_count": 12,
136 | "metadata": {},
137 | "outputs": [],
138 | "source": [
139 | "# Create a Series of different numbers and find the sum of them\n"
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "execution_count": 13,
145 | "metadata": {},
146 | "outputs": [],
147 | "source": [
148 | "# List out all the column names of the car sales DataFrame\n"
149 | ]
150 | },
151 | {
152 | "cell_type": "code",
153 | "execution_count": 14,
154 | "metadata": {},
155 | "outputs": [],
156 | "source": [
157 | "# Find the length of the car sales DataFrame\n"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": 15,
163 | "metadata": {},
164 | "outputs": [],
165 | "source": [
166 | "# Show the first 5 rows of the car sales DataFrame\n"
167 | ]
168 | },
169 | {
170 | "cell_type": "code",
171 | "execution_count": 16,
172 | "metadata": {},
173 | "outputs": [],
174 | "source": [
175 | "# Show the first 7 rows of the car sales DataFrame\n"
176 | ]
177 | },
178 | {
179 | "cell_type": "code",
180 | "execution_count": 17,
181 | "metadata": {},
182 | "outputs": [],
183 | "source": [
184 | "# Show the bottom 5 rows of the car sales DataFrame\n"
185 | ]
186 | },
187 | {
188 | "cell_type": "code",
189 | "execution_count": 18,
190 | "metadata": {},
191 | "outputs": [],
192 | "source": [
193 | "# Use .loc to select the row at index 3 of the car sales DataFrame\n"
194 | ]
195 | },
196 | {
197 | "cell_type": "code",
198 | "execution_count": 19,
199 | "metadata": {},
200 | "outputs": [],
201 | "source": [
202 | "# Use .iloc to select the row at position 3 of the car sales DataFrame\n"
203 | ]
204 | },
205 | {
206 | "cell_type": "markdown",
207 | "metadata": {},
208 | "source": [
209 | "Notice how they're the same? Why do you think this is? \n",
210 | "\n",
211 | "Check the pandas documentation for [.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) and [.iloc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html). Think about a different situation each could be used for and try them out."
212 | ]
213 | },
214 | {
215 | "cell_type": "code",
216 | "execution_count": 20,
217 | "metadata": {},
218 | "outputs": [],
219 | "source": [
220 | "# Select the \"Odometer (KM)\" column from the car sales DataFrame\n"
221 | ]
222 | },
223 | {
224 | "cell_type": "code",
225 | "execution_count": 21,
226 | "metadata": {},
227 | "outputs": [],
228 | "source": [
229 | "# Find the mean of the \"Odometer (KM)\" column in the car sales DataFrame\n"
230 | ]
231 | },
232 | {
233 | "cell_type": "code",
234 | "execution_count": 22,
235 | "metadata": {},
236 | "outputs": [],
237 | "source": [
238 | "# Select the rows with over 100,000 kilometers on the Odometer\n"
239 | ]
240 | },
241 | {
242 | "cell_type": "code",
243 | "execution_count": 23,
244 | "metadata": {},
245 | "outputs": [],
246 | "source": [
247 | "# Create a crosstab of the Make and Doors columns\n"
248 | ]
249 | },
250 | {
251 | "cell_type": "code",
252 | "execution_count": 24,
253 | "metadata": {},
254 | "outputs": [],
255 | "source": [
256 | "# Group columns of the car sales DataFrame by the Make column and find the average\n"
257 | ]
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": 25,
262 | "metadata": {},
263 | "outputs": [],
264 | "source": [
265 | "# Import Matplotlib and create a plot of the Odometer column\n",
266 | "# Don't forget to use %matplotlib inline\n"
267 | ]
268 | },
269 | {
270 | "cell_type": "code",
271 | "execution_count": 26,
272 | "metadata": {},
273 | "outputs": [],
274 | "source": [
275 | "# Create a histogram of the Odometer column using hist()\n"
276 | ]
277 | },
278 | {
279 | "cell_type": "code",
280 | "execution_count": 27,
281 | "metadata": {},
282 | "outputs": [],
283 | "source": [
284 | "# Try to plot the Price column using plot()\n"
285 | ]
286 | },
287 | {
288 | "cell_type": "markdown",
289 | "metadata": {},
290 | "source": [
291 | "Why didn't it work? Can you think of a solution?\n",
292 | "\n",
293 | "You might want to search for \"how to convert a pandas string columb to numbers\".\n",
294 | "\n",
295 | "And if you're still stuck, check out this [Stack Overflow question and answer on turning a price column into integers](https://stackoverflow.com/questions/44469313/price-column-object-to-int-in-pandas).\n",
296 | "\n",
297 | "See how you can provide the example code there to the problem here."
298 | ]
299 | },
300 | {
301 | "cell_type": "code",
302 | "execution_count": 28,
303 | "metadata": {},
304 | "outputs": [],
305 | "source": [
306 | "# Remove the punctuation from price column\n"
307 | ]
308 | },
309 | {
310 | "cell_type": "code",
311 | "execution_count": 29,
312 | "metadata": {},
313 | "outputs": [],
314 | "source": [
315 | "# Check the changes to the price column\n"
316 | ]
317 | },
318 | {
319 | "cell_type": "code",
320 | "execution_count": 30,
321 | "metadata": {},
322 | "outputs": [],
323 | "source": [
324 | "# Remove the two extra zeros at the end of the price column\n"
325 | ]
326 | },
327 | {
328 | "cell_type": "code",
329 | "execution_count": 31,
330 | "metadata": {},
331 | "outputs": [],
332 | "source": [
333 | "# Check the changes to the Price column\n"
334 | ]
335 | },
336 | {
337 | "cell_type": "code",
338 | "execution_count": 32,
339 | "metadata": {},
340 | "outputs": [],
341 | "source": [
342 | "# Change the datatype of the Price column to integers\n"
343 | ]
344 | },
345 | {
346 | "cell_type": "code",
347 | "execution_count": 33,
348 | "metadata": {},
349 | "outputs": [],
350 | "source": [
351 | "# Lower the strings of the Make column\n"
352 | ]
353 | },
354 | {
355 | "cell_type": "markdown",
356 | "metadata": {},
357 | "source": [
358 | "If you check the car sales DataFrame, you'll notice the Make column hasn't been lowered.\n",
359 | "\n",
360 | "How could you make these changes permanent?\n",
361 | "\n",
362 | "Try it out."
363 | ]
364 | },
365 | {
366 | "cell_type": "code",
367 | "execution_count": 34,
368 | "metadata": {},
369 | "outputs": [],
370 | "source": [
371 | "# Make lowering the case of the Make column permanent\n"
372 | ]
373 | },
374 | {
375 | "cell_type": "code",
376 | "execution_count": 35,
377 | "metadata": {},
378 | "outputs": [],
379 | "source": [
380 | "# Check the car sales DataFrame\n"
381 | ]
382 | },
383 | {
384 | "cell_type": "markdown",
385 | "metadata": {},
386 | "source": [
387 | "Notice how the Make column stays lowered after reassigning.\n",
388 | "\n",
389 | "Now let's deal with missing data."
390 | ]
391 | },
392 | {
393 | "cell_type": "code",
394 | "execution_count": 36,
395 | "metadata": {},
396 | "outputs": [],
397 | "source": [
398 | "# Import the car sales DataFrame with missing data (\"../data/car-sales-missing-data.csv\")\n",
399 | "\n",
400 | "\n",
401 | "# Check out the new DataFrame\n"
402 | ]
403 | },
404 | {
405 | "cell_type": "markdown",
406 | "metadata": {},
407 | "source": [
408 | "Notice the missing values are represented as `NaN` in pandas DataFrames.\n",
409 | "\n",
410 | "Let's try fill them."
411 | ]
412 | },
413 | {
414 | "cell_type": "code",
415 | "execution_count": 37,
416 | "metadata": {},
417 | "outputs": [],
418 | "source": [
419 | "# Fill the Odometer (KM) column missing values with the mean of the column inplace\n"
420 | ]
421 | },
422 | {
423 | "cell_type": "code",
424 | "execution_count": 38,
425 | "metadata": {},
426 | "outputs": [],
427 | "source": [
428 | "# View the car sales missing DataFrame and verify the changes\n"
429 | ]
430 | },
431 | {
432 | "cell_type": "code",
433 | "execution_count": 39,
434 | "metadata": {},
435 | "outputs": [],
436 | "source": [
437 | "# Remove the rest of the missing data inplace\n"
438 | ]
439 | },
440 | {
441 | "cell_type": "code",
442 | "execution_count": 40,
443 | "metadata": {},
444 | "outputs": [],
445 | "source": [
446 | "# Verify the missing values are removed by viewing the DataFrame\n"
447 | ]
448 | },
449 | {
450 | "cell_type": "markdown",
451 | "metadata": {},
452 | "source": [
453 | "We'll now start to add columns to our DataFrame."
454 | ]
455 | },
456 | {
457 | "cell_type": "code",
458 | "execution_count": 41,
459 | "metadata": {},
460 | "outputs": [],
461 | "source": [
462 | "# Create a \"Seats\" column where every row has a value of 5\n"
463 | ]
464 | },
465 | {
466 | "cell_type": "code",
467 | "execution_count": 42,
468 | "metadata": {},
469 | "outputs": [],
470 | "source": [
471 | "# Create a column called \"Engine Size\" with random values between 1.3 and 4.5\n",
472 | "# Remember: If you're doing it from a Python list, the list has to be the same length\n",
473 | "# as the DataFrame\n"
474 | ]
475 | },
476 | {
477 | "cell_type": "code",
478 | "execution_count": 43,
479 | "metadata": {},
480 | "outputs": [],
481 | "source": [
482 | "# Create a column which represents the price of a car per kilometer\n",
483 | "# Then view the DataFrame\n"
484 | ]
485 | },
486 | {
487 | "cell_type": "code",
488 | "execution_count": 44,
489 | "metadata": {},
490 | "outputs": [],
491 | "source": [
492 | "# Remove the last column you added using .drop()\n"
493 | ]
494 | },
495 | {
496 | "cell_type": "code",
497 | "execution_count": 45,
498 | "metadata": {},
499 | "outputs": [],
500 | "source": [
501 | "# Shuffle the DataFrame using sample() with the frac parameter set to 1\n",
502 | "# Save the the shuffled DataFrame to a new variable\n"
503 | ]
504 | },
505 | {
506 | "cell_type": "markdown",
507 | "metadata": {},
508 | "source": [
509 | "Notice how the index numbers get moved around. The [`sample()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) function is a great way to get random samples from your DataFrame. It's also another great way to shuffle the rows by setting `frac=1`."
510 | ]
511 | },
512 | {
513 | "cell_type": "code",
514 | "execution_count": 46,
515 | "metadata": {},
516 | "outputs": [],
517 | "source": [
518 | "# Reset the indexes of the shuffled DataFrame\n"
519 | ]
520 | },
521 | {
522 | "cell_type": "markdown",
523 | "metadata": {},
524 | "source": [
525 | "Notice the index numbers have been changed to have order (start from 0)."
526 | ]
527 | },
528 | {
529 | "cell_type": "code",
530 | "execution_count": 47,
531 | "metadata": {},
532 | "outputs": [],
533 | "source": [
534 | "# Change the Odometer values from kilometers to miles using a Lambda function\n",
535 | "# Then view the DataFrame\n"
536 | ]
537 | },
538 | {
539 | "cell_type": "code",
540 | "execution_count": 48,
541 | "metadata": {},
542 | "outputs": [],
543 | "source": [
544 | "# Change the title of the Odometer (KM) to represent miles instead of kilometers\n"
545 | ]
546 | },
547 | {
548 | "cell_type": "markdown",
549 | "metadata": {},
550 | "source": [
551 | "## Extensions\n",
552 | "\n",
553 | "For more exercises, check out the pandas documentation, particularly the [10-minutes to pandas section](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html). \n",
554 | "\n",
555 | "One great exercise would be to retype out the entire section into a Jupyter Notebook of your own.\n",
556 | "\n",
557 | "Get hands-on with the code and see what it does.\n",
558 | "\n",
559 | "The next place you should check out are the [top questions and answers on Stack Overflow for pandas](https://stackoverflow.com/questions/tagged/pandas?sort=MostVotes&edited=true). Often, these contain some of the most useful and common pandas functions. Be sure to play around with the different filters!\n",
560 | "\n",
561 | "Finally, always remember, the best way to learn something new to is try it. Make mistakes. Ask questions, get things wrong, take note of the things you do most often. And don't worry if you keep making the same mistake, pandas has many ways to do the same thing and is a big library. So it'll likely take a while before you get the hang of it."
562 | ]
563 | }
564 | ],
565 | "metadata": {
566 | "kernelspec": {
567 | "display_name": "Python 3",
568 | "language": "python",
569 | "name": "python3"
570 | },
571 | "language_info": {
572 | "codemirror_mode": {
573 | "name": "ipython",
574 | "version": 3
575 | },
576 | "file_extension": ".py",
577 | "mimetype": "text/x-python",
578 | "name": "python",
579 | "nbconvert_exporter": "python",
580 | "pygments_lexer": "ipython3",
581 | "version": "3.8.3"
582 | }
583 | },
584 | "nbformat": 4,
585 | "nbformat_minor": 2
586 | }
587 |
--------------------------------------------------------------------------------
/Code/A3_Numpy/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A3_Numpy/.DS_Store
--------------------------------------------------------------------------------
/Code/A3_Numpy/numpy-images/car-photo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A3_Numpy/numpy-images/car-photo.png
--------------------------------------------------------------------------------
/Code/A3_Numpy/numpy-images/dog-photo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A3_Numpy/numpy-images/dog-photo.png
--------------------------------------------------------------------------------
/Code/A3_Numpy/numpy-images/panda.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A3_Numpy/numpy-images/panda.png
--------------------------------------------------------------------------------
/Code/A4_Matplotlib/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A4_Matplotlib/.DS_Store
--------------------------------------------------------------------------------
/Code/A4_Matplotlib/data/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A4_Matplotlib/data/.DS_Store
--------------------------------------------------------------------------------
/Code/A4_Matplotlib/data/car-sales.csv:
--------------------------------------------------------------------------------
1 | Make,Colour,Odometer (KM),Doors,Price
2 | Toyota,White,150043,4,"$4,000.00"
3 | Honda,Red,87899,4,"$5,000.00"
4 | Toyota,Blue,32549,3,"$7,000.00"
5 | BMW,Black,11179,5,"$22,000.00"
6 | Nissan,White,213095,4,"$3,500.00"
7 | Toyota,Green,99213,4,"$4,500.00"
8 | Honda,Blue,45698,4,"$7,500.00"
9 | Honda,Blue,54738,4,"$7,000.00"
10 | Toyota,White,60000,4,"$6,250.00"
11 | Nissan,White,31600,4,"$9,700.00"
--------------------------------------------------------------------------------
/Code/A4_Matplotlib/data/heart-disease.csv:
--------------------------------------------------------------------------------
1 | age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
2 | 63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
3 | 37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
4 | 41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
5 | 56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
6 | 57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
7 | 57,1,0,140,192,0,1,148,0,0.4,1,0,1,1
8 | 56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
9 | 44,1,1,120,263,0,1,173,0,0,2,0,3,1
10 | 52,1,2,172,199,1,1,162,0,0.5,2,0,3,1
11 | 57,1,2,150,168,0,1,174,0,1.6,2,0,2,1
12 | 54,1,0,140,239,0,1,160,0,1.2,2,0,2,1
13 | 48,0,2,130,275,0,1,139,0,0.2,2,0,2,1
14 | 49,1,1,130,266,0,1,171,0,0.6,2,0,2,1
15 | 64,1,3,110,211,0,0,144,1,1.8,1,0,2,1
16 | 58,0,3,150,283,1,0,162,0,1,2,0,2,1
17 | 50,0,2,120,219,0,1,158,0,1.6,1,0,2,1
18 | 58,0,2,120,340,0,1,172,0,0,2,0,2,1
19 | 66,0,3,150,226,0,1,114,0,2.6,0,0,2,1
20 | 43,1,0,150,247,0,1,171,0,1.5,2,0,2,1
21 | 69,0,3,140,239,0,1,151,0,1.8,2,2,2,1
22 | 59,1,0,135,234,0,1,161,0,0.5,1,0,3,1
23 | 44,1,2,130,233,0,1,179,1,0.4,2,0,2,1
24 | 42,1,0,140,226,0,1,178,0,0,2,0,2,1
25 | 61,1,2,150,243,1,1,137,1,1,1,0,2,1
26 | 40,1,3,140,199,0,1,178,1,1.4,2,0,3,1
27 | 71,0,1,160,302,0,1,162,0,0.4,2,2,2,1
28 | 59,1,2,150,212,1,1,157,0,1.6,2,0,2,1
29 | 51,1,2,110,175,0,1,123,0,0.6,2,0,2,1
30 | 65,0,2,140,417,1,0,157,0,0.8,2,1,2,1
31 | 53,1,2,130,197,1,0,152,0,1.2,0,0,2,1
32 | 41,0,1,105,198,0,1,168,0,0,2,1,2,1
33 | 65,1,0,120,177,0,1,140,0,0.4,2,0,3,1
34 | 44,1,1,130,219,0,0,188,0,0,2,0,2,1
35 | 54,1,2,125,273,0,0,152,0,0.5,0,1,2,1
36 | 51,1,3,125,213,0,0,125,1,1.4,2,1,2,1
37 | 46,0,2,142,177,0,0,160,1,1.4,0,0,2,1
38 | 54,0,2,135,304,1,1,170,0,0,2,0,2,1
39 | 54,1,2,150,232,0,0,165,0,1.6,2,0,3,1
40 | 65,0,2,155,269,0,1,148,0,0.8,2,0,2,1
41 | 65,0,2,160,360,0,0,151,0,0.8,2,0,2,1
42 | 51,0,2,140,308,0,0,142,0,1.5,2,1,2,1
43 | 48,1,1,130,245,0,0,180,0,0.2,1,0,2,1
44 | 45,1,0,104,208,0,0,148,1,3,1,0,2,1
45 | 53,0,0,130,264,0,0,143,0,0.4,1,0,2,1
46 | 39,1,2,140,321,0,0,182,0,0,2,0,2,1
47 | 52,1,1,120,325,0,1,172,0,0.2,2,0,2,1
48 | 44,1,2,140,235,0,0,180,0,0,2,0,2,1
49 | 47,1,2,138,257,0,0,156,0,0,2,0,2,1
50 | 53,0,2,128,216,0,0,115,0,0,2,0,0,1
51 | 53,0,0,138,234,0,0,160,0,0,2,0,2,1
52 | 51,0,2,130,256,0,0,149,0,0.5,2,0,2,1
53 | 66,1,0,120,302,0,0,151,0,0.4,1,0,2,1
54 | 62,1,2,130,231,0,1,146,0,1.8,1,3,3,1
55 | 44,0,2,108,141,0,1,175,0,0.6,1,0,2,1
56 | 63,0,2,135,252,0,0,172,0,0,2,0,2,1
57 | 52,1,1,134,201,0,1,158,0,0.8,2,1,2,1
58 | 48,1,0,122,222,0,0,186,0,0,2,0,2,1
59 | 45,1,0,115,260,0,0,185,0,0,2,0,2,1
60 | 34,1,3,118,182,0,0,174,0,0,2,0,2,1
61 | 57,0,0,128,303,0,0,159,0,0,2,1,2,1
62 | 71,0,2,110,265,1,0,130,0,0,2,1,2,1
63 | 54,1,1,108,309,0,1,156,0,0,2,0,3,1
64 | 52,1,3,118,186,0,0,190,0,0,1,0,1,1
65 | 41,1,1,135,203,0,1,132,0,0,1,0,1,1
66 | 58,1,2,140,211,1,0,165,0,0,2,0,2,1
67 | 35,0,0,138,183,0,1,182,0,1.4,2,0,2,1
68 | 51,1,2,100,222,0,1,143,1,1.2,1,0,2,1
69 | 45,0,1,130,234,0,0,175,0,0.6,1,0,2,1
70 | 44,1,1,120,220,0,1,170,0,0,2,0,2,1
71 | 62,0,0,124,209,0,1,163,0,0,2,0,2,1
72 | 54,1,2,120,258,0,0,147,0,0.4,1,0,3,1
73 | 51,1,2,94,227,0,1,154,1,0,2,1,3,1
74 | 29,1,1,130,204,0,0,202,0,0,2,0,2,1
75 | 51,1,0,140,261,0,0,186,1,0,2,0,2,1
76 | 43,0,2,122,213,0,1,165,0,0.2,1,0,2,1
77 | 55,0,1,135,250,0,0,161,0,1.4,1,0,2,1
78 | 51,1,2,125,245,1,0,166,0,2.4,1,0,2,1
79 | 59,1,1,140,221,0,1,164,1,0,2,0,2,1
80 | 52,1,1,128,205,1,1,184,0,0,2,0,2,1
81 | 58,1,2,105,240,0,0,154,1,0.6,1,0,3,1
82 | 41,1,2,112,250,0,1,179,0,0,2,0,2,1
83 | 45,1,1,128,308,0,0,170,0,0,2,0,2,1
84 | 60,0,2,102,318,0,1,160,0,0,2,1,2,1
85 | 52,1,3,152,298,1,1,178,0,1.2,1,0,3,1
86 | 42,0,0,102,265,0,0,122,0,0.6,1,0,2,1
87 | 67,0,2,115,564,0,0,160,0,1.6,1,0,3,1
88 | 68,1,2,118,277,0,1,151,0,1,2,1,3,1
89 | 46,1,1,101,197,1,1,156,0,0,2,0,3,1
90 | 54,0,2,110,214,0,1,158,0,1.6,1,0,2,1
91 | 58,0,0,100,248,0,0,122,0,1,1,0,2,1
92 | 48,1,2,124,255,1,1,175,0,0,2,2,2,1
93 | 57,1,0,132,207,0,1,168,1,0,2,0,3,1
94 | 52,1,2,138,223,0,1,169,0,0,2,4,2,1
95 | 54,0,1,132,288,1,0,159,1,0,2,1,2,1
96 | 45,0,1,112,160,0,1,138,0,0,1,0,2,1
97 | 53,1,0,142,226,0,0,111,1,0,2,0,3,1
98 | 62,0,0,140,394,0,0,157,0,1.2,1,0,2,1
99 | 52,1,0,108,233,1,1,147,0,0.1,2,3,3,1
100 | 43,1,2,130,315,0,1,162,0,1.9,2,1,2,1
101 | 53,1,2,130,246,1,0,173,0,0,2,3,2,1
102 | 42,1,3,148,244,0,0,178,0,0.8,2,2,2,1
103 | 59,1,3,178,270,0,0,145,0,4.2,0,0,3,1
104 | 63,0,1,140,195,0,1,179,0,0,2,2,2,1
105 | 42,1,2,120,240,1,1,194,0,0.8,0,0,3,1
106 | 50,1,2,129,196,0,1,163,0,0,2,0,2,1
107 | 68,0,2,120,211,0,0,115,0,1.5,1,0,2,1
108 | 69,1,3,160,234,1,0,131,0,0.1,1,1,2,1
109 | 45,0,0,138,236,0,0,152,1,0.2,1,0,2,1
110 | 50,0,1,120,244,0,1,162,0,1.1,2,0,2,1
111 | 50,0,0,110,254,0,0,159,0,0,2,0,2,1
112 | 64,0,0,180,325,0,1,154,1,0,2,0,2,1
113 | 57,1,2,150,126,1,1,173,0,0.2,2,1,3,1
114 | 64,0,2,140,313,0,1,133,0,0.2,2,0,3,1
115 | 43,1,0,110,211,0,1,161,0,0,2,0,3,1
116 | 55,1,1,130,262,0,1,155,0,0,2,0,2,1
117 | 37,0,2,120,215,0,1,170,0,0,2,0,2,1
118 | 41,1,2,130,214,0,0,168,0,2,1,0,2,1
119 | 56,1,3,120,193,0,0,162,0,1.9,1,0,3,1
120 | 46,0,1,105,204,0,1,172,0,0,2,0,2,1
121 | 46,0,0,138,243,0,0,152,1,0,1,0,2,1
122 | 64,0,0,130,303,0,1,122,0,2,1,2,2,1
123 | 59,1,0,138,271,0,0,182,0,0,2,0,2,1
124 | 41,0,2,112,268,0,0,172,1,0,2,0,2,1
125 | 54,0,2,108,267,0,0,167,0,0,2,0,2,1
126 | 39,0,2,94,199,0,1,179,0,0,2,0,2,1
127 | 34,0,1,118,210,0,1,192,0,0.7,2,0,2,1
128 | 47,1,0,112,204,0,1,143,0,0.1,2,0,2,1
129 | 67,0,2,152,277,0,1,172,0,0,2,1,2,1
130 | 52,0,2,136,196,0,0,169,0,0.1,1,0,2,1
131 | 74,0,1,120,269,0,0,121,1,0.2,2,1,2,1
132 | 54,0,2,160,201,0,1,163,0,0,2,1,2,1
133 | 49,0,1,134,271,0,1,162,0,0,1,0,2,1
134 | 42,1,1,120,295,0,1,162,0,0,2,0,2,1
135 | 41,1,1,110,235,0,1,153,0,0,2,0,2,1
136 | 41,0,1,126,306,0,1,163,0,0,2,0,2,1
137 | 49,0,0,130,269,0,1,163,0,0,2,0,2,1
138 | 60,0,2,120,178,1,1,96,0,0,2,0,2,1
139 | 62,1,1,128,208,1,0,140,0,0,2,0,2,1
140 | 57,1,0,110,201,0,1,126,1,1.5,1,0,1,1
141 | 64,1,0,128,263,0,1,105,1,0.2,1,1,3,1
142 | 51,0,2,120,295,0,0,157,0,0.6,2,0,2,1
143 | 43,1,0,115,303,0,1,181,0,1.2,1,0,2,1
144 | 42,0,2,120,209,0,1,173,0,0,1,0,2,1
145 | 67,0,0,106,223,0,1,142,0,0.3,2,2,2,1
146 | 76,0,2,140,197,0,2,116,0,1.1,1,0,2,1
147 | 70,1,1,156,245,0,0,143,0,0,2,0,2,1
148 | 44,0,2,118,242,0,1,149,0,0.3,1,1,2,1
149 | 60,0,3,150,240,0,1,171,0,0.9,2,0,2,1
150 | 44,1,2,120,226,0,1,169,0,0,2,0,2,1
151 | 42,1,2,130,180,0,1,150,0,0,2,0,2,1
152 | 66,1,0,160,228,0,0,138,0,2.3,2,0,1,1
153 | 71,0,0,112,149,0,1,125,0,1.6,1,0,2,1
154 | 64,1,3,170,227,0,0,155,0,0.6,1,0,3,1
155 | 66,0,2,146,278,0,0,152,0,0,1,1,2,1
156 | 39,0,2,138,220,0,1,152,0,0,1,0,2,1
157 | 58,0,0,130,197,0,1,131,0,0.6,1,0,2,1
158 | 47,1,2,130,253,0,1,179,0,0,2,0,2,1
159 | 35,1,1,122,192,0,1,174,0,0,2,0,2,1
160 | 58,1,1,125,220,0,1,144,0,0.4,1,4,3,1
161 | 56,1,1,130,221,0,0,163,0,0,2,0,3,1
162 | 56,1,1,120,240,0,1,169,0,0,0,0,2,1
163 | 55,0,1,132,342,0,1,166,0,1.2,2,0,2,1
164 | 41,1,1,120,157,0,1,182,0,0,2,0,2,1
165 | 38,1,2,138,175,0,1,173,0,0,2,4,2,1
166 | 38,1,2,138,175,0,1,173,0,0,2,4,2,1
167 | 67,1,0,160,286,0,0,108,1,1.5,1,3,2,0
168 | 67,1,0,120,229,0,0,129,1,2.6,1,2,3,0
169 | 62,0,0,140,268,0,0,160,0,3.6,0,2,2,0
170 | 63,1,0,130,254,0,0,147,0,1.4,1,1,3,0
171 | 53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
172 | 56,1,2,130,256,1,0,142,1,0.6,1,1,1,0
173 | 48,1,1,110,229,0,1,168,0,1,0,0,3,0
174 | 58,1,1,120,284,0,0,160,0,1.8,1,0,2,0
175 | 58,1,2,132,224,0,0,173,0,3.2,2,2,3,0
176 | 60,1,0,130,206,0,0,132,1,2.4,1,2,3,0
177 | 40,1,0,110,167,0,0,114,1,2,1,0,3,0
178 | 60,1,0,117,230,1,1,160,1,1.4,2,2,3,0
179 | 64,1,2,140,335,0,1,158,0,0,2,0,2,0
180 | 43,1,0,120,177,0,0,120,1,2.5,1,0,3,0
181 | 57,1,0,150,276,0,0,112,1,0.6,1,1,1,0
182 | 55,1,0,132,353,0,1,132,1,1.2,1,1,3,0
183 | 65,0,0,150,225,0,0,114,0,1,1,3,3,0
184 | 61,0,0,130,330,0,0,169,0,0,2,0,2,0
185 | 58,1,2,112,230,0,0,165,0,2.5,1,1,3,0
186 | 50,1,0,150,243,0,0,128,0,2.6,1,0,3,0
187 | 44,1,0,112,290,0,0,153,0,0,2,1,2,0
188 | 60,1,0,130,253,0,1,144,1,1.4,2,1,3,0
189 | 54,1,0,124,266,0,0,109,1,2.2,1,1,3,0
190 | 50,1,2,140,233,0,1,163,0,0.6,1,1,3,0
191 | 41,1,0,110,172,0,0,158,0,0,2,0,3,0
192 | 51,0,0,130,305,0,1,142,1,1.2,1,0,3,0
193 | 58,1,0,128,216,0,0,131,1,2.2,1,3,3,0
194 | 54,1,0,120,188,0,1,113,0,1.4,1,1,3,0
195 | 60,1,0,145,282,0,0,142,1,2.8,1,2,3,0
196 | 60,1,2,140,185,0,0,155,0,3,1,0,2,0
197 | 59,1,0,170,326,0,0,140,1,3.4,0,0,3,0
198 | 46,1,2,150,231,0,1,147,0,3.6,1,0,2,0
199 | 67,1,0,125,254,1,1,163,0,0.2,1,2,3,0
200 | 62,1,0,120,267,0,1,99,1,1.8,1,2,3,0
201 | 65,1,0,110,248,0,0,158,0,0.6,2,2,1,0
202 | 44,1,0,110,197,0,0,177,0,0,2,1,2,0
203 | 60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
204 | 58,1,0,150,270,0,0,111,1,0.8,2,0,3,0
205 | 68,1,2,180,274,1,0,150,1,1.6,1,0,3,0
206 | 62,0,0,160,164,0,0,145,0,6.2,0,3,3,0
207 | 52,1,0,128,255,0,1,161,1,0,2,1,3,0
208 | 59,1,0,110,239,0,0,142,1,1.2,1,1,3,0
209 | 60,0,0,150,258,0,0,157,0,2.6,1,2,3,0
210 | 49,1,2,120,188,0,1,139,0,2,1,3,3,0
211 | 59,1,0,140,177,0,1,162,1,0,2,1,3,0
212 | 57,1,2,128,229,0,0,150,0,0.4,1,1,3,0
213 | 61,1,0,120,260,0,1,140,1,3.6,1,1,3,0
214 | 39,1,0,118,219,0,1,140,0,1.2,1,0,3,0
215 | 61,0,0,145,307,0,0,146,1,1,1,0,3,0
216 | 56,1,0,125,249,1,0,144,1,1.2,1,1,2,0
217 | 43,0,0,132,341,1,0,136,1,3,1,0,3,0
218 | 62,0,2,130,263,0,1,97,0,1.2,1,1,3,0
219 | 63,1,0,130,330,1,0,132,1,1.8,2,3,3,0
220 | 65,1,0,135,254,0,0,127,0,2.8,1,1,3,0
221 | 48,1,0,130,256,1,0,150,1,0,2,2,3,0
222 | 63,0,0,150,407,0,0,154,0,4,1,3,3,0
223 | 55,1,0,140,217,0,1,111,1,5.6,0,0,3,0
224 | 65,1,3,138,282,1,0,174,0,1.4,1,1,2,0
225 | 56,0,0,200,288,1,0,133,1,4,0,2,3,0
226 | 54,1,0,110,239,0,1,126,1,2.8,1,1,3,0
227 | 70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
228 | 62,1,1,120,281,0,0,103,0,1.4,1,1,3,0
229 | 35,1,0,120,198,0,1,130,1,1.6,1,0,3,0
230 | 59,1,3,170,288,0,0,159,0,0.2,1,0,3,0
231 | 64,1,2,125,309,0,1,131,1,1.8,1,0,3,0
232 | 47,1,2,108,243,0,1,152,0,0,2,0,2,0
233 | 57,1,0,165,289,1,0,124,0,1,1,3,3,0
234 | 55,1,0,160,289,0,0,145,1,0.8,1,1,3,0
235 | 64,1,0,120,246,0,0,96,1,2.2,0,1,2,0
236 | 70,1,0,130,322,0,0,109,0,2.4,1,3,2,0
237 | 51,1,0,140,299,0,1,173,1,1.6,2,0,3,0
238 | 58,1,0,125,300,0,0,171,0,0,2,2,3,0
239 | 60,1,0,140,293,0,0,170,0,1.2,1,2,3,0
240 | 77,1,0,125,304,0,0,162,1,0,2,3,2,0
241 | 35,1,0,126,282,0,0,156,1,0,2,0,3,0
242 | 70,1,2,160,269,0,1,112,1,2.9,1,1,3,0
243 | 59,0,0,174,249,0,1,143,1,0,1,0,2,0
244 | 64,1,0,145,212,0,0,132,0,2,1,2,1,0
245 | 57,1,0,152,274,0,1,88,1,1.2,1,1,3,0
246 | 56,1,0,132,184,0,0,105,1,2.1,1,1,1,0
247 | 48,1,0,124,274,0,0,166,0,0.5,1,0,3,0
248 | 56,0,0,134,409,0,0,150,1,1.9,1,2,3,0
249 | 66,1,1,160,246,0,1,120,1,0,1,3,1,0
250 | 54,1,1,192,283,0,0,195,0,0,2,1,3,0
251 | 69,1,2,140,254,0,0,146,0,2,1,3,3,0
252 | 51,1,0,140,298,0,1,122,1,4.2,1,3,3,0
253 | 43,1,0,132,247,1,0,143,1,0.1,1,4,3,0
254 | 62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
255 | 67,1,0,100,299,0,0,125,1,0.9,1,2,2,0
256 | 59,1,3,160,273,0,0,125,0,0,2,0,2,0
257 | 45,1,0,142,309,0,0,147,1,0,1,3,3,0
258 | 58,1,0,128,259,0,0,130,1,3,1,2,3,0
259 | 50,1,0,144,200,0,0,126,1,0.9,1,0,3,0
260 | 62,0,0,150,244,0,1,154,1,1.4,1,0,2,0
261 | 38,1,3,120,231,0,1,182,1,3.8,1,0,3,0
262 | 66,0,0,178,228,1,1,165,1,1,1,2,3,0
263 | 52,1,0,112,230,0,1,160,0,0,2,1,2,0
264 | 53,1,0,123,282,0,1,95,1,2,1,2,3,0
265 | 63,0,0,108,269,0,1,169,1,1.8,1,2,2,0
266 | 54,1,0,110,206,0,0,108,1,0,1,1,2,0
267 | 66,1,0,112,212,0,0,132,1,0.1,2,1,2,0
268 | 55,0,0,180,327,0,2,117,1,3.4,1,0,2,0
269 | 49,1,2,118,149,0,0,126,0,0.8,2,3,2,0
270 | 54,1,0,122,286,0,0,116,1,3.2,1,2,2,0
271 | 56,1,0,130,283,1,0,103,1,1.6,0,0,3,0
272 | 46,1,0,120,249,0,0,144,0,0.8,2,0,3,0
273 | 61,1,3,134,234,0,1,145,0,2.6,1,2,2,0
274 | 67,1,0,120,237,0,1,71,0,1,1,0,2,0
275 | 58,1,0,100,234,0,1,156,0,0.1,2,1,3,0
276 | 47,1,0,110,275,0,0,118,1,1,1,1,2,0
277 | 52,1,0,125,212,0,1,168,0,1,2,2,3,0
278 | 58,1,0,146,218,0,1,105,0,2,1,1,3,0
279 | 57,1,1,124,261,0,1,141,0,0.3,2,0,3,0
280 | 58,0,1,136,319,1,0,152,0,0,2,2,2,0
281 | 61,1,0,138,166,0,0,125,1,3.6,1,1,2,0
282 | 42,1,0,136,315,0,1,125,1,1.8,1,0,1,0
283 | 52,1,0,128,204,1,1,156,1,1,1,0,0,0
284 | 59,1,2,126,218,1,1,134,0,2.2,1,1,1,0
285 | 40,1,0,152,223,0,1,181,0,0,2,0,3,0
286 | 61,1,0,140,207,0,0,138,1,1.9,2,1,3,0
287 | 46,1,0,140,311,0,1,120,1,1.8,1,2,3,0
288 | 59,1,3,134,204,0,1,162,0,0.8,2,2,2,0
289 | 57,1,1,154,232,0,0,164,0,0,2,1,2,0
290 | 57,1,0,110,335,0,1,143,1,3,1,1,3,0
291 | 55,0,0,128,205,0,2,130,1,2,1,1,3,0
292 | 61,1,0,148,203,0,1,161,0,0,2,1,3,0
293 | 58,1,0,114,318,0,2,140,0,4.4,0,3,1,0
294 | 58,0,0,170,225,1,0,146,1,2.8,1,2,1,0
295 | 67,1,2,152,212,0,0,150,0,0.8,1,0,3,0
296 | 44,1,0,120,169,0,1,144,1,2.8,0,0,1,0
297 | 63,1,0,140,187,0,0,144,1,4,2,2,3,0
298 | 63,0,0,124,197,0,1,136,1,0,1,0,2,0
299 | 59,1,0,164,176,1,0,90,0,1,1,2,1,0
300 | 57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
301 | 45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
302 | 68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
303 | 57,1,0,130,131,0,1,115,1,1.2,1,1,3,0
304 | 57,0,1,130,236,0,0,174,0,0,1,1,2,0
305 |
--------------------------------------------------------------------------------
/Code/A4_Matplotlib/images/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A4_Matplotlib/images/.DS_Store
--------------------------------------------------------------------------------
/Code/A4_Matplotlib/images/simple-plot.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A4_Matplotlib/images/simple-plot.jpg
--------------------------------------------------------------------------------
/Code/A5_Scikit_Learn/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A5_Scikit_Learn/.DS_Store
--------------------------------------------------------------------------------
/Code/A5_Scikit_Learn/.ipynb_checkpoints/1-Get-Data-Ready-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "sublime-register",
6 | "metadata": {},
7 | "source": [
8 | "# Introduction to Scikit-Learn (sklearn)\n",
9 | "\n",
10 | "This notebook demonstates some of the most useful functions of the Sklearn Lib\n",
11 | "\n",
12 | "Cover:\n",
13 | "\n",
14 | "0. End-to_end Scikit-Learn Workflow\n",
15 | "1. Getting Data Ready\n",
16 | "2. Choose the right estimator/algorithm for our problems\n",
17 | "3. Fit the model/algorithm and use it to make predictions on our data\n",
18 | "4. Evaluation a model\n",
19 | "5. Improve a model\n",
20 | "6. Save and load a trained model\n",
21 | "7. Put it all together!"
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "id": "raising-nutrition",
27 | "metadata": {},
28 | "source": [
29 | "## 0. An end-to-end scikit-learn workflow"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": 1,
35 | "id": "annoying-macedonia",
36 | "metadata": {},
37 | "outputs": [],
38 | "source": [
39 | "# 1. Get the data ready\n",
40 | "\n",
41 | "# Standard import\n",
42 | "import pandas as pd\n",
43 | "import numpy as np\n",
44 | "import matplotlib.pyplot as plt\n",
45 | "\n",
46 | "%matplotlib inline"
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": 2,
52 | "id": "abroad-prediction",
53 | "metadata": {},
54 | "outputs": [
55 | {
56 | "data": {
57 | "text/html": [
58 | "\n",
59 | "\n",
72 | "
\n",
73 | " \n",
74 | " \n",
75 | " | \n",
76 | " Make | \n",
77 | " Colour | \n",
78 | " Odometer (KM) | \n",
79 | " Doors | \n",
80 | " Price | \n",
81 | "
\n",
82 | " \n",
83 | " \n",
84 | " \n",
85 | " 0 | \n",
86 | " Honda | \n",
87 | " White | \n",
88 | " 35431.0 | \n",
89 | " 4.0 | \n",
90 | " 15323.0 | \n",
91 | "
\n",
92 | " \n",
93 | " 1 | \n",
94 | " BMW | \n",
95 | " Blue | \n",
96 | " 192714.0 | \n",
97 | " 5.0 | \n",
98 | " 19943.0 | \n",
99 | "
\n",
100 | " \n",
101 | " 2 | \n",
102 | " Honda | \n",
103 | " White | \n",
104 | " 84714.0 | \n",
105 | " 4.0 | \n",
106 | " 28343.0 | \n",
107 | "
\n",
108 | " \n",
109 | " 3 | \n",
110 | " Toyota | \n",
111 | " White | \n",
112 | " 154365.0 | \n",
113 | " 4.0 | \n",
114 | " 13434.0 | \n",
115 | "
\n",
116 | " \n",
117 | " 4 | \n",
118 | " Nissan | \n",
119 | " Blue | \n",
120 | " 181577.0 | \n",
121 | " 3.0 | \n",
122 | " 14043.0 | \n",
123 | "
\n",
124 | " \n",
125 | "
\n",
126 | "
"
127 | ],
128 | "text/plain": [
129 | " Make Colour Odometer (KM) Doors Price\n",
130 | "0 Honda White 35431.0 4.0 15323.0\n",
131 | "1 BMW Blue 192714.0 5.0 19943.0\n",
132 | "2 Honda White 84714.0 4.0 28343.0\n",
133 | "3 Toyota White 154365.0 4.0 13434.0\n",
134 | "4 Nissan Blue 181577.0 3.0 14043.0"
135 | ]
136 | },
137 | "execution_count": 2,
138 | "metadata": {},
139 | "output_type": "execute_result"
140 | }
141 | ],
142 | "source": [
143 | "car_sales = pd.read_csv(\"./data/car-sales-extended-missing-data.csv\")\n",
144 | "car_sales.head()"
145 | ]
146 | },
147 | {
148 | "cell_type": "code",
149 | "execution_count": 3,
150 | "id": "liable-mortgage",
151 | "metadata": {},
152 | "outputs": [
153 | {
154 | "data": {
155 | "text/plain": [
156 | "1000"
157 | ]
158 | },
159 | "execution_count": 3,
160 | "metadata": {},
161 | "output_type": "execute_result"
162 | }
163 | ],
164 | "source": [
165 | "len(car_sales)"
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": 4,
171 | "id": "relevant-space",
172 | "metadata": {
173 | "scrolled": true
174 | },
175 | "outputs": [
176 | {
177 | "data": {
178 | "text/plain": [
179 | "Make object\n",
180 | "Colour object\n",
181 | "Odometer (KM) float64\n",
182 | "Doors float64\n",
183 | "Price float64\n",
184 | "dtype: object"
185 | ]
186 | },
187 | "execution_count": 4,
188 | "metadata": {},
189 | "output_type": "execute_result"
190 | }
191 | ],
192 | "source": [
193 | "car_sales.dtypes"
194 | ]
195 | },
196 | {
197 | "cell_type": "markdown",
198 | "id": "coastal-nudist",
199 | "metadata": {},
200 | "source": [
201 | "## What if there were missing values ?\n",
202 | "1. Fill them with some values (a.k.a `imputation`).\n",
203 | "2. Remove the samples with missing data altogether."
204 | ]
205 | },
206 | {
207 | "cell_type": "code",
208 | "execution_count": 5,
209 | "id": "naval-terry",
210 | "metadata": {
211 | "scrolled": true
212 | },
213 | "outputs": [
214 | {
215 | "data": {
216 | "text/plain": [
217 | "Make 49\n",
218 | "Colour 50\n",
219 | "Odometer (KM) 50\n",
220 | "Doors 50\n",
221 | "Price 50\n",
222 | "dtype: int64"
223 | ]
224 | },
225 | "execution_count": 5,
226 | "metadata": {},
227 | "output_type": "execute_result"
228 | }
229 | ],
230 | "source": [
231 | "car_sales.isna().sum()"
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "execution_count": 7,
237 | "id": "under-ferry",
238 | "metadata": {},
239 | "outputs": [],
240 | "source": [
241 | "# Drop the rows with missing in the \"Price\" column\n",
242 | "car_sales.dropna(subset=[\"Price\"], inplace=True)"
243 | ]
244 | },
245 | {
246 | "cell_type": "markdown",
247 | "id": "greenhouse-ticket",
248 | "metadata": {},
249 | "source": [
250 | "## 1. Getting Data Ready: \n",
251 | "\n",
252 | "Three main thins we have to do:\n",
253 | "1. Split the data into features and labels (Usually `X` and `y`)\n",
254 | "2. Filling (also called imputing) or disregarding missing values\n",
255 | "3. Converting non-numerical values to numerical values (a.k.a. feature encoding)"
256 | ]
257 | },
258 | {
259 | "cell_type": "code",
260 | "execution_count": 8,
261 | "id": "prescription-vertical",
262 | "metadata": {},
263 | "outputs": [],
264 | "source": [
265 | "# Create X (features matrix)\n",
266 | "X = car_sales.drop(\"Price\", axis = 1) # Remove 'target' column\n",
267 | "\n",
268 | "# Create y (lables)\n",
269 | "y = car_sales[\"Price\"]"
270 | ]
271 | },
272 | {
273 | "cell_type": "markdown",
274 | "id": "incorporated-september",
275 | "metadata": {},
276 | "source": [
277 | "**Note**: We split data into train & test to perform filling missing values on them separately."
278 | ]
279 | },
280 | {
281 | "cell_type": "code",
282 | "execution_count": 10,
283 | "id": "scenic-baking",
284 | "metadata": {},
285 | "outputs": [],
286 | "source": [
287 | "np.random.seed(42)\n",
288 | "\n",
289 | "# Split the data into training and test sets\n",
290 | "from sklearn.model_selection import train_test_split\n",
291 | "\n",
292 | "X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)"
293 | ]
294 | },
295 | {
296 | "cell_type": "code",
297 | "execution_count": 11,
298 | "id": "raising-gibraltar",
299 | "metadata": {},
300 | "outputs": [
301 | {
302 | "data": {
303 | "text/plain": [
304 | "((760, 4), (190, 4), (760,), (190,))"
305 | ]
306 | },
307 | "execution_count": 11,
308 | "metadata": {},
309 | "output_type": "execute_result"
310 | }
311 | ],
312 | "source": [
313 | "X_train.shape, X_test.shape, y_train.shape, y_test.shape"
314 | ]
315 | },
316 | {
317 | "cell_type": "code",
318 | "execution_count": null,
319 | "id": "parliamentary-click",
320 | "metadata": {},
321 | "outputs": [],
322 | "source": [
323 | "# Inspect whether \"Door\" is categorical feature or not\n",
324 | "# Although \"Door\" contains numerical values\n",
325 | "car_sales[\"Doors\"].value_counts()\n",
326 | "\n",
327 | "# Conclusion: \"Door\" is categorical feature since it has only 3 options"
328 | ]
329 | },
330 | {
331 | "cell_type": "code",
332 | "execution_count": 12,
333 | "id": "alternate-indian",
334 | "metadata": {},
335 | "outputs": [],
336 | "source": [
337 | "# Fill missing values with Scikit-Learn \n",
338 | "from sklearn.impute import SimpleImputer #Help fill the missing values\n",
339 | "from sklearn.compose import ColumnTransformer\n",
340 | "\n",
341 | "# Fill Categorical values with 'missing' & numerical values with mean\n",
342 | "\n",
343 | "cat_imputer = SimpleImputer(strategy=\"constant\", fill_value=\"missing\")\n",
344 | "door_imputer = SimpleImputer(strategy=\"constant\", fill_value=4) #Since \"Door\" col, although type is numerical, but actually cat\n",
345 | "num_imputer = SimpleImputer(strategy=\"mean\")"
346 | ]
347 | },
348 | {
349 | "cell_type": "code",
350 | "execution_count": 13,
351 | "id": "blank-study",
352 | "metadata": {},
353 | "outputs": [],
354 | "source": [
355 | "# Define different column features\n",
356 | "categorical_features = [\"Make\", \"Colour\"]\n",
357 | "door_feature = [\"Doors\"]\n",
358 | "numerical_feature = [\"Odometer (KM)\"]"
359 | ]
360 | },
361 | {
362 | "cell_type": "code",
363 | "execution_count": 15,
364 | "id": "superb-forwarding",
365 | "metadata": {},
366 | "outputs": [],
367 | "source": [
368 | "imputer = ColumnTransformer([\n",
369 | " (\"cat_imputer\", cat_imputer, categorical_features),\n",
370 | " (\"door_imputer\", door_imputer, door_feature),\n",
371 | " (\"num_imputer\", num_imputer, numerical_feature)])"
372 | ]
373 | },
374 | {
375 | "cell_type": "markdown",
376 | "id": "governing-oxford",
377 | "metadata": {},
378 | "source": [
379 | "**Note:** We use fit_transform() on the training data and transform() on the testing data. \n",
380 | "* In essence, we learn the patterns in the training set and transform it via imputation (fit, then transform). \n",
381 | "* Then we take those same patterns and fill the test set (transform only)."
382 | ]
383 | },
384 | {
385 | "cell_type": "code",
386 | "execution_count": 16,
387 | "id": "matched-english",
388 | "metadata": {},
389 | "outputs": [
390 | {
391 | "data": {
392 | "text/plain": [
393 | "array([['Honda', 'White', 4.0, 71934.0],\n",
394 | " ['Toyota', 'Red', 4.0, 162665.0],\n",
395 | " ['Honda', 'White', 4.0, 42844.0],\n",
396 | " ...,\n",
397 | " ['Toyota', 'White', 4.0, 196225.0],\n",
398 | " ['Honda', 'Blue', 4.0, 133117.0],\n",
399 | " ['Honda', 'missing', 4.0, 150582.0]], dtype=object)"
400 | ]
401 | },
402 | "execution_count": 16,
403 | "metadata": {},
404 | "output_type": "execute_result"
405 | }
406 | ],
407 | "source": [
408 | "# learn the patterns in the training set and transform it via imputation (fit, then transform)\n",
409 | "filled_X_train = imputer.fit_transform(X_train)\n",
410 | "# take those same patterns and fill the test set (transform only)\n",
411 | "filled_X_test = imputer.transform(X_test)\n",
412 | "\n",
413 | "# Check filled X_train\n",
414 | "filled_X_train"
415 | ]
416 | },
417 | {
418 | "cell_type": "code",
419 | "execution_count": 17,
420 | "id": "pointed-darkness",
421 | "metadata": {},
422 | "outputs": [
423 | {
424 | "data": {
425 | "text/plain": [
426 | "Make 0\n",
427 | "Colour 0\n",
428 | "Doors 0\n",
429 | "Odometer (KM) 0\n",
430 | "dtype: int64"
431 | ]
432 | },
433 | "execution_count": 17,
434 | "metadata": {},
435 | "output_type": "execute_result"
436 | }
437 | ],
438 | "source": [
439 | "# Get our transformed data array's back into DataFrame's\n",
440 | "car_sales_filled_train = pd.DataFrame(filled_X_train, \n",
441 | " columns=[\"Make\", \"Colour\", \"Doors\", \"Odometer (KM)\"])\n",
442 | "\n",
443 | "car_sales_filled_test = pd.DataFrame(filled_X_test, \n",
444 | " columns=[\"Make\", \"Colour\", \"Doors\", \"Odometer (KM)\"])\n",
445 | "\n",
446 | "# Check missing data in training set\n",
447 | "car_sales_filled_train.isna().sum()"
448 | ]
449 | },
450 | {
451 | "cell_type": "code",
452 | "execution_count": 21,
453 | "id": "distributed-grounds",
454 | "metadata": {},
455 | "outputs": [
456 | {
457 | "data": {
458 | "text/plain": [
459 | "array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n",
460 | " 0.00000e+00, 7.19340e+04],\n",
461 | " [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n",
462 | " 0.00000e+00, 1.62665e+05],\n",
463 | " [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n",
464 | " 0.00000e+00, 4.28440e+04],\n",
465 | " ...,\n",
466 | " [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n",
467 | " 0.00000e+00, 1.96225e+05],\n",
468 | " [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n",
469 | " 0.00000e+00, 1.33117e+05],\n",
470 | " [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n",
471 | " 0.00000e+00, 1.50582e+05]])"
472 | ]
473 | },
474 | "execution_count": 21,
475 | "metadata": {},
476 | "output_type": "execute_result"
477 | }
478 | ],
479 | "source": [
480 | "# Turn the categories into numbers\n",
481 | "from sklearn.preprocessing import OneHotEncoder\n",
482 | "\n",
483 | "\n",
484 | "categorical_features = [\"Make\", \"Colour\", \"Doors\"] \n",
485 | "\n",
486 | "one_hot = OneHotEncoder()\n",
487 | "transformer = ColumnTransformer([(\"one_hot\", \n",
488 | " one_hot,\n",
489 | " categorical_features)], remainder=\"passthrough\")\n",
490 | "\n",
491 | "# Fill train and test values separately\n",
492 | "transformed_X_train = transformer.fit_transform(car_sales_filled_train)\n",
493 | "transformed_X_test = transformer.transform(car_sales_filled_test)\n",
494 | "\n",
495 | "transformed_X_train.toarray()"
496 | ]
497 | },
498 | {
499 | "cell_type": "code",
500 | "execution_count": 22,
501 | "id": "historic-sarah",
502 | "metadata": {},
503 | "outputs": [
504 | {
505 | "data": {
506 | "text/plain": [
507 | "0.21735623151692096"
508 | ]
509 | },
510 | "execution_count": 22,
511 | "metadata": {},
512 | "output_type": "execute_result"
513 | }
514 | ],
515 | "source": [
516 | "# 2. Chose the right model and hyper-parameters\n",
517 | "\n",
518 | "from sklearn.ensemble import RandomForestRegressor\n",
519 | "\n",
520 | "model = RandomForestRegressor()\n",
521 | "model.fit(transformed_X_train, y_train)\n",
522 | "model.score(transformed_X_test, y_test)"
523 | ]
524 | },
525 | {
526 | "cell_type": "code",
527 | "execution_count": null,
528 | "id": "complex-camera",
529 | "metadata": {},
530 | "outputs": [],
531 | "source": []
532 | }
533 | ],
534 | "metadata": {
535 | "kernelspec": {
536 | "display_name": "Python 3",
537 | "language": "python",
538 | "name": "python3"
539 | },
540 | "language_info": {
541 | "codemirror_mode": {
542 | "name": "ipython",
543 | "version": 3
544 | },
545 | "file_extension": ".py",
546 | "mimetype": "text/x-python",
547 | "name": "python",
548 | "nbconvert_exporter": "python",
549 | "pygments_lexer": "ipython3",
550 | "version": "3.8.8"
551 | }
552 | },
553 | "nbformat": 4,
554 | "nbformat_minor": 5
555 | }
556 |
--------------------------------------------------------------------------------
/Code/A5_Scikit_Learn/1-Get-Data-Ready.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "sublime-register",
6 | "metadata": {},
7 | "source": [
8 | "# Introduction to Scikit-Learn (sklearn)\n",
9 | "\n",
10 | "This notebook demonstates some of the most useful functions of the Sklearn Lib\n",
11 | "\n",
12 | "Cover:\n",
13 | "\n",
14 | "0. End-to_end Scikit-Learn Workflow\n",
15 | "1. Getting Data Ready\n",
16 | "2. Choose the right estimator/algorithm for our problems\n",
17 | "3. Fit the model/algorithm and use it to make predictions on our data\n",
18 | "4. Evaluation a model\n",
19 | "5. Improve a model\n",
20 | "6. Save and load a trained model\n",
21 | "7. Put it all together!"
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "id": "raising-nutrition",
27 | "metadata": {},
28 | "source": [
29 | "## 0. An end-to-end scikit-learn workflow"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": 1,
35 | "id": "annoying-macedonia",
36 | "metadata": {},
37 | "outputs": [],
38 | "source": [
39 | "# 1. Get the data ready\n",
40 | "\n",
41 | "# Standard import\n",
42 | "import pandas as pd\n",
43 | "import numpy as np\n",
44 | "import matplotlib.pyplot as plt\n",
45 | "\n",
46 | "%matplotlib inline"
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": 2,
52 | "id": "abroad-prediction",
53 | "metadata": {},
54 | "outputs": [
55 | {
56 | "data": {
57 | "text/html": [
58 | "\n",
59 | "\n",
72 | "
\n",
73 | " \n",
74 | " \n",
75 | " | \n",
76 | " Make | \n",
77 | " Colour | \n",
78 | " Odometer (KM) | \n",
79 | " Doors | \n",
80 | " Price | \n",
81 | "
\n",
82 | " \n",
83 | " \n",
84 | " \n",
85 | " 0 | \n",
86 | " Honda | \n",
87 | " White | \n",
88 | " 35431.0 | \n",
89 | " 4.0 | \n",
90 | " 15323.0 | \n",
91 | "
\n",
92 | " \n",
93 | " 1 | \n",
94 | " BMW | \n",
95 | " Blue | \n",
96 | " 192714.0 | \n",
97 | " 5.0 | \n",
98 | " 19943.0 | \n",
99 | "
\n",
100 | " \n",
101 | " 2 | \n",
102 | " Honda | \n",
103 | " White | \n",
104 | " 84714.0 | \n",
105 | " 4.0 | \n",
106 | " 28343.0 | \n",
107 | "
\n",
108 | " \n",
109 | " 3 | \n",
110 | " Toyota | \n",
111 | " White | \n",
112 | " 154365.0 | \n",
113 | " 4.0 | \n",
114 | " 13434.0 | \n",
115 | "
\n",
116 | " \n",
117 | " 4 | \n",
118 | " Nissan | \n",
119 | " Blue | \n",
120 | " 181577.0 | \n",
121 | " 3.0 | \n",
122 | " 14043.0 | \n",
123 | "
\n",
124 | " \n",
125 | "
\n",
126 | "
"
127 | ],
128 | "text/plain": [
129 | " Make Colour Odometer (KM) Doors Price\n",
130 | "0 Honda White 35431.0 4.0 15323.0\n",
131 | "1 BMW Blue 192714.0 5.0 19943.0\n",
132 | "2 Honda White 84714.0 4.0 28343.0\n",
133 | "3 Toyota White 154365.0 4.0 13434.0\n",
134 | "4 Nissan Blue 181577.0 3.0 14043.0"
135 | ]
136 | },
137 | "execution_count": 2,
138 | "metadata": {},
139 | "output_type": "execute_result"
140 | }
141 | ],
142 | "source": [
143 | "car_sales = pd.read_csv(\"./data/car-sales-extended-missing-data.csv\")\n",
144 | "car_sales.head()"
145 | ]
146 | },
147 | {
148 | "cell_type": "code",
149 | "execution_count": 3,
150 | "id": "liable-mortgage",
151 | "metadata": {},
152 | "outputs": [
153 | {
154 | "data": {
155 | "text/plain": [
156 | "1000"
157 | ]
158 | },
159 | "execution_count": 3,
160 | "metadata": {},
161 | "output_type": "execute_result"
162 | }
163 | ],
164 | "source": [
165 | "len(car_sales)"
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": 4,
171 | "id": "relevant-space",
172 | "metadata": {
173 | "scrolled": true
174 | },
175 | "outputs": [
176 | {
177 | "data": {
178 | "text/plain": [
179 | "Make object\n",
180 | "Colour object\n",
181 | "Odometer (KM) float64\n",
182 | "Doors float64\n",
183 | "Price float64\n",
184 | "dtype: object"
185 | ]
186 | },
187 | "execution_count": 4,
188 | "metadata": {},
189 | "output_type": "execute_result"
190 | }
191 | ],
192 | "source": [
193 | "car_sales.dtypes"
194 | ]
195 | },
196 | {
197 | "cell_type": "markdown",
198 | "id": "coastal-nudist",
199 | "metadata": {},
200 | "source": [
201 | "## What if there were missing values ?\n",
202 | "1. Fill them with some values (a.k.a `imputation`).\n",
203 | "2. Remove the samples with missing data altogether."
204 | ]
205 | },
206 | {
207 | "cell_type": "code",
208 | "execution_count": 5,
209 | "id": "naval-terry",
210 | "metadata": {
211 | "scrolled": true
212 | },
213 | "outputs": [
214 | {
215 | "data": {
216 | "text/plain": [
217 | "Make 49\n",
218 | "Colour 50\n",
219 | "Odometer (KM) 50\n",
220 | "Doors 50\n",
221 | "Price 50\n",
222 | "dtype: int64"
223 | ]
224 | },
225 | "execution_count": 5,
226 | "metadata": {},
227 | "output_type": "execute_result"
228 | }
229 | ],
230 | "source": [
231 | "car_sales.isna().sum()"
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "execution_count": 7,
237 | "id": "under-ferry",
238 | "metadata": {},
239 | "outputs": [],
240 | "source": [
241 | "# Drop the rows with missing in the \"Price\" column\n",
242 | "car_sales.dropna(subset=[\"Price\"], inplace=True)"
243 | ]
244 | },
245 | {
246 | "cell_type": "markdown",
247 | "id": "greenhouse-ticket",
248 | "metadata": {},
249 | "source": [
250 | "## 1. Getting Data Ready: \n",
251 | "\n",
252 | "Three main thins we have to do:\n",
253 | "1. Split the data into features and labels (Usually `X` and `y`)\n",
254 | "2. Filling (also called imputing) or disregarding missing values\n",
255 | "3. Converting non-numerical values to numerical values (a.k.a. feature encoding)"
256 | ]
257 | },
258 | {
259 | "cell_type": "code",
260 | "execution_count": 8,
261 | "id": "prescription-vertical",
262 | "metadata": {},
263 | "outputs": [],
264 | "source": [
265 | "# Create X (features matrix)\n",
266 | "X = car_sales.drop(\"Price\", axis = 1) # Remove 'target' column\n",
267 | "\n",
268 | "# Create y (lables)\n",
269 | "y = car_sales[\"Price\"]"
270 | ]
271 | },
272 | {
273 | "cell_type": "markdown",
274 | "id": "incorporated-september",
275 | "metadata": {},
276 | "source": [
277 | "**Note**: We split data into train & test to perform filling missing values on them separately."
278 | ]
279 | },
280 | {
281 | "cell_type": "code",
282 | "execution_count": 10,
283 | "id": "scenic-baking",
284 | "metadata": {},
285 | "outputs": [],
286 | "source": [
287 | "np.random.seed(42)\n",
288 | "\n",
289 | "# Split the data into training and test sets\n",
290 | "from sklearn.model_selection import train_test_split\n",
291 | "\n",
292 | "X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)"
293 | ]
294 | },
295 | {
296 | "cell_type": "code",
297 | "execution_count": 11,
298 | "id": "raising-gibraltar",
299 | "metadata": {},
300 | "outputs": [
301 | {
302 | "data": {
303 | "text/plain": [
304 | "((760, 4), (190, 4), (760,), (190,))"
305 | ]
306 | },
307 | "execution_count": 11,
308 | "metadata": {},
309 | "output_type": "execute_result"
310 | }
311 | ],
312 | "source": [
313 | "X_train.shape, X_test.shape, y_train.shape, y_test.shape"
314 | ]
315 | },
316 | {
317 | "cell_type": "code",
318 | "execution_count": null,
319 | "id": "parliamentary-click",
320 | "metadata": {},
321 | "outputs": [],
322 | "source": [
323 | "# Inspect whether \"Door\" is categorical feature or not\n",
324 | "# Although \"Door\" contains numerical values\n",
325 | "car_sales[\"Doors\"].value_counts()\n",
326 | "\n",
327 | "# Conclusion: \"Door\" is categorical feature since it has only 3 options"
328 | ]
329 | },
330 | {
331 | "cell_type": "code",
332 | "execution_count": 12,
333 | "id": "alternate-indian",
334 | "metadata": {},
335 | "outputs": [],
336 | "source": [
337 | "# Fill missing values with Scikit-Learn \n",
338 | "from sklearn.impute import SimpleImputer #Help fill the missing values\n",
339 | "from sklearn.compose import ColumnTransformer\n",
340 | "\n",
341 | "# Fill Categorical values with 'missing' & numerical values with mean\n",
342 | "\n",
343 | "cat_imputer = SimpleImputer(strategy=\"constant\", fill_value=\"missing\")\n",
344 | "door_imputer = SimpleImputer(strategy=\"constant\", fill_value=4) #Since \"Door\" col, although type is numerical, but actually cat\n",
345 | "num_imputer = SimpleImputer(strategy=\"mean\")"
346 | ]
347 | },
348 | {
349 | "cell_type": "code",
350 | "execution_count": 13,
351 | "id": "blank-study",
352 | "metadata": {},
353 | "outputs": [],
354 | "source": [
355 | "# Define different column features\n",
356 | "categorical_features = [\"Make\", \"Colour\"]\n",
357 | "door_feature = [\"Doors\"]\n",
358 | "numerical_feature = [\"Odometer (KM)\"]"
359 | ]
360 | },
361 | {
362 | "cell_type": "code",
363 | "execution_count": 15,
364 | "id": "superb-forwarding",
365 | "metadata": {},
366 | "outputs": [],
367 | "source": [
368 | "imputer = ColumnTransformer([\n",
369 | " (\"cat_imputer\", cat_imputer, categorical_features),\n",
370 | " (\"door_imputer\", door_imputer, door_feature),\n",
371 | " (\"num_imputer\", num_imputer, numerical_feature)])"
372 | ]
373 | },
374 | {
375 | "cell_type": "markdown",
376 | "id": "governing-oxford",
377 | "metadata": {},
378 | "source": [
379 | "**Note:** We use fit_transform() on the training data and transform() on the testing data. \n",
380 | "* In essence, we learn the patterns in the training set and transform it via imputation (fit, then transform). \n",
381 | "* Then we take those same patterns and fill the test set (transform only)."
382 | ]
383 | },
384 | {
385 | "cell_type": "code",
386 | "execution_count": 16,
387 | "id": "matched-english",
388 | "metadata": {},
389 | "outputs": [
390 | {
391 | "data": {
392 | "text/plain": [
393 | "array([['Honda', 'White', 4.0, 71934.0],\n",
394 | " ['Toyota', 'Red', 4.0, 162665.0],\n",
395 | " ['Honda', 'White', 4.0, 42844.0],\n",
396 | " ...,\n",
397 | " ['Toyota', 'White', 4.0, 196225.0],\n",
398 | " ['Honda', 'Blue', 4.0, 133117.0],\n",
399 | " ['Honda', 'missing', 4.0, 150582.0]], dtype=object)"
400 | ]
401 | },
402 | "execution_count": 16,
403 | "metadata": {},
404 | "output_type": "execute_result"
405 | }
406 | ],
407 | "source": [
408 | "# learn the patterns in the training set and transform it via imputation (fit, then transform)\n",
409 | "filled_X_train = imputer.fit_transform(X_train)\n",
410 | "# take those same patterns and fill the test set (transform only)\n",
411 | "filled_X_test = imputer.transform(X_test)\n",
412 | "\n",
413 | "# Check filled X_train\n",
414 | "filled_X_train"
415 | ]
416 | },
417 | {
418 | "cell_type": "code",
419 | "execution_count": 17,
420 | "id": "pointed-darkness",
421 | "metadata": {},
422 | "outputs": [
423 | {
424 | "data": {
425 | "text/plain": [
426 | "Make 0\n",
427 | "Colour 0\n",
428 | "Doors 0\n",
429 | "Odometer (KM) 0\n",
430 | "dtype: int64"
431 | ]
432 | },
433 | "execution_count": 17,
434 | "metadata": {},
435 | "output_type": "execute_result"
436 | }
437 | ],
438 | "source": [
439 | "# Get our transformed data array's back into DataFrame's\n",
440 | "car_sales_filled_train = pd.DataFrame(filled_X_train, \n",
441 | " columns=[\"Make\", \"Colour\", \"Doors\", \"Odometer (KM)\"])\n",
442 | "\n",
443 | "car_sales_filled_test = pd.DataFrame(filled_X_test, \n",
444 | " columns=[\"Make\", \"Colour\", \"Doors\", \"Odometer (KM)\"])\n",
445 | "\n",
446 | "# Check missing data in training set\n",
447 | "car_sales_filled_train.isna().sum()"
448 | ]
449 | },
450 | {
451 | "cell_type": "code",
452 | "execution_count": 21,
453 | "id": "distributed-grounds",
454 | "metadata": {},
455 | "outputs": [
456 | {
457 | "data": {
458 | "text/plain": [
459 | "array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n",
460 | " 0.00000e+00, 7.19340e+04],\n",
461 | " [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n",
462 | " 0.00000e+00, 1.62665e+05],\n",
463 | " [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n",
464 | " 0.00000e+00, 4.28440e+04],\n",
465 | " ...,\n",
466 | " [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n",
467 | " 0.00000e+00, 1.96225e+05],\n",
468 | " [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n",
469 | " 0.00000e+00, 1.33117e+05],\n",
470 | " [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n",
471 | " 0.00000e+00, 1.50582e+05]])"
472 | ]
473 | },
474 | "execution_count": 21,
475 | "metadata": {},
476 | "output_type": "execute_result"
477 | }
478 | ],
479 | "source": [
480 | "# Turn the categories into numbers\n",
481 | "from sklearn.preprocessing import OneHotEncoder\n",
482 | "\n",
483 | "\n",
484 | "categorical_features = [\"Make\", \"Colour\", \"Doors\"] \n",
485 | "\n",
486 | "one_hot = OneHotEncoder()\n",
487 | "transformer = ColumnTransformer([(\"one_hot\", \n",
488 | " one_hot,\n",
489 | " categorical_features)], remainder=\"passthrough\")\n",
490 | "\n",
491 | "# Fill train and test values separately\n",
492 | "transformed_X_train = transformer.fit_transform(car_sales_filled_train)\n",
493 | "transformed_X_test = transformer.transform(car_sales_filled_test)\n",
494 | "\n",
495 | "transformed_X_train.toarray()"
496 | ]
497 | },
498 | {
499 | "cell_type": "code",
500 | "execution_count": 22,
501 | "id": "historic-sarah",
502 | "metadata": {},
503 | "outputs": [
504 | {
505 | "data": {
506 | "text/plain": [
507 | "0.21735623151692096"
508 | ]
509 | },
510 | "execution_count": 22,
511 | "metadata": {},
512 | "output_type": "execute_result"
513 | }
514 | ],
515 | "source": [
516 | "# 2. Chose the right model and hyper-parameters\n",
517 | "\n",
518 | "from sklearn.ensemble import RandomForestRegressor\n",
519 | "\n",
520 | "model = RandomForestRegressor()\n",
521 | "model.fit(transformed_X_train, y_train)\n",
522 | "model.score(transformed_X_test, y_test)"
523 | ]
524 | },
525 | {
526 | "cell_type": "code",
527 | "execution_count": null,
528 | "id": "complex-camera",
529 | "metadata": {},
530 | "outputs": [],
531 | "source": []
532 | }
533 | ],
534 | "metadata": {
535 | "kernelspec": {
536 | "display_name": "Python 3",
537 | "language": "python",
538 | "name": "python3"
539 | },
540 | "language_info": {
541 | "codemirror_mode": {
542 | "name": "ipython",
543 | "version": 3
544 | },
545 | "file_extension": ".py",
546 | "mimetype": "text/x-python",
547 | "name": "python",
548 | "nbconvert_exporter": "python",
549 | "pygments_lexer": "ipython3",
550 | "version": "3.8.8"
551 | }
552 | },
553 | "nbformat": 4,
554 | "nbformat_minor": 5
555 | }
556 |
--------------------------------------------------------------------------------
/Code/A5_Scikit_Learn/data/car-sales-missing-data.csv:
--------------------------------------------------------------------------------
1 | Make,Colour,Odometer,Doors,Price
2 | Toyota,White,150043,4,"$4,000"
3 | Honda,Red,87899,4,"$5,000"
4 | Toyota,Blue,,3,"$7,000"
5 | BMW,Black,11179,5,"$22,000"
6 | Nissan,White,213095,4,"$3,500"
7 | Toyota,Green,,4,"$4,500"
8 | Honda,,,4,"$7,500"
9 | Honda,Blue,,4,
10 | Toyota,White,60000,,
11 | ,White,31600,4,"$9,700"
--------------------------------------------------------------------------------
/Code/A5_Scikit_Learn/data/car-sales.csv:
--------------------------------------------------------------------------------
1 | Make,Colour,Odometer (KM),Doors,Price
2 | Toyota,White,150043,4,"$4,000.00"
3 | Honda,Red,87899,4,"$5,000.00"
4 | Toyota,Blue,32549,3,"$7,000.00"
5 | BMW,Black,11179,5,"$22,000.00"
6 | Nissan,White,213095,4,"$3,500.00"
7 | Toyota,Green,99213,4,"$4,500.00"
8 | Honda,Blue,45698,4,"$7,500.00"
9 | Honda,Blue,54738,4,"$7,000.00"
10 | Toyota,White,60000,4,"$6,250.00"
11 | Nissan,White,31600,4,"$9,700.00"
--------------------------------------------------------------------------------
/Code/A5_Scikit_Learn/data/heart-disease.csv:
--------------------------------------------------------------------------------
1 | age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
2 | 63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
3 | 37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
4 | 41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
5 | 56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
6 | 57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
7 | 57,1,0,140,192,0,1,148,0,0.4,1,0,1,1
8 | 56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
9 | 44,1,1,120,263,0,1,173,0,0,2,0,3,1
10 | 52,1,2,172,199,1,1,162,0,0.5,2,0,3,1
11 | 57,1,2,150,168,0,1,174,0,1.6,2,0,2,1
12 | 54,1,0,140,239,0,1,160,0,1.2,2,0,2,1
13 | 48,0,2,130,275,0,1,139,0,0.2,2,0,2,1
14 | 49,1,1,130,266,0,1,171,0,0.6,2,0,2,1
15 | 64,1,3,110,211,0,0,144,1,1.8,1,0,2,1
16 | 58,0,3,150,283,1,0,162,0,1,2,0,2,1
17 | 50,0,2,120,219,0,1,158,0,1.6,1,0,2,1
18 | 58,0,2,120,340,0,1,172,0,0,2,0,2,1
19 | 66,0,3,150,226,0,1,114,0,2.6,0,0,2,1
20 | 43,1,0,150,247,0,1,171,0,1.5,2,0,2,1
21 | 69,0,3,140,239,0,1,151,0,1.8,2,2,2,1
22 | 59,1,0,135,234,0,1,161,0,0.5,1,0,3,1
23 | 44,1,2,130,233,0,1,179,1,0.4,2,0,2,1
24 | 42,1,0,140,226,0,1,178,0,0,2,0,2,1
25 | 61,1,2,150,243,1,1,137,1,1,1,0,2,1
26 | 40,1,3,140,199,0,1,178,1,1.4,2,0,3,1
27 | 71,0,1,160,302,0,1,162,0,0.4,2,2,2,1
28 | 59,1,2,150,212,1,1,157,0,1.6,2,0,2,1
29 | 51,1,2,110,175,0,1,123,0,0.6,2,0,2,1
30 | 65,0,2,140,417,1,0,157,0,0.8,2,1,2,1
31 | 53,1,2,130,197,1,0,152,0,1.2,0,0,2,1
32 | 41,0,1,105,198,0,1,168,0,0,2,1,2,1
33 | 65,1,0,120,177,0,1,140,0,0.4,2,0,3,1
34 | 44,1,1,130,219,0,0,188,0,0,2,0,2,1
35 | 54,1,2,125,273,0,0,152,0,0.5,0,1,2,1
36 | 51,1,3,125,213,0,0,125,1,1.4,2,1,2,1
37 | 46,0,2,142,177,0,0,160,1,1.4,0,0,2,1
38 | 54,0,2,135,304,1,1,170,0,0,2,0,2,1
39 | 54,1,2,150,232,0,0,165,0,1.6,2,0,3,1
40 | 65,0,2,155,269,0,1,148,0,0.8,2,0,2,1
41 | 65,0,2,160,360,0,0,151,0,0.8,2,0,2,1
42 | 51,0,2,140,308,0,0,142,0,1.5,2,1,2,1
43 | 48,1,1,130,245,0,0,180,0,0.2,1,0,2,1
44 | 45,1,0,104,208,0,0,148,1,3,1,0,2,1
45 | 53,0,0,130,264,0,0,143,0,0.4,1,0,2,1
46 | 39,1,2,140,321,0,0,182,0,0,2,0,2,1
47 | 52,1,1,120,325,0,1,172,0,0.2,2,0,2,1
48 | 44,1,2,140,235,0,0,180,0,0,2,0,2,1
49 | 47,1,2,138,257,0,0,156,0,0,2,0,2,1
50 | 53,0,2,128,216,0,0,115,0,0,2,0,0,1
51 | 53,0,0,138,234,0,0,160,0,0,2,0,2,1
52 | 51,0,2,130,256,0,0,149,0,0.5,2,0,2,1
53 | 66,1,0,120,302,0,0,151,0,0.4,1,0,2,1
54 | 62,1,2,130,231,0,1,146,0,1.8,1,3,3,1
55 | 44,0,2,108,141,0,1,175,0,0.6,1,0,2,1
56 | 63,0,2,135,252,0,0,172,0,0,2,0,2,1
57 | 52,1,1,134,201,0,1,158,0,0.8,2,1,2,1
58 | 48,1,0,122,222,0,0,186,0,0,2,0,2,1
59 | 45,1,0,115,260,0,0,185,0,0,2,0,2,1
60 | 34,1,3,118,182,0,0,174,0,0,2,0,2,1
61 | 57,0,0,128,303,0,0,159,0,0,2,1,2,1
62 | 71,0,2,110,265,1,0,130,0,0,2,1,2,1
63 | 54,1,1,108,309,0,1,156,0,0,2,0,3,1
64 | 52,1,3,118,186,0,0,190,0,0,1,0,1,1
65 | 41,1,1,135,203,0,1,132,0,0,1,0,1,1
66 | 58,1,2,140,211,1,0,165,0,0,2,0,2,1
67 | 35,0,0,138,183,0,1,182,0,1.4,2,0,2,1
68 | 51,1,2,100,222,0,1,143,1,1.2,1,0,2,1
69 | 45,0,1,130,234,0,0,175,0,0.6,1,0,2,1
70 | 44,1,1,120,220,0,1,170,0,0,2,0,2,1
71 | 62,0,0,124,209,0,1,163,0,0,2,0,2,1
72 | 54,1,2,120,258,0,0,147,0,0.4,1,0,3,1
73 | 51,1,2,94,227,0,1,154,1,0,2,1,3,1
74 | 29,1,1,130,204,0,0,202,0,0,2,0,2,1
75 | 51,1,0,140,261,0,0,186,1,0,2,0,2,1
76 | 43,0,2,122,213,0,1,165,0,0.2,1,0,2,1
77 | 55,0,1,135,250,0,0,161,0,1.4,1,0,2,1
78 | 51,1,2,125,245,1,0,166,0,2.4,1,0,2,1
79 | 59,1,1,140,221,0,1,164,1,0,2,0,2,1
80 | 52,1,1,128,205,1,1,184,0,0,2,0,2,1
81 | 58,1,2,105,240,0,0,154,1,0.6,1,0,3,1
82 | 41,1,2,112,250,0,1,179,0,0,2,0,2,1
83 | 45,1,1,128,308,0,0,170,0,0,2,0,2,1
84 | 60,0,2,102,318,0,1,160,0,0,2,1,2,1
85 | 52,1,3,152,298,1,1,178,0,1.2,1,0,3,1
86 | 42,0,0,102,265,0,0,122,0,0.6,1,0,2,1
87 | 67,0,2,115,564,0,0,160,0,1.6,1,0,3,1
88 | 68,1,2,118,277,0,1,151,0,1,2,1,3,1
89 | 46,1,1,101,197,1,1,156,0,0,2,0,3,1
90 | 54,0,2,110,214,0,1,158,0,1.6,1,0,2,1
91 | 58,0,0,100,248,0,0,122,0,1,1,0,2,1
92 | 48,1,2,124,255,1,1,175,0,0,2,2,2,1
93 | 57,1,0,132,207,0,1,168,1,0,2,0,3,1
94 | 52,1,2,138,223,0,1,169,0,0,2,4,2,1
95 | 54,0,1,132,288,1,0,159,1,0,2,1,2,1
96 | 45,0,1,112,160,0,1,138,0,0,1,0,2,1
97 | 53,1,0,142,226,0,0,111,1,0,2,0,3,1
98 | 62,0,0,140,394,0,0,157,0,1.2,1,0,2,1
99 | 52,1,0,108,233,1,1,147,0,0.1,2,3,3,1
100 | 43,1,2,130,315,0,1,162,0,1.9,2,1,2,1
101 | 53,1,2,130,246,1,0,173,0,0,2,3,2,1
102 | 42,1,3,148,244,0,0,178,0,0.8,2,2,2,1
103 | 59,1,3,178,270,0,0,145,0,4.2,0,0,3,1
104 | 63,0,1,140,195,0,1,179,0,0,2,2,2,1
105 | 42,1,2,120,240,1,1,194,0,0.8,0,0,3,1
106 | 50,1,2,129,196,0,1,163,0,0,2,0,2,1
107 | 68,0,2,120,211,0,0,115,0,1.5,1,0,2,1
108 | 69,1,3,160,234,1,0,131,0,0.1,1,1,2,1
109 | 45,0,0,138,236,0,0,152,1,0.2,1,0,2,1
110 | 50,0,1,120,244,0,1,162,0,1.1,2,0,2,1
111 | 50,0,0,110,254,0,0,159,0,0,2,0,2,1
112 | 64,0,0,180,325,0,1,154,1,0,2,0,2,1
113 | 57,1,2,150,126,1,1,173,0,0.2,2,1,3,1
114 | 64,0,2,140,313,0,1,133,0,0.2,2,0,3,1
115 | 43,1,0,110,211,0,1,161,0,0,2,0,3,1
116 | 55,1,1,130,262,0,1,155,0,0,2,0,2,1
117 | 37,0,2,120,215,0,1,170,0,0,2,0,2,1
118 | 41,1,2,130,214,0,0,168,0,2,1,0,2,1
119 | 56,1,3,120,193,0,0,162,0,1.9,1,0,3,1
120 | 46,0,1,105,204,0,1,172,0,0,2,0,2,1
121 | 46,0,0,138,243,0,0,152,1,0,1,0,2,1
122 | 64,0,0,130,303,0,1,122,0,2,1,2,2,1
123 | 59,1,0,138,271,0,0,182,0,0,2,0,2,1
124 | 41,0,2,112,268,0,0,172,1,0,2,0,2,1
125 | 54,0,2,108,267,0,0,167,0,0,2,0,2,1
126 | 39,0,2,94,199,0,1,179,0,0,2,0,2,1
127 | 34,0,1,118,210,0,1,192,0,0.7,2,0,2,1
128 | 47,1,0,112,204,0,1,143,0,0.1,2,0,2,1
129 | 67,0,2,152,277,0,1,172,0,0,2,1,2,1
130 | 52,0,2,136,196,0,0,169,0,0.1,1,0,2,1
131 | 74,0,1,120,269,0,0,121,1,0.2,2,1,2,1
132 | 54,0,2,160,201,0,1,163,0,0,2,1,2,1
133 | 49,0,1,134,271,0,1,162,0,0,1,0,2,1
134 | 42,1,1,120,295,0,1,162,0,0,2,0,2,1
135 | 41,1,1,110,235,0,1,153,0,0,2,0,2,1
136 | 41,0,1,126,306,0,1,163,0,0,2,0,2,1
137 | 49,0,0,130,269,0,1,163,0,0,2,0,2,1
138 | 60,0,2,120,178,1,1,96,0,0,2,0,2,1
139 | 62,1,1,128,208,1,0,140,0,0,2,0,2,1
140 | 57,1,0,110,201,0,1,126,1,1.5,1,0,1,1
141 | 64,1,0,128,263,0,1,105,1,0.2,1,1,3,1
142 | 51,0,2,120,295,0,0,157,0,0.6,2,0,2,1
143 | 43,1,0,115,303,0,1,181,0,1.2,1,0,2,1
144 | 42,0,2,120,209,0,1,173,0,0,1,0,2,1
145 | 67,0,0,106,223,0,1,142,0,0.3,2,2,2,1
146 | 76,0,2,140,197,0,2,116,0,1.1,1,0,2,1
147 | 70,1,1,156,245,0,0,143,0,0,2,0,2,1
148 | 44,0,2,118,242,0,1,149,0,0.3,1,1,2,1
149 | 60,0,3,150,240,0,1,171,0,0.9,2,0,2,1
150 | 44,1,2,120,226,0,1,169,0,0,2,0,2,1
151 | 42,1,2,130,180,0,1,150,0,0,2,0,2,1
152 | 66,1,0,160,228,0,0,138,0,2.3,2,0,1,1
153 | 71,0,0,112,149,0,1,125,0,1.6,1,0,2,1
154 | 64,1,3,170,227,0,0,155,0,0.6,1,0,3,1
155 | 66,0,2,146,278,0,0,152,0,0,1,1,2,1
156 | 39,0,2,138,220,0,1,152,0,0,1,0,2,1
157 | 58,0,0,130,197,0,1,131,0,0.6,1,0,2,1
158 | 47,1,2,130,253,0,1,179,0,0,2,0,2,1
159 | 35,1,1,122,192,0,1,174,0,0,2,0,2,1
160 | 58,1,1,125,220,0,1,144,0,0.4,1,4,3,1
161 | 56,1,1,130,221,0,0,163,0,0,2,0,3,1
162 | 56,1,1,120,240,0,1,169,0,0,0,0,2,1
163 | 55,0,1,132,342,0,1,166,0,1.2,2,0,2,1
164 | 41,1,1,120,157,0,1,182,0,0,2,0,2,1
165 | 38,1,2,138,175,0,1,173,0,0,2,4,2,1
166 | 38,1,2,138,175,0,1,173,0,0,2,4,2,1
167 | 67,1,0,160,286,0,0,108,1,1.5,1,3,2,0
168 | 67,1,0,120,229,0,0,129,1,2.6,1,2,3,0
169 | 62,0,0,140,268,0,0,160,0,3.6,0,2,2,0
170 | 63,1,0,130,254,0,0,147,0,1.4,1,1,3,0
171 | 53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
172 | 56,1,2,130,256,1,0,142,1,0.6,1,1,1,0
173 | 48,1,1,110,229,0,1,168,0,1,0,0,3,0
174 | 58,1,1,120,284,0,0,160,0,1.8,1,0,2,0
175 | 58,1,2,132,224,0,0,173,0,3.2,2,2,3,0
176 | 60,1,0,130,206,0,0,132,1,2.4,1,2,3,0
177 | 40,1,0,110,167,0,0,114,1,2,1,0,3,0
178 | 60,1,0,117,230,1,1,160,1,1.4,2,2,3,0
179 | 64,1,2,140,335,0,1,158,0,0,2,0,2,0
180 | 43,1,0,120,177,0,0,120,1,2.5,1,0,3,0
181 | 57,1,0,150,276,0,0,112,1,0.6,1,1,1,0
182 | 55,1,0,132,353,0,1,132,1,1.2,1,1,3,0
183 | 65,0,0,150,225,0,0,114,0,1,1,3,3,0
184 | 61,0,0,130,330,0,0,169,0,0,2,0,2,0
185 | 58,1,2,112,230,0,0,165,0,2.5,1,1,3,0
186 | 50,1,0,150,243,0,0,128,0,2.6,1,0,3,0
187 | 44,1,0,112,290,0,0,153,0,0,2,1,2,0
188 | 60,1,0,130,253,0,1,144,1,1.4,2,1,3,0
189 | 54,1,0,124,266,0,0,109,1,2.2,1,1,3,0
190 | 50,1,2,140,233,0,1,163,0,0.6,1,1,3,0
191 | 41,1,0,110,172,0,0,158,0,0,2,0,3,0
192 | 51,0,0,130,305,0,1,142,1,1.2,1,0,3,0
193 | 58,1,0,128,216,0,0,131,1,2.2,1,3,3,0
194 | 54,1,0,120,188,0,1,113,0,1.4,1,1,3,0
195 | 60,1,0,145,282,0,0,142,1,2.8,1,2,3,0
196 | 60,1,2,140,185,0,0,155,0,3,1,0,2,0
197 | 59,1,0,170,326,0,0,140,1,3.4,0,0,3,0
198 | 46,1,2,150,231,0,1,147,0,3.6,1,0,2,0
199 | 67,1,0,125,254,1,1,163,0,0.2,1,2,3,0
200 | 62,1,0,120,267,0,1,99,1,1.8,1,2,3,0
201 | 65,1,0,110,248,0,0,158,0,0.6,2,2,1,0
202 | 44,1,0,110,197,0,0,177,0,0,2,1,2,0
203 | 60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
204 | 58,1,0,150,270,0,0,111,1,0.8,2,0,3,0
205 | 68,1,2,180,274,1,0,150,1,1.6,1,0,3,0
206 | 62,0,0,160,164,0,0,145,0,6.2,0,3,3,0
207 | 52,1,0,128,255,0,1,161,1,0,2,1,3,0
208 | 59,1,0,110,239,0,0,142,1,1.2,1,1,3,0
209 | 60,0,0,150,258,0,0,157,0,2.6,1,2,3,0
210 | 49,1,2,120,188,0,1,139,0,2,1,3,3,0
211 | 59,1,0,140,177,0,1,162,1,0,2,1,3,0
212 | 57,1,2,128,229,0,0,150,0,0.4,1,1,3,0
213 | 61,1,0,120,260,0,1,140,1,3.6,1,1,3,0
214 | 39,1,0,118,219,0,1,140,0,1.2,1,0,3,0
215 | 61,0,0,145,307,0,0,146,1,1,1,0,3,0
216 | 56,1,0,125,249,1,0,144,1,1.2,1,1,2,0
217 | 43,0,0,132,341,1,0,136,1,3,1,0,3,0
218 | 62,0,2,130,263,0,1,97,0,1.2,1,1,3,0
219 | 63,1,0,130,330,1,0,132,1,1.8,2,3,3,0
220 | 65,1,0,135,254,0,0,127,0,2.8,1,1,3,0
221 | 48,1,0,130,256,1,0,150,1,0,2,2,3,0
222 | 63,0,0,150,407,0,0,154,0,4,1,3,3,0
223 | 55,1,0,140,217,0,1,111,1,5.6,0,0,3,0
224 | 65,1,3,138,282,1,0,174,0,1.4,1,1,2,0
225 | 56,0,0,200,288,1,0,133,1,4,0,2,3,0
226 | 54,1,0,110,239,0,1,126,1,2.8,1,1,3,0
227 | 70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
228 | 62,1,1,120,281,0,0,103,0,1.4,1,1,3,0
229 | 35,1,0,120,198,0,1,130,1,1.6,1,0,3,0
230 | 59,1,3,170,288,0,0,159,0,0.2,1,0,3,0
231 | 64,1,2,125,309,0,1,131,1,1.8,1,0,3,0
232 | 47,1,2,108,243,0,1,152,0,0,2,0,2,0
233 | 57,1,0,165,289,1,0,124,0,1,1,3,3,0
234 | 55,1,0,160,289,0,0,145,1,0.8,1,1,3,0
235 | 64,1,0,120,246,0,0,96,1,2.2,0,1,2,0
236 | 70,1,0,130,322,0,0,109,0,2.4,1,3,2,0
237 | 51,1,0,140,299,0,1,173,1,1.6,2,0,3,0
238 | 58,1,0,125,300,0,0,171,0,0,2,2,3,0
239 | 60,1,0,140,293,0,0,170,0,1.2,1,2,3,0
240 | 77,1,0,125,304,0,0,162,1,0,2,3,2,0
241 | 35,1,0,126,282,0,0,156,1,0,2,0,3,0
242 | 70,1,2,160,269,0,1,112,1,2.9,1,1,3,0
243 | 59,0,0,174,249,0,1,143,1,0,1,0,2,0
244 | 64,1,0,145,212,0,0,132,0,2,1,2,1,0
245 | 57,1,0,152,274,0,1,88,1,1.2,1,1,3,0
246 | 56,1,0,132,184,0,0,105,1,2.1,1,1,1,0
247 | 48,1,0,124,274,0,0,166,0,0.5,1,0,3,0
248 | 56,0,0,134,409,0,0,150,1,1.9,1,2,3,0
249 | 66,1,1,160,246,0,1,120,1,0,1,3,1,0
250 | 54,1,1,192,283,0,0,195,0,0,2,1,3,0
251 | 69,1,2,140,254,0,0,146,0,2,1,3,3,0
252 | 51,1,0,140,298,0,1,122,1,4.2,1,3,3,0
253 | 43,1,0,132,247,1,0,143,1,0.1,1,4,3,0
254 | 62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
255 | 67,1,0,100,299,0,0,125,1,0.9,1,2,2,0
256 | 59,1,3,160,273,0,0,125,0,0,2,0,2,0
257 | 45,1,0,142,309,0,0,147,1,0,1,3,3,0
258 | 58,1,0,128,259,0,0,130,1,3,1,2,3,0
259 | 50,1,0,144,200,0,0,126,1,0.9,1,0,3,0
260 | 62,0,0,150,244,0,1,154,1,1.4,1,0,2,0
261 | 38,1,3,120,231,0,1,182,1,3.8,1,0,3,0
262 | 66,0,0,178,228,1,1,165,1,1,1,2,3,0
263 | 52,1,0,112,230,0,1,160,0,0,2,1,2,0
264 | 53,1,0,123,282,0,1,95,1,2,1,2,3,0
265 | 63,0,0,108,269,0,1,169,1,1.8,1,2,2,0
266 | 54,1,0,110,206,0,0,108,1,0,1,1,2,0
267 | 66,1,0,112,212,0,0,132,1,0.1,2,1,2,0
268 | 55,0,0,180,327,0,2,117,1,3.4,1,0,2,0
269 | 49,1,2,118,149,0,0,126,0,0.8,2,3,2,0
270 | 54,1,0,122,286,0,0,116,1,3.2,1,2,2,0
271 | 56,1,0,130,283,1,0,103,1,1.6,0,0,3,0
272 | 46,1,0,120,249,0,0,144,0,0.8,2,0,3,0
273 | 61,1,3,134,234,0,1,145,0,2.6,1,2,2,0
274 | 67,1,0,120,237,0,1,71,0,1,1,0,2,0
275 | 58,1,0,100,234,0,1,156,0,0.1,2,1,3,0
276 | 47,1,0,110,275,0,0,118,1,1,1,1,2,0
277 | 52,1,0,125,212,0,1,168,0,1,2,2,3,0
278 | 58,1,0,146,218,0,1,105,0,2,1,1,3,0
279 | 57,1,1,124,261,0,1,141,0,0.3,2,0,3,0
280 | 58,0,1,136,319,1,0,152,0,0,2,2,2,0
281 | 61,1,0,138,166,0,0,125,1,3.6,1,1,2,0
282 | 42,1,0,136,315,0,1,125,1,1.8,1,0,1,0
283 | 52,1,0,128,204,1,1,156,1,1,1,0,0,0
284 | 59,1,2,126,218,1,1,134,0,2.2,1,1,1,0
285 | 40,1,0,152,223,0,1,181,0,0,2,0,3,0
286 | 61,1,0,140,207,0,0,138,1,1.9,2,1,3,0
287 | 46,1,0,140,311,0,1,120,1,1.8,1,2,3,0
288 | 59,1,3,134,204,0,1,162,0,0.8,2,2,2,0
289 | 57,1,1,154,232,0,0,164,0,0,2,1,2,0
290 | 57,1,0,110,335,0,1,143,1,3,1,1,3,0
291 | 55,0,0,128,205,0,2,130,1,2,1,1,3,0
292 | 61,1,0,148,203,0,1,161,0,0,2,1,3,0
293 | 58,1,0,114,318,0,2,140,0,4.4,0,3,1,0
294 | 58,0,0,170,225,1,0,146,1,2.8,1,2,1,0
295 | 67,1,2,152,212,0,0,150,0,0.8,1,0,3,0
296 | 44,1,0,120,169,0,1,144,1,2.8,0,0,1,0
297 | 63,1,0,140,187,0,0,144,1,4,2,2,3,0
298 | 63,0,0,124,197,0,1,136,1,0,1,0,2,0
299 | 59,1,0,164,176,1,0,90,0,1,1,2,1,0
300 | 57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
301 | 45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
302 | 68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
303 | 57,1,0,130,131,0,1,115,1,1.2,1,1,3,0
304 | 57,0,1,130,236,0,0,174,0,0,1,1,2,0
305 |
--------------------------------------------------------------------------------
/Code/A5_Scikit_Learn/gs_random_forest_model_1.joblib:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A5_Scikit_Learn/gs_random_forest_model_1.joblib
--------------------------------------------------------------------------------
/Code/A5_Scikit_Learn/gs_random_forest_model_1.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A5_Scikit_Learn/gs_random_forest_model_1.pkl
--------------------------------------------------------------------------------
/Code/A5_Scikit_Learn/random_forest_model_1.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A5_Scikit_Learn/random_forest_model_1.pkl
--------------------------------------------------------------------------------
/Code/A6_Kaggle/.ipynb_checkpoints/Day12_Housing Prices Competition-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "07d7bd2d",
7 | "metadata": {
8 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
9 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5",
10 | "execution": {
11 | "iopub.execute_input": "2021-08-15T03:43:11.802107Z",
12 | "iopub.status.busy": "2021-08-15T03:43:11.800938Z",
13 | "iopub.status.idle": "2021-08-15T03:43:11.814520Z",
14 | "shell.execute_reply": "2021-08-15T03:43:11.815049Z",
15 | "shell.execute_reply.started": "2021-08-15T02:46:18.327501Z"
16 | },
17 | "papermill": {
18 | "duration": 0.02756,
19 | "end_time": "2021-08-15T03:43:11.815353",
20 | "exception": false,
21 | "start_time": "2021-08-15T03:43:11.787793",
22 | "status": "completed"
23 | },
24 | "tags": []
25 | },
26 | "outputs": [
27 | {
28 | "name": "stdout",
29 | "output_type": "stream",
30 | "text": [
31 | "/kaggle/input/home-data-for-ml-course/sample_submission.csv\n",
32 | "/kaggle/input/home-data-for-ml-course/sample_submission.csv.gz\n",
33 | "/kaggle/input/home-data-for-ml-course/train.csv.gz\n",
34 | "/kaggle/input/home-data-for-ml-course/data_description.txt\n",
35 | "/kaggle/input/home-data-for-ml-course/test.csv.gz\n",
36 | "/kaggle/input/home-data-for-ml-course/train.csv\n",
37 | "/kaggle/input/home-data-for-ml-course/test.csv\n"
38 | ]
39 | }
40 | ],
41 | "source": [
42 | "# This Python 3 environment comes with many helpful analytics libraries installed\n",
43 | "# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python\n",
44 | "# For example, here's several helpful packages to load\n",
45 | "\n",
46 | "import numpy as np # linear algebra\n",
47 | "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n",
48 | "\n",
49 | "# Input data files are available in the read-only \"../input/\" directory\n",
50 | "# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n",
51 | "\n",
52 | "import os\n",
53 | "for dirname, _, filenames in os.walk('/kaggle/input'):\n",
54 | " for filename in filenames:\n",
55 | " print(os.path.join(dirname, filename))\n",
56 | "\n",
57 | "# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using \"Save & Run All\" \n",
58 | "# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": 2,
64 | "id": "7def9f9a",
65 | "metadata": {
66 | "execution": {
67 | "iopub.execute_input": "2021-08-15T03:43:11.837576Z",
68 | "iopub.status.busy": "2021-08-15T03:43:11.836898Z",
69 | "iopub.status.idle": "2021-08-15T03:43:13.077546Z",
70 | "shell.execute_reply": "2021-08-15T03:43:13.076802Z",
71 | "shell.execute_reply.started": "2021-08-15T03:10:19.101028Z"
72 | },
73 | "papermill": {
74 | "duration": 1.252221,
75 | "end_time": "2021-08-15T03:43:13.077693",
76 | "exception": false,
77 | "start_time": "2021-08-15T03:43:11.825472",
78 | "status": "completed"
79 | },
80 | "tags": []
81 | },
82 | "outputs": [],
83 | "source": [
84 | "# Import helpful libraries\n",
85 | "import pandas as pd\n",
86 | "import numpy as np\n",
87 | "from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor\n",
88 | "from sklearn.metrics import mean_absolute_error, mean_squared_error #Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.\n",
89 | "from sklearn.model_selection import train_test_split\n",
90 | "\n",
91 | "# Load the data, and separate the target\n",
92 | "iowa_file_path = '../input/home-data-for-ml-course/train.csv'\n",
93 | "home_data = pd.read_csv(iowa_file_path)\n",
94 | "y = home_data.SalePrice\n",
95 | "\n",
96 | "# Create X (After completing the exercise, you can return to modify this line!)\n",
97 | "#features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']\n"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": 3,
103 | "id": "fed68f38",
104 | "metadata": {
105 | "execution": {
106 | "iopub.execute_input": "2021-08-15T03:43:13.101919Z",
107 | "iopub.status.busy": "2021-08-15T03:43:13.101196Z",
108 | "iopub.status.idle": "2021-08-15T03:43:13.104152Z",
109 | "shell.execute_reply": "2021-08-15T03:43:13.104647Z",
110 | "shell.execute_reply.started": "2021-08-15T03:01:48.233770Z"
111 | },
112 | "papermill": {
113 | "duration": 0.017924,
114 | "end_time": "2021-08-15T03:43:13.104807",
115 | "exception": false,
116 | "start_time": "2021-08-15T03:43:13.086883",
117 | "status": "completed"
118 | },
119 | "tags": []
120 | },
121 | "outputs": [],
122 | "source": [
123 | "# Create X (After completing the exercise, you can return to modify this line!)\n",
124 | "features = [\n",
125 | " 'MSSubClass',\n",
126 | " 'LotArea',\n",
127 | " 'OverallQual',\n",
128 | " 'OverallCond',\n",
129 | " 'YearBuilt',\n",
130 | " 'YearRemodAdd', \n",
131 | " '1stFlrSF',\n",
132 | " '2ndFlrSF' ,\n",
133 | " 'LowQualFinSF',\n",
134 | " 'GrLivArea',\n",
135 | " 'FullBath',\n",
136 | " 'HalfBath',\n",
137 | " 'BedroomAbvGr',\n",
138 | " 'KitchenAbvGr', \n",
139 | " 'TotRmsAbvGrd',\n",
140 | " 'Fireplaces', \n",
141 | " 'WoodDeckSF' ,\n",
142 | " 'OpenPorchSF',\n",
143 | " 'EnclosedPorch',\n",
144 | " '3SsnPorch', \n",
145 | " 'ScreenPorch',\n",
146 | " 'PoolArea',\n",
147 | " 'MiscVal',\n",
148 | " 'MoSold',\n",
149 | " 'YrSold'\n",
150 | "]"
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": 4,
156 | "id": "ac795a42",
157 | "metadata": {
158 | "execution": {
159 | "iopub.execute_input": "2021-08-15T03:43:13.126743Z",
160 | "iopub.status.busy": "2021-08-15T03:43:13.126039Z",
161 | "iopub.status.idle": "2021-08-15T03:43:20.892772Z",
162 | "shell.execute_reply": "2021-08-15T03:43:20.893262Z",
163 | "shell.execute_reply.started": "2021-08-15T03:40:09.224892Z"
164 | },
165 | "papermill": {
166 | "duration": 7.779176,
167 | "end_time": "2021-08-15T03:43:20.893480",
168 | "exception": false,
169 | "start_time": "2021-08-15T03:43:13.114304",
170 | "status": "completed"
171 | },
172 | "tags": []
173 | },
174 | "outputs": [
175 | {
176 | "name": "stdout",
177 | "output_type": "stream",
178 | "text": [
179 | "Validation RMSE for Random Forest Model: 26,895\n",
180 | "Validation RMSE for Gradient Boosting Model: 24,597\n",
181 | "Validation RMSE for Mean Prediction of 2 Models: 23,834\n"
182 | ]
183 | }
184 | ],
185 | "source": [
186 | "# Select columns corresponding to features, and preview the data\n",
187 | "X = home_data[features]\n",
188 | "\n",
189 | "\n",
190 | "# Split into validation and training data\n",
191 | "train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)\n",
192 | "\n",
193 | "# Define a random forest model\n",
194 | "rf_model = RandomForestRegressor(random_state=1, n_estimators=700)\n",
195 | "rf_model.fit(train_X, train_y)\n",
196 | "rf_val_predictions = rf_model.predict(val_X)\n",
197 | "rf_val_rmse = np.sqrt(mean_squared_error(rf_val_predictions, val_y))\n",
198 | "\n",
199 | "gbm_model = GradientBoostingRegressor(random_state=1, n_estimators=500)\n",
200 | "gbm_model.fit(train_X, train_y)\n",
201 | "gbm_val_predictions = gbm_model.predict(val_X)\n",
202 | "gbm_val_rmse = np.sqrt(mean_squared_error(gbm_val_predictions, val_y))\n",
203 | "\n",
204 | "mean_2model_val_predictions = (rf_val_predictions + gbm_val_predictions)/2\n",
205 | "mean_2model_val_rmse = np.sqrt(mean_squared_error(mean_2model_val_predictions, val_y))\n",
206 | "\n",
207 | "print(\"Validation RMSE for Random Forest Model: {:,.0f}\".format(rf_val_rmse))\n",
208 | "print(\"Validation RMSE for Gradient Boosting Model: {:,.0f}\".format(gbm_val_rmse))\n",
209 | "print(\"Validation RMSE for Mean Prediction of 2 Models: {:,.0f}\".format(mean_2model_val_rmse))"
210 | ]
211 | },
212 | {
213 | "cell_type": "code",
214 | "execution_count": 5,
215 | "id": "87e88081",
216 | "metadata": {
217 | "execution": {
218 | "iopub.execute_input": "2021-08-15T03:43:20.917251Z",
219 | "iopub.status.busy": "2021-08-15T03:43:20.916646Z",
220 | "iopub.status.idle": "2021-08-15T03:43:20.918748Z",
221 | "shell.execute_reply": "2021-08-15T03:43:20.919237Z",
222 | "shell.execute_reply.started": "2021-08-15T02:52:49.030875Z"
223 | },
224 | "papermill": {
225 | "duration": 0.016227,
226 | "end_time": "2021-08-15T03:43:20.919401",
227 | "exception": false,
228 | "start_time": "2021-08-15T03:43:20.903174",
229 | "status": "completed"
230 | },
231 | "tags": []
232 | },
233 | "outputs": [],
234 | "source": [
235 | "#?RandomForestRegressor "
236 | ]
237 | },
238 | {
239 | "cell_type": "code",
240 | "execution_count": 6,
241 | "id": "4b3573a6",
242 | "metadata": {
243 | "execution": {
244 | "iopub.execute_input": "2021-08-15T03:43:20.941585Z",
245 | "iopub.status.busy": "2021-08-15T03:43:20.940666Z",
246 | "iopub.status.idle": "2021-08-15T03:43:30.827997Z",
247 | "shell.execute_reply": "2021-08-15T03:43:30.828507Z",
248 | "shell.execute_reply.started": "2021-08-15T03:41:31.585294Z"
249 | },
250 | "papermill": {
251 | "duration": 9.899838,
252 | "end_time": "2021-08-15T03:43:30.828673",
253 | "exception": false,
254 | "start_time": "2021-08-15T03:43:20.928835",
255 | "status": "completed"
256 | },
257 | "tags": []
258 | },
259 | "outputs": [
260 | {
261 | "data": {
262 | "text/plain": [
263 | "GradientBoostingRegressor(n_estimators=500, random_state=1)"
264 | ]
265 | },
266 | "execution_count": 6,
267 | "metadata": {},
268 | "output_type": "execute_result"
269 | }
270 | ],
271 | "source": [
272 | "# To improve accuracy, create a new Random Forest model which you will train on all training data\n",
273 | "rf_model.fit(X,y)\n",
274 | "gbm_model.fit(X,y)"
275 | ]
276 | },
277 | {
278 | "cell_type": "code",
279 | "execution_count": 7,
280 | "id": "701f28ab",
281 | "metadata": {
282 | "execution": {
283 | "iopub.execute_input": "2021-08-15T03:43:30.853856Z",
284 | "iopub.status.busy": "2021-08-15T03:43:30.853151Z",
285 | "iopub.status.idle": "2021-08-15T03:43:31.128150Z",
286 | "shell.execute_reply": "2021-08-15T03:43:31.128649Z",
287 | "shell.execute_reply.started": "2021-08-15T03:42:36.364752Z"
288 | },
289 | "papermill": {
290 | "duration": 0.290491,
291 | "end_time": "2021-08-15T03:43:31.128831",
292 | "exception": false,
293 | "start_time": "2021-08-15T03:43:30.838340",
294 | "status": "completed"
295 | },
296 | "tags": []
297 | },
298 | "outputs": [],
299 | "source": [
300 | "# path to file you will use for predictions\n",
301 | "test_data_path = '../input/home-data-for-ml-course/test.csv'\n",
302 | "\n",
303 | "# read test data file using pandas\n",
304 | "test_data = pd.read_csv(test_data_path)\n",
305 | "test_data = test_data.fillna(-1)\n",
306 | "# create test_X which comes from test_data but includes only the columns you used for prediction.\n",
307 | "# The list of columns is stored in a variable called features\n",
308 | "test_X = test_data[features]\n",
309 | "\n",
310 | "# make predictions which we will submit. \n",
311 | "test_preds1 = rf_model.predict(test_X)\n",
312 | "test_preds2 = gbm_model.predict(test_X)\n",
313 | "\n",
314 | "test_preds = (test_preds1 + test_preds2)/2"
315 | ]
316 | },
317 | {
318 | "cell_type": "code",
319 | "execution_count": 8,
320 | "id": "4a01247c",
321 | "metadata": {
322 | "execution": {
323 | "iopub.execute_input": "2021-08-15T03:43:31.154336Z",
324 | "iopub.status.busy": "2021-08-15T03:43:31.153705Z",
325 | "iopub.status.idle": "2021-08-15T03:43:31.165500Z",
326 | "shell.execute_reply": "2021-08-15T03:43:31.164833Z",
327 | "shell.execute_reply.started": "2021-08-15T03:42:39.697331Z"
328 | },
329 | "papermill": {
330 | "duration": 0.027107,
331 | "end_time": "2021-08-15T03:43:31.165641",
332 | "exception": false,
333 | "start_time": "2021-08-15T03:43:31.138534",
334 | "status": "completed"
335 | },
336 | "tags": []
337 | },
338 | "outputs": [],
339 | "source": [
340 | "# Run the code to save predictions in the format used for competition scoring\n",
341 | "\n",
342 | "output = pd.DataFrame({'Id': test_data.Id,\n",
343 | " 'SalePrice': test_preds})\n",
344 | "output.to_csv('submission.csv', index=False)"
345 | ]
346 | },
347 | {
348 | "cell_type": "code",
349 | "execution_count": 9,
350 | "id": "795c92eb",
351 | "metadata": {
352 | "execution": {
353 | "iopub.execute_input": "2021-08-15T03:43:31.190219Z",
354 | "iopub.status.busy": "2021-08-15T03:43:31.189600Z",
355 | "iopub.status.idle": "2021-08-15T03:43:31.201419Z",
356 | "shell.execute_reply": "2021-08-15T03:43:31.200904Z",
357 | "shell.execute_reply.started": "2021-08-15T03:42:41.777270Z"
358 | },
359 | "papermill": {
360 | "duration": 0.026074,
361 | "end_time": "2021-08-15T03:43:31.201571",
362 | "exception": false,
363 | "start_time": "2021-08-15T03:43:31.175497",
364 | "status": "completed"
365 | },
366 | "tags": []
367 | },
368 | "outputs": [
369 | {
370 | "data": {
371 | "text/html": [
372 | "\n",
373 | "\n",
386 | "
\n",
387 | " \n",
388 | " \n",
389 | " | \n",
390 | " Id | \n",
391 | " SalePrice | \n",
392 | "
\n",
393 | " \n",
394 | " \n",
395 | " \n",
396 | " 0 | \n",
397 | " 1461 | \n",
398 | " 127695.257638 | \n",
399 | "
\n",
400 | " \n",
401 | " 1 | \n",
402 | " 1462 | \n",
403 | " 158944.116706 | \n",
404 | "
\n",
405 | " \n",
406 | " 2 | \n",
407 | " 1463 | \n",
408 | " 177126.990807 | \n",
409 | "
\n",
410 | " \n",
411 | " 3 | \n",
412 | " 1464 | \n",
413 | " 191648.425977 | \n",
414 | "
\n",
415 | " \n",
416 | " 4 | \n",
417 | " 1465 | \n",
418 | " 196206.122025 | \n",
419 | "
\n",
420 | " \n",
421 | "
\n",
422 | "
"
423 | ],
424 | "text/plain": [
425 | " Id SalePrice\n",
426 | "0 1461 127695.257638\n",
427 | "1 1462 158944.116706\n",
428 | "2 1463 177126.990807\n",
429 | "3 1464 191648.425977\n",
430 | "4 1465 196206.122025"
431 | ]
432 | },
433 | "execution_count": 9,
434 | "metadata": {},
435 | "output_type": "execute_result"
436 | }
437 | ],
438 | "source": [
439 | "output.head()"
440 | ]
441 | }
442 | ],
443 | "metadata": {
444 | "kernelspec": {
445 | "display_name": "Python 3",
446 | "language": "python",
447 | "name": "python3"
448 | },
449 | "language_info": {
450 | "codemirror_mode": {
451 | "name": "ipython",
452 | "version": 3
453 | },
454 | "file_extension": ".py",
455 | "mimetype": "text/x-python",
456 | "name": "python",
457 | "nbconvert_exporter": "python",
458 | "pygments_lexer": "ipython3",
459 | "version": "3.8.8"
460 | },
461 | "papermill": {
462 | "default_parameters": {},
463 | "duration": 28.760462,
464 | "end_time": "2021-08-15T03:43:32.700610",
465 | "environment_variables": {},
466 | "exception": null,
467 | "input_path": "__notebook__.ipynb",
468 | "output_path": "__notebook__.ipynb",
469 | "parameters": {},
470 | "start_time": "2021-08-15T03:43:03.940148",
471 | "version": "2.3.3"
472 | }
473 | },
474 | "nbformat": 4,
475 | "nbformat_minor": 5
476 | }
477 |
--------------------------------------------------------------------------------
/Code/A6_Kaggle/Day12_Housing Prices Competition.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "07d7bd2d",
7 | "metadata": {
8 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
9 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5",
10 | "execution": {
11 | "iopub.execute_input": "2021-08-15T03:43:11.802107Z",
12 | "iopub.status.busy": "2021-08-15T03:43:11.800938Z",
13 | "iopub.status.idle": "2021-08-15T03:43:11.814520Z",
14 | "shell.execute_reply": "2021-08-15T03:43:11.815049Z",
15 | "shell.execute_reply.started": "2021-08-15T02:46:18.327501Z"
16 | },
17 | "papermill": {
18 | "duration": 0.02756,
19 | "end_time": "2021-08-15T03:43:11.815353",
20 | "exception": false,
21 | "start_time": "2021-08-15T03:43:11.787793",
22 | "status": "completed"
23 | },
24 | "tags": []
25 | },
26 | "outputs": [
27 | {
28 | "name": "stdout",
29 | "output_type": "stream",
30 | "text": [
31 | "/kaggle/input/home-data-for-ml-course/sample_submission.csv\n",
32 | "/kaggle/input/home-data-for-ml-course/sample_submission.csv.gz\n",
33 | "/kaggle/input/home-data-for-ml-course/train.csv.gz\n",
34 | "/kaggle/input/home-data-for-ml-course/data_description.txt\n",
35 | "/kaggle/input/home-data-for-ml-course/test.csv.gz\n",
36 | "/kaggle/input/home-data-for-ml-course/train.csv\n",
37 | "/kaggle/input/home-data-for-ml-course/test.csv\n"
38 | ]
39 | }
40 | ],
41 | "source": [
42 | "# This Python 3 environment comes with many helpful analytics libraries installed\n",
43 | "# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python\n",
44 | "# For example, here's several helpful packages to load\n",
45 | "\n",
46 | "import numpy as np # linear algebra\n",
47 | "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n",
48 | "\n",
49 | "# Input data files are available in the read-only \"../input/\" directory\n",
50 | "# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n",
51 | "\n",
52 | "import os\n",
53 | "for dirname, _, filenames in os.walk('/kaggle/input'):\n",
54 | " for filename in filenames:\n",
55 | " print(os.path.join(dirname, filename))\n",
56 | "\n",
57 | "# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using \"Save & Run All\" \n",
58 | "# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": 2,
64 | "id": "7def9f9a",
65 | "metadata": {
66 | "execution": {
67 | "iopub.execute_input": "2021-08-15T03:43:11.837576Z",
68 | "iopub.status.busy": "2021-08-15T03:43:11.836898Z",
69 | "iopub.status.idle": "2021-08-15T03:43:13.077546Z",
70 | "shell.execute_reply": "2021-08-15T03:43:13.076802Z",
71 | "shell.execute_reply.started": "2021-08-15T03:10:19.101028Z"
72 | },
73 | "papermill": {
74 | "duration": 1.252221,
75 | "end_time": "2021-08-15T03:43:13.077693",
76 | "exception": false,
77 | "start_time": "2021-08-15T03:43:11.825472",
78 | "status": "completed"
79 | },
80 | "tags": []
81 | },
82 | "outputs": [],
83 | "source": [
84 | "# Import helpful libraries\n",
85 | "import pandas as pd\n",
86 | "import numpy as np\n",
87 | "from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor\n",
88 | "from sklearn.metrics import mean_absolute_error, mean_squared_error #Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.\n",
89 | "from sklearn.model_selection import train_test_split\n",
90 | "\n",
91 | "# Load the data, and separate the target\n",
92 | "iowa_file_path = '../input/home-data-for-ml-course/train.csv'\n",
93 | "home_data = pd.read_csv(iowa_file_path)\n",
94 | "y = home_data.SalePrice\n",
95 | "\n",
96 | "# Create X (After completing the exercise, you can return to modify this line!)\n",
97 | "#features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']\n"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": 3,
103 | "id": "fed68f38",
104 | "metadata": {
105 | "execution": {
106 | "iopub.execute_input": "2021-08-15T03:43:13.101919Z",
107 | "iopub.status.busy": "2021-08-15T03:43:13.101196Z",
108 | "iopub.status.idle": "2021-08-15T03:43:13.104152Z",
109 | "shell.execute_reply": "2021-08-15T03:43:13.104647Z",
110 | "shell.execute_reply.started": "2021-08-15T03:01:48.233770Z"
111 | },
112 | "papermill": {
113 | "duration": 0.017924,
114 | "end_time": "2021-08-15T03:43:13.104807",
115 | "exception": false,
116 | "start_time": "2021-08-15T03:43:13.086883",
117 | "status": "completed"
118 | },
119 | "tags": []
120 | },
121 | "outputs": [],
122 | "source": [
123 | "# Create X (After completing the exercise, you can return to modify this line!)\n",
124 | "features = [\n",
125 | " 'MSSubClass',\n",
126 | " 'LotArea',\n",
127 | " 'OverallQual',\n",
128 | " 'OverallCond',\n",
129 | " 'YearBuilt',\n",
130 | " 'YearRemodAdd', \n",
131 | " '1stFlrSF',\n",
132 | " '2ndFlrSF' ,\n",
133 | " 'LowQualFinSF',\n",
134 | " 'GrLivArea',\n",
135 | " 'FullBath',\n",
136 | " 'HalfBath',\n",
137 | " 'BedroomAbvGr',\n",
138 | " 'KitchenAbvGr', \n",
139 | " 'TotRmsAbvGrd',\n",
140 | " 'Fireplaces', \n",
141 | " 'WoodDeckSF' ,\n",
142 | " 'OpenPorchSF',\n",
143 | " 'EnclosedPorch',\n",
144 | " '3SsnPorch', \n",
145 | " 'ScreenPorch',\n",
146 | " 'PoolArea',\n",
147 | " 'MiscVal',\n",
148 | " 'MoSold',\n",
149 | " 'YrSold'\n",
150 | "]"
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": 4,
156 | "id": "ac795a42",
157 | "metadata": {
158 | "execution": {
159 | "iopub.execute_input": "2021-08-15T03:43:13.126743Z",
160 | "iopub.status.busy": "2021-08-15T03:43:13.126039Z",
161 | "iopub.status.idle": "2021-08-15T03:43:20.892772Z",
162 | "shell.execute_reply": "2021-08-15T03:43:20.893262Z",
163 | "shell.execute_reply.started": "2021-08-15T03:40:09.224892Z"
164 | },
165 | "papermill": {
166 | "duration": 7.779176,
167 | "end_time": "2021-08-15T03:43:20.893480",
168 | "exception": false,
169 | "start_time": "2021-08-15T03:43:13.114304",
170 | "status": "completed"
171 | },
172 | "tags": []
173 | },
174 | "outputs": [
175 | {
176 | "name": "stdout",
177 | "output_type": "stream",
178 | "text": [
179 | "Validation RMSE for Random Forest Model: 26,895\n",
180 | "Validation RMSE for Gradient Boosting Model: 24,597\n",
181 | "Validation RMSE for Mean Prediction of 2 Models: 23,834\n"
182 | ]
183 | }
184 | ],
185 | "source": [
186 | "# Select columns corresponding to features, and preview the data\n",
187 | "X = home_data[features]\n",
188 | "\n",
189 | "\n",
190 | "# Split into validation and training data\n",
191 | "train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)\n",
192 | "\n",
193 | "# Define a random forest model\n",
194 | "rf_model = RandomForestRegressor(random_state=1, n_estimators=700)\n",
195 | "rf_model.fit(train_X, train_y)\n",
196 | "rf_val_predictions = rf_model.predict(val_X)\n",
197 | "rf_val_rmse = np.sqrt(mean_squared_error(rf_val_predictions, val_y))\n",
198 | "\n",
199 | "gbm_model = GradientBoostingRegressor(random_state=1, n_estimators=500)\n",
200 | "gbm_model.fit(train_X, train_y)\n",
201 | "gbm_val_predictions = gbm_model.predict(val_X)\n",
202 | "gbm_val_rmse = np.sqrt(mean_squared_error(gbm_val_predictions, val_y))\n",
203 | "\n",
204 | "mean_2model_val_predictions = (rf_val_predictions + gbm_val_predictions)/2\n",
205 | "mean_2model_val_rmse = np.sqrt(mean_squared_error(mean_2model_val_predictions, val_y))\n",
206 | "\n",
207 | "print(\"Validation RMSE for Random Forest Model: {:,.0f}\".format(rf_val_rmse))\n",
208 | "print(\"Validation RMSE for Gradient Boosting Model: {:,.0f}\".format(gbm_val_rmse))\n",
209 | "print(\"Validation RMSE for Mean Prediction of 2 Models: {:,.0f}\".format(mean_2model_val_rmse))"
210 | ]
211 | },
212 | {
213 | "cell_type": "code",
214 | "execution_count": 5,
215 | "id": "87e88081",
216 | "metadata": {
217 | "execution": {
218 | "iopub.execute_input": "2021-08-15T03:43:20.917251Z",
219 | "iopub.status.busy": "2021-08-15T03:43:20.916646Z",
220 | "iopub.status.idle": "2021-08-15T03:43:20.918748Z",
221 | "shell.execute_reply": "2021-08-15T03:43:20.919237Z",
222 | "shell.execute_reply.started": "2021-08-15T02:52:49.030875Z"
223 | },
224 | "papermill": {
225 | "duration": 0.016227,
226 | "end_time": "2021-08-15T03:43:20.919401",
227 | "exception": false,
228 | "start_time": "2021-08-15T03:43:20.903174",
229 | "status": "completed"
230 | },
231 | "tags": []
232 | },
233 | "outputs": [],
234 | "source": [
235 | "#?RandomForestRegressor "
236 | ]
237 | },
238 | {
239 | "cell_type": "code",
240 | "execution_count": 6,
241 | "id": "4b3573a6",
242 | "metadata": {
243 | "execution": {
244 | "iopub.execute_input": "2021-08-15T03:43:20.941585Z",
245 | "iopub.status.busy": "2021-08-15T03:43:20.940666Z",
246 | "iopub.status.idle": "2021-08-15T03:43:30.827997Z",
247 | "shell.execute_reply": "2021-08-15T03:43:30.828507Z",
248 | "shell.execute_reply.started": "2021-08-15T03:41:31.585294Z"
249 | },
250 | "papermill": {
251 | "duration": 9.899838,
252 | "end_time": "2021-08-15T03:43:30.828673",
253 | "exception": false,
254 | "start_time": "2021-08-15T03:43:20.928835",
255 | "status": "completed"
256 | },
257 | "tags": []
258 | },
259 | "outputs": [
260 | {
261 | "data": {
262 | "text/plain": [
263 | "GradientBoostingRegressor(n_estimators=500, random_state=1)"
264 | ]
265 | },
266 | "execution_count": 6,
267 | "metadata": {},
268 | "output_type": "execute_result"
269 | }
270 | ],
271 | "source": [
272 | "# To improve accuracy, create a new Random Forest model which you will train on all training data\n",
273 | "rf_model.fit(X,y)\n",
274 | "gbm_model.fit(X,y)"
275 | ]
276 | },
277 | {
278 | "cell_type": "code",
279 | "execution_count": 7,
280 | "id": "701f28ab",
281 | "metadata": {
282 | "execution": {
283 | "iopub.execute_input": "2021-08-15T03:43:30.853856Z",
284 | "iopub.status.busy": "2021-08-15T03:43:30.853151Z",
285 | "iopub.status.idle": "2021-08-15T03:43:31.128150Z",
286 | "shell.execute_reply": "2021-08-15T03:43:31.128649Z",
287 | "shell.execute_reply.started": "2021-08-15T03:42:36.364752Z"
288 | },
289 | "papermill": {
290 | "duration": 0.290491,
291 | "end_time": "2021-08-15T03:43:31.128831",
292 | "exception": false,
293 | "start_time": "2021-08-15T03:43:30.838340",
294 | "status": "completed"
295 | },
296 | "tags": []
297 | },
298 | "outputs": [],
299 | "source": [
300 | "# path to file you will use for predictions\n",
301 | "test_data_path = '../input/home-data-for-ml-course/test.csv'\n",
302 | "\n",
303 | "# read test data file using pandas\n",
304 | "test_data = pd.read_csv(test_data_path)\n",
305 | "test_data = test_data.fillna(-1)\n",
306 | "# create test_X which comes from test_data but includes only the columns you used for prediction.\n",
307 | "# The list of columns is stored in a variable called features\n",
308 | "test_X = test_data[features]\n",
309 | "\n",
310 | "# make predictions which we will submit. \n",
311 | "test_preds1 = rf_model.predict(test_X)\n",
312 | "test_preds2 = gbm_model.predict(test_X)\n",
313 | "\n",
314 | "test_preds = (test_preds1 + test_preds2)/2"
315 | ]
316 | },
317 | {
318 | "cell_type": "code",
319 | "execution_count": 8,
320 | "id": "4a01247c",
321 | "metadata": {
322 | "execution": {
323 | "iopub.execute_input": "2021-08-15T03:43:31.154336Z",
324 | "iopub.status.busy": "2021-08-15T03:43:31.153705Z",
325 | "iopub.status.idle": "2021-08-15T03:43:31.165500Z",
326 | "shell.execute_reply": "2021-08-15T03:43:31.164833Z",
327 | "shell.execute_reply.started": "2021-08-15T03:42:39.697331Z"
328 | },
329 | "papermill": {
330 | "duration": 0.027107,
331 | "end_time": "2021-08-15T03:43:31.165641",
332 | "exception": false,
333 | "start_time": "2021-08-15T03:43:31.138534",
334 | "status": "completed"
335 | },
336 | "tags": []
337 | },
338 | "outputs": [],
339 | "source": [
340 | "# Run the code to save predictions in the format used for competition scoring\n",
341 | "\n",
342 | "output = pd.DataFrame({'Id': test_data.Id,\n",
343 | " 'SalePrice': test_preds})\n",
344 | "output.to_csv('submission.csv', index=False)"
345 | ]
346 | },
347 | {
348 | "cell_type": "code",
349 | "execution_count": 9,
350 | "id": "795c92eb",
351 | "metadata": {
352 | "execution": {
353 | "iopub.execute_input": "2021-08-15T03:43:31.190219Z",
354 | "iopub.status.busy": "2021-08-15T03:43:31.189600Z",
355 | "iopub.status.idle": "2021-08-15T03:43:31.201419Z",
356 | "shell.execute_reply": "2021-08-15T03:43:31.200904Z",
357 | "shell.execute_reply.started": "2021-08-15T03:42:41.777270Z"
358 | },
359 | "papermill": {
360 | "duration": 0.026074,
361 | "end_time": "2021-08-15T03:43:31.201571",
362 | "exception": false,
363 | "start_time": "2021-08-15T03:43:31.175497",
364 | "status": "completed"
365 | },
366 | "tags": []
367 | },
368 | "outputs": [
369 | {
370 | "data": {
371 | "text/html": [
372 | "\n",
373 | "\n",
386 | "
\n",
387 | " \n",
388 | " \n",
389 | " | \n",
390 | " Id | \n",
391 | " SalePrice | \n",
392 | "
\n",
393 | " \n",
394 | " \n",
395 | " \n",
396 | " 0 | \n",
397 | " 1461 | \n",
398 | " 127695.257638 | \n",
399 | "
\n",
400 | " \n",
401 | " 1 | \n",
402 | " 1462 | \n",
403 | " 158944.116706 | \n",
404 | "
\n",
405 | " \n",
406 | " 2 | \n",
407 | " 1463 | \n",
408 | " 177126.990807 | \n",
409 | "
\n",
410 | " \n",
411 | " 3 | \n",
412 | " 1464 | \n",
413 | " 191648.425977 | \n",
414 | "
\n",
415 | " \n",
416 | " 4 | \n",
417 | " 1465 | \n",
418 | " 196206.122025 | \n",
419 | "
\n",
420 | " \n",
421 | "
\n",
422 | "
"
423 | ],
424 | "text/plain": [
425 | " Id SalePrice\n",
426 | "0 1461 127695.257638\n",
427 | "1 1462 158944.116706\n",
428 | "2 1463 177126.990807\n",
429 | "3 1464 191648.425977\n",
430 | "4 1465 196206.122025"
431 | ]
432 | },
433 | "execution_count": 9,
434 | "metadata": {},
435 | "output_type": "execute_result"
436 | }
437 | ],
438 | "source": [
439 | "output.head()"
440 | ]
441 | }
442 | ],
443 | "metadata": {
444 | "kernelspec": {
445 | "display_name": "Python 3",
446 | "language": "python",
447 | "name": "python3"
448 | },
449 | "language_info": {
450 | "codemirror_mode": {
451 | "name": "ipython",
452 | "version": 3
453 | },
454 | "file_extension": ".py",
455 | "mimetype": "text/x-python",
456 | "name": "python",
457 | "nbconvert_exporter": "python",
458 | "pygments_lexer": "ipython3",
459 | "version": "3.8.8"
460 | },
461 | "papermill": {
462 | "default_parameters": {},
463 | "duration": 28.760462,
464 | "end_time": "2021-08-15T03:43:32.700610",
465 | "environment_variables": {},
466 | "exception": null,
467 | "input_path": "__notebook__.ipynb",
468 | "output_path": "__notebook__.ipynb",
469 | "parameters": {},
470 | "start_time": "2021-08-15T03:43:03.940148",
471 | "version": "2.3.3"
472 | }
473 | },
474 | "nbformat": 4,
475 | "nbformat_minor": 5
476 | }
477 |
--------------------------------------------------------------------------------
/Code/A6_Seaborn/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A6_Seaborn/.DS_Store
--------------------------------------------------------------------------------
/Code/A6_Seaborn/data/cereal.csv:
--------------------------------------------------------------------------------
1 | name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
2 | 100% Bran,N,C,70,4,1,130,10,5,6,280,25,3,1,0.33,68.402973
3 | 100% Natural Bran,Q,C,120,3,5,15,2,8,8,135,0,3,1,1,33.983679
4 | All-Bran,K,C,70,4,1,260,9,7,5,320,25,3,1,0.33,59.425505
5 | All-Bran with Extra Fiber,K,C,50,4,0,140,14,8,0,330,25,3,1,0.5,93.704912
6 | Almond Delight,R,C,110,2,2,200,1,14,8,-1,25,3,1,0.75,34.384843
7 | Apple Cinnamon Cheerios,G,C,110,2,2,180,1.5,10.5,10,70,25,1,1,0.75,29.509541
8 | Apple Jacks,K,C,110,2,0,125,1,11,14,30,25,2,1,1,33.174094
9 | Basic 4,G,C,130,3,2,210,2,18,8,100,25,3,1.33,0.75,37.038562
10 | Bran Chex,R,C,90,2,1,200,4,15,6,125,25,1,1,0.67,49.120253
11 | Bran Flakes,P,C,90,3,0,210,5,13,5,190,25,3,1,0.67,53.313813
12 | Cap'n'Crunch,Q,C,120,1,2,220,0,12,12,35,25,2,1,0.75,18.042851
13 | Cheerios,G,C,110,6,2,290,2,17,1,105,25,1,1,1.25,50.764999
14 | Cinnamon Toast Crunch,G,C,120,1,3,210,0,13,9,45,25,2,1,0.75,19.823573
15 | Clusters,G,C,110,3,2,140,2,13,7,105,25,3,1,0.5,40.400208
16 | Cocoa Puffs,G,C,110,1,1,180,0,12,13,55,25,2,1,1,22.736446
17 | Corn Chex,R,C,110,2,0,280,0,22,3,25,25,1,1,1,41.445019
18 | Corn Flakes,K,C,100,2,0,290,1,21,2,35,25,1,1,1,45.863324
19 | Corn Pops,K,C,110,1,0,90,1,13,12,20,25,2,1,1,35.782791
20 | Count Chocula,G,C,110,1,1,180,0,12,13,65,25,2,1,1,22.396513
21 | Cracklin' Oat Bran,K,C,110,3,3,140,4,10,7,160,25,3,1,0.5,40.448772
22 | Cream of Wheat (Quick),N,H,100,3,0,80,1,21,0,-1,0,2,1,1,64.533816
23 | Crispix,K,C,110,2,0,220,1,21,3,30,25,3,1,1,46.895644
24 | Crispy Wheat & Raisins,G,C,100,2,1,140,2,11,10,120,25,3,1,0.75,36.176196
25 | Double Chex,R,C,100,2,0,190,1,18,5,80,25,3,1,0.75,44.330856
26 | Froot Loops,K,C,110,2,1,125,1,11,13,30,25,2,1,1,32.207582
27 | Frosted Flakes,K,C,110,1,0,200,1,14,11,25,25,1,1,0.75,31.435973
28 | Frosted Mini-Wheats,K,C,100,3,0,0,3,14,7,100,25,2,1,0.8,58.345141
29 | Fruit & Fibre Dates; Walnuts; and Oats,P,C,120,3,2,160,5,12,10,200,25,3,1.25,0.67,40.917047
30 | Fruitful Bran,K,C,120,3,0,240,5,14,12,190,25,3,1.33,0.67,41.015492
31 | Fruity Pebbles,P,C,110,1,1,135,0,13,12,25,25,2,1,0.75,28.025765
32 | Golden Crisp,P,C,100,2,0,45,0,11,15,40,25,1,1,0.88,35.252444
33 | Golden Grahams,G,C,110,1,1,280,0,15,9,45,25,2,1,0.75,23.804043
34 | Grape Nuts Flakes,P,C,100,3,1,140,3,15,5,85,25,3,1,0.88,52.076897
35 | Grape-Nuts,P,C,110,3,0,170,3,17,3,90,25,3,1,0.25,53.371007
36 | Great Grains Pecan,P,C,120,3,3,75,3,13,4,100,25,3,1,0.33,45.811716
37 | Honey Graham Ohs,Q,C,120,1,2,220,1,12,11,45,25,2,1,1,21.871292
38 | Honey Nut Cheerios,G,C,110,3,1,250,1.5,11.5,10,90,25,1,1,0.75,31.072217
39 | Honey-comb,P,C,110,1,0,180,0,14,11,35,25,1,1,1.33,28.742414
40 | Just Right Crunchy Nuggets,K,C,110,2,1,170,1,17,6,60,100,3,1,1,36.523683
41 | Just Right Fruit & Nut,K,C,140,3,1,170,2,20,9,95,100,3,1.3,0.75,36.471512
42 | Kix,G,C,110,2,1,260,0,21,3,40,25,2,1,1.5,39.241114
43 | Life,Q,C,100,4,2,150,2,12,6,95,25,2,1,0.67,45.328074
44 | Lucky Charms,G,C,110,2,1,180,0,12,12,55,25,2,1,1,26.734515
45 | Maypo,A,H,100,4,1,0,0,16,3,95,25,2,1,1,54.850917
46 | Muesli Raisins; Dates; & Almonds,R,C,150,4,3,95,3,16,11,170,25,3,1,1,37.136863
47 | Muesli Raisins; Peaches; & Pecans,R,C,150,4,3,150,3,16,11,170,25,3,1,1,34.139765
48 | Mueslix Crispy Blend,K,C,160,3,2,150,3,17,13,160,25,3,1.5,0.67,30.313351
49 | Multi-Grain Cheerios,G,C,100,2,1,220,2,15,6,90,25,1,1,1,40.105965
50 | Nut&Honey Crunch,K,C,120,2,1,190,0,15,9,40,25,2,1,0.67,29.924285
51 | Nutri-Grain Almond-Raisin,K,C,140,3,2,220,3,21,7,130,25,3,1.33,0.67,40.692320
52 | Nutri-grain Wheat,K,C,90,3,0,170,3,18,2,90,25,3,1,1,59.642837
53 | Oatmeal Raisin Crisp,G,C,130,3,2,170,1.5,13.5,10,120,25,3,1.25,0.5,30.450843
54 | Post Nat. Raisin Bran,P,C,120,3,1,200,6,11,14,260,25,3,1.33,0.67,37.840594
55 | Product 19,K,C,100,3,0,320,1,20,3,45,100,3,1,1,41.503540
56 | Puffed Rice,Q,C,50,1,0,0,0,13,0,15,0,3,0.5,1,60.756112
57 | Puffed Wheat,Q,C,50,2,0,0,1,10,0,50,0,3,0.5,1,63.005645
58 | Quaker Oat Squares,Q,C,100,4,1,135,2,14,6,110,25,3,1,0.5,49.511874
59 | Quaker Oatmeal,Q,H,100,5,2,0,2.7,-1,-1,110,0,1,1,0.67,50.828392
60 | Raisin Bran,K,C,120,3,1,210,5,14,12,240,25,2,1.33,0.75,39.259197
61 | Raisin Nut Bran,G,C,100,3,2,140,2.5,10.5,8,140,25,3,1,0.5,39.703400
62 | Raisin Squares,K,C,90,2,0,0,2,15,6,110,25,3,1,0.5,55.333142
63 | Rice Chex,R,C,110,1,0,240,0,23,2,30,25,1,1,1.13,41.998933
64 | Rice Krispies,K,C,110,2,0,290,0,22,3,35,25,1,1,1,40.560159
65 | Shredded Wheat,N,C,80,2,0,0,3,16,0,95,0,1,0.83,1,68.235885
66 | Shredded Wheat 'n'Bran,N,C,90,3,0,0,4,19,0,140,0,1,1,0.67,74.472949
67 | Shredded Wheat spoon size,N,C,90,3,0,0,3,20,0,120,0,1,1,0.67,72.801787
68 | Smacks,K,C,110,2,1,70,1,9,15,40,25,2,1,0.75,31.230054
69 | Special K,K,C,110,6,0,230,1,16,3,55,25,1,1,1,53.131324
70 | Strawberry Fruit Wheats,N,C,90,2,0,15,3,15,5,90,25,2,1,1,59.363993
71 | Total Corn Flakes,G,C,110,2,1,200,0,21,3,35,100,3,1,1,38.839746
72 | Total Raisin Bran,G,C,140,3,1,190,4,15,14,230,100,3,1.5,1,28.592785
73 | Total Whole Grain,G,C,100,3,1,200,3,16,3,110,100,3,1,1,46.658844
74 | Triples,G,C,110,2,1,250,0,21,3,60,25,3,1,0.75,39.106174
75 | Trix,G,C,110,1,1,140,0,13,12,25,25,2,1,1,27.753301
76 | Wheat Chex,R,C,100,3,1,230,3,17,3,115,25,1,1,0.67,49.787445
77 | Wheaties,G,C,100,3,1,200,3,17,3,110,25,1,1,1,51.592193
78 | Wheaties Honey Gold,G,C,110,2,1,200,1,16,8,60,25,1,1,0.75,36.187559
79 |
--------------------------------------------------------------------------------
/Code/P00_Project_Template/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/P00_Project_Template/.DS_Store
--------------------------------------------------------------------------------
/Code/P00_Project_Template/.ipynb_checkpoints/Project_Template_Heart_Disease_Classification-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [],
3 | "metadata": {},
4 | "nbformat": 4,
5 | "nbformat_minor": 5
6 | }
7 |
--------------------------------------------------------------------------------
/Code/P00_Project_Template/data/heart.csv:
--------------------------------------------------------------------------------
1 | age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
2 | 63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
3 | 37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
4 | 41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
5 | 56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
6 | 57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
7 | 57,1,0,140,192,0,1,148,0,0.4,1,0,1,1
8 | 56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
9 | 44,1,1,120,263,0,1,173,0,0,2,0,3,1
10 | 52,1,2,172,199,1,1,162,0,0.5,2,0,3,1
11 | 57,1,2,150,168,0,1,174,0,1.6,2,0,2,1
12 | 54,1,0,140,239,0,1,160,0,1.2,2,0,2,1
13 | 48,0,2,130,275,0,1,139,0,0.2,2,0,2,1
14 | 49,1,1,130,266,0,1,171,0,0.6,2,0,2,1
15 | 64,1,3,110,211,0,0,144,1,1.8,1,0,2,1
16 | 58,0,3,150,283,1,0,162,0,1,2,0,2,1
17 | 50,0,2,120,219,0,1,158,0,1.6,1,0,2,1
18 | 58,0,2,120,340,0,1,172,0,0,2,0,2,1
19 | 66,0,3,150,226,0,1,114,0,2.6,0,0,2,1
20 | 43,1,0,150,247,0,1,171,0,1.5,2,0,2,1
21 | 69,0,3,140,239,0,1,151,0,1.8,2,2,2,1
22 | 59,1,0,135,234,0,1,161,0,0.5,1,0,3,1
23 | 44,1,2,130,233,0,1,179,1,0.4,2,0,2,1
24 | 42,1,0,140,226,0,1,178,0,0,2,0,2,1
25 | 61,1,2,150,243,1,1,137,1,1,1,0,2,1
26 | 40,1,3,140,199,0,1,178,1,1.4,2,0,3,1
27 | 71,0,1,160,302,0,1,162,0,0.4,2,2,2,1
28 | 59,1,2,150,212,1,1,157,0,1.6,2,0,2,1
29 | 51,1,2,110,175,0,1,123,0,0.6,2,0,2,1
30 | 65,0,2,140,417,1,0,157,0,0.8,2,1,2,1
31 | 53,1,2,130,197,1,0,152,0,1.2,0,0,2,1
32 | 41,0,1,105,198,0,1,168,0,0,2,1,2,1
33 | 65,1,0,120,177,0,1,140,0,0.4,2,0,3,1
34 | 44,1,1,130,219,0,0,188,0,0,2,0,2,1
35 | 54,1,2,125,273,0,0,152,0,0.5,0,1,2,1
36 | 51,1,3,125,213,0,0,125,1,1.4,2,1,2,1
37 | 46,0,2,142,177,0,0,160,1,1.4,0,0,2,1
38 | 54,0,2,135,304,1,1,170,0,0,2,0,2,1
39 | 54,1,2,150,232,0,0,165,0,1.6,2,0,3,1
40 | 65,0,2,155,269,0,1,148,0,0.8,2,0,2,1
41 | 65,0,2,160,360,0,0,151,0,0.8,2,0,2,1
42 | 51,0,2,140,308,0,0,142,0,1.5,2,1,2,1
43 | 48,1,1,130,245,0,0,180,0,0.2,1,0,2,1
44 | 45,1,0,104,208,0,0,148,1,3,1,0,2,1
45 | 53,0,0,130,264,0,0,143,0,0.4,1,0,2,1
46 | 39,1,2,140,321,0,0,182,0,0,2,0,2,1
47 | 52,1,1,120,325,0,1,172,0,0.2,2,0,2,1
48 | 44,1,2,140,235,0,0,180,0,0,2,0,2,1
49 | 47,1,2,138,257,0,0,156,0,0,2,0,2,1
50 | 53,0,2,128,216,0,0,115,0,0,2,0,0,1
51 | 53,0,0,138,234,0,0,160,0,0,2,0,2,1
52 | 51,0,2,130,256,0,0,149,0,0.5,2,0,2,1
53 | 66,1,0,120,302,0,0,151,0,0.4,1,0,2,1
54 | 62,1,2,130,231,0,1,146,0,1.8,1,3,3,1
55 | 44,0,2,108,141,0,1,175,0,0.6,1,0,2,1
56 | 63,0,2,135,252,0,0,172,0,0,2,0,2,1
57 | 52,1,1,134,201,0,1,158,0,0.8,2,1,2,1
58 | 48,1,0,122,222,0,0,186,0,0,2,0,2,1
59 | 45,1,0,115,260,0,0,185,0,0,2,0,2,1
60 | 34,1,3,118,182,0,0,174,0,0,2,0,2,1
61 | 57,0,0,128,303,0,0,159,0,0,2,1,2,1
62 | 71,0,2,110,265,1,0,130,0,0,2,1,2,1
63 | 54,1,1,108,309,0,1,156,0,0,2,0,3,1
64 | 52,1,3,118,186,0,0,190,0,0,1,0,1,1
65 | 41,1,1,135,203,0,1,132,0,0,1,0,1,1
66 | 58,1,2,140,211,1,0,165,0,0,2,0,2,1
67 | 35,0,0,138,183,0,1,182,0,1.4,2,0,2,1
68 | 51,1,2,100,222,0,1,143,1,1.2,1,0,2,1
69 | 45,0,1,130,234,0,0,175,0,0.6,1,0,2,1
70 | 44,1,1,120,220,0,1,170,0,0,2,0,2,1
71 | 62,0,0,124,209,0,1,163,0,0,2,0,2,1
72 | 54,1,2,120,258,0,0,147,0,0.4,1,0,3,1
73 | 51,1,2,94,227,0,1,154,1,0,2,1,3,1
74 | 29,1,1,130,204,0,0,202,0,0,2,0,2,1
75 | 51,1,0,140,261,0,0,186,1,0,2,0,2,1
76 | 43,0,2,122,213,0,1,165,0,0.2,1,0,2,1
77 | 55,0,1,135,250,0,0,161,0,1.4,1,0,2,1
78 | 51,1,2,125,245,1,0,166,0,2.4,1,0,2,1
79 | 59,1,1,140,221,0,1,164,1,0,2,0,2,1
80 | 52,1,1,128,205,1,1,184,0,0,2,0,2,1
81 | 58,1,2,105,240,0,0,154,1,0.6,1,0,3,1
82 | 41,1,2,112,250,0,1,179,0,0,2,0,2,1
83 | 45,1,1,128,308,0,0,170,0,0,2,0,2,1
84 | 60,0,2,102,318,0,1,160,0,0,2,1,2,1
85 | 52,1,3,152,298,1,1,178,0,1.2,1,0,3,1
86 | 42,0,0,102,265,0,0,122,0,0.6,1,0,2,1
87 | 67,0,2,115,564,0,0,160,0,1.6,1,0,3,1
88 | 68,1,2,118,277,0,1,151,0,1,2,1,3,1
89 | 46,1,1,101,197,1,1,156,0,0,2,0,3,1
90 | 54,0,2,110,214,0,1,158,0,1.6,1,0,2,1
91 | 58,0,0,100,248,0,0,122,0,1,1,0,2,1
92 | 48,1,2,124,255,1,1,175,0,0,2,2,2,1
93 | 57,1,0,132,207,0,1,168,1,0,2,0,3,1
94 | 52,1,2,138,223,0,1,169,0,0,2,4,2,1
95 | 54,0,1,132,288,1,0,159,1,0,2,1,2,1
96 | 45,0,1,112,160,0,1,138,0,0,1,0,2,1
97 | 53,1,0,142,226,0,0,111,1,0,2,0,3,1
98 | 62,0,0,140,394,0,0,157,0,1.2,1,0,2,1
99 | 52,1,0,108,233,1,1,147,0,0.1,2,3,3,1
100 | 43,1,2,130,315,0,1,162,0,1.9,2,1,2,1
101 | 53,1,2,130,246,1,0,173,0,0,2,3,2,1
102 | 42,1,3,148,244,0,0,178,0,0.8,2,2,2,1
103 | 59,1,3,178,270,0,0,145,0,4.2,0,0,3,1
104 | 63,0,1,140,195,0,1,179,0,0,2,2,2,1
105 | 42,1,2,120,240,1,1,194,0,0.8,0,0,3,1
106 | 50,1,2,129,196,0,1,163,0,0,2,0,2,1
107 | 68,0,2,120,211,0,0,115,0,1.5,1,0,2,1
108 | 69,1,3,160,234,1,0,131,0,0.1,1,1,2,1
109 | 45,0,0,138,236,0,0,152,1,0.2,1,0,2,1
110 | 50,0,1,120,244,0,1,162,0,1.1,2,0,2,1
111 | 50,0,0,110,254,0,0,159,0,0,2,0,2,1
112 | 64,0,0,180,325,0,1,154,1,0,2,0,2,1
113 | 57,1,2,150,126,1,1,173,0,0.2,2,1,3,1
114 | 64,0,2,140,313,0,1,133,0,0.2,2,0,3,1
115 | 43,1,0,110,211,0,1,161,0,0,2,0,3,1
116 | 55,1,1,130,262,0,1,155,0,0,2,0,2,1
117 | 37,0,2,120,215,0,1,170,0,0,2,0,2,1
118 | 41,1,2,130,214,0,0,168,0,2,1,0,2,1
119 | 56,1,3,120,193,0,0,162,0,1.9,1,0,3,1
120 | 46,0,1,105,204,0,1,172,0,0,2,0,2,1
121 | 46,0,0,138,243,0,0,152,1,0,1,0,2,1
122 | 64,0,0,130,303,0,1,122,0,2,1,2,2,1
123 | 59,1,0,138,271,0,0,182,0,0,2,0,2,1
124 | 41,0,2,112,268,0,0,172,1,0,2,0,2,1
125 | 54,0,2,108,267,0,0,167,0,0,2,0,2,1
126 | 39,0,2,94,199,0,1,179,0,0,2,0,2,1
127 | 34,0,1,118,210,0,1,192,0,0.7,2,0,2,1
128 | 47,1,0,112,204,0,1,143,0,0.1,2,0,2,1
129 | 67,0,2,152,277,0,1,172,0,0,2,1,2,1
130 | 52,0,2,136,196,0,0,169,0,0.1,1,0,2,1
131 | 74,0,1,120,269,0,0,121,1,0.2,2,1,2,1
132 | 54,0,2,160,201,0,1,163,0,0,2,1,2,1
133 | 49,0,1,134,271,0,1,162,0,0,1,0,2,1
134 | 42,1,1,120,295,0,1,162,0,0,2,0,2,1
135 | 41,1,1,110,235,0,1,153,0,0,2,0,2,1
136 | 41,0,1,126,306,0,1,163,0,0,2,0,2,1
137 | 49,0,0,130,269,0,1,163,0,0,2,0,2,1
138 | 60,0,2,120,178,1,1,96,0,0,2,0,2,1
139 | 62,1,1,128,208,1,0,140,0,0,2,0,2,1
140 | 57,1,0,110,201,0,1,126,1,1.5,1,0,1,1
141 | 64,1,0,128,263,0,1,105,1,0.2,1,1,3,1
142 | 51,0,2,120,295,0,0,157,0,0.6,2,0,2,1
143 | 43,1,0,115,303,0,1,181,0,1.2,1,0,2,1
144 | 42,0,2,120,209,0,1,173,0,0,1,0,2,1
145 | 67,0,0,106,223,0,1,142,0,0.3,2,2,2,1
146 | 76,0,2,140,197,0,2,116,0,1.1,1,0,2,1
147 | 70,1,1,156,245,0,0,143,0,0,2,0,2,1
148 | 44,0,2,118,242,0,1,149,0,0.3,1,1,2,1
149 | 60,0,3,150,240,0,1,171,0,0.9,2,0,2,1
150 | 44,1,2,120,226,0,1,169,0,0,2,0,2,1
151 | 42,1,2,130,180,0,1,150,0,0,2,0,2,1
152 | 66,1,0,160,228,0,0,138,0,2.3,2,0,1,1
153 | 71,0,0,112,149,0,1,125,0,1.6,1,0,2,1
154 | 64,1,3,170,227,0,0,155,0,0.6,1,0,3,1
155 | 66,0,2,146,278,0,0,152,0,0,1,1,2,1
156 | 39,0,2,138,220,0,1,152,0,0,1,0,2,1
157 | 58,0,0,130,197,0,1,131,0,0.6,1,0,2,1
158 | 47,1,2,130,253,0,1,179,0,0,2,0,2,1
159 | 35,1,1,122,192,0,1,174,0,0,2,0,2,1
160 | 58,1,1,125,220,0,1,144,0,0.4,1,4,3,1
161 | 56,1,1,130,221,0,0,163,0,0,2,0,3,1
162 | 56,1,1,120,240,0,1,169,0,0,0,0,2,1
163 | 55,0,1,132,342,0,1,166,0,1.2,2,0,2,1
164 | 41,1,1,120,157,0,1,182,0,0,2,0,2,1
165 | 38,1,2,138,175,0,1,173,0,0,2,4,2,1
166 | 38,1,2,138,175,0,1,173,0,0,2,4,2,1
167 | 67,1,0,160,286,0,0,108,1,1.5,1,3,2,0
168 | 67,1,0,120,229,0,0,129,1,2.6,1,2,3,0
169 | 62,0,0,140,268,0,0,160,0,3.6,0,2,2,0
170 | 63,1,0,130,254,0,0,147,0,1.4,1,1,3,0
171 | 53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
172 | 56,1,2,130,256,1,0,142,1,0.6,1,1,1,0
173 | 48,1,1,110,229,0,1,168,0,1,0,0,3,0
174 | 58,1,1,120,284,0,0,160,0,1.8,1,0,2,0
175 | 58,1,2,132,224,0,0,173,0,3.2,2,2,3,0
176 | 60,1,0,130,206,0,0,132,1,2.4,1,2,3,0
177 | 40,1,0,110,167,0,0,114,1,2,1,0,3,0
178 | 60,1,0,117,230,1,1,160,1,1.4,2,2,3,0
179 | 64,1,2,140,335,0,1,158,0,0,2,0,2,0
180 | 43,1,0,120,177,0,0,120,1,2.5,1,0,3,0
181 | 57,1,0,150,276,0,0,112,1,0.6,1,1,1,0
182 | 55,1,0,132,353,0,1,132,1,1.2,1,1,3,0
183 | 65,0,0,150,225,0,0,114,0,1,1,3,3,0
184 | 61,0,0,130,330,0,0,169,0,0,2,0,2,0
185 | 58,1,2,112,230,0,0,165,0,2.5,1,1,3,0
186 | 50,1,0,150,243,0,0,128,0,2.6,1,0,3,0
187 | 44,1,0,112,290,0,0,153,0,0,2,1,2,0
188 | 60,1,0,130,253,0,1,144,1,1.4,2,1,3,0
189 | 54,1,0,124,266,0,0,109,1,2.2,1,1,3,0
190 | 50,1,2,140,233,0,1,163,0,0.6,1,1,3,0
191 | 41,1,0,110,172,0,0,158,0,0,2,0,3,0
192 | 51,0,0,130,305,0,1,142,1,1.2,1,0,3,0
193 | 58,1,0,128,216,0,0,131,1,2.2,1,3,3,0
194 | 54,1,0,120,188,0,1,113,0,1.4,1,1,3,0
195 | 60,1,0,145,282,0,0,142,1,2.8,1,2,3,0
196 | 60,1,2,140,185,0,0,155,0,3,1,0,2,0
197 | 59,1,0,170,326,0,0,140,1,3.4,0,0,3,0
198 | 46,1,2,150,231,0,1,147,0,3.6,1,0,2,0
199 | 67,1,0,125,254,1,1,163,0,0.2,1,2,3,0
200 | 62,1,0,120,267,0,1,99,1,1.8,1,2,3,0
201 | 65,1,0,110,248,0,0,158,0,0.6,2,2,1,0
202 | 44,1,0,110,197,0,0,177,0,0,2,1,2,0
203 | 60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
204 | 58,1,0,150,270,0,0,111,1,0.8,2,0,3,0
205 | 68,1,2,180,274,1,0,150,1,1.6,1,0,3,0
206 | 62,0,0,160,164,0,0,145,0,6.2,0,3,3,0
207 | 52,1,0,128,255,0,1,161,1,0,2,1,3,0
208 | 59,1,0,110,239,0,0,142,1,1.2,1,1,3,0
209 | 60,0,0,150,258,0,0,157,0,2.6,1,2,3,0
210 | 49,1,2,120,188,0,1,139,0,2,1,3,3,0
211 | 59,1,0,140,177,0,1,162,1,0,2,1,3,0
212 | 57,1,2,128,229,0,0,150,0,0.4,1,1,3,0
213 | 61,1,0,120,260,0,1,140,1,3.6,1,1,3,0
214 | 39,1,0,118,219,0,1,140,0,1.2,1,0,3,0
215 | 61,0,0,145,307,0,0,146,1,1,1,0,3,0
216 | 56,1,0,125,249,1,0,144,1,1.2,1,1,2,0
217 | 43,0,0,132,341,1,0,136,1,3,1,0,3,0
218 | 62,0,2,130,263,0,1,97,0,1.2,1,1,3,0
219 | 63,1,0,130,330,1,0,132,1,1.8,2,3,3,0
220 | 65,1,0,135,254,0,0,127,0,2.8,1,1,3,0
221 | 48,1,0,130,256,1,0,150,1,0,2,2,3,0
222 | 63,0,0,150,407,0,0,154,0,4,1,3,3,0
223 | 55,1,0,140,217,0,1,111,1,5.6,0,0,3,0
224 | 65,1,3,138,282,1,0,174,0,1.4,1,1,2,0
225 | 56,0,0,200,288,1,0,133,1,4,0,2,3,0
226 | 54,1,0,110,239,0,1,126,1,2.8,1,1,3,0
227 | 70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
228 | 62,1,1,120,281,0,0,103,0,1.4,1,1,3,0
229 | 35,1,0,120,198,0,1,130,1,1.6,1,0,3,0
230 | 59,1,3,170,288,0,0,159,0,0.2,1,0,3,0
231 | 64,1,2,125,309,0,1,131,1,1.8,1,0,3,0
232 | 47,1,2,108,243,0,1,152,0,0,2,0,2,0
233 | 57,1,0,165,289,1,0,124,0,1,1,3,3,0
234 | 55,1,0,160,289,0,0,145,1,0.8,1,1,3,0
235 | 64,1,0,120,246,0,0,96,1,2.2,0,1,2,0
236 | 70,1,0,130,322,0,0,109,0,2.4,1,3,2,0
237 | 51,1,0,140,299,0,1,173,1,1.6,2,0,3,0
238 | 58,1,0,125,300,0,0,171,0,0,2,2,3,0
239 | 60,1,0,140,293,0,0,170,0,1.2,1,2,3,0
240 | 77,1,0,125,304,0,0,162,1,0,2,3,2,0
241 | 35,1,0,126,282,0,0,156,1,0,2,0,3,0
242 | 70,1,2,160,269,0,1,112,1,2.9,1,1,3,0
243 | 59,0,0,174,249,0,1,143,1,0,1,0,2,0
244 | 64,1,0,145,212,0,0,132,0,2,1,2,1,0
245 | 57,1,0,152,274,0,1,88,1,1.2,1,1,3,0
246 | 56,1,0,132,184,0,0,105,1,2.1,1,1,1,0
247 | 48,1,0,124,274,0,0,166,0,0.5,1,0,3,0
248 | 56,0,0,134,409,0,0,150,1,1.9,1,2,3,0
249 | 66,1,1,160,246,0,1,120,1,0,1,3,1,0
250 | 54,1,1,192,283,0,0,195,0,0,2,1,3,0
251 | 69,1,2,140,254,0,0,146,0,2,1,3,3,0
252 | 51,1,0,140,298,0,1,122,1,4.2,1,3,3,0
253 | 43,1,0,132,247,1,0,143,1,0.1,1,4,3,0
254 | 62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
255 | 67,1,0,100,299,0,0,125,1,0.9,1,2,2,0
256 | 59,1,3,160,273,0,0,125,0,0,2,0,2,0
257 | 45,1,0,142,309,0,0,147,1,0,1,3,3,0
258 | 58,1,0,128,259,0,0,130,1,3,1,2,3,0
259 | 50,1,0,144,200,0,0,126,1,0.9,1,0,3,0
260 | 62,0,0,150,244,0,1,154,1,1.4,1,0,2,0
261 | 38,1,3,120,231,0,1,182,1,3.8,1,0,3,0
262 | 66,0,0,178,228,1,1,165,1,1,1,2,3,0
263 | 52,1,0,112,230,0,1,160,0,0,2,1,2,0
264 | 53,1,0,123,282,0,1,95,1,2,1,2,3,0
265 | 63,0,0,108,269,0,1,169,1,1.8,1,2,2,0
266 | 54,1,0,110,206,0,0,108,1,0,1,1,2,0
267 | 66,1,0,112,212,0,0,132,1,0.1,2,1,2,0
268 | 55,0,0,180,327,0,2,117,1,3.4,1,0,2,0
269 | 49,1,2,118,149,0,0,126,0,0.8,2,3,2,0
270 | 54,1,0,122,286,0,0,116,1,3.2,1,2,2,0
271 | 56,1,0,130,283,1,0,103,1,1.6,0,0,3,0
272 | 46,1,0,120,249,0,0,144,0,0.8,2,0,3,0
273 | 61,1,3,134,234,0,1,145,0,2.6,1,2,2,0
274 | 67,1,0,120,237,0,1,71,0,1,1,0,2,0
275 | 58,1,0,100,234,0,1,156,0,0.1,2,1,3,0
276 | 47,1,0,110,275,0,0,118,1,1,1,1,2,0
277 | 52,1,0,125,212,0,1,168,0,1,2,2,3,0
278 | 58,1,0,146,218,0,1,105,0,2,1,1,3,0
279 | 57,1,1,124,261,0,1,141,0,0.3,2,0,3,0
280 | 58,0,1,136,319,1,0,152,0,0,2,2,2,0
281 | 61,1,0,138,166,0,0,125,1,3.6,1,1,2,0
282 | 42,1,0,136,315,0,1,125,1,1.8,1,0,1,0
283 | 52,1,0,128,204,1,1,156,1,1,1,0,0,0
284 | 59,1,2,126,218,1,1,134,0,2.2,1,1,1,0
285 | 40,1,0,152,223,0,1,181,0,0,2,0,3,0
286 | 61,1,0,140,207,0,0,138,1,1.9,2,1,3,0
287 | 46,1,0,140,311,0,1,120,1,1.8,1,2,3,0
288 | 59,1,3,134,204,0,1,162,0,0.8,2,2,2,0
289 | 57,1,1,154,232,0,0,164,0,0,2,1,2,0
290 | 57,1,0,110,335,0,1,143,1,3,1,1,3,0
291 | 55,0,0,128,205,0,2,130,1,2,1,1,3,0
292 | 61,1,0,148,203,0,1,161,0,0,2,1,3,0
293 | 58,1,0,114,318,0,2,140,0,4.4,0,3,1,0
294 | 58,0,0,170,225,1,0,146,1,2.8,1,2,1,0
295 | 67,1,2,152,212,0,0,150,0,0.8,1,0,3,0
296 | 44,1,0,120,169,0,1,144,1,2.8,0,0,1,0
297 | 63,1,0,140,187,0,0,144,1,4,2,2,3,0
298 | 63,0,0,124,197,0,1,136,1,0,1,0,2,0
299 | 59,1,0,164,176,1,0,90,0,1,1,2,1,0
300 | 57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
301 | 45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
302 | 68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
303 | 57,1,0,130,131,0,1,115,1,1.2,1,1,3,0
304 | 57,0,1,130,236,0,0,174,0,0,1,1,2,0
305 |
--------------------------------------------------------------------------------
/Code/P01_Pre_Processing/Data.csv:
--------------------------------------------------------------------------------
1 | Country,Age,Salary,Purchased
2 | France,44,72000,No
3 | Spain,27,48000,Yes
4 | Germany,30,54000,No
5 | Spain,38,61000,No
6 | Germany,40,,Yes
7 | France,35,58000,Yes
8 | Spain,,52000,No
9 | France,48,79000,Yes
10 | Germany,50,83000,No
11 | France,37,67000,Yes
--------------------------------------------------------------------------------
/Code/P02_Linear_Regression/Position_Salaries.csv:
--------------------------------------------------------------------------------
1 | Position,Level,Salary
2 | Business Analyst,1,45000
3 | Junior Consultant,2,50000
4 | Senior Consultant,3,60000
5 | Manager,4,80000
6 | Country Manager,5,110000
7 | Region Manager,6,150000
8 | Partner,7,200000
9 | Senior Partner,8,300000
10 | C-level,9,500000
11 | CEO,10,1000000
--------------------------------------------------------------------------------
/Code/Project/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/Project/.DS_Store
--------------------------------------------------------------------------------
/Code/Project/Housing Corporation/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/Project/Housing Corporation/.DS_Store
--------------------------------------------------------------------------------
/Code/Project/Housing Corporation/.ipynb_checkpoints/Icon
:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/Project/Housing Corporation/.ipynb_checkpoints/Icon
--------------------------------------------------------------------------------
/Code/Project/Housing Corporation/Icon
:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/Project/Housing Corporation/Icon
--------------------------------------------------------------------------------
/Code/Project/Housing Corporation/datasets/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/Project/Housing Corporation/datasets/.DS_Store
--------------------------------------------------------------------------------
/Code/Project/Housing Corporation/datasets/Icon
:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/Project/Housing Corporation/datasets/Icon
--------------------------------------------------------------------------------
/Code/Project/Housing Corporation/datasets/housing/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/Project/Housing Corporation/datasets/housing/.DS_Store
--------------------------------------------------------------------------------
/Code/Project/Housing Corporation/datasets/housing/Icon
:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/Project/Housing Corporation/datasets/housing/Icon
--------------------------------------------------------------------------------
/Pages/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Pages/.DS_Store
--------------------------------------------------------------------------------
/Pages/A00_Reading_List.md:
--------------------------------------------------------------------------------
1 | # Machine Learning & Data Science Reading List
2 | ## Table of contents
3 | - [AI Application](#ai-application)
4 | - [Machine Learning Interview](#machine-learning-interview)
5 |
6 |
7 | ## AI Application
8 | - [How the biggest companies in the world design Machine Learning-powered applications](https://www.linkedin.com/pulse/how-biggest-companies-world-design-machine-daniel-bourke/?fbclid=IwAR1QdQIaeK72nKjpePNKv6sAZayqCp669lg2EjwfvtBx7v6orN1Kw5QOc5c)
9 |
10 | [(Back to top)](#table-of-contents)
11 | ## Machine Learning Interview
12 | - [Introduction to Machine Learning Interviews Book](https://huyenchip.com/ml-interviews-book/)
13 |
14 | [(Back to top)](#table-of-contents)
15 |
--------------------------------------------------------------------------------
/Pages/A01_Interview_Question.md:
--------------------------------------------------------------------------------
1 | # Data Scientist - Interview Questions
2 |
3 | ## Table of contents
4 | - [1. Interview Questions](#1-interview-questions)
5 | - [2. SQL Questions](#2-sql-questions)
6 |
7 |
8 | ## 1. Interview Questions
9 | ### 1.1. Data Scientist
10 | - **[Facebook] Product Generalist** (i.e. solving a business case study)
11 | - How to design the friends you may know feature -- how to recommend friends;
12 | - How can we tell if two users on Instagram are best friends
13 | - How would you build a model to decide what to show a user and how would you evaluate its success
14 | - How would you create a model to find bad sellers on marketplace?
15 | - **[Facebook] Coding Exercise (in SQL)**: joins (LEFT, RIGHT, UNION), group by, date manipulation
16 | - **[Facebook] Quantitative Analysis**
17 | - How to test out the assumptions; how to decide next steps if the metrics shows only positive signals in certain features
18 | - How can you tell if your model is working?
19 | - **[Facebook] Applied Data (stats questions)**: AB Testing
20 |
21 | ### 1.2. Machine Learning Engineer
22 | - **Coding Interview**:
23 | - [Data Structure] Difference Stack vs Queue, Dequeue Implementation, Linked List Reversal
24 | - [Easy] Reverse a linked list, Convert decimal to hexadecimal without using built-in methods (str, int etc.), pairs of number that sum up to K
25 | - [Medium] Verify binary search tree
26 | - [Hard] Min edit distance
27 | - **Techinical Interview**:
28 | - Fundamental ML questions:
29 | - Non-deep and deep methods
30 | - Basic ML models or algorithms: Formula for gradient descent, Linear an Non-Linear classifiers, K-means, Random forest, Clustering Nearest neighbors. Decision Tree
31 | - Basic DL: Explain how CNN works, Recurrent neural network
32 | - Metric Understanding: ROC
33 | - What is overfitting?
34 | - Difference between Bagging and Boosting
35 | - Regularization: Diff of L1 and L2 regularization
36 | - System Design:
37 | - How to search efficiently
38 | - Given salaries of people from ten professions and salary of a new people. Design an algorithm to predict the profession of this new people.
39 | - Case Study:
40 | - How would you apply A/B testing on food odering service
41 | - How surge pricing works for both customers and drivers
42 | - Implement Huffman code for a given English sentence
43 | - **Interview with Hiring Manager**: explain your Machine learning projects
44 |
45 | ### 1.3. Grab: Machine Learning Engineer
46 | - General mobility industry and economics oriented questions
47 | - How surge pricing work for both customers and drivers?
48 |
49 | - Formula for gradient decent
50 | - Supervised and unsupervised ML methods, detailed question about different classification and clustering algos
51 | - What is overfitting and how you deal with it
52 | - How to solve the issue if the features are highly correlated?
53 | - What is a good way to detect anomalies?
54 | - What's the ROC Curve? What does an ROC curve plot?
55 | - What's the difference between bagging and boosting?
56 |
57 | - How do you find out average number of bookings for a given day. What factors do you think will play a crucial role?
58 | - How do you thing grab can implement surge pricing concept different than that to Uber. What factors do you think will play a role here ?
59 | ## 2. SQL Questions
60 | #### SQL#1: Facebook
61 | ```SQL
62 | Given the following data:
63 |
64 | Table:
65 | searches
66 | Columns:
67 | date STRING date of the search,
68 | search_id INT the unique identifier of each search,
69 | user_id INT the unique identifier of the searcher,
70 | age_group STRING ('<30', '30-50', '50+'),
71 | search_query STRING the text of the search query
72 |
73 | Sample Rows:
74 | date | search_id | user_id | age_group | search_query
75 | --------------------------------------------------------------------
76 | '2020-01-01' | 101 | 9991 | '<30' | 'justin bieber'
77 | '2020-01-01' | 102 | 9991 | '<30' | 'menlo park'
78 | '2020-01-01' | 103 | 5555 | '30-50' | 'john'
79 | '2020-01-01' | 104 | 1234 | '50+' | 'funny cats'
80 |
81 |
82 | Table:
83 | search_results
84 | Columns:
85 | date STRING date of the search action,
86 | search_id INT the unique identifier of each search,
87 | result_id INT the unique identifier of the result,
88 | result_type STRING (page, event, group, person, post, etc.),
89 | clicked BOOLEAN did the user click on the result?
90 |
91 | Sample Rows:
92 | date | search_id | result_id | result_type | clicked
93 | --------------------------------------------------------------------
94 | '2020-01-01' | 101 | 1001 | 'page' | TRUE
95 | '2020-01-01' | 101 | 1002 | 'event' | FALSE
96 | '2020-01-01' | 101 | 1003 | 'event' | FALSE
97 | '2020-01-01' | 101 | 1004 | 'group' | FALSE
98 |
99 |
100 | Over the last 7 days, how many users made more than 10 searches?
101 |
102 | You notice that the number of users that clicked on a search result
103 | about a Facebook Event increased 10% week-over-week. How would you
104 | investigate? How do you decide if this is a good thing or a bad thing?
105 |
106 | The Events team wants to up-rank Events such that they show up higher
107 | in Search. How would you determine if this is a good idea or not?
108 | ```
109 | [(Back to top)](#table-of-contents)
110 |
--------------------------------------------------------------------------------
/Pages/A01_Job_Description.md:
--------------------------------------------------------------------------------
1 | # Data Science Job Description
2 | ## #1 Dell - Junior Analyst, Business Intelligence (Data Science)
3 | - Learn and apply database tools and statistical predictive models by analyzing large datasets with a variety of tools
4 | - Ask and answer questions in large datasets and have a strong desire to create strategies and solutions that challenge and expand the thinking of everyone around you
5 | - Deep dive into data to find answers to yet unknown questions and have a natural desire to go beneath the surface of a problem
6 | - Ask relevant questions and possess the skills to build algorithms necessary to find meaningful answers
7 | - Creatively visualize and effectively communicate data findings and insights in a variety of formats
8 |
9 | ## #2 Facebook - Data Scientist
10 | - Defining new opportunities for product impact Influencing product and sales to solve the most impactful market problems.
11 | - Apply your expertise in quantitative analysis and the presentation of data to see beyond the numbers and understand how our users interact with our growth products.
12 | - Work as a key member of the product team to solve problems and identify trends and opportunities.
13 | - Inform, influence, support, and execute our product decisions and product launches.
14 | - Set KPIs and goals, design and evaluate experiments, monitor key product metrics, understand root causes of changes in metrics.
15 | - Exploratory analysis to discover new opportunities: understanding ecosystems, user behaviours, and long-term trends Identifying levers to help move key metrics.
16 |
17 | # Essential Requirements
18 | - Core statistical knowledge
19 | - JD: #1
20 | - Programming languages (R, Python, SAS)
21 | - JD: #1, #2
22 | - BI, Data Mining, and Machine Learning experience
23 | - JD: #1
24 | - Proficient in SQL
25 | - JD: #1, #2
26 | - Proven experience leading data-driven projects from definition to execution: defining metrics, experiment design, communicating actionable insights.
27 | - JD: #2
28 | - Experience in Spark (Scala / pySpark)
29 | - JD: #3
30 | # Desirable Requirements
31 | - Self driven, able to work independently yet acts as a team player
32 | - Strong communication skills, willingness to learn and develop their ability to apply data science principles through a business lens
33 | - Ability to prioritize, re-prioritize, and handle multiple competing priorities simultaneously
34 |
35 | # Internship
36 | ## SAP - AI Engineer
37 | #### PURPOSE AND OBJECTIVES
38 |
39 | The Artificial Intelligence CoE team in IES Technology Services is pushing the SAP internal adoption of Machine Learning through all business processes. We are implementing standard product functionality from the SAP AI portfolio as well as custom built AI scenarios for internal business cases. In this field we have a lot existing projects in the horizon and a lot of rooms for creativity and freedom to experiment.
40 |
41 | In Singapore, we are looking for AI Engineer Interns to work on the technical stacks, also involve in end-user and business stakeholder communications.
42 |
43 | #### EXPECTATIONS AND TASKS
44 |
45 | - Collaborate with other AI Scientists, Engineers, and Product Owners etc.
46 | - Communicate with stakeholders to understand SAP business processes.
47 | - Support the end-to-end MLOps lifecycle (Python, Jenkins, Docker, Kubernetes, etc)
48 | - Apply existing SAP AI or open-sourced solutions to solve business problems.
49 | - Learn the latest advancements in applied NLP and Devops.
50 |
51 | #### EDUCATION AND QUALIFICATIONS / SKILLS AND COMPETENCIES
52 |
53 | Required skills
54 | - Programming experience in Python language and packages such as Pandas, scikit-learn etc.
55 | - Understanding of basic Machine Learning algorithms and concepts
56 | - Experience with development tools like VScode, Pycharm, Jupyter, Docker, Git, etc.
57 | - Curiosity to learn about different AI applications.
58 | - Able to take ownership of one’s work and collaborate with others.
59 | - Preferred skills
60 | - Pursuing an undergraduate or graduate degree
61 | - Driven and dynamic with strong problem-solving skills
62 | - Good communication skills with end-users and business stakeholders
63 | - Additional certificates and courses focusing on machine learning are a plus but not a must
64 |
65 | ## YARA - Data Science (ML Developer)
66 | There are more than 500 million smallholder farms globally. 2.5 billion people depend on Smallholder Communities for their food and livelihoods. Smallholder regions are characterized by low living standards, high rates of illiteracy and low agricultural productivity. Yara's mission is "Responsibly Feed the World and Protect the Planet". Key to achieving this is enabling thriving Smallholder Communities. At Yara, the Smallholders Digital Team is part of the Crop and Digital Solutions Unit.
67 |
68 | #### About Crop and Digital Solutions
69 | Yara aims to be the crop nutrition company for the future and is leading the development of sustainable agriculture and digital tools to contribute to solving global agricultural challenges. We have a worldwide presence with sales teams in ~150 countries and around 17,000 employees. Yara Farming Solutions, will lead the transformation towards more sustainable and efficient food production, by innovating our offerings and the way we work. Crop and Digital Solutions is responsible developing and scaling new “on-farm” digital and integrated tools and solutions for an efficient and transparent food system.
70 |
71 |
72 | #### Responsibilities
73 | - Support the development of the data resources of the analytics and insights side of the smallholder solutions.
74 | - Research new ways to interpret and utilise the data resources to provide meaningful, actionable insights.
75 | - Support cross-functional teams with different parts of the development centre to support the data needs of different teams.
76 | - Support the development, testing and implementation of various scripts, algorithms and code when necessary.
77 | ##### Required Profile
78 |
79 | - Strong personal/professional interest in the LSM segments (developing countries, low-income markets, etc) and agriculture in general.
80 | - Exposure to Computer Vision and Data Science concepts preferred.
81 | - Prior experience with at least one of Python development, data analysis or machine learning.
82 |
83 | Additional information
84 |
85 | We strive to reflect the diversity in society and encourage all qualified applicants from all background to apply. We are committed to creating a work environment that fits gender equality and allows combining career progress with the needs of a family or other personal circumstances
86 |
87 | #### Why us?
88 | - Evolving tech development division of an established agricultural products and services company.
89 | - Explore and develop digital, software, hardware products, which provide value to farmers, smallholder communities and the value chain.
90 | - Be part of our mission to build sustainable solutions that benefit humanity and the environment.
91 | - Full-time, permanent and freelance contract options available with competitive remuneration + benefits.
92 | - Support for personal development, training and continuous learning.
93 | - Commitment to using new technologies and frameworks, meetups, and knowledge sharing.
94 |
95 | ## ByteDance - Big Data Engineer
96 | #### About the Ad Data team
97 | The Ad Data team is committed to empowering our global team's monetization products through acquiring, building, and managing key ads data assets and providing scalable data warehouse, product, and service solutions.
98 | #### Responsibilities
99 |
100 | - Responsible for building offline and real-time data pipelines for various businesses of advertising;
101 | - Process data processing requests and execute data model design, implementation and maintenance;
102 | - Research on business logic and data architecture design to address business value.
103 |
104 | #### Qualifications
105 |
106 | - Undergraduate or Postgraduate currently pursuing a degree/master in software development, computer science, computer engineering, information systems or a related technical discipline;
107 | - Strong interest in computer science and internet technology;
108 | - Know relevant technologies of the Hadoop ecosystem, such as the principles of MapReduce, Spark, Hive, and Flink;
109 | - Familiar with SQL, be able to use SQL to perform data analysis, or proficient in Java, Python and Shell and other programming languages for data processing;
110 | - Good at communication, proactive in work, a strong sense of responsibility, and good teamwork ability.
111 |
112 | ## Grab - Data Analytics & Data Science
113 | Our Data Analytics & Data Science team focuses on gaining a true understanding of our users. We apply analytics in product development which enables us in building the best app ever! Our business analytics team directs itself by following 3 core philosophies.
114 |
115 | We focus on big data, being a bridge between the online and offline side of business, and to use data in order to align our operations with strategic goals. If you love to find new ways of interpreting data then you’re a perfect fit!
116 |
117 | Requirements:
118 |
119 | - Affinity with resolving complex people and business issues
120 | - Full-time student majoring in Computer Sciences, Engineering, Statistics, Data Science or related fields seeking a matriculated internship
121 | - You are enrolled in a Singapore university, or Singaporean/PR studying abroad
122 | - Previous internship experience in Data Analytics, Computer Science or other related field is an advantage
123 | - Be able to commit full-time from Jan - June 2022, for a minimum of 20 weeks
124 | - Agile and able to work in a fast-paced environment
125 | - Excellent communication, presentation and project management skills
126 |
--------------------------------------------------------------------------------
/Pages/A02_Pandas_Cheat_Sheet.md:
--------------------------------------------------------------------------------
1 |
2 | # Pandas Cheat Sheet
3 | # Table of contents
4 | - [Table of contents](#table-of-contents)
5 | - [Import & Export Data](#import-export-data)
6 | - [Getting and knowing](#getting-and-knowing)
7 | - [loc vs iloc](#loc-vs-iloc)
8 | - [Access Rows of Data Frame](#access-columns-of-data-frame)
9 | - [Access Columns of Data Frame](#access-columns-of-data-frame)
10 | - [Manipulating Data](#manipulating-data)
11 | - [Grouping](#grouping)
12 | - [Basic Grouping](#basic-grouping)
13 |
14 |
15 | # Import Export Data
16 | ### Import with Different Separator
17 | ```Python
18 | users = pd.read_csv('user.csv', sep='|')
19 | chipo = pd.read_csv(url, sep = "\t")
20 | ```
21 |
22 |
23 | #### Renaming Index
24 | ```Python
25 | users = pd.read_csv('u.user', sep='|', index_col='user_id')
26 | ```
27 | ### Export
28 | ```Python
29 | users.to_csv("exported-users.csv")
30 | ```
31 |
32 | # Getting and knowing
33 | ### shape : Return (Row, Column)
34 | ```Python
35 | df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4],
36 | 'col3': [5, 6]})
37 | df.shape
38 | (2, 3) # df.shape[0] = 2 row, df.shape[1] = 3 col
39 | ```
40 | ### info() : Return index dtype, columns, non-null values & memory usage.
41 | ```Python
42 | df.info()
43 | ```
44 | - We will understand dtype of cols, how many non-null value of DF
45 | ```Python
46 |
47 | RangeIndex: 4622 entries, 0 to 4621
48 | Data columns (total 5 columns):
49 | # Column Non-Null Count Dtype
50 | --- ------ -------------- -----
51 | 0 order_id 4622 non-null int64
52 | 1 quantity 4622 non-null int64
53 | 2 item_name 4622 non-null object
54 | 3 choice_description 3376 non-null object
55 | 4 item_price 4622 non-null object
56 | dtypes: int64(2), object(3)
57 | memory usage: 180.7+ KB
58 | ````
59 |
60 | ### describe() : Generate descriptive statistics.
61 | ```Python
62 | chipo.describe() #Notice: by default, only the numeric columns are returned.
63 | chipo.describe(include = "all") #Notice: By default, only the numeric columns are returned.
64 | ```
65 |
66 |
67 | ### dtype : Return data type of specific column
68 | - `df.col_name.dtype` return the data type of that column
69 | ```Python
70 | df.item_price.dtype
71 | #'O' (Python) objects
72 | ```
73 |
74 | - Please note: dtype will return below special character
75 | ```Python
76 | 'b' boolean
77 | 'i' (signed) integer
78 | 'u' unsigned integer
79 | 'f' floating-point
80 | 'c' complex-floating point
81 | 'O' (Python) objects
82 | 'S', 'a' (byte-)string
83 | 'U' Unicode
84 | 'V' raw data (void
85 | ```
86 |
87 | ## loc vs iloc
88 | ### loc
89 | - `loc`: is **label-based**, which means that we have to specify the "name of the rows and columns" that we need to filter out.
90 | #### Find all the rows based on 1 or more conditions in a column
91 | ```Python
92 | # select all rows with a condition
93 | data.loc[data.age >= 15]
94 | # select all rows with multiple conditions
95 | data.loc[(data.age >= 12) & (data.gender == 'M')]
96 | ```
97 | 
98 |
99 | #### Select only required columns with conditions
100 | ```Python
101 | # Update the values of multiple columns on selected rows
102 | chipo.loc[(chipo.quantity == 7) & (chipo.item_name == 'Bottled Water'), ['item_name', 'item_price']] = ['Tra Xanh', 0]
103 | # Select only required columns with a condition
104 | chipo.loc[(chipo.quantity > 5), ['item_name', 'quantity', 'item_price']]
105 | ```
106 |
107 |
108 | ### iloc
109 | - `iloc`: is **index-based**, which means that we have to specify the "integer index-based" that we need to filter out.
110 | - `.iloc[]` allowed inputs are:
111 | #### Selecting Rows
112 | - An integer, e.g. `dataset.iloc[0]` > return row 0 in ``
113 | ```Python
114 | Country France
115 | Age 44
116 | Salary 72000
117 | Purchased No
118 | ```
119 | - A list or array of integers, e.g.`dataset.iloc[[0]]` > return row 0 in DataFrame format
120 | ```Python
121 | Country Age Salary Purchased
122 | 0 France 44.0 72000.0 No
123 | ```
124 | - A slice object with ints, e.g. `dataset.iloc[:3]` > return row 0 up to row 3 in DataFrame format
125 | ```Python
126 | Country Age Salary Purchased
127 | 0 France 44.0 72000.0 No
128 | 1 Spain 27.0 48000.0 Yes
129 | 2 Germany 30.0 54000.0 No
130 | ```
131 | #### Selecting Rows & Columns
132 | - Select First 3 Rows & up to Last Columns (not included) `X = dataset.iloc[:3, :-1]`
133 | ```Python
134 | Country Age Salary
135 | 0 France 44.0 72000.0
136 | 1 Spain 27.0 48000.0
137 | 2 Germany 30.0 54000.0
138 | ```
139 | ### Numpy representation of DF
140 | - `DataFrame.values`: Return a Numpy representation of the DataFrame (i.e: Only the values in the DataFrame will be returned, the axes labels will be removed)
141 | - For ex: `X = dataset.iloc[:3, :-1].values`
142 | ```Python
143 | [['France' 44.0 72000.0]
144 | ['Spain' 27.0 48000.0]
145 | ['Germany' 30.0 54000.0]]
146 | ```
147 |
148 | ## Access Rows of Data Frame
149 | ### Check index of DF
150 | ```Python
151 | df.index
152 | #RangeIndex(start=0, stop=4622, step=1)
153 | ```
154 |
155 | [(Back to top)](#table-of-contents)
156 |
157 | ## Access Columns of Data Frame
158 | ### Print the name of all the columns
159 | ```Python
160 | list(df.columns)
161 | #['order_id', 'quantity', 'item_name', 'choice_description','item_price', 'revenue']
162 | ```
163 | ### Access column
164 | ```Python
165 | # Counting how many values in the column
166 | df.col_name.count()
167 | # Take the mean of values in the column
168 | df["col_name"].mean()
169 | ```
170 | ### value_counts() : Return a Series containing counts of unique values
171 | ```Python
172 | index = pd.Index([3, 1, 2, 3, 4, np.nan])
173 | #dropna=False will also consider NaN as a unique value
174 | index.value_counts(dropna=False)
175 | #Return:
176 | 3.0 2
177 | 2.0 1
178 | NaN 1
179 | 4.0 1
180 | 1.0 1
181 | dtype: int64
182 | ```
183 | ### Calculate total unique values in a columns
184 | ```Python
185 | #How many unique values
186 | index.value_counts().count()
187 |
188 | index.nunique()
189 | #5
190 | ```
191 |
192 | [(Back to top)](#table-of-contents)
193 | # Manipulating Data
194 | ## Missing Values
195 | ### Filling Missing Values with fillna()
196 | - To fill `nan` value with a v
197 | ```Python
198 | car_sales_missing["Odometer"].fillna(car_sales_missing["Odometer"].mean(), inplace = True)
199 | ```
200 | ### Dropping Missing Values with dropna()
201 | - To drop columns containing Missing Values
202 | ```Python
203 | car_sales_missing.dropna(inplace=True)
204 | ```
205 | ## Drop a column
206 |
207 | ```Python
208 | car_sales.drop("Passed road safety", axis = 1) # axis = 1 if you want to drop a column
209 | ```
210 | [(Back to top)](#table-of-contents)
211 | # Grouping
212 |
213 |
214 | ## Basic Grouping
215 | - Grouping by "item_name" column & take the sum of "quantity" column
216 | - Method #1 : `df.groupby("item_name")`
217 |
218 | ```Python
219 | df.groupby("item_name")["quantity"].sum()
220 | ```
221 |
222 | ```Python
223 | item_name
224 | Chicken Bowl 761
225 | Chicken Burrito 591
226 | Name: quantity, dtype: int64
227 | ```
228 |
229 | - Method #2: `df.groupby(by=['order_id'])`
230 |
231 | ```Python
232 | order_revenue = df.groupby(by=["order_id"])["revenue"].sum()
233 | ```
234 | [(Back to top)](#table-of-contents)
235 |
236 |
237 |
238 |
239 |
--------------------------------------------------------------------------------
/Pages/A03_Numpy_Cheat_Sheet.md:
--------------------------------------------------------------------------------
1 | # Numpy Cheat Sheet
2 | # Table of contents
3 | - [Table of contents](#table-of-contents)
4 | - [Introduction to Numpy](#introduction-to-numpy)
5 | - [Numpy Data Types and Attributes](numpy-data-types-and-attributes)
6 |
7 | # Introduction to Numpy
8 | ### Why is Numpy important?
9 | - How many decimal numbers we can store with `n bits` ?
10 | - `n bits` is equal to 3 positions to store 0 & 1.
11 | - Formula: 2^(n) = 8 decimal numbers
12 | - Numpy allow you to specify more precisely number of memory you need for storing the data
13 | ```Python
14 | #Python costs 28 bytes to store x = 5 since it is Integer Object
15 | import sys
16 | x = 5
17 | sys.getsizeof(x) #return 28 - means variable x = 5 costs 28 bytes of memory
18 |
19 | #Numpy : allow you to specify more precisely number of bits (memory) you need for storing the data
20 | np.int8 #8-bit
21 | ```
22 | - Numpy is **Array Processing**
23 | - Built-in DS in Python `List` NOT optimized for High-Level Processing as List in Python is Object and they will not store elements in separate position in Memory
24 | - In constrast, Numpy will store `Array Elements` in **Continuous Positions** in memory
25 |
26 | ### Numpy is more efficient for storing and manipulating data
27 |
28 |
29 |
30 | - `Numpy array` : essentially contains a single pointer to one contiguous block of data
31 | - `Python list` : contains a pointer to a block of pointers, each of which in turn points to a full Python object
32 |
33 | # Numpy Data Types and Attributes
34 | - Main Numpy Data Type is `ndarray`
35 | - Attributes: `shape, ndim, size, dtype`
36 |
--------------------------------------------------------------------------------
/Pages/A04_Conda_CLI.md:
--------------------------------------------------------------------------------
1 |
2 | | Usage | Command | Description |
3 | | -------------------| ---------------------------------- | ----------------|
4 | | Create |`conda create --prefix ./env pandas numpy matplotlib scikit-learn jupyter`| Create Conda env & install packages |
5 | | List Env | `conda env list` | Listdown env currently activated |
6 | | Activate | `conda activate ./env` | Activate Conda virtual env |
7 | | Install package | `conda install jupyter` | |
8 | | Update package | `conda update scikit-learn=0.22` | Can specify the version also |
9 | | List Installed Package | `conda list`||
10 | | Un-install Package | `conda uninstall python scikit-learn`| To uninstall packages to re-install with the Latest version|
11 | | Open Jupyter Notebook | `jupyter notebook`||
12 |
13 |
14 |
15 | ## Sharing Conda Environment
16 | - Share a `.yml` (pronounced YAM-L) file of your Conda environment
17 | - `.yml` is basically a text file with instructions to tell Conda how to set up an environment.
18 | - Step 1: Export `.yml` file:
19 | - `conda env export --prefix {Path to env folder} > environment.yml`
20 | - Step 2: New PC, create an environment called `env_from_file` from `environment.yml`:
21 | - `conda env create --file environment.yml --name env_from_file`
22 |
23 | ## Jupyter Notebook
24 | | Usage | Command | Description |
25 | | -------------------| ---------------------------------- | ----------------|
26 | | Run Cell |`Shift + Enter`| Create Conda env & install packages |
27 | | Switch to Markdown | Exit Edit Mode `ESC` > press `m` | |
28 | | Show Function Description | `Shift + Tab`|
|
29 | | How to install a conda package into the current env from Jupyter's Notebook|`import sys`
`!conda install --yes --prefix {sys.prefix} seaborn`||
30 |
31 | ### Jupyter Magic Function
32 | | Function | Command | Description |
33 | | -------------------| ---------------------------------- | ----------------|
34 | | Matplotlib | `%matplotlib inline` | will make your plot outputs appear and be stored within the notebook. |
35 |
--------------------------------------------------------------------------------
/Pages/A05_Matplotlib.md:
--------------------------------------------------------------------------------
1 |
2 | # Matplotlib Cheat Sheet
3 | # Table of contents
4 | - [Table of contents](#table-of-contents)
5 | - [Introduction to Matplotlib](#introduction-to-matplotlib)
6 | - [Plotting from an IPython notebook](#plotting-from-an-ipython-notebook)
7 | - [Matplotlib Two Interfaces: MATLAB-style & Object-Oriented Interfaces](#matplotlib-two-interfaces)
8 | - [Matplotlib Workflow](#matplotlib-workflow)
9 | - [Subplots](#subplots)
10 | - [Scatter, Bar & Histogram Plot](#scatter-bar-and-histogram-plot)
11 |
12 | # Introduction to Matplotlib
13 | - Matplotlib is a multi-platform data visualization library built on NumPy arrays, and designed to work with the broader SciPy stack
14 | - Newer tools like `ggplot` and `ggvis` in the R language, along with web visualization toolkits based on `D3js` and `HTML5 canvas`, often make Matplotlib feel clunky and old-fashioned
15 | - Hence, nowadays, cleaner, more modern APIs, for example, `Seaborn`, `ggpy`, `HoloViews`, `Altai`, has been developed to drive Matplotlib
16 | ```Python
17 | import matplotlib.pyplot as plt
18 | ```
19 | - The `plt` interface is what we will use most often
20 | #### Setting Styles
21 | ```Python
22 | # See the different styles avail
23 | plt.style.available
24 | # Set Style
25 | plt.style.use('seaborn-whitegrid')
26 | ```
27 | ## Plotting from an IPython notebook
28 | - `%matplotlib notebook` will lead to **interactive** plots embedded within the notebook
29 | - `%matplotlib inline` will lead to **static images** of your plot embedded in the notebook
30 |
31 |
32 |
33 | - `Figure` can contains multiple Subplot
34 | - `Axes 0` and `Axes 1` are `AxesSubplot` stacked together
35 | ## Matplotlib Two Interfaces
36 | ### Pyplot API vs Object-Oriented API
37 | * Quickly → use Pyplot Method
38 | * Advanced → use Object-Oriented Method
39 | - In general, try to use the `object-oriented interface` (more flexible) over the `pyplot` interface (i.e: `plt.plot()`)
40 |
41 | ```Python
42 | x = [1,2,3,4]
43 | y = [11,22,33,44]
44 | ```
45 | - **MATLAB-style or PyPlot API**: Matplotlib was originally written as a Python alternative for MATLAB users, and much of its syntax reflects that fact
46 | ```Python
47 | # Pyplot API
48 | plt.plot(x,y, color='blue')
49 |
50 | plt.title("A Sine Curve") #in OO, use the ax.set() method to set all these properties at once
51 | plt.xlabel("x")
52 | plt.ylabel("sin(x)")
53 | plt.xlim([1,3])
54 | plt.ylim([20,])
55 | ```
56 | - **Object-oriented**: plotting functions are methods of explicit `Figure` and `Axes` objects.
57 | ```Python
58 | # [Recommended] Object-oriented interface
59 | fig, ax = plt.subplots() #create figure + set of subplots, by default, nrow =1, ncol=1
60 | ax.plot(x,y) #add some data
61 | plt.show()
62 | ```
63 | ##### Matplotlib Gotchas
64 |
65 | While most `plt` functions translate directly to `ax` methods (such as plt.plot() → ax.plot(), plt.legend() → ax.legend(), etc.), this is not the case for all commands. In particular, functions to set limits, labels, and titles are slightly modified. For transitioning between MATLAB-style functions and object-oriented methods, make the following changes:
66 | - `plt.xlabel()` → `ax.set_xlabel()`
67 | - `plt.ylabel()` → `ax.set_ylabel()`
68 | - `plt.xlim()` → `ax.set_xlim()`
69 | - `plt.ylim()` → `ax.set_ylim()`
70 | - `plt.title()` → `ax.set_title()`
71 | In the object-oriented interface to plotting, rather than calling these functions individually, it is often more convenient to use the ax.set()
72 |
73 | ```Python
74 | ax.set(xlim=(0, 10), ylim=(-2, 2),
75 | xlabel='x', ylabel='sin(x)',
76 | title='A Simple Plot');
77 | ```
78 |
79 | ## Matplotlib Workflow
80 | ```Python
81 | # 0. Import and get matplotlib ready
82 | %matplotlib inline
83 | import matplotlib.pyplot as plt
84 |
85 | # 1. Prepare data
86 | x = [1, 2, 3, 4]
87 | y = [11, 22, 33, 44]
88 |
89 | # 2. Setup plot
90 | fig, ax = plt.subplots(figsize=(5,5)) #Figure size = Width & Height of the Plot
91 |
92 | # 3. Plot data
93 | ax.plot(x, y)
94 |
95 | # 4. Customize plot
96 | ax.set(title="Sample Simple Plot",
97 | xlabel="x-axis",
98 | ylabel="y-axis",
99 | xlim=(0, 10), ylim=(-2, 2))
100 |
101 | # 5. Save & Show
102 | fig.savefig("../images/simple-plot.png")
103 | ```
104 |
105 | [(Back to top)](#table-of-contents)
106 |
107 | # Subplots
108 | - Option #1: to plot multiple subplots in same figure
109 | ```Python
110 | # Option 1: Create multiple subplots
111 | fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2,
112 | ncols=2,
113 | figsize=(10, 5))
114 | # Plot data to each axis
115 | ax1.plot(x, x/2);
116 | ax2.scatter(np.random.random(10), np.random.random(10));
117 | ax3.bar(nut_butter_prices.keys(), nut_butter_prices.values());
118 | ax4.hist(np.random.randn(1000));
119 | ```
120 |
121 |
122 | [(Back to top)](#table-of-contents)
123 |
124 | # Scatter Bar and Histogram Plot
125 | ## Scatter
126 | ```Python
127 | #<--- Method 1: Pytlot --->:
128 | df.plot(kind = 'scatter',
129 | x = 'age',
130 | y = 'chol',
131 | c = 'target', #c = color the dot based on over_50['target'] columns
132 | figsize=(10,6));
133 | ```
134 | ```Python
135 | #<--- Method 2: OO --->:
136 | ## OO Method from Scratch
137 | fig, ax = plt.subplots(figsize=(10,6))
138 |
139 | ## Plot the data
140 | scatter = ax.scatter(x=over_50["age"],
141 | y=over_50["chol"],
142 | c=over_50["target"]);
143 | # Customize the plot
144 | ax.set(title="Heart Disease and Cholesterol Levels",
145 | xlabel="Age",
146 | ylabel="Cholesterol");
147 | # Add a legend
148 | ax.legend(*scatter.legend_elements(), title="target"); # * to unpack all the values of Title="target"
149 |
150 | #Add a horizontal line
151 | ax.axhline(over_50["chol"].mean(), linestyle = "--");
152 | ```
153 |
154 |
155 |
156 | ## Bar
157 | * Vertical
158 | * Horizontal
159 | ```Python
160 | #<--- Method 1: Pytlot --->:
161 | df.plot.bar();
162 | ```
163 | ```Python
164 | #<--- Method 2: OO --->:
165 | fig, ax = plt.subplots()
166 | ax.bar(x, y)
167 | ax.set(title="Dan's Nut Butter Store", ylabel="Price ($)");
168 | ```
169 | ## Histogram
170 | ```Python
171 | # Create Histogram of Age to see the distribution of age
172 |
173 | heart_disease["age"].plot.hist(bins=10);
174 | ```
175 | [(Back to top)](#table-of-contents)
176 |
--------------------------------------------------------------------------------
/Pages/A05_Statistics.md:
--------------------------------------------------------------------------------
1 | # Statistics
2 | # Table of contents
3 | - [Standard Deviation & Variance](standard-deviation-and-variance)
4 |
5 |
6 |
7 | # Standard Deviation and Variance
8 |
9 | ## Standard Deviation
10 | - Standard Deviation is a measure of how spread out numbers are.
11 | ```Python
12 | # Standard deviation = a measure of how spread out a group of numbers is from the mean
13 | np.std(a2)
14 | # Standar deviation = Square Root of Variance
15 | np.sqrt(np.var(a2))
16 | ```
17 |
18 | ## Variance
19 | - The average of the squared differences from the Mean.
20 | ```Python
21 | # Varainace = measure of the average degree to which each number is different to the mean
22 | # Higher variance = wider range of numbers
23 | # Lower variance = lower range of numbers
24 | np.var(a2)
25 | ```
26 |
27 | ### Example:
28 | 
29 | - The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm
30 | - Mean = (600 + 470 + 170 + 430 + 300)/5 = 394mm
31 |
32 | 
33 | - `Variance` = 21704
34 | - `Standard Deviation` = sqrt(variance) = 147 mm
35 |
36 | 
37 |
38 | - we can show which heights are within one Standard Deviation (147mm) of the Mean:
39 | - **Standard Deviation we have a "standard" way of knowing what is normal**, and what is extra large or extra small
40 |
41 | 
42 | - Credit: [Math is Fun](https://www.mathsisfun.com/data/standard-deviation.html)
43 |
--------------------------------------------------------------------------------
/Pages/A8_Daily_Lessons.md:
--------------------------------------------------------------------------------
1 | # Daily Lessons
2 |
3 | # Day 1:
4 | - **Math**: [Geometric Sequences](https://www.mathsisfun.com/algebra/sequences-sums-geometric.html)
5 | - **Python**: List (`enumarate`), Dict (`keys()`, `values()`,`items()`)
6 | - **LeetCode**: [50. Pow(x, n)](https://leetcode.com/problems/powx-n/): using Recursive with `Pow(x,n) = Pow(x,n//2)`, rmb for n = odd and even cases
7 | # Day 2:
8 | - **Math**: [Modular Arithmetic](https://brilliant.org/wiki/modular-arithmetic/)
9 | - **Congruence** `a ≡ b (mod n)` For a positive integer n, the integers *a and b are congruent mod n* if their remainders when divided by n are the same.
10 | - For example: 52≡24(mod7): 52 and 24 are congruent (mod 7) because (52 mod 7) = 3 and (24 mod 7) = 3.
11 | - **Properties of multiplication** in Modular Arithmetic:
12 | - `(a mod n) mod n = a mod n`: This is obvious because a mod n ∈ [0,𝑛−1] and so the second modmod cannot have an effect.
13 | - `(A^2) mod C = (A * A) mod C = ((A mod C) * (A mod C)) mod C`
14 | - **LeetCode**: [Fast Exponentiation](https://youtu.be/-3Lt-EwR_Hw)
15 | # Day 3:
16 | - **Python**:
17 | - Nested List Comprehension `[[item if not item.isspace() else -1 for item in row] for row in board]` to build 2D matrix
18 | - String Formatting with Padding 0: For example, convert integer 2 to "02" `f"{month:02d}"`
19 | - Math's Ceil & Floor: `math.ceil()`, `math.floor()`
20 | - **Math**:
21 | - `Modular Multiplicative Inverse (MMI)`: **MMI(a, b) = x** s.t `a*x ≡ 1 (mod n)`
22 | - For example: a = 3, m = 11 => x = 4 as (3*4) mod 11 = 1
23 | - `Euclidean Algorithm` to find GCD of A & B & `Extended Euclidean Algorithm` to find **MMI(A, B)**
24 | # Day 4:
25 | - **LeetCode**: `Best Time to Buy and Sell Stock` (Keep track on the buying price, compare to the next days), `Climbing Stairs` (At T(n): first step = 1, remaining steps = T(n-1) or first step = 2, remaing steps = T(n-2). This recurrence relationship is similar to Fibonacci number)
26 |
27 | # Day 5:
28 | - **LeetCode**: `3 Sum`, `Longest Palindromic Substring` and `Container With Most Water`
29 | # Day 6:
30 | - **LeetCode**: `Number of Islands`, `Design Circular Queue`
31 | # Day 7:
32 | - **Data Science**: Understand about confusion matrix of classifier, Precision & Recall, F1
33 |
--------------------------------------------------------------------------------
/Pages/P00_Introduction.md:
--------------------------------------------------------------------------------
1 | # Introduction
2 | # Table of contents
3 | - [Table of contents](#table-of-contents)
4 | - [Why need to learn Machine Learning ?](#why-need-to-learn-machine-learning)
5 | - [Terms](#terms)
6 | - [AI](#ai)
7 | - [Machine Learning](#machine-learning)
8 | - [Deep Learning](#deep-learning)
9 | - [Data Science](#data-science)
10 | - [Machine Learning Framework](#machine-learning-framework)
11 | - [Main Types of ML Problems](#main-types-of-ml-problems)
12 | - [Evaluation](#evaluation)
13 | - [Features](#features)
14 | - [Modelling](#modelling)
15 | - [Splitting Data](#splitting-data)
16 | - [Modelling](#modelling)
17 | - [Tuning](#tuning)
18 | - [Comparison](#comparison)
19 |
20 | # Why need to learn Machine Learning ?
21 |
22 |
23 | - **Spread Sheets (Excel, CSV)**: store data that business needs → Human can analyse data to make business decision
24 | - **Relational DB (MySQL)**: a better way to organize things → Human can analyse data to make business decision
25 | - **Big Data (NoSQL)**: FB, Amazon, Twitter accumulating more and more data like "User actions, user purchasing history", where you can store un-structure data → need Machine Learning instead of Human to make business decision
26 |
27 | [(Back to top)](#table-of-contents)
28 |
29 | # Terms
30 | ## AI
31 | ## Machine Learning
32 |
33 |
34 |
35 | - [A subset of AI](https://teachablemachine.withgoogle.com/): ML uses Algorithms or Computer Programs to learn different patterns of data & then take those algorithms & what it learned to make prediction or classification on similar data.
36 | - The things hard to describe for computers to perform like
37 | - How to ask Computers to classify Cat/Dog images, or Product Reviews
38 |
39 | ### Difference between ML and Normal Algorithms
40 | - Normal Algorithm: a set of instructions on how to accomplish a task: start with `given input + set of instructions` → output
41 | - ML Algorithm : start with `given input + given output` → set of instructions between I/P and O/P
42 |
43 |
44 | ### Types of ML Problems
45 |
46 |
47 |
48 | - **Supervised**: Data with Label
49 | - **Unsupervised**: Data without Label like CSV without Column Names
50 | - *Clustering*: Machine decicdes clusters/groups
51 | - *Association Rule Learning*: Associate different things to predict what customers might buy in the future
52 | - **Reinforcement**: teach Machine to try and error (with reward and penalty)
53 |
54 | ## Deep Learning
55 | ## Data Science
56 | - `Data Analysis`: analyse data to gain understanding of your data
57 | - `Data Science` : running experiments on set of data to figure actionable insights within it
58 | - Example: to build ML Models
59 |
60 | [(Back to top)](#table-of-contents)
61 |
62 | # Machine Learning Framework
63 | 
64 |
65 | - Readings: [ (1) ](https://www.mrdbourke.com/a-6-step-field-guide-for-building-machine-learning-projects/), [ (2) ](https://whimsical.com/6-step-field-guide-to-machine-learning-projects-flowcharts-9g65jgoRYTxMXxDosndYTB)
66 | ### Step 1: Problem Definition - Rephrase business problem as a machine learning problem
67 | - What problem are we trying to solve ?
68 | - Supervised
69 | - Un-supervised
70 | - Classification
71 | - Regression
72 | ### Step 2: Data
73 | - What kind of Data we have ?
74 | ### Step 3: Evaluation
75 | - What defines success for us ? knowing what metrics you should be paying attention to gives you an idea of how to evaluate your machine learning project.
76 | ### Step 4: Features
77 | - What features does your data have and which can you use to build your model ? turning features → patterns
78 | - **Three main types of features**:
79 | - `Categorical` features — One or the other(s)
80 | - For example, in our heart disease problem, the sex of the patient. Or for an online store, whether or not someone has made a purchase or not.
81 | - `Continuous (or numerical)` features: A numerical value such as average heart rate or the number of times logged in.
82 | - `Derived` features — Features you create from the data. Often referred to as feature engineering.
83 | - `Feature engineering` is how a subject matter expert takes their knowledge and encodes it into the data. You might combine the number of times logged in with timestamps to make a feature called time since last login. Or turn dates from numbers into “is a weekday (yes)” and “is a weekday (no)”.
84 | ### Step 5: Models
85 | - Figure out right models for your problems
86 | ### Step 6: Experimentation
87 | - How to improve or what can do better ?
88 |
89 | ## Main Types of ML Problems
90 | 
91 | ### Supervised Learning:
92 | - (Input & Output) Data + Label → Classifications, Regressions
93 | ### Un-Supervised Learning:
94 | - (Only Input) Data → Clustering
95 | ### Transfer Learning:
96 | - (My problem similar to others) Leverage from Other ML Models
97 | ### Reinforcement Learning:
98 | - Purnishing & Rewarding the ML Learning model by updating the scores of ML
99 |
100 | ## Evaluation
101 |
102 | | Classification | Regression | Recommendation |
103 | | -------------------| ---------------------------------- | ----------------|
104 | | Accuracy | Mean Absolute Error (MAE) | Precision at K |
105 | | Precision | Mean Squared Error (MSE) | |
106 | | Recall | Root Mean Squared Error (RMSE) | |
107 |
108 | [(Back to top)](#table-of-contents)
109 |
110 | ## Features
111 | - Numerical Features
112 | - Categorical Features
113 |
114 |
115 |
116 | [(Back to top)](#table-of-contents)
117 |
118 | ## Modelling
119 | ### Splitting Data
120 |
121 |
122 |
123 | - 3 sets: Trainning, Validation (model hyperparameter tuning and experimentation evaluation) & Test Sets (model testing and comparison)
124 |
125 | ### Modelling
126 | - Chosen models work for your problem → train the model
127 | - Goal: Minimise time between experiments
128 | - Start small and add up complexity (use small parts of your training sets to start with)
129 | - Choosing the less complicated models to start first
130 |
131 |
132 | ### Tuning
133 | - Happens on Validation or Training Sets
134 |
135 | ### Comparison
136 | - Measure Model Performance via Test Set
137 | - Advoid `Overfitting` & `Underfitting`
138 | #### Overfitting
139 | - Great performance on the training data but poor performance on test data means your model doesn’t generalize well
140 | - Solution: Try simpler model or making sure your the test data is of the same style your model is training on
141 | ### Underfitting
142 | - Poor performance on training data means the model hasn’t learned properly and is underfitting
143 | - Solution: Try a different model, improve the existing one through hyperparameter or collect more data.
144 |
145 |
146 |
147 |
--------------------------------------------------------------------------------
/Pages/P01_Data_Pre_Processing.md:
--------------------------------------------------------------------------------
1 | # Data Preprocessing
2 | # Table of contents
3 | - [Table of contents](#table-of-contents)
4 | - [Introduction](#introduction)
5 | - [Data Preprocessing](#data-preprocessing)
6 | - [Import Dataset](#import-dataset)
7 | - [Select Data](#select-data)
8 | - [Using Index: iloc](#using-index-iloc)
9 | - [Numpy representation of DF](#numpy-representation-of-df)
10 | - [Handle Missing Data](#handle-missing-data)
11 | - [Encode Categorical Data](#encode-categorical-data)
12 | - [Encode Independent Variables](#encode-independent-variables)
13 | - [Encode Dependent Variables](#encode-dependent-variables)
14 | - [Splitting Training set and Test set](#splitting-training-set-and-test-set)
15 | - [Feature Scaling](#feature-scaling)
16 | - [Standardisation Feature Scaling](#standardisation-feature-scaling)
17 | - [Resources](#resources)
18 |
19 |
20 | # Data Preprocessing
21 | ## Import Dataset
22 | ```python
23 | dataset = pd.read_csv("data.csv")
24 |
25 | Country Age Salary Purchased
26 | 0 France 44.0 72000.0 No
27 | 1 Spain 27.0 48000.0 Yes
28 | 2 Germany 30.0 54000.0 No
29 | 3 Spain 38.0 61000.0 No
30 | 4 Germany 40.0 NaN Yes
31 | 5 France 35.0 58000.0 Yes
32 | 6 Spain NaN 52000.0 No
33 | 7 France 48.0 79000.0 Yes
34 | 8 Germany 50.0 83000.0 No
35 | 9 France 37.0 67000.0 Yes
36 | ```
37 | ## Select Data
38 | ### Using Index iloc
39 | - `.iloc[]` allowed inputs are:
40 | #### Selecting Rows
41 | - An integer, e.g. `dataset.iloc[0]` > return row 0 in ``
42 | ```
43 | Country France
44 | Age 44
45 | Salary 72000
46 | Purchased No
47 | ```
48 | - A list or array of integers, e.g.`dataset.iloc[[0]]` > return row 0 in DataFrame format
49 | ```
50 | Country Age Salary Purchased
51 | 0 France 44.0 72000.0 No
52 | ```
53 | - A slice object with ints, e.g. `dataset.iloc[:3]` > return row 0 up to row 3 in DataFrame format
54 | ```
55 | Country Age Salary Purchased
56 | 0 France 44.0 72000.0 No
57 | 1 Spain 27.0 48000.0 Yes
58 | 2 Germany 30.0 54000.0 No
59 | ```
60 | #### Selecting Rows & Columns
61 | - Select First 3 Rows & up to Last Columns (not included) `X = dataset.iloc[:3, :-1]`
62 | ```
63 | Country Age Salary
64 | 0 France 44.0 72000.0
65 | 1 Spain 27.0 48000.0
66 | 2 Germany 30.0 54000.0
67 | ```
68 | ### Numpy representation of DF
69 | - `DataFrame.values`: Return a Numpy representation of the DataFrame (i.e: Only the values in the DataFrame will be returned, the axes labels will be removed)
70 | - For ex: `X = dataset.iloc[:3, :-1].values`
71 | ```
72 | [['France' 44.0 72000.0]
73 | ['Spain' 27.0 48000.0]
74 | ['Germany' 30.0 54000.0]]
75 | ```
76 | [(Back to top)](#table-of-contents)
77 |
78 | ## Handle Missing Data
79 | ### SimpleImputer
80 | - sklearn.impute.`SimpleImputer(missing_values={should be set to np.nan} strategy={"mean",“median”, “most_frequent”, ..})`
81 | - imputer.`fit(X[:, 1:3])`: Fit the imputer on X.
82 | - imputer.`transform(X[:, 1:3])`: Impute all missing values in X.
83 |
84 | ```Python
85 | from sklearn.impute import SimpleImputer
86 |
87 | #Create an instance of Class SimpleImputer: np.nan is the empty value in the dataset
88 | imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
89 |
90 | #Replace missing value from numerical Col 1 'Age', Col 2 'Salary'
91 | imputer.fit(X[:, 1:3])
92 |
93 | #transform will replace & return the new updated columns
94 | X[:, 1:3] = imputer.transform(X[:, 1:3])
95 | ```
96 |
97 | ## Encode Categorical Data
98 | ### Encode Independent Variables
99 | - Since for the independent variable, we will convert into vector of 0 & 1
100 | - Using the `ColumnTransformer` class &
101 | - `OneHotEncoder`: encoding technique for features are nominal(do not have any order)
102 | 
103 |
104 | ```Python
105 | from sklearn.compose import ColumnTransformer
106 | from sklearn.preprocessing import OneHotEncoder
107 | ```
108 | - `transformers`: specify what kind of transformation, and which cols
109 | - Tuple `('encoder' encoding transformation, instance of Class OneHotEncoder, [cols to transform])`
110 | - `remainder ="passthrough"` > to keep the cols which not be transformed. Otherwise, the remaining cols will not be included
111 | ```Python
112 | ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])] , remainder="passthrough" )
113 | ```
114 | - Fit and Transform with input = X in the Instance `ct` of class `ColumnTransformer`
115 | ```Python
116 | #fit and transform with input = X
117 | #np.array: need to convert output of fit_transform() from matrix to np.array
118 | X = np.array(ct.fit_transform(X))
119 | ```
120 | - Before converting categorical column [0] `Country`
121 | ```
122 | Country Age Salary Purchased
123 | 0 France 44.0 72000.0 No
124 | 1 Spain 27.0 48000.0 Yes
125 | ```
126 | - After converting, France = [1.0, 0, 0] vector
127 | ```
128 | [[1.0 0.0 0.0 44.0 72000.0]
129 | [0.0 0.0 1.0 27.0 48000.0]
130 | [0.0 1.0 0.0 30.0 54000.0]
131 | ```
132 |
133 | ### Encode Dependent Variables
134 | - For the dependent variable, since it is the Label > we use `Label Encoder`
135 | ```Python
136 | from sklearn.preprocessing import LabelEncoder
137 | le = LabelEncoder()
138 | #output of fit_transform of Label Encoder is already a Numpy Array
139 | y = le.fit_transform(y)
140 |
141 | #y = [0 1 0 0 1 1 0 1 0 1]
142 | ```
143 |
144 | # Splitting Training set and Test set
145 | - Using the `train_test_split` of SkLearn - Model Selection
146 | - Recommend Split: `test_size = 0.2`
147 | - `random_state = 1`: fixing the seed for random state so that we can have the same training & test sets anytime
148 | ```Python
149 | from sklearn.model_selection import train_test_split
150 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
151 | ```
152 | [(Back to top)](#table-of-contents)
153 |
154 | # Feature Scaling
155 | - What ? Feature Scaling (FS): scale all the features in the same scale to prevent 1 feature dominates the others & then neglected by ML Model
156 | - Note #1: FS **no need to apply in all the times** in all ML Models (like Multi-Regression Models)
157 | - Why no need FS for Multi-Regression Model: y = b0 + b1 * x1 + b2 * x2 + b3 * x3, since we have the coefficients (b0, b1, b2, b3) to compensate, so there is no need FS.
158 | - Note #2: For dummy variables from Categorial Features Encoding, **no need to apply FS**
159 |
160 | - Note #3: **FS MUST be done AFTER splitting** Training & Test sets
161 |
162 | - Why ?
163 | - Test Set suppose to the brand-new set, which we are not supposed to work with the Training Set
164 | - FS is technique to get the mean & median of features in order to scale
165 | - If we apply FS before splitting Training & Test sets, it will include the mean & median of both Training Set and Test Set
166 | - FS MUST be done AFTER Splitting => Otherwise, we will cause **Information Leakage**
167 | ## How ?
168 | - There are 2 main Feature Scaling Technique: Standardisation & Normalisation
169 | - `Standardisation`: This makes the dataset, center at 0 i.e mean at 0, and changes the standard deviation value to 1.
170 | - *Usage*: apply all the situations
171 | - `Normalisation`: This makes the dataset in range [0, 1]
172 | - *Usage*: apply when the all the features in the data set have the **normal distribution**
173 |
174 | 
175 |
176 | ## Standardisation Feature Scaling:
177 | - We will use `StandardScaler` from `sklearn.preprocessing`
178 | ```Python
179 | from sklearn.preprocessing import StandardScaler
180 | sc = StandardScaler()
181 | ```
182 | - For `X_train`: apply `StandardScaler` by using `fit_transform`
183 | ```Python
184 | X_train[:,3:] = sc.fit_transform(X_train[:,3:])
185 | ```
186 | - For `X_test`: apply `StandardScaler` only use `transform`, because we want to apply the SAME scale as `X_train`
187 | ```Python
188 | #only use Transform to use the SAME scaler as the Training Set
189 | X_test[:,3:] = sc.transform(X_test[:,3:])
190 | ```
191 |
192 |
193 | [(Back to top)](#table-of-contents)
194 |
195 | # Resources:
196 | ### Podcast:
197 | https://www.superdatascience.com/podcast/sds-041-inspiring-journey-totally-different-background-data-science
198 |
199 |
200 |
201 |
202 |
--------------------------------------------------------------------------------
/Pages/P02_Regression.md:
--------------------------------------------------------------------------------
1 | # Regression
2 | # Table of contents
3 |
4 | - [Table of contents](#table-of-contents)
5 | - [Introduction to Regressions](#introduction-to-regressions)
6 | - [Simple Linear Regression](#simple-linear-regression)
7 | - [Outline: Building a Model](#outline-building-a-model)
8 | - [Creating a Model](#creating-a-model)
9 | - [Predicting a Test Result](#predicting-a-test-result)
10 | - [Visualising the Test set results](#visualising-the-test-set-results)
11 | - [Getting Linear Regression Equation](#getting-linear-regression-equation)
12 | - [Evaluating the Algorithm](#evaluating-the-algorithm)
13 | - [R Square or Adjusted R Square](#r-square-or-adjusted-r-square)
14 | - [Mean Square Error (MSE)/Root Mean Square Error (RMSE)](#mean-square-error-and-root-mean-square-error)
15 | - [Mean Absolute Error (MAE)](#mean-absolute-error)
16 | - [Multiple Linear Regression](#multiple-linear-regression)
17 | - [Assumptions of Linear Regression](#assumptions-of-linear-regression)
18 | - [Dummy Variables](#dummy-variables)
19 | - [Understanding P-value](#understanding-p-value)
20 | - [Building a Model](#building-a-model)
21 | - [Polynomial Linear Regression](#polynomial-linear-regression)
22 |
23 |
24 | # Introduction to Regressions
25 | - Simple Linear Regression : `y = b0 + b1*x1`
26 | - Multiple Linear Regression : `y = b0 + b1*x1 + b2*x2 + ... + bn*xn`
27 | - Polynomial Linear Regression: `y = b0 + b1*x1 + b2*x1^(2) + ... + bn*x1^(n)`
28 |
29 | # Simple Linear Regression
30 | ## Outline Building a Model
31 | - Importing libraries and datasets
32 | - Splitting the dataset
33 | - Training the simple Linear Regression model on the Training set
34 | - Predicting and visualizing the test set results
35 | - Visualizing the training set results
36 | - Making a single prediction
37 | - Getting the final linear regression equation (with values of the coefficients)
38 | ```
39 | y = bo + b1 * x1
40 | ```
41 | - y: Dependent Variable (DV)
42 | - x: InDependent Variable (IV)
43 | - b0: Intercept Coefficient
44 | - b1: Slope of Line Coefficient
45 | 
46 |
47 |
48 | ## Creating a Model
49 | - Using `sklearn.linear_model`, `LinearRegression` model
50 | ```Python
51 | from sklearn.linear_model import LinearRegression
52 |
53 | #To Create Instance of Simple Linear Regression Model
54 | regressor = LinearRegression()
55 |
56 | #To fit the X_train and y_train
57 | regressor.fit(X_train, y_train)
58 | ```
59 | ## Predicting a Test Result
60 | ```Python
61 | y_pred = regressor.predict(X_test)
62 | ```
63 | ### Predict a single value
64 | **Important note:** "predict" method always expects a 2D array as the format of its inputs.
65 | - And putting 12 into a double pair of square brackets makes the input exactly a 2D array:
66 | - `regressor.predict([[12]])`
67 |
68 | ```Python
69 | print(f"Predicted Salary of Employee with 12 years of EXP: {regressor.predict([[12]])}" )
70 |
71 | #Output: Predicted Salary of Employee with 12 years of EXP: [137605.23485427]
72 | ```
73 | ## Visualising the Test set results
74 | ```Python
75 | #Plot predicted values
76 | plt.scatter(X_test, y_test, color = 'red', label = 'Predicted Value')
77 | #Plot the regression line
78 | plt.plot(X_train, regressor.predict(X_train), color = 'blue', label = 'Linear Regression')
79 | #Label the Plot
80 | plt.title('Salary vs Experience (Test Set)')
81 | plt.xlabel('Years of Experience')
82 | plt.ylabel('Salary')
83 | #Show the plot
84 | plt.show()
85 | ```
86 | 
87 |
88 | ## Getting Linear Regression Equation
89 | - General Formula: `y_pred = model.intercept_ + model.coef_ * x`
90 | ```Python
91 | print(f"b0 : {regressor.intercept_}")
92 | print(f"b1 : {regressor.coef_}")
93 |
94 | b0 : 25609.89799835482
95 | b1 : [9332.94473799]
96 | ```
97 |
98 | Linear Regression Equation: `Salary = 25609 + 9332.94×YearsExperience`
99 |
100 | ## Evaluating the Algorithm
101 | - compare how well different algorithms perform on a particular dataset.
102 | - For regression algorithms, three evaluation metrics are commonly used:
103 | 1. R Square/Adjusted R Square > Percentage of the output variability
104 | 2. Mean Square Error(MSE)/Root Mean Square Error(RMSE) > to compare performance between different regression models
105 | 3. Mean Absolute Error(MAE) > to compare performance between different regression models
106 |
107 | ### R Square or Adjusted R Square
108 | #### R Square: Coefficient of determination
109 | - R Square measures how much of **variability** in predicted variable can be explained by the model.
110 | - `Variance` is a measure in statistics defined as the average of the square of differences between individual point and the expected value.
111 | - R Square value: between 0 to 1 and bigger value indicates a better fit between prediction and actual value.
112 | - However, it does **not take into consideration of overfitting problem**.
113 | - If your regression model has many independent variables, because the model is too complicated, it may fit very well to the training data
114 | - but performs badly for testing data.
115 | - Solution: Adjusted R Square
116 |
117 | 
118 |
119 | #### Adjusted R Square
120 | - is introduced Since R-square can be increased by adding more number of variable and may lead to the **over-fitting** of the model
121 | - Will penalise additional independent variables added to the model and adjust the metric to **prevent overfitting issue**.
122 |
123 | #### Calculate R Square and Adjusted R Square using Python
124 | - In Python, you can calculate R Square using `Statsmodel` or `Sklearn` Package
125 | ```Python
126 | import statsmodels.api as sm
127 |
128 | X_addC = sm.add_constant(X)
129 |
130 | result = sm.OLS(Y, X_addC).fit()
131 |
132 | print(result.rsquared, result.rsquared_adj)
133 | # 0.79180307318 0.790545085707
134 |
135 | ```
136 | - around 79% of dependent variability can be explain by the model and adjusted R Square is roughly the same as R Square meaning the model is quite robust
137 |
138 | ### Mean Square Error and Root Mean Square Error
139 | - While **R Square** is a **relative measure** of how well the model fits dependent variables
140 | - **Mean Square Error (MSE)** is an **absolute measure** of the goodness for the fit.
141 | - **Root Mean Square Error(RMSE)** is the square root of MSE.
142 | - It is used more commonly than MSE because firstly sometimes MSE value can be too big to compare easily.
143 | - Secondly, MSE is calculated by the square of error, and thus square root brings it back to the same level of prediction error and make it easier for interpretation.
144 |
145 | 
146 |
147 | ```Python
148 | from sklearn.metrics import mean_squared_error
149 | import math
150 | print(mean_squared_error(Y_test, Y_predicted))
151 | print(math.sqrt(mean_squared_error(Y_test, Y_predicted)))
152 | # MSE: 2017904593.23
153 | # RMSE: 44921.092965684235
154 | ```
155 | ### Mean Absolute Error
156 | - Compare to MSE or RMSE, MAE is a more direct representation of sum of error terms.
157 |
158 | 
159 |
160 | ```Python
161 | from sklearn.metrics import mean_absolute_error
162 | print(mean_absolute_error(Y_test, Y_predicted))
163 | #MAE: 26745.1109986
164 | ```
165 |
166 | [(Back to top)](#table-of-contents)
167 |
168 | # Multiple Linear Regression
169 | ### Assumptions of Linear Regression:
170 | Before choosing Linear Regression, need to consider below assumptions
171 | 1. Linearity
172 | 2. Homoscedasticity
173 | 3. Multivariate normality
174 | 4. Independence of errors
175 | 5. Lack of multicollinearity
176 |
177 | ## Dummy Variables
178 | - Since `State` is categorical variable => we need to convert it into `dummy variable`
179 | - No need to include all dummy variable to our Regression Model => **Only omit one dummy variable**
180 | - Why ? `dummy variable trap`
181 | 
182 |
183 | ## Understanding P value
184 | - Ho : `Null Hypothesis (Universe)`
185 | - H1 : `Alternative Hypothesis (Universe)`
186 | - For example:
187 | - Assume `Null Hypothesis` is true (or we are living in Null Universe)
188 |
189 | 
190 |
191 | [(Back to top)](#table-of-contents)
192 | ## Building a Model
193 | - 5 methods of Building Models
194 | ### Method 1: All-in
195 | - Throw in all variables in the dataset
196 | - Usage:
197 | - Prior knowledge about this problem; OR
198 | - You have to (Company Framework required)
199 | - Prepare for Backward Elimination
200 | ### Method 2 [Stepwise Regression]: Backward Elimination (Fastest)
201 | - Step 1: Select a significance level (SL) to stay in the model (e.g: SL = 0.05)
202 | ```Python
203 | # Building the optimal model using Backward Elimination
204 | import statsmodels.api as sm
205 |
206 | # Avoiding the Dummy Variable Trap by excluding the first column of Dummy Variable
207 | # Note: in general you don't have to remove manually a dummy variable column because Scikit-Learn takes care of it.
208 | X = X[:, 1:]
209 |
210 | #Append full column of "1"s to First Column of X using np.append
211 | #Since y = b0*(1) + b1 * x1 + b2 * x2 + .. + bn * xn, b0 is constant and can be re-written as b0 * (1)
212 | #np.append(arr = the array will add to, values = column to be added, axis = row/column)
213 | # np.ones((row, column)).astype(int) => .astype(int) to convert array of 1 into integer type to avoid data type error
214 | X = np.append(arr = np.ones((50,1)).astype(int), values = X, axis = 1)
215 |
216 | #Initialize X_opt with Original X by including all the column from #0 to #5
217 | X_opt = np.array(X[:, [0, 1, 2, 3, 4, 5]], dtype=float)
218 | #If you are using the google colab to write your code,
219 | # the datatype of all the features is not set to float hence this step is important: X_opt = np.array(X[:, [0, 1, 2, 3, 4, 5]], dtype=float)
220 | ```
221 | - Step 2: Fit the full model with all possible predictors
222 | ```Python
223 | #OrdinaryLeastSquares
224 | regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
225 | regressor_OLS.summary()
226 | ```
227 | - Step 3: Consider Predictor with Highest P-value
228 | - If P > SL, go to Step 4, otherwise go to [**FIN** : Your Model Is Ready]
229 | - Step 4: Remove the predictor
230 | ```Python
231 | #Remove column = 2 from X_opt since Column 2 has Highest P value (0.99) and > SL (0.05).
232 | X_opt = np.array(X[:, [0, 1, 3, 4, 5]], dtype=float)
233 | #OrdinaryLeastSquares
234 | regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
235 | regressor_OLS.summary()
236 | ```
237 | - Step 5: Re-Fit model without this variable
238 |
239 | ### Method 3 [Stepwise Regression]: Forward Selection
240 | - Step 1: Select a significance level (SL) to enter in the model (e.g: SL = 0.05)
241 | - Step 2: Fit all simple regression models (y ~ xn). Select the one with Lowest P-value for the independent variable.
242 | - Step 3: Keep this variable and fit all possible regression models with one extra predictor added to the one(s) you already have.
243 | - Step 4: Consider the predicotr with Lowest P-value. If P < SL (i.e: model is good), go STEP 3 (to add 3rd variable into the model and so on with all variables we have left), otherwise go to [**FIN** : Keep the previous model]
244 | ### Method 4 [Stepwise Regression]: Bidirectional Elemination
245 | - Step 1: Select a significant level to enter and to stay in the model: `e.g: SLENTER = 0.05, SLSTAY = 0.05`
246 | - Step 2: Perform the next step of Forward Selection (new variables must have: P < SLENTER to enter)
247 | - Step 3: Perform ALL steps of Backward Elimination (old variables must have P < SLSTAY to stay) => Step 2.
248 | - Step 4: No variables can enter and no old variables can exit => [**FIN** : Your Model Is Ready]
249 |
250 | ### Method 5: Score Comparison
251 | - Step 1: Select a criterion of goodness of ift (e.g Akaike criterion)
252 | - Step 2: Construct all possible regression Models: `2^(N) - 1` total combinations, where N: total number of variables
253 | - Step 3: Select the one with best criterion => [**FIN** : Your Model Is Ready]
254 |
255 | ### Code Implementation
256 | - Note: Backward Elimination is irrelevant in Python, because the Scikit-Learn library automatically takes care of selecting the statistically significant features when training the model to make accurate predictions.
257 | ##### Step 1: Splitting the dataset into the Training set and Test set
258 | ```Python
259 | #no need Feature Scaling (FS) for Multi-Regression Model: y = b0 + b1 * x1 + b2 * x2 + b3 * x3,
260 | # since we have the coefficients (b0, b1, b2, b3) to compensate, so there is no need FS.
261 | from sklearn.model_selection import train_test_split
262 |
263 | # NOT have to remove manually a dummy variable column because Scikit-Learn takes care of it.
264 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
265 | ```
266 |
267 | ##### Step 2: Training the Multiple Linear Regression model on the Training set
268 | ```Python
269 | #LinearRegression will take care "Dummy variable trap" & feature selection
270 | from sklearn.linear_model import LinearRegression
271 | regressor = LinearRegression()
272 | regressor.fit(X_train, y_train)
273 | ```
274 |
275 | ##### Step 3: Predicting the Test set results
276 | ```Python
277 | y_pred = regressor.predict(X_test)
278 | ```
279 |
280 |
281 | ##### Step 4: Displaying Y_Pred vs Y_test
282 | - Since this is multiple linear regression, so can not visualize by drawing the graph
283 |
284 | ```Python
285 | #To display the y_pred vs y_test vectors side by side
286 | np.set_printoptions(precision=2) #To round up value to 2 decimal places
287 |
288 | #np.concatenate((tuple of rows/columns you want to concatenate), axis = 0 for rows and 1 for columns)
289 | #y_pred.reshape(len(y_pred),1) : to convert y_pred to column vector by using .reshape()
290 |
291 | print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)), 1))
292 | ```
293 | ##### Step 5: Getting the final linear regression equation with the values of the coefficients
294 | ```Python
295 | print(regressor.coef_)
296 | print(regressor.intercept_)
297 |
298 | [ 8.66e+01 -8.73e+02 7.86e+02 7.73e-01 3.29e-02 3.66e-02]
299 | 42467.52924853204
300 | ```
301 |
302 | Equation:
303 | Profit = 86.6 x DummyState1 - 873 x DummyState2 + 786 x DummyState3 - 0.773 x R&D Spend + 0.0329 x Administration + 0.0366 x Marketing Spend + 42467.53
304 |
305 | [(Back to top)](#table-of-contents)
306 |
307 |
308 | # Polynomial Linear Regression
309 | - Polynomial Linear Regression: `y = b0 + b1*x1 + b2*x1^(2) + ... + bn*x1^(n)`
310 | - Used for dataset with non-linear relation, but polynomial linear relation like salary scale.
311 |
312 |
313 |
314 | [(Back to top)](#table-of-contents)
315 |
--------------------------------------------------------------------------------
/Pages/Project_Guideline.md:
--------------------------------------------------------------------------------
1 | # End-to-End Machine Learning Project Guideline
2 |
3 |
4 | ## Table of contents
5 | - [1. Project Environment Setup](#1-project-environment-setup)
6 | - [1.1 Setup Conda Env](#11-setup-conda-env)
7 |
8 |
9 |
10 | ## 1. Project Environment Setup
11 |
12 |
13 | ### 1.1. Setup Conda Env
14 | #### 1.1.1. Create Conda Env from Stratch
15 | `conda create --prefix ./env pandas numpy matplotlib scikit-learn jupyter`
16 | #### 1.1.2. Create Conda Env from a base env
17 | - **Step 1**: Go to Base Env folder and export the base conda env to `environment.yml` file
18 | - *Note*: open `environment.yml` file by `vim environment.yml` to open the file → To exit Vim: `press ESC then ; then q to exit`
19 | ```Python
20 | conda env list #to list down current env
21 | conda activate /Users/quannguyen/Data_Science/Conda/env #Activate the base conda env
22 | conda env export > environment.yml #Export base conda env to environment.yml file
23 | conda deactivate #de-activate env once done
24 | ```
25 | - **Step 2**: Go to current project folder and create the env based on `environment.yml` file
26 | ```python
27 | conda env create --prefix ./env -f environment.yml
28 | ```
29 | [(Back to top)](#table-of-contents)
30 |
--------------------------------------------------------------------------------
/Pages/Resources/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Pages/Resources/.DS_Store
--------------------------------------------------------------------------------
/Pages/Resources/Interview/ML_cheatsheets.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Pages/Resources/Interview/ML_cheatsheets.pdf
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Data Science Handbook
2 | # Table of contents
3 | - [0. Introduction](./Pages/P00_Introduction.md)
4 | - [1. Data Preprocessing](./Pages/P01_Data_Pre_Processing.md)
5 | - [2. Regression](./Pages/P02_Regression.md)
6 | - [Project Guideline](./Pages/Project_Guideline.md)
7 | - [Appendix](#appendix)
8 | - [Resources](#resources)
9 |
10 | ## Appendix
11 | - [Job Description for Data Science](./Pages/A01_Job_Description.md)
12 | - [Interview Question for Data Science](./Pages/A01_Interview_Question.md)
13 | - [Pandas Cheat Sheet](./Pages/A02_Pandas_Cheat_Sheet.md)
14 | - [Numpy Cheat Sheet](./Pages/A03_Numpy_Cheat_Sheet.md)
15 | - [Matplotlib Cheat Sheet](./Pages/A05_Matplotlib.md)
16 | - [Sklearn Cheat Sheet](./Pages/A06_SkLearn.md)
17 | - [Conda CLI Cheat Sheet](./Pages/A04_Conda_CLI.md)
18 | - [Statistics](./Pages/A05_Statistics.md)
19 | - [Kaggle 30 Days of Machine Learning](./Pages/A07_Kaggle_30_ML.md)
20 | - [Daily Lessons](./Pages/A8_Daily_Lessons.md)
21 | ## Resources:
22 | - [AI Road Map](https://i.am.ai/roadmap/#note)
23 | - [AI Free Course](https://learn.aisingapore.org/professionals/): Intel AI Academy
24 | - [Reading List](./Pages/A00_Reading_List.md)
25 | - [Deep Learning Monitor: Latest Deep Learning Research Papers](https://deeplearn.org/)
26 | ### Podcast:
27 | - [1. Super Data Science](https://www.superdatascience.com/podcast/sds-041-inspiring-journey-totally-different-background-data-science)
28 | - [2. Visual ML ](https://vas3k.com/blog/machine_learning/)
29 |
30 |
31 |
32 |
33 |
--------------------------------------------------------------------------------