├── .DS_Store ├── Code ├── .DS_Store ├── A2_Pandas │ ├── .DS_Store │ └── P1_Getting_Knowing_Data │ │ ├── .DS_Store │ │ ├── .ipynb_checkpoints │ │ ├── Introduction-to-pandas-checkpoint.ipynb │ │ └── Pandas_Basic_1-checkpoint.ipynb │ │ ├── Introduction-to-pandas.ipynb │ │ ├── Pandas_Basic_1.ipynb │ │ ├── chipotle.tsv │ │ ├── data │ │ ├── car-sales-missing-data.csv │ │ └── car-sales.csv │ │ ├── pandas-exercises-solutions.ipynb │ │ └── pandas-exercises.ipynb ├── A3_Numpy │ ├── .DS_Store │ ├── .ipynb_checkpoints │ │ ├── 2. NumPy-checkpoint.ipynb │ │ ├── 3. NumPy exercises-checkpoint.ipynb │ │ └── Introduction-to-Numpy-checkpoint.ipynb │ ├── 2. NumPy.ipynb │ ├── 3. NumPy exercises.ipynb │ ├── Introduction-to-Numpy.ipynb │ └── numpy-images │ │ ├── car-photo.png │ │ ├── dog-photo.png │ │ └── panda.png ├── A4_Matplotlib │ ├── .DS_Store │ ├── .ipynb_checkpoints │ │ └── Introduction_to_Matplotlib-checkpoint.ipynb │ ├── Introduction_to_Matplotlib.ipynb │ ├── Introduction_to_Matplotlib_ZTM.ipynb │ ├── data │ │ ├── .DS_Store │ │ ├── california_cities.csv │ │ ├── car-sales.csv │ │ └── heart-disease.csv │ └── images │ │ ├── .DS_Store │ │ └── simple-plot.jpg ├── A5_Scikit_Learn │ ├── .DS_Store │ ├── .ipynb_checkpoints │ │ ├── 1-Get-Data-Ready-checkpoint.ipynb │ │ ├── Introduction-to-scikit-learn-checkpoint.ipynb │ │ └── sklearn-workflow-1-Get-Data-Ready-checkpoint.ipynb │ ├── 1-Get-Data-Ready.ipynb │ ├── Introduction-to-scikit-learn.ipynb │ ├── data │ │ ├── car-sales-extended-missing-data.csv │ │ ├── car-sales-extended.csv │ │ ├── car-sales-missing-data.csv │ │ ├── car-sales.csv │ │ └── heart-disease.csv │ ├── gs_random_forest_model_1.joblib │ ├── gs_random_forest_model_1.pkl │ └── random_forest_model_1.pkl ├── A6_Kaggle │ ├── .ipynb_checkpoints │ │ └── Day12_Housing Prices Competition-checkpoint.ipynb │ └── Day12_Housing Prices Competition.ipynb ├── A6_Seaborn │ ├── .DS_Store │ ├── .ipynb_checkpoints │ │ └── Introduction_to_Seaborn-checkpoint.ipynb │ ├── Seaborn_Tutorial.ipynb │ └── data │ │ └── cereal.csv ├── P00_Project_Template │ ├── .DS_Store │ ├── .ipynb_checkpoints │ │ └── Project_Template_Heart_Disease_Classification-checkpoint.ipynb │ ├── Project_Template_Heart_Disease_Classification.ipynb │ └── data │ │ └── heart.csv ├── P01_Pre_Processing │ ├── .ipynb_checkpoints │ │ └── data_preprocessing_template-checkpoint.ipynb │ ├── Data.csv │ └── data_preprocessing_template.ipynb ├── P02_Linear_Regression │ ├── .ipynb_checkpoints │ │ ├── polynomial_regression-checkpoint.ipynb │ │ └── simple_linear_regression-checkpoint.ipynb │ ├── Position_Salaries.csv │ ├── multiple_linear_regression.ipynb │ ├── multiple_linear_regression_Backward_Elimination.ipynb │ ├── polynomial_regression.ipynb │ └── simple_linear_regression.ipynb └── Project │ ├── .DS_Store │ └── Housing Corporation │ ├── .DS_Store │ ├── .ipynb_checkpoints │ ├── Housing Corporation-checkpoint.ipynb │ └── Icon │ ├── Housing Corporation.ipynb │ ├── Icon │ └── datasets │ ├── .DS_Store │ ├── Icon │ └── housing │ ├── .DS_Store │ ├── Icon │ └── housing.csv ├── Pages ├── .DS_Store ├── A00_Reading_List.md ├── A01_Interview_Question.md ├── A01_Job_Description.md ├── A02_Pandas_Cheat_Sheet.md ├── A03_Numpy_Cheat_Sheet.md ├── A04_Conda_CLI.md ├── A05_Matplotlib.md ├── A05_Statistics.md ├── A06_SkLearn.md ├── A8_Daily_Lessons.md ├── P00_Introduction.md ├── P01_Data_Pre_Processing.md ├── P02_Regression.md ├── Project_Guideline.md └── Resources │ ├── .DS_Store │ └── Interview │ └── ML_cheatsheets.pdf └── README.md /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/.DS_Store -------------------------------------------------------------------------------- /Code/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/.DS_Store -------------------------------------------------------------------------------- /Code/A2_Pandas/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A2_Pandas/.DS_Store -------------------------------------------------------------------------------- /Code/A2_Pandas/P1_Getting_Knowing_Data/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A2_Pandas/P1_Getting_Knowing_Data/.DS_Store -------------------------------------------------------------------------------- /Code/A2_Pandas/P1_Getting_Knowing_Data/.ipynb_checkpoints/Introduction-to-pandas-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 5 6 | } 7 | -------------------------------------------------------------------------------- /Code/A2_Pandas/P1_Getting_Knowing_Data/.ipynb_checkpoints/Pandas_Basic_1-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 4 6 | } 7 | -------------------------------------------------------------------------------- /Code/A2_Pandas/P1_Getting_Knowing_Data/Introduction-to-pandas.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "impressive-touch", 6 | "metadata": {}, 7 | "source": [ 8 | "## Introduction to Pandas" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "id": "cellular-fleet", 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "id": "mathematical-floating", 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "# 2 main datatypes: Series & DataFrame\n", 29 | "# Series = 1-D\n", 30 | "# DF. = 2-D\n", 31 | "\n", 32 | "series = pd.Series([\"BMW\", \"Toyota\", \"Honda\"])\n", 33 | "colours = pd.Series([\"Red\", \"Blue\", \"White\"])" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 3, 39 | "id": "right-effort", 40 | "metadata": {}, 41 | "outputs": [ 42 | { 43 | "data": { 44 | "text/html": [ 45 | "
\n", 46 | "\n", 59 | "\n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | "
Car makeColour
0BMWRed
1ToyotaBlue
2HondaWhite
\n", 85 | "
" 86 | ], 87 | "text/plain": [ 88 | " Car make Colour\n", 89 | "0 BMW Red\n", 90 | "1 Toyota Blue\n", 91 | "2 Honda White" 92 | ] 93 | }, 94 | "execution_count": 3, 95 | "metadata": {}, 96 | "output_type": "execute_result" 97 | } 98 | ], 99 | "source": [ 100 | "car_data = pd.DataFrame({\"Car make\": series, \"Colour\" : colours})\n", 101 | "car_data" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "id": "unnecessary-newsletter", 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "# Import data\n", 112 | "car_sales = pd.read_csv(\"./data/car-sales.csv\")" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": null, 118 | "id": "present-fisher", 119 | "metadata": {}, 120 | "outputs": [], 121 | "source": [ 122 | "car_sales" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "id": "abroad-province", 128 | "metadata": {}, 129 | "source": [ 130 | "## Describe Data" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": null, 136 | "id": "reduced-maine", 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "# Attributes\n", 141 | "car_sales.dtypes" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "id": "professional-acoustic", 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [ 151 | "car_sales.columns" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": null, 157 | "id": "infectious-satin", 158 | "metadata": {}, 159 | "outputs": [], 160 | "source": [ 161 | "car_colums = car_sales.columns\n", 162 | "car_colums" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": null, 168 | "id": "lyric-savannah", 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [ 172 | "car_sales.index" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "id": "hairy-brake", 179 | "metadata": {}, 180 | "outputs": [], 181 | "source": [ 182 | "# Function\n", 183 | "car_sales.describe()" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "id": "expensive-taiwan", 190 | "metadata": {}, 191 | "outputs": [], 192 | "source": [ 193 | "car_sales.info()" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": null, 199 | "id": "anticipated-rebecca", 200 | "metadata": {}, 201 | "outputs": [], 202 | "source": [ 203 | "car_sales.mean()" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": null, 209 | "id": "herbal-moment", 210 | "metadata": {}, 211 | "outputs": [], 212 | "source": [ 213 | "car_prices = pd.Series([3000,1500,112045])\n", 214 | "car_prices.mean()" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": null, 220 | "id": "sunrise-softball", 221 | "metadata": {}, 222 | "outputs": [], 223 | "source": [ 224 | "car_sales[\"Doors\"].sum()" 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": null, 230 | "id": "severe-mother", 231 | "metadata": {}, 232 | "outputs": [], 233 | "source": [ 234 | "len(car_sales)" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "id": "aware-chicago", 240 | "metadata": {}, 241 | "source": [ 242 | "## Viewing and selecting data" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": null, 248 | "id": "civic-gamma", 249 | "metadata": {}, 250 | "outputs": [], 251 | "source": [ 252 | "car_sales.head()" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": null, 258 | "id": "interior-european", 259 | "metadata": {}, 260 | "outputs": [], 261 | "source": [ 262 | "car_sales.tail()" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": null, 268 | "id": "favorite-necklace", 269 | "metadata": {}, 270 | "outputs": [], 271 | "source": [ 272 | "# .loc and .iloc\n", 273 | "animals = pd.Series([\"cat\", \"dog\", \"Bird\", \"panda\", \"snake\"], index=[0,3,9,8,3])" 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": null, 279 | "id": "seasonal-region", 280 | "metadata": {}, 281 | "outputs": [], 282 | "source": [ 283 | "animals" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": null, 289 | "id": "later-boxing", 290 | "metadata": {}, 291 | "outputs": [], 292 | "source": [ 293 | "# loc refers to index\n", 294 | "animals.loc[3]" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": null, 300 | "id": "cordless-aurora", 301 | "metadata": {}, 302 | "outputs": [], 303 | "source": [ 304 | "# .iloc refers to position\n", 305 | "animals.iloc[3]" 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "id": "animated-constitution", 311 | "metadata": {}, 312 | "source": [ 313 | "### Boolean Indexing" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": null, 319 | "id": "governing-headline", 320 | "metadata": {}, 321 | "outputs": [], 322 | "source": [ 323 | "car_sales[car_sales[\"Odometer (KM)\"] > 100000]" 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": null, 329 | "id": "internal-attribute", 330 | "metadata": {}, 331 | "outputs": [], 332 | "source": [ 333 | "pd.crosstab(car_sales[\"Make\"], car_sales[\"Doors\"])" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": null, 339 | "id": "iraqi-protocol", 340 | "metadata": {}, 341 | "outputs": [], 342 | "source": [ 343 | "# Groupy\n", 344 | "car_sales.groupby([\"Make\"]).mean()" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": null, 350 | "id": "integrated-transparency", 351 | "metadata": {}, 352 | "outputs": [], 353 | "source": [ 354 | "car_sales[\"Odometer (KM)\"].plot()" 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": null, 360 | "id": "accessory-scene", 361 | "metadata": {}, 362 | "outputs": [], 363 | "source": [ 364 | "car_sales[\"Odometer (KM)\"].hist() #150000 & 200000 consider as Outliner" 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": null, 370 | "id": "linear-emission", 371 | "metadata": {}, 372 | "outputs": [], 373 | "source": [ 374 | "car_sales['Price'] = car_sales['Price'].replace('[\\$\\,\\.]',\"\",regex=True).astype(int)" 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": null, 380 | "id": "lesbian-consent", 381 | "metadata": {}, 382 | "outputs": [], 383 | "source": [ 384 | "car_sales.plot()" 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "execution_count": null, 390 | "id": "descending-detection", 391 | "metadata": {}, 392 | "outputs": [], 393 | "source": [ 394 | "car_sales[\"Make\"] = car_sales[\"Make\"].str.lower()" 395 | ] 396 | }, 397 | { 398 | "cell_type": "code", 399 | "execution_count": null, 400 | "id": "frequent-indonesia", 401 | "metadata": {}, 402 | "outputs": [], 403 | "source": [ 404 | "car_sales_missing = pd.read_csv(\"./data/car-sales-missing-data.csv\")\n", 405 | "car_sales_missing" 406 | ] 407 | }, 408 | { 409 | "cell_type": "code", 410 | "execution_count": null, 411 | "id": "assumed-processor", 412 | "metadata": {}, 413 | "outputs": [], 414 | "source": [ 415 | "car_sales_missing[\"Odometer\"].fillna(car_sales_missing[\"Odometer\"].mean(), inplace = True)\n", 416 | "car_sales_missing" 417 | ] 418 | }, 419 | { 420 | "cell_type": "code", 421 | "execution_count": null, 422 | "id": "changing-perception", 423 | "metadata": {}, 424 | "outputs": [], 425 | "source": [ 426 | "car_sales_missing.dropna(inplace=True)\n", 427 | "car_sales_missing" 428 | ] 429 | }, 430 | { 431 | "cell_type": "markdown", 432 | "id": "about-professional", 433 | "metadata": {}, 434 | "source": [ 435 | "## Create new Columns for Pandas DF" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": null, 441 | "id": "amended-sharing", 442 | "metadata": {}, 443 | "outputs": [], 444 | "source": [ 445 | "#columns from series\n", 446 | "\n", 447 | "seat_columns = pd.Series([5,5,5,5,5])\n", 448 | "\n", 449 | "# New column added\n", 450 | "car_sales[\"Seats\"] = seat_columns\n", 451 | "car_sales" 452 | ] 453 | }, 454 | { 455 | "cell_type": "code", 456 | "execution_count": null, 457 | "id": "younger-grocery", 458 | "metadata": {}, 459 | "outputs": [], 460 | "source": [ 461 | "car_sales[\"Seats\"].fillna(5, inplace = True)" 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": null, 467 | "id": "reflected-remainder", 468 | "metadata": {}, 469 | "outputs": [], 470 | "source": [ 471 | "car_sales" 472 | ] 473 | }, 474 | { 475 | "cell_type": "code", 476 | "execution_count": null, 477 | "id": "multiple-generation", 478 | "metadata": {}, 479 | "outputs": [], 480 | "source": [ 481 | "fuel_economy = [7.5, 9.2, 5.0, 9.6, 8.7, 4.7, 7.6,8.7,3.0,4.5]\n", 482 | "car_sales[\"Fuel per 100KM\"] = fuel_economy\n", 483 | "car_sales" 484 | ] 485 | }, 486 | { 487 | "cell_type": "code", 488 | "execution_count": null, 489 | "id": "healthy-mortgage", 490 | "metadata": {}, 491 | "outputs": [], 492 | "source": [ 493 | "car_sales[\"Total fuel used (L)\"] = car_sales[\"Odometer (KM)\"]/100*car_sales[\"Fuel per 100KM\"]" 494 | ] 495 | }, 496 | { 497 | "cell_type": "code", 498 | "execution_count": null, 499 | "id": "fiscal-serial", 500 | "metadata": {}, 501 | "outputs": [], 502 | "source": [ 503 | "car_sales" 504 | ] 505 | }, 506 | { 507 | "cell_type": "code", 508 | "execution_count": null, 509 | "id": "marine-catholic", 510 | "metadata": {}, 511 | "outputs": [], 512 | "source": [ 513 | "car_sales[\"Number of wheels\"] = 4\n", 514 | "car_sales[\"Passed road safety\"] = True\n", 515 | "car_sales" 516 | ] 517 | }, 518 | { 519 | "cell_type": "markdown", 520 | "id": "fancy-courtesy", 521 | "metadata": {}, 522 | "source": [ 523 | "## Sampling Data" 524 | ] 525 | }, 526 | { 527 | "cell_type": "code", 528 | "execution_count": null, 529 | "id": "suitable-survey", 530 | "metadata": {}, 531 | "outputs": [], 532 | "source": [ 533 | "#Shuffle all the row\n", 534 | "car_sales_shuffled = car_sales.sample(frac=1)\n", 535 | "car_sales_shuffled" 536 | ] 537 | }, 538 | { 539 | "cell_type": "code", 540 | "execution_count": null, 541 | "id": "amino-remove", 542 | "metadata": {}, 543 | "outputs": [], 544 | "source": [ 545 | "# Take a sample of the data to practise\n", 546 | "\n", 547 | "#Only select 20% of data\n", 548 | "car_sales_shuffled.sample(frac = 0.2)" 549 | ] 550 | }, 551 | { 552 | "cell_type": "code", 553 | "execution_count": null, 554 | "id": "ready-commissioner", 555 | "metadata": {}, 556 | "outputs": [], 557 | "source": [ 558 | "# Revert the original index\n", 559 | "car_sales_shuffled.reset_index(drop=True, inplace=True)" 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "execution_count": null, 565 | "id": "beneficial-screen", 566 | "metadata": {}, 567 | "outputs": [], 568 | "source": [ 569 | "car_sales_shuffled" 570 | ] 571 | }, 572 | { 573 | "cell_type": "code", 574 | "execution_count": null, 575 | "id": "grand-uniform", 576 | "metadata": {}, 577 | "outputs": [], 578 | "source": [ 579 | "car_sales[\"Odometer (KM)\"] = car_sales[\"Odometer (KM)\"].apply(lambda x: x/1.6)" 580 | ] 581 | }, 582 | { 583 | "cell_type": "code", 584 | "execution_count": null, 585 | "id": "private-fossil", 586 | "metadata": {}, 587 | "outputs": [], 588 | "source": [ 589 | "car_sales" 590 | ] 591 | }, 592 | { 593 | "cell_type": "code", 594 | "execution_count": null, 595 | "id": "expensive-friday", 596 | "metadata": {}, 597 | "outputs": [], 598 | "source": [] 599 | } 600 | ], 601 | "metadata": { 602 | "kernelspec": { 603 | "display_name": "Python 3", 604 | "language": "python", 605 | "name": "python3" 606 | }, 607 | "language_info": { 608 | "codemirror_mode": { 609 | "name": "ipython", 610 | "version": 3 611 | }, 612 | "file_extension": ".py", 613 | "mimetype": "text/x-python", 614 | "name": "python", 615 | "nbconvert_exporter": "python", 616 | "pygments_lexer": "ipython3", 617 | "version": "3.8.8" 618 | } 619 | }, 620 | "nbformat": 4, 621 | "nbformat_minor": 5 622 | } 623 | -------------------------------------------------------------------------------- /Code/A2_Pandas/P1_Getting_Knowing_Data/data/car-sales-missing-data.csv: -------------------------------------------------------------------------------- 1 | Make,Colour,Odometer,Doors,Price 2 | Toyota,White,150043,4,"$4,000" 3 | Honda,Red,87899,4,"$5,000" 4 | Toyota,Blue,,3,"$7,000" 5 | BMW,Black,11179,5,"$22,000" 6 | Nissan,White,213095,4,"$3,500" 7 | Toyota,Green,,4,"$4,500" 8 | Honda,,,4,"$7,500" 9 | Honda,Blue,,4, 10 | Toyota,White,60000,, 11 | ,White,31600,4,"$9,700" -------------------------------------------------------------------------------- /Code/A2_Pandas/P1_Getting_Knowing_Data/data/car-sales.csv: -------------------------------------------------------------------------------- 1 | Make,Colour,Odometer (KM),Doors,Price 2 | Toyota,White,150043,4,"$4,000.00" 3 | Honda,Red,87899,4,"$5,000.00" 4 | Toyota,Blue,32549,3,"$7,000.00" 5 | BMW,Black,11179,5,"$22,000.00" 6 | Nissan,White,213095,4,"$3,500.00" 7 | Toyota,Green,99213,4,"$4,500.00" 8 | Honda,Blue,45698,4,"$7,500.00" 9 | Honda,Blue,54738,4,"$7,000.00" 10 | Toyota,White,60000,4,"$6,250.00" 11 | Nissan,White,31600,4,"$9,700.00" -------------------------------------------------------------------------------- /Code/A2_Pandas/P1_Getting_Knowing_Data/pandas-exercises.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Pandas Practice\n", 8 | "\n", 9 | "This notebook is dedicated to practicing different tasks with pandas. The solutions are available in a solutions notebook, however, you should always try to figure them out yourself first.\n", 10 | "\n", 11 | "It should be noted there may be more than one different way to answer a question or complete an exercise.\n", 12 | "\n", 13 | "Exercises are based off (and directly taken from) the quick introduction to pandas notebook.\n", 14 | "\n", 15 | "Different tasks will be detailed by comments or text.\n", 16 | "\n", 17 | "For further reference and resources, it's advised to check out the [pandas documnetation](https://pandas.pydata.org/pandas-docs/stable/)." 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 1, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "# Import pandas\n" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 2, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "# Create a series of three different colours\n" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 3, 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "# View the series of different colours\n" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 4, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "# Create a series of three different car types and view it\n" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 5, 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "# Combine the Series of cars and colours into a DataFrame\n" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 6, 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [ 71 | "# Import \"../data/car-sales.csv\" and turn it into a DataFrame\n" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "**Note:** Since you've imported `../data/car-sales.csv` as a DataFrame, we'll now refer to this DataFrame as 'the car sales DataFrame'." 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 7, 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [ 87 | "# Export the DataFrame you created to a .csv file\n" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 8, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "# Find the different datatypes of the car data DataFrame\n" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 9, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "# Describe your current car sales DataFrame using describe()\n" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 10, 111 | "metadata": {}, 112 | "outputs": [], 113 | "source": [ 114 | "# Get information about your DataFrame using info()\n" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "What does it show you?" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 11, 127 | "metadata": {}, 128 | "outputs": [], 129 | "source": [ 130 | "# Create a Series of different numbers and find the mean of them\n" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 12, 136 | "metadata": {}, 137 | "outputs": [], 138 | "source": [ 139 | "# Create a Series of different numbers and find the sum of them\n" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 13, 145 | "metadata": {}, 146 | "outputs": [], 147 | "source": [ 148 | "# List out all the column names of the car sales DataFrame\n" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": 14, 154 | "metadata": {}, 155 | "outputs": [], 156 | "source": [ 157 | "# Find the length of the car sales DataFrame\n" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 15, 163 | "metadata": {}, 164 | "outputs": [], 165 | "source": [ 166 | "# Show the first 5 rows of the car sales DataFrame\n" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 16, 172 | "metadata": {}, 173 | "outputs": [], 174 | "source": [ 175 | "# Show the first 7 rows of the car sales DataFrame\n" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": 17, 181 | "metadata": {}, 182 | "outputs": [], 183 | "source": [ 184 | "# Show the bottom 5 rows of the car sales DataFrame\n" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": 18, 190 | "metadata": {}, 191 | "outputs": [], 192 | "source": [ 193 | "# Use .loc to select the row at index 3 of the car sales DataFrame\n" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": 19, 199 | "metadata": {}, 200 | "outputs": [], 201 | "source": [ 202 | "# Use .iloc to select the row at position 3 of the car sales DataFrame\n" 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": {}, 208 | "source": [ 209 | "Notice how they're the same? Why do you think this is? \n", 210 | "\n", 211 | "Check the pandas documentation for [.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) and [.iloc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html). Think about a different situation each could be used for and try them out." 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 20, 217 | "metadata": {}, 218 | "outputs": [], 219 | "source": [ 220 | "# Select the \"Odometer (KM)\" column from the car sales DataFrame\n" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 21, 226 | "metadata": {}, 227 | "outputs": [], 228 | "source": [ 229 | "# Find the mean of the \"Odometer (KM)\" column in the car sales DataFrame\n" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 22, 235 | "metadata": {}, 236 | "outputs": [], 237 | "source": [ 238 | "# Select the rows with over 100,000 kilometers on the Odometer\n" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": 23, 244 | "metadata": {}, 245 | "outputs": [], 246 | "source": [ 247 | "# Create a crosstab of the Make and Doors columns\n" 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": 24, 253 | "metadata": {}, 254 | "outputs": [], 255 | "source": [ 256 | "# Group columns of the car sales DataFrame by the Make column and find the average\n" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": 25, 262 | "metadata": {}, 263 | "outputs": [], 264 | "source": [ 265 | "# Import Matplotlib and create a plot of the Odometer column\n", 266 | "# Don't forget to use %matplotlib inline\n" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 26, 272 | "metadata": {}, 273 | "outputs": [], 274 | "source": [ 275 | "# Create a histogram of the Odometer column using hist()\n" 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": 27, 281 | "metadata": {}, 282 | "outputs": [], 283 | "source": [ 284 | "# Try to plot the Price column using plot()\n" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "Why didn't it work? Can you think of a solution?\n", 292 | "\n", 293 | "You might want to search for \"how to convert a pandas string columb to numbers\".\n", 294 | "\n", 295 | "And if you're still stuck, check out this [Stack Overflow question and answer on turning a price column into integers](https://stackoverflow.com/questions/44469313/price-column-object-to-int-in-pandas).\n", 296 | "\n", 297 | "See how you can provide the example code there to the problem here." 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": 28, 303 | "metadata": {}, 304 | "outputs": [], 305 | "source": [ 306 | "# Remove the punctuation from price column\n" 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": 29, 312 | "metadata": {}, 313 | "outputs": [], 314 | "source": [ 315 | "# Check the changes to the price column\n" 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": 30, 321 | "metadata": {}, 322 | "outputs": [], 323 | "source": [ 324 | "# Remove the two extra zeros at the end of the price column\n" 325 | ] 326 | }, 327 | { 328 | "cell_type": "code", 329 | "execution_count": 31, 330 | "metadata": {}, 331 | "outputs": [], 332 | "source": [ 333 | "# Check the changes to the Price column\n" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": 32, 339 | "metadata": {}, 340 | "outputs": [], 341 | "source": [ 342 | "# Change the datatype of the Price column to integers\n" 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": 33, 348 | "metadata": {}, 349 | "outputs": [], 350 | "source": [ 351 | "# Lower the strings of the Make column\n" 352 | ] 353 | }, 354 | { 355 | "cell_type": "markdown", 356 | "metadata": {}, 357 | "source": [ 358 | "If you check the car sales DataFrame, you'll notice the Make column hasn't been lowered.\n", 359 | "\n", 360 | "How could you make these changes permanent?\n", 361 | "\n", 362 | "Try it out." 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": 34, 368 | "metadata": {}, 369 | "outputs": [], 370 | "source": [ 371 | "# Make lowering the case of the Make column permanent\n" 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": 35, 377 | "metadata": {}, 378 | "outputs": [], 379 | "source": [ 380 | "# Check the car sales DataFrame\n" 381 | ] 382 | }, 383 | { 384 | "cell_type": "markdown", 385 | "metadata": {}, 386 | "source": [ 387 | "Notice how the Make column stays lowered after reassigning.\n", 388 | "\n", 389 | "Now let's deal with missing data." 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": 36, 395 | "metadata": {}, 396 | "outputs": [], 397 | "source": [ 398 | "# Import the car sales DataFrame with missing data (\"../data/car-sales-missing-data.csv\")\n", 399 | "\n", 400 | "\n", 401 | "# Check out the new DataFrame\n" 402 | ] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "metadata": {}, 407 | "source": [ 408 | "Notice the missing values are represented as `NaN` in pandas DataFrames.\n", 409 | "\n", 410 | "Let's try fill them." 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": 37, 416 | "metadata": {}, 417 | "outputs": [], 418 | "source": [ 419 | "# Fill the Odometer (KM) column missing values with the mean of the column inplace\n" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": 38, 425 | "metadata": {}, 426 | "outputs": [], 427 | "source": [ 428 | "# View the car sales missing DataFrame and verify the changes\n" 429 | ] 430 | }, 431 | { 432 | "cell_type": "code", 433 | "execution_count": 39, 434 | "metadata": {}, 435 | "outputs": [], 436 | "source": [ 437 | "# Remove the rest of the missing data inplace\n" 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "execution_count": 40, 443 | "metadata": {}, 444 | "outputs": [], 445 | "source": [ 446 | "# Verify the missing values are removed by viewing the DataFrame\n" 447 | ] 448 | }, 449 | { 450 | "cell_type": "markdown", 451 | "metadata": {}, 452 | "source": [ 453 | "We'll now start to add columns to our DataFrame." 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": 41, 459 | "metadata": {}, 460 | "outputs": [], 461 | "source": [ 462 | "# Create a \"Seats\" column where every row has a value of 5\n" 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": 42, 468 | "metadata": {}, 469 | "outputs": [], 470 | "source": [ 471 | "# Create a column called \"Engine Size\" with random values between 1.3 and 4.5\n", 472 | "# Remember: If you're doing it from a Python list, the list has to be the same length\n", 473 | "# as the DataFrame\n" 474 | ] 475 | }, 476 | { 477 | "cell_type": "code", 478 | "execution_count": 43, 479 | "metadata": {}, 480 | "outputs": [], 481 | "source": [ 482 | "# Create a column which represents the price of a car per kilometer\n", 483 | "# Then view the DataFrame\n" 484 | ] 485 | }, 486 | { 487 | "cell_type": "code", 488 | "execution_count": 44, 489 | "metadata": {}, 490 | "outputs": [], 491 | "source": [ 492 | "# Remove the last column you added using .drop()\n" 493 | ] 494 | }, 495 | { 496 | "cell_type": "code", 497 | "execution_count": 45, 498 | "metadata": {}, 499 | "outputs": [], 500 | "source": [ 501 | "# Shuffle the DataFrame using sample() with the frac parameter set to 1\n", 502 | "# Save the the shuffled DataFrame to a new variable\n" 503 | ] 504 | }, 505 | { 506 | "cell_type": "markdown", 507 | "metadata": {}, 508 | "source": [ 509 | "Notice how the index numbers get moved around. The [`sample()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) function is a great way to get random samples from your DataFrame. It's also another great way to shuffle the rows by setting `frac=1`." 510 | ] 511 | }, 512 | { 513 | "cell_type": "code", 514 | "execution_count": 46, 515 | "metadata": {}, 516 | "outputs": [], 517 | "source": [ 518 | "# Reset the indexes of the shuffled DataFrame\n" 519 | ] 520 | }, 521 | { 522 | "cell_type": "markdown", 523 | "metadata": {}, 524 | "source": [ 525 | "Notice the index numbers have been changed to have order (start from 0)." 526 | ] 527 | }, 528 | { 529 | "cell_type": "code", 530 | "execution_count": 47, 531 | "metadata": {}, 532 | "outputs": [], 533 | "source": [ 534 | "# Change the Odometer values from kilometers to miles using a Lambda function\n", 535 | "# Then view the DataFrame\n" 536 | ] 537 | }, 538 | { 539 | "cell_type": "code", 540 | "execution_count": 48, 541 | "metadata": {}, 542 | "outputs": [], 543 | "source": [ 544 | "# Change the title of the Odometer (KM) to represent miles instead of kilometers\n" 545 | ] 546 | }, 547 | { 548 | "cell_type": "markdown", 549 | "metadata": {}, 550 | "source": [ 551 | "## Extensions\n", 552 | "\n", 553 | "For more exercises, check out the pandas documentation, particularly the [10-minutes to pandas section](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html). \n", 554 | "\n", 555 | "One great exercise would be to retype out the entire section into a Jupyter Notebook of your own.\n", 556 | "\n", 557 | "Get hands-on with the code and see what it does.\n", 558 | "\n", 559 | "The next place you should check out are the [top questions and answers on Stack Overflow for pandas](https://stackoverflow.com/questions/tagged/pandas?sort=MostVotes&edited=true). Often, these contain some of the most useful and common pandas functions. Be sure to play around with the different filters!\n", 560 | "\n", 561 | "Finally, always remember, the best way to learn something new to is try it. Make mistakes. Ask questions, get things wrong, take note of the things you do most often. And don't worry if you keep making the same mistake, pandas has many ways to do the same thing and is a big library. So it'll likely take a while before you get the hang of it." 562 | ] 563 | } 564 | ], 565 | "metadata": { 566 | "kernelspec": { 567 | "display_name": "Python 3", 568 | "language": "python", 569 | "name": "python3" 570 | }, 571 | "language_info": { 572 | "codemirror_mode": { 573 | "name": "ipython", 574 | "version": 3 575 | }, 576 | "file_extension": ".py", 577 | "mimetype": "text/x-python", 578 | "name": "python", 579 | "nbconvert_exporter": "python", 580 | "pygments_lexer": "ipython3", 581 | "version": "3.8.3" 582 | } 583 | }, 584 | "nbformat": 4, 585 | "nbformat_minor": 2 586 | } 587 | -------------------------------------------------------------------------------- /Code/A3_Numpy/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A3_Numpy/.DS_Store -------------------------------------------------------------------------------- /Code/A3_Numpy/numpy-images/car-photo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A3_Numpy/numpy-images/car-photo.png -------------------------------------------------------------------------------- /Code/A3_Numpy/numpy-images/dog-photo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A3_Numpy/numpy-images/dog-photo.png -------------------------------------------------------------------------------- /Code/A3_Numpy/numpy-images/panda.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A3_Numpy/numpy-images/panda.png -------------------------------------------------------------------------------- /Code/A4_Matplotlib/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A4_Matplotlib/.DS_Store -------------------------------------------------------------------------------- /Code/A4_Matplotlib/data/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A4_Matplotlib/data/.DS_Store -------------------------------------------------------------------------------- /Code/A4_Matplotlib/data/car-sales.csv: -------------------------------------------------------------------------------- 1 | Make,Colour,Odometer (KM),Doors,Price 2 | Toyota,White,150043,4,"$4,000.00" 3 | Honda,Red,87899,4,"$5,000.00" 4 | Toyota,Blue,32549,3,"$7,000.00" 5 | BMW,Black,11179,5,"$22,000.00" 6 | Nissan,White,213095,4,"$3,500.00" 7 | Toyota,Green,99213,4,"$4,500.00" 8 | Honda,Blue,45698,4,"$7,500.00" 9 | Honda,Blue,54738,4,"$7,000.00" 10 | Toyota,White,60000,4,"$6,250.00" 11 | Nissan,White,31600,4,"$9,700.00" -------------------------------------------------------------------------------- /Code/A4_Matplotlib/data/heart-disease.csv: -------------------------------------------------------------------------------- 1 | age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target 2 | 63,1,3,145,233,1,0,150,0,2.3,0,0,1,1 3 | 37,1,2,130,250,0,1,187,0,3.5,0,0,2,1 4 | 41,0,1,130,204,0,0,172,0,1.4,2,0,2,1 5 | 56,1,1,120,236,0,1,178,0,0.8,2,0,2,1 6 | 57,0,0,120,354,0,1,163,1,0.6,2,0,2,1 7 | 57,1,0,140,192,0,1,148,0,0.4,1,0,1,1 8 | 56,0,1,140,294,0,0,153,0,1.3,1,0,2,1 9 | 44,1,1,120,263,0,1,173,0,0,2,0,3,1 10 | 52,1,2,172,199,1,1,162,0,0.5,2,0,3,1 11 | 57,1,2,150,168,0,1,174,0,1.6,2,0,2,1 12 | 54,1,0,140,239,0,1,160,0,1.2,2,0,2,1 13 | 48,0,2,130,275,0,1,139,0,0.2,2,0,2,1 14 | 49,1,1,130,266,0,1,171,0,0.6,2,0,2,1 15 | 64,1,3,110,211,0,0,144,1,1.8,1,0,2,1 16 | 58,0,3,150,283,1,0,162,0,1,2,0,2,1 17 | 50,0,2,120,219,0,1,158,0,1.6,1,0,2,1 18 | 58,0,2,120,340,0,1,172,0,0,2,0,2,1 19 | 66,0,3,150,226,0,1,114,0,2.6,0,0,2,1 20 | 43,1,0,150,247,0,1,171,0,1.5,2,0,2,1 21 | 69,0,3,140,239,0,1,151,0,1.8,2,2,2,1 22 | 59,1,0,135,234,0,1,161,0,0.5,1,0,3,1 23 | 44,1,2,130,233,0,1,179,1,0.4,2,0,2,1 24 | 42,1,0,140,226,0,1,178,0,0,2,0,2,1 25 | 61,1,2,150,243,1,1,137,1,1,1,0,2,1 26 | 40,1,3,140,199,0,1,178,1,1.4,2,0,3,1 27 | 71,0,1,160,302,0,1,162,0,0.4,2,2,2,1 28 | 59,1,2,150,212,1,1,157,0,1.6,2,0,2,1 29 | 51,1,2,110,175,0,1,123,0,0.6,2,0,2,1 30 | 65,0,2,140,417,1,0,157,0,0.8,2,1,2,1 31 | 53,1,2,130,197,1,0,152,0,1.2,0,0,2,1 32 | 41,0,1,105,198,0,1,168,0,0,2,1,2,1 33 | 65,1,0,120,177,0,1,140,0,0.4,2,0,3,1 34 | 44,1,1,130,219,0,0,188,0,0,2,0,2,1 35 | 54,1,2,125,273,0,0,152,0,0.5,0,1,2,1 36 | 51,1,3,125,213,0,0,125,1,1.4,2,1,2,1 37 | 46,0,2,142,177,0,0,160,1,1.4,0,0,2,1 38 | 54,0,2,135,304,1,1,170,0,0,2,0,2,1 39 | 54,1,2,150,232,0,0,165,0,1.6,2,0,3,1 40 | 65,0,2,155,269,0,1,148,0,0.8,2,0,2,1 41 | 65,0,2,160,360,0,0,151,0,0.8,2,0,2,1 42 | 51,0,2,140,308,0,0,142,0,1.5,2,1,2,1 43 | 48,1,1,130,245,0,0,180,0,0.2,1,0,2,1 44 | 45,1,0,104,208,0,0,148,1,3,1,0,2,1 45 | 53,0,0,130,264,0,0,143,0,0.4,1,0,2,1 46 | 39,1,2,140,321,0,0,182,0,0,2,0,2,1 47 | 52,1,1,120,325,0,1,172,0,0.2,2,0,2,1 48 | 44,1,2,140,235,0,0,180,0,0,2,0,2,1 49 | 47,1,2,138,257,0,0,156,0,0,2,0,2,1 50 | 53,0,2,128,216,0,0,115,0,0,2,0,0,1 51 | 53,0,0,138,234,0,0,160,0,0,2,0,2,1 52 | 51,0,2,130,256,0,0,149,0,0.5,2,0,2,1 53 | 66,1,0,120,302,0,0,151,0,0.4,1,0,2,1 54 | 62,1,2,130,231,0,1,146,0,1.8,1,3,3,1 55 | 44,0,2,108,141,0,1,175,0,0.6,1,0,2,1 56 | 63,0,2,135,252,0,0,172,0,0,2,0,2,1 57 | 52,1,1,134,201,0,1,158,0,0.8,2,1,2,1 58 | 48,1,0,122,222,0,0,186,0,0,2,0,2,1 59 | 45,1,0,115,260,0,0,185,0,0,2,0,2,1 60 | 34,1,3,118,182,0,0,174,0,0,2,0,2,1 61 | 57,0,0,128,303,0,0,159,0,0,2,1,2,1 62 | 71,0,2,110,265,1,0,130,0,0,2,1,2,1 63 | 54,1,1,108,309,0,1,156,0,0,2,0,3,1 64 | 52,1,3,118,186,0,0,190,0,0,1,0,1,1 65 | 41,1,1,135,203,0,1,132,0,0,1,0,1,1 66 | 58,1,2,140,211,1,0,165,0,0,2,0,2,1 67 | 35,0,0,138,183,0,1,182,0,1.4,2,0,2,1 68 | 51,1,2,100,222,0,1,143,1,1.2,1,0,2,1 69 | 45,0,1,130,234,0,0,175,0,0.6,1,0,2,1 70 | 44,1,1,120,220,0,1,170,0,0,2,0,2,1 71 | 62,0,0,124,209,0,1,163,0,0,2,0,2,1 72 | 54,1,2,120,258,0,0,147,0,0.4,1,0,3,1 73 | 51,1,2,94,227,0,1,154,1,0,2,1,3,1 74 | 29,1,1,130,204,0,0,202,0,0,2,0,2,1 75 | 51,1,0,140,261,0,0,186,1,0,2,0,2,1 76 | 43,0,2,122,213,0,1,165,0,0.2,1,0,2,1 77 | 55,0,1,135,250,0,0,161,0,1.4,1,0,2,1 78 | 51,1,2,125,245,1,0,166,0,2.4,1,0,2,1 79 | 59,1,1,140,221,0,1,164,1,0,2,0,2,1 80 | 52,1,1,128,205,1,1,184,0,0,2,0,2,1 81 | 58,1,2,105,240,0,0,154,1,0.6,1,0,3,1 82 | 41,1,2,112,250,0,1,179,0,0,2,0,2,1 83 | 45,1,1,128,308,0,0,170,0,0,2,0,2,1 84 | 60,0,2,102,318,0,1,160,0,0,2,1,2,1 85 | 52,1,3,152,298,1,1,178,0,1.2,1,0,3,1 86 | 42,0,0,102,265,0,0,122,0,0.6,1,0,2,1 87 | 67,0,2,115,564,0,0,160,0,1.6,1,0,3,1 88 | 68,1,2,118,277,0,1,151,0,1,2,1,3,1 89 | 46,1,1,101,197,1,1,156,0,0,2,0,3,1 90 | 54,0,2,110,214,0,1,158,0,1.6,1,0,2,1 91 | 58,0,0,100,248,0,0,122,0,1,1,0,2,1 92 | 48,1,2,124,255,1,1,175,0,0,2,2,2,1 93 | 57,1,0,132,207,0,1,168,1,0,2,0,3,1 94 | 52,1,2,138,223,0,1,169,0,0,2,4,2,1 95 | 54,0,1,132,288,1,0,159,1,0,2,1,2,1 96 | 45,0,1,112,160,0,1,138,0,0,1,0,2,1 97 | 53,1,0,142,226,0,0,111,1,0,2,0,3,1 98 | 62,0,0,140,394,0,0,157,0,1.2,1,0,2,1 99 | 52,1,0,108,233,1,1,147,0,0.1,2,3,3,1 100 | 43,1,2,130,315,0,1,162,0,1.9,2,1,2,1 101 | 53,1,2,130,246,1,0,173,0,0,2,3,2,1 102 | 42,1,3,148,244,0,0,178,0,0.8,2,2,2,1 103 | 59,1,3,178,270,0,0,145,0,4.2,0,0,3,1 104 | 63,0,1,140,195,0,1,179,0,0,2,2,2,1 105 | 42,1,2,120,240,1,1,194,0,0.8,0,0,3,1 106 | 50,1,2,129,196,0,1,163,0,0,2,0,2,1 107 | 68,0,2,120,211,0,0,115,0,1.5,1,0,2,1 108 | 69,1,3,160,234,1,0,131,0,0.1,1,1,2,1 109 | 45,0,0,138,236,0,0,152,1,0.2,1,0,2,1 110 | 50,0,1,120,244,0,1,162,0,1.1,2,0,2,1 111 | 50,0,0,110,254,0,0,159,0,0,2,0,2,1 112 | 64,0,0,180,325,0,1,154,1,0,2,0,2,1 113 | 57,1,2,150,126,1,1,173,0,0.2,2,1,3,1 114 | 64,0,2,140,313,0,1,133,0,0.2,2,0,3,1 115 | 43,1,0,110,211,0,1,161,0,0,2,0,3,1 116 | 55,1,1,130,262,0,1,155,0,0,2,0,2,1 117 | 37,0,2,120,215,0,1,170,0,0,2,0,2,1 118 | 41,1,2,130,214,0,0,168,0,2,1,0,2,1 119 | 56,1,3,120,193,0,0,162,0,1.9,1,0,3,1 120 | 46,0,1,105,204,0,1,172,0,0,2,0,2,1 121 | 46,0,0,138,243,0,0,152,1,0,1,0,2,1 122 | 64,0,0,130,303,0,1,122,0,2,1,2,2,1 123 | 59,1,0,138,271,0,0,182,0,0,2,0,2,1 124 | 41,0,2,112,268,0,0,172,1,0,2,0,2,1 125 | 54,0,2,108,267,0,0,167,0,0,2,0,2,1 126 | 39,0,2,94,199,0,1,179,0,0,2,0,2,1 127 | 34,0,1,118,210,0,1,192,0,0.7,2,0,2,1 128 | 47,1,0,112,204,0,1,143,0,0.1,2,0,2,1 129 | 67,0,2,152,277,0,1,172,0,0,2,1,2,1 130 | 52,0,2,136,196,0,0,169,0,0.1,1,0,2,1 131 | 74,0,1,120,269,0,0,121,1,0.2,2,1,2,1 132 | 54,0,2,160,201,0,1,163,0,0,2,1,2,1 133 | 49,0,1,134,271,0,1,162,0,0,1,0,2,1 134 | 42,1,1,120,295,0,1,162,0,0,2,0,2,1 135 | 41,1,1,110,235,0,1,153,0,0,2,0,2,1 136 | 41,0,1,126,306,0,1,163,0,0,2,0,2,1 137 | 49,0,0,130,269,0,1,163,0,0,2,0,2,1 138 | 60,0,2,120,178,1,1,96,0,0,2,0,2,1 139 | 62,1,1,128,208,1,0,140,0,0,2,0,2,1 140 | 57,1,0,110,201,0,1,126,1,1.5,1,0,1,1 141 | 64,1,0,128,263,0,1,105,1,0.2,1,1,3,1 142 | 51,0,2,120,295,0,0,157,0,0.6,2,0,2,1 143 | 43,1,0,115,303,0,1,181,0,1.2,1,0,2,1 144 | 42,0,2,120,209,0,1,173,0,0,1,0,2,1 145 | 67,0,0,106,223,0,1,142,0,0.3,2,2,2,1 146 | 76,0,2,140,197,0,2,116,0,1.1,1,0,2,1 147 | 70,1,1,156,245,0,0,143,0,0,2,0,2,1 148 | 44,0,2,118,242,0,1,149,0,0.3,1,1,2,1 149 | 60,0,3,150,240,0,1,171,0,0.9,2,0,2,1 150 | 44,1,2,120,226,0,1,169,0,0,2,0,2,1 151 | 42,1,2,130,180,0,1,150,0,0,2,0,2,1 152 | 66,1,0,160,228,0,0,138,0,2.3,2,0,1,1 153 | 71,0,0,112,149,0,1,125,0,1.6,1,0,2,1 154 | 64,1,3,170,227,0,0,155,0,0.6,1,0,3,1 155 | 66,0,2,146,278,0,0,152,0,0,1,1,2,1 156 | 39,0,2,138,220,0,1,152,0,0,1,0,2,1 157 | 58,0,0,130,197,0,1,131,0,0.6,1,0,2,1 158 | 47,1,2,130,253,0,1,179,0,0,2,0,2,1 159 | 35,1,1,122,192,0,1,174,0,0,2,0,2,1 160 | 58,1,1,125,220,0,1,144,0,0.4,1,4,3,1 161 | 56,1,1,130,221,0,0,163,0,0,2,0,3,1 162 | 56,1,1,120,240,0,1,169,0,0,0,0,2,1 163 | 55,0,1,132,342,0,1,166,0,1.2,2,0,2,1 164 | 41,1,1,120,157,0,1,182,0,0,2,0,2,1 165 | 38,1,2,138,175,0,1,173,0,0,2,4,2,1 166 | 38,1,2,138,175,0,1,173,0,0,2,4,2,1 167 | 67,1,0,160,286,0,0,108,1,1.5,1,3,2,0 168 | 67,1,0,120,229,0,0,129,1,2.6,1,2,3,0 169 | 62,0,0,140,268,0,0,160,0,3.6,0,2,2,0 170 | 63,1,0,130,254,0,0,147,0,1.4,1,1,3,0 171 | 53,1,0,140,203,1,0,155,1,3.1,0,0,3,0 172 | 56,1,2,130,256,1,0,142,1,0.6,1,1,1,0 173 | 48,1,1,110,229,0,1,168,0,1,0,0,3,0 174 | 58,1,1,120,284,0,0,160,0,1.8,1,0,2,0 175 | 58,1,2,132,224,0,0,173,0,3.2,2,2,3,0 176 | 60,1,0,130,206,0,0,132,1,2.4,1,2,3,0 177 | 40,1,0,110,167,0,0,114,1,2,1,0,3,0 178 | 60,1,0,117,230,1,1,160,1,1.4,2,2,3,0 179 | 64,1,2,140,335,0,1,158,0,0,2,0,2,0 180 | 43,1,0,120,177,0,0,120,1,2.5,1,0,3,0 181 | 57,1,0,150,276,0,0,112,1,0.6,1,1,1,0 182 | 55,1,0,132,353,0,1,132,1,1.2,1,1,3,0 183 | 65,0,0,150,225,0,0,114,0,1,1,3,3,0 184 | 61,0,0,130,330,0,0,169,0,0,2,0,2,0 185 | 58,1,2,112,230,0,0,165,0,2.5,1,1,3,0 186 | 50,1,0,150,243,0,0,128,0,2.6,1,0,3,0 187 | 44,1,0,112,290,0,0,153,0,0,2,1,2,0 188 | 60,1,0,130,253,0,1,144,1,1.4,2,1,3,0 189 | 54,1,0,124,266,0,0,109,1,2.2,1,1,3,0 190 | 50,1,2,140,233,0,1,163,0,0.6,1,1,3,0 191 | 41,1,0,110,172,0,0,158,0,0,2,0,3,0 192 | 51,0,0,130,305,0,1,142,1,1.2,1,0,3,0 193 | 58,1,0,128,216,0,0,131,1,2.2,1,3,3,0 194 | 54,1,0,120,188,0,1,113,0,1.4,1,1,3,0 195 | 60,1,0,145,282,0,0,142,1,2.8,1,2,3,0 196 | 60,1,2,140,185,0,0,155,0,3,1,0,2,0 197 | 59,1,0,170,326,0,0,140,1,3.4,0,0,3,0 198 | 46,1,2,150,231,0,1,147,0,3.6,1,0,2,0 199 | 67,1,0,125,254,1,1,163,0,0.2,1,2,3,0 200 | 62,1,0,120,267,0,1,99,1,1.8,1,2,3,0 201 | 65,1,0,110,248,0,0,158,0,0.6,2,2,1,0 202 | 44,1,0,110,197,0,0,177,0,0,2,1,2,0 203 | 60,1,0,125,258,0,0,141,1,2.8,1,1,3,0 204 | 58,1,0,150,270,0,0,111,1,0.8,2,0,3,0 205 | 68,1,2,180,274,1,0,150,1,1.6,1,0,3,0 206 | 62,0,0,160,164,0,0,145,0,6.2,0,3,3,0 207 | 52,1,0,128,255,0,1,161,1,0,2,1,3,0 208 | 59,1,0,110,239,0,0,142,1,1.2,1,1,3,0 209 | 60,0,0,150,258,0,0,157,0,2.6,1,2,3,0 210 | 49,1,2,120,188,0,1,139,0,2,1,3,3,0 211 | 59,1,0,140,177,0,1,162,1,0,2,1,3,0 212 | 57,1,2,128,229,0,0,150,0,0.4,1,1,3,0 213 | 61,1,0,120,260,0,1,140,1,3.6,1,1,3,0 214 | 39,1,0,118,219,0,1,140,0,1.2,1,0,3,0 215 | 61,0,0,145,307,0,0,146,1,1,1,0,3,0 216 | 56,1,0,125,249,1,0,144,1,1.2,1,1,2,0 217 | 43,0,0,132,341,1,0,136,1,3,1,0,3,0 218 | 62,0,2,130,263,0,1,97,0,1.2,1,1,3,0 219 | 63,1,0,130,330,1,0,132,1,1.8,2,3,3,0 220 | 65,1,0,135,254,0,0,127,0,2.8,1,1,3,0 221 | 48,1,0,130,256,1,0,150,1,0,2,2,3,0 222 | 63,0,0,150,407,0,0,154,0,4,1,3,3,0 223 | 55,1,0,140,217,0,1,111,1,5.6,0,0,3,0 224 | 65,1,3,138,282,1,0,174,0,1.4,1,1,2,0 225 | 56,0,0,200,288,1,0,133,1,4,0,2,3,0 226 | 54,1,0,110,239,0,1,126,1,2.8,1,1,3,0 227 | 70,1,0,145,174,0,1,125,1,2.6,0,0,3,0 228 | 62,1,1,120,281,0,0,103,0,1.4,1,1,3,0 229 | 35,1,0,120,198,0,1,130,1,1.6,1,0,3,0 230 | 59,1,3,170,288,0,0,159,0,0.2,1,0,3,0 231 | 64,1,2,125,309,0,1,131,1,1.8,1,0,3,0 232 | 47,1,2,108,243,0,1,152,0,0,2,0,2,0 233 | 57,1,0,165,289,1,0,124,0,1,1,3,3,0 234 | 55,1,0,160,289,0,0,145,1,0.8,1,1,3,0 235 | 64,1,0,120,246,0,0,96,1,2.2,0,1,2,0 236 | 70,1,0,130,322,0,0,109,0,2.4,1,3,2,0 237 | 51,1,0,140,299,0,1,173,1,1.6,2,0,3,0 238 | 58,1,0,125,300,0,0,171,0,0,2,2,3,0 239 | 60,1,0,140,293,0,0,170,0,1.2,1,2,3,0 240 | 77,1,0,125,304,0,0,162,1,0,2,3,2,0 241 | 35,1,0,126,282,0,0,156,1,0,2,0,3,0 242 | 70,1,2,160,269,0,1,112,1,2.9,1,1,3,0 243 | 59,0,0,174,249,0,1,143,1,0,1,0,2,0 244 | 64,1,0,145,212,0,0,132,0,2,1,2,1,0 245 | 57,1,0,152,274,0,1,88,1,1.2,1,1,3,0 246 | 56,1,0,132,184,0,0,105,1,2.1,1,1,1,0 247 | 48,1,0,124,274,0,0,166,0,0.5,1,0,3,0 248 | 56,0,0,134,409,0,0,150,1,1.9,1,2,3,0 249 | 66,1,1,160,246,0,1,120,1,0,1,3,1,0 250 | 54,1,1,192,283,0,0,195,0,0,2,1,3,0 251 | 69,1,2,140,254,0,0,146,0,2,1,3,3,0 252 | 51,1,0,140,298,0,1,122,1,4.2,1,3,3,0 253 | 43,1,0,132,247,1,0,143,1,0.1,1,4,3,0 254 | 62,0,0,138,294,1,1,106,0,1.9,1,3,2,0 255 | 67,1,0,100,299,0,0,125,1,0.9,1,2,2,0 256 | 59,1,3,160,273,0,0,125,0,0,2,0,2,0 257 | 45,1,0,142,309,0,0,147,1,0,1,3,3,0 258 | 58,1,0,128,259,0,0,130,1,3,1,2,3,0 259 | 50,1,0,144,200,0,0,126,1,0.9,1,0,3,0 260 | 62,0,0,150,244,0,1,154,1,1.4,1,0,2,0 261 | 38,1,3,120,231,0,1,182,1,3.8,1,0,3,0 262 | 66,0,0,178,228,1,1,165,1,1,1,2,3,0 263 | 52,1,0,112,230,0,1,160,0,0,2,1,2,0 264 | 53,1,0,123,282,0,1,95,1,2,1,2,3,0 265 | 63,0,0,108,269,0,1,169,1,1.8,1,2,2,0 266 | 54,1,0,110,206,0,0,108,1,0,1,1,2,0 267 | 66,1,0,112,212,0,0,132,1,0.1,2,1,2,0 268 | 55,0,0,180,327,0,2,117,1,3.4,1,0,2,0 269 | 49,1,2,118,149,0,0,126,0,0.8,2,3,2,0 270 | 54,1,0,122,286,0,0,116,1,3.2,1,2,2,0 271 | 56,1,0,130,283,1,0,103,1,1.6,0,0,3,0 272 | 46,1,0,120,249,0,0,144,0,0.8,2,0,3,0 273 | 61,1,3,134,234,0,1,145,0,2.6,1,2,2,0 274 | 67,1,0,120,237,0,1,71,0,1,1,0,2,0 275 | 58,1,0,100,234,0,1,156,0,0.1,2,1,3,0 276 | 47,1,0,110,275,0,0,118,1,1,1,1,2,0 277 | 52,1,0,125,212,0,1,168,0,1,2,2,3,0 278 | 58,1,0,146,218,0,1,105,0,2,1,1,3,0 279 | 57,1,1,124,261,0,1,141,0,0.3,2,0,3,0 280 | 58,0,1,136,319,1,0,152,0,0,2,2,2,0 281 | 61,1,0,138,166,0,0,125,1,3.6,1,1,2,0 282 | 42,1,0,136,315,0,1,125,1,1.8,1,0,1,0 283 | 52,1,0,128,204,1,1,156,1,1,1,0,0,0 284 | 59,1,2,126,218,1,1,134,0,2.2,1,1,1,0 285 | 40,1,0,152,223,0,1,181,0,0,2,0,3,0 286 | 61,1,0,140,207,0,0,138,1,1.9,2,1,3,0 287 | 46,1,0,140,311,0,1,120,1,1.8,1,2,3,0 288 | 59,1,3,134,204,0,1,162,0,0.8,2,2,2,0 289 | 57,1,1,154,232,0,0,164,0,0,2,1,2,0 290 | 57,1,0,110,335,0,1,143,1,3,1,1,3,0 291 | 55,0,0,128,205,0,2,130,1,2,1,1,3,0 292 | 61,1,0,148,203,0,1,161,0,0,2,1,3,0 293 | 58,1,0,114,318,0,2,140,0,4.4,0,3,1,0 294 | 58,0,0,170,225,1,0,146,1,2.8,1,2,1,0 295 | 67,1,2,152,212,0,0,150,0,0.8,1,0,3,0 296 | 44,1,0,120,169,0,1,144,1,2.8,0,0,1,0 297 | 63,1,0,140,187,0,0,144,1,4,2,2,3,0 298 | 63,0,0,124,197,0,1,136,1,0,1,0,2,0 299 | 59,1,0,164,176,1,0,90,0,1,1,2,1,0 300 | 57,0,0,140,241,0,1,123,1,0.2,1,0,3,0 301 | 45,1,3,110,264,0,1,132,0,1.2,1,0,3,0 302 | 68,1,0,144,193,1,1,141,0,3.4,1,2,3,0 303 | 57,1,0,130,131,0,1,115,1,1.2,1,1,3,0 304 | 57,0,1,130,236,0,0,174,0,0,1,1,2,0 305 | -------------------------------------------------------------------------------- /Code/A4_Matplotlib/images/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A4_Matplotlib/images/.DS_Store -------------------------------------------------------------------------------- /Code/A4_Matplotlib/images/simple-plot.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A4_Matplotlib/images/simple-plot.jpg -------------------------------------------------------------------------------- /Code/A5_Scikit_Learn/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A5_Scikit_Learn/.DS_Store -------------------------------------------------------------------------------- /Code/A5_Scikit_Learn/.ipynb_checkpoints/1-Get-Data-Ready-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "sublime-register", 6 | "metadata": {}, 7 | "source": [ 8 | "# Introduction to Scikit-Learn (sklearn)\n", 9 | "\n", 10 | "This notebook demonstates some of the most useful functions of the Sklearn Lib\n", 11 | "\n", 12 | "Cover:\n", 13 | "\n", 14 | "0. End-to_end Scikit-Learn Workflow\n", 15 | "1. Getting Data Ready\n", 16 | "2. Choose the right estimator/algorithm for our problems\n", 17 | "3. Fit the model/algorithm and use it to make predictions on our data\n", 18 | "4. Evaluation a model\n", 19 | "5. Improve a model\n", 20 | "6. Save and load a trained model\n", 21 | "7. Put it all together!" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "id": "raising-nutrition", 27 | "metadata": {}, 28 | "source": [ 29 | "## 0. An end-to-end scikit-learn workflow" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 1, 35 | "id": "annoying-macedonia", 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "# 1. Get the data ready\n", 40 | "\n", 41 | "# Standard import\n", 42 | "import pandas as pd\n", 43 | "import numpy as np\n", 44 | "import matplotlib.pyplot as plt\n", 45 | "\n", 46 | "%matplotlib inline" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 2, 52 | "id": "abroad-prediction", 53 | "metadata": {}, 54 | "outputs": [ 55 | { 56 | "data": { 57 | "text/html": [ 58 | "
\n", 59 | "\n", 72 | "\n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | "
MakeColourOdometer (KM)DoorsPrice
0HondaWhite35431.04.015323.0
1BMWBlue192714.05.019943.0
2HondaWhite84714.04.028343.0
3ToyotaWhite154365.04.013434.0
4NissanBlue181577.03.014043.0
\n", 126 | "
" 127 | ], 128 | "text/plain": [ 129 | " Make Colour Odometer (KM) Doors Price\n", 130 | "0 Honda White 35431.0 4.0 15323.0\n", 131 | "1 BMW Blue 192714.0 5.0 19943.0\n", 132 | "2 Honda White 84714.0 4.0 28343.0\n", 133 | "3 Toyota White 154365.0 4.0 13434.0\n", 134 | "4 Nissan Blue 181577.0 3.0 14043.0" 135 | ] 136 | }, 137 | "execution_count": 2, 138 | "metadata": {}, 139 | "output_type": "execute_result" 140 | } 141 | ], 142 | "source": [ 143 | "car_sales = pd.read_csv(\"./data/car-sales-extended-missing-data.csv\")\n", 144 | "car_sales.head()" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 3, 150 | "id": "liable-mortgage", 151 | "metadata": {}, 152 | "outputs": [ 153 | { 154 | "data": { 155 | "text/plain": [ 156 | "1000" 157 | ] 158 | }, 159 | "execution_count": 3, 160 | "metadata": {}, 161 | "output_type": "execute_result" 162 | } 163 | ], 164 | "source": [ 165 | "len(car_sales)" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": 4, 171 | "id": "relevant-space", 172 | "metadata": { 173 | "scrolled": true 174 | }, 175 | "outputs": [ 176 | { 177 | "data": { 178 | "text/plain": [ 179 | "Make object\n", 180 | "Colour object\n", 181 | "Odometer (KM) float64\n", 182 | "Doors float64\n", 183 | "Price float64\n", 184 | "dtype: object" 185 | ] 186 | }, 187 | "execution_count": 4, 188 | "metadata": {}, 189 | "output_type": "execute_result" 190 | } 191 | ], 192 | "source": [ 193 | "car_sales.dtypes" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "id": "coastal-nudist", 199 | "metadata": {}, 200 | "source": [ 201 | "## What if there were missing values ?\n", 202 | "1. Fill them with some values (a.k.a `imputation`).\n", 203 | "2. Remove the samples with missing data altogether." 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 5, 209 | "id": "naval-terry", 210 | "metadata": { 211 | "scrolled": true 212 | }, 213 | "outputs": [ 214 | { 215 | "data": { 216 | "text/plain": [ 217 | "Make 49\n", 218 | "Colour 50\n", 219 | "Odometer (KM) 50\n", 220 | "Doors 50\n", 221 | "Price 50\n", 222 | "dtype: int64" 223 | ] 224 | }, 225 | "execution_count": 5, 226 | "metadata": {}, 227 | "output_type": "execute_result" 228 | } 229 | ], 230 | "source": [ 231 | "car_sales.isna().sum()" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": 7, 237 | "id": "under-ferry", 238 | "metadata": {}, 239 | "outputs": [], 240 | "source": [ 241 | "# Drop the rows with missing in the \"Price\" column\n", 242 | "car_sales.dropna(subset=[\"Price\"], inplace=True)" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "id": "greenhouse-ticket", 248 | "metadata": {}, 249 | "source": [ 250 | "## 1. Getting Data Ready: \n", 251 | "\n", 252 | "Three main thins we have to do:\n", 253 | "1. Split the data into features and labels (Usually `X` and `y`)\n", 254 | "2. Filling (also called imputing) or disregarding missing values\n", 255 | "3. Converting non-numerical values to numerical values (a.k.a. feature encoding)" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": 8, 261 | "id": "prescription-vertical", 262 | "metadata": {}, 263 | "outputs": [], 264 | "source": [ 265 | "# Create X (features matrix)\n", 266 | "X = car_sales.drop(\"Price\", axis = 1) # Remove 'target' column\n", 267 | "\n", 268 | "# Create y (lables)\n", 269 | "y = car_sales[\"Price\"]" 270 | ] 271 | }, 272 | { 273 | "cell_type": "markdown", 274 | "id": "incorporated-september", 275 | "metadata": {}, 276 | "source": [ 277 | "**Note**: We split data into train & test to perform filling missing values on them separately." 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 10, 283 | "id": "scenic-baking", 284 | "metadata": {}, 285 | "outputs": [], 286 | "source": [ 287 | "np.random.seed(42)\n", 288 | "\n", 289 | "# Split the data into training and test sets\n", 290 | "from sklearn.model_selection import train_test_split\n", 291 | "\n", 292 | "X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": 11, 298 | "id": "raising-gibraltar", 299 | "metadata": {}, 300 | "outputs": [ 301 | { 302 | "data": { 303 | "text/plain": [ 304 | "((760, 4), (190, 4), (760,), (190,))" 305 | ] 306 | }, 307 | "execution_count": 11, 308 | "metadata": {}, 309 | "output_type": "execute_result" 310 | } 311 | ], 312 | "source": [ 313 | "X_train.shape, X_test.shape, y_train.shape, y_test.shape" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": null, 319 | "id": "parliamentary-click", 320 | "metadata": {}, 321 | "outputs": [], 322 | "source": [ 323 | "# Inspect whether \"Door\" is categorical feature or not\n", 324 | "# Although \"Door\" contains numerical values\n", 325 | "car_sales[\"Doors\"].value_counts()\n", 326 | "\n", 327 | "# Conclusion: \"Door\" is categorical feature since it has only 3 options" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": 12, 333 | "id": "alternate-indian", 334 | "metadata": {}, 335 | "outputs": [], 336 | "source": [ 337 | "# Fill missing values with Scikit-Learn \n", 338 | "from sklearn.impute import SimpleImputer #Help fill the missing values\n", 339 | "from sklearn.compose import ColumnTransformer\n", 340 | "\n", 341 | "# Fill Categorical values with 'missing' & numerical values with mean\n", 342 | "\n", 343 | "cat_imputer = SimpleImputer(strategy=\"constant\", fill_value=\"missing\")\n", 344 | "door_imputer = SimpleImputer(strategy=\"constant\", fill_value=4) #Since \"Door\" col, although type is numerical, but actually cat\n", 345 | "num_imputer = SimpleImputer(strategy=\"mean\")" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": 13, 351 | "id": "blank-study", 352 | "metadata": {}, 353 | "outputs": [], 354 | "source": [ 355 | "# Define different column features\n", 356 | "categorical_features = [\"Make\", \"Colour\"]\n", 357 | "door_feature = [\"Doors\"]\n", 358 | "numerical_feature = [\"Odometer (KM)\"]" 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": 15, 364 | "id": "superb-forwarding", 365 | "metadata": {}, 366 | "outputs": [], 367 | "source": [ 368 | "imputer = ColumnTransformer([\n", 369 | " (\"cat_imputer\", cat_imputer, categorical_features),\n", 370 | " (\"door_imputer\", door_imputer, door_feature),\n", 371 | " (\"num_imputer\", num_imputer, numerical_feature)])" 372 | ] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "id": "governing-oxford", 377 | "metadata": {}, 378 | "source": [ 379 | "**Note:** We use fit_transform() on the training data and transform() on the testing data. \n", 380 | "* In essence, we learn the patterns in the training set and transform it via imputation (fit, then transform). \n", 381 | "* Then we take those same patterns and fill the test set (transform only)." 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": 16, 387 | "id": "matched-english", 388 | "metadata": {}, 389 | "outputs": [ 390 | { 391 | "data": { 392 | "text/plain": [ 393 | "array([['Honda', 'White', 4.0, 71934.0],\n", 394 | " ['Toyota', 'Red', 4.0, 162665.0],\n", 395 | " ['Honda', 'White', 4.0, 42844.0],\n", 396 | " ...,\n", 397 | " ['Toyota', 'White', 4.0, 196225.0],\n", 398 | " ['Honda', 'Blue', 4.0, 133117.0],\n", 399 | " ['Honda', 'missing', 4.0, 150582.0]], dtype=object)" 400 | ] 401 | }, 402 | "execution_count": 16, 403 | "metadata": {}, 404 | "output_type": "execute_result" 405 | } 406 | ], 407 | "source": [ 408 | "# learn the patterns in the training set and transform it via imputation (fit, then transform)\n", 409 | "filled_X_train = imputer.fit_transform(X_train)\n", 410 | "# take those same patterns and fill the test set (transform only)\n", 411 | "filled_X_test = imputer.transform(X_test)\n", 412 | "\n", 413 | "# Check filled X_train\n", 414 | "filled_X_train" 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": 17, 420 | "id": "pointed-darkness", 421 | "metadata": {}, 422 | "outputs": [ 423 | { 424 | "data": { 425 | "text/plain": [ 426 | "Make 0\n", 427 | "Colour 0\n", 428 | "Doors 0\n", 429 | "Odometer (KM) 0\n", 430 | "dtype: int64" 431 | ] 432 | }, 433 | "execution_count": 17, 434 | "metadata": {}, 435 | "output_type": "execute_result" 436 | } 437 | ], 438 | "source": [ 439 | "# Get our transformed data array's back into DataFrame's\n", 440 | "car_sales_filled_train = pd.DataFrame(filled_X_train, \n", 441 | " columns=[\"Make\", \"Colour\", \"Doors\", \"Odometer (KM)\"])\n", 442 | "\n", 443 | "car_sales_filled_test = pd.DataFrame(filled_X_test, \n", 444 | " columns=[\"Make\", \"Colour\", \"Doors\", \"Odometer (KM)\"])\n", 445 | "\n", 446 | "# Check missing data in training set\n", 447 | "car_sales_filled_train.isna().sum()" 448 | ] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "execution_count": 21, 453 | "id": "distributed-grounds", 454 | "metadata": {}, 455 | "outputs": [ 456 | { 457 | "data": { 458 | "text/plain": [ 459 | "array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n", 460 | " 0.00000e+00, 7.19340e+04],\n", 461 | " [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n", 462 | " 0.00000e+00, 1.62665e+05],\n", 463 | " [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n", 464 | " 0.00000e+00, 4.28440e+04],\n", 465 | " ...,\n", 466 | " [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n", 467 | " 0.00000e+00, 1.96225e+05],\n", 468 | " [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n", 469 | " 0.00000e+00, 1.33117e+05],\n", 470 | " [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n", 471 | " 0.00000e+00, 1.50582e+05]])" 472 | ] 473 | }, 474 | "execution_count": 21, 475 | "metadata": {}, 476 | "output_type": "execute_result" 477 | } 478 | ], 479 | "source": [ 480 | "# Turn the categories into numbers\n", 481 | "from sklearn.preprocessing import OneHotEncoder\n", 482 | "\n", 483 | "\n", 484 | "categorical_features = [\"Make\", \"Colour\", \"Doors\"] \n", 485 | "\n", 486 | "one_hot = OneHotEncoder()\n", 487 | "transformer = ColumnTransformer([(\"one_hot\", \n", 488 | " one_hot,\n", 489 | " categorical_features)], remainder=\"passthrough\")\n", 490 | "\n", 491 | "# Fill train and test values separately\n", 492 | "transformed_X_train = transformer.fit_transform(car_sales_filled_train)\n", 493 | "transformed_X_test = transformer.transform(car_sales_filled_test)\n", 494 | "\n", 495 | "transformed_X_train.toarray()" 496 | ] 497 | }, 498 | { 499 | "cell_type": "code", 500 | "execution_count": 22, 501 | "id": "historic-sarah", 502 | "metadata": {}, 503 | "outputs": [ 504 | { 505 | "data": { 506 | "text/plain": [ 507 | "0.21735623151692096" 508 | ] 509 | }, 510 | "execution_count": 22, 511 | "metadata": {}, 512 | "output_type": "execute_result" 513 | } 514 | ], 515 | "source": [ 516 | "# 2. Chose the right model and hyper-parameters\n", 517 | "\n", 518 | "from sklearn.ensemble import RandomForestRegressor\n", 519 | "\n", 520 | "model = RandomForestRegressor()\n", 521 | "model.fit(transformed_X_train, y_train)\n", 522 | "model.score(transformed_X_test, y_test)" 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": null, 528 | "id": "complex-camera", 529 | "metadata": {}, 530 | "outputs": [], 531 | "source": [] 532 | } 533 | ], 534 | "metadata": { 535 | "kernelspec": { 536 | "display_name": "Python 3", 537 | "language": "python", 538 | "name": "python3" 539 | }, 540 | "language_info": { 541 | "codemirror_mode": { 542 | "name": "ipython", 543 | "version": 3 544 | }, 545 | "file_extension": ".py", 546 | "mimetype": "text/x-python", 547 | "name": "python", 548 | "nbconvert_exporter": "python", 549 | "pygments_lexer": "ipython3", 550 | "version": "3.8.8" 551 | } 552 | }, 553 | "nbformat": 4, 554 | "nbformat_minor": 5 555 | } 556 | -------------------------------------------------------------------------------- /Code/A5_Scikit_Learn/1-Get-Data-Ready.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "sublime-register", 6 | "metadata": {}, 7 | "source": [ 8 | "# Introduction to Scikit-Learn (sklearn)\n", 9 | "\n", 10 | "This notebook demonstates some of the most useful functions of the Sklearn Lib\n", 11 | "\n", 12 | "Cover:\n", 13 | "\n", 14 | "0. End-to_end Scikit-Learn Workflow\n", 15 | "1. Getting Data Ready\n", 16 | "2. Choose the right estimator/algorithm for our problems\n", 17 | "3. Fit the model/algorithm and use it to make predictions on our data\n", 18 | "4. Evaluation a model\n", 19 | "5. Improve a model\n", 20 | "6. Save and load a trained model\n", 21 | "7. Put it all together!" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "id": "raising-nutrition", 27 | "metadata": {}, 28 | "source": [ 29 | "## 0. An end-to-end scikit-learn workflow" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 1, 35 | "id": "annoying-macedonia", 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "# 1. Get the data ready\n", 40 | "\n", 41 | "# Standard import\n", 42 | "import pandas as pd\n", 43 | "import numpy as np\n", 44 | "import matplotlib.pyplot as plt\n", 45 | "\n", 46 | "%matplotlib inline" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 2, 52 | "id": "abroad-prediction", 53 | "metadata": {}, 54 | "outputs": [ 55 | { 56 | "data": { 57 | "text/html": [ 58 | "
\n", 59 | "\n", 72 | "\n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | "
MakeColourOdometer (KM)DoorsPrice
0HondaWhite35431.04.015323.0
1BMWBlue192714.05.019943.0
2HondaWhite84714.04.028343.0
3ToyotaWhite154365.04.013434.0
4NissanBlue181577.03.014043.0
\n", 126 | "
" 127 | ], 128 | "text/plain": [ 129 | " Make Colour Odometer (KM) Doors Price\n", 130 | "0 Honda White 35431.0 4.0 15323.0\n", 131 | "1 BMW Blue 192714.0 5.0 19943.0\n", 132 | "2 Honda White 84714.0 4.0 28343.0\n", 133 | "3 Toyota White 154365.0 4.0 13434.0\n", 134 | "4 Nissan Blue 181577.0 3.0 14043.0" 135 | ] 136 | }, 137 | "execution_count": 2, 138 | "metadata": {}, 139 | "output_type": "execute_result" 140 | } 141 | ], 142 | "source": [ 143 | "car_sales = pd.read_csv(\"./data/car-sales-extended-missing-data.csv\")\n", 144 | "car_sales.head()" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 3, 150 | "id": "liable-mortgage", 151 | "metadata": {}, 152 | "outputs": [ 153 | { 154 | "data": { 155 | "text/plain": [ 156 | "1000" 157 | ] 158 | }, 159 | "execution_count": 3, 160 | "metadata": {}, 161 | "output_type": "execute_result" 162 | } 163 | ], 164 | "source": [ 165 | "len(car_sales)" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": 4, 171 | "id": "relevant-space", 172 | "metadata": { 173 | "scrolled": true 174 | }, 175 | "outputs": [ 176 | { 177 | "data": { 178 | "text/plain": [ 179 | "Make object\n", 180 | "Colour object\n", 181 | "Odometer (KM) float64\n", 182 | "Doors float64\n", 183 | "Price float64\n", 184 | "dtype: object" 185 | ] 186 | }, 187 | "execution_count": 4, 188 | "metadata": {}, 189 | "output_type": "execute_result" 190 | } 191 | ], 192 | "source": [ 193 | "car_sales.dtypes" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "id": "coastal-nudist", 199 | "metadata": {}, 200 | "source": [ 201 | "## What if there were missing values ?\n", 202 | "1. Fill them with some values (a.k.a `imputation`).\n", 203 | "2. Remove the samples with missing data altogether." 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 5, 209 | "id": "naval-terry", 210 | "metadata": { 211 | "scrolled": true 212 | }, 213 | "outputs": [ 214 | { 215 | "data": { 216 | "text/plain": [ 217 | "Make 49\n", 218 | "Colour 50\n", 219 | "Odometer (KM) 50\n", 220 | "Doors 50\n", 221 | "Price 50\n", 222 | "dtype: int64" 223 | ] 224 | }, 225 | "execution_count": 5, 226 | "metadata": {}, 227 | "output_type": "execute_result" 228 | } 229 | ], 230 | "source": [ 231 | "car_sales.isna().sum()" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": 7, 237 | "id": "under-ferry", 238 | "metadata": {}, 239 | "outputs": [], 240 | "source": [ 241 | "# Drop the rows with missing in the \"Price\" column\n", 242 | "car_sales.dropna(subset=[\"Price\"], inplace=True)" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "id": "greenhouse-ticket", 248 | "metadata": {}, 249 | "source": [ 250 | "## 1. Getting Data Ready: \n", 251 | "\n", 252 | "Three main thins we have to do:\n", 253 | "1. Split the data into features and labels (Usually `X` and `y`)\n", 254 | "2. Filling (also called imputing) or disregarding missing values\n", 255 | "3. Converting non-numerical values to numerical values (a.k.a. feature encoding)" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": 8, 261 | "id": "prescription-vertical", 262 | "metadata": {}, 263 | "outputs": [], 264 | "source": [ 265 | "# Create X (features matrix)\n", 266 | "X = car_sales.drop(\"Price\", axis = 1) # Remove 'target' column\n", 267 | "\n", 268 | "# Create y (lables)\n", 269 | "y = car_sales[\"Price\"]" 270 | ] 271 | }, 272 | { 273 | "cell_type": "markdown", 274 | "id": "incorporated-september", 275 | "metadata": {}, 276 | "source": [ 277 | "**Note**: We split data into train & test to perform filling missing values on them separately." 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 10, 283 | "id": "scenic-baking", 284 | "metadata": {}, 285 | "outputs": [], 286 | "source": [ 287 | "np.random.seed(42)\n", 288 | "\n", 289 | "# Split the data into training and test sets\n", 290 | "from sklearn.model_selection import train_test_split\n", 291 | "\n", 292 | "X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": 11, 298 | "id": "raising-gibraltar", 299 | "metadata": {}, 300 | "outputs": [ 301 | { 302 | "data": { 303 | "text/plain": [ 304 | "((760, 4), (190, 4), (760,), (190,))" 305 | ] 306 | }, 307 | "execution_count": 11, 308 | "metadata": {}, 309 | "output_type": "execute_result" 310 | } 311 | ], 312 | "source": [ 313 | "X_train.shape, X_test.shape, y_train.shape, y_test.shape" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": null, 319 | "id": "parliamentary-click", 320 | "metadata": {}, 321 | "outputs": [], 322 | "source": [ 323 | "# Inspect whether \"Door\" is categorical feature or not\n", 324 | "# Although \"Door\" contains numerical values\n", 325 | "car_sales[\"Doors\"].value_counts()\n", 326 | "\n", 327 | "# Conclusion: \"Door\" is categorical feature since it has only 3 options" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": 12, 333 | "id": "alternate-indian", 334 | "metadata": {}, 335 | "outputs": [], 336 | "source": [ 337 | "# Fill missing values with Scikit-Learn \n", 338 | "from sklearn.impute import SimpleImputer #Help fill the missing values\n", 339 | "from sklearn.compose import ColumnTransformer\n", 340 | "\n", 341 | "# Fill Categorical values with 'missing' & numerical values with mean\n", 342 | "\n", 343 | "cat_imputer = SimpleImputer(strategy=\"constant\", fill_value=\"missing\")\n", 344 | "door_imputer = SimpleImputer(strategy=\"constant\", fill_value=4) #Since \"Door\" col, although type is numerical, but actually cat\n", 345 | "num_imputer = SimpleImputer(strategy=\"mean\")" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": 13, 351 | "id": "blank-study", 352 | "metadata": {}, 353 | "outputs": [], 354 | "source": [ 355 | "# Define different column features\n", 356 | "categorical_features = [\"Make\", \"Colour\"]\n", 357 | "door_feature = [\"Doors\"]\n", 358 | "numerical_feature = [\"Odometer (KM)\"]" 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": 15, 364 | "id": "superb-forwarding", 365 | "metadata": {}, 366 | "outputs": [], 367 | "source": [ 368 | "imputer = ColumnTransformer([\n", 369 | " (\"cat_imputer\", cat_imputer, categorical_features),\n", 370 | " (\"door_imputer\", door_imputer, door_feature),\n", 371 | " (\"num_imputer\", num_imputer, numerical_feature)])" 372 | ] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "id": "governing-oxford", 377 | "metadata": {}, 378 | "source": [ 379 | "**Note:** We use fit_transform() on the training data and transform() on the testing data. \n", 380 | "* In essence, we learn the patterns in the training set and transform it via imputation (fit, then transform). \n", 381 | "* Then we take those same patterns and fill the test set (transform only)." 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": 16, 387 | "id": "matched-english", 388 | "metadata": {}, 389 | "outputs": [ 390 | { 391 | "data": { 392 | "text/plain": [ 393 | "array([['Honda', 'White', 4.0, 71934.0],\n", 394 | " ['Toyota', 'Red', 4.0, 162665.0],\n", 395 | " ['Honda', 'White', 4.0, 42844.0],\n", 396 | " ...,\n", 397 | " ['Toyota', 'White', 4.0, 196225.0],\n", 398 | " ['Honda', 'Blue', 4.0, 133117.0],\n", 399 | " ['Honda', 'missing', 4.0, 150582.0]], dtype=object)" 400 | ] 401 | }, 402 | "execution_count": 16, 403 | "metadata": {}, 404 | "output_type": "execute_result" 405 | } 406 | ], 407 | "source": [ 408 | "# learn the patterns in the training set and transform it via imputation (fit, then transform)\n", 409 | "filled_X_train = imputer.fit_transform(X_train)\n", 410 | "# take those same patterns and fill the test set (transform only)\n", 411 | "filled_X_test = imputer.transform(X_test)\n", 412 | "\n", 413 | "# Check filled X_train\n", 414 | "filled_X_train" 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": 17, 420 | "id": "pointed-darkness", 421 | "metadata": {}, 422 | "outputs": [ 423 | { 424 | "data": { 425 | "text/plain": [ 426 | "Make 0\n", 427 | "Colour 0\n", 428 | "Doors 0\n", 429 | "Odometer (KM) 0\n", 430 | "dtype: int64" 431 | ] 432 | }, 433 | "execution_count": 17, 434 | "metadata": {}, 435 | "output_type": "execute_result" 436 | } 437 | ], 438 | "source": [ 439 | "# Get our transformed data array's back into DataFrame's\n", 440 | "car_sales_filled_train = pd.DataFrame(filled_X_train, \n", 441 | " columns=[\"Make\", \"Colour\", \"Doors\", \"Odometer (KM)\"])\n", 442 | "\n", 443 | "car_sales_filled_test = pd.DataFrame(filled_X_test, \n", 444 | " columns=[\"Make\", \"Colour\", \"Doors\", \"Odometer (KM)\"])\n", 445 | "\n", 446 | "# Check missing data in training set\n", 447 | "car_sales_filled_train.isna().sum()" 448 | ] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "execution_count": 21, 453 | "id": "distributed-grounds", 454 | "metadata": {}, 455 | "outputs": [ 456 | { 457 | "data": { 458 | "text/plain": [ 459 | "array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n", 460 | " 0.00000e+00, 7.19340e+04],\n", 461 | " [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n", 462 | " 0.00000e+00, 1.62665e+05],\n", 463 | " [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n", 464 | " 0.00000e+00, 4.28440e+04],\n", 465 | " ...,\n", 466 | " [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n", 467 | " 0.00000e+00, 1.96225e+05],\n", 468 | " [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n", 469 | " 0.00000e+00, 1.33117e+05],\n", 470 | " [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,\n", 471 | " 0.00000e+00, 1.50582e+05]])" 472 | ] 473 | }, 474 | "execution_count": 21, 475 | "metadata": {}, 476 | "output_type": "execute_result" 477 | } 478 | ], 479 | "source": [ 480 | "# Turn the categories into numbers\n", 481 | "from sklearn.preprocessing import OneHotEncoder\n", 482 | "\n", 483 | "\n", 484 | "categorical_features = [\"Make\", \"Colour\", \"Doors\"] \n", 485 | "\n", 486 | "one_hot = OneHotEncoder()\n", 487 | "transformer = ColumnTransformer([(\"one_hot\", \n", 488 | " one_hot,\n", 489 | " categorical_features)], remainder=\"passthrough\")\n", 490 | "\n", 491 | "# Fill train and test values separately\n", 492 | "transformed_X_train = transformer.fit_transform(car_sales_filled_train)\n", 493 | "transformed_X_test = transformer.transform(car_sales_filled_test)\n", 494 | "\n", 495 | "transformed_X_train.toarray()" 496 | ] 497 | }, 498 | { 499 | "cell_type": "code", 500 | "execution_count": 22, 501 | "id": "historic-sarah", 502 | "metadata": {}, 503 | "outputs": [ 504 | { 505 | "data": { 506 | "text/plain": [ 507 | "0.21735623151692096" 508 | ] 509 | }, 510 | "execution_count": 22, 511 | "metadata": {}, 512 | "output_type": "execute_result" 513 | } 514 | ], 515 | "source": [ 516 | "# 2. Chose the right model and hyper-parameters\n", 517 | "\n", 518 | "from sklearn.ensemble import RandomForestRegressor\n", 519 | "\n", 520 | "model = RandomForestRegressor()\n", 521 | "model.fit(transformed_X_train, y_train)\n", 522 | "model.score(transformed_X_test, y_test)" 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": null, 528 | "id": "complex-camera", 529 | "metadata": {}, 530 | "outputs": [], 531 | "source": [] 532 | } 533 | ], 534 | "metadata": { 535 | "kernelspec": { 536 | "display_name": "Python 3", 537 | "language": "python", 538 | "name": "python3" 539 | }, 540 | "language_info": { 541 | "codemirror_mode": { 542 | "name": "ipython", 543 | "version": 3 544 | }, 545 | "file_extension": ".py", 546 | "mimetype": "text/x-python", 547 | "name": "python", 548 | "nbconvert_exporter": "python", 549 | "pygments_lexer": "ipython3", 550 | "version": "3.8.8" 551 | } 552 | }, 553 | "nbformat": 4, 554 | "nbformat_minor": 5 555 | } 556 | -------------------------------------------------------------------------------- /Code/A5_Scikit_Learn/data/car-sales-missing-data.csv: -------------------------------------------------------------------------------- 1 | Make,Colour,Odometer,Doors,Price 2 | Toyota,White,150043,4,"$4,000" 3 | Honda,Red,87899,4,"$5,000" 4 | Toyota,Blue,,3,"$7,000" 5 | BMW,Black,11179,5,"$22,000" 6 | Nissan,White,213095,4,"$3,500" 7 | Toyota,Green,,4,"$4,500" 8 | Honda,,,4,"$7,500" 9 | Honda,Blue,,4, 10 | Toyota,White,60000,, 11 | ,White,31600,4,"$9,700" -------------------------------------------------------------------------------- /Code/A5_Scikit_Learn/data/car-sales.csv: -------------------------------------------------------------------------------- 1 | Make,Colour,Odometer (KM),Doors,Price 2 | Toyota,White,150043,4,"$4,000.00" 3 | Honda,Red,87899,4,"$5,000.00" 4 | Toyota,Blue,32549,3,"$7,000.00" 5 | BMW,Black,11179,5,"$22,000.00" 6 | Nissan,White,213095,4,"$3,500.00" 7 | Toyota,Green,99213,4,"$4,500.00" 8 | Honda,Blue,45698,4,"$7,500.00" 9 | Honda,Blue,54738,4,"$7,000.00" 10 | Toyota,White,60000,4,"$6,250.00" 11 | Nissan,White,31600,4,"$9,700.00" -------------------------------------------------------------------------------- /Code/A5_Scikit_Learn/data/heart-disease.csv: -------------------------------------------------------------------------------- 1 | age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target 2 | 63,1,3,145,233,1,0,150,0,2.3,0,0,1,1 3 | 37,1,2,130,250,0,1,187,0,3.5,0,0,2,1 4 | 41,0,1,130,204,0,0,172,0,1.4,2,0,2,1 5 | 56,1,1,120,236,0,1,178,0,0.8,2,0,2,1 6 | 57,0,0,120,354,0,1,163,1,0.6,2,0,2,1 7 | 57,1,0,140,192,0,1,148,0,0.4,1,0,1,1 8 | 56,0,1,140,294,0,0,153,0,1.3,1,0,2,1 9 | 44,1,1,120,263,0,1,173,0,0,2,0,3,1 10 | 52,1,2,172,199,1,1,162,0,0.5,2,0,3,1 11 | 57,1,2,150,168,0,1,174,0,1.6,2,0,2,1 12 | 54,1,0,140,239,0,1,160,0,1.2,2,0,2,1 13 | 48,0,2,130,275,0,1,139,0,0.2,2,0,2,1 14 | 49,1,1,130,266,0,1,171,0,0.6,2,0,2,1 15 | 64,1,3,110,211,0,0,144,1,1.8,1,0,2,1 16 | 58,0,3,150,283,1,0,162,0,1,2,0,2,1 17 | 50,0,2,120,219,0,1,158,0,1.6,1,0,2,1 18 | 58,0,2,120,340,0,1,172,0,0,2,0,2,1 19 | 66,0,3,150,226,0,1,114,0,2.6,0,0,2,1 20 | 43,1,0,150,247,0,1,171,0,1.5,2,0,2,1 21 | 69,0,3,140,239,0,1,151,0,1.8,2,2,2,1 22 | 59,1,0,135,234,0,1,161,0,0.5,1,0,3,1 23 | 44,1,2,130,233,0,1,179,1,0.4,2,0,2,1 24 | 42,1,0,140,226,0,1,178,0,0,2,0,2,1 25 | 61,1,2,150,243,1,1,137,1,1,1,0,2,1 26 | 40,1,3,140,199,0,1,178,1,1.4,2,0,3,1 27 | 71,0,1,160,302,0,1,162,0,0.4,2,2,2,1 28 | 59,1,2,150,212,1,1,157,0,1.6,2,0,2,1 29 | 51,1,2,110,175,0,1,123,0,0.6,2,0,2,1 30 | 65,0,2,140,417,1,0,157,0,0.8,2,1,2,1 31 | 53,1,2,130,197,1,0,152,0,1.2,0,0,2,1 32 | 41,0,1,105,198,0,1,168,0,0,2,1,2,1 33 | 65,1,0,120,177,0,1,140,0,0.4,2,0,3,1 34 | 44,1,1,130,219,0,0,188,0,0,2,0,2,1 35 | 54,1,2,125,273,0,0,152,0,0.5,0,1,2,1 36 | 51,1,3,125,213,0,0,125,1,1.4,2,1,2,1 37 | 46,0,2,142,177,0,0,160,1,1.4,0,0,2,1 38 | 54,0,2,135,304,1,1,170,0,0,2,0,2,1 39 | 54,1,2,150,232,0,0,165,0,1.6,2,0,3,1 40 | 65,0,2,155,269,0,1,148,0,0.8,2,0,2,1 41 | 65,0,2,160,360,0,0,151,0,0.8,2,0,2,1 42 | 51,0,2,140,308,0,0,142,0,1.5,2,1,2,1 43 | 48,1,1,130,245,0,0,180,0,0.2,1,0,2,1 44 | 45,1,0,104,208,0,0,148,1,3,1,0,2,1 45 | 53,0,0,130,264,0,0,143,0,0.4,1,0,2,1 46 | 39,1,2,140,321,0,0,182,0,0,2,0,2,1 47 | 52,1,1,120,325,0,1,172,0,0.2,2,0,2,1 48 | 44,1,2,140,235,0,0,180,0,0,2,0,2,1 49 | 47,1,2,138,257,0,0,156,0,0,2,0,2,1 50 | 53,0,2,128,216,0,0,115,0,0,2,0,0,1 51 | 53,0,0,138,234,0,0,160,0,0,2,0,2,1 52 | 51,0,2,130,256,0,0,149,0,0.5,2,0,2,1 53 | 66,1,0,120,302,0,0,151,0,0.4,1,0,2,1 54 | 62,1,2,130,231,0,1,146,0,1.8,1,3,3,1 55 | 44,0,2,108,141,0,1,175,0,0.6,1,0,2,1 56 | 63,0,2,135,252,0,0,172,0,0,2,0,2,1 57 | 52,1,1,134,201,0,1,158,0,0.8,2,1,2,1 58 | 48,1,0,122,222,0,0,186,0,0,2,0,2,1 59 | 45,1,0,115,260,0,0,185,0,0,2,0,2,1 60 | 34,1,3,118,182,0,0,174,0,0,2,0,2,1 61 | 57,0,0,128,303,0,0,159,0,0,2,1,2,1 62 | 71,0,2,110,265,1,0,130,0,0,2,1,2,1 63 | 54,1,1,108,309,0,1,156,0,0,2,0,3,1 64 | 52,1,3,118,186,0,0,190,0,0,1,0,1,1 65 | 41,1,1,135,203,0,1,132,0,0,1,0,1,1 66 | 58,1,2,140,211,1,0,165,0,0,2,0,2,1 67 | 35,0,0,138,183,0,1,182,0,1.4,2,0,2,1 68 | 51,1,2,100,222,0,1,143,1,1.2,1,0,2,1 69 | 45,0,1,130,234,0,0,175,0,0.6,1,0,2,1 70 | 44,1,1,120,220,0,1,170,0,0,2,0,2,1 71 | 62,0,0,124,209,0,1,163,0,0,2,0,2,1 72 | 54,1,2,120,258,0,0,147,0,0.4,1,0,3,1 73 | 51,1,2,94,227,0,1,154,1,0,2,1,3,1 74 | 29,1,1,130,204,0,0,202,0,0,2,0,2,1 75 | 51,1,0,140,261,0,0,186,1,0,2,0,2,1 76 | 43,0,2,122,213,0,1,165,0,0.2,1,0,2,1 77 | 55,0,1,135,250,0,0,161,0,1.4,1,0,2,1 78 | 51,1,2,125,245,1,0,166,0,2.4,1,0,2,1 79 | 59,1,1,140,221,0,1,164,1,0,2,0,2,1 80 | 52,1,1,128,205,1,1,184,0,0,2,0,2,1 81 | 58,1,2,105,240,0,0,154,1,0.6,1,0,3,1 82 | 41,1,2,112,250,0,1,179,0,0,2,0,2,1 83 | 45,1,1,128,308,0,0,170,0,0,2,0,2,1 84 | 60,0,2,102,318,0,1,160,0,0,2,1,2,1 85 | 52,1,3,152,298,1,1,178,0,1.2,1,0,3,1 86 | 42,0,0,102,265,0,0,122,0,0.6,1,0,2,1 87 | 67,0,2,115,564,0,0,160,0,1.6,1,0,3,1 88 | 68,1,2,118,277,0,1,151,0,1,2,1,3,1 89 | 46,1,1,101,197,1,1,156,0,0,2,0,3,1 90 | 54,0,2,110,214,0,1,158,0,1.6,1,0,2,1 91 | 58,0,0,100,248,0,0,122,0,1,1,0,2,1 92 | 48,1,2,124,255,1,1,175,0,0,2,2,2,1 93 | 57,1,0,132,207,0,1,168,1,0,2,0,3,1 94 | 52,1,2,138,223,0,1,169,0,0,2,4,2,1 95 | 54,0,1,132,288,1,0,159,1,0,2,1,2,1 96 | 45,0,1,112,160,0,1,138,0,0,1,0,2,1 97 | 53,1,0,142,226,0,0,111,1,0,2,0,3,1 98 | 62,0,0,140,394,0,0,157,0,1.2,1,0,2,1 99 | 52,1,0,108,233,1,1,147,0,0.1,2,3,3,1 100 | 43,1,2,130,315,0,1,162,0,1.9,2,1,2,1 101 | 53,1,2,130,246,1,0,173,0,0,2,3,2,1 102 | 42,1,3,148,244,0,0,178,0,0.8,2,2,2,1 103 | 59,1,3,178,270,0,0,145,0,4.2,0,0,3,1 104 | 63,0,1,140,195,0,1,179,0,0,2,2,2,1 105 | 42,1,2,120,240,1,1,194,0,0.8,0,0,3,1 106 | 50,1,2,129,196,0,1,163,0,0,2,0,2,1 107 | 68,0,2,120,211,0,0,115,0,1.5,1,0,2,1 108 | 69,1,3,160,234,1,0,131,0,0.1,1,1,2,1 109 | 45,0,0,138,236,0,0,152,1,0.2,1,0,2,1 110 | 50,0,1,120,244,0,1,162,0,1.1,2,0,2,1 111 | 50,0,0,110,254,0,0,159,0,0,2,0,2,1 112 | 64,0,0,180,325,0,1,154,1,0,2,0,2,1 113 | 57,1,2,150,126,1,1,173,0,0.2,2,1,3,1 114 | 64,0,2,140,313,0,1,133,0,0.2,2,0,3,1 115 | 43,1,0,110,211,0,1,161,0,0,2,0,3,1 116 | 55,1,1,130,262,0,1,155,0,0,2,0,2,1 117 | 37,0,2,120,215,0,1,170,0,0,2,0,2,1 118 | 41,1,2,130,214,0,0,168,0,2,1,0,2,1 119 | 56,1,3,120,193,0,0,162,0,1.9,1,0,3,1 120 | 46,0,1,105,204,0,1,172,0,0,2,0,2,1 121 | 46,0,0,138,243,0,0,152,1,0,1,0,2,1 122 | 64,0,0,130,303,0,1,122,0,2,1,2,2,1 123 | 59,1,0,138,271,0,0,182,0,0,2,0,2,1 124 | 41,0,2,112,268,0,0,172,1,0,2,0,2,1 125 | 54,0,2,108,267,0,0,167,0,0,2,0,2,1 126 | 39,0,2,94,199,0,1,179,0,0,2,0,2,1 127 | 34,0,1,118,210,0,1,192,0,0.7,2,0,2,1 128 | 47,1,0,112,204,0,1,143,0,0.1,2,0,2,1 129 | 67,0,2,152,277,0,1,172,0,0,2,1,2,1 130 | 52,0,2,136,196,0,0,169,0,0.1,1,0,2,1 131 | 74,0,1,120,269,0,0,121,1,0.2,2,1,2,1 132 | 54,0,2,160,201,0,1,163,0,0,2,1,2,1 133 | 49,0,1,134,271,0,1,162,0,0,1,0,2,1 134 | 42,1,1,120,295,0,1,162,0,0,2,0,2,1 135 | 41,1,1,110,235,0,1,153,0,0,2,0,2,1 136 | 41,0,1,126,306,0,1,163,0,0,2,0,2,1 137 | 49,0,0,130,269,0,1,163,0,0,2,0,2,1 138 | 60,0,2,120,178,1,1,96,0,0,2,0,2,1 139 | 62,1,1,128,208,1,0,140,0,0,2,0,2,1 140 | 57,1,0,110,201,0,1,126,1,1.5,1,0,1,1 141 | 64,1,0,128,263,0,1,105,1,0.2,1,1,3,1 142 | 51,0,2,120,295,0,0,157,0,0.6,2,0,2,1 143 | 43,1,0,115,303,0,1,181,0,1.2,1,0,2,1 144 | 42,0,2,120,209,0,1,173,0,0,1,0,2,1 145 | 67,0,0,106,223,0,1,142,0,0.3,2,2,2,1 146 | 76,0,2,140,197,0,2,116,0,1.1,1,0,2,1 147 | 70,1,1,156,245,0,0,143,0,0,2,0,2,1 148 | 44,0,2,118,242,0,1,149,0,0.3,1,1,2,1 149 | 60,0,3,150,240,0,1,171,0,0.9,2,0,2,1 150 | 44,1,2,120,226,0,1,169,0,0,2,0,2,1 151 | 42,1,2,130,180,0,1,150,0,0,2,0,2,1 152 | 66,1,0,160,228,0,0,138,0,2.3,2,0,1,1 153 | 71,0,0,112,149,0,1,125,0,1.6,1,0,2,1 154 | 64,1,3,170,227,0,0,155,0,0.6,1,0,3,1 155 | 66,0,2,146,278,0,0,152,0,0,1,1,2,1 156 | 39,0,2,138,220,0,1,152,0,0,1,0,2,1 157 | 58,0,0,130,197,0,1,131,0,0.6,1,0,2,1 158 | 47,1,2,130,253,0,1,179,0,0,2,0,2,1 159 | 35,1,1,122,192,0,1,174,0,0,2,0,2,1 160 | 58,1,1,125,220,0,1,144,0,0.4,1,4,3,1 161 | 56,1,1,130,221,0,0,163,0,0,2,0,3,1 162 | 56,1,1,120,240,0,1,169,0,0,0,0,2,1 163 | 55,0,1,132,342,0,1,166,0,1.2,2,0,2,1 164 | 41,1,1,120,157,0,1,182,0,0,2,0,2,1 165 | 38,1,2,138,175,0,1,173,0,0,2,4,2,1 166 | 38,1,2,138,175,0,1,173,0,0,2,4,2,1 167 | 67,1,0,160,286,0,0,108,1,1.5,1,3,2,0 168 | 67,1,0,120,229,0,0,129,1,2.6,1,2,3,0 169 | 62,0,0,140,268,0,0,160,0,3.6,0,2,2,0 170 | 63,1,0,130,254,0,0,147,0,1.4,1,1,3,0 171 | 53,1,0,140,203,1,0,155,1,3.1,0,0,3,0 172 | 56,1,2,130,256,1,0,142,1,0.6,1,1,1,0 173 | 48,1,1,110,229,0,1,168,0,1,0,0,3,0 174 | 58,1,1,120,284,0,0,160,0,1.8,1,0,2,0 175 | 58,1,2,132,224,0,0,173,0,3.2,2,2,3,0 176 | 60,1,0,130,206,0,0,132,1,2.4,1,2,3,0 177 | 40,1,0,110,167,0,0,114,1,2,1,0,3,0 178 | 60,1,0,117,230,1,1,160,1,1.4,2,2,3,0 179 | 64,1,2,140,335,0,1,158,0,0,2,0,2,0 180 | 43,1,0,120,177,0,0,120,1,2.5,1,0,3,0 181 | 57,1,0,150,276,0,0,112,1,0.6,1,1,1,0 182 | 55,1,0,132,353,0,1,132,1,1.2,1,1,3,0 183 | 65,0,0,150,225,0,0,114,0,1,1,3,3,0 184 | 61,0,0,130,330,0,0,169,0,0,2,0,2,0 185 | 58,1,2,112,230,0,0,165,0,2.5,1,1,3,0 186 | 50,1,0,150,243,0,0,128,0,2.6,1,0,3,0 187 | 44,1,0,112,290,0,0,153,0,0,2,1,2,0 188 | 60,1,0,130,253,0,1,144,1,1.4,2,1,3,0 189 | 54,1,0,124,266,0,0,109,1,2.2,1,1,3,0 190 | 50,1,2,140,233,0,1,163,0,0.6,1,1,3,0 191 | 41,1,0,110,172,0,0,158,0,0,2,0,3,0 192 | 51,0,0,130,305,0,1,142,1,1.2,1,0,3,0 193 | 58,1,0,128,216,0,0,131,1,2.2,1,3,3,0 194 | 54,1,0,120,188,0,1,113,0,1.4,1,1,3,0 195 | 60,1,0,145,282,0,0,142,1,2.8,1,2,3,0 196 | 60,1,2,140,185,0,0,155,0,3,1,0,2,0 197 | 59,1,0,170,326,0,0,140,1,3.4,0,0,3,0 198 | 46,1,2,150,231,0,1,147,0,3.6,1,0,2,0 199 | 67,1,0,125,254,1,1,163,0,0.2,1,2,3,0 200 | 62,1,0,120,267,0,1,99,1,1.8,1,2,3,0 201 | 65,1,0,110,248,0,0,158,0,0.6,2,2,1,0 202 | 44,1,0,110,197,0,0,177,0,0,2,1,2,0 203 | 60,1,0,125,258,0,0,141,1,2.8,1,1,3,0 204 | 58,1,0,150,270,0,0,111,1,0.8,2,0,3,0 205 | 68,1,2,180,274,1,0,150,1,1.6,1,0,3,0 206 | 62,0,0,160,164,0,0,145,0,6.2,0,3,3,0 207 | 52,1,0,128,255,0,1,161,1,0,2,1,3,0 208 | 59,1,0,110,239,0,0,142,1,1.2,1,1,3,0 209 | 60,0,0,150,258,0,0,157,0,2.6,1,2,3,0 210 | 49,1,2,120,188,0,1,139,0,2,1,3,3,0 211 | 59,1,0,140,177,0,1,162,1,0,2,1,3,0 212 | 57,1,2,128,229,0,0,150,0,0.4,1,1,3,0 213 | 61,1,0,120,260,0,1,140,1,3.6,1,1,3,0 214 | 39,1,0,118,219,0,1,140,0,1.2,1,0,3,0 215 | 61,0,0,145,307,0,0,146,1,1,1,0,3,0 216 | 56,1,0,125,249,1,0,144,1,1.2,1,1,2,0 217 | 43,0,0,132,341,1,0,136,1,3,1,0,3,0 218 | 62,0,2,130,263,0,1,97,0,1.2,1,1,3,0 219 | 63,1,0,130,330,1,0,132,1,1.8,2,3,3,0 220 | 65,1,0,135,254,0,0,127,0,2.8,1,1,3,0 221 | 48,1,0,130,256,1,0,150,1,0,2,2,3,0 222 | 63,0,0,150,407,0,0,154,0,4,1,3,3,0 223 | 55,1,0,140,217,0,1,111,1,5.6,0,0,3,0 224 | 65,1,3,138,282,1,0,174,0,1.4,1,1,2,0 225 | 56,0,0,200,288,1,0,133,1,4,0,2,3,0 226 | 54,1,0,110,239,0,1,126,1,2.8,1,1,3,0 227 | 70,1,0,145,174,0,1,125,1,2.6,0,0,3,0 228 | 62,1,1,120,281,0,0,103,0,1.4,1,1,3,0 229 | 35,1,0,120,198,0,1,130,1,1.6,1,0,3,0 230 | 59,1,3,170,288,0,0,159,0,0.2,1,0,3,0 231 | 64,1,2,125,309,0,1,131,1,1.8,1,0,3,0 232 | 47,1,2,108,243,0,1,152,0,0,2,0,2,0 233 | 57,1,0,165,289,1,0,124,0,1,1,3,3,0 234 | 55,1,0,160,289,0,0,145,1,0.8,1,1,3,0 235 | 64,1,0,120,246,0,0,96,1,2.2,0,1,2,0 236 | 70,1,0,130,322,0,0,109,0,2.4,1,3,2,0 237 | 51,1,0,140,299,0,1,173,1,1.6,2,0,3,0 238 | 58,1,0,125,300,0,0,171,0,0,2,2,3,0 239 | 60,1,0,140,293,0,0,170,0,1.2,1,2,3,0 240 | 77,1,0,125,304,0,0,162,1,0,2,3,2,0 241 | 35,1,0,126,282,0,0,156,1,0,2,0,3,0 242 | 70,1,2,160,269,0,1,112,1,2.9,1,1,3,0 243 | 59,0,0,174,249,0,1,143,1,0,1,0,2,0 244 | 64,1,0,145,212,0,0,132,0,2,1,2,1,0 245 | 57,1,0,152,274,0,1,88,1,1.2,1,1,3,0 246 | 56,1,0,132,184,0,0,105,1,2.1,1,1,1,0 247 | 48,1,0,124,274,0,0,166,0,0.5,1,0,3,0 248 | 56,0,0,134,409,0,0,150,1,1.9,1,2,3,0 249 | 66,1,1,160,246,0,1,120,1,0,1,3,1,0 250 | 54,1,1,192,283,0,0,195,0,0,2,1,3,0 251 | 69,1,2,140,254,0,0,146,0,2,1,3,3,0 252 | 51,1,0,140,298,0,1,122,1,4.2,1,3,3,0 253 | 43,1,0,132,247,1,0,143,1,0.1,1,4,3,0 254 | 62,0,0,138,294,1,1,106,0,1.9,1,3,2,0 255 | 67,1,0,100,299,0,0,125,1,0.9,1,2,2,0 256 | 59,1,3,160,273,0,0,125,0,0,2,0,2,0 257 | 45,1,0,142,309,0,0,147,1,0,1,3,3,0 258 | 58,1,0,128,259,0,0,130,1,3,1,2,3,0 259 | 50,1,0,144,200,0,0,126,1,0.9,1,0,3,0 260 | 62,0,0,150,244,0,1,154,1,1.4,1,0,2,0 261 | 38,1,3,120,231,0,1,182,1,3.8,1,0,3,0 262 | 66,0,0,178,228,1,1,165,1,1,1,2,3,0 263 | 52,1,0,112,230,0,1,160,0,0,2,1,2,0 264 | 53,1,0,123,282,0,1,95,1,2,1,2,3,0 265 | 63,0,0,108,269,0,1,169,1,1.8,1,2,2,0 266 | 54,1,0,110,206,0,0,108,1,0,1,1,2,0 267 | 66,1,0,112,212,0,0,132,1,0.1,2,1,2,0 268 | 55,0,0,180,327,0,2,117,1,3.4,1,0,2,0 269 | 49,1,2,118,149,0,0,126,0,0.8,2,3,2,0 270 | 54,1,0,122,286,0,0,116,1,3.2,1,2,2,0 271 | 56,1,0,130,283,1,0,103,1,1.6,0,0,3,0 272 | 46,1,0,120,249,0,0,144,0,0.8,2,0,3,0 273 | 61,1,3,134,234,0,1,145,0,2.6,1,2,2,0 274 | 67,1,0,120,237,0,1,71,0,1,1,0,2,0 275 | 58,1,0,100,234,0,1,156,0,0.1,2,1,3,0 276 | 47,1,0,110,275,0,0,118,1,1,1,1,2,0 277 | 52,1,0,125,212,0,1,168,0,1,2,2,3,0 278 | 58,1,0,146,218,0,1,105,0,2,1,1,3,0 279 | 57,1,1,124,261,0,1,141,0,0.3,2,0,3,0 280 | 58,0,1,136,319,1,0,152,0,0,2,2,2,0 281 | 61,1,0,138,166,0,0,125,1,3.6,1,1,2,0 282 | 42,1,0,136,315,0,1,125,1,1.8,1,0,1,0 283 | 52,1,0,128,204,1,1,156,1,1,1,0,0,0 284 | 59,1,2,126,218,1,1,134,0,2.2,1,1,1,0 285 | 40,1,0,152,223,0,1,181,0,0,2,0,3,0 286 | 61,1,0,140,207,0,0,138,1,1.9,2,1,3,0 287 | 46,1,0,140,311,0,1,120,1,1.8,1,2,3,0 288 | 59,1,3,134,204,0,1,162,0,0.8,2,2,2,0 289 | 57,1,1,154,232,0,0,164,0,0,2,1,2,0 290 | 57,1,0,110,335,0,1,143,1,3,1,1,3,0 291 | 55,0,0,128,205,0,2,130,1,2,1,1,3,0 292 | 61,1,0,148,203,0,1,161,0,0,2,1,3,0 293 | 58,1,0,114,318,0,2,140,0,4.4,0,3,1,0 294 | 58,0,0,170,225,1,0,146,1,2.8,1,2,1,0 295 | 67,1,2,152,212,0,0,150,0,0.8,1,0,3,0 296 | 44,1,0,120,169,0,1,144,1,2.8,0,0,1,0 297 | 63,1,0,140,187,0,0,144,1,4,2,2,3,0 298 | 63,0,0,124,197,0,1,136,1,0,1,0,2,0 299 | 59,1,0,164,176,1,0,90,0,1,1,2,1,0 300 | 57,0,0,140,241,0,1,123,1,0.2,1,0,3,0 301 | 45,1,3,110,264,0,1,132,0,1.2,1,0,3,0 302 | 68,1,0,144,193,1,1,141,0,3.4,1,2,3,0 303 | 57,1,0,130,131,0,1,115,1,1.2,1,1,3,0 304 | 57,0,1,130,236,0,0,174,0,0,1,1,2,0 305 | -------------------------------------------------------------------------------- /Code/A5_Scikit_Learn/gs_random_forest_model_1.joblib: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A5_Scikit_Learn/gs_random_forest_model_1.joblib -------------------------------------------------------------------------------- /Code/A5_Scikit_Learn/gs_random_forest_model_1.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A5_Scikit_Learn/gs_random_forest_model_1.pkl -------------------------------------------------------------------------------- /Code/A5_Scikit_Learn/random_forest_model_1.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A5_Scikit_Learn/random_forest_model_1.pkl -------------------------------------------------------------------------------- /Code/A6_Kaggle/.ipynb_checkpoints/Day12_Housing Prices Competition-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "07d7bd2d", 7 | "metadata": { 8 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", 9 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5", 10 | "execution": { 11 | "iopub.execute_input": "2021-08-15T03:43:11.802107Z", 12 | "iopub.status.busy": "2021-08-15T03:43:11.800938Z", 13 | "iopub.status.idle": "2021-08-15T03:43:11.814520Z", 14 | "shell.execute_reply": "2021-08-15T03:43:11.815049Z", 15 | "shell.execute_reply.started": "2021-08-15T02:46:18.327501Z" 16 | }, 17 | "papermill": { 18 | "duration": 0.02756, 19 | "end_time": "2021-08-15T03:43:11.815353", 20 | "exception": false, 21 | "start_time": "2021-08-15T03:43:11.787793", 22 | "status": "completed" 23 | }, 24 | "tags": [] 25 | }, 26 | "outputs": [ 27 | { 28 | "name": "stdout", 29 | "output_type": "stream", 30 | "text": [ 31 | "/kaggle/input/home-data-for-ml-course/sample_submission.csv\n", 32 | "/kaggle/input/home-data-for-ml-course/sample_submission.csv.gz\n", 33 | "/kaggle/input/home-data-for-ml-course/train.csv.gz\n", 34 | "/kaggle/input/home-data-for-ml-course/data_description.txt\n", 35 | "/kaggle/input/home-data-for-ml-course/test.csv.gz\n", 36 | "/kaggle/input/home-data-for-ml-course/train.csv\n", 37 | "/kaggle/input/home-data-for-ml-course/test.csv\n" 38 | ] 39 | } 40 | ], 41 | "source": [ 42 | "# This Python 3 environment comes with many helpful analytics libraries installed\n", 43 | "# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python\n", 44 | "# For example, here's several helpful packages to load\n", 45 | "\n", 46 | "import numpy as np # linear algebra\n", 47 | "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", 48 | "\n", 49 | "# Input data files are available in the read-only \"../input/\" directory\n", 50 | "# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n", 51 | "\n", 52 | "import os\n", 53 | "for dirname, _, filenames in os.walk('/kaggle/input'):\n", 54 | " for filename in filenames:\n", 55 | " print(os.path.join(dirname, filename))\n", 56 | "\n", 57 | "# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using \"Save & Run All\" \n", 58 | "# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 2, 64 | "id": "7def9f9a", 65 | "metadata": { 66 | "execution": { 67 | "iopub.execute_input": "2021-08-15T03:43:11.837576Z", 68 | "iopub.status.busy": "2021-08-15T03:43:11.836898Z", 69 | "iopub.status.idle": "2021-08-15T03:43:13.077546Z", 70 | "shell.execute_reply": "2021-08-15T03:43:13.076802Z", 71 | "shell.execute_reply.started": "2021-08-15T03:10:19.101028Z" 72 | }, 73 | "papermill": { 74 | "duration": 1.252221, 75 | "end_time": "2021-08-15T03:43:13.077693", 76 | "exception": false, 77 | "start_time": "2021-08-15T03:43:11.825472", 78 | "status": "completed" 79 | }, 80 | "tags": [] 81 | }, 82 | "outputs": [], 83 | "source": [ 84 | "# Import helpful libraries\n", 85 | "import pandas as pd\n", 86 | "import numpy as np\n", 87 | "from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor\n", 88 | "from sklearn.metrics import mean_absolute_error, mean_squared_error #Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.\n", 89 | "from sklearn.model_selection import train_test_split\n", 90 | "\n", 91 | "# Load the data, and separate the target\n", 92 | "iowa_file_path = '../input/home-data-for-ml-course/train.csv'\n", 93 | "home_data = pd.read_csv(iowa_file_path)\n", 94 | "y = home_data.SalePrice\n", 95 | "\n", 96 | "# Create X (After completing the exercise, you can return to modify this line!)\n", 97 | "#features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']\n" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 3, 103 | "id": "fed68f38", 104 | "metadata": { 105 | "execution": { 106 | "iopub.execute_input": "2021-08-15T03:43:13.101919Z", 107 | "iopub.status.busy": "2021-08-15T03:43:13.101196Z", 108 | "iopub.status.idle": "2021-08-15T03:43:13.104152Z", 109 | "shell.execute_reply": "2021-08-15T03:43:13.104647Z", 110 | "shell.execute_reply.started": "2021-08-15T03:01:48.233770Z" 111 | }, 112 | "papermill": { 113 | "duration": 0.017924, 114 | "end_time": "2021-08-15T03:43:13.104807", 115 | "exception": false, 116 | "start_time": "2021-08-15T03:43:13.086883", 117 | "status": "completed" 118 | }, 119 | "tags": [] 120 | }, 121 | "outputs": [], 122 | "source": [ 123 | "# Create X (After completing the exercise, you can return to modify this line!)\n", 124 | "features = [\n", 125 | " 'MSSubClass',\n", 126 | " 'LotArea',\n", 127 | " 'OverallQual',\n", 128 | " 'OverallCond',\n", 129 | " 'YearBuilt',\n", 130 | " 'YearRemodAdd', \n", 131 | " '1stFlrSF',\n", 132 | " '2ndFlrSF' ,\n", 133 | " 'LowQualFinSF',\n", 134 | " 'GrLivArea',\n", 135 | " 'FullBath',\n", 136 | " 'HalfBath',\n", 137 | " 'BedroomAbvGr',\n", 138 | " 'KitchenAbvGr', \n", 139 | " 'TotRmsAbvGrd',\n", 140 | " 'Fireplaces', \n", 141 | " 'WoodDeckSF' ,\n", 142 | " 'OpenPorchSF',\n", 143 | " 'EnclosedPorch',\n", 144 | " '3SsnPorch', \n", 145 | " 'ScreenPorch',\n", 146 | " 'PoolArea',\n", 147 | " 'MiscVal',\n", 148 | " 'MoSold',\n", 149 | " 'YrSold'\n", 150 | "]" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": 4, 156 | "id": "ac795a42", 157 | "metadata": { 158 | "execution": { 159 | "iopub.execute_input": "2021-08-15T03:43:13.126743Z", 160 | "iopub.status.busy": "2021-08-15T03:43:13.126039Z", 161 | "iopub.status.idle": "2021-08-15T03:43:20.892772Z", 162 | "shell.execute_reply": "2021-08-15T03:43:20.893262Z", 163 | "shell.execute_reply.started": "2021-08-15T03:40:09.224892Z" 164 | }, 165 | "papermill": { 166 | "duration": 7.779176, 167 | "end_time": "2021-08-15T03:43:20.893480", 168 | "exception": false, 169 | "start_time": "2021-08-15T03:43:13.114304", 170 | "status": "completed" 171 | }, 172 | "tags": [] 173 | }, 174 | "outputs": [ 175 | { 176 | "name": "stdout", 177 | "output_type": "stream", 178 | "text": [ 179 | "Validation RMSE for Random Forest Model: 26,895\n", 180 | "Validation RMSE for Gradient Boosting Model: 24,597\n", 181 | "Validation RMSE for Mean Prediction of 2 Models: 23,834\n" 182 | ] 183 | } 184 | ], 185 | "source": [ 186 | "# Select columns corresponding to features, and preview the data\n", 187 | "X = home_data[features]\n", 188 | "\n", 189 | "\n", 190 | "# Split into validation and training data\n", 191 | "train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)\n", 192 | "\n", 193 | "# Define a random forest model\n", 194 | "rf_model = RandomForestRegressor(random_state=1, n_estimators=700)\n", 195 | "rf_model.fit(train_X, train_y)\n", 196 | "rf_val_predictions = rf_model.predict(val_X)\n", 197 | "rf_val_rmse = np.sqrt(mean_squared_error(rf_val_predictions, val_y))\n", 198 | "\n", 199 | "gbm_model = GradientBoostingRegressor(random_state=1, n_estimators=500)\n", 200 | "gbm_model.fit(train_X, train_y)\n", 201 | "gbm_val_predictions = gbm_model.predict(val_X)\n", 202 | "gbm_val_rmse = np.sqrt(mean_squared_error(gbm_val_predictions, val_y))\n", 203 | "\n", 204 | "mean_2model_val_predictions = (rf_val_predictions + gbm_val_predictions)/2\n", 205 | "mean_2model_val_rmse = np.sqrt(mean_squared_error(mean_2model_val_predictions, val_y))\n", 206 | "\n", 207 | "print(\"Validation RMSE for Random Forest Model: {:,.0f}\".format(rf_val_rmse))\n", 208 | "print(\"Validation RMSE for Gradient Boosting Model: {:,.0f}\".format(gbm_val_rmse))\n", 209 | "print(\"Validation RMSE for Mean Prediction of 2 Models: {:,.0f}\".format(mean_2model_val_rmse))" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": 5, 215 | "id": "87e88081", 216 | "metadata": { 217 | "execution": { 218 | "iopub.execute_input": "2021-08-15T03:43:20.917251Z", 219 | "iopub.status.busy": "2021-08-15T03:43:20.916646Z", 220 | "iopub.status.idle": "2021-08-15T03:43:20.918748Z", 221 | "shell.execute_reply": "2021-08-15T03:43:20.919237Z", 222 | "shell.execute_reply.started": "2021-08-15T02:52:49.030875Z" 223 | }, 224 | "papermill": { 225 | "duration": 0.016227, 226 | "end_time": "2021-08-15T03:43:20.919401", 227 | "exception": false, 228 | "start_time": "2021-08-15T03:43:20.903174", 229 | "status": "completed" 230 | }, 231 | "tags": [] 232 | }, 233 | "outputs": [], 234 | "source": [ 235 | "#?RandomForestRegressor " 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 6, 241 | "id": "4b3573a6", 242 | "metadata": { 243 | "execution": { 244 | "iopub.execute_input": "2021-08-15T03:43:20.941585Z", 245 | "iopub.status.busy": "2021-08-15T03:43:20.940666Z", 246 | "iopub.status.idle": "2021-08-15T03:43:30.827997Z", 247 | "shell.execute_reply": "2021-08-15T03:43:30.828507Z", 248 | "shell.execute_reply.started": "2021-08-15T03:41:31.585294Z" 249 | }, 250 | "papermill": { 251 | "duration": 9.899838, 252 | "end_time": "2021-08-15T03:43:30.828673", 253 | "exception": false, 254 | "start_time": "2021-08-15T03:43:20.928835", 255 | "status": "completed" 256 | }, 257 | "tags": [] 258 | }, 259 | "outputs": [ 260 | { 261 | "data": { 262 | "text/plain": [ 263 | "GradientBoostingRegressor(n_estimators=500, random_state=1)" 264 | ] 265 | }, 266 | "execution_count": 6, 267 | "metadata": {}, 268 | "output_type": "execute_result" 269 | } 270 | ], 271 | "source": [ 272 | "# To improve accuracy, create a new Random Forest model which you will train on all training data\n", 273 | "rf_model.fit(X,y)\n", 274 | "gbm_model.fit(X,y)" 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": 7, 280 | "id": "701f28ab", 281 | "metadata": { 282 | "execution": { 283 | "iopub.execute_input": "2021-08-15T03:43:30.853856Z", 284 | "iopub.status.busy": "2021-08-15T03:43:30.853151Z", 285 | "iopub.status.idle": "2021-08-15T03:43:31.128150Z", 286 | "shell.execute_reply": "2021-08-15T03:43:31.128649Z", 287 | "shell.execute_reply.started": "2021-08-15T03:42:36.364752Z" 288 | }, 289 | "papermill": { 290 | "duration": 0.290491, 291 | "end_time": "2021-08-15T03:43:31.128831", 292 | "exception": false, 293 | "start_time": "2021-08-15T03:43:30.838340", 294 | "status": "completed" 295 | }, 296 | "tags": [] 297 | }, 298 | "outputs": [], 299 | "source": [ 300 | "# path to file you will use for predictions\n", 301 | "test_data_path = '../input/home-data-for-ml-course/test.csv'\n", 302 | "\n", 303 | "# read test data file using pandas\n", 304 | "test_data = pd.read_csv(test_data_path)\n", 305 | "test_data = test_data.fillna(-1)\n", 306 | "# create test_X which comes from test_data but includes only the columns you used for prediction.\n", 307 | "# The list of columns is stored in a variable called features\n", 308 | "test_X = test_data[features]\n", 309 | "\n", 310 | "# make predictions which we will submit. \n", 311 | "test_preds1 = rf_model.predict(test_X)\n", 312 | "test_preds2 = gbm_model.predict(test_X)\n", 313 | "\n", 314 | "test_preds = (test_preds1 + test_preds2)/2" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": 8, 320 | "id": "4a01247c", 321 | "metadata": { 322 | "execution": { 323 | "iopub.execute_input": "2021-08-15T03:43:31.154336Z", 324 | "iopub.status.busy": "2021-08-15T03:43:31.153705Z", 325 | "iopub.status.idle": "2021-08-15T03:43:31.165500Z", 326 | "shell.execute_reply": "2021-08-15T03:43:31.164833Z", 327 | "shell.execute_reply.started": "2021-08-15T03:42:39.697331Z" 328 | }, 329 | "papermill": { 330 | "duration": 0.027107, 331 | "end_time": "2021-08-15T03:43:31.165641", 332 | "exception": false, 333 | "start_time": "2021-08-15T03:43:31.138534", 334 | "status": "completed" 335 | }, 336 | "tags": [] 337 | }, 338 | "outputs": [], 339 | "source": [ 340 | "# Run the code to save predictions in the format used for competition scoring\n", 341 | "\n", 342 | "output = pd.DataFrame({'Id': test_data.Id,\n", 343 | " 'SalePrice': test_preds})\n", 344 | "output.to_csv('submission.csv', index=False)" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": 9, 350 | "id": "795c92eb", 351 | "metadata": { 352 | "execution": { 353 | "iopub.execute_input": "2021-08-15T03:43:31.190219Z", 354 | "iopub.status.busy": "2021-08-15T03:43:31.189600Z", 355 | "iopub.status.idle": "2021-08-15T03:43:31.201419Z", 356 | "shell.execute_reply": "2021-08-15T03:43:31.200904Z", 357 | "shell.execute_reply.started": "2021-08-15T03:42:41.777270Z" 358 | }, 359 | "papermill": { 360 | "duration": 0.026074, 361 | "end_time": "2021-08-15T03:43:31.201571", 362 | "exception": false, 363 | "start_time": "2021-08-15T03:43:31.175497", 364 | "status": "completed" 365 | }, 366 | "tags": [] 367 | }, 368 | "outputs": [ 369 | { 370 | "data": { 371 | "text/html": [ 372 | "
\n", 373 | "\n", 386 | "\n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | "
IdSalePrice
01461127695.257638
11462158944.116706
21463177126.990807
31464191648.425977
41465196206.122025
\n", 422 | "
" 423 | ], 424 | "text/plain": [ 425 | " Id SalePrice\n", 426 | "0 1461 127695.257638\n", 427 | "1 1462 158944.116706\n", 428 | "2 1463 177126.990807\n", 429 | "3 1464 191648.425977\n", 430 | "4 1465 196206.122025" 431 | ] 432 | }, 433 | "execution_count": 9, 434 | "metadata": {}, 435 | "output_type": "execute_result" 436 | } 437 | ], 438 | "source": [ 439 | "output.head()" 440 | ] 441 | } 442 | ], 443 | "metadata": { 444 | "kernelspec": { 445 | "display_name": "Python 3", 446 | "language": "python", 447 | "name": "python3" 448 | }, 449 | "language_info": { 450 | "codemirror_mode": { 451 | "name": "ipython", 452 | "version": 3 453 | }, 454 | "file_extension": ".py", 455 | "mimetype": "text/x-python", 456 | "name": "python", 457 | "nbconvert_exporter": "python", 458 | "pygments_lexer": "ipython3", 459 | "version": "3.8.8" 460 | }, 461 | "papermill": { 462 | "default_parameters": {}, 463 | "duration": 28.760462, 464 | "end_time": "2021-08-15T03:43:32.700610", 465 | "environment_variables": {}, 466 | "exception": null, 467 | "input_path": "__notebook__.ipynb", 468 | "output_path": "__notebook__.ipynb", 469 | "parameters": {}, 470 | "start_time": "2021-08-15T03:43:03.940148", 471 | "version": "2.3.3" 472 | } 473 | }, 474 | "nbformat": 4, 475 | "nbformat_minor": 5 476 | } 477 | -------------------------------------------------------------------------------- /Code/A6_Kaggle/Day12_Housing Prices Competition.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "07d7bd2d", 7 | "metadata": { 8 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", 9 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5", 10 | "execution": { 11 | "iopub.execute_input": "2021-08-15T03:43:11.802107Z", 12 | "iopub.status.busy": "2021-08-15T03:43:11.800938Z", 13 | "iopub.status.idle": "2021-08-15T03:43:11.814520Z", 14 | "shell.execute_reply": "2021-08-15T03:43:11.815049Z", 15 | "shell.execute_reply.started": "2021-08-15T02:46:18.327501Z" 16 | }, 17 | "papermill": { 18 | "duration": 0.02756, 19 | "end_time": "2021-08-15T03:43:11.815353", 20 | "exception": false, 21 | "start_time": "2021-08-15T03:43:11.787793", 22 | "status": "completed" 23 | }, 24 | "tags": [] 25 | }, 26 | "outputs": [ 27 | { 28 | "name": "stdout", 29 | "output_type": "stream", 30 | "text": [ 31 | "/kaggle/input/home-data-for-ml-course/sample_submission.csv\n", 32 | "/kaggle/input/home-data-for-ml-course/sample_submission.csv.gz\n", 33 | "/kaggle/input/home-data-for-ml-course/train.csv.gz\n", 34 | "/kaggle/input/home-data-for-ml-course/data_description.txt\n", 35 | "/kaggle/input/home-data-for-ml-course/test.csv.gz\n", 36 | "/kaggle/input/home-data-for-ml-course/train.csv\n", 37 | "/kaggle/input/home-data-for-ml-course/test.csv\n" 38 | ] 39 | } 40 | ], 41 | "source": [ 42 | "# This Python 3 environment comes with many helpful analytics libraries installed\n", 43 | "# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python\n", 44 | "# For example, here's several helpful packages to load\n", 45 | "\n", 46 | "import numpy as np # linear algebra\n", 47 | "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", 48 | "\n", 49 | "# Input data files are available in the read-only \"../input/\" directory\n", 50 | "# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n", 51 | "\n", 52 | "import os\n", 53 | "for dirname, _, filenames in os.walk('/kaggle/input'):\n", 54 | " for filename in filenames:\n", 55 | " print(os.path.join(dirname, filename))\n", 56 | "\n", 57 | "# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using \"Save & Run All\" \n", 58 | "# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 2, 64 | "id": "7def9f9a", 65 | "metadata": { 66 | "execution": { 67 | "iopub.execute_input": "2021-08-15T03:43:11.837576Z", 68 | "iopub.status.busy": "2021-08-15T03:43:11.836898Z", 69 | "iopub.status.idle": "2021-08-15T03:43:13.077546Z", 70 | "shell.execute_reply": "2021-08-15T03:43:13.076802Z", 71 | "shell.execute_reply.started": "2021-08-15T03:10:19.101028Z" 72 | }, 73 | "papermill": { 74 | "duration": 1.252221, 75 | "end_time": "2021-08-15T03:43:13.077693", 76 | "exception": false, 77 | "start_time": "2021-08-15T03:43:11.825472", 78 | "status": "completed" 79 | }, 80 | "tags": [] 81 | }, 82 | "outputs": [], 83 | "source": [ 84 | "# Import helpful libraries\n", 85 | "import pandas as pd\n", 86 | "import numpy as np\n", 87 | "from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor\n", 88 | "from sklearn.metrics import mean_absolute_error, mean_squared_error #Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.\n", 89 | "from sklearn.model_selection import train_test_split\n", 90 | "\n", 91 | "# Load the data, and separate the target\n", 92 | "iowa_file_path = '../input/home-data-for-ml-course/train.csv'\n", 93 | "home_data = pd.read_csv(iowa_file_path)\n", 94 | "y = home_data.SalePrice\n", 95 | "\n", 96 | "# Create X (After completing the exercise, you can return to modify this line!)\n", 97 | "#features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']\n" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 3, 103 | "id": "fed68f38", 104 | "metadata": { 105 | "execution": { 106 | "iopub.execute_input": "2021-08-15T03:43:13.101919Z", 107 | "iopub.status.busy": "2021-08-15T03:43:13.101196Z", 108 | "iopub.status.idle": "2021-08-15T03:43:13.104152Z", 109 | "shell.execute_reply": "2021-08-15T03:43:13.104647Z", 110 | "shell.execute_reply.started": "2021-08-15T03:01:48.233770Z" 111 | }, 112 | "papermill": { 113 | "duration": 0.017924, 114 | "end_time": "2021-08-15T03:43:13.104807", 115 | "exception": false, 116 | "start_time": "2021-08-15T03:43:13.086883", 117 | "status": "completed" 118 | }, 119 | "tags": [] 120 | }, 121 | "outputs": [], 122 | "source": [ 123 | "# Create X (After completing the exercise, you can return to modify this line!)\n", 124 | "features = [\n", 125 | " 'MSSubClass',\n", 126 | " 'LotArea',\n", 127 | " 'OverallQual',\n", 128 | " 'OverallCond',\n", 129 | " 'YearBuilt',\n", 130 | " 'YearRemodAdd', \n", 131 | " '1stFlrSF',\n", 132 | " '2ndFlrSF' ,\n", 133 | " 'LowQualFinSF',\n", 134 | " 'GrLivArea',\n", 135 | " 'FullBath',\n", 136 | " 'HalfBath',\n", 137 | " 'BedroomAbvGr',\n", 138 | " 'KitchenAbvGr', \n", 139 | " 'TotRmsAbvGrd',\n", 140 | " 'Fireplaces', \n", 141 | " 'WoodDeckSF' ,\n", 142 | " 'OpenPorchSF',\n", 143 | " 'EnclosedPorch',\n", 144 | " '3SsnPorch', \n", 145 | " 'ScreenPorch',\n", 146 | " 'PoolArea',\n", 147 | " 'MiscVal',\n", 148 | " 'MoSold',\n", 149 | " 'YrSold'\n", 150 | "]" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": 4, 156 | "id": "ac795a42", 157 | "metadata": { 158 | "execution": { 159 | "iopub.execute_input": "2021-08-15T03:43:13.126743Z", 160 | "iopub.status.busy": "2021-08-15T03:43:13.126039Z", 161 | "iopub.status.idle": "2021-08-15T03:43:20.892772Z", 162 | "shell.execute_reply": "2021-08-15T03:43:20.893262Z", 163 | "shell.execute_reply.started": "2021-08-15T03:40:09.224892Z" 164 | }, 165 | "papermill": { 166 | "duration": 7.779176, 167 | "end_time": "2021-08-15T03:43:20.893480", 168 | "exception": false, 169 | "start_time": "2021-08-15T03:43:13.114304", 170 | "status": "completed" 171 | }, 172 | "tags": [] 173 | }, 174 | "outputs": [ 175 | { 176 | "name": "stdout", 177 | "output_type": "stream", 178 | "text": [ 179 | "Validation RMSE for Random Forest Model: 26,895\n", 180 | "Validation RMSE for Gradient Boosting Model: 24,597\n", 181 | "Validation RMSE for Mean Prediction of 2 Models: 23,834\n" 182 | ] 183 | } 184 | ], 185 | "source": [ 186 | "# Select columns corresponding to features, and preview the data\n", 187 | "X = home_data[features]\n", 188 | "\n", 189 | "\n", 190 | "# Split into validation and training data\n", 191 | "train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)\n", 192 | "\n", 193 | "# Define a random forest model\n", 194 | "rf_model = RandomForestRegressor(random_state=1, n_estimators=700)\n", 195 | "rf_model.fit(train_X, train_y)\n", 196 | "rf_val_predictions = rf_model.predict(val_X)\n", 197 | "rf_val_rmse = np.sqrt(mean_squared_error(rf_val_predictions, val_y))\n", 198 | "\n", 199 | "gbm_model = GradientBoostingRegressor(random_state=1, n_estimators=500)\n", 200 | "gbm_model.fit(train_X, train_y)\n", 201 | "gbm_val_predictions = gbm_model.predict(val_X)\n", 202 | "gbm_val_rmse = np.sqrt(mean_squared_error(gbm_val_predictions, val_y))\n", 203 | "\n", 204 | "mean_2model_val_predictions = (rf_val_predictions + gbm_val_predictions)/2\n", 205 | "mean_2model_val_rmse = np.sqrt(mean_squared_error(mean_2model_val_predictions, val_y))\n", 206 | "\n", 207 | "print(\"Validation RMSE for Random Forest Model: {:,.0f}\".format(rf_val_rmse))\n", 208 | "print(\"Validation RMSE for Gradient Boosting Model: {:,.0f}\".format(gbm_val_rmse))\n", 209 | "print(\"Validation RMSE for Mean Prediction of 2 Models: {:,.0f}\".format(mean_2model_val_rmse))" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": 5, 215 | "id": "87e88081", 216 | "metadata": { 217 | "execution": { 218 | "iopub.execute_input": "2021-08-15T03:43:20.917251Z", 219 | "iopub.status.busy": "2021-08-15T03:43:20.916646Z", 220 | "iopub.status.idle": "2021-08-15T03:43:20.918748Z", 221 | "shell.execute_reply": "2021-08-15T03:43:20.919237Z", 222 | "shell.execute_reply.started": "2021-08-15T02:52:49.030875Z" 223 | }, 224 | "papermill": { 225 | "duration": 0.016227, 226 | "end_time": "2021-08-15T03:43:20.919401", 227 | "exception": false, 228 | "start_time": "2021-08-15T03:43:20.903174", 229 | "status": "completed" 230 | }, 231 | "tags": [] 232 | }, 233 | "outputs": [], 234 | "source": [ 235 | "#?RandomForestRegressor " 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 6, 241 | "id": "4b3573a6", 242 | "metadata": { 243 | "execution": { 244 | "iopub.execute_input": "2021-08-15T03:43:20.941585Z", 245 | "iopub.status.busy": "2021-08-15T03:43:20.940666Z", 246 | "iopub.status.idle": "2021-08-15T03:43:30.827997Z", 247 | "shell.execute_reply": "2021-08-15T03:43:30.828507Z", 248 | "shell.execute_reply.started": "2021-08-15T03:41:31.585294Z" 249 | }, 250 | "papermill": { 251 | "duration": 9.899838, 252 | "end_time": "2021-08-15T03:43:30.828673", 253 | "exception": false, 254 | "start_time": "2021-08-15T03:43:20.928835", 255 | "status": "completed" 256 | }, 257 | "tags": [] 258 | }, 259 | "outputs": [ 260 | { 261 | "data": { 262 | "text/plain": [ 263 | "GradientBoostingRegressor(n_estimators=500, random_state=1)" 264 | ] 265 | }, 266 | "execution_count": 6, 267 | "metadata": {}, 268 | "output_type": "execute_result" 269 | } 270 | ], 271 | "source": [ 272 | "# To improve accuracy, create a new Random Forest model which you will train on all training data\n", 273 | "rf_model.fit(X,y)\n", 274 | "gbm_model.fit(X,y)" 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": 7, 280 | "id": "701f28ab", 281 | "metadata": { 282 | "execution": { 283 | "iopub.execute_input": "2021-08-15T03:43:30.853856Z", 284 | "iopub.status.busy": "2021-08-15T03:43:30.853151Z", 285 | "iopub.status.idle": "2021-08-15T03:43:31.128150Z", 286 | "shell.execute_reply": "2021-08-15T03:43:31.128649Z", 287 | "shell.execute_reply.started": "2021-08-15T03:42:36.364752Z" 288 | }, 289 | "papermill": { 290 | "duration": 0.290491, 291 | "end_time": "2021-08-15T03:43:31.128831", 292 | "exception": false, 293 | "start_time": "2021-08-15T03:43:30.838340", 294 | "status": "completed" 295 | }, 296 | "tags": [] 297 | }, 298 | "outputs": [], 299 | "source": [ 300 | "# path to file you will use for predictions\n", 301 | "test_data_path = '../input/home-data-for-ml-course/test.csv'\n", 302 | "\n", 303 | "# read test data file using pandas\n", 304 | "test_data = pd.read_csv(test_data_path)\n", 305 | "test_data = test_data.fillna(-1)\n", 306 | "# create test_X which comes from test_data but includes only the columns you used for prediction.\n", 307 | "# The list of columns is stored in a variable called features\n", 308 | "test_X = test_data[features]\n", 309 | "\n", 310 | "# make predictions which we will submit. \n", 311 | "test_preds1 = rf_model.predict(test_X)\n", 312 | "test_preds2 = gbm_model.predict(test_X)\n", 313 | "\n", 314 | "test_preds = (test_preds1 + test_preds2)/2" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": 8, 320 | "id": "4a01247c", 321 | "metadata": { 322 | "execution": { 323 | "iopub.execute_input": "2021-08-15T03:43:31.154336Z", 324 | "iopub.status.busy": "2021-08-15T03:43:31.153705Z", 325 | "iopub.status.idle": "2021-08-15T03:43:31.165500Z", 326 | "shell.execute_reply": "2021-08-15T03:43:31.164833Z", 327 | "shell.execute_reply.started": "2021-08-15T03:42:39.697331Z" 328 | }, 329 | "papermill": { 330 | "duration": 0.027107, 331 | "end_time": "2021-08-15T03:43:31.165641", 332 | "exception": false, 333 | "start_time": "2021-08-15T03:43:31.138534", 334 | "status": "completed" 335 | }, 336 | "tags": [] 337 | }, 338 | "outputs": [], 339 | "source": [ 340 | "# Run the code to save predictions in the format used for competition scoring\n", 341 | "\n", 342 | "output = pd.DataFrame({'Id': test_data.Id,\n", 343 | " 'SalePrice': test_preds})\n", 344 | "output.to_csv('submission.csv', index=False)" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": 9, 350 | "id": "795c92eb", 351 | "metadata": { 352 | "execution": { 353 | "iopub.execute_input": "2021-08-15T03:43:31.190219Z", 354 | "iopub.status.busy": "2021-08-15T03:43:31.189600Z", 355 | "iopub.status.idle": "2021-08-15T03:43:31.201419Z", 356 | "shell.execute_reply": "2021-08-15T03:43:31.200904Z", 357 | "shell.execute_reply.started": "2021-08-15T03:42:41.777270Z" 358 | }, 359 | "papermill": { 360 | "duration": 0.026074, 361 | "end_time": "2021-08-15T03:43:31.201571", 362 | "exception": false, 363 | "start_time": "2021-08-15T03:43:31.175497", 364 | "status": "completed" 365 | }, 366 | "tags": [] 367 | }, 368 | "outputs": [ 369 | { 370 | "data": { 371 | "text/html": [ 372 | "
\n", 373 | "\n", 386 | "\n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | "
IdSalePrice
01461127695.257638
11462158944.116706
21463177126.990807
31464191648.425977
41465196206.122025
\n", 422 | "
" 423 | ], 424 | "text/plain": [ 425 | " Id SalePrice\n", 426 | "0 1461 127695.257638\n", 427 | "1 1462 158944.116706\n", 428 | "2 1463 177126.990807\n", 429 | "3 1464 191648.425977\n", 430 | "4 1465 196206.122025" 431 | ] 432 | }, 433 | "execution_count": 9, 434 | "metadata": {}, 435 | "output_type": "execute_result" 436 | } 437 | ], 438 | "source": [ 439 | "output.head()" 440 | ] 441 | } 442 | ], 443 | "metadata": { 444 | "kernelspec": { 445 | "display_name": "Python 3", 446 | "language": "python", 447 | "name": "python3" 448 | }, 449 | "language_info": { 450 | "codemirror_mode": { 451 | "name": "ipython", 452 | "version": 3 453 | }, 454 | "file_extension": ".py", 455 | "mimetype": "text/x-python", 456 | "name": "python", 457 | "nbconvert_exporter": "python", 458 | "pygments_lexer": "ipython3", 459 | "version": "3.8.8" 460 | }, 461 | "papermill": { 462 | "default_parameters": {}, 463 | "duration": 28.760462, 464 | "end_time": "2021-08-15T03:43:32.700610", 465 | "environment_variables": {}, 466 | "exception": null, 467 | "input_path": "__notebook__.ipynb", 468 | "output_path": "__notebook__.ipynb", 469 | "parameters": {}, 470 | "start_time": "2021-08-15T03:43:03.940148", 471 | "version": "2.3.3" 472 | } 473 | }, 474 | "nbformat": 4, 475 | "nbformat_minor": 5 476 | } 477 | -------------------------------------------------------------------------------- /Code/A6_Seaborn/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/A6_Seaborn/.DS_Store -------------------------------------------------------------------------------- /Code/A6_Seaborn/data/cereal.csv: -------------------------------------------------------------------------------- 1 | name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating 2 | 100% Bran,N,C,70,4,1,130,10,5,6,280,25,3,1,0.33,68.402973 3 | 100% Natural Bran,Q,C,120,3,5,15,2,8,8,135,0,3,1,1,33.983679 4 | All-Bran,K,C,70,4,1,260,9,7,5,320,25,3,1,0.33,59.425505 5 | All-Bran with Extra Fiber,K,C,50,4,0,140,14,8,0,330,25,3,1,0.5,93.704912 6 | Almond Delight,R,C,110,2,2,200,1,14,8,-1,25,3,1,0.75,34.384843 7 | Apple Cinnamon Cheerios,G,C,110,2,2,180,1.5,10.5,10,70,25,1,1,0.75,29.509541 8 | Apple Jacks,K,C,110,2,0,125,1,11,14,30,25,2,1,1,33.174094 9 | Basic 4,G,C,130,3,2,210,2,18,8,100,25,3,1.33,0.75,37.038562 10 | Bran Chex,R,C,90,2,1,200,4,15,6,125,25,1,1,0.67,49.120253 11 | Bran Flakes,P,C,90,3,0,210,5,13,5,190,25,3,1,0.67,53.313813 12 | Cap'n'Crunch,Q,C,120,1,2,220,0,12,12,35,25,2,1,0.75,18.042851 13 | Cheerios,G,C,110,6,2,290,2,17,1,105,25,1,1,1.25,50.764999 14 | Cinnamon Toast Crunch,G,C,120,1,3,210,0,13,9,45,25,2,1,0.75,19.823573 15 | Clusters,G,C,110,3,2,140,2,13,7,105,25,3,1,0.5,40.400208 16 | Cocoa Puffs,G,C,110,1,1,180,0,12,13,55,25,2,1,1,22.736446 17 | Corn Chex,R,C,110,2,0,280,0,22,3,25,25,1,1,1,41.445019 18 | Corn Flakes,K,C,100,2,0,290,1,21,2,35,25,1,1,1,45.863324 19 | Corn Pops,K,C,110,1,0,90,1,13,12,20,25,2,1,1,35.782791 20 | Count Chocula,G,C,110,1,1,180,0,12,13,65,25,2,1,1,22.396513 21 | Cracklin' Oat Bran,K,C,110,3,3,140,4,10,7,160,25,3,1,0.5,40.448772 22 | Cream of Wheat (Quick),N,H,100,3,0,80,1,21,0,-1,0,2,1,1,64.533816 23 | Crispix,K,C,110,2,0,220,1,21,3,30,25,3,1,1,46.895644 24 | Crispy Wheat & Raisins,G,C,100,2,1,140,2,11,10,120,25,3,1,0.75,36.176196 25 | Double Chex,R,C,100,2,0,190,1,18,5,80,25,3,1,0.75,44.330856 26 | Froot Loops,K,C,110,2,1,125,1,11,13,30,25,2,1,1,32.207582 27 | Frosted Flakes,K,C,110,1,0,200,1,14,11,25,25,1,1,0.75,31.435973 28 | Frosted Mini-Wheats,K,C,100,3,0,0,3,14,7,100,25,2,1,0.8,58.345141 29 | Fruit & Fibre Dates; Walnuts; and Oats,P,C,120,3,2,160,5,12,10,200,25,3,1.25,0.67,40.917047 30 | Fruitful Bran,K,C,120,3,0,240,5,14,12,190,25,3,1.33,0.67,41.015492 31 | Fruity Pebbles,P,C,110,1,1,135,0,13,12,25,25,2,1,0.75,28.025765 32 | Golden Crisp,P,C,100,2,0,45,0,11,15,40,25,1,1,0.88,35.252444 33 | Golden Grahams,G,C,110,1,1,280,0,15,9,45,25,2,1,0.75,23.804043 34 | Grape Nuts Flakes,P,C,100,3,1,140,3,15,5,85,25,3,1,0.88,52.076897 35 | Grape-Nuts,P,C,110,3,0,170,3,17,3,90,25,3,1,0.25,53.371007 36 | Great Grains Pecan,P,C,120,3,3,75,3,13,4,100,25,3,1,0.33,45.811716 37 | Honey Graham Ohs,Q,C,120,1,2,220,1,12,11,45,25,2,1,1,21.871292 38 | Honey Nut Cheerios,G,C,110,3,1,250,1.5,11.5,10,90,25,1,1,0.75,31.072217 39 | Honey-comb,P,C,110,1,0,180,0,14,11,35,25,1,1,1.33,28.742414 40 | Just Right Crunchy Nuggets,K,C,110,2,1,170,1,17,6,60,100,3,1,1,36.523683 41 | Just Right Fruit & Nut,K,C,140,3,1,170,2,20,9,95,100,3,1.3,0.75,36.471512 42 | Kix,G,C,110,2,1,260,0,21,3,40,25,2,1,1.5,39.241114 43 | Life,Q,C,100,4,2,150,2,12,6,95,25,2,1,0.67,45.328074 44 | Lucky Charms,G,C,110,2,1,180,0,12,12,55,25,2,1,1,26.734515 45 | Maypo,A,H,100,4,1,0,0,16,3,95,25,2,1,1,54.850917 46 | Muesli Raisins; Dates; & Almonds,R,C,150,4,3,95,3,16,11,170,25,3,1,1,37.136863 47 | Muesli Raisins; Peaches; & Pecans,R,C,150,4,3,150,3,16,11,170,25,3,1,1,34.139765 48 | Mueslix Crispy Blend,K,C,160,3,2,150,3,17,13,160,25,3,1.5,0.67,30.313351 49 | Multi-Grain Cheerios,G,C,100,2,1,220,2,15,6,90,25,1,1,1,40.105965 50 | Nut&Honey Crunch,K,C,120,2,1,190,0,15,9,40,25,2,1,0.67,29.924285 51 | Nutri-Grain Almond-Raisin,K,C,140,3,2,220,3,21,7,130,25,3,1.33,0.67,40.692320 52 | Nutri-grain Wheat,K,C,90,3,0,170,3,18,2,90,25,3,1,1,59.642837 53 | Oatmeal Raisin Crisp,G,C,130,3,2,170,1.5,13.5,10,120,25,3,1.25,0.5,30.450843 54 | Post Nat. Raisin Bran,P,C,120,3,1,200,6,11,14,260,25,3,1.33,0.67,37.840594 55 | Product 19,K,C,100,3,0,320,1,20,3,45,100,3,1,1,41.503540 56 | Puffed Rice,Q,C,50,1,0,0,0,13,0,15,0,3,0.5,1,60.756112 57 | Puffed Wheat,Q,C,50,2,0,0,1,10,0,50,0,3,0.5,1,63.005645 58 | Quaker Oat Squares,Q,C,100,4,1,135,2,14,6,110,25,3,1,0.5,49.511874 59 | Quaker Oatmeal,Q,H,100,5,2,0,2.7,-1,-1,110,0,1,1,0.67,50.828392 60 | Raisin Bran,K,C,120,3,1,210,5,14,12,240,25,2,1.33,0.75,39.259197 61 | Raisin Nut Bran,G,C,100,3,2,140,2.5,10.5,8,140,25,3,1,0.5,39.703400 62 | Raisin Squares,K,C,90,2,0,0,2,15,6,110,25,3,1,0.5,55.333142 63 | Rice Chex,R,C,110,1,0,240,0,23,2,30,25,1,1,1.13,41.998933 64 | Rice Krispies,K,C,110,2,0,290,0,22,3,35,25,1,1,1,40.560159 65 | Shredded Wheat,N,C,80,2,0,0,3,16,0,95,0,1,0.83,1,68.235885 66 | Shredded Wheat 'n'Bran,N,C,90,3,0,0,4,19,0,140,0,1,1,0.67,74.472949 67 | Shredded Wheat spoon size,N,C,90,3,0,0,3,20,0,120,0,1,1,0.67,72.801787 68 | Smacks,K,C,110,2,1,70,1,9,15,40,25,2,1,0.75,31.230054 69 | Special K,K,C,110,6,0,230,1,16,3,55,25,1,1,1,53.131324 70 | Strawberry Fruit Wheats,N,C,90,2,0,15,3,15,5,90,25,2,1,1,59.363993 71 | Total Corn Flakes,G,C,110,2,1,200,0,21,3,35,100,3,1,1,38.839746 72 | Total Raisin Bran,G,C,140,3,1,190,4,15,14,230,100,3,1.5,1,28.592785 73 | Total Whole Grain,G,C,100,3,1,200,3,16,3,110,100,3,1,1,46.658844 74 | Triples,G,C,110,2,1,250,0,21,3,60,25,3,1,0.75,39.106174 75 | Trix,G,C,110,1,1,140,0,13,12,25,25,2,1,1,27.753301 76 | Wheat Chex,R,C,100,3,1,230,3,17,3,115,25,1,1,0.67,49.787445 77 | Wheaties,G,C,100,3,1,200,3,17,3,110,25,1,1,1,51.592193 78 | Wheaties Honey Gold,G,C,110,2,1,200,1,16,8,60,25,1,1,0.75,36.187559 79 | -------------------------------------------------------------------------------- /Code/P00_Project_Template/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/P00_Project_Template/.DS_Store -------------------------------------------------------------------------------- /Code/P00_Project_Template/.ipynb_checkpoints/Project_Template_Heart_Disease_Classification-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 5 6 | } 7 | -------------------------------------------------------------------------------- /Code/P00_Project_Template/data/heart.csv: -------------------------------------------------------------------------------- 1 | age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target 2 | 63,1,3,145,233,1,0,150,0,2.3,0,0,1,1 3 | 37,1,2,130,250,0,1,187,0,3.5,0,0,2,1 4 | 41,0,1,130,204,0,0,172,0,1.4,2,0,2,1 5 | 56,1,1,120,236,0,1,178,0,0.8,2,0,2,1 6 | 57,0,0,120,354,0,1,163,1,0.6,2,0,2,1 7 | 57,1,0,140,192,0,1,148,0,0.4,1,0,1,1 8 | 56,0,1,140,294,0,0,153,0,1.3,1,0,2,1 9 | 44,1,1,120,263,0,1,173,0,0,2,0,3,1 10 | 52,1,2,172,199,1,1,162,0,0.5,2,0,3,1 11 | 57,1,2,150,168,0,1,174,0,1.6,2,0,2,1 12 | 54,1,0,140,239,0,1,160,0,1.2,2,0,2,1 13 | 48,0,2,130,275,0,1,139,0,0.2,2,0,2,1 14 | 49,1,1,130,266,0,1,171,0,0.6,2,0,2,1 15 | 64,1,3,110,211,0,0,144,1,1.8,1,0,2,1 16 | 58,0,3,150,283,1,0,162,0,1,2,0,2,1 17 | 50,0,2,120,219,0,1,158,0,1.6,1,0,2,1 18 | 58,0,2,120,340,0,1,172,0,0,2,0,2,1 19 | 66,0,3,150,226,0,1,114,0,2.6,0,0,2,1 20 | 43,1,0,150,247,0,1,171,0,1.5,2,0,2,1 21 | 69,0,3,140,239,0,1,151,0,1.8,2,2,2,1 22 | 59,1,0,135,234,0,1,161,0,0.5,1,0,3,1 23 | 44,1,2,130,233,0,1,179,1,0.4,2,0,2,1 24 | 42,1,0,140,226,0,1,178,0,0,2,0,2,1 25 | 61,1,2,150,243,1,1,137,1,1,1,0,2,1 26 | 40,1,3,140,199,0,1,178,1,1.4,2,0,3,1 27 | 71,0,1,160,302,0,1,162,0,0.4,2,2,2,1 28 | 59,1,2,150,212,1,1,157,0,1.6,2,0,2,1 29 | 51,1,2,110,175,0,1,123,0,0.6,2,0,2,1 30 | 65,0,2,140,417,1,0,157,0,0.8,2,1,2,1 31 | 53,1,2,130,197,1,0,152,0,1.2,0,0,2,1 32 | 41,0,1,105,198,0,1,168,0,0,2,1,2,1 33 | 65,1,0,120,177,0,1,140,0,0.4,2,0,3,1 34 | 44,1,1,130,219,0,0,188,0,0,2,0,2,1 35 | 54,1,2,125,273,0,0,152,0,0.5,0,1,2,1 36 | 51,1,3,125,213,0,0,125,1,1.4,2,1,2,1 37 | 46,0,2,142,177,0,0,160,1,1.4,0,0,2,1 38 | 54,0,2,135,304,1,1,170,0,0,2,0,2,1 39 | 54,1,2,150,232,0,0,165,0,1.6,2,0,3,1 40 | 65,0,2,155,269,0,1,148,0,0.8,2,0,2,1 41 | 65,0,2,160,360,0,0,151,0,0.8,2,0,2,1 42 | 51,0,2,140,308,0,0,142,0,1.5,2,1,2,1 43 | 48,1,1,130,245,0,0,180,0,0.2,1,0,2,1 44 | 45,1,0,104,208,0,0,148,1,3,1,0,2,1 45 | 53,0,0,130,264,0,0,143,0,0.4,1,0,2,1 46 | 39,1,2,140,321,0,0,182,0,0,2,0,2,1 47 | 52,1,1,120,325,0,1,172,0,0.2,2,0,2,1 48 | 44,1,2,140,235,0,0,180,0,0,2,0,2,1 49 | 47,1,2,138,257,0,0,156,0,0,2,0,2,1 50 | 53,0,2,128,216,0,0,115,0,0,2,0,0,1 51 | 53,0,0,138,234,0,0,160,0,0,2,0,2,1 52 | 51,0,2,130,256,0,0,149,0,0.5,2,0,2,1 53 | 66,1,0,120,302,0,0,151,0,0.4,1,0,2,1 54 | 62,1,2,130,231,0,1,146,0,1.8,1,3,3,1 55 | 44,0,2,108,141,0,1,175,0,0.6,1,0,2,1 56 | 63,0,2,135,252,0,0,172,0,0,2,0,2,1 57 | 52,1,1,134,201,0,1,158,0,0.8,2,1,2,1 58 | 48,1,0,122,222,0,0,186,0,0,2,0,2,1 59 | 45,1,0,115,260,0,0,185,0,0,2,0,2,1 60 | 34,1,3,118,182,0,0,174,0,0,2,0,2,1 61 | 57,0,0,128,303,0,0,159,0,0,2,1,2,1 62 | 71,0,2,110,265,1,0,130,0,0,2,1,2,1 63 | 54,1,1,108,309,0,1,156,0,0,2,0,3,1 64 | 52,1,3,118,186,0,0,190,0,0,1,0,1,1 65 | 41,1,1,135,203,0,1,132,0,0,1,0,1,1 66 | 58,1,2,140,211,1,0,165,0,0,2,0,2,1 67 | 35,0,0,138,183,0,1,182,0,1.4,2,0,2,1 68 | 51,1,2,100,222,0,1,143,1,1.2,1,0,2,1 69 | 45,0,1,130,234,0,0,175,0,0.6,1,0,2,1 70 | 44,1,1,120,220,0,1,170,0,0,2,0,2,1 71 | 62,0,0,124,209,0,1,163,0,0,2,0,2,1 72 | 54,1,2,120,258,0,0,147,0,0.4,1,0,3,1 73 | 51,1,2,94,227,0,1,154,1,0,2,1,3,1 74 | 29,1,1,130,204,0,0,202,0,0,2,0,2,1 75 | 51,1,0,140,261,0,0,186,1,0,2,0,2,1 76 | 43,0,2,122,213,0,1,165,0,0.2,1,0,2,1 77 | 55,0,1,135,250,0,0,161,0,1.4,1,0,2,1 78 | 51,1,2,125,245,1,0,166,0,2.4,1,0,2,1 79 | 59,1,1,140,221,0,1,164,1,0,2,0,2,1 80 | 52,1,1,128,205,1,1,184,0,0,2,0,2,1 81 | 58,1,2,105,240,0,0,154,1,0.6,1,0,3,1 82 | 41,1,2,112,250,0,1,179,0,0,2,0,2,1 83 | 45,1,1,128,308,0,0,170,0,0,2,0,2,1 84 | 60,0,2,102,318,0,1,160,0,0,2,1,2,1 85 | 52,1,3,152,298,1,1,178,0,1.2,1,0,3,1 86 | 42,0,0,102,265,0,0,122,0,0.6,1,0,2,1 87 | 67,0,2,115,564,0,0,160,0,1.6,1,0,3,1 88 | 68,1,2,118,277,0,1,151,0,1,2,1,3,1 89 | 46,1,1,101,197,1,1,156,0,0,2,0,3,1 90 | 54,0,2,110,214,0,1,158,0,1.6,1,0,2,1 91 | 58,0,0,100,248,0,0,122,0,1,1,0,2,1 92 | 48,1,2,124,255,1,1,175,0,0,2,2,2,1 93 | 57,1,0,132,207,0,1,168,1,0,2,0,3,1 94 | 52,1,2,138,223,0,1,169,0,0,2,4,2,1 95 | 54,0,1,132,288,1,0,159,1,0,2,1,2,1 96 | 45,0,1,112,160,0,1,138,0,0,1,0,2,1 97 | 53,1,0,142,226,0,0,111,1,0,2,0,3,1 98 | 62,0,0,140,394,0,0,157,0,1.2,1,0,2,1 99 | 52,1,0,108,233,1,1,147,0,0.1,2,3,3,1 100 | 43,1,2,130,315,0,1,162,0,1.9,2,1,2,1 101 | 53,1,2,130,246,1,0,173,0,0,2,3,2,1 102 | 42,1,3,148,244,0,0,178,0,0.8,2,2,2,1 103 | 59,1,3,178,270,0,0,145,0,4.2,0,0,3,1 104 | 63,0,1,140,195,0,1,179,0,0,2,2,2,1 105 | 42,1,2,120,240,1,1,194,0,0.8,0,0,3,1 106 | 50,1,2,129,196,0,1,163,0,0,2,0,2,1 107 | 68,0,2,120,211,0,0,115,0,1.5,1,0,2,1 108 | 69,1,3,160,234,1,0,131,0,0.1,1,1,2,1 109 | 45,0,0,138,236,0,0,152,1,0.2,1,0,2,1 110 | 50,0,1,120,244,0,1,162,0,1.1,2,0,2,1 111 | 50,0,0,110,254,0,0,159,0,0,2,0,2,1 112 | 64,0,0,180,325,0,1,154,1,0,2,0,2,1 113 | 57,1,2,150,126,1,1,173,0,0.2,2,1,3,1 114 | 64,0,2,140,313,0,1,133,0,0.2,2,0,3,1 115 | 43,1,0,110,211,0,1,161,0,0,2,0,3,1 116 | 55,1,1,130,262,0,1,155,0,0,2,0,2,1 117 | 37,0,2,120,215,0,1,170,0,0,2,0,2,1 118 | 41,1,2,130,214,0,0,168,0,2,1,0,2,1 119 | 56,1,3,120,193,0,0,162,0,1.9,1,0,3,1 120 | 46,0,1,105,204,0,1,172,0,0,2,0,2,1 121 | 46,0,0,138,243,0,0,152,1,0,1,0,2,1 122 | 64,0,0,130,303,0,1,122,0,2,1,2,2,1 123 | 59,1,0,138,271,0,0,182,0,0,2,0,2,1 124 | 41,0,2,112,268,0,0,172,1,0,2,0,2,1 125 | 54,0,2,108,267,0,0,167,0,0,2,0,2,1 126 | 39,0,2,94,199,0,1,179,0,0,2,0,2,1 127 | 34,0,1,118,210,0,1,192,0,0.7,2,0,2,1 128 | 47,1,0,112,204,0,1,143,0,0.1,2,0,2,1 129 | 67,0,2,152,277,0,1,172,0,0,2,1,2,1 130 | 52,0,2,136,196,0,0,169,0,0.1,1,0,2,1 131 | 74,0,1,120,269,0,0,121,1,0.2,2,1,2,1 132 | 54,0,2,160,201,0,1,163,0,0,2,1,2,1 133 | 49,0,1,134,271,0,1,162,0,0,1,0,2,1 134 | 42,1,1,120,295,0,1,162,0,0,2,0,2,1 135 | 41,1,1,110,235,0,1,153,0,0,2,0,2,1 136 | 41,0,1,126,306,0,1,163,0,0,2,0,2,1 137 | 49,0,0,130,269,0,1,163,0,0,2,0,2,1 138 | 60,0,2,120,178,1,1,96,0,0,2,0,2,1 139 | 62,1,1,128,208,1,0,140,0,0,2,0,2,1 140 | 57,1,0,110,201,0,1,126,1,1.5,1,0,1,1 141 | 64,1,0,128,263,0,1,105,1,0.2,1,1,3,1 142 | 51,0,2,120,295,0,0,157,0,0.6,2,0,2,1 143 | 43,1,0,115,303,0,1,181,0,1.2,1,0,2,1 144 | 42,0,2,120,209,0,1,173,0,0,1,0,2,1 145 | 67,0,0,106,223,0,1,142,0,0.3,2,2,2,1 146 | 76,0,2,140,197,0,2,116,0,1.1,1,0,2,1 147 | 70,1,1,156,245,0,0,143,0,0,2,0,2,1 148 | 44,0,2,118,242,0,1,149,0,0.3,1,1,2,1 149 | 60,0,3,150,240,0,1,171,0,0.9,2,0,2,1 150 | 44,1,2,120,226,0,1,169,0,0,2,0,2,1 151 | 42,1,2,130,180,0,1,150,0,0,2,0,2,1 152 | 66,1,0,160,228,0,0,138,0,2.3,2,0,1,1 153 | 71,0,0,112,149,0,1,125,0,1.6,1,0,2,1 154 | 64,1,3,170,227,0,0,155,0,0.6,1,0,3,1 155 | 66,0,2,146,278,0,0,152,0,0,1,1,2,1 156 | 39,0,2,138,220,0,1,152,0,0,1,0,2,1 157 | 58,0,0,130,197,0,1,131,0,0.6,1,0,2,1 158 | 47,1,2,130,253,0,1,179,0,0,2,0,2,1 159 | 35,1,1,122,192,0,1,174,0,0,2,0,2,1 160 | 58,1,1,125,220,0,1,144,0,0.4,1,4,3,1 161 | 56,1,1,130,221,0,0,163,0,0,2,0,3,1 162 | 56,1,1,120,240,0,1,169,0,0,0,0,2,1 163 | 55,0,1,132,342,0,1,166,0,1.2,2,0,2,1 164 | 41,1,1,120,157,0,1,182,0,0,2,0,2,1 165 | 38,1,2,138,175,0,1,173,0,0,2,4,2,1 166 | 38,1,2,138,175,0,1,173,0,0,2,4,2,1 167 | 67,1,0,160,286,0,0,108,1,1.5,1,3,2,0 168 | 67,1,0,120,229,0,0,129,1,2.6,1,2,3,0 169 | 62,0,0,140,268,0,0,160,0,3.6,0,2,2,0 170 | 63,1,0,130,254,0,0,147,0,1.4,1,1,3,0 171 | 53,1,0,140,203,1,0,155,1,3.1,0,0,3,0 172 | 56,1,2,130,256,1,0,142,1,0.6,1,1,1,0 173 | 48,1,1,110,229,0,1,168,0,1,0,0,3,0 174 | 58,1,1,120,284,0,0,160,0,1.8,1,0,2,0 175 | 58,1,2,132,224,0,0,173,0,3.2,2,2,3,0 176 | 60,1,0,130,206,0,0,132,1,2.4,1,2,3,0 177 | 40,1,0,110,167,0,0,114,1,2,1,0,3,0 178 | 60,1,0,117,230,1,1,160,1,1.4,2,2,3,0 179 | 64,1,2,140,335,0,1,158,0,0,2,0,2,0 180 | 43,1,0,120,177,0,0,120,1,2.5,1,0,3,0 181 | 57,1,0,150,276,0,0,112,1,0.6,1,1,1,0 182 | 55,1,0,132,353,0,1,132,1,1.2,1,1,3,0 183 | 65,0,0,150,225,0,0,114,0,1,1,3,3,0 184 | 61,0,0,130,330,0,0,169,0,0,2,0,2,0 185 | 58,1,2,112,230,0,0,165,0,2.5,1,1,3,0 186 | 50,1,0,150,243,0,0,128,0,2.6,1,0,3,0 187 | 44,1,0,112,290,0,0,153,0,0,2,1,2,0 188 | 60,1,0,130,253,0,1,144,1,1.4,2,1,3,0 189 | 54,1,0,124,266,0,0,109,1,2.2,1,1,3,0 190 | 50,1,2,140,233,0,1,163,0,0.6,1,1,3,0 191 | 41,1,0,110,172,0,0,158,0,0,2,0,3,0 192 | 51,0,0,130,305,0,1,142,1,1.2,1,0,3,0 193 | 58,1,0,128,216,0,0,131,1,2.2,1,3,3,0 194 | 54,1,0,120,188,0,1,113,0,1.4,1,1,3,0 195 | 60,1,0,145,282,0,0,142,1,2.8,1,2,3,0 196 | 60,1,2,140,185,0,0,155,0,3,1,0,2,0 197 | 59,1,0,170,326,0,0,140,1,3.4,0,0,3,0 198 | 46,1,2,150,231,0,1,147,0,3.6,1,0,2,0 199 | 67,1,0,125,254,1,1,163,0,0.2,1,2,3,0 200 | 62,1,0,120,267,0,1,99,1,1.8,1,2,3,0 201 | 65,1,0,110,248,0,0,158,0,0.6,2,2,1,0 202 | 44,1,0,110,197,0,0,177,0,0,2,1,2,0 203 | 60,1,0,125,258,0,0,141,1,2.8,1,1,3,0 204 | 58,1,0,150,270,0,0,111,1,0.8,2,0,3,0 205 | 68,1,2,180,274,1,0,150,1,1.6,1,0,3,0 206 | 62,0,0,160,164,0,0,145,0,6.2,0,3,3,0 207 | 52,1,0,128,255,0,1,161,1,0,2,1,3,0 208 | 59,1,0,110,239,0,0,142,1,1.2,1,1,3,0 209 | 60,0,0,150,258,0,0,157,0,2.6,1,2,3,0 210 | 49,1,2,120,188,0,1,139,0,2,1,3,3,0 211 | 59,1,0,140,177,0,1,162,1,0,2,1,3,0 212 | 57,1,2,128,229,0,0,150,0,0.4,1,1,3,0 213 | 61,1,0,120,260,0,1,140,1,3.6,1,1,3,0 214 | 39,1,0,118,219,0,1,140,0,1.2,1,0,3,0 215 | 61,0,0,145,307,0,0,146,1,1,1,0,3,0 216 | 56,1,0,125,249,1,0,144,1,1.2,1,1,2,0 217 | 43,0,0,132,341,1,0,136,1,3,1,0,3,0 218 | 62,0,2,130,263,0,1,97,0,1.2,1,1,3,0 219 | 63,1,0,130,330,1,0,132,1,1.8,2,3,3,0 220 | 65,1,0,135,254,0,0,127,0,2.8,1,1,3,0 221 | 48,1,0,130,256,1,0,150,1,0,2,2,3,0 222 | 63,0,0,150,407,0,0,154,0,4,1,3,3,0 223 | 55,1,0,140,217,0,1,111,1,5.6,0,0,3,0 224 | 65,1,3,138,282,1,0,174,0,1.4,1,1,2,0 225 | 56,0,0,200,288,1,0,133,1,4,0,2,3,0 226 | 54,1,0,110,239,0,1,126,1,2.8,1,1,3,0 227 | 70,1,0,145,174,0,1,125,1,2.6,0,0,3,0 228 | 62,1,1,120,281,0,0,103,0,1.4,1,1,3,0 229 | 35,1,0,120,198,0,1,130,1,1.6,1,0,3,0 230 | 59,1,3,170,288,0,0,159,0,0.2,1,0,3,0 231 | 64,1,2,125,309,0,1,131,1,1.8,1,0,3,0 232 | 47,1,2,108,243,0,1,152,0,0,2,0,2,0 233 | 57,1,0,165,289,1,0,124,0,1,1,3,3,0 234 | 55,1,0,160,289,0,0,145,1,0.8,1,1,3,0 235 | 64,1,0,120,246,0,0,96,1,2.2,0,1,2,0 236 | 70,1,0,130,322,0,0,109,0,2.4,1,3,2,0 237 | 51,1,0,140,299,0,1,173,1,1.6,2,0,3,0 238 | 58,1,0,125,300,0,0,171,0,0,2,2,3,0 239 | 60,1,0,140,293,0,0,170,0,1.2,1,2,3,0 240 | 77,1,0,125,304,0,0,162,1,0,2,3,2,0 241 | 35,1,0,126,282,0,0,156,1,0,2,0,3,0 242 | 70,1,2,160,269,0,1,112,1,2.9,1,1,3,0 243 | 59,0,0,174,249,0,1,143,1,0,1,0,2,0 244 | 64,1,0,145,212,0,0,132,0,2,1,2,1,0 245 | 57,1,0,152,274,0,1,88,1,1.2,1,1,3,0 246 | 56,1,0,132,184,0,0,105,1,2.1,1,1,1,0 247 | 48,1,0,124,274,0,0,166,0,0.5,1,0,3,0 248 | 56,0,0,134,409,0,0,150,1,1.9,1,2,3,0 249 | 66,1,1,160,246,0,1,120,1,0,1,3,1,0 250 | 54,1,1,192,283,0,0,195,0,0,2,1,3,0 251 | 69,1,2,140,254,0,0,146,0,2,1,3,3,0 252 | 51,1,0,140,298,0,1,122,1,4.2,1,3,3,0 253 | 43,1,0,132,247,1,0,143,1,0.1,1,4,3,0 254 | 62,0,0,138,294,1,1,106,0,1.9,1,3,2,0 255 | 67,1,0,100,299,0,0,125,1,0.9,1,2,2,0 256 | 59,1,3,160,273,0,0,125,0,0,2,0,2,0 257 | 45,1,0,142,309,0,0,147,1,0,1,3,3,0 258 | 58,1,0,128,259,0,0,130,1,3,1,2,3,0 259 | 50,1,0,144,200,0,0,126,1,0.9,1,0,3,0 260 | 62,0,0,150,244,0,1,154,1,1.4,1,0,2,0 261 | 38,1,3,120,231,0,1,182,1,3.8,1,0,3,0 262 | 66,0,0,178,228,1,1,165,1,1,1,2,3,0 263 | 52,1,0,112,230,0,1,160,0,0,2,1,2,0 264 | 53,1,0,123,282,0,1,95,1,2,1,2,3,0 265 | 63,0,0,108,269,0,1,169,1,1.8,1,2,2,0 266 | 54,1,0,110,206,0,0,108,1,0,1,1,2,0 267 | 66,1,0,112,212,0,0,132,1,0.1,2,1,2,0 268 | 55,0,0,180,327,0,2,117,1,3.4,1,0,2,0 269 | 49,1,2,118,149,0,0,126,0,0.8,2,3,2,0 270 | 54,1,0,122,286,0,0,116,1,3.2,1,2,2,0 271 | 56,1,0,130,283,1,0,103,1,1.6,0,0,3,0 272 | 46,1,0,120,249,0,0,144,0,0.8,2,0,3,0 273 | 61,1,3,134,234,0,1,145,0,2.6,1,2,2,0 274 | 67,1,0,120,237,0,1,71,0,1,1,0,2,0 275 | 58,1,0,100,234,0,1,156,0,0.1,2,1,3,0 276 | 47,1,0,110,275,0,0,118,1,1,1,1,2,0 277 | 52,1,0,125,212,0,1,168,0,1,2,2,3,0 278 | 58,1,0,146,218,0,1,105,0,2,1,1,3,0 279 | 57,1,1,124,261,0,1,141,0,0.3,2,0,3,0 280 | 58,0,1,136,319,1,0,152,0,0,2,2,2,0 281 | 61,1,0,138,166,0,0,125,1,3.6,1,1,2,0 282 | 42,1,0,136,315,0,1,125,1,1.8,1,0,1,0 283 | 52,1,0,128,204,1,1,156,1,1,1,0,0,0 284 | 59,1,2,126,218,1,1,134,0,2.2,1,1,1,0 285 | 40,1,0,152,223,0,1,181,0,0,2,0,3,0 286 | 61,1,0,140,207,0,0,138,1,1.9,2,1,3,0 287 | 46,1,0,140,311,0,1,120,1,1.8,1,2,3,0 288 | 59,1,3,134,204,0,1,162,0,0.8,2,2,2,0 289 | 57,1,1,154,232,0,0,164,0,0,2,1,2,0 290 | 57,1,0,110,335,0,1,143,1,3,1,1,3,0 291 | 55,0,0,128,205,0,2,130,1,2,1,1,3,0 292 | 61,1,0,148,203,0,1,161,0,0,2,1,3,0 293 | 58,1,0,114,318,0,2,140,0,4.4,0,3,1,0 294 | 58,0,0,170,225,1,0,146,1,2.8,1,2,1,0 295 | 67,1,2,152,212,0,0,150,0,0.8,1,0,3,0 296 | 44,1,0,120,169,0,1,144,1,2.8,0,0,1,0 297 | 63,1,0,140,187,0,0,144,1,4,2,2,3,0 298 | 63,0,0,124,197,0,1,136,1,0,1,0,2,0 299 | 59,1,0,164,176,1,0,90,0,1,1,2,1,0 300 | 57,0,0,140,241,0,1,123,1,0.2,1,0,3,0 301 | 45,1,3,110,264,0,1,132,0,1.2,1,0,3,0 302 | 68,1,0,144,193,1,1,141,0,3.4,1,2,3,0 303 | 57,1,0,130,131,0,1,115,1,1.2,1,1,3,0 304 | 57,0,1,130,236,0,0,174,0,0,1,1,2,0 305 | -------------------------------------------------------------------------------- /Code/P01_Pre_Processing/Data.csv: -------------------------------------------------------------------------------- 1 | Country,Age,Salary,Purchased 2 | France,44,72000,No 3 | Spain,27,48000,Yes 4 | Germany,30,54000,No 5 | Spain,38,61000,No 6 | Germany,40,,Yes 7 | France,35,58000,Yes 8 | Spain,,52000,No 9 | France,48,79000,Yes 10 | Germany,50,83000,No 11 | France,37,67000,Yes -------------------------------------------------------------------------------- /Code/P02_Linear_Regression/Position_Salaries.csv: -------------------------------------------------------------------------------- 1 | Position,Level,Salary 2 | Business Analyst,1,45000 3 | Junior Consultant,2,50000 4 | Senior Consultant,3,60000 5 | Manager,4,80000 6 | Country Manager,5,110000 7 | Region Manager,6,150000 8 | Partner,7,200000 9 | Senior Partner,8,300000 10 | C-level,9,500000 11 | CEO,10,1000000 -------------------------------------------------------------------------------- /Code/Project/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/Project/.DS_Store -------------------------------------------------------------------------------- /Code/Project/Housing Corporation/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/Project/Housing Corporation/.DS_Store -------------------------------------------------------------------------------- /Code/Project/Housing Corporation/.ipynb_checkpoints/Icon : -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/Project/Housing Corporation/.ipynb_checkpoints/Icon -------------------------------------------------------------------------------- /Code/Project/Housing Corporation/Icon : -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/Project/Housing Corporation/Icon -------------------------------------------------------------------------------- /Code/Project/Housing Corporation/datasets/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/Project/Housing Corporation/datasets/.DS_Store -------------------------------------------------------------------------------- /Code/Project/Housing Corporation/datasets/Icon : -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/Project/Housing Corporation/datasets/Icon -------------------------------------------------------------------------------- /Code/Project/Housing Corporation/datasets/housing/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/Project/Housing Corporation/datasets/housing/.DS_Store -------------------------------------------------------------------------------- /Code/Project/Housing Corporation/datasets/housing/Icon : -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Code/Project/Housing Corporation/datasets/housing/Icon -------------------------------------------------------------------------------- /Pages/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Pages/.DS_Store -------------------------------------------------------------------------------- /Pages/A00_Reading_List.md: -------------------------------------------------------------------------------- 1 | # Machine Learning & Data Science Reading List 2 | ## Table of contents 3 | - [AI Application](#ai-application) 4 | - [Machine Learning Interview](#machine-learning-interview) 5 | 6 | 7 | ## AI Application 8 | - [How the biggest companies in the world design Machine Learning-powered applications](https://www.linkedin.com/pulse/how-biggest-companies-world-design-machine-daniel-bourke/?fbclid=IwAR1QdQIaeK72nKjpePNKv6sAZayqCp669lg2EjwfvtBx7v6orN1Kw5QOc5c) 9 | 10 | [(Back to top)](#table-of-contents) 11 | ## Machine Learning Interview 12 | - [Introduction to Machine Learning Interviews Book](https://huyenchip.com/ml-interviews-book/) 13 | 14 | [(Back to top)](#table-of-contents) 15 | -------------------------------------------------------------------------------- /Pages/A01_Interview_Question.md: -------------------------------------------------------------------------------- 1 | # Data Scientist - Interview Questions 2 | 3 | ## Table of contents 4 | - [1. Interview Questions](#1-interview-questions) 5 | - [2. SQL Questions](#2-sql-questions) 6 | 7 | 8 | ## 1. Interview Questions 9 | ### 1.1. Data Scientist 10 | - **[Facebook] Product Generalist** (i.e. solving a business case study) 11 | - How to design the friends you may know feature -- how to recommend friends; 12 | - How can we tell if two users on Instagram are best friends 13 | - How would you build a model to decide what to show a user and how would you evaluate its success 14 | - How would you create a model to find bad sellers on marketplace? 15 | - **[Facebook] Coding Exercise (in SQL)**: joins (LEFT, RIGHT, UNION), group by, date manipulation 16 | - **[Facebook] Quantitative Analysis** 17 | - How to test out the assumptions; how to decide next steps if the metrics shows only positive signals in certain features 18 | - How can you tell if your model is working? 19 | - **[Facebook] Applied Data (stats questions)**: AB Testing 20 | 21 | ### 1.2. Machine Learning Engineer 22 | - **Coding Interview**: 23 | - [Data Structure] Difference Stack vs Queue, Dequeue Implementation, Linked List Reversal 24 | - [Easy] Reverse a linked list, Convert decimal to hexadecimal without using built-in methods (str, int etc.), pairs of number that sum up to K 25 | - [Medium] Verify binary search tree 26 | - [Hard] Min edit distance 27 | - **Techinical Interview**: 28 | - Fundamental ML questions: 29 | - Non-deep and deep methods 30 | - Basic ML models or algorithms: Formula for gradient descent, Linear an Non-Linear classifiers, K-means, Random forest, Clustering Nearest neighbors. Decision Tree 31 | - Basic DL: Explain how CNN works, Recurrent neural network 32 | - Metric Understanding: ROC 33 | - What is overfitting? 34 | - Difference between Bagging and Boosting 35 | - Regularization: Diff of L1 and L2 regularization 36 | - System Design: 37 | - How to search efficiently 38 | - Given salaries of people from ten professions and salary of a new people. Design an algorithm to predict the profession of this new people. 39 | - Case Study: 40 | - How would you apply A/B testing on food odering service 41 | - How surge pricing works for both customers and drivers 42 | - Implement Huffman code for a given English sentence 43 | - **Interview with Hiring Manager**: explain your Machine learning projects 44 | 45 | ### 1.3. Grab: Machine Learning Engineer 46 | - General mobility industry and economics oriented questions 47 | - How surge pricing work for both customers and drivers? 48 | 49 | - Formula for gradient decent 50 | - Supervised and unsupervised ML methods, detailed question about different classification and clustering algos 51 | - What is overfitting and how you deal with it 52 | - How to solve the issue if the features are highly correlated? 53 | - What is a good way to detect anomalies? 54 | - What's the ROC Curve? What does an ROC curve plot? 55 | - What's the difference between bagging and boosting? 56 | 57 | - How do you find out average number of bookings for a given day. What factors do you think will play a crucial role? 58 | - How do you thing grab can implement surge pricing concept different than that to Uber. What factors do you think will play a role here ? 59 | ## 2. SQL Questions 60 | #### SQL#1: Facebook 61 | ```SQL 62 | Given the following data: 63 | 64 | Table: 65 | searches 66 | Columns: 67 | date STRING date of the search, 68 | search_id INT the unique identifier of each search, 69 | user_id INT the unique identifier of the searcher, 70 | age_group STRING ('<30', '30-50', '50+'), 71 | search_query STRING the text of the search query 72 | 73 | Sample Rows: 74 | date | search_id | user_id | age_group | search_query 75 | -------------------------------------------------------------------- 76 | '2020-01-01' | 101 | 9991 | '<30' | 'justin bieber' 77 | '2020-01-01' | 102 | 9991 | '<30' | 'menlo park' 78 | '2020-01-01' | 103 | 5555 | '30-50' | 'john' 79 | '2020-01-01' | 104 | 1234 | '50+' | 'funny cats' 80 | 81 | 82 | Table: 83 | search_results 84 | Columns: 85 | date STRING date of the search action, 86 | search_id INT the unique identifier of each search, 87 | result_id INT the unique identifier of the result, 88 | result_type STRING (page, event, group, person, post, etc.), 89 | clicked BOOLEAN did the user click on the result? 90 | 91 | Sample Rows: 92 | date | search_id | result_id | result_type | clicked 93 | -------------------------------------------------------------------- 94 | '2020-01-01' | 101 | 1001 | 'page' | TRUE 95 | '2020-01-01' | 101 | 1002 | 'event' | FALSE 96 | '2020-01-01' | 101 | 1003 | 'event' | FALSE 97 | '2020-01-01' | 101 | 1004 | 'group' | FALSE 98 | 99 | 100 | Over the last 7 days, how many users made more than 10 searches? 101 | 102 | You notice that the number of users that clicked on a search result 103 | about a Facebook Event increased 10% week-over-week. How would you 104 | investigate? How do you decide if this is a good thing or a bad thing? 105 | 106 | The Events team wants to up-rank Events such that they show up higher 107 | in Search. How would you determine if this is a good idea or not? 108 | ``` 109 | [(Back to top)](#table-of-contents) 110 | -------------------------------------------------------------------------------- /Pages/A01_Job_Description.md: -------------------------------------------------------------------------------- 1 | # Data Science Job Description 2 | ## #1 Dell - Junior Analyst, Business Intelligence (Data Science) 3 | - Learn and apply database tools and statistical predictive models by analyzing large datasets with a variety of tools 4 | - Ask and answer questions in large datasets and have a strong desire to create strategies and solutions that challenge and expand the thinking of everyone around you 5 | - Deep dive into data to find answers to yet unknown questions and have a natural desire to go beneath the surface of a problem 6 | - Ask relevant questions and possess the skills to build algorithms necessary to find meaningful answers 7 | - Creatively visualize and effectively communicate data findings and insights in a variety of formats 8 | 9 | ## #2 Facebook - Data Scientist 10 | - Defining new opportunities for product impact Influencing product and sales to solve the most impactful market problems. 11 | - Apply your expertise in quantitative analysis and the presentation of data to see beyond the numbers and understand how our users interact with our growth products. 12 | - Work as a key member of the product team to solve problems and identify trends and opportunities. 13 | - Inform, influence, support, and execute our product decisions and product launches. 14 | - Set KPIs and goals, design and evaluate experiments, monitor key product metrics, understand root causes of changes in metrics. 15 | - Exploratory analysis to discover new opportunities: understanding ecosystems, user behaviours, and long-term trends Identifying levers to help move key metrics. 16 | 17 | # Essential Requirements 18 | - Core statistical knowledge 19 | - JD: #1 20 | - Programming languages (R, Python, SAS) 21 | - JD: #1, #2 22 | - BI, Data Mining, and Machine Learning experience 23 | - JD: #1 24 | - Proficient in SQL 25 | - JD: #1, #2 26 | - Proven experience leading data-driven projects from definition to execution: defining metrics, experiment design, communicating actionable insights. 27 | - JD: #2 28 | - Experience in Spark (Scala / pySpark) 29 | - JD: #3 30 | # Desirable Requirements 31 | - Self driven, able to work independently yet acts as a team player 32 | - Strong communication skills, willingness to learn and develop their ability to apply data science principles through a business lens 33 | - Ability to prioritize, re-prioritize, and handle multiple competing priorities simultaneously 34 | 35 | # Internship 36 | ## SAP - AI Engineer 37 | #### PURPOSE AND OBJECTIVES 38 | 39 | The Artificial Intelligence CoE team in IES Technology Services is pushing the SAP internal adoption of Machine Learning through all business processes. We are implementing standard product functionality from the SAP AI portfolio as well as custom built AI scenarios for internal business cases. In this field we have a lot existing projects in the horizon and a lot of rooms for creativity and freedom to experiment. 40 | 41 | In Singapore, we are looking for AI Engineer Interns to work on the technical stacks, also involve in end-user and business stakeholder communications. 42 | 43 | #### EXPECTATIONS AND TASKS 44 | 45 | - Collaborate with other AI Scientists, Engineers, and Product Owners etc. 46 | - Communicate with stakeholders to understand SAP business processes. 47 | - Support the end-to-end MLOps lifecycle (Python, Jenkins, Docker, Kubernetes, etc) 48 | - Apply existing SAP AI or open-sourced solutions to solve business problems. 49 | - Learn the latest advancements in applied NLP and Devops. 50 | 51 | #### EDUCATION AND QUALIFICATIONS / SKILLS AND COMPETENCIES 52 | 53 | Required skills 54 | - Programming experience in Python language and packages such as Pandas, scikit-learn etc. 55 | - Understanding of basic Machine Learning algorithms and concepts 56 | - Experience with development tools like VScode, Pycharm, Jupyter, Docker, Git, etc. 57 | - Curiosity to learn about different AI applications. 58 | - Able to take ownership of one’s work and collaborate with others. 59 | - Preferred skills 60 | - Pursuing an undergraduate or graduate degree 61 | - Driven and dynamic with strong problem-solving skills 62 | - Good communication skills with end-users and business stakeholders 63 | - Additional certificates and courses focusing on machine learning are a plus but not a must 64 | 65 | ## YARA - Data Science (ML Developer) 66 | There are more than 500 million smallholder farms globally. 2.5 billion people depend on Smallholder Communities for their food and livelihoods. Smallholder regions are characterized by low living standards, high rates of illiteracy and low agricultural productivity. Yara's mission is "Responsibly Feed the World and Protect the Planet". Key to achieving this is enabling thriving Smallholder Communities. At Yara, the Smallholders Digital Team is part of the Crop and Digital Solutions Unit. 67 | 68 | #### About Crop and Digital Solutions 69 | Yara aims to be the crop nutrition company for the future and is leading the development of sustainable agriculture and digital tools to contribute to solving global agricultural challenges. We have a worldwide presence with sales teams in ~150 countries and around 17,000 employees. Yara Farming Solutions, will lead the transformation towards more sustainable and efficient food production, by innovating our offerings and the way we work. Crop and Digital Solutions is responsible developing and scaling new “on-farm” digital and integrated tools and solutions for an efficient and transparent food system. 70 | 71 | 72 | #### Responsibilities 73 | - Support the development of the data resources of the analytics and insights side of the smallholder solutions. 74 | - Research new ways to interpret and utilise the data resources to provide meaningful, actionable insights. 75 | - Support cross-functional teams with different parts of the development centre to support the data needs of different teams. 76 | - Support the development, testing and implementation of various scripts, algorithms and code when necessary. 77 | ##### Required Profile 78 | 79 | - Strong personal/professional interest in the LSM segments (developing countries, low-income markets, etc) and agriculture in general. 80 | - Exposure to Computer Vision and Data Science concepts preferred. 81 | - Prior experience with at least one of Python development, data analysis or machine learning. 82 | 83 | Additional information 84 | 85 | We strive to reflect the diversity in society and encourage all qualified applicants from all background to apply. We are committed to creating a work environment that fits gender equality and allows combining career progress with the needs of a family or other personal circumstances 86 | 87 | #### Why us? 88 | - Evolving tech development division of an established agricultural products and services company. 89 | - Explore and develop digital, software, hardware products, which provide value to farmers, smallholder communities and the value chain. 90 | - Be part of our mission to build sustainable solutions that benefit humanity and the environment. 91 | - Full-time, permanent and freelance contract options available with competitive remuneration + benefits. 92 | - Support for personal development, training and continuous learning. 93 | - Commitment to using new technologies and frameworks, meetups, and knowledge sharing. 94 | 95 | ## ByteDance - Big Data Engineer 96 | #### About the Ad Data team 97 | The Ad Data team is committed to empowering our global team's monetization products through acquiring, building, and managing key ads data assets and providing scalable data warehouse, product, and service solutions. 98 | #### Responsibilities 99 | 100 | - Responsible for building offline and real-time data pipelines for various businesses of advertising; 101 | - Process data processing requests and execute data model design, implementation and maintenance; 102 | - Research on business logic and data architecture design to address business value. 103 | 104 | #### Qualifications 105 | 106 | - Undergraduate or Postgraduate currently pursuing a degree/master in software development, computer science, computer engineering, information systems or a related technical discipline; 107 | - Strong interest in computer science and internet technology; 108 | - Know relevant technologies of the Hadoop ecosystem, such as the principles of MapReduce, Spark, Hive, and Flink; 109 | - Familiar with SQL, be able to use SQL to perform data analysis, or proficient in Java, Python and Shell and other programming languages for data processing; 110 | - Good at communication, proactive in work, a strong sense of responsibility, and good teamwork ability. 111 | 112 | ## Grab - Data Analytics & Data Science 113 | Our Data Analytics & Data Science team focuses on gaining a true understanding of our users. We apply analytics in product development which enables us in building the best app ever! Our business analytics team directs itself by following 3 core philosophies. 114 | 115 | We focus on big data, being a bridge between the online and offline side of business, and to use data in order to align our operations with strategic goals. If you love to find new ways of interpreting data then you’re a perfect fit! 116 | 117 | Requirements: 118 | 119 | - Affinity with resolving complex people and business issues 120 | - Full-time student majoring in Computer Sciences, Engineering, Statistics, Data Science or related fields seeking a matriculated internship 121 | - You are enrolled in a Singapore university, or Singaporean/PR studying abroad 122 | - Previous internship experience in Data Analytics, Computer Science or other related field is an advantage 123 | - Be able to commit full-time from Jan - June 2022, for a minimum of 20 weeks 124 | - Agile and able to work in a fast-paced environment 125 | - Excellent communication, presentation and project management skills 126 | -------------------------------------------------------------------------------- /Pages/A02_Pandas_Cheat_Sheet.md: -------------------------------------------------------------------------------- 1 | 2 | # Pandas Cheat Sheet 3 | # Table of contents 4 | - [Table of contents](#table-of-contents) 5 | - [Import & Export Data](#import-export-data) 6 | - [Getting and knowing](#getting-and-knowing) 7 | - [loc vs iloc](#loc-vs-iloc) 8 | - [Access Rows of Data Frame](#access-columns-of-data-frame) 9 | - [Access Columns of Data Frame](#access-columns-of-data-frame) 10 | - [Manipulating Data](#manipulating-data) 11 | - [Grouping](#grouping) 12 | - [Basic Grouping](#basic-grouping) 13 | 14 | 15 | # Import Export Data 16 | ### Import with Different Separator 17 | ```Python 18 | users = pd.read_csv('user.csv', sep='|') 19 | chipo = pd.read_csv(url, sep = "\t") 20 | ``` 21 | pandas-anatomy-of-a-dataframe 22 | 23 | #### Renaming Index 24 | ```Python 25 | users = pd.read_csv('u.user', sep='|', index_col='user_id') 26 | ``` 27 | ### Export 28 | ```Python 29 | users.to_csv("exported-users.csv") 30 | ``` 31 | 32 | # Getting and knowing 33 | ### shape : Return (Row, Column) 34 | ```Python 35 | df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4], 36 | 'col3': [5, 6]}) 37 | df.shape 38 | (2, 3) # df.shape[0] = 2 row, df.shape[1] = 3 col 39 | ``` 40 | ### info() : Return index dtype, columns, non-null values & memory usage. 41 | ```Python 42 | df.info() 43 | ``` 44 | - We will understand dtype of cols, how many non-null value of DF 45 | ```Python 46 | 47 | RangeIndex: 4622 entries, 0 to 4621 48 | Data columns (total 5 columns): 49 | # Column Non-Null Count Dtype 50 | --- ------ -------------- ----- 51 | 0 order_id 4622 non-null int64 52 | 1 quantity 4622 non-null int64 53 | 2 item_name 4622 non-null object 54 | 3 choice_description 3376 non-null object 55 | 4 item_price 4622 non-null object 56 | dtypes: int64(2), object(3) 57 | memory usage: 180.7+ KB 58 | ```` 59 | 60 | ### describe() : Generate descriptive statistics. 61 | ```Python 62 | chipo.describe() #Notice: by default, only the numeric columns are returned. 63 | chipo.describe(include = "all") #Notice: By default, only the numeric columns are returned. 64 | ``` 65 | 66 | 67 | ### dtype : Return data type of specific column 68 | - `df.col_name.dtype` return the data type of that column 69 | ```Python 70 | df.item_price.dtype 71 | #'O' (Python) objects 72 | ``` 73 | 74 | - Please note: dtype will return below special character 75 | ```Python 76 | 'b' boolean 77 | 'i' (signed) integer 78 | 'u' unsigned integer 79 | 'f' floating-point 80 | 'c' complex-floating point 81 | 'O' (Python) objects 82 | 'S', 'a' (byte-)string 83 | 'U' Unicode 84 | 'V' raw data (void 85 | ``` 86 | 87 | ## loc vs iloc 88 | ### loc 89 | - `loc`: is **label-based**, which means that we have to specify the "name of the rows and columns" that we need to filter out. 90 | #### Find all the rows based on 1 or more conditions in a column 91 | ```Python 92 | # select all rows with a condition 93 | data.loc[data.age >= 15] 94 | # select all rows with multiple conditions 95 | data.loc[(data.age >= 12) & (data.gender == 'M')] 96 | ``` 97 | ![image](https://user-images.githubusercontent.com/64508435/106067849-7abaec00-613a-11eb-8cbe-f9aa5e2c6202.png) 98 | 99 | #### Select only required columns with conditions 100 | ```Python 101 | # Update the values of multiple columns on selected rows 102 | chipo.loc[(chipo.quantity == 7) & (chipo.item_name == 'Bottled Water'), ['item_name', 'item_price']] = ['Tra Xanh', 0] 103 | # Select only required columns with a condition 104 | chipo.loc[(chipo.quantity > 5), ['item_name', 'quantity', 'item_price']] 105 | ``` 106 | Screenshot 2021-01-28 at 7 26 04 AM 107 | 108 | ### iloc 109 | - `iloc`: is **index-based**, which means that we have to specify the "integer index-based" that we need to filter out. 110 | - `.iloc[]` allowed inputs are: 111 | #### Selecting Rows 112 | - An integer, e.g. `dataset.iloc[0]` > return row 0 in `` 113 | ```Python 114 | Country France 115 | Age 44 116 | Salary 72000 117 | Purchased No 118 | ``` 119 | - A list or array of integers, e.g.`dataset.iloc[[0]]` > return row 0 in DataFrame format 120 | ```Python 121 | Country Age Salary Purchased 122 | 0 France 44.0 72000.0 No 123 | ``` 124 | - A slice object with ints, e.g. `dataset.iloc[:3]` > return row 0 up to row 3 in DataFrame format 125 | ```Python 126 | Country Age Salary Purchased 127 | 0 France 44.0 72000.0 No 128 | 1 Spain 27.0 48000.0 Yes 129 | 2 Germany 30.0 54000.0 No 130 | ``` 131 | #### Selecting Rows & Columns 132 | - Select First 3 Rows & up to Last Columns (not included) `X = dataset.iloc[:3, :-1]` 133 | ```Python 134 | Country Age Salary 135 | 0 France 44.0 72000.0 136 | 1 Spain 27.0 48000.0 137 | 2 Germany 30.0 54000.0 138 | ``` 139 | ### Numpy representation of DF 140 | - `DataFrame.values`: Return a Numpy representation of the DataFrame (i.e: Only the values in the DataFrame will be returned, the axes labels will be removed) 141 | - For ex: `X = dataset.iloc[:3, :-1].values` 142 | ```Python 143 | [['France' 44.0 72000.0] 144 | ['Spain' 27.0 48000.0] 145 | ['Germany' 30.0 54000.0]] 146 | ``` 147 | 148 | ## Access Rows of Data Frame 149 | ### Check index of DF 150 | ```Python 151 | df.index 152 | #RangeIndex(start=0, stop=4622, step=1) 153 | ``` 154 | 155 | [(Back to top)](#table-of-contents) 156 | 157 | ## Access Columns of Data Frame 158 | ### Print the name of all the columns 159 | ```Python 160 | list(df.columns) 161 | #['order_id', 'quantity', 'item_name', 'choice_description','item_price', 'revenue'] 162 | ``` 163 | ### Access column 164 | ```Python 165 | # Counting how many values in the column 166 | df.col_name.count() 167 | # Take the mean of values in the column 168 | df["col_name"].mean() 169 | ``` 170 | ### value_counts() : Return a Series containing counts of unique values 171 | ```Python 172 | index = pd.Index([3, 1, 2, 3, 4, np.nan]) 173 | #dropna=False will also consider NaN as a unique value 174 | index.value_counts(dropna=False) 175 | #Return: 176 | 3.0 2 177 | 2.0 1 178 | NaN 1 179 | 4.0 1 180 | 1.0 1 181 | dtype: int64 182 | ``` 183 | ### Calculate total unique values in a columns 184 | ```Python 185 | #How many unique values 186 | index.value_counts().count() 187 | 188 | index.nunique() 189 | #5 190 | ``` 191 | 192 | [(Back to top)](#table-of-contents) 193 | # Manipulating Data 194 | ## Missing Values 195 | ### Filling Missing Values with fillna() 196 | - To fill `nan` value with a v 197 | ```Python 198 | car_sales_missing["Odometer"].fillna(car_sales_missing["Odometer"].mean(), inplace = True) 199 | ``` 200 | ### Dropping Missing Values with dropna() 201 | - To drop columns containing Missing Values 202 | ```Python 203 | car_sales_missing.dropna(inplace=True) 204 | ``` 205 | ## Drop a column 206 | 207 | ```Python 208 | car_sales.drop("Passed road safety", axis = 1) # axis = 1 if you want to drop a column 209 | ``` 210 | [(Back to top)](#table-of-contents) 211 | # Grouping 212 | Screenshot 2021-01-23 at 10 47 21 PM 213 | 214 | ## Basic Grouping 215 | - Grouping by "item_name" column & take the sum of "quantity" column 216 | - Method #1 : `df.groupby("item_name")` 217 | 218 | ```Python 219 | df.groupby("item_name")["quantity"].sum() 220 | ``` 221 | 222 | ```Python 223 | item_name 224 | Chicken Bowl 761 225 | Chicken Burrito 591 226 | Name: quantity, dtype: int64 227 | ``` 228 | 229 | - Method #2: `df.groupby(by=['order_id'])` 230 | 231 | ```Python 232 | order_revenue = df.groupby(by=["order_id"])["revenue"].sum() 233 | ``` 234 | [(Back to top)](#table-of-contents) 235 | 236 | 237 | 238 | 239 | -------------------------------------------------------------------------------- /Pages/A03_Numpy_Cheat_Sheet.md: -------------------------------------------------------------------------------- 1 | # Numpy Cheat Sheet 2 | # Table of contents 3 | - [Table of contents](#table-of-contents) 4 | - [Introduction to Numpy](#introduction-to-numpy) 5 | - [Numpy Data Types and Attributes](numpy-data-types-and-attributes) 6 | 7 | # Introduction to Numpy 8 | ### Why is Numpy important? 9 | - How many decimal numbers we can store with `n bits` ? 10 | - `n bits` is equal to 3 positions to store 0 & 1. 11 | - Formula: 2^(n) = 8 decimal numbers 12 | - Numpy allow you to specify more precisely number of memory you need for storing the data 13 | ```Python 14 | #Python costs 28 bytes to store x = 5 since it is Integer Object 15 | import sys 16 | x = 5 17 | sys.getsizeof(x) #return 28 - means variable x = 5 costs 28 bytes of memory 18 | 19 | #Numpy : allow you to specify more precisely number of bits (memory) you need for storing the data 20 | np.int8 #8-bit 21 | ``` 22 | - Numpy is **Array Processing** 23 | - Built-in DS in Python `List` NOT optimized for High-Level Processing as List in Python is Object and they will not store elements in separate position in Memory 24 | - In constrast, Numpy will store `Array Elements` in **Continuous Positions** in memory 25 | 26 | ### Numpy is more efficient for storing and manipulating data 27 | 28 | 29 | 30 | - `Numpy array` : essentially contains a single pointer to one contiguous block of data 31 | - `Python list` : contains a pointer to a block of pointers, each of which in turn points to a full Python object 32 | 33 | # Numpy Data Types and Attributes 34 | - Main Numpy Data Type is `ndarray` 35 | - Attributes: `shape, ndim, size, dtype` 36 | -------------------------------------------------------------------------------- /Pages/A04_Conda_CLI.md: -------------------------------------------------------------------------------- 1 | 2 | | Usage | Command | Description | 3 | | -------------------| ---------------------------------- | ----------------| 4 | | Create |`conda create --prefix ./env pandas numpy matplotlib scikit-learn jupyter`| Create Conda env & install packages | 5 | | List Env | `conda env list` | Listdown env currently activated | 6 | | Activate | `conda activate ./env` | Activate Conda virtual env | 7 | | Install package | `conda install jupyter` | | 8 | | Update package | `conda update scikit-learn=0.22` | Can specify the version also | 9 | | List Installed Package | `conda list`|| 10 | | Un-install Package | `conda uninstall python scikit-learn`| To uninstall packages to re-install with the Latest version| 11 | | Open Jupyter Notebook | `jupyter notebook`|| 12 | 13 | 14 | 15 | ## Sharing Conda Environment 16 | - Share a `.yml` (pronounced YAM-L) file of your Conda environment 17 | - `.yml` is basically a text file with instructions to tell Conda how to set up an environment. 18 | - Step 1: Export `.yml` file: 19 | - `conda env export --prefix {Path to env folder} > environment.yml` 20 | - Step 2: New PC, create an environment called `env_from_file` from `environment.yml`: 21 | - `conda env create --file environment.yml --name env_from_file` 22 | 23 | ## Jupyter Notebook 24 | | Usage | Command | Description | 25 | | -------------------| ---------------------------------- | ----------------| 26 | | Run Cell |`Shift + Enter`| Create Conda env & install packages | 27 | | Switch to Markdown | Exit Edit Mode `ESC` > press `m` | | 28 | | Show Function Description | `Shift + Tab`|Screenshot 2021-03-18 at 8 09 57 AM| 29 | | How to install a conda package into the current env from Jupyter's Notebook|`import sys`
`!conda install --yes --prefix {sys.prefix} seaborn`|| 30 | 31 | ### Jupyter Magic Function 32 | | Function | Command | Description | 33 | | -------------------| ---------------------------------- | ----------------| 34 | | Matplotlib | `%matplotlib inline` | will make your plot outputs appear and be stored within the notebook. | 35 | -------------------------------------------------------------------------------- /Pages/A05_Matplotlib.md: -------------------------------------------------------------------------------- 1 | 2 | # Matplotlib Cheat Sheet 3 | # Table of contents 4 | - [Table of contents](#table-of-contents) 5 | - [Introduction to Matplotlib](#introduction-to-matplotlib) 6 | - [Plotting from an IPython notebook](#plotting-from-an-ipython-notebook) 7 | - [Matplotlib Two Interfaces: MATLAB-style & Object-Oriented Interfaces](#matplotlib-two-interfaces) 8 | - [Matplotlib Workflow](#matplotlib-workflow) 9 | - [Subplots](#subplots) 10 | - [Scatter, Bar & Histogram Plot](#scatter-bar-and-histogram-plot) 11 | 12 | # Introduction to Matplotlib 13 | - Matplotlib is a multi-platform data visualization library built on NumPy arrays, and designed to work with the broader SciPy stack 14 | - Newer tools like `ggplot` and `ggvis` in the R language, along with web visualization toolkits based on `D3js` and `HTML5 canvas`, often make Matplotlib feel clunky and old-fashioned 15 | - Hence, nowadays, cleaner, more modern APIs, for example, `Seaborn`, `ggpy`, `HoloViews`, `Altai`, has been developed to drive Matplotlib 16 | ```Python 17 | import matplotlib.pyplot as plt 18 | ``` 19 | - The `plt` interface is what we will use most often 20 | #### Setting Styles 21 | ```Python 22 | # See the different styles avail 23 | plt.style.available 24 | # Set Style 25 | plt.style.use('seaborn-whitegrid') 26 | ``` 27 | ## Plotting from an IPython notebook 28 | - `%matplotlib notebook` will lead to **interactive** plots embedded within the notebook 29 | - `%matplotlib inline` will lead to **static images** of your plot embedded in the notebook 30 | 31 | matplotlib-anatomy-of-a-plot 32 | 33 | - `Figure` can contains multiple Subplot 34 | - `Axes 0` and `Axes 1` are `AxesSubplot` stacked together 35 | ## Matplotlib Two Interfaces 36 | ### Pyplot API vs Object-Oriented API 37 | * Quickly → use Pyplot Method 38 | * Advanced → use Object-Oriented Method 39 | - In general, try to use the `object-oriented interface` (more flexible) over the `pyplot` interface (i.e: `plt.plot()`) 40 | 41 | ```Python 42 | x = [1,2,3,4] 43 | y = [11,22,33,44] 44 | ``` 45 | - **MATLAB-style or PyPlot API**: Matplotlib was originally written as a Python alternative for MATLAB users, and much of its syntax reflects that fact 46 | ```Python 47 | # Pyplot API 48 | plt.plot(x,y, color='blue') 49 | 50 | plt.title("A Sine Curve") #in OO, use the ax.set() method to set all these properties at once 51 | plt.xlabel("x") 52 | plt.ylabel("sin(x)") 53 | plt.xlim([1,3]) 54 | plt.ylim([20,]) 55 | ``` 56 | - **Object-oriented**: plotting functions are methods of explicit `Figure` and `Axes` objects. 57 | ```Python 58 | # [Recommended] Object-oriented interface 59 | fig, ax = plt.subplots() #create figure + set of subplots, by default, nrow =1, ncol=1 60 | ax.plot(x,y) #add some data 61 | plt.show() 62 | ``` 63 | ##### Matplotlib Gotchas 64 | 65 | While most `plt` functions translate directly to `ax` methods (such as plt.plot() → ax.plot(), plt.legend() → ax.legend(), etc.), this is not the case for all commands. In particular, functions to set limits, labels, and titles are slightly modified. For transitioning between MATLAB-style functions and object-oriented methods, make the following changes: 66 | - `plt.xlabel()` → `ax.set_xlabel()` 67 | - `plt.ylabel()` → `ax.set_ylabel()` 68 | - `plt.xlim()` → `ax.set_xlim()` 69 | - `plt.ylim()` → `ax.set_ylim()` 70 | - `plt.title()` → `ax.set_title()` 71 | In the object-oriented interface to plotting, rather than calling these functions individually, it is often more convenient to use the ax.set() 72 | 73 | ```Python 74 | ax.set(xlim=(0, 10), ylim=(-2, 2), 75 | xlabel='x', ylabel='sin(x)', 76 | title='A Simple Plot'); 77 | ``` 78 | 79 | ## Matplotlib Workflow 80 | ```Python 81 | # 0. Import and get matplotlib ready 82 | %matplotlib inline 83 | import matplotlib.pyplot as plt 84 | 85 | # 1. Prepare data 86 | x = [1, 2, 3, 4] 87 | y = [11, 22, 33, 44] 88 | 89 | # 2. Setup plot 90 | fig, ax = plt.subplots(figsize=(5,5)) #Figure size = Width & Height of the Plot 91 | 92 | # 3. Plot data 93 | ax.plot(x, y) 94 | 95 | # 4. Customize plot 96 | ax.set(title="Sample Simple Plot", 97 | xlabel="x-axis", 98 | ylabel="y-axis", 99 | xlim=(0, 10), ylim=(-2, 2)) 100 | 101 | # 5. Save & Show 102 | fig.savefig("../images/simple-plot.png") 103 | ``` 104 | 105 | [(Back to top)](#table-of-contents) 106 | 107 | # Subplots 108 | - Option #1: to plot multiple subplots in same figure 109 | ```Python 110 | # Option 1: Create multiple subplots 111 | fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, 112 | ncols=2, 113 | figsize=(10, 5)) 114 | # Plot data to each axis 115 | ax1.plot(x, x/2); 116 | ax2.scatter(np.random.random(10), np.random.random(10)); 117 | ax3.bar(nut_butter_prices.keys(), nut_butter_prices.values()); 118 | ax4.hist(np.random.randn(1000)); 119 | ``` 120 | Screenshot 2021-03-23 at 8 52 20 AM 121 | 122 | [(Back to top)](#table-of-contents) 123 | 124 | # Scatter Bar and Histogram Plot 125 | ## Scatter 126 | ```Python 127 | #<--- Method 1: Pytlot --->: 128 | df.plot(kind = 'scatter', 129 | x = 'age', 130 | y = 'chol', 131 | c = 'target', #c = color the dot based on over_50['target'] columns 132 | figsize=(10,6)); 133 | ``` 134 | ```Python 135 | #<--- Method 2: OO --->: 136 | ## OO Method from Scratch 137 | fig, ax = plt.subplots(figsize=(10,6)) 138 | 139 | ## Plot the data 140 | scatter = ax.scatter(x=over_50["age"], 141 | y=over_50["chol"], 142 | c=over_50["target"]); 143 | # Customize the plot 144 | ax.set(title="Heart Disease and Cholesterol Levels", 145 | xlabel="Age", 146 | ylabel="Cholesterol"); 147 | # Add a legend 148 | ax.legend(*scatter.legend_elements(), title="target"); # * to unpack all the values of Title="target" 149 | 150 | #Add a horizontal line 151 | ax.axhline(over_50["chol"].mean(), linestyle = "--"); 152 | ``` 153 | 154 | 155 | 156 | ## Bar 157 | * Vertical 158 | * Horizontal 159 | ```Python 160 | #<--- Method 1: Pytlot --->: 161 | df.plot.bar(); 162 | ``` 163 | ```Python 164 | #<--- Method 2: OO --->: 165 | fig, ax = plt.subplots() 166 | ax.bar(x, y) 167 | ax.set(title="Dan's Nut Butter Store", ylabel="Price ($)"); 168 | ``` 169 | ## Histogram 170 | ```Python 171 | # Create Histogram of Age to see the distribution of age 172 | 173 | heart_disease["age"].plot.hist(bins=10); 174 | ``` 175 | [(Back to top)](#table-of-contents) 176 | -------------------------------------------------------------------------------- /Pages/A05_Statistics.md: -------------------------------------------------------------------------------- 1 | # Statistics 2 | # Table of contents 3 | - [Standard Deviation & Variance](standard-deviation-and-variance) 4 | 5 | 6 | 7 | # Standard Deviation and Variance 8 | 9 | ## Standard Deviation 10 | - Standard Deviation is a measure of how spread out numbers are. 11 | ```Python 12 | # Standard deviation = a measure of how spread out a group of numbers is from the mean 13 | np.std(a2) 14 | # Standar deviation = Square Root of Variance 15 | np.sqrt(np.var(a2)) 16 | ``` 17 | 18 | ## Variance 19 | - The average of the squared differences from the Mean. 20 | ```Python 21 | # Varainace = measure of the average degree to which each number is different to the mean 22 | # Higher variance = wider range of numbers 23 | # Lower variance = lower range of numbers 24 | np.var(a2) 25 | ``` 26 | 27 | ### Example: 28 | ![image](https://user-images.githubusercontent.com/64508435/111798728-4d521980-8905-11eb-890a-afe682a02c3e.png) 29 | - The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm 30 | - Mean = (600 + 470 + 170 + 430 + 300)/5 = 394mm 31 | 32 | ![image](https://user-images.githubusercontent.com/64508435/111799090-a8840c00-8905-11eb-8064-6890d95abca4.png) 33 | - `Variance` = 21704 34 | - `Standard Deviation` = sqrt(variance) = 147 mm 35 | 36 | ![Screenshot 2021-03-19 at 10 51 56 PM](https://user-images.githubusercontent.com/64508435/111799145-b6399180-8905-11eb-990b-2c72b9067520.png) 37 | 38 | - we can show which heights are within one Standard Deviation (147mm) of the Mean: 39 | - **Standard Deviation we have a "standard" way of knowing what is normal**, and what is extra large or extra small 40 | 41 | ![image](https://user-images.githubusercontent.com/64508435/111799454-fd278700-8905-11eb-98c1-f9866d34f27b.png) 42 | - Credit: [Math is Fun](https://www.mathsisfun.com/data/standard-deviation.html) 43 | -------------------------------------------------------------------------------- /Pages/A8_Daily_Lessons.md: -------------------------------------------------------------------------------- 1 | # Daily Lessons 2 | 3 | # Day 1: 4 | - **Math**: [Geometric Sequences](https://www.mathsisfun.com/algebra/sequences-sums-geometric.html) 5 | - **Python**: List (`enumarate`), Dict (`keys()`, `values()`,`items()`) 6 | - **LeetCode**: [50. Pow(x, n)](https://leetcode.com/problems/powx-n/): using Recursive with `Pow(x,n) = Pow(x,n//2)`, rmb for n = odd and even cases 7 | # Day 2: 8 | - **Math**: [Modular Arithmetic](https://brilliant.org/wiki/modular-arithmetic/) 9 | - **Congruence** `a ≡ b (mod n)` For a positive integer n, the integers *a and b are congruent mod n* if their remainders when divided by n are the same. 10 | - For example: 52≡24(mod7): 52 and 24 are congruent (mod 7) because (52 mod 7) = 3 and (24 mod 7) = 3. 11 | - **Properties of multiplication** in Modular Arithmetic: 12 | - `(a mod n) mod n = a mod n`: This is obvious because a mod n ∈ [0,𝑛−1] and so the second modmod cannot have an effect. 13 | - `(A^2) mod C = (A * A) mod C = ((A mod C) * (A mod C)) mod C` 14 | - **LeetCode**: [Fast Exponentiation](https://youtu.be/-3Lt-EwR_Hw) 15 | # Day 3: 16 | - **Python**: 17 | - Nested List Comprehension `[[item if not item.isspace() else -1 for item in row] for row in board]` to build 2D matrix 18 | - String Formatting with Padding 0: For example, convert integer 2 to "02" `f"{month:02d}"` 19 | - Math's Ceil & Floor: `math.ceil()`, `math.floor()` 20 | - **Math**: 21 | - `Modular Multiplicative Inverse (MMI)`: **MMI(a, b) = x** s.t `a*x ≡ 1 (mod n)` 22 | - For example: a = 3, m = 11 => x = 4 as (3*4) mod 11 = 1 23 | - `Euclidean Algorithm` to find GCD of A & B & `Extended Euclidean Algorithm` to find **MMI(A, B)** 24 | # Day 4: 25 | - **LeetCode**: `Best Time to Buy and Sell Stock` (Keep track on the buying price, compare to the next days), `Climbing Stairs` (At T(n): first step = 1, remaining steps = T(n-1) or first step = 2, remaing steps = T(n-2). This recurrence relationship is similar to Fibonacci number) 26 | 27 | # Day 5: 28 | - **LeetCode**: `3 Sum`, `Longest Palindromic Substring` and `Container With Most Water` 29 | # Day 6: 30 | - **LeetCode**: `Number of Islands`, `Design Circular Queue` 31 | # Day 7: 32 | - **Data Science**: Understand about confusion matrix of classifier, Precision & Recall, F1 33 | -------------------------------------------------------------------------------- /Pages/P00_Introduction.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | # Table of contents 3 | - [Table of contents](#table-of-contents) 4 | - [Why need to learn Machine Learning ?](#why-need-to-learn-machine-learning) 5 | - [Terms](#terms) 6 | - [AI](#ai) 7 | - [Machine Learning](#machine-learning) 8 | - [Deep Learning](#deep-learning) 9 | - [Data Science](#data-science) 10 | - [Machine Learning Framework](#machine-learning-framework) 11 | - [Main Types of ML Problems](#main-types-of-ml-problems) 12 | - [Evaluation](#evaluation) 13 | - [Features](#features) 14 | - [Modelling](#modelling) 15 | - [Splitting Data](#splitting-data) 16 | - [Modelling](#modelling) 17 | - [Tuning](#tuning) 18 | - [Comparison](#comparison) 19 | 20 | # Why need to learn Machine Learning ? 21 | 22 | 23 | - **Spread Sheets (Excel, CSV)**: store data that business needs → Human can analyse data to make business decision 24 | - **Relational DB (MySQL)**: a better way to organize things → Human can analyse data to make business decision 25 | - **Big Data (NoSQL)**: FB, Amazon, Twitter accumulating more and more data like "User actions, user purchasing history", where you can store un-structure data → need Machine Learning instead of Human to make business decision 26 | 27 | [(Back to top)](#table-of-contents) 28 | 29 | # Terms 30 | ## AI 31 | ## Machine Learning 32 | 33 | 34 | 35 | - [A subset of AI](https://teachablemachine.withgoogle.com/): ML uses Algorithms or Computer Programs to learn different patterns of data & then take those algorithms & what it learned to make prediction or classification on similar data. 36 | - The things hard to describe for computers to perform like 37 | - How to ask Computers to classify Cat/Dog images, or Product Reviews 38 | 39 | ### Difference between ML and Normal Algorithms 40 | - Normal Algorithm: a set of instructions on how to accomplish a task: start with `given input + set of instructions` → output 41 | - ML Algorithm : start with `given input + given output` → set of instructions between I/P and O/P 42 | 43 | 44 | ### Types of ML Problems 45 | 46 | 47 | 48 | - **Supervised**: Data with Label 49 | - **Unsupervised**: Data without Label like CSV without Column Names 50 | - *Clustering*: Machine decicdes clusters/groups 51 | - *Association Rule Learning*: Associate different things to predict what customers might buy in the future 52 | - **Reinforcement**: teach Machine to try and error (with reward and penalty) 53 | 54 | ## Deep Learning 55 | ## Data Science 56 | - `Data Analysis`: analyse data to gain understanding of your data 57 | - `Data Science` : running experiments on set of data to figure actionable insights within it 58 | - Example: to build ML Models 59 | 60 | [(Back to top)](#table-of-contents) 61 | 62 | # Machine Learning Framework 63 | ![Screenshot 2021-03-05 at 7 00 17 AM](https://user-images.githubusercontent.com/64508435/110042238-7361b080-7d80-11eb-825d-f8fc4d4c2cf2.png) 64 | 65 | - Readings: [ (1) ](https://www.mrdbourke.com/a-6-step-field-guide-for-building-machine-learning-projects/), [ (2) ](https://whimsical.com/6-step-field-guide-to-machine-learning-projects-flowcharts-9g65jgoRYTxMXxDosndYTB) 66 | ### Step 1: Problem Definition - Rephrase business problem as a machine learning problem 67 | - What problem are we trying to solve ? 68 | - Supervised 69 | - Un-supervised 70 | - Classification 71 | - Regression 72 | ### Step 2: Data 73 | - What kind of Data we have ? 74 | ### Step 3: Evaluation 75 | - What defines success for us ? knowing what metrics you should be paying attention to gives you an idea of how to evaluate your machine learning project. 76 | ### Step 4: Features 77 | - What features does your data have and which can you use to build your model ? turning features → patterns 78 | - **Three main types of features**: 79 | - `Categorical` features — One or the other(s) 80 | - For example, in our heart disease problem, the sex of the patient. Or for an online store, whether or not someone has made a purchase or not. 81 | - `Continuous (or numerical)` features: A numerical value such as average heart rate or the number of times logged in. 82 | - `Derived` features — Features you create from the data. Often referred to as feature engineering. 83 | - `Feature engineering` is how a subject matter expert takes their knowledge and encodes it into the data. You might combine the number of times logged in with timestamps to make a feature called time since last login. Or turn dates from numbers into “is a weekday (yes)” and “is a weekday (no)”. 84 | ### Step 5: Models 85 | - Figure out right models for your problems 86 | ### Step 6: Experimentation 87 | - How to improve or what can do better ? 88 | 89 | ## Main Types of ML Problems 90 | ![Screenshot 2021-03-09 at 8 23 37 AM](https://user-images.githubusercontent.com/64508435/110399393-c1dcbb00-80b0-11eb-8c0d-4b21f02fc3e4.png) 91 | ### Supervised Learning: 92 | - (Input & Output) Data + Label → Classifications, Regressions 93 | ### Un-Supervised Learning: 94 | - (Only Input) Data → Clustering 95 | ### Transfer Learning: 96 | - (My problem similar to others) Leverage from Other ML Models 97 | ### Reinforcement Learning: 98 | - Purnishing & Rewarding the ML Learning model by updating the scores of ML 99 | 100 | ## Evaluation 101 | 102 | | Classification | Regression | Recommendation | 103 | | -------------------| ---------------------------------- | ----------------| 104 | | Accuracy | Mean Absolute Error (MAE) | Precision at K | 105 | | Precision | Mean Squared Error (MSE) | | 106 | | Recall | Root Mean Squared Error (RMSE) | | 107 | 108 | [(Back to top)](#table-of-contents) 109 | 110 | ## Features 111 | - Numerical Features 112 | - Categorical Features 113 | 114 | 115 | 116 | [(Back to top)](#table-of-contents) 117 | 118 | ## Modelling 119 | ### Splitting Data 120 | 121 | 122 | 123 | - 3 sets: Trainning, Validation (model hyperparameter tuning and experimentation evaluation) & Test Sets (model testing and comparison) 124 | 125 | ### Modelling 126 | - Chosen models work for your problem → train the model 127 | - Goal: Minimise time between experiments 128 | - Start small and add up complexity (use small parts of your training sets to start with) 129 | - Choosing the less complicated models to start first 130 | 131 | 132 | ### Tuning 133 | - Happens on Validation or Training Sets 134 | 135 | ### Comparison 136 | - Measure Model Performance via Test Set 137 | - Advoid `Overfitting` & `Underfitting` 138 | #### Overfitting 139 | - Great performance on the training data but poor performance on test data means your model doesn’t generalize well 140 | - Solution: Try simpler model or making sure your the test data is of the same style your model is training on 141 | ### Underfitting 142 | - Poor performance on training data means the model hasn’t learned properly and is underfitting 143 | - Solution: Try a different model, improve the existing one through hyperparameter or collect more data. 144 | 145 | 146 | 147 | -------------------------------------------------------------------------------- /Pages/P01_Data_Pre_Processing.md: -------------------------------------------------------------------------------- 1 | # Data Preprocessing 2 | # Table of contents 3 | - [Table of contents](#table-of-contents) 4 | - [Introduction](#introduction) 5 | - [Data Preprocessing](#data-preprocessing) 6 | - [Import Dataset](#import-dataset) 7 | - [Select Data](#select-data) 8 | - [Using Index: iloc](#using-index-iloc) 9 | - [Numpy representation of DF](#numpy-representation-of-df) 10 | - [Handle Missing Data](#handle-missing-data) 11 | - [Encode Categorical Data](#encode-categorical-data) 12 | - [Encode Independent Variables](#encode-independent-variables) 13 | - [Encode Dependent Variables](#encode-dependent-variables) 14 | - [Splitting Training set and Test set](#splitting-training-set-and-test-set) 15 | - [Feature Scaling](#feature-scaling) 16 | - [Standardisation Feature Scaling](#standardisation-feature-scaling) 17 | - [Resources](#resources) 18 | 19 | 20 | # Data Preprocessing 21 | ## Import Dataset 22 | ```python 23 | dataset = pd.read_csv("data.csv") 24 | 25 | Country Age Salary Purchased 26 | 0 France 44.0 72000.0 No 27 | 1 Spain 27.0 48000.0 Yes 28 | 2 Germany 30.0 54000.0 No 29 | 3 Spain 38.0 61000.0 No 30 | 4 Germany 40.0 NaN Yes 31 | 5 France 35.0 58000.0 Yes 32 | 6 Spain NaN 52000.0 No 33 | 7 France 48.0 79000.0 Yes 34 | 8 Germany 50.0 83000.0 No 35 | 9 France 37.0 67000.0 Yes 36 | ``` 37 | ## Select Data 38 | ### Using Index iloc 39 | - `.iloc[]` allowed inputs are: 40 | #### Selecting Rows 41 | - An integer, e.g. `dataset.iloc[0]` > return row 0 in `` 42 | ``` 43 | Country France 44 | Age 44 45 | Salary 72000 46 | Purchased No 47 | ``` 48 | - A list or array of integers, e.g.`dataset.iloc[[0]]` > return row 0 in DataFrame format 49 | ``` 50 | Country Age Salary Purchased 51 | 0 France 44.0 72000.0 No 52 | ``` 53 | - A slice object with ints, e.g. `dataset.iloc[:3]` > return row 0 up to row 3 in DataFrame format 54 | ``` 55 | Country Age Salary Purchased 56 | 0 France 44.0 72000.0 No 57 | 1 Spain 27.0 48000.0 Yes 58 | 2 Germany 30.0 54000.0 No 59 | ``` 60 | #### Selecting Rows & Columns 61 | - Select First 3 Rows & up to Last Columns (not included) `X = dataset.iloc[:3, :-1]` 62 | ``` 63 | Country Age Salary 64 | 0 France 44.0 72000.0 65 | 1 Spain 27.0 48000.0 66 | 2 Germany 30.0 54000.0 67 | ``` 68 | ### Numpy representation of DF 69 | - `DataFrame.values`: Return a Numpy representation of the DataFrame (i.e: Only the values in the DataFrame will be returned, the axes labels will be removed) 70 | - For ex: `X = dataset.iloc[:3, :-1].values` 71 | ``` 72 | [['France' 44.0 72000.0] 73 | ['Spain' 27.0 48000.0] 74 | ['Germany' 30.0 54000.0]] 75 | ``` 76 | [(Back to top)](#table-of-contents) 77 | 78 | ## Handle Missing Data 79 | ### SimpleImputer 80 | - sklearn.impute.`SimpleImputer(missing_values={should be set to np.nan} strategy={"mean",“median”, “most_frequent”, ..})` 81 | - imputer.`fit(X[:, 1:3])`: Fit the imputer on X. 82 | - imputer.`transform(X[:, 1:3])`: Impute all missing values in X. 83 | 84 | ```Python 85 | from sklearn.impute import SimpleImputer 86 | 87 | #Create an instance of Class SimpleImputer: np.nan is the empty value in the dataset 88 | imputer = SimpleImputer(missing_values=np.nan, strategy='mean') 89 | 90 | #Replace missing value from numerical Col 1 'Age', Col 2 'Salary' 91 | imputer.fit(X[:, 1:3]) 92 | 93 | #transform will replace & return the new updated columns 94 | X[:, 1:3] = imputer.transform(X[:, 1:3]) 95 | ``` 96 | 97 | ## Encode Categorical Data 98 | ### Encode Independent Variables 99 | - Since for the independent variable, we will convert into vector of 0 & 1 100 | - Using the `ColumnTransformer` class & 101 | - `OneHotEncoder`: encoding technique for features are nominal(do not have any order) 102 | ![image](https://user-images.githubusercontent.com/64508435/104794298-a86e6f80-57e1-11eb-8ffc-aee2178762d1.png) 103 | 104 | ```Python 105 | from sklearn.compose import ColumnTransformer 106 | from sklearn.preprocessing import OneHotEncoder 107 | ``` 108 | - `transformers`: specify what kind of transformation, and which cols 109 | - Tuple `('encoder' encoding transformation, instance of Class OneHotEncoder, [cols to transform])` 110 | - `remainder ="passthrough"` > to keep the cols which not be transformed. Otherwise, the remaining cols will not be included 111 | ```Python 112 | ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])] , remainder="passthrough" ) 113 | ``` 114 | - Fit and Transform with input = X in the Instance `ct` of class `ColumnTransformer` 115 | ```Python 116 | #fit and transform with input = X 117 | #np.array: need to convert output of fit_transform() from matrix to np.array 118 | X = np.array(ct.fit_transform(X)) 119 | ``` 120 | - Before converting categorical column [0] `Country` 121 | ``` 122 | Country Age Salary Purchased 123 | 0 France 44.0 72000.0 No 124 | 1 Spain 27.0 48000.0 Yes 125 | ``` 126 | - After converting, France = [1.0, 0, 0] vector 127 | ``` 128 | [[1.0 0.0 0.0 44.0 72000.0] 129 | [0.0 0.0 1.0 27.0 48000.0] 130 | [0.0 1.0 0.0 30.0 54000.0] 131 | ``` 132 | 133 | ### Encode Dependent Variables 134 | - For the dependent variable, since it is the Label > we use `Label Encoder` 135 | ```Python 136 | from sklearn.preprocessing import LabelEncoder 137 | le = LabelEncoder() 138 | #output of fit_transform of Label Encoder is already a Numpy Array 139 | y = le.fit_transform(y) 140 | 141 | #y = [0 1 0 0 1 1 0 1 0 1] 142 | ``` 143 | 144 | # Splitting Training set and Test set 145 | - Using the `train_test_split` of SkLearn - Model Selection 146 | - Recommend Split: `test_size = 0.2` 147 | - `random_state = 1`: fixing the seed for random state so that we can have the same training & test sets anytime 148 | ```Python 149 | from sklearn.model_selection import train_test_split 150 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1) 151 | ``` 152 | [(Back to top)](#table-of-contents) 153 | 154 | # Feature Scaling 155 | - What ? Feature Scaling (FS): scale all the features in the same scale to prevent 1 feature dominates the others & then neglected by ML Model 156 | - Note #1: FS **no need to apply in all the times** in all ML Models (like Multi-Regression Models) 157 | - Why no need FS for Multi-Regression Model: y = b0 + b1 * x1 + b2 * x2 + b3 * x3, since we have the coefficients (b0, b1, b2, b3) to compensate, so there is no need FS. 158 | - Note #2: For dummy variables from Categorial Features Encoding, **no need to apply FS** 159 | Screenshot 2021-01-16 at 11 35 13 AM 160 | - Note #3: **FS MUST be done AFTER splitting** Training & Test sets 161 | 162 | - Why ? 163 | - Test Set suppose to the brand-new set, which we are not supposed to work with the Training Set 164 | - FS is technique to get the mean & median of features in order to scale 165 | - If we apply FS before splitting Training & Test sets, it will include the mean & median of both Training Set and Test Set 166 | - FS MUST be done AFTER Splitting => Otherwise, we will cause **Information Leakage** 167 | ## How ? 168 | - There are 2 main Feature Scaling Technique: Standardisation & Normalisation 169 | - `Standardisation`: This makes the dataset, center at 0 i.e mean at 0, and changes the standard deviation value to 1. 170 | - *Usage*: apply all the situations 171 | - `Normalisation`: This makes the dataset in range [0, 1] 172 | - *Usage*: apply when the all the features in the data set have the **normal distribution** 173 | 174 | ![Screenshot 2021-01-16 at 10 59 20 AM](https://user-images.githubusercontent.com/64508435/104795502-e40d3780-57e9-11eb-91ce-bb68c43a715f.png) 175 | 176 | ## Standardisation Feature Scaling: 177 | - We will use `StandardScaler` from `sklearn.preprocessing` 178 | ```Python 179 | from sklearn.preprocessing import StandardScaler 180 | sc = StandardScaler() 181 | ``` 182 | - For `X_train`: apply `StandardScaler` by using `fit_transform` 183 | ```Python 184 | X_train[:,3:] = sc.fit_transform(X_train[:,3:]) 185 | ``` 186 | - For `X_test`: apply `StandardScaler` only use `transform`, because we want to apply the SAME scale as `X_train` 187 | ```Python 188 | #only use Transform to use the SAME scaler as the Training Set 189 | X_test[:,3:] = sc.transform(X_test[:,3:]) 190 | ``` 191 | 192 | 193 | [(Back to top)](#table-of-contents) 194 | 195 | # Resources: 196 | ### Podcast: 197 | https://www.superdatascience.com/podcast/sds-041-inspiring-journey-totally-different-background-data-science 198 | 199 | 200 | 201 | 202 | -------------------------------------------------------------------------------- /Pages/P02_Regression.md: -------------------------------------------------------------------------------- 1 | # Regression 2 | # Table of contents 3 | 4 | - [Table of contents](#table-of-contents) 5 | - [Introduction to Regressions](#introduction-to-regressions) 6 | - [Simple Linear Regression](#simple-linear-regression) 7 | - [Outline: Building a Model](#outline-building-a-model) 8 | - [Creating a Model](#creating-a-model) 9 | - [Predicting a Test Result](#predicting-a-test-result) 10 | - [Visualising the Test set results](#visualising-the-test-set-results) 11 | - [Getting Linear Regression Equation](#getting-linear-regression-equation) 12 | - [Evaluating the Algorithm](#evaluating-the-algorithm) 13 | - [R Square or Adjusted R Square](#r-square-or-adjusted-r-square) 14 | - [Mean Square Error (MSE)/Root Mean Square Error (RMSE)](#mean-square-error-and-root-mean-square-error) 15 | - [Mean Absolute Error (MAE)](#mean-absolute-error) 16 | - [Multiple Linear Regression](#multiple-linear-regression) 17 | - [Assumptions of Linear Regression](#assumptions-of-linear-regression) 18 | - [Dummy Variables](#dummy-variables) 19 | - [Understanding P-value](#understanding-p-value) 20 | - [Building a Model](#building-a-model) 21 | - [Polynomial Linear Regression](#polynomial-linear-regression) 22 | 23 | 24 | # Introduction to Regressions 25 | - Simple Linear Regression : `y = b0 + b1*x1` 26 | - Multiple Linear Regression : `y = b0 + b1*x1 + b2*x2 + ... + bn*xn` 27 | - Polynomial Linear Regression: `y = b0 + b1*x1 + b2*x1^(2) + ... + bn*x1^(n)` 28 | 29 | # Simple Linear Regression 30 | ## Outline Building a Model 31 | - Importing libraries and datasets 32 | - Splitting the dataset 33 | - Training the simple Linear Regression model on the Training set 34 | - Predicting and visualizing the test set results 35 | - Visualizing the training set results 36 | - Making a single prediction 37 | - Getting the final linear regression equation (with values of the coefficients) 38 | ``` 39 | y = bo + b1 * x1 40 | ``` 41 | - y: Dependent Variable (DV) 42 | - x: InDependent Variable (IV) 43 | - b0: Intercept Coefficient 44 | - b1: Slope of Line Coefficient 45 |

46 | 47 | 48 | ## Creating a Model 49 | - Using `sklearn.linear_model`, `LinearRegression` model 50 | ```Python 51 | from sklearn.linear_model import LinearRegression 52 | 53 | #To Create Instance of Simple Linear Regression Model 54 | regressor = LinearRegression() 55 | 56 | #To fit the X_train and y_train 57 | regressor.fit(X_train, y_train) 58 | ``` 59 | ## Predicting a Test Result 60 | ```Python 61 | y_pred = regressor.predict(X_test) 62 | ``` 63 | ### Predict a single value 64 | **Important note:** "predict" method always expects a 2D array as the format of its inputs. 65 | - And putting 12 into a double pair of square brackets makes the input exactly a 2D array: 66 | - `regressor.predict([[12]])` 67 | 68 | ```Python 69 | print(f"Predicted Salary of Employee with 12 years of EXP: {regressor.predict([[12]])}" ) 70 | 71 | #Output: Predicted Salary of Employee with 12 years of EXP: [137605.23485427] 72 | ``` 73 | ## Visualising the Test set results 74 | ```Python 75 | #Plot predicted values 76 | plt.scatter(X_test, y_test, color = 'red', label = 'Predicted Value') 77 | #Plot the regression line 78 | plt.plot(X_train, regressor.predict(X_train), color = 'blue', label = 'Linear Regression') 79 | #Label the Plot 80 | plt.title('Salary vs Experience (Test Set)') 81 | plt.xlabel('Years of Experience') 82 | plt.ylabel('Salary') 83 | #Show the plot 84 | plt.show() 85 | ``` 86 | ![download](https://user-images.githubusercontent.com/64508435/105365689-7c1b7e80-5c39-11eb-8e44-12866fb7eb3d.png) 87 | 88 | ## Getting Linear Regression Equation 89 | - General Formula: `y_pred = model.intercept_ + model.coef_ * x` 90 | ```Python 91 | print(f"b0 : {regressor.intercept_}") 92 | print(f"b1 : {regressor.coef_}") 93 | 94 | b0 : 25609.89799835482 95 | b1 : [9332.94473799] 96 | ``` 97 | 98 | Linear Regression Equation: `Salary = 25609 + 9332.94×YearsExperience` 99 | 100 | ## Evaluating the Algorithm 101 | - compare how well different algorithms perform on a particular dataset. 102 | - For regression algorithms, three evaluation metrics are commonly used: 103 | 1. R Square/Adjusted R Square > Percentage of the output variability 104 | 2. Mean Square Error(MSE)/Root Mean Square Error(RMSE) > to compare performance between different regression models 105 | 3. Mean Absolute Error(MAE) > to compare performance between different regression models 106 | 107 | ### R Square or Adjusted R Square 108 | #### R Square: Coefficient of determination 109 | - R Square measures how much of **variability** in predicted variable can be explained by the model. 110 | - `Variance` is a measure in statistics defined as the average of the square of differences between individual point and the expected value. 111 | - R Square value: between 0 to 1 and bigger value indicates a better fit between prediction and actual value. 112 | - However, it does **not take into consideration of overfitting problem**. 113 | - If your regression model has many independent variables, because the model is too complicated, it may fit very well to the training data 114 | - but performs badly for testing data. 115 | - Solution: Adjusted R Square 116 | 117 |

118 | 119 | #### Adjusted R Square 120 | - is introduced Since R-square can be increased by adding more number of variable and may lead to the **over-fitting** of the model 121 | - Will penalise additional independent variables added to the model and adjust the metric to **prevent overfitting issue**. 122 | 123 | #### Calculate R Square and Adjusted R Square using Python 124 | - In Python, you can calculate R Square using `Statsmodel` or `Sklearn` Package 125 | ```Python 126 | import statsmodels.api as sm 127 | 128 | X_addC = sm.add_constant(X) 129 | 130 | result = sm.OLS(Y, X_addC).fit() 131 | 132 | print(result.rsquared, result.rsquared_adj) 133 | # 0.79180307318 0.790545085707 134 | 135 | ``` 136 | - around 79% of dependent variability can be explain by the model and adjusted R Square is roughly the same as R Square meaning the model is quite robust 137 | 138 | ### Mean Square Error and Root Mean Square Error 139 | - While **R Square** is a **relative measure** of how well the model fits dependent variables 140 | - **Mean Square Error (MSE)** is an **absolute measure** of the goodness for the fit. 141 | - **Root Mean Square Error(RMSE)** is the square root of MSE. 142 | - It is used more commonly than MSE because firstly sometimes MSE value can be too big to compare easily. 143 | - Secondly, MSE is calculated by the square of error, and thus square root brings it back to the same level of prediction error and make it easier for interpretation. 144 | 145 |

146 | 147 | ```Python 148 | from sklearn.metrics import mean_squared_error 149 | import math 150 | print(mean_squared_error(Y_test, Y_predicted)) 151 | print(math.sqrt(mean_squared_error(Y_test, Y_predicted))) 152 | # MSE: 2017904593.23 153 | # RMSE: 44921.092965684235 154 | ``` 155 | ### Mean Absolute Error 156 | - Compare to MSE or RMSE, MAE is a more direct representation of sum of error terms. 157 | 158 |

159 | 160 | ```Python 161 | from sklearn.metrics import mean_absolute_error 162 | print(mean_absolute_error(Y_test, Y_predicted)) 163 | #MAE: 26745.1109986 164 | ``` 165 | 166 | [(Back to top)](#table-of-contents) 167 | 168 | # Multiple Linear Regression 169 | ### Assumptions of Linear Regression: 170 | Before choosing Linear Regression, need to consider below assumptions 171 | 1. Linearity 172 | 2. Homoscedasticity 173 | 3. Multivariate normality 174 | 4. Independence of errors 175 | 5. Lack of multicollinearity 176 | 177 | ## Dummy Variables 178 | - Since `State` is categorical variable => we need to convert it into `dummy variable` 179 | - No need to include all dummy variable to our Regression Model => **Only omit one dummy variable** 180 | - Why ? `dummy variable trap` 181 | ![Screenshot 2021-01-28 at 9 22 08 PM](https://user-images.githubusercontent.com/64508435/106144509-3e75a300-61af-11eb-8240-53bed739b2a1.png) 182 | 183 | ## Understanding P value 184 | - Ho : `Null Hypothesis (Universe)` 185 | - H1 : `Alternative Hypothesis (Universe)` 186 | - For example: 187 | - Assume `Null Hypothesis` is true (or we are living in Null Universe) 188 | 189 | ![Screenshot 2021-01-28 at 9 45 01 PM](https://user-images.githubusercontent.com/64508435/106148653-4421b780-61b4-11eb-91b4-4db1247a1a2a.png) 190 | 191 | [(Back to top)](#table-of-contents) 192 | ## Building a Model 193 | - 5 methods of Building Models 194 | ### Method 1: All-in 195 | - Throw in all variables in the dataset 196 | - Usage: 197 | - Prior knowledge about this problem; OR 198 | - You have to (Company Framework required) 199 | - Prepare for Backward Elimination 200 | ### Method 2 [Stepwise Regression]: Backward Elimination (Fastest) 201 | - Step 1: Select a significance level (SL) to stay in the model (e.g: SL = 0.05) 202 | ```Python 203 | # Building the optimal model using Backward Elimination 204 | import statsmodels.api as sm 205 | 206 | # Avoiding the Dummy Variable Trap by excluding the first column of Dummy Variable 207 | # Note: in general you don't have to remove manually a dummy variable column because Scikit-Learn takes care of it. 208 | X = X[:, 1:] 209 | 210 | #Append full column of "1"s to First Column of X using np.append 211 | #Since y = b0*(1) + b1 * x1 + b2 * x2 + .. + bn * xn, b0 is constant and can be re-written as b0 * (1) 212 | #np.append(arr = the array will add to, values = column to be added, axis = row/column) 213 | # np.ones((row, column)).astype(int) => .astype(int) to convert array of 1 into integer type to avoid data type error 214 | X = np.append(arr = np.ones((50,1)).astype(int), values = X, axis = 1) 215 | 216 | #Initialize X_opt with Original X by including all the column from #0 to #5 217 | X_opt = np.array(X[:, [0, 1, 2, 3, 4, 5]], dtype=float) 218 | #If you are using the google colab to write your code, 219 | # the datatype of all the features is not set to float hence this step is important: X_opt = np.array(X[:, [0, 1, 2, 3, 4, 5]], dtype=float) 220 | ``` 221 | - Step 2: Fit the full model with all possible predictors 222 | ```Python 223 | #OrdinaryLeastSquares 224 | regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit() 225 | regressor_OLS.summary() 226 | ``` 227 | - Step 3: Consider Predictor with Highest P-value 228 | - If P > SL, go to Step 4, otherwise go to [**FIN** : Your Model Is Ready] 229 | - Step 4: Remove the predictor 230 | ```Python 231 | #Remove column = 2 from X_opt since Column 2 has Highest P value (0.99) and > SL (0.05). 232 | X_opt = np.array(X[:, [0, 1, 3, 4, 5]], dtype=float) 233 | #OrdinaryLeastSquares 234 | regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit() 235 | regressor_OLS.summary() 236 | ``` 237 | - Step 5: Re-Fit model without this variable 238 | 239 | ### Method 3 [Stepwise Regression]: Forward Selection 240 | - Step 1: Select a significance level (SL) to enter in the model (e.g: SL = 0.05) 241 | - Step 2: Fit all simple regression models (y ~ xn). Select the one with Lowest P-value for the independent variable. 242 | - Step 3: Keep this variable and fit all possible regression models with one extra predictor added to the one(s) you already have. 243 | - Step 4: Consider the predicotr with Lowest P-value. If P < SL (i.e: model is good), go STEP 3 (to add 3rd variable into the model and so on with all variables we have left), otherwise go to [**FIN** : Keep the previous model] 244 | ### Method 4 [Stepwise Regression]: Bidirectional Elemination 245 | - Step 1: Select a significant level to enter and to stay in the model: `e.g: SLENTER = 0.05, SLSTAY = 0.05` 246 | - Step 2: Perform the next step of Forward Selection (new variables must have: P < SLENTER to enter) 247 | - Step 3: Perform ALL steps of Backward Elimination (old variables must have P < SLSTAY to stay) => Step 2. 248 | - Step 4: No variables can enter and no old variables can exit => [**FIN** : Your Model Is Ready] 249 | 250 | ### Method 5: Score Comparison 251 | - Step 1: Select a criterion of goodness of ift (e.g Akaike criterion) 252 | - Step 2: Construct all possible regression Models: `2^(N) - 1` total combinations, where N: total number of variables 253 | - Step 3: Select the one with best criterion => [**FIN** : Your Model Is Ready] 254 | 255 | ### Code Implementation 256 | - Note: Backward Elimination is irrelevant in Python, because the Scikit-Learn library automatically takes care of selecting the statistically significant features when training the model to make accurate predictions. 257 | ##### Step 1: Splitting the dataset into the Training set and Test set 258 | ```Python 259 | #no need Feature Scaling (FS) for Multi-Regression Model: y = b0 + b1 * x1 + b2 * x2 + b3 * x3, 260 | # since we have the coefficients (b0, b1, b2, b3) to compensate, so there is no need FS. 261 | from sklearn.model_selection import train_test_split 262 | 263 | # NOT have to remove manually a dummy variable column because Scikit-Learn takes care of it. 264 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) 265 | ``` 266 | 267 | ##### Step 2: Training the Multiple Linear Regression model on the Training set 268 | ```Python 269 | #LinearRegression will take care "Dummy variable trap" & feature selection 270 | from sklearn.linear_model import LinearRegression 271 | regressor = LinearRegression() 272 | regressor.fit(X_train, y_train) 273 | ``` 274 | 275 | ##### Step 3: Predicting the Test set results 276 | ```Python 277 | y_pred = regressor.predict(X_test) 278 | ``` 279 | 280 | 281 | ##### Step 4: Displaying Y_Pred vs Y_test 282 | - Since this is multiple linear regression, so can not visualize by drawing the graph 283 | 284 | ```Python 285 | #To display the y_pred vs y_test vectors side by side 286 | np.set_printoptions(precision=2) #To round up value to 2 decimal places 287 | 288 | #np.concatenate((tuple of rows/columns you want to concatenate), axis = 0 for rows and 1 for columns) 289 | #y_pred.reshape(len(y_pred),1) : to convert y_pred to column vector by using .reshape() 290 | 291 | print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)), 1)) 292 | ``` 293 | ##### Step 5: Getting the final linear regression equation with the values of the coefficients 294 | ```Python 295 | print(regressor.coef_) 296 | print(regressor.intercept_) 297 | 298 | [ 8.66e+01 -8.73e+02 7.86e+02 7.73e-01 3.29e-02 3.66e-02] 299 | 42467.52924853204 300 | ``` 301 | 302 | Equation: 303 | Profit = 86.6 x DummyState1 - 873 x DummyState2 + 786 x DummyState3 - 0.773 x R&D Spend + 0.0329 x Administration + 0.0366 x Marketing Spend + 42467.53 304 | 305 | [(Back to top)](#table-of-contents) 306 | 307 | 308 | # Polynomial Linear Regression 309 | - Polynomial Linear Regression: `y = b0 + b1*x1 + b2*x1^(2) + ... + bn*x1^(n)` 310 | - Used for dataset with non-linear relation, but polynomial linear relation like salary scale. 311 | 312 | 313 | 314 | [(Back to top)](#table-of-contents) 315 | -------------------------------------------------------------------------------- /Pages/Project_Guideline.md: -------------------------------------------------------------------------------- 1 | # End-to-End Machine Learning Project Guideline 2 | Screenshot 2021-06-24 at 23 01 18 3 | 4 | ## Table of contents 5 | - [1. Project Environment Setup](#1-project-environment-setup) 6 | - [1.1 Setup Conda Env](#11-setup-conda-env) 7 | 8 | 9 | 10 | ## 1. Project Environment Setup 11 | Screenshot 2021-06-24 at 23 00 56 12 | 13 | ### 1.1. Setup Conda Env 14 | #### 1.1.1. Create Conda Env from Stratch 15 | `conda create --prefix ./env pandas numpy matplotlib scikit-learn jupyter` 16 | #### 1.1.2. Create Conda Env from a base env 17 | - **Step 1**: Go to Base Env folder and export the base conda env to `environment.yml` file 18 | - *Note*: open `environment.yml` file by `vim environment.yml` to open the file → To exit Vim: `press ESC then ; then q to exit` 19 | ```Python 20 | conda env list #to list down current env 21 | conda activate /Users/quannguyen/Data_Science/Conda/env #Activate the base conda env 22 | conda env export > environment.yml #Export base conda env to environment.yml file 23 | conda deactivate #de-activate env once done 24 | ``` 25 | - **Step 2**: Go to current project folder and create the env based on `environment.yml` file 26 | ```python 27 | conda env create --prefix ./env -f environment.yml 28 | ``` 29 | [(Back to top)](#table-of-contents) 30 | -------------------------------------------------------------------------------- /Pages/Resources/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Pages/Resources/.DS_Store -------------------------------------------------------------------------------- /Pages/Resources/Interview/ML_cheatsheets.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodexploreRepo/data-science/b545e58709583f79f883b5ec1a0b926cc0c04b19/Pages/Resources/Interview/ML_cheatsheets.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data Science Handbook 2 | # Table of contents 3 | - [0. Introduction](./Pages/P00_Introduction.md) 4 | - [1. Data Preprocessing](./Pages/P01_Data_Pre_Processing.md) 5 | - [2. Regression](./Pages/P02_Regression.md) 6 | - [Project Guideline](./Pages/Project_Guideline.md) 7 | - [Appendix](#appendix) 8 | - [Resources](#resources) 9 | 10 | ## Appendix 11 | - [Job Description for Data Science](./Pages/A01_Job_Description.md) 12 | - [Interview Question for Data Science](./Pages/A01_Interview_Question.md) 13 | - [Pandas Cheat Sheet](./Pages/A02_Pandas_Cheat_Sheet.md) 14 | - [Numpy Cheat Sheet](./Pages/A03_Numpy_Cheat_Sheet.md) 15 | - [Matplotlib Cheat Sheet](./Pages/A05_Matplotlib.md) 16 | - [Sklearn Cheat Sheet](./Pages/A06_SkLearn.md) 17 | - [Conda CLI Cheat Sheet](./Pages/A04_Conda_CLI.md) 18 | - [Statistics](./Pages/A05_Statistics.md) 19 | - [Kaggle 30 Days of Machine Learning](./Pages/A07_Kaggle_30_ML.md) 20 | - [Daily Lessons](./Pages/A8_Daily_Lessons.md) 21 | ## Resources: 22 | - [AI Road Map](https://i.am.ai/roadmap/#note) 23 | - [AI Free Course](https://learn.aisingapore.org/professionals/): Intel AI Academy 24 | - [Reading List](./Pages/A00_Reading_List.md) 25 | - [Deep Learning Monitor: Latest Deep Learning Research Papers](https://deeplearn.org/) 26 | ### Podcast: 27 | - [1. Super Data Science](https://www.superdatascience.com/podcast/sds-041-inspiring-journey-totally-different-background-data-science) 28 | - [2. Visual ML ](https://vas3k.com/blog/machine_learning/) 29 | 30 | 31 | 32 | 33 | --------------------------------------------------------------------------------