├── DA0101EN-Review-Data-Wrangling.ipynb ├── DA0101EN-Review-Exploratory-Data-Analysis.ipynb ├── DA0101EN-Review-Introduction.ipynb ├── DA0101EN-Review-Model-Development.ipynb ├── DA0101EN-Review-Model-Evaluation-and-Refinement.ipynb ├── House Sales_in_King_Count_USA.ipynb ├── README.md ├── data-wrangling.ipynb ├── exploratory-data-analysis.ipynb ├── model-development.ipynb ├── model-evaluation-and-refinement.ipynb └── review-introduction.ipynb /DA0101EN-Review-Data-Wrangling.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "

\n", 8 | " \n", 9 | "

\n", 10 | " \n", 11 | "

" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "

\n", 19 | "\n", 20 | "

Data Analysis with Python

" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "

Data Wrangling

" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "

Welcome!

\n", 35 | "\n", 36 | "By the end of this notebook, you will have learned the basics of Data Wrangling! " 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "

Table of content

\n", 44 | "\n", 45 | "

\n", 46 | "

Identify and handle missing values\n", 48 | "
- Identify missing values
- Deal with missing values
- Correct data format
\n", 53 | "
Data standardization
Data Normalization (centering/scaling)
Binning
Indicator variable

\n", 59 | " \n", 60 | "Estimated Time Needed: 30 min\n", 61 | "

\n", 62 | " \n", 63 | "

" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "

What is the purpose of Data Wrangling?

" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "Data Wrangling is the process of converting data from the initial format to a format that may be better for analysis." 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "

What is the fuel consumption (L/100k) rate for the diesel car?

" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "

Import data

\n", 92 | "

\n", 93 | "You can find the \"Automobile Data Set\" from the following link: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data. \n", 94 | "We will be using this data set throughout this course.\n", 95 | "

" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "

Import pandas

" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": { 109 | "collapsed": true 110 | }, 111 | "outputs": [], 112 | "source": [ 113 | "import pandas as pd\n", 114 | "import matplotlib.pylab as plt" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "

Reading the data set from the URL and adding the related headers.

" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "URL of the dataset" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "This dataset was hosted on IBM Cloud object click HERE for free storage " 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": null, 141 | "metadata": { 142 | "collapsed": true 143 | }, 144 | "outputs": [], 145 | "source": [ 146 | "filename = \"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv\"" 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | " Python list headers containing name of headers " 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": { 160 | "collapsed": true 161 | }, 162 | "outputs": [], 163 | "source": [ 164 | "headers = [\"symboling\",\"normalized-losses\",\"make\",\"fuel-type\",\"aspiration\", \"num-of-doors\",\"body-style\",\n", 165 | " \"drive-wheels\",\"engine-location\",\"wheel-base\", \"length\",\"width\",\"height\",\"curb-weight\",\"engine-type\",\n", 166 | " \"num-of-cylinders\", \"engine-size\",\"fuel-system\",\"bore\",\"stroke\",\"compression-ratio\",\"horsepower\",\n", 167 | " \"peak-rpm\",\"city-mpg\",\"highway-mpg\",\"price\"]" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": {}, 173 | "source": [ 174 | "Use the Pandas method read_csv() to load the data from the web address. Set the parameter \"names\" equal to the Python list \"headers\"." 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": null, 180 | "metadata": { 181 | "collapsed": false 182 | }, 183 | "outputs": [], 184 | "source": [ 185 | "df = pd.read_csv(filename, names = headers)" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | " Use the method head() to display the first five rows of the dataframe. " 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": { 199 | "collapsed": false 200 | }, 201 | "outputs": [], 202 | "source": [ 203 | "# To see what the data set looks like, we'll use the head() method.\n", 204 | "df.head()" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "As we can see, several question marks appeared in the dataframe; those are missing values which may hinder our further analysis. \n", 212 | "

So, how do we identify all those missing values and deal with them?

\n", 213 | "\n", 214 | "\n", 215 | "How to work with missing data?\n", 216 | "\n", 217 | "Steps for working with missing data:\n", 218 | "

dentify missing data
deal with missing data
correct data format

" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "

Identify and handle missing values

\n", 230 | "\n", 231 | "\n", 232 | "

Identify missing values

\n", 233 | "

Convert \"?\" to NaN

\n", 234 | "In the car dataset, missing data comes with the question mark \"?\".\n", 235 | "We replace \"?\" with NaN (Not a Number), which is Python's default missing value marker, for reasons of computational speed and convenience. Here we use the function: \n", 236 | "

.replace(A, B, inplace = True)

\n", 237 | "to replace A by B" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": null, 243 | "metadata": { 244 | "collapsed": false 245 | }, 246 | "outputs": [], 247 | "source": [ 248 | "import numpy as np\n", 249 | "\n", 250 | "# replace \"?\" to NaN\n", 251 | "df.replace(\"?\", np.nan, inplace = True)\n", 252 | "df.head(5)" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "dentify_missing_values\n", 260 | "\n", 261 | "

Evaluating for Missing Data

\n", 262 | "\n", 263 | "The missing values are converted to Python's default. We use Python's built-in functions to identify these missing values. There are two methods to detect missing data:\n", 264 | "

.isnull()
.notnull()

\n", 268 | "The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data." 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": { 275 | "collapsed": false 276 | }, 277 | "outputs": [], 278 | "source": [ 279 | "missing_data = df.isnull()\n", 280 | "missing_data.head(5)" 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": {}, 286 | "source": [ 287 | "\"True\" stands for missing value, while \"False\" stands for not missing value." 288 | ] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "metadata": {}, 293 | "source": [ 294 | "

Count missing values in each column

\n", 295 | "

\n", 296 | "Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, \"True\" represents a missing value, \"False\" means the value is present in the dataset. In the body of the for loop the method \".value_counts()\" counts the number of \"True\" values. \n", 297 | "

" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": null, 303 | "metadata": { 304 | "collapsed": false 305 | }, 306 | "outputs": [], 307 | "source": [ 308 | "for column in missing_data.columns.values.tolist():\n", 309 | " print(column)\n", 310 | " print (missing_data[column].value_counts())\n", 311 | " print(\"\") " 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "metadata": {}, 317 | "source": [ 318 | "Based on the summary above, each column has 205 rows of data, seven columns containing missing data:\n", 319 | "

\"normalized-losses\": 41 missing data
\"num-of-doors\": 2 missing data
\"bore\": 4 missing data
\"stroke\" : 4 missing data
\"horsepower\": 2 missing data
\"peak-rpm\": 2 missing data
\"price\": 4 missing data

" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [ 334 | "

Deal with missing data

\n", 335 | "How to deal with missing data?\n", 336 | "\n", 337 | "

drop data
\n", 339 | " a. drop the whole row
\n", 340 | " b. drop the whole column\n", 341 | "
replace data
\n", 343 | " a. replace it by mean
\n", 344 | " b. replace it by frequency
\n", 345 | " c. replace it based on other functions\n", 346 | "

" 348 | ] 349 | }, 350 | { 351 | "cell_type": "markdown", 352 | "metadata": {}, 353 | "source": [ 354 | "Whole columns should be dropped only if most entries in the column are empty. In our dataset, none of the columns are empty enough to drop entirely.\n", 355 | "We have some freedom in choosing which method to replace data; however, some methods may seem more reasonable than others. We will apply each method to many different columns:\n", 356 | "\n", 357 | "Replace by mean:\n", 358 | "

\"normalized-losses\": 41 missing data, replace them with mean
\"stroke\": 4 missing data, replace them with mean
\"bore\": 4 missing data, replace them with mean
\"horsepower\": 2 missing data, replace them with mean
\"peak-rpm\": 2 missing data, replace them with mean

\n", 365 | "\n", 366 | "Replace by frequency:\n", 367 | "

\"num-of-doors\": 2 missing data, replace them with \"four\". \n", 369 | "
- Reason: 84% sedans is four doors. Since four doors is most frequent, it is most likely to occur
\n", 372 | "

\n", 374 | "\n", 375 | "Drop the whole row:\n", 376 | "

\"price\": 4 missing data, simply delete the whole row\n", 378 | "
- Reason: price is what we want to predict. Any data entry without price data cannot be used for prediction; therefore any row now without price data is not useful to us
\n", 381 | "

" 383 | ] 384 | }, 385 | { 386 | "cell_type": "markdown", 387 | "metadata": {}, 388 | "source": [ 389 | "

Calculate the average of the column

" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": null, 395 | "metadata": { 396 | "collapsed": false 397 | }, 398 | "outputs": [], 399 | "source": [ 400 | "avg_norm_loss = df[\"normalized-losses\"].astype(\"float\").mean(axis=0)\n", 401 | "print(\"Average of normalized-losses:\", avg_norm_loss)" 402 | ] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "metadata": {}, 407 | "source": [ 408 | "

Replace \"NaN\" by mean value in \"normalized-losses\" column

" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": null, 414 | "metadata": { 415 | "collapsed": true 416 | }, 417 | "outputs": [], 418 | "source": [ 419 | "df[\"normalized-losses\"].replace(np.nan, avg_norm_loss, inplace=True)" 420 | ] 421 | }, 422 | { 423 | "cell_type": "markdown", 424 | "metadata": {}, 425 | "source": [ 426 | "

Calculate the mean value for 'bore' column

" 427 | ] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "execution_count": null, 432 | "metadata": { 433 | "collapsed": true 434 | }, 435 | "outputs": [], 436 | "source": [ 437 | "avg_bore=df['bore'].astype('float').mean(axis=0)\n", 438 | "print(\"Average of bore:\", avg_bore)" 439 | ] 440 | }, 441 | { 442 | "cell_type": "markdown", 443 | "metadata": {}, 444 | "source": [ 445 | "

Replace NaN by mean value

" 446 | ] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "execution_count": null, 451 | "metadata": { 452 | "collapsed": true 453 | }, 454 | "outputs": [], 455 | "source": [ 456 | "df[\"bore\"].replace(np.nan, avg_bore, inplace=True)" 457 | ] 458 | }, 459 | { 460 | "cell_type": "markdown", 461 | "metadata": {}, 462 | "source": [ 463 | "

\n", 464 | "

Question #1:

\n", 465 | "\n", 466 | "According to the example above, replace NaN in \"stroke\" column by mean.\n", 467 | "

" 468 | ] 469 | }, 470 | { 471 | "cell_type": "code", 472 | "execution_count": null, 473 | "metadata": { 474 | "collapsed": false 475 | }, 476 | "outputs": [], 477 | "source": [ 478 | "# Write your code below and press Shift+Enter to execute \n" 479 | ] 480 | }, 481 | { 482 | "cell_type": "markdown", 483 | "metadata": {}, 484 | "source": [ 485 | "Double-click here for the solution.\n", 486 | "\n", 487 | "\n" 497 | ] 498 | }, 499 | { 500 | "cell_type": "markdown", 501 | "metadata": {}, 502 | "source": [ 503 | "

Calculate the mean value for the 'horsepower' column:

" 504 | ] 505 | }, 506 | { 507 | "cell_type": "code", 508 | "execution_count": null, 509 | "metadata": { 510 | "collapsed": true 511 | }, 512 | "outputs": [], 513 | "source": [ 514 | "avg_horsepower = df['horsepower'].astype('float').mean(axis=0)\n", 515 | "print(\"Average horsepower:\", avg_horsepower)" 516 | ] 517 | }, 518 | { 519 | "cell_type": "markdown", 520 | "metadata": {}, 521 | "source": [ 522 | "

Replace \"NaN\" by mean value:

" 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": null, 528 | "metadata": { 529 | "collapsed": true 530 | }, 531 | "outputs": [], 532 | "source": [ 533 | "df['horsepower'].replace(np.nan, avg_horsepower, inplace=True)" 534 | ] 535 | }, 536 | { 537 | "cell_type": "markdown", 538 | "metadata": {}, 539 | "source": [ 540 | "

Calculate the mean value for 'peak-rpm' column:

" 541 | ] 542 | }, 543 | { 544 | "cell_type": "code", 545 | "execution_count": null, 546 | "metadata": { 547 | "collapsed": true 548 | }, 549 | "outputs": [], 550 | "source": [ 551 | "avg_peakrpm=df['peak-rpm'].astype('float').mean(axis=0)\n", 552 | "print(\"Average peak rpm:\", avg_peakrpm)" 553 | ] 554 | }, 555 | { 556 | "cell_type": "markdown", 557 | "metadata": {}, 558 | "source": [ 559 | "

Replace NaN by mean value:

" 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "execution_count": null, 565 | "metadata": { 566 | "collapsed": true 567 | }, 568 | "outputs": [], 569 | "source": [ 570 | "df['peak-rpm'].replace(np.nan, avg_peakrpm, inplace=True)" 571 | ] 572 | }, 573 | { 574 | "cell_type": "markdown", 575 | "metadata": {}, 576 | "source": [ 577 | "To see which values are present in a particular column, we can use the \".value_counts()\" method:" 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": null, 583 | "metadata": { 584 | "collapsed": false 585 | }, 586 | "outputs": [], 587 | "source": [ 588 | "df['num-of-doors'].value_counts()" 589 | ] 590 | }, 591 | { 592 | "cell_type": "markdown", 593 | "metadata": {}, 594 | "source": [ 595 | "We can see that four doors are the most common type. We can also use the \".idxmax()\" method to calculate for us the most common type automatically:" 596 | ] 597 | }, 598 | { 599 | "cell_type": "code", 600 | "execution_count": null, 601 | "metadata": { 602 | "collapsed": false 603 | }, 604 | "outputs": [], 605 | "source": [ 606 | "df['num-of-doors'].value_counts().idxmax()" 607 | ] 608 | }, 609 | { 610 | "cell_type": "markdown", 611 | "metadata": {}, 612 | "source": [ 613 | "The replacement procedure is very similar to what we have seen previously" 614 | ] 615 | }, 616 | { 617 | "cell_type": "code", 618 | "execution_count": null, 619 | "metadata": { 620 | "collapsed": false 621 | }, 622 | "outputs": [], 623 | "source": [ 624 | "#replace the missing 'num-of-doors' values by the most frequent \n", 625 | "df[\"num-of-doors\"].replace(np.nan, \"four\", inplace=True)" 626 | ] 627 | }, 628 | { 629 | "cell_type": "markdown", 630 | "metadata": {}, 631 | "source": [ 632 | "Finally, let's drop all rows that do not have price data:" 633 | ] 634 | }, 635 | { 636 | "cell_type": "code", 637 | "execution_count": null, 638 | "metadata": { 639 | "collapsed": true 640 | }, 641 | "outputs": [], 642 | "source": [ 643 | "# simply drop whole row with NaN in \"price\" column\n", 644 | "df.dropna(subset=[\"price\"], axis=0, inplace=True)\n", 645 | "\n", 646 | "# reset index, because we droped two rows\n", 647 | "df.reset_index(drop=True, inplace=True)" 648 | ] 649 | }, 650 | { 651 | "cell_type": "code", 652 | "execution_count": null, 653 | "metadata": { 654 | "collapsed": false 655 | }, 656 | "outputs": [], 657 | "source": [ 658 | "df.head()" 659 | ] 660 | }, 661 | { 662 | "cell_type": "markdown", 663 | "metadata": {}, 664 | "source": [ 665 | "Good! Now, we obtain the dataset with no missing values." 666 | ] 667 | }, 668 | { 669 | "cell_type": "markdown", 670 | "metadata": {}, 671 | "source": [ 672 | "

Correct data format

\n", 673 | "We are almost there!\n", 674 | "

The last step in data cleaning is checking and making sure that all data is in the correct format (int, float, text or other).

\n", 675 | "\n", 676 | "In Pandas, we use \n", 677 | "

.dtype() to check the data type

\n", 678 | "

.astype() to change the data type

" 679 | ] 680 | }, 681 | { 682 | "cell_type": "markdown", 683 | "metadata": {}, 684 | "source": [ 685 | "

Lets list the data types for each column

" 686 | ] 687 | }, 688 | { 689 | "cell_type": "code", 690 | "execution_count": null, 691 | "metadata": { 692 | "collapsed": false 693 | }, 694 | "outputs": [], 695 | "source": [ 696 | "df.dtypes" 697 | ] 698 | }, 699 | { 700 | "cell_type": "markdown", 701 | "metadata": {}, 702 | "source": [ 703 | "

As we can see above, some columns are not of the correct data type. Numerical variables should have type 'float' or 'int', and variables with strings such as categories should have type 'object'. For example, 'bore' and 'stroke' variables are numerical values that describe the engines, so we should expect them to be of the type 'float' or 'int'; however, they are shown as type 'object'. We have to convert data types into a proper format for each column using the \"astype()\" method.

" 704 | ] 705 | }, 706 | { 707 | "cell_type": "markdown", 708 | "metadata": {}, 709 | "source": [ 710 | "

Convert data types to proper format

" 711 | ] 712 | }, 713 | { 714 | "cell_type": "code", 715 | "execution_count": null, 716 | "metadata": { 717 | "collapsed": false 718 | }, 719 | "outputs": [], 720 | "source": [ 721 | "df[[\"bore\", \"stroke\"]] = df[[\"bore\", \"stroke\"]].astype(\"float\")\n", 722 | "df[[\"normalized-losses\"]] = df[[\"normalized-losses\"]].astype(\"int\")\n", 723 | "df[[\"price\"]] = df[[\"price\"]].astype(\"float\")\n", 724 | "df[[\"peak-rpm\"]] = df[[\"peak-rpm\"]].astype(\"float\")" 725 | ] 726 | }, 727 | { 728 | "cell_type": "markdown", 729 | "metadata": {}, 730 | "source": [ 731 | "

Let us list the columns after the conversion

" 732 | ] 733 | }, 734 | { 735 | "cell_type": "code", 736 | "execution_count": null, 737 | "metadata": { 738 | "collapsed": false 739 | }, 740 | "outputs": [], 741 | "source": [ 742 | "df.dtypes" 743 | ] 744 | }, 745 | { 746 | "cell_type": "markdown", 747 | "metadata": {}, 748 | "source": [ 749 | "Wonderful!\n", 750 | "\n", 751 | "Now, we finally obtain the cleaned dataset with no missing values and all data in its proper format." 752 | ] 753 | }, 754 | { 755 | "cell_type": "markdown", 756 | "metadata": {}, 757 | "source": [ 758 | "

Data Standardization

\n", 759 | "

\n", 760 | "Data is usually collected from different agencies with different formats.\n", 761 | "(Data Standardization is also a term for a particular type of data normalization, where we subtract the mean and divide by the standard deviation)\n", 762 | "

\n", 763 | " \n", 764 | "What is Standardization?\n", 765 | "

Standardization is the process of transforming data into a common format which allows the researcher to make the meaningful comparison.\n", 766 | "

\n", 767 | "\n", 768 | "Example\n", 769 | "

Transform mpg to L/100km:

\n", 770 | "

In our dataset, the fuel consumption columns \"city-mpg\" and \"highway-mpg\" are represented by mpg (miles per gallon) unit. Assume we are developing an application in a country that accept the fuel consumption with L/100km standard

\n", 771 | "

We will need to apply data transformation to transform mpg into L/100km?

\n" 772 | ] 773 | }, 774 | { 775 | "cell_type": "markdown", 776 | "metadata": {}, 777 | "source": [ 778 | "

The formula for unit conversion is

\n", 779 | "L/100km = 235 / mpg\n", 780 | "

We can do many mathematical operations directly in Pandas.

" 781 | ] 782 | }, 783 | { 784 | "cell_type": "code", 785 | "execution_count": null, 786 | "metadata": { 787 | "collapsed": false 788 | }, 789 | "outputs": [], 790 | "source": [ 791 | "df.head()" 792 | ] 793 | }, 794 | { 795 | "cell_type": "code", 796 | "execution_count": null, 797 | "metadata": { 798 | "collapsed": false 799 | }, 800 | "outputs": [], 801 | "source": [ 802 | "# Convert mpg to L/100km by mathematical operation (235 divided by mpg)\n", 803 | "df['city-L/100km'] = 235/df[\"city-mpg\"]\n", 804 | "\n", 805 | "# check your transformed data \n", 806 | "df.head()" 807 | ] 808 | }, 809 | { 810 | "cell_type": "markdown", 811 | "metadata": {}, 812 | "source": [ 813 | "

\n", 814 | "

Question #2:

\n", 815 | "\n", 816 | "According to the example above, transform mpg to L/100km in the column of \"highway-mpg\", and change the name of column to \"highway-L/100km\".\n", 817 | "

" 818 | ] 819 | }, 820 | { 821 | "cell_type": "code", 822 | "execution_count": null, 823 | "metadata": { 824 | "collapsed": false 825 | }, 826 | "outputs": [], 827 | "source": [ 828 | "# Write your code below and press Shift+Enter to execute \n" 829 | ] 830 | }, 831 | { 832 | "cell_type": "markdown", 833 | "metadata": {}, 834 | "source": [ 835 | "Double-click here for the solution.\n", 836 | "\n", 837 | "\n" 849 | ] 850 | }, 851 | { 852 | "cell_type": "markdown", 853 | "metadata": {}, 854 | "source": [ 855 | "

Data Normalization

\n", 856 | "\n", 857 | "Why normalization?\n", 858 | "

Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling variable so the variable values range from 0 to 1\n", 859 | "

\n", 860 | "\n", 861 | "Example\n", 862 | "

To demonstrate normalization, let's say we want to scale the columns \"length\", \"width\" and \"height\"

\n", 863 | "

Target:would like to Normalize those variables so their value ranges from 0 to 1.

\n", 864 | "

Approach: replace original value by (original value)/(maximum value)

" 865 | ] 866 | }, 867 | { 868 | "cell_type": "code", 869 | "execution_count": null, 870 | "metadata": { 871 | "collapsed": false 872 | }, 873 | "outputs": [], 874 | "source": [ 875 | "# replace (original value) by (original value)/(maximum value)\n", 876 | "df['length'] = df['length']/df['length'].max()\n", 877 | "df['width'] = df['width']/df['width'].max()" 878 | ] 879 | }, 880 | { 881 | "cell_type": "markdown", 882 | "metadata": {}, 883 | "source": [ 884 | "

\n", 885 | "

Questiont #3:

\n", 886 | "\n", 887 | "According to the example above, normalize the column \"height\".\n", 888 | "

" 889 | ] 890 | }, 891 | { 892 | "cell_type": "code", 893 | "execution_count": null, 894 | "metadata": { 895 | "collapsed": false 896 | }, 897 | "outputs": [], 898 | "source": [ 899 | "# Write your code below and press Shift+Enter to execute \n" 900 | ] 901 | }, 902 | { 903 | "cell_type": "markdown", 904 | "metadata": {}, 905 | "source": [ 906 | "Double-click here for the solution.\n", 907 | "\n", 908 | "" 915 | ] 916 | }, 917 | { 918 | "cell_type": "markdown", 919 | "metadata": {}, 920 | "source": [ 921 | "Here we can see, we've normalized \"length\", \"width\" and \"height\" in the range of [0,1]." 922 | ] 923 | }, 924 | { 925 | "cell_type": "markdown", 926 | "metadata": {}, 927 | "source": [ 928 | "

Binning

\n", 929 | "Why binning?\n", 930 | "

\n", 931 | " Binning is a process of transforming continuous numerical variables into discrete categorical 'bins', for grouped analysis.\n", 932 | "

\n", 933 | "\n", 934 | "Example: \n", 935 | "

In our dataset, \"horsepower\" is a real valued variable ranging from 48 to 288, it has 57 unique values. What if we only care about the price difference between cars with high horsepower, medium horsepower, and little horsepower (3 types)? Can we rearrange them into three ‘bins' to simplify analysis?

\n", 936 | "\n", 937 | "

We will use the Pandas method 'cut' to segment the 'horsepower' column into 3 bins

\n", 938 | "\n" 939 | ] 940 | }, 941 | { 942 | "cell_type": "markdown", 943 | "metadata": {}, 944 | "source": [ 945 | "

Example of Binning Data In Pandas

" 946 | ] 947 | }, 948 | { 949 | "cell_type": "markdown", 950 | "metadata": {}, 951 | "source": [ 952 | " Convert data to correct format " 953 | ] 954 | }, 955 | { 956 | "cell_type": "code", 957 | "execution_count": null, 958 | "metadata": { 959 | "collapsed": false 960 | }, 961 | "outputs": [], 962 | "source": [ 963 | "df[\"horsepower\"]=df[\"horsepower\"].astype(int, copy=True)" 964 | ] 965 | }, 966 | { 967 | "cell_type": "markdown", 968 | "metadata": {}, 969 | "source": [ 970 | "Lets plot the histogram of horspower, to see what the distribution of horsepower looks like." 971 | ] 972 | }, 973 | { 974 | "cell_type": "code", 975 | "execution_count": null, 976 | "metadata": {}, 977 | "outputs": [], 978 | "source": [ 979 | "%matplotlib inline\n", 980 | "import matplotlib as plt\n", 981 | "from matplotlib import pyplot\n", 982 | "plt.pyplot.hist(df[\"horsepower\"])\n", 983 | "\n", 984 | "# set x/y labels and plot title\n", 985 | "plt.pyplot.xlabel(\"horsepower\")\n", 986 | "plt.pyplot.ylabel(\"count\")\n", 987 | "plt.pyplot.title(\"horsepower bins\")" 988 | ] 989 | }, 990 | { 991 | "cell_type": "markdown", 992 | "metadata": {}, 993 | "source": [ 994 | "

We would like 3 bins of equal size bandwidth so we use numpy's linspace(start_value, end_value, numbers_generated function.

\n", 995 | "

Since we want to include the minimum value of horsepower we want to set start_value=min(df[\"horsepower\"]).

\n", 996 | "

Since we want to include the maximum value of horsepower we want to set end_value=max(df[\"horsepower\"]).

\n", 997 | "

Since we are building 3 bins of equal length, there should be 4 dividers, so numbers_generated=4.

" 998 | ] 999 | }, 1000 | { 1001 | "cell_type": "markdown", 1002 | "metadata": {}, 1003 | "source": [ 1004 | "We build a bin array, with a minimum value to a maximum value, with bandwidth calculated above. The bins will be values used to determine when one bin ends and another begins." 1005 | ] 1006 | }, 1007 | { 1008 | "cell_type": "code", 1009 | "execution_count": null, 1010 | "metadata": { 1011 | "collapsed": false 1012 | }, 1013 | "outputs": [], 1014 | "source": [ 1015 | "bins = np.linspace(min(df[\"horsepower\"]), max(df[\"horsepower\"]), 4)\n", 1016 | "bins" 1017 | ] 1018 | }, 1019 | { 1020 | "cell_type": "markdown", 1021 | "metadata": {}, 1022 | "source": [ 1023 | " We set group names:" 1024 | ] 1025 | }, 1026 | { 1027 | "cell_type": "code", 1028 | "execution_count": null, 1029 | "metadata": { 1030 | "collapsed": true 1031 | }, 1032 | "outputs": [], 1033 | "source": [ 1034 | "group_names = ['Low', 'Medium', 'High']" 1035 | ] 1036 | }, 1037 | { 1038 | "cell_type": "markdown", 1039 | "metadata": {}, 1040 | "source": [ 1041 | " We apply the function \"cut\" the determine what each value of \"df['horsepower']\" belongs to. " 1042 | ] 1043 | }, 1044 | { 1045 | "cell_type": "code", 1046 | "execution_count": null, 1047 | "metadata": { 1048 | "collapsed": false 1049 | }, 1050 | "outputs": [], 1051 | "source": [ 1052 | "df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_lowest=True )\n", 1053 | "df[['horsepower','horsepower-binned']].head(20)" 1054 | ] 1055 | }, 1056 | { 1057 | "cell_type": "markdown", 1058 | "metadata": {}, 1059 | "source": [ 1060 | "Lets see the number of vehicles in each bin." 1061 | ] 1062 | }, 1063 | { 1064 | "cell_type": "code", 1065 | "execution_count": null, 1066 | "metadata": {}, 1067 | "outputs": [], 1068 | "source": [ 1069 | "df[\"horsepower-binned\"].value_counts()" 1070 | ] 1071 | }, 1072 | { 1073 | "cell_type": "markdown", 1074 | "metadata": {}, 1075 | "source": [ 1076 | "Lets plot the distribution of each bin." 1077 | ] 1078 | }, 1079 | { 1080 | "cell_type": "code", 1081 | "execution_count": null, 1082 | "metadata": {}, 1083 | "outputs": [], 1084 | "source": [ 1085 | "%matplotlib inline\n", 1086 | "import matplotlib as plt\n", 1087 | "from matplotlib import pyplot\n", 1088 | "pyplot.bar(group_names, df[\"horsepower-binned\"].value_counts())\n", 1089 | "\n", 1090 | "# set x/y labels and plot title\n", 1091 | "plt.pyplot.xlabel(\"horsepower\")\n", 1092 | "plt.pyplot.ylabel(\"count\")\n", 1093 | "plt.pyplot.title(\"horsepower bins\")" 1094 | ] 1095 | }, 1096 | { 1097 | "cell_type": "markdown", 1098 | "metadata": {}, 1099 | "source": [ 1100 | "

\n", 1101 | " Check the dataframe above carefully, you will find the last column provides the bins for \"horsepower\" with 3 categories (\"Low\",\"Medium\" and \"High\"). \n", 1102 | "

\n", 1103 | "

\n", 1104 | " We successfully narrow the intervals from 57 to 3!\n", 1105 | "

" 1106 | ] 1107 | }, 1108 | { 1109 | "cell_type": "markdown", 1110 | "metadata": {}, 1111 | "source": [ 1112 | "

Bins visualization

\n", 1113 | "Normally, a histogram is used to visualize the distribution of bins we created above. " 1114 | ] 1115 | }, 1116 | { 1117 | "cell_type": "code", 1118 | "execution_count": null, 1119 | "metadata": { 1120 | "collapsed": false 1121 | }, 1122 | "outputs": [], 1123 | "source": [ 1124 | "%matplotlib inline\n", 1125 | "import matplotlib as plt\n", 1126 | "from matplotlib import pyplot\n", 1127 | "\n", 1128 | "a = (0,1,2)\n", 1129 | "\n", 1130 | "# draw historgram of attribute \"horsepower\" with bins = 3\n", 1131 | "plt.pyplot.hist(df[\"horsepower\"], bins = 3)\n", 1132 | "\n", 1133 | "# set x/y labels and plot title\n", 1134 | "plt.pyplot.xlabel(\"horsepower\")\n", 1135 | "plt.pyplot.ylabel(\"count\")\n", 1136 | "plt.pyplot.title(\"horsepower bins\")" 1137 | ] 1138 | }, 1139 | { 1140 | "cell_type": "markdown", 1141 | "metadata": {}, 1142 | "source": [ 1143 | "The plot above shows the binning result for attribute \"horsepower\". " 1144 | ] 1145 | }, 1146 | { 1147 | "cell_type": "markdown", 1148 | "metadata": {}, 1149 | "source": [ 1150 | "

Indicator variable (or dummy variable)

\n", 1151 | "What is an indicator variable?\n", 1152 | "

\n", 1153 | " An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called 'dummies' because the numbers themselves don't have inherent meaning. \n", 1154 | "

\n", 1155 | "\n", 1156 | "Why we use indicator variables?\n", 1157 | "

\n", 1158 | " So we can use categorical variables for regression analysis in the later modules.\n", 1159 | "

\n", 1160 | "Example\n", 1161 | "

\n", 1162 | " We see the column \"fuel-type\" has two unique values, \"gas\" or \"diesel\". Regression doesn't understand words, only numbers. To use this attribute in regression analysis, we convert \"fuel-type\" into indicator variables.\n", 1163 | "

\n", 1164 | "\n", 1165 | "

\n", 1166 | " We will use the panda's method 'get_dummies' to assign numerical values to different categories of fuel type. \n", 1167 | "

" 1168 | ] 1169 | }, 1170 | { 1171 | "cell_type": "code", 1172 | "execution_count": null, 1173 | "metadata": { 1174 | "collapsed": false 1175 | }, 1176 | "outputs": [], 1177 | "source": [ 1178 | "df.columns" 1179 | ] 1180 | }, 1181 | { 1182 | "cell_type": "markdown", 1183 | "metadata": {}, 1184 | "source": [ 1185 | "get indicator variables and assign it to data frame \"dummy_variable_1\" " 1186 | ] 1187 | }, 1188 | { 1189 | "cell_type": "code", 1190 | "execution_count": null, 1191 | "metadata": { 1192 | "collapsed": false 1193 | }, 1194 | "outputs": [], 1195 | "source": [ 1196 | "dummy_variable_1 = pd.get_dummies(df[\"fuel-type\"])\n", 1197 | "dummy_variable_1.head()" 1198 | ] 1199 | }, 1200 | { 1201 | "cell_type": "markdown", 1202 | "metadata": {}, 1203 | "source": [ 1204 | "change column names for clarity " 1205 | ] 1206 | }, 1207 | { 1208 | "cell_type": "code", 1209 | "execution_count": null, 1210 | "metadata": { 1211 | "collapsed": false 1212 | }, 1213 | "outputs": [], 1214 | "source": [ 1215 | "dummy_variable_1.rename(columns={'fuel-type-diesel':'gas', 'fuel-type-diesel':'diesel'}, inplace=True)\n", 1216 | "dummy_variable_1.head()" 1217 | ] 1218 | }, 1219 | { 1220 | "cell_type": "markdown", 1221 | "metadata": {}, 1222 | "source": [ 1223 | "We now have the value 0 to represent \"gas\" and 1 to represent \"diesel\" in the column \"fuel-type\". We will now insert this column back into our original dataset. " 1224 | ] 1225 | }, 1226 | { 1227 | "cell_type": "code", 1228 | "execution_count": null, 1229 | "metadata": { 1230 | "collapsed": true 1231 | }, 1232 | "outputs": [], 1233 | "source": [ 1234 | "# merge data frame \"df\" and \"dummy_variable_1\" \n", 1235 | "df = pd.concat([df, dummy_variable_1], axis=1)\n", 1236 | "\n", 1237 | "# drop original column \"fuel-type\" from \"df\"\n", 1238 | "df.drop(\"fuel-type\", axis = 1, inplace=True)" 1239 | ] 1240 | }, 1241 | { 1242 | "cell_type": "code", 1243 | "execution_count": null, 1244 | "metadata": { 1245 | "collapsed": false 1246 | }, 1247 | "outputs": [], 1248 | "source": [ 1249 | "df.head()" 1250 | ] 1251 | }, 1252 | { 1253 | "cell_type": "markdown", 1254 | "metadata": {}, 1255 | "source": [ 1256 | "The last two columns are now the indicator variable representation of the fuel-type variable. It's all 0s and 1s now." 1257 | ] 1258 | }, 1259 | { 1260 | "cell_type": "markdown", 1261 | "metadata": {}, 1262 | "source": [ 1263 | "

\n", 1264 | "

Question #4:

\n", 1265 | "\n", 1266 | "As above, create indicator variable to the column of \"aspiration\": \"std\" to 0, while \"turbo\" to 1.\n", 1267 | "

" 1268 | ] 1269 | }, 1270 | { 1271 | "cell_type": "code", 1272 | "execution_count": null, 1273 | "metadata": { 1274 | "collapsed": false 1275 | }, 1276 | "outputs": [], 1277 | "source": [ 1278 | "# Write your code below and press Shift+Enter to execute \n" 1279 | ] 1280 | }, 1281 | { 1282 | "cell_type": "markdown", 1283 | "metadata": {}, 1284 | "source": [ 1285 | "Double-click here for the solution.\n", 1286 | "\n", 1287 | "" 1299 | ] 1300 | }, 1301 | { 1302 | "cell_type": "markdown", 1303 | "metadata": {}, 1304 | "source": [ 1305 | "

\n", 1306 | "

Question #5:

\n", 1307 | "\n", 1308 | "Merge the new dataframe to the original dataframe then drop the column 'aspiration'\n", 1309 | "

" 1310 | ] 1311 | }, 1312 | { 1313 | "cell_type": "code", 1314 | "execution_count": null, 1315 | "metadata": { 1316 | "collapsed": false 1317 | }, 1318 | "outputs": [], 1319 | "source": [ 1320 | "# Write your code below and press Shift+Enter to execute \n" 1321 | ] 1322 | }, 1323 | { 1324 | "cell_type": "markdown", 1325 | "metadata": {}, 1326 | "source": [ 1327 | "Double-click here for the solution.\n", 1328 | "\n", 1329 | "" 1338 | ] 1339 | }, 1340 | { 1341 | "cell_type": "markdown", 1342 | "metadata": {}, 1343 | "source": [ 1344 | "save the new csv " 1345 | ] 1346 | }, 1347 | { 1348 | "cell_type": "code", 1349 | "execution_count": null, 1350 | "metadata": { 1351 | "collapsed": true 1352 | }, 1353 | "outputs": [], 1354 | "source": [ 1355 | "df.to_csv('clean_df.csv')" 1356 | ] 1357 | }, 1358 | { 1359 | "cell_type": "markdown", 1360 | "metadata": {}, 1361 | "source": [ 1362 | "

Thank you for completing this notebook

" 1363 | ] 1364 | }, 1365 | { 1366 | "cell_type": "markdown", 1367 | "metadata": {}, 1368 | "source": [ 1369 | "

\n", 1370 | "\n", 1371 | "

\n", 1372 | "

" 1373 | ] 1374 | }, 1375 | { 1376 | "cell_type": "markdown", 1377 | "metadata": {}, 1378 | "source": [ 1379 | "

About the Authors:

\n", 1380 | "\n", 1381 | "This notebook was written by Mahdi Noorian PhD, Joseph Santarcangelo, Bahare Talayian, Eric Xiao, Steven Dong, Parizad, Hima Vsudevan and Fiorella Wenver and Yi Yao.\n", 1382 | "\n", 1383 | "

Joseph Santarcangelo is a Data Scientist at IBM, and holds a PhD in Electrical Engineering. His research focused on using Machine Learning, Signal Processing, and Computer Vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.

" 1384 | ] 1385 | }, 1386 | { 1387 | "cell_type": "markdown", 1388 | "metadata": {}, 1389 | "source": [ 1390 | "

\n", 1391 | "

" 1392 | ] 1393 | } 1394 | ], 1395 | "metadata": { 1396 | "anaconda-cloud": {}, 1397 | "kernelspec": { 1398 | "display_name": "Python 3", 1399 | "language": "python", 1400 | "name": "python3" 1401 | }, 1402 | "language_info": { 1403 | "codemirror_mode": { 1404 | "name": "ipython", 1405 | "version": 3 1406 | }, 1407 | "file_extension": ".py", 1408 | "mimetype": "text/x-python", 1409 | "name": "python", 1410 | "nbconvert_exporter": "python", 1411 | "pygments_lexer": "ipython3", 1412 | "version": "3.6.7" 1413 | } 1414 | }, 1415 | "nbformat": 4, 1416 | "nbformat_minor": 2 1417 | } 1418 | -------------------------------------------------------------------------------- /DA0101EN-Review-Exploratory-Data-Analysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "

\n", 8 | " \n", 9 | "

\n", 10 | " \n", 11 | "

\n" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "

\n", 19 | "\n", 20 | "

Data Analysis with Python

" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "Exploratory Data Analysis" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "

Welcome!

\n", 35 | "In this section, we will explore several methods to see if certain characteristics or features can be used to predict car price. " 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "

Table of content

\n", 43 | "\n", 44 | "

\n", 45 | "

Import Data from Module
Analyzing Individual Feature Patterns using Visualization
Descriptive Statistical Analysis
Basics of Grouping
Correlation and Causation
ANOVA

\n", 53 | " \n", 54 | "Estimated Time Needed: 30 min\n", 55 | "

\n", 56 | " \n", 57 | "

" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "

What are the main characteristics which have the most impact on the car price?

" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "

1. Import Data from Module 2

" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "

Setup

" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | " Import libraries " 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "metadata": { 92 | "collapsed": true 93 | }, 94 | "outputs": [], 95 | "source": [ 96 | "import pandas as pd\n", 97 | "import numpy as np" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | " load data and store in dataframe df:" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "This dataset was hosted on IBM Cloud object click HERE for free storage" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": { 118 | "collapsed": false 119 | }, 120 | "outputs": [], 121 | "source": [ 122 | "path='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/automobileEDA.csv'\n", 123 | "df = pd.read_csv(path)\n", 124 | "df.head()" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "

2. Analyzing Individual Feature Patterns using Visualization

" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": {}, 137 | "source": [ 138 | "To install seaborn we use the pip which is the python package manager." 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": null, 144 | "metadata": {}, 145 | "outputs": [], 146 | "source": [ 147 | "%%capture\n", 148 | "! pip install seaborn" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | " Import visualization packages \"Matplotlib\" and \"Seaborn\", don't forget about \"%matplotlib inline\" to plot in a Jupyter notebook." 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "metadata": { 162 | "collapsed": false 163 | }, 164 | "outputs": [], 165 | "source": [ 166 | "import matplotlib.pyplot as plt\n", 167 | "import seaborn as sns\n", 168 | "%matplotlib inline " 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "

How to choose the right visualization method?

\n", 176 | "

When visualizing individual variables, it is important to first understand what type of variable you are dealing with. This will help us find the right visualization method for that variable.

\n" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": { 183 | "collapsed": false 184 | }, 185 | "outputs": [], 186 | "source": [ 187 | "# list the data types for each column\n", 188 | "print(df.dtypes)" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "

\n", 196 | "

Question #1:

\n", 197 | "\n", 198 | "What is the data type of the column \"peak-rpm\"? \n", 199 | "

" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": {}, 205 | "source": [ 206 | "Double-click here for the solution.\n", 207 | "\n", 208 | "" 213 | ] 214 | }, 215 | { 216 | "cell_type": "markdown", 217 | "metadata": {}, 218 | "source": [ 219 | "for example, we can calculate the correlation between variables of type \"int64\" or \"float64\" using the method \"corr\":" 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": null, 225 | "metadata": { 226 | "collapsed": false 227 | }, 228 | "outputs": [], 229 | "source": [ 230 | "df.corr()" 231 | ] 232 | }, 233 | { 234 | "cell_type": "markdown", 235 | "metadata": {}, 236 | "source": [ 237 | "The diagonal elements are always one; we will study correlation more precisely Pearson correlation in-depth at the end of the notebook." 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "metadata": {}, 243 | "source": [ 244 | "

\n", 245 | "

Question #2:

\n", 246 | "\n", 247 | "

Find the correlation between the following columns: bore, stroke,compression-ratio , and horsepower.

\n", 248 | "

Hint: if you would like to select those columns use the following syntax: df[['bore','stroke' ,'compression-ratio','horsepower']]

\n", 249 | "

" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": null, 255 | "metadata": { 256 | "collapsed": true 257 | }, 258 | "outputs": [], 259 | "source": [ 260 | "# Write your code below and press Shift+Enter to execute \n" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "Double-click here for the solution.\n", 268 | "\n", 269 | "" 274 | ] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "metadata": {}, 279 | "source": [ 280 | "

Continuous numerical variables:

\n", 281 | "\n", 282 | "

Continuous numerical variables are variables that may contain any value within some range. Continuous numerical variables can have the type \"int64\" or \"float64\". A great way to visualize these variables is by using scatterplots with fitted lines.

\n", 283 | "\n", 284 | "

In order to start understanding the (linear) relationship between an individual variable and the price. We can do this by using \"regplot\", which plots the scatterplot plus the fitted regression line for the data.

" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | " Let's see several examples of different linear relationships:" 292 | ] 293 | }, 294 | { 295 | "cell_type": "markdown", 296 | "metadata": {}, 297 | "source": [ 298 | "

Positive linear relationship

" 299 | ] 300 | }, 301 | { 302 | "cell_type": "markdown", 303 | "metadata": {}, 304 | "source": [ 305 | "Let's find the scatterplot of \"engine-size\" and \"price\" " 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": null, 311 | "metadata": { 312 | "collapsed": false, 313 | "scrolled": true 314 | }, 315 | "outputs": [], 316 | "source": [ 317 | "# Engine size as potential predictor variable of price\n", 318 | "sns.regplot(x=\"engine-size\", y=\"price\", data=df)\n", 319 | "plt.ylim(0,)" 320 | ] 321 | }, 322 | { 323 | "cell_type": "markdown", 324 | "metadata": {}, 325 | "source": [ 326 | "

As the engine-size goes up, the price goes up: this indicates a positive direct correlation between these two variables. Engine size seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.

" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | " We can examine the correlation between 'engine-size' and 'price' and see it's approximately 0.87" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": null, 339 | "metadata": { 340 | "collapsed": false 341 | }, 342 | "outputs": [], 343 | "source": [ 344 | "df[[\"engine-size\", \"price\"]].corr()" 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "metadata": {}, 350 | "source": [ 351 | "Highway mpg is a potential predictor variable of price " 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": null, 357 | "metadata": { 358 | "collapsed": false 359 | }, 360 | "outputs": [], 361 | "source": [ 362 | "sns.regplot(x=\"highway-mpg\", y=\"price\", data=df)" 363 | ] 364 | }, 365 | { 366 | "cell_type": "markdown", 367 | "metadata": {}, 368 | "source": [ 369 | "

As the highway-mpg goes up, the price goes down: this indicates an inverse/negative relationship between these two variables. Highway mpg could potentially be a predictor of price.

" 370 | ] 371 | }, 372 | { 373 | "cell_type": "markdown", 374 | "metadata": {}, 375 | "source": [ 376 | "We can examine the correlation between 'highway-mpg' and 'price' and see it's approximately -0.704" 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": null, 382 | "metadata": { 383 | "collapsed": false 384 | }, 385 | "outputs": [], 386 | "source": [ 387 | "df[['highway-mpg', 'price']].corr()" 388 | ] 389 | }, 390 | { 391 | "cell_type": "markdown", 392 | "metadata": {}, 393 | "source": [ 394 | "

Weak Linear Relationship

" 395 | ] 396 | }, 397 | { 398 | "cell_type": "markdown", 399 | "metadata": {}, 400 | "source": [ 401 | "Let's see if \"Peak-rpm\" as a predictor variable of \"price\"." 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": null, 407 | "metadata": { 408 | "collapsed": false 409 | }, 410 | "outputs": [], 411 | "source": [ 412 | "sns.regplot(x=\"peak-rpm\", y=\"price\", data=df)" 413 | ] 414 | }, 415 | { 416 | "cell_type": "markdown", 417 | "metadata": {}, 418 | "source": [ 419 | "

Peak rpm does not seem like a good predictor of the price at all since the regression line is close to horizontal. Also, the data points are very scattered and far from the fitted line, showing lots of variability. Therefore it's it is not a reliable variable.

" 420 | ] 421 | }, 422 | { 423 | "cell_type": "markdown", 424 | "metadata": {}, 425 | "source": [ 426 | "We can examine the correlation between 'peak-rpm' and 'price' and see it's approximately -0.101616 " 427 | ] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "execution_count": null, 432 | "metadata": { 433 | "collapsed": false 434 | }, 435 | "outputs": [], 436 | "source": [ 437 | "df[['peak-rpm','price']].corr()" 438 | ] 439 | }, 440 | { 441 | "cell_type": "markdown", 442 | "metadata": {}, 443 | "source": [ 444 | "

\n", 445 | "

Question 3 a):

\n", 446 | "\n", 447 | "

Find the correlation between x=\"stroke\", y=\"price\".

\n", 448 | "

Hint: if you would like to select those columns use the following syntax: df[[\"stroke\",\"price\"]]

\n", 449 | "

" 450 | ] 451 | }, 452 | { 453 | "cell_type": "code", 454 | "execution_count": null, 455 | "metadata": { 456 | "collapsed": false 457 | }, 458 | "outputs": [], 459 | "source": [ 460 | "# Write your code below and press Shift+Enter to execute\n" 461 | ] 462 | }, 463 | { 464 | "cell_type": "markdown", 465 | "metadata": {}, 466 | "source": [ 467 | "Double-click here for the solution.\n", 468 | "\n", 469 | "" 476 | ] 477 | }, 478 | { 479 | "cell_type": "markdown", 480 | "metadata": {}, 481 | "source": [ 482 | "

\n", 483 | "

Question 3 b):

\n", 484 | "\n", 485 | "

Given the correlation results between \"price\" and \"stroke\" do you expect a linear relationship?

\n", 486 | "

Verify your results using the function \"regplot()\".

\n", 487 | "

" 488 | ] 489 | }, 490 | { 491 | "cell_type": "code", 492 | "execution_count": null, 493 | "metadata": { 494 | "collapsed": false 495 | }, 496 | "outputs": [], 497 | "source": [ 498 | "# Write your code below and press Shift+Enter to execute \n" 499 | ] 500 | }, 501 | { 502 | "cell_type": "markdown", 503 | "metadata": {}, 504 | "source": [ 505 | "Double-click here for the solution.\n", 506 | "\n", 507 | "" 515 | ] 516 | }, 517 | { 518 | "cell_type": "markdown", 519 | "metadata": {}, 520 | "source": [ 521 | "

Categorical variables

\n", 522 | "\n", 523 | "

These are variables that describe a 'characteristic' of a data unit, and are selected from a small group of categories. The categorical variables can have the type \"object\" or \"int64\". A good way to visualize categorical variables is by using boxplots.

" 524 | ] 525 | }, 526 | { 527 | "cell_type": "markdown", 528 | "metadata": {}, 529 | "source": [ 530 | "Let's look at the relationship between \"body-style\" and \"price\"." 531 | ] 532 | }, 533 | { 534 | "cell_type": "code", 535 | "execution_count": null, 536 | "metadata": { 537 | "collapsed": false, 538 | "scrolled": true 539 | }, 540 | "outputs": [], 541 | "source": [ 542 | "sns.boxplot(x=\"body-style\", y=\"price\", data=df)" 543 | ] 544 | }, 545 | { 546 | "cell_type": "markdown", 547 | "metadata": {}, 548 | "source": [ 549 | "

We see that the distributions of price between the different body-style categories have a significant overlap, and so body-style would not be a good predictor of price. Let's examine engine \"engine-location\" and \"price\":

" 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": null, 555 | "metadata": { 556 | "collapsed": false, 557 | "scrolled": true 558 | }, 559 | "outputs": [], 560 | "source": [ 561 | "sns.boxplot(x=\"engine-location\", y=\"price\", data=df)" 562 | ] 563 | }, 564 | { 565 | "cell_type": "markdown", 566 | "metadata": {}, 567 | "source": [ 568 | "

Here we see that the distribution of price between these two engine-location categories, front and rear, are distinct enough to take engine-location as a potential good predictor of price.

" 569 | ] 570 | }, 571 | { 572 | "cell_type": "markdown", 573 | "metadata": {}, 574 | "source": [ 575 | " Let's examine \"drive-wheels\" and \"price\"." 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": null, 581 | "metadata": { 582 | "collapsed": false, 583 | "scrolled": false 584 | }, 585 | "outputs": [], 586 | "source": [ 587 | "# drive-wheels\n", 588 | "sns.boxplot(x=\"drive-wheels\", y=\"price\", data=df)" 589 | ] 590 | }, 591 | { 592 | "cell_type": "markdown", 593 | "metadata": {}, 594 | "source": [ 595 | "

Here we see that the distribution of price between the different drive-wheels categories differs; as such drive-wheels could potentially be a predictor of price.

" 596 | ] 597 | }, 598 | { 599 | "cell_type": "markdown", 600 | "metadata": {}, 601 | "source": [ 602 | "

3. Descriptive Statistical Analysis

" 603 | ] 604 | }, 605 | { 606 | "cell_type": "markdown", 607 | "metadata": {}, 608 | "source": [ 609 | "

Let's first take a look at the variables by utilizing a description method.

\n", 610 | "\n", 611 | "

The describe function automatically computes basic statistics for all continuous variables. Any NaN values are automatically skipped in these statistics.

\n", 612 | "\n", 613 | "This will show:\n", 614 | "

the count of that variable
the mean
the standard deviation (std)
the minimum value
the IQR (Interquartile Range: 25%, 50% and 75%)
the maximum value

Value Counts

Value-counts is a good way of understanding how many units of each characteristic/variable we have. We can apply the \"value_counts\" method on the column 'drive-wheels'. Don’t forget the method \"value_counts\" only works on Pandas series, not Pandas Dataframes. As a result, we only include one bracket \"df['drive-wheels']\" not two brackets \"df[['drive-wheels']]\".

Examining the value counts of the engine location would not be a good predictor variable for the price. This is because we only have three cars with a rear engine and 198 with an engine in the front, this result is skewed. Thus, we are not able to draw any conclusions about the engine location.

4. Basics of Grouping

The \"groupby\" method groups data by different categories. The data is grouped based on one or several variables and analysis is performed on the individual groups.

For example, let's group by the variable \"drive-wheels\". We see that there are 3 different categories of drive wheels.

If we want to know, on average, which type of drive wheel is most valuable, we can group \"drive-wheels\" and then average them.

We can select the columns 'drive-wheels', 'body-style' and 'price', then assign it to the variable \"df_group_one\".

From our data, it seems rear-wheel drive vehicles are, on average, the most expensive, while 4-wheel and front-wheel are approximately the same in price.

You can also group with multiple variables. For example, let's group by both 'drive-wheels' and 'body-style'. This groups the dataframe by the unique combinations 'drive-wheels' and 'body-style'. We can store the results in the variable 'grouped_test1'.

This grouped data is much easier to visualize when it is made into a pivot table. A pivot table is like an Excel spreadsheet, with one variable along the column and another along the row. We can convert the dataframe to a pivot table using the method \"pivot \" to create a pivot table from the groups.

In this case, we will leave the drive-wheel variable as the rows of the table, and pivot body-style to become the columns of the table:

Often, we won't have data for some of the pivot cells. We can fill these missing cells with the value 0, but any other value could potentially be used as well. It should be mentioned that missing data is quite a complex subject and is an entire course on its own.

\n", 908 | "

Question 4:

\n", 909 | "\n", 910 | "

Use the \"groupby\" function to find the average \"price\" of each car based on \"body-style\" ?

\n", 911 | "

here

Variables: Drive Wheels and Body Style vs Price

The heatmap plots the target variable (price) proportional to colour with respect to the variables 'drive-wheel' and 'body-style' in the vertical and horizontal axis respectively. This allows us to visualize how the price is related to 'drive-wheel' and 'body-style'.

The default labels convey no useful information to us. Let's change that:

Visualization is very important in data science, and Python visualization packages provide great freedom. We will go more in-depth in a separate Python Visualizations course.

The main question we want to answer in this module, is \"What are the main characteristics which have the most impact on the car price?\".

To get a better measure of the important characteristics, we look at the correlation of these variables with the car price, in other words: how is the car price dependent on this variable?

5. Correlation and Causation

Correlation: a measure of the extent of interdependence between variables.

Causation: the relationship between cause and effect between two variables.

It is important to know the difference between these two and that correlation does not imply causation. Determining correlation is much simpler the determining causation as causation may require independent experimentation.

Pearson Correlation

\n", 1062 | "

The Pearson Correlation measures the linear dependence between two variables X and Y.

\n", 1063 | "

The resulting coefficient is a value between -1 and 1 inclusive, where:

\n", 1064 | "

1: Total positive linear correlation.
0: No linear correlation, the two variables most likely do not affect each other.
-1: Total negative linear correlation.

" 1069 | ] 1070 | }, 1071 | { 1072 | "cell_type": "markdown", 1073 | "metadata": {}, 1074 | "source": [ 1075 | "

Pearson Correlation is the default method of the function \"corr\". Like before we can calculate the Pearson Correlation of the of the 'int64' or 'float64' variables.

" 1076 | ] 1077 | }, 1078 | { 1079 | "cell_type": "code", 1080 | "execution_count": null, 1081 | "metadata": { 1082 | "collapsed": false 1083 | }, 1084 | "outputs": [], 1085 | "source": [ 1086 | "df.corr()" 1087 | ] 1088 | }, 1089 | { 1090 | "cell_type": "markdown", 1091 | "metadata": {}, 1092 | "source": [ 1093 | " sometimes we would like to know the significant of the correlation estimate. " 1094 | ] 1095 | }, 1096 | { 1097 | "cell_type": "markdown", 1098 | "metadata": {}, 1099 | "source": [ 1100 | "P-value: \n", 1101 | "

What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.

\n", 1102 | "\n", 1103 | "By convention, when the\n", 1104 | "

p-value is $<$ 0.001: we say there is strong evidence that the correlation is significant.
the p-value is $<$ 0.05: there is moderate evidence that the correlation is significant.
the p-value is $<$ 0.1: there is weak evidence that the correlation is significant.
the p-value is $>$ 0.1: there is no evidence that the correlation is significant.

" 1110 | ] 1111 | }, 1112 | { 1113 | "cell_type": "markdown", 1114 | "metadata": {}, 1115 | "source": [ 1116 | " We can obtain this information using \"stats\" module in the \"scipy\" library." 1117 | ] 1118 | }, 1119 | { 1120 | "cell_type": "code", 1121 | "execution_count": null, 1122 | "metadata": { 1123 | "collapsed": true 1124 | }, 1125 | "outputs": [], 1126 | "source": [ 1127 | "from scipy import stats" 1128 | ] 1129 | }, 1130 | { 1131 | "cell_type": "markdown", 1132 | "metadata": {}, 1133 | "source": [ 1134 | "

Wheel-base vs Price

" 1135 | ] 1136 | }, 1137 | { 1138 | "cell_type": "markdown", 1139 | "metadata": {}, 1140 | "source": [ 1141 | "Let's calculate the Pearson Correlation Coefficient and P-value of 'wheel-base' and 'price'. " 1142 | ] 1143 | }, 1144 | { 1145 | "cell_type": "code", 1146 | "execution_count": null, 1147 | "metadata": { 1148 | "collapsed": false 1149 | }, 1150 | "outputs": [], 1151 | "source": [ 1152 | "pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])\n", 1153 | "print(\"The Pearson Correlation Coefficient is\", pearson_coef, \" with a P-value of P =\", p_value) " 1154 | ] 1155 | }, 1156 | { 1157 | "cell_type": "markdown", 1158 | "metadata": {}, 1159 | "source": [ 1160 | "

Conclusion:

\n", 1161 | "

Since the p-value is $<$ 0.001, the correlation between wheel-base and price is statistically significant, although the linear relationship isn't extremely strong (~0.585)

" 1162 | ] 1163 | }, 1164 | { 1165 | "cell_type": "markdown", 1166 | "metadata": {}, 1167 | "source": [ 1168 | "

Horsepower vs Price

" 1169 | ] 1170 | }, 1171 | { 1172 | "cell_type": "markdown", 1173 | "metadata": {}, 1174 | "source": [ 1175 | " Let's calculate the Pearson Correlation Coefficient and P-value of 'horsepower' and 'price'." 1176 | ] 1177 | }, 1178 | { 1179 | "cell_type": "code", 1180 | "execution_count": null, 1181 | "metadata": { 1182 | "collapsed": false 1183 | }, 1184 | "outputs": [], 1185 | "source": [ 1186 | "pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])\n", 1187 | "print(\"The Pearson Correlation Coefficient is\", pearson_coef, \" with a P-value of P = \", p_value) " 1188 | ] 1189 | }, 1190 | { 1191 | "cell_type": "markdown", 1192 | "metadata": {}, 1193 | "source": [ 1194 | "

Conclusion:

\n", 1195 | "\n", 1196 | "

Since the p-value is $<$ 0.001, the correlation between horsepower and price is statistically significant, and the linear relationship is quite strong (~0.809, close to 1)

" 1197 | ] 1198 | }, 1199 | { 1200 | "cell_type": "markdown", 1201 | "metadata": {}, 1202 | "source": [ 1203 | "

Length vs Price

\n", 1204 | "\n", 1205 | "Let's calculate the Pearson Correlation Coefficient and P-value of 'length' and 'price'." 1206 | ] 1207 | }, 1208 | { 1209 | "cell_type": "code", 1210 | "execution_count": null, 1211 | "metadata": { 1212 | "collapsed": false 1213 | }, 1214 | "outputs": [], 1215 | "source": [ 1216 | "pearson_coef, p_value = stats.pearsonr(df['length'], df['price'])\n", 1217 | "print(\"The Pearson Correlation Coefficient is\", pearson_coef, \" with a P-value of P = \", p_value) " 1218 | ] 1219 | }, 1220 | { 1221 | "cell_type": "markdown", 1222 | "metadata": {}, 1223 | "source": [ 1224 | "

Conclusion:

\n", 1225 | "

Since the p-value is $<$ 0.001, the correlation between length and price is statistically significant, and the linear relationship is moderately strong (~0.691).

" 1226 | ] 1227 | }, 1228 | { 1229 | "cell_type": "markdown", 1230 | "metadata": {}, 1231 | "source": [ 1232 | "

Width vs Price

" 1233 | ] 1234 | }, 1235 | { 1236 | "cell_type": "markdown", 1237 | "metadata": {}, 1238 | "source": [ 1239 | " Let's calculate the Pearson Correlation Coefficient and P-value of 'width' and 'price':" 1240 | ] 1241 | }, 1242 | { 1243 | "cell_type": "code", 1244 | "execution_count": null, 1245 | "metadata": { 1246 | "collapsed": false 1247 | }, 1248 | "outputs": [], 1249 | "source": [ 1250 | "pearson_coef, p_value = stats.pearsonr(df['width'], df['price'])\n", 1251 | "print(\"The Pearson Correlation Coefficient is\", pearson_coef, \" with a P-value of P =\", p_value ) " 1252 | ] 1253 | }, 1254 | { 1255 | "cell_type": "markdown", 1256 | "metadata": {}, 1257 | "source": [ 1258 | "##### Conclusion:\n", 1259 | "\n", 1260 | "Since the p-value is < 0.001, the correlation between width and price is statistically significant, and the linear relationship is quite strong (~0.751)." 1261 | ] 1262 | }, 1263 | { 1264 | "cell_type": "markdown", 1265 | "metadata": {}, 1266 | "source": [ 1267 | "### Curb-weight vs Price" 1268 | ] 1269 | }, 1270 | { 1271 | "cell_type": "markdown", 1272 | "metadata": {}, 1273 | "source": [ 1274 | " Let's calculate the Pearson Correlation Coefficient and P-value of 'curb-weight' and 'price':" 1275 | ] 1276 | }, 1277 | { 1278 | "cell_type": "code", 1279 | "execution_count": null, 1280 | "metadata": { 1281 | "collapsed": false 1282 | }, 1283 | "outputs": [], 1284 | "source": [ 1285 | "pearson_coef, p_value = stats.pearsonr(df['curb-weight'], df['price'])\n", 1286 | "print( \"The Pearson Correlation Coefficient is\", pearson_coef, \" with a P-value of P = \", p_value) " 1287 | ] 1288 | }, 1289 | { 1290 | "cell_type": "markdown", 1291 | "metadata": {}, 1292 | "source": [ 1293 | "

Conclusion:

\n", 1294 | "

Since the p-value is $<$ 0.001, the correlation between curb-weight and price is statistically significant, and the linear relationship is quite strong (~0.834).

" 1295 | ] 1296 | }, 1297 | { 1298 | "cell_type": "markdown", 1299 | "metadata": {}, 1300 | "source": [ 1301 | "

Engine-size vs Price

\n", 1302 | "\n", 1303 | "Let's calculate the Pearson Correlation Coefficient and P-value of 'engine-size' and 'price':" 1304 | ] 1305 | }, 1306 | { 1307 | "cell_type": "code", 1308 | "execution_count": null, 1309 | "metadata": { 1310 | "collapsed": false 1311 | }, 1312 | "outputs": [], 1313 | "source": [ 1314 | "pearson_coef, p_value = stats.pearsonr(df['engine-size'], df['price'])\n", 1315 | "print(\"The Pearson Correlation Coefficient is\", pearson_coef, \" with a P-value of P =\", p_value) " 1316 | ] 1317 | }, 1318 | { 1319 | "cell_type": "markdown", 1320 | "metadata": {}, 1321 | "source": [ 1322 | "

Conclusion:

\n", 1323 | "\n", 1324 | "

Since the p-value is $<$ 0.001, the correlation between engine-size and price is statistically significant, and the linear relationship is very strong (~0.872).

" 1325 | ] 1326 | }, 1327 | { 1328 | "cell_type": "markdown", 1329 | "metadata": {}, 1330 | "source": [ 1331 | "

Bore vs Price

" 1332 | ] 1333 | }, 1334 | { 1335 | "cell_type": "markdown", 1336 | "metadata": {}, 1337 | "source": [ 1338 | " Let's calculate the Pearson Correlation Coefficient and P-value of 'bore' and 'price':" 1339 | ] 1340 | }, 1341 | { 1342 | "cell_type": "code", 1343 | "execution_count": null, 1344 | "metadata": { 1345 | "collapsed": false 1346 | }, 1347 | "outputs": [], 1348 | "source": [ 1349 | "pearson_coef, p_value = stats.pearsonr(df['bore'], df['price'])\n", 1350 | "print(\"The Pearson Correlation Coefficient is\", pearson_coef, \" with a P-value of P = \", p_value ) " 1351 | ] 1352 | }, 1353 | { 1354 | "cell_type": "markdown", 1355 | "metadata": {}, 1356 | "source": [ 1357 | "

Conclusion:

\n", 1358 | "

Since the p-value is $<$ 0.001, the correlation between bore and price is statistically significant, but the linear relationship is only moderate (~0.521).

" 1359 | ] 1360 | }, 1361 | { 1362 | "cell_type": "markdown", 1363 | "metadata": {}, 1364 | "source": [ 1365 | " We can relate the process for each 'City-mpg' and 'Highway-mpg':" 1366 | ] 1367 | }, 1368 | { 1369 | "cell_type": "markdown", 1370 | "metadata": {}, 1371 | "source": [ 1372 | "

City-mpg vs Price

" 1373 | ] 1374 | }, 1375 | { 1376 | "cell_type": "code", 1377 | "execution_count": null, 1378 | "metadata": { 1379 | "collapsed": false 1380 | }, 1381 | "outputs": [], 1382 | "source": [ 1383 | "pearson_coef, p_value = stats.pearsonr(df['city-mpg'], df['price'])\n", 1384 | "print(\"The Pearson Correlation Coefficient is\", pearson_coef, \" with a P-value of P = \", p_value) " 1385 | ] 1386 | }, 1387 | { 1388 | "cell_type": "markdown", 1389 | "metadata": {}, 1390 | "source": [ 1391 | "

Conclusion:

\n", 1392 | "

Since the p-value is $<$ 0.001, the correlation between city-mpg and price is statistically significant, and the coefficient of ~ -0.687 shows that the relationship is negative and moderately strong.

" 1393 | ] 1394 | }, 1395 | { 1396 | "cell_type": "markdown", 1397 | "metadata": {}, 1398 | "source": [ 1399 | "

Highway-mpg vs Price

" 1400 | ] 1401 | }, 1402 | { 1403 | "cell_type": "code", 1404 | "execution_count": null, 1405 | "metadata": { 1406 | "collapsed": false 1407 | }, 1408 | "outputs": [], 1409 | "source": [ 1410 | "pearson_coef, p_value = stats.pearsonr(df['highway-mpg'], df['price'])\n", 1411 | "print( \"The Pearson Correlation Coefficient is\", pearson_coef, \" with a P-value of P = \", p_value ) " 1412 | ] 1413 | }, 1414 | { 1415 | "cell_type": "markdown", 1416 | "metadata": {}, 1417 | "source": [ 1418 | "##### Conclusion:\n", 1419 | "Since the p-value is < 0.001, the correlation between highway-mpg and price is statistically significant, and the coefficient of ~ -0.705 shows that the relationship is negative and moderately strong." 1420 | ] 1421 | }, 1422 | { 1423 | "cell_type": "markdown", 1424 | "metadata": {}, 1425 | "source": [ 1426 | "

6. ANOVA

" 1427 | ] 1428 | }, 1429 | { 1430 | "cell_type": "markdown", 1431 | "metadata": {}, 1432 | "source": [ 1433 | "

ANOVA: Analysis of Variance

\n", 1434 | "

The Analysis of Variance (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:

\n", 1435 | "\n", 1436 | "

F-test score: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.

\n", 1437 | "\n", 1438 | "

P-value: P-value tells how statistically significant is our calculated score value.

\n", 1439 | "\n", 1440 | "

If our price variable is strongly correlated with the variable we are analyzing, expect ANOVA to return a sizeable F-test score and a small p-value.

" 1441 | ] 1442 | }, 1443 | { 1444 | "cell_type": "markdown", 1445 | "metadata": {}, 1446 | "source": [ 1447 | "

Drive Wheels

" 1448 | ] 1449 | }, 1450 | { 1451 | "cell_type": "markdown", 1452 | "metadata": {}, 1453 | "source": [ 1454 | "

Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Because the ANOVA algorithm averages the data automatically, we do not need to take the average before hand.

\n", 1455 | "\n", 1456 | "

Let's see if different types 'drive-wheels' impact 'price', we group the data.

" 1457 | ] 1458 | }, 1459 | { 1460 | "cell_type": "markdown", 1461 | "metadata": {}, 1462 | "source": [ 1463 | " Let's see if different types 'drive-wheels' impact 'price', we group the data." 1464 | ] 1465 | }, 1466 | { 1467 | "cell_type": "code", 1468 | "execution_count": null, 1469 | "metadata": { 1470 | "collapsed": false 1471 | }, 1472 | "outputs": [], 1473 | "source": [ 1474 | "grouped_test2=df_gptest[['drive-wheels', 'price']].groupby(['drive-wheels'])\n", 1475 | "grouped_test2.head(2)" 1476 | ] 1477 | }, 1478 | { 1479 | "cell_type": "code", 1480 | "execution_count": null, 1481 | "metadata": {}, 1482 | "outputs": [], 1483 | "source": [ 1484 | "df_gptest" 1485 | ] 1486 | }, 1487 | { 1488 | "cell_type": "markdown", 1489 | "metadata": {}, 1490 | "source": [ 1491 | " We can obtain the values of the method group using the method \"get_group\". " 1492 | ] 1493 | }, 1494 | { 1495 | "cell_type": "code", 1496 | "execution_count": null, 1497 | "metadata": { 1498 | "collapsed": false 1499 | }, 1500 | "outputs": [], 1501 | "source": [ 1502 | "grouped_test2.get_group('4wd')['price']" 1503 | ] 1504 | }, 1505 | { 1506 | "cell_type": "markdown", 1507 | "metadata": {}, 1508 | "source": [ 1509 | "we can use the function 'f_oneway' in the module 'stats' to obtain the F-test score and P-value." 1510 | ] 1511 | }, 1512 | { 1513 | "cell_type": "code", 1514 | "execution_count": null, 1515 | "metadata": { 1516 | "collapsed": false 1517 | }, 1518 | "outputs": [], 1519 | "source": [ 1520 | "# ANOVA\n", 1521 | "f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'], grouped_test2.get_group('4wd')['price']) \n", 1522 | " \n", 1523 | "print( \"ANOVA results: F=\", f_val, \", P =\", p_val) " 1524 | ] 1525 | }, 1526 | { 1527 | "cell_type": "markdown", 1528 | "metadata": {}, 1529 | "source": [ 1530 | "This is a great result, with a large F test score showing a strong correlation and a P value of almost 0 implying almost certain statistical significance. But does this mean all three tested groups are all this highly correlated? " 1531 | ] 1532 | }, 1533 | { 1534 | "cell_type": "markdown", 1535 | "metadata": {}, 1536 | "source": [ 1537 | "#### Separately: fwd and rwd" 1538 | ] 1539 | }, 1540 | { 1541 | "cell_type": "code", 1542 | "execution_count": null, 1543 | "metadata": { 1544 | "collapsed": false 1545 | }, 1546 | "outputs": [], 1547 | "source": [ 1548 | "f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price']) \n", 1549 | " \n", 1550 | "print( \"ANOVA results: F=\", f_val, \", P =\", p_val )" 1551 | ] 1552 | }, 1553 | { 1554 | "cell_type": "markdown", 1555 | "metadata": {}, 1556 | "source": [ 1557 | " Let's examine the other groups " 1558 | ] 1559 | }, 1560 | { 1561 | "cell_type": "markdown", 1562 | "metadata": {}, 1563 | "source": [ 1564 | "#### 4wd and rwd" 1565 | ] 1566 | }, 1567 | { 1568 | "cell_type": "code", 1569 | "execution_count": null, 1570 | "metadata": { 1571 | "collapsed": false, 1572 | "scrolled": true 1573 | }, 1574 | "outputs": [], 1575 | "source": [ 1576 | "f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('rwd')['price']) \n", 1577 | " \n", 1578 | "print( \"ANOVA results: F=\", f_val, \", P =\", p_val) " 1579 | ] 1580 | }, 1581 | { 1582 | "cell_type": "markdown", 1583 | "metadata": {}, 1584 | "source": [ 1585 | "

4wd and fwd

" 1586 | ] 1587 | }, 1588 | { 1589 | "cell_type": "code", 1590 | "execution_count": null, 1591 | "metadata": { 1592 | "collapsed": false 1593 | }, 1594 | "outputs": [], 1595 | "source": [ 1596 | "f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('fwd')['price']) \n", 1597 | " \n", 1598 | "print(\"ANOVA results: F=\", f_val, \", P =\", p_val) " 1599 | ] 1600 | }, 1601 | { 1602 | "cell_type": "markdown", 1603 | "metadata": {}, 1604 | "source": [ 1605 | "

Conclusion: Important Variables

" 1606 | ] 1607 | }, 1608 | { 1609 | "cell_type": "markdown", 1610 | "metadata": {}, 1611 | "source": [ 1612 | "

We now have a better idea of what our data looks like and which variables are important to take into account when predicting the car price. We have narrowed it down to the following variables:

\n", 1613 | "\n", 1614 | "Continuous numerical variables:\n", 1615 | "

Length
Width
Curb-weight
Engine-size
Horsepower
City-mpg
Highway-mpg
Wheel-base
Bore

\n", 1626 | " \n", 1627 | "Categorical variables:\n", 1628 | "

Drive-wheels

\n", 1631 | "\n", 1632 | "

As we now move into building machine learning models to automate our analysis, feeding the model with variables that meaningfully affect our target variable will improve our model's prediction performance.

" 1633 | ] 1634 | }, 1635 | { 1636 | "cell_type": "markdown", 1637 | "metadata": {}, 1638 | "source": [ 1639 | "

Thank you for completing this notebook

" 1640 | ] 1641 | }, 1642 | { 1643 | "cell_type": "markdown", 1644 | "metadata": {}, 1645 | "source": [ 1646 | "

\n", 1647 | "\n", 1648 | "

\n", 1649 | "

\n" 1650 | ] 1651 | }, 1652 | { 1653 | "cell_type": "markdown", 1654 | "metadata": {}, 1655 | "source": [ 1656 | "

About the Authors:

\n", 1657 | "\n", 1658 | "This notebook was written by Mahdi Noorian PhD, Joseph Santarcangelo, Bahare Talayian, Eric Xiao, Steven Dong, Parizad, Hima Vsudevan and Fiorella Wenver and Yi Yao.\n", 1659 | "\n", 1660 | "

" 1661 | ] 1662 | }, 1663 | { 1664 | "cell_type": "markdown", 1665 | "metadata": {}, 1666 | "source": [ 1667 | "

\n", 1668 | "

" 1669 | ] 1670 | } 1671 | ], 1672 | "metadata": { 1673 | "anaconda-cloud": {}, 1674 | "kernelspec": { 1675 | "display_name": "Python 3", 1676 | "language": "python", 1677 | "name": "python3" 1678 | }, 1679 | "language_info": { 1680 | "codemirror_mode": { 1681 | "name": "ipython", 1682 | "version": 3 1683 | }, 1684 | "file_extension": ".py", 1685 | "mimetype": "text/x-python", 1686 | "name": "python", 1687 | "nbconvert_exporter": "python", 1688 | "pygments_lexer": "ipython3", 1689 | "version": "3.6.7" 1690 | } 1691 | }, 1692 | "nbformat": 4, 1693 | "nbformat_minor": 2 1694 | } 1695 | -------------------------------------------------------------------------------- /DA0101EN-Review-Introduction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "

\n", 8 | " \n", 9 | "

\n", 10 | " \n", 11 | "

\n" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "

\n", 19 | "\n", 20 | "

Data Analysis with Python

" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "

Introduction

\n", 28 | "

Welcome!

\n", 29 | "\n", 30 | "

\n", 31 | "In this section, you will learn how to approach data acquisition in various ways, and obtain necessary insights from a dataset. By the end of this lab, you will successfully load the data into Jupyter Notebook, and gain some fundamental insights via Pandas Library.\n", 32 | "

" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "

\n", 40 | "\n", 41 | "

\n", 42 | "

Data Acquisition\n", 44 | "
Basic Insight of Dataset

\n", 46 | "\n", 47 | "Estimated Time Needed: 10 min\n", 48 | "

\n", 49 | "

" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "

Data Acquisition

\n", 57 | "

\n", 58 | "There are various formats for a dataset, .csv, .json, .xlsx etc. The dataset can be stored in different places, on your local machine or sometimes online.
\n", 59 | "In this section, you will learn how to load a dataset into our Jupyter Notebook.
\n", 60 | "In our case, the Automobile Dataset is an online source, and it is in CSV (comma separated value) format. Let's use this dataset as an example to practice data reading.\n", 61 | "

data source: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
data type: csv

\n", 65 | "The Pandas Library is a useful tool that enables us to read various datasets into a data frame; our Jupyter notebook platforms have a built-in Pandas Library so that all we need to do is import Pandas without installing.\n", 66 | "

" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": { 73 | "collapsed": true 74 | }, 75 | "outputs": [], 76 | "source": [ 77 | "# import pandas library\n", 78 | "import pandas as pd" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "

Read Data

\n", 86 | "

\n", 87 | "We use pandas.read_csv() function to read the csv file. In the bracket, we put the file path along with a quotation mark, so that pandas will read the file into a data frame from that address. The file path can be either an URL or your local file address.
\n", 88 | "Because the data does not include headers, we can add an argument headers = None inside the read_csv() method, so that pandas will not automatically set the first row as a header.
\n", 89 | "You can also assign the dataset to any variable you create.\n", 90 | "

" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "This dataset was hosted on IBM Cloud object click HERE for free storage." 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": { 104 | "collapsed": false 105 | }, 106 | "outputs": [], 107 | "source": [ 108 | "# Import pandas library\n", 109 | "import pandas as pd\n", 110 | "\n", 111 | "# Read the online file by the URL provides above, and assign it to variable \"df\"\n", 112 | "other_path = \"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv\"\n", 113 | "df = pd.read_csv(other_path, header=None)" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": {}, 119 | "source": [ 120 | "After reading the dataset, we can use the dataframe.head(n) method to check the top n rows of the dataframe; where n is an integer. Contrary to dataframe.head(n), dataframe.tail(n) will show you the bottom n rows of the dataframe.\n" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "metadata": { 127 | "collapsed": false 128 | }, 129 | "outputs": [], 130 | "source": [ 131 | "# show the first 5 rows using dataframe.head() method\n", 132 | "print(\"The first 5 rows of the dataframe\") \n", 133 | "df.head(5)" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "

\n", 141 | "

Question #1:

\n", 142 | "check the bottom 10 rows of data frame \"df\".\n", 143 | "

" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": { 150 | "collapsed": false 151 | }, 152 | "outputs": [], 153 | "source": [ 154 | "# Write your code below and press Shift+Enter to execute \n" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "

\n", 162 | "

Question #1 Answer:

\n", 163 | "Run the code below for the solution!\n", 164 | "

" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "Double-click here for the solution.\n", 172 | "\n", 173 | "" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": {}, 184 | "source": [ 185 | "

Add Headers

\n", 186 | "

\n", 187 | "Take a look at our dataset; pandas automatically set the header by an integer from 0.\n", 188 | "

\n", 189 | "

\n", 190 | "To better describe our data we can introduce a header, this information is available at: https://archive.ics.uci.edu/ml/datasets/Automobile\n", 191 | "

\n", 192 | "

\n", 193 | "Thus, we have to add headers manually.\n", 194 | "

\n", 195 | "

\n", 196 | "Firstly, we create a list \"headers\" that include all column names in order.\n", 197 | "Then, we use dataframe.columns = headers to replace the headers by the list we created.\n", 198 | "

" 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "metadata": { 205 | "collapsed": false 206 | }, 207 | "outputs": [], 208 | "source": [ 209 | "# create headers list\n", 210 | "headers = [\"symboling\",\"normalized-losses\",\"make\",\"fuel-type\",\"aspiration\", \"num-of-doors\",\"body-style\",\n", 211 | " \"drive-wheels\",\"engine-location\",\"wheel-base\", \"length\",\"width\",\"height\",\"curb-weight\",\"engine-type\",\n", 212 | " \"num-of-cylinders\", \"engine-size\",\"fuel-system\",\"bore\",\"stroke\",\"compression-ratio\",\"horsepower\",\n", 213 | " \"peak-rpm\",\"city-mpg\",\"highway-mpg\",\"price\"]\n", 214 | "print(\"headers\\n\", headers)" 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | " We replace headers and recheck our data frame" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "metadata": { 228 | "collapsed": false 229 | }, 230 | "outputs": [], 231 | "source": [ 232 | "df.columns = headers\n", 233 | "df.head(10)" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "we can drop missing values along the column \"price\" as follows " 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": null, 246 | "metadata": { 247 | "collapsed": false 248 | }, 249 | "outputs": [], 250 | "source": [ 251 | "df.dropna(subset=[\"price\"], axis=0)" 252 | ] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "metadata": {}, 257 | "source": [ 258 | "Now, we have successfully read the raw dataset and add the correct headers into the data frame." 259 | ] 260 | }, 261 | { 262 | "cell_type": "markdown", 263 | "metadata": {}, 264 | "source": [ 265 | "

\n", 266 | "

Question #2:

\n", 267 | "Find the name of the columns of the dataframe\n", 268 | "

" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": { 275 | "collapsed": false 276 | }, 277 | "outputs": [], 278 | "source": [ 279 | "# Write your code below and press Shift+Enter to execute \n" 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "Double-click here for the solution.\n", 287 | "\n", 288 | "" 293 | ] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "metadata": {}, 298 | "source": [ 299 | "

Save Dataset

\n", 300 | "

\n", 301 | "Correspondingly, Pandas enables us to save the dataset to csv by using the dataframe.to_csv() method, you can add the file path and name along with quotation marks in the brackets.\n", 302 | "

\n", 303 | "

\n", 304 | " For example, if you would save the dataframe df as automobile.csv to your local machine, you may use the syntax below:\n", 305 | "

" 306 | ] 307 | }, 308 | { 309 | "cell_type": "raw", 310 | "metadata": {}, 311 | "source": [ 312 | "df.to_csv(\"automobile.csv\", index=False)" 313 | ] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "metadata": {}, 318 | "source": [ 319 | " We can also read and save other file formats, we can use similar functions to **`pd.read_csv()`** and **`df.to_csv()`** for other data formats, the functions are listed in the following table:\n" 320 | ] 321 | }, 322 | { 323 | "cell_type": "markdown", 324 | "metadata": {}, 325 | "source": [ 326 | "

Read/Save Other Data Formats

\n", 327 | "\n", 328 | "\n", 329 | "\n", 330 | "| Data Formate | Read | Save |\n", 331 | "| ------------- |:--------------:| ----------------:|\n", 332 | "| csv | `pd.read_csv()` |`df.to_csv()` |\n", 333 | "| json | `pd.read_json()` |`df.to_json()` |\n", 334 | "| excel | `pd.read_excel()`|`df.to_excel()` |\n", 335 | "| hdf | `pd.read_hdf()` |`df.to_hdf()` |\n", 336 | "| sql | `pd.read_sql()` |`df.to_sql()` |\n", 337 | "| ... | ... | ... |" 338 | ] 339 | }, 340 | { 341 | "cell_type": "markdown", 342 | "metadata": {}, 343 | "source": [ 344 | "

Basic Insight of Dataset

\n", 345 | "

\n", 346 | "After reading data into Pandas dataframe, it is time for us to explore the dataset.
\n", 347 | "There are several ways to obtain essential insights of the data to help us better understand our dataset.\n", 348 | "

" 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "metadata": {}, 354 | "source": [ 355 | "

Data Types

\n", 356 | "

\n", 357 | "Data has a variety of types.
\n", 358 | "The main types stored in Pandas dataframes are object, float, int, bool and datetime64. In order to better learn about each attribute, it is always good for us to know the data type of each column. In Pandas:\n", 359 | "

" 360 | ] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": null, 365 | "metadata": {}, 366 | "outputs": [], 367 | "source": [ 368 | "df.dtypes\n" 369 | ] 370 | }, 371 | { 372 | "cell_type": "markdown", 373 | "metadata": {}, 374 | "source": [ 375 | "returns a Series with the data type of each column." 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": null, 381 | "metadata": { 382 | "collapsed": false, 383 | "scrolled": false 384 | }, 385 | "outputs": [], 386 | "source": [ 387 | "# check the data type of data frame \"df\" by .dtypes\n", 388 | "print(df.dtypes)" 389 | ] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "metadata": {}, 394 | "source": [ 395 | "

\n", 396 | "As a result, as shown above, it is clear to see that the data type of \"symboling\" and \"curb-weight\" are int64, \"normalized-losses\" is object, and \"wheel-base\" is float64, etc.\n", 397 | "

\n", 398 | "

\n", 399 | "These data types can be changed; we will learn how to accomplish this in a later module.\n", 400 | "

" 401 | ] 402 | }, 403 | { 404 | "cell_type": "markdown", 405 | "metadata": {}, 406 | "source": [ 407 | "

Describe

\n", 408 | "If we would like to get a statistical summary of each column, such as count, column mean value, column standard deviation, etc. We use the describe method:" 409 | ] 410 | }, 411 | { 412 | "cell_type": "raw", 413 | "metadata": {}, 414 | "source": [ 415 | "dataframe.describe()" 416 | ] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "metadata": {}, 421 | "source": [ 422 | "This method will provide various summary statistics, excluding NaN (Not a Number) values." 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": null, 428 | "metadata": { 429 | "collapsed": false 430 | }, 431 | "outputs": [], 432 | "source": [ 433 | "df.describe()" 434 | ] 435 | }, 436 | { 437 | "cell_type": "markdown", 438 | "metadata": {}, 439 | "source": [ 440 | "

\n", 441 | "This shows the statistical summary of all numeric-typed (int, float) columns.
\n", 442 | "For example, the attribute \"symboling\" has 205 counts, the mean value of this column is 0.83, the standard deviation is 1.25, the minimum value is -2, 25th percentile is 0, 50th percentile is 1, 75th percentile is 2, and the maximum value is 3.\n", 443 | "
\n", 444 | "However, what if we would also like to check all the columns including those that are of type object.\n", 445 | "

\n", 446 | "\n", 447 | "You can add an argument include = \"all\" inside the bracket. Let's try it again.\n", 448 | "

" 449 | ] 450 | }, 451 | { 452 | "cell_type": "code", 453 | "execution_count": null, 454 | "metadata": { 455 | "collapsed": false 456 | }, 457 | "outputs": [], 458 | "source": [ 459 | "# describe all the columns in \"df\" \n", 460 | "df.describe(include = \"all\")" 461 | ] 462 | }, 463 | { 464 | "cell_type": "markdown", 465 | "metadata": {}, 466 | "source": [ 467 | "

\n", 468 | "Now, it provides the statistical summary of all the columns, including object-typed attributes.
\n", 469 | "We can now see how many unique values, which is the top value and the frequency of top value in the object-typed columns.
\n", 470 | "Some values in the table above show as \"NaN\", this is because those numbers are not available regarding a particular column type.
\n", 471 | "

" 472 | ] 473 | }, 474 | { 475 | "cell_type": "markdown", 476 | "metadata": {}, 477 | "source": [ 478 | "

\n", 479 | "

Question #3:

\n", 480 | "\n", 481 | "

\n", 482 | "You can select the columns of a data frame by indicating the name of each column, for example, you can select the three columns as follows:\n", 483 | "

\n", 484 | "

\n", 485 | " dataframe[[' column 1 ',column 2', 'column 3']]\n", 486 | "

\n", 487 | "

\n", 488 | "Where \"column\" is the name of the column, you can apply the method \".describe()\" to get the statistics of those columns as follows:\n", 489 | "

\n", 490 | "

\n", 491 | " dataframe[[' column 1 ',column 2', 'column 3'] ].describe()\n", 492 | "

\n", 493 | "\n", 494 | "Apply the method to \".describe()\" to the columns 'length' and 'compression-ratio'.\n", 495 | "

" 496 | ] 497 | }, 498 | { 499 | "cell_type": "code", 500 | "execution_count": null, 501 | "metadata": {}, 502 | "outputs": [], 503 | "source": [ 504 | "# Write your code below and press Shift+Enter to execute \n" 505 | ] 506 | }, 507 | { 508 | "cell_type": "markdown", 509 | "metadata": {}, 510 | "source": [ 511 | "Double-click here for the solution.\n", 512 | "\n", 513 | "\n" 518 | ] 519 | }, 520 | { 521 | "cell_type": "markdown", 522 | "metadata": {}, 523 | "source": [ 524 | "

Info

\n", 525 | "Another method you can use to check your dataset is:" 526 | ] 527 | }, 528 | { 529 | "cell_type": "raw", 530 | "metadata": {}, 531 | "source": [ 532 | "dataframe.info" 533 | ] 534 | }, 535 | { 536 | "cell_type": "markdown", 537 | "metadata": {}, 538 | "source": [ 539 | "It provide a concise summary of your DataFrame." 540 | ] 541 | }, 542 | { 543 | "cell_type": "code", 544 | "execution_count": null, 545 | "metadata": { 546 | "collapsed": false 547 | }, 548 | "outputs": [], 549 | "source": [ 550 | "# look at the info of \"df\"\n", 551 | "df.info" 552 | ] 553 | }, 554 | { 555 | "cell_type": "markdown", 556 | "metadata": {}, 557 | "source": [ 558 | "

\n", 559 | "Here we are able to see the information of our dataframe, with the top 30 rows and the bottom 30 rows.\n", 560 | "

\n", 561 | "And, it also shows us the whole data frame has 205 rows and 26 columns in total.\n", 562 | "

" 563 | ] 564 | }, 565 | { 566 | "cell_type": "markdown", 567 | "metadata": {}, 568 | "source": [ 569 | "

Excellent! You have just completed the Introduction Notebook!

" 570 | ] 571 | }, 572 | { 573 | "cell_type": "markdown", 574 | "metadata": {}, 575 | "source": [ 576 | "

\n", 577 | "\n", 578 | "

\n", 579 | "

\n" 580 | ] 581 | }, 582 | { 583 | "cell_type": "markdown", 584 | "metadata": {}, 585 | "source": [ 586 | "

About the Authors:

\n", 587 | "\n", 588 | "This notebook was written by Mahdi Noorian PhD, Joseph Santarcangelo, Bahare Talayian, Eric Xiao, Steven Dong, Parizad, Hima Vsudevan and Fiorella Wenver and Yi Yao.\n", 589 | "\n", 590 | "

" 591 | ] 592 | }, 593 | { 594 | "cell_type": "markdown", 595 | "metadata": {}, 596 | "source": [ 597 | "

\n", 598 | "

" 599 | ] 600 | } 601 | ], 602 | "metadata": { 603 | "anaconda-cloud": {}, 604 | "kernelspec": { 605 | "display_name": "Python 3", 606 | "language": "python", 607 | "name": "python3" 608 | }, 609 | "language_info": { 610 | "codemirror_mode": { 611 | "name": "ipython", 612 | "version": 3 613 | }, 614 | "file_extension": ".py", 615 | "mimetype": "text/x-python", 616 | "name": "python", 617 | "nbconvert_exporter": "python", 618 | "pygments_lexer": "ipython3", 619 | "version": "3.6.7" 620 | } 621 | }, 622 | "nbformat": 4, 623 | "nbformat_minor": 2 624 | } 625 | -------------------------------------------------------------------------------- /DA0101EN-Review-Model-Evaluation-and-Refinement.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "

\n", 8 | " \n", 9 | "

\n", 10 | " \n", 11 | "

\n" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "

\n", 19 | "\n", 20 | "

Data Analysis with Python

" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "

Module 5: Model Evaluation and Refinement

\n", 28 | "\n", 29 | "We have built models and made predictions of vehicle prices. Now we will determine how accurate these predictions are. " 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "

Table of content

\n", 37 | "

Model Evaluation
Over-fitting, Under-fitting and Model Selection
Ridge Regression
Grid Search

" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "This dataset was hosted on IBM Cloud object click HERE for free storage." 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": null, 55 | "metadata": { 56 | "collapsed": true 57 | }, 58 | "outputs": [], 59 | "source": [ 60 | "import pandas as pd\n", 61 | "import numpy as np\n", 62 | "\n", 63 | "# Import clean data \n", 64 | "path = 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/module_5_auto.csv'\n", 65 | "df = pd.read_csv(path)" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": null, 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "df.to_csv('module_5_auto.csv')" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | " First lets only use numeric data " 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "metadata": { 88 | "collapsed": false, 89 | "scrolled": false 90 | }, 91 | "outputs": [], 92 | "source": [ 93 | "df=df._get_numeric_data()\n", 94 | "df.head()" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | " Libraries for plotting " 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "%%capture\n", 111 | "! pip install ipywidgets" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": { 118 | "collapsed": false 119 | }, 120 | "outputs": [], 121 | "source": [ 122 | "from IPython.display import display\n", 123 | "from IPython.html import widgets \n", 124 | "from IPython.display import display\n", 125 | "from ipywidgets import interact, interactive, fixed, interact_manual" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "

Functions for plotting

" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": { 139 | "collapsed": false 140 | }, 141 | "outputs": [], 142 | "source": [ 143 | "def DistributionPlot(RedFunction, BlueFunction, RedName, BlueName, Title):\n", 144 | " width = 12\n", 145 | " height = 10\n", 146 | " plt.figure(figsize=(width, height))\n", 147 | "\n", 148 | " ax1 = sns.distplot(RedFunction, hist=False, color=\"r\", label=RedName)\n", 149 | " ax2 = sns.distplot(BlueFunction, hist=False, color=\"b\", label=BlueName, ax=ax1)\n", 150 | "\n", 151 | " plt.title(Title)\n", 152 | " plt.xlabel('Price (in dollars)')\n", 153 | " plt.ylabel('Proportion of Cars')\n", 154 | "\n", 155 | " plt.show()\n", 156 | " plt.close()" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": { 163 | "collapsed": false 164 | }, 165 | "outputs": [], 166 | "source": [ 167 | "def PollyPlot(xtrain, xtest, y_train, y_test, lr,poly_transform):\n", 168 | " width = 12\n", 169 | " height = 10\n", 170 | " plt.figure(figsize=(width, height))\n", 171 | " \n", 172 | " \n", 173 | " #training data \n", 174 | " #testing data \n", 175 | " # lr: linear regression object \n", 176 | " #poly_transform: polynomial transformation object \n", 177 | " \n", 178 | " xmax=max([xtrain.values.max(), xtest.values.max()])\n", 179 | "\n", 180 | " xmin=min([xtrain.values.min(), xtest.values.min()])\n", 181 | "\n", 182 | " x=np.arange(xmin, xmax, 0.1)\n", 183 | "\n", 184 | "\n", 185 | " plt.plot(xtrain, y_train, 'ro', label='Training Data')\n", 186 | " plt.plot(xtest, y_test, 'go', label='Test Data')\n", 187 | " plt.plot(x, lr.predict(poly_transform.fit_transform(x.reshape(-1, 1))), label='Predicted Function')\n", 188 | " plt.ylim([-10000, 60000])\n", 189 | " plt.ylabel('Price')\n", 190 | " plt.legend()" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": {}, 196 | "source": [ 197 | "

Part 1: Training and Testing

\n", 198 | "\n", 199 | "

An important step in testing your model is to split your data into training and testing data. We will place the target data price in a separate dataframe y:

" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": null, 205 | "metadata": { 206 | "collapsed": false 207 | }, 208 | "outputs": [], 209 | "source": [ 210 | "y_data = df['price']" 211 | ] 212 | }, 213 | { 214 | "cell_type": "markdown", 215 | "metadata": {}, 216 | "source": [ 217 | "drop price data in x data" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": null, 223 | "metadata": { 224 | "collapsed": true 225 | }, 226 | "outputs": [], 227 | "source": [ 228 | "x_data=df.drop('price',axis=1)" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": {}, 234 | "source": [ 235 | "Now we randomly split our data into training and testing data using the function train_test_split. " 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": null, 241 | "metadata": { 242 | "collapsed": false 243 | }, 244 | "outputs": [], 245 | "source": [ 246 | "from sklearn.model_selection import train_test_split\n", 247 | "\n", 248 | "\n", 249 | "x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.15, random_state=1)\n", 250 | "\n", 251 | "\n", 252 | "print(\"number of test samples :\", x_test.shape[0])\n", 253 | "print(\"number of training samples:\",x_train.shape[0])\n" 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": {}, 259 | "source": [ 260 | "The test_size parameter sets the proportion of data that is split into the testing set. In the above, the testing set is set to 10% of the total dataset. " 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "

\n", 268 | "

Question #1):

\n", 269 | "\n", 270 | "Use the function \"train_test_split\" to split up the data set such that 40% of the data samples will be utilized for testing, set the parameter \"random_state\" equal to zero. The output of the function should be the following: \"x_train_1\" , \"x_test_1\", \"y_train_1\" and \"y_test_1\".\n", 271 | "

" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": null, 277 | "metadata": { 278 | "collapsed": true 279 | }, 280 | "outputs": [], 281 | "source": [ 282 | "# Write your code below and press Shift+Enter to execute \n" 283 | ] 284 | }, 285 | { 286 | "cell_type": "markdown", 287 | "metadata": {}, 288 | "source": [ 289 | "Double-click here for the solution.\n", 290 | "\n", 291 | "" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": {}, 303 | "source": [ 304 | "Let's import LinearRegression from the module linear_model." 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": null, 310 | "metadata": { 311 | "collapsed": false 312 | }, 313 | "outputs": [], 314 | "source": [ 315 | "from sklearn.linear_model import LinearRegression" 316 | ] 317 | }, 318 | { 319 | "cell_type": "markdown", 320 | "metadata": {}, 321 | "source": [ 322 | " We create a Linear Regression object:" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": null, 328 | "metadata": { 329 | "collapsed": false 330 | }, 331 | "outputs": [], 332 | "source": [ 333 | "lre=LinearRegression()" 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": {}, 339 | "source": [ 340 | "we fit the model using the feature horsepower " 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": null, 346 | "metadata": { 347 | "collapsed": false 348 | }, 349 | "outputs": [], 350 | "source": [ 351 | "lre.fit(x_train[['horsepower']], y_train)" 352 | ] 353 | }, 354 | { 355 | "cell_type": "markdown", 356 | "metadata": {}, 357 | "source": [ 358 | "Let's Calculate the R^2 on the test data:" 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": null, 364 | "metadata": { 365 | "collapsed": false 366 | }, 367 | "outputs": [], 368 | "source": [ 369 | "lre.score(x_test[['horsepower']], y_test)" 370 | ] 371 | }, 372 | { 373 | "cell_type": "markdown", 374 | "metadata": {}, 375 | "source": [ 376 | "we can see the R^2 is much smaller using the test data." 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": null, 382 | "metadata": { 383 | "collapsed": false 384 | }, 385 | "outputs": [], 386 | "source": [ 387 | "lre.score(x_train[['horsepower']], y_train)" 388 | ] 389 | }, 390 | { 391 | "cell_type": "markdown", 392 | "metadata": {}, 393 | "source": [ 394 | "

\n", 395 | "

Question #2):

\n", 396 | " \n", 397 | "Find the R^2 on the test data using 90% of the data for training data\n", 398 | "\n", 399 | "

" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": null, 405 | "metadata": { 406 | "collapsed": false 407 | }, 408 | "outputs": [], 409 | "source": [ 410 | "# Write your code below and press Shift+Enter to execute \n" 411 | ] 412 | }, 413 | { 414 | "cell_type": "markdown", 415 | "metadata": {}, 416 | "source": [ 417 | "Double-click here for the solution.\n", 418 | "\n", 419 | "" 426 | ] 427 | }, 428 | { 429 | "cell_type": "markdown", 430 | "metadata": {}, 431 | "source": [ 432 | " Sometimes you do not have sufficient testing data; as a result, you may want to perform Cross-validation. Let's go over several methods that you can use for Cross-validation. " 433 | ] 434 | }, 435 | { 436 | "cell_type": "markdown", 437 | "metadata": {}, 438 | "source": [ 439 | "

Cross-validation Score

" 440 | ] 441 | }, 442 | { 443 | "cell_type": "markdown", 444 | "metadata": {}, 445 | "source": [ 446 | "Lets import model_selection from the module cross_val_score." 447 | ] 448 | }, 449 | { 450 | "cell_type": "code", 451 | "execution_count": null, 452 | "metadata": { 453 | "collapsed": false 454 | }, 455 | "outputs": [], 456 | "source": [ 457 | "from sklearn.model_selection import cross_val_score" 458 | ] 459 | }, 460 | { 461 | "cell_type": "markdown", 462 | "metadata": {}, 463 | "source": [ 464 | "We input the object, the feature in this case ' horsepower', the target data (y_data). The parameter 'cv' determines the number of folds; in this case 4. " 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": null, 470 | "metadata": { 471 | "collapsed": false 472 | }, 473 | "outputs": [], 474 | "source": [ 475 | "Rcross = cross_val_score(lre, x_data[['horsepower']], y_data, cv=4)" 476 | ] 477 | }, 478 | { 479 | "cell_type": "markdown", 480 | "metadata": {}, 481 | "source": [ 482 | "The default scoring is R^2; each element in the array has the average R^2 value in the fold:" 483 | ] 484 | }, 485 | { 486 | "cell_type": "code", 487 | "execution_count": null, 488 | "metadata": { 489 | "collapsed": false 490 | }, 491 | "outputs": [], 492 | "source": [ 493 | "Rcross" 494 | ] 495 | }, 496 | { 497 | "cell_type": "markdown", 498 | "metadata": {}, 499 | "source": [ 500 | " We can calculate the average and standard deviation of our estimate:" 501 | ] 502 | }, 503 | { 504 | "cell_type": "code", 505 | "execution_count": null, 506 | "metadata": { 507 | "collapsed": false 508 | }, 509 | "outputs": [], 510 | "source": [ 511 | "print(\"The mean of the folds are\", Rcross.mean(), \"and the standard deviation is\" , Rcross.std())" 512 | ] 513 | }, 514 | { 515 | "cell_type": "markdown", 516 | "metadata": {}, 517 | "source": [ 518 | "We can use negative squared error as a score by setting the parameter 'scoring' metric to 'neg_mean_squared_error'. " 519 | ] 520 | }, 521 | { 522 | "cell_type": "code", 523 | "execution_count": null, 524 | "metadata": { 525 | "collapsed": false 526 | }, 527 | "outputs": [], 528 | "source": [ 529 | "-1 * cross_val_score(lre,x_data[['horsepower']], y_data,cv=4,scoring='neg_mean_squared_error')" 530 | ] 531 | }, 532 | { 533 | "cell_type": "markdown", 534 | "metadata": {}, 535 | "source": [ 536 | "

\n", 537 | "

Question #3):

\n", 538 | " \n", 539 | "Calculate the average R^2 using two folds, find the average R^2 for the second fold utilizing the horsepower as a feature : \n", 540 | "\n", 541 | "

" 542 | ] 543 | }, 544 | { 545 | "cell_type": "code", 546 | "execution_count": null, 547 | "metadata": { 548 | "collapsed": false 549 | }, 550 | "outputs": [], 551 | "source": [ 552 | "# Write your code below and press Shift+Enter to execute \n" 553 | ] 554 | }, 555 | { 556 | "cell_type": "markdown", 557 | "metadata": {}, 558 | "source": [ 559 | "Double-click here for the solution.\n", 560 | "\n", 561 | "" 567 | ] 568 | }, 569 | { 570 | "cell_type": "markdown", 571 | "metadata": {}, 572 | "source": [ 573 | "You can also use the function 'cross_val_predict' to predict the output. The function splits up the data into the specified number of folds, using one fold to get a prediction while the rest of the folds are used as test data. First import the function:" 574 | ] 575 | }, 576 | { 577 | "cell_type": "code", 578 | "execution_count": null, 579 | "metadata": { 580 | "collapsed": true 581 | }, 582 | "outputs": [], 583 | "source": [ 584 | "from sklearn.model_selection import cross_val_predict" 585 | ] 586 | }, 587 | { 588 | "cell_type": "markdown", 589 | "metadata": {}, 590 | "source": [ 591 | "We input the object, the feature in this case 'horsepower' , the target data y_data. The parameter 'cv' determines the number of folds; in this case 4. We can produce an output:" 592 | ] 593 | }, 594 | { 595 | "cell_type": "code", 596 | "execution_count": null, 597 | "metadata": { 598 | "collapsed": false 599 | }, 600 | "outputs": [], 601 | "source": [ 602 | "yhat = cross_val_predict(lre,x_data[['horsepower']], y_data,cv=4)\n", 603 | "yhat[0:5]" 604 | ] 605 | }, 606 | { 607 | "cell_type": "markdown", 608 | "metadata": {}, 609 | "source": [ 610 | "

Part 2: Overfitting, Underfitting and Model Selection

\n", 611 | "\n", 612 | "

It turns out that the test data sometimes referred to as the out of sample data is a much better measure of how well your model performs in the real world. One reason for this is overfitting; let's go over some examples. It turns out these differences are more apparent in Multiple Linear Regression and Polynomial Regression so we will explore overfitting in that context.

" 613 | ] 614 | }, 615 | { 616 | "cell_type": "markdown", 617 | "metadata": {}, 618 | "source": [ 619 | "Let's create Multiple linear regression objects and train the model using 'horsepower', 'curb-weight', 'engine-size' and 'highway-mpg' as features." 620 | ] 621 | }, 622 | { 623 | "cell_type": "code", 624 | "execution_count": null, 625 | "metadata": { 626 | "collapsed": false 627 | }, 628 | "outputs": [], 629 | "source": [ 630 | "lr = LinearRegression()\n", 631 | "lr.fit(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_train)" 632 | ] 633 | }, 634 | { 635 | "cell_type": "markdown", 636 | "metadata": {}, 637 | "source": [ 638 | "Prediction using training data:" 639 | ] 640 | }, 641 | { 642 | "cell_type": "code", 643 | "execution_count": null, 644 | "metadata": { 645 | "collapsed": false 646 | }, 647 | "outputs": [], 648 | "source": [ 649 | "yhat_train = lr.predict(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])\n", 650 | "yhat_train[0:5]" 651 | ] 652 | }, 653 | { 654 | "cell_type": "markdown", 655 | "metadata": {}, 656 | "source": [ 657 | "Prediction using test data: " 658 | ] 659 | }, 660 | { 661 | "cell_type": "code", 662 | "execution_count": null, 663 | "metadata": { 664 | "collapsed": false 665 | }, 666 | "outputs": [], 667 | "source": [ 668 | "yhat_test = lr.predict(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])\n", 669 | "yhat_test[0:5]" 670 | ] 671 | }, 672 | { 673 | "cell_type": "markdown", 674 | "metadata": {}, 675 | "source": [ 676 | "Let's perform some model evaluation using our training and testing data separately. First we import the seaborn and matplotlibb library for plotting." 677 | ] 678 | }, 679 | { 680 | "cell_type": "code", 681 | "execution_count": null, 682 | "metadata": { 683 | "collapsed": true 684 | }, 685 | "outputs": [], 686 | "source": [ 687 | "import matplotlib.pyplot as plt\n", 688 | "%matplotlib inline\n", 689 | "import seaborn as sns" 690 | ] 691 | }, 692 | { 693 | "cell_type": "markdown", 694 | "metadata": {}, 695 | "source": [ 696 | "Let's examine the distribution of the predicted values of the training data." 697 | ] 698 | }, 699 | { 700 | "cell_type": "code", 701 | "execution_count": null, 702 | "metadata": { 703 | "collapsed": false 704 | }, 705 | "outputs": [], 706 | "source": [ 707 | "Title = 'Distribution Plot of Predicted Value Using Training Data vs Training Data Distribution'\n", 708 | "DistributionPlot(y_train, yhat_train, \"Actual Values (Train)\", \"Predicted Values (Train)\", Title)" 709 | ] 710 | }, 711 | { 712 | "cell_type": "markdown", 713 | "metadata": {}, 714 | "source": [ 715 | "Figure 1: Plot of predicted values using the training data compared to the training data. " 716 | ] 717 | }, 718 | { 719 | "cell_type": "markdown", 720 | "metadata": {}, 721 | "source": [ 722 | "So far the model seems to be doing well in learning from the training dataset. But what happens when the model encounters new data from the testing dataset? When the model generates new values from the test data, we see the distribution of the predicted values is much different from the actual target values. " 723 | ] 724 | }, 725 | { 726 | "cell_type": "code", 727 | "execution_count": null, 728 | "metadata": { 729 | "collapsed": false 730 | }, 731 | "outputs": [], 732 | "source": [ 733 | "Title='Distribution Plot of Predicted Value Using Test Data vs Data Distribution of Test Data'\n", 734 | "DistributionPlot(y_test,yhat_test,\"Actual Values (Test)\",\"Predicted Values (Test)\",Title)" 735 | ] 736 | }, 737 | { 738 | "cell_type": "markdown", 739 | "metadata": {}, 740 | "source": [ 741 | "Figur 2: Plot of predicted value using the test data compared to the test data. " 742 | ] 743 | }, 744 | { 745 | "cell_type": "markdown", 746 | "metadata": {}, 747 | "source": [ 748 | "

Comparing Figure 1 and Figure 2; it is evident the distribution of the test data in Figure 1 is much better at fitting the data. This difference in Figure 2 is apparent where the ranges are from 5000 to 15 000. This is where the distribution shape is exceptionally different. Let's see if polynomial regression also exhibits a drop in the prediction accuracy when analysing the test dataset.

" 749 | ] 750 | }, 751 | { 752 | "cell_type": "code", 753 | "execution_count": null, 754 | "metadata": { 755 | "collapsed": false 756 | }, 757 | "outputs": [], 758 | "source": [ 759 | "from sklearn.preprocessing import PolynomialFeatures" 760 | ] 761 | }, 762 | { 763 | "cell_type": "markdown", 764 | "metadata": {}, 765 | "source": [ 766 | "

Overfitting

\n", 767 | "

Overfitting occurs when the model fits the noise, not the underlying process. Therefore when testing your model using the test-set, your model does not perform as well as it is modelling noise, not the underlying process that generated the relationship. Let's create a degree 5 polynomial model.

" 768 | ] 769 | }, 770 | { 771 | "cell_type": "markdown", 772 | "metadata": {}, 773 | "source": [ 774 | "Let's use 55 percent of the data for testing and the rest for training:" 775 | ] 776 | }, 777 | { 778 | "cell_type": "code", 779 | "execution_count": null, 780 | "metadata": { 781 | "collapsed": false 782 | }, 783 | "outputs": [], 784 | "source": [ 785 | "x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.45, random_state=0)" 786 | ] 787 | }, 788 | { 789 | "cell_type": "markdown", 790 | "metadata": {}, 791 | "source": [ 792 | "We will perform a degree 5 polynomial transformation on the feature 'horse power'. " 793 | ] 794 | }, 795 | { 796 | "cell_type": "code", 797 | "execution_count": null, 798 | "metadata": { 799 | "collapsed": false 800 | }, 801 | "outputs": [], 802 | "source": [ 803 | "pr = PolynomialFeatures(degree=5)\n", 804 | "x_train_pr = pr.fit_transform(x_train[['horsepower']])\n", 805 | "x_test_pr = pr.fit_transform(x_test[['horsepower']])\n", 806 | "pr" 807 | ] 808 | }, 809 | { 810 | "cell_type": "markdown", 811 | "metadata": {}, 812 | "source": [ 813 | "Now let's create a linear regression model \"poly\" and train it." 814 | ] 815 | }, 816 | { 817 | "cell_type": "code", 818 | "execution_count": null, 819 | "metadata": { 820 | "collapsed": false 821 | }, 822 | "outputs": [], 823 | "source": [ 824 | "poly = LinearRegression()\n", 825 | "poly.fit(x_train_pr, y_train)" 826 | ] 827 | }, 828 | { 829 | "cell_type": "markdown", 830 | "metadata": {}, 831 | "source": [ 832 | "We can see the output of our model using the method \"predict.\" then assign the values to \"yhat\"." 833 | ] 834 | }, 835 | { 836 | "cell_type": "code", 837 | "execution_count": null, 838 | "metadata": { 839 | "collapsed": false 840 | }, 841 | "outputs": [], 842 | "source": [ 843 | "yhat = poly.predict(x_test_pr)\n", 844 | "yhat[0:5]" 845 | ] 846 | }, 847 | { 848 | "cell_type": "markdown", 849 | "metadata": {}, 850 | "source": [ 851 | "Let's take the first five predicted values and compare it to the actual targets. " 852 | ] 853 | }, 854 | { 855 | "cell_type": "code", 856 | "execution_count": null, 857 | "metadata": { 858 | "collapsed": false 859 | }, 860 | "outputs": [], 861 | "source": [ 862 | "print(\"Predicted values:\", yhat[0:4])\n", 863 | "print(\"True values:\", y_test[0:4].values)" 864 | ] 865 | }, 866 | { 867 | "cell_type": "markdown", 868 | "metadata": {}, 869 | "source": [ 870 | "We will use the function \"PollyPlot\" that we defined at the beginning of the lab to display the training data, testing data, and the predicted function." 871 | ] 872 | }, 873 | { 874 | "cell_type": "code", 875 | "execution_count": null, 876 | "metadata": { 877 | "collapsed": false, 878 | "scrolled": false 879 | }, 880 | "outputs": [], 881 | "source": [ 882 | "PollyPlot(x_train[['horsepower']], x_test[['horsepower']], y_train, y_test, poly,pr)" 883 | ] 884 | }, 885 | { 886 | "cell_type": "markdown", 887 | "metadata": {}, 888 | "source": [ 889 | "Figur 4 A polynomial regression model, red dots represent training data, green dots represent test data, and the blue line represents the model prediction. " 890 | ] 891 | }, 892 | { 893 | "cell_type": "markdown", 894 | "metadata": {}, 895 | "source": [ 896 | "We see that the estimated function appears to track the data but around 200 horsepower, the function begins to diverge from the data points. " 897 | ] 898 | }, 899 | { 900 | "cell_type": "markdown", 901 | "metadata": {}, 902 | "source": [ 903 | " R^2 of the training data:" 904 | ] 905 | }, 906 | { 907 | "cell_type": "code", 908 | "execution_count": null, 909 | "metadata": { 910 | "collapsed": false 911 | }, 912 | "outputs": [], 913 | "source": [ 914 | "poly.score(x_train_pr, y_train)" 915 | ] 916 | }, 917 | { 918 | "cell_type": "markdown", 919 | "metadata": {}, 920 | "source": [ 921 | " R^2 of the test data:" 922 | ] 923 | }, 924 | { 925 | "cell_type": "code", 926 | "execution_count": null, 927 | "metadata": { 928 | "collapsed": false 929 | }, 930 | "outputs": [], 931 | "source": [ 932 | "poly.score(x_test_pr, y_test)" 933 | ] 934 | }, 935 | { 936 | "cell_type": "markdown", 937 | "metadata": {}, 938 | "source": [ 939 | "We see the R^2 for the training data is 0.5567 while the R^2 on the test data was -29.87. The lower the R^2, the worse the model, a Negative R^2 is a sign of overfitting." 940 | ] 941 | }, 942 | { 943 | "cell_type": "markdown", 944 | "metadata": {}, 945 | "source": [ 946 | "Let's see how the R^2 changes on the test data for different order polynomials and plot the results:" 947 | ] 948 | }, 949 | { 950 | "cell_type": "code", 951 | "execution_count": null, 952 | "metadata": { 953 | "collapsed": false 954 | }, 955 | "outputs": [], 956 | "source": [ 957 | "Rsqu_test = []\n", 958 | "\n", 959 | "order = [1, 2, 3, 4]\n", 960 | "for n in order:\n", 961 | " pr = PolynomialFeatures(degree=n)\n", 962 | " \n", 963 | " x_train_pr = pr.fit_transform(x_train[['horsepower']])\n", 964 | " \n", 965 | " x_test_pr = pr.fit_transform(x_test[['horsepower']]) \n", 966 | " \n", 967 | " lr.fit(x_train_pr, y_train)\n", 968 | " \n", 969 | " Rsqu_test.append(lr.score(x_test_pr, y_test))\n", 970 | "\n", 971 | "plt.plot(order, Rsqu_test)\n", 972 | "plt.xlabel('order')\n", 973 | "plt.ylabel('R^2')\n", 974 | "plt.title('R^2 Using Test Data')\n", 975 | "plt.text(3, 0.75, 'Maximum R^2 ') " 976 | ] 977 | }, 978 | { 979 | "cell_type": "markdown", 980 | "metadata": {}, 981 | "source": [ 982 | "We see the R^2 gradually increases until an order three polynomial is used. Then the R^2 dramatically decreases at four." 983 | ] 984 | }, 985 | { 986 | "cell_type": "markdown", 987 | "metadata": {}, 988 | "source": [ 989 | "The following function will be used in the next section; please run the cell." 990 | ] 991 | }, 992 | { 993 | "cell_type": "code", 994 | "execution_count": null, 995 | "metadata": { 996 | "collapsed": true 997 | }, 998 | "outputs": [], 999 | "source": [ 1000 | "def f(order, test_data):\n", 1001 | " x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=test_data, random_state=0)\n", 1002 | " pr = PolynomialFeatures(degree=order)\n", 1003 | " x_train_pr = pr.fit_transform(x_train[['horsepower']])\n", 1004 | " x_test_pr = pr.fit_transform(x_test[['horsepower']])\n", 1005 | " poly = LinearRegression()\n", 1006 | " poly.fit(x_train_pr,y_train)\n", 1007 | " PollyPlot(x_train[['horsepower']], x_test[['horsepower']], y_train,y_test, poly, pr)" 1008 | ] 1009 | }, 1010 | { 1011 | "cell_type": "markdown", 1012 | "metadata": {}, 1013 | "source": [ 1014 | "The following interface allows you to experiment with different polynomial orders and different amounts of data. " 1015 | ] 1016 | }, 1017 | { 1018 | "cell_type": "code", 1019 | "execution_count": null, 1020 | "metadata": { 1021 | "collapsed": false 1022 | }, 1023 | "outputs": [], 1024 | "source": [ 1025 | "interact(f, order=(0, 6, 1), test_data=(0.05, 0.95, 0.05))" 1026 | ] 1027 | }, 1028 | { 1029 | "cell_type": "markdown", 1030 | "metadata": {}, 1031 | "source": [ 1032 | "

\n", 1033 | "

Question #4a):

\n", 1034 | "\n", 1035 | "We can perform polynomial transformations with more than one feature. Create a \"PolynomialFeatures\" object \"pr1\" of degree two?\n", 1036 | "

" 1037 | ] 1038 | }, 1039 | { 1040 | "cell_type": "markdown", 1041 | "metadata": {}, 1042 | "source": [ 1043 | "Double-click here for the solution.\n", 1044 | "\n", 1045 | "" 1050 | ] 1051 | }, 1052 | { 1053 | "cell_type": "markdown", 1054 | "metadata": {}, 1055 | "source": [ 1056 | "

\n", 1057 | "

Question #4b):

\n", 1058 | "\n", 1059 | " \n", 1060 | " Transform the training and testing samples for the features 'horsepower', 'curb-weight', 'engine-size' and 'highway-mpg'. Hint: use the method \"fit_transform\" \n", 1061 | "?\n", 1062 | "

" 1063 | ] 1064 | }, 1065 | { 1066 | "cell_type": "markdown", 1067 | "metadata": {}, 1068 | "source": [ 1069 | "Double-click here for the solution.\n", 1070 | "\n", 1071 | "" 1078 | ] 1079 | }, 1080 | { 1081 | "cell_type": "markdown", 1082 | "metadata": {}, 1083 | "source": [ 1084 | "" 1090 | ] 1091 | }, 1092 | { 1093 | "cell_type": "markdown", 1094 | "metadata": {}, 1095 | "source": [ 1096 | "

\n", 1097 | "

Question #4c):

\n", 1098 | " \n", 1099 | "How many dimensions does the new feature have? Hint: use the attribute \"shape\"\n", 1100 | "\n", 1101 | "

" 1102 | ] 1103 | }, 1104 | { 1105 | "cell_type": "markdown", 1106 | "metadata": {}, 1107 | "source": [ 1108 | "Double-click here for the solution.\n", 1109 | "\n", 1110 | "" 1115 | ] 1116 | }, 1117 | { 1118 | "cell_type": "markdown", 1119 | "metadata": {}, 1120 | "source": [ 1121 | "

\n", 1122 | "

Question #4d):

\n", 1123 | "\n", 1124 | " \n", 1125 | "Create a linear regression model \"poly1\" and train the object using the method \"fit\" using the polynomial features?\n", 1126 | "

" 1127 | ] 1128 | }, 1129 | { 1130 | "cell_type": "markdown", 1131 | "metadata": {}, 1132 | "source": [ 1133 | "Double-click here for the solution.\n", 1134 | "\n", 1135 | "" 1140 | ] 1141 | }, 1142 | { 1143 | "cell_type": "markdown", 1144 | "metadata": {}, 1145 | "source": [ 1146 | "

\n", 1147 | "

Question #4e):

\n", 1148 | "Use the method \"predict\" to predict an output on the polynomial features, then use the function \"DistributionPlot\" to display the distribution of the predicted output vs the test data?\n", 1149 | "

" 1150 | ] 1151 | }, 1152 | { 1153 | "cell_type": "markdown", 1154 | "metadata": {}, 1155 | "source": [ 1156 | "Double-click here for the solution.\n", 1157 | "\n", 1158 | "" 1165 | ] 1166 | }, 1167 | { 1168 | "cell_type": "markdown", 1169 | "metadata": {}, 1170 | "source": [ 1171 | "

\n", 1172 | "

Question #4f):

\n", 1173 | "\n", 1174 | "Use the distribution plot to determine the two regions were the predicted prices are less accurate than the actual prices.\n", 1175 | "

" 1176 | ] 1177 | }, 1178 | { 1179 | "cell_type": "markdown", 1180 | "metadata": {}, 1181 | "source": [ 1182 | "Double-click here for the solution.\n", 1183 | "\n", 1184 | "\n", 1189 | "\n", 1190 | "

\n" 1191 | ] 1192 | }, 1193 | { 1194 | "cell_type": "markdown", 1195 | "metadata": {}, 1196 | "source": [ 1197 | "

Part 3: Ridge regression

" 1198 | ] 1199 | }, 1200 | { 1201 | "cell_type": "markdown", 1202 | "metadata": {}, 1203 | "source": [ 1204 | " In this section, we will review Ridge Regression we will see how the parameter Alfa changes the model. Just a note here our test data will be used as validation data." 1205 | ] 1206 | }, 1207 | { 1208 | "cell_type": "markdown", 1209 | "metadata": {}, 1210 | "source": [ 1211 | " Let's perform a degree two polynomial transformation on our data. " 1212 | ] 1213 | }, 1214 | { 1215 | "cell_type": "code", 1216 | "execution_count": null, 1217 | "metadata": { 1218 | "collapsed": true 1219 | }, 1220 | "outputs": [], 1221 | "source": [ 1222 | "pr=PolynomialFeatures(degree=2)\n", 1223 | "x_train_pr=pr.fit_transform(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg','normalized-losses','symboling']])\n", 1224 | "x_test_pr=pr.fit_transform(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg','normalized-losses','symboling']])" 1225 | ] 1226 | }, 1227 | { 1228 | "cell_type": "markdown", 1229 | "metadata": {}, 1230 | "source": [ 1231 | " Let's import Ridge from the module linear models." 1232 | ] 1233 | }, 1234 | { 1235 | "cell_type": "code", 1236 | "execution_count": null, 1237 | "metadata": { 1238 | "collapsed": true 1239 | }, 1240 | "outputs": [], 1241 | "source": [ 1242 | "from sklearn.linear_model import Ridge" 1243 | ] 1244 | }, 1245 | { 1246 | "cell_type": "markdown", 1247 | "metadata": {}, 1248 | "source": [ 1249 | "Let's create a Ridge regression object, setting the regularization parameter to 0.1 " 1250 | ] 1251 | }, 1252 | { 1253 | "cell_type": "code", 1254 | "execution_count": null, 1255 | "metadata": { 1256 | "collapsed": true 1257 | }, 1258 | "outputs": [], 1259 | "source": [ 1260 | "RigeModel=Ridge(alpha=0.1)" 1261 | ] 1262 | }, 1263 | { 1264 | "cell_type": "markdown", 1265 | "metadata": {}, 1266 | "source": [ 1267 | "Like regular regression, you can fit the model using the method fit." 1268 | ] 1269 | }, 1270 | { 1271 | "cell_type": "code", 1272 | "execution_count": null, 1273 | "metadata": { 1274 | "collapsed": false 1275 | }, 1276 | "outputs": [], 1277 | "source": [ 1278 | "RigeModel.fit(x_train_pr, y_train)" 1279 | ] 1280 | }, 1281 | { 1282 | "cell_type": "markdown", 1283 | "metadata": {}, 1284 | "source": [ 1285 | " Similarly, you can obtain a prediction: " 1286 | ] 1287 | }, 1288 | { 1289 | "cell_type": "code", 1290 | "execution_count": null, 1291 | "metadata": { 1292 | "collapsed": false 1293 | }, 1294 | "outputs": [], 1295 | "source": [ 1296 | "yhat = RigeModel.predict(x_test_pr)" 1297 | ] 1298 | }, 1299 | { 1300 | "cell_type": "markdown", 1301 | "metadata": {}, 1302 | "source": [ 1303 | "Let's compare the first five predicted samples to our test set " 1304 | ] 1305 | }, 1306 | { 1307 | "cell_type": "code", 1308 | "execution_count": null, 1309 | "metadata": { 1310 | "collapsed": false 1311 | }, 1312 | "outputs": [], 1313 | "source": [ 1314 | "print('predicted:', yhat[0:4])\n", 1315 | "print('test set :', y_test[0:4].values)" 1316 | ] 1317 | }, 1318 | { 1319 | "cell_type": "markdown", 1320 | "metadata": {}, 1321 | "source": [ 1322 | "We select the value of Alfa that minimizes the test error, for example, we can use a for loop. " 1323 | ] 1324 | }, 1325 | { 1326 | "cell_type": "code", 1327 | "execution_count": null, 1328 | "metadata": { 1329 | "collapsed": false 1330 | }, 1331 | "outputs": [], 1332 | "source": [ 1333 | "Rsqu_test = []\n", 1334 | "Rsqu_train = []\n", 1335 | "dummy1 = []\n", 1336 | "ALFA = 10 * np.array(range(0,1000))\n", 1337 | "for alfa in ALFA:\n", 1338 | " RigeModel = Ridge(alpha=alfa) \n", 1339 | " RigeModel.fit(x_train_pr, y_train)\n", 1340 | " Rsqu_test.append(RigeModel.score(x_test_pr, y_test))\n", 1341 | " Rsqu_train.append(RigeModel.score(x_train_pr, y_train))" 1342 | ] 1343 | }, 1344 | { 1345 | "cell_type": "markdown", 1346 | "metadata": {}, 1347 | "source": [ 1348 | "We can plot out the value of R^2 for different Alphas " 1349 | ] 1350 | }, 1351 | { 1352 | "cell_type": "code", 1353 | "execution_count": null, 1354 | "metadata": { 1355 | "collapsed": false 1356 | }, 1357 | "outputs": [], 1358 | "source": [ 1359 | "width = 12\n", 1360 | "height = 10\n", 1361 | "plt.figure(figsize=(width, height))\n", 1362 | "\n", 1363 | "plt.plot(ALFA,Rsqu_test, label='validation data ')\n", 1364 | "plt.plot(ALFA,Rsqu_train, 'r', label='training Data ')\n", 1365 | "plt.xlabel('alpha')\n", 1366 | "plt.ylabel('R^2')\n", 1367 | "plt.legend()" 1368 | ] 1369 | }, 1370 | { 1371 | "cell_type": "markdown", 1372 | "metadata": {}, 1373 | "source": [ 1374 | "Figure 6:The blue line represents the R^2 of the test data, and the red line represents the R^2 of the training data. The x-axis represents the different values of Alfa " 1375 | ] 1376 | }, 1377 | { 1378 | "cell_type": "markdown", 1379 | "metadata": {}, 1380 | "source": [ 1381 | "The red line in figure 6 represents the R^2 of the test data, as Alpha increases the R^2 decreases; therefore as Alfa increases the model performs worse on the test data. The blue line represents the R^2 on the validation data, as the value for Alfa increases the R^2 decreases. " 1382 | ] 1383 | }, 1384 | { 1385 | "cell_type": "markdown", 1386 | "metadata": {}, 1387 | "source": [ 1388 | "

\n", 1389 | "

Question #5):

\n", 1390 | "\n", 1391 | "Perform Ridge regression and calculate the R^2 using the polynomial features, use the training data to train the model and test data to test the model. The parameter alpha should be set to 10.\n", 1392 | "

" 1393 | ] 1394 | }, 1395 | { 1396 | "cell_type": "code", 1397 | "execution_count": null, 1398 | "metadata": { 1399 | "collapsed": false 1400 | }, 1401 | "outputs": [], 1402 | "source": [ 1403 | "# Write your code below and press Shift+Enter to execute \n" 1404 | ] 1405 | }, 1406 | { 1407 | "cell_type": "markdown", 1408 | "metadata": {}, 1409 | "source": [ 1410 | "Double-click here for the solution.\n", 1411 | "\n", 1412 | "" 1419 | ] 1420 | }, 1421 | { 1422 | "cell_type": "markdown", 1423 | "metadata": {}, 1424 | "source": [ 1425 | "

Part 4: Grid Search

" 1426 | ] 1427 | }, 1428 | { 1429 | "cell_type": "markdown", 1430 | "metadata": {}, 1431 | "source": [ 1432 | "The term Alfa is a hyperparameter, sklearn has the class GridSearchCV to make the process of finding the best hyperparameter simpler." 1433 | ] 1434 | }, 1435 | { 1436 | "cell_type": "markdown", 1437 | "metadata": {}, 1438 | "source": [ 1439 | "Let's import GridSearchCV from the module model_selection." 1440 | ] 1441 | }, 1442 | { 1443 | "cell_type": "code", 1444 | "execution_count": null, 1445 | "metadata": { 1446 | "collapsed": false 1447 | }, 1448 | "outputs": [], 1449 | "source": [ 1450 | "from sklearn.model_selection import GridSearchCV" 1451 | ] 1452 | }, 1453 | { 1454 | "cell_type": "markdown", 1455 | "metadata": {}, 1456 | "source": [ 1457 | "We create a dictionary of parameter values:" 1458 | ] 1459 | }, 1460 | { 1461 | "cell_type": "code", 1462 | "execution_count": null, 1463 | "metadata": { 1464 | "collapsed": false 1465 | }, 1466 | "outputs": [], 1467 | "source": [ 1468 | "parameters1= [{'alpha': [0.001,0.1,1, 10, 100, 1000, 10000, 100000, 100000]}]\n", 1469 | "parameters1" 1470 | ] 1471 | }, 1472 | { 1473 | "cell_type": "markdown", 1474 | "metadata": {}, 1475 | "source": [ 1476 | "Create a ridge regions object:" 1477 | ] 1478 | }, 1479 | { 1480 | "cell_type": "code", 1481 | "execution_count": null, 1482 | "metadata": { 1483 | "collapsed": false 1484 | }, 1485 | "outputs": [], 1486 | "source": [ 1487 | "RR=Ridge()\n", 1488 | "RR" 1489 | ] 1490 | }, 1491 | { 1492 | "cell_type": "markdown", 1493 | "metadata": {}, 1494 | "source": [ 1495 | "Create a ridge grid search object " 1496 | ] 1497 | }, 1498 | { 1499 | "cell_type": "code", 1500 | "execution_count": null, 1501 | "metadata": { 1502 | "collapsed": false 1503 | }, 1504 | "outputs": [], 1505 | "source": [ 1506 | "Grid1 = GridSearchCV(RR, parameters1,cv=4)" 1507 | ] 1508 | }, 1509 | { 1510 | "cell_type": "markdown", 1511 | "metadata": {}, 1512 | "source": [ 1513 | "Fit the model " 1514 | ] 1515 | }, 1516 | { 1517 | "cell_type": "code", 1518 | "execution_count": null, 1519 | "metadata": { 1520 | "collapsed": false 1521 | }, 1522 | "outputs": [], 1523 | "source": [ 1524 | "Grid1.fit(x_data[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_data)" 1525 | ] 1526 | }, 1527 | { 1528 | "cell_type": "markdown", 1529 | "metadata": {}, 1530 | "source": [ 1531 | "The object finds the best parameter values on the validation data. We can obtain the estimator with the best parameters and assign it to the variable BestRR as follows:" 1532 | ] 1533 | }, 1534 | { 1535 | "cell_type": "code", 1536 | "execution_count": null, 1537 | "metadata": { 1538 | "collapsed": false 1539 | }, 1540 | "outputs": [], 1541 | "source": [ 1542 | "BestRR=Grid1.best_estimator_\n", 1543 | "BestRR" 1544 | ] 1545 | }, 1546 | { 1547 | "cell_type": "markdown", 1548 | "metadata": {}, 1549 | "source": [ 1550 | " We now test our model on the test data " 1551 | ] 1552 | }, 1553 | { 1554 | "cell_type": "code", 1555 | "execution_count": null, 1556 | "metadata": { 1557 | "collapsed": false 1558 | }, 1559 | "outputs": [], 1560 | "source": [ 1561 | "BestRR.score(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_test)" 1562 | ] 1563 | }, 1564 | { 1565 | "cell_type": "markdown", 1566 | "metadata": {}, 1567 | "source": [ 1568 | "

\n", 1569 | "

Question #6):

\n", 1570 | "Perform a grid search for the alpha parameter and the normalization parameter, then find the best values of the parameters\n", 1571 | "

" 1572 | ] 1573 | }, 1574 | { 1575 | "cell_type": "code", 1576 | "execution_count": null, 1577 | "metadata": { 1578 | "collapsed": false 1579 | }, 1580 | "outputs": [], 1581 | "source": [ 1582 | "# Write your code below and press Shift+Enter to execute \n" 1583 | ] 1584 | }, 1585 | { 1586 | "cell_type": "markdown", 1587 | "metadata": {}, 1588 | "source": [ 1589 | "Double-click here for the solution.\n", 1590 | "\n", 1591 | "" 1599 | ] 1600 | }, 1601 | { 1602 | "cell_type": "markdown", 1603 | "metadata": {}, 1604 | "source": [ 1605 | "

Thank you for completing this notebook!

" 1606 | ] 1607 | }, 1608 | { 1609 | "cell_type": "markdown", 1610 | "metadata": {}, 1611 | "source": [ 1612 | "

\n", 1613 | "\n", 1614 | "

\n", 1615 | "

\n" 1616 | ] 1617 | }, 1618 | { 1619 | "cell_type": "markdown", 1620 | "metadata": {}, 1621 | "source": [ 1622 | "

About the Authors:

\n", 1623 | "\n", 1624 | "This notebook was written by Mahdi Noorian PhD, Joseph Santarcangelo, Bahare Talayian, Eric Xiao, Steven Dong, Parizad, Hima Vsudevan and Fiorella Wenver and Yi Yao.\n", 1625 | "\n", 1626 | "

" 1627 | ] 1628 | }, 1629 | { 1630 | "cell_type": "markdown", 1631 | "metadata": {}, 1632 | "source": [ 1633 | "

\n", 1634 | "

" 1635 | ] 1636 | } 1637 | ], 1638 | "metadata": { 1639 | "anaconda-cloud": {}, 1640 | "kernelspec": { 1641 | "display_name": "Python 3", 1642 | "language": "python", 1643 | "name": "python3" 1644 | }, 1645 | "language_info": { 1646 | "codemirror_mode": { 1647 | "name": "ipython", 1648 | "version": 3 1649 | }, 1650 | "file_extension": ".py", 1651 | "mimetype": "text/x-python", 1652 | "name": "python", 1653 | "nbconvert_exporter": "python", 1654 | "pygments_lexer": "ipython3", 1655 | "version": "3.7.3" 1656 | } 1657 | }, 1658 | "nbformat": 4, 1659 | "nbformat_minor": 2 1660 | } 1661 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data-Analysis-with-Python-by-IBM-on-Coursera 2 | Answer keys for course - Data Analysis with Python by IBM on Coursera 3 | --------------------------------------------------------------------------------