├── DA0101EN-Review-Data-Wrangling.ipynb ├── DA0101EN-Review-Exploratory-Data-Analysis.ipynb ├── DA0101EN-Review-Introduction.ipynb ├── DA0101EN-Review-Model-Development.ipynb ├── DA0101EN-Review-Model-Evaluation-and-Refinement.ipynb ├── House Sales_in_King_Count_USA.ipynb ├── README.md ├── data-wrangling.ipynb ├── exploratory-data-analysis.ipynb ├── model-development.ipynb ├── model-evaluation-and-refinement.ipynb └── review-introduction.ipynb /DA0101EN-Review-Data-Wrangling.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "
\n", 8 | " \n", 9 | " \n", 10 | " \n", 11 | "
" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "\n", 19 | "\n", 20 | "

Data Analysis with Python

" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "

Data Wrangling

" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "

Welcome!

\n", 35 | "\n", 36 | "By the end of this notebook, you will have learned the basics of Data Wrangling! " 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "

Table of content

\n", 44 | "\n", 45 | "
\n", 46 | "\n", 59 | " \n", 60 | "Estimated Time Needed: 30 min\n", 61 | "
\n", 62 | " \n", 63 | "
" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "

What is the purpose of Data Wrangling?

" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "Data Wrangling is the process of converting data from the initial format to a format that may be better for analysis." 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "

What is the fuel consumption (L/100k) rate for the diesel car?

" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "

Import data

\n", 92 | "

\n", 93 | "You can find the \"Automobile Data Set\" from the following link: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data. \n", 94 | "We will be using this data set throughout this course.\n", 95 | "

" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "

Import pandas

" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": { 109 | "collapsed": true 110 | }, 111 | "outputs": [], 112 | "source": [ 113 | "import pandas as pd\n", 114 | "import matplotlib.pylab as plt" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "

Reading the data set from the URL and adding the related headers.

" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "URL of the dataset" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "This dataset was hosted on IBM Cloud object click HERE for free storage " 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": null, 141 | "metadata": { 142 | "collapsed": true 143 | }, 144 | "outputs": [], 145 | "source": [ 146 | "filename = \"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv\"" 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | " Python list headers containing name of headers " 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": { 160 | "collapsed": true 161 | }, 162 | "outputs": [], 163 | "source": [ 164 | "headers = [\"symboling\",\"normalized-losses\",\"make\",\"fuel-type\",\"aspiration\", \"num-of-doors\",\"body-style\",\n", 165 | " \"drive-wheels\",\"engine-location\",\"wheel-base\", \"length\",\"width\",\"height\",\"curb-weight\",\"engine-type\",\n", 166 | " \"num-of-cylinders\", \"engine-size\",\"fuel-system\",\"bore\",\"stroke\",\"compression-ratio\",\"horsepower\",\n", 167 | " \"peak-rpm\",\"city-mpg\",\"highway-mpg\",\"price\"]" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": {}, 173 | "source": [ 174 | "Use the Pandas method read_csv() to load the data from the web address. Set the parameter \"names\" equal to the Python list \"headers\"." 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": null, 180 | "metadata": { 181 | "collapsed": false 182 | }, 183 | "outputs": [], 184 | "source": [ 185 | "df = pd.read_csv(filename, names = headers)" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | " Use the method head() to display the first five rows of the dataframe. " 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": { 199 | "collapsed": false 200 | }, 201 | "outputs": [], 202 | "source": [ 203 | "# To see what the data set looks like, we'll use the head() method.\n", 204 | "df.head()" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "As we can see, several question marks appeared in the dataframe; those are missing values which may hinder our further analysis. \n", 212 | "
So, how do we identify all those missing values and deal with them?
\n", 213 | "\n", 214 | "\n", 215 | "How to work with missing data?\n", 216 | "\n", 217 | "Steps for working with missing data:\n", 218 | "
    \n", 219 | "
  1. dentify missing data
  2. \n", 220 | "
  3. deal with missing data
  4. \n", 221 | "
  5. correct data format
  6. \n", 222 | "
" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "

Identify and handle missing values

\n", 230 | "\n", 231 | "\n", 232 | "

Identify missing values

\n", 233 | "

Convert \"?\" to NaN

\n", 234 | "In the car dataset, missing data comes with the question mark \"?\".\n", 235 | "We replace \"?\" with NaN (Not a Number), which is Python's default missing value marker, for reasons of computational speed and convenience. Here we use the function: \n", 236 | "
.replace(A, B, inplace = True) 
\n", 237 | "to replace A by B" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": null, 243 | "metadata": { 244 | "collapsed": false 245 | }, 246 | "outputs": [], 247 | "source": [ 248 | "import numpy as np\n", 249 | "\n", 250 | "# replace \"?\" to NaN\n", 251 | "df.replace(\"?\", np.nan, inplace = True)\n", 252 | "df.head(5)" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "dentify_missing_values\n", 260 | "\n", 261 | "

Evaluating for Missing Data

\n", 262 | "\n", 263 | "The missing values are converted to Python's default. We use Python's built-in functions to identify these missing values. There are two methods to detect missing data:\n", 264 | "
    \n", 265 | "
  1. .isnull()
  2. \n", 266 | "
  3. .notnull()
  4. \n", 267 | "
\n", 268 | "The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data." 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": { 275 | "collapsed": false 276 | }, 277 | "outputs": [], 278 | "source": [ 279 | "missing_data = df.isnull()\n", 280 | "missing_data.head(5)" 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": {}, 286 | "source": [ 287 | "\"True\" stands for missing value, while \"False\" stands for not missing value." 288 | ] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "metadata": {}, 293 | "source": [ 294 | "

Count missing values in each column

\n", 295 | "

\n", 296 | "Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, \"True\" represents a missing value, \"False\" means the value is present in the dataset. In the body of the for loop the method \".value_counts()\" counts the number of \"True\" values. \n", 297 | "

" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": null, 303 | "metadata": { 304 | "collapsed": false 305 | }, 306 | "outputs": [], 307 | "source": [ 308 | "for column in missing_data.columns.values.tolist():\n", 309 | " print(column)\n", 310 | " print (missing_data[column].value_counts())\n", 311 | " print(\"\") " 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "metadata": {}, 317 | "source": [ 318 | "Based on the summary above, each column has 205 rows of data, seven columns containing missing data:\n", 319 | "
    \n", 320 | "
  1. \"normalized-losses\": 41 missing data
  2. \n", 321 | "
  3. \"num-of-doors\": 2 missing data
  4. \n", 322 | "
  5. \"bore\": 4 missing data
  6. \n", 323 | "
  7. \"stroke\" : 4 missing data
  8. \n", 324 | "
  9. \"horsepower\": 2 missing data
  10. \n", 325 | "
  11. \"peak-rpm\": 2 missing data
  12. \n", 326 | "
  13. \"price\": 4 missing data
  14. \n", 327 | "
" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [ 334 | "

Deal with missing data

\n", 335 | "How to deal with missing data?\n", 336 | "\n", 337 | "
    \n", 338 | "
  1. drop data
    \n", 339 | " a. drop the whole row
    \n", 340 | " b. drop the whole column\n", 341 | "
  2. \n", 342 | "
  3. replace data
    \n", 343 | " a. replace it by mean
    \n", 344 | " b. replace it by frequency
    \n", 345 | " c. replace it based on other functions\n", 346 | "
  4. \n", 347 | "
" 348 | ] 349 | }, 350 | { 351 | "cell_type": "markdown", 352 | "metadata": {}, 353 | "source": [ 354 | "Whole columns should be dropped only if most entries in the column are empty. In our dataset, none of the columns are empty enough to drop entirely.\n", 355 | "We have some freedom in choosing which method to replace data; however, some methods may seem more reasonable than others. We will apply each method to many different columns:\n", 356 | "\n", 357 | "Replace by mean:\n", 358 | "\n", 365 | "\n", 366 | "Replace by frequency:\n", 367 | "\n", 374 | "\n", 375 | "Drop the whole row:\n", 376 | "" 383 | ] 384 | }, 385 | { 386 | "cell_type": "markdown", 387 | "metadata": {}, 388 | "source": [ 389 | "

Calculate the average of the column

" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": null, 395 | "metadata": { 396 | "collapsed": false 397 | }, 398 | "outputs": [], 399 | "source": [ 400 | "avg_norm_loss = df[\"normalized-losses\"].astype(\"float\").mean(axis=0)\n", 401 | "print(\"Average of normalized-losses:\", avg_norm_loss)" 402 | ] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "metadata": {}, 407 | "source": [ 408 | "

Replace \"NaN\" by mean value in \"normalized-losses\" column

" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": null, 414 | "metadata": { 415 | "collapsed": true 416 | }, 417 | "outputs": [], 418 | "source": [ 419 | "df[\"normalized-losses\"].replace(np.nan, avg_norm_loss, inplace=True)" 420 | ] 421 | }, 422 | { 423 | "cell_type": "markdown", 424 | "metadata": {}, 425 | "source": [ 426 | "

Calculate the mean value for 'bore' column

" 427 | ] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "execution_count": null, 432 | "metadata": { 433 | "collapsed": true 434 | }, 435 | "outputs": [], 436 | "source": [ 437 | "avg_bore=df['bore'].astype('float').mean(axis=0)\n", 438 | "print(\"Average of bore:\", avg_bore)" 439 | ] 440 | }, 441 | { 442 | "cell_type": "markdown", 443 | "metadata": {}, 444 | "source": [ 445 | "

Replace NaN by mean value

" 446 | ] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "execution_count": null, 451 | "metadata": { 452 | "collapsed": true 453 | }, 454 | "outputs": [], 455 | "source": [ 456 | "df[\"bore\"].replace(np.nan, avg_bore, inplace=True)" 457 | ] 458 | }, 459 | { 460 | "cell_type": "markdown", 461 | "metadata": {}, 462 | "source": [ 463 | "
\n", 464 | "

Question #1:

\n", 465 | "\n", 466 | "According to the example above, replace NaN in \"stroke\" column by mean.\n", 467 | "
" 468 | ] 469 | }, 470 | { 471 | "cell_type": "code", 472 | "execution_count": null, 473 | "metadata": { 474 | "collapsed": false 475 | }, 476 | "outputs": [], 477 | "source": [ 478 | "# Write your code below and press Shift+Enter to execute \n" 479 | ] 480 | }, 481 | { 482 | "cell_type": "markdown", 483 | "metadata": {}, 484 | "source": [ 485 | "Double-click here for the solution.\n", 486 | "\n", 487 | "\n" 497 | ] 498 | }, 499 | { 500 | "cell_type": "markdown", 501 | "metadata": {}, 502 | "source": [ 503 | "

Calculate the mean value for the 'horsepower' column:

" 504 | ] 505 | }, 506 | { 507 | "cell_type": "code", 508 | "execution_count": null, 509 | "metadata": { 510 | "collapsed": true 511 | }, 512 | "outputs": [], 513 | "source": [ 514 | "avg_horsepower = df['horsepower'].astype('float').mean(axis=0)\n", 515 | "print(\"Average horsepower:\", avg_horsepower)" 516 | ] 517 | }, 518 | { 519 | "cell_type": "markdown", 520 | "metadata": {}, 521 | "source": [ 522 | "

Replace \"NaN\" by mean value:

" 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": null, 528 | "metadata": { 529 | "collapsed": true 530 | }, 531 | "outputs": [], 532 | "source": [ 533 | "df['horsepower'].replace(np.nan, avg_horsepower, inplace=True)" 534 | ] 535 | }, 536 | { 537 | "cell_type": "markdown", 538 | "metadata": {}, 539 | "source": [ 540 | "

Calculate the mean value for 'peak-rpm' column:

" 541 | ] 542 | }, 543 | { 544 | "cell_type": "code", 545 | "execution_count": null, 546 | "metadata": { 547 | "collapsed": true 548 | }, 549 | "outputs": [], 550 | "source": [ 551 | "avg_peakrpm=df['peak-rpm'].astype('float').mean(axis=0)\n", 552 | "print(\"Average peak rpm:\", avg_peakrpm)" 553 | ] 554 | }, 555 | { 556 | "cell_type": "markdown", 557 | "metadata": {}, 558 | "source": [ 559 | "

Replace NaN by mean value:

" 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "execution_count": null, 565 | "metadata": { 566 | "collapsed": true 567 | }, 568 | "outputs": [], 569 | "source": [ 570 | "df['peak-rpm'].replace(np.nan, avg_peakrpm, inplace=True)" 571 | ] 572 | }, 573 | { 574 | "cell_type": "markdown", 575 | "metadata": {}, 576 | "source": [ 577 | "To see which values are present in a particular column, we can use the \".value_counts()\" method:" 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": null, 583 | "metadata": { 584 | "collapsed": false 585 | }, 586 | "outputs": [], 587 | "source": [ 588 | "df['num-of-doors'].value_counts()" 589 | ] 590 | }, 591 | { 592 | "cell_type": "markdown", 593 | "metadata": {}, 594 | "source": [ 595 | "We can see that four doors are the most common type. We can also use the \".idxmax()\" method to calculate for us the most common type automatically:" 596 | ] 597 | }, 598 | { 599 | "cell_type": "code", 600 | "execution_count": null, 601 | "metadata": { 602 | "collapsed": false 603 | }, 604 | "outputs": [], 605 | "source": [ 606 | "df['num-of-doors'].value_counts().idxmax()" 607 | ] 608 | }, 609 | { 610 | "cell_type": "markdown", 611 | "metadata": {}, 612 | "source": [ 613 | "The replacement procedure is very similar to what we have seen previously" 614 | ] 615 | }, 616 | { 617 | "cell_type": "code", 618 | "execution_count": null, 619 | "metadata": { 620 | "collapsed": false 621 | }, 622 | "outputs": [], 623 | "source": [ 624 | "#replace the missing 'num-of-doors' values by the most frequent \n", 625 | "df[\"num-of-doors\"].replace(np.nan, \"four\", inplace=True)" 626 | ] 627 | }, 628 | { 629 | "cell_type": "markdown", 630 | "metadata": {}, 631 | "source": [ 632 | "Finally, let's drop all rows that do not have price data:" 633 | ] 634 | }, 635 | { 636 | "cell_type": "code", 637 | "execution_count": null, 638 | "metadata": { 639 | "collapsed": true 640 | }, 641 | "outputs": [], 642 | "source": [ 643 | "# simply drop whole row with NaN in \"price\" column\n", 644 | "df.dropna(subset=[\"price\"], axis=0, inplace=True)\n", 645 | "\n", 646 | "# reset index, because we droped two rows\n", 647 | "df.reset_index(drop=True, inplace=True)" 648 | ] 649 | }, 650 | { 651 | "cell_type": "code", 652 | "execution_count": null, 653 | "metadata": { 654 | "collapsed": false 655 | }, 656 | "outputs": [], 657 | "source": [ 658 | "df.head()" 659 | ] 660 | }, 661 | { 662 | "cell_type": "markdown", 663 | "metadata": {}, 664 | "source": [ 665 | "Good! Now, we obtain the dataset with no missing values." 666 | ] 667 | }, 668 | { 669 | "cell_type": "markdown", 670 | "metadata": {}, 671 | "source": [ 672 | "

Correct data format

\n", 673 | "We are almost there!\n", 674 | "

The last step in data cleaning is checking and making sure that all data is in the correct format (int, float, text or other).

\n", 675 | "\n", 676 | "In Pandas, we use \n", 677 | "

.dtype() to check the data type

\n", 678 | "

.astype() to change the data type

" 679 | ] 680 | }, 681 | { 682 | "cell_type": "markdown", 683 | "metadata": {}, 684 | "source": [ 685 | "

Lets list the data types for each column

" 686 | ] 687 | }, 688 | { 689 | "cell_type": "code", 690 | "execution_count": null, 691 | "metadata": { 692 | "collapsed": false 693 | }, 694 | "outputs": [], 695 | "source": [ 696 | "df.dtypes" 697 | ] 698 | }, 699 | { 700 | "cell_type": "markdown", 701 | "metadata": {}, 702 | "source": [ 703 | "

As we can see above, some columns are not of the correct data type. Numerical variables should have type 'float' or 'int', and variables with strings such as categories should have type 'object'. For example, 'bore' and 'stroke' variables are numerical values that describe the engines, so we should expect them to be of the type 'float' or 'int'; however, they are shown as type 'object'. We have to convert data types into a proper format for each column using the \"astype()\" method.

" 704 | ] 705 | }, 706 | { 707 | "cell_type": "markdown", 708 | "metadata": {}, 709 | "source": [ 710 | "

Convert data types to proper format

" 711 | ] 712 | }, 713 | { 714 | "cell_type": "code", 715 | "execution_count": null, 716 | "metadata": { 717 | "collapsed": false 718 | }, 719 | "outputs": [], 720 | "source": [ 721 | "df[[\"bore\", \"stroke\"]] = df[[\"bore\", \"stroke\"]].astype(\"float\")\n", 722 | "df[[\"normalized-losses\"]] = df[[\"normalized-losses\"]].astype(\"int\")\n", 723 | "df[[\"price\"]] = df[[\"price\"]].astype(\"float\")\n", 724 | "df[[\"peak-rpm\"]] = df[[\"peak-rpm\"]].astype(\"float\")" 725 | ] 726 | }, 727 | { 728 | "cell_type": "markdown", 729 | "metadata": {}, 730 | "source": [ 731 | "

Let us list the columns after the conversion

" 732 | ] 733 | }, 734 | { 735 | "cell_type": "code", 736 | "execution_count": null, 737 | "metadata": { 738 | "collapsed": false 739 | }, 740 | "outputs": [], 741 | "source": [ 742 | "df.dtypes" 743 | ] 744 | }, 745 | { 746 | "cell_type": "markdown", 747 | "metadata": {}, 748 | "source": [ 749 | "Wonderful!\n", 750 | "\n", 751 | "Now, we finally obtain the cleaned dataset with no missing values and all data in its proper format." 752 | ] 753 | }, 754 | { 755 | "cell_type": "markdown", 756 | "metadata": {}, 757 | "source": [ 758 | "

Data Standardization

\n", 759 | "

\n", 760 | "Data is usually collected from different agencies with different formats.\n", 761 | "(Data Standardization is also a term for a particular type of data normalization, where we subtract the mean and divide by the standard deviation)\n", 762 | "

\n", 763 | " \n", 764 | "What is Standardization?\n", 765 | "

Standardization is the process of transforming data into a common format which allows the researcher to make the meaningful comparison.\n", 766 | "

\n", 767 | "\n", 768 | "Example\n", 769 | "

Transform mpg to L/100km:

\n", 770 | "

In our dataset, the fuel consumption columns \"city-mpg\" and \"highway-mpg\" are represented by mpg (miles per gallon) unit. Assume we are developing an application in a country that accept the fuel consumption with L/100km standard

\n", 771 | "

We will need to apply data transformation to transform mpg into L/100km?

\n" 772 | ] 773 | }, 774 | { 775 | "cell_type": "markdown", 776 | "metadata": {}, 777 | "source": [ 778 | "

The formula for unit conversion is

\n", 779 | "L/100km = 235 / mpg\n", 780 | "

We can do many mathematical operations directly in Pandas.

" 781 | ] 782 | }, 783 | { 784 | "cell_type": "code", 785 | "execution_count": null, 786 | "metadata": { 787 | "collapsed": false 788 | }, 789 | "outputs": [], 790 | "source": [ 791 | "df.head()" 792 | ] 793 | }, 794 | { 795 | "cell_type": "code", 796 | "execution_count": null, 797 | "metadata": { 798 | "collapsed": false 799 | }, 800 | "outputs": [], 801 | "source": [ 802 | "# Convert mpg to L/100km by mathematical operation (235 divided by mpg)\n", 803 | "df['city-L/100km'] = 235/df[\"city-mpg\"]\n", 804 | "\n", 805 | "# check your transformed data \n", 806 | "df.head()" 807 | ] 808 | }, 809 | { 810 | "cell_type": "markdown", 811 | "metadata": {}, 812 | "source": [ 813 | "
\n", 814 | "

Question #2:

\n", 815 | "\n", 816 | "According to the example above, transform mpg to L/100km in the column of \"highway-mpg\", and change the name of column to \"highway-L/100km\".\n", 817 | "
" 818 | ] 819 | }, 820 | { 821 | "cell_type": "code", 822 | "execution_count": null, 823 | "metadata": { 824 | "collapsed": false 825 | }, 826 | "outputs": [], 827 | "source": [ 828 | "# Write your code below and press Shift+Enter to execute \n" 829 | ] 830 | }, 831 | { 832 | "cell_type": "markdown", 833 | "metadata": {}, 834 | "source": [ 835 | "Double-click here for the solution.\n", 836 | "\n", 837 | "\n" 849 | ] 850 | }, 851 | { 852 | "cell_type": "markdown", 853 | "metadata": {}, 854 | "source": [ 855 | "

Data Normalization

\n", 856 | "\n", 857 | "Why normalization?\n", 858 | "

Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling variable so the variable values range from 0 to 1\n", 859 | "

\n", 860 | "\n", 861 | "Example\n", 862 | "

To demonstrate normalization, let's say we want to scale the columns \"length\", \"width\" and \"height\"

\n", 863 | "

Target:would like to Normalize those variables so their value ranges from 0 to 1.

\n", 864 | "

Approach: replace original value by (original value)/(maximum value)

" 865 | ] 866 | }, 867 | { 868 | "cell_type": "code", 869 | "execution_count": null, 870 | "metadata": { 871 | "collapsed": false 872 | }, 873 | "outputs": [], 874 | "source": [ 875 | "# replace (original value) by (original value)/(maximum value)\n", 876 | "df['length'] = df['length']/df['length'].max()\n", 877 | "df['width'] = df['width']/df['width'].max()" 878 | ] 879 | }, 880 | { 881 | "cell_type": "markdown", 882 | "metadata": {}, 883 | "source": [ 884 | "
\n", 885 | "

Questiont #3:

\n", 886 | "\n", 887 | "According to the example above, normalize the column \"height\".\n", 888 | "
" 889 | ] 890 | }, 891 | { 892 | "cell_type": "code", 893 | "execution_count": null, 894 | "metadata": { 895 | "collapsed": false 896 | }, 897 | "outputs": [], 898 | "source": [ 899 | "# Write your code below and press Shift+Enter to execute \n" 900 | ] 901 | }, 902 | { 903 | "cell_type": "markdown", 904 | "metadata": {}, 905 | "source": [ 906 | "Double-click here for the solution.\n", 907 | "\n", 908 | "" 915 | ] 916 | }, 917 | { 918 | "cell_type": "markdown", 919 | "metadata": {}, 920 | "source": [ 921 | "Here we can see, we've normalized \"length\", \"width\" and \"height\" in the range of [0,1]." 922 | ] 923 | }, 924 | { 925 | "cell_type": "markdown", 926 | "metadata": {}, 927 | "source": [ 928 | "

Binning

\n", 929 | "Why binning?\n", 930 | "

\n", 931 | " Binning is a process of transforming continuous numerical variables into discrete categorical 'bins', for grouped analysis.\n", 932 | "

\n", 933 | "\n", 934 | "Example: \n", 935 | "

In our dataset, \"horsepower\" is a real valued variable ranging from 48 to 288, it has 57 unique values. What if we only care about the price difference between cars with high horsepower, medium horsepower, and little horsepower (3 types)? Can we rearrange them into three ‘bins' to simplify analysis?

\n", 936 | "\n", 937 | "

We will use the Pandas method 'cut' to segment the 'horsepower' column into 3 bins

\n", 938 | "\n" 939 | ] 940 | }, 941 | { 942 | "cell_type": "markdown", 943 | "metadata": {}, 944 | "source": [ 945 | "

Example of Binning Data In Pandas

" 946 | ] 947 | }, 948 | { 949 | "cell_type": "markdown", 950 | "metadata": {}, 951 | "source": [ 952 | " Convert data to correct format " 953 | ] 954 | }, 955 | { 956 | "cell_type": "code", 957 | "execution_count": null, 958 | "metadata": { 959 | "collapsed": false 960 | }, 961 | "outputs": [], 962 | "source": [ 963 | "df[\"horsepower\"]=df[\"horsepower\"].astype(int, copy=True)" 964 | ] 965 | }, 966 | { 967 | "cell_type": "markdown", 968 | "metadata": {}, 969 | "source": [ 970 | "Lets plot the histogram of horspower, to see what the distribution of horsepower looks like." 971 | ] 972 | }, 973 | { 974 | "cell_type": "code", 975 | "execution_count": null, 976 | "metadata": {}, 977 | "outputs": [], 978 | "source": [ 979 | "%matplotlib inline\n", 980 | "import matplotlib as plt\n", 981 | "from matplotlib import pyplot\n", 982 | "plt.pyplot.hist(df[\"horsepower\"])\n", 983 | "\n", 984 | "# set x/y labels and plot title\n", 985 | "plt.pyplot.xlabel(\"horsepower\")\n", 986 | "plt.pyplot.ylabel(\"count\")\n", 987 | "plt.pyplot.title(\"horsepower bins\")" 988 | ] 989 | }, 990 | { 991 | "cell_type": "markdown", 992 | "metadata": {}, 993 | "source": [ 994 | "

We would like 3 bins of equal size bandwidth so we use numpy's linspace(start_value, end_value, numbers_generated function.

\n", 995 | "

Since we want to include the minimum value of horsepower we want to set start_value=min(df[\"horsepower\"]).

\n", 996 | "

Since we want to include the maximum value of horsepower we want to set end_value=max(df[\"horsepower\"]).

\n", 997 | "

Since we are building 3 bins of equal length, there should be 4 dividers, so numbers_generated=4.

" 998 | ] 999 | }, 1000 | { 1001 | "cell_type": "markdown", 1002 | "metadata": {}, 1003 | "source": [ 1004 | "We build a bin array, with a minimum value to a maximum value, with bandwidth calculated above. The bins will be values used to determine when one bin ends and another begins." 1005 | ] 1006 | }, 1007 | { 1008 | "cell_type": "code", 1009 | "execution_count": null, 1010 | "metadata": { 1011 | "collapsed": false 1012 | }, 1013 | "outputs": [], 1014 | "source": [ 1015 | "bins = np.linspace(min(df[\"horsepower\"]), max(df[\"horsepower\"]), 4)\n", 1016 | "bins" 1017 | ] 1018 | }, 1019 | { 1020 | "cell_type": "markdown", 1021 | "metadata": {}, 1022 | "source": [ 1023 | " We set group names:" 1024 | ] 1025 | }, 1026 | { 1027 | "cell_type": "code", 1028 | "execution_count": null, 1029 | "metadata": { 1030 | "collapsed": true 1031 | }, 1032 | "outputs": [], 1033 | "source": [ 1034 | "group_names = ['Low', 'Medium', 'High']" 1035 | ] 1036 | }, 1037 | { 1038 | "cell_type": "markdown", 1039 | "metadata": {}, 1040 | "source": [ 1041 | " We apply the function \"cut\" the determine what each value of \"df['horsepower']\" belongs to. " 1042 | ] 1043 | }, 1044 | { 1045 | "cell_type": "code", 1046 | "execution_count": null, 1047 | "metadata": { 1048 | "collapsed": false 1049 | }, 1050 | "outputs": [], 1051 | "source": [ 1052 | "df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_lowest=True )\n", 1053 | "df[['horsepower','horsepower-binned']].head(20)" 1054 | ] 1055 | }, 1056 | { 1057 | "cell_type": "markdown", 1058 | "metadata": {}, 1059 | "source": [ 1060 | "Lets see the number of vehicles in each bin." 1061 | ] 1062 | }, 1063 | { 1064 | "cell_type": "code", 1065 | "execution_count": null, 1066 | "metadata": {}, 1067 | "outputs": [], 1068 | "source": [ 1069 | "df[\"horsepower-binned\"].value_counts()" 1070 | ] 1071 | }, 1072 | { 1073 | "cell_type": "markdown", 1074 | "metadata": {}, 1075 | "source": [ 1076 | "Lets plot the distribution of each bin." 1077 | ] 1078 | }, 1079 | { 1080 | "cell_type": "code", 1081 | "execution_count": null, 1082 | "metadata": {}, 1083 | "outputs": [], 1084 | "source": [ 1085 | "%matplotlib inline\n", 1086 | "import matplotlib as plt\n", 1087 | "from matplotlib import pyplot\n", 1088 | "pyplot.bar(group_names, df[\"horsepower-binned\"].value_counts())\n", 1089 | "\n", 1090 | "# set x/y labels and plot title\n", 1091 | "plt.pyplot.xlabel(\"horsepower\")\n", 1092 | "plt.pyplot.ylabel(\"count\")\n", 1093 | "plt.pyplot.title(\"horsepower bins\")" 1094 | ] 1095 | }, 1096 | { 1097 | "cell_type": "markdown", 1098 | "metadata": {}, 1099 | "source": [ 1100 | "

\n", 1101 | " Check the dataframe above carefully, you will find the last column provides the bins for \"horsepower\" with 3 categories (\"Low\",\"Medium\" and \"High\"). \n", 1102 | "

\n", 1103 | "

\n", 1104 | " We successfully narrow the intervals from 57 to 3!\n", 1105 | "

" 1106 | ] 1107 | }, 1108 | { 1109 | "cell_type": "markdown", 1110 | "metadata": {}, 1111 | "source": [ 1112 | "

Bins visualization

\n", 1113 | "Normally, a histogram is used to visualize the distribution of bins we created above. " 1114 | ] 1115 | }, 1116 | { 1117 | "cell_type": "code", 1118 | "execution_count": null, 1119 | "metadata": { 1120 | "collapsed": false 1121 | }, 1122 | "outputs": [], 1123 | "source": [ 1124 | "%matplotlib inline\n", 1125 | "import matplotlib as plt\n", 1126 | "from matplotlib import pyplot\n", 1127 | "\n", 1128 | "a = (0,1,2)\n", 1129 | "\n", 1130 | "# draw historgram of attribute \"horsepower\" with bins = 3\n", 1131 | "plt.pyplot.hist(df[\"horsepower\"], bins = 3)\n", 1132 | "\n", 1133 | "# set x/y labels and plot title\n", 1134 | "plt.pyplot.xlabel(\"horsepower\")\n", 1135 | "plt.pyplot.ylabel(\"count\")\n", 1136 | "plt.pyplot.title(\"horsepower bins\")" 1137 | ] 1138 | }, 1139 | { 1140 | "cell_type": "markdown", 1141 | "metadata": {}, 1142 | "source": [ 1143 | "The plot above shows the binning result for attribute \"horsepower\". " 1144 | ] 1145 | }, 1146 | { 1147 | "cell_type": "markdown", 1148 | "metadata": {}, 1149 | "source": [ 1150 | "

Indicator variable (or dummy variable)

\n", 1151 | "What is an indicator variable?\n", 1152 | "

\n", 1153 | " An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called 'dummies' because the numbers themselves don't have inherent meaning. \n", 1154 | "

\n", 1155 | "\n", 1156 | "Why we use indicator variables?\n", 1157 | "

\n", 1158 | " So we can use categorical variables for regression analysis in the later modules.\n", 1159 | "

\n", 1160 | "Example\n", 1161 | "

\n", 1162 | " We see the column \"fuel-type\" has two unique values, \"gas\" or \"diesel\". Regression doesn't understand words, only numbers. To use this attribute in regression analysis, we convert \"fuel-type\" into indicator variables.\n", 1163 | "

\n", 1164 | "\n", 1165 | "

\n", 1166 | " We will use the panda's method 'get_dummies' to assign numerical values to different categories of fuel type. \n", 1167 | "

" 1168 | ] 1169 | }, 1170 | { 1171 | "cell_type": "code", 1172 | "execution_count": null, 1173 | "metadata": { 1174 | "collapsed": false 1175 | }, 1176 | "outputs": [], 1177 | "source": [ 1178 | "df.columns" 1179 | ] 1180 | }, 1181 | { 1182 | "cell_type": "markdown", 1183 | "metadata": {}, 1184 | "source": [ 1185 | "get indicator variables and assign it to data frame \"dummy_variable_1\" " 1186 | ] 1187 | }, 1188 | { 1189 | "cell_type": "code", 1190 | "execution_count": null, 1191 | "metadata": { 1192 | "collapsed": false 1193 | }, 1194 | "outputs": [], 1195 | "source": [ 1196 | "dummy_variable_1 = pd.get_dummies(df[\"fuel-type\"])\n", 1197 | "dummy_variable_1.head()" 1198 | ] 1199 | }, 1200 | { 1201 | "cell_type": "markdown", 1202 | "metadata": {}, 1203 | "source": [ 1204 | "change column names for clarity " 1205 | ] 1206 | }, 1207 | { 1208 | "cell_type": "code", 1209 | "execution_count": null, 1210 | "metadata": { 1211 | "collapsed": false 1212 | }, 1213 | "outputs": [], 1214 | "source": [ 1215 | "dummy_variable_1.rename(columns={'fuel-type-diesel':'gas', 'fuel-type-diesel':'diesel'}, inplace=True)\n", 1216 | "dummy_variable_1.head()" 1217 | ] 1218 | }, 1219 | { 1220 | "cell_type": "markdown", 1221 | "metadata": {}, 1222 | "source": [ 1223 | "We now have the value 0 to represent \"gas\" and 1 to represent \"diesel\" in the column \"fuel-type\". We will now insert this column back into our original dataset. " 1224 | ] 1225 | }, 1226 | { 1227 | "cell_type": "code", 1228 | "execution_count": null, 1229 | "metadata": { 1230 | "collapsed": true 1231 | }, 1232 | "outputs": [], 1233 | "source": [ 1234 | "# merge data frame \"df\" and \"dummy_variable_1\" \n", 1235 | "df = pd.concat([df, dummy_variable_1], axis=1)\n", 1236 | "\n", 1237 | "# drop original column \"fuel-type\" from \"df\"\n", 1238 | "df.drop(\"fuel-type\", axis = 1, inplace=True)" 1239 | ] 1240 | }, 1241 | { 1242 | "cell_type": "code", 1243 | "execution_count": null, 1244 | "metadata": { 1245 | "collapsed": false 1246 | }, 1247 | "outputs": [], 1248 | "source": [ 1249 | "df.head()" 1250 | ] 1251 | }, 1252 | { 1253 | "cell_type": "markdown", 1254 | "metadata": {}, 1255 | "source": [ 1256 | "The last two columns are now the indicator variable representation of the fuel-type variable. It's all 0s and 1s now." 1257 | ] 1258 | }, 1259 | { 1260 | "cell_type": "markdown", 1261 | "metadata": {}, 1262 | "source": [ 1263 | "
\n", 1264 | "

Question #4:

\n", 1265 | "\n", 1266 | "As above, create indicator variable to the column of \"aspiration\": \"std\" to 0, while \"turbo\" to 1.\n", 1267 | "
" 1268 | ] 1269 | }, 1270 | { 1271 | "cell_type": "code", 1272 | "execution_count": null, 1273 | "metadata": { 1274 | "collapsed": false 1275 | }, 1276 | "outputs": [], 1277 | "source": [ 1278 | "# Write your code below and press Shift+Enter to execute \n" 1279 | ] 1280 | }, 1281 | { 1282 | "cell_type": "markdown", 1283 | "metadata": {}, 1284 | "source": [ 1285 | "Double-click here for the solution.\n", 1286 | "\n", 1287 | "" 1299 | ] 1300 | }, 1301 | { 1302 | "cell_type": "markdown", 1303 | "metadata": {}, 1304 | "source": [ 1305 | "
\n", 1306 | "

Question #5:

\n", 1307 | "\n", 1308 | "Merge the new dataframe to the original dataframe then drop the column 'aspiration'\n", 1309 | "
" 1310 | ] 1311 | }, 1312 | { 1313 | "cell_type": "code", 1314 | "execution_count": null, 1315 | "metadata": { 1316 | "collapsed": false 1317 | }, 1318 | "outputs": [], 1319 | "source": [ 1320 | "# Write your code below and press Shift+Enter to execute \n" 1321 | ] 1322 | }, 1323 | { 1324 | "cell_type": "markdown", 1325 | "metadata": {}, 1326 | "source": [ 1327 | "Double-click here for the solution.\n", 1328 | "\n", 1329 | "" 1338 | ] 1339 | }, 1340 | { 1341 | "cell_type": "markdown", 1342 | "metadata": {}, 1343 | "source": [ 1344 | "save the new csv " 1345 | ] 1346 | }, 1347 | { 1348 | "cell_type": "code", 1349 | "execution_count": null, 1350 | "metadata": { 1351 | "collapsed": true 1352 | }, 1353 | "outputs": [], 1354 | "source": [ 1355 | "df.to_csv('clean_df.csv')" 1356 | ] 1357 | }, 1358 | { 1359 | "cell_type": "markdown", 1360 | "metadata": {}, 1361 | "source": [ 1362 | "

Thank you for completing this notebook

" 1363 | ] 1364 | }, 1365 | { 1366 | "cell_type": "markdown", 1367 | "metadata": {}, 1368 | "source": [ 1369 | "
\n", 1370 | "\n", 1371 | "

\n", 1372 | "
" 1373 | ] 1374 | }, 1375 | { 1376 | "cell_type": "markdown", 1377 | "metadata": {}, 1378 | "source": [ 1379 | "

About the Authors:

\n", 1380 | "\n", 1381 | "This notebook was written by Mahdi Noorian PhD, Joseph Santarcangelo, Bahare Talayian, Eric Xiao, Steven Dong, Parizad, Hima Vsudevan and Fiorella Wenver and Yi Yao.\n", 1382 | "\n", 1383 | "

Joseph Santarcangelo is a Data Scientist at IBM, and holds a PhD in Electrical Engineering. His research focused on using Machine Learning, Signal Processing, and Computer Vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.

" 1384 | ] 1385 | }, 1386 | { 1387 | "cell_type": "markdown", 1388 | "metadata": {}, 1389 | "source": [ 1390 | "
\n", 1391 | "

Copyright © 2018 IBM Developer Skills Network. This notebook and its source code are released under the terms of the MIT License.

" 1392 | ] 1393 | } 1394 | ], 1395 | "metadata": { 1396 | "anaconda-cloud": {}, 1397 | "kernelspec": { 1398 | "display_name": "Python 3", 1399 | "language": "python", 1400 | "name": "python3" 1401 | }, 1402 | "language_info": { 1403 | "codemirror_mode": { 1404 | "name": "ipython", 1405 | "version": 3 1406 | }, 1407 | "file_extension": ".py", 1408 | "mimetype": "text/x-python", 1409 | "name": "python", 1410 | "nbconvert_exporter": "python", 1411 | "pygments_lexer": "ipython3", 1412 | "version": "3.6.7" 1413 | } 1414 | }, 1415 | "nbformat": 4, 1416 | "nbformat_minor": 2 1417 | } 1418 | -------------------------------------------------------------------------------- /DA0101EN-Review-Exploratory-Data-Analysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "
\n", 8 | " \n", 9 | " \n", 10 | " \n", 11 | "
\n" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "\n", 19 | "\n", 20 | "

Data Analysis with Python

" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "Exploratory Data Analysis" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "

Welcome!

\n", 35 | "In this section, we will explore several methods to see if certain characteristics or features can be used to predict car price. " 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "

Table of content

\n", 43 | "\n", 44 | "
\n", 45 | "
    \n", 46 | "
  1. Import Data from Module
  2. \n", 47 | "
  3. Analyzing Individual Feature Patterns using Visualization
  4. \n", 48 | "
  5. Descriptive Statistical Analysis
  6. \n", 49 | "
  7. Basics of Grouping
  8. \n", 50 | "
  9. Correlation and Causation
  10. \n", 51 | "
  11. ANOVA
  12. \n", 52 | "
\n", 53 | " \n", 54 | "Estimated Time Needed: 30 min\n", 55 | "
\n", 56 | " \n", 57 | "
" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "

What are the main characteristics which have the most impact on the car price?

" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "

1. Import Data from Module 2

" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "

Setup

" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | " Import libraries " 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "metadata": { 92 | "collapsed": true 93 | }, 94 | "outputs": [], 95 | "source": [ 96 | "import pandas as pd\n", 97 | "import numpy as np" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | " load data and store in dataframe df:" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "This dataset was hosted on IBM Cloud object click HERE for free storage" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": { 118 | "collapsed": false 119 | }, 120 | "outputs": [], 121 | "source": [ 122 | "path='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/automobileEDA.csv'\n", 123 | "df = pd.read_csv(path)\n", 124 | "df.head()" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "

2. Analyzing Individual Feature Patterns using Visualization

" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": {}, 137 | "source": [ 138 | "To install seaborn we use the pip which is the python package manager." 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": null, 144 | "metadata": {}, 145 | "outputs": [], 146 | "source": [ 147 | "%%capture\n", 148 | "! pip install seaborn" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | " Import visualization packages \"Matplotlib\" and \"Seaborn\", don't forget about \"%matplotlib inline\" to plot in a Jupyter notebook." 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "metadata": { 162 | "collapsed": false 163 | }, 164 | "outputs": [], 165 | "source": [ 166 | "import matplotlib.pyplot as plt\n", 167 | "import seaborn as sns\n", 168 | "%matplotlib inline " 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "

How to choose the right visualization method?

\n", 176 | "

When visualizing individual variables, it is important to first understand what type of variable you are dealing with. This will help us find the right visualization method for that variable.

\n" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": { 183 | "collapsed": false 184 | }, 185 | "outputs": [], 186 | "source": [ 187 | "# list the data types for each column\n", 188 | "print(df.dtypes)" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "
\n", 196 | "

Question #1:

\n", 197 | "\n", 198 | "What is the data type of the column \"peak-rpm\"? \n", 199 | "
" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": {}, 205 | "source": [ 206 | "Double-click here for the solution.\n", 207 | "\n", 208 | "" 213 | ] 214 | }, 215 | { 216 | "cell_type": "markdown", 217 | "metadata": {}, 218 | "source": [ 219 | "for example, we can calculate the correlation between variables of type \"int64\" or \"float64\" using the method \"corr\":" 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": null, 225 | "metadata": { 226 | "collapsed": false 227 | }, 228 | "outputs": [], 229 | "source": [ 230 | "df.corr()" 231 | ] 232 | }, 233 | { 234 | "cell_type": "markdown", 235 | "metadata": {}, 236 | "source": [ 237 | "The diagonal elements are always one; we will study correlation more precisely Pearson correlation in-depth at the end of the notebook." 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "metadata": {}, 243 | "source": [ 244 | "
\n", 245 | "

Question #2:

\n", 246 | "\n", 247 | "

Find the correlation between the following columns: bore, stroke,compression-ratio , and horsepower.

\n", 248 | "

Hint: if you would like to select those columns use the following syntax: df[['bore','stroke' ,'compression-ratio','horsepower']]

\n", 249 | "
" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": null, 255 | "metadata": { 256 | "collapsed": true 257 | }, 258 | "outputs": [], 259 | "source": [ 260 | "# Write your code below and press Shift+Enter to execute \n" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "Double-click here for the solution.\n", 268 | "\n", 269 | "" 274 | ] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "metadata": {}, 279 | "source": [ 280 | "

Continuous numerical variables:

\n", 281 | "\n", 282 | "

Continuous numerical variables are variables that may contain any value within some range. Continuous numerical variables can have the type \"int64\" or \"float64\". A great way to visualize these variables is by using scatterplots with fitted lines.

\n", 283 | "\n", 284 | "

In order to start understanding the (linear) relationship between an individual variable and the price. We can do this by using \"regplot\", which plots the scatterplot plus the fitted regression line for the data.

" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | " Let's see several examples of different linear relationships:" 292 | ] 293 | }, 294 | { 295 | "cell_type": "markdown", 296 | "metadata": {}, 297 | "source": [ 298 | "

Positive linear relationship

" 299 | ] 300 | }, 301 | { 302 | "cell_type": "markdown", 303 | "metadata": {}, 304 | "source": [ 305 | "Let's find the scatterplot of \"engine-size\" and \"price\" " 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": null, 311 | "metadata": { 312 | "collapsed": false, 313 | "scrolled": true 314 | }, 315 | "outputs": [], 316 | "source": [ 317 | "# Engine size as potential predictor variable of price\n", 318 | "sns.regplot(x=\"engine-size\", y=\"price\", data=df)\n", 319 | "plt.ylim(0,)" 320 | ] 321 | }, 322 | { 323 | "cell_type": "markdown", 324 | "metadata": {}, 325 | "source": [ 326 | "

As the engine-size goes up, the price goes up: this indicates a positive direct correlation between these two variables. Engine size seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.

" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | " We can examine the correlation between 'engine-size' and 'price' and see it's approximately 0.87" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": null, 339 | "metadata": { 340 | "collapsed": false 341 | }, 342 | "outputs": [], 343 | "source": [ 344 | "df[[\"engine-size\", \"price\"]].corr()" 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "metadata": {}, 350 | "source": [ 351 | "Highway mpg is a potential predictor variable of price " 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": null, 357 | "metadata": { 358 | "collapsed": false 359 | }, 360 | "outputs": [], 361 | "source": [ 362 | "sns.regplot(x=\"highway-mpg\", y=\"price\", data=df)" 363 | ] 364 | }, 365 | { 366 | "cell_type": "markdown", 367 | "metadata": {}, 368 | "source": [ 369 | "

As the highway-mpg goes up, the price goes down: this indicates an inverse/negative relationship between these two variables. Highway mpg could potentially be a predictor of price.

" 370 | ] 371 | }, 372 | { 373 | "cell_type": "markdown", 374 | "metadata": {}, 375 | "source": [ 376 | "We can examine the correlation between 'highway-mpg' and 'price' and see it's approximately -0.704" 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": null, 382 | "metadata": { 383 | "collapsed": false 384 | }, 385 | "outputs": [], 386 | "source": [ 387 | "df[['highway-mpg', 'price']].corr()" 388 | ] 389 | }, 390 | { 391 | "cell_type": "markdown", 392 | "metadata": {}, 393 | "source": [ 394 | "

Weak Linear Relationship

" 395 | ] 396 | }, 397 | { 398 | "cell_type": "markdown", 399 | "metadata": {}, 400 | "source": [ 401 | "Let's see if \"Peak-rpm\" as a predictor variable of \"price\"." 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": null, 407 | "metadata": { 408 | "collapsed": false 409 | }, 410 | "outputs": [], 411 | "source": [ 412 | "sns.regplot(x=\"peak-rpm\", y=\"price\", data=df)" 413 | ] 414 | }, 415 | { 416 | "cell_type": "markdown", 417 | "metadata": {}, 418 | "source": [ 419 | "

Peak rpm does not seem like a good predictor of the price at all since the regression line is close to horizontal. Also, the data points are very scattered and far from the fitted line, showing lots of variability. Therefore it's it is not a reliable variable.

" 420 | ] 421 | }, 422 | { 423 | "cell_type": "markdown", 424 | "metadata": {}, 425 | "source": [ 426 | "We can examine the correlation between 'peak-rpm' and 'price' and see it's approximately -0.101616 " 427 | ] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "execution_count": null, 432 | "metadata": { 433 | "collapsed": false 434 | }, 435 | "outputs": [], 436 | "source": [ 437 | "df[['peak-rpm','price']].corr()" 438 | ] 439 | }, 440 | { 441 | "cell_type": "markdown", 442 | "metadata": {}, 443 | "source": [ 444 | "
\n", 445 | "

Question 3 a):

\n", 446 | "\n", 447 | "

Find the correlation between x=\"stroke\", y=\"price\".

\n", 448 | "

Hint: if you would like to select those columns use the following syntax: df[[\"stroke\",\"price\"]]

\n", 449 | "
" 450 | ] 451 | }, 452 | { 453 | "cell_type": "code", 454 | "execution_count": null, 455 | "metadata": { 456 | "collapsed": false 457 | }, 458 | "outputs": [], 459 | "source": [ 460 | "# Write your code below and press Shift+Enter to execute\n" 461 | ] 462 | }, 463 | { 464 | "cell_type": "markdown", 465 | "metadata": {}, 466 | "source": [ 467 | "Double-click here for the solution.\n", 468 | "\n", 469 | "" 476 | ] 477 | }, 478 | { 479 | "cell_type": "markdown", 480 | "metadata": {}, 481 | "source": [ 482 | "
\n", 483 | "

Question 3 b):

\n", 484 | "\n", 485 | "

Given the correlation results between \"price\" and \"stroke\" do you expect a linear relationship?

\n", 486 | "

Verify your results using the function \"regplot()\".

\n", 487 | "
" 488 | ] 489 | }, 490 | { 491 | "cell_type": "code", 492 | "execution_count": null, 493 | "metadata": { 494 | "collapsed": false 495 | }, 496 | "outputs": [], 497 | "source": [ 498 | "# Write your code below and press Shift+Enter to execute \n" 499 | ] 500 | }, 501 | { 502 | "cell_type": "markdown", 503 | "metadata": {}, 504 | "source": [ 505 | "Double-click here for the solution.\n", 506 | "\n", 507 | "" 515 | ] 516 | }, 517 | { 518 | "cell_type": "markdown", 519 | "metadata": {}, 520 | "source": [ 521 | "

Categorical variables

\n", 522 | "\n", 523 | "

These are variables that describe a 'characteristic' of a data unit, and are selected from a small group of categories. The categorical variables can have the type \"object\" or \"int64\". A good way to visualize categorical variables is by using boxplots.

" 524 | ] 525 | }, 526 | { 527 | "cell_type": "markdown", 528 | "metadata": {}, 529 | "source": [ 530 | "Let's look at the relationship between \"body-style\" and \"price\"." 531 | ] 532 | }, 533 | { 534 | "cell_type": "code", 535 | "execution_count": null, 536 | "metadata": { 537 | "collapsed": false, 538 | "scrolled": true 539 | }, 540 | "outputs": [], 541 | "source": [ 542 | "sns.boxplot(x=\"body-style\", y=\"price\", data=df)" 543 | ] 544 | }, 545 | { 546 | "cell_type": "markdown", 547 | "metadata": {}, 548 | "source": [ 549 | "

We see that the distributions of price between the different body-style categories have a significant overlap, and so body-style would not be a good predictor of price. Let's examine engine \"engine-location\" and \"price\":

" 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": null, 555 | "metadata": { 556 | "collapsed": false, 557 | "scrolled": true 558 | }, 559 | "outputs": [], 560 | "source": [ 561 | "sns.boxplot(x=\"engine-location\", y=\"price\", data=df)" 562 | ] 563 | }, 564 | { 565 | "cell_type": "markdown", 566 | "metadata": {}, 567 | "source": [ 568 | "

Here we see that the distribution of price between these two engine-location categories, front and rear, are distinct enough to take engine-location as a potential good predictor of price.

" 569 | ] 570 | }, 571 | { 572 | "cell_type": "markdown", 573 | "metadata": {}, 574 | "source": [ 575 | " Let's examine \"drive-wheels\" and \"price\"." 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": null, 581 | "metadata": { 582 | "collapsed": false, 583 | "scrolled": false 584 | }, 585 | "outputs": [], 586 | "source": [ 587 | "# drive-wheels\n", 588 | "sns.boxplot(x=\"drive-wheels\", y=\"price\", data=df)" 589 | ] 590 | }, 591 | { 592 | "cell_type": "markdown", 593 | "metadata": {}, 594 | "source": [ 595 | "

Here we see that the distribution of price between the different drive-wheels categories differs; as such drive-wheels could potentially be a predictor of price.

" 596 | ] 597 | }, 598 | { 599 | "cell_type": "markdown", 600 | "metadata": {}, 601 | "source": [ 602 | "

3. Descriptive Statistical Analysis

" 603 | ] 604 | }, 605 | { 606 | "cell_type": "markdown", 607 | "metadata": {}, 608 | "source": [ 609 | "

Let's first take a look at the variables by utilizing a description method.

\n", 610 | "\n", 611 | "

The describe function automatically computes basic statistics for all continuous variables. Any NaN values are automatically skipped in these statistics.

\n", 612 | "\n", 613 | "This will show:\n", 614 | "