├── DA0101EN-Review-Data-Wrangling.ipynb ├── DA0101EN-Review-Exploratory-Data-Analysis.ipynb ├── DA0101EN-Review-Introduction.ipynb ├── DA0101EN-Review-Model-Development.ipynb ├── DA0101EN-Review-Model-Evaluation-and-Refinement.ipynb ├── House Sales_in_King_Count_USA.ipynb ├── README.md ├── data-wrangling.ipynb ├── exploratory-data-analysis.ipynb ├── model-development.ipynb ├── model-evaluation-and-refinement.ipynb └── review-introduction.ipynb /DA0101EN-Review-Data-Wrangling.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "
\n", 93 | "You can find the \"Automobile Data Set\" from the following link: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data. \n", 94 | "We will be using this data set throughout this course.\n", 95 | "
" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | ".replace(A, B, inplace = True)\n", 237 | "to replace A by B" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": null, 243 | "metadata": { 244 | "collapsed": false 245 | }, 246 | "outputs": [], 247 | "source": [ 248 | "import numpy as np\n", 249 | "\n", 250 | "# replace \"?\" to NaN\n", 251 | "df.replace(\"?\", np.nan, inplace = True)\n", 252 | "df.head(5)" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "dentify_missing_values\n", 260 | "\n", 261 | "
\n", 296 | "Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, \"True\" represents a missing value, \"False\" means the value is present in the dataset. In the body of the for loop the method \".value_counts()\" counts the number of \"True\" values. \n", 297 | "
" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": null, 303 | "metadata": { 304 | "collapsed": false 305 | }, 306 | "outputs": [], 307 | "source": [ 308 | "for column in missing_data.columns.values.tolist():\n", 309 | " print(column)\n", 310 | " print (missing_data[column].value_counts())\n", 311 | " print(\"\") " 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "metadata": {}, 317 | "source": [ 318 | "Based on the summary above, each column has 205 rows of data, seven columns containing missing data:\n", 319 | "The last step in data cleaning is checking and making sure that all data is in the correct format (int, float, text or other).
\n", 675 | "\n", 676 | "In Pandas, we use \n", 677 | ".dtype() to check the data type
\n", 678 | ".astype() to change the data type
" 679 | ] 680 | }, 681 | { 682 | "cell_type": "markdown", 683 | "metadata": {}, 684 | "source": [ 685 | "As we can see above, some columns are not of the correct data type. Numerical variables should have type 'float' or 'int', and variables with strings such as categories should have type 'object'. For example, 'bore' and 'stroke' variables are numerical values that describe the engines, so we should expect them to be of the type 'float' or 'int'; however, they are shown as type 'object'. We have to convert data types into a proper format for each column using the \"astype()\" method.
" 704 | ] 705 | }, 706 | { 707 | "cell_type": "markdown", 708 | "metadata": {}, 709 | "source": [ 710 | "\n", 760 | "Data is usually collected from different agencies with different formats.\n", 761 | "(Data Standardization is also a term for a particular type of data normalization, where we subtract the mean and divide by the standard deviation)\n", 762 | "
\n", 763 | " \n", 764 | "What is Standardization?\n", 765 | "Standardization is the process of transforming data into a common format which allows the researcher to make the meaningful comparison.\n", 766 | "
\n", 767 | "\n", 768 | "Example\n", 769 | "Transform mpg to L/100km:
\n", 770 | "In our dataset, the fuel consumption columns \"city-mpg\" and \"highway-mpg\" are represented by mpg (miles per gallon) unit. Assume we are developing an application in a country that accept the fuel consumption with L/100km standard
\n", 771 | "We will need to apply data transformation to transform mpg into L/100km?
\n" 772 | ] 773 | }, 774 | { 775 | "cell_type": "markdown", 776 | "metadata": {}, 777 | "source": [ 778 | "The formula for unit conversion is
\n", 779 | "L/100km = 235 / mpg\n", 780 | "
We can do many mathematical operations directly in Pandas.
" 781 | ] 782 | }, 783 | { 784 | "cell_type": "code", 785 | "execution_count": null, 786 | "metadata": { 787 | "collapsed": false 788 | }, 789 | "outputs": [], 790 | "source": [ 791 | "df.head()" 792 | ] 793 | }, 794 | { 795 | "cell_type": "code", 796 | "execution_count": null, 797 | "metadata": { 798 | "collapsed": false 799 | }, 800 | "outputs": [], 801 | "source": [ 802 | "# Convert mpg to L/100km by mathematical operation (235 divided by mpg)\n", 803 | "df['city-L/100km'] = 235/df[\"city-mpg\"]\n", 804 | "\n", 805 | "# check your transformed data \n", 806 | "df.head()" 807 | ] 808 | }, 809 | { 810 | "cell_type": "markdown", 811 | "metadata": {}, 812 | "source": [ 813 | "Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling variable so the variable values range from 0 to 1\n", 859 | "
\n", 860 | "\n", 861 | "Example\n", 862 | "To demonstrate normalization, let's say we want to scale the columns \"length\", \"width\" and \"height\"
\n", 863 | "Target:would like to Normalize those variables so their value ranges from 0 to 1.
\n", 864 | "Approach: replace original value by (original value)/(maximum value)
" 865 | ] 866 | }, 867 | { 868 | "cell_type": "code", 869 | "execution_count": null, 870 | "metadata": { 871 | "collapsed": false 872 | }, 873 | "outputs": [], 874 | "source": [ 875 | "# replace (original value) by (original value)/(maximum value)\n", 876 | "df['length'] = df['length']/df['length'].max()\n", 877 | "df['width'] = df['width']/df['width'].max()" 878 | ] 879 | }, 880 | { 881 | "cell_type": "markdown", 882 | "metadata": {}, 883 | "source": [ 884 | "\n", 931 | " Binning is a process of transforming continuous numerical variables into discrete categorical 'bins', for grouped analysis.\n", 932 | "
\n", 933 | "\n", 934 | "Example: \n", 935 | "In our dataset, \"horsepower\" is a real valued variable ranging from 48 to 288, it has 57 unique values. What if we only care about the price difference between cars with high horsepower, medium horsepower, and little horsepower (3 types)? Can we rearrange them into three ‘bins' to simplify analysis?
\n", 936 | "\n", 937 | "We will use the Pandas method 'cut' to segment the 'horsepower' column into 3 bins
\n", 938 | "\n" 939 | ] 940 | }, 941 | { 942 | "cell_type": "markdown", 943 | "metadata": {}, 944 | "source": [ 945 | "We would like 3 bins of equal size bandwidth so we use numpy's linspace(start_value, end_value, numbers_generated
function.
Since we want to include the minimum value of horsepower we want to set start_value=min(df[\"horsepower\"]).
\n", 996 | "Since we want to include the maximum value of horsepower we want to set end_value=max(df[\"horsepower\"]).
\n", 997 | "Since we are building 3 bins of equal length, there should be 4 dividers, so numbers_generated=4.
" 998 | ] 999 | }, 1000 | { 1001 | "cell_type": "markdown", 1002 | "metadata": {}, 1003 | "source": [ 1004 | "We build a bin array, with a minimum value to a maximum value, with bandwidth calculated above. The bins will be values used to determine when one bin ends and another begins." 1005 | ] 1006 | }, 1007 | { 1008 | "cell_type": "code", 1009 | "execution_count": null, 1010 | "metadata": { 1011 | "collapsed": false 1012 | }, 1013 | "outputs": [], 1014 | "source": [ 1015 | "bins = np.linspace(min(df[\"horsepower\"]), max(df[\"horsepower\"]), 4)\n", 1016 | "bins" 1017 | ] 1018 | }, 1019 | { 1020 | "cell_type": "markdown", 1021 | "metadata": {}, 1022 | "source": [ 1023 | " We set group names:" 1024 | ] 1025 | }, 1026 | { 1027 | "cell_type": "code", 1028 | "execution_count": null, 1029 | "metadata": { 1030 | "collapsed": true 1031 | }, 1032 | "outputs": [], 1033 | "source": [ 1034 | "group_names = ['Low', 'Medium', 'High']" 1035 | ] 1036 | }, 1037 | { 1038 | "cell_type": "markdown", 1039 | "metadata": {}, 1040 | "source": [ 1041 | " We apply the function \"cut\" the determine what each value of \"df['horsepower']\" belongs to. " 1042 | ] 1043 | }, 1044 | { 1045 | "cell_type": "code", 1046 | "execution_count": null, 1047 | "metadata": { 1048 | "collapsed": false 1049 | }, 1050 | "outputs": [], 1051 | "source": [ 1052 | "df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_lowest=True )\n", 1053 | "df[['horsepower','horsepower-binned']].head(20)" 1054 | ] 1055 | }, 1056 | { 1057 | "cell_type": "markdown", 1058 | "metadata": {}, 1059 | "source": [ 1060 | "Lets see the number of vehicles in each bin." 1061 | ] 1062 | }, 1063 | { 1064 | "cell_type": "code", 1065 | "execution_count": null, 1066 | "metadata": {}, 1067 | "outputs": [], 1068 | "source": [ 1069 | "df[\"horsepower-binned\"].value_counts()" 1070 | ] 1071 | }, 1072 | { 1073 | "cell_type": "markdown", 1074 | "metadata": {}, 1075 | "source": [ 1076 | "Lets plot the distribution of each bin." 1077 | ] 1078 | }, 1079 | { 1080 | "cell_type": "code", 1081 | "execution_count": null, 1082 | "metadata": {}, 1083 | "outputs": [], 1084 | "source": [ 1085 | "%matplotlib inline\n", 1086 | "import matplotlib as plt\n", 1087 | "from matplotlib import pyplot\n", 1088 | "pyplot.bar(group_names, df[\"horsepower-binned\"].value_counts())\n", 1089 | "\n", 1090 | "# set x/y labels and plot title\n", 1091 | "plt.pyplot.xlabel(\"horsepower\")\n", 1092 | "plt.pyplot.ylabel(\"count\")\n", 1093 | "plt.pyplot.title(\"horsepower bins\")" 1094 | ] 1095 | }, 1096 | { 1097 | "cell_type": "markdown", 1098 | "metadata": {}, 1099 | "source": [ 1100 | "\n", 1101 | " Check the dataframe above carefully, you will find the last column provides the bins for \"horsepower\" with 3 categories (\"Low\",\"Medium\" and \"High\"). \n", 1102 | "
\n", 1103 | "\n", 1104 | " We successfully narrow the intervals from 57 to 3!\n", 1105 | "
" 1106 | ] 1107 | }, 1108 | { 1109 | "cell_type": "markdown", 1110 | "metadata": {}, 1111 | "source": [ 1112 | "\n", 1153 | " An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called 'dummies' because the numbers themselves don't have inherent meaning. \n", 1154 | "
\n", 1155 | "\n", 1156 | "Why we use indicator variables?\n", 1157 | "\n", 1158 | " So we can use categorical variables for regression analysis in the later modules.\n", 1159 | "
\n", 1160 | "Example\n", 1161 | "\n", 1162 | " We see the column \"fuel-type\" has two unique values, \"gas\" or \"diesel\". Regression doesn't understand words, only numbers. To use this attribute in regression analysis, we convert \"fuel-type\" into indicator variables.\n", 1163 | "
\n", 1164 | "\n", 1165 | "\n", 1166 | " We will use the panda's method 'get_dummies' to assign numerical values to different categories of fuel type. \n", 1167 | "
" 1168 | ] 1169 | }, 1170 | { 1171 | "cell_type": "code", 1172 | "execution_count": null, 1173 | "metadata": { 1174 | "collapsed": false 1175 | }, 1176 | "outputs": [], 1177 | "source": [ 1178 | "df.columns" 1179 | ] 1180 | }, 1181 | { 1182 | "cell_type": "markdown", 1183 | "metadata": {}, 1184 | "source": [ 1185 | "get indicator variables and assign it to data frame \"dummy_variable_1\" " 1186 | ] 1187 | }, 1188 | { 1189 | "cell_type": "code", 1190 | "execution_count": null, 1191 | "metadata": { 1192 | "collapsed": false 1193 | }, 1194 | "outputs": [], 1195 | "source": [ 1196 | "dummy_variable_1 = pd.get_dummies(df[\"fuel-type\"])\n", 1197 | "dummy_variable_1.head()" 1198 | ] 1199 | }, 1200 | { 1201 | "cell_type": "markdown", 1202 | "metadata": {}, 1203 | "source": [ 1204 | "change column names for clarity " 1205 | ] 1206 | }, 1207 | { 1208 | "cell_type": "code", 1209 | "execution_count": null, 1210 | "metadata": { 1211 | "collapsed": false 1212 | }, 1213 | "outputs": [], 1214 | "source": [ 1215 | "dummy_variable_1.rename(columns={'fuel-type-diesel':'gas', 'fuel-type-diesel':'diesel'}, inplace=True)\n", 1216 | "dummy_variable_1.head()" 1217 | ] 1218 | }, 1219 | { 1220 | "cell_type": "markdown", 1221 | "metadata": {}, 1222 | "source": [ 1223 | "We now have the value 0 to represent \"gas\" and 1 to represent \"diesel\" in the column \"fuel-type\". We will now insert this column back into our original dataset. " 1224 | ] 1225 | }, 1226 | { 1227 | "cell_type": "code", 1228 | "execution_count": null, 1229 | "metadata": { 1230 | "collapsed": true 1231 | }, 1232 | "outputs": [], 1233 | "source": [ 1234 | "# merge data frame \"df\" and \"dummy_variable_1\" \n", 1235 | "df = pd.concat([df, dummy_variable_1], axis=1)\n", 1236 | "\n", 1237 | "# drop original column \"fuel-type\" from \"df\"\n", 1238 | "df.drop(\"fuel-type\", axis = 1, inplace=True)" 1239 | ] 1240 | }, 1241 | { 1242 | "cell_type": "code", 1243 | "execution_count": null, 1244 | "metadata": { 1245 | "collapsed": false 1246 | }, 1247 | "outputs": [], 1248 | "source": [ 1249 | "df.head()" 1250 | ] 1251 | }, 1252 | { 1253 | "cell_type": "markdown", 1254 | "metadata": {}, 1255 | "source": [ 1256 | "The last two columns are now the indicator variable representation of the fuel-type variable. It's all 0s and 1s now." 1257 | ] 1258 | }, 1259 | { 1260 | "cell_type": "markdown", 1261 | "metadata": {}, 1262 | "source": [ 1263 | "Joseph Santarcangelo is a Data Scientist at IBM, and holds a PhD in Electrical Engineering. His research focused on using Machine Learning, Signal Processing, and Computer Vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.
" 1384 | ] 1385 | }, 1386 | { 1387 | "cell_type": "markdown", 1388 | "metadata": {}, 1389 | "source": [ 1390 | "Copyright © 2018 IBM Developer Skills Network. This notebook and its source code are released under the terms of the MIT License.
" 1392 | ] 1393 | } 1394 | ], 1395 | "metadata": { 1396 | "anaconda-cloud": {}, 1397 | "kernelspec": { 1398 | "display_name": "Python 3", 1399 | "language": "python", 1400 | "name": "python3" 1401 | }, 1402 | "language_info": { 1403 | "codemirror_mode": { 1404 | "name": "ipython", 1405 | "version": 3 1406 | }, 1407 | "file_extension": ".py", 1408 | "mimetype": "text/x-python", 1409 | "name": "python", 1410 | "nbconvert_exporter": "python", 1411 | "pygments_lexer": "ipython3", 1412 | "version": "3.6.7" 1413 | } 1414 | }, 1415 | "nbformat": 4, 1416 | "nbformat_minor": 2 1417 | } 1418 | -------------------------------------------------------------------------------- /DA0101EN-Review-Exploratory-Data-Analysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "When visualizing individual variables, it is important to first understand what type of variable you are dealing with. This will help us find the right visualization method for that variable.
\n" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": { 183 | "collapsed": false 184 | }, 185 | "outputs": [], 186 | "source": [ 187 | "# list the data types for each column\n", 188 | "print(df.dtypes)" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "Find the correlation between the following columns: bore, stroke,compression-ratio , and horsepower.
\n", 248 | "Hint: if you would like to select those columns use the following syntax: df[['bore','stroke' ,'compression-ratio','horsepower']]
\n", 249 | "Continuous numerical variables are variables that may contain any value within some range. Continuous numerical variables can have the type \"int64\" or \"float64\". A great way to visualize these variables is by using scatterplots with fitted lines.
\n", 283 | "\n", 284 | "In order to start understanding the (linear) relationship between an individual variable and the price. We can do this by using \"regplot\", which plots the scatterplot plus the fitted regression line for the data.
" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | " Let's see several examples of different linear relationships:" 292 | ] 293 | }, 294 | { 295 | "cell_type": "markdown", 296 | "metadata": {}, 297 | "source": [ 298 | "As the engine-size goes up, the price goes up: this indicates a positive direct correlation between these two variables. Engine size seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.
" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | " We can examine the correlation between 'engine-size' and 'price' and see it's approximately 0.87" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": null, 339 | "metadata": { 340 | "collapsed": false 341 | }, 342 | "outputs": [], 343 | "source": [ 344 | "df[[\"engine-size\", \"price\"]].corr()" 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "metadata": {}, 350 | "source": [ 351 | "Highway mpg is a potential predictor variable of price " 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": null, 357 | "metadata": { 358 | "collapsed": false 359 | }, 360 | "outputs": [], 361 | "source": [ 362 | "sns.regplot(x=\"highway-mpg\", y=\"price\", data=df)" 363 | ] 364 | }, 365 | { 366 | "cell_type": "markdown", 367 | "metadata": {}, 368 | "source": [ 369 | "As the highway-mpg goes up, the price goes down: this indicates an inverse/negative relationship between these two variables. Highway mpg could potentially be a predictor of price.
" 370 | ] 371 | }, 372 | { 373 | "cell_type": "markdown", 374 | "metadata": {}, 375 | "source": [ 376 | "We can examine the correlation between 'highway-mpg' and 'price' and see it's approximately -0.704" 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": null, 382 | "metadata": { 383 | "collapsed": false 384 | }, 385 | "outputs": [], 386 | "source": [ 387 | "df[['highway-mpg', 'price']].corr()" 388 | ] 389 | }, 390 | { 391 | "cell_type": "markdown", 392 | "metadata": {}, 393 | "source": [ 394 | "Peak rpm does not seem like a good predictor of the price at all since the regression line is close to horizontal. Also, the data points are very scattered and far from the fitted line, showing lots of variability. Therefore it's it is not a reliable variable.
" 420 | ] 421 | }, 422 | { 423 | "cell_type": "markdown", 424 | "metadata": {}, 425 | "source": [ 426 | "We can examine the correlation between 'peak-rpm' and 'price' and see it's approximately -0.101616 " 427 | ] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "execution_count": null, 432 | "metadata": { 433 | "collapsed": false 434 | }, 435 | "outputs": [], 436 | "source": [ 437 | "df[['peak-rpm','price']].corr()" 438 | ] 439 | }, 440 | { 441 | "cell_type": "markdown", 442 | "metadata": {}, 443 | "source": [ 444 | "Find the correlation between x=\"stroke\", y=\"price\".
\n", 448 | "Hint: if you would like to select those columns use the following syntax: df[[\"stroke\",\"price\"]]
\n", 449 | "Given the correlation results between \"price\" and \"stroke\" do you expect a linear relationship?
\n", 486 | "Verify your results using the function \"regplot()\".
\n", 487 | "These are variables that describe a 'characteristic' of a data unit, and are selected from a small group of categories. The categorical variables can have the type \"object\" or \"int64\". A good way to visualize categorical variables is by using boxplots.
" 524 | ] 525 | }, 526 | { 527 | "cell_type": "markdown", 528 | "metadata": {}, 529 | "source": [ 530 | "Let's look at the relationship between \"body-style\" and \"price\"." 531 | ] 532 | }, 533 | { 534 | "cell_type": "code", 535 | "execution_count": null, 536 | "metadata": { 537 | "collapsed": false, 538 | "scrolled": true 539 | }, 540 | "outputs": [], 541 | "source": [ 542 | "sns.boxplot(x=\"body-style\", y=\"price\", data=df)" 543 | ] 544 | }, 545 | { 546 | "cell_type": "markdown", 547 | "metadata": {}, 548 | "source": [ 549 | "We see that the distributions of price between the different body-style categories have a significant overlap, and so body-style would not be a good predictor of price. Let's examine engine \"engine-location\" and \"price\":
" 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": null, 555 | "metadata": { 556 | "collapsed": false, 557 | "scrolled": true 558 | }, 559 | "outputs": [], 560 | "source": [ 561 | "sns.boxplot(x=\"engine-location\", y=\"price\", data=df)" 562 | ] 563 | }, 564 | { 565 | "cell_type": "markdown", 566 | "metadata": {}, 567 | "source": [ 568 | "Here we see that the distribution of price between these two engine-location categories, front and rear, are distinct enough to take engine-location as a potential good predictor of price.
" 569 | ] 570 | }, 571 | { 572 | "cell_type": "markdown", 573 | "metadata": {}, 574 | "source": [ 575 | " Let's examine \"drive-wheels\" and \"price\"." 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": null, 581 | "metadata": { 582 | "collapsed": false, 583 | "scrolled": false 584 | }, 585 | "outputs": [], 586 | "source": [ 587 | "# drive-wheels\n", 588 | "sns.boxplot(x=\"drive-wheels\", y=\"price\", data=df)" 589 | ] 590 | }, 591 | { 592 | "cell_type": "markdown", 593 | "metadata": {}, 594 | "source": [ 595 | "Here we see that the distribution of price between the different drive-wheels categories differs; as such drive-wheels could potentially be a predictor of price.
" 596 | ] 597 | }, 598 | { 599 | "cell_type": "markdown", 600 | "metadata": {}, 601 | "source": [ 602 | "Let's first take a look at the variables by utilizing a description method.
\n", 610 | "\n", 611 | "The describe function automatically computes basic statistics for all continuous variables. Any NaN values are automatically skipped in these statistics.
\n", 612 | "\n", 613 | "This will show:\n", 614 | "Value-counts is a good way of understanding how many units of each characteristic/variable we have. We can apply the \"value_counts\" method on the column 'drive-wheels'. Don’t forget the method \"value_counts\" only works on Pandas series, not Pandas Dataframes. As a result, we only include one bracket \"df['drive-wheels']\" not two brackets \"df[['drive-wheels']]\".
" 673 | ] 674 | }, 675 | { 676 | "cell_type": "code", 677 | "execution_count": null, 678 | "metadata": { 679 | "collapsed": false 680 | }, 681 | "outputs": [], 682 | "source": [ 683 | "df['drive-wheels'].value_counts()" 684 | ] 685 | }, 686 | { 687 | "cell_type": "markdown", 688 | "metadata": {}, 689 | "source": [ 690 | "We can convert the series to a Dataframe as follows :" 691 | ] 692 | }, 693 | { 694 | "cell_type": "code", 695 | "execution_count": null, 696 | "metadata": { 697 | "collapsed": false 698 | }, 699 | "outputs": [], 700 | "source": [ 701 | "df['drive-wheels'].value_counts().to_frame()" 702 | ] 703 | }, 704 | { 705 | "cell_type": "markdown", 706 | "metadata": {}, 707 | "source": [ 708 | "Let's repeat the above steps but save the results to the dataframe \"drive_wheels_counts\" and rename the column 'drive-wheels' to 'value_counts'." 709 | ] 710 | }, 711 | { 712 | "cell_type": "code", 713 | "execution_count": null, 714 | "metadata": { 715 | "collapsed": false 716 | }, 717 | "outputs": [], 718 | "source": [ 719 | "drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()\n", 720 | "drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'}, inplace=True)\n", 721 | "drive_wheels_counts" 722 | ] 723 | }, 724 | { 725 | "cell_type": "markdown", 726 | "metadata": {}, 727 | "source": [ 728 | " Now let's rename the index to 'drive-wheels':" 729 | ] 730 | }, 731 | { 732 | "cell_type": "code", 733 | "execution_count": null, 734 | "metadata": { 735 | "collapsed": false 736 | }, 737 | "outputs": [], 738 | "source": [ 739 | "drive_wheels_counts.index.name = 'drive-wheels'\n", 740 | "drive_wheels_counts" 741 | ] 742 | }, 743 | { 744 | "cell_type": "markdown", 745 | "metadata": {}, 746 | "source": [ 747 | "We can repeat the above process for the variable 'engine-location'." 748 | ] 749 | }, 750 | { 751 | "cell_type": "code", 752 | "execution_count": null, 753 | "metadata": { 754 | "collapsed": false 755 | }, 756 | "outputs": [], 757 | "source": [ 758 | "# engine-location as variable\n", 759 | "engine_loc_counts = df['engine-location'].value_counts().to_frame()\n", 760 | "engine_loc_counts.rename(columns={'engine-location': 'value_counts'}, inplace=True)\n", 761 | "engine_loc_counts.index.name = 'engine-location'\n", 762 | "engine_loc_counts.head(10)" 763 | ] 764 | }, 765 | { 766 | "cell_type": "markdown", 767 | "metadata": {}, 768 | "source": [ 769 | "Examining the value counts of the engine location would not be a good predictor variable for the price. This is because we only have three cars with a rear engine and 198 with an engine in the front, this result is skewed. Thus, we are not able to draw any conclusions about the engine location.
" 770 | ] 771 | }, 772 | { 773 | "cell_type": "markdown", 774 | "metadata": {}, 775 | "source": [ 776 | "The \"groupby\" method groups data by different categories. The data is grouped based on one or several variables and analysis is performed on the individual groups.
\n", 784 | "\n", 785 | "For example, let's group by the variable \"drive-wheels\". We see that there are 3 different categories of drive wheels.
" 786 | ] 787 | }, 788 | { 789 | "cell_type": "code", 790 | "execution_count": null, 791 | "metadata": { 792 | "collapsed": false 793 | }, 794 | "outputs": [], 795 | "source": [ 796 | "df['drive-wheels'].unique()" 797 | ] 798 | }, 799 | { 800 | "cell_type": "markdown", 801 | "metadata": {}, 802 | "source": [ 803 | "If we want to know, on average, which type of drive wheel is most valuable, we can group \"drive-wheels\" and then average them.
\n", 804 | "\n", 805 | "We can select the columns 'drive-wheels', 'body-style' and 'price', then assign it to the variable \"df_group_one\".
" 806 | ] 807 | }, 808 | { 809 | "cell_type": "code", 810 | "execution_count": null, 811 | "metadata": { 812 | "collapsed": true 813 | }, 814 | "outputs": [], 815 | "source": [ 816 | "df_group_one = df[['drive-wheels','body-style','price']]" 817 | ] 818 | }, 819 | { 820 | "cell_type": "markdown", 821 | "metadata": {}, 822 | "source": [ 823 | "We can then calculate the average price for each of the different categories of data." 824 | ] 825 | }, 826 | { 827 | "cell_type": "code", 828 | "execution_count": null, 829 | "metadata": { 830 | "collapsed": false 831 | }, 832 | "outputs": [], 833 | "source": [ 834 | "# grouping results\n", 835 | "df_group_one = df_group_one.groupby(['drive-wheels'],as_index=False).mean()\n", 836 | "df_group_one" 837 | ] 838 | }, 839 | { 840 | "cell_type": "markdown", 841 | "metadata": {}, 842 | "source": [ 843 | "From our data, it seems rear-wheel drive vehicles are, on average, the most expensive, while 4-wheel and front-wheel are approximately the same in price.
\n", 844 | "\n", 845 | "You can also group with multiple variables. For example, let's group by both 'drive-wheels' and 'body-style'. This groups the dataframe by the unique combinations 'drive-wheels' and 'body-style'. We can store the results in the variable 'grouped_test1'.
" 846 | ] 847 | }, 848 | { 849 | "cell_type": "code", 850 | "execution_count": null, 851 | "metadata": { 852 | "collapsed": false 853 | }, 854 | "outputs": [], 855 | "source": [ 856 | "# grouping results\n", 857 | "df_gptest = df[['drive-wheels','body-style','price']]\n", 858 | "grouped_test1 = df_gptest.groupby(['drive-wheels','body-style'],as_index=False).mean()\n", 859 | "grouped_test1" 860 | ] 861 | }, 862 | { 863 | "cell_type": "markdown", 864 | "metadata": {}, 865 | "source": [ 866 | "This grouped data is much easier to visualize when it is made into a pivot table. A pivot table is like an Excel spreadsheet, with one variable along the column and another along the row. We can convert the dataframe to a pivot table using the method \"pivot \" to create a pivot table from the groups.
\n", 867 | "\n", 868 | "In this case, we will leave the drive-wheel variable as the rows of the table, and pivot body-style to become the columns of the table:
" 869 | ] 870 | }, 871 | { 872 | "cell_type": "code", 873 | "execution_count": null, 874 | "metadata": { 875 | "collapsed": false 876 | }, 877 | "outputs": [], 878 | "source": [ 879 | "grouped_pivot = grouped_test1.pivot(index='drive-wheels',columns='body-style')\n", 880 | "grouped_pivot" 881 | ] 882 | }, 883 | { 884 | "cell_type": "markdown", 885 | "metadata": {}, 886 | "source": [ 887 | "Often, we won't have data for some of the pivot cells. We can fill these missing cells with the value 0, but any other value could potentially be used as well. It should be mentioned that missing data is quite a complex subject and is an entire course on its own.
" 888 | ] 889 | }, 890 | { 891 | "cell_type": "code", 892 | "execution_count": null, 893 | "metadata": { 894 | "collapsed": false, 895 | "scrolled": true 896 | }, 897 | "outputs": [], 898 | "source": [ 899 | "grouped_pivot = grouped_pivot.fillna(0) #fill missing values with 0\n", 900 | "grouped_pivot" 901 | ] 902 | }, 903 | { 904 | "cell_type": "markdown", 905 | "metadata": {}, 906 | "source": [ 907 | "Use the \"groupby\" function to find the average \"price\" of each car based on \"body-style\" ?
\n", 911 | "The heatmap plots the target variable (price) proportional to colour with respect to the variables 'drive-wheel' and 'body-style' in the vertical and horizontal axis respectively. This allows us to visualize how the price is related to 'drive-wheel' and 'body-style'.
\n", 994 | "\n", 995 | "The default labels convey no useful information to us. Let's change that:
" 996 | ] 997 | }, 998 | { 999 | "cell_type": "code", 1000 | "execution_count": null, 1001 | "metadata": { 1002 | "collapsed": false 1003 | }, 1004 | "outputs": [], 1005 | "source": [ 1006 | "fig, ax = plt.subplots()\n", 1007 | "im = ax.pcolor(grouped_pivot, cmap='RdBu')\n", 1008 | "\n", 1009 | "#label names\n", 1010 | "row_labels = grouped_pivot.columns.levels[1]\n", 1011 | "col_labels = grouped_pivot.index\n", 1012 | "\n", 1013 | "#move ticks and labels to the center\n", 1014 | "ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)\n", 1015 | "ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)\n", 1016 | "\n", 1017 | "#insert labels\n", 1018 | "ax.set_xticklabels(row_labels, minor=False)\n", 1019 | "ax.set_yticklabels(col_labels, minor=False)\n", 1020 | "\n", 1021 | "#rotate label if too long\n", 1022 | "plt.xticks(rotation=90)\n", 1023 | "\n", 1024 | "fig.colorbar(im)\n", 1025 | "plt.show()" 1026 | ] 1027 | }, 1028 | { 1029 | "cell_type": "markdown", 1030 | "metadata": {}, 1031 | "source": [ 1032 | "Visualization is very important in data science, and Python visualization packages provide great freedom. We will go more in-depth in a separate Python Visualizations course.
\n", 1033 | "\n", 1034 | "The main question we want to answer in this module, is \"What are the main characteristics which have the most impact on the car price?\".
\n", 1035 | "\n", 1036 | "To get a better measure of the important characteristics, we look at the correlation of these variables with the car price, in other words: how is the car price dependent on this variable?
" 1037 | ] 1038 | }, 1039 | { 1040 | "cell_type": "markdown", 1041 | "metadata": {}, 1042 | "source": [ 1043 | "Correlation: a measure of the extent of interdependence between variables.
\n", 1051 | "\n", 1052 | "Causation: the relationship between cause and effect between two variables.
\n", 1053 | "\n", 1054 | "It is important to know the difference between these two and that correlation does not imply causation. Determining correlation is much simpler the determining causation as causation may require independent experimentation.
" 1055 | ] 1056 | }, 1057 | { 1058 | "cell_type": "markdown", 1059 | "metadata": {}, 1060 | "source": [ 1061 | "The Pearson Correlation measures the linear dependence between two variables X and Y.
\n", 1063 | "The resulting coefficient is a value between -1 and 1 inclusive, where:
\n", 1064 | "Pearson Correlation is the default method of the function \"corr\". Like before we can calculate the Pearson Correlation of the of the 'int64' or 'float64' variables.
" 1076 | ] 1077 | }, 1078 | { 1079 | "cell_type": "code", 1080 | "execution_count": null, 1081 | "metadata": { 1082 | "collapsed": false 1083 | }, 1084 | "outputs": [], 1085 | "source": [ 1086 | "df.corr()" 1087 | ] 1088 | }, 1089 | { 1090 | "cell_type": "markdown", 1091 | "metadata": {}, 1092 | "source": [ 1093 | " sometimes we would like to know the significant of the correlation estimate. " 1094 | ] 1095 | }, 1096 | { 1097 | "cell_type": "markdown", 1098 | "metadata": {}, 1099 | "source": [ 1100 | "P-value: \n", 1101 | "What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.
\n", 1102 | "\n", 1103 | "By convention, when the\n", 1104 | "Since the p-value is $<$ 0.001, the correlation between wheel-base and price is statistically significant, although the linear relationship isn't extremely strong (~0.585)
" 1162 | ] 1163 | }, 1164 | { 1165 | "cell_type": "markdown", 1166 | "metadata": {}, 1167 | "source": [ 1168 | "Since the p-value is $<$ 0.001, the correlation between horsepower and price is statistically significant, and the linear relationship is quite strong (~0.809, close to 1)
" 1197 | ] 1198 | }, 1199 | { 1200 | "cell_type": "markdown", 1201 | "metadata": {}, 1202 | "source": [ 1203 | "Since the p-value is $<$ 0.001, the correlation between length and price is statistically significant, and the linear relationship is moderately strong (~0.691).
" 1226 | ] 1227 | }, 1228 | { 1229 | "cell_type": "markdown", 1230 | "metadata": {}, 1231 | "source": [ 1232 | "Since the p-value is $<$ 0.001, the correlation between curb-weight and price is statistically significant, and the linear relationship is quite strong (~0.834).
" 1295 | ] 1296 | }, 1297 | { 1298 | "cell_type": "markdown", 1299 | "metadata": {}, 1300 | "source": [ 1301 | "Since the p-value is $<$ 0.001, the correlation between engine-size and price is statistically significant, and the linear relationship is very strong (~0.872).
" 1325 | ] 1326 | }, 1327 | { 1328 | "cell_type": "markdown", 1329 | "metadata": {}, 1330 | "source": [ 1331 | "Since the p-value is $<$ 0.001, the correlation between bore and price is statistically significant, but the linear relationship is only moderate (~0.521).
" 1359 | ] 1360 | }, 1361 | { 1362 | "cell_type": "markdown", 1363 | "metadata": {}, 1364 | "source": [ 1365 | " We can relate the process for each 'City-mpg' and 'Highway-mpg':" 1366 | ] 1367 | }, 1368 | { 1369 | "cell_type": "markdown", 1370 | "metadata": {}, 1371 | "source": [ 1372 | "Since the p-value is $<$ 0.001, the correlation between city-mpg and price is statistically significant, and the coefficient of ~ -0.687 shows that the relationship is negative and moderately strong.
" 1393 | ] 1394 | }, 1395 | { 1396 | "cell_type": "markdown", 1397 | "metadata": {}, 1398 | "source": [ 1399 | "The Analysis of Variance (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:
\n", 1435 | "\n", 1436 | "F-test score: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.
\n", 1437 | "\n", 1438 | "P-value: P-value tells how statistically significant is our calculated score value.
\n", 1439 | "\n", 1440 | "If our price variable is strongly correlated with the variable we are analyzing, expect ANOVA to return a sizeable F-test score and a small p-value.
" 1441 | ] 1442 | }, 1443 | { 1444 | "cell_type": "markdown", 1445 | "metadata": {}, 1446 | "source": [ 1447 | "Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Because the ANOVA algorithm averages the data automatically, we do not need to take the average before hand.
\n", 1455 | "\n", 1456 | "Let's see if different types 'drive-wheels' impact 'price', we group the data.
" 1457 | ] 1458 | }, 1459 | { 1460 | "cell_type": "markdown", 1461 | "metadata": {}, 1462 | "source": [ 1463 | " Let's see if different types 'drive-wheels' impact 'price', we group the data." 1464 | ] 1465 | }, 1466 | { 1467 | "cell_type": "code", 1468 | "execution_count": null, 1469 | "metadata": { 1470 | "collapsed": false 1471 | }, 1472 | "outputs": [], 1473 | "source": [ 1474 | "grouped_test2=df_gptest[['drive-wheels', 'price']].groupby(['drive-wheels'])\n", 1475 | "grouped_test2.head(2)" 1476 | ] 1477 | }, 1478 | { 1479 | "cell_type": "code", 1480 | "execution_count": null, 1481 | "metadata": {}, 1482 | "outputs": [], 1483 | "source": [ 1484 | "df_gptest" 1485 | ] 1486 | }, 1487 | { 1488 | "cell_type": "markdown", 1489 | "metadata": {}, 1490 | "source": [ 1491 | " We can obtain the values of the method group using the method \"get_group\". " 1492 | ] 1493 | }, 1494 | { 1495 | "cell_type": "code", 1496 | "execution_count": null, 1497 | "metadata": { 1498 | "collapsed": false 1499 | }, 1500 | "outputs": [], 1501 | "source": [ 1502 | "grouped_test2.get_group('4wd')['price']" 1503 | ] 1504 | }, 1505 | { 1506 | "cell_type": "markdown", 1507 | "metadata": {}, 1508 | "source": [ 1509 | "we can use the function 'f_oneway' in the module 'stats' to obtain the F-test score and P-value." 1510 | ] 1511 | }, 1512 | { 1513 | "cell_type": "code", 1514 | "execution_count": null, 1515 | "metadata": { 1516 | "collapsed": false 1517 | }, 1518 | "outputs": [], 1519 | "source": [ 1520 | "# ANOVA\n", 1521 | "f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'], grouped_test2.get_group('4wd')['price']) \n", 1522 | " \n", 1523 | "print( \"ANOVA results: F=\", f_val, \", P =\", p_val) " 1524 | ] 1525 | }, 1526 | { 1527 | "cell_type": "markdown", 1528 | "metadata": {}, 1529 | "source": [ 1530 | "This is a great result, with a large F test score showing a strong correlation and a P value of almost 0 implying almost certain statistical significance. But does this mean all three tested groups are all this highly correlated? " 1531 | ] 1532 | }, 1533 | { 1534 | "cell_type": "markdown", 1535 | "metadata": {}, 1536 | "source": [ 1537 | "#### Separately: fwd and rwd" 1538 | ] 1539 | }, 1540 | { 1541 | "cell_type": "code", 1542 | "execution_count": null, 1543 | "metadata": { 1544 | "collapsed": false 1545 | }, 1546 | "outputs": [], 1547 | "source": [ 1548 | "f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price']) \n", 1549 | " \n", 1550 | "print( \"ANOVA results: F=\", f_val, \", P =\", p_val )" 1551 | ] 1552 | }, 1553 | { 1554 | "cell_type": "markdown", 1555 | "metadata": {}, 1556 | "source": [ 1557 | " Let's examine the other groups " 1558 | ] 1559 | }, 1560 | { 1561 | "cell_type": "markdown", 1562 | "metadata": {}, 1563 | "source": [ 1564 | "#### 4wd and rwd" 1565 | ] 1566 | }, 1567 | { 1568 | "cell_type": "code", 1569 | "execution_count": null, 1570 | "metadata": { 1571 | "collapsed": false, 1572 | "scrolled": true 1573 | }, 1574 | "outputs": [], 1575 | "source": [ 1576 | "f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('rwd')['price']) \n", 1577 | " \n", 1578 | "print( \"ANOVA results: F=\", f_val, \", P =\", p_val) " 1579 | ] 1580 | }, 1581 | { 1582 | "cell_type": "markdown", 1583 | "metadata": {}, 1584 | "source": [ 1585 | "We now have a better idea of what our data looks like and which variables are important to take into account when predicting the car price. We have narrowed it down to the following variables:
\n", 1613 | "\n", 1614 | "Continuous numerical variables:\n", 1615 | "As we now move into building machine learning models to automate our analysis, feeding the model with variables that meaningfully affect our target variable will improve our model's prediction performance.
" 1633 | ] 1634 | }, 1635 | { 1636 | "cell_type": "markdown", 1637 | "metadata": {}, 1638 | "source": [ 1639 | "Joseph Santarcangelo is a Data Scientist at IBM, and holds a PhD in Electrical Engineering. His research focused on using Machine Learning, Signal Processing, and Computer Vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.
" 1661 | ] 1662 | }, 1663 | { 1664 | "cell_type": "markdown", 1665 | "metadata": {}, 1666 | "source": [ 1667 | "Copyright © 2018 IBM Developer Skills Network. This notebook and its source code are released under the terms of the MIT License.
" 1669 | ] 1670 | } 1671 | ], 1672 | "metadata": { 1673 | "anaconda-cloud": {}, 1674 | "kernelspec": { 1675 | "display_name": "Python 3", 1676 | "language": "python", 1677 | "name": "python3" 1678 | }, 1679 | "language_info": { 1680 | "codemirror_mode": { 1681 | "name": "ipython", 1682 | "version": 3 1683 | }, 1684 | "file_extension": ".py", 1685 | "mimetype": "text/x-python", 1686 | "name": "python", 1687 | "nbconvert_exporter": "python", 1688 | "pygments_lexer": "ipython3", 1689 | "version": "3.6.7" 1690 | } 1691 | }, 1692 | "nbformat": 4, 1693 | "nbformat_minor": 2 1694 | } 1695 | -------------------------------------------------------------------------------- /DA0101EN-Review-Introduction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 31 | "In this section, you will learn how to approach data acquisition in various ways, and obtain necessary insights from a dataset. By the end of this lab, you will successfully load the data into Jupyter Notebook, and gain some fundamental insights via Pandas Library.\n", 32 | "
" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "\n",
58 | "There are various formats for a dataset, .csv, .json, .xlsx etc. The dataset can be stored in different places, on your local machine or sometimes online.
\n",
59 | "In this section, you will learn how to load a dataset into our Jupyter Notebook.
\n",
60 | "In our case, the Automobile Dataset is an online source, and it is in CSV (comma separated value) format. Let's use this dataset as an example to practice data reading.\n",
61 | "
\n",
87 | "We use pandas.read_csv()
function to read the csv file. In the bracket, we put the file path along with a quotation mark, so that pandas will read the file into a data frame from that address. The file path can be either an URL or your local file address.
\n",
88 | "Because the data does not include headers, we can add an argument headers = None
inside the read_csv()
method, so that pandas will not automatically set the first row as a header.
\n",
89 | "You can also assign the dataset to any variable you create.\n",
90 | "
dataframe.head(n)
method to check the top n rows of the dataframe; where n is an integer. Contrary to dataframe.head(n)
, dataframe.tail(n)
will show you the bottom n rows of the dataframe.\n"
121 | ]
122 | },
123 | {
124 | "cell_type": "code",
125 | "execution_count": null,
126 | "metadata": {
127 | "collapsed": false
128 | },
129 | "outputs": [],
130 | "source": [
131 | "# show the first 5 rows using dataframe.head() method\n",
132 | "print(\"The first 5 rows of the dataframe\") \n",
133 | "df.head(5)"
134 | ]
135 | },
136 | {
137 | "cell_type": "markdown",
138 | "metadata": {},
139 | "source": [
140 | "\n", 187 | "Take a look at our dataset; pandas automatically set the header by an integer from 0.\n", 188 | "
\n", 189 | "\n", 190 | "To better describe our data we can introduce a header, this information is available at: https://archive.ics.uci.edu/ml/datasets/Automobile\n", 191 | "
\n", 192 | "\n", 193 | "Thus, we have to add headers manually.\n", 194 | "
\n", 195 | "\n",
196 | "Firstly, we create a list \"headers\" that include all column names in order.\n",
197 | "Then, we use dataframe.columns = headers
to replace the headers by the list we created.\n",
198 | "
\n",
301 | "Correspondingly, Pandas enables us to save the dataset to csv by using the dataframe.to_csv()
method, you can add the file path and name along with quotation marks in the brackets.\n",
302 | "
\n", 304 | " For example, if you would save the dataframe df as automobile.csv to your local machine, you may use the syntax below:\n", 305 | "
" 306 | ] 307 | }, 308 | { 309 | "cell_type": "raw", 310 | "metadata": {}, 311 | "source": [ 312 | "df.to_csv(\"automobile.csv\", index=False)" 313 | ] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "metadata": {}, 318 | "source": [ 319 | " We can also read and save other file formats, we can use similar functions to **`pd.read_csv()`** and **`df.to_csv()`** for other data formats, the functions are listed in the following table:\n" 320 | ] 321 | }, 322 | { 323 | "cell_type": "markdown", 324 | "metadata": {}, 325 | "source": [ 326 | "\n",
346 | "After reading data into Pandas dataframe, it is time for us to explore the dataset.
\n",
347 | "There are several ways to obtain essential insights of the data to help us better understand our dataset.\n",
348 | "
\n",
357 | "Data has a variety of types.
\n",
358 | "The main types stored in Pandas dataframes are object, float, int, bool and datetime64. In order to better learn about each attribute, it is always good for us to know the data type of each column. In Pandas:\n",
359 | "
\n",
396 | "As a result, as shown above, it is clear to see that the data type of \"symboling\" and \"curb-weight\" are int64
, \"normalized-losses\" is object
, and \"wheel-base\" is float64
, etc.\n",
397 | "
\n", 399 | "These data types can be changed; we will learn how to accomplish this in a later module.\n", 400 | "
" 401 | ] 402 | }, 403 | { 404 | "cell_type": "markdown", 405 | "metadata": {}, 406 | "source": [ 407 | "NaN
(Not a Number) values."
423 | ]
424 | },
425 | {
426 | "cell_type": "code",
427 | "execution_count": null,
428 | "metadata": {
429 | "collapsed": false
430 | },
431 | "outputs": [],
432 | "source": [
433 | "df.describe()"
434 | ]
435 | },
436 | {
437 | "cell_type": "markdown",
438 | "metadata": {},
439 | "source": [
440 | "\n",
441 | "This shows the statistical summary of all numeric-typed (int, float) columns.
\n",
442 | "For example, the attribute \"symboling\" has 205 counts, the mean value of this column is 0.83, the standard deviation is 1.25, the minimum value is -2, 25th percentile is 0, 50th percentile is 1, 75th percentile is 2, and the maximum value is 3.\n",
443 | "
\n",
444 | "However, what if we would also like to check all the columns including those that are of type object.\n",
445 | "
\n",
446 | "\n",
447 | "You can add an argument include = \"all\"
inside the bracket. Let's try it again.\n",
448 | "
\n",
468 | "Now, it provides the statistical summary of all the columns, including object-typed attributes.
\n",
469 | "We can now see how many unique values, which is the top value and the frequency of top value in the object-typed columns.
\n",
470 | "Some values in the table above show as \"NaN\", this is because those numbers are not available regarding a particular column type.
\n",
471 | "
\n", 482 | "You can select the columns of a data frame by indicating the name of each column, for example, you can select the three columns as follows:\n", 483 | "
\n", 484 | "\n",
485 | " dataframe[[' column 1 ',column 2', 'column 3']]
\n",
486 | "
\n", 488 | "Where \"column\" is the name of the column, you can apply the method \".describe()\" to get the statistics of those columns as follows:\n", 489 | "
\n", 490 | "\n",
491 | " dataframe[[' column 1 ',column 2', 'column 3'] ].describe()
\n",
492 | "
\n",
559 | "Here we are able to see the information of our dataframe, with the top 30 rows and the bottom 30 rows.\n",
560 | "
\n",
561 | "And, it also shows us the whole data frame has 205 rows and 26 columns in total.\n",
562 | "
Joseph Santarcangelo is a Data Scientist at IBM, and holds a PhD in Electrical Engineering. His research focused on using Machine Learning, Signal Processing, and Computer Vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.
" 591 | ] 592 | }, 593 | { 594 | "cell_type": "markdown", 595 | "metadata": {}, 596 | "source": [ 597 | "Copyright © 2018 IBM Developer Skills Network. This notebook and its source code are released under the terms of the MIT License.
" 599 | ] 600 | } 601 | ], 602 | "metadata": { 603 | "anaconda-cloud": {}, 604 | "kernelspec": { 605 | "display_name": "Python 3", 606 | "language": "python", 607 | "name": "python3" 608 | }, 609 | "language_info": { 610 | "codemirror_mode": { 611 | "name": "ipython", 612 | "version": 3 613 | }, 614 | "file_extension": ".py", 615 | "mimetype": "text/x-python", 616 | "name": "python", 617 | "nbconvert_exporter": "python", 618 | "pygments_lexer": "ipython3", 619 | "version": "3.6.7" 620 | } 621 | }, 622 | "nbformat": 4, 623 | "nbformat_minor": 2 624 | } 625 | -------------------------------------------------------------------------------- /DA0101EN-Review-Model-Evaluation-and-Refinement.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "An important step in testing your model is to split your data into training and testing data. We will place the target data price in a separate dataframe y:
" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": null, 205 | "metadata": { 206 | "collapsed": false 207 | }, 208 | "outputs": [], 209 | "source": [ 210 | "y_data = df['price']" 211 | ] 212 | }, 213 | { 214 | "cell_type": "markdown", 215 | "metadata": {}, 216 | "source": [ 217 | "drop price data in x data" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": null, 223 | "metadata": { 224 | "collapsed": true 225 | }, 226 | "outputs": [], 227 | "source": [ 228 | "x_data=df.drop('price',axis=1)" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": {}, 234 | "source": [ 235 | "Now we randomly split our data into training and testing data using the function train_test_split. " 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": null, 241 | "metadata": { 242 | "collapsed": false 243 | }, 244 | "outputs": [], 245 | "source": [ 246 | "from sklearn.model_selection import train_test_split\n", 247 | "\n", 248 | "\n", 249 | "x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.15, random_state=1)\n", 250 | "\n", 251 | "\n", 252 | "print(\"number of test samples :\", x_test.shape[0])\n", 253 | "print(\"number of training samples:\",x_train.shape[0])\n" 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": {}, 259 | "source": [ 260 | "The test_size parameter sets the proportion of data that is split into the testing set. In the above, the testing set is set to 10% of the total dataset. " 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "It turns out that the test data sometimes referred to as the out of sample data is a much better measure of how well your model performs in the real world. One reason for this is overfitting; let's go over some examples. It turns out these differences are more apparent in Multiple Linear Regression and Polynomial Regression so we will explore overfitting in that context.
" 613 | ] 614 | }, 615 | { 616 | "cell_type": "markdown", 617 | "metadata": {}, 618 | "source": [ 619 | "Let's create Multiple linear regression objects and train the model using 'horsepower', 'curb-weight', 'engine-size' and 'highway-mpg' as features." 620 | ] 621 | }, 622 | { 623 | "cell_type": "code", 624 | "execution_count": null, 625 | "metadata": { 626 | "collapsed": false 627 | }, 628 | "outputs": [], 629 | "source": [ 630 | "lr = LinearRegression()\n", 631 | "lr.fit(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_train)" 632 | ] 633 | }, 634 | { 635 | "cell_type": "markdown", 636 | "metadata": {}, 637 | "source": [ 638 | "Prediction using training data:" 639 | ] 640 | }, 641 | { 642 | "cell_type": "code", 643 | "execution_count": null, 644 | "metadata": { 645 | "collapsed": false 646 | }, 647 | "outputs": [], 648 | "source": [ 649 | "yhat_train = lr.predict(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])\n", 650 | "yhat_train[0:5]" 651 | ] 652 | }, 653 | { 654 | "cell_type": "markdown", 655 | "metadata": {}, 656 | "source": [ 657 | "Prediction using test data: " 658 | ] 659 | }, 660 | { 661 | "cell_type": "code", 662 | "execution_count": null, 663 | "metadata": { 664 | "collapsed": false 665 | }, 666 | "outputs": [], 667 | "source": [ 668 | "yhat_test = lr.predict(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])\n", 669 | "yhat_test[0:5]" 670 | ] 671 | }, 672 | { 673 | "cell_type": "markdown", 674 | "metadata": {}, 675 | "source": [ 676 | "Let's perform some model evaluation using our training and testing data separately. First we import the seaborn and matplotlibb library for plotting." 677 | ] 678 | }, 679 | { 680 | "cell_type": "code", 681 | "execution_count": null, 682 | "metadata": { 683 | "collapsed": true 684 | }, 685 | "outputs": [], 686 | "source": [ 687 | "import matplotlib.pyplot as plt\n", 688 | "%matplotlib inline\n", 689 | "import seaborn as sns" 690 | ] 691 | }, 692 | { 693 | "cell_type": "markdown", 694 | "metadata": {}, 695 | "source": [ 696 | "Let's examine the distribution of the predicted values of the training data." 697 | ] 698 | }, 699 | { 700 | "cell_type": "code", 701 | "execution_count": null, 702 | "metadata": { 703 | "collapsed": false 704 | }, 705 | "outputs": [], 706 | "source": [ 707 | "Title = 'Distribution Plot of Predicted Value Using Training Data vs Training Data Distribution'\n", 708 | "DistributionPlot(y_train, yhat_train, \"Actual Values (Train)\", \"Predicted Values (Train)\", Title)" 709 | ] 710 | }, 711 | { 712 | "cell_type": "markdown", 713 | "metadata": {}, 714 | "source": [ 715 | "Figure 1: Plot of predicted values using the training data compared to the training data. " 716 | ] 717 | }, 718 | { 719 | "cell_type": "markdown", 720 | "metadata": {}, 721 | "source": [ 722 | "So far the model seems to be doing well in learning from the training dataset. But what happens when the model encounters new data from the testing dataset? When the model generates new values from the test data, we see the distribution of the predicted values is much different from the actual target values. " 723 | ] 724 | }, 725 | { 726 | "cell_type": "code", 727 | "execution_count": null, 728 | "metadata": { 729 | "collapsed": false 730 | }, 731 | "outputs": [], 732 | "source": [ 733 | "Title='Distribution Plot of Predicted Value Using Test Data vs Data Distribution of Test Data'\n", 734 | "DistributionPlot(y_test,yhat_test,\"Actual Values (Test)\",\"Predicted Values (Test)\",Title)" 735 | ] 736 | }, 737 | { 738 | "cell_type": "markdown", 739 | "metadata": {}, 740 | "source": [ 741 | "Figur 2: Plot of predicted value using the test data compared to the test data. " 742 | ] 743 | }, 744 | { 745 | "cell_type": "markdown", 746 | "metadata": {}, 747 | "source": [ 748 | "Comparing Figure 1 and Figure 2; it is evident the distribution of the test data in Figure 1 is much better at fitting the data. This difference in Figure 2 is apparent where the ranges are from 5000 to 15 000. This is where the distribution shape is exceptionally different. Let's see if polynomial regression also exhibits a drop in the prediction accuracy when analysing the test dataset.
" 749 | ] 750 | }, 751 | { 752 | "cell_type": "code", 753 | "execution_count": null, 754 | "metadata": { 755 | "collapsed": false 756 | }, 757 | "outputs": [], 758 | "source": [ 759 | "from sklearn.preprocessing import PolynomialFeatures" 760 | ] 761 | }, 762 | { 763 | "cell_type": "markdown", 764 | "metadata": {}, 765 | "source": [ 766 | "Overfitting occurs when the model fits the noise, not the underlying process. Therefore when testing your model using the test-set, your model does not perform as well as it is modelling noise, not the underlying process that generated the relationship. Let's create a degree 5 polynomial model.
" 768 | ] 769 | }, 770 | { 771 | "cell_type": "markdown", 772 | "metadata": {}, 773 | "source": [ 774 | "Let's use 55 percent of the data for testing and the rest for training:" 775 | ] 776 | }, 777 | { 778 | "cell_type": "code", 779 | "execution_count": null, 780 | "metadata": { 781 | "collapsed": false 782 | }, 783 | "outputs": [], 784 | "source": [ 785 | "x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.45, random_state=0)" 786 | ] 787 | }, 788 | { 789 | "cell_type": "markdown", 790 | "metadata": {}, 791 | "source": [ 792 | "We will perform a degree 5 polynomial transformation on the feature 'horse power'. " 793 | ] 794 | }, 795 | { 796 | "cell_type": "code", 797 | "execution_count": null, 798 | "metadata": { 799 | "collapsed": false 800 | }, 801 | "outputs": [], 802 | "source": [ 803 | "pr = PolynomialFeatures(degree=5)\n", 804 | "x_train_pr = pr.fit_transform(x_train[['horsepower']])\n", 805 | "x_test_pr = pr.fit_transform(x_test[['horsepower']])\n", 806 | "pr" 807 | ] 808 | }, 809 | { 810 | "cell_type": "markdown", 811 | "metadata": {}, 812 | "source": [ 813 | "Now let's create a linear regression model \"poly\" and train it." 814 | ] 815 | }, 816 | { 817 | "cell_type": "code", 818 | "execution_count": null, 819 | "metadata": { 820 | "collapsed": false 821 | }, 822 | "outputs": [], 823 | "source": [ 824 | "poly = LinearRegression()\n", 825 | "poly.fit(x_train_pr, y_train)" 826 | ] 827 | }, 828 | { 829 | "cell_type": "markdown", 830 | "metadata": {}, 831 | "source": [ 832 | "We can see the output of our model using the method \"predict.\" then assign the values to \"yhat\"." 833 | ] 834 | }, 835 | { 836 | "cell_type": "code", 837 | "execution_count": null, 838 | "metadata": { 839 | "collapsed": false 840 | }, 841 | "outputs": [], 842 | "source": [ 843 | "yhat = poly.predict(x_test_pr)\n", 844 | "yhat[0:5]" 845 | ] 846 | }, 847 | { 848 | "cell_type": "markdown", 849 | "metadata": {}, 850 | "source": [ 851 | "Let's take the first five predicted values and compare it to the actual targets. " 852 | ] 853 | }, 854 | { 855 | "cell_type": "code", 856 | "execution_count": null, 857 | "metadata": { 858 | "collapsed": false 859 | }, 860 | "outputs": [], 861 | "source": [ 862 | "print(\"Predicted values:\", yhat[0:4])\n", 863 | "print(\"True values:\", y_test[0:4].values)" 864 | ] 865 | }, 866 | { 867 | "cell_type": "markdown", 868 | "metadata": {}, 869 | "source": [ 870 | "We will use the function \"PollyPlot\" that we defined at the beginning of the lab to display the training data, testing data, and the predicted function." 871 | ] 872 | }, 873 | { 874 | "cell_type": "code", 875 | "execution_count": null, 876 | "metadata": { 877 | "collapsed": false, 878 | "scrolled": false 879 | }, 880 | "outputs": [], 881 | "source": [ 882 | "PollyPlot(x_train[['horsepower']], x_test[['horsepower']], y_train, y_test, poly,pr)" 883 | ] 884 | }, 885 | { 886 | "cell_type": "markdown", 887 | "metadata": {}, 888 | "source": [ 889 | "Figur 4 A polynomial regression model, red dots represent training data, green dots represent test data, and the blue line represents the model prediction. " 890 | ] 891 | }, 892 | { 893 | "cell_type": "markdown", 894 | "metadata": {}, 895 | "source": [ 896 | "We see that the estimated function appears to track the data but around 200 horsepower, the function begins to diverge from the data points. " 897 | ] 898 | }, 899 | { 900 | "cell_type": "markdown", 901 | "metadata": {}, 902 | "source": [ 903 | " R^2 of the training data:" 904 | ] 905 | }, 906 | { 907 | "cell_type": "code", 908 | "execution_count": null, 909 | "metadata": { 910 | "collapsed": false 911 | }, 912 | "outputs": [], 913 | "source": [ 914 | "poly.score(x_train_pr, y_train)" 915 | ] 916 | }, 917 | { 918 | "cell_type": "markdown", 919 | "metadata": {}, 920 | "source": [ 921 | " R^2 of the test data:" 922 | ] 923 | }, 924 | { 925 | "cell_type": "code", 926 | "execution_count": null, 927 | "metadata": { 928 | "collapsed": false 929 | }, 930 | "outputs": [], 931 | "source": [ 932 | "poly.score(x_test_pr, y_test)" 933 | ] 934 | }, 935 | { 936 | "cell_type": "markdown", 937 | "metadata": {}, 938 | "source": [ 939 | "We see the R^2 for the training data is 0.5567 while the R^2 on the test data was -29.87. The lower the R^2, the worse the model, a Negative R^2 is a sign of overfitting." 940 | ] 941 | }, 942 | { 943 | "cell_type": "markdown", 944 | "metadata": {}, 945 | "source": [ 946 | "Let's see how the R^2 changes on the test data for different order polynomials and plot the results:" 947 | ] 948 | }, 949 | { 950 | "cell_type": "code", 951 | "execution_count": null, 952 | "metadata": { 953 | "collapsed": false 954 | }, 955 | "outputs": [], 956 | "source": [ 957 | "Rsqu_test = []\n", 958 | "\n", 959 | "order = [1, 2, 3, 4]\n", 960 | "for n in order:\n", 961 | " pr = PolynomialFeatures(degree=n)\n", 962 | " \n", 963 | " x_train_pr = pr.fit_transform(x_train[['horsepower']])\n", 964 | " \n", 965 | " x_test_pr = pr.fit_transform(x_test[['horsepower']]) \n", 966 | " \n", 967 | " lr.fit(x_train_pr, y_train)\n", 968 | " \n", 969 | " Rsqu_test.append(lr.score(x_test_pr, y_test))\n", 970 | "\n", 971 | "plt.plot(order, Rsqu_test)\n", 972 | "plt.xlabel('order')\n", 973 | "plt.ylabel('R^2')\n", 974 | "plt.title('R^2 Using Test Data')\n", 975 | "plt.text(3, 0.75, 'Maximum R^2 ') " 976 | ] 977 | }, 978 | { 979 | "cell_type": "markdown", 980 | "metadata": {}, 981 | "source": [ 982 | "We see the R^2 gradually increases until an order three polynomial is used. Then the R^2 dramatically decreases at four." 983 | ] 984 | }, 985 | { 986 | "cell_type": "markdown", 987 | "metadata": {}, 988 | "source": [ 989 | "The following function will be used in the next section; please run the cell." 990 | ] 991 | }, 992 | { 993 | "cell_type": "code", 994 | "execution_count": null, 995 | "metadata": { 996 | "collapsed": true 997 | }, 998 | "outputs": [], 999 | "source": [ 1000 | "def f(order, test_data):\n", 1001 | " x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=test_data, random_state=0)\n", 1002 | " pr = PolynomialFeatures(degree=order)\n", 1003 | " x_train_pr = pr.fit_transform(x_train[['horsepower']])\n", 1004 | " x_test_pr = pr.fit_transform(x_test[['horsepower']])\n", 1005 | " poly = LinearRegression()\n", 1006 | " poly.fit(x_train_pr,y_train)\n", 1007 | " PollyPlot(x_train[['horsepower']], x_test[['horsepower']], y_train,y_test, poly, pr)" 1008 | ] 1009 | }, 1010 | { 1011 | "cell_type": "markdown", 1012 | "metadata": {}, 1013 | "source": [ 1014 | "The following interface allows you to experiment with different polynomial orders and different amounts of data. " 1015 | ] 1016 | }, 1017 | { 1018 | "cell_type": "code", 1019 | "execution_count": null, 1020 | "metadata": { 1021 | "collapsed": false 1022 | }, 1023 | "outputs": [], 1024 | "source": [ 1025 | "interact(f, order=(0, 6, 1), test_data=(0.05, 0.95, 0.05))" 1026 | ] 1027 | }, 1028 | { 1029 | "cell_type": "markdown", 1030 | "metadata": {}, 1031 | "source": [ 1032 | "Joseph Santarcangelo is a Data Scientist at IBM, and holds a PhD in Electrical Engineering. His research focused on using Machine Learning, Signal Processing, and Computer Vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.
" 1627 | ] 1628 | }, 1629 | { 1630 | "cell_type": "markdown", 1631 | "metadata": {}, 1632 | "source": [ 1633 | "Copyright © 2018 IBM Developer Skills Network. This notebook and its source code are released under the terms of the MIT License.
" 1635 | ] 1636 | } 1637 | ], 1638 | "metadata": { 1639 | "anaconda-cloud": {}, 1640 | "kernelspec": { 1641 | "display_name": "Python 3", 1642 | "language": "python", 1643 | "name": "python3" 1644 | }, 1645 | "language_info": { 1646 | "codemirror_mode": { 1647 | "name": "ipython", 1648 | "version": 3 1649 | }, 1650 | "file_extension": ".py", 1651 | "mimetype": "text/x-python", 1652 | "name": "python", 1653 | "nbconvert_exporter": "python", 1654 | "pygments_lexer": "ipython3", 1655 | "version": "3.7.3" 1656 | } 1657 | }, 1658 | "nbformat": 4, 1659 | "nbformat_minor": 2 1660 | } 1661 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data-Analysis-with-Python-by-IBM-on-Coursera 2 | Answer keys for course - Data Analysis with Python by IBM on Coursera 3 | --------------------------------------------------------------------------------