├── LICENSE.txt ├── Lab 10 - Ridge Regression and the Lasso in Python.ipynb ├── Lab 11 - PCR and PLS Regression in Python.ipynb ├── Lab 12 - Polynomial Regression and Step Functions in Python.ipynb ├── Lab 13 - Splines in Python.ipynb ├── Lab 14 - Decision Trees in Python.ipynb ├── Lab 15 - Support Vector Machines in Python.ipynb ├── Lab 16 - Multiclass SVMs and Applications to Real Data in Python.ipynb ├── Lab 18 - PCA in Python.ipynb ├── Lab 2 - Linear Regression in Python.ipynb ├── Lab 3 - K-Nearest Neighbors in Python.ipynb ├── Lab 4 - Logistic Regression in Python.ipynb ├── Lab 5 - LDA and QDA in Python.ipynb ├── Lab 7 - Cross-Validation in Python.ipynb ├── Lab 8 - Subset Selection in Python.ipynb ├── Lab 9 - Linear Model Selection in Python.ipynb ├── README.md └── data ├── Auto.csv ├── Boston.csv ├── Caravan.csv ├── Carseats.csv ├── Hitters.csv ├── Khan_xtest.csv ├── Khan_xtrain.csv ├── Khan_ytest.csv ├── Khan_ytrain.csv ├── OJ.csv ├── Smarket 2.csv ├── Smarket.csv └── Wage.csv /LICENSE.txt: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 R. Jordan Crouser. 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Lab 13 - Splines in Python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "This lab on Splines and GAMs is a python adaptation of p. 293-297 of \"Introduction to Statistical Learning with Applications in R\" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. It was originally written by Jordi Warmenhoven, and was adapted by R. Jordan Crouser at Smith College in Spring 2016. " 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "import numpy as np\n", 20 | "import matplotlib as mpl\n", 21 | "import matplotlib.pyplot as plt\n", 22 | "\n", 23 | "from sklearn.preprocessing import PolynomialFeatures\n", 24 | "import statsmodels.api as sm\n", 25 | "import statsmodels.formula.api as smf\n", 26 | "\n", 27 | "%matplotlib inline\n", 28 | "\n", 29 | "# Read in the data\n", 30 | "df = pd.read_csv('Wage.csv')\n", 31 | "\n", 32 | "# Generate a sequence of age values spanning the range\n", 33 | "age_grid = np.arange(df.age.min(), df.age.max()).reshape(-1,1)" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "# 7.8.2 Splines\n", 41 | "\n", 42 | "In order to fit regression splines in python, we use the ${\\tt dmatrix}$ module from the ${\\tt patsy}$ library. In lecture, we saw that regression splines can be fit by constructing an appropriate matrix of basis functions. The ${\\tt bs()}$ function generates the entire matrix of basis functions for splines with the specified set of knots. Fitting ${\\tt wage}$ to ${\\tt age}$ using a regression spline is simple:" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": null, 48 | "metadata": { 49 | "collapsed": false 50 | }, 51 | "outputs": [], 52 | "source": [ 53 | "from patsy import dmatrix\n", 54 | "\n", 55 | "# Specifying 3 knots\n", 56 | "transformed_x1 = dmatrix(\"bs(df.age, knots=(25,40,60), degree=3, include_intercept=False)\",\n", 57 | " {\"df.age\": df.age}, return_type='dataframe')\n", 58 | "\n", 59 | "# Build a regular linear model from the splines\n", 60 | "fit1 = sm.GLM(df.wage, transformed_x1).fit()\n", 61 | "fit1.params" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "Here we have prespecified knots at ages 25, 40, and 60. This produces a\n", 69 | "spline with six basis functions. (Recall that a cubic spline with three knots\n", 70 | "has seven degrees of freedom; these degrees of freedom are used up by an\n", 71 | "intercept, plus six basis functions.) We could also use the ${\\tt df}$ option to\n", 72 | "produce a spline with knots at uniform quantiles of the data:" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": null, 78 | "metadata": { 79 | "collapsed": false 80 | }, 81 | "outputs": [], 82 | "source": [ 83 | "# Specifying 6 degrees of freedom \n", 84 | "transformed_x2 = dmatrix(\"bs(df.age, df=6, include_intercept=False)\",\n", 85 | " {\"df.age\": df.age}, return_type='dataframe')\n", 86 | "fit2 = sm.GLM(df.wage, transformed_x2).fit()\n", 87 | "fit2.params" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "In this case python chooses knots which correspond\n", 95 | "to the 25th, 50th, and 75th percentiles of ${\\tt age}$. The function ${\\tt bs()}$ also has\n", 96 | "a ${\\tt degree}$ argument, so we can fit splines of any degree, rather than the\n", 97 | "default degree of 3 (which yields a cubic spline).\n", 98 | "\n", 99 | "In order to instead fit a natural spline, we use the ${\\tt cr()}$ function. Here\n", 100 | "we fit a natural spline with four degrees of freedom:" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": null, 106 | "metadata": { 107 | "collapsed": false 108 | }, 109 | "outputs": [], 110 | "source": [ 111 | "# Specifying 4 degrees of freedom\n", 112 | "transformed_x3 = dmatrix(\"cr(df.age, df=4)\", {\"df.age\": df.age}, return_type='dataframe')\n", 113 | "fit3 = sm.GLM(df.wage, transformed_x3).fit()\n", 114 | "fit3.params" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "As with the ${\\tt bs()}$ function, we could instead specify the knots directly using\n", 122 | "the ${\\tt knots}$ option.\n", 123 | "\n", 124 | "Let's see how these three models stack up:" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": { 131 | "collapsed": false 132 | }, 133 | "outputs": [], 134 | "source": [ 135 | "# Generate a sequence of age values spanning the range\n", 136 | "age_grid = np.arange(df.age.min(), df.age.max()).reshape(-1,1)\n", 137 | "\n", 138 | "# Make some predictions\n", 139 | "pred1 = fit1.predict(dmatrix(\"bs(age_grid, knots=(25,40,60), include_intercept=False)\",\n", 140 | " {\"age_grid\": age_grid}, return_type='dataframe'))\n", 141 | "pred2 = fit2.predict(dmatrix(\"bs(age_grid, df=6, include_intercept=False)\",\n", 142 | " {\"age_grid\": age_grid}, return_type='dataframe'))\n", 143 | "pred3 = fit3.predict(dmatrix(\"cr(age_grid, df=4)\", {\"age_grid\": age_grid}, return_type='dataframe'))\n", 144 | "\n", 145 | "# Plot the splines and error bands\n", 146 | "plt.scatter(df.age, df.wage, facecolor='None', edgecolor='k', alpha=0.1)\n", 147 | "plt.plot(age_grid, pred1, color='b', label='Specifying three knots')\n", 148 | "plt.plot(age_grid, pred2, color='r', label='Specifying df=6')\n", 149 | "plt.plot(age_grid, pred3, color='g', label='Natural spline df=4')\n", 150 | "plt.legend()\n", 151 | "plt.xlim(15,85)\n", 152 | "plt.ylim(0,350)\n", 153 | "plt.xlabel('age')\n", 154 | "plt.ylabel('wage')" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "To get credit for this lab, post your answer to the following question:\n", 162 | " - How would you choose whether to use a polynomial, step, or spline function for each predictor when building a GAM?" 163 | ] 164 | } 165 | ], 166 | "metadata": { 167 | "kernelspec": { 168 | "display_name": "Python [python3]", 169 | "language": "python", 170 | "name": "Python [python3]" 171 | }, 172 | "language_info": { 173 | "codemirror_mode": { 174 | "name": "ipython", 175 | "version": 3 176 | }, 177 | "file_extension": ".py", 178 | "mimetype": "text/x-python", 179 | "name": "python", 180 | "nbconvert_exporter": "python", 181 | "pygments_lexer": "ipython3", 182 | "version": "3.5.1" 183 | } 184 | }, 185 | "nbformat": 4, 186 | "nbformat_minor": 0 187 | } 188 | -------------------------------------------------------------------------------- /Lab 14 - Decision Trees in Python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "This lab on Decision Trees is a Python adaptation of p. 324-331 of \"Introduction to Statistical Learning with\n", 8 | "Applications in R\" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Original adaptation by J. Warmenhoven, updated by R. Jordan Crouser at Smith\n", 9 | "College for SDS293: Machine Learning (Spring 2016)." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": { 16 | "collapsed": false 17 | }, 18 | "outputs": [], 19 | "source": [ 20 | "import pandas as pd\n", 21 | "import numpy as np\n", 22 | "import matplotlib as mpl\n", 23 | "import matplotlib.pyplot as plt\n", 24 | "import graphviz\n", 25 | "\n", 26 | "%matplotlib inline" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "# 8.3.1 Fitting Classification Trees\n", 34 | "\n", 35 | "The ${\\tt sklearn}$ library has a lot of useful tools for constructing classification and regression trees:" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "metadata": { 42 | "collapsed": false 43 | }, 44 | "outputs": [], 45 | "source": [ 46 | "from sklearn.cross_validation import train_test_split\n", 47 | "from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, export_graphviz\n", 48 | "from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor\n", 49 | "from sklearn.metrics import confusion_matrix, mean_squared_error" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "We'll start by using **classification trees** to analyze the ${\\tt Carseats}$ data set. In these\n", 57 | "data, ${\\tt Sales}$ is a continuous variable, and so we begin by converting it to a\n", 58 | "binary variable. We use the ${\\tt ifelse()}$ function to create a variable, called\n", 59 | "${\\tt High}$, which takes on a value of ${\\tt Yes}$ if the ${\\tt Sales}$ variable exceeds 8, and\n", 60 | "takes on a value of ${\\tt No}$ otherwise. We'll append this onto our dataFrame using the ${\\tt .map()}$ function, and then do a little data cleaning to tidy things up:" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "metadata": { 67 | "collapsed": false 68 | }, 69 | "outputs": [], 70 | "source": [ 71 | "df3 = pd.read_csv('Carseats.csv').drop('Unnamed: 0', axis=1)\n", 72 | "df3['High'] = df3.Sales.map(lambda x: 1 if x>8 else 0)\n", 73 | "df3.ShelveLoc = pd.factorize(df3.ShelveLoc)[0]\n", 74 | "df3.Urban = df3.Urban.map({'No':0, 'Yes':1})\n", 75 | "df3.US = df3.US.map({'No':0, 'Yes':1})\n", 76 | "df3.info()" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "In order to properly evaluate the performance of a classification tree on\n", 84 | "the data, we must estimate the test error rather than simply computing\n", 85 | "the training error. We first split the observations into a training set and a test\n", 86 | "set:" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": null, 92 | "metadata": { 93 | "collapsed": false 94 | }, 95 | "outputs": [], 96 | "source": [ 97 | "X = df3.drop(['Sales', 'High'], axis=1)\n", 98 | "y = df3.High\n", 99 | "\n", 100 | "X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=0)" 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": {}, 106 | "source": [ 107 | "We now use the ${\\tt DecisionTreeClassifier()}$ function to fit a classification tree in order to predict\n", 108 | "${\\tt High}$ using all variables but ${\\tt Sales}$ (that would be a little silly...). Unfortunately, manual pruning is not implemented in ${\\tt sklearn}$: http://scikit-learn.org/stable/modules/tree.html\n", 109 | "\n", 110 | "However, we can limit the depth of a tree using the ${\\tt max\\_depth}$ parameter:" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "metadata": { 117 | "collapsed": false 118 | }, 119 | "outputs": [], 120 | "source": [ 121 | "clf = DecisionTreeClassifier(max_depth=6)\n", 122 | "clf.fit(X_train, y_train)\n", 123 | "clf.score(X_train, y_train)" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "We see that the training accuracy is 95.5%.\n", 131 | "\n", 132 | "One of the most attractive properties of trees is that they can be\n", 133 | "graphically displayed. Unfortunately, this is a bit of a roundabout process in ${\\tt sklearn}$. We use the ${\\tt export\\_graphviz()}$ function to export the tree structure to a temporary ${\\tt .dot}$ file,\n", 134 | "and the ${\\tt graphviz.Source()}$ function to display the image:" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": { 141 | "collapsed": false 142 | }, 143 | "outputs": [], 144 | "source": [ 145 | "export_graphviz(clf, out_file=\"mytree.dot\", feature_names=X_train.columns)\n", 146 | "with open(\"mytree.dot\") as f:\n", 147 | " dot_graph = f.read()\n", 148 | "graphviz.Source(dot_graph)" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "The most important indicator of ${\\tt High}$ sales appears to be ${\\tt Price}$." 156 | ] 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "metadata": {}, 161 | "source": [ 162 | "Finally, let's evaluate the tree's performance on\n", 163 | "the test data. The ${\\tt predict()}$ function can be used for this purpose. We can then build a confusion matrix, which shows that we are making correct predictions for\n", 164 | "around 74.5% of the test data set:" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": null, 170 | "metadata": { 171 | "collapsed": false 172 | }, 173 | "outputs": [], 174 | "source": [ 175 | "pred = clf.predict(X_test)\n", 176 | "cm = pd.DataFrame(confusion_matrix(y_test, pred).T, index=['No', 'Yes'], columns=['No', 'Yes'])\n", 177 | "print(cm)\n", 178 | "# 99+50/200 = 0.745" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": {}, 184 | "source": [ 185 | "# 8.3.2 Fitting Regression Trees\n", 186 | "\n", 187 | "Now let's try fitting a **regression tree** to the ${\\tt Boston}$ data set from the ${\\tt MASS}$ library. First, we create a\n", 188 | "training set, and fit the tree to the training data using ${\\tt medv}$ (median home value) as our response:" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": null, 194 | "metadata": { 195 | "collapsed": false 196 | }, 197 | "outputs": [], 198 | "source": [ 199 | "boston_df = pd.read_csv('Boston.csv')\n", 200 | "X = boston_df.drop('medv', axis=1)\n", 201 | "y = boston_df.medv\n", 202 | "X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=0)\n", 203 | "\n", 204 | "# Pruning not supported. Choosing max depth 2)\n", 205 | "regr2 = DecisionTreeRegressor(max_depth=2)\n", 206 | "regr2.fit(X_train, y_train)" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "Let's take a look at the tree:" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": { 220 | "collapsed": false 221 | }, 222 | "outputs": [], 223 | "source": [ 224 | "export_graphviz(regr2, out_file=\"mytree.dot\", feature_names=X_train.columns)\n", 225 | "with open(\"mytree.dot\") as f:\n", 226 | " dot_graph = f.read()\n", 227 | "graphviz.Source(dot_graph)" 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": {}, 233 | "source": [ 234 | "The variable ${\\tt lstat}$ measures the percentage of individuals with lower\n", 235 | "socioeconomic status. The tree indicates that lower values of ${\\tt lstat}$ correspond\n", 236 | "to more expensive houses. The tree predicts a median house price\n", 237 | "of \\$45,766 for larger homes (${\\tt rm}>=7.435$) in suburbs in which residents have high socioeconomic\n", 238 | "status (${\\tt lstat}<7.81$).\n", 239 | "\n", 240 | "Now let's see how it does on the test data:" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": null, 246 | "metadata": { 247 | "collapsed": false 248 | }, 249 | "outputs": [], 250 | "source": [ 251 | "pred = regr2.predict(X_test)\n", 252 | "\n", 253 | "plt.scatter(pred, y_test, label='medv')\n", 254 | "plt.plot([0, 1], [0, 1], '--k', transform=plt.gca().transAxes)\n", 255 | "plt.xlabel('pred')\n", 256 | "plt.ylabel('y_test')\n", 257 | "\n", 258 | "mean_squared_error(y_test, pred)" 259 | ] 260 | }, 261 | { 262 | "cell_type": "markdown", 263 | "metadata": {}, 264 | "source": [ 265 | "The test set MSE associated with the regression tree is\n", 266 | "28.8. The square root of the MSE is therefore around 5.37, indicating\n", 267 | "that this model leads to test predictions that are within around \\$5,370 of\n", 268 | "the true median home value for the suburb.\n", 269 | " \n", 270 | "# 8.3.3 Bagging and Random Forests\n", 271 | "\n", 272 | "Let's see if we can improve on this result using **bagging** and **random forests**. The exact results obtained in this section may\n", 273 | "depend on the version of ${\\tt python}$ and the version of the ${\\tt RandomForestRegressor}$ package\n", 274 | "installed on your computer, so don't stress out if you don't match up exactly with the book. Recall that **bagging** is simply a special case of\n", 275 | "a **random forest** with $m = p$. Therefore, the ${\\tt RandomForestRegressor()}$ function can\n", 276 | "be used to perform both random forests and bagging. Let's start with bagging:" 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": null, 282 | "metadata": { 283 | "collapsed": false 284 | }, 285 | "outputs": [], 286 | "source": [ 287 | "# Bagging: using all features\n", 288 | "regr1 = RandomForestRegressor(max_features=13, random_state=1)\n", 289 | "regr1.fit(X_train, y_train)" 290 | ] 291 | }, 292 | { 293 | "cell_type": "markdown", 294 | "metadata": {}, 295 | "source": [ 296 | "The argument ${\\tt max\\_features=13}$ indicates that all 13 predictors should be considered\n", 297 | "for each split of the tree -- in other words, that bagging should be done. How\n", 298 | "well does this bagged model perform on the test set?" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": null, 304 | "metadata": { 305 | "collapsed": false 306 | }, 307 | "outputs": [], 308 | "source": [ 309 | "pred = regr1.predict(X_test)\n", 310 | "plt.scatter(pred, y_test, label='medv')\n", 311 | "plt.plot([0, 1], [0, 1], '--k', transform=plt.gca().transAxes)\n", 312 | "plt.xlabel('pred')\n", 313 | "plt.ylabel('y_test')\n", 314 | "mean_squared_error(y_test, pred)" 315 | ] 316 | }, 317 | { 318 | "cell_type": "markdown", 319 | "metadata": {}, 320 | "source": [ 321 | "The test setMSE associated with the bagged regression tree is significantly lower than our single tree!" 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": {}, 327 | "source": [ 328 | "We can grow a random forest in exactly the same way, except that\n", 329 | "we'll use a smaller value of the ${\\tt max\\_features}$ argument. Here we'll\n", 330 | "use ${\\tt max\\_features = 6}$:" 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": null, 336 | "metadata": { 337 | "collapsed": false 338 | }, 339 | "outputs": [], 340 | "source": [ 341 | "# Random forests: using 6 features\n", 342 | "regr2 = RandomForestRegressor(max_features=6, random_state=1)\n", 343 | "regr2.fit(X_train, y_train)\n", 344 | "\n", 345 | "pred = regr2.predict(X_test)\n", 346 | "mean_squared_error(y_test, pred)" 347 | ] 348 | }, 349 | { 350 | "cell_type": "markdown", 351 | "metadata": {}, 352 | "source": [ 353 | "The test set MSE is even lower; this indicates that random forests yielded an\n", 354 | "improvement over bagging in this case.\n", 355 | "\n", 356 | "Using the ${\\tt feature\\_importances\\_}$ attribute of the ${\\tt RandomForestRegressor}$, we can view the importance of each\n", 357 | "variable:" 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "execution_count": null, 363 | "metadata": { 364 | "collapsed": false 365 | }, 366 | "outputs": [], 367 | "source": [ 368 | "Importance = pd.DataFrame({'Importance':regr2.feature_importances_*100}, index=X.columns)\n", 369 | "Importance.sort_values(by='Importance', axis=0, ascending=True).plot(kind='barh', color='r', )\n", 370 | "plt.xlabel('Variable Importance')\n", 371 | "plt.gca().legend_ = None" 372 | ] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "metadata": {}, 377 | "source": [ 378 | "The results indicate that across all of the trees considered in the random\n", 379 | "forest, the wealth level of the community (${\\tt lstat}$) and the house size (${\\tt rm}$)\n", 380 | "are by far the two most important variables." 381 | ] 382 | }, 383 | { 384 | "cell_type": "markdown", 385 | "metadata": {}, 386 | "source": [ 387 | "# 8.3.4 Boosting\n", 388 | "\n", 389 | "Now we'll use the ${\\tt GradientBoostingRegressor}$ package to fit **boosted\n", 390 | "regression trees** to the ${\\tt Boston}$ data set. The\n", 391 | "argument ${\\tt n_estimators=500}$ indicates that we want 500 trees, and the option\n", 392 | "${\\tt interaction.depth=4}$ limits the depth of each tree:" 393 | ] 394 | }, 395 | { 396 | "cell_type": "code", 397 | "execution_count": null, 398 | "metadata": { 399 | "collapsed": false 400 | }, 401 | "outputs": [], 402 | "source": [ 403 | "regr = GradientBoostingRegressor(n_estimators=500, learning_rate=0.01, max_depth=4, random_state=1)\n", 404 | "regr.fit(X_train, y_train)" 405 | ] 406 | }, 407 | { 408 | "cell_type": "markdown", 409 | "metadata": {}, 410 | "source": [ 411 | "Let's check out the feature importances again:" 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "execution_count": null, 417 | "metadata": { 418 | "collapsed": false 419 | }, 420 | "outputs": [], 421 | "source": [ 422 | "feature_importance = regr.feature_importances_*100\n", 423 | "rel_imp = pd.Series(feature_importance, index=X.columns).sort_values(inplace=False)\n", 424 | "rel_imp.T.plot(kind='barh', color='r', )\n", 425 | "plt.xlabel('Variable Importance')\n", 426 | "plt.gca().legend_ = None" 427 | ] 428 | }, 429 | { 430 | "cell_type": "markdown", 431 | "metadata": {}, 432 | "source": [ 433 | "We see that ${\\tt lstat}$ and ${\\tt rm}$ are again the most important variables by far. Now let's use the boosted model to predict ${\\tt medv}$ on the test set:" 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": null, 439 | "metadata": { 440 | "collapsed": false 441 | }, 442 | "outputs": [], 443 | "source": [ 444 | "mean_squared_error(y_test, regr.predict(X_test))" 445 | ] 446 | }, 447 | { 448 | "cell_type": "markdown", 449 | "metadata": {}, 450 | "source": [ 451 | "The test MSE obtained is similar to the test MSE for random forests\n", 452 | "and superior to that for bagging. If we want to, we can perform boosting\n", 453 | "with a different value of the shrinkage parameter $\\lambda$. Here we take $\\lambda = 0.2$:" 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": null, 459 | "metadata": { 460 | "collapsed": false 461 | }, 462 | "outputs": [], 463 | "source": [ 464 | "regr2 = GradientBoostingRegressor(n_estimators=500, learning_rate=0.2, max_depth=4, random_state=1)\n", 465 | "regr2.fit(X_train, y_train)\n", 466 | "mean_squared_error(y_test, regr2.predict(X_test))" 467 | ] 468 | }, 469 | { 470 | "cell_type": "markdown", 471 | "metadata": {}, 472 | "source": [ 473 | "In this case, using $\\lambda = 0.2$ leads to a slightly lower test MSE than $\\lambda = 0.01$.\n", 474 | "\n", 475 | "To get credit for this lab, post your responses to the following questions:\n", 476 | " - What's one real-world scenario where you might try using Bagging?\n", 477 | " - What's one real-world scenario where you might try using Random Forests?\n", 478 | " - What's one real-world scenario where you might try using Boosting?\n", 479 | " \n", 480 | "to Piazza: https://piazza.com/class/igwiv4w3ctb6rg?cid=53" 481 | ] 482 | } 483 | ], 484 | "metadata": { 485 | "anaconda-cloud": {}, 486 | "kernelspec": { 487 | "display_name": "Python [conda root]", 488 | "language": "python", 489 | "name": "conda-root-py" 490 | }, 491 | "language_info": { 492 | "codemirror_mode": { 493 | "name": "ipython", 494 | "version": 2 495 | }, 496 | "file_extension": ".py", 497 | "mimetype": "text/x-python", 498 | "name": "python", 499 | "nbconvert_exporter": "python", 500 | "pygments_lexer": "ipython2", 501 | "version": "2.7.11" 502 | } 503 | }, 504 | "nbformat": 4, 505 | "nbformat_minor": 0 506 | } 507 | -------------------------------------------------------------------------------- /Lab 18 - PCA in Python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "This lab on Principal Components Analysis is a python adaptation of p. 401-404,\n", 8 | "408-410 of \"Introduction to Statistical Learning with Applications in R\" by Gareth James,\n", 9 | "Daniela Witten, Trevor Hastie and Robert Tibshirani. Original adaptation by J. Warmenhoven, updated by R. Jordan Crouser at Smith College for\n", 10 | "SDS293: Machine Learning (Spring 2016)." 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "metadata": { 17 | "collapsed": true 18 | }, 19 | "outputs": [], 20 | "source": [ 21 | "import pandas as pd\n", 22 | "import numpy as np\n", 23 | "import matplotlib as mpl\n", 24 | "import matplotlib.pyplot as plt\n", 25 | "\n", 26 | "%matplotlib inline" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "# 10.4: Principal Components Analysis\n", 34 | "\n", 35 | "In this lab, we perform PCA on the ${\\tt USArrests}$ data set. The rows of the data set contain the 50 states, in\n", 36 | "alphabetical order:" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "metadata": { 43 | "collapsed": false 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "df = pd.read_csv('USArrests.csv', index_col=0)\n", 48 | "df.head()" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "The columns of the data set contain four variables relating to various crimes:" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "metadata": { 62 | "collapsed": false 63 | }, 64 | "outputs": [], 65 | "source": [ 66 | "df.info()" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "Let's start by taking a quick look at the column means of the data:" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "metadata": { 80 | "collapsed": false 81 | }, 82 | "outputs": [], 83 | "source": [ 84 | "df.mean()" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "We see right away the the data have **vastly** different means. We can also examine the variances of the four variables:" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": { 98 | "collapsed": false 99 | }, 100 | "outputs": [], 101 | "source": [ 102 | "df.var()" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "Not surprisingly, the variables also have vastly different variances: the\n", 110 | "${\\tt UrbanPop}$ variable measures the percentage of the population in each state\n", 111 | "living in an urban area, which is not a comparable number to the number\n", 112 | "of crimes committeed in each state per 100,000 individuals. If we failed to scale the\n", 113 | "variables before performing PCA, then most of the principal components\n", 114 | "that we observed would be driven by the ${\\tt Assault}$ variable, since it has by\n", 115 | "far the largest mean and variance. \n", 116 | "\n", 117 | "Thus, it is important to standardize the\n", 118 | "variables to have mean zero and standard deviation 1 before performing\n", 119 | "PCA. We can do this using the ${\\tt scale()}$ function from ${\\tt sklearn}$:" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": null, 125 | "metadata": { 126 | "collapsed": true 127 | }, 128 | "outputs": [], 129 | "source": [ 130 | "from sklearn.preprocessing import scale\n", 131 | "X = pd.DataFrame(scale(df), index=df.index, columns=df.columns)" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": {}, 137 | "source": [ 138 | "Now we'll use the ${\\tt PCA()}$ function from ${\\tt sklearn}$ to compute the loading vectors:" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": null, 144 | "metadata": { 145 | "collapsed": false 146 | }, 147 | "outputs": [], 148 | "source": [ 149 | "from sklearn.decomposition import PCA\n", 150 | "\n", 151 | "pca_loadings = pd.DataFrame(PCA().fit(X).components_.T, index=df.columns, columns=['V1', 'V2', 'V3', 'V4'])\n", 152 | "pca_loadings" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "We see that there are four distinct principal components. This is to be\n", 160 | "expected because there are in general ${\\tt min(n − 1, p)}$ informative principal\n", 161 | "components in a data set with $n$ observations and $p$ variables.\n", 162 | "\n", 163 | "Using the ${\\tt fit_transform()}$ function, we can get the principal component scores of the original data. We'll take a look at the first few states:" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "metadata": { 170 | "collapsed": false 171 | }, 172 | "outputs": [], 173 | "source": [ 174 | "# Fit the PCA model and transform X to get the principal components\n", 175 | "pca = PCA()\n", 176 | "df_plot = pd.DataFrame(pca.fit_transform(X), columns=['PC1', 'PC2', 'PC3', 'PC4'], index=X.index)\n", 177 | "df_plot.head()" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | "We can construct a **biplot** of the first two principal components using our loading vectors:" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": null, 190 | "metadata": { 191 | "collapsed": false 192 | }, 193 | "outputs": [], 194 | "source": [ 195 | "fig , ax1 = plt.subplots(figsize=(9,7))\n", 196 | "\n", 197 | "ax1.set_xlim(-3.5,3.5)\n", 198 | "ax1.set_ylim(-3.5,3.5)\n", 199 | "\n", 200 | "# Plot Principal Components 1 and 2\n", 201 | "for i in df_plot.index:\n", 202 | " ax1.annotate(i, (-df_plot.PC1.loc[i], -df_plot.PC2.loc[i]), ha='center')\n", 203 | "\n", 204 | "# Plot reference lines\n", 205 | "ax1.hlines(0,-3.5,3.5, linestyles='dotted', colors='grey')\n", 206 | "ax1.vlines(0,-3.5,3.5, linestyles='dotted', colors='grey')\n", 207 | "\n", 208 | "ax1.set_xlabel('First Principal Component')\n", 209 | "ax1.set_ylabel('Second Principal Component')\n", 210 | " \n", 211 | "# Plot Principal Component loading vectors, using a second y-axis.\n", 212 | "ax2 = ax1.twinx().twiny() \n", 213 | "\n", 214 | "ax2.set_ylim(-1,1)\n", 215 | "ax2.set_xlim(-1,1)\n", 216 | "ax2.set_xlabel('Principal Component loading vectors', color='red')\n", 217 | "\n", 218 | "# Plot labels for vectors. Variable 'a' is a small offset parameter to separate arrow tip and text.\n", 219 | "a = 1.07 \n", 220 | "for i in pca_loadings[['V1', 'V2']].index:\n", 221 | " ax2.annotate(i, (-pca_loadings.V1.loc[i]*a, -pca_loadings.V2.loc[i]*a), color='red')\n", 222 | "\n", 223 | "# Plot vectors\n", 224 | "ax2.arrow(0,0,-pca_loadings.V1[0], -pca_loadings.V2[0])\n", 225 | "ax2.arrow(0,0,-pca_loadings.V1[1], -pca_loadings.V2[1])\n", 226 | "ax2.arrow(0,0,-pca_loadings.V1[2], -pca_loadings.V2[2])\n", 227 | "ax2.arrow(0,0,-pca_loadings.V1[3], -pca_loadings.V2[3])" 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": {}, 233 | "source": [ 234 | "The ${\\tt PCA()}$ function also outputs the variance explained by of each principal\n", 235 | "component. We can access these values as follows:" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": null, 241 | "metadata": { 242 | "collapsed": false 243 | }, 244 | "outputs": [], 245 | "source": [ 246 | "pca.explained_variance_" 247 | ] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": {}, 252 | "source": [ 253 | "We can also get the proportion of variance explained:" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": null, 259 | "metadata": { 260 | "collapsed": false 261 | }, 262 | "outputs": [], 263 | "source": [ 264 | "pca.explained_variance_ratio_" 265 | ] 266 | }, 267 | { 268 | "cell_type": "markdown", 269 | "metadata": {}, 270 | "source": [ 271 | "We see that the first principal component explains 62.0% of the variance\n", 272 | "in the data, the next principal component explains 24.7% of the variance,\n", 273 | "and so forth. We can plot the PVE explained by each component as follows:" 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": null, 279 | "metadata": { 280 | "collapsed": false 281 | }, 282 | "outputs": [], 283 | "source": [ 284 | "plt.figure(figsize=(7,5))\n", 285 | "plt.plot([1,2,3,4], pca.explained_variance_ratio_, '-o')\n", 286 | "plt.ylabel('Proportion of Variance Explained')\n", 287 | "plt.xlabel('Principal Component')\n", 288 | "plt.xlim(0.75,4.25)\n", 289 | "plt.ylim(0,1.05)\n", 290 | "plt.xticks([1,2,3,4])" 291 | ] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "We can also use the function ${\\tt cumsum()}$, which computes the cumulative sum of the elements of a numeric vector, to plot the cumulative PVE:" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": null, 303 | "metadata": { 304 | "collapsed": false 305 | }, 306 | "outputs": [], 307 | "source": [ 308 | "plt.figure(figsize=(7,5))\n", 309 | "plt.plot([1,2,3,4], np.cumsum(pca.explained_variance_ratio_), '-s')\n", 310 | "plt.ylabel('Proportion of Variance Explained')\n", 311 | "plt.xlabel('Principal Component')\n", 312 | "plt.xlim(0.75,4.25)\n", 313 | "plt.ylim(0,1.05)\n", 314 | "plt.xticks([1,2,3,4])\n", 315 | "\n" 316 | ] 317 | }, 318 | { 319 | "cell_type": "markdown", 320 | "metadata": {}, 321 | "source": [ 322 | "# 10.6: NCI60 Data Example\n", 323 | "\n", 324 | "Let's return to the ${\\tt NCI60}$ cancer cell line microarray data, which\n", 325 | "consists of 6,830 gene expression measurements on 64 cancer cell lines:" 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": null, 331 | "metadata": { 332 | "collapsed": false 333 | }, 334 | "outputs": [], 335 | "source": [ 336 | "df2 = pd.read_csv('NCI60.csv').drop('Unnamed: 0', axis=1)\n", 337 | "df2.columns = np.arange(df2.columns.size)\n", 338 | "df2.info()" 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": null, 344 | "metadata": { 345 | "collapsed": false 346 | }, 347 | "outputs": [], 348 | "source": [ 349 | "# Read in the labels to check our work later\n", 350 | "y = pd.read_csv('NCI60_y.csv', usecols=[1], skiprows=1, names=['type'])" 351 | ] 352 | }, 353 | { 354 | "cell_type": "markdown", 355 | "metadata": {}, 356 | "source": [ 357 | "# 10.6.1 PCA on the NCI60 Data\n", 358 | "\n", 359 | "We first perform PCA on the data after scaling the variables (genes) to\n", 360 | "have standard deviation one, although one could reasonably argue that it\n", 361 | "is better not to scale the genes:" 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "execution_count": null, 367 | "metadata": { 368 | "collapsed": true 369 | }, 370 | "outputs": [], 371 | "source": [ 372 | "# Scale the data\n", 373 | "X = pd.DataFrame(scale(df2))\n", 374 | "X.shape\n", 375 | "\n", 376 | "# Fit the PCA model and transform X to get the principal components\n", 377 | "pca2 = PCA()\n", 378 | "df2_plot = pd.DataFrame(pca2.fit_transform(X))" 379 | ] 380 | }, 381 | { 382 | "cell_type": "markdown", 383 | "metadata": {}, 384 | "source": [ 385 | "We now plot the first few principal component score vectors, in order to\n", 386 | "visualize the data. The observations (cell lines) corresponding to a given\n", 387 | "cancer type will be plotted in the same color, so that we can see to what\n", 388 | "extent the observations within a cancer type are similar to each other:" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": null, 394 | "metadata": { 395 | "collapsed": false 396 | }, 397 | "outputs": [], 398 | "source": [ 399 | "fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,6))\n", 400 | "\n", 401 | "color_idx = pd.factorize(y.type)[0]\n", 402 | "cmap = mpl.cm.hsv\n", 403 | "\n", 404 | "# Left plot\n", 405 | "ax1.scatter(df2_plot.iloc[:,0], df2_plot.iloc[:,1], c=color_idx, cmap=cmap, alpha=0.5, s=50)\n", 406 | "ax1.set_ylabel('Principal Component 2')\n", 407 | "\n", 408 | "# Right plot\n", 409 | "ax2.scatter(df2_plot.iloc[:,0], df2_plot.iloc[:,2], c=color_idx, cmap=cmap, alpha=0.5, s=50)\n", 410 | "ax2.set_ylabel('Principal Component 3')\n", 411 | "\n", 412 | "# Custom legend for the classes (y) since we do not create scatter plots per class (which could have their own labels).\n", 413 | "handles = []\n", 414 | "labels = pd.factorize(y.type.unique())\n", 415 | "norm = mpl.colors.Normalize(vmin=0.0, vmax=14.0)\n", 416 | "\n", 417 | "for i, v in zip(labels[0], labels[1]):\n", 418 | " handles.append(mpl.patches.Patch(color=cmap(norm(i)), label=v, alpha=0.5))\n", 419 | "\n", 420 | "ax2.legend(handles=handles, bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)\n", 421 | "\n", 422 | "# xlabel for both plots\n", 423 | "for ax in fig.axes:\n", 424 | " ax.set_xlabel('Principal Component 1') " 425 | ] 426 | }, 427 | { 428 | "cell_type": "markdown", 429 | "metadata": {}, 430 | "source": [ 431 | "On the whole, cell lines corresponding to a single cancer type do tend to have similar values on the\n", 432 | "first few principal component score vectors. This indicates that cell lines\n", 433 | "from the same cancer type tend to have pretty similar gene expression\n", 434 | "levels.\n", 435 | "\n", 436 | "We can generate a summary of the proportion of variance explained (PVE)\n", 437 | "of the first few principal components:" 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "execution_count": null, 443 | "metadata": { 444 | "collapsed": false 445 | }, 446 | "outputs": [], 447 | "source": [ 448 | "pd.DataFrame([df2_plot.iloc[:,:5].std(axis=0, ddof=0).as_matrix(),\n", 449 | " pca2.explained_variance_ratio_[:5],\n", 450 | " np.cumsum(pca2.explained_variance_ratio_[:5])],\n", 451 | " index=['Standard Deviation', 'Proportion of Variance', 'Cumulative Proportion'],\n", 452 | " columns=['PC1', 'PC2', 'PC3', 'PC4', 'PC5'])" 453 | ] 454 | }, 455 | { 456 | "cell_type": "markdown", 457 | "metadata": {}, 458 | "source": [ 459 | "Using the ${\\tt plot()}$ function, we can also plot the variance explained by the\n", 460 | "first few principal components:" 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": null, 466 | "metadata": { 467 | "collapsed": false 468 | }, 469 | "outputs": [], 470 | "source": [ 471 | "df2_plot.iloc[:,:10].var(axis=0, ddof=0).plot(kind='bar', rot=0)\n", 472 | "plt.ylabel('Variances')" 473 | ] 474 | }, 475 | { 476 | "cell_type": "markdown", 477 | "metadata": {}, 478 | "source": [ 479 | "However, it is generally more informative to\n", 480 | "plot the PVE of each principal component (i.e. a **scree plot**) and the cumulative\n", 481 | "PVE of each principal component. This can be done with just a\n", 482 | "little tweaking:" 483 | ] 484 | }, 485 | { 486 | "cell_type": "code", 487 | "execution_count": null, 488 | "metadata": { 489 | "collapsed": false 490 | }, 491 | "outputs": [], 492 | "source": [ 493 | "fig , (ax1,ax2) = plt.subplots(1,2, figsize=(15,5))\n", 494 | "\n", 495 | "# Left plot\n", 496 | "ax1.plot(pca2.explained_variance_ratio_, '-o')\n", 497 | "ax1.set_ylabel('Proportion of Variance Explained')\n", 498 | "ax1.set_ylim(ymin=-0.01)\n", 499 | "\n", 500 | "# Right plot\n", 501 | "ax2.plot(np.cumsum(pca2.explained_variance_ratio_), '-ro')\n", 502 | "ax2.set_ylabel('Cumulative Proportion of Variance Explained')\n", 503 | "ax2.set_ylim(ymax=1.05)\n", 504 | "\n", 505 | "for ax in fig.axes:\n", 506 | " ax.set_xlabel('Principal Component')\n", 507 | " ax.set_xlim(-1,65) " 508 | ] 509 | }, 510 | { 511 | "cell_type": "markdown", 512 | "metadata": {}, 513 | "source": [ 514 | "We see that together, the first seven principal components\n", 515 | "explain around 40% of the variance in the data. This is not a huge amount\n", 516 | "of the variance. However, looking at the scree plot, we see that while each\n", 517 | "of the first seven principal components explain a substantial amount of\n", 518 | "variance, there is a marked decrease in the variance explained by further\n", 519 | "principal components. That is, there is an **elbow** in the plot after approximately\n", 520 | "the seventh principal component. This suggests that there may\n", 521 | "be little benefit to examining more than seven or so principal components\n", 522 | "(phew! even examining seven principal components may be difficult)." 523 | ] 524 | } 525 | ], 526 | "metadata": { 527 | "kernelspec": { 528 | "display_name": "Python 3", 529 | "language": "python", 530 | "name": "python3" 531 | }, 532 | "language_info": { 533 | "codemirror_mode": { 534 | "name": "ipython", 535 | "version": 3 536 | }, 537 | "file_extension": ".py", 538 | "mimetype": "text/x-python", 539 | "name": "python", 540 | "nbconvert_exporter": "python", 541 | "pygments_lexer": "ipython3", 542 | "version": "3.5.1" 543 | } 544 | }, 545 | "nbformat": 4, 546 | "nbformat_minor": 0 547 | } 548 | -------------------------------------------------------------------------------- /Lab 3 - K-Nearest Neighbors in Python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "This lab on K-Nearest Neighbors is a python adaptation of p. 163-167 of \"Introduction to Statistical Learning with Applications in R\" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Originally adapted by Jordi Warmenhoven (github.com/JWarmenhoven/ISLR-python), modified by R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016)." 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "import numpy as np" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "# 4.6.5: K-Nearest Neighbors" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "In this lab, we will perform KNN on the ${\\tt Smarket}$ dataset from ${\\tt ISLR}$. This data set consists of percentage returns for the S&P 500 stock index over 1,250 days, from the\n", 34 | "beginning of 2001 until the end of 2005. For each date, we have recorded\n", 35 | "the percentage returns for each of the five previous trading days, ${\\tt Lag1}$\n", 36 | "through ${\\tt Lag5}$. We have also recorded ${\\tt Volume}$ (the number of shares traded on the previous day, in billions), ${\\tt Today}$ (the percentage return on the date\n", 37 | "in question) and ${\\tt Direction}$ (whether the market was ${\\tt Up}$ or ${\\tt Down}$ on this\n", 38 | "date). We can use the ${\\tt head(...)}$ function to look at the first few rows:" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": null, 44 | "metadata": { 45 | "collapsed": false 46 | }, 47 | "outputs": [], 48 | "source": [ 49 | "df = pd.read_csv('Smarket.csv', usecols=range(1,10), index_col=0, parse_dates=True)\n", 50 | "df.head()" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "Today we're going to try to predict ${\\tt Direction}$ using percentage returns from the previous two days (${\\tt Lag1}$ and ${\\tt Lag2}$). We'll build our model using the ${\\tt KNeighborsClassifier()}$ function, which is part of the\n", 58 | "${\\tt neighbors}$ submodule of SciKitLearn (${\\tt sklearn}$). We'll also grab a couple of useful tools from the ${\\tt metrics}$ submodule:" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "metadata": { 65 | "collapsed": true 66 | }, 67 | "outputs": [], 68 | "source": [ 69 | "from sklearn import neighbors\n", 70 | "from sklearn.metrics import confusion_matrix, classification_report" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "This function works rather differently from the other model-fitting\n", 78 | "functions that we have encountered thus far. Rather than a two-step\n", 79 | "approach in which we first fit the model and then we use the model to make\n", 80 | "predictions, ${\\tt knn()}$ forms predictions using a single command. The function\n", 81 | "requires four inputs.\n", 82 | " 1. A matrix containing the predictors associated with the training data,\n", 83 | "labeled ${\\tt X\\_train}$ below.\n", 84 | " 2. A matrix containing the predictors associated with the data for which\n", 85 | "we wish to make predictions, labeled ${\\tt X\\_test}$ below.\n", 86 | " 3. A vector containing the class labels for the training observations,\n", 87 | "labeled ${\\tt y\\_train}$ below.\n", 88 | " 4. A value for $K$, the number of nearest neighbors to be used by the\n", 89 | "classifier.\n", 90 | "\n", 91 | "We'll first create a vector corresponding to the observations from 2001 through 2004, which we'll use to train the model. We will then use this vector to create a held out data set of observations from 2005 on which we will test. We'll also pull out our training and test labels." 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": { 98 | "collapsed": false 99 | }, 100 | "outputs": [], 101 | "source": [ 102 | "X_train = df[:'2004'][['Lag1','Lag2']]\n", 103 | "y_train = df[:'2004']['Direction']\n", 104 | "\n", 105 | "X_test = df['2005':][['Lag1','Lag2']]\n", 106 | "y_test = df['2005':]['Direction']" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "Now the ${\\tt neighbors.KNeighborsClassifier()}$ function can be used to predict the market’s movement for\n", 114 | "the dates in 2005." 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": null, 120 | "metadata": { 121 | "collapsed": false 122 | }, 123 | "outputs": [], 124 | "source": [ 125 | "knn = neighbors.KNeighborsClassifier(n_neighbors=1)\n", 126 | "pred = knn.fit(X_train, y_train).predict(X_test)" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "The ${\\tt confusion\\_matrix()}$ function can be used to produce a **confusion matrix** in order to determine how many observations were correctly or incorrectly classified. The ${\\tt classification\\_report()}$ function gives us some summary statistics on the classifier's performance:" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": { 140 | "collapsed": false 141 | }, 142 | "outputs": [], 143 | "source": [ 144 | "print(confusion_matrix(y_test, pred).T)\n", 145 | "print(classification_report(y_test, pred, digits=3))" 146 | ] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "metadata": {}, 151 | "source": [ 152 | "The results using $K = 1$ are not very good, since only 50% of the observations\n", 153 | "are correctly predicted. Of course, it may be that $K = 1$ results in an\n", 154 | "overly flexible fit to the data. Below, we repeat the analysis using $K = 3$." 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": null, 160 | "metadata": { 161 | "collapsed": false 162 | }, 163 | "outputs": [], 164 | "source": [ 165 | "knn = neighbors.KNeighborsClassifier(n_neighbors=3)\n", 166 | "pred = knn.fit(X_train, y_train).predict(X_test)\n", 167 | "print(confusion_matrix(y_test, pred).T)\n", 168 | "print(classification_report(y_test, pred, digits=3))" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "The results have improved slightly. Try looping through a few other $K$ values to see if you can get any further improvement:" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": null, 181 | "metadata": { 182 | "collapsed": false 183 | }, 184 | "outputs": [], 185 | "source": [ 186 | "for k_val in range(10):\n", 187 | " # Your code here" 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": {}, 193 | "source": [ 194 | "It looks like for classifying this dataset, ${KNN}$ might not be the right approach." 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "# 4.6.6: An Application to Caravan Insurance Data\n", 202 | "Let's see how the ${\\tt KNN}$ approach performs on the ${\\tt Caravan}$ data set, which is\n", 203 | "part of the ${\\tt ISLR}$ library. This data set includes 85 predictors that measure demographic characteristics for 5,822 individuals. The response variable is\n", 204 | "${\\tt Purchase}$, which indicates whether or not a given individual purchases a\n", 205 | "caravan insurance policy. In this data set, only 6% of people purchased\n", 206 | "caravan insurance." 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "metadata": { 213 | "collapsed": false 214 | }, 215 | "outputs": [], 216 | "source": [ 217 | "df2 = pd.read_csv('Caravan.csv')\n", 218 | "df2[\"Purchase\"].value_counts()" 219 | ] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "metadata": {}, 224 | "source": [ 225 | "Because the ${\\tt KNN}$ classifier predicts the class of a given test observation by\n", 226 | "identifying the observations that are nearest to it, the scale of the variables\n", 227 | "matters. Any variables that are on a large scale will have a much larger\n", 228 | "effect on the distance between the observations, and hence on the ${\\tt KNN}$\n", 229 | "classifier, than variables that are on a small scale. \n", 230 | "\n", 231 | "For instance, imagine a\n", 232 | "data set that contains two variables, salary and age (measured in dollars\n", 233 | "and years, respectively). As far as ${\\tt KNN}$ is concerned, a difference of \\$1,000\n", 234 | "in salary is enormous compared to a difference of 50 years in age. Consequently,\n", 235 | "salary will drive the ${\\tt KNN}$ classification results, and age will have\n", 236 | "almost no effect. \n", 237 | "\n", 238 | "This is contrary to our intuition that a salary difference\n", 239 | "of \\$1,000 is quite small compared to an age difference of 50 years. Furthermore,\n", 240 | "the importance of scale to the ${\\tt KNN}$ classifier leads to another issue:\n", 241 | "if we measured salary in Japanese yen, or if we measured age in minutes,\n", 242 | "then we’d get quite different classification results from what we get if these\n", 243 | "two variables are measured in dollars and years.\n", 244 | "\n", 245 | "A good way to handle this problem is to **standardize** the data so that all\n", 246 | "variables are given a mean of zero and a standard deviation of one. Then\n", 247 | "all variables will be on a comparable scale. The ${\\tt scale()}$ function from the ${\\tt preprocessing}$ submodule of SciKitLearn does just\n", 248 | "this. In standardizing the data, we exclude column 86, because that is the\n", 249 | "qualitative ${\\tt Purchase}$ variable." 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": null, 255 | "metadata": { 256 | "collapsed": false 257 | }, 258 | "outputs": [], 259 | "source": [ 260 | "from sklearn import preprocessing\n", 261 | "y = df2.Purchase\n", 262 | "X = df2.drop('Purchase', axis=1).astype('float64')\n", 263 | "X_scaled = preprocessing.scale(X)\n", 264 | "print(np.std(X_scaled))" 265 | ] 266 | }, 267 | { 268 | "cell_type": "markdown", 269 | "metadata": {}, 270 | "source": [ 271 | "Now every column of ${\\tt X\\_scaled}$ has a standard deviation of one and\n", 272 | "a mean of zero.\n", 273 | "\n", 274 | "We'll now split the observations into a test set, containing the first 1,000\n", 275 | "observations, and a training set, containing the remaining observations." 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": null, 281 | "metadata": { 282 | "collapsed": false 283 | }, 284 | "outputs": [], 285 | "source": [ 286 | "X_train = X_scaled[1000:,:]\n", 287 | "y_train = y[1000:]\n", 288 | "X_test = X_scaled[:1000,:]\n", 289 | "y_test = y[:1000]" 290 | ] 291 | }, 292 | { 293 | "cell_type": "markdown", 294 | "metadata": {}, 295 | "source": [ 296 | "Let's fit a ${\\tt KNN}$ model on the training data using $K = 1$, and evaluate its\n", 297 | "performance on the test data." 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": null, 303 | "metadata": { 304 | "collapsed": false 305 | }, 306 | "outputs": [], 307 | "source": [ 308 | "knn = neighbors.KNeighborsClassifier(n_neighbors=1)\n", 309 | "pred = knn.fit(X_train, y_train).predict(X_test)\n", 310 | "print(classification_report(y_test, pred, digits=3))" 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "The KNN error rate on the 1,000 test observations is just under 12%. At first glance, this may appear to be fairly good. However, since only 6% of customers purchased insurance, we could get the error rate down to 6% by always predicting ${\\tt No}$ regardless of the values of the predictors!\n", 318 | "\n", 319 | "Suppose that there is some non-trivial cost to trying to sell insurance\n", 320 | "to a given individual. For instance, perhaps a salesperson must visit each\n", 321 | "potential customer. If the company tries to sell insurance to a random\n", 322 | "selection of customers, then the success rate will be only 6%, which may\n", 323 | "be far too low given the costs involved. \n", 324 | "\n", 325 | "Instead, the company would like\n", 326 | "to try to sell insurance only to customers who are likely to buy it. So the\n", 327 | "overall error rate is not of interest. Instead, the fraction of individuals that\n", 328 | "are correctly predicted to buy insurance is of interest.\n", 329 | "\n", 330 | "It turns out that ${\\tt KNN}$ with $K = 1$ does far better than random guessing\n", 331 | "among the customers that are predicted to buy insurance:" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": null, 337 | "metadata": { 338 | "collapsed": false 339 | }, 340 | "outputs": [], 341 | "source": [ 342 | "print(confusion_matrix(y_test, pred).T)" 343 | ] 344 | }, 345 | { 346 | "cell_type": "markdown", 347 | "metadata": {}, 348 | "source": [ 349 | "Among 77 such\n", 350 | "customers, 9, or 11.7%, actually do purchase insurance. This is double the\n", 351 | "rate that one would obtain from random guessing. Let's see if increasing $K$ helps! Try out a few different $K$ values below. Feeling adventurous? Write a function that figures out the best value for $K$." 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": null, 357 | "metadata": { 358 | "collapsed": false 359 | }, 360 | "outputs": [], 361 | "source": [ 362 | "# Your code here" 363 | ] 364 | }, 365 | { 366 | "cell_type": "markdown", 367 | "metadata": {}, 368 | "source": [ 369 | "It appears that ${\\tt KNN}$ is finding some real patterns in a difficult data set! To get credit for this lab, post a response to the Piazza prompt available at: https://piazza.com/class/igwiv4w3ctb6rg?cid=10" 370 | ] 371 | } 372 | ], 373 | "metadata": { 374 | "kernelspec": { 375 | "display_name": "Python 2", 376 | "language": "python", 377 | "name": "python2" 378 | }, 379 | "language_info": { 380 | "codemirror_mode": { 381 | "name": "ipython", 382 | "version": 2 383 | }, 384 | "file_extension": ".py", 385 | "mimetype": "text/x-python", 386 | "name": "python", 387 | "nbconvert_exporter": "python", 388 | "pygments_lexer": "ipython2", 389 | "version": "2.7.11" 390 | } 391 | }, 392 | "nbformat": 4, 393 | "nbformat_minor": 0 394 | } 395 | -------------------------------------------------------------------------------- /Lab 4 - Logistic Regression in Python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "This lab on Logistic Regression is a Python adaptation from p. 154-161 of \"Introduction to Statistical Learning with Applications in R\" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Adapted by R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016)." 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "import numpy as np\n", 20 | "import statsmodels.api as sm" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "# 4.6.2 Logistic Regression\n", 28 | "\n", 29 | "Let's return to the ${\\tt Smarket}$ data from ${\\tt ISLR}$. " 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 2, 35 | "metadata": { 36 | "collapsed": false 37 | }, 38 | "outputs": [ 39 | { 40 | "data": { 41 | "text/html": [ 42 | "

\n", 43 | "\n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | "

	Lag1	Lag2	Lag3	Lag4	Lag5	Volume	Today
count	1250.000000	1250.000000	1250.000000	1250.000000	1250.00000	1250.000000	1250.000000
mean	0.003834	0.003919	0.001716	0.001636	0.00561	1.478305	0.003138
std	1.136299	1.136280	1.138703	1.138774	1.14755	0.360357	1.136334
min	-4.922000	-4.922000	-4.922000	-4.922000	-4.92200	0.356070	-4.922000
25%	-0.639500	-0.639500	-0.640000	-0.640000	-0.64000	1.257400	-0.639500
50%	0.039000	0.039000	0.038500	0.038500	0.03850	1.422950	0.038500
75%	0.596750	0.596750	0.596750	0.596750	0.59700	1.641675	0.596750
max	5.733000	5.733000	5.733000	5.733000	5.73300	3.152470	5.733000

\n", 139 | "

" 140 | ], 141 | "text/plain": [ 142 | " Lag1 Lag2 Lag3 Lag4 Lag5 \\\n", 143 | "count 1250.000000 1250.000000 1250.000000 1250.000000 1250.00000 \n", 144 | "mean 0.003834 0.003919 0.001716 0.001636 0.00561 \n", 145 | "std 1.136299 1.136280 1.138703 1.138774 1.14755 \n", 146 | "min -4.922000 -4.922000 -4.922000 -4.922000 -4.92200 \n", 147 | "25% -0.639500 -0.639500 -0.640000 -0.640000 -0.64000 \n", 148 | "50% 0.039000 0.039000 0.038500 0.038500 0.03850 \n", 149 | "75% 0.596750 0.596750 0.596750 0.596750 0.59700 \n", 150 | "max 5.733000 5.733000 5.733000 5.733000 5.73300 \n", 151 | "\n", 152 | " Volume Today \n", 153 | "count 1250.000000 1250.000000 \n", 154 | "mean 1.478305 0.003138 \n", 155 | "std 0.360357 1.136334 \n", 156 | "min 0.356070 -4.922000 \n", 157 | "25% 1.257400 -0.639500 \n", 158 | "50% 1.422950 0.038500 \n", 159 | "75% 1.641675 0.596750 \n", 160 | "max 3.152470 5.733000 " 161 | ] 162 | }, 163 | "execution_count": 2, 164 | "metadata": {}, 165 | "output_type": "execute_result" 166 | } 167 | ], 168 | "source": [ 169 | "df = pd.read_csv('Smarket.csv', usecols=range(1,10), index_col=0, parse_dates=True)\n", 170 | "df.describe()" 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "In this lab, we will fit a logistic regression model in order to predict ${\\tt Direction}$ using ${\\tt Lag1}$ through ${\\tt Lag5}$ and ${\\tt Volume}$. We'll build our model using the ${\\tt glm()}$ function, which is part of the\n", 178 | "${\\tt formula}$ submodule of (${\\tt statsmodels}$)." 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": 3, 184 | "metadata": { 185 | "collapsed": true 186 | }, 187 | "outputs": [], 188 | "source": [ 189 | "import statsmodels.formula.api as smf" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "We can use an ${\\tt R}$-like formula string to separate the predictors from the response." 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 4, 202 | "metadata": { 203 | "collapsed": false 204 | }, 205 | "outputs": [], 206 | "source": [ 207 | "formula = 'Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5+Volume'" 208 | ] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "metadata": {}, 213 | "source": [ 214 | "The ${\\tt glm()}$ function fits **generalized linear models**, a class of models that includes logistic regression. The syntax of the ${\\tt glm()}$ function is similar to that of ${\\tt lm()}$, except that we must pass in the argument ${\\tt family=sm.families.Binomial()}$ in order to tell ${\\tt R}$ to run a logistic regression rather than some other type of generalized linear model." 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": 5, 220 | "metadata": { 221 | "collapsed": false 222 | }, 223 | "outputs": [ 224 | { 225 | "name": "stdout", 226 | "output_type": "stream", 227 | "text": [ 228 | " Generalized Linear Model Regression Results \n", 229 | "================================================================================================\n", 230 | "Dep. Variable: ['Direction[Down]', 'Direction[Up]'] No. Observations: 1250\n", 231 | "Model: GLM Df Residuals: 1243\n", 232 | "Model Family: Binomial Df Model: 6\n", 233 | "Link Function: logit Scale: 1.0\n", 234 | "Method: IRLS Log-Likelihood: -863.79\n", 235 | "Date: Wed, 10 Feb 2016 Deviance: 1727.6\n", 236 | "Time: 14:00:03 Pearson chi2: 1.25e+03\n", 237 | "No. Iterations: 6 \n", 238 | "==============================================================================\n", 239 | " coef std err z P>|z| [95.0% Conf. Int.]\n", 240 | "------------------------------------------------------------------------------\n", 241 | "Intercept 0.1260 0.241 0.523 0.601 -0.346 0.598\n", 242 | "Lag1 0.0731 0.050 1.457 0.145 -0.025 0.171\n", 243 | "Lag2 0.0423 0.050 0.845 0.398 -0.056 0.140\n", 244 | "Lag3 -0.0111 0.050 -0.222 0.824 -0.109 0.087\n", 245 | "Lag4 -0.0094 0.050 -0.187 0.851 -0.107 0.089\n", 246 | "Lag5 -0.0103 0.050 -0.208 0.835 -0.107 0.087\n", 247 | "Volume -0.1354 0.158 -0.855 0.392 -0.446 0.175\n", 248 | "==============================================================================\n" 249 | ] 250 | } 251 | ], 252 | "source": [ 253 | "model = smf.glm(formula=formula, data=df, family=sm.families.Binomial())\n", 254 | "result = model.fit()\n", 255 | "print(result.summary())" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "The smallest p-value here is associated with ${\\tt Lag1}$. The negative coefficient\n", 263 | "for this predictor suggests that if the market had a positive return yesterday,\n", 264 | "then it is less likely to go up today. However, at a value of 0.145, the p-value\n", 265 | "is still relatively large, and so there is no clear evidence of a real association\n", 266 | "between ${\\tt Lag1}$ and ${\\tt Direction}$.\n", 267 | "\n", 268 | "We use the ${\\tt .params}$ attribute in order to access just the coefficients for this\n", 269 | "fitted model. Similarly, we can use ${\\tt .pvalues}$ to get the p-values for the coefficients, and ${\\tt .model.endog_names}$ to get the **endogenous** (or dependent) variables." 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": null, 275 | "metadata": { 276 | "collapsed": false 277 | }, 278 | "outputs": [], 279 | "source": [ 280 | "print(\"Coeffieients\")\n", 281 | "print(result.params)\n", 282 | "print\n", 283 | "print(\"p-Values\")\n", 284 | "print(result.pvalues)\n", 285 | "print\n", 286 | "print(\"Dependent variables\")\n", 287 | "print(result.model.endog_names)" 288 | ] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "metadata": {}, 293 | "source": [ 294 | "Note that the dependent variable has been converted from nominal into two dummy variables: ${\\tt ['Direction[Down]', 'Direction[Up]']}$.\n", 295 | "\n", 296 | "The ${\\tt predict()}$ function can be used to predict the probability that the\n", 297 | "market will go down, given values of the predictors. If no data set is supplied to the\n", 298 | "${\\tt predict()}$ function, then the probabilities are computed for the training\n", 299 | "data that was used to fit the logistic regression model. " 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": null, 305 | "metadata": { 306 | "collapsed": false 307 | }, 308 | "outputs": [], 309 | "source": [ 310 | "predictions = result.predict()\n", 311 | "print(predictions[0:10])" 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "metadata": {}, 317 | "source": [ 318 | "Here we have printe only the first ten probabilities. Note: these values correspond to the probability of the market going down, rather than up. If we print the model's encoding of the response values alongside the original nominal response, we see that Python has created a dummy variable with\n", 319 | "a 1 for ${\\tt Down}$." 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": null, 325 | "metadata": { 326 | "collapsed": false 327 | }, 328 | "outputs": [], 329 | "source": [ 330 | "print np.column_stack((df.as_matrix(columns=[\"Direction\"]).flatten(), result.model.endog))" 331 | ] 332 | }, 333 | { 334 | "cell_type": "markdown", 335 | "metadata": {}, 336 | "source": [ 337 | "In order to make a prediction as to whether the market will go up or\n", 338 | "down on a particular day, we must convert these predicted probabilities\n", 339 | "into class labels, ${\\tt Up}$ or ${\\tt Down}$. The following two commands create a vector\n", 340 | "of class predictions based on whether the predicted probability of a market\n", 341 | "increase is greater than or less than 0.5." 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": null, 347 | "metadata": { 348 | "collapsed": false 349 | }, 350 | "outputs": [], 351 | "source": [ 352 | "predictions_nominal = [ \"Up\" if x < 0.5 else \"Down\" for x in predictions]" 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": {}, 358 | "source": [ 359 | "This transforms to ${\\tt Up}$ all of the elements for which the predicted probability of a\n", 360 | "market increase exceeds 0.5 (i.e. probability of a decrease is below 0.5). Given these predictions, the ${\\tt confusion\\_matrix()}$ function can be used to produce a confusion matrix in order to determine how many\n", 361 | "observations were correctly or incorrectly classified." 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "execution_count": null, 367 | "metadata": { 368 | "collapsed": false 369 | }, 370 | "outputs": [], 371 | "source": [ 372 | "from sklearn.metrics import confusion_matrix, classification_report\n", 373 | "print confusion_matrix(df[\"Direction\"], predictions_nominal)" 374 | ] 375 | }, 376 | { 377 | "cell_type": "markdown", 378 | "metadata": {}, 379 | "source": [ 380 | "The diagonal elements of the confusion matrix indicate correct predictions,\n", 381 | "while the off-diagonals represent incorrect predictions. Hence our model\n", 382 | "correctly predicted that the market would go up on 507 days and that\n", 383 | "it would go down on 145 days, for a total of 507 + 145 = 652 correct\n", 384 | "predictions. The ${\\tt mean()}$ function can be used to compute the fraction of\n", 385 | "days for which the prediction was correct. In this case, logistic regression\n", 386 | "correctly predicted the movement of the market 52.2% of the time. this is confirmed by checking the output of the ${\\tt classification\\_report()}$ function." 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": null, 392 | "metadata": { 393 | "collapsed": false 394 | }, 395 | "outputs": [], 396 | "source": [ 397 | "print classification_report(df[\"Direction\"], predictions_nominal, digits=3)" 398 | ] 399 | }, 400 | { 401 | "cell_type": "markdown", 402 | "metadata": {}, 403 | "source": [ 404 | "At first glance, it appears that the logistic regression model is working\n", 405 | "a little better than random guessing. But remember, this result is misleading\n", 406 | "because we trained and tested the model on the same set of 1,250 observations.\n", 407 | "In other words, 100− 52.2 = 47.8% is the **training error rate**. As we\n", 408 | "have seen previously, the training error rate is often overly optimistic — it\n", 409 | "tends to underestimate the _test_ error rate. \n", 410 | "\n", 411 | "In order to better assess the accuracy\n", 412 | "of the logistic regression model in this setting, we can fit the model\n", 413 | "using part of the data, and then examine how well it predicts the held out\n", 414 | "data. This will yield a more realistic error rate, in the sense that in practice\n", 415 | "we will be interested in our model’s performance not on the data that\n", 416 | "we used to fit the model, but rather on days in the future for which the\n", 417 | "market’s movements are unknown.\n", 418 | "\n", 419 | "Like we did with KNN, we will first create a vector corresponding\n", 420 | "to the observations from 2001 through 2004. We will then use this vector\n", 421 | "to create a held out data set of observations from 2005." 422 | ] 423 | }, 424 | { 425 | "cell_type": "code", 426 | "execution_count": null, 427 | "metadata": { 428 | "collapsed": false 429 | }, 430 | "outputs": [], 431 | "source": [ 432 | "x_train = df[:'2004'][:]\n", 433 | "y_train = df[:'2004']['Direction']\n", 434 | "\n", 435 | "x_test = df['2005':][:]\n", 436 | "y_test = df['2005':]['Direction']" 437 | ] 438 | }, 439 | { 440 | "cell_type": "markdown", 441 | "metadata": {}, 442 | "source": [ 443 | "We now fit a logistic regression model using only the subset of the observations\n", 444 | "that correspond to dates before 2005, using the subset argument.\n", 445 | "We then obtain predicted probabilities of the stock market going up for\n", 446 | "each of the days in our test set—that is, for the days in 2005." 447 | ] 448 | }, 449 | { 450 | "cell_type": "code", 451 | "execution_count": null, 452 | "metadata": { 453 | "collapsed": false 454 | }, 455 | "outputs": [], 456 | "source": [ 457 | "model = smf.glm(formula=formula, data=x_train, family=sm.families.Binomial())\n", 458 | "result = model.fit()" 459 | ] 460 | }, 461 | { 462 | "cell_type": "markdown", 463 | "metadata": {}, 464 | "source": [ 465 | "Notice that we have trained and tested our model on two completely separate\n", 466 | "data sets: training was performed using only the dates before 2005,\n", 467 | "and testing was performed using only the dates in 2005. Finally, we compute\n", 468 | "the predictions for 2005 and compare them to the actual movements\n", 469 | "of the market over that time period." 470 | ] 471 | }, 472 | { 473 | "cell_type": "code", 474 | "execution_count": null, 475 | "metadata": { 476 | "collapsed": false 477 | }, 478 | "outputs": [], 479 | "source": [ 480 | "predictions = result.predict(x_test)\n", 481 | "predictions_nominal = [ \"Up\" if x < 0.5 else \"Down\" for x in predictions]\n", 482 | "print classification_report(y_test, predictions_nominal, digits=3)" 483 | ] 484 | }, 485 | { 486 | "cell_type": "markdown", 487 | "metadata": {}, 488 | "source": [ 489 | "The results are rather disappointing: the test error\n", 490 | "rate (1 - ${\\tt recall}$) is 52%, which is worse than random guessing! Of course this result\n", 491 | "is not all that surprising, given that one would not generally expect to be\n", 492 | "able to use previous days’ returns to predict future market performance.\n", 493 | "(After all, if it were possible to do so, then the authors of this book [along with your professor] would probably\n", 494 | "be out striking it rich rather than teaching statistics.)\n", 495 | "\n", 496 | "We recall that the logistic regression model had very underwhelming pvalues\n", 497 | "associated with all of the predictors, and that the smallest p-value,\n", 498 | "though not very small, corresponded to ${\\tt Lag1}$. Perhaps by removing the\n", 499 | "variables that appear not to be helpful in predicting ${\\tt Direction}$, we can\n", 500 | "obtain a more effective model. After all, using predictors that have no\n", 501 | "relationship with the response tends to cause a deterioration in the test\n", 502 | "error rate (since such predictors cause an increase in variance without a\n", 503 | "corresponding decrease in bias), and so removing such predictors may in\n", 504 | "turn yield an improvement. \n", 505 | "\n", 506 | "In the space below, refit a logistic regression using just ${\\tt Lag1}$ and ${\\tt Lag2}$, which seemed to have the highest predictive power in the original logistic regression model." 507 | ] 508 | }, 509 | { 510 | "cell_type": "code", 511 | "execution_count": null, 512 | "metadata": { 513 | "collapsed": false 514 | }, 515 | "outputs": [], 516 | "source": [ 517 | "model = # Write your code to fit the new model here\n", 518 | "\n", 519 | "# This will test your new model\n", 520 | "result = model.fit()\n", 521 | "predictions = result.predict(x_test)\n", 522 | "predictions_nominal = [ \"Up\" if x < 0.5 else \"Down\" for x in predictions]\n", 523 | "print classification_report(y_test, predictions_nominal, digits=3)" 524 | ] 525 | }, 526 | { 527 | "cell_type": "markdown", 528 | "metadata": {}, 529 | "source": [ 530 | "Now the results appear to be more promising: 56% of the daily movements\n", 531 | "have been correctly predicted. The confusion matrix suggests that on days\n", 532 | "when logistic regression predicts that the market will decline, it is only\n", 533 | "correct 50% of the time. However, on days when it predicts an increase in\n", 534 | "the market, it has a 58% accuracy rate.\n", 535 | "\n", 536 | "Finally, suppose that we want to predict the returns associated with **particular\n", 537 | "values** of ${\\tt Lag1}$ and ${\\tt Lag2}$. In particular, we want to predict Direction on a\n", 538 | "day when ${\\tt Lag1}$ and ${\\tt Lag2}$ equal 1.2 and 1.1, respectively, and on a day when\n", 539 | "they equal 1.5 and −0.8. We can do this by passing a new data frame containing our test values to the ${\\tt predict()}$ function." 540 | ] 541 | }, 542 | { 543 | "cell_type": "code", 544 | "execution_count": null, 545 | "metadata": { 546 | "collapsed": false 547 | }, 548 | "outputs": [], 549 | "source": [ 550 | "print result.predict(pd.DataFrame([[1.2,1.1],[1.5,-0.8]], columns = [\"Lag1\",\"Lag2\"]))" 551 | ] 552 | }, 553 | { 554 | "cell_type": "markdown", 555 | "metadata": {}, 556 | "source": [ 557 | "To get credit for this lab, play around with a few other values for ${\\tt Lag1}$ and ${\\tt Lag2}$, and then post to Piazza about what you found. If you're feeling adventurous, try fitting models with other subsets of variables to see if you can find a letter one!" 558 | ] 559 | } 560 | ], 561 | "metadata": { 562 | "kernelspec": { 563 | "display_name": "Python 2", 564 | "language": "python", 565 | "name": "python2" 566 | }, 567 | "language_info": { 568 | "codemirror_mode": { 569 | "name": "ipython", 570 | "version": 2 571 | }, 572 | "file_extension": ".py", 573 | "mimetype": "text/x-python", 574 | "name": "python", 575 | "nbconvert_exporter": "python", 576 | "pygments_lexer": "ipython2", 577 | "version": "2.7.11" 578 | } 579 | }, 580 | "nbformat": 4, 581 | "nbformat_minor": 0 582 | } 583 | -------------------------------------------------------------------------------- /Lab 5 - LDA and QDA in Python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "This lab on Logistic Regression is a Python adaptation of p. 161-163 of \"Introduction to Statistical Learning with Applications in R\" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Adapted by R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016)." 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "import numpy as np\n", 20 | "\n", 21 | "from sklearn.lda import LDA\n", 22 | "from sklearn.qda import QDA\n", 23 | "from sklearn.metrics import confusion_matrix, classification_report, precision_score\n", 24 | "\n", 25 | "%matplotlib inline" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "# 4.6.3 Linear Discriminant Analysis" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "Let's return to the ${\\tt Smarket}$ data from ${\\tt ISLR}$. " 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": { 46 | "collapsed": false 47 | }, 48 | "outputs": [], 49 | "source": [ 50 | "df = pd.read_csv('Smarket.csv', usecols=range(1,10), index_col=0, parse_dates=True)\n", 51 | "df.head()" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "Now we will perform LDA on the ${\\tt Smarket}$ data from the ${\\tt ISLR}$ package. In ${\\tt Python}$, we can fit a LDA model using the ${\\tt LDA()}$ function, which is part of the ${\\tt lda}$ module of the ${\\tt sklearn}$ library. As we did with logistic regression and KNN, we'll fit the model using only the observations before 2005, and then test the model on the data from 2005." 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "metadata": { 65 | "collapsed": false 66 | }, 67 | "outputs": [], 68 | "source": [ 69 | "X_train = df[:'2004'][['Lag1','Lag2']]\n", 70 | "y_train = df[:'2004']['Direction']\n", 71 | "\n", 72 | "X_test = df['2005':][['Lag1','Lag2']]\n", 73 | "y_test = df['2005':]['Direction']\n", 74 | "\n", 75 | "lda = LDA()\n", 76 | "model = lda.fit(X_train, y_train)\n", 77 | "\n", 78 | "print(model.priors_)" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "The LDA output indicates prior probabilities of ${\\hat{\\pi}}_1 = 0.492$ and ${\\hat{\\pi}}_2 = 0.508$; in other words,\n", 86 | "49.2% of the training observations correspond to days during which the\n", 87 | "market went down." 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": { 94 | "collapsed": false 95 | }, 96 | "outputs": [], 97 | "source": [ 98 | "print(model.means_)" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "The above provides the group means; these are the average\n", 106 | "of each predictor within each class, and are used by LDA as estimates\n", 107 | "of $\\mu_k$. These suggest that there is a tendency for the previous 2 days’\n", 108 | "returns to be negative on days when the market increases, and a tendency\n", 109 | "for the previous days’ returns to be positive on days when the market\n", 110 | "declines. " 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "metadata": { 117 | "collapsed": false 118 | }, 119 | "outputs": [], 120 | "source": [ 121 | "print(model.coef_)" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "The coefficients of linear discriminants output provides the linear\n", 129 | "combination of ${\\tt Lag1}$ and ${\\tt Lag2}$ that are used to form the LDA decision rule.\n", 130 | "\n", 131 | "If $−0.0554\\times{\\tt Lag1}−0.0443\\times{\\tt Lag2}$ is large, then the LDA classifier will\n", 132 | "predict a market increase, and if it is small, then the LDA classifier will\n", 133 | "predict a market decline. **Note**: these coefficients differ from those produced by ${\\tt R}$." 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "The ${\\tt predict()}$ function returns a list of LDA’s predictions about the movement of the market on the test data:" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "metadata": { 147 | "collapsed": false 148 | }, 149 | "outputs": [], 150 | "source": [ 151 | "pred=model.predict(X_test)\n", 152 | "print(np.unique(pred, return_counts=True))" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "The model assigned 70 observations to the \"Down\" class, and 182 observations to the \"Up\" class. Let's check out the confusion matrix to see how this model is doing. We'll want to compare the **predicted class** (which we can find in ${\\tt pred}$) to the **true class** (found in ${\\tt y\\_test})$." 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "metadata": { 166 | "collapsed": false 167 | }, 168 | "outputs": [], 169 | "source": [ 170 | "print(confusion_matrix(pred, y_test))\n", 171 | "print(classification_report(y_test, pred, digits=3))" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "We can also get the predicted _probabilities_ using the ${\\tt predict\\_proba()}$ function:" 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": null, 184 | "metadata": { 185 | "collapsed": false 186 | }, 187 | "outputs": [], 188 | "source": [ 189 | "pred_p = model.predict_proba(X_test)" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "Applying a 50% threshold to the posterior probabilities allows us to recreate\n", 197 | "the predictions:" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": null, 203 | "metadata": { 204 | "collapsed": false 205 | }, 206 | "outputs": [], 207 | "source": [ 208 | "print(np.unique(pred_p[:,1]>0.5, return_counts=True))" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "Notice that the posterior probability output by the model corresponds to\n", 216 | "the probability that the market will **increase**:" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": { 223 | "collapsed": false 224 | }, 225 | "outputs": [], 226 | "source": [ 227 | "print np.stack((pred_p[10:20,1], pred[10:20])).T" 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": {}, 233 | "source": [ 234 | "If we wanted to use a posterior probability threshold other than 50% in\n", 235 | "order to make predictions, then we could easily do so. For instance, suppose\n", 236 | "that we wish to predict a market decrease only if we are very certain that the\n", 237 | "market will indeed decrease on that day—say, if the posterior probability\n", 238 | "is at least 90%:" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": null, 244 | "metadata": { 245 | "collapsed": false 246 | }, 247 | "outputs": [], 248 | "source": [ 249 | "print(np.unique(pred_p[:,1]>0.9, return_counts=True))" 250 | ] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "metadata": {}, 255 | "source": [ 256 | "No days in 2005 meet that threshold! In fact, the greatest posterior probability\n", 257 | "of decrease in all of 2005 was 54.2%:" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "metadata": { 264 | "collapsed": false 265 | }, 266 | "outputs": [], 267 | "source": [ 268 | "max(pred_p[:,1])" 269 | ] 270 | }, 271 | { 272 | "cell_type": "markdown", 273 | "metadata": {}, 274 | "source": [ 275 | "# 4.6.4 Quadratic Discriminant Analysis\n", 276 | "We will now fit a QDA model to the ${\\tt Smarket}$ data. QDA is implemented\n", 277 | "in ${\\tt sklearn}$ using the ${\\tt QDA()}$ function, which is part of the ${\\tt qda}$ module. The\n", 278 | "syntax is identical to that of ${\\tt LDA()}$." 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": null, 284 | "metadata": { 285 | "collapsed": false 286 | }, 287 | "outputs": [], 288 | "source": [ 289 | "qda = QDA()\n", 290 | "model2 = qda.fit(X_train, y_train)\n", 291 | "print model2.priors_\n", 292 | "print model2.means_" 293 | ] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "metadata": {}, 298 | "source": [ 299 | "The output contains the group means. But it does not contain the coefficients\n", 300 | "of the linear discriminants, because the QDA classifier involves a\n", 301 | "_quadratic_, rather than a linear, function of the predictors. The ${\\tt predict()}$\n", 302 | "function works in exactly the same fashion as for LDA." 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": null, 308 | "metadata": { 309 | "collapsed": false 310 | }, 311 | "outputs": [], 312 | "source": [ 313 | "pred2=model2.predict(X_test)\n", 314 | "print(np.unique(pred2, return_counts=True))\n", 315 | "print(confusion_matrix(pred2, y_test))\n", 316 | "print(classification_report(y_test, pred2, digits=3))" 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "metadata": {}, 322 | "source": [ 323 | "Interestingly, the QDA predictions are accurate almost 60% of the time,\n", 324 | "even though the 2005 data was not used to fit the model. This level of accuracy\n", 325 | "is quite impressive for stock market data, which is known to be quite\n", 326 | "hard to model accurately. \n", 327 | "\n", 328 | "This suggests that the quadratic form assumed\n", 329 | "by QDA may capture the true relationship more accurately than the linear\n", 330 | "forms assumed by LDA and logistic regression. However, we recommend\n", 331 | "evaluating this method’s performance on a larger test set before betting\n", 332 | "that this approach will consistently beat the market!" 333 | ] 334 | }, 335 | { 336 | "cell_type": "markdown", 337 | "metadata": {}, 338 | "source": [ 339 | "# An Application to Carseats Data\n", 340 | "Let's see how the ${\\tt LDA/QDA}$ approach performs on the ${\\tt Carseats}$ data set, which is\n", 341 | "included with ${\\tt ISLR}$. \n", 342 | "\n", 343 | "Recall: this is a simulated data set containing sales of child car seats at 400 different stores." 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": null, 349 | "metadata": { 350 | "collapsed": false 351 | }, 352 | "outputs": [], 353 | "source": [ 354 | "df2 = pd.read_csv('Carseats.csv')\n", 355 | "df2.head()" 356 | ] 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "metadata": {}, 361 | "source": [ 362 | "See if you can build a model that predicts ${\\tt ShelveLoc}$, the shelf location (Bad, Good, or Medium) of the product at each store. Don't forget to hold out some of the data for testing!" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": null, 368 | "metadata": { 369 | "collapsed": false 370 | }, 371 | "outputs": [], 372 | "source": [ 373 | "# Your code here" 374 | ] 375 | }, 376 | { 377 | "cell_type": "markdown", 378 | "metadata": {}, 379 | "source": [ 380 | "To get credit for this lab, please post your answers to the following questions:\n", 381 | "\n", 382 | "- What was your approach to building the model?\n", 383 | "- How did your model perform?\n", 384 | "- Was anything easier or more challenging than you anticipated?\n", 385 | "\n", 386 | "to Piazza: https://piazza.com/class/igwiv4w3ctb6rg?cid=23" 387 | ] 388 | } 389 | ], 390 | "metadata": { 391 | "kernelspec": { 392 | "display_name": "Python 2", 393 | "language": "python", 394 | "name": "python2" 395 | }, 396 | "language_info": { 397 | "codemirror_mode": { 398 | "name": "ipython", 399 | "version": 2 400 | }, 401 | "file_extension": ".py", 402 | "mimetype": "text/x-python", 403 | "name": "python", 404 | "nbconvert_exporter": "python", 405 | "pygments_lexer": "ipython2", 406 | "version": "2.7.11" 407 | } 408 | }, 409 | "nbformat": 4, 410 | "nbformat_minor": 0 411 | } 412 | -------------------------------------------------------------------------------- /Lab 7 - Cross-Validation in Python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "This lab on Cross-Validation is a python adaptation of p. 190-194 of \"Introduction to Statistical Learning\n", 8 | "with Applications in R\" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Written\n", 9 | "by R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016).\n", 10 | "\n", 11 | "# 5.3.1 The Validation Set Approach" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 80, 17 | "metadata": { 18 | "collapsed": true 19 | }, 20 | "outputs": [], 21 | "source": [ 22 | "import pandas as pd\n", 23 | "import numpy as np\n", 24 | "import statsmodels.api as sm" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "In this section, we'll explore the use of the validation set approach in order to estimate the\n", 32 | "test error rates that result from fitting various linear models on the ${\\tt Auto}$ data set." 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 81, 38 | "metadata": { 39 | "collapsed": false 40 | }, 41 | "outputs": [ 42 | { 43 | "name": "stdout", 44 | "output_type": "stream", 45 | "text": [ 46 | "\n", 47 | "Int64Index: 392 entries, 0 to 396\n", 48 | "Data columns (total 9 columns):\n", 49 | "mpg 392 non-null float64\n", 50 | "cylinders 392 non-null int64\n", 51 | "displacement 392 non-null float64\n", 52 | "horsepower 392 non-null float64\n", 53 | "weight 392 non-null int64\n", 54 | "acceleration 392 non-null float64\n", 55 | "year 392 non-null int64\n", 56 | "origin 392 non-null int64\n", 57 | "name 392 non-null object\n", 58 | "dtypes: float64(4), int64(4), object(1)\n", 59 | "memory usage: 30.6+ KB\n" 60 | ] 61 | } 62 | ], 63 | "source": [ 64 | "df1 = pd.read_csv('Auto.csv', na_values='?').dropna()\n", 65 | "df1.info()" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "We begin by using the ${\\tt sample()}$ function to split the set of observations\n", 73 | "into two halves, by selecting a random subset of 196 observations out of\n", 74 | "the original 392 observations. We refer to these observations as the training\n", 75 | "set.\n", 76 | "\n", 77 | "We'll use the ${\\tt random\\_state}$ parameter in order to set a seed for\n", 78 | "${\\tt python}$’s random number generator, so that you'll obtain precisely the same results as those shown below. It is generally a good idea to set a random seed when performing an analysis such as cross-validation\n", 79 | "that contains an element of randomness, so that the results obtained can be reproduced precisely at a later time." 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 84, 85 | "metadata": { 86 | "collapsed": true 87 | }, 88 | "outputs": [], 89 | "source": [ 90 | "train = df1.sample(196, random_state = 1)\n", 91 | "test = df1[~df1.isin(train)].dropna(how = 'all')" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "We then use the ${\\tt sm.OLS.from\\_formula()}$ to fit a linear regression to predict ${\\tt mpg}$ from ${\\tt horsepower}$ using only\n", 99 | "the observations corresponding to the training set." 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 85, 105 | "metadata": { 106 | "collapsed": true 107 | }, 108 | "outputs": [], 109 | "source": [ 110 | "lm = sm.OLS.from_formula('mpg~horsepower', train)\n", 111 | "result = lm.fit()" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "We now use the ${\\tt predict()}$ function to estimate the response for the test\n", 119 | "observations, and we use some ${\\tt numpy}$ functions to caclulate the MSE." 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 86, 125 | "metadata": { 126 | "collapsed": false 127 | }, 128 | "outputs": [ 129 | { 130 | "name": "stdout", 131 | "output_type": "stream", 132 | "text": [ 133 | "23.361902892587235\n" 134 | ] 135 | } 136 | ], 137 | "source": [ 138 | "pred = result.predict(test)\n", 139 | "\n", 140 | "MSE = np.mean(np.square(np.subtract(test[\"mpg\"], pred)))\n", 141 | " \n", 142 | "print(MSE)" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "Therefore, the estimated test MSE for the linear regression fit is 23.36. We\n", 150 | "can use the ${\\tt np.power()}$ function to estimate the test error for the polynomial\n", 151 | "and cubic regressions." 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": 88, 157 | "metadata": { 158 | "collapsed": false 159 | }, 160 | "outputs": [ 161 | { 162 | "name": "stdout", 163 | "output_type": "stream", 164 | "text": [ 165 | "20.252690858350192\n", 166 | "20.325609366115582\n" 167 | ] 168 | } 169 | ], 170 | "source": [ 171 | "lm2 = sm.OLS.from_formula('mpg~' + '+'.join(['np.power(horsepower,' + str(i) + ')' for i in [1,2]]), train)\n", 172 | "print(np.mean(np.square(np.subtract(test[\"mpg\"], lm2.fit().predict(test)))))\n", 173 | "\n", 174 | "lm3 = sm.OLS.from_formula('mpg~' + '+'.join(['np.power(horsepower,' + str(i) + ')' for i in [1,2,3]]), train)\n", 175 | "print(np.mean(np.square(np.subtract(test[\"mpg\"], lm3.fit().predict(test)))))" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "These error rates are 20.25 and 20.33, respectively. If we choose a different\n", 183 | "training set instead, then we will obtain somewhat different errors on the\n", 184 | "validation set. We can test this out by setting a different random seed:" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": 89, 190 | "metadata": { 191 | "collapsed": false 192 | }, 193 | "outputs": [ 194 | { 195 | "name": "stdout", 196 | "output_type": "stream", 197 | "text": [ 198 | "23.214272449679587\n", 199 | "19.525710963117103\n", 200 | "19.667628097077426\n" 201 | ] 202 | } 203 | ], 204 | "source": [ 205 | "train = df1.sample(196, random_state = 2)\n", 206 | "\n", 207 | "lm = sm.OLS.from_formula('mpg~horsepower', train)\n", 208 | "print(np.mean(np.square(np.subtract(test[\"mpg\"], lm.fit().predict(test)))))\n", 209 | "\n", 210 | "lm2 = sm.OLS.from_formula('mpg~' + '+'.join(['np.power(horsepower,' + str(i) + ')' for i in [1,2]]), train)\n", 211 | "print(np.mean(np.square(np.subtract(test[\"mpg\"], lm2.fit().predict(test)))))\n", 212 | "\n", 213 | "lm3 = sm.OLS.from_formula('mpg~' + '+'.join(['np.power(horsepower,' + str(i) + ')' for i in [1,2,3]]), train)\n", 214 | "print(np.mean(np.square(np.subtract(test[\"mpg\"], lm3.fit().predict(test)))))" 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "Using this split of the observations into a training set and a validation\n", 222 | "set, we find that the validation set error rates for the models with linear,\n", 223 | "quadratic, and cubic terms are 23.21, 19.53, and 19.67, respectively.\n", 224 | "\n", 225 | "These results are consistent with our previous findings: a model that\n", 226 | "predicts ${\\tt mpg}$ using a quadratic function of ${\\tt horsepower}$ performs better than\n", 227 | "a model that involves only a linear function of ${\\tt horsepower}$, and there is\n", 228 | "little evidence in favor of a model that uses a cubic function of ${\\tt horsepower}$." 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": 108, 234 | "metadata": { 235 | "collapsed": false 236 | }, 237 | "outputs": [ 238 | { 239 | "ename": "IndexError", 240 | "evalue": "indices are out-of-bounds", 241 | "output_type": "error", 242 | "traceback": [ 243 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 244 | "\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)", 245 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mloo\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mLeaveOneOut\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m10\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mtrain_index\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtest_index\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mloo\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0mdf1\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mtrain_index\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 246 | "\u001b[0;32m/Users/jcrouser/anaconda/envs/py3k/lib/python3.5/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 1961\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mSeries\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mndarray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mIndex\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1962\u001b[0m \u001b[0;31m# either boolean or fancy integer index\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1963\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_array\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1964\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mDataFrame\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1965\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_frame\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 247 | "\u001b[0;32m/Users/jcrouser/anaconda/envs/py3k/lib/python3.5/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m_getitem_array\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 2006\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2007\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mix\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_convert_to_indexer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2008\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtake\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconvert\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2009\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2010\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_getitem_multilevel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 248 | "\u001b[0;32m/Users/jcrouser/anaconda/envs/py3k/lib/python3.5/site-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36mtake\u001b[0;34m(self, indices, axis, convert, is_copy)\u001b[0m\n\u001b[1;32m 1369\u001b[0m new_data = self._data.take(indices,\n\u001b[1;32m 1370\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_get_block_manager_axis\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1371\u001b[0;31m convert=True, verify=True)\n\u001b[0m\u001b[1;32m 1372\u001b[0m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_constructor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnew_data\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__finalize__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1373\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", 249 | "\u001b[0;32m/Users/jcrouser/anaconda/envs/py3k/lib/python3.5/site-packages/pandas/core/internals.py\u001b[0m in \u001b[0;36mtake\u001b[0;34m(self, indexer, axis, verify, convert)\u001b[0m\n\u001b[1;32m 3617\u001b[0m \u001b[0mn\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3618\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mconvert\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3619\u001b[0;31m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmaybe_convert_indices\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mn\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3620\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3621\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mverify\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 250 | "\u001b[0;32m/Users/jcrouser/anaconda/envs/py3k/lib/python3.5/site-packages/pandas/core/indexing.py\u001b[0m in \u001b[0;36mmaybe_convert_indices\u001b[0;34m(indices, n)\u001b[0m\n\u001b[1;32m 1748\u001b[0m \u001b[0mmask\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mindices\u001b[0m \u001b[0;34m>=\u001b[0m \u001b[0mn\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m|\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mindices\u001b[0m \u001b[0;34m<\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1749\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mmask\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0many\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1750\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mIndexError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"indices are out-of-bounds\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1751\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mindices\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1752\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", 251 | "\u001b[0;31mIndexError\u001b[0m: indices are out-of-bounds" 252 | ] 253 | } 254 | ], 255 | "source": [ 256 | "from sklearn.cross_validation import LeaveOneOut\n", 257 | "\n", 258 | "loo = LeaveOneOut(10)\n", 259 | "for train_index, test_index in loo:\n", 260 | " df1[train_index]" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": null, 266 | "metadata": { 267 | "collapsed": true 268 | }, 269 | "outputs": [], 270 | "source": [] 271 | } 272 | ], 273 | "metadata": { 274 | "kernelspec": { 275 | "display_name": "Python 3", 276 | "language": "python", 277 | "name": "python3" 278 | }, 279 | "language_info": { 280 | "codemirror_mode": { 281 | "name": "ipython", 282 | "version": 3 283 | }, 284 | "file_extension": ".py", 285 | "mimetype": "text/x-python", 286 | "name": "python", 287 | "nbconvert_exporter": "python", 288 | "pygments_lexer": "ipython3", 289 | "version": "3.5.1" 290 | } 291 | }, 292 | "nbformat": 4, 293 | "nbformat_minor": 0 294 | } 295 | -------------------------------------------------------------------------------- /Lab 8 - Subset Selection in Python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "This lab on Subset Selection is a Python adaptation of p. 244-247 of \"Introduction to Statistical Learning with Applications in R\" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Adapted by R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016)." 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "%matplotlib inline\n", 19 | "import pandas as pd\n", 20 | "import numpy as np\n", 21 | "import itertools\n", 22 | "import time\n", 23 | "import statsmodels.api as sm\n", 24 | "import matplotlib.pyplot as plt" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "# 6.5.1 Best Subset Selection\n", 32 | "\n", 33 | "Here we apply the best subset selection approach to the Hitters data. We\n", 34 | "wish to predict a baseball player’s Salary on the basis of various statistics\n", 35 | "associated with performance in the previous year. Let's take a quick look:" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "metadata": { 42 | "collapsed": false 43 | }, 44 | "outputs": [], 45 | "source": [ 46 | "df = pd.read_csv('Hitters.csv')\n", 47 | "df.head()" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "First of all, we note that the ${\\tt Salary}$ variable is missing for some of the\n", 55 | "players. The ${\\tt isnull()}$ function can be used to identify the missing observations. It returns a vector of the same length as the input vector, with a ${\\tt TRUE}$ value\n", 56 | "for any elements that are missing, and a ${\\tt FALSE}$ value for non-missing elements.\n", 57 | "The ${\\tt sum()}$ function can then be used to count all of the missing elements:" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "metadata": { 64 | "collapsed": false 65 | }, 66 | "outputs": [], 67 | "source": [ 68 | "print(df[\"Salary\"].isnull().sum())" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "We see that ${\\tt Salary}$ is missing for 59 players. The ${\\tt dropna()}$ function\n", 76 | "removes all of the rows that have missing values in any variable:" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": { 83 | "collapsed": false 84 | }, 85 | "outputs": [], 86 | "source": [ 87 | "# Print the dimensions of the original Hitters data (322 rows x 20 columns)\n", 88 | "print(df.shape)\n", 89 | "\n", 90 | "# Drop any rows the contain missing values, along with the player names\n", 91 | "df = df.dropna().drop('Player', axis=1)\n", 92 | "\n", 93 | "# Print the dimensions of the modified Hitters data (263 rows x 20 columns)\n", 94 | "print(df.shape)\n", 95 | "\n", 96 | "# One last check: should return 0\n", 97 | "print(df[\"Salary\"].isnull().sum())" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": { 104 | "collapsed": false 105 | }, 106 | "outputs": [], 107 | "source": [ 108 | "dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']])\n", 109 | "\n", 110 | "y = df.Salary\n", 111 | "\n", 112 | "# Drop the column with the independent variable (Salary), and columns for which we created dummy variables\n", 113 | "X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')\n", 114 | "\n", 115 | "# Define the feature set X.\n", 116 | "X = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "We can perform best subset selection by identifying the best model that contains a given number of predictors, where **best** is quantified using RSS. We'll define a helper function to outputs the best set of variables for\n", 124 | "each model size:" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": { 131 | "collapsed": true 132 | }, 133 | "outputs": [], 134 | "source": [ 135 | "def processSubset(feature_set):\n", 136 | " # Fit model on feature_set and calculate RSS\n", 137 | " model = sm.OLS(y,X[list(feature_set)])\n", 138 | " regr = model.fit()\n", 139 | " RSS = ((regr.predict(X[list(feature_set)]) - y) ** 2).sum()\n", 140 | " return {\"model\":regr, \"RSS\":RSS}" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "metadata": { 147 | "collapsed": false 148 | }, 149 | "outputs": [], 150 | "source": [ 151 | "def getBest(k):\n", 152 | " \n", 153 | " tic = time.time()\n", 154 | " \n", 155 | " results = []\n", 156 | " \n", 157 | " for combo in itertools.combinations(X.columns, k):\n", 158 | " results.append(processSubset(combo))\n", 159 | " \n", 160 | " # Wrap everything up in a nice dataframe\n", 161 | " models = pd.DataFrame(results)\n", 162 | " \n", 163 | " # Choose the model with the highest RSS\n", 164 | " best_model = models.loc[models['RSS'].argmin()]\n", 165 | " \n", 166 | " toc = time.time()\n", 167 | " print(\"Processed \", models.shape[0], \"models on\", k, \"predictors in\", (toc-tic), \"seconds.\")\n", 168 | " \n", 169 | " # Return the best model, along with some other useful information about the model\n", 170 | " return best_model" 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "This returns a ${\\tt DataFrame}$ containing the best model that we generated, along with some extra information about the model. Now we want to call that function for each number of predictors $k$:" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "metadata": { 184 | "collapsed": false 185 | }, 186 | "outputs": [], 187 | "source": [ 188 | "# Could take quite awhile to complete...\n", 189 | "\n", 190 | "models = pd.DataFrame(columns=[\"RSS\", \"model\"])\n", 191 | "\n", 192 | "tic = time.time()\n", 193 | "for i in range(1,8):\n", 194 | " models.loc[i] = getBest(i)\n", 195 | "\n", 196 | "toc = time.time()\n", 197 | "print(\"Total elapsed time:\", (toc-tic), \"seconds.\")" 198 | ] 199 | }, 200 | { 201 | "cell_type": "markdown", 202 | "metadata": {}, 203 | "source": [ 204 | "Now we have one big $\\tt{DataFrame}$ that contains the best models we've generated. Let's take a look at the first few:" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "metadata": { 211 | "collapsed": false 212 | }, 213 | "outputs": [], 214 | "source": [ 215 | "models" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "If we want to access the details of each model, no problem! We can get a full rundown of a single model using the ${\\tt summary()}$ function:" 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": null, 228 | "metadata": { 229 | "collapsed": false 230 | }, 231 | "outputs": [], 232 | "source": [ 233 | "print(models.loc[2, \"model\"].summary())" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "This output indicates that the best two-variable model\n", 241 | "contains only ${\\tt Hits}$ and ${\\tt CRBI}$. To save time, we only generated results\n", 242 | "up to the best 11-variable model. You can use the functions we defined above to explore as many variables as are desired." 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": null, 248 | "metadata": { 249 | "collapsed": false 250 | }, 251 | "outputs": [], 252 | "source": [ 253 | "print(getBest(19)[\"model\"].summary())" 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": {}, 259 | "source": [ 260 | "Rather than letting the results of our call to the ${\\tt summary()}$ function print to the screen, we can access just the parts we need using the model's attributes. For example, if we want the $R^2$ value:" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": null, 266 | "metadata": { 267 | "collapsed": false 268 | }, 269 | "outputs": [], 270 | "source": [ 271 | "models.loc[2, \"model\"].rsquared" 272 | ] 273 | }, 274 | { 275 | "cell_type": "markdown", 276 | "metadata": {}, 277 | "source": [ 278 | "Excellent! In addition to the verbose output we get when we print the summary to the screen, fitting the ${\\tt OLM}$ also produced many other useful statistics such as adjusted $R^2$, AIC, and BIC. We can examine these to try to select the best overall model. Let's start by looking at $R^2$ across all our models:" 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": null, 284 | "metadata": { 285 | "collapsed": false 286 | }, 287 | "outputs": [], 288 | "source": [ 289 | "# Gets the second element from each row ('model') and pulls out its rsquared attribute\n", 290 | "models.apply(lambda row: row[1].rsquared, axis=1)" 291 | ] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "As expected, the $R^2$ statistic increases monotonically as more\n", 298 | "variables are included.\n", 299 | "\n", 300 | "Plotting RSS, adjusted $R^2$, AIC, and BIC for all of the models at once will\n", 301 | "help us decide which model to select. Note the ${\\tt type=\"l\"}$ option tells ${\\tt R}$ to\n", 302 | "connect the plotted points with lines:" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": null, 308 | "metadata": { 309 | "collapsed": false 310 | }, 311 | "outputs": [], 312 | "source": [ 313 | "plt.figure(figsize=(20,10))\n", 314 | "plt.rcParams.update({'font.size': 18, 'lines.markersize': 10})\n", 315 | "\n", 316 | "# Set up a 2x2 grid so we can look at 4 plots at once\n", 317 | "plt.subplot(2, 2, 1)\n", 318 | "\n", 319 | "# We will now plot a red dot to indicate the model with the largest adjusted R^2 statistic.\n", 320 | "# The argmax() function can be used to identify the location of the maximum point of a vector\n", 321 | "plt.plot(models[\"RSS\"])\n", 322 | "plt.xlabel('# Predictors')\n", 323 | "plt.ylabel('RSS')\n", 324 | "\n", 325 | "# We will now plot a red dot to indicate the model with the largest adjusted R^2 statistic.\n", 326 | "# The argmax() function can be used to identify the location of the maximum point of a vector\n", 327 | "\n", 328 | "rsquared_adj = models.apply(lambda row: row[1].rsquared_adj, axis=1)\n", 329 | "\n", 330 | "plt.subplot(2, 2, 2)\n", 331 | "plt.plot(rsquared_adj)\n", 332 | "plt.plot(rsquared_adj.argmax(), rsquared_adj.max(), \"or\")\n", 333 | "plt.xlabel('# Predictors')\n", 334 | "plt.ylabel('adjusted rsquared')\n", 335 | "\n", 336 | "# We'll do the same for AIC and BIC, this time looking for the models with the SMALLEST statistic\n", 337 | "aic = models.apply(lambda row: row[1].aic, axis=1)\n", 338 | "\n", 339 | "plt.subplot(2, 2, 3)\n", 340 | "plt.plot(aic)\n", 341 | "plt.plot(aic.argmin(), aic.min(), \"or\")\n", 342 | "plt.xlabel('# Predictors')\n", 343 | "plt.ylabel('AIC')\n", 344 | "\n", 345 | "bic = models.apply(lambda row: row[1].bic, axis=1)\n", 346 | "\n", 347 | "plt.subplot(2, 2, 4)\n", 348 | "plt.plot(bic)\n", 349 | "plt.plot(bic.argmin(), bic.min(), \"or\")\n", 350 | "plt.xlabel('# Predictors')\n", 351 | "plt.ylabel('BIC')" 352 | ] 353 | }, 354 | { 355 | "cell_type": "markdown", 356 | "metadata": {}, 357 | "source": [ 358 | "Recall that in the second step of our selection process, we narrowed the field down to just one model on any $k<=p$ predictors. We see that according to BIC, the best performer is the model with 6 variables. According to AIC and adjusted $R^2$ something a bit more complex might be better. Again, no one measure is going to give us an entirely accurate picture... but they all agree that a model with 5 or fewer predictors is insufficient." 359 | ] 360 | }, 361 | { 362 | "cell_type": "markdown", 363 | "metadata": {}, 364 | "source": [ 365 | "# 6.5.2 Forward and Backward Stepwise Selection\n", 366 | "We can also use a similar approach to perform forward stepwise\n", 367 | "or backward stepwise selection, using a slight modification of the functions we defined above:" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": null, 373 | "metadata": { 374 | "collapsed": true 375 | }, 376 | "outputs": [], 377 | "source": [ 378 | "def forward(predictors):\n", 379 | "\n", 380 | " # Pull out predictors we still need to process\n", 381 | " remaining_predictors = [p for p in X.columns if p not in predictors]\n", 382 | " \n", 383 | " tic = time.time()\n", 384 | " \n", 385 | " results = []\n", 386 | " \n", 387 | " for p in remaining_predictors:\n", 388 | " results.append(processSubset(predictors+[p]))\n", 389 | " \n", 390 | " # Wrap everything up in a nice dataframe\n", 391 | " models = pd.DataFrame(results)\n", 392 | " \n", 393 | " # Choose the model with the highest RSS\n", 394 | " best_model = models.loc[models['RSS'].argmin()]\n", 395 | " \n", 396 | " toc = time.time()\n", 397 | " print(\"Processed \", models.shape[0], \"models on\", len(predictors)+1, \"predictors in\", (toc-tic), \"seconds.\")\n", 398 | " \n", 399 | " # Return the best model, along with some other useful information about the model\n", 400 | " return best_model" 401 | ] 402 | }, 403 | { 404 | "cell_type": "markdown", 405 | "metadata": {}, 406 | "source": [ 407 | "Now let's see how much faster it runs!" 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": null, 413 | "metadata": { 414 | "collapsed": false 415 | }, 416 | "outputs": [], 417 | "source": [ 418 | "models2 = pd.DataFrame(columns=[\"RSS\", \"model\"])\n", 419 | "\n", 420 | "tic = time.time()\n", 421 | "predictors = []\n", 422 | "\n", 423 | "for i in range(1,len(X.columns)+1): \n", 424 | " models2.loc[i] = forward(predictors)\n", 425 | " predictors = models2.loc[i][\"model\"].model.exog_names\n", 426 | "\n", 427 | "toc = time.time()\n", 428 | "print(\"Total elapsed time:\", (toc-tic), \"seconds.\")" 429 | ] 430 | }, 431 | { 432 | "cell_type": "markdown", 433 | "metadata": {}, 434 | "source": [ 435 | "Phew! That's a lot better. Let's take a look:" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": null, 441 | "metadata": { 442 | "collapsed": false 443 | }, 444 | "outputs": [], 445 | "source": [ 446 | "print(models.loc[1, \"model\"].summary())\n", 447 | "print(models.loc[2, \"model\"].summary())" 448 | ] 449 | }, 450 | { 451 | "cell_type": "markdown", 452 | "metadata": {}, 453 | "source": [ 454 | "We see that using forward stepwise selection, the best one-variable\n", 455 | "model contains only ${\\tt Hits}$, and the best two-variable model additionally\n", 456 | "includes ${\\tt CRBI}$. Let's see how the models stack up against best subset selection:" 457 | ] 458 | }, 459 | { 460 | "cell_type": "code", 461 | "execution_count": null, 462 | "metadata": { 463 | "collapsed": false 464 | }, 465 | "outputs": [], 466 | "source": [ 467 | "print(models.loc[6, \"model\"].summary())\n", 468 | "print(models2.loc[6, \"model\"].summary())" 469 | ] 470 | }, 471 | { 472 | "cell_type": "markdown", 473 | "metadata": { 474 | "collapsed": true 475 | }, 476 | "source": [ 477 | "For this data, the best one-variable through six-variable\n", 478 | "models are each identical for best subset and forward selection.\n", 479 | "\n", 480 | "# Backward Selection\n", 481 | "Not much has to change to implement backward selection... just looping through the predictors in reverse!" 482 | ] 483 | }, 484 | { 485 | "cell_type": "code", 486 | "execution_count": null, 487 | "metadata": { 488 | "collapsed": true 489 | }, 490 | "outputs": [], 491 | "source": [ 492 | "def backward(predictors):\n", 493 | " \n", 494 | " tic = time.time()\n", 495 | " \n", 496 | " results = []\n", 497 | " \n", 498 | " for combo in itertools.combinations(predictors, len(predictors)-1):\n", 499 | " results.append(processSubset(combo))\n", 500 | " \n", 501 | " # Wrap everything up in a nice dataframe\n", 502 | " models = pd.DataFrame(results)\n", 503 | " \n", 504 | " # Choose the model with the highest RSS\n", 505 | " best_model = models.loc[models['RSS'].argmin()]\n", 506 | " \n", 507 | " toc = time.time()\n", 508 | " print(\"Processed \", models.shape[0], \"models on\", len(predictors)-1, \"predictors in\", (toc-tic), \"seconds.\")\n", 509 | " \n", 510 | " # Return the best model, along with some other useful information about the model\n", 511 | " return best_model" 512 | ] 513 | }, 514 | { 515 | "cell_type": "code", 516 | "execution_count": null, 517 | "metadata": { 518 | "collapsed": false 519 | }, 520 | "outputs": [], 521 | "source": [ 522 | "models3 = pd.DataFrame(columns=[\"RSS\", \"model\"], index = range(1,len(X.columns)))\n", 523 | "\n", 524 | "tic = time.time()\n", 525 | "predictors = X.columns\n", 526 | "\n", 527 | "while(len(predictors) > 1): \n", 528 | " models3.loc[len(predictors)-1] = backward(predictors)\n", 529 | " predictors = models3.loc[len(predictors)-1][\"model\"].model.exog_names\n", 530 | "\n", 531 | "toc = time.time()\n", 532 | "print(\"Total elapsed time:\", (toc-tic), \"seconds.\")" 533 | ] 534 | }, 535 | { 536 | "cell_type": "markdown", 537 | "metadata": {}, 538 | "source": [ 539 | "For this data, the best one-variable through six-variable\n", 540 | "models are each identical for best subset and forward selection.\n", 541 | "However, the best seven-variable models identified by forward stepwise selection,\n", 542 | "backward stepwise selection, and best subset selection are different:" 543 | ] 544 | }, 545 | { 546 | "cell_type": "code", 547 | "execution_count": null, 548 | "metadata": { 549 | "collapsed": false 550 | }, 551 | "outputs": [], 552 | "source": [ 553 | "print(models.loc[7, \"model\"].params)" 554 | ] 555 | }, 556 | { 557 | "cell_type": "code", 558 | "execution_count": null, 559 | "metadata": { 560 | "collapsed": false 561 | }, 562 | "outputs": [], 563 | "source": [ 564 | "print(models2.loc[7, \"model\"].params)" 565 | ] 566 | }, 567 | { 568 | "cell_type": "code", 569 | "execution_count": null, 570 | "metadata": { 571 | "collapsed": false 572 | }, 573 | "outputs": [], 574 | "source": [ 575 | "print(models3.loc[7, \"model\"].params)" 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": null, 581 | "metadata": { 582 | "collapsed": true 583 | }, 584 | "outputs": [], 585 | "source": [] 586 | } 587 | ], 588 | "metadata": { 589 | "kernelspec": { 590 | "display_name": "Python 3", 591 | "language": "python", 592 | "name": "python3" 593 | }, 594 | "language_info": { 595 | "codemirror_mode": { 596 | "name": "ipython", 597 | "version": 3 598 | }, 599 | "file_extension": ".py", 600 | "mimetype": "text/x-python", 601 | "name": "python", 602 | "nbconvert_exporter": "python", 603 | "pygments_lexer": "ipython3", 604 | "version": "3.5.1" 605 | } 606 | }, 607 | "nbformat": 4, 608 | "nbformat_minor": 0 609 | } 610 | -------------------------------------------------------------------------------- /Lab 9 - Linear Model Selection in Python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "This lab on Model Validation using Validation and Cross-Validation is a Python adaptation of p. 248-251 of \"Introduction to Statistical Learning with Applications in R\" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Adapted by R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016)." 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "%matplotlib inline\n", 19 | "import pandas as pd\n", 20 | "import numpy as np\n", 21 | "import itertools\n", 22 | "import statsmodels.api as sm\n", 23 | "import matplotlib.pyplot as plt" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "# Model selection using the Validation Set Approach\n", 31 | "\n", 32 | "In Lab 8, we saw that it is possible to choose among a set of models of different\n", 33 | "sizes using $C_p$, BIC, and adjusted $R^2$. We will now consider how to do this\n", 34 | "using the validation set and cross-validation approaches.\n", 35 | "\n", 36 | "As in Lab 8, we'll be working with the ${\\tt Hitters}$ dataset from ${\\tt ISLR}$. Since we're trying to predict ${\\tt Salary}$ and we know from last time that some are missing, let's first drop all the rows with missing values and do a little cleanup:" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "metadata": { 43 | "collapsed": false 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "df = pd.read_csv('Hitters.csv')\n", 48 | "\n", 49 | "# Drop any rows the contain missing values, along with the player names\n", 50 | "df = df.dropna().drop('Player', axis=1)\n", 51 | "\n", 52 | "# Get dummy variables\n", 53 | "dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']])\n", 54 | "\n", 55 | "# Extract independent variable\n", 56 | "y = pd.DataFrame(df.Salary)\n", 57 | "\n", 58 | "# Drop the column with the independent variable (Salary), and columns for which we created dummy variables\n", 59 | "X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')\n", 60 | "\n", 61 | "# Define the feature set X.\n", 62 | "X = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "In order for the validation set approach to yield accurate estimates of the test\n", 70 | "error, we must use *only the training observations* to perform all aspects of\n", 71 | "model-fitting — including variable selection. Therefore, the determination of\n", 72 | "which model of a given size is best must be made using *only the training\n", 73 | "observations*. This point is subtle but important. If the full data set is used\n", 74 | "to perform the best subset selection step, the validation set errors and\n", 75 | "cross-validation errors that we obtain will not be accurate estimates of the\n", 76 | "test error.\n", 77 | "\n", 78 | "In order to use the validation set approach, we begin by splitting the\n", 79 | "observations into a training set and a test set. We do this by creating\n", 80 | "a random vector, train, of elements equal to TRUE if the corresponding\n", 81 | "observation is in the training set, and FALSE otherwise. The vector test has\n", 82 | "a TRUE if the observation is in the test set, and a FALSE otherwise. Note the\n", 83 | "${\\tt np.invert()}$ in the command to create test causes TRUEs to be switched to FALSEs and\n", 84 | "vice versa. We also set a random seed so that the user will obtain the same\n", 85 | "training set/test set split." 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "metadata": { 92 | "collapsed": false 93 | }, 94 | "outputs": [], 95 | "source": [ 96 | "np.random.seed(seed=12)\n", 97 | "train = np.random.choice([True, False], size = len(y), replace = True)\n", 98 | "test = np.invert(train)" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "We'll define our helper function to outputs the best set of variables for each model size like we did in Lab 8. Not that we'll need to modify this to take in both test and training sets, because we want the returned error to be the **test** error:" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": null, 111 | "metadata": { 112 | "collapsed": true 113 | }, 114 | "outputs": [], 115 | "source": [ 116 | "def processSubset(feature_set, X_train, y_train, X_test, y_test):\n", 117 | " # Fit model on feature_set and calculate RSS\n", 118 | " model = sm.OLS(y_train,X_train[list(feature_set)])\n", 119 | " regr = model.fit()\n", 120 | " RSS = ((regr.predict(X_test[list(feature_set)]) - y_test) ** 2).sum()\n", 121 | " return {\"model\":regr, \"RSS\":RSS}" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "And another function to perform forward selection:" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "metadata": { 135 | "collapsed": true 136 | }, 137 | "outputs": [], 138 | "source": [ 139 | "def forward(predictors, X_train, y_train, X_test, y_test):\n", 140 | "\n", 141 | " # Pull out predictors we still need to process\n", 142 | " remaining_predictors = [p for p in X_train.columns if p not in predictors]\n", 143 | " \n", 144 | " results = []\n", 145 | " \n", 146 | " for p in remaining_predictors:\n", 147 | " results.append(processSubset(predictors+[p], X_train, y_train, X_test, y_test))\n", 148 | " \n", 149 | " # Wrap everything up in a nice dataframe\n", 150 | " models = pd.DataFrame(results)\n", 151 | " \n", 152 | " # Choose the model with the highest RSS\n", 153 | " best_model = models.loc[models['RSS'].argmin()]\n", 154 | " \n", 155 | " # Return the best model, along with some other useful information about the model\n", 156 | " return best_model" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "Now, we'll call our ${\\tt forward()}$ to the training set in order to perform forward selection for all nodel sizes:" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "metadata": { 170 | "collapsed": false 171 | }, 172 | "outputs": [], 173 | "source": [ 174 | "models_train = pd.DataFrame(columns=[\"RSS\", \"model\"])\n", 175 | "\n", 176 | "predictors = []\n", 177 | "\n", 178 | "for i in range(1,len(X.columns)+1): \n", 179 | " models_train.loc[i] = forward(predictors, X[train], y[train][\"Salary\"], X[test], y[test][\"Salary\"])\n", 180 | " predictors = models_train.loc[i][\"model\"].model.exog_names" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": { 186 | "collapsed": true 187 | }, 188 | "source": [ 189 | "Now let's plot the errors, and find the model that minimizes it:" 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": null, 195 | "metadata": { 196 | "collapsed": false 197 | }, 198 | "outputs": [], 199 | "source": [ 200 | "plt.plot(models_train[\"RSS\"])\n", 201 | "plt.xlabel('# Predictors')\n", 202 | "plt.ylabel('RSS')\n", 203 | "plt.plot(models_train[\"RSS\"].argmin(), models_train[\"RSS\"].min(), \"or\")" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "Viola! We find that the best model (according to the validation set approach) is the one that contains 10 predictors.\n", 211 | "\n", 212 | "Now that we know what we're looking for, let's perform forward selection on the full dataset and select the best 10-predictor model. It is important that we make use of the *full\n", 213 | "data set* in order to obtain more accurate coefficient estimates. Note that\n", 214 | "we perform best subset selection on the full data set and select the best 10-predictor\n", 215 | "model, rather than simply using the predictors that we obtained\n", 216 | "from the training set, because the best 10-predictor model on the full data\n", 217 | "set may differ from the corresponding model on the training set." 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": null, 223 | "metadata": { 224 | "collapsed": false 225 | }, 226 | "outputs": [], 227 | "source": [ 228 | "models_full = pd.DataFrame(columns=[\"RSS\", \"model\"])\n", 229 | "\n", 230 | "predictors = []\n", 231 | "\n", 232 | "for i in range(1,20): \n", 233 | " models_full.loc[i] = forward(predictors, X, y[\"Salary\"], X, y[\"Salary\"])\n", 234 | " predictors = models_full.loc[i][\"model\"].model.exog_names" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": {}, 240 | "source": [ 241 | "In fact, we see that the best ten-variable model on the full data set has a\n", 242 | "**different set of predictors** than the best ten-variable model on the training\n", 243 | "set:" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": null, 249 | "metadata": { 250 | "collapsed": false 251 | }, 252 | "outputs": [], 253 | "source": [ 254 | "print(models_train.loc[10, \"model\"].model.exog_names)\n", 255 | "print(models_full.loc[10, \"model\"].model.exog_names)" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "# Model selection using Cross-Validation\n", 263 | "\n", 264 | "Now let's try to choose among the models of different sizes using cross-validation.\n", 265 | "This approach is somewhat involved, as we must perform forward selection within each of the $k$ training sets. Despite this, we see that\n", 266 | "with its clever subsetting syntax, ${\\tt python}$ makes this job quite easy. First, we\n", 267 | "create a vector that assigns each observation to one of $k = 10$ folds, and\n", 268 | "we create a DataFrame in which we will store the results:" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": { 275 | "collapsed": false 276 | }, 277 | "outputs": [], 278 | "source": [ 279 | "k=10 # number of folds\n", 280 | "np.random.seed(seed=1)\n", 281 | "folds = np.random.choice(k, size = len(y), replace = True)\n", 282 | "\n", 283 | "# Create a DataFrame to store the results of our upcoming calculations\n", 284 | "cv_errors = pd.DataFrame(columns=range(1,k+1), index=range(1,20))\n", 285 | "cv_errors = cv_errors.fillna(0)\n", 286 | "cv_errors" 287 | ] 288 | }, 289 | { 290 | "cell_type": "markdown", 291 | "metadata": { 292 | "collapsed": true 293 | }, 294 | "source": [ 295 | "Now let's write a for loop that performs cross-validation. In the $j^{th}$ fold, the\n", 296 | "elements of folds that equal $j$ are in the test set, and the remainder are in\n", 297 | "the training set. We make our predictions for each model size, compute the test errors on the appropriate subset,\n", 298 | "and store them in the appropriate slot in the matrix ${\\tt cv.errors}$." 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": null, 304 | "metadata": { 305 | "collapsed": false 306 | }, 307 | "outputs": [], 308 | "source": [ 309 | "models_cv = pd.DataFrame(columns=[\"RSS\", \"model\"])\n", 310 | " \n", 311 | "# Outer loop iterates over all folds\n", 312 | "for j in range(1,k+1):\n", 313 | "\n", 314 | " # Reset predictors\n", 315 | " predictors = []\n", 316 | " \n", 317 | " # Inner loop iterates over each size i\n", 318 | " for i in range(1,len(X.columns)+1): \n", 319 | " \n", 320 | " # The perform forward selection on the full dataset minus the jth fold, test on jth fold\n", 321 | " models_cv.loc[i] = forward(predictors, X[folds != (j-1)], y[folds != (j-1)][\"Salary\"], X[folds == (j-1)], y[folds == (j-1)][\"Salary\"])\n", 322 | " \n", 323 | " # Save the cross-validated error for this fold\n", 324 | " cv_errors[j][i] = models_cv.loc[i][\"RSS\"]\n", 325 | "\n", 326 | " # Extract the predictors\n", 327 | " predictors = models_cv.loc[i][\"model\"].model.exog_names\n", 328 | " " 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": null, 334 | "metadata": { 335 | "collapsed": false 336 | }, 337 | "outputs": [], 338 | "source": [ 339 | "cv_errors" 340 | ] 341 | }, 342 | { 343 | "cell_type": "markdown", 344 | "metadata": { 345 | "collapsed": true 346 | }, 347 | "source": [ 348 | "This has filled up the ${\\tt cv\\_errors}$ DataFrame such that the $(i,j)^{th}$ element corresponds\n", 349 | "to the test MSE for the $i^{th}$ cross-validation fold for the best $j$-variable\n", 350 | "model. We can then use the ${\\tt apply()}$ function to take the ${\\tt mean}$ over the columns of this\n", 351 | "matrix. This will give us a vector for which the $j^{th}$ element is the cross-validation\n", 352 | "error for the $j$-variable model." 353 | ] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "execution_count": null, 358 | "metadata": { 359 | "collapsed": false 360 | }, 361 | "outputs": [], 362 | "source": [ 363 | "cv_mean = cv_errors.apply(np.mean, axis=1)\n", 364 | "\n", 365 | "plt.plot(cv_mean)\n", 366 | "plt.xlabel('# Predictors')\n", 367 | "plt.ylabel('CV Error')\n", 368 | "plt.plot(cv_mean.argmin(), cv_mean.min(), \"or\")" 369 | ] 370 | }, 371 | { 372 | "cell_type": "markdown", 373 | "metadata": {}, 374 | "source": [ 375 | "We see that cross-validation selects a 9-predictor model. Now let's go back to our results on the full data set in order to obtain the 9-predictor model." 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": null, 381 | "metadata": { 382 | "collapsed": false 383 | }, 384 | "outputs": [], 385 | "source": [ 386 | "print(models_full.loc[9, \"model\"].summary())" 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "metadata": {}, 392 | "source": [ 393 | "For comparison, let's also take a look at the statistics from last lab:" 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": null, 399 | "metadata": { 400 | "collapsed": false 401 | }, 402 | "outputs": [], 403 | "source": [ 404 | "plt.figure(figsize=(20,10))\n", 405 | "plt.rcParams.update({'font.size': 18, 'lines.markersize': 10})\n", 406 | "\n", 407 | "# Set up a 2x2 grid so we can look at 4 plots at once\n", 408 | "plt.subplot(2, 2, 1)\n", 409 | "\n", 410 | "# We will now plot a red dot to indicate the model with the largest adjusted R^2 statistic.\n", 411 | "# The argmax() function can be used to identify the location of the maximum point of a vector\n", 412 | "plt.plot(models_full[\"RSS\"])\n", 413 | "plt.xlabel('# Predictors')\n", 414 | "plt.ylabel('RSS')\n", 415 | "\n", 416 | "# We will now plot a red dot to indicate the model with the largest adjusted R^2 statistic.\n", 417 | "# The argmax() function can be used to identify the location of the maximum point of a vector\n", 418 | "\n", 419 | "rsquared_adj = models_full.apply(lambda row: row[1].rsquared_adj, axis=1)\n", 420 | "\n", 421 | "plt.subplot(2, 2, 2)\n", 422 | "plt.plot(rsquared_adj)\n", 423 | "plt.plot(rsquared_adj.argmax(), rsquared_adj.max(), \"or\")\n", 424 | "plt.xlabel('# Predictors')\n", 425 | "plt.ylabel('adjusted rsquared')\n", 426 | "\n", 427 | "# We'll do the same for AIC and BIC, this time looking for the models with the SMALLEST statistic\n", 428 | "aic = models_full.apply(lambda row: row[1].aic, axis=1)\n", 429 | "\n", 430 | "plt.subplot(2, 2, 3)\n", 431 | "plt.plot(aic)\n", 432 | "plt.plot(aic.argmin(), aic.min(), \"or\")\n", 433 | "plt.xlabel('# Predictors')\n", 434 | "plt.ylabel('AIC')\n", 435 | "\n", 436 | "bic = models_full.apply(lambda row: row[1].bic, axis=1)\n", 437 | "\n", 438 | "plt.subplot(2, 2, 4)\n", 439 | "plt.plot(bic)\n", 440 | "plt.plot(bic.argmin(), bic.min(), \"or\")\n", 441 | "plt.xlabel('# Predictors')\n", 442 | "plt.ylabel('BIC')" 443 | ] 444 | }, 445 | { 446 | "cell_type": "markdown", 447 | "metadata": {}, 448 | "source": [ 449 | "Notice how some of the indicators are similar the cross-validated model, and others are very different?\n", 450 | "\n", 451 | "# Your turn!\n", 452 | "\n", 453 | "Now it's time to test out these approaches (best / forward / backward selection) and evaluation methods (adjusted training error, validation set, cross validation) on other datasets. You may want to work with a team on this portion of the lab.\n", 454 | "\n", 455 | "You may use any of the datasets included in ${\\tt ISLR}$, or choose one from the UCI machine learning repository (http://archive.ics.uci.edu/ml/datasets.html). Download a dataset, and try to determine the optimal set of parameters to use to model it!" 456 | ] 457 | }, 458 | { 459 | "cell_type": "code", 460 | "execution_count": null, 461 | "metadata": { 462 | "collapsed": true 463 | }, 464 | "outputs": [], 465 | "source": [ 466 | "# Your code here" 467 | ] 468 | }, 469 | { 470 | "cell_type": "markdown", 471 | "metadata": {}, 472 | "source": [ 473 | "To get credit for this lab, please post your answers to the following questions:\n", 474 | " - What dataset did you choose?\n", 475 | " - Which selection techniques did you try?\n", 476 | " - Which evaluation techniques did you try?\n", 477 | " - What did you determine was the best set of parameters to model this data?\n", 478 | " - How well did this model perform?\n", 479 | " \n", 480 | "to Piazza: https://piazza.com/class/igwiv4w3ctb6rg?cid=35" 481 | ] 482 | } 483 | ], 484 | "metadata": { 485 | "kernelspec": { 486 | "display_name": "Python 3", 487 | "language": "python", 488 | "name": "python3" 489 | }, 490 | "language_info": { 491 | "codemirror_mode": { 492 | "name": "ipython", 493 | "version": 3 494 | }, 495 | "file_extension": ".py", 496 | "mimetype": "text/x-python", 497 | "name": "python", 498 | "nbconvert_exporter": "python", 499 | "pygments_lexer": "ipython3", 500 | "version": "3.5.1" 501 | } 502 | }, 503 | "nbformat": 4, 504 | "nbformat_minor": 0 505 | } 506 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # islr-python 2 | 3 | This project is a python adaptation of the lab example in 4 | "Introduction to Statistical Learning with Applications in R" 5 | by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. 6 | 7 | Some of the original adaptations were produced by J. Warmenhoven, and updated by R. Jordan Crouser at Smith College for 8 | SDS293: Machine Learning (Spring 2016). 9 | -------------------------------------------------------------------------------- /data/Auto.csv: -------------------------------------------------------------------------------- 1 | mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name 2 | 18,8,307,130,3504,12,70,1,chevrolet chevelle malibu 3 | 15,8,350,165,3693,11.5,70,1,buick skylark 320 4 | 18,8,318,150,3436,11,70,1,plymouth satellite 5 | 16,8,304,150,3433,12,70,1,amc rebel sst 6 | 17,8,302,140,3449,10.5,70,1,ford torino 7 | 15,8,429,198,4341,10,70,1,ford galaxie 500 8 | 14,8,454,220,4354,9,70,1,chevrolet impala 9 | 14,8,440,215,4312,8.5,70,1,plymouth fury iii 10 | 14,8,455,225,4425,10,70,1,pontiac catalina 11 | 15,8,390,190,3850,8.5,70,1,amc ambassador dpl 12 | 15,8,383,170,3563,10,70,1,dodge challenger se 13 | 14,8,340,160,3609,8,70,1,plymouth 'cuda 340 14 | 15,8,400,150,3761,9.5,70,1,chevrolet monte carlo 15 | 14,8,455,225,3086,10,70,1,buick estate wagon (sw) 16 | 24,4,113,95,2372,15,70,3,toyota corona mark ii 17 | 22,6,198,95,2833,15.5,70,1,plymouth duster 18 | 18,6,199,97,2774,15.5,70,1,amc hornet 19 | 21,6,200,85,2587,16,70,1,ford maverick 20 | 27,4,97,88,2130,14.5,70,3,datsun pl510 21 | 26,4,97,46,1835,20.5,70,2,volkswagen 1131 deluxe sedan 22 | 25,4,110,87,2672,17.5,70,2,peugeot 504 23 | 24,4,107,90,2430,14.5,70,2,audi 100 ls 24 | 25,4,104,95,2375,17.5,70,2,saab 99e 25 | 26,4,121,113,2234,12.5,70,2,bmw 2002 26 | 21,6,199,90,2648,15,70,1,amc gremlin 27 | 10,8,360,215,4615,14,70,1,ford f250 28 | 10,8,307,200,4376,15,70,1,chevy c20 29 | 11,8,318,210,4382,13.5,70,1,dodge d200 30 | 9,8,304,193,4732,18.5,70,1,hi 1200d 31 | 27,4,97,88,2130,14.5,71,3,datsun pl510 32 | 28,4,140,90,2264,15.5,71,1,chevrolet vega 2300 33 | 25,4,113,95,2228,14,71,3,toyota corona 34 | 19,6,232,100,2634,13,71,1,amc gremlin 35 | 16,6,225,105,3439,15.5,71,1,plymouth satellite custom 36 | 17,6,250,100,3329,15.5,71,1,chevrolet chevelle malibu 37 | 19,6,250,88,3302,15.5,71,1,ford torino 500 38 | 18,6,232,100,3288,15.5,71,1,amc matador 39 | 14,8,350,165,4209,12,71,1,chevrolet impala 40 | 14,8,400,175,4464,11.5,71,1,pontiac catalina brougham 41 | 14,8,351,153,4154,13.5,71,1,ford galaxie 500 42 | 14,8,318,150,4096,13,71,1,plymouth fury iii 43 | 12,8,383,180,4955,11.5,71,1,dodge monaco (sw) 44 | 13,8,400,170,4746,12,71,1,ford country squire (sw) 45 | 13,8,400,175,5140,12,71,1,pontiac safari (sw) 46 | 18,6,258,110,2962,13.5,71,1,amc hornet sportabout (sw) 47 | 22,4,140,72,2408,19,71,1,chevrolet vega (sw) 48 | 19,6,250,100,3282,15,71,1,pontiac firebird 49 | 18,6,250,88,3139,14.5,71,1,ford mustang 50 | 23,4,122,86,2220,14,71,1,mercury capri 2000 51 | 28,4,116,90,2123,14,71,2,opel 1900 52 | 30,4,79,70,2074,19.5,71,2,peugeot 304 53 | 30,4,88,76,2065,14.5,71,2,fiat 124b 54 | 31,4,71,65,1773,19,71,3,toyota corolla 1200 55 | 35,4,72,69,1613,18,71,3,datsun 1200 56 | 27,4,97,60,1834,19,71,2,volkswagen model 111 57 | 26,4,91,70,1955,20.5,71,1,plymouth cricket 58 | 24,4,113,95,2278,15.5,72,3,toyota corona hardtop 59 | 25,4,97.5,80,2126,17,72,1,dodge colt hardtop 60 | 23,4,97,54,2254,23.5,72,2,volkswagen type 3 61 | 20,4,140,90,2408,19.5,72,1,chevrolet vega 62 | 21,4,122,86,2226,16.5,72,1,ford pinto runabout 63 | 13,8,350,165,4274,12,72,1,chevrolet impala 64 | 14,8,400,175,4385,12,72,1,pontiac catalina 65 | 15,8,318,150,4135,13.5,72,1,plymouth fury iii 66 | 14,8,351,153,4129,13,72,1,ford galaxie 500 67 | 17,8,304,150,3672,11.5,72,1,amc ambassador sst 68 | 11,8,429,208,4633,11,72,1,mercury marquis 69 | 13,8,350,155,4502,13.5,72,1,buick lesabre custom 70 | 12,8,350,160,4456,13.5,72,1,oldsmobile delta 88 royale 71 | 13,8,400,190,4422,12.5,72,1,chrysler newport royal 72 | 19,3,70,97,2330,13.5,72,3,mazda rx2 coupe 73 | 15,8,304,150,3892,12.5,72,1,amc matador (sw) 74 | 13,8,307,130,4098,14,72,1,chevrolet chevelle concours (sw) 75 | 13,8,302,140,4294,16,72,1,ford gran torino (sw) 76 | 14,8,318,150,4077,14,72,1,plymouth satellite custom (sw) 77 | 18,4,121,112,2933,14.5,72,2,volvo 145e (sw) 78 | 22,4,121,76,2511,18,72,2,volkswagen 411 (sw) 79 | 21,4,120,87,2979,19.5,72,2,peugeot 504 (sw) 80 | 26,4,96,69,2189,18,72,2,renault 12 (sw) 81 | 22,4,122,86,2395,16,72,1,ford pinto (sw) 82 | 28,4,97,92,2288,17,72,3,datsun 510 (sw) 83 | 23,4,120,97,2506,14.5,72,3,toyouta corona mark ii (sw) 84 | 28,4,98,80,2164,15,72,1,dodge colt (sw) 85 | 27,4,97,88,2100,16.5,72,3,toyota corolla 1600 (sw) 86 | 13,8,350,175,4100,13,73,1,buick century 350 87 | 14,8,304,150,3672,11.5,73,1,amc matador 88 | 13,8,350,145,3988,13,73,1,chevrolet malibu 89 | 14,8,302,137,4042,14.5,73,1,ford gran torino 90 | 15,8,318,150,3777,12.5,73,1,dodge coronet custom 91 | 12,8,429,198,4952,11.5,73,1,mercury marquis brougham 92 | 13,8,400,150,4464,12,73,1,chevrolet caprice classic 93 | 13,8,351,158,4363,13,73,1,ford ltd 94 | 14,8,318,150,4237,14.5,73,1,plymouth fury gran sedan 95 | 13,8,440,215,4735,11,73,1,chrysler new yorker brougham 96 | 12,8,455,225,4951,11,73,1,buick electra 225 custom 97 | 13,8,360,175,3821,11,73,1,amc ambassador brougham 98 | 18,6,225,105,3121,16.5,73,1,plymouth valiant 99 | 16,6,250,100,3278,18,73,1,chevrolet nova custom 100 | 18,6,232,100,2945,16,73,1,amc hornet 101 | 18,6,250,88,3021,16.5,73,1,ford maverick 102 | 23,6,198,95,2904,16,73,1,plymouth duster 103 | 26,4,97,46,1950,21,73,2,volkswagen super beetle 104 | 11,8,400,150,4997,14,73,1,chevrolet impala 105 | 12,8,400,167,4906,12.5,73,1,ford country 106 | 13,8,360,170,4654,13,73,1,plymouth custom suburb 107 | 12,8,350,180,4499,12.5,73,1,oldsmobile vista cruiser 108 | 18,6,232,100,2789,15,73,1,amc gremlin 109 | 20,4,97,88,2279,19,73,3,toyota carina 110 | 21,4,140,72,2401,19.5,73,1,chevrolet vega 111 | 22,4,108,94,2379,16.5,73,3,datsun 610 112 | 18,3,70,90,2124,13.5,73,3,maxda rx3 113 | 19,4,122,85,2310,18.5,73,1,ford pinto 114 | 21,6,155,107,2472,14,73,1,mercury capri v6 115 | 26,4,98,90,2265,15.5,73,2,fiat 124 sport coupe 116 | 15,8,350,145,4082,13,73,1,chevrolet monte carlo s 117 | 16,8,400,230,4278,9.5,73,1,pontiac grand prix 118 | 29,4,68,49,1867,19.5,73,2,fiat 128 119 | 24,4,116,75,2158,15.5,73,2,opel manta 120 | 20,4,114,91,2582,14,73,2,audi 100ls 121 | 19,4,121,112,2868,15.5,73,2,volvo 144ea 122 | 15,8,318,150,3399,11,73,1,dodge dart custom 123 | 24,4,121,110,2660,14,73,2,saab 99le 124 | 20,6,156,122,2807,13.5,73,3,toyota mark ii 125 | 11,8,350,180,3664,11,73,1,oldsmobile omega 126 | 20,6,198,95,3102,16.5,74,1,plymouth duster 127 | 19,6,232,100,2901,16,74,1,amc hornet 128 | 15,6,250,100,3336,17,74,1,chevrolet nova 129 | 31,4,79,67,1950,19,74,3,datsun b210 130 | 26,4,122,80,2451,16.5,74,1,ford pinto 131 | 32,4,71,65,1836,21,74,3,toyota corolla 1200 132 | 25,4,140,75,2542,17,74,1,chevrolet vega 133 | 16,6,250,100,3781,17,74,1,chevrolet chevelle malibu classic 134 | 16,6,258,110,3632,18,74,1,amc matador 135 | 18,6,225,105,3613,16.5,74,1,plymouth satellite sebring 136 | 16,8,302,140,4141,14,74,1,ford gran torino 137 | 13,8,350,150,4699,14.5,74,1,buick century luxus (sw) 138 | 14,8,318,150,4457,13.5,74,1,dodge coronet custom (sw) 139 | 14,8,302,140,4638,16,74,1,ford gran torino (sw) 140 | 14,8,304,150,4257,15.5,74,1,amc matador (sw) 141 | 29,4,98,83,2219,16.5,74,2,audi fox 142 | 26,4,79,67,1963,15.5,74,2,volkswagen dasher 143 | 26,4,97,78,2300,14.5,74,2,opel manta 144 | 31,4,76,52,1649,16.5,74,3,toyota corona 145 | 32,4,83,61,2003,19,74,3,datsun 710 146 | 28,4,90,75,2125,14.5,74,1,dodge colt 147 | 24,4,90,75,2108,15.5,74,2,fiat 128 148 | 26,4,116,75,2246,14,74,2,fiat 124 tc 149 | 24,4,120,97,2489,15,74,3,honda civic 150 | 26,4,108,93,2391,15.5,74,3,subaru 151 | 31,4,79,67,2000,16,74,2,fiat x1.9 152 | 19,6,225,95,3264,16,75,1,plymouth valiant custom 153 | 18,6,250,105,3459,16,75,1,chevrolet nova 154 | 15,6,250,72,3432,21,75,1,mercury monarch 155 | 15,6,250,72,3158,19.5,75,1,ford maverick 156 | 16,8,400,170,4668,11.5,75,1,pontiac catalina 157 | 15,8,350,145,4440,14,75,1,chevrolet bel air 158 | 16,8,318,150,4498,14.5,75,1,plymouth grand fury 159 | 14,8,351,148,4657,13.5,75,1,ford ltd 160 | 17,6,231,110,3907,21,75,1,buick century 161 | 16,6,250,105,3897,18.5,75,1,chevroelt chevelle malibu 162 | 15,6,258,110,3730,19,75,1,amc matador 163 | 18,6,225,95,3785,19,75,1,plymouth fury 164 | 21,6,231,110,3039,15,75,1,buick skyhawk 165 | 20,8,262,110,3221,13.5,75,1,chevrolet monza 2+2 166 | 13,8,302,129,3169,12,75,1,ford mustang ii 167 | 29,4,97,75,2171,16,75,3,toyota corolla 168 | 23,4,140,83,2639,17,75,1,ford pinto 169 | 20,6,232,100,2914,16,75,1,amc gremlin 170 | 23,4,140,78,2592,18.5,75,1,pontiac astro 171 | 24,4,134,96,2702,13.5,75,3,toyota corona 172 | 25,4,90,71,2223,16.5,75,2,volkswagen dasher 173 | 24,4,119,97,2545,17,75,3,datsun 710 174 | 18,6,171,97,2984,14.5,75,1,ford pinto 175 | 29,4,90,70,1937,14,75,2,volkswagen rabbit 176 | 19,6,232,90,3211,17,75,1,amc pacer 177 | 23,4,115,95,2694,15,75,2,audi 100ls 178 | 23,4,120,88,2957,17,75,2,peugeot 504 179 | 22,4,121,98,2945,14.5,75,2,volvo 244dl 180 | 25,4,121,115,2671,13.5,75,2,saab 99le 181 | 33,4,91,53,1795,17.5,75,3,honda civic cvcc 182 | 28,4,107,86,2464,15.5,76,2,fiat 131 183 | 25,4,116,81,2220,16.9,76,2,opel 1900 184 | 25,4,140,92,2572,14.9,76,1,capri ii 185 | 26,4,98,79,2255,17.7,76,1,dodge colt 186 | 27,4,101,83,2202,15.3,76,2,renault 12tl 187 | 17.5,8,305,140,4215,13,76,1,chevrolet chevelle malibu classic 188 | 16,8,318,150,4190,13,76,1,dodge coronet brougham 189 | 15.5,8,304,120,3962,13.9,76,1,amc matador 190 | 14.5,8,351,152,4215,12.8,76,1,ford gran torino 191 | 22,6,225,100,3233,15.4,76,1,plymouth valiant 192 | 22,6,250,105,3353,14.5,76,1,chevrolet nova 193 | 24,6,200,81,3012,17.6,76,1,ford maverick 194 | 22.5,6,232,90,3085,17.6,76,1,amc hornet 195 | 29,4,85,52,2035,22.2,76,1,chevrolet chevette 196 | 24.5,4,98,60,2164,22.1,76,1,chevrolet woody 197 | 29,4,90,70,1937,14.2,76,2,vw rabbit 198 | 33,4,91,53,1795,17.4,76,3,honda civic 199 | 20,6,225,100,3651,17.7,76,1,dodge aspen se 200 | 18,6,250,78,3574,21,76,1,ford granada ghia 201 | 18.5,6,250,110,3645,16.2,76,1,pontiac ventura sj 202 | 17.5,6,258,95,3193,17.8,76,1,amc pacer d/l 203 | 29.5,4,97,71,1825,12.2,76,2,volkswagen rabbit 204 | 32,4,85,70,1990,17,76,3,datsun b-210 205 | 28,4,97,75,2155,16.4,76,3,toyota corolla 206 | 26.5,4,140,72,2565,13.6,76,1,ford pinto 207 | 20,4,130,102,3150,15.7,76,2,volvo 245 208 | 13,8,318,150,3940,13.2,76,1,plymouth volare premier v8 209 | 19,4,120,88,3270,21.9,76,2,peugeot 504 210 | 19,6,156,108,2930,15.5,76,3,toyota mark ii 211 | 16.5,6,168,120,3820,16.7,76,2,mercedes-benz 280s 212 | 16.5,8,350,180,4380,12.1,76,1,cadillac seville 213 | 13,8,350,145,4055,12,76,1,chevy c10 214 | 13,8,302,130,3870,15,76,1,ford f108 215 | 13,8,318,150,3755,14,76,1,dodge d100 216 | 31.5,4,98,68,2045,18.5,77,3,honda accord cvcc 217 | 30,4,111,80,2155,14.8,77,1,buick opel isuzu deluxe 218 | 36,4,79,58,1825,18.6,77,2,renault 5 gtl 219 | 25.5,4,122,96,2300,15.5,77,1,plymouth arrow gs 220 | 33.5,4,85,70,1945,16.8,77,3,datsun f-10 hatchback 221 | 17.5,8,305,145,3880,12.5,77,1,chevrolet caprice classic 222 | 17,8,260,110,4060,19,77,1,oldsmobile cutlass supreme 223 | 15.5,8,318,145,4140,13.7,77,1,dodge monaco brougham 224 | 15,8,302,130,4295,14.9,77,1,mercury cougar brougham 225 | 17.5,6,250,110,3520,16.4,77,1,chevrolet concours 226 | 20.5,6,231,105,3425,16.9,77,1,buick skylark 227 | 19,6,225,100,3630,17.7,77,1,plymouth volare custom 228 | 18.5,6,250,98,3525,19,77,1,ford granada 229 | 16,8,400,180,4220,11.1,77,1,pontiac grand prix lj 230 | 15.5,8,350,170,4165,11.4,77,1,chevrolet monte carlo landau 231 | 15.5,8,400,190,4325,12.2,77,1,chrysler cordoba 232 | 16,8,351,149,4335,14.5,77,1,ford thunderbird 233 | 29,4,97,78,1940,14.5,77,2,volkswagen rabbit custom 234 | 24.5,4,151,88,2740,16,77,1,pontiac sunbird coupe 235 | 26,4,97,75,2265,18.2,77,3,toyota corolla liftback 236 | 25.5,4,140,89,2755,15.8,77,1,ford mustang ii 2+2 237 | 30.5,4,98,63,2051,17,77,1,chevrolet chevette 238 | 33.5,4,98,83,2075,15.9,77,1,dodge colt m/m 239 | 30,4,97,67,1985,16.4,77,3,subaru dl 240 | 30.5,4,97,78,2190,14.1,77,2,volkswagen dasher 241 | 22,6,146,97,2815,14.5,77,3,datsun 810 242 | 21.5,4,121,110,2600,12.8,77,2,bmw 320i 243 | 21.5,3,80,110,2720,13.5,77,3,mazda rx-4 244 | 43.1,4,90,48,1985,21.5,78,2,volkswagen rabbit custom diesel 245 | 36.1,4,98,66,1800,14.4,78,1,ford fiesta 246 | 32.8,4,78,52,1985,19.4,78,3,mazda glc deluxe 247 | 39.4,4,85,70,2070,18.6,78,3,datsun b210 gx 248 | 36.1,4,91,60,1800,16.4,78,3,honda civic cvcc 249 | 19.9,8,260,110,3365,15.5,78,1,oldsmobile cutlass salon brougham 250 | 19.4,8,318,140,3735,13.2,78,1,dodge diplomat 251 | 20.2,8,302,139,3570,12.8,78,1,mercury monarch ghia 252 | 19.2,6,231,105,3535,19.2,78,1,pontiac phoenix lj 253 | 20.5,6,200,95,3155,18.2,78,1,chevrolet malibu 254 | 20.2,6,200,85,2965,15.8,78,1,ford fairmont (auto) 255 | 25.1,4,140,88,2720,15.4,78,1,ford fairmont (man) 256 | 20.5,6,225,100,3430,17.2,78,1,plymouth volare 257 | 19.4,6,232,90,3210,17.2,78,1,amc concord 258 | 20.6,6,231,105,3380,15.8,78,1,buick century special 259 | 20.8,6,200,85,3070,16.7,78,1,mercury zephyr 260 | 18.6,6,225,110,3620,18.7,78,1,dodge aspen 261 | 18.1,6,258,120,3410,15.1,78,1,amc concord d/l 262 | 19.2,8,305,145,3425,13.2,78,1,chevrolet monte carlo landau 263 | 17.7,6,231,165,3445,13.4,78,1,buick regal sport coupe (turbo) 264 | 18.1,8,302,139,3205,11.2,78,1,ford futura 265 | 17.5,8,318,140,4080,13.7,78,1,dodge magnum xe 266 | 30,4,98,68,2155,16.5,78,1,chevrolet chevette 267 | 27.5,4,134,95,2560,14.2,78,3,toyota corona 268 | 27.2,4,119,97,2300,14.7,78,3,datsun 510 269 | 30.9,4,105,75,2230,14.5,78,1,dodge omni 270 | 21.1,4,134,95,2515,14.8,78,3,toyota celica gt liftback 271 | 23.2,4,156,105,2745,16.7,78,1,plymouth sapporo 272 | 23.8,4,151,85,2855,17.6,78,1,oldsmobile starfire sx 273 | 23.9,4,119,97,2405,14.9,78,3,datsun 200-sx 274 | 20.3,5,131,103,2830,15.9,78,2,audi 5000 275 | 17,6,163,125,3140,13.6,78,2,volvo 264gl 276 | 21.6,4,121,115,2795,15.7,78,2,saab 99gle 277 | 16.2,6,163,133,3410,15.8,78,2,peugeot 604sl 278 | 31.5,4,89,71,1990,14.9,78,2,volkswagen scirocco 279 | 29.5,4,98,68,2135,16.6,78,3,honda accord lx 280 | 21.5,6,231,115,3245,15.4,79,1,pontiac lemans v6 281 | 19.8,6,200,85,2990,18.2,79,1,mercury zephyr 6 282 | 22.3,4,140,88,2890,17.3,79,1,ford fairmont 4 283 | 20.2,6,232,90,3265,18.2,79,1,amc concord dl 6 284 | 20.6,6,225,110,3360,16.6,79,1,dodge aspen 6 285 | 17,8,305,130,3840,15.4,79,1,chevrolet caprice classic 286 | 17.6,8,302,129,3725,13.4,79,1,ford ltd landau 287 | 16.5,8,351,138,3955,13.2,79,1,mercury grand marquis 288 | 18.2,8,318,135,3830,15.2,79,1,dodge st. regis 289 | 16.9,8,350,155,4360,14.9,79,1,buick estate wagon (sw) 290 | 15.5,8,351,142,4054,14.3,79,1,ford country squire (sw) 291 | 19.2,8,267,125,3605,15,79,1,chevrolet malibu classic (sw) 292 | 18.5,8,360,150,3940,13,79,1,chrysler lebaron town @ country (sw) 293 | 31.9,4,89,71,1925,14,79,2,vw rabbit custom 294 | 34.1,4,86,65,1975,15.2,79,3,maxda glc deluxe 295 | 35.7,4,98,80,1915,14.4,79,1,dodge colt hatchback custom 296 | 27.4,4,121,80,2670,15,79,1,amc spirit dl 297 | 25.4,5,183,77,3530,20.1,79,2,mercedes benz 300d 298 | 23,8,350,125,3900,17.4,79,1,cadillac eldorado 299 | 27.2,4,141,71,3190,24.8,79,2,peugeot 504 300 | 23.9,8,260,90,3420,22.2,79,1,oldsmobile cutlass salon brougham 301 | 34.2,4,105,70,2200,13.2,79,1,plymouth horizon 302 | 34.5,4,105,70,2150,14.9,79,1,plymouth horizon tc3 303 | 31.8,4,85,65,2020,19.2,79,3,datsun 210 304 | 37.3,4,91,69,2130,14.7,79,2,fiat strada custom 305 | 28.4,4,151,90,2670,16,79,1,buick skylark limited 306 | 28.8,6,173,115,2595,11.3,79,1,chevrolet citation 307 | 26.8,6,173,115,2700,12.9,79,1,oldsmobile omega brougham 308 | 33.5,4,151,90,2556,13.2,79,1,pontiac phoenix 309 | 41.5,4,98,76,2144,14.7,80,2,vw rabbit 310 | 38.1,4,89,60,1968,18.8,80,3,toyota corolla tercel 311 | 32.1,4,98,70,2120,15.5,80,1,chevrolet chevette 312 | 37.2,4,86,65,2019,16.4,80,3,datsun 310 313 | 28,4,151,90,2678,16.5,80,1,chevrolet citation 314 | 26.4,4,140,88,2870,18.1,80,1,ford fairmont 315 | 24.3,4,151,90,3003,20.1,80,1,amc concord 316 | 19.1,6,225,90,3381,18.7,80,1,dodge aspen 317 | 34.3,4,97,78,2188,15.8,80,2,audi 4000 318 | 29.8,4,134,90,2711,15.5,80,3,toyota corona liftback 319 | 31.3,4,120,75,2542,17.5,80,3,mazda 626 320 | 37,4,119,92,2434,15,80,3,datsun 510 hatchback 321 | 32.2,4,108,75,2265,15.2,80,3,toyota corolla 322 | 46.6,4,86,65,2110,17.9,80,3,mazda glc 323 | 27.9,4,156,105,2800,14.4,80,1,dodge colt 324 | 40.8,4,85,65,2110,19.2,80,3,datsun 210 325 | 44.3,4,90,48,2085,21.7,80,2,vw rabbit c (diesel) 326 | 43.4,4,90,48,2335,23.7,80,2,vw dasher (diesel) 327 | 36.4,5,121,67,2950,19.9,80,2,audi 5000s (diesel) 328 | 30,4,146,67,3250,21.8,80,2,mercedes-benz 240d 329 | 44.6,4,91,67,1850,13.8,80,3,honda civic 1500 gl 330 | 33.8,4,97,67,2145,18,80,3,subaru dl 331 | 29.8,4,89,62,1845,15.3,80,2,vokswagen rabbit 332 | 32.7,6,168,132,2910,11.4,80,3,datsun 280-zx 333 | 23.7,3,70,100,2420,12.5,80,3,mazda rx-7 gs 334 | 35,4,122,88,2500,15.1,80,2,triumph tr7 coupe 335 | 32.4,4,107,72,2290,17,80,3,honda accord 336 | 27.2,4,135,84,2490,15.7,81,1,plymouth reliant 337 | 26.6,4,151,84,2635,16.4,81,1,buick skylark 338 | 25.8,4,156,92,2620,14.4,81,1,dodge aries wagon (sw) 339 | 23.5,6,173,110,2725,12.6,81,1,chevrolet citation 340 | 30,4,135,84,2385,12.9,81,1,plymouth reliant 341 | 39.1,4,79,58,1755,16.9,81,3,toyota starlet 342 | 39,4,86,64,1875,16.4,81,1,plymouth champ 343 | 35.1,4,81,60,1760,16.1,81,3,honda civic 1300 344 | 32.3,4,97,67,2065,17.8,81,3,subaru 345 | 37,4,85,65,1975,19.4,81,3,datsun 210 mpg 346 | 37.7,4,89,62,2050,17.3,81,3,toyota tercel 347 | 34.1,4,91,68,1985,16,81,3,mazda glc 4 348 | 34.7,4,105,63,2215,14.9,81,1,plymouth horizon 4 349 | 34.4,4,98,65,2045,16.2,81,1,ford escort 4w 350 | 29.9,4,98,65,2380,20.7,81,1,ford escort 2h 351 | 33,4,105,74,2190,14.2,81,2,volkswagen jetta 352 | 33.7,4,107,75,2210,14.4,81,3,honda prelude 353 | 32.4,4,108,75,2350,16.8,81,3,toyota corolla 354 | 32.9,4,119,100,2615,14.8,81,3,datsun 200sx 355 | 31.6,4,120,74,2635,18.3,81,3,mazda 626 356 | 28.1,4,141,80,3230,20.4,81,2,peugeot 505s turbo diesel 357 | 30.7,6,145,76,3160,19.6,81,2,volvo diesel 358 | 25.4,6,168,116,2900,12.6,81,3,toyota cressida 359 | 24.2,6,146,120,2930,13.8,81,3,datsun 810 maxima 360 | 22.4,6,231,110,3415,15.8,81,1,buick century 361 | 26.6,8,350,105,3725,19,81,1,oldsmobile cutlass ls 362 | 20.2,6,200,88,3060,17.1,81,1,ford granada gl 363 | 17.6,6,225,85,3465,16.6,81,1,chrysler lebaron salon 364 | 28,4,112,88,2605,19.6,82,1,chevrolet cavalier 365 | 27,4,112,88,2640,18.6,82,1,chevrolet cavalier wagon 366 | 34,4,112,88,2395,18,82,1,chevrolet cavalier 2-door 367 | 31,4,112,85,2575,16.2,82,1,pontiac j2000 se hatchback 368 | 29,4,135,84,2525,16,82,1,dodge aries se 369 | 27,4,151,90,2735,18,82,1,pontiac phoenix 370 | 24,4,140,92,2865,16.4,82,1,ford fairmont futura 371 | 36,4,105,74,1980,15.3,82,2,volkswagen rabbit l 372 | 37,4,91,68,2025,18.2,82,3,mazda glc custom l 373 | 31,4,91,68,1970,17.6,82,3,mazda glc custom 374 | 38,4,105,63,2125,14.7,82,1,plymouth horizon miser 375 | 36,4,98,70,2125,17.3,82,1,mercury lynx l 376 | 36,4,120,88,2160,14.5,82,3,nissan stanza xe 377 | 36,4,107,75,2205,14.5,82,3,honda accord 378 | 34,4,108,70,2245,16.9,82,3,toyota corolla 379 | 38,4,91,67,1965,15,82,3,honda civic 380 | 32,4,91,67,1965,15.7,82,3,honda civic (auto) 381 | 38,4,91,67,1995,16.2,82,3,datsun 310 gx 382 | 25,6,181,110,2945,16.4,82,1,buick century limited 383 | 38,6,262,85,3015,17,82,1,oldsmobile cutlass ciera (diesel) 384 | 26,4,156,92,2585,14.5,82,1,chrysler lebaron medallion 385 | 22,6,232,112,2835,14.7,82,1,ford granada l 386 | 32,4,144,96,2665,13.9,82,3,toyota celica gt 387 | 36,4,135,84,2370,13,82,1,dodge charger 2.2 388 | 27,4,151,90,2950,17.3,82,1,chevrolet camaro 389 | 27,4,140,86,2790,15.6,82,1,ford mustang gl 390 | 44,4,97,52,2130,24.6,82,2,vw pickup 391 | 32,4,135,84,2295,11.6,82,1,dodge rampage 392 | 28,4,120,79,2625,18.6,82,1,ford ranger 393 | 31,4,119,82,2720,19.4,82,1,chevy s-10 -------------------------------------------------------------------------------- /data/Carseats.csv: -------------------------------------------------------------------------------- 1 | "Sales","CompPrice","Income","Advertising","Population","Price","ShelveLoc","Age","Education","Urban","US" 2 | 9.5,138,73,11,276,120,"Bad",42,17,"Yes","Yes" 3 | 11.22,111,48,16,260,83,"Good",65,10,"Yes","Yes" 4 | 10.06,113,35,10,269,80,"Medium",59,12,"Yes","Yes" 5 | 7.4,117,100,4,466,97,"Medium",55,14,"Yes","Yes" 6 | 4.15,141,64,3,340,128,"Bad",38,13,"Yes","No" 7 | 10.81,124,113,13,501,72,"Bad",78,16,"No","Yes" 8 | 6.63,115,105,0,45,108,"Medium",71,15,"Yes","No" 9 | 11.85,136,81,15,425,120,"Good",67,10,"Yes","Yes" 10 | 6.54,132,110,0,108,124,"Medium",76,10,"No","No" 11 | 4.69,132,113,0,131,124,"Medium",76,17,"No","Yes" 12 | 9.01,121,78,9,150,100,"Bad",26,10,"No","Yes" 13 | 11.96,117,94,4,503,94,"Good",50,13,"Yes","Yes" 14 | 3.98,122,35,2,393,136,"Medium",62,18,"Yes","No" 15 | 10.96,115,28,11,29,86,"Good",53,18,"Yes","Yes" 16 | 11.17,107,117,11,148,118,"Good",52,18,"Yes","Yes" 17 | 8.71,149,95,5,400,144,"Medium",76,18,"No","No" 18 | 7.58,118,32,0,284,110,"Good",63,13,"Yes","No" 19 | 12.29,147,74,13,251,131,"Good",52,10,"Yes","Yes" 20 | 13.91,110,110,0,408,68,"Good",46,17,"No","Yes" 21 | 8.73,129,76,16,58,121,"Medium",69,12,"Yes","Yes" 22 | 6.41,125,90,2,367,131,"Medium",35,18,"Yes","Yes" 23 | 12.13,134,29,12,239,109,"Good",62,18,"No","Yes" 24 | 5.08,128,46,6,497,138,"Medium",42,13,"Yes","No" 25 | 5.87,121,31,0,292,109,"Medium",79,10,"Yes","No" 26 | 10.14,145,119,16,294,113,"Bad",42,12,"Yes","Yes" 27 | 14.9,139,32,0,176,82,"Good",54,11,"No","No" 28 | 8.33,107,115,11,496,131,"Good",50,11,"No","Yes" 29 | 5.27,98,118,0,19,107,"Medium",64,17,"Yes","No" 30 | 2.99,103,74,0,359,97,"Bad",55,11,"Yes","Yes" 31 | 7.81,104,99,15,226,102,"Bad",58,17,"Yes","Yes" 32 | 13.55,125,94,0,447,89,"Good",30,12,"Yes","No" 33 | 8.25,136,58,16,241,131,"Medium",44,18,"Yes","Yes" 34 | 6.2,107,32,12,236,137,"Good",64,10,"No","Yes" 35 | 8.77,114,38,13,317,128,"Good",50,16,"Yes","Yes" 36 | 2.67,115,54,0,406,128,"Medium",42,17,"Yes","Yes" 37 | 11.07,131,84,11,29,96,"Medium",44,17,"No","Yes" 38 | 8.89,122,76,0,270,100,"Good",60,18,"No","No" 39 | 4.95,121,41,5,412,110,"Medium",54,10,"Yes","Yes" 40 | 6.59,109,73,0,454,102,"Medium",65,15,"Yes","No" 41 | 3.24,130,60,0,144,138,"Bad",38,10,"No","No" 42 | 2.07,119,98,0,18,126,"Bad",73,17,"No","No" 43 | 7.96,157,53,0,403,124,"Bad",58,16,"Yes","No" 44 | 10.43,77,69,0,25,24,"Medium",50,18,"Yes","No" 45 | 4.12,123,42,11,16,134,"Medium",59,13,"Yes","Yes" 46 | 4.16,85,79,6,325,95,"Medium",69,13,"Yes","Yes" 47 | 4.56,141,63,0,168,135,"Bad",44,12,"Yes","Yes" 48 | 12.44,127,90,14,16,70,"Medium",48,15,"No","Yes" 49 | 4.38,126,98,0,173,108,"Bad",55,16,"Yes","No" 50 | 3.91,116,52,0,349,98,"Bad",69,18,"Yes","No" 51 | 10.61,157,93,0,51,149,"Good",32,17,"Yes","No" 52 | 1.42,99,32,18,341,108,"Bad",80,16,"Yes","Yes" 53 | 4.42,121,90,0,150,108,"Bad",75,16,"Yes","No" 54 | 7.91,153,40,3,112,129,"Bad",39,18,"Yes","Yes" 55 | 6.92,109,64,13,39,119,"Medium",61,17,"Yes","Yes" 56 | 4.9,134,103,13,25,144,"Medium",76,17,"No","Yes" 57 | 6.85,143,81,5,60,154,"Medium",61,18,"Yes","Yes" 58 | 11.91,133,82,0,54,84,"Medium",50,17,"Yes","No" 59 | 0.91,93,91,0,22,117,"Bad",75,11,"Yes","No" 60 | 5.42,103,93,15,188,103,"Bad",74,16,"Yes","Yes" 61 | 5.21,118,71,4,148,114,"Medium",80,13,"Yes","No" 62 | 8.32,122,102,19,469,123,"Bad",29,13,"Yes","Yes" 63 | 7.32,105,32,0,358,107,"Medium",26,13,"No","No" 64 | 1.82,139,45,0,146,133,"Bad",77,17,"Yes","Yes" 65 | 8.47,119,88,10,170,101,"Medium",61,13,"Yes","Yes" 66 | 7.8,100,67,12,184,104,"Medium",32,16,"No","Yes" 67 | 4.9,122,26,0,197,128,"Medium",55,13,"No","No" 68 | 8.85,127,92,0,508,91,"Medium",56,18,"Yes","No" 69 | 9.01,126,61,14,152,115,"Medium",47,16,"Yes","Yes" 70 | 13.39,149,69,20,366,134,"Good",60,13,"Yes","Yes" 71 | 7.99,127,59,0,339,99,"Medium",65,12,"Yes","No" 72 | 9.46,89,81,15,237,99,"Good",74,12,"Yes","Yes" 73 | 6.5,148,51,16,148,150,"Medium",58,17,"No","Yes" 74 | 5.52,115,45,0,432,116,"Medium",25,15,"Yes","No" 75 | 12.61,118,90,10,54,104,"Good",31,11,"No","Yes" 76 | 6.2,150,68,5,125,136,"Medium",64,13,"No","Yes" 77 | 8.55,88,111,23,480,92,"Bad",36,16,"No","Yes" 78 | 10.64,102,87,10,346,70,"Medium",64,15,"Yes","Yes" 79 | 7.7,118,71,12,44,89,"Medium",67,18,"No","Yes" 80 | 4.43,134,48,1,139,145,"Medium",65,12,"Yes","Yes" 81 | 9.14,134,67,0,286,90,"Bad",41,13,"Yes","No" 82 | 8.01,113,100,16,353,79,"Bad",68,11,"Yes","Yes" 83 | 7.52,116,72,0,237,128,"Good",70,13,"Yes","No" 84 | 11.62,151,83,4,325,139,"Good",28,17,"Yes","Yes" 85 | 4.42,109,36,7,468,94,"Bad",56,11,"Yes","Yes" 86 | 2.23,111,25,0,52,121,"Bad",43,18,"No","No" 87 | 8.47,125,103,0,304,112,"Medium",49,13,"No","No" 88 | 8.7,150,84,9,432,134,"Medium",64,15,"Yes","No" 89 | 11.7,131,67,7,272,126,"Good",54,16,"No","Yes" 90 | 6.56,117,42,7,144,111,"Medium",62,10,"Yes","Yes" 91 | 7.95,128,66,3,493,119,"Medium",45,16,"No","No" 92 | 5.33,115,22,0,491,103,"Medium",64,11,"No","No" 93 | 4.81,97,46,11,267,107,"Medium",80,15,"Yes","Yes" 94 | 4.53,114,113,0,97,125,"Medium",29,12,"Yes","No" 95 | 8.86,145,30,0,67,104,"Medium",55,17,"Yes","No" 96 | 8.39,115,97,5,134,84,"Bad",55,11,"Yes","Yes" 97 | 5.58,134,25,10,237,148,"Medium",59,13,"Yes","Yes" 98 | 9.48,147,42,10,407,132,"Good",73,16,"No","Yes" 99 | 7.45,161,82,5,287,129,"Bad",33,16,"Yes","Yes" 100 | 12.49,122,77,24,382,127,"Good",36,16,"No","Yes" 101 | 4.88,121,47,3,220,107,"Bad",56,16,"No","Yes" 102 | 4.11,113,69,11,94,106,"Medium",76,12,"No","Yes" 103 | 6.2,128,93,0,89,118,"Medium",34,18,"Yes","No" 104 | 5.3,113,22,0,57,97,"Medium",65,16,"No","No" 105 | 5.07,123,91,0,334,96,"Bad",78,17,"Yes","Yes" 106 | 4.62,121,96,0,472,138,"Medium",51,12,"Yes","No" 107 | 5.55,104,100,8,398,97,"Medium",61,11,"Yes","Yes" 108 | 0.16,102,33,0,217,139,"Medium",70,18,"No","No" 109 | 8.55,134,107,0,104,108,"Medium",60,12,"Yes","No" 110 | 3.47,107,79,2,488,103,"Bad",65,16,"Yes","No" 111 | 8.98,115,65,0,217,90,"Medium",60,17,"No","No" 112 | 9,128,62,7,125,116,"Medium",43,14,"Yes","Yes" 113 | 6.62,132,118,12,272,151,"Medium",43,14,"Yes","Yes" 114 | 6.67,116,99,5,298,125,"Good",62,12,"Yes","Yes" 115 | 6.01,131,29,11,335,127,"Bad",33,12,"Yes","Yes" 116 | 9.31,122,87,9,17,106,"Medium",65,13,"Yes","Yes" 117 | 8.54,139,35,0,95,129,"Medium",42,13,"Yes","No" 118 | 5.08,135,75,0,202,128,"Medium",80,10,"No","No" 119 | 8.8,145,53,0,507,119,"Medium",41,12,"Yes","No" 120 | 7.57,112,88,2,243,99,"Medium",62,11,"Yes","Yes" 121 | 7.37,130,94,8,137,128,"Medium",64,12,"Yes","Yes" 122 | 6.87,128,105,11,249,131,"Medium",63,13,"Yes","Yes" 123 | 11.67,125,89,10,380,87,"Bad",28,10,"Yes","Yes" 124 | 6.88,119,100,5,45,108,"Medium",75,10,"Yes","Yes" 125 | 8.19,127,103,0,125,155,"Good",29,15,"No","Yes" 126 | 8.87,131,113,0,181,120,"Good",63,14,"Yes","No" 127 | 9.34,89,78,0,181,49,"Medium",43,15,"No","No" 128 | 11.27,153,68,2,60,133,"Good",59,16,"Yes","Yes" 129 | 6.52,125,48,3,192,116,"Medium",51,14,"Yes","Yes" 130 | 4.96,133,100,3,350,126,"Bad",55,13,"Yes","Yes" 131 | 4.47,143,120,7,279,147,"Bad",40,10,"No","Yes" 132 | 8.41,94,84,13,497,77,"Medium",51,12,"Yes","Yes" 133 | 6.5,108,69,3,208,94,"Medium",77,16,"Yes","No" 134 | 9.54,125,87,9,232,136,"Good",72,10,"Yes","Yes" 135 | 7.62,132,98,2,265,97,"Bad",62,12,"Yes","Yes" 136 | 3.67,132,31,0,327,131,"Medium",76,16,"Yes","No" 137 | 6.44,96,94,14,384,120,"Medium",36,18,"No","Yes" 138 | 5.17,131,75,0,10,120,"Bad",31,18,"No","No" 139 | 6.52,128,42,0,436,118,"Medium",80,11,"Yes","No" 140 | 10.27,125,103,12,371,109,"Medium",44,10,"Yes","Yes" 141 | 12.3,146,62,10,310,94,"Medium",30,13,"No","Yes" 142 | 6.03,133,60,10,277,129,"Medium",45,18,"Yes","Yes" 143 | 6.53,140,42,0,331,131,"Bad",28,15,"Yes","No" 144 | 7.44,124,84,0,300,104,"Medium",77,15,"Yes","No" 145 | 0.53,122,88,7,36,159,"Bad",28,17,"Yes","Yes" 146 | 9.09,132,68,0,264,123,"Good",34,11,"No","No" 147 | 8.77,144,63,11,27,117,"Medium",47,17,"Yes","Yes" 148 | 3.9,114,83,0,412,131,"Bad",39,14,"Yes","No" 149 | 10.51,140,54,9,402,119,"Good",41,16,"No","Yes" 150 | 7.56,110,119,0,384,97,"Medium",72,14,"No","Yes" 151 | 11.48,121,120,13,140,87,"Medium",56,11,"Yes","Yes" 152 | 10.49,122,84,8,176,114,"Good",57,10,"No","Yes" 153 | 10.77,111,58,17,407,103,"Good",75,17,"No","Yes" 154 | 7.64,128,78,0,341,128,"Good",45,13,"No","No" 155 | 5.93,150,36,7,488,150,"Medium",25,17,"No","Yes" 156 | 6.89,129,69,10,289,110,"Medium",50,16,"No","Yes" 157 | 7.71,98,72,0,59,69,"Medium",65,16,"Yes","No" 158 | 7.49,146,34,0,220,157,"Good",51,16,"Yes","No" 159 | 10.21,121,58,8,249,90,"Medium",48,13,"No","Yes" 160 | 12.53,142,90,1,189,112,"Good",39,10,"No","Yes" 161 | 9.32,119,60,0,372,70,"Bad",30,18,"No","No" 162 | 4.67,111,28,0,486,111,"Medium",29,12,"No","No" 163 | 2.93,143,21,5,81,160,"Medium",67,12,"No","Yes" 164 | 3.63,122,74,0,424,149,"Medium",51,13,"Yes","No" 165 | 5.68,130,64,0,40,106,"Bad",39,17,"No","No" 166 | 8.22,148,64,0,58,141,"Medium",27,13,"No","Yes" 167 | 0.37,147,58,7,100,191,"Bad",27,15,"Yes","Yes" 168 | 6.71,119,67,17,151,137,"Medium",55,11,"Yes","Yes" 169 | 6.71,106,73,0,216,93,"Medium",60,13,"Yes","No" 170 | 7.3,129,89,0,425,117,"Medium",45,10,"Yes","No" 171 | 11.48,104,41,15,492,77,"Good",73,18,"Yes","Yes" 172 | 8.01,128,39,12,356,118,"Medium",71,10,"Yes","Yes" 173 | 12.49,93,106,12,416,55,"Medium",75,15,"Yes","Yes" 174 | 9.03,104,102,13,123,110,"Good",35,16,"Yes","Yes" 175 | 6.38,135,91,5,207,128,"Medium",66,18,"Yes","Yes" 176 | 0,139,24,0,358,185,"Medium",79,15,"No","No" 177 | 7.54,115,89,0,38,122,"Medium",25,12,"Yes","No" 178 | 5.61,138,107,9,480,154,"Medium",47,11,"No","Yes" 179 | 10.48,138,72,0,148,94,"Medium",27,17,"Yes","Yes" 180 | 10.66,104,71,14,89,81,"Medium",25,14,"No","Yes" 181 | 7.78,144,25,3,70,116,"Medium",77,18,"Yes","Yes" 182 | 4.94,137,112,15,434,149,"Bad",66,13,"Yes","Yes" 183 | 7.43,121,83,0,79,91,"Medium",68,11,"Yes","No" 184 | 4.74,137,60,4,230,140,"Bad",25,13,"Yes","No" 185 | 5.32,118,74,6,426,102,"Medium",80,18,"Yes","Yes" 186 | 9.95,132,33,7,35,97,"Medium",60,11,"No","Yes" 187 | 10.07,130,100,11,449,107,"Medium",64,10,"Yes","Yes" 188 | 8.68,120,51,0,93,86,"Medium",46,17,"No","No" 189 | 6.03,117,32,0,142,96,"Bad",62,17,"Yes","No" 190 | 8.07,116,37,0,426,90,"Medium",76,15,"Yes","No" 191 | 12.11,118,117,18,509,104,"Medium",26,15,"No","Yes" 192 | 8.79,130,37,13,297,101,"Medium",37,13,"No","Yes" 193 | 6.67,156,42,13,170,173,"Good",74,14,"Yes","Yes" 194 | 7.56,108,26,0,408,93,"Medium",56,14,"No","No" 195 | 13.28,139,70,7,71,96,"Good",61,10,"Yes","Yes" 196 | 7.23,112,98,18,481,128,"Medium",45,11,"Yes","Yes" 197 | 4.19,117,93,4,420,112,"Bad",66,11,"Yes","Yes" 198 | 4.1,130,28,6,410,133,"Bad",72,16,"Yes","Yes" 199 | 2.52,124,61,0,333,138,"Medium",76,16,"Yes","No" 200 | 3.62,112,80,5,500,128,"Medium",69,10,"Yes","Yes" 201 | 6.42,122,88,5,335,126,"Medium",64,14,"Yes","Yes" 202 | 5.56,144,92,0,349,146,"Medium",62,12,"No","No" 203 | 5.94,138,83,0,139,134,"Medium",54,18,"Yes","No" 204 | 4.1,121,78,4,413,130,"Bad",46,10,"No","Yes" 205 | 2.05,131,82,0,132,157,"Bad",25,14,"Yes","No" 206 | 8.74,155,80,0,237,124,"Medium",37,14,"Yes","No" 207 | 5.68,113,22,1,317,132,"Medium",28,12,"Yes","No" 208 | 4.97,162,67,0,27,160,"Medium",77,17,"Yes","Yes" 209 | 8.19,111,105,0,466,97,"Bad",61,10,"No","No" 210 | 7.78,86,54,0,497,64,"Bad",33,12,"Yes","No" 211 | 3.02,98,21,11,326,90,"Bad",76,11,"No","Yes" 212 | 4.36,125,41,2,357,123,"Bad",47,14,"No","Yes" 213 | 9.39,117,118,14,445,120,"Medium",32,15,"Yes","Yes" 214 | 12.04,145,69,19,501,105,"Medium",45,11,"Yes","Yes" 215 | 8.23,149,84,5,220,139,"Medium",33,10,"Yes","Yes" 216 | 4.83,115,115,3,48,107,"Medium",73,18,"Yes","Yes" 217 | 2.34,116,83,15,170,144,"Bad",71,11,"Yes","Yes" 218 | 5.73,141,33,0,243,144,"Medium",34,17,"Yes","No" 219 | 4.34,106,44,0,481,111,"Medium",70,14,"No","No" 220 | 9.7,138,61,12,156,120,"Medium",25,14,"Yes","Yes" 221 | 10.62,116,79,19,359,116,"Good",58,17,"Yes","Yes" 222 | 10.59,131,120,15,262,124,"Medium",30,10,"Yes","Yes" 223 | 6.43,124,44,0,125,107,"Medium",80,11,"Yes","No" 224 | 7.49,136,119,6,178,145,"Medium",35,13,"Yes","Yes" 225 | 3.45,110,45,9,276,125,"Medium",62,14,"Yes","Yes" 226 | 4.1,134,82,0,464,141,"Medium",48,13,"No","No" 227 | 6.68,107,25,0,412,82,"Bad",36,14,"Yes","No" 228 | 7.8,119,33,0,245,122,"Good",56,14,"Yes","No" 229 | 8.69,113,64,10,68,101,"Medium",57,16,"Yes","Yes" 230 | 5.4,149,73,13,381,163,"Bad",26,11,"No","Yes" 231 | 11.19,98,104,0,404,72,"Medium",27,18,"No","No" 232 | 5.16,115,60,0,119,114,"Bad",38,14,"No","No" 233 | 8.09,132,69,0,123,122,"Medium",27,11,"No","No" 234 | 13.14,137,80,10,24,105,"Good",61,15,"Yes","Yes" 235 | 8.65,123,76,18,218,120,"Medium",29,14,"No","Yes" 236 | 9.43,115,62,11,289,129,"Good",56,16,"No","Yes" 237 | 5.53,126,32,8,95,132,"Medium",50,17,"Yes","Yes" 238 | 9.32,141,34,16,361,108,"Medium",69,10,"Yes","Yes" 239 | 9.62,151,28,8,499,135,"Medium",48,10,"Yes","Yes" 240 | 7.36,121,24,0,200,133,"Good",73,13,"Yes","No" 241 | 3.89,123,105,0,149,118,"Bad",62,16,"Yes","Yes" 242 | 10.31,159,80,0,362,121,"Medium",26,18,"Yes","No" 243 | 12.01,136,63,0,160,94,"Medium",38,12,"Yes","No" 244 | 4.68,124,46,0,199,135,"Medium",52,14,"No","No" 245 | 7.82,124,25,13,87,110,"Medium",57,10,"Yes","Yes" 246 | 8.78,130,30,0,391,100,"Medium",26,18,"Yes","No" 247 | 10,114,43,0,199,88,"Good",57,10,"No","Yes" 248 | 6.9,120,56,20,266,90,"Bad",78,18,"Yes","Yes" 249 | 5.04,123,114,0,298,151,"Bad",34,16,"Yes","No" 250 | 5.36,111,52,0,12,101,"Medium",61,11,"Yes","Yes" 251 | 5.05,125,67,0,86,117,"Bad",65,11,"Yes","No" 252 | 9.16,137,105,10,435,156,"Good",72,14,"Yes","Yes" 253 | 3.72,139,111,5,310,132,"Bad",62,13,"Yes","Yes" 254 | 8.31,133,97,0,70,117,"Medium",32,16,"Yes","No" 255 | 5.64,124,24,5,288,122,"Medium",57,12,"No","Yes" 256 | 9.58,108,104,23,353,129,"Good",37,17,"Yes","Yes" 257 | 7.71,123,81,8,198,81,"Bad",80,15,"Yes","Yes" 258 | 4.2,147,40,0,277,144,"Medium",73,10,"Yes","No" 259 | 8.67,125,62,14,477,112,"Medium",80,13,"Yes","Yes" 260 | 3.47,108,38,0,251,81,"Bad",72,14,"No","No" 261 | 5.12,123,36,10,467,100,"Bad",74,11,"No","Yes" 262 | 7.67,129,117,8,400,101,"Bad",36,10,"Yes","Yes" 263 | 5.71,121,42,4,188,118,"Medium",54,15,"Yes","Yes" 264 | 6.37,120,77,15,86,132,"Medium",48,18,"Yes","Yes" 265 | 7.77,116,26,6,434,115,"Medium",25,17,"Yes","Yes" 266 | 6.95,128,29,5,324,159,"Good",31,15,"Yes","Yes" 267 | 5.31,130,35,10,402,129,"Bad",39,17,"Yes","Yes" 268 | 9.1,128,93,12,343,112,"Good",73,17,"No","Yes" 269 | 5.83,134,82,7,473,112,"Bad",51,12,"No","Yes" 270 | 6.53,123,57,0,66,105,"Medium",39,11,"Yes","No" 271 | 5.01,159,69,0,438,166,"Medium",46,17,"Yes","No" 272 | 11.99,119,26,0,284,89,"Good",26,10,"Yes","No" 273 | 4.55,111,56,0,504,110,"Medium",62,16,"Yes","No" 274 | 12.98,113,33,0,14,63,"Good",38,12,"Yes","No" 275 | 10.04,116,106,8,244,86,"Medium",58,12,"Yes","Yes" 276 | 7.22,135,93,2,67,119,"Medium",34,11,"Yes","Yes" 277 | 6.67,107,119,11,210,132,"Medium",53,11,"Yes","Yes" 278 | 6.93,135,69,14,296,130,"Medium",73,15,"Yes","Yes" 279 | 7.8,136,48,12,326,125,"Medium",36,16,"Yes","Yes" 280 | 7.22,114,113,2,129,151,"Good",40,15,"No","Yes" 281 | 3.42,141,57,13,376,158,"Medium",64,18,"Yes","Yes" 282 | 2.86,121,86,10,496,145,"Bad",51,10,"Yes","Yes" 283 | 11.19,122,69,7,303,105,"Good",45,16,"No","Yes" 284 | 7.74,150,96,0,80,154,"Good",61,11,"Yes","No" 285 | 5.36,135,110,0,112,117,"Medium",80,16,"No","No" 286 | 6.97,106,46,11,414,96,"Bad",79,17,"No","No" 287 | 7.6,146,26,11,261,131,"Medium",39,10,"Yes","Yes" 288 | 7.53,117,118,11,429,113,"Medium",67,18,"No","Yes" 289 | 6.88,95,44,4,208,72,"Bad",44,17,"Yes","Yes" 290 | 6.98,116,40,0,74,97,"Medium",76,15,"No","No" 291 | 8.75,143,77,25,448,156,"Medium",43,17,"Yes","Yes" 292 | 9.49,107,111,14,400,103,"Medium",41,11,"No","Yes" 293 | 6.64,118,70,0,106,89,"Bad",39,17,"Yes","No" 294 | 11.82,113,66,16,322,74,"Good",76,15,"Yes","Yes" 295 | 11.28,123,84,0,74,89,"Good",59,10,"Yes","No" 296 | 12.66,148,76,3,126,99,"Good",60,11,"Yes","Yes" 297 | 4.21,118,35,14,502,137,"Medium",79,10,"No","Yes" 298 | 8.21,127,44,13,160,123,"Good",63,18,"Yes","Yes" 299 | 3.07,118,83,13,276,104,"Bad",75,10,"Yes","Yes" 300 | 10.98,148,63,0,312,130,"Good",63,15,"Yes","No" 301 | 9.4,135,40,17,497,96,"Medium",54,17,"No","Yes" 302 | 8.57,116,78,1,158,99,"Medium",45,11,"Yes","Yes" 303 | 7.41,99,93,0,198,87,"Medium",57,16,"Yes","Yes" 304 | 5.28,108,77,13,388,110,"Bad",74,14,"Yes","Yes" 305 | 10.01,133,52,16,290,99,"Medium",43,11,"Yes","Yes" 306 | 11.93,123,98,12,408,134,"Good",29,10,"Yes","Yes" 307 | 8.03,115,29,26,394,132,"Medium",33,13,"Yes","Yes" 308 | 4.78,131,32,1,85,133,"Medium",48,12,"Yes","Yes" 309 | 5.9,138,92,0,13,120,"Bad",61,12,"Yes","No" 310 | 9.24,126,80,19,436,126,"Medium",52,10,"Yes","Yes" 311 | 11.18,131,111,13,33,80,"Bad",68,18,"Yes","Yes" 312 | 9.53,175,65,29,419,166,"Medium",53,12,"Yes","Yes" 313 | 6.15,146,68,12,328,132,"Bad",51,14,"Yes","Yes" 314 | 6.8,137,117,5,337,135,"Bad",38,10,"Yes","Yes" 315 | 9.33,103,81,3,491,54,"Medium",66,13,"Yes","No" 316 | 7.72,133,33,10,333,129,"Good",71,14,"Yes","Yes" 317 | 6.39,131,21,8,220,171,"Good",29,14,"Yes","Yes" 318 | 15.63,122,36,5,369,72,"Good",35,10,"Yes","Yes" 319 | 6.41,142,30,0,472,136,"Good",80,15,"No","No" 320 | 10.08,116,72,10,456,130,"Good",41,14,"No","Yes" 321 | 6.97,127,45,19,459,129,"Medium",57,11,"No","Yes" 322 | 5.86,136,70,12,171,152,"Medium",44,18,"Yes","Yes" 323 | 7.52,123,39,5,499,98,"Medium",34,15,"Yes","No" 324 | 9.16,140,50,10,300,139,"Good",60,15,"Yes","Yes" 325 | 10.36,107,105,18,428,103,"Medium",34,12,"Yes","Yes" 326 | 2.66,136,65,4,133,150,"Bad",53,13,"Yes","Yes" 327 | 11.7,144,69,11,131,104,"Medium",47,11,"Yes","Yes" 328 | 4.69,133,30,0,152,122,"Medium",53,17,"Yes","No" 329 | 6.23,112,38,17,316,104,"Medium",80,16,"Yes","Yes" 330 | 3.15,117,66,1,65,111,"Bad",55,11,"Yes","Yes" 331 | 11.27,100,54,9,433,89,"Good",45,12,"Yes","Yes" 332 | 4.99,122,59,0,501,112,"Bad",32,14,"No","No" 333 | 10.1,135,63,15,213,134,"Medium",32,10,"Yes","Yes" 334 | 5.74,106,33,20,354,104,"Medium",61,12,"Yes","Yes" 335 | 5.87,136,60,7,303,147,"Medium",41,10,"Yes","Yes" 336 | 7.63,93,117,9,489,83,"Bad",42,13,"Yes","Yes" 337 | 6.18,120,70,15,464,110,"Medium",72,15,"Yes","Yes" 338 | 5.17,138,35,6,60,143,"Bad",28,18,"Yes","No" 339 | 8.61,130,38,0,283,102,"Medium",80,15,"Yes","No" 340 | 5.97,112,24,0,164,101,"Medium",45,11,"Yes","No" 341 | 11.54,134,44,4,219,126,"Good",44,15,"Yes","Yes" 342 | 7.5,140,29,0,105,91,"Bad",43,16,"Yes","No" 343 | 7.38,98,120,0,268,93,"Medium",72,10,"No","No" 344 | 7.81,137,102,13,422,118,"Medium",71,10,"No","Yes" 345 | 5.99,117,42,10,371,121,"Bad",26,14,"Yes","Yes" 346 | 8.43,138,80,0,108,126,"Good",70,13,"No","Yes" 347 | 4.81,121,68,0,279,149,"Good",79,12,"Yes","No" 348 | 8.97,132,107,0,144,125,"Medium",33,13,"No","No" 349 | 6.88,96,39,0,161,112,"Good",27,14,"No","No" 350 | 12.57,132,102,20,459,107,"Good",49,11,"Yes","Yes" 351 | 9.32,134,27,18,467,96,"Medium",49,14,"No","Yes" 352 | 8.64,111,101,17,266,91,"Medium",63,17,"No","Yes" 353 | 10.44,124,115,16,458,105,"Medium",62,16,"No","Yes" 354 | 13.44,133,103,14,288,122,"Good",61,17,"Yes","Yes" 355 | 9.45,107,67,12,430,92,"Medium",35,12,"No","Yes" 356 | 5.3,133,31,1,80,145,"Medium",42,18,"Yes","Yes" 357 | 7.02,130,100,0,306,146,"Good",42,11,"Yes","No" 358 | 3.58,142,109,0,111,164,"Good",72,12,"Yes","No" 359 | 13.36,103,73,3,276,72,"Medium",34,15,"Yes","Yes" 360 | 4.17,123,96,10,71,118,"Bad",69,11,"Yes","Yes" 361 | 3.13,130,62,11,396,130,"Bad",66,14,"Yes","Yes" 362 | 8.77,118,86,7,265,114,"Good",52,15,"No","Yes" 363 | 8.68,131,25,10,183,104,"Medium",56,15,"No","Yes" 364 | 5.25,131,55,0,26,110,"Bad",79,12,"Yes","Yes" 365 | 10.26,111,75,1,377,108,"Good",25,12,"Yes","No" 366 | 10.5,122,21,16,488,131,"Good",30,14,"Yes","Yes" 367 | 6.53,154,30,0,122,162,"Medium",57,17,"No","No" 368 | 5.98,124,56,11,447,134,"Medium",53,12,"No","Yes" 369 | 14.37,95,106,0,256,53,"Good",52,17,"Yes","No" 370 | 10.71,109,22,10,348,79,"Good",74,14,"No","Yes" 371 | 10.26,135,100,22,463,122,"Medium",36,14,"Yes","Yes" 372 | 7.68,126,41,22,403,119,"Bad",42,12,"Yes","Yes" 373 | 9.08,152,81,0,191,126,"Medium",54,16,"Yes","No" 374 | 7.8,121,50,0,508,98,"Medium",65,11,"No","No" 375 | 5.58,137,71,0,402,116,"Medium",78,17,"Yes","No" 376 | 9.44,131,47,7,90,118,"Medium",47,12,"Yes","Yes" 377 | 7.9,132,46,4,206,124,"Medium",73,11,"Yes","No" 378 | 16.27,141,60,19,319,92,"Good",44,11,"Yes","Yes" 379 | 6.81,132,61,0,263,125,"Medium",41,12,"No","No" 380 | 6.11,133,88,3,105,119,"Medium",79,12,"Yes","Yes" 381 | 5.81,125,111,0,404,107,"Bad",54,15,"Yes","No" 382 | 9.64,106,64,10,17,89,"Medium",68,17,"Yes","Yes" 383 | 3.9,124,65,21,496,151,"Bad",77,13,"Yes","Yes" 384 | 4.95,121,28,19,315,121,"Medium",66,14,"Yes","Yes" 385 | 9.35,98,117,0,76,68,"Medium",63,10,"Yes","No" 386 | 12.85,123,37,15,348,112,"Good",28,12,"Yes","Yes" 387 | 5.87,131,73,13,455,132,"Medium",62,17,"Yes","Yes" 388 | 5.32,152,116,0,170,160,"Medium",39,16,"Yes","No" 389 | 8.67,142,73,14,238,115,"Medium",73,14,"No","Yes" 390 | 8.14,135,89,11,245,78,"Bad",79,16,"Yes","Yes" 391 | 8.44,128,42,8,328,107,"Medium",35,12,"Yes","Yes" 392 | 5.47,108,75,9,61,111,"Medium",67,12,"Yes","Yes" 393 | 6.1,153,63,0,49,124,"Bad",56,16,"Yes","No" 394 | 4.53,129,42,13,315,130,"Bad",34,13,"Yes","Yes" 395 | 5.57,109,51,10,26,120,"Medium",30,17,"No","Yes" 396 | 5.35,130,58,19,366,139,"Bad",33,16,"Yes","Yes" 397 | 12.57,138,108,17,203,128,"Good",33,14,"Yes","Yes" 398 | 6.14,139,23,3,37,120,"Medium",55,11,"No","Yes" 399 | 7.41,162,26,12,368,159,"Medium",40,18,"Yes","Yes" 400 | 5.94,100,79,7,284,95,"Bad",50,12,"Yes","Yes" 401 | 9.71,134,37,0,27,120,"Good",49,16,"Yes","Yes" 402 | -------------------------------------------------------------------------------- /data/Hitters.csv: -------------------------------------------------------------------------------- 1 | Player,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague 2 | -Andy Allanson,293,66,1,30,29,14,1,293,66,1,30,29,14,A,E,446,33,20,,A 3 | -Alan Ashby,315,81,7,24,38,39,14,3449,835,69,321,414,375,N,W,632,43,10,475,N 4 | -Alvin Davis,479,130,18,66,72,76,3,1624,457,63,224,266,263,A,W,880,82,14,480,A 5 | -Andre Dawson,496,141,20,65,78,37,11,5628,1575,225,828,838,354,N,E,200,11,3,500,N 6 | -Andres Galarraga,321,87,10,39,42,30,2,396,101,12,48,46,33,N,E,805,40,4,91.5,N 7 | -Alfredo Griffin,594,169,4,74,51,35,11,4408,1133,19,501,336,194,A,W,282,421,25,750,A 8 | -Al Newman,185,37,1,23,8,21,2,214,42,1,30,9,24,N,E,76,127,7,70,A 9 | -Argenis Salazar,298,73,0,24,24,7,3,509,108,0,41,37,12,A,W,121,283,9,100,A 10 | -Andres Thomas,323,81,6,26,32,8,2,341,86,6,32,34,8,N,W,143,290,19,75,N 11 | -Andre Thornton,401,92,17,49,66,65,13,5206,1332,253,784,890,866,A,E,0,0,0,1100,A 12 | -Alan Trammell,574,159,21,107,75,59,10,4631,1300,90,702,504,488,A,E,238,445,22,517.143,A 13 | -Alex Trevino,202,53,4,31,26,27,9,1876,467,15,192,186,161,N,W,304,45,11,512.5,N 14 | -Andy VanSlyke,418,113,13,48,61,47,4,1512,392,41,205,204,203,N,E,211,11,7,550,N 15 | -Alan Wiggins,239,60,0,30,11,22,6,1941,510,4,309,103,207,A,E,121,151,6,700,A 16 | -Bill Almon,196,43,7,29,27,30,13,3231,825,36,376,290,238,N,E,80,45,8,240,N 17 | -Billy Beane,183,39,3,20,15,11,3,201,42,3,20,16,11,A,W,118,0,0,,A 18 | -Buddy Bell,568,158,20,89,75,73,15,8068,2273,177,1045,993,732,N,W,105,290,10,775,N 19 | -Buddy Biancalana,190,46,2,24,8,15,5,479,102,5,65,23,39,A,W,102,177,16,175,A 20 | -Bruce Bochte,407,104,6,57,43,65,12,5233,1478,100,643,658,653,A,W,912,88,9,,A 21 | -Bruce Bochy,127,32,8,16,22,14,8,727,180,24,67,82,56,N,W,202,22,2,135,N 22 | -Barry Bonds,413,92,16,72,48,65,1,413,92,16,72,48,65,N,E,280,9,5,100,N 23 | -Bobby Bonilla,426,109,3,55,43,62,1,426,109,3,55,43,62,A,W,361,22,2,115,N 24 | -Bob Boone,22,10,1,4,2,1,6,84,26,2,9,9,3,A,W,812,84,11,,A 25 | -Bob Brenly,472,116,16,60,62,74,6,1924,489,67,242,251,240,N,W,518,55,3,600,N 26 | -Bill Buckner,629,168,18,73,102,40,18,8424,2464,164,1008,1072,402,A,E,1067,157,14,776.667,A 27 | -Brett Butler,587,163,4,92,51,70,6,2695,747,17,442,198,317,A,E,434,9,3,765,A 28 | -Bob Dernier,324,73,4,32,18,22,7,1931,491,13,291,108,180,N,E,222,3,3,708.333,N 29 | -Bo Diaz,474,129,10,50,56,40,10,2331,604,61,246,327,166,N,W,732,83,13,750,N 30 | -Bill Doran,550,152,6,92,37,81,5,2308,633,32,349,182,308,N,W,262,329,16,625,N 31 | -Brian Downing,513,137,20,90,95,90,14,5201,1382,166,763,734,784,A,W,267,5,3,900,A 32 | -Bobby Grich,313,84,9,42,30,39,17,6890,1833,224,1033,864,1087,A,W,127,221,7,,A 33 | -Billy Hatcher,419,108,6,55,36,22,3,591,149,8,80,46,31,N,W,226,7,4,110,N 34 | -Bob Horner,517,141,27,70,87,52,9,3571,994,215,545,652,337,N,W,1378,102,8,,N 35 | -Brook Jacoby,583,168,17,83,80,56,5,1646,452,44,219,208,136,A,E,109,292,25,612.5,A 36 | -Bob Kearney,204,49,6,23,25,12,7,1309,308,27,126,132,66,A,W,419,46,5,300,A 37 | -Bill Madlock,379,106,10,38,60,30,14,6207,1906,146,859,803,571,N,W,72,170,24,850,N 38 | -Bobby Meacham,161,36,0,19,10,17,4,1053,244,3,156,86,107,A,E,70,149,12,,A 39 | -Bob Melvin,268,60,5,24,25,15,2,350,78,5,34,29,18,N,W,442,59,6,90,N 40 | -Ben Oglivie,346,98,5,31,53,30,16,5913,1615,235,784,901,560,A,E,0,0,0,,A 41 | -Bip Roberts,241,61,1,34,12,14,1,241,61,1,34,12,14,N,W,166,172,10,,N 42 | -BillyJo Robidoux,181,41,1,15,21,33,2,232,50,4,20,29,45,A,E,326,29,5,67.5,A 43 | -Bill Russell,216,54,0,21,18,15,18,7318,1926,46,796,627,483,N,W,103,84,5,,N 44 | -Billy Sample,200,57,6,23,14,14,9,2516,684,46,371,230,195,N,W,69,1,1,,N 45 | -Bill Schroeder,217,46,7,32,19,9,4,694,160,32,86,76,32,A,E,307,25,1,180,A 46 | -Butch Wynegar,194,40,7,19,29,30,11,4183,1069,64,486,493,608,A,E,325,22,2,,A 47 | -Chris Bando,254,68,2,28,26,22,6,999,236,21,108,117,118,A,E,359,30,4,305,A 48 | -Chris Brown,416,132,7,57,49,33,3,932,273,24,113,121,80,N,W,73,177,18,215,N 49 | -Carmen Castillo,205,57,8,34,32,9,5,756,192,32,117,107,51,A,E,58,4,4,247.5,A 50 | -Cecil Cooper,542,140,12,46,75,41,16,7099,2130,235,987,1089,431,A,E,697,61,9,,A 51 | -Chili Davis,526,146,13,71,70,84,6,2648,715,77,352,342,289,N,W,303,9,9,815,N 52 | -Carlton Fisk,457,101,14,42,63,22,17,6521,1767,281,1003,977,619,A,W,389,39,4,875,A 53 | -Curt Ford,214,53,2,30,29,23,2,226,59,2,32,32,27,N,E,109,7,3,70,N 54 | -Cliff Johnson,19,7,0,1,2,1,4,41,13,1,3,4,4,A,E,0,0,0,,A 55 | -Carney Lansford,591,168,19,80,72,39,9,4478,1307,113,634,563,319,A,W,67,147,4,1200,A 56 | -Chet Lemon,403,101,12,45,53,39,12,5150,1429,166,747,666,526,A,E,316,6,5,675,A 57 | -Candy Maldonado,405,102,18,49,85,20,6,950,231,29,99,138,64,N,W,161,10,3,415,N 58 | -Carmelo Martinez,244,58,9,28,25,35,4,1335,333,49,164,179,194,N,W,142,14,2,340,N 59 | -Charlie Moore,235,61,3,24,39,21,14,3926,1029,35,441,401,333,A,E,425,43,4,,A 60 | -Craig Reynolds,313,78,6,32,41,12,12,3742,968,35,409,321,170,N,W,106,206,7,416.667,N 61 | -Cal Ripken,627,177,25,98,81,70,6,3210,927,133,529,472,313,A,E,240,482,13,1350,A 62 | -Cory Snyder,416,113,24,58,69,16,1,416,113,24,58,69,16,A,E,203,70,10,90,A 63 | -Chris Speier,155,44,6,21,23,15,16,6631,1634,98,698,661,777,N,E,53,88,3,275,N 64 | -Curt Wilkerson,236,56,0,27,15,11,4,1115,270,1,116,64,57,A,W,125,199,13,230,A 65 | -Dave Anderson,216,53,1,31,15,22,4,926,210,9,118,69,114,N,W,73,152,11,225,N 66 | -Doug Baker,24,3,0,1,0,2,3,159,28,0,20,12,9,A,W,80,4,0,,A 67 | -Don Baylor,585,139,31,93,94,62,17,7546,1982,315,1141,1179,727,A,E,0,0,0,950,A 68 | -Dann Bilardello,191,37,4,12,17,14,4,773,163,16,61,74,52,N,E,391,38,8,,N 69 | -Daryl Boston,199,53,5,29,22,21,3,514,120,8,57,40,39,A,W,152,3,5,75,A 70 | -Darnell Coles,521,142,20,67,86,45,4,815,205,22,99,103,78,A,E,107,242,23,105,A 71 | -Dave Collins,419,113,1,44,27,44,12,4484,1231,32,612,344,422,A,E,211,2,1,,A 72 | -Dave Concepcion,311,81,3,42,30,26,17,8247,2198,100,950,909,690,N,W,153,223,10,320,N 73 | -Darren Daulton,138,31,8,18,21,38,3,244,53,12,33,32,55,N,E,244,21,4,,N 74 | -Doug DeCinces,512,131,26,69,96,52,14,5347,1397,221,712,815,548,A,W,119,216,12,850,A 75 | -Darrell Evans,507,122,29,78,85,91,18,7761,1947,347,1175,1152,1380,A,E,808,108,2,535,A 76 | -Dwight Evans,529,137,26,86,97,97,15,6661,1785,291,1082,949,989,A,E,280,10,5,933.333,A 77 | -Damaso Garcia,424,119,6,57,46,13,9,3651,1046,32,461,301,112,A,E,224,286,8,850,N 78 | -Dan Gladden,351,97,4,55,29,39,4,1258,353,16,196,110,117,N,W,226,7,3,210,A 79 | -Danny Heep,195,55,5,24,33,30,8,1313,338,25,144,149,153,N,E,83,2,1,,N 80 | -Dave Henderson,388,103,15,59,47,39,6,2174,555,80,285,274,186,A,W,182,9,4,325,A 81 | -Donnie Hill,339,96,4,37,29,23,4,1064,290,11,123,108,55,A,W,104,213,9,275,A 82 | -Dave Kingman,561,118,35,70,94,33,16,6677,1575,442,901,1210,608,A,W,463,32,8,,A 83 | -Davey Lopes,255,70,7,49,35,43,15,6311,1661,154,1019,608,820,N,E,51,54,8,450,N 84 | -Don Mattingly,677,238,31,117,113,53,5,2223,737,93,349,401,171,A,E,1377,100,6,1975,A 85 | -Darryl Motley,227,46,7,23,20,12,5,1325,324,44,156,158,67,A,W,92,2,2,,A 86 | -Dale Murphy,614,163,29,89,83,75,11,5017,1388,266,813,822,617,N,W,303,6,6,1900,N 87 | -Dwayne Murphy,329,83,9,50,39,56,9,3828,948,145,575,528,635,A,W,276,6,2,600,A 88 | -Dave Parker,637,174,31,89,116,56,14,6727,2024,247,978,1093,495,N,W,278,9,9,1041.667,N 89 | -Dan Pasqua,280,82,16,44,45,47,2,428,113,25,61,70,63,A,E,148,4,2,110,A 90 | -Darrell Porter,155,41,12,21,29,22,16,5409,1338,181,746,805,875,A,W,165,9,1,260,A 91 | -Dick Schofield,458,114,13,67,57,48,4,1350,298,28,160,123,122,A,W,246,389,18,475,A 92 | -Don Slaught,314,83,13,39,46,16,5,1457,405,28,156,159,76,A,W,533,40,4,431.5,A 93 | -Darryl Strawberry,475,123,27,76,93,72,4,1810,471,108,292,343,267,N,E,226,10,6,1220,N 94 | -Dale Sveum,317,78,7,35,35,32,1,317,78,7,35,35,32,A,E,45,122,26,70,A 95 | -Danny Tartabull,511,138,25,76,96,61,3,592,164,28,87,110,71,A,W,157,7,8,145,A 96 | -Dickie Thon,278,69,3,24,21,29,8,2079,565,32,258,192,162,N,W,142,210,10,,N 97 | -Denny Walling,382,119,13,54,58,36,12,2133,594,41,287,294,227,N,W,59,156,9,595,N 98 | -Dave Winfield,565,148,24,90,104,77,14,7287,2083,305,1135,1234,791,A,E,292,9,5,1861.46,A 99 | -Enos Cabell,277,71,2,27,29,14,15,5952,1647,60,753,596,259,N,W,360,32,5,,N 100 | -Eric Davis,415,115,27,97,71,68,3,711,184,45,156,119,99,N,W,274,2,7,300,N 101 | -Eddie Milner,424,110,15,70,47,36,7,2130,544,38,335,174,258,N,W,292,6,3,490,N 102 | -Eddie Murray,495,151,17,61,84,78,10,5624,1679,275,884,1015,709,A,E,1045,88,13,2460,A 103 | -Ernest Riles,524,132,9,69,47,54,2,972,260,14,123,92,90,A,E,212,327,20,,A 104 | -Ed Romero,233,49,2,41,23,18,8,1350,336,7,166,122,106,A,E,102,132,10,375,A 105 | -Ernie Whitt,395,106,16,48,56,35,10,2303,571,86,266,323,248,A,E,709,41,7,,A 106 | -Fred Lynn,397,114,23,67,67,53,13,5589,1632,241,906,926,716,A,E,244,2,4,,A 107 | -Floyd Rayford,210,37,8,15,19,15,6,994,244,36,107,114,53,A,E,40,115,15,,A 108 | -Franklin Stubbs,420,95,23,55,58,37,3,646,139,31,77,77,61,N,W,206,10,7,,N 109 | -Frank White,566,154,22,76,84,43,14,6100,1583,131,743,693,300,A,W,316,439,10,750,A 110 | -George Bell,641,198,31,101,108,41,5,2129,610,92,297,319,117,A,E,269,17,10,1175,A 111 | -Glenn Braggs,215,51,4,19,18,11,1,215,51,4,19,18,11,A,E,116,5,12,70,A 112 | -George Brett,441,128,16,70,73,80,14,6675,2095,209,1072,1050,695,A,W,97,218,16,1500,A 113 | -Greg Brock,325,76,16,33,52,37,5,1506,351,71,195,219,214,N,W,726,87,3,385,A 114 | -Gary Carter,490,125,24,81,105,62,13,6063,1646,271,847,999,680,N,E,869,62,8,1925.571,N 115 | -Glenn Davis,574,152,31,91,101,64,3,985,260,53,148,173,95,N,W,1253,111,11,215,N 116 | -George Foster,284,64,14,30,42,24,18,7023,1925,348,986,1239,666,N,E,96,4,4,,N 117 | -Gary Gaetti,596,171,34,91,108,52,6,2862,728,107,361,401,224,A,W,118,334,21,900,A 118 | -Greg Gagne,472,118,12,63,54,30,4,793,187,14,102,80,50,A,W,228,377,26,155,A 119 | -George Hendrick,283,77,14,45,47,26,16,6840,1910,259,915,1067,546,A,W,144,6,5,700,A 120 | -Glenn Hubbard,408,94,4,42,36,66,9,3573,866,59,429,365,410,N,W,282,487,19,535,N 121 | -Garth Iorg,327,85,3,30,44,20,8,2140,568,16,216,208,93,A,E,91,185,12,362.5,A 122 | -Gary Matthews,370,96,21,49,46,60,15,6986,1972,231,1070,955,921,N,E,137,5,9,733.333,N 123 | -Graig Nettles,354,77,16,36,55,41,20,8716,2172,384,1172,1267,1057,N,W,83,174,16,200,N 124 | -Gary Pettis,539,139,5,93,58,69,5,1469,369,12,247,126,198,A,W,462,9,7,400,A 125 | -Gary Redus,340,84,11,62,33,47,5,1516,376,42,284,141,219,N,E,185,8,4,400,A 126 | -Garry Templeton,510,126,2,42,44,35,11,5562,1578,44,703,519,256,N,W,207,358,20,737.5,N 127 | -Gorman Thomas,315,59,16,45,36,58,13,4677,1051,268,681,782,697,A,W,0,0,0,,A 128 | -Greg Walker,282,78,13,37,51,29,5,1649,453,73,211,280,138,A,W,670,57,5,500,A 129 | -Gary Ward,380,120,5,54,51,31,8,3118,900,92,444,419,240,A,W,237,8,1,600,A 130 | -Glenn Wilson,584,158,15,70,84,42,5,2358,636,58,265,316,134,N,E,331,20,4,662.5,N 131 | -Harold Baines,570,169,21,72,88,38,7,3754,1077,140,492,589,263,A,W,295,15,5,950,A 132 | -Hubie Brooks,306,104,14,50,58,25,7,2954,822,55,313,377,187,N,E,116,222,15,750,N 133 | -Howard Johnson,220,54,10,30,39,31,5,1185,299,40,145,154,128,N,E,50,136,20,297.5,N 134 | -Hal McRae,278,70,7,22,37,18,18,7186,2081,190,935,1088,643,A,W,0,0,0,325,A 135 | -Harold Reynolds,445,99,1,46,24,29,4,618,129,1,72,31,48,A,W,278,415,16,87.5,A 136 | -Harry Spilman,143,39,5,18,30,15,9,639,151,16,80,97,61,N,W,138,15,1,175,N 137 | -Herm Winningham,185,40,4,23,11,18,3,524,125,7,58,37,47,N,E,97,2,2,90,N 138 | -Jesse Barfield,589,170,40,107,108,69,6,2325,634,128,371,376,238,A,E,368,20,3,1237.5,A 139 | -Juan Beniquez,343,103,6,48,36,40,15,4338,1193,70,581,421,325,A,E,211,56,13,430,A 140 | -Juan Bonilla,284,69,1,33,18,25,5,1407,361,6,139,98,111,A,E,122,140,5,,N 141 | -John Cangelosi,438,103,2,65,32,71,2,440,103,2,67,32,71,A,W,276,7,9,100,N 142 | -Jose Canseco,600,144,33,85,117,65,2,696,173,38,101,130,69,A,W,319,4,14,165,A 143 | -Joe Carter,663,200,29,108,121,32,4,1447,404,57,210,222,68,A,E,241,8,6,250,A 144 | -Jack Clark,232,55,9,34,23,45,12,4405,1213,194,702,705,625,N,E,623,35,3,1300,N 145 | -Jose Cruz,479,133,10,48,72,55,17,7472,2147,153,980,1032,854,N,W,237,5,4,773.333,N 146 | -Julio Cruz,209,45,0,38,19,42,10,3859,916,23,557,279,478,A,W,132,205,5,,A 147 | -Jody Davis,528,132,21,61,74,41,6,2641,671,97,273,383,226,N,E,885,105,8,1008.333,N 148 | -Jim Dwyer,160,39,8,18,31,22,14,2128,543,56,304,268,298,A,E,33,3,0,275,A 149 | -Julio Franco,599,183,10,80,74,32,5,2482,715,27,330,326,158,A,E,231,374,18,775,A 150 | -Jim Gantner,497,136,7,58,38,26,11,3871,1066,40,450,367,241,A,E,304,347,10,850,A 151 | -Johnny Grubb,210,70,13,32,51,28,15,4040,1130,97,544,462,551,A,E,0,0,0,365,A 152 | -Jerry Hairston,225,61,5,32,26,26,11,1568,408,25,202,185,257,A,W,132,9,0,,A 153 | -Jack Howell,151,41,4,26,21,19,2,288,68,9,45,39,35,A,W,28,56,2,95,A 154 | -John Kruk,278,86,4,33,38,45,1,278,86,4,33,38,45,N,W,102,4,2,110,N 155 | -Jeffrey Leonard,341,95,6,48,42,20,10,2964,808,81,379,428,221,N,W,158,4,5,100,N 156 | -Jim Morrison,537,147,23,58,88,47,10,2744,730,97,302,351,174,N,E,92,257,20,277.5,N 157 | -John Moses,399,102,3,56,34,34,5,670,167,4,89,48,54,A,W,211,9,3,80,A 158 | -Jerry Mumphrey,309,94,5,37,32,26,13,4618,1330,57,616,522,436,N,E,161,3,3,600,N 159 | -Joe Orsulak,401,100,2,60,19,28,4,876,238,2,126,44,55,N,E,193,11,4,,N 160 | -Jorge Orta,336,93,9,35,46,23,15,5779,1610,128,730,741,497,A,W,0,0,0,,A 161 | -Jim Presley,616,163,27,83,107,32,3,1437,377,65,181,227,82,A,W,110,308,15,200,A 162 | -Jamie Quirk,219,47,8,24,26,17,12,1188,286,23,100,125,63,A,W,260,58,4,,A 163 | -Johnny Ray,579,174,7,67,78,58,6,3053,880,32,366,337,218,N,E,280,479,5,657,N 164 | -Jeff Reed,165,39,2,13,9,16,3,196,44,2,18,10,18,A,W,332,19,2,75,N 165 | -Jim Rice,618,200,20,98,110,62,13,7127,2163,351,1104,1289,564,A,E,330,16,8,2412.5,A 166 | -Jerry Royster,257,66,5,31,26,32,14,3910,979,33,518,324,382,N,W,87,166,14,250,A 167 | -John Russell,315,76,13,35,60,25,3,630,151,24,68,94,55,N,E,498,39,13,155,N 168 | -Juan Samuel,591,157,16,90,78,26,4,2020,541,52,310,226,91,N,E,290,440,25,640,N 169 | -John Shelby,404,92,11,54,49,18,6,1354,325,30,188,135,63,A,E,222,5,5,300,A 170 | -Joel Skinner,315,73,5,23,37,16,4,450,108,6,38,46,28,A,W,227,15,3,110,A 171 | -Jeff Stone,249,69,6,32,19,20,4,702,209,10,97,48,44,N,E,103,8,2,,N 172 | -Jim Sundberg,429,91,12,41,42,57,13,5590,1397,83,578,579,644,A,W,686,46,4,825,N 173 | -Jim Traber,212,54,13,28,44,18,2,233,59,13,31,46,20,A,E,243,23,5,,A 174 | -Jose Uribe,453,101,3,46,43,61,3,948,218,6,96,72,91,N,W,249,444,16,195,N 175 | -Jerry Willard,161,43,4,17,26,22,3,707,179,21,77,99,76,A,W,300,12,2,,A 176 | -Joel Youngblood,184,47,5,20,28,18,11,3327,890,74,419,382,304,N,W,49,2,0,450,N 177 | -Kevin Bass,591,184,20,83,79,38,5,1689,462,40,219,195,82,N,W,303,12,5,630,N 178 | -Kal Daniels,181,58,6,34,23,22,1,181,58,6,34,23,22,N,W,88,0,3,86.5,N 179 | -Kirk Gibson,441,118,28,84,86,68,8,2723,750,126,433,420,309,A,E,190,2,2,1300,A 180 | -Ken Griffey,490,150,21,69,58,35,14,6126,1839,121,983,707,600,A,E,96,5,3,1000,N 181 | -Keith Hernandez,551,171,13,94,83,94,13,6090,1840,128,969,900,917,N,E,1199,149,5,1800,N 182 | -Kent Hrbek,550,147,29,85,91,71,6,2816,815,117,405,474,319,A,W,1218,104,10,1310,A 183 | -Ken Landreaux,283,74,4,34,29,22,10,3919,1062,85,505,456,283,N,W,145,5,7,737.5,N 184 | -Kevin McReynolds,560,161,26,89,96,66,4,1789,470,65,233,260,155,N,W,332,9,8,625,N 185 | -Kevin Mitchell,328,91,12,51,43,33,2,342,94,12,51,44,33,N,E,145,59,8,125,N 186 | -Keith Moreland,586,159,12,72,79,53,9,3082,880,83,363,477,295,N,E,181,13,4,1043.333,N 187 | -Ken Oberkfell,503,136,5,62,48,83,10,3423,970,20,408,303,414,N,W,65,258,8,725,N 188 | -Ken Phelps,344,85,24,69,64,88,7,911,214,64,150,156,187,A,W,0,0,0,300,A 189 | -Kirby Puckett,680,223,31,119,96,34,3,1928,587,35,262,201,91,A,W,429,8,6,365,A 190 | -Kurt Stillwell,279,64,0,31,26,30,1,279,64,0,31,26,30,N,W,107,205,16,75,N 191 | -Leon Durham,484,127,20,66,65,67,7,3006,844,116,436,458,377,N,E,1231,80,7,1183.333,N 192 | -Len Dykstra,431,127,8,77,45,58,2,667,187,9,117,64,88,N,E,283,8,3,202.5,N 193 | -Larry Herndon,283,70,8,33,37,27,12,4479,1222,94,557,483,307,A,E,156,2,2,225,A 194 | -Lee Lacy,491,141,11,77,47,37,15,4291,1240,84,615,430,340,A,E,239,8,2,525,A 195 | -Len Matuszek,199,52,9,26,28,21,6,805,191,30,113,119,87,N,W,235,22,5,265,N 196 | -Lloyd Moseby,589,149,21,89,86,64,7,3558,928,102,513,471,351,A,E,371,6,6,787.5,A 197 | -Lance Parrish,327,84,22,53,62,38,10,4273,1123,212,577,700,334,A,E,483,48,6,800,N 198 | -Larry Parrish,464,128,28,67,94,52,13,5829,1552,210,740,840,452,A,W,0,0,0,587.5,A 199 | -Luis Rivera,166,34,0,20,13,17,1,166,34,0,20,13,17,N,E,64,119,9,,N 200 | -Larry Sheets,338,92,18,42,60,21,3,682,185,36,88,112,50,A,E,0,0,0,145,A 201 | -Lonnie Smith,508,146,8,80,44,46,9,3148,915,41,571,289,326,A,W,245,5,9,,A 202 | -Lou Whitaker,584,157,20,95,73,63,10,4704,1320,93,724,522,576,A,E,276,421,11,420,A 203 | -Mike Aldrete,216,54,2,27,25,33,1,216,54,2,27,25,33,N,W,317,36,1,75,N 204 | -Marty Barrett,625,179,4,94,60,65,5,1696,476,12,216,163,166,A,E,303,450,14,575,A 205 | -Mike Brown,243,53,4,18,26,27,4,853,228,23,101,110,76,N,E,107,3,3,,N 206 | -Mike Davis,489,131,19,77,55,34,7,2051,549,62,300,263,153,A,W,310,9,9,780,A 207 | -Mike Diaz,209,56,12,22,36,19,2,216,58,12,24,37,19,N,E,201,6,3,90,N 208 | -Mariano Duncan,407,93,8,47,30,30,2,969,230,14,121,69,68,N,W,172,317,25,150,N 209 | -Mike Easler,490,148,14,64,78,49,13,3400,1000,113,445,491,301,A,E,0,0,0,700,N 210 | -Mike Fitzgerald,209,59,6,20,37,27,4,884,209,14,66,106,92,N,E,415,35,3,,N 211 | -Mel Hall,442,131,18,68,77,33,6,1416,398,47,210,203,136,A,E,233,7,7,550,A 212 | -Mickey Hatcher,317,88,3,40,32,19,8,2543,715,28,269,270,118,A,W,220,16,4,,A 213 | -Mike Heath,288,65,8,30,36,27,9,2815,698,55,315,325,189,N,E,259,30,10,650,A 214 | -Mike Kingery,209,54,3,25,14,12,1,209,54,3,25,14,12,A,W,102,6,3,68,A 215 | -Mike LaValliere,303,71,3,18,30,36,3,344,76,3,20,36,45,N,E,468,47,6,100,N 216 | -Mike Marshall,330,77,19,47,53,27,6,1928,516,90,247,288,161,N,W,149,8,6,670,N 217 | -Mike Pagliarulo,504,120,28,71,71,54,3,1085,259,54,150,167,114,A,E,103,283,19,175,A 218 | -Mark Salas,258,60,8,28,33,18,3,638,170,17,80,75,36,A,W,358,32,8,137,A 219 | -Mike Schmidt,20,1,0,0,0,0,2,41,9,2,6,7,4,N,E,78,220,6,2127.333,N 220 | -Mike Scioscia,374,94,5,36,26,62,7,1968,519,26,181,199,288,N,W,756,64,15,875,N 221 | -Mickey Tettleton,211,43,10,26,35,39,3,498,116,14,59,55,78,A,W,463,32,8,120,A 222 | -Milt Thompson,299,75,6,38,23,26,3,580,160,8,71,33,44,N,E,212,1,2,140,N 223 | -Mitch Webster,576,167,8,89,49,57,4,822,232,19,132,83,79,N,E,325,12,8,210,N 224 | -Mookie Wilson,381,110,9,61,45,32,7,3015,834,40,451,249,168,N,E,228,7,5,800,N 225 | -Marvell Wynne,288,76,7,34,37,15,4,1644,408,16,198,120,113,N,W,203,3,3,240,N 226 | -Mike Young,369,93,9,43,42,49,5,1258,323,54,181,177,157,A,E,149,1,6,350,A 227 | -Nick Esasky,330,76,12,35,41,47,4,1367,326,55,167,198,167,N,W,512,30,5,,N 228 | -Ozzie Guillen,547,137,2,58,47,12,2,1038,271,3,129,80,24,A,W,261,459,22,175,A 229 | -Oddibe McDowell,572,152,18,105,49,65,2,978,249,36,168,91,101,A,W,325,13,3,200,A 230 | -Omar Moreno,359,84,4,46,27,21,12,4992,1257,37,699,386,387,N,W,151,8,5,,N 231 | -Ozzie Smith,514,144,0,67,54,79,9,4739,1169,13,583,374,528,N,E,229,453,15,1940,N 232 | -Ozzie Virgil,359,80,15,45,48,63,7,1493,359,61,176,202,175,N,W,682,93,13,700,N 233 | -Phil Bradley,526,163,12,88,50,77,4,1556,470,38,245,167,174,A,W,250,11,1,750,A 234 | -Phil Garner,313,83,9,43,41,30,14,5885,1543,104,751,714,535,N,W,58,141,23,450,N 235 | -Pete Incaviglia,540,135,30,82,88,55,1,540,135,30,82,88,55,A,W,157,6,14,172,A 236 | -Paul Molitor,437,123,9,62,55,40,9,4139,1203,79,676,390,364,A,E,82,170,15,1260,A 237 | -Pete O'Brien,551,160,23,86,90,87,5,2235,602,75,278,328,273,A,W,1224,115,11,,A 238 | -Pete Rose,237,52,0,15,25,30,24,14053,4256,160,2165,1314,1566,N,W,523,43,6,750,N 239 | -Pat Sheridan,236,56,6,41,19,21,5,1257,329,24,166,125,105,A,E,172,1,4,190,A 240 | -Pat Tabler,473,154,6,61,48,29,6,1966,566,29,250,252,178,A,E,846,84,9,580,A 241 | -Rafael Belliard,309,72,0,33,31,26,5,354,82,0,41,32,26,N,E,117,269,12,130,N 242 | -Rick Burleson,271,77,5,35,29,33,12,4933,1358,48,630,435,403,A,W,62,90,3,450,A 243 | -Randy Bush,357,96,7,50,45,39,5,1394,344,43,178,192,136,A,W,167,2,4,300,A 244 | -Rick Cerone,216,56,4,22,18,15,12,2796,665,43,266,304,198,A,E,391,44,4,250,A 245 | -Ron Cey,256,70,13,42,36,44,16,7058,1845,312,965,1128,990,N,E,41,118,8,1050,A 246 | -Rob Deer,466,108,33,75,86,72,3,652,142,44,102,109,102,A,E,286,8,8,215,A 247 | -Rick Dempsey,327,68,13,42,29,45,18,3949,939,78,438,380,466,A,E,659,53,7,400,A 248 | -Rich Gedman,462,119,16,49,65,37,7,2131,583,69,244,288,150,A,E,866,65,6,,A 249 | -Ron Hassey,341,110,9,45,49,46,9,2331,658,50,249,322,274,A,E,251,9,4,560,A 250 | -Rickey Henderson,608,160,28,130,74,89,8,4071,1182,103,862,417,708,A,E,426,4,6,1670,A 251 | -Reggie Jackson,419,101,18,65,58,92,20,9528,2510,548,1509,1659,1342,A,W,0,0,0,487.5,A 252 | -Ricky Jones,33,6,0,2,4,7,1,33,6,0,2,4,7,A,W,205,5,4,,A 253 | -Ron Kittle,376,82,21,42,60,35,5,1770,408,115,238,299,157,A,W,0,0,0,425,A 254 | -Ray Knight,486,145,11,51,76,40,11,3967,1102,67,410,497,284,N,E,88,204,16,500,A 255 | -Randy Kutcher,186,44,7,28,16,11,1,186,44,7,28,16,11,N,W,99,3,1,,N 256 | -Rudy Law,307,80,1,42,36,29,7,2421,656,18,379,198,184,A,W,145,2,2,,A 257 | -Rick Leach,246,76,5,35,39,13,6,912,234,12,102,96,80,A,E,44,0,1,250,A 258 | -Rick Manning,205,52,8,31,27,17,12,5134,1323,56,643,445,459,A,E,155,3,2,400,A 259 | -Rance Mulliniks,348,90,11,50,45,43,10,2288,614,43,295,273,269,A,E,60,176,6,450,A 260 | -Ron Oester,523,135,8,52,44,52,9,3368,895,39,377,284,296,N,W,367,475,19,750,N 261 | -Rey Quinones,312,68,2,32,22,24,1,312,68,2,32,22,24,A,E,86,150,15,70,A 262 | -Rafael Ramirez,496,119,8,57,33,21,7,3358,882,36,365,280,165,N,W,155,371,29,875,N 263 | -Ronn Reynolds,126,27,3,8,10,5,4,239,49,3,16,13,14,N,E,190,2,9,190,N 264 | -Ron Roenicke,275,68,5,42,42,61,6,961,238,16,128,104,172,N,E,181,3,2,191,N 265 | -Ryne Sandberg,627,178,14,68,76,46,6,3146,902,74,494,345,242,N,E,309,492,5,740,N 266 | -Rafael Santana,394,86,1,38,28,36,4,1089,267,3,94,71,76,N,E,203,369,16,250,N 267 | -Rick Schu,208,57,8,32,25,18,3,653,170,17,98,54,62,N,E,42,94,13,140,N 268 | -Ruben Sierra,382,101,16,50,55,22,1,382,101,16,50,55,22,A,W,200,7,6,97.5,A 269 | -Roy Smalley,459,113,20,59,57,68,12,5348,1369,155,713,660,735,A,W,0,0,0,740,A 270 | -Robby Thompson,549,149,7,73,47,42,1,549,149,7,73,47,42,N,W,255,450,17,140,N 271 | -Rob Wilfong,288,63,3,25,33,16,10,2682,667,38,315,259,204,A,W,135,257,7,341.667,A 272 | -Reggie Williams,303,84,4,35,32,23,2,312,87,4,39,32,23,N,W,179,5,3,,N 273 | -Robin Yount,522,163,9,82,46,62,13,7037,2019,153,1043,827,535,A,E,352,9,1,1000,A 274 | -Steve Balboni,512,117,29,54,88,43,6,1750,412,100,204,276,155,A,W,1236,98,18,100,A 275 | -Scott Bradley,220,66,5,20,28,13,3,290,80,5,27,31,15,A,W,281,21,3,90,A 276 | -Sid Bream,522,140,16,73,77,60,4,730,185,22,93,106,86,N,E,1320,166,17,200,N 277 | -Steve Buechele,461,112,18,54,54,35,2,680,160,24,76,75,49,A,W,111,226,11,135,A 278 | -Shawon Dunston,581,145,17,66,68,21,2,831,210,21,106,86,40,N,E,320,465,32,155,N 279 | -Scott Fletcher,530,159,3,82,50,47,6,1619,426,11,218,149,163,A,W,196,354,15,475,A 280 | -Steve Garvey,557,142,21,58,81,23,18,8759,2583,271,1138,1299,478,N,W,1160,53,7,1450,N 281 | -Steve Jeltz,439,96,0,44,36,65,4,711,148,1,68,56,99,N,E,229,406,22,150,N 282 | -Steve Lombardozzi,453,103,8,53,33,52,2,507,123,8,63,39,58,A,W,289,407,6,105,A 283 | -Spike Owen,528,122,1,67,45,51,4,1716,403,12,211,146,155,A,W,209,372,17,350,A 284 | -Steve Sax,633,210,6,91,56,59,6,3070,872,19,420,230,274,N,W,367,432,16,90,N 285 | -Tony Armas,16,2,0,1,0,0,2,28,4,0,1,0,0,A,E,247,4,8,,A 286 | -Tony Bernazard,562,169,17,88,73,53,8,3181,841,61,450,342,373,A,E,351,442,17,530,A 287 | -Tom Brookens,281,76,3,42,25,20,8,2658,657,48,324,300,179,A,E,106,144,7,341.667,A 288 | -Tom Brunansky,593,152,23,69,75,53,6,2765,686,133,369,384,321,A,W,315,10,6,940,A 289 | -Tony Fernandez,687,213,10,91,65,27,4,1518,448,15,196,137,89,A,E,294,445,13,350,A 290 | -Tim Flannery,368,103,3,48,28,54,8,1897,493,9,207,162,198,N,W,209,246,3,326.667,N 291 | -Tom Foley,263,70,1,26,23,30,4,888,220,9,83,82,86,N,E,81,147,4,250,N 292 | -Tony Gwynn,642,211,14,107,59,52,5,2364,770,27,352,230,193,N,W,337,19,4,740,N 293 | -Terry Harper,265,68,8,26,30,29,7,1337,339,32,135,163,128,N,W,92,5,3,425,A 294 | -Toby Harrah,289,63,7,36,41,44,17,7402,1954,195,1115,919,1153,A,W,166,211,7,,A 295 | -Tommy Herr,559,141,2,48,61,73,8,3162,874,16,421,349,359,N,E,352,414,9,925,N 296 | -Tim Hulett,520,120,17,53,44,21,4,927,227,22,106,80,52,A,W,70,144,11,185,A 297 | -Terry Kennedy,19,4,1,2,3,1,1,19,4,1,2,3,1,N,W,692,70,8,920,A 298 | -Tito Landrum,205,43,2,24,17,20,7,854,219,12,105,99,71,N,E,131,6,1,286.667,N 299 | -Tim Laudner,193,47,10,21,29,24,6,1136,256,42,129,139,106,A,W,299,13,5,245,A 300 | -Tom O'Malley,181,46,1,19,18,17,5,937,238,9,88,95,104,A,E,37,98,9,,A 301 | -Tom Paciorek,213,61,4,17,22,3,17,4061,1145,83,488,491,244,A,W,178,45,4,235,A 302 | -Tony Pena,510,147,10,56,52,53,7,2872,821,63,307,340,174,N,E,810,99,18,1150,N 303 | -Terry Pendleton,578,138,1,56,59,34,3,1399,357,7,149,161,87,N,E,133,371,20,160,N 304 | -Tony Perez,200,51,2,14,29,25,23,9778,2732,379,1272,1652,925,N,W,398,29,7,,N 305 | -Tony Phillips,441,113,5,76,52,76,5,1546,397,17,226,149,191,A,W,160,290,11,425,A 306 | -Terry Puhl,172,42,3,17,14,15,10,4086,1150,57,579,363,406,N,W,65,0,0,900,N 307 | -Tim Raines,580,194,9,91,62,78,8,3372,1028,48,604,314,469,N,E,270,13,6,,N 308 | -Ted Simmons,127,32,4,14,25,12,19,8396,2402,242,1048,1348,819,N,W,167,18,6,500,N 309 | -Tim Teufel,279,69,4,35,31,32,4,1359,355,31,180,148,158,N,E,133,173,9,277.5,N 310 | -Tim Wallach,480,112,18,50,71,44,7,3031,771,110,338,406,239,N,E,94,270,16,750,N 311 | -Vince Coleman,600,139,0,94,29,60,2,1236,309,1,201,69,110,N,E,300,12,9,160,N 312 | -Von Hayes,610,186,19,107,98,74,6,2728,753,69,399,366,286,N,E,1182,96,13,1300,N 313 | -Vance Law,360,81,5,37,44,37,7,2268,566,41,279,257,246,N,E,170,284,3,525,N 314 | -Wally Backman,387,124,1,67,27,36,7,1775,506,6,272,125,194,N,E,186,290,17,550,N 315 | -Wade Boggs,580,207,8,107,71,105,5,2778,978,32,474,322,417,A,E,121,267,19,1600,A 316 | -Will Clark,408,117,11,66,41,34,1,408,117,11,66,41,34,N,W,942,72,11,120,N 317 | -Wally Joyner,593,172,22,82,100,57,1,593,172,22,82,100,57,A,W,1222,139,15,165,A 318 | -Wayne Krenchicki,221,53,2,21,23,22,8,1063,283,15,107,124,106,N,E,325,58,6,,N 319 | -Willie McGee,497,127,7,65,48,37,5,2703,806,32,379,311,138,N,E,325,9,3,700,N 320 | -Willie Randolph,492,136,5,76,50,94,12,5511,1511,39,897,451,875,A,E,313,381,20,875,A 321 | -Wayne Tolleson,475,126,3,61,43,52,6,1700,433,7,217,93,146,A,W,37,113,7,385,A 322 | -Willie Upshaw,573,144,9,85,60,78,8,3198,857,97,470,420,332,A,E,1314,131,12,960,A 323 | -Willie Wilson,631,170,9,77,44,31,11,4908,1457,30,775,357,249,A,W,408,4,3,1000,A 324 | -------------------------------------------------------------------------------- /data/Khan_ytest.csv: -------------------------------------------------------------------------------- 1 | "","x" 2 | "1",3 3 | "2",2 4 | "3",4 5 | "4",2 6 | "5",1 7 | "6",3 8 | "7",4 9 | "8",2 10 | "9",3 11 | "10",1 12 | "11",3 13 | "12",4 14 | "13",1 15 | "14",2 16 | "15",2 17 | "16",2 18 | "17",4 19 | "18",3 20 | "19",4 21 | "20",3 22 | -------------------------------------------------------------------------------- /data/Khan_ytrain.csv: -------------------------------------------------------------------------------- 1 | "","x" 2 | "1",2 3 | "2",2 4 | "3",2 5 | "4",2 6 | "5",2 7 | "6",2 8 | "7",2 9 | "8",2 10 | "9",2 11 | "10",2 12 | "11",2 13 | "12",2 14 | "13",2 15 | "14",2 16 | "15",2 17 | "16",2 18 | "17",2 19 | "18",2 20 | "19",2 21 | "20",2 22 | "21",2 23 | "22",2 24 | "23",2 25 | "24",4 26 | "25",4 27 | "26",4 28 | "27",4 29 | "28",4 30 | "29",4 31 | "30",4 32 | "31",4 33 | "32",4 34 | "33",4 35 | "34",4 36 | "35",4 37 | "36",4 38 | "37",4 39 | "38",4 40 | "39",4 41 | "40",4 42 | "41",4 43 | "42",4 44 | "43",4 45 | "44",3 46 | "45",3 47 | "46",3 48 | "47",3 49 | "48",3 50 | "49",3 51 | "50",3 52 | "51",3 53 | "52",3 54 | "53",3 55 | "54",3 56 | "55",3 57 | "56",1 58 | "57",1 59 | "58",1 60 | "59",1 61 | "60",1 62 | "61",1 63 | "62",1 64 | "63",1 65 | --------------------------------------------------------------------------------