├── LICENSE.txt
├── Lab 10 - Ridge Regression and the Lasso in Python.ipynb
├── Lab 11 - PCR and PLS Regression in Python.ipynb
├── Lab 12 - Polynomial Regression and Step Functions in Python.ipynb
├── Lab 13 - Splines in Python.ipynb
├── Lab 14 - Decision Trees in Python.ipynb
├── Lab 15 - Support Vector Machines in Python.ipynb
├── Lab 16 - Multiclass SVMs and Applications to Real Data in Python.ipynb
├── Lab 18 - PCA in Python.ipynb
├── Lab 2 - Linear Regression in Python.ipynb
├── Lab 3 - K-Nearest Neighbors in Python.ipynb
├── Lab 4 - Logistic Regression in Python.ipynb
├── Lab 5 - LDA and QDA in Python.ipynb
├── Lab 7 - Cross-Validation in Python.ipynb
├── Lab 8 - Subset Selection in Python.ipynb
├── Lab 9 - Linear Model Selection in Python.ipynb
├── README.md
└── data
├── Auto.csv
├── Boston.csv
├── Caravan.csv
├── Carseats.csv
├── Hitters.csv
├── Khan_xtest.csv
├── Khan_xtrain.csv
├── Khan_ytest.csv
├── Khan_ytrain.csv
├── OJ.csv
├── Smarket 2.csv
├── Smarket.csv
└── Wage.csv
/LICENSE.txt:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 R. Jordan Crouser.
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/Lab 13 - Splines in Python.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "This lab on Splines and GAMs is a python adaptation of p. 293-297 of \"Introduction to Statistical Learning with Applications in R\" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. It was originally written by Jordi Warmenhoven, and was adapted by R. Jordan Crouser at Smith College in Spring 2016. "
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": null,
13 | "metadata": {
14 | "collapsed": false
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd\n",
19 | "import numpy as np\n",
20 | "import matplotlib as mpl\n",
21 | "import matplotlib.pyplot as plt\n",
22 | "\n",
23 | "from sklearn.preprocessing import PolynomialFeatures\n",
24 | "import statsmodels.api as sm\n",
25 | "import statsmodels.formula.api as smf\n",
26 | "\n",
27 | "%matplotlib inline\n",
28 | "\n",
29 | "# Read in the data\n",
30 | "df = pd.read_csv('Wage.csv')\n",
31 | "\n",
32 | "# Generate a sequence of age values spanning the range\n",
33 | "age_grid = np.arange(df.age.min(), df.age.max()).reshape(-1,1)"
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "metadata": {},
39 | "source": [
40 | "# 7.8.2 Splines\n",
41 | "\n",
42 | "In order to fit regression splines in python, we use the ${\\tt dmatrix}$ module from the ${\\tt patsy}$ library. In lecture, we saw that regression splines can be fit by constructing an appropriate matrix of basis functions. The ${\\tt bs()}$ function generates the entire matrix of basis functions for splines with the specified set of knots. Fitting ${\\tt wage}$ to ${\\tt age}$ using a regression spline is simple:"
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "execution_count": null,
48 | "metadata": {
49 | "collapsed": false
50 | },
51 | "outputs": [],
52 | "source": [
53 | "from patsy import dmatrix\n",
54 | "\n",
55 | "# Specifying 3 knots\n",
56 | "transformed_x1 = dmatrix(\"bs(df.age, knots=(25,40,60), degree=3, include_intercept=False)\",\n",
57 | " {\"df.age\": df.age}, return_type='dataframe')\n",
58 | "\n",
59 | "# Build a regular linear model from the splines\n",
60 | "fit1 = sm.GLM(df.wage, transformed_x1).fit()\n",
61 | "fit1.params"
62 | ]
63 | },
64 | {
65 | "cell_type": "markdown",
66 | "metadata": {},
67 | "source": [
68 | "Here we have prespecified knots at ages 25, 40, and 60. This produces a\n",
69 | "spline with six basis functions. (Recall that a cubic spline with three knots\n",
70 | "has seven degrees of freedom; these degrees of freedom are used up by an\n",
71 | "intercept, plus six basis functions.) We could also use the ${\\tt df}$ option to\n",
72 | "produce a spline with knots at uniform quantiles of the data:"
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": null,
78 | "metadata": {
79 | "collapsed": false
80 | },
81 | "outputs": [],
82 | "source": [
83 | "# Specifying 6 degrees of freedom \n",
84 | "transformed_x2 = dmatrix(\"bs(df.age, df=6, include_intercept=False)\",\n",
85 | " {\"df.age\": df.age}, return_type='dataframe')\n",
86 | "fit2 = sm.GLM(df.wage, transformed_x2).fit()\n",
87 | "fit2.params"
88 | ]
89 | },
90 | {
91 | "cell_type": "markdown",
92 | "metadata": {},
93 | "source": [
94 | "In this case python chooses knots which correspond\n",
95 | "to the 25th, 50th, and 75th percentiles of ${\\tt age}$. The function ${\\tt bs()}$ also has\n",
96 | "a ${\\tt degree}$ argument, so we can fit splines of any degree, rather than the\n",
97 | "default degree of 3 (which yields a cubic spline).\n",
98 | "\n",
99 | "In order to instead fit a natural spline, we use the ${\\tt cr()}$ function. Here\n",
100 | "we fit a natural spline with four degrees of freedom:"
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": null,
106 | "metadata": {
107 | "collapsed": false
108 | },
109 | "outputs": [],
110 | "source": [
111 | "# Specifying 4 degrees of freedom\n",
112 | "transformed_x3 = dmatrix(\"cr(df.age, df=4)\", {\"df.age\": df.age}, return_type='dataframe')\n",
113 | "fit3 = sm.GLM(df.wage, transformed_x3).fit()\n",
114 | "fit3.params"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "As with the ${\\tt bs()}$ function, we could instead specify the knots directly using\n",
122 | "the ${\\tt knots}$ option.\n",
123 | "\n",
124 | "Let's see how these three models stack up:"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": null,
130 | "metadata": {
131 | "collapsed": false
132 | },
133 | "outputs": [],
134 | "source": [
135 | "# Generate a sequence of age values spanning the range\n",
136 | "age_grid = np.arange(df.age.min(), df.age.max()).reshape(-1,1)\n",
137 | "\n",
138 | "# Make some predictions\n",
139 | "pred1 = fit1.predict(dmatrix(\"bs(age_grid, knots=(25,40,60), include_intercept=False)\",\n",
140 | " {\"age_grid\": age_grid}, return_type='dataframe'))\n",
141 | "pred2 = fit2.predict(dmatrix(\"bs(age_grid, df=6, include_intercept=False)\",\n",
142 | " {\"age_grid\": age_grid}, return_type='dataframe'))\n",
143 | "pred3 = fit3.predict(dmatrix(\"cr(age_grid, df=4)\", {\"age_grid\": age_grid}, return_type='dataframe'))\n",
144 | "\n",
145 | "# Plot the splines and error bands\n",
146 | "plt.scatter(df.age, df.wage, facecolor='None', edgecolor='k', alpha=0.1)\n",
147 | "plt.plot(age_grid, pred1, color='b', label='Specifying three knots')\n",
148 | "plt.plot(age_grid, pred2, color='r', label='Specifying df=6')\n",
149 | "plt.plot(age_grid, pred3, color='g', label='Natural spline df=4')\n",
150 | "plt.legend()\n",
151 | "plt.xlim(15,85)\n",
152 | "plt.ylim(0,350)\n",
153 | "plt.xlabel('age')\n",
154 | "plt.ylabel('wage')"
155 | ]
156 | },
157 | {
158 | "cell_type": "markdown",
159 | "metadata": {},
160 | "source": [
161 | "To get credit for this lab, post your answer to the following question:\n",
162 | " - How would you choose whether to use a polynomial, step, or spline function for each predictor when building a GAM?"
163 | ]
164 | }
165 | ],
166 | "metadata": {
167 | "kernelspec": {
168 | "display_name": "Python [python3]",
169 | "language": "python",
170 | "name": "Python [python3]"
171 | },
172 | "language_info": {
173 | "codemirror_mode": {
174 | "name": "ipython",
175 | "version": 3
176 | },
177 | "file_extension": ".py",
178 | "mimetype": "text/x-python",
179 | "name": "python",
180 | "nbconvert_exporter": "python",
181 | "pygments_lexer": "ipython3",
182 | "version": "3.5.1"
183 | }
184 | },
185 | "nbformat": 4,
186 | "nbformat_minor": 0
187 | }
188 |
--------------------------------------------------------------------------------
/Lab 14 - Decision Trees in Python.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "This lab on Decision Trees is a Python adaptation of p. 324-331 of \"Introduction to Statistical Learning with\n",
8 | "Applications in R\" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Original adaptation by J. Warmenhoven, updated by R. Jordan Crouser at Smith\n",
9 | "College for SDS293: Machine Learning (Spring 2016)."
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": null,
15 | "metadata": {
16 | "collapsed": false
17 | },
18 | "outputs": [],
19 | "source": [
20 | "import pandas as pd\n",
21 | "import numpy as np\n",
22 | "import matplotlib as mpl\n",
23 | "import matplotlib.pyplot as plt\n",
24 | "import graphviz\n",
25 | "\n",
26 | "%matplotlib inline"
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "metadata": {},
32 | "source": [
33 | "# 8.3.1 Fitting Classification Trees\n",
34 | "\n",
35 | "The ${\\tt sklearn}$ library has a lot of useful tools for constructing classification and regression trees:"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": null,
41 | "metadata": {
42 | "collapsed": false
43 | },
44 | "outputs": [],
45 | "source": [
46 | "from sklearn.cross_validation import train_test_split\n",
47 | "from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, export_graphviz\n",
48 | "from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor\n",
49 | "from sklearn.metrics import confusion_matrix, mean_squared_error"
50 | ]
51 | },
52 | {
53 | "cell_type": "markdown",
54 | "metadata": {},
55 | "source": [
56 | "We'll start by using **classification trees** to analyze the ${\\tt Carseats}$ data set. In these\n",
57 | "data, ${\\tt Sales}$ is a continuous variable, and so we begin by converting it to a\n",
58 | "binary variable. We use the ${\\tt ifelse()}$ function to create a variable, called\n",
59 | "${\\tt High}$, which takes on a value of ${\\tt Yes}$ if the ${\\tt Sales}$ variable exceeds 8, and\n",
60 | "takes on a value of ${\\tt No}$ otherwise. We'll append this onto our dataFrame using the ${\\tt .map()}$ function, and then do a little data cleaning to tidy things up:"
61 | ]
62 | },
63 | {
64 | "cell_type": "code",
65 | "execution_count": null,
66 | "metadata": {
67 | "collapsed": false
68 | },
69 | "outputs": [],
70 | "source": [
71 | "df3 = pd.read_csv('Carseats.csv').drop('Unnamed: 0', axis=1)\n",
72 | "df3['High'] = df3.Sales.map(lambda x: 1 if x>8 else 0)\n",
73 | "df3.ShelveLoc = pd.factorize(df3.ShelveLoc)[0]\n",
74 | "df3.Urban = df3.Urban.map({'No':0, 'Yes':1})\n",
75 | "df3.US = df3.US.map({'No':0, 'Yes':1})\n",
76 | "df3.info()"
77 | ]
78 | },
79 | {
80 | "cell_type": "markdown",
81 | "metadata": {},
82 | "source": [
83 | "In order to properly evaluate the performance of a classification tree on\n",
84 | "the data, we must estimate the test error rather than simply computing\n",
85 | "the training error. We first split the observations into a training set and a test\n",
86 | "set:"
87 | ]
88 | },
89 | {
90 | "cell_type": "code",
91 | "execution_count": null,
92 | "metadata": {
93 | "collapsed": false
94 | },
95 | "outputs": [],
96 | "source": [
97 | "X = df3.drop(['Sales', 'High'], axis=1)\n",
98 | "y = df3.High\n",
99 | "\n",
100 | "X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=0)"
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "metadata": {},
106 | "source": [
107 | "We now use the ${\\tt DecisionTreeClassifier()}$ function to fit a classification tree in order to predict\n",
108 | "${\\tt High}$ using all variables but ${\\tt Sales}$ (that would be a little silly...). Unfortunately, manual pruning is not implemented in ${\\tt sklearn}$: http://scikit-learn.org/stable/modules/tree.html\n",
109 | "\n",
110 | "However, we can limit the depth of a tree using the ${\\tt max\\_depth}$ parameter:"
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": null,
116 | "metadata": {
117 | "collapsed": false
118 | },
119 | "outputs": [],
120 | "source": [
121 | "clf = DecisionTreeClassifier(max_depth=6)\n",
122 | "clf.fit(X_train, y_train)\n",
123 | "clf.score(X_train, y_train)"
124 | ]
125 | },
126 | {
127 | "cell_type": "markdown",
128 | "metadata": {},
129 | "source": [
130 | "We see that the training accuracy is 95.5%.\n",
131 | "\n",
132 | "One of the most attractive properties of trees is that they can be\n",
133 | "graphically displayed. Unfortunately, this is a bit of a roundabout process in ${\\tt sklearn}$. We use the ${\\tt export\\_graphviz()}$ function to export the tree structure to a temporary ${\\tt .dot}$ file,\n",
134 | "and the ${\\tt graphviz.Source()}$ function to display the image:"
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "execution_count": null,
140 | "metadata": {
141 | "collapsed": false
142 | },
143 | "outputs": [],
144 | "source": [
145 | "export_graphviz(clf, out_file=\"mytree.dot\", feature_names=X_train.columns)\n",
146 | "with open(\"mytree.dot\") as f:\n",
147 | " dot_graph = f.read()\n",
148 | "graphviz.Source(dot_graph)"
149 | ]
150 | },
151 | {
152 | "cell_type": "markdown",
153 | "metadata": {},
154 | "source": [
155 | "The most important indicator of ${\\tt High}$ sales appears to be ${\\tt Price}$."
156 | ]
157 | },
158 | {
159 | "cell_type": "markdown",
160 | "metadata": {},
161 | "source": [
162 | "Finally, let's evaluate the tree's performance on\n",
163 | "the test data. The ${\\tt predict()}$ function can be used for this purpose. We can then build a confusion matrix, which shows that we are making correct predictions for\n",
164 | "around 74.5% of the test data set:"
165 | ]
166 | },
167 | {
168 | "cell_type": "code",
169 | "execution_count": null,
170 | "metadata": {
171 | "collapsed": false
172 | },
173 | "outputs": [],
174 | "source": [
175 | "pred = clf.predict(X_test)\n",
176 | "cm = pd.DataFrame(confusion_matrix(y_test, pred).T, index=['No', 'Yes'], columns=['No', 'Yes'])\n",
177 | "print(cm)\n",
178 | "# 99+50/200 = 0.745"
179 | ]
180 | },
181 | {
182 | "cell_type": "markdown",
183 | "metadata": {},
184 | "source": [
185 | "# 8.3.2 Fitting Regression Trees\n",
186 | "\n",
187 | "Now let's try fitting a **regression tree** to the ${\\tt Boston}$ data set from the ${\\tt MASS}$ library. First, we create a\n",
188 | "training set, and fit the tree to the training data using ${\\tt medv}$ (median home value) as our response:"
189 | ]
190 | },
191 | {
192 | "cell_type": "code",
193 | "execution_count": null,
194 | "metadata": {
195 | "collapsed": false
196 | },
197 | "outputs": [],
198 | "source": [
199 | "boston_df = pd.read_csv('Boston.csv')\n",
200 | "X = boston_df.drop('medv', axis=1)\n",
201 | "y = boston_df.medv\n",
202 | "X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=0)\n",
203 | "\n",
204 | "# Pruning not supported. Choosing max depth 2)\n",
205 | "regr2 = DecisionTreeRegressor(max_depth=2)\n",
206 | "regr2.fit(X_train, y_train)"
207 | ]
208 | },
209 | {
210 | "cell_type": "markdown",
211 | "metadata": {},
212 | "source": [
213 | "Let's take a look at the tree:"
214 | ]
215 | },
216 | {
217 | "cell_type": "code",
218 | "execution_count": null,
219 | "metadata": {
220 | "collapsed": false
221 | },
222 | "outputs": [],
223 | "source": [
224 | "export_graphviz(regr2, out_file=\"mytree.dot\", feature_names=X_train.columns)\n",
225 | "with open(\"mytree.dot\") as f:\n",
226 | " dot_graph = f.read()\n",
227 | "graphviz.Source(dot_graph)"
228 | ]
229 | },
230 | {
231 | "cell_type": "markdown",
232 | "metadata": {},
233 | "source": [
234 | "The variable ${\\tt lstat}$ measures the percentage of individuals with lower\n",
235 | "socioeconomic status. The tree indicates that lower values of ${\\tt lstat}$ correspond\n",
236 | "to more expensive houses. The tree predicts a median house price\n",
237 | "of \\$45,766 for larger homes (${\\tt rm}>=7.435$) in suburbs in which residents have high socioeconomic\n",
238 | "status (${\\tt lstat}<7.81$).\n",
239 | "\n",
240 | "Now let's see how it does on the test data:"
241 | ]
242 | },
243 | {
244 | "cell_type": "code",
245 | "execution_count": null,
246 | "metadata": {
247 | "collapsed": false
248 | },
249 | "outputs": [],
250 | "source": [
251 | "pred = regr2.predict(X_test)\n",
252 | "\n",
253 | "plt.scatter(pred, y_test, label='medv')\n",
254 | "plt.plot([0, 1], [0, 1], '--k', transform=plt.gca().transAxes)\n",
255 | "plt.xlabel('pred')\n",
256 | "plt.ylabel('y_test')\n",
257 | "\n",
258 | "mean_squared_error(y_test, pred)"
259 | ]
260 | },
261 | {
262 | "cell_type": "markdown",
263 | "metadata": {},
264 | "source": [
265 | "The test set MSE associated with the regression tree is\n",
266 | "28.8. The square root of the MSE is therefore around 5.37, indicating\n",
267 | "that this model leads to test predictions that are within around \\$5,370 of\n",
268 | "the true median home value for the suburb.\n",
269 | " \n",
270 | "# 8.3.3 Bagging and Random Forests\n",
271 | "\n",
272 | "Let's see if we can improve on this result using **bagging** and **random forests**. The exact results obtained in this section may\n",
273 | "depend on the version of ${\\tt python}$ and the version of the ${\\tt RandomForestRegressor}$ package\n",
274 | "installed on your computer, so don't stress out if you don't match up exactly with the book. Recall that **bagging** is simply a special case of\n",
275 | "a **random forest** with $m = p$. Therefore, the ${\\tt RandomForestRegressor()}$ function can\n",
276 | "be used to perform both random forests and bagging. Let's start with bagging:"
277 | ]
278 | },
279 | {
280 | "cell_type": "code",
281 | "execution_count": null,
282 | "metadata": {
283 | "collapsed": false
284 | },
285 | "outputs": [],
286 | "source": [
287 | "# Bagging: using all features\n",
288 | "regr1 = RandomForestRegressor(max_features=13, random_state=1)\n",
289 | "regr1.fit(X_train, y_train)"
290 | ]
291 | },
292 | {
293 | "cell_type": "markdown",
294 | "metadata": {},
295 | "source": [
296 | "The argument ${\\tt max\\_features=13}$ indicates that all 13 predictors should be considered\n",
297 | "for each split of the tree -- in other words, that bagging should be done. How\n",
298 | "well does this bagged model perform on the test set?"
299 | ]
300 | },
301 | {
302 | "cell_type": "code",
303 | "execution_count": null,
304 | "metadata": {
305 | "collapsed": false
306 | },
307 | "outputs": [],
308 | "source": [
309 | "pred = regr1.predict(X_test)\n",
310 | "plt.scatter(pred, y_test, label='medv')\n",
311 | "plt.plot([0, 1], [0, 1], '--k', transform=plt.gca().transAxes)\n",
312 | "plt.xlabel('pred')\n",
313 | "plt.ylabel('y_test')\n",
314 | "mean_squared_error(y_test, pred)"
315 | ]
316 | },
317 | {
318 | "cell_type": "markdown",
319 | "metadata": {},
320 | "source": [
321 | "The test setMSE associated with the bagged regression tree is significantly lower than our single tree!"
322 | ]
323 | },
324 | {
325 | "cell_type": "markdown",
326 | "metadata": {},
327 | "source": [
328 | "We can grow a random forest in exactly the same way, except that\n",
329 | "we'll use a smaller value of the ${\\tt max\\_features}$ argument. Here we'll\n",
330 | "use ${\\tt max\\_features = 6}$:"
331 | ]
332 | },
333 | {
334 | "cell_type": "code",
335 | "execution_count": null,
336 | "metadata": {
337 | "collapsed": false
338 | },
339 | "outputs": [],
340 | "source": [
341 | "# Random forests: using 6 features\n",
342 | "regr2 = RandomForestRegressor(max_features=6, random_state=1)\n",
343 | "regr2.fit(X_train, y_train)\n",
344 | "\n",
345 | "pred = regr2.predict(X_test)\n",
346 | "mean_squared_error(y_test, pred)"
347 | ]
348 | },
349 | {
350 | "cell_type": "markdown",
351 | "metadata": {},
352 | "source": [
353 | "The test set MSE is even lower; this indicates that random forests yielded an\n",
354 | "improvement over bagging in this case.\n",
355 | "\n",
356 | "Using the ${\\tt feature\\_importances\\_}$ attribute of the ${\\tt RandomForestRegressor}$, we can view the importance of each\n",
357 | "variable:"
358 | ]
359 | },
360 | {
361 | "cell_type": "code",
362 | "execution_count": null,
363 | "metadata": {
364 | "collapsed": false
365 | },
366 | "outputs": [],
367 | "source": [
368 | "Importance = pd.DataFrame({'Importance':regr2.feature_importances_*100}, index=X.columns)\n",
369 | "Importance.sort_values(by='Importance', axis=0, ascending=True).plot(kind='barh', color='r', )\n",
370 | "plt.xlabel('Variable Importance')\n",
371 | "plt.gca().legend_ = None"
372 | ]
373 | },
374 | {
375 | "cell_type": "markdown",
376 | "metadata": {},
377 | "source": [
378 | "The results indicate that across all of the trees considered in the random\n",
379 | "forest, the wealth level of the community (${\\tt lstat}$) and the house size (${\\tt rm}$)\n",
380 | "are by far the two most important variables."
381 | ]
382 | },
383 | {
384 | "cell_type": "markdown",
385 | "metadata": {},
386 | "source": [
387 | "# 8.3.4 Boosting\n",
388 | "\n",
389 | "Now we'll use the ${\\tt GradientBoostingRegressor}$ package to fit **boosted\n",
390 | "regression trees** to the ${\\tt Boston}$ data set. The\n",
391 | "argument ${\\tt n_estimators=500}$ indicates that we want 500 trees, and the option\n",
392 | "${\\tt interaction.depth=4}$ limits the depth of each tree:"
393 | ]
394 | },
395 | {
396 | "cell_type": "code",
397 | "execution_count": null,
398 | "metadata": {
399 | "collapsed": false
400 | },
401 | "outputs": [],
402 | "source": [
403 | "regr = GradientBoostingRegressor(n_estimators=500, learning_rate=0.01, max_depth=4, random_state=1)\n",
404 | "regr.fit(X_train, y_train)"
405 | ]
406 | },
407 | {
408 | "cell_type": "markdown",
409 | "metadata": {},
410 | "source": [
411 | "Let's check out the feature importances again:"
412 | ]
413 | },
414 | {
415 | "cell_type": "code",
416 | "execution_count": null,
417 | "metadata": {
418 | "collapsed": false
419 | },
420 | "outputs": [],
421 | "source": [
422 | "feature_importance = regr.feature_importances_*100\n",
423 | "rel_imp = pd.Series(feature_importance, index=X.columns).sort_values(inplace=False)\n",
424 | "rel_imp.T.plot(kind='barh', color='r', )\n",
425 | "plt.xlabel('Variable Importance')\n",
426 | "plt.gca().legend_ = None"
427 | ]
428 | },
429 | {
430 | "cell_type": "markdown",
431 | "metadata": {},
432 | "source": [
433 | "We see that ${\\tt lstat}$ and ${\\tt rm}$ are again the most important variables by far. Now let's use the boosted model to predict ${\\tt medv}$ on the test set:"
434 | ]
435 | },
436 | {
437 | "cell_type": "code",
438 | "execution_count": null,
439 | "metadata": {
440 | "collapsed": false
441 | },
442 | "outputs": [],
443 | "source": [
444 | "mean_squared_error(y_test, regr.predict(X_test))"
445 | ]
446 | },
447 | {
448 | "cell_type": "markdown",
449 | "metadata": {},
450 | "source": [
451 | "The test MSE obtained is similar to the test MSE for random forests\n",
452 | "and superior to that for bagging. If we want to, we can perform boosting\n",
453 | "with a different value of the shrinkage parameter $\\lambda$. Here we take $\\lambda = 0.2$:"
454 | ]
455 | },
456 | {
457 | "cell_type": "code",
458 | "execution_count": null,
459 | "metadata": {
460 | "collapsed": false
461 | },
462 | "outputs": [],
463 | "source": [
464 | "regr2 = GradientBoostingRegressor(n_estimators=500, learning_rate=0.2, max_depth=4, random_state=1)\n",
465 | "regr2.fit(X_train, y_train)\n",
466 | "mean_squared_error(y_test, regr2.predict(X_test))"
467 | ]
468 | },
469 | {
470 | "cell_type": "markdown",
471 | "metadata": {},
472 | "source": [
473 | "In this case, using $\\lambda = 0.2$ leads to a slightly lower test MSE than $\\lambda = 0.01$.\n",
474 | "\n",
475 | "To get credit for this lab, post your responses to the following questions:\n",
476 | " - What's one real-world scenario where you might try using Bagging?\n",
477 | " - What's one real-world scenario where you might try using Random Forests?\n",
478 | " - What's one real-world scenario where you might try using Boosting?\n",
479 | " \n",
480 | "to Piazza: https://piazza.com/class/igwiv4w3ctb6rg?cid=53"
481 | ]
482 | }
483 | ],
484 | "metadata": {
485 | "anaconda-cloud": {},
486 | "kernelspec": {
487 | "display_name": "Python [conda root]",
488 | "language": "python",
489 | "name": "conda-root-py"
490 | },
491 | "language_info": {
492 | "codemirror_mode": {
493 | "name": "ipython",
494 | "version": 2
495 | },
496 | "file_extension": ".py",
497 | "mimetype": "text/x-python",
498 | "name": "python",
499 | "nbconvert_exporter": "python",
500 | "pygments_lexer": "ipython2",
501 | "version": "2.7.11"
502 | }
503 | },
504 | "nbformat": 4,
505 | "nbformat_minor": 0
506 | }
507 |
--------------------------------------------------------------------------------
/Lab 18 - PCA in Python.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "This lab on Principal Components Analysis is a python adaptation of p. 401-404,\n",
8 | "408-410 of \"Introduction to Statistical Learning with Applications in R\" by Gareth James,\n",
9 | "Daniela Witten, Trevor Hastie and Robert Tibshirani. Original adaptation by J. Warmenhoven, updated by R. Jordan Crouser at Smith College for\n",
10 | "SDS293: Machine Learning (Spring 2016)."
11 | ]
12 | },
13 | {
14 | "cell_type": "code",
15 | "execution_count": null,
16 | "metadata": {
17 | "collapsed": true
18 | },
19 | "outputs": [],
20 | "source": [
21 | "import pandas as pd\n",
22 | "import numpy as np\n",
23 | "import matplotlib as mpl\n",
24 | "import matplotlib.pyplot as plt\n",
25 | "\n",
26 | "%matplotlib inline"
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "metadata": {},
32 | "source": [
33 | "# 10.4: Principal Components Analysis\n",
34 | "\n",
35 | "In this lab, we perform PCA on the ${\\tt USArrests}$ data set. The rows of the data set contain the 50 states, in\n",
36 | "alphabetical order:"
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": null,
42 | "metadata": {
43 | "collapsed": false
44 | },
45 | "outputs": [],
46 | "source": [
47 | "df = pd.read_csv('USArrests.csv', index_col=0)\n",
48 | "df.head()"
49 | ]
50 | },
51 | {
52 | "cell_type": "markdown",
53 | "metadata": {},
54 | "source": [
55 | "The columns of the data set contain four variables relating to various crimes:"
56 | ]
57 | },
58 | {
59 | "cell_type": "code",
60 | "execution_count": null,
61 | "metadata": {
62 | "collapsed": false
63 | },
64 | "outputs": [],
65 | "source": [
66 | "df.info()"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "Let's start by taking a quick look at the column means of the data:"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": null,
79 | "metadata": {
80 | "collapsed": false
81 | },
82 | "outputs": [],
83 | "source": [
84 | "df.mean()"
85 | ]
86 | },
87 | {
88 | "cell_type": "markdown",
89 | "metadata": {},
90 | "source": [
91 | "We see right away the the data have **vastly** different means. We can also examine the variances of the four variables:"
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": null,
97 | "metadata": {
98 | "collapsed": false
99 | },
100 | "outputs": [],
101 | "source": [
102 | "df.var()"
103 | ]
104 | },
105 | {
106 | "cell_type": "markdown",
107 | "metadata": {},
108 | "source": [
109 | "Not surprisingly, the variables also have vastly different variances: the\n",
110 | "${\\tt UrbanPop}$ variable measures the percentage of the population in each state\n",
111 | "living in an urban area, which is not a comparable number to the number\n",
112 | "of crimes committeed in each state per 100,000 individuals. If we failed to scale the\n",
113 | "variables before performing PCA, then most of the principal components\n",
114 | "that we observed would be driven by the ${\\tt Assault}$ variable, since it has by\n",
115 | "far the largest mean and variance. \n",
116 | "\n",
117 | "Thus, it is important to standardize the\n",
118 | "variables to have mean zero and standard deviation 1 before performing\n",
119 | "PCA. We can do this using the ${\\tt scale()}$ function from ${\\tt sklearn}$:"
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": null,
125 | "metadata": {
126 | "collapsed": true
127 | },
128 | "outputs": [],
129 | "source": [
130 | "from sklearn.preprocessing import scale\n",
131 | "X = pd.DataFrame(scale(df), index=df.index, columns=df.columns)"
132 | ]
133 | },
134 | {
135 | "cell_type": "markdown",
136 | "metadata": {},
137 | "source": [
138 | "Now we'll use the ${\\tt PCA()}$ function from ${\\tt sklearn}$ to compute the loading vectors:"
139 | ]
140 | },
141 | {
142 | "cell_type": "code",
143 | "execution_count": null,
144 | "metadata": {
145 | "collapsed": false
146 | },
147 | "outputs": [],
148 | "source": [
149 | "from sklearn.decomposition import PCA\n",
150 | "\n",
151 | "pca_loadings = pd.DataFrame(PCA().fit(X).components_.T, index=df.columns, columns=['V1', 'V2', 'V3', 'V4'])\n",
152 | "pca_loadings"
153 | ]
154 | },
155 | {
156 | "cell_type": "markdown",
157 | "metadata": {},
158 | "source": [
159 | "We see that there are four distinct principal components. This is to be\n",
160 | "expected because there are in general ${\\tt min(n − 1, p)}$ informative principal\n",
161 | "components in a data set with $n$ observations and $p$ variables.\n",
162 | "\n",
163 | "Using the ${\\tt fit_transform()}$ function, we can get the principal component scores of the original data. We'll take a look at the first few states:"
164 | ]
165 | },
166 | {
167 | "cell_type": "code",
168 | "execution_count": null,
169 | "metadata": {
170 | "collapsed": false
171 | },
172 | "outputs": [],
173 | "source": [
174 | "# Fit the PCA model and transform X to get the principal components\n",
175 | "pca = PCA()\n",
176 | "df_plot = pd.DataFrame(pca.fit_transform(X), columns=['PC1', 'PC2', 'PC3', 'PC4'], index=X.index)\n",
177 | "df_plot.head()"
178 | ]
179 | },
180 | {
181 | "cell_type": "markdown",
182 | "metadata": {},
183 | "source": [
184 | "We can construct a **biplot** of the first two principal components using our loading vectors:"
185 | ]
186 | },
187 | {
188 | "cell_type": "code",
189 | "execution_count": null,
190 | "metadata": {
191 | "collapsed": false
192 | },
193 | "outputs": [],
194 | "source": [
195 | "fig , ax1 = plt.subplots(figsize=(9,7))\n",
196 | "\n",
197 | "ax1.set_xlim(-3.5,3.5)\n",
198 | "ax1.set_ylim(-3.5,3.5)\n",
199 | "\n",
200 | "# Plot Principal Components 1 and 2\n",
201 | "for i in df_plot.index:\n",
202 | " ax1.annotate(i, (-df_plot.PC1.loc[i], -df_plot.PC2.loc[i]), ha='center')\n",
203 | "\n",
204 | "# Plot reference lines\n",
205 | "ax1.hlines(0,-3.5,3.5, linestyles='dotted', colors='grey')\n",
206 | "ax1.vlines(0,-3.5,3.5, linestyles='dotted', colors='grey')\n",
207 | "\n",
208 | "ax1.set_xlabel('First Principal Component')\n",
209 | "ax1.set_ylabel('Second Principal Component')\n",
210 | " \n",
211 | "# Plot Principal Component loading vectors, using a second y-axis.\n",
212 | "ax2 = ax1.twinx().twiny() \n",
213 | "\n",
214 | "ax2.set_ylim(-1,1)\n",
215 | "ax2.set_xlim(-1,1)\n",
216 | "ax2.set_xlabel('Principal Component loading vectors', color='red')\n",
217 | "\n",
218 | "# Plot labels for vectors. Variable 'a' is a small offset parameter to separate arrow tip and text.\n",
219 | "a = 1.07 \n",
220 | "for i in pca_loadings[['V1', 'V2']].index:\n",
221 | " ax2.annotate(i, (-pca_loadings.V1.loc[i]*a, -pca_loadings.V2.loc[i]*a), color='red')\n",
222 | "\n",
223 | "# Plot vectors\n",
224 | "ax2.arrow(0,0,-pca_loadings.V1[0], -pca_loadings.V2[0])\n",
225 | "ax2.arrow(0,0,-pca_loadings.V1[1], -pca_loadings.V2[1])\n",
226 | "ax2.arrow(0,0,-pca_loadings.V1[2], -pca_loadings.V2[2])\n",
227 | "ax2.arrow(0,0,-pca_loadings.V1[3], -pca_loadings.V2[3])"
228 | ]
229 | },
230 | {
231 | "cell_type": "markdown",
232 | "metadata": {},
233 | "source": [
234 | "The ${\\tt PCA()}$ function also outputs the variance explained by of each principal\n",
235 | "component. We can access these values as follows:"
236 | ]
237 | },
238 | {
239 | "cell_type": "code",
240 | "execution_count": null,
241 | "metadata": {
242 | "collapsed": false
243 | },
244 | "outputs": [],
245 | "source": [
246 | "pca.explained_variance_"
247 | ]
248 | },
249 | {
250 | "cell_type": "markdown",
251 | "metadata": {},
252 | "source": [
253 | "We can also get the proportion of variance explained:"
254 | ]
255 | },
256 | {
257 | "cell_type": "code",
258 | "execution_count": null,
259 | "metadata": {
260 | "collapsed": false
261 | },
262 | "outputs": [],
263 | "source": [
264 | "pca.explained_variance_ratio_"
265 | ]
266 | },
267 | {
268 | "cell_type": "markdown",
269 | "metadata": {},
270 | "source": [
271 | "We see that the first principal component explains 62.0% of the variance\n",
272 | "in the data, the next principal component explains 24.7% of the variance,\n",
273 | "and so forth. We can plot the PVE explained by each component as follows:"
274 | ]
275 | },
276 | {
277 | "cell_type": "code",
278 | "execution_count": null,
279 | "metadata": {
280 | "collapsed": false
281 | },
282 | "outputs": [],
283 | "source": [
284 | "plt.figure(figsize=(7,5))\n",
285 | "plt.plot([1,2,3,4], pca.explained_variance_ratio_, '-o')\n",
286 | "plt.ylabel('Proportion of Variance Explained')\n",
287 | "plt.xlabel('Principal Component')\n",
288 | "plt.xlim(0.75,4.25)\n",
289 | "plt.ylim(0,1.05)\n",
290 | "plt.xticks([1,2,3,4])"
291 | ]
292 | },
293 | {
294 | "cell_type": "markdown",
295 | "metadata": {},
296 | "source": [
297 | "We can also use the function ${\\tt cumsum()}$, which computes the cumulative sum of the elements of a numeric vector, to plot the cumulative PVE:"
298 | ]
299 | },
300 | {
301 | "cell_type": "code",
302 | "execution_count": null,
303 | "metadata": {
304 | "collapsed": false
305 | },
306 | "outputs": [],
307 | "source": [
308 | "plt.figure(figsize=(7,5))\n",
309 | "plt.plot([1,2,3,4], np.cumsum(pca.explained_variance_ratio_), '-s')\n",
310 | "plt.ylabel('Proportion of Variance Explained')\n",
311 | "plt.xlabel('Principal Component')\n",
312 | "plt.xlim(0.75,4.25)\n",
313 | "plt.ylim(0,1.05)\n",
314 | "plt.xticks([1,2,3,4])\n",
315 | "\n"
316 | ]
317 | },
318 | {
319 | "cell_type": "markdown",
320 | "metadata": {},
321 | "source": [
322 | "# 10.6: NCI60 Data Example\n",
323 | "\n",
324 | "Let's return to the ${\\tt NCI60}$ cancer cell line microarray data, which\n",
325 | "consists of 6,830 gene expression measurements on 64 cancer cell lines:"
326 | ]
327 | },
328 | {
329 | "cell_type": "code",
330 | "execution_count": null,
331 | "metadata": {
332 | "collapsed": false
333 | },
334 | "outputs": [],
335 | "source": [
336 | "df2 = pd.read_csv('NCI60.csv').drop('Unnamed: 0', axis=1)\n",
337 | "df2.columns = np.arange(df2.columns.size)\n",
338 | "df2.info()"
339 | ]
340 | },
341 | {
342 | "cell_type": "code",
343 | "execution_count": null,
344 | "metadata": {
345 | "collapsed": false
346 | },
347 | "outputs": [],
348 | "source": [
349 | "# Read in the labels to check our work later\n",
350 | "y = pd.read_csv('NCI60_y.csv', usecols=[1], skiprows=1, names=['type'])"
351 | ]
352 | },
353 | {
354 | "cell_type": "markdown",
355 | "metadata": {},
356 | "source": [
357 | "# 10.6.1 PCA on the NCI60 Data\n",
358 | "\n",
359 | "We first perform PCA on the data after scaling the variables (genes) to\n",
360 | "have standard deviation one, although one could reasonably argue that it\n",
361 | "is better not to scale the genes:"
362 | ]
363 | },
364 | {
365 | "cell_type": "code",
366 | "execution_count": null,
367 | "metadata": {
368 | "collapsed": true
369 | },
370 | "outputs": [],
371 | "source": [
372 | "# Scale the data\n",
373 | "X = pd.DataFrame(scale(df2))\n",
374 | "X.shape\n",
375 | "\n",
376 | "# Fit the PCA model and transform X to get the principal components\n",
377 | "pca2 = PCA()\n",
378 | "df2_plot = pd.DataFrame(pca2.fit_transform(X))"
379 | ]
380 | },
381 | {
382 | "cell_type": "markdown",
383 | "metadata": {},
384 | "source": [
385 | "We now plot the first few principal component score vectors, in order to\n",
386 | "visualize the data. The observations (cell lines) corresponding to a given\n",
387 | "cancer type will be plotted in the same color, so that we can see to what\n",
388 | "extent the observations within a cancer type are similar to each other:"
389 | ]
390 | },
391 | {
392 | "cell_type": "code",
393 | "execution_count": null,
394 | "metadata": {
395 | "collapsed": false
396 | },
397 | "outputs": [],
398 | "source": [
399 | "fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,6))\n",
400 | "\n",
401 | "color_idx = pd.factorize(y.type)[0]\n",
402 | "cmap = mpl.cm.hsv\n",
403 | "\n",
404 | "# Left plot\n",
405 | "ax1.scatter(df2_plot.iloc[:,0], df2_plot.iloc[:,1], c=color_idx, cmap=cmap, alpha=0.5, s=50)\n",
406 | "ax1.set_ylabel('Principal Component 2')\n",
407 | "\n",
408 | "# Right plot\n",
409 | "ax2.scatter(df2_plot.iloc[:,0], df2_plot.iloc[:,2], c=color_idx, cmap=cmap, alpha=0.5, s=50)\n",
410 | "ax2.set_ylabel('Principal Component 3')\n",
411 | "\n",
412 | "# Custom legend for the classes (y) since we do not create scatter plots per class (which could have their own labels).\n",
413 | "handles = []\n",
414 | "labels = pd.factorize(y.type.unique())\n",
415 | "norm = mpl.colors.Normalize(vmin=0.0, vmax=14.0)\n",
416 | "\n",
417 | "for i, v in zip(labels[0], labels[1]):\n",
418 | " handles.append(mpl.patches.Patch(color=cmap(norm(i)), label=v, alpha=0.5))\n",
419 | "\n",
420 | "ax2.legend(handles=handles, bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)\n",
421 | "\n",
422 | "# xlabel for both plots\n",
423 | "for ax in fig.axes:\n",
424 | " ax.set_xlabel('Principal Component 1') "
425 | ]
426 | },
427 | {
428 | "cell_type": "markdown",
429 | "metadata": {},
430 | "source": [
431 | "On the whole, cell lines corresponding to a single cancer type do tend to have similar values on the\n",
432 | "first few principal component score vectors. This indicates that cell lines\n",
433 | "from the same cancer type tend to have pretty similar gene expression\n",
434 | "levels.\n",
435 | "\n",
436 | "We can generate a summary of the proportion of variance explained (PVE)\n",
437 | "of the first few principal components:"
438 | ]
439 | },
440 | {
441 | "cell_type": "code",
442 | "execution_count": null,
443 | "metadata": {
444 | "collapsed": false
445 | },
446 | "outputs": [],
447 | "source": [
448 | "pd.DataFrame([df2_plot.iloc[:,:5].std(axis=0, ddof=0).as_matrix(),\n",
449 | " pca2.explained_variance_ratio_[:5],\n",
450 | " np.cumsum(pca2.explained_variance_ratio_[:5])],\n",
451 | " index=['Standard Deviation', 'Proportion of Variance', 'Cumulative Proportion'],\n",
452 | " columns=['PC1', 'PC2', 'PC3', 'PC4', 'PC5'])"
453 | ]
454 | },
455 | {
456 | "cell_type": "markdown",
457 | "metadata": {},
458 | "source": [
459 | "Using the ${\\tt plot()}$ function, we can also plot the variance explained by the\n",
460 | "first few principal components:"
461 | ]
462 | },
463 | {
464 | "cell_type": "code",
465 | "execution_count": null,
466 | "metadata": {
467 | "collapsed": false
468 | },
469 | "outputs": [],
470 | "source": [
471 | "df2_plot.iloc[:,:10].var(axis=0, ddof=0).plot(kind='bar', rot=0)\n",
472 | "plt.ylabel('Variances')"
473 | ]
474 | },
475 | {
476 | "cell_type": "markdown",
477 | "metadata": {},
478 | "source": [
479 | "However, it is generally more informative to\n",
480 | "plot the PVE of each principal component (i.e. a **scree plot**) and the cumulative\n",
481 | "PVE of each principal component. This can be done with just a\n",
482 | "little tweaking:"
483 | ]
484 | },
485 | {
486 | "cell_type": "code",
487 | "execution_count": null,
488 | "metadata": {
489 | "collapsed": false
490 | },
491 | "outputs": [],
492 | "source": [
493 | "fig , (ax1,ax2) = plt.subplots(1,2, figsize=(15,5))\n",
494 | "\n",
495 | "# Left plot\n",
496 | "ax1.plot(pca2.explained_variance_ratio_, '-o')\n",
497 | "ax1.set_ylabel('Proportion of Variance Explained')\n",
498 | "ax1.set_ylim(ymin=-0.01)\n",
499 | "\n",
500 | "# Right plot\n",
501 | "ax2.plot(np.cumsum(pca2.explained_variance_ratio_), '-ro')\n",
502 | "ax2.set_ylabel('Cumulative Proportion of Variance Explained')\n",
503 | "ax2.set_ylim(ymax=1.05)\n",
504 | "\n",
505 | "for ax in fig.axes:\n",
506 | " ax.set_xlabel('Principal Component')\n",
507 | " ax.set_xlim(-1,65) "
508 | ]
509 | },
510 | {
511 | "cell_type": "markdown",
512 | "metadata": {},
513 | "source": [
514 | "We see that together, the first seven principal components\n",
515 | "explain around 40% of the variance in the data. This is not a huge amount\n",
516 | "of the variance. However, looking at the scree plot, we see that while each\n",
517 | "of the first seven principal components explain a substantial amount of\n",
518 | "variance, there is a marked decrease in the variance explained by further\n",
519 | "principal components. That is, there is an **elbow** in the plot after approximately\n",
520 | "the seventh principal component. This suggests that there may\n",
521 | "be little benefit to examining more than seven or so principal components\n",
522 | "(phew! even examining seven principal components may be difficult)."
523 | ]
524 | }
525 | ],
526 | "metadata": {
527 | "kernelspec": {
528 | "display_name": "Python 3",
529 | "language": "python",
530 | "name": "python3"
531 | },
532 | "language_info": {
533 | "codemirror_mode": {
534 | "name": "ipython",
535 | "version": 3
536 | },
537 | "file_extension": ".py",
538 | "mimetype": "text/x-python",
539 | "name": "python",
540 | "nbconvert_exporter": "python",
541 | "pygments_lexer": "ipython3",
542 | "version": "3.5.1"
543 | }
544 | },
545 | "nbformat": 4,
546 | "nbformat_minor": 0
547 | }
548 |
--------------------------------------------------------------------------------
/Lab 3 - K-Nearest Neighbors in Python.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "This lab on K-Nearest Neighbors is a python adaptation of p. 163-167 of \"Introduction to Statistical Learning with Applications in R\" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Originally adapted by Jordi Warmenhoven (github.com/JWarmenhoven/ISLR-python), modified by R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016)."
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": null,
13 | "metadata": {
14 | "collapsed": false
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd\n",
19 | "import numpy as np"
20 | ]
21 | },
22 | {
23 | "cell_type": "markdown",
24 | "metadata": {},
25 | "source": [
26 | "# 4.6.5: K-Nearest Neighbors"
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "metadata": {},
32 | "source": [
33 | "In this lab, we will perform KNN on the ${\\tt Smarket}$ dataset from ${\\tt ISLR}$. This data set consists of percentage returns for the S&P 500 stock index over 1,250 days, from the\n",
34 | "beginning of 2001 until the end of 2005. For each date, we have recorded\n",
35 | "the percentage returns for each of the five previous trading days, ${\\tt Lag1}$\n",
36 | "through ${\\tt Lag5}$. We have also recorded ${\\tt Volume}$ (the number of shares traded on the previous day, in billions), ${\\tt Today}$ (the percentage return on the date\n",
37 | "in question) and ${\\tt Direction}$ (whether the market was ${\\tt Up}$ or ${\\tt Down}$ on this\n",
38 | "date). We can use the ${\\tt head(...)}$ function to look at the first few rows:"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": null,
44 | "metadata": {
45 | "collapsed": false
46 | },
47 | "outputs": [],
48 | "source": [
49 | "df = pd.read_csv('Smarket.csv', usecols=range(1,10), index_col=0, parse_dates=True)\n",
50 | "df.head()"
51 | ]
52 | },
53 | {
54 | "cell_type": "markdown",
55 | "metadata": {},
56 | "source": [
57 | "Today we're going to try to predict ${\\tt Direction}$ using percentage returns from the previous two days (${\\tt Lag1}$ and ${\\tt Lag2}$). We'll build our model using the ${\\tt KNeighborsClassifier()}$ function, which is part of the\n",
58 | "${\\tt neighbors}$ submodule of SciKitLearn (${\\tt sklearn}$). We'll also grab a couple of useful tools from the ${\\tt metrics}$ submodule:"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": null,
64 | "metadata": {
65 | "collapsed": true
66 | },
67 | "outputs": [],
68 | "source": [
69 | "from sklearn import neighbors\n",
70 | "from sklearn.metrics import confusion_matrix, classification_report"
71 | ]
72 | },
73 | {
74 | "cell_type": "markdown",
75 | "metadata": {},
76 | "source": [
77 | "This function works rather differently from the other model-fitting\n",
78 | "functions that we have encountered thus far. Rather than a two-step\n",
79 | "approach in which we first fit the model and then we use the model to make\n",
80 | "predictions, ${\\tt knn()}$ forms predictions using a single command. The function\n",
81 | "requires four inputs.\n",
82 | " 1. A matrix containing the predictors associated with the training data,\n",
83 | "labeled ${\\tt X\\_train}$ below.\n",
84 | " 2. A matrix containing the predictors associated with the data for which\n",
85 | "we wish to make predictions, labeled ${\\tt X\\_test}$ below.\n",
86 | " 3. A vector containing the class labels for the training observations,\n",
87 | "labeled ${\\tt y\\_train}$ below.\n",
88 | " 4. A value for $K$, the number of nearest neighbors to be used by the\n",
89 | "classifier.\n",
90 | "\n",
91 | "We'll first create a vector corresponding to the observations from 2001 through 2004, which we'll use to train the model. We will then use this vector to create a held out data set of observations from 2005 on which we will test. We'll also pull out our training and test labels."
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": null,
97 | "metadata": {
98 | "collapsed": false
99 | },
100 | "outputs": [],
101 | "source": [
102 | "X_train = df[:'2004'][['Lag1','Lag2']]\n",
103 | "y_train = df[:'2004']['Direction']\n",
104 | "\n",
105 | "X_test = df['2005':][['Lag1','Lag2']]\n",
106 | "y_test = df['2005':]['Direction']"
107 | ]
108 | },
109 | {
110 | "cell_type": "markdown",
111 | "metadata": {},
112 | "source": [
113 | "Now the ${\\tt neighbors.KNeighborsClassifier()}$ function can be used to predict the market’s movement for\n",
114 | "the dates in 2005."
115 | ]
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": null,
120 | "metadata": {
121 | "collapsed": false
122 | },
123 | "outputs": [],
124 | "source": [
125 | "knn = neighbors.KNeighborsClassifier(n_neighbors=1)\n",
126 | "pred = knn.fit(X_train, y_train).predict(X_test)"
127 | ]
128 | },
129 | {
130 | "cell_type": "markdown",
131 | "metadata": {},
132 | "source": [
133 | "The ${\\tt confusion\\_matrix()}$ function can be used to produce a **confusion matrix** in order to determine how many observations were correctly or incorrectly classified. The ${\\tt classification\\_report()}$ function gives us some summary statistics on the classifier's performance:"
134 | ]
135 | },
136 | {
137 | "cell_type": "code",
138 | "execution_count": null,
139 | "metadata": {
140 | "collapsed": false
141 | },
142 | "outputs": [],
143 | "source": [
144 | "print(confusion_matrix(y_test, pred).T)\n",
145 | "print(classification_report(y_test, pred, digits=3))"
146 | ]
147 | },
148 | {
149 | "cell_type": "markdown",
150 | "metadata": {},
151 | "source": [
152 | "The results using $K = 1$ are not very good, since only 50% of the observations\n",
153 | "are correctly predicted. Of course, it may be that $K = 1$ results in an\n",
154 | "overly flexible fit to the data. Below, we repeat the analysis using $K = 3$."
155 | ]
156 | },
157 | {
158 | "cell_type": "code",
159 | "execution_count": null,
160 | "metadata": {
161 | "collapsed": false
162 | },
163 | "outputs": [],
164 | "source": [
165 | "knn = neighbors.KNeighborsClassifier(n_neighbors=3)\n",
166 | "pred = knn.fit(X_train, y_train).predict(X_test)\n",
167 | "print(confusion_matrix(y_test, pred).T)\n",
168 | "print(classification_report(y_test, pred, digits=3))"
169 | ]
170 | },
171 | {
172 | "cell_type": "markdown",
173 | "metadata": {},
174 | "source": [
175 | "The results have improved slightly. Try looping through a few other $K$ values to see if you can get any further improvement:"
176 | ]
177 | },
178 | {
179 | "cell_type": "code",
180 | "execution_count": null,
181 | "metadata": {
182 | "collapsed": false
183 | },
184 | "outputs": [],
185 | "source": [
186 | "for k_val in range(10):\n",
187 | " # Your code here"
188 | ]
189 | },
190 | {
191 | "cell_type": "markdown",
192 | "metadata": {},
193 | "source": [
194 | "It looks like for classifying this dataset, ${KNN}$ might not be the right approach."
195 | ]
196 | },
197 | {
198 | "cell_type": "markdown",
199 | "metadata": {},
200 | "source": [
201 | "# 4.6.6: An Application to Caravan Insurance Data\n",
202 | "Let's see how the ${\\tt KNN}$ approach performs on the ${\\tt Caravan}$ data set, which is\n",
203 | "part of the ${\\tt ISLR}$ library. This data set includes 85 predictors that measure demographic characteristics for 5,822 individuals. The response variable is\n",
204 | "${\\tt Purchase}$, which indicates whether or not a given individual purchases a\n",
205 | "caravan insurance policy. In this data set, only 6% of people purchased\n",
206 | "caravan insurance."
207 | ]
208 | },
209 | {
210 | "cell_type": "code",
211 | "execution_count": null,
212 | "metadata": {
213 | "collapsed": false
214 | },
215 | "outputs": [],
216 | "source": [
217 | "df2 = pd.read_csv('Caravan.csv')\n",
218 | "df2[\"Purchase\"].value_counts()"
219 | ]
220 | },
221 | {
222 | "cell_type": "markdown",
223 | "metadata": {},
224 | "source": [
225 | "Because the ${\\tt KNN}$ classifier predicts the class of a given test observation by\n",
226 | "identifying the observations that are nearest to it, the scale of the variables\n",
227 | "matters. Any variables that are on a large scale will have a much larger\n",
228 | "effect on the distance between the observations, and hence on the ${\\tt KNN}$\n",
229 | "classifier, than variables that are on a small scale. \n",
230 | "\n",
231 | "For instance, imagine a\n",
232 | "data set that contains two variables, salary and age (measured in dollars\n",
233 | "and years, respectively). As far as ${\\tt KNN}$ is concerned, a difference of \\$1,000\n",
234 | "in salary is enormous compared to a difference of 50 years in age. Consequently,\n",
235 | "salary will drive the ${\\tt KNN}$ classification results, and age will have\n",
236 | "almost no effect. \n",
237 | "\n",
238 | "This is contrary to our intuition that a salary difference\n",
239 | "of \\$1,000 is quite small compared to an age difference of 50 years. Furthermore,\n",
240 | "the importance of scale to the ${\\tt KNN}$ classifier leads to another issue:\n",
241 | "if we measured salary in Japanese yen, or if we measured age in minutes,\n",
242 | "then we’d get quite different classification results from what we get if these\n",
243 | "two variables are measured in dollars and years.\n",
244 | "\n",
245 | "A good way to handle this problem is to **standardize** the data so that all\n",
246 | "variables are given a mean of zero and a standard deviation of one. Then\n",
247 | "all variables will be on a comparable scale. The ${\\tt scale()}$ function from the ${\\tt preprocessing}$ submodule of SciKitLearn does just\n",
248 | "this. In standardizing the data, we exclude column 86, because that is the\n",
249 | "qualitative ${\\tt Purchase}$ variable."
250 | ]
251 | },
252 | {
253 | "cell_type": "code",
254 | "execution_count": null,
255 | "metadata": {
256 | "collapsed": false
257 | },
258 | "outputs": [],
259 | "source": [
260 | "from sklearn import preprocessing\n",
261 | "y = df2.Purchase\n",
262 | "X = df2.drop('Purchase', axis=1).astype('float64')\n",
263 | "X_scaled = preprocessing.scale(X)\n",
264 | "print(np.std(X_scaled))"
265 | ]
266 | },
267 | {
268 | "cell_type": "markdown",
269 | "metadata": {},
270 | "source": [
271 | "Now every column of ${\\tt X\\_scaled}$ has a standard deviation of one and\n",
272 | "a mean of zero.\n",
273 | "\n",
274 | "We'll now split the observations into a test set, containing the first 1,000\n",
275 | "observations, and a training set, containing the remaining observations."
276 | ]
277 | },
278 | {
279 | "cell_type": "code",
280 | "execution_count": null,
281 | "metadata": {
282 | "collapsed": false
283 | },
284 | "outputs": [],
285 | "source": [
286 | "X_train = X_scaled[1000:,:]\n",
287 | "y_train = y[1000:]\n",
288 | "X_test = X_scaled[:1000,:]\n",
289 | "y_test = y[:1000]"
290 | ]
291 | },
292 | {
293 | "cell_type": "markdown",
294 | "metadata": {},
295 | "source": [
296 | "Let's fit a ${\\tt KNN}$ model on the training data using $K = 1$, and evaluate its\n",
297 | "performance on the test data."
298 | ]
299 | },
300 | {
301 | "cell_type": "code",
302 | "execution_count": null,
303 | "metadata": {
304 | "collapsed": false
305 | },
306 | "outputs": [],
307 | "source": [
308 | "knn = neighbors.KNeighborsClassifier(n_neighbors=1)\n",
309 | "pred = knn.fit(X_train, y_train).predict(X_test)\n",
310 | "print(classification_report(y_test, pred, digits=3))"
311 | ]
312 | },
313 | {
314 | "cell_type": "markdown",
315 | "metadata": {},
316 | "source": [
317 | "The KNN error rate on the 1,000 test observations is just under 12%. At first glance, this may appear to be fairly good. However, since only 6% of customers purchased insurance, we could get the error rate down to 6% by always predicting ${\\tt No}$ regardless of the values of the predictors!\n",
318 | "\n",
319 | "Suppose that there is some non-trivial cost to trying to sell insurance\n",
320 | "to a given individual. For instance, perhaps a salesperson must visit each\n",
321 | "potential customer. If the company tries to sell insurance to a random\n",
322 | "selection of customers, then the success rate will be only 6%, which may\n",
323 | "be far too low given the costs involved. \n",
324 | "\n",
325 | "Instead, the company would like\n",
326 | "to try to sell insurance only to customers who are likely to buy it. So the\n",
327 | "overall error rate is not of interest. Instead, the fraction of individuals that\n",
328 | "are correctly predicted to buy insurance is of interest.\n",
329 | "\n",
330 | "It turns out that ${\\tt KNN}$ with $K = 1$ does far better than random guessing\n",
331 | "among the customers that are predicted to buy insurance:"
332 | ]
333 | },
334 | {
335 | "cell_type": "code",
336 | "execution_count": null,
337 | "metadata": {
338 | "collapsed": false
339 | },
340 | "outputs": [],
341 | "source": [
342 | "print(confusion_matrix(y_test, pred).T)"
343 | ]
344 | },
345 | {
346 | "cell_type": "markdown",
347 | "metadata": {},
348 | "source": [
349 | "Among 77 such\n",
350 | "customers, 9, or 11.7%, actually do purchase insurance. This is double the\n",
351 | "rate that one would obtain from random guessing. Let's see if increasing $K$ helps! Try out a few different $K$ values below. Feeling adventurous? Write a function that figures out the best value for $K$."
352 | ]
353 | },
354 | {
355 | "cell_type": "code",
356 | "execution_count": null,
357 | "metadata": {
358 | "collapsed": false
359 | },
360 | "outputs": [],
361 | "source": [
362 | "# Your code here"
363 | ]
364 | },
365 | {
366 | "cell_type": "markdown",
367 | "metadata": {},
368 | "source": [
369 | "It appears that ${\\tt KNN}$ is finding some real patterns in a difficult data set! To get credit for this lab, post a response to the Piazza prompt available at: https://piazza.com/class/igwiv4w3ctb6rg?cid=10"
370 | ]
371 | }
372 | ],
373 | "metadata": {
374 | "kernelspec": {
375 | "display_name": "Python 2",
376 | "language": "python",
377 | "name": "python2"
378 | },
379 | "language_info": {
380 | "codemirror_mode": {
381 | "name": "ipython",
382 | "version": 2
383 | },
384 | "file_extension": ".py",
385 | "mimetype": "text/x-python",
386 | "name": "python",
387 | "nbconvert_exporter": "python",
388 | "pygments_lexer": "ipython2",
389 | "version": "2.7.11"
390 | }
391 | },
392 | "nbformat": 4,
393 | "nbformat_minor": 0
394 | }
395 |
--------------------------------------------------------------------------------
/Lab 4 - Logistic Regression in Python.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "This lab on Logistic Regression is a Python adaptation from p. 154-161 of \"Introduction to Statistical Learning with Applications in R\" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Adapted by R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016)."
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd\n",
19 | "import numpy as np\n",
20 | "import statsmodels.api as sm"
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "# 4.6.2 Logistic Regression\n",
28 | "\n",
29 | "Let's return to the ${\\tt Smarket}$ data from ${\\tt ISLR}$. "
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": 2,
35 | "metadata": {
36 | "collapsed": false
37 | },
38 | "outputs": [
39 | {
40 | "data": {
41 | "text/html": [
42 | "
\n",
43 | "
\n",
44 | " \n",
45 | " \n",
46 | " | \n",
47 | " Lag1 | \n",
48 | " Lag2 | \n",
49 | " Lag3 | \n",
50 | " Lag4 | \n",
51 | " Lag5 | \n",
52 | " Volume | \n",
53 | " Today | \n",
54 | "
\n",
55 | " \n",
56 | " \n",
57 | " \n",
58 | " count | \n",
59 | " 1250.000000 | \n",
60 | " 1250.000000 | \n",
61 | " 1250.000000 | \n",
62 | " 1250.000000 | \n",
63 | " 1250.00000 | \n",
64 | " 1250.000000 | \n",
65 | " 1250.000000 | \n",
66 | "
\n",
67 | " \n",
68 | " mean | \n",
69 | " 0.003834 | \n",
70 | " 0.003919 | \n",
71 | " 0.001716 | \n",
72 | " 0.001636 | \n",
73 | " 0.00561 | \n",
74 | " 1.478305 | \n",
75 | " 0.003138 | \n",
76 | "
\n",
77 | " \n",
78 | " std | \n",
79 | " 1.136299 | \n",
80 | " 1.136280 | \n",
81 | " 1.138703 | \n",
82 | " 1.138774 | \n",
83 | " 1.14755 | \n",
84 | " 0.360357 | \n",
85 | " 1.136334 | \n",
86 | "
\n",
87 | " \n",
88 | " min | \n",
89 | " -4.922000 | \n",
90 | " -4.922000 | \n",
91 | " -4.922000 | \n",
92 | " -4.922000 | \n",
93 | " -4.92200 | \n",
94 | " 0.356070 | \n",
95 | " -4.922000 | \n",
96 | "
\n",
97 | " \n",
98 | " 25% | \n",
99 | " -0.639500 | \n",
100 | " -0.639500 | \n",
101 | " -0.640000 | \n",
102 | " -0.640000 | \n",
103 | " -0.64000 | \n",
104 | " 1.257400 | \n",
105 | " -0.639500 | \n",
106 | "
\n",
107 | " \n",
108 | " 50% | \n",
109 | " 0.039000 | \n",
110 | " 0.039000 | \n",
111 | " 0.038500 | \n",
112 | " 0.038500 | \n",
113 | " 0.03850 | \n",
114 | " 1.422950 | \n",
115 | " 0.038500 | \n",
116 | "
\n",
117 | " \n",
118 | " 75% | \n",
119 | " 0.596750 | \n",
120 | " 0.596750 | \n",
121 | " 0.596750 | \n",
122 | " 0.596750 | \n",
123 | " 0.59700 | \n",
124 | " 1.641675 | \n",
125 | " 0.596750 | \n",
126 | "
\n",
127 | " \n",
128 | " max | \n",
129 | " 5.733000 | \n",
130 | " 5.733000 | \n",
131 | " 5.733000 | \n",
132 | " 5.733000 | \n",
133 | " 5.73300 | \n",
134 | " 3.152470 | \n",
135 | " 5.733000 | \n",
136 | "
\n",
137 | " \n",
138 | "
\n",
139 | "
"
140 | ],
141 | "text/plain": [
142 | " Lag1 Lag2 Lag3 Lag4 Lag5 \\\n",
143 | "count 1250.000000 1250.000000 1250.000000 1250.000000 1250.00000 \n",
144 | "mean 0.003834 0.003919 0.001716 0.001636 0.00561 \n",
145 | "std 1.136299 1.136280 1.138703 1.138774 1.14755 \n",
146 | "min -4.922000 -4.922000 -4.922000 -4.922000 -4.92200 \n",
147 | "25% -0.639500 -0.639500 -0.640000 -0.640000 -0.64000 \n",
148 | "50% 0.039000 0.039000 0.038500 0.038500 0.03850 \n",
149 | "75% 0.596750 0.596750 0.596750 0.596750 0.59700 \n",
150 | "max 5.733000 5.733000 5.733000 5.733000 5.73300 \n",
151 | "\n",
152 | " Volume Today \n",
153 | "count 1250.000000 1250.000000 \n",
154 | "mean 1.478305 0.003138 \n",
155 | "std 0.360357 1.136334 \n",
156 | "min 0.356070 -4.922000 \n",
157 | "25% 1.257400 -0.639500 \n",
158 | "50% 1.422950 0.038500 \n",
159 | "75% 1.641675 0.596750 \n",
160 | "max 3.152470 5.733000 "
161 | ]
162 | },
163 | "execution_count": 2,
164 | "metadata": {},
165 | "output_type": "execute_result"
166 | }
167 | ],
168 | "source": [
169 | "df = pd.read_csv('Smarket.csv', usecols=range(1,10), index_col=0, parse_dates=True)\n",
170 | "df.describe()"
171 | ]
172 | },
173 | {
174 | "cell_type": "markdown",
175 | "metadata": {},
176 | "source": [
177 | "In this lab, we will fit a logistic regression model in order to predict ${\\tt Direction}$ using ${\\tt Lag1}$ through ${\\tt Lag5}$ and ${\\tt Volume}$. We'll build our model using the ${\\tt glm()}$ function, which is part of the\n",
178 | "${\\tt formula}$ submodule of (${\\tt statsmodels}$)."
179 | ]
180 | },
181 | {
182 | "cell_type": "code",
183 | "execution_count": 3,
184 | "metadata": {
185 | "collapsed": true
186 | },
187 | "outputs": [],
188 | "source": [
189 | "import statsmodels.formula.api as smf"
190 | ]
191 | },
192 | {
193 | "cell_type": "markdown",
194 | "metadata": {},
195 | "source": [
196 | "We can use an ${\\tt R}$-like formula string to separate the predictors from the response."
197 | ]
198 | },
199 | {
200 | "cell_type": "code",
201 | "execution_count": 4,
202 | "metadata": {
203 | "collapsed": false
204 | },
205 | "outputs": [],
206 | "source": [
207 | "formula = 'Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5+Volume'"
208 | ]
209 | },
210 | {
211 | "cell_type": "markdown",
212 | "metadata": {},
213 | "source": [
214 | "The ${\\tt glm()}$ function fits **generalized linear models**, a class of models that includes logistic regression. The syntax of the ${\\tt glm()}$ function is similar to that of ${\\tt lm()}$, except that we must pass in the argument ${\\tt family=sm.families.Binomial()}$ in order to tell ${\\tt R}$ to run a logistic regression rather than some other type of generalized linear model."
215 | ]
216 | },
217 | {
218 | "cell_type": "code",
219 | "execution_count": 5,
220 | "metadata": {
221 | "collapsed": false
222 | },
223 | "outputs": [
224 | {
225 | "name": "stdout",
226 | "output_type": "stream",
227 | "text": [
228 | " Generalized Linear Model Regression Results \n",
229 | "================================================================================================\n",
230 | "Dep. Variable: ['Direction[Down]', 'Direction[Up]'] No. Observations: 1250\n",
231 | "Model: GLM Df Residuals: 1243\n",
232 | "Model Family: Binomial Df Model: 6\n",
233 | "Link Function: logit Scale: 1.0\n",
234 | "Method: IRLS Log-Likelihood: -863.79\n",
235 | "Date: Wed, 10 Feb 2016 Deviance: 1727.6\n",
236 | "Time: 14:00:03 Pearson chi2: 1.25e+03\n",
237 | "No. Iterations: 6 \n",
238 | "==============================================================================\n",
239 | " coef std err z P>|z| [95.0% Conf. Int.]\n",
240 | "------------------------------------------------------------------------------\n",
241 | "Intercept 0.1260 0.241 0.523 0.601 -0.346 0.598\n",
242 | "Lag1 0.0731 0.050 1.457 0.145 -0.025 0.171\n",
243 | "Lag2 0.0423 0.050 0.845 0.398 -0.056 0.140\n",
244 | "Lag3 -0.0111 0.050 -0.222 0.824 -0.109 0.087\n",
245 | "Lag4 -0.0094 0.050 -0.187 0.851 -0.107 0.089\n",
246 | "Lag5 -0.0103 0.050 -0.208 0.835 -0.107 0.087\n",
247 | "Volume -0.1354 0.158 -0.855 0.392 -0.446 0.175\n",
248 | "==============================================================================\n"
249 | ]
250 | }
251 | ],
252 | "source": [
253 | "model = smf.glm(formula=formula, data=df, family=sm.families.Binomial())\n",
254 | "result = model.fit()\n",
255 | "print(result.summary())"
256 | ]
257 | },
258 | {
259 | "cell_type": "markdown",
260 | "metadata": {},
261 | "source": [
262 | "The smallest p-value here is associated with ${\\tt Lag1}$. The negative coefficient\n",
263 | "for this predictor suggests that if the market had a positive return yesterday,\n",
264 | "then it is less likely to go up today. However, at a value of 0.145, the p-value\n",
265 | "is still relatively large, and so there is no clear evidence of a real association\n",
266 | "between ${\\tt Lag1}$ and ${\\tt Direction}$.\n",
267 | "\n",
268 | "We use the ${\\tt .params}$ attribute in order to access just the coefficients for this\n",
269 | "fitted model. Similarly, we can use ${\\tt .pvalues}$ to get the p-values for the coefficients, and ${\\tt .model.endog_names}$ to get the **endogenous** (or dependent) variables."
270 | ]
271 | },
272 | {
273 | "cell_type": "code",
274 | "execution_count": null,
275 | "metadata": {
276 | "collapsed": false
277 | },
278 | "outputs": [],
279 | "source": [
280 | "print(\"Coeffieients\")\n",
281 | "print(result.params)\n",
282 | "print\n",
283 | "print(\"p-Values\")\n",
284 | "print(result.pvalues)\n",
285 | "print\n",
286 | "print(\"Dependent variables\")\n",
287 | "print(result.model.endog_names)"
288 | ]
289 | },
290 | {
291 | "cell_type": "markdown",
292 | "metadata": {},
293 | "source": [
294 | "Note that the dependent variable has been converted from nominal into two dummy variables: ${\\tt ['Direction[Down]', 'Direction[Up]']}$.\n",
295 | "\n",
296 | "The ${\\tt predict()}$ function can be used to predict the probability that the\n",
297 | "market will go down, given values of the predictors. If no data set is supplied to the\n",
298 | "${\\tt predict()}$ function, then the probabilities are computed for the training\n",
299 | "data that was used to fit the logistic regression model. "
300 | ]
301 | },
302 | {
303 | "cell_type": "code",
304 | "execution_count": null,
305 | "metadata": {
306 | "collapsed": false
307 | },
308 | "outputs": [],
309 | "source": [
310 | "predictions = result.predict()\n",
311 | "print(predictions[0:10])"
312 | ]
313 | },
314 | {
315 | "cell_type": "markdown",
316 | "metadata": {},
317 | "source": [
318 | "Here we have printe only the first ten probabilities. Note: these values correspond to the probability of the market going down, rather than up. If we print the model's encoding of the response values alongside the original nominal response, we see that Python has created a dummy variable with\n",
319 | "a 1 for ${\\tt Down}$."
320 | ]
321 | },
322 | {
323 | "cell_type": "code",
324 | "execution_count": null,
325 | "metadata": {
326 | "collapsed": false
327 | },
328 | "outputs": [],
329 | "source": [
330 | "print np.column_stack((df.as_matrix(columns=[\"Direction\"]).flatten(), result.model.endog))"
331 | ]
332 | },
333 | {
334 | "cell_type": "markdown",
335 | "metadata": {},
336 | "source": [
337 | "In order to make a prediction as to whether the market will go up or\n",
338 | "down on a particular day, we must convert these predicted probabilities\n",
339 | "into class labels, ${\\tt Up}$ or ${\\tt Down}$. The following two commands create a vector\n",
340 | "of class predictions based on whether the predicted probability of a market\n",
341 | "increase is greater than or less than 0.5."
342 | ]
343 | },
344 | {
345 | "cell_type": "code",
346 | "execution_count": null,
347 | "metadata": {
348 | "collapsed": false
349 | },
350 | "outputs": [],
351 | "source": [
352 | "predictions_nominal = [ \"Up\" if x < 0.5 else \"Down\" for x in predictions]"
353 | ]
354 | },
355 | {
356 | "cell_type": "markdown",
357 | "metadata": {},
358 | "source": [
359 | "This transforms to ${\\tt Up}$ all of the elements for which the predicted probability of a\n",
360 | "market increase exceeds 0.5 (i.e. probability of a decrease is below 0.5). Given these predictions, the ${\\tt confusion\\_matrix()}$ function can be used to produce a confusion matrix in order to determine how many\n",
361 | "observations were correctly or incorrectly classified."
362 | ]
363 | },
364 | {
365 | "cell_type": "code",
366 | "execution_count": null,
367 | "metadata": {
368 | "collapsed": false
369 | },
370 | "outputs": [],
371 | "source": [
372 | "from sklearn.metrics import confusion_matrix, classification_report\n",
373 | "print confusion_matrix(df[\"Direction\"], predictions_nominal)"
374 | ]
375 | },
376 | {
377 | "cell_type": "markdown",
378 | "metadata": {},
379 | "source": [
380 | "The diagonal elements of the confusion matrix indicate correct predictions,\n",
381 | "while the off-diagonals represent incorrect predictions. Hence our model\n",
382 | "correctly predicted that the market would go up on 507 days and that\n",
383 | "it would go down on 145 days, for a total of 507 + 145 = 652 correct\n",
384 | "predictions. The ${\\tt mean()}$ function can be used to compute the fraction of\n",
385 | "days for which the prediction was correct. In this case, logistic regression\n",
386 | "correctly predicted the movement of the market 52.2% of the time. this is confirmed by checking the output of the ${\\tt classification\\_report()}$ function."
387 | ]
388 | },
389 | {
390 | "cell_type": "code",
391 | "execution_count": null,
392 | "metadata": {
393 | "collapsed": false
394 | },
395 | "outputs": [],
396 | "source": [
397 | "print classification_report(df[\"Direction\"], predictions_nominal, digits=3)"
398 | ]
399 | },
400 | {
401 | "cell_type": "markdown",
402 | "metadata": {},
403 | "source": [
404 | "At first glance, it appears that the logistic regression model is working\n",
405 | "a little better than random guessing. But remember, this result is misleading\n",
406 | "because we trained and tested the model on the same set of 1,250 observations.\n",
407 | "In other words, 100− 52.2 = 47.8% is the **training error rate**. As we\n",
408 | "have seen previously, the training error rate is often overly optimistic — it\n",
409 | "tends to underestimate the _test_ error rate. \n",
410 | "\n",
411 | "In order to better assess the accuracy\n",
412 | "of the logistic regression model in this setting, we can fit the model\n",
413 | "using part of the data, and then examine how well it predicts the held out\n",
414 | "data. This will yield a more realistic error rate, in the sense that in practice\n",
415 | "we will be interested in our model’s performance not on the data that\n",
416 | "we used to fit the model, but rather on days in the future for which the\n",
417 | "market’s movements are unknown.\n",
418 | "\n",
419 | "Like we did with KNN, we will first create a vector corresponding\n",
420 | "to the observations from 2001 through 2004. We will then use this vector\n",
421 | "to create a held out data set of observations from 2005."
422 | ]
423 | },
424 | {
425 | "cell_type": "code",
426 | "execution_count": null,
427 | "metadata": {
428 | "collapsed": false
429 | },
430 | "outputs": [],
431 | "source": [
432 | "x_train = df[:'2004'][:]\n",
433 | "y_train = df[:'2004']['Direction']\n",
434 | "\n",
435 | "x_test = df['2005':][:]\n",
436 | "y_test = df['2005':]['Direction']"
437 | ]
438 | },
439 | {
440 | "cell_type": "markdown",
441 | "metadata": {},
442 | "source": [
443 | "We now fit a logistic regression model using only the subset of the observations\n",
444 | "that correspond to dates before 2005, using the subset argument.\n",
445 | "We then obtain predicted probabilities of the stock market going up for\n",
446 | "each of the days in our test set—that is, for the days in 2005."
447 | ]
448 | },
449 | {
450 | "cell_type": "code",
451 | "execution_count": null,
452 | "metadata": {
453 | "collapsed": false
454 | },
455 | "outputs": [],
456 | "source": [
457 | "model = smf.glm(formula=formula, data=x_train, family=sm.families.Binomial())\n",
458 | "result = model.fit()"
459 | ]
460 | },
461 | {
462 | "cell_type": "markdown",
463 | "metadata": {},
464 | "source": [
465 | "Notice that we have trained and tested our model on two completely separate\n",
466 | "data sets: training was performed using only the dates before 2005,\n",
467 | "and testing was performed using only the dates in 2005. Finally, we compute\n",
468 | "the predictions for 2005 and compare them to the actual movements\n",
469 | "of the market over that time period."
470 | ]
471 | },
472 | {
473 | "cell_type": "code",
474 | "execution_count": null,
475 | "metadata": {
476 | "collapsed": false
477 | },
478 | "outputs": [],
479 | "source": [
480 | "predictions = result.predict(x_test)\n",
481 | "predictions_nominal = [ \"Up\" if x < 0.5 else \"Down\" for x in predictions]\n",
482 | "print classification_report(y_test, predictions_nominal, digits=3)"
483 | ]
484 | },
485 | {
486 | "cell_type": "markdown",
487 | "metadata": {},
488 | "source": [
489 | "The results are rather disappointing: the test error\n",
490 | "rate (1 - ${\\tt recall}$) is 52%, which is worse than random guessing! Of course this result\n",
491 | "is not all that surprising, given that one would not generally expect to be\n",
492 | "able to use previous days’ returns to predict future market performance.\n",
493 | "(After all, if it were possible to do so, then the authors of this book [along with your professor] would probably\n",
494 | "be out striking it rich rather than teaching statistics.)\n",
495 | "\n",
496 | "We recall that the logistic regression model had very underwhelming pvalues\n",
497 | "associated with all of the predictors, and that the smallest p-value,\n",
498 | "though not very small, corresponded to ${\\tt Lag1}$. Perhaps by removing the\n",
499 | "variables that appear not to be helpful in predicting ${\\tt Direction}$, we can\n",
500 | "obtain a more effective model. After all, using predictors that have no\n",
501 | "relationship with the response tends to cause a deterioration in the test\n",
502 | "error rate (since such predictors cause an increase in variance without a\n",
503 | "corresponding decrease in bias), and so removing such predictors may in\n",
504 | "turn yield an improvement. \n",
505 | "\n",
506 | "In the space below, refit a logistic regression using just ${\\tt Lag1}$ and ${\\tt Lag2}$, which seemed to have the highest predictive power in the original logistic regression model."
507 | ]
508 | },
509 | {
510 | "cell_type": "code",
511 | "execution_count": null,
512 | "metadata": {
513 | "collapsed": false
514 | },
515 | "outputs": [],
516 | "source": [
517 | "model = # Write your code to fit the new model here\n",
518 | "\n",
519 | "# This will test your new model\n",
520 | "result = model.fit()\n",
521 | "predictions = result.predict(x_test)\n",
522 | "predictions_nominal = [ \"Up\" if x < 0.5 else \"Down\" for x in predictions]\n",
523 | "print classification_report(y_test, predictions_nominal, digits=3)"
524 | ]
525 | },
526 | {
527 | "cell_type": "markdown",
528 | "metadata": {},
529 | "source": [
530 | "Now the results appear to be more promising: 56% of the daily movements\n",
531 | "have been correctly predicted. The confusion matrix suggests that on days\n",
532 | "when logistic regression predicts that the market will decline, it is only\n",
533 | "correct 50% of the time. However, on days when it predicts an increase in\n",
534 | "the market, it has a 58% accuracy rate.\n",
535 | "\n",
536 | "Finally, suppose that we want to predict the returns associated with **particular\n",
537 | "values** of ${\\tt Lag1}$ and ${\\tt Lag2}$. In particular, we want to predict Direction on a\n",
538 | "day when ${\\tt Lag1}$ and ${\\tt Lag2}$ equal 1.2 and 1.1, respectively, and on a day when\n",
539 | "they equal 1.5 and −0.8. We can do this by passing a new data frame containing our test values to the ${\\tt predict()}$ function."
540 | ]
541 | },
542 | {
543 | "cell_type": "code",
544 | "execution_count": null,
545 | "metadata": {
546 | "collapsed": false
547 | },
548 | "outputs": [],
549 | "source": [
550 | "print result.predict(pd.DataFrame([[1.2,1.1],[1.5,-0.8]], columns = [\"Lag1\",\"Lag2\"]))"
551 | ]
552 | },
553 | {
554 | "cell_type": "markdown",
555 | "metadata": {},
556 | "source": [
557 | "To get credit for this lab, play around with a few other values for ${\\tt Lag1}$ and ${\\tt Lag2}$, and then post to Piazza about what you found. If you're feeling adventurous, try fitting models with other subsets of variables to see if you can find a letter one!"
558 | ]
559 | }
560 | ],
561 | "metadata": {
562 | "kernelspec": {
563 | "display_name": "Python 2",
564 | "language": "python",
565 | "name": "python2"
566 | },
567 | "language_info": {
568 | "codemirror_mode": {
569 | "name": "ipython",
570 | "version": 2
571 | },
572 | "file_extension": ".py",
573 | "mimetype": "text/x-python",
574 | "name": "python",
575 | "nbconvert_exporter": "python",
576 | "pygments_lexer": "ipython2",
577 | "version": "2.7.11"
578 | }
579 | },
580 | "nbformat": 4,
581 | "nbformat_minor": 0
582 | }
583 |
--------------------------------------------------------------------------------
/Lab 5 - LDA and QDA in Python.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "This lab on Logistic Regression is a Python adaptation of p. 161-163 of \"Introduction to Statistical Learning with Applications in R\" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Adapted by R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016)."
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": null,
13 | "metadata": {
14 | "collapsed": false
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd\n",
19 | "import numpy as np\n",
20 | "\n",
21 | "from sklearn.lda import LDA\n",
22 | "from sklearn.qda import QDA\n",
23 | "from sklearn.metrics import confusion_matrix, classification_report, precision_score\n",
24 | "\n",
25 | "%matplotlib inline"
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "# 4.6.3 Linear Discriminant Analysis"
33 | ]
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "metadata": {},
38 | "source": [
39 | "Let's return to the ${\\tt Smarket}$ data from ${\\tt ISLR}$. "
40 | ]
41 | },
42 | {
43 | "cell_type": "code",
44 | "execution_count": null,
45 | "metadata": {
46 | "collapsed": false
47 | },
48 | "outputs": [],
49 | "source": [
50 | "df = pd.read_csv('Smarket.csv', usecols=range(1,10), index_col=0, parse_dates=True)\n",
51 | "df.head()"
52 | ]
53 | },
54 | {
55 | "cell_type": "markdown",
56 | "metadata": {},
57 | "source": [
58 | "Now we will perform LDA on the ${\\tt Smarket}$ data from the ${\\tt ISLR}$ package. In ${\\tt Python}$, we can fit a LDA model using the ${\\tt LDA()}$ function, which is part of the ${\\tt lda}$ module of the ${\\tt sklearn}$ library. As we did with logistic regression and KNN, we'll fit the model using only the observations before 2005, and then test the model on the data from 2005."
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": null,
64 | "metadata": {
65 | "collapsed": false
66 | },
67 | "outputs": [],
68 | "source": [
69 | "X_train = df[:'2004'][['Lag1','Lag2']]\n",
70 | "y_train = df[:'2004']['Direction']\n",
71 | "\n",
72 | "X_test = df['2005':][['Lag1','Lag2']]\n",
73 | "y_test = df['2005':]['Direction']\n",
74 | "\n",
75 | "lda = LDA()\n",
76 | "model = lda.fit(X_train, y_train)\n",
77 | "\n",
78 | "print(model.priors_)"
79 | ]
80 | },
81 | {
82 | "cell_type": "markdown",
83 | "metadata": {},
84 | "source": [
85 | "The LDA output indicates prior probabilities of ${\\hat{\\pi}}_1 = 0.492$ and ${\\hat{\\pi}}_2 = 0.508$; in other words,\n",
86 | "49.2% of the training observations correspond to days during which the\n",
87 | "market went down."
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": null,
93 | "metadata": {
94 | "collapsed": false
95 | },
96 | "outputs": [],
97 | "source": [
98 | "print(model.means_)"
99 | ]
100 | },
101 | {
102 | "cell_type": "markdown",
103 | "metadata": {},
104 | "source": [
105 | "The above provides the group means; these are the average\n",
106 | "of each predictor within each class, and are used by LDA as estimates\n",
107 | "of $\\mu_k$. These suggest that there is a tendency for the previous 2 days’\n",
108 | "returns to be negative on days when the market increases, and a tendency\n",
109 | "for the previous days’ returns to be positive on days when the market\n",
110 | "declines. "
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": null,
116 | "metadata": {
117 | "collapsed": false
118 | },
119 | "outputs": [],
120 | "source": [
121 | "print(model.coef_)"
122 | ]
123 | },
124 | {
125 | "cell_type": "markdown",
126 | "metadata": {},
127 | "source": [
128 | "The coefficients of linear discriminants output provides the linear\n",
129 | "combination of ${\\tt Lag1}$ and ${\\tt Lag2}$ that are used to form the LDA decision rule.\n",
130 | "\n",
131 | "If $−0.0554\\times{\\tt Lag1}−0.0443\\times{\\tt Lag2}$ is large, then the LDA classifier will\n",
132 | "predict a market increase, and if it is small, then the LDA classifier will\n",
133 | "predict a market decline. **Note**: these coefficients differ from those produced by ${\\tt R}$."
134 | ]
135 | },
136 | {
137 | "cell_type": "markdown",
138 | "metadata": {},
139 | "source": [
140 | "The ${\\tt predict()}$ function returns a list of LDA’s predictions about the movement of the market on the test data:"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": null,
146 | "metadata": {
147 | "collapsed": false
148 | },
149 | "outputs": [],
150 | "source": [
151 | "pred=model.predict(X_test)\n",
152 | "print(np.unique(pred, return_counts=True))"
153 | ]
154 | },
155 | {
156 | "cell_type": "markdown",
157 | "metadata": {},
158 | "source": [
159 | "The model assigned 70 observations to the \"Down\" class, and 182 observations to the \"Up\" class. Let's check out the confusion matrix to see how this model is doing. We'll want to compare the **predicted class** (which we can find in ${\\tt pred}$) to the **true class** (found in ${\\tt y\\_test})$."
160 | ]
161 | },
162 | {
163 | "cell_type": "code",
164 | "execution_count": null,
165 | "metadata": {
166 | "collapsed": false
167 | },
168 | "outputs": [],
169 | "source": [
170 | "print(confusion_matrix(pred, y_test))\n",
171 | "print(classification_report(y_test, pred, digits=3))"
172 | ]
173 | },
174 | {
175 | "cell_type": "markdown",
176 | "metadata": {},
177 | "source": [
178 | "We can also get the predicted _probabilities_ using the ${\\tt predict\\_proba()}$ function:"
179 | ]
180 | },
181 | {
182 | "cell_type": "code",
183 | "execution_count": null,
184 | "metadata": {
185 | "collapsed": false
186 | },
187 | "outputs": [],
188 | "source": [
189 | "pred_p = model.predict_proba(X_test)"
190 | ]
191 | },
192 | {
193 | "cell_type": "markdown",
194 | "metadata": {},
195 | "source": [
196 | "Applying a 50% threshold to the posterior probabilities allows us to recreate\n",
197 | "the predictions:"
198 | ]
199 | },
200 | {
201 | "cell_type": "code",
202 | "execution_count": null,
203 | "metadata": {
204 | "collapsed": false
205 | },
206 | "outputs": [],
207 | "source": [
208 | "print(np.unique(pred_p[:,1]>0.5, return_counts=True))"
209 | ]
210 | },
211 | {
212 | "cell_type": "markdown",
213 | "metadata": {},
214 | "source": [
215 | "Notice that the posterior probability output by the model corresponds to\n",
216 | "the probability that the market will **increase**:"
217 | ]
218 | },
219 | {
220 | "cell_type": "code",
221 | "execution_count": null,
222 | "metadata": {
223 | "collapsed": false
224 | },
225 | "outputs": [],
226 | "source": [
227 | "print np.stack((pred_p[10:20,1], pred[10:20])).T"
228 | ]
229 | },
230 | {
231 | "cell_type": "markdown",
232 | "metadata": {},
233 | "source": [
234 | "If we wanted to use a posterior probability threshold other than 50% in\n",
235 | "order to make predictions, then we could easily do so. For instance, suppose\n",
236 | "that we wish to predict a market decrease only if we are very certain that the\n",
237 | "market will indeed decrease on that day—say, if the posterior probability\n",
238 | "is at least 90%:"
239 | ]
240 | },
241 | {
242 | "cell_type": "code",
243 | "execution_count": null,
244 | "metadata": {
245 | "collapsed": false
246 | },
247 | "outputs": [],
248 | "source": [
249 | "print(np.unique(pred_p[:,1]>0.9, return_counts=True))"
250 | ]
251 | },
252 | {
253 | "cell_type": "markdown",
254 | "metadata": {},
255 | "source": [
256 | "No days in 2005 meet that threshold! In fact, the greatest posterior probability\n",
257 | "of decrease in all of 2005 was 54.2%:"
258 | ]
259 | },
260 | {
261 | "cell_type": "code",
262 | "execution_count": null,
263 | "metadata": {
264 | "collapsed": false
265 | },
266 | "outputs": [],
267 | "source": [
268 | "max(pred_p[:,1])"
269 | ]
270 | },
271 | {
272 | "cell_type": "markdown",
273 | "metadata": {},
274 | "source": [
275 | "# 4.6.4 Quadratic Discriminant Analysis\n",
276 | "We will now fit a QDA model to the ${\\tt Smarket}$ data. QDA is implemented\n",
277 | "in ${\\tt sklearn}$ using the ${\\tt QDA()}$ function, which is part of the ${\\tt qda}$ module. The\n",
278 | "syntax is identical to that of ${\\tt LDA()}$."
279 | ]
280 | },
281 | {
282 | "cell_type": "code",
283 | "execution_count": null,
284 | "metadata": {
285 | "collapsed": false
286 | },
287 | "outputs": [],
288 | "source": [
289 | "qda = QDA()\n",
290 | "model2 = qda.fit(X_train, y_train)\n",
291 | "print model2.priors_\n",
292 | "print model2.means_"
293 | ]
294 | },
295 | {
296 | "cell_type": "markdown",
297 | "metadata": {},
298 | "source": [
299 | "The output contains the group means. But it does not contain the coefficients\n",
300 | "of the linear discriminants, because the QDA classifier involves a\n",
301 | "_quadratic_, rather than a linear, function of the predictors. The ${\\tt predict()}$\n",
302 | "function works in exactly the same fashion as for LDA."
303 | ]
304 | },
305 | {
306 | "cell_type": "code",
307 | "execution_count": null,
308 | "metadata": {
309 | "collapsed": false
310 | },
311 | "outputs": [],
312 | "source": [
313 | "pred2=model2.predict(X_test)\n",
314 | "print(np.unique(pred2, return_counts=True))\n",
315 | "print(confusion_matrix(pred2, y_test))\n",
316 | "print(classification_report(y_test, pred2, digits=3))"
317 | ]
318 | },
319 | {
320 | "cell_type": "markdown",
321 | "metadata": {},
322 | "source": [
323 | "Interestingly, the QDA predictions are accurate almost 60% of the time,\n",
324 | "even though the 2005 data was not used to fit the model. This level of accuracy\n",
325 | "is quite impressive for stock market data, which is known to be quite\n",
326 | "hard to model accurately. \n",
327 | "\n",
328 | "This suggests that the quadratic form assumed\n",
329 | "by QDA may capture the true relationship more accurately than the linear\n",
330 | "forms assumed by LDA and logistic regression. However, we recommend\n",
331 | "evaluating this method’s performance on a larger test set before betting\n",
332 | "that this approach will consistently beat the market!"
333 | ]
334 | },
335 | {
336 | "cell_type": "markdown",
337 | "metadata": {},
338 | "source": [
339 | "# An Application to Carseats Data\n",
340 | "Let's see how the ${\\tt LDA/QDA}$ approach performs on the ${\\tt Carseats}$ data set, which is\n",
341 | "included with ${\\tt ISLR}$. \n",
342 | "\n",
343 | "Recall: this is a simulated data set containing sales of child car seats at 400 different stores."
344 | ]
345 | },
346 | {
347 | "cell_type": "code",
348 | "execution_count": null,
349 | "metadata": {
350 | "collapsed": false
351 | },
352 | "outputs": [],
353 | "source": [
354 | "df2 = pd.read_csv('Carseats.csv')\n",
355 | "df2.head()"
356 | ]
357 | },
358 | {
359 | "cell_type": "markdown",
360 | "metadata": {},
361 | "source": [
362 | "See if you can build a model that predicts ${\\tt ShelveLoc}$, the shelf location (Bad, Good, or Medium) of the product at each store. Don't forget to hold out some of the data for testing!"
363 | ]
364 | },
365 | {
366 | "cell_type": "code",
367 | "execution_count": null,
368 | "metadata": {
369 | "collapsed": false
370 | },
371 | "outputs": [],
372 | "source": [
373 | "# Your code here"
374 | ]
375 | },
376 | {
377 | "cell_type": "markdown",
378 | "metadata": {},
379 | "source": [
380 | "To get credit for this lab, please post your answers to the following questions:\n",
381 | "\n",
382 | "- What was your approach to building the model?\n",
383 | "- How did your model perform?\n",
384 | "- Was anything easier or more challenging than you anticipated?\n",
385 | "\n",
386 | "to Piazza: https://piazza.com/class/igwiv4w3ctb6rg?cid=23"
387 | ]
388 | }
389 | ],
390 | "metadata": {
391 | "kernelspec": {
392 | "display_name": "Python 2",
393 | "language": "python",
394 | "name": "python2"
395 | },
396 | "language_info": {
397 | "codemirror_mode": {
398 | "name": "ipython",
399 | "version": 2
400 | },
401 | "file_extension": ".py",
402 | "mimetype": "text/x-python",
403 | "name": "python",
404 | "nbconvert_exporter": "python",
405 | "pygments_lexer": "ipython2",
406 | "version": "2.7.11"
407 | }
408 | },
409 | "nbformat": 4,
410 | "nbformat_minor": 0
411 | }
412 |
--------------------------------------------------------------------------------
/Lab 7 - Cross-Validation in Python.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "This lab on Cross-Validation is a python adaptation of p. 190-194 of \"Introduction to Statistical Learning\n",
8 | "with Applications in R\" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Written\n",
9 | "by R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016).\n",
10 | "\n",
11 | "# 5.3.1 The Validation Set Approach"
12 | ]
13 | },
14 | {
15 | "cell_type": "code",
16 | "execution_count": 80,
17 | "metadata": {
18 | "collapsed": true
19 | },
20 | "outputs": [],
21 | "source": [
22 | "import pandas as pd\n",
23 | "import numpy as np\n",
24 | "import statsmodels.api as sm"
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "In this section, we'll explore the use of the validation set approach in order to estimate the\n",
32 | "test error rates that result from fitting various linear models on the ${\\tt Auto}$ data set."
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 81,
38 | "metadata": {
39 | "collapsed": false
40 | },
41 | "outputs": [
42 | {
43 | "name": "stdout",
44 | "output_type": "stream",
45 | "text": [
46 | "\n",
47 | "Int64Index: 392 entries, 0 to 396\n",
48 | "Data columns (total 9 columns):\n",
49 | "mpg 392 non-null float64\n",
50 | "cylinders 392 non-null int64\n",
51 | "displacement 392 non-null float64\n",
52 | "horsepower 392 non-null float64\n",
53 | "weight 392 non-null int64\n",
54 | "acceleration 392 non-null float64\n",
55 | "year 392 non-null int64\n",
56 | "origin 392 non-null int64\n",
57 | "name 392 non-null object\n",
58 | "dtypes: float64(4), int64(4), object(1)\n",
59 | "memory usage: 30.6+ KB\n"
60 | ]
61 | }
62 | ],
63 | "source": [
64 | "df1 = pd.read_csv('Auto.csv', na_values='?').dropna()\n",
65 | "df1.info()"
66 | ]
67 | },
68 | {
69 | "cell_type": "markdown",
70 | "metadata": {},
71 | "source": [
72 | "We begin by using the ${\\tt sample()}$ function to split the set of observations\n",
73 | "into two halves, by selecting a random subset of 196 observations out of\n",
74 | "the original 392 observations. We refer to these observations as the training\n",
75 | "set.\n",
76 | "\n",
77 | "We'll use the ${\\tt random\\_state}$ parameter in order to set a seed for\n",
78 | "${\\tt python}$’s random number generator, so that you'll obtain precisely the same results as those shown below. It is generally a good idea to set a random seed when performing an analysis such as cross-validation\n",
79 | "that contains an element of randomness, so that the results obtained can be reproduced precisely at a later time."
80 | ]
81 | },
82 | {
83 | "cell_type": "code",
84 | "execution_count": 84,
85 | "metadata": {
86 | "collapsed": true
87 | },
88 | "outputs": [],
89 | "source": [
90 | "train = df1.sample(196, random_state = 1)\n",
91 | "test = df1[~df1.isin(train)].dropna(how = 'all')"
92 | ]
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "metadata": {},
97 | "source": [
98 | "We then use the ${\\tt sm.OLS.from\\_formula()}$ to fit a linear regression to predict ${\\tt mpg}$ from ${\\tt horsepower}$ using only\n",
99 | "the observations corresponding to the training set."
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "execution_count": 85,
105 | "metadata": {
106 | "collapsed": true
107 | },
108 | "outputs": [],
109 | "source": [
110 | "lm = sm.OLS.from_formula('mpg~horsepower', train)\n",
111 | "result = lm.fit()"
112 | ]
113 | },
114 | {
115 | "cell_type": "markdown",
116 | "metadata": {},
117 | "source": [
118 | "We now use the ${\\tt predict()}$ function to estimate the response for the test\n",
119 | "observations, and we use some ${\\tt numpy}$ functions to caclulate the MSE."
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": 86,
125 | "metadata": {
126 | "collapsed": false
127 | },
128 | "outputs": [
129 | {
130 | "name": "stdout",
131 | "output_type": "stream",
132 | "text": [
133 | "23.361902892587235\n"
134 | ]
135 | }
136 | ],
137 | "source": [
138 | "pred = result.predict(test)\n",
139 | "\n",
140 | "MSE = np.mean(np.square(np.subtract(test[\"mpg\"], pred)))\n",
141 | " \n",
142 | "print(MSE)"
143 | ]
144 | },
145 | {
146 | "cell_type": "markdown",
147 | "metadata": {},
148 | "source": [
149 | "Therefore, the estimated test MSE for the linear regression fit is 23.36. We\n",
150 | "can use the ${\\tt np.power()}$ function to estimate the test error for the polynomial\n",
151 | "and cubic regressions."
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": 88,
157 | "metadata": {
158 | "collapsed": false
159 | },
160 | "outputs": [
161 | {
162 | "name": "stdout",
163 | "output_type": "stream",
164 | "text": [
165 | "20.252690858350192\n",
166 | "20.325609366115582\n"
167 | ]
168 | }
169 | ],
170 | "source": [
171 | "lm2 = sm.OLS.from_formula('mpg~' + '+'.join(['np.power(horsepower,' + str(i) + ')' for i in [1,2]]), train)\n",
172 | "print(np.mean(np.square(np.subtract(test[\"mpg\"], lm2.fit().predict(test)))))\n",
173 | "\n",
174 | "lm3 = sm.OLS.from_formula('mpg~' + '+'.join(['np.power(horsepower,' + str(i) + ')' for i in [1,2,3]]), train)\n",
175 | "print(np.mean(np.square(np.subtract(test[\"mpg\"], lm3.fit().predict(test)))))"
176 | ]
177 | },
178 | {
179 | "cell_type": "markdown",
180 | "metadata": {},
181 | "source": [
182 | "These error rates are 20.25 and 20.33, respectively. If we choose a different\n",
183 | "training set instead, then we will obtain somewhat different errors on the\n",
184 | "validation set. We can test this out by setting a different random seed:"
185 | ]
186 | },
187 | {
188 | "cell_type": "code",
189 | "execution_count": 89,
190 | "metadata": {
191 | "collapsed": false
192 | },
193 | "outputs": [
194 | {
195 | "name": "stdout",
196 | "output_type": "stream",
197 | "text": [
198 | "23.214272449679587\n",
199 | "19.525710963117103\n",
200 | "19.667628097077426\n"
201 | ]
202 | }
203 | ],
204 | "source": [
205 | "train = df1.sample(196, random_state = 2)\n",
206 | "\n",
207 | "lm = sm.OLS.from_formula('mpg~horsepower', train)\n",
208 | "print(np.mean(np.square(np.subtract(test[\"mpg\"], lm.fit().predict(test)))))\n",
209 | "\n",
210 | "lm2 = sm.OLS.from_formula('mpg~' + '+'.join(['np.power(horsepower,' + str(i) + ')' for i in [1,2]]), train)\n",
211 | "print(np.mean(np.square(np.subtract(test[\"mpg\"], lm2.fit().predict(test)))))\n",
212 | "\n",
213 | "lm3 = sm.OLS.from_formula('mpg~' + '+'.join(['np.power(horsepower,' + str(i) + ')' for i in [1,2,3]]), train)\n",
214 | "print(np.mean(np.square(np.subtract(test[\"mpg\"], lm3.fit().predict(test)))))"
215 | ]
216 | },
217 | {
218 | "cell_type": "markdown",
219 | "metadata": {},
220 | "source": [
221 | "Using this split of the observations into a training set and a validation\n",
222 | "set, we find that the validation set error rates for the models with linear,\n",
223 | "quadratic, and cubic terms are 23.21, 19.53, and 19.67, respectively.\n",
224 | "\n",
225 | "These results are consistent with our previous findings: a model that\n",
226 | "predicts ${\\tt mpg}$ using a quadratic function of ${\\tt horsepower}$ performs better than\n",
227 | "a model that involves only a linear function of ${\\tt horsepower}$, and there is\n",
228 | "little evidence in favor of a model that uses a cubic function of ${\\tt horsepower}$."
229 | ]
230 | },
231 | {
232 | "cell_type": "code",
233 | "execution_count": 108,
234 | "metadata": {
235 | "collapsed": false
236 | },
237 | "outputs": [
238 | {
239 | "ename": "IndexError",
240 | "evalue": "indices are out-of-bounds",
241 | "output_type": "error",
242 | "traceback": [
243 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
244 | "\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)",
245 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mloo\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mLeaveOneOut\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m10\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mtrain_index\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtest_index\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mloo\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0mdf1\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mtrain_index\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
246 | "\u001b[0;32m/Users/jcrouser/anaconda/envs/py3k/lib/python3.5/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 1961\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mSeries\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mndarray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mIndex\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1962\u001b[0m \u001b[0;31m# either boolean or fancy integer index\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1963\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_array\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1964\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mDataFrame\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1965\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_frame\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
247 | "\u001b[0;32m/Users/jcrouser/anaconda/envs/py3k/lib/python3.5/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m_getitem_array\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 2006\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2007\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mix\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_convert_to_indexer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2008\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtake\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconvert\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2009\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2010\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_getitem_multilevel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
248 | "\u001b[0;32m/Users/jcrouser/anaconda/envs/py3k/lib/python3.5/site-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36mtake\u001b[0;34m(self, indices, axis, convert, is_copy)\u001b[0m\n\u001b[1;32m 1369\u001b[0m new_data = self._data.take(indices,\n\u001b[1;32m 1370\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_get_block_manager_axis\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1371\u001b[0;31m convert=True, verify=True)\n\u001b[0m\u001b[1;32m 1372\u001b[0m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_constructor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnew_data\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__finalize__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1373\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
249 | "\u001b[0;32m/Users/jcrouser/anaconda/envs/py3k/lib/python3.5/site-packages/pandas/core/internals.py\u001b[0m in \u001b[0;36mtake\u001b[0;34m(self, indexer, axis, verify, convert)\u001b[0m\n\u001b[1;32m 3617\u001b[0m \u001b[0mn\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3618\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mconvert\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3619\u001b[0;31m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmaybe_convert_indices\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mn\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3620\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3621\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mverify\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
250 | "\u001b[0;32m/Users/jcrouser/anaconda/envs/py3k/lib/python3.5/site-packages/pandas/core/indexing.py\u001b[0m in \u001b[0;36mmaybe_convert_indices\u001b[0;34m(indices, n)\u001b[0m\n\u001b[1;32m 1748\u001b[0m \u001b[0mmask\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mindices\u001b[0m \u001b[0;34m>=\u001b[0m \u001b[0mn\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m|\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mindices\u001b[0m \u001b[0;34m<\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1749\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mmask\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0many\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1750\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mIndexError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"indices are out-of-bounds\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1751\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mindices\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1752\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
251 | "\u001b[0;31mIndexError\u001b[0m: indices are out-of-bounds"
252 | ]
253 | }
254 | ],
255 | "source": [
256 | "from sklearn.cross_validation import LeaveOneOut\n",
257 | "\n",
258 | "loo = LeaveOneOut(10)\n",
259 | "for train_index, test_index in loo:\n",
260 | " df1[train_index]"
261 | ]
262 | },
263 | {
264 | "cell_type": "code",
265 | "execution_count": null,
266 | "metadata": {
267 | "collapsed": true
268 | },
269 | "outputs": [],
270 | "source": []
271 | }
272 | ],
273 | "metadata": {
274 | "kernelspec": {
275 | "display_name": "Python 3",
276 | "language": "python",
277 | "name": "python3"
278 | },
279 | "language_info": {
280 | "codemirror_mode": {
281 | "name": "ipython",
282 | "version": 3
283 | },
284 | "file_extension": ".py",
285 | "mimetype": "text/x-python",
286 | "name": "python",
287 | "nbconvert_exporter": "python",
288 | "pygments_lexer": "ipython3",
289 | "version": "3.5.1"
290 | }
291 | },
292 | "nbformat": 4,
293 | "nbformat_minor": 0
294 | }
295 |
--------------------------------------------------------------------------------
/Lab 8 - Subset Selection in Python.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "This lab on Subset Selection is a Python adaptation of p. 244-247 of \"Introduction to Statistical Learning with Applications in R\" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Adapted by R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016)."
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": null,
13 | "metadata": {
14 | "collapsed": false
15 | },
16 | "outputs": [],
17 | "source": [
18 | "%matplotlib inline\n",
19 | "import pandas as pd\n",
20 | "import numpy as np\n",
21 | "import itertools\n",
22 | "import time\n",
23 | "import statsmodels.api as sm\n",
24 | "import matplotlib.pyplot as plt"
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "# 6.5.1 Best Subset Selection\n",
32 | "\n",
33 | "Here we apply the best subset selection approach to the Hitters data. We\n",
34 | "wish to predict a baseball player’s Salary on the basis of various statistics\n",
35 | "associated with performance in the previous year. Let's take a quick look:"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": null,
41 | "metadata": {
42 | "collapsed": false
43 | },
44 | "outputs": [],
45 | "source": [
46 | "df = pd.read_csv('Hitters.csv')\n",
47 | "df.head()"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "First of all, we note that the ${\\tt Salary}$ variable is missing for some of the\n",
55 | "players. The ${\\tt isnull()}$ function can be used to identify the missing observations. It returns a vector of the same length as the input vector, with a ${\\tt TRUE}$ value\n",
56 | "for any elements that are missing, and a ${\\tt FALSE}$ value for non-missing elements.\n",
57 | "The ${\\tt sum()}$ function can then be used to count all of the missing elements:"
58 | ]
59 | },
60 | {
61 | "cell_type": "code",
62 | "execution_count": null,
63 | "metadata": {
64 | "collapsed": false
65 | },
66 | "outputs": [],
67 | "source": [
68 | "print(df[\"Salary\"].isnull().sum())"
69 | ]
70 | },
71 | {
72 | "cell_type": "markdown",
73 | "metadata": {},
74 | "source": [
75 | "We see that ${\\tt Salary}$ is missing for 59 players. The ${\\tt dropna()}$ function\n",
76 | "removes all of the rows that have missing values in any variable:"
77 | ]
78 | },
79 | {
80 | "cell_type": "code",
81 | "execution_count": null,
82 | "metadata": {
83 | "collapsed": false
84 | },
85 | "outputs": [],
86 | "source": [
87 | "# Print the dimensions of the original Hitters data (322 rows x 20 columns)\n",
88 | "print(df.shape)\n",
89 | "\n",
90 | "# Drop any rows the contain missing values, along with the player names\n",
91 | "df = df.dropna().drop('Player', axis=1)\n",
92 | "\n",
93 | "# Print the dimensions of the modified Hitters data (263 rows x 20 columns)\n",
94 | "print(df.shape)\n",
95 | "\n",
96 | "# One last check: should return 0\n",
97 | "print(df[\"Salary\"].isnull().sum())"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": null,
103 | "metadata": {
104 | "collapsed": false
105 | },
106 | "outputs": [],
107 | "source": [
108 | "dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']])\n",
109 | "\n",
110 | "y = df.Salary\n",
111 | "\n",
112 | "# Drop the column with the independent variable (Salary), and columns for which we created dummy variables\n",
113 | "X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')\n",
114 | "\n",
115 | "# Define the feature set X.\n",
116 | "X = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)"
117 | ]
118 | },
119 | {
120 | "cell_type": "markdown",
121 | "metadata": {},
122 | "source": [
123 | "We can perform best subset selection by identifying the best model that contains a given number of predictors, where **best** is quantified using RSS. We'll define a helper function to outputs the best set of variables for\n",
124 | "each model size:"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": null,
130 | "metadata": {
131 | "collapsed": true
132 | },
133 | "outputs": [],
134 | "source": [
135 | "def processSubset(feature_set):\n",
136 | " # Fit model on feature_set and calculate RSS\n",
137 | " model = sm.OLS(y,X[list(feature_set)])\n",
138 | " regr = model.fit()\n",
139 | " RSS = ((regr.predict(X[list(feature_set)]) - y) ** 2).sum()\n",
140 | " return {\"model\":regr, \"RSS\":RSS}"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": null,
146 | "metadata": {
147 | "collapsed": false
148 | },
149 | "outputs": [],
150 | "source": [
151 | "def getBest(k):\n",
152 | " \n",
153 | " tic = time.time()\n",
154 | " \n",
155 | " results = []\n",
156 | " \n",
157 | " for combo in itertools.combinations(X.columns, k):\n",
158 | " results.append(processSubset(combo))\n",
159 | " \n",
160 | " # Wrap everything up in a nice dataframe\n",
161 | " models = pd.DataFrame(results)\n",
162 | " \n",
163 | " # Choose the model with the highest RSS\n",
164 | " best_model = models.loc[models['RSS'].argmin()]\n",
165 | " \n",
166 | " toc = time.time()\n",
167 | " print(\"Processed \", models.shape[0], \"models on\", k, \"predictors in\", (toc-tic), \"seconds.\")\n",
168 | " \n",
169 | " # Return the best model, along with some other useful information about the model\n",
170 | " return best_model"
171 | ]
172 | },
173 | {
174 | "cell_type": "markdown",
175 | "metadata": {},
176 | "source": [
177 | "This returns a ${\\tt DataFrame}$ containing the best model that we generated, along with some extra information about the model. Now we want to call that function for each number of predictors $k$:"
178 | ]
179 | },
180 | {
181 | "cell_type": "code",
182 | "execution_count": null,
183 | "metadata": {
184 | "collapsed": false
185 | },
186 | "outputs": [],
187 | "source": [
188 | "# Could take quite awhile to complete...\n",
189 | "\n",
190 | "models = pd.DataFrame(columns=[\"RSS\", \"model\"])\n",
191 | "\n",
192 | "tic = time.time()\n",
193 | "for i in range(1,8):\n",
194 | " models.loc[i] = getBest(i)\n",
195 | "\n",
196 | "toc = time.time()\n",
197 | "print(\"Total elapsed time:\", (toc-tic), \"seconds.\")"
198 | ]
199 | },
200 | {
201 | "cell_type": "markdown",
202 | "metadata": {},
203 | "source": [
204 | "Now we have one big $\\tt{DataFrame}$ that contains the best models we've generated. Let's take a look at the first few:"
205 | ]
206 | },
207 | {
208 | "cell_type": "code",
209 | "execution_count": null,
210 | "metadata": {
211 | "collapsed": false
212 | },
213 | "outputs": [],
214 | "source": [
215 | "models"
216 | ]
217 | },
218 | {
219 | "cell_type": "markdown",
220 | "metadata": {},
221 | "source": [
222 | "If we want to access the details of each model, no problem! We can get a full rundown of a single model using the ${\\tt summary()}$ function:"
223 | ]
224 | },
225 | {
226 | "cell_type": "code",
227 | "execution_count": null,
228 | "metadata": {
229 | "collapsed": false
230 | },
231 | "outputs": [],
232 | "source": [
233 | "print(models.loc[2, \"model\"].summary())"
234 | ]
235 | },
236 | {
237 | "cell_type": "markdown",
238 | "metadata": {},
239 | "source": [
240 | "This output indicates that the best two-variable model\n",
241 | "contains only ${\\tt Hits}$ and ${\\tt CRBI}$. To save time, we only generated results\n",
242 | "up to the best 11-variable model. You can use the functions we defined above to explore as many variables as are desired."
243 | ]
244 | },
245 | {
246 | "cell_type": "code",
247 | "execution_count": null,
248 | "metadata": {
249 | "collapsed": false
250 | },
251 | "outputs": [],
252 | "source": [
253 | "print(getBest(19)[\"model\"].summary())"
254 | ]
255 | },
256 | {
257 | "cell_type": "markdown",
258 | "metadata": {},
259 | "source": [
260 | "Rather than letting the results of our call to the ${\\tt summary()}$ function print to the screen, we can access just the parts we need using the model's attributes. For example, if we want the $R^2$ value:"
261 | ]
262 | },
263 | {
264 | "cell_type": "code",
265 | "execution_count": null,
266 | "metadata": {
267 | "collapsed": false
268 | },
269 | "outputs": [],
270 | "source": [
271 | "models.loc[2, \"model\"].rsquared"
272 | ]
273 | },
274 | {
275 | "cell_type": "markdown",
276 | "metadata": {},
277 | "source": [
278 | "Excellent! In addition to the verbose output we get when we print the summary to the screen, fitting the ${\\tt OLM}$ also produced many other useful statistics such as adjusted $R^2$, AIC, and BIC. We can examine these to try to select the best overall model. Let's start by looking at $R^2$ across all our models:"
279 | ]
280 | },
281 | {
282 | "cell_type": "code",
283 | "execution_count": null,
284 | "metadata": {
285 | "collapsed": false
286 | },
287 | "outputs": [],
288 | "source": [
289 | "# Gets the second element from each row ('model') and pulls out its rsquared attribute\n",
290 | "models.apply(lambda row: row[1].rsquared, axis=1)"
291 | ]
292 | },
293 | {
294 | "cell_type": "markdown",
295 | "metadata": {},
296 | "source": [
297 | "As expected, the $R^2$ statistic increases monotonically as more\n",
298 | "variables are included.\n",
299 | "\n",
300 | "Plotting RSS, adjusted $R^2$, AIC, and BIC for all of the models at once will\n",
301 | "help us decide which model to select. Note the ${\\tt type=\"l\"}$ option tells ${\\tt R}$ to\n",
302 | "connect the plotted points with lines:"
303 | ]
304 | },
305 | {
306 | "cell_type": "code",
307 | "execution_count": null,
308 | "metadata": {
309 | "collapsed": false
310 | },
311 | "outputs": [],
312 | "source": [
313 | "plt.figure(figsize=(20,10))\n",
314 | "plt.rcParams.update({'font.size': 18, 'lines.markersize': 10})\n",
315 | "\n",
316 | "# Set up a 2x2 grid so we can look at 4 plots at once\n",
317 | "plt.subplot(2, 2, 1)\n",
318 | "\n",
319 | "# We will now plot a red dot to indicate the model with the largest adjusted R^2 statistic.\n",
320 | "# The argmax() function can be used to identify the location of the maximum point of a vector\n",
321 | "plt.plot(models[\"RSS\"])\n",
322 | "plt.xlabel('# Predictors')\n",
323 | "plt.ylabel('RSS')\n",
324 | "\n",
325 | "# We will now plot a red dot to indicate the model with the largest adjusted R^2 statistic.\n",
326 | "# The argmax() function can be used to identify the location of the maximum point of a vector\n",
327 | "\n",
328 | "rsquared_adj = models.apply(lambda row: row[1].rsquared_adj, axis=1)\n",
329 | "\n",
330 | "plt.subplot(2, 2, 2)\n",
331 | "plt.plot(rsquared_adj)\n",
332 | "plt.plot(rsquared_adj.argmax(), rsquared_adj.max(), \"or\")\n",
333 | "plt.xlabel('# Predictors')\n",
334 | "plt.ylabel('adjusted rsquared')\n",
335 | "\n",
336 | "# We'll do the same for AIC and BIC, this time looking for the models with the SMALLEST statistic\n",
337 | "aic = models.apply(lambda row: row[1].aic, axis=1)\n",
338 | "\n",
339 | "plt.subplot(2, 2, 3)\n",
340 | "plt.plot(aic)\n",
341 | "plt.plot(aic.argmin(), aic.min(), \"or\")\n",
342 | "plt.xlabel('# Predictors')\n",
343 | "plt.ylabel('AIC')\n",
344 | "\n",
345 | "bic = models.apply(lambda row: row[1].bic, axis=1)\n",
346 | "\n",
347 | "plt.subplot(2, 2, 4)\n",
348 | "plt.plot(bic)\n",
349 | "plt.plot(bic.argmin(), bic.min(), \"or\")\n",
350 | "plt.xlabel('# Predictors')\n",
351 | "plt.ylabel('BIC')"
352 | ]
353 | },
354 | {
355 | "cell_type": "markdown",
356 | "metadata": {},
357 | "source": [
358 | "Recall that in the second step of our selection process, we narrowed the field down to just one model on any $k<=p$ predictors. We see that according to BIC, the best performer is the model with 6 variables. According to AIC and adjusted $R^2$ something a bit more complex might be better. Again, no one measure is going to give us an entirely accurate picture... but they all agree that a model with 5 or fewer predictors is insufficient."
359 | ]
360 | },
361 | {
362 | "cell_type": "markdown",
363 | "metadata": {},
364 | "source": [
365 | "# 6.5.2 Forward and Backward Stepwise Selection\n",
366 | "We can also use a similar approach to perform forward stepwise\n",
367 | "or backward stepwise selection, using a slight modification of the functions we defined above:"
368 | ]
369 | },
370 | {
371 | "cell_type": "code",
372 | "execution_count": null,
373 | "metadata": {
374 | "collapsed": true
375 | },
376 | "outputs": [],
377 | "source": [
378 | "def forward(predictors):\n",
379 | "\n",
380 | " # Pull out predictors we still need to process\n",
381 | " remaining_predictors = [p for p in X.columns if p not in predictors]\n",
382 | " \n",
383 | " tic = time.time()\n",
384 | " \n",
385 | " results = []\n",
386 | " \n",
387 | " for p in remaining_predictors:\n",
388 | " results.append(processSubset(predictors+[p]))\n",
389 | " \n",
390 | " # Wrap everything up in a nice dataframe\n",
391 | " models = pd.DataFrame(results)\n",
392 | " \n",
393 | " # Choose the model with the highest RSS\n",
394 | " best_model = models.loc[models['RSS'].argmin()]\n",
395 | " \n",
396 | " toc = time.time()\n",
397 | " print(\"Processed \", models.shape[0], \"models on\", len(predictors)+1, \"predictors in\", (toc-tic), \"seconds.\")\n",
398 | " \n",
399 | " # Return the best model, along with some other useful information about the model\n",
400 | " return best_model"
401 | ]
402 | },
403 | {
404 | "cell_type": "markdown",
405 | "metadata": {},
406 | "source": [
407 | "Now let's see how much faster it runs!"
408 | ]
409 | },
410 | {
411 | "cell_type": "code",
412 | "execution_count": null,
413 | "metadata": {
414 | "collapsed": false
415 | },
416 | "outputs": [],
417 | "source": [
418 | "models2 = pd.DataFrame(columns=[\"RSS\", \"model\"])\n",
419 | "\n",
420 | "tic = time.time()\n",
421 | "predictors = []\n",
422 | "\n",
423 | "for i in range(1,len(X.columns)+1): \n",
424 | " models2.loc[i] = forward(predictors)\n",
425 | " predictors = models2.loc[i][\"model\"].model.exog_names\n",
426 | "\n",
427 | "toc = time.time()\n",
428 | "print(\"Total elapsed time:\", (toc-tic), \"seconds.\")"
429 | ]
430 | },
431 | {
432 | "cell_type": "markdown",
433 | "metadata": {},
434 | "source": [
435 | "Phew! That's a lot better. Let's take a look:"
436 | ]
437 | },
438 | {
439 | "cell_type": "code",
440 | "execution_count": null,
441 | "metadata": {
442 | "collapsed": false
443 | },
444 | "outputs": [],
445 | "source": [
446 | "print(models.loc[1, \"model\"].summary())\n",
447 | "print(models.loc[2, \"model\"].summary())"
448 | ]
449 | },
450 | {
451 | "cell_type": "markdown",
452 | "metadata": {},
453 | "source": [
454 | "We see that using forward stepwise selection, the best one-variable\n",
455 | "model contains only ${\\tt Hits}$, and the best two-variable model additionally\n",
456 | "includes ${\\tt CRBI}$. Let's see how the models stack up against best subset selection:"
457 | ]
458 | },
459 | {
460 | "cell_type": "code",
461 | "execution_count": null,
462 | "metadata": {
463 | "collapsed": false
464 | },
465 | "outputs": [],
466 | "source": [
467 | "print(models.loc[6, \"model\"].summary())\n",
468 | "print(models2.loc[6, \"model\"].summary())"
469 | ]
470 | },
471 | {
472 | "cell_type": "markdown",
473 | "metadata": {
474 | "collapsed": true
475 | },
476 | "source": [
477 | "For this data, the best one-variable through six-variable\n",
478 | "models are each identical for best subset and forward selection.\n",
479 | "\n",
480 | "# Backward Selection\n",
481 | "Not much has to change to implement backward selection... just looping through the predictors in reverse!"
482 | ]
483 | },
484 | {
485 | "cell_type": "code",
486 | "execution_count": null,
487 | "metadata": {
488 | "collapsed": true
489 | },
490 | "outputs": [],
491 | "source": [
492 | "def backward(predictors):\n",
493 | " \n",
494 | " tic = time.time()\n",
495 | " \n",
496 | " results = []\n",
497 | " \n",
498 | " for combo in itertools.combinations(predictors, len(predictors)-1):\n",
499 | " results.append(processSubset(combo))\n",
500 | " \n",
501 | " # Wrap everything up in a nice dataframe\n",
502 | " models = pd.DataFrame(results)\n",
503 | " \n",
504 | " # Choose the model with the highest RSS\n",
505 | " best_model = models.loc[models['RSS'].argmin()]\n",
506 | " \n",
507 | " toc = time.time()\n",
508 | " print(\"Processed \", models.shape[0], \"models on\", len(predictors)-1, \"predictors in\", (toc-tic), \"seconds.\")\n",
509 | " \n",
510 | " # Return the best model, along with some other useful information about the model\n",
511 | " return best_model"
512 | ]
513 | },
514 | {
515 | "cell_type": "code",
516 | "execution_count": null,
517 | "metadata": {
518 | "collapsed": false
519 | },
520 | "outputs": [],
521 | "source": [
522 | "models3 = pd.DataFrame(columns=[\"RSS\", \"model\"], index = range(1,len(X.columns)))\n",
523 | "\n",
524 | "tic = time.time()\n",
525 | "predictors = X.columns\n",
526 | "\n",
527 | "while(len(predictors) > 1): \n",
528 | " models3.loc[len(predictors)-1] = backward(predictors)\n",
529 | " predictors = models3.loc[len(predictors)-1][\"model\"].model.exog_names\n",
530 | "\n",
531 | "toc = time.time()\n",
532 | "print(\"Total elapsed time:\", (toc-tic), \"seconds.\")"
533 | ]
534 | },
535 | {
536 | "cell_type": "markdown",
537 | "metadata": {},
538 | "source": [
539 | "For this data, the best one-variable through six-variable\n",
540 | "models are each identical for best subset and forward selection.\n",
541 | "However, the best seven-variable models identified by forward stepwise selection,\n",
542 | "backward stepwise selection, and best subset selection are different:"
543 | ]
544 | },
545 | {
546 | "cell_type": "code",
547 | "execution_count": null,
548 | "metadata": {
549 | "collapsed": false
550 | },
551 | "outputs": [],
552 | "source": [
553 | "print(models.loc[7, \"model\"].params)"
554 | ]
555 | },
556 | {
557 | "cell_type": "code",
558 | "execution_count": null,
559 | "metadata": {
560 | "collapsed": false
561 | },
562 | "outputs": [],
563 | "source": [
564 | "print(models2.loc[7, \"model\"].params)"
565 | ]
566 | },
567 | {
568 | "cell_type": "code",
569 | "execution_count": null,
570 | "metadata": {
571 | "collapsed": false
572 | },
573 | "outputs": [],
574 | "source": [
575 | "print(models3.loc[7, \"model\"].params)"
576 | ]
577 | },
578 | {
579 | "cell_type": "code",
580 | "execution_count": null,
581 | "metadata": {
582 | "collapsed": true
583 | },
584 | "outputs": [],
585 | "source": []
586 | }
587 | ],
588 | "metadata": {
589 | "kernelspec": {
590 | "display_name": "Python 3",
591 | "language": "python",
592 | "name": "python3"
593 | },
594 | "language_info": {
595 | "codemirror_mode": {
596 | "name": "ipython",
597 | "version": 3
598 | },
599 | "file_extension": ".py",
600 | "mimetype": "text/x-python",
601 | "name": "python",
602 | "nbconvert_exporter": "python",
603 | "pygments_lexer": "ipython3",
604 | "version": "3.5.1"
605 | }
606 | },
607 | "nbformat": 4,
608 | "nbformat_minor": 0
609 | }
610 |
--------------------------------------------------------------------------------
/Lab 9 - Linear Model Selection in Python.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "This lab on Model Validation using Validation and Cross-Validation is a Python adaptation of p. 248-251 of \"Introduction to Statistical Learning with Applications in R\" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Adapted by R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016)."
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": null,
13 | "metadata": {
14 | "collapsed": false
15 | },
16 | "outputs": [],
17 | "source": [
18 | "%matplotlib inline\n",
19 | "import pandas as pd\n",
20 | "import numpy as np\n",
21 | "import itertools\n",
22 | "import statsmodels.api as sm\n",
23 | "import matplotlib.pyplot as plt"
24 | ]
25 | },
26 | {
27 | "cell_type": "markdown",
28 | "metadata": {},
29 | "source": [
30 | "# Model selection using the Validation Set Approach\n",
31 | "\n",
32 | "In Lab 8, we saw that it is possible to choose among a set of models of different\n",
33 | "sizes using $C_p$, BIC, and adjusted $R^2$. We will now consider how to do this\n",
34 | "using the validation set and cross-validation approaches.\n",
35 | "\n",
36 | "As in Lab 8, we'll be working with the ${\\tt Hitters}$ dataset from ${\\tt ISLR}$. Since we're trying to predict ${\\tt Salary}$ and we know from last time that some are missing, let's first drop all the rows with missing values and do a little cleanup:"
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": null,
42 | "metadata": {
43 | "collapsed": false
44 | },
45 | "outputs": [],
46 | "source": [
47 | "df = pd.read_csv('Hitters.csv')\n",
48 | "\n",
49 | "# Drop any rows the contain missing values, along with the player names\n",
50 | "df = df.dropna().drop('Player', axis=1)\n",
51 | "\n",
52 | "# Get dummy variables\n",
53 | "dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']])\n",
54 | "\n",
55 | "# Extract independent variable\n",
56 | "y = pd.DataFrame(df.Salary)\n",
57 | "\n",
58 | "# Drop the column with the independent variable (Salary), and columns for which we created dummy variables\n",
59 | "X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')\n",
60 | "\n",
61 | "# Define the feature set X.\n",
62 | "X = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)"
63 | ]
64 | },
65 | {
66 | "cell_type": "markdown",
67 | "metadata": {},
68 | "source": [
69 | "In order for the validation set approach to yield accurate estimates of the test\n",
70 | "error, we must use *only the training observations* to perform all aspects of\n",
71 | "model-fitting — including variable selection. Therefore, the determination of\n",
72 | "which model of a given size is best must be made using *only the training\n",
73 | "observations*. This point is subtle but important. If the full data set is used\n",
74 | "to perform the best subset selection step, the validation set errors and\n",
75 | "cross-validation errors that we obtain will not be accurate estimates of the\n",
76 | "test error.\n",
77 | "\n",
78 | "In order to use the validation set approach, we begin by splitting the\n",
79 | "observations into a training set and a test set. We do this by creating\n",
80 | "a random vector, train, of elements equal to TRUE if the corresponding\n",
81 | "observation is in the training set, and FALSE otherwise. The vector test has\n",
82 | "a TRUE if the observation is in the test set, and a FALSE otherwise. Note the\n",
83 | "${\\tt np.invert()}$ in the command to create test causes TRUEs to be switched to FALSEs and\n",
84 | "vice versa. We also set a random seed so that the user will obtain the same\n",
85 | "training set/test set split."
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": null,
91 | "metadata": {
92 | "collapsed": false
93 | },
94 | "outputs": [],
95 | "source": [
96 | "np.random.seed(seed=12)\n",
97 | "train = np.random.choice([True, False], size = len(y), replace = True)\n",
98 | "test = np.invert(train)"
99 | ]
100 | },
101 | {
102 | "cell_type": "markdown",
103 | "metadata": {},
104 | "source": [
105 | "We'll define our helper function to outputs the best set of variables for each model size like we did in Lab 8. Not that we'll need to modify this to take in both test and training sets, because we want the returned error to be the **test** error:"
106 | ]
107 | },
108 | {
109 | "cell_type": "code",
110 | "execution_count": null,
111 | "metadata": {
112 | "collapsed": true
113 | },
114 | "outputs": [],
115 | "source": [
116 | "def processSubset(feature_set, X_train, y_train, X_test, y_test):\n",
117 | " # Fit model on feature_set and calculate RSS\n",
118 | " model = sm.OLS(y_train,X_train[list(feature_set)])\n",
119 | " regr = model.fit()\n",
120 | " RSS = ((regr.predict(X_test[list(feature_set)]) - y_test) ** 2).sum()\n",
121 | " return {\"model\":regr, \"RSS\":RSS}"
122 | ]
123 | },
124 | {
125 | "cell_type": "markdown",
126 | "metadata": {},
127 | "source": [
128 | "And another function to perform forward selection:"
129 | ]
130 | },
131 | {
132 | "cell_type": "code",
133 | "execution_count": null,
134 | "metadata": {
135 | "collapsed": true
136 | },
137 | "outputs": [],
138 | "source": [
139 | "def forward(predictors, X_train, y_train, X_test, y_test):\n",
140 | "\n",
141 | " # Pull out predictors we still need to process\n",
142 | " remaining_predictors = [p for p in X_train.columns if p not in predictors]\n",
143 | " \n",
144 | " results = []\n",
145 | " \n",
146 | " for p in remaining_predictors:\n",
147 | " results.append(processSubset(predictors+[p], X_train, y_train, X_test, y_test))\n",
148 | " \n",
149 | " # Wrap everything up in a nice dataframe\n",
150 | " models = pd.DataFrame(results)\n",
151 | " \n",
152 | " # Choose the model with the highest RSS\n",
153 | " best_model = models.loc[models['RSS'].argmin()]\n",
154 | " \n",
155 | " # Return the best model, along with some other useful information about the model\n",
156 | " return best_model"
157 | ]
158 | },
159 | {
160 | "cell_type": "markdown",
161 | "metadata": {},
162 | "source": [
163 | "Now, we'll call our ${\\tt forward()}$ to the training set in order to perform forward selection for all nodel sizes:"
164 | ]
165 | },
166 | {
167 | "cell_type": "code",
168 | "execution_count": null,
169 | "metadata": {
170 | "collapsed": false
171 | },
172 | "outputs": [],
173 | "source": [
174 | "models_train = pd.DataFrame(columns=[\"RSS\", \"model\"])\n",
175 | "\n",
176 | "predictors = []\n",
177 | "\n",
178 | "for i in range(1,len(X.columns)+1): \n",
179 | " models_train.loc[i] = forward(predictors, X[train], y[train][\"Salary\"], X[test], y[test][\"Salary\"])\n",
180 | " predictors = models_train.loc[i][\"model\"].model.exog_names"
181 | ]
182 | },
183 | {
184 | "cell_type": "markdown",
185 | "metadata": {
186 | "collapsed": true
187 | },
188 | "source": [
189 | "Now let's plot the errors, and find the model that minimizes it:"
190 | ]
191 | },
192 | {
193 | "cell_type": "code",
194 | "execution_count": null,
195 | "metadata": {
196 | "collapsed": false
197 | },
198 | "outputs": [],
199 | "source": [
200 | "plt.plot(models_train[\"RSS\"])\n",
201 | "plt.xlabel('# Predictors')\n",
202 | "plt.ylabel('RSS')\n",
203 | "plt.plot(models_train[\"RSS\"].argmin(), models_train[\"RSS\"].min(), \"or\")"
204 | ]
205 | },
206 | {
207 | "cell_type": "markdown",
208 | "metadata": {},
209 | "source": [
210 | "Viola! We find that the best model (according to the validation set approach) is the one that contains 10 predictors.\n",
211 | "\n",
212 | "Now that we know what we're looking for, let's perform forward selection on the full dataset and select the best 10-predictor model. It is important that we make use of the *full\n",
213 | "data set* in order to obtain more accurate coefficient estimates. Note that\n",
214 | "we perform best subset selection on the full data set and select the best 10-predictor\n",
215 | "model, rather than simply using the predictors that we obtained\n",
216 | "from the training set, because the best 10-predictor model on the full data\n",
217 | "set may differ from the corresponding model on the training set."
218 | ]
219 | },
220 | {
221 | "cell_type": "code",
222 | "execution_count": null,
223 | "metadata": {
224 | "collapsed": false
225 | },
226 | "outputs": [],
227 | "source": [
228 | "models_full = pd.DataFrame(columns=[\"RSS\", \"model\"])\n",
229 | "\n",
230 | "predictors = []\n",
231 | "\n",
232 | "for i in range(1,20): \n",
233 | " models_full.loc[i] = forward(predictors, X, y[\"Salary\"], X, y[\"Salary\"])\n",
234 | " predictors = models_full.loc[i][\"model\"].model.exog_names"
235 | ]
236 | },
237 | {
238 | "cell_type": "markdown",
239 | "metadata": {},
240 | "source": [
241 | "In fact, we see that the best ten-variable model on the full data set has a\n",
242 | "**different set of predictors** than the best ten-variable model on the training\n",
243 | "set:"
244 | ]
245 | },
246 | {
247 | "cell_type": "code",
248 | "execution_count": null,
249 | "metadata": {
250 | "collapsed": false
251 | },
252 | "outputs": [],
253 | "source": [
254 | "print(models_train.loc[10, \"model\"].model.exog_names)\n",
255 | "print(models_full.loc[10, \"model\"].model.exog_names)"
256 | ]
257 | },
258 | {
259 | "cell_type": "markdown",
260 | "metadata": {},
261 | "source": [
262 | "# Model selection using Cross-Validation\n",
263 | "\n",
264 | "Now let's try to choose among the models of different sizes using cross-validation.\n",
265 | "This approach is somewhat involved, as we must perform forward selection within each of the $k$ training sets. Despite this, we see that\n",
266 | "with its clever subsetting syntax, ${\\tt python}$ makes this job quite easy. First, we\n",
267 | "create a vector that assigns each observation to one of $k = 10$ folds, and\n",
268 | "we create a DataFrame in which we will store the results:"
269 | ]
270 | },
271 | {
272 | "cell_type": "code",
273 | "execution_count": null,
274 | "metadata": {
275 | "collapsed": false
276 | },
277 | "outputs": [],
278 | "source": [
279 | "k=10 # number of folds\n",
280 | "np.random.seed(seed=1)\n",
281 | "folds = np.random.choice(k, size = len(y), replace = True)\n",
282 | "\n",
283 | "# Create a DataFrame to store the results of our upcoming calculations\n",
284 | "cv_errors = pd.DataFrame(columns=range(1,k+1), index=range(1,20))\n",
285 | "cv_errors = cv_errors.fillna(0)\n",
286 | "cv_errors"
287 | ]
288 | },
289 | {
290 | "cell_type": "markdown",
291 | "metadata": {
292 | "collapsed": true
293 | },
294 | "source": [
295 | "Now let's write a for loop that performs cross-validation. In the $j^{th}$ fold, the\n",
296 | "elements of folds that equal $j$ are in the test set, and the remainder are in\n",
297 | "the training set. We make our predictions for each model size, compute the test errors on the appropriate subset,\n",
298 | "and store them in the appropriate slot in the matrix ${\\tt cv.errors}$."
299 | ]
300 | },
301 | {
302 | "cell_type": "code",
303 | "execution_count": null,
304 | "metadata": {
305 | "collapsed": false
306 | },
307 | "outputs": [],
308 | "source": [
309 | "models_cv = pd.DataFrame(columns=[\"RSS\", \"model\"])\n",
310 | " \n",
311 | "# Outer loop iterates over all folds\n",
312 | "for j in range(1,k+1):\n",
313 | "\n",
314 | " # Reset predictors\n",
315 | " predictors = []\n",
316 | " \n",
317 | " # Inner loop iterates over each size i\n",
318 | " for i in range(1,len(X.columns)+1): \n",
319 | " \n",
320 | " # The perform forward selection on the full dataset minus the jth fold, test on jth fold\n",
321 | " models_cv.loc[i] = forward(predictors, X[folds != (j-1)], y[folds != (j-1)][\"Salary\"], X[folds == (j-1)], y[folds == (j-1)][\"Salary\"])\n",
322 | " \n",
323 | " # Save the cross-validated error for this fold\n",
324 | " cv_errors[j][i] = models_cv.loc[i][\"RSS\"]\n",
325 | "\n",
326 | " # Extract the predictors\n",
327 | " predictors = models_cv.loc[i][\"model\"].model.exog_names\n",
328 | " "
329 | ]
330 | },
331 | {
332 | "cell_type": "code",
333 | "execution_count": null,
334 | "metadata": {
335 | "collapsed": false
336 | },
337 | "outputs": [],
338 | "source": [
339 | "cv_errors"
340 | ]
341 | },
342 | {
343 | "cell_type": "markdown",
344 | "metadata": {
345 | "collapsed": true
346 | },
347 | "source": [
348 | "This has filled up the ${\\tt cv\\_errors}$ DataFrame such that the $(i,j)^{th}$ element corresponds\n",
349 | "to the test MSE for the $i^{th}$ cross-validation fold for the best $j$-variable\n",
350 | "model. We can then use the ${\\tt apply()}$ function to take the ${\\tt mean}$ over the columns of this\n",
351 | "matrix. This will give us a vector for which the $j^{th}$ element is the cross-validation\n",
352 | "error for the $j$-variable model."
353 | ]
354 | },
355 | {
356 | "cell_type": "code",
357 | "execution_count": null,
358 | "metadata": {
359 | "collapsed": false
360 | },
361 | "outputs": [],
362 | "source": [
363 | "cv_mean = cv_errors.apply(np.mean, axis=1)\n",
364 | "\n",
365 | "plt.plot(cv_mean)\n",
366 | "plt.xlabel('# Predictors')\n",
367 | "plt.ylabel('CV Error')\n",
368 | "plt.plot(cv_mean.argmin(), cv_mean.min(), \"or\")"
369 | ]
370 | },
371 | {
372 | "cell_type": "markdown",
373 | "metadata": {},
374 | "source": [
375 | "We see that cross-validation selects a 9-predictor model. Now let's go back to our results on the full data set in order to obtain the 9-predictor model."
376 | ]
377 | },
378 | {
379 | "cell_type": "code",
380 | "execution_count": null,
381 | "metadata": {
382 | "collapsed": false
383 | },
384 | "outputs": [],
385 | "source": [
386 | "print(models_full.loc[9, \"model\"].summary())"
387 | ]
388 | },
389 | {
390 | "cell_type": "markdown",
391 | "metadata": {},
392 | "source": [
393 | "For comparison, let's also take a look at the statistics from last lab:"
394 | ]
395 | },
396 | {
397 | "cell_type": "code",
398 | "execution_count": null,
399 | "metadata": {
400 | "collapsed": false
401 | },
402 | "outputs": [],
403 | "source": [
404 | "plt.figure(figsize=(20,10))\n",
405 | "plt.rcParams.update({'font.size': 18, 'lines.markersize': 10})\n",
406 | "\n",
407 | "# Set up a 2x2 grid so we can look at 4 plots at once\n",
408 | "plt.subplot(2, 2, 1)\n",
409 | "\n",
410 | "# We will now plot a red dot to indicate the model with the largest adjusted R^2 statistic.\n",
411 | "# The argmax() function can be used to identify the location of the maximum point of a vector\n",
412 | "plt.plot(models_full[\"RSS\"])\n",
413 | "plt.xlabel('# Predictors')\n",
414 | "plt.ylabel('RSS')\n",
415 | "\n",
416 | "# We will now plot a red dot to indicate the model with the largest adjusted R^2 statistic.\n",
417 | "# The argmax() function can be used to identify the location of the maximum point of a vector\n",
418 | "\n",
419 | "rsquared_adj = models_full.apply(lambda row: row[1].rsquared_adj, axis=1)\n",
420 | "\n",
421 | "plt.subplot(2, 2, 2)\n",
422 | "plt.plot(rsquared_adj)\n",
423 | "plt.plot(rsquared_adj.argmax(), rsquared_adj.max(), \"or\")\n",
424 | "plt.xlabel('# Predictors')\n",
425 | "plt.ylabel('adjusted rsquared')\n",
426 | "\n",
427 | "# We'll do the same for AIC and BIC, this time looking for the models with the SMALLEST statistic\n",
428 | "aic = models_full.apply(lambda row: row[1].aic, axis=1)\n",
429 | "\n",
430 | "plt.subplot(2, 2, 3)\n",
431 | "plt.plot(aic)\n",
432 | "plt.plot(aic.argmin(), aic.min(), \"or\")\n",
433 | "plt.xlabel('# Predictors')\n",
434 | "plt.ylabel('AIC')\n",
435 | "\n",
436 | "bic = models_full.apply(lambda row: row[1].bic, axis=1)\n",
437 | "\n",
438 | "plt.subplot(2, 2, 4)\n",
439 | "plt.plot(bic)\n",
440 | "plt.plot(bic.argmin(), bic.min(), \"or\")\n",
441 | "plt.xlabel('# Predictors')\n",
442 | "plt.ylabel('BIC')"
443 | ]
444 | },
445 | {
446 | "cell_type": "markdown",
447 | "metadata": {},
448 | "source": [
449 | "Notice how some of the indicators are similar the cross-validated model, and others are very different?\n",
450 | "\n",
451 | "# Your turn!\n",
452 | "\n",
453 | "Now it's time to test out these approaches (best / forward / backward selection) and evaluation methods (adjusted training error, validation set, cross validation) on other datasets. You may want to work with a team on this portion of the lab.\n",
454 | "\n",
455 | "You may use any of the datasets included in ${\\tt ISLR}$, or choose one from the UCI machine learning repository (http://archive.ics.uci.edu/ml/datasets.html). Download a dataset, and try to determine the optimal set of parameters to use to model it!"
456 | ]
457 | },
458 | {
459 | "cell_type": "code",
460 | "execution_count": null,
461 | "metadata": {
462 | "collapsed": true
463 | },
464 | "outputs": [],
465 | "source": [
466 | "# Your code here"
467 | ]
468 | },
469 | {
470 | "cell_type": "markdown",
471 | "metadata": {},
472 | "source": [
473 | "To get credit for this lab, please post your answers to the following questions:\n",
474 | " - What dataset did you choose?\n",
475 | " - Which selection techniques did you try?\n",
476 | " - Which evaluation techniques did you try?\n",
477 | " - What did you determine was the best set of parameters to model this data?\n",
478 | " - How well did this model perform?\n",
479 | " \n",
480 | "to Piazza: https://piazza.com/class/igwiv4w3ctb6rg?cid=35"
481 | ]
482 | }
483 | ],
484 | "metadata": {
485 | "kernelspec": {
486 | "display_name": "Python 3",
487 | "language": "python",
488 | "name": "python3"
489 | },
490 | "language_info": {
491 | "codemirror_mode": {
492 | "name": "ipython",
493 | "version": 3
494 | },
495 | "file_extension": ".py",
496 | "mimetype": "text/x-python",
497 | "name": "python",
498 | "nbconvert_exporter": "python",
499 | "pygments_lexer": "ipython3",
500 | "version": "3.5.1"
501 | }
502 | },
503 | "nbformat": 4,
504 | "nbformat_minor": 0
505 | }
506 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # islr-python
2 |
3 | This project is a python adaptation of the lab example in
4 | "Introduction to Statistical Learning with Applications in R"
5 | by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.
6 |
7 | Some of the original adaptations were produced by J. Warmenhoven, and updated by R. Jordan Crouser at Smith College for
8 | SDS293: Machine Learning (Spring 2016).
9 |
--------------------------------------------------------------------------------
/data/Auto.csv:
--------------------------------------------------------------------------------
1 | mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
2 | 18,8,307,130,3504,12,70,1,chevrolet chevelle malibu
3 | 15,8,350,165,3693,11.5,70,1,buick skylark 320
4 | 18,8,318,150,3436,11,70,1,plymouth satellite
5 | 16,8,304,150,3433,12,70,1,amc rebel sst
6 | 17,8,302,140,3449,10.5,70,1,ford torino
7 | 15,8,429,198,4341,10,70,1,ford galaxie 500
8 | 14,8,454,220,4354,9,70,1,chevrolet impala
9 | 14,8,440,215,4312,8.5,70,1,plymouth fury iii
10 | 14,8,455,225,4425,10,70,1,pontiac catalina
11 | 15,8,390,190,3850,8.5,70,1,amc ambassador dpl
12 | 15,8,383,170,3563,10,70,1,dodge challenger se
13 | 14,8,340,160,3609,8,70,1,plymouth 'cuda 340
14 | 15,8,400,150,3761,9.5,70,1,chevrolet monte carlo
15 | 14,8,455,225,3086,10,70,1,buick estate wagon (sw)
16 | 24,4,113,95,2372,15,70,3,toyota corona mark ii
17 | 22,6,198,95,2833,15.5,70,1,plymouth duster
18 | 18,6,199,97,2774,15.5,70,1,amc hornet
19 | 21,6,200,85,2587,16,70,1,ford maverick
20 | 27,4,97,88,2130,14.5,70,3,datsun pl510
21 | 26,4,97,46,1835,20.5,70,2,volkswagen 1131 deluxe sedan
22 | 25,4,110,87,2672,17.5,70,2,peugeot 504
23 | 24,4,107,90,2430,14.5,70,2,audi 100 ls
24 | 25,4,104,95,2375,17.5,70,2,saab 99e
25 | 26,4,121,113,2234,12.5,70,2,bmw 2002
26 | 21,6,199,90,2648,15,70,1,amc gremlin
27 | 10,8,360,215,4615,14,70,1,ford f250
28 | 10,8,307,200,4376,15,70,1,chevy c20
29 | 11,8,318,210,4382,13.5,70,1,dodge d200
30 | 9,8,304,193,4732,18.5,70,1,hi 1200d
31 | 27,4,97,88,2130,14.5,71,3,datsun pl510
32 | 28,4,140,90,2264,15.5,71,1,chevrolet vega 2300
33 | 25,4,113,95,2228,14,71,3,toyota corona
34 | 19,6,232,100,2634,13,71,1,amc gremlin
35 | 16,6,225,105,3439,15.5,71,1,plymouth satellite custom
36 | 17,6,250,100,3329,15.5,71,1,chevrolet chevelle malibu
37 | 19,6,250,88,3302,15.5,71,1,ford torino 500
38 | 18,6,232,100,3288,15.5,71,1,amc matador
39 | 14,8,350,165,4209,12,71,1,chevrolet impala
40 | 14,8,400,175,4464,11.5,71,1,pontiac catalina brougham
41 | 14,8,351,153,4154,13.5,71,1,ford galaxie 500
42 | 14,8,318,150,4096,13,71,1,plymouth fury iii
43 | 12,8,383,180,4955,11.5,71,1,dodge monaco (sw)
44 | 13,8,400,170,4746,12,71,1,ford country squire (sw)
45 | 13,8,400,175,5140,12,71,1,pontiac safari (sw)
46 | 18,6,258,110,2962,13.5,71,1,amc hornet sportabout (sw)
47 | 22,4,140,72,2408,19,71,1,chevrolet vega (sw)
48 | 19,6,250,100,3282,15,71,1,pontiac firebird
49 | 18,6,250,88,3139,14.5,71,1,ford mustang
50 | 23,4,122,86,2220,14,71,1,mercury capri 2000
51 | 28,4,116,90,2123,14,71,2,opel 1900
52 | 30,4,79,70,2074,19.5,71,2,peugeot 304
53 | 30,4,88,76,2065,14.5,71,2,fiat 124b
54 | 31,4,71,65,1773,19,71,3,toyota corolla 1200
55 | 35,4,72,69,1613,18,71,3,datsun 1200
56 | 27,4,97,60,1834,19,71,2,volkswagen model 111
57 | 26,4,91,70,1955,20.5,71,1,plymouth cricket
58 | 24,4,113,95,2278,15.5,72,3,toyota corona hardtop
59 | 25,4,97.5,80,2126,17,72,1,dodge colt hardtop
60 | 23,4,97,54,2254,23.5,72,2,volkswagen type 3
61 | 20,4,140,90,2408,19.5,72,1,chevrolet vega
62 | 21,4,122,86,2226,16.5,72,1,ford pinto runabout
63 | 13,8,350,165,4274,12,72,1,chevrolet impala
64 | 14,8,400,175,4385,12,72,1,pontiac catalina
65 | 15,8,318,150,4135,13.5,72,1,plymouth fury iii
66 | 14,8,351,153,4129,13,72,1,ford galaxie 500
67 | 17,8,304,150,3672,11.5,72,1,amc ambassador sst
68 | 11,8,429,208,4633,11,72,1,mercury marquis
69 | 13,8,350,155,4502,13.5,72,1,buick lesabre custom
70 | 12,8,350,160,4456,13.5,72,1,oldsmobile delta 88 royale
71 | 13,8,400,190,4422,12.5,72,1,chrysler newport royal
72 | 19,3,70,97,2330,13.5,72,3,mazda rx2 coupe
73 | 15,8,304,150,3892,12.5,72,1,amc matador (sw)
74 | 13,8,307,130,4098,14,72,1,chevrolet chevelle concours (sw)
75 | 13,8,302,140,4294,16,72,1,ford gran torino (sw)
76 | 14,8,318,150,4077,14,72,1,plymouth satellite custom (sw)
77 | 18,4,121,112,2933,14.5,72,2,volvo 145e (sw)
78 | 22,4,121,76,2511,18,72,2,volkswagen 411 (sw)
79 | 21,4,120,87,2979,19.5,72,2,peugeot 504 (sw)
80 | 26,4,96,69,2189,18,72,2,renault 12 (sw)
81 | 22,4,122,86,2395,16,72,1,ford pinto (sw)
82 | 28,4,97,92,2288,17,72,3,datsun 510 (sw)
83 | 23,4,120,97,2506,14.5,72,3,toyouta corona mark ii (sw)
84 | 28,4,98,80,2164,15,72,1,dodge colt (sw)
85 | 27,4,97,88,2100,16.5,72,3,toyota corolla 1600 (sw)
86 | 13,8,350,175,4100,13,73,1,buick century 350
87 | 14,8,304,150,3672,11.5,73,1,amc matador
88 | 13,8,350,145,3988,13,73,1,chevrolet malibu
89 | 14,8,302,137,4042,14.5,73,1,ford gran torino
90 | 15,8,318,150,3777,12.5,73,1,dodge coronet custom
91 | 12,8,429,198,4952,11.5,73,1,mercury marquis brougham
92 | 13,8,400,150,4464,12,73,1,chevrolet caprice classic
93 | 13,8,351,158,4363,13,73,1,ford ltd
94 | 14,8,318,150,4237,14.5,73,1,plymouth fury gran sedan
95 | 13,8,440,215,4735,11,73,1,chrysler new yorker brougham
96 | 12,8,455,225,4951,11,73,1,buick electra 225 custom
97 | 13,8,360,175,3821,11,73,1,amc ambassador brougham
98 | 18,6,225,105,3121,16.5,73,1,plymouth valiant
99 | 16,6,250,100,3278,18,73,1,chevrolet nova custom
100 | 18,6,232,100,2945,16,73,1,amc hornet
101 | 18,6,250,88,3021,16.5,73,1,ford maverick
102 | 23,6,198,95,2904,16,73,1,plymouth duster
103 | 26,4,97,46,1950,21,73,2,volkswagen super beetle
104 | 11,8,400,150,4997,14,73,1,chevrolet impala
105 | 12,8,400,167,4906,12.5,73,1,ford country
106 | 13,8,360,170,4654,13,73,1,plymouth custom suburb
107 | 12,8,350,180,4499,12.5,73,1,oldsmobile vista cruiser
108 | 18,6,232,100,2789,15,73,1,amc gremlin
109 | 20,4,97,88,2279,19,73,3,toyota carina
110 | 21,4,140,72,2401,19.5,73,1,chevrolet vega
111 | 22,4,108,94,2379,16.5,73,3,datsun 610
112 | 18,3,70,90,2124,13.5,73,3,maxda rx3
113 | 19,4,122,85,2310,18.5,73,1,ford pinto
114 | 21,6,155,107,2472,14,73,1,mercury capri v6
115 | 26,4,98,90,2265,15.5,73,2,fiat 124 sport coupe
116 | 15,8,350,145,4082,13,73,1,chevrolet monte carlo s
117 | 16,8,400,230,4278,9.5,73,1,pontiac grand prix
118 | 29,4,68,49,1867,19.5,73,2,fiat 128
119 | 24,4,116,75,2158,15.5,73,2,opel manta
120 | 20,4,114,91,2582,14,73,2,audi 100ls
121 | 19,4,121,112,2868,15.5,73,2,volvo 144ea
122 | 15,8,318,150,3399,11,73,1,dodge dart custom
123 | 24,4,121,110,2660,14,73,2,saab 99le
124 | 20,6,156,122,2807,13.5,73,3,toyota mark ii
125 | 11,8,350,180,3664,11,73,1,oldsmobile omega
126 | 20,6,198,95,3102,16.5,74,1,plymouth duster
127 | 19,6,232,100,2901,16,74,1,amc hornet
128 | 15,6,250,100,3336,17,74,1,chevrolet nova
129 | 31,4,79,67,1950,19,74,3,datsun b210
130 | 26,4,122,80,2451,16.5,74,1,ford pinto
131 | 32,4,71,65,1836,21,74,3,toyota corolla 1200
132 | 25,4,140,75,2542,17,74,1,chevrolet vega
133 | 16,6,250,100,3781,17,74,1,chevrolet chevelle malibu classic
134 | 16,6,258,110,3632,18,74,1,amc matador
135 | 18,6,225,105,3613,16.5,74,1,plymouth satellite sebring
136 | 16,8,302,140,4141,14,74,1,ford gran torino
137 | 13,8,350,150,4699,14.5,74,1,buick century luxus (sw)
138 | 14,8,318,150,4457,13.5,74,1,dodge coronet custom (sw)
139 | 14,8,302,140,4638,16,74,1,ford gran torino (sw)
140 | 14,8,304,150,4257,15.5,74,1,amc matador (sw)
141 | 29,4,98,83,2219,16.5,74,2,audi fox
142 | 26,4,79,67,1963,15.5,74,2,volkswagen dasher
143 | 26,4,97,78,2300,14.5,74,2,opel manta
144 | 31,4,76,52,1649,16.5,74,3,toyota corona
145 | 32,4,83,61,2003,19,74,3,datsun 710
146 | 28,4,90,75,2125,14.5,74,1,dodge colt
147 | 24,4,90,75,2108,15.5,74,2,fiat 128
148 | 26,4,116,75,2246,14,74,2,fiat 124 tc
149 | 24,4,120,97,2489,15,74,3,honda civic
150 | 26,4,108,93,2391,15.5,74,3,subaru
151 | 31,4,79,67,2000,16,74,2,fiat x1.9
152 | 19,6,225,95,3264,16,75,1,plymouth valiant custom
153 | 18,6,250,105,3459,16,75,1,chevrolet nova
154 | 15,6,250,72,3432,21,75,1,mercury monarch
155 | 15,6,250,72,3158,19.5,75,1,ford maverick
156 | 16,8,400,170,4668,11.5,75,1,pontiac catalina
157 | 15,8,350,145,4440,14,75,1,chevrolet bel air
158 | 16,8,318,150,4498,14.5,75,1,plymouth grand fury
159 | 14,8,351,148,4657,13.5,75,1,ford ltd
160 | 17,6,231,110,3907,21,75,1,buick century
161 | 16,6,250,105,3897,18.5,75,1,chevroelt chevelle malibu
162 | 15,6,258,110,3730,19,75,1,amc matador
163 | 18,6,225,95,3785,19,75,1,plymouth fury
164 | 21,6,231,110,3039,15,75,1,buick skyhawk
165 | 20,8,262,110,3221,13.5,75,1,chevrolet monza 2+2
166 | 13,8,302,129,3169,12,75,1,ford mustang ii
167 | 29,4,97,75,2171,16,75,3,toyota corolla
168 | 23,4,140,83,2639,17,75,1,ford pinto
169 | 20,6,232,100,2914,16,75,1,amc gremlin
170 | 23,4,140,78,2592,18.5,75,1,pontiac astro
171 | 24,4,134,96,2702,13.5,75,3,toyota corona
172 | 25,4,90,71,2223,16.5,75,2,volkswagen dasher
173 | 24,4,119,97,2545,17,75,3,datsun 710
174 | 18,6,171,97,2984,14.5,75,1,ford pinto
175 | 29,4,90,70,1937,14,75,2,volkswagen rabbit
176 | 19,6,232,90,3211,17,75,1,amc pacer
177 | 23,4,115,95,2694,15,75,2,audi 100ls
178 | 23,4,120,88,2957,17,75,2,peugeot 504
179 | 22,4,121,98,2945,14.5,75,2,volvo 244dl
180 | 25,4,121,115,2671,13.5,75,2,saab 99le
181 | 33,4,91,53,1795,17.5,75,3,honda civic cvcc
182 | 28,4,107,86,2464,15.5,76,2,fiat 131
183 | 25,4,116,81,2220,16.9,76,2,opel 1900
184 | 25,4,140,92,2572,14.9,76,1,capri ii
185 | 26,4,98,79,2255,17.7,76,1,dodge colt
186 | 27,4,101,83,2202,15.3,76,2,renault 12tl
187 | 17.5,8,305,140,4215,13,76,1,chevrolet chevelle malibu classic
188 | 16,8,318,150,4190,13,76,1,dodge coronet brougham
189 | 15.5,8,304,120,3962,13.9,76,1,amc matador
190 | 14.5,8,351,152,4215,12.8,76,1,ford gran torino
191 | 22,6,225,100,3233,15.4,76,1,plymouth valiant
192 | 22,6,250,105,3353,14.5,76,1,chevrolet nova
193 | 24,6,200,81,3012,17.6,76,1,ford maverick
194 | 22.5,6,232,90,3085,17.6,76,1,amc hornet
195 | 29,4,85,52,2035,22.2,76,1,chevrolet chevette
196 | 24.5,4,98,60,2164,22.1,76,1,chevrolet woody
197 | 29,4,90,70,1937,14.2,76,2,vw rabbit
198 | 33,4,91,53,1795,17.4,76,3,honda civic
199 | 20,6,225,100,3651,17.7,76,1,dodge aspen se
200 | 18,6,250,78,3574,21,76,1,ford granada ghia
201 | 18.5,6,250,110,3645,16.2,76,1,pontiac ventura sj
202 | 17.5,6,258,95,3193,17.8,76,1,amc pacer d/l
203 | 29.5,4,97,71,1825,12.2,76,2,volkswagen rabbit
204 | 32,4,85,70,1990,17,76,3,datsun b-210
205 | 28,4,97,75,2155,16.4,76,3,toyota corolla
206 | 26.5,4,140,72,2565,13.6,76,1,ford pinto
207 | 20,4,130,102,3150,15.7,76,2,volvo 245
208 | 13,8,318,150,3940,13.2,76,1,plymouth volare premier v8
209 | 19,4,120,88,3270,21.9,76,2,peugeot 504
210 | 19,6,156,108,2930,15.5,76,3,toyota mark ii
211 | 16.5,6,168,120,3820,16.7,76,2,mercedes-benz 280s
212 | 16.5,8,350,180,4380,12.1,76,1,cadillac seville
213 | 13,8,350,145,4055,12,76,1,chevy c10
214 | 13,8,302,130,3870,15,76,1,ford f108
215 | 13,8,318,150,3755,14,76,1,dodge d100
216 | 31.5,4,98,68,2045,18.5,77,3,honda accord cvcc
217 | 30,4,111,80,2155,14.8,77,1,buick opel isuzu deluxe
218 | 36,4,79,58,1825,18.6,77,2,renault 5 gtl
219 | 25.5,4,122,96,2300,15.5,77,1,plymouth arrow gs
220 | 33.5,4,85,70,1945,16.8,77,3,datsun f-10 hatchback
221 | 17.5,8,305,145,3880,12.5,77,1,chevrolet caprice classic
222 | 17,8,260,110,4060,19,77,1,oldsmobile cutlass supreme
223 | 15.5,8,318,145,4140,13.7,77,1,dodge monaco brougham
224 | 15,8,302,130,4295,14.9,77,1,mercury cougar brougham
225 | 17.5,6,250,110,3520,16.4,77,1,chevrolet concours
226 | 20.5,6,231,105,3425,16.9,77,1,buick skylark
227 | 19,6,225,100,3630,17.7,77,1,plymouth volare custom
228 | 18.5,6,250,98,3525,19,77,1,ford granada
229 | 16,8,400,180,4220,11.1,77,1,pontiac grand prix lj
230 | 15.5,8,350,170,4165,11.4,77,1,chevrolet monte carlo landau
231 | 15.5,8,400,190,4325,12.2,77,1,chrysler cordoba
232 | 16,8,351,149,4335,14.5,77,1,ford thunderbird
233 | 29,4,97,78,1940,14.5,77,2,volkswagen rabbit custom
234 | 24.5,4,151,88,2740,16,77,1,pontiac sunbird coupe
235 | 26,4,97,75,2265,18.2,77,3,toyota corolla liftback
236 | 25.5,4,140,89,2755,15.8,77,1,ford mustang ii 2+2
237 | 30.5,4,98,63,2051,17,77,1,chevrolet chevette
238 | 33.5,4,98,83,2075,15.9,77,1,dodge colt m/m
239 | 30,4,97,67,1985,16.4,77,3,subaru dl
240 | 30.5,4,97,78,2190,14.1,77,2,volkswagen dasher
241 | 22,6,146,97,2815,14.5,77,3,datsun 810
242 | 21.5,4,121,110,2600,12.8,77,2,bmw 320i
243 | 21.5,3,80,110,2720,13.5,77,3,mazda rx-4
244 | 43.1,4,90,48,1985,21.5,78,2,volkswagen rabbit custom diesel
245 | 36.1,4,98,66,1800,14.4,78,1,ford fiesta
246 | 32.8,4,78,52,1985,19.4,78,3,mazda glc deluxe
247 | 39.4,4,85,70,2070,18.6,78,3,datsun b210 gx
248 | 36.1,4,91,60,1800,16.4,78,3,honda civic cvcc
249 | 19.9,8,260,110,3365,15.5,78,1,oldsmobile cutlass salon brougham
250 | 19.4,8,318,140,3735,13.2,78,1,dodge diplomat
251 | 20.2,8,302,139,3570,12.8,78,1,mercury monarch ghia
252 | 19.2,6,231,105,3535,19.2,78,1,pontiac phoenix lj
253 | 20.5,6,200,95,3155,18.2,78,1,chevrolet malibu
254 | 20.2,6,200,85,2965,15.8,78,1,ford fairmont (auto)
255 | 25.1,4,140,88,2720,15.4,78,1,ford fairmont (man)
256 | 20.5,6,225,100,3430,17.2,78,1,plymouth volare
257 | 19.4,6,232,90,3210,17.2,78,1,amc concord
258 | 20.6,6,231,105,3380,15.8,78,1,buick century special
259 | 20.8,6,200,85,3070,16.7,78,1,mercury zephyr
260 | 18.6,6,225,110,3620,18.7,78,1,dodge aspen
261 | 18.1,6,258,120,3410,15.1,78,1,amc concord d/l
262 | 19.2,8,305,145,3425,13.2,78,1,chevrolet monte carlo landau
263 | 17.7,6,231,165,3445,13.4,78,1,buick regal sport coupe (turbo)
264 | 18.1,8,302,139,3205,11.2,78,1,ford futura
265 | 17.5,8,318,140,4080,13.7,78,1,dodge magnum xe
266 | 30,4,98,68,2155,16.5,78,1,chevrolet chevette
267 | 27.5,4,134,95,2560,14.2,78,3,toyota corona
268 | 27.2,4,119,97,2300,14.7,78,3,datsun 510
269 | 30.9,4,105,75,2230,14.5,78,1,dodge omni
270 | 21.1,4,134,95,2515,14.8,78,3,toyota celica gt liftback
271 | 23.2,4,156,105,2745,16.7,78,1,plymouth sapporo
272 | 23.8,4,151,85,2855,17.6,78,1,oldsmobile starfire sx
273 | 23.9,4,119,97,2405,14.9,78,3,datsun 200-sx
274 | 20.3,5,131,103,2830,15.9,78,2,audi 5000
275 | 17,6,163,125,3140,13.6,78,2,volvo 264gl
276 | 21.6,4,121,115,2795,15.7,78,2,saab 99gle
277 | 16.2,6,163,133,3410,15.8,78,2,peugeot 604sl
278 | 31.5,4,89,71,1990,14.9,78,2,volkswagen scirocco
279 | 29.5,4,98,68,2135,16.6,78,3,honda accord lx
280 | 21.5,6,231,115,3245,15.4,79,1,pontiac lemans v6
281 | 19.8,6,200,85,2990,18.2,79,1,mercury zephyr 6
282 | 22.3,4,140,88,2890,17.3,79,1,ford fairmont 4
283 | 20.2,6,232,90,3265,18.2,79,1,amc concord dl 6
284 | 20.6,6,225,110,3360,16.6,79,1,dodge aspen 6
285 | 17,8,305,130,3840,15.4,79,1,chevrolet caprice classic
286 | 17.6,8,302,129,3725,13.4,79,1,ford ltd landau
287 | 16.5,8,351,138,3955,13.2,79,1,mercury grand marquis
288 | 18.2,8,318,135,3830,15.2,79,1,dodge st. regis
289 | 16.9,8,350,155,4360,14.9,79,1,buick estate wagon (sw)
290 | 15.5,8,351,142,4054,14.3,79,1,ford country squire (sw)
291 | 19.2,8,267,125,3605,15,79,1,chevrolet malibu classic (sw)
292 | 18.5,8,360,150,3940,13,79,1,chrysler lebaron town @ country (sw)
293 | 31.9,4,89,71,1925,14,79,2,vw rabbit custom
294 | 34.1,4,86,65,1975,15.2,79,3,maxda glc deluxe
295 | 35.7,4,98,80,1915,14.4,79,1,dodge colt hatchback custom
296 | 27.4,4,121,80,2670,15,79,1,amc spirit dl
297 | 25.4,5,183,77,3530,20.1,79,2,mercedes benz 300d
298 | 23,8,350,125,3900,17.4,79,1,cadillac eldorado
299 | 27.2,4,141,71,3190,24.8,79,2,peugeot 504
300 | 23.9,8,260,90,3420,22.2,79,1,oldsmobile cutlass salon brougham
301 | 34.2,4,105,70,2200,13.2,79,1,plymouth horizon
302 | 34.5,4,105,70,2150,14.9,79,1,plymouth horizon tc3
303 | 31.8,4,85,65,2020,19.2,79,3,datsun 210
304 | 37.3,4,91,69,2130,14.7,79,2,fiat strada custom
305 | 28.4,4,151,90,2670,16,79,1,buick skylark limited
306 | 28.8,6,173,115,2595,11.3,79,1,chevrolet citation
307 | 26.8,6,173,115,2700,12.9,79,1,oldsmobile omega brougham
308 | 33.5,4,151,90,2556,13.2,79,1,pontiac phoenix
309 | 41.5,4,98,76,2144,14.7,80,2,vw rabbit
310 | 38.1,4,89,60,1968,18.8,80,3,toyota corolla tercel
311 | 32.1,4,98,70,2120,15.5,80,1,chevrolet chevette
312 | 37.2,4,86,65,2019,16.4,80,3,datsun 310
313 | 28,4,151,90,2678,16.5,80,1,chevrolet citation
314 | 26.4,4,140,88,2870,18.1,80,1,ford fairmont
315 | 24.3,4,151,90,3003,20.1,80,1,amc concord
316 | 19.1,6,225,90,3381,18.7,80,1,dodge aspen
317 | 34.3,4,97,78,2188,15.8,80,2,audi 4000
318 | 29.8,4,134,90,2711,15.5,80,3,toyota corona liftback
319 | 31.3,4,120,75,2542,17.5,80,3,mazda 626
320 | 37,4,119,92,2434,15,80,3,datsun 510 hatchback
321 | 32.2,4,108,75,2265,15.2,80,3,toyota corolla
322 | 46.6,4,86,65,2110,17.9,80,3,mazda glc
323 | 27.9,4,156,105,2800,14.4,80,1,dodge colt
324 | 40.8,4,85,65,2110,19.2,80,3,datsun 210
325 | 44.3,4,90,48,2085,21.7,80,2,vw rabbit c (diesel)
326 | 43.4,4,90,48,2335,23.7,80,2,vw dasher (diesel)
327 | 36.4,5,121,67,2950,19.9,80,2,audi 5000s (diesel)
328 | 30,4,146,67,3250,21.8,80,2,mercedes-benz 240d
329 | 44.6,4,91,67,1850,13.8,80,3,honda civic 1500 gl
330 | 33.8,4,97,67,2145,18,80,3,subaru dl
331 | 29.8,4,89,62,1845,15.3,80,2,vokswagen rabbit
332 | 32.7,6,168,132,2910,11.4,80,3,datsun 280-zx
333 | 23.7,3,70,100,2420,12.5,80,3,mazda rx-7 gs
334 | 35,4,122,88,2500,15.1,80,2,triumph tr7 coupe
335 | 32.4,4,107,72,2290,17,80,3,honda accord
336 | 27.2,4,135,84,2490,15.7,81,1,plymouth reliant
337 | 26.6,4,151,84,2635,16.4,81,1,buick skylark
338 | 25.8,4,156,92,2620,14.4,81,1,dodge aries wagon (sw)
339 | 23.5,6,173,110,2725,12.6,81,1,chevrolet citation
340 | 30,4,135,84,2385,12.9,81,1,plymouth reliant
341 | 39.1,4,79,58,1755,16.9,81,3,toyota starlet
342 | 39,4,86,64,1875,16.4,81,1,plymouth champ
343 | 35.1,4,81,60,1760,16.1,81,3,honda civic 1300
344 | 32.3,4,97,67,2065,17.8,81,3,subaru
345 | 37,4,85,65,1975,19.4,81,3,datsun 210 mpg
346 | 37.7,4,89,62,2050,17.3,81,3,toyota tercel
347 | 34.1,4,91,68,1985,16,81,3,mazda glc 4
348 | 34.7,4,105,63,2215,14.9,81,1,plymouth horizon 4
349 | 34.4,4,98,65,2045,16.2,81,1,ford escort 4w
350 | 29.9,4,98,65,2380,20.7,81,1,ford escort 2h
351 | 33,4,105,74,2190,14.2,81,2,volkswagen jetta
352 | 33.7,4,107,75,2210,14.4,81,3,honda prelude
353 | 32.4,4,108,75,2350,16.8,81,3,toyota corolla
354 | 32.9,4,119,100,2615,14.8,81,3,datsun 200sx
355 | 31.6,4,120,74,2635,18.3,81,3,mazda 626
356 | 28.1,4,141,80,3230,20.4,81,2,peugeot 505s turbo diesel
357 | 30.7,6,145,76,3160,19.6,81,2,volvo diesel
358 | 25.4,6,168,116,2900,12.6,81,3,toyota cressida
359 | 24.2,6,146,120,2930,13.8,81,3,datsun 810 maxima
360 | 22.4,6,231,110,3415,15.8,81,1,buick century
361 | 26.6,8,350,105,3725,19,81,1,oldsmobile cutlass ls
362 | 20.2,6,200,88,3060,17.1,81,1,ford granada gl
363 | 17.6,6,225,85,3465,16.6,81,1,chrysler lebaron salon
364 | 28,4,112,88,2605,19.6,82,1,chevrolet cavalier
365 | 27,4,112,88,2640,18.6,82,1,chevrolet cavalier wagon
366 | 34,4,112,88,2395,18,82,1,chevrolet cavalier 2-door
367 | 31,4,112,85,2575,16.2,82,1,pontiac j2000 se hatchback
368 | 29,4,135,84,2525,16,82,1,dodge aries se
369 | 27,4,151,90,2735,18,82,1,pontiac phoenix
370 | 24,4,140,92,2865,16.4,82,1,ford fairmont futura
371 | 36,4,105,74,1980,15.3,82,2,volkswagen rabbit l
372 | 37,4,91,68,2025,18.2,82,3,mazda glc custom l
373 | 31,4,91,68,1970,17.6,82,3,mazda glc custom
374 | 38,4,105,63,2125,14.7,82,1,plymouth horizon miser
375 | 36,4,98,70,2125,17.3,82,1,mercury lynx l
376 | 36,4,120,88,2160,14.5,82,3,nissan stanza xe
377 | 36,4,107,75,2205,14.5,82,3,honda accord
378 | 34,4,108,70,2245,16.9,82,3,toyota corolla
379 | 38,4,91,67,1965,15,82,3,honda civic
380 | 32,4,91,67,1965,15.7,82,3,honda civic (auto)
381 | 38,4,91,67,1995,16.2,82,3,datsun 310 gx
382 | 25,6,181,110,2945,16.4,82,1,buick century limited
383 | 38,6,262,85,3015,17,82,1,oldsmobile cutlass ciera (diesel)
384 | 26,4,156,92,2585,14.5,82,1,chrysler lebaron medallion
385 | 22,6,232,112,2835,14.7,82,1,ford granada l
386 | 32,4,144,96,2665,13.9,82,3,toyota celica gt
387 | 36,4,135,84,2370,13,82,1,dodge charger 2.2
388 | 27,4,151,90,2950,17.3,82,1,chevrolet camaro
389 | 27,4,140,86,2790,15.6,82,1,ford mustang gl
390 | 44,4,97,52,2130,24.6,82,2,vw pickup
391 | 32,4,135,84,2295,11.6,82,1,dodge rampage
392 | 28,4,120,79,2625,18.6,82,1,ford ranger
393 | 31,4,119,82,2720,19.4,82,1,chevy s-10
--------------------------------------------------------------------------------
/data/Carseats.csv:
--------------------------------------------------------------------------------
1 | "Sales","CompPrice","Income","Advertising","Population","Price","ShelveLoc","Age","Education","Urban","US"
2 | 9.5,138,73,11,276,120,"Bad",42,17,"Yes","Yes"
3 | 11.22,111,48,16,260,83,"Good",65,10,"Yes","Yes"
4 | 10.06,113,35,10,269,80,"Medium",59,12,"Yes","Yes"
5 | 7.4,117,100,4,466,97,"Medium",55,14,"Yes","Yes"
6 | 4.15,141,64,3,340,128,"Bad",38,13,"Yes","No"
7 | 10.81,124,113,13,501,72,"Bad",78,16,"No","Yes"
8 | 6.63,115,105,0,45,108,"Medium",71,15,"Yes","No"
9 | 11.85,136,81,15,425,120,"Good",67,10,"Yes","Yes"
10 | 6.54,132,110,0,108,124,"Medium",76,10,"No","No"
11 | 4.69,132,113,0,131,124,"Medium",76,17,"No","Yes"
12 | 9.01,121,78,9,150,100,"Bad",26,10,"No","Yes"
13 | 11.96,117,94,4,503,94,"Good",50,13,"Yes","Yes"
14 | 3.98,122,35,2,393,136,"Medium",62,18,"Yes","No"
15 | 10.96,115,28,11,29,86,"Good",53,18,"Yes","Yes"
16 | 11.17,107,117,11,148,118,"Good",52,18,"Yes","Yes"
17 | 8.71,149,95,5,400,144,"Medium",76,18,"No","No"
18 | 7.58,118,32,0,284,110,"Good",63,13,"Yes","No"
19 | 12.29,147,74,13,251,131,"Good",52,10,"Yes","Yes"
20 | 13.91,110,110,0,408,68,"Good",46,17,"No","Yes"
21 | 8.73,129,76,16,58,121,"Medium",69,12,"Yes","Yes"
22 | 6.41,125,90,2,367,131,"Medium",35,18,"Yes","Yes"
23 | 12.13,134,29,12,239,109,"Good",62,18,"No","Yes"
24 | 5.08,128,46,6,497,138,"Medium",42,13,"Yes","No"
25 | 5.87,121,31,0,292,109,"Medium",79,10,"Yes","No"
26 | 10.14,145,119,16,294,113,"Bad",42,12,"Yes","Yes"
27 | 14.9,139,32,0,176,82,"Good",54,11,"No","No"
28 | 8.33,107,115,11,496,131,"Good",50,11,"No","Yes"
29 | 5.27,98,118,0,19,107,"Medium",64,17,"Yes","No"
30 | 2.99,103,74,0,359,97,"Bad",55,11,"Yes","Yes"
31 | 7.81,104,99,15,226,102,"Bad",58,17,"Yes","Yes"
32 | 13.55,125,94,0,447,89,"Good",30,12,"Yes","No"
33 | 8.25,136,58,16,241,131,"Medium",44,18,"Yes","Yes"
34 | 6.2,107,32,12,236,137,"Good",64,10,"No","Yes"
35 | 8.77,114,38,13,317,128,"Good",50,16,"Yes","Yes"
36 | 2.67,115,54,0,406,128,"Medium",42,17,"Yes","Yes"
37 | 11.07,131,84,11,29,96,"Medium",44,17,"No","Yes"
38 | 8.89,122,76,0,270,100,"Good",60,18,"No","No"
39 | 4.95,121,41,5,412,110,"Medium",54,10,"Yes","Yes"
40 | 6.59,109,73,0,454,102,"Medium",65,15,"Yes","No"
41 | 3.24,130,60,0,144,138,"Bad",38,10,"No","No"
42 | 2.07,119,98,0,18,126,"Bad",73,17,"No","No"
43 | 7.96,157,53,0,403,124,"Bad",58,16,"Yes","No"
44 | 10.43,77,69,0,25,24,"Medium",50,18,"Yes","No"
45 | 4.12,123,42,11,16,134,"Medium",59,13,"Yes","Yes"
46 | 4.16,85,79,6,325,95,"Medium",69,13,"Yes","Yes"
47 | 4.56,141,63,0,168,135,"Bad",44,12,"Yes","Yes"
48 | 12.44,127,90,14,16,70,"Medium",48,15,"No","Yes"
49 | 4.38,126,98,0,173,108,"Bad",55,16,"Yes","No"
50 | 3.91,116,52,0,349,98,"Bad",69,18,"Yes","No"
51 | 10.61,157,93,0,51,149,"Good",32,17,"Yes","No"
52 | 1.42,99,32,18,341,108,"Bad",80,16,"Yes","Yes"
53 | 4.42,121,90,0,150,108,"Bad",75,16,"Yes","No"
54 | 7.91,153,40,3,112,129,"Bad",39,18,"Yes","Yes"
55 | 6.92,109,64,13,39,119,"Medium",61,17,"Yes","Yes"
56 | 4.9,134,103,13,25,144,"Medium",76,17,"No","Yes"
57 | 6.85,143,81,5,60,154,"Medium",61,18,"Yes","Yes"
58 | 11.91,133,82,0,54,84,"Medium",50,17,"Yes","No"
59 | 0.91,93,91,0,22,117,"Bad",75,11,"Yes","No"
60 | 5.42,103,93,15,188,103,"Bad",74,16,"Yes","Yes"
61 | 5.21,118,71,4,148,114,"Medium",80,13,"Yes","No"
62 | 8.32,122,102,19,469,123,"Bad",29,13,"Yes","Yes"
63 | 7.32,105,32,0,358,107,"Medium",26,13,"No","No"
64 | 1.82,139,45,0,146,133,"Bad",77,17,"Yes","Yes"
65 | 8.47,119,88,10,170,101,"Medium",61,13,"Yes","Yes"
66 | 7.8,100,67,12,184,104,"Medium",32,16,"No","Yes"
67 | 4.9,122,26,0,197,128,"Medium",55,13,"No","No"
68 | 8.85,127,92,0,508,91,"Medium",56,18,"Yes","No"
69 | 9.01,126,61,14,152,115,"Medium",47,16,"Yes","Yes"
70 | 13.39,149,69,20,366,134,"Good",60,13,"Yes","Yes"
71 | 7.99,127,59,0,339,99,"Medium",65,12,"Yes","No"
72 | 9.46,89,81,15,237,99,"Good",74,12,"Yes","Yes"
73 | 6.5,148,51,16,148,150,"Medium",58,17,"No","Yes"
74 | 5.52,115,45,0,432,116,"Medium",25,15,"Yes","No"
75 | 12.61,118,90,10,54,104,"Good",31,11,"No","Yes"
76 | 6.2,150,68,5,125,136,"Medium",64,13,"No","Yes"
77 | 8.55,88,111,23,480,92,"Bad",36,16,"No","Yes"
78 | 10.64,102,87,10,346,70,"Medium",64,15,"Yes","Yes"
79 | 7.7,118,71,12,44,89,"Medium",67,18,"No","Yes"
80 | 4.43,134,48,1,139,145,"Medium",65,12,"Yes","Yes"
81 | 9.14,134,67,0,286,90,"Bad",41,13,"Yes","No"
82 | 8.01,113,100,16,353,79,"Bad",68,11,"Yes","Yes"
83 | 7.52,116,72,0,237,128,"Good",70,13,"Yes","No"
84 | 11.62,151,83,4,325,139,"Good",28,17,"Yes","Yes"
85 | 4.42,109,36,7,468,94,"Bad",56,11,"Yes","Yes"
86 | 2.23,111,25,0,52,121,"Bad",43,18,"No","No"
87 | 8.47,125,103,0,304,112,"Medium",49,13,"No","No"
88 | 8.7,150,84,9,432,134,"Medium",64,15,"Yes","No"
89 | 11.7,131,67,7,272,126,"Good",54,16,"No","Yes"
90 | 6.56,117,42,7,144,111,"Medium",62,10,"Yes","Yes"
91 | 7.95,128,66,3,493,119,"Medium",45,16,"No","No"
92 | 5.33,115,22,0,491,103,"Medium",64,11,"No","No"
93 | 4.81,97,46,11,267,107,"Medium",80,15,"Yes","Yes"
94 | 4.53,114,113,0,97,125,"Medium",29,12,"Yes","No"
95 | 8.86,145,30,0,67,104,"Medium",55,17,"Yes","No"
96 | 8.39,115,97,5,134,84,"Bad",55,11,"Yes","Yes"
97 | 5.58,134,25,10,237,148,"Medium",59,13,"Yes","Yes"
98 | 9.48,147,42,10,407,132,"Good",73,16,"No","Yes"
99 | 7.45,161,82,5,287,129,"Bad",33,16,"Yes","Yes"
100 | 12.49,122,77,24,382,127,"Good",36,16,"No","Yes"
101 | 4.88,121,47,3,220,107,"Bad",56,16,"No","Yes"
102 | 4.11,113,69,11,94,106,"Medium",76,12,"No","Yes"
103 | 6.2,128,93,0,89,118,"Medium",34,18,"Yes","No"
104 | 5.3,113,22,0,57,97,"Medium",65,16,"No","No"
105 | 5.07,123,91,0,334,96,"Bad",78,17,"Yes","Yes"
106 | 4.62,121,96,0,472,138,"Medium",51,12,"Yes","No"
107 | 5.55,104,100,8,398,97,"Medium",61,11,"Yes","Yes"
108 | 0.16,102,33,0,217,139,"Medium",70,18,"No","No"
109 | 8.55,134,107,0,104,108,"Medium",60,12,"Yes","No"
110 | 3.47,107,79,2,488,103,"Bad",65,16,"Yes","No"
111 | 8.98,115,65,0,217,90,"Medium",60,17,"No","No"
112 | 9,128,62,7,125,116,"Medium",43,14,"Yes","Yes"
113 | 6.62,132,118,12,272,151,"Medium",43,14,"Yes","Yes"
114 | 6.67,116,99,5,298,125,"Good",62,12,"Yes","Yes"
115 | 6.01,131,29,11,335,127,"Bad",33,12,"Yes","Yes"
116 | 9.31,122,87,9,17,106,"Medium",65,13,"Yes","Yes"
117 | 8.54,139,35,0,95,129,"Medium",42,13,"Yes","No"
118 | 5.08,135,75,0,202,128,"Medium",80,10,"No","No"
119 | 8.8,145,53,0,507,119,"Medium",41,12,"Yes","No"
120 | 7.57,112,88,2,243,99,"Medium",62,11,"Yes","Yes"
121 | 7.37,130,94,8,137,128,"Medium",64,12,"Yes","Yes"
122 | 6.87,128,105,11,249,131,"Medium",63,13,"Yes","Yes"
123 | 11.67,125,89,10,380,87,"Bad",28,10,"Yes","Yes"
124 | 6.88,119,100,5,45,108,"Medium",75,10,"Yes","Yes"
125 | 8.19,127,103,0,125,155,"Good",29,15,"No","Yes"
126 | 8.87,131,113,0,181,120,"Good",63,14,"Yes","No"
127 | 9.34,89,78,0,181,49,"Medium",43,15,"No","No"
128 | 11.27,153,68,2,60,133,"Good",59,16,"Yes","Yes"
129 | 6.52,125,48,3,192,116,"Medium",51,14,"Yes","Yes"
130 | 4.96,133,100,3,350,126,"Bad",55,13,"Yes","Yes"
131 | 4.47,143,120,7,279,147,"Bad",40,10,"No","Yes"
132 | 8.41,94,84,13,497,77,"Medium",51,12,"Yes","Yes"
133 | 6.5,108,69,3,208,94,"Medium",77,16,"Yes","No"
134 | 9.54,125,87,9,232,136,"Good",72,10,"Yes","Yes"
135 | 7.62,132,98,2,265,97,"Bad",62,12,"Yes","Yes"
136 | 3.67,132,31,0,327,131,"Medium",76,16,"Yes","No"
137 | 6.44,96,94,14,384,120,"Medium",36,18,"No","Yes"
138 | 5.17,131,75,0,10,120,"Bad",31,18,"No","No"
139 | 6.52,128,42,0,436,118,"Medium",80,11,"Yes","No"
140 | 10.27,125,103,12,371,109,"Medium",44,10,"Yes","Yes"
141 | 12.3,146,62,10,310,94,"Medium",30,13,"No","Yes"
142 | 6.03,133,60,10,277,129,"Medium",45,18,"Yes","Yes"
143 | 6.53,140,42,0,331,131,"Bad",28,15,"Yes","No"
144 | 7.44,124,84,0,300,104,"Medium",77,15,"Yes","No"
145 | 0.53,122,88,7,36,159,"Bad",28,17,"Yes","Yes"
146 | 9.09,132,68,0,264,123,"Good",34,11,"No","No"
147 | 8.77,144,63,11,27,117,"Medium",47,17,"Yes","Yes"
148 | 3.9,114,83,0,412,131,"Bad",39,14,"Yes","No"
149 | 10.51,140,54,9,402,119,"Good",41,16,"No","Yes"
150 | 7.56,110,119,0,384,97,"Medium",72,14,"No","Yes"
151 | 11.48,121,120,13,140,87,"Medium",56,11,"Yes","Yes"
152 | 10.49,122,84,8,176,114,"Good",57,10,"No","Yes"
153 | 10.77,111,58,17,407,103,"Good",75,17,"No","Yes"
154 | 7.64,128,78,0,341,128,"Good",45,13,"No","No"
155 | 5.93,150,36,7,488,150,"Medium",25,17,"No","Yes"
156 | 6.89,129,69,10,289,110,"Medium",50,16,"No","Yes"
157 | 7.71,98,72,0,59,69,"Medium",65,16,"Yes","No"
158 | 7.49,146,34,0,220,157,"Good",51,16,"Yes","No"
159 | 10.21,121,58,8,249,90,"Medium",48,13,"No","Yes"
160 | 12.53,142,90,1,189,112,"Good",39,10,"No","Yes"
161 | 9.32,119,60,0,372,70,"Bad",30,18,"No","No"
162 | 4.67,111,28,0,486,111,"Medium",29,12,"No","No"
163 | 2.93,143,21,5,81,160,"Medium",67,12,"No","Yes"
164 | 3.63,122,74,0,424,149,"Medium",51,13,"Yes","No"
165 | 5.68,130,64,0,40,106,"Bad",39,17,"No","No"
166 | 8.22,148,64,0,58,141,"Medium",27,13,"No","Yes"
167 | 0.37,147,58,7,100,191,"Bad",27,15,"Yes","Yes"
168 | 6.71,119,67,17,151,137,"Medium",55,11,"Yes","Yes"
169 | 6.71,106,73,0,216,93,"Medium",60,13,"Yes","No"
170 | 7.3,129,89,0,425,117,"Medium",45,10,"Yes","No"
171 | 11.48,104,41,15,492,77,"Good",73,18,"Yes","Yes"
172 | 8.01,128,39,12,356,118,"Medium",71,10,"Yes","Yes"
173 | 12.49,93,106,12,416,55,"Medium",75,15,"Yes","Yes"
174 | 9.03,104,102,13,123,110,"Good",35,16,"Yes","Yes"
175 | 6.38,135,91,5,207,128,"Medium",66,18,"Yes","Yes"
176 | 0,139,24,0,358,185,"Medium",79,15,"No","No"
177 | 7.54,115,89,0,38,122,"Medium",25,12,"Yes","No"
178 | 5.61,138,107,9,480,154,"Medium",47,11,"No","Yes"
179 | 10.48,138,72,0,148,94,"Medium",27,17,"Yes","Yes"
180 | 10.66,104,71,14,89,81,"Medium",25,14,"No","Yes"
181 | 7.78,144,25,3,70,116,"Medium",77,18,"Yes","Yes"
182 | 4.94,137,112,15,434,149,"Bad",66,13,"Yes","Yes"
183 | 7.43,121,83,0,79,91,"Medium",68,11,"Yes","No"
184 | 4.74,137,60,4,230,140,"Bad",25,13,"Yes","No"
185 | 5.32,118,74,6,426,102,"Medium",80,18,"Yes","Yes"
186 | 9.95,132,33,7,35,97,"Medium",60,11,"No","Yes"
187 | 10.07,130,100,11,449,107,"Medium",64,10,"Yes","Yes"
188 | 8.68,120,51,0,93,86,"Medium",46,17,"No","No"
189 | 6.03,117,32,0,142,96,"Bad",62,17,"Yes","No"
190 | 8.07,116,37,0,426,90,"Medium",76,15,"Yes","No"
191 | 12.11,118,117,18,509,104,"Medium",26,15,"No","Yes"
192 | 8.79,130,37,13,297,101,"Medium",37,13,"No","Yes"
193 | 6.67,156,42,13,170,173,"Good",74,14,"Yes","Yes"
194 | 7.56,108,26,0,408,93,"Medium",56,14,"No","No"
195 | 13.28,139,70,7,71,96,"Good",61,10,"Yes","Yes"
196 | 7.23,112,98,18,481,128,"Medium",45,11,"Yes","Yes"
197 | 4.19,117,93,4,420,112,"Bad",66,11,"Yes","Yes"
198 | 4.1,130,28,6,410,133,"Bad",72,16,"Yes","Yes"
199 | 2.52,124,61,0,333,138,"Medium",76,16,"Yes","No"
200 | 3.62,112,80,5,500,128,"Medium",69,10,"Yes","Yes"
201 | 6.42,122,88,5,335,126,"Medium",64,14,"Yes","Yes"
202 | 5.56,144,92,0,349,146,"Medium",62,12,"No","No"
203 | 5.94,138,83,0,139,134,"Medium",54,18,"Yes","No"
204 | 4.1,121,78,4,413,130,"Bad",46,10,"No","Yes"
205 | 2.05,131,82,0,132,157,"Bad",25,14,"Yes","No"
206 | 8.74,155,80,0,237,124,"Medium",37,14,"Yes","No"
207 | 5.68,113,22,1,317,132,"Medium",28,12,"Yes","No"
208 | 4.97,162,67,0,27,160,"Medium",77,17,"Yes","Yes"
209 | 8.19,111,105,0,466,97,"Bad",61,10,"No","No"
210 | 7.78,86,54,0,497,64,"Bad",33,12,"Yes","No"
211 | 3.02,98,21,11,326,90,"Bad",76,11,"No","Yes"
212 | 4.36,125,41,2,357,123,"Bad",47,14,"No","Yes"
213 | 9.39,117,118,14,445,120,"Medium",32,15,"Yes","Yes"
214 | 12.04,145,69,19,501,105,"Medium",45,11,"Yes","Yes"
215 | 8.23,149,84,5,220,139,"Medium",33,10,"Yes","Yes"
216 | 4.83,115,115,3,48,107,"Medium",73,18,"Yes","Yes"
217 | 2.34,116,83,15,170,144,"Bad",71,11,"Yes","Yes"
218 | 5.73,141,33,0,243,144,"Medium",34,17,"Yes","No"
219 | 4.34,106,44,0,481,111,"Medium",70,14,"No","No"
220 | 9.7,138,61,12,156,120,"Medium",25,14,"Yes","Yes"
221 | 10.62,116,79,19,359,116,"Good",58,17,"Yes","Yes"
222 | 10.59,131,120,15,262,124,"Medium",30,10,"Yes","Yes"
223 | 6.43,124,44,0,125,107,"Medium",80,11,"Yes","No"
224 | 7.49,136,119,6,178,145,"Medium",35,13,"Yes","Yes"
225 | 3.45,110,45,9,276,125,"Medium",62,14,"Yes","Yes"
226 | 4.1,134,82,0,464,141,"Medium",48,13,"No","No"
227 | 6.68,107,25,0,412,82,"Bad",36,14,"Yes","No"
228 | 7.8,119,33,0,245,122,"Good",56,14,"Yes","No"
229 | 8.69,113,64,10,68,101,"Medium",57,16,"Yes","Yes"
230 | 5.4,149,73,13,381,163,"Bad",26,11,"No","Yes"
231 | 11.19,98,104,0,404,72,"Medium",27,18,"No","No"
232 | 5.16,115,60,0,119,114,"Bad",38,14,"No","No"
233 | 8.09,132,69,0,123,122,"Medium",27,11,"No","No"
234 | 13.14,137,80,10,24,105,"Good",61,15,"Yes","Yes"
235 | 8.65,123,76,18,218,120,"Medium",29,14,"No","Yes"
236 | 9.43,115,62,11,289,129,"Good",56,16,"No","Yes"
237 | 5.53,126,32,8,95,132,"Medium",50,17,"Yes","Yes"
238 | 9.32,141,34,16,361,108,"Medium",69,10,"Yes","Yes"
239 | 9.62,151,28,8,499,135,"Medium",48,10,"Yes","Yes"
240 | 7.36,121,24,0,200,133,"Good",73,13,"Yes","No"
241 | 3.89,123,105,0,149,118,"Bad",62,16,"Yes","Yes"
242 | 10.31,159,80,0,362,121,"Medium",26,18,"Yes","No"
243 | 12.01,136,63,0,160,94,"Medium",38,12,"Yes","No"
244 | 4.68,124,46,0,199,135,"Medium",52,14,"No","No"
245 | 7.82,124,25,13,87,110,"Medium",57,10,"Yes","Yes"
246 | 8.78,130,30,0,391,100,"Medium",26,18,"Yes","No"
247 | 10,114,43,0,199,88,"Good",57,10,"No","Yes"
248 | 6.9,120,56,20,266,90,"Bad",78,18,"Yes","Yes"
249 | 5.04,123,114,0,298,151,"Bad",34,16,"Yes","No"
250 | 5.36,111,52,0,12,101,"Medium",61,11,"Yes","Yes"
251 | 5.05,125,67,0,86,117,"Bad",65,11,"Yes","No"
252 | 9.16,137,105,10,435,156,"Good",72,14,"Yes","Yes"
253 | 3.72,139,111,5,310,132,"Bad",62,13,"Yes","Yes"
254 | 8.31,133,97,0,70,117,"Medium",32,16,"Yes","No"
255 | 5.64,124,24,5,288,122,"Medium",57,12,"No","Yes"
256 | 9.58,108,104,23,353,129,"Good",37,17,"Yes","Yes"
257 | 7.71,123,81,8,198,81,"Bad",80,15,"Yes","Yes"
258 | 4.2,147,40,0,277,144,"Medium",73,10,"Yes","No"
259 | 8.67,125,62,14,477,112,"Medium",80,13,"Yes","Yes"
260 | 3.47,108,38,0,251,81,"Bad",72,14,"No","No"
261 | 5.12,123,36,10,467,100,"Bad",74,11,"No","Yes"
262 | 7.67,129,117,8,400,101,"Bad",36,10,"Yes","Yes"
263 | 5.71,121,42,4,188,118,"Medium",54,15,"Yes","Yes"
264 | 6.37,120,77,15,86,132,"Medium",48,18,"Yes","Yes"
265 | 7.77,116,26,6,434,115,"Medium",25,17,"Yes","Yes"
266 | 6.95,128,29,5,324,159,"Good",31,15,"Yes","Yes"
267 | 5.31,130,35,10,402,129,"Bad",39,17,"Yes","Yes"
268 | 9.1,128,93,12,343,112,"Good",73,17,"No","Yes"
269 | 5.83,134,82,7,473,112,"Bad",51,12,"No","Yes"
270 | 6.53,123,57,0,66,105,"Medium",39,11,"Yes","No"
271 | 5.01,159,69,0,438,166,"Medium",46,17,"Yes","No"
272 | 11.99,119,26,0,284,89,"Good",26,10,"Yes","No"
273 | 4.55,111,56,0,504,110,"Medium",62,16,"Yes","No"
274 | 12.98,113,33,0,14,63,"Good",38,12,"Yes","No"
275 | 10.04,116,106,8,244,86,"Medium",58,12,"Yes","Yes"
276 | 7.22,135,93,2,67,119,"Medium",34,11,"Yes","Yes"
277 | 6.67,107,119,11,210,132,"Medium",53,11,"Yes","Yes"
278 | 6.93,135,69,14,296,130,"Medium",73,15,"Yes","Yes"
279 | 7.8,136,48,12,326,125,"Medium",36,16,"Yes","Yes"
280 | 7.22,114,113,2,129,151,"Good",40,15,"No","Yes"
281 | 3.42,141,57,13,376,158,"Medium",64,18,"Yes","Yes"
282 | 2.86,121,86,10,496,145,"Bad",51,10,"Yes","Yes"
283 | 11.19,122,69,7,303,105,"Good",45,16,"No","Yes"
284 | 7.74,150,96,0,80,154,"Good",61,11,"Yes","No"
285 | 5.36,135,110,0,112,117,"Medium",80,16,"No","No"
286 | 6.97,106,46,11,414,96,"Bad",79,17,"No","No"
287 | 7.6,146,26,11,261,131,"Medium",39,10,"Yes","Yes"
288 | 7.53,117,118,11,429,113,"Medium",67,18,"No","Yes"
289 | 6.88,95,44,4,208,72,"Bad",44,17,"Yes","Yes"
290 | 6.98,116,40,0,74,97,"Medium",76,15,"No","No"
291 | 8.75,143,77,25,448,156,"Medium",43,17,"Yes","Yes"
292 | 9.49,107,111,14,400,103,"Medium",41,11,"No","Yes"
293 | 6.64,118,70,0,106,89,"Bad",39,17,"Yes","No"
294 | 11.82,113,66,16,322,74,"Good",76,15,"Yes","Yes"
295 | 11.28,123,84,0,74,89,"Good",59,10,"Yes","No"
296 | 12.66,148,76,3,126,99,"Good",60,11,"Yes","Yes"
297 | 4.21,118,35,14,502,137,"Medium",79,10,"No","Yes"
298 | 8.21,127,44,13,160,123,"Good",63,18,"Yes","Yes"
299 | 3.07,118,83,13,276,104,"Bad",75,10,"Yes","Yes"
300 | 10.98,148,63,0,312,130,"Good",63,15,"Yes","No"
301 | 9.4,135,40,17,497,96,"Medium",54,17,"No","Yes"
302 | 8.57,116,78,1,158,99,"Medium",45,11,"Yes","Yes"
303 | 7.41,99,93,0,198,87,"Medium",57,16,"Yes","Yes"
304 | 5.28,108,77,13,388,110,"Bad",74,14,"Yes","Yes"
305 | 10.01,133,52,16,290,99,"Medium",43,11,"Yes","Yes"
306 | 11.93,123,98,12,408,134,"Good",29,10,"Yes","Yes"
307 | 8.03,115,29,26,394,132,"Medium",33,13,"Yes","Yes"
308 | 4.78,131,32,1,85,133,"Medium",48,12,"Yes","Yes"
309 | 5.9,138,92,0,13,120,"Bad",61,12,"Yes","No"
310 | 9.24,126,80,19,436,126,"Medium",52,10,"Yes","Yes"
311 | 11.18,131,111,13,33,80,"Bad",68,18,"Yes","Yes"
312 | 9.53,175,65,29,419,166,"Medium",53,12,"Yes","Yes"
313 | 6.15,146,68,12,328,132,"Bad",51,14,"Yes","Yes"
314 | 6.8,137,117,5,337,135,"Bad",38,10,"Yes","Yes"
315 | 9.33,103,81,3,491,54,"Medium",66,13,"Yes","No"
316 | 7.72,133,33,10,333,129,"Good",71,14,"Yes","Yes"
317 | 6.39,131,21,8,220,171,"Good",29,14,"Yes","Yes"
318 | 15.63,122,36,5,369,72,"Good",35,10,"Yes","Yes"
319 | 6.41,142,30,0,472,136,"Good",80,15,"No","No"
320 | 10.08,116,72,10,456,130,"Good",41,14,"No","Yes"
321 | 6.97,127,45,19,459,129,"Medium",57,11,"No","Yes"
322 | 5.86,136,70,12,171,152,"Medium",44,18,"Yes","Yes"
323 | 7.52,123,39,5,499,98,"Medium",34,15,"Yes","No"
324 | 9.16,140,50,10,300,139,"Good",60,15,"Yes","Yes"
325 | 10.36,107,105,18,428,103,"Medium",34,12,"Yes","Yes"
326 | 2.66,136,65,4,133,150,"Bad",53,13,"Yes","Yes"
327 | 11.7,144,69,11,131,104,"Medium",47,11,"Yes","Yes"
328 | 4.69,133,30,0,152,122,"Medium",53,17,"Yes","No"
329 | 6.23,112,38,17,316,104,"Medium",80,16,"Yes","Yes"
330 | 3.15,117,66,1,65,111,"Bad",55,11,"Yes","Yes"
331 | 11.27,100,54,9,433,89,"Good",45,12,"Yes","Yes"
332 | 4.99,122,59,0,501,112,"Bad",32,14,"No","No"
333 | 10.1,135,63,15,213,134,"Medium",32,10,"Yes","Yes"
334 | 5.74,106,33,20,354,104,"Medium",61,12,"Yes","Yes"
335 | 5.87,136,60,7,303,147,"Medium",41,10,"Yes","Yes"
336 | 7.63,93,117,9,489,83,"Bad",42,13,"Yes","Yes"
337 | 6.18,120,70,15,464,110,"Medium",72,15,"Yes","Yes"
338 | 5.17,138,35,6,60,143,"Bad",28,18,"Yes","No"
339 | 8.61,130,38,0,283,102,"Medium",80,15,"Yes","No"
340 | 5.97,112,24,0,164,101,"Medium",45,11,"Yes","No"
341 | 11.54,134,44,4,219,126,"Good",44,15,"Yes","Yes"
342 | 7.5,140,29,0,105,91,"Bad",43,16,"Yes","No"
343 | 7.38,98,120,0,268,93,"Medium",72,10,"No","No"
344 | 7.81,137,102,13,422,118,"Medium",71,10,"No","Yes"
345 | 5.99,117,42,10,371,121,"Bad",26,14,"Yes","Yes"
346 | 8.43,138,80,0,108,126,"Good",70,13,"No","Yes"
347 | 4.81,121,68,0,279,149,"Good",79,12,"Yes","No"
348 | 8.97,132,107,0,144,125,"Medium",33,13,"No","No"
349 | 6.88,96,39,0,161,112,"Good",27,14,"No","No"
350 | 12.57,132,102,20,459,107,"Good",49,11,"Yes","Yes"
351 | 9.32,134,27,18,467,96,"Medium",49,14,"No","Yes"
352 | 8.64,111,101,17,266,91,"Medium",63,17,"No","Yes"
353 | 10.44,124,115,16,458,105,"Medium",62,16,"No","Yes"
354 | 13.44,133,103,14,288,122,"Good",61,17,"Yes","Yes"
355 | 9.45,107,67,12,430,92,"Medium",35,12,"No","Yes"
356 | 5.3,133,31,1,80,145,"Medium",42,18,"Yes","Yes"
357 | 7.02,130,100,0,306,146,"Good",42,11,"Yes","No"
358 | 3.58,142,109,0,111,164,"Good",72,12,"Yes","No"
359 | 13.36,103,73,3,276,72,"Medium",34,15,"Yes","Yes"
360 | 4.17,123,96,10,71,118,"Bad",69,11,"Yes","Yes"
361 | 3.13,130,62,11,396,130,"Bad",66,14,"Yes","Yes"
362 | 8.77,118,86,7,265,114,"Good",52,15,"No","Yes"
363 | 8.68,131,25,10,183,104,"Medium",56,15,"No","Yes"
364 | 5.25,131,55,0,26,110,"Bad",79,12,"Yes","Yes"
365 | 10.26,111,75,1,377,108,"Good",25,12,"Yes","No"
366 | 10.5,122,21,16,488,131,"Good",30,14,"Yes","Yes"
367 | 6.53,154,30,0,122,162,"Medium",57,17,"No","No"
368 | 5.98,124,56,11,447,134,"Medium",53,12,"No","Yes"
369 | 14.37,95,106,0,256,53,"Good",52,17,"Yes","No"
370 | 10.71,109,22,10,348,79,"Good",74,14,"No","Yes"
371 | 10.26,135,100,22,463,122,"Medium",36,14,"Yes","Yes"
372 | 7.68,126,41,22,403,119,"Bad",42,12,"Yes","Yes"
373 | 9.08,152,81,0,191,126,"Medium",54,16,"Yes","No"
374 | 7.8,121,50,0,508,98,"Medium",65,11,"No","No"
375 | 5.58,137,71,0,402,116,"Medium",78,17,"Yes","No"
376 | 9.44,131,47,7,90,118,"Medium",47,12,"Yes","Yes"
377 | 7.9,132,46,4,206,124,"Medium",73,11,"Yes","No"
378 | 16.27,141,60,19,319,92,"Good",44,11,"Yes","Yes"
379 | 6.81,132,61,0,263,125,"Medium",41,12,"No","No"
380 | 6.11,133,88,3,105,119,"Medium",79,12,"Yes","Yes"
381 | 5.81,125,111,0,404,107,"Bad",54,15,"Yes","No"
382 | 9.64,106,64,10,17,89,"Medium",68,17,"Yes","Yes"
383 | 3.9,124,65,21,496,151,"Bad",77,13,"Yes","Yes"
384 | 4.95,121,28,19,315,121,"Medium",66,14,"Yes","Yes"
385 | 9.35,98,117,0,76,68,"Medium",63,10,"Yes","No"
386 | 12.85,123,37,15,348,112,"Good",28,12,"Yes","Yes"
387 | 5.87,131,73,13,455,132,"Medium",62,17,"Yes","Yes"
388 | 5.32,152,116,0,170,160,"Medium",39,16,"Yes","No"
389 | 8.67,142,73,14,238,115,"Medium",73,14,"No","Yes"
390 | 8.14,135,89,11,245,78,"Bad",79,16,"Yes","Yes"
391 | 8.44,128,42,8,328,107,"Medium",35,12,"Yes","Yes"
392 | 5.47,108,75,9,61,111,"Medium",67,12,"Yes","Yes"
393 | 6.1,153,63,0,49,124,"Bad",56,16,"Yes","No"
394 | 4.53,129,42,13,315,130,"Bad",34,13,"Yes","Yes"
395 | 5.57,109,51,10,26,120,"Medium",30,17,"No","Yes"
396 | 5.35,130,58,19,366,139,"Bad",33,16,"Yes","Yes"
397 | 12.57,138,108,17,203,128,"Good",33,14,"Yes","Yes"
398 | 6.14,139,23,3,37,120,"Medium",55,11,"No","Yes"
399 | 7.41,162,26,12,368,159,"Medium",40,18,"Yes","Yes"
400 | 5.94,100,79,7,284,95,"Bad",50,12,"Yes","Yes"
401 | 9.71,134,37,0,27,120,"Good",49,16,"Yes","Yes"
402 |
--------------------------------------------------------------------------------
/data/Hitters.csv:
--------------------------------------------------------------------------------
1 | Player,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague
2 | -Andy Allanson,293,66,1,30,29,14,1,293,66,1,30,29,14,A,E,446,33,20,,A
3 | -Alan Ashby,315,81,7,24,38,39,14,3449,835,69,321,414,375,N,W,632,43,10,475,N
4 | -Alvin Davis,479,130,18,66,72,76,3,1624,457,63,224,266,263,A,W,880,82,14,480,A
5 | -Andre Dawson,496,141,20,65,78,37,11,5628,1575,225,828,838,354,N,E,200,11,3,500,N
6 | -Andres Galarraga,321,87,10,39,42,30,2,396,101,12,48,46,33,N,E,805,40,4,91.5,N
7 | -Alfredo Griffin,594,169,4,74,51,35,11,4408,1133,19,501,336,194,A,W,282,421,25,750,A
8 | -Al Newman,185,37,1,23,8,21,2,214,42,1,30,9,24,N,E,76,127,7,70,A
9 | -Argenis Salazar,298,73,0,24,24,7,3,509,108,0,41,37,12,A,W,121,283,9,100,A
10 | -Andres Thomas,323,81,6,26,32,8,2,341,86,6,32,34,8,N,W,143,290,19,75,N
11 | -Andre Thornton,401,92,17,49,66,65,13,5206,1332,253,784,890,866,A,E,0,0,0,1100,A
12 | -Alan Trammell,574,159,21,107,75,59,10,4631,1300,90,702,504,488,A,E,238,445,22,517.143,A
13 | -Alex Trevino,202,53,4,31,26,27,9,1876,467,15,192,186,161,N,W,304,45,11,512.5,N
14 | -Andy VanSlyke,418,113,13,48,61,47,4,1512,392,41,205,204,203,N,E,211,11,7,550,N
15 | -Alan Wiggins,239,60,0,30,11,22,6,1941,510,4,309,103,207,A,E,121,151,6,700,A
16 | -Bill Almon,196,43,7,29,27,30,13,3231,825,36,376,290,238,N,E,80,45,8,240,N
17 | -Billy Beane,183,39,3,20,15,11,3,201,42,3,20,16,11,A,W,118,0,0,,A
18 | -Buddy Bell,568,158,20,89,75,73,15,8068,2273,177,1045,993,732,N,W,105,290,10,775,N
19 | -Buddy Biancalana,190,46,2,24,8,15,5,479,102,5,65,23,39,A,W,102,177,16,175,A
20 | -Bruce Bochte,407,104,6,57,43,65,12,5233,1478,100,643,658,653,A,W,912,88,9,,A
21 | -Bruce Bochy,127,32,8,16,22,14,8,727,180,24,67,82,56,N,W,202,22,2,135,N
22 | -Barry Bonds,413,92,16,72,48,65,1,413,92,16,72,48,65,N,E,280,9,5,100,N
23 | -Bobby Bonilla,426,109,3,55,43,62,1,426,109,3,55,43,62,A,W,361,22,2,115,N
24 | -Bob Boone,22,10,1,4,2,1,6,84,26,2,9,9,3,A,W,812,84,11,,A
25 | -Bob Brenly,472,116,16,60,62,74,6,1924,489,67,242,251,240,N,W,518,55,3,600,N
26 | -Bill Buckner,629,168,18,73,102,40,18,8424,2464,164,1008,1072,402,A,E,1067,157,14,776.667,A
27 | -Brett Butler,587,163,4,92,51,70,6,2695,747,17,442,198,317,A,E,434,9,3,765,A
28 | -Bob Dernier,324,73,4,32,18,22,7,1931,491,13,291,108,180,N,E,222,3,3,708.333,N
29 | -Bo Diaz,474,129,10,50,56,40,10,2331,604,61,246,327,166,N,W,732,83,13,750,N
30 | -Bill Doran,550,152,6,92,37,81,5,2308,633,32,349,182,308,N,W,262,329,16,625,N
31 | -Brian Downing,513,137,20,90,95,90,14,5201,1382,166,763,734,784,A,W,267,5,3,900,A
32 | -Bobby Grich,313,84,9,42,30,39,17,6890,1833,224,1033,864,1087,A,W,127,221,7,,A
33 | -Billy Hatcher,419,108,6,55,36,22,3,591,149,8,80,46,31,N,W,226,7,4,110,N
34 | -Bob Horner,517,141,27,70,87,52,9,3571,994,215,545,652,337,N,W,1378,102,8,,N
35 | -Brook Jacoby,583,168,17,83,80,56,5,1646,452,44,219,208,136,A,E,109,292,25,612.5,A
36 | -Bob Kearney,204,49,6,23,25,12,7,1309,308,27,126,132,66,A,W,419,46,5,300,A
37 | -Bill Madlock,379,106,10,38,60,30,14,6207,1906,146,859,803,571,N,W,72,170,24,850,N
38 | -Bobby Meacham,161,36,0,19,10,17,4,1053,244,3,156,86,107,A,E,70,149,12,,A
39 | -Bob Melvin,268,60,5,24,25,15,2,350,78,5,34,29,18,N,W,442,59,6,90,N
40 | -Ben Oglivie,346,98,5,31,53,30,16,5913,1615,235,784,901,560,A,E,0,0,0,,A
41 | -Bip Roberts,241,61,1,34,12,14,1,241,61,1,34,12,14,N,W,166,172,10,,N
42 | -BillyJo Robidoux,181,41,1,15,21,33,2,232,50,4,20,29,45,A,E,326,29,5,67.5,A
43 | -Bill Russell,216,54,0,21,18,15,18,7318,1926,46,796,627,483,N,W,103,84,5,,N
44 | -Billy Sample,200,57,6,23,14,14,9,2516,684,46,371,230,195,N,W,69,1,1,,N
45 | -Bill Schroeder,217,46,7,32,19,9,4,694,160,32,86,76,32,A,E,307,25,1,180,A
46 | -Butch Wynegar,194,40,7,19,29,30,11,4183,1069,64,486,493,608,A,E,325,22,2,,A
47 | -Chris Bando,254,68,2,28,26,22,6,999,236,21,108,117,118,A,E,359,30,4,305,A
48 | -Chris Brown,416,132,7,57,49,33,3,932,273,24,113,121,80,N,W,73,177,18,215,N
49 | -Carmen Castillo,205,57,8,34,32,9,5,756,192,32,117,107,51,A,E,58,4,4,247.5,A
50 | -Cecil Cooper,542,140,12,46,75,41,16,7099,2130,235,987,1089,431,A,E,697,61,9,,A
51 | -Chili Davis,526,146,13,71,70,84,6,2648,715,77,352,342,289,N,W,303,9,9,815,N
52 | -Carlton Fisk,457,101,14,42,63,22,17,6521,1767,281,1003,977,619,A,W,389,39,4,875,A
53 | -Curt Ford,214,53,2,30,29,23,2,226,59,2,32,32,27,N,E,109,7,3,70,N
54 | -Cliff Johnson,19,7,0,1,2,1,4,41,13,1,3,4,4,A,E,0,0,0,,A
55 | -Carney Lansford,591,168,19,80,72,39,9,4478,1307,113,634,563,319,A,W,67,147,4,1200,A
56 | -Chet Lemon,403,101,12,45,53,39,12,5150,1429,166,747,666,526,A,E,316,6,5,675,A
57 | -Candy Maldonado,405,102,18,49,85,20,6,950,231,29,99,138,64,N,W,161,10,3,415,N
58 | -Carmelo Martinez,244,58,9,28,25,35,4,1335,333,49,164,179,194,N,W,142,14,2,340,N
59 | -Charlie Moore,235,61,3,24,39,21,14,3926,1029,35,441,401,333,A,E,425,43,4,,A
60 | -Craig Reynolds,313,78,6,32,41,12,12,3742,968,35,409,321,170,N,W,106,206,7,416.667,N
61 | -Cal Ripken,627,177,25,98,81,70,6,3210,927,133,529,472,313,A,E,240,482,13,1350,A
62 | -Cory Snyder,416,113,24,58,69,16,1,416,113,24,58,69,16,A,E,203,70,10,90,A
63 | -Chris Speier,155,44,6,21,23,15,16,6631,1634,98,698,661,777,N,E,53,88,3,275,N
64 | -Curt Wilkerson,236,56,0,27,15,11,4,1115,270,1,116,64,57,A,W,125,199,13,230,A
65 | -Dave Anderson,216,53,1,31,15,22,4,926,210,9,118,69,114,N,W,73,152,11,225,N
66 | -Doug Baker,24,3,0,1,0,2,3,159,28,0,20,12,9,A,W,80,4,0,,A
67 | -Don Baylor,585,139,31,93,94,62,17,7546,1982,315,1141,1179,727,A,E,0,0,0,950,A
68 | -Dann Bilardello,191,37,4,12,17,14,4,773,163,16,61,74,52,N,E,391,38,8,,N
69 | -Daryl Boston,199,53,5,29,22,21,3,514,120,8,57,40,39,A,W,152,3,5,75,A
70 | -Darnell Coles,521,142,20,67,86,45,4,815,205,22,99,103,78,A,E,107,242,23,105,A
71 | -Dave Collins,419,113,1,44,27,44,12,4484,1231,32,612,344,422,A,E,211,2,1,,A
72 | -Dave Concepcion,311,81,3,42,30,26,17,8247,2198,100,950,909,690,N,W,153,223,10,320,N
73 | -Darren Daulton,138,31,8,18,21,38,3,244,53,12,33,32,55,N,E,244,21,4,,N
74 | -Doug DeCinces,512,131,26,69,96,52,14,5347,1397,221,712,815,548,A,W,119,216,12,850,A
75 | -Darrell Evans,507,122,29,78,85,91,18,7761,1947,347,1175,1152,1380,A,E,808,108,2,535,A
76 | -Dwight Evans,529,137,26,86,97,97,15,6661,1785,291,1082,949,989,A,E,280,10,5,933.333,A
77 | -Damaso Garcia,424,119,6,57,46,13,9,3651,1046,32,461,301,112,A,E,224,286,8,850,N
78 | -Dan Gladden,351,97,4,55,29,39,4,1258,353,16,196,110,117,N,W,226,7,3,210,A
79 | -Danny Heep,195,55,5,24,33,30,8,1313,338,25,144,149,153,N,E,83,2,1,,N
80 | -Dave Henderson,388,103,15,59,47,39,6,2174,555,80,285,274,186,A,W,182,9,4,325,A
81 | -Donnie Hill,339,96,4,37,29,23,4,1064,290,11,123,108,55,A,W,104,213,9,275,A
82 | -Dave Kingman,561,118,35,70,94,33,16,6677,1575,442,901,1210,608,A,W,463,32,8,,A
83 | -Davey Lopes,255,70,7,49,35,43,15,6311,1661,154,1019,608,820,N,E,51,54,8,450,N
84 | -Don Mattingly,677,238,31,117,113,53,5,2223,737,93,349,401,171,A,E,1377,100,6,1975,A
85 | -Darryl Motley,227,46,7,23,20,12,5,1325,324,44,156,158,67,A,W,92,2,2,,A
86 | -Dale Murphy,614,163,29,89,83,75,11,5017,1388,266,813,822,617,N,W,303,6,6,1900,N
87 | -Dwayne Murphy,329,83,9,50,39,56,9,3828,948,145,575,528,635,A,W,276,6,2,600,A
88 | -Dave Parker,637,174,31,89,116,56,14,6727,2024,247,978,1093,495,N,W,278,9,9,1041.667,N
89 | -Dan Pasqua,280,82,16,44,45,47,2,428,113,25,61,70,63,A,E,148,4,2,110,A
90 | -Darrell Porter,155,41,12,21,29,22,16,5409,1338,181,746,805,875,A,W,165,9,1,260,A
91 | -Dick Schofield,458,114,13,67,57,48,4,1350,298,28,160,123,122,A,W,246,389,18,475,A
92 | -Don Slaught,314,83,13,39,46,16,5,1457,405,28,156,159,76,A,W,533,40,4,431.5,A
93 | -Darryl Strawberry,475,123,27,76,93,72,4,1810,471,108,292,343,267,N,E,226,10,6,1220,N
94 | -Dale Sveum,317,78,7,35,35,32,1,317,78,7,35,35,32,A,E,45,122,26,70,A
95 | -Danny Tartabull,511,138,25,76,96,61,3,592,164,28,87,110,71,A,W,157,7,8,145,A
96 | -Dickie Thon,278,69,3,24,21,29,8,2079,565,32,258,192,162,N,W,142,210,10,,N
97 | -Denny Walling,382,119,13,54,58,36,12,2133,594,41,287,294,227,N,W,59,156,9,595,N
98 | -Dave Winfield,565,148,24,90,104,77,14,7287,2083,305,1135,1234,791,A,E,292,9,5,1861.46,A
99 | -Enos Cabell,277,71,2,27,29,14,15,5952,1647,60,753,596,259,N,W,360,32,5,,N
100 | -Eric Davis,415,115,27,97,71,68,3,711,184,45,156,119,99,N,W,274,2,7,300,N
101 | -Eddie Milner,424,110,15,70,47,36,7,2130,544,38,335,174,258,N,W,292,6,3,490,N
102 | -Eddie Murray,495,151,17,61,84,78,10,5624,1679,275,884,1015,709,A,E,1045,88,13,2460,A
103 | -Ernest Riles,524,132,9,69,47,54,2,972,260,14,123,92,90,A,E,212,327,20,,A
104 | -Ed Romero,233,49,2,41,23,18,8,1350,336,7,166,122,106,A,E,102,132,10,375,A
105 | -Ernie Whitt,395,106,16,48,56,35,10,2303,571,86,266,323,248,A,E,709,41,7,,A
106 | -Fred Lynn,397,114,23,67,67,53,13,5589,1632,241,906,926,716,A,E,244,2,4,,A
107 | -Floyd Rayford,210,37,8,15,19,15,6,994,244,36,107,114,53,A,E,40,115,15,,A
108 | -Franklin Stubbs,420,95,23,55,58,37,3,646,139,31,77,77,61,N,W,206,10,7,,N
109 | -Frank White,566,154,22,76,84,43,14,6100,1583,131,743,693,300,A,W,316,439,10,750,A
110 | -George Bell,641,198,31,101,108,41,5,2129,610,92,297,319,117,A,E,269,17,10,1175,A
111 | -Glenn Braggs,215,51,4,19,18,11,1,215,51,4,19,18,11,A,E,116,5,12,70,A
112 | -George Brett,441,128,16,70,73,80,14,6675,2095,209,1072,1050,695,A,W,97,218,16,1500,A
113 | -Greg Brock,325,76,16,33,52,37,5,1506,351,71,195,219,214,N,W,726,87,3,385,A
114 | -Gary Carter,490,125,24,81,105,62,13,6063,1646,271,847,999,680,N,E,869,62,8,1925.571,N
115 | -Glenn Davis,574,152,31,91,101,64,3,985,260,53,148,173,95,N,W,1253,111,11,215,N
116 | -George Foster,284,64,14,30,42,24,18,7023,1925,348,986,1239,666,N,E,96,4,4,,N
117 | -Gary Gaetti,596,171,34,91,108,52,6,2862,728,107,361,401,224,A,W,118,334,21,900,A
118 | -Greg Gagne,472,118,12,63,54,30,4,793,187,14,102,80,50,A,W,228,377,26,155,A
119 | -George Hendrick,283,77,14,45,47,26,16,6840,1910,259,915,1067,546,A,W,144,6,5,700,A
120 | -Glenn Hubbard,408,94,4,42,36,66,9,3573,866,59,429,365,410,N,W,282,487,19,535,N
121 | -Garth Iorg,327,85,3,30,44,20,8,2140,568,16,216,208,93,A,E,91,185,12,362.5,A
122 | -Gary Matthews,370,96,21,49,46,60,15,6986,1972,231,1070,955,921,N,E,137,5,9,733.333,N
123 | -Graig Nettles,354,77,16,36,55,41,20,8716,2172,384,1172,1267,1057,N,W,83,174,16,200,N
124 | -Gary Pettis,539,139,5,93,58,69,5,1469,369,12,247,126,198,A,W,462,9,7,400,A
125 | -Gary Redus,340,84,11,62,33,47,5,1516,376,42,284,141,219,N,E,185,8,4,400,A
126 | -Garry Templeton,510,126,2,42,44,35,11,5562,1578,44,703,519,256,N,W,207,358,20,737.5,N
127 | -Gorman Thomas,315,59,16,45,36,58,13,4677,1051,268,681,782,697,A,W,0,0,0,,A
128 | -Greg Walker,282,78,13,37,51,29,5,1649,453,73,211,280,138,A,W,670,57,5,500,A
129 | -Gary Ward,380,120,5,54,51,31,8,3118,900,92,444,419,240,A,W,237,8,1,600,A
130 | -Glenn Wilson,584,158,15,70,84,42,5,2358,636,58,265,316,134,N,E,331,20,4,662.5,N
131 | -Harold Baines,570,169,21,72,88,38,7,3754,1077,140,492,589,263,A,W,295,15,5,950,A
132 | -Hubie Brooks,306,104,14,50,58,25,7,2954,822,55,313,377,187,N,E,116,222,15,750,N
133 | -Howard Johnson,220,54,10,30,39,31,5,1185,299,40,145,154,128,N,E,50,136,20,297.5,N
134 | -Hal McRae,278,70,7,22,37,18,18,7186,2081,190,935,1088,643,A,W,0,0,0,325,A
135 | -Harold Reynolds,445,99,1,46,24,29,4,618,129,1,72,31,48,A,W,278,415,16,87.5,A
136 | -Harry Spilman,143,39,5,18,30,15,9,639,151,16,80,97,61,N,W,138,15,1,175,N
137 | -Herm Winningham,185,40,4,23,11,18,3,524,125,7,58,37,47,N,E,97,2,2,90,N
138 | -Jesse Barfield,589,170,40,107,108,69,6,2325,634,128,371,376,238,A,E,368,20,3,1237.5,A
139 | -Juan Beniquez,343,103,6,48,36,40,15,4338,1193,70,581,421,325,A,E,211,56,13,430,A
140 | -Juan Bonilla,284,69,1,33,18,25,5,1407,361,6,139,98,111,A,E,122,140,5,,N
141 | -John Cangelosi,438,103,2,65,32,71,2,440,103,2,67,32,71,A,W,276,7,9,100,N
142 | -Jose Canseco,600,144,33,85,117,65,2,696,173,38,101,130,69,A,W,319,4,14,165,A
143 | -Joe Carter,663,200,29,108,121,32,4,1447,404,57,210,222,68,A,E,241,8,6,250,A
144 | -Jack Clark,232,55,9,34,23,45,12,4405,1213,194,702,705,625,N,E,623,35,3,1300,N
145 | -Jose Cruz,479,133,10,48,72,55,17,7472,2147,153,980,1032,854,N,W,237,5,4,773.333,N
146 | -Julio Cruz,209,45,0,38,19,42,10,3859,916,23,557,279,478,A,W,132,205,5,,A
147 | -Jody Davis,528,132,21,61,74,41,6,2641,671,97,273,383,226,N,E,885,105,8,1008.333,N
148 | -Jim Dwyer,160,39,8,18,31,22,14,2128,543,56,304,268,298,A,E,33,3,0,275,A
149 | -Julio Franco,599,183,10,80,74,32,5,2482,715,27,330,326,158,A,E,231,374,18,775,A
150 | -Jim Gantner,497,136,7,58,38,26,11,3871,1066,40,450,367,241,A,E,304,347,10,850,A
151 | -Johnny Grubb,210,70,13,32,51,28,15,4040,1130,97,544,462,551,A,E,0,0,0,365,A
152 | -Jerry Hairston,225,61,5,32,26,26,11,1568,408,25,202,185,257,A,W,132,9,0,,A
153 | -Jack Howell,151,41,4,26,21,19,2,288,68,9,45,39,35,A,W,28,56,2,95,A
154 | -John Kruk,278,86,4,33,38,45,1,278,86,4,33,38,45,N,W,102,4,2,110,N
155 | -Jeffrey Leonard,341,95,6,48,42,20,10,2964,808,81,379,428,221,N,W,158,4,5,100,N
156 | -Jim Morrison,537,147,23,58,88,47,10,2744,730,97,302,351,174,N,E,92,257,20,277.5,N
157 | -John Moses,399,102,3,56,34,34,5,670,167,4,89,48,54,A,W,211,9,3,80,A
158 | -Jerry Mumphrey,309,94,5,37,32,26,13,4618,1330,57,616,522,436,N,E,161,3,3,600,N
159 | -Joe Orsulak,401,100,2,60,19,28,4,876,238,2,126,44,55,N,E,193,11,4,,N
160 | -Jorge Orta,336,93,9,35,46,23,15,5779,1610,128,730,741,497,A,W,0,0,0,,A
161 | -Jim Presley,616,163,27,83,107,32,3,1437,377,65,181,227,82,A,W,110,308,15,200,A
162 | -Jamie Quirk,219,47,8,24,26,17,12,1188,286,23,100,125,63,A,W,260,58,4,,A
163 | -Johnny Ray,579,174,7,67,78,58,6,3053,880,32,366,337,218,N,E,280,479,5,657,N
164 | -Jeff Reed,165,39,2,13,9,16,3,196,44,2,18,10,18,A,W,332,19,2,75,N
165 | -Jim Rice,618,200,20,98,110,62,13,7127,2163,351,1104,1289,564,A,E,330,16,8,2412.5,A
166 | -Jerry Royster,257,66,5,31,26,32,14,3910,979,33,518,324,382,N,W,87,166,14,250,A
167 | -John Russell,315,76,13,35,60,25,3,630,151,24,68,94,55,N,E,498,39,13,155,N
168 | -Juan Samuel,591,157,16,90,78,26,4,2020,541,52,310,226,91,N,E,290,440,25,640,N
169 | -John Shelby,404,92,11,54,49,18,6,1354,325,30,188,135,63,A,E,222,5,5,300,A
170 | -Joel Skinner,315,73,5,23,37,16,4,450,108,6,38,46,28,A,W,227,15,3,110,A
171 | -Jeff Stone,249,69,6,32,19,20,4,702,209,10,97,48,44,N,E,103,8,2,,N
172 | -Jim Sundberg,429,91,12,41,42,57,13,5590,1397,83,578,579,644,A,W,686,46,4,825,N
173 | -Jim Traber,212,54,13,28,44,18,2,233,59,13,31,46,20,A,E,243,23,5,,A
174 | -Jose Uribe,453,101,3,46,43,61,3,948,218,6,96,72,91,N,W,249,444,16,195,N
175 | -Jerry Willard,161,43,4,17,26,22,3,707,179,21,77,99,76,A,W,300,12,2,,A
176 | -Joel Youngblood,184,47,5,20,28,18,11,3327,890,74,419,382,304,N,W,49,2,0,450,N
177 | -Kevin Bass,591,184,20,83,79,38,5,1689,462,40,219,195,82,N,W,303,12,5,630,N
178 | -Kal Daniels,181,58,6,34,23,22,1,181,58,6,34,23,22,N,W,88,0,3,86.5,N
179 | -Kirk Gibson,441,118,28,84,86,68,8,2723,750,126,433,420,309,A,E,190,2,2,1300,A
180 | -Ken Griffey,490,150,21,69,58,35,14,6126,1839,121,983,707,600,A,E,96,5,3,1000,N
181 | -Keith Hernandez,551,171,13,94,83,94,13,6090,1840,128,969,900,917,N,E,1199,149,5,1800,N
182 | -Kent Hrbek,550,147,29,85,91,71,6,2816,815,117,405,474,319,A,W,1218,104,10,1310,A
183 | -Ken Landreaux,283,74,4,34,29,22,10,3919,1062,85,505,456,283,N,W,145,5,7,737.5,N
184 | -Kevin McReynolds,560,161,26,89,96,66,4,1789,470,65,233,260,155,N,W,332,9,8,625,N
185 | -Kevin Mitchell,328,91,12,51,43,33,2,342,94,12,51,44,33,N,E,145,59,8,125,N
186 | -Keith Moreland,586,159,12,72,79,53,9,3082,880,83,363,477,295,N,E,181,13,4,1043.333,N
187 | -Ken Oberkfell,503,136,5,62,48,83,10,3423,970,20,408,303,414,N,W,65,258,8,725,N
188 | -Ken Phelps,344,85,24,69,64,88,7,911,214,64,150,156,187,A,W,0,0,0,300,A
189 | -Kirby Puckett,680,223,31,119,96,34,3,1928,587,35,262,201,91,A,W,429,8,6,365,A
190 | -Kurt Stillwell,279,64,0,31,26,30,1,279,64,0,31,26,30,N,W,107,205,16,75,N
191 | -Leon Durham,484,127,20,66,65,67,7,3006,844,116,436,458,377,N,E,1231,80,7,1183.333,N
192 | -Len Dykstra,431,127,8,77,45,58,2,667,187,9,117,64,88,N,E,283,8,3,202.5,N
193 | -Larry Herndon,283,70,8,33,37,27,12,4479,1222,94,557,483,307,A,E,156,2,2,225,A
194 | -Lee Lacy,491,141,11,77,47,37,15,4291,1240,84,615,430,340,A,E,239,8,2,525,A
195 | -Len Matuszek,199,52,9,26,28,21,6,805,191,30,113,119,87,N,W,235,22,5,265,N
196 | -Lloyd Moseby,589,149,21,89,86,64,7,3558,928,102,513,471,351,A,E,371,6,6,787.5,A
197 | -Lance Parrish,327,84,22,53,62,38,10,4273,1123,212,577,700,334,A,E,483,48,6,800,N
198 | -Larry Parrish,464,128,28,67,94,52,13,5829,1552,210,740,840,452,A,W,0,0,0,587.5,A
199 | -Luis Rivera,166,34,0,20,13,17,1,166,34,0,20,13,17,N,E,64,119,9,,N
200 | -Larry Sheets,338,92,18,42,60,21,3,682,185,36,88,112,50,A,E,0,0,0,145,A
201 | -Lonnie Smith,508,146,8,80,44,46,9,3148,915,41,571,289,326,A,W,245,5,9,,A
202 | -Lou Whitaker,584,157,20,95,73,63,10,4704,1320,93,724,522,576,A,E,276,421,11,420,A
203 | -Mike Aldrete,216,54,2,27,25,33,1,216,54,2,27,25,33,N,W,317,36,1,75,N
204 | -Marty Barrett,625,179,4,94,60,65,5,1696,476,12,216,163,166,A,E,303,450,14,575,A
205 | -Mike Brown,243,53,4,18,26,27,4,853,228,23,101,110,76,N,E,107,3,3,,N
206 | -Mike Davis,489,131,19,77,55,34,7,2051,549,62,300,263,153,A,W,310,9,9,780,A
207 | -Mike Diaz,209,56,12,22,36,19,2,216,58,12,24,37,19,N,E,201,6,3,90,N
208 | -Mariano Duncan,407,93,8,47,30,30,2,969,230,14,121,69,68,N,W,172,317,25,150,N
209 | -Mike Easler,490,148,14,64,78,49,13,3400,1000,113,445,491,301,A,E,0,0,0,700,N
210 | -Mike Fitzgerald,209,59,6,20,37,27,4,884,209,14,66,106,92,N,E,415,35,3,,N
211 | -Mel Hall,442,131,18,68,77,33,6,1416,398,47,210,203,136,A,E,233,7,7,550,A
212 | -Mickey Hatcher,317,88,3,40,32,19,8,2543,715,28,269,270,118,A,W,220,16,4,,A
213 | -Mike Heath,288,65,8,30,36,27,9,2815,698,55,315,325,189,N,E,259,30,10,650,A
214 | -Mike Kingery,209,54,3,25,14,12,1,209,54,3,25,14,12,A,W,102,6,3,68,A
215 | -Mike LaValliere,303,71,3,18,30,36,3,344,76,3,20,36,45,N,E,468,47,6,100,N
216 | -Mike Marshall,330,77,19,47,53,27,6,1928,516,90,247,288,161,N,W,149,8,6,670,N
217 | -Mike Pagliarulo,504,120,28,71,71,54,3,1085,259,54,150,167,114,A,E,103,283,19,175,A
218 | -Mark Salas,258,60,8,28,33,18,3,638,170,17,80,75,36,A,W,358,32,8,137,A
219 | -Mike Schmidt,20,1,0,0,0,0,2,41,9,2,6,7,4,N,E,78,220,6,2127.333,N
220 | -Mike Scioscia,374,94,5,36,26,62,7,1968,519,26,181,199,288,N,W,756,64,15,875,N
221 | -Mickey Tettleton,211,43,10,26,35,39,3,498,116,14,59,55,78,A,W,463,32,8,120,A
222 | -Milt Thompson,299,75,6,38,23,26,3,580,160,8,71,33,44,N,E,212,1,2,140,N
223 | -Mitch Webster,576,167,8,89,49,57,4,822,232,19,132,83,79,N,E,325,12,8,210,N
224 | -Mookie Wilson,381,110,9,61,45,32,7,3015,834,40,451,249,168,N,E,228,7,5,800,N
225 | -Marvell Wynne,288,76,7,34,37,15,4,1644,408,16,198,120,113,N,W,203,3,3,240,N
226 | -Mike Young,369,93,9,43,42,49,5,1258,323,54,181,177,157,A,E,149,1,6,350,A
227 | -Nick Esasky,330,76,12,35,41,47,4,1367,326,55,167,198,167,N,W,512,30,5,,N
228 | -Ozzie Guillen,547,137,2,58,47,12,2,1038,271,3,129,80,24,A,W,261,459,22,175,A
229 | -Oddibe McDowell,572,152,18,105,49,65,2,978,249,36,168,91,101,A,W,325,13,3,200,A
230 | -Omar Moreno,359,84,4,46,27,21,12,4992,1257,37,699,386,387,N,W,151,8,5,,N
231 | -Ozzie Smith,514,144,0,67,54,79,9,4739,1169,13,583,374,528,N,E,229,453,15,1940,N
232 | -Ozzie Virgil,359,80,15,45,48,63,7,1493,359,61,176,202,175,N,W,682,93,13,700,N
233 | -Phil Bradley,526,163,12,88,50,77,4,1556,470,38,245,167,174,A,W,250,11,1,750,A
234 | -Phil Garner,313,83,9,43,41,30,14,5885,1543,104,751,714,535,N,W,58,141,23,450,N
235 | -Pete Incaviglia,540,135,30,82,88,55,1,540,135,30,82,88,55,A,W,157,6,14,172,A
236 | -Paul Molitor,437,123,9,62,55,40,9,4139,1203,79,676,390,364,A,E,82,170,15,1260,A
237 | -Pete O'Brien,551,160,23,86,90,87,5,2235,602,75,278,328,273,A,W,1224,115,11,,A
238 | -Pete Rose,237,52,0,15,25,30,24,14053,4256,160,2165,1314,1566,N,W,523,43,6,750,N
239 | -Pat Sheridan,236,56,6,41,19,21,5,1257,329,24,166,125,105,A,E,172,1,4,190,A
240 | -Pat Tabler,473,154,6,61,48,29,6,1966,566,29,250,252,178,A,E,846,84,9,580,A
241 | -Rafael Belliard,309,72,0,33,31,26,5,354,82,0,41,32,26,N,E,117,269,12,130,N
242 | -Rick Burleson,271,77,5,35,29,33,12,4933,1358,48,630,435,403,A,W,62,90,3,450,A
243 | -Randy Bush,357,96,7,50,45,39,5,1394,344,43,178,192,136,A,W,167,2,4,300,A
244 | -Rick Cerone,216,56,4,22,18,15,12,2796,665,43,266,304,198,A,E,391,44,4,250,A
245 | -Ron Cey,256,70,13,42,36,44,16,7058,1845,312,965,1128,990,N,E,41,118,8,1050,A
246 | -Rob Deer,466,108,33,75,86,72,3,652,142,44,102,109,102,A,E,286,8,8,215,A
247 | -Rick Dempsey,327,68,13,42,29,45,18,3949,939,78,438,380,466,A,E,659,53,7,400,A
248 | -Rich Gedman,462,119,16,49,65,37,7,2131,583,69,244,288,150,A,E,866,65,6,,A
249 | -Ron Hassey,341,110,9,45,49,46,9,2331,658,50,249,322,274,A,E,251,9,4,560,A
250 | -Rickey Henderson,608,160,28,130,74,89,8,4071,1182,103,862,417,708,A,E,426,4,6,1670,A
251 | -Reggie Jackson,419,101,18,65,58,92,20,9528,2510,548,1509,1659,1342,A,W,0,0,0,487.5,A
252 | -Ricky Jones,33,6,0,2,4,7,1,33,6,0,2,4,7,A,W,205,5,4,,A
253 | -Ron Kittle,376,82,21,42,60,35,5,1770,408,115,238,299,157,A,W,0,0,0,425,A
254 | -Ray Knight,486,145,11,51,76,40,11,3967,1102,67,410,497,284,N,E,88,204,16,500,A
255 | -Randy Kutcher,186,44,7,28,16,11,1,186,44,7,28,16,11,N,W,99,3,1,,N
256 | -Rudy Law,307,80,1,42,36,29,7,2421,656,18,379,198,184,A,W,145,2,2,,A
257 | -Rick Leach,246,76,5,35,39,13,6,912,234,12,102,96,80,A,E,44,0,1,250,A
258 | -Rick Manning,205,52,8,31,27,17,12,5134,1323,56,643,445,459,A,E,155,3,2,400,A
259 | -Rance Mulliniks,348,90,11,50,45,43,10,2288,614,43,295,273,269,A,E,60,176,6,450,A
260 | -Ron Oester,523,135,8,52,44,52,9,3368,895,39,377,284,296,N,W,367,475,19,750,N
261 | -Rey Quinones,312,68,2,32,22,24,1,312,68,2,32,22,24,A,E,86,150,15,70,A
262 | -Rafael Ramirez,496,119,8,57,33,21,7,3358,882,36,365,280,165,N,W,155,371,29,875,N
263 | -Ronn Reynolds,126,27,3,8,10,5,4,239,49,3,16,13,14,N,E,190,2,9,190,N
264 | -Ron Roenicke,275,68,5,42,42,61,6,961,238,16,128,104,172,N,E,181,3,2,191,N
265 | -Ryne Sandberg,627,178,14,68,76,46,6,3146,902,74,494,345,242,N,E,309,492,5,740,N
266 | -Rafael Santana,394,86,1,38,28,36,4,1089,267,3,94,71,76,N,E,203,369,16,250,N
267 | -Rick Schu,208,57,8,32,25,18,3,653,170,17,98,54,62,N,E,42,94,13,140,N
268 | -Ruben Sierra,382,101,16,50,55,22,1,382,101,16,50,55,22,A,W,200,7,6,97.5,A
269 | -Roy Smalley,459,113,20,59,57,68,12,5348,1369,155,713,660,735,A,W,0,0,0,740,A
270 | -Robby Thompson,549,149,7,73,47,42,1,549,149,7,73,47,42,N,W,255,450,17,140,N
271 | -Rob Wilfong,288,63,3,25,33,16,10,2682,667,38,315,259,204,A,W,135,257,7,341.667,A
272 | -Reggie Williams,303,84,4,35,32,23,2,312,87,4,39,32,23,N,W,179,5,3,,N
273 | -Robin Yount,522,163,9,82,46,62,13,7037,2019,153,1043,827,535,A,E,352,9,1,1000,A
274 | -Steve Balboni,512,117,29,54,88,43,6,1750,412,100,204,276,155,A,W,1236,98,18,100,A
275 | -Scott Bradley,220,66,5,20,28,13,3,290,80,5,27,31,15,A,W,281,21,3,90,A
276 | -Sid Bream,522,140,16,73,77,60,4,730,185,22,93,106,86,N,E,1320,166,17,200,N
277 | -Steve Buechele,461,112,18,54,54,35,2,680,160,24,76,75,49,A,W,111,226,11,135,A
278 | -Shawon Dunston,581,145,17,66,68,21,2,831,210,21,106,86,40,N,E,320,465,32,155,N
279 | -Scott Fletcher,530,159,3,82,50,47,6,1619,426,11,218,149,163,A,W,196,354,15,475,A
280 | -Steve Garvey,557,142,21,58,81,23,18,8759,2583,271,1138,1299,478,N,W,1160,53,7,1450,N
281 | -Steve Jeltz,439,96,0,44,36,65,4,711,148,1,68,56,99,N,E,229,406,22,150,N
282 | -Steve Lombardozzi,453,103,8,53,33,52,2,507,123,8,63,39,58,A,W,289,407,6,105,A
283 | -Spike Owen,528,122,1,67,45,51,4,1716,403,12,211,146,155,A,W,209,372,17,350,A
284 | -Steve Sax,633,210,6,91,56,59,6,3070,872,19,420,230,274,N,W,367,432,16,90,N
285 | -Tony Armas,16,2,0,1,0,0,2,28,4,0,1,0,0,A,E,247,4,8,,A
286 | -Tony Bernazard,562,169,17,88,73,53,8,3181,841,61,450,342,373,A,E,351,442,17,530,A
287 | -Tom Brookens,281,76,3,42,25,20,8,2658,657,48,324,300,179,A,E,106,144,7,341.667,A
288 | -Tom Brunansky,593,152,23,69,75,53,6,2765,686,133,369,384,321,A,W,315,10,6,940,A
289 | -Tony Fernandez,687,213,10,91,65,27,4,1518,448,15,196,137,89,A,E,294,445,13,350,A
290 | -Tim Flannery,368,103,3,48,28,54,8,1897,493,9,207,162,198,N,W,209,246,3,326.667,N
291 | -Tom Foley,263,70,1,26,23,30,4,888,220,9,83,82,86,N,E,81,147,4,250,N
292 | -Tony Gwynn,642,211,14,107,59,52,5,2364,770,27,352,230,193,N,W,337,19,4,740,N
293 | -Terry Harper,265,68,8,26,30,29,7,1337,339,32,135,163,128,N,W,92,5,3,425,A
294 | -Toby Harrah,289,63,7,36,41,44,17,7402,1954,195,1115,919,1153,A,W,166,211,7,,A
295 | -Tommy Herr,559,141,2,48,61,73,8,3162,874,16,421,349,359,N,E,352,414,9,925,N
296 | -Tim Hulett,520,120,17,53,44,21,4,927,227,22,106,80,52,A,W,70,144,11,185,A
297 | -Terry Kennedy,19,4,1,2,3,1,1,19,4,1,2,3,1,N,W,692,70,8,920,A
298 | -Tito Landrum,205,43,2,24,17,20,7,854,219,12,105,99,71,N,E,131,6,1,286.667,N
299 | -Tim Laudner,193,47,10,21,29,24,6,1136,256,42,129,139,106,A,W,299,13,5,245,A
300 | -Tom O'Malley,181,46,1,19,18,17,5,937,238,9,88,95,104,A,E,37,98,9,,A
301 | -Tom Paciorek,213,61,4,17,22,3,17,4061,1145,83,488,491,244,A,W,178,45,4,235,A
302 | -Tony Pena,510,147,10,56,52,53,7,2872,821,63,307,340,174,N,E,810,99,18,1150,N
303 | -Terry Pendleton,578,138,1,56,59,34,3,1399,357,7,149,161,87,N,E,133,371,20,160,N
304 | -Tony Perez,200,51,2,14,29,25,23,9778,2732,379,1272,1652,925,N,W,398,29,7,,N
305 | -Tony Phillips,441,113,5,76,52,76,5,1546,397,17,226,149,191,A,W,160,290,11,425,A
306 | -Terry Puhl,172,42,3,17,14,15,10,4086,1150,57,579,363,406,N,W,65,0,0,900,N
307 | -Tim Raines,580,194,9,91,62,78,8,3372,1028,48,604,314,469,N,E,270,13,6,,N
308 | -Ted Simmons,127,32,4,14,25,12,19,8396,2402,242,1048,1348,819,N,W,167,18,6,500,N
309 | -Tim Teufel,279,69,4,35,31,32,4,1359,355,31,180,148,158,N,E,133,173,9,277.5,N
310 | -Tim Wallach,480,112,18,50,71,44,7,3031,771,110,338,406,239,N,E,94,270,16,750,N
311 | -Vince Coleman,600,139,0,94,29,60,2,1236,309,1,201,69,110,N,E,300,12,9,160,N
312 | -Von Hayes,610,186,19,107,98,74,6,2728,753,69,399,366,286,N,E,1182,96,13,1300,N
313 | -Vance Law,360,81,5,37,44,37,7,2268,566,41,279,257,246,N,E,170,284,3,525,N
314 | -Wally Backman,387,124,1,67,27,36,7,1775,506,6,272,125,194,N,E,186,290,17,550,N
315 | -Wade Boggs,580,207,8,107,71,105,5,2778,978,32,474,322,417,A,E,121,267,19,1600,A
316 | -Will Clark,408,117,11,66,41,34,1,408,117,11,66,41,34,N,W,942,72,11,120,N
317 | -Wally Joyner,593,172,22,82,100,57,1,593,172,22,82,100,57,A,W,1222,139,15,165,A
318 | -Wayne Krenchicki,221,53,2,21,23,22,8,1063,283,15,107,124,106,N,E,325,58,6,,N
319 | -Willie McGee,497,127,7,65,48,37,5,2703,806,32,379,311,138,N,E,325,9,3,700,N
320 | -Willie Randolph,492,136,5,76,50,94,12,5511,1511,39,897,451,875,A,E,313,381,20,875,A
321 | -Wayne Tolleson,475,126,3,61,43,52,6,1700,433,7,217,93,146,A,W,37,113,7,385,A
322 | -Willie Upshaw,573,144,9,85,60,78,8,3198,857,97,470,420,332,A,E,1314,131,12,960,A
323 | -Willie Wilson,631,170,9,77,44,31,11,4908,1457,30,775,357,249,A,W,408,4,3,1000,A
324 |
--------------------------------------------------------------------------------
/data/Khan_ytest.csv:
--------------------------------------------------------------------------------
1 | "","x"
2 | "1",3
3 | "2",2
4 | "3",4
5 | "4",2
6 | "5",1
7 | "6",3
8 | "7",4
9 | "8",2
10 | "9",3
11 | "10",1
12 | "11",3
13 | "12",4
14 | "13",1
15 | "14",2
16 | "15",2
17 | "16",2
18 | "17",4
19 | "18",3
20 | "19",4
21 | "20",3
22 |
--------------------------------------------------------------------------------
/data/Khan_ytrain.csv:
--------------------------------------------------------------------------------
1 | "","x"
2 | "1",2
3 | "2",2
4 | "3",2
5 | "4",2
6 | "5",2
7 | "6",2
8 | "7",2
9 | "8",2
10 | "9",2
11 | "10",2
12 | "11",2
13 | "12",2
14 | "13",2
15 | "14",2
16 | "15",2
17 | "16",2
18 | "17",2
19 | "18",2
20 | "19",2
21 | "20",2
22 | "21",2
23 | "22",2
24 | "23",2
25 | "24",4
26 | "25",4
27 | "26",4
28 | "27",4
29 | "28",4
30 | "29",4
31 | "30",4
32 | "31",4
33 | "32",4
34 | "33",4
35 | "34",4
36 | "35",4
37 | "36",4
38 | "37",4
39 | "38",4
40 | "39",4
41 | "40",4
42 | "41",4
43 | "42",4
44 | "43",4
45 | "44",3
46 | "45",3
47 | "46",3
48 | "47",3
49 | "48",3
50 | "49",3
51 | "50",3
52 | "51",3
53 | "52",3
54 | "53",3
55 | "54",3
56 | "55",3
57 | "56",1
58 | "57",1
59 | "58",1
60 | "59",1
61 | "60",1
62 | "61",1
63 | "62",1
64 | "63",1
65 |
--------------------------------------------------------------------------------