├── .gitignore
├── LICENSE
├── README.md
├── R_Exercises
    ├── Exercise1
    │   └── Exercise1.Rmd
    ├── Exercise2
    │   └── Exercise2.Rmd
    ├── Exercise3
    │   ├── Exercise3.Rmd
    │   ├── ex_p1.JPG
    │   ├── ex_p2.jpg
    │   ├── ex_p3_part1.jpg
    │   ├── ex_p3_part2.jpg
    │   ├── ex_p4_part1.jpg
    │   ├── ex_p4_part2.jpg
    │   ├── ex_p4_part3.jpg
    │   └── ex_p4_part4.jpg
    ├── Exercise4
    │   └── Exercise4.Rmd
    ├── Exercise5
    │   └── Exercise5.Rmd
    ├── Exercise6
    │   └── Exercise6.Rmd
    └── Exercise7
    │   └── Exercise7.Rmd
├── R_Labs
    ├── Lab1
    │   └── Lab1.Rmd
    ├── Lab2
    │   └── Lab2.Rmd
    ├── Lab3
    │   └── Lab3.Rmd
    ├── Lab4
    │   └── Lab4.rmd
    ├── Lab5
    │   └── Lab5.Rmd
    ├── Lab6
    │   └── Lab6.Rmd
    └── Lab7
    │   └── Lab7.Rmd
└── data
    ├── Advertising.csv
    ├── Auto.csv
    ├── Auto.data
    ├── Ch10Ex11.csv
    ├── College.csv
    ├── Credit.csv
    ├── Heart.csv
    ├── Income1.csv
    └── Income2.csv


/.gitignore:
--------------------------------------------------------------------------------
 1 | # History files
 2 | .Rhistory
 3 | .Rdata
 4 | *.Rproj
 5 | *.md
 6 | *.Md
 7 | *.html
 8 | *.png
 9 | *figure/*
10 | 
11 | # Example code in package build process
12 | *-Ex.R
13 | .Rproj.user
14 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2013 John St. John
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 6 | this software and associated documentation files (the "Software"), to deal in
 7 | the Software without restriction, including without limitation the rights to
 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 9 | the Software, and to permit persons to whom the Software is furnished to do so,
10 | subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
21 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | IntroToStatisticalLearningR
2 | ===========================
3 | 
4 | My work through the different examples given in www.StatLearning.com
5 | 


--------------------------------------------------------------------------------
/R_Exercises/Exercise1/Exercise1.Rmd:
--------------------------------------------------------------------------------
  1 | Introduction to Statistical Learning Exercise 1
  2 | ========================================================
  3 | 
  4 | 
  5 | Conceptual
  6 | ---
  7 | 
  8 | 1. **_In general, do we expect the performance of a flexible statistical learning method to perform better or worse than an inflexible method when:_**
  9 |   1. **_The sample size n is extremely large, and the number of predictors p is small?_** In this case, since we have so much data, I would expect that a more flexible model would perform better.
 10 |   2. **_p is extremely large, and n is small?_** In this case, we are very prone to overfitting, a more inflexible method is much preffered.
 11 |   3. **_relationship between predictors and response is highly non-linear?_** In this case, inflexible methods might force an unwarented linearity on the data, and underfit, so more flexible methods would be appropriate.
 12 |   4. **_variance of the error terms is extremely high?_** In this case, highly flexible methods would be prone to fitting the error, and perform worse than less flexible methods.
 13 | 2. **_Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p._** Note that inference is how Y changes as a function of X, while prediction is determining what Y is given X. 
 14 |   1. **_We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary._** This is an inference problem, we want to know how various variables effect salary, rather than simply being able to predict salary. We are inferring relationships to a continuous variable, so it is regression rather than classification. `n = 500, p = 3` (I think the 4th variable, CEO salary, being the output we want to predict, is not counted in p)
 15 |   2. **_We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables._** This is a prediction problem, we only care about whether the outcome is success or failure. The outcome is binary, so this is a classification problem. `n = 20, p = 10+3 = 13` And one more for the outcome variable, which I am not counting in `p`.
 16 |   3. **_We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market._** This is an inference problem, we want to know the relationship between the US dollar and weekly changes in the world stock market. We are predicting a continuous variable, so it is a prediction problem. 
 17 | 3. **_We now revisit the bias-variance decomposition._**
 18 |   1. **_Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a sin- gle plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one._** I am just going to describe what I would draw here. First off the Bayes error curve is easy. This is simply the costant horizontal line representing the Variance around the optimal decision boundary. This doesn't change for a given underlying true datagenerating function. The squared bias is going to decrease until it hits a minimum and then stay there, the variance will then take over and the more flexible methods will be fitting variance, which is going to start low, then go up in frequency. The training error is going to drop and get very low as flexiblilty increases, however the testing error is going to have a U shape, starting high probably, dropping down to some optimal minimum, then rising again as the more flexible models begin to overfit to the underlying variance in the training set. 
 19 |   2. **_Explain why each of the five curves has the shape displayed in part (1)._** See previous description that also had explanations.
 20 | 4. **_You will now think of some real-life applications for statistical learning_**
 21 |   1. **_Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer._** a) Predicting stock price gain vs loss, stock price gain or loss, daily news + twitter + other features, prediction. b) Determining which genes have expression that is useful for determining response to a drug, drug response or no response, gene expression levels for all genes, inference. c) Predicting which links a user will click on, success or failure of click, user history, context of add, etc, prediction. 
 22 |   2. **_Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer._** a) Predicting stock price change amplitude, stock price change, daily news + twitter + other features, prediction. b) Determining which genes have expression that is useful that correlate with PFS, expression correlation to PFS, gene expression levels for all genes, inference. c) Predicting what market price a house will sell for, house value in dolars, neighborhood + schools + other recent sales of similar homes in area, prediction.
 23 |   3. **_Describe three real-life applications in which cluster analysis might be useful._** a) identify cancer sub-types and what genes drive those, classification, inference. b) identify web usage outlier weeks, web usage over each week, prediction c) identify groups of users that might have similar behaviour that is distinct in some useful way from other users, behaviour data of some sort, inference.
 24 | 5. **_What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?_** More flexible approaches tend to be more useful for prediction than inference. Inference requires knowledge of how the result is a function of the data, Prediction can be a black box. Also more flexible methods are required when the underlying data varries significantly from linear assumptions.
 25 | 6. **_Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a para- metric approach to regression or classification (as opposed to a non- parametric approach)? What are its disadvantages?_** Parametric approaches do not have as much of an issue with overfitting. They are more resitrictive, but as a result fewer observations of underlying data are required to fit them fairly well. Nonparametric methods are sometimes required when the underlying data is very different from what can be fit with parametric techniques, or when the underlying distribution is unknown but obviously not normally distributed.
 26 | 7. **_The table below provides a training data set containing 6 observations, 3 predictors, and 1 qualitative response variable. Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors._**
 27 |   1. **_Compute the Euclidean distance between each observation and the test point, X1=X2=X3=0._** euclidian distance is `sqrt((x1-x2)^2+...)`. 1) 0-0 0-3 0-0 = sqrt(9) = 3 2) 2 3) sqrt(10) = 3.16 4) sqrt(5) = 2.24 5) sqrt(2) = 1.141 6) sqrt(5) = 2.24
 28 |   2. **_What is our prediction with K = 1? Why?_** just the closest point which is point 5, that has class Green.
 29 |   3. **_What is our prediction with K = 3? Why?_** average of 3 closest points, 5,2, and 4/6 are equidistant, so include them both I guess. I have actually just searched this, the R version of KNN includes all ties by default, or chooses randomly K points when there are ties, and other methods simply do K-1,K-2 and so on until no more ties exist. I will do the R version (3 Red + 2Green) = Red
 30 |   4. **_If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for K to be large or small? Why?_** The higher K is the closer the decision boundary is to being linear. Lower K can fit more irregular data like this.
 31 | 
 32 | Applied
 33 | --------
 34 | 
 35 | 8. 
 36 |   1. up until part 3. 
 37 | ```{r}
 38 | college=read.csv("~/src/IntroToStatisticalLearningR/data/College.csv")
 39 | rownames(college) <- college[,1]
 40 | college <- college[,-1]
 41 | summary(college)
 42 | ```
 43 | Show a pairs plot
 44 | 
 45 | ```{r fig.width=7, fig.height=6}
 46 | pairs(college[,1:10])
 47 | ```
 48 | ```{r}
 49 | Elite=rep("No",nrow(college))
 50 | Elite[college$Top10perc >50]="Yes"
 51 | Elite=as.factor(Elite)
 52 | college=data.frame(college,Elite)
 53 | summary(Elite)
 54 | ```
 55 | Elite college vs Outstate tuition
 56 | ```{r fig.width=7, fig.height=6}
 57 | plot(college$Elite, college$Outstate)
 58 | ```
 59 | Some histograms of different variables
 60 | ```{r fig.width=7, fig.height=6}
 61 | par(mfrow=c(2,2))
 62 | hist(college$PhD)
 63 | hist(college$Accept)
 64 | hist(college$Enroll)
 65 | hist(college$S.F.Ratio)
 66 | ```
 67 | Hmm, lets plot enrollment vs applications by elite status
 68 | ```{r fig.width=7, fig.height=6}
 69 | par(mfrow=c(1,2))
 70 | plot(college$Enroll[college$Elite == 'Yes'], college$Apps[college$Elite == 'Yes'], main="Elite")
 71 | plot(college$Enroll[college$Elite == 'No'], college$Apps[college$Elite == 'No'], main="Not Elite")
 72 | ```
 73 | And with ratios
 74 | ```{r fig.width=7, fig.height=6}
 75 | par(mfrow=c(1,2))
 76 | hist(college$Enroll[college$Elite == 'Yes']/college$Apps[college$Elite == 'Yes'], main="Elite")
 77 | hist(college$Enroll[college$Elite == 'No']/college$Apps[college$Elite == 'No'], main="Not Elite")
 78 | ```
 79 | Doesn't seem to be the strongest signal with enrollment vs application number vs elite status.
 80 | 
 81 | 9.
 82 | ```{r}
 83 | Auto = read.table("~/src/IntroToStatisticalLearningR/data/Auto.data", head=T, na.strings="?")
 84 | Auto <- na.omit(Auto)
 85 | summary(Auto)
 86 | ```
 87 |   1. mpg, cylinders, displacement, horsepower, weight, acceleration, are quantative. origin, year, and name are qualitative. 
 88 |   2. mpg: 9-46, cylinders: 3-8, displacement: 68-455, horsepower: 46-230, weight:1613-5140, acceleration:8-24.8
 89 |   3. means: `r apply(Auto[,1:6],2,mean)`, sds: `r apply(Auto[,1:6],2,sd)`
 90 |   4. range: `r apply(Auto[-(10:85),1:6],2,range)` means: `r apply(Auto[,1:6],2,mean)`, sds: `r apply(Auto[,1:6],2,sd)`
 91 |   5. 
 92 | ```{r fig.width=11, fig.height=11}
 93 | pairs(Auto)
 94 | ```
 95 |   I like the association with acceleration and year, the interesting thing is that it has the same trend as mpg and year. It basically seems that cars are becoming both more efficient, and simultaniously more fun, on average at least.
 96 | 
 97 | 6. It looks like some linear combination of year, acceleration, and maybe one of displacement, horsepower or weight would combine to be a pretty good predictor. I would probably chose weight acceleration year as the three inputs.
 98 | 
 99 | 
100 | 10. 
101 | ```{r}
102 | library(MASS)
103 | dim(Boston)
104 | summary(Boston)
105 | #?Boston
106 | ```
107 |   1. There are 506 rows by 14 columns in the Boston dataset
108 |   2. 
109 | 
110 | ```{r fig.width=11, fig.height=11}
111 | pairs(Boston)
112 | ```
113 | One interesting tidbit, as the proportion of black people goes up, the pupil teacher ratio goes down (more pupils per teacher)
114 | 
115 |   3.  As the proportion of owner occupied units built prior to 1940 goes up, crime rate also goes up
116 |   4. 
117 | 
118 | ```{r fig.width=11, fig.height=11}
119 | par(mfrow=c(2,2))
120 | hist(Boston$crim, main="Crime Rates")
121 | hist(Boston$tax, main="Tax Rates")
122 | hist(Boston$ptratio, main="Pupil Teacher Ratio")
123 | library(scatterplot3d)
124 | scatterplot3d(log(Boston$crim),Boston$tax,Boston$ptratio,main="Boston Crime, Tax vs Pupil Teacher Ratio")
125 | ```
126 | Most areas of boston have low crime rates, however a small number of areas in boston have very high crime rates, that variable exponentially declines. Tax rates are clearly bimodal. The highest tax rates are very seperate from the lower tax rates. The pupil teacher ratio seems to be normally distributed except for a handful of areas with very high pt ratios. 
127 |   5. `r sum(Boston$chas)` suburbs border the boston river.
128 |   6. `r median(Boston$ptratio)` is the median pupil teacher ratio
129 |   7. These are the suburbs with the lowest median value of owner-occupied homes.
130 | ```{r}
131 | subset(Boston, medv == min(medv))
132 | ```
133 | 
134 |   8. `r dim(subset(Boston, rm > 7))[1]` suburbs average more than 7 rooms per dwelling.
135 |   9. 
136 | ```{r}
137 | library(plyr)
138 | Boston$O8 <- factor(ifelse(Boston$rm > 8, 1, 0))
139 | ddply(Boston, .(O8), summary )
140 | ```
141 |   Median crime is slightly higher in these over 8 room neighboorhods, but the mean is lower. Tax is also lower. pt ratio is lower in both mean and median. median value of homes the mean value of home is way higher. the age is also higher which seems interesting. These appear to be expensive, old neighborhoods. Proportion of non-retail business is substantially lower as well, these seem to be more residential in nature.


--------------------------------------------------------------------------------
/R_Exercises/Exercise3/Exercise3.Rmd:
--------------------------------------------------------------------------------
  1 | Introduction to statistical learning exercise 3
  2 | ========================================================
  3 | # Conceptual Section
  4 | ******
  5 | ## Problem 1
  6 | See work in ex_p1.jpg
  7 | 
  8 | *******
  9 | ## Problem 2
 10 | See work in ex_p2.jpg
 11 | Also note that in answer to my question on there, yes we can remove that last term. Remember that we are maximizing (k), so we can remove any term we want that does not interact with k, and we are ok. The final term does have something to do with x, but not k, so it can be removed and the equation is still proportional. We could remove the summation term because for any particular k, it is the same, since it is a marginalization over all k.
 12 | 
 13 | *******
 14 | ## Problem 3
 15 | see the saved parts 1 and 2 images. The key thing here is that we can't remove the final term as we did in the previus part (the x^2/2sigma) that is now something dependent on class k, so we can't claim proportionality and remove the term when we want to identify the max.
 16 | 
 17 | ********
 18 | ## Problem 4
 19 | See the included `ex_p4_*` jpgs. 
 20 | 
 21 | ********
 22 | ## Problem 5
 23 | ### Part a
 24 | When the bayes decision boundary is linear (the optimal classifier) we would still predict QDA to fit the training set better since it can fit more of the error in the data. On the test set on the other hand, QDA will probably perform worse since it is modeling the error whenver it deviates from the simpler linear best fit.
 25 | ### Part b
 26 | If the bayes decision boundary is non-linear, we would expect QDA to perform better on both the training and test set, depending on the degree of non-linearity, and the number of cases in the test set. If the number of the samples are small, or the underlying model is nearly linear, it is still possible for LDA to perform better.
 27 | ### Part c
 28 | The test prediction accuracy of LDA and QDA should improve as n increases. Depending on the underlying model, if it is non-linear, then at some point QDA will learn things about the data that LDA can't model, and QDA will be better. On the other hand LDA will still do better if the data is modeled well by it, or n is on the smaller size. It will take a lot more observations to fit QDA equally well to LDA since QDA is quadratic and LDA is linear.
 29 | ### Part d
 30 | TRUE: QDA can modle a linear boundary, so it will fit whatever linearness is in the training data. It can do more though, so it can additionally fit some of the additional residual error that the linear model wouldn't be able to handle as well. Thus it is superior on fittin the training data. The testing data on the other hand is a different story.
 31 | 
 32 | *********
 33 | ## Problem 6
 34 | ### Part a
 35 | $p(X) = \frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}}$ and we have that $\hat\beta_0=-6, \hat\beta_1=0.05, \hat\beta_2=1$. This gives us $p(gets\ an\ A)=$ `r exp(-6+40*0.05+1*3.5)/(1+exp(-6+40*0.05+1*3.5))`.
 36 | ### Part b
 37 | $log(\frac{p(X)}{1-p(X)})=\beta_0+\beta_1X+\beta_2X$ and we want $p(X)=0.5$. $log(\frac{0.5}{1-0.5})=0$. So we need to solve $\frac{6-3.5}{0.05}=hours\ required=$ `r (6-3.5)/0.05`. Indeed plugging 50 hours into the above equation comes out to `r exp(-6+50*0.05+1*3.5)/(1+exp(-6+50*0.05+1*3.5))`. Apparently 10 more hours of work would have given this student a coin toss chance at getting an A!
 38 | 
 39 | *********
 40 | ## Problem 7
 41 | $P(Yes|X) = \frac{P(X|Yes)P(Yes)}{P(X)}$ And we are given that $P(Yes)=0.8$, Also we can find $P(X)$ by marginalizing over the two possibilities, $Yes$ and $No$. $P(X)=P(X|Yes)P(Yes)+P(X|No)P(No)$. $P(X|Yes)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x-\mu)^2/2\sigma^2}$ $x=7, \mu_{Yes}=10, \mu_{No}=0, P(Yes)=.8, P(No)=.2, \sigma^2_{Yes}=\sigma^2_{No}=36$. This gives us $P(Yes|X)=$ `r pxyes <- 1/sqrt(2*pi*36)*exp(-((7-10)**2)/(2*36)); pxno <- 1/sqrt(2*pi*36)*exp(-((7-0)**2)/(2*36)); pyes=.8; pno=.2; (pxyes*pyes)/(pxyes*pyes+pxno*pno)`.
 42 | 
 43 | **********
 44 | ## Problem 8
 45 | KNN with $K=1$ by design will not missclassify anything in the training set. The real test with KNN especially at the most permissive level of $K=1$ comes when you then try to classify new things. On the other hand Logistic regression fits a parametric model to the training set, so the training error provides _some_ indication of how good the fit is to the data, and a little insight into how it might perform on future data.
 46 | 
 47 | Lets consider a dataset with 1000 examples, you divide it in half and you have 500 examples that you train on, and 500 that you test on. With KNN, $K=1$ to get an average error of 18% (180 misclassified examples of the 1000) all of these will need to be in the 500 test cases since it will not misclassify anything in the training set. This means that the test error for KNN is actually 180/500 or 36% as opposed to logistic regression which was less at 30%! KNN misclassified 180 test cases, while logisitic regression missclassified fewer cases (150) in this example.
 48 | 
 49 | 
 50 | *************
 51 | ## Problem 9
 52 | ### Part a
 53 | $\frac{p}{1-p}=odds$ For example given an odds of 0.37, one way to get there would be $\frac{37+}{100-}$. For probability, this would be a fraction of $\frac{37}{137}$ or 27%.
 54 | ### Part b
 55 | odds for an event with probability $p$ are $\frac{p}{1-p}$. So if the probability is .16, then the odds for this event are `r .16/(1-.16)`.
 56 | 
 57 | 
 58 | *************
 59 | # Applied Section
 60 | *************
 61 | 
 62 | ## Problem 10
 63 | 
 64 | > This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
 65 | 
 66 | ```{r}
 67 | library(MASS)
 68 | library(ISLR)
 69 | ```
 70 | 
 71 | ### Part a
 72 | > Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
 73 | 
 74 | ```{r fig.width=11, fig.height=11}
 75 | pairs(Weekly)
 76 | cor(Weekly[,-ncol(Weekly)])
 77 | ```
 78 | 
 79 | There is high correlation between Volume and Year. This appears to be an exponential relationship where volume increases exponentially as a function of year. Certain years seem to have more or less variation than other years. Notice the violin shape of the various Lag features and the Year. There does seem to be some autocorrelation in the variability of the Lag and the year. Perhaps some years people are more skittish than other years, and this takes a while to wear off. Syclical skittishness or something. There appears to be very little if any correlation between Lag and other lags. Direction appears slightly skewed by a few of the lags, perhaps Lag5, and Lag1.
 80 | 
 81 | ### Part b
 82 | > Use the full dataset to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
 83 | 
 84 | ```{r}
 85 | logit.fit = glm(Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5+Volume, family=binomial, data=Weekly)
 86 | contrasts(Weekly$Direction)
 87 | summary(logit.fit)
 88 | ```
 89 | 
 90 | The Lag2 varaible, and intercept, appear to be significant.
 91 | 
 92 | ### Part c
 93 | > Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
 94 | 
 95 | ```{r}
 96 | glm.probs=predict(logit.fit,Weekly,type="response")
 97 | glm.pred=rep("Down",nrow(Weekly))
 98 | glm.pred[glm.probs > 0.50]="Up"
 99 | table(glm.pred,Weekly$Direction)
100 | mean(glm.pred==Weekly$Direction)
101 | ```
102 | 
103 | The confusion matrix is telling us that the model does not fir the data particularly well. The "Up" direction is guessed most of the time. Most of the mistakes come from guessing that the market is going to go up when it really is going to go down. When Up is guessed, it is right `r 557/(430+557)` of the time, when down is guessed it is right `r 54/(54+48)` of the time.
104 | 
105 | ### Part d
106 | > Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
107 | 
108 | ```{r}
109 | train=Weekly$Year <= 2008
110 | Weekly.test=Weekly[!train,]
111 | logit.fit = glm(Direction ~ Lag2, family=binomial, data=Weekly, subset=train)
112 | contrasts(Weekly$Direction)
113 | summary(logit.fit)
114 | glm.probs=predict(logit.fit,Weekly.test,type="response")
115 | glm.pred=rep("Down",nrow(Weekly.test))
116 | glm.pred[glm.probs > 0.50]="Up"
117 | table(glm.pred,Weekly.test$Direction)
118 | mean(glm.pred==Weekly.test$Direction)
119 | ```
120 | 
121 | 
122 | 
123 | ### Part e
124 | > Repeat (d) using LDA.
125 | 
126 | ```{r}
127 | lda.fit = lda(Direction ~ Lag2, data=Weekly, subset=train)
128 | lda.fit
129 | lda.class=predict(lda.fit,Weekly.test)$class
130 | table(lda.class,Weekly.test$Direction)
131 | mean(lda.class==Weekly.test$Direction)
132 | ```
133 | 
134 | This one performed identicaly to logistic regression.
135 | 
136 | ### Part f
137 | > Repeat (d) using QDA.
138 | 
139 | ```{r}
140 | qda.fit = qda(Direction ~ Lag2, data=Weekly, subset=train)
141 | qda.fit
142 | qda.class=predict(qda.fit,Weekly.test)$class
143 | table(qda.class,Weekly.test$Direction)
144 | mean(qda.class==Weekly.test$Direction)
145 | ```
146 | 
147 | Interestingly it seems that QDA overfit this variable. LDA/logistic regression performs better on the test data.
148 | 
149 | ### Part g
150 | > Repeat (d) using KNN with K = 1.
151 | 
152 | ```{r}
153 | library(class)
154 | train.X=Weekly[train,"Lag2",drop=F]
155 | test.X=Weekly[!train,"Lag2",drop=F]
156 | train.Direction=Weekly[train,"Direction",drop=T]
157 | test.Direction=Weekly[!train,"Direction",drop=T]
158 | set.seed(1)
159 | knn.pred=knn(train.X,test.X,train.Direction,k=1)
160 | table(knn.pred,test.Direction)
161 | mean(knn.pred==test.Direction)
162 | ```
163 | 
164 | ### Part h
165 | > Which of these methods appears to provide the best results on this data?
166 | 
167 | KNN is totally random looking. QDA appears to overfit the data slightly more than LDA and Logistic Regression, which perform equally well on the test data.
168 | 
169 | ### Part i
170 | > Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.
171 | 
172 | KNN with `K=4` performs pretty well at 0.57. 
173 | ```{r}
174 | set.seed(1)
175 | knn.pred=knn(train.X,test.X,train.Direction,k=4)
176 | table(knn.pred,test.Direction)
177 | mean(knn.pred==test.Direction)
178 | ```
179 | 
180 | QDA appears to perform worse as we add in more variables, with Lag1, Lag2 and Volume it goes down to 0.46. With Lag1 and Lag2 it is a little better, at 0.55, but still Lag2 by itself is pretty good.
181 | ```{r}
182 | qda.fit = qda(Direction ~ Lag2, data=Weekly, subset=train)
183 | qda.fit
184 | qda.class=predict(qda.fit,Weekly.test)$class
185 | table(qda.class,Weekly.test$Direction)
186 | mean(qda.class==Weekly.test$Direction)
187 | ```
188 | 
189 | Logistic regression also seems to perform worse with more variables thrown in. Lag2 seems to be a pretty good fit.
190 | 
191 | ```{r}
192 | train=Weekly$Year <= 2008
193 | Weekly.test=Weekly[!train,]
194 | logit.fit = glm(Direction ~ Lag1+Lag2+Volume, family=binomial, data=Weekly, subset=train)
195 | contrasts(Weekly$Direction)
196 | summary(logit.fit)
197 | glm.probs=predict(logit.fit,Weekly.test,type="response")
198 | glm.pred=rep("Down",nrow(Weekly.test))
199 | glm.pred[glm.probs > 0.50]="Up"
200 | table(glm.pred,Weekly.test$Direction)
201 | mean(glm.pred==Weekly.test$Direction)
202 | ```
203 | 
204 | **************
205 | ## Problem 11
206 | > In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
207 | ### Part a) 
208 | > Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function.
209 | 
210 | ```{r}
211 | library(MASS)
212 | library(ISLR)
213 | Auto$mpg01 <- ifelse(Auto$mpg > median(Auto$mpg),1,0)
214 | ```
215 | 
216 | ### Part b) 
217 | > Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this ques- tion. Describe your findings.
218 | 
219 | ```{r fig.width=11,fig.height=11}
220 | pairs(Auto[,-9])
221 | ```
222 | Horsepower, displacement, weight, and acceleration look the most promissing. However these variables are all fairly correlated/anti-correlated.
223 | 
224 | ### Part c) 
225 | > Split the data into a training set and a test set.
226 | 
227 | ```{r}
228 | set.seed(1)
229 | rands <- rnorm(nrow(Auto))
230 | test <- rands > quantile(rands,0.75)
231 | train <- !test
232 | Auto.train <- Auto[train,]
233 | Auto.test <- Auto[test,]
234 | ```
235 | 
236 | 
237 | ### Part d) 
238 | > Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
239 | 
240 | ```{r}
241 | lda.fit = lda(mpg01 ~ horsepower+weight+acceleration, data=Auto.train)
242 | lda.fit
243 | lda.class=predict(lda.fit,Auto.test)$class
244 | table(lda.class,Auto.test$mpg01)
245 | mean(lda.class==Auto.test$mpg01)
246 | ```
247 | LDA achieved 88.8% test accuracy.
248 | 
249 | ### Part e) 
250 | > Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
251 | 
252 | ```{r}
253 | qda.fit = qda(mpg01 ~ horsepower+weight+acceleration, data=Auto.train)
254 | qda.fit
255 | qda.class=predict(qda.fit,Auto.test)$class
256 | table(qda.class,Auto.test$mpg01)
257 | mean(qda.class==Auto.test$mpg01)
258 | ```
259 | QDA performed a little better, and achieved 92.9% accuracy on the test set.
260 | 
261 | ### Part f) 
262 | > Perform logistic regression on the training data in order to pre- dict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
263 | 
264 | ```{r}
265 | logit.fit = glm(mpg01 ~ horsepower+weight+acceleration, family=binomial, data=Auto.train)
266 | summary(logit.fit)
267 | glm.probs=predict(logit.fit,Auto.test,type="response")
268 | glm.pred=rep(0,nrow(Auto.test))
269 | glm.pred[glm.probs > 0.50]=1
270 | table(glm.pred,Auto.test$mpg01)
271 | mean(glm.pred==Auto.test$mpg01)
272 | ```
273 | Recompiling this a few times, I see that the accuracy and everything fluctuates a bit. Sometimes LDA and Logistic Regression do the same, sometimes Logistic Regression does a little worse, sometimes LDA a little better. QDA seems to fairly consistently perform the best.
274 | 
275 | ### Part g) 
276 | > Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?
277 | 
278 | ```{r}
279 | set.seed(1)
280 | train.Auto = Auto.train[,c("horsepower","weight","acceleration")]
281 | test.Auto =  Auto.test[,c("horsepower","weight","acceleration")]
282 | knn.pred=knn(train.Auto,test.Auto,Auto.train$mpg01,k=1)
283 | table(knn.pred,Auto.test$mpg01)
284 | mean(knn.pred==Auto.test$mpg01)
285 | 
286 | knn.pred=knn(train.Auto,test.Auto,Auto.train$mpg01,k=2)
287 | table(knn.pred,Auto.test$mpg01)
288 | mean(knn.pred==Auto.test$mpg01)
289 | 
290 | knn.pred=knn(train.Auto,test.Auto,Auto.train$mpg01,k=3)
291 | table(knn.pred,Auto.test$mpg01)
292 | mean(knn.pred==Auto.test$mpg01)
293 | 
294 | knn.pred=knn(train.Auto,test.Auto,Auto.train$mpg01,k=4)
295 | table(knn.pred,Auto.test$mpg01)
296 | mean(knn.pred==Auto.test$mpg01)
297 | 
298 | knn.pred=knn(train.Auto,test.Auto,Auto.train$mpg01,k=5)
299 | table(knn.pred,Auto.test$mpg01)
300 | mean(knn.pred==Auto.test$mpg01)
301 | 
302 | knn.pred=knn(train.Auto,test.Auto,Auto.train$mpg01,k=11)
303 | table(knn.pred,Auto.test$mpg01)
304 | mean(knn.pred==Auto.test$mpg01)
305 | ```
306 | 
307 | Interestingly at least in one case, KNN with K=3 outperforms all other models. K=4 and K=5 perform similarly well.
308 | 
309 | ************
310 | ## Problem 12
311 | ### Part a) 
312 | > Write a function, Power(), that prints out the result of raising 2 to the 3rd power. In other words, your function should compute 23 and print out the results Hint: Recall that x^a raises x to the power a. Use the print() function to output the result.
313 | 
314 | ```{r}
315 | Power <- function(){
316 |   print(2^3)
317 | }
318 | Power()
319 | ```
320 | 
321 | ### Part b) 
322 | > Create a new function, Power2(), that allows you to pass any two numbers, x and a, and prints out the value of x^a. You can do this by beginning your function with the line You should be able to call your function by entering, for instance,
323 | `Power2(3,8)` on the command line. This should output the value of 38, namely, 6561.
324 | 
325 | ```{r}
326 | Power2 <- function(x,a){
327 |   print(x^a)
328 | }
329 | Power2(3,8)
330 | ```
331 | 
332 | ### Part c) 
333 | > Using the Power2() function that you just wrote, compute 103, 817, and 1313.
334 | 
335 | ```{r}
336 | Power2(10,3)
337 | Power2(8,17)
338 | Power2(131,3)
339 | ```
340 | 
341 | ### Part d) 
342 | > Now create a new function, Power3(), that actually returns the result x^a as an R object, rather than simply printing it to the screen. That is, if you store the value x^a in an object called result within your function, then you can simply return() this result, using the following line:
343 | return(result)
344 | The line above should be the last line in your function, before the } symbol.
345 | 
346 | ```{r}
347 | Power3 <- function(x,a){
348 |   return(x^a)
349 | }
350 | ```
351 | 
352 | ### Part e) 
353 | > NowusingthePower3()function,createaplotoff(x)=x2.The x-axis should display a range of integers from 1 to 10, and the y-axis should display x2. Label the axes appropriately, and use an appropriate title for the figure. Consider displaying either the x-axis, the y-axis, or both on the log-scale. You can do this by using log="x", log="y", or log="xy" as arguments to the plot() function.
354 | 
355 | ```{r fig.width=7,fig.height=5}
356 | plot(seq(1,10),
357 |      sapply(seq(1,10), function(x) Power3(x,2)),
358 |      log="y",
359 |      main="Plotting x vs x**2",
360 |      xlab="x",
361 |      ylab="x**2")
362 | ```
363 | 
364 | ### Part f) 
365 | > Create a function, PlotPower(), that allows you to create a plot of x against x^a for a fixed a and for a range of values of x. For instance, if you call
366 | > PlotPower(1:10,3)
367 | then a plot should be created with an x-axis taking on values 1,2,...,10, and a y-axis taking on values 13,23,...,103.
368 | 
369 | ```{r fig.width=7,fig.height=5}
370 | PlotPower <- function(x,a){
371 |   plot(x,
372 |      sapply(x, function(z) Power3(z,a)),
373 |      log="y",
374 |      main=sprintf("Plotting x vs x**%d",a),
375 |      xlab="x",
376 |      ylab=sprintf("x**%d",a))
377 | }
378 | PlotPower(1:10,3)
379 | ```
380 | 
381 | *************
382 | ## Problem 13
383 | > Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.
384 | 
385 | ```{r fig.width=15, fig.height=15}
386 | Boston$crim01 <- as.numeric(Boston$crim > median(Boston$crim))
387 | # as.numeric converts FALSE to 0 and TRUE to 1
388 | 
389 | set.seed(1)
390 | rands <- rnorm(nrow(Boston))
391 | test <- rands > quantile(rands,0.75)
392 | train <- !test
393 | Boston.train <- Boston[train,]
394 | Boston.test <- Boston[test,]
395 | 
396 | Boston.train.fact <- Boston.train
397 | Boston.train.fact$crim01 <- factor(Boston.train.fact$crim01)
398 | library(GGally)
399 | ggpairs(Boston.train.fact, colour='crim01')
400 | #pairs(Boston.train)
401 | 
402 | #We should explore "black"
403 | # "ptratio" "rad" "dis" "nox, and "zn" "lstat" "rm"
404 | 
405 | ########################
406 | # Logistic Regression
407 | glm.fit=glm(crim01~lstat+rm+zn+nox+dis+rad+ptratio+black+medv+age+chas+indus+tax, data=Boston.train)
408 | summary(glm.fit)
409 | #NOX,RAD,MEDV,AGE,TAX look good
410 | glm.probs=predict(glm.fit,Boston.test,type="response")
411 | glm.pred=rep(0,nrow(Boston.test))
412 | glm.pred[glm.probs > 0.50]=1
413 | table(glm.pred,Boston.test$crim01)
414 | mean(glm.pred==Boston.test$crim01)
415 | 
416 | glm.fit=glm(crim01~nox+rad+medv+age+tax, data=Boston.train)
417 | summary(glm.fit)
418 | #NOX,RAD,MEDV,AGE,TAX look good
419 | glm.probs=predict(glm.fit,Boston.test,type="response")
420 | glm.pred=rep(0,nrow(Boston.test))
421 | glm.pred[glm.probs > 0.50]=1
422 | table(glm.pred,Boston.test$crim01)
423 | mean(glm.pred==Boston.test$crim01)
424 | 
425 | #ptratio helps a bit, but the nox*dis helps quite a bit
426 | glm.fit=glm(crim01~nox*dis+medv:tax+rad+age, data=Boston.train)
427 | summary(glm.fit)
428 | #NOX,RAD,MEDV,AGE,TAX look good
429 | glm.probs=predict(glm.fit,Boston.test,type="response")
430 | glm.pred=rep(0,nrow(Boston.test))
431 | glm.pred[glm.probs > 0.50]=1
432 | table(glm.pred,Boston.test$crim01)
433 | mean(glm.pred==Boston.test$crim01)
434 | 
435 | #indus brings it back down a bit
436 | glm.fit=glm(crim01~nox+rad+medv+age+tax+ptratio+indus, data=Boston.train)
437 | summary(glm.fit)
438 | #NOX,RAD,MEDV,AGE,TAX look good
439 | glm.probs=predict(glm.fit,Boston.test,type="response")
440 | glm.pred=rep(0,nrow(Boston.test))
441 | glm.pred[glm.probs > 0.50]=1
442 | table(glm.pred,Boston.test$crim01)
443 | mean(glm.pred==Boston.test$crim01)
444 | 
445 | #indus by itslef doesn't help much
446 | glm.fit=glm(crim01~nox+rad+medv+age+tax+indus, data=Boston.train)
447 | summary(glm.fit)
448 | #NOX,RAD,MEDV,AGE,TAX look good
449 | glm.probs=predict(glm.fit,Boston.test,type="response")
450 | glm.pred=rep(0,nrow(Boston.test))
451 | glm.pred[glm.probs > 0.50]=1
452 | table(glm.pred,Boston.test$crim01)
453 | mean(glm.pred==Boston.test$crim01)
454 | 
455 | 
456 | ########################
457 | # LDA
458 | lda.fit=lda(crim01~nox+rad+medv+age+tax+ptratio, data=Boston.train)
459 | lda.fit
460 | #NOX,RAD,MEDV,AGE,TAX look good, ptratio seems to help also
461 | lda.pred=predict(lda.fit,Boston.test)$class
462 | table(lda.pred,Boston.test$crim01)
463 | mean(lda.pred==Boston.test$crim01)
464 | 
465 | ########################
466 | # KNN
467 | set.seed(1)
468 | train.Boston = Boston.train[,c("nox","rad","medv","age","tax","ptratio")]
469 | test.Boston =  Boston.test[,c("nox","rad","medv","age","tax","ptratio")]
470 | knn.pred=knn(train.Boston,test.Boston,Boston.train$crim01,k=1)
471 | table(knn.pred,Boston.test$crim01)
472 | mean(knn.pred==Boston.test$crim01)
473 | 
474 | knn.pred=knn(train.Boston,test.Boston,Boston.train$crim01,k=2)
475 | table(knn.pred,Boston.test$crim01)
476 | mean(knn.pred==Boston.test$crim01)
477 | 
478 | knn.pred=knn(train.Boston,test.Boston,Boston.train$crim01,k=3)
479 | table(knn.pred,Boston.test$crim01)
480 | mean(knn.pred==Boston.test$crim01)
481 | 
482 | knn.pred=knn(train.Boston,test.Boston,Boston.train$crim01,k=4)
483 | table(knn.pred,Boston.test$crim01)
484 | mean(knn.pred==Boston.test$crim01)
485 | 
486 | knn.pred=knn(train.Boston,test.Boston,Boston.train$crim01,k=5)
487 | table(knn.pred,Boston.test$crim01)
488 | mean(knn.pred==Boston.test$crim01)
489 | 
490 | 
491 | knn.pred=knn(train.Boston,test.Boston,Boston.train$crim01,k=11)
492 | table(knn.pred,Boston.test$crim01)
493 | mean(knn.pred==Boston.test$crim01)
494 | 
495 | 
496 | 
497 | ```
498 | 
499 | So the best I could get LDA/logistic regression was 89%. Using the features optimized with logistic regression I was able to get KNN to perform better, returning a model that got up to 92%. K=1 got to 93%, but K=3 was nearly as good, and the higher K might be more robust going forward.
500 | 


--------------------------------------------------------------------------------
/R_Exercises/Exercise3/ex_p1.JPG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jstjohn/IntroToStatisticalLearningR-/b93bf196a4e0dfd0fb070dd68643d031d77902b0/R_Exercises/Exercise3/ex_p1.JPG


--------------------------------------------------------------------------------
/R_Exercises/Exercise3/ex_p2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jstjohn/IntroToStatisticalLearningR-/b93bf196a4e0dfd0fb070dd68643d031d77902b0/R_Exercises/Exercise3/ex_p2.jpg


--------------------------------------------------------------------------------
/R_Exercises/Exercise3/ex_p3_part1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jstjohn/IntroToStatisticalLearningR-/b93bf196a4e0dfd0fb070dd68643d031d77902b0/R_Exercises/Exercise3/ex_p3_part1.jpg


--------------------------------------------------------------------------------
/R_Exercises/Exercise3/ex_p3_part2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jstjohn/IntroToStatisticalLearningR-/b93bf196a4e0dfd0fb070dd68643d031d77902b0/R_Exercises/Exercise3/ex_p3_part2.jpg


--------------------------------------------------------------------------------
/R_Exercises/Exercise3/ex_p4_part1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jstjohn/IntroToStatisticalLearningR-/b93bf196a4e0dfd0fb070dd68643d031d77902b0/R_Exercises/Exercise3/ex_p4_part1.jpg


--------------------------------------------------------------------------------
/R_Exercises/Exercise3/ex_p4_part2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jstjohn/IntroToStatisticalLearningR-/b93bf196a4e0dfd0fb070dd68643d031d77902b0/R_Exercises/Exercise3/ex_p4_part2.jpg


--------------------------------------------------------------------------------
/R_Exercises/Exercise3/ex_p4_part3.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jstjohn/IntroToStatisticalLearningR-/b93bf196a4e0dfd0fb070dd68643d031d77902b0/R_Exercises/Exercise3/ex_p4_part3.jpg


--------------------------------------------------------------------------------
/R_Exercises/Exercise3/ex_p4_part4.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jstjohn/IntroToStatisticalLearningR-/b93bf196a4e0dfd0fb070dd68643d031d77902b0/R_Exercises/Exercise3/ex_p4_part4.jpg


--------------------------------------------------------------------------------
/R_Exercises/Exercise4/Exercise4.Rmd:
--------------------------------------------------------------------------------
  1 | Introduction to statistical learning exercise 4
  2 | ========================================================
  3 | 
  4 | # Coneceptual Section
  5 | *********
  6 | ## Problem 1. 
  7 | > Using basic statistical properties of the variance, as well as single- variable calculus, derive (5.6). In other words, prove that α given by (5.6) does indeed minimize $Var(\alpha X + (1 − \alpha)Y )$.
  8 | 
  9 | Here is the equation that minimizes 5.6:
 10 | $\alpha=\frac{\sigma_Y^2-\sigma_{XY}}{\sigma^2_X+\sigma^2_Y-2\sigma{XY}}=\frac{Var(Y)-Cov(X,Y)}{Var(X)+Var(Y)-2Cov(X,Y)}$ 
 11 | 
 12 | ********
 13 | ## Problem 2. 
 14 | > We will now derive the probability that a given observation is part of a bootstrap sample. Suppose that we obtain a bootstrap sample from a set of n observations.
 15 | 
 16 | ### Part a) 
 17 | > What is the probability that the first bootstrap observation is not the jth observation from the original sample? Justify your answer.
 18 | 
 19 | There are $n$ observations in the original sample. Since bootstrap sampling draws items with replacement, we are sampling from the same pool with the same probability every time. There are $n-1$ items in the $n$ that are not $j$. So there is an $\frac{n-1}{n}$ chance that the first item is not $j$.
 20 | 
 21 | ### Part b) 
 22 | > What is the probability that the second bootstrap observation is not the jth observation from the original sample?
 23 | 
 24 | Since we draw with replacement, it is the same as above.
 25 | 
 26 | ### Part c) 
 27 | > Argue that the probability that the jth observation is not in the bootstrap sample is (1 − 1/n)n.
 28 | 
 29 | Note that $\frac{n-1}{n}=1-\frac{1}{n}$. Also with the bootstrap we do $n$ draws. That means there are $n$ chances to draw something other than $j$ that all have to succeed for $j$ not to be in the bootstrap. This is a simple product of $n$ of these probabilities, which can be written as $(1-\frac{1}{n})^n$ 
 30 | 
 31 | ### Part d) 
 32 | > When n = 5, what is the probability that the jth observation is in the bootstrap sample?
 33 | 
 34 | This is 1 minus the probability that the jth observation is _not_ in the bootstrap sample: `r 1-((1-1/5)^5)`
 35 | 
 36 | ### Part e)
 37 | > When n = 100, what is the probability that the jth observation is in the bootstrap sample?
 38 | 
 39 | calculated as above: `r 1-((1-1/100)^100)`
 40 | 
 41 | ### Part f)
 42 | > When n = 10, 000, what is the probability that the jth observa- tion is in the bootstrap sample?
 43 | 
 44 | `r 1-((1-1/10000)^10000)`
 45 | 
 46 | ### Part g)
 47 | > Create a plot that displays, for each integer value of n from 1 to 100,000, the probability that the jth observation is in the bootstrap sample. Comment on what you observe.
 48 | 
 49 | ```{r fig.width=7, fig.height=5}
 50 | x=seq(1,100000)
 51 | y=sapply(x,function(n){1-((1-(1/n))^n)})
 52 | plot(x,y,xlab="n",ylab="Probability jth observation is in the bootstrap sample",log="x")
 53 | ```
 54 | 
 55 | The probability seems to converge on something around 0.63 fairly quickly, around n=100, and then stay there!
 56 | 
 57 | That is very odd that there is always a 63% chance that any particular thing will be in the bootstrap sample even with large datasets.
 58 | 
 59 | ### Part h)
 60 | > We will now investigate numerically the probability that a boot- strap sample of size n = 100 contains the jth observation. Here j = 4. We repeatedly create bootstrap samples, and each time we record whether or not the fourth observation is contained in the bootstrap sample.
 61 | 
 62 | ```{r}
 63 | store=rep(NA, 10000)
 64 | for(i in 1:10000){
 65 |   store[i]=sum(sample(1:100, rep=TRUE)==4)>0 
 66 | }
 67 | mean(store)
 68 | ```
 69 | 
 70 | 
 71 | > Comment on the results obtained.
 72 | 
 73 | This made a list of length 10,000, and each time sampled 0-100 with replacement and checked to see if 4 is in the list. Interestingly 63% of the time, the list contains the number 4.
 74 | 
 75 | 
 76 | ************
 77 | ## Problem 3. 
 78 | > We now review k-fold cross-validation.
 79 | ### Part a)
 80 | > Explain how k-fold cross-validation is implemented.
 81 | 
 82 | You take your dataset, and do a train/test split where you train on $\frac{k-1}{k}$ and test on the remaining $\frac{1}{k}$ of the dataset. You re-do this procedure $k$ times and then can explore the variability in the obtained results on the various test sets.
 83 | 
 84 | ### Part b) 
 85 | > What are the advantages and disadvantages of k-fold cross validation relative to:
 86 | i. The validation set approach?
 87 | ii. LOOCV?
 88 | 
 89 | k fold cv allows you to use more of your data in training than the validation set approach. Also you get to see how well the model performs on more of the dataset, so you get to see the variability in test errors on different subsets of data.
 90 | 
 91 | LOOCV is a special instance of k fold cv where k=n. Lower values of k are faster to compute since you do not need to do n different fits. There of course is the special case though where you can do the computational shoortcut with LOOCV on least-squares fit models given in equation 5.2. More generally though smaller k (typically 5 or 10) has much better performance than k=n.
 92 | 
 93 | k-fold cv has another benefit though described in section 5.1.4. LOOCV has higher variance than k fold cv with $k<n$ and the $k$ fitted models are less correlated with eachother, since the training sets are less overlapped. The means you get from highly correlated items has higher variance than the means of quanties that are not as correlated. This means that the test error estimates from LOOCV have higher variance than those that you get from $k$-fold cv.
 94 | 
 95 | **********
 96 | ## Problem 4. 
 97 | > Suppose that we use some statistical learning method to make a prediction for the response Y for a particular value of the predictor X. Carefully describe how we might estimate the standard deviation of our prediction.
 98 | 
 99 | One way to do this would be with the bootstrap. We can train on a bunch of different random samplings of the original data, and see how much the estimates change. 
100 | 
101 | ************
102 | # Applied
103 | *************
104 | ## Problem 5. 
105 | > In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the ￼￼￼￼validation set approach. Do not forget to set a random seed before beginning your analysis.
106 | 
107 | ### Part a) 
108 | > Fit a multiple logistic regression model that uses income and balance to predict the probability of default, using only the observations.
109 | 
110 | ```{r}
111 | library(ISLR)
112 | set.seed(1)
113 | glm.fit=glm(default~income+balance,data=Default, family="binomial")
114 | ```
115 | 
116 | ### Part b)
117 | > Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:
118 | i. Split the sample set into a training set and a validation set.
119 | ii. Fit a multiple logistic regression model using only the train-
120 | ing observations.
121 | iii. Obtain a prediction of default status for each individual in
122 | the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability equals 0.5.
123 | iv. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.
124 | 
125 | ```{r}
126 | set.seed(1)
127 | train=sample(nrow(Default),nrow(Default)-nrow(Default)/4)
128 | Default.train=Default[train,]
129 | Default.test=Default[-train,]
130 | glm.fit=glm(default~income+balance,data=Default.train, family="binomial")
131 | glm.probs=predict(glm.fit,Default.test,type="response")
132 | glm.pred=ifelse(glm.probs>.5,"Yes","No")
133 | mean(glm.pred!=Default.test$default)
134 | ```
135 | 
136 | ### Part c)
137 | > Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Com- ment on the results obtained.
138 | 
139 | ```{r}
140 | set.seed(15)
141 | train=sample(nrow(Default),nrow(Default)-nrow(Default)/4)
142 | Default.train=Default[train,]
143 | Default.test=Default[-train,]
144 | glm.fit=glm(default~income+balance,data=Default.train, family="binomial")
145 | glm.probs=predict(glm.fit,Default.test,type="response")
146 | glm.pred=ifelse(glm.probs>.5,"Yes","No")
147 | mean(glm.pred!=Default.test$default)
148 | 
149 | set.seed(5)
150 | train=sample(nrow(Default),nrow(Default)-nrow(Default)/4)
151 | Default.train=Default[train,]
152 | Default.test=Default[-train,]
153 | glm.fit=glm(default~income+balance,data=Default.train, family="binomial")
154 | glm.probs=predict(glm.fit,Default.test,type="response")
155 | glm.pred=ifelse(glm.probs>.5,"Yes","No")
156 | mean(glm.pred!=Default.test$default)
157 | 
158 | set.seed(31)
159 | train=sample(nrow(Default),nrow(Default)-nrow(Default)/4)
160 | Default.train=Default[train,]
161 | Default.test=Default[-train,]
162 | glm.fit=glm(default~income+balance,data=Default.train, family="binomial")
163 | glm.probs=predict(glm.fit,Default.test,type="response")
164 | glm.pred=ifelse(glm.probs>.5,"Yes","No")
165 | mean(glm.pred!=Default.test$default)
166 | ```
167 | 
168 | ### Part d)
169 | > Now consider a logistic regression model that predicts the prob- ability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the val- idation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.
170 | 
171 | ```{r}
172 | set.seed(15)
173 | train=sample(nrow(Default),nrow(Default)-nrow(Default)/4)
174 | Default.train=Default[train,]
175 | Default.test=Default[-train,]
176 | glm.fit=glm(default~income+balance+student,data=Default.train, family="binomial")
177 | glm.probs=predict(glm.fit,Default.test,type="response")
178 | glm.pred=ifelse(glm.probs>.5,"Yes","No")
179 | mean(glm.pred!=Default.test$default)
180 | 
181 | set.seed(5)
182 | train=sample(nrow(Default),nrow(Default)-nrow(Default)/4)
183 | Default.train=Default[train,]
184 | Default.test=Default[-train,]
185 | glm.fit=glm(default~income+balance+student,data=Default.train, family="binomial")
186 | glm.probs=predict(glm.fit,Default.test,type="response")
187 | glm.pred=ifelse(glm.probs>.5,"Yes","No")
188 | mean(glm.pred!=Default.test$default)
189 | 
190 | set.seed(31)
191 | train=sample(nrow(Default),nrow(Default)-nrow(Default)/4)
192 | Default.train=Default[train,]
193 | Default.test=Default[-train,]
194 | glm.fit=glm(default~income+balance+student,data=Default.train, family="binomial")
195 | glm.probs=predict(glm.fit,Default.test,type="response")
196 | glm.pred=ifelse(glm.probs>.5,"Yes","No")
197 | mean(glm.pred!=Default.test$default)
198 | ```
199 | 
200 | It does not look like including this variable helps the model much. The three tests I tried with both models produce similar ranges of test error.
201 | 
202 | **********
203 | ## Problem 6.
204 | > We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression co- efficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.
205 | 
206 | 
207 | 
208 | ### Part a) 
209 | > Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.
210 | 
211 | ```{r}
212 | 
213 | 
214 | set.seed(1)
215 | glm.fit=glm(default~income+balance,data=Default, family="binomial")
216 | summary(glm.fit)$coef[,1]
217 | ```
218 | 
219 | ### Part b) 
220 | > Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.
221 | 
222 | ```{r}
223 | boot.fn=function(data,index){
224 |   coefficients(glm(default~income+balance, data=data, subset=index, family="binomial"))
225 | }
226 | 
227 | boot.fn(Default,1:nrow(Default))
228 | ```
229 | 
230 | ### Part c)
231 | > Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.
232 | 
233 | ```{r}
234 | library(boot)
235 | #boot(Default,boot.fn,1000)
236 | ```
237 | 
238 | ```
239 | ## 
240 | ## ORDINARY NONPARAMETRIC BOOTSTRAP
241 | ## 
242 | ## 
243 | ## Call:
244 | ## boot(data = Default, statistic = boot.fn, R = 1000)
245 | ## 
246 | ## 
247 | ## Bootstrap Statistics :
248 | ##       original     bias    std. error
249 | ## t1* -1.154e+01 -8.008e-03   4.239e-01
250 | ## t2*  2.081e-05  5.871e-08   4.583e-06
251 | ## t3*  5.647e-03  2.300e-06   2.268e-04
252 | ```
253 | 
254 | ### Part d)
255 | > Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.
256 | 
257 | 
258 | These bootstrap estimates actually match up with the glm summary estimates. That is a really good sign.
259 | 
260 | **********
261 | ## Problem 7.
262 | > In Sections 5.3.2 and 5.3.3, we saw that the cv.glm() function can be used in order to compute the LOOCV test error estimate. Alterna- tively, one could compute those quantities using just the glm() and predict.glm() functions, and a for loop. You will now take this ap- proach in order to compute the LOOCV error for a simple logistic regression model on the Default data set. Recall that in the context of classification problems, the LOOCV error is given in (5.4).
263 | 
264 | 
265 | 
266 | ### Part a) 
267 | > Fit a logistic regression model that predicts the probability of default using balance.
268 | 
269 | ```{r}
270 | glm.fit=glm(default~balance,data=Default,family="binomial")
271 | ```
272 | 
273 | 
274 | ### Part b) 
275 | > Fit a logistic regression model that predicts the probability of default using balance using all but the first observation.
276 | 
277 | ```{r}
278 | glm.fit2=update(glm.fit,subset=-1)
279 | ```
280 | 
281 | ### Part c) 
282 | > Use the model from (b) to predict the default status of the first observation. You can do this by predicting that the first observation will default if P (default|balance) > 0.5. Was this observation correctly classified?
283 | 
284 | ```{r}
285 | Default.test=Default[1,,drop=F]
286 | glm.probs=predict(glm.fit2,Default.test,type="response")
287 | glm.pred=ifelse(glm.probs>.5,"Yes","No")
288 | mean(glm.pred==Default.test$default)
289 | ```
290 | This observation was correctly calssified.
291 | 
292 | ### Part d)
293 | >  Write a for loop from i=1 to i=n, where n is the number of observations in the data set, that performs each of the following steps:
294 | i. Fit a logistic regression model using all but the ith observation to predict probability of default using balance.
295 | ii. Compute the posterior probability of default for the ith observation.
296 | iii. Use the posterior probability of default for the ith observation in order to predict whether or not the observation defaults.
297 | iv. Determine whether or not an error was made in predicting the default status for the ith observation. If an error was made, then indicate this as a 1, and otherwise indicate it as a 0.
298 | 
299 | ```{r}
300 | library(multicore)
301 | # predictions=unlist(mclapply(seq(nrow(Default)), function(i){
302 | #   glm.fit2=update(glm.fit,subset=-i)
303 | #   Default.test=Default[i,,drop=F]
304 | #   glm.probs=predict(glm.fit2,Default.test,type="response")
305 | #   glm.pred=ifelse(glm.probs>.5,"Yes","No")
306 | #   mean(glm.pred==Default.test$default)
307 | # },mc.cores=8))
308 | ```
309 | 
310 | 
311 | ### Part e)
312 | >  Take the average of the n numbers obtained in (d)iv in order to obtain the LOOCV estimate for the test error. Comment on the results.
313 | 
314 | ```
315 | # 1 - mean(predictions)
316 | ## [1] 0.0275
317 | ```
318 | 
319 | 
320 | 
321 | ***********
322 | 
323 | ## Problem 8. 
324 | > We will now perform cross-validation on a simulated data set. 
325 | 
326 | ### Part a) Generate a simulated data set as follows:
327 | 
328 | ```{r}
329 | set.seed(1)
330 | y=rnorm(100)
331 | x=rnorm(100)
332 | y=x-2*x^2+rnorm(100)
333 | ```
334 | 
335 | > ￼￼￼In this data set, what is n and what is p? Write out the model used to generate the data in equation form.
336 | 
337 | In this dataset, n is 100 and p is 2. 
338 | 
339 | ### Part b) 
340 | > Create a scatterplot of X against Y . Comment on what you find.
341 | 
342 | ```{r}
343 | plot(x,y)
344 | ```
345 | 
346 | x is quadratic in terms of y. 
347 | 
348 | ### Part c) 
349 | > Set a random seed, and then compute the LOOCV errors that
350 | result from fitting the following four models using least squares:
351 | i. Y = β0 + β1X + ǫ
352 | ii. Y = β0 + β1X + β2X2 + ǫ
353 | iii. Y = β0 +β1X +β2X2 +β3X3 +ǫ
354 | iv. Y = β0 +β1X +β2X2 +β3X3 +β4X4 +ǫ.
355 | 
356 | ```{r}
357 | dat=data.frame(x=x,y=y)
358 | fit.errors = unlist(mclapply(seq(4),function(i){ 
359 |   glm.fit.i=glm(y~poly(x,i),data=dat)
360 |   cv.err=cv.glm(dat,glm.fit.i)
361 |   cv.err$delta[1]
362 |   }))
363 | names(fit.errors)<-sprintf("poly_%d",seq(4))
364 | fit.errors
365 | ```
366 | 
367 | ### Part d)
368 | >  Repeat c) using another random seed, and report your results. Are your results the same as what you got in c)? Why?
369 | 
370 | ```{r}
371 | set.seed(131)
372 | fit.errors = unlist(mclapply(seq(4),function(i){ 
373 |   glm.fit.i=glm(y~poly(x,i),data=dat)
374 |   cv.err=cv.glm(dat,glm.fit.i)
375 |   cv.err$delta[1]
376 |   }))
377 | names(fit.errors)<-sprintf("poly_%d",seq(4))
378 | fit.errors
379 | ```
380 | 
381 | The results are the same because LOOCV does not have a randomness factor involved, it is the same with any iteration given the same undelrying data and model.
382 | 
383 | 
384 | ### Part e) 
385 | > Which of the models in c) had the smallest LOOCV error? Is this what you expected? Explain your answer.
386 | 
387 | The `poly(x,2)` model had the smallest LOOCV error which is encouraging becuase this is what was used to generate the data!
388 | 
389 | ### Part f) 
390 | > Comment on the statistical significance of the coefficient estimates that results from fitting each of the models in c) using least squares. Do these results agree with the conclusions drawn based on the cross-validation results?
391 | 
392 | ```{r}
393 | glm.fit.i=glm(y~poly(x,4),data=dat)
394 | summary(glm.fit.i)
395 | ```
396 | 
397 | Yes when we do a poly(x,4) we see that the x and x**2 terms are the two that end up statistically significant.
398 | 
399 | **************
400 | ## Probelm 9. 
401 | > We will now consider the Boston housing data set, from the MASS library.
402 | 
403 | ### Part a) 
404 | > Based on this data set, provide an estimate for the population mean of medv. Call this estimate μˆ.
405 | 
406 | ```{r}
407 | library(MASS)
408 | mu=mean(Boston$medv)
409 | mu
410 | ```
411 | 
412 | ### Part b) 
413 | > Provide an estimate of the standard error of μˆ. Interpret this result.
414 | Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.
415 | 
416 | ```{r}
417 | sd(Boston$medv)/sqrt(length(Boston$medv))
418 | ```
419 | 
420 | ### Part c) 
421 | > Now estimate the standard error of μˆ using the bootstrap. How does this compare to your answer from (b)?
422 | 
423 | ```{r}
424 | boot.fn<-function(data,index){
425 |   mean(data[index])
426 | }
427 | boot(Boston$medv,boot.fn,1000,parallel ="multicore")
428 | ```
429 | 
430 | ### Part d) 
431 | > Based on your bootstrap estimate from (c), provide a 95% confidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv).
432 | Hint: You can approximate a 95% confidence interval using the formula [μˆ − 2SE(μˆ), μˆ + 2SE(μˆ)].
433 | 
434 | ```{r}
435 | t.test(Boston$medv)
436 | mu=22.53
437 | se=0.4016
438 | mu-2*se
439 | mu+2*se
440 | ```
441 | 
442 | They are very similar, the bootstrap estimate is slightly tighter than the one we just calculated with the mean and std error from bootstrap. (23.33 vs 23.34) the lower bound is the same. They are probably basically the same.
443 | 
444 | ### Part e) 
445 | > Based on this data set, provide an estimate, $\hat\mu_{med}$, for the median value of medv in the population.
446 | 
447 | `r median(Boston$medv)`
448 | 
449 | 
450 | ### Part f) 
451 | > Wenowwouldliketoestimatethestandarderrorofμˆmed.Unfor- tunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.
452 | 
453 | ```{r}
454 | boot.fn<-function(data,index){
455 |   median(data[index])
456 | }
457 | boot(Boston$medv,boot.fn,1000,parallel ="multicore")
458 | ```
459 | 
460 | Interestingly the std error of the median is lower than that of the mean! Cool.
461 | 
462 | ### Part g) 
463 | > Based on this data set, provide an estimate for the tenth per- centile of medv in Boston suburbs. Call this quantity μˆ0.1. (You can use the quantile() function.)
464 | 
465 | `r quantile(Boston$medv,p=0.1)`
466 | 
467 | 
468 | ### Part h) 
469 | > Use the bootstrap to estimate the standard error of μˆ0.1. Com- ment on your findings.
470 | 
471 | ```{r}
472 | boot.fn<-function(data,index){
473 |   quantile(data[index],p=0.1)
474 | }
475 | boot(Boston$medv,boot.fn,1000,parallel ="multicore")
476 | ```
477 | 
478 | The lower 10% of the data has a higher std error than the mean and the median, that is interesting. Apparently these outliers must be more sensitive to which subset is chosen than the mean and median.
479 | 
480 | 
481 | 


--------------------------------------------------------------------------------
/R_Exercises/Exercise5/Exercise5.Rmd:
--------------------------------------------------------------------------------
  1 | Introduction to Statistical Learning Exercises 5:
  2 | ========================================================
  3 | 
  4 | # Conceptual Section
  5 | *******************
  6 | ## Problem 1
  7 | 
  8 | > We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, . . . , p predictors. Explain your answers:
  9 | 
 10 | ### Part a) 
 11 | > Which of the three models with k predictors has the smallest training RSS?
 12 | 
 13 | Best subset selection will have the best training RSS. Although it is possible that either of the other two will chose comparably good models, they will not chose better models on the training data. Best subset selection exhaustively searches all possible models with k predictors chosing the smallest training RSS while the other two methods heuristically explore a subset of that space, either by starting with teh best k-1 model and chosing the best k given a fixed k-1 (forward) or in reverse starting at the best k+1 and chosing the best single feature to remove resulting in the best model with that constraint.
 14 | 
 15 | ### Part b) 
 16 | > Which of the three models with k predictors has the smallest test RSS?
 17 | 
 18 | It is possible to overfit with any of these methods. There are probably cases where the best model trained on the training set (the one exhaustively chosen by best subset) happens to not perform as well on the test set as the best model chosen by forward or backward selection.
 19 | 
 20 | ### Part c) 
 21 | > True or False:
 22 | 
 23 | > i. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection.
 24 | 
 25 | TRUE: the k+1 variable model contains all k features chosen in the k variable model, plus the best aditional feature.
 26 | 
 27 | > ii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1) variable model identified by backward stepwise selection.
 28 | 
 29 | TRUE: the k variable model contains all but one feature in the k+1 best model, minus the single feature resulting in the smallest gain in RSS.
 30 | 
 31 | > iii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1) variable model identified by forward stepwise selection.
 32 | 
 33 | FALSE: it is possible for disjoint sets to be identified by forward and backward selection.
 34 | 
 35 | > iv. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by backward stepwise selection.
 36 | 
 37 | FALSE: it is possible for disjoint sets to be identified by foward and backward selection.
 38 | 
 39 | > v. The predictors in the k-variable model identified by best subset are a subset of the predictors in the (k + 1)-variable model identified by best subset selection.
 40 | 
 41 | FALSE: again these two methods are not guarenteed to chose the same k or k+1 features, they may be disjoint sets.
 42 | 
 43 | *******************
 44 | ## Problem 2
 45 | 
 46 | > For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
 47 | 
 48 | 
 49 | ### Part a) 
 50 | > The lasso, relative to least squares, is:
 51 | i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
 52 | ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
 53 | iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
 54 | iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
 55 | 
 56 | iii is the correct answer. The lasso is a more restrictive model, and thus it has the possibility of reducing overfitting and variance in predictions. As long as it does not result in too high of a bias due to its added constraints, it will outperform least squares which might be fitting spurious parameters.
 57 | 
 58 | 
 59 | ### Part b)
 60 | > Repeat (a) for ridge regression relative to least squares.
 61 | 
 62 | Again iii is the correct answer. Although not as restrictive as the lasso, it is more restrictive, and for the same reasions as outliend above this is the case.
 63 | 
 64 | ### Part c)
 65 | > Repeat (a) for non-linear methods relative to least squares.
 66 | 
 67 | ii is the correct answer. Non linear methods are generally more flexible than least squares. They perform better when the linearity assumption is strongly broken. These methods will have more variance due to their more sensitive fits to the underlying data, and to perform well will need to have a substantial drop in bias.
 68 | 
 69 | 
 70 | ******************
 71 | 
 72 | ## Problem 3
 73 | > Suppose we estimate the regression coefficients in a linear regression model by minimizing
 74 | > $\sum_{i=1}^n ( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} ) subject\ to \sum_{j=1}^{p}|\beta_j|\leq s$
 75 | 
 76 | > for a particular value of s. For parts (a) through (e), indicate which
 77 | of i. through v. is correct. Justify your answer.
 78 | 
 79 | ### Part a) 
 80 | > As we increase s from 0, the training RSS will:
 81 | i. Increase initially, and then eventually start decreasing in an
 82 | inverted U shape.
 83 | ii. Decrease initially, and then eventually start increasing in a U shape.
 84 | iii. Steadily increase.
 85 | iv. Steadily decrease.
 86 | v. Remain constant.
 87 | 
 88 | The RSS will steadily increase (iii) as s increases. Increasing s places a heavier constraint on the model, forcing more $\beta$ coeficients to be set to 0 (this is a lasso or $\ell_1$ penalty).
 89 | 
 90 | 
 91 | ### Part b)
 92 | > Repeat (a) for test RSS. 
 93 | 
 94 | ii. Initially as spurious coefficients are forced to 0, the test RSS will improve as the model has less overfitting. However eventually necessary coefficients will be removed from the model, and the test RSS will again increase, making a U shape.
 95 | 
 96 | ### Part c) 
 97 | > Repeat (a) for variance.
 98 | 
 99 | The variance will decrease as more penalty is placed on the model.
100 | 
101 | ### Part d) 
102 | > Repeat (a) for (squared) bias. 
103 | 
104 | The squared bias will increase as the model becomes less flexible (s increased)
105 | 
106 | ### Part e)
107 | > Repeat (a) for Bayes error rate.
108 | 
109 | This is an optimal theoretical perfectly predicting construct not dependent on the model we are fitting to the data.
110 | 
111 | 
112 | 
113 | ******************
114 | 
115 | ## Problem 4
116 | > Suppose we estimate the regression coefficients in a linear regression model by minimizing
117 | > $\sum_{i=1}^n ( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} ) + \lambda\sum_{j=1}^{p}\beta_j^2$
118 | 
119 | > for a particular value of s. For parts (a) through (e), indicate which
120 | of i. through v. is correct. Justify your answer.
121 | 
122 | ### Part a) 
123 | > As we increase $\lambda$ from 0, the training RSS will:
124 | i. Increase initially, and then eventually start decreasing in an
125 | inverted U shape.
126 | ii. Decrease initially, and then eventually start increasing in a U shape.
127 | iii. Steadily increase.
128 | iv. Steadily decrease.
129 | v. Remain constant.
130 | 
131 | The RSS will steadily increase (iii) as $\lambda$ increases. Increasing s places a heavier constraint on the model, forcing more $\beta$ coeficients to be set to 0 (this is a ridge or $\ell_2$ penalty).
132 | 
133 | 
134 | ### Part b)
135 | > Repeat (a) for test RSS. 
136 | 
137 | ii. Initially as spurious coefficients are forced to 0, the test RSS will improve as the model has less overfitting. However eventually necessary coefficients will be removed from the model, and the test RSS will again increase, making a U shape.
138 | 
139 | ### Part c) 
140 | > Repeat (a) for variance.
141 | 
142 | The variance will decrease as more penalty is placed on the model.
143 | 
144 | ### Part d) 
145 | > Repeat (a) for (squared) bias. 
146 | 
147 | The squared bias will increase as the model becomes less flexible ($\lambda$ increased)
148 | 
149 | ### Part e)
150 | > Repeat (a) for Bayes error rate.
151 | 
152 | This is an optimal theoretical perfectly predicting construct not dependent on the model we are fitting to the data.
153 | 
154 | *********************
155 | 
156 | ## Problem 6
157 | > We will now explore (6.12) and (6.13) further.
158 | 
159 | 
160 | 6.12: $\sum_{j=1}^{p}(y_j-\beta_j)^2 + \alpha\sum_{j=1}^p\beta_j^2$
161 | 
162 | 6.13: $\sum_{j=1}^{p}(y_j-\beta_j)^2 + \alpha\sum_{j=1}|\beta_j|$
163 | 
164 | 6.14: $\hat\beta_j^R=\frac{y_j}{1+\alpha}$
165 | 
166 | 6.15: $\hat\beta_j^L=\begin{cases}y_j-\alpha/2 & \text{if } y_j > \alpha/2;\\ y_j + \alpha/2 & \text{if } y_j < -\alpha/2; \\ 0 & \text{if } |y_j| \leq \alpha/2. \end{cases}$
167 | 
168 | 
169 | 
170 | ### Part a)
171 | > Consider (6.12) with $p = 1$. For some choice of $y_1$, $x_1$, and $\alpha > 0$, plot (6.12) as a function of $\beta_1$. Your plot should confirm that (6.12) is solved by (6.14).
172 | 
173 | ```{r fig.height=11,fig.width=11}
174 | par(mfrow=c(2,2))
175 | for(A in c(0,1,5,10)){
176 |   y1=5
177 |   x1=1 # special case where x1 is 1
178 |   b1=seq(-1,6,by=0.05)
179 |   yhat=((y1-b1)^2) + (A*b1^2)
180 |   plot(b1,yhat)
181 |   points(b1[which.min(yhat)],yhat[which.min(yhat)], col="green",cex=4,pch=20)
182 |   abline(v=y1/(1+A),col="red",lwd=3)
183 | }
184 | ```
185 | ### Part b) 
186 | > Consider (6.13) with $p = 1$. For some choice of $y_1$, $x_1$, and $\alpha > 0$, plot (6.13) as a function of $\beta_1$. Your plot should confirm that (6.13) is solved by (6.15).
187 | 
188 | 
189 | ```{r fig.height=11,fig.width=11}
190 | opt.y.lasso=function(y,a){
191 |   if(y>(a/2)){
192 |     return(y-(a/2))
193 |   }
194 |   
195 |   if(y < (-a/2)){
196 |     return(y+(a/2))
197 |   }
198 |   if(abs(y) <= (a/2)){
199 |     return(0)
200 |   }
201 | }
202 | 
203 | par(mfrow=c(2,2))
204 | for(A in c(0,1,5,10)){
205 |   y1=5
206 |   x1=1 # special case where x1 is 1
207 |   b1=seq(-1,6,by=0.05)
208 |   yhat=((y1-b1)^2) + A*abs(b1)
209 |   plot(b1,yhat)
210 |   points(b1[which.min(yhat)],yhat[which.min(yhat)], col="green",cex=4,pch=20)
211 |   abline(v=opt.y.lasso(y1,A),col="red",lwd=3) 
212 | }
213 | ```
214 | 
215 | 
216 | 
217 | 
218 | *************************
219 | 
220 | # Applied section
221 | ## Problem 8
222 | > create a simluated dataset
223 | 
224 | ### Parts a-c)
225 | ```{r}
226 | set.seed(1)
227 | X=rnorm(100,mean=0,sd=1)
228 | e=rnorm(100,mean=0,sd=0.5)
229 | B0=150.3
230 | B1=50.5
231 | B2=-10.1
232 | B3=-34.2
233 | 
234 | Y=B0+ B1*X + B2*(X^2) + B3*(X^3) + e
235 | 
236 | dat=data.frame(Y=Y,X=X)
237 | library(leaps)
238 | library(ISLR)
239 | 
240 | regfit.full=regsubsets(Y~poly(X,10,raw=T),data=dat,nvmax=10)
241 | reg.summary=summary(regfit.full)
242 | ```
243 | 
244 | ```{r fig.height=11,fig.width=11}
245 | par(mfrow=c(2,2))
246 | plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")
247 | plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
248 | points(which.max(reg.summary$adjr2),
249 |        reg.summary$adjr2[which.max(reg.summary$adjr2)],
250 |        col="red",cex=2,pch=20)
251 | plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",
252 |      type="l")
253 | points(which.min(reg.summary$cp),
254 |        reg.summary$cp[which.min(reg.summary$cp)],
255 |        col="red",cex=2,pch=20)
256 | plot(reg.summary$bic,xlab="Number of Variables",
257 |      ylab="BIC", type="l")
258 | points(which.min(reg.summary$bic),
259 |        reg.summary$bic[which.min(reg.summary$bic)],
260 |        col="red",cex=2,pch=20)
261 | 
262 | # BIC
263 | coef(regfit.full,3)
264 | # Cp/adjusted R2
265 | coef(regfit.full,4)
266 | ```
267 | 
268 | Each method chose at least a superset of the correct X polynomials. BIC chose the correct ones (X,X^2,X^3). Cp and adjusted R^2 added in X^9
269 | 
270 | 
271 | ### Part d
272 | 
273 | ```{r fig.height=11,fig.width=11}
274 | regfit.full=regsubsets(Y~poly(X,10,raw=T),data=dat,nvmax=10,method="forward")
275 | reg.summary=summary(regfit.full)
276 | 
277 | par(mfrow=c(2,2))
278 | plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")
279 | plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
280 | points(which.max(reg.summary$adjr2),
281 |        reg.summary$adjr2[which.max(reg.summary$adjr2)],
282 |        col="red",cex=2,pch=20)
283 | plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",
284 |      type="l")
285 | points(which.min(reg.summary$cp),
286 |        reg.summary$cp[which.min(reg.summary$cp)],
287 |        col="red",cex=2,pch=20)
288 | plot(reg.summary$bic,xlab="Number of Variables",
289 |      ylab="BIC", type="l")
290 | points(which.min(reg.summary$bic),
291 |        reg.summary$bic[which.min(reg.summary$bic)],
292 |        col="red",cex=2,pch=20)
293 | 
294 | # BIC
295 | coef(regfit.full,3)
296 | # Cp
297 | coef(regfit.full,4)
298 | # Adjusted R2
299 | coef(regfit.full,4)
300 | ```
301 | 
302 | The same number of parameters were chosen by each method, however now the spurious parameters changed with Cp and Adjusted R2. Now Cp and adjusted R^2 use X^5 as the extra.
303 | 
304 | 
305 | ### Part e
306 | > same but with lasso
307 | 
308 | ```{r fig.height=5,fig.width=7}
309 | library(glmnet)
310 | dat.mat=model.matrix(Y~poly(X,10,raw=T),data=dat)[,-1]
311 | 
312 | cv.out=cv.glmnet(dat.mat,Y,alpha=1)
313 | plot(cv.out)
314 | bestlam=cv.out$lambda.min
315 | bestlam
316 | 
317 | lasso.mod=glmnet(dat.mat,Y,alpha=1,lambda=bestlam)
318 | coef(lasso.mod)
319 | 
320 | ```
321 | 
322 | Using the optimal lambda chosen by CV, lasso regression choses up to a 3nd degree polynomial (X,X^2,X^3), but includes X^5 as an extra term.
323 | 
324 | ### Part f
325 | > redo with different model and repeat c and e.
326 | 
327 | ```{r}
328 | set.seed(1)
329 | X=rnorm(100,mean=0,sd=1)
330 | e=rnorm(100,mean=0,sd=0.5)
331 | B0=150.3
332 | B7=33.3
333 | 
334 | 
335 | Y=B0+ B7*(X^7) + e
336 | 
337 | dat=data.frame(Y=Y,X=X)
338 | 
339 | ```
340 | 
341 | ```{r fig.height=11,fig.width=11}
342 | regfit.full=regsubsets(Y~poly(X,10,raw=T),data=dat,nvmax=10)
343 | reg.summary=summary(regfit.full)
344 | 
345 | par(mfrow=c(2,2))
346 | plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")
347 | plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
348 | points(which.max(reg.summary$adjr2),
349 |        reg.summary$adjr2[which.max(reg.summary$adjr2)],
350 |        col="red",cex=2,pch=20)
351 | plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",
352 |      type="l")
353 | points(which.min(reg.summary$cp),
354 |        reg.summary$cp[which.min(reg.summary$cp)],
355 |        col="red",cex=2,pch=20)
356 | plot(reg.summary$bic,xlab="Number of Variables",
357 |      ylab="BIC", type="l")
358 | points(which.min(reg.summary$bic),
359 |        reg.summary$bic[which.min(reg.summary$bic)],
360 |        col="red",cex=2,pch=20)
361 | 
362 | # adj R^2
363 | coef(regfit.full,4)
364 | 
365 | # Cp
366 | coef(regfit.full,2)
367 | 
368 | # BIC
369 | coef(regfit.full,1)
370 | ```
371 | 
372 | ```{r fig.height=5,fig.width=7}
373 | library(glmnet)
374 | dat.mat=model.matrix(Y~poly(X,10,raw=T),data=dat)[,-1]
375 | 
376 | cv.out=cv.glmnet(dat.mat,Y,alpha=1)
377 | plot(cv.out)
378 | bestlam=cv.out$lambda.min
379 | bestlam
380 | 
381 | lasso.mod=glmnet(dat.mat,Y,alpha=1,lambda=bestlam)
382 | coef(lasso.mod)
383 | 
384 | ```
385 | 
386 | This model was difficult for the methods to deal with. BIC chose the correct model though, with the seventh degree polynomial being the only one included! Cp added in X^2, and adjusted R^2 added in X^1, X^2, and X^3.
387 | 
388 | The lasso on the ohter hand chose two features as well, like Cp. It chose X^7 along with X^9 as the spurious feature though.
389 | 
390 | **NOTE: I redid this section after realizing in the next chapter that the poly function returns a linear combination of terms, which has the interesting effect of causing the above models to chose $\beta^{1..7}$ rather than only $\beta^7$!! This is something to be aware of, and I completely missed this the first time though.**
391 | 
392 | 
393 | ***************
394 | ## Problem 9
395 | > In this exercise, we will predict the number of applications received using the other variables in the College data set. 
396 | 
397 | ### Part a) 
398 | > Split the data set into a training set and a test set. 
399 | 
400 | ```{r}
401 | set.seed(1)
402 | train=sample(c(TRUE,FALSE),nrow(College),rep=TRUE)
403 | test=(!train)
404 | 
405 | College.train=College[train,,drop=F]
406 | College.test=College[test,,drop=F]
407 | 
408 | ```
409 | 
410 | ### Part b) 
411 | > Fit a linear model using least squares on the training set, and report the test error obtained. 
412 | 
413 | ```{r}
414 | lm.fit=lm(Apps~.,data=College.train)
415 | summary(lm.fit)
416 | pred=predict(lm.fit,College.test)
417 | rss=sum((pred-College.test$Apps)^2)
418 | tss=sum((College.test$Apps-mean(College.test$Apps))^2)
419 | test.rsq=1-(rss/tss)
420 | test.rsq
421 | ```
422 | 
423 | where test.rsq is the $R^2$ statistic.
424 | 
425 | ### Part c) 
426 | > Fit a ridge regression model on the training set, with $\lambda$ chosen by cross-validation. Report the test error obtained. 
427 | 
428 | ```{r}
429 | 
430 | ### Scale the training data, and scale the test data using the centers/scale
431 | # learned on the training data. 
432 | College.train.X=scale(model.matrix(Apps~.,data=College.train)[,-1],scale=T,center=T)
433 | College.train.Y=College.train$Apps
434 | 
435 | College.test.X=scale(model.matrix(Apps~.,data=College.test)[,-1],
436 |       attr(College.train.X,"scaled:center"),
437 |       attr(College.train.X,"scaled:scale"))
438 | 
439 | College.test.Y=College.test$Apps
440 | 
441 | cv.out=cv.glmnet(College.train.X,College.train.Y,alpha=0)
442 | bestlam=cv.out$lambda.min
443 | bestlam
444 | 
445 | lasso.mod=glmnet(College.train.X,College.train.Y,alpha=0,lambda=bestlam)
446 | pred=predict(lasso.mod,College.test.X,s=bestlam)
447 | rss=sum((pred-College.test$Apps)^2)
448 | tss=sum((College.test$Apps-mean(College.test$Apps))^2)
449 | test.rsq=1-(rss/tss)
450 | test.rsq
451 | 
452 | ```
453 | 
454 | 
455 | ### Part d) 
456 | > Fit a lasso model on the training set, with $\lambda$ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates. 
457 | 
458 | ```{r}
459 | 
460 | cv.out=cv.glmnet(College.train.X,College.train.Y,alpha=1)
461 | bestlam=cv.out$lambda.min
462 | bestlam
463 | 
464 | lasso.mod=glmnet(College.train.X,College.train.Y,alpha=1,lambda=bestlam)
465 | pred=predict(lasso.mod,College.test.X,s=bestlam)
466 | rss=sum((pred-College.test$Apps)^2)
467 | tss=sum((College.test$Apps-mean(College.test$Apps))^2)
468 | test.rsq=1-(rss/tss)
469 | test.rsq
470 | 
471 | #Number of coefficients equal to 0
472 | sum(coef(lasso.mod)[,1]==0)
473 | 
474 | names(coef(lasso.mod)[, 1][coef(lasso.mod)[, 1] == 0])
475 | ```
476 | 
477 | 
478 | ### Part e) 
479 | > Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation. 
480 | 
481 | ```{r}
482 | library(pls)
483 | set.seed(1)
484 | pcr.fit=pcr(Apps~.,data=College.train, scale=TRUE, validation="CV")
485 | summary(pcr.fit) #lowest at M=17
486 | pred=predict(pcr.fit,College.test,ncomp=17)
487 | rss=sum((pred-College.test$Apps)^2)
488 | tss=sum((College.test$Apps-mean(College.test$Apps))^2)
489 | test.rsq=1-(rss/tss)
490 | test.rsq
491 | ```
492 | 
493 | 
494 | ### Part f)
495 | > Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation. 
496 | 
497 | 
498 | ```{r}
499 | library(pls)
500 | set.seed(1)
501 | pls.fit=plsr(Apps~.,data=College.train, scale=TRUE, validation="CV")
502 | summary(pls.fit) #pretty much lowest at 9 comps, certainly closest to lowest there
503 | pred=predict(pls.fit,College.test,ncomp=9)
504 | rss=sum((pred-College.test$Apps)^2)
505 | tss=sum((College.test$Apps-mean(College.test$Apps))^2)
506 | test.rsq=1-(rss/tss)
507 | test.rsq
508 | ```
509 | 
510 | ### Part g) 
511 | > Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
512 | 
513 | Ordinary least squares, PLS regression, lasso, and PCR regression performed (more or less equally) best. These methods ended up using the same underlying data essentially, since the optimal PCR regression used the same number of underlying variables. PLS regression was able to cut out a few things, chosing a model that used 9 of the possible 17 components, and 83% of the variance, while still performing pretty much as well. Interestingly the Lasso, while not performing quite as well, still performed pretty comparably 0.8995 vs 0.9052 (a difference of `r 0.9052 - 0.8995`). The lasso though only set 3 variables to 0 (Enroll (students enrolled), Terminal (pct fac w/ terminal degree), and S.F. Ratio(student/factulty ratio)). It is interesting that most of the variables seem to contribute interesting information to the model. Ridge regression performed the poorest at $R^2=0.84$.
514 | 
515 | 
516 | ****************
517 | ## Problem 10
518 | > We have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set.
519 | 
520 | ### Parts a-e)
521 | > Generate a data set with p = 20 features, n = 1000 observations, and an associated quantitative response vector generated according to the model $Y=X\beta+\epsilon$
522 | where $\beta$ has some elements that are exactly equal to zero.
523 | 
524 | ```{r, fig.height=5, fig.width=7}
525 | library(leaps)
526 | 
527 | set.seed(1)
528 | X=matrix(rnorm(1000*20),ncol=20,nrow=1000)
529 | colnames(X) <- sprintf("Feature_%d",1:20)
530 | beta=rnorm(20,sd=10)
531 | beta[c(3,7,9,11,13,18)]=0
532 | e=rnorm(1000)
533 | Y=as.vector(X%*%beta+e)
534 | train=sample(1:nrow(X),100) ### FIXME 100 train 900 test
535 | test=(-train)
536 | 
537 | X.train=X[train,]
538 | X.test=X[test,]
539 | Y.train=Y[train]
540 | Y.test=Y[test]
541 | 
542 | dat.train=cbind(data.frame(Y=Y.train),X.train)
543 | dat.test=cbind(data.frame(Y=Y.test),X.test)
544 | 
545 | regfit.best=regsubsets(Y~.,dat=dat.train,nvmax=20)
546 | 
547 | predict.regsubsets=function(object,newdata,id,...){ 
548 |   form=as.formula(object$call[[2]]) ## extract formula
549 |   mat=model.matrix(form,newdata)
550 |   coefi=coef(object,id=id)
551 |   xvars=names(coefi)
552 |   mat[,xvars]%*%coefi
553 | }
554 | 
555 | mse=function(pred,real){
556 |   mean((pred-real)^2)
557 | }
558 | 
559 | test.mse <- sapply(1:20, function(id){
560 |   pred=predict.regsubsets(regfit.best,dat.test,id)
561 |   mse(pred,Y.test)
562 | })
563 | 
564 | plot(seq(1:20),test.mse,xlab="Number of Features",
565 |      ylab="Test MSE")
566 | points(which.min(test.mse),test.mse[which.min(test.mse)],
567 |        col="red",cex=2,pch=20)
568 | 
569 | coef(regfit.best,id=which.min(test.mse))
570 | ```
571 | 
572 | My 0 beta features are 3,7,9,11,13, and 18.
573 | The lowest test MSE was found at 14/20 features. This is indeed the correct number of features which is encouraging (20-6). Best subset selection selected against Features 3, 7, 9, 11, 13, 18, so it did really well at finding the true underlying model!
574 | 
575 | ### Part g)
576 | 
577 | ```{r fig.width=7,fig.height=5}
578 | beta.rsqb.diffs <- sapply(1:20,function(r){
579 |   coefi<-coef(regfit.best,id=r)
580 |   ncoefi<-names(coefi)
581 |   beta.est <- sapply(1:20,function(i){
582 |     id<-sprintf("Feature_%d",i)
583 |     if(id %in% names(coefi)){
584 |       return(coefi[id])
585 |     }else{
586 |       return(0)
587 |     }
588 |   })
589 |   
590 |   return(sqrt(sum((beta-beta.est)^2)))
591 | })
592 | 
593 | plot(seq(1:20),beta.rsqb.diffs,xlab="Number of Features",
594 |      ylab="Root Squared Diff Of Betas")
595 | points(which.min(beta.rsqb.diffs),beta.rsqb.diffs[which.min(beta.rsqb.diffs)],
596 |        col="red",cex=2,pch=20)
597 | 
598 | ```
599 | 
600 | The minimum value is the same as before, 14 features, however the cool thing is how much more pronounced the answer is. The dip is really strong between 13 and 14, and then stays small going out to 20.
601 | 
602 | 
603 | ************
604 | ## Problem 11
605 | 
606 | ### Part a)
607 | 
608 | I am going to evalueate each of these methods with 10 Fold CV on the Boston dataset.
609 | 
610 | ```{r}
611 | library(MASS)
612 | library(ISLR)
613 | 
614 | ## Best Subset
615 | k=10
616 | set.seed(1)
617 | p=ncol(Boston)-1
618 | folds=sample(1:k,nrow(Boston),replace=TRUE)
619 | 
620 | cv.errors=c()
621 | for(j in 1:k){
622 |   Boston.sub=Boston[folds!=j,]
623 |   #now do CV on this CV subset to choose the best model, and apply
624 |   # it to the whole thing.
625 |   cv.err=matrix(NA,k,p,dimnames=list(NULL,paste(1:p)))
626 |   folds.sub=sample(1:k,nrow(Boston.sub),replace=TRUE)
627 |   
628 |   for(q in 1:k){
629 |     best.fit=regsubsets(crim~.,data=Boston.sub[folds.sub!=q,],nvmax=p)
630 |     for(i in 1:p){
631 |       pred=predict.regsubsets(best.fit,Boston.sub[folds.sub==q,],id=i)
632 |       cv.err[q,i]=mean((Boston.sub$crim[folds.sub==q]-pred)^2)
633 |     }
634 |   }
635 |   
636 |   best.k = which.min(apply(cv.err,2,mean))
637 |   
638 |   best.fit.all=regsubsets(crim~.,data=Boston.sub,nvmax=p)
639 |   pred=predict.regsubsets(best.fit.all,Boston[folds==j,],id=best.k)
640 |   
641 |   cv.errors=c(cv.errors,mean((Boston$crim[folds==j]-pred)^2)) 
642 | }
643 |   
644 | mean(cv.errors)
645 | 
646 | ## Ridge regression (alpha=0)
647 | 
648 | cv.errors = sapply(1:k, function(j){
649 |   Boston.X=as.matrix(Boston[,-1])
650 |   Boston.Y=Boston[,1]
651 |   
652 |   cv.out=cv.glmnet(Boston.X[folds!=j,],Boston.Y[folds!=j],alpha=0)
653 |   bestlam=cv.out$lambda.min
654 |   bestlam
655 |   
656 |   lasso.mod=glmnet(Boston.X[folds!=j,],Boston.Y[folds!=j],alpha=0,lambda=bestlam)
657 |   pred=predict(lasso.mod,Boston.X[folds==j,],s=bestlam)  
658 |   return(mean((Boston.Y[folds==j]-pred)^2))
659 | })
660 | 
661 | mean(cv.errors)
662 | 
663 | ## Lasso (alpha=1)
664 | cv.errors = sapply(1:k, function(j){
665 |   Boston.X=as.matrix(Boston[,-1])
666 |   Boston.Y=Boston[,1]
667 |   
668 |   cv.out=cv.glmnet(Boston.X[folds!=j,],Boston.Y[folds!=j],alpha=1)
669 |   bestlam=cv.out$lambda.min
670 |   bestlam
671 |   
672 |   lasso.mod=glmnet(Boston.X[folds!=j,],Boston.Y[folds!=j],alpha=1,lambda=bestlam)
673 |   pred=predict(lasso.mod,Boston.X[folds==j,],s=bestlam)  
674 |   return(mean((Boston.Y[folds==j]-pred)^2))
675 | })
676 | 
677 | mean(cv.errors)
678 | 
679 | ## PCR
680 | cv.errors = sapply(1:k, function(j){
681 | 
682 |   pcr.fit=pcr(crim~.,data=Boston[folds!=j,],scale=TRUE,validation="CV")
683 |   res=RMSEP(pcr.fit)
684 |   pcr.best=which.min(res$val[1,,])-1
685 | 
686 |   pred=predict(pcr.fit,Boston[folds==j,],ncomp=pcr.best)  
687 |   return(mean((Boston[folds==j,1]-pred)^2))
688 | })
689 | 
690 | mean(cv.errors)
691 | 
692 | ```
693 | 
694 | 
695 | Of these above methods on the Boston dataset using CV for building multiple training/testing splits, and using CV within each CV iteration for choosing optimal parameters for each model, PCR and ridge regression perform best. Lasso performs nearly as well, and best subset selection performs slightly worse than the others.
696 | 
697 | The best method, PCR regression, does include all features. It selects a subset of linear combinations of all features though, so some of the variance in some of the features is likely not included, although some information from each feature will make it into the final model regardless of the parameter that was selected in each CV iteration. 
698 | 
699 | The same thing goes for the second best method, ridge regression, this one also uses some information from each feature, although it might heavily discount the contribution from some of the features.
700 | 
701 | 
702 | 


--------------------------------------------------------------------------------
/R_Exercises/Exercise6/Exercise6.Rmd:
--------------------------------------------------------------------------------
  1 | Introduction to Statistical Learning Exercises 5:
  2 | ========================================================
  3 | 
  4 | # Conceptual Section
  5 | *******************
  6 | ## Problem 1
  7 | 
  8 | *******************
  9 | ## Problem 2
 10 | 
 11 | See section 7.5 previously on smoothing splines. A typicall smoothing spline uses the second derivative of g (the change in slope). So minimizing the second derivative of g results in less roughness of the line. The first derivative of g represents the slope. The third derivative is weird. This is the rate at which acceleration is changing in physics. It is the rate of change of the second derivative, called "jerk". Smoothing splines are like regular splines but they have a knot at every training datapoint, and they smooth the fit by minimizing this alpha term over the second derivative. 
 12 | 
 13 | ### Part a
 14 | I think this should predict everythiing to be zero. No matter how bad the fit is, the infinite penalty of a non-zero prediction on X will overrule everything else.
 15 | 
 16 | ### Part b
 17 | This is minimizing slope (first derivative). Basically g must be flat! So this should just be the mean of all points.
 18 | 
 19 | ### Part c
 20 | This is minimizing change in slope (second derivative). This is allowed to be a line so it will likely be a nice linear regression fit.
 21 | 
 22 | ### Part d
 23 | This is minimizing "jerk" the rate of change in slope. Hmm.. So the rate of change in slope is allowed to be constant, but not changing. This should be the closest you can get to fitting all points perfectly given some acceleration of line change.
 24 | 
 25 | ### Part e
 26 | Now there is no smoothing parameter, so that part of the equation is ignored. Basically this is a set of straight lines connecting the dots!
 27 | *******************
 28 | 
 29 | ## Problem 3
 30 | 
 31 | ## Problem 4
 32 | 
 33 | ## Problem 5
 34 | 
 35 | 
 36 | # Applied section
 37 | *******************
 38 | 
 39 | ## Problem 6
 40 | 
 41 | ### Part a
 42 | ```{r}
 43 | library(ISLR)
 44 | 
 45 | k=10
 46 | max.poly=15
 47 | set.seed(1)
 48 | folds=sample(1:k,nrow(Wage),replace=TRUE)
 49 | cv.errors=matrix(NA,k,max.poly,dimnames=list(NULL,paste(1:max.poly)))
 50 | 
 51 | for(j in 1:k){
 52 |   for(i in 1:max.poly){
 53 |     lm.fit=lm(wage~poly(age,i,raw=T),data=Wage[folds!=j,])
 54 | 
 55 |     pred=predict(lm.fit,Wage[folds==j,])
 56 |     cv.errors[j,i]=mean((Wage$wage[folds==j]-pred)^2)
 57 |   }
 58 | }
 59 | 
 60 | mean.cv.errors=apply(cv.errors,2,mean)
 61 | mean.cv.errors
 62 | which.min(mean.cv.errors)
 63 | 
 64 | fit.1=lm(wage~poly(age,1,raw=T),data=Wage)
 65 | fit.2=lm(wage~poly(age,2,raw=T),data=Wage)
 66 | fit.3=lm(wage~poly(age,3,raw=T),data=Wage)
 67 | fit.4=lm(wage~poly(age,4,raw=T),data=Wage)
 68 | fit.5=lm(wage~poly(age,5,raw=T),data=Wage)
 69 | fit.6=lm(wage~poly(age,6,raw=T),data=Wage)
 70 | fit.7=lm(wage~poly(age,7,raw=T),data=Wage)
 71 | fit.8=lm(wage~poly(age,8,raw=T),data=Wage)
 72 | fit.9=lm(wage~poly(age,9,raw=T),data=Wage)
 73 | fit.10=lm(wage~poly(age,10,raw=T),data=Wage)
 74 | fit.11=lm(wage~poly(age,11,raw=T),data=Wage)
 75 | fit.12=lm(wage~poly(age,12,raw=T),data=Wage)
 76 | fit.13=lm(wage~poly(age,13,raw=T),data=Wage)
 77 | fit.14=lm(wage~poly(age,14,raw=T),data=Wage)
 78 | fit.15=lm(wage~poly(age,15,raw=T),data=Wage)
 79 | anova(fit.1,fit.2,fit.3,fit.4,fit.5,fit.6,fit.7,fit.8,fit.9,fit.10,fit.11,fit.12,fit.13,fit.14,fit.15)
 80 | ```
 81 | 
 82 | There appears to be support both in CV and anova for a 9th degree polynomial on age relative to wage!
 83 | 
 84 | ```{r, fig.width=11, fig.height=11}
 85 | age.range=data.frame(age=seq(min(Wage$age),max(Wage$age),by=0.1))
 86 | plot(Wage$age, Wage$wage, xlab="Age", ylab="Wage", main="Wage vs Age")
 87 | lines(age.range$age,predict(fit.9,age.range),col="red",lwd=3)
 88 | ```
 89 | 
 90 | ### Part b
 91 | ```{r}
 92 | library(ISLR)
 93 | 
 94 | k=10
 95 | max.cut=19
 96 | set.seed(1)
 97 | folds=sample(1:k,nrow(Wage),replace=TRUE)
 98 | cv.errors=matrix(NA,k,max.cut-1,dimnames=list(NULL,paste(2:max.cut-1)))
 99 | 
100 | for(j in 1:k){
101 |   for(i in 2:max.cut){
102 |     lm.fit=lm(wage~cut(age,i,labels = FALSE),data=Wage[folds!=j,])
103 | 
104 |     pred=predict(lm.fit,newdata=Wage[folds==j,])
105 |     cv.errors[j,i-1]=mean((Wage$wage[folds==j]-pred)^2)
106 |   }
107 | }
108 | 
109 | mean.cv.errors=apply(cv.errors,2,mean)
110 | mean.cv.errors
111 | mean.cv.errors[which.min(mean.cv.errors)]
112 | 
113 | fit.2=lm(wage~cut(age,2),data=Wage)
114 | fit.3=lm(wage~cut(age,3),data=Wage)
115 | fit.4=lm(wage~cut(age,4),data=Wage)
116 | fit.5=lm(wage~cut(age,5),data=Wage)
117 | fit.6=lm(wage~cut(age,6),data=Wage)
118 | fit.7=lm(wage~cut(age,7),data=Wage)
119 | fit.8=lm(wage~cut(age,8),data=Wage)
120 | fit.9=lm(wage~cut(age,9),data=Wage)
121 | fit.10=lm(wage~cut(age,10),data=Wage)
122 | fit.11=lm(wage~cut(age,11),data=Wage)
123 | fit.12=lm(wage~cut(age,12),data=Wage)
124 | fit.13=lm(wage~cut(age,13),data=Wage)
125 | fit.14=lm(wage~cut(age,14),data=Wage)
126 | fit.15=lm(wage~cut(age,15),data=Wage)
127 | fit.16=lm(wage~cut(age,16),data=Wage)
128 | fit.17=lm(wage~cut(age,17),data=Wage)
129 | fit.18=lm(wage~cut(age,18),data=Wage)
130 | anova(fit.2,fit.3,fit.4,fit.5,fit.6,fit.7,fit.8,fit.9,fit.10,fit.11,fit.12,fit.13,fit.14,fit.15,fit.16,fit.17,fit.18)
131 | ```
132 | 
133 | There appears to be optimal support for 6 cutpoints in CV and up to 15 cutpoints in anova. I will plot both, red is the 6 cutpoints, and blue is 15.
134 | 
135 | ```{r, fig.width=11, fig.height=11}
136 | age.range=data.frame(age=seq(min(Wage$age),max(Wage$age),by=0.1))
137 | plot(Wage$age, Wage$wage, xlab="Age", ylab="Wage", main="Wage vs Age")
138 | lines(age.range$age,predict(fit.6,age.range),col="red",lwd=3)
139 | lines(age.range$age,predict(fit.15,age.range),col="blue",lwd=3)
140 | ```
141 | 
142 | 
143 | 
144 | 
145 | 
146 | ## Problem 7
147 | Here I will use a gam function
148 | 
149 | ```{r,fig.width=11,fig.height=11}
150 | pairs(wage~age+jobclass+maritl,data=Wage)
151 | ```
152 | 
153 | ```{r}
154 | library(gam)
155 | gam.m0=gam(wage~lo(year,span=0.7)+s(age,5)+education,data=Wage)
156 | gam.m1=gam(wage~lo(year,span=0.7)+s(age,5)+education+jobclass,data=Wage)
157 | gam.m2=gam(wage~lo(year,span=0.7)+s(age,5)+education+maritl,data=Wage)
158 | gam.m3=gam(wage~lo(year,span=0.7)+s(age,5)+education+jobclass+maritl,data=Wage)
159 | anova(gam.m0,gam.m1,gam.m2,gam.m3,test="F")
160 | anova(gam.m1,gam.m3,test="F")
161 | ```
162 | 
163 | It seems that together jobclass and marital status provide useful information beyond what you get from year, age, and education alone.
164 | 
165 | ```{r,fig.width=11,fig.height=11}
166 | par(mfrow=c(3,3))
167 | plot(gam.m3,se=T,col="blue")
168 | ```
169 | 
170 | ********
171 | ## Problem 8
172 | 
173 | ```{r fig.width=11,fig.height=11}
174 | pairs(Auto)
175 | gam.m0=gam(mpg~displacement+acceleration+year,data=Auto)
176 | gam.m1=gam(mpg~s(displacement,5)+s(acceleration,3)+year,data=Auto)
177 | gam.m2=gam(mpg~s(displacement,5)+s(acceleration,4)+year,data=Auto)
178 | anova(gam.m0,gam.m1,gam.m2)
179 | ```
180 | 
181 | There is pretty strong support for a non-linear relationship in the anova test. We can visually see this non-linearity in the scatterplot matrix. Year appears pretty linear though.
182 | 
183 | ```{r fig.width=11,fig.height=5}
184 | par(mfrow=c(1,3))
185 | plot(gam.m1,se=T,col="red")
186 | ```
187 | 
188 | ***********
189 | ## Problem 9
190 | ### Part a
191 | ```{r fig.width=7, fig.height=5}
192 | library(MASS)
193 | gam.fit0=lm(nox~poly(dis,3,raw=T),data=Boston)
194 | r=range(Boston$dis)
195 | d.grid=seq(r[1],r[2],by=(r[2]-r[1])/200)
196 | preds=predict(gam.fit0,newdata=list(dis=d.grid),se=T)
197 | se.bands=cbind(preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit)
198 | plot(Boston$dis,Boston$nox,xlab="weighted mean distance to employment",ylab="nitrogen oxides concentration",col="lightgrey")
199 | lines(d.grid,preds$fit,lwd=2,col="blue")
200 | matlines(d.grid,se.bands,lwd=1,col="blue",lty=3)
201 | points(jitter(Boston$dis),rep(max(Boston$nox),nrow(Boston)),cex=.5,pch="|",col="darkgrey")
202 |   points(rep(max(Boston$dis),nrow(Boston)),jitter(Boston$nox),cex=.5,pch="-",col="darkgrey")
203 | title("Cubic relationship between mean distance to employment\nand nitrogen oxide contamination")
204 | ```
205 | ### Part b
206 | > Plot the polynomial fits for a range of different polynomial degrees (say, from 1 to 10), and report the associated residual sum of squares.
207 | 
208 | ```{r fig.height=11, fig.width=11}
209 | par(mfrow=c(4,3))
210 | for(i in seq(1,12)){
211 |   gam.fit=lm(nox~poly(dis,i,raw=T),data=Boston)
212 |   preds=predict(gam.fit,newdata=list(dis=d.grid),se=T)
213 |   pred.fit=predict(gam.fit,newdata=Boston)
214 |   se.bands=cbind(preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit)
215 |   rss=sum((Boston$nox-pred.fit)^2)
216 |   plot(Boston$dis,Boston$nox,xlab="dis",ylab="nox",col="lightgrey",main=sprintf("Poly=%d, rss=%0.6f",i,rss))
217 |   lines(d.grid,preds$fit,lwd=2,col="blue")
218 |   matlines(d.grid,se.bands,lwd=1,col="blue",lty=3)
219 |   points(jitter(Boston$dis),rep(max(Boston$nox),nrow(Boston)),cex=.5,pch="|",col="darkgrey")
220 |   points(rep(max(Boston$dis),nrow(Boston)),jitter(Boston$nox),cex=.5,pch="-",col="darkgrey")
221 | }
222 | ```
223 | 
224 | ### Part c
225 | 
226 | 
227 | ```{r}
228 | k=10
229 | max.poly=12
230 | set.seed(1)
231 | folds=sample(1:k,nrow(Boston),replace=TRUE)
232 | cv.errors=matrix(NA,k,max.poly,dimnames=list(NULL,paste(1:max.poly)))
233 | 
234 | for(j in 1:k){
235 |   for(i in 1:max.poly){
236 |     gam.fit=lm(nox~poly(dis,i,raw=T),data=Boston[folds!=j,])
237 | 
238 |     pred=predict(gam.fit,Boston[folds==j,])
239 |     cv.errors[j,i]=mean((Boston$nox[folds==j]-pred)^2)
240 |   }
241 | }
242 | 
243 | mean.cv.errors=apply(cv.errors,2,mean)
244 | mean.cv.errors
245 | mean.cv.errors[which.min(mean.cv.errors)]
246 | 
247 | ```
248 | 
249 | 
250 | 
251 | The minimal mean of mean squared errors is with a 4th degree polynomial through our CV iterations.
252 | 
253 | 
254 | ### Part d
255 | ```{r}
256 | gam.fit=lm(nox~bs(dis,df=4),data=Boston)
257 |   preds=predict(gam.fit,newdata=list(dis=d.grid),se=T)
258 |   pred.fit=predict(gam.fit,newdata=Boston)
259 |   se.bands=cbind(preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit)
260 |   rss=sum((Boston$nox-pred.fit)^2)
261 |   plot(Boston$dis,Boston$nox,xlab="dis",ylab="nox",col="lightgrey",main=sprintf("Df=%d, rss=%0.6f",4,rss))
262 |   lines(d.grid,preds$fit,lwd=2,col="blue")
263 |   matlines(d.grid,se.bands,lwd=1,col="blue",lty=3)
264 |   points(jitter(Boston$dis),rep(max(Boston$nox),nrow(Boston)),cex=.5,pch="|",col="darkgrey")
265 |   points(rep(max(Boston$dis),nrow(Boston)),jitter(Boston$nox),cex=.5,pch="-",col="darkgrey")
266 | ```
267 | 
268 | Here I include the intercept in the bases so that the line fits the data in its raw form.
269 | 
270 | ### Part e
271 | 
272 | ```{r fig.width=11,fig.height=11}
273 | par(mfrow=c(3,3))
274 | for(i in seq(3,11)){
275 |   #add one for the intercept
276 |   gam.fit=lm(nox~bs(dis,df=i),data=Boston)
277 |   preds=predict(gam.fit,newdata=list(dis=d.grid),se=T)
278 |   pred.fit=predict(gam.fit,newdata=Boston)
279 |   se.bands=cbind(preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit)
280 |   rss=sum((Boston$nox-pred.fit)^2)
281 |   plot(Boston$dis,Boston$nox,xlab="dis",ylab="nox",col="lightgrey",main=sprintf("DF=%d, rss=%0.6f",i,rss))
282 |   lines(d.grid,preds$fit,lwd=2,col="blue")
283 |   matlines(d.grid,se.bands,lwd=1,col="blue",lty=3)
284 |   points(jitter(Boston$dis),rep(max(Boston$nox),nrow(Boston)),cex=.5,pch="|",col="darkgrey")
285 |   points(rep(max(Boston$dis),nrow(Boston)),jitter(Boston$nox),cex=.5,pch="-",col="darkgrey")
286 | }
287 | ```
288 | 
289 | The results start getting pretty noisy looking around 10 degrees of freedom, I bet the best results will be around 4 or 5 degrees of freedom in CV validation.
290 | 
291 | 
292 | ```{r}
293 | k=10
294 | max.df=12
295 | set.seed(1)
296 | folds=sample(1:k,nrow(Boston),replace=TRUE)
297 | cv.errors=matrix(NA,k,max.df-2,dimnames=list(NULL,paste(3:max.df)))
298 | 
299 | for(j in 1:k){
300 |   for(i in 3:max.df){
301 |     gam.fit=lm(nox~bs(dis,df=i),data=Boston[folds!=j,])
302 | 
303 |     pred=predict(gam.fit,Boston[folds==j,])
304 |     cv.errors[j,i-2]=mean((Boston$nox[folds==j]-pred)^2)
305 |   }
306 | }
307 | 
308 | mean.cv.errors=apply(cv.errors,2,mean)
309 | mean.cv.errors
310 | mean.cv.errors[which.min(mean.cv.errors)]
311 | 
312 | ```
313 | 
314 | As I suspected 5 was chosen by CV as the optimal degree of freedom. 
315 | 
316 | **********
317 | ## Problem 10
318 | 
319 | ### Part a
320 | ```{r fig.width=11,fig.height=11}
321 | library(leaps)
322 | set.seed(1)
323 | test=sample(1:nrow(College),nrow(College)/4)
324 | train=(-test)
325 | College.test=College[test,,drop=F]
326 | College.train=College[train,,drop=F]
327 | 
328 | #predict Outstate using other variables
329 | regfit.fwd=regsubsets(Outstate~.,data=College.train,nvmax=ncol(College)+1,method="forward")
330 | reg.summary=summary(regfit.fwd)
331 | 
332 | par(mfrow=c(2,2))
333 | plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")
334 | plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l",main=sprintf("Max Adjusted RSq: %d",which.max(reg.summary$adjr2)))
335 | points(which.max(reg.summary$adjr2),
336 |        reg.summary$adjr2[which.max(reg.summary$adjr2)],
337 |        col="red",cex=2,pch=20)
338 | plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",
339 |      type="l",main=sprintf("Min Cp: %d",which.min(reg.summary$cp)))
340 | points(which.min(reg.summary$cp),
341 |        reg.summary$cp[which.min(reg.summary$cp)],
342 |        col="red",cex=2,pch=20)
343 | plot(reg.summary$bic,xlab="Number of Variables",
344 |      ylab="BIC", type="l",main=sprintf("Min BIC: %d",which.min(reg.summary$bic)))
345 | points(which.min(reg.summary$bic),
346 |        reg.summary$bic[which.min(reg.summary$bic)],
347 |        col="red",cex=2,pch=20)
348 | 
349 | reg.summary
350 | ```
351 | 
352 | ### Part b
353 | 
354 | The 12 feature model that BIC choses seems to be pretty good.
355 | 
356 | This model uses the following features:
357 | 
358 | * Private
359 | * Accept
360 | * Top10perc
361 | * F.Undergrad
362 | * Room.Board
363 | * Personal
364 | * PhD
365 | * Terminal
366 | * S.F.Ratio
367 | * perc.alumni
368 | * Expend
369 | * Grad.Rate
370 | 
371 | ```{r fig.width=11,fig.height=11}
372 | good.features=c("Outstate",
373 |                 "Private",
374 |                 "Accept",
375 |                 "Top10perc",
376 |                 "F.Undergrad",
377 |                 "Room.Board",
378 |                 "Personal",
379 |                 "PhD",
380 |                 "Terminal",
381 |                 "S.F.Ratio",
382 |                 "perc.alumni",
383 |                 "Expend",
384 |                 "Grad.Rate")
385 | 
386 | pairs(College.train[,good.features,drop=F])
387 | 
388 | #standard lm fit using these features for comparison
389 | lm.fit=lm(Outstate~.,data=College.train[,good.features])
390 | 
391 | ```
392 | 
393 | The highly non-linear features by eye relative to the response seem to be Accept, F.Undergrad, and Top10perc. PhD might be anotehr good candidate, and perhaps Terminal.
394 | 
395 | ```{r fig.width=11,fig.height=11}
396 | gam.fit=gam(Outstate~
397 |              ns(Accept,5)+
398 |              ns(Top10perc,3)+
399 |              ns(F.Undergrad,5)+
400 |              ns(PhD,3)+
401 |              ns(Terminal,3)+
402 |              Private +
403 |              Room.Board + 
404 |              Personal + 
405 |              S.F.Ratio + 
406 |              perc.alumni +
407 |              Expend +
408 |              Grad.Rate
409 |              ,data=College.train)
410 | anova(lm.fit,gam.fit)
411 | 
412 | par(mfrow=c(4,3))
413 | plot.gam(gam.fit,se=T,residuals=T,col="blue")
414 | 
415 | summary(lm.fit)
416 | summary(gam.fit)
417 | 
418 | 
419 | gam.pred=predict(gam.fit,newdata=College.test)
420 | lm.pred=predict(lm.fit,newdata=College.test)
421 | 
422 | sqrt(mean((College.test$Outstate-gam.pred)^2))
423 | sqrt(mean((College.test$Outstate-lm.pred)^2))
424 | ```
425 | 
426 | The root mean squared error is lower for the GAM than the standard linear model. I used the features described previosly to test out non-linear relationships.
427 | 
428 | 
429 | ## Problem 11
430 | 
431 | ```{r fig.width=11, fig.height=5}
432 | n=100
433 | beta0_t=5
434 | beta1_t=-0.55
435 | beta2_t=1.35
436 | set.seed(10)
437 | x1=rnorm(n,sd=1.1)
438 | x2=rnorm(n,sd=2.3)
439 | e=rnorm(n,mean=0,sd=0.5)
440 | y=x1*beta1_t+x2*beta2_t+beta0_t+e
441 | 
442 | res=matrix(NA,3,1000,dimnames=list(c("B0","B1","B2"),paste(1:1000)))
443 | 
444 | b0hat=150
445 | b1hat=100
446 | b2hat=-100
447 | 
448 | for(i in seq(1000)){
449 |   a=y-b1hat*x1
450 |   b2hat=lm(a~x2)$coef[2]
451 |   a=y-b2hat*x2
452 |   fit2=lm(a~x1)
453 |   b1hat=fit2$coef[2]
454 |   b0hat=fit2$coef[1]
455 |   res["B0",i]=b0hat
456 |   res["B1",i]=b1hat
457 |   res["B2",i]=b2hat
458 | }
459 | 
460 | res[,c(1,2,3,4,5,6,7,8,9,1000),drop=F]
461 | 
462 | r=range(res)
463 | plot(seq(1000),res[1,],type="l",lwd=3,ylim=r,col="blue",xlab="iteration",ylab="coefficient estimate")
464 | lines(res[2,],lwd=3,col="red")
465 | lines(res[3,],lwd=3,col="green")
466 | legend("topright", 
467 | c("B0","B1","B2"), # puts text in the legend 
468 | lty=c(1,1,1), # gives the legend appropriate symbols (lines)
469 | lwd=c(3,3,3),col=c("blue","red","green")) # gives the legend lines the correct color and width
470 | fit=lm(y~x1+x2)
471 | abline(h=fit$coef[1],lty=3,lwd=1,col="blue")
472 | abline(h=fit$coef[2],lty=3,lwd=1,col="red")
473 | abline(h=fit$coef[3],lty=3,lwd=1,col="green")
474 | summary(lm(y~x1+x2))
475 | 
476 | ```
477 | 
478 | In this dataset by the 3rd iteration results were nearly as good as they would get, and by the fourth they were pretty much converged.
479 | 
480 | *******
481 | ## Problem 12
482 | 
483 | ```{r fig.width=11, fig.height=5}
484 | n=100
485 | p=100
486 | set.seed(5)
487 | bhats=rnorm(101,sd=100)
488 | btarg=rnorm(101,sd=10)
489 | X=cbind(rep(1,100),matrix(rnorm(100*100),100,100))
490 | e=rnorm(n,mean=0,sd=0.5)
491 | res=matrix(NA,101,1000,dimnames=list(paste(0:100),paste(1:1000)))
492 | 
493 | ## rows from left dot product with columns from right
494 | y=as.vector(btarg%*% t(X))+e
495 | for(i in seq(1000)){
496 |   for (j in 2:101){
497 |     a=y-as.vector((bhats[-j]%*% t(X[,-j])))
498 |     bhats[j]=lm(a~X[,j,drop=T])$coef[2]
499 |   }
500 |   bhats[1]=mean(y-as.vector(bhats[-1]%*% t(X[,-1])))
501 |   res[,i]=bhats
502 | }
503 | 
504 | rmse_betas=apply(res,2,function(c)mean(sqrt((btarg-c)^2)))
505 | 
506 | plot(1:1000,rmse_betas,type="l",col="blue",lwd=2,main="Beta RMSE by iteration",xlab="iteration",ylab="Coefficient RMSE")
507 | 
508 | which.min(rmse_betas)
509 | min(rmse_betas)
510 | 
511 | ```
512 | 
513 | As you can see, the RMSE on the betas is decreasing as a function of the iteration. The minimal value is at iteration 1000 (I stopped it there due to runtime), and it decreases slowly but steadily until then.
514 | 
515 | 


--------------------------------------------------------------------------------
/R_Exercises/Exercise7/Exercise7.Rmd:
--------------------------------------------------------------------------------
  1 | ISLR Exercise 7: Decision Trees
  2 | ========================================================
  3 | ********
  4 | ********
  5 | ## Conceptual
  6 | ### Problem 1
  7 | > Draw an example (of your own invention) of a partition of two- dimensional feature space that could result from recursive binary splitting. Your example should contain at least six regions. Draw a decision tree corresponding to this partition. Be sure to label all as- pects of your figures, including the regions R1, R2, . . ., the cutpoints t1,t2,..., and so forth.
  8 | _Hint: Your result should look something like Figures 8.1 and 8.2._
  9 | 
 10 | 
 11 | *****
 12 | ### Problem 2
 13 | > It is mentioned in Section 8.2.3 that boosting using depth-one trees (or stumps) leads to an additive model: that is, a model of the form
 14 | $$f(X)=\sum_{j=1}^pf_j(X_j).$$
 15 | Explain why this is the case. You can begin with (8.12) in Algorithm 8.2
 16 | 
 17 | 
 18 | *****
 19 | ### Problem 3
 20 | > Consider the Gini index, classification error, and cross-entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of pˆm1. The x- axis should display pˆm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.
 21 | _Hint: In a setting with two classes, $$\hat p_{m1} = 1 - \hat p_{m2}$$. You could make this plot by hand, but it will be much easier to make in R._
 22 | 
 23 | The Gini index
 24 | $$G=\sum_{k=1}^K\hat p_{mk}(1 - \hat p_{mk})$$
 25 | 
 26 | cross-entropy
 27 | $$D=-\sum_{k=1}^K\hat p_{mk}\text{log} \hat p_{mk}$$
 28 | 
 29 | ```{r fig.width=8, fig.height=8}
 30 | gini=function(m1){
 31 |   return(2*(m1*(1-m1)))
 32 | }
 33 | 
 34 | ent=function(m1){
 35 |   m2=1-m1
 36 |   return(-((m1*log(m1))+(m2*log(m2))))
 37 | }
 38 | 
 39 | classerr=function(m1){
 40 |   m2=1-m1
 41 |   return(1-max(m1,m2))
 42 |   #return(min((1-m1),m1))
 43 |   #return(m1)
 44 | }
 45 | 
 46 | err=seq(0,1,by=0.01)
 47 | c.err=sapply(err,classerr)
 48 | g=sapply(err,gini)
 49 | e=sapply(err,ent)
 50 | d=data.frame(Gini.Index=g,Cross.Entropy=e)
 51 | plot(err,c.err,type='l',col="red",xlab="m1",ylim=c(0,0.8),ylab="value")
 52 | matlines(err,d,col=c("green","blue"))
 53 | 
 54 | 
 55 | ```
 56 | 
 57 | *****
 58 | ### Problem 4
 59 | > This question relates to the plots in Figure 8.12.
 60 | 
 61 | 
 62 | *****
 63 | #### Part a
 64 | > Sketch the tree corresponding to the partition of the predictor space illustrated in the left-hand panel of Figure 8.12. The num- bers inside the boxes indicate the mean of Y within each region.
 65 | 
 66 | 
 67 | *****
 68 | #### Part b
 69 | > Create a diagram similar to the left-hand panel of Figure 8.12, using the tree illustrated in the right-hand panel of the same figure. You should divide up the predictor space into the correct regions, and indicate the mean for each region.
 70 | 
 71 | 
 72 | *****
 73 | ### Problem 5
 74 | > Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of X, produce 10 estimates of P(Class is Red|X):
 75 | 0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, and 0.75.
 76 | There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in this chapter. The second approach is to classify based on the average probability. In this example, what is the final classification under each of these two approaches?
 77 | 
 78 | First the mean probability based classification:
 79 | `r x=c(0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75); mean(x)>0.5`
 80 | 
 81 | Second the majority vote based classification:
 82 | `r mean(x>0.5)>0.5`
 83 | 
 84 | *****
 85 | ### Problem 6
 86 | > Provide a detailed explanation of the algorithm that is used to fit a regression tree.
 87 | 
 88 | 1. First we do recursive binary splitting on the data. This is a top-down approach where recursively and greedily we find the best single partitioning of the data such that the reduction of RSS is the greatest. This process is applied to each of the split parts seperately until some minimal number of observations is present on each of the leaves.
 89 | 
 90 | 2. apply cost complexity pruning of this larger tree formed in step 1 to obtain a sequence of best subtrees as a function of a parameter, $\alpha$. Each value of $\alpha$ corresponds to a different subtree which minimizes the equation $$\sum_{m=i}^{|T|}\sum_{i:x_i\in R_m}(y_i - \hat y_{R_m})^2 + \alpha |T|$$. Here $|T|$ is the number of terminal nodes on the tree. When $\alpha=0$ we have the original tree, and as $\alpha$ increases we get a more pruned version of the tree. 
 91 | 
 92 | 3. using K-fold CV, choose $\alpha$. For each fold, repeat steps 1 and 2, and then evaluate the MSE as a function of $\alpha$ on the held out fold. Chose an $\alpha$ that minimizes the average error.
 93 | 
 94 | 4. Given the $\alpha$ chosen in step 3, return the tree calculated using the formula laid out in step 2 on the entire dataset with that chosen value of $\alhpa$.
 95 | 
 96 | *****
 97 | *****
 98 | ## Applied
 99 | 
100 | ****
101 | ### Problem 7
102 | > In the lab, we applied random forests to the Boston data using mtry=6 and using ntree=25 and ntree=500. Create a plot displaying the test error resulting from random forests on this data set for a more comprehensive range of values for mtry and ntree. You can model your plot after Figure 8.10. Describe the results obtained.
103 | 
104 | `mtry` is the number of variables randomly sampled as candidates for each split. There are 13 variables to look at in the boston dataset. This defaults to `r sqrt(13)` for a dataset of this size.
105 | 
106 | ```{r fig.width=11, fig.height=11}
107 | library(ISLR)
108 | library(MASS)
109 | library(randomForest)
110 | library(tree)
111 | 
112 | mtry=c(3,4,6)
113 | ntree=c(10,30,50,75,100,500)
114 | x=matrix(rep(NA,length(mtry)*length(ntree)),length(ntree),length(mtry))
115 | set.seed(1)
116 | train=sample(1:nrow(Boston), nrow(Boston)/2)
117 | boston.test=Boston[-train,'medv']
118 | 
119 | for(i in 1:length(ntree)){
120 |   for(j in 1:length(mtry)){
121 |     rf.boston=randomForest(medv~.,data=Boston,
122 |                            subset=train,mtry=mtry[j],ntree=ntree[i],
123 |                            importance=TRUE)
124 |     yhat.rf=predict(rf.boston,newdata=Boston[-train,])
125 |     err=sqrt(mean((yhat.rf-boston.test)^2))
126 |     x[i,j]=err
127 |   }
128 | }
129 | 
130 | cols=c("red","green","blue","orange")
131 | 
132 | plot(ntree,x[,1],xlab="Number of trees",ylim=c(3,5),ylab="Test RMSE",col=cols[1],type='l')
133 | for(j in 2:length(mtry)){
134 |   lines(ntree,x[,j],col=cols[j])
135 | }
136 | legend("topright",sprintf("mtry=%g",mtry),lty = 1,col=cols)
137 | ```
138 | Larger trees definitely had a slight advantage. The default choice of 4 did pretty well, and perhaps bumping up that value a bit helps even more, especially at larger numbers of trees. The default value of 4 maximixed its performance at fewer numbers of trees. Overall 6 looks like a good choice for mtry, and 500 a good choice for ntree on this dataset and train/test split.
139 | 
140 | ******
141 | ### Problem 8
142 | > In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.
143 | 
144 | 
145 | #### Part a
146 | > Split the data set into a training set and a test set.
147 | 
148 | ```{r}
149 | set.seed(1)
150 | train=sample(1:nrow(Carseats),nrow(Carseats)/2)
151 | library(tree)
152 | Carseats.train=Carseats[train,]
153 | Carseats.test=Carseats[-train,]
154 | ```
155 | 
156 | #### Part b
157 | > Fit a regression tree to the training set. Plot the tree, and interpret the results. What test error rate do you obtain?
158 | 
159 | ```{r fig.width=11, fig.height=11}
160 | tree.carseats=tree(Sales~.,Carseats.train)
161 | summary(tree.carseats)
162 | plot(tree.carseats)
163 | text(tree.carseats,pretty=0)
164 | sales.est=predict(tree.carseats,Carseats.test)
165 | test.R2=1-(sum((sales.est-Carseats.test$Sales)^2)/sum((Carseats.test$Sales-mean(Carseats.test$Sales))^2))
166 | test.R2
167 | ```
168 | 
169 | #### Part c
170 | > Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test error rate?
171 | 
172 | ```{r fig.width=11, fig.height=11}
173 | cv.carseats=cv.tree(tree.carseats)
174 | plot(cv.carseats$size,cv.carseats$dev,type="b")
175 | min.carseats=which.min(cv.carseats$dev)
176 | #8 is min
177 | prune.carseats=prune.tree(tree.carseats,best=min.carseats)
178 | plot(prune.carseats)
179 | text(prune.carseats ,pretty=0)
180 | sales.est=predict(prune.carseats,Carseats.test)
181 | test.R2=1-(sum((sales.est-Carseats.test$Sales)^2)/sum((Carseats.test$Sales-mean(Carseats.test$Sales))^2))
182 | test.R2
183 | ```
184 | 
185 | The error rate is actually not better with the pruned tree.. interesting.
186 | 
187 | #### Part d
188 | > Use the bagging approach in order to analyze this data. What test error rate do you obtain? Use the importance() function to determine which variables are most important.
189 | 
190 | ```{r fig.width=11, fig.height=11}
191 | library(randomForest)
192 | set.seed(1)
193 | bag.carseats=randomForest(Sales~.,data=Carseats,subset=train,
194 | mtry=ncol(Carseats)-1,importance =TRUE)
195 | importance(bag.carseats)
196 | varImpPlot(bag.carseats)
197 | sales.est=predict(bag.carseats,Carseats.test)
198 | test.R2=1-(sum((sales.est-Carseats.test$Sales)^2)/sum((Carseats.test$Sales-mean(Carseats.test$Sales))^2))
199 | test.R2
200 | ```
201 | 
202 | #### Part e
203 | > Use random forests to analyze this data. What test error rate do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.
204 | 
205 | ```{r}
206 | rf.carseats=randomForest(Sales~.,data=Carseats,subset=train,importance=T)
207 | importance(rf.carseats)
208 | 
209 | mtotry=2:6
210 | errs=rep(NA,length(mtotry))
211 | for(i in 1:length(mtotry)){
212 |   m=mtotry[i]
213 |   rf.carseats=randomForest(Sales~.,data=Carseats,
214 |                            subset=train,mtry=mtotry[i],
215 |                            importance=T)
216 |   sales.est=predict(rf.carseats,Carseats.test)
217 |   test.R2=1-(sum((sales.est-Carseats.test$Sales)^2)/
218 |                sum((Carseats.test$Sales-
219 |                       mean(Carseats.test$Sales))^2))
220 |   errs[i]=test.R2
221 | }
222 | errs
223 | ```
224 | 
225 | 
226 | ****
227 | ### Problem 9
228 | > This problem involves the OJ data set which is part of the ISLR package.
229 | 
230 | 
231 | #### Part a
232 | > Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
233 | 
234 | ```{r}
235 | set.seed(10)
236 | train=sample(1:nrow(OJ),800)
237 | train.OJ=OJ[train,]
238 | test.OJ=OJ[-train,]
239 | 
240 | ```
241 | 
242 | #### Part b
243 | > Fit a tree to the training data, with Purchase as the response and the other variables except for Buy as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?
244 | 
245 | ```{r}
246 | tree.oj=tree(Purchase~.,data=train.OJ)
247 | summary(tree.oj)
248 | ```
249 | 
250 | The training error rate is 0.1625, and there are 7 terminal nodes.
251 | 
252 | 
253 | #### Part c
254 | > Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
255 | 
256 | ```{r}
257 | tree.oj
258 | ```
259 | 
260 | Node 4 shows the split which occures of LoyalCH is less first less than 0.45956 and then less than 0.276142. The predicted outcome is MM. There is a deviance of 100. Smaller values of deviance ar indicative of how pure this node is. Finally there is the probability confidence bound on this prediction.
261 | 
262 | #### Part d
263 | > Create a plot of the tree, and interpret the results.
264 | 
265 | ```{r fig.width=11, fig.height=11}
266 | plot(tree.oj)
267 | text(tree.oj,pretty=0)
268 | ```
269 | 
270 | #### Part e
271 | > Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
272 | 
273 | ```{r}
274 | preds=predict(tree.oj,test.OJ,type="class")
275 | table(test.OJ$Purchase,preds)
276 | test.err=(155+66)/(155+22+27+66)
277 | test.err
278 | ```
279 | 
280 | #### Part f
281 | > Apply the cv.tree() function to the training set in order to determine the optimal tree size.
282 | 
283 | ```{r}
284 | cv.oj=cv.tree(tree.oj,FUN=prune.misclass)
285 | ```
286 | 
287 | #### Part g
288 | > Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
289 | 
290 | ```{r fig.width=11, fig.height=11}
291 | plot(cv.oj$size ,cv.oj$dev ,type="b")
292 | ```
293 | 
294 | 
295 | #### Part h
296 | > Which tree size corresponds to the lowest cross-validated classification error rate?
297 | 
298 | ```{r}
299 | msize=cv.oj$size[which.min(cv.oj$dev)]
300 | msize
301 | ```
302 | 
303 | #### Part i
304 | > Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
305 | 
306 | ```{r}
307 | prune.oj=prune.misclass(tree.oj,best=msize)
308 | ```
309 | 
310 | #### Part j
311 | > Compare the training error rates between the pruned and un- pruned trees. Which is higher?
312 | 
313 | ```{r}
314 | 
315 | prune.pred=predict(prune.oj,test.OJ,type="class")
316 | table(prune.pred,test.OJ$Purchase)
317 | ```
318 | 
319 | 
320 | #### Part k
321 | > Compare the test error rates between the pruned and unpruned trees. Which is higher?
322 | 
323 | ```{r}
324 | prune.test.err=(151+68)/(151+68+26+25)
325 | 1-prune.test.err
326 | 1-test.err
327 | ```
328 | 
329 | The classification accurazy is slightly worse in the pruned tree.
330 | 
331 | ****
332 | ### Problem 10
333 | > We now use boosting to predict Salary in the Hitters data set.
334 | 
335 | #### Part a
336 | > Remove the observations for whom the salary information is unknown, and then log-transform the salaries.
337 | 
338 | ```{r}
339 | H=Hitters[!is.na(Hitters$Salary),,drop=F]
340 | H$Salary=log(H$Salary)
341 | ```
342 | 
343 | 
344 | #### Part b
345 | > Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations.
346 | 
347 | ```{r}
348 | H.train=H[1:200,]
349 | H.test=H[201:nrow(H),]
350 | ```
351 | 
352 | #### Part c
353 | > Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter $\lambda$. Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis.
354 | 
355 | ```{r,fig.width=7,fig.height=7}
356 | library(gbm)
357 | set.seed(1)
358 | shrinkage=c(0.00001,0.0001,0.001,0.01,0.1,1)
359 | errs=rep(NA,length(shrinkage))
360 | for (i in 1:length(shrinkage)){
361 |   s=shrinkage[i]
362 |   boost.H=gbm(Salary~., data=H.train, 
363 |                  distribution="gaussian", 
364 |                  n.trees=1000,
365 |                  shrinkage = s,
366 |                  interaction.depth=1,
367 |                  n.cores=10)
368 |   yhat.boost=predict(boost.H,newdata=H.train, n.trees=1000)
369 |   errs[i]=mean((yhat.boost-H.train$Salary)^2)
370 | }
371 | plot(log(shrinkage),errs)
372 | ```
373 | 
374 | #### Part d
375 | > Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis.
376 | 
377 | ```{r,fig.width=7,fig.height=7}
378 | library(gbm)
379 | set.seed(1)
380 | errs.test=rep(NA,length(shrinkage))
381 | for (i in 1:length(shrinkage)){
382 |   s=shrinkage[i]
383 |   boost.H=gbm(Salary~., data=H.train, 
384 |                  distribution="gaussian", 
385 |                  n.trees=1000,
386 |                  shrinkage = s,
387 |                  interaction.depth=1,
388 |                  n.cores=10)
389 |   yhat.boost=predict(boost.H,newdata=H.test, n.trees=1000)
390 |   errs.test[i]=mean((yhat.boost-H.test$Salary)^2)
391 | }
392 | plot(log(shrinkage),errs.test)
393 | ```
394 | 
395 | #### Part e
396 | > Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in Chapters 3 and 6.
397 | 
398 | ```{r fig.width=11, fig.height=11}
399 | boost.H=gbm(Salary~., data=H.train, 
400 |                  distribution="gaussian", 
401 |                  n.trees=1000,
402 |                  shrinkage = shrinkage[which.min(errs)],
403 |                  interaction.depth=1,
404 |                  n.cores=10)
405 | 
406 | boost.mse=errs[which.min(errs)]
407 | library(leaps)
408 | fit=regsubsets(Salary~.,data=H.train,nvmax=19)
409 | fit.summ=summary(fit)
410 | to.inc=fit.summ$which[which.min(fit.summ$cp),][2:20]
411 | features=c(features,"Division","Salary")
412 | fit.lm=lm(Salary~.,data=H.train[,colnames(H.train)%in%features])
413 | yhat=predict(fit.lm,H.test[,colnames(H.train)%in%features])
414 | best.sub=mean((yhat-H.test$Salary)^2)
415 | 
416 | cols.bad=c("League","Division","NewLeague")
417 | n.H=model.matrix(~.,H)[,-1]
418 | n.H.train=n.H[1:200,]
419 | n.H.test=n.H[201:nrow(n.H),]
420 | 
421 | library(glmnet)
422 | fit=cv.glmnet(n.H.train[,colnames(n.H)!="Sallary"],n.H.train[,"Salary"])
423 | fit=glmnet(n.H.train[,colnames(n.H)!="Sallary"],n.H.train[,"Salary"],lambda=fit$lambda.1se)
424 | pred=predict(fit,n.H.test[,colnames(n.H)!="Sallary"])
425 | best.lasso=mean((pred[,1]-H.test$Salary)^2)
426 | 
427 | 
428 | #boost
429 | boost.mse
430 | 
431 | #Best subset lm:
432 | best.sub
433 | 
434 | #best lasso:
435 | best.lasso
436 | 
437 | ```
438 | 
439 | the lasso is the best by a really little bit on the test data, but boosting came in close.
440 | 
441 | #### Part f
442 | > Which variables appear to be the most important predictors in the boosted model?
443 | 
444 | ```{r fig.width=11, fig.height=11}
445 | summary(boost.H)
446 | ```
447 | 
448 | CAtBat and PutOuts were the top two predictors by a lot. Next at about half of the importance was RBI and Walks.
449 | 
450 | #### Part g
451 | > Now apply bagging to the training set. What is the test set MSE for this approach?
452 | 
453 | ```{r}
454 | library(randomForest)
455 | set.seed(1)
456 | bag.H=randomForest(Salary~.,data=H.train,
457 |                         mtry=ncol(H.train)-1,
458 |                         importance=TRUE)
459 | preds=predict(bag.H,newdata=H.test)
460 | mean((preds-H.test$Salary)^2)
461 | ```
462 | 
463 | 
464 | **** 
465 | ### Problem 11
466 | > This question uses the Caravan data set.
467 | 
468 | 
469 | #### Part a
470 | > Create a training set consisting of the first 1,000 observations,
471 | and a test set consisting of the remaining observations.
472 | 
473 | ```{r}
474 | train.C=Caravan[1:1000,]
475 | test.C=Caravan[1001:nrow(Caravan),]
476 | ```
477 | 
478 | 
479 | #### Part b
480 | > Fit a boosting model to the training set with Purchase as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?
481 | 
482 | ```{r fig.width=11, fig.height=11}
483 | boost.C=gbm(I(Purchase=="Yes")~., data=train.C, 
484 |                  distribution="bernoulli", 
485 |                  n.trees=1000,
486 |                  shrinkage = 0.01,
487 |                  interaction.depth=1,
488 |                  n.cores=10)
489 | 
490 | summary(boost.C)
491 | 
492 | The most important predictors are PPEARSAUT, MOPLHOOG and MKOOPKLA, followed pretty closely by a group of others.
493 | ```
494 | 
495 | #### Part c
496 | > Use the boosting model to predict the response on the test data. Predict that a person will make a purchase if the estimated probability of purchase is greater than 20 %. Form a confusion matrix. What fraction of the people predicted to make a purchase do in fact make one? How does this compare with the results obtained from applying KNN or logistic regression to this data set?
497 | 
498 | ```{r}
499 | preds=predict(boost.C,test.C,type="response",n.trees=1000)
500 | yhat=ifelse(preds>.2,"Yes","No")
501 | table(yhat,test.C$Purchase)
502 | #the following is the fraction of people predicted to make
503 | # a purchase who actually do
504 | 34/154
505 | 
506 | ```
507 | 
508 | **** 
509 | ### Problem 12
510 | > Apply boosting, bagging, and random forests to a data set of your choice. Be sure to fit the models on a training set and to evaluate their performance on a test set. How accurate are the results compared to simple methods like linear or logistic regression? Which of these approaches yields the best performance?
511 | 
512 | 
513 | 
514 | 
515 | 


--------------------------------------------------------------------------------
/R_Labs/Lab1/Lab1.Rmd:
--------------------------------------------------------------------------------
  1 | Introduction to Statistical Learning R Lab 1
  2 | ========================================================
  3 | 
  4 | Here is an example of making a randomly correlated variable:
  5 | 
  6 | ```{r}
  7 | x=rnorm(50)
  8 | y=x+rnorm(50, mean=50, sd=.1)
  9 | cor(x,y)
 10 | ```
 11 | 
 12 | And here is a visualization of this correlation
 13 | 
 14 | ```{r fig.width=7, fig.height=6}
 15 | plot(x,y)
 16 | ```
 17 | 
 18 | Here is a sequence of 60 numbers between -pi and pi
 19 | ```{r}
 20 | x=seq(-pi,pi,length=50)
 21 | ```
 22 | 
 23 | Now exploring contour
 24 | 
 25 | ```{r fig.width=7, fig.height=6}
 26 | y=x
 27 | f=outer(x,y,function(x,y)cos(y)/(1+x^2))
 28 | contour(x,y,f)
 29 | contour(x,y,f,nlevels=45,add=T)
 30 | ```
 31 | 
 32 | And another contor figure, this time with a new transform of the data
 33 | 
 34 | ```{r fig.width=7, fig.height=6}
 35 | fa=(f-t(f))/2
 36 | contour(x,y,f,nlevels=15)
 37 | ```
 38 | 
 39 | We can also do this with an "image" plot to show a heatmap
 40 | ```{r fig.width=7, fig.height=6}
 41 | image(x,y,fa)
 42 | ```
 43 | 
 44 | Or a 3d perspective plot
 45 | ```{r fig.width=7, fig.height=6}
 46 | persp(x,y,fa)
 47 | ```
 48 | 
 49 | Which can be rotated, I think, with theta
 50 | ```{r fig.width=7, fig.height=6}
 51 | persp(x,y,fa, theta=30)
 52 | ```
 53 | 
 54 | And further rotated with phi
 55 | ```{r fig.width=7, fig.height=6}
 56 | persp(x,y,fa, theta=30, phi=20)
 57 | ```
 58 | 
 59 | Another angle at phi=70
 60 | 
 61 | ```{r fig.width=7, fig.height=6}
 62 | persp(x,y,fa, theta=30, phi=70)
 63 | ```
 64 | 
 65 | And once again with phi=40
 66 | ```{r fig.width=7, fig.height=6}
 67 | persp(x,y,fa, theta=30, phi=40)
 68 | ```
 69 | 
 70 | 
 71 | Loading data
 72 | --------
 73 | 
 74 | Here are different ways to load data into R. interesting use of the `fix()` command which loads the data into a spreadsheet like window. I commented this out because it blocks execution and doesn't put anything into the result, but it looks like a spreadsheet page.
 75 | 
 76 | ```{r}
 77 | Auto=read.table("~/src/IntroToStatisticalLearningR/data/Auto.data",header=T,na.strings="?")
 78 | dim(Auto)
 79 | Auto=na.omit(Auto)
 80 | dim(Auto)
 81 | 
 82 | ```
 83 | 
 84 | You can also attach data so that it is easier to plot and whatnot, turning cylindars into a dataframe makes the output a little nicer, a boxplot actually!
 85 | 
 86 | ```{r fig.width=7, fig.height=6}
 87 | attach(Auto)
 88 | cylinders=as.factor(cylinders)
 89 | plot(cylinders, mpg)
 90 | ```
 91 | 
 92 | Lets slap some lipstick on this plot!
 93 | ```{r fig.width=7, fig.height=6}
 94 | plot(cylinders, mpg, col="red", varwidth=T, xlab="cylinders ", ylab="MPG")
 95 | ```
 96 | 
 97 | And we can do a histogram on this variable as well. `col=2` is the same as `col="red"` btw.
 98 | ```{r fig.width=7, fig.height=6}
 99 | hist(mpg,col=2,breaks=15)
100 | ```
101 | 
102 | We could also do a pairs plot which is really nice.
103 | ```{r fig.width=7, fig.height=6}
104 | pairs(Auto)
105 | ```
106 | 
107 | And we can limit the pairing to specific variables.
108 | ```{r fig.width=7, fig.height=6}
109 | pairs(~mpg + displacement + horsepower + weight + acceleration, Auto)
110 | ```
111 | 
112 | Now this is cool, you can plot variables, and then run identify. In rstudio you click on points you are interested in, then hit `ESC` on the keyboard. This will then give you a list of the points, and name the points on the figure! Cant show it here.
113 | 
114 | ```
115 | plot(horsepower, mpg)
116 | identify(horsepower,mpg,name)
117 | ```
118 | 
119 | 
120 | FIN!
121 | 
122 | 


--------------------------------------------------------------------------------
/R_Labs/Lab2/Lab2.Rmd:
--------------------------------------------------------------------------------
  1 | Introduction to Statistical Learning in R Lab 2 (Chapter 3)
  2 | ========================================================
  3 | 
  4 | Load some R libs
  5 | 
  6 | ```{r}
  7 | library(MASS)
  8 | library(ISLR)
  9 | lm.fit=lm(medv~lstat,data=Boston)
 10 | summary(lm.fit)
 11 | ```
 12 | 
 13 | Show some QC figures on this lm fit to the data. As expected there is a very significant correlation between median value of homes in a neighborhood, and proportion of the population that is lower status. There is a lot of interesting stuff here. For example we see that there are some points with high leverage that are also nearly 3 on the residuals axis. **Not sure what absolute values of leverage are considered really high, I will need to look into the chapter more deeply for that. Looks like Cook's distance doesn't come into play, would those be the particularly dangerous points? Perhaps that is the key determinate.** From earlier in the chapter, just before the section on colinearity, it says that the average value of leverage is `(p+1)/n`. Also it always varies between 0 and 1. Figure 3.13 which shows an example of a dangerous situation, has a point with a leverage of a little over 0.25. In this case we are looking at one variable, so I think `p=1 n=506` either `r 2/506` or `r 3/506` should be the mean value of the leverage. We have a few points that are around 0.025 which is an order of magnitude higher than what we should expect for the mean, not sure how significant this is though. Here are some interesting online resources: http://www.statmethods.net/stats/rdiagnostics.html
 14 | 
 15 | ```{r fig.width=11, fig.height=11}
 16 | par(mfrow=c(2,2))
 17 | plot(lm.fit)
 18 | ```
 19 | 
 20 | And we can do a confidence interval on the coefficients. Remember that confidence intervals and prediction intervals are different. Prediction intervals as we will see are more conservative.
 21 | 
 22 | ```{r}
 23 | confint(lm.fit)
 24 | 
 25 | predict(lm.fit,data.frame(lstat=(c(5,10,15))), interval="confidence")
 26 | predict(lm.fit,data.frame(lstat=(c(5,10,15))), interval="prediction")
 27 | ```
 28 | 
 29 | Here is a plot of the points, with the fitted line
 30 | ```{r fig.width=7, fig.height=5}
 31 | plot(Boston$lstat,Boston$medv)
 32 | abline(lm.fit, lwd=3, col="red")
 33 | ```
 34 | 
 35 | Plot of predicted fit vs residuals and studentized versions. Remember that the sutentized residuals are useful for determining outliers. Studentized residuals are devided by the standard error to make something like a Z score, so you can look at `sigma` levels. Values greater than 3 on studentized residuals are poential outliers. That corresponds to 3 `sigma` I think.
 36 | 
 37 | ```{r fig.width=11, fig.height=5}
 38 | par(mfrow=c(1,2))
 39 | plot(predict(lm.fit), residuals(lm.fit), main="Residuals")
 40 | plot(predict(lm.fit), rstudent(lm.fit), main="Student Fit")
 41 | ```
 42 | 
 43 | There is some evidence of non-linearity based on the resudials plot. Lets look into the leverage statistic using the hatvals function.
 44 | 
 45 | ```{r fig.width=7, fig.height=5}
 46 | plot(hatvalues(lm.fit))
 47 | ```
 48 | 
 49 | To see which index point has the max leverage, we can use which.max to return the index of the max value.
 50 | ```{r}
 51 | which.max(hatvalues(lm.fit))
 52 | ```
 53 | 
 54 | Multiple Linear Regression
 55 | --------------------------
 56 | ```{r}
 57 | lm.fit = lm(medv~lstat+age, data=Boston)
 58 | summary(lm.fit)
 59 | ```
 60 | 
 61 | We can also look at all variables at the same time with this shorthand syntax.
 62 | ```{r}
 63 | lm.fit = lm(medv~., data=Boston)
 64 | summary(lm.fit)
 65 | ```
 66 | 
 67 | Interesting that when we include all variables, age, which used to be significant, is no longer called that way.
 68 | 
 69 | One cool thing is that we can access certain elements of the lm summary like so:
 70 | ```{r}
 71 | summary(lm.fit)$r.sq
 72 | ```
 73 | 
 74 | Also the vif function from the car package can calcualte soemthing called the "variance inflation" factor. **How do we know what a good vs bad inflation factor is?**
 75 | ```{r}
 76 | library(car)
 77 | vif(lm.fit)
 78 | ```
 79 | 
 80 | Let's try a regression excluding one variable. There are two ways to do this, we can specify a new regression with the special -age syntax after the . syntax, otherwise we can use the "update" method to update the previous fit removing the age variable. We can remove multiple variables with this syntax as well.
 81 | 
 82 | ```{r}
 83 | lm.fit1=lm(medv~.-age,data=Boston)
 84 | summary(lm.fit1)
 85 | lm.fit2=update(lm.fit, ~.-age)
 86 | summary(lm.fit2)
 87 | lm.fit3=lm(medv~.-age-indus,data=Boston)
 88 | summary(lm.fit3)
 89 | ```
 90 | 
 91 | And here is a plot including the significant variables identified previously (-age,-industry)
 92 | ```{r fig.width=11, fig.height=11}
 93 | par(mfrow=c(2,2))
 94 | plot(lm.fit3)
 95 | ```
 96 | 
 97 | 
 98 | Interaction Terms
 99 | ----------------
100 | When we use `lstat*age` in the formula, the individual terms `lstat` and `age` are automatically included. So for example we can do this to explore the version of lstat and age as interactive terms.
101 | 
102 | ```{r}
103 | summary(lm(medv~lstat*age,data=Boston))
104 | 
105 | ```
106 | 
107 | Non-linear transformations of predictors
108 | -------------------
109 | We can look at things like something squared using the `I()` function which helps when you want to use a special symbol in your equation.
110 | ```{r}
111 | lm.fit2=lm(medv~lstat+I(lstat^2), data=Boston)
112 | summary(lm.fit2)
113 | ```
114 | 
115 | We can use the `anova()` function to further quantify how much better the quadratic fit is superior to the linear fit.
116 | ```{r}
117 | lm.fit=lm(medv~lstat,data=Boston)
118 | anova(lm.fit,lm.fit2)
119 | ```
120 | 
121 | And here is a plot of the fits for lstat^2
122 | ```{r fig.width=11, fig.height=11}
123 | par(mfrow=c(2,2))
124 | plot(lm.fit2)
125 | ```
126 | 
127 | We can also fit higher order polynomials using the `poly()` function, which does the work of writing out all of the decreasing polynomial terms for us given a variable, and the numbers of polynomials you want to fit. We could also do something like a log transform in the linear model
128 | 
129 | ```{r}
130 | lm.fit5=lm(medv~poly(lstat,5),data=Boston)
131 | summary(lm.fit5)
132 | summary(lm(medv~log(rm),data=Boston))
133 | ```
134 | 
135 | Qualitative Predictors
136 | ------------------
137 | 
138 | R actually automatically will create dummy variables when you have a predictor. We also add a few specific interaction terms to the full model, namely Income and Advertising, along with Price and Age.
139 | 
140 | ```{r}
141 | lm.fit=lm(Sales~.+Income:Advertising+Price:Age,data=Carseats)
142 | summary(lm.fit)
143 | ```
144 | 
145 | We can see what the dummy coding is for a variable using the `contrasts()` function.
146 | 
147 | ```{r}
148 | contrasts(Carseats$ShelveLoc)
149 | ```
150 | 
151 | Cool! contrasts can be set for any factor, you can use the above function to specify what the contrasts are for a given factor in a dataframe. This sounds very handy.
152 | 
153 | So to interpret the output of the `lm` above with interpreting the dummy variables, keep in mind that `ShelveLocGood` being a positive and significant association represents an improvement over the default case, `ShelveLocBad`, similarly `ShelveLocMedium` represents a slightly less, yet still positive and significant improvement.
154 | 
155 | 
156 | ```{r}
157 | LoadLibraries <- function(){
158 |   library(ISLR)
159 |   library(MASS)
160 |   print("The libraries have been loaded.")
161 | }
162 | LoadLibraries()
163 | ```
164 | 
165 | 


--------------------------------------------------------------------------------
/R_Labs/Lab3/Lab3.Rmd:
--------------------------------------------------------------------------------
  1 | Introduction to Statistical Learning
  2 | ========================================================
  3 | 
  4 | 
  5 | 
  6 | ```{r}
  7 | library(ISLR)
  8 | names(Smarket)
  9 | dim(Smarket)
 10 | summary(Smarket)
 11 | cor(Smarket[-9])
 12 | ```
 13 | 
 14 | Plot of volume
 15 | 
 16 | ```{r fig.width=7, fig.height=6}
 17 | attach(Smarket)
 18 | plot(Volume)
 19 | ```
 20 | 
 21 | 
 22 | ##Logistic Regression
 23 | ```{r}
 24 | glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5, data=Smarket, family=binomial)
 25 | summary(glm.fit)
 26 | coef(glm.fit)
 27 | #all coefficient summaries
 28 | summary(glm.fit)$coef
 29 | #probabilities
 30 | summary(glm.fit)$coef[,4]
 31 | glm.probs=predict(glm.fit,type="response")
 32 | glm.probs[1:10]
 33 | contrasts(Direction)
 34 | ```
 35 | 
 36 | Note that the p values in the probabilities are for the stock market going up, since we can see that the dummy variable reproted by `contrasts(Direction)` has 1 on `Up`, and 0 on `Down`.
 37 | 
 38 | Come up with predictions based on this model (on the training data), and make a confusion matrix.
 39 | ```{r}
 40 | glm.pred=rep("Down",1250)
 41 | glm.pred[glm.probs>.5]="Up"
 42 | table(glm.pred,Direction)
 43 | (550+116)/1250
 44 | mean(glm.pred==Direction)
 45 | ```
 46 | 
 47 | However this is the training error rate, lets create a test set and try again.
 48 | 
 49 | ```{r}
 50 | train=(Year<2005)
 51 | Smarket.2005=Smarket[!train,]
 52 | dim(Smarket.2005)
 53 | Direction.2005=Direction[!train]
 54 | glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Smarket,family=binomial,subset=train)
 55 | glm.probs=predict(glm.fit,Smarket.2005,type="response")
 56 | glm.pred=rep("Down",252)
 57 | glm.pred[glm.probs>0.5]="Up"
 58 | table(glm.pred,Direction.2005)
 59 | mean(glm.pred==Direction.2005)
 60 | mean(glm.pred!=Direction.2005)
 61 | ```
 62 | 
 63 | Yikes, the last line with `!=Direction.2005` computes the test error, which is worse than random chance! Oh well..
 64 | 
 65 | Lets see what happens if we get rid of some of the lower p-value predictors in the model, those likely contribute noise.
 66 | 
 67 | ```{r}
 68 | glm.fit=glm(Direction~Lag1+Lag2, data=Smarket,family=binomial,subset=train)
 69 | glm.probs=predict(glm.fit,Smarket.2005,type="response")
 70 | glm.pred=rep("Down",252)
 71 | glm.pred[glm.probs>0.5]="Up"
 72 | table(glm.pred,Direction.2005)
 73 | mean(glm.pred==Direction.2005)
 74 | mean(glm.pred!=Direction.2005)
 75 | ```
 76 | 
 77 | To predict on some new days, say two days, you can do the following:
 78 | ```{r}
 79 | predict(glm.fit,newdata=data.frame(Lag1=c(1.2,1.5), Lag2=c(1.1,-0.8)),type="response")
 80 | ```
 81 | 
 82 | We would guess that the market is going to go down these days.
 83 | 
 84 | ### LDA
 85 | ```{r}
 86 | library(MASS)
 87 | lda.fit=lda(Direction~Lag1+Lag2,data=Smarket,subset=train)
 88 | lda.fit
 89 | ```
 90 | 
 91 | ```{r fig.height=5, fig.width=7}
 92 | plot(lda.fit)
 93 | ```
 94 | 
 95 | ```{r}
 96 | lda.pred=predict(lda.fit, Smarket.2005)
 97 | names(lda.pred)
 98 | lda.class=lda.pred$class
 99 | table(lda.class,Direction.2005)
100 | mean(lda.class==Direction.2005)
101 | sum(lda.pred$posterior[,1]>=.5)
102 | sum(lda.pred$posterior[,1]<.5)
103 | lda.pred$posterior[1:20,1]
104 | lda.class[1:20]
105 | sum(lda.pred$posterior[,1]>.54)
106 | max(lda.pred$posterior[,1])
107 | min(lda.pred$posterior[,1])
108 | ```
109 | 
110 | ### QDA
111 | ```{r}
112 | qda.fit=qda(Direction~Lag1+Lag2,data=Smarket,subset=train)
113 | qda.fit
114 | qda.class=predict(qda.fit,Smarket.2005)$class
115 | table(qda.class,Direction.2005)
116 | mean(qda.class==Direction.2005)
117 | ```
118 | 
119 | 
120 | 60% accuracy on stock market data? wow. There is no default `plot()` that takes qda.fitted results.
121 | 
122 | ### K-Nearest Neighbors
123 | ```{r}
124 | library(class)
125 | train.X=cbind(Lag1,Lag2)[train,]
126 | test.X=cbind(Lag1,Lag2)[!train,]
127 | train.Direction=Direction[train]
128 | set.seed(1)
129 | knn.pred=knn(train.X,test.X,train.Direction,k=1)
130 | table(knn.pred,Direction.2005)
131 | mean(knn.pred==Direction.2005)
132 | 
133 | knn.pred=knn(train.X,test.X,train.Direction,k=3)
134 | table(knn.pred,Direction.2005)
135 | mean(knn.pred==Direction.2005)
136 | ```
137 | 
138 | ### Caravan Insurance Data
139 | Caravan insurance is just insurance for caravans... Thought it might be something else for some reason.
140 | 
141 | ```{r}
142 | dim(Caravan)
143 | attach(Caravan)
144 | sp=summary(Purchase)
145 | sp
146 | sp["Yes"]/sum(sp)
147 | ```
148 | 
149 | Since KNN is based on distance, and different variables can have very different scales, things need to be scaled. Consider salary and age, salary can change in thousands easily, age will mostly range 0-100. Consider that a salary difference of 1000 should be small compared to an age difference of 50 years, need to _standardize_ the data so that KNN knows these scales.
150 | 
151 | 
152 | The scale function in R does this automagically!
153 | 
154 | Col 86 is Purchase which is qualative, and will be left out for this.
155 | ```{r}
156 | standardized.X=scale(Caravan[-86])
157 | var(Caravan[,1])
158 | var(Caravan[,2])
159 | var(standardized.X[,1])
160 | var(standardized.X[,2])
161 | 
162 | test=1:1000
163 | train.X=standardized.X[-test,]
164 | test.X=standardized.X[test,]
165 | train.Y=Purchase[-test]
166 | test.Y=Purchase[test]
167 | 
168 | set.seed(1)
169 | knn.pred=knn(train.X,test.X,train.Y,k=1)
170 | mean(test.Y!=knn.pred)
171 | mean(test.Y!="No")
172 | ```
173 | 
174 | Keep in mind that although 12% error sounds really good, if we just always predicted "No" we would have only 6% error. 
175 | 
176 | ```{r}
177 | table(knn.pred,test.Y)
178 | 9/(68+9)
179 | 
180 | knn.pred=knn(train.X,test.X,train.Y,k=3)
181 | table(knn.pred,test.Y)
182 | 5/26
183 | 
184 | knn.pred=knn(train.X,test.X,train.Y,k=5)
185 | table(knn.pred,test.Y)
186 | 4/15
187 | 
188 | ```
189 | 
190 | We could also try with logistic regression. Since there are so few positives, the predictor doesn't do so well with the default probability cutoff of 0.5 if we are hoping to identify the people we want to spend time trying to sell insurance to.
191 | 
192 | ```{r}
193 | glm.fit=glm(Purchase~.,data=Caravan,family=binomial,subset=-test)
194 | glm.probs=predict(glm.fit,Caravan[test,],type="response")
195 | glm.pred=rep("No",1000)
196 | glm.pred[glm.probs >.5]="Yes"
197 | table(glm.pred,test.Y)
198 | #yikes, all of our guesses on "yes" are wrong!
199 | 
200 | #try with a different p cutoff
201 | glm.pred=rep("No",1000)
202 | glm.pred[glm.probs >.25]="Yes"
203 | table(glm.pred,test.Y)
204 | 11/(22+11)
205 | ```
206 | 
207 | ```{r fig.height=5,fig.width=7}
208 | plot(factor(test.Y),glm.probs)
209 | abline(0.25,0,col="red",lwd=2)
210 | ```
211 | Probably a better plot would be one that shows the TP rate vs the FP rate or something, bet it has a dip around 0.25 or something.
212 | 


--------------------------------------------------------------------------------
/R_Labs/Lab4/Lab4.rmd:
--------------------------------------------------------------------------------
  1 | Introduction to Statistical Learning Lab 4: Cross validation and the bootstrap
  2 | ========================================================
  3 | 
  4 | *************
  5 | ## Train/Test split
  6 | ```{r}
  7 | library(ISLR)
  8 | set.seed(1)
  9 | train=sample(392,196)
 10 | lm.fit=lm(mpg~horsepower,data=Auto,subset=train)
 11 | attach(Auto)
 12 | mean((mpg-predict(lm.fit,Auto))[-train]^2)
 13 | 
 14 | lm.fit2=lm(mpg~poly(horsepower,2),data=Auto,subset=train)
 15 | mean((mpg-predict(lm.fit2,Auto))[-train]^2)
 16 | 
 17 | lm.fit3=lm(mpg~poly(horsepower,3),data=Auto,subset=train)
 18 | mean((mpg-predict(lm.fit3,Auto))[-train]^2)
 19 | 
 20 | set.seed(2)
 21 | train=sample(392,196)
 22 | lm.fit=lm(mpg~horsepower,data=Auto,subset=train)
 23 | mean((mpg-predict(lm.fit,Auto))[-train]^2)
 24 | 
 25 | lm.fit2=lm(mpg~poly(horsepower,2),data=Auto,subset=train)
 26 | mean((mpg-predict(lm.fit2,Auto))[-train]^2)
 27 | 
 28 | lm.fit3=lm(mpg~poly(horsepower,3),data=Auto,subset=train)
 29 | mean((mpg-predict(lm.fit3,Auto))[-train]^2)
 30 | ```
 31 | 
 32 | Little improvement for cubic function, but quadratic improves over linear. Interesting that different test sets perform so differently.
 33 | 
 34 | *************
 35 | ## LOOCV
 36 | 
 37 | So glm has a nice cv function which is handy. Note that GLM can do linear models as well as a bunch of others, so it can be a drop in replacement for lm.
 38 | 
 39 | ```{r}
 40 | glm.fit=glm(mpg~horsepower,data=Auto)
 41 | coef(glm.fit)
 42 | #vs
 43 | lm.fit=lm(mpg~horsepower,data=Auto)
 44 | coef(lm.fit)
 45 | ```
 46 | 
 47 | cv and all for glm is part of the `boot` library.
 48 | 
 49 | ```{r}
 50 | library(boot)
 51 | glm.fit=glm(mpg~horsepower,data=Auto)
 52 | cv.err=cv.glm(Auto,glm.fit)
 53 | cv.err$delta
 54 | ```
 55 | 
 56 | Lets use CV to find which polynomial fit is optimal for this horsepower data. Lets do this in parallelz b/c it takes a while otherwise.
 57 | 
 58 | ```{r}
 59 | library(multicore)
 60 | cv.error=unlist(mclapply(1:5,function(i){
 61 |   glm.fit=glm(mpg~poly(horsepower,i),data=Auto)
 62 |   cv.glm(Auto,glm.fit)$delta[1]
 63 | }, mc.cores=5))
 64 | 
 65 | cv.error
 66 | ```
 67 | 
 68 | *************
 69 | ## K fold CV
 70 | We can also do this with k fold cv which goes faster.
 71 | 
 72 | ```{r}
 73 | set.seed(17)
 74 | cv.error.10=unlist(mclapply(1:10,function(i){
 75 |   glm.fit=glm(mpg~poly(horsepower,i),data=Auto)
 76 |   cv.glm(Auto,glm.fit,K=10)$delta[1]
 77 | }, mc.cores=10))
 78 | 
 79 | cv.error.10
 80 | ```
 81 | 
 82 | from the text, this is an interesting point about what the two $delta values mean:
 83 | 
 84 | > We saw in Section 5.3.2 that the two numbers associated with delta are essentially the same when LOOCV is performed. When we instead perform k-fold CV, then the two numbers associated with delta differ slightly. The first is the standard k-fold CV estimate, as in (5.3). The second is a bias- corrected version. On this data set, the two estimates are very similar to each other.
 85 | 
 86 | Also note that cv.glm does not use the computational speed up that is possible for LOOCV with least-squares fit models given in equation formula 5.2. This would have actually made LOOCV faster than K-fold CV rather than the other way around!
 87 | 
 88 | *************
 89 | ## The bootstrap
 90 | 
 91 | ```{r}
 92 | alpha.fn=function(data,index){
 93 |   X=data$X[index]
 94 |   Y=data$Y[index]
 95 |   return((var(Y)-cov(X,Y))/(var(X)+var(Y)-2*cov(X,Y)))
 96 | }
 97 | #using all 100 observations to get alpha:
 98 | alpha.fn(Portfolio,1:100)
 99 | 
100 | #or we can sample bootstrap style
101 | set.seed(1)
102 | alpha.fn(Portfolio,sample(100,100,replace=T))
103 | 
104 | #and we can use the boot function to automate this thousands of times
105 | boot(Portfolio,alpha.fn,R=1000)
106 | ```
107 | 
108 | now were going to use the bootstrap to help determine accuracy of lm fit.
109 | 
110 | ```{r}
111 | boot.fn=function(data,index){
112 |   return(coef(lm(mpg~horsepower,data=data,subset=index)))
113 | }
114 | 
115 | #simply compute coefficient estimates
116 | boot.fn(Auto,1:392)
117 | 
118 | set.seed(1)
119 | 
120 | #one bootstrap round
121 | boot.fn(Auto,sample(392,392,replace=T))
122 | 
123 | #now do a thousand!
124 | boot(Auto,boot.fn,1000)
125 | 
126 | #however in the simple case of linear regression, we can also get these
127 | # estimates with the summary() function from the fit itself
128 | # as was described in section 3.1.2
129 | summary(lm(mpg~horsepower,data=Auto))$coef
130 | ```
131 | 
132 | Interestingly, the formula given in equation 3.8 that the summary function uses to calculate the estimate of the beta standard errors rely on certain assumptions about the underlying data. Like the population $\sigma^2$ which is estimated from the RSS. This $\sigma^2$ relies on the model being correct! The non-linear relationship in the data causes inflated residuals and an inflated $\hat\sigma^2$. Also the standard formulas assume that $x_i$ are fixed and that $\epsilon_i$ is the sole source of variability, which is weird. The bootstrap does not have these assumptions, so it is probably more accurate in its estimates of the errors around $\hat\beta_0, \hat\beta_1$.
133 | 
134 | Here is an example where the model is closer to the correct one, how the boostrap and summary estimates should be closer.
135 | 
136 | ```{r}
137 | boot.fn=function(data,index){
138 |   coefficients(lm(mpg~horsepower+I(horsepower^2), data=data, subset=index))
139 | }
140 | 
141 | set.seed(1)
142 | boot(Auto,boot.fn,1000)
143 | summary(lm(mpg~horsepower+I(horsepower^2),data=Auto))$coef
144 | ```
145 | 
146 | 
147 | 
148 | 


--------------------------------------------------------------------------------
/R_Labs/Lab5/Lab5.Rmd:
--------------------------------------------------------------------------------
  1 | Introduction to Statistical Learning Lab 5: Feature selection, subset selection, etc
  2 | ========================================================
  3 | 
  4 | # Lab 1: Subset selection methods
  5 | 
  6 | ## Best subset selection
  7 | ```{r}
  8 | library(ISLR)
  9 | #fix(Hitters)
 10 | names(Hitters)
 11 | dim(Hitters)
 12 | sum(is.na(Hitters$Salary))
 13 | Hitters=na.omit(Hitters)
 14 | sum(is.na(Hitters))
 15 | ```
 16 | 
 17 | And now lets do best subset selection with the leaps library
 18 | 
 19 | ```{r}
 20 | library(leaps)
 21 | regfit.full=regsubsets(Salary~.,Hitters)
 22 | summary(regfit.full)
 23 | regfit.full=regsubsets(Salary~.,data=Hitters,nvmax=19)
 24 | reg.summary=summary(regfit.full)
 25 | names(reg.summary)
 26 | reg.summary$rsq
 27 | ```
 28 | 
 29 | Some r plots showing various optimal numbers of features given different penalties for overfitting.
 30 | ```{r fig.height=11,fig.width=11}
 31 | par(mfrow=c(2,2))
 32 | plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")
 33 | plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
 34 | points(which.max(reg.summary$adjr2),
 35 |        reg.summary$adjr2[which.max(reg.summary$adjr2)],
 36 |        col="red",cex=2,pch=20)
 37 | plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",
 38 |      type="l")
 39 | points(which.min(reg.summary$cp),
 40 |        reg.summary$cp[which.min(reg.summary$cp)],
 41 |        col="red",cex=2,pch=20)
 42 | plot(reg.summary$bic,xlab="Number of Variables",
 43 |      ylab="BIC", type="l")
 44 | points(which.min(reg.summary$bic),
 45 |        reg.summary$bic[which.min(reg.summary$bic)],
 46 |        col="red",cex=2,pch=20)
 47 | ```
 48 | 
 49 | We can also do the default plots for this package which show which things were selected given different numbers of features and different penalty terms.
 50 | ```{r fig.height=11,fig.width=11}
 51 | par(mfrow=c(2,2))
 52 | plot(regfit.full,scale="r2")
 53 | plot(regfit.full,scale="adjr2")
 54 | plot(regfit.full,scale="Cp")
 55 | plot(regfit.full,scale="bic")
 56 | ```
 57 | 
 58 | The coeficient function can take as an argument the model (in terms of number of features) and it will output the coefficient estimates for that model.
 59 | ```{r}
 60 | coef(regfit.full,6)
 61 | ```
 62 | 
 63 | 
 64 | ## Forward and backward stepwise selection
 65 | 
 66 | ```{r}
 67 | regfit.fwd=regsubsets(Salary~.,data=Hitters,nvmax=19,method="forward")
 68 | summary(regfit.fwd)
 69 | regfit.bwd=regsubsets(Salary~.,data=Hitters,nvmax=19,method="backward")
 70 | summary(regfit.bwd)
 71 | coef(regfit.full,7)
 72 | coef(regfit.fwd,7)
 73 | coef(regfit.bwd,7)
 74 | ```
 75 | Note how different selection methods produce different sets of data.
 76 | 
 77 | ## Choosing among models using the validation set approach and cv.
 78 | 
 79 | ```{r}
 80 | set.seed(1)
 81 | train=sample(c(TRUE,FALSE),nrow(Hitters),rep=TRUE)
 82 | test=(!train)
 83 | regfit.best=regsubsets(Salary~.,data=Hitters[train,],nvmax=19)
 84 | test.mat=model.matrix(Salary~.,data=Hitters[test,])
 85 | val.errors=rep(NA,19)
 86 | for(i in 1:19){
 87 |   coefi=coef(regfit.best,id=i)
 88 |   pred=test.mat[,names(coefi)]%*%coefi # dot product of coeficients is the
 89 |                                        # prediction
 90 |   val.errors[i]=mean((Hitters$Salary[test]-pred)^2)
 91 | }
 92 | val.errors
 93 | which.min(val.errors)
 94 | coef(regfit.best,which.min(val.errors))
 95 | ```
 96 | 
 97 | Function to do the predicting we did above
 98 | ```{r}
 99 | predict.regsubsets=function(object,newdata,id,...){ 
100 |   form=as.formula(object$call[[2]]) ## extract formula
101 |   mat=model.matrix(form,newdata)
102 |   coefi=coef(object,id=id)
103 |   xvars=names(coefi)
104 |   mat[,xvars]%*%coefi
105 | }
106 | ```
107 | 
108 | 
109 | ```{r}
110 | regfit.best=regsubsets(Salary~.,data=Hitters,nvmax=19)
111 | coef(regfit.best,10)
112 | ```
113 | 
114 | 
115 | 
116 | ### Now doing with cv
117 | 
118 | ```{r}
119 | k=10
120 | set.seed(1)
121 | folds=sample(1:k,nrow(Hitters),replace=TRUE)
122 | cv.errors=matrix(NA,k,19,dimnames=list(NULL,paste(1:19)))
123 | 
124 | for(j in 1:k){
125 |   best.fit=regsubsets(Salary~.,data=Hitters[folds!=j,],nvmax=19)
126 |   for(i in 1:19){
127 |     pred=predict(best.fit,Hitters[folds==j,],id=i)
128 |     cv.errors[j,i]=mean((Hitters$Salary[folds==j]-pred)^2)
129 |   }
130 | }
131 | 
132 | mean.cv.errors=apply(cv.errors,2,mean)
133 | mean.cv.errors
134 | which.min(mean.cv.errors)
135 | ```
136 | So the above stores the k fold cv results in a matrix. For fold j, there are 19 optimal variable subset models to test (hence the 10X19 matrix). Now to find which performed the best, we average the error across each of the 10 cv rounds for a given number of variables that we are interested in testing in our model. Plotting these averages out, we see that 11 is the best model.
137 | 
138 | ```{r fig.width=7,fig.height=5}
139 | par(mfrow=c(1,1))
140 | plot(mean.cv.errors,type='b')
141 | points(which.min(mean.cv.errors),mean.cv.errors[which.min(mean.cv.errors)],
142 |        col="red",cex=2,pch=20)
143 | ```
144 | 
145 | And now we train the best modle on all of the datas
146 | ```{r}
147 | reg.best=regsubsets(Salary~.,data=Hitters,nvmax=19)
148 | coef(reg.best,which.min(mean.cv.errors))
149 | ```
150 | 
151 | *************************
152 | # Lab 2: Ridge Regression and the Lasso
153 | 
154 | ```{r}
155 | x=model.matrix(Salary~.,Hitters)[,-1]
156 | y=Hitters$Salary
157 | ```
158 | Note that above model.matrix is being used for the side effect that it converts categorical variables into sets of dummy variables. So for example NewLeague could take on the value A and N. model.matrix took this, chose n, and made a new column called "NewLeagueN" with the binary values 0 and 1. This is required prior to running glmnet because it needs numerical /quantitative inputs.
159 | 
160 | ## Ridge regression
161 | glmnet takes the alpha argument which you can use to tell it what kind of model to fit. For example alpha=1 is a lasso model, and alpha=0 is a ridge regression model. 
162 | 
163 | ```{r}
164 | library(glmnet)
165 | grid=10^seq(10,-2,length=100)
166 | #spreads out the range 10 to -2 to 100 
167 | #equally spaced intermediate values
168 | ridge.mod=glmnet(x,y,alpha=0,lambda=grid)
169 | ```
170 | 
171 | By default glmnet does ridge regression on an automagically selected range of $\lambda$ values.
172 | 
173 | Glmnet also standardizes variables which may or may not be problematic, we may want to do that ourselves first in some way for example. To turn this setting off we can do `standardize=FALSE`.
174 | 
175 | The coefficients are stored in there for each value of lambda in our previous grid. So this should be a #variable by 100 matrix.
176 | ```{r}
177 | dim(coef(ridge.mod))
178 | ```
179 | 
180 | for the 50th lambda we can see some info
181 | 
182 | ```{r}
183 | ridge.mod$lambda[50]
184 | coef(ridge.mod)[,50]
185 | #calculate the l2 norm by the following
186 | sqrt(sum(coef(ridge.mod)[-1,50]^2))
187 | ```
188 | 
189 | vs when a lower value of lambda is used, 
190 | ```{r}
191 | ridge.mod$lambda[60]
192 | coef(ridge.mod)[,60]
193 | #calculate the l2 norm by the following
194 | sqrt(sum(coef(ridge.mod)[-1,60]^2))
195 | ```
196 | 
197 | We can use predict to get ridge regression coefficients for a new value of $\lambda=50$.
198 | ```{r}
199 | predict(ridge.mod,s=50,type="coefficients")[1:20,]
200 | ```
201 | 
202 | Note how as $\lambda$ gets smaller, fewer of the coefficients are nearly 0. Basically smaller $\lambda$'s mean lower constraints and the closer the model is to ordinary least-squares.
203 | 
204 | Here is another method of doing subset selection, prior we did this with a vector of TRUE/FALSE, now we do with a list of indices.
205 | ```{r}
206 | set.seed(1)
207 | train=sample(1:nrow(x),nrow(x)/2)
208 | test=(-train)
209 | y.test=y[test]
210 | 
211 | 
212 | ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid,thresh=1e-12)
213 | ridge.pred=predict(ridge.mod,s=4,newx=x[test,])
214 | mean((ridge.pred-y.test)^2)
215 | 
216 | #if we fit with *only* the intercept, and no other beta coefficients
217 | # then the outcome would be the mean of the training data, and
218 | # it would just be the mean of the training cases.
219 | mean((mean(y[train])-y.test)^2)
220 | 
221 | ###
222 | # we can also get this with a super-high lambda value, which
223 | # essentially sets all betas to nearly 0.
224 | ridge.pred=predict(ridge.mod,s=1e10,newx=x[test,])
225 | mean((ridge.pred-y.test)^2)
226 | 
227 | ridge.pred=predict(ridge.mod,s=0,newx=x[test,],exact=T)
228 | #need to use exact to get the answer close to least-squares due to
229 | # numerical approximation.
230 | mean((ridge.pred-y.test)^2)
231 | lm(y~x,subset=train)
232 | predict(ridge.mod,s=0,exact=T,type="coefficients")[1:20,]
233 | ```
234 | 
235 | Lets use CV and do some better selection of $\lambda$
236 | 
237 | ```{r fig.width=7,fig.height=5}
238 | set.seed(1)
239 | cv.out=cv.glmnet(x[train,],y[train],alpha=0)
240 | plot(cv.out)
241 | bestlam=cv.out$lambda.min
242 | bestlam
243 | ```
244 | 
245 | Lets see how this does on the test data!
246 | ```{r}
247 | ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
248 | mean((ridge.pred-y.test)^2)
249 | ```
250 | 
251 | seems to perform better than $\lambda=4$.
252 | 
253 | Let's see what the coefficients are like for the entire dataset now.
254 | ```{r}
255 | out=glmnet(x,y,alpha=0)
256 | predict(out,type="coefficients",s=bestlam)[1:20,]
257 | ```
258 | 
259 | ## The Lasso
260 | 
261 | ```{r fig.height=5,fig.width=7}
262 | lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
263 | plot(lasso.mod)
264 | ```
265 | 
266 | ```{r fig.height=5,fig.width=7}
267 | cv.out=cv.glmnet(x[train,],y[train],alpha=1)
268 | plot(cv.out)
269 | bestlam=cv.out$lambda.min
270 | bestlam
271 | lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
272 | mean((lasso.pred-y.test)^2)
273 | out=glmnet(x,y,alpha=1,lambda=grid)
274 | lasso.coef=predict(out,type="coefficients",s=bestlam)[1:20,]
275 | lasso.coef
276 | lasso.coef[lasso.coef!=0]
277 | ```
278 | 
279 | Note that a bunch of the variables are exactly 0! Much easier to interpret, basically subset selection happened on the variables which is pretty awesome.
280 | 
281 | The output best lasso model has only 7 variables, and discards 12!
282 | 
283 | 
284 | 
285 | *************************
286 | # Lab 3: PCR and PLS Regression
287 | 
288 | ## Principal Components Regression
289 | ```{r fig.height=5, fig.width=7}
290 | library(pls)
291 | set.seed(2)
292 | pcr.fit=pcr(Salary~., data=Hitters, scale=TRUE, validation="CV")
293 | summary(pcr.fit)
294 | 
295 | validationplot(pcr.fit, val.type="MSEP")
296 | ```
297 | 
298 | ```{r fig.height=5, fig.width=7}
299 | set.seed(1)
300 | pcr.fit=pcr(Salary~., data=Hitters, scale=TRUE, subset=train, validation="CV")
301 | 
302 | validationplot(pcr.fit, val.type="MSEP")
303 | ```
304 | 
305 | 
306 | ```{r}
307 | pcr.pred=predict(pcr.fit,x[test,],ncomp=7)
308 | mean((pcr.pred-y.test)^2)
309 | ```
310 | 
311 | 
312 | Comparable performance to ridge regression and lasso, but harder to interpret b/c doesn't give us selected variables or coefficients!
313 | 
314 | ```{r}
315 | pcr.fit=pcr(y~x,scale=TRUE,ncomp=7)
316 | summary(pcr.fit)
317 | ```
318 | 
319 | 
320 | 
321 | 
322 | 
323 | 
324 | 
325 | 
326 | 


--------------------------------------------------------------------------------
/R_Labs/Lab6/Lab6.Rmd:
--------------------------------------------------------------------------------
  1 | Introduction to Statistical Learning Lab 6
  2 | ========================================================
  3 | 
  4 | ## Polynomial regression and step functions (and rug plots!)
  5 | ```{r}
  6 | library(ISLR)
  7 | attach(Wage)
  8 | fit=lm(wage~poly(age,4),data=Wage)
  9 | coef(summary(fit))
 10 | ```
 11 | 
 12 | But this does a linear combination of the polynomials, not the raw polynomials! This is orthoganal though, so it is a different basis and the overal fit will be equivalent, but it does result in different coefficients!
 13 | 
 14 | ```{r}
 15 | library(ISLR)
 16 | attach(Wage)
 17 | fit2=lm(wage~poly(age,4,raw=T),data=Wage)
 18 | coef(summary(fit2))
 19 | 
 20 | # or alternatively
 21 | fit2a=lm(wage~age+I(age^2)+I(age^3)+I(age^4))
 22 | coef(fit2a)
 23 | 
 24 | #and a third way!
 25 | fit2b=lm(wage~cbind(age,age^2,age^3,age^4))
 26 | coef(fit2b)
 27 | 
 28 | 
 29 | ##
 30 | # get predictions for a range of ages and std errors around those
 31 | # predictions
 32 | agelims=range(age) #returns the range of this list (2 values), lower->upper
 33 | age.grid=seq(from=agelims[1],to=agelims[2])
 34 | preds=predict(fit,newdata=list(age=age.grid),se=TRUE)
 35 | se.bands=cbind(preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit)
 36 | ```
 37 | 
 38 | ```{r,fig.width=11,fig.height=5}
 39 | par(mfrow=c(1,2),mar=c(4.5,4.5,1,1),oma=c(0,0,4,0))
 40 | plot(age,wage,xlim=agelims,cex=0.5,col="darkgrey")
 41 | title("Degree-4 Polynomial", outer=T)
 42 | lines(age.grid,preds$fit,lwd=2,col="blue")
 43 | matlines(age.grid,se.bands,lwd=1,col="blue",lty=3)
 44 | ```
 45 | 
 46 | Note that the right panel of this figure will be filled in later.
 47 | 
 48 | 
 49 | ```{r}
 50 | ##nearly identical predictions from orthogonal and raw data
 51 | preds2=predict(fit2,newdata=list(age=age.grid),se=TRUE)
 52 | max(abs(preds$fit-preds2$fit))
 53 | ```
 54 | 
 55 | 
 56 | One way to decide which degree of polynomial to use is with a hypothesis test. Note that this requires that the models are nested for anova to be the same as the results we get out of the summary function to the lm fit. Anova is more general though.
 57 | 
 58 | ```{r}
 59 | fit.1=lm(wage~age,data=Wage)
 60 | fit.2=lm(wage~poly(age,2),data=Wage)
 61 | fit.3=lm(wage~poly(age,3),data=Wage)
 62 | fit.4=lm(wage~poly(age,4),data=Wage)
 63 | fit.5=lm(wage~poly(age,5),data=Wage)
 64 | 
 65 | anova(fit.1,fit.2,fit.3,fit.4,fit.5)
 66 | 
 67 | #alternatively we could just have looked at the p values on the coefficients for the 5th degree model.
 68 | coef(summary(fit.5))
 69 | 
 70 | 
 71 | ## more general anova test
 72 | fit.1=lm(wage~education+age,data=Wage)
 73 | fit.2=lm(wage~education+poly(age,2),data=Wage)
 74 | fit.3=lm(wage~education+poly(age,3),data=Wage)
 75 | anova(fit.1,fit.2,fit.3)
 76 | ```
 77 | 
 78 | Lets move on to predict which people make more than 250k.
 79 | ```{r}
 80 | fit=glm(I(wage>250)~poly(age,4),data=Wage,family=binomial)
 81 | preds3=predict(fit,newdata=list(age=age.grid),se=T)
 82 | 
 83 | ## Need to transform the SE estimates, we have a fit to a logit
 84 | pfit=exp(preds3$fit)/(1+exp(preds3$fit))
 85 | se.bands.logit=cbind(preds3$fit+2*preds3$se.fit, preds3$fit-2*preds3$se.fit)
 86 | se.bands2=exp(se.bands.logit)/(1+exp(se.bands.logit))
 87 | 
 88 | #alternatively we could have gotten this directly by saying
 89 | # type="response" to the predict function:
 90 | #preds=predict(fit,newdata=list(age=age.grid),type="response",se=T)
 91 | # however in this case the confidence intervals are not sensible because 
 92 | # they should represent probabilities but can come out negative! 
 93 | # With the above transformation this is not an issue, and the probabilities
 94 | # remain well behaved.
 95 | ```
 96 | 
 97 | Now for the full figure 7.1 plot
 98 | ```{r,fig.width=11,fig.height=5}
 99 | #previous section
100 | par(mfrow=c(1,2),mar=c(4.5,4.5,1,1),oma=c(0,0,4,0))
101 | plot(age,wage,xlim=agelims,cex=0.5,col="darkgrey")
102 | title("Degree-4 Polynomial (matching fig 7.1)", outer=T)
103 | lines(age.grid,preds$fit,lwd=2,col="blue")
104 | matlines(age.grid,se.bands,lwd=1,col="blue",lty=3)
105 | 
106 | #new data for prediction of income over 250k
107 | plot(age,I(wage>250),xlim=agelims,type="n",ylim=c(0,.2))
108 | #add in density ticks (jitter helps with this) at top and bottom of panel (0 and 0.2 look good I guess)
109 | points(jitter(age),I((wage>250)/5),cex=.5,pch="|",col="darkgrey")
110 | 
111 | #add in probability of being a really high wage earner, along with
112 | # std error on this polynomial logistic regression probability fit.
113 | lines(age.grid,pfit,lwd=2,col="blue")
114 | matlines(age.grid,se.bands2,lwd=1,col="blue",lty=3)
115 | ```
116 | 
117 | Note the above plot type is often called a "rug plot"
118 | 
119 | 
120 | Step functions can be fit with the cut function
121 | 
122 | ```{r}
123 | table(cut(age,4))
124 | fit=lm(wage~cut(age,4),data=Wage)
125 | coef(summary(fit))
126 | ```
127 | 
128 | NOTE:
129 | > The age<33.5 category is left out, so the intercept coefficient of $94,160 can be interpreted as the average salary for those under 33.5 years of age, and the other coefficients can be interpreted as the average additional salary for those in the other age groups. We can produce predictions and plots just as we did in the case of the polynomial fit.
130 | 
131 | ## Splines
132 | 
133 | ```{r fig.height=5,fig.width=7}
134 | library(splines)
135 | fit=lm(wage~bs(age,knots=c(25,40,60)),data=Wage)
136 | pred=predict(fit,newdata=list(age=age.grid),se=T)
137 | plot(age,wage,col="gray")
138 | lines(age.grid,pred$fit,lwd=2)
139 | lines(age.grid,pred$fit+2*pred$se,lty="dashed")
140 | lines(age.grid,pred$fit-2*pred$se,lty="dashed")
141 | dim(bs(age,knots=c(25,40,60)))#or specified specifically
142 | dim(bs(age,df=6))#knots can be chosen automagically at uniform quantiles in the data
143 | attr(bs(age,df=6),"knots")
144 | fit2=lm(wage~ns(age,df=4),data=Wage)
145 | pred2=predict(fit2,newdata=list(age=age.grid),se=T)
146 | lines(age.grid,pred2$fit,col="red",lwd=2)
147 | lines(age.grid,pred2$fit+2*pred2$se,col="red",lty="dashed")
148 | lines(age.grid,pred2$fit-2*pred2$se,col="red",lty="dashed")
149 | ```
150 | 
151 | ### Smooth.spline, and figure 7.8 replication:
152 | 
153 | ```{r fig.height=5,fig.width=7}
154 | plot(age,wage,xlim=agelims,cex=.5,col="darkgrey")
155 | title("Smoothing Spline")
156 | fit=smooth.spline(age,wage,df=16)
157 | fit2=smooth.spline(age,wage,cv=TRUE)
158 | fit2$df
159 | lines(fit,col="red",lwd=2)
160 | lines(fit2,col="blue",lwd=2)
161 | legend("topright",legend=c("16 DF",sprintf("%.1f DF",fit2$df)),
162 |        col=c("red","blue"),lty=1,lwd=2,cex=.8)
163 | ```
164 | 
165 | ### LOESS-- local regression.
166 | 
167 | ```{r fig.height=5,fig.width=7}
168 | plot(age,wage,xlim=agelims,cex=.5,col="darkgrey")
169 | title("Local Regression")
170 | fit=loess(wage~age,span=.2,data=Wage)
171 | fit2=loess(wage~age,span=.5,data=Wage)
172 | lines(age.grid,predict(fit,data.frame(age=age.grid)),
173 |       col="red",lwd=2)
174 | lines(age.grid,predict(fit2,data.frame(age=age.grid)),
175 |       col="blue",lwd=2)
176 | legend("topright",legend=c("Span=0.2","Span=0.5"),
177 |        col=c("red","blue"),lty=1,lwd=2,cex=.8)
178 | #Span is the percent of the data used
179 | ```
180 | 
181 | ## GAMs
182 | 
183 | Figure 7.11
184 | ```{r fig.width=11, fig.height=5}
185 | library(gam)
186 | gam1=lm(wage~ns(year,4)+ns(age,5)+education,data=Wage)
187 | gam.m3=gam(wage~s(year,4)+s(age,5)+education,data=Wage)
188 | par(mfrow=c(1,3))
189 | plot(gam.m3,se=TRUE,col="blue")
190 | ```
191 | 
192 | Figure 7.12
193 | ```{r fig.width=11, fig.height=5}
194 | par(mfrow=c(1,3))
195 | plot.gam(gam1,se=TRUE,col="red") #must use plot.gam since this is not
196 | # of the gam class, so R will not automatically call this function on the
197 | # gam1 object.
198 | ```
199 | 
200 | Anova to chose GAM model
201 | ```{r}
202 | gam.m1=gam(wage~s(age,5)+education,data=Wage)
203 | gam.m2=gam(wage~year+s(age,5)+education,data=Wage)
204 | anova(gam.m1,gam.m2,gam.m3,test="F")
205 | ```
206 | 
207 | Looks like good evidence for including year, but linearly is sufficient. 
208 | 
209 | ```{r}
210 | summary(gam.m3) # p value is for a null of a linear relationship
211 | # vs a non linear relationship! COOL!
212 | preds=predict(gam.m2,newdata=Wage)
213 | head(preds)
214 | gam.lo=gam(wage~s(year,df=4)+lo(age,span=0.7)+education,data=Wage)
215 | ```
216 | 
217 | ```{r fig.width=11,fig.height=5}
218 | par(mfrow=c(1,3))
219 | plot.gam(gam.lo,se=TRUE,col="green")
220 | ```
221 | 
222 | ```{r fig.width=7, fig.height=5}
223 | gam.lo.i=gam(wage~lo(year,age,span=0.5)+education,data=Wage)
224 | library(akima)
225 | plot(gam.lo.i)
226 | ```
227 | 
228 | ```{r fig.height=5, fig.width=11}
229 | gam.lr=gam(I(wage>250)~year+s(age,df=5)+education,family=binomial,data=Wage)
230 | par(mfrow=c(1,3))
231 | plot(gam.lr,se=T,col="green")
232 | ```
233 | 
234 | 
235 | ```{r}
236 | table(education,I(wage>250))
237 | ```
238 | 
239 | ```{r fig.height=5,fig.width=11}
240 | par(mfrow=c(1,3))
241 | gam.lr.s=gam(I(wage>250)~year+s(age,df=5)+education,family=binomial,data=Wage,subset=(education!="1. < HS Grad"))
242 | plot(gam.lr.s,se=T,col="green")
243 | ```
244 | 
245 | 
246 | 


--------------------------------------------------------------------------------
/R_Labs/Lab7/Lab7.Rmd:
--------------------------------------------------------------------------------
  1 | ISLR Lab 7: Decision Trees
  2 | ========================================================
  3 | 
  4 | ## Classification tree fitting
  5 | ```{r}
  6 | library(tree)
  7 | library(ISLR)
  8 | attach(Carseats)
  9 | High=ifelse(Sales<=8,"No","Yes")
 10 | Carseats=data.frame(Carseats,High)
 11 | tree.carseats=tree(High~.-Sales,Carseats)
 12 | summary(tree.carseats)
 13 | ```
 14 | 
 15 | Plot of the carseats tree model:
 16 | ```{r fig.width=11, fig.height=11}
 17 | plot(tree.carseats)
 18 | text(tree.carseats,pretty=0)
 19 | ```
 20 | 
 21 | ```{r}
 22 | tree.carseats
 23 | set.seed(2)
 24 | train=sample(1:nrow(Carseats),200)
 25 | Carseats.test=Carseats[-train,]
 26 | High.test=High[-train]
 27 | tree.carseats=tree(High~.-Sales,Carseats,subset=train)
 28 | tree.pred=predict(tree.carseats,Carseats.test,type="class")
 29 | table(tree.pred,High.test)
 30 | (86+57)/200
 31 | 
 32 | set.seed(3)
 33 | cv.carseats=cv.tree(tree.carseats,FUN=prune.misclass)
 34 | names(cv.carseats)
 35 | cv.carseats
 36 | ```
 37 | 
 38 | ```{r fig.height=7, fig.width=11}
 39 | par(mfrow=c(1,2))
 40 | plot(cv.carseats$size,cv.carseats$dev,type="b")
 41 | plot(cv.carseats$k,cv.carseats$dev, type="b")
 42 | ```
 43 | 
 44 | ```{r fig.height=11, fig.width=11}
 45 | prune.carseats=prune.misclass(tree.carseats,best=9)
 46 | plot(prune.carseats)
 47 | text(prune.carseats,pretty=0)
 48 | 
 49 | tree.pred=predict(prune.carseats,Carseats.test,type="class")
 50 | table(tree.pred,High.test)
 51 | (94+60)/200
 52 | ```
 53 | 
 54 | 
 55 | 
 56 | ## Regression tree fitting.
 57 | ```{r}
 58 | library(MASS)
 59 | set.seed(1)
 60 | train=sample(1:nrow(Boston), nrow(Boston)/2)
 61 | tree.boston=tree(medv~.,Boston,subset=train)
 62 | summary(tree.boston)
 63 | ```
 64 | 
 65 | ```{r fig.width=11, fig.height=11}
 66 | plot(tree.boston)
 67 | text(tree.boston,pretty=0)
 68 | 
 69 | cv.boston=cv.tree(tree.boston)
 70 | plot(cv.boston$size, cv.boston$dev, type='b')
 71 | 
 72 | prune.boston=prune.tree(tree.boston,best=5)
 73 | plot(prune.boston)
 74 | text(prune.boston,pretty=0)
 75 | 
 76 | yhat=predict(tree.boston,newdata=Boston[-train,])
 77 | boston.test=Boston[-train,'medv']
 78 | plot(yhat,boston.test)
 79 | abline(0,1)
 80 | mean((yhat-boston.test)^2)
 81 | ```
 82 | 
 83 | 
 84 | ## Random forest
 85 | note: bagging is a special case of random forest where m=p
 86 | ```{r}
 87 | library(randomForest)
 88 | 
 89 | ## do bagging
 90 | set.seed(1)
 91 | bag.boston=randomForest(medv~.,data=Boston,subset=train,mtry=13,importance=TRUE)
 92 | bag.boston
 93 | yhat.bag=predict(bag.boston,newdata=Boston[-train,])
 94 | plot(yhat.bag, boston.test)
 95 | abline(0,1)
 96 | mean((yhat.bag-boston.test)^2)
 97 | 
 98 | #change number of trees used
 99 | bag.boston=randomForest(medv~.,data=Boston,subset=train,mtry=13,ntree=25,importance=TRUE)
100 | yhat.bag=predict(bag.boston,newdata=Boston[-train,])
101 | mean((yhat.bag-boston.test)^2)
102 | 
103 | 
104 | # do actual random forest
105 | set.seed(1)
106 | rf.boston=randomForest(medv~.,data=Boston,subset=train,mtry=6,importance=TRUE)
107 | yhat.rf=predict(rf.boston,newdata=Boston[-train,])
108 | mean((yhat.rf-boston.test)^2)
109 | ##see importance of variables
110 | importance(rf.boston)
111 | ```
112 | 
113 | 
114 | ```{r fig.width=11, fig.height=11}
115 | varImpPlot(rf.boston)
116 | ```
117 | 
118 | 
119 | ## Boosting
120 | ```{r fig.height=8,fig.width=8}
121 | library(gbm)
122 | set.seed(1)
123 | boost.boston=gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4)
124 | summary(boost.boston)
125 | ```
126 | 
127 | 
128 | ```{r fig.height=11, fig.width=11}
129 | par(mfrow=c(1,2))
130 | plot(boost.boston,i="rm")
131 | plot(boost.boston,i="lstat")
132 | ```
133 | 
134 | 
135 | ```{r}
136 | yhat.boost=predict(boost.boston,newdata=Boston[-train,], n.trees=5000)
137 | mean((yhat.boost-boston.test)^2)
138 | boost.boston=gbm(medv~.,data=Boston[train,],distribution="gaussian",ntrees=5000,interaction.depth=4,shrinkage=0.2,verbose=F)
139 | yhat.boost=predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
140 | mean((yhat.boost-boston.test)^2)
141 | ```
142 | 


--------------------------------------------------------------------------------
/data/Advertising.csv:
--------------------------------------------------------------------------------
  1 | "","TV","Radio","Newspaper","Sales"
  2 | "1",230.1,37.8,69.2,22.1
  3 | "2",44.5,39.3,45.1,10.4
  4 | "3",17.2,45.9,69.3,9.3
  5 | "4",151.5,41.3,58.5,18.5
  6 | "5",180.8,10.8,58.4,12.9
  7 | "6",8.7,48.9,75,7.2
  8 | "7",57.5,32.8,23.5,11.8
  9 | "8",120.2,19.6,11.6,13.2
 10 | "9",8.6,2.1,1,4.8
 11 | "10",199.8,2.6,21.2,10.6
 12 | "11",66.1,5.8,24.2,8.6
 13 | "12",214.7,24,4,17.4
 14 | "13",23.8,35.1,65.9,9.2
 15 | "14",97.5,7.6,7.2,9.7
 16 | "15",204.1,32.9,46,19
 17 | "16",195.4,47.7,52.9,22.4
 18 | "17",67.8,36.6,114,12.5
 19 | "18",281.4,39.6,55.8,24.4
 20 | "19",69.2,20.5,18.3,11.3
 21 | "20",147.3,23.9,19.1,14.6
 22 | "21",218.4,27.7,53.4,18
 23 | "22",237.4,5.1,23.5,12.5
 24 | "23",13.2,15.9,49.6,5.6
 25 | "24",228.3,16.9,26.2,15.5
 26 | "25",62.3,12.6,18.3,9.7
 27 | "26",262.9,3.5,19.5,12
 28 | "27",142.9,29.3,12.6,15
 29 | "28",240.1,16.7,22.9,15.9
 30 | "29",248.8,27.1,22.9,18.9
 31 | "30",70.6,16,40.8,10.5
 32 | "31",292.9,28.3,43.2,21.4
 33 | "32",112.9,17.4,38.6,11.9
 34 | "33",97.2,1.5,30,9.6
 35 | "34",265.6,20,0.3,17.4
 36 | "35",95.7,1.4,7.4,9.5
 37 | "36",290.7,4.1,8.5,12.8
 38 | "37",266.9,43.8,5,25.4
 39 | "38",74.7,49.4,45.7,14.7
 40 | "39",43.1,26.7,35.1,10.1
 41 | "40",228,37.7,32,21.5
 42 | "41",202.5,22.3,31.6,16.6
 43 | "42",177,33.4,38.7,17.1
 44 | "43",293.6,27.7,1.8,20.7
 45 | "44",206.9,8.4,26.4,12.9
 46 | "45",25.1,25.7,43.3,8.5
 47 | "46",175.1,22.5,31.5,14.9
 48 | "47",89.7,9.9,35.7,10.6
 49 | "48",239.9,41.5,18.5,23.2
 50 | "49",227.2,15.8,49.9,14.8
 51 | "50",66.9,11.7,36.8,9.7
 52 | "51",199.8,3.1,34.6,11.4
 53 | "52",100.4,9.6,3.6,10.7
 54 | "53",216.4,41.7,39.6,22.6
 55 | "54",182.6,46.2,58.7,21.2
 56 | "55",262.7,28.8,15.9,20.2
 57 | "56",198.9,49.4,60,23.7
 58 | "57",7.3,28.1,41.4,5.5
 59 | "58",136.2,19.2,16.6,13.2
 60 | "59",210.8,49.6,37.7,23.8
 61 | "60",210.7,29.5,9.3,18.4
 62 | "61",53.5,2,21.4,8.1
 63 | "62",261.3,42.7,54.7,24.2
 64 | "63",239.3,15.5,27.3,15.7
 65 | "64",102.7,29.6,8.4,14
 66 | "65",131.1,42.8,28.9,18
 67 | "66",69,9.3,0.9,9.3
 68 | "67",31.5,24.6,2.2,9.5
 69 | "68",139.3,14.5,10.2,13.4
 70 | "69",237.4,27.5,11,18.9
 71 | "70",216.8,43.9,27.2,22.3
 72 | "71",199.1,30.6,38.7,18.3
 73 | "72",109.8,14.3,31.7,12.4
 74 | "73",26.8,33,19.3,8.8
 75 | "74",129.4,5.7,31.3,11
 76 | "75",213.4,24.6,13.1,17
 77 | "76",16.9,43.7,89.4,8.7
 78 | "77",27.5,1.6,20.7,6.9
 79 | "78",120.5,28.5,14.2,14.2
 80 | "79",5.4,29.9,9.4,5.3
 81 | "80",116,7.7,23.1,11
 82 | "81",76.4,26.7,22.3,11.8
 83 | "82",239.8,4.1,36.9,12.3
 84 | "83",75.3,20.3,32.5,11.3
 85 | "84",68.4,44.5,35.6,13.6
 86 | "85",213.5,43,33.8,21.7
 87 | "86",193.2,18.4,65.7,15.2
 88 | "87",76.3,27.5,16,12
 89 | "88",110.7,40.6,63.2,16
 90 | "89",88.3,25.5,73.4,12.9
 91 | "90",109.8,47.8,51.4,16.7
 92 | "91",134.3,4.9,9.3,11.2
 93 | "92",28.6,1.5,33,7.3
 94 | "93",217.7,33.5,59,19.4
 95 | "94",250.9,36.5,72.3,22.2
 96 | "95",107.4,14,10.9,11.5
 97 | "96",163.3,31.6,52.9,16.9
 98 | "97",197.6,3.5,5.9,11.7
 99 | "98",184.9,21,22,15.5
100 | "99",289.7,42.3,51.2,25.4
101 | "100",135.2,41.7,45.9,17.2
102 | "101",222.4,4.3,49.8,11.7
103 | "102",296.4,36.3,100.9,23.8
104 | "103",280.2,10.1,21.4,14.8
105 | "104",187.9,17.2,17.9,14.7
106 | "105",238.2,34.3,5.3,20.7
107 | "106",137.9,46.4,59,19.2
108 | "107",25,11,29.7,7.2
109 | "108",90.4,0.3,23.2,8.7
110 | "109",13.1,0.4,25.6,5.3
111 | "110",255.4,26.9,5.5,19.8
112 | "111",225.8,8.2,56.5,13.4
113 | "112",241.7,38,23.2,21.8
114 | "113",175.7,15.4,2.4,14.1
115 | "114",209.6,20.6,10.7,15.9
116 | "115",78.2,46.8,34.5,14.6
117 | "116",75.1,35,52.7,12.6
118 | "117",139.2,14.3,25.6,12.2
119 | "118",76.4,0.8,14.8,9.4
120 | "119",125.7,36.9,79.2,15.9
121 | "120",19.4,16,22.3,6.6
122 | "121",141.3,26.8,46.2,15.5
123 | "122",18.8,21.7,50.4,7
124 | "123",224,2.4,15.6,11.6
125 | "124",123.1,34.6,12.4,15.2
126 | "125",229.5,32.3,74.2,19.7
127 | "126",87.2,11.8,25.9,10.6
128 | "127",7.8,38.9,50.6,6.6
129 | "128",80.2,0,9.2,8.8
130 | "129",220.3,49,3.2,24.7
131 | "130",59.6,12,43.1,9.7
132 | "131",0.7,39.6,8.7,1.6
133 | "132",265.2,2.9,43,12.7
134 | "133",8.4,27.2,2.1,5.7
135 | "134",219.8,33.5,45.1,19.6
136 | "135",36.9,38.6,65.6,10.8
137 | "136",48.3,47,8.5,11.6
138 | "137",25.6,39,9.3,9.5
139 | "138",273.7,28.9,59.7,20.8
140 | "139",43,25.9,20.5,9.6
141 | "140",184.9,43.9,1.7,20.7
142 | "141",73.4,17,12.9,10.9
143 | "142",193.7,35.4,75.6,19.2
144 | "143",220.5,33.2,37.9,20.1
145 | "144",104.6,5.7,34.4,10.4
146 | "145",96.2,14.8,38.9,11.4
147 | "146",140.3,1.9,9,10.3
148 | "147",240.1,7.3,8.7,13.2
149 | "148",243.2,49,44.3,25.4
150 | "149",38,40.3,11.9,10.9
151 | "150",44.7,25.8,20.6,10.1
152 | "151",280.7,13.9,37,16.1
153 | "152",121,8.4,48.7,11.6
154 | "153",197.6,23.3,14.2,16.6
155 | "154",171.3,39.7,37.7,19
156 | "155",187.8,21.1,9.5,15.6
157 | "156",4.1,11.6,5.7,3.2
158 | "157",93.9,43.5,50.5,15.3
159 | "158",149.8,1.3,24.3,10.1
160 | "159",11.7,36.9,45.2,7.3
161 | "160",131.7,18.4,34.6,12.9
162 | "161",172.5,18.1,30.7,14.4
163 | "162",85.7,35.8,49.3,13.3
164 | "163",188.4,18.1,25.6,14.9
165 | "164",163.5,36.8,7.4,18
166 | "165",117.2,14.7,5.4,11.9
167 | "166",234.5,3.4,84.8,11.9
168 | "167",17.9,37.6,21.6,8
169 | "168",206.8,5.2,19.4,12.2
170 | "169",215.4,23.6,57.6,17.1
171 | "170",284.3,10.6,6.4,15
172 | "171",50,11.6,18.4,8.4
173 | "172",164.5,20.9,47.4,14.5
174 | "173",19.6,20.1,17,7.6
175 | "174",168.4,7.1,12.8,11.7
176 | "175",222.4,3.4,13.1,11.5
177 | "176",276.9,48.9,41.8,27
178 | "177",248.4,30.2,20.3,20.2
179 | "178",170.2,7.8,35.2,11.7
180 | "179",276.7,2.3,23.7,11.8
181 | "180",165.6,10,17.6,12.6
182 | "181",156.6,2.6,8.3,10.5
183 | "182",218.5,5.4,27.4,12.2
184 | "183",56.2,5.7,29.7,8.7
185 | "184",287.6,43,71.8,26.2
186 | "185",253.8,21.3,30,17.6
187 | "186",205,45.1,19.6,22.6
188 | "187",139.5,2.1,26.6,10.3
189 | "188",191.1,28.7,18.2,17.3
190 | "189",286,13.9,3.7,15.9
191 | "190",18.7,12.1,23.4,6.7
192 | "191",39.5,41.1,5.8,10.8
193 | "192",75.5,10.8,6,9.9
194 | "193",17.2,4.1,31.6,5.9
195 | "194",166.8,42,3.6,19.6
196 | "195",149.7,35.6,6,17.3
197 | "196",38.2,3.7,13.8,7.6
198 | "197",94.2,4.9,8.1,9.7
199 | "198",177,9.3,6.4,12.8
200 | "199",283.6,42,66.2,25.5
201 | "200",232.1,8.6,8.7,13.4
202 | 


--------------------------------------------------------------------------------
/data/Auto.csv:
--------------------------------------------------------------------------------
  1 | mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
  2 | 18,8,307,130,3504,12,70,1,chevrolet chevelle malibu
  3 | 15,8,350,165,3693,11.5,70,1,buick skylark 320
  4 | 18,8,318,150,3436,11,70,1,plymouth satellite
  5 | 16,8,304,150,3433,12,70,1,amc rebel sst
  6 | 17,8,302,140,3449,10.5,70,1,ford torino
  7 | 15,8,429,198,4341,10,70,1,ford galaxie 500
  8 | 14,8,454,220,4354,9,70,1,chevrolet impala
  9 | 14,8,440,215,4312,8.5,70,1,plymouth fury iii
 10 | 14,8,455,225,4425,10,70,1,pontiac catalina
 11 | 15,8,390,190,3850,8.5,70,1,amc ambassador dpl
 12 | 15,8,383,170,3563,10,70,1,dodge challenger se
 13 | 14,8,340,160,3609,8,70,1,plymouth 'cuda 340
 14 | 15,8,400,150,3761,9.5,70,1,chevrolet monte carlo
 15 | 14,8,455,225,3086,10,70,1,buick estate wagon (sw)
 16 | 24,4,113,95,2372,15,70,3,toyota corona mark ii
 17 | 22,6,198,95,2833,15.5,70,1,plymouth duster
 18 | 18,6,199,97,2774,15.5,70,1,amc hornet
 19 | 21,6,200,85,2587,16,70,1,ford maverick
 20 | 27,4,97,88,2130,14.5,70,3,datsun pl510
 21 | 26,4,97,46,1835,20.5,70,2,volkswagen 1131 deluxe sedan
 22 | 25,4,110,87,2672,17.5,70,2,peugeot 504
 23 | 24,4,107,90,2430,14.5,70,2,audi 100 ls
 24 | 25,4,104,95,2375,17.5,70,2,saab 99e
 25 | 26,4,121,113,2234,12.5,70,2,bmw 2002
 26 | 21,6,199,90,2648,15,70,1,amc gremlin
 27 | 10,8,360,215,4615,14,70,1,ford f250
 28 | 10,8,307,200,4376,15,70,1,chevy c20
 29 | 11,8,318,210,4382,13.5,70,1,dodge d200
 30 | 9,8,304,193,4732,18.5,70,1,hi 1200d
 31 | 27,4,97,88,2130,14.5,71,3,datsun pl510
 32 | 28,4,140,90,2264,15.5,71,1,chevrolet vega 2300
 33 | 25,4,113,95,2228,14,71,3,toyota corona
 34 | 25,4,98,?,2046,19,71,1,ford pinto
 35 | 19,6,232,100,2634,13,71,1,amc gremlin
 36 | 16,6,225,105,3439,15.5,71,1,plymouth satellite custom
 37 | 17,6,250,100,3329,15.5,71,1,chevrolet chevelle malibu
 38 | 19,6,250,88,3302,15.5,71,1,ford torino 500
 39 | 18,6,232,100,3288,15.5,71,1,amc matador
 40 | 14,8,350,165,4209,12,71,1,chevrolet impala
 41 | 14,8,400,175,4464,11.5,71,1,pontiac catalina brougham
 42 | 14,8,351,153,4154,13.5,71,1,ford galaxie 500
 43 | 14,8,318,150,4096,13,71,1,plymouth fury iii
 44 | 12,8,383,180,4955,11.5,71,1,dodge monaco (sw)
 45 | 13,8,400,170,4746,12,71,1,ford country squire (sw)
 46 | 13,8,400,175,5140,12,71,1,pontiac safari (sw)
 47 | 18,6,258,110,2962,13.5,71,1,amc hornet sportabout (sw)
 48 | 22,4,140,72,2408,19,71,1,chevrolet vega (sw)
 49 | 19,6,250,100,3282,15,71,1,pontiac firebird
 50 | 18,6,250,88,3139,14.5,71,1,ford mustang
 51 | 23,4,122,86,2220,14,71,1,mercury capri 2000
 52 | 28,4,116,90,2123,14,71,2,opel 1900
 53 | 30,4,79,70,2074,19.5,71,2,peugeot 304
 54 | 30,4,88,76,2065,14.5,71,2,fiat 124b
 55 | 31,4,71,65,1773,19,71,3,toyota corolla 1200
 56 | 35,4,72,69,1613,18,71,3,datsun 1200
 57 | 27,4,97,60,1834,19,71,2,volkswagen model 111
 58 | 26,4,91,70,1955,20.5,71,1,plymouth cricket
 59 | 24,4,113,95,2278,15.5,72,3,toyota corona hardtop
 60 | 25,4,97.5,80,2126,17,72,1,dodge colt hardtop
 61 | 23,4,97,54,2254,23.5,72,2,volkswagen type 3
 62 | 20,4,140,90,2408,19.5,72,1,chevrolet vega
 63 | 21,4,122,86,2226,16.5,72,1,ford pinto runabout
 64 | 13,8,350,165,4274,12,72,1,chevrolet impala
 65 | 14,8,400,175,4385,12,72,1,pontiac catalina
 66 | 15,8,318,150,4135,13.5,72,1,plymouth fury iii
 67 | 14,8,351,153,4129,13,72,1,ford galaxie 500
 68 | 17,8,304,150,3672,11.5,72,1,amc ambassador sst
 69 | 11,8,429,208,4633,11,72,1,mercury marquis
 70 | 13,8,350,155,4502,13.5,72,1,buick lesabre custom
 71 | 12,8,350,160,4456,13.5,72,1,oldsmobile delta 88 royale
 72 | 13,8,400,190,4422,12.5,72,1,chrysler newport royal
 73 | 19,3,70,97,2330,13.5,72,3,mazda rx2 coupe
 74 | 15,8,304,150,3892,12.5,72,1,amc matador (sw)
 75 | 13,8,307,130,4098,14,72,1,chevrolet chevelle concours (sw)
 76 | 13,8,302,140,4294,16,72,1,ford gran torino (sw)
 77 | 14,8,318,150,4077,14,72,1,plymouth satellite custom (sw)
 78 | 18,4,121,112,2933,14.5,72,2,volvo 145e (sw)
 79 | 22,4,121,76,2511,18,72,2,volkswagen 411 (sw)
 80 | 21,4,120,87,2979,19.5,72,2,peugeot 504 (sw)
 81 | 26,4,96,69,2189,18,72,2,renault 12 (sw)
 82 | 22,4,122,86,2395,16,72,1,ford pinto (sw)
 83 | 28,4,97,92,2288,17,72,3,datsun 510 (sw)
 84 | 23,4,120,97,2506,14.5,72,3,toyouta corona mark ii (sw)
 85 | 28,4,98,80,2164,15,72,1,dodge colt (sw)
 86 | 27,4,97,88,2100,16.5,72,3,toyota corolla 1600 (sw)
 87 | 13,8,350,175,4100,13,73,1,buick century 350
 88 | 14,8,304,150,3672,11.5,73,1,amc matador
 89 | 13,8,350,145,3988,13,73,1,chevrolet malibu
 90 | 14,8,302,137,4042,14.5,73,1,ford gran torino
 91 | 15,8,318,150,3777,12.5,73,1,dodge coronet custom
 92 | 12,8,429,198,4952,11.5,73,1,mercury marquis brougham
 93 | 13,8,400,150,4464,12,73,1,chevrolet caprice classic
 94 | 13,8,351,158,4363,13,73,1,ford ltd
 95 | 14,8,318,150,4237,14.5,73,1,plymouth fury gran sedan
 96 | 13,8,440,215,4735,11,73,1,chrysler new yorker brougham
 97 | 12,8,455,225,4951,11,73,1,buick electra 225 custom
 98 | 13,8,360,175,3821,11,73,1,amc ambassador brougham
 99 | 18,6,225,105,3121,16.5,73,1,plymouth valiant
100 | 16,6,250,100,3278,18,73,1,chevrolet nova custom
101 | 18,6,232,100,2945,16,73,1,amc hornet
102 | 18,6,250,88,3021,16.5,73,1,ford maverick
103 | 23,6,198,95,2904,16,73,1,plymouth duster
104 | 26,4,97,46,1950,21,73,2,volkswagen super beetle
105 | 11,8,400,150,4997,14,73,1,chevrolet impala
106 | 12,8,400,167,4906,12.5,73,1,ford country
107 | 13,8,360,170,4654,13,73,1,plymouth custom suburb
108 | 12,8,350,180,4499,12.5,73,1,oldsmobile vista cruiser
109 | 18,6,232,100,2789,15,73,1,amc gremlin
110 | 20,4,97,88,2279,19,73,3,toyota carina
111 | 21,4,140,72,2401,19.5,73,1,chevrolet vega
112 | 22,4,108,94,2379,16.5,73,3,datsun 610
113 | 18,3,70,90,2124,13.5,73,3,maxda rx3
114 | 19,4,122,85,2310,18.5,73,1,ford pinto
115 | 21,6,155,107,2472,14,73,1,mercury capri v6
116 | 26,4,98,90,2265,15.5,73,2,fiat 124 sport coupe
117 | 15,8,350,145,4082,13,73,1,chevrolet monte carlo s
118 | 16,8,400,230,4278,9.5,73,1,pontiac grand prix
119 | 29,4,68,49,1867,19.5,73,2,fiat 128
120 | 24,4,116,75,2158,15.5,73,2,opel manta
121 | 20,4,114,91,2582,14,73,2,audi 100ls
122 | 19,4,121,112,2868,15.5,73,2,volvo 144ea
123 | 15,8,318,150,3399,11,73,1,dodge dart custom
124 | 24,4,121,110,2660,14,73,2,saab 99le
125 | 20,6,156,122,2807,13.5,73,3,toyota mark ii
126 | 11,8,350,180,3664,11,73,1,oldsmobile omega
127 | 20,6,198,95,3102,16.5,74,1,plymouth duster
128 | 21,6,200,?,2875,17,74,1,ford maverick
129 | 19,6,232,100,2901,16,74,1,amc hornet
130 | 15,6,250,100,3336,17,74,1,chevrolet nova
131 | 31,4,79,67,1950,19,74,3,datsun b210
132 | 26,4,122,80,2451,16.5,74,1,ford pinto
133 | 32,4,71,65,1836,21,74,3,toyota corolla 1200
134 | 25,4,140,75,2542,17,74,1,chevrolet vega
135 | 16,6,250,100,3781,17,74,1,chevrolet chevelle malibu classic
136 | 16,6,258,110,3632,18,74,1,amc matador
137 | 18,6,225,105,3613,16.5,74,1,plymouth satellite sebring
138 | 16,8,302,140,4141,14,74,1,ford gran torino
139 | 13,8,350,150,4699,14.5,74,1,buick century luxus (sw)
140 | 14,8,318,150,4457,13.5,74,1,dodge coronet custom (sw)
141 | 14,8,302,140,4638,16,74,1,ford gran torino (sw)
142 | 14,8,304,150,4257,15.5,74,1,amc matador (sw)
143 | 29,4,98,83,2219,16.5,74,2,audi fox
144 | 26,4,79,67,1963,15.5,74,2,volkswagen dasher
145 | 26,4,97,78,2300,14.5,74,2,opel manta
146 | 31,4,76,52,1649,16.5,74,3,toyota corona
147 | 32,4,83,61,2003,19,74,3,datsun 710
148 | 28,4,90,75,2125,14.5,74,1,dodge colt
149 | 24,4,90,75,2108,15.5,74,2,fiat 128
150 | 26,4,116,75,2246,14,74,2,fiat 124 tc
151 | 24,4,120,97,2489,15,74,3,honda civic
152 | 26,4,108,93,2391,15.5,74,3,subaru
153 | 31,4,79,67,2000,16,74,2,fiat x1.9
154 | 19,6,225,95,3264,16,75,1,plymouth valiant custom
155 | 18,6,250,105,3459,16,75,1,chevrolet nova
156 | 15,6,250,72,3432,21,75,1,mercury monarch
157 | 15,6,250,72,3158,19.5,75,1,ford maverick
158 | 16,8,400,170,4668,11.5,75,1,pontiac catalina
159 | 15,8,350,145,4440,14,75,1,chevrolet bel air
160 | 16,8,318,150,4498,14.5,75,1,plymouth grand fury
161 | 14,8,351,148,4657,13.5,75,1,ford ltd
162 | 17,6,231,110,3907,21,75,1,buick century
163 | 16,6,250,105,3897,18.5,75,1,chevroelt chevelle malibu
164 | 15,6,258,110,3730,19,75,1,amc matador
165 | 18,6,225,95,3785,19,75,1,plymouth fury
166 | 21,6,231,110,3039,15,75,1,buick skyhawk
167 | 20,8,262,110,3221,13.5,75,1,chevrolet monza 2+2
168 | 13,8,302,129,3169,12,75,1,ford mustang ii
169 | 29,4,97,75,2171,16,75,3,toyota corolla
170 | 23,4,140,83,2639,17,75,1,ford pinto
171 | 20,6,232,100,2914,16,75,1,amc gremlin
172 | 23,4,140,78,2592,18.5,75,1,pontiac astro
173 | 24,4,134,96,2702,13.5,75,3,toyota corona
174 | 25,4,90,71,2223,16.5,75,2,volkswagen dasher
175 | 24,4,119,97,2545,17,75,3,datsun 710
176 | 18,6,171,97,2984,14.5,75,1,ford pinto
177 | 29,4,90,70,1937,14,75,2,volkswagen rabbit
178 | 19,6,232,90,3211,17,75,1,amc pacer
179 | 23,4,115,95,2694,15,75,2,audi 100ls
180 | 23,4,120,88,2957,17,75,2,peugeot 504
181 | 22,4,121,98,2945,14.5,75,2,volvo 244dl
182 | 25,4,121,115,2671,13.5,75,2,saab 99le
183 | 33,4,91,53,1795,17.5,75,3,honda civic cvcc
184 | 28,4,107,86,2464,15.5,76,2,fiat 131
185 | 25,4,116,81,2220,16.9,76,2,opel 1900
186 | 25,4,140,92,2572,14.9,76,1,capri ii
187 | 26,4,98,79,2255,17.7,76,1,dodge colt
188 | 27,4,101,83,2202,15.3,76,2,renault 12tl
189 | 17.5,8,305,140,4215,13,76,1,chevrolet chevelle malibu classic
190 | 16,8,318,150,4190,13,76,1,dodge coronet brougham
191 | 15.5,8,304,120,3962,13.9,76,1,amc matador
192 | 14.5,8,351,152,4215,12.8,76,1,ford gran torino
193 | 22,6,225,100,3233,15.4,76,1,plymouth valiant
194 | 22,6,250,105,3353,14.5,76,1,chevrolet nova
195 | 24,6,200,81,3012,17.6,76,1,ford maverick
196 | 22.5,6,232,90,3085,17.6,76,1,amc hornet
197 | 29,4,85,52,2035,22.2,76,1,chevrolet chevette
198 | 24.5,4,98,60,2164,22.1,76,1,chevrolet woody
199 | 29,4,90,70,1937,14.2,76,2,vw rabbit
200 | 33,4,91,53,1795,17.4,76,3,honda civic
201 | 20,6,225,100,3651,17.7,76,1,dodge aspen se
202 | 18,6,250,78,3574,21,76,1,ford granada ghia
203 | 18.5,6,250,110,3645,16.2,76,1,pontiac ventura sj
204 | 17.5,6,258,95,3193,17.8,76,1,amc pacer d/l
205 | 29.5,4,97,71,1825,12.2,76,2,volkswagen rabbit
206 | 32,4,85,70,1990,17,76,3,datsun b-210
207 | 28,4,97,75,2155,16.4,76,3,toyota corolla
208 | 26.5,4,140,72,2565,13.6,76,1,ford pinto
209 | 20,4,130,102,3150,15.7,76,2,volvo 245
210 | 13,8,318,150,3940,13.2,76,1,plymouth volare premier v8
211 | 19,4,120,88,3270,21.9,76,2,peugeot 504
212 | 19,6,156,108,2930,15.5,76,3,toyota mark ii
213 | 16.5,6,168,120,3820,16.7,76,2,mercedes-benz 280s
214 | 16.5,8,350,180,4380,12.1,76,1,cadillac seville
215 | 13,8,350,145,4055,12,76,1,chevy c10
216 | 13,8,302,130,3870,15,76,1,ford f108
217 | 13,8,318,150,3755,14,76,1,dodge d100
218 | 31.5,4,98,68,2045,18.5,77,3,honda accord cvcc
219 | 30,4,111,80,2155,14.8,77,1,buick opel isuzu deluxe
220 | 36,4,79,58,1825,18.6,77,2,renault 5 gtl
221 | 25.5,4,122,96,2300,15.5,77,1,plymouth arrow gs
222 | 33.5,4,85,70,1945,16.8,77,3,datsun f-10 hatchback
223 | 17.5,8,305,145,3880,12.5,77,1,chevrolet caprice classic
224 | 17,8,260,110,4060,19,77,1,oldsmobile cutlass supreme
225 | 15.5,8,318,145,4140,13.7,77,1,dodge monaco brougham
226 | 15,8,302,130,4295,14.9,77,1,mercury cougar brougham
227 | 17.5,6,250,110,3520,16.4,77,1,chevrolet concours
228 | 20.5,6,231,105,3425,16.9,77,1,buick skylark
229 | 19,6,225,100,3630,17.7,77,1,plymouth volare custom
230 | 18.5,6,250,98,3525,19,77,1,ford granada
231 | 16,8,400,180,4220,11.1,77,1,pontiac grand prix lj
232 | 15.5,8,350,170,4165,11.4,77,1,chevrolet monte carlo landau
233 | 15.5,8,400,190,4325,12.2,77,1,chrysler cordoba
234 | 16,8,351,149,4335,14.5,77,1,ford thunderbird
235 | 29,4,97,78,1940,14.5,77,2,volkswagen rabbit custom
236 | 24.5,4,151,88,2740,16,77,1,pontiac sunbird coupe
237 | 26,4,97,75,2265,18.2,77,3,toyota corolla liftback
238 | 25.5,4,140,89,2755,15.8,77,1,ford mustang ii 2+2
239 | 30.5,4,98,63,2051,17,77,1,chevrolet chevette
240 | 33.5,4,98,83,2075,15.9,77,1,dodge colt m/m
241 | 30,4,97,67,1985,16.4,77,3,subaru dl
242 | 30.5,4,97,78,2190,14.1,77,2,volkswagen dasher
243 | 22,6,146,97,2815,14.5,77,3,datsun 810
244 | 21.5,4,121,110,2600,12.8,77,2,bmw 320i
245 | 21.5,3,80,110,2720,13.5,77,3,mazda rx-4
246 | 43.1,4,90,48,1985,21.5,78,2,volkswagen rabbit custom diesel
247 | 36.1,4,98,66,1800,14.4,78,1,ford fiesta
248 | 32.8,4,78,52,1985,19.4,78,3,mazda glc deluxe
249 | 39.4,4,85,70,2070,18.6,78,3,datsun b210 gx
250 | 36.1,4,91,60,1800,16.4,78,3,honda civic cvcc
251 | 19.9,8,260,110,3365,15.5,78,1,oldsmobile cutlass salon brougham
252 | 19.4,8,318,140,3735,13.2,78,1,dodge diplomat
253 | 20.2,8,302,139,3570,12.8,78,1,mercury monarch ghia
254 | 19.2,6,231,105,3535,19.2,78,1,pontiac phoenix lj
255 | 20.5,6,200,95,3155,18.2,78,1,chevrolet malibu
256 | 20.2,6,200,85,2965,15.8,78,1,ford fairmont (auto)
257 | 25.1,4,140,88,2720,15.4,78,1,ford fairmont (man)
258 | 20.5,6,225,100,3430,17.2,78,1,plymouth volare
259 | 19.4,6,232,90,3210,17.2,78,1,amc concord
260 | 20.6,6,231,105,3380,15.8,78,1,buick century special
261 | 20.8,6,200,85,3070,16.7,78,1,mercury zephyr
262 | 18.6,6,225,110,3620,18.7,78,1,dodge aspen
263 | 18.1,6,258,120,3410,15.1,78,1,amc concord d/l
264 | 19.2,8,305,145,3425,13.2,78,1,chevrolet monte carlo landau
265 | 17.7,6,231,165,3445,13.4,78,1,buick regal sport coupe (turbo)
266 | 18.1,8,302,139,3205,11.2,78,1,ford futura
267 | 17.5,8,318,140,4080,13.7,78,1,dodge magnum xe
268 | 30,4,98,68,2155,16.5,78,1,chevrolet chevette
269 | 27.5,4,134,95,2560,14.2,78,3,toyota corona
270 | 27.2,4,119,97,2300,14.7,78,3,datsun 510
271 | 30.9,4,105,75,2230,14.5,78,1,dodge omni
272 | 21.1,4,134,95,2515,14.8,78,3,toyota celica gt liftback
273 | 23.2,4,156,105,2745,16.7,78,1,plymouth sapporo
274 | 23.8,4,151,85,2855,17.6,78,1,oldsmobile starfire sx
275 | 23.9,4,119,97,2405,14.9,78,3,datsun 200-sx
276 | 20.3,5,131,103,2830,15.9,78,2,audi 5000
277 | 17,6,163,125,3140,13.6,78,2,volvo 264gl
278 | 21.6,4,121,115,2795,15.7,78,2,saab 99gle
279 | 16.2,6,163,133,3410,15.8,78,2,peugeot 604sl
280 | 31.5,4,89,71,1990,14.9,78,2,volkswagen scirocco
281 | 29.5,4,98,68,2135,16.6,78,3,honda accord lx
282 | 21.5,6,231,115,3245,15.4,79,1,pontiac lemans v6
283 | 19.8,6,200,85,2990,18.2,79,1,mercury zephyr 6
284 | 22.3,4,140,88,2890,17.3,79,1,ford fairmont 4
285 | 20.2,6,232,90,3265,18.2,79,1,amc concord dl 6
286 | 20.6,6,225,110,3360,16.6,79,1,dodge aspen 6
287 | 17,8,305,130,3840,15.4,79,1,chevrolet caprice classic
288 | 17.6,8,302,129,3725,13.4,79,1,ford ltd landau
289 | 16.5,8,351,138,3955,13.2,79,1,mercury grand marquis
290 | 18.2,8,318,135,3830,15.2,79,1,dodge st. regis
291 | 16.9,8,350,155,4360,14.9,79,1,buick estate wagon (sw)
292 | 15.5,8,351,142,4054,14.3,79,1,ford country squire (sw)
293 | 19.2,8,267,125,3605,15,79,1,chevrolet malibu classic (sw)
294 | 18.5,8,360,150,3940,13,79,1,chrysler lebaron town @ country (sw)
295 | 31.9,4,89,71,1925,14,79,2,vw rabbit custom
296 | 34.1,4,86,65,1975,15.2,79,3,maxda glc deluxe
297 | 35.7,4,98,80,1915,14.4,79,1,dodge colt hatchback custom
298 | 27.4,4,121,80,2670,15,79,1,amc spirit dl
299 | 25.4,5,183,77,3530,20.1,79,2,mercedes benz 300d
300 | 23,8,350,125,3900,17.4,79,1,cadillac eldorado
301 | 27.2,4,141,71,3190,24.8,79,2,peugeot 504
302 | 23.9,8,260,90,3420,22.2,79,1,oldsmobile cutlass salon brougham
303 | 34.2,4,105,70,2200,13.2,79,1,plymouth horizon
304 | 34.5,4,105,70,2150,14.9,79,1,plymouth horizon tc3
305 | 31.8,4,85,65,2020,19.2,79,3,datsun 210
306 | 37.3,4,91,69,2130,14.7,79,2,fiat strada custom
307 | 28.4,4,151,90,2670,16,79,1,buick skylark limited
308 | 28.8,6,173,115,2595,11.3,79,1,chevrolet citation
309 | 26.8,6,173,115,2700,12.9,79,1,oldsmobile omega brougham
310 | 33.5,4,151,90,2556,13.2,79,1,pontiac phoenix
311 | 41.5,4,98,76,2144,14.7,80,2,vw rabbit
312 | 38.1,4,89,60,1968,18.8,80,3,toyota corolla tercel
313 | 32.1,4,98,70,2120,15.5,80,1,chevrolet chevette
314 | 37.2,4,86,65,2019,16.4,80,3,datsun 310
315 | 28,4,151,90,2678,16.5,80,1,chevrolet citation
316 | 26.4,4,140,88,2870,18.1,80,1,ford fairmont
317 | 24.3,4,151,90,3003,20.1,80,1,amc concord
318 | 19.1,6,225,90,3381,18.7,80,1,dodge aspen
319 | 34.3,4,97,78,2188,15.8,80,2,audi 4000
320 | 29.8,4,134,90,2711,15.5,80,3,toyota corona liftback
321 | 31.3,4,120,75,2542,17.5,80,3,mazda 626
322 | 37,4,119,92,2434,15,80,3,datsun 510 hatchback
323 | 32.2,4,108,75,2265,15.2,80,3,toyota corolla
324 | 46.6,4,86,65,2110,17.9,80,3,mazda glc
325 | 27.9,4,156,105,2800,14.4,80,1,dodge colt
326 | 40.8,4,85,65,2110,19.2,80,3,datsun 210
327 | 44.3,4,90,48,2085,21.7,80,2,vw rabbit c (diesel)
328 | 43.4,4,90,48,2335,23.7,80,2,vw dasher (diesel)
329 | 36.4,5,121,67,2950,19.9,80,2,audi 5000s (diesel)
330 | 30,4,146,67,3250,21.8,80,2,mercedes-benz 240d
331 | 44.6,4,91,67,1850,13.8,80,3,honda civic 1500 gl
332 | 40.9,4,85,?,1835,17.3,80,2,renault lecar deluxe
333 | 33.8,4,97,67,2145,18,80,3,subaru dl
334 | 29.8,4,89,62,1845,15.3,80,2,vokswagen rabbit
335 | 32.7,6,168,132,2910,11.4,80,3,datsun 280-zx
336 | 23.7,3,70,100,2420,12.5,80,3,mazda rx-7 gs
337 | 35,4,122,88,2500,15.1,80,2,triumph tr7 coupe
338 | 23.6,4,140,?,2905,14.3,80,1,ford mustang cobra
339 | 32.4,4,107,72,2290,17,80,3,honda accord
340 | 27.2,4,135,84,2490,15.7,81,1,plymouth reliant
341 | 26.6,4,151,84,2635,16.4,81,1,buick skylark
342 | 25.8,4,156,92,2620,14.4,81,1,dodge aries wagon (sw)
343 | 23.5,6,173,110,2725,12.6,81,1,chevrolet citation
344 | 30,4,135,84,2385,12.9,81,1,plymouth reliant
345 | 39.1,4,79,58,1755,16.9,81,3,toyota starlet
346 | 39,4,86,64,1875,16.4,81,1,plymouth champ
347 | 35.1,4,81,60,1760,16.1,81,3,honda civic 1300
348 | 32.3,4,97,67,2065,17.8,81,3,subaru
349 | 37,4,85,65,1975,19.4,81,3,datsun 210 mpg
350 | 37.7,4,89,62,2050,17.3,81,3,toyota tercel
351 | 34.1,4,91,68,1985,16,81,3,mazda glc 4
352 | 34.7,4,105,63,2215,14.9,81,1,plymouth horizon 4
353 | 34.4,4,98,65,2045,16.2,81,1,ford escort 4w
354 | 29.9,4,98,65,2380,20.7,81,1,ford escort 2h
355 | 33,4,105,74,2190,14.2,81,2,volkswagen jetta
356 | 34.5,4,100,?,2320,15.8,81,2,renault 18i
357 | 33.7,4,107,75,2210,14.4,81,3,honda prelude
358 | 32.4,4,108,75,2350,16.8,81,3,toyota corolla
359 | 32.9,4,119,100,2615,14.8,81,3,datsun 200sx
360 | 31.6,4,120,74,2635,18.3,81,3,mazda 626
361 | 28.1,4,141,80,3230,20.4,81,2,peugeot 505s turbo diesel
362 | 30.7,6,145,76,3160,19.6,81,2,volvo diesel
363 | 25.4,6,168,116,2900,12.6,81,3,toyota cressida
364 | 24.2,6,146,120,2930,13.8,81,3,datsun 810 maxima
365 | 22.4,6,231,110,3415,15.8,81,1,buick century
366 | 26.6,8,350,105,3725,19,81,1,oldsmobile cutlass ls
367 | 20.2,6,200,88,3060,17.1,81,1,ford granada gl
368 | 17.6,6,225,85,3465,16.6,81,1,chrysler lebaron salon
369 | 28,4,112,88,2605,19.6,82,1,chevrolet cavalier
370 | 27,4,112,88,2640,18.6,82,1,chevrolet cavalier wagon
371 | 34,4,112,88,2395,18,82,1,chevrolet cavalier 2-door
372 | 31,4,112,85,2575,16.2,82,1,pontiac j2000 se hatchback
373 | 29,4,135,84,2525,16,82,1,dodge aries se
374 | 27,4,151,90,2735,18,82,1,pontiac phoenix
375 | 24,4,140,92,2865,16.4,82,1,ford fairmont futura
376 | 36,4,105,74,1980,15.3,82,2,volkswagen rabbit l
377 | 37,4,91,68,2025,18.2,82,3,mazda glc custom l
378 | 31,4,91,68,1970,17.6,82,3,mazda glc custom
379 | 38,4,105,63,2125,14.7,82,1,plymouth horizon miser
380 | 36,4,98,70,2125,17.3,82,1,mercury lynx l
381 | 36,4,120,88,2160,14.5,82,3,nissan stanza xe
382 | 36,4,107,75,2205,14.5,82,3,honda accord
383 | 34,4,108,70,2245,16.9,82,3,toyota corolla
384 | 38,4,91,67,1965,15,82,3,honda civic
385 | 32,4,91,67,1965,15.7,82,3,honda civic (auto)
386 | 38,4,91,67,1995,16.2,82,3,datsun 310 gx
387 | 25,6,181,110,2945,16.4,82,1,buick century limited
388 | 38,6,262,85,3015,17,82,1,oldsmobile cutlass ciera (diesel)
389 | 26,4,156,92,2585,14.5,82,1,chrysler lebaron medallion
390 | 22,6,232,112,2835,14.7,82,1,ford granada l
391 | 32,4,144,96,2665,13.9,82,3,toyota celica gt
392 | 36,4,135,84,2370,13,82,1,dodge charger 2.2
393 | 27,4,151,90,2950,17.3,82,1,chevrolet camaro
394 | 27,4,140,86,2790,15.6,82,1,ford mustang gl
395 | 44,4,97,52,2130,24.6,82,2,vw pickup
396 | 32,4,135,84,2295,11.6,82,1,dodge rampage
397 | 28,4,120,79,2625,18.6,82,1,ford ranger
398 | 31,4,119,82,2720,19.4,82,1,chevy s-10
399 | 


--------------------------------------------------------------------------------
/data/Heart.csv:
--------------------------------------------------------------------------------
1 | row.names,sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd1,160,12,5.73,23.11,Present,49,25.3,97.2,52,12,144,0.01,4.41,28.61,Absent,55,28.87,2.06,63,13,118,0.08,3.48,32.28,Present,52,29.14,3.81,46,04,170,7.5,6.41,38.03,Present,51,31.99,24.26,58,15,134,13.6,3.5,27.78,Present,60,25.99,57.34,49,16,132,6.2,6.47,36.21,Present,62,30.77,14.14,45,07,142,4.05,3.38,16.2,Absent,59,20.81,2.62,38,08,114,4.08,4.59,14.6,Present,62,23.11,6.72,58,19,114,0,3.83,19.4,Present,49,24.86,2.49,29,010,132,0,5.8,30.96,Present,69,30.11,0,53,111,206,6,2.95,32.27,Absent,72,26.81,56.06,60,112,134,14.1,4.44,22.39,Present,65,23.09,0,40,113,118,0,1.88,10.05,Absent,59,21.57,0,17,014,132,0,1.87,17.21,Absent,49,23.63,0.97,15,015,112,9.65,2.29,17.2,Present,54,23.53,0.68,53,016,117,1.53,2.44,28.95,Present,35,25.89,30.03,46,017,120,7.5,15.33,22,Absent,60,25.31,34.49,49,018,146,10.5,8.29,35.36,Present,78,32.73,13.89,53,119,158,2.6,7.46,34.07,Present,61,29.3,53.28,62,120,124,14,6.23,35.96,Present,45,30.09,0,59,121,106,1.61,1.74,12.32,Absent,74,20.92,13.37,20,122,132,7.9,2.85,26.5,Present,51,26.16,25.71,44,023,150,0.3,6.38,33.99,Present,62,24.64,0,50,024,138,0.6,3.81,28.66,Absent,54,28.7,1.46,58,025,142,18.2,4.34,24.38,Absent,61,26.19,0,50,026,124,4,12.42,31.29,Present,54,23.23,2.06,42,127,118,6,9.65,33.91,Absent,60,38.8,0,48,028,145,9.1,5.24,27.55,Absent,59,20.96,21.6,61,129,144,4.09,5.55,31.4,Present,60,29.43,5.55,56,030,146,0,6.62,25.69,Absent,60,28.07,8.23,63,131,136,2.52,3.95,25.63,Absent,51,21.86,0,45,132,158,1.02,6.33,23.88,Absent,66,22.13,24.99,46,133,122,6.6,5.58,35.95,Present,53,28.07,12.55,59,134,126,8.75,6.53,34.02,Absent,49,30.25,0,41,135,148,5.5,7.1,25.31,Absent,56,29.84,3.6,48,036,122,4.26,4.44,13.04,Absent,57,19.49,48.99,28,137,140,3.9,7.32,25.05,Absent,47,27.36,36.77,32,038,110,4.64,4.55,30.46,Absent,48,30.9,15.22,46,039,130,0,2.82,19.63,Present,70,24.86,0,29,040,136,11.2,5.81,31.85,Present,75,27.68,22.94,58,141,118,0.28,5.8,33.7,Present,60,30.98,0,41,142,144,0.04,3.38,23.61,Absent,30,23.75,4.66,30,043,120,0,1.07,16.02,Absent,47,22.15,0,15,044,130,2.61,2.72,22.99,Present,51,26.29,13.37,51,145,114,0,2.99,9.74,Absent,54,46.58,0,17,046,128,4.65,3.31,22.74,Absent,62,22.95,0.51,48,047,162,7.4,8.55,24.65,Present,64,25.71,5.86,58,148,116,1.91,7.56,26.45,Present,52,30.01,3.6,33,149,114,0,1.94,11.02,Absent,54,20.17,38.98,16,050,126,3.8,3.88,31.79,Absent,57,30.53,0,30,051,122,0,5.75,30.9,Present,46,29.01,4.11,42,052,134,2.5,3.66,30.9,Absent,52,27.19,23.66,49,053,152,0.9,9.12,30.23,Absent,56,28.64,0.37,42,154,134,8.08,1.55,17.5,Present,56,22.65,66.65,31,155,156,3,1.82,27.55,Absent,60,23.91,54,53,056,152,5.99,7.99,32.48,Absent,45,26.57,100.32,48,057,118,0,2.99,16.17,Absent,49,23.83,3.22,28,058,126,5.1,2.96,26.5,Absent,55,25.52,12.34,38,159,103,0.03,4.21,18.96,Absent,48,22.94,2.62,18,060,121,0.8,5.29,18.95,Present,47,22.51,0,61,061,142,0.28,1.8,21.03,Absent,57,23.65,2.93,33,062,138,1.15,5.09,27.87,Present,61,25.65,2.34,44,063,152,10.1,4.71,24.65,Present,65,26.21,24.53,57,064,140,0.45,4.3,24.33,Absent,41,27.23,10.08,38,065,130,0,1.82,10.45,Absent,57,22.07,2.06,17,066,136,7.36,2.19,28.11,Present,61,25,61.71,54,067,124,4.82,3.24,21.1,Present,48,28.49,8.42,30,068,112,0.41,1.88,10.29,Absent,39,22.08,20.98,27,069,118,4.46,7.27,29.13,Present,48,29.01,11.11,33,070,122,0,3.37,16.1,Absent,67,21.06,0,32,171,118,0,3.67,12.13,Absent,51,19.15,0.6,15,072,130,1.72,2.66,10.38,Absent,68,17.81,11.1,26,073,130,5.6,3.37,24.8,Absent,58,25.76,43.2,36,074,126,0.09,5.03,13.27,Present,50,17.75,4.63,20,075,128,0.4,6.17,26.35,Absent,64,27.86,11.11,34,076,136,0,4.12,17.42,Absent,52,21.66,12.86,40,077,134,0,5.9,30.84,Absent,49,29.16,0,55,078,140,0.6,5.56,33.39,Present,58,27.19,0,55,179,168,4.5,6.68,28.47,Absent,43,24.25,24.38,56,180,108,0.4,5.91,22.92,Present,57,25.72,72,39,081,114,3,7.04,22.64,Present,55,22.59,0,45,182,140,8.14,4.93,42.49,Absent,53,45.72,6.43,53,183,148,4.8,6.09,36.55,Present,63,25.44,0.88,55,184,148,12.2,3.79,34.15,Absent,57,26.38,14.4,57,185,128,0,2.43,13.15,Present,63,20.75,0,17,086,130,0.56,3.3,30.86,Absent,49,27.52,33.33,45,087,126,10.5,4.49,17.33,Absent,67,19.37,0,49,188,140,0,5.08,27.33,Present,41,27.83,1.25,38,089,126,0.9,5.64,17.78,Present,55,21.94,0,41,090,122,0.72,4.04,32.38,Absent,34,28.34,0,55,091,116,1.03,2.83,10.85,Absent,45,21.59,1.75,21,092,120,3.7,4.02,39.66,Absent,61,30.57,0,64,193,143,0.46,2.4,22.87,Absent,62,29.17,15.43,29,094,118,4,3.95,18.96,Absent,54,25.15,8.33,49,195,194,1.7,6.32,33.67,Absent,47,30.16,0.19,56,096,134,3,4.37,23.07,Absent,56,20.54,9.65,62,097,138,2.16,4.9,24.83,Present,39,26.06,28.29,29,098,136,0,5,27.58,Present,49,27.59,1.47,39,099,122,3.2,11.32,35.36,Present,55,27.07,0,51,1100,164,12,3.91,19.59,Absent,51,23.44,19.75,39,0101,136,8,7.85,23.81,Present,51,22.69,2.78,50,0102,166,0.07,4.03,29.29,Absent,53,28.37,0,27,0103,118,0,4.34,30.12,Present,52,32.18,3.91,46,0104,128,0.42,4.6,26.68,Absent,41,30.97,10.33,31,0105,118,1.5,5.38,25.84,Absent,64,28.63,3.89,29,0106,158,3.6,2.97,30.11,Absent,63,26.64,108,64,0107,108,1.5,4.33,24.99,Absent,66,22.29,21.6,61,1108,170,7.6,5.5,37.83,Present,42,37.41,6.17,54,1109,118,1,5.76,22.1,Absent,62,23.48,7.71,42,0110,124,0,3.04,17.33,Absent,49,22.04,0,18,0111,114,0,8.01,21.64,Absent,66,25.51,2.49,16,0112,168,9,8.53,24.48,Present,69,26.18,4.63,54,1113,134,2,3.66,14.69,Absent,52,21.03,2.06,37,0114,174,0,8.46,35.1,Present,35,25.27,0,61,1115,116,31.2,3.17,14.99,Absent,47,19.4,49.06,59,1116,128,0,10.58,31.81,Present,46,28.41,14.66,48,0117,140,4.5,4.59,18.01,Absent,63,21.91,22.09,32,1118,154,0.7,5.91,25,Absent,13,20.6,0,42,0119,150,3.5,6.99,25.39,Present,50,23.35,23.48,61,1120,130,0,3.92,25.55,Absent,68,28.02,0.68,27,0121,128,2,6.13,21.31,Absent,66,22.86,11.83,60,0122,120,1.4,6.25,20.47,Absent,60,25.85,8.51,28,0123,120,0,5.01,26.13,Absent,64,26.21,12.24,33,0124,138,4.5,2.85,30.11,Absent,55,24.78,24.89,56,1125,153,7.8,3.96,25.73,Absent,54,25.91,27.03,45,0126,123,8.6,11.17,35.28,Present,70,33.14,0,59,1127,148,4.04,3.99,20.69,Absent,60,27.78,1.75,28,0128,136,3.96,2.76,30.28,Present,50,34.42,18.51,38,0129,134,8.8,7.41,26.84,Absent,35,29.44,29.52,60,1130,152,12.18,4.04,37.83,Present,63,34.57,4.17,64,0131,158,13.5,5.04,30.79,Absent,54,24.79,21.5,62,0132,132,2,3.08,35.39,Absent,45,31.44,79.82,58,1133,134,1.5,3.73,21.53,Absent,41,24.7,11.11,30,1134,142,7.44,5.52,33.97,Absent,47,29.29,24.27,54,0135,134,6,3.3,28.45,Absent,65,26.09,58.11,40,0136,122,4.18,9.05,29.27,Present,44,24.05,19.34,52,1137,116,2.7,3.69,13.52,Absent,55,21.13,18.51,32,0138,128,0.5,3.7,12.81,Present,66,21.25,22.73,28,0139,120,0,3.68,12.24,Absent,51,20.52,0.51,20,0140,124,0,3.95,36.35,Present,59,32.83,9.59,54,0141,160,14,5.9,37.12,Absent,58,33.87,3.52,54,1142,130,2.78,4.89,9.39,Present,63,19.3,17.47,25,1143,128,2.8,5.53,14.29,Absent,64,24.97,0.51,38,0144,130,4.5,5.86,37.43,Absent,61,31.21,32.3,58,0145,109,1.2,6.14,29.26,Absent,47,24.72,10.46,40,0146,144,0,3.84,18.72,Absent,56,22.1,4.8,40,0147,118,1.05,3.16,12.98,Present,46,22.09,16.35,31,0148,136,3.46,6.38,32.25,Present,43,28.73,3.13,43,1149,136,1.5,6.06,26.54,Absent,54,29.38,14.5,33,1150,124,15.5,5.05,24.06,Absent,46,23.22,0,61,1151,148,6,6.49,26.47,Absent,48,24.7,0,55,0152,128,6.6,3.58,20.71,Absent,55,24.15,0,52,0153,122,0.28,4.19,19.97,Absent,61,25.63,0,24,0154,108,0,2.74,11.17,Absent,53,22.61,0.95,20,0155,124,3.04,4.8,19.52,Present,60,21.78,147.19,41,1156,138,8.8,3.12,22.41,Present,63,23.33,120.03,55,1157,127,0,2.81,15.7,Absent,42,22.03,1.03,17,0158,174,9.45,5.13,35.54,Absent,55,30.71,59.79,53,0159,122,0,3.05,23.51,Absent,46,25.81,0,38,0160,144,6.75,5.45,29.81,Absent,53,25.62,26.23,43,1161,126,1.8,6.22,19.71,Absent,65,24.81,0.69,31,0162,208,27.4,3.12,26.63,Absent,66,27.45,33.07,62,1163,138,0,2.68,17.04,Absent,42,22.16,0,16,0164,148,0,3.84,17.26,Absent,70,20,0,21,0165,122,0,3.08,16.3,Absent,43,22.13,0,16,0166,132,7,3.2,23.26,Absent,77,23.64,23.14,49,0167,110,12.16,4.99,28.56,Absent,44,27.14,21.6,55,1168,160,1.52,8.12,29.3,Present,54,25.87,12.86,43,1169,126,0.54,4.39,21.13,Present,45,25.99,0,25,0170,162,5.3,7.95,33.58,Present,58,36.06,8.23,48,0171,194,2.55,6.89,33.88,Present,69,29.33,0,41,0172,118,0.75,2.58,20.25,Absent,59,24.46,0,32,0173,124,0,4.79,34.71,Absent,49,26.09,9.26,47,0174,160,0,2.42,34.46,Absent,48,29.83,1.03,61,0175,128,0,2.51,29.35,Present,53,22.05,1.37,62,0176,122,4,5.24,27.89,Present,45,26.52,0,61,1177,132,2,2.7,21.57,Present,50,27.95,9.26,37,0178,120,0,2.42,16.66,Absent,46,20.16,0,17,0179,128,0.04,8.22,28.17,Absent,65,26.24,11.73,24,0180,108,15,4.91,34.65,Absent,41,27.96,14.4,56,0181,166,0,4.31,34.27,Absent,45,30.14,13.27,56,0182,152,0,6.06,41.05,Present,51,40.34,0,51,0183,170,4.2,4.67,35.45,Present,50,27.14,7.92,60,1184,156,4,2.05,19.48,Present,50,21.48,27.77,39,1185,116,8,6.73,28.81,Present,41,26.74,40.94,48,1186,122,4.4,3.18,11.59,Present,59,21.94,0,33,1187,150,20,6.4,35.04,Absent,53,28.88,8.33,63,0188,129,2.15,5.17,27.57,Absent,52,25.42,2.06,39,0189,134,4.8,6.58,29.89,Present,55,24.73,23.66,63,0190,126,0,5.98,29.06,Present,56,25.39,11.52,64,1191,142,0,3.72,25.68,Absent,48,24.37,5.25,40,1192,128,0.7,4.9,37.42,Present,72,35.94,3.09,49,1193,102,0.4,3.41,17.22,Present,56,23.59,2.06,39,1194,130,0,4.89,25.98,Absent,72,30.42,14.71,23,0195,138,0.05,2.79,10.35,Absent,46,21.62,0,18,0196,138,0,1.96,11.82,Present,54,22.01,8.13,21,0197,128,0,3.09,20.57,Absent,54,25.63,0.51,17,0198,162,2.92,3.63,31.33,Absent,62,31.59,18.51,42,0199,160,3,9.19,26.47,Present,39,28.25,14.4,54,1200,148,0,4.66,24.39,Absent,50,25.26,4.03,27,0201,124,0.16,2.44,16.67,Absent,65,24.58,74.91,23,0202,136,3.15,4.37,20.22,Present,59,25.12,47.16,31,1203,134,2.75,5.51,26.17,Absent,57,29.87,8.33,33,0204,128,0.73,3.97,23.52,Absent,54,23.81,19.2,64,0205,122,3.2,3.59,22.49,Present,45,24.96,36.17,58,0206,152,3,4.64,31.29,Absent,41,29.34,4.53,40,0207,162,0,5.09,24.6,Present,64,26.71,3.81,18,0208,124,4,6.65,30.84,Present,54,28.4,33.51,60,0209,136,5.8,5.9,27.55,Absent,65,25.71,14.4,59,0210,136,8.8,4.26,32.03,Present,52,31.44,34.35,60,0211,134,0.05,8.03,27.95,Absent,48,26.88,0,60,0212,122,1,5.88,34.81,Present,69,31.27,15.94,40,1213,116,3,3.05,30.31,Absent,41,23.63,0.86,44,0214,132,0,0.98,21.39,Absent,62,26.75,0,53,0215,134,0,2.4,21.11,Absent,57,22.45,1.37,18,0216,160,7.77,8.07,34.8,Absent,64,31.15,0,62,1217,180,0.52,4.23,16.38,Absent,55,22.56,14.77,45,1218,124,0.81,6.16,11.61,Absent,35,21.47,10.49,26,0219,114,0,4.97,9.69,Absent,26,22.6,0,25,0220,208,7.4,7.41,32.03,Absent,50,27.62,7.85,57,0221,138,0,3.14,12,Absent,54,20.28,0,16,0222,164,0.5,6.95,39.64,Present,47,41.76,3.81,46,1223,144,2.4,8.13,35.61,Absent,46,27.38,13.37,60,0224,136,7.5,7.39,28.04,Present,50,25.01,0,45,1225,132,7.28,3.52,12.33,Absent,60,19.48,2.06,56,0226,143,5.04,4.86,23.59,Absent,58,24.69,18.72,42,0227,112,4.46,7.18,26.25,Present,69,27.29,0,32,1228,134,10,3.79,34.72,Absent,42,28.33,28.8,52,1229,138,2,5.11,31.4,Present,49,27.25,2.06,64,1230,188,0,5.47,32.44,Present,71,28.99,7.41,50,1231,110,2.35,3.36,26.72,Present,54,26.08,109.8,58,1232,136,13.2,7.18,35.95,Absent,48,29.19,0,62,0233,130,1.75,5.46,34.34,Absent,53,29.42,0,58,1234,122,0,3.76,24.59,Absent,56,24.36,0,30,0235,138,0,3.24,27.68,Absent,60,25.7,88.66,29,0236,130,18,4.13,27.43,Absent,54,27.44,0,51,1237,126,5.5,3.78,34.15,Absent,55,28.85,3.18,61,0238,176,5.76,4.89,26.1,Present,46,27.3,19.44,57,0239,122,0,5.49,19.56,Absent,57,23.12,14.02,27,0240,124,0,3.23,9.64,Absent,59,22.7,0,16,0241,140,5.2,3.58,29.26,Absent,70,27.29,20.17,45,1242,128,6,4.37,22.98,Present,50,26.01,0,47,0243,190,4.18,5.05,24.83,Absent,45,26.09,82.85,41,0244,144,0.76,10.53,35.66,Absent,63,34.35,0,55,1245,126,4.6,7.4,31.99,Present,57,28.67,0.37,60,1246,128,0,2.63,23.88,Absent,45,21.59,6.54,57,0247,136,0.4,3.91,21.1,Present,63,22.3,0,56,1248,158,4,4.18,28.61,Present,42,25.11,0,60,0249,160,0.6,6.94,30.53,Absent,36,25.68,1.42,64,0250,124,6,5.21,33.02,Present,64,29.37,7.61,58,1251,158,6.17,8.12,30.75,Absent,46,27.84,92.62,48,0252,128,0,6.34,11.87,Absent,57,23.14,0,17,0253,166,3,3.82,26.75,Absent,45,20.86,0,63,1254,146,7.5,7.21,25.93,Present,55,22.51,0.51,42,0255,161,9,4.65,15.16,Present,58,23.76,43.2,46,0256,164,13.02,6.26,29.38,Present,47,22.75,37.03,54,1257,146,5.08,7.03,27.41,Present,63,36.46,24.48,37,1258,142,4.48,3.57,19.75,Present,51,23.54,3.29,49,0259,138,12,5.13,28.34,Absent,59,24.49,32.81,58,1260,154,1.8,7.13,34.04,Present,52,35.51,39.36,44,0261,118,0,2.39,12.13,Absent,49,18.46,0.26,17,1263,124,0.61,2.69,17.15,Present,61,22.76,11.55,20,0264,124,1.04,2.84,16.42,Present,46,20.17,0,61,0265,136,5,4.19,23.99,Present,68,27.8,25.86,35,0266,132,9.9,4.63,27.86,Present,46,23.39,0.51,52,1267,118,0.12,1.96,20.31,Absent,37,20.01,2.42,18,0268,118,0.12,4.16,9.37,Absent,57,19.61,0,17,0269,134,12,4.96,29.79,Absent,53,24.86,8.23,57,0270,114,0.1,3.95,15.89,Present,57,20.31,17.14,16,0271,136,6.8,7.84,30.74,Present,58,26.2,23.66,45,1272,130,0,4.16,39.43,Present,46,30.01,0,55,1273,136,2.2,4.16,38.02,Absent,65,37.24,4.11,41,1274,136,1.36,3.16,14.97,Present,56,24.98,7.3,24,0275,154,4.2,5.59,25.02,Absent,58,25.02,1.54,43,0276,108,0.8,2.47,17.53,Absent,47,22.18,0,55,1277,136,8.8,4.69,36.07,Present,38,26.56,2.78,63,1278,174,2.02,6.57,31.9,Present,50,28.75,11.83,64,1279,124,4.25,8.22,30.77,Absent,56,25.8,0,43,0280,114,0,2.63,9.69,Absent,45,17.89,0,16,0281,118,0.12,3.26,12.26,Absent,55,22.65,0,16,0282,106,1.08,4.37,26.08,Absent,67,24.07,17.74,28,1283,146,3.6,3.51,22.67,Absent,51,22.29,43.71,42,0284,206,0,4.17,33.23,Absent,69,27.36,6.17,50,1285,134,3,3.17,17.91,Absent,35,26.37,15.12,27,0286,148,15,4.98,36.94,Present,72,31.83,66.27,41,1287,126,0.21,3.95,15.11,Absent,61,22.17,2.42,17,0288,134,0,3.69,13.92,Absent,43,27.66,0,19,0289,134,0.02,2.8,18.84,Absent,45,24.82,0,17,0290,123,0.05,4.61,13.69,Absent,51,23.23,2.78,16,0291,112,0.6,5.28,25.71,Absent,55,27.02,27.77,38,1292,112,0,1.71,15.96,Absent,42,22.03,3.5,16,0293,101,0.48,7.26,13,Absent,50,19.82,5.19,16,0294,150,0.18,4.14,14.4,Absent,53,23.43,7.71,44,0295,170,2.6,7.22,28.69,Present,71,27.87,37.65,56,1296,134,0,5.63,29.12,Absent,68,32.33,2.02,34,0297,142,0,4.19,18.04,Absent,56,23.65,20.78,42,1298,132,0.1,3.28,10.73,Absent,73,20.42,0,17,0299,136,0,2.28,18.14,Absent,55,22.59,0,17,0300,132,12,4.51,21.93,Absent,61,26.07,64.8,46,1301,166,4.1,4,34.3,Present,32,29.51,8.23,53,0302,138,0,3.96,24.7,Present,53,23.8,0,45,0303,138,2.27,6.41,29.07,Absent,58,30.22,2.93,32,1304,170,0,3.12,37.15,Absent,47,35.42,0,53,0305,128,0,8.41,28.82,Present,60,26.86,0,59,1306,136,1.2,2.78,7.12,Absent,52,22.51,3.41,27,0307,128,0,3.22,26.55,Present,39,26.59,16.71,49,0308,150,14.4,5.04,26.52,Present,60,28.84,0,45,0309,132,8.4,3.57,13.68,Absent,42,18.75,15.43,59,1310,142,2.4,2.55,23.89,Absent,54,26.09,59.14,37,0311,130,0.05,2.44,28.25,Present,67,30.86,40.32,34,0312,174,3.5,5.26,21.97,Present,36,22.04,8.33,59,1313,114,9.6,2.51,29.18,Absent,49,25.67,40.63,46,0314,162,1.5,2.46,19.39,Present,49,24.32,0,59,1315,174,0,3.27,35.4,Absent,58,37.71,24.95,44,0316,190,5.15,6.03,36.59,Absent,42,30.31,72,50,0317,154,1.4,1.72,18.86,Absent,58,22.67,43.2,59,0318,124,0,2.28,24.86,Present,50,22.24,8.26,38,0319,114,1.2,3.98,14.9,Absent,49,23.79,25.82,26,0320,168,11.4,5.08,26.66,Present,56,27.04,2.61,59,1321,142,3.72,4.24,32.57,Absent,52,24.98,7.61,51,0322,154,0,4.81,28.11,Present,56,25.67,75.77,59,0323,146,4.36,4.31,18.44,Present,47,24.72,10.8,38,0324,166,6,3.02,29.3,Absent,35,24.38,38.06,61,0325,140,8.6,3.9,32.16,Present,52,28.51,11.11,64,1326,136,1.7,3.53,20.13,Absent,56,19.44,14.4,55,0327,156,0,3.47,21.1,Absent,73,28.4,0,36,1328,132,0,6.63,29.58,Present,37,29.41,2.57,62,0329,128,0,2.98,12.59,Absent,65,20.74,2.06,19,0330,106,5.6,3.2,12.3,Absent,49,20.29,0,39,0331,144,0.4,4.64,30.09,Absent,30,27.39,0.74,55,0332,154,0.31,2.33,16.48,Absent,33,24,11.83,17,0333,126,3.1,2.01,32.97,Present,56,28.63,26.74,45,0334,134,6.4,8.49,37.25,Present,56,28.94,10.49,51,1335,152,19.45,4.22,29.81,Absent,28,23.95,0,59,1336,146,1.35,6.39,34.21,Absent,51,26.43,0,59,1337,162,6.94,4.55,33.36,Present,52,27.09,32.06,43,0338,130,7.28,3.56,23.29,Present,20,26.8,51.87,58,1339,138,6,7.24,37.05,Absent,38,28.69,0,59,0340,148,0,5.32,26.71,Present,52,32.21,32.78,27,0341,124,4.2,2.94,27.59,Absent,50,30.31,85.06,30,0342,118,1.62,9.01,21.7,Absent,59,25.89,21.19,40,0343,116,4.28,7.02,19.99,Present,68,23.31,0,52,1344,162,6.3,5.73,22.61,Present,46,20.43,62.54,53,1345,138,0.87,1.87,15.89,Absent,44,26.76,42.99,31,0346,137,1.2,3.14,23.87,Absent,66,24.13,45,37,0347,198,0.52,11.89,27.68,Present,48,28.4,78.99,26,1348,154,4.5,4.75,23.52,Present,43,25.76,0,53,1349,128,5.4,2.36,12.98,Absent,51,18.36,6.69,61,0350,130,0.08,5.59,25.42,Present,50,24.98,6.27,43,1351,162,5.6,4.24,22.53,Absent,29,22.91,5.66,60,0352,120,10.5,2.7,29.87,Present,54,24.5,16.46,49,0353,136,3.99,2.58,16.38,Present,53,22.41,27.67,36,0354,176,1.2,8.28,36.16,Present,42,27.81,11.6,58,1355,134,11.79,4.01,26.57,Present,38,21.79,38.88,61,1356,122,1.7,5.28,32.23,Present,51,24.08,0,54,0357,134,0.9,3.18,23.66,Present,52,23.26,27.36,58,1358,134,0,2.43,22.24,Absent,52,26.49,41.66,24,0359,136,6.6,6.08,32.74,Absent,64,33.28,2.72,49,0360,132,4.05,5.15,26.51,Present,31,26.67,16.3,50,0361,152,1.68,3.58,25.43,Absent,50,27.03,0,32,0362,132,12.3,5.96,32.79,Present,57,30.12,21.5,62,1363,124,0.4,3.67,25.76,Absent,43,28.08,20.57,34,0364,140,4.2,2.91,28.83,Present,43,24.7,47.52,48,0365,166,0.6,2.42,34.03,Present,53,26.96,54,60,0366,156,3.02,5.35,25.72,Present,53,25.22,28.11,52,1367,132,0.72,4.37,19.54,Absent,48,26.11,49.37,28,0368,150,0,4.99,27.73,Absent,57,30.92,8.33,24,0369,134,0.12,3.4,21.18,Present,33,26.27,14.21,30,0370,126,3.4,4.87,15.16,Present,65,22.01,11.11,38,0371,148,0.5,5.97,32.88,Absent,54,29.27,6.43,42,0372,148,8.2,7.75,34.46,Present,46,26.53,6.04,64,1373,132,6,5.97,25.73,Present,66,24.18,145.29,41,0374,128,1.6,5.41,29.3,Absent,68,29.38,23.97,32,0375,128,5.16,4.9,31.35,Present,57,26.42,0,64,0376,140,0,2.4,27.89,Present,70,30.74,144,29,0377,126,0,5.29,27.64,Absent,25,27.62,2.06,45,0378,114,3.6,4.16,22.58,Absent,60,24.49,65.31,31,0379,118,1.25,4.69,31.58,Present,52,27.16,4.11,53,0380,126,0.96,4.99,29.74,Absent,66,33.35,58.32,38,0381,154,4.5,4.68,39.97,Absent,61,33.17,1.54,64,1382,112,1.44,2.71,22.92,Absent,59,24.81,0,52,0383,140,8,4.42,33.15,Present,47,32.77,66.86,44,0384,140,1.68,11.41,29.54,Present,74,30.75,2.06,38,1385,128,2.6,4.94,21.36,Absent,61,21.3,0,31,0386,126,19.6,6.03,34.99,Absent,49,26.99,55.89,44,0387,160,4.2,6.76,37.99,Present,61,32.91,3.09,54,1388,144,0,4.17,29.63,Present,52,21.83,0,59,0389,148,4.5,10.49,33.27,Absent,50,25.92,2.06,53,1390,146,0,4.92,18.53,Absent,57,24.2,34.97,26,0391,164,5.6,3.17,30.98,Present,44,25.99,43.2,53,1392,130,0.54,3.63,22.03,Present,69,24.34,12.86,39,1393,154,2.4,5.63,42.17,Present,59,35.07,12.86,50,1394,178,0.95,4.75,21.06,Absent,49,23.74,24.69,61,0395,180,3.57,3.57,36.1,Absent,36,26.7,19.95,64,0396,134,12.5,2.73,39.35,Absent,48,35.58,0,48,0397,142,0,3.54,16.64,Absent,58,25.97,8.36,27,0398,162,7,7.67,34.34,Present,33,30.77,0,62,0399,218,11.2,2.77,30.79,Absent,38,24.86,90.93,48,1400,126,8.75,6.06,32.72,Present,33,27,62.43,55,1401,126,0,3.57,26.01,Absent,61,26.3,7.97,47,0402,134,6.1,4.77,26.08,Absent,47,23.82,1.03,49,0403,132,0,4.17,36.57,Absent,57,30.61,18,49,0404,178,5.5,3.79,23.92,Present,45,21.26,6.17,62,1405,208,5.04,5.19,20.71,Present,52,25.12,24.27,58,1406,160,1.15,10.19,39.71,Absent,31,31.65,20.52,57,0407,116,2.38,5.67,29.01,Present,54,27.26,15.77,51,0408,180,25.01,3.7,38.11,Present,57,30.54,0,61,1409,200,19.2,4.43,40.6,Present,55,32.04,36,60,1410,112,4.2,3.58,27.14,Absent,52,26.83,2.06,40,0411,120,0,3.1,26.97,Absent,41,24.8,0,16,0412,178,20,9.78,33.55,Absent,37,27.29,2.88,62,1413,166,0.8,5.63,36.21,Absent,50,34.72,28.8,60,0414,164,8.2,14.16,36.85,Absent,52,28.5,17.02,55,1415,216,0.92,2.66,19.85,Present,49,20.58,0.51,63,1416,146,6.4,5.62,33.05,Present,57,31.03,0.74,46,0417,134,1.1,3.54,20.41,Present,58,24.54,39.91,39,1418,158,16,5.56,29.35,Absent,36,25.92,58.32,60,0419,176,0,3.14,31.04,Present,45,30.18,4.63,45,0420,132,2.8,4.79,20.47,Present,50,22.15,11.73,48,0421,126,0,4.55,29.18,Absent,48,24.94,36,41,0422,120,5.5,3.51,23.23,Absent,46,22.4,90.31,43,0423,174,0,3.86,21.73,Absent,42,23.37,0,63,0424,150,13.8,5.1,29.45,Present,52,27.92,77.76,55,1425,176,6,3.98,17.2,Present,52,21.07,4.11,61,1426,142,2.2,3.29,22.7,Absent,44,23.66,5.66,42,1427,132,0,3.3,21.61,Absent,42,24.92,32.61,33,0428,142,1.32,7.63,29.98,Present,57,31.16,72.93,33,0429,146,1.16,2.28,34.53,Absent,50,28.71,45,49,0430,132,7.2,3.65,17.16,Present,56,23.25,0,34,0431,120,0,3.57,23.22,Absent,58,27.2,0,32,0432,118,0,3.89,15.96,Absent,65,20.18,0,16,0433,108,0,1.43,26.26,Absent,42,19.38,0,16,0434,136,0,4,19.06,Absent,40,21.94,2.06,16,0435,120,0,2.46,13.39,Absent,47,22.01,0.51,18,0436,132,0,3.55,8.66,Present,61,18.5,3.87,16,0437,136,0,1.77,20.37,Absent,45,21.51,2.06,16,0438,138,0,1.86,18.35,Present,59,25.38,6.51,17,0439,138,0.06,4.15,20.66,Absent,49,22.59,2.49,16,0440,130,1.22,3.3,13.65,Absent,50,21.4,3.81,31,0441,130,4,2.4,17.42,Absent,60,22.05,0,40,0442,110,0,7.14,28.28,Absent,57,29,0,32,0443,120,0,3.98,13.19,Present,47,21.89,0,16,0444,166,6,8.8,37.89,Absent,39,28.7,43.2,52,0445,134,0.57,4.75,23.07,Absent,67,26.33,0,37,0446,142,3,3.69,25.1,Absent,60,30.08,38.88,27,0447,136,2.8,2.53,9.28,Present,61,20.7,4.55,25,0448,142,0,4.32,25.22,Absent,47,28.92,6.53,34,1449,130,0,1.88,12.51,Present,52,20.28,0,17,0450,124,1.8,3.74,16.64,Present,42,22.26,10.49,20,0451,144,4,5.03,25.78,Present,57,27.55,90,48,1452,136,1.81,3.31,6.74,Absent,63,19.57,24.94,24,0453,120,0,2.77,13.35,Absent,67,23.37,1.03,18,0454,154,5.53,3.2,28.81,Present,61,26.15,42.79,42,0455,124,1.6,7.22,39.68,Present,36,31.5,0,51,1456,146,0.64,4.82,28.02,Absent,60,28.11,8.23,39,1457,128,2.24,2.83,26.48,Absent,48,23.96,47.42,27,1458,170,0.4,4.11,42.06,Present,56,33.1,2.06,57,0459,214,0.4,5.98,31.72,Absent,64,28.45,0,58,0460,182,4.2,4.41,32.1,Absent,52,28.61,18.72,52,1461,108,3,1.59,15.23,Absent,40,20.09,26.64,55,0462,118,5.4,11.61,30.79,Absent,64,27.35,23.97,40,0463,132,0,4.82,33.41,Present,62,14.7,0,46,1


--------------------------------------------------------------------------------
/data/Income1.csv:
--------------------------------------------------------------------------------
 1 | "","Education","Income"
 2 | "1",10,26.6588387834389
 3 | "2",10.4013377926421,27.3064353457772
 4 | "3",10.8428093645485,22.1324101716143
 5 | "4",11.2441471571906,21.1698405046065
 6 | "5",11.6454849498328,15.1926335164307
 7 | "6",12.0869565217391,26.3989510407284
 8 | "7",12.4882943143813,17.435306578572
 9 | "8",12.8896321070234,25.5078852305278
10 | "9",13.2909698996656,36.884594694235
11 | "10",13.7324414715719,39.666108747637
12 | "11",14.133779264214,34.3962805641312
13 | "12",14.5351170568562,41.4979935356871
14 | "13",14.9765886287625,44.9815748660704
15 | "14",15.3779264214047,47.039595257834
16 | "15",15.7792642140468,48.2525782901863
17 | "16",16.2207357859532,57.0342513373801
18 | "17",16.6220735785953,51.4909192102538
19 | "18",17.0234113712375,61.3366205527288
20 | "19",17.4648829431438,57.581988179306
21 | "20",17.866220735786,68.5537140185881
22 | "21",18.2675585284281,64.310925303692
23 | "22",18.7090301003344,68.9590086393083
24 | "23",19.1103678929766,74.6146392793647
25 | "24",19.5117056856187,71.8671953042483
26 | "25",19.9130434782609,76.098135379724
27 | "26",20.3545150501672,75.77521802986
28 | "27",20.7558528428094,72.4860553152424
29 | "28",21.1571906354515,77.3550205741877
30 | "29",21.5986622073579,72.1187904524136
31 | "30",22,80.2605705009016
32 | 


--------------------------------------------------------------------------------
/data/Income2.csv:
--------------------------------------------------------------------------------
 1 | "","Education","Seniority","Income"
 2 | "1",21.5862068965517,113.103448275862,99.9171726114381
 3 | "2",18.2758620689655,119.310344827586,92.579134855529
 4 | "3",12.0689655172414,100.689655172414,34.6787271520874
 5 | "4",17.0344827586207,187.586206896552,78.7028062353695
 6 | "5",19.9310344827586,20,68.0099216471551
 7 | "6",18.2758620689655,26.2068965517241,71.5044853814318
 8 | "7",19.9310344827586,150.344827586207,87.9704669939115
 9 | "8",21.1724137931034,82.0689655172414,79.8110298331255
10 | "9",20.3448275862069,88.2758620689655,90.00632710858
11 | "10",10,113.103448275862,45.6555294997364
12 | "11",13.7241379310345,51.0344827586207,31.9138079371295
13 | "12",18.6896551724138,144.137931034483,96.2829968022869
14 | "13",11.6551724137931,20,27.9825049000603
15 | "14",16.6206896551724,94.4827586206897,66.601792415137
16 | "15",10,187.586206896552,41.5319924201478
17 | "16",20.3448275862069,94.4827586206897,89.00070081522
18 | "17",14.1379310344828,20,28.8163007592387
19 | "18",16.6206896551724,44.8275862068966,57.6816942573605
20 | "19",16.6206896551724,175.172413793103,70.1050960424457
21 | "20",20.3448275862069,187.586206896552,98.8340115435447
22 | "21",18.2758620689655,100.689655172414,74.7046991976891
23 | "22",14.551724137931,137.931034482759,53.5321056283034
24 | "23",17.448275862069,94.4827586206897,72.0789236655191
25 | "24",10.4137931034483,32.4137931034483,18.5706650327685
26 | "25",21.5862068965517,20,78.8057842852386
27 | "26",11.2413793103448,44.8275862068966,21.388561306174
28 | "27",19.9310344827586,168.965517241379,90.8140351180409
29 | "28",11.6551724137931,57.2413793103448,22.6361626208955
30 | "29",12.0689655172414,32.4137931034483,17.613593041445
31 | "30",17.0344827586207,106.896551724138,74.6109601985289
32 | 


--------------------------------------------------------------------------------