├── .gitignore ├── LICENSE ├── README.md ├── R_Exercises ├── Exercise1 │ └── Exercise1.Rmd ├── Exercise2 │ └── Exercise2.Rmd ├── Exercise3 │ ├── Exercise3.Rmd │ ├── ex_p1.JPG │ ├── ex_p2.jpg │ ├── ex_p3_part1.jpg │ ├── ex_p3_part2.jpg │ ├── ex_p4_part1.jpg │ ├── ex_p4_part2.jpg │ ├── ex_p4_part3.jpg │ └── ex_p4_part4.jpg ├── Exercise4 │ └── Exercise4.Rmd ├── Exercise5 │ └── Exercise5.Rmd ├── Exercise6 │ └── Exercise6.Rmd └── Exercise7 │ └── Exercise7.Rmd ├── R_Labs ├── Lab1 │ └── Lab1.Rmd ├── Lab2 │ └── Lab2.Rmd ├── Lab3 │ └── Lab3.Rmd ├── Lab4 │ └── Lab4.rmd ├── Lab5 │ └── Lab5.Rmd ├── Lab6 │ └── Lab6.Rmd └── Lab7 │ └── Lab7.Rmd └── data ├── Advertising.csv ├── Auto.csv ├── Auto.data ├── Ch10Ex11.csv ├── College.csv ├── Credit.csv ├── Heart.csv ├── Income1.csv └── Income2.csv /.gitignore: -------------------------------------------------------------------------------- 1 | # History files 2 | .Rhistory 3 | .Rdata 4 | *.Rproj 5 | *.md 6 | *.Md 7 | *.html 8 | *.png 9 | *figure/* 10 | 11 | # Example code in package build process 12 | *-Ex.R 13 | .Rproj.user 14 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2013 John St. John 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of 6 | this software and associated documentation files (the "Software"), to deal in 7 | the Software without restriction, including without limitation the rights to 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 9 | the Software, and to permit persons to whom the Software is furnished to do so, 10 | subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 21 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | IntroToStatisticalLearningR 2 | =========================== 3 | 4 | My work through the different examples given in www.StatLearning.com 5 | -------------------------------------------------------------------------------- /R_Exercises/Exercise1/Exercise1.Rmd: -------------------------------------------------------------------------------- 1 | Introduction to Statistical Learning Exercise 1 2 | ======================================================== 3 | 4 | 5 | Conceptual 6 | --- 7 | 8 | 1. **_In general, do we expect the performance of a flexible statistical learning method to perform better or worse than an inflexible method when:_** 9 | 1. **_The sample size n is extremely large, and the number of predictors p is small?_** In this case, since we have so much data, I would expect that a more flexible model would perform better. 10 | 2. **_p is extremely large, and n is small?_** In this case, we are very prone to overfitting, a more inflexible method is much preffered. 11 | 3. **_relationship between predictors and response is highly non-linear?_** In this case, inflexible methods might force an unwarented linearity on the data, and underfit, so more flexible methods would be appropriate. 12 | 4. **_variance of the error terms is extremely high?_** In this case, highly flexible methods would be prone to fitting the error, and perform worse than less flexible methods. 13 | 2. **_Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p._** Note that inference is how Y changes as a function of X, while prediction is determining what Y is given X. 14 | 1. **_We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary._** This is an inference problem, we want to know how various variables effect salary, rather than simply being able to predict salary. We are inferring relationships to a continuous variable, so it is regression rather than classification. `n = 500, p = 3` (I think the 4th variable, CEO salary, being the output we want to predict, is not counted in p) 15 | 2. **_We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables._** This is a prediction problem, we only care about whether the outcome is success or failure. The outcome is binary, so this is a classification problem. `n = 20, p = 10+3 = 13` And one more for the outcome variable, which I am not counting in `p`. 16 | 3. **_We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market._** This is an inference problem, we want to know the relationship between the US dollar and weekly changes in the world stock market. We are predicting a continuous variable, so it is a prediction problem. 17 | 3. **_We now revisit the bias-variance decomposition._** 18 | 1. **_Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a sin- gle plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one._** I am just going to describe what I would draw here. First off the Bayes error curve is easy. This is simply the costant horizontal line representing the Variance around the optimal decision boundary. This doesn't change for a given underlying true datagenerating function. The squared bias is going to decrease until it hits a minimum and then stay there, the variance will then take over and the more flexible methods will be fitting variance, which is going to start low, then go up in frequency. The training error is going to drop and get very low as flexiblilty increases, however the testing error is going to have a U shape, starting high probably, dropping down to some optimal minimum, then rising again as the more flexible models begin to overfit to the underlying variance in the training set. 19 | 2. **_Explain why each of the five curves has the shape displayed in part (1)._** See previous description that also had explanations. 20 | 4. **_You will now think of some real-life applications for statistical learning_** 21 | 1. **_Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer._** a) Predicting stock price gain vs loss, stock price gain or loss, daily news + twitter + other features, prediction. b) Determining which genes have expression that is useful for determining response to a drug, drug response or no response, gene expression levels for all genes, inference. c) Predicting which links a user will click on, success or failure of click, user history, context of add, etc, prediction. 22 | 2. **_Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer._** a) Predicting stock price change amplitude, stock price change, daily news + twitter + other features, prediction. b) Determining which genes have expression that is useful that correlate with PFS, expression correlation to PFS, gene expression levels for all genes, inference. c) Predicting what market price a house will sell for, house value in dolars, neighborhood + schools + other recent sales of similar homes in area, prediction. 23 | 3. **_Describe three real-life applications in which cluster analysis might be useful._** a) identify cancer sub-types and what genes drive those, classification, inference. b) identify web usage outlier weeks, web usage over each week, prediction c) identify groups of users that might have similar behaviour that is distinct in some useful way from other users, behaviour data of some sort, inference. 24 | 5. **_What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?_** More flexible approaches tend to be more useful for prediction than inference. Inference requires knowledge of how the result is a function of the data, Prediction can be a black box. Also more flexible methods are required when the underlying data varries significantly from linear assumptions. 25 | 6. **_Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a para- metric approach to regression or classification (as opposed to a non- parametric approach)? What are its disadvantages?_** Parametric approaches do not have as much of an issue with overfitting. They are more resitrictive, but as a result fewer observations of underlying data are required to fit them fairly well. Nonparametric methods are sometimes required when the underlying data is very different from what can be fit with parametric techniques, or when the underlying distribution is unknown but obviously not normally distributed. 26 | 7. **_The table below provides a training data set containing 6 observations, 3 predictors, and 1 qualitative response variable. Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors._** 27 | 1. **_Compute the Euclidean distance between each observation and the test point, X1=X2=X3=0._** euclidian distance is `sqrt((x1-x2)^2+...)`. 1) 0-0 0-3 0-0 = sqrt(9) = 3 2) 2 3) sqrt(10) = 3.16 4) sqrt(5) = 2.24 5) sqrt(2) = 1.141 6) sqrt(5) = 2.24 28 | 2. **_What is our prediction with K = 1? Why?_** just the closest point which is point 5, that has class Green. 29 | 3. **_What is our prediction with K = 3? Why?_** average of 3 closest points, 5,2, and 4/6 are equidistant, so include them both I guess. I have actually just searched this, the R version of KNN includes all ties by default, or chooses randomly K points when there are ties, and other methods simply do K-1,K-2 and so on until no more ties exist. I will do the R version (3 Red + 2Green) = Red 30 | 4. **_If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for K to be large or small? Why?_** The higher K is the closer the decision boundary is to being linear. Lower K can fit more irregular data like this. 31 | 32 | Applied 33 | -------- 34 | 35 | 8. 36 | 1. up until part 3. 37 | ```{r} 38 | college=read.csv("~/src/IntroToStatisticalLearningR/data/College.csv") 39 | rownames(college) <- college[,1] 40 | college <- college[,-1] 41 | summary(college) 42 | ``` 43 | Show a pairs plot 44 | 45 | ```{r fig.width=7, fig.height=6} 46 | pairs(college[,1:10]) 47 | ``` 48 | ```{r} 49 | Elite=rep("No",nrow(college)) 50 | Elite[college$Top10perc >50]="Yes" 51 | Elite=as.factor(Elite) 52 | college=data.frame(college,Elite) 53 | summary(Elite) 54 | ``` 55 | Elite college vs Outstate tuition 56 | ```{r fig.width=7, fig.height=6} 57 | plot(college$Elite, college$Outstate) 58 | ``` 59 | Some histograms of different variables 60 | ```{r fig.width=7, fig.height=6} 61 | par(mfrow=c(2,2)) 62 | hist(college$PhD) 63 | hist(college$Accept) 64 | hist(college$Enroll) 65 | hist(college$S.F.Ratio) 66 | ``` 67 | Hmm, lets plot enrollment vs applications by elite status 68 | ```{r fig.width=7, fig.height=6} 69 | par(mfrow=c(1,2)) 70 | plot(college$Enroll[college$Elite == 'Yes'], college$Apps[college$Elite == 'Yes'], main="Elite") 71 | plot(college$Enroll[college$Elite == 'No'], college$Apps[college$Elite == 'No'], main="Not Elite") 72 | ``` 73 | And with ratios 74 | ```{r fig.width=7, fig.height=6} 75 | par(mfrow=c(1,2)) 76 | hist(college$Enroll[college$Elite == 'Yes']/college$Apps[college$Elite == 'Yes'], main="Elite") 77 | hist(college$Enroll[college$Elite == 'No']/college$Apps[college$Elite == 'No'], main="Not Elite") 78 | ``` 79 | Doesn't seem to be the strongest signal with enrollment vs application number vs elite status. 80 | 81 | 9. 82 | ```{r} 83 | Auto = read.table("~/src/IntroToStatisticalLearningR/data/Auto.data", head=T, na.strings="?") 84 | Auto <- na.omit(Auto) 85 | summary(Auto) 86 | ``` 87 | 1. mpg, cylinders, displacement, horsepower, weight, acceleration, are quantative. origin, year, and name are qualitative. 88 | 2. mpg: 9-46, cylinders: 3-8, displacement: 68-455, horsepower: 46-230, weight:1613-5140, acceleration:8-24.8 89 | 3. means: `r apply(Auto[,1:6],2,mean)`, sds: `r apply(Auto[,1:6],2,sd)` 90 | 4. range: `r apply(Auto[-(10:85),1:6],2,range)` means: `r apply(Auto[,1:6],2,mean)`, sds: `r apply(Auto[,1:6],2,sd)` 91 | 5. 92 | ```{r fig.width=11, fig.height=11} 93 | pairs(Auto) 94 | ``` 95 | I like the association with acceleration and year, the interesting thing is that it has the same trend as mpg and year. It basically seems that cars are becoming both more efficient, and simultaniously more fun, on average at least. 96 | 97 | 6. It looks like some linear combination of year, acceleration, and maybe one of displacement, horsepower or weight would combine to be a pretty good predictor. I would probably chose weight acceleration year as the three inputs. 98 | 99 | 100 | 10. 101 | ```{r} 102 | library(MASS) 103 | dim(Boston) 104 | summary(Boston) 105 | #?Boston 106 | ``` 107 | 1. There are 506 rows by 14 columns in the Boston dataset 108 | 2. 109 | 110 | ```{r fig.width=11, fig.height=11} 111 | pairs(Boston) 112 | ``` 113 | One interesting tidbit, as the proportion of black people goes up, the pupil teacher ratio goes down (more pupils per teacher) 114 | 115 | 3. As the proportion of owner occupied units built prior to 1940 goes up, crime rate also goes up 116 | 4. 117 | 118 | ```{r fig.width=11, fig.height=11} 119 | par(mfrow=c(2,2)) 120 | hist(Boston$crim, main="Crime Rates") 121 | hist(Boston$tax, main="Tax Rates") 122 | hist(Boston$ptratio, main="Pupil Teacher Ratio") 123 | library(scatterplot3d) 124 | scatterplot3d(log(Boston$crim),Boston$tax,Boston$ptratio,main="Boston Crime, Tax vs Pupil Teacher Ratio") 125 | ``` 126 | Most areas of boston have low crime rates, however a small number of areas in boston have very high crime rates, that variable exponentially declines. Tax rates are clearly bimodal. The highest tax rates are very seperate from the lower tax rates. The pupil teacher ratio seems to be normally distributed except for a handful of areas with very high pt ratios. 127 | 5. `r sum(Boston$chas)` suburbs border the boston river. 128 | 6. `r median(Boston$ptratio)` is the median pupil teacher ratio 129 | 7. These are the suburbs with the lowest median value of owner-occupied homes. 130 | ```{r} 131 | subset(Boston, medv == min(medv)) 132 | ``` 133 | 134 | 8. `r dim(subset(Boston, rm > 7))[1]` suburbs average more than 7 rooms per dwelling. 135 | 9. 136 | ```{r} 137 | library(plyr) 138 | Boston$O8 <- factor(ifelse(Boston$rm > 8, 1, 0)) 139 | ddply(Boston, .(O8), summary ) 140 | ``` 141 | Median crime is slightly higher in these over 8 room neighboorhods, but the mean is lower. Tax is also lower. pt ratio is lower in both mean and median. median value of homes the mean value of home is way higher. the age is also higher which seems interesting. These appear to be expensive, old neighborhoods. Proportion of non-retail business is substantially lower as well, these seem to be more residential in nature. -------------------------------------------------------------------------------- /R_Exercises/Exercise3/Exercise3.Rmd: -------------------------------------------------------------------------------- 1 | Introduction to statistical learning exercise 3 2 | ======================================================== 3 | # Conceptual Section 4 | ****** 5 | ## Problem 1 6 | See work in ex_p1.jpg 7 | 8 | ******* 9 | ## Problem 2 10 | See work in ex_p2.jpg 11 | Also note that in answer to my question on there, yes we can remove that last term. Remember that we are maximizing (k), so we can remove any term we want that does not interact with k, and we are ok. The final term does have something to do with x, but not k, so it can be removed and the equation is still proportional. We could remove the summation term because for any particular k, it is the same, since it is a marginalization over all k. 12 | 13 | ******* 14 | ## Problem 3 15 | see the saved parts 1 and 2 images. The key thing here is that we can't remove the final term as we did in the previus part (the x^2/2sigma) that is now something dependent on class k, so we can't claim proportionality and remove the term when we want to identify the max. 16 | 17 | ******** 18 | ## Problem 4 19 | See the included `ex_p4_*` jpgs. 20 | 21 | ******** 22 | ## Problem 5 23 | ### Part a 24 | When the bayes decision boundary is linear (the optimal classifier) we would still predict QDA to fit the training set better since it can fit more of the error in the data. On the test set on the other hand, QDA will probably perform worse since it is modeling the error whenver it deviates from the simpler linear best fit. 25 | ### Part b 26 | If the bayes decision boundary is non-linear, we would expect QDA to perform better on both the training and test set, depending on the degree of non-linearity, and the number of cases in the test set. If the number of the samples are small, or the underlying model is nearly linear, it is still possible for LDA to perform better. 27 | ### Part c 28 | The test prediction accuracy of LDA and QDA should improve as n increases. Depending on the underlying model, if it is non-linear, then at some point QDA will learn things about the data that LDA can't model, and QDA will be better. On the other hand LDA will still do better if the data is modeled well by it, or n is on the smaller size. It will take a lot more observations to fit QDA equally well to LDA since QDA is quadratic and LDA is linear. 29 | ### Part d 30 | TRUE: QDA can modle a linear boundary, so it will fit whatever linearness is in the training data. It can do more though, so it can additionally fit some of the additional residual error that the linear model wouldn't be able to handle as well. Thus it is superior on fittin the training data. The testing data on the other hand is a different story. 31 | 32 | ********* 33 | ## Problem 6 34 | ### Part a 35 | $p(X) = \frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}}$ and we have that $\hat\beta_0=-6, \hat\beta_1=0.05, \hat\beta_2=1$. This gives us $p(gets\ an\ A)=$ `r exp(-6+40*0.05+1*3.5)/(1+exp(-6+40*0.05+1*3.5))`. 36 | ### Part b 37 | $log(\frac{p(X)}{1-p(X)})=\beta_0+\beta_1X+\beta_2X$ and we want $p(X)=0.5$. $log(\frac{0.5}{1-0.5})=0$. So we need to solve $\frac{6-3.5}{0.05}=hours\ required=$ `r (6-3.5)/0.05`. Indeed plugging 50 hours into the above equation comes out to `r exp(-6+50*0.05+1*3.5)/(1+exp(-6+50*0.05+1*3.5))`. Apparently 10 more hours of work would have given this student a coin toss chance at getting an A! 38 | 39 | ********* 40 | ## Problem 7 41 | $P(Yes|X) = \frac{P(X|Yes)P(Yes)}{P(X)}$ And we are given that $P(Yes)=0.8$, Also we can find $P(X)$ by marginalizing over the two possibilities, $Yes$ and $No$. $P(X)=P(X|Yes)P(Yes)+P(X|No)P(No)$. $P(X|Yes)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x-\mu)^2/2\sigma^2}$ $x=7, \mu_{Yes}=10, \mu_{No}=0, P(Yes)=.8, P(No)=.2, \sigma^2_{Yes}=\sigma^2_{No}=36$. This gives us $P(Yes|X)=$ `r pxyes <- 1/sqrt(2*pi*36)*exp(-((7-10)**2)/(2*36)); pxno <- 1/sqrt(2*pi*36)*exp(-((7-0)**2)/(2*36)); pyes=.8; pno=.2; (pxyes*pyes)/(pxyes*pyes+pxno*pno)`. 42 | 43 | ********** 44 | ## Problem 8 45 | KNN with $K=1$ by design will not missclassify anything in the training set. The real test with KNN especially at the most permissive level of $K=1$ comes when you then try to classify new things. On the other hand Logistic regression fits a parametric model to the training set, so the training error provides _some_ indication of how good the fit is to the data, and a little insight into how it might perform on future data. 46 | 47 | Lets consider a dataset with 1000 examples, you divide it in half and you have 500 examples that you train on, and 500 that you test on. With KNN, $K=1$ to get an average error of 18% (180 misclassified examples of the 1000) all of these will need to be in the 500 test cases since it will not misclassify anything in the training set. This means that the test error for KNN is actually 180/500 or 36% as opposed to logistic regression which was less at 30%! KNN misclassified 180 test cases, while logisitic regression missclassified fewer cases (150) in this example. 48 | 49 | 50 | ************* 51 | ## Problem 9 52 | ### Part a 53 | $\frac{p}{1-p}=odds$ For example given an odds of 0.37, one way to get there would be $\frac{37+}{100-}$. For probability, this would be a fraction of $\frac{37}{137}$ or 27%. 54 | ### Part b 55 | odds for an event with probability $p$ are $\frac{p}{1-p}$. So if the probability is .16, then the odds for this event are `r .16/(1-.16)`. 56 | 57 | 58 | ************* 59 | # Applied Section 60 | ************* 61 | 62 | ## Problem 10 63 | 64 | > This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010. 65 | 66 | ```{r} 67 | library(MASS) 68 | library(ISLR) 69 | ``` 70 | 71 | ### Part a 72 | > Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns? 73 | 74 | ```{r fig.width=11, fig.height=11} 75 | pairs(Weekly) 76 | cor(Weekly[,-ncol(Weekly)]) 77 | ``` 78 | 79 | There is high correlation between Volume and Year. This appears to be an exponential relationship where volume increases exponentially as a function of year. Certain years seem to have more or less variation than other years. Notice the violin shape of the various Lag features and the Year. There does seem to be some autocorrelation in the variability of the Lag and the year. Perhaps some years people are more skittish than other years, and this takes a while to wear off. Syclical skittishness or something. There appears to be very little if any correlation between Lag and other lags. Direction appears slightly skewed by a few of the lags, perhaps Lag5, and Lag1. 80 | 81 | ### Part b 82 | > Use the full dataset to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones? 83 | 84 | ```{r} 85 | logit.fit = glm(Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5+Volume, family=binomial, data=Weekly) 86 | contrasts(Weekly$Direction) 87 | summary(logit.fit) 88 | ``` 89 | 90 | The Lag2 varaible, and intercept, appear to be significant. 91 | 92 | ### Part c 93 | > Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression. 94 | 95 | ```{r} 96 | glm.probs=predict(logit.fit,Weekly,type="response") 97 | glm.pred=rep("Down",nrow(Weekly)) 98 | glm.pred[glm.probs > 0.50]="Up" 99 | table(glm.pred,Weekly$Direction) 100 | mean(glm.pred==Weekly$Direction) 101 | ``` 102 | 103 | The confusion matrix is telling us that the model does not fir the data particularly well. The "Up" direction is guessed most of the time. Most of the mistakes come from guessing that the market is going to go up when it really is going to go down. When Up is guessed, it is right `r 557/(430+557)` of the time, when down is guessed it is right `r 54/(54+48)` of the time. 104 | 105 | ### Part d 106 | > Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010). 107 | 108 | ```{r} 109 | train=Weekly$Year <= 2008 110 | Weekly.test=Weekly[!train,] 111 | logit.fit = glm(Direction ~ Lag2, family=binomial, data=Weekly, subset=train) 112 | contrasts(Weekly$Direction) 113 | summary(logit.fit) 114 | glm.probs=predict(logit.fit,Weekly.test,type="response") 115 | glm.pred=rep("Down",nrow(Weekly.test)) 116 | glm.pred[glm.probs > 0.50]="Up" 117 | table(glm.pred,Weekly.test$Direction) 118 | mean(glm.pred==Weekly.test$Direction) 119 | ``` 120 | 121 | 122 | 123 | ### Part e 124 | > Repeat (d) using LDA. 125 | 126 | ```{r} 127 | lda.fit = lda(Direction ~ Lag2, data=Weekly, subset=train) 128 | lda.fit 129 | lda.class=predict(lda.fit,Weekly.test)$class 130 | table(lda.class,Weekly.test$Direction) 131 | mean(lda.class==Weekly.test$Direction) 132 | ``` 133 | 134 | This one performed identicaly to logistic regression. 135 | 136 | ### Part f 137 | > Repeat (d) using QDA. 138 | 139 | ```{r} 140 | qda.fit = qda(Direction ~ Lag2, data=Weekly, subset=train) 141 | qda.fit 142 | qda.class=predict(qda.fit,Weekly.test)$class 143 | table(qda.class,Weekly.test$Direction) 144 | mean(qda.class==Weekly.test$Direction) 145 | ``` 146 | 147 | Interestingly it seems that QDA overfit this variable. LDA/logistic regression performs better on the test data. 148 | 149 | ### Part g 150 | > Repeat (d) using KNN with K = 1. 151 | 152 | ```{r} 153 | library(class) 154 | train.X=Weekly[train,"Lag2",drop=F] 155 | test.X=Weekly[!train,"Lag2",drop=F] 156 | train.Direction=Weekly[train,"Direction",drop=T] 157 | test.Direction=Weekly[!train,"Direction",drop=T] 158 | set.seed(1) 159 | knn.pred=knn(train.X,test.X,train.Direction,k=1) 160 | table(knn.pred,test.Direction) 161 | mean(knn.pred==test.Direction) 162 | ``` 163 | 164 | ### Part h 165 | > Which of these methods appears to provide the best results on this data? 166 | 167 | KNN is totally random looking. QDA appears to overfit the data slightly more than LDA and Logistic Regression, which perform equally well on the test data. 168 | 169 | ### Part i 170 | > Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier. 171 | 172 | KNN with `K=4` performs pretty well at 0.57. 173 | ```{r} 174 | set.seed(1) 175 | knn.pred=knn(train.X,test.X,train.Direction,k=4) 176 | table(knn.pred,test.Direction) 177 | mean(knn.pred==test.Direction) 178 | ``` 179 | 180 | QDA appears to perform worse as we add in more variables, with Lag1, Lag2 and Volume it goes down to 0.46. With Lag1 and Lag2 it is a little better, at 0.55, but still Lag2 by itself is pretty good. 181 | ```{r} 182 | qda.fit = qda(Direction ~ Lag2, data=Weekly, subset=train) 183 | qda.fit 184 | qda.class=predict(qda.fit,Weekly.test)$class 185 | table(qda.class,Weekly.test$Direction) 186 | mean(qda.class==Weekly.test$Direction) 187 | ``` 188 | 189 | Logistic regression also seems to perform worse with more variables thrown in. Lag2 seems to be a pretty good fit. 190 | 191 | ```{r} 192 | train=Weekly$Year <= 2008 193 | Weekly.test=Weekly[!train,] 194 | logit.fit = glm(Direction ~ Lag1+Lag2+Volume, family=binomial, data=Weekly, subset=train) 195 | contrasts(Weekly$Direction) 196 | summary(logit.fit) 197 | glm.probs=predict(logit.fit,Weekly.test,type="response") 198 | glm.pred=rep("Down",nrow(Weekly.test)) 199 | glm.pred[glm.probs > 0.50]="Up" 200 | table(glm.pred,Weekly.test$Direction) 201 | mean(glm.pred==Weekly.test$Direction) 202 | ``` 203 | 204 | ************** 205 | ## Problem 11 206 | > In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set. 207 | ### Part a) 208 | > Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. 209 | 210 | ```{r} 211 | library(MASS) 212 | library(ISLR) 213 | Auto$mpg01 <- ifelse(Auto$mpg > median(Auto$mpg),1,0) 214 | ``` 215 | 216 | ### Part b) 217 | > Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this ques- tion. Describe your findings. 218 | 219 | ```{r fig.width=11,fig.height=11} 220 | pairs(Auto[,-9]) 221 | ``` 222 | Horsepower, displacement, weight, and acceleration look the most promissing. However these variables are all fairly correlated/anti-correlated. 223 | 224 | ### Part c) 225 | > Split the data into a training set and a test set. 226 | 227 | ```{r} 228 | set.seed(1) 229 | rands <- rnorm(nrow(Auto)) 230 | test <- rands > quantile(rands,0.75) 231 | train <- !test 232 | Auto.train <- Auto[train,] 233 | Auto.test <- Auto[test,] 234 | ``` 235 | 236 | 237 | ### Part d) 238 | > Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained? 239 | 240 | ```{r} 241 | lda.fit = lda(mpg01 ~ horsepower+weight+acceleration, data=Auto.train) 242 | lda.fit 243 | lda.class=predict(lda.fit,Auto.test)$class 244 | table(lda.class,Auto.test$mpg01) 245 | mean(lda.class==Auto.test$mpg01) 246 | ``` 247 | LDA achieved 88.8% test accuracy. 248 | 249 | ### Part e) 250 | > Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained? 251 | 252 | ```{r} 253 | qda.fit = qda(mpg01 ~ horsepower+weight+acceleration, data=Auto.train) 254 | qda.fit 255 | qda.class=predict(qda.fit,Auto.test)$class 256 | table(qda.class,Auto.test$mpg01) 257 | mean(qda.class==Auto.test$mpg01) 258 | ``` 259 | QDA performed a little better, and achieved 92.9% accuracy on the test set. 260 | 261 | ### Part f) 262 | > Perform logistic regression on the training data in order to pre- dict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained? 263 | 264 | ```{r} 265 | logit.fit = glm(mpg01 ~ horsepower+weight+acceleration, family=binomial, data=Auto.train) 266 | summary(logit.fit) 267 | glm.probs=predict(logit.fit,Auto.test,type="response") 268 | glm.pred=rep(0,nrow(Auto.test)) 269 | glm.pred[glm.probs > 0.50]=1 270 | table(glm.pred,Auto.test$mpg01) 271 | mean(glm.pred==Auto.test$mpg01) 272 | ``` 273 | Recompiling this a few times, I see that the accuracy and everything fluctuates a bit. Sometimes LDA and Logistic Regression do the same, sometimes Logistic Regression does a little worse, sometimes LDA a little better. QDA seems to fairly consistently perform the best. 274 | 275 | ### Part g) 276 | > Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set? 277 | 278 | ```{r} 279 | set.seed(1) 280 | train.Auto = Auto.train[,c("horsepower","weight","acceleration")] 281 | test.Auto = Auto.test[,c("horsepower","weight","acceleration")] 282 | knn.pred=knn(train.Auto,test.Auto,Auto.train$mpg01,k=1) 283 | table(knn.pred,Auto.test$mpg01) 284 | mean(knn.pred==Auto.test$mpg01) 285 | 286 | knn.pred=knn(train.Auto,test.Auto,Auto.train$mpg01,k=2) 287 | table(knn.pred,Auto.test$mpg01) 288 | mean(knn.pred==Auto.test$mpg01) 289 | 290 | knn.pred=knn(train.Auto,test.Auto,Auto.train$mpg01,k=3) 291 | table(knn.pred,Auto.test$mpg01) 292 | mean(knn.pred==Auto.test$mpg01) 293 | 294 | knn.pred=knn(train.Auto,test.Auto,Auto.train$mpg01,k=4) 295 | table(knn.pred,Auto.test$mpg01) 296 | mean(knn.pred==Auto.test$mpg01) 297 | 298 | knn.pred=knn(train.Auto,test.Auto,Auto.train$mpg01,k=5) 299 | table(knn.pred,Auto.test$mpg01) 300 | mean(knn.pred==Auto.test$mpg01) 301 | 302 | knn.pred=knn(train.Auto,test.Auto,Auto.train$mpg01,k=11) 303 | table(knn.pred,Auto.test$mpg01) 304 | mean(knn.pred==Auto.test$mpg01) 305 | ``` 306 | 307 | Interestingly at least in one case, KNN with K=3 outperforms all other models. K=4 and K=5 perform similarly well. 308 | 309 | ************ 310 | ## Problem 12 311 | ### Part a) 312 | > Write a function, Power(), that prints out the result of raising 2 to the 3rd power. In other words, your function should compute 23 and print out the results Hint: Recall that x^a raises x to the power a. Use the print() function to output the result. 313 | 314 | ```{r} 315 | Power <- function(){ 316 | print(2^3) 317 | } 318 | Power() 319 | ``` 320 | 321 | ### Part b) 322 | > Create a new function, Power2(), that allows you to pass any two numbers, x and a, and prints out the value of x^a. You can do this by beginning your function with the line You should be able to call your function by entering, for instance, 323 | `Power2(3,8)` on the command line. This should output the value of 38, namely, 6561. 324 | 325 | ```{r} 326 | Power2 <- function(x,a){ 327 | print(x^a) 328 | } 329 | Power2(3,8) 330 | ``` 331 | 332 | ### Part c) 333 | > Using the Power2() function that you just wrote, compute 103, 817, and 1313. 334 | 335 | ```{r} 336 | Power2(10,3) 337 | Power2(8,17) 338 | Power2(131,3) 339 | ``` 340 | 341 | ### Part d) 342 | > Now create a new function, Power3(), that actually returns the result x^a as an R object, rather than simply printing it to the screen. That is, if you store the value x^a in an object called result within your function, then you can simply return() this result, using the following line: 343 | return(result) 344 | The line above should be the last line in your function, before the } symbol. 345 | 346 | ```{r} 347 | Power3 <- function(x,a){ 348 | return(x^a) 349 | } 350 | ``` 351 | 352 | ### Part e) 353 | > NowusingthePower3()function,createaplotoff(x)=x2.The x-axis should display a range of integers from 1 to 10, and the y-axis should display x2. Label the axes appropriately, and use an appropriate title for the figure. Consider displaying either the x-axis, the y-axis, or both on the log-scale. You can do this by using log="x", log="y", or log="xy" as arguments to the plot() function. 354 | 355 | ```{r fig.width=7,fig.height=5} 356 | plot(seq(1,10), 357 | sapply(seq(1,10), function(x) Power3(x,2)), 358 | log="y", 359 | main="Plotting x vs x**2", 360 | xlab="x", 361 | ylab="x**2") 362 | ``` 363 | 364 | ### Part f) 365 | > Create a function, PlotPower(), that allows you to create a plot of x against x^a for a fixed a and for a range of values of x. For instance, if you call 366 | > PlotPower(1:10,3) 367 | then a plot should be created with an x-axis taking on values 1,2,...,10, and a y-axis taking on values 13,23,...,103. 368 | 369 | ```{r fig.width=7,fig.height=5} 370 | PlotPower <- function(x,a){ 371 | plot(x, 372 | sapply(x, function(z) Power3(z,a)), 373 | log="y", 374 | main=sprintf("Plotting x vs x**%d",a), 375 | xlab="x", 376 | ylab=sprintf("x**%d",a)) 377 | } 378 | PlotPower(1:10,3) 379 | ``` 380 | 381 | ************* 382 | ## Problem 13 383 | > Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings. 384 | 385 | ```{r fig.width=15, fig.height=15} 386 | Boston$crim01 <- as.numeric(Boston$crim > median(Boston$crim)) 387 | # as.numeric converts FALSE to 0 and TRUE to 1 388 | 389 | set.seed(1) 390 | rands <- rnorm(nrow(Boston)) 391 | test <- rands > quantile(rands,0.75) 392 | train <- !test 393 | Boston.train <- Boston[train,] 394 | Boston.test <- Boston[test,] 395 | 396 | Boston.train.fact <- Boston.train 397 | Boston.train.fact$crim01 <- factor(Boston.train.fact$crim01) 398 | library(GGally) 399 | ggpairs(Boston.train.fact, colour='crim01') 400 | #pairs(Boston.train) 401 | 402 | #We should explore "black" 403 | # "ptratio" "rad" "dis" "nox, and "zn" "lstat" "rm" 404 | 405 | ######################## 406 | # Logistic Regression 407 | glm.fit=glm(crim01~lstat+rm+zn+nox+dis+rad+ptratio+black+medv+age+chas+indus+tax, data=Boston.train) 408 | summary(glm.fit) 409 | #NOX,RAD,MEDV,AGE,TAX look good 410 | glm.probs=predict(glm.fit,Boston.test,type="response") 411 | glm.pred=rep(0,nrow(Boston.test)) 412 | glm.pred[glm.probs > 0.50]=1 413 | table(glm.pred,Boston.test$crim01) 414 | mean(glm.pred==Boston.test$crim01) 415 | 416 | glm.fit=glm(crim01~nox+rad+medv+age+tax, data=Boston.train) 417 | summary(glm.fit) 418 | #NOX,RAD,MEDV,AGE,TAX look good 419 | glm.probs=predict(glm.fit,Boston.test,type="response") 420 | glm.pred=rep(0,nrow(Boston.test)) 421 | glm.pred[glm.probs > 0.50]=1 422 | table(glm.pred,Boston.test$crim01) 423 | mean(glm.pred==Boston.test$crim01) 424 | 425 | #ptratio helps a bit, but the nox*dis helps quite a bit 426 | glm.fit=glm(crim01~nox*dis+medv:tax+rad+age, data=Boston.train) 427 | summary(glm.fit) 428 | #NOX,RAD,MEDV,AGE,TAX look good 429 | glm.probs=predict(glm.fit,Boston.test,type="response") 430 | glm.pred=rep(0,nrow(Boston.test)) 431 | glm.pred[glm.probs > 0.50]=1 432 | table(glm.pred,Boston.test$crim01) 433 | mean(glm.pred==Boston.test$crim01) 434 | 435 | #indus brings it back down a bit 436 | glm.fit=glm(crim01~nox+rad+medv+age+tax+ptratio+indus, data=Boston.train) 437 | summary(glm.fit) 438 | #NOX,RAD,MEDV,AGE,TAX look good 439 | glm.probs=predict(glm.fit,Boston.test,type="response") 440 | glm.pred=rep(0,nrow(Boston.test)) 441 | glm.pred[glm.probs > 0.50]=1 442 | table(glm.pred,Boston.test$crim01) 443 | mean(glm.pred==Boston.test$crim01) 444 | 445 | #indus by itslef doesn't help much 446 | glm.fit=glm(crim01~nox+rad+medv+age+tax+indus, data=Boston.train) 447 | summary(glm.fit) 448 | #NOX,RAD,MEDV,AGE,TAX look good 449 | glm.probs=predict(glm.fit,Boston.test,type="response") 450 | glm.pred=rep(0,nrow(Boston.test)) 451 | glm.pred[glm.probs > 0.50]=1 452 | table(glm.pred,Boston.test$crim01) 453 | mean(glm.pred==Boston.test$crim01) 454 | 455 | 456 | ######################## 457 | # LDA 458 | lda.fit=lda(crim01~nox+rad+medv+age+tax+ptratio, data=Boston.train) 459 | lda.fit 460 | #NOX,RAD,MEDV,AGE,TAX look good, ptratio seems to help also 461 | lda.pred=predict(lda.fit,Boston.test)$class 462 | table(lda.pred,Boston.test$crim01) 463 | mean(lda.pred==Boston.test$crim01) 464 | 465 | ######################## 466 | # KNN 467 | set.seed(1) 468 | train.Boston = Boston.train[,c("nox","rad","medv","age","tax","ptratio")] 469 | test.Boston = Boston.test[,c("nox","rad","medv","age","tax","ptratio")] 470 | knn.pred=knn(train.Boston,test.Boston,Boston.train$crim01,k=1) 471 | table(knn.pred,Boston.test$crim01) 472 | mean(knn.pred==Boston.test$crim01) 473 | 474 | knn.pred=knn(train.Boston,test.Boston,Boston.train$crim01,k=2) 475 | table(knn.pred,Boston.test$crim01) 476 | mean(knn.pred==Boston.test$crim01) 477 | 478 | knn.pred=knn(train.Boston,test.Boston,Boston.train$crim01,k=3) 479 | table(knn.pred,Boston.test$crim01) 480 | mean(knn.pred==Boston.test$crim01) 481 | 482 | knn.pred=knn(train.Boston,test.Boston,Boston.train$crim01,k=4) 483 | table(knn.pred,Boston.test$crim01) 484 | mean(knn.pred==Boston.test$crim01) 485 | 486 | knn.pred=knn(train.Boston,test.Boston,Boston.train$crim01,k=5) 487 | table(knn.pred,Boston.test$crim01) 488 | mean(knn.pred==Boston.test$crim01) 489 | 490 | 491 | knn.pred=knn(train.Boston,test.Boston,Boston.train$crim01,k=11) 492 | table(knn.pred,Boston.test$crim01) 493 | mean(knn.pred==Boston.test$crim01) 494 | 495 | 496 | 497 | ``` 498 | 499 | So the best I could get LDA/logistic regression was 89%. Using the features optimized with logistic regression I was able to get KNN to perform better, returning a model that got up to 92%. K=1 got to 93%, but K=3 was nearly as good, and the higher K might be more robust going forward. 500 | -------------------------------------------------------------------------------- /R_Exercises/Exercise3/ex_p1.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jstjohn/IntroToStatisticalLearningR-/b93bf196a4e0dfd0fb070dd68643d031d77902b0/R_Exercises/Exercise3/ex_p1.JPG -------------------------------------------------------------------------------- /R_Exercises/Exercise3/ex_p2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jstjohn/IntroToStatisticalLearningR-/b93bf196a4e0dfd0fb070dd68643d031d77902b0/R_Exercises/Exercise3/ex_p2.jpg -------------------------------------------------------------------------------- /R_Exercises/Exercise3/ex_p3_part1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jstjohn/IntroToStatisticalLearningR-/b93bf196a4e0dfd0fb070dd68643d031d77902b0/R_Exercises/Exercise3/ex_p3_part1.jpg -------------------------------------------------------------------------------- /R_Exercises/Exercise3/ex_p3_part2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jstjohn/IntroToStatisticalLearningR-/b93bf196a4e0dfd0fb070dd68643d031d77902b0/R_Exercises/Exercise3/ex_p3_part2.jpg -------------------------------------------------------------------------------- /R_Exercises/Exercise3/ex_p4_part1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jstjohn/IntroToStatisticalLearningR-/b93bf196a4e0dfd0fb070dd68643d031d77902b0/R_Exercises/Exercise3/ex_p4_part1.jpg -------------------------------------------------------------------------------- /R_Exercises/Exercise3/ex_p4_part2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jstjohn/IntroToStatisticalLearningR-/b93bf196a4e0dfd0fb070dd68643d031d77902b0/R_Exercises/Exercise3/ex_p4_part2.jpg -------------------------------------------------------------------------------- /R_Exercises/Exercise3/ex_p4_part3.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jstjohn/IntroToStatisticalLearningR-/b93bf196a4e0dfd0fb070dd68643d031d77902b0/R_Exercises/Exercise3/ex_p4_part3.jpg -------------------------------------------------------------------------------- /R_Exercises/Exercise3/ex_p4_part4.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jstjohn/IntroToStatisticalLearningR-/b93bf196a4e0dfd0fb070dd68643d031d77902b0/R_Exercises/Exercise3/ex_p4_part4.jpg -------------------------------------------------------------------------------- /R_Exercises/Exercise4/Exercise4.Rmd: -------------------------------------------------------------------------------- 1 | Introduction to statistical learning exercise 4 2 | ======================================================== 3 | 4 | # Coneceptual Section 5 | ********* 6 | ## Problem 1. 7 | > Using basic statistical properties of the variance, as well as single- variable calculus, derive (5.6). In other words, prove that α given by (5.6) does indeed minimize $Var(\alpha X + (1 − \alpha)Y )$. 8 | 9 | Here is the equation that minimizes 5.6: 10 | $\alpha=\frac{\sigma_Y^2-\sigma_{XY}}{\sigma^2_X+\sigma^2_Y-2\sigma{XY}}=\frac{Var(Y)-Cov(X,Y)}{Var(X)+Var(Y)-2Cov(X,Y)}$ 11 | 12 | ******** 13 | ## Problem 2. 14 | > We will now derive the probability that a given observation is part of a bootstrap sample. Suppose that we obtain a bootstrap sample from a set of n observations. 15 | 16 | ### Part a) 17 | > What is the probability that the first bootstrap observation is not the jth observation from the original sample? Justify your answer. 18 | 19 | There are $n$ observations in the original sample. Since bootstrap sampling draws items with replacement, we are sampling from the same pool with the same probability every time. There are $n-1$ items in the $n$ that are not $j$. So there is an $\frac{n-1}{n}$ chance that the first item is not $j$. 20 | 21 | ### Part b) 22 | > What is the probability that the second bootstrap observation is not the jth observation from the original sample? 23 | 24 | Since we draw with replacement, it is the same as above. 25 | 26 | ### Part c) 27 | > Argue that the probability that the jth observation is not in the bootstrap sample is (1 − 1/n)n. 28 | 29 | Note that $\frac{n-1}{n}=1-\frac{1}{n}$. Also with the bootstrap we do $n$ draws. That means there are $n$ chances to draw something other than $j$ that all have to succeed for $j$ not to be in the bootstrap. This is a simple product of $n$ of these probabilities, which can be written as $(1-\frac{1}{n})^n$ 30 | 31 | ### Part d) 32 | > When n = 5, what is the probability that the jth observation is in the bootstrap sample? 33 | 34 | This is 1 minus the probability that the jth observation is _not_ in the bootstrap sample: `r 1-((1-1/5)^5)` 35 | 36 | ### Part e) 37 | > When n = 100, what is the probability that the jth observation is in the bootstrap sample? 38 | 39 | calculated as above: `r 1-((1-1/100)^100)` 40 | 41 | ### Part f) 42 | > When n = 10, 000, what is the probability that the jth observa- tion is in the bootstrap sample? 43 | 44 | `r 1-((1-1/10000)^10000)` 45 | 46 | ### Part g) 47 | > Create a plot that displays, for each integer value of n from 1 to 100,000, the probability that the jth observation is in the bootstrap sample. Comment on what you observe. 48 | 49 | ```{r fig.width=7, fig.height=5} 50 | x=seq(1,100000) 51 | y=sapply(x,function(n){1-((1-(1/n))^n)}) 52 | plot(x,y,xlab="n",ylab="Probability jth observation is in the bootstrap sample",log="x") 53 | ``` 54 | 55 | The probability seems to converge on something around 0.63 fairly quickly, around n=100, and then stay there! 56 | 57 | That is very odd that there is always a 63% chance that any particular thing will be in the bootstrap sample even with large datasets. 58 | 59 | ### Part h) 60 | > We will now investigate numerically the probability that a boot- strap sample of size n = 100 contains the jth observation. Here j = 4. We repeatedly create bootstrap samples, and each time we record whether or not the fourth observation is contained in the bootstrap sample. 61 | 62 | ```{r} 63 | store=rep(NA, 10000) 64 | for(i in 1:10000){ 65 | store[i]=sum(sample(1:100, rep=TRUE)==4)>0 66 | } 67 | mean(store) 68 | ``` 69 | 70 | 71 | > Comment on the results obtained. 72 | 73 | This made a list of length 10,000, and each time sampled 0-100 with replacement and checked to see if 4 is in the list. Interestingly 63% of the time, the list contains the number 4. 74 | 75 | 76 | ************ 77 | ## Problem 3. 78 | > We now review k-fold cross-validation. 79 | ### Part a) 80 | > Explain how k-fold cross-validation is implemented. 81 | 82 | You take your dataset, and do a train/test split where you train on $\frac{k-1}{k}$ and test on the remaining $\frac{1}{k}$ of the dataset. You re-do this procedure $k$ times and then can explore the variability in the obtained results on the various test sets. 83 | 84 | ### Part b) 85 | > What are the advantages and disadvantages of k-fold cross validation relative to: 86 | i. The validation set approach? 87 | ii. LOOCV? 88 | 89 | k fold cv allows you to use more of your data in training than the validation set approach. Also you get to see how well the model performs on more of the dataset, so you get to see the variability in test errors on different subsets of data. 90 | 91 | LOOCV is a special instance of k fold cv where k=n. Lower values of k are faster to compute since you do not need to do n different fits. There of course is the special case though where you can do the computational shoortcut with LOOCV on least-squares fit models given in equation 5.2. More generally though smaller k (typically 5 or 10) has much better performance than k=n. 92 | 93 | k-fold cv has another benefit though described in section 5.1.4. LOOCV has higher variance than k fold cv with $k Suppose that we use some statistical learning method to make a prediction for the response Y for a particular value of the predictor X. Carefully describe how we might estimate the standard deviation of our prediction. 98 | 99 | One way to do this would be with the bootstrap. We can train on a bunch of different random samplings of the original data, and see how much the estimates change. 100 | 101 | ************ 102 | # Applied 103 | ************* 104 | ## Problem 5. 105 | > In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis. 106 | 107 | ### Part a) 108 | > Fit a multiple logistic regression model that uses income and balance to predict the probability of default, using only the observations. 109 | 110 | ```{r} 111 | library(ISLR) 112 | set.seed(1) 113 | glm.fit=glm(default~income+balance,data=Default, family="binomial") 114 | ``` 115 | 116 | ### Part b) 117 | > Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps: 118 | i. Split the sample set into a training set and a validation set. 119 | ii. Fit a multiple logistic regression model using only the train- 120 | ing observations. 121 | iii. Obtain a prediction of default status for each individual in 122 | the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability equals 0.5. 123 | iv. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified. 124 | 125 | ```{r} 126 | set.seed(1) 127 | train=sample(nrow(Default),nrow(Default)-nrow(Default)/4) 128 | Default.train=Default[train,] 129 | Default.test=Default[-train,] 130 | glm.fit=glm(default~income+balance,data=Default.train, family="binomial") 131 | glm.probs=predict(glm.fit,Default.test,type="response") 132 | glm.pred=ifelse(glm.probs>.5,"Yes","No") 133 | mean(glm.pred!=Default.test$default) 134 | ``` 135 | 136 | ### Part c) 137 | > Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Com- ment on the results obtained. 138 | 139 | ```{r} 140 | set.seed(15) 141 | train=sample(nrow(Default),nrow(Default)-nrow(Default)/4) 142 | Default.train=Default[train,] 143 | Default.test=Default[-train,] 144 | glm.fit=glm(default~income+balance,data=Default.train, family="binomial") 145 | glm.probs=predict(glm.fit,Default.test,type="response") 146 | glm.pred=ifelse(glm.probs>.5,"Yes","No") 147 | mean(glm.pred!=Default.test$default) 148 | 149 | set.seed(5) 150 | train=sample(nrow(Default),nrow(Default)-nrow(Default)/4) 151 | Default.train=Default[train,] 152 | Default.test=Default[-train,] 153 | glm.fit=glm(default~income+balance,data=Default.train, family="binomial") 154 | glm.probs=predict(glm.fit,Default.test,type="response") 155 | glm.pred=ifelse(glm.probs>.5,"Yes","No") 156 | mean(glm.pred!=Default.test$default) 157 | 158 | set.seed(31) 159 | train=sample(nrow(Default),nrow(Default)-nrow(Default)/4) 160 | Default.train=Default[train,] 161 | Default.test=Default[-train,] 162 | glm.fit=glm(default~income+balance,data=Default.train, family="binomial") 163 | glm.probs=predict(glm.fit,Default.test,type="response") 164 | glm.pred=ifelse(glm.probs>.5,"Yes","No") 165 | mean(glm.pred!=Default.test$default) 166 | ``` 167 | 168 | ### Part d) 169 | > Now consider a logistic regression model that predicts the prob- ability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the val- idation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate. 170 | 171 | ```{r} 172 | set.seed(15) 173 | train=sample(nrow(Default),nrow(Default)-nrow(Default)/4) 174 | Default.train=Default[train,] 175 | Default.test=Default[-train,] 176 | glm.fit=glm(default~income+balance+student,data=Default.train, family="binomial") 177 | glm.probs=predict(glm.fit,Default.test,type="response") 178 | glm.pred=ifelse(glm.probs>.5,"Yes","No") 179 | mean(glm.pred!=Default.test$default) 180 | 181 | set.seed(5) 182 | train=sample(nrow(Default),nrow(Default)-nrow(Default)/4) 183 | Default.train=Default[train,] 184 | Default.test=Default[-train,] 185 | glm.fit=glm(default~income+balance+student,data=Default.train, family="binomial") 186 | glm.probs=predict(glm.fit,Default.test,type="response") 187 | glm.pred=ifelse(glm.probs>.5,"Yes","No") 188 | mean(glm.pred!=Default.test$default) 189 | 190 | set.seed(31) 191 | train=sample(nrow(Default),nrow(Default)-nrow(Default)/4) 192 | Default.train=Default[train,] 193 | Default.test=Default[-train,] 194 | glm.fit=glm(default~income+balance+student,data=Default.train, family="binomial") 195 | glm.probs=predict(glm.fit,Default.test,type="response") 196 | glm.pred=ifelse(glm.probs>.5,"Yes","No") 197 | mean(glm.pred!=Default.test$default) 198 | ``` 199 | 200 | It does not look like including this variable helps the model much. The three tests I tried with both models produce similar ranges of test error. 201 | 202 | ********** 203 | ## Problem 6. 204 | > We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression co- efficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis. 205 | 206 | 207 | 208 | ### Part a) 209 | > Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors. 210 | 211 | ```{r} 212 | 213 | 214 | set.seed(1) 215 | glm.fit=glm(default~income+balance,data=Default, family="binomial") 216 | summary(glm.fit)$coef[,1] 217 | ``` 218 | 219 | ### Part b) 220 | > Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model. 221 | 222 | ```{r} 223 | boot.fn=function(data,index){ 224 | coefficients(glm(default~income+balance, data=data, subset=index, family="binomial")) 225 | } 226 | 227 | boot.fn(Default,1:nrow(Default)) 228 | ``` 229 | 230 | ### Part c) 231 | > Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance. 232 | 233 | ```{r} 234 | library(boot) 235 | #boot(Default,boot.fn,1000) 236 | ``` 237 | 238 | ``` 239 | ## 240 | ## ORDINARY NONPARAMETRIC BOOTSTRAP 241 | ## 242 | ## 243 | ## Call: 244 | ## boot(data = Default, statistic = boot.fn, R = 1000) 245 | ## 246 | ## 247 | ## Bootstrap Statistics : 248 | ## original bias std. error 249 | ## t1* -1.154e+01 -8.008e-03 4.239e-01 250 | ## t2* 2.081e-05 5.871e-08 4.583e-06 251 | ## t3* 5.647e-03 2.300e-06 2.268e-04 252 | ``` 253 | 254 | ### Part d) 255 | > Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function. 256 | 257 | 258 | These bootstrap estimates actually match up with the glm summary estimates. That is a really good sign. 259 | 260 | ********** 261 | ## Problem 7. 262 | > In Sections 5.3.2 and 5.3.3, we saw that the cv.glm() function can be used in order to compute the LOOCV test error estimate. Alterna- tively, one could compute those quantities using just the glm() and predict.glm() functions, and a for loop. You will now take this ap- proach in order to compute the LOOCV error for a simple logistic regression model on the Default data set. Recall that in the context of classification problems, the LOOCV error is given in (5.4). 263 | 264 | 265 | 266 | ### Part a) 267 | > Fit a logistic regression model that predicts the probability of default using balance. 268 | 269 | ```{r} 270 | glm.fit=glm(default~balance,data=Default,family="binomial") 271 | ``` 272 | 273 | 274 | ### Part b) 275 | > Fit a logistic regression model that predicts the probability of default using balance using all but the first observation. 276 | 277 | ```{r} 278 | glm.fit2=update(glm.fit,subset=-1) 279 | ``` 280 | 281 | ### Part c) 282 | > Use the model from (b) to predict the default status of the first observation. You can do this by predicting that the first observation will default if P (default|balance) > 0.5. Was this observation correctly classified? 283 | 284 | ```{r} 285 | Default.test=Default[1,,drop=F] 286 | glm.probs=predict(glm.fit2,Default.test,type="response") 287 | glm.pred=ifelse(glm.probs>.5,"Yes","No") 288 | mean(glm.pred==Default.test$default) 289 | ``` 290 | This observation was correctly calssified. 291 | 292 | ### Part d) 293 | > Write a for loop from i=1 to i=n, where n is the number of observations in the data set, that performs each of the following steps: 294 | i. Fit a logistic regression model using all but the ith observation to predict probability of default using balance. 295 | ii. Compute the posterior probability of default for the ith observation. 296 | iii. Use the posterior probability of default for the ith observation in order to predict whether or not the observation defaults. 297 | iv. Determine whether or not an error was made in predicting the default status for the ith observation. If an error was made, then indicate this as a 1, and otherwise indicate it as a 0. 298 | 299 | ```{r} 300 | library(multicore) 301 | # predictions=unlist(mclapply(seq(nrow(Default)), function(i){ 302 | # glm.fit2=update(glm.fit,subset=-i) 303 | # Default.test=Default[i,,drop=F] 304 | # glm.probs=predict(glm.fit2,Default.test,type="response") 305 | # glm.pred=ifelse(glm.probs>.5,"Yes","No") 306 | # mean(glm.pred==Default.test$default) 307 | # },mc.cores=8)) 308 | ``` 309 | 310 | 311 | ### Part e) 312 | > Take the average of the n numbers obtained in (d)iv in order to obtain the LOOCV estimate for the test error. Comment on the results. 313 | 314 | ``` 315 | # 1 - mean(predictions) 316 | ## [1] 0.0275 317 | ``` 318 | 319 | 320 | 321 | *********** 322 | 323 | ## Problem 8. 324 | > We will now perform cross-validation on a simulated data set. 325 | 326 | ### Part a) Generate a simulated data set as follows: 327 | 328 | ```{r} 329 | set.seed(1) 330 | y=rnorm(100) 331 | x=rnorm(100) 332 | y=x-2*x^2+rnorm(100) 333 | ``` 334 | 335 | > In this data set, what is n and what is p? Write out the model used to generate the data in equation form. 336 | 337 | In this dataset, n is 100 and p is 2. 338 | 339 | ### Part b) 340 | > Create a scatterplot of X against Y . Comment on what you find. 341 | 342 | ```{r} 343 | plot(x,y) 344 | ``` 345 | 346 | x is quadratic in terms of y. 347 | 348 | ### Part c) 349 | > Set a random seed, and then compute the LOOCV errors that 350 | result from fitting the following four models using least squares: 351 | i. Y = β0 + β1X + ǫ 352 | ii. Y = β0 + β1X + β2X2 + ǫ 353 | iii. Y = β0 +β1X +β2X2 +β3X3 +ǫ 354 | iv. Y = β0 +β1X +β2X2 +β3X3 +β4X4 +ǫ. 355 | 356 | ```{r} 357 | dat=data.frame(x=x,y=y) 358 | fit.errors = unlist(mclapply(seq(4),function(i){ 359 | glm.fit.i=glm(y~poly(x,i),data=dat) 360 | cv.err=cv.glm(dat,glm.fit.i) 361 | cv.err$delta[1] 362 | })) 363 | names(fit.errors)<-sprintf("poly_%d",seq(4)) 364 | fit.errors 365 | ``` 366 | 367 | ### Part d) 368 | > Repeat c) using another random seed, and report your results. Are your results the same as what you got in c)? Why? 369 | 370 | ```{r} 371 | set.seed(131) 372 | fit.errors = unlist(mclapply(seq(4),function(i){ 373 | glm.fit.i=glm(y~poly(x,i),data=dat) 374 | cv.err=cv.glm(dat,glm.fit.i) 375 | cv.err$delta[1] 376 | })) 377 | names(fit.errors)<-sprintf("poly_%d",seq(4)) 378 | fit.errors 379 | ``` 380 | 381 | The results are the same because LOOCV does not have a randomness factor involved, it is the same with any iteration given the same undelrying data and model. 382 | 383 | 384 | ### Part e) 385 | > Which of the models in c) had the smallest LOOCV error? Is this what you expected? Explain your answer. 386 | 387 | The `poly(x,2)` model had the smallest LOOCV error which is encouraging becuase this is what was used to generate the data! 388 | 389 | ### Part f) 390 | > Comment on the statistical significance of the coefficient estimates that results from fitting each of the models in c) using least squares. Do these results agree with the conclusions drawn based on the cross-validation results? 391 | 392 | ```{r} 393 | glm.fit.i=glm(y~poly(x,4),data=dat) 394 | summary(glm.fit.i) 395 | ``` 396 | 397 | Yes when we do a poly(x,4) we see that the x and x**2 terms are the two that end up statistically significant. 398 | 399 | ************** 400 | ## Probelm 9. 401 | > We will now consider the Boston housing data set, from the MASS library. 402 | 403 | ### Part a) 404 | > Based on this data set, provide an estimate for the population mean of medv. Call this estimate μˆ. 405 | 406 | ```{r} 407 | library(MASS) 408 | mu=mean(Boston$medv) 409 | mu 410 | ``` 411 | 412 | ### Part b) 413 | > Provide an estimate of the standard error of μˆ. Interpret this result. 414 | Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations. 415 | 416 | ```{r} 417 | sd(Boston$medv)/sqrt(length(Boston$medv)) 418 | ``` 419 | 420 | ### Part c) 421 | > Now estimate the standard error of μˆ using the bootstrap. How does this compare to your answer from (b)? 422 | 423 | ```{r} 424 | boot.fn<-function(data,index){ 425 | mean(data[index]) 426 | } 427 | boot(Boston$medv,boot.fn,1000,parallel ="multicore") 428 | ``` 429 | 430 | ### Part d) 431 | > Based on your bootstrap estimate from (c), provide a 95% confidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv). 432 | Hint: You can approximate a 95% confidence interval using the formula [μˆ − 2SE(μˆ), μˆ + 2SE(μˆ)]. 433 | 434 | ```{r} 435 | t.test(Boston$medv) 436 | mu=22.53 437 | se=0.4016 438 | mu-2*se 439 | mu+2*se 440 | ``` 441 | 442 | They are very similar, the bootstrap estimate is slightly tighter than the one we just calculated with the mean and std error from bootstrap. (23.33 vs 23.34) the lower bound is the same. They are probably basically the same. 443 | 444 | ### Part e) 445 | > Based on this data set, provide an estimate, $\hat\mu_{med}$, for the median value of medv in the population. 446 | 447 | `r median(Boston$medv)` 448 | 449 | 450 | ### Part f) 451 | > Wenowwouldliketoestimatethestandarderrorofμˆmed.Unfor- tunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings. 452 | 453 | ```{r} 454 | boot.fn<-function(data,index){ 455 | median(data[index]) 456 | } 457 | boot(Boston$medv,boot.fn,1000,parallel ="multicore") 458 | ``` 459 | 460 | Interestingly the std error of the median is lower than that of the mean! Cool. 461 | 462 | ### Part g) 463 | > Based on this data set, provide an estimate for the tenth per- centile of medv in Boston suburbs. Call this quantity μˆ0.1. (You can use the quantile() function.) 464 | 465 | `r quantile(Boston$medv,p=0.1)` 466 | 467 | 468 | ### Part h) 469 | > Use the bootstrap to estimate the standard error of μˆ0.1. Com- ment on your findings. 470 | 471 | ```{r} 472 | boot.fn<-function(data,index){ 473 | quantile(data[index],p=0.1) 474 | } 475 | boot(Boston$medv,boot.fn,1000,parallel ="multicore") 476 | ``` 477 | 478 | The lower 10% of the data has a higher std error than the mean and the median, that is interesting. Apparently these outliers must be more sensitive to which subset is chosen than the mean and median. 479 | 480 | 481 | -------------------------------------------------------------------------------- /R_Exercises/Exercise5/Exercise5.Rmd: -------------------------------------------------------------------------------- 1 | Introduction to Statistical Learning Exercises 5: 2 | ======================================================== 3 | 4 | # Conceptual Section 5 | ******************* 6 | ## Problem 1 7 | 8 | > We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, . . . , p predictors. Explain your answers: 9 | 10 | ### Part a) 11 | > Which of the three models with k predictors has the smallest training RSS? 12 | 13 | Best subset selection will have the best training RSS. Although it is possible that either of the other two will chose comparably good models, they will not chose better models on the training data. Best subset selection exhaustively searches all possible models with k predictors chosing the smallest training RSS while the other two methods heuristically explore a subset of that space, either by starting with teh best k-1 model and chosing the best k given a fixed k-1 (forward) or in reverse starting at the best k+1 and chosing the best single feature to remove resulting in the best model with that constraint. 14 | 15 | ### Part b) 16 | > Which of the three models with k predictors has the smallest test RSS? 17 | 18 | It is possible to overfit with any of these methods. There are probably cases where the best model trained on the training set (the one exhaustively chosen by best subset) happens to not perform as well on the test set as the best model chosen by forward or backward selection. 19 | 20 | ### Part c) 21 | > True or False: 22 | 23 | > i. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection. 24 | 25 | TRUE: the k+1 variable model contains all k features chosen in the k variable model, plus the best aditional feature. 26 | 27 | > ii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1) variable model identified by backward stepwise selection. 28 | 29 | TRUE: the k variable model contains all but one feature in the k+1 best model, minus the single feature resulting in the smallest gain in RSS. 30 | 31 | > iii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1) variable model identified by forward stepwise selection. 32 | 33 | FALSE: it is possible for disjoint sets to be identified by forward and backward selection. 34 | 35 | > iv. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by backward stepwise selection. 36 | 37 | FALSE: it is possible for disjoint sets to be identified by foward and backward selection. 38 | 39 | > v. The predictors in the k-variable model identified by best subset are a subset of the predictors in the (k + 1)-variable model identified by best subset selection. 40 | 41 | FALSE: again these two methods are not guarenteed to chose the same k or k+1 features, they may be disjoint sets. 42 | 43 | ******************* 44 | ## Problem 2 45 | 46 | > For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer. 47 | 48 | 49 | ### Part a) 50 | > The lasso, relative to least squares, is: 51 | i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. 52 | ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. 53 | iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. 54 | iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. 55 | 56 | iii is the correct answer. The lasso is a more restrictive model, and thus it has the possibility of reducing overfitting and variance in predictions. As long as it does not result in too high of a bias due to its added constraints, it will outperform least squares which might be fitting spurious parameters. 57 | 58 | 59 | ### Part b) 60 | > Repeat (a) for ridge regression relative to least squares. 61 | 62 | Again iii is the correct answer. Although not as restrictive as the lasso, it is more restrictive, and for the same reasions as outliend above this is the case. 63 | 64 | ### Part c) 65 | > Repeat (a) for non-linear methods relative to least squares. 66 | 67 | ii is the correct answer. Non linear methods are generally more flexible than least squares. They perform better when the linearity assumption is strongly broken. These methods will have more variance due to their more sensitive fits to the underlying data, and to perform well will need to have a substantial drop in bias. 68 | 69 | 70 | ****************** 71 | 72 | ## Problem 3 73 | > Suppose we estimate the regression coefficients in a linear regression model by minimizing 74 | > $\sum_{i=1}^n ( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} ) subject\ to \sum_{j=1}^{p}|\beta_j|\leq s$ 75 | 76 | > for a particular value of s. For parts (a) through (e), indicate which 77 | of i. through v. is correct. Justify your answer. 78 | 79 | ### Part a) 80 | > As we increase s from 0, the training RSS will: 81 | i. Increase initially, and then eventually start decreasing in an 82 | inverted U shape. 83 | ii. Decrease initially, and then eventually start increasing in a U shape. 84 | iii. Steadily increase. 85 | iv. Steadily decrease. 86 | v. Remain constant. 87 | 88 | The RSS will steadily increase (iii) as s increases. Increasing s places a heavier constraint on the model, forcing more $\beta$ coeficients to be set to 0 (this is a lasso or $\ell_1$ penalty). 89 | 90 | 91 | ### Part b) 92 | > Repeat (a) for test RSS. 93 | 94 | ii. Initially as spurious coefficients are forced to 0, the test RSS will improve as the model has less overfitting. However eventually necessary coefficients will be removed from the model, and the test RSS will again increase, making a U shape. 95 | 96 | ### Part c) 97 | > Repeat (a) for variance. 98 | 99 | The variance will decrease as more penalty is placed on the model. 100 | 101 | ### Part d) 102 | > Repeat (a) for (squared) bias. 103 | 104 | The squared bias will increase as the model becomes less flexible (s increased) 105 | 106 | ### Part e) 107 | > Repeat (a) for Bayes error rate. 108 | 109 | This is an optimal theoretical perfectly predicting construct not dependent on the model we are fitting to the data. 110 | 111 | 112 | 113 | ****************** 114 | 115 | ## Problem 4 116 | > Suppose we estimate the regression coefficients in a linear regression model by minimizing 117 | > $\sum_{i=1}^n ( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} ) + \lambda\sum_{j=1}^{p}\beta_j^2$ 118 | 119 | > for a particular value of s. For parts (a) through (e), indicate which 120 | of i. through v. is correct. Justify your answer. 121 | 122 | ### Part a) 123 | > As we increase $\lambda$ from 0, the training RSS will: 124 | i. Increase initially, and then eventually start decreasing in an 125 | inverted U shape. 126 | ii. Decrease initially, and then eventually start increasing in a U shape. 127 | iii. Steadily increase. 128 | iv. Steadily decrease. 129 | v. Remain constant. 130 | 131 | The RSS will steadily increase (iii) as $\lambda$ increases. Increasing s places a heavier constraint on the model, forcing more $\beta$ coeficients to be set to 0 (this is a ridge or $\ell_2$ penalty). 132 | 133 | 134 | ### Part b) 135 | > Repeat (a) for test RSS. 136 | 137 | ii. Initially as spurious coefficients are forced to 0, the test RSS will improve as the model has less overfitting. However eventually necessary coefficients will be removed from the model, and the test RSS will again increase, making a U shape. 138 | 139 | ### Part c) 140 | > Repeat (a) for variance. 141 | 142 | The variance will decrease as more penalty is placed on the model. 143 | 144 | ### Part d) 145 | > Repeat (a) for (squared) bias. 146 | 147 | The squared bias will increase as the model becomes less flexible ($\lambda$ increased) 148 | 149 | ### Part e) 150 | > Repeat (a) for Bayes error rate. 151 | 152 | This is an optimal theoretical perfectly predicting construct not dependent on the model we are fitting to the data. 153 | 154 | ********************* 155 | 156 | ## Problem 6 157 | > We will now explore (6.12) and (6.13) further. 158 | 159 | 160 | 6.12: $\sum_{j=1}^{p}(y_j-\beta_j)^2 + \alpha\sum_{j=1}^p\beta_j^2$ 161 | 162 | 6.13: $\sum_{j=1}^{p}(y_j-\beta_j)^2 + \alpha\sum_{j=1}|\beta_j|$ 163 | 164 | 6.14: $\hat\beta_j^R=\frac{y_j}{1+\alpha}$ 165 | 166 | 6.15: $\hat\beta_j^L=\begin{cases}y_j-\alpha/2 & \text{if } y_j > \alpha/2;\\ y_j + \alpha/2 & \text{if } y_j < -\alpha/2; \\ 0 & \text{if } |y_j| \leq \alpha/2. \end{cases}$ 167 | 168 | 169 | 170 | ### Part a) 171 | > Consider (6.12) with $p = 1$. For some choice of $y_1$, $x_1$, and $\alpha > 0$, plot (6.12) as a function of $\beta_1$. Your plot should confirm that (6.12) is solved by (6.14). 172 | 173 | ```{r fig.height=11,fig.width=11} 174 | par(mfrow=c(2,2)) 175 | for(A in c(0,1,5,10)){ 176 | y1=5 177 | x1=1 # special case where x1 is 1 178 | b1=seq(-1,6,by=0.05) 179 | yhat=((y1-b1)^2) + (A*b1^2) 180 | plot(b1,yhat) 181 | points(b1[which.min(yhat)],yhat[which.min(yhat)], col="green",cex=4,pch=20) 182 | abline(v=y1/(1+A),col="red",lwd=3) 183 | } 184 | ``` 185 | ### Part b) 186 | > Consider (6.13) with $p = 1$. For some choice of $y_1$, $x_1$, and $\alpha > 0$, plot (6.13) as a function of $\beta_1$. Your plot should confirm that (6.13) is solved by (6.15). 187 | 188 | 189 | ```{r fig.height=11,fig.width=11} 190 | opt.y.lasso=function(y,a){ 191 | if(y>(a/2)){ 192 | return(y-(a/2)) 193 | } 194 | 195 | if(y < (-a/2)){ 196 | return(y+(a/2)) 197 | } 198 | if(abs(y) <= (a/2)){ 199 | return(0) 200 | } 201 | } 202 | 203 | par(mfrow=c(2,2)) 204 | for(A in c(0,1,5,10)){ 205 | y1=5 206 | x1=1 # special case where x1 is 1 207 | b1=seq(-1,6,by=0.05) 208 | yhat=((y1-b1)^2) + A*abs(b1) 209 | plot(b1,yhat) 210 | points(b1[which.min(yhat)],yhat[which.min(yhat)], col="green",cex=4,pch=20) 211 | abline(v=opt.y.lasso(y1,A),col="red",lwd=3) 212 | } 213 | ``` 214 | 215 | 216 | 217 | 218 | ************************* 219 | 220 | # Applied section 221 | ## Problem 8 222 | > create a simluated dataset 223 | 224 | ### Parts a-c) 225 | ```{r} 226 | set.seed(1) 227 | X=rnorm(100,mean=0,sd=1) 228 | e=rnorm(100,mean=0,sd=0.5) 229 | B0=150.3 230 | B1=50.5 231 | B2=-10.1 232 | B3=-34.2 233 | 234 | Y=B0+ B1*X + B2*(X^2) + B3*(X^3) + e 235 | 236 | dat=data.frame(Y=Y,X=X) 237 | library(leaps) 238 | library(ISLR) 239 | 240 | regfit.full=regsubsets(Y~poly(X,10,raw=T),data=dat,nvmax=10) 241 | reg.summary=summary(regfit.full) 242 | ``` 243 | 244 | ```{r fig.height=11,fig.width=11} 245 | par(mfrow=c(2,2)) 246 | plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l") 247 | plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l") 248 | points(which.max(reg.summary$adjr2), 249 | reg.summary$adjr2[which.max(reg.summary$adjr2)], 250 | col="red",cex=2,pch=20) 251 | plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp", 252 | type="l") 253 | points(which.min(reg.summary$cp), 254 | reg.summary$cp[which.min(reg.summary$cp)], 255 | col="red",cex=2,pch=20) 256 | plot(reg.summary$bic,xlab="Number of Variables", 257 | ylab="BIC", type="l") 258 | points(which.min(reg.summary$bic), 259 | reg.summary$bic[which.min(reg.summary$bic)], 260 | col="red",cex=2,pch=20) 261 | 262 | # BIC 263 | coef(regfit.full,3) 264 | # Cp/adjusted R2 265 | coef(regfit.full,4) 266 | ``` 267 | 268 | Each method chose at least a superset of the correct X polynomials. BIC chose the correct ones (X,X^2,X^3). Cp and adjusted R^2 added in X^9 269 | 270 | 271 | ### Part d 272 | 273 | ```{r fig.height=11,fig.width=11} 274 | regfit.full=regsubsets(Y~poly(X,10,raw=T),data=dat,nvmax=10,method="forward") 275 | reg.summary=summary(regfit.full) 276 | 277 | par(mfrow=c(2,2)) 278 | plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l") 279 | plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l") 280 | points(which.max(reg.summary$adjr2), 281 | reg.summary$adjr2[which.max(reg.summary$adjr2)], 282 | col="red",cex=2,pch=20) 283 | plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp", 284 | type="l") 285 | points(which.min(reg.summary$cp), 286 | reg.summary$cp[which.min(reg.summary$cp)], 287 | col="red",cex=2,pch=20) 288 | plot(reg.summary$bic,xlab="Number of Variables", 289 | ylab="BIC", type="l") 290 | points(which.min(reg.summary$bic), 291 | reg.summary$bic[which.min(reg.summary$bic)], 292 | col="red",cex=2,pch=20) 293 | 294 | # BIC 295 | coef(regfit.full,3) 296 | # Cp 297 | coef(regfit.full,4) 298 | # Adjusted R2 299 | coef(regfit.full,4) 300 | ``` 301 | 302 | The same number of parameters were chosen by each method, however now the spurious parameters changed with Cp and Adjusted R2. Now Cp and adjusted R^2 use X^5 as the extra. 303 | 304 | 305 | ### Part e 306 | > same but with lasso 307 | 308 | ```{r fig.height=5,fig.width=7} 309 | library(glmnet) 310 | dat.mat=model.matrix(Y~poly(X,10,raw=T),data=dat)[,-1] 311 | 312 | cv.out=cv.glmnet(dat.mat,Y,alpha=1) 313 | plot(cv.out) 314 | bestlam=cv.out$lambda.min 315 | bestlam 316 | 317 | lasso.mod=glmnet(dat.mat,Y,alpha=1,lambda=bestlam) 318 | coef(lasso.mod) 319 | 320 | ``` 321 | 322 | Using the optimal lambda chosen by CV, lasso regression choses up to a 3nd degree polynomial (X,X^2,X^3), but includes X^5 as an extra term. 323 | 324 | ### Part f 325 | > redo with different model and repeat c and e. 326 | 327 | ```{r} 328 | set.seed(1) 329 | X=rnorm(100,mean=0,sd=1) 330 | e=rnorm(100,mean=0,sd=0.5) 331 | B0=150.3 332 | B7=33.3 333 | 334 | 335 | Y=B0+ B7*(X^7) + e 336 | 337 | dat=data.frame(Y=Y,X=X) 338 | 339 | ``` 340 | 341 | ```{r fig.height=11,fig.width=11} 342 | regfit.full=regsubsets(Y~poly(X,10,raw=T),data=dat,nvmax=10) 343 | reg.summary=summary(regfit.full) 344 | 345 | par(mfrow=c(2,2)) 346 | plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l") 347 | plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l") 348 | points(which.max(reg.summary$adjr2), 349 | reg.summary$adjr2[which.max(reg.summary$adjr2)], 350 | col="red",cex=2,pch=20) 351 | plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp", 352 | type="l") 353 | points(which.min(reg.summary$cp), 354 | reg.summary$cp[which.min(reg.summary$cp)], 355 | col="red",cex=2,pch=20) 356 | plot(reg.summary$bic,xlab="Number of Variables", 357 | ylab="BIC", type="l") 358 | points(which.min(reg.summary$bic), 359 | reg.summary$bic[which.min(reg.summary$bic)], 360 | col="red",cex=2,pch=20) 361 | 362 | # adj R^2 363 | coef(regfit.full,4) 364 | 365 | # Cp 366 | coef(regfit.full,2) 367 | 368 | # BIC 369 | coef(regfit.full,1) 370 | ``` 371 | 372 | ```{r fig.height=5,fig.width=7} 373 | library(glmnet) 374 | dat.mat=model.matrix(Y~poly(X,10,raw=T),data=dat)[,-1] 375 | 376 | cv.out=cv.glmnet(dat.mat,Y,alpha=1) 377 | plot(cv.out) 378 | bestlam=cv.out$lambda.min 379 | bestlam 380 | 381 | lasso.mod=glmnet(dat.mat,Y,alpha=1,lambda=bestlam) 382 | coef(lasso.mod) 383 | 384 | ``` 385 | 386 | This model was difficult for the methods to deal with. BIC chose the correct model though, with the seventh degree polynomial being the only one included! Cp added in X^2, and adjusted R^2 added in X^1, X^2, and X^3. 387 | 388 | The lasso on the ohter hand chose two features as well, like Cp. It chose X^7 along with X^9 as the spurious feature though. 389 | 390 | **NOTE: I redid this section after realizing in the next chapter that the poly function returns a linear combination of terms, which has the interesting effect of causing the above models to chose $\beta^{1..7}$ rather than only $\beta^7$!! This is something to be aware of, and I completely missed this the first time though.** 391 | 392 | 393 | *************** 394 | ## Problem 9 395 | > In this exercise, we will predict the number of applications received using the other variables in the College data set. 396 | 397 | ### Part a) 398 | > Split the data set into a training set and a test set. 399 | 400 | ```{r} 401 | set.seed(1) 402 | train=sample(c(TRUE,FALSE),nrow(College),rep=TRUE) 403 | test=(!train) 404 | 405 | College.train=College[train,,drop=F] 406 | College.test=College[test,,drop=F] 407 | 408 | ``` 409 | 410 | ### Part b) 411 | > Fit a linear model using least squares on the training set, and report the test error obtained. 412 | 413 | ```{r} 414 | lm.fit=lm(Apps~.,data=College.train) 415 | summary(lm.fit) 416 | pred=predict(lm.fit,College.test) 417 | rss=sum((pred-College.test$Apps)^2) 418 | tss=sum((College.test$Apps-mean(College.test$Apps))^2) 419 | test.rsq=1-(rss/tss) 420 | test.rsq 421 | ``` 422 | 423 | where test.rsq is the $R^2$ statistic. 424 | 425 | ### Part c) 426 | > Fit a ridge regression model on the training set, with $\lambda$ chosen by cross-validation. Report the test error obtained. 427 | 428 | ```{r} 429 | 430 | ### Scale the training data, and scale the test data using the centers/scale 431 | # learned on the training data. 432 | College.train.X=scale(model.matrix(Apps~.,data=College.train)[,-1],scale=T,center=T) 433 | College.train.Y=College.train$Apps 434 | 435 | College.test.X=scale(model.matrix(Apps~.,data=College.test)[,-1], 436 | attr(College.train.X,"scaled:center"), 437 | attr(College.train.X,"scaled:scale")) 438 | 439 | College.test.Y=College.test$Apps 440 | 441 | cv.out=cv.glmnet(College.train.X,College.train.Y,alpha=0) 442 | bestlam=cv.out$lambda.min 443 | bestlam 444 | 445 | lasso.mod=glmnet(College.train.X,College.train.Y,alpha=0,lambda=bestlam) 446 | pred=predict(lasso.mod,College.test.X,s=bestlam) 447 | rss=sum((pred-College.test$Apps)^2) 448 | tss=sum((College.test$Apps-mean(College.test$Apps))^2) 449 | test.rsq=1-(rss/tss) 450 | test.rsq 451 | 452 | ``` 453 | 454 | 455 | ### Part d) 456 | > Fit a lasso model on the training set, with $\lambda$ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates. 457 | 458 | ```{r} 459 | 460 | cv.out=cv.glmnet(College.train.X,College.train.Y,alpha=1) 461 | bestlam=cv.out$lambda.min 462 | bestlam 463 | 464 | lasso.mod=glmnet(College.train.X,College.train.Y,alpha=1,lambda=bestlam) 465 | pred=predict(lasso.mod,College.test.X,s=bestlam) 466 | rss=sum((pred-College.test$Apps)^2) 467 | tss=sum((College.test$Apps-mean(College.test$Apps))^2) 468 | test.rsq=1-(rss/tss) 469 | test.rsq 470 | 471 | #Number of coefficients equal to 0 472 | sum(coef(lasso.mod)[,1]==0) 473 | 474 | names(coef(lasso.mod)[, 1][coef(lasso.mod)[, 1] == 0]) 475 | ``` 476 | 477 | 478 | ### Part e) 479 | > Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation. 480 | 481 | ```{r} 482 | library(pls) 483 | set.seed(1) 484 | pcr.fit=pcr(Apps~.,data=College.train, scale=TRUE, validation="CV") 485 | summary(pcr.fit) #lowest at M=17 486 | pred=predict(pcr.fit,College.test,ncomp=17) 487 | rss=sum((pred-College.test$Apps)^2) 488 | tss=sum((College.test$Apps-mean(College.test$Apps))^2) 489 | test.rsq=1-(rss/tss) 490 | test.rsq 491 | ``` 492 | 493 | 494 | ### Part f) 495 | > Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation. 496 | 497 | 498 | ```{r} 499 | library(pls) 500 | set.seed(1) 501 | pls.fit=plsr(Apps~.,data=College.train, scale=TRUE, validation="CV") 502 | summary(pls.fit) #pretty much lowest at 9 comps, certainly closest to lowest there 503 | pred=predict(pls.fit,College.test,ncomp=9) 504 | rss=sum((pred-College.test$Apps)^2) 505 | tss=sum((College.test$Apps-mean(College.test$Apps))^2) 506 | test.rsq=1-(rss/tss) 507 | test.rsq 508 | ``` 509 | 510 | ### Part g) 511 | > Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches? 512 | 513 | Ordinary least squares, PLS regression, lasso, and PCR regression performed (more or less equally) best. These methods ended up using the same underlying data essentially, since the optimal PCR regression used the same number of underlying variables. PLS regression was able to cut out a few things, chosing a model that used 9 of the possible 17 components, and 83% of the variance, while still performing pretty much as well. Interestingly the Lasso, while not performing quite as well, still performed pretty comparably 0.8995 vs 0.9052 (a difference of `r 0.9052 - 0.8995`). The lasso though only set 3 variables to 0 (Enroll (students enrolled), Terminal (pct fac w/ terminal degree), and S.F. Ratio(student/factulty ratio)). It is interesting that most of the variables seem to contribute interesting information to the model. Ridge regression performed the poorest at $R^2=0.84$. 514 | 515 | 516 | **************** 517 | ## Problem 10 518 | > We have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set. 519 | 520 | ### Parts a-e) 521 | > Generate a data set with p = 20 features, n = 1000 observations, and an associated quantitative response vector generated according to the model $Y=X\beta+\epsilon$ 522 | where $\beta$ has some elements that are exactly equal to zero. 523 | 524 | ```{r, fig.height=5, fig.width=7} 525 | library(leaps) 526 | 527 | set.seed(1) 528 | X=matrix(rnorm(1000*20),ncol=20,nrow=1000) 529 | colnames(X) <- sprintf("Feature_%d",1:20) 530 | beta=rnorm(20,sd=10) 531 | beta[c(3,7,9,11,13,18)]=0 532 | e=rnorm(1000) 533 | Y=as.vector(X%*%beta+e) 534 | train=sample(1:nrow(X),100) ### FIXME 100 train 900 test 535 | test=(-train) 536 | 537 | X.train=X[train,] 538 | X.test=X[test,] 539 | Y.train=Y[train] 540 | Y.test=Y[test] 541 | 542 | dat.train=cbind(data.frame(Y=Y.train),X.train) 543 | dat.test=cbind(data.frame(Y=Y.test),X.test) 544 | 545 | regfit.best=regsubsets(Y~.,dat=dat.train,nvmax=20) 546 | 547 | predict.regsubsets=function(object,newdata,id,...){ 548 | form=as.formula(object$call[[2]]) ## extract formula 549 | mat=model.matrix(form,newdata) 550 | coefi=coef(object,id=id) 551 | xvars=names(coefi) 552 | mat[,xvars]%*%coefi 553 | } 554 | 555 | mse=function(pred,real){ 556 | mean((pred-real)^2) 557 | } 558 | 559 | test.mse <- sapply(1:20, function(id){ 560 | pred=predict.regsubsets(regfit.best,dat.test,id) 561 | mse(pred,Y.test) 562 | }) 563 | 564 | plot(seq(1:20),test.mse,xlab="Number of Features", 565 | ylab="Test MSE") 566 | points(which.min(test.mse),test.mse[which.min(test.mse)], 567 | col="red",cex=2,pch=20) 568 | 569 | coef(regfit.best,id=which.min(test.mse)) 570 | ``` 571 | 572 | My 0 beta features are 3,7,9,11,13, and 18. 573 | The lowest test MSE was found at 14/20 features. This is indeed the correct number of features which is encouraging (20-6). Best subset selection selected against Features 3, 7, 9, 11, 13, 18, so it did really well at finding the true underlying model! 574 | 575 | ### Part g) 576 | 577 | ```{r fig.width=7,fig.height=5} 578 | beta.rsqb.diffs <- sapply(1:20,function(r){ 579 | coefi<-coef(regfit.best,id=r) 580 | ncoefi<-names(coefi) 581 | beta.est <- sapply(1:20,function(i){ 582 | id<-sprintf("Feature_%d",i) 583 | if(id %in% names(coefi)){ 584 | return(coefi[id]) 585 | }else{ 586 | return(0) 587 | } 588 | }) 589 | 590 | return(sqrt(sum((beta-beta.est)^2))) 591 | }) 592 | 593 | plot(seq(1:20),beta.rsqb.diffs,xlab="Number of Features", 594 | ylab="Root Squared Diff Of Betas") 595 | points(which.min(beta.rsqb.diffs),beta.rsqb.diffs[which.min(beta.rsqb.diffs)], 596 | col="red",cex=2,pch=20) 597 | 598 | ``` 599 | 600 | The minimum value is the same as before, 14 features, however the cool thing is how much more pronounced the answer is. The dip is really strong between 13 and 14, and then stays small going out to 20. 601 | 602 | 603 | ************ 604 | ## Problem 11 605 | 606 | ### Part a) 607 | 608 | I am going to evalueate each of these methods with 10 Fold CV on the Boston dataset. 609 | 610 | ```{r} 611 | library(MASS) 612 | library(ISLR) 613 | 614 | ## Best Subset 615 | k=10 616 | set.seed(1) 617 | p=ncol(Boston)-1 618 | folds=sample(1:k,nrow(Boston),replace=TRUE) 619 | 620 | cv.errors=c() 621 | for(j in 1:k){ 622 | Boston.sub=Boston[folds!=j,] 623 | #now do CV on this CV subset to choose the best model, and apply 624 | # it to the whole thing. 625 | cv.err=matrix(NA,k,p,dimnames=list(NULL,paste(1:p))) 626 | folds.sub=sample(1:k,nrow(Boston.sub),replace=TRUE) 627 | 628 | for(q in 1:k){ 629 | best.fit=regsubsets(crim~.,data=Boston.sub[folds.sub!=q,],nvmax=p) 630 | for(i in 1:p){ 631 | pred=predict.regsubsets(best.fit,Boston.sub[folds.sub==q,],id=i) 632 | cv.err[q,i]=mean((Boston.sub$crim[folds.sub==q]-pred)^2) 633 | } 634 | } 635 | 636 | best.k = which.min(apply(cv.err,2,mean)) 637 | 638 | best.fit.all=regsubsets(crim~.,data=Boston.sub,nvmax=p) 639 | pred=predict.regsubsets(best.fit.all,Boston[folds==j,],id=best.k) 640 | 641 | cv.errors=c(cv.errors,mean((Boston$crim[folds==j]-pred)^2)) 642 | } 643 | 644 | mean(cv.errors) 645 | 646 | ## Ridge regression (alpha=0) 647 | 648 | cv.errors = sapply(1:k, function(j){ 649 | Boston.X=as.matrix(Boston[,-1]) 650 | Boston.Y=Boston[,1] 651 | 652 | cv.out=cv.glmnet(Boston.X[folds!=j,],Boston.Y[folds!=j],alpha=0) 653 | bestlam=cv.out$lambda.min 654 | bestlam 655 | 656 | lasso.mod=glmnet(Boston.X[folds!=j,],Boston.Y[folds!=j],alpha=0,lambda=bestlam) 657 | pred=predict(lasso.mod,Boston.X[folds==j,],s=bestlam) 658 | return(mean((Boston.Y[folds==j]-pred)^2)) 659 | }) 660 | 661 | mean(cv.errors) 662 | 663 | ## Lasso (alpha=1) 664 | cv.errors = sapply(1:k, function(j){ 665 | Boston.X=as.matrix(Boston[,-1]) 666 | Boston.Y=Boston[,1] 667 | 668 | cv.out=cv.glmnet(Boston.X[folds!=j,],Boston.Y[folds!=j],alpha=1) 669 | bestlam=cv.out$lambda.min 670 | bestlam 671 | 672 | lasso.mod=glmnet(Boston.X[folds!=j,],Boston.Y[folds!=j],alpha=1,lambda=bestlam) 673 | pred=predict(lasso.mod,Boston.X[folds==j,],s=bestlam) 674 | return(mean((Boston.Y[folds==j]-pred)^2)) 675 | }) 676 | 677 | mean(cv.errors) 678 | 679 | ## PCR 680 | cv.errors = sapply(1:k, function(j){ 681 | 682 | pcr.fit=pcr(crim~.,data=Boston[folds!=j,],scale=TRUE,validation="CV") 683 | res=RMSEP(pcr.fit) 684 | pcr.best=which.min(res$val[1,,])-1 685 | 686 | pred=predict(pcr.fit,Boston[folds==j,],ncomp=pcr.best) 687 | return(mean((Boston[folds==j,1]-pred)^2)) 688 | }) 689 | 690 | mean(cv.errors) 691 | 692 | ``` 693 | 694 | 695 | Of these above methods on the Boston dataset using CV for building multiple training/testing splits, and using CV within each CV iteration for choosing optimal parameters for each model, PCR and ridge regression perform best. Lasso performs nearly as well, and best subset selection performs slightly worse than the others. 696 | 697 | The best method, PCR regression, does include all features. It selects a subset of linear combinations of all features though, so some of the variance in some of the features is likely not included, although some information from each feature will make it into the final model regardless of the parameter that was selected in each CV iteration. 698 | 699 | The same thing goes for the second best method, ridge regression, this one also uses some information from each feature, although it might heavily discount the contribution from some of the features. 700 | 701 | 702 | -------------------------------------------------------------------------------- /R_Exercises/Exercise6/Exercise6.Rmd: -------------------------------------------------------------------------------- 1 | Introduction to Statistical Learning Exercises 5: 2 | ======================================================== 3 | 4 | # Conceptual Section 5 | ******************* 6 | ## Problem 1 7 | 8 | ******************* 9 | ## Problem 2 10 | 11 | See section 7.5 previously on smoothing splines. A typicall smoothing spline uses the second derivative of g (the change in slope). So minimizing the second derivative of g results in less roughness of the line. The first derivative of g represents the slope. The third derivative is weird. This is the rate at which acceleration is changing in physics. It is the rate of change of the second derivative, called "jerk". Smoothing splines are like regular splines but they have a knot at every training datapoint, and they smooth the fit by minimizing this alpha term over the second derivative. 12 | 13 | ### Part a 14 | I think this should predict everythiing to be zero. No matter how bad the fit is, the infinite penalty of a non-zero prediction on X will overrule everything else. 15 | 16 | ### Part b 17 | This is minimizing slope (first derivative). Basically g must be flat! So this should just be the mean of all points. 18 | 19 | ### Part c 20 | This is minimizing change in slope (second derivative). This is allowed to be a line so it will likely be a nice linear regression fit. 21 | 22 | ### Part d 23 | This is minimizing "jerk" the rate of change in slope. Hmm.. So the rate of change in slope is allowed to be constant, but not changing. This should be the closest you can get to fitting all points perfectly given some acceleration of line change. 24 | 25 | ### Part e 26 | Now there is no smoothing parameter, so that part of the equation is ignored. Basically this is a set of straight lines connecting the dots! 27 | ******************* 28 | 29 | ## Problem 3 30 | 31 | ## Problem 4 32 | 33 | ## Problem 5 34 | 35 | 36 | # Applied section 37 | ******************* 38 | 39 | ## Problem 6 40 | 41 | ### Part a 42 | ```{r} 43 | library(ISLR) 44 | 45 | k=10 46 | max.poly=15 47 | set.seed(1) 48 | folds=sample(1:k,nrow(Wage),replace=TRUE) 49 | cv.errors=matrix(NA,k,max.poly,dimnames=list(NULL,paste(1:max.poly))) 50 | 51 | for(j in 1:k){ 52 | for(i in 1:max.poly){ 53 | lm.fit=lm(wage~poly(age,i,raw=T),data=Wage[folds!=j,]) 54 | 55 | pred=predict(lm.fit,Wage[folds==j,]) 56 | cv.errors[j,i]=mean((Wage$wage[folds==j]-pred)^2) 57 | } 58 | } 59 | 60 | mean.cv.errors=apply(cv.errors,2,mean) 61 | mean.cv.errors 62 | which.min(mean.cv.errors) 63 | 64 | fit.1=lm(wage~poly(age,1,raw=T),data=Wage) 65 | fit.2=lm(wage~poly(age,2,raw=T),data=Wage) 66 | fit.3=lm(wage~poly(age,3,raw=T),data=Wage) 67 | fit.4=lm(wage~poly(age,4,raw=T),data=Wage) 68 | fit.5=lm(wage~poly(age,5,raw=T),data=Wage) 69 | fit.6=lm(wage~poly(age,6,raw=T),data=Wage) 70 | fit.7=lm(wage~poly(age,7,raw=T),data=Wage) 71 | fit.8=lm(wage~poly(age,8,raw=T),data=Wage) 72 | fit.9=lm(wage~poly(age,9,raw=T),data=Wage) 73 | fit.10=lm(wage~poly(age,10,raw=T),data=Wage) 74 | fit.11=lm(wage~poly(age,11,raw=T),data=Wage) 75 | fit.12=lm(wage~poly(age,12,raw=T),data=Wage) 76 | fit.13=lm(wage~poly(age,13,raw=T),data=Wage) 77 | fit.14=lm(wage~poly(age,14,raw=T),data=Wage) 78 | fit.15=lm(wage~poly(age,15,raw=T),data=Wage) 79 | anova(fit.1,fit.2,fit.3,fit.4,fit.5,fit.6,fit.7,fit.8,fit.9,fit.10,fit.11,fit.12,fit.13,fit.14,fit.15) 80 | ``` 81 | 82 | There appears to be support both in CV and anova for a 9th degree polynomial on age relative to wage! 83 | 84 | ```{r, fig.width=11, fig.height=11} 85 | age.range=data.frame(age=seq(min(Wage$age),max(Wage$age),by=0.1)) 86 | plot(Wage$age, Wage$wage, xlab="Age", ylab="Wage", main="Wage vs Age") 87 | lines(age.range$age,predict(fit.9,age.range),col="red",lwd=3) 88 | ``` 89 | 90 | ### Part b 91 | ```{r} 92 | library(ISLR) 93 | 94 | k=10 95 | max.cut=19 96 | set.seed(1) 97 | folds=sample(1:k,nrow(Wage),replace=TRUE) 98 | cv.errors=matrix(NA,k,max.cut-1,dimnames=list(NULL,paste(2:max.cut-1))) 99 | 100 | for(j in 1:k){ 101 | for(i in 2:max.cut){ 102 | lm.fit=lm(wage~cut(age,i,labels = FALSE),data=Wage[folds!=j,]) 103 | 104 | pred=predict(lm.fit,newdata=Wage[folds==j,]) 105 | cv.errors[j,i-1]=mean((Wage$wage[folds==j]-pred)^2) 106 | } 107 | } 108 | 109 | mean.cv.errors=apply(cv.errors,2,mean) 110 | mean.cv.errors 111 | mean.cv.errors[which.min(mean.cv.errors)] 112 | 113 | fit.2=lm(wage~cut(age,2),data=Wage) 114 | fit.3=lm(wage~cut(age,3),data=Wage) 115 | fit.4=lm(wage~cut(age,4),data=Wage) 116 | fit.5=lm(wage~cut(age,5),data=Wage) 117 | fit.6=lm(wage~cut(age,6),data=Wage) 118 | fit.7=lm(wage~cut(age,7),data=Wage) 119 | fit.8=lm(wage~cut(age,8),data=Wage) 120 | fit.9=lm(wage~cut(age,9),data=Wage) 121 | fit.10=lm(wage~cut(age,10),data=Wage) 122 | fit.11=lm(wage~cut(age,11),data=Wage) 123 | fit.12=lm(wage~cut(age,12),data=Wage) 124 | fit.13=lm(wage~cut(age,13),data=Wage) 125 | fit.14=lm(wage~cut(age,14),data=Wage) 126 | fit.15=lm(wage~cut(age,15),data=Wage) 127 | fit.16=lm(wage~cut(age,16),data=Wage) 128 | fit.17=lm(wage~cut(age,17),data=Wage) 129 | fit.18=lm(wage~cut(age,18),data=Wage) 130 | anova(fit.2,fit.3,fit.4,fit.5,fit.6,fit.7,fit.8,fit.9,fit.10,fit.11,fit.12,fit.13,fit.14,fit.15,fit.16,fit.17,fit.18) 131 | ``` 132 | 133 | There appears to be optimal support for 6 cutpoints in CV and up to 15 cutpoints in anova. I will plot both, red is the 6 cutpoints, and blue is 15. 134 | 135 | ```{r, fig.width=11, fig.height=11} 136 | age.range=data.frame(age=seq(min(Wage$age),max(Wage$age),by=0.1)) 137 | plot(Wage$age, Wage$wage, xlab="Age", ylab="Wage", main="Wage vs Age") 138 | lines(age.range$age,predict(fit.6,age.range),col="red",lwd=3) 139 | lines(age.range$age,predict(fit.15,age.range),col="blue",lwd=3) 140 | ``` 141 | 142 | 143 | 144 | 145 | 146 | ## Problem 7 147 | Here I will use a gam function 148 | 149 | ```{r,fig.width=11,fig.height=11} 150 | pairs(wage~age+jobclass+maritl,data=Wage) 151 | ``` 152 | 153 | ```{r} 154 | library(gam) 155 | gam.m0=gam(wage~lo(year,span=0.7)+s(age,5)+education,data=Wage) 156 | gam.m1=gam(wage~lo(year,span=0.7)+s(age,5)+education+jobclass,data=Wage) 157 | gam.m2=gam(wage~lo(year,span=0.7)+s(age,5)+education+maritl,data=Wage) 158 | gam.m3=gam(wage~lo(year,span=0.7)+s(age,5)+education+jobclass+maritl,data=Wage) 159 | anova(gam.m0,gam.m1,gam.m2,gam.m3,test="F") 160 | anova(gam.m1,gam.m3,test="F") 161 | ``` 162 | 163 | It seems that together jobclass and marital status provide useful information beyond what you get from year, age, and education alone. 164 | 165 | ```{r,fig.width=11,fig.height=11} 166 | par(mfrow=c(3,3)) 167 | plot(gam.m3,se=T,col="blue") 168 | ``` 169 | 170 | ******** 171 | ## Problem 8 172 | 173 | ```{r fig.width=11,fig.height=11} 174 | pairs(Auto) 175 | gam.m0=gam(mpg~displacement+acceleration+year,data=Auto) 176 | gam.m1=gam(mpg~s(displacement,5)+s(acceleration,3)+year,data=Auto) 177 | gam.m2=gam(mpg~s(displacement,5)+s(acceleration,4)+year,data=Auto) 178 | anova(gam.m0,gam.m1,gam.m2) 179 | ``` 180 | 181 | There is pretty strong support for a non-linear relationship in the anova test. We can visually see this non-linearity in the scatterplot matrix. Year appears pretty linear though. 182 | 183 | ```{r fig.width=11,fig.height=5} 184 | par(mfrow=c(1,3)) 185 | plot(gam.m1,se=T,col="red") 186 | ``` 187 | 188 | *********** 189 | ## Problem 9 190 | ### Part a 191 | ```{r fig.width=7, fig.height=5} 192 | library(MASS) 193 | gam.fit0=lm(nox~poly(dis,3,raw=T),data=Boston) 194 | r=range(Boston$dis) 195 | d.grid=seq(r[1],r[2],by=(r[2]-r[1])/200) 196 | preds=predict(gam.fit0,newdata=list(dis=d.grid),se=T) 197 | se.bands=cbind(preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit) 198 | plot(Boston$dis,Boston$nox,xlab="weighted mean distance to employment",ylab="nitrogen oxides concentration",col="lightgrey") 199 | lines(d.grid,preds$fit,lwd=2,col="blue") 200 | matlines(d.grid,se.bands,lwd=1,col="blue",lty=3) 201 | points(jitter(Boston$dis),rep(max(Boston$nox),nrow(Boston)),cex=.5,pch="|",col="darkgrey") 202 | points(rep(max(Boston$dis),nrow(Boston)),jitter(Boston$nox),cex=.5,pch="-",col="darkgrey") 203 | title("Cubic relationship between mean distance to employment\nand nitrogen oxide contamination") 204 | ``` 205 | ### Part b 206 | > Plot the polynomial fits for a range of different polynomial degrees (say, from 1 to 10), and report the associated residual sum of squares. 207 | 208 | ```{r fig.height=11, fig.width=11} 209 | par(mfrow=c(4,3)) 210 | for(i in seq(1,12)){ 211 | gam.fit=lm(nox~poly(dis,i,raw=T),data=Boston) 212 | preds=predict(gam.fit,newdata=list(dis=d.grid),se=T) 213 | pred.fit=predict(gam.fit,newdata=Boston) 214 | se.bands=cbind(preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit) 215 | rss=sum((Boston$nox-pred.fit)^2) 216 | plot(Boston$dis,Boston$nox,xlab="dis",ylab="nox",col="lightgrey",main=sprintf("Poly=%d, rss=%0.6f",i,rss)) 217 | lines(d.grid,preds$fit,lwd=2,col="blue") 218 | matlines(d.grid,se.bands,lwd=1,col="blue",lty=3) 219 | points(jitter(Boston$dis),rep(max(Boston$nox),nrow(Boston)),cex=.5,pch="|",col="darkgrey") 220 | points(rep(max(Boston$dis),nrow(Boston)),jitter(Boston$nox),cex=.5,pch="-",col="darkgrey") 221 | } 222 | ``` 223 | 224 | ### Part c 225 | 226 | 227 | ```{r} 228 | k=10 229 | max.poly=12 230 | set.seed(1) 231 | folds=sample(1:k,nrow(Boston),replace=TRUE) 232 | cv.errors=matrix(NA,k,max.poly,dimnames=list(NULL,paste(1:max.poly))) 233 | 234 | for(j in 1:k){ 235 | for(i in 1:max.poly){ 236 | gam.fit=lm(nox~poly(dis,i,raw=T),data=Boston[folds!=j,]) 237 | 238 | pred=predict(gam.fit,Boston[folds==j,]) 239 | cv.errors[j,i]=mean((Boston$nox[folds==j]-pred)^2) 240 | } 241 | } 242 | 243 | mean.cv.errors=apply(cv.errors,2,mean) 244 | mean.cv.errors 245 | mean.cv.errors[which.min(mean.cv.errors)] 246 | 247 | ``` 248 | 249 | 250 | 251 | The minimal mean of mean squared errors is with a 4th degree polynomial through our CV iterations. 252 | 253 | 254 | ### Part d 255 | ```{r} 256 | gam.fit=lm(nox~bs(dis,df=4),data=Boston) 257 | preds=predict(gam.fit,newdata=list(dis=d.grid),se=T) 258 | pred.fit=predict(gam.fit,newdata=Boston) 259 | se.bands=cbind(preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit) 260 | rss=sum((Boston$nox-pred.fit)^2) 261 | plot(Boston$dis,Boston$nox,xlab="dis",ylab="nox",col="lightgrey",main=sprintf("Df=%d, rss=%0.6f",4,rss)) 262 | lines(d.grid,preds$fit,lwd=2,col="blue") 263 | matlines(d.grid,se.bands,lwd=1,col="blue",lty=3) 264 | points(jitter(Boston$dis),rep(max(Boston$nox),nrow(Boston)),cex=.5,pch="|",col="darkgrey") 265 | points(rep(max(Boston$dis),nrow(Boston)),jitter(Boston$nox),cex=.5,pch="-",col="darkgrey") 266 | ``` 267 | 268 | Here I include the intercept in the bases so that the line fits the data in its raw form. 269 | 270 | ### Part e 271 | 272 | ```{r fig.width=11,fig.height=11} 273 | par(mfrow=c(3,3)) 274 | for(i in seq(3,11)){ 275 | #add one for the intercept 276 | gam.fit=lm(nox~bs(dis,df=i),data=Boston) 277 | preds=predict(gam.fit,newdata=list(dis=d.grid),se=T) 278 | pred.fit=predict(gam.fit,newdata=Boston) 279 | se.bands=cbind(preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit) 280 | rss=sum((Boston$nox-pred.fit)^2) 281 | plot(Boston$dis,Boston$nox,xlab="dis",ylab="nox",col="lightgrey",main=sprintf("DF=%d, rss=%0.6f",i,rss)) 282 | lines(d.grid,preds$fit,lwd=2,col="blue") 283 | matlines(d.grid,se.bands,lwd=1,col="blue",lty=3) 284 | points(jitter(Boston$dis),rep(max(Boston$nox),nrow(Boston)),cex=.5,pch="|",col="darkgrey") 285 | points(rep(max(Boston$dis),nrow(Boston)),jitter(Boston$nox),cex=.5,pch="-",col="darkgrey") 286 | } 287 | ``` 288 | 289 | The results start getting pretty noisy looking around 10 degrees of freedom, I bet the best results will be around 4 or 5 degrees of freedom in CV validation. 290 | 291 | 292 | ```{r} 293 | k=10 294 | max.df=12 295 | set.seed(1) 296 | folds=sample(1:k,nrow(Boston),replace=TRUE) 297 | cv.errors=matrix(NA,k,max.df-2,dimnames=list(NULL,paste(3:max.df))) 298 | 299 | for(j in 1:k){ 300 | for(i in 3:max.df){ 301 | gam.fit=lm(nox~bs(dis,df=i),data=Boston[folds!=j,]) 302 | 303 | pred=predict(gam.fit,Boston[folds==j,]) 304 | cv.errors[j,i-2]=mean((Boston$nox[folds==j]-pred)^2) 305 | } 306 | } 307 | 308 | mean.cv.errors=apply(cv.errors,2,mean) 309 | mean.cv.errors 310 | mean.cv.errors[which.min(mean.cv.errors)] 311 | 312 | ``` 313 | 314 | As I suspected 5 was chosen by CV as the optimal degree of freedom. 315 | 316 | ********** 317 | ## Problem 10 318 | 319 | ### Part a 320 | ```{r fig.width=11,fig.height=11} 321 | library(leaps) 322 | set.seed(1) 323 | test=sample(1:nrow(College),nrow(College)/4) 324 | train=(-test) 325 | College.test=College[test,,drop=F] 326 | College.train=College[train,,drop=F] 327 | 328 | #predict Outstate using other variables 329 | regfit.fwd=regsubsets(Outstate~.,data=College.train,nvmax=ncol(College)+1,method="forward") 330 | reg.summary=summary(regfit.fwd) 331 | 332 | par(mfrow=c(2,2)) 333 | plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l") 334 | plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l",main=sprintf("Max Adjusted RSq: %d",which.max(reg.summary$adjr2))) 335 | points(which.max(reg.summary$adjr2), 336 | reg.summary$adjr2[which.max(reg.summary$adjr2)], 337 | col="red",cex=2,pch=20) 338 | plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp", 339 | type="l",main=sprintf("Min Cp: %d",which.min(reg.summary$cp))) 340 | points(which.min(reg.summary$cp), 341 | reg.summary$cp[which.min(reg.summary$cp)], 342 | col="red",cex=2,pch=20) 343 | plot(reg.summary$bic,xlab="Number of Variables", 344 | ylab="BIC", type="l",main=sprintf("Min BIC: %d",which.min(reg.summary$bic))) 345 | points(which.min(reg.summary$bic), 346 | reg.summary$bic[which.min(reg.summary$bic)], 347 | col="red",cex=2,pch=20) 348 | 349 | reg.summary 350 | ``` 351 | 352 | ### Part b 353 | 354 | The 12 feature model that BIC choses seems to be pretty good. 355 | 356 | This model uses the following features: 357 | 358 | * Private 359 | * Accept 360 | * Top10perc 361 | * F.Undergrad 362 | * Room.Board 363 | * Personal 364 | * PhD 365 | * Terminal 366 | * S.F.Ratio 367 | * perc.alumni 368 | * Expend 369 | * Grad.Rate 370 | 371 | ```{r fig.width=11,fig.height=11} 372 | good.features=c("Outstate", 373 | "Private", 374 | "Accept", 375 | "Top10perc", 376 | "F.Undergrad", 377 | "Room.Board", 378 | "Personal", 379 | "PhD", 380 | "Terminal", 381 | "S.F.Ratio", 382 | "perc.alumni", 383 | "Expend", 384 | "Grad.Rate") 385 | 386 | pairs(College.train[,good.features,drop=F]) 387 | 388 | #standard lm fit using these features for comparison 389 | lm.fit=lm(Outstate~.,data=College.train[,good.features]) 390 | 391 | ``` 392 | 393 | The highly non-linear features by eye relative to the response seem to be Accept, F.Undergrad, and Top10perc. PhD might be anotehr good candidate, and perhaps Terminal. 394 | 395 | ```{r fig.width=11,fig.height=11} 396 | gam.fit=gam(Outstate~ 397 | ns(Accept,5)+ 398 | ns(Top10perc,3)+ 399 | ns(F.Undergrad,5)+ 400 | ns(PhD,3)+ 401 | ns(Terminal,3)+ 402 | Private + 403 | Room.Board + 404 | Personal + 405 | S.F.Ratio + 406 | perc.alumni + 407 | Expend + 408 | Grad.Rate 409 | ,data=College.train) 410 | anova(lm.fit,gam.fit) 411 | 412 | par(mfrow=c(4,3)) 413 | plot.gam(gam.fit,se=T,residuals=T,col="blue") 414 | 415 | summary(lm.fit) 416 | summary(gam.fit) 417 | 418 | 419 | gam.pred=predict(gam.fit,newdata=College.test) 420 | lm.pred=predict(lm.fit,newdata=College.test) 421 | 422 | sqrt(mean((College.test$Outstate-gam.pred)^2)) 423 | sqrt(mean((College.test$Outstate-lm.pred)^2)) 424 | ``` 425 | 426 | The root mean squared error is lower for the GAM than the standard linear model. I used the features described previosly to test out non-linear relationships. 427 | 428 | 429 | ## Problem 11 430 | 431 | ```{r fig.width=11, fig.height=5} 432 | n=100 433 | beta0_t=5 434 | beta1_t=-0.55 435 | beta2_t=1.35 436 | set.seed(10) 437 | x1=rnorm(n,sd=1.1) 438 | x2=rnorm(n,sd=2.3) 439 | e=rnorm(n,mean=0,sd=0.5) 440 | y=x1*beta1_t+x2*beta2_t+beta0_t+e 441 | 442 | res=matrix(NA,3,1000,dimnames=list(c("B0","B1","B2"),paste(1:1000))) 443 | 444 | b0hat=150 445 | b1hat=100 446 | b2hat=-100 447 | 448 | for(i in seq(1000)){ 449 | a=y-b1hat*x1 450 | b2hat=lm(a~x2)$coef[2] 451 | a=y-b2hat*x2 452 | fit2=lm(a~x1) 453 | b1hat=fit2$coef[2] 454 | b0hat=fit2$coef[1] 455 | res["B0",i]=b0hat 456 | res["B1",i]=b1hat 457 | res["B2",i]=b2hat 458 | } 459 | 460 | res[,c(1,2,3,4,5,6,7,8,9,1000),drop=F] 461 | 462 | r=range(res) 463 | plot(seq(1000),res[1,],type="l",lwd=3,ylim=r,col="blue",xlab="iteration",ylab="coefficient estimate") 464 | lines(res[2,],lwd=3,col="red") 465 | lines(res[3,],lwd=3,col="green") 466 | legend("topright", 467 | c("B0","B1","B2"), # puts text in the legend 468 | lty=c(1,1,1), # gives the legend appropriate symbols (lines) 469 | lwd=c(3,3,3),col=c("blue","red","green")) # gives the legend lines the correct color and width 470 | fit=lm(y~x1+x2) 471 | abline(h=fit$coef[1],lty=3,lwd=1,col="blue") 472 | abline(h=fit$coef[2],lty=3,lwd=1,col="red") 473 | abline(h=fit$coef[3],lty=3,lwd=1,col="green") 474 | summary(lm(y~x1+x2)) 475 | 476 | ``` 477 | 478 | In this dataset by the 3rd iteration results were nearly as good as they would get, and by the fourth they were pretty much converged. 479 | 480 | ******* 481 | ## Problem 12 482 | 483 | ```{r fig.width=11, fig.height=5} 484 | n=100 485 | p=100 486 | set.seed(5) 487 | bhats=rnorm(101,sd=100) 488 | btarg=rnorm(101,sd=10) 489 | X=cbind(rep(1,100),matrix(rnorm(100*100),100,100)) 490 | e=rnorm(n,mean=0,sd=0.5) 491 | res=matrix(NA,101,1000,dimnames=list(paste(0:100),paste(1:1000))) 492 | 493 | ## rows from left dot product with columns from right 494 | y=as.vector(btarg%*% t(X))+e 495 | for(i in seq(1000)){ 496 | for (j in 2:101){ 497 | a=y-as.vector((bhats[-j]%*% t(X[,-j]))) 498 | bhats[j]=lm(a~X[,j,drop=T])$coef[2] 499 | } 500 | bhats[1]=mean(y-as.vector(bhats[-1]%*% t(X[,-1]))) 501 | res[,i]=bhats 502 | } 503 | 504 | rmse_betas=apply(res,2,function(c)mean(sqrt((btarg-c)^2))) 505 | 506 | plot(1:1000,rmse_betas,type="l",col="blue",lwd=2,main="Beta RMSE by iteration",xlab="iteration",ylab="Coefficient RMSE") 507 | 508 | which.min(rmse_betas) 509 | min(rmse_betas) 510 | 511 | ``` 512 | 513 | As you can see, the RMSE on the betas is decreasing as a function of the iteration. The minimal value is at iteration 1000 (I stopped it there due to runtime), and it decreases slowly but steadily until then. 514 | 515 | -------------------------------------------------------------------------------- /R_Exercises/Exercise7/Exercise7.Rmd: -------------------------------------------------------------------------------- 1 | ISLR Exercise 7: Decision Trees 2 | ======================================================== 3 | ******** 4 | ******** 5 | ## Conceptual 6 | ### Problem 1 7 | > Draw an example (of your own invention) of a partition of two- dimensional feature space that could result from recursive binary splitting. Your example should contain at least six regions. Draw a decision tree corresponding to this partition. Be sure to label all as- pects of your figures, including the regions R1, R2, . . ., the cutpoints t1,t2,..., and so forth. 8 | _Hint: Your result should look something like Figures 8.1 and 8.2._ 9 | 10 | 11 | ***** 12 | ### Problem 2 13 | > It is mentioned in Section 8.2.3 that boosting using depth-one trees (or stumps) leads to an additive model: that is, a model of the form 14 | $$f(X)=\sum_{j=1}^pf_j(X_j).$$ 15 | Explain why this is the case. You can begin with (8.12) in Algorithm 8.2 16 | 17 | 18 | ***** 19 | ### Problem 3 20 | > Consider the Gini index, classification error, and cross-entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of pˆm1. The x- axis should display pˆm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy. 21 | _Hint: In a setting with two classes, $$\hat p_{m1} = 1 - \hat p_{m2}$$. You could make this plot by hand, but it will be much easier to make in R._ 22 | 23 | The Gini index 24 | $$G=\sum_{k=1}^K\hat p_{mk}(1 - \hat p_{mk})$$ 25 | 26 | cross-entropy 27 | $$D=-\sum_{k=1}^K\hat p_{mk}\text{log} \hat p_{mk}$$ 28 | 29 | ```{r fig.width=8, fig.height=8} 30 | gini=function(m1){ 31 | return(2*(m1*(1-m1))) 32 | } 33 | 34 | ent=function(m1){ 35 | m2=1-m1 36 | return(-((m1*log(m1))+(m2*log(m2)))) 37 | } 38 | 39 | classerr=function(m1){ 40 | m2=1-m1 41 | return(1-max(m1,m2)) 42 | #return(min((1-m1),m1)) 43 | #return(m1) 44 | } 45 | 46 | err=seq(0,1,by=0.01) 47 | c.err=sapply(err,classerr) 48 | g=sapply(err,gini) 49 | e=sapply(err,ent) 50 | d=data.frame(Gini.Index=g,Cross.Entropy=e) 51 | plot(err,c.err,type='l',col="red",xlab="m1",ylim=c(0,0.8),ylab="value") 52 | matlines(err,d,col=c("green","blue")) 53 | 54 | 55 | ``` 56 | 57 | ***** 58 | ### Problem 4 59 | > This question relates to the plots in Figure 8.12. 60 | 61 | 62 | ***** 63 | #### Part a 64 | > Sketch the tree corresponding to the partition of the predictor space illustrated in the left-hand panel of Figure 8.12. The num- bers inside the boxes indicate the mean of Y within each region. 65 | 66 | 67 | ***** 68 | #### Part b 69 | > Create a diagram similar to the left-hand panel of Figure 8.12, using the tree illustrated in the right-hand panel of the same figure. You should divide up the predictor space into the correct regions, and indicate the mean for each region. 70 | 71 | 72 | ***** 73 | ### Problem 5 74 | > Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of X, produce 10 estimates of P(Class is Red|X): 75 | 0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, and 0.75. 76 | There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in this chapter. The second approach is to classify based on the average probability. In this example, what is the final classification under each of these two approaches? 77 | 78 | First the mean probability based classification: 79 | `r x=c(0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75); mean(x)>0.5` 80 | 81 | Second the majority vote based classification: 82 | `r mean(x>0.5)>0.5` 83 | 84 | ***** 85 | ### Problem 6 86 | > Provide a detailed explanation of the algorithm that is used to fit a regression tree. 87 | 88 | 1. First we do recursive binary splitting on the data. This is a top-down approach where recursively and greedily we find the best single partitioning of the data such that the reduction of RSS is the greatest. This process is applied to each of the split parts seperately until some minimal number of observations is present on each of the leaves. 89 | 90 | 2. apply cost complexity pruning of this larger tree formed in step 1 to obtain a sequence of best subtrees as a function of a parameter, $\alpha$. Each value of $\alpha$ corresponds to a different subtree which minimizes the equation $$\sum_{m=i}^{|T|}\sum_{i:x_i\in R_m}(y_i - \hat y_{R_m})^2 + \alpha |T|$$. Here $|T|$ is the number of terminal nodes on the tree. When $\alpha=0$ we have the original tree, and as $\alpha$ increases we get a more pruned version of the tree. 91 | 92 | 3. using K-fold CV, choose $\alpha$. For each fold, repeat steps 1 and 2, and then evaluate the MSE as a function of $\alpha$ on the held out fold. Chose an $\alpha$ that minimizes the average error. 93 | 94 | 4. Given the $\alpha$ chosen in step 3, return the tree calculated using the formula laid out in step 2 on the entire dataset with that chosen value of $\alhpa$. 95 | 96 | ***** 97 | ***** 98 | ## Applied 99 | 100 | **** 101 | ### Problem 7 102 | > In the lab, we applied random forests to the Boston data using mtry=6 and using ntree=25 and ntree=500. Create a plot displaying the test error resulting from random forests on this data set for a more comprehensive range of values for mtry and ntree. You can model your plot after Figure 8.10. Describe the results obtained. 103 | 104 | `mtry` is the number of variables randomly sampled as candidates for each split. There are 13 variables to look at in the boston dataset. This defaults to `r sqrt(13)` for a dataset of this size. 105 | 106 | ```{r fig.width=11, fig.height=11} 107 | library(ISLR) 108 | library(MASS) 109 | library(randomForest) 110 | library(tree) 111 | 112 | mtry=c(3,4,6) 113 | ntree=c(10,30,50,75,100,500) 114 | x=matrix(rep(NA,length(mtry)*length(ntree)),length(ntree),length(mtry)) 115 | set.seed(1) 116 | train=sample(1:nrow(Boston), nrow(Boston)/2) 117 | boston.test=Boston[-train,'medv'] 118 | 119 | for(i in 1:length(ntree)){ 120 | for(j in 1:length(mtry)){ 121 | rf.boston=randomForest(medv~.,data=Boston, 122 | subset=train,mtry=mtry[j],ntree=ntree[i], 123 | importance=TRUE) 124 | yhat.rf=predict(rf.boston,newdata=Boston[-train,]) 125 | err=sqrt(mean((yhat.rf-boston.test)^2)) 126 | x[i,j]=err 127 | } 128 | } 129 | 130 | cols=c("red","green","blue","orange") 131 | 132 | plot(ntree,x[,1],xlab="Number of trees",ylim=c(3,5),ylab="Test RMSE",col=cols[1],type='l') 133 | for(j in 2:length(mtry)){ 134 | lines(ntree,x[,j],col=cols[j]) 135 | } 136 | legend("topright",sprintf("mtry=%g",mtry),lty = 1,col=cols) 137 | ``` 138 | Larger trees definitely had a slight advantage. The default choice of 4 did pretty well, and perhaps bumping up that value a bit helps even more, especially at larger numbers of trees. The default value of 4 maximixed its performance at fewer numbers of trees. Overall 6 looks like a good choice for mtry, and 500 a good choice for ntree on this dataset and train/test split. 139 | 140 | ****** 141 | ### Problem 8 142 | > In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable. 143 | 144 | 145 | #### Part a 146 | > Split the data set into a training set and a test set. 147 | 148 | ```{r} 149 | set.seed(1) 150 | train=sample(1:nrow(Carseats),nrow(Carseats)/2) 151 | library(tree) 152 | Carseats.train=Carseats[train,] 153 | Carseats.test=Carseats[-train,] 154 | ``` 155 | 156 | #### Part b 157 | > Fit a regression tree to the training set. Plot the tree, and interpret the results. What test error rate do you obtain? 158 | 159 | ```{r fig.width=11, fig.height=11} 160 | tree.carseats=tree(Sales~.,Carseats.train) 161 | summary(tree.carseats) 162 | plot(tree.carseats) 163 | text(tree.carseats,pretty=0) 164 | sales.est=predict(tree.carseats,Carseats.test) 165 | test.R2=1-(sum((sales.est-Carseats.test$Sales)^2)/sum((Carseats.test$Sales-mean(Carseats.test$Sales))^2)) 166 | test.R2 167 | ``` 168 | 169 | #### Part c 170 | > Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test error rate? 171 | 172 | ```{r fig.width=11, fig.height=11} 173 | cv.carseats=cv.tree(tree.carseats) 174 | plot(cv.carseats$size,cv.carseats$dev,type="b") 175 | min.carseats=which.min(cv.carseats$dev) 176 | #8 is min 177 | prune.carseats=prune.tree(tree.carseats,best=min.carseats) 178 | plot(prune.carseats) 179 | text(prune.carseats ,pretty=0) 180 | sales.est=predict(prune.carseats,Carseats.test) 181 | test.R2=1-(sum((sales.est-Carseats.test$Sales)^2)/sum((Carseats.test$Sales-mean(Carseats.test$Sales))^2)) 182 | test.R2 183 | ``` 184 | 185 | The error rate is actually not better with the pruned tree.. interesting. 186 | 187 | #### Part d 188 | > Use the bagging approach in order to analyze this data. What test error rate do you obtain? Use the importance() function to determine which variables are most important. 189 | 190 | ```{r fig.width=11, fig.height=11} 191 | library(randomForest) 192 | set.seed(1) 193 | bag.carseats=randomForest(Sales~.,data=Carseats,subset=train, 194 | mtry=ncol(Carseats)-1,importance =TRUE) 195 | importance(bag.carseats) 196 | varImpPlot(bag.carseats) 197 | sales.est=predict(bag.carseats,Carseats.test) 198 | test.R2=1-(sum((sales.est-Carseats.test$Sales)^2)/sum((Carseats.test$Sales-mean(Carseats.test$Sales))^2)) 199 | test.R2 200 | ``` 201 | 202 | #### Part e 203 | > Use random forests to analyze this data. What test error rate do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained. 204 | 205 | ```{r} 206 | rf.carseats=randomForest(Sales~.,data=Carseats,subset=train,importance=T) 207 | importance(rf.carseats) 208 | 209 | mtotry=2:6 210 | errs=rep(NA,length(mtotry)) 211 | for(i in 1:length(mtotry)){ 212 | m=mtotry[i] 213 | rf.carseats=randomForest(Sales~.,data=Carseats, 214 | subset=train,mtry=mtotry[i], 215 | importance=T) 216 | sales.est=predict(rf.carseats,Carseats.test) 217 | test.R2=1-(sum((sales.est-Carseats.test$Sales)^2)/ 218 | sum((Carseats.test$Sales- 219 | mean(Carseats.test$Sales))^2)) 220 | errs[i]=test.R2 221 | } 222 | errs 223 | ``` 224 | 225 | 226 | **** 227 | ### Problem 9 228 | > This problem involves the OJ data set which is part of the ISLR package. 229 | 230 | 231 | #### Part a 232 | > Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations. 233 | 234 | ```{r} 235 | set.seed(10) 236 | train=sample(1:nrow(OJ),800) 237 | train.OJ=OJ[train,] 238 | test.OJ=OJ[-train,] 239 | 240 | ``` 241 | 242 | #### Part b 243 | > Fit a tree to the training data, with Purchase as the response and the other variables except for Buy as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have? 244 | 245 | ```{r} 246 | tree.oj=tree(Purchase~.,data=train.OJ) 247 | summary(tree.oj) 248 | ``` 249 | 250 | The training error rate is 0.1625, and there are 7 terminal nodes. 251 | 252 | 253 | #### Part c 254 | > Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed. 255 | 256 | ```{r} 257 | tree.oj 258 | ``` 259 | 260 | Node 4 shows the split which occures of LoyalCH is less first less than 0.45956 and then less than 0.276142. The predicted outcome is MM. There is a deviance of 100. Smaller values of deviance ar indicative of how pure this node is. Finally there is the probability confidence bound on this prediction. 261 | 262 | #### Part d 263 | > Create a plot of the tree, and interpret the results. 264 | 265 | ```{r fig.width=11, fig.height=11} 266 | plot(tree.oj) 267 | text(tree.oj,pretty=0) 268 | ``` 269 | 270 | #### Part e 271 | > Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate? 272 | 273 | ```{r} 274 | preds=predict(tree.oj,test.OJ,type="class") 275 | table(test.OJ$Purchase,preds) 276 | test.err=(155+66)/(155+22+27+66) 277 | test.err 278 | ``` 279 | 280 | #### Part f 281 | > Apply the cv.tree() function to the training set in order to determine the optimal tree size. 282 | 283 | ```{r} 284 | cv.oj=cv.tree(tree.oj,FUN=prune.misclass) 285 | ``` 286 | 287 | #### Part g 288 | > Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis. 289 | 290 | ```{r fig.width=11, fig.height=11} 291 | plot(cv.oj$size ,cv.oj$dev ,type="b") 292 | ``` 293 | 294 | 295 | #### Part h 296 | > Which tree size corresponds to the lowest cross-validated classification error rate? 297 | 298 | ```{r} 299 | msize=cv.oj$size[which.min(cv.oj$dev)] 300 | msize 301 | ``` 302 | 303 | #### Part i 304 | > Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes. 305 | 306 | ```{r} 307 | prune.oj=prune.misclass(tree.oj,best=msize) 308 | ``` 309 | 310 | #### Part j 311 | > Compare the training error rates between the pruned and un- pruned trees. Which is higher? 312 | 313 | ```{r} 314 | 315 | prune.pred=predict(prune.oj,test.OJ,type="class") 316 | table(prune.pred,test.OJ$Purchase) 317 | ``` 318 | 319 | 320 | #### Part k 321 | > Compare the test error rates between the pruned and unpruned trees. Which is higher? 322 | 323 | ```{r} 324 | prune.test.err=(151+68)/(151+68+26+25) 325 | 1-prune.test.err 326 | 1-test.err 327 | ``` 328 | 329 | The classification accurazy is slightly worse in the pruned tree. 330 | 331 | **** 332 | ### Problem 10 333 | > We now use boosting to predict Salary in the Hitters data set. 334 | 335 | #### Part a 336 | > Remove the observations for whom the salary information is unknown, and then log-transform the salaries. 337 | 338 | ```{r} 339 | H=Hitters[!is.na(Hitters$Salary),,drop=F] 340 | H$Salary=log(H$Salary) 341 | ``` 342 | 343 | 344 | #### Part b 345 | > Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations. 346 | 347 | ```{r} 348 | H.train=H[1:200,] 349 | H.test=H[201:nrow(H),] 350 | ``` 351 | 352 | #### Part c 353 | > Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter $\lambda$. Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis. 354 | 355 | ```{r,fig.width=7,fig.height=7} 356 | library(gbm) 357 | set.seed(1) 358 | shrinkage=c(0.00001,0.0001,0.001,0.01,0.1,1) 359 | errs=rep(NA,length(shrinkage)) 360 | for (i in 1:length(shrinkage)){ 361 | s=shrinkage[i] 362 | boost.H=gbm(Salary~., data=H.train, 363 | distribution="gaussian", 364 | n.trees=1000, 365 | shrinkage = s, 366 | interaction.depth=1, 367 | n.cores=10) 368 | yhat.boost=predict(boost.H,newdata=H.train, n.trees=1000) 369 | errs[i]=mean((yhat.boost-H.train$Salary)^2) 370 | } 371 | plot(log(shrinkage),errs) 372 | ``` 373 | 374 | #### Part d 375 | > Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis. 376 | 377 | ```{r,fig.width=7,fig.height=7} 378 | library(gbm) 379 | set.seed(1) 380 | errs.test=rep(NA,length(shrinkage)) 381 | for (i in 1:length(shrinkage)){ 382 | s=shrinkage[i] 383 | boost.H=gbm(Salary~., data=H.train, 384 | distribution="gaussian", 385 | n.trees=1000, 386 | shrinkage = s, 387 | interaction.depth=1, 388 | n.cores=10) 389 | yhat.boost=predict(boost.H,newdata=H.test, n.trees=1000) 390 | errs.test[i]=mean((yhat.boost-H.test$Salary)^2) 391 | } 392 | plot(log(shrinkage),errs.test) 393 | ``` 394 | 395 | #### Part e 396 | > Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in Chapters 3 and 6. 397 | 398 | ```{r fig.width=11, fig.height=11} 399 | boost.H=gbm(Salary~., data=H.train, 400 | distribution="gaussian", 401 | n.trees=1000, 402 | shrinkage = shrinkage[which.min(errs)], 403 | interaction.depth=1, 404 | n.cores=10) 405 | 406 | boost.mse=errs[which.min(errs)] 407 | library(leaps) 408 | fit=regsubsets(Salary~.,data=H.train,nvmax=19) 409 | fit.summ=summary(fit) 410 | to.inc=fit.summ$which[which.min(fit.summ$cp),][2:20] 411 | features=c(features,"Division","Salary") 412 | fit.lm=lm(Salary~.,data=H.train[,colnames(H.train)%in%features]) 413 | yhat=predict(fit.lm,H.test[,colnames(H.train)%in%features]) 414 | best.sub=mean((yhat-H.test$Salary)^2) 415 | 416 | cols.bad=c("League","Division","NewLeague") 417 | n.H=model.matrix(~.,H)[,-1] 418 | n.H.train=n.H[1:200,] 419 | n.H.test=n.H[201:nrow(n.H),] 420 | 421 | library(glmnet) 422 | fit=cv.glmnet(n.H.train[,colnames(n.H)!="Sallary"],n.H.train[,"Salary"]) 423 | fit=glmnet(n.H.train[,colnames(n.H)!="Sallary"],n.H.train[,"Salary"],lambda=fit$lambda.1se) 424 | pred=predict(fit,n.H.test[,colnames(n.H)!="Sallary"]) 425 | best.lasso=mean((pred[,1]-H.test$Salary)^2) 426 | 427 | 428 | #boost 429 | boost.mse 430 | 431 | #Best subset lm: 432 | best.sub 433 | 434 | #best lasso: 435 | best.lasso 436 | 437 | ``` 438 | 439 | the lasso is the best by a really little bit on the test data, but boosting came in close. 440 | 441 | #### Part f 442 | > Which variables appear to be the most important predictors in the boosted model? 443 | 444 | ```{r fig.width=11, fig.height=11} 445 | summary(boost.H) 446 | ``` 447 | 448 | CAtBat and PutOuts were the top two predictors by a lot. Next at about half of the importance was RBI and Walks. 449 | 450 | #### Part g 451 | > Now apply bagging to the training set. What is the test set MSE for this approach? 452 | 453 | ```{r} 454 | library(randomForest) 455 | set.seed(1) 456 | bag.H=randomForest(Salary~.,data=H.train, 457 | mtry=ncol(H.train)-1, 458 | importance=TRUE) 459 | preds=predict(bag.H,newdata=H.test) 460 | mean((preds-H.test$Salary)^2) 461 | ``` 462 | 463 | 464 | **** 465 | ### Problem 11 466 | > This question uses the Caravan data set. 467 | 468 | 469 | #### Part a 470 | > Create a training set consisting of the first 1,000 observations, 471 | and a test set consisting of the remaining observations. 472 | 473 | ```{r} 474 | train.C=Caravan[1:1000,] 475 | test.C=Caravan[1001:nrow(Caravan),] 476 | ``` 477 | 478 | 479 | #### Part b 480 | > Fit a boosting model to the training set with Purchase as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important? 481 | 482 | ```{r fig.width=11, fig.height=11} 483 | boost.C=gbm(I(Purchase=="Yes")~., data=train.C, 484 | distribution="bernoulli", 485 | n.trees=1000, 486 | shrinkage = 0.01, 487 | interaction.depth=1, 488 | n.cores=10) 489 | 490 | summary(boost.C) 491 | 492 | The most important predictors are PPEARSAUT, MOPLHOOG and MKOOPKLA, followed pretty closely by a group of others. 493 | ``` 494 | 495 | #### Part c 496 | > Use the boosting model to predict the response on the test data. Predict that a person will make a purchase if the estimated probability of purchase is greater than 20 %. Form a confusion matrix. What fraction of the people predicted to make a purchase do in fact make one? How does this compare with the results obtained from applying KNN or logistic regression to this data set? 497 | 498 | ```{r} 499 | preds=predict(boost.C,test.C,type="response",n.trees=1000) 500 | yhat=ifelse(preds>.2,"Yes","No") 501 | table(yhat,test.C$Purchase) 502 | #the following is the fraction of people predicted to make 503 | # a purchase who actually do 504 | 34/154 505 | 506 | ``` 507 | 508 | **** 509 | ### Problem 12 510 | > Apply boosting, bagging, and random forests to a data set of your choice. Be sure to fit the models on a training set and to evaluate their performance on a test set. How accurate are the results compared to simple methods like linear or logistic regression? Which of these approaches yields the best performance? 511 | 512 | 513 | 514 | 515 | -------------------------------------------------------------------------------- /R_Labs/Lab1/Lab1.Rmd: -------------------------------------------------------------------------------- 1 | Introduction to Statistical Learning R Lab 1 2 | ======================================================== 3 | 4 | Here is an example of making a randomly correlated variable: 5 | 6 | ```{r} 7 | x=rnorm(50) 8 | y=x+rnorm(50, mean=50, sd=.1) 9 | cor(x,y) 10 | ``` 11 | 12 | And here is a visualization of this correlation 13 | 14 | ```{r fig.width=7, fig.height=6} 15 | plot(x,y) 16 | ``` 17 | 18 | Here is a sequence of 60 numbers between -pi and pi 19 | ```{r} 20 | x=seq(-pi,pi,length=50) 21 | ``` 22 | 23 | Now exploring contour 24 | 25 | ```{r fig.width=7, fig.height=6} 26 | y=x 27 | f=outer(x,y,function(x,y)cos(y)/(1+x^2)) 28 | contour(x,y,f) 29 | contour(x,y,f,nlevels=45,add=T) 30 | ``` 31 | 32 | And another contor figure, this time with a new transform of the data 33 | 34 | ```{r fig.width=7, fig.height=6} 35 | fa=(f-t(f))/2 36 | contour(x,y,f,nlevels=15) 37 | ``` 38 | 39 | We can also do this with an "image" plot to show a heatmap 40 | ```{r fig.width=7, fig.height=6} 41 | image(x,y,fa) 42 | ``` 43 | 44 | Or a 3d perspective plot 45 | ```{r fig.width=7, fig.height=6} 46 | persp(x,y,fa) 47 | ``` 48 | 49 | Which can be rotated, I think, with theta 50 | ```{r fig.width=7, fig.height=6} 51 | persp(x,y,fa, theta=30) 52 | ``` 53 | 54 | And further rotated with phi 55 | ```{r fig.width=7, fig.height=6} 56 | persp(x,y,fa, theta=30, phi=20) 57 | ``` 58 | 59 | Another angle at phi=70 60 | 61 | ```{r fig.width=7, fig.height=6} 62 | persp(x,y,fa, theta=30, phi=70) 63 | ``` 64 | 65 | And once again with phi=40 66 | ```{r fig.width=7, fig.height=6} 67 | persp(x,y,fa, theta=30, phi=40) 68 | ``` 69 | 70 | 71 | Loading data 72 | -------- 73 | 74 | Here are different ways to load data into R. interesting use of the `fix()` command which loads the data into a spreadsheet like window. I commented this out because it blocks execution and doesn't put anything into the result, but it looks like a spreadsheet page. 75 | 76 | ```{r} 77 | Auto=read.table("~/src/IntroToStatisticalLearningR/data/Auto.data",header=T,na.strings="?") 78 | dim(Auto) 79 | Auto=na.omit(Auto) 80 | dim(Auto) 81 | 82 | ``` 83 | 84 | You can also attach data so that it is easier to plot and whatnot, turning cylindars into a dataframe makes the output a little nicer, a boxplot actually! 85 | 86 | ```{r fig.width=7, fig.height=6} 87 | attach(Auto) 88 | cylinders=as.factor(cylinders) 89 | plot(cylinders, mpg) 90 | ``` 91 | 92 | Lets slap some lipstick on this plot! 93 | ```{r fig.width=7, fig.height=6} 94 | plot(cylinders, mpg, col="red", varwidth=T, xlab="cylinders ", ylab="MPG") 95 | ``` 96 | 97 | And we can do a histogram on this variable as well. `col=2` is the same as `col="red"` btw. 98 | ```{r fig.width=7, fig.height=6} 99 | hist(mpg,col=2,breaks=15) 100 | ``` 101 | 102 | We could also do a pairs plot which is really nice. 103 | ```{r fig.width=7, fig.height=6} 104 | pairs(Auto) 105 | ``` 106 | 107 | And we can limit the pairing to specific variables. 108 | ```{r fig.width=7, fig.height=6} 109 | pairs(~mpg + displacement + horsepower + weight + acceleration, Auto) 110 | ``` 111 | 112 | Now this is cool, you can plot variables, and then run identify. In rstudio you click on points you are interested in, then hit `ESC` on the keyboard. This will then give you a list of the points, and name the points on the figure! Cant show it here. 113 | 114 | ``` 115 | plot(horsepower, mpg) 116 | identify(horsepower,mpg,name) 117 | ``` 118 | 119 | 120 | FIN! 121 | 122 | -------------------------------------------------------------------------------- /R_Labs/Lab2/Lab2.Rmd: -------------------------------------------------------------------------------- 1 | Introduction to Statistical Learning in R Lab 2 (Chapter 3) 2 | ======================================================== 3 | 4 | Load some R libs 5 | 6 | ```{r} 7 | library(MASS) 8 | library(ISLR) 9 | lm.fit=lm(medv~lstat,data=Boston) 10 | summary(lm.fit) 11 | ``` 12 | 13 | Show some QC figures on this lm fit to the data. As expected there is a very significant correlation between median value of homes in a neighborhood, and proportion of the population that is lower status. There is a lot of interesting stuff here. For example we see that there are some points with high leverage that are also nearly 3 on the residuals axis. **Not sure what absolute values of leverage are considered really high, I will need to look into the chapter more deeply for that. Looks like Cook's distance doesn't come into play, would those be the particularly dangerous points? Perhaps that is the key determinate.** From earlier in the chapter, just before the section on colinearity, it says that the average value of leverage is `(p+1)/n`. Also it always varies between 0 and 1. Figure 3.13 which shows an example of a dangerous situation, has a point with a leverage of a little over 0.25. In this case we are looking at one variable, so I think `p=1 n=506` either `r 2/506` or `r 3/506` should be the mean value of the leverage. We have a few points that are around 0.025 which is an order of magnitude higher than what we should expect for the mean, not sure how significant this is though. Here are some interesting online resources: http://www.statmethods.net/stats/rdiagnostics.html 14 | 15 | ```{r fig.width=11, fig.height=11} 16 | par(mfrow=c(2,2)) 17 | plot(lm.fit) 18 | ``` 19 | 20 | And we can do a confidence interval on the coefficients. Remember that confidence intervals and prediction intervals are different. Prediction intervals as we will see are more conservative. 21 | 22 | ```{r} 23 | confint(lm.fit) 24 | 25 | predict(lm.fit,data.frame(lstat=(c(5,10,15))), interval="confidence") 26 | predict(lm.fit,data.frame(lstat=(c(5,10,15))), interval="prediction") 27 | ``` 28 | 29 | Here is a plot of the points, with the fitted line 30 | ```{r fig.width=7, fig.height=5} 31 | plot(Boston$lstat,Boston$medv) 32 | abline(lm.fit, lwd=3, col="red") 33 | ``` 34 | 35 | Plot of predicted fit vs residuals and studentized versions. Remember that the sutentized residuals are useful for determining outliers. Studentized residuals are devided by the standard error to make something like a Z score, so you can look at `sigma` levels. Values greater than 3 on studentized residuals are poential outliers. That corresponds to 3 `sigma` I think. 36 | 37 | ```{r fig.width=11, fig.height=5} 38 | par(mfrow=c(1,2)) 39 | plot(predict(lm.fit), residuals(lm.fit), main="Residuals") 40 | plot(predict(lm.fit), rstudent(lm.fit), main="Student Fit") 41 | ``` 42 | 43 | There is some evidence of non-linearity based on the resudials plot. Lets look into the leverage statistic using the hatvals function. 44 | 45 | ```{r fig.width=7, fig.height=5} 46 | plot(hatvalues(lm.fit)) 47 | ``` 48 | 49 | To see which index point has the max leverage, we can use which.max to return the index of the max value. 50 | ```{r} 51 | which.max(hatvalues(lm.fit)) 52 | ``` 53 | 54 | Multiple Linear Regression 55 | -------------------------- 56 | ```{r} 57 | lm.fit = lm(medv~lstat+age, data=Boston) 58 | summary(lm.fit) 59 | ``` 60 | 61 | We can also look at all variables at the same time with this shorthand syntax. 62 | ```{r} 63 | lm.fit = lm(medv~., data=Boston) 64 | summary(lm.fit) 65 | ``` 66 | 67 | Interesting that when we include all variables, age, which used to be significant, is no longer called that way. 68 | 69 | One cool thing is that we can access certain elements of the lm summary like so: 70 | ```{r} 71 | summary(lm.fit)$r.sq 72 | ``` 73 | 74 | Also the vif function from the car package can calcualte soemthing called the "variance inflation" factor. **How do we know what a good vs bad inflation factor is?** 75 | ```{r} 76 | library(car) 77 | vif(lm.fit) 78 | ``` 79 | 80 | Let's try a regression excluding one variable. There are two ways to do this, we can specify a new regression with the special -age syntax after the . syntax, otherwise we can use the "update" method to update the previous fit removing the age variable. We can remove multiple variables with this syntax as well. 81 | 82 | ```{r} 83 | lm.fit1=lm(medv~.-age,data=Boston) 84 | summary(lm.fit1) 85 | lm.fit2=update(lm.fit, ~.-age) 86 | summary(lm.fit2) 87 | lm.fit3=lm(medv~.-age-indus,data=Boston) 88 | summary(lm.fit3) 89 | ``` 90 | 91 | And here is a plot including the significant variables identified previously (-age,-industry) 92 | ```{r fig.width=11, fig.height=11} 93 | par(mfrow=c(2,2)) 94 | plot(lm.fit3) 95 | ``` 96 | 97 | 98 | Interaction Terms 99 | ---------------- 100 | When we use `lstat*age` in the formula, the individual terms `lstat` and `age` are automatically included. So for example we can do this to explore the version of lstat and age as interactive terms. 101 | 102 | ```{r} 103 | summary(lm(medv~lstat*age,data=Boston)) 104 | 105 | ``` 106 | 107 | Non-linear transformations of predictors 108 | ------------------- 109 | We can look at things like something squared using the `I()` function which helps when you want to use a special symbol in your equation. 110 | ```{r} 111 | lm.fit2=lm(medv~lstat+I(lstat^2), data=Boston) 112 | summary(lm.fit2) 113 | ``` 114 | 115 | We can use the `anova()` function to further quantify how much better the quadratic fit is superior to the linear fit. 116 | ```{r} 117 | lm.fit=lm(medv~lstat,data=Boston) 118 | anova(lm.fit,lm.fit2) 119 | ``` 120 | 121 | And here is a plot of the fits for lstat^2 122 | ```{r fig.width=11, fig.height=11} 123 | par(mfrow=c(2,2)) 124 | plot(lm.fit2) 125 | ``` 126 | 127 | We can also fit higher order polynomials using the `poly()` function, which does the work of writing out all of the decreasing polynomial terms for us given a variable, and the numbers of polynomials you want to fit. We could also do something like a log transform in the linear model 128 | 129 | ```{r} 130 | lm.fit5=lm(medv~poly(lstat,5),data=Boston) 131 | summary(lm.fit5) 132 | summary(lm(medv~log(rm),data=Boston)) 133 | ``` 134 | 135 | Qualitative Predictors 136 | ------------------ 137 | 138 | R actually automatically will create dummy variables when you have a predictor. We also add a few specific interaction terms to the full model, namely Income and Advertising, along with Price and Age. 139 | 140 | ```{r} 141 | lm.fit=lm(Sales~.+Income:Advertising+Price:Age,data=Carseats) 142 | summary(lm.fit) 143 | ``` 144 | 145 | We can see what the dummy coding is for a variable using the `contrasts()` function. 146 | 147 | ```{r} 148 | contrasts(Carseats$ShelveLoc) 149 | ``` 150 | 151 | Cool! contrasts can be set for any factor, you can use the above function to specify what the contrasts are for a given factor in a dataframe. This sounds very handy. 152 | 153 | So to interpret the output of the `lm` above with interpreting the dummy variables, keep in mind that `ShelveLocGood` being a positive and significant association represents an improvement over the default case, `ShelveLocBad`, similarly `ShelveLocMedium` represents a slightly less, yet still positive and significant improvement. 154 | 155 | 156 | ```{r} 157 | LoadLibraries <- function(){ 158 | library(ISLR) 159 | library(MASS) 160 | print("The libraries have been loaded.") 161 | } 162 | LoadLibraries() 163 | ``` 164 | 165 | -------------------------------------------------------------------------------- /R_Labs/Lab3/Lab3.Rmd: -------------------------------------------------------------------------------- 1 | Introduction to Statistical Learning 2 | ======================================================== 3 | 4 | 5 | 6 | ```{r} 7 | library(ISLR) 8 | names(Smarket) 9 | dim(Smarket) 10 | summary(Smarket) 11 | cor(Smarket[-9]) 12 | ``` 13 | 14 | Plot of volume 15 | 16 | ```{r fig.width=7, fig.height=6} 17 | attach(Smarket) 18 | plot(Volume) 19 | ``` 20 | 21 | 22 | ##Logistic Regression 23 | ```{r} 24 | glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5, data=Smarket, family=binomial) 25 | summary(glm.fit) 26 | coef(glm.fit) 27 | #all coefficient summaries 28 | summary(glm.fit)$coef 29 | #probabilities 30 | summary(glm.fit)$coef[,4] 31 | glm.probs=predict(glm.fit,type="response") 32 | glm.probs[1:10] 33 | contrasts(Direction) 34 | ``` 35 | 36 | Note that the p values in the probabilities are for the stock market going up, since we can see that the dummy variable reproted by `contrasts(Direction)` has 1 on `Up`, and 0 on `Down`. 37 | 38 | Come up with predictions based on this model (on the training data), and make a confusion matrix. 39 | ```{r} 40 | glm.pred=rep("Down",1250) 41 | glm.pred[glm.probs>.5]="Up" 42 | table(glm.pred,Direction) 43 | (550+116)/1250 44 | mean(glm.pred==Direction) 45 | ``` 46 | 47 | However this is the training error rate, lets create a test set and try again. 48 | 49 | ```{r} 50 | train=(Year<2005) 51 | Smarket.2005=Smarket[!train,] 52 | dim(Smarket.2005) 53 | Direction.2005=Direction[!train] 54 | glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Smarket,family=binomial,subset=train) 55 | glm.probs=predict(glm.fit,Smarket.2005,type="response") 56 | glm.pred=rep("Down",252) 57 | glm.pred[glm.probs>0.5]="Up" 58 | table(glm.pred,Direction.2005) 59 | mean(glm.pred==Direction.2005) 60 | mean(glm.pred!=Direction.2005) 61 | ``` 62 | 63 | Yikes, the last line with `!=Direction.2005` computes the test error, which is worse than random chance! Oh well.. 64 | 65 | Lets see what happens if we get rid of some of the lower p-value predictors in the model, those likely contribute noise. 66 | 67 | ```{r} 68 | glm.fit=glm(Direction~Lag1+Lag2, data=Smarket,family=binomial,subset=train) 69 | glm.probs=predict(glm.fit,Smarket.2005,type="response") 70 | glm.pred=rep("Down",252) 71 | glm.pred[glm.probs>0.5]="Up" 72 | table(glm.pred,Direction.2005) 73 | mean(glm.pred==Direction.2005) 74 | mean(glm.pred!=Direction.2005) 75 | ``` 76 | 77 | To predict on some new days, say two days, you can do the following: 78 | ```{r} 79 | predict(glm.fit,newdata=data.frame(Lag1=c(1.2,1.5), Lag2=c(1.1,-0.8)),type="response") 80 | ``` 81 | 82 | We would guess that the market is going to go down these days. 83 | 84 | ### LDA 85 | ```{r} 86 | library(MASS) 87 | lda.fit=lda(Direction~Lag1+Lag2,data=Smarket,subset=train) 88 | lda.fit 89 | ``` 90 | 91 | ```{r fig.height=5, fig.width=7} 92 | plot(lda.fit) 93 | ``` 94 | 95 | ```{r} 96 | lda.pred=predict(lda.fit, Smarket.2005) 97 | names(lda.pred) 98 | lda.class=lda.pred$class 99 | table(lda.class,Direction.2005) 100 | mean(lda.class==Direction.2005) 101 | sum(lda.pred$posterior[,1]>=.5) 102 | sum(lda.pred$posterior[,1]<.5) 103 | lda.pred$posterior[1:20,1] 104 | lda.class[1:20] 105 | sum(lda.pred$posterior[,1]>.54) 106 | max(lda.pred$posterior[,1]) 107 | min(lda.pred$posterior[,1]) 108 | ``` 109 | 110 | ### QDA 111 | ```{r} 112 | qda.fit=qda(Direction~Lag1+Lag2,data=Smarket,subset=train) 113 | qda.fit 114 | qda.class=predict(qda.fit,Smarket.2005)$class 115 | table(qda.class,Direction.2005) 116 | mean(qda.class==Direction.2005) 117 | ``` 118 | 119 | 120 | 60% accuracy on stock market data? wow. There is no default `plot()` that takes qda.fitted results. 121 | 122 | ### K-Nearest Neighbors 123 | ```{r} 124 | library(class) 125 | train.X=cbind(Lag1,Lag2)[train,] 126 | test.X=cbind(Lag1,Lag2)[!train,] 127 | train.Direction=Direction[train] 128 | set.seed(1) 129 | knn.pred=knn(train.X,test.X,train.Direction,k=1) 130 | table(knn.pred,Direction.2005) 131 | mean(knn.pred==Direction.2005) 132 | 133 | knn.pred=knn(train.X,test.X,train.Direction,k=3) 134 | table(knn.pred,Direction.2005) 135 | mean(knn.pred==Direction.2005) 136 | ``` 137 | 138 | ### Caravan Insurance Data 139 | Caravan insurance is just insurance for caravans... Thought it might be something else for some reason. 140 | 141 | ```{r} 142 | dim(Caravan) 143 | attach(Caravan) 144 | sp=summary(Purchase) 145 | sp 146 | sp["Yes"]/sum(sp) 147 | ``` 148 | 149 | Since KNN is based on distance, and different variables can have very different scales, things need to be scaled. Consider salary and age, salary can change in thousands easily, age will mostly range 0-100. Consider that a salary difference of 1000 should be small compared to an age difference of 50 years, need to _standardize_ the data so that KNN knows these scales. 150 | 151 | 152 | The scale function in R does this automagically! 153 | 154 | Col 86 is Purchase which is qualative, and will be left out for this. 155 | ```{r} 156 | standardized.X=scale(Caravan[-86]) 157 | var(Caravan[,1]) 158 | var(Caravan[,2]) 159 | var(standardized.X[,1]) 160 | var(standardized.X[,2]) 161 | 162 | test=1:1000 163 | train.X=standardized.X[-test,] 164 | test.X=standardized.X[test,] 165 | train.Y=Purchase[-test] 166 | test.Y=Purchase[test] 167 | 168 | set.seed(1) 169 | knn.pred=knn(train.X,test.X,train.Y,k=1) 170 | mean(test.Y!=knn.pred) 171 | mean(test.Y!="No") 172 | ``` 173 | 174 | Keep in mind that although 12% error sounds really good, if we just always predicted "No" we would have only 6% error. 175 | 176 | ```{r} 177 | table(knn.pred,test.Y) 178 | 9/(68+9) 179 | 180 | knn.pred=knn(train.X,test.X,train.Y,k=3) 181 | table(knn.pred,test.Y) 182 | 5/26 183 | 184 | knn.pred=knn(train.X,test.X,train.Y,k=5) 185 | table(knn.pred,test.Y) 186 | 4/15 187 | 188 | ``` 189 | 190 | We could also try with logistic regression. Since there are so few positives, the predictor doesn't do so well with the default probability cutoff of 0.5 if we are hoping to identify the people we want to spend time trying to sell insurance to. 191 | 192 | ```{r} 193 | glm.fit=glm(Purchase~.,data=Caravan,family=binomial,subset=-test) 194 | glm.probs=predict(glm.fit,Caravan[test,],type="response") 195 | glm.pred=rep("No",1000) 196 | glm.pred[glm.probs >.5]="Yes" 197 | table(glm.pred,test.Y) 198 | #yikes, all of our guesses on "yes" are wrong! 199 | 200 | #try with a different p cutoff 201 | glm.pred=rep("No",1000) 202 | glm.pred[glm.probs >.25]="Yes" 203 | table(glm.pred,test.Y) 204 | 11/(22+11) 205 | ``` 206 | 207 | ```{r fig.height=5,fig.width=7} 208 | plot(factor(test.Y),glm.probs) 209 | abline(0.25,0,col="red",lwd=2) 210 | ``` 211 | Probably a better plot would be one that shows the TP rate vs the FP rate or something, bet it has a dip around 0.25 or something. 212 | -------------------------------------------------------------------------------- /R_Labs/Lab4/Lab4.rmd: -------------------------------------------------------------------------------- 1 | Introduction to Statistical Learning Lab 4: Cross validation and the bootstrap 2 | ======================================================== 3 | 4 | ************* 5 | ## Train/Test split 6 | ```{r} 7 | library(ISLR) 8 | set.seed(1) 9 | train=sample(392,196) 10 | lm.fit=lm(mpg~horsepower,data=Auto,subset=train) 11 | attach(Auto) 12 | mean((mpg-predict(lm.fit,Auto))[-train]^2) 13 | 14 | lm.fit2=lm(mpg~poly(horsepower,2),data=Auto,subset=train) 15 | mean((mpg-predict(lm.fit2,Auto))[-train]^2) 16 | 17 | lm.fit3=lm(mpg~poly(horsepower,3),data=Auto,subset=train) 18 | mean((mpg-predict(lm.fit3,Auto))[-train]^2) 19 | 20 | set.seed(2) 21 | train=sample(392,196) 22 | lm.fit=lm(mpg~horsepower,data=Auto,subset=train) 23 | mean((mpg-predict(lm.fit,Auto))[-train]^2) 24 | 25 | lm.fit2=lm(mpg~poly(horsepower,2),data=Auto,subset=train) 26 | mean((mpg-predict(lm.fit2,Auto))[-train]^2) 27 | 28 | lm.fit3=lm(mpg~poly(horsepower,3),data=Auto,subset=train) 29 | mean((mpg-predict(lm.fit3,Auto))[-train]^2) 30 | ``` 31 | 32 | Little improvement for cubic function, but quadratic improves over linear. Interesting that different test sets perform so differently. 33 | 34 | ************* 35 | ## LOOCV 36 | 37 | So glm has a nice cv function which is handy. Note that GLM can do linear models as well as a bunch of others, so it can be a drop in replacement for lm. 38 | 39 | ```{r} 40 | glm.fit=glm(mpg~horsepower,data=Auto) 41 | coef(glm.fit) 42 | #vs 43 | lm.fit=lm(mpg~horsepower,data=Auto) 44 | coef(lm.fit) 45 | ``` 46 | 47 | cv and all for glm is part of the `boot` library. 48 | 49 | ```{r} 50 | library(boot) 51 | glm.fit=glm(mpg~horsepower,data=Auto) 52 | cv.err=cv.glm(Auto,glm.fit) 53 | cv.err$delta 54 | ``` 55 | 56 | Lets use CV to find which polynomial fit is optimal for this horsepower data. Lets do this in parallelz b/c it takes a while otherwise. 57 | 58 | ```{r} 59 | library(multicore) 60 | cv.error=unlist(mclapply(1:5,function(i){ 61 | glm.fit=glm(mpg~poly(horsepower,i),data=Auto) 62 | cv.glm(Auto,glm.fit)$delta[1] 63 | }, mc.cores=5)) 64 | 65 | cv.error 66 | ``` 67 | 68 | ************* 69 | ## K fold CV 70 | We can also do this with k fold cv which goes faster. 71 | 72 | ```{r} 73 | set.seed(17) 74 | cv.error.10=unlist(mclapply(1:10,function(i){ 75 | glm.fit=glm(mpg~poly(horsepower,i),data=Auto) 76 | cv.glm(Auto,glm.fit,K=10)$delta[1] 77 | }, mc.cores=10)) 78 | 79 | cv.error.10 80 | ``` 81 | 82 | from the text, this is an interesting point about what the two $delta values mean: 83 | 84 | > We saw in Section 5.3.2 that the two numbers associated with delta are essentially the same when LOOCV is performed. When we instead perform k-fold CV, then the two numbers associated with delta differ slightly. The first is the standard k-fold CV estimate, as in (5.3). The second is a bias- corrected version. On this data set, the two estimates are very similar to each other. 85 | 86 | Also note that cv.glm does not use the computational speed up that is possible for LOOCV with least-squares fit models given in equation formula 5.2. This would have actually made LOOCV faster than K-fold CV rather than the other way around! 87 | 88 | ************* 89 | ## The bootstrap 90 | 91 | ```{r} 92 | alpha.fn=function(data,index){ 93 | X=data$X[index] 94 | Y=data$Y[index] 95 | return((var(Y)-cov(X,Y))/(var(X)+var(Y)-2*cov(X,Y))) 96 | } 97 | #using all 100 observations to get alpha: 98 | alpha.fn(Portfolio,1:100) 99 | 100 | #or we can sample bootstrap style 101 | set.seed(1) 102 | alpha.fn(Portfolio,sample(100,100,replace=T)) 103 | 104 | #and we can use the boot function to automate this thousands of times 105 | boot(Portfolio,alpha.fn,R=1000) 106 | ``` 107 | 108 | now were going to use the bootstrap to help determine accuracy of lm fit. 109 | 110 | ```{r} 111 | boot.fn=function(data,index){ 112 | return(coef(lm(mpg~horsepower,data=data,subset=index))) 113 | } 114 | 115 | #simply compute coefficient estimates 116 | boot.fn(Auto,1:392) 117 | 118 | set.seed(1) 119 | 120 | #one bootstrap round 121 | boot.fn(Auto,sample(392,392,replace=T)) 122 | 123 | #now do a thousand! 124 | boot(Auto,boot.fn,1000) 125 | 126 | #however in the simple case of linear regression, we can also get these 127 | # estimates with the summary() function from the fit itself 128 | # as was described in section 3.1.2 129 | summary(lm(mpg~horsepower,data=Auto))$coef 130 | ``` 131 | 132 | Interestingly, the formula given in equation 3.8 that the summary function uses to calculate the estimate of the beta standard errors rely on certain assumptions about the underlying data. Like the population $\sigma^2$ which is estimated from the RSS. This $\sigma^2$ relies on the model being correct! The non-linear relationship in the data causes inflated residuals and an inflated $\hat\sigma^2$. Also the standard formulas assume that $x_i$ are fixed and that $\epsilon_i$ is the sole source of variability, which is weird. The bootstrap does not have these assumptions, so it is probably more accurate in its estimates of the errors around $\hat\beta_0, \hat\beta_1$. 133 | 134 | Here is an example where the model is closer to the correct one, how the boostrap and summary estimates should be closer. 135 | 136 | ```{r} 137 | boot.fn=function(data,index){ 138 | coefficients(lm(mpg~horsepower+I(horsepower^2), data=data, subset=index)) 139 | } 140 | 141 | set.seed(1) 142 | boot(Auto,boot.fn,1000) 143 | summary(lm(mpg~horsepower+I(horsepower^2),data=Auto))$coef 144 | ``` 145 | 146 | 147 | 148 | -------------------------------------------------------------------------------- /R_Labs/Lab5/Lab5.Rmd: -------------------------------------------------------------------------------- 1 | Introduction to Statistical Learning Lab 5: Feature selection, subset selection, etc 2 | ======================================================== 3 | 4 | # Lab 1: Subset selection methods 5 | 6 | ## Best subset selection 7 | ```{r} 8 | library(ISLR) 9 | #fix(Hitters) 10 | names(Hitters) 11 | dim(Hitters) 12 | sum(is.na(Hitters$Salary)) 13 | Hitters=na.omit(Hitters) 14 | sum(is.na(Hitters)) 15 | ``` 16 | 17 | And now lets do best subset selection with the leaps library 18 | 19 | ```{r} 20 | library(leaps) 21 | regfit.full=regsubsets(Salary~.,Hitters) 22 | summary(regfit.full) 23 | regfit.full=regsubsets(Salary~.,data=Hitters,nvmax=19) 24 | reg.summary=summary(regfit.full) 25 | names(reg.summary) 26 | reg.summary$rsq 27 | ``` 28 | 29 | Some r plots showing various optimal numbers of features given different penalties for overfitting. 30 | ```{r fig.height=11,fig.width=11} 31 | par(mfrow=c(2,2)) 32 | plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l") 33 | plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l") 34 | points(which.max(reg.summary$adjr2), 35 | reg.summary$adjr2[which.max(reg.summary$adjr2)], 36 | col="red",cex=2,pch=20) 37 | plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp", 38 | type="l") 39 | points(which.min(reg.summary$cp), 40 | reg.summary$cp[which.min(reg.summary$cp)], 41 | col="red",cex=2,pch=20) 42 | plot(reg.summary$bic,xlab="Number of Variables", 43 | ylab="BIC", type="l") 44 | points(which.min(reg.summary$bic), 45 | reg.summary$bic[which.min(reg.summary$bic)], 46 | col="red",cex=2,pch=20) 47 | ``` 48 | 49 | We can also do the default plots for this package which show which things were selected given different numbers of features and different penalty terms. 50 | ```{r fig.height=11,fig.width=11} 51 | par(mfrow=c(2,2)) 52 | plot(regfit.full,scale="r2") 53 | plot(regfit.full,scale="adjr2") 54 | plot(regfit.full,scale="Cp") 55 | plot(regfit.full,scale="bic") 56 | ``` 57 | 58 | The coeficient function can take as an argument the model (in terms of number of features) and it will output the coefficient estimates for that model. 59 | ```{r} 60 | coef(regfit.full,6) 61 | ``` 62 | 63 | 64 | ## Forward and backward stepwise selection 65 | 66 | ```{r} 67 | regfit.fwd=regsubsets(Salary~.,data=Hitters,nvmax=19,method="forward") 68 | summary(regfit.fwd) 69 | regfit.bwd=regsubsets(Salary~.,data=Hitters,nvmax=19,method="backward") 70 | summary(regfit.bwd) 71 | coef(regfit.full,7) 72 | coef(regfit.fwd,7) 73 | coef(regfit.bwd,7) 74 | ``` 75 | Note how different selection methods produce different sets of data. 76 | 77 | ## Choosing among models using the validation set approach and cv. 78 | 79 | ```{r} 80 | set.seed(1) 81 | train=sample(c(TRUE,FALSE),nrow(Hitters),rep=TRUE) 82 | test=(!train) 83 | regfit.best=regsubsets(Salary~.,data=Hitters[train,],nvmax=19) 84 | test.mat=model.matrix(Salary~.,data=Hitters[test,]) 85 | val.errors=rep(NA,19) 86 | for(i in 1:19){ 87 | coefi=coef(regfit.best,id=i) 88 | pred=test.mat[,names(coefi)]%*%coefi # dot product of coeficients is the 89 | # prediction 90 | val.errors[i]=mean((Hitters$Salary[test]-pred)^2) 91 | } 92 | val.errors 93 | which.min(val.errors) 94 | coef(regfit.best,which.min(val.errors)) 95 | ``` 96 | 97 | Function to do the predicting we did above 98 | ```{r} 99 | predict.regsubsets=function(object,newdata,id,...){ 100 | form=as.formula(object$call[[2]]) ## extract formula 101 | mat=model.matrix(form,newdata) 102 | coefi=coef(object,id=id) 103 | xvars=names(coefi) 104 | mat[,xvars]%*%coefi 105 | } 106 | ``` 107 | 108 | 109 | ```{r} 110 | regfit.best=regsubsets(Salary~.,data=Hitters,nvmax=19) 111 | coef(regfit.best,10) 112 | ``` 113 | 114 | 115 | 116 | ### Now doing with cv 117 | 118 | ```{r} 119 | k=10 120 | set.seed(1) 121 | folds=sample(1:k,nrow(Hitters),replace=TRUE) 122 | cv.errors=matrix(NA,k,19,dimnames=list(NULL,paste(1:19))) 123 | 124 | for(j in 1:k){ 125 | best.fit=regsubsets(Salary~.,data=Hitters[folds!=j,],nvmax=19) 126 | for(i in 1:19){ 127 | pred=predict(best.fit,Hitters[folds==j,],id=i) 128 | cv.errors[j,i]=mean((Hitters$Salary[folds==j]-pred)^2) 129 | } 130 | } 131 | 132 | mean.cv.errors=apply(cv.errors,2,mean) 133 | mean.cv.errors 134 | which.min(mean.cv.errors) 135 | ``` 136 | So the above stores the k fold cv results in a matrix. For fold j, there are 19 optimal variable subset models to test (hence the 10X19 matrix). Now to find which performed the best, we average the error across each of the 10 cv rounds for a given number of variables that we are interested in testing in our model. Plotting these averages out, we see that 11 is the best model. 137 | 138 | ```{r fig.width=7,fig.height=5} 139 | par(mfrow=c(1,1)) 140 | plot(mean.cv.errors,type='b') 141 | points(which.min(mean.cv.errors),mean.cv.errors[which.min(mean.cv.errors)], 142 | col="red",cex=2,pch=20) 143 | ``` 144 | 145 | And now we train the best modle on all of the datas 146 | ```{r} 147 | reg.best=regsubsets(Salary~.,data=Hitters,nvmax=19) 148 | coef(reg.best,which.min(mean.cv.errors)) 149 | ``` 150 | 151 | ************************* 152 | # Lab 2: Ridge Regression and the Lasso 153 | 154 | ```{r} 155 | x=model.matrix(Salary~.,Hitters)[,-1] 156 | y=Hitters$Salary 157 | ``` 158 | Note that above model.matrix is being used for the side effect that it converts categorical variables into sets of dummy variables. So for example NewLeague could take on the value A and N. model.matrix took this, chose n, and made a new column called "NewLeagueN" with the binary values 0 and 1. This is required prior to running glmnet because it needs numerical /quantitative inputs. 159 | 160 | ## Ridge regression 161 | glmnet takes the alpha argument which you can use to tell it what kind of model to fit. For example alpha=1 is a lasso model, and alpha=0 is a ridge regression model. 162 | 163 | ```{r} 164 | library(glmnet) 165 | grid=10^seq(10,-2,length=100) 166 | #spreads out the range 10 to -2 to 100 167 | #equally spaced intermediate values 168 | ridge.mod=glmnet(x,y,alpha=0,lambda=grid) 169 | ``` 170 | 171 | By default glmnet does ridge regression on an automagically selected range of $\lambda$ values. 172 | 173 | Glmnet also standardizes variables which may or may not be problematic, we may want to do that ourselves first in some way for example. To turn this setting off we can do `standardize=FALSE`. 174 | 175 | The coefficients are stored in there for each value of lambda in our previous grid. So this should be a #variable by 100 matrix. 176 | ```{r} 177 | dim(coef(ridge.mod)) 178 | ``` 179 | 180 | for the 50th lambda we can see some info 181 | 182 | ```{r} 183 | ridge.mod$lambda[50] 184 | coef(ridge.mod)[,50] 185 | #calculate the l2 norm by the following 186 | sqrt(sum(coef(ridge.mod)[-1,50]^2)) 187 | ``` 188 | 189 | vs when a lower value of lambda is used, 190 | ```{r} 191 | ridge.mod$lambda[60] 192 | coef(ridge.mod)[,60] 193 | #calculate the l2 norm by the following 194 | sqrt(sum(coef(ridge.mod)[-1,60]^2)) 195 | ``` 196 | 197 | We can use predict to get ridge regression coefficients for a new value of $\lambda=50$. 198 | ```{r} 199 | predict(ridge.mod,s=50,type="coefficients")[1:20,] 200 | ``` 201 | 202 | Note how as $\lambda$ gets smaller, fewer of the coefficients are nearly 0. Basically smaller $\lambda$'s mean lower constraints and the closer the model is to ordinary least-squares. 203 | 204 | Here is another method of doing subset selection, prior we did this with a vector of TRUE/FALSE, now we do with a list of indices. 205 | ```{r} 206 | set.seed(1) 207 | train=sample(1:nrow(x),nrow(x)/2) 208 | test=(-train) 209 | y.test=y[test] 210 | 211 | 212 | ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid,thresh=1e-12) 213 | ridge.pred=predict(ridge.mod,s=4,newx=x[test,]) 214 | mean((ridge.pred-y.test)^2) 215 | 216 | #if we fit with *only* the intercept, and no other beta coefficients 217 | # then the outcome would be the mean of the training data, and 218 | # it would just be the mean of the training cases. 219 | mean((mean(y[train])-y.test)^2) 220 | 221 | ### 222 | # we can also get this with a super-high lambda value, which 223 | # essentially sets all betas to nearly 0. 224 | ridge.pred=predict(ridge.mod,s=1e10,newx=x[test,]) 225 | mean((ridge.pred-y.test)^2) 226 | 227 | ridge.pred=predict(ridge.mod,s=0,newx=x[test,],exact=T) 228 | #need to use exact to get the answer close to least-squares due to 229 | # numerical approximation. 230 | mean((ridge.pred-y.test)^2) 231 | lm(y~x,subset=train) 232 | predict(ridge.mod,s=0,exact=T,type="coefficients")[1:20,] 233 | ``` 234 | 235 | Lets use CV and do some better selection of $\lambda$ 236 | 237 | ```{r fig.width=7,fig.height=5} 238 | set.seed(1) 239 | cv.out=cv.glmnet(x[train,],y[train],alpha=0) 240 | plot(cv.out) 241 | bestlam=cv.out$lambda.min 242 | bestlam 243 | ``` 244 | 245 | Lets see how this does on the test data! 246 | ```{r} 247 | ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,]) 248 | mean((ridge.pred-y.test)^2) 249 | ``` 250 | 251 | seems to perform better than $\lambda=4$. 252 | 253 | Let's see what the coefficients are like for the entire dataset now. 254 | ```{r} 255 | out=glmnet(x,y,alpha=0) 256 | predict(out,type="coefficients",s=bestlam)[1:20,] 257 | ``` 258 | 259 | ## The Lasso 260 | 261 | ```{r fig.height=5,fig.width=7} 262 | lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid) 263 | plot(lasso.mod) 264 | ``` 265 | 266 | ```{r fig.height=5,fig.width=7} 267 | cv.out=cv.glmnet(x[train,],y[train],alpha=1) 268 | plot(cv.out) 269 | bestlam=cv.out$lambda.min 270 | bestlam 271 | lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,]) 272 | mean((lasso.pred-y.test)^2) 273 | out=glmnet(x,y,alpha=1,lambda=grid) 274 | lasso.coef=predict(out,type="coefficients",s=bestlam)[1:20,] 275 | lasso.coef 276 | lasso.coef[lasso.coef!=0] 277 | ``` 278 | 279 | Note that a bunch of the variables are exactly 0! Much easier to interpret, basically subset selection happened on the variables which is pretty awesome. 280 | 281 | The output best lasso model has only 7 variables, and discards 12! 282 | 283 | 284 | 285 | ************************* 286 | # Lab 3: PCR and PLS Regression 287 | 288 | ## Principal Components Regression 289 | ```{r fig.height=5, fig.width=7} 290 | library(pls) 291 | set.seed(2) 292 | pcr.fit=pcr(Salary~., data=Hitters, scale=TRUE, validation="CV") 293 | summary(pcr.fit) 294 | 295 | validationplot(pcr.fit, val.type="MSEP") 296 | ``` 297 | 298 | ```{r fig.height=5, fig.width=7} 299 | set.seed(1) 300 | pcr.fit=pcr(Salary~., data=Hitters, scale=TRUE, subset=train, validation="CV") 301 | 302 | validationplot(pcr.fit, val.type="MSEP") 303 | ``` 304 | 305 | 306 | ```{r} 307 | pcr.pred=predict(pcr.fit,x[test,],ncomp=7) 308 | mean((pcr.pred-y.test)^2) 309 | ``` 310 | 311 | 312 | Comparable performance to ridge regression and lasso, but harder to interpret b/c doesn't give us selected variables or coefficients! 313 | 314 | ```{r} 315 | pcr.fit=pcr(y~x,scale=TRUE,ncomp=7) 316 | summary(pcr.fit) 317 | ``` 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | -------------------------------------------------------------------------------- /R_Labs/Lab6/Lab6.Rmd: -------------------------------------------------------------------------------- 1 | Introduction to Statistical Learning Lab 6 2 | ======================================================== 3 | 4 | ## Polynomial regression and step functions (and rug plots!) 5 | ```{r} 6 | library(ISLR) 7 | attach(Wage) 8 | fit=lm(wage~poly(age,4),data=Wage) 9 | coef(summary(fit)) 10 | ``` 11 | 12 | But this does a linear combination of the polynomials, not the raw polynomials! This is orthoganal though, so it is a different basis and the overal fit will be equivalent, but it does result in different coefficients! 13 | 14 | ```{r} 15 | library(ISLR) 16 | attach(Wage) 17 | fit2=lm(wage~poly(age,4,raw=T),data=Wage) 18 | coef(summary(fit2)) 19 | 20 | # or alternatively 21 | fit2a=lm(wage~age+I(age^2)+I(age^3)+I(age^4)) 22 | coef(fit2a) 23 | 24 | #and a third way! 25 | fit2b=lm(wage~cbind(age,age^2,age^3,age^4)) 26 | coef(fit2b) 27 | 28 | 29 | ## 30 | # get predictions for a range of ages and std errors around those 31 | # predictions 32 | agelims=range(age) #returns the range of this list (2 values), lower->upper 33 | age.grid=seq(from=agelims[1],to=agelims[2]) 34 | preds=predict(fit,newdata=list(age=age.grid),se=TRUE) 35 | se.bands=cbind(preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit) 36 | ``` 37 | 38 | ```{r,fig.width=11,fig.height=5} 39 | par(mfrow=c(1,2),mar=c(4.5,4.5,1,1),oma=c(0,0,4,0)) 40 | plot(age,wage,xlim=agelims,cex=0.5,col="darkgrey") 41 | title("Degree-4 Polynomial", outer=T) 42 | lines(age.grid,preds$fit,lwd=2,col="blue") 43 | matlines(age.grid,se.bands,lwd=1,col="blue",lty=3) 44 | ``` 45 | 46 | Note that the right panel of this figure will be filled in later. 47 | 48 | 49 | ```{r} 50 | ##nearly identical predictions from orthogonal and raw data 51 | preds2=predict(fit2,newdata=list(age=age.grid),se=TRUE) 52 | max(abs(preds$fit-preds2$fit)) 53 | ``` 54 | 55 | 56 | One way to decide which degree of polynomial to use is with a hypothesis test. Note that this requires that the models are nested for anova to be the same as the results we get out of the summary function to the lm fit. Anova is more general though. 57 | 58 | ```{r} 59 | fit.1=lm(wage~age,data=Wage) 60 | fit.2=lm(wage~poly(age,2),data=Wage) 61 | fit.3=lm(wage~poly(age,3),data=Wage) 62 | fit.4=lm(wage~poly(age,4),data=Wage) 63 | fit.5=lm(wage~poly(age,5),data=Wage) 64 | 65 | anova(fit.1,fit.2,fit.3,fit.4,fit.5) 66 | 67 | #alternatively we could just have looked at the p values on the coefficients for the 5th degree model. 68 | coef(summary(fit.5)) 69 | 70 | 71 | ## more general anova test 72 | fit.1=lm(wage~education+age,data=Wage) 73 | fit.2=lm(wage~education+poly(age,2),data=Wage) 74 | fit.3=lm(wage~education+poly(age,3),data=Wage) 75 | anova(fit.1,fit.2,fit.3) 76 | ``` 77 | 78 | Lets move on to predict which people make more than 250k. 79 | ```{r} 80 | fit=glm(I(wage>250)~poly(age,4),data=Wage,family=binomial) 81 | preds3=predict(fit,newdata=list(age=age.grid),se=T) 82 | 83 | ## Need to transform the SE estimates, we have a fit to a logit 84 | pfit=exp(preds3$fit)/(1+exp(preds3$fit)) 85 | se.bands.logit=cbind(preds3$fit+2*preds3$se.fit, preds3$fit-2*preds3$se.fit) 86 | se.bands2=exp(se.bands.logit)/(1+exp(se.bands.logit)) 87 | 88 | #alternatively we could have gotten this directly by saying 89 | # type="response" to the predict function: 90 | #preds=predict(fit,newdata=list(age=age.grid),type="response",se=T) 91 | # however in this case the confidence intervals are not sensible because 92 | # they should represent probabilities but can come out negative! 93 | # With the above transformation this is not an issue, and the probabilities 94 | # remain well behaved. 95 | ``` 96 | 97 | Now for the full figure 7.1 plot 98 | ```{r,fig.width=11,fig.height=5} 99 | #previous section 100 | par(mfrow=c(1,2),mar=c(4.5,4.5,1,1),oma=c(0,0,4,0)) 101 | plot(age,wage,xlim=agelims,cex=0.5,col="darkgrey") 102 | title("Degree-4 Polynomial (matching fig 7.1)", outer=T) 103 | lines(age.grid,preds$fit,lwd=2,col="blue") 104 | matlines(age.grid,se.bands,lwd=1,col="blue",lty=3) 105 | 106 | #new data for prediction of income over 250k 107 | plot(age,I(wage>250),xlim=agelims,type="n",ylim=c(0,.2)) 108 | #add in density ticks (jitter helps with this) at top and bottom of panel (0 and 0.2 look good I guess) 109 | points(jitter(age),I((wage>250)/5),cex=.5,pch="|",col="darkgrey") 110 | 111 | #add in probability of being a really high wage earner, along with 112 | # std error on this polynomial logistic regression probability fit. 113 | lines(age.grid,pfit,lwd=2,col="blue") 114 | matlines(age.grid,se.bands2,lwd=1,col="blue",lty=3) 115 | ``` 116 | 117 | Note the above plot type is often called a "rug plot" 118 | 119 | 120 | Step functions can be fit with the cut function 121 | 122 | ```{r} 123 | table(cut(age,4)) 124 | fit=lm(wage~cut(age,4),data=Wage) 125 | coef(summary(fit)) 126 | ``` 127 | 128 | NOTE: 129 | > The age<33.5 category is left out, so the intercept coefficient of $94,160 can be interpreted as the average salary for those under 33.5 years of age, and the other coefficients can be interpreted as the average additional salary for those in the other age groups. We can produce predictions and plots just as we did in the case of the polynomial fit. 130 | 131 | ## Splines 132 | 133 | ```{r fig.height=5,fig.width=7} 134 | library(splines) 135 | fit=lm(wage~bs(age,knots=c(25,40,60)),data=Wage) 136 | pred=predict(fit,newdata=list(age=age.grid),se=T) 137 | plot(age,wage,col="gray") 138 | lines(age.grid,pred$fit,lwd=2) 139 | lines(age.grid,pred$fit+2*pred$se,lty="dashed") 140 | lines(age.grid,pred$fit-2*pred$se,lty="dashed") 141 | dim(bs(age,knots=c(25,40,60)))#or specified specifically 142 | dim(bs(age,df=6))#knots can be chosen automagically at uniform quantiles in the data 143 | attr(bs(age,df=6),"knots") 144 | fit2=lm(wage~ns(age,df=4),data=Wage) 145 | pred2=predict(fit2,newdata=list(age=age.grid),se=T) 146 | lines(age.grid,pred2$fit,col="red",lwd=2) 147 | lines(age.grid,pred2$fit+2*pred2$se,col="red",lty="dashed") 148 | lines(age.grid,pred2$fit-2*pred2$se,col="red",lty="dashed") 149 | ``` 150 | 151 | ### Smooth.spline, and figure 7.8 replication: 152 | 153 | ```{r fig.height=5,fig.width=7} 154 | plot(age,wage,xlim=agelims,cex=.5,col="darkgrey") 155 | title("Smoothing Spline") 156 | fit=smooth.spline(age,wage,df=16) 157 | fit2=smooth.spline(age,wage,cv=TRUE) 158 | fit2$df 159 | lines(fit,col="red",lwd=2) 160 | lines(fit2,col="blue",lwd=2) 161 | legend("topright",legend=c("16 DF",sprintf("%.1f DF",fit2$df)), 162 | col=c("red","blue"),lty=1,lwd=2,cex=.8) 163 | ``` 164 | 165 | ### LOESS-- local regression. 166 | 167 | ```{r fig.height=5,fig.width=7} 168 | plot(age,wage,xlim=agelims,cex=.5,col="darkgrey") 169 | title("Local Regression") 170 | fit=loess(wage~age,span=.2,data=Wage) 171 | fit2=loess(wage~age,span=.5,data=Wage) 172 | lines(age.grid,predict(fit,data.frame(age=age.grid)), 173 | col="red",lwd=2) 174 | lines(age.grid,predict(fit2,data.frame(age=age.grid)), 175 | col="blue",lwd=2) 176 | legend("topright",legend=c("Span=0.2","Span=0.5"), 177 | col=c("red","blue"),lty=1,lwd=2,cex=.8) 178 | #Span is the percent of the data used 179 | ``` 180 | 181 | ## GAMs 182 | 183 | Figure 7.11 184 | ```{r fig.width=11, fig.height=5} 185 | library(gam) 186 | gam1=lm(wage~ns(year,4)+ns(age,5)+education,data=Wage) 187 | gam.m3=gam(wage~s(year,4)+s(age,5)+education,data=Wage) 188 | par(mfrow=c(1,3)) 189 | plot(gam.m3,se=TRUE,col="blue") 190 | ``` 191 | 192 | Figure 7.12 193 | ```{r fig.width=11, fig.height=5} 194 | par(mfrow=c(1,3)) 195 | plot.gam(gam1,se=TRUE,col="red") #must use plot.gam since this is not 196 | # of the gam class, so R will not automatically call this function on the 197 | # gam1 object. 198 | ``` 199 | 200 | Anova to chose GAM model 201 | ```{r} 202 | gam.m1=gam(wage~s(age,5)+education,data=Wage) 203 | gam.m2=gam(wage~year+s(age,5)+education,data=Wage) 204 | anova(gam.m1,gam.m2,gam.m3,test="F") 205 | ``` 206 | 207 | Looks like good evidence for including year, but linearly is sufficient. 208 | 209 | ```{r} 210 | summary(gam.m3) # p value is for a null of a linear relationship 211 | # vs a non linear relationship! COOL! 212 | preds=predict(gam.m2,newdata=Wage) 213 | head(preds) 214 | gam.lo=gam(wage~s(year,df=4)+lo(age,span=0.7)+education,data=Wage) 215 | ``` 216 | 217 | ```{r fig.width=11,fig.height=5} 218 | par(mfrow=c(1,3)) 219 | plot.gam(gam.lo,se=TRUE,col="green") 220 | ``` 221 | 222 | ```{r fig.width=7, fig.height=5} 223 | gam.lo.i=gam(wage~lo(year,age,span=0.5)+education,data=Wage) 224 | library(akima) 225 | plot(gam.lo.i) 226 | ``` 227 | 228 | ```{r fig.height=5, fig.width=11} 229 | gam.lr=gam(I(wage>250)~year+s(age,df=5)+education,family=binomial,data=Wage) 230 | par(mfrow=c(1,3)) 231 | plot(gam.lr,se=T,col="green") 232 | ``` 233 | 234 | 235 | ```{r} 236 | table(education,I(wage>250)) 237 | ``` 238 | 239 | ```{r fig.height=5,fig.width=11} 240 | par(mfrow=c(1,3)) 241 | gam.lr.s=gam(I(wage>250)~year+s(age,df=5)+education,family=binomial,data=Wage,subset=(education!="1. < HS Grad")) 242 | plot(gam.lr.s,se=T,col="green") 243 | ``` 244 | 245 | 246 | -------------------------------------------------------------------------------- /R_Labs/Lab7/Lab7.Rmd: -------------------------------------------------------------------------------- 1 | ISLR Lab 7: Decision Trees 2 | ======================================================== 3 | 4 | ## Classification tree fitting 5 | ```{r} 6 | library(tree) 7 | library(ISLR) 8 | attach(Carseats) 9 | High=ifelse(Sales<=8,"No","Yes") 10 | Carseats=data.frame(Carseats,High) 11 | tree.carseats=tree(High~.-Sales,Carseats) 12 | summary(tree.carseats) 13 | ``` 14 | 15 | Plot of the carseats tree model: 16 | ```{r fig.width=11, fig.height=11} 17 | plot(tree.carseats) 18 | text(tree.carseats,pretty=0) 19 | ``` 20 | 21 | ```{r} 22 | tree.carseats 23 | set.seed(2) 24 | train=sample(1:nrow(Carseats),200) 25 | Carseats.test=Carseats[-train,] 26 | High.test=High[-train] 27 | tree.carseats=tree(High~.-Sales,Carseats,subset=train) 28 | tree.pred=predict(tree.carseats,Carseats.test,type="class") 29 | table(tree.pred,High.test) 30 | (86+57)/200 31 | 32 | set.seed(3) 33 | cv.carseats=cv.tree(tree.carseats,FUN=prune.misclass) 34 | names(cv.carseats) 35 | cv.carseats 36 | ``` 37 | 38 | ```{r fig.height=7, fig.width=11} 39 | par(mfrow=c(1,2)) 40 | plot(cv.carseats$size,cv.carseats$dev,type="b") 41 | plot(cv.carseats$k,cv.carseats$dev, type="b") 42 | ``` 43 | 44 | ```{r fig.height=11, fig.width=11} 45 | prune.carseats=prune.misclass(tree.carseats,best=9) 46 | plot(prune.carseats) 47 | text(prune.carseats,pretty=0) 48 | 49 | tree.pred=predict(prune.carseats,Carseats.test,type="class") 50 | table(tree.pred,High.test) 51 | (94+60)/200 52 | ``` 53 | 54 | 55 | 56 | ## Regression tree fitting. 57 | ```{r} 58 | library(MASS) 59 | set.seed(1) 60 | train=sample(1:nrow(Boston), nrow(Boston)/2) 61 | tree.boston=tree(medv~.,Boston,subset=train) 62 | summary(tree.boston) 63 | ``` 64 | 65 | ```{r fig.width=11, fig.height=11} 66 | plot(tree.boston) 67 | text(tree.boston,pretty=0) 68 | 69 | cv.boston=cv.tree(tree.boston) 70 | plot(cv.boston$size, cv.boston$dev, type='b') 71 | 72 | prune.boston=prune.tree(tree.boston,best=5) 73 | plot(prune.boston) 74 | text(prune.boston,pretty=0) 75 | 76 | yhat=predict(tree.boston,newdata=Boston[-train,]) 77 | boston.test=Boston[-train,'medv'] 78 | plot(yhat,boston.test) 79 | abline(0,1) 80 | mean((yhat-boston.test)^2) 81 | ``` 82 | 83 | 84 | ## Random forest 85 | note: bagging is a special case of random forest where m=p 86 | ```{r} 87 | library(randomForest) 88 | 89 | ## do bagging 90 | set.seed(1) 91 | bag.boston=randomForest(medv~.,data=Boston,subset=train,mtry=13,importance=TRUE) 92 | bag.boston 93 | yhat.bag=predict(bag.boston,newdata=Boston[-train,]) 94 | plot(yhat.bag, boston.test) 95 | abline(0,1) 96 | mean((yhat.bag-boston.test)^2) 97 | 98 | #change number of trees used 99 | bag.boston=randomForest(medv~.,data=Boston,subset=train,mtry=13,ntree=25,importance=TRUE) 100 | yhat.bag=predict(bag.boston,newdata=Boston[-train,]) 101 | mean((yhat.bag-boston.test)^2) 102 | 103 | 104 | # do actual random forest 105 | set.seed(1) 106 | rf.boston=randomForest(medv~.,data=Boston,subset=train,mtry=6,importance=TRUE) 107 | yhat.rf=predict(rf.boston,newdata=Boston[-train,]) 108 | mean((yhat.rf-boston.test)^2) 109 | ##see importance of variables 110 | importance(rf.boston) 111 | ``` 112 | 113 | 114 | ```{r fig.width=11, fig.height=11} 115 | varImpPlot(rf.boston) 116 | ``` 117 | 118 | 119 | ## Boosting 120 | ```{r fig.height=8,fig.width=8} 121 | library(gbm) 122 | set.seed(1) 123 | boost.boston=gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4) 124 | summary(boost.boston) 125 | ``` 126 | 127 | 128 | ```{r fig.height=11, fig.width=11} 129 | par(mfrow=c(1,2)) 130 | plot(boost.boston,i="rm") 131 | plot(boost.boston,i="lstat") 132 | ``` 133 | 134 | 135 | ```{r} 136 | yhat.boost=predict(boost.boston,newdata=Boston[-train,], n.trees=5000) 137 | mean((yhat.boost-boston.test)^2) 138 | boost.boston=gbm(medv~.,data=Boston[train,],distribution="gaussian",ntrees=5000,interaction.depth=4,shrinkage=0.2,verbose=F) 139 | yhat.boost=predict(boost.boston,newdata=Boston[-train,],n.trees=5000) 140 | mean((yhat.boost-boston.test)^2) 141 | ``` 142 | -------------------------------------------------------------------------------- /data/Advertising.csv: -------------------------------------------------------------------------------- 1 | "","TV","Radio","Newspaper","Sales" 2 | "1",230.1,37.8,69.2,22.1 3 | "2",44.5,39.3,45.1,10.4 4 | "3",17.2,45.9,69.3,9.3 5 | "4",151.5,41.3,58.5,18.5 6 | "5",180.8,10.8,58.4,12.9 7 | "6",8.7,48.9,75,7.2 8 | "7",57.5,32.8,23.5,11.8 9 | "8",120.2,19.6,11.6,13.2 10 | "9",8.6,2.1,1,4.8 11 | "10",199.8,2.6,21.2,10.6 12 | "11",66.1,5.8,24.2,8.6 13 | "12",214.7,24,4,17.4 14 | "13",23.8,35.1,65.9,9.2 15 | "14",97.5,7.6,7.2,9.7 16 | "15",204.1,32.9,46,19 17 | "16",195.4,47.7,52.9,22.4 18 | "17",67.8,36.6,114,12.5 19 | "18",281.4,39.6,55.8,24.4 20 | "19",69.2,20.5,18.3,11.3 21 | "20",147.3,23.9,19.1,14.6 22 | "21",218.4,27.7,53.4,18 23 | "22",237.4,5.1,23.5,12.5 24 | "23",13.2,15.9,49.6,5.6 25 | "24",228.3,16.9,26.2,15.5 26 | "25",62.3,12.6,18.3,9.7 27 | "26",262.9,3.5,19.5,12 28 | "27",142.9,29.3,12.6,15 29 | "28",240.1,16.7,22.9,15.9 30 | "29",248.8,27.1,22.9,18.9 31 | "30",70.6,16,40.8,10.5 32 | "31",292.9,28.3,43.2,21.4 33 | "32",112.9,17.4,38.6,11.9 34 | "33",97.2,1.5,30,9.6 35 | "34",265.6,20,0.3,17.4 36 | "35",95.7,1.4,7.4,9.5 37 | "36",290.7,4.1,8.5,12.8 38 | "37",266.9,43.8,5,25.4 39 | "38",74.7,49.4,45.7,14.7 40 | "39",43.1,26.7,35.1,10.1 41 | "40",228,37.7,32,21.5 42 | "41",202.5,22.3,31.6,16.6 43 | "42",177,33.4,38.7,17.1 44 | "43",293.6,27.7,1.8,20.7 45 | "44",206.9,8.4,26.4,12.9 46 | "45",25.1,25.7,43.3,8.5 47 | "46",175.1,22.5,31.5,14.9 48 | "47",89.7,9.9,35.7,10.6 49 | "48",239.9,41.5,18.5,23.2 50 | "49",227.2,15.8,49.9,14.8 51 | "50",66.9,11.7,36.8,9.7 52 | "51",199.8,3.1,34.6,11.4 53 | "52",100.4,9.6,3.6,10.7 54 | "53",216.4,41.7,39.6,22.6 55 | "54",182.6,46.2,58.7,21.2 56 | "55",262.7,28.8,15.9,20.2 57 | "56",198.9,49.4,60,23.7 58 | "57",7.3,28.1,41.4,5.5 59 | "58",136.2,19.2,16.6,13.2 60 | "59",210.8,49.6,37.7,23.8 61 | "60",210.7,29.5,9.3,18.4 62 | "61",53.5,2,21.4,8.1 63 | "62",261.3,42.7,54.7,24.2 64 | "63",239.3,15.5,27.3,15.7 65 | "64",102.7,29.6,8.4,14 66 | "65",131.1,42.8,28.9,18 67 | "66",69,9.3,0.9,9.3 68 | "67",31.5,24.6,2.2,9.5 69 | "68",139.3,14.5,10.2,13.4 70 | "69",237.4,27.5,11,18.9 71 | "70",216.8,43.9,27.2,22.3 72 | "71",199.1,30.6,38.7,18.3 73 | "72",109.8,14.3,31.7,12.4 74 | "73",26.8,33,19.3,8.8 75 | "74",129.4,5.7,31.3,11 76 | "75",213.4,24.6,13.1,17 77 | "76",16.9,43.7,89.4,8.7 78 | "77",27.5,1.6,20.7,6.9 79 | "78",120.5,28.5,14.2,14.2 80 | "79",5.4,29.9,9.4,5.3 81 | "80",116,7.7,23.1,11 82 | "81",76.4,26.7,22.3,11.8 83 | "82",239.8,4.1,36.9,12.3 84 | "83",75.3,20.3,32.5,11.3 85 | "84",68.4,44.5,35.6,13.6 86 | "85",213.5,43,33.8,21.7 87 | "86",193.2,18.4,65.7,15.2 88 | "87",76.3,27.5,16,12 89 | "88",110.7,40.6,63.2,16 90 | "89",88.3,25.5,73.4,12.9 91 | "90",109.8,47.8,51.4,16.7 92 | "91",134.3,4.9,9.3,11.2 93 | "92",28.6,1.5,33,7.3 94 | "93",217.7,33.5,59,19.4 95 | "94",250.9,36.5,72.3,22.2 96 | "95",107.4,14,10.9,11.5 97 | "96",163.3,31.6,52.9,16.9 98 | "97",197.6,3.5,5.9,11.7 99 | "98",184.9,21,22,15.5 100 | "99",289.7,42.3,51.2,25.4 101 | "100",135.2,41.7,45.9,17.2 102 | "101",222.4,4.3,49.8,11.7 103 | "102",296.4,36.3,100.9,23.8 104 | "103",280.2,10.1,21.4,14.8 105 | "104",187.9,17.2,17.9,14.7 106 | "105",238.2,34.3,5.3,20.7 107 | "106",137.9,46.4,59,19.2 108 | "107",25,11,29.7,7.2 109 | "108",90.4,0.3,23.2,8.7 110 | "109",13.1,0.4,25.6,5.3 111 | "110",255.4,26.9,5.5,19.8 112 | "111",225.8,8.2,56.5,13.4 113 | "112",241.7,38,23.2,21.8 114 | "113",175.7,15.4,2.4,14.1 115 | "114",209.6,20.6,10.7,15.9 116 | "115",78.2,46.8,34.5,14.6 117 | "116",75.1,35,52.7,12.6 118 | "117",139.2,14.3,25.6,12.2 119 | "118",76.4,0.8,14.8,9.4 120 | "119",125.7,36.9,79.2,15.9 121 | "120",19.4,16,22.3,6.6 122 | "121",141.3,26.8,46.2,15.5 123 | "122",18.8,21.7,50.4,7 124 | "123",224,2.4,15.6,11.6 125 | "124",123.1,34.6,12.4,15.2 126 | "125",229.5,32.3,74.2,19.7 127 | "126",87.2,11.8,25.9,10.6 128 | "127",7.8,38.9,50.6,6.6 129 | "128",80.2,0,9.2,8.8 130 | "129",220.3,49,3.2,24.7 131 | "130",59.6,12,43.1,9.7 132 | "131",0.7,39.6,8.7,1.6 133 | "132",265.2,2.9,43,12.7 134 | "133",8.4,27.2,2.1,5.7 135 | "134",219.8,33.5,45.1,19.6 136 | "135",36.9,38.6,65.6,10.8 137 | "136",48.3,47,8.5,11.6 138 | "137",25.6,39,9.3,9.5 139 | "138",273.7,28.9,59.7,20.8 140 | "139",43,25.9,20.5,9.6 141 | "140",184.9,43.9,1.7,20.7 142 | "141",73.4,17,12.9,10.9 143 | "142",193.7,35.4,75.6,19.2 144 | "143",220.5,33.2,37.9,20.1 145 | "144",104.6,5.7,34.4,10.4 146 | "145",96.2,14.8,38.9,11.4 147 | "146",140.3,1.9,9,10.3 148 | "147",240.1,7.3,8.7,13.2 149 | "148",243.2,49,44.3,25.4 150 | "149",38,40.3,11.9,10.9 151 | "150",44.7,25.8,20.6,10.1 152 | "151",280.7,13.9,37,16.1 153 | "152",121,8.4,48.7,11.6 154 | "153",197.6,23.3,14.2,16.6 155 | "154",171.3,39.7,37.7,19 156 | "155",187.8,21.1,9.5,15.6 157 | "156",4.1,11.6,5.7,3.2 158 | "157",93.9,43.5,50.5,15.3 159 | "158",149.8,1.3,24.3,10.1 160 | "159",11.7,36.9,45.2,7.3 161 | "160",131.7,18.4,34.6,12.9 162 | "161",172.5,18.1,30.7,14.4 163 | "162",85.7,35.8,49.3,13.3 164 | "163",188.4,18.1,25.6,14.9 165 | "164",163.5,36.8,7.4,18 166 | "165",117.2,14.7,5.4,11.9 167 | "166",234.5,3.4,84.8,11.9 168 | "167",17.9,37.6,21.6,8 169 | "168",206.8,5.2,19.4,12.2 170 | "169",215.4,23.6,57.6,17.1 171 | "170",284.3,10.6,6.4,15 172 | "171",50,11.6,18.4,8.4 173 | "172",164.5,20.9,47.4,14.5 174 | "173",19.6,20.1,17,7.6 175 | "174",168.4,7.1,12.8,11.7 176 | "175",222.4,3.4,13.1,11.5 177 | "176",276.9,48.9,41.8,27 178 | "177",248.4,30.2,20.3,20.2 179 | "178",170.2,7.8,35.2,11.7 180 | "179",276.7,2.3,23.7,11.8 181 | "180",165.6,10,17.6,12.6 182 | "181",156.6,2.6,8.3,10.5 183 | "182",218.5,5.4,27.4,12.2 184 | "183",56.2,5.7,29.7,8.7 185 | "184",287.6,43,71.8,26.2 186 | "185",253.8,21.3,30,17.6 187 | "186",205,45.1,19.6,22.6 188 | "187",139.5,2.1,26.6,10.3 189 | "188",191.1,28.7,18.2,17.3 190 | "189",286,13.9,3.7,15.9 191 | "190",18.7,12.1,23.4,6.7 192 | "191",39.5,41.1,5.8,10.8 193 | "192",75.5,10.8,6,9.9 194 | "193",17.2,4.1,31.6,5.9 195 | "194",166.8,42,3.6,19.6 196 | "195",149.7,35.6,6,17.3 197 | "196",38.2,3.7,13.8,7.6 198 | "197",94.2,4.9,8.1,9.7 199 | "198",177,9.3,6.4,12.8 200 | "199",283.6,42,66.2,25.5 201 | "200",232.1,8.6,8.7,13.4 202 | -------------------------------------------------------------------------------- /data/Auto.csv: -------------------------------------------------------------------------------- 1 | mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name 2 | 18,8,307,130,3504,12,70,1,chevrolet chevelle malibu 3 | 15,8,350,165,3693,11.5,70,1,buick skylark 320 4 | 18,8,318,150,3436,11,70,1,plymouth satellite 5 | 16,8,304,150,3433,12,70,1,amc rebel sst 6 | 17,8,302,140,3449,10.5,70,1,ford torino 7 | 15,8,429,198,4341,10,70,1,ford galaxie 500 8 | 14,8,454,220,4354,9,70,1,chevrolet impala 9 | 14,8,440,215,4312,8.5,70,1,plymouth fury iii 10 | 14,8,455,225,4425,10,70,1,pontiac catalina 11 | 15,8,390,190,3850,8.5,70,1,amc ambassador dpl 12 | 15,8,383,170,3563,10,70,1,dodge challenger se 13 | 14,8,340,160,3609,8,70,1,plymouth 'cuda 340 14 | 15,8,400,150,3761,9.5,70,1,chevrolet monte carlo 15 | 14,8,455,225,3086,10,70,1,buick estate wagon (sw) 16 | 24,4,113,95,2372,15,70,3,toyota corona mark ii 17 | 22,6,198,95,2833,15.5,70,1,plymouth duster 18 | 18,6,199,97,2774,15.5,70,1,amc hornet 19 | 21,6,200,85,2587,16,70,1,ford maverick 20 | 27,4,97,88,2130,14.5,70,3,datsun pl510 21 | 26,4,97,46,1835,20.5,70,2,volkswagen 1131 deluxe sedan 22 | 25,4,110,87,2672,17.5,70,2,peugeot 504 23 | 24,4,107,90,2430,14.5,70,2,audi 100 ls 24 | 25,4,104,95,2375,17.5,70,2,saab 99e 25 | 26,4,121,113,2234,12.5,70,2,bmw 2002 26 | 21,6,199,90,2648,15,70,1,amc gremlin 27 | 10,8,360,215,4615,14,70,1,ford f250 28 | 10,8,307,200,4376,15,70,1,chevy c20 29 | 11,8,318,210,4382,13.5,70,1,dodge d200 30 | 9,8,304,193,4732,18.5,70,1,hi 1200d 31 | 27,4,97,88,2130,14.5,71,3,datsun pl510 32 | 28,4,140,90,2264,15.5,71,1,chevrolet vega 2300 33 | 25,4,113,95,2228,14,71,3,toyota corona 34 | 25,4,98,?,2046,19,71,1,ford pinto 35 | 19,6,232,100,2634,13,71,1,amc gremlin 36 | 16,6,225,105,3439,15.5,71,1,plymouth satellite custom 37 | 17,6,250,100,3329,15.5,71,1,chevrolet chevelle malibu 38 | 19,6,250,88,3302,15.5,71,1,ford torino 500 39 | 18,6,232,100,3288,15.5,71,1,amc matador 40 | 14,8,350,165,4209,12,71,1,chevrolet impala 41 | 14,8,400,175,4464,11.5,71,1,pontiac catalina brougham 42 | 14,8,351,153,4154,13.5,71,1,ford galaxie 500 43 | 14,8,318,150,4096,13,71,1,plymouth fury iii 44 | 12,8,383,180,4955,11.5,71,1,dodge monaco (sw) 45 | 13,8,400,170,4746,12,71,1,ford country squire (sw) 46 | 13,8,400,175,5140,12,71,1,pontiac safari (sw) 47 | 18,6,258,110,2962,13.5,71,1,amc hornet sportabout (sw) 48 | 22,4,140,72,2408,19,71,1,chevrolet vega (sw) 49 | 19,6,250,100,3282,15,71,1,pontiac firebird 50 | 18,6,250,88,3139,14.5,71,1,ford mustang 51 | 23,4,122,86,2220,14,71,1,mercury capri 2000 52 | 28,4,116,90,2123,14,71,2,opel 1900 53 | 30,4,79,70,2074,19.5,71,2,peugeot 304 54 | 30,4,88,76,2065,14.5,71,2,fiat 124b 55 | 31,4,71,65,1773,19,71,3,toyota corolla 1200 56 | 35,4,72,69,1613,18,71,3,datsun 1200 57 | 27,4,97,60,1834,19,71,2,volkswagen model 111 58 | 26,4,91,70,1955,20.5,71,1,plymouth cricket 59 | 24,4,113,95,2278,15.5,72,3,toyota corona hardtop 60 | 25,4,97.5,80,2126,17,72,1,dodge colt hardtop 61 | 23,4,97,54,2254,23.5,72,2,volkswagen type 3 62 | 20,4,140,90,2408,19.5,72,1,chevrolet vega 63 | 21,4,122,86,2226,16.5,72,1,ford pinto runabout 64 | 13,8,350,165,4274,12,72,1,chevrolet impala 65 | 14,8,400,175,4385,12,72,1,pontiac catalina 66 | 15,8,318,150,4135,13.5,72,1,plymouth fury iii 67 | 14,8,351,153,4129,13,72,1,ford galaxie 500 68 | 17,8,304,150,3672,11.5,72,1,amc ambassador sst 69 | 11,8,429,208,4633,11,72,1,mercury marquis 70 | 13,8,350,155,4502,13.5,72,1,buick lesabre custom 71 | 12,8,350,160,4456,13.5,72,1,oldsmobile delta 88 royale 72 | 13,8,400,190,4422,12.5,72,1,chrysler newport royal 73 | 19,3,70,97,2330,13.5,72,3,mazda rx2 coupe 74 | 15,8,304,150,3892,12.5,72,1,amc matador (sw) 75 | 13,8,307,130,4098,14,72,1,chevrolet chevelle concours (sw) 76 | 13,8,302,140,4294,16,72,1,ford gran torino (sw) 77 | 14,8,318,150,4077,14,72,1,plymouth satellite custom (sw) 78 | 18,4,121,112,2933,14.5,72,2,volvo 145e (sw) 79 | 22,4,121,76,2511,18,72,2,volkswagen 411 (sw) 80 | 21,4,120,87,2979,19.5,72,2,peugeot 504 (sw) 81 | 26,4,96,69,2189,18,72,2,renault 12 (sw) 82 | 22,4,122,86,2395,16,72,1,ford pinto (sw) 83 | 28,4,97,92,2288,17,72,3,datsun 510 (sw) 84 | 23,4,120,97,2506,14.5,72,3,toyouta corona mark ii (sw) 85 | 28,4,98,80,2164,15,72,1,dodge colt (sw) 86 | 27,4,97,88,2100,16.5,72,3,toyota corolla 1600 (sw) 87 | 13,8,350,175,4100,13,73,1,buick century 350 88 | 14,8,304,150,3672,11.5,73,1,amc matador 89 | 13,8,350,145,3988,13,73,1,chevrolet malibu 90 | 14,8,302,137,4042,14.5,73,1,ford gran torino 91 | 15,8,318,150,3777,12.5,73,1,dodge coronet custom 92 | 12,8,429,198,4952,11.5,73,1,mercury marquis brougham 93 | 13,8,400,150,4464,12,73,1,chevrolet caprice classic 94 | 13,8,351,158,4363,13,73,1,ford ltd 95 | 14,8,318,150,4237,14.5,73,1,plymouth fury gran sedan 96 | 13,8,440,215,4735,11,73,1,chrysler new yorker brougham 97 | 12,8,455,225,4951,11,73,1,buick electra 225 custom 98 | 13,8,360,175,3821,11,73,1,amc ambassador brougham 99 | 18,6,225,105,3121,16.5,73,1,plymouth valiant 100 | 16,6,250,100,3278,18,73,1,chevrolet nova custom 101 | 18,6,232,100,2945,16,73,1,amc hornet 102 | 18,6,250,88,3021,16.5,73,1,ford maverick 103 | 23,6,198,95,2904,16,73,1,plymouth duster 104 | 26,4,97,46,1950,21,73,2,volkswagen super beetle 105 | 11,8,400,150,4997,14,73,1,chevrolet impala 106 | 12,8,400,167,4906,12.5,73,1,ford country 107 | 13,8,360,170,4654,13,73,1,plymouth custom suburb 108 | 12,8,350,180,4499,12.5,73,1,oldsmobile vista cruiser 109 | 18,6,232,100,2789,15,73,1,amc gremlin 110 | 20,4,97,88,2279,19,73,3,toyota carina 111 | 21,4,140,72,2401,19.5,73,1,chevrolet vega 112 | 22,4,108,94,2379,16.5,73,3,datsun 610 113 | 18,3,70,90,2124,13.5,73,3,maxda rx3 114 | 19,4,122,85,2310,18.5,73,1,ford pinto 115 | 21,6,155,107,2472,14,73,1,mercury capri v6 116 | 26,4,98,90,2265,15.5,73,2,fiat 124 sport coupe 117 | 15,8,350,145,4082,13,73,1,chevrolet monte carlo s 118 | 16,8,400,230,4278,9.5,73,1,pontiac grand prix 119 | 29,4,68,49,1867,19.5,73,2,fiat 128 120 | 24,4,116,75,2158,15.5,73,2,opel manta 121 | 20,4,114,91,2582,14,73,2,audi 100ls 122 | 19,4,121,112,2868,15.5,73,2,volvo 144ea 123 | 15,8,318,150,3399,11,73,1,dodge dart custom 124 | 24,4,121,110,2660,14,73,2,saab 99le 125 | 20,6,156,122,2807,13.5,73,3,toyota mark ii 126 | 11,8,350,180,3664,11,73,1,oldsmobile omega 127 | 20,6,198,95,3102,16.5,74,1,plymouth duster 128 | 21,6,200,?,2875,17,74,1,ford maverick 129 | 19,6,232,100,2901,16,74,1,amc hornet 130 | 15,6,250,100,3336,17,74,1,chevrolet nova 131 | 31,4,79,67,1950,19,74,3,datsun b210 132 | 26,4,122,80,2451,16.5,74,1,ford pinto 133 | 32,4,71,65,1836,21,74,3,toyota corolla 1200 134 | 25,4,140,75,2542,17,74,1,chevrolet vega 135 | 16,6,250,100,3781,17,74,1,chevrolet chevelle malibu classic 136 | 16,6,258,110,3632,18,74,1,amc matador 137 | 18,6,225,105,3613,16.5,74,1,plymouth satellite sebring 138 | 16,8,302,140,4141,14,74,1,ford gran torino 139 | 13,8,350,150,4699,14.5,74,1,buick century luxus (sw) 140 | 14,8,318,150,4457,13.5,74,1,dodge coronet custom (sw) 141 | 14,8,302,140,4638,16,74,1,ford gran torino (sw) 142 | 14,8,304,150,4257,15.5,74,1,amc matador (sw) 143 | 29,4,98,83,2219,16.5,74,2,audi fox 144 | 26,4,79,67,1963,15.5,74,2,volkswagen dasher 145 | 26,4,97,78,2300,14.5,74,2,opel manta 146 | 31,4,76,52,1649,16.5,74,3,toyota corona 147 | 32,4,83,61,2003,19,74,3,datsun 710 148 | 28,4,90,75,2125,14.5,74,1,dodge colt 149 | 24,4,90,75,2108,15.5,74,2,fiat 128 150 | 26,4,116,75,2246,14,74,2,fiat 124 tc 151 | 24,4,120,97,2489,15,74,3,honda civic 152 | 26,4,108,93,2391,15.5,74,3,subaru 153 | 31,4,79,67,2000,16,74,2,fiat x1.9 154 | 19,6,225,95,3264,16,75,1,plymouth valiant custom 155 | 18,6,250,105,3459,16,75,1,chevrolet nova 156 | 15,6,250,72,3432,21,75,1,mercury monarch 157 | 15,6,250,72,3158,19.5,75,1,ford maverick 158 | 16,8,400,170,4668,11.5,75,1,pontiac catalina 159 | 15,8,350,145,4440,14,75,1,chevrolet bel air 160 | 16,8,318,150,4498,14.5,75,1,plymouth grand fury 161 | 14,8,351,148,4657,13.5,75,1,ford ltd 162 | 17,6,231,110,3907,21,75,1,buick century 163 | 16,6,250,105,3897,18.5,75,1,chevroelt chevelle malibu 164 | 15,6,258,110,3730,19,75,1,amc matador 165 | 18,6,225,95,3785,19,75,1,plymouth fury 166 | 21,6,231,110,3039,15,75,1,buick skyhawk 167 | 20,8,262,110,3221,13.5,75,1,chevrolet monza 2+2 168 | 13,8,302,129,3169,12,75,1,ford mustang ii 169 | 29,4,97,75,2171,16,75,3,toyota corolla 170 | 23,4,140,83,2639,17,75,1,ford pinto 171 | 20,6,232,100,2914,16,75,1,amc gremlin 172 | 23,4,140,78,2592,18.5,75,1,pontiac astro 173 | 24,4,134,96,2702,13.5,75,3,toyota corona 174 | 25,4,90,71,2223,16.5,75,2,volkswagen dasher 175 | 24,4,119,97,2545,17,75,3,datsun 710 176 | 18,6,171,97,2984,14.5,75,1,ford pinto 177 | 29,4,90,70,1937,14,75,2,volkswagen rabbit 178 | 19,6,232,90,3211,17,75,1,amc pacer 179 | 23,4,115,95,2694,15,75,2,audi 100ls 180 | 23,4,120,88,2957,17,75,2,peugeot 504 181 | 22,4,121,98,2945,14.5,75,2,volvo 244dl 182 | 25,4,121,115,2671,13.5,75,2,saab 99le 183 | 33,4,91,53,1795,17.5,75,3,honda civic cvcc 184 | 28,4,107,86,2464,15.5,76,2,fiat 131 185 | 25,4,116,81,2220,16.9,76,2,opel 1900 186 | 25,4,140,92,2572,14.9,76,1,capri ii 187 | 26,4,98,79,2255,17.7,76,1,dodge colt 188 | 27,4,101,83,2202,15.3,76,2,renault 12tl 189 | 17.5,8,305,140,4215,13,76,1,chevrolet chevelle malibu classic 190 | 16,8,318,150,4190,13,76,1,dodge coronet brougham 191 | 15.5,8,304,120,3962,13.9,76,1,amc matador 192 | 14.5,8,351,152,4215,12.8,76,1,ford gran torino 193 | 22,6,225,100,3233,15.4,76,1,plymouth valiant 194 | 22,6,250,105,3353,14.5,76,1,chevrolet nova 195 | 24,6,200,81,3012,17.6,76,1,ford maverick 196 | 22.5,6,232,90,3085,17.6,76,1,amc hornet 197 | 29,4,85,52,2035,22.2,76,1,chevrolet chevette 198 | 24.5,4,98,60,2164,22.1,76,1,chevrolet woody 199 | 29,4,90,70,1937,14.2,76,2,vw rabbit 200 | 33,4,91,53,1795,17.4,76,3,honda civic 201 | 20,6,225,100,3651,17.7,76,1,dodge aspen se 202 | 18,6,250,78,3574,21,76,1,ford granada ghia 203 | 18.5,6,250,110,3645,16.2,76,1,pontiac ventura sj 204 | 17.5,6,258,95,3193,17.8,76,1,amc pacer d/l 205 | 29.5,4,97,71,1825,12.2,76,2,volkswagen rabbit 206 | 32,4,85,70,1990,17,76,3,datsun b-210 207 | 28,4,97,75,2155,16.4,76,3,toyota corolla 208 | 26.5,4,140,72,2565,13.6,76,1,ford pinto 209 | 20,4,130,102,3150,15.7,76,2,volvo 245 210 | 13,8,318,150,3940,13.2,76,1,plymouth volare premier v8 211 | 19,4,120,88,3270,21.9,76,2,peugeot 504 212 | 19,6,156,108,2930,15.5,76,3,toyota mark ii 213 | 16.5,6,168,120,3820,16.7,76,2,mercedes-benz 280s 214 | 16.5,8,350,180,4380,12.1,76,1,cadillac seville 215 | 13,8,350,145,4055,12,76,1,chevy c10 216 | 13,8,302,130,3870,15,76,1,ford f108 217 | 13,8,318,150,3755,14,76,1,dodge d100 218 | 31.5,4,98,68,2045,18.5,77,3,honda accord cvcc 219 | 30,4,111,80,2155,14.8,77,1,buick opel isuzu deluxe 220 | 36,4,79,58,1825,18.6,77,2,renault 5 gtl 221 | 25.5,4,122,96,2300,15.5,77,1,plymouth arrow gs 222 | 33.5,4,85,70,1945,16.8,77,3,datsun f-10 hatchback 223 | 17.5,8,305,145,3880,12.5,77,1,chevrolet caprice classic 224 | 17,8,260,110,4060,19,77,1,oldsmobile cutlass supreme 225 | 15.5,8,318,145,4140,13.7,77,1,dodge monaco brougham 226 | 15,8,302,130,4295,14.9,77,1,mercury cougar brougham 227 | 17.5,6,250,110,3520,16.4,77,1,chevrolet concours 228 | 20.5,6,231,105,3425,16.9,77,1,buick skylark 229 | 19,6,225,100,3630,17.7,77,1,plymouth volare custom 230 | 18.5,6,250,98,3525,19,77,1,ford granada 231 | 16,8,400,180,4220,11.1,77,1,pontiac grand prix lj 232 | 15.5,8,350,170,4165,11.4,77,1,chevrolet monte carlo landau 233 | 15.5,8,400,190,4325,12.2,77,1,chrysler cordoba 234 | 16,8,351,149,4335,14.5,77,1,ford thunderbird 235 | 29,4,97,78,1940,14.5,77,2,volkswagen rabbit custom 236 | 24.5,4,151,88,2740,16,77,1,pontiac sunbird coupe 237 | 26,4,97,75,2265,18.2,77,3,toyota corolla liftback 238 | 25.5,4,140,89,2755,15.8,77,1,ford mustang ii 2+2 239 | 30.5,4,98,63,2051,17,77,1,chevrolet chevette 240 | 33.5,4,98,83,2075,15.9,77,1,dodge colt m/m 241 | 30,4,97,67,1985,16.4,77,3,subaru dl 242 | 30.5,4,97,78,2190,14.1,77,2,volkswagen dasher 243 | 22,6,146,97,2815,14.5,77,3,datsun 810 244 | 21.5,4,121,110,2600,12.8,77,2,bmw 320i 245 | 21.5,3,80,110,2720,13.5,77,3,mazda rx-4 246 | 43.1,4,90,48,1985,21.5,78,2,volkswagen rabbit custom diesel 247 | 36.1,4,98,66,1800,14.4,78,1,ford fiesta 248 | 32.8,4,78,52,1985,19.4,78,3,mazda glc deluxe 249 | 39.4,4,85,70,2070,18.6,78,3,datsun b210 gx 250 | 36.1,4,91,60,1800,16.4,78,3,honda civic cvcc 251 | 19.9,8,260,110,3365,15.5,78,1,oldsmobile cutlass salon brougham 252 | 19.4,8,318,140,3735,13.2,78,1,dodge diplomat 253 | 20.2,8,302,139,3570,12.8,78,1,mercury monarch ghia 254 | 19.2,6,231,105,3535,19.2,78,1,pontiac phoenix lj 255 | 20.5,6,200,95,3155,18.2,78,1,chevrolet malibu 256 | 20.2,6,200,85,2965,15.8,78,1,ford fairmont (auto) 257 | 25.1,4,140,88,2720,15.4,78,1,ford fairmont (man) 258 | 20.5,6,225,100,3430,17.2,78,1,plymouth volare 259 | 19.4,6,232,90,3210,17.2,78,1,amc concord 260 | 20.6,6,231,105,3380,15.8,78,1,buick century special 261 | 20.8,6,200,85,3070,16.7,78,1,mercury zephyr 262 | 18.6,6,225,110,3620,18.7,78,1,dodge aspen 263 | 18.1,6,258,120,3410,15.1,78,1,amc concord d/l 264 | 19.2,8,305,145,3425,13.2,78,1,chevrolet monte carlo landau 265 | 17.7,6,231,165,3445,13.4,78,1,buick regal sport coupe (turbo) 266 | 18.1,8,302,139,3205,11.2,78,1,ford futura 267 | 17.5,8,318,140,4080,13.7,78,1,dodge magnum xe 268 | 30,4,98,68,2155,16.5,78,1,chevrolet chevette 269 | 27.5,4,134,95,2560,14.2,78,3,toyota corona 270 | 27.2,4,119,97,2300,14.7,78,3,datsun 510 271 | 30.9,4,105,75,2230,14.5,78,1,dodge omni 272 | 21.1,4,134,95,2515,14.8,78,3,toyota celica gt liftback 273 | 23.2,4,156,105,2745,16.7,78,1,plymouth sapporo 274 | 23.8,4,151,85,2855,17.6,78,1,oldsmobile starfire sx 275 | 23.9,4,119,97,2405,14.9,78,3,datsun 200-sx 276 | 20.3,5,131,103,2830,15.9,78,2,audi 5000 277 | 17,6,163,125,3140,13.6,78,2,volvo 264gl 278 | 21.6,4,121,115,2795,15.7,78,2,saab 99gle 279 | 16.2,6,163,133,3410,15.8,78,2,peugeot 604sl 280 | 31.5,4,89,71,1990,14.9,78,2,volkswagen scirocco 281 | 29.5,4,98,68,2135,16.6,78,3,honda accord lx 282 | 21.5,6,231,115,3245,15.4,79,1,pontiac lemans v6 283 | 19.8,6,200,85,2990,18.2,79,1,mercury zephyr 6 284 | 22.3,4,140,88,2890,17.3,79,1,ford fairmont 4 285 | 20.2,6,232,90,3265,18.2,79,1,amc concord dl 6 286 | 20.6,6,225,110,3360,16.6,79,1,dodge aspen 6 287 | 17,8,305,130,3840,15.4,79,1,chevrolet caprice classic 288 | 17.6,8,302,129,3725,13.4,79,1,ford ltd landau 289 | 16.5,8,351,138,3955,13.2,79,1,mercury grand marquis 290 | 18.2,8,318,135,3830,15.2,79,1,dodge st. regis 291 | 16.9,8,350,155,4360,14.9,79,1,buick estate wagon (sw) 292 | 15.5,8,351,142,4054,14.3,79,1,ford country squire (sw) 293 | 19.2,8,267,125,3605,15,79,1,chevrolet malibu classic (sw) 294 | 18.5,8,360,150,3940,13,79,1,chrysler lebaron town @ country (sw) 295 | 31.9,4,89,71,1925,14,79,2,vw rabbit custom 296 | 34.1,4,86,65,1975,15.2,79,3,maxda glc deluxe 297 | 35.7,4,98,80,1915,14.4,79,1,dodge colt hatchback custom 298 | 27.4,4,121,80,2670,15,79,1,amc spirit dl 299 | 25.4,5,183,77,3530,20.1,79,2,mercedes benz 300d 300 | 23,8,350,125,3900,17.4,79,1,cadillac eldorado 301 | 27.2,4,141,71,3190,24.8,79,2,peugeot 504 302 | 23.9,8,260,90,3420,22.2,79,1,oldsmobile cutlass salon brougham 303 | 34.2,4,105,70,2200,13.2,79,1,plymouth horizon 304 | 34.5,4,105,70,2150,14.9,79,1,plymouth horizon tc3 305 | 31.8,4,85,65,2020,19.2,79,3,datsun 210 306 | 37.3,4,91,69,2130,14.7,79,2,fiat strada custom 307 | 28.4,4,151,90,2670,16,79,1,buick skylark limited 308 | 28.8,6,173,115,2595,11.3,79,1,chevrolet citation 309 | 26.8,6,173,115,2700,12.9,79,1,oldsmobile omega brougham 310 | 33.5,4,151,90,2556,13.2,79,1,pontiac phoenix 311 | 41.5,4,98,76,2144,14.7,80,2,vw rabbit 312 | 38.1,4,89,60,1968,18.8,80,3,toyota corolla tercel 313 | 32.1,4,98,70,2120,15.5,80,1,chevrolet chevette 314 | 37.2,4,86,65,2019,16.4,80,3,datsun 310 315 | 28,4,151,90,2678,16.5,80,1,chevrolet citation 316 | 26.4,4,140,88,2870,18.1,80,1,ford fairmont 317 | 24.3,4,151,90,3003,20.1,80,1,amc concord 318 | 19.1,6,225,90,3381,18.7,80,1,dodge aspen 319 | 34.3,4,97,78,2188,15.8,80,2,audi 4000 320 | 29.8,4,134,90,2711,15.5,80,3,toyota corona liftback 321 | 31.3,4,120,75,2542,17.5,80,3,mazda 626 322 | 37,4,119,92,2434,15,80,3,datsun 510 hatchback 323 | 32.2,4,108,75,2265,15.2,80,3,toyota corolla 324 | 46.6,4,86,65,2110,17.9,80,3,mazda glc 325 | 27.9,4,156,105,2800,14.4,80,1,dodge colt 326 | 40.8,4,85,65,2110,19.2,80,3,datsun 210 327 | 44.3,4,90,48,2085,21.7,80,2,vw rabbit c (diesel) 328 | 43.4,4,90,48,2335,23.7,80,2,vw dasher (diesel) 329 | 36.4,5,121,67,2950,19.9,80,2,audi 5000s (diesel) 330 | 30,4,146,67,3250,21.8,80,2,mercedes-benz 240d 331 | 44.6,4,91,67,1850,13.8,80,3,honda civic 1500 gl 332 | 40.9,4,85,?,1835,17.3,80,2,renault lecar deluxe 333 | 33.8,4,97,67,2145,18,80,3,subaru dl 334 | 29.8,4,89,62,1845,15.3,80,2,vokswagen rabbit 335 | 32.7,6,168,132,2910,11.4,80,3,datsun 280-zx 336 | 23.7,3,70,100,2420,12.5,80,3,mazda rx-7 gs 337 | 35,4,122,88,2500,15.1,80,2,triumph tr7 coupe 338 | 23.6,4,140,?,2905,14.3,80,1,ford mustang cobra 339 | 32.4,4,107,72,2290,17,80,3,honda accord 340 | 27.2,4,135,84,2490,15.7,81,1,plymouth reliant 341 | 26.6,4,151,84,2635,16.4,81,1,buick skylark 342 | 25.8,4,156,92,2620,14.4,81,1,dodge aries wagon (sw) 343 | 23.5,6,173,110,2725,12.6,81,1,chevrolet citation 344 | 30,4,135,84,2385,12.9,81,1,plymouth reliant 345 | 39.1,4,79,58,1755,16.9,81,3,toyota starlet 346 | 39,4,86,64,1875,16.4,81,1,plymouth champ 347 | 35.1,4,81,60,1760,16.1,81,3,honda civic 1300 348 | 32.3,4,97,67,2065,17.8,81,3,subaru 349 | 37,4,85,65,1975,19.4,81,3,datsun 210 mpg 350 | 37.7,4,89,62,2050,17.3,81,3,toyota tercel 351 | 34.1,4,91,68,1985,16,81,3,mazda glc 4 352 | 34.7,4,105,63,2215,14.9,81,1,plymouth horizon 4 353 | 34.4,4,98,65,2045,16.2,81,1,ford escort 4w 354 | 29.9,4,98,65,2380,20.7,81,1,ford escort 2h 355 | 33,4,105,74,2190,14.2,81,2,volkswagen jetta 356 | 34.5,4,100,?,2320,15.8,81,2,renault 18i 357 | 33.7,4,107,75,2210,14.4,81,3,honda prelude 358 | 32.4,4,108,75,2350,16.8,81,3,toyota corolla 359 | 32.9,4,119,100,2615,14.8,81,3,datsun 200sx 360 | 31.6,4,120,74,2635,18.3,81,3,mazda 626 361 | 28.1,4,141,80,3230,20.4,81,2,peugeot 505s turbo diesel 362 | 30.7,6,145,76,3160,19.6,81,2,volvo diesel 363 | 25.4,6,168,116,2900,12.6,81,3,toyota cressida 364 | 24.2,6,146,120,2930,13.8,81,3,datsun 810 maxima 365 | 22.4,6,231,110,3415,15.8,81,1,buick century 366 | 26.6,8,350,105,3725,19,81,1,oldsmobile cutlass ls 367 | 20.2,6,200,88,3060,17.1,81,1,ford granada gl 368 | 17.6,6,225,85,3465,16.6,81,1,chrysler lebaron salon 369 | 28,4,112,88,2605,19.6,82,1,chevrolet cavalier 370 | 27,4,112,88,2640,18.6,82,1,chevrolet cavalier wagon 371 | 34,4,112,88,2395,18,82,1,chevrolet cavalier 2-door 372 | 31,4,112,85,2575,16.2,82,1,pontiac j2000 se hatchback 373 | 29,4,135,84,2525,16,82,1,dodge aries se 374 | 27,4,151,90,2735,18,82,1,pontiac phoenix 375 | 24,4,140,92,2865,16.4,82,1,ford fairmont futura 376 | 36,4,105,74,1980,15.3,82,2,volkswagen rabbit l 377 | 37,4,91,68,2025,18.2,82,3,mazda glc custom l 378 | 31,4,91,68,1970,17.6,82,3,mazda glc custom 379 | 38,4,105,63,2125,14.7,82,1,plymouth horizon miser 380 | 36,4,98,70,2125,17.3,82,1,mercury lynx l 381 | 36,4,120,88,2160,14.5,82,3,nissan stanza xe 382 | 36,4,107,75,2205,14.5,82,3,honda accord 383 | 34,4,108,70,2245,16.9,82,3,toyota corolla 384 | 38,4,91,67,1965,15,82,3,honda civic 385 | 32,4,91,67,1965,15.7,82,3,honda civic (auto) 386 | 38,4,91,67,1995,16.2,82,3,datsun 310 gx 387 | 25,6,181,110,2945,16.4,82,1,buick century limited 388 | 38,6,262,85,3015,17,82,1,oldsmobile cutlass ciera (diesel) 389 | 26,4,156,92,2585,14.5,82,1,chrysler lebaron medallion 390 | 22,6,232,112,2835,14.7,82,1,ford granada l 391 | 32,4,144,96,2665,13.9,82,3,toyota celica gt 392 | 36,4,135,84,2370,13,82,1,dodge charger 2.2 393 | 27,4,151,90,2950,17.3,82,1,chevrolet camaro 394 | 27,4,140,86,2790,15.6,82,1,ford mustang gl 395 | 44,4,97,52,2130,24.6,82,2,vw pickup 396 | 32,4,135,84,2295,11.6,82,1,dodge rampage 397 | 28,4,120,79,2625,18.6,82,1,ford ranger 398 | 31,4,119,82,2720,19.4,82,1,chevy s-10 399 | -------------------------------------------------------------------------------- /data/Heart.csv: -------------------------------------------------------------------------------- 1 | row.names,sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd 1,160,12,5.73,23.11,Present,49,25.3,97.2,52,1 2,144,0.01,4.41,28.61,Absent,55,28.87,2.06,63,1 3,118,0.08,3.48,32.28,Present,52,29.14,3.81,46,0 4,170,7.5,6.41,38.03,Present,51,31.99,24.26,58,1 5,134,13.6,3.5,27.78,Present,60,25.99,57.34,49,1 6,132,6.2,6.47,36.21,Present,62,30.77,14.14,45,0 7,142,4.05,3.38,16.2,Absent,59,20.81,2.62,38,0 8,114,4.08,4.59,14.6,Present,62,23.11,6.72,58,1 9,114,0,3.83,19.4,Present,49,24.86,2.49,29,0 10,132,0,5.8,30.96,Present,69,30.11,0,53,1 11,206,6,2.95,32.27,Absent,72,26.81,56.06,60,1 12,134,14.1,4.44,22.39,Present,65,23.09,0,40,1 13,118,0,1.88,10.05,Absent,59,21.57,0,17,0 14,132,0,1.87,17.21,Absent,49,23.63,0.97,15,0 15,112,9.65,2.29,17.2,Present,54,23.53,0.68,53,0 16,117,1.53,2.44,28.95,Present,35,25.89,30.03,46,0 17,120,7.5,15.33,22,Absent,60,25.31,34.49,49,0 18,146,10.5,8.29,35.36,Present,78,32.73,13.89,53,1 19,158,2.6,7.46,34.07,Present,61,29.3,53.28,62,1 20,124,14,6.23,35.96,Present,45,30.09,0,59,1 21,106,1.61,1.74,12.32,Absent,74,20.92,13.37,20,1 22,132,7.9,2.85,26.5,Present,51,26.16,25.71,44,0 23,150,0.3,6.38,33.99,Present,62,24.64,0,50,0 24,138,0.6,3.81,28.66,Absent,54,28.7,1.46,58,0 25,142,18.2,4.34,24.38,Absent,61,26.19,0,50,0 26,124,4,12.42,31.29,Present,54,23.23,2.06,42,1 27,118,6,9.65,33.91,Absent,60,38.8,0,48,0 28,145,9.1,5.24,27.55,Absent,59,20.96,21.6,61,1 29,144,4.09,5.55,31.4,Present,60,29.43,5.55,56,0 30,146,0,6.62,25.69,Absent,60,28.07,8.23,63,1 31,136,2.52,3.95,25.63,Absent,51,21.86,0,45,1 32,158,1.02,6.33,23.88,Absent,66,22.13,24.99,46,1 33,122,6.6,5.58,35.95,Present,53,28.07,12.55,59,1 34,126,8.75,6.53,34.02,Absent,49,30.25,0,41,1 35,148,5.5,7.1,25.31,Absent,56,29.84,3.6,48,0 36,122,4.26,4.44,13.04,Absent,57,19.49,48.99,28,1 37,140,3.9,7.32,25.05,Absent,47,27.36,36.77,32,0 38,110,4.64,4.55,30.46,Absent,48,30.9,15.22,46,0 39,130,0,2.82,19.63,Present,70,24.86,0,29,0 40,136,11.2,5.81,31.85,Present,75,27.68,22.94,58,1 41,118,0.28,5.8,33.7,Present,60,30.98,0,41,1 42,144,0.04,3.38,23.61,Absent,30,23.75,4.66,30,0 43,120,0,1.07,16.02,Absent,47,22.15,0,15,0 44,130,2.61,2.72,22.99,Present,51,26.29,13.37,51,1 45,114,0,2.99,9.74,Absent,54,46.58,0,17,0 46,128,4.65,3.31,22.74,Absent,62,22.95,0.51,48,0 47,162,7.4,8.55,24.65,Present,64,25.71,5.86,58,1 48,116,1.91,7.56,26.45,Present,52,30.01,3.6,33,1 49,114,0,1.94,11.02,Absent,54,20.17,38.98,16,0 50,126,3.8,3.88,31.79,Absent,57,30.53,0,30,0 51,122,0,5.75,30.9,Present,46,29.01,4.11,42,0 52,134,2.5,3.66,30.9,Absent,52,27.19,23.66,49,0 53,152,0.9,9.12,30.23,Absent,56,28.64,0.37,42,1 54,134,8.08,1.55,17.5,Present,56,22.65,66.65,31,1 55,156,3,1.82,27.55,Absent,60,23.91,54,53,0 56,152,5.99,7.99,32.48,Absent,45,26.57,100.32,48,0 57,118,0,2.99,16.17,Absent,49,23.83,3.22,28,0 58,126,5.1,2.96,26.5,Absent,55,25.52,12.34,38,1 59,103,0.03,4.21,18.96,Absent,48,22.94,2.62,18,0 60,121,0.8,5.29,18.95,Present,47,22.51,0,61,0 61,142,0.28,1.8,21.03,Absent,57,23.65,2.93,33,0 62,138,1.15,5.09,27.87,Present,61,25.65,2.34,44,0 63,152,10.1,4.71,24.65,Present,65,26.21,24.53,57,0 64,140,0.45,4.3,24.33,Absent,41,27.23,10.08,38,0 65,130,0,1.82,10.45,Absent,57,22.07,2.06,17,0 66,136,7.36,2.19,28.11,Present,61,25,61.71,54,0 67,124,4.82,3.24,21.1,Present,48,28.49,8.42,30,0 68,112,0.41,1.88,10.29,Absent,39,22.08,20.98,27,0 69,118,4.46,7.27,29.13,Present,48,29.01,11.11,33,0 70,122,0,3.37,16.1,Absent,67,21.06,0,32,1 71,118,0,3.67,12.13,Absent,51,19.15,0.6,15,0 72,130,1.72,2.66,10.38,Absent,68,17.81,11.1,26,0 73,130,5.6,3.37,24.8,Absent,58,25.76,43.2,36,0 74,126,0.09,5.03,13.27,Present,50,17.75,4.63,20,0 75,128,0.4,6.17,26.35,Absent,64,27.86,11.11,34,0 76,136,0,4.12,17.42,Absent,52,21.66,12.86,40,0 77,134,0,5.9,30.84,Absent,49,29.16,0,55,0 78,140,0.6,5.56,33.39,Present,58,27.19,0,55,1 79,168,4.5,6.68,28.47,Absent,43,24.25,24.38,56,1 80,108,0.4,5.91,22.92,Present,57,25.72,72,39,0 81,114,3,7.04,22.64,Present,55,22.59,0,45,1 82,140,8.14,4.93,42.49,Absent,53,45.72,6.43,53,1 83,148,4.8,6.09,36.55,Present,63,25.44,0.88,55,1 84,148,12.2,3.79,34.15,Absent,57,26.38,14.4,57,1 85,128,0,2.43,13.15,Present,63,20.75,0,17,0 86,130,0.56,3.3,30.86,Absent,49,27.52,33.33,45,0 87,126,10.5,4.49,17.33,Absent,67,19.37,0,49,1 88,140,0,5.08,27.33,Present,41,27.83,1.25,38,0 89,126,0.9,5.64,17.78,Present,55,21.94,0,41,0 90,122,0.72,4.04,32.38,Absent,34,28.34,0,55,0 91,116,1.03,2.83,10.85,Absent,45,21.59,1.75,21,0 92,120,3.7,4.02,39.66,Absent,61,30.57,0,64,1 93,143,0.46,2.4,22.87,Absent,62,29.17,15.43,29,0 94,118,4,3.95,18.96,Absent,54,25.15,8.33,49,1 95,194,1.7,6.32,33.67,Absent,47,30.16,0.19,56,0 96,134,3,4.37,23.07,Absent,56,20.54,9.65,62,0 97,138,2.16,4.9,24.83,Present,39,26.06,28.29,29,0 98,136,0,5,27.58,Present,49,27.59,1.47,39,0 99,122,3.2,11.32,35.36,Present,55,27.07,0,51,1 100,164,12,3.91,19.59,Absent,51,23.44,19.75,39,0 101,136,8,7.85,23.81,Present,51,22.69,2.78,50,0 102,166,0.07,4.03,29.29,Absent,53,28.37,0,27,0 103,118,0,4.34,30.12,Present,52,32.18,3.91,46,0 104,128,0.42,4.6,26.68,Absent,41,30.97,10.33,31,0 105,118,1.5,5.38,25.84,Absent,64,28.63,3.89,29,0 106,158,3.6,2.97,30.11,Absent,63,26.64,108,64,0 107,108,1.5,4.33,24.99,Absent,66,22.29,21.6,61,1 108,170,7.6,5.5,37.83,Present,42,37.41,6.17,54,1 109,118,1,5.76,22.1,Absent,62,23.48,7.71,42,0 110,124,0,3.04,17.33,Absent,49,22.04,0,18,0 111,114,0,8.01,21.64,Absent,66,25.51,2.49,16,0 112,168,9,8.53,24.48,Present,69,26.18,4.63,54,1 113,134,2,3.66,14.69,Absent,52,21.03,2.06,37,0 114,174,0,8.46,35.1,Present,35,25.27,0,61,1 115,116,31.2,3.17,14.99,Absent,47,19.4,49.06,59,1 116,128,0,10.58,31.81,Present,46,28.41,14.66,48,0 117,140,4.5,4.59,18.01,Absent,63,21.91,22.09,32,1 118,154,0.7,5.91,25,Absent,13,20.6,0,42,0 119,150,3.5,6.99,25.39,Present,50,23.35,23.48,61,1 120,130,0,3.92,25.55,Absent,68,28.02,0.68,27,0 121,128,2,6.13,21.31,Absent,66,22.86,11.83,60,0 122,120,1.4,6.25,20.47,Absent,60,25.85,8.51,28,0 123,120,0,5.01,26.13,Absent,64,26.21,12.24,33,0 124,138,4.5,2.85,30.11,Absent,55,24.78,24.89,56,1 125,153,7.8,3.96,25.73,Absent,54,25.91,27.03,45,0 126,123,8.6,11.17,35.28,Present,70,33.14,0,59,1 127,148,4.04,3.99,20.69,Absent,60,27.78,1.75,28,0 128,136,3.96,2.76,30.28,Present,50,34.42,18.51,38,0 129,134,8.8,7.41,26.84,Absent,35,29.44,29.52,60,1 130,152,12.18,4.04,37.83,Present,63,34.57,4.17,64,0 131,158,13.5,5.04,30.79,Absent,54,24.79,21.5,62,0 132,132,2,3.08,35.39,Absent,45,31.44,79.82,58,1 133,134,1.5,3.73,21.53,Absent,41,24.7,11.11,30,1 134,142,7.44,5.52,33.97,Absent,47,29.29,24.27,54,0 135,134,6,3.3,28.45,Absent,65,26.09,58.11,40,0 136,122,4.18,9.05,29.27,Present,44,24.05,19.34,52,1 137,116,2.7,3.69,13.52,Absent,55,21.13,18.51,32,0 138,128,0.5,3.7,12.81,Present,66,21.25,22.73,28,0 139,120,0,3.68,12.24,Absent,51,20.52,0.51,20,0 140,124,0,3.95,36.35,Present,59,32.83,9.59,54,0 141,160,14,5.9,37.12,Absent,58,33.87,3.52,54,1 142,130,2.78,4.89,9.39,Present,63,19.3,17.47,25,1 143,128,2.8,5.53,14.29,Absent,64,24.97,0.51,38,0 144,130,4.5,5.86,37.43,Absent,61,31.21,32.3,58,0 145,109,1.2,6.14,29.26,Absent,47,24.72,10.46,40,0 146,144,0,3.84,18.72,Absent,56,22.1,4.8,40,0 147,118,1.05,3.16,12.98,Present,46,22.09,16.35,31,0 148,136,3.46,6.38,32.25,Present,43,28.73,3.13,43,1 149,136,1.5,6.06,26.54,Absent,54,29.38,14.5,33,1 150,124,15.5,5.05,24.06,Absent,46,23.22,0,61,1 151,148,6,6.49,26.47,Absent,48,24.7,0,55,0 152,128,6.6,3.58,20.71,Absent,55,24.15,0,52,0 153,122,0.28,4.19,19.97,Absent,61,25.63,0,24,0 154,108,0,2.74,11.17,Absent,53,22.61,0.95,20,0 155,124,3.04,4.8,19.52,Present,60,21.78,147.19,41,1 156,138,8.8,3.12,22.41,Present,63,23.33,120.03,55,1 157,127,0,2.81,15.7,Absent,42,22.03,1.03,17,0 158,174,9.45,5.13,35.54,Absent,55,30.71,59.79,53,0 159,122,0,3.05,23.51,Absent,46,25.81,0,38,0 160,144,6.75,5.45,29.81,Absent,53,25.62,26.23,43,1 161,126,1.8,6.22,19.71,Absent,65,24.81,0.69,31,0 162,208,27.4,3.12,26.63,Absent,66,27.45,33.07,62,1 163,138,0,2.68,17.04,Absent,42,22.16,0,16,0 164,148,0,3.84,17.26,Absent,70,20,0,21,0 165,122,0,3.08,16.3,Absent,43,22.13,0,16,0 166,132,7,3.2,23.26,Absent,77,23.64,23.14,49,0 167,110,12.16,4.99,28.56,Absent,44,27.14,21.6,55,1 168,160,1.52,8.12,29.3,Present,54,25.87,12.86,43,1 169,126,0.54,4.39,21.13,Present,45,25.99,0,25,0 170,162,5.3,7.95,33.58,Present,58,36.06,8.23,48,0 171,194,2.55,6.89,33.88,Present,69,29.33,0,41,0 172,118,0.75,2.58,20.25,Absent,59,24.46,0,32,0 173,124,0,4.79,34.71,Absent,49,26.09,9.26,47,0 174,160,0,2.42,34.46,Absent,48,29.83,1.03,61,0 175,128,0,2.51,29.35,Present,53,22.05,1.37,62,0 176,122,4,5.24,27.89,Present,45,26.52,0,61,1 177,132,2,2.7,21.57,Present,50,27.95,9.26,37,0 178,120,0,2.42,16.66,Absent,46,20.16,0,17,0 179,128,0.04,8.22,28.17,Absent,65,26.24,11.73,24,0 180,108,15,4.91,34.65,Absent,41,27.96,14.4,56,0 181,166,0,4.31,34.27,Absent,45,30.14,13.27,56,0 182,152,0,6.06,41.05,Present,51,40.34,0,51,0 183,170,4.2,4.67,35.45,Present,50,27.14,7.92,60,1 184,156,4,2.05,19.48,Present,50,21.48,27.77,39,1 185,116,8,6.73,28.81,Present,41,26.74,40.94,48,1 186,122,4.4,3.18,11.59,Present,59,21.94,0,33,1 187,150,20,6.4,35.04,Absent,53,28.88,8.33,63,0 188,129,2.15,5.17,27.57,Absent,52,25.42,2.06,39,0 189,134,4.8,6.58,29.89,Present,55,24.73,23.66,63,0 190,126,0,5.98,29.06,Present,56,25.39,11.52,64,1 191,142,0,3.72,25.68,Absent,48,24.37,5.25,40,1 192,128,0.7,4.9,37.42,Present,72,35.94,3.09,49,1 193,102,0.4,3.41,17.22,Present,56,23.59,2.06,39,1 194,130,0,4.89,25.98,Absent,72,30.42,14.71,23,0 195,138,0.05,2.79,10.35,Absent,46,21.62,0,18,0 196,138,0,1.96,11.82,Present,54,22.01,8.13,21,0 197,128,0,3.09,20.57,Absent,54,25.63,0.51,17,0 198,162,2.92,3.63,31.33,Absent,62,31.59,18.51,42,0 199,160,3,9.19,26.47,Present,39,28.25,14.4,54,1 200,148,0,4.66,24.39,Absent,50,25.26,4.03,27,0 201,124,0.16,2.44,16.67,Absent,65,24.58,74.91,23,0 202,136,3.15,4.37,20.22,Present,59,25.12,47.16,31,1 203,134,2.75,5.51,26.17,Absent,57,29.87,8.33,33,0 204,128,0.73,3.97,23.52,Absent,54,23.81,19.2,64,0 205,122,3.2,3.59,22.49,Present,45,24.96,36.17,58,0 206,152,3,4.64,31.29,Absent,41,29.34,4.53,40,0 207,162,0,5.09,24.6,Present,64,26.71,3.81,18,0 208,124,4,6.65,30.84,Present,54,28.4,33.51,60,0 209,136,5.8,5.9,27.55,Absent,65,25.71,14.4,59,0 210,136,8.8,4.26,32.03,Present,52,31.44,34.35,60,0 211,134,0.05,8.03,27.95,Absent,48,26.88,0,60,0 212,122,1,5.88,34.81,Present,69,31.27,15.94,40,1 213,116,3,3.05,30.31,Absent,41,23.63,0.86,44,0 214,132,0,0.98,21.39,Absent,62,26.75,0,53,0 215,134,0,2.4,21.11,Absent,57,22.45,1.37,18,0 216,160,7.77,8.07,34.8,Absent,64,31.15,0,62,1 217,180,0.52,4.23,16.38,Absent,55,22.56,14.77,45,1 218,124,0.81,6.16,11.61,Absent,35,21.47,10.49,26,0 219,114,0,4.97,9.69,Absent,26,22.6,0,25,0 220,208,7.4,7.41,32.03,Absent,50,27.62,7.85,57,0 221,138,0,3.14,12,Absent,54,20.28,0,16,0 222,164,0.5,6.95,39.64,Present,47,41.76,3.81,46,1 223,144,2.4,8.13,35.61,Absent,46,27.38,13.37,60,0 224,136,7.5,7.39,28.04,Present,50,25.01,0,45,1 225,132,7.28,3.52,12.33,Absent,60,19.48,2.06,56,0 226,143,5.04,4.86,23.59,Absent,58,24.69,18.72,42,0 227,112,4.46,7.18,26.25,Present,69,27.29,0,32,1 228,134,10,3.79,34.72,Absent,42,28.33,28.8,52,1 229,138,2,5.11,31.4,Present,49,27.25,2.06,64,1 230,188,0,5.47,32.44,Present,71,28.99,7.41,50,1 231,110,2.35,3.36,26.72,Present,54,26.08,109.8,58,1 232,136,13.2,7.18,35.95,Absent,48,29.19,0,62,0 233,130,1.75,5.46,34.34,Absent,53,29.42,0,58,1 234,122,0,3.76,24.59,Absent,56,24.36,0,30,0 235,138,0,3.24,27.68,Absent,60,25.7,88.66,29,0 236,130,18,4.13,27.43,Absent,54,27.44,0,51,1 237,126,5.5,3.78,34.15,Absent,55,28.85,3.18,61,0 238,176,5.76,4.89,26.1,Present,46,27.3,19.44,57,0 239,122,0,5.49,19.56,Absent,57,23.12,14.02,27,0 240,124,0,3.23,9.64,Absent,59,22.7,0,16,0 241,140,5.2,3.58,29.26,Absent,70,27.29,20.17,45,1 242,128,6,4.37,22.98,Present,50,26.01,0,47,0 243,190,4.18,5.05,24.83,Absent,45,26.09,82.85,41,0 244,144,0.76,10.53,35.66,Absent,63,34.35,0,55,1 245,126,4.6,7.4,31.99,Present,57,28.67,0.37,60,1 246,128,0,2.63,23.88,Absent,45,21.59,6.54,57,0 247,136,0.4,3.91,21.1,Present,63,22.3,0,56,1 248,158,4,4.18,28.61,Present,42,25.11,0,60,0 249,160,0.6,6.94,30.53,Absent,36,25.68,1.42,64,0 250,124,6,5.21,33.02,Present,64,29.37,7.61,58,1 251,158,6.17,8.12,30.75,Absent,46,27.84,92.62,48,0 252,128,0,6.34,11.87,Absent,57,23.14,0,17,0 253,166,3,3.82,26.75,Absent,45,20.86,0,63,1 254,146,7.5,7.21,25.93,Present,55,22.51,0.51,42,0 255,161,9,4.65,15.16,Present,58,23.76,43.2,46,0 256,164,13.02,6.26,29.38,Present,47,22.75,37.03,54,1 257,146,5.08,7.03,27.41,Present,63,36.46,24.48,37,1 258,142,4.48,3.57,19.75,Present,51,23.54,3.29,49,0 259,138,12,5.13,28.34,Absent,59,24.49,32.81,58,1 260,154,1.8,7.13,34.04,Present,52,35.51,39.36,44,0 261,118,0,2.39,12.13,Absent,49,18.46,0.26,17,1 263,124,0.61,2.69,17.15,Present,61,22.76,11.55,20,0 264,124,1.04,2.84,16.42,Present,46,20.17,0,61,0 265,136,5,4.19,23.99,Present,68,27.8,25.86,35,0 266,132,9.9,4.63,27.86,Present,46,23.39,0.51,52,1 267,118,0.12,1.96,20.31,Absent,37,20.01,2.42,18,0 268,118,0.12,4.16,9.37,Absent,57,19.61,0,17,0 269,134,12,4.96,29.79,Absent,53,24.86,8.23,57,0 270,114,0.1,3.95,15.89,Present,57,20.31,17.14,16,0 271,136,6.8,7.84,30.74,Present,58,26.2,23.66,45,1 272,130,0,4.16,39.43,Present,46,30.01,0,55,1 273,136,2.2,4.16,38.02,Absent,65,37.24,4.11,41,1 274,136,1.36,3.16,14.97,Present,56,24.98,7.3,24,0 275,154,4.2,5.59,25.02,Absent,58,25.02,1.54,43,0 276,108,0.8,2.47,17.53,Absent,47,22.18,0,55,1 277,136,8.8,4.69,36.07,Present,38,26.56,2.78,63,1 278,174,2.02,6.57,31.9,Present,50,28.75,11.83,64,1 279,124,4.25,8.22,30.77,Absent,56,25.8,0,43,0 280,114,0,2.63,9.69,Absent,45,17.89,0,16,0 281,118,0.12,3.26,12.26,Absent,55,22.65,0,16,0 282,106,1.08,4.37,26.08,Absent,67,24.07,17.74,28,1 283,146,3.6,3.51,22.67,Absent,51,22.29,43.71,42,0 284,206,0,4.17,33.23,Absent,69,27.36,6.17,50,1 285,134,3,3.17,17.91,Absent,35,26.37,15.12,27,0 286,148,15,4.98,36.94,Present,72,31.83,66.27,41,1 287,126,0.21,3.95,15.11,Absent,61,22.17,2.42,17,0 288,134,0,3.69,13.92,Absent,43,27.66,0,19,0 289,134,0.02,2.8,18.84,Absent,45,24.82,0,17,0 290,123,0.05,4.61,13.69,Absent,51,23.23,2.78,16,0 291,112,0.6,5.28,25.71,Absent,55,27.02,27.77,38,1 292,112,0,1.71,15.96,Absent,42,22.03,3.5,16,0 293,101,0.48,7.26,13,Absent,50,19.82,5.19,16,0 294,150,0.18,4.14,14.4,Absent,53,23.43,7.71,44,0 295,170,2.6,7.22,28.69,Present,71,27.87,37.65,56,1 296,134,0,5.63,29.12,Absent,68,32.33,2.02,34,0 297,142,0,4.19,18.04,Absent,56,23.65,20.78,42,1 298,132,0.1,3.28,10.73,Absent,73,20.42,0,17,0 299,136,0,2.28,18.14,Absent,55,22.59,0,17,0 300,132,12,4.51,21.93,Absent,61,26.07,64.8,46,1 301,166,4.1,4,34.3,Present,32,29.51,8.23,53,0 302,138,0,3.96,24.7,Present,53,23.8,0,45,0 303,138,2.27,6.41,29.07,Absent,58,30.22,2.93,32,1 304,170,0,3.12,37.15,Absent,47,35.42,0,53,0 305,128,0,8.41,28.82,Present,60,26.86,0,59,1 306,136,1.2,2.78,7.12,Absent,52,22.51,3.41,27,0 307,128,0,3.22,26.55,Present,39,26.59,16.71,49,0 308,150,14.4,5.04,26.52,Present,60,28.84,0,45,0 309,132,8.4,3.57,13.68,Absent,42,18.75,15.43,59,1 310,142,2.4,2.55,23.89,Absent,54,26.09,59.14,37,0 311,130,0.05,2.44,28.25,Present,67,30.86,40.32,34,0 312,174,3.5,5.26,21.97,Present,36,22.04,8.33,59,1 313,114,9.6,2.51,29.18,Absent,49,25.67,40.63,46,0 314,162,1.5,2.46,19.39,Present,49,24.32,0,59,1 315,174,0,3.27,35.4,Absent,58,37.71,24.95,44,0 316,190,5.15,6.03,36.59,Absent,42,30.31,72,50,0 317,154,1.4,1.72,18.86,Absent,58,22.67,43.2,59,0 318,124,0,2.28,24.86,Present,50,22.24,8.26,38,0 319,114,1.2,3.98,14.9,Absent,49,23.79,25.82,26,0 320,168,11.4,5.08,26.66,Present,56,27.04,2.61,59,1 321,142,3.72,4.24,32.57,Absent,52,24.98,7.61,51,0 322,154,0,4.81,28.11,Present,56,25.67,75.77,59,0 323,146,4.36,4.31,18.44,Present,47,24.72,10.8,38,0 324,166,6,3.02,29.3,Absent,35,24.38,38.06,61,0 325,140,8.6,3.9,32.16,Present,52,28.51,11.11,64,1 326,136,1.7,3.53,20.13,Absent,56,19.44,14.4,55,0 327,156,0,3.47,21.1,Absent,73,28.4,0,36,1 328,132,0,6.63,29.58,Present,37,29.41,2.57,62,0 329,128,0,2.98,12.59,Absent,65,20.74,2.06,19,0 330,106,5.6,3.2,12.3,Absent,49,20.29,0,39,0 331,144,0.4,4.64,30.09,Absent,30,27.39,0.74,55,0 332,154,0.31,2.33,16.48,Absent,33,24,11.83,17,0 333,126,3.1,2.01,32.97,Present,56,28.63,26.74,45,0 334,134,6.4,8.49,37.25,Present,56,28.94,10.49,51,1 335,152,19.45,4.22,29.81,Absent,28,23.95,0,59,1 336,146,1.35,6.39,34.21,Absent,51,26.43,0,59,1 337,162,6.94,4.55,33.36,Present,52,27.09,32.06,43,0 338,130,7.28,3.56,23.29,Present,20,26.8,51.87,58,1 339,138,6,7.24,37.05,Absent,38,28.69,0,59,0 340,148,0,5.32,26.71,Present,52,32.21,32.78,27,0 341,124,4.2,2.94,27.59,Absent,50,30.31,85.06,30,0 342,118,1.62,9.01,21.7,Absent,59,25.89,21.19,40,0 343,116,4.28,7.02,19.99,Present,68,23.31,0,52,1 344,162,6.3,5.73,22.61,Present,46,20.43,62.54,53,1 345,138,0.87,1.87,15.89,Absent,44,26.76,42.99,31,0 346,137,1.2,3.14,23.87,Absent,66,24.13,45,37,0 347,198,0.52,11.89,27.68,Present,48,28.4,78.99,26,1 348,154,4.5,4.75,23.52,Present,43,25.76,0,53,1 349,128,5.4,2.36,12.98,Absent,51,18.36,6.69,61,0 350,130,0.08,5.59,25.42,Present,50,24.98,6.27,43,1 351,162,5.6,4.24,22.53,Absent,29,22.91,5.66,60,0 352,120,10.5,2.7,29.87,Present,54,24.5,16.46,49,0 353,136,3.99,2.58,16.38,Present,53,22.41,27.67,36,0 354,176,1.2,8.28,36.16,Present,42,27.81,11.6,58,1 355,134,11.79,4.01,26.57,Present,38,21.79,38.88,61,1 356,122,1.7,5.28,32.23,Present,51,24.08,0,54,0 357,134,0.9,3.18,23.66,Present,52,23.26,27.36,58,1 358,134,0,2.43,22.24,Absent,52,26.49,41.66,24,0 359,136,6.6,6.08,32.74,Absent,64,33.28,2.72,49,0 360,132,4.05,5.15,26.51,Present,31,26.67,16.3,50,0 361,152,1.68,3.58,25.43,Absent,50,27.03,0,32,0 362,132,12.3,5.96,32.79,Present,57,30.12,21.5,62,1 363,124,0.4,3.67,25.76,Absent,43,28.08,20.57,34,0 364,140,4.2,2.91,28.83,Present,43,24.7,47.52,48,0 365,166,0.6,2.42,34.03,Present,53,26.96,54,60,0 366,156,3.02,5.35,25.72,Present,53,25.22,28.11,52,1 367,132,0.72,4.37,19.54,Absent,48,26.11,49.37,28,0 368,150,0,4.99,27.73,Absent,57,30.92,8.33,24,0 369,134,0.12,3.4,21.18,Present,33,26.27,14.21,30,0 370,126,3.4,4.87,15.16,Present,65,22.01,11.11,38,0 371,148,0.5,5.97,32.88,Absent,54,29.27,6.43,42,0 372,148,8.2,7.75,34.46,Present,46,26.53,6.04,64,1 373,132,6,5.97,25.73,Present,66,24.18,145.29,41,0 374,128,1.6,5.41,29.3,Absent,68,29.38,23.97,32,0 375,128,5.16,4.9,31.35,Present,57,26.42,0,64,0 376,140,0,2.4,27.89,Present,70,30.74,144,29,0 377,126,0,5.29,27.64,Absent,25,27.62,2.06,45,0 378,114,3.6,4.16,22.58,Absent,60,24.49,65.31,31,0 379,118,1.25,4.69,31.58,Present,52,27.16,4.11,53,0 380,126,0.96,4.99,29.74,Absent,66,33.35,58.32,38,0 381,154,4.5,4.68,39.97,Absent,61,33.17,1.54,64,1 382,112,1.44,2.71,22.92,Absent,59,24.81,0,52,0 383,140,8,4.42,33.15,Present,47,32.77,66.86,44,0 384,140,1.68,11.41,29.54,Present,74,30.75,2.06,38,1 385,128,2.6,4.94,21.36,Absent,61,21.3,0,31,0 386,126,19.6,6.03,34.99,Absent,49,26.99,55.89,44,0 387,160,4.2,6.76,37.99,Present,61,32.91,3.09,54,1 388,144,0,4.17,29.63,Present,52,21.83,0,59,0 389,148,4.5,10.49,33.27,Absent,50,25.92,2.06,53,1 390,146,0,4.92,18.53,Absent,57,24.2,34.97,26,0 391,164,5.6,3.17,30.98,Present,44,25.99,43.2,53,1 392,130,0.54,3.63,22.03,Present,69,24.34,12.86,39,1 393,154,2.4,5.63,42.17,Present,59,35.07,12.86,50,1 394,178,0.95,4.75,21.06,Absent,49,23.74,24.69,61,0 395,180,3.57,3.57,36.1,Absent,36,26.7,19.95,64,0 396,134,12.5,2.73,39.35,Absent,48,35.58,0,48,0 397,142,0,3.54,16.64,Absent,58,25.97,8.36,27,0 398,162,7,7.67,34.34,Present,33,30.77,0,62,0 399,218,11.2,2.77,30.79,Absent,38,24.86,90.93,48,1 400,126,8.75,6.06,32.72,Present,33,27,62.43,55,1 401,126,0,3.57,26.01,Absent,61,26.3,7.97,47,0 402,134,6.1,4.77,26.08,Absent,47,23.82,1.03,49,0 403,132,0,4.17,36.57,Absent,57,30.61,18,49,0 404,178,5.5,3.79,23.92,Present,45,21.26,6.17,62,1 405,208,5.04,5.19,20.71,Present,52,25.12,24.27,58,1 406,160,1.15,10.19,39.71,Absent,31,31.65,20.52,57,0 407,116,2.38,5.67,29.01,Present,54,27.26,15.77,51,0 408,180,25.01,3.7,38.11,Present,57,30.54,0,61,1 409,200,19.2,4.43,40.6,Present,55,32.04,36,60,1 410,112,4.2,3.58,27.14,Absent,52,26.83,2.06,40,0 411,120,0,3.1,26.97,Absent,41,24.8,0,16,0 412,178,20,9.78,33.55,Absent,37,27.29,2.88,62,1 413,166,0.8,5.63,36.21,Absent,50,34.72,28.8,60,0 414,164,8.2,14.16,36.85,Absent,52,28.5,17.02,55,1 415,216,0.92,2.66,19.85,Present,49,20.58,0.51,63,1 416,146,6.4,5.62,33.05,Present,57,31.03,0.74,46,0 417,134,1.1,3.54,20.41,Present,58,24.54,39.91,39,1 418,158,16,5.56,29.35,Absent,36,25.92,58.32,60,0 419,176,0,3.14,31.04,Present,45,30.18,4.63,45,0 420,132,2.8,4.79,20.47,Present,50,22.15,11.73,48,0 421,126,0,4.55,29.18,Absent,48,24.94,36,41,0 422,120,5.5,3.51,23.23,Absent,46,22.4,90.31,43,0 423,174,0,3.86,21.73,Absent,42,23.37,0,63,0 424,150,13.8,5.1,29.45,Present,52,27.92,77.76,55,1 425,176,6,3.98,17.2,Present,52,21.07,4.11,61,1 426,142,2.2,3.29,22.7,Absent,44,23.66,5.66,42,1 427,132,0,3.3,21.61,Absent,42,24.92,32.61,33,0 428,142,1.32,7.63,29.98,Present,57,31.16,72.93,33,0 429,146,1.16,2.28,34.53,Absent,50,28.71,45,49,0 430,132,7.2,3.65,17.16,Present,56,23.25,0,34,0 431,120,0,3.57,23.22,Absent,58,27.2,0,32,0 432,118,0,3.89,15.96,Absent,65,20.18,0,16,0 433,108,0,1.43,26.26,Absent,42,19.38,0,16,0 434,136,0,4,19.06,Absent,40,21.94,2.06,16,0 435,120,0,2.46,13.39,Absent,47,22.01,0.51,18,0 436,132,0,3.55,8.66,Present,61,18.5,3.87,16,0 437,136,0,1.77,20.37,Absent,45,21.51,2.06,16,0 438,138,0,1.86,18.35,Present,59,25.38,6.51,17,0 439,138,0.06,4.15,20.66,Absent,49,22.59,2.49,16,0 440,130,1.22,3.3,13.65,Absent,50,21.4,3.81,31,0 441,130,4,2.4,17.42,Absent,60,22.05,0,40,0 442,110,0,7.14,28.28,Absent,57,29,0,32,0 443,120,0,3.98,13.19,Present,47,21.89,0,16,0 444,166,6,8.8,37.89,Absent,39,28.7,43.2,52,0 445,134,0.57,4.75,23.07,Absent,67,26.33,0,37,0 446,142,3,3.69,25.1,Absent,60,30.08,38.88,27,0 447,136,2.8,2.53,9.28,Present,61,20.7,4.55,25,0 448,142,0,4.32,25.22,Absent,47,28.92,6.53,34,1 449,130,0,1.88,12.51,Present,52,20.28,0,17,0 450,124,1.8,3.74,16.64,Present,42,22.26,10.49,20,0 451,144,4,5.03,25.78,Present,57,27.55,90,48,1 452,136,1.81,3.31,6.74,Absent,63,19.57,24.94,24,0 453,120,0,2.77,13.35,Absent,67,23.37,1.03,18,0 454,154,5.53,3.2,28.81,Present,61,26.15,42.79,42,0 455,124,1.6,7.22,39.68,Present,36,31.5,0,51,1 456,146,0.64,4.82,28.02,Absent,60,28.11,8.23,39,1 457,128,2.24,2.83,26.48,Absent,48,23.96,47.42,27,1 458,170,0.4,4.11,42.06,Present,56,33.1,2.06,57,0 459,214,0.4,5.98,31.72,Absent,64,28.45,0,58,0 460,182,4.2,4.41,32.1,Absent,52,28.61,18.72,52,1 461,108,3,1.59,15.23,Absent,40,20.09,26.64,55,0 462,118,5.4,11.61,30.79,Absent,64,27.35,23.97,40,0 463,132,0,4.82,33.41,Present,62,14.7,0,46,1 -------------------------------------------------------------------------------- /data/Income1.csv: -------------------------------------------------------------------------------- 1 | "","Education","Income" 2 | "1",10,26.6588387834389 3 | "2",10.4013377926421,27.3064353457772 4 | "3",10.8428093645485,22.1324101716143 5 | "4",11.2441471571906,21.1698405046065 6 | "5",11.6454849498328,15.1926335164307 7 | "6",12.0869565217391,26.3989510407284 8 | "7",12.4882943143813,17.435306578572 9 | "8",12.8896321070234,25.5078852305278 10 | "9",13.2909698996656,36.884594694235 11 | "10",13.7324414715719,39.666108747637 12 | "11",14.133779264214,34.3962805641312 13 | "12",14.5351170568562,41.4979935356871 14 | "13",14.9765886287625,44.9815748660704 15 | "14",15.3779264214047,47.039595257834 16 | "15",15.7792642140468,48.2525782901863 17 | "16",16.2207357859532,57.0342513373801 18 | "17",16.6220735785953,51.4909192102538 19 | "18",17.0234113712375,61.3366205527288 20 | "19",17.4648829431438,57.581988179306 21 | "20",17.866220735786,68.5537140185881 22 | "21",18.2675585284281,64.310925303692 23 | "22",18.7090301003344,68.9590086393083 24 | "23",19.1103678929766,74.6146392793647 25 | "24",19.5117056856187,71.8671953042483 26 | "25",19.9130434782609,76.098135379724 27 | "26",20.3545150501672,75.77521802986 28 | "27",20.7558528428094,72.4860553152424 29 | "28",21.1571906354515,77.3550205741877 30 | "29",21.5986622073579,72.1187904524136 31 | "30",22,80.2605705009016 32 | -------------------------------------------------------------------------------- /data/Income2.csv: -------------------------------------------------------------------------------- 1 | "","Education","Seniority","Income" 2 | "1",21.5862068965517,113.103448275862,99.9171726114381 3 | "2",18.2758620689655,119.310344827586,92.579134855529 4 | "3",12.0689655172414,100.689655172414,34.6787271520874 5 | "4",17.0344827586207,187.586206896552,78.7028062353695 6 | "5",19.9310344827586,20,68.0099216471551 7 | "6",18.2758620689655,26.2068965517241,71.5044853814318 8 | "7",19.9310344827586,150.344827586207,87.9704669939115 9 | "8",21.1724137931034,82.0689655172414,79.8110298331255 10 | "9",20.3448275862069,88.2758620689655,90.00632710858 11 | "10",10,113.103448275862,45.6555294997364 12 | "11",13.7241379310345,51.0344827586207,31.9138079371295 13 | "12",18.6896551724138,144.137931034483,96.2829968022869 14 | "13",11.6551724137931,20,27.9825049000603 15 | "14",16.6206896551724,94.4827586206897,66.601792415137 16 | "15",10,187.586206896552,41.5319924201478 17 | "16",20.3448275862069,94.4827586206897,89.00070081522 18 | "17",14.1379310344828,20,28.8163007592387 19 | "18",16.6206896551724,44.8275862068966,57.6816942573605 20 | "19",16.6206896551724,175.172413793103,70.1050960424457 21 | "20",20.3448275862069,187.586206896552,98.8340115435447 22 | "21",18.2758620689655,100.689655172414,74.7046991976891 23 | "22",14.551724137931,137.931034482759,53.5321056283034 24 | "23",17.448275862069,94.4827586206897,72.0789236655191 25 | "24",10.4137931034483,32.4137931034483,18.5706650327685 26 | "25",21.5862068965517,20,78.8057842852386 27 | "26",11.2413793103448,44.8275862068966,21.388561306174 28 | "27",19.9310344827586,168.965517241379,90.8140351180409 29 | "28",11.6551724137931,57.2413793103448,22.6361626208955 30 | "29",12.0689655172414,32.4137931034483,17.613593041445 31 | "30",17.0344827586207,106.896551724138,74.6109601985289 32 | --------------------------------------------------------------------------------