├── 10.R.RData ├── 5.R.RData ├── 7.R.RData ├── HW_Nonlinearity.R ├── HW_PCA.R ├── HW_SVM.R ├── HW_TheBootstrap.R ├── ch10.Principal Components.html ├── ch10.Rmd ├── ch2.R ├── ch3.R ├── ch4.R ├── ch5.R ├── ch6.ModelSelection.html ├── ch6.Rmd ├── ch7.NonlinearModels.html ├── ch7.Rmd ├── ch8.DecisionTrees.html ├── ch8.Rmd ├── ch9.Rmd └── ch9.SVM.html /10.R.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kjy/StatisticalLearning_classstuff/936249d6e84f4b0888ebe08eba1b23681221a16c/10.R.RData -------------------------------------------------------------------------------- /5.R.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kjy/StatisticalLearning_classstuff/936249d6e84f4b0888ebe08eba1b23681221a16c/5.R.RData -------------------------------------------------------------------------------- /7.R.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kjy/StatisticalLearning_classstuff/936249d6e84f4b0888ebe08eba1b23681221a16c/7.R.RData -------------------------------------------------------------------------------- /HW_Nonlinearity.R: -------------------------------------------------------------------------------- 1 | ##Load the data from the file 7.R.RData, and plot it using plot(x,y). 2 | #Q1. What is the slope coefficient in a linear regression of y on x (to within 10%)? -0.67483 3 | load("/Users/karenyang/Desktop/Statisical_Learning_Rcode/7.R.RData") 4 | ls() 5 | x 6 | y 7 | plot(x,y) # shows a negative relationship 8 | model1 <- lm(y ~ x) 9 | summary(model1) 10 | 11 | Call: 12 | lm(formula = y ~ x) 13 | 14 | Residuals: 15 | Min 1Q Median 3Q Max 16 | -0.71289 -0.26943 -0.02448 0.21068 0.83582 17 | 18 | Coefficients: 19 | Estimate Std. Error t value Pr(>|t|) 20 | (Intercept) 95.43627 7.14200 13.36 <2e-16 *** 21 | x -0.67483 0.05073 -13.30 <2e-16 *** 22 | --- 23 | Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 24 | 25 | Residual standard error: 0.3376 on 98 degrees of freedom 26 | Multiple R-squared: 0.6436, Adjusted R-squared: 0.64 27 | F-statistic: 177 on 1 and 98 DF, p-value: < 2.2e-16 28 | 29 | Q2. For the model y ~ 1+x+x^2, what is the coefficient of x (to within 10%)? 7.771e+01 30 | fit1 = lm(y ~ x + I(x^2)) 31 | summary(fit1) 32 | Call: 33 | lm(formula = y ~ x + I(x^2)) 34 | 35 | Residuals: 36 | Min 1Q Median 3Q Max 37 | -0.65698 -0.18190 -0.01938 0.16355 0.86149 38 | 39 | Coefficients: 40 | Estimate Std. Error t value Pr(>|t|) 41 | (Intercept) -5.421e+03 1.547e+03 -3.505 0.000692 *** 42 | x 7.771e+01 2.197e+01 3.536 0.000624 *** 43 | I(x^2) -2.784e-01 7.805e-02 -3.567 0.000563 *** 44 | --- 45 | Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 46 | 47 | Residual standard error: 0.3191 on 97 degrees of freedom 48 | Multiple R-squared: 0.6849, Adjusted R-squared: 0.6784 49 | F-statistic: 105.4 on 2 and 97 DF, p-value: < 2.2e-16 -------------------------------------------------------------------------------- /HW_PCA.R: -------------------------------------------------------------------------------- 1 | 2 | 3 | load("/Users/karenyang/Desktop/Statistical_Learning_Rcode/10.R.RData") 4 | ls() 5 | [1] "x" "x.test" "y" "y.test" 6 | 7 | 8 | Suppose we want to fit a linear regression, but the number of variables is much larger than the number of observations. In some cases, we may improve the fit by reducing the dimension of the features before. 9 | In this problem, we use a data set with n = 300 and p = 200, so we have more observations than variables, but not by much. Load the data x, y, x.test, and y.test from 10.R.RData. 10 | First, concatenate x and x.test using the rbind functions and perform a principal components analysis on the concatenated data frame. 11 | Q.1 To within 10% relative error, what proportion of the variance is explained by the first five principal components? 12 | xvars = rbind(x,x.test) # bind the variables 13 | 14 | dataset1 = data.frame(xvars) #create a dataset 15 | 16 | pca.out = prcomp(dataset1, scale = TRUE) # prinicipal component analysis model 17 | pca.out$sdev # gives standard deviations 18 | 19 | screeplot(pca.out) # Scree plot shows variance explained per principal component 20 | (pca.out$sdev)^2/ sum(pca.out$sdev^2) 21 | sum(0.1278392623, 0.1056409183, 0.0693007523, 0.0363725007, 0.0107030317) #Take sum of first five 22 | [1] 0.3498565 #cumulative sum 23 | #alternative method is to use the cumulative sum function, which gives the same 0.3498565 on fifth entry 24 | cumsum((pca.out$sdev)^2) / sum(pca.out$sdev^2) 25 | [1] 0.1278393 0.2334802 0.3027809 0.3391534 0.3498565 26 | 27 | #or just use 28 | summary(pca.out) 29 | Importance of components: 30 | PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 31 | Standard deviation 5.0565 4.5965 3.7229 2.69713 1.4631 1.16827 1.15848 1.15544 1.14591 32 | Proportion of Variance 0.1278 0.1056 0.0693 0.03637 0.0107 0.00682 0.00671 0.00668 0.00657 33 | Cumulative Proportion 0.1278 0.2335 0.3028 0.33915 0.3499 0.35668 0.36339 0.37007 0.37663 34 | 35 | 36 | #Q.2 The previous answer suggests that a relatively small number of '"latent variables" account for a substantial fraction of the features' variability. We might believe that these 37 | latent variables are more important than linear combinations of the features that have low variance. 38 | We can try forgetting about the raw features and using the first five principal components (computed on rbind(x,x.test)) instead as low-dimensional derived features. What is the mean-squared test error if we regress y on the first five principal components, and use the resulting model to predict y.test? 39 | xols<-pca.out$x[1:300,1:5] 40 | fit0 <- lm(y ~ xols) 41 | summary(fit0) 42 | Call: 43 | lm(formula = y ~ xols) 44 | 45 | Residuals: 46 | Min 1Q Median 3Q Max 47 | -3.3289 -0.6992 0.0319 0.8075 2.5240 48 | 49 | Coefficients: 50 | Estimate Std. Error t value Pr(>|t|) 51 | (Intercept) 0.09541 0.06107 1.562 0.119314 52 | xolsPC1 0.07608 0.01159 6.564 2.36e-10 *** 53 | xolsPC2 -0.02276 0.01314 -1.732 0.084309 . 54 | xolsPC3 -0.04023 0.01538 -2.616 0.009352 ** 55 | xolsPC4 -0.06368 0.02237 -2.847 0.004722 ** 56 | xolsPC5 -0.16069 0.04299 -3.738 0.000223 *** 57 | --- 58 | Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 59 | 60 | Residual standard error: 1.056 on 294 degrees of freedom 61 | Multiple R-squared: 0.1906, Adjusted R-squared: 0.1769 62 | F-statistic: 13.85 on 5 and 294 DF, p-value: 3.704e-12 63 | 64 | yhat0 = predict(fit0, x.test) 65 | mean((yhat0-y.test)^2) 66 | 67 | [1] 1.40799 #not correct 68 | [1] 1.413063 #not correct 69 | 70 | 71 | 72 | 73 | # Q.3 Now, try an OLS linear regression of y on the matrix x. What is the mean squared prediction error if we use the fitted model to predict y.test from x.test? 74 | fit<-lm(y~.,x) # Run linear model 75 | summary(fit) 76 | yhat = predict(fit, newdata=x.test) # Use linear model and new dataset to get the predicted values, yhat 77 | mean((yhat-y.test)^2) #Calculate mean squared error (difference between predicted y and true y values is error) 78 | [1] 3.657197 #Grader shows answer to be 3.90714 -------------------------------------------------------------------------------- /HW_SVM.R: -------------------------------------------------------------------------------- 1 | In this problem, you will use simulation to evaluate (by Monte Carlo) the expected misclassification error rate given a particular generating model.  Let yi be equally divided between classes 0 and 1, and let xi ∈ ℝ10 be normally distributed. Given yi=0, xi ∼ N10(0, I10).  Given yi=1, xi ∼N10(μ, I10) with μ=(1,1,1,1,1,0,0,0,0,0). 2 | Now, we would like to know the expected test error rate if we fit an SVM to a sample of 50 random training points from class 1 and 50 more from class 0.  We can calculate this to high precision by 1) generating a random training sample to train on, 2) evaluating the number of mistakes we make on a large test set, and then 3) repeating (1-2) many times and averaging the error rate for each trial. 3 | 4 | install.packages("MASS") # Use this package for the mvrnorm() function 5 | library(MASS) 6 | 7 | install.packages("e1071") # Use this package for the SVM() function 8 | library(e1071) 9 | 10 | #Create a dataset of 100 observations (with 50 1's and 50 0's) with 10 predictors 11 | set.seed(10111) 12 | # n is number of obs., mu is vector mean, sigma is Identity matrix (variance-covariance) 13 | x1 = mvrnorm(n = 50, mu = rep(0,10), Sigma=diag(10)) 14 | dim(x1) 15 | [1] 50 10 16 | x2 = mvrnorm(n = 50, mu = rep(0,10), Sigma=diag(10)) 17 | xvar_1 = rbind(x1,x2) 18 | dim(xvar_1) 19 | [1] 100 10 20 | data1 = matrix(xvar_1) 21 | y = rep(c(0,1), c(50,50)) #Make a vector of 100 obs. with 50 1's and 50 0's 22 | y = matrix(y) 23 | dim (y) 24 | #Put the y variable in with the X variables to form a training dataset 25 | train_data = data.frame(xvar_1, y = as.factor(y)) 26 | dim (train_data) 27 | [1] 100 11 28 | 29 | #Create a large dataset with 1000 observations for the test datset with 10 predictors 30 | x3 = mvrnorm(500,rep(0,10),diag(10)) 31 | x4 = mvrnorm(500,rep(c(1,0),c(5,5)),diag(10)) 32 | xvar_2 = rbind(x3,x4) 33 | y1 = rep(c(0,1),c(500,500)) 34 | test_data = data.frame(xvar_2,y1=as.factor(y1)) 35 | dim (test_data) 36 | [1] 1000 11 #Test data set with 1000 observations, Y variable with 500 1's and 500 0's, and 10 X variables 37 | 38 | 39 | # Q9.1 Use svm in the e1071 package with the default settings (the default kernel is a radial kernel). What is the expected test error rate of this method (to within 10%)? 40 | sample1_train <- train_data[sample(1:nrow(train_data), 50, 41 | replace=TRUE),] 42 | svmModel1 = svm(sample1_train$y ~., data=sample1_train) 43 | svmModel1 44 | svmPredicted1 <- predict(svmModel1, newdata = test_data) 45 | svmPredicted1 46 | errorrateSVM1 <- (sum(svmPredicted1 != test_data$y))/length(test_data$y) 47 | errorrateSVM1 48 | [1] 0.421 49 | 50 | 51 | sample2_train <- train_data[sample(1:nrow(train_data), 50, 52 | replace=TRUE),] 53 | svmModel2 = svm(sample2_train$y ~., data=sample2_train) 54 | svmModel2 55 | svmPredicted2 <- predict(svmModel2, newdata = test_data) 56 | svmPredicted2 57 | errorrateSVM2 <- (sum(svmPredicted2 != test_data$y))/length(test_data$y) 58 | errorrateSVM2 59 | [1] 0.465 60 | 61 | 62 | sample3_train <- train_data[sample(1:nrow(train_data), 50, 63 | replace=TRUE),] 64 | svmModel3 = svm(sample2_train$y ~., data=sample3_train) 65 | svmModel3 66 | svmPredicted3 <- predict(svmModel3, newdata = test_data) 67 | svmPredicted3 68 | errorrateSVM3 <- (sum(svmPredicted3 != test_data$y))/length(test_data$y) 69 | errorrateSVM3 70 | [1] 0.476 71 | #Take the average of the test errors for SVM 72 | sum(errorrateSVM1, errorrateSVM2, errorrateSVM3)/3 73 | [1] 0.454 74 | 75 | #USe Monte Carlo method to repeat above for 1000 iterations on SVM with default setting radial kernel 76 | error_vector <- c(); 77 | for (i in 1:1000) { 78 | ## Generate traning data 79 | sample_train <- train_data[sample(1:nrow(train_data), 50, replace = TRUE),] 80 | 81 | ## Fit a model on trainning data 82 | svmModel = svm(sample_train$y ~., data=sample_train) 83 | 84 | ## Generate test data 85 | #Use test_data from above 86 | 87 | ## Predict on test data 88 | svmPredicted <- predict(svmModel, newdata = test_data) 89 | 90 | ## Compare prediction with y 91 | errorrateSVM <- (sum(svmPredicted != test_data$y))/length(test_data$y) 92 | 93 | ## Save the result in a vector 94 | error_vector <- c(error_vector, errorrateSVM) 95 | 96 | } 97 | 98 | #Random check the values in the vector 99 | > head(error_vector) 100 | [1] 0.620 0.604 0.534 0.492 0.599 0.526 101 | 102 | error_vector[999] 103 | [1] 0.511 104 | 105 | error_vector[750] 106 | [1] 0.526 107 | 108 | # Take the average across the 1000 test errors 109 | mean(error_vector) 110 | [1] 0.513473 111 | 112 | 113 | # Q.9.2 Now fit an svm with a linear kernel (kernel = "linear"). What is the expected test error rate to within 10%? 114 | sample6_train <- train_data[sample(1:nrow(train_data), 50, 115 | replace=TRUE),] 116 | svmModel6 = svm(sample6_train$y ~., data=sample6_train, kernel = "linear") 117 | svmModel6 118 | svmPredicted6 <- predict(svmModel6, newdata = test_data) 119 | svmPredicted6 120 | errorrateSVM6 <- (sum(svmPredicted6 != test_data$y))/length(test_data$y) 121 | errorrateSVM6 122 | [1] 0.451 123 | 124 | sample7_train <- train_data[sample(1:nrow(train_data), 50, 125 | replace=TRUE),] 126 | svmModel7 = svm(sample7_train$y ~., data=sample7_train, kernel = "linear") 127 | svmModel7 128 | svmPredicted7 <- predict(svmModel7, newdata = test_data) 129 | svmPredicted7 130 | errorrateSVM7 <- (sum(svmPredicted7 != test_data$y))/length(test_data$y) 131 | errorrateSVM7 132 | [1] 0.537 133 | 134 | #Take the average of the test errors for SVM models with linear kernel 135 | sum(errorrateSVM6, errorrateSVM7)/2 136 | [1] 0.494 137 | 138 | #USe Monte Carlo method to repeat above for 1000 iterations on SVM with linear kernel 139 | error_vector1 <- c(); 140 | for (i in 1:1000) { 141 | ## Generate traning data 142 | sample_train <- train_data[sample(1:nrow(train_data), 50, replace = TRUE),] 143 | 144 | ## Fit a model on trainning data 145 | svmModel = svm(sample_train$y ~., data=sample_train, kernel = "linear") 146 | 147 | ## Generate test data 148 | #Use test_data from above 149 | 150 | ## Predict on test data 151 | svmPredicted <- predict(svmModel, newdata = test_data) 152 | 153 | ## Compare prediction with y 154 | errorrateSVM <- (sum(svmPredicted != test_data$y))/length(test_data$y) 155 | 156 | ## Save the result in a vector 157 | error_vector1 <- c(error_vector1, errorrateSVM) 158 | 159 | } 160 | 161 | # Random check the values in the vector 162 | head(error_vector1) 163 | [1] 0.553 0.608 0.461 0.483 0.542 0.469 164 | 165 | error_vector1[999] 166 | [1] 0.382 167 | 168 | error_vector1[750] 169 | [1] 0.415 170 | 171 | # Take the average across the 1000 test errors 172 | mean(error_vector1) 173 | [1] 0.483152 174 | 175 | 176 | # Q. 9.3 What is the expected test error for logistic regression? (to within 10%) 177 | 178 | sample11_train <- train_data[sample(1:nrow(train_data), 50, 179 | replace=TRUE),] 180 | 181 | logistic1 = glm(sample11_train$y ~., family = "binomial", data=sample11_train) 182 | summary(logistic1) 183 | fitted <- predict(logistic1, newdata=test_data, type = 'response') 184 | missClass <- function(values, prediction){sum(((prediction > 0.5)*1) != values)/length(values)} 185 | errorratelog1 <-missClass(test_data$y,fitted) 186 | [1] 0.576 187 | 188 | 189 | sample12_train <- train_data[sample(1:nrow(train_data), 50, 190 | replace=TRUE),] 191 | logistic2 = glm(sample12_train$y ~., family = "binomial", data=sample12_train) 192 | summary(logistic2) 193 | fitted1 <- predict(logistic2, newdata=test_data, type = 'response') 194 | missClass <- function(values, prediction){sum(((prediction > 0.5)*1) != values)/length(values)} 195 | errorratelog2 <- missClass(test_data$y,fitted1) 196 | [1] 0.561 197 | 198 | sample13_train <- train_data[sample(1:nrow(train_data), 50, 199 | replace=TRUE),] 200 | logistic3 = glm(sample13_train$y ~., family = "binomial", data=sample13_train) 201 | summary(logistic3) 202 | fitted2 <- predict(logistic3, newdata=test_data, type = 'response') 203 | missClass <- function(values, prediction){sum(((prediction > 0.5)*1) != values)/length(values)} 204 | errorratelog3 <- missClass(test_data$y,fitted2) 205 | [1] 0.425 206 | #Take the average of the test errors for binary logistic 207 | sum(errorratelog1, errorratelog2, errorratelog3)/3 208 | 209 | 210 | 211 | #USe Monte Carlo method to repeat above for 1000 iterations on binary logistic regression model 212 | 213 | #Use a function for missclassification error rate 214 | missClass <- function(values, prediction){sum(((prediction > 0.5)*1) != values)/length(values)} 215 | #Set up a vector to store the error rates 216 | error_vector2 <- c(); 217 | 218 | for (i in 1:1000) { 219 | ## Generate traning data 220 | sample_train <- train_data[sample(1:nrow(train_data), 50, replace = TRUE),] 221 | 222 | ## Fit a model on trainning data 223 | LogisticModel = glm(sample_train$y ~., family = "binomial", data=sample_train) 224 | 225 | ## Generate test data 226 | #Use test_data from above 227 | 228 | ## Predict on test data 229 | fitted <- predict(LogisticModel, newdata=test_data, type = 'response') 230 | 231 | ## Compare prediction with y 232 | errorrate_logistic <- missClass(test_data$y,fitted) 233 | 234 | ## Save the result in a vector 235 | error_vector2 <- c(error_vector2, errorrate_logistic) 236 | 237 | } 238 | #(Don't worry if you get errors saying the logistic regression did not converge.)Warning messages: 239 | 1: glm.fit: algorithm did not converge 240 | 2: glm.fit: fitted probabilities numerically 0 or 1 occurred 241 | 3: glm.fit: fitted probabilities numerically 0 or 1 occurred 242 | 4: glm.fit: fitted probabilities numerically 0 or 1 occurred 243 | 244 | # Random check the values in the vector 245 | head(error_vector2) 246 | [1] 0.444 0.280 0.419 0.593 0.501 0.686 247 | error_vector2[999] 248 | [1] 0.419 249 | error_vector2[750] 250 | [1] 0.548 251 | 252 | # Take the average across the 1000 test errors 253 | mean(error_vector2) 254 | [1] 0.476648 255 | -------------------------------------------------------------------------------- /HW_TheBootstrap.R: -------------------------------------------------------------------------------- 1 | setwd("/Users/karenyang/Downloads/") 2 | load("/Users/karenyang/Downloads/5.R.RData") 3 | ls() 4 | head(Xy) 5 | tail(Xy) 6 | dim(Xy) 7 | [1] 1000 3 8 | 9 | attach(Xy) 10 | search() 11 | 12 | # Q1. Download the file 5.R.RData and load it into R using load("5.R.RData"). 13 | # Consider the linear regression model of y on X1 and X2. 14 | # To within 10%, what is the standard error for β1? 0.026 15 | model1 <- lm(y ~ X1 + X2) 16 | summary(model1) 17 | Call: 18 | lm(formula = y ~ X1 + X2) 19 | 20 | Residuals: 21 | Min 1Q Median 3Q Max 22 | -1.44171 -0.25468 -0.01736 0.33081 1.45860 23 | 24 | Coefficients: 25 | Estimate Std. Error t value Pr(>|t|) 26 | (Intercept) 0.26583 0.01988 13.372 < 2e-16 *** 27 | X1 0.14533 0.02593 5.604 2.71e-08 *** 28 | X2 0.31337 0.02923 10.722 < 2e-16 *** 29 | --- 30 | Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 31 | 32 | Residual standard error: 0.5451 on 997 degrees of freedom 33 | Multiple R-squared: 0.1171, Adjusted R-squared: 0.1154 34 | F-statistic: 66.14 on 2 and 997 DF, p-value: < 2.2e-16 35 | 36 | # Q2 Next, plot the data using matplot(Xy,type="l"). 37 | # Which of the following do you think is most likely given what you see? Our estimate of s.e.(β^1) is too low. 38 | ?matplot 39 | matplot(Xy, type="l") 40 | 41 | # Q3 Now, use the (standard) bootstrap to estimate s.e.(β^1). To within 10%, what do you get? 42 | 43 | library(boot) 44 | alpha = function(x,y){ 45 | vx = var(x) 46 | vy = var(y) 47 | cxy= cov(x,y) 48 | (vy-cxy)/(vx+vy-2*cxy) 49 | } 50 | alpha(Xy$X1,Xy$y) 51 | [1] 0.4167192 52 | alpha.fn = function(data,index){ 53 | with(data[index,],alpha(Xy$X1,Xy$y)) 54 | } 55 | 56 | alpha.fn<-function(data, index) { 57 | fit1<-lm(y~., data=Xy[index,]) 58 | coefficients(fit1)[['X1']] 59 | } 60 | 61 | set.seed(1) 62 | alpha.fn (Xy,sample(1:100,100,replace=TRUE)) 63 | [1] 0.1059068 64 | 65 | boot.out=boot(Xy,alpha.fn,R=1000) 66 | boot.out 67 | 68 | ORDINARY NONPARAMETRIC BOOTSTRAP 69 | 70 | 71 | Call: 72 | boot(data = Xy, statistic = alpha.fn, R = 1000) 73 | 74 | 75 | Bootstrap Statistics : 76 | original bias std. error 77 | t1* 0.1453263 0.0001885914 0.02873965 78 | 79 | # Q4. Finally, use the block bootstrap to estimate s.e.(β^1). Use blocks of size 100. To within 10%, what do you get? 80 | ?tsboot 81 | # Generate R bootstrap replicates of a statistic applied to a time series. 82 | # The replicate time series can be generated using fixed or random block lengths 83 | # or can be model based replicates. 84 | 85 | 86 | tsboot.out = tsboot(Xy, se_stat, R = 1000, l = 100, sim = "fixed") 87 | tsboot.out 88 | -------------------------------------------------------------------------------- /ch10.Rmd: -------------------------------------------------------------------------------- 1 | Principal Components 2 | ==================== 3 | We will use the `USArrests` data (which is in R) 4 | ```{r} 5 | dimnames(USArrests) 6 | apply(USArrests,2,mean) 7 | apply(USArrests,2, var) 8 | ``` 9 | 10 | We see that `Assault` has a much larger variance than the other variables. It would dominate the principal components, so we choose to standardize the variables when we perform PCA 11 | 12 | ```{r} 13 | pca.out=prcomp(USArrests, scale=TRUE) 14 | pca.out 15 | names(pca.out) 16 | biplot(pca.out, scale=0) 17 | ``` 18 | 19 | K-Means Clustering 20 | ================== 21 | K-means works in any dimension, but is most fun to demonstrate in two, because we can plot pictures. 22 | Lets make some data with clusters. We do this by shifting the means of the points around. 23 | ```{r} 24 | set.seed(101) 25 | x=matrix(rnorm(100*2),100,2) 26 | xmean=matrix(rnorm(8,sd=4),4,2) 27 | which=sample(1:4,100,replace=TRUE) 28 | x=x+xmean[which,] 29 | plot(x,col=which,pch=19) 30 | ``` 31 | We know the "true" cluster IDs, but we wont tell that to the `kmeans` algorithm. 32 | 33 | ```{r} 34 | km.out=kmeans(x,4,nstart=15) 35 | km.out 36 | plot(x,col=km.out$cluster,cex=2,pch=1,lwd=2) 37 | points(x,col=which,pch=19) 38 | points(x,col=c(4,3,2,1)[which],pch=19) 39 | ``` 40 | 41 | Hierarchical Clustering 42 | ======================= 43 | We will use these same data and use hierarchical clustering 44 | 45 | ```{r} 46 | hc.complete=hclust(dist(x),method="complete") 47 | plot(hc.complete) 48 | hc.single=hclust(dist(x),method="single") 49 | plot(hc.single) 50 | hc.average=hclust(dist(x),method="average") 51 | plot(hc.average) 52 | 53 | ``` 54 | Lets compare this with the actualy clusters in the data. We will use the function `cutree` to cut the tree at level 4. 55 | This will produce a vector of numbers from 1 to 4, saying which branch each observation is on. You will sometimes see pretty plots where the leaves of the dendrogram are colored. I searched a bit on the web for how to do this, and its a little too complicated for this demonstration. 56 | 57 | We can use `table` to see how well they match: 58 | ```{r} 59 | hc.cut=cutree(hc.complete,4) 60 | table(hc.cut,which) 61 | table(hc.cut,km.out$cluster) 62 | ``` 63 | or we can use our group membership as labels for the leaves of the dendrogram: 64 | ```{r} 65 | plot(hc.complete,labels=which) 66 | ``` 67 | 68 | 69 | -------------------------------------------------------------------------------- /ch2.R: -------------------------------------------------------------------------------- 1 | ### Week1 Chapter 2 Statistical Learning using R 2 | ### vectors, data, matrices, subsetting 3 | # Note: there are no scalars in R 4 | 5 | setwd("/Users/karenyang/Desktop") 6 | 7 | install.packages("ISLR") #Auto data located in ISLR package 8 | library(ISLR) # Load the library ISLR 9 | 10 | 11 | 12 | x=c(2,7,5) #Assign 3 numbers to vector 3 13 | x 14 | [1] 2 7 5 15 | 16 | y=seq(from=4,length=3,by=3) #Another way to make a vector, using a sequence 17 | ?seq #Opens up R help on sequence 18 | 19 | y 20 | [1] 4 7 10 21 | 22 | x+y #Sum of 2 vectors of the same length, element by element 23 | [1] 6 14 15 24 | 25 | x/y #Division of 2 vectors, element-wise 26 | [1] 0.5 1.0 0.5 27 | 28 | x^y #x to the power of y, element-wise 29 | [1] 16 823543 9765625 30 | 31 | x[2] #Use square brace to subset: extract 2nd element of x 32 | [1] 7 33 | 34 | x[2:3] #Use colon to indicate start at element 2 and stop at element 3 of subsetted vector, return subset 35 | [1] 7 5 36 | 37 | x[-2] #Remove element 2 from x and return the subsetted vector as output 38 | [1] 2 5 39 | 40 | x[-c(1,2)] #Remove a collection of indices 1 and 2 and return the subsetted vector 41 | [1] 5 42 | 43 | 44 | z=matrix(seq(1,12),4,3) #A 2-way array is a matrix, make a 4x3 matrix and fill values 1-12 column-wise 45 | z 46 | [,1] [,2] [,3] 47 | [1,] 1 5 9 48 | [2,] 2 6 10 49 | [3,] 3 7 11 50 | [4,] 4 8 12 51 | 52 | 53 | z[3:4,2:3] #Subset elements of matrix, 3-4 rows and 2-3 columns 54 | [,1] [,2] 55 | [1,] 7 11 56 | [2,] 8 12 57 | 58 | 59 | z[,2:3] #Subset columns 2 and 3 of z and return subsetted matrix 60 | [,1] [,2] 61 | [1,] 5 9 62 | [2,] 6 10 63 | [3,] 7 11 64 | [4,] 8 12 65 | 66 | z[,1] #Subset the 1st column of z. Note that it returns a vector and drops its matrix status. 67 | [1] 1 2 3 4 68 | 69 | z[,1,drop=FALSE] #Drop=FALSE keeps it as a 1-column matrix 70 | [,1] 71 | [1,] 1 72 | [2,] 2 73 | [3,] 3 74 | [4,] 4 75 | 76 | 77 | dim(z) #Dimensions of matrix is 4 rows and 3 columns 78 | [1] 4 3 79 | 80 | 81 | ls() #Tells you what is available in your working directory 82 | [1] "Auto" "x" "y" "z" 83 | 84 | rm(y) #Remove y 85 | 86 | ls() #Verify that y is gone 87 | [1] "Auto" "x" "z" 88 | 89 | 90 | ### Generating random data, graphics 91 | x=runif(50) #Random uniform from 0 to 1 92 | y=rnorm(50) #Random normal from 0 to 1 93 | plot(x,y) #plots x and y 94 | plot(x,y,xlab="Random Uniform",ylab="Random Normal",pch="*",col="blue") 95 | par(mfrow=c(2,1)) #Make a panel of plots with 2 rows and 1 column 96 | plot(x,y) 97 | hist(y) #Make a histogram of y 98 | par(mfrow=c(1,1)) #Reset the mfrow command 99 | 100 | 101 | ### Reading in data 102 | #Auto=read.csv("Auto.csv") #Already have data via ISLR package so just call data 103 | data(Auto) 104 | 105 | names(Auto) #Gives names of variables 106 | [1] "mpg" "cylinders" "displacement" "horsepower" "weight" 107 | [6] "acceleration" "year" "origin" "name" 108 | 109 | 110 | dim(Auto) 111 | [1] 392 9 112 | 113 | class(Auto) #What type of object 114 | [1] "data.frame" 115 | 116 | 117 | str(Auto) #Structure of data 118 | 'data.frame': 392 obs. of 9 variables: 119 | $ mpg : num 18 15 18 16 17 15 14 14 14 15 ... 120 | $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ... 121 | $ displacement: num 307 350 318 304 302 429 454 440 455 390 ... 122 | $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ... 123 | $ weight : num 3504 3693 3436 3433 3449 ... 124 | $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ... 125 | $ year : num 70 70 70 70 70 70 70 70 70 70 ... 126 | $ origin : num 1 1 1 1 1 1 1 1 1 1 ... 127 | $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...ls() 128 | 129 | summary(Auto) #Gives summary of data 130 | mpg cylinders displacement horsepower weight 131 | Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613 132 | 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225 133 | Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804 134 | Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978 135 | 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615 136 | Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140 137 | 138 | acceleration year origin name 139 | Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5 140 | 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5 141 | Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5 142 | Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4 143 | 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4 144 | Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4 145 | (Other) :365 146 | 147 | 148 | plot(Auto$cylinders,Auto$mpg) 149 | plot(Auto$cyl,Auto$mpg) 150 | 151 | attach(Auto) #Attach dataframe, creates a workspace 152 | search() #You can see "Auto" available for direct use so you don't need to use $cylinders or $mpg 153 | [1] ".GlobalEnv" "Auto" "package:ISLR" "tools:RGUI" 154 | [5] "package:stats" "package:graphics" "package:grDevices" "package:utils" 155 | [9] "package:datasets" "package:methods" "Autoloads" "package:base" 156 | 157 | plot(cylinders,mpg) #Use variables directly 158 | cylinders=as.factor(cylinders) #Cast as factor 159 | plot(cylinders,mpg,xlab="Cylinders",ylab="Mpg",col="red") 160 | pdf(file="../mpg.pdf") 161 | plot(cylinders,mpg,xlab="Cylinders",ylab="Mpg",col="red") 162 | dev.off() 163 | quartz 164 | 2 165 | pairs(Auto,col="brown") 166 | pairs(mpg~cylinders+acceleration+weight,Auto) 167 | 168 | #TO SUBSET DATA TO OBTAIN ROWS THAT FALL OUTSIDE OF ROWS 10-85 169 | Auto2 = Auto[-10:-85, ] 170 | 171 | 172 | #TO PARTITION DATASET INTO TRAIN AND TEST SETS 173 | # load the library that contains the data set 174 | library(ISLR) 175 | # readme for data set 176 | ?Auto 177 | # readme for sample() function 178 | ?sample 179 | 180 | # create a vector of row indexes 181 | training <- sample(nrow(Auto), size=200) 182 | 183 | # create training data set with 200 obs 184 | train_set <- Auto[training,] 185 | 186 | # create test data set with the remaing obs 187 | test_set <- Auto[-training,] 188 | 189 | q() -------------------------------------------------------------------------------- /ch3.R: -------------------------------------------------------------------------------- 1 | ### Week2 Chapter 3 Statistical Learning 2 | 3 | library(MASS) #has the datasets 4 | library(ISLR) # has datasets that are used in the book 5 | ### Simple linear regression 6 | names(Boston) #Boston dataset, names of variables 7 | [1] "crim" "zn" "indus" "chas" "nox" "rm" "age" "dis" "rad" 8 | [10] "tax" "ptratio" "black" "lstat" "medv" 9 | 10 | ?Boston #Gives description of the dataset, The Boston data frame has 506 rows and 14 columns 11 | plot(medv~lstat,Boston) #(outcome)medv = median value of owner-occupied homes in \$1000s. (predictor)lstat=lower status of the population (percent). 12 | 13 | fit1=lm(medv~lstat,data=Boston) # ~ means "as modeled as" 14 | fit1 15 | Call: 16 | lm(formula = medv ~ lstat, data = Boston) 17 | 18 | Coefficients: 19 | (Intercept) lstat 20 | 34.55 -0.95 #coefficient shows negative relationship 21 | 22 | summary(fit1) 23 | Call: 24 | lm(formula = medv ~ lstat, data = Boston) 25 | 26 | Residuals: 27 | Min 1Q Median 3Q Max 28 | -15.168 -3.990 -1.318 2.034 24.500 29 | 30 | Coefficients: 31 | Estimate Std. Error t value Pr(>|t|) 32 | (Intercept) 34.55384 0.56263 61.41 <2e-16 *** 33 | lstat -0.95005 0.03873 -24.53 <2e-16 *** #statistically significant 34 | --- 35 | Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 36 | 37 | Residual standard error: 6.216 on 504 degrees of freedom 38 | Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432 39 | F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16 40 | 41 | 42 | abline(fit1,col="red") 43 | names(fit1) 44 | [1] "coefficients" "residuals" "effects" "rank" "fitted.values" 45 | [6] "assign" "qr" 46 | 47 | confint(fit1) #Gives the confidence intervals 48 | 2.5 % 97.5 % #This is the 95% confidence interval 49 | (Intercept) 33.448457 35.6592247 50 | lstat -1.026148 -0.8739505 51 | 52 | 53 | predict(fit1,data.frame(lstat=c(5,10,15)),interval="confidence") #Predict function is a method to query a linear model fit 54 | fit lwr upr #Query the 3 values (5, 10, 15) and get predicted values and confidence interval for each 55 | 1 29.80359 29.00741 30.59978 56 | 2 25.05335 24.47413 25.63256 57 | 3 20.30310 19.73159 20.87461 58 | 59 | 60 | 61 | ### Multiple linear regression 62 | fit2=lm(medv~lstat+age,data=Boston) #2 predictors (lstat and age), outcome is medv 63 | summary(fit2) 64 | Call: 65 | lm(formula = medv ~ lstat + age, data = Boston) 66 | 67 | Residuals: 68 | Min 1Q Median 3Q Max 69 | -15.981 -3.978 -1.283 1.968 23.158 70 | 71 | Coefficients: 72 | Estimate Std. Error t value Pr(>|t|) 73 | (Intercept) 33.22276 0.73085 45.458 < 2e-16 *** 74 | lstat -1.03207 0.04819 -21.416 < 2e-16 *** #Statistically significant 75 | age 0.03454 0.01223 2.826 0.00491 ** #Statistically significant but less so than lstat 76 | --- 77 | Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 78 | 79 | Residual standard error: 6.173 on 503 degrees of freedom 80 | Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495 #R-squared is the percentage of variance explained 81 | F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16 82 | 83 | 84 | fit3=lm(medv~.,Boston) #Use all other variables in the dataset, except medv, which is the outcome/response variable 85 | summary(fit3) 86 | Call: 87 | lm(formula = medv ~ ., data = Boston) # Look at the call to the function--a model with all the predictors 88 | 89 | Residuals: 90 | Min 1Q Median 3Q Max 91 | -15.595 -2.730 -0.518 1.777 26.199 92 | 93 | Coefficients: 94 | Estimate Std. Error t value Pr(>|t|) 95 | (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 *** 96 | crim -1.080e-01 3.286e-02 -3.287 0.001087 ** 97 | zn 4.642e-02 1.373e-02 3.382 0.000778 *** 98 | indus 2.056e-02 6.150e-02 0.334 0.738288 99 | chas 2.687e+00 8.616e-01 3.118 0.001925 ** 100 | nox -1.777e+01 3.820e+00 -4.651 4.25e-06 *** 101 | rm 3.810e+00 4.179e-01 9.116 < 2e-16 *** 102 | age 6.922e-04 1.321e-02 0.052 0.958229 #Age no longer statistically significant 103 | dis -1.476e+00 1.995e-01 -7.398 6.01e-13 *** 104 | rad 3.060e-01 6.635e-02 4.613 5.07e-06 *** 105 | tax -1.233e-02 3.760e-03 -3.280 0.001112 ** 106 | ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 *** 107 | black 9.312e-03 2.686e-03 3.467 0.000573 *** 108 | lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 *** 109 | --- 110 | Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 111 | 112 | Residual standard error: 4.745 on 492 degrees of freedom 113 | Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338 114 | F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16 115 | 116 | 117 | par(mfrow=c(2,2)) #Make a 2x2 layout for the plot, 118 | plot(fit3) #plot residuals against fitted values, take a look at the variance, look for non-linearities 119 | 120 | fit4=update(fit3,~.-age-indus) #update function, use model in fit3 but remove age and indus 121 | summary(fit4) #indus = proportion of non-retail business acres per town. 122 | Call: 123 | lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + 124 | tax + ptratio + black + lstat, data = Boston) 125 | 126 | Residuals: 127 | Min 1Q Median 3Q Max 128 | -15.5984 -2.7386 -0.5046 1.7273 26.2373 129 | 130 | Coefficients: 131 | Estimate Std. Error t value Pr(>|t|) 132 | (Intercept) 36.341145 5.067492 7.171 2.73e-12 *** 133 | crim -0.108413 0.032779 -3.307 0.001010 ** 134 | zn 0.045845 0.013523 3.390 0.000754 *** 135 | chas 2.718716 0.854240 3.183 0.001551 ** 136 | nox -17.376023 3.535243 -4.915 1.21e-06 *** 137 | rm 3.801579 0.406316 9.356 < 2e-16 *** #Everything is significant in model 138 | dis -1.492711 0.185731 -8.037 6.84e-15 *** 139 | rad 0.299608 0.063402 4.726 3.00e-06 *** 140 | tax -0.011778 0.003372 -3.493 0.000521 *** 141 | ptratio -0.946525 0.129066 -7.334 9.24e-13 *** 142 | black 0.009291 0.002674 3.475 0.000557 *** 143 | lstat -0.522553 0.047424 -11.019 < 2e-16 *** 144 | --- 145 | Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 146 | 147 | Residual standard error: 4.736 on 494 degrees of freedom 148 | Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348 149 | F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16 150 | 151 | 152 | ### Nonlinear terms and Interactions 153 | fit5=lm(medv~lstat*age,Boston) #Use interaction term (lstat*age) 154 | summary(fit5) #Model will include 3 predictors (lstat, age, and lstat*age) 155 | Call: 156 | lm(formula = medv ~ lstat * age, data = Boston) 157 | 158 | Residuals: 159 | Min 1Q Median 3Q Max 160 | -15.806 -4.045 -1.333 2.085 27.552 161 | 162 | Coefficients: 163 | Estimate Std. Error t value Pr(>|t|) 164 | (Intercept) 36.0885359 1.4698355 24.553 < 2e-16 *** 165 | lstat -1.3921168 0.1674555 -8.313 8.78e-16 *** #main effect (statistically significant) 166 | age -0.0007209 0.0198792 -0.036 0.9711 #main effect (not statistically significant) 167 | lstat:age 0.0041560 0.0018518 2.244 0.0252 * #interaction effect (denoted by colon) 168 | --- 169 | Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 170 | 171 | Residual standard error: 6.149 on 502 degrees of freedom 172 | Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531 173 | F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16 174 | 175 | #Use a quadratic term since scatterplot of medv and lstat looked nonlinear 176 | fit6=lm(medv~lstat +I(lstat^2),Boston); summary(fit6) #Need to put quadratic term inside the Identity function I() 177 | 178 | Call: 179 | lm(formula = medv ~ lstat + I(lstat^2), data = Boston) #Notice that Identity function I() shows up in the call 180 | 181 | Residuals: 182 | Min 1Q Median 3Q Max 183 | -15.2834 -3.8313 -0.5295 2.3095 25.4148 184 | 185 | Coefficients: 186 | Estimate Std. Error t value Pr(>|t|) 187 | (Intercept) 42.862007 0.872084 49.15 <2e-16 *** 188 | lstat -2.332821 0.123803 -18.84 <2e-16 *** #Linear is significant 189 | I(lstat^2) 0.043547 0.003745 11.63 <2e-16 *** #quadratic is significant 190 | --- 191 | Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 192 | 193 | Residual standard error: 5.524 on 503 degrees of freedom 194 | Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393 195 | F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16 196 | 197 | 198 | attach(Boston) #Named variables are now available in workspace 199 | 200 | par(mfrow=c(1,1)) #Make a 1x1 layout 201 | plot(medv~lstat) #plot as before the same 2 variables, medv is outcome and lstat is predictor 202 | #Now, include the fitted values of the quadratic fit, fit6, and put in the plot 203 | #Do not use abline since that is used for a straight line 204 | #For each value of lstat, you get a fitted value from the model fit6 205 | points(lstat,fitted(fit6),col="red",pch=20) #Include the quadratic fit with a series of points 206 | 207 | 208 | #Use poly function to fit a polynomial 209 | fit7=lm(medv~poly(lstat,4)) #4th degree polynomial, non-linear in the variable yet linear in the coefficient 210 | points(lstat,fitted(fit7),col="blue",pch=20) #pch means plotting character, 20 is a round ball symbol 211 | 212 | #To see all the plotting characters available to you 213 | plot(1:20,1:20,pch=1:20,cex=2) #cex = 2 means double the size of the plotting character 214 | 215 | 216 | ###Qualitative predictors 217 | fix(Carseats) #pch means plotting character, 20 is a round ball #Throws up a dataframe so that you can see it 218 | names(Carseats) #show all the variable names 219 | [1] "Sales" "CompPrice" "Income" "Advertising" "Population" "Price" 220 | [7] "ShelveLoc" "Age" "Education" "Urban" "US" 221 | 222 | summary(Carseats) 223 | fit1=lm(Sales~.+Income:Advertising+Age:Price,Carseats) 224 | summary(fit1) 225 | Call: 226 | lm(formula = Sales ~ . + Income:Advertising + Age:Price, data = Carseats) 227 | 228 | Residuals: 229 | Min 1Q Median 3Q Max 230 | -2.9208 -0.7503 0.0177 0.6754 3.3413 231 | 232 | Coefficients: 233 | Estimate Std. Error t value Pr(>|t|) 234 | (Intercept) 6.5755654 1.0087470 6.519 2.22e-10 *** 235 | CompPrice 0.0929371 0.0041183 22.567 < 2e-16 *** 236 | Income 0.0108940 0.0026044 4.183 3.57e-05 *** 237 | Advertising 0.0702462 0.0226091 3.107 0.002030 ** 238 | Population 0.0001592 0.0003679 0.433 0.665330 239 | Price -0.1008064 0.0074399 -13.549 < 2e-16 *** 240 | ShelveLocGood 4.8486762 0.1528378 31.724 < 2e-16 *** 241 | ShelveLocMedium 1.9532620 0.1257682 15.531 < 2e-16 *** 242 | Age -0.0579466 0.0159506 -3.633 0.000318 *** 243 | Education -0.0208525 0.0196131 -1.063 0.288361 244 | UrbanYes 0.1401597 0.1124019 1.247 0.213171 245 | USYes -0.1575571 0.1489234 -1.058 0.290729 246 | Income:Advertising 0.0007510 0.0002784 2.698 0.007290 ** 247 | Price:Age 0.0001068 0.0001333 0.801 0.423812 248 | --- 249 | Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 250 | 251 | Residual standard error: 1.011 on 386 degrees of freedom 252 | Multiple R-squared: 0.8761, Adjusted R-squared: 0.8719 253 | F-statistic: 210 on 13 and 386 DF, p-value: < 2.2e-16 254 | 255 | contrasts(Carseats$ShelveLoc) #ShelveLoc is a qualitative variable 256 | #Contrast function will show you how R will code it 257 | Good Medium #3 level factor means 2 dummy variables need to be made 258 | Bad 0 0 259 | Good 1 0 260 | Medium 0 1 261 | 262 | ###Writing R functions 263 | regplot=function(x,y){ 264 | fit=lm(y~x) 265 | plot(x,y) 266 | abline(fit,col="red") 267 | } 268 | attach(Carseats) #Make the Carseats data available in the data space 269 | regplot(Price,Sales) #Call the regplot function with Price, Sales as variables passed in as arguments 270 | 271 | regplot=function(x,y,...){ #. . . means there are unnamed arguments, can later add extra arguments (regplot call below) 272 | fit=lm(y~x) 273 | plot(x,y,...) 274 | abline(fit,col="red") 275 | } 276 | regplot(Price,Sales,xlab="Price",ylab="Sales",col="blue",pch=20) 277 | 278 | ##QUESTION What is the difference between lm(y ~ x*z) and lm(y ~ I(x*z)), when x and z are both numeric variables? See examples below. 279 | lm(medv~lstat*age) gives interaction effects and main effects 280 | lm(medv~I(lstat*age)) gives interaction effects only 281 | 282 | 283 | > fit15 = lm(medv~lstat*age) 284 | > summary(fit15) 285 | 286 | Call: 287 | lm(formula = medv ~ lstat * age) 288 | 289 | Residuals: 290 | Min 1Q Median 3Q Max 291 | -15.806 -4.045 -1.333 2.085 27.552 292 | 293 | Coefficients: 294 | Estimate Std. Error t value Pr(>|t|) 295 | (Intercept) 36.0885359 1.4698355 24.553 < 2e-16 *** 296 | lstat -1.3921168 0.1674555 -8.313 8.78e-16 *** #main effects 297 | age -0.0007209 0.0198792 -0.036 0.9711 #main effects 298 | lstat:age 0.0041560 0.0018518 2.244 0.0252 * #interaction effects 299 | --- 300 | Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 301 | 302 | Residual standard error: 6.149 on 502 degrees of freedom 303 | Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531 304 | F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16 305 | 306 | > fit16 = lm(medv~I(lstat*age)) 307 | > summary(fit16) 308 | 309 | Call: 310 | lm(formula = medv ~ I(lstat * age)) 311 | 312 | Residuals: 313 | Min 1Q Median 3Q Max 314 | -13.347 -4.372 -1.534 1.914 27.193 315 | 316 | Coefficients: 317 | Estimate Std. Error t value Pr(>|t|) 318 | (Intercept) 30.1588631 0.4828240 62.46 <2e-16 *** 319 | I(lstat * age) -0.0077146 0.0003799 -20.31 <2e-16 *** #Interaction effects only 320 | --- 321 | Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 322 | 323 | Residual standard error: 6.827 on 504 degrees of freedom 324 | Multiple R-squared: 0.4501, Adjusted R-squared: 0.449 325 | F-statistic: 412.4 on 1 and 504 DF, p-value: < 2.2e-16 326 | 327 | 328 | 329 | 330 | 331 | -------------------------------------------------------------------------------- /ch4.R: -------------------------------------------------------------------------------- 1 | ### Week 3 Chapter 4 Statistical Learning 2 | 3 | require(ISLR) #require command is similar to libary command; load library, has the dataset on Stock Market 4 | names(Smarket) #gives names of variables; Direction is the response(outcome) variable 5 | [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5" "Volume" 6 | [8] "Today" "Direction" 7 | 8 | summary(Smarket) 9 | Year Lag1 Lag2 Lag3 10 | Min. :2001 Min. :-4.922000 Min. :-4.922000 Min. :-4.922000 11 | 1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500 1st Qu.:-0.640000 12 | Median :2003 Median : 0.039000 Median : 0.039000 Median : 0.038500 13 | Mean :2003 Mean : 0.003834 Mean : 0.003919 Mean : 0.001716 14 | 3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.596750 15 | Max. :2005 Max. : 5.733000 Max. : 5.733000 Max. : 5.733000 16 | Lag4 Lag5 Volume Today Direction 17 | Min. :-4.922000 Min. :-4.92200 Min. :0.3561 Min. :-4.922000 Down:602 18 | 1st Qu.:-0.640000 1st Qu.:-0.64000 1st Qu.:1.2574 1st Qu.:-0.639500 Up :648 19 | Median : 0.038500 Median : 0.03850 Median :1.4229 Median : 0.038500 20 | Mean : 0.001636 Mean : 0.00561 Mean :1.4783 Mean : 0.003138 21 | 3rd Qu.: 0.596750 3rd Qu.: 0.59700 3rd Qu.:1.6417 3rd Qu.: 0.596750 22 | Max. : 5.733000 Max. : 5.73300 Max. :3.1525 Max. : 5.733000 23 | 24 | #Lag1-Lag5 gives previous day's price 25 | 26 | ?Smarket # Daily percentage returns for the S&P 500 stock index between 2001 and 2005. 27 | 28 | pairs(Smarket,col=Smarket$Direction) #Create a matrix of scatterplots and make color contrast for the binary classes 29 | # Binary Logistic regression model, using Direction as outcome variable 30 | glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, 31 | data=Smarket,family=binomial) 32 | summary(glm.fit) #Gives summary output 33 | Call: 34 | glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 35 | Volume, family = binomial, data = Smarket) 36 | 37 | Deviance Residuals: 38 | Min 1Q Median 3Q Max 39 | -1.446 -1.203 1.065 1.145 1.326 40 | 41 | Coefficients: 42 | Estimate Std. Error z value Pr(>|z|) 43 | (Intercept) -0.126000 0.240736 -0.523 0.601 44 | Lag1 -0.073074 0.050167 -1.457 0.145 45 | Lag2 -0.042301 0.050086 -0.845 0.398 #Nothing is significant in model 46 | Lag3 0.011085 0.049939 0.222 0.824 47 | Lag4 0.009359 0.049974 0.187 0.851 48 | Lag5 0.010313 0.049511 0.208 0.835 49 | Volume 0.135441 0.158360 0.855 0.392 50 | 51 | (Dispersion parameter for binomial family taken to be 1) 52 | 53 | Null deviance: 1731.2 on 1249 degrees of freedom #Null deviance is the deviance of mean 54 | Residual deviance: 1727.6 on 1243 degrees of freedom #Deviance of residuals 55 | AIC: 1741.6 56 | 57 | Number of Fisher Scoring iterations: 3 58 | 59 | 60 | glm.probs=predict(glm.fit,type="response") #Give predicted probabilities with command type = "response" 61 | glm.probs[1:5] #Show the 1st five observations (rows) of predicted values 62 | 1 2 3 4 5 63 | 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 64 | 65 | glm.pred=ifelse(glm.probs>0.5,"Up","Down") #Set threshold at 0.5. If True that > 0.5, then "Up"; else, "Down" 66 | 67 | attach(Smarket) 68 | table(glm.pred,Direction) #glm.pred is the predicted values while Direction is the true values 69 | 70 | Direction #This is a Confusion Matrix 71 | glm.pred Down Up 72 | Down 145 141 # The diagonal values are the correct classifications 73 | Up 457 507 # The off-diagonal values are the incorrect classifications 74 | 75 | mean(glm.pred==Direction) #This gives the classification rate of model 76 | [1] 0.5216 # Performed slightly better than chance (0.50) 77 | 78 | 79 | 80 | # Make training and test set 81 | train = Year<2005 #Create training data set 82 | glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, 83 | data=Smarket,family=binomial, subset=train) #The subset = train where train is < 2005 84 | 85 | #Use the remaining !train for the test data set 86 | glm.probs=predict(glm.fit,newdata=Smarket[!train,],type="response") type = "response" #gives predicted probabilities 87 | 88 | glm.pred=ifelse(glm.probs >0.5,"Up","Down") #Make an up/down variable, if true, then "Up", else false, then "Down" 89 | Direction.2005=Smarket$Direction[!train] #Make a variable of the true values 90 | 91 | table(glm.pred,Direction.2005) #Give table of classification of test data set with predicted against true values 92 | Direction.2005 93 | glm.pred Down Up #This is a Confusion Matrix 94 | Down 77 97 #Off-diagonal values are the mistakes 95 | Up 34 44 #Diagonal values are where we got it right 96 | 97 | mean(glm.pred==Direction.2005) #Classification rate 98 | [1] 0.4801587 #The mean is less than 50%, which is the null rate, which could be overfitting of model 99 | 100 | 101 | #Fit smaller model (only include 2 predictors, lag1 and lag2) 102 | glm.fit=glm(Direction~Lag1+Lag2, 103 | data=Smarket,family=binomial, subset=train) 104 | summary(glm.fit) 105 | Call: 106 | glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Smarket, 107 | subset = train) 108 | 109 | Deviance Residuals: 110 | Min 1Q Median 3Q Max 111 | -1.345 -1.188 1.074 1.164 1.326 112 | 113 | Coefficients: 114 | Estimate Std. Error z value Pr(>|z|) 115 | (Intercept) 0.03222 0.06338 0.508 0.611 116 | Lag1 -0.05562 0.05171 -1.076 0.282 117 | Lag2 -0.04449 0.05166 -0.861 0.389 118 | 119 | (Dispersion parameter for binomial family taken to be 1) 120 | 121 | Null deviance: 1383.3 on 997 degrees of freedom #Deviance of the mean 122 | Residual deviance: 1381.4 on 995 degrees of freedom #Residual deviance 123 | AIC: 1387.4 124 | 125 | Number of Fisher Scoring iterations: 3 126 | 127 | glm.probs=predict(glm.fit,newdata=Smarket[!train,],type="response") 128 | glm.pred=ifelse(glm.probs >0.5,"Up","Down") 129 | table(glm.pred,Direction.2005) 130 | 131 | Direction.2005 #Confusion Matrix 132 | glm.pred Down Up 133 | Down 35 35 #Correct classifications are on the diagonal 134 | Up 76 106 #Incorrect classifcations are on the off-diagonal 135 | mean(glm.pred==Direction.2005) #This is the classification rate of model 136 | [1] 0.5595238 #Mean shows a better classification rate than previous dataset 137 | 138 | 106/(76+106) 139 | [1] 0.5824176 #Correct classification rate for "Up" when predicting "Up" is 0.58 140 | 141 | 142 | 143 | require (ISLR) # This has the dataset on the Stock market 144 | require(MASS) # This has the Linear Discriminant Analysis commands that we need 145 | ## Linear Discriminant Analysis 146 | lda.fit=lda(Direction~Lag1+Lag2,data=Smarket, subset=Year<2005) 147 | lda.fit 148 | Call: 149 | lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = Year < 150 | 2005) 151 | 152 | Prior probabilities of groups: #Gives proportions of Up and Down in dataset 153 | Down Up 154 | 0.491984 0.508016 155 | 156 | Group means: #Summary of group means 157 | Lag1 Lag2 158 | Down 0.04279022 0.03389409 159 | Up -0.03954635 -0.03132544 160 | 161 | Coefficients of linear discriminants: #LDA coefficients 162 | LD1 163 | Lag1 -0.6420190 164 | Lag2 -0.5135293 165 | 166 | plot(lda.fit) 167 | #Create a test dataset 168 | Smarket.2005=subset(Smarket,Year==2005) #dataframe is Smarket, subset out the observations with Year is 2005 169 | lda.pred=predict(lda.fit,Smarket.2005) 170 | lda.pred[1:5,] #Look at the first 5 observations of the predicted values 171 | Error in lda.pred[1:5, ] : incorrect number of dimensions #Error because data is not in matrix form 172 | class(lda.pred) #What form is the dataset in? List. 173 | [1] "list" 174 | data.frame(lda.pred)[1:5,] #Cast to a dataframe and then try to look at first 5 observations 175 | class posterior.Down posterior.Up LD1 176 | 999 Up 0.4901792 0.5098208 0.08293096 177 | 1000 Up 0.4792185 0.5207815 0.59114102 178 | 1001 Up 0.4668185 0.5331815 1.16723063 179 | 1002 Up 0.4740011 0.5259989 0.83335022 180 | 1003 Up 0.4927877 0.5072123 -0.03792892 181 | table(lda.pred$class,Smarket.2005$Direction) #Create a table of the predicted classes against the true classes 182 | Down Up #This is a Confusion Matrix 183 | Down 35 35 #Diagonal elements are the correct classification 184 | Up 76 106 #Off-diagonal elements are the mistakes 185 | mean(lda.pred$class==Smarket.2005$Direction) #Classification rate of the model, which is 56% 186 | [1] 0.5595238 #Model performs 6% better than chance (flipping a coin) 187 | 188 | 189 | 190 | ## K-Nearest Neighbors (very handy to have in toolbox, effective most of the time) 191 | library(class) 192 | ?knn 193 | k-nearest neighbour classification for test set from training set. For each row of the test set, the k nearest (in Euclidean distance) training set vectors are found, and the classification is decided by majority vote, with ties broken at random. If there are ties for the kth nearest vector, all candidates are included in the vote. 194 | attach(Smarket) #Put variable names in workspace 195 | ls() # Check to see it is there 196 | Xlag=cbind(Lag1,Lag2) #Make a matrix with lag1 and lag2 197 | Xlag[1:5, ] #Take a look at first five observations 198 | train=Year<2005 # Designate train as Year less than 2005 199 | #Make a k-Nearest Neighbor model where knn(x-variables from train, x-variables from test, class labels from train, k = , ) 200 | knn.pred=knn(Xlag[train,],Xlag[!train,],Direction[train],k=1) 201 | table(knn.pred,Direction[!train]) #Generate a Confusion Matrix with predicted againsted true values 202 | knn.pred Down Up 203 | Down 43 58 204 | Up 68 83 205 | mean(knn.pred==Direction[!train]) #Obtain the mean, which gives model performance on prediction 206 | [1] 0.5 #Model prediction rate is 50%, which is the same as flipping a coin 207 | -------------------------------------------------------------------------------- /ch5.R: -------------------------------------------------------------------------------- 1 | ### Week 4 Chapter 5 Statistical Learning 2 | 3 | # LOOCV, K-Fold Cross-Validation, and the Bootstrap (Resampling methods) 4 | require(ISLR) # require function is bool and will return false if pkg does not exist 5 | require(boot) # Use the boot package 6 | ?cv.glm # This function calculates the estimated K-fold cross-validation prediction error for generalized linear models. 7 | plot(mpg~horsepower,data=Auto) # As horsepower increases, mpg decreases 8 | 9 | ## LOOCV (Leave One Out Cross-Validation) 10 | glm.fit=glm(mpg~horsepower, data=Auto) # Fits a linear model if family is not specified (can use glm) 11 | # Fits a model repeatedly n times while leaving out 1 observation each time. Makes a fit on all other data. 12 | # Makes a prediction on the x observation that was left out. 13 | cv.glm(Auto,glm.fit)$delta #pretty slow (doesnt use formula (5.2) on page 180) 14 | [1] 24.23151 24.23114 15 | # Delta is vector of length two. The first component is the raw cross-validation estimate of prediction error. 16 | # The second component is the adjusted cross-validation estimate. 17 | # The adjustment is designed to compensate for the bias introduced by not using leave-one-out cross-validation. 18 | 19 | ##Lets write a simple function to use formula (5.2) Misclassification error: 1/n* SUMMATION(yi - yhat(-i))^2 20 | loocv=function(fit){ # How much does i contribute to its own fit? A measure of self-influence of i. 21 | h=lm.influence(fit)$h # This is the formula computed: 1/n* SUMMATION(yi - yhati)^2/(1-Hii)^2 22 | mean((residuals(fit)/(1-h))^2) # h = Hii is the diagonal of the hat matrix 23 | } 24 | ## Now we try it out 25 | loocv(glm.fit) 26 | [1] 24.23151 27 | 28 | cv.error=rep(0,5) # Create vector to collect the errors 29 | degree=1:5 # Now fit polynomials of different degrees to the data 30 | for(d in degree){ 31 | glm.fit=glm(mpg~poly(horsepower,d), data=Auto) 32 | cv.error[d]=loocv(glm.fit) # Compute error and put in error vector 33 | } 34 | plot(degree,cv.error,type="b") # Plot error against degree. It shows quadratic does the best in terms of lowest error 35 | 36 | ## 10-fold CV Divide data into 10 parts. 1 part is test set. 9 parts is training set. Fits the model 10 times, switching out test set. 37 | 38 | cv.error10=rep(0,5) # Create a vector to accept the errors 39 | for(d in degree){ 40 | glm.fit=glm(mpg~poly(horsepower,d), data=Auto) 41 | cv.error10[d]=cv.glm(Auto,glm.fit,K=10)$delta[1] # Specify K = 10 for 10 folds 42 | } 43 | lines(degree,cv.error10,type="b",col="red") # Similar story as before. 44 | 45 | # If given a choice, pick K-fold CV over LOOCV since it is cheaper to compute. 46 | 47 | 48 | 49 | ## Bootstrap--repeated, random sampling of training observations. Invented by Brad Efron. 50 | ## Some observations can be represented more than once and some not at all. 51 | ## Minimum risk investment - Section 5.2 2 investments called X and Y 52 | 53 | alpha=function(x,y){ # VAR(X), VAR(Y) 54 | vx=var(x) # alpha = VAR(Y) - COV(X,Y)/VAR(X)+VAR(Y)-2COV(X,Y) 55 | vy=var(y) # What is the standard error (sampling variablity) of alpha? Don't know. 56 | cxy=cov(x,y) # Need bootstrap 57 | (vy-cxy)/(vx+vy-2*cxy) 58 | } 59 | alpha(Portfolio$X,Portfolio$Y) # Portfolio is the dataset 60 | [1] 0.5758321 61 | 62 | ## What is the standard error of alpha? 63 | ## Need a wrapper that allows the bootstrap to work. Takes a dataframe and an index (over the rows, 1 to n, can be repeats) 64 | alpha.fn=function(data, index){ # index tells you which observations gets represented in the sampling 65 | with(data[index,],alpha(X,Y)) # with is a function that takes dataframe as 1st argument. It says 66 | } # using this data in the dataframe, execute these commands. 67 | 68 | alpha.fn(Portfolio,1:100) 69 | [1] 0.5758321 70 | 71 | #Now, run the bootstrap. Set the random number seed. 72 | set.seed(1) 73 | alpha.fn (Portfolio,sample(1:100,100,replace=TRUE)) # Sample 1 to 100 with replacement 74 | [1] 0.5963833 75 | 76 | boot.out=boot(Portfolio,alpha.fn,R=1000) # Do a 1000 bootstraps 77 | boot.out # Gives a summary of the bootstrap 78 | ORDINARY NONPARAMETRIC BOOTSTRAP 79 | 80 | 81 | Call: 82 | boot(data = Portfolio, statistic = alpha.fn, R = 1000) 83 | 84 | 85 | Bootstrap Statistics : 86 | original bias std. error # Gives estimate, bias, and standard error 87 | t1* 0.5758321 -7.315422e-05 0.08861826 88 | 89 | plot(boot.out) # histogram (symmetrical distribution, Gaussian), QQ-plot (lines up as a straight line: Gaussian) 90 | 91 | # The bootstrap can give you reliable estimates of standard error from nasty statistics. 92 | -------------------------------------------------------------------------------- /ch6.Rmd: -------------------------------------------------------------------------------- 1 | Model Selection 2 | ================ 3 | 4 | This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages, 5 | and a very nice way of distributing an analysis. It has some very simple syntax rules. 6 | 7 | 8 | ```{r} 9 | library(ISLR) 10 | summary(Hitters) 11 | ``` 12 | There are some missing values here, so before we proceed we will remove them: 13 | 14 | ```{r} 15 | Hitters=na.omit(Hitters) 16 | with(Hitters,sum(is.na(Salary))) 17 | ``` 18 | 19 | 20 | 21 | Best Subset regression 22 | ------------------------ 23 | We will now use the package `leaps` to evaluate all the best-subset models. 24 | ```{r} 25 | library(leaps) 26 | regfit.full=regsubsets(Salary~.,data=Hitters) 27 | summary(regfit.full) 28 | ``` 29 | It gives by default best-subsets up to size 8; lets increase that to 19, i.e. all the variables 30 | ```{r} 31 | regfit.full=regsubsets(Salary~.,data=Hitters, nvmax=19) 32 | reg.summary=summary(regfit.full) 33 | names(reg.summary) 34 | plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp") 35 | which.min(reg.summary$cp) 36 | points(10,reg.summary$cp[10],pch=20,col="red") 37 | ``` 38 | There is a plot method for the `regsubsets` object 39 | ```{r} 40 | plot(regfit.full,scale="Cp") 41 | coef(regfit.full,10) 42 | ``` 43 | 44 | 45 | 46 | Forward Stepwise Selection 47 | -------------------------- 48 | Here we use the `regsubsets` function but specify the `method="forward" option: 49 | ```{r} 50 | regfit.fwd=regsubsets(Salary~.,data=Hitters,nvmax=19,method="forward") 51 | summary(regfit.fwd) 52 | plot(regfit.fwd,scale="Cp") 53 | ``` 54 | 55 | 56 | 57 | 58 | Model Selection Using a Validation Set 59 | --------------------------------------- 60 | Lets make a training and validation set, so that we can choose a good subset model. 61 | We will do it using a slightly different approach from what was done in the the book. 62 | ```{r} 63 | dim(Hitters) 64 | set.seed(1) 65 | train=sample(seq(263),180,replace=FALSE) 66 | train 67 | regfit.fwd=regsubsets(Salary~.,data=Hitters[train,],nvmax=19,method="forward") 68 | ``` 69 | Now we will make predictions on the observations not used for training. We know there are 19 models, so we set up some vectors to record the errors. We have to do a bit of work here, because there is no predict method for `regsubsets`. 70 | ```{r} 71 | val.errors=rep(NA,19) 72 | x.test=model.matrix(Salary~.,data=Hitters[-train,])# notice the -index! 73 | for(i in 1:19){ 74 | coefi=coef(regfit.fwd,id=i) 75 | pred=x.test[,names(coefi)]%*%coefi 76 | val.errors[i]=mean((Hitters$Salary[-train]-pred)^2) 77 | } 78 | plot(sqrt(val.errors),ylab="Root MSE",ylim=c(300,400),pch=19,type="b") 79 | points(sqrt(regfit.fwd$rss[-1]/180),col="blue",pch=19,type="b") 80 | legend("topright",legend=c("Training","Validation"),col=c("blue","black"),pch=19) 81 | ``` 82 | As we expect, the training error goes down monotonically as the model gets bigger, but not so 83 | for the validation error. 84 | 85 | This was a little tedious - not having a predict method for `regsubsets`. So we will write one! 86 | ```{r} 87 | predict.regsubsets=function(object,newdata,id,...){ 88 | form=as.formula(object$call[[2]]) 89 | mat=model.matrix(form,newdata) 90 | coefi=coef(object,id=id) 91 | mat[,names(coefi)]%*%coefi 92 | } 93 | ``` 94 | 95 | 96 | 97 | 98 | Model Selection by Cross-Validation 99 | ----------------------------------- 100 | We will do 10-fold cross-validation. Its really easy! 101 | ```{r} 102 | set.seed(11) 103 | folds=sample(rep(1:10,length=nrow(Hitters))) 104 | folds 105 | table(folds) 106 | cv.errors=matrix(NA,10,19) 107 | for(k in 1:10){ 108 | best.fit=regsubsets(Salary~.,data=Hitters[folds!=k,],nvmax=19,method="forward") 109 | for(i in 1:19){ 110 | pred=predict(best.fit,Hitters[folds==k,],id=i) 111 | cv.errors[k,i]=mean( (Hitters$Salary[folds==k]-pred)^2) 112 | } 113 | } 114 | rmse.cv=sqrt(apply(cv.errors,2,mean)) 115 | plot(rmse.cv,pch=19,type="b") 116 | ``` 117 | 118 | 119 | 120 | Ridge Regression and the Lasso 121 | ------------------------------- 122 | We will use the package `glmnet`, which does not use the model formula language, so we will set up an `x` and `y`. 123 | ```{r} 124 | library(glmnet) 125 | x=model.matrix(Salary~.-1,data=Hitters) 126 | y=Hitters$Salary 127 | ``` 128 | First we will fit a ridge-regression model. This is achieved by calling `glmnet` with `alpha=0` (see the helpfile). There is also a `cv.glmnet` function which will do the cross-validation for us. 129 | ```{r} 130 | fit.ridge=glmnet(x,y,alpha=0) 131 | plot(fit.ridge,xvar="lambda",label=TRUE) 132 | cv.ridge=cv.glmnet(x,y,alpha=0) 133 | plot(cv.ridge) 134 | ``` 135 | Now we fit a lasso model; for this we use the default `alpha=1` 136 | ```{r} 137 | fit.lasso=glmnet(x,y) 138 | plot(fit.lasso,xvar="lambda",label=TRUE) 139 | cv.lasso=cv.glmnet(x,y) 140 | plot(cv.lasso) 141 | coef(cv.lasso) 142 | ``` 143 | 144 | Suppose we want to use our earlier train/validation division to select the `lambda` for the lasso. 145 | This is easy to do. 146 | ```{r} 147 | lasso.tr=glmnet(x[train,],y[train]) 148 | lasso.tr 149 | pred=predict(lasso.tr,x[-train,]) 150 | dim(pred) 151 | rmse= sqrt(apply((y[-train]-pred)^2,2,mean)) 152 | plot(log(lasso.tr$lambda),rmse,type="b",xlab="Log(lambda)") 153 | lam.best=lasso.tr$lambda[order(rmse)[1]] 154 | lam.best 155 | coef(lasso.tr,s=lam.best) 156 | ``` 157 | -------------------------------------------------------------------------------- /ch7.Rmd: -------------------------------------------------------------------------------- 1 | Nonlinear Models 2 | ======================================================== 3 | Here we explore the use of nonlinear models using some tools in R 4 | 5 | ```{r} 6 | require(ISLR) 7 | attach(Wage) 8 | ``` 9 | 10 | Polynomials 11 | ------------ 12 | 13 | First we will use polynomials, and focus on a single predictor age: 14 | 15 | ```{r} 16 | fit=lm(wage~poly(age,4),data=Wage) 17 | summary(fit) 18 | ``` 19 | 20 | The `poly()` function generates a basis of *orthogonal polynomials*. 21 | Lets make a plot of the fitted function, along with the standard errors of the fit. 22 | 23 | ```{r fig.width=7, fig.height=6} 24 | agelims=range(age) 25 | age.grid=seq(from=agelims[1],to=agelims[2]) 26 | preds=predict(fit,newdata=list(age=age.grid),se=TRUE) 27 | se.bands=cbind(preds$fit+2*preds$se,preds$fit-2*preds$se) 28 | plot(age,wage,col="darkgrey") 29 | lines(age.grid,preds$fit,lwd=2,col="blue") 30 | matlines(age.grid,se.bands,col="blue",lty=2) 31 | ``` 32 | 33 | There are other more direct ways of doing this in R. For example 34 | 35 | ```{r} 36 | fita=lm(wage~age+I(age^2)+I(age^3)+I(age^4),data=Wage) 37 | summary(fita) 38 | ``` 39 | 40 | Here `I()` is a *wrapper* function; we need it because `age^2` means something to the formula language, 41 | while `I(age^2)` is protected. 42 | The coefficients are different to those we got before! However, the fits are the same: 43 | 44 | ```{r} 45 | plot(fitted(fit),fitted(fita)) 46 | ``` 47 | 48 | By using orthogonal polynomials in this simple way, it turns out that we can separately test 49 | for each coefficient. So if we look at the summary again, we can see that the linear, quadratic 50 | and cubic terms are significant, but not the quartic. 51 | 52 | ```{r} 53 | summary(fit) 54 | ``` 55 | 56 | This only works with linear regression, and if there is a single predictor. In general we would use `anova()` 57 | as this next example demonstrates. 58 | 59 | ```{r} 60 | fita=lm(wage~education,data=Wage) 61 | fitb=lm(wage~education+age,data=Wage) 62 | fitc=lm(wage~education+poly(age,2),data=Wage) 63 | fitd=lm(wage~education+poly(age,3),data=Wage) 64 | anova(fita,fitb,fitc,fitd) 65 | 66 | ``` 67 | 68 | ### Polynomial logistic regression 69 | 70 | Now we fit a logistic regression model to a binary response variable, 71 | constructed from `wage`. We code the big earners (`>250K`) as 1, else 0. 72 | 73 | ```{r} 74 | fit=glm(I(wage>250) ~ poly(age,3), data=Wage, family=binomial) 75 | summary(fit) 76 | preds=predict(fit,list(age=age.grid),se=T) 77 | se.bands=preds$fit + cbind(fit=0,lower=-2*preds$se,upper=2*preds$se) 78 | se.bands[1:5,] 79 | ``` 80 | 81 | We have done the computations on the logit scale. To transform we need to apply the inverse logit 82 | mapping 83 | $$p=\frac{e^\eta}{1+e^\eta}.$$ 84 | (Here we have used the ability of MarkDown to interpret TeX expressions.) 85 | We can do this simultaneously for all three columns of `se.bands`: 86 | 87 | ```{r} 88 | prob.bands=exp(se.bands)/(1+exp(se.bands)) 89 | matplot(age.grid,prob.bands,col="blue",lwd=c(2,1,1),lty=c(1,2,2),type="l",ylim=c(0,.1)) 90 | points(jitter(age),I(wage>250)/10,pch="|",cex=.5) 91 | ``` 92 | 93 | Splines 94 | ------- 95 | Splines are more flexible than polynomials, but the idea is rather similar. 96 | Here we will explore cubic splines. 97 | 98 | ```{r} 99 | require(splines) 100 | fit=lm(wage~bs(age,knots=c(25,40,60)),data=Wage) 101 | plot(age,wage,col="darkgrey") 102 | lines(age.grid,predict(fit,list(age=age.grid)),col="darkgreen",lwd=2) 103 | abline(v=c(25,40,60),lty=2,col="darkgreen") 104 | ``` 105 | 106 | The smoothing splines does not require knot selection, but it does have a smoothing parameter, 107 | which can conveniently be specified via the effective degrees of freedom or `df`. 108 | 109 | ```{r} 110 | fit=smooth.spline(age,wage,df=16) 111 | lines(fit,col="red",lwd=2) 112 | ``` 113 | 114 | Or we can use LOO cross-validation to select the smoothing parameter for us automatically: 115 | 116 | ```{r} 117 | fit=smooth.spline(age,wage,cv=TRUE) 118 | lines(fit,col="purple",lwd=2) 119 | fit 120 | ``` 121 | 122 | Generalized Additive Models 123 | --------------------------- 124 | 125 | So far we have focused on fitting models with mostly single nonlinear terms. 126 | The `gam` package makes it easier to work with multiple nonlinear terms. In addition 127 | it knows how to plot these functions and their standard errors. 128 | 129 | ```{r fig.width=10, fig.height=5} 130 | require(gam) 131 | gam1=gam(wage~s(age,df=4)+s(year,df=4)+education,data=Wage) 132 | par(mfrow=c(1,3)) 133 | plot(gam1,se=T) 134 | gam2=gam(I(wage>250)~s(age,df=4)+s(year,df=4)+education,data=Wage,family=binomial) 135 | plot(gam2) 136 | ``` 137 | 138 | Lets see if we need a nonlinear terms for year 139 | 140 | ```{r} 141 | gam2a=gam(I(wage>250)~s(age,df=4)+year+education,data=Wage,family=binomial) 142 | anova(gam2a,gam2,test="Chisq") 143 | ``` 144 | 145 | One nice feature of the `gam` package is that it knows how to plot the functions nicely, 146 | even for models fit by `lm` and `glm`. 147 | 148 | ```{r fig.width=10, fig.height=5} 149 | par(mfrow=c(1,3)) 150 | lm1=lm(wage~ns(age,df=4)+ns(year,df=4)+education,data=Wage) 151 | plot.gam(lm1,se=T) 152 | ``` 153 | 154 | 155 | 156 | 157 | 158 | 159 | -------------------------------------------------------------------------------- /ch8.Rmd: -------------------------------------------------------------------------------- 1 | Decision Trees 2 | ======================================================== 3 | 4 | We will have a look at the `Carseats` data using the `tree` package in R, as in the lab in the book. 5 | We create a binary response variable `High` (for high sales), and we include it in the same dataframe. 6 | ```{r} 7 | require(ISLR) 8 | require(tree) 9 | attach(Carseats) 10 | hist(Sales) 11 | High=ifelse(Sales<=8,"No","Yes") 12 | Carseats=data.frame(Carseats, High) 13 | ``` 14 | Now we fit a tree to these data, and summarize and plot it. Notice that we have to _exclude_ `Sales` from the right-hand side of the formula, because the response is derived from it. 15 | ```{r} 16 | tree.carseats=tree(High~.-Sales,data=Carseats) 17 | summary(tree.carseats) 18 | plot(tree.carseats) 19 | text(tree.carseats,pretty=0) 20 | ``` 21 | For a detailed summary of the tree, print it: 22 | ```{r} 23 | tree.carseats 24 | ``` 25 | Lets create a training and test set (250,150) split of the 400 observations, grow the tree on the training set, and evaluate its performance on the test set. 26 | ```{r} 27 | set.seed(1011) 28 | train=sample(1:nrow(Carseats),250) 29 | tree.carseats=tree(High~.-Sales,Carseats,subset=train) 30 | plot(tree.carseats);text(tree.carseats,pretty=0) 31 | tree.pred=predict(tree.carseats,Carseats[-train,],type="class") 32 | with(Carseats[-train,],table(tree.pred,High)) 33 | (72+33)/150 34 | ``` 35 | This tree was grown to full depth, and might be too variable. We now use CV to prune it. 36 | ```{r} 37 | cv.carseats=cv.tree(tree.carseats,FUN=prune.misclass) 38 | cv.carseats 39 | plot(cv.carseats) 40 | prune.carseats=prune.misclass(tree.carseats,best=13) 41 | plot(prune.carseats);text(prune.carseats,pretty=0) 42 | ``` 43 | Now lets evaluate this pruned tree on the test data. 44 | ```{r} 45 | tree.pred=predict(prune.carseats,Carseats[-train,],type="class") 46 | with(Carseats[-train,],table(tree.pred,High)) 47 | (72+32)/150 48 | ``` 49 | It has done about the same as our original tree. So pruning did not hurt us wrt misclassification errors, and gave us a simpler tree. 50 | 51 | Random Forests and Boosting 52 | ============================ 53 | 54 | These methods use trees as building blocks to build more complex models. Here we will use the Boston housing data to explore random forests and boosting. These data are in the `MASS` package. 55 | It gives housing values and other statistics in each of 506 suburbs of Boston based on a 1970 census. 56 | 57 | Random Forests 58 | -------------- 59 | Random forests build lots of bushy trees, and then average them to reduce the variance. 60 | 61 | ```{r} 62 | require(randomForest) 63 | require(MASS) 64 | set.seed(101) 65 | dim(Boston) 66 | train=sample(1:nrow(Boston),300) 67 | ?Boston 68 | ``` 69 | Lets fit a random forest and see how well it performs. We will use the response `medv`, the median housing value (in \$1K dollars) 70 | 71 | ```{r} 72 | rf.boston=randomForest(medv~.,data=Boston,subset=train) 73 | rf.boston 74 | ``` 75 | The MSR and % variance explained are based on OOB or _out-of-bag_ estimates, a very clever device in random forests to get honest error estimates. The model reports that `mtry=4`, which is the number of variables randomly chosen at each split. Since $p=13$ here, we could try all 13 possible values of `mtry`. We will do so, record the results, and make a plot. 76 | 77 | ```{r} 78 | oob.err=double(13) 79 | test.err=double(13) 80 | for(mtry in 1:13){ 81 | fit=randomForest(medv~.,data=Boston,subset=train,mtry=mtry,ntree=400) 82 | oob.err[mtry]=fit$mse[400] 83 | pred=predict(fit,Boston[-train,]) 84 | test.err[mtry]=with(Boston[-train,],mean((medv-pred)^2)) 85 | cat(mtry," ") 86 | } 87 | matplot(1:mtry,cbind(test.err,oob.err),pch=19,col=c("red","blue"),type="b",ylab="Mean Squared Error") 88 | legend("topright",legend=c("OOB","Test"),pch=19,col=c("red","blue")) 89 | ``` 90 | 91 | Not too difficult! Although the test-error curve drops below the OOB curve, these are estimates based on data, and so have their own standard errors (which are typically quite large). Notice that the points at the end with `mtry=13` correspond to bagging. 92 | 93 | Boosting 94 | -------- 95 | Boosting builds lots of smaller trees. Unlike random forests, each new tree in boosting tries to patch up the deficiencies of the current ensemble. 96 | ```{r} 97 | require(gbm) 98 | boost.boston=gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=10000,shrinkage=0.01,interaction.depth=4) 99 | summary(boost.boston) 100 | plot(boost.boston,i="lstat") 101 | plot(boost.boston,i="rm") 102 | ``` 103 | Lets make a prediction on the test set. With boosting, the number of trees is a tuning parameter, and if we have too many we can overfit. So we should use cross-validation to select the number of trees. We will leave this as an exercise. Instead, we will compute the test error as a function of the number of trees, and make a plot. 104 | 105 | ```{r} 106 | n.trees=seq(from=100,to=10000,by=100) 107 | predmat=predict(boost.boston,newdata=Boston[-train,],n.trees=n.trees) 108 | dim(predmat) 109 | berr=with(Boston[-train,],apply( (predmat-medv)^2,2,mean)) 110 | plot(n.trees,berr,pch=19,ylab="Mean Squared Error", xlab="# Trees",main="Boosting Test Error") 111 | abline(h=min(test.err),col="red") 112 | ``` 113 | 114 | 115 | 116 | -------------------------------------------------------------------------------- /ch9.Rmd: -------------------------------------------------------------------------------- 1 | SVM 2 | ======================================================== 3 | To demonstrate the SVM, it is easiest to work in low dimensions, so we can see the data. 4 | 5 | Linear SVM classifier 6 | --------------------- 7 | Lets generate some data in two dimensions, and make them a little separated. 8 | ```{r} 9 | set.seed(10111) 10 | x=matrix(rnorm(40),20,2) 11 | y=rep(c(-1,1),c(10,10)) 12 | x[y==1,]=x[y==1,]+1 13 | plot(x,col=y+3,pch=19) 14 | ``` 15 | Now we will load the package `e1071` which contains the `svm` function we will use. We then compute the fit. Notice that we have to specify a `cost` parameter, which is a tuning parameter. 16 | ```{r} 17 | library(e1071) 18 | dat=data.frame(x,y=as.factor(y)) 19 | svmfit=svm(y~.,data=dat,kernel="linear",cost=10,scale=FALSE) 20 | print(svmfit) 21 | plot(svmfit,dat) 22 | ``` 23 | As mentioned in the the chapter, the plot function is somewhat crude, and plots X2 on the horizontal axis (unlike what R would do automatically for a matrix). Lets see how we might make our own plot. 24 | 25 | The first thing we will do is make a grid of values for X1 and X2. We will write a function to do that, 26 | in case we want to reuse it. It uses the handy function `expand.grid`, and produces the coordinates of `n*n` points on a lattice covering the domain of `x`. Having made the lattice, we make a prediction at each point on the lattice. We then plot the lattice, color-coded according to the classification. Now we can see the decision boundary. 27 | 28 | The support points (points on the margin, or on the wrong side of the margin) are indexed in the `$index` component of the fit. 29 | 30 | ```{r} 31 | make.grid=function(x,n=75){ 32 | grange=apply(x,2,range) 33 | x1=seq(from=grange[1,1],to=grange[2,1],length=n) 34 | x2=seq(from=grange[1,2],to=grange[2,2],length=n) 35 | expand.grid(X1=x1,X2=x2) 36 | } 37 | xgrid=make.grid(x) 38 | ygrid=predict(svmfit,xgrid) 39 | plot(xgrid,col=c("red","blue")[as.numeric(ygrid)],pch=20,cex=.2) 40 | points(x,col=y+3,pch=19) 41 | points(x[svmfit$index,],pch=5,cex=2) 42 | ``` 43 | The `svm` function is not too friendly, in that we have to do some work to get back the linear coefficients, as described in the text. Probably the reason is that this only makes sense for linear kernels, and the function is more general. Here we will use a formula to extract the coefficients; for those interested in where this comes from, have a look in chapter 12 of ESL ("Elements of Statistical Learning"). 44 | 45 | We extract the linear coefficients, and then using simple algebra, we include the decision boundary and the two margins. 46 | 47 | ```{r} 48 | beta=drop(t(svmfit$coefs)%*%x[svmfit$index,]) 49 | beta0=svmfit$rho 50 | plot(xgrid,col=c("red","blue")[as.numeric(ygrid)],pch=20,cex=.2) 51 | points(x,col=y+3,pch=19) 52 | points(x[svmfit$index,],pch=5,cex=2) 53 | abline(beta0/beta[2],-beta[1]/beta[2]) 54 | abline((beta0-1)/beta[2],-beta[1]/beta[2],lty=2) 55 | abline((beta0+1)/beta[2],-beta[1]/beta[2],lty=2) 56 | ``` 57 | Just like for the other models in this book, the tuning parameter `C` has to be selected. 58 | Different values will give different solutions. Rerun the code above, but using `C=1`, and see what we mean. One can use cross-validation to do this. 59 | 60 | 61 | Nonlinear SVM 62 | -------------- 63 | Instead, we will run the SVM on some data where a non-linear boundary is called for. We will use the mixture data from ESL 64 | 65 | ```{r} 66 | load(url("http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/ESL.mixture.rda")) 67 | names(ESL.mixture) 68 | rm(x,y) 69 | attach(ESL.mixture) 70 | ``` 71 | These data are also two dimensional. Lets plot them and fit a nonlinear SVM, using a radial kernel. 72 | ```{r} 73 | plot(x,col=y+1) 74 | dat=data.frame(y=factor(y),x) 75 | fit=svm(factor(y)~.,data=dat,scale=FALSE,kernel="radial",cost=5) 76 | ``` 77 | Now we are going to create a grid, as before, and make predictions on the grid. 78 | These data have the grid points for each variable included on the data frame. 79 | ```{r} 80 | xgrid=expand.grid(X1=px1,X2=px2) 81 | ygrid=predict(fit,xgrid) 82 | plot(xgrid,col=as.numeric(ygrid),pch=20,cex=.2) 83 | points(x,col=y+1,pch=19) 84 | ``` 85 | We can go further, and have the predict function produce the actual function estimates at each of our grid points. We can include the actual decision boundary on the plot by making use of the contour function. On the dataframe is also `prob`, which is the true probability of class 1 for these data, at the gridpoints. If we plot its 0.5 contour, that will give us the _Bayes Decision Boundary_, which is the best one could ever do. 86 | ```{r} 87 | func=predict(fit,xgrid,decision.values=TRUE) 88 | func=attributes(func)$decision 89 | xgrid=expand.grid(X1=px1,X2=px2) 90 | ygrid=predict(fit,xgrid) 91 | plot(xgrid,col=as.numeric(ygrid),pch=20,cex=.2) 92 | points(x,col=y+1,pch=19) 93 | 94 | contour(px1,px2,matrix(func,69,99),level=0,add=TRUE) 95 | contour(px1,px2,matrix(prob,69,99),level=.5,add=TRUE,col="blue",lwd=2) 96 | ``` 97 | We see in this case that the radial kernel has done an excellent job. --------------------------------------------------------------------------------