├── 10.R.RData
├── 5.R.RData
├── 7.R.RData
├── HW_Nonlinearity.R
├── HW_PCA.R
├── HW_SVM.R
├── HW_TheBootstrap.R
├── ch10.Principal Components.html
├── ch10.Rmd
├── ch2.R
├── ch3.R
├── ch4.R
├── ch5.R
├── ch6.ModelSelection.html
├── ch6.Rmd
├── ch7.NonlinearModels.html
├── ch7.Rmd
├── ch8.DecisionTrees.html
├── ch8.Rmd
├── ch9.Rmd
└── ch9.SVM.html


/10.R.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kjy/StatisticalLearning_classstuff/936249d6e84f4b0888ebe08eba1b23681221a16c/10.R.RData


--------------------------------------------------------------------------------
/5.R.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kjy/StatisticalLearning_classstuff/936249d6e84f4b0888ebe08eba1b23681221a16c/5.R.RData


--------------------------------------------------------------------------------
/7.R.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kjy/StatisticalLearning_classstuff/936249d6e84f4b0888ebe08eba1b23681221a16c/7.R.RData


--------------------------------------------------------------------------------
/HW_Nonlinearity.R:
--------------------------------------------------------------------------------
 1 | ##Load the data from the file 7.R.RData, and plot it using plot(x,y). 
 2 | #Q1. What is the slope coefficient in a linear regression of y on x (to within 10%)? -0.67483
 3 | load("/Users/karenyang/Desktop/Statisical_Learning_Rcode/7.R.RData")
 4 | ls()
 5 | x
 6 | y
 7 | plot(x,y)  # shows a negative relationship
 8 | model1 <- lm(y ~ x)
 9 | summary(model1)
10 | 
11 | Call:
12 | lm(formula = y ~ x)
13 | 
14 | Residuals:
15 |      Min       1Q   Median       3Q      Max 
16 | -0.71289 -0.26943 -0.02448  0.21068  0.83582 
17 | 
18 | Coefficients:
19 |             Estimate Std. Error t value Pr(>|t|)    
20 | (Intercept) 95.43627    7.14200   13.36   <2e-16 ***
21 | x           -0.67483    0.05073  -13.30   <2e-16 ***
22 | ---
23 | Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
24 | 
25 | Residual standard error: 0.3376 on 98 degrees of freedom
26 | Multiple R-squared:  0.6436,	Adjusted R-squared:   0.64 
27 | F-statistic:   177 on 1 and 98 DF,  p-value: < 2.2e-16
28 | 
29 | Q2.  For the model y ~ 1+x+x^2, what is the coefficient of x (to within 10%)? 7.771e+01
30 | fit1 = lm(y ~ x + I(x^2))
31 | summary(fit1)
32 | Call:
33 | lm(formula = y ~ x + I(x^2))
34 | 
35 | Residuals:
36 |      Min       1Q   Median       3Q      Max 
37 | -0.65698 -0.18190 -0.01938  0.16355  0.86149 
38 | 
39 | Coefficients:
40 |               Estimate Std. Error t value Pr(>|t|)    
41 | (Intercept) -5.421e+03  1.547e+03  -3.505 0.000692 ***
42 | x            7.771e+01  2.197e+01   3.536 0.000624 ***
43 | I(x^2)      -2.784e-01  7.805e-02  -3.567 0.000563 ***
44 | ---
45 | Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
46 | 
47 | Residual standard error: 0.3191 on 97 degrees of freedom
48 | Multiple R-squared:  0.6849,	Adjusted R-squared:  0.6784 
49 | F-statistic: 105.4 on 2 and 97 DF,  p-value: < 2.2e-16


--------------------------------------------------------------------------------
/HW_PCA.R:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | load("/Users/karenyang/Desktop/Statistical_Learning_Rcode/10.R.RData")
 4 | ls()
 5 | [1] "x"      "x.test" "y"      "y.test"
 6 | 
 7 | 
 8 | Suppose we want to fit a linear regression, but the number of variables is much larger than the number of observations. In some cases, we may improve the fit by reducing the dimension of the features before.
 9 | In this problem, we use a data set with n = 300 and p = 200, so we have more observations than variables, but not by much. Load the data x, y, x.test, and y.test from 10.R.RData.
10 | First, concatenate x and x.test using the rbind functions and perform a principal components analysis on the concatenated data frame. 
11 | Q.1 To within 10% relative error, what proportion of the variance is explained by the first five principal components?
12 | xvars = rbind(x,x.test)  # bind the variables
13 | 
14 | dataset1 = data.frame(xvars)  #create a dataset
15 | 
16 | pca.out = prcomp(dataset1, scale = TRUE) # prinicipal component analysis model
17 | pca.out$sdev   # gives standard deviations 
18 | 
19 | screeplot(pca.out)   # Scree plot shows variance explained per principal component
20 | (pca.out$sdev)^2/ sum(pca.out$sdev^2)
21 | sum(0.1278392623, 0.1056409183, 0.0693007523, 0.0363725007, 0.0107030317)  #Take sum of first five
22 | [1] 0.3498565 #cumulative sum
23 | #alternative method is to use the cumulative sum function, which gives the same 0.3498565 on fifth entry
24 | cumsum((pca.out$sdev)^2) / sum(pca.out$sdev^2)
25 | [1] 0.1278393 0.2334802 0.3027809 0.3391534 0.3498565
26 | 
27 | #or just use
28 | summary(pca.out)
29 | Importance of components:
30 |                           PC1    PC2    PC3     PC4    PC5     PC6     PC7     PC8     PC9
31 | Standard deviation     5.0565 4.5965 3.7229 2.69713 1.4631 1.16827 1.15848 1.15544 1.14591
32 | Proportion of Variance 0.1278 0.1056 0.0693 0.03637 0.0107 0.00682 0.00671 0.00668 0.00657
33 | Cumulative Proportion  0.1278 0.2335 0.3028 0.33915 0.3499 0.35668 0.36339 0.37007 0.37663
34 | 
35 | 
36 | #Q.2 The previous answer suggests that a relatively small number of '"latent variables" account for a substantial fraction of the features' variability. We might believe that these 
37 | latent variables are more important than linear combinations of the features that have low variance.
38 | We can try forgetting about the raw features and using the first five principal components (computed on rbind(x,x.test)) instead as low-dimensional derived features. What is the mean-squared test error if we regress y on the first five principal components, and use the resulting model to predict y.test? 
39 | xols<-pca.out$x[1:300,1:5]
40 | fit0 <- lm(y ~ xols)
41 | summary(fit0)
42 | Call:
43 | lm(formula = y ~ xols)
44 | 
45 | Residuals:
46 |     Min      1Q  Median      3Q     Max 
47 | -3.3289 -0.6992  0.0319  0.8075  2.5240 
48 | 
49 | Coefficients:
50 |             Estimate Std. Error t value Pr(>|t|)    
51 | (Intercept)  0.09541    0.06107   1.562 0.119314    
52 | xolsPC1      0.07608    0.01159   6.564 2.36e-10 ***
53 | xolsPC2     -0.02276    0.01314  -1.732 0.084309 .  
54 | xolsPC3     -0.04023    0.01538  -2.616 0.009352 ** 
55 | xolsPC4     -0.06368    0.02237  -2.847 0.004722 ** 
56 | xolsPC5     -0.16069    0.04299  -3.738 0.000223 ***
57 | ---
58 | Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
59 | 
60 | Residual standard error: 1.056 on 294 degrees of freedom
61 | Multiple R-squared:  0.1906,	Adjusted R-squared:  0.1769 
62 | F-statistic: 13.85 on 5 and 294 DF,  p-value: 3.704e-12
63 | 
64 | yhat0 = predict(fit0, x.test)
65 | mean((yhat0-y.test)^2)
66 | 
67 | [1] 1.40799 #not correct
68 | [1] 1.413063 #not correct
69 | 
70 | 
71 | 
72 | 
73 | # Q.3 Now, try an OLS linear regression of y on the matrix x. What is the mean squared prediction error if we use the fitted model to predict y.test from x.test?
74 | fit<-lm(y~.,x)  # Run linear model
75 | summary(fit)
76 | yhat = predict(fit, newdata=x.test)  # Use linear model and new dataset to get the predicted values, yhat
77 | mean((yhat-y.test)^2)  #Calculate mean squared error (difference between predicted y and true y values is error)
78 | [1] 3.657197           #Grader shows answer to be 3.90714 


--------------------------------------------------------------------------------
/HW_SVM.R:
--------------------------------------------------------------------------------
  1 | In this problem, you will use simulation to evaluate (by Monte Carlo) the expected misclassification error rate given a particular generating model.  Let yi be equally divided between classes 0 and 1, and let xi ∈ ℝ10 be normally distributed. Given yi=0, xi ∼ N10(0, I10).  Given yi=1, xi ∼N10(μ, I10) with μ=(1,1,1,1,1,0,0,0,0,0).
  2 | Now, we would like to know the expected test error rate if we fit an SVM to a sample of 50 random training points from class 1 and 50 more from class 0.  We can calculate this to high precision by 1) generating a random training sample to train on, 2) evaluating the number of mistakes we make on a large test set, and then 3) repeating (1-2) many times and averaging the error rate for each trial.
  3 | 
  4 | install.packages("MASS")  # Use this package for the mvrnorm() function
  5 | library(MASS)
  6 | 
  7 | install.packages("e1071") # Use this package for the SVM() function
  8 | library(e1071)
  9 | 
 10 | #Create a dataset of 100 observations (with 50 1's and 50 0's) with 10 predictors
 11 | set.seed(10111) 
 12 | # n is number of obs., mu is vector mean, sigma is Identity matrix (variance-covariance)
 13 | x1 = mvrnorm(n = 50, mu = rep(0,10), Sigma=diag(10)) 
 14 | dim(x1)
 15 | [1] 50 10
 16 | x2 = mvrnorm(n = 50, mu = rep(0,10), Sigma=diag(10)) 
 17 | xvar_1 = rbind(x1,x2)
 18 | dim(xvar_1)
 19 | [1] 100  10
 20 | data1 = matrix(xvar_1)
 21 | y = rep(c(0,1), c(50,50)) #Make a vector of 100 obs. with 50 1's and 50 0's
 22 | y = matrix(y)
 23 | dim (y)
 24 | #Put the y variable in with the X variables to form a training dataset
 25 | train_data = data.frame(xvar_1, y = as.factor(y))
 26 | dim (train_data)
 27 | [1] 100  11
 28 | 
 29 | #Create a large dataset with 1000 observations for the test datset with 10 predictors
 30 | x3 = mvrnorm(500,rep(0,10),diag(10))
 31 | x4 = mvrnorm(500,rep(c(1,0),c(5,5)),diag(10))
 32 | xvar_2 = rbind(x3,x4)
 33 | y1 = rep(c(0,1),c(500,500))
 34 | test_data = data.frame(xvar_2,y1=as.factor(y1))
 35 | dim (test_data)
 36 | [1] 1000   11  #Test data set with 1000 observations, Y variable with 500 1's and 500 0's, and 10 X variables
 37 | 
 38 | 
 39 | # Q9.1 Use svm in the e1071 package with the default settings (the default kernel is a radial kernel). What is the expected test error rate of this method (to within 10%)?
 40 | sample1_train <- train_data[sample(1:nrow(train_data), 50,
 41 |                           replace=TRUE),]
 42 | svmModel1 = svm(sample1_train$y ~., data=sample1_train)
 43 | svmModel1
 44 | svmPredicted1 <- predict(svmModel1, newdata = test_data)
 45 | svmPredicted1
 46 | errorrateSVM1 <- (sum(svmPredicted1 != test_data$y))/length(test_data$y)
 47 | errorrateSVM1
 48 | [1] 0.421
 49 | 
 50 | 
 51 | sample2_train <- train_data[sample(1:nrow(train_data), 50,
 52 |                           replace=TRUE),]
 53 | svmModel2 = svm(sample2_train$y ~., data=sample2_train)
 54 | svmModel2
 55 | svmPredicted2 <- predict(svmModel2, newdata = test_data)
 56 | svmPredicted2
 57 | errorrateSVM2 <- (sum(svmPredicted2 != test_data$y))/length(test_data$y)
 58 | errorrateSVM2
 59 | [1] 0.465
 60 | 
 61 | 
 62 | sample3_train <- train_data[sample(1:nrow(train_data), 50,
 63 |                           replace=TRUE),]
 64 | svmModel3 = svm(sample2_train$y ~., data=sample3_train)
 65 | svmModel3
 66 | svmPredicted3 <- predict(svmModel3, newdata = test_data)
 67 | svmPredicted3
 68 | errorrateSVM3 <- (sum(svmPredicted3 != test_data$y))/length(test_data$y)
 69 | errorrateSVM3
 70 | [1] 0.476
 71 | #Take the average of the test errors for SVM 
 72 | sum(errorrateSVM1, errorrateSVM2, errorrateSVM3)/3
 73 | [1] 0.454
 74 | 
 75 | #USe Monte Carlo method to repeat above for 1000 iterations on SVM with default setting radial kernel
 76 | error_vector <- c();
 77 | for (i in 1:1000) {
 78 |   ## Generate traning data
 79 |   sample_train <- train_data[sample(1:nrow(train_data), 50, replace = TRUE),]
 80 | 
 81 |   ## Fit a model on trainning data
 82 |   svmModel = svm(sample_train$y ~., data=sample_train)
 83 | 
 84 |   ## Generate test data
 85 |   #Use test_data from above
 86 |   
 87 |   ## Predict on test  data
 88 |   svmPredicted <- predict(svmModel, newdata = test_data)
 89 |   
 90 |   ## Compare prediction with y
 91 |   errorrateSVM <- (sum(svmPredicted != test_data$y))/length(test_data$y)  
 92 |   
 93 |   ## Save the result in a vector
 94 |   error_vector <- c(error_vector, errorrateSVM)
 95 |   
 96 | }
 97 | 
 98 | #Random check the values in the vector
 99 | > head(error_vector)
100 | [1] 0.620 0.604 0.534 0.492 0.599 0.526
101 | 
102 | error_vector[999]
103 | [1] 0.511
104 | 
105 | error_vector[750]
106 | [1] 0.526
107 | 
108 | # Take the average across the 1000 test errors
109 | mean(error_vector)
110 | [1] 0.513473
111 | 
112 | 
113 | # Q.9.2 Now fit an svm with a linear kernel (kernel = "linear"). What is the expected test error rate to within 10%?
114 | sample6_train <- train_data[sample(1:nrow(train_data), 50,
115 |                           replace=TRUE),]
116 | svmModel6 = svm(sample6_train$y ~., data=sample6_train, kernel = "linear")
117 | svmModel6
118 | svmPredicted6 <- predict(svmModel6, newdata = test_data)
119 | svmPredicted6
120 | errorrateSVM6 <- (sum(svmPredicted6 != test_data$y))/length(test_data$y)
121 | errorrateSVM6
122 | [1] 0.451
123 | 
124 | sample7_train <- train_data[sample(1:nrow(train_data), 50,
125 |                           replace=TRUE),]
126 | svmModel7 = svm(sample7_train$y ~., data=sample7_train, kernel = "linear")
127 | svmModel7
128 | svmPredicted7 <- predict(svmModel7, newdata = test_data)
129 | svmPredicted7
130 | errorrateSVM7 <- (sum(svmPredicted7 != test_data$y))/length(test_data$y)
131 | errorrateSVM7
132 | [1] 0.537
133 | 
134 | #Take the average of the test errors for SVM models with linear kernel
135 | sum(errorrateSVM6, errorrateSVM7)/2
136 | [1] 0.494
137 | 
138 | #USe Monte Carlo method to repeat above for 1000 iterations on SVM with linear kernel
139 | error_vector1 <- c();
140 | for (i in 1:1000) {
141 |   ## Generate traning data
142 |   sample_train <- train_data[sample(1:nrow(train_data), 50, replace = TRUE),]
143 | 
144 |   ## Fit a model on trainning data
145 |   svmModel = svm(sample_train$y ~., data=sample_train, kernel = "linear")
146 | 
147 |   ## Generate test data
148 |   #Use test_data from above
149 |   
150 |   ## Predict on test  data
151 |   svmPredicted <- predict(svmModel, newdata = test_data)
152 |   
153 |   ## Compare prediction with y
154 |   errorrateSVM <- (sum(svmPredicted != test_data$y))/length(test_data$y)  
155 |   
156 |   ## Save the result in a vector
157 |   error_vector1 <- c(error_vector1, errorrateSVM)
158 |   
159 | }
160 | 
161 | # Random check the values in the vector
162 | head(error_vector1)
163 | [1] 0.553 0.608 0.461 0.483 0.542 0.469
164 | 
165 | error_vector1[999]
166 | [1] 0.382
167 | 
168 | error_vector1[750]
169 | [1] 0.415
170 | 
171 | # Take the average across the 1000 test errors
172 | mean(error_vector1)
173 | [1] 0.483152
174 | 
175 | 
176 | # Q. 9.3 What is the expected test error for logistic regression? (to within 10%)
177 | 
178 | sample11_train <- train_data[sample(1:nrow(train_data), 50,
179 |                           replace=TRUE),]
180 | 
181 | logistic1 = glm(sample11_train$y ~., family = "binomial", data=sample11_train)
182 | summary(logistic1)                          
183 | fitted <- predict(logistic1, newdata=test_data, type = 'response')
184 | missClass <- function(values, prediction){sum(((prediction > 0.5)*1) != values)/length(values)}
185 | errorratelog1 <-missClass(test_data$y,fitted)
186 | [1] 0.576
187 | 
188 | 
189 | sample12_train <- train_data[sample(1:nrow(train_data), 50,
190 |                           replace=TRUE),]
191 | logistic2 = glm(sample12_train$y ~., family = "binomial", data=sample12_train)
192 | summary(logistic2)                          
193 | fitted1 <- predict(logistic2, newdata=test_data, type = 'response')
194 | missClass <- function(values, prediction){sum(((prediction > 0.5)*1) != values)/length(values)}
195 | errorratelog2 <- missClass(test_data$y,fitted1)                        
196 | [1] 0.561
197 | 
198 | sample13_train <- train_data[sample(1:nrow(train_data), 50,
199 |                           replace=TRUE),]
200 | logistic3 = glm(sample13_train$y ~., family = "binomial", data=sample13_train)
201 | summary(logistic3)                          
202 | fitted2 <- predict(logistic3, newdata=test_data, type = 'response')
203 | missClass <- function(values, prediction){sum(((prediction > 0.5)*1) != values)/length(values)}
204 | errorratelog3 <- missClass(test_data$y,fitted2) 
205 | [1] 0.425
206 | #Take the average of the test errors for binary logistic
207 | sum(errorratelog1, errorratelog2, errorratelog3)/3
208 | 
209 | 
210 | 
211 | #USe Monte Carlo method to repeat above for 1000 iterations on binary logistic regression model
212 | 
213 | #Use a function for missclassification error rate
214 | missClass <- function(values, prediction){sum(((prediction > 0.5)*1) != values)/length(values)}
215 | #Set up a vector to store the error rates 
216 | error_vector2 <- c();
217 | 
218 | for (i in 1:1000) {
219 |   ## Generate traning data
220 |   sample_train <- train_data[sample(1:nrow(train_data), 50, replace = TRUE),]
221 | 
222 |   ## Fit a model on trainning data
223 |   LogisticModel = glm(sample_train$y ~., family = "binomial", data=sample_train)
224 |  
225 |   ## Generate test data
226 |   #Use test_data from above
227 |   
228 |   ## Predict on test  data
229 |   fitted <- predict(LogisticModel, newdata=test_data, type = 'response')
230 |   
231 |   ## Compare prediction with y
232 |   errorrate_logistic <- missClass(test_data$y,fitted) 
233 |   
234 |   ## Save the result in a vector
235 |   error_vector2 <- c(error_vector2, errorrate_logistic)
236 |   
237 | }
238 | #(Don't worry if you get errors saying the logistic regression did not converge.)Warning messages:
239 | 1: glm.fit: algorithm did not converge 
240 | 2: glm.fit: fitted probabilities numerically 0 or 1 occurred 
241 | 3: glm.fit: fitted probabilities numerically 0 or 1 occurred 
242 | 4: glm.fit: fitted probabilities numerically 0 or 1 occurred 
243 | 
244 | # Random check the values in the vector
245 | head(error_vector2)
246 | [1] 0.444 0.280 0.419 0.593 0.501 0.686
247 | error_vector2[999]
248 | [1] 0.419
249 | error_vector2[750]
250 | [1] 0.548
251 | 
252 | # Take the average across the 1000 test errors
253 | mean(error_vector2)
254 | [1] 0.476648
255 | 


--------------------------------------------------------------------------------
/HW_TheBootstrap.R:
--------------------------------------------------------------------------------
 1 | setwd("/Users/karenyang/Downloads/")
 2 | load("/Users/karenyang/Downloads/5.R.RData")
 3 | ls()
 4 | head(Xy)
 5 | tail(Xy)
 6 | dim(Xy)
 7 | [1] 1000    3
 8 | 
 9 | attach(Xy)
10 | search()
11 | 
12 | # Q1. Download the file 5.R.RData and load it into R using load("5.R.RData"). 
13 | # Consider the linear regression model of y on X1 and X2. 
14 | # To within 10%, what is the standard error for β1? 0.026
15 | model1 <- lm(y ~ X1 + X2)
16 | summary(model1)
17 | Call:
18 |   lm(formula = y ~ X1 + X2)
19 | 
20 | Residuals:
21 |   Min       1Q   Median       3Q      Max 
22 | -1.44171 -0.25468 -0.01736  0.33081  1.45860 
23 | 
24 | Coefficients:
25 |   Estimate Std. Error t value Pr(>|t|)    
26 | (Intercept)  0.26583    0.01988  13.372  < 2e-16 ***
27 |   X1           0.14533    0.02593   5.604 2.71e-08 ***
28 |   X2           0.31337    0.02923  10.722  < 2e-16 ***
29 |   ---
30 |   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
31 | 
32 | Residual standard error: 0.5451 on 997 degrees of freedom
33 | Multiple R-squared:  0.1171,  Adjusted R-squared:  0.1154 
34 | F-statistic: 66.14 on 2 and 997 DF,  p-value: < 2.2e-16
35 | 
36 | # Q2 Next, plot the data using matplot(Xy,type="l"). 
37 | # Which of the following do you think is most likely given what you see? Our estimate of s.e.(β^1) is too low. 
38 | ?matplot
39 | matplot(Xy, type="l")
40 | 
41 | # Q3 Now, use the (standard) bootstrap to estimate s.e.(β^1). To within 10%, what do you get?
42 | 
43 | library(boot)
44 | alpha = function(x,y){
45 |   vx = var(x)
46 |   vy = var(y)
47 |   cxy= cov(x,y)
48 |   (vy-cxy)/(vx+vy-2*cxy)
49 | }
50 | alpha(Xy$X1,Xy$y)
51 | [1] 0.4167192
52 | alpha.fn = function(data,index){
53 |   with(data[index,],alpha(Xy$X1,Xy$y))
54 | }
55 | 
56 | alpha.fn<-function(data, index) {
57 |   fit1<-lm(y~., data=Xy[index,])
58 |   coefficients(fit1)[['X1']]
59 | }
60 | 
61 | set.seed(1)
62 | alpha.fn (Xy,sample(1:100,100,replace=TRUE))
63 | [1] 0.1059068
64 | 
65 | boot.out=boot(Xy,alpha.fn,R=1000)
66 | boot.out
67 | 
68 | ORDINARY NONPARAMETRIC BOOTSTRAP
69 | 
70 | 
71 | Call:
72 |   boot(data = Xy, statistic = alpha.fn, R = 1000)
73 | 
74 | 
75 | Bootstrap Statistics :
76 |   original       bias    std. error
77 | t1* 0.1453263 0.0001885914  0.02873965
78 | 
79 | # Q4. Finally, use the block bootstrap to estimate s.e.(β^1). Use blocks of size 100. To within 10%, what do you get?
80 | ?tsboot
81 | # Generate R bootstrap replicates of a statistic applied to a time series. 
82 | # The replicate time series can be generated using fixed or random block lengths 
83 | # or can be model based replicates.
84 | 
85 | 
86 | tsboot.out = tsboot(Xy, se_stat, R = 1000, l = 100, sim = "fixed")
87 | tsboot.out
88 | 


--------------------------------------------------------------------------------
/ch10.Rmd:
--------------------------------------------------------------------------------
 1 | Principal Components
 2 | ====================
 3 | We will use the `USArrests` data (which is in R)
 4 | ```{r}
 5 | dimnames(USArrests)
 6 | apply(USArrests,2,mean)
 7 | apply(USArrests,2, var)
 8 | ```
 9 | 
10 | We see that `Assault` has a much larger variance than the other variables. It would dominate the principal components, so we choose to standardize the variables when we perform PCA
11 | 
12 | ```{r}
13 | pca.out=prcomp(USArrests, scale=TRUE)
14 | pca.out
15 | names(pca.out)
16 | biplot(pca.out, scale=0)
17 | ```
18 | 
19 | K-Means Clustering
20 | ==================
21 | K-means works in any dimension, but is most fun to demonstrate in two, because we can plot pictures.
22 | Lets make some data with clusters. We do this by shifting the means of the points around.
23 | ```{r}
24 | set.seed(101)
25 | x=matrix(rnorm(100*2),100,2)
26 | xmean=matrix(rnorm(8,sd=4),4,2)
27 | which=sample(1:4,100,replace=TRUE)
28 | x=x+xmean[which,]
29 | plot(x,col=which,pch=19)
30 | ```
31 | We know the "true" cluster IDs, but we wont tell that to the `kmeans` algorithm.
32 | 
33 | ```{r}
34 | km.out=kmeans(x,4,nstart=15)
35 | km.out
36 | plot(x,col=km.out$cluster,cex=2,pch=1,lwd=2)
37 | points(x,col=which,pch=19)
38 | points(x,col=c(4,3,2,1)[which],pch=19)
39 | ```
40 | 
41 | Hierarchical Clustering
42 | =======================
43 | We will use these same data and use hierarchical clustering
44 | 
45 | ```{r}
46 | hc.complete=hclust(dist(x),method="complete")
47 | plot(hc.complete)
48 | hc.single=hclust(dist(x),method="single")
49 | plot(hc.single)
50 | hc.average=hclust(dist(x),method="average")
51 | plot(hc.average)
52 | 
53 | ```
54 | Lets compare this with the actualy clusters in the data. We will use the function `cutree` to cut the tree at level 4.
55 | This will produce a vector of numbers from 1 to 4, saying which branch each observation is on. You will sometimes see pretty plots where the leaves of the dendrogram are colored. I searched a bit on the web for how to do this, and its a little too complicated for this demonstration.
56 | 
57 | We can use `table` to see how well they match:
58 | ```{r}
59 | hc.cut=cutree(hc.complete,4)
60 | table(hc.cut,which)
61 | table(hc.cut,km.out$cluster)
62 | ```
63 | or we can use our group membership as labels for the leaves of the dendrogram:
64 | ```{r}
65 | plot(hc.complete,labels=which)
66 | ```
67 |  
68 | 
69 | 


--------------------------------------------------------------------------------
/ch2.R:
--------------------------------------------------------------------------------
  1 | ### Week1  Chapter 2  Statistical Learning using R 
  2 | ### vectors, data, matrices, subsetting
  3 | # Note: there are no scalars in R
  4 | 
  5 | setwd("/Users/karenyang/Desktop")
  6 | 
  7 | install.packages("ISLR")  #Auto data located in ISLR package
  8 | library(ISLR)  #  Load the library ISLR
  9 | 
 10 | 
 11 | 
 12 | x=c(2,7,5)   #Assign 3 numbers to vector 3
 13 | x
 14 | [1] 2 7 5
 15 | 
 16 | y=seq(from=4,length=3,by=3)  #Another way to make a vector, using a sequence
 17 | ?seq    #Opens up R help on sequence
 18 | 
 19 | y
 20 | [1]  4  7 10
 21 | 
 22 | x+y    #Sum of 2 vectors of the same length, element by element
 23 | [1]  6 14 15
 24 | 
 25 | x/y    #Division of 2 vectors, element-wise
 26 | [1] 0.5 1.0 0.5
 27 | 
 28 | x^y    #x to the power of y, element-wise
 29 | [1]      16  823543 9765625
 30 | 
 31 | x[2]   #Use square brace to subset: extract 2nd element of x
 32 | [1] 7  
 33 | 
 34 | x[2:3]  #Use colon to indicate start at element 2 and stop at element 3 of subsetted vector, return subset
 35 | [1] 7 5
 36 | 
 37 | x[-2]   #Remove element 2 from x and return the subsetted vector as output
 38 | [1] 2 5
 39 | 
 40 | x[-c(1,2)]   #Remove a collection of indices 1 and 2 and return the subsetted vector
 41 | [1] 5
 42 | 
 43 | 
 44 | z=matrix(seq(1,12),4,3)   #A 2-way array is a matrix, make a 4x3 matrix and fill values 1-12 column-wise
 45 | z
 46 |      [,1] [,2] [,3]
 47 | [1,]    1    5    9
 48 | [2,]    2    6   10
 49 | [3,]    3    7   11
 50 | [4,]    4    8   12
 51 | 
 52 | 
 53 | z[3:4,2:3]            #Subset elements of matrix, 3-4 rows and 2-3 columns
 54 |      [,1] [,2]
 55 | [1,]    7   11
 56 | [2,]    8   12
 57 | 
 58 | 
 59 | z[,2:3]              #Subset columns 2 and 3 of z and return subsetted matrix
 60 |      [,1] [,2]
 61 | [1,]    5    9
 62 | [2,]    6   10
 63 | [3,]    7   11
 64 | [4,]    8   12
 65 | 
 66 | z[,1]                #Subset the 1st column of z. Note that it returns a vector and drops its matrix status.         
 67 | [1] 1 2 3 4
 68 | 
 69 | z[,1,drop=FALSE]     #Drop=FALSE keeps it as a 1-column matrix
 70 |      [,1]
 71 | [1,]    1
 72 | [2,]    2
 73 | [3,]    3
 74 | [4,]    4
 75 | 
 76 | 
 77 | dim(z)             #Dimensions of matrix is 4 rows and 3 columns
 78 | [1] 4 3
 79 | 
 80 | 
 81 | ls()            #Tells you what is available in your working directory
 82 | [1] "Auto" "x"    "y"    "z"  
 83 | 
 84 | rm(y)            #Remove y
 85 | 
 86 | ls()             #Verify that y is gone
 87 | [1] "Auto" "x"    "z"  
 88 | 
 89 | 
 90 | ### Generating random data, graphics
 91 | x=runif(50)   #Random uniform from 0 to 1
 92 | y=rnorm(50)   #Random normal from 0 to 1
 93 | plot(x,y)     #plots x and y
 94 | plot(x,y,xlab="Random Uniform",ylab="Random Normal",pch="*",col="blue")
 95 | par(mfrow=c(2,1))   #Make a panel of plots with 2 rows and 1 column
 96 | plot(x,y)
 97 | hist(y)              #Make a histogram of y
 98 | par(mfrow=c(1,1))    #Reset the mfrow command
 99 | 
100 | 
101 | ### Reading in data
102 | #Auto=read.csv("Auto.csv")   #Already have data via ISLR package so just call data
103 | data(Auto)
104 | 
105 | names(Auto)           #Gives names of variables
106 | [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
107 | [6] "acceleration" "year"         "origin"       "name" 
108 | 
109 | 
110 | dim(Auto)
111 | [1] 392   9
112 | 
113 | class(Auto)        #What type of object
114 | [1] "data.frame"
115 | 
116 | 
117 | str(Auto)          #Structure of data
118 | 'data.frame':	392 obs. of  9 variables:
119 |  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
120 |  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
121 |  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
122 |  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
123 |  $ weight      : num  3504 3693 3436 3433 3449 ...
124 |  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
125 |  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
126 |  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
127 |  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...ls()
128 | 
129 | summary(Auto)    #Gives summary of data
130 |       mpg          cylinders      displacement     horsepower        weight    
131 |  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
132 |  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
133 |  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
134 |  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
135 |  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
136 |  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
137 |                                                                                
138 |   acceleration        year           origin                      name    
139 |  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
140 |  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
141 |  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
142 |  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
143 |  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
144 |  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
145 |                                                  (Other)           :365 
146 |                                                  
147 |                                                  
148 | plot(Auto$cylinders,Auto$mpg)  
149 | plot(Auto$cyl,Auto$mpg)
150 | 
151 | attach(Auto)       #Attach dataframe, creates a workspace
152 | search()        #You can see "Auto" available for direct use so you don't need to use $cylinders or $mpg
153 |  [1] ".GlobalEnv"        "Auto"              "package:ISLR"      "tools:RGUI"       
154 |  [5] "package:stats"     "package:graphics"  "package:grDevices" "package:utils"    
155 |  [9] "package:datasets"  "package:methods"   "Autoloads"         "package:base" 
156 |   
157 | plot(cylinders,mpg)   #Use variables directly
158 | cylinders=as.factor(cylinders)  #Cast as factor
159 | plot(cylinders,mpg,xlab="Cylinders",ylab="Mpg",col="red")
160 | pdf(file="../mpg.pdf")
161 | plot(cylinders,mpg,xlab="Cylinders",ylab="Mpg",col="red")
162 | dev.off()
163 | quartz 
164 |      2 
165 | pairs(Auto,col="brown")
166 | pairs(mpg~cylinders+acceleration+weight,Auto)
167 | 
168 | #TO SUBSET DATA TO OBTAIN ROWS THAT FALL OUTSIDE OF ROWS 10-85 
169 | Auto2 = Auto[-10:-85, ]
170 | 
171 | 
172 | #TO PARTITION DATASET INTO TRAIN AND TEST SETS
173 | # load the library that contains the data set
174 | library(ISLR)
175 | # readme for data set
176 | ?Auto
177 | # readme for sample() function
178 | ?sample
179 | 
180 | # create a vector of row indexes
181 | training <- sample(nrow(Auto), size=200)
182 | 
183 | # create training data set with 200 obs
184 | train_set <- Auto[training,]
185 | 
186 | # create test data set with the remaing obs
187 | test_set <- Auto[-training,]
188 | 
189 | q()


--------------------------------------------------------------------------------
/ch3.R:
--------------------------------------------------------------------------------
  1 | ### Week2 Chapter 3 Statistical Learning
  2 | 
  3 | library(MASS)  #has the datasets
  4 | library(ISLR)  # has datasets that are used in the book
  5 | ### Simple linear regression
  6 | names(Boston)   #Boston dataset, names of variables
  7 | [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"     "dis"     "rad"    
  8 | [10] "tax"     "ptratio" "black"   "lstat"   "medv"
  9 | 
 10 | ?Boston    #Gives description of the dataset, The Boston data frame has 506 rows and 14 columns
 11 | plot(medv~lstat,Boston)    #(outcome)medv = median value of owner-occupied homes in \$1000s.  (predictor)lstat=lower status of the population (percent).
 12 | 
 13 | fit1=lm(medv~lstat,data=Boston)  # ~ means "as modeled as"
 14 | fit1
 15 | Call:
 16 | lm(formula = medv ~ lstat, data = Boston)
 17 | 
 18 | Coefficients:
 19 | (Intercept)        lstat  
 20 |       34.55        -0.95    #coefficient shows negative relationship
 21 |       
 22 | summary(fit1)
 23 | Call:
 24 | lm(formula = medv ~ lstat, data = Boston)  
 25 | 
 26 | Residuals:
 27 |     Min      1Q  Median      3Q     Max 
 28 | -15.168  -3.990  -1.318   2.034  24.500 
 29 | 
 30 | Coefficients:
 31 |             Estimate Std. Error t value Pr(>|t|)    
 32 | (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
 33 | lstat       -0.95005    0.03873  -24.53   <2e-16 ***  #statistically significant
 34 | ---
 35 | Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 36 | 
 37 | Residual standard error: 6.216 on 504 degrees of freedom
 38 | Multiple R-squared:  0.5441,	Adjusted R-squared:  0.5432 
 39 | F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
 40 | 
 41 | 
 42 | abline(fit1,col="red")
 43 | names(fit1)
 44 | [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values"
 45 |  [6] "assign"        "qr"    
 46 | 
 47 | confint(fit1)   #Gives the confidence intervals 
 48 |                 2.5 %     97.5 %  #This is the 95% confidence interval
 49 | (Intercept) 33.448457 35.6592247
 50 | lstat       -1.026148 -0.8739505
 51 | 
 52 | 
 53 | predict(fit1,data.frame(lstat=c(5,10,15)),interval="confidence")  #Predict function is a method to query a linear model fit
 54 |        fit      lwr      upr      #Query the 3 values (5, 10, 15) and get predicted values and confidence interval for each
 55 | 1 29.80359 29.00741 30.59978
 56 | 2 25.05335 24.47413 25.63256
 57 | 3 20.30310 19.73159 20.87461
 58 | 
 59 | 
 60 | 
 61 | ### Multiple linear regression
 62 | fit2=lm(medv~lstat+age,data=Boston)  #2 predictors (lstat and age), outcome is medv
 63 | summary(fit2)
 64 | Call:
 65 | lm(formula = medv ~ lstat + age, data = Boston)
 66 | 
 67 | Residuals:
 68 |     Min      1Q  Median      3Q     Max 
 69 | -15.981  -3.978  -1.283   1.968  23.158 
 70 | 
 71 | Coefficients:
 72 |             Estimate Std. Error t value Pr(>|t|)    
 73 | (Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
 74 | lstat       -1.03207    0.04819 -21.416  < 2e-16 ***  #Statistically significant
 75 | age          0.03454    0.01223   2.826  0.00491 **   #Statistically significant but less so than lstat
 76 | ---
 77 | Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 78 | 
 79 | Residual standard error: 6.173 on 503 degrees of freedom
 80 | Multiple R-squared:  0.5513,	Adjusted R-squared:  0.5495 #R-squared is the percentage of variance explained
 81 | F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16
 82 | 
 83 | 
 84 | fit3=lm(medv~.,Boston)  #Use all other variables in the dataset, except medv, which is the outcome/response variable
 85 | summary(fit3)
 86 | Call:
 87 | lm(formula = medv ~ ., data = Boston)    # Look at the call to the function--a model with all the predictors
 88 | 
 89 | Residuals:
 90 |     Min      1Q  Median      3Q     Max 
 91 | -15.595  -2.730  -0.518   1.777  26.199 
 92 | 
 93 | Coefficients:
 94 |               Estimate Std. Error t value Pr(>|t|)    
 95 | (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
 96 | crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
 97 | zn           4.642e-02  1.373e-02   3.382 0.000778 ***
 98 | indus        2.056e-02  6.150e-02   0.334 0.738288    
 99 | chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
100 | nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
101 | rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
102 | age          6.922e-04  1.321e-02   0.052 0.958229     #Age no longer statistically significant  
103 | dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
104 | rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
105 | tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
106 | ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
107 | black        9.312e-03  2.686e-03   3.467 0.000573 ***
108 | lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
109 | ---
110 | Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
111 | 
112 | Residual standard error: 4.745 on 492 degrees of freedom
113 | Multiple R-squared:  0.7406,	Adjusted R-squared:  0.7338 
114 | F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16
115 | 
116 | 
117 | par(mfrow=c(2,2))  #Make a 2x2 layout for the plot, 
118 | plot(fit3)         #plot residuals against fitted values, take a look at the variance, look for non-linearities
119 | 
120 | fit4=update(fit3,~.-age-indus)  #update function, use model in fit3 but remove age and indus
121 | summary(fit4)                   #indus = proportion of non-retail business acres per town.
122 | Call:
123 | lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + 
124 |     tax + ptratio + black + lstat, data = Boston)
125 | 
126 | Residuals:
127 |      Min       1Q   Median       3Q      Max 
128 | -15.5984  -2.7386  -0.5046   1.7273  26.2373 
129 | 
130 | Coefficients:
131 |               Estimate Std. Error t value Pr(>|t|)    
132 | (Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
133 | crim         -0.108413   0.032779  -3.307 0.001010 ** 
134 | zn            0.045845   0.013523   3.390 0.000754 ***
135 | chas          2.718716   0.854240   3.183 0.001551 ** 
136 | nox         -17.376023   3.535243  -4.915 1.21e-06 ***
137 | rm            3.801579   0.406316   9.356  < 2e-16 ***    #Everything is significant in model
138 | dis          -1.492711   0.185731  -8.037 6.84e-15 ***
139 | rad           0.299608   0.063402   4.726 3.00e-06 ***
140 | tax          -0.011778   0.003372  -3.493 0.000521 ***
141 | ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
142 | black         0.009291   0.002674   3.475 0.000557 ***
143 | lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
144 | ---
145 | Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
146 | 
147 | Residual standard error: 4.736 on 494 degrees of freedom
148 | Multiple R-squared:  0.7406,	Adjusted R-squared:  0.7348 
149 | F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16
150 | 
151 | 
152 | ### Nonlinear terms and Interactions
153 | fit5=lm(medv~lstat*age,Boston)   #Use interaction term (lstat*age)
154 | summary(fit5)                    #Model will include 3 predictors (lstat, age, and lstat*age)
155 | Call:
156 | lm(formula = medv ~ lstat * age, data = Boston)  
157 | 
158 | Residuals:
159 |     Min      1Q  Median      3Q     Max 
160 | -15.806  -4.045  -1.333   2.085  27.552 
161 | 
162 | Coefficients:
163 |               Estimate Std. Error t value Pr(>|t|)    
164 | (Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
165 | lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***   #main effect (statistically significant)
166 | age         -0.0007209  0.0198792  -0.036   0.9711       #main effect (not statistically significant)
167 | lstat:age    0.0041560  0.0018518   2.244   0.0252 *     #interaction effect (denoted by colon)
168 | ---
169 | Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
170 | 
171 | Residual standard error: 6.149 on 502 degrees of freedom
172 | Multiple R-squared:  0.5557,	Adjusted R-squared:  0.5531 
173 | F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16
174 | 
175 | #Use a quadratic term since scatterplot of medv and lstat looked nonlinear
176 | fit6=lm(medv~lstat +I(lstat^2),Boston); summary(fit6) #Need to put quadratic term inside the Identity function I() 
177 | 
178 | Call:
179 | lm(formula = medv ~ lstat + I(lstat^2), data = Boston)  #Notice that Identity function I() shows up in the call
180 | 
181 | Residuals:
182 |      Min       1Q   Median       3Q      Max 
183 | -15.2834  -3.8313  -0.5295   2.3095  25.4148 
184 | 
185 | Coefficients:
186 |              Estimate Std. Error t value Pr(>|t|)    
187 | (Intercept) 42.862007   0.872084   49.15   <2e-16 ***
188 | lstat       -2.332821   0.123803  -18.84   <2e-16 ***  #Linear is significant
189 | I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***  #quadratic is significant
190 | ---
191 | Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
192 | 
193 | Residual standard error: 5.524 on 503 degrees of freedom
194 | Multiple R-squared:  0.6407,	Adjusted R-squared:  0.6393 
195 | F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16
196 | 
197 | 
198 | attach(Boston)  #Named variables are now available in workspace
199 | 
200 | par(mfrow=c(1,1))  #Make a 1x1 layout
201 | plot(medv~lstat)   #plot as before the same 2 variables, medv is outcome and lstat is predictor
202 | #Now, include the fitted values of the quadratic fit, fit6, and put in the plot 
203 | #Do not use abline since that is used for a straight line
204 | #For each value of lstat, you get a fitted value from the model fit6
205 | points(lstat,fitted(fit6),col="red",pch=20)  #Include the quadratic fit with a series of points 
206 |                                              
207 | 
208 | #Use poly function to fit a polynomial 
209 | fit7=lm(medv~poly(lstat,4))  #4th degree polynomial, non-linear in the variable yet linear in the coefficient
210 | points(lstat,fitted(fit7),col="blue",pch=20)  #pch means plotting character, 20 is a round ball symbol
211 | 
212 | #To see all the plotting characters available to you
213 | plot(1:20,1:20,pch=1:20,cex=2) #cex = 2 means double the size of the plotting character
214 | 
215 | 
216 | ###Qualitative predictors
217 | fix(Carseats)  #pch means plotting character, 20 is a round ball  #Throws up a dataframe so that you can see it
218 | names(Carseats)  #show all the variable names
219 | [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population"  "Price"      
220 |  [7] "ShelveLoc"   "Age"         "Education"   "Urban"       "US"   
221 |  
222 | summary(Carseats)
223 | fit1=lm(Sales~.+Income:Advertising+Age:Price,Carseats)
224 | summary(fit1)
225 | Call:
226 | lm(formula = Sales ~ . + Income:Advertising + Age:Price, data = Carseats)
227 | 
228 | Residuals:
229 |     Min      1Q  Median      3Q     Max 
230 | -2.9208 -0.7503  0.0177  0.6754  3.3413 
231 | 
232 | Coefficients:
233 |                      Estimate Std. Error t value Pr(>|t|)    
234 | (Intercept)         6.5755654  1.0087470   6.519 2.22e-10 ***
235 | CompPrice           0.0929371  0.0041183  22.567  < 2e-16 ***
236 | Income              0.0108940  0.0026044   4.183 3.57e-05 ***
237 | Advertising         0.0702462  0.0226091   3.107 0.002030 ** 
238 | Population          0.0001592  0.0003679   0.433 0.665330    
239 | Price              -0.1008064  0.0074399 -13.549  < 2e-16 ***
240 | ShelveLocGood       4.8486762  0.1528378  31.724  < 2e-16 ***
241 | ShelveLocMedium     1.9532620  0.1257682  15.531  < 2e-16 ***
242 | Age                -0.0579466  0.0159506  -3.633 0.000318 ***
243 | Education          -0.0208525  0.0196131  -1.063 0.288361    
244 | UrbanYes            0.1401597  0.1124019   1.247 0.213171    
245 | USYes              -0.1575571  0.1489234  -1.058 0.290729    
246 | Income:Advertising  0.0007510  0.0002784   2.698 0.007290 ** 
247 | Price:Age           0.0001068  0.0001333   0.801 0.423812    
248 | ---
249 | Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
250 | 
251 | Residual standard error: 1.011 on 386 degrees of freedom
252 | Multiple R-squared:  0.8761,	Adjusted R-squared:  0.8719 
253 | F-statistic:   210 on 13 and 386 DF,  p-value: < 2.2e-16
254 | 
255 | contrasts(Carseats$ShelveLoc)  #ShelveLoc is a qualitative variable
256 |                                #Contrast function will show you how R will code it 
257 |        Good Medium             #3 level factor means 2 dummy variables need to be made
258 | Bad       0      0             
259 | Good      1      0
260 | Medium    0      1
261 | 
262 | ###Writing R functions
263 | regplot=function(x,y){
264 |   fit=lm(y~x)
265 |   plot(x,y)
266 |   abline(fit,col="red")
267 | }
268 | attach(Carseats)      #Make the Carseats data available in the data space
269 | regplot(Price,Sales)  #Call the regplot function with Price, Sales as variables passed in as arguments
270 | 
271 | regplot=function(x,y,...){     #. . . means there are unnamed arguments, can later add extra arguments (regplot call below)
272 |   fit=lm(y~x)
273 |   plot(x,y,...)
274 |   abline(fit,col="red")
275 | }
276 | regplot(Price,Sales,xlab="Price",ylab="Sales",col="blue",pch=20)
277 | 
278 | ##QUESTION  What is the difference between lm(y ~ x*z) and lm(y ~ I(x*z)), when x and z are both numeric variables?  See examples below.
279 | lm(medv~lstat*age) gives interaction effects and main effects
280 | lm(medv~I(lstat*age)) gives interaction effects only
281 | 
282 | 
283 | > fit15 = lm(medv~lstat*age)
284 | > summary(fit15)
285 | 
286 | Call:
287 | lm(formula = medv ~ lstat * age)
288 | 
289 | Residuals:
290 |     Min      1Q  Median      3Q     Max 
291 | -15.806  -4.045  -1.333   2.085  27.552 
292 | 
293 | Coefficients:
294 |               Estimate Std. Error t value Pr(>|t|)    
295 | (Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
296 | lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***  #main effects
297 | age         -0.0007209  0.0198792  -0.036   0.9711      #main effects
298 | lstat:age    0.0041560  0.0018518   2.244   0.0252 *    #interaction effects
299 | ---
300 | Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
301 | 
302 | Residual standard error: 6.149 on 502 degrees of freedom
303 | Multiple R-squared:  0.5557,	Adjusted R-squared:  0.5531 
304 | F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16
305 | 
306 | > fit16 = lm(medv~I(lstat*age))
307 | > summary(fit16)
308 | 
309 | Call:
310 | lm(formula = medv ~ I(lstat * age))
311 | 
312 | Residuals:
313 |     Min      1Q  Median      3Q     Max 
314 | -13.347  -4.372  -1.534   1.914  27.193 
315 | 
316 | Coefficients:
317 |                  Estimate Std. Error t value Pr(>|t|)    
318 | (Intercept)    30.1588631  0.4828240   62.46   <2e-16 ***
319 | I(lstat * age) -0.0077146  0.0003799  -20.31   <2e-16 ***  #Interaction effects only
320 | ---
321 | Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
322 | 
323 | Residual standard error: 6.827 on 504 degrees of freedom
324 | Multiple R-squared:  0.4501,	Adjusted R-squared:  0.449 
325 | F-statistic: 412.4 on 1 and 504 DF,  p-value: < 2.2e-16
326 | 
327 | 
328 | 
329 | 
330 | 
331 | 


--------------------------------------------------------------------------------
/ch4.R:
--------------------------------------------------------------------------------
  1 | ### Week 3 Chapter 4 Statistical Learning
  2 | 
  3 | require(ISLR)  #require command is similar to libary command; load library, has the dataset on Stock Market
  4 | names(Smarket) #gives names of variables; Direction is the response(outcome) variable
  5 | [1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"      "Volume"   
  6 | [8] "Today"     "Direction"
  7 | 
  8 | summary(Smarket)
  9 |       Year           Lag1                Lag2                Lag3          
 10 |  Min.   :2001   Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.922000  
 11 |  1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500   1st Qu.:-0.640000  
 12 |  Median :2003   Median : 0.039000   Median : 0.039000   Median : 0.038500  
 13 |  Mean   :2003   Mean   : 0.003834   Mean   : 0.003919   Mean   : 0.001716  
 14 |  3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
 15 |  Max.   :2005   Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.733000  
 16 |       Lag4                Lag5              Volume           Today           Direction 
 17 |  Min.   :-4.922000   Min.   :-4.92200   Min.   :0.3561   Min.   :-4.922000   Down:602  
 18 |  1st Qu.:-0.640000   1st Qu.:-0.64000   1st Qu.:1.2574   1st Qu.:-0.639500   Up  :648  
 19 |  Median : 0.038500   Median : 0.03850   Median :1.4229   Median : 0.038500             
 20 |  Mean   : 0.001636   Mean   : 0.00561   Mean   :1.4783   Mean   : 0.003138             
 21 |  3rd Qu.: 0.596750   3rd Qu.: 0.59700   3rd Qu.:1.6417   3rd Qu.: 0.596750             
 22 |  Max.   : 5.733000   Max.   : 5.73300   Max.   :3.1525   Max.   : 5.733000    
 23 |  
 24 |  #Lag1-Lag5 gives previous day's price
 25 |  
 26 | ?Smarket  #  Daily percentage returns for the S&P 500 stock index between 2001 and 2005.
 27 | 
 28 | pairs(Smarket,col=Smarket$Direction)  #Create a matrix of scatterplots and make color contrast for the binary classes
 29 | # Binary Logistic regression model, using Direction as outcome variable
 30 | glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
 31 |             data=Smarket,family=binomial)
 32 | summary(glm.fit)  #Gives summary output
 33 | Call:
 34 | glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
 35 |     Volume, family = binomial, data = Smarket)
 36 | 
 37 | Deviance Residuals: 
 38 |    Min      1Q  Median      3Q     Max  
 39 | -1.446  -1.203   1.065   1.145   1.326  
 40 | 
 41 | Coefficients:
 42 |              Estimate Std. Error z value Pr(>|z|)
 43 | (Intercept) -0.126000   0.240736  -0.523    0.601
 44 | Lag1        -0.073074   0.050167  -1.457    0.145
 45 | Lag2        -0.042301   0.050086  -0.845    0.398    #Nothing is significant in model
 46 | Lag3         0.011085   0.049939   0.222    0.824
 47 | Lag4         0.009359   0.049974   0.187    0.851
 48 | Lag5         0.010313   0.049511   0.208    0.835
 49 | Volume       0.135441   0.158360   0.855    0.392
 50 | 
 51 | (Dispersion parameter for binomial family taken to be 1)
 52 | 
 53 |     Null deviance: 1731.2  on 1249  degrees of freedom   #Null deviance is the deviance of mean
 54 | Residual deviance: 1727.6  on 1243  degrees of freedom   #Deviance of residuals
 55 | AIC: 1741.6
 56 | 
 57 | Number of Fisher Scoring iterations: 3
 58 | 
 59 | 
 60 | glm.probs=predict(glm.fit,type="response")   #Give predicted probabilities with command type = "response"
 61 | glm.probs[1:5]                               #Show the 1st five observations (rows) of predicted values
 62 |         1         2         3         4         5 
 63 | 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 
 64 | 
 65 | glm.pred=ifelse(glm.probs>0.5,"Up","Down")   #Set threshold at 0.5. If True that > 0.5, then "Up"; else, "Down"
 66 | 
 67 | attach(Smarket)
 68 | table(glm.pred,Direction)  #glm.pred is the predicted values while Direction is the true values
 69 | 
 70 |         Direction     #This is a Confusion Matrix
 71 | glm.pred Down  Up
 72 |     Down  145 141    #  The diagonal values are the correct classifications
 73 |     Up    457 507    #  The off-diagonal values are the incorrect classifications
 74 |     
 75 | mean(glm.pred==Direction)  #This gives the classification rate of model
 76 | [1] 0.5216           #   Performed slightly better than chance (0.50)
 77 | 
 78 | 
 79 | 
 80 | # Make training and test set
 81 | train = Year<2005   #Create training data set
 82 | glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
 83 |             data=Smarket,family=binomial, subset=train)  #The subset = train where train is < 2005
 84 | 
 85 | #Use the remaining !train for the test data set           
 86 | glm.probs=predict(glm.fit,newdata=Smarket[!train,],type="response") type = "response" #gives predicted probabilities
 87 | 
 88 | glm.pred=ifelse(glm.probs >0.5,"Up","Down")  #Make an up/down variable, if true, then "Up", else false, then "Down"
 89 | Direction.2005=Smarket$Direction[!train]     #Make a variable of the true values
 90 | 
 91 | table(glm.pred,Direction.2005)   #Give table of classification of test data set with predicted against true values     
 92 | Direction.2005
 93 | glm.pred Down Up      #This is a Confusion Matrix
 94 |     Down   77 97      #Off-diagonal values are the mistakes
 95 |     Up     34 44      #Diagonal values are where we got it right
 96 |     
 97 | mean(glm.pred==Direction.2005)   #Classification rate
 98 | [1] 0.4801587     #The mean is less than 50%, which is the null rate, which could be overfitting of model
 99 | 
100 | 
101 | #Fit smaller model (only include 2 predictors, lag1 and lag2)
102 | glm.fit=glm(Direction~Lag1+Lag2,
103 |             data=Smarket,family=binomial, subset=train)
104 | summary(glm.fit)
105 | Call:
106 | glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Smarket, 
107 |     subset = train)
108 | 
109 | Deviance Residuals: 
110 |    Min      1Q  Median      3Q     Max  
111 | -1.345  -1.188   1.074   1.164   1.326  
112 | 
113 | Coefficients:
114 |             Estimate Std. Error z value Pr(>|z|)
115 | (Intercept)  0.03222    0.06338   0.508    0.611
116 | Lag1        -0.05562    0.05171  -1.076    0.282
117 | Lag2        -0.04449    0.05166  -0.861    0.389
118 | 
119 | (Dispersion parameter for binomial family taken to be 1)
120 | 
121 |     Null deviance: 1383.3  on 997  degrees of freedom     #Deviance of the mean
122 | Residual deviance: 1381.4  on 995  degrees of freedom     #Residual deviance
123 | AIC: 1387.4
124 | 
125 | Number of Fisher Scoring iterations: 3
126 | 
127 | glm.probs=predict(glm.fit,newdata=Smarket[!train,],type="response") 
128 | glm.pred=ifelse(glm.probs >0.5,"Up","Down")
129 | table(glm.pred,Direction.2005)
130 | 
131 |         Direction.2005          #Confusion Matrix
132 | glm.pred Down  Up               
133 |     Down   35  35               #Correct classifications are on the diagonal
134 |     Up     76 106               #Incorrect classifcations are on the off-diagonal
135 | mean(glm.pred==Direction.2005)   #This is the classification rate of model
136 | [1] 0.5595238     #Mean shows a better classification rate than previous dataset
137 | 
138 | 106/(76+106)
139 | [1] 0.5824176  #Correct classification rate for "Up" when predicting "Up" is 0.58
140 | 
141 | 
142 | 
143 | require (ISLR)  # This has the dataset on the Stock market
144 | require(MASS)   # This has the Linear Discriminant Analysis commands that we need
145 | ## Linear Discriminant Analysis
146 | lda.fit=lda(Direction~Lag1+Lag2,data=Smarket, subset=Year<2005)
147 | lda.fit
148 | Call:
149 | lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = Year < 
150 |     2005)
151 | 
152 | Prior probabilities of groups:    #Gives proportions of Up and Down in dataset
153 |     Down       Up 
154 | 0.491984 0.508016 
155 | 
156 | Group means:                    #Summary of group means
157 |             Lag1        Lag2
158 | Down  0.04279022  0.03389409
159 | Up   -0.03954635 -0.03132544
160 | 
161 | Coefficients of linear discriminants:   #LDA coefficients
162 |             LD1
163 | Lag1 -0.6420190
164 | Lag2 -0.5135293
165 | 
166 | plot(lda.fit)
167 | #Create a test dataset
168 | Smarket.2005=subset(Smarket,Year==2005)  #dataframe is Smarket, subset out the observations with Year is 2005
169 | lda.pred=predict(lda.fit,Smarket.2005)
170 | lda.pred[1:5,]                           #Look at the first 5 observations of the predicted values
171 | Error in lda.pred[1:5, ] : incorrect number of dimensions   #Error because data is not in matrix form
172 | class(lda.pred)                          #What form is the dataset in? List.
173 | [1] "list"
174 | data.frame(lda.pred)[1:5,]               #Cast to a dataframe and then try to look at first 5 observations
175 |      class posterior.Down posterior.Up         LD1
176 | 999     Up      0.4901792    0.5098208  0.08293096
177 | 1000    Up      0.4792185    0.5207815  0.59114102
178 | 1001    Up      0.4668185    0.5331815  1.16723063
179 | 1002    Up      0.4740011    0.5259989  0.83335022
180 | 1003    Up      0.4927877    0.5072123 -0.03792892
181 | table(lda.pred$class,Smarket.2005$Direction)      #Create a table of the predicted classes against the true classes
182 |        Down  Up                                   #This is a Confusion Matrix
183 |   Down   35  35                                   #Diagonal elements are the correct classification
184 |   Up     76 106                                   #Off-diagonal elements are the mistakes
185 | mean(lda.pred$class==Smarket.2005$Direction)      #Classification rate of the model, which is 56%
186 | [1] 0.5595238                                     #Model performs 6% better than chance (flipping a coin)
187 | 
188 | 
189 | 
190 | ## K-Nearest Neighbors  (very handy to have in toolbox, effective most of the time)
191 | library(class)  
192 | ?knn
193 | k-nearest neighbour classification for test set from training set. For each row of the test set, the k nearest (in Euclidean distance) training set vectors are found, and the classification is decided by majority vote, with ties broken at random. If there are ties for the kth nearest vector, all candidates are included in the vote.
194 | attach(Smarket)      #Put variable names in workspace
195 | ls()  # Check to see it is there
196 | Xlag=cbind(Lag1,Lag2)  #Make a matrix with lag1 and lag2
197 | Xlag[1:5, ]   #Take a look at first five observations
198 | train=Year<2005        # Designate train as Year less than 2005
199 | #Make a k-Nearest Neighbor model where knn(x-variables from train, x-variables from test, class labels from train, k = ,  )
200 | knn.pred=knn(Xlag[train,],Xlag[!train,],Direction[train],k=1)
201 | table(knn.pred,Direction[!train])  #Generate a Confusion Matrix with predicted againsted true values
202 | knn.pred Down Up
203 |     Down   43 58
204 |     Up     68 83
205 | mean(knn.pred==Direction[!train])  #Obtain the mean, which gives model performance on prediction
206 | [1] 0.5             #Model prediction rate is 50%, which is the same as flipping a coin
207 | 


--------------------------------------------------------------------------------
/ch5.R:
--------------------------------------------------------------------------------
 1 | ### Week 4 Chapter 5  Statistical Learning
 2 | 
 3 | #  LOOCV, K-Fold Cross-Validation, and the Bootstrap  (Resampling methods)
 4 | require(ISLR)  # require function is bool and will return false if pkg does not exist 
 5 | require(boot)  # Use the boot package
 6 | ?cv.glm        # This function calculates the estimated K-fold cross-validation prediction error for generalized linear models.
 7 | plot(mpg~horsepower,data=Auto)   # As horsepower increases, mpg decreases
 8 | 
 9 | ## LOOCV   (Leave One Out Cross-Validation)
10 | glm.fit=glm(mpg~horsepower, data=Auto)  # Fits a linear model if family is not specified (can use glm)
11 | # Fits a model repeatedly n times while leaving out 1 observation each time. Makes a fit on all other data.
12 | # Makes a prediction on the x observation that was left out.
13 | cv.glm(Auto,glm.fit)$delta #pretty slow (doesnt use formula (5.2) on page 180)
14 | [1] 24.23151 24.23114  
15 | # Delta is vector of length two. The first component is the raw cross-validation estimate of prediction error. 
16 | # The second component is the adjusted cross-validation estimate. 
17 | # The adjustment is designed to compensate for the bias introduced by not using leave-one-out cross-validation.
18 | 
19 | ##Lets write a simple function to use formula (5.2)  Misclassification error: 1/n* SUMMATION(yi - yhat(-i))^2
20 | loocv=function(fit){                    # How much does i contribute to its own fit? A measure of self-influence of i.
21 |   h=lm.influence(fit)$h                 # This is the formula computed:  1/n* SUMMATION(yi - yhati)^2/(1-Hii)^2
22 |   mean((residuals(fit)/(1-h))^2)        # h = Hii is the diagonal of the hat matrix 
23 | }                                       
24 | ## Now we try it out
25 | loocv(glm.fit)
26 | [1] 24.23151                                        
27 | 
28 | cv.error=rep(0,5)     #  Create vector to collect the errors 
29 | degree=1:5            #  Now fit polynomials of different degrees to the data
30 | for(d in degree){
31 |   glm.fit=glm(mpg~poly(horsepower,d), data=Auto)
32 |   cv.error[d]=loocv(glm.fit)                    # Compute error and put in error vector
33 | }
34 | plot(degree,cv.error,type="b")  # Plot error against degree. It shows quadratic does the best in terms of lowest error 
35 | 
36 | ## 10-fold CV   Divide data into 10 parts. 1 part is test set. 9 parts is training set. Fits the model 10 times, switching out test set.
37 | 
38 | cv.error10=rep(0,5)           # Create a vector to accept the errors
39 | for(d in degree){
40 |   glm.fit=glm(mpg~poly(horsepower,d), data=Auto)
41 |   cv.error10[d]=cv.glm(Auto,glm.fit,K=10)$delta[1]   # Specify K = 10 for 10 folds
42 | }
43 | lines(degree,cv.error10,type="b",col="red")      # Similar story as before. 
44 | 
45 | # If given a choice, pick K-fold CV over LOOCV since it is cheaper to compute.
46 | 
47 | 
48 | 
49 | ## Bootstrap--repeated, random sampling of training observations. Invented by Brad Efron.
50 | ## Some observations can be represented more than once and some not at all.
51 | ## Minimum risk investment - Section 5.2    2 investments called X and Y
52 | 
53 | alpha=function(x,y){                    # VAR(X), VAR(Y)
54 |   vx=var(x)                             # alpha = VAR(Y) - COV(X,Y)/VAR(X)+VAR(Y)-2COV(X,Y)
55 |   vy=var(y)                             # What is the standard error (sampling variablity) of alpha? Don't know. 
56 |   cxy=cov(x,y)                          # Need bootstrap
57 |   (vy-cxy)/(vx+vy-2*cxy)
58 | }
59 | alpha(Portfolio$X,Portfolio$Y)         #  Portfolio is the dataset
60 | [1] 0.5758321
61 | 
62 | ## What is the standard error of alpha?
63 | ## Need a wrapper that allows the bootstrap to work. Takes a dataframe and an index (over the rows, 1 to n, can be repeats)
64 | alpha.fn=function(data, index){       #  index tells you which observations gets represented in the sampling
65 |   with(data[index,],alpha(X,Y))       #  with is a function that takes dataframe as 1st argument. It says
66 | }                                     #  using this data in the dataframe, execute these commands.
67 | 
68 | alpha.fn(Portfolio,1:100)
69 | [1] 0.5758321
70 | 
71 | #Now, run the bootstrap. Set the random number seed. 
72 | set.seed(1)
73 | alpha.fn (Portfolio,sample(1:100,100,replace=TRUE))  #  Sample 1 to 100 with replacement
74 | [1] 0.5963833    
75 | 
76 | boot.out=boot(Portfolio,alpha.fn,R=1000)  # Do a 1000 bootstraps
77 | boot.out                                  # Gives a summary of the bootstrap
78 | ORDINARY NONPARAMETRIC BOOTSTRAP
79 | 
80 | 
81 | Call:
82 |   boot(data = Portfolio, statistic = alpha.fn, R = 1000)
83 | 
84 | 
85 | Bootstrap Statistics :
86 |   original        bias    std. error       # Gives estimate, bias, and standard error
87 | t1* 0.5758321 -7.315422e-05  0.08861826    
88 | 
89 | plot(boot.out)        # histogram (symmetrical distribution, Gaussian), QQ-plot (lines up as a straight line: Gaussian)
90 | 
91 | # The bootstrap can give you reliable estimates of standard error from nasty statistics.
92 | 


--------------------------------------------------------------------------------
/ch6.Rmd:
--------------------------------------------------------------------------------
  1 | Model Selection
  2 | ================
  3 | 
  4 | This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages,
  5 | and a very nice way of distributing an analysis. It has some very simple syntax rules.
  6 | 
  7 | 
  8 | ```{r}
  9 | library(ISLR)
 10 | summary(Hitters)
 11 | ```
 12 | There are some missing values here, so before we proceed we will remove them:
 13 | 
 14 | ```{r}
 15 | Hitters=na.omit(Hitters)
 16 | with(Hitters,sum(is.na(Salary)))
 17 | ```
 18 | 
 19 | 
 20 | 
 21 | Best Subset regression
 22 | ------------------------
 23 | We will now use the package `leaps` to evaluate all the best-subset models.
 24 | ```{r}
 25 | library(leaps)
 26 | regfit.full=regsubsets(Salary~.,data=Hitters)
 27 | summary(regfit.full)
 28 | ```
 29 | It gives by default best-subsets up to size 8; lets increase that to 19, i.e. all the variables
 30 | ```{r}
 31 | regfit.full=regsubsets(Salary~.,data=Hitters, nvmax=19)
 32 | reg.summary=summary(regfit.full)
 33 | names(reg.summary)
 34 | plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp")
 35 | which.min(reg.summary$cp)
 36 | points(10,reg.summary$cp[10],pch=20,col="red")
 37 | ```
 38 | There is a plot method for the `regsubsets`  object
 39 | ```{r}
 40 | plot(regfit.full,scale="Cp")
 41 | coef(regfit.full,10)
 42 | ```
 43 | 
 44 | 
 45 | 
 46 | Forward Stepwise Selection
 47 | --------------------------
 48 | Here we use the `regsubsets` function but specify the `method="forward" option:
 49 | ```{r}
 50 | regfit.fwd=regsubsets(Salary~.,data=Hitters,nvmax=19,method="forward")
 51 | summary(regfit.fwd)
 52 | plot(regfit.fwd,scale="Cp")
 53 | ```
 54 | 
 55 | 
 56 | 
 57 | 
 58 | Model Selection Using a Validation Set
 59 | ---------------------------------------
 60 | Lets make a training and validation set, so that we can choose a good subset model.
 61 | We will do it using a slightly different approach from what was done in the the book.
 62 | ```{r}
 63 | dim(Hitters)
 64 | set.seed(1)
 65 | train=sample(seq(263),180,replace=FALSE)
 66 | train
 67 | regfit.fwd=regsubsets(Salary~.,data=Hitters[train,],nvmax=19,method="forward")
 68 | ```
 69 | Now we will make predictions on the observations not used for training. We know there are 19 models, so we set up some vectors to record the errors. We have to do a bit of work here, because there is no predict method for `regsubsets`.
 70 | ```{r}
 71 | val.errors=rep(NA,19)
 72 | x.test=model.matrix(Salary~.,data=Hitters[-train,])# notice the -index!
 73 | for(i in 1:19){
 74 |   coefi=coef(regfit.fwd,id=i)
 75 |   pred=x.test[,names(coefi)]%*%coefi
 76 |   val.errors[i]=mean((Hitters$Salary[-train]-pred)^2)
 77 | }
 78 | plot(sqrt(val.errors),ylab="Root MSE",ylim=c(300,400),pch=19,type="b")
 79 | points(sqrt(regfit.fwd$rss[-1]/180),col="blue",pch=19,type="b")
 80 | legend("topright",legend=c("Training","Validation"),col=c("blue","black"),pch=19)
 81 | ```
 82 | As we expect, the training error goes down monotonically as the model gets bigger, but not so 
 83 | for the validation error.
 84 | 
 85 | This was a little tedious - not having a predict method for `regsubsets`. So we will write one!
 86 | ```{r}
 87 | predict.regsubsets=function(object,newdata,id,...){
 88 |   form=as.formula(object$call[[2]])
 89 |   mat=model.matrix(form,newdata)
 90 |   coefi=coef(object,id=id)
 91 |   mat[,names(coefi)]%*%coefi
 92 | }
 93 | ```
 94 | 
 95 | 
 96 | 
 97 | 
 98 | Model Selection by Cross-Validation
 99 | -----------------------------------
100 | We will do 10-fold cross-validation. Its really easy!
101 | ```{r}
102 | set.seed(11)
103 | folds=sample(rep(1:10,length=nrow(Hitters)))
104 | folds
105 | table(folds)
106 | cv.errors=matrix(NA,10,19)
107 | for(k in 1:10){
108 |   best.fit=regsubsets(Salary~.,data=Hitters[folds!=k,],nvmax=19,method="forward")
109 |   for(i in 1:19){
110 |     pred=predict(best.fit,Hitters[folds==k,],id=i)
111 |     cv.errors[k,i]=mean( (Hitters$Salary[folds==k]-pred)^2)
112 |   }
113 | }
114 | rmse.cv=sqrt(apply(cv.errors,2,mean))
115 | plot(rmse.cv,pch=19,type="b")
116 | ```
117 | 
118 | 
119 | 
120 | Ridge Regression and the Lasso
121 | -------------------------------
122 | We will use the package `glmnet`, which does not use the model formula language, so we will set up an `x` and `y`.
123 | ```{r}
124 | library(glmnet)
125 | x=model.matrix(Salary~.-1,data=Hitters) 
126 | y=Hitters$Salary
127 | ```
128 | First we will fit a ridge-regression model. This is achieved by calling `glmnet` with `alpha=0` (see the helpfile). There is also a `cv.glmnet` function which will do the cross-validation for us. 
129 | ```{r}
130 | fit.ridge=glmnet(x,y,alpha=0)
131 | plot(fit.ridge,xvar="lambda",label=TRUE)
132 | cv.ridge=cv.glmnet(x,y,alpha=0)
133 | plot(cv.ridge)
134 | ```
135 | Now we fit a lasso model; for this we use the default `alpha=1`
136 | ```{r}
137 | fit.lasso=glmnet(x,y)
138 | plot(fit.lasso,xvar="lambda",label=TRUE)
139 | cv.lasso=cv.glmnet(x,y)
140 | plot(cv.lasso)
141 | coef(cv.lasso)
142 | ```
143 | 
144 |  Suppose we want to use our earlier train/validation division to select the `lambda` for the lasso.
145 |  This is easy to do.
146 | ```{r}
147 | lasso.tr=glmnet(x[train,],y[train])
148 | lasso.tr
149 | pred=predict(lasso.tr,x[-train,])
150 | dim(pred)
151 | rmse= sqrt(apply((y[-train]-pred)^2,2,mean))
152 | plot(log(lasso.tr$lambda),rmse,type="b",xlab="Log(lambda)")
153 | lam.best=lasso.tr$lambda[order(rmse)[1]]
154 | lam.best
155 | coef(lasso.tr,s=lam.best)
156 | ```
157 | 


--------------------------------------------------------------------------------
/ch7.Rmd:
--------------------------------------------------------------------------------
  1 | Nonlinear Models
  2 | ========================================================
  3 | Here we explore the use of nonlinear models using some tools in R
  4 | 
  5 | ```{r}
  6 | require(ISLR)
  7 | attach(Wage)
  8 | ```
  9 | 
 10 | Polynomials
 11 | ------------
 12 | 
 13 | First we will use polynomials, and focus on a single predictor age:
 14 | 
 15 | ```{r}
 16 | fit=lm(wage~poly(age,4),data=Wage)
 17 | summary(fit)
 18 | ```
 19 | 
 20 | The `poly()` function generates a basis of *orthogonal polynomials*.
 21 | Lets make a plot of the fitted function, along with the standard errors of the fit.
 22 | 
 23 | ```{r fig.width=7, fig.height=6}
 24 | agelims=range(age)
 25 | age.grid=seq(from=agelims[1],to=agelims[2])
 26 | preds=predict(fit,newdata=list(age=age.grid),se=TRUE)
 27 | se.bands=cbind(preds$fit+2*preds$se,preds$fit-2*preds$se)
 28 | plot(age,wage,col="darkgrey")
 29 | lines(age.grid,preds$fit,lwd=2,col="blue")
 30 | matlines(age.grid,se.bands,col="blue",lty=2)
 31 | ```
 32 | 
 33 | There are other more direct ways of doing this in R. For example
 34 | 
 35 | ```{r}
 36 | fita=lm(wage~age+I(age^2)+I(age^3)+I(age^4),data=Wage)
 37 | summary(fita)
 38 | ```
 39 | 
 40 | Here `I()` is a *wrapper* function; we need it because `age^2` means something to the formula language, 
 41 | while `I(age^2)` is protected.
 42 | The coefficients are different to those we got before! However, the fits are the same:
 43 | 
 44 | ```{r}
 45 | plot(fitted(fit),fitted(fita))
 46 | ```
 47 | 
 48 | By using orthogonal polynomials in this simple way, it turns out that we can separately test
 49 | for each coefficient. So if we look at the summary again, we can see that the linear, quadratic
 50 | and cubic terms are significant, but not the quartic.
 51 | 
 52 | ```{r}
 53 | summary(fit)
 54 | ```
 55 | 
 56 | This only works with linear regression, and if there is a single predictor. In general we would use `anova()`
 57 | as this next example demonstrates.
 58 | 
 59 | ```{r}
 60 | fita=lm(wage~education,data=Wage)
 61 | fitb=lm(wage~education+age,data=Wage)
 62 | fitc=lm(wage~education+poly(age,2),data=Wage)
 63 | fitd=lm(wage~education+poly(age,3),data=Wage)
 64 | anova(fita,fitb,fitc,fitd)
 65 | 
 66 | ```
 67 | 
 68 | ### Polynomial logistic regression
 69 | 
 70 | Now we fit a logistic regression model to a binary response variable, 
 71 | constructed from `wage`. We code the big earners (`>250K`) as 1, else 0.
 72 | 
 73 | ```{r}
 74 | fit=glm(I(wage>250) ~ poly(age,3), data=Wage, family=binomial)
 75 | summary(fit)
 76 | preds=predict(fit,list(age=age.grid),se=T)
 77 | se.bands=preds$fit + cbind(fit=0,lower=-2*preds$se,upper=2*preds$se)
 78 | se.bands[1:5,]
 79 | ```
 80 | 
 81 | We have done the computations on the logit scale. To transform we need to apply the inverse logit
 82 | mapping 
 83 | $$p=\frac{e^\eta}{1+e^\eta}.$$
 84 | (Here we have used the ability of MarkDown to interpret TeX expressions.) 
 85 | We can do this simultaneously for all three columns of `se.bands`:
 86 | 
 87 | ```{r}
 88 | prob.bands=exp(se.bands)/(1+exp(se.bands))
 89 | matplot(age.grid,prob.bands,col="blue",lwd=c(2,1,1),lty=c(1,2,2),type="l",ylim=c(0,.1))
 90 | points(jitter(age),I(wage>250)/10,pch="|",cex=.5)
 91 | ```
 92 | 
 93 | Splines
 94 | -------
 95 | Splines are more flexible than polynomials, but the idea is rather similar.
 96 | Here we will explore cubic splines.
 97 | 
 98 |  ```{r}
 99 | require(splines)
100 | fit=lm(wage~bs(age,knots=c(25,40,60)),data=Wage)
101 | plot(age,wage,col="darkgrey")
102 | lines(age.grid,predict(fit,list(age=age.grid)),col="darkgreen",lwd=2)
103 | abline(v=c(25,40,60),lty=2,col="darkgreen")
104 | ```
105 | 
106 | The smoothing splines does not require knot selection, but it does have a smoothing parameter,
107 | which can conveniently be specified via the effective degrees of freedom or `df`.
108 | 
109 | ```{r}
110 | fit=smooth.spline(age,wage,df=16)
111 | lines(fit,col="red",lwd=2)
112 | ```
113 | 
114 | Or we can use LOO cross-validation to select the smoothing parameter for us automatically:
115 | 
116 | ```{r}
117 | fit=smooth.spline(age,wage,cv=TRUE)
118 | lines(fit,col="purple",lwd=2)
119 | fit
120 | ```
121 | 
122 | Generalized Additive Models
123 | ---------------------------
124 | 
125 | So far we have focused on fitting models with mostly single nonlinear terms.
126 | The `gam` package makes it easier to work with multiple nonlinear terms. In addition 
127 | it knows how to plot these functions and their standard errors.
128 | 
129 | ```{r fig.width=10, fig.height=5}
130 | require(gam)
131 | gam1=gam(wage~s(age,df=4)+s(year,df=4)+education,data=Wage)
132 | par(mfrow=c(1,3))
133 | plot(gam1,se=T)
134 | gam2=gam(I(wage>250)~s(age,df=4)+s(year,df=4)+education,data=Wage,family=binomial)
135 | plot(gam2)
136 | ```
137 | 
138 | Lets see if we need a nonlinear terms for year
139 | 
140 | ```{r}
141 | gam2a=gam(I(wage>250)~s(age,df=4)+year+education,data=Wage,family=binomial)
142 | anova(gam2a,gam2,test="Chisq")
143 | ```
144 | 
145 | One nice feature of the `gam` package is that it knows how to plot the functions nicely,
146 | even for models fit by `lm` and `glm`.
147 | 
148 | ```{r fig.width=10, fig.height=5}
149 | par(mfrow=c(1,3))
150 | lm1=lm(wage~ns(age,df=4)+ns(year,df=4)+education,data=Wage)
151 | plot.gam(lm1,se=T)
152 | ```
153 | 
154 | 
155 | 
156 | 
157 | 
158 |  
159 | 


--------------------------------------------------------------------------------
/ch8.Rmd:
--------------------------------------------------------------------------------
  1 | Decision Trees
  2 | ========================================================
  3 | 
  4 | We will have a look at the `Carseats` data using the `tree` package in R, as in the lab in the book.
  5 | We create a binary response variable `High` (for high sales), and we include it in the same dataframe.
  6 | ```{r}
  7 | require(ISLR)
  8 | require(tree)
  9 | attach(Carseats)
 10 | hist(Sales)
 11 | High=ifelse(Sales<=8,"No","Yes")
 12 | Carseats=data.frame(Carseats, High)
 13 | ```
 14 | Now we fit a tree to these data, and summarize and plot it. Notice that we have to _exclude_ `Sales` from the right-hand side of the formula, because the response is derived from it.
 15 | ```{r}
 16 | tree.carseats=tree(High~.-Sales,data=Carseats)
 17 | summary(tree.carseats)
 18 | plot(tree.carseats)
 19 | text(tree.carseats,pretty=0)
 20 | ```
 21 | For a detailed summary of the tree, print it:
 22 | ```{r}
 23 | tree.carseats
 24 | ```
 25 | Lets create a training and test set (250,150) split of the 400 observations, grow the tree on the training set, and evaluate its performance on the test set.
 26 | ```{r}
 27 | set.seed(1011)
 28 | train=sample(1:nrow(Carseats),250)
 29 | tree.carseats=tree(High~.-Sales,Carseats,subset=train)
 30 | plot(tree.carseats);text(tree.carseats,pretty=0)
 31 | tree.pred=predict(tree.carseats,Carseats[-train,],type="class")
 32 | with(Carseats[-train,],table(tree.pred,High))
 33 | (72+33)/150
 34 | ```
 35 | This tree was grown to full depth, and might be too variable. We now use CV to prune it.
 36 | ```{r}
 37 | cv.carseats=cv.tree(tree.carseats,FUN=prune.misclass)
 38 | cv.carseats
 39 | plot(cv.carseats)
 40 | prune.carseats=prune.misclass(tree.carseats,best=13)
 41 | plot(prune.carseats);text(prune.carseats,pretty=0)
 42 | ```
 43 | Now lets evaluate this pruned tree on the test data.
 44 | ```{r}
 45 | tree.pred=predict(prune.carseats,Carseats[-train,],type="class")
 46 | with(Carseats[-train,],table(tree.pred,High))
 47 | (72+32)/150
 48 | ```
 49 | It has done about the same as our original tree. So pruning did not hurt us wrt misclassification errors, and gave us a simpler tree.
 50 | 
 51 | Random Forests and Boosting
 52 | ============================
 53 | 
 54 | These methods use trees as building blocks to build more complex models. Here we will use the Boston housing data to explore random forests and boosting. These data are in the `MASS` package.
 55 | It gives housing values and other statistics in each of 506 suburbs of Boston based on a 1970 census.
 56 | 
 57 | Random Forests
 58 | --------------
 59 | Random forests build lots of bushy trees, and then average them to reduce the variance.
 60 | 
 61 | ```{r}
 62 | require(randomForest)
 63 | require(MASS)
 64 | set.seed(101)
 65 | dim(Boston)
 66 | train=sample(1:nrow(Boston),300)
 67 | ?Boston
 68 | ```
 69 | Lets fit a random forest and see how well it performs. We will use the response `medv`, the median housing value (in \$1K dollars)
 70 | 
 71 | ```{r}
 72 | rf.boston=randomForest(medv~.,data=Boston,subset=train)
 73 | rf.boston
 74 | ```
 75 | The MSR and % variance explained are based on OOB  or _out-of-bag_ estimates, a very clever device in random forests to get honest error estimates. The model reports that `mtry=4`, which is the number of variables randomly chosen at each split. Since $p=13$ here, we could try all 13 possible values of `mtry`. We will do so, record the results, and make a plot.
 76 | 
 77 | ```{r}
 78 | oob.err=double(13)
 79 | test.err=double(13)
 80 | for(mtry in 1:13){
 81 |   fit=randomForest(medv~.,data=Boston,subset=train,mtry=mtry,ntree=400)
 82 |   oob.err[mtry]=fit$mse[400]
 83 |   pred=predict(fit,Boston[-train,])
 84 |   test.err[mtry]=with(Boston[-train,],mean((medv-pred)^2))
 85 |   cat(mtry," ")
 86 | }
 87 | matplot(1:mtry,cbind(test.err,oob.err),pch=19,col=c("red","blue"),type="b",ylab="Mean Squared Error")
 88 | legend("topright",legend=c("OOB","Test"),pch=19,col=c("red","blue"))
 89 | ```
 90 | 
 91 | Not too difficult! Although the test-error curve drops below the OOB curve, these are estimates based on data, and so have their own standard errors (which are typically quite large). Notice that the points at the end with `mtry=13` correspond to bagging.
 92 | 
 93 | Boosting
 94 | --------
 95 | Boosting builds lots of smaller trees. Unlike random forests, each new tree in boosting tries to patch up the deficiencies of the current ensemble.
 96 | ```{r}
 97 | require(gbm)
 98 | boost.boston=gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=10000,shrinkage=0.01,interaction.depth=4)
 99 | summary(boost.boston)
100 | plot(boost.boston,i="lstat")
101 | plot(boost.boston,i="rm")
102 | ```
103 | Lets make a prediction on the test set. With boosting, the number of trees is a tuning parameter, and if we have too many we can overfit. So we should use cross-validation to select the number of trees. We will leave this as an exercise. Instead, we will compute the test error as a function of the number of trees, and make a plot.
104 | 
105 | ```{r}
106 | n.trees=seq(from=100,to=10000,by=100)
107 | predmat=predict(boost.boston,newdata=Boston[-train,],n.trees=n.trees)
108 | dim(predmat)
109 | berr=with(Boston[-train,],apply( (predmat-medv)^2,2,mean))
110 | plot(n.trees,berr,pch=19,ylab="Mean Squared Error", xlab="# Trees",main="Boosting Test Error")
111 | abline(h=min(test.err),col="red")
112 | ```
113 | 
114 | 
115 |  
116 | 


--------------------------------------------------------------------------------
/ch9.Rmd:
--------------------------------------------------------------------------------
 1 | SVM
 2 | ========================================================
 3 | To demonstrate the SVM, it is easiest to work in low dimensions, so we can see the data.
 4 | 
 5 | Linear SVM classifier
 6 | ---------------------
 7 | Lets generate some data in two dimensions, and make them a little separated.
 8 | ```{r}
 9 | set.seed(10111)
10 | x=matrix(rnorm(40),20,2)
11 | y=rep(c(-1,1),c(10,10))
12 | x[y==1,]=x[y==1,]+1
13 | plot(x,col=y+3,pch=19)
14 | ```
15 | Now we will load the package `e1071` which contains the `svm` function we will use. We then compute the fit. Notice that we have to specify a `cost` parameter, which is a tuning parameter. 
16 | ```{r}
17 | library(e1071)
18 | dat=data.frame(x,y=as.factor(y))
19 | svmfit=svm(y~.,data=dat,kernel="linear",cost=10,scale=FALSE)
20 | print(svmfit)
21 | plot(svmfit,dat)
22 | ```
23 | As mentioned in the the chapter, the plot function is somewhat crude, and plots X2 on the horizontal axis (unlike what R would do automatically for a matrix). Lets see how we might make our own plot.
24 | 
25 | The first thing we will do is make a grid of values for X1 and X2. We will write a function to do that,
26 | in case we want to reuse it. It uses the handy function `expand.grid`, and produces the coordinates of `n*n` points on a lattice covering the domain of `x`. Having made the lattice, we make a prediction at each point on the lattice. We then plot the lattice, color-coded according to the classification. Now we can see the decision boundary.
27 | 
28 | The support points (points on the margin, or on the wrong side of the margin) are indexed in the `$index` component of the fit.
29 | 
30 | ```{r}
31 | make.grid=function(x,n=75){
32 |   grange=apply(x,2,range)
33 |   x1=seq(from=grange[1,1],to=grange[2,1],length=n)
34 |   x2=seq(from=grange[1,2],to=grange[2,2],length=n)
35 |   expand.grid(X1=x1,X2=x2)
36 |   }
37 | xgrid=make.grid(x)
38 | ygrid=predict(svmfit,xgrid)
39 | plot(xgrid,col=c("red","blue")[as.numeric(ygrid)],pch=20,cex=.2)
40 | points(x,col=y+3,pch=19)
41 | points(x[svmfit$index,],pch=5,cex=2)
42 | ```
43 | The `svm` function is not too friendly, in that we have to do some work to get back the linear coefficients, as described in the text. Probably the reason is that this only makes sense for linear kernels, and the function is more general. Here we will use a formula to extract the coefficients; for those interested in where this comes from, have a look in chapter 12 of ESL ("Elements of Statistical Learning").
44 | 
45 | We extract the linear coefficients, and then using simple algebra, we include the decision boundary and the two margins.
46 | 
47 | ```{r}
48 | beta=drop(t(svmfit$coefs)%*%x[svmfit$index,])
49 | beta0=svmfit$rho
50 | plot(xgrid,col=c("red","blue")[as.numeric(ygrid)],pch=20,cex=.2)
51 | points(x,col=y+3,pch=19)
52 | points(x[svmfit$index,],pch=5,cex=2)
53 | abline(beta0/beta[2],-beta[1]/beta[2])
54 | abline((beta0-1)/beta[2],-beta[1]/beta[2],lty=2)
55 | abline((beta0+1)/beta[2],-beta[1]/beta[2],lty=2)
56 | ```
57 | Just like for the other models in this book, the tuning parameter `C` has to be selected.
58 | Different values will give different solutions. Rerun the code above, but using `C=1`, and see what we mean. One can use cross-validation to do this.
59 | 
60 | 
61 | Nonlinear SVM
62 | --------------
63 | Instead, we will run the SVM on some data where a non-linear boundary is called for. We will use the mixture data from ESL
64 | 
65 | ```{r}
66 | load(url("http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/ESL.mixture.rda"))
67 | names(ESL.mixture)
68 | rm(x,y)
69 | attach(ESL.mixture)
70 | ```
71 | These data are also two dimensional. Lets plot them and fit a nonlinear SVM, using a radial kernel.
72 | ```{r}
73 | plot(x,col=y+1)
74 | dat=data.frame(y=factor(y),x)
75 | fit=svm(factor(y)~.,data=dat,scale=FALSE,kernel="radial",cost=5)
76 | ```
77 | Now we are going to create a grid, as before, and make predictions on the grid.
78 | These data have the grid points for each variable included on the data frame.
79 | ```{r}
80 | xgrid=expand.grid(X1=px1,X2=px2)
81 | ygrid=predict(fit,xgrid)
82 | plot(xgrid,col=as.numeric(ygrid),pch=20,cex=.2)
83 | points(x,col=y+1,pch=19)
84 | ```
85 | We can go further, and have the predict function produce the actual function estimates at each of our grid points. We can include the actual decision boundary on the plot by making use of the contour function. On the dataframe is also `prob`, which is the true probability of class 1 for these data, at the gridpoints. If we plot its 0.5 contour, that will give us the _Bayes Decision Boundary_, which is the best one could ever do.
86 | ```{r}
87 | func=predict(fit,xgrid,decision.values=TRUE)
88 | func=attributes(func)$decision
89 | xgrid=expand.grid(X1=px1,X2=px2)
90 | ygrid=predict(fit,xgrid)
91 | plot(xgrid,col=as.numeric(ygrid),pch=20,cex=.2)
92 | points(x,col=y+1,pch=19)
93 | 
94 | contour(px1,px2,matrix(func,69,99),level=0,add=TRUE)
95 | contour(px1,px2,matrix(prob,69,99),level=.5,add=TRUE,col="blue",lwd=2)
96 | ```
97 | We see in this case that the radial kernel has done an excellent job.


--------------------------------------------------------------------------------