├── .gitattributes ├── 01 - Introduction └── Quiz.md ├── 02 - Overview of Statistical Learning ├── 01 - Introduction to Regression Models Quiz.md ├── 02 - Dimensionality and Structured Models Quiz.md ├── 03 - Model Selection and Bias-Variance Tradeoff Quiz.md ├── 04 - Classification Quiz.md ├── 05 - Introduction to R Quiz.md └── 06 - Chapter 2 Quiz.md ├── 03 - Linear Regression ├── 01 - Simple Linear Regression Quiz.md ├── 02 - Hypothesis Testing and Confidence Intervals Quiz.md ├── 03 - Multiple Linear Regression Quiz.md ├── 04 - Some important questions Quiz.md ├── 05 - Extension of the linear model.md ├── 06 - Linear Regression in R.md └── 07 - Chapter 3 Quiz.md ├── 04 - Classification ├── 01 - Introduction to Classification Problems Quiz.md ├── 02 - Logistic Regression Quiz.md ├── 03 - Multivariate Logistic Regression Quiz.md ├── 04 - Logistic Regression - Case-Control Sampling and Multiclass Quiz.md ├── 05 - Discriminant Analysis Quiz.md ├── 06 - Gaussian Discriminant Analysis - One Variable - Quiz.md ├── 07 - Gaussian Discriminant Analysis - Many Variables Quiz.md ├── 08 - Quadratic Discriminant Analysis and Naive Bayes Quiz.md ├── 09 - Classification in R Quiz.md └── 10 - Chapter 4 Quiz.md ├── 05 - Resampling Methods ├── 01 - Cross Validation Quiz.md ├── 02 - K-Fold Cross Validation Quiz.md ├── 03 - Cross-Validation - The wrong and the right way Quiz.md ├── 04 - The Bootstrap Quiz.md ├── 05 - More on the Bootstrap Quiz.md ├── 06 - Resampling in R Quiz.md └── 07 - Chapter 5 Quiz.md ├── 06 - Linear Model Selection and Regularization ├── 01 - Introduction and Best-Subset Selection Quiz.md ├── 02 - Stepwise Selection Quiz.md ├── 03 - Backward Stepwise Selection Quiz.md ├── 04 - Estimating Test Error Quiz.md ├── 04 - Validation and Cross-Validation Quiz.md ├── 05 - Shrinkage Methods and Ridge Regression Quiz.md ├── 06 - The Lasso Quiz.md ├── 07 - Tuning Parameter Selection Quiz.md ├── 08 - Dimension Reduction Methods Quiz.md ├── 09 - Principal Components Regression and Partial Least Squares Quiz.md ├── 10 - Model Selection in R Quiz.md └── 11 - Chapter 6 Quiz.md ├── 07 - Moving Beyond Linearity ├── 01 - Polynomials and Step Functions Quiz.md ├── 02 - Piecewise-Polynomials and Splines Quiz.md ├── 03 - Smoothing Splines Quiz.md ├── 04 - Generalized Additive Models and Local Regression Quiz.md ├── 05 - Nonlinear Functions in R Quiz.md └── 06 - Chapter 7 Quiz.md ├── 08 - Tree-Based Methods ├── 01 - Tree-Based Methods Quiz.md ├── 02 - More Details on Trees Quiz.md ├── 03 - Classification Trees Quiz.md ├── 04 - Bagging and Random Forest Quiz.md ├── 05 - Boosting Quiz.md ├── 06 - Tree-Based Methods in R Quiz.md └── 07 - Chapter 8 Quiz.md ├── 09 - Support Vector Machines ├── 01 - Optimal Separating Hyperplanes Quiz.md ├── 02 - Supper Vector Classifier Quiz.md ├── 03 - Feature Expansion and the SVM Quiz.md ├── 04 - Example and Comparison with Logistic Regression Quiz.md ├── 05 - SVM in R Quiz.md └── 06 - Chapter 9 Quiz.md ├── 10 - Unsupervised Learning ├── 01 - Principal Components Quiz.md ├── 02 - Higher Order Principal Component Quiz.md ├── 03 - K-Means Clustering Quiz.md ├── 04 - Hierarchical Clustering Quiz.md ├── 05 - Breast Cancer Example Quiz.md ├── 06 - Unsupervised in R Quiz.md └── 07 - Chapter 10 Quiz.md ├── Lectures Files ├── 5.R.RData ├── 7.R.RData ├── ch10.Rmd ├── ch10.html ├── ch2.R ├── ch3.R ├── ch4.R ├── ch5.R ├── ch6.Rmd ├── ch6.html ├── ch7.Rmd ├── ch7.html ├── ch8.Rmd ├── ch8.html ├── ch9.Rmd └── ch9.html ├── Pdf ├── classification-handout.pdf ├── cv_boot-handout.pdf ├── introduction-handout.pdf ├── linear_regression-handout.pdf ├── model_selection-handout.pdf ├── nonlinear-handout.pdf ├── statistical_learning-handout.pdf ├── svm-handout.pdf ├── trees-handout.pdf └── unsupervised-handout.pdf └── README.md /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /01 - Introduction/Quiz.md: -------------------------------------------------------------------------------- 1 | # Quiz 2 | 3 | ## 1.2.R1 4 | 5 | Which of the following are supervised learning problems? More than one box can be checked. 6 | 7 | - **Predict whether a website user will click on an ad** 8 | - Find clusters of genes that interact with each other 9 | - **Classify a handwritten digit as 0-9 from labeled examples** 10 | - **Find stocks that are likely to rise** 11 | 12 | ## 1.2.R2 13 | 14 | True or False: The only goal of any supervised learning study is to be able to predict the response very accurately. 15 | 16 | - True 17 | - **False** -------------------------------------------------------------------------------- /02 - Overview of Statistical Learning/01 - Introduction to Regression Models Quiz.md: -------------------------------------------------------------------------------- 1 | # Introduction to Regression Models Quiz 2 | 3 | ## 2.1 R1 4 | 5 | In the expression Sales ≈ f(TV, Radio, Newspaper), "Sales" is the: 6 | 7 | - **Response** 8 | - Training Data 9 | - Independent Variable 10 | - Feature -------------------------------------------------------------------------------- /02 - Overview of Statistical Learning/02 - Dimensionality and Structured Models Quiz.md: -------------------------------------------------------------------------------- 1 | # Dimensionality and Structured Models Quiz 2 | 3 | ## 2.2 R1 4 | 5 | A hypercube with side length 1 in d dimensions is defined to be the set of points (x1, x2, ..., xd) such that 0 <= x_j <= 1 for all j = 1, 2, ..., d. The boundary of the hypercube is defined to be the set of all points such that there exists a j for which 0 <= x_j <= 0.05 or 0.95 <= x_j <= 1 (namely, the boundary is the set of all points that have at least one dimension in the most extreme 10% of possible values). What proportion of the points in a hypercube of dimension 50 are in the boundary? (hint: you may want to calculate the volume of the non-boundary region) 6 | 7 | Please give your answer as a value between 0 and 1 with 3 significant digits. If you think the answer is 50.52%, you should say 0.505: 8 | 9 | The volume of the interior of the hypercube is 0.950 = 0.005. Thus, the volume of the boundary is 1-0.005 = **0.995**. 10 | -------------------------------------------------------------------------------- /02 - Overview of Statistical Learning/03 - Model Selection and Bias-Variance Tradeoff Quiz.md: -------------------------------------------------------------------------------- 1 | # Model Selection and Bias-Variance Tradeoff Quiz 2 | 3 | ## 2.3 R1 4 | 5 | True or False: A fitted model with more predictors will necessarily have a lower Training Set Error than a model with fewer predictors. 6 | 7 | **False** 8 | 9 | ## 2.3 R2 10 | 11 | While doing a homework assignment, you fit a Linear Model to your data set. You are thinking about changing the Linear Model to a Quadratic one. Which of the following is most likely true: 12 | 13 | - Using the Quadratic Model will decrease your Irreducible Error. 14 | - **Using the Quadratic Model will decrease the Bias of your model.** 15 | - Using the Quadratic Model will decrease the Variance of your model 16 | - Using the Quadratic Model will decrease your Reducible Error 17 | -------------------------------------------------------------------------------- /02 - Overview of Statistical Learning/04 - Classification Quiz.md: -------------------------------------------------------------------------------- 1 | # Classification Quiz 2 | 3 | ## 2.4.R1 4 | 5 | Look at the graph given on page 30 of the Chapter 2 lecture slides. Which of the following is most likely true of what would happen to the Test Error curve as we move 1/K further above 1? 6 | 7 | - The Test Errors will increase 8 | - The Test Errors will decrease 9 | - Not enough information is given to decide 10 | - **It does not make sense to have 1/K > 1** -------------------------------------------------------------------------------- /02 - Overview of Statistical Learning/05 - Introduction to R Quiz.md: -------------------------------------------------------------------------------- 1 | # Introduction to R Quiz 2 | 3 | ## 2.R.R1 4 | 5 | You are doing an analysis in R and need to use the 'summary()' function, but you are not exactly sure how it works. Which of the following commands should you run? (There is more than one correct answer, so any one these will earn the point). 6 | 7 | - help(summary) 8 | - ?summary 9 | - man(summary) 10 | - **?summary()** -------------------------------------------------------------------------------- /02 - Overview of Statistical Learning/06 - Chapter 2 Quiz.md: -------------------------------------------------------------------------------- 1 | # Chapter 2 Quiz 2 | 3 | ## 2.Q.1 4 | 5 | For each of the following parts, indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible model. 6 | 7 | The sample size n is extremely large, and the number of predictors p is small: 8 | 9 | **Flexible is better** 10 | 11 | ## 2.Q.2 12 | 13 | The number of predictors p is extremely large, and the sample size n is small: 14 | 15 | **Flexible is worse** 16 | 17 | ## 2.Q.3 18 | 19 | The relationship between the predictors and response is highly non-linear: 20 | 21 | **Flexible is better** 22 | 23 | ## 2.Q.4 24 | 25 | The variance of the error terms, i.e. \sigma^2 = \text{Var}(\epsilon), is extremely high: 26 | 27 | **Flexible is worse** -------------------------------------------------------------------------------- /03 - Linear Regression/01 - Simple Linear Regression Quiz.md: -------------------------------------------------------------------------------- 1 | # Simple Linear Regression Quiz 2 | 3 | ## 3.1.R1 4 | 5 | Why is linear regression important to understand? Select all that apply: 6 | 7 | - The linear model is often 8 | - **Linear regression is very extensible and can be used to capture nonlinear effects** 9 | - **Simple methods can outperform more complex ones if the data are noisy** 10 | - **Understanding simpler methods sheds light on more complex ones** 11 | 12 | ## 3.1.R2 13 | 14 | You may want to reread the paragraph on confidence intervals on page 66 of the textbook before trying this queston (the distinctions are subtle). 15 | 16 | Which of the following are true statements? Select all that apply: 17 | 18 | - **A 95% confidence interval is a random interval that contains the true parameter 95% of the time** 19 | - The true parameter is a random value that has 95% chance of falling in the 95% confidence interval 20 | - I perform a linear regression and get a 95% confidence interval from 0.4 to 0.5. There is a 95% probability that the true parameter is between 0.4 and 0.5. 21 | - **The true parameter (unknown to me) is 0.5. If I sample data and construct a 95% confidence interval, the interval will contain 0.5 95% of the time.** 22 | -------------------------------------------------------------------------------- /03 - Linear Regression/02 - Hypothesis Testing and Confidence Intervals Quiz.md: -------------------------------------------------------------------------------- 1 | # Hyphotesis Testing and Confidence Invervals Quiz 2 | 3 | ## 3.2.R1 4 | 5 | We run a linear regression and the slope estimate is 0.5 with estimated standard error of 0.2. What is the largest value of for which we would NOT reject the null hypothesis that \beta_1=b? (assume normal approximation to t distribution, and that we are using the 5% significance level for a two-sided test; need two significant digits of accuracy) 6 | 7 | **0.892** 8 | 9 | ## 3.2.R2 10 | 11 | Which of the following indicates a fairly strong relationship between X and Y? 12 | 13 | - **R^2 = 0.9** 14 | - The p-value for the null hypothesis \beta_1=0 is 0.0001 15 | - The t-statistic for the null hypothesis \beta_1=0 is 30 -------------------------------------------------------------------------------- /03 - Linear Regression/03 - Multiple Linear Regression Quiz.md: -------------------------------------------------------------------------------- 1 | # Multiple Linear Regression Quiz 2 | 3 | # 3.3.R1 4 | 5 | Suppose we are interested in learning about a relationship between X_1 and Y, which we would ideally like to interpret as causal. 6 | 7 | True or False? The estimate \hat\beta_1 in a linear regression that controls for many variables (that is, a regression with many predictors in addition to X_1) is usually a more reliable measure of a causal relationship than \hat\beta_1 from a univariate regression on X_1. 8 | 9 | - True 10 | - **False** -------------------------------------------------------------------------------- /03 - Linear Regression/04 - Some important questions Quiz.md: -------------------------------------------------------------------------------- 1 | # Some Important Questions Quiz 2 | 3 | ## 3.4.R1 4 | 5 | According to the balance vs ethnicity model, what is the predicted balance for an Asian in the data set? (within 0.01 accuracy) 6 | 7 | **512.31** 8 | 9 | ## 3.4.R2 10 | 11 | What is the predicted balance for an African American? (within .01 accuracy) 12 | 13 | **531** 14 | -------------------------------------------------------------------------------- /03 - Linear Regression/05 - Extension of the linear model.md: -------------------------------------------------------------------------------- 1 | # Extensions of the linear model 2 | 3 | ## 3.5.R1 4 | 5 | According to the model for sales vs TV interacted with radio, what is the effect of an additional $1 of radio advertising if TV=$50? (with 4 decimal accuracy) 6 | 7 | **.0839** 8 | 9 | ## 3.5.R2 10 | 11 | What if TV=$250? (with 4 decimal accuracy) 12 | 13 | **.3039** 14 | -------------------------------------------------------------------------------- /03 - Linear Regression/06 - Linear Regression in R.md: -------------------------------------------------------------------------------- 1 | # Linear Regression in R 2 | 3 | ## 3.R.R1 4 | 5 | What is the difference between lm(y ~ x*z) and lm(y ~ I(x*z)), when x and z are both numeric variables? 6 | 7 | - The first one includes an interaction term between x and z, whereas the second uses the product of x and z as a predictor in the model. 8 | - The second one includes an interaction term between x and z, whereas the first uses the product of x and z as a predictor in the model. 9 | - The first includes only an interaction term for x and z, while the second includes both interaction effects and main effects. 10 | - **The second includes only an interaction term for x and z, while the first includes both interaction effects and main effects.** 11 | -------------------------------------------------------------------------------- /03 - Linear Regression/07 - Chapter 3 Quiz.md: -------------------------------------------------------------------------------- 1 | # Chapter 3 Quiz 2 | 3 | ## Multiple Choice 4 | 5 | Which of the following statements are true? 6 | 7 | - **In the balance vs. income * student model plotted on slide 44, the estimate of beta3 is negative.** 8 | - One advantage of using linear models is that the true regression function is often linear. 9 | - If the F statistic is significant, all of the predictors have statistically significant effects. 10 | - In a linear regression with several variables, a variable has a positive regression coefficient if and only if its correlation with the response is positive. 11 | -------------------------------------------------------------------------------- /04 - Classification/01 - Introduction to Classification Problems Quiz.md: -------------------------------------------------------------------------------- 1 | # Introduction to Classification Problem Quiz 2 | 3 | ## 4.1 R1 4 | 5 | Which of the following is the best example of a Qualitative Variable? 6 | 7 | - Height 8 | - Age 9 | - Speed 10 | - **Color** 11 | 12 | ## 4.1 R2 13 | 14 | Judging from the plots on page 2 of the notes, which should be the better predictor of Default: Income or Balance? 15 | 16 | - Income. 17 | - **Balance.** 18 | - Both are equally good. 19 | - Not enough information is given to decide. -------------------------------------------------------------------------------- /04 - Classification/02 - Logistic Regression Quiz.md: -------------------------------------------------------------------------------- 1 | # Logistic Regression Quiz 2 | 3 | ## 4.2.R1 4 | 5 | Using the model on page 8 of the notes, what value of Balance will give a predicted Default rate of 50%? (within 3 units of accuracy) 6 | 7 | Enter the value of Balance below: 8 | 9 | **1936.6** -------------------------------------------------------------------------------- /04 - Classification/03 - Multivariate Logistic Regression Quiz.md: -------------------------------------------------------------------------------- 1 | # Multivariate Logistic Regression Quiz 2 | 3 | ## 4.3.R1 4 | 5 | Suppose we collect data for a group of students in a statistics class with variables X_1 hours studied, X_2 undergrad GPA, and Y= receive an A. We fit a logistic regression and produce estimated coefficients \hat\beta_o = -6, \hat\beta_1 = 0.05, \hat\beta_2 = 1. 6 | 7 | Estimate the probability that a student who studies for 40h and has an undergrad GPA of 3.5 gets an A in the class (within 0.01 accuracy): 8 | 9 | **0.3775** 10 | 11 | ## 4.3.R2 12 | 13 | How many hours would that student need to study to have a 50% chance of getting an A in the class?: 14 | 15 | **50** -------------------------------------------------------------------------------- /04 - Classification/04 - Logistic Regression - Case-Control Sampling and Multiclass Quiz.md: -------------------------------------------------------------------------------- 1 | # Logistic Regression - Case-Control Sampling and Multiclass Quiz 2 | 3 | ## 4.4 R1 4 | 5 | In which of the following problems is Case/Control Sampling LEAST likely to make a positive impact? 6 | 7 | - **Predicting a shopper's gender based on the products they buy** 8 | - Finding predictors for a certain type of cancer 9 | - Predicting if an email is Spam or Not Spam -------------------------------------------------------------------------------- /04 - Classification/05 - Discriminant Analysis Quiz.md: -------------------------------------------------------------------------------- 1 | # Discriminant Analysis Quiz 2 | 3 | ## 4.5 R1 4 | 5 | Suppose that in Ad Clicks (a problem where you try to model if a user will click on a particular ad) it is well known that the majority of the time an ad is shown it will not be clicked. What is another way of saying that? 6 | 7 | - **Ad Clicks have a low Prior Probability** 8 | - Ad Clicks have a high Prior Probability. 9 | - Ad Clicks have a low Density. 10 | - Ad Clicks have a high Density. -------------------------------------------------------------------------------- /04 - Classification/06 - Gaussian Discriminant Analysis - One Variable - Quiz.md: -------------------------------------------------------------------------------- 1 | # Gaussian Discriminant Analysis - One Variable Quiz 2 | 3 | # 4.6.R1 4 | 5 | Which of the following is NOT a linear function in x: 6 | 7 | - f(x) = a + b^2x 8 | - The discriminant function from LDA 9 | - \delta_k(x) = x\frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} +\log(\pi_k) 10 | - \text{logit}(P(y = 1 | x)) where P(y=1 | x) is as in logistic regression 11 | - P(y=1 | x) from logistic regression 12 | -------------------------------------------------------------------------------- /04 - Classification/07 - Gaussian Discriminant Analysis - Many Variables Quiz.md: -------------------------------------------------------------------------------- 1 | # Gaussian Discriminant Analysis - Many Variables 2 | 3 | ## 4.7.R1 4 | 5 | Why does Total Error keep going down on the graph on page 34 of the notes, even though the False Negative Rate increases? 6 | 7 | - The False Negative Rate does not affect Total Error. 8 | - A higher False Negative Rate generally decreases Total Error. 9 | - **Positive responses are so uncommon that their impact on the Total Error is small.** 10 | - All of the above -------------------------------------------------------------------------------- /04 - Classification/08 - Quadratic Discriminant Analysis and Naive Bayes Quiz.md: -------------------------------------------------------------------------------- 1 | # Quadratic Discriminant Analysis and Naive Bayes Quiz 2 | 3 | ## 4.8.R1 4 | 5 | Which of the following statements best explains the relationship between Quadratic Discriminant Analysis and naive Bayes with Gaussian distributions in each class? 6 | 7 | - **Quadratic Discriminant Analysis is a more flexible class of models than naive Bayes** 8 | - Quadratic Discriminant Analysis is a less flexible class of models than naive Bayes 9 | - Quadratic Discriminant Analysis is an equivalently flexible class of models to naive Bayes 10 | - For some problems Quadratic Discriminant Analysis is more flexible than naive Bayes, for others the opposite is true. -------------------------------------------------------------------------------- /04 - Classification/09 - Classification in R Quiz.md: -------------------------------------------------------------------------------- 1 | # Classification in R 2 | 3 | ## 4.R.R1 4 | 5 | In ch4.R, line 13 is "attach(Smarket)." If that line was omitted from the script, which of the following lines would cause an error?: 6 | 7 | - **line 15: mean(glm.pred==Direction)** 8 | - line 18: glm.fit = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Smarket,family=binomial, subset=train) 9 | - line 22: Direction.2005=Smarket$Direction[!train] 10 | - line 30: table(glm.pred,Direction.2005) -------------------------------------------------------------------------------- /04 - Classification/10 - Chapter 4 Quiz.md: -------------------------------------------------------------------------------- 1 | # Chapter 4 Quiz 2 | 3 | ## 4.Q.1 4 | 5 | Which of the following tools would be well suited for predicting if a student will get an A in a class based on the student's height, and parents’ income? Select all that apply: 6 | 7 | - **Linear Discriminant Analysis** 8 | - **Linear Regression** 9 | - **Logistic Regression** 10 | - Random Guess -------------------------------------------------------------------------------- /05 - Resampling Methods/01 - Cross Validation Quiz.md: -------------------------------------------------------------------------------- 1 | # Cross-Validation Quiz 2 | 3 | ## 5.1.R1 4 | 5 | When we fit a model to data, which is typically larger? 6 | 7 | - **Test Error** 8 | - Training Error 9 | 10 | ## 5.1.R2 11 | 12 | What are reasons why test error could be LESS than training error? 13 | 14 | - **By chance, the test set has easier cases than the training set.** 15 | - The model is highly complex, so training error systematically overestimates test error 16 | - The model is not very complex, so training error systematically overestimates test error -------------------------------------------------------------------------------- /05 - Resampling Methods/02 - K-Fold Cross Validation Quiz.md: -------------------------------------------------------------------------------- 1 | # K-Fold Cross Validation Quiz 2 | 3 | ## 5.2.R1 4 | 5 | Suppose we want to use cross-validation to estimate the error of the following procedure: 6 | 7 | Step 1: Find the k variables most correlated with y 8 | 9 | Step 2: Fit a linear regression using those variables as predictors 10 | 11 | We will estimate the error for each k from 1 to p, and then choose the best k. 12 | 13 | True or false: a correct cross-validation procedure will possibly choose a different set of k variables for every fold. 14 | 15 | - **TRUE** 16 | - FALSE -------------------------------------------------------------------------------- /05 - Resampling Methods/03 - Cross-Validation - The wrong and the right way Quiz.md: -------------------------------------------------------------------------------- 1 | # Cross-Validation: the wrong and right way Quiz 2 | 3 | ## 5.3.R1 4 | 5 | Suppose that we perform forward stepwise regression and use cross-validation to choose the best model size. 6 | 7 | Using the full data set to choose the sequence of models is the WRONG way to do cross-validation (we need to redo the model selection step within each training fold). If we do cross-validation the WRONG way, which of the following is true? 8 | 9 | - **The selected model will probably be too complex** 10 | - The selected model will probably be too simple -------------------------------------------------------------------------------- /05 - Resampling Methods/04 - The Bootstrap Quiz.md: -------------------------------------------------------------------------------- 1 | # The Bootstra Quiz 2 | 3 | # 5.4.R1 4 | 5 | One way of carrying out the bootstrap is to average equally over all possible bootstrap samples from the original data set (where two bootstrap data sets are different if they have the same data points but in different order). Unlike the usual implementation of the bootstrap, this method has the advantage of not introducing extra noise due to resampling randomly. (You can use "^" to denote power, as in "n^2") 6 | 7 | To carry out this implementation on a data set with n data points, how many bootstrap data sets would we need to average over? 8 | 9 | **n^n** -------------------------------------------------------------------------------- /05 - Resampling Methods/05 - More on the Bootstrap Quiz.md: -------------------------------------------------------------------------------- 1 | # More on the Bootstra Quiz 2 | 3 | ## 5.5.R1 4 | 5 | If we have n data points, what is the probability that a given data point does not appear in a bootstrap sample? 6 | 7 | **(1-1/n)^n** -------------------------------------------------------------------------------- /05 - Resampling Methods/06 - Resampling in R Quiz.md: -------------------------------------------------------------------------------- 1 | # Resampling in R Quiz 2 | 3 | ## 5.R.R1 4 | 5 | Download the file 5.R.RData and load it into R using load("5.R.RData"). Consider the linear regression model of y on X1 and X2. What is the standard error for ? 6 | 7 | **0.02593** 8 | 9 | ## 5.R.R2 10 | 11 | Next, plot the data using matplot(Xy,type="l"). Which of the following do you think is most likely given what you see? 12 | 13 | - Our estimate of s.e.(\hat\beta_1) is too high. 14 | - **Our estimate of s.e.(\hat\beta_1) is too low.** 15 | - Our estimate of s.e.(\hat\beta_1) is about right. 16 | 17 | ## 5.R.R3 18 | 19 | Now, use the (standard) bootstrap to estimate . To within 10%, what do you get? 20 | 21 | **0.0274** 22 | 23 | ## 5.R.R4 24 | 25 | Finally, use the block bootstrap to estimate s.e.(\hat\beta_1). Use blocks of 100 contiguous observations, and resample ten whole blocks with replacement then paste them together to construct each bootstrap time series. For example, one of your bootstrap resamples could be: 26 | 27 | ``` 28 | new.rows = c(101:200, 401:500, 101:200, 901:1000, 301:400, 1:100, 1:100, 801:900, 201:300, 701:800) 29 | 30 | new.Xy = Xy[new.rows, ] 31 | ``` 32 | 33 | To within 10%, what do you get? 34 | 35 | **0.2** 36 | -------------------------------------------------------------------------------- /05 - Resampling Methods/07 - Chapter 5 Quiz.md: -------------------------------------------------------------------------------- 1 | # Chapter 5 Quiz 2 | 3 | ## 5.Q.1 4 | 5 | If we use ten-fold cross-validation as a means of model selection, the cross-validation estimate of test error is: 6 | 7 | - biased upward 8 | - biased downward 9 | - unbiased 10 | - **potentially any of the above** 11 | 12 | ## 5.Q.2 13 | 14 | Why can't we use the standard bootstrap for some time series data? 15 | 16 | - **The data points in most time series aren't i.i.d.** 17 | - Some points will be used twice in the same sample 18 | - **The standard bootstrap doesn't accurately mimic the real-world data-generating mechanism** 19 | -------------------------------------------------------------------------------- /06 - Linear Model Selection and Regularization/01 - Introduction and Best-Subset Selection Quiz.md: -------------------------------------------------------------------------------- 1 | # Introduction and Best-Subset Selection 2 | 3 | ## 6.1.R1 4 | 5 | Which of the following modeling techniques performs Feature Selection? 6 | 7 | - Linear Discriminant Analysis 8 | - Least Squares 9 | - **Linear Regression with Forward Selection** 10 | - Support Vector Machines -------------------------------------------------------------------------------- /06 - Linear Model Selection and Regularization/02 - Stepwise Selection Quiz.md: -------------------------------------------------------------------------------- 1 | # Stepwise Selection Quiz 2 | 3 | ## 6.2.R1 4 | 5 | We perform best subset and forward stepwise selection on a single dataset. For both approaches, we obtain models, containing predictors. 6 | 7 | Which of the two models with predictors is guaranteed to have training RSS no larger than the other model? 8 | 9 | - **Best Subset** 10 | - Forward Stepwise 11 | - They always have the same training RSS 12 | - Not enough information is given to know 13 | 14 | ## 6.2.R2 15 | 16 | Which of the two models with predictors has the smallest test RSS? 17 | 18 | - Best Subset 19 | - Forward Stepwise 20 | - They always have the same test RSS 21 | - **Not enough information is given to know** 22 | -------------------------------------------------------------------------------- /06 - Linear Model Selection and Regularization/03 - Backward Stepwise Selection Quiz.md: -------------------------------------------------------------------------------- 1 | # Backward Stepwise Selection Quiz 2 | 3 | ## 6.3.R1 4 | 5 | You are trying to fit a model and are given p=30 predictor variables to choose from. Ultimately, you want your model to be interpretable, so you decide to use Best Subset Selection. 6 | 7 | How many different models will you end up considering?: 8 | 9 | **2^30** 10 | 11 | ## 6.3.R2 12 | 13 | How many would you fit using Forward Selection?: 14 | 15 | **1+30*(30+1)/2** 16 | -------------------------------------------------------------------------------- /06 - Linear Model Selection and Regularization/04 - Estimating Test Error Quiz.md: -------------------------------------------------------------------------------- 1 | # Estimating Test Error Quiz 2 | 3 | ## 6.4.R1 4 | 5 | You are fitting a linear model to data assumed to have Gaussian errors. The model has up to p = 5 predictors and n = 100 observations. Which of the following is most likely true of the relationship between C_p and AIC in terms of using the statistic to select a number of predictors to include? 6 | 7 | - C_p will select a model with more predictors AIC 8 | - C_p will select a model with fewer predictors AIC 9 | - **C_p will select the same model as AIC** 10 | - Not enough information is given to decide -------------------------------------------------------------------------------- /06 - Linear Model Selection and Regularization/04 - Validation and Cross-Validation Quiz.md: -------------------------------------------------------------------------------- 1 | # Validation and Cross-Validation 2 | 3 | ## 6.5.R1 4 | 5 | You are doing a simulation in order to compare the effect of using Cross-Validation or a Validation set. For each iteration of the simulation, you generate new data and then use both Cross-Validation and a Validation set in order to determine the optimal number of predictors. Which of the following is most likely? 6 | 7 | - The Cross-Validation method will result in a higher variance of optimal number of predictors 8 | - **The Validation set method will result in a higher variance of optimal number of predictors** 9 | - Both methods will produce results with the same variance of optimal number of predictors 10 | - Not enough information is given to decide -------------------------------------------------------------------------------- /06 - Linear Model Selection and Regularization/05 - Shrinkage Methods and Ridge Regression Quiz.md: -------------------------------------------------------------------------------- 1 | # Shrinkage Methods and Ridge Regression Quiz 2 | 3 | ## 6.6.R1 4 | 5 | \sqrt{\sum_{j=1}^p\beta_j^2} is equivalent to: 6 | 7 | - X\hat\beta 8 | - \hat\beta^R 9 | - C_p statistic 10 | - **\|\beta\|_2** 11 | 12 | ## 6.6.R2 13 | 14 | You perform ridge regression on a problem where your third predictor, x3, is measured in dollars. You decide to refit the model after changing x3 to be measured in cents. Which of the following is true?: 15 | 16 | - \hat\beta_3 and \hat y will remain the same. 17 | - \hat\beta_3 will change but \hat y will remain the same. 18 | - \hat\beta_3 will remain the same but \hat y will change. 19 | - **\hat\beta_3 and \hat y will both change.** 20 | -------------------------------------------------------------------------------- /06 - Linear Model Selection and Regularization/06 - The Lasso Quiz.md: -------------------------------------------------------------------------------- 1 | # The Lasso Quiz 2 | 3 | ## 6.7 R1 4 | 5 | Which of the following is NOT a benefit of the sparsity imposed by the Lasso? 6 | 7 | - Sparse models are generally more easy to interperet 8 | - The Lasso does variable selection by default 9 | - **Using the Lasso penalty helps to decrease the bias of the fits** 10 | - Using the Lasso penalty helps to decrease the variance of the fits 11 | -------------------------------------------------------------------------------- /06 - Linear Model Selection and Regularization/07 - Tuning Parameter Selection Quiz.md: -------------------------------------------------------------------------------- 1 | # Tuning Parameter Selection Quiz 2 | 3 | ## 6.8 R1 4 | 5 | Which of the following would be the worst metric to use to select in the Lasso? 6 | 7 | - Cross-Validated error 8 | - Validation set error 9 | - **RSS** -------------------------------------------------------------------------------- /06 - Linear Model Selection and Regularization/08 - Dimension Reduction Methods Quiz.md: -------------------------------------------------------------------------------- 1 | # Dimension Reduction Methods Quiz 2 | 3 | ## 6.9.R1 4 | 5 | We compute the principal components of our p predictor variables. The RSS in a simple linear regression of Y onto the largest principal component will always be no larger than the RSS in a simple regression of Y onto the second largest principal component. True or False? (You may want to watch 6.10 as well before answering - sorry!) 6 | 7 | - True 8 | - **False** -------------------------------------------------------------------------------- /06 - Linear Model Selection and Regularization/09 - Principal Components Regression and Partial Least Squares Quiz.md: -------------------------------------------------------------------------------- 1 | # Principal Components Regression and Partial Least Squares 2 | 3 | ## 6.10.R1 4 | 5 | You are working on a regression problem with many variables, so you decide to do Principal Components Analysis first and then fit the regression to the first 2 principal components. Which of the following would you expect to happen?: 6 | 7 | - A subset of the features will be selected 8 | - Model Bias will decrease relative to the full least squares model 9 | - **Variance of fitted values will decrease relative to the full least squares model** 10 | - Model interpretability will improve relative to the full least squares model 11 | -------------------------------------------------------------------------------- /06 - Linear Model Selection and Regularization/10 - Model Selection in R Quiz.md: -------------------------------------------------------------------------------- 1 | # R Model Selection in R 2 | 3 | ## 6.R.R1 4 | 5 | One of the functions in the glmnet package is cv.glmnet(). This function, like many functions in R, will return a list object that contains various outputs of interest. What is the name of the component that contains a vector of the mean cross-validated errors? 6 | 7 | **cvm** 8 | -------------------------------------------------------------------------------- /06 - Linear Model Selection and Regularization/11 - Chapter 6 Quiz.md: -------------------------------------------------------------------------------- 1 | # Chapter 6 Quiz 2 | 3 | ## 6.Q.1 4 | 5 | Suppose we estimate the regression coefficients in a linear regression model by minimizing 6 | 7 | displaystyle\sum_{i=1}^n\left(y_i - \beta_0 - \sum_{j=1}^p\beta_jx_{ij}\right)^2 + \lambda\sum_{j=1}^p\beta_j^2 8 | 9 | for a particular value of lambda. For each of the following, select the correct answer: 10 | 11 | As we increase lambda from 0, the training RSS will: 12 | 13 | **Steadlily Increase** 14 | 15 | ## 6.Q.2 16 | 17 | As we increase lambda from 0, the test RSS will: 18 | 19 | **Decrease initually, and then eventually start increasing in a U shape** 20 | 21 | ## 6.Q.3 22 | 23 | As we increase lambda from 0, the variance will: 24 | 25 | **Steadlily Decrease** 26 | 27 | ## 6.Q.4 28 | 29 | As we increase lambda from 0, the (squared) bias will: 30 | 31 | **Steadlily Increase** 32 | 33 | ## 6.Q.5 34 | 35 | As we increase lambda from 0, the irreducible error will: 36 | 37 | **Remain constant** -------------------------------------------------------------------------------- /07 - Moving Beyond Linearity/01 - Polynomials and Step Functions Quiz.md: -------------------------------------------------------------------------------- 1 | # Polynomials and Step Functions Quiz 2 | 3 | ## 7.1.R1 4 | 5 | Which of the following can we add to linear models to capture 6 | 7 | nonlinear effects? 8 | 9 | - **Spline terms** 10 | - **Polynomial terms** 11 | - **Interactions** 12 | - Arbitrary linear combinations of the variables 13 | - **Step functions** -------------------------------------------------------------------------------- /07 - Moving Beyond Linearity/02 - Piecewise-Polynomials and Splines Quiz.md: -------------------------------------------------------------------------------- 1 | # Piecewise-Polynomials and Splines Quiz 2 | 3 | ## 7.2.R1 4 | 5 | Why are natural cubic splines typically preferred over global polynomials of degree d? 6 | 7 | - Polynomials have too many degrees of freedom 8 | - **Polynomials tend to extrapolate very badly** 9 | - Polynomials are not as continuous as splines 10 | 11 | ## 7.2.R2 12 | 13 | Let 1\{x \leq t\} denote a function which is 1 if x \leq t and 0 otherwise. 14 | 15 | Which of the following is a basis for linear splines with a knot at t? Select all that apply: 16 | 17 | - **1, x, (x - t)1\{x > t\}** 18 | - **1, x, (x - t)1\{x \leq t\}** 19 | - 1\{x > t\}, 1\{x \leq t\}, (x - t)1\{x > t\} 20 | - **1, (x - t)1\{x \leq t\}, (x - t)1\{x > t\}** -------------------------------------------------------------------------------- /07 - Moving Beyond Linearity/03 - Smoothing Splines Quiz.md: -------------------------------------------------------------------------------- 1 | # Smoothing Splines 2 | 3 | ## 7.3.R1 4 | 5 | In terms of model complexity, which is more similar to a smoothing 6 | 7 | spline with 100 knots and 5 effective degrees of freedom? 8 | 9 | - **A natural cubic spline with 5 knots** 10 | - A natural cubic spline with 100 knots -------------------------------------------------------------------------------- /07 - Moving Beyond Linearity/04 - Generalized Additive Models and Local Regression Quiz.md: -------------------------------------------------------------------------------- 1 | # Generalized Additive Models and Local Regression 2 | 3 | ## 7.4.R1 4 | 5 | True or False: In the GAM y \sim f_1(X_1) + f_2(X_2) + e, as we make f_1 and f_2 more and more complex we can approximate any regression function to arbitrary precision. 6 | 7 | - True 8 | - **False** 9 | -------------------------------------------------------------------------------- /07 - Moving Beyond Linearity/05 - Nonlinear Functions in R Quiz.md: -------------------------------------------------------------------------------- 1 | # Nonlinear Functions in R 2 | 3 | ## 7.R.R1 4 | 5 | Load the data from the file 7.R.RData, and plot it using plot(x,y). What is the slope coefficient in a linear regression of y on x (to within 10%)? 6 | 7 | **-0.6748** 8 | 9 | ## 7.R.R2 10 | 11 | For the model y ~ 1+x+x^2, what is the coefficient of x (to within 10%)? 12 | 13 | **77.7** 14 | -------------------------------------------------------------------------------- /07 - Moving Beyond Linearity/06 - Chapter 7 Quiz.md: -------------------------------------------------------------------------------- 1 | # Chapter 7 Quiz 2 | 3 | ## 7.Q.1 4 | 5 | Suppose we want to fit a generalized additive model (with a continuous response) for y against X_1 and X_2. Suppose that we are using a cubic spline with four knots for each variable (so our model can be expressed as a linear regression after the right basis expansion). 6 | 7 | Suppose that we fit our model by the following three steps: 8 | 9 | 1) First fit our cubic spline model for y against X_1, obtaining the fit \hat f_1(x) and residuals r_i = y_i - \hat f_1(X_{i,1}). 10 | 11 | 2) Then, fit a cubic spline model for r against X_2 to obtain \hat f_2(x). 12 | 13 | 3) Finally construct fitted values \hat y_i = \hat f_1(X_{i,1}) + \hat f_2(X_{i,2}). 14 | 15 | Will we get the same fitted values as we would if we fit the additive model for y against X_1 and X_2 jointly? 16 | 17 | - yes, no matter what 18 | - only if X_1 and X_2 are uncorrelated 19 | - **not necessarily, even if X_1 and X_2 are uncorrelated.** -------------------------------------------------------------------------------- /08 - Tree-Based Methods/01 - Tree-Based Methods Quiz.md: -------------------------------------------------------------------------------- 1 | # Tree-Based Methods Quiz 2 | 3 | ## 8.1.R1 4 | 5 | Using the decision tree on page 5 of the notes, what would you predict for the log salary of a player who has played for 4 years and has 150 hits?: 6 | 7 | - **5.11** 8 | - 5.55 9 | - 6.0 10 | - 6.74 -------------------------------------------------------------------------------- /08 - Tree-Based Methods/02 - More Details on Trees Quiz.md: -------------------------------------------------------------------------------- 1 | # More Details on Trees 2 | 3 | ## 8.2.R1 4 | 5 | Imagine that you are doing cost complexity pruning as defined on page 18 of the notes. You fit two trees to the same data: T_1 is fit at alpha = 1 and T_2 is fit at alpha = 2. Which of the following is true? 6 | 7 | - **T_1 will have at least as many nodes as T_2** 8 | - T_1 will have at most as many nodes as T_2 9 | - Not enough information is given in the problem to decide 10 | -------------------------------------------------------------------------------- /08 - Tree-Based Methods/03 - Classification Trees Quiz.md: -------------------------------------------------------------------------------- 1 | # Classification Trees Quiz 2 | 3 | ## 8.3.R1 4 | 5 | You have a bag of marbles with 64 red marbles and 36 blue marbles. 6 | 7 | What is the value of the Gini Index for that bag? Give your answer to the nearest hundredth: 8 | 9 | **.4608** 10 | 11 | ## 8.3.R2 12 | 13 | What is the value of the Cross-Entropy? Give your answer to the nearest hundredth (using log base e, as in R): 14 | 15 | **.653** 16 | -------------------------------------------------------------------------------- /08 - Tree-Based Methods/04 - Bagging and Random Forest Quiz.md: -------------------------------------------------------------------------------- 1 | # Bagging and Random Forest Quiz 2 | 3 | ## 8.4.R1 4 | 5 | Suppose we produce ten bootstrap samples from a data set containing red and green classes. We then apply a classification tree to each bootstrap sample and, for a specific value of X, produce 10 estimates of P(Class is Red|X): 6 | 7 | 0.1,0.15,0.2,0.2,0.55,0.6,0.6,0.65,0.7, \text{ and } 0.75 8 | 9 | There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in the notes. The second approach is to classify based on the average probability. 10 | 11 | What is the final classification under the majority vote method?: 12 | 13 | **red** 14 | 15 | ## 8.4.R2 16 | 17 | What is the final classification under the average probability method?: 18 | 19 | **green** -------------------------------------------------------------------------------- /08 - Tree-Based Methods/05 - Boosting Quiz.md: -------------------------------------------------------------------------------- 1 | # Boosting Quiz 2 | 3 | ## 8.5.R1 4 | 5 | In order to perform Boosting, we need to select 3 parameters: number of samples B, tree depth d, and step size lambda. 6 | 7 | How many parameters do we need to select in order to perform Random Forests?: 8 | 9 | **2** 10 | -------------------------------------------------------------------------------- /08 - Tree-Based Methods/06 - Tree-Based Methods in R Quiz.md: -------------------------------------------------------------------------------- 1 | # Tree-Based Methods in R 2 | 3 | ## 8.R.R1 4 | 5 | You are trying to reproduce the results of the R labs, so you run the following command in R: 6 | 7 | > library(tree) 8 | 9 | As a response, you see the following error message: 10 | 11 | Error in library(tree) : there is no package called ‘tree’ 12 | 13 | What went wrong? 14 | 15 | - You meant to use 'require(tree)' 16 | - You meant to use 'library("tree")' 17 | - **The tree package is not installed on your computer** 18 | - Nothing is wrong, that error message could not be produced by R -------------------------------------------------------------------------------- /08 - Tree-Based Methods/07 - Chapter 8 Quiz.md: -------------------------------------------------------------------------------- 1 | # Chapter 8 Quiz 2 | 3 | ## 8.Q1 4 | 5 | The tree building algorithm given on pg 13 is described as a Greedy Algorithm. Which of the following is also an example of a Greedy Algorithm?: 6 | 7 | - The Lasso 8 | - Support Vector Machines 9 | - The Bootstrap 10 | - **Forward Stepwise Selection** 11 | 12 | ## 8.Q2 13 | 14 | Examine the plot on pg 23. Assume that we wanted to select a model using the one-standard-error rule on the Cross-Validated error. What tree size would end up being selected?: 15 | 16 | - 1 17 | - **2** 18 | - 3 19 | - 10 20 | 21 | ## 8.Q3 22 | 23 | Suppose I have two qualitative predictor variables, each with three levels, and a quantitative response. I am considering fitting either a tree or an additive model. For the additive model, I will use a piecewise-constant function for each variable, with a separate constant for each level. Which model is capable of fitting a richer class of functions: 24 | 25 | - **Tree** 26 | - Additive Model 27 | - They are equivalent 28 | -------------------------------------------------------------------------------- /09 - Support Vector Machines/01 - Optimal Separating Hyperplanes Quiz.md: -------------------------------------------------------------------------------- 1 | # Optimal Separating Hyperplanes Quiz 2 | 3 | ## 9.1.R1 4 | 5 | If beta is not a unit vector but instead has length 2, then \sum_{j=1}^p \beta_j X_j is 6 | 7 | - **twice the signed Euclidean distance from the separating hyperplane \sum_{j=1}^p \beta_j X_j = 0** 8 | - half the signed Euclidean distance from X to the separating hyperplane* 9 | - exactly the signed Euclidean distance from the separating hyperplane 10 | -------------------------------------------------------------------------------- /09 - Support Vector Machines/02 - Supper Vector Classifier Quiz.md: -------------------------------------------------------------------------------- 1 | # Support Vector Classifier Quiz 2 | 3 | ## 9.2.R1 4 | 5 | If we increase C (the error budget) in an SVM, do you expect the standard error of beta to increase or decrease? 6 | 7 | - Increase 8 | - **Decrease** -------------------------------------------------------------------------------- /09 - Support Vector Machines/03 - Feature Expansion and the SVM Quiz.md: -------------------------------------------------------------------------------- 1 | # Feature Expansion and the SVM Quiz 2 | 3 | ## 9.3.R1 4 | 5 | True or False: If no linear boundary can perfectly classify all the training data, this means we need to use a feature expansion. 6 | 7 | - True 8 | - **False** 9 | 10 | ## 9.3.R2 11 | 12 | True or False: The computational effort required to solve a kernel support vector machine becomes greater and greater as the dimension of the basis increases. 13 | 14 | - True 15 | - **False** -------------------------------------------------------------------------------- /09 - Support Vector Machines/04 - Example and Comparison with Logistic Regression Quiz.md: -------------------------------------------------------------------------------- 1 | # Example and Comparison with Logistic Regression Quiz 2 | 3 | ## 9.4.R1 4 | 5 | Recall that we obtain the ROC curve by classifying test points based on whether \hat f(x) > t, and varying t. 6 | 7 | How large is the AUC (area under the ROC curve) for a classifier based on a completely random function \hat f(x) (that is, one for which the orderings of the \hat f(x_i) are completely random)? 8 | 9 | **0.5** -------------------------------------------------------------------------------- /09 - Support Vector Machines/05 - SVM in R Quiz.md: -------------------------------------------------------------------------------- 1 | # SVM in R Quiz 2 | 3 | in this problem, you will use simulation to evaluate (by Monte Carlo) the expected misclassification error rate given a particular generating model. Let y_i be equally divided between classes 0 and 1, and let x_i \in \mathbb{R}^{10} be normally distributed. 4 | 5 | Given y_i, x_i \sim N_{10}(0, I_{10}). Given y_i = 1,x_i \sim N_{10}( \mu, I_{10}) with \mu = (1,1,1,1,1,0,0,0,0,0). 6 | 7 | The notation just means its a ten-dimensional Gaussian distribution; you can use the mvrnorm function in the MASS package to help generate the data. Now, we would like to know the expected test error rate if we fit an SVM to a sample of 50 random training points from class 1 and 50 more from class 0. We can calculate this to high precision by 1) generating a random training sample to train on, 2) evaluating the number of mistakes we make on a large test set, and then 3) repeating (1-2) many times and averaging the error rate for each trial. 8 | 9 | Aside: in real life don't know the generating distribution, so we have to use resampling methods instead of the procedure described above. 10 | 11 | For all of the following, please enter your error rate as a number between zero and 1 (e.g., 0.21 instead of 21 if the error rate is 21%). 12 | 13 | ## 9.R.1 14 | 15 | Use svm in the e1071 package with the default settings (the default kernel is a radial kernel). What is the expected test error rate of this method (to within 10%)? 16 | 17 | **0.16350** 18 | 19 | ## 9.R.2 20 | 21 | Now fit an svm with a linear kernel (kernel = "linear"). What is the expected test error rate to within 10%? 22 | 23 | **0.15791** 24 | 25 | ## 9.R.3 26 | 27 | What is the expected test error for logistic regression? (to within 10%) 28 | 29 | (Don't worry if you get errors saying the logistic regression did not converge.) 30 | 31 | **0.15750** 32 | -------------------------------------------------------------------------------- /09 - Support Vector Machines/06 - Chapter 9 Quiz.md: -------------------------------------------------------------------------------- 1 | # Chapter 9 Quiz 2 | 3 | ## 9.Q.1 4 | 5 | Suppose that after our computer works for an hour to fit an SVM on a large data set, we notice that x_4, the feature vector for the fourth example, was recorded incorrectly (say, one of the decimal points is obviously in the wrong place). 6 | 7 | However, your co-worker notices that the pair (x_4,y_4) did not turn out to be a support point in the original fit. He says there is no need to re-fit the SVM on the corrected data set, because changing the value of a non-support point can't possibly change the fit. 8 | 9 | Is your co-worker correct? 10 | 11 | - Yes 12 | - **No** 13 | -------------------------------------------------------------------------------- /10 - Unsupervised Learning/01 - Principal Components Quiz.md: -------------------------------------------------------------------------------- 1 | # Principal Component Quiz 2 | 3 | ## 10.1.R1 4 | 5 | You are analyzing a dataset where each observation is an age, height, length, and width of a particular turtle. You want to know if the data can be well described by fewer than four dimensions (maybe for plotting), so you decide to do Principal Component Analysis. Which of the following is most likely to be the loadings of the first Principal Component? 6 | 7 | - (1, 1, 1, 1) 8 | - **(.5, .5, .5, .5)** 9 | - (.71, -.71, 0, 0) 10 | - (1, -1, -1, -1) -------------------------------------------------------------------------------- /10 - Unsupervised Learning/02 - Higher Order Principal Component Quiz.md: -------------------------------------------------------------------------------- 1 | # Higher Order Principal Component Quiz 2 | 3 | ## 10.2.R1 4 | 5 | Suppose we a data set where each data point represents a single student's scores on a math test, a physics test, a reading comprehension test, and a vocabulary test. 6 | 7 | We find the first two principal components, which capture 90% of the variability in the data, and interpret their loadings. We conclude that the first principal component represents overall academic ability, and the second represents a contrast between quantitative ability and verbal ability. 8 | 9 | What loadings would be consistent with that interpretation? Choose all that apply. 10 | 11 | - (0.5, 0.5, 0.5, 0.5) and (0.71, 0.71, 0, 0) 12 | - (0.5, 0.5, 0.5, 0.5) and (0, 0, -0.71, -0.71) 13 | - **(0.5, 0.5, 0.5, 0.5) and (0.5, 0.5, -0.5, -0.5)** 14 | - **(0.5, 0.5, 0.5, 0.5) and (-0.5, -0.5, 0.5, 0.5)** 15 | - (0.71, 0.71, 0, 0) and (0, 0, 0.71, 0.71) 16 | - (0.71, 0, -0.71, 0) and (0, 0.71, 0, -0.71) -------------------------------------------------------------------------------- /10 - Unsupervised Learning/03 - K-Means Clustering Quiz.md: -------------------------------------------------------------------------------- 1 | # K-Means Clustering Quiz 2 | 3 | ## 10.3.R1 4 | 5 | True or False: If we use k-means clustering, will we get the same cluster assignments for each point, whether or not we standardize the variables. 6 | 7 | - True 8 | - **False** 9 | -------------------------------------------------------------------------------- /10 - Unsupervised Learning/04 - Hierarchical Clustering Quiz.md: -------------------------------------------------------------------------------- 1 | # Hierarchical Clustering Quiz 2 | 3 | ## 10.4.R1 4 | 5 | True or False: If we cut the dendrogram at a lower point, we will tend to get more clusters (and cannot get fewer clusters). 6 | 7 | - **True** 8 | - False -------------------------------------------------------------------------------- /10 - Unsupervised Learning/05 - Breast Cancer Example Quiz.md: -------------------------------------------------------------------------------- 1 | # Breast Cancer Example Quiz 2 | 3 | ## 10.5.R1 4 | 5 | In the heat map for breast cancer data, which of the following depended on the output of hierarchical clustering? 6 | 7 | - **The ordering of the rows** 8 | - **The ordering of the columns** 9 | - The coloring of the cells as red or green -------------------------------------------------------------------------------- /10 - Unsupervised Learning/06 - Unsupervised in R Quiz.md: -------------------------------------------------------------------------------- 1 | # Unsupervised Learning in R Quiz 2 | 3 | ## 10.R.1 4 | 5 | Suppose we want to fit a linear regression, but the number of variables is much larger than the number of observations. In some cases, we may improve the fit by reducing the dimension of the features before. 6 | 7 | In this problem, we use a data set with n = 300 and p = 200, so we have more observations than variables, but not by much. Load the data x, y, x.test, and y.test from 10.R.RData. 8 | 9 | First, concatenate x and x.test using the rbind functions and perform a principal components analysis on the concatenated data frame (use the "scale=TRUE" option). To within 10% relative error, what proportion of the variance is explained by the first five principal components? 10 | 11 | **0.3498565** 12 | 13 | ## 10.R.2 14 | 15 | The previous answer suggests that a relatively small number of "latent variables" account for a substantial fraction of the features' variability. We might believe that these latent variables are more important than linear combinations of the features that have low variance. 16 | 17 | We can try forgetting about the raw features and using the first five principal components (computed on rbind(x,x.test)) instead as low-dimensional derived features. What is the mean-squared test error if we regress y on the first five principal components, and use the resulting model to predict y.test? 18 | 19 | **0.9923** 20 | 21 | ## 10.R.3 22 | 23 | Now, try an OLS linear regression of y on the matrix x. What is the mean squared predition error if we use the fitted model to predict y.test from x.test? 24 | 25 | **3.90714** 26 | -------------------------------------------------------------------------------- /10 - Unsupervised Learning/07 - Chapter 10 Quiz.md: -------------------------------------------------------------------------------- 1 | # Chapter 10 Quiz 2 | 3 | ## 10.Q.1 4 | 5 | K-Means is a seemingly complicated clustering algorithms. Here is a simpler one: 6 | 7 | Given k, the number of clusters, and n, the number of observations, try all possible assignments of the n observations into k clusters. Then, select one of the assignments that minimizes Within-Cluster Variation as defined on page 30. 8 | 9 | Assume that you implemented the most naive version of the above algorithm. Here, by naive we mean that you try all possible assignments even though some of them might be redundant (for example, the algorithm tries assigning all of the observations to cluster 1 and it also tries to assign them all to cluster 2 even though those are effectively the same solution). 10 | 11 | In terms of n and k, how many potential solutions will your algorithm try? 12 | 13 | **k^n** 14 | -------------------------------------------------------------------------------- /Lectures Files/5.R.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AlessandroCorradini/Stanford-University-Statistical-Learning/1086c0ef3bf680e5cee528be97fb33e205554d06/Lectures Files/5.R.RData -------------------------------------------------------------------------------- /Lectures Files/7.R.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AlessandroCorradini/Stanford-University-Statistical-Learning/1086c0ef3bf680e5cee528be97fb33e205554d06/Lectures Files/7.R.RData -------------------------------------------------------------------------------- /Lectures Files/ch10.Rmd: -------------------------------------------------------------------------------- 1 | Principal Components 2 | ==================== 3 | We will use the `USArrests` data (which is in R) 4 | ```{r} 5 | dimnames(USArrests) 6 | apply(USArrests,2,mean) 7 | apply(USArrests,2, var) 8 | ``` 9 | 10 | We see that `Assault` has a much larger variance than the other variables. It would dominate the principal components, so we choose to standardize the variables when we perform PCA 11 | 12 | ```{r} 13 | pca.out=prcomp(USArrests, scale=TRUE) 14 | pca.out 15 | names(pca.out) 16 | biplot(pca.out, scale=0) 17 | ``` 18 | 19 | K-Means Clustering 20 | ================== 21 | K-means works in any dimension, but is most fun to demonstrate in two, because we can plot pictures. 22 | Lets make some data with clusters. We do this by shifting the means of the points around. 23 | ```{r} 24 | set.seed(101) 25 | x=matrix(rnorm(100*2),100,2) 26 | xmean=matrix(rnorm(8,sd=4),4,2) 27 | which=sample(1:4,100,replace=TRUE) 28 | x=x+xmean[which,] 29 | plot(x,col=which,pch=19) 30 | ``` 31 | We know the "true" cluster IDs, but we wont tell that to the `kmeans` algorithm. 32 | 33 | ```{r} 34 | km.out=kmeans(x,4,nstart=15) 35 | km.out 36 | plot(x,col=km.out$cluster,cex=2,pch=1,lwd=2) 37 | points(x,col=which,pch=19) 38 | points(x,col=c(4,3,2,1)[which],pch=19) 39 | ``` 40 | 41 | Hierarchical Clustering 42 | ======================= 43 | We will use these same data and use hierarchical clustering 44 | 45 | ```{r} 46 | hc.complete=hclust(dist(x),method="complete") 47 | plot(hc.complete) 48 | hc.single=hclust(dist(x),method="single") 49 | plot(hc.single) 50 | hc.average=hclust(dist(x),method="average") 51 | plot(hc.average) 52 | 53 | ``` 54 | Lets compare this with the actualy clusters in the data. We will use the function `cutree` to cut the tree at level 4. 55 | This will produce a vector of numbers from 1 to 4, saying which branch each observation is on. You will sometimes see pretty plots where the leaves of the dendrogram are colored. I searched a bit on the web for how to do this, and its a little too complicated for this demonstration. 56 | 57 | We can use `table` to see how well they match: 58 | ```{r} 59 | hc.cut=cutree(hc.complete,4) 60 | table(hc.cut,which) 61 | table(hc.cut,km.out$cluster) 62 | ``` 63 | or we can use our group membership as labels for the leaves of the dendrogram: 64 | ```{r} 65 | plot(hc.complete,labels=which) 66 | ``` 67 | 68 | 69 | -------------------------------------------------------------------------------- /Lectures Files/ch2.R: -------------------------------------------------------------------------------- 1 | ### vectors, data, matrices, subsetting 2 | x=c(2,7,5) 3 | x 4 | y=seq(from=4,length=3,by=3) 5 | ?seq 6 | y 7 | x+y 8 | x/y 9 | x^y 10 | x[2] 11 | x[2:3] 12 | x[-2] 13 | x[-c(1,2)] 14 | z=matrix(seq(1,12),4,3) 15 | z 16 | z[3:4,2:3] 17 | z[,2:3] 18 | z[,1] 19 | z[,1,drop=FALSE] 20 | dim(z) 21 | ls() 22 | rm(y) 23 | ls() 24 | ### Generating random data, graphics 25 | x=runif(50) 26 | y=rnorm(50) 27 | plot(x,y) 28 | plot(x,y,xlab="Random Uniform",ylab="Random Normal",pch="*",col="blue") 29 | par(mfrow=c(2,1)) 30 | plot(x,y) 31 | hist(y) 32 | par(mfrow=c(1,1)) 33 | ### Reading in data 34 | Auto=read.csv("Auto.csv") 35 | pwd() 36 | Auto=read.csv("../Auto.csv") 37 | names(Auto) 38 | dim(Auto) 39 | class(Auto) 40 | summary(Auto) 41 | plot(Auto$cylinders,Auto$mpg) 42 | plot(Auto$cyl,Auto$mpg) 43 | attach(Auto) 44 | search() 45 | plot(cylinders,mpg) 46 | cylinders=as.factor(cylinders) 47 | plot(cylinders,mpg,xlab="Cylinders",ylab="Mpg",col="red") 48 | pdf(file="../mpg.pdf") 49 | plot(cylinders,mpg,xlab="Cylinders",ylab="Mpg",col="red") 50 | dev.off() 51 | pairs(Auto,col="brown") 52 | pairs(mpg~cylinders+acceleration+weight,Auto) 53 | q() -------------------------------------------------------------------------------- /Lectures Files/ch3.R: -------------------------------------------------------------------------------- 1 | library(MASS) 2 | library(ISLR) 3 | ### Simple linear regression 4 | names(Boston) 5 | ?Boston 6 | plot(medv~lstat,Boston) 7 | fit1=lm(medv~lstat,data=Boston) 8 | fit1 9 | summary(fit1) 10 | abline(fit1,col="red") 11 | names(fit1) 12 | confint(fit1) 13 | predict(fit1,data.frame(lstat=c(5,10,15)),interval="confidence") 14 | ### Multiple linear regression 15 | fit2=lm(medv~lstat+age,data=Boston) 16 | summary(fit2) 17 | fit3=lm(medv~.,Boston) 18 | summary(fit3) 19 | par(mfrow=c(2,2)) 20 | plot(fit3) 21 | fit4=update(fit3,~.-age-indus) 22 | summary(fit4) 23 | ### Nonlinear terms and Interactions 24 | fit5=lm(medv~lstat*age,Boston) 25 | summary(fit5) 26 | fit6=lm(medv~lstat +I(lstat^2),Boston); summary(fit6) 27 | attach(Boston) 28 | par(mfrow=c(1,1)) 29 | plot(medv~lstat) 30 | points(lstat,fitted(fit6),col="red",pch=20) 31 | fit7=lm(medv~poly(lstat,4)) 32 | points(lstat,fitted(fit7),col="blue",pch=20) 33 | plot(1:20,1:20,pch=1:20,cex=2) 34 | ###Qualitative predictors 35 | fix(Carseats) 36 | names(Carseats) 37 | summary(Carseats) 38 | fit1=lm(Sales~.+Income:Advertising+Age:Price,Carseats) 39 | summary(fit1) 40 | contrasts(Carseats$ShelveLoc) 41 | ###Writing R functions 42 | regplot=function(x,y){ 43 | fit=lm(y~x) 44 | plot(x,y) 45 | abline(fit,col="red") 46 | } 47 | attach(Carseats) 48 | regplot(Price,Sales) 49 | regplot=function(x,y,...){ 50 | fit=lm(y~x) 51 | plot(x,y,...) 52 | abline(fit,col="red") 53 | } 54 | regplot(Price,Sales,xlab="Price",ylab="Sales",col="blue",pch=20) 55 | 56 | 57 | 58 | 59 | -------------------------------------------------------------------------------- /Lectures Files/ch4.R: -------------------------------------------------------------------------------- 1 | require(ISLR) 2 | names(Smarket) 3 | summary(Smarket) 4 | ?Smarket 5 | pairs(Smarket,col=Smarket$Direction) 6 | # Logistic regression 7 | glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, 8 | data=Smarket,family=binomial) 9 | summary(glm.fit) 10 | glm.probs=predict(glm.fit,type="response") 11 | glm.probs[1:5] 12 | glm.pred=ifelse(glm.probs>0.5,"Up","Down") 13 | attach(Smarket) 14 | table(glm.pred,Direction) 15 | mean(glm.pred==Direction) 16 | # Make training and test set 17 | train = Year<2005 18 | glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, 19 | data=Smarket,family=binomial, subset=train) 20 | glm.probs=predict(glm.fit,newdata=Smarket[!train,],type="response") 21 | glm.pred=ifelse(glm.probs >0.5,"Up","Down") 22 | Direction.2005=Smarket$Direction[!train] 23 | table(glm.pred,Direction.2005) 24 | mean(glm.pred==Direction.2005) 25 | #Fit smaller model 26 | glm.fit=glm(Direction~Lag1+Lag2, 27 | data=Smarket,family=binomial, subset=train) 28 | glm.probs=predict(glm.fit,newdata=Smarket[!train,],type="response") 29 | glm.pred=ifelse(glm.probs >0.5,"Up","Down") 30 | table(glm.pred,Direction.2005) 31 | mean(glm.pred==Direction.2005) 32 | 106/(76+106) 33 | 34 | 35 | require(MASS) 36 | 37 | ## Linear Discriminant Analysis 38 | lda.fit=lda(Direction~Lag1+Lag2,data=Smarket, subset=Year<2005) 39 | lda.fit 40 | plot(lda.fit) 41 | Smarket.2005=subset(Smarket,Year==2005) 42 | lda.pred=predict(lda.fit,Smarket.2005) 43 | lda.pred[1:5,] 44 | class(lda.pred) 45 | data.frame(lda.pred)[1:5,] 46 | table(lda.pred$class,Smarket.2005$Direction) 47 | mean(lda.pred$class==Smarket.2005$Direction) 48 | 49 | ## K-Nearest Neighbors 50 | library(class) 51 | ?knn 52 | attach(Smarket) 53 | Xlag=cbind(Lag1,Lag2) 54 | train=Year<2005 55 | knn.pred=knn(Xlag[train,],Xlag[!train,],Direction[train],k=1) 56 | table(knn.pred,Direction[!train]) 57 | mean(knn.pred==Direction[!train]) 58 | -------------------------------------------------------------------------------- /Lectures Files/ch5.R: -------------------------------------------------------------------------------- 1 | require(ISLR) 2 | require(boot) 3 | ?cv.glm 4 | plot(mpg~horsepower,data=Auto) 5 | 6 | ## LOOCV 7 | glm.fit=glm(mpg~horsepower, data=Auto) 8 | cv.glm(Auto,glm.fit)$delta #pretty slow (doesnt use formula (5.2) on page 180) 9 | 10 | ##Lets write a simple function to use formula (5.2) 11 | loocv=function(fit){ 12 | h=lm.influence(fit)$h 13 | mean((residuals(fit)/(1-h))^2) 14 | } 15 | 16 | ## Now we try it out 17 | loocv(glm.fit) 18 | 19 | 20 | cv.error=rep(0,5) 21 | degree=1:5 22 | for(d in degree){ 23 | glm.fit=glm(mpg~poly(horsepower,d), data=Auto) 24 | cv.error[d]=loocv(glm.fit) 25 | } 26 | plot(degree,cv.error,type="b") 27 | 28 | ## 10-fold CV 29 | 30 | cv.error10=rep(0,5) 31 | for(d in degree){ 32 | glm.fit=glm(mpg~poly(horsepower,d), data=Auto) 33 | cv.error10[d]=cv.glm(Auto,glm.fit,K=10)$delta[1] 34 | } 35 | lines(degree,cv.error10,type="b",col="red") 36 | 37 | 38 | ## Bootstrap 39 | ## Minimum risk investment - Section 5.2 40 | 41 | alpha=function(x,y){ 42 | vx=var(x) 43 | vy=var(y) 44 | cxy=cov(x,y) 45 | (vy-cxy)/(vx+vy-2*cxy) 46 | } 47 | alpha(Portfolio$X,Portfolio$Y) 48 | 49 | ## What is the standard error of alpha? 50 | 51 | alpha.fn=function(data, index){ 52 | with(data[index,],alpha(X,Y)) 53 | } 54 | 55 | alpha.fn(Portfolio,1:100) 56 | 57 | set.seed(1) 58 | alpha.fn (Portfolio,sample(1:100,100,replace=TRUE)) 59 | 60 | boot.out=boot(Portfolio,alpha.fn,R=1000) 61 | boot.out 62 | plot(boot.out) 63 | -------------------------------------------------------------------------------- /Lectures Files/ch6.Rmd: -------------------------------------------------------------------------------- 1 | Model Selection 2 | ================ 3 | 4 | This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages, 5 | and a very nice way of distributing an analysis. It has some very simple syntax rules. 6 | 7 | 8 | ```{r} 9 | library(ISLR) 10 | summary(Hitters) 11 | ``` 12 | There are some missing values here, so before we proceed we will remove them: 13 | 14 | ```{r} 15 | Hitters=na.omit(Hitters) 16 | with(Hitters,sum(is.na(Salary))) 17 | ``` 18 | 19 | 20 | 21 | Best Subset regression 22 | ------------------------ 23 | We will now use the package `leaps` to evaluate all the best-subset models. 24 | ```{r} 25 | library(leaps) 26 | regfit.full=regsubsets(Salary~.,data=Hitters) 27 | summary(regfit.full) 28 | ``` 29 | It gives by default best-subsets up to size 8; lets increase that to 19, i.e. all the variables 30 | ```{r} 31 | regfit.full=regsubsets(Salary~.,data=Hitters, nvmax=19) 32 | reg.summary=summary(regfit.full) 33 | names(reg.summary) 34 | plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp") 35 | which.min(reg.summary$cp) 36 | points(10,reg.summary$cp[10],pch=20,col="red") 37 | ``` 38 | There is a plot method for the `regsubsets` object 39 | ```{r} 40 | plot(regfit.full,scale="Cp") 41 | coef(regfit.full,10) 42 | ``` 43 | 44 | 45 | 46 | Forward Stepwise Selection 47 | -------------------------- 48 | Here we use the `regsubsets` function but specify the `method="forward" option: 49 | ```{r} 50 | regfit.fwd=regsubsets(Salary~.,data=Hitters,nvmax=19,method="forward") 51 | summary(regfit.fwd) 52 | plot(regfit.fwd,scale="Cp") 53 | ``` 54 | 55 | 56 | 57 | 58 | Model Selection Using a Validation Set 59 | --------------------------------------- 60 | Lets make a training and validation set, so that we can choose a good subset model. 61 | We will do it using a slightly different approach from what was done in the the book. 62 | ```{r} 63 | dim(Hitters) 64 | set.seed(1) 65 | train=sample(seq(263),180,replace=FALSE) 66 | train 67 | regfit.fwd=regsubsets(Salary~.,data=Hitters[train,],nvmax=19,method="forward") 68 | ``` 69 | Now we will make predictions on the observations not used for training. We know there are 19 models, so we set up some vectors to record the errors. We have to do a bit of work here, because there is no predict method for `regsubsets`. 70 | ```{r} 71 | val.errors=rep(NA,19) 72 | x.test=model.matrix(Salary~.,data=Hitters[-train,])# notice the -index! 73 | for(i in 1:19){ 74 | coefi=coef(regfit.fwd,id=i) 75 | pred=x.test[,names(coefi)]%*%coefi 76 | val.errors[i]=mean((Hitters$Salary[-train]-pred)^2) 77 | } 78 | plot(sqrt(val.errors),ylab="Root MSE",ylim=c(300,400),pch=19,type="b") 79 | points(sqrt(regfit.fwd$rss[-1]/180),col="blue",pch=19,type="b") 80 | legend("topright",legend=c("Training","Validation"),col=c("blue","black"),pch=19) 81 | ``` 82 | As we expect, the training error goes down monotonically as the model gets bigger, but not so 83 | for the validation error. 84 | 85 | This was a little tedious - not having a predict method for `regsubsets`. So we will write one! 86 | ```{r} 87 | predict.regsubsets=function(object,newdata,id,...){ 88 | form=as.formula(object$call[[2]]) 89 | mat=model.matrix(form,newdata) 90 | coefi=coef(object,id=id) 91 | mat[,names(coefi)]%*%coefi 92 | } 93 | ``` 94 | 95 | 96 | 97 | 98 | Model Selection by Cross-Validation 99 | ----------------------------------- 100 | We will do 10-fold cross-validation. Its really easy! 101 | ```{r} 102 | set.seed(11) 103 | folds=sample(rep(1:10,length=nrow(Hitters))) 104 | folds 105 | table(folds) 106 | cv.errors=matrix(NA,10,19) 107 | for(k in 1:10){ 108 | best.fit=regsubsets(Salary~.,data=Hitters[folds!=k,],nvmax=19,method="forward") 109 | for(i in 1:19){ 110 | pred=predict(best.fit,Hitters[folds==k,],id=i) 111 | cv.errors[k,i]=mean( (Hitters$Salary[folds==k]-pred)^2) 112 | } 113 | } 114 | rmse.cv=sqrt(apply(cv.errors,2,mean)) 115 | plot(rmse.cv,pch=19,type="b") 116 | ``` 117 | 118 | 119 | 120 | Ridge Regression and the Lasso 121 | ------------------------------- 122 | We will use the package `glmnet`, which does not use the model formula language, so we will set up an `x` and `y`. 123 | ```{r} 124 | library(glmnet) 125 | x=model.matrix(Salary~.-1,data=Hitters) 126 | y=Hitters$Salary 127 | ``` 128 | First we will fit a ridge-regression model. This is achieved by calling `glmnet` with `alpha=0` (see the helpfile). There is also a `cv.glmnet` function which will do the cross-validation for us. 129 | ```{r} 130 | fit.ridge=glmnet(x,y,alpha=0) 131 | plot(fit.ridge,xvar="lambda",label=TRUE) 132 | cv.ridge=cv.glmnet(x,y,alpha=0) 133 | plot(cv.ridge) 134 | ``` 135 | Now we fit a lasso model; for this we use the default `alpha=1` 136 | ```{r} 137 | fit.lasso=glmnet(x,y) 138 | plot(fit.lasso,xvar="lambda",label=TRUE) 139 | cv.lasso=cv.glmnet(x,y) 140 | plot(cv.lasso) 141 | coef(cv.lasso) 142 | ``` 143 | 144 | Suppose we want to use our earlier train/validation division to select the `lambda` for the lasso. 145 | This is easy to do. 146 | ```{r} 147 | lasso.tr=glmnet(x[train,],y[train]) 148 | lasso.tr 149 | pred=predict(lasso.tr,x[-train,]) 150 | dim(pred) 151 | rmse= sqrt(apply((y[-train]-pred)^2,2,mean)) 152 | plot(log(lasso.tr$lambda),rmse,type="b",xlab="Log(lambda)") 153 | lam.best=lasso.tr$lambda[order(rmse)[1]] 154 | lam.best 155 | coef(lasso.tr,s=lam.best) 156 | ``` 157 | -------------------------------------------------------------------------------- /Lectures Files/ch7.Rmd: -------------------------------------------------------------------------------- 1 | Nonlinear Models 2 | ======================================================== 3 | Here we explore the use of nonlinear models using some tools in R 4 | 5 | ```{r} 6 | require(ISLR) 7 | attach(Wage) 8 | ``` 9 | 10 | Polynomials 11 | ------------ 12 | 13 | First we will use polynomials, and focus on a single predictor age: 14 | 15 | ```{r} 16 | fit=lm(wage~poly(age,4),data=Wage) 17 | summary(fit) 18 | ``` 19 | 20 | The `poly()` function generates a basis of *orthogonal polynomials*. 21 | Lets make a plot of the fitted function, along with the standard errors of the fit. 22 | 23 | ```{r fig.width=7, fig.height=6} 24 | agelims=range(age) 25 | age.grid=seq(from=agelims[1],to=agelims[2]) 26 | preds=predict(fit,newdata=list(age=age.grid),se=TRUE) 27 | se.bands=cbind(preds$fit+2*preds$se,preds$fit-2*preds$se) 28 | plot(age,wage,col="darkgrey") 29 | lines(age.grid,preds$fit,lwd=2,col="blue") 30 | matlines(age.grid,se.bands,col="blue",lty=2) 31 | ``` 32 | 33 | There are other more direct ways of doing this in R. For example 34 | 35 | ```{r} 36 | fita=lm(wage~age+I(age^2)+I(age^3)+I(age^4),data=Wage) 37 | summary(fita) 38 | ``` 39 | 40 | Here `I()` is a *wrapper* function; we need it because `age^2` means something to the formula language, 41 | while `I(age^2)` is protected. 42 | The coefficients are different to those we got before! However, the fits are the same: 43 | 44 | ```{r} 45 | plot(fitted(fit),fitted(fita)) 46 | ``` 47 | 48 | By using orthogonal polynomials in this simple way, it turns out that we can separately test 49 | for each coefficient. So if we look at the summary again, we can see that the linear, quadratic 50 | and cubic terms are significant, but not the quartic. 51 | 52 | ```{r} 53 | summary(fit) 54 | ``` 55 | 56 | This only works with linear regression, and if there is a single predictor. In general we would use `anova()` 57 | as this next example demonstrates. 58 | 59 | ```{r} 60 | fita=lm(wage~education,data=Wage) 61 | fitb=lm(wage~education+age,data=Wage) 62 | fitc=lm(wage~education+poly(age,2),data=Wage) 63 | fitd=lm(wage~education+poly(age,3),data=Wage) 64 | anova(fita,fitb,fitc,fitd) 65 | 66 | ``` 67 | 68 | ### Polynomial logistic regression 69 | 70 | Now we fit a logistic regression model to a binary response variable, 71 | constructed from `wage`. We code the big earners (`>250K`) as 1, else 0. 72 | 73 | ```{r} 74 | fit=glm(I(wage>250) ~ poly(age,3), data=Wage, family=binomial) 75 | summary(fit) 76 | preds=predict(fit,list(age=age.grid),se=T) 77 | se.bands=preds$fit + cbind(fit=0,lower=-2*preds$se,upper=2*preds$se) 78 | se.bands[1:5,] 79 | ``` 80 | 81 | We have done the computations on the logit scale. To transform we need to apply the inverse logit 82 | mapping 83 | $$p=\frac{e^\eta}{1+e^\eta}.$$ 84 | (Here we have used the ability of MarkDown to interpret TeX expressions.) 85 | We can do this simultaneously for all three columns of `se.bands`: 86 | 87 | ```{r} 88 | prob.bands=exp(se.bands)/(1+exp(se.bands)) 89 | matplot(age.grid,prob.bands,col="blue",lwd=c(2,1,1),lty=c(1,2,2),type="l",ylim=c(0,.1)) 90 | points(jitter(age),I(wage>250)/10,pch="|",cex=.5) 91 | ``` 92 | 93 | Splines 94 | ------- 95 | Splines are more flexible than polynomials, but the idea is rather similar. 96 | Here we will explore cubic splines. 97 | 98 | ```{r} 99 | require(splines) 100 | fit=lm(wage~bs(age,knots=c(25,40,60)),data=Wage) 101 | plot(age,wage,col="darkgrey") 102 | lines(age.grid,predict(fit,list(age=age.grid)),col="darkgreen",lwd=2) 103 | abline(v=c(25,40,60),lty=2,col="darkgreen") 104 | ``` 105 | 106 | The smoothing splines does not require knot selection, but it does have a smoothing parameter, 107 | which can conveniently be specified via the effective degrees of freedom or `df`. 108 | 109 | ```{r} 110 | fit=smooth.spline(age,wage,df=16) 111 | lines(fit,col="red",lwd=2) 112 | ``` 113 | 114 | Or we can use LOO cross-validation to select the smoothing parameter for us automatically: 115 | 116 | ```{r} 117 | fit=smooth.spline(age,wage,cv=TRUE) 118 | lines(fit,col="purple",lwd=2) 119 | fit 120 | ``` 121 | 122 | Generalized Additive Models 123 | --------------------------- 124 | 125 | So far we have focused on fitting models with mostly single nonlinear terms. 126 | The `gam` package makes it easier to work with multiple nonlinear terms. In addition 127 | it knows how to plot these functions and their standard errors. 128 | 129 | ```{r fig.width=10, fig.height=5} 130 | require(gam) 131 | gam1=gam(wage~s(age,df=4)+s(year,df=4)+education,data=Wage) 132 | par(mfrow=c(1,3)) 133 | plot(gam1,se=T) 134 | gam2=gam(I(wage>250)~s(age,df=4)+s(year,df=4)+education,data=Wage,family=binomial) 135 | plot(gam2) 136 | ``` 137 | 138 | Lets see if we need a nonlinear terms for year 139 | 140 | ```{r} 141 | gam2a=gam(I(wage>250)~s(age,df=4)+year+education,data=Wage,family=binomial) 142 | anova(gam2a,gam2,test="Chisq") 143 | ``` 144 | 145 | One nice feature of the `gam` package is that it knows how to plot the functions nicely, 146 | even for models fit by `lm` and `glm`. 147 | 148 | ```{r fig.width=10, fig.height=5} 149 | par(mfrow=c(1,3)) 150 | lm1=lm(wage~ns(age,df=4)+ns(year,df=4)+education,data=Wage) 151 | plot.gam(lm1,se=T) 152 | ``` 153 | 154 | 155 | 156 | 157 | 158 | 159 | -------------------------------------------------------------------------------- /Lectures Files/ch8.Rmd: -------------------------------------------------------------------------------- 1 | Decision Trees 2 | ======================================================== 3 | 4 | We will have a look at the `Carseats` data using the `tree` package in R, as in the lab in the book. 5 | We create a binary response variable `High` (for high sales), and we include it in the same dataframe. 6 | ```{r} 7 | require(ISLR) 8 | require(tree) 9 | attach(Carseats) 10 | hist(Sales) 11 | High=ifelse(Sales<=8,"No","Yes") 12 | Carseats=data.frame(Carseats, High) 13 | ``` 14 | Now we fit a tree to these data, and summarize and plot it. Notice that we have to _exclude_ `Sales` from the right-hand side of the formula, because the response is derived from it. 15 | ```{r} 16 | tree.carseats=tree(High~.-Sales,data=Carseats) 17 | summary(tree.carseats) 18 | plot(tree.carseats) 19 | text(tree.carseats,pretty=0) 20 | ``` 21 | For a detailed summary of the tree, print it: 22 | ```{r} 23 | tree.carseats 24 | ``` 25 | Lets create a training and test set (250,150) split of the 400 observations, grow the tree on the training set, and evaluate its performance on the test set. 26 | ```{r} 27 | set.seed(1011) 28 | train=sample(1:nrow(Carseats),250) 29 | tree.carseats=tree(High~.-Sales,Carseats,subset=train) 30 | plot(tree.carseats);text(tree.carseats,pretty=0) 31 | tree.pred=predict(tree.carseats,Carseats[-train,],type="class") 32 | with(Carseats[-train,],table(tree.pred,High)) 33 | (72+33)/150 34 | ``` 35 | This tree was grown to full depth, and might be too variable. We now use CV to prune it. 36 | ```{r} 37 | cv.carseats=cv.tree(tree.carseats,FUN=prune.misclass) 38 | cv.carseats 39 | plot(cv.carseats) 40 | prune.carseats=prune.misclass(tree.carseats,best=13) 41 | plot(prune.carseats);text(prune.carseats,pretty=0) 42 | ``` 43 | Now lets evaluate this pruned tree on the test data. 44 | ```{r} 45 | tree.pred=predict(prune.carseats,Carseats[-train,],type="class") 46 | with(Carseats[-train,],table(tree.pred,High)) 47 | (72+32)/150 48 | ``` 49 | It has done about the same as our original tree. So pruning did not hurt us wrt misclassification errors, and gave us a simpler tree. 50 | 51 | Random Forests and Boosting 52 | ============================ 53 | 54 | These methods use trees as building blocks to build more complex models. Here we will use the Boston housing data to explore random forests and boosting. These data are in the `MASS` package. 55 | It gives housing values and other statistics in each of 506 suburbs of Boston based on a 1970 census. 56 | 57 | Random Forests 58 | -------------- 59 | Random forests build lots of bushy trees, and then average them to reduce the variance. 60 | 61 | ```{r} 62 | require(randomForest) 63 | require(MASS) 64 | set.seed(101) 65 | dim(Boston) 66 | train=sample(1:nrow(Boston),300) 67 | ?Boston 68 | ``` 69 | Lets fit a random forest and see how well it performs. We will use the response `medv`, the median housing value (in \$1K dollars) 70 | 71 | ```{r} 72 | rf.boston=randomForest(medv~.,data=Boston,subset=train) 73 | rf.boston 74 | ``` 75 | The MSR and % variance explained are based on OOB or _out-of-bag_ estimates, a very clever device in random forests to get honest error estimates. The model reports that `mtry=4`, which is the number of variables randomly chosen at each split. Since $p=13$ here, we could try all 13 possible values of `mtry`. We will do so, record the results, and make a plot. 76 | 77 | ```{r} 78 | oob.err=double(13) 79 | test.err=double(13) 80 | for(mtry in 1:13){ 81 | fit=randomForest(medv~.,data=Boston,subset=train,mtry=mtry,ntree=400) 82 | oob.err[mtry]=fit$mse[400] 83 | pred=predict(fit,Boston[-train,]) 84 | test.err[mtry]=with(Boston[-train,],mean((medv-pred)^2)) 85 | cat(mtry," ") 86 | } 87 | matplot(1:mtry,cbind(test.err,oob.err),pch=19,col=c("red","blue"),type="b",ylab="Mean Squared Error") 88 | legend("topright",legend=c("OOB","Test"),pch=19,col=c("red","blue")) 89 | ``` 90 | 91 | Not too difficult! Although the test-error curve drops below the OOB curve, these are estimates based on data, and so have their own standard errors (which are typically quite large). Notice that the points at the end with `mtry=13` correspond to bagging. 92 | 93 | Boosting 94 | -------- 95 | Boosting builds lots of smaller trees. Unlike random forests, each new tree in boosting tries to patch up the deficiencies of the current ensemble. 96 | ```{r} 97 | require(gbm) 98 | boost.boston=gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=10000,shrinkage=0.01,interaction.depth=4) 99 | summary(boost.boston) 100 | plot(boost.boston,i="lstat") 101 | plot(boost.boston,i="rm") 102 | ``` 103 | Lets make a prediction on the test set. With boosting, the number of trees is a tuning parameter, and if we have too many we can overfit. So we should use cross-validation to select the number of trees. We will leave this as an exercise. Instead, we will compute the test error as a function of the number of trees, and make a plot. 104 | 105 | ```{r} 106 | n.trees=seq(from=100,to=10000,by=100) 107 | predmat=predict(boost.boston,newdata=Boston[-train,],n.trees=n.trees) 108 | dim(predmat) 109 | berr=with(Boston[-train,],apply( (predmat-medv)^2,2,mean)) 110 | plot(n.trees,berr,pch=19,ylab="Mean Squared Error", xlab="# Trees",main="Boosting Test Error") 111 | abline(h=min(test.err),col="red") 112 | ``` 113 | 114 | 115 | 116 | -------------------------------------------------------------------------------- /Lectures Files/ch9.Rmd: -------------------------------------------------------------------------------- 1 | SVM 2 | ======================================================== 3 | To demonstrate the SVM, it is easiest to work in low dimensions, so we can see the data. 4 | 5 | Linear SVM classifier 6 | --------------------- 7 | Lets generate some data in two dimensions, and make them a little separated. 8 | ```{r} 9 | set.seed(10111) 10 | x=matrix(rnorm(40),20,2) 11 | y=rep(c(-1,1),c(10,10)) 12 | x[y==1,]=x[y==1,]+1 13 | plot(x,col=y+3,pch=19) 14 | ``` 15 | 16 | Now we will load the package `e1071` which contains the `svm` function we will use. We then compute the fit. Notice that we have to specify a `cost` parameter, which is a tuning parameter. 17 | ```{r} 18 | library(e1071) 19 | dat=data.frame(x,y=as.factor(y)) 20 | svmfit=svm(y~.,data=dat,kernel="linear",cost=10,scale=FALSE) 21 | print(svmfit) 22 | plot(svmfit,dat) 23 | ``` 24 | 25 | As mentioned in the the chapter, the plot function is somewhat crude, and plots X2 on the horizontal axis (unlike what R would do automatically for a matrix). Lets see how we might make our own plot. 26 | 27 | The first thing we will do is make a grid of values for X1 and X2. We will write a function to do that, 28 | in case we want to reuse it. It uses the handy function `expand.grid`, and produces the coordinates of `n*n` points on a lattice covering the domain of `x`. Having made the lattice, we make a prediction at each point on the lattice. We then plot the lattice, color-coded according to the classification. Now we can see the decision boundary. 29 | 30 | The support points (points on the margin, or on the wrong side of the margin) are indexed in the `$index` component of the fit. 31 | 32 | ```{r} 33 | make.grid=function(x,n=75){ 34 | grange=apply(x,2,range) 35 | x1=seq(from=grange[1,1],to=grange[2,1],length=n) 36 | x2=seq(from=grange[1,2],to=grange[2,2],length=n) 37 | expand.grid(X1=x1,X2=x2) 38 | } 39 | xgrid=make.grid(x) 40 | ygrid=predict(svmfit,xgrid) 41 | plot(xgrid,col=c("red","blue")[as.numeric(ygrid)],pch=20,cex=.2) 42 | points(x,col=y+3,pch=19) 43 | points(x[svmfit$index,],pch=5,cex=2) 44 | ``` 45 | 46 | The `svm` function is not too friendly, in that we have to do some work to get back the linear coefficients, as described in the text. Probably the reason is that this only makes sense for linear kernels, and the function is more general. Here we will use a formula to extract the coefficients; for those interested in where this comes from, have a look in chapter 12 of ESL ("Elements of Statistical Learning"). 47 | 48 | We extract the linear coefficients, and then using simple algebra, we include the decision boundary and the two margins. 49 | 50 | ```{r} 51 | beta=drop(t(svmfit$coefs)%*%x[svmfit$index,]) 52 | beta0=svmfit$rho 53 | plot(xgrid,col=c("red","blue")[as.numeric(ygrid)],pch=20,cex=.2) 54 | points(x,col=y+3,pch=19) 55 | points(x[svmfit$index,],pch=5,cex=2) 56 | abline(beta0/beta[2],-beta[1]/beta[2]) 57 | abline((beta0-1)/beta[2],-beta[1]/beta[2],lty=2) 58 | abline((beta0+1)/beta[2],-beta[1]/beta[2],lty=2) 59 | ``` 60 | 61 | Just like for the other models in this book, the tuning parameter `C` has to be selected. 62 | Different values will give different solutions. Rerun the code above, but using `C=1`, and see what we mean. One can use cross-validation to do this. 63 | 64 | 65 | Nonlinear SVM 66 | -------------- 67 | Instead, we will run the SVM on some data where a non-linear boundary is called for. We will use the mixture data from ESL 68 | 69 | ```{r} 70 | load(url("http://www.stanford.edu/~hastie/ElemStatLearn/datasets/ESL.mixture.rda")) 71 | names(ESL.mixture) 72 | rm(x,y) 73 | attach(ESL.mixture) 74 | ``` 75 | 76 | These data are also two dimensional. Lets plot them and fit a nonlinear SVM, using a radial kernel. 77 | ```{r} 78 | plot(x,col=y+1) 79 | dat=data.frame(y=factor(y),x) 80 | fit=svm(factor(y)~.,data=dat,scale=FALSE,kernel="radial",cost=5) 81 | ``` 82 | 83 | Now we are going to create a grid, as before, and make predictions on the grid. 84 | These data have the grid points for each variable included on the data frame. 85 | ```{r} 86 | xgrid=expand.grid(X1=px1,X2=px2) 87 | ygrid=predict(fit,xgrid) 88 | plot(xgrid,col=as.numeric(ygrid),pch=20,cex=.2) 89 | points(x,col=y+1,pch=19) 90 | ``` 91 | 92 | We can go further, and have the predict function produce the actual function estimates at each of our grid points. We can include the actual decision boundary on the plot by making use of the contour function. On the dataframe is also `prob`, which is the true probability of class 1 for these data, at the gridpoints. If we plot its 0.5 contour, that will give us the _Bayes Decision Boundary_, which is the best one could ever do. 93 | ```{r} 94 | func=predict(fit,xgrid,decision.values=TRUE) 95 | func=attributes(func)$decision 96 | xgrid=expand.grid(X1=px1,X2=px2) 97 | ygrid=predict(fit,xgrid) 98 | plot(xgrid,col=as.numeric(ygrid),pch=20,cex=.2) 99 | points(x,col=y+1,pch=19) 100 | 101 | contour(px1,px2,matrix(func,69,99),level=0,add=TRUE) 102 | contour(px1,px2,matrix(prob,69,99),level=.5,add=TRUE,col="blue",lwd=2) 103 | ``` 104 | 105 | We see in this case that the radial kernel has done an excellent job. -------------------------------------------------------------------------------- /Pdf/classification-handout.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AlessandroCorradini/Stanford-University-Statistical-Learning/1086c0ef3bf680e5cee528be97fb33e205554d06/Pdf/classification-handout.pdf -------------------------------------------------------------------------------- /Pdf/cv_boot-handout.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AlessandroCorradini/Stanford-University-Statistical-Learning/1086c0ef3bf680e5cee528be97fb33e205554d06/Pdf/cv_boot-handout.pdf -------------------------------------------------------------------------------- /Pdf/introduction-handout.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AlessandroCorradini/Stanford-University-Statistical-Learning/1086c0ef3bf680e5cee528be97fb33e205554d06/Pdf/introduction-handout.pdf -------------------------------------------------------------------------------- /Pdf/linear_regression-handout.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AlessandroCorradini/Stanford-University-Statistical-Learning/1086c0ef3bf680e5cee528be97fb33e205554d06/Pdf/linear_regression-handout.pdf -------------------------------------------------------------------------------- /Pdf/model_selection-handout.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AlessandroCorradini/Stanford-University-Statistical-Learning/1086c0ef3bf680e5cee528be97fb33e205554d06/Pdf/model_selection-handout.pdf -------------------------------------------------------------------------------- /Pdf/nonlinear-handout.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AlessandroCorradini/Stanford-University-Statistical-Learning/1086c0ef3bf680e5cee528be97fb33e205554d06/Pdf/nonlinear-handout.pdf -------------------------------------------------------------------------------- /Pdf/statistical_learning-handout.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AlessandroCorradini/Stanford-University-Statistical-Learning/1086c0ef3bf680e5cee528be97fb33e205554d06/Pdf/statistical_learning-handout.pdf -------------------------------------------------------------------------------- /Pdf/svm-handout.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AlessandroCorradini/Stanford-University-Statistical-Learning/1086c0ef3bf680e5cee528be97fb33e205554d06/Pdf/svm-handout.pdf -------------------------------------------------------------------------------- /Pdf/trees-handout.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AlessandroCorradini/Stanford-University-Statistical-Learning/1086c0ef3bf680e5cee528be97fb33e205554d06/Pdf/trees-handout.pdf -------------------------------------------------------------------------------- /Pdf/unsupervised-handout.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AlessandroCorradini/Stanford-University-Statistical-Learning/1086c0ef3bf680e5cee528be97fb33e205554d06/Pdf/unsupervised-handout.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Statistical Learning - Stanford University 2 | 3 | [Statistical Learning](https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about) is an introductory-level course in supervised learning, with a focus on regression and classification methods offered by Stanford Unviersity for free. 4 | 5 | ## Content 6 | 7 | - Introduction 8 | - Overview of Statistical Learning 9 | - Linear Regression 10 | - Classification 11 | - Resampling Methods 12 | - Linear Model Selection and Regularization 13 | - Moving Beyond Linearity 14 | - Tree-Based Methods 15 | - Support Vector Machine 16 | - Unsupervised Learning 17 | 18 | ## Certificate of Completion 19 | 20 | You can see the [Certificate of Completion](https://github.com/AlessandroCorradini/Certificates/blob/master/Stanford%20University%20-%20Statistical%20Learning.pdf) and other certificates in my [Certificates Repo](https://github.com/AlessandroCorradini/Certificates) that contains all my certificates obtained through my journey as a self-made Data Science and better developer. 21 | 22 |
23 | 24 | ### ⚠️ Disclaimer ⚠️ 25 | 26 | **Please, don't fork or copy this repository.** 27 | 28 | **Statistical Learning, is a very easy and straight forward course. Data Science is one of the hardest subfield of Computer Science and requires a lot of study and hard work. You can complete this course with a minimal effort.** --------------------------------------------------------------------------------