├── README.md
├── index.html
└── project.Rmd


/README.md:
--------------------------------------------------------------------------------
1 | Practical Machine Learning
2 | ========================
3 | 
4 | Course project for Practical Machine Learning: https://www.coursera.org/course/predmachlearn
5 | 
6 | [View the HTML file](http://justmarkham.github.io/PracticalMachineLearning/)
7 | 


--------------------------------------------------------------------------------
/project.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Practical Machine Learning - Course Project"
  3 | output: html_document
  4 | ---
  5 | 
  6 | ## Introduction
  7 | 
  8 | For this project, we are given data from accelerometers on the belt, forearm, arm, and dumbell of 6 research study participants. Our training data consists of accelerometer data and a label identifying the quality of  the activity the participant was doing. Our testing data consists of accelerometer data without the identifying label. Our goal is to predict the labels for the test set observations.
  9 | 
 10 | Below is the code I used when creating the model, estimating the out-of-sample error, and making predictions. I also include a description of each step of the process.
 11 | 
 12 | ## Data Preparation
 13 | 
 14 | I load the caret package, and read in the training and testing data:
 15 | 
 16 | ```{r}
 17 | library(caret)
 18 | ptrain <- read.csv("pml-training.csv")
 19 | ptest <- read.csv("pml-testing.csv")
 20 | ```
 21 | 
 22 | Because I want to be able to estimate the out-of-sample error, I randomly split the full training data (ptrain) into a smaller training set (ptrain1) and a validation set (ptrain2):
 23 | 
 24 | ```{r}
 25 | set.seed(10)
 26 | inTrain <- createDataPartition(y=ptrain$classe, p=0.7, list=F)
 27 | ptrain1 <- ptrain[inTrain, ]
 28 | ptrain2 <- ptrain[-inTrain, ]
 29 | ```
 30 | 
 31 | I am now going to reduce the number of features by removing variables with nearly zero variance, variables that are almost always NA, and variables that don't make intuitive sense for prediction. Note that I decide which ones to remove by analyzing ptrain1, and perform the identical removals on ptrain2:
 32 | 
 33 | ```{r}
 34 | # remove variables with nearly zero variance
 35 | nzv <- nearZeroVar(ptrain1)
 36 | ptrain1 <- ptrain1[, -nzv]
 37 | ptrain2 <- ptrain2[, -nzv]
 38 | 
 39 | # remove variables that are almost always NA
 40 | mostlyNA <- sapply(ptrain1, function(x) mean(is.na(x))) > 0.95
 41 | ptrain1 <- ptrain1[, mostlyNA==F]
 42 | ptrain2 <- ptrain2[, mostlyNA==F]
 43 | 
 44 | # remove variables that don't make intuitive sense for prediction (X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp), which happen to be the first five variables
 45 | ptrain1 <- ptrain1[, -(1:5)]
 46 | ptrain2 <- ptrain2[, -(1:5)]
 47 | ```
 48 | 
 49 | ## Model Building
 50 | 
 51 | I decided to start with a Random Forest model, to see if it would have acceptable performance. I fit the model on ptrain1, and instruct the "train" function to use 3-fold cross-validation to select optimal tuning parameters for the model.
 52 | 
 53 | ```{r}
 54 | # instruct train to use 3-fold CV to select optimal tuning parameters
 55 | fitControl <- trainControl(method="cv", number=3, verboseIter=F)
 56 | 
 57 | # fit model on ptrain1
 58 | fit <- train(classe ~ ., data=ptrain1, method="rf", trControl=fitControl)
 59 | 
 60 | # print final model to see tuning parameters it chose
 61 | fit$finalModel
 62 | ```
 63 | 
 64 | I see that it decided to use 500 trees and try 27 variables at each split.
 65 | 
 66 | ## Model Evaluation and Selection
 67 | 
 68 | Now, I use the fitted model to predict the label ("classe") in ptrain2, and show the confusion matrix to compare the predicted versus the actual labels:
 69 | 
 70 | ```{r}
 71 | # use model to predict classe in validation set (ptrain2)
 72 | preds <- predict(fit, newdata=ptrain2)
 73 | 
 74 | # show confusion matrix to get estimate of out-of-sample error
 75 | confusionMatrix(ptrain2$classe, preds)
 76 | ```
 77 | 
 78 | The accuracy is 99.8%, thus my predicted accuracy for the out-of-sample error is 0.2%.
 79 | 
 80 | This is an excellent result, so rather than trying additional algorithms, I will use Random Forests to predict on the test set.
 81 | 
 82 | ## Re-training the Selected Model
 83 | 
 84 | Before predicting on the test set, it is important to train the model on the full training set (ptrain), rather than using a model trained on a reduced training set (ptrain1), in order to produce the most accurate predictions. Therefore, I now repeat everything I did above on ptrain and ptest:
 85 | 
 86 | ```{r}
 87 | # remove variables with nearly zero variance
 88 | nzv <- nearZeroVar(ptrain)
 89 | ptrain <- ptrain[, -nzv]
 90 | ptest <- ptest[, -nzv]
 91 | 
 92 | # remove variables that are almost always NA
 93 | mostlyNA <- sapply(ptrain, function(x) mean(is.na(x))) > 0.95
 94 | ptrain <- ptrain[, mostlyNA==F]
 95 | ptest <- ptest[, mostlyNA==F]
 96 | 
 97 | # remove variables that don't make intuitive sense for prediction (X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp), which happen to be the first five variables
 98 | ptrain <- ptrain[, -(1:5)]
 99 | ptest <- ptest[, -(1:5)]
100 | 
101 | # re-fit model using full training set (ptrain)
102 | fitControl <- trainControl(method="cv", number=3, verboseIter=F)
103 | fit <- train(classe ~ ., data=ptrain, method="rf", trControl=fitControl)
104 | ```
105 | 
106 | ## Making Test Set Predictions
107 | 
108 | Now, I use the model fit on ptrain to predict the label for the observations in ptest, and write those predictions to individual files:
109 | 
110 | ```{r}
111 | # predict on test set
112 | preds <- predict(fit, newdata=ptest)
113 | 
114 | # convert predictions to character vector
115 | preds <- as.character(preds)
116 | 
117 | # create function to write predictions to files
118 | pml_write_files <- function(x) {
119 |     n <- length(x)
120 |     for(i in 1:n) {
121 |         filename <- paste0("problem_id_", i, ".txt")
122 |         write.table(x[i], file=filename, quote=F, row.names=F, col.names=F)
123 |     }
124 | }
125 | 
126 | # create prediction files to submit
127 | pml_write_files(preds)
128 | ```
129 | 


--------------------------------------------------------------------------------