├── README.md ├── codebook.md └── run_analysis.R /README.md: -------------------------------------------------------------------------------- 1 | # Getting and Cleaning Data Course Project 2 | Getting and Cleaning Data Course Project for Coursera 3 | 4 | ## run_analysis.R 5 | This R script takes the data from the UCI HAR Dataset and does the following: 6 | 7 | 1. Reads in the training and testing data for the variables, subject and activity. It then merges the training and the test sets to create a single dataset for variables, subject and activity. Then the variable, subject and activity datasets are combined into one dataframe. 8 | 2. Extracts the variables that contain the mean and standard deviation summary statistic for each measurement. A new dataframe is created. 9 | 3. From the dataframe in #2, it replaces the activity code with the activity name for clarity using the data dictionary provided in **activity_labels.txt** 10 | 4. Renames the variables to be more verbose. 11 | 5. From the dataframe in #4, the script creates an output dataset **dataset2.txt** with the average of each variable grouped by activity and subject. 12 | 13 | Additional detail is provided as comments in the script. 14 | 15 | ## codebook.md 16 | A codebook is provided as an overview of the variables contained in **dataset2.txt**. 17 | 18 | ### Dataset Reference 19 | http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones 20 | -------------------------------------------------------------------------------- /codebook.md: -------------------------------------------------------------------------------- 1 | # Codebook 2 | 3 | ## Mean Summary of a Dataset of Human Activity Recognition Using Smartphones 4 | 5 | This dataset is a mean summary of an experimental dataset obtained with a smartphone on human activity 6 | recognition. More information on the experimenta and initial analysis can be found at 7 | http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones 8 | 9 | Each row of this mean summary dataset provides 10 | 11 | 1. The activity 12 | 2. The subject 13 | 3. The variable name 14 | 4. Mean value for all measurements for the aforementioned variable for that subject and activity 15 | 16 | The variable names are described with the following nomenclature 17 | - *time-domain* or *frequency-domain* to indicate which domain the variable is in 18 | - *Body* or *Gravity* to indicate which motion component of the accelerometer signal the measurement is coming from 19 | - *accelerometer* or *gyroscope* to indicate which sensor the measurement is coming from 20 | - *Jerk* is the derivative of either the linear acceleration (accelerometer) or the derivative of the angular acceleration (gyroscope) 21 | - *magnitude* is the amplitude of the 3-dimensional signal using the Euclidean norm 22 | - *mean* or *stdev* refers to the statistic calculated 23 | - *x-axis* or *y-axis* or *z-axis* refers to the axis that the sensor was measuring 24 | 25 | e.g. The variable **frequency-domain_Body_gyroscope_Jerk_magnitude_mean** refers to the mean value of the magnitude of the Jerk in the frequency domain from the gyroscope for the body motion component. 26 | 27 | ### Units 28 | - Accelerometer units is in **g** which is equivalent to **9.81m/s^2**. 29 | - Gyroscope units is **revolutions per second** or the angular velocity. 30 | -------------------------------------------------------------------------------- /run_analysis.R: -------------------------------------------------------------------------------- 1 | # 1. Merges the training and the test sets to create one data set. 2 | 3 | # Read in training data 4 | setwd(".UCI HAR Dataset/train") 5 | xtrain = read.table("X_train.txt", header=F, sep="") 6 | ytrain = read.table("y_train.txt", header=F, sep="", col.names="Activity_Code") 7 | subject_train = read.table("subject_train.txt", header=F, sep="", col.names="Subject") 8 | 9 | # Read in testing data 10 | setwd("../test") 11 | xtest = read.table("X_test.txt", header=F, sep="") 12 | ytest = read.table("y_test.txt", header=F, sep="", col.names="Activity_Code") 13 | subject_test = read.table("subject_test.txt", header=F, sep="", col.names="Subject") 14 | 15 | # Read in column headers 16 | setwd("../..") 17 | features = read.csv("features.txt", header=F, sep="") 18 | 19 | # Concatenating training and testing dataframes 20 | x = rbind(xtrain,xtest) 21 | y = rbind(ytrain,ytest) 22 | subject = rbind(subject_train,subject_test) 23 | 24 | # Assigning column headers 25 | colnames(x) <- features[,2] 26 | 27 | # Concatenating subject, activity, sensor data 28 | data <- cbind(y,subject,x) 29 | 30 | # 2. Extracts only the measurements on the mean and standard deviation for each measurement. 31 | 32 | # Create patterns for regular expressions 33 | pattern1 <- "mean()" 34 | pattern2 <- "std()" 35 | col1 <- grep(pattern1, colnames(data), fixed=T) 36 | col2 <- grep(pattern2, colnames(data), fixed=T) 37 | # Sort column index 38 | cols <- sort(c(1,2,col1,col2)) 39 | # Extract only select columns 40 | tidy_data <- data[,cols] 41 | 42 | # 3. Uses descriptive activity names to name the activities in the data set 43 | activity_labels <- read.csv("activity_labels.txt", header=F, sep="", col.names=c("Activity_Code","Activity")) 44 | tidy <- merge(activity_labels, tidy_data, by="Activity_Code", all=TRUE) 45 | tidy <- subset(tidy, select=-Activity_Code) 46 | 47 | # 4. Appropriately labels the data set with descriptive variable names. 48 | 49 | # Grab column names and replace substrings using regular expressions 50 | headers <- colnames(tidy) 51 | headers <- gsub("-","",headers) 52 | headers <- gsub("BodyBody","Body",headers) 53 | headers <- gsub("Body","Body_",headers) 54 | headers <- gsub("Gravity","Gravity_",headers) 55 | headers <- gsub("Jerk","Jerk_",headers) 56 | headers <- gsub("mean()","mean_",headers, fixed = TRUE) 57 | headers <- gsub("std()","stdev_",headers, fixed = TRUE) 58 | headers <- gsub("^t","time-domain_",headers) 59 | headers <- gsub("^f","frequency-domain_",headers) 60 | headers <- gsub("Acc","accelerometer_",headers) 61 | headers <- gsub("X$","x-axis",headers) 62 | headers <- gsub("Y$","y-axis",headers) 63 | headers <- gsub("Z$","z-axis",headers) 64 | headers <- gsub("Gyro","gyroscope_",headers) 65 | headers <- gsub("Mag","magnitude_",headers) 66 | headers <- gsub("_$","",headers) 67 | # Reassign column names 68 | colnames(tidy) <- headers 69 | 70 | # 5. From the data set in step 4, creates a second, independent tidy data set with the 71 | # average of each variable for each activity and each subject. 72 | 73 | # Reshape dataframe from wide to long 74 | library(reshape2) 75 | melted <- melt(tidy, id.vars=c("Activity","Subject")) 76 | # Split-Apply-Combine dataframe 77 | library(dplyr) 78 | tidy2 <- summarise(group_by(melted, Activity, Subject, variable), mean=mean(value)) 79 | # Write to file 80 | write.table(tidy2, file="dataset2.txt", row.names=F) 81 | --------------------------------------------------------------------------------