├── README.md
├── codebook.md
└── run_analysis.R


/README.md:
--------------------------------------------------------------------------------
 1 | # Getting and Cleaning Data Course Project
 2 | Getting and Cleaning Data Course Project for Coursera 
 3 | 
 4 | ## run_analysis.R
 5 | This R script takes the data from the UCI HAR Dataset and does the following:
 6 | 
 7 | 1. Reads in the training and testing data for the variables, subject and activity. It then merges the training and the test sets to create a single dataset for variables, subject and activity. Then the variable, subject and activity datasets are combined into one dataframe.
 8 | 2. Extracts the variables that contain the mean and standard deviation summary statistic for each measurement. A new dataframe is created. 
 9 | 3. From the dataframe in #2, it replaces the activity code with the activity name for clarity using the data dictionary provided in **activity_labels.txt**
10 | 4. Renames the variables to be more verbose. 
11 | 5. From the dataframe in #4, the script creates an output dataset **dataset2.txt** with the average of each variable grouped by activity and subject.
12 | 
13 | Additional detail is provided as comments in the script.
14 | 
15 | ## codebook.md
16 | A codebook is provided as an overview of the variables contained in **dataset2.txt**. 
17 | 
18 | ### Dataset Reference
19 | http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
20 | 


--------------------------------------------------------------------------------
/codebook.md:
--------------------------------------------------------------------------------
 1 | # Codebook
 2 | 
 3 | ## Mean Summary of a Dataset of Human Activity Recognition Using Smartphones
 4 | 
 5 | This dataset is a mean summary of an experimental dataset obtained with a smartphone on human activity 
 6 | recognition. More information on the experimenta and initial analysis can be found at 
 7 | http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
 8 | 
 9 | Each row of this mean summary dataset provides
10 | 
11 | 1. The activity
12 | 2. The subject
13 | 3. The variable name
14 | 4. Mean value for all measurements for the aforementioned variable for that subject and activity
15 | 
16 | The variable names are described with the following nomenclature
17 | - *time-domain* or *frequency-domain* to indicate which domain the variable is in
18 | - *Body* or *Gravity* to indicate which motion component of the accelerometer signal the measurement is coming from
19 | - *accelerometer* or *gyroscope* to indicate which sensor the measurement is coming from
20 | - *Jerk* is the derivative of either the linear acceleration (accelerometer) or the derivative of the angular acceleration (gyroscope)
21 | - *magnitude* is the amplitude of the 3-dimensional signal using the Euclidean norm
22 | - *mean* or *stdev* refers to the statistic calculated 
23 | - *x-axis* or *y-axis* or *z-axis* refers to the axis that the sensor was measuring
24 | 
25 | e.g. The variable **frequency-domain_Body_gyroscope_Jerk_magnitude_mean** refers to the mean value of the magnitude of the Jerk in the frequency domain from the gyroscope for the body motion component. 
26 | 
27 | ### Units
28 | - Accelerometer units is in **g** which is equivalent to **9.81m/s^2**. 
29 | - Gyroscope units is **revolutions per second** or the angular velocity.
30 | 


--------------------------------------------------------------------------------
/run_analysis.R:
--------------------------------------------------------------------------------
 1 | # 1. Merges the training and the test sets to create one data set.
 2 | 
 3 | # Read in training data
 4 | setwd(".UCI HAR Dataset/train")
 5 | xtrain = read.table("X_train.txt", header=F, sep="")
 6 | ytrain = read.table("y_train.txt", header=F, sep="", col.names="Activity_Code")
 7 | subject_train = read.table("subject_train.txt", header=F, sep="", col.names="Subject")
 8 | 
 9 | # Read in testing data
10 | setwd("../test")
11 | xtest = read.table("X_test.txt", header=F, sep="")
12 | ytest = read.table("y_test.txt", header=F, sep="", col.names="Activity_Code")
13 | subject_test = read.table("subject_test.txt", header=F, sep="", col.names="Subject")
14 | 
15 | # Read in column headers
16 | setwd("../..")
17 | features = read.csv("features.txt", header=F, sep="")
18 | 
19 | # Concatenating training and testing dataframes
20 | x = rbind(xtrain,xtest)
21 | y = rbind(ytrain,ytest)
22 | subject = rbind(subject_train,subject_test)
23 | 
24 | # Assigning column headers
25 | colnames(x) <- features[,2]
26 | 
27 | # Concatenating subject, activity, sensor data
28 | data <- cbind(y,subject,x)
29 | 
30 | # 2. Extracts only the measurements on the mean and standard deviation for each measurement.
31 | 
32 | # Create patterns for regular expressions
33 | pattern1 <- "mean()"
34 | pattern2 <- "std()"
35 | col1 <- grep(pattern1, colnames(data), fixed=T)
36 | col2 <- grep(pattern2, colnames(data), fixed=T)
37 | # Sort column index
38 | cols <- sort(c(1,2,col1,col2))
39 | # Extract only select columns
40 | tidy_data <- data[,cols]
41 | 
42 | # 3. Uses descriptive activity names to name the activities in the data set
43 | activity_labels <- read.csv("activity_labels.txt", header=F, sep="", col.names=c("Activity_Code","Activity"))
44 | tidy <- merge(activity_labels, tidy_data, by="Activity_Code", all=TRUE)
45 | tidy <- subset(tidy, select=-Activity_Code)
46 | 
47 | # 4. Appropriately labels the data set with descriptive variable names.
48 | 
49 | # Grab column names and replace substrings using regular expressions
50 | headers <- colnames(tidy)
51 | headers <- gsub("-","",headers)
52 | headers <- gsub("BodyBody","Body",headers)
53 | headers <- gsub("Body","Body_",headers)
54 | headers <- gsub("Gravity","Gravity_",headers)
55 | headers <- gsub("Jerk","Jerk_",headers)
56 | headers <- gsub("mean()","mean_",headers, fixed = TRUE)
57 | headers <- gsub("std()","stdev_",headers, fixed = TRUE)
58 | headers <- gsub("^t","time-domain_",headers)
59 | headers <- gsub("^f","frequency-domain_",headers)
60 | headers <- gsub("Acc","accelerometer_",headers)
61 | headers <- gsub("X$","x-axis",headers)
62 | headers <- gsub("Y$","y-axis",headers)
63 | headers <- gsub("Z$","z-axis",headers)
64 | headers <- gsub("Gyro","gyroscope_",headers)
65 | headers <- gsub("Mag","magnitude_",headers)
66 | headers <- gsub("_$","",headers)
67 | # Reassign column names
68 | colnames(tidy) <- headers
69 | 
70 | # 5. From the data set in step 4, creates a second, independent tidy data set with the 
71 | # average of each variable for each activity and each subject.
72 | 
73 | # Reshape dataframe from wide to long
74 | library(reshape2)
75 | melted <- melt(tidy, id.vars=c("Activity","Subject"))
76 | # Split-Apply-Combine dataframe
77 | library(dplyr)
78 | tidy2 <- summarise(group_by(melted, Activity, Subject, variable), mean=mean(value))
79 | # Write to file
80 | write.table(tidy2, file="dataset2.txt", row.names=F)
81 | 


--------------------------------------------------------------------------------