-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmain_analysis.Rmd
234 lines (171 loc) · 6.33 KB
/
main_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
main_analysis
============
Preliminaries
-------------
Load packages.
```{r loading packages}
packages <- c("data.table", "reshape2")
sapply(packages, require, character.only=TRUE, quietly=TRUE)
```
Set path.
```{r set path}
path <- getwd()
path
```
Get the data
------------
Download the file. Put it in the `Data` folder.
```{r downloading file}
url <- "https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip"
f <- "Dataset.zip"
if (!file.exists(path)) {dir.create(path)}
download.file(url, file.path(path, f))
```
Unzip the file.
```{r unzip it}
unzip(f)
```
The archive put the files in a folder named `Dataset`. Set this folder as the input path.
```{r set input path}
pathIn <- file.path(path, "UCI HAR Dataset")
list.files(pathIn, recursive=TRUE)
```
Read the files
--------------
Read the subject files.
```{r read subject files}
fileToDataTable <- function (f) {
df <- read.table(f)
dt <- data.table(df)
}
dtSubjectTrain <- fileToDataTable(file.path(pathIn, "train", "subject_train.txt"))
dtSubjectTest <- fileToDataTable(file.path(pathIn, "test" , "subject_test.txt" ))
```
Read the label files.
```{r read label files}
dtActivityTrain <- fileToDataTable(file.path(pathIn, "train", "y_train.txt"))
dtActivityTest <- fileToDataTable(file.path(pathIn, "test" , "y_test.txt" ))
```
Read the data files.
```{r read data files}
dtTrain <- fileToDataTable(file.path(pathIn, "train", "X_train.txt"))
dtTest <- fileToDataTable(file.path(pathIn, "test" , "X_test.txt" ))
```
Read the key files.
```{r read key files}
dtFeatures <- fileToDataTable(file.path(pathIn, "features.txt"))
dtActivityNames <- fileToDataTable(file.path(pathIn, "activity_labels.txt"))
```
1. Merge the training and the test sets
------------------------------------
Concatenate the data tables.
```{r concatenate data tables}
dtSubject <- rbind(dtSubjectTrain, dtSubjectTest)
setnames(dtSubject, "V1", "subject")
dtActivity <- rbind(dtActivityTrain, dtActivityTest)
setnames(dtActivity, "V1", "activityNum")
dt <- rbind(dtTrain, dtTest)
```
Merge columns.
```{r merge columns}
dtSubject <- cbind(dtSubject, dtActivity)
dt <- cbind(dtSubject, dt)
```
Set key.
```{r setting key}
setkey(dt, subject, activityNum)
```
2. Extract only the mean and standard deviation
-----------------------------------------------
Read the `features.txt` file. This tells which variables in `dt` are measurements for the mean and standard deviation.
```{r reading features.txt}
setnames(dtFeatures, names(dtFeatures), c("featureNum", "featureName"))
```
Subset only measurements for the mean and standard deviation.
```{r getting mean and standard deviation}
dtFeatures <- dtFeatures[grepl("mean\\(\\)|std\\(\\)", featureName)]
```
Convert the column numbers to a vector of variable names matching columns in `dt`.
```{r onverting column numbers}
dtFeatures$featureCode <- dtFeatures[, paste0("V", featureNum)]
head(dtFeatures)
dtFeatures$featureCode
```
Subset these variables using variable names.
```{r subsetting}
select <- c(key(dt), dtFeatures$featureCode)
dt <- dt[, select, with=FALSE]
```
3. Use descriptive activity names
---------------------------------
Use the `activity_labels.txt` file. This will be used to add descriptive names to the activities.
```{r reading label files}
setnames(dtActivityNames, names(dtActivityNames), c("activityNum", "activityName"))
```
4. Appropriately labels the data set with descriptive variable names.
---------------------------------------------------------------------
Merge activity labels.
```{r merging labels}
dt <- merge(dt, dtActivityNames, by="activityNum", all.x=TRUE)
```
Add `activityName` as a key.
```{r adding activity name as key}
setkey(dt, subject, activityNum, activityName)
```
Melt the data table to reshape it from a short and wide format to a tall and narrow format.
```{r melting the data table}
dt <- data.table(melt(dt, key(dt), variable.name="featureCode"))
```
Merge activity name.
```{r merge activity name}
dt <- merge(dt, dtFeatures[, list(featureNum, featureCode, featureName)], by="featureCode", all.x=TRUE)
```
Create a new variable, `activity` that is equivalent to `activityName` as a factor class.
Create a new variable, `feature` that is equivalent to `featureName` as a factor class.
```{r creating new variables}
dt$activity <- factor(dt$activityName)
dt$feature <- factor(dt$featureName)
```
Separate features from `featureName` using the helper function `grepthis`.
```{r grepthis}
grepthis <- function (regex) {
grepl(regex, dt$feature)
}
## Features with 2 categories
n <- 2
y <- matrix(seq(1, n), nrow=n)
x <- matrix(c(grepthis("^t"), grepthis("^f")), ncol=nrow(y))
dt$featDomain <- factor(x %*% y, labels=c("Time", "Freq"))
x <- matrix(c(grepthis("Acc"), grepthis("Gyro")), ncol=nrow(y))
dt$featInstrument <- factor(x %*% y, labels=c("Accelerometer", "Gyroscope"))
x <- matrix(c(grepthis("BodyAcc"), grepthis("GravityAcc")), ncol=nrow(y))
dt$featAcceleration <- factor(x %*% y, labels=c(NA, "Body", "Gravity"))
x <- matrix(c(grepthis("mean()"), grepthis("std()")), ncol=nrow(y))
dt$featVariable <- factor(x %*% y, labels=c("Mean", "SD"))
## Features with 1 category
dt$featJerk <- factor(grepthis("Jerk"), labels=c(NA, "Jerk"))
dt$featMagnitude <- factor(grepthis("Mag"), labels=c(NA, "Magnitude"))
## Features with 3 categories
n <- 3
y <- matrix(seq(1, n), nrow=n)
x <- matrix(c(grepthis("-X"), grepthis("-Y"), grepthis("-Z")), ncol=nrow(y))
dt$featAxis <- factor(x %*% y, labels=c(NA, "X", "Y", "Z"))
```
Check to make sure all possible combinations of `feature` are accounted for by all possible combinations of the factor class variables.
```{r combining}
r1 <- nrow(dt[, .N, by=c("feature")])
r2 <- nrow(dt[, .N, by=c("featDomain", "featAcceleration", "featInstrument", "featJerk", "featMagnitude", "featVariable", "featAxis")])
r1 == r2
```
5. Create a new independent tidy data set
---------------------------------------------
Create a data set with the average of each variable for each activity and each subject.
```{r creating a new data set}
setkey(dt, subject, activity, featDomain, featAcceleration, featInstrument, featJerk, featMagnitude, featVariable, featAxis)
dtTidy <- dt[, list(count = .N, average = mean(value)), by=key(dt)]
```
Make codebook.
```{r creating a codebook}
knit("createCodebook.Rmd", output="codebook.md", encoding="ISO8859-1", quiet = TRUE)
markdownToHTML("codebook.md", "codebook.html")
```