forked from ASonCourse/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
345 lines (243 loc) · 14.1 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
---
title: "Reproducible Research: Peer Assessment 1"
output:
html_document:
keep_md: true
---
## Loading and preprocessing the data
Before we go and download the data, let's first grab the date: `r Sys.Date()`. Next get the data from the web. The file comes from a repository in my GitHub account that was forked from the original.
```{r download}
source_url <- "https://raw.githubusercontent.com/ASonCourse/RepData_PeerAssessment1/master/activity.zip"
download.file(source_url, destfile = "activity.zip",
method = "curl")
```
The downloaded file was compressed, so we will unzip it now:
```{r unzip}
unzip("activity.zip", exdir = ".")
```
It turns out the unzipped file is a comma separated file. We can import that into R easily:
```{r import}
activity <- read.csv("activity.csv", stringsAsFactors = FALSE)
```
The `date` variable in `activity` is of type character and needs to be converted to `date`:
```{r char_to_date}
activity$date <- as.Date(activity$date, format = "%Y-%m-%d")
```
This concludes the loading and preprocessing of the data. [Element #1 required for a complete submission ("Code for reading in the dataset and/or processing the data.")]
## What is mean total number of steps taken per day?
Since we'll be using `group_by()` and related commands to figure this out we need to install and load the `dplyr` package:
```{r install_dplyr, message=FALSE, warning=FALSE}
options(repos="https://cran.rstudio.com" )
if (!require(dplyr)) install.packages("dplyr")
library(dplyr)
```
NOTE:
Installing packages by means of R Markdown (`knitr` document) is not advised. The consensus seems to be that users should manage their own package library. We should assume that people know how to install a missing package. Here we've opted to download and install the package _only_ if it's necessary. Arguably a number of users might have dplyr installed / loaded already. See this [resource](https://stackoverflow.com/questions/33969024/install-packages-fails-in-knitr-document-trying-to-use-cran-without-setting-a) for more on the subject.
Calculate mean total number of steps per day:
```{r mean_total_steps}
mean_total_steps <- activity %>%
group_by(date) %>%
summarise (daily_steps = sum(steps)) %>%
summarise(mean_total_steps = mean(daily_steps, na.rm = TRUE))
mean_steps <- round(mean_total_steps, 0)
```
Result: `r as.integer(mean_steps)` steps. How about the median? Let's calculate the median aswell:
```{r median_total_steps}
median_total_steps <- activity %>%
group_by(date) %>%
summarise (daily_steps = sum(steps)) %>%
summarise(mean_total_steps = median(daily_steps, na.rm = TRUE))
median_steps <- round(mean_total_steps, 0)
```
Result: `r as.integer(median_steps)` steps.
So, mean and median are equal! [Element #3 required for a complete submission ("Mean and median number of steps taken each day.")]
How about a plot? Yes, let's create a histogram of the total number of steps taken each day:
```{r histogram_daily_steps}
steps_by_day <- activity %>%
group_by(date) %>%
summarise (daily_steps = sum(steps))
hist(steps_by_day$daily_steps,
main = "Daily Total Steps",
xlab = "Steps / Day")
```
An alternative way of visualizing the total number of steps taken each day is by means of a boxplot:
```{r boxplot_daily_steps}
boxplot(steps_by_day$daily_steps,
main = "Daily Total Steps")
```
The histogram shows that on most days the number of steps lies between 10.000 and 15.000. The boxplot clearly shows the mean / median and that there are a couple of outliers. [Element #2 required for a complete submission ("Histogram of the total number of steps taken each day.")]
## What is the average daily activity pattern?
To show the daily activity pattern we have to group and summarise the data by interval first:
```{r steps_by_interval}
steps_by_interval <- activity %>%
group_by(interval) %>%
summarise (av_steps = mean(steps, na.rm = TRUE))
```
Once we've done that it's relatively easy to create a time-series plot:
```{r daily_pattern_plot}
plot(steps_by_interval$av_steps, x = steps_by_interval$interval,
type = "l",
main = "Daily Activity",
xlab = "Intervals",
ylab = "Average Steps")
```
[Element #4 required for a complete submission ("Time series plot of the average number of steps taken.")]
What's the 5-minute interval with the highest number of steps (on average)? Judging from the plot we can guess that it should be around 800. Luckily, we can figure it out exactly:
```{r max_interval}
max_steps <- steps_by_interval[which.max(steps_by_interval$av_steps),]
max_steps_round <- round(max_steps$av_steps, 0)
```
So, the highest number of steps (on average) is `r max_steps_round`. The corresponding interval is `r max_steps$interval`. [Element #5 required for a complete submission ("The 5-minute interval that, on average, contains the maximum number of steps.")]
## Imputing missing values
So, there are missing values in the dataset (NA's). How many of them are there, exactly? Simply calling `summary()` on `activity` will tell us:
```{r total_number_NAs_1}
summary(activity)
```
Alternatively, we can filter out `activity` observations where `steps` is NA and subsequently count the number of rows the filtered dataframe has:
``` {r total_number_NAs_2}
na_activity <- activity %>% filter(is.na(steps) == TRUE)
total_number_NAs <- nrow(na_activity)
```
The result of this operation is: `r total_number_NAs` NA's. In relative terms this represents only a fraction of the original dataset (`r total_number_NAs / nrow(activity)`), though.
How to deal with these NA's? Well, that depends. We have to know a bit more about the NA's. How are they spread accross the dates / days? The following code delivers the answer to that question:
```{r NAs_by_day}
na_by_day <- na_activity %>%
group_by(date) %>%
summarise(na_count = n())
na_by_day
```
The NA's seem to be occuring on only 8 days. Judging by the number of NA's for each day (8 times 288) we can conclude that data for complete days are missing (all 288 intervals (12 x 24)). How are these 8 days without data spread accross the total number of days? What days are missing? Always the same days or does it vary? Let's find out:
```{r NAs_weekday_spread}
na_by_day$weekday <- weekdays(na_by_day$date)
na_by_day$weekday <- as.factor(na_by_day$weekday)
summary(na_by_day)
different_days <- n_distinct(na_by_day$weekday)
```
So, the missing data seems to be spread pretty evenly over measurement period and the weekdays (only Tuesday has no missing data; `n-distinct()` returned `r different_days`).
Is it a good idea to impute the missing data by calculating typical averages for each day of the week? This seems reasonable, since data for _complete_ days is missing.
This can be done by getting the average number of steps by day (by grouping and summarizing). Subsequently joining the dataframe with the averages by day with the ungrouped dataframe (by day) makes the averages available. The NA values can now be easily replaced by the daily averages.
A dataset with imputed data in place of the NA's can be created like this:
```{r activity_imputed}
# Copy activity to activity_imputed:
activity_imputed <- activity
# Create column 'steps.imp' (set to '0' for clarity):
activity_imputed$steps.imp <- 0
# Add column 'weekday' (generated from 'date'):
activity_imputed$weekday <- as.factor(weekdays(activity_imputed$date))
# Create separate dataframe 'activity_weekday':
activity_weekday <- activity %>%
mutate(weekday = as.factor(weekdays(date))) %>%
group_by(weekday) %>%
summarise(daily_steps = mean(steps, na.rm = TRUE))
# Now join the dataframes 'activity_imputed' and
# 'activity_weekday'on column 'weekday' (which is in both):
activity_imputed <- inner_join(activity_imputed, activity_weekday, by = "weekday")
# Replace value in 'steps.imp' with value in 'daily_steps' if
# 'steps' is NA...
activity_imputed$steps.imp[is.na(activity_imputed$steps)] <-
activity_imputed$daily_steps[is.na(activity_imputed$steps)]
# Checking the result can be done by sampling the dataframe:
# activity_imputed[sample(nrow(activity_imputed), 10), ]
# Rows with steps equaling NA should show 'daily_steps' as
# value in column 'steps.imp'. Other rows should show 0.
# Everything seems in order...
# Replace value in 'steps.imp' with value in 'steps' if
# 'steps' is NOT equal to NA...
activity_imputed$steps.imp[!is.na(activity_imputed$steps)] <-
activity_imputed$steps[!is.na(activity_imputed$steps)]
# Make variable 'steps.imp' a numeric again (if neccesary):
activity_imputed$steps.imp <- as.numeric(activity_imputed$steps.imp)
# Clean the dataframe by removing 'daily_steps' column,
# since it's no longer relevant:
activity_imputed$daily_steps <- NULL
# For display purposes we eliminate the column 'weekday':
display_activity_imputed <- activity_imputed
display_activity_imputed$weekday <- NULL
# Final check:
summary(display_activity_imputed)
# No NA's in the 'steps.imp' column!
# NB: values in 'steps.imp' are now fractions and should be
# rounded to the nearest integer before comparing with
# results from non-imputed data.
```
[Element #6 required for a complete submission (“Code to describe and show a strategy for imputing missing data.”)]
Has imputing the data changed the characteristics of the dataset? Let's make a histogram:
```{r histogram_imputed}
steps_by_day.imp <- activity_imputed %>%
group_by(date) %>%
summarise (daily_steps = sum(steps.imp))
hist(round(steps_by_day.imp$daily_steps, 0),
main = "Daily Total Steps \n (with imputed data)",
xlab = "Steps / Day")
```
So, on most days the number of steps still lies between 10.000 and 15.000. There's hardly any visible difference when compared to the histogram that was created using the original (non-imputed) dataset. [Element #7 required for a complete submission (“Histogram of the total number of steps taken each day after missing values are imputed.”)]
Are the mean and the median substantially different? Let's calculate those for the imputed dataset and compare the values to those calculated with the original data:
```{r mean_total_steps_imp}
mean_total_steps.imp <- activity_imputed %>%
group_by(date) %>%
summarise (daily_steps = sum(steps.imp)) %>%
summarise(mean_total_steps = median(daily_steps, na.rm = TRUE))
mean_steps_imp <- as.integer(round(mean_total_steps.imp$mean_total_steps, 0))
difference_mean <- mean_steps_imp - mean_steps
perc_diff_mean <- round(difference_mean / mean_steps * 100, 1)
```
The mean total number of steps for the imputed dataset is `r mean_steps_imp`. This is only slightly more than the mean for the original data (`r as.integer(round(mean_steps, 0))`). In relative terms the difference of `r difference_mean` represents a (neglectable) difference of only `r perc_diff_mean` percent.
```{r median_total_steps_imp}
median_total_steps.imp <- activity_imputed %>%
group_by(date) %>%
summarise (daily_steps = sum(steps.imp)) %>%
summarise(median_total_steps = median(daily_steps, na.rm = TRUE))
median_steps_imp <- as.integer(round(median_total_steps.imp$median_total_steps, 0))
difference_median <- median_steps_imp - median_steps
perc_diff_median <- round(difference_median / median_steps * 100, 1)
```
The median total number of steps for the imputed dataset is `r median_steps_imp`. This is only slightly more than the median for the original data (`r as.integer(round(median_steps, 0))`). In relative terms the difference of `r difference_median` represents a (neglectable) difference of only `r perc_diff_median` percent.
## Are there differences in activity patterns between weekdays and weekends?
Let's make a panel plot so that we can make a visual comparison:
```{r activity_pattern_plots, message=FALSE, warning=FALSE}
# Using 'activity_imputed'...
# Adding 'day_type' (factor)variable.
# First setting 'day_type' to "Weekend":
activity_imputed$day_type[activity_imputed$weekday == "Saturday" |
activity_imputed$weekday == "Sunday"] <- "Weekend"
# Now setting the complement to "Weekday":
activity_imputed$day_type[!(activity_imputed$weekday == "Saturday" |
activity_imputed$weekday == "Sunday")] <- "Weekday"
# Coercing 'day_type' to factor variable:
activity_imputed$day_type <- as.factor(activity_imputed$day_type)
# Checking the result by sampling the data:
# activity_imputed[sample(nrow(activity_imputed), 10), ]
# Everything seems to be in order — only Saturdays / Sundays are
# of type "Weekend".
# What is the average daily activity pattern during the weekends?
steps_by_interval_weekend <- activity_imputed %>%
filter(day_type == "Weekend") %>%
group_by(interval) %>%
summarise (av_steps = mean(steps.imp, na.rm = TRUE))
# What is the average daily activity pattern during weekdays?
steps_by_interval_weekdays <- activity_imputed %>%
filter(day_type == "Weekday") %>%
group_by(interval) %>%
summarise (av_steps = mean(steps.imp, na.rm = TRUE))
# Creating 'steps_by_interval.imp', since we need that for the panel plot:
steps_by_interval.imp <- activity_imputed %>%
# First group activity by interval...
# Keep day_type (needed for plotting).
group_by(interval, day_type) %>%
# Then summarise by summing steps...
summarise (av_steps = round(mean(steps.imp, na.rm = TRUE), 0))
# Panelplot? Showing weekend and weekdays side by side...
# Judging by the look of the example plot lattice should be used here.
if (!require(lattice)) install.packages("lattice")
library(lattice)
# This code produces the required plot:
xyplot(av_steps ~ interval | day_type, data = steps_by_interval.imp,
type = "l", layout = c(1,2),
ylab = "Number of steps",
xlab = "Interval",
main = "Activity Pattern")
```
[Element #8 required for a complete submission (“Panel plot comparing the average number of steps taken per 5-minute interval across weekdays and weekends.”)]
There does indeed seem to be a difference in the daily activity when we compare weekends to weekdays.
This (HTML) document was generated by `knitr` from a R Markdown file. This .Rmd file was produced with RStudio. Knitting the R Markdown file reproduces this report exactly.
[Element #9 required for a complete submission (“All of the R code needed to reproduce the results (numbers, plots, etc.) in the report.”)]