forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
99 lines (76 loc) · 3.28 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
title: "Reproducible Research: Peer Assessment 1"
output:
html_document:
keep_md: true
---
## Loading and preprocessing the data
```{r echo = TRUE}
Sys.setlocale(category = "LC_ALL", locale = "en_US.utf8")
library(dplyr)
library(lattice)
unzip("activity.zip")
activity <- read.csv("activity.csv")
activity_df <- tbl_df(activity)
```
## What is mean total number of steps taken per day?
```{r steps_perday, echo=TRUE}
activity_perday <- summarise(group_by(activity_df, date), steps = sum(steps))
barplot(activity_perday$steps, names.arg = activity_perday$date,
ylab = "Number of Steps",
main = "Total Number of Steps Taken per Day"
)
```
Mean of above distribution is **`r mean(activity_perday$steps, na.rm = TRUE)`** while the meadian is **`r median(activity_perday$steps, na.rm = TRUE)`**.
## What is the average daily activity pattern?
```{r average_interval, echo = TRUE}
activity_interval <- summarise(group_by(activity_df, interval), steps = mean(steps, na.rm = TRUE))
plot(activity_interval$interval, activity_interval$steps,
type = "l",
ylab = "Number of Steps",
xlab = "Time Intervals",
main = "Average Daily Steps by Time(5 Mins per interval)"
)
```
Across all days, the maximum of average number of steps happened on interval **`r filter(activity_interval, steps == max(steps))$interval`**, which is **`r filter(activity_interval, steps == max(steps))$steps`**.
## Imputing missing values
The total number of rows with NAs is **`r sum(is.na(activity_df))`**.
Use average steps of a given interval to fill the missing value.
```{r echo = TRUE}
activity_filled <- activity_df
for (x in 1:nrow(activity_filled)){
if(is.na(activity_filled[x,]$steps)){
mark <- activity_filled[x,]$interval
activity_filled[x,]$steps <- filter(activity_interval, interval == mark)$steps
}
}
rm(x, mark)
```
Below is what the histgram of the new filled dataset.
```{r filled_data,echo = TRUE}
activity_filled_perday <- summarise(group_by(activity_filled, date), steps = sum(steps))
barplot(activity_filled_perday$steps, names.arg = activity_filled_perday$date,
ylab = "Number of Steps",
main = "Total Number of Steps Taken per Day(Filled)"
)
```
Mean of above distribution is **`r mean(activity_filled_perday$steps, na.rm = TRUE)`** while the meadian is **`r median(activity_filled_perday$steps, na.rm = TRUE)`**. There should be no big change from original dataset.
## Are there differences in activity patterns between weekdays and weekends?
```{r weekend,echo = TRUE}
activity_week <- cbind(activity_filled, week = NA)
for (x in 1:nrow(activity_week)){
activity_week[x,]$week <- ifelse(test = (weekdays(strptime(activity_week[x,]$date, format = "%Y-%m-%d"), TRUE) %in% c("Sat", "Sun")),
yes = "weekend",
no = "weekday")
}
rm(x)
activity_week_interval <- summarise(group_by(activity_week, interval, week), steps = mean(steps))
xyplot(steps ~ interval | week, data = activity_week_interval,
layout = c(1, 2),
type = "l",
ylab = "Number of Steps",
xlab = "Time Intervals",
main = "Average Daily Steps by Time(5 Mins per interval)"
)
```
Above chart suggests there indeed some differences between weekdays and weekend.