forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
117 lines (83 loc) · 3.01 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
---
title: "Reproducible Research: Peer Assessment 1"
output:
html_document:
keep_md: true
---
## Loading and preprocessing the data
```{r}
library(lubridate)
library(datasets)
library(lattice)
library(dplyr)
raw_data <- read.csv("activity.csv")
raw_data$date <- ymd(raw_data$date)
```
## What is mean total number of steps taken per day?
```{r}
filter <- complete.cases(raw_data)
data <- raw_data[filter,]
data_by_date_grouped <- group_by(data, date)
data_by_date <- summarize(data_by_date_grouped, daily_total = sum(steps))
hist(data_by_date$daily_total, main = "Total steps taken each day", xlab = "Number of steps", ylab = "Frquency (Number of days)")
```
The mean number of steps taken per day:
```{r}
mean(data_by_date$daily_total)
```
The median number of steps taken per day:
```{r}
median(data_by_date$daily_total)
```
## What is the average daily activity pattern?
```{r}
data_by_interval_grouped <- group_by(data, interval)
data_by_interval <- summarize(data_by_interval_grouped, interval_average = mean(steps))
plot(data_by_interval$interval_average, type = "l", main = "Average Daily Activity Pattern", xlab = "Time Interval", ylab = "Number of Steps")
```
Time interval that has the most number of steps:
```{r}
data_by_interval$interval[which.max(data_by_interval$interval_average)]
```
## Imputing missing values
Total number of missing values:
```{r}
sum(is.na(raw_data))
```
New dataset with missing values imputed, filling NAs with its respective 5-min interval mean:
```{r}
d <- data.frame(data_by_interval)
data_na <- raw_data[is.na(raw_data),]
count <- 1
for(i in data_na$interval){
data_na[count,1] <- filter(d, interval == i)[,2]
count <- count + 1
}
data_imputed <- rbind(data, data_na)
data_imputed_by_date_grouped <- group_by(data_imputed, date)
data_imputed_by_date <- summarize(data_imputed_by_date_grouped, daily_total = sum(steps))
hist(data_imputed_by_date$daily_total, main = "Total steps taken each day (Imputed)", xlab = "Number of steps", ylab = "Frquency (Number of days)")
```
The mean number of steps taken per day from imputed data:
```{r}
mean(data_imputed_by_date$daily_total)
```
The median number of steps taken per day from imputed data:
```{r}
median(data_imputed_by_date$daily_total)
```
The mean and median of the imputed data do not differ much from the data with NAs ignored. Thus, the impact of imputing missing data is small.
## Are there differences in activity patterns between weekdays and weekends?
```{r}
for(j in 1:nrow(data_imputed)){
if(weekdays(data[j,2]) %in% c("Saturday", "Sunday")){
data_imputed[j,4] = "Weekend"
}else{
data_imputed[j,4] = "Weekday"
}
}
names(data_imputed) <- c("steps", "date", "interval", "daytype")
data_imputed_by_interval_grouped <- group_by(data_imputed, interval, daytype)
data_imputed_by_interval <- summarize(data_imputed_by_interval_grouped, interval_average = mean(steps))
xyplot(interval_average ~ interval | daytype, data_imputed_by_interval, type = "l", layout = c(1, 2), xlab = "Time Interval", ylab = "Number of Steps")
```