forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
140 lines (120 loc) · 4.73 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
title: "Reproducible Research: Peer Assessment 1"
output:
html_document:
keep_md: true
---
## Loading and preprocessing the data
We will load the data create a clean version of a data set, which has only rows from the original dataset that are complete (e.g. have no NA values). We also load graphic libraries we are going to use, and set global options
```{r}
library(knitr)
library(ggplot2)
opts_chunk$set(echo = TRUE, cache = TRUE, cache.path = "cache/", fig.path = "figure/")
# numbers should be shown up to 4 digits precision
options(scipen=1, digits=4)
data=read.csv("activity.csv", header=TRUE)
clean_data=na.omit(data)
nrow(clean_data)
```
## What is mean total number of steps taken per day?
First, using clean dataset (with no missing values), we group the data by date
```{r}
groupped=aggregate(clean_data$steps, by=list(date=clean_data$date), FUN=sum)
colnames(groupped) = c("date","steps")
```
An extract of groupped dataset
```{r}
head(groupped)
```
Plot a histogram of number of steps
```{r}
qplot(steps, data=groupped, geom="histogram", binwidth=500)
```
Calculate the mean and median of number of steps taken by date
```{r}
mean_steps = mean(groupped$steps)
median_steps = median(groupped$steps)
```
**The mean of the total number of steps taken per day is `r mean_steps`
The median of the total number of steps taken per day is `r median_steps`**
## What is the average daily activity pattern?
We need to group by interval and take the average across the days for this interval
```{r}
groupped=aggregate(clean_data$steps, by=list(interval=clean_data$interval), FUN=mean)
colnames(groupped) =c("interval","steps")
```
Now do the actual plot
```{r}
ggplot(groupped, aes(interval, steps))+geom_line(color="blue")+ggtitle("Average number of steps taken by interval")
```
Find Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps
```{r}
max_interval = groupped[which(groupped$steps==max(groupped$steps)),1]
```
**Interval `r max_interval` contains the maximum number of steps **
## Imputing missing values
First lets calculate the total number of missing values in dataset
```{r}
total_missing_rows= nrow(data)-nrow(clean_data)
```
**Overall, we have `r total_missing_rows` rows with NA.**
We will replace missing values in number of steps taken with the median of steps in this day.
First lets calculate median per day
```{r}
medians=aggregate(clean_data$steps, by=list(date=clean_data$date), FUN=median)
head(medians)
colnames(medians) = c("date","steps")
````
Now replace the NA values calculated medians. Also note that some days don't have any non-NA values at all, so there would be no entry in medians for such days. For these days, we set number of steps to be 0.
```{r}
for (i in which(is.na(data))) {
index_in_medians = which(medians$date==data[i,]$date)
if (length(index_in_medians)==0) {
# replace with 0, we have no data at all for this date
data[i,]$steps=0
}else{
median_steps=medians[index_in_medians,2]
print(median_steps)
steps=median_steps
}
}
```
Lets make sure we have no NA left
```{r}
sum(is.na(data))
```
All good!
Using imputed dataset, we group the data by date
```{r}
groupped=aggregate(data$steps, by=list(date=data$date), FUN=sum)
colnames(groupped) = c("date","steps")
```
Plot a histogram of number of steps
```{r}
qplot(steps, data=groupped, geom="histogram", binwidth=500)
```
Calculate the mean and median of number of steps taken by date
```{r}
mean_steps_imputed = mean(groupped$steps)
median_steps_imputed = median(groupped$steps)
```
**The mean of the total number of steps taken per day is `r mean_steps_imputed` and was `r mean_steps`.
The median of the total number of steps taken per day is `r median_steps_imputed` and was `r median_steps`.**
Imputing reduced the estimates of both mean and median.
## Are there differences in activity patterns between weekdays and weekends?
First we create a column to indicate if the day is weekend or weekday
```{r}
data$date=as.POSIXlt(data$date)
data$day =ifelse(weekdays(data$date) == "Sunday" | weekdays(data$date) =="Saturday","Weekend", "Weekday")
data$day=as.factor(data$day)
```
Lets aggregate by interval and day
```{r}
groupped=aggregate(data$steps, by=list(interval=data$interval, day=data$day), FUN=mean)
colnames(groupped) =c("interval","day","steps")
```
Create plots for weekend and weekdays
```{r}
ggplot(groupped, aes(interval, steps, color=day))+geom_line()+geom_smooth(method="loess")+ggtitle("Average number of steps taken by interval")+facet_grid(day ~ .)
```
It seems that timer series indeed different for weekends and weekdays - weekdays activity starts earlier (around interval 500), the peak of activity is higher (around 200 vs around 150 steps) and smoothed curve is somewhat different