-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
110 lines (84 loc) · 4.27 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
title: "PA1_template"
author: "Simon Monk"
date: "April 17, 2016"
output: html_document
---
Load the required R packages.
```{r, results='hide', message = FALSE, warning=FALSE, echo = TRUE}
library(ggplot2)
library(dplyr)
library(sqldf)
```
Set working directory and read our pre-processed data set in.
```{r, message = FALSE, warning=FALSE, echo = TRUE}
setwd('C:/Users/simon.monk/Documents/Data Science Course/Reproducible Research')
data <- read.csv("activity.csv")
```
Remove NAs from dataset and then remove any date factors that are no longer relevant because all associated data points are NAs.
```{r, message = FALSE, warning=FALSE, echo = TRUE}
data.RemovedNAs <- data[!is.na(data$steps), ]
data.RemovedNAs$date <- factor(data.RemovedNAs$date)
```
## What is mean total number of steps taken per day?
Calculate the number of steps per day and plot them in a histogram.
```{r, message = FALSE, warning=FALSE, echo = TRUE}
sumByDay <- as.data.frame(tapply(data.RemovedNAs$steps, as.factor(data.RemovedNAs$date), sum))
names(sumByDay) <- c("Steps")
qplot(sumByDay$Steps, geom="histogram", ylab="Number of Days", xlab="Number of Steps", binwidth = 500)
```
Calculate the mean and median total number of steps per day:
```{r, message = FALSE, warning=FALSE, echo = TRUE}
print(mean(sumByDay$Steps), row.names = FALSE)
print(median(sumByDay$Steps), row.names = FALSE)
```
## What is the average daily activity pattern?
Calculate the steps per interval averaged across all days and plot the results.
```{r, message = FALSE, warning=FALSE, echo = TRUE}
meanStepsPerInterval <- tapply(data.RemovedNAs$steps, as.factor(data.RemovedNAs$interval), mean)
meanStepsPerInterval <- as.data.frame(meanStepsPerInterval)
meanStepsPerInterval$interval <- rownames(meanStepsPerInterval)
plot(meanStepsPerInterval$interval, meanStepsPerInterval$mean, type = 'l', ylab = "Mean Steps", xlab = "Interval")
```
Get the interval with the highest average number of steps on average.
```{r, message = FALSE, warning=FALSE, echo = TRUE}
meanStepsPerInterval[meanStepsPerInterval$mean == max(meanStepsPerInterval$mean), ][1]
```
##Imputing missing values
Calculate and report the total number of missing values in the dataset
```{r, message = FALSE, warning=FALSE, echo = TRUE}
nrow(data[is.na(data$steps), ])
```
For all data points with NA steps, impute the average number of steps for that interval.
```{r, message = FALSE, warning=FALSE, echo = TRUE}
meanStepsPerInterval$interval <- as.integer(meanStepsPerInterval$interval)
data.filledIn <- merge(data, meanStepsPerInterval, by = "interval")
data.filledIn$steps[is.na(data.filledIn$steps)] <- data.filledIn$meanStepsPerInterval[is.na(data.filledIn$steps)]
```
Calculate and plot a histogram of the numbers of steps per day using the newly created dataset
```{r, message = FALSE, warning=FALSE, echo = TRUE}
sumByDay.filledIn <- as.data.frame(tapply(data.filledIn$steps, as.factor(data.filledIn$date), sum))
names(sumByDay.filledIn) <- c("Steps")
qplot(sumByDay.filledIn$Steps, geom="histogram", ylab="Number of Days", xlab="Number of Steps", binwidth = 500)
```
Calculate the mean and median total number of steps per day:
```{r, message = FALSE, warning=FALSE, echo = TRUE}
print(mean(sumByDay.filledIn$Steps), row.names = FALSE)
print(median(sumByDay.filledIn$Steps), row.names = FALSE)
```
##Are there differences in activity patterns between weekdays and weekends?
Create a factor variable that classifies each date as either a "weekday" or "weekend"
```{r, message = FALSE, warning=FALSE, echo = TRUE}
data.filledIn$dow <- weekdays(as.Date(data.filledIn$date))
WeekdayFlags <- c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday')
#Use `%in%` and `weekdays` to create a logical vector
#convert to `factor` and specify the `levels/labels`
data.filledIn$dtype <- factor(weekdays(as.Date(data.filledIn$date)) %in% WeekdayFlags,
levels=c(FALSE, TRUE), labels=c('weekend', 'weekday'))
```
Calculate the steps per interval averaged across day type ("weekday" or "weekend") and plot the results in a time series.
```{r, message = FALSE, warning=FALSE, echo = TRUE}
averages <- aggregate(steps ~ interval + dtype, data=data.filledIn, mean)
ggplot(averages, aes(interval, steps)) + geom_line(colour="red") + facet_grid(dtype ~ .) +
xlab("Interval (5 minutes)") + ylab("Average Steps")
```