forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
145 lines (103 loc) · 4.9 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
##Q1 - Loading and preprocessing the data
We assume that the file activity.csv is already in the directly.
Read in the CSV file now.
```{r, echo=TRUE}
activity.data <- read.csv("activity.csv",header=TRUE)
```
Clean the data by removing NA values from column step. Data should now be in a format ready for further analysis.
```{r,echo=TRUE}
activity.data.clean <- activity.data[complete.cases(activity.data),]
head(activity.data.clean)
```
##Q2 - What is mean total number of steps taken per day?
Get required data to make histogram of the total number of steps taken each day.
```{r, echo=TRUE}
tempdata <- tapply(activity.data.clean$steps,activity.data.clean$date,sum)
tempdata <- tempdata[!is.na(tempdata)]
```
Now plot the histogram.
```{r,echo=TRUE}
hist(tempdata,ylim=c(0,20),breaks=10, xlab="steps", main="Histogram of the total number of steps taken each day")
```
Calculate and report the mean and median of the total number of steps taken per day.
```{r,echo=TRUE}
meandata <- as.integer(mean(tempdata))
mediandata <- as.integer(median(tempdata))
```
The mean of the number of steps taken per day is `r meandata`.
The median of the number of steps taken per day is `r mediandata`.
##Q3 - What is the average daily activity pattern?
Extract data for average steps per day group by interval type. Then plot the result.
```{r,echo=TRUE}
tempdata <- tapply(activity.data.clean$steps,activity.data.clean$interval,mean)
#Plot the result
plot(tempdata,type="l",xaxt="n",xlab="5-minute Interval",ylab="Average number of steps taken",main="Time Series Plot")
axis(1, at=seq(0,288,by=288/4),labels=c(0,600,1200,1800,2400))
```
Find the 5-minute interval which has the max number of steps.
```{r,echo=TRUE}
m <- max(tempdata)
maxstep <- names(which(tempdata==m))
```
The 5-minute interval which has the max number of steps is `r maxstep`.
##Q4.Imputing Missing Values
Find the total number of missing values in the dataset.
```{r,echo=TRUE}
missingvalue <- nrow(activity.data)-nrow(activity.data.clean)
```
The total number of missing values = `r missingvalue`.
Fill in the missing value using mean value of steps as replacement
```{r,echo=TRUE}
#Work on duplicate data, preserve original.
data <- activity.data
#Create interval mean data frame with interval and mean.steps columns
interval.mean.df <- data.frame(interval=unique(activity.data$interval), mean.steps=tempdata)
#Find missing interval from activity data
missing.interval <- activity.data[is.na(activity.data$steps),"interval"]
#where to pick up the values from interval.mean.df?
indices.in.meandata <- match(missing.interval,interval.mean.df$interval)
#Now write the missing values. data is now the new data set with missing values filled in
data[is.na(data$steps),"steps"] <- interval.mean.df[indices.in.meandata,"mean.steps"]
```
Now make the histogram of total number of steps taken each day with the new data.
```{r,echo=TRUE}
tempdata <- tapply(data$steps,data$date,sum)
hist(tempdata,ylim=c(0,30),breaks=10,xlab="steps",main="Histogram of the total number of steps taken each day")
```
Calculate mean and median total number of steps taken per day.
```{r,echo=TRUE}
meandata.no.missing <- as.integer(mean(tempdata))
mediandata.no.missing <- as.integer(median(tempdata))
delta.mean <- meandata.no.missing - meandata
delta.median <- mediandata.no.missing - mediandata
```
The mean of the number of steps taken per day is `r meandata.no.missing`.
The median of the number of steps taken per day is `r mediandata.no.missing`.
Difference between the 2 means are `r delta.mean`.
Difference between the 2 medians are `r delta.median`.
There is virtually no difference with the first part of the assignment. Recall that in the first part of my assignment I have removed the missing data. Imputing missing values in this case did not make any difference.
##Q5.Are there differences in activity patterns between weekdays and weekends?
```{r,echo=TRUE}
data$date <- as.Date(data$date, "%Y-%m-%d")
data$dayofweek <- weekdays(data$date)
data$dayofweek <- ifelse(data$dayofweek %in% c("Saturday","Sunday"),"Weekend","Weekday")
data$dayofweek <- as.factor(data$dayofweek)
```
```{r,echo=TRUE}
weekend.data <- subset(data,data$dayofweek == "Weekend")
weekend.data <- tapply(weekend.data$steps,weekend.data$interval,mean)
weekend.df <- data.frame(steps=as.integer(names(weekend.data)),mean=weekend.data,dayofweek="Weekend")
weekday.data <- subset(data,data$dayofweek == "Weekday")
weekday.data <- tapply(weekday.data$steps,weekday.data$interval,mean)
weekday.df <- data.frame(steps=as.integer(names(weekday.data)),mean=weekday.data,dayofweek="Weekday")
c.data <- rbind(weekend.df,weekday.df)
c.data$steps <- as.integer(c.data$steps)
#Plot the result
library(ggplot2)
x=seq(0,2400,600)
g<-ggplot(data=c.data,aes(x=steps,y=mean))
g<-g+geom_line(aes(group=dayofweek))+facet_grid(dayofweek~.)
g<-g+scale_x_continuous(breaks=x,labels=as.character(x))
g<-g+ggtitle("Time Series Plot")
g
```