Zelos Zhu 10/22/2018
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(readr)
library(wordcountaddin)
library(patchwork)
wordcountaddin::text_stats("p8105_mtp_zdz2101.Rmd")
Method | koRpus | stringi |
---|---|---|
Word count | 498 | 471 |
Character count | 2900 | 2899 |
Sentence count | 29 | Not available |
Reading time | 2.5 minutes | 2.4 minutes |
accelero_data <- read_csv("data/p8105_mtp_data.csv") %>%
gather(., key = "activity", value = "activity_value", 3:1442) %>%
mutate(day = factor(day, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")),
day_id = (7 * week) - (7 - as.numeric(day)),
minute = as.numeric(str_replace_all(activity, "activity.","")),
hour = floor((minute - 1)/60)) %>%
select(day_id, week, day, hour, activity, minute, activity_value) %>%
arrange(week,day)
sample_n(accelero_data, 10) #show day_id works
## # A tibble: 10 x 7
## day_id week day hour activity minute activity_value
## <dbl> <int> <fct> <dbl> <chr> <dbl> <dbl>
## 1 70 10 Saturday 16 activity.1015 1015 580.
## 2 234 34 Tuesday 17 activity.1063 1063 364.
## 3 172 25 Wednesday 23 activity.1395 1395 1
## 4 175 25 Saturday 2 activity.141 141 1
## 5 133 19 Saturday 17 activity.1060 1060 1
## 6 87 13 Tuesday 4 activity.275 275 1
## 7 31 5 Tuesday 9 activity.543 543 238.
## 8 69 10 Friday 21 activity.1296 1296 1
## 9 183 27 Sunday 11 activity.703 703 542.
## 10 185 27 Tuesday 5 activity.322 322 150.
#day_id is just a linear transform of week and day, sunday week 1 is day 1, sunday week 2 is day 8, etc.
#hour is made by minute: 1-60 are midnight, 61-120 are 1am, etc. I used another clever trick to transform hour to be 0-23
This dataset has 473760 observations and 7 variables. The number of observations is expected as there is 1 reading/minute, 60 minutes/hour, 24 hours/day, 7 days/week, and 47 weeks worth of accelerometer readings (product of these 4 numbers gets you 473760). Variables are:
- day_id: day of study, assuming week starts on sunday
- week: week of study
- day: weekday, recoded to factor
- hour: hour of day (military time)
- activity: essentially is the column names of the original file, used as "key" for the gather function, left here to "preserve original observations"
- minute: minute of day
- activity_value: accelerometer activity reading
day_aggregate_df <- accelero_data %>%
group_by(day_id, week, day) %>%
summarize(total_activity = sum(activity_value),
mean = mean(activity_value),
min = quantile(activity_value, 0),
q1 = quantile(activity_value, 0.25),
median = quantile(activity_value, 0.50),
q3 = quantile(activity_value, 0.75),
max = quantile(activity_value, 1)
)
unreasonable_days <- day_aggregate_df$day_id[which(day_aggregate_df$total_activity == 1440)]
day_aggregate_plot <- day_aggregate_df %>%
filter(total_activity != 1440) %>% #---------remove days w/ invalid readings
ggplot(., aes(x = day_id, y = total_activity)) +
geom_point(aes(color = day), alpha = 0.5) +
geom_smooth() +
ylab("Total Activity") +
xlab("Day of Study") +
theme(legend.position = "bottom") +
ggtitle("Total activity of each day") +
theme(axis.text = element_text(size = 16),
axis.title = element_text(size = 18),
plot.title = element_text(size = 20, face = "bold"),
legend.title=element_text(size=14),
legend.text=element_text(size=14))
by_day_plot <- day_aggregate_df %>%
filter(total_activity != 1440) %>%
ggplot(., aes(x = day_id, y = total_activity, color = day)) +
geom_point(alpha = 0.5) +
geom_smooth() +
ylab("Total Activity") +
xlab("Day of Study") +
theme(legend.position = "none") +
ggtitle("Total activity measure by weekdays") +
theme(axis.text = element_text(size = 16),
axis.title = element_text(size = 18),
plot.title = element_text(size = 20, face = "bold"),
strip.text = element_text(size = 14)) +
facet_wrap(~day, nrow = 2)
day_aggregate_plot + by_day_plot
Plotting total activity by day, as shown on the left, we see an upward trend in activity throughout the course of the study. It dips around day 150, but the trend overall throughout 47 weeks is still positive. Based on the figure on the right, the patient's activity still increases over time regardless of what day it is suggesting weekday does not play a role in our findings. It could be argued the patient's activity on Thursdays looks stagnant.
I discovered 18 days: day_id's 7, 8, 15, 18, 19, 20, 23, 25, 26, 27, 55, 78, 83, 84, 86, 133, 140, 219 where total activity is "1440" meaning there was no activity over the entirety of those days. I suspect the patient was not wearing the accelerometer or accelerometer malfunction, so I excluded these days from figures.
hour_aggregate <- accelero_data %>%
filter(!day_id %in% unreasonable_days) %>%
group_by(day_id, week, day, hour) %>%
summarize(hourly_activity_total = sum(activity_value),
hourly_activity_mean = mean(activity_value)) %>%
mutate(sleep = ifelse(hour %in% c(0:6,23), "asleep", "awake"))
hourly_profiles <- ggplot(hour_aggregate, aes(x = as.factor(hour), y = hourly_activity_total, color = sleep)) +
geom_boxplot() +
xlab("Hour of the Day") +
ylab("Total Activity") +
ggtitle("Total activity based on hour-by-hour profiles") +
theme(axis.text = element_text(size = 16),
axis.title = element_text(size = 18),
plot.title = element_text(size = 20, face = "bold"),
strip.text = element_text(size = 14))
hourly_profiles_by_day <- ggplot(hour_aggregate, aes(x = as.factor(hour), y = hourly_activity_total, color = sleep)) +
geom_boxplot() +
xlab("Hour of the Day") +
ylab("Total Activity") +
ggtitle("Hour-by-hour profiles by day") +
theme(axis.text = element_text(size = 16),
axis.title = element_text(size = 18),
plot.title = element_text(size = 20, face = "bold"),
strip.text = element_text(size = 14)) +
facet_wrap(~day, nrow = 2)
hourly_profiles/hourly_profiles_by_day
It seems logical to get a sense of when our patient is awake and doing things. Assuming 8 hours of sleep, I made the 8 hours with the lowest median total activity readings a different color to signify "asleep" hours. In general, he starts his day around 7-8am, ramps up activity which stabilizes from 10am until 8pm, then "winds down" his day and probably asleep by 11pm-12am. This routine does not seem to change too much by day of the week either though there is a handful of Friday nights and Sunday afternoons where he is clocking in some high activity.
top_one_percent <- quantile(accelero_data$activity_value, 0.99)
peaks_df<- accelero_data %>%
filter(activity_value > top_one_percent) %>%
select(day_id, hour, minute, activity_value) %>%
group_by(day_id) %>%
arrange(day_id, minute)
peak_times_plot <- ggplot(peaks_df, aes(x = minute, y = activity_value, color = as.factor(day_id))) +
geom_line() +
scale_x_continuous(breaks = seq(1,1440,60), labels = 0:23) +
xlab("Hour of Day") +
ylab("Activity Value") +
ggtitle("Spaghetti plot of Top 1% of Accelerometer Readings") +
theme(axis.text = element_text(size = 16),
axis.title = element_text(size = 18),
plot.title = element_text(size = 20, face = "bold"),
strip.text = element_text(size = 14)) +
geom_hline(yintercept = 2500, color = "black", linetype = "dashed") +
theme(legend.position="none")
exercise_df <- peaks_df %>%
filter(activity_value > 2500) %>%
mutate(study = ifelse(day_id > 165, "Second Half", "First Half"))
study_half_counts <- exercise_df %>%
group_by(study) %>%
count()
exercise_hyp_plot <- ggplot(exercise_df, aes(x = minute, y = activity_value, color = study)) +
geom_jitter(alpha = 0.5) +
scale_x_continuous(breaks = seq(1,1440,60), labels = 0:23) +
xlab("Hour of Day") +
ylab("Activity Value") +
ggtitle("Scatter plot of top readings based on two periods") +
theme(axis.text = element_text(size = 16),
axis.title = element_text(size = 18),
plot.title = element_text(size = 20, face = "bold"),
strip.text = element_text(size = 14)) +
facet_grid(~study)
peak_times_plot/exercise_hyp_plot
Attempting to track exercise, I assumed the highest 1% of activity is during exercise. It was clear from the top figure there was still "noise" below the activity value ~2500; each "spaghetti" line was meant to distinguish a particular day. So I filtered for readings above 2500 (arbitrary, easily adjustable). The bottom panels show scatter plots of activity against time of day where the left is data from the first half of the study, days 1 - 165, and the right is data from the second half of the study, days 165 - 329.
I suspect the patient likes to exercise during three periods: 11-1pm, 3-6pm or around 8pm, the 3-6pm slot becoming much more popular as time passed. The second half plot has much more points, generally higher too, than the first which may suggest the patient is exercising more frequently and intensely over time.