Skip to content

Latest commit

 

History

History
230 lines (189 loc) · 11 KB

p8105_mtp_zdz2101.md

File metadata and controls

230 lines (189 loc) · 11 KB

p8105_mtp_zdz2101

Zelos Zhu 10/22/2018

Load Packages

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(readr)
library(wordcountaddin)
library(patchwork)
wordcountaddin::text_stats("p8105_mtp_zdz2101.Rmd")
Method koRpus stringi
Word count 498 471
Character count 2900 2899
Sentence count 29 Not available
Reading time 2.5 minutes 2.4 minutes

Read Data/Tidying

accelero_data <- read_csv("data/p8105_mtp_data.csv") %>%
  gather(., key = "activity", value = "activity_value", 3:1442) %>%
  mutate(day = factor(day, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")),
         day_id = (7 * week) - (7 - as.numeric(day)),
         minute = as.numeric(str_replace_all(activity, "activity.","")),
         hour = floor((minute - 1)/60)) %>%
  select(day_id, week, day, hour, activity, minute, activity_value) %>%
  arrange(week,day)

sample_n(accelero_data, 10) #show day_id works
## # A tibble: 10 x 7
##    day_id  week day        hour activity      minute activity_value
##     <dbl> <int> <fct>     <dbl> <chr>          <dbl>          <dbl>
##  1     70    10 Saturday     16 activity.1015   1015           580.
##  2    234    34 Tuesday      17 activity.1063   1063           364.
##  3    172    25 Wednesday    23 activity.1395   1395             1 
##  4    175    25 Saturday      2 activity.141     141             1 
##  5    133    19 Saturday     17 activity.1060   1060             1 
##  6     87    13 Tuesday       4 activity.275     275             1 
##  7     31     5 Tuesday       9 activity.543     543           238.
##  8     69    10 Friday       21 activity.1296   1296             1 
##  9    183    27 Sunday       11 activity.703     703           542.
## 10    185    27 Tuesday       5 activity.322     322           150.
#day_id is just a linear transform of week and day, sunday week 1 is day 1, sunday week 2 is day 8, etc.
#hour is made by minute: 1-60 are midnight, 61-120 are 1am, etc. I used another clever trick to transform hour to be 0-23

This dataset has 473760 observations and 7 variables. The number of observations is expected as there is 1 reading/minute, 60 minutes/hour, 24 hours/day, 7 days/week, and 47 weeks worth of accelerometer readings (product of these 4 numbers gets you 473760). Variables are:

  1. day_id: day of study, assuming week starts on sunday
  2. week: week of study
  3. day: weekday, recoded to factor
  4. hour: hour of day (military time)
  5. activity: essentially is the column names of the original file, used as "key" for the gather function, left here to "preserve original observations"
  6. minute: minute of day
  7. activity_value: accelerometer activity reading

Aggregate data by day and discover some trends

day_aggregate_df <- accelero_data %>%
  group_by(day_id, week, day) %>%
  summarize(total_activity = sum(activity_value),
            mean = mean(activity_value),
            min = quantile(activity_value, 0),
            q1 = quantile(activity_value, 0.25),
            median = quantile(activity_value, 0.50),
            q3 = quantile(activity_value, 0.75),
            max = quantile(activity_value, 1)
            )
  
unreasonable_days <- day_aggregate_df$day_id[which(day_aggregate_df$total_activity == 1440)]

day_aggregate_plot <- day_aggregate_df %>%
  filter(total_activity != 1440) %>% #---------remove days w/ invalid readings
  ggplot(., aes(x = day_id, y = total_activity)) + 
  geom_point(aes(color = day), alpha = 0.5) + 
  geom_smooth() +
  ylab("Total Activity") +
  xlab("Day of Study") + 
  theme(legend.position = "bottom") +
  ggtitle("Total activity of each day") + 
  theme(axis.text = element_text(size = 16),
        axis.title = element_text(size = 18),
        plot.title = element_text(size = 20, face = "bold"),
        legend.title=element_text(size=14), 
        legend.text=element_text(size=14))

by_day_plot <- day_aggregate_df %>%
  filter(total_activity != 1440) %>%
  ggplot(., aes(x = day_id, y = total_activity, color = day)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth() + 
  ylab("Total Activity") +
  xlab("Day of Study") + 
  theme(legend.position = "none") +
  ggtitle("Total activity measure by weekdays") +
  theme(axis.text = element_text(size = 16),
        axis.title = element_text(size = 18),
        plot.title = element_text(size = 20, face = "bold"),
        strip.text = element_text(size = 14)) +
  facet_wrap(~day, nrow = 2)

day_aggregate_plot + by_day_plot

Plotting total activity by day, as shown on the left, we see an upward trend in activity throughout the course of the study. It dips around day 150, but the trend overall throughout 47 weeks is still positive. Based on the figure on the right, the patient's activity still increases over time regardless of what day it is suggesting weekday does not play a role in our findings. It could be argued the patient's activity on Thursdays looks stagnant.

I discovered 18 days: day_id's 7, 8, 15, 18, 19, 20, 23, 25, 26, 27, 55, 78, 83, 84, 86, 133, 140, 219 where total activity is "1440" meaning there was no activity over the entirety of those days. I suspect the patient was not wearing the accelerometer or accelerometer malfunction, so I excluded these days from figures.

Aggregate by hour/24 hour profiles

hour_aggregate <- accelero_data %>%
  filter(!day_id %in% unreasonable_days) %>%
  group_by(day_id, week, day, hour) %>%
  summarize(hourly_activity_total = sum(activity_value),
            hourly_activity_mean = mean(activity_value)) %>%
  mutate(sleep = ifelse(hour %in% c(0:6,23), "asleep", "awake"))

hourly_profiles <- ggplot(hour_aggregate, aes(x = as.factor(hour), y = hourly_activity_total, color = sleep)) +
  geom_boxplot() + 
  xlab("Hour of the Day") + 
  ylab("Total Activity") +
  ggtitle("Total activity based on hour-by-hour profiles") +
  theme(axis.text = element_text(size = 16),
        axis.title = element_text(size = 18),
        plot.title = element_text(size = 20, face = "bold"),
        strip.text = element_text(size = 14))

hourly_profiles_by_day <- ggplot(hour_aggregate, aes(x = as.factor(hour), y = hourly_activity_total, color = sleep)) +
  geom_boxplot() + 
  xlab("Hour of the Day") + 
  ylab("Total Activity") +
  ggtitle("Hour-by-hour profiles by day") +
  theme(axis.text = element_text(size = 16),
        axis.title = element_text(size = 18),
        plot.title = element_text(size = 20, face = "bold"),
        strip.text = element_text(size = 14)) +
  facet_wrap(~day, nrow = 2)

hourly_profiles/hourly_profiles_by_day

It seems logical to get a sense of when our patient is awake and doing things. Assuming 8 hours of sleep, I made the 8 hours with the lowest median total activity readings a different color to signify "asleep" hours. In general, he starts his day around 7-8am, ramps up activity which stabilizes from 10am until 8pm, then "winds down" his day and probably asleep by 11pm-12am. This routine does not seem to change too much by day of the week either though there is a handful of Friday nights and Sunday afternoons where he is clocking in some high activity.

Additional Exploratory Stuff

top_one_percent <- quantile(accelero_data$activity_value, 0.99)

peaks_df<- accelero_data %>%
  filter(activity_value > top_one_percent) %>%
  select(day_id, hour, minute, activity_value) %>%
  group_by(day_id) %>%
  arrange(day_id, minute)

peak_times_plot <- ggplot(peaks_df, aes(x = minute, y = activity_value, color = as.factor(day_id))) +
  geom_line() + 
  scale_x_continuous(breaks = seq(1,1440,60), labels = 0:23) + 
  xlab("Hour of Day") +
  ylab("Activity Value") +
  ggtitle("Spaghetti plot of Top 1% of Accelerometer Readings") +
  theme(axis.text = element_text(size = 16),
        axis.title = element_text(size = 18),
        plot.title = element_text(size = 20, face = "bold"),
        strip.text = element_text(size = 14)) +
  geom_hline(yintercept =  2500, color = "black", linetype = "dashed") +
  theme(legend.position="none")

exercise_df <- peaks_df %>%
  filter(activity_value > 2500) %>%
  mutate(study = ifelse(day_id > 165, "Second Half", "First Half"))

study_half_counts <- exercise_df %>%
  group_by(study) %>%
  count()

exercise_hyp_plot <- ggplot(exercise_df, aes(x = minute, y = activity_value, color = study)) + 
  geom_jitter(alpha = 0.5) + 
  scale_x_continuous(breaks = seq(1,1440,60), labels = 0:23) + 
  xlab("Hour of Day") +
  ylab("Activity Value") + 
  ggtitle("Scatter plot of top readings based on two periods") +
  theme(axis.text = element_text(size = 16),
        axis.title = element_text(size = 18),
        plot.title = element_text(size = 20, face = "bold"),
        strip.text = element_text(size = 14)) +
  facet_grid(~study)

peak_times_plot/exercise_hyp_plot

Attempting to track exercise, I assumed the highest 1% of activity is during exercise. It was clear from the top figure there was still "noise" below the activity value ~2500; each "spaghetti" line was meant to distinguish a particular day. So I filtered for readings above 2500 (arbitrary, easily adjustable). The bottom panels show scatter plots of activity against time of day where the left is data from the first half of the study, days 1 - 165, and the right is data from the second half of the study, days 165 - 329.

I suspect the patient likes to exercise during three periods: 11-1pm, 3-6pm or around 8pm, the 3-6pm slot becoming much more popular as time passed. The second half plot has much more points, generally higher too, than the first which may suggest the patient is exercising more frequently and intensely over time.