-
Notifications
You must be signed in to change notification settings - Fork 0
/
data-portfolio-bellabeat.Rmd
698 lines (516 loc) · 23.8 KB
/
data-portfolio-bellabeat.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
---
title: "Bellabeat Marketing Fitness Tracker Usage Case Study"
author: "Eric Chan"
date: "2022-03-02"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
fig.path = "figure/fig-")
```
# Case background
This is the capstone project of the Google Data Analytics Professional Certificate.
Bellabeat, a health focus company for women, is looking to enter the global smart device market. The founder believes analysing smart device fitness data could unlock new growth opportunities.
https://bellabeat.com
I have been asked to analyse smart device data to gain insight on how consumers are using their device, and provide high-level recommendations for Bellabeat’s marketing strategy.
# 1. Ask
## Business Tasks
- Find the average exercise level (MET minute), and exercise duration per week. Suggest marketing persona for content marketing.
- Find the peak time of exercising during a day, and during a week. Provides suggestion on timing on social media marketing post, and premium membership events.
- Find sleep efficiency in the group, correlations between sleep efficiency and other factors. Suggests types of content and event the marketing team could provides to increase customer loyalty and conversion of premium membership.
## Key stakeholders
- Urška Sršen, Co-founder
- Sando Mur, Co-founder
- Bellabeat marketing analytics team
# 2. Prepare
## Description of data source used
The specific dataset used is the FitBit Fitness Tracker Data (CC0: Public Domain, through Kaggle Mobius).
- Reliability -- Second-party data, collected by Kaggle user Möbius who is a data scientist.
- Original -- Data is generated by respondents of a survey distributed via Amazon Mechanical Turk between 12 Mar 2016 to 12 May 2016. CSV files in the dataset are transformed and merged.
- Comprehensive -- Includes minute-level output for physical activity, heart rate, and sleep monitoring.
- Not current -- Data was collected in 2016. Users may have different usage patterns e.g. hybrid work has appeared.
- Cited -- Usability score on Kaggle is 10.
Dataset URL
https://www.kaggle.com/arashnic/fitbit
Metadata PDF - Data descriptions
https://www.fitabase.com/media/1930/fitabasedatadictionary102320.pdf
Fitbit - How activity, steps, distance etc are collected
https://help.fitbit.com/articles/en_US/Help_article/1141.htm
Fitbit - How sleeps are tracked
https://help.fitbit.com/articles/en_US/Help_article/1314.htm
## Import data
```{r}
# Install packages
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
install.packages("lubridate", repos = "http://cran.us.r-project.org")
install.packages("hms", repos = "http://cran.us.r-project.org")
install.packages("skimr", repos = "http://cran.us.r-project.org")
install.packages("janitor", repos = "http://cran.us.r-project.org")
install.packages("scales", repos = "http://cran.us.r-project.org")
install.packages("multimode", repos = "http://cran.us.r-project.org")
# Load packages
library(tidyverse)
library(lubridate)
library(hms)
library(skimr)
library(janitor)
library(scales)
library(multimode)
# Install ggthemr package
# Apply to all plots
library(devtools)
devtools::install_github('Mikata-Project/ggthemr')
library(ggthemr)
ggthemr('fresh')
```
```{r}
# Import datasets
dailyActivity_raw <- read_csv("input/dailyActivity_merged.csv")
minuteMET_raw <- read_csv("input/minuteMETsNarrow_merged.csv")
dailySleep_raw <- read_csv("input/sleepDay_merged.csv")
hourlyIntensity_raw <- read_csv("input/hourlyIntensities_merged.csv")
dailyCalories_raw <- read_csv("input/dailyCalories_merged.csv")
weight_raw <- read_csv("input/weightLogInfo_merged.csv")
```
## Explore datasets
```{r}
head(dailyActivity_raw)
head(minuteMET_raw)
head(dailySleep_raw)
head(hourlyIntensity_raw)
head(weight_raw)
```
```{r}
# Number of participants in the datasets
n_distinct(dailyActivity_raw$Id)
n_distinct(minuteMET_raw$Id)
n_distinct(dailySleep_raw$Id)
n_distinct(hourlyIntensity_raw$Id)
n_distinct(weight_raw$Id)
```
There are 33 respondents in the activity datasets.
Only 24 participated in the sleep dataset; 8 in weight dataset. Sample size less than 30, this may be not representative (Central Limit Theorem CLT).
Preliminary I will take a look at the sleep dataset see if there are any patterns. Further work on sample size is needed for any conclusion.
I will drop the weight dataset since the sample size is too small to be representative.
# 3. Process
## Clean datasets
### Check the data for errors
```{r}
# Check for duplicated observations
sum(duplicated(dailyActivity_raw))
sum(duplicated(minuteMET_raw))
sum(duplicated(dailySleep_raw))
sum(duplicated(hourlyIntensity_raw))
```
```{r}
# Check for missing values
sum(is.na(dailyActivity_raw))
sum(is.na(minuteMET_raw))
sum(is.na(dailySleep_raw))
sum(is.na(hourlyIntensity_raw))
```
```{r}
# Remove duplicated observations
dailySleep <- dailySleep_raw %>%
unique()
sum(duplicated(dailySleep))
```
## Transform datasets
### Fix date and time formats
```{r}
# Fix datetime format
dailyActivity <- dailyActivity_raw
dailyActivity$ActivityDate <- mdy(dailyActivity$ActivityDate)
# Add weekday column
dailyActivity <- dailyActivity %>%
mutate(weekday = wday(ActivityDate, label = TRUE, abbr = TRUE))
# Fix datetime format
minuteMET <- minuteMET_raw
minuteMET$ActivityMinute <- mdy_hms(minuteMET$ActivityMinute)
# Separate date and time
minuteMET <- minuteMET %>%
mutate(datetime = ActivityMinute) %>%
separate(ActivityMinute, c("date", "time"), sep = " ")
# Fix class
minuteMET$date <- as_date(minuteMET$date)
minuteMET$time <- as_hms(minuteMET$time)
# Fix datetime format, remove blank time, to date
dailySleep <- dailySleep_raw
date <- dailySleep$SleepDay
dailySleep$SleepDay <- mdy_hms(dailySleep$SleepDay)
dailySleep$SleepDay <- as_date(dailySleep$SleepDay)
# Fix datetime format
hourlyIntensity <- hourlyIntensity_raw
hourlyIntensity$ActivityHour <- mdy_hms(hourlyIntensity$ActivityHour)
# Fix datetime format
# Add weekday
# Separate date and time
# Intensity of the sample as a whole
hourlyIntensity <- hourlyIntensity %>%
mutate(datetime = ActivityHour) %>%
mutate(weekday = wday(ActivityHour, label = TRUE, abbr = TRUE)) %>%
separate(ActivityHour, c("date", "time"), sep = " ") %>%
group_by(datetime, time, weekday) %>%
summarise(total_int = sum(TotalIntensity))
# Fix time format
hourlyIntensity$time <- as_hms(hourlyIntensity$time)
```
### Exercise distance
NHS suggests adults to have 150 minutes of moderate exercise, or 75 minutes of vigorous exercise a week.
Exercise distance is the moderate exercise distance plus vigorous exercise distance. exercise distance and exercise minutes for finding correlations with other factors.
When 75 mins of vigorous exercise is equivalent to 150 mins of moderate exercise, adjusted exercise minutes of 2 * vigorous mins + 1 * moderate exercise is used for comparison.
https://www.nhs.uk/live-well/exercise/
```{r}
# Exclude columns won't be using
# Calculate distance during exercise
# Calculate device usage minutes
# Calculate adjusted exercise minutes
dailyActivity <- dailyActivity %>%
select(-TrackerDistance, -LoggedActivitiesDistance) %>%
mutate(exercise_distance = VeryActiveDistance + ModeratelyActiveDistance) %>%
mutate(total_minutes = VeryActiveMinutes + FairlyActiveMinutes + LightlyActiveMinutes + SedentaryMinutes) %>%
mutate(adjusted_exercise_minutes = VeryActiveMinutes * 2 + FairlyActiveMinutes)
# Remove row if exercise distance is zero
# Remove row if adjusted exercise minutes is zero
dailyActivity <- dailyActivity %>%
filter(exercise_distance != 0) %>%
filter(adjusted_exercise_minutes != 0)
```
### MET minute
Moderate exercise accounts for 3 to 6 MET minute. Vigorous exercise for over 6 MET minutes.
For MET minute over 3, will count as exercising.
Since number of days participated in the survey varies, average weekly exercise MET is calculated by total daily exercise MET / number of days participated * 7
https://www.hsph.harvard.edu/nutritionsource/staying-active/
```{r}
# Find total MET during exercise in a day
# Correct to actual MET value according to metadata PDF, prepare for calculations
minuteMET$METs <- minuteMET$METs / 10
dailyMET <- minuteMET
dailyMET <- dailyMET %>%
group_by(Id, date) %>%
filter(METs > 3) %>%
summarise(exerciseMET = sum(METs))
# Add weekday
dailyMET$weekday <- wday(dailyMET$date, label = TRUE, abbr = TRUE)
# Identify duration of exercise
minute_exerciseMET <- minuteMET
minute_exerciseMET <- minute_exerciseMET %>%
filter(METs > 3)
# Categorise exercise level
minute_exerciseMET <- minute_exerciseMET %>%
mutate(
exerciseLevel = case_when(
METs <= 6 ~ "moderate",
METs > 6 ~ "vigorous"
)
)
# Find user average weekly MET
weeklyMET <- dailyMET
weeklyMET <- weeklyMET %>%
group_by(Id) %>%
summarise(num_days = n_distinct(date), avg_weekly_MET = sum(exerciseMET) / num_days * 7 )
```
### Daily intensity
```{r}
# Fix datetime format
dailyIntensity <- hourlyIntensity_raw
dailyIntensity$ActivityHour <- mdy_hms(dailyIntensity$ActivityHour)
# Fix datetime format
# Add weekday
# Separate date and time
# Intensity of the sample as a whole
dailyIntensity <- dailyIntensity %>%
mutate(datetime = ActivityHour) %>%
separate(ActivityHour, c("date", "time"), sep = " ") %>%
group_by(Id, date) %>%
summarise(total_int = sum(TotalIntensity))
# Fix time format
dailyIntensity$date <- as_date(dailyIntensity$date)
```
### Fitness device usage duration
Device usage time is the time participants wearing the device, which is equal to the total activity recorded in a day.
Using over 0.75 of a day as high usage; over half a day as medium; less than half of a day, as low; less than 0.25 of a day as very low usage.
```{r}
# Find device usage minutes
# Identify usage level
usage <- dailyActivity %>%
select(Id, total_minutes) %>%
group_by(Id) %>%
summarise(usage_day = mean(total_minutes) / 1440) %>%
mutate(usage_level = case_when(
usage_day >= 0.75 ~ "high usage",
usage_day >= 0.5 ~ "medium usage",
usage_day >= 0.25 ~ "low usage",
usage_day >= 0 ~ "very low usage"
)
)
```
### Sleep efficiency
Sleep efficiency a way to measure the quality of sleep. Calculated as the total asleep minutes / total time in bed.
Sleep efficiency over 85% is a good sleep.
```{r}
# Find sleep efficiency
dailySleep <- dailySleep %>%
mutate(sleepEfficiency = TotalMinutesAsleep / TotalTimeInBed)
# Identify good or bad sleep
dailySleep <- dailySleep %>%
mutate(good_bad_sleep = case_when(
sleepEfficiency <= 0.85 ~ "bad sleep",
TRUE ~ "good sleep"
)
)
```
### Merge dataframes
```{r}
# Merge daily dataframes
dailyActivity <- dailyActivity %>%
rename(date = ActivityDate)
dailySleep <- dailySleep %>%
rename(date = SleepDay)
daily_merged <- merge(dailyActivity, dailyMET, by = c("Id" , "date"))
daily_merged <- merge(daily_merged, dailySleep, by = c("Id", "date"))
daily_merged <- merge(daily_merged, dailyIntensity, by = c("Id", "date"))
```
### Recommended weekly exercise minutes
```{r}
# Find user average weekly exercise minutes
weekly_exercise_minutes <- dailyActivity
weekly_exercise_minutes <- weekly_exercise_minutes %>%
group_by(Id) %>%
summarise(num_days = n_distinct(date), avg_weekly_exer_min = sum(adjusted_exercise_minutes) / num_days * 7 )
```
# 4. Analyse and visualise
## Support visualistions and key findings
### Weekly MET minute pattern
```{r}
# Weekly MET
# Find mean weekly MET in the group
mean_avg_weekly_MET <- mean(weeklyMET$avg_weekly_MET)
mean_avg_weekly_MET
summary(weeklyMET$avg_weekly_MET)
ggplot(weeklyMET, aes(x=avg_weekly_MET)) +
geom_density() +
geom_vline(xintercept = 500, colour = "red", linetype = "dashed") +
geom_vline(xintercept = mean_avg_weekly_MET, colour = "blue", linetype = "dashed") +
geom_text(aes(x = 1000, y = 0.00005), label = "Minimum recommended \n weekly MET = 500", colour = "red", angle = 90) +
geom_text(aes(x = 4300, y = 0.00005), label = "Mean = 4004", colour = "blue", angle = 90) +
labs(x = "Average MET minute per week",
y = "Density") +
labs(title = "MET minute distribution",
subtitle = "All sample meets the minium weekly MET minute recommeded")
```
Metabolic equivalent of task (MET) -- 1 MET is the amount of energy used at rest. MET is used to indicate physical intensity.
Moderate intensity (3 - 6 METs) e.g. walking briskly
Vigorous intensity (6+ METs) e.g. running, HIIT
https://www.hsph.harvard.edu/nutritionsource/staying-active/
MET minute per week tells you how much energy is used when performing various activities throughout the week. The US Department of Health recommends 500 MET minute per week for adults.
https://health.gov/sites/default/files/2019-09/Physical_Activity_Guidelines_2nd_edition.pdf
Respondents' minimum weekly MET minute = 615 MET minute
Respondents' mean weekly MET minute = 4004 MET minute
Recommended weekly MET minute = 500 MET minute
All respondents meet the recommended MET minute per week.
Data shows the group is physically very active.
Bellabeats's current fitness device product lines are design like fashion accessories. Target customers of Bellabeat, e.g. office women, university students, etc, would be expect to be a spectrum in physical intensity, rather than very active persons only. Small sample size of 33 may be showing the whole picture.
### Exercise minutes per week pattern
```{r}
# Create a function to calculate percentage
percent_x <- function(data, x) {
lens = length(data)
ls_x = data[x]
len = length(ls_x)
len / lens * 100
}
```
```{r}
# Mean of weighted exercise minutes
mean_wk_exer_min <- mean(weekly_exercise_minutes$avg_weekly_exer_min)
mean_wk_exer_min
# Percentage below recommended exercise minutes
percent_x(weekly_exercise_minutes$avg_weekly_exer_min, weekly_exercise_minutes$avg_weekly_exer_min > 150)
# density plot, x = wt exer min, vline = recommended wk exer min
ggplot(weekly_exercise_minutes, aes(x = avg_weekly_exer_min)) +
geom_density() +
geom_vline(xintercept = 150, colour = "red", linetype = "dashed") +
geom_vline(xintercept = mean_wk_exer_min, colour = "blue", linetype = "dashed") +
geom_text(aes(x = 225, y = 0.0004), label = "Minimum recommended \n weekly exercise minutes = 150", colour = "red", angle = 90) +
geom_text(aes(x = 600, y = 0.0002), label = "Mean = 518", colour = "blue", angle = 90, vjust = -0.4, hjust = 0) +
labs(x = "Average exercise minute per week",
y = "Density") +
labs(title = "Exercise minute distribution",
subtitle = "Majority of sample meets the minium weekly MET minute recommeded")
summary(weekly_exercise_minutes$avg_weekly_exer_min)
```
NHS recommends adults for at least 150 minutes of moderate intensity activity, or 75 minutes of vigorous activity a week, and spreads exercise evenly over 4 to 5 days a week, or every day.
https://www.nhs.uk/live-well/exercise/
For comparison, 1 vigorous activity minute is accounted as 2 moderate activity minutes.
Recommended exercise minutes per week = 150 minutes
Mean of the group is = 518 minutes
Percentage above weekly recommended 150 minutes = 85%
85% of respondents meet the exercise minutes recommended, and having a much higher average than recommended baseline.
Respondents were having high intensity exercise in general, such as interval run, or HIIT.
### Device usage pattern
```{r}
# Plot usage
ggplot(usage, aes(x = usage_day * 24)) +
geom_histogram(binwidth = 1) +
labs(x = "Device usage (hrs)",
y = "Count") +
labs(title = "Device usage in hour",
subtitle = "How long do they use the device each day? ")
# Find mode
locmodes(usage$usage_day, mod0=2, display=TRUE)
# Translate mode in hours
0.6898284 * 24
0.9926501 * 24
```
2 group of usage pattern -- one wear the device for 16.6 hours, and the other for 23.8 hours per day in average.
Both groups are with very high engagement.
I would presume the group wear the device all day, even at sleep, wanting to get most out of the data captured by the fitness device; while the other group, would take off their device at sleep.
### Physical intensity in a week and in a day
```{r}
# hour intensity through out the week
# plot x = time, y = weighted exercise minutes, colour by intensity, facet by day of week
# find app notification best time for workout, recovery, time to bed
ggplot(hourlyIntensity, aes(x = weekday, y = total_int)) +
geom_col() +
labs(x = "Weekday",
y = "Total physical intensity") +
labs(title = "Physical intensity distribution in a week",
subtitle = "Tuesday and wednesday are popular for workout")
ggplot(hourlyIntensity, aes(x = time, y = total_int)) +
geom_col() +
theme(axis.text.x = element_text(angle = 90, size = 8)) +
labs(x = "Time",
y = "Total physical intensity") +
labs(title = "Physical intensity distribution in a day",
subtitle = "5 pm to 7 pm are popular for workout")
ggplot(hourlyIntensity, aes(x = time, y = total_int)) +
geom_col() +
facet_wrap(~weekday) +
theme(axis.text.x = element_text(angle = 90, size = 4)) +
labs(x = "Time",
y = "Total physical intensity") +
labs(title = "Physical intensity distribution in a week (detailed) ",
subtitle = "Tue, Wed after work and Sat noon are popular for workout")
```
Tuesday and Wednesday are popular weekday for workout.
5pm to 7pm is the most popular time slot for workout in a day. Presuming respondents go workout right after work, before dinner.
Saturday 1pm is the most popular time for workout on weekend. This could be different kinds of activity and places they would go workout in the weekend.
### Correlation between adjusted exercise minutes and exercise distance
```{r}
ggplot(daily_merged, aes(x = adjusted_exercise_minutes, y = exercise_distance)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
labs(x = "Adjusted exercise minutes",
y = "Exercise distance (km)") +
labs(title = "Adjusted exercise minutes vs Exercise distance (km)",
subtitle = "r = 0.77, => High positive correlation")
# Find correlation coefficient
x <- daily_merged$adjusted_exercise_minutes
y <- daily_merged$exercise_distance
cor.test(x, y, method = "pearson")
```
Correlation coefficient (r) = 0.77
Correlation between exercise duration and distance is highly positively correlated.
The graph shows on average 5km took about 200 adjusted exercise minutes, outdoor running may not be a popular exercise in the group.
They might be doing exercise require moving around, while having medium to high intensity e.g. cross training, or functional training.
### Correlations between exercise, calories burnt and sleep efficiency
```{r}
# Plot adjusted exercise mins vs sleep efficiency
ggplot(daily_merged, aes(x = adjusted_exercise_minutes, y = sleepEfficiency)) +
geom_point() +
labs(x = "Adjusted exercise minutes",
y = "Sleep efficiency") +
labs(title = "Adjusted exercise minutes vs Sleep efficiency",
subtitle = "r = -0.002, => No correlation")
# Find correlation coefficient
x <- daily_merged$adjusted_exercise_minutes
y <- daily_merged$sleepEfficiency
cor.test(x, y, method = "pearson")
# Plot total steps vs sleep efficiency
ggplot(daily_merged, aes(x = TotalSteps, y = sleepEfficiency)) +
geom_point() +
labs(x = "Total steps",
y = "Sleep efficiency") +
labs(title = "Total steps vs Sleep efficiency",
subtitle = "r = -0.062, => No correlation")
# Find correlation coefficient
x <- daily_merged$TotalSteps
y <- daily_merged$sleepEfficiency
cor.test(x, y, method = "pearson")
# Plot exercise distance vs sleep efficiency
ggplot(daily_merged, aes(x = exercise_distance, y = sleepEfficiency)) +
geom_point() +
labs(x = "Exercise distance (km)",
y = "Sleep efficiency") +
labs(title = "Exercise distance (km) vs Sleep efficiency",
subtitle = "r = -0.15, => No correlation")
# Find correlation coefficient
x <- daily_merged$exercise_distance
y <- daily_merged$sleepEfficiency
cor.test(x, y, method = "pearson")
# Plot total intensity vs sleep efficiency
ggplot(daily_merged, aes(x = total_int, y = sleepEfficiency)) +
geom_point() +
labs(x = "Total intensity",
y = "Sleep efficiency") +
labs(title = "Total intensity vs Sleep efficiency",
subtitle = "r = 0.065, => No correlation")
# Find correlation coefficient
x <- daily_merged$total_int
y <- daily_merged$sleepEfficiency
cor.test(x, y, method = "pearson")
```
Data shows there is no correlation between exercise duration, distance, number of steps, intensity and sleep efficiency.
### Correction between calories burnt, adjusted exericse minutes and sleep efficiency
```{r}
# bad sleep group clustered at lower end of calories burnt
ggplot(daily_merged, aes(x = adjusted_exercise_minutes, y = Calories)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
labs(x = "Adjusted exercise minutes",
y = "Calories") +
labs(title = "Adjusted exercise minutes vs Calories burnt",
subtitle = "r = 0.54 => Moderate positive correlation")
# Find correlation coefficient
x <- daily_merged$adjusted_exercise_minutes
y <- daily_merged$Calories
cor.test(x, y, method = "pearson")
# Facet view
ggplot(daily_merged, aes(x = adjusted_exercise_minutes, y = Calories, colour = good_bad_sleep)) +
geom_point() +
facet_grid(~ good_bad_sleep) +
labs(x = "Adjusted exercise minutes",
y = "Calories") +
labs(title = "Adjusted exercise minutes vs Calories burnt",
subtitle = "Catagorised by quality of sleep") +
labs(colour = "Good or bad sleep") +
theme(legend.title = element_blank(), legend.position = "right")
# Percentage of good sleep efficiency
percent_x(daily_merged$sleepEfficiency, daily_merged$sleepEfficiency > 0.85)
percent_x
```
Correlation coefficient (r) = 0.54
Correlation between exercise duration and distance is moderate positively correlated.
Faceting with sleep efficiency, the respondents with bad sleep clustered at the lower end of calories burnt, compare with the same exercise minutes among the group.
However, since the sample size of sleep dataset is only 24, a larger sample size is needed to have a solid conclusion.
# 5. Share and act
## Summary of Analysis
- The respondents are physically active, and meeting the recommended exercise duration (150 minutes per week) and MET requirement (150 MET minute per week). They should have regular exercise habits.
- Most people exercise from 5 to 7pm on weekdays, and 1pm on Saturday.
- Cross training or functional training maybe more popular than outdoor running.
- High calories burnt during workout could be correlated to good sleep.
## High level marketing strategy recommendations
### 1. Content marketing
People have regular exercise habits should have good knowledge about fitness. Bellabeat's marketing social media could post more in depth, edgy sport science and nutrients contents, in order to provide new values to their customers.
### 2. Member subscription conversion
Bellabeat could setup freemium membership model to maximise customer lifetime value (CLV). Paid memberships could receive exclusive contents and attend virtual events, timing based on the usage analysis. e.g. Cross training virtual class on Tuesday 6pm.
### 3. Best time for notification
Provides good UX by having relevant contents according to usage patterns. e.g. Recovery or nutrition tips after workout at 7 pm.
## Marketing v2 -- what's next?
Sample data was dated in 2016 and lacks information such as demographic, gender and age.
Bellabeat could collect new data, with a larger sample size, ideally having demographic in line with its target customers, e.g. women working in office, university students, and see if the usage pattern still in line with the assumption above.
Check if there is any correlation between calories burnt and sleep efficiency. Provide relevant contents to improve sleep and wellness of Bellabeat's customers.
=====
Thank you for reading.
If you have any questions, please feel free to contact me in GitHub or Kaggle!