-
Notifications
You must be signed in to change notification settings - Fork 0
/
5-11-21_Cyclistic ReportR Markdown.Rmd
1093 lines (810 loc) · 60.9 KB
/
5-11-21_Cyclistic ReportR Markdown.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Cyclistic Case Study Report"
author: "P. Katekomol"
date: "11/4/2021"
output: word_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
![](...xxx...xxx...)
## Overview
This report is a part of the "[Google Data Analytics Professional Certificate's](https://www.coursera.org/professional-certificates/google-data-analytics?utm_source=gg&utm_medium=sem&utm_campaign=15-GoogleDataAnalytics-ROW&utm_content=15-GoogleDataAnalytics-ROW&campaignid=12566515400&adgroupid=117869292685&device=c&keyword=google%20data%20analytics%20professional%20certificate&matchtype=p&network=g&devicemodel=&adpostion=&creativeid=507290840624&hide_mobile_promo&gclid=Cj0KCQjwlOmLBhCHARIsAGiJg7nCDJoKyJBvSNg_ZxabdVBTbqVrKCXJUKI_nAEzpv0AmrXkmcYWZ9kaAqIPEALw_wcB)" capstone project offered by [Coursera](https://www.coursera.org/).
In this report, the chosen case study is the first problem in track#1, also known as '**Cyclistic**,' an imaginary bike-sharing company in Chicago. The company offers both traditional and assistive bikes to both casual riders and riders who are annual members.
### The quoted scenario provided by the course:
"You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations."
## 1. "Ask"
The questions that the stakeholders want answer in this scenario are:
1. How do annual members and casual riders use Cyclistic bikes differently?
2. Why would casual riders buy Cyclistic annual memberships?
3. How can Cyclistic use digital media to influence casual riders to become members?
From these questions, the priority-1 task in this project is to identify the differences of bike usage between annual members and casual riders. Then, use those differences to figure out how to keep the existing and attract new annual members to maximize Cyclistic's profit.
## 2. "Prepare"
__Note__: All the data preparation processes and tools involved were all under **Windows 10 (Home)** operating system.
The [**data**](https://divvy-tripdata.s3.amazonaws.com/index.html) used in this scenario was provided by the course under this [**license**](https://www.divvybikes.com/data-license-agreement). There were 12 data files for the data collected in the last 12 months. Each file was originally compressed in a .zip file format. The .zip files were named according to the month in the year the data was collected. For example, the file containing the data from **September 2021** was named "**202109-divvy-tripdata.zip**."
From the data provided, the latest data was from **September 2021**, and the oldest **October 2020**. Therefore, the first and the last downloaded files were "**202109-divvy-tripdata.zip**" and "**202010-divvy-tripdata.zip**", respectively.
The following is the complete list of downloaded files in descending chronological order:
[202109-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202109-divvy-tripdata.zip)\
[202108-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202108-divvy-tripdata.zip)\
[202107-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202107-divvy-tripdata.zip)\
[202106-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202106-divvy-tripdata.zip)\
[202105-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202105-divvy-tripdata.zip)\
[202104-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202104-divvy-tripdata.zip)\
[202103-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202103-divvy-tripdata.zip)\
[202102-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202102-divvy-tripdata.zip)\
[202101-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202101-divvy-tripdata.zip)\
[202012-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202012-divvy-tripdata.zip)\
[202011-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202011-divvy-tripdata.zip)\
[202010-divvy-tripdata.zip](https://divvy-tripdata.s3.amazonaws.com/202010-divvy-tripdata.zip)\
Once unzipped, the data files were in **comma-separated values (.csv)** format. The files were **renamed** with **number prefixes starting from "1_" to "12_"**, from the first downloaded file containing the most up-to-date data to the last. The added prefixes helped make it easier to view the file in chronological order in a folder under Windows 10 operating system.
The following picture shows the unzipped and renamed files automatically sorted by the given prefixes in a folder under Windows 10 operating system:
![*__Figure 2.1: The list of unzipped files seen in a folder on a Windows 10 machine__*](...xxx...xxx...)
Some files are relatively smaller than the rest. This can be an evidence of significantly smaller number of observations in those files:
![*__Figture 2.2: Files that are relatively small compared to the rest__*](...xxx...xxx...)
At this point, the data was ready to be processed. It is worth noting that this data was, per the scenario, provided directly by Cyclistic. This data is **secondary data**. This means that the analyst didn't collect this data by himself, the data was provided to the analyst by a party/organization that collected the data and directly involve in the analysis process as stakeholders.
## 3. "Process"
The tools used in this project are **RStudio** (2021.09.0 Build 351) runs on **R** (version 4.1.1) and **Microsoft Excel** (Office 365 version), all, as stated above, run under Windows 10 (Home) operating system.
R was chosen because it can manage large data sets much quicker than spreadsheets. It can also visualize and create well-formatted reports, complete with code chunks and pictures. Excel, in this case, is for creating a temporary summary tables and visualizations which were later ported into RStudio to create this report.
### 3.1 Importing data into RStudio
Before importing any data into RStudio, the working directory must be changed to match the directory in which the data is stored. The quickest way to do this is to use Windows' GUI to navigate to the folder containing those files, then copy path string from the address bar, for example, "__C:\\Users\\...\\Cyclistic_data__", then paste the copied path string in the function '**setwd()**' but change all the **backslash** (__\\__) into **slash** (__/__) to comply with R syntax:
```{r Setting the working directory to where the data files are stored, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 3.1.1
setwd("C:/Users/.../Cyclistic_data--Example file location")
```
Next, use the "**read_csv()**" function to import data from each .csv file and store in R's data objects as data frames. In this case, each data frame was given its corresponding name according to the data stored in each file. For example, the data collected from **September 2021** will be stored in a data frame named "**SEP21**."
The following code chunk shows the data import step:
```{r Import .csv files as R\'s environment objects, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 3.1.2
SEP21 <- read_csv("1_202109-divvy-tripdata.csv")
```
The data in the rest of the downloaded files was imported with the above code and stored in data frames named **SEP21, AUG21,...,OCT20**, totaling to 12 objects in R's environment.
### 3.2 Data Integrity Checkups
#### 3.2.1 Glimpses of the Data Integrity
Next, the data frames were subjected to the integrity checkup process. First, the "**glimpse()**" function was used to get an overview of each data object, including the number of rows and columns, and data type of each column.
<a name="chunk3211">
```{r Use glimpse() to check the data frame\'s integrity, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 3.2.1.1
> glimpse(SEP21)
Rows: 756,147
Columns: 13
$ ride_id <chr> "9DC7B962304CBFD8", "F930E2C6872~
$ rideable_type <chr> "electric_bike", "electric_bike"~
$ started_at <dttm> 2021-09-28 16:07:10, 2021-09-28~
$ ended_at <dttm> 2021-09-28 16:09:54, 2021-09-28~
$ start_station_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, ~
$ start_station_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, ~
$ end_station_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, ~
$ end_station_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, ~
$ start_lat <dbl> 41.89000, 41.94000, 41.81000, 41~
$ start_lng <dbl> -87.68000, -87.64000, -87.72000,~
$ end_lat <dbl> 41.89, 41.98, 41.80, 41.81, 41.8~
$ end_lng <dbl> -87.67, -87.67, -87.72, -87.72, ~
$ member_casual <chr> "casual", "casual", "casual", "c~
```
</a>
From the above result, each of the **13 variables** in **SEP21** seems to have been assigned its correct data type. Every data frame was checked with the **glimpse()** function and the number of rows and columns and other exploratory information were recorded in a temporary summary table, in this case, in **Excel**:
![*__Figure 3.2.1.1: A temporary table summarizing the exploratory results of the data sets in an Excel table__*](...xxx...xxx...)
From the above table, each data frame contains varying number of rows, but the most important point is that they all have the same number of columns, presumably with the same data type in each column.
#### 3.2.2 Inspecting and comparing data type in each column of every data frame
To make sure that the data type in each of the column in every data frame is consistent, a data type check up was performed using the "**sapply()**" function to apply "**typeof()**" across every column in a data frame. The information about the data type of each column in a data frame can be extracted and stored in a separate object named according to its original data frame. For example, a data type object extracted from the data frame **SEP21** was named **dtSEP21**.
Each data type object extracted from each data frame was then compared by using the "**identical()**" function. All 12 data type objects were compared following the **transitive property** that states that "**if a = b and b = c, then a = c**." The following is the code chunk demonstrating the comparison process and its results:
```{r The data type comparison process, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 3.2.2.1
> dtSEP21 <- sapply(SEP21,typeof)
> dtAUG21 <- sapply(AUG21,typeof)
> dtJUL21 <- sapply(JUL21,typeof)
> dtJUN21 <- sapply(JUN21,typeof)
> dtMAY21 <- sapply(MAY21,typeof)
> dtAPR21 <- sapply(APR21,typeof)
> dtMAR21 <- sapply(MAR21,typeof)
> dtFEB21 <- sapply(FEB21,typeof)
> dtJAN21 <- sapply(JAN21,typeof)
> dtDEC20 <- sapply(DEC20,typeof)
> dtNOV20 <- sapply(NOV20,typeof)
> dtOCT20 <- sapply(OCT20,typeof)
> identical(dtSEP21,dtAUG21)
[1] TRUE
> identical(dtAUG21,dtJUL21)
[1] TRUE
> identical(dtJUL21,dtJUN21)
[1] TRUE
> identical(dtJUN21,dtMAY21)
[1] TRUE
> identical(dtMAY21,dtAPR21)
[1] TRUE
> identical(dtAPR21,dtMAR21)
[1] TRUE
> identical(dtMAR21,dtFEB21)
[1] TRUE
> identical(dtFEB21,dtJAN21)
[1] TRUE
> identical(dtJAN21,dtDEC20)
[1] TRUE
> identical(dtDEC20,dtNOV20)
[1] FALSE
> identical(dtNOV20,dtOCT20)
[1] TRUE
```
From the result of the code chunk above, the data type integrity was broken in the data frames **NOV20** and **OCT20**, the last two data frames. Further inspections with **glimpse()** revealed that the **start_station_id** and **end_station_id** were assigned **double** type, instead of the **character** type. This was fixed by re-assigning the correct data types to those columns in both data frames with the following code chunks:
```{r Assigning the correct data type to columns in NOV20 and OCT20 data frames, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 3.2.2.2 - check the data types of NOV20 and OCT20 data frames
> glimpse(NOV20)
Rows: 259,716
Columns: 13
$ ride_id <chr> "BD0A6FF6FFF9B921", "96A7A7A4BDE4~
$ rideable_type <chr> "electric_bike", "electric_bike",~
$ started_at <dttm> 2020-11-01 13:36:00, 2020-11-01 ~
$ ended_at <dttm> 2020-11-01 13:45:40, 2020-11-01 ~
$ start_station_name <chr> "Dearborn St & Erie St", "Frankli~
$ start_station_id <dbl> 110, 672, 76, 659, 2, 72, 76, NA,~
$ end_station_name <chr> "St. Clair St & Erie St", "Noble ~
$ end_station_id <dbl> 211, 29, 41, 185, 2, 76, 72, NA, ~
$ start_lat <dbl> 41.89418, 41.89096, 41.88098, 41.~
$ start_lng <dbl> -87.62913, -87.63534, -87.61675, ~
$ end_lat <dbl> 41.89443, 41.90067, 41.87205, 41.~
$ end_lng <dbl> -87.62338, -87.66248, -87.62955, ~
$ member_casual <chr> "casual", "casual", "casual", "ca~
> glimpse(OCT20)
Rows: 388,653
Columns: 13
$ ride_id <chr> "ACB6B40CF5B9044C", "DF450C72FD10~
$ rideable_type <chr> "electric_bike", "electric_bike",~
$ started_at <dttm> 2020-10-31 19:39:43, 2020-10-31 ~
$ ended_at <dttm> 2020-10-31 19:57:12, 2020-11-01 ~
$ start_station_name <chr> "Lakeview Ave & Fullerton Pkwy", ~
$ start_station_id <dbl> 313, 227, 102, 165, 190, 359, 313~
$ end_station_name <chr> "Rush St & Hubbard St", "Kedzie A~
$ end_station_id <dbl> 125, 260, 423, 256, 185, 53, 125,~
$ start_lat <dbl> 41.92610, 41.94817, 41.77346, 41.~
$ start_lng <dbl> -87.63898, -87.66391, -87.58537, ~
$ end_lat <dbl> 41.89035, 41.92953, 41.79145, 41.~
$ end_lng <dbl> -87.62607, -87.70782, -87.60005, ~
$ member_casual <chr> "casual", "casual", "casual", "ca~
```
```{r Assigning the correct data types to columns in NOV20 and OCT20 data frames, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 3.2.2.3 - Assigning the correct data types
> NOV20 <- transform(NOV20, start_station_id = as.character(start_station_id))
> NOV20 <- transform(NOV20, end_station_id = as.character(end_station_id))
> OCT20 <- transform(OCT20, start_station_id = as.character(start_station_id))
> OCT20 <- transform(OCT20, end_station_id = as.character(end_station_id))
```
```{r Check to confirm that the data type transformation was successful, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 3.2.2.4 - Confirming the data type transformation
> glimpse(NOV20)
Rows: 259,716
Columns: 13
$ ride_id <chr> "BD0A6FF6FFF9B921", "96A7A7A4BDE4~
$ rideable_type <chr> "electric_bike", "electric_bike",~
$ started_at <dttm> 2020-11-01 13:36:00, 2020-11-01 ~
$ ended_at <dttm> 2020-11-01 13:45:40, 2020-11-01 ~
$ start_station_name <chr> "Dearborn St & Erie St", "Frankli~
$ start_station_id <chr> "110", "672", "76", "659", "2", "~
$ end_station_name <chr> "St. Clair St & Erie St", "Noble ~
$ end_station_id <chr> "211", "29", "41", "185", "2", "7~
$ start_lat <dbl> 41.89418, 41.89096, 41.88098, 41.~
$ start_lng <dbl> -87.62913, -87.63534, -87.61675, ~
$ end_lat <dbl> 41.89443, 41.90067, 41.87205, 41.~
$ end_lng <dbl> -87.62338, -87.66248, -87.62955, ~
$ member_casual <chr> "casual", "casual", "casual", "ca~
> glimpse(OCT20)
Rows: 388,653
Columns: 13
$ ride_id <chr> "ACB6B40CF5B9044C", "DF450C72FD10~
$ rideable_type <chr> "electric_bike", "electric_bike",~
$ started_at <dttm> 2020-10-31 19:39:43, 2020-10-31 ~
$ ended_at <dttm> 2020-10-31 19:57:12, 2020-11-01 ~
$ start_station_name <chr> "Lakeview Ave & Fullerton Pkwy", ~
$ start_station_id <chr> "313", "227", "102", "165", "190"~
$ end_station_name <chr> "Rush St & Hubbard St", "Kedzie A~
$ end_station_id <chr> "125", "260", "423", "256", "185"~
$ start_lat <dbl> 41.92610, 41.94817, 41.77346, 41.~
$ start_lng <dbl> -87.63898, -87.66391, -87.58537, ~
$ end_lat <dbl> 41.89035, 41.92953, 41.79145, 41.~
$ end_lng <dbl> -87.62607, -87.70782, -87.60005, ~
$ member_casual <chr> "casual", "casual", "casual", "ca~
```
From the above code chunk, the data type transformation from "**double**" to "**character**" was successful. Another data type checkup was done to confirm that the data type integrity was intact for all the 12 data frames. The following code chunk shows the process of data type extraction and comparison of the newly fixed two data frames, using the **DEC20** data frame as a reference starting point:
```{r Data type extraction of the newly fixed data frames and data type comparison, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 3.2.2.5: data type extraction and comparison
> dtNOV20 <- sapply(NOV20,typeof)
> dtOCT20 <- sapply(OCT20,typeof)
> identical(dtDEC20,dtNOV20)
[1] TRUE
> identical(dtNOV20,dtOCT20)
[1] TRUE
```
It was confirmed with the above code chunk that all the data types across every columns in every data frame were exactly the same at this point, according to the transitive property.
#### 3.2.3 Handling NAs
As seen earlier in [code chunk 3.2.1.1](#chunk3211), there were several **NAs** (missing "not available" values) readily visible when the **glimpse()** function was applied to the **SEP21** data frame. This also happened to the other data frames as well.
In this case, rows containing NAs weren't removed because it was suspected that they were not input errors, and learning about their occurrence pattern could reveal more insights about the data.
Initially, the occurrence of NAs in different columns of each data frame was recorded with the following code, demonstrated with the "**SEP21**" data frame:
```{r The occurrence of NAs in different columns of each data frame, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 3.2.3.1: Find the column index with NAs in them
> which(colSums(is.na(SEP21))!=0)
start_station_name start_station_id end_station_name
5 6 7
end_station_id end_lat end_lng
8 11 12
```
From the code above, the columns of the **SEP21** data frame with NAs in them were:
column index: **5**, column name: **start_station_name**\
column index: **6**, column name: **start_station_id**\
column index: **7**, column name: **end_station_name**\
column index: **8**, column name: **end_station_id**\
column index: **11**, column name: **end_lat**\
column index: **12**, column name: **end_lng**\
This specific information about NAs was also obtained from the rest of the data frames and was added to the Excel temporary summary table:
![*__Figure 3.2.3.1: A temporary table summarizing the exploratory results of the data sets in an Excel table, including column indices of each data frame containing NAs__*](...xxx...xxx...)
From the above table, it could be seen that every data frame have the same columns with NAs values. These columns contains the information about the starting and ending stations of the bikes. NAs in these columns can mean that riders did not take off from stations, nor did they park the bikes at stations.
Another interesting observation is that there are also some NAs in the columns containing the actual position of the bikes after each trip. This can mean that the bikes' positioning devices ran out of battery before they reached their destinations.
These observations concerning NAs could reveal more insights about customer behaviors and maintenance requirements. Further investigations and their results will be shown later in this report.
#### 3.2.4 Check for Duplicated Rows
Checking for duplicated entries of data should be one of the first tasks in data processing. However, the nature of this particular data allow for many duplicated values in all but the "**ride_id**" column. Each ride id represents one unique bike trip. There should be no two identical ride ids, however the opposite, where there are at least two rows with unique ride ids but the rest of the fields contain the exact same information, is possible.
In this case, the "**red flag**" appears only when duplicated ride ids are found. The following code chunk checked for duplicated ride ids, as an example, in the data frame **SEP21**:
```{r Check for duplicated ride ids, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 3.2.4.1: Check for duplicated ride ids
> SEP21 %>% group_by(ride_id) %>% filter(n()>1)
# A tibble: 0 x 13
# Groups: ride_id [0]
# ... with 13 variables: ride_id <chr>, rideable_type <chr>,
# started_at <dttm>, ended_at <dttm>,
# start_station_name <chr>, start_station_id <chr>,
# end_station_name <chr>, end_station_id <chr>,
# start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
# end_lng <dbl>, member_casual <chr>
>
```
A tibble of 0 rows and 13 columns was returned from the code above. This means that there was no duplicated ride id for the **SEP21** data frame. After every data frame was checked, no one showed up with duplicated ride ids.
#### 3.2.5 Combining All Data Frames
At this point, the data integrity was checked and confirmed its consistency. There was nothing to add or remove from the data frames just yet. They were ready to be combined into one big data frame with complete information ready for the analysis.
The following code combined all data frame together into a single big data frame named "**alldata**":
```{r Combining all data frames into onee complete data frame ready for analysis, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 3.2.5.1: Combining all data frames into one complete data frame ready for analysis
> alldata <- bind_rows(SEP21,AUG21,JUL21,JUN21,MAY21,APR21,MAR21,FEB21,JAN21,DEC20,NOV20,OCT20)
>
> glimpse(alldata)
Rows: 5,136,261
Columns: 13
$ ride_id <chr> "9DC7B962304CBFD8", "F930E2C6872D6~
$ rideable_type <chr> "electric_bike", "electric_bike", ~
$ started_at <dttm> 2021-09-28 16:07:10, 2021-09-28 1~
$ ended_at <dttm> 2021-09-28 16:09:54, 2021-09-28 1~
$ start_station_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ start_station_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ end_station_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ end_station_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ start_lat <dbl> 41.89000, 41.94000, 41.81000, 41.8~
$ start_lng <dbl> -87.68000, -87.64000, -87.72000, -~
$ end_lat <dbl> 41.89, 41.98, 41.80, 41.81, 41.88,~
$ end_lng <dbl> -87.67, -87.67, -87.72, -87.72, -8~
$ member_casual <chr> "casual", "casual", "casual", "cas~
```
The result of "**glimpse(alldata)**" returned a data frame of **5,136,261** rows and **13** columns. The number 5,136,261 was equal to the sum of all rows in all of the smaller data frames.
![*__Figure 3.2.5.1: The total number of rows in the temporary excel table matches that calculated with the glimpse() function applied to the "alldata" data frame__*](...xxx...xxx...)
The data type integrity of the **alldata** data frame was also preserved.
##### 3.2.6 Change Data Type of Some Variables
In this case, some variable contained categorical values that could be converted into R's "**factor**" type. Doing so would reduce data redundancy and save some memory space. The following code performed data conversion to "**factor**":
```{r Convert data type ot "factor", message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 3.2.6.1: convert data type to 'factor'
> alldata <- alldata %>%
mutate(rideable_type=as.factor(rideable_type)) %>%
mutate(member_casual=as.factor(member_casual))
```
The result of the following code confirmed that the data in the "**rideable_type**" and the "**member_casual**" had been converted to "**factor**".
```{r Check the data type conversion, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 3.2.6.2: Data type conversion transformation
> summary(alldata)
ride_id rideable_type
Length:5136261 classic_bike :2750831
Class :character docked_bike : 677980
Mode :character electric_bike:1707450
started_at ended_at
Min. :2020-10-01 00:00:06 Min. :2020-10-01 00:05:09
1st Qu.:2021-04-11 18:50:57 1st Qu.:2021-04-11 19:15:05
Median :2021-06-21 18:01:31 Median :2021-06-21 18:20:59
Mean :2021-05-25 22:30:57 Mean :2021-05-25 22:51:34
3rd Qu.:2021-08-11 21:13:51 3rd Qu.:2021-08-11 21:33:57
Max. :2021-09-30 23:59:48 Max. :2021-10-01 22:55:35
start_station_name start_station_id end_station_name
Length:5136261 Length:5136261 Length:5136261
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
end_station_id start_lat start_lng
Length:5136261 Min. :41.64 Min. :-87.84
Class :character 1st Qu.:41.88 1st Qu.:-87.66
Mode :character Median :41.90 Median :-87.64
Mean :41.90 Mean :-87.65
3rd Qu.:41.93 3rd Qu.:-87.63
Max. :42.08 Max. :-87.52
end_lat end_lng member_casual
Min. :41.51 Min. :-88.07 casual:2358287
1st Qu.:41.88 1st Qu.:-87.66 member:2777974
Median :41.90 Median :-87.64
Mean :41.90 Mean :-87.65
3rd Qu.:41.93 3rd Qu.:-87.63
Max. :42.17 Max. :-87.44
NA's :4821 NA's :4821
```
#### 3.2.7 Create a Calculated Column: "ride_duration_min"
In the "**alldata**" data frame, the information about the starting and ending time for each trip was available. It would be useful to extract the information about the duration of each trip from this available information and store it in a new column.
However, before creating a column containing the ride durations, the "**impossible values**" in the existing date-time columns should be considered.
Logically, all the values in the "**started_at**" column must be less than their counterparts in the "**ended_at**" column. In this case, if any observation went in the opposite direction, a simple value swapping between the two columns would be applied.
The following code checked if there are any so-called "**impossible values**":
```{r Check if there are any "Impossible Values", message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 3.2.7.1: Checking for "Impossible Values"
> alldata %>% filter(started_at > ended_at)
# A tibble: 3,284 x 13
ride_id rideable_type started_at ended_at
<chr> <fct> <dttm> <dttm>
1 3A47A9240B~ classic_bike 2021-09-29 14:52:22 2021-09-29 14:52:09
2 FA44EBC79C~ electric_bike 2021-09-28 17:43:33 2021-09-28 17:43:31
3 D5E5F806D0~ classic_bike 2021-09-01 18:42:09 2021-09-01 18:41:40
4 2EDC266B1A~ classic_bike 2021-09-04 09:42:26 2021-09-04 09:41:54
5 137BEA38C2~ classic_bike 2021-09-02 18:38:19 2021-09-02 18:37:48
6 7F7679196D~ classic_bike 2021-09-01 06:55:19 2021-09-01 06:55:11
7 D4A976387C~ electric_bike 2021-09-04 17:20:37 2021-09-04 17:20:35
8 3E68DE6382~ electric_bike 2021-09-29 17:44:02 2021-09-29 17:40:51
9 88BA6AC732~ classic_bike 2021-09-29 15:26:23 2021-09-29 15:26:22
10 6ACD4446FA~ electric_bike 2021-09-29 17:32:54 2021-09-29 17:28:32
# ... with 3,274 more rows, and 9 more variables:
# start_station_name <chr>, start_station_id <chr>,
# end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
# start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
# member_casual <fct>
```
As a result, there were still more than 3,000 rows containing illogical date-time values.
The following code checked for any time value in **started_at** that was higher than the value in **ended_at** in the same row and swaped them:
```{r check for impossible values and swap them, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 3.2.7.2: Checking if values in started_at are <
#values in ended_at (which is illogical) and swap them
for (i in 1:nrow(alldata)) {
if(alldata[i,3] > alldata[i,4]) {
c <- alldata[i,3]
alldata[i,3] <- alldata[i,4]
alldata[i,4] <- c
}
}
```
Next, the following code confirmed that there were no more impossible values:
```{r confirm that there\'re no more impossible values, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 3.2.7.3: Confirm that there are no more impossible values:
> alldata %>% filter(started_at > ended_at)
# A tibble: 0 x 13
# ... with 13 variables: ride_id <chr>, rideable_type <fct>,
# started_at <dttm>, ended_at <dttm>, start_station_name <chr>,
# start_station_id <chr>, end_station_name <chr>,
# end_station_id <chr>, start_lat <dbl>, start_lng <dbl>,
# end_lat <dbl>, end_lng <dbl>, member_casual <fct>
```
From the result, there were no more impossible values. It would be OK to proceed to create the calculated column "**ride_duration_min**" for trip duration at this point.
The following code calculateed each trip's duration in minutes and populates the calculation result in the new column, "**ride_duration_min**":
```{r Create a calculated column for time difference, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 3.2.7.4: Create a calculated column to store ride duration in minutes
> alldata <- alldata %>%
mutate(ride_duration_min=abs(difftime(alldata$ended_at,
alldata$started_at,units="mins")))
> glimpse(alldata)
Rows: 5,136,261
Columns: 14
$ ride_id <chr> "9DC7B962304CBFD8", "F930E2C6872D6B32",~
$ rideable_type <fct> electric_bike, electric_bike, electric_~
$ started_at <dttm> 2021-09-28 16:07:10, 2021-09-28 14:24:~
$ ended_at <dttm> 2021-09-28 16:09:54, 2021-09-28 14:40:~
$ start_station_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ start_station_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ end_station_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ end_station_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ start_lat <dbl> 41.89000, 41.94000, 41.81000, 41.80000,~
$ start_lng <dbl> -87.68000, -87.64000, -87.72000, -87.72~
$ end_lat <dbl> 41.89, 41.98, 41.80, 41.81, 41.88, 41.8~
$ end_lng <dbl> -87.67, -87.67, -87.72, -87.72, -87.71,~
$ member_casual <fct> casual, casual, casual, casual, casual,~
$ ride_duration_min <drtn> 2.7333333 mins, 15.2333333 mins, 3.683~
```
The calculation resulted in an additional column at the last position called "**ride_duration_min**", storing the duration of each bike trip in minutes.
At this point, the types of information available were comprehensive, with numerical, positional, categorical, and chronological. The data frame was saved as a .csv format file with the following code:
```{r Saving the data file to .csv, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 3.2.7.5: Saving data to a .csv file
> write_csv(alldata,"allcata.csv")
```
The next step of data analysis could finally be commenced.
## 4. "Analyze"
In this part of the report, insights extracted from the data with various visualizations will be discussed along with codes and technique used to created them.
**Note**: Only relevant visualization will be shown here. The full documentation about this phase can be accessed via this link:
### 4.1 Overview
The data about the monthly number of rides recorded in the temporary table in Excel can be quickly visualized to demonstrate that there are more bike trips in warm months than in cold months. This trend is to be expected from the data and can serve as another layer of confirmation that the data collected was valid.
![*__Figure 4.1.1: A quick visualization that shows number of monthly rides from October 2020 to September 2021. Notice how the number increases and decreases in different times of year.__*](...xxx...xxx...)
### 4.2 Comparing number of rides between "casual" and "member" customers
The main questions of this data analysis project are:\
**1. How do annual members and casual riders use Cyclistic bikes differently?**\
**2. Why would casual riders buy Cyclistic annual memberships?**\
**3. How can Cyclistic use digital media to influence casual riders to become members?**\
\
The questions #1 and #2 can be changed to: "**How to convert casual customers to member customers?**" This question requires the comparison of insights extracted from both types of customers.
#### 4.2.1 Counting Numbers of Rides from Each Type of Customer
First, the basic information about number of rides from each customer type was obtain by simply counting each ride entry from each group with the following code:
```{r Counting number of rides from each group, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 4.2.1.1: counting the number of rides from each customer type
> table(alldata$member_casual)
casual member
2358287 2777974
```
From the entire year, there were 2,358,287 casual and 2,777,974 member riders. These figures were visualize as a pie chart by the following code:
```{r Plotting a pie chart for number of rides from each customer type, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 4.2.1.2: Number of rides from each type of customer as a pie chart
ggplot(data=alldata)+
geom_bar(mapping=aes(x="",fill=member_casual))+
coord_polar("y")+
annotate("text",label="Member",x=1,y=900000,size=6)+
annotate("text",label="2,777,974",x=1,y=1200000,size=6)+
annotate("text",label="54.1%",x=1,y=1500000,size=6)+
annotate("text",label="Casual",x=1,y=4220000,size=6)+
annotate("text",label="2,358,287",x=1,y=3950000,size=6)+
annotate("text",label="45.9%",x=1,y=3650000,size=6)+
theme(plot.background=element_blank(),
panel.background=element_blank(),
axis.title=element_blank(),
axis.ticks=element_blank(),
axis.text=element_blank(),
legend.position="none")+
ggtitle(label="Number of Rides from Each Customer Type",
subtitle="Source: Cyclistic")
```
![*__Figure 4.2.1.1: A pie chart showing numbers of rides and their percentage values from each customer type__*](...xxx...xxx...)
#### 4.2.2 Looking at the Sattistics of the Time each Customer Type Spent Riding
Next, two sets of statistics were compared to reveal the difference of the time each type of customer spent riding. A box plot was created from the data from both customer types with the following code:
```{r The first box plot, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 4.2.2.1: The first box plot
ggplot(data=alldata)+
geom_boxplot(mapping=aes(x=member_casual,y=ride_duration_min))+
ggtitle("Bike Trips Duration (mins)",
subtitle="Source: Cyclistic")+
scale_x_discrete(name=element_blank(),
labels=c("Casual Riders","Member Riders"))+
ylab("Ride Duration (mins)")
```
![*__Figure 4.2.2.1: The first box plot that doesn't give any statistical view. There are simply too many outliers.__*](...xxx...xxx...g)
Not much information could be gained from this plot, save for the obvious that there were too many outliers with values much higher than the means from the two boxes that they were crushed flat on the x-axis.
A procedure to spot and remove outliers was implemented using the [interquartile range.](https://www.r-bloggers.com/2020/01/how-to-remove-outliers-in-r/) This technique finds the interquartile range of a list of numbers, the width of the range, and defines boundary outside of which lie removable outliers.
The following formula define the said range outside of which outliers can be removed:\
\
*__Formula 4.2.2.1: Setting the inter quantile boundary to remove outliers__*
$$[Q1-1.5IQR,Q3+1.5IQR]$$
**Where**:\
Q1 is the 1^st^ quartile value\
Q3 is the 3^rd^ quartile value\
IQR is th width of the interquantile range\
\
Before removing the outliers, some statistics of the ride duration was checked with the following code:
```{r Check the statistics of the ride duration, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 4.2.2.2: Check the stats of the column ride_duration_mins
> summary(alldata$ride_duration_min)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 7.10 12.60 24.83 22.82 55944.15
```
The statistics summary of the ride duration showed that the shortest trip time was zero, and the longest trip time was 55,944.15 minutes, or 15.5 hours.
In this case, the zeros on the lower end were not sensible. There should be no ride duration time of "**0 minute**". The 0 duration were most probatly errors from data entry. However, the ride duration of **15.5 hours**, albeit very long, was possible due to the genuine ride or due to the customer forgetting to return the bike.
The data of both customer groups was spitted into two new sets called "**casualdata**," and "**memberdata**" according to the following code:
```{r Splitting data into casualdata and memberdata, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 4.2.2.3: Splitting data into "casualdata" and "memberdata" data sets:
casualdata <- alldata %>%
filter(member_casual=="casual")
memberdata <- alldata %>%
filter(member_casual=="member")
```
The zeros were removed from both data sets by the following code:
```{r Removing zeros from both data sets, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 4.2.2.4: Removing zeros from both data sets
casualdata <- casualdata %>%
filter(!(ride_duration_min==0))
memberdata <- memberdata %>%
filter(!(ride_duration_min==0))
```
Outliers were spotted and removed according to its interquantile range (IQR), and a new data set was created to store the result by the following set of codes:
```{r Spotting and removing the outliers from the data ,message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 4.2.2.5 Spotting and removing outliers from the data
#install statistical packages to deal with outliers
install.packages("ggstatplot")
#find the quantile range for "casualdata":
Q_casual <- quantile(casualdata$ride_duration_min,probs=c(.25,.75),na.rm=F)
#find the IQR of "casualdata"
iqr_casual <- IQR(casualdata$ride_duration_min)
#find the upper cut-off value:
casual_up <- Q_casual[2]+1.5*iqr_casual
#find the lower cut-off value:
casual_low <- Q_casual[1]-1.5*iqr_casual
#removing the outliers from "casualdata":
casualdata_adj <- casualdata %>%
mutate(ride_duration_min=ifelse(ride_duration_min>casual_up,NA,
ride_duration_min)) %>%
mutate(ride_duration_min=ifelse(ride_duration_min<casual_low,NA,
ride_duration_min))
#find the quantile range for "memberdata":
Q_member <- quantile(memberdata$ride_duration_min,probs=c(.25,.75),na.rm=F)
#find the IQR of "memberdata"
iqr_member <- IQR(memberdata$ride_duration_min)
#find the upper cut-off value:
member_up <- Q_member[2]+1.5*iqr_member
#find the lower cut-off value:
member_low <- Q_member[1]-1.5*iqr_member
#removing the outliers from "memberdata":
memberdata_adj <- memberdata %>%
mutate(ride_duration_min=ifelse(ride_duration_min>member_up,NA,
ride_duration_min)) %>%
mutate(ride_duration_min=ifelse(ride_duration_min<member_low,NA,
ride_duration_min))
#merging the outlier-free data sets:
alldata_adj <- bind_rows(casualdata_adj,memberdata_adj)
#Sorting the data by time back to its original order
#This will affect the map plot
alldata_adj <- alldata_adj[rev(order(alldata_adj$started_at)),
decreasing=TRUE,na.last=FALSE]
#creating box plot of both data sets together:
ggplot(data=alldata_adj)+
geom_boxplot(mapping=aes(x=member_casual,y=ride_duration_min))+
ggtitle("Bike Trips Duration (mins)--adjusted",
subtitle="Source: Cyclistic")+
scale_x_discrete(name=element_blank(),labels=c("Casual Riders",
"Member Riders"))+
ylab("Ride Duration (mins)")+
annotate("text",label="0.02",x=1.1,y=1.5)+
annotate("text",label="8.90",x=1.1,y=10.9)+
annotate("text",label="15.12",x=1.1,y=17)+
annotate("text",label="25.53",x=1.1,y=27.5)+
annotate("text",label="50.47",x=1.1,y=52)+
annotate("segment",x=1.05,xend=1,y=0,yend=0)+
annotate("segment",x=1.05,xend=1,y=50.47,yend=50.47)+
annotate("text",label="0.02",x=2.1,y=1.5)+
annotate("text",label="5.67",x=2.1,y=7.5)+
annotate("text",label="9.55",x=2.1,y=11.5)+
annotate("text",label="15.75",x=2.1,y=17.5)+
annotate("text",label="30.87",x=2.1,y=32)+
annotate("segment",x=2.05,xend=2,y=0,yend=0)+
annotate("segment",x=2.05,xend=2,y=30.87,yend=30.87)
```
![*__Figure 4.2.2.2: A new box plot with statistics of ride duration from both customer groups__*](...xxx...xxx...)
Data records containing zeros duration were removed from the original data set. The zero-free data was named "**alldata_adj**. From this point onward, any new modification of the data set would be based on the **alldata_adj** data set.
The new box plot showed that the **casual** riders spent more time on the bikes than the **member** riders in average. This probably due to the fact that they weren't have any time constrain or have any fixed destinations in their commutes.
The outliers were still present after the first removal, but they weren't interfere with the interpretation of the result anymore.
Still, the **minimum duration** was shown to be **0.02 minutes**. This is no more possible than the 0 minute duration. However, the strange observations here might due to the fact that the data was first obtained from an imaginary source.
### 4.2.3 The most popular bike type
There are three types of bikes in the data set provided by Cyclistic, according to the following code:
```{r Types of bikes provided by Cyclistic, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 4.2.3.1: Showing 3 types of bikes
> unique(alldata_adj$rideable_type)
[1] "electric_bike" "docked_bike" "classic_bike"
```
From the data provided, Cyclistic offers 3 types of bike: **electric**, **docked**, and **classic**. It's clear that the electric and the classic bikes are different, but it's not clear what exactly is the **docked** type. Again, this is an imaginary data set and observations don't have to match the reality.
Learning about the popularity of these bikes among the customer groups can help with future marketing campaigns.
The following code create a grouped bar chart visualization for the number of rides for each type of bikes from different customer groups:
```{r The popularity of different bike types ,message=FALSE, warning=FALSE, eval=FALSE }
#Code chunk 4.2.3.2: The popularity of different bike types among two customer groups
ggplot(data=alldata_adj)+
geom_bar(mapping=aes(x=rideable_type,fill=member_casual),
position="dodge")+
scale_x_discrete(name=element_blank(),labels=c("Classic Bike","Docked Bike",
"Electric Bike"))+
scale_y_continuous(name=expression(Number ~ of ~ Rides ~ (x10^5)),
labels=function(x) x / 100000,
limits=c(0,1700000))+
labs(fill="Customer Type")+
annotate("text",label="1,120,625",x=0.75,y=1180000)+
annotate("text",label="1,630,033",x=1.2,y=1700000)+
annotate("text",label="407,763",x=1.75,y=480000)+
annotate("text",label="270,169",x=2.25,y=350000)+
annotate("text",label="829,649",x=2.75,y=905000)+
annotate("text",label="877,563",x=3.25,y=957563)+
ggtitle("Number of Rides of Different Bike Types",
subtitle="Source: Cyclistic")
```
![*__Figure 4.2.3.1: Number of Rides of Different Bike Types from both Customer Groups__*](...xxx...xxx...)
From the bar chart above, it appeared that the "**classic bike**" was the most popular bike type for both group of customers. The least popular was the **docked** type which could mean the bikes that were parked "docked" at designated stations.
### 4.2.4 Study the starting and ending locations of different customer groups
Another aspect of customer behavior that can be studied to spot differences is the difference in locations that they started and ended their bike trips. Luckily, Cyclistic also supplied geolocation data for the starting point and ending point of each bike trip.
**Map plots** are required to visualize this difference.
__NOTE__: Plotting maps requires installing the "**ggmap**" package and a **Google Maps API key**.
\
The following code plot map charts showing the starting and ending locations of each customer groups:
```{r Plotting map plots to show starting and ending locations of differnt customer groups,message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 4.2.4.1: Creating map charts showing the starting and the ending points of each bike trip in each customer group
#Installing the ggmap package
install.packages("ggmap")
#Loading the ggmap package
library(ggmap)
#Registering Google Map API key
register_google(key="...xxx...xxx...xxx...")
#(The key can be obtained for free with limited usage from
#Google Cloud service)
#Getting coordinates information of "Chicago, Illinois"
chg <-geocode("Chicago, IL")
#Plotting map chart for starting locations
ggmap(get_map(chg,maptype="toner-lite"))+
theme(plot.background=element_blank(),
panel.background=element_blank(),
axis.title=element_blank(),
axis.ticks=element_blank(),
axis.text=element_blank())+
geom_point(data=alldata_adj,
mapping=aes(x=start_lng,y=start_lat,
color=member_casual),alpha=0.6)+
ggtitle("Bike Trip Starting Locations",
subtitle="Source: Cyclistic")+
scale_color_discrete(name="Customer Type")+
theme(legend.key=element_blank())
#Plotting map chart for ending locations
ggmap(get_map(chg,maptype="toner-lite"))+
theme(plot.background=element_blank(),
panel.background=element_blank(),
axis.title=element_blank(),
axis.ticks=element_blank(),
axis.text=element_blank())+
geom_point(data=alldata_adj,
mapping=aes(x=end_lng,y=end_lat,
color=member_casual),alpha=0.6)+
ggtitle("Bike Trip Ending Locations",
subtitle="Source: Cyclistic")+
scale_color_discrete(name="Customer Type")+
theme(legend.key=element_blank())
```
![*__Figure 4.2.4.1: Starting locations in Chicago, IL for both customer groups__* ](...xxx...xxx...)
![*__Figure 4.2.4.2: Ending locations in Chicago, IL for both customer groups__* ](...xxx...xxx...)
From the resulting map charts, it is obvious that the "**member**" customers have more fixed trip destinations than the "**casual**" group. The member's destinations are also more clustered in the center area of Chicago than the casual's destination.
According to the information originally provided by Cyclistic, the casual customers use the bikes as a part of their leisure activities. This information agrees with this observation and can lead to a conclusion that casual customers ride the bikes for fun and sight-seeing activities and go anywhere they please in Chicago, while the members ride the bikes regularly to and from their workplaces which have fixed points on the map.
### 4.2.5 The study of time: how many people ride on each day of week and each hour of day?
Another revealing insight is to learn about the relationship of time and number of rides. What day of week and what time of day do people ride most? These questions can be answered by visualizations the date-time data originally provided by Cyclistic.
First, the number of rides according to different days of week was studied. The days in the weeks of the whole data set was extracted from the column "**started_at**" for this visualization.
__NOTE__: The "**lubridate**" package is required to work with "datetime" type of data.
The following code created a grouped bar chart visualizing total annual number of rides according to each day of the week:
```{r Visualizing number of rides according to each day of the week, message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 4.2.5.1: Creating a grouped bar chart visualizing total annual number of rides on different days of the week
#Installing "lubridate" package:
install.packages("lubridate")
#Loading the package:
library(lubridate)
#Counting number of rides on each day for "casual" customers
> alldata_adj %>% group_by(weekdays(alldata_adj$started_at)) %>%
+ filter(member_casual=="casual") %>% tally()
# A tibble: 7 x 2
`weekdays(alldata_adj$started_at)` n
<chr> <int>
1 Friday 339374
2 Monday 266417
3 Saturday 523829
4 Sunday 444997
5 Thursday 274664
6 Tuesday 251483
7 Wednesday 257273
#Counting number of rides on each day for "member" customers
> alldata_adj %>% group_by(weekdays(alldata_adj$started_at)) %>%
+ filter(member_casual=="member") %>% tally()
# A tibble: 7 x 2
`weekdays(alldata_adj$started_at)` n
<chr> <int>
1 Friday 405902
2 Monday 375841
3 Saturday 400268
4 Sunday 345273
5 Thursday 420539
6 Tuesday 406833
7 Wednesday 423109
#Getting statistics from "casual" customers
> alldata_adj %>% group_by(weekdays(alldata_adj$started_at)) %>%
+ filter(member_casual=="casual") %>% tally() %>% summary()
weekdays(alldata_adj$started_at) n
Length:7 Min. :251483
Class :character 1st Qu.:261845
Mode :character Median :274664
Mean :336862
3rd Qu.:392186
Max. :523829
#Getting statistics from "member" customers
> alldata_adj %>% group_by(weekdays(alldata_adj$started_at)) %>%
+ filter(member_casual=="member") %>% tally() %>% summary()
weekdays(alldata_adj$started_at) n
Length:7 Min. :345273
Class :character 1st Qu.:388055
Mode :character Median :405902
Mean :396824
3rd Qu.:413686
Max. :423109
#Plotting the chart
ggplot(data=alldata_adj)+
geom_bar(mapping=aes(x=weekdays(alldata_adj$started_at),
fill=member_casual),
position="dodge")+
scale_x_discrete(name=element_blank())+
scale_y_continuous(name=expression(Number ~ of ~ Rides ~ (x10^5)),
labels=function(x) x / 100000,
limits=c(0,700000))+
labs(fill="Customer Type")+
annotate("text",label="casual riders",
x=4,y=700000)+
annotate("text",label="max = 523,829 (SAT)",
x=4,y=650000)+
annotate("text",label="min = 251,483 (TUE)",
x=4,y=600000)+
annotate("text",label="avg = 336,862",
x=4,y=550000)+
annotate("text",label="member riders",
x=6,y=700000)+
annotate("text",label="max = 423,109 (WED)",
x=6,y=650000)+
annotate("text",label="min = 345,273 (SUN)",
x=6,y=600000)+
annotate("text",label="avg = 396,824",
x=6,y=550000)+
ggtitle("Number of Rides on Different Days of Week",
subtitle="Source: Cyclistic")
```
![*__Figure 4.2.5.1: Grouped bar plot showing the total annual number of rides according to different days of the week__*](...xxx...xxx...)
From the weekdays bar chart, there is an obvious spike of casual customers on Saturdays and Sundays, the weekend holidays. On the other hand, the number of rides from the member group looks to be constant throughout the week, and drops to the lowest on Sundays.
This observation may be cluing the fact that most of the casual customers are tourists from elsewhere using the bikes for sightseeing trips around Chicago, while most of the member customers are already living and working their regular jobs in Chicago. However, there's not enough data to conclude this at the moment.
To dig a little deeper into the insights about time, another grouped bar chart was created to show total number of rides at different hour of day by the following code:
```{r Visualizing number of rides according to each hour of day ,message=FALSE, warning=FALSE, eval=FALSE}
#Code chunk 4.2.5.2: Create a data set with hourly data (Every 1 hour)
alldata_adjh <- alldata_adj %>%
+ mutate(hours=hour(alldata_adj$started_at))
#Plot a bar chart to show hourly customer for the entire year
#!!must create a set of hours in the day fist
hr <- c(0:23)
#Start plotting
ggplot(data=alldata_adjh)+
geom_bar(mapping=aes(x=hours,fill=member_casual),
position="dodge")+
scale_x_continuous(name="Hour of Day",