-
Notifications
You must be signed in to change notification settings - Fork 90
/
05-Data-Types.Rmd
168 lines (122 loc) · 3.49 KB
/
05-Data-Types.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---
title: "Data Types"
output: html_notebook
---
<!-- This file by Charlotte Wickham is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio. -->
```{r setup}
library(tidyverse)
library(lubridate)
# Example of a factor
eyes <- factor(x = c("blue", "green", "green"),
levels = c("blue", "brown", "green"))
# An example for times/dates
library(fivethirtyeight)
births <- US_births_1994_2003 %>%
select(date, births)
```
## Warm-up / Review
Using the data `gss_cat`, find the average hours of tv watched (`tvhours`) for each category of marital status (`marital`).
```{r}
gss_cat
```
## Your Turn 1
What kind of object is the `marital` variable?
```{r}
gss_cat
```
Brainstorm with your neighbor, all the things you know about that kind of object.
# Factors
## Your Turn 2
Fix your summary of average hours of tv watched (`tvhours`) by marital status (`marital`), to drop missing values in `tvhours`, then create a plot to examine the results.
```{r}
gss_cat %>%
group_by(marital) %>%
summarise(avg_tvhours = mean(tvhours))
```
## Your Turn 3
Fill in the blanks (` `) to explore the average hours of tv watched by religion.
```{r, error = TRUE}
gss_cat %>%
drop_na( ) %>%
group_by( ) %>%
summarise( ) %>%
ggplot() +
geom_point(mapping = aes(x = , y = ))
```
## Quiz
Why is this plot not very useful?
```{r}
gss_cat %>%
drop_na(tvhours) %>%
group_by(denom) %>%
summarise(avg_tvhours = mean(tvhours)) %>%
ggplot() +
geom_point(mapping = aes(x = avg_tvhours,
y = fct_reorder(denom, avg_tvhours)))
```
## Your Turn 4
Edit the code to also relabel some other Baptist denominations:
* "Baptist-dk which" -> "Baptist - Don't Know"
* "Other baptists" -> "Baptist = Other"
```{r}
gss_cat %>%
mutate(denom = fct_recode(denom,
"Baptist - Southern" = "Southern baptist")
) %>%
pull(denom) %>%
levels()
```
## Your Turn 5
What does the function `detect_denom()` do?
```{r}
detect_denom <- function(x){
case_when(
str_detect(x, "[Bb]ap") ~ "Baptist",
str_detect(x, "[Pp]res") ~ "Presbyterian",
str_detect(x, "[Ll]uth") ~ "Lutheran",
str_detect(x, "[Mm]eth") ~ "Methodist",
TRUE ~ x
)
}
gss_cat %>% pull(denom) %>% levels() %>% detect_denom()
```
# Strings
With your neighbor, predict what these might return:
```{r}
strings <- c("Apple", "Pineapple", "Orange")
str_detect(strings, pattern = "pp")
str_detect(strings, pattern = "apple")
str_detect(strings, pattern = "[Aa]pple")
```
Then run them!
# Times and Dates
## Your Turn 7
For each of the following formats (of the same date), pick the right `ymd()` function to parse them:
```{r}
"2018 Feb 01"
"2-1-18"
"01/02/2018"
```
## Your Turn 8
Fill in the blanks to:
* Extract the month from date.
* Extract the year from date.
* Find the total births for each year/month.
* Plot the results as a line chart.
```{r, error = TRUE}
births %>%
mutate(year = ___,
month = ___) %>%
group_by(___, ___) %>%
summarise(total_births = ___) %>%
ggplot() +
geom_line(aes(x = month, y = total_births, group = year))
```
# Take Aways
Dplyr gives you three _general_ functions for manipulating data: `mutate()`, `summarise()`, and `group_by()`. Augment these with functions from the packages below, which focus on specific types of data.
Package | Data Type
--------- | --------
forcats | factors
stringr | strings
hms | times
lubridate | dates and times