-
Notifications
You must be signed in to change notification settings - Fork 15
/
longer_wider.qmd
195 lines (143 loc) · 6.94 KB
/
longer_wider.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
title: "Pivot"
date-modified: 'today'
date-format: long
format:
html:
footer: "CC BY 4.0 John R Little"
license: CC BY-NC
bibliography: references.bib
---
Often our goal is to reshape a data frame into a *tidy data* [@wickham2014] format. This is also known as **tall data or long data**. The {[tidyr](https://tidyr.tidyverse.org/)} package has functions to reshape data into tall or wide formats. When coding in the tidyverse context, tall data is much easier to iterate over --- without ever writing *for* loops or other kind of flow control. Later, in the [advance section, we'll introduce the {purrr} package](purrr.html "purrr") for more powerful iteration.
[![Image credit: Tidy data from the book \_R for Data Science\_](images/tidy_data.png){fig-alt="Tidy data"}](images/tidy_data.png)
Tidy data[^1]
[^1]: A robust discussion of *tidy data* can be found in *R for Data Science* [@wickham2023]*: [https://r4ds.had.co.nz/tidy-data.html](https://r4ds.had.co.nz/tidy-data.htmlhttps://r4ds.had.co.nz/tidy-data.html)*
- Every variable is a column
- Every row is an observation
- Every cell is a single value
Many of the [{dplyr} functions](https://dplyr.tidyverse.org/reference/index.html) help with making data tidy. **Additionally**, in this chapter, we focus on using the [{tidyr} function](https://tidyr.tidyverse.org/reference/index.html): `pivot_longer`, as well as `pivot_wider`.
See Also the [*Pivot Vignette*](https://tidyr.tidyverse.org/articles/pivot.html)
## Load library packages
Load the {tidyverse} meta package which loads eight packges, including {tidyr}.
```{r}
#| warning: false
#| message: false
library(tidyverse)
```
## Data
Our practice datasets are now available.
```{r}
data(relig_income)
data(fish_encounters)
```
## Longer
> `pivot_longer()`
::: column-margin
```{=html}
<iframe width="300" height="200" src="https://www.youtube.com/embed/sspFC2m8fog" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
```
:::
We can start with the tidyr::relig_income data frame. This is wide data and does not conform to tidy data [@wickham2014] principles. This makes it harder to iterate by row because there are multiple observations in each row. In this data frame, each row has 10 observations, one observation for each income category.
```{r}
relig_income
```
We can pivot this to tall data (i.e. `pivot_longer`) so that we have one observation per row for a total of 180 rows.
```{r}
relig_income %>%
pivot_longer(-religion, names_to = "income", values_to = "count")
```
## Wider
Of course, sometimes we want wide data. There are a variety of reasons to wrangle data into a wide-data format. We use `pivot_wider` to accomplish this.
```{r}
fish_encounters
```
```{r}
fish_encounters %>%
pivot_wider(names_from = station, values_from = seen)
```
## Why pivot data?
Why pivot data? Your analysis may require the shape of data to match a particular structure. For example, ggplot generally prefers long tidy data. Once the data are properly shaped, analysis and variations becomes easier. Below is a quick example of using ggplot to format data in a long and tidy shape to create a bar plot. Of course, the plot needs some refining. Hence, improvements become easier to accomplish with the tall data shape. Nonetheless, below shows an initial draft of a bar plot.
```{r}
#| code-fold: true
relig_income %>%
pivot_longer(-religion, names_to = "income", values_to = "count") %>%
ggplot(aes(religion, count, fill = income)) +
geom_col()
```
Once the data are properly shaped, variations on analysis becomes easier. Next, I format some variables as categorical vectors (i.e. factors), so that I can redraw the plot for clarity.
My goal is to format the vectors as factors using the `forcats` package. This will allow me to arrange
- the order of the bars
- the order of the stacked elements of each bar
- the order of the Legend
I will also change the color scheme of the discrete color from the `fill` argument, in combination with the `scale_fill_viridis_d` function.
```{r}
#| code-fold: true
inc_levels = c("Don't know/refused",
"<$10k", "$10-20k", "$20-30k", "$30-40k",
"$40-50k", "$50-75k", "$75-100k", "$100-150k",
">150k")
relig_income %>%
pivot_longer(-religion, names_to = "income", values_to = "count") %>%
mutate(income = fct_relevel(income, inc_levels)) %>%
ggplot(aes(fct_reorder(religion, count),
count, fill = fct_rev(income))) +
geom_col() +
scale_fill_viridis_d(direction = -1) +
coord_flip()
```
Nonetheless, the un-pivoted, wide data, can be subset and visualized even though this is not ideal when attempting more complex visualization variations. Here, un-pivoted, I will make a bar chart of religious affiliation for incomes between \$40k and \$50k. Since I'm using just one variable, the code is not hard to compose. But note, the data re not in a tidy-data format.
```{r}
#| code-fold: true
relig_income %>%
ggplot(aes(fct_reorder(religion, `$40-50k`), `$40-50k`)) +
geom_col() +
coord_flip()
```
Tidy, `pivot_longer`, data will be easier to manipulate with `ggplot2`. For example, You can subset the data with a single `filter` function, thereby more easily enabling different income charts. Below, the code is easier to read and easier to modify if I want to use a different income value.
> `filter(income == "$40-50k")`
```{r}
#| code-fold: true
relig_income %>%
pivot_longer(-religion, names_to = "income", values_to = "count") %>%
filter(income == "$40-50k") %>%
ggplot(aes(fct_reorder(religion, count), count)) +
geom_col() +
coord_flip()
```
Getting more complex, a natural step is to make comparisons across incomes. To do this we use `ggplot2::facet_wrap()`
```{r}
#| code-fold: true
#| fig-height: 9
#| fig-width: 10
#| warning: false
relig_income %>%
pivot_longer(-religion, names_to = "income", values_to = "count") %>%
mutate(income = fct_relevel(income, inc_levels)) %>%
ggplot(aes(fct_reorder(religion, count),
count)) +
geom_col(show.legend = FALSE) +
coord_flip() +
facet_wrap(vars(income), nrow = 2)
```
Another variation. Again, ggplot2 affordances are easier to leverage with tall data.
```{r}
#| code-fold: true
#| fig-height: 4
#| fig-width: 10
#| warning: false
relig_income %>%
pivot_longer(-religion, names_to = "income", values_to = "count") %>%
mutate(religion = fct_lump_n(religion, 4, w = count)) %>%
mutate(income = fct_relevel(income, inc_levels)) %>%
group_by(religion, income) %>%
summarise(sumcount = sum(count)) %>%
ggplot(aes(fct_reorder(religion, sumcount),
sumcount)) +
geom_col(fill = "grey80", show.legend = FALSE) +
geom_col(data = . %>% filter(income == "$40-50k"),
fill = "firebrick") +
geom_col(data = . %>% filter(income == ">150k"),
fill = "forestgreen") +
coord_flip() +
facet_wrap(vars(income), nrow = 2)
```