-
Notifications
You must be signed in to change notification settings - Fork 0
/
IMD_D2_M2_Data_Cleaning.Rmd
159 lines (117 loc) · 5.5 KB
/
IMD_D2_M2_Data_Cleaning.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
title: "Day 2 Module 2"
author: "Lauren Pandori & John Paul Schmit"
date: "12/15/2021"
output: html_document
---
#### Data Cleaning {.tabset}
<details open><summary class='drop'> Data Cleaning</summary>
Data wrangling is all the steps you have to take to prepare your data for analysis or plotting. Data munging or data cleaning are other terms used to describe this process.Material presented today can be found in Hadley Wickham's book "R for Data Science", which is free and online. <https://r4ds.had.co.nz/wrangle-intro.html>
If you're starting from this tab of the lesson, you'll need to load two packages before proceeding.
```{r echo=TRUE, message = FALSE, warning = FALSE}
library('janitor')
library('tidyverse')
```
Major Components to Data Wrangling
- Step 1: Import (covered yesterday)
- Step 2+3: Tidy & Transform (get data into a useful format)
</details>
<br>
</details><br><details open><summary class='drop'> Column Names</summary>
Sometimes you get data where the column names are messy. For an example of data that needs tidying before exploration and analyses, we'll use fake data. Copy + paste the code chunk below into your script, and run it in the console to generate a dataframe called "messy_data".
Here, we'll clarify that we are using the word "data.frame" or "dataframe" to describe the data. The word "tibble" also shows up. We'll use these terms interchangeably.
```{r, echo = TRUE}
messy_data <- tibble(TrailName = c('Bayside', 'Coastal', 'Tidepools',
'Monument', 'Tidepools', NA),
Visitor_count = c(200, 300, 400, 500, 400, NA),
`empty row` = NA)
```
This data.frame captures the hypothetical number of visitors across four trails at a park. There are a few qualities that make it difficult to interpret. Seeing this data, what would you change to make it most consistent?
```{r, echo = TRUE}
messy_data
```
There are four things I suggest fixing with these data:
1. The column names are inconsistently formatted
2. There is an empty row
3. There is an empty column
4. There is a duplicated row
The Janitor package has ways to fix all of these data situations and aid in data cleaning. Let's proceed with changing the column names.
The clean_names() function in the janitor package will consistently format column names.
```{r, echo = TRUE}
clean_data <- clean_names(messy_data)
clean_data
```
</details>
<br>
<details open><summary class='drop'> Empty Rows and Columns </summary>
Now, we'll deal with the empty row and empty column. To remove empty rows or columns, we can use the remove_empty() function.
The arguments for the function are as follows:
```{r eval=FALSE, include=TRUE}
remove_empty(
# first argument is the data
dat,
# the next specifies if you want to get rid of empty rows, columns, or both
which = c("rows", "cols"),
# this last one asks whether (TRUE) or not (FALSE) you want R to tell you which rows or columns were removed
quiet = TRUE)
```
Let's give it a try
Removing empty rows
```{r, echo = TRUE}
clean_data2 <- remove_empty(clean_data, which = c('rows'), quiet = FALSE)
clean_data2
```
Removing empty rows and columns
```{r, echo = TRUE}
clean_data2 <- remove_empty(clean_data, which = c('rows', 'cols'), quiet = TRUE)
clean_data2
```
</details>
<br>
<details open><summary class='drop'> The Pipe - Sequences of Functions </summary>
What happens if you'd like to both clean_names() and remove_empty() on the same data?
Solution 1 - run as a sequence
```{r, echo = TRUE}
clean_data <- clean_names(messy_data)
clean_data <- remove_empty(clean_data, which = c('rows', 'cols'), quiet = TRUE)
clean_data
```
Solution 2 - use the pipe operator
Pipe operators (%>% or |>) can be read like the word "then".
For example, clean_names() %>% remove_empty() can be read as "clean names, then remove empty".
They allow for a more clean workflow.
```{r, echo = TRUE}
clean_data <- clean_names(messy_data) %>%
remove_empty(which = c('rows', 'cols'), quiet = TRUE)
clean_data
# note that replacing %>% with |> also works
```
Note that there is no data argument in the remove_empty() function. This is because, when using a pipe, the data argument is inherited from the prior function.
</details>
<br>
<details open><summary class='drop'> Removing Duplicate Entries </summary>
A final function we'll discuss is distinct() from the dplyr package, which is park of tidyverse. It removes duplicate rows from data.
```{r, echo = TRUE}
clean_data <- clean_names(messy_data) %>%
remove_empty(which = c('rows', 'cols'), quiet = TRUE) %>%
distinct()
clean_data
```
A word of caution - sometimes you may not want to remove duplicate rows, or may not want to remove empty rows, because they are important to the data. You know your data best. Use functions responsibly.
</details>
<br>
<details open><summary class='drop'> Challenge </summary>
The code chunk below creates messy data, called "messy_data2". Use functions from the Janitor package to tidy the data, and name the result "clean_data2". See if you can use the pipe operator to streamline the script.
```{r, echo = TRUE}
messy_data2 <- tibble(`Park Code` = c(NA, NA, 'CABR', 'CHIS', 'SAMO', 'SAMO'),
visitor_count = c(NA, NA, 400, 500, 400, 400),
`empty row` = NA)
```
</details>
<br>
<details open><summary class='drop'> Solution </summary>
```{r, echo = TRUE}
clean_data2 <- clean_names(messy_data2) %>%
remove_empty(which = c('rows', 'cols'), quiet = TRUE) %>%
distinct()
```