This repository has been archived by the owner on Apr 7, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3
/
04-exemplar.Rmd
298 lines (185 loc) · 18.8 KB
/
04-exemplar.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
# Exemplar RAP {#exemplar}
Chapter \@ref(why) considered why RAP is a useful paradigm.
In this Chapter we demonstrate a RAP package developed in collaboration with the Department for Culture Media and Sport (DCMS). This package enshrines all the pertinent business knowledge in one corpus.
```{r fig.align='center', echo=FALSE, include=identical(knitr:::pandoc_to(), 'html')}
knitr::include_graphics('images/eesectors_package.png', dpi = NA)
```
## Package Purpose
In this exemplar project Matt Upson aimed at a high level of automation to demonstrate what is possible, and because DCMS had a skilled data scientist on hand to maintain and develop the project. Nonetheless, in the course of the work, statisticians at DCMS continue to undertake training in R, and the [Better Use of Data Team](https://data.blog.gov.uk/) spent time to ensure that the software development practices such as managing [software dependencies](https://www.gov.uk/service-manual/technology/managing-software-dependencies), [version control](https://www.gov.uk/service-manual/technology/maintaining-version-control-in-coding), [package development](http://r-pkgs.had.co.nz/), [unit testing](http://r-pkgs.had.co.nz/tests.html), style [guide](http://adv-r.had.co.nz/Style.html), [open by default](https://www.gov.uk/service-manual/technology/making-source-code-open-and-reusable) and [continuous integration](https://www.r-bloggers.com/continuous-integration-for-r-packages/) are embedded within the team that owns the publication.
We’re continuing to support DCMS in the development of this prototype pipeline, with the expectation that it will be used operationally in 2017. If you want to learn more about this project, the source code for the eesectors R package is maintained on [github.com](https://github.com/ukgovdatascience/eesectors). The README provides instructions on how to test the package using the openly published data from the 2016 publication.
## Tidy data
> Tidy data are all alike; every messy data is messy in its own way. - Hadley Tolstoy
What is the [simplest representation](http://vita.had.co.nz/papers/tidy-data.html) of the data possible? Prior to any analysis we must tidy our data: structuring our data to facilitate analysis.
Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. You and your team trying to RAP should spend time reading this [paper](http://vita.had.co.nz/papers/tidy-data.pdf) and hold a seminar discussing it. It's important to involve the analysts involved in the traditional production of this report as they will be familiar with the inputs and outputs of the report.
With the heuristic of a tidy dataset in your mind, proceed, as a team, to look through the chapter or report you are attempting to produce using RAP. As you work through, note down what variables you would need to produce each table or figure, what would the input dataframe look like? (Say what you see.) After looking at all the figures and tables, is there one tidy daaset that could be used as input? Sketch out what it looks like.
### eesectors tidy data
We demonstrate this process using the DCMS publication, refer to [Chapter 3 - GVA](https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/544103/DCMS_Sectors_Economic_Estimates_-_August_2016.pdf).
What data do you need to produce this table?
```{r fig.align='center', echo=FALSE, include=identical(knitr:::pandoc_to(), 'html'), fig.link='https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/544103/DCMS_Sectors_Economic_Estimates_-_August_2016.pdf'}
knitr::include_graphics('images/gva_table_3_1.png', dpi = NA)
```
Variables: Year, Sector, GVA
What data do you need to produce this figure?
```{r fig.align='center', echo=FALSE, include=identical(knitr:::pandoc_to(), 'html'), fig.link='https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/544103/DCMS_Sectors_Economic_Estimates_-_August_2016.pdf'}
knitr::include_graphics('images/gva_fig_3_1.png', dpi = NA)
```
The GVA of each Sector by Year.
Variables: Year, Sector, GVA
What data do you need to produce this figure?
```{r fig.align='center', echo=FALSE, include=identical(knitr:::pandoc_to(), 'html'), fig.link='https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/544103/DCMS_Sectors_Economic_Estimates_-_August_2016.pdf'}
knitr::include_graphics('images/gva_fig_3_2.png', dpi = NA)
```
Total GVA across all sectors.
Variables: Year, Sector, GVA
What data do you need to produce this figure?
```{r fig.align='center', echo=FALSE, include=identical(knitr:::pandoc_to(), 'html'), fig.link='https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/544103/DCMS_Sectors_Economic_Estimates_-_August_2016.pdf'}
knitr::include_graphics('images/gva_fig_3_3.png', dpi = NA)
```
For each Year by Sector we need the GVA.
Variables: Year, Sector, GVA
### What does our eesectors tidy data look like?
Remember, for tidy data:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
Our tidy data is of the form **Year - Sector - GVA**:
```{r table2, echo=FALSE, message=FALSE, warnings=FALSE, results='asis'}
tabl <- "
| Year | Sector | GVA |
|---------------|:-------------:|------:|
| 2010 | creative | 65188 |
| 2010 | culture | 20291 |
| 2010 | digital | 97303 |
| 2011 | creative | 69398 |
| 2011 | culture | 20954 |
| 2011 | digital | 107303 |
"
cat(tabl) # output the table in a format good for HTML/PDF/docx conversion
```
*This data is for demonstration purposes only.*
#### Another worked example - what does our SEN tidy data look like?
We repeat the process above for a different publication related to [Special Educational Needs data](https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/633031/SFR37_2017_Main_Text.pdf) to demonstrate the thought process. We suggest you attempt to do this independently without peeking at the solution below, that way you can test your understanding.
Look at the final report; work through and think about what data you need to produce each figure or table (write out the variables then sketch the minimal tidy data set required to build it). Ideally there will be one minimal tidy data set that we can build as input for our functions to produce these figures, tables or statistics. If a report covers a broad topic it might not be possible to have one minimal tidy data set (it’s OK to have more than one). We can create our own [custom class](http://adv-r.had.co.nz/OO-essentials.html) of object to cope and keep things simple for the user of our package.
Here we draw our tables in a pseudo csv format to digitise for sharing. Sketching with pencil and paper is also fine and much clearer! I also use shorthand for some of the variable names given in the publication.
##### Figure A
year, all students, total sen, sen without statement or EHC plan, sen with statement or EHC plan ...
##### Figure B
This digs deeper than Fig A by counting and categorising students (converted into percentage) by their primary type of need. Thus our minimal table above will not meet the needs for Figure B. We’ll add in some example made-up data to check I understand the data correctly (the type of the data is the important thing e.g. date, integer, string). It’s important here to have expert domain knowledge as one might misunderstand due to esoteric language use.
year, sen_status, sen_category, sen_primary_type_need, count
2016, 0, NA, NA, 3e6
2016, 1, “SEN Support", “Specific Learning Difficulty", 5000
2016, 1, “Statement or EHC Plan", “Specific Learning Difficulty", 1500 2016, 1, “SEN Support", “Moderate Learning Difficulty", 5000
2017, ...
** Question: using the data table above can you produce both Figure A and B? **
With our data structured like this we have all the data we need to produce Figure B and Figure A.
##### Figure C
Again we dig deeper and ask what’s their school type? We don’t have this in our previous minimal data table so we need to include this variable in our dataframe.
year, school_type, sen_status, sen_category, sen_primary_type_need, count
2016, “State-funded primary”, 0, NA, NA, 3e6
2016, “State-funded primary”, 1, “SEN Support", “Specific Learning Difficulty", 5000
2016, “State-funded primary”, 1, “Statement or EHC Plan", “Specific Learning Difficulty", 1500 2016, “State-funded primary”, 1, “SEN Support", “Moderate Learning Difficulty", 5000
...
As you can imagine the table can end up being quite long!
** Question: using the data table above can you produce both Figure A, B and C? **
Yes.
Continue this thought process for the rest of the document. However, bear in mind that you have the added insight of where the data comes from and in what format, this might affect your using more than one data class for the package. For example you could call the one we described above as your "year-sch-sen” class, and have another data class dedicated to being the input for some of the other figures in the chapter.
The data might come from an SQL query or a bunch of disparate spreadsheets. In the later case we can write some functions to extract and combine the data into a minimal tidy data table for use in our package. See eesectors [README](https://github.com/DCMSstats/eesectors/blob/master/README.md) for an example.
### How to build your tidy data?
With the minimal tidy dataset idea in place, you can begin to think about how you might construct this tidy dataset from the data stores you have availiable.
As we are working in R we can formalise this minimal tidy dataset as a [class](http://adv-r.had.co.nz/OO-essentials.html). For our `eesectors` package we create our long data `year_sector_data` class as the fundamental input to create all our figures and tables for the output report.
## `eesectors` Package Exploration
The following is an exploration of the `eesectors` package to help familiarise users with the key principles so that they can automate report production through package development in R using `knitr`. This examines the package in more detail compared to the README so that data scientists looking to implement RAP can note some of the characteristics of the code employed.
### Installation
The package can then be installed using
`devtools::install_github('ukgovdatascience/eesectors')`. Some users may
not be able to use the `devtools::install_github()` commands as a result
of network security settings. If this is the case, `eesectors` can be
installed by downloading the [zip of the
repository](https://github.com/ukgovdatascience/govstyle/archive/master.zip)
and installing the package locally using
`devtools::install_local(<path to zip file>)`.
#### Version control
As the code is stored on Github we can access the current master version as well as all [historic versions](https://github.com/ukgovdatascience/eesectors/releases). This allows me to reproduce a report from last year if required. I can look at what release version was used and install that accordingly using the [additional arguments](ftp://cran.r-project.org/pub/R/web/packages/githubinstall/vignettes/githubinstall.html) for `install_github`.
### Loading the package
Installation means the package is on our computer but it is not loaded into the computer's working memory. We also load any additional packages that might be useful for exploring the package or data therein.
```{r, eval=FALSE}
devtools::install_github('ukgovdatascience/eesectors')
```
```{r, echo = FALSE}
library(eesectors)
```
This makes all the functions within the package available for use. It also provides us with some R [data objects](https://github.com/ukgovdatascience/eesectors/tree/master/data), such as aggregated data sets ready for visualisations or analysis within the report.
> Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data. - Hadely Wickham
### Explore the package
A good place to start is the package [README](https://github.com/ukgovdatascience/eesectors).
#### Status badges
The [status badges](https://stackoverflow.com/questions/35563012/what-are-the-status-tags-like-build-passing) provide useful information. They are found in the top left of the README and should be green and say passing. This indicates that this package will run OK on Windows and linux or mac. Essentially the package is likely to build correctly on your machine when you install it. You can carry out these build tests locally using the [`devtools` package](https://github.com/hadley/devtools).
#### Look at the output first
If you go to Chapter 3 of the [DCMS publication](https://www.gov.uk/government/statistics/dcms-sectors-economic-estimates-2016) it is apparent that most of the content is either data tables of summary statistics or visualisation of the data. This makes automation particularly useful here and likely to make time savings. Chapter 3 seems to be fairly typical in its length (if not a bit shroter compared to other Chapters).
This package seems to work by taking the necessary data inputs as arguments in a function then outputting the relevant figures. The names of the functions match the figures they produce. Prior to this step we have to get the data in the correct format.
If you look at the functions within the package within R Studio using the package navigator it is evident that there are a function of families dedicated to reading Excel spreadsheets and collecting the data in a tidy .Rds format. These are given the funciton name-prefix of `extract_` (try to give your functions [good names](http://adv-r.had.co.nz/Style.html)).
The `GVA_by_sector_2016` provides test data to work with during development. This will be important for the development of other packages for different reports. You need a precise understanding of how you go from raw data, to aggregated data (such as `GVA_by_sector_2016`) to the final figure. What are your inputs (arguments) and outputs? In some cases where your master data is stored in a particularly difficult for a machine to read you may prefer having a human to this extraction step.
```{r}
dplyr::glimpse(GVA_by_sector_2016)
x <- GVA_by_sector_2016
```
#### Automating QA
Human's are not particularly good at Quality Assurance (QA), especially when working with massive spreadsheets it's easy for errors to creep in. We can automate alot of the sense checking and update this if things change or a human provides another creative test to use for sense checking. If you can describe the test to a colleague then you can code it.
The author uses messages to tell us what checks are being conducted or we can look at the body of the function if we are interested. This is useful if you are considering developing your own package, it will help you struture the message which are useful for the user.
```{r}
gva <- year_sector_data(GVA_by_sector_2016)
```
This is a semi-automated process so the user should check the Checks and ensure they meet their usual checks that would be conducted manually. If a new check or test becomes necessary then it should be implemented by changing the code.
```{r}
body(year_sector_data)
```
The function is structured to tell the user what check is being made and then running that check given the input `x`. If the input fails a check the function is stopped with a useful diagnostic message for the user. This is achieved using `if` and the opposite of the desired feature of `x`.
```{r eval=FALSE}
message("Checking x has correct columns...")
if (length(colnames(x)) != 3)
stop("x must have three columns: sector, year, and one of GVA, export, or x")
```
For example, if `x` does not have exactly three columns we `stop`.
#### Output of this function
The output object is different to the input as expected, yet it does contain the initial data.
```{r}
identical(gva$df, x)
```
The rest of the list contains other details that could be changed at a later date if required, demonstrating defensive programming. For example, the sectors that are of interest to DCMS have changed and may change again.
```{r eval=FALSE}
?year_sector_data
```
Let's take a closer look at this function using the help and other standard function exploration functions.
The help says it produces a custom class of object with five slots.
```{r}
isS4(gva)
class(gva)
```
It's not actually an S4 object, by slots the author means a list of objects. This approach is sensible and easy to work with, as most users are familiar with [S3](http://adv-r.had.co.nz/S3.html).
#### The input
The input, which is likely a bunch of [not tidy or messy](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) spreadsheets needs to be wrangled and aggregated (if necessary) for input into the functions prefixed by `figure`.
```{r}
dplyr::glimpse(GVA_by_sector_2016)
```
#### The R output
> We build our functions to use the same simple, tidy, data. - Matt Upson
With the data in the appropriate form to be received as an argument or input for the `figure` family of functions, we can proceed to plot.
```{r}
figure3.1(x = gva)
```
Again we can look at the details of the plot. We could change the body of the function to affect change to the default plot or we can pass additional `ggplot` arguments to it.
Reading the code we see it filters the data, makes the variables it needs, refactors the `sector` variable and then plots it.
```{r}
body(figure3.1)
```
We can inspect and change an argument if we feel inclined or if a new colour scheme becomes preferred for example. However, there is no `...` in the body of the function itself so where does this argument get passed to?
This all looks straight forward and we can inspect the other functions for generating the figures or plot output.
```{r}
body(figure3.2)
body(figure3.3)
```
#### Error handling
A point of interest in the code with which some users may be unfamiliar is `tryCatch` which is a function that allows the function to catch conditions such as warnings, errors and messages. We see this towards the end of the function where if either of these conditions are produced then an informative message is produced (in that it tells you in what function there was a problem). The structure here is simple and could be copied and pasted for use in automating other figures of other chapters or statistical reports.
For a comprehensive introduction see [Hadley's Chapter](http://adv-r.had.co.nz/Exceptions-Debugging.html#condition-handling).
## Chapter plenary
We have explored the `eesectors` package from the perspective of someone wishing to develop our own semi-automated chapter production through the development of a package in R. This package provides a useful tempplate where one could copy the foundations of the package and workflow.