-
Notifications
You must be signed in to change notification settings - Fork 2
/
index.Rmd
562 lines (384 loc) · 14.5 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
---
title: "Coffee and Coding http://cnc.machinegurning.com"
author: "Matthew Upson matthew.upson@digital.cabinet-office.gov.uk"
date: '2016-10-26'
output:
html_document: default
html_notebook: default
---
```{r setup, include=FALSE}
checkpoint::setSnapshot('2016-10-14')
knitr::opts_chunk$set(
error = TRUE
)
```
## Test Driven Development (TDD), with `testthat` and `purrr`
Imagine that we want to normalise a vector, which is a common task for machine learning.
Let's think for a second what we would need to do.
Mathematically it might look something like this:
$$
\vec{y} = \frac{\vec{x}-\min{x}}{\max{x}-\min{x}}
$$
Where:
* $\vec{y}$ is the normalised vector,
* $\vec{x}$ is our vector,
Before setting out to write the code, there are a few things we already expect to be true of $\vec{y}$.
We can list some of these expectations, for example:
* $\vec{y}$ should all be between $0$ and $1$
* $\vec{y}$ should be the same length as $\vec{x}$
* $\vec{x}$ should be scalar (`double` or `numeric` in R)
Let's try this. First let's create a vector x to play with based on a uniform distribution with a minimum of 10 and a maximum of 100
```{r}
x <- runif(100, min = 10, max = 100)
x
```
Now to normalise:
```{r}
y <- (x - min(x))/(max(x) - min(x))
y
```
Now we can test our assumptions:
What is the range of $\vec{y}$
```{r}
range(y)
```
How long is $\vec{y}$?
```{r}
length(y)
```
What class is $\vec{y}$?
```{r}
class(y)
```
All good so far, but this is quite long winded, and if we had to check that our code worked every time like this, we would have to write a lot of code.
We would also start to break one of the tenets of good software development: *DRY* or *D*on't *R*epeat *Y*ourself.
A better solution would be to generalise our code into a function.
In this way we can hide away the complexities of the process, keeping our code nice and tidy, and just call our function every time we want to normalise something.
This is also important when we come to talk about applying functions over a list using `purrr`.
But, we can go further than this.
As well as formalising our code into a function, we can also formalise our testing process, so that these tests get completed automatically.
This is the idea behind Test Driven Development, or TDD.
The 'Driven' part is that we write out tests FIRST, not after we write the code we want to test.
In this way, we formalise our expectations of what our code will do.
The process then becomes that we write our test for a *unit* of code, these tests will fail until we write a complete function to whose output passes those tests.
Only then do we move on to write the tests to the next unit, which of course will fail until we write our next unit of code.
`testthat` is a unit testing framework for R, which makes TDD very easy.
Here I will re-use the normalisation example but with a TDD approach.
First I define my tests, which encapsulate our expectations of what our code should do.
Then I write a function to run our normalisation, thereby passing those tests.
A simple test from `testthat` looks like this:
```{r,eval=FALSE}
library(testthat)
expect_equal(
range(normalise(x)),
c(0, 1)
)
```
This test encapsulates our expectation that after normalisation, our vector will have a minimum of $0$ and maximum of $1$.
Usually we want to package up a series of tests together to test a particular function.
Here's how we can do that:
```{r}
library(testthat)
test_that(
"Test the normalise function",
{
# Tests run in their own environment, so we need to recreate a vector to test on:
x <- runif(100, min = 10, max = 100)
# Check that the min = 0 and the max = 1
expect_equal(
range(normalise(x)),
c(0, 1)
)
# Check that y has the same length as x
expect_equal(
length(normalise(x)),
length(x)
)
# Check that the resulting vector is numeric
expect_is(
range(normalise(x)),
'numeric'
)
# Here I go one step further, and check the output against a specific and
# easily checkable vector. I just did the sums in my head!
test_vector = c(1, 2, 3)
expect_equal(
normalise(test_vector),
c(0, 0.5, 1)
)
}
)
```
So we run this code, and the tests fail.
Why?
Because `could not find function "normalise"`.
Of course - we have not written it yet!
So lets write it now.
We define our function to take one input.
```{r}
normalise <- function(x) {
# Run the normalisation as before
y <- (x - min(x))/(max(x) - min(x))
# Return the y object (if we didn't do this, nothing would happen!)
return(y)
}
```
So with that written, lets try again with our tests.
```{r}
test_that(
"Test the normalise function",
{
# Tests run in their own environment, so we need to recreate a vector to test on:
x <- runif(100, min = 10, max = 100)
# Check that the min = 0 and the max = 1
expect_equal(
range(normalise(x)),
c(0, 1)
)
# Check that y has the same length as x
expect_equal(
length(normalise(x)),
length(x)
)
# Check that the resulting vector is numeric
expect_is(
range(normalise(x)),
'numeric'
)
# Here I go one step further, and check the output against a specific and
# easily checkable vector. I just did the sums in my head!
test_vector = c(1, 2, 3)
expect_equal(
normalise(test_vector),
c(0, 0.5, 1)
)
}
)
```
This time we don't get anything returned.
These tests are silent.
If they pass, they pass silently; if they fail they let us know.
So all good.
At this point, we might start to think about the other ways that we might test our function.
What would happen for instance if someone passed something to it that it was not expecting, for example:
* A single scalar instead of a vector.
How might we want our function to work in these case?
Let's say we want to handle this cases very simplistically by just returning an error
We expand our test suite to include this:
```{r}
test_that(
"Test the normalise function",
{
# Tests run in their own environment, so we need to recreate a vector to test on:
x <- runif(100, min = 10, max = 100)
# Check that the min = 0 and the max = 1
expect_equal(
range(normalise(x)),
c(0, 1)
)
# Check that y has the same length as x
expect_equal(
length(normalise(x)),
length(x)
)
# Check that the resulting vector is numeric
expect_is(
range(normalise(x)),
'numeric'
)
# Here I go one step further, and check the output against a specific and
# easily checkable vector. I just did the sums in my head!
test_vector = c(1, 2, 3)
expect_equal(
normalise(test_vector),
c(0, 0.5, 1)
)
# Add cases where we want to raise a warning
expect_error(
normalise(1)
)
}
)
```
These tests fail.
Why?
The error message says `normalise(1) did not throw an error`; so we know what we have to do...
```{r}
normalise <- function(x) {
stopifnot(length(x) > 1)
# Run the normalisation as before
y <- (x - min(x))/(max(x) - min(x))
# Return the y object (if we didn't do this, nothing would happen!)
return(y)
}
```
Now that we have uodated the function, lets run the tests again...
```{r}
test_that(
"Test the normalise function",
{
# Tests run in their own environment, so we need to recreate a vector to test on:
x <- runif(100, min = 10, max = 100)
# Check that the min = 0 and the max = 1
expect_equal(
range(normalise(x)),
c(0, 1)
)
# Check that y has the same length as x
expect_equal(
length(normalise(x)),
length(x)
)
# Check that the resulting vector is numeric
expect_is(
range(normalise(x)),
'numeric'
)
# Here I go one step further, and check the output against a specific and
# easily checkable vector. I just did the sums in my head!
test_vector = c(1, 2, 3)
expect_equal(
normalise(test_vector),
c(0, 0.5, 1)
)
# Add cases where we want to raise a warning
expect_error(
normalise(1)
)
}
)
```
So all the tests pass silently!
## Running functions over a list
R contains a myriad of functions for applying functions over its various data structures: `apply`, `sapply`, `tapply`, `vapply`, `mapply`, etc.
Each is a minor variation on the theme.
`purrr` simplifies this somewhat by providing a simpler range of `map() `functions.
So let's expand our TDD normalisation example.
Imagine this scenario:
* We have a list of $n$ elements each containing a numeric vector.
* We want to normalise each vector in the list, without combining the data together (we'll deal with this scenario later).
First lets create such a list.
Actually it turns out that we can use `purrr` to help us create this list in the first place.
Don't worry if you don't follow this just yet (you can always come back to this example once we have talked more about `purrr:map()`.
```{r,message=FALSE,warning=FALSE}
library(dplyr)
library(purrr)
```
```{r}
# Map runs over each element of .x, and the anonymous function defined in .f
# uses these elements to set the elngth of the sequence of uniformly distributed
# numbers.
list_x <- purrr::map(.x = 2:10, .f = function(x) runif(x,min = 0, max = 10))
list_x
```
Just as I used `map()` to create this list, I can use it to normalise the items in the list.
```{r}
purrr::map(.x = list_x, .f = normalise)
```
We can see from the first element of the list that the values have been converted in $0$ and $1$: and since we have tested our function, we can be reasonably sure that it is doing what we expected.
But what if we don't want to return a list at the end?
There are a number of variants of `map()` that we can use instead.
Probably most useful is `purrr::map_df()`, this will coerce our list into a dataframe, which tends to be a much more useful structure to use.
If we run it out of the box, it will fail, because the output of `normalise()` is not a `data.frame`:
```{r}
purrr::map_df(.x = list_x, .f = normalise)
```
We can fix this by wrapping `normalise` into an anonymous function which converts the output to a `data.frame`.
```{r}
df_x <- purrr::map_df(
.x = list_x,
.f = function(x) data.frame(value = normalise(x)),
.id = 'list_item'
)
df_x
```
So here we are returned a `data.frame` which contains all the items together in one column of the dataframe. An additional column (specified with `.id`) denotes which vector in the list a particular scalar has come from.
Of course, we know that we expect a `data.frame` to be the output of this bit of code, so we might have written a test for this beforehand with:
```{r}
test_that(
'Test that map_df works with normalise',
{
# We expect df_x to be a dat.frame
expect_is(
df_x,
'data.frame'
)
# We also know that the range of value should be 0, 1
expect_equal(
range(df_x$value),
c(0, 1)
)
}
)
```
The tests pass - all is good!
## Using purrr to iteratively apply models
One of the most useful things you can do with `purrr` is to apply a number of models over splits of a dataset.
I exmplain this in a lot more detail on my blog <http://www.machinegurning.com/rstats/iterating/>, but I will give a brief overview of it here.
Here, I use the pipe (`%>%`) from package `dplyr`.
If you've not seen this before, it operates very similarly to the linux pipe `|`, and just passes the output of one line of code to the input of the next line of code.
There's a lot of material available online about this, so I won't try to replicate it - Google it!
So let's start with the well known `mtcars` dataset.
For info I will print it here:
```{r}
mtcars
```
You can read about the data by running `?mtcars`, but essentially it describes the characteristics of a number of classic cars.
Lets say that we want to apply a model to these data, and we have a hunch that miles per gallon (`mpg`) is correlated with the weight of the car `wt`.
We can produce a single linear model like so:
```{r}
model <- lm(mpg ~ wt, data = mtcars)
summary(model)
```
All very simple stuff, so far so good.
Now lets say we have an interest in the number of cylinders that the cars have, and we want to run a model for every number of cylinders that a car might have.
We can do this by slitting the `mtcars` dataset over `cyl`.
```{r}
mtcars %>%
split(.$cyl)
```
You can see that this code has returned me a list of three `data.frames` each with a different number of `cyl`.
Now that we have this in a list, we can start to use `purrr::map()` in the same way as before to run an operation on each of the elements of the list.
In this case, a linear model. Note that we need to use `.` to denote where we want our data to go, as the pipe will always look to insert its data into the first argument, unless we explicitly tell it where we want it to go:
```{r}
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .))
```
Here we are now returned the coefficients from the three models.
This output is not very inoformative, really we want to apply the summary method to each of these models, so we can just chain another `map` call on to the end to do this:
```{r}
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .)) %>%
map(summary)
```
And lets say we are only really interested in the $R^2$, well, we can extract this too.
If we just pass an element of the list to the `map` command, it will print it by default, instead of applying a function.
```{r}
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .)) %>%
map(summary) %>%
map("r.squared")
```
So finally we get the three $R^2$ values of the models, and we can keep going.
If we want all these elements rounded, and coerced into a numeric vector (rather than a list), we can use `map_dbl`.
```{r}
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .)) %>%
map(summary) %>%
map("r.squared") %>%
map_dbl(~round(.x, 2))
```
For more detail on this, I encurage you to read my blog post: <http://www.machinegurning.com/rstats/iterating/>.
## Links for further reading
### machinegurning.com
* Test Driven Development: http://www.machinegurning.com/rstats/test-driven-development/
* Using purrr:map_df: http://www.machinegurning.com/rstats/map_df/
* Iteratively applying models with purrr: http://www.machinegurning.com/rstats/iterating/
## Reproducibility info
```{r}
sessionInfo()
```