-
Notifications
You must be signed in to change notification settings - Fork 175
/
ch05.Rmd
1337 lines (936 loc) · 65.5 KB
/
ch05.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
bookdown::html_document2:
fig_caption: yes
editor_options:
chunk_output_type: console
---
```{r echo = FALSE, cache = FALSE}
source("utils.R", local = TRUE)
```
Scatter Plots {#CHAPTER-SCATTER}
=============
Scatter plots are used to display the relationship between two continuous variables. In a scatter plot, each observation in a data set is represented by a point. Often, a scatter plot will also have a line showing the predicted values based on some statistical model. Adding this line is easy to do with R and the ggplot2 package, and can help to make sense of data when the trends aren't immediately obvious just by looking at the points.
With large data sets, plotting every single observation in the data set can result in overplotting, when points overlap and obscure one another. To deal with the problem of overplotting, you'll probably want to summarize the data before displaying it. We'll also see how to do that in this chapter.
Making a Basic Scatter Plot {#RECIPE-SCATTER-BASIC-SCATTER}
---------------------------
### Problem
You want to make a scatter plot using two continuous variables.
### Solution
Use `geom_point()`, and map one variable to `x` and one variable to `y`.
We will use the `heightweight` data set. There are a number of columns in this data set, but we'll only use two in this example (Figure \@ref(fig:FIG-SCATTER-BASIC)):
```{r FIG-SCATTER-BASIC, fig.cap="A basic scatter plot", fig.width=4, fig.height=4}
library(gcookbook) # Load gcookbook for the heightweight data set
library(dplyr)
# Show the head of the two columns we'll use in the plot
heightweight %>%
select(ageYear, heightIn)
ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
geom_point()
```
### Discussion
Instead of points, you can use different shapes for your scatter plot by using the `shape` aesthetic. A common alternative to the default solid circles (shape #19) is hollow ones (#21), as seen in Figure \@ref(fig:FIG-SCATTER-BASIC-SHAPE-SIZE) (left):
```{r, eval=FALSE}
ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
geom_point(shape = 21)
```
The size of the points can be controlled with the `size` aesthetic. The default value of size is 2 (`size = 2`). The following code will set `size = 1.5` to create smaller points (Figure \@ref(fig:FIG-SCATTER-BASIC-SHAPE-SIZE), right):
```{r, eval=FALSE}
ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
geom_point(size = 1.5)
```
```{r FIG-SCATTER-BASIC-SHAPE-SIZE, echo=FALSE, fig.show="hold", fig.cap="Scatter plot with hollow circles (shape 21, left); With smaller points (right)", fig.width=4, fig.height=4}
ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
geom_point(shape = 21)
ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
geom_point(size = 1.5)
```
Grouping Points Together using Shapes or Colors {#RECIPE-SCATTER-GROUPED-SCATTER}
-------------------------------------------------------
### Problem
You want to visually group points by some variable (the grouping variable), using different shapes or colors.
### Solution
Map the grouping variable to the aesthetic of `shape` or `colour`. We'll use three columns from the `heightweight` data set for this example:
```{r}
library(gcookbook) # Load gcookbook for the heightweight data set
# Show the head of the three columns we'll use
heightweight %>%
select(sex, ageYear, heightIn)
```
We can use the aesthetics of `colour` or `shape` to visually differentiate the data points belonging to different categories of `sex`. We do this by mapping `sex` to one of the aesthetics `colour` or `shape` (Figure\@ref(fig:FIG-SCATTER-SHAPE-COLOR)):
```{r FIG-SCATTER-SHAPE-COLOR, echo=FALSE, fig.show="hold", fig.cap="Grouping points by a variable mapped to colour (left), or to shape (right)"}
ggplot(heightweight, aes(x = ageYear, y = heightIn, colour = sex)) +
geom_point()
ggplot(heightweight, aes(x = ageYear, y = heightIn, shape = sex)) +
geom_point()
```
### Discussion
The grouping variable you choose must be categorical -- in other words, a factor or character vector. If the grouping variable is a numeric vector, you should convert it to a factor first.
It is possible to map a variable to both `shape` and `colour`, or, if you have multiple grouping variables, to map each grouping variable to a different aesthetic. Here, we'll map the variable `sex` to both `shape` and `colour` aesthetics (Figure \@ref(fig:FIG-SCATTER-SHAPE-COLOR-BOTH), left):
```{r, eval=FALSE}
ggplot(heightweight, aes(x = ageYear, y = heightIn, shape = sex, colour = sex)) +
geom_point()
```
You may want to use different shapes and colors than are given by the default settings. You can select other shapes for the grouping variables using `scale_shape_manual()`, and select other colors using `scale_colour_brewer()` or `scale_colour_manual()`. (Figure \@ref(fig:FIG-SCATTER-SHAPE-COLOR-BOTH), right):
```{r, eval=FALSE}
ggplot(heightweight, aes(x = ageYear, y = heightIn, shape = sex, colour = sex)) +
geom_point() +
scale_shape_manual(values = c(1,2)) +
scale_colour_brewer(palette = "Set1")
```
```{r FIG-SCATTER-SHAPE-COLOR-BOTH, echo=FALSE, fig.show="hold", fig.cap="Mapping to both shape and colour (left); With manually set shapes and colors (right)"}
ggplot(heightweight, aes(x = ageYear, y = heightIn, shape = sex, colour = sex)) +
geom_point()
ggplot(heightweight, aes(x = ageYear, y = heightIn, shape = sex, colour = sex)) +
geom_point() +
scale_shape_manual(values = c(1,2)) +
scale_colour_brewer(palette = "Set1")
```
### See Also
To use different shapes, see Recipe \@ref(RECIPE-SCATTER-SHAPES).
For more on using different colors, see Chapter \@ref(CHAPTER-COLORS).
Using Different Point Shapes {#RECIPE-SCATTER-SHAPES}
----------------------------
### Problem
You want to change the default scatterplot shapes for the data points.
### Solution
You can set the shape of all the data points at once (Figure \@ref(fig:FIG-SCATTER-SHAPES), left) by setting a shape in `geom_point()`:
```{r, eval=FALSE}
library(gcookbook) # Load gcookbook for the heightweight data set
ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
geom_point(shape = 3)
```
If you have mapped a variable to `shape`, you can use `scale_shape_manual()` to manually change the shapes mapped to the levels of that variable:
```{r, eval=FALSE}
# Use slightly larger points and use custom values for the shape scale
ggplot(heightweight, aes(x = ageYear, y = heightIn, shape = sex)) +
geom_point(size = 3) +
scale_shape_manual(values = c(1, 4))
```
```{r FIG-SCATTER-SHAPES, echo=FALSE, fig.show="hold", fig.cap="Scatter plot with the shape aesthetic set to a custom value (left); With a variable mapped to shape, using a custom shape palette (right)", fig.width=8.5, fig.height=3.5}
library(gcookbook) # Load gcookbook for the heightweight data set
p1 <- ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
geom_point(shape = 3)
p2 <- ggplot(heightweight, aes(x = ageYear, y = heightIn, shape = sex)) +
geom_point(size = 3) +
scale_shape_manual(values = c(1, 4))
library(patchwork)
p1 + plot_spacer() + p2 + plot_layout(widths = c(5, 1, 5))
```
### Discussion
Figure \@ref(fig:FIG-SCATTER-SHAPES-CHART) shows the shapes that are already built into R. Some of the point shapes (1–14) only have an outline; some (15–20) have solid fill; and some (21–25) have an outline and fill that can be controlled separately. You can also use characters for points.
For shapes 1–20, the color of the entire point -- even the points that have solid fill -- is controlled by the `colour` aesthetic. For shapes 21–25, the outline is controlled by `colour` and the fill is controlled by `fill`.
```{r FIG-SCATTER-SHAPES-CHART, echo=FALSE, fig.cap="Shapes in R graphics"}
pchShow <- function(extras = c("*",".", "o","O","0","+","-","|","%","#"),
cex = 3, ## good for both .Device=="postscript" and "x11"
col = "red3", bg = "gold", coltext = "brown", cextext = 1.2,
main = paste("plot symbols : points (... pch = *, cex =",
cex,")"))
{
nex <- length(extras)
np <- 26 + nex
ipch <- 0:(np - 1)
k <- floor(sqrt(np))
dd <- c(-1,1)/2
rx <- dd + range(ix <- ipch %/% k)
ry <- dd + range(iy <- 3 + (k - 1) - ipch %% k)
pch <- as.list(ipch) # list with integers & strings
if (nex > 0) pch[26 + 1:nex] <- as.list(extras)
plot(rx, ry, type ="n", axes = FALSE, xlab = "", ylab = "",
main = main)
abline(v = ix, h = iy, col = "lightgray", lty = "dotted")
for (i in 1:np) {
pc <- pch[[i]]
## 'col' symbols with a 'bg'-colored interior (where available) :
points(ix[i], iy[i], pch = pc, col = col, bg = bg, cex = cex)
if (cextext > 0)
text(ix[i] - 0.3, iy[i], pc, col = coltext, cex = cextext)
}
}
par(mar = c(0,0,0,0))
pchShow(main = NULL)
```
It's possible to have one variable represented by the shape of a point, and and another variable represented by the fill (empty or filled) of the point. To do this, you need to first choose point shapes that have both colour and fill, and set these in `scale_shape_manual`. You then need to choose a fill palette that includes `NA` and another color (the `NA` will result in a hollow shape) and use these in `scale_fill_manual()`.
For example, we'll take the `heightweight` data set and add another column that indicates whether the child weighed 100 pounds or more (Figure \@ref(fig:FIG-SCATTER-SHAPES-FILL)):
```{r FIG-SCATTER-SHAPES-FILL, fig.cap="A variable mapped to shape and another mapped to fill"}
# Using the heightweight data set, create a new column that indicates if the
# child weighs < 100 or >= 100 pounds. We'll save this modified dataset as 'hw'.
hw <- heightweight %>%
mutate(weightgroup = ifelse(weightLb < 100, "< 100", ">= 100"))
# Specify shapes with fill and color, and specify fill colors that includes an empty (NA) color
ggplot(hw, aes(x = ageYear, y = heightIn, shape = sex, fill = weightgroup)) +
geom_point(size = 2.5) +
scale_shape_manual(values = c(21, 24)) +
scale_fill_manual(
values = c(NA, "black"),
guide = guide_legend(override.aes = list(shape = 21))
)
```
### See Also
For more on using different colors, see Chapter \@ref(CHAPTER-COLORS).
For more information about recoding a continuous variable to a categorical one, see Recipe \@ref(RECIPE-DATAPREP-RECODE-CONTINUOUS).
Mapping a Continuous Variable to Color or Size {#RECIPE-SCATTER-CONTINUOUS-SCATTER}
----------------------------------------------
### Problem
You want to represent a third continuous variable using color or size.
### Solution
Map the continuous variable to `size` or `colour`. We will use the `heightweight` data set for this example. There are many columns in this data set, but we'll only use four of them in this example:
```{r}
library(gcookbook) # Load gcookbook for the heightweight data set
# Show the head of the four columns we'll use
heightweight %>%
select(sex, ageYear, heightIn, weightLb)
```
The basic scatter plot in Recipe \@ref(RECIPE-SCATTER-BASIC-SCATTER) shows the relationship between the continuous variables `ageYear` and `heightIn`. We can represent a third continuous variable, `weightLb`, by mapping this variable to another aesthetic property, such as `colour` or `size` (Figure \@ref(fig:FIG-SCATTER-CONTINUOUS-COLOR-SIZE):
```{r FIG-SCATTER-CONTINUOUS-COLOR-SIZE, fig.show="hold", fig.cap="A continuous variable mapped to colour (left); Mapped to size (right)"}
ggplot(heightweight, aes(x = ageYear, y = heightIn, colour = weightLb)) +
geom_point()
ggplot(heightweight, aes(x = ageYear, y = heightIn, size = weightLb)) +
geom_point()
```
### Discussion
A basic scatter plot shows the relationship between two continuous variables: one mapped to the x-axis, and one to the y-axis. When there are more than two continuous variables, these additional variables must be mapped to other aesthetics, like `size` and `color`.
Humans can easily perceive small differences in spatial position, so we can interpret the variables mapped to *x* and *y* coordinates with high precision. Humans aren't as good at perceiving small differences in `size` and `color` though, so we will interpret variables mapped to these aesthetic attributes with much lower precision. Therefore, when you map a variable to `size` or `color`, make sure it is a variable where high precision is not very important for correctly intepreting the data.
There is another consideration when mapping a variable to `size`, which is that the results can be perceptually misleading. While the largest dots in Figure \@ref(fig:FIG-SCATTER-CONTINUOUS-COLOR-SIZE) are about 36 times the size of the smallest ones, they are only supposed to represent about 3.5 times the weight of the smallest dots.
This relative misrepresentation of size happens because the default values in ggplot2 for the diameter of points ranges from 1 to 6mm, regardless of the actual data values. For example, if the data values range from 0 to 10, the smallest value of 0 will be represented on the plot with a point that is 1mm wide, while the largest value of 10 will be represented on the plot with a point that is 6mm wide. Similarly, if the data values range from 100 to 110, the smallest value of 100 will still be represented by a point that is 1mm wide, and the largest value of 110 will be represented by a point that is 6mm wide. Thus regardless of the actual data values, the largest point will have a diameter that is 6 times the diameter of the smallest point, and will be 36 times the area.
If it is important for the size of the points to accurately reflect the proportional differences of your data values, you should first decide if you want the diameter of the points to represent the data values, or if you want to area of the points to represent the data values. Figure \@ref(fig:FIG-SCATTER-SIZE-AREA) shows the difference between these representations.
```{r FIG-SCATTER-SIZE-AREA, fig.show="hold", fig.cap="Value mapped to diameter of points (left); Value mapped to area of points (right)"}
range(heightweight$weightLb)
size_range <- range(heightweight$weightLb) / max(heightweight$weightLb) * 6
size_range
ggplot(heightweight, aes(x = ageYear, y = heightIn, size = weightLb)) +
geom_point() +
scale_size_continuous(range = size_range)
ggplot(heightweight, aes(x = ageYear, y = heightIn, size = weightLb)) +
geom_point() +
scale_size_area()
```
See Recipe \@ref(RECIPE-SCATTER-BALLOON) for details on making the area of points proportional to the data values.
When it comes to color, there are actually two aesthetic attributes that can be used: `color` and `fill`. You will use `color` for most point shapes. However, shapes 21–25 have an outline with a solid region in the middle where the color is controlled by fill. These outlined shapes can be useful when using a color scale with light colors as in Figure \@ref(fig:FIG-SCATTER-CONTINUOUS-FILL), because the outline sets the shapes off from the background. In this example, we also set the fill gradient to go from black to white and make the points larger so that the fill is easier to see:
```{r FIG-SCATTER-CONTINUOUS-FILL, echo=FALSE, fig.show="hold", fig.cap="Outlined points with a continuous variable mapped to fill (left); With a discrete legend instead of continuous colorbar (right)"}
ggplot(heightweight, aes(x = ageYear, y = heightIn, fill = weightLb)) +
geom_point(shape = 21, size = 2.5) +
scale_fill_gradient(low = "black", high = "white")
# Using guide_legend() will result in a discrete legend instead of a colorbar legend
ggplot(heightweight, aes(x = ageYear, y = heightIn, fill = weightLb)) +
geom_point(shape = 21, size = 2.5) +
scale_fill_gradient(
low = "black", high = "white",
breaks = seq(70, 170, by = 20),
guide = guide_legend()
)
```
Mapping a continuous variable to an aesthetic doesn't prevent us from mapping a categorical variable to other aesthetics. In Figure \@ref(fig:FIG-SCATTER-CONTINUOUS-SIZE-CATEGORICAL-COLOR), we'll map `weightLb` to `size`, and also map `sex` to `color`. Because there is a fair amount of overplotting (where the points overlap), we'll make the points 50% transparent by setting `alpha = .5`. We'll also use `scale_size_area()` to make the area of the points proportional to the data values (see Recipe \@ref(RECIPE-SCATTER-BALLOON)), and manually change the color palette:
```{r FIG-SCATTER-CONTINUOUS-SIZE-CATEGORICAL-COLOR, echo=FALSE, fig.cap="Continuous variable mapped to size and categorical variable mapped to colour"}
ggplot(heightweight, aes(x = ageYear, y = heightIn, size = weightLb, colour = sex)) +
geom_point(alpha = .5) +
scale_size_area() + # Make area proportional to numeric value
scale_colour_brewer(palette = "Set1")
```
When a variable is mapped to `size`, it's a good idea to *not* map a variable to `shape`. This is because it is difficult to compare the sizes of different shapes; for example, a size 4 triangle could appear larger than a size 3.5 circle. Also, some of the shapes really are different sizes: shapes 16 and 19 are both circles, but at any given numeric size, shape 19 circles are visually larger than shape 16 circles.
### See Also
To use different colors from the default, see Recipe \@ref(RECIPE-COLORS-PALETTE-CONTINUOUS).
See Recipe \@ref(RECIPE-SCATTER-BALLOON) for creating a balloon plot.
Dealing with Overplotting {#RECIPE-SCATTER-OVERPLOT}
-------------------------
### Problem
You have many points that overlap and obscure each other when plotted.
### Solution
With large data sets, the points in a scatter plot may overlap and obscure each other and prevent the viewer from accurately assessing the distribution of the data. This is called *overplotting*. If the amount of overplotting is low, you may be able to alleviate the problem by using smaller points, or by using a different shape (like shape 1, a hollow circle) through which other points can be seen. Figure \@ref(fig:FIG-SCATTER-BASIC-SHAPE-SIZE) in Recipe \@ref(RECIPE-SCATTER-BASIC-SCATTER) demonstrates both of these solutions.
If there's a high degree of overplotting, there are a number of possible solutions:
* Make the points semi-transparent
* Bin the data into rectangles (better for quantitative analysis)
* Bin the data into hexagons
* Use box plots
### Discussion
The scatter plot in Figure \@ref(fig:FIG-SCATTER-OVERPLOT) contains about 54,000 points. They are heavily overplotted, making it impossible to get a sense of the relative density of points in different areas of the graph:
```{r, eval=FALSE}
# We'll use the diamonds data set and create a base plot called `diamonds_sp`
diamonds_sp <- ggplot(diamonds, aes(x = carat, y = price))
diamonds_sp +
geom_point()
```
```{r FIG-SCATTER-OVERPLOT, echo=FALSE, dev="png", dpi=300, fig.cap="Overplotting, with about 54,000 points", fig.width=4, fig.height=4}
# We'll use the diamonds data set and create a base plot called `diamonds_sp`
diamonds_sp <- ggplot(diamonds, aes(x = carat, y = price))
diamonds_sp +
geom_point() +
theme(
axis.text = element_text(size = 6),
axis.title = element_text(size = 8)
)
```
We can make the points semitransparent using the `alpha` aesthetic, as in Figure \@ref(fig:FIG-SCATTER-OVERPLOT-ALPHA). Here, we'll make them 90% transparent and then 99% transparent, by setting `alpha = .1` and `alpha = .01`:
```{r FIG-SCATTER-OVERPLOT-ALPHA, dev="png", dpi=300, fig.show="hold", fig.cap="Semitransparent points with alpha=.1 (left); With alpha=.01 (right)", fig.width=4, fig.height=4}
diamonds_sp +
geom_point(alpha = .1)
diamonds_sp +
geom_point(alpha = .01)
```
Now we can see that there appear to be vertical bands at nice round values of carats, indicating that diamonds tend to be cut to those sizes. Still, the data is so dense that even when the points are 99% transparent, much of the graph appears black and the data distribution is still somewhat obscured.
> **Note**
>
> For most plots, vector formats (such as PDF, EPS, and SVG) result in smaller output files than bitmap formats (such as TIFF and PNG). But in cases where there are tens of thousands of points, vector output files can be very large and slow to render -- the scatter plot here with 99% transparent points is a 1.5 MB PDF! In these cases, high-resolution bitmaps will be smaller and faster to display on computer screens. See Chapter \@ref(CHAPTER-OUTPUT) for more information.
Another solution is to *bin* the points into rectangles and map the density of the points to the fill color of the rectangles, as shown in Figure \@ref(fig:FIG-SCATTER-OVERPLOT-BIN2D). With the binned visualization, the vertical bands are barely visible. The density of points in the lower-left corner is much greater, which tells us that the vast majority of diamonds are small and inexpensive.
By default, `stat_bin_2d()` divides the space into 30 groups in the *x* and *y* directions, for a total of 900 bins. In the second version, we increase the number of bins with `bins = 50`.
The default colors are somewhat difficult to distinguish because they don't vary much in luminosity. In the second version we set the colors by using `scale_fill_gradient()` and by specifying the low and high colors. By default, the legend doesn't show an entry for the lowest values. This is because the range of the color scale starts not from zero, but from the smallest nonzero quantity in a bin -- probably 1, in this case. To make the legend show a zero (as in Figure \@ref(fig:FIG-SCATTER-OVERPLOT-BIN2D), right), we can manually set the range from 0 to the maximum, 6000, using limits (Figure \@ref(fig:FIG-SCATTER-OVERPLOT-BIN2D), left):
```{r, eval=FALSE}
diamonds_sp +
stat_bin2d()
diamonds_sp +
stat_bin2d(bins = 50) +
scale_fill_gradient(low = "lightblue", high = "red", limits = c(0, 6000))
```
(ref:cap-FIG-SCATTER-OVERPLOT-BIN2D) Binning data with `stat_bin2d()` (left); With more bins, manually specified colors, and legend breaks (right)
```{r FIG-SCATTER-OVERPLOT-BIN2D, echo=FALSE, fig.show="hold", fig.cap="(ref:cap-FIG-SCATTER-OVERPLOT-BIN2D)"}
diamonds_sp +
stat_bin2d() +
theme(
axis.text = element_text(size = 6),
axis.title = element_text(size = 8)
)
diamonds_sp +
stat_bin2d(bins = 50) +
scale_fill_gradient(low = "lightblue", high = "red", limits = c(0, 6000)) +
theme(
axis.text = element_text(size = 6),
axis.title = element_text(size = 8)
)
```
Another alternative is to bin the data into hexagons instead of rectangles, with `stat_binhex()` (Figure \@ref(fig:FIG-SCATTER-OVERPLOT-BINHEX)). It works just like `stat_bin2d()`. To use `stat_binhex()`, you must first install the hexbin package, with the command `install.packages("hexbin")`:
(ref:cap-FIG-SCATTER-OVERPLOT-BINHEX) Binning data with `stat_binhex()` (left); Cells outside of the range shown in grey (right)
```{r FIG-SCATTER-OVERPLOT-BINHEX, fig.show="hold", fig.cap="(ref:cap-FIG-SCATTER-OVERPLOT-BINHEX)"}
library(hexbin) # Load the hexbin library to access stat_binhex()
diamonds_sp +
stat_binhex() +
scale_fill_gradient(low = "lightblue", high = "red", limits = c(0, 8000))
diamonds_sp +
stat_binhex() +
scale_fill_gradient(low = "lightblue", high = "red", limits = c(0, 5000))
```
For both of these methods, if you manually specify the range and there is a bin that falls outside that range because it has too many or too few points, that bin will show up as grey rather than the color at the high or low end of the range, as seen in the graph on the right in Figure \@ref(fig:FIG-SCATTER-OVERPLOT-BINHEX).
Overplotting can also occur when the data is *discrete* on one or both axes, as shown in Figure \@ref(fig:FIG-SCATTER-OVERPLOT-JITTER). In these cases, you can randomly *jitter* the points with `position_jitter()`. By default the amount of jitter is 40% of the resolution of the data in each direction, but these amounts can be controlled with `width` and `height`:
```{r FIG-SCATTER-OVERPLOT-JITTER, fig.show="hold", fig.cap="Data with a discrete x variable (left); Jittered (middle); Jittered horizontally only (right)", fig.width=4, fig.height=4}
# We'll use the ChickWeight data set and create a base plot called `cw_sp` (for ChickWeight scatter plot)
cw_sp <- ggplot(ChickWeight, aes(x = Time, y = weight))
cw_sp +
geom_point()
cw_sp +
geom_point(position = "jitter") # Could also use geom_jitter(), which is equivalent
cw_sp +
geom_point(position = position_jitter(width = .5, height = 0))
```
When the data has one discrete axis and one continuous axis, it might make sense to use box plots, as shown in Figure \@ref(fig:FIG-SCATTER-OVERPLOT-BOXPLOT). This will convey a different story than a standard scatter plot because a box plot will obscure the *number* of data points at each location on the discrete axis. This may be problematic in some cases, but desirable in others.
When we look at the `ChickWeight` data we know that we conceptually want to treat `Time` as a discrete variable. However since `Time` is taken as a numerical variable by default, ggplot doesn't know to group the data to form each boxplot box. If you don't tell ggplot how to group the data, you get a result like the graph on the right in Figure \@ref(fig:FIG-SCATTER-OVERPLOT-BOXPLOT). To tell it how to group the data, use `aes(group = ...)`. In this case, we'll group by each distinct value of `Time`:
```{r FIG-SCATTER-OVERPLOT-BOXPLOT, fig.show="hold", fig.cap="Grouping into box plots (left); What happens if you don't specify groups (right)", fig.width=4, fig.height=4, warning=FALSE}
cw_sp +
geom_boxplot(aes(group = Time))
cw_sp +
geom_boxplot() # Without groups
```
### See Also
Instead of binning the data, it may be useful to display a 2D density estimate. To do this, see Recipe \@ref(RECIPE-DISTRIBUTION-DENSITY2D).
Adding Fitted Regression Model Lines {#RECIPE-SCATTER-FITLINES}
------------------------------------
### Problem
You want to add lines from a fitted regression model to a scatter plot.
### Solution
To add a linear regression line to a scatter plot, add `stat_smooth()` and tell it to use `method = lm`. This instructs ggplot to fit the data with the `lm()` (linear model) function. First we'll save the base plot object in `sp`, then we'll add different components to it:
```{r, eval=FALSE}
library(gcookbook) # Load gcookbook for the heightweight data set
# We'll use the heightweight data set and create a base plot called `hw_sp` (for heighweight scatter plot)
hw_sp <- ggplot(heightweight, aes(x = ageYear, y = heightIn))
hw_sp +
geom_point() +
stat_smooth(method = lm)
```
By default, `stat_smooth()` also adds a 95% confidence region for the regression fit. The confidence interval can be changed by modifying the value for `level`, or it can be disabled with `se = FALSE` (Figure \@ref(fig:FIG-SCATTER-FIT-LM)):
```{r, eval=FALSE}
# 99% confidence region
hw_sp +
geom_point() +
stat_smooth(method = lm, level = 0.99)
# No confidence region
hw_sp +
geom_point() +
stat_smooth(method = lm, se = FALSE)
```
The default color of the fit line is blue. This can be change by setting `colour`. As with any other line, the attributes `linetype` and `size` can also be set. To emphasize the line, you can make the dots less prominent by changing the `colour` of the points (Figure \@ref(fig:FIG-SCATTER-FIT-LM), bottom right):
```{r, eval=FALSE}
hw_sp +
geom_point(colour = "grey60") +
stat_smooth(method = lm, se = FALSE, colour = "black")
```
```{r FIG-SCATTER-FIT-LM, echo=FALSE, fig.show="hold", fig.cap="An lm fit with the default 95\\% confidence region (top left); A 99\\% confidence region (top right); No confidence region (bottom left); In black with grey points (bottom right)", fig.width=4, fig.height=4}
library(gcookbook) # Load gcookbook for the heightweight data set
# We'll use the heightweight data set and create a base plot called `hw_sp` (for heightweight scatter plot)
hw_sp <- ggplot(heightweight, aes(x = ageYear, y = heightIn))
hw_sp +
geom_point() +
stat_smooth(method = lm)
hw_sp +
geom_point() +
stat_smooth(method = lm, level = 0.99)
hw_sp +
geom_point() +
stat_smooth(method = lm, se = FALSE)
hw_sp +
geom_point(colour = "grey60") +
stat_smooth(method = lm, se = FALSE, colour = "black")
```
### Discussion
The linear regression line is not the only way of fitting a model to the data -- in fact, it's not even the default. If you add `stat_smooth()` without specifying the method, it will use a LOESS (locally weighted polynomial) curve by default, as shown in Figure \@ref(fig:FIG-SCATTER-FIT-LOESS):
```{r, eval=FALSE}
hw_sp +
geom_point(colour = "grey60") +
stat_smooth()
# Equivalent to:
hw_sp +
geom_point(colour = "grey60") +
stat_smooth(method = loess)
```
```{r FIG-SCATTER-FIT-LOESS, echo=FALSE, fig.cap="A LOESS fit", fig.width=4, fig.height=4}
hw_sp +
geom_point(colour = "grey60") +
stat_smooth(method = loess)
```
It may be useful to specify additional parameters for the modeling function, which in this case is `loess()`. If, for example, you wanted to use `loess(degree = 1)`, you would call `stat_smooth(method = loess, method.args = list(degree = 1))`. The same could be done for other modeling functions like `lm()` or `glm()`.
Another common type of model fit is a logistic regression. Logistic regression isn't appropriate for `heightweight`, but it's perfect for the `biopsy` data set in the `MASS` package. In the `biopsy` data, there are nine different measured attributes of breast cancer biopsies, as well as the class of the tumor, which is either benign or malignant. To prepare the data for logistic regression, we must convert the factor `class`, with the levels `benign` and `malignant`, to a vector with numeric values of 0 and 1. We'll make a copy of the `biopsy` data frame called `biopsy_mod`, then store the numeric coded `class` in a column called `classn`:
```{r}
library(MASS) # Load MASS for the biopsy data set
biopsy_mod <- biopsy %>%
mutate(classn = recode(class, benign = 0, malignant = 1))
biopsy_mod
```
Although there are many attributes we could examine, for this example we'll just look at the relationship of `V1` (clump thickness) and the `class` of the tumor. Because there is a large degree of overplotting, we'll jitter the points and make them semitransparent (`alpha = 0.4`), hollow (`shape = 21`), and slightly smaller (`size = 1.5`). Then we'll add a fitted logistic regression line (Figure \@ref(fig:FIG-SCATTER-FIT-LOGISTIC)) by telling `stat_smooth()` to use the `glm()` function with `family = binomial`:
```{r FIG-SCATTER-FIT-LOGISTIC, fig.cap="A logistic model"}
ggplot(biopsy_mod, aes(x = V1, y = classn)) +
geom_point(
position = position_jitter(width = 0.3, height = 0.06),
alpha = 0.4,
shape = 21,
size = 1.5
) +
stat_smooth(method = glm, method.args = list(family = binomial))
```
If your scatter plot has points grouped by a factor, and that factor is mapped to an aesthetic such as `colour` or `shape`, one fit line will be drawn for each factor level. First we'll make the base plot object `hw_sp`, then we'll add the LOESS lines to it. We'll also make the points less prominent by making them semitransparent, using `alpha = .4` (Figure \@ref(fig:FIG-SCATTER-FIT-GROUP)):
```{r FIG-SCATTER-FIT-GROUP-1, eval=FALSE}
hw_sp <- ggplot(heightweight, aes(x = ageYear, y = heightIn, colour = sex)) +
geom_point() +
scale_colour_brewer(palette = "Set1")
hw_sp +
geom_smooth()
```
Notice that the blue line, for males, doesn't run all the way to the right side of the plot. There are two reasons for this. The first is that by default, `stat_smooth()` limits the prediction to within the range of the predictor data on the x-axis. The second is that even if it extrapolates, the `loess()` function only offers prediction within the *x* range of the data.
If you want the lines to extrapolate from the data, as shown in the right-hand image of Figure \@ref(fig:FIG-SCATTER-FIT-GROUP), you must use a model method that allows extrapolation, like `lm()`, and pass `stat_smooth()` the option `fullrange = TRUE`:
```{r FIG-SCATTER-FIT-GROUP-2, eval=FALSE}
hw_sp +
geom_smooth(method = lm, se = FALSE, fullrange = TRUE)
```
```{r FIG-SCATTER-FIT-GROUP, ref.label=c("FIG-SCATTER-FIT-GROUP-1", "FIG-SCATTER-FIT-GROUP-2"), echo=FALSE, fig.show="hold", fig.cap="LOESS fit lines for each group (left); Extrapolated linear fit lines (right)", message=FALSE}
```
In this example with the `heightweight` data set, the default settings for `stat_smooth()` (with `loess` and no extrapolation) may make more sense than the extrapolated linear predictions, because humans don't grow linearly and we don't grow forever.
Adding Fitted Lines from an Existing Model {#RECIPE-SCATTER-FITLINES-MODEL}
------------------------------------------
### Problem
You have already created a fitted regression model object for a data set, and you want to plot the lines for that model.
### Solution
Usually the easiest way to overlay a fitted model is to simply ask `stat_smooth()` to do it for you, as described in Recipe \@ref(RECIPE-SCATTER-FITLINES). Sometimes, however, you may want to create the model yourself and then add it to your graph. This allows you to be sure that the model you're using for other calculations is the same one that you see.
In this example, we'll build a quadratic model using `lm()` with `ageYear` as a predictor of `heightIn`. Then we'll use the `predict()` function and find the predicted values of `heightIn` across the range of values for the predictor, `ageYear`:
```{r}
library(gcookbook) # Load gcookbook for the heightweight data set
model <- lm(heightIn ~ ageYear + I(ageYear^2), heightweight)
model
# Create a data frame with ageYear column, interpolating across range
xmin <- min(heightweight$ageYear)
xmax <- max(heightweight$ageYear)
predicted <- data.frame(ageYear = seq(xmin, xmax, length.out = 100))
# Calculate predicted values of heightIn
predicted$heightIn <- predict(model, predicted)
predicted
```
We can now plot the data points along with the values predicted from the model (as you'll see in Figure \@ref(fig:FIG-SCATTER-FIT-MODEL)):
```{r FIG-SCATTER-FIT-MODEL-1, eval=FALSE}
# Create a base plot called `hw_sp` (for heightweight scatter plot)
hw_sp <- ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
geom_point(colour = "grey40")
hw_sp +
geom_line(data = predicted, size = 1)
```
### Discussion
Any model object (e.g. `lm`) can be used, so long as it has a corresponding `predict()` method. For example, `lm` has `predict.lm()`, loess has `predict.loess()`, and so on. Adding lines from a model can be simplified by using the function `predictvals()`, defined below. If you simply pass a model to `predictvals()`, the function will do the work of finding the variable names and the range of the predictor, and will return a data frame with predictor and predicted values. That data frame can then be passed to `geom_line()` to draw the fitted line, as we did earlier:
```{r}
# Given a model, predict values of yvar from xvar
# This function supports one predictor and one predicted variable
# xrange: If NULL, determine the x range from the model object. If a vector with
# two numbers, use those as the min and max of the prediction range.
# samples: Number of samples across the x range.
# ...: Further arguments to be passed to predict()
predictvals <- function(model, xvar, yvar, xrange = NULL, samples = 100, ...) {
# If xrange isn't passed in, determine xrange from the models.
# Different ways of extracting the x range, depending on model type
if (is.null(xrange)) {
if (any(class(model) %in% c("lm", "glm")))
xrange <- range(model$model[[xvar]])
else if (any(class(model) %in% "loess"))
xrange <- range(model$x)
}
newdata <- data.frame(x = seq(xrange[1], xrange[2], length.out = samples))
names(newdata) <- xvar
newdata[[yvar]] <- predict(model, newdata = newdata, ...)
newdata
}
```
With the heightweight data set, we'll make a linear model with `lm()` and a LOESS model with `loess()` (Figure \@ref(fig:FIG-SCATTER-FIT-MODEL)):
```{r}
modlinear <- lm(heightIn ~ ageYear, heightweight)
modloess <- loess(heightIn ~ ageYear, heightweight)
```
Then we can call `predictvals()` on each model, and pass the resulting data frames to `geom_line()`:
```{r FIG-SCATTER-FIT-MODEL-2, eval=FALSE}
lm_predicted <- predictvals(modlinear, "ageYear", "heightIn")
loess_predicted <- predictvals(modloess, "ageYear", "heightIn")
hw_sp +
geom_line(data = lm_predicted, colour = "red", size = .8) +
geom_line(data = loess_predicted, colour = "blue", size = .8)
```
```{r FIG-SCATTER-FIT-MODEL, ref.label=c("FIG-SCATTER-FIT-MODEL-1", "FIG-SCATTER-FIT-MODEL-2"), echo=FALSE, fig.show="hold", fig.cap="A quadratic prediction line from an lm object (left); Prediction lines from linear (in red) and LOESS (in blue) models (right)", fig.width=4, fig.height=4}
```
For `glm` models use a nonlinear link function, you need to specify `type = "response"` to the `predictvals()` function. This is because the default behavior of `glm` is to return predicted values in the scale of the linear predictors, instead of in the scale of the response (*y*) variable.
To illustrate this, we'll use the `biopsy` data set from the `MASS` package. As we did in Recipe \@ref(RECIPE-SCATTER-FITLINES), we'll use `V1` to predict `class`. Since logistic regressions require numeric values from 0 to 1, we need to convert the factor `class` to 0s and 1s:
```{r}
library(MASS) # Load MASS for the biopsy data set
# Using the biopsy data set, create a new column that stores the factor `class` as a numeric variable named `classn`. If `class` == "benign", set `classn` to 0. If `class` == "malignant", set `classn` to 1. Save this new dataset as `biopsy_mod`.
biopsy_mod <- biopsy %>%
mutate(classn = recode(class, benign = 0, malignant = 1))
biopsy_mod
```
Next, we'll perform the logistic regression:
```{r}
fitlogistic <- glm(classn ~ V1, biopsy_mod, family = binomial)
```
Finally, we'll make the graph with jittered points and the `fitlogistic` line. We'll make the line in a shade of blue by specifying a color in RGB values, and slightly thicker, with `size = 1` (Figure \@ref(fig:FIG-SCATTER-FIT-MODEL-LOGISTIC)):
```{r FIG-SCATTER-FIT-MODEL-LOGISTIC, fig.cap="A fitted logistic model"}
# Get predicted values
glm_predicted <- predictvals(fitlogistic, "V1", "classn", type = "response")
ggplot(biopsy_mod, aes(x = V1, y = classn)) +
geom_point(
position = position_jitter(width = .3, height = .08),
alpha = 0.4,
shape = 21,
size = 1.5
) +
geom_line(data = glm_predicted, colour = "#1177FF", size = 1)
```
Adding Fitted Lines from Multiple Existing Models {#RECIPE-SCATTER-FITLINES-MODEL-MULTI}
-------------------------------------------------
### Problem
You have already created a fitted regression model object for a data set, and you want to plot the lines for that model.
### Solution
Use the `predictvals()` function from the previous recipe (\@ref(RECIPE-SCATTER-FITLINES-MODEL)) along with functions from the `dplyr` package, including `group_by()` and `do()`.
With the `heightweight` data set, we'll make a linear model for each of the levels of `sex`. The model building is done for each subset of the data frame by specifying the model computation we want within the `do()` function.
The following code splits the `heightweight` data frame into two data frames, by the grouping variable `sex`. This creates a data frame subset for males and a data frame subset for females. We then apply `lm(heightIn ~ ageYear, .)` to each subset. The `.` in `lm(heightIn ~ ageYear)` represents the data frame we are piping in from the previous line -- in this case, the two data frame subsets we have just created. Finally, this code returns a data frame which contains a column, `model`, which is a list of the model outputs corresponding to each level of the grouping variable `sex`, female and male.
```{r}
library(gcookbook) # Load gcookbook for the heightweight data set
library(dplyr)
# Create an lm model object for each value of sex; this returns a data frame
models <- heightweight %>%
group_by(sex) %>%
do(model = lm(heightIn ~ ageYear, .)) %>%
ungroup()
# Print the data frame
models
# Print out the model column of the data frame
models$model
```
Now that we have the list of model objects, we can run the `predictvals()` as defined in \@ref(RECIPE-SCATTER-FITLINES-MODEL) to get the predicted values from each model:
```{r}
predvals <- models %>%
group_by(sex) %>%
do(predictvals(.$model[[1]], xvar = "ageYear", yvar = "heightIn"))
```
Finally, we can plot the data with the predicted values (Figure \@ref(fig:FIG-SCATTER-FIT-MODEL-MULTI)):
```{r eval=FALSE}
ggplot(heightweight, aes(x = ageYear, y = heightIn, colour = sex)) +
geom_point() +
geom_line(data = predvals)
# Using facets instead of colors for the groups
ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
geom_point() +
geom_line(data = predvals) +
facet_grid(. ~ sex)
```
```{r FIG-SCATTER-FIT-MODEL-MULTI, echo=FALSE, fig.cap="Predictions from two separate lm objects, one for each subset of data (left); With facets (right)", fig.width = 12}
p1 <- ggplot(heightweight, aes(x = ageYear, y = heightIn, colour = sex)) +
geom_point() +
geom_line(data = predvals)
# Using facets instead of colors for the groups
p2 <- ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
geom_point() +
geom_line(data = predvals) +
facet_grid(. ~ sex)
p1 + plot_spacer() + p2 + plot_layout(widths = c(5, 1, 9))
```
### Discussion
The `group_by()` and `do()` calls are used for splitting the data into parts, running functions on those parts, and then reassembling the output.
With the preceding code, the *x* range of the predicted values for each group spans the *x* range of each group, and no further; for the males, the prediction line stops at the oldest male, while for females, the prediction line continues further right, to the oldest female. To form prediction lines that have the same *x* range across all groups, we can simply pass in `xrange`, like this:
```{r}
predvals <- models %>%
group_by(sex) %>%
do(predictvals(
.$model[[1]],
xvar = "ageYear",
yvar = "heightIn",
xrange = range(heightweight$ageYear))
)
```
Then we can plot it, the same as we did before:
```{r FIG-SCATTER-FIT-MODEL-MULTI-RANGE, fig.cap="Predictions for each group extend to the full x range of all groups together"}
ggplot(heightweight, aes(x = ageYear, y = heightIn, colour = sex)) +
geom_point() +
geom_line(data = predvals)
```
As you can see in Figure \@ref(fig:FIG-SCATTER-FIT-MODEL-MULTI-RANGE), the line for males now extends as far to the right as the line for females. Keep in mind that extrapolating past the data isn't always appropriate, though; whether or not it's justified will depend on the nature of your data and the assumptions you bring to the table.
Adding Annotations with Model Coefficients {#RECIPE-SCATTER-FITLINES-TEXT}
------------------------------------------
### Problem
You want to add numerical information about a model to a plot.
### Solution
To add simple text to a plot, simply add an annotation. In this example, we'll create a linear model and use the `predictvals()` function defined in Recipe \@ref(RECIPE-SCATTER-FITLINES-MODEL) to create a prediction line from the model. Then we'll add an annotation:
```{r}
library(gcookbook) # Load gcookbook for the heightweight data set
model <- lm(heightIn ~ ageYear, heightweight)
summary(model)
```
This shows that the *r^2* value is 0.4249. We'll create a graph and manually add the text using `annotate()` (Figure \@ref(fig:FIG-SCATTER-FIT-MODEL-TEXT)):
```{r eval=FALSE}
# First generate prediction data
pred <- predictvals(model, "ageYear", "heightIn")
# Save a base plot
hw_sp <- ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
geom_point() +
geom_line(data = pred)
hw_sp +
annotate("text", x = 16.5, y = 52, label = "r^2=0.42")
```
Instead of using a plain text string, it's also possible to enter formulas using R's math expression syntax, by using `parse = TRUE`:
```{r eval=FALSE}
hw_sp +
annotate("text", x = 16.5, y = 52, label = "r^2 == 0.42", parse = TRUE)
```
```{r FIG-SCATTER-FIT-MODEL-TEXT, echo=FALSE, fig.show="hold", fig.cap="Plain text (left); Math expression (right)", fig.width=3.5, fig.height=3.5}
pred <- predictvals(model, "ageYear", "heightIn")
hw_sp <- ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
geom_point() +
geom_line(data = pred)
hw_sp +
annotate("text", x = 16.5, y = 52, label = "r^2=0.42")
hw_sp +
annotate("text", x = 16.5, y = 52, label = "r^2 == 0.42", parse = TRUE)
```
### Discussion
Text geoms in ggplot do not take expression objects directly; instead, they take character strings that can be turned into expressions with R's `parse()` function.
If you use a mathematical expression, the syntax must be correct for the expression to be a valid R expression object. You can test the validity by wrapping the object in the `expression()` function and seeing if it throws an error (make sure *not* to use quotes around the expression). In the example here, `==` is a valid construct in an expression to express equality, but `=` is not:
```{r, eval=FALSE}
expression(r^2 == 0.42) # Valid
expression(r^2 = 0.42) # Not valid
#> Error: unexpected '=' in "expression(r\^2 ="
```
It's possible to automatically extract values from the model object and build an expression using those values. In this example, we'll create a string which when parsed, yields a valid expression:
```{r}
# Use sprintf() to construct our string.
# The %.3g and %.2g are replaced with numbers with 3 significant digits and 2
# significant digits, respectively. The numbers are supplied after the string.
eqn <- sprintf(
"italic(y) == %.3g + %.3g * italic(x) * ',' ~~ italic(r)^2 ~ '=' ~ %.2g",
coef(model)[1],
coef(model)[2],
summary(model)$r.squared
)
eqn
# Test validity by using parse()
parse(text = eqn)
```
Now that we have the expression string, we can add it to the plot. In this example we'll put the text in the bottom-right corner, by setting `x = Inf` and `y = -Inf` and using horizontal and vertical adjustments so that the text all fits inside the plotting area (Figure \@ref(fig:FIG-SCATTER-FIT-MODEL-TEXT-AUTO)):
```{r FIG-SCATTER-FIT-MODEL-TEXT-AUTO, fig.cap="Scatter plot with automatically generated expression", fig.width=4, fig.height=4}
hw_sp +
annotate(
"text",
x = Inf, y = -Inf,
label = eqn, parse = TRUE,
hjust = 1.1, vjust = -.5
)
```
### See Also
The math expression syntax in R can be a bit tricky. See Recipe \@ref(RECIPE-ANNOTATE-TEXT-MATH) for more information.
Adding Marginal Rugs to a Scatter Plot {#RECIPE-SCATTER-RUG}
--------------------------------------
### Problem
You want to add marginal rugs to a scatter plot.
### Solution
Use `geom_rug()`. For this example (Figure \@ref(fig:FIG-SCATTER-RUG)), we'll use the `faithful` data set. This data set has two columns with data about the Old Faithful geyser: `eruptions`, which is the length of each eruption, and `waiting`, which is the length of time until the next eruption:
```{r FIG-SCATTER-RUG, fig.cap="Marginal rug added to a scatter plot", fig.width=5, fig.height=5}
ggplot(faithful, aes(x = eruptions, y = waiting)) +
geom_point() +
geom_rug()
```
### Discussion
A marginal rug plot is essentially a one-dimensional scatter plot that can be used to visualize the distribution of data on each axis.
In this particular data set, the marginal rug is not as informative as it could be. The resolution of the `waiting` variable is in whole minutes, and because of this, the rug lines have a lot of overplotting. To reduce overplotting, we can jitter the line positions and make them slightly thinner by specifying size (Figure \@ref(fig:FIG-SCATTER-RUG-JITTER)). This helps the viewer see the distribution more clearly:
```{r FIG-SCATTER-RUG-JITTER, fig.cap="Marginal rug with thinner, jittered lines", fig.width=5, fig.height=5}
ggplot(faithful, aes(x = eruptions, y = waiting)) +
geom_point() +
geom_rug(position = "jitter", size = 0.2)
```
### See Also
For more about overplotting, see Recipe \@ref(RECIPE-SCATTER-OVERPLOT).
Labeling Points in a Scatter Plot {#RECIPE-SCATTER-LABELS}
---------------------------------
### Problem
You want to add labels to points in a scatter plot.
### Solution
For annotating just one or a few points, you can use `annotate()` or `geom_text()`. For this example, we'll use the countries data set and visualize the relationship between health expenditures and infant mortality rate per 1,000 live births. To keep things manageable, we'll filter the data to only look at data from 2009 for a subset of countries that spent more than $2,000 USD per capita:
```{r}
library(gcookbook) # Load gcookbook for the countries data set
library(dplyr)
# Filter the data to only look at 2009 data for countries that spent > 2000 USD per capita