ch05.Rmd

---
output:
  bookdown::html_document2:
    fig_caption: yes
editor_options:
  chunk_output_type: console
---

```{r echo = FALSE, cache = FALSE}
source("utils.R", local = TRUE)
```

Scatter Plots {#CHAPTER-SCATTER}
=============

Scatter plots are used to display the relationship between two continuous variables. In a scatter plot, each observation in a data set is represented by a point. Often, a scatter plot will also have a line showing the predicted values based on some statistical model. Adding this line is easy to do with R and the ggplot2 package, and can help to make sense of data when the trends aren't immediately obvious just by looking at the points.

With large data sets, plotting every single observation in the data set can result in overplotting, when points overlap and obscure one another. To deal with the problem of overplotting, you'll probably want to summarize the data before displaying it. We'll also see how to do that in this chapter.


Making a Basic Scatter Plot {#RECIPE-SCATTER-BASIC-SCATTER}
---------------------------

### Problem

You want to make a scatter plot using two continuous variables.

### Solution

Use `geom_point()`, and map one variable to `x` and one variable to `y`.

We will use the `heightweight` data set. There are a number of columns in this data set, but we'll only use two in this example (Figure \@ref(fig:FIG-SCATTER-BASIC)):

```{r FIG-SCATTER-BASIC, fig.cap="A basic scatter plot", fig.width=4, fig.height=4}
library(gcookbook) # Load gcookbook for the heightweight data set
library(dplyr)

# Show the head of the two columns we'll use in the plot
heightweight %>%
  select(ageYear, heightIn)

ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
  geom_point()
```

### Discussion

Instead of points, you can use different shapes for your scatter plot by using the `shape` aesthetic. A common alternative to the default solid circles (shape #19) is hollow ones (#21), as seen in Figure \@ref(fig:FIG-SCATTER-BASIC-SHAPE-SIZE) (left):

```{r, eval=FALSE}
ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
  geom_point(shape = 21)
```

The size of the points can be controlled with the `size` aesthetic. The default value of size is 2 (`size = 2`). The following code will set `size = 1.5` to create smaller points (Figure \@ref(fig:FIG-SCATTER-BASIC-SHAPE-SIZE), right):

```{r, eval=FALSE}
ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
  geom_point(size = 1.5)
```

```{r FIG-SCATTER-BASIC-SHAPE-SIZE, echo=FALSE, fig.show="hold", fig.cap="Scatter plot with hollow circles (shape 21, left); With smaller points (right)", fig.width=4, fig.height=4}

ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
  geom_point(shape = 21)

ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
  geom_point(size = 1.5)
```


Grouping Points Together using Shapes or Colors {#RECIPE-SCATTER-GROUPED-SCATTER}
-------------------------------------------------------

### Problem

You want to visually group points by some variable (the grouping variable), using different shapes or colors.

### Solution

Map the grouping variable to the aesthetic of `shape` or `colour`. We'll use three columns from the `heightweight` data set for this example:

```{r}
library(gcookbook) # Load gcookbook for the heightweight data set

# Show the head of the three columns we'll use
heightweight %>%
  select(sex, ageYear, heightIn)
```

We can use the aesthetics of `colour` or `shape` to visually differentiate the data points belonging to different categories of `sex`. We do this by mapping `sex` to one of the aesthetics `colour` or `shape` (Figure\@ref(fig:FIG-SCATTER-SHAPE-COLOR)):

```{r FIG-SCATTER-SHAPE-COLOR, echo=FALSE, fig.show="hold", fig.cap="Grouping points by a variable mapped to colour (left), or to shape (right)"}

ggplot(heightweight, aes(x = ageYear, y = heightIn, colour = sex)) +
  geom_point()

ggplot(heightweight, aes(x = ageYear, y = heightIn, shape = sex)) +
  geom_point()
```

### Discussion

The grouping variable you choose must be categorical -- in other words, a factor or character vector. If the grouping variable is a numeric vector, you should convert it to a factor first.

It is possible to map a variable to both `shape` and `colour`, or, if you have multiple grouping variables, to map each grouping variable to a different aesthetic. Here, we'll map the variable `sex` to both `shape` and `colour` aesthetics (Figure \@ref(fig:FIG-SCATTER-SHAPE-COLOR-BOTH), left):

```{r, eval=FALSE}
ggplot(heightweight, aes(x = ageYear, y = heightIn, shape = sex, colour = sex)) +
  geom_point()
```

You may want to use different shapes and colors than are given by the default settings. You can select other shapes for the grouping variables using `scale_shape_manual()`, and select other colors using `scale_colour_brewer()` or `scale_colour_manual()`. (Figure \@ref(fig:FIG-SCATTER-SHAPE-COLOR-BOTH), right):

```{r, eval=FALSE}
ggplot(heightweight, aes(x = ageYear, y = heightIn, shape = sex, colour = sex)) +
  geom_point() +
  scale_shape_manual(values = c(1,2)) +
  scale_colour_brewer(palette = "Set1")
```

```{r FIG-SCATTER-SHAPE-COLOR-BOTH, echo=FALSE, fig.show="hold", fig.cap="Mapping to both shape and colour (left); With manually set shapes and colors (right)"}
ggplot(heightweight, aes(x = ageYear, y = heightIn, shape = sex, colour = sex)) +
  geom_point()

ggplot(heightweight, aes(x = ageYear, y = heightIn, shape = sex, colour = sex)) +
  geom_point() +
  scale_shape_manual(values = c(1,2)) +
  scale_colour_brewer(palette = "Set1")
```

### See Also

To use different shapes, see Recipe \@ref(RECIPE-SCATTER-SHAPES).

For more on using different colors, see Chapter \@ref(CHAPTER-COLORS).


Using Different Point Shapes {#RECIPE-SCATTER-SHAPES}
----------------------------

### Problem

You want to change the default scatterplot shapes for the data points.

### Solution

You can set the shape of all the data points at once (Figure \@ref(fig:FIG-SCATTER-SHAPES), left) by setting a shape in `geom_point()`:

```{r, eval=FALSE}
library(gcookbook) # Load gcookbook for the heightweight data set

ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
  geom_point(shape = 3)
```

If you have mapped a variable to `shape`, you can use `scale_shape_manual()` to manually change the shapes mapped to the levels of that variable:

```{r, eval=FALSE}
# Use slightly larger points and use custom values for the shape scale
ggplot(heightweight, aes(x = ageYear, y = heightIn, shape = sex)) +
  geom_point(size = 3) +
  scale_shape_manual(values = c(1, 4))
```

```{r FIG-SCATTER-SHAPES, echo=FALSE, fig.show="hold", fig.cap="Scatter plot with the shape aesthetic set to a custom value (left); With a variable mapped to shape, using a custom shape palette (right)", fig.width=8.5, fig.height=3.5}
library(gcookbook) # Load gcookbook for the heightweight data set

p1 <- ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
  geom_point(shape = 3)

p2 <- ggplot(heightweight, aes(x = ageYear, y = heightIn, shape = sex)) +
  geom_point(size = 3) +
  scale_shape_manual(values = c(1, 4))

library(patchwork)
p1 + plot_spacer() + p2 + plot_layout(widths = c(5, 1, 5))
```

### Discussion

Figure \@ref(fig:FIG-SCATTER-SHAPES-CHART) shows the shapes that are already built into R. Some of the point shapes (1–14) only have an outline; some (15–20) have solid fill; and some (21–25) have an outline and fill that can be controlled separately. You can also use characters for points.

For shapes 1–20, the color of the entire point -- even the points that have solid fill -- is controlled by the `colour` aesthetic. For shapes 21–25, the outline is controlled by `colour` and the fill is controlled by `fill`.

```{r FIG-SCATTER-SHAPES-CHART, echo=FALSE, fig.cap="Shapes in R graphics"}
pchShow <- function(extras = c("*",".", "o","O","0","+","-","|","%","#"),
  cex = 3, ## good for both .Device=="postscript" and "x11"
  col = "red3", bg = "gold", coltext = "brown", cextext = 1.2,
  main = paste("plot symbols :  points (...  pch = *, cex =",
    cex,")"))
{
  nex <- length(extras)
  np  <- 26 + nex
  ipch <- 0:(np - 1)
  k <- floor(sqrt(np))
  dd <- c(-1,1)/2
  rx <- dd + range(ix <- ipch %/% k)
  ry <- dd + range(iy <- 3 + (k - 1) - ipch %% k)
  pch <- as.list(ipch) # list with integers & strings
  if (nex > 0) pch[26 + 1:nex] <- as.list(extras)
  plot(rx, ry, type ="n", axes = FALSE, xlab = "", ylab = "",
    main = main)
  abline(v = ix, h = iy, col = "lightgray", lty = "dotted")
  for (i in 1:np) {
    pc <- pch[[i]]
    ## 'col' symbols with a 'bg'-colored interior (where available) :
    points(ix[i], iy[i], pch = pc, col = col, bg = bg, cex = cex)
    if (cextext > 0)
      text(ix[i] - 0.3, iy[i], pc, col = coltext, cex = cextext)
  }
}

par(mar = c(0,0,0,0))
pchShow(main = NULL)
```

It's possible to have one variable represented by the shape of a point, and and another variable represented by the fill (empty or filled) of the point. To do this, you need to first choose point shapes that have both colour and fill, and set these in `scale_shape_manual`. You then need to choose a fill palette that includes `NA` and another color (the `NA` will result in a hollow shape) and use these in `scale_fill_manual()`.

For example, we'll take the `heightweight` data set and add another column that indicates whether the child weighed 100 pounds or more (Figure \@ref(fig:FIG-SCATTER-SHAPES-FILL)):

```{r FIG-SCATTER-SHAPES-FILL, fig.cap="A variable mapped to shape and another mapped to fill"}
# Using the heightweight data set, create a new column that indicates if the
# child weighs < 100 or >= 100 pounds. We'll save this modified dataset as 'hw'.
hw <- heightweight %>%
  mutate(weightgroup = ifelse(weightLb < 100, "< 100", ">= 100"))

# Specify shapes with fill and color, and specify fill colors that includes an empty (NA) color
ggplot(hw, aes(x = ageYear, y = heightIn, shape = sex, fill = weightgroup)) +
  geom_point(size = 2.5) +
  scale_shape_manual(values = c(21, 24)) +
  scale_fill_manual(
    values = c(NA, "black"),
    guide = guide_legend(override.aes = list(shape = 21))
  )
```

### See Also

For more on using different colors, see Chapter \@ref(CHAPTER-COLORS).

For more information about recoding a continuous variable to a categorical one, see Recipe \@ref(RECIPE-DATAPREP-RECODE-CONTINUOUS).


Mapping a Continuous Variable to Color or Size {#RECIPE-SCATTER-CONTINUOUS-SCATTER}
----------------------------------------------

### Problem

You want to represent a third continuous variable using color or size.

### Solution

Map the continuous variable to `size` or `colour`. We will use the `heightweight` data set for this example. There are many columns in this data set, but we'll only use four of them in this example:

```{r}
library(gcookbook) # Load gcookbook for the heightweight data set

# Show the head of the four columns we'll use
heightweight %>%
  select(sex, ageYear, heightIn, weightLb)
```

The basic scatter plot in Recipe \@ref(RECIPE-SCATTER-BASIC-SCATTER) shows the relationship between the continuous variables `ageYear` and `heightIn`. We can represent a third continuous variable, `weightLb`, by mapping this variable to another aesthetic property, such as `colour` or `size` (Figure \@ref(fig:FIG-SCATTER-CONTINUOUS-COLOR-SIZE):

```{r FIG-SCATTER-CONTINUOUS-COLOR-SIZE, fig.show="hold", fig.cap="A continuous variable mapped to colour (left); Mapped to size (right)"}
ggplot(heightweight, aes(x = ageYear, y = heightIn, colour = weightLb)) +
  geom_point()

ggplot(heightweight, aes(x = ageYear, y = heightIn, size = weightLb)) +
  geom_point()
```

### Discussion

A basic scatter plot shows the relationship between two continuous variables: one mapped to the x-axis, and one to the y-axis. When there are more than two continuous variables, these additional variables must be mapped to other aesthetics, like `size` and `color`.

Humans can easily perceive small differences in spatial position, so we can interpret the variables mapped to *x* and *y* coordinates with high precision. Humans aren't as good at perceiving small differences in `size` and `color` though, so we will interpret variables mapped to these aesthetic attributes with much lower precision. Therefore, when you map a variable to `size` or `color`, make sure it is a variable where high precision is not very important for correctly intepreting the data.

There is another consideration when mapping a variable to `size`, which is that the results can be perceptually misleading. While the largest dots in Figure \@ref(fig:FIG-SCATTER-CONTINUOUS-COLOR-SIZE) are about 36 times the size of the smallest ones, they are only supposed to represent about 3.5 times the weight of the smallest dots.

This relative misrepresentation of size happens because the default values in ggplot2 for the diameter of points ranges from 1 to 6mm, regardless of the actual data values. For example, if the data values range from 0 to 10, the smallest value of 0 will be represented on the plot with a point that is 1mm wide, while the largest value of 10 will be represented on the plot with a point that is 6mm wide. Similarly, if the data values range from 100 to 110, the smallest value of 100 will still be represented by a point that is 1mm wide, and the largest value of 110 will be represented by a point that is 6mm wide. Thus regardless of the actual data values, the largest point will have a diameter that is 6 times the diameter of the smallest point, and will be 36 times the area.

If it is important for the size of the points to accurately reflect the proportional differences of your data values, you should first decide if you want the diameter of the points to represent the data values, or if you want to area of the points to represent the data values. Figure \@ref(fig:FIG-SCATTER-SIZE-AREA) shows the difference between these representations.

```{r FIG-SCATTER-SIZE-AREA, fig.show="hold", fig.cap="Value mapped to diameter of points (left); Value mapped to area of points (right)"}

range(heightweight$weightLb)
size_range <- range(heightweight$weightLb) / max(heightweight$weightLb) *  6
size_range

ggplot(heightweight, aes(x = ageYear, y = heightIn, size = weightLb)) +
  geom_point() +
  scale_size_continuous(range = size_range)

ggplot(heightweight, aes(x = ageYear, y = heightIn, size = weightLb)) +
  geom_point() +
  scale_size_area()
```

See Recipe \@ref(RECIPE-SCATTER-BALLOON) for details on making the area of points proportional to the data values.

When it comes to color, there are actually two aesthetic attributes that can be used: `color` and `fill`. You will use `color` for most point shapes. However, shapes 21–25 have an outline with a solid region in the middle where the color is controlled by fill. These outlined shapes can be useful when using a color scale with light colors as in Figure \@ref(fig:FIG-SCATTER-CONTINUOUS-FILL), because the outline sets the shapes off from the background. In this example, we also set the fill gradient to go from black to white and make the points larger so that the fill is easier to see:

```{r FIG-SCATTER-CONTINUOUS-FILL, echo=FALSE, fig.show="hold", fig.cap="Outlined points with a continuous variable mapped to fill (left); With a discrete legend instead of continuous colorbar (right)"}
ggplot(heightweight, aes(x = ageYear, y = heightIn, fill = weightLb)) +
  geom_point(shape = 21, size = 2.5) +
  scale_fill_gradient(low = "black", high = "white")

# Using guide_legend() will result in a discrete legend instead of a colorbar legend
ggplot(heightweight, aes(x = ageYear, y = heightIn, fill = weightLb)) +
  geom_point(shape = 21, size = 2.5) +
  scale_fill_gradient(
    low = "black", high = "white",
    breaks = seq(70, 170, by = 20),
    guide = guide_legend()
  )
```

Mapping a continuous variable to an aesthetic doesn't prevent us from mapping a categorical variable to other aesthetics. In Figure \@ref(fig:FIG-SCATTER-CONTINUOUS-SIZE-CATEGORICAL-COLOR), we'll map `weightLb` to `size`, and also map `sex` to `color`. Because there is a fair amount of overplotting (where the points overlap), we'll make the points 50% transparent by setting `alpha = .5`. We'll also use `scale_size_area()` to make the area of the points proportional to the data values (see Recipe \@ref(RECIPE-SCATTER-BALLOON)), and manually change the color palette:

```{r FIG-SCATTER-CONTINUOUS-SIZE-CATEGORICAL-COLOR, echo=FALSE, fig.cap="Continuous variable mapped to size and categorical variable mapped to colour"}
ggplot(heightweight, aes(x = ageYear, y = heightIn, size = weightLb, colour = sex)) +
  geom_point(alpha = .5) +
  scale_size_area() +     # Make area proportional to numeric value
  scale_colour_brewer(palette = "Set1")
```

When a variable is mapped to `size`, it's a good idea to *not* map a variable to `shape`. This is because it is difficult to compare the sizes of different shapes; for example, a size 4 triangle could appear larger than a size 3.5 circle. Also, some of the shapes really are different sizes: shapes 16 and 19 are both circles, but at any given numeric size, shape 19 circles are visually larger than shape 16 circles.

### See Also

To use different colors from the default, see Recipe \@ref(RECIPE-COLORS-PALETTE-CONTINUOUS).

See Recipe \@ref(RECIPE-SCATTER-BALLOON) for creating a balloon plot.


Dealing with Overplotting {#RECIPE-SCATTER-OVERPLOT}
-------------------------

### Problem

You have many points that overlap and obscure each other when plotted.

### Solution

With large data sets, the points in a scatter plot may overlap and obscure each other and prevent the viewer from accurately assessing the distribution of the data. This is called *overplotting*. If the amount of overplotting is low, you may be able to alleviate the problem by using smaller points, or by using a different shape (like shape 1, a hollow circle) through which other points can be seen. Figure \@ref(fig:FIG-SCATTER-BASIC-SHAPE-SIZE) in Recipe \@ref(RECIPE-SCATTER-BASIC-SCATTER) demonstrates both of these solutions.

If there's a high degree of overplotting, there are a number of possible solutions:

*   Make the points semi-transparent
*   Bin the data into rectangles (better for quantitative analysis)
*   Bin the data into hexagons
*   Use box plots

### Discussion

The scatter plot in Figure \@ref(fig:FIG-SCATTER-OVERPLOT) contains about 54,000 points. They are heavily overplotted, making it impossible to get a sense of the relative density of points in different areas of the graph:

```{r, eval=FALSE}
# We'll use the diamonds data set and create a base plot called `diamonds_sp`
diamonds_sp <- ggplot(diamonds, aes(x = carat, y = price))

diamonds_sp +
  geom_point()
```

```{r FIG-SCATTER-OVERPLOT, echo=FALSE, dev="png", dpi=300, fig.cap="Overplotting, with about 54,000 points", fig.width=4, fig.height=4}
# We'll use the diamonds data set and create a base plot called `diamonds_sp`
diamonds_sp <- ggplot(diamonds, aes(x = carat, y = price))

diamonds_sp +
  geom_point() +
  theme(
    axis.text = element_text(size = 6),
    axis.title = element_text(size = 8)
  )
```

We can make the points semitransparent using the `alpha` aesthetic, as in Figure \@ref(fig:FIG-SCATTER-OVERPLOT-ALPHA). Here, we'll make them 90% transparent and then 99% transparent, by setting `alpha = .1` and `alpha = .01`:

```{r FIG-SCATTER-OVERPLOT-ALPHA, dev="png", dpi=300, fig.show="hold", fig.cap="Semitransparent points with alpha=.1 (left); With alpha=.01 (right)", fig.width=4, fig.height=4}
diamonds_sp +
  geom_point(alpha = .1)

diamonds_sp +
  geom_point(alpha = .01)
```

Now we can see that there appear to be vertical bands at nice round values of carats, indicating that diamonds tend to be cut to those sizes. Still, the data is so dense that even when the points are 99% transparent, much of the graph appears black and the data distribution is still somewhat obscured.

> **Note**
>
> For most plots, vector formats (such as PDF, EPS, and SVG) result in smaller output files than bitmap formats (such as TIFF and PNG). But in cases where there are tens of thousands of points, vector output files can be very large and slow to render -- the scatter plot here with 99% transparent points is a 1.5 MB PDF! In these cases, high-resolution bitmaps will be smaller and faster to display on computer screens. See Chapter \@ref(CHAPTER-OUTPUT) for more information.

Another solution is to *bin* the points into rectangles and map the density of the points to the fill color of the rectangles, as shown in Figure \@ref(fig:FIG-SCATTER-OVERPLOT-BIN2D). With the binned visualization, the vertical bands are barely visible. The density of points in the lower-left corner is much greater, which tells us that the vast majority of diamonds are small and inexpensive.

By default, `stat_bin_2d()` divides the space into 30 groups in the *x* and *y* directions, for a total of 900 bins. In the second version, we increase the number of bins with `bins = 50`.

The default colors are somewhat difficult to distinguish because they don't vary much in luminosity. In the second version we set the colors by using `scale_fill_gradient()` and by specifying the low and high colors. By default, the legend doesn't show an entry for the lowest values. This is because the range of the color scale starts not from zero, but from the smallest nonzero quantity in a bin -- probably 1, in this case. To make the legend show a zero (as in Figure \@ref(fig:FIG-SCATTER-OVERPLOT-BIN2D), right), we can manually set the range from 0 to the maximum, 6000, using limits (Figure \@ref(fig:FIG-SCATTER-OVERPLOT-BIN2D), left):

```{r, eval=FALSE}
diamonds_sp +
  stat_bin2d()

diamonds_sp +
  stat_bin2d(bins = 50) +
  scale_fill_gradient(low = "lightblue", high = "red", limits = c(0, 6000))
```

(ref:cap-FIG-SCATTER-OVERPLOT-BIN2D) Binning data with `stat_bin2d()` (left); With more bins, manually specified colors, and legend breaks (right)

```{r FIG-SCATTER-OVERPLOT-BIN2D, echo=FALSE, fig.show="hold", fig.cap="(ref:cap-FIG-SCATTER-OVERPLOT-BIN2D)"}
diamonds_sp +
  stat_bin2d() +
  theme(
    axis.text = element_text(size = 6),
    axis.title = element_text(size = 8)
  )

diamonds_sp +
  stat_bin2d(bins = 50) +
  scale_fill_gradient(low = "lightblue", high = "red", limits = c(0, 6000)) +
  theme(
    axis.text = element_text(size = 6),
    axis.title = element_text(size = 8)
  )
```

Another alternative is to bin the data into hexagons instead of rectangles, with `stat_binhex()` (Figure \@ref(fig:FIG-SCATTER-OVERPLOT-BINHEX)). It works just like `stat_bin2d()`. To use `stat_binhex()`, you must first install the hexbin package, with the command `install.packages("hexbin")`:

(ref:cap-FIG-SCATTER-OVERPLOT-BINHEX) Binning data with `stat_binhex()` (left); Cells outside of the range shown in grey (right)

```{r FIG-SCATTER-OVERPLOT-BINHEX, fig.show="hold", fig.cap="(ref:cap-FIG-SCATTER-OVERPLOT-BINHEX)"}
library(hexbin) # Load the hexbin library to access stat_binhex()

diamonds_sp +
  stat_binhex() +
  scale_fill_gradient(low = "lightblue", high = "red", limits = c(0, 8000))

diamonds_sp +
  stat_binhex() +
  scale_fill_gradient(low = "lightblue", high = "red", limits = c(0, 5000))
```

For both of these methods, if you manually specify the range and there is a bin that falls outside that range because it has too many or too few points, that bin will show up as grey rather than the color at the high or low end of the range, as seen in the graph on the right in Figure \@ref(fig:FIG-SCATTER-OVERPLOT-BINHEX).

Overplotting can also occur when the data is *discrete* on one or both axes, as shown in Figure \@ref(fig:FIG-SCATTER-OVERPLOT-JITTER). In these cases, you can randomly *jitter* the points with `position_jitter()`. By default the amount of jitter is 40% of the resolution of the data in each direction, but these amounts can be controlled with `width` and `height`:

```{r FIG-SCATTER-OVERPLOT-JITTER, fig.show="hold", fig.cap="Data with a discrete x variable (left); Jittered (middle); Jittered horizontally only (right)", fig.width=4, fig.height=4}

# We'll use the ChickWeight data set and create a base plot called `cw_sp` (for ChickWeight scatter plot)
cw_sp <- ggplot(ChickWeight, aes(x = Time, y = weight))

cw_sp +
  geom_point()

cw_sp +
  geom_point(position = "jitter")  # Could also use geom_jitter(), which is equivalent

cw_sp +
  geom_point(position = position_jitter(width = .5, height = 0))
```

When the data has one discrete axis and one continuous axis, it might make sense to use box plots, as shown in Figure \@ref(fig:FIG-SCATTER-OVERPLOT-BOXPLOT). This will convey a different story than a standard scatter plot because a box plot will obscure the *number* of data points at each location on the discrete axis. This may be problematic in some cases, but desirable in others.

When we look at the `ChickWeight` data we know that we conceptually want to treat `Time` as a discrete variable. However since `Time` is taken as a numerical variable by default, ggplot doesn't know to group the data to form each boxplot box. If you don't tell ggplot how to group the data, you get a result like the graph on the right in Figure \@ref(fig:FIG-SCATTER-OVERPLOT-BOXPLOT). To tell it how to group the data, use `aes(group = ...)`. In this case, we'll group by each distinct value of `Time`:

```{r FIG-SCATTER-OVERPLOT-BOXPLOT, fig.show="hold", fig.cap="Grouping into box plots (left); What happens if you don't specify groups (right)", fig.width=4, fig.height=4, warning=FALSE}
cw_sp +
  geom_boxplot(aes(group = Time))

cw_sp +
  geom_boxplot()  # Without groups
```


### See Also

Instead of binning the data, it may be useful to display a 2D density estimate. To do this, see Recipe \@ref(RECIPE-DISTRIBUTION-DENSITY2D).


Adding Fitted Regression Model Lines {#RECIPE-SCATTER-FITLINES}
------------------------------------

### Problem

You want to add lines from a fitted regression model to a scatter plot.

### Solution

To add a linear regression line to a scatter plot, add `stat_smooth()` and tell it to use `method = lm`. This instructs ggplot to fit the data with the `lm()` (linear model) function. First we'll save the base plot object in `sp`, then we'll add different components to it:

```{r, eval=FALSE}
library(gcookbook) # Load gcookbook for the heightweight data set

# We'll use the heightweight data set and create a base plot called `hw_sp` (for heighweight scatter plot)
hw_sp <- ggplot(heightweight, aes(x = ageYear, y = heightIn))

hw_sp +
  geom_point() +
  stat_smooth(method = lm)
```

By default, `stat_smooth()` also adds a 95% confidence region for the regression fit. The confidence interval can be changed by modifying the value for `level`, or it can be disabled with `se = FALSE` (Figure \@ref(fig:FIG-SCATTER-FIT-LM)):

```{r, eval=FALSE}
# 99% confidence region
hw_sp +
  geom_point() +
  stat_smooth(method = lm, level = 0.99)

# No confidence region
hw_sp +
  geom_point() +
  stat_smooth(method = lm, se = FALSE)
```

The default color of the fit line is blue. This can be change by setting `colour`. As with any other line, the attributes `linetype` and `size` can also be set. To emphasize the line, you can make the dots less prominent by changing the `colour` of the points (Figure \@ref(fig:FIG-SCATTER-FIT-LM), bottom right):

```{r, eval=FALSE}
hw_sp +
  geom_point(colour = "grey60") +
  stat_smooth(method = lm, se = FALSE, colour = "black")
```

```{r FIG-SCATTER-FIT-LM, echo=FALSE, fig.show="hold", fig.cap="An lm fit with the default 95\\% confidence region (top left); A 99\\% confidence region (top right); No confidence region (bottom left); In black with grey points (bottom right)", fig.width=4, fig.height=4}

library(gcookbook) # Load gcookbook for the heightweight data set

# We'll use the heightweight data set and create a base plot called `hw_sp` (for heightweight scatter plot)
hw_sp <- ggplot(heightweight, aes(x = ageYear, y = heightIn))

hw_sp +
  geom_point() +
  stat_smooth(method = lm)

hw_sp +
  geom_point() +
  stat_smooth(method = lm, level = 0.99)

hw_sp +
  geom_point() +
  stat_smooth(method = lm, se = FALSE)

hw_sp +
  geom_point(colour = "grey60") +
  stat_smooth(method = lm, se = FALSE, colour = "black")
```

### Discussion

The linear regression line is not the only way of fitting a model to the data -- in fact, it's not even the default. If you add `stat_smooth()` without specifying the method, it will use a LOESS (locally weighted polynomial) curve by default, as shown in Figure \@ref(fig:FIG-SCATTER-FIT-LOESS):

```{r, eval=FALSE}
hw_sp +
  geom_point(colour = "grey60") +
  stat_smooth()

# Equivalent to:
hw_sp +
  geom_point(colour = "grey60") +
  stat_smooth(method = loess)
```

```{r FIG-SCATTER-FIT-LOESS, echo=FALSE, fig.cap="A LOESS fit", fig.width=4, fig.height=4}
hw_sp +
  geom_point(colour = "grey60") +
  stat_smooth(method = loess)
```

It may be useful to specify additional parameters for the modeling function, which in this case is `loess()`. If, for example, you wanted to use `loess(degree = 1)`, you would call `stat_smooth(method = loess, method.args = list(degree = 1))`. The same could be done for other modeling functions like `lm()` or `glm()`.

Another common type of model fit is a logistic regression. Logistic regression isn't appropriate for `heightweight`, but it's perfect for the `biopsy` data set in the `MASS` package. In the `biopsy` data, there are nine different measured attributes of breast cancer biopsies, as well as the class of the tumor, which is either benign or malignant. To prepare the data for logistic regression, we must convert the factor `class`, with the levels `benign` and `malignant`, to a vector with numeric values of 0 and 1. We'll make a copy of the `biopsy` data frame called `biopsy_mod`, then store the numeric coded `class` in a column called `classn`:

```{r}
library(MASS) # Load MASS for the biopsy data set

biopsy_mod <- biopsy %>%
  mutate(classn = recode(class, benign = 0, malignant = 1))

biopsy_mod
```

Although there are many attributes we could examine, for this example we'll just look at the relationship of `V1` (clump thickness) and the `class` of the tumor. Because there is a large degree of overplotting, we'll jitter the points and make them semitransparent (`alpha = 0.4`), hollow (`shape = 21`), and slightly smaller (`size = 1.5`). Then we'll add a fitted logistic regression line (Figure \@ref(fig:FIG-SCATTER-FIT-LOGISTIC)) by telling `stat_smooth()` to use the `glm()` function with `family = binomial`:

```{r FIG-SCATTER-FIT-LOGISTIC, fig.cap="A logistic model"}
ggplot(biopsy_mod, aes(x = V1, y = classn)) +
  geom_point(
    position = position_jitter(width = 0.3, height = 0.06),
    alpha = 0.4,
    shape = 21,
    size = 1.5
  ) +
  stat_smooth(method = glm, method.args = list(family = binomial))
```

If your scatter plot has points grouped by a factor, and that factor is mapped to an aesthetic such as `colour` or `shape`, one fit line will be drawn for each factor level. First we'll make the base plot object `hw_sp`, then we'll add the LOESS lines to it. We'll also make the points less prominent by making them semitransparent, using `alpha = .4` (Figure \@ref(fig:FIG-SCATTER-FIT-GROUP)):

```{r FIG-SCATTER-FIT-GROUP-1, eval=FALSE}
hw_sp <- ggplot(heightweight, aes(x = ageYear, y = heightIn, colour = sex)) +
    geom_point() +
    scale_colour_brewer(palette = "Set1")

hw_sp +
  geom_smooth()
```

Notice that the blue line, for males, doesn't run all the way to the right side of the plot. There are two reasons for this. The first is that by default, `stat_smooth()` limits the prediction to within the range of the predictor data on the x-axis. The second is that even if it extrapolates, the `loess()` function only offers prediction within the *x* range of the data.

If you want the lines to extrapolate from the data, as shown in the right-hand image of Figure \@ref(fig:FIG-SCATTER-FIT-GROUP), you must use a model method that allows extrapolation, like `lm()`, and pass `stat_smooth()` the option `fullrange = TRUE`:

```{r FIG-SCATTER-FIT-GROUP-2, eval=FALSE}
hw_sp +
  geom_smooth(method = lm, se = FALSE, fullrange = TRUE)
```

```{r FIG-SCATTER-FIT-GROUP, ref.label=c("FIG-SCATTER-FIT-GROUP-1", "FIG-SCATTER-FIT-GROUP-2"), echo=FALSE, fig.show="hold", fig.cap="LOESS fit lines for each group (left); Extrapolated linear fit lines (right)", message=FALSE}
```

In this example with the `heightweight` data set, the default settings for `stat_smooth()` (with `loess` and no extrapolation) may make more sense than the extrapolated linear predictions, because humans don't grow linearly and we don't grow forever.


Adding Fitted Lines from an Existing Model {#RECIPE-SCATTER-FITLINES-MODEL}
------------------------------------------

### Problem

You have already created a fitted regression model object for a data set, and you want to plot the lines for that model.

### Solution

Usually the easiest way to overlay a fitted model is to simply ask `stat_smooth()` to do it for you, as described in Recipe \@ref(RECIPE-SCATTER-FITLINES). Sometimes, however, you may want to create the model yourself and then add it to your graph. This allows you to be sure that the model you're using for other calculations is the same one that you see.

In this example, we'll build a quadratic model using `lm()` with `ageYear` as a predictor of `heightIn`. Then we'll use the `predict()` function and find the predicted values of `heightIn` across the range of values for the predictor, `ageYear`:

```{r}
library(gcookbook) # Load gcookbook for the heightweight data set

model <- lm(heightIn ~ ageYear + I(ageYear^2), heightweight)
model

# Create a data frame with ageYear column, interpolating across range
xmin <- min(heightweight$ageYear)
xmax <- max(heightweight$ageYear)
predicted <- data.frame(ageYear = seq(xmin, xmax, length.out = 100))

# Calculate predicted values of heightIn
predicted$heightIn <- predict(model, predicted)
predicted
```

We can now plot the data points along with the values predicted from the model (as you'll see in Figure \@ref(fig:FIG-SCATTER-FIT-MODEL)):

```{r FIG-SCATTER-FIT-MODEL-1, eval=FALSE}
# Create a base plot called `hw_sp` (for heightweight scatter plot)
hw_sp <- ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
    geom_point(colour = "grey40")

hw_sp +
  geom_line(data = predicted, size = 1)
```

### Discussion

Any model object (e.g. `lm`) can be used, so long as it has a corresponding `predict()` method. For example, `lm` has `predict.lm()`, loess has `predict.loess()`, and so on. Adding lines from a model can be simplified by using the function `predictvals()`, defined below. If you simply pass a model to `predictvals()`, the function will do the work of finding the variable names and the range of the predictor, and will return a data frame with predictor and predicted values. That data frame can then be passed to `geom_line()` to draw the fitted line, as we did earlier:

```{r}
# Given a model, predict values of yvar from xvar
# This function supports one predictor and one predicted variable
# xrange: If NULL, determine the x range from the model object. If a vector with
#   two numbers, use those as the min and max of the prediction range.
# samples: Number of samples across the x range.
# ...: Further arguments to be passed to predict()
predictvals <- function(model, xvar, yvar, xrange = NULL, samples = 100, ...) {

  # If xrange isn't passed in, determine xrange from the models.
  # Different ways of extracting the x range, depending on model type
  if (is.null(xrange)) {
    if (any(class(model) %in% c("lm", "glm")))
      xrange <- range(model$model[[xvar]])
    else if (any(class(model) %in% "loess"))
      xrange <- range(model$x)
  }

  newdata <- data.frame(x = seq(xrange[1], xrange[2], length.out = samples))
  names(newdata) <- xvar
  newdata[[yvar]] <- predict(model, newdata = newdata, ...)
  newdata
}
```

With the heightweight data set, we'll make a linear model with `lm()` and a LOESS model with `loess()` (Figure \@ref(fig:FIG-SCATTER-FIT-MODEL)):

```{r}
modlinear <- lm(heightIn ~ ageYear, heightweight)
modloess  <- loess(heightIn ~ ageYear, heightweight)
```

Then we can call `predictvals()` on each model, and pass the resulting data frames to `geom_line()`:

```{r FIG-SCATTER-FIT-MODEL-2, eval=FALSE}
lm_predicted    <- predictvals(modlinear, "ageYear", "heightIn")
loess_predicted <- predictvals(modloess, "ageYear", "heightIn")

hw_sp +
  geom_line(data = lm_predicted, colour = "red", size = .8) +
  geom_line(data = loess_predicted, colour = "blue", size = .8)
```

```{r FIG-SCATTER-FIT-MODEL, ref.label=c("FIG-SCATTER-FIT-MODEL-1", "FIG-SCATTER-FIT-MODEL-2"), echo=FALSE, fig.show="hold", fig.cap="A quadratic prediction line from an lm object (left); Prediction lines from linear (in red) and LOESS (in blue) models (right)", fig.width=4, fig.height=4}
```

For `glm` models use a nonlinear link function, you need to specify `type = "response"` to the `predictvals()` function. This is because the default behavior of `glm` is to return predicted values in the scale of the linear predictors, instead of in the scale of the response (*y*) variable.

To illustrate this, we'll use the `biopsy` data set from the `MASS` package. As we did in Recipe \@ref(RECIPE-SCATTER-FITLINES), we'll use `V1` to predict `class`. Since logistic regressions require numeric values from 0 to 1, we need to convert the factor `class` to 0s and 1s:

```{r}
library(MASS) # Load MASS for the biopsy data set

# Using the biopsy data set, create a new column that stores the factor `class` as a numeric variable named `classn`. If `class` == "benign", set `classn` to 0. If `class` == "malignant", set `classn` to 1. Save this new dataset as `biopsy_mod`.

biopsy_mod <- biopsy %>%
  mutate(classn = recode(class, benign = 0, malignant = 1))

biopsy_mod
```

Next, we'll perform the logistic regression:

```{r}
fitlogistic <- glm(classn ~ V1, biopsy_mod, family = binomial)
```

Finally, we'll make the graph with jittered points and the `fitlogistic` line. We'll make the line in a shade of blue by specifying a color in RGB values, and slightly thicker, with `size = 1` (Figure \@ref(fig:FIG-SCATTER-FIT-MODEL-LOGISTIC)):

```{r FIG-SCATTER-FIT-MODEL-LOGISTIC, fig.cap="A fitted logistic model"}
# Get predicted values
glm_predicted <- predictvals(fitlogistic, "V1", "classn", type = "response")

ggplot(biopsy_mod, aes(x = V1, y = classn)) +
  geom_point(
    position = position_jitter(width = .3, height = .08),
    alpha = 0.4,
    shape = 21,
    size = 1.5
  ) +
  geom_line(data = glm_predicted, colour = "#1177FF", size = 1)
```


Adding Fitted Lines from Multiple Existing Models {#RECIPE-SCATTER-FITLINES-MODEL-MULTI}
-------------------------------------------------

### Problem

You have already created a fitted regression model object for a data set, and you want to plot the lines for that model.

### Solution

Use the `predictvals()` function from the previous recipe (\@ref(RECIPE-SCATTER-FITLINES-MODEL)) along with functions from the `dplyr` package, including `group_by()` and `do()`.

With the `heightweight` data set, we'll make a linear model for each of the levels of `sex`. The model building is done for each subset of the data frame by specifying the model computation we want within the `do()` function.

The following code splits the `heightweight` data frame into two data frames, by the grouping variable `sex`. This creates a data frame subset for males and a data frame subset for females. We then apply `lm(heightIn ~ ageYear, .)` to each subset. The `.` in `lm(heightIn ~ ageYear)` represents the data frame we are piping in from the previous line -- in this case, the two data frame subsets we have just created. Finally, this code returns a data frame which contains a column, `model`, which is a list of the model outputs corresponding to each level of the grouping variable `sex`, female and male.

```{r}
library(gcookbook) # Load gcookbook for the heightweight data set
library(dplyr)

# Create an lm model object for each value of sex; this returns a data frame
models <- heightweight %>%
  group_by(sex) %>%
  do(model = lm(heightIn ~ ageYear, .)) %>%
  ungroup()

# Print the data frame
models

# Print out the model column of the data frame
models$model
```

Now that we have the list of model objects, we can run the `predictvals()` as defined in \@ref(RECIPE-SCATTER-FITLINES-MODEL) to get the predicted values from each model:

```{r}
predvals <- models %>%
  group_by(sex) %>%
  do(predictvals(.$model[[1]], xvar = "ageYear", yvar = "heightIn"))
```

Finally, we can plot the data with the predicted values (Figure \@ref(fig:FIG-SCATTER-FIT-MODEL-MULTI)):

```{r eval=FALSE}
ggplot(heightweight, aes(x = ageYear, y = heightIn, colour = sex)) +
  geom_point() +
  geom_line(data = predvals)

# Using facets instead of colors for the groups
ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
  geom_point() +
  geom_line(data = predvals) +
  facet_grid(. ~ sex)
```

```{r FIG-SCATTER-FIT-MODEL-MULTI, echo=FALSE, fig.cap="Predictions from two separate lm objects, one for each subset of data (left); With facets (right)", fig.width = 12}
p1 <- ggplot(heightweight, aes(x = ageYear, y = heightIn, colour = sex)) +
  geom_point() +
  geom_line(data = predvals)

# Using facets instead of colors for the groups
p2 <- ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
  geom_point() +
  geom_line(data = predvals) +
  facet_grid(. ~ sex)

p1 + plot_spacer() + p2 + plot_layout(widths = c(5, 1, 9))
```

### Discussion

The `group_by()` and `do()` calls are used for splitting the data into parts, running functions on those parts, and then reassembling the output.

With the preceding code, the *x* range of the predicted values for each group spans the *x* range of each group, and no further; for the males, the prediction line stops at the oldest male, while for females, the prediction line continues further right, to the oldest female. To form prediction lines that have the same *x* range across all groups, we can simply pass in `xrange`, like this:

```{r}
predvals <- models %>%
  group_by(sex) %>%
  do(predictvals(
    .$model[[1]],
    xvar = "ageYear",
    yvar = "heightIn",
    xrange = range(heightweight$ageYear))
  )
```

Then we can plot it, the same as we did before:

```{r FIG-SCATTER-FIT-MODEL-MULTI-RANGE, fig.cap="Predictions for each group extend to the full x range of all groups together"}
ggplot(heightweight, aes(x = ageYear, y = heightIn, colour = sex)) +
  geom_point() +
  geom_line(data = predvals)
```

As you can see in Figure \@ref(fig:FIG-SCATTER-FIT-MODEL-MULTI-RANGE), the line for males now extends as far to the right as the line for females. Keep in mind that extrapolating past the data isn't always appropriate, though; whether or not it's justified will depend on the nature of your data and the assumptions you bring to the table.


Adding Annotations with Model Coefficients {#RECIPE-SCATTER-FITLINES-TEXT}
------------------------------------------

### Problem

You want to add numerical information about a model to a plot.

### Solution

To add simple text to a plot, simply add an annotation. In this example, we'll create a linear model and use the `predictvals()` function defined in Recipe \@ref(RECIPE-SCATTER-FITLINES-MODEL) to create a prediction line from the model. Then we'll add an annotation:

```{r}
library(gcookbook) # Load gcookbook for the heightweight data set

model <- lm(heightIn ~ ageYear, heightweight)
summary(model)
```

This shows that the *r^2* value is 0.4249. We'll create a graph and manually add the text using `annotate()` (Figure \@ref(fig:FIG-SCATTER-FIT-MODEL-TEXT)):

```{r eval=FALSE}
# First generate prediction data
pred <- predictvals(model, "ageYear", "heightIn")

# Save a base plot
hw_sp <- ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
    geom_point() +
    geom_line(data = pred)

hw_sp +
  annotate("text", x = 16.5, y = 52, label = "r^2=0.42")
```

Instead of using a plain text string, it's also possible to enter formulas using R's math expression syntax, by using `parse = TRUE`:

```{r eval=FALSE}
hw_sp +
  annotate("text", x = 16.5, y = 52, label = "r^2 == 0.42", parse = TRUE)
```

```{r FIG-SCATTER-FIT-MODEL-TEXT, echo=FALSE, fig.show="hold", fig.cap="Plain text (left); Math expression (right)", fig.width=3.5, fig.height=3.5}
pred <- predictvals(model, "ageYear", "heightIn")

hw_sp <- ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
    geom_point() +
    geom_line(data = pred)

hw_sp +
  annotate("text", x = 16.5, y = 52, label = "r^2=0.42")

hw_sp +
  annotate("text", x = 16.5, y = 52, label = "r^2 == 0.42", parse = TRUE)
```

### Discussion

Text geoms in ggplot do not take expression objects directly; instead, they take character strings that can be turned into expressions with R's `parse()` function.

If you use a mathematical expression, the syntax must be correct for the expression to be a valid R expression object. You can test the validity by wrapping the object in the `expression()` function and seeing if it throws an error (make sure *not* to use quotes around the expression). In the example here, `==` is a valid construct in an expression to express equality, but `=` is not:

```{r, eval=FALSE}
expression(r^2 == 0.42) # Valid
expression(r^2 = 0.42)  # Not valid
#> Error: unexpected '=' in "expression(r\^2 ="
```

It's possible to automatically extract values from the model object and build an expression using those values. In this example, we'll create a string which when parsed, yields a valid expression:

```{r}
# Use sprintf() to construct our string.
# The %.3g and %.2g are replaced with numbers with 3 significant digits and 2
# significant digits, respectively. The numbers are supplied after the string.

eqn <- sprintf(
    "italic(y) == %.3g + %.3g * italic(x) * ',' ~~ italic(r)^2 ~ '=' ~ %.2g",
    coef(model)[1],
    coef(model)[2],
    summary(model)$r.squared
  )

eqn

# Test validity by using parse()
parse(text = eqn)
```

Now that we have the expression string, we can add it to the plot. In this example we'll put the text in the bottom-right corner, by setting `x = Inf` and `y = -Inf` and using horizontal and vertical adjustments so that the text all fits inside the plotting area (Figure \@ref(fig:FIG-SCATTER-FIT-MODEL-TEXT-AUTO)):

```{r FIG-SCATTER-FIT-MODEL-TEXT-AUTO, fig.cap="Scatter plot with automatically generated expression", fig.width=4, fig.height=4}
hw_sp +
  annotate(
    "text",
    x = Inf, y = -Inf,
    label = eqn, parse = TRUE,
    hjust = 1.1, vjust = -.5
  )
```

### See Also

The math expression syntax in R can be a bit tricky. See Recipe \@ref(RECIPE-ANNOTATE-TEXT-MATH) for more information.


Adding Marginal Rugs to a Scatter Plot {#RECIPE-SCATTER-RUG}
--------------------------------------

### Problem

You want to add marginal rugs to a scatter plot.

### Solution

Use `geom_rug()`. For this example (Figure \@ref(fig:FIG-SCATTER-RUG)), we'll use the `faithful` data set. This data set has two columns with data about the Old Faithful geyser: `eruptions`, which is the length of each eruption, and `waiting`, which is the length of time until the next eruption:

```{r FIG-SCATTER-RUG, fig.cap="Marginal rug added to a scatter plot", fig.width=5, fig.height=5}
ggplot(faithful, aes(x = eruptions, y = waiting)) +
  geom_point() +
  geom_rug()
```

### Discussion

A marginal rug plot is essentially a one-dimensional scatter plot that can be used to visualize the distribution of data on each axis.

In this particular data set, the marginal rug is not as informative as it could be. The resolution of the `waiting` variable is in whole minutes, and because of this, the rug lines have a lot of overplotting. To reduce overplotting, we can jitter the line positions and make them slightly thinner by specifying size (Figure \@ref(fig:FIG-SCATTER-RUG-JITTER)). This helps the viewer see the distribution more clearly:

```{r FIG-SCATTER-RUG-JITTER, fig.cap="Marginal rug with thinner, jittered lines", fig.width=5, fig.height=5}
ggplot(faithful, aes(x = eruptions, y = waiting)) +
  geom_point() +
  geom_rug(position = "jitter", size = 0.2)
```

### See Also

For more about overplotting, see Recipe \@ref(RECIPE-SCATTER-OVERPLOT).


Labeling Points in a Scatter Plot {#RECIPE-SCATTER-LABELS}
---------------------------------

### Problem

You want to add labels to points in a scatter plot.

### Solution

For annotating just one or a few points, you can use `annotate()` or `geom_text()`. For this example, we'll use the countries data set and visualize the relationship between health expenditures and infant mortality rate per 1,000 live births. To keep things manageable, we'll filter the data to only look at data from 2009 for a subset of countries that spent more than $2,000 USD per capita:

```{r}
library(gcookbook) # Load gcookbook for the countries data set
library(dplyr)

# Filter the data to only look at 2009 data for countries that spent > 2000 USD per capita
countries_sub <- countries %>%
  filter(Year == 2009 & healthexp > 2000)
```

We'll save the basic scatter plot object in `countries_sp` (for countries scatter plot) and add then add our annotations to it. To manually add annotations, use `annotate()`, and specify the coordinates and label (Figure \@ref(fig:FIG-SCATTER-LABEL), left). It may require some trial-and-error tweaking to get the labels positioned just right:

```{r FIG-SCATTER-LABEL-1, eval=FALSE}
countries_sp <- ggplot(countries_sub, aes(x = healthexp, y = infmortality)) +
    geom_point()

countries_sp +
  annotate("text", x = 4350, y = 5.4, label = "Canada") +
  annotate("text", x = 7400, y = 6.8, label = "USA")
```

To automatically add the labels from your data (Figure \@ref(fig:FIG-SCATTER-LABEL), right), use `geom_text()` and map a column that is a factor or character vector to the label aesthetic. In this case, we'll use `Name`, and we'll make the font slightly smaller to reduce crowding. The default value for `size` is 5, which doesn't correspond directly to a point size:

```{r FIG-SCATTER-LABEL-2, eval=FALSE}
countries_sp +
  geom_text(aes(label = Name), size = 4)
```


```{r FIG-SCATTER-LABEL, ref.label=c("FIG-SCATTER-LABEL-1", "FIG-SCATTER-LABEL-2"), echo=FALSE, fig.show="hold", fig.cap="A scatter plot with manually labeled points (left); With automatically labeled points and a smaller font (right)", fig.width=4.5, fig.height=4.5}
```

As you can see in the center of (Figure \@ref(fig:FIG-SCATTER-LABEL), right), you may find yourself with a plot where labels are overlapping. To automatically adjust point labels so that they don't overlap, we can use `geom_text_repel` (Figure \@ref(fig:FIG-GGREPEL-LABEL), left) or `geom_label_repel` (which adds a box around the label, Figure \@ref(fig:FIG-GGREPEL-LABEL), right) from the ggrepel package, which functions similarly to `geom_text`.


(ref:cap-FIG-GGREPEL-LABEL) A scatter plot labeled with `geom_text_repel` (left); Labeled with `geom_label_repel` (right)

```{r FIG-GGREPEL-LABEL, fig.show="hold", fig.cap="(ref:cap-FIG-GGREPEL-LABEL)", fig.width=4.5, fig.height=4.5}
# Make sure to have installed ggrepel with install.packages("ggrepel")
library(ggrepel)
countries_sp +
  geom_text_repel(aes(label = Name), size = 3)

countries_sp +
  geom_label_repel(aes(label = Name), size = 3)
```


### Discussion

Using `geom_text_repel` or `geom_label_repel` is the easiest way to have nicely-placed labels on a plot. It makes automatic (and random) decisions about label placement, so if exact control over where each label is placed, you should use `annotate()` or `geom_text()`.

The automatic method for placing annotations using `geom_text()` centers each annotation on the *x* and *y* coordinates. You'll probably want to shift the text vertically, horizontally, or both.

Setting `vjust = 0` will make the baseline of the text on the same level as the point (Figure \@ref(fig:FIG-SCATTER-LABEL-VJUST), left), and setting `vjust = 1` will make the top of the text level with the point. This usually isn't enough, though -- you can increase or decrease `vjust` to shift the labels higher or lower, or you can add or subtract a bit to or from the y mapping to get the same effect (Figure \@ref(fig:FIG-SCATTER-LABEL-VJUST), right):

```{r FIG-SCATTER-LABEL-VJUST, fig.show="hold", fig.cap="A scatter plot with vjust=0 (left); With a little extra added to y (right)", fig.width=4.5, fig.height=4.5}
countries_sp +
  geom_text(aes(label = Name), size = 3, vjust = 0)

# Add a little extra to y
countries_sp +
  geom_text(aes(y = infmortality + .1, label = Name), size = 3)
```

It often makes sense to right- or left-justify the labels relative to the points. To left-justify, set `hjust = 0` (Figure \@ref(fig:FIG-SCATTER-LABEL-HJUST), left), and to right-justify, set `hjust = 1`. As was the case with `vjust`, the labels will still slightly overlap with the points. This time, though, it's not a good idea to try to fix it by increasing or decreasing `hjust`. Doing so will shift the labels a distance proportional to the length of the label, making longer labels move further than shorter ones. It's better to just set hjust to 0 or 1, and then add or subtract a bit to or from `x` (Figure \@ref(fig:FIG-SCATTER-LABEL-HJUST), right):

```{r FIG-SCATTER-LABEL-HJUST, fig.show="hold", fig.cap="A scatter plot with hjust=0 (left); With a little extra added to x (right)", fig.width=4.5, fig.height=4.5}
countries_sp +
  geom_text(
    aes(label = Name),
    size = 3,
    hjust = 0
  )

countries_sp +
  geom_text(
    aes(x = healthexp + 100, label = Name),
    size = 3,
    hjust = 0
  )
```

> **Note**
>
> If you are using a logarithmic axis, instead of adding to x or y, you'll need to *multiply* the x or y value by a number to shift the labels a consistent amount.

Besides right- or left-justifying all of your labels, you can also adjust the position of all of the labels at once is to use `position = position_nudge()`. This allows you to specify the amount of vertical or horizontal distance you want to move the labels. As you can see from the figures below (Figure \@ref(fig:FIG-SCATTER-LABEL-NUDGE), this strategy works best when there are fewer labels, or fewer points that can cause overlap with labels. Note that the units you specify with `x = ...` and `y = ...` correspond to the units of the x and y axis.

```{r FIG-SCATTER-LABEL-NUDGE, fig.show="hold", fig.cap="Original scatter plot (left); Scatter plot with labels nudged down and to the right (right)", fig.width=4.5, fig.height=4.5}

countries_sp +
  geom_text(
    aes(x = healthexp + 100, label = Name),
    size = 3,
    hjust = 0
  )

countries_sp +
  geom_text(
    aes(x = healthexp + 100, label = Name),
    size = 3,
    hjust = 0,
    position = position_nudge(x = 100, y = -0.2)
  )
```

If you want to label just some of the points but want the placement to be handled automatically, you can add a new column to your data frame containing just the labels you want. Here's one way to do that: first we'll make a copy of the data we're using, then we'll copy the `Name` column into `plotname`, converting from a factor to a character vector, for reasons we'll see below.

```{r}
cdat <- countries %>%
  filter(Year == 2009, healthexp > 2000) %>%
  mutate(plotname = as.character(Name))
```

Now that `plotname` is a character vector, we can use an `ifelse()` function and the `%in%` operator to identify if each row of `plotname` matches the list of names we want to show on our plot, which we have specified manually below. The `%in%` operator returns a logical vector that allows us to specify within the `ifelse()` function that we want to replace all values of `plotname` that do not match one of our specified names with a blank string.

```{r}
countrylist <- c("Canada", "Ireland", "United Kingdom", "United States",
  "New Zealand", "Iceland", "Japan", "Luxembourg", "Netherlands", "Switzerland")

cdat <- cdat %>%
  mutate(plotname = ifelse(plotname %in% countrylist, plotname, ""))

# Take a look at the resulting `plotname` variable, as compared to the original `Name` variable
cdat %>%
  select(Name, plotname)
```

Now we can make the plot (Figure \@ref(fig:FIG-SCATTER-LABEL-SELECT)). This time, we'll also expand the *x* range so that the text will fit:

```{r FIG-SCATTER-LABEL-SELECT, fig.cap="Scatter plot with selected labels and expanded x range"}
ggplot(cdat, aes(x = healthexp, y = infmortality)) +
  geom_point() +
  geom_text(aes(x = healthexp + 100, label = plotname), size = 4, hjust = 0) +
  xlim(2000, 10000)
```

If any individual position adjustments are needed, you have a couple of options. One option is to copy the columns used for the *x* and *y* coordinates and modify the numbers for the individual items to move the text around. Make sure to use the original numbers for the coordinates of the points, of course!

Finally, another option is to save the output to a vector format such as PDF or SVG (see Recipes Recipe \@ref(RECIPE-OUTPUT-VECTOR) and Recipe \@ref(RECIPE-OUTPUT-VECTOR-SVG)), then edit it in a program like Illustrator or Inkscape.

### See Also

For more on controlling the appearance of the text, see Recipe
\@ref(RECIPE-APPEARANCE-TEXT-APPEARANCE).

If you want to manually edit a PDF or SVG file, see Recipe
\@ref(RECIPE-OUTPUT-EDIT-VECTOR).


Creating a Balloon Plot {#RECIPE-SCATTER-BALLOON}
-----------------------

### Problem

You want to make a balloon plot, where the area of the dots is proportional to their numerical value.

### Solution

Use `geom_point()` with `scale_size_area()`. For this example, we'll filter the data set `countries` to only include data from the year 2009, for certain countries we have specified in `countrylist`:

```{r}
library(gcookbook) # Load gcookbook for the countries data set

countrylist <- c("Canada", "Ireland", "United Kingdom", "United States",
  "New Zealand", "Iceland", "Japan", "Luxembourg", "Netherlands", "Switzerland")

cdat <- countries %>%
  filter(Year == 2009, Name %in% countrylist)

cdat
```

If we just map `GDP` to `size`, the value of `GDP` gets mapped to the *radius* of the dots (Figure \@ref(fig:FIG-SCATTER-BALLOON), left), which is not what we want; a doubling of value results in a quadrupling of area, and this will distort the interpretation of the data. We instead want to map the value of `GDP` to the *area* of the dots, which we can do this using `scale_size_area()` (Figure \@ref(fig:FIG-SCATTER-BALLOON), right):

```{r FIG-SCATTER-BALLOON, fig.show="hold", fig.cap="Balloon plot with value mapped to radius (left); With value mapped to area (right)"}
# Create a base plot using the cdat data frame. We will call this base plot `cdat_sp` (for cdat scatter plot)
cdat_sp <- ggplot(cdat, aes(x = healthexp, y = infmortality, size = GDP)) +
    geom_point(shape = 21, colour = "black", fill = "cornsilk")

# GDP mapped to radius (default with scale_size_continuous)
cdat_sp

# GDP mapped to area instead, and larger circles
cdat_sp +
  scale_size_area(max_size = 15)
```

### Discussion

The example here is a scatter plot, but that is not the only way to use balloon plots. It may also be useful to use balloon plots to represent values on a grid, where the x- and y-axes are categorical, as in Figure \@ref(fig:FIG-SCATTER-BALLOON-CAT):

```{r FIG-SCATTER-BALLOON-CAT, fig.show="hold", fig.cap="Balloon plot with categorical axes and text labels (left); With guide points to help position text (right)", fig.width=4.5, fig.height=4.5}
# Create a data frame that adds up counts for males and females
hec <- HairEyeColor %>%
  # Convert to long format
  as_tibble() %>%
  group_by(Hair, Eye) %>%
  summarize(count = sum(n))

# Create the base balloon plot
hec_sp <- ggplot(hec, aes(x = Eye, y = Hair)) +
  geom_point(aes(size = count), shape = 21, colour = "black", fill = "cornsilk") +
  scale_size_area(max_size = 20, guide = FALSE) +
  geom_text(aes(
    y = as.numeric(as.factor(Hair)) - sqrt(count)/34, label = count),
    vjust = 1.3,
    colour = "grey60",
    size = 4
  )

hec_sp

# Add red guide points
hec_sp +
  geom_point(aes(y = as.numeric(as.factor(Hair)) - sqrt(count)/34), colour = "red", size = 1)
```

In this example we've used a few tricks to add the text labels under the circles. First, we used `vjust = 1.3` to justify the top of text slightly below the *y* coordinate. Next, we wanted to set the *y* coordinate so that it is at the bottom of each circle. This requires a little wrangling and arithmetic: we need to first convert the levels of `Hair` and `Eye` into numeric values, which involves converting these variables from being a character vector to being a factor variable, and then converting them again into a numeric variable. We then take the *numeric* value of `Hair` and subtract a small value from it, where the value depends in some way on count. This actually requires taking the square root of count, since the radius has a linear relationship with the square root of `count`. The number that this value is divided by (34 in this case) is found by trial and error; it depends on the particular data values, radius, text size, and output image size.

To help find the correct *y* offset, we can add guide points in red and adjusted the value until they lined up with the bottom of each circle. Once we have the correct value, we can place the text and remove the points.

The text under the circles is in a shade of grey. This is so that it doesn't jump out at the viewer and overwhelm the perceptual impact of the circles, but is still available if the viewer wants to know the exact values.

### See Also

To add labels to the circles, see Recipes Recipe \@ref(RECIPE-SCATTER-LABELS) and Recipe \@ref(RECIPE-ANNOTATE-TEXT).

See Recipe \@ref(RECIPE-SCATTER-CONTINUOUS-SCATTER) for ways of mapping variables to other aesthetics in a scatter plot.


Making a Scatter Plot Matrix {#RECIPE-SCATTER-SPLOM}
----------------------------

### Problem

You want to make a scatter plot matrix.

### Solution

A scatter plot matrix is an excellent way of visualizing the pairwise relationships among several variables. To make one, use the `pairs()` function from R's base graphics.

For this example, we'll use a subset of the `countries` data. We'll pull out the data for the year 2009, and keep only the columns that are relevant:

```{r}
library(gcookbook) # Load gcookbook for the countries data set

c2009 <- countries %>%
  filter(Year == 2009) %>%
  select(Name, GDP, laborrate, healthexp, infmortality)

c2009
```

To make the scatter plot matrix (Figure \@ref(fig:FIG-SCATTER-SPLOM)), we'll use all of the variables except for `Name`, since making a scatter plot matrix using the names of the countries wouldn't make sense and  would produce strange-looking results:

```{r FIG-SCATTER-SPLOM, fig.cap="A scatter plot matrix", fig.width=7, fig.height=7}
c2009_num <- select(c2009, -Name)
pairs(c2009_num)
```

### Discussion

You can also use customized functions for the panels. To show the correlation coefficient of each pair of variables instead of a scatter plot, we'll define the function `panel.cor`. This will also show higher correlations in a larger font. Don't worry about the details for now -- just paste this code into your R session or script:

```{r}
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
  usr <- par("usr")
  on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- abs(cor(x, y, use = "complete.obs"))
  txt <- format(c(r, 0.123456789), digits = digits)[1]
  txt <- paste(prefix, txt, sep = "")
  if (missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex =  cex.cor * (1 + r) / 2)
}
```

To show histograms of each variable along the diagonal, we'll define `panel.hist`:

```{r}
panel.hist <- function(x, ...) {
  usr <- par("usr")
  on.exit(par(usr))
  par(usr = c(usr[1:2], 0, 1.5) )
  h <- hist(x, plot = FALSE)
  breaks <- h$breaks
  nB <- length(breaks)
  y <- h$counts
  y <- y/max(y)
  rect(breaks[-nB], 0, breaks[-1], y, col = "white", ...)
}
```

Both of these panel functions are taken from the `pairs()` help page, so if it's more convenient, you can simply open that help page, then copy and paste. The last line of this version of the `panel.cor` function is slightly modified, however, so that the changes in font size aren't as extreme as with the original.

Now that we've defined these functions we can use them for our scatter plot matrix, by telling `pairs()` to use `panel.cor` for the upper panels and `panel.hist` for the diagonal panels.

We'll also throw in one more thing: `panel.smooth` for the lower panels, which makes a scatter plot and adds a LOWESS smoothed line, as shown in Figure \@ref(fig:FIG-SCATTER-SPLOM-PANELS). (LOWESS is slightly different from LOESS, which we saw in Recipe \@ref(RECIPE-SCATTER-FITLINES), but the differences aren't important for this sort of rough exploratory visualization):

```{r FIG-SCATTER-SPLOM-PANELS, fig.cap="Scatter plot with correlations in the upper triangle, smoothing lines in the lower triangle, and histograms on the diagonal", fig.width=7, fig.height=7}
pairs(
  c2009_num,
  upper.panel = panel.cor,
  diag.panel  = panel.hist,
  lower.panel = panel.smooth
)
```

It may be more desirable to use linear regression lines instead of LOWESS lines. The `panel.lm()` function will do the trick (unlike the previous panel functions, this one isn't in the pairs help page):

```{r}
panel.lm <- function (x, y, col = par("col"), bg = NA, pch = par("pch"),
                      cex = 1, col.smooth = "black", ...) {
  points(x, y, pch = pch, col = col, bg = bg, cex = cex)
  abline(stats::lm(y ~ x),  col = col.smooth, ...)
}
```

This time the default line color is black instead of red, though you can change it here (and with `panel.smooth`) by setting `col.smooth` when you call `pairs()`.

We'll also use small points in the visualization, so that we can distinguish them a bit better (Figure \@ref(fig:FIG-SCATTER-SPLOM-PANELS2)). This is done by setting `pch = "."`:

```{r FIG-SCATTER-SPLOM-PANELS2, fig.cap="Scatter plot matrix with smaller points and linear fit lines", fig.width=7, fig.height=7}
pairs(
  c2009_num,
  upper.panel = panel.cor,
  diag.panel  = panel.hist,
  lower.panel = panel.smooth,
  pch = "."
)
```

The size of the points can also be controlled using the `cex` parameter. The default value for `cex` is 1; make it smaller for smaller points and larger for larger points. Values below .5 might not render properly with PDF output.

### See Also

To create a correlation matrix, see Recipe \@ref(RECIPE-MISCGRAPH-CORRMATRIX).

It is worth noting that we didn't use ggplot here because it doesn't make scatter plot matrices (at least, not well).

Other packages like `GGally` have been developed as extensions to ggplot to fill in this gap. The `ggpairs()` function from the `GGally` package makes scatter plot matrices, for example.