Aims and objectives

Aim:

Present the fundamentals of creating high-quality charts in R with the ggplot2 package.

Objectives:

  • Introduce ggplot2
  • Explain the “grammar of graphics” approach
  • Show how to construct a range of plots
  • Signpost resources

This workshop is based on the free eBook “R for Data Science” http://r4ds.had.co.nz (reproduced here under CC BY-NC-ND 3.0 US license).

Part 1

The topics for this first part will include the following:

  • Base R plotting
  • The grammar of graphics
  • Aesthetics
  • Facets

Base R plotting

Base R has plotting functionality.

# Random number seed
set.seed(5)

# Scatter plot
plot(rnorm(20), rnorm(20))

ggplot2 - Basics

Grammar of graphics

The ggplot2 package is a part of the Tidyverse, a popular collection of packages designed for data science.

ggplot2 follows the “grammar of graphics” principles that plots can be composed from:

  • A data set
  • Mappings of variables to aesthetic features
  • Geometric objects to represent the data (“geoms”)
  • Scales - properties of the visual mappings
  • Facets (for small multiples)
  • A coordinate system
  • Annotations
  • A theme (plot styling)

We will use package::function() syntax throughout this course instead of library() calls. This makes it easier to see where functions come from, avoids “function masking” and improves the reproducibility and reuseability of code.

Exceptions to this are ggplot2’s aes() and vars() functions which are always nested inside other ggplot2 functions. We will use the import package to load those function as well as the mpg dataset:

import::from("ggplot2", "aes", "vars", "mpg")

Loading the data

For this intro we’ll use the dataset mpg, which is included in the ggplot2 package. It contains information on cars fuel efficiency. We can use dplyr::glimpse(mpg) or head(mpg) or mpg to have a quick look at the data.

# Examine the mpg dataset
dplyr::glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

displ contains the cars’ engine size in litres
hwy contains the cars fuel efficiency on the motorway in miles per gallon

A minimal plot

ggplot2::ggplot(data = mpg, aes(x = displ, y = hwy)) +
  ggplot2::geom_point()

  • Contains data, aesthetic mappings and a geom function

  • Add components with the + operator

Setting the aesthetics

  • aes()can be called inside ggplot() or a geom function.
  • Setting aesthetics in a geom function allows you to add layers with different data or aesthetic mappings.
ggplot2::ggplot(data = mpg, aes(x = displ, y = hwy)) +
  ggplot2::geom_point() +
  ggplot2::geom_point(aes(x = displ, y = cty), colour = "blue")

General chart template

# chart template
ggplot(data = <DATA>, aes(<MAPPINGS>) )  + 
  <GEOM_FUNCTION> +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION> +
  <SCALE_FUNCTION> +
  <THEME FUNCTION>

Required:

  • data
  • aesthetic mappings
  • a geometric function

The other functions are optional, with default values used if they are not specified.

Exercises: Section 1

  1. Run ggplot2::ggplot(mpg, aes(x = displ, y = cty)). Why is the data not plotted?
  2. What does the drv variable describe? Read the help for mpg data set to find out by running ?mpg
  3. Make a scatter plot of hwy vs. cty
  4. Make a scatter plot of urban fuel consumption vs engine size (displacement).

Note

You will need to run the following first to import the necessary functions and data:

import::from("ggplot2", "aes", "vars", "mpg")

Aesthetics

Typical aesthetics: x axis position, y axis position, colour, fill, size, alpha (transparency).

For example, we can the use the point colour to show the class of vehicle.

# Aesthetics
ggplot2::ggplot(data = mpg, aes(x = displ, y = hwy, colour = class)) +
  ggplot2::geom_point()

ggplot2 looks at the data type to set a default colour scale - categorical variables get discrete colours, continuous variables get a continuous colour scale. These can be customised using scale_ functions (shown later).

# ggplot2 looks at the data type to set a default colour scale
ggplot2::ggplot(data = mpg, aes(x = displ, y = hwy, colour = year)) +
  ggplot2::geom_point()

You could also use the point shape to distinguish between categories. For example, we use it here in addition to colour to show the vehicle class.

ggplot2::ggplot(data = mpg, aes(x = displ, y = hwy, shape = class, colour = class)) +
  ggplot2::geom_point()

But note there is a problem: suv is missing from the plot because by default, shape only supports six categories (you can override this manually). Some aesthetics are only suitable in certain situations.

Scale functions

Use scale_ functions to set visual properties of the aesthetics, such as colours and axis scales.

# Use scale_ functions to set properties of the aesthetics
dplyr::filter(mpg, manufacturer %in% c('audi', 'volkswagen')) |> 
  ggplot2::ggplot(aes(x = displ, y = hwy, colour = manufacturer)) +
  ggplot2::geom_point(size = 3) +
  ggplot2::scale_colour_manual(values = c("audi" = "#12436D", "volkswagen" = "#28A197")) +
  ggplot2::scale_x_continuous(limits = c(1, 5))

Hint: use colours() to get the list of built-in colours. Use hex values to specify any colour.

Setting an aesthetic to a fixed value

You can also set an aesthetic to a fixed value, rather than being mapped to a data variable.

# Set an aesthetic to a fixed value
ggplot2::ggplot(data = mpg, aes(x = displ, y = hwy)) +
  ggplot2::geom_point(colour = "orange")

Exercises: Section 2

1

  • What’s wrong with this code? Why aren’t the points blue?
# What's wrong with this code? Why aren't the points blue?
ggplot2::ggplot(data = mpg, aes(x = displ, y = hwy, colour = "blue")) +
  ggplot2::geom_point()

2

  • Map a continuous variable to colour, size and shape. How do these aesthetics behave differently for categorical vs. continuous variables? Hint: use dplyr::glimpse(mpg) to identify variable types.

3

  • What happens if you map the same variable to multiple aesthetics?

4

  • What happens if you map an aesthetic to something other than a variable name, like a Boolean statement aes(colour = displ < 5)

Faceting (small multiples)

Faceting generates subplots for different values of a data variable.

ggplot2::facet_wrap() splits the data by the specified variable and wraps the subplots into a grid.

ggplot2::ggplot(data = mpg, aes(x = displ, y = hwy)) +
  ggplot2::geom_point() +
  ggplot2::facet_wrap(vars(class))

Use vars() to specify variables for faceting, in the same way as aes() for aesthetics.

ggplot2::facet_grid() can be used to facet by two variables.

ggplot2::ggplot(data = mpg, aes(x = displ, y = hwy)) +
  ggplot2::geom_point() +
  ggplot2::facet_grid(rows = vars(drv), cols = vars(class))

Part 2

The topics for the second part will include:

  • More geoms
  • Chart Positioning
  • Themes, titles and Multiple Plots

Avoiding overplotting

What is wrong with this plot?

# What is wrong with this plot?
ggplot2::ggplot(data = mpg, aes(x = cty, y = hwy)) +
  ggplot2::geom_point()

Because there are multiple observations, some are overplotted. To correct this, you can add some random noise to the data with position = "jitter" or ggplot2::geom_jitter(). This is a bit of a compromise - you either have a chart that is accurate but suffers from over plotting, or one that contains some random noise but reveals the size of the data.

# Because there are multiple observations, some are **overplotted**. To correct this, you can add 
# some random noise to the data with `position="jitter"` or `geom_jitter`.
ggplot2::ggplot(data = mpg, aes(x = cty, y = hwy)) +
  ggplot2::geom_jitter(colour = 'red') +
  ggplot2::geom_point()

Fitted lines

ggplot2::geom_smooth() takes data points and returns a regression model with confidence intervals. For example, if we just replace ggplot2::geom_point() with ggplot2::geom_smooth(), we get a loess curve.

ggplot2::ggplot(data = mpg, aes(x = displ, y = hwy)) +
  ggplot2::geom_smooth()

We can add the points as well, using the + operator; and have added a line break using ggplot2::geom_line() with a constant x value (this can be used for example to show a significant event in a time series). Note that the order of layers determines their order on the plot. As points are after the smooth line, they will be drawn on top (and obscure) the line.

# We can add the points as well, using the + operator
ggplot2::ggplot(data = mpg, aes(x = displ, y = hwy)) +
  ggplot2::geom_smooth() +
  ggplot2::geom_point() +
  ggplot2::geom_line(aes(x = 4.5))

Bar charts

Counts of a single discrete variable.

# Bar charts counting rows in the data
ggplot2::ggplot(data = mpg, aes(x = class)) +
  ggplot2::geom_bar()

Bar charts from tabulated data.

# Bar charts from tabulated data
mpg |>
  dplyr::group_by(class) |> 
  dplyr::summarise(count = dplyr::n()) |> 
  ggplot2::ggplot(aes(x = class, y = count)) +
  ggplot2::geom_col() 

Bar chart positioning

How could this chart be made easier to use?

Note: With bar charts, fill = determines the colour inside the bar and colour = controls the lines around the bar.

# Bar chart positioning
ggplot2::ggplot(data = mpg, aes(x = manufacturer, fill = class)) +
  ggplot2::geom_bar(colour = 'black') +
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, vjust = 0, hjust = 1))

By default bars are “stacked” on top of each other. This makes comparing proportions rather difficult. You can control this with the position argument. ‘dodge’ places bars next to each other, whilst ‘fill’ makes them all the same length (to allow for easier comparison).

# By default bars are "stacked" on top of each other. This makes comparing proportions rather difficult. 
# You can control this with the position argument.
ggplot2::ggplot(data = mpg, aes(x = manufacturer, fill = class)) +
  ggplot2::geom_bar(colour = 'black', position = 'dodge') +
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, vjust = 0, hjust = 1))

Density plots and histograms

Histograms are a great tool for summarizing and describing a continuous variable.

# Histogram
ggplot2::ggplot(data = mpg, aes(x = hwy)) +
  ggplot2::geom_histogram(bins = 30) 

Density plots are another nice way to visualise a continuous variable.

# Density plot
ggplot2::ggplot(data = mpg, aes(x = hwy)) +
  ggplot2::geom_density() 

We can use the alpha argument to reveal overlapping areas if we plot by group

# Overlapping density plots
ggplot2::ggplot(data = mpg, aes(x = hwy, fill = drv)) +
  ggplot2::geom_density(alpha = 0.5) 

Boxplots and violin plots

# Box plots
ggplot2::ggplot(data = mpg, aes(x = drv, y = hwy)) +
  ggplot2::geom_boxplot(fill = "red")

# Violin plots
ggplot2::ggplot(data = mpg, aes(x = drv, y = hwy)) +
  ggplot2::geom_violin(aes(fill = drv))

Line charts

US unemployment population data

# Line charts
ggplot2::economics |> 
  ggplot2::ggplot(aes(x = date, y = unemploy)) +
  ggplot2::geom_line()

Exercises: Section 3

  1. Using the mpg dataset create a histogram of cty. What impact do different values for the bins argument have?

  2. Make a density plot of cty by drv. Use the alpha argument to make the overlapping regions visible.

  3. Recreate the US unemployed population line graph. Add another line to show the total population.

Hint You will need to supply a new geom layer with a new aesthetic mapping.

Labels, Themes, and multiple plots

Lables

Labels and titles can be added using the labs() layer.

# Themes, titles, and multiple plots
ggplot2::ggplot(data = mpg, aes(x = class, y = ggplot2::after_stat(prop), group = 1)) +
  ggplot2::geom_bar() +
  ggplot2::labs(title = "Proportion of sample by class", x = "Class", y = "Proportion")

Multiple plots

Plots can be arranged using the grid.arrange command from the gridExtra package. First we store the plots in a variable using the <- operator.

# Multiple plots
plot1 <-
  ggplot2::ggplot(data = mpg, aes(x = cyl, y = ggplot2::after_stat(prop), group = 1)) +
  ggplot2::geom_bar(fill = "red") +
  ggplot2::labs(title = "Proportion of sample by engine type", 
                x = "Number of cylinders", 
                y = "Proportion")

plot2 <-
  ggplot2::ggplot(data = mpg, aes(x = cty, y = hwy)) +
  ggplot2::geom_jitter(colour = 'red') +
  ggplot2::labs(title = "Fuel efficiency comparison", 
                subtitle = "Note: the points are jittered", 
                x = "City fuel efficiency", 
                y = "Highway fuel efficiency")

# gridExtra
gridExtra::grid.arrange(plot1, plot2, nrow = 1)

Themes

You can change the theme to a number of presets.

# Themes
gridExtra::grid.arrange(
  plot2 + ggplot2::theme_bw(), plot2 + ggplot2::theme_classic(), 
  plot2 + ggplot2::theme_dark(), plot2 + ggplot2::theme_light())

You can also make your own custom themes; plot are made up of four elements element_text, element_line, element_rect, and element_blank. Plots can be modified using these element commands. For example:

# You can also make your own custom themes
ugly_theme <-
  ggplot2::theme(
    text = ggplot2::element_text(colour = 'orange', face = 'bold'),
    panel.grid.major = ggplot2::element_line(colour = "violet", linetype = "dashed"),
    panel.grid.minor = ggplot2::element_blank(),
    panel.background = ggplot2::element_rect(fill = 'black', colour = 'red'),
  )

plot2 + ugly_theme

Pre-built MoJ and Government Analysis Function themes and colour schemes are available in the mojchart package (https://github.com/moj-analytical-services/mojchart).

# MoJ colour scheme
ggplot2::ggplot(data = mpg, aes(x = class, fill = drv)) +
  ggplot2::geom_bar(position = "dodge") +
  mojchart::scale_fill_moj(n = 3) +
  mojchart::theme_gss(xlabel = TRUE)

The Government Analysis Function colour scheme is accessible.

# Accessible colour scheme
ggplot2::ggplot(data = mpg, aes(x = class, fill = drv)) +
  ggplot2::geom_bar(position = "dodge") +
  mojchart::scale_fill_moj(n = 3, scheme = "govanal_bars") +
  mojchart::theme_gss(xlabel = TRUE)

Exercises: Section 4

  1. Create a basic scatter plot showing highway fuel efficiency vs city fuel efficiency. Add a theme_minimal; and colour the points based on the number of cylinders the car has.
    • Hint: here is some code to start you off:
    ggplot2::ggplot(data = mpg) +
       ggplot2::geom_point(mapping = aes(x = hwy, y = cty, colour = as.factor(cyl)))
  2. Change the title and axis labels to something sensible based on what the chart is showing.
  3. Create a custom theme changing:
    • major gridlines to grey, dashed
    • removing minor gridlines
    • x and y axis text to black, pt 10, italic
    • move the legend position to the bottom of the plot
  4. Instead of the custom theme, use the gss theme from mojchart and one of the built-in colour schemes.
    • (Number 4 may be homework if we are running out of time!)

Saving plots

Most of the time plots will be created directly into an R Markdown output, or a shiny app. However plots can also be saved as image (png) file:

  • ‘Export’ button in RStudio viewer

  • ggplot2::ggsave(filename = "plotname.png", plot = myplot) - saves the plot into your current working directory in R Studio. Can then be downloaded from the platform via ‘More’ -> ‘Export…’

In R Markdown or R Shiny: ggplot call will create the plot in situ.

Resources

References

Extras

Alternatives

  • Plotly: Plotly is a Javascript framework for creating interactive charts. The plotly R library has its own grammar for constructing charts, with power and flexibility rivaling that of ggplot2. However, it also has a function, ggplotly(), which can turn a ggplot chart into an interactive chart. Find out more here: https://plot.ly/r/

Any questions?