5 Data Structures

When developing RAP processes, you should ensure that your data adheres to tidy data formats at each stage - an MoJ level 1 RAP component. At its core this means adhering to three basic principles:

  1. Each variable must have its own column
  2. Each observation must have its own row
  3. Each value must have its own cell

In most cases, this will also mean only having one column of values, while every other column is used to define that value.

For example, if you were creating a dataset showing the prison population by age over time, it might look like this:

Pop_2017 Pop_2018 Pop_2020
15-17 470 3540 70706
18-20 444 3241 68861
21+ 409 3192 69142

This way of arranging a dataset is not tidy because each observation doesn’t have its own row. Each row contains an observation for 2017, 2018 and 2019. To convert this dataset to tidy data, you would rearrange it to the following:

Age Year Population
15-17 2017 470
18-20 2017 3540
21+ 2017 70706
15-17 2018 444
18-20 2018 3241
21+ 2018 68861
15-17 2019 409
18-20 2019 3192
21+ 2019 69142

As you can see from the above example, turning your dataset into tidy data can be thought of as making it longer and thinner, with a single column of values.

There are three main advantages to following a tidy data structure within RAP:

  1. Picking one consistent way of storing data allows you to learn how to use tools and packages within R and apply them across multiple datasets. It also enables you to use and develop functions and packages that will work across multiple datasets because you know how the data will be structured.
  2. Commonly used tools like dplyr and ggplot2 are designed to work with tidy data.
  3. R is designed to work with vectorised data. Placing variables in columns transforms your data into a set of vectors and so it will be easier to work with within R.

For these reasons it is desirable to get your data into tidy format as soon as possible within your RAP process so you can start using some of these advantages as early in the process as possible.

Sometimes you may want to output data in an ‘untidy’ format, such as for outputs that are to be looked at by a user but not to be read by further code. Make sure to have these be the final output rather than datasets that are then manipulated by your code. All data wrangling should be done with tidy data.

For more information on tidy data, see R for Data Science: Tidy data.