5 Data Structures

When developing RAP processes, you should ensure that your data adheres to tidy data formats at each stage - an MoJ level 1 RAP component. At its core this means adhering to three basic principles:

Each variable must have its own column
Each observation must have its own row
Each value must have its own cell

In most cases, this will also mean only having one column of values, while every other column is used to define that value.

For example, if you were creating a dataset showing the prison population by age over time, it might look like this:

	Pop_2017	Pop_2018	Pop_2020
15-17	470	3540	70706
18-20	444	3241	68861
21+	409	3192	69142

This way of arranging a dataset is not tidy because each observation doesn’t have its own row. Each row contains an observation for 2017, 2018 and 2019. To convert this dataset to tidy data, you would rearrange it to the following:

Age	Year	Population
15-17	2017	470
18-20	2017	3540
21+	2017	70706
15-17	2018	444
18-20	2018	3241
21+	2018	68861
15-17	2019	409
18-20	2019	3192
21+	2019	69142

As you can see from the above example, turning your dataset into tidy data can be thought of as making it longer and thinner, with a single column of values.

There are three main advantages to following a tidy data structure within RAP:

Picking one consistent way of storing data allows you to learn how to use tools and packages within R and apply them across multiple datasets. It also enables you to use and develop functions and packages that will work across multiple datasets because you know how the data will be structured.
Commonly used tools like dplyr and ggplot2 are designed to work with tidy data.
R is designed to work with vectorised data. Placing variables in columns transforms your data into a set of vectors and so it will be easier to work with within R.

For these reasons it is desirable to get your data into tidy format as soon as possible within your RAP process so you can start using some of these advantages as early in the process as possible.

Sometimes you may want to output data in an ‘untidy’ format, such as for outputs that are to be looked at by a user but not to be read by further code. Make sure to have these be the final output rather than datasets that are then manipulated by your code. All data wrangling should be done with tidy data.

For more information on tidy data, see R for Data Science: Tidy data.