5 Data Structures
When developing RAP processes, you should ensure that your data adheres to tidy data formats at each stage - an MoJ level 1 RAP component. At its core this means adhering to three basic principles:
- Each variable must have its own column
- Each observation must have its own row
- Each value must have its own cell
In most cases, this will also mean only having one column of values, while every other column is used to define that value.
For example, if you were creating a dataset showing the prison population by age over time, it might look like this:
Pop_2017 | Pop_2018 | Pop_2020 | |
---|---|---|---|
15-17 | 470 | 3540 | 70706 |
18-20 | 444 | 3241 | 68861 |
21+ | 409 | 3192 | 69142 |
This way of arranging a dataset is not tidy because each observation doesn’t have its own row. Each row contains an observation for 2017, 2018 and 2019. To convert this dataset to tidy data, you would rearrange it to the following:
Age | Year | Population |
---|---|---|
15-17 | 2017 | 470 |
18-20 | 2017 | 3540 |
21+ | 2017 | 70706 |
15-17 | 2018 | 444 |
18-20 | 2018 | 3241 |
21+ | 2018 | 68861 |
15-17 | 2019 | 409 |
18-20 | 2019 | 3192 |
21+ | 2019 | 69142 |
As you can see from the above example, turning your dataset into tidy data can be thought of as making it longer and thinner, with a single column of values.
There are three main advantages to following a tidy data structure within RAP:
- Picking one consistent way of storing data allows you to learn how to use tools and packages within R and apply them across multiple datasets. It also enables you to use and develop functions and packages that will work across multiple datasets because you know how the data will be structured.
- Commonly used tools like dplyr and ggplot2 are designed to work with tidy data.
- R is designed to work with vectorised data. Placing variables in columns transforms your data into a set of vectors and so it will be easier to work with within R.
For these reasons it is desirable to get your data into tidy format as soon as possible within your RAP process so you can start using some of these advantages as early in the process as possible.
Sometimes you may want to output data in an ‘untidy’ format, such as for outputs that are to be looked at by a user but not to be read by further code. Make sure to have these be the final output rather than datasets that are then manipulated by your code. All data wrangling should be done with tidy data.
For more information on tidy data, see R for Data Science: Tidy data.