Section 1 Dataset design

This guidance is mainly intended for producers of datasets and data tools that sit alongside statistical publications, although some sections are also relevant to statistical tables and commentary. This section focuses on the structure and design principles underlying open datasets for publication.

When publishing datasets we should seek to meet the same standards across all our datsets. They should:

Be open
Be machine-readable
Follow tidy principles
Use standardised variable categories to facilitate linking between datasets

Following these standards will allow users to easily reuse the data with their own analytical software and without needing to convert the data to a format they require. Data should always be published in CSV format, but can also be published in other machine readable open formats if desired (eg. XML, JSON)

1.1 Open data

Open data should be available to all, accessible to all and reusable by all. The Open Data Institute (ODI) have produced a short guide to Open Data. Fundamentally, many of the principles of open data are contained in this guidance document. By following the guidance below in the production of published data, producers will be moving towards producing more open data.

1.2 Machine-readable data

At its simplest level, machine-readable data is data that can be read and manipulated using software. Excel tables are a form of machine-readable data, one step up from a pasted image of a table.

However, truly machine-readable data should be provided as pure data files with no formatting. To be open and machine-readable, data should also not require proprietary software in order to read it. It is anticipated that most producers will publish data in CSV format to meet this requirement.

1.3 Tidy Data principles

‘Tidy’ is the term given to a dataset that follows a set of principles that makes it easy to analyse reducing the amount of time a user needs to prepare the dataset for analysis first. For detailed background on tidy datasets, read Hadley Wickham’s paper (pdf).

For a dataset to be tidy:

Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table

The best way to think of a tidy dataset is that it is an aggregated list of data.

1.3.1 Each variable forms a column

In most tidy datasets, there will be one column of quantative data. This will usually be a frequency. If you have more than one column containing a frequency than it is almost certain that your dataset is not tidy. For example, if you have multiple columns each containing the relevant value for a different year then the year should be converted to a variable so that there is only a single column of quantitative data. If using R, the easiset way to convert a dataset like this to a tidy dataset is to use the gather function in the tidyr package.

It is not always the case that a tidy dataset will only have a single quantative column. Multiple quantative columns can be included in a tidy dataset where they are different measurements of the same observation or group of obserations. For example, one column could be the number of offenders given a custodial sentence, and a second column could contain the total number of months those offenders were sentenced to.

Each column in the dataset should be a single variable. Do not construct a variable that concatenates two or more variables. Among other things, this makes it harder to link the dataset to other datasets.

1.3.2 Each observation forms a row

Each row should form a single observation, or set of observations, and any given observation must not be included in more than one row. A case-level dataset would be tidy as each row would contain a single observation, however a tidy dataset can also group observations together according to the variables in the dataset. This will be true for datasets which contain a frequency column. Although not strictly case-level, these datasets can be easily deconstructed into a case-level dataset by replicating each row according to the frequency column.

A simple test of whether you have a tidy dataset is that summing the frequency column should give you the total number of observations. If it does not than your dataset is not tidy. All quantative information in a row should be related and be of the same observation(s). If any quantative information in a row is mutually exclusive from any other quantitative information in that row than they should be on separate rows with an appropriate column (variable) added to identify them.

1.4 Standardised variable categories

By standardising the way that data are presented and the way that variables are categorised, it is easier for users to link datasets together, both within MoJ and with other datasets. It also improves accessibility of data as a single set of guidance can be applied across all MoJ datasets and users can know what to expect when accessing MoJ datasets.

The remainder of this guidance addresses the categories that producers should aim to use in published datasets. Following this agreed set of standards increases the openness of data produced.

1.4.1 Variable naming conventions

This guidance contains a number of standard variable classifications with associated names. Using these variable names when following these classifications will ensure that users know that a variable with a particular name is classified in a particular way.

All these variable names also follow a set of naming conventions. Producers should try to follow these conventions in the naming of other variables:

Variable names all in lower case with words separated by an underscore. This is intended to improve readability and reduce the scope for error in case-sensitive applications.
Groups of variables to all begin with the same word or prefix. (eg. use ethnicity and ethnic_group not grouped_ethnicity). This ensures that when looking through a list of variables, all related variables will be together if the list is arranged alphabetically.