Section 3 Inclusion of Geographical Identifiers

In accordance with the GSS Geography Policy, all published data should include a variable identifying the geographical coverage of the data. The geographical units used should be GSS standard codes and names for UK statistical geographies where possible. Details of standard statistical geographies can be found on the ONS Open Geography portal. The documents on the portal include a graphical representation of the UK Statistical Geographies.

Producers should try to use variable names, codes and values consistent with ONS standardised codes and provide appropriate metadata to describe the variable. However, a more descriptive variable name can be used if needed.

3.1 National level data

Producers are encouraged to publish data at sub-national level. However, where data are provided without a geographic breakdown, a variable should still be included to denote the overall geographical coverage of the data. In the case of MoJ, this is most likely to be the code for England and Wales.

The relevant ONS variable names and values for this are below:

CTRY18CD CTRY18NM
K04000001 England and Wales

3.2 Sub-national data

Producers are encouraged to make data available at the lowest geographical level possible, taking into account appropriate considerations around data quality and disclosure control. Where data are being made available at sub-national level, the geographical units used should be one of the recognised UK Statistical Geographies where possible.

3.2.1 Preferred geographies

Within the wide range of UK Statistical Geographies, those preferred within MoJ for sub-national data are (in order of preference:

  • Any Statistical Building Block level (ie. Output Area, LSOA, MSOA)

  • Local Authority Districts

  • Regions

  • Countries (ie. England and Wales presented separately)

3.2.2 Single point geographies (Postcodes)

Some sub-national data does not relate to a whole area, but to a single location (eg. a prison or court). In these cases, the data should include a variable with the postcode of the location. This can either be as a variable in the dataset itself or an accompanying lookup file which lists the postcodes of all locations contained in the dataset.

As a matter of best practice, producers should also use the ONS Open Geography Portal lookup tables to allocate each postcode location to an Output Area and include the Output Area in the dataset to allow users to easily identify the location within other geographical units.

3.2.3 Non-standard geographies

It will sometimes be necessary to publish data according to non-standard geographies that don’t form one of the recognised UK Statistical Geographies (for example, YOT areas). In these cases, producers should provide a lookup between the non-standard geography used and at least one of the preferred geographical units in Section 3.1. If this is not possible, a lookup to any other standard geographical unit should be provided. If this is not possible then appropriate metadata should be provided, explaining the nature of the geographical breakdown used and that it is not compatible with any recognised UK Statistical Geography. This avoids users attempting to make matches that aren’t possible to make.

3.3 Dataset formatting

Each geographic variable in a dataset should only relate to a single geographical level. Different levels should not be combined into a single variable as this would violate the principles of Tidy data.

Producers are only required to include one variable which denotes the lowest geographical level of data available in the dataset. Users will then be able to aggregate data up to larger areas if needed. However, producers are encouraged to provide additional variables which show the relevant values for each of the Preferred Geographies in Section 3.1 and any others that they deem relevant.

The combination of these points means that datasets shouldn’t contain any aggregated totals for higher geographies. These totals should be calculated by the user by combining the relevant lower-level geographical units, or separate datasets can be provided at each geographical level.

3.3.1 Examples

Data should be arranged to look like the below example:

Example A: Complete Geographical Hierarchy

LAD19CD LAD19NM RGN19CD RGN19NM CTRY18CD CTRY18NM value
E09000002 Barking and Dagenham E12000007 London E92000001 England 18
E07000039 South Derbyshire E12000004 East Midlands E92000001 England 23
W06000015 Cardiff W92000004 Wales W92000004 Wales 12

Example A provides data at all preferred geographical levels. Note in Example A, that under the UK Statistical Geographies Hierarchy, Wales is both a Region and a Country, although the same code is used in both cases. The country variables could therefore be replaced with the codes for England and Wales as a whole if needed, as in Example B.

Example B: England and Wales combined

LAD19CD LAD19NM RGN19CD RGN19NM CTRY18CD CTRY18NM value
E09000002 Barking and Dagenham E12000007 London K04000001 England and Wales 18
E07000039 South Derbyshire E12000004 East Midlands K04000001 England and Wales 23
W06000015 Cardiff W92000004 Wales K04000001 England and Wales 12

While Example B would be valid, this arrangement is less useful than Example A as the user has to perform additional data processing to obtain a figure for England. Producers should also avoid including codes for both England and Wales and England and Wales separately in the same variable (and therefore within the same dataset).

For this reason, the code for England and Wales should generally only be used in cases where data is being presented at a national level with no sub-national breakdown at all. Where sub-national data is being made available, the country variable should be used to identify England and Wales separately. In this case, users can obtain an England and Wales total simply by using the total value for the whole dataset, without filtering on any geographical variable.

As detailed above, while Example A demonstrates a best practice dataset layout, relevant lookups to higher level geographies are available to users from the ONS Open Geography Portal. Therefore, the minimum requirement is only for producers to include a variable for the lowest geographical level available, such as in Example C.

Example C: Only lowest geography included

local_authority_code local_authority_name value
E09000002 Barking and Dagenham 18
E07000039 South Derbyshire 23
W06000015 Cardiff 12

In Example C, the variables have also been given more understandable names for those not familiar with standard UK Statistical Geography variable names. In this case, producers should also provide users with appropriate metadata to detail which UK Statistical Geography the variables relate to.

As detailed above, multiple Geographical levels should not be included in the same variable. Example D below would therefore be an invalid dataset structure which violates the principles of Tidy Data:

Example D: Invalid structure, combining levels

geography_code geography_name value
E09000002 Barking and Dagenham 18
E07000039 South Derbyshire 23
W06000015 Cardiff 12

This example also highlights why the practice of using the ONS geographical variable names should be followed, as it forces different geographical levels to be included as different variables.

If producers wish to provide users with pre-totalled data for higher geographical levels, then each geographical level should be provided in a different dataset, as in Example E.

Example E: Separate datasets for pre-aggregated data

Dataset 1 – ‘LAD Values.csv’

LAD19CD LAD19NM value
E09000002 Barking and Dagenham 18
E07000039 South Derbyshire 23
W06000015 Cardiff 12

Dataset 1 – ‘E&W Totals.csv’

CTRY18CD CTRY18NM value
K04000001 England and Wales 53