Ordnance Survey data¶

This guide walks through the end-to-end process of matching messy data to Ordnance Survey using uk_address_matcher

The steps are:

Step	Indicative timing¹
Create a data package in the Ordnance Survey Data Hub and obtain an API key	about 5 mins
Install tooling	about 5 mins
Build the canonical dataset	5 mins for full UK
Pre-process for matching (national-scale only)	5 mins
Match	Less than 1 minute

What is Ordnance Survey data?¶

Ordnance Survey is the UK's authoritative provider of address data. Many public sector organisations can use this data for free under the Public Sector Geospatial Agreement (PSGA).

Why is there a special process for Ordnance Survey data?¶

uk_address_matcher makes no assumptions about the canonical dataset provided by the user.

However, in recognition that many users will be matching to Ordnance Survey data, we provide a streamlined process for downloading and preparing it for matching. There are two main reasons:

We provide a tool to easily download and extract the data into the required format
With careful processing of Ordnance Survey files, it's possible to achieve higher accuracy than simply using the raw NGD Builtaddress. Our tool does this processing for you.

Step 1: Create a data package and obtain an API key¶

Create a 'data package' in the OS Data Hub containing the Ordnance Survey data you want to match to.

Choose NGD or AddressBase. Default to NGD if you have no preference.

Optimising accuracy when linking to a local area

If you are matching to a local area, restrict your data package to only that area to make matching faster and more accurate.

Log in and create a new recipe for your geographical area. Follow the process and create a new data package.
Navigate to data packages and and crete a new data package corresponding to the area of interest. Use 'load polygon' to load a pre-defined polygons, which exist for geographic areas such as local authorities.
Obtain your package_id and version_id.

The package_id is in the URL for the data package - for instance it is 18296 in https://osdatahub.os.uk/data/downloads/data-packages/18296

Within the data package, The version ID can be retrieved by hovering over the download link for your data package, which is in the format https://osdatahub.os.uk/api/dataPackages/{package_id}/{version_id}/file.
Obtain your API key and secret from the API Projects page.

You should now have the following values:

Data package package_id
THe data package version_id
Your API key and secret
- OS_PROJECT_API_KEY
- OS_PROJECT_API_SECRET

Step 2: Create a folder for your project and install tooling¶

Install uv, then create a project and add uk_address_matcher:

mkdir address_project && cd address_project
uv init --bare
uv add uk_address_matcher

Step 3: Build the canonical dataset¶

Our ukam-os-builder tool downloads and processes Ordnance Survey data into a Parquet file optimised for matching.

You can use it as follows. Make sure you have your package_id, version_id, and API key/secret to hand from step 1.

# Config wizard — point the tool to your data package
uvx --from ukam-os-builder ukam-os-setup

# Download and build the flat file
uvx --from ukam-os-builder ukam-os-build

By default the output lands in data/output/. Unless you set num_chunks=1, the output is a folder of Parquet files representing a single table. DuckDB reads them as one table via con.read_parquet('data/output/*.parquet').

Step 4: Pre-process for matching (national-scale only)¶

If you're linking to a small canonical dataset (of say, less than 500,000 rows), then it's simplest to process the data on-the-fly.

If you're linking to a large canonical dataset (for example, national-scale NGD), then we recommend a one-time pre-processing step. It computes reusable datasets (indices and feature tables) once, so subsequent matching runs are fast.

For a local council region, skip this step — everything runs on the fly.

import duckdb
from uk_address_matcher import prepare_canonical_folder

con = duckdb.connect()
df_canonical = con.read_parquet("data/output/*.parquet")

prepare_canonical_folder(
    df_canonical,
    output_folder="./ukam_prepared_canonical",
    con=con,
    overwrite=True,
)

The prepared folder can be used for all subsequent matching runs.

Step 5: Match¶

Create a script called match.py with the following content.

Local / regionalNational-scale (prepared folder)

import duckdb
from uk_address_matcher import AddressMatcher

con = duckdb.connect()

df_messy = con.read_parquet("messy_addresses.parquet")
df_canonical = con.read_parquet("data/output/*.parquet")

matcher = AddressMatcher(
    canonical_addresses=df_canonical,
    addresses_to_match=df_messy,
    con=con,
)

result = matcher.match()

result.matches().show(max_width=10000)

import duckdb
from uk_address_matcher import AddressMatcher

con = duckdb.connect()
df_messy = con.read_parquet("messy_addresses.parquet")

matcher = AddressMatcher(
    canonical_addresses="./ukam_prepared_canonical",
    addresses_to_match=df_messy,
    con=con,
)

result = matcher.match()
result.matches().show(max_width=10000)

Run with:

uv run match.py

Video¶

The following video shows the end-to-end process of matching council tax data from a local council to the full Ordnance Survey dataset for that council.

Timings on a MacBook M4 Max. ↩