Ordnance Survey data¶
This guide walks through the end-to-end process of matching messy data to Ordnance Survey using uk_address_matcher
The steps are:
| Step | Indicative timing1 |
|---|---|
| Create a data package in the Ordnance Survey Data Hub and obtain an API key | about 5 mins |
| Install tooling | about 5 mins |
| Build the canonical dataset | 5 mins for full UK |
| Pre-process for matching (national-scale only) | 5 mins |
| Match | Less than 1 minute |
What is Ordnance Survey data?¶
Ordnance Survey is the UK's authoritative provider of address data. Many public sector organisations can use this data for free under the Public Sector Geospatial Agreement (PSGA).
Why is there a special process for Ordnance Survey data?¶
uk_address_matcher makes no assumptions about the canonical dataset provided by the user.
However, in recognition that many users will be matching to Ordnance Survey data, we provide a streamlined process for downloading and preparing it for matching. There are two main reasons:
- We provide a tool to easily download and extract the data into the required format
- With careful processing of Ordnance Survey files, it's possible to achieve higher accuracy than simply using the raw NGD Builtaddress. Our tool does this processing for you.
Step 1: Create a data package and obtain an API key¶
Create a 'data package' in the OS Data Hub containing the Ordnance Survey data you want to match to.
Choose NGD or AddressBase. Default to NGD if you have no preference.
Optimising accuracy when linking to a local area
If you are matching to a local area, restrict your data package to only that area to make matching faster and more accurate.
- Log in and create a new recipe for your geographical area. Follow the process and create a new data package.
- Navigate to data packages and and crete a new data package corresponding to the area of interest. Use 'load polygon' to load a pre-defined polygons, which exist for geographic areas such as local authorities.
-
Obtain your
package_idandversion_id.The
package_idis in the URL for the data package - for instance it is18296inhttps://osdatahub.os.uk/data/downloads/data-packages/18296Within the data package, The version ID can be retrieved by hovering over the download link for your data package, which is in the format
https://osdatahub.os.uk/api/dataPackages/{package_id}/{version_id}/file. -
Obtain your API key and secret from the API Projects page.
You should now have the following values:
- Data package
package_id - THe data package
version_id - Your API key and secret
OS_PROJECT_API_KEYOS_PROJECT_API_SECRET
Step 2: Create a folder for your project and install tooling¶
Install uv, then
create a project and add uk_address_matcher:
mkdir address_project && cd address_project
uv init --bare
uv add uk_address_matcher
Step 3: Build the canonical dataset¶
Our ukam-os-builder tool
downloads and processes Ordnance Survey data into a Parquet file optimised for
matching.
You can use it as follows. Make sure you have your package_id, version_id, and API key/secret to hand from step 1.
# Config wizard: point the tool to your data package
uvx --from ukam-os-builder ukam-os-setup
# Download and build the flat file
uvx --from ukam-os-builder ukam-os-build
By default the output lands in data/output/. Unless you set num_chunks=1,
the output is a folder of Parquet files representing a single table. DuckDB
reads them as one table via con.read_parquet('data/output/*.parquet').
Step 4: Pre-process for matching (national-scale only)¶
If you're linking to a small canonical dataset (of say, less than 500,000 rows), then it's simplest to process the data on-the-fly.
If you're linking to a large canonical dataset (for example, national-scale NGD), then we recommend a one-time pre-processing step. It computes reusable datasets (indices and feature tables) once, so subsequent matching runs are fast.
For a local council region, skip this step. Everything runs on the fly.
import duckdb
from uk_address_matcher import prepare_canonical_folder
con = duckdb.connect()
df_canonical = con.read_parquet("data/output/*.parquet")
prepare_canonical_folder(
df_canonical,
output_folder="./ukam_prepared_canonical",
con=con,
overwrite=True,
)
The prepared folder can be used for all subsequent matching runs.
Filtering before canonical preparation¶
If you already know that some Ordnance Survey records should never be eligible
matches for your use case, it is usually better to filter them out before
calling prepare_canonical_folder().
This is the right place for organisation-wide policy decisions such as:
- residential-only matching
- excluding parent shells
- excluding clearly unwanted residential subtypes such as garages
- excluding known nuisance classes that your users would not expect to match to
Applying these exclusions before preparation has two advantages:
- Term frequencies and the inverted index are built on the reduced canonical pool, which usually improves discrimination.
- The prepared folder itself becomes policy-aligned, so downstream users are less likely to leave unwanted records in by accident.
By contrast, canonical_address_filter= on AddressMatcher is most useful when
different users need different subsets from the same prepared folder, or when
you are running exploratory experiments.
Do not treat policy exclusions as text cleaning
Be careful not to mix up two different concerns:
- text cleaning, such as normalising abbreviations or whitespace
- candidate-pool policy, such as deciding that `RG` garages, `PP` parent
shells, or `CAR PARK SPACE` records should not be eligible matches
If a record should not participate in matching, prefer filtering the
canonical rows rather than stripping words from `clean_full_address`.
AddressBase classification codes¶
The classificationcode column is the main Ordnance Survey field to use for
this. Officially, Ordnance Survey defines the scheme as a hierarchical code with
primary, secondary, tertiary, and quaternary levels.
- Official OS classification documentation: addressclassificationcodevalue
- Practical AddressBase summary: Ideal Postcodes classification guide
Top-level groups
C: Commercial. Usually exclude for residential-only matching.L: Land. Usually exclude unless your messy data contains land parcels or parks.M: Military. Usually exclude unless you expect military addresses.O: Other, Ordnance Survey only. Usually exclude.P: Parent shell. Often exclude from candidate pools.R: Residential. Usually include for residential matching.U: Unclassified. Review carefully before including.X: Dual use. Review carefully, may be useful in mixed datasets.Z: Object of interest. Usually exclude from postal-style matching.
Residential secondary groups
RD: Dwelling. Usually include.RG: Garage. Often exclude for household or council-tax matching.RH: House in multiple occupation. Often include, depending on the source data.RI: Residential institution. Include only if your messy data covers care homes, halls, prisons, and similar settings.
Commercial secondary groups
CA: Agricultural.CB: Ancillary building.CC: Community services.CE: Education.CH: Hotel, motel, boarding, guest house.CI: Industrial.CL: Leisure.CM: Medical.CN: Animal centre.CO: Office.CR: Retail.CT: Transport.CU: Utility.CX: Emergency or rescue service.
Other commonly relevant non-residential groups
LD: Land, development.LP: Land, park.PP: Parent shell, property shell.OR: Royal Mail infrastructure.UC: Awaiting classification.UP: Pending internal investigation.
Ordnance Survey also defines deeper levels. For example:
RD02means detached houseCE01means collegeRI01means care home
For a full, maintained list of tertiary and quaternary codes, use the official OS page above rather than copying the entire code set into local documentation.
Recommended filtering approach¶
Use this rule of thumb:
| Situation | Best place to apply the rule | Why |
|---|---|---|
| A stable organisation-wide exclusion | Before prepare_canonical_folder() |
The prepared data, term frequencies, and index all reflect the reduced candidate pool |
| Different users need different subsets | canonical_address_filter= at match time |
You can reuse one prepared folder while still narrowing eligibility |
Best default for most residential OS workflows
Start by keeping only residential rows, excluding known garage and parent shell classes, and then excluding the smaller ancillary residential groups that often contain parking and shared-facility rows.
Quick filtering recipes¶
Residential Ordnance Survey data often contains non-household rows
Some residential records in Ordnance Survey point to car park spaces, garages, bins, communal stores, and similar ancillary objects rather than homes. For household-style matching, you will usually want to omit these before linkage begins.
The recipes below show a practical way to do that quickly.
df_canonical = df_canonical.filter("substr(classificationcode, 1, 1) = 'R'")
from uk_address_matcher import prepare_canonical_folder
residential_household_filter = (
"substr(classificationcode, 1, 1) = 'R' "
"AND substr(classificationcode, 1, 2) <> 'RG' "
"AND substr(classificationcode, 1, 2) <> 'PP' "
"AND substr(classificationcode, 1, 2) <> 'RB' "
"AND substr(classificationcode, 1, 2) <> 'RC' "
"AND clean_full_address NOT LIKE 'CAR PARK SPACE%' "
"AND clean_full_address NOT LIKE 'CAR PARK %' "
"AND clean_full_address NOT LIKE 'PARKING SPACE%' "
"AND clean_full_address NOT LIKE 'GARAGE %' "
"AND clean_full_address NOT LIKE 'GARAGES %'"
)
prepare_canonical_folder(
df_canonical.filter(residential_household_filter),
output_folder="./ukam_prepared_canonical",
con=con,
overwrite=True,
)
This is the clearest starting recipe for a household-style subset. It keeps ordinary residential rows while excluding:
RGgaragesPPparent shellsRBancillary/shared residential rows such as bins, stores, and communal spacesRCparking and hardstanding-style residential rows- parking-style prefixes that still leak through as address text
prefix_filter = (
"clean_full_address NOT LIKE 'CAR PARK SPACE%' "
"AND clean_full_address NOT LIKE 'CAR PARK %' "
"AND clean_full_address NOT LIKE 'PARKING SPACE%' "
"AND clean_full_address NOT LIKE 'GARAGE %' "
"AND clean_full_address NOT LIKE 'GARAGES %'"
)
df_canonical = df_canonical.filter(prefix_filter)
Use this only when you need a quick local heuristic. It is less robust than excluding the relevant classification groups first.
df_parent_shells = df_canonical.filter("substr(classificationcode, 1, 2) = 'PP'")
If you need different subsets for different users, keep the broader prepared
folder and apply the restriction later with canonical_address_filter=.
Step 5: Match¶
Create a script called match.py with the following content.
If you pass canonical_address_filter=, the value should be a DuckDB SQL
predicate over the canonical rows. In other words, write the part that would go
after WHERE, not a full SELECT statement.
import duckdb
from uk_address_matcher import AddressMatcher
con = duckdb.connect()
df_messy = con.read_parquet("messy_addresses.parquet")
df_canonical = con.read_parquet("data/output/*.parquet")
matcher = AddressMatcher(
canonical_addresses=df_canonical,
addresses_to_match=df_messy,
con=con,
)
result = matcher.match()
result.matches().show(max_width=10000)
import duckdb
from uk_address_matcher import AddressMatcher
con = duckdb.connect()
df_messy = con.read_parquet("messy_addresses.parquet")
common_exclusions = (
# Keep residential rows only.
"substr(classificationcode, 1, 1) = 'R' "
# Remove garages and parent shells.
"AND substr(classificationcode, 1, 2) <> 'RG' "
"AND substr(classificationcode, 1, 2) <> 'PP' "
# Remove ancillary/shared residential rows and parking-style residential rows.
"AND substr(classificationcode, 1, 2) <> 'RB' "
"AND substr(classificationcode, 1, 2) <> 'RC' "
# Remove parking and garage-style address text that can still leak through.
"AND clean_full_address NOT LIKE 'CAR PARK SPACE%' "
"AND clean_full_address NOT LIKE 'CAR PARK %' "
"AND clean_full_address NOT LIKE 'PARKING SPACE%' "
"AND clean_full_address NOT LIKE 'GARAGE %' "
"AND clean_full_address NOT LIKE 'GARAGES %'"
)
matcher = AddressMatcher(
canonical_addresses="./ukam_prepared_canonical",
addresses_to_match=df_messy,
canonical_address_filter=common_exclusions,
con=con,
)
result = matcher.match()
result.matches().show(max_width=10000)
This is useful when you want to trial a household-style subset quickly without rebuilding the prepared folder first. If the exclusions become standard policy, move them earlier into the Step 4 preparation query.
import duckdb
from uk_address_matcher import AddressMatcher
con = duckdb.connect()
df_messy = con.read_parquet("messy_addresses.parquet")
matcher = AddressMatcher(
canonical_addresses="./ukam_prepared_canonical",
addresses_to_match=df_messy,
con=con,
)
result = matcher.match()
result.matches().show(max_width=10000)
Run with:
uv run match.py
Video¶
The following video shows the end-to-end process of matching council tax data from a local council to the full Ordnance Survey dataset for that council.
-
Timings on a MacBook M4 Max. ↩