UK Address Matcher¶

Fast, simple address matching (geocoding) in Python.

Why use this library¶

Simple. Setup in seconds, runs on a laptop. No separate infrastructure of services needed.
Fast. Match 100,000 addresses in ~30 seconds.¹
Proven accuracy. We use public, labelled datasets to measure and document accuracy.
Support for Ordnance Survey data. We provide a automated build pipeline for users wishing to match to Ordnance Survey data. Matching to any other canonical dataset is also supported.

The end-to-end process of matching 100,000 addresses to Ordnance Survey data, including all software downloads and data processing takes:²

Less than a minute if you are matching to a small area such as a local council region.
If matching to the whole UK, there's a one-time preprocessing step that takes around 10 minutes. Subsequent matching of 100k records takes less than a minute.

Installation¶

pip install uk_address_matcher

What does it do?¶

Given the following data:

a "messy" dataset of addresses that you want to match
a "canonical" dataset of known addresses, often an Ordnance Survey dataset such as AddressBase or NGD.

this package will find the best matching canonical address for each messy address.

Example:¶

Your data should be in the following format³:

Messy data¶

unique_id	address_concat	postcode
m_1	Flat A Example Court, 10 Demo Road, Townton	AB1 2BC
...more rows

Canonical data¶

unique_id	address_concat	postcode
c_1	Flat A, 10 Demo Road, Townton	AB1 2BC
c_2	Flat B, 10 Demo Road, Townton	AB1 2BC
c_3	Basement Flat, 10 Demo Road, Townton	AB1 2BC
...more rows

You can match it as follows:

import duckdb
from uk_address_matcher import AddressMatcher

con = duckdb.connect()
messy = con.read_csv("example_data/messy_example.csv")
canonical = con.read_csv("example_data/canonical_example.csv")

matcher = AddressMatcher(
    canonical_addresses=canonical,
    addresses_to_match=messy,
    con=con,
)
result = matcher.match()
result.matches().show(max_width=10000)

Example output:

unique_id	resolved_canonical_id	original_address_concat	original_address_concat_canonical	match_reason	match_weight	distinguishability
m_1	c_2	Flat A Example Court, 10 Demo Road, Townton	Flat A, 10 Demo Road, Townton	splink: probabilistic match	13.5885	11.5033

The above is recommended if your canonical dataset is relatively small, say, under 1 million rows. If you're matching to larger canonical dataset, a preprocessing step is recommended. See choose whether to pre-process your canonical dataset for details.

Licence¶

This project is free and open source and is released under the MIT licence.

Next steps¶

Timings on a MacBook Pro M4 Max. ↩
Does not include the time taken to download Ordnance Survey data since this depends on the speed of your internet connection. ↩
The postcode column is optional. If you include it, the matcher will use it directly. If you do not, the matcher will attempt to detect and extract postcodes from address_concat. uk_address_matcher also supports matching addresses that lack a postcode. ↩