UK Address Matcher¶
Fast, simple address matching (geocoding) in Python.
Why use this library¶
- Simple. Setup in seconds, runs on a laptop. No separate infrastructure of services needed.
- Fast. Match 100,000 addresses in ~30 seconds.1
- Proven accuracy. We use public, labelled datasets to measure and document accuracy.
- Support for Ordnance Survey data. We provide a automated build pipeline for users wishing to match to Ordnance Survey data. Matching to any other canonical dataset is also supported.
The end-to-end process of matching 100,000 addresses to Ordnance Survey data, including all software downloads and data processing takes:2
- Less than a minute if you are matching to a small area such as a local council region.
- If matching to the whole UK, there's a one-time preprocessing step that takes around 10 minutes. Subsequent matching of 100k records takes less than a minute.
Installation¶
pip install --pre uk_address_matcher
What does it do?¶
Given the following data:
- a "messy" dataset of addresses that you want to match
- a "canonical" dataset of known addresses, often an Ordnance Survey dataset such as AddressBase or NGD.
this package will find the best matching canonical address for each messy address.
Example:¶
Your data should be in the following format3:
Messy data¶
| unique_id | address_concat | postcode |
|---|---|---|
| m_1 | Flat A Example Court, 10 Demo Road, Townton | AB1 2BC |
| ...more rows |
Canonical data¶
| unique_id | address_concat | postcode |
|---|---|---|
| c_1 | Flat A, 10 Demo Road, Townton | AB1 2BC |
| c_2 | Flat B, 10 Demo Road, Townton | AB1 2BC |
| c_3 | Basement Flat, 10 Demo Road, Townton | AB1 2BC |
| ...more rows |
You can match it as follows:
import duckdb
from uk_address_matcher import AddressMatcher
con = duckdb.connect()
messy = con.read_csv("example_data/messy_example.csv")
canonical = con.read_csv("example_data/canonical_example.csv")
matcher = AddressMatcher(
canonical_addresses=canonical,
addresses_to_match=messy,
con=con,
)
result = matcher.match()
result.matches().show(max_width=10000)
Example output:
| unique_id | resolved_canonical_id | original_address_concat | original_address_concat_canonical | match_reason | match_weight | distinguishability |
|---|---|---|---|---|---|---|
| m_1 | c_2 | Flat A Example Court, 10 Demo Road, Townton | Flat A, 10 Demo Road, Townton | splink: probabilistic match | 13.5885 | 11.5033 |
The above is recommended if your canonical dataset is relatively small, say, under 1 million rows. If you're matching to larger canonical dataset, a preprocessing step is recommended. See choose whether to pre-process your canonical dataset for details.
Licence¶
This project is free and open source and is released under the MIT licence.
Next steps¶
-
Timings on a MacBook Pro M4 Max. ↩
-
Does not include the time taken to download Ordnance Survey data since this depends on the speed of your internet connection. ↩
-
The
postcodecolumn is optional. If you include it, the matcher will use it directly. If you do not, the matcher will attempt to detect and extract postcodes fromaddress_concat.uk_address_matcheralso supports matching addresses that lack a postcode. ↩