Geting started¶
Install¶
uk_address_matcher is a Python package available on PyPI. You can install with pip:
pip install uk_address_matcher
Input data requirements¶
Both your messy addresses and your canonical addresses need at least these columns:
| Column | Description |
|---|---|
unique_id |
Stable unique identifier |
address_concat |
Address text, which can include the postcode |
Optionally you can provide:
| Column | Description |
|---|---|
postcode |
If provided, this postcode is used in favour over any postcode provided in address_concat |
ukam_label |
The unique ID of the true match. If provided, it enables accuracy analysis output |
Choose whether to pre-process your canonical dataset¶
If you're linking to a small canonical dataset (of say, less than 500,000 rows), then it's simplest to process the data on-the-fly.
If you're linking to a large canonical dataset (for example, national-scale NGD), then we recommend a one-time pre-processing step. It computes reusable datasets (indices and feature tables) once, so subsequent matching runs are fast.
The examples below use the fictional London datasets from ukam_datasets, which are included for runnable examples.
import duckdb
from uk_address_matcher import AddressMatcher, ukam_datasets
con = duckdb.connect()
df_messy = ukam_datasets.as_relation("fictional_london_messy", con=con)
df_canonical = ukam_datasets.as_relation("fictional_london_canonical", con=con)
matcher = AddressMatcher(
canonical_addresses=df_canonical,
addresses_to_match=df_messy,
con=con,
)
result = matcher.match()
print(result.matches().limit(5).to_df().to_markdown(index=False))
| unique_id | resolved_canonical_id | ukam_label | original_address_concat | original_address_concat_canonical | match_reason | match_weight | distinguishability |
|---|---|---|---|---|---|---|---|
| m_0001872 | c_0005356 | c_0005356 | FLAT 2,131 PRIMROSEWICK WY,WEST ALDER,LONDON | Flat 2, 131 Primrosewick Way, West Alder, London | exact: full match | nan | nan |
| m_0001832 | c_0000916 | c_0000916 | Unit 6, 102 Bartonstone Gdns, East Bramwick, London | Unit 6, 102 Bartonstone Gardens, East Bramwick, London | exact: full match | nan | nan |
| m_0001440 | c_0008498 | c_0008498 | Suite 9,3 Elmhurst St,Kingsford,London | Suite 9, 3 Elmhurst Street, Kingsford, London | splink: probabilistic match | 43.03 | nan |
| m_0001762 | c_0003898 | c_0003898 | UNIT 11, 12 YORKSTNOE ST, MAPLE GREEN, LONDON | Unit 11, 12 Yorkstone Street, Maple Green, London | splink: probabilistic match | 39.6298 | nan |
| m_0001689 | c_0007665 | c_0007665 | Suite 12, 120 Novalane Ave, New Huxley, London | Suite 12, 120 Novalane Avenue, New Huxley, London | exact: full match | nan | nan |
import duckdb
import os
import tempfile
from uk_address_matcher import (
AddressMatcher,
prepare_canonical_folder,
ukam_datasets,
)
con = duckdb.connect()
df_messy = ukam_datasets.as_relation("fictional_london_messy", con=con)
df_canonical = ukam_datasets.as_relation("fictional_london_canonical", con=con)
# One-time preparation
output_folder = tempfile.mkdtemp()
prepare_canonical_folder(
df_canonical,
output_folder=output_folder,
con=con,
overwrite=True,
)
# Pass the folder path instead of a relation
matcher = AddressMatcher(
canonical_addresses=output_folder,
addresses_to_match=df_messy,
con=con,
)
result = matcher.match()
print("Prepared folder contents:")
for f in sorted(os.listdir(output_folder)):
print(f" {f}")
print()
print(result.matches().limit(5).to_df().to_markdown(index=False))
Prepared folder contents: ukam_canonical_addresses.parquet ukam_inverted_index.parquet ukam_manifest.json ukam_term_frequencies.parquet
| unique_id | resolved_canonical_id | ukam_label | original_address_concat | original_address_concat_canonical | match_reason | match_weight | distinguishability |
|---|---|---|---|---|---|---|---|
| m_0001872 | c_0005356 | c_0005356 | FLAT 2,131 PRIMROSEWICK WY,WEST ALDER,LONDON | Flat 2, 131 Primrosewick Way, West Alder, London | exact: full match | nan | nan |
| m_0001832 | c_0000916 | c_0000916 | Unit 6, 102 Bartonstone Gdns, East Bramwick, London | Unit 6, 102 Bartonstone Gardens, East Bramwick, London | exact: full match | nan | nan |
| m_0001440 | c_0008498 | c_0008498 | Suite 9,3 Elmhurst St,Kingsford,London | Suite 9, 3 Elmhurst Street, Kingsford, London | splink: probabilistic match | 43.03 | nan |
| m_0001762 | c_0003898 | c_0003898 | UNIT 11, 12 YORKSTNOE ST, MAPLE GREEN, LONDON | Unit 11, 12 Yorkstone Street, Maple Green, London | splink: probabilistic match | 39.6298 | nan |
| m_0001689 | c_0007665 | c_0007665 | Suite 12, 120 Novalane Ave, New Huxley, London | Suite 12, 120 Novalane Avenue, New Huxley, London | exact: full match | nan | nan |
The output_folder contains parquet files plus ukam_manifest.json
(package version, row counts, file hashes) for reproducibility.
Subsequent matching exercises that use the same canonical data can reuse this folder, skipping the prepare_canonical_folder step.
Reading results¶
matcher.match() returns a MatchResult object:
| Property / method | Returns |
|---|---|
.matches() |
DuckDB relation with unique_id, resolved_canonical_id, match_reason, and more. |
.match_metrics() |
Match-reason breakdown with counts and percentages. |
.accuracy_analysis() |
Threshold-based accuracy analysis from labelled data (requires ukam_label in messy input). |
Customising stages
The default pipeline is ExactMatchStage → SplinkStage. Pass your own
stages list to change behaviour:
from uk_address_matcher import (
AddressMatcher,
ExactMatchStage,
PeeledAddressStage,
UniqueTrigramStage,
SplinkStage,
)
matcher = AddressMatcher(
canonical_addresses=df_canonical,
addresses_to_match=df_messy,
con=con,
stages=[
ExactMatchStage(),
PeeledAddressStage(),
UniqueTrigramStage(),
SplinkStage(
final_match_weight_threshold=20.0,
final_distinguishability_threshold=5.0,
),
],
)
Use AddressMatcher.available_stages() to discover registered stage classes. See Choosing a matching threshold and Optimising accuracy for further accuracy advice. The API reference covers the main API docs.
Using labelled data¶
If you know the correct match for each address, add a ukam_label column to
your messy data. It propagates through to results, enabling accuracy analysis
with MatchResult.accuracy_analysis().