Geting started¶
Install¶
uk_address_matcher is a Python package available on PyPI. You can install with pip:
pip install uk_address_matcher
Input data requirements¶
Both your messy addresses and your canonical addresses need at least these columns:
| Column | Description |
|---|---|
unique_id |
Stable unique identifier |
address_concat |
Address text, which can include the postcode |
Optionally you can provide:
| Column | Description |
|---|---|
postcode |
If provided, this postcode is used in favour over any postcode provided in address_concat |
ukam_label |
The unique ID of the true match. If provided, it enables accuracy analysis output |
Choose whether to pre-process your canonical dataset¶
If you're linking to a small canonical dataset (of say, less than 500,000 rows), then it's simplest to process the data on-the-fly.
If you're linking to a large canonical dataset (for example, national-scale NGD), then we recommend a one-time pre-processing step. It computes reusable datasets (indices and feature tables) once, so subsequent matching runs are fast.
import duckdb
import pyarrow as pa
from uk_address_matcher import AddressMatcher
con = duckdb.connect()
# Usually data would be loaded from files.
# It's hard coded here so this example can be easily run
messy = pa.Table.from_pylist([
{"unique_id": "m_1", "address_concat": "Flat A Example Court, 10 Demo Road, Townton", "postcode": "AB1 2BC"},
])
messy_df = con.from_arrow(messy)
canonical = pa.Table.from_pylist([
{"unique_id": "c_1", "address_concat": "9 Demo Road, Townton", "postcode": "AB1 2BC"},
{"unique_id": "c_2", "address_concat": "Flat A, 10 Demo Road, Townton", "postcode": "AB1 2BC"},
{"unique_id": "c_3", "address_concat": "Flat B, 10 Demo Road, Townton", "postcode": "AB1 2BC"},
{"unique_id": "c_4", "address_concat": "Flat C, 10 Demo Road, Townton", "postcode": "AB1 2BC"},
{"unique_id": "c_5", "address_concat": "Basement Flat, 10 Demo Road, Townton", "postcode": "AB1 2BC"},
])
canonical_df = con.from_arrow(canonical)
matcher = AddressMatcher(
canonical_addresses=canonical_df,
addresses_to_match=messy_df,
con=con,
)
result = matcher.match()
print(result.matches().limit(5).to_df().to_markdown(index=False))
| unique_id | resolved_canonical_id | original_address_concat | original_address_concat_canonical | match_reason | match_weight | distinguishability |
|---|---|---|---|---|---|---|
| m_1 | c_2 | Flat A Example Court, 10 Demo Road, Townton | Flat A, 10 Demo Road, Townton | splink: probabilistic match | 13.5885 | 11.5033 |
import duckdb
import os
import tempfile
from uk_address_matcher import AddressMatcher, prepare_canonical_folder
con = duckdb.connect()
df_canonical = con.read_csv("example_data/canonical_example.csv")
df_messy = con.read_csv("example_data/messy_example.csv")
# One-time preparation
output_folder = tempfile.mkdtemp()
prepare_canonical_folder(
df_canonical,
output_folder=output_folder,
con=con,
overwrite=True,
)
# Pass the folder path instead of a relation
matcher = AddressMatcher(
canonical_addresses=output_folder,
addresses_to_match=df_messy,
con=con,
)
result = matcher.match()
print("Prepared folder contents:")
for f in sorted(os.listdir(output_folder)):
print(f" {f}")
print()
print(result.matches().limit(5).to_df().to_markdown(index=False))
Prepared folder contents: ukam_canonical_addresses.parquet ukam_inverted_index.parquet ukam_manifest.json ukam_term_frequencies.parquet
| unique_id | resolved_canonical_id | original_address_concat | original_address_concat_canonical | match_reason | match_weight | distinguishability |
|---|---|---|---|---|---|---|
| m_1 | c_2 | Flat A Example Court, 10 Demo Road, Townton | Flat A, 10 Demo Road, Townton | splink: probabilistic match | 13.5885 | 11.5033 |
The output_folder contains parquet files plus ukam_manifest.json
(package version, row counts, file hashes) for reproducibility.
Subsequent matching exercises that use the same canonical data can reuse this folder, skipping the prepare_canonical_folder step.
Reading results¶
matcher.match() returns a MatchResult object:
| Property / method | Returns |
|---|---|
.matches() |
DuckDB relation with unique_id, resolved_canonical_id, match_reason, and more. |
.match_metrics() |
Match-reason breakdown with counts and percentages. |
.accuracy_analysis() |
Threshold-based accuracy analysis from labelled data (requires ukam_label in messy input). |
.splink_predictions() |
Raw Splink predictions (only available when a SplinkStage ran). |
Customising stages
The default pipeline is ExactMatchStage → SplinkStage. Pass your own
stages list to change behaviour:
from uk_address_matcher import (
AddressMatcher,
ExactMatchStage,
PeeledAddressStage,
UniqueTrigramStage,
SplinkStage,
)
matcher = AddressMatcher(
canonical_addresses=df_canonical,
addresses_to_match=df_messy,
con=con,
stages=[
ExactMatchStage(),
PeeledAddressStage(),
UniqueTrigramStage(),
SplinkStage(
final_match_weight_threshold=20.0,
final_distinguishability_threshold=5.0,
),
],
)
Use AddressMatcher.available_stages() to discover registered stage
classes. See the API reference for full parameter
tables.
Using labelled data¶
If you know the correct match for each address, add a ukam_label column to
your messy data. It propagates through to results, enabling accuracy analysis
with MatchResult.match_metrics().