Skip to content

Geting started

Install

uk_address_matcher is a Python package available on PyPI. You can install with pip:

pip install uk_address_matcher

Input data requirements

Both your messy addresses and your canonical addresses need at least these columns:

Column Description
unique_id Stable unique identifier
address_concat Address text, which can include the postcode

Optionally you can provide:

Column Description
postcode If provided, this postcode is used in favour over any postcode provided in address_concat
ukam_label The unique ID of the true match. If provided, it enables accuracy analysis output

Choose whether to pre-process your canonical dataset

If you're linking to a small canonical dataset (of say, less than 500,000 rows), then it's simplest to process the data on-the-fly.

If you're linking to a large canonical dataset (for example, national-scale NGD), then we recommend a one-time pre-processing step. It computes reusable datasets (indices and feature tables) once, so subsequent matching runs are fast.

import duckdb
import pyarrow as pa
from uk_address_matcher import AddressMatcher

con = duckdb.connect()

# Usually data would be loaded from files.
# It's hard coded here so this example can be easily run
messy = pa.Table.from_pylist([
    {"unique_id": "m_1", "address_concat": "Flat A Example Court, 10 Demo Road, Townton", "postcode": "AB1 2BC"},
])
messy_df = con.from_arrow(messy)

canonical = pa.Table.from_pylist([
    {"unique_id": "c_1", "address_concat": "9 Demo Road, Townton", "postcode": "AB1 2BC"},
    {"unique_id": "c_2", "address_concat": "Flat A, 10 Demo Road, Townton", "postcode": "AB1 2BC"},
    {"unique_id": "c_3", "address_concat": "Flat B, 10 Demo Road, Townton", "postcode": "AB1 2BC"},
    {"unique_id": "c_4", "address_concat": "Flat C, 10 Demo Road, Townton", "postcode": "AB1 2BC"},
    {"unique_id": "c_5", "address_concat": "Basement Flat, 10 Demo Road, Townton", "postcode": "AB1 2BC"},
])
canonical_df = con.from_arrow(canonical)

matcher = AddressMatcher(
    canonical_addresses=canonical_df,
    addresses_to_match=messy_df,
    con=con,
)
result = matcher.match()
print(result.matches().limit(5).to_df().to_markdown(index=False))
unique_id resolved_canonical_id original_address_concat original_address_concat_canonical match_reason match_weight distinguishability
m_1 c_2 Flat A Example Court, 10 Demo Road, Townton Flat A, 10 Demo Road, Townton splink: probabilistic match 13.5885 11.5033
import duckdb
import os
import tempfile
from uk_address_matcher import AddressMatcher, prepare_canonical_folder

con = duckdb.connect()
df_canonical = con.read_csv("example_data/canonical_example.csv")
df_messy = con.read_csv("example_data/messy_example.csv")

# One-time preparation
output_folder = tempfile.mkdtemp()
prepare_canonical_folder(
    df_canonical,
    output_folder=output_folder,
    con=con,
    overwrite=True,
)

# Pass the folder path instead of a relation
matcher = AddressMatcher(
    canonical_addresses=output_folder,
    addresses_to_match=df_messy,
    con=con,
)
result = matcher.match()

print("Prepared folder contents:")
for f in sorted(os.listdir(output_folder)):
    print(f"  {f}")
print()
print(result.matches().limit(5).to_df().to_markdown(index=False))

Prepared folder contents: ukam_canonical_addresses.parquet ukam_inverted_index.parquet ukam_manifest.json ukam_term_frequencies.parquet

unique_id resolved_canonical_id original_address_concat original_address_concat_canonical match_reason match_weight distinguishability
m_1 c_2 Flat A Example Court, 10 Demo Road, Townton Flat A, 10 Demo Road, Townton splink: probabilistic match 13.5885 11.5033

The output_folder contains parquet files plus ukam_manifest.json (package version, row counts, file hashes) for reproducibility.

Subsequent matching exercises that use the same canonical data can reuse this folder, skipping the prepare_canonical_folder step.

Reading results

matcher.match() returns a MatchResult object:

Property / method Returns
.matches() DuckDB relation with unique_id, resolved_canonical_id, match_reason, and more.
.match_metrics() Match-reason breakdown with counts and percentages.
.accuracy_analysis() Threshold-based accuracy analysis from labelled data (requires ukam_label in messy input).
.splink_predictions() Raw Splink predictions (only available when a SplinkStage ran).
Customising stages

The default pipeline is ExactMatchStageSplinkStage. Pass your own stages list to change behaviour:

from uk_address_matcher import (
    AddressMatcher,
    ExactMatchStage,
    PeeledAddressStage,
    UniqueTrigramStage,
    SplinkStage,
)

matcher = AddressMatcher(
    canonical_addresses=df_canonical,
    addresses_to_match=df_messy,
    con=con,
    stages=[
        ExactMatchStage(),
        PeeledAddressStage(),
        UniqueTrigramStage(),
        SplinkStage(
            final_match_weight_threshold=20.0,
            final_distinguishability_threshold=5.0,
        ),
    ],
)

Use AddressMatcher.available_stages() to discover registered stage classes. See the API reference for full parameter tables.

Using labelled data

If you know the correct match for each address, add a ukam_label column to your messy data. It propagates through to results, enabling accuracy analysis with MatchResult.match_metrics().