Geting started¶

Install¶

uk_address_matcher is a Python package available on PyPI. You can install with pip:

pip install uk_address_matcher

Input data requirements¶

Both your messy addresses and your canonical addresses need at least these columns:

Column	Description
`unique_id`	Stable unique identifier
`address_concat`	Address text, which can include the postcode

Optionally you can provide:

Column	Description
`postcode`	If provided, this postcode is used in favour over any postcode provided in `address_concat`
`ukam_label`	The unique ID of the true match. If provided, it enables accuracy analysis output

Choose whether to pre-process your canonical dataset¶

If you're linking to a small canonical dataset (of say, less than 500,000 rows), then it's simplest to process the data on-the-fly.

If you're linking to a large canonical dataset (for example, national-scale NGD), then we recommend a one-time pre-processing step. It computes reusable datasets (indices and feature tables) once, so subsequent matching runs are fast.

Local / regional (processing on-the-fly)National-scale (preprocessed data)

Source codeOutput

import duckdb
import pyarrow as pa
from uk_address_matcher import AddressMatcher

con = duckdb.connect()

# Usually data would be loaded from files.
# It's hard coded here so this example can be easily run
messy = pa.Table.from_pylist([
    {"unique_id": "m_1", "address_concat": "Flat A Example Court, 10 Demo Road, Townton", "postcode": "AB1 2BC"},
])
messy_df = con.from_arrow(messy)

canonical = pa.Table.from_pylist([
    {"unique_id": "c_1", "address_concat": "9 Demo Road, Townton", "postcode": "AB1 2BC"},
    {"unique_id": "c_2", "address_concat": "Flat A, 10 Demo Road, Townton", "postcode": "AB1 2BC"},
    {"unique_id": "c_3", "address_concat": "Flat B, 10 Demo Road, Townton", "postcode": "AB1 2BC"},
    {"unique_id": "c_4", "address_concat": "Flat C, 10 Demo Road, Townton", "postcode": "AB1 2BC"},
    {"unique_id": "c_5", "address_concat": "Basement Flat, 10 Demo Road, Townton", "postcode": "AB1 2BC"},
])
canonical_df = con.from_arrow(canonical)

matcher = AddressMatcher(
    canonical_addresses=canonical_df,
    addresses_to_match=messy_df,
    con=con,
)
result = matcher.match()
print(result.matches().limit(5).to_df().to_markdown(index=False))

unique_id	resolved_canonical_id	original_address_concat	original_address_concat_canonical	match_reason	match_weight	distinguishability
m_1	c_2	Flat A Example Court, 10 Demo Road, Townton	Flat A, 10 Demo Road, Townton	splink: probabilistic match	13.5885	11.5033

Source codeOutput

import duckdb
import os
import tempfile
from uk_address_matcher import AddressMatcher, prepare_canonical_folder

con = duckdb.connect()
df_canonical = con.read_csv("example_data/canonical_example.csv")
df_messy = con.read_csv("example_data/messy_example.csv")

# One-time preparation
output_folder = tempfile.mkdtemp()
prepare_canonical_folder(
    df_canonical,
    output_folder=output_folder,
    con=con,
    overwrite=True,
)

# Pass the folder path instead of a relation
matcher = AddressMatcher(
    canonical_addresses=output_folder,
    addresses_to_match=df_messy,
    con=con,
)
result = matcher.match()

print("Prepared folder contents:")
for f in sorted(os.listdir(output_folder)):
    print(f"  {f}")
print()
print(result.matches().limit(5).to_df().to_markdown(index=False))

Prepared folder contents: ukam_canonical_addresses.parquet ukam_inverted_index.parquet ukam_manifest.json ukam_term_frequencies.parquet

unique_id	resolved_canonical_id	original_address_concat	original_address_concat_canonical	match_reason	match_weight	distinguishability
m_1	c_2	Flat A Example Court, 10 Demo Road, Townton	Flat A, 10 Demo Road, Townton	splink: probabilistic match	13.5885	11.5033

The output_folder contains parquet files plus ukam_manifest.json (package version, row counts, file hashes) for reproducibility.

Subsequent matching exercises that use the same canonical data can reuse this folder, skipping the prepare_canonical_folder step.

Reading results¶

matcher.match() returns a MatchResult object:

Property / method	Returns
`.matches()`	DuckDB relation with `unique_id`, `resolved_canonical_id`, `match_reason`, and more.
`.match_metrics()`	Match-reason breakdown with counts and percentages.
`.accuracy_analysis()`	Threshold-based accuracy analysis from labelled data (requires `ukam_label` in messy input).

Customising stages

The default pipeline is ExactMatchStage → SplinkStage. Pass your own stages list to change behaviour:

from uk_address_matcher import (
    AddressMatcher,
    ExactMatchStage,
    PeeledAddressStage,
    UniqueTrigramStage,
    SplinkStage,
)

matcher = AddressMatcher(
    canonical_addresses=df_canonical,
    addresses_to_match=df_messy,
    con=con,
    stages=[
        ExactMatchStage(),
        PeeledAddressStage(),
        UniqueTrigramStage(),
        SplinkStage(
            final_match_weight_threshold=20.0,
            final_distinguishability_threshold=5.0,
        ),
    ],
)

Use AddressMatcher.available_stages() to discover registered stage classes. See Choosing a matching threshold and Optimising accuracy for further accuracy advice. The API reference covers the main API docs.

Using labelled data¶

If you know the correct match for each address, add a ukam_label column to your messy data. It propagates through to results, enabling accuracy analysis with MatchResult.match_metrics().