Skip to content

Geting started

Install

uk_address_matcher is a Python package available on PyPI. You can install with pip:

pip install uk_address_matcher

Input data requirements

Both your messy addresses and your canonical addresses need at least these columns:

Column Description
unique_id Stable unique identifier
address_concat Address text, which can include the postcode

Optionally you can provide:

Column Description
postcode If provided, this postcode is used in favour over any postcode provided in address_concat
ukam_label The unique ID of the true match. If provided, it enables accuracy analysis output

Choose whether to pre-process your canonical dataset

If you're linking to a small canonical dataset (of say, less than 500,000 rows), then it's simplest to process the data on-the-fly.

If you're linking to a large canonical dataset (for example, national-scale NGD), then we recommend a one-time pre-processing step. It computes reusable datasets (indices and feature tables) once, so subsequent matching runs are fast.

The examples below use the fictional London datasets from ukam_datasets, which are included for runnable examples.

import duckdb
from uk_address_matcher import AddressMatcher, ukam_datasets

con = duckdb.connect()

df_messy = ukam_datasets.as_relation("fictional_london_messy", con=con)
df_canonical = ukam_datasets.as_relation("fictional_london_canonical", con=con)

matcher = AddressMatcher(
    canonical_addresses=df_canonical,
    addresses_to_match=df_messy,
    con=con,
)
result = matcher.match()
print(result.matches().limit(5).to_df().to_markdown(index=False))
unique_id resolved_canonical_id ukam_label original_address_concat original_address_concat_canonical match_reason match_weight distinguishability
m_0001872 c_0005356 c_0005356 FLAT 2,131 PRIMROSEWICK WY,WEST ALDER,LONDON Flat 2, 131 Primrosewick Way, West Alder, London exact: full match nan nan
m_0001832 c_0000916 c_0000916 Unit 6, 102 Bartonstone Gdns, East Bramwick, London Unit 6, 102 Bartonstone Gardens, East Bramwick, London exact: full match nan nan
m_0001440 c_0008498 c_0008498 Suite 9,3 Elmhurst St,Kingsford,London Suite 9, 3 Elmhurst Street, Kingsford, London splink: probabilistic match 43.03 nan
m_0001762 c_0003898 c_0003898 UNIT 11, 12 YORKSTNOE ST, MAPLE GREEN, LONDON Unit 11, 12 Yorkstone Street, Maple Green, London splink: probabilistic match 39.6298 nan
m_0001689 c_0007665 c_0007665 Suite 12, 120 Novalane Ave, New Huxley, London Suite 12, 120 Novalane Avenue, New Huxley, London exact: full match nan nan
import duckdb
import os
import tempfile
from uk_address_matcher import (
    AddressMatcher,
    prepare_canonical_folder,
    ukam_datasets,
)

con = duckdb.connect()
df_messy = ukam_datasets.as_relation("fictional_london_messy", con=con)
df_canonical = ukam_datasets.as_relation("fictional_london_canonical", con=con)

# One-time preparation
output_folder = tempfile.mkdtemp()
prepare_canonical_folder(
    df_canonical,
    output_folder=output_folder,
    con=con,
    overwrite=True,
)

# Pass the folder path instead of a relation
matcher = AddressMatcher(
    canonical_addresses=output_folder,
    addresses_to_match=df_messy,
    con=con,
)
result = matcher.match()

print("Prepared folder contents:")
for f in sorted(os.listdir(output_folder)):
    print(f"  {f}")
print()
print(result.matches().limit(5).to_df().to_markdown(index=False))

Prepared folder contents: ukam_canonical_addresses.parquet ukam_inverted_index.parquet ukam_manifest.json ukam_term_frequencies.parquet

unique_id resolved_canonical_id ukam_label original_address_concat original_address_concat_canonical match_reason match_weight distinguishability
m_0001872 c_0005356 c_0005356 FLAT 2,131 PRIMROSEWICK WY,WEST ALDER,LONDON Flat 2, 131 Primrosewick Way, West Alder, London exact: full match nan nan
m_0001832 c_0000916 c_0000916 Unit 6, 102 Bartonstone Gdns, East Bramwick, London Unit 6, 102 Bartonstone Gardens, East Bramwick, London exact: full match nan nan
m_0001440 c_0008498 c_0008498 Suite 9,3 Elmhurst St,Kingsford,London Suite 9, 3 Elmhurst Street, Kingsford, London splink: probabilistic match 43.03 nan
m_0001762 c_0003898 c_0003898 UNIT 11, 12 YORKSTNOE ST, MAPLE GREEN, LONDON Unit 11, 12 Yorkstone Street, Maple Green, London splink: probabilistic match 39.6298 nan
m_0001689 c_0007665 c_0007665 Suite 12, 120 Novalane Ave, New Huxley, London Suite 12, 120 Novalane Avenue, New Huxley, London exact: full match nan nan

The output_folder contains parquet files plus ukam_manifest.json (package version, row counts, file hashes) for reproducibility.

Subsequent matching exercises that use the same canonical data can reuse this folder, skipping the prepare_canonical_folder step.

Reading results

matcher.match() returns a MatchResult object:

Property / method Returns
.matches() DuckDB relation with unique_id, resolved_canonical_id, match_reason, and more.
.match_metrics() Match-reason breakdown with counts and percentages.
.accuracy_analysis() Threshold-based accuracy analysis from labelled data (requires ukam_label in messy input).
Customising stages

The default pipeline is ExactMatchStageSplinkStage. Pass your own stages list to change behaviour:

from uk_address_matcher import (
    AddressMatcher,
    ExactMatchStage,
    PeeledAddressStage,
    UniqueTrigramStage,
    SplinkStage,
)

matcher = AddressMatcher(
    canonical_addresses=df_canonical,
    addresses_to_match=df_messy,
    con=con,
    stages=[
        ExactMatchStage(),
        PeeledAddressStage(),
        UniqueTrigramStage(),
        SplinkStage(
            final_match_weight_threshold=20.0,
            final_distinguishability_threshold=5.0,
        ),
    ],
)

Use AddressMatcher.available_stages() to discover registered stage classes. See Choosing a matching threshold and Optimising accuracy for further accuracy advice. The API reference covers the main API docs.

Using labelled data

If you know the correct match for each address, add a ukam_label column to your messy data. It propagates through to results, enabling accuracy analysis with MatchResult.accuracy_analysis().