Geting started¶

Install¶

uk_address_matcher is a Python package available on PyPI. You can install with pip:

pip install uk_address_matcher

Input data requirements¶

Both your messy addresses and your canonical addresses need at least these columns:

Column	Description
`unique_id`	Stable unique identifier
`address_concat`	Address text, which can include the postcode

Optionally you can provide:

Column	Description
`postcode`	If provided, this postcode is used in favour over any postcode provided in `address_concat`
`ukam_label`	The unique ID of the true match. If provided, it enables accuracy analysis output

Choose whether to pre-process your canonical dataset¶

If you're linking to a small canonical dataset (of say, less than 500,000 rows), then it's simplest to process the data on-the-fly.

If you're linking to a large canonical dataset (for example, national-scale NGD), then we recommend a one-time pre-processing step. It computes reusable datasets (indices and feature tables) once, so subsequent matching runs are fast.

The examples below use the fictional London datasets from ukam_datasets, which are included for runnable examples.

Local / regional (processing on-the-fly)National-scale (preprocessed data)

Source codeOutput

import duckdb
from uk_address_matcher import AddressMatcher, ukam_datasets

con = duckdb.connect()

df_messy = ukam_datasets.as_relation("fictional_london_messy", con=con)
df_canonical = ukam_datasets.as_relation("fictional_london_canonical", con=con)

matcher = AddressMatcher(
    canonical_addresses=df_canonical,
    addresses_to_match=df_messy,
    con=con,
)
result = matcher.match()
print(result.matches().limit(5).to_df().to_markdown(index=False))

unique_id	resolved_canonical_id	ukam_label	original_address_concat	original_address_concat_canonical	match_reason	match_weight	distinguishability
m_0001872	c_0005356	c_0005356	FLAT 2,131 PRIMROSEWICK WY,WEST ALDER,LONDON	Flat 2, 131 Primrosewick Way, West Alder, London	exact: full match	nan	nan
m_0001832	c_0000916	c_0000916	Unit 6, 102 Bartonstone Gdns, East Bramwick, London	Unit 6, 102 Bartonstone Gardens, East Bramwick, London	exact: full match	nan	nan
m_0001440	c_0008498	c_0008498	Suite 9,3 Elmhurst St,Kingsford,London	Suite 9, 3 Elmhurst Street, Kingsford, London	splink: probabilistic match	43.03	nan
m_0001762	c_0003898	c_0003898	UNIT 11, 12 YORKSTNOE ST, MAPLE GREEN, LONDON	Unit 11, 12 Yorkstone Street, Maple Green, London	splink: probabilistic match	39.6298	nan
m_0001689	c_0007665	c_0007665	Suite 12, 120 Novalane Ave, New Huxley, London	Suite 12, 120 Novalane Avenue, New Huxley, London	exact: full match	nan	nan

Source codeOutput

import duckdb
import os
import tempfile
from uk_address_matcher import (
    AddressMatcher,
    prepare_canonical_folder,
    ukam_datasets,
)

con = duckdb.connect()
df_messy = ukam_datasets.as_relation("fictional_london_messy", con=con)
df_canonical = ukam_datasets.as_relation("fictional_london_canonical", con=con)

# One-time preparation
output_folder = tempfile.mkdtemp()
prepare_canonical_folder(
    df_canonical,
    output_folder=output_folder,
    con=con,
    overwrite=True,
)

# Pass the folder path instead of a relation
matcher = AddressMatcher(
    canonical_addresses=output_folder,
    addresses_to_match=df_messy,
    con=con,
)
result = matcher.match()

print("Prepared folder contents:")
for f in sorted(os.listdir(output_folder)):
    print(f"  {f}")
print()
print(result.matches().limit(5).to_df().to_markdown(index=False))

Prepared folder contents: ukam_canonical_addresses.parquet ukam_inverted_index.parquet ukam_manifest.json ukam_term_frequencies.parquet

unique_id	resolved_canonical_id	ukam_label	original_address_concat	original_address_concat_canonical	match_reason	match_weight	distinguishability
m_0001872	c_0005356	c_0005356	FLAT 2,131 PRIMROSEWICK WY,WEST ALDER,LONDON	Flat 2, 131 Primrosewick Way, West Alder, London	exact: full match	nan	nan
m_0001832	c_0000916	c_0000916	Unit 6, 102 Bartonstone Gdns, East Bramwick, London	Unit 6, 102 Bartonstone Gardens, East Bramwick, London	exact: full match	nan	nan
m_0001440	c_0008498	c_0008498	Suite 9,3 Elmhurst St,Kingsford,London	Suite 9, 3 Elmhurst Street, Kingsford, London	splink: probabilistic match	43.03	nan
m_0001762	c_0003898	c_0003898	UNIT 11, 12 YORKSTNOE ST, MAPLE GREEN, LONDON	Unit 11, 12 Yorkstone Street, Maple Green, London	splink: probabilistic match	39.6298	nan
m_0001689	c_0007665	c_0007665	Suite 12, 120 Novalane Ave, New Huxley, London	Suite 12, 120 Novalane Avenue, New Huxley, London	exact: full match	nan	nan

The output_folder contains parquet files plus ukam_manifest.json (package version, row counts, file hashes) for reproducibility.

Subsequent matching exercises that use the same canonical data can reuse this folder, skipping the prepare_canonical_folder step.

Reading results¶

matcher.match() returns a MatchResult object:

Property / method	Returns
`.matches()`	DuckDB relation with `unique_id`, `resolved_canonical_id`, `match_reason`, and more.
`.match_metrics()`	Match-reason breakdown with counts and percentages.
`.accuracy_analysis()`	Threshold-based accuracy analysis from labelled data (requires `ukam_label` in messy input).

Customising stages

The default pipeline is ExactMatchStage → SplinkStage. Pass your own stages list to change behaviour:

from uk_address_matcher import (
    AddressMatcher,
    ExactMatchStage,
    PeeledAddressStage,
    UniqueTrigramStage,
    SplinkStage,
)

matcher = AddressMatcher(
    canonical_addresses=df_canonical,
    addresses_to_match=df_messy,
    con=con,
    stages=[
        ExactMatchStage(),
        PeeledAddressStage(),
        UniqueTrigramStage(),
        SplinkStage(
            final_match_weight_threshold=20.0,
            final_distinguishability_threshold=5.0,
        ),
    ],
)

Use AddressMatcher.available_stages() to discover registered stage classes. See Choosing a matching threshold and Optimising accuracy for further accuracy advice. The API reference covers the main API docs.

Using labelled data¶

If you know the correct match for each address, add a ukam_label column to your messy data. It propagates through to results, enabling accuracy analysis with MatchResult.accuracy_analysis().