Skip to content

API reference

AddressMatcher

AddressMatcher

Primary entry point for address matching.

Accepts either a raw DuckDBPyRelation (cleaned on the fly) or a str / Path pointing to a folder created by prepare_canonical_folder for canonical addresses. Messy addresses can be a DuckDB relation or a list of AddressRecord / dicts.

Stages default to [ExactMatchStage(), SplinkStage()]. Pass your own list to customise matching behaviour — the existing stage dataclasses (ExactMatchStage, UniqueTrigramStage, SplinkStage) already expose all the knobs you need.

Parameters:

Name Type Description Default
canonical_addresses Union[DuckDBPyRelation, str, Path]

Canonical dataset to match against. Can be a DuckDBPyRelation or a path to a prepared canonical folder.

required
canonical_address_filter str | None

Optional DuckDB SQL expression used to filter canonical addresses after load (for prepared folders) or directly on the provided canonical relation.

None
addresses_to_match Union[DuckDBPyRelation, list[AddressRecord], list[dict]]

Messy addresses to resolve. Can be a DuckDBPyRelation, a list of AddressRecord, or a list of dicts with address_concat, postcode, and unique_id fields.

required
con DuckDBPyConnection

DuckDB connection to use for all operations.

required
stages Optional[list[MatchingStage]]

Optional list of MatchingStage instances defining the matching pipeline. Defaults to exact match followed by Splink.

None
debug_options Optional[DebugOptions]

Optional DebugOptions to control debug output and logging.

None

Examples:

Simple matching:

import duckdb
from uk_address_matcher import AddressMatcher

con = duckdb.connect()
canonical = con.read_parquet("./canonical.parquet")
messy = con.read_parquet("./messy.parquet")

matcher = AddressMatcher(
    canonical_addresses=canonical,
    addresses_to_match=messy,
    con=con,
)
result = matcher.match()

Custom stages:

from uk_address_matcher import (
    AddressMatcher, ExactMatchStage, UniqueTrigramStage, SplinkStage,
)

matcher = AddressMatcher(
    canonical_addresses=canonical,
    addresses_to_match=messy,
    con=con,
    stages=[
        ExactMatchStage(),
        UniqueTrigramStage(),
        SplinkStage(),
    ],
)
result = matcher.match()

Pre-prepared canonical data:

matcher = AddressMatcher(
    canonical_addresses="./prepared_addressbase",
    addresses_to_match=messy,
    con=con,
)
result = matcher.match()

__init__(canonical_addresses, addresses_to_match, *, canonical_address_filter=None, con, stages=None, debug_options=None)

match()

Runs the full matching pipeline.

Each stage is executed in sequence. Earlier stages consume easy matches; later stages handle the remainder.

Returns:

Type Description
MatchResult

A MatchResult wrapper around the final DuckDB relation, including

MatchResult

unique_id, resolved_canonical_id, match_reason, and any

MatchResult

additional columns produced by the stages.

available_stages() classmethod

All registered MatchingStage subclasses.

Delegates to MatchingStage.available_stages() which walks the subclass tree dynamically, so newly added stages are picked up automatically without maintaining a hard-coded list.

Results

MatchResult

MatchResult dataclass

Wraps match output with connection-scoped inspection helpers.

Access the underlying DuckDB relation via .matches().

Key methods

match_metrics - match-reason breakdown with counts and percentages. match_reasons - distinct match-reason values. splink_predictions - raw Splink predictions table (requires SplinkStage).

matches(*, all_columns=False)

The underlying DuckDB relation containing match results.

Parameters:

Name Type Description Default
all_columns bool

When True, return every column. By default only the key result columns are returned.

False

match_metrics(*, order='descending')

Match-reason breakdown with counts and percentages

Splink predictions as a DuckDB relation.

Use ukam_ids to filter on the input-side identifier.