Skip to content

API reference

AddressMatcher

AddressMatcher

Primary entry point for address matching.

Accepts either a raw DuckDBPyRelation (cleaned on the fly) or a str / Path pointing to a folder created by prepare_canonical_folder for canonical addresses. Messy addresses can be a DuckDB relation or a list of AddressRecord / dicts.

Stages default to [ExactMatchStage(), SplinkStage()]. Pass your own list to customise matching behaviour — the existing stage dataclasses (ExactMatchStage, UniqueTrigramStage, SplinkStage) already expose all the knobs you need.

Parameters:

Name Type Description Default
canonical_addresses Union[DuckDBPyRelation, str, Path]

Canonical dataset to match against. Can be a DuckDBPyRelation or a path to a prepared canonical folder.

required
canonical_address_filter str | None

Optional DuckDB SQL expression used to filter canonical addresses after load (for prepared folders) or directly on the provided canonical relation.

None
addresses_to_match Union[DuckDBPyRelation, list[AddressRecord], list[dict]]

Messy addresses to resolve. Can be a DuckDBPyRelation, a list of AddressRecord, or a list of dicts with address_concat, postcode, and unique_id fields.

required
con DuckDBPyConnection

DuckDB connection to use for all operations.

required
stages Optional[list[MatchingStage]]

Optional list of MatchingStage instances defining the matching pipeline. Defaults to exact match followed by Splink.

None
cleaning_num_chunks int

Number of chunks to use for cleaning and term frequency derivation when canonical input is a raw relation. Also used for messy-address cleaning. Must be a positive integer.

10
debug_options Optional[DebugOptions]

Optional DebugOptions to control debug output and logging.

None

Examples:

Simple matching:

import duckdb
from uk_address_matcher import AddressMatcher

con = duckdb.connect()
canonical = con.read_parquet("./canonical.parquet")
messy = con.read_parquet("./messy.parquet")

matcher = AddressMatcher(
    canonical_addresses=canonical,
    addresses_to_match=messy,
    con=con,
)
result = matcher.match()

Custom stages:

from uk_address_matcher import (
    AddressMatcher, ExactMatchStage, UniqueTrigramStage, SplinkStage,
)

matcher = AddressMatcher(
    canonical_addresses=canonical,
    addresses_to_match=messy,
    con=con,
    stages=[
        ExactMatchStage(),
        UniqueTrigramStage(),
        SplinkStage(),
    ],
)
result = matcher.match()

Pre-prepared canonical data:

matcher = AddressMatcher(
    canonical_addresses="./prepared_addressbase",
    addresses_to_match=messy,
    con=con,
)
result = matcher.match()

__init__(canonical_addresses, addresses_to_match, *, canonical_address_filter=None, con, stages=None, debug_options=None, cleaning_num_chunks=10)

match()

Runs the full matching pipeline.

Each stage is executed in sequence. Earlier stages consume easy matches; later stages handle the remainder.

Returns:

Type Description
MatchResult

A MatchResult wrapper around the final DuckDB relation, including

MatchResult

unique_id, resolved_canonical_id, match_reason, and any

MatchResult

additional columns produced by the stages.

available_stages() classmethod

All registered MatchingStage subclasses.

Delegates to MatchingStage.available_stages() which walks the subclass tree dynamically, so newly added stages are picked up automatically without maintaining a hard-coded list.

Results

MatchResult

MatchResult dataclass

Wraps match output with connection-scoped inspection helpers.

Access the underlying DuckDB relation via .matches().

Key methods

match_metrics - match-reason breakdown with counts and percentages. match_reasons - distinct match-reason values. _splink_predictions - raw Splink predictions table (requires SplinkStage).

matches(*, all_columns=False)

The underlying DuckDB relation containing match results.

Parameters:

Name Type Description Default
all_columns bool

When True, return every column. By default only the key result columns are returned.

False

match_metrics(*, order='descending')

Match-reason breakdown with counts and percentages

accuracy_analysis(*, match_weight_round_to_nearest=0.1, output_type='threshold_selection', add_metrics=None)

Generate an accuracy chart or table from labelled match results.

Mirrors Splink's linker.evaluation.accuracy_analysis_from_labels_table API. Requires a ukam_label column in the input addresses.

Parameters:

Name Type Description Default
match_weight_round_to_nearest float | None

Round splink match weights to this increment before grouping. Pass None for full precision. Defaults to 0.1.

0.1
output_type Literal['threshold_selection', 'precision_recall', 'table']

One of:

  • "threshold_selection" (default) — interactive panel showing precision/recall curves against match-weight threshold.
  • "precision_recall" — precision vs recall curve.
  • "table" — the raw truth-space data as a list of dicts.

"roc" is intentionally not supported here because this record-level evaluation does not observe the full negative-pair universe, making TN-dependent ROC interpretation unreliable.

'threshold_selection'
add_metrics List[Literal['specificity', 'npv', 'accuracy', 'f1', 'f2', 'f0_5', 'p4', 'phi']] | None

Extra metrics to include in the "threshold_selection" chart. Accepted values: "specificity", "npv", "accuracy", "f1", "f2", "f0_5", "p4", "phi".

None

Returns:

Type Description
Any

An Altair chart, or a list of dicts when output_type="table".

accuracy_data(*, match_weight_round_to_nearest=0.1)

Compute threshold metrics swept over every emitted decision score.

Each row in the returned list corresponds to one threshold value and contains the confusion-matrix counts (tp, tn, fp, fn) plus the derived rates (tp_rate, fp_rate, precision, recall, f1).

Requires a ukam_label column in the match results relation.

Important semantics: this uses top-1 outcome evaluation. Wrong-ID rows are false positives at their emitted score; they are not score-floored. Recall is derived from true positives as TP/P (equivalently FN = P - TP).

The ground-truth positive class is determined by looking up each record's ukam_label in the canonical dataset. A record whose ukam_label matches a canonical unique_id is treated as an expected match; all others are treated as expected non-matches.

The score used as the decision threshold is:

    - ``+999`` for non-splink matches (exact, peeled, trigram).
  • The actual match_weight for splink probabilistic matches. - -999 for rows with match_reason = NULL.

Parameters:

Name Type Description Default
match_weight_round_to_nearest float | None

Round splink match weights to this increment before grouping to reduce the number of threshold points. Pass None to keep full precision. Defaults to 0.1.

0.1

Returns:

Type Description
list[dict]

List of dicts with keys: truth_threshold, match_probability,

list[dict]

tp, tn, fp, fp_neg, fn, tp_rate,

list[dict]

tn_rate, fp_rate, fn_rate, precision, recall,

list[dict]

f1.

list[dict]

fp is every accepted non-TP row (used by precision).

list[dict]

fp_neg counts only the accepted rows whose label is absent from

list[dict]

the canonical dataset; it is used to derive tn, tn_rate,

list[dict]

and fp_rate.