API reference¶

AddressMatcher¶

`AddressMatcher` ¶

Primary entry point for address matching.

Accepts either a raw DuckDBPyRelation (cleaned on the fly) or a str / Path pointing to a folder created by prepare_canonical_folder for canonical addresses. Messy addresses can be a DuckDB relation or a list of AddressRecord / dicts.

Stages default to [ExactMatchStage(), SplinkStage()]. Pass your own list to customise matching behaviour — the existing stage dataclasses (ExactMatchStage, UniqueTrigramStage, SplinkStage) already expose all the knobs you need.

Parameters:

Name	Type	Description	Default
`canonical_addresses`	`Union[DuckDBPyRelation, str, Path]`	Canonical dataset to match against. Can be a `DuckDBPyRelation` or a path to a prepared canonical folder.	required
`canonical_address_filter`	`str \| None`	Optional DuckDB SQL expression used to filter canonical addresses after load (for prepared folders) or directly on the provided canonical relation.	`None`
`addresses_to_match`	`Union[DuckDBPyRelation, list[AddressRecord], list[dict]]`	Messy addresses to resolve. Can be a `DuckDBPyRelation`, a list of `AddressRecord`, or a list of dicts with `address_concat`, `postcode`, and `unique_id` fields.	required
`con`	`DuckDBPyConnection`	DuckDB connection to use for all operations.	required
`stages`	`Optional[list[MatchingStage]]`	Optional list of `MatchingStage` instances defining the matching pipeline. Defaults to exact match followed by Splink.	`None`
`cleaning_num_chunks`	`int`	Number of chunks to use for cleaning and term frequency derivation when canonical input is a raw relation. Also used for messy-address cleaning. Must be a positive integer.	`10`
`debug_options`	`Optional[DebugOptions]`	Optional `DebugOptions` to control debug output and logging.	`None`

Examples:

Simple matching:

import duckdb
from uk_address_matcher import AddressMatcher

con = duckdb.connect()
canonical = con.read_parquet("./canonical.parquet")
messy = con.read_parquet("./messy.parquet")

matcher = AddressMatcher(
    canonical_addresses=canonical,
    addresses_to_match=messy,
    con=con,
)
result = matcher.match()

Custom stages:

from uk_address_matcher import (
    AddressMatcher, ExactMatchStage, UniqueTrigramStage, SplinkStage,
)

matcher = AddressMatcher(
    canonical_addresses=canonical,
    addresses_to_match=messy,
    con=con,
    stages=[
        ExactMatchStage(),
        UniqueTrigramStage(),
        SplinkStage(),
    ],
)
result = matcher.match()

Pre-prepared canonical data:

matcher = AddressMatcher(
    canonical_addresses="./prepared_addressbase",
    addresses_to_match=messy,
    con=con,
)
result = matcher.match()

`init(canonical_addresses, addresses_to_match, *, canonical_address_filter=None, con, stages=None, debug_options=None, cleaning_num_chunks=10)` ¶

`match()` ¶

Runs the full matching pipeline.

Each stage is executed in sequence. Earlier stages consume easy matches; later stages handle the remainder.

Returns:

Type	Description
`MatchResult`	A `MatchResult` wrapper around the final DuckDB relation, including
`MatchResult`	`unique_id`, `resolved_canonical_id`, `match_reason`, and any
`MatchResult`	additional columns produced by the stages.

`available_stages()` `classmethod` ¶

All registered MatchingStage subclasses.

Delegates to MatchingStage.available_stages() which walks the subclass tree dynamically, so newly added stages are picked up automatically without maintaining a hard-coded list.

Results¶

MatchResult¶

`MatchResult` `dataclass` ¶

Wraps match output with connection-scoped inspection helpers.

Access the underlying DuckDB relation via .matches().

Key methods

match_metrics - match-reason breakdown with counts and percentages. match_reasons - distinct match-reason values. _splink_predictions - raw Splink predictions table (requires SplinkStage).

`matches(*, all_columns=False)` ¶

The underlying DuckDB relation containing match results.

Parameters:

Name	Type	Description	Default
`all_columns`	`bool`	When True, return every column. By default only the key result columns are returned.	`False`

`match_metrics(*, order='descending')` ¶

Match-reason breakdown with counts and percentages

`accuracy_analysis(*, match_weight_round_to_nearest=0.1, output_type='threshold_selection', add_metrics=None)` ¶

Generate an accuracy chart or table from labelled match results.

Mirrors Splink's linker.evaluation.accuracy_analysis_from_labels_table API. Requires a ukam_label column in the input addresses.

Parameters:

Name	Type	Description	Default
`match_weight_round_to_nearest`	`float \| None`	Round splink match weights to this increment before grouping. Pass `None` for full precision. Defaults to 0.1.	`0.1`
`output_type`	`Literal['threshold_selection', 'precision_recall', 'table']`	One of: `"threshold_selection"` (default) — interactive panel showing precision/recall curves against match-weight threshold. `"precision_recall"` — precision vs recall curve. `"table"` — the raw truth-space data as a list of dicts. `"roc"` is intentionally not supported here because this record-level evaluation does not observe the full negative-pair universe, making TN-dependent ROC interpretation unreliable.	`'threshold_selection'`
`add_metrics`	`List[Literal['specificity', 'npv', 'accuracy', 'f1', 'f2', 'f0_5', 'p4', 'phi']] \| None`	Extra metrics to include in the `"threshold_selection"` chart. Accepted values: `"specificity"`, `"npv"`, `"accuracy"`, `"f1"`, `"f2"`, `"f0_5"`, `"p4"`, `"phi"`.	`None`

Returns:

Type	Description
`Any`	An Altair chart, or a list of dicts when `output_type="table"`.

`accuracy_data(*, match_weight_round_to_nearest=0.1)` ¶

Compute threshold metrics swept over every emitted decision score.

Each row in the returned list corresponds to one threshold value and contains the confusion-matrix counts (tp, tn, fp, fn) plus the derived rates (tp_rate, fp_rate, precision, recall, f1).

Requires a ukam_label column in the match results relation.

Important semantics: this uses top-1 outcome evaluation. Wrong-ID rows are false positives at their emitted score; they are not score-floored. Recall is derived from true positives as TP/P (equivalently FN = P - TP).

The ground-truth positive class is determined by looking up each record's ukam_label in the canonical dataset. A record whose ukam_label matches a canonical unique_id is treated as an expected match; all others are treated as expected non-matches.

The score used as the decision threshold is:

    - ``+999`` for non-splink matches (exact, peeled, trigram).

The actual match_weight for splink probabilistic matches. - -999 for rows with match_reason = NULL.

Parameters:

Name	Type	Description	Default
`match_weight_round_to_nearest`	`float \| None`	Round splink match weights to this increment before grouping to reduce the number of threshold points. Pass `None` to keep full precision. Defaults to 0.1.	`0.1`

Returns:

Type	Description
`list[dict]`	List of dicts with keys: `truth_threshold`, `match_probability`,
`list[dict]`	`tp`, `tn`, `fp`, `fp_neg`, `fn`, `tp_rate`,
`list[dict]`	`tn_rate`, `fp_rate`, `fn_rate`, `precision`, `recall`,
`list[dict]`	`f1`.
`list[dict]`	`fp` is every accepted non-TP row (used by `precision`).
`list[dict]`	`fp_neg` counts only the accepted rows whose label is absent from
`list[dict]`	the canonical dataset; it is used to derive `tn`, `tn_rate`,
`list[dict]`	and `fp_rate`.

API reference¶

AddressMatcher¶

AddressMatcher ¶

__init__(canonical_addresses, addresses_to_match, *, canonical_address_filter=None, con, stages=None, debug_options=None, cleaning_num_chunks=10) ¶

match() ¶

available_stages() classmethod ¶

Results¶

MatchResult¶

MatchResult dataclass ¶

matches(*, all_columns=False) ¶

match_metrics(*, order='descending') ¶

accuracy_analysis(*, match_weight_round_to_nearest=0.1, output_type='threshold_selection', add_metrics=None) ¶

accuracy_data(*, match_weight_round_to_nearest=0.1) ¶

`AddressMatcher` ¶

`init(canonical_addresses, addresses_to_match, *, canonical_address_filter=None, con, stages=None, debug_options=None, cleaning_num_chunks=10)` ¶

`match()` ¶

`available_stages()` `classmethod` ¶

`MatchResult` `dataclass` ¶

`matches(*, all_columns=False)` ¶

`match_metrics(*, order='descending')` ¶

`accuracy_analysis(*, match_weight_round_to_nearest=0.1, output_type='threshold_selection', add_metrics=None)` ¶

`accuracy_data(*, match_weight_round_to_nearest=0.1)` ¶