API reference¶
AddressMatcher¶
AddressMatcher
¶
Primary entry point for address matching.
Accepts either a raw DuckDBPyRelation (cleaned on the fly) or a
str / Path pointing to a folder created by
prepare_canonical_folder for canonical addresses.
Messy addresses can be a DuckDB relation or a list of
AddressRecord / dicts.
Stages default to [ExactMatchStage(), SplinkStage()]. Pass your own
list to customise matching behaviour — the existing stage dataclasses
(ExactMatchStage, UniqueTrigramStage, SplinkStage) already
expose all the knobs you need.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
canonical_addresses
|
Union[DuckDBPyRelation, str, Path]
|
Canonical dataset to match against. Can be a
|
required |
canonical_address_filter
|
str | None
|
Optional DuckDB SQL expression used to filter canonical addresses after load (for prepared folders) or directly on the provided canonical relation. |
None
|
addresses_to_match
|
Union[DuckDBPyRelation, list[AddressRecord], list[dict]]
|
Messy addresses to resolve. Can be a
|
required |
con
|
DuckDBPyConnection
|
DuckDB connection to use for all operations. |
required |
stages
|
Optional[list[MatchingStage]]
|
Optional list of |
None
|
cleaning_num_chunks
|
int
|
Number of chunks to use for cleaning and term frequency derivation when canonical input is a raw relation. Also used for messy-address cleaning. Must be a positive integer. |
10
|
debug_options
|
Optional[DebugOptions]
|
Optional |
None
|
Examples:
Simple matching:
import duckdb
from uk_address_matcher import AddressMatcher
con = duckdb.connect()
canonical = con.read_parquet("./canonical.parquet")
messy = con.read_parquet("./messy.parquet")
matcher = AddressMatcher(
canonical_addresses=canonical,
addresses_to_match=messy,
con=con,
)
result = matcher.match()
Custom stages:
from uk_address_matcher import (
AddressMatcher, ExactMatchStage, UniqueTrigramStage, SplinkStage,
)
matcher = AddressMatcher(
canonical_addresses=canonical,
addresses_to_match=messy,
con=con,
stages=[
ExactMatchStage(),
UniqueTrigramStage(),
SplinkStage(),
],
)
result = matcher.match()
Pre-prepared canonical data:
matcher = AddressMatcher(
canonical_addresses="./prepared_addressbase",
addresses_to_match=messy,
con=con,
)
result = matcher.match()
__init__(canonical_addresses, addresses_to_match, *, canonical_address_filter=None, con, stages=None, debug_options=None, cleaning_num_chunks=10)
¶
match()
¶
Runs the full matching pipeline.
Each stage is executed in sequence. Earlier stages consume easy matches; later stages handle the remainder.
Returns:
| Type | Description |
|---|---|
MatchResult
|
A |
MatchResult
|
|
MatchResult
|
additional columns produced by the stages. |
available_stages()
classmethod
¶
All registered MatchingStage subclasses.
Delegates to MatchingStage.available_stages() which walks the
subclass tree dynamically, so newly added stages are picked up
automatically without maintaining a hard-coded list.
Results¶
MatchResult¶
MatchResult
dataclass
¶
Wraps match output with connection-scoped inspection helpers.
Access the underlying DuckDB relation via .matches().
Key methods
match_metrics - match-reason breakdown with counts and percentages.
match_reasons - distinct match-reason values.
_splink_predictions - raw Splink predictions table (requires SplinkStage).
matches(*, all_columns=False)
¶
The underlying DuckDB relation containing match results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
all_columns
|
bool
|
When True, return every column. By default only the key result columns are returned. |
False
|
match_metrics(*, order='descending')
¶
Match-reason breakdown with counts and percentages
accuracy_analysis(*, match_weight_round_to_nearest=0.1, output_type='threshold_selection', add_metrics=None)
¶
Generate an accuracy chart or table from labelled match results.
Mirrors Splink's linker.evaluation.accuracy_analysis_from_labels_table
API. Requires a ukam_label column in the input addresses.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
match_weight_round_to_nearest
|
float | None
|
Round splink match weights to this
increment before grouping. Pass |
0.1
|
output_type
|
Literal['threshold_selection', 'precision_recall', 'table']
|
One of:
|
'threshold_selection'
|
add_metrics
|
List[Literal['specificity', 'npv', 'accuracy', 'f1', 'f2', 'f0_5', 'p4', 'phi']] | None
|
Extra metrics to include in the |
None
|
Returns:
| Type | Description |
|---|---|
Any
|
An Altair chart, or a list of dicts when |
accuracy_data(*, match_weight_round_to_nearest=0.1)
¶
Compute threshold metrics swept over every emitted decision score.
Each row in the returned list corresponds to one threshold value and contains the confusion-matrix counts (tp, tn, fp, fn) plus the derived rates (tp_rate, fp_rate, precision, recall, f1).
Requires a ukam_label column in the match results relation.
Important semantics: this uses top-1 outcome evaluation. Wrong-ID rows
are false positives at their emitted score; they are not score-floored.
Recall is derived from true positives as TP/P (equivalently
FN = P - TP).
The ground-truth positive class is determined by looking up each record's
ukam_label in the canonical dataset. A record whose ukam_label
matches a canonical unique_id is treated as an expected match;
all others are treated as expected non-matches.
The score used as the decision threshold is:
- ``+999`` for non-splink matches (exact, peeled, trigram).
- The actual
match_weightfor splink probabilistic matches. --999for rows withmatch_reason=NULL.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
match_weight_round_to_nearest
|
float | None
|
Round splink match weights to this
increment before grouping to reduce the number of threshold
points. Pass |
0.1
|
Returns:
| Type | Description |
|---|---|
list[dict]
|
List of dicts with keys: |
list[dict]
|
|
list[dict]
|
|
list[dict]
|
|
list[dict]
|
|
list[dict]
|
|
list[dict]
|
the canonical dataset; it is used to derive |
list[dict]
|
and |