API reference¶
AddressMatcher¶
AddressMatcher
¶
Primary entry point for address matching.
Accepts either a raw DuckDBPyRelation (cleaned on the fly) or a
str / Path pointing to a folder created by
prepare_canonical_folder for canonical addresses.
Messy addresses can be a DuckDB relation or a list of
AddressRecord / dicts.
Stages default to [ExactMatchStage(), SplinkStage()]. Pass your own
list to customise matching behaviour — the existing stage dataclasses
(ExactMatchStage, UniqueTrigramStage, SplinkStage) already
expose all the knobs you need.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
canonical_addresses
|
Union[DuckDBPyRelation, str, Path]
|
Canonical dataset to match against. Can be a
|
required |
canonical_address_filter
|
str | None
|
Optional DuckDB SQL expression used to filter canonical addresses after load (for prepared folders) or directly on the provided canonical relation. |
None
|
addresses_to_match
|
Union[DuckDBPyRelation, list[AddressRecord], list[dict]]
|
Messy addresses to resolve. Can be a
|
required |
con
|
DuckDBPyConnection
|
DuckDB connection to use for all operations. |
required |
stages
|
Optional[list[MatchingStage]]
|
Optional list of |
None
|
debug_options
|
Optional[DebugOptions]
|
Optional |
None
|
Examples:
Simple matching:
import duckdb
from uk_address_matcher import AddressMatcher
con = duckdb.connect()
canonical = con.read_parquet("./canonical.parquet")
messy = con.read_parquet("./messy.parquet")
matcher = AddressMatcher(
canonical_addresses=canonical,
addresses_to_match=messy,
con=con,
)
result = matcher.match()
Custom stages:
from uk_address_matcher import (
AddressMatcher, ExactMatchStage, UniqueTrigramStage, SplinkStage,
)
matcher = AddressMatcher(
canonical_addresses=canonical,
addresses_to_match=messy,
con=con,
stages=[
ExactMatchStage(),
UniqueTrigramStage(),
SplinkStage(),
],
)
result = matcher.match()
Pre-prepared canonical data:
matcher = AddressMatcher(
canonical_addresses="./prepared_addressbase",
addresses_to_match=messy,
con=con,
)
result = matcher.match()
__init__(canonical_addresses, addresses_to_match, *, canonical_address_filter=None, con, stages=None, debug_options=None)
¶
match()
¶
Runs the full matching pipeline.
Each stage is executed in sequence. Earlier stages consume easy matches; later stages handle the remainder.
Returns:
| Type | Description |
|---|---|
MatchResult
|
A |
MatchResult
|
|
MatchResult
|
additional columns produced by the stages. |
available_stages()
classmethod
¶
All registered MatchingStage subclasses.
Delegates to MatchingStage.available_stages() which walks the
subclass tree dynamically, so newly added stages are picked up
automatically without maintaining a hard-coded list.
Results¶
MatchResult¶
MatchResult
dataclass
¶
Wraps match output with connection-scoped inspection helpers.
Access the underlying DuckDB relation via .matches().
Key methods
match_metrics - match-reason breakdown with counts and percentages.
match_reasons - distinct match-reason values.
splink_predictions - raw Splink predictions table (requires SplinkStage).
matches(*, all_columns=False)
¶
The underlying DuckDB relation containing match results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
all_columns
|
bool
|
When True, return every column. By default only the key result columns are returned. |
False
|
match_metrics(*, order='descending')
¶
Match-reason breakdown with counts and percentages
splink_predictions(limit=None, ukam_ids=None, *, threshold_match_probability=None, threshold_match_weight=None)
¶
Splink predictions as a DuckDB relation.
Use ukam_ids to filter on the input-side identifier.