Optimising accuracy¶
uk_address_matcher has a variety of settings that can be tuned to optimise accuracy dependnig on your use case.
In 'biggest wins' we describe the most important settings that are likely to result in unambiguously better accuracy.
In 'optimising match stages' we describe settings which are harder to make recommendations about because the best settings depend on the input data.
Biggest wins¶
If matching to Ordnance Survey data, use ukam-os-builder to prepare it for matching¶
In its typical form, Ordnance Survey address data contains one address UPRN per row.
However, if processed in a specific way, Ordnance Survey contains variants on an address that provides a greater 'target area' to match to. Our ukam-os-builder tool does this processing for you to increase the chance of a match. Use of this tool is documented in this guide.
To illustrate why this is important, an single address could have two variants:
Basement Flat, 10 Demo Road, TowntonFlat A, Example Court, 10 Demo Road, Townton
By providing uk_address_matcher with a canonical address dataset built by ukam-os-builder, you will match against all these variants, giving more options for a high scoring match.
Filter your input data to the smallest possible dataset¶
The smaller the number of addresses to match to, the more accurate your results are likely to be, because there is less chance of multiple similar candidates (such as two different '1 High Street' addresses in two different geographical locations).
Matching will also be faster, because there are fewer candidates to compare against.
There are two primary ways to filter your input data:
- Geographically. If your input data comes from a known geographical area such as a local authority, use an extract of Ordnance Survey data for that area only.
- By classification or other metadata. Ordnance Survey data contains rich metadata about each address, such as its classification. For example, if you know your messy data is residential only, filter out non-residential addresses from the canonical dataset.
How to filter¶
The mechanism for filtering depends on whether you are pre-processing your canonical dataset or processing it on the fly.
If you are processing data on-the-fly, then you can simply filter your data before passing it to AddressMatcher:
Filtering for on the fly processing¶
import duckdb
from uk_address_matcher import AddressMatcher, prepare_canonical_folder
con = duckdb.connect()
df_canonical = con.read_csv("path/to/canonical.csv")
# Filter to residential addresses only
df_canonical = df_canonical.filter("substr(classificationcode, 1, 1) = 'R'")
df_messy = con.read_csv("path/to/messy.csv")
matcher = AddressMatcher(
canonical_addresses=output_folder,
addresses_to_match=df_messy,
con=con,
)
result = matcher.match()
Filtering a pre-prepared datasets¶
If your are pre-processing your canonical dataset, consider whether users of the preprocessed file will always want a filter applied. If so, apply this filter using prior to passing the data to the prepare_canonical_folder folder.
import duckdb
import os
import tempfile
from uk_address_matcher import AddressMatcher, prepare_canonical_folder
con = duckdb.connect()
df_canonical = con.read_csv("path/to/canonical.csv")
df_canonical = df_canonical.filter("substr(classificationcode, 1, 1) = 'R'")
output_folder = tempfile.mkdtemp()
prepare_canonical_folder(
df_canonical,
output_folder=output_folder,
con=con
)
However, if different users will need different filters, you can also apply a filter after pre-processing the whole dataset. This will result in a small degradation in accuracy because indices and term frequencies will be computed globally, making them less discriminative.
output_folder = "path_to_prepared_canonical_folder"
df_canonical = con.read_csv("path/to/canonical.csv")
prepare_canonical_folder(
df_canonical,
output_folder=output_folder,
con=con
)
matcher = AddressMatcher(
canonical_addresses=output_folder,
addresses_to_match=df_messy,
canonical_address_filter="substr(classificationcode, 1, 1) = 'R'",
con=con,
)
result = matcher.match()
Optimising match stages¶
uk_address_matcher uses an 'ensemble' methodology, whereby it sequentially applies a number of matching strategies called 'stages'.
Stages run in order. Once a record is matched by one stage, later stages do not revisit it. The results contain a column indicating which stage produced the match.
As a result they are intended to be ordered from highest precision to lowest precision. For example, the first stage is usually the ExactMatchStage, since it makes sense to find all exact matches (full address string and postcode) prior to applying any more sophisticated/fuzzy matching strategies.
Matching stages¶
The available stages are as follows:
| Stage | Type | What it is good at | Accuracy implication |
|---|---|---|---|
ExactMatchStage |
Deterministic | Cleaned address text is already the same on both sides | Very high precision, should usually run first |
PeeledAddressStage |
Deterministic | One side has extra trailing locality words such as LONDON or HACKNEY |
High precision, useful before probabilistic matching |
UniqueTrigramStage |
Deterministic | A distinctive phrase identifies one canonical address within the postcode | High precision, removes clear fuzzy cases before Splink |
SplinkStage |
Scored | Typos, abbreviations, partial matches, and other fuzzy cases | Precision and recall depend on threshold choice |
Summary recommendation¶
You almost always want to use the ExactMatchStage. The PeeledAddressStage and UniqueTrigramStage produce high, but not perfect precision (i.e. there's a chance of a small number of false positives).
You almost always want to use the SplinkStage last, to attempt to find any matches missed by the previous stages. In some cases, it may produce higher accuracy than the PeeledAddressStage and UniqueTrigramStage, which is why you do not always want to use these stages.
from uk_address_matcher import (
AddressMatcher,
ExactMatchStage,
PeeledAddressStage,
UniqueTrigramStage,
SplinkStage,
)
matcher = AddressMatcher(
canonical_addresses=df_canonical,
addresses_to_match=df_messy,
con=con,
stages=[
ExactMatchStage(),
PeeledAddressStage(),
UniqueTrigramStage(),
SplinkStage(
final_match_weight_threshold=2.0,
final_distinguishability_threshold=1.0,
),
],
)
Tuning the Splink stage¶
For guidance on choosing Splink thresholds, see here.
Stage API docs¶
ExactMatchStage¶
ExactMatchStage
dataclass
¶
Bases: MatchingStage
Deterministic exact matching on clean_full_address and postcode.
This is usually the first stage in a pipeline. It accepts the easy, unambiguous cases before any probabilistic matching is attempted.
A match is emitted when the cleaned messy address and the cleaned canonical address are identical and the postcode is also identical. A cleaned address match on its own is not enough: differing postcodes will not match.
Example
"10 Demo Road Townton" matches
"10 Demo Road, Townton" only when cleaning normalises punctuation
and whitespace and both records have the same postcode.
PeeledAddressStage¶
PeeledAddressStage
dataclass
¶
Bases: MatchingStage
Deterministic matching after peeling common UK locality suffixes.
This stage removes trailing locality words such as borough, county, or city
names, then performs an exact match on the peeled address plus postcode.
It is useful when one side includes extra suffixes such as "Hackney
London" and the other does not, but it still requires the postcodes to
be identical.
Use this before SplinkStage so these high-precision cases are resolved
without needing probabilistic thresholds.
Example
"100 Test Street Hackney London" can match
"100 Test Street" when both share the same postcode. Peeling only
relaxes the address text comparison; it does not allow cross-postcode
matches.
UniqueTrigramStage¶
UniqueTrigramStage
dataclass
¶
Bases: MatchingStage
Deterministic matching using n-grams that identify one canonical row.
The stage looks for n-grams, usually trigrams, that appear in exactly one canonical address within the same postcode and with the same numeric and unit structure. If all supporting evidence points to one canonical row, the stage emits a match without using a score.
This is a good stage to place before SplinkStage because it removes
clear-cut fuzzy cases from the scored stage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ngram_size
|
int
|
Size of the token n-gram to index. The default of |
3
|
min_unique_hits
|
int
|
Minimum number of unique n-grams that must support the same canonical address before a match is accepted. |
1
|
include_conflicts
|
bool
|
When |
False
|
include_trigram_text
|
bool
|
When |
True
|
SplinkStage¶
SplinkStage
dataclass
¶
Bases: MatchingStage
Probabilistic matching stage built on Splink.
This stage is usually placed last because it is the only stage that emits a score and therefore requires threshold tuning. Earlier deterministic stages should remove the obvious high-precision matches first, leaving Splink to handle the harder residual cases.
The stage returns the standard match columns plus two key diagnostics:
match_weight: the strength of evidence for the selected canonical candidate. Higher is better.distinguishability: the gap inmatch_weightbetween the best candidate and the next best candidate for the same messy record. Higher means the winner is clearer.NULLusually means there was only one candidate left after blocking.
Setting final_match_weight_threshold=-20 and
final_distinguishability_threshold=0.0 is a permissive configuration
that keeps almost all top-ranked Splink candidates. Raising either
threshold filters out more weak or ambiguous matches, typically improving
precision at the cost of recall.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predict_threshold_match_weight
|
float
|
Initial minimum score passed to
|
-50
|
improve_threshold_match_weight
|
float
|
Minimum score considered when applying the token-based score adjustment step. |
-20
|
improve_top_n_matches
|
int
|
Number of top candidate pairs per messy address to retain for the token-based score adjustment step. |
5
|
improve_use_bigrams
|
bool
|
Whether the token-based improvement step should use bigrams as well as single tokens. |
True
|
final_match_weight_threshold
|
float
|
Minimum |
-20.0
|
final_distinguishability_threshold
|
Optional[float]
|
Minimum distinguishability required
for a Splink match to be emitted. Set to |
0.0
|
include_full_postcode_block
|
bool
|
Whether to include a strict full-postcode blocking rule when generating Splink candidate pairs. |
False
|
include_outside_postcode_block
|
bool
|
Whether to include broader blocking rules that can generate candidate pairs across postcode boundaries. |
True
|
additional_columns_to_retain
|
Optional[list[str]]
|
Extra columns to keep in the Splink predictions and downstream inspection output. |
None
|
settings
|
Optional[SettingsCreator]
|
Optional custom Splink settings object. Leave as |
None
|
retain_intermediate_calculation_columns
|
bool
|
Retain Splink comparison columns needed for debugging and waterfall charts. |
False
|