Skip to content

Optimising accuracy

uk_address_matcher has a variety of settings that can be tuned to optimise accuracy dependnig on your use case.

In 'biggest wins' we describe the most important settings that are likely to result in unambiguously better accuracy.

In 'optimising match stages' we describe settings which are harder to make recommendations about because the best settings depend on the input data.

Biggest wins

If matching to Ordnance Survey data, use ukam-os-builder to prepare it for matching

In its typical form, Ordnance Survey address data contains one address UPRN per row.

However, if processed in a specific way, Ordnance Survey contains variants on an address that provides a greater 'target area' to match to. Our ukam-os-builder tool does this processing for you to increase the chance of a match. Use of this tool is documented in this guide.

To illustrate why this is important, an single address could have two variants:

  1. Basement Flat, 10 Demo Road, Townton
  2. Flat A, Example Court, 10 Demo Road, Townton

By providing uk_address_matcher with a canonical address dataset built by ukam-os-builder, you will match against all these variants, giving more options for a high scoring match.

Filter your input data to the smallest possible dataset

The smaller the number of addresses to match to, the more accurate your results are likely to be, because there is less chance of multiple similar candidates (such as two different '1 High Street' addresses in two different geographical locations).

Matching will also be faster, because there are fewer candidates to compare against.

There are two primary ways to filter your input data:

  • Geographically. If your input data comes from a known geographical area such as a local authority, use an extract of Ordnance Survey data for that area only.
  • By classification or other metadata. Ordnance Survey data contains rich metadata about each address, such as its classification. For example, if you know your messy data is residential only, filter out non-residential addresses from the canonical dataset.

How to filter

The mechanism for filtering depends on whether you are pre-processing your canonical dataset or processing it on the fly.

If you are processing data on-the-fly, then you can simply filter your data before passing it to AddressMatcher:

Filtering for on the fly processing

import duckdb
from uk_address_matcher import AddressMatcher, prepare_canonical_folder

con = duckdb.connect()

df_canonical = con.read_csv("path/to/canonical.csv")

# Filter to residential addresses only
df_canonical = df_canonical.filter("substr(classificationcode, 1, 1) = 'R'")
df_messy = con.read_csv("path/to/messy.csv")

matcher = AddressMatcher(
    canonical_addresses=output_folder,
    addresses_to_match=df_messy,
    con=con,
)
result = matcher.match()

Filtering a pre-prepared datasets

If your are pre-processing your canonical dataset, consider whether users of the preprocessed file will always want a filter applied. If so, apply this filter using prior to passing the data to the prepare_canonical_folder folder.

import duckdb
import os
import tempfile
from uk_address_matcher import AddressMatcher, prepare_canonical_folder

con = duckdb.connect()
df_canonical = con.read_csv("path/to/canonical.csv")
df_canonical = df_canonical.filter("substr(classificationcode, 1, 1) = 'R'")

output_folder = tempfile.mkdtemp()
prepare_canonical_folder(
    df_canonical,
    output_folder=output_folder,
    con=con
)

However, if different users will need different filters, you can also apply a filter after pre-processing the whole dataset. This will result in a small degradation in accuracy because indices and term frequencies will be computed globally, making them less discriminative.

output_folder = "path_to_prepared_canonical_folder"
df_canonical = con.read_csv("path/to/canonical.csv")

prepare_canonical_folder(
    df_canonical,
    output_folder=output_folder,
    con=con
)

matcher = AddressMatcher(
    canonical_addresses=output_folder,
    addresses_to_match=df_messy,
    canonical_address_filter="substr(classificationcode, 1, 1) = 'R'",
    con=con,
)
result = matcher.match()

Optimising match stages

uk_address_matcher uses an 'ensemble' methodology, whereby it sequentially applies a number of matching strategies called 'stages'.

Stages run in order. Once a record is matched by one stage, later stages do not revisit it. The results contain a column indicating which stage produced the match.

As a result they are intended to be ordered from highest precision to lowest precision. For example, the first stage is usually the ExactMatchStage, since it makes sense to find all exact matches (full address string and postcode) prior to applying any more sophisticated/fuzzy matching strategies.

Matching stages

The available stages are as follows:

Stage Type What it is good at Accuracy implication
ExactMatchStage Deterministic Cleaned address text is already the same on both sides Very high precision, should usually run first
PeeledAddressStage Deterministic One side has extra trailing locality words such as LONDON or HACKNEY High precision, useful before probabilistic matching
UniqueTrigramStage Deterministic A distinctive phrase identifies one canonical address within the postcode High precision, removes clear fuzzy cases before Splink
SplinkStage Scored Typos, abbreviations, partial matches, and other fuzzy cases Precision and recall depend on threshold choice
Summary recommendation

You almost always want to use the ExactMatchStage. The PeeledAddressStage and UniqueTrigramStage produce high, but not perfect precision (i.e. there's a chance of a small number of false positives).

You almost always want to use the SplinkStage last, to attempt to find any matches missed by the previous stages. In some cases, it may produce higher accuracy than the PeeledAddressStage and UniqueTrigramStage, which is why you do not always want to use these stages.

from uk_address_matcher import (
    AddressMatcher,
    ExactMatchStage,
    PeeledAddressStage,
    UniqueTrigramStage,
    SplinkStage,
)

matcher = AddressMatcher(
    canonical_addresses=df_canonical,
    addresses_to_match=df_messy,
    con=con,
    stages=[
        ExactMatchStage(),
        PeeledAddressStage(),
        UniqueTrigramStage(),
        SplinkStage(
            final_match_weight_threshold=2.0,
            final_distinguishability_threshold=1.0,
        ),
    ],
)

For guidance on choosing Splink thresholds, see here.

Stage API docs

ExactMatchStage

ExactMatchStage dataclass

Bases: MatchingStage

Deterministic exact matching on clean_full_address and postcode.

This is usually the first stage in a pipeline. It accepts the easy, unambiguous cases before any probabilistic matching is attempted.

A match is emitted when the cleaned messy address and the cleaned canonical address are identical and the postcode is also identical. A cleaned address match on its own is not enough: differing postcodes will not match.

Example

"10 Demo Road Townton" matches "10 Demo Road, Townton" only when cleaning normalises punctuation and whitespace and both records have the same postcode.

PeeledAddressStage

PeeledAddressStage dataclass

Bases: MatchingStage

Deterministic matching after peeling common UK locality suffixes.

This stage removes trailing locality words such as borough, county, or city names, then performs an exact match on the peeled address plus postcode. It is useful when one side includes extra suffixes such as "Hackney London" and the other does not, but it still requires the postcodes to be identical.

Use this before SplinkStage so these high-precision cases are resolved without needing probabilistic thresholds.

Example

"100 Test Street Hackney London" can match "100 Test Street" when both share the same postcode. Peeling only relaxes the address text comparison; it does not allow cross-postcode matches.

UniqueTrigramStage

UniqueTrigramStage dataclass

Bases: MatchingStage

Deterministic matching using n-grams that identify one canonical row.

The stage looks for n-grams, usually trigrams, that appear in exactly one canonical address within the same postcode and with the same numeric and unit structure. If all supporting evidence points to one canonical row, the stage emits a match without using a score.

This is a good stage to place before SplinkStage because it removes clear-cut fuzzy cases from the scored stage.

Parameters:

Name Type Description Default
ngram_size int

Size of the token n-gram to index. The default of 3 works well for most address data.

3
min_unique_hits int

Minimum number of unique n-grams that must support the same canonical address before a match is accepted.

1
include_conflicts bool

When True, keep an intermediate conflicts table for debugging ambiguous n-gram matches.

False
include_trigram_text bool

When True, retain supporting trigram text in the intermediate output for inspection.

True

SplinkStage

SplinkStage dataclass

Bases: MatchingStage

Probabilistic matching stage built on Splink.

This stage is usually placed last because it is the only stage that emits a score and therefore requires threshold tuning. Earlier deterministic stages should remove the obvious high-precision matches first, leaving Splink to handle the harder residual cases.

The stage returns the standard match columns plus two key diagnostics:

  • match_weight: the strength of evidence for the selected canonical candidate. Higher is better.
  • distinguishability: the gap in match_weight between the best candidate and the next best candidate for the same messy record. Higher means the winner is clearer. NULL usually means there was only one candidate left after blocking.

Setting final_match_weight_threshold=-20 and final_distinguishability_threshold=0.0 is a permissive configuration that keeps almost all top-ranked Splink candidates. Raising either threshold filters out more weak or ambiguous matches, typically improving precision at the cost of recall.

Parameters:

Name Type Description Default
predict_threshold_match_weight float

Initial minimum score passed to linker.inference.predict(). Lower values retain more candidate pairs for later refinement.

-50
improve_threshold_match_weight float

Minimum score considered when applying the token-based score adjustment step.

-20
improve_top_n_matches int

Number of top candidate pairs per messy address to retain for the token-based score adjustment step.

5
improve_use_bigrams bool

Whether the token-based improvement step should use bigrams as well as single tokens.

True
final_match_weight_threshold float

Minimum match_weight required for a Splink match to be emitted in the final results.

-20.0
final_distinguishability_threshold Optional[float]

Minimum distinguishability required for a Splink match to be emitted. Set to None to disable this filter.

0.0
include_full_postcode_block bool

Whether to include a strict full-postcode blocking rule when generating Splink candidate pairs.

False
include_outside_postcode_block bool

Whether to include broader blocking rules that can generate candidate pairs across postcode boundaries.

True
additional_columns_to_retain Optional[list[str]]

Extra columns to keep in the Splink predictions and downstream inspection output.

None
settings Optional[SettingsCreator]

Optional custom Splink settings object. Leave as None to use the library defaults.

None
retain_intermediate_calculation_columns bool

Retain Splink comparison columns needed for debugging and waterfall charts.

False