Optimising accuracy¶

uk_address_matcher has a variety of settings that can be tuned to optimise accuracy dependnig on your use case.

In 'biggest wins' we describe the most important settings that are likely to result in unambiguously better accuracy.

In 'optimising match stages' we describe settings which are harder to make recommendations about because the best settings depend on the input data.

Biggest wins¶

If matching to Ordnance Survey data, use `ukam-os-builder` to prepare it for matching¶

In its typical form, Ordnance Survey address data contains one address UPRN per row.

However, if processed in a specific way, Ordnance Survey contains variants on an address that provides a greater 'target area' to match to. Our ukam-os-builder tool does this processing for you to increase the chance of a match. Use of this tool is documented in this guide.

To illustrate why this is important, an single address could have two variants:

Basement Flat, 10 Demo Road, Townton
Flat A, Example Court, 10 Demo Road, Townton

By providing uk_address_matcher with a canonical address dataset built by ukam-os-builder, you will match against all these variants, giving more options for a high scoring match.

Filter your input data to the smallest possible dataset¶

The smaller the number of addresses to match to, the more accurate your results are likely to be, because there is less chance of multiple similar candidates (such as two different '1 High Street' addresses in two different geographical locations).

Matching will also be faster, because there are fewer candidates to compare against.

There are two primary ways to filter your input data:

Geographically. If your input data comes from a known geographical area such as a local authority, use an extract of Ordnance Survey data for that area only.
By classification or other metadata. Ordnance Survey data contains rich metadata about each address, such as its classification. For example, if you know your messy data is residential only, filter out non-residential addresses from the canonical dataset.

How to filter¶

The mechanism for filtering depends on whether you are pre-processing your canonical dataset or processing it on the fly.

If you are processing data on-the-fly, then you can simply filter your data before passing it to AddressMatcher:

Filtering for on the fly processing¶

import duckdb
from uk_address_matcher import AddressMatcher, prepare_canonical_folder

con = duckdb.connect()

df_canonical = con.read_csv("path/to/canonical.csv")

# Filter to residential addresses only
df_canonical = df_canonical.filter("substr(classificationcode, 1, 1) = 'R'")
df_messy = con.read_csv("path/to/messy.csv")

matcher = AddressMatcher(
    canonical_addresses=output_folder,
    addresses_to_match=df_messy,
    con=con,
)
result = matcher.match()

Filtering a pre-prepared datasets¶

If your are pre-processing your canonical dataset, consider whether users of the preprocessed file will always want a filter applied. If so, apply this filter using prior to passing the data to the prepare_canonical_folder folder.

import duckdb
import os
import tempfile
from uk_address_matcher import AddressMatcher, prepare_canonical_folder

con = duckdb.connect()
df_canonical = con.read_csv("path/to/canonical.csv")
df_canonical = df_canonical.filter("substr(classificationcode, 1, 1) = 'R'")

output_folder = tempfile.mkdtemp()
prepare_canonical_folder(
    df_canonical,
    output_folder=output_folder,
    con=con
)

However, if different users will need different filters, you can also apply a filter after pre-processing the whole dataset. This will result in a small degradation in accuracy because indices and term frequencies will be computed globally, making them less discriminative.

output_folder = "path_to_prepared_canonical_folder"
df_canonical = con.read_csv("path/to/canonical.csv")

prepare_canonical_folder(
    df_canonical,
    output_folder=output_folder,
    con=con
)

matcher = AddressMatcher(
    canonical_addresses=output_folder,
    addresses_to_match=df_messy,
    canonical_address_filter="substr(classificationcode, 1, 1) = 'R'",
    con=con,
)
result = matcher.match()

Optimising match stages¶

uk_address_matcher uses an 'ensemble' methodology, whereby it sequentially applies a number of matching strategies called 'stages'.

Stages run in order. Once a record is matched by one stage, later stages do not revisit it. The results contain a column indicating which stage produced the match.

As a result they are intended to be ordered from highest precision to lowest precision. For example, the first stage is usually the ExactMatchStage, since it makes sense to find all exact matches (full address string and postcode) prior to applying any more sophisticated/fuzzy matching strategies.

Matching stages¶

The available stages are as follows:

Stage	Type	What it is good at	Accuracy implication
`ExactMatchStage`	Deterministic	Cleaned address text is already the same on both sides	Very high precision, should usually run first
`PeeledAddressStage`	Deterministic	One side has extra trailing locality words such as `LONDON` or `HACKNEY`	High precision, useful before probabilistic matching
`UniqueTrigramStage`	Deterministic	A distinctive phrase identifies one canonical address within the postcode	High precision, removes clear fuzzy cases before Splink
`SplinkStage`	Scored	Typos, abbreviations, partial matches, and other fuzzy cases	Precision and recall depend on threshold choice

Summary recommendation¶

You almost always want to use the ExactMatchStage. The PeeledAddressStage and UniqueTrigramStage produce high, but not perfect precision (i.e. there's a chance of a small number of false positives).

You almost always want to use the SplinkStage last, to attempt to find any matches missed by the previous stages. In some cases, it may produce higher accuracy than the PeeledAddressStage and UniqueTrigramStage, which is why you do not always want to use these stages.

from uk_address_matcher import (
    AddressMatcher,
    ExactMatchStage,
    PeeledAddressStage,
    UniqueTrigramStage,
    SplinkStage,
)

matcher = AddressMatcher(
    canonical_addresses=df_canonical,
    addresses_to_match=df_messy,
    con=con,
    stages=[
        ExactMatchStage(),
        PeeledAddressStage(),
        UniqueTrigramStage(),
        SplinkStage(
            final_match_weight_threshold=2.0,
            final_distinguishability_threshold=1.0,
        ),
    ],
)

Tuning the Splink stage¶

For guidance on choosing Splink thresholds, see here.

Stage API docs¶

ExactMatchStage¶

`ExactMatchStage` `dataclass` ¶

Bases: MatchingStage

Deterministic exact matching on clean_full_address and postcode.

This is usually the first stage in a pipeline. It accepts the easy, unambiguous cases before any probabilistic matching is attempted.

A match is emitted when the cleaned messy address and the cleaned canonical address are identical and the postcode is also identical. A cleaned address match on its own is not enough: differing postcodes will not match.

Example

"10 Demo Road Townton" matches "10 Demo Road, Townton" only when cleaning normalises punctuation and whitespace and both records have the same postcode.

PeeledAddressStage¶

`PeeledAddressStage` `dataclass` ¶

Bases: MatchingStage

Deterministic matching after peeling common UK locality suffixes.

This stage removes trailing locality words such as borough, county, or city names, then performs an exact match on the peeled address plus postcode. It is useful when one side includes extra suffixes such as "Hackney London" and the other does not, but it still requires the postcodes to be identical.

Use this before SplinkStage so these high-precision cases are resolved without needing probabilistic thresholds.

Example

"100 Test Street Hackney London" can match "100 Test Street" when both share the same postcode. Peeling only relaxes the address text comparison; it does not allow cross-postcode matches.

UniqueTrigramStage¶

`UniqueTrigramStage` `dataclass` ¶

Bases: MatchingStage

Deterministic matching using n-grams that identify one canonical row.

The stage looks for n-grams, usually trigrams, that appear in exactly one canonical address within the same postcode and with the same numeric and unit structure. If all supporting evidence points to one canonical row, the stage emits a match without using a score.

This is a good stage to place before SplinkStage because it removes clear-cut fuzzy cases from the scored stage.

Parameters:

Name	Type	Description	Default
`ngram_size`	`int`	Size of the token n-gram to index. The default of `3` works well for most address data.	`3`
`min_unique_hits`	`int`	Minimum number of unique n-grams that must support the same canonical address before a match is accepted.	`1`
`include_conflicts`	`bool`	When `True`, keep an intermediate conflicts table for debugging ambiguous n-gram matches.	`False`
`include_trigram_text`	`bool`	When `True`, retain supporting trigram text in the intermediate output for inspection.	`True`

SplinkStage¶

`SplinkStage` `dataclass` ¶

Bases: MatchingStage

Probabilistic matching stage built on Splink.

This stage is usually placed last because it is the only stage that emits a score and therefore requires threshold tuning. Earlier deterministic stages should remove the obvious high-precision matches first, leaving Splink to handle the harder residual cases.

The stage returns the standard match columns plus two key diagnostics:

match_weight: the strength of evidence for the selected canonical candidate. Higher is better.
distinguishability: the gap in match_weight between the best candidate and the next best candidate for the same messy record. Higher means the winner is clearer. NULL usually means there was only one candidate left after blocking.

Setting final_match_weight_threshold=-20 and final_distinguishability_threshold=0.0 is a permissive configuration that keeps almost all top-ranked Splink candidates. Raising either threshold filters out more weak or ambiguous matches, typically improving precision at the cost of recall.

Parameters:

Name	Type	Description	Default
`predict_threshold_match_weight`	`float`	Initial minimum score passed to `linker.inference.predict()`. Lower values retain more candidate pairs for later refinement.	`-50`
`improve_threshold_match_weight`	`float`	Minimum score considered when applying the token-based score adjustment step.	`-20`
`improve_top_n_matches`	`int`	Number of top candidate pairs per messy address to retain for the token-based score adjustment step.	`5`
`improve_use_bigrams`	`bool`	Whether the token-based improvement step should use bigrams as well as single tokens.	`True`
`final_match_weight_threshold`	`float`	Minimum `match_weight` required for a Splink match to be emitted in the final results.	`-20.0`
`final_distinguishability_threshold`	`Optional[float]`	Minimum distinguishability required for a Splink match to be emitted. Set to `None` to disable this filter.	`0.0`
`include_full_postcode_block`	`bool`	Whether to include a strict full-postcode blocking rule when generating Splink candidate pairs.	`False`
`include_outside_postcode_block`	`bool`	Whether to include broader blocking rules that can generate candidate pairs across postcode boundaries.	`True`
`additional_columns_to_retain`	`Optional[list[str]]`	Extra columns to keep in the Splink predictions and downstream inspection output.	`None`
`settings`	`Optional[SettingsCreator]`	Optional custom Splink settings object. Leave as `None` to use the library defaults.	`None`
`retain_intermediate_calculation_columns`	`bool`	Retain Splink comparison columns needed for debugging and waterfall charts.	`False`

Optimising accuracy¶

Biggest wins¶

If matching to Ordnance Survey data, use ukam-os-builder to prepare it for matching¶

Filter your input data to the smallest possible dataset¶

How to filter¶

Filtering for on the fly processing¶

Filtering a pre-prepared datasets¶

Optimising match stages¶

Matching stages¶

Summary recommendation¶

Tuning the Splink stage¶

Stage API docs¶

ExactMatchStage¶

ExactMatchStage dataclass ¶

PeeledAddressStage¶

PeeledAddressStage dataclass ¶

UniqueTrigramStage¶

UniqueTrigramStage dataclass ¶

SplinkStage¶

SplinkStage dataclass ¶

If matching to Ordnance Survey data, use `ukam-os-builder` to prepare it for matching¶

`ExactMatchStage` `dataclass` ¶

`PeeledAddressStage` `dataclass` ¶

`UniqueTrigramStage` `dataclass` ¶

`SplinkStage` `dataclass` ¶