Choosing blocking rules to optimise runtime¶

To link records, we need to compare pairs of records and decide which pairs are matches.

For example consider the following two records:

first_name	surname	dob	city	email
Robert	Allen	1971-05-24	nan	roberta25@smith.net
Rob	Allen	1971-06-24	London	roberta25@smith.net

These can be represented as a pairwise comparison as follows:

first_name_l	first_name_r	surname_l	surname_r	dob_l	dob_r	city_l	city_r	email_l	email_r
Robert	Rob	Allen	Allen	1971-05-24	1971-06-24	nan	London	roberta25@smith.net	roberta25@smith.net

For most large datasets, it is computationally intractable to compare every row with every other row, since the number of comparisons rises quadratically with the number of records.

Instead we rely on blocking rules, which specify which pairwise comparisons to generate. For example, we could generate the subset of pairwise comparisons where either first name or surname matches.

This is part of a two step process to link data:

Use blocking rules to generate candidate pairwise record comparisons
Use a probabilistic linkage model to score these candidate pairs, to determine which ones should be linked

Blocking rules are the most important determinant of the performance of your linkage job.

When deciding on your blocking rules, you're trading off accuracy for performance:

If your rules are too loose, your linkage job may fail.
If they're too tight, you may miss some valid links.

This tutorial clarifies what blocking rules are, and how to choose good rules.

Blocking rules in Splink¶

In Splink, blocking rules are specified as SQL expressions.

For example, to generate the subset of record comparisons where the first name and surname matches, we can specify the following blocking rule:

from splink import block_on
block_on("first_name", "surname")

When executed, this blocking rule will be converted to a SQL statement with the following form:

SELECT ...
FROM input_tables as l
INNER JOIN input_tables as r
ON l.first_name = r.first_name AND l.surname = r.surname

Since blocking rules are SQL expressions, they can be arbitrarily complex. For example, you could create record comparisons where the initial of the first name and the surname match with the following rule:

from splink import block_on
block_on("substr(first_name, 1, 2)", "surname")

Devising effective blocking rules for prediction¶

The aims of your blocking rules are twofold:

Eliminate enough non-matching comparison pairs so your record linkage job is small enough to compute
Eliminate as few truly matching pairs as possible (ideally none)

It is usually impossible to find a single blocking rule which achieves both aims, so we recommend using multiple blocking rules.

When we specify multiple blocking rules, Splink will generate all comparison pairs that meet any one of the rules.

For example, consider the following blocking rule:

block_on("first_name", "dob")

This rule is likely to be effective in reducing the number of comparison pairs. It will retain all truly matching pairs, except those with errors or nulls in either the first_name or dob fields.

Now consider a second blocking rule:

block_on("email").

This will retain all truly matching pairs, except those with errors or nulls in the email column.

Individually, these blocking rules are problematic because they exclude true matches where the records contain typos of certain types. But between them, they might do quite a good job.

For a true match to be eliminated by the use of these two blocking rules, it would have to have an error in both email AND (first_name or dob).

This is not completely implausible, but it is significantly less likely than if we'd used a single rule.

More generally, we can often specify multiple blocking rules such that it becomes highly implausible that a true match would not meet at least one of these blocking criteria. This is the recommended approach in Splink. Generally we would recommend between about 3 and 10, though even more is possible.

The question then becomes how to choose what to put in this list.

Splink tools to help choose your blocking rules¶

Splink contains a number of tools to help you choose effective blocking rules. Let's try them out, using our small test dataset:

from splink import DuckDBAPI, block_on, splink_datasets

df = splink_datasets.fake_1000

Counting the number of comparisons created by a single blocking rule¶

On large datasets, some blocking rules imply the creation of trillions of record comparisons, which would cause a linkage job to fail.

Before using a blocking rule in a linkage job, it's therefore a good idea to count the number of records it generates to ensure it is not too loose:

from splink.blocking_analysis import count_comparisons_from_blocking_rule

db_api = DuckDBAPI()

br = block_on("substr(first_name, 1,1)", "surname")

counts = count_comparisons_from_blocking_rule(
    table_or_tables=df,
    blocking_rule=br,
    link_type="dedupe_only",
    db_api=db_api,
)

counts

{'number_of_comparisons_generated_pre_filter_conditions': 1632,
 'number_of_comparisons_to_be_scored_post_filter_conditions': 473,
 'filter_conditions_identified': '',
 'equi_join_conditions_identified': 'SUBSTR(l.first_name, 1, 1) = SUBSTR(r.first_name, 1, 1) AND l."surname" = r."surname"',
 'link_type_join_condition': 'where l."unique_id" < r."unique_id"'}

br = "l.first_name = r.first_name and levenshtein(l.surname, r.surname) < 2"

counts = count_comparisons_from_blocking_rule(
    table_or_tables=df,
    blocking_rule= br,
    link_type="dedupe_only",
    db_api=db_api,
)
counts

{'number_of_comparisons_generated_pre_filter_conditions': 4827,
 'number_of_comparisons_to_be_scored_post_filter_conditions': 372,
 'filter_conditions_identified': 'LEVENSHTEIN(l.surname, r.surname) < 2',
 'equi_join_conditions_identified': 'l.first_name = r.first_name',
 'link_type_join_condition': 'where l."unique_id" < r."unique_id"'}

The maximum number of comparisons that you can compute will be affected by your choice of SQL backend, and how powerful your computer is.

For linkages in DuckDB on a standard laptop, we suggest using blocking rules that create no more than about 20 million comparisons. For Spark and Athena, try starting with fewer than 100 million comparisons, before scaling up.

Finding 'worst offending' values for your blocking rule¶

Blocking rules can be affected by skew: some values of a field may be much more common than others, and this can lead to a disproportionate number of comparisons being generated.

It can be useful to identify whether your data is afflicted by this problem.

from splink.blocking_analysis import n_largest_blocks

result = n_largest_blocks(    table_or_tables=df,
    blocking_rule= block_on("city", "first_name"),
    link_type="dedupe_only",
    db_api=db_api,
    n_largest=3
    )

result.as_pandas_dataframe()

	key_0	key_1	count_l	count_r	block_count
0	Birmingham	Theodore	7	7	49
1	London	Oliver	7	7	49
2	London	James	6	6	36

In this case, we can see that Olivers in London will result in 49 comparisons being generated. This is acceptable on this small dataset, but on a larger dataset, Olivers in London could be responsible for many million comparisons.

Counting the number of comparisons created by a list of blocking rules¶

As noted above, it's usually a good idea to use multiple blocking rules. It's therefore useful to know how many record comparisons will be generated when these rules are applied.

Since the same record comparison may be created by several blocking rules, and Splink automatically deduplicates these comparisons, we cannot simply total the number of comparisons generated by each rule individually.

Splink provides a chart that shows the marginal (additional) comparisons generated by each blocking rule, after deduplication:

from splink.blocking_analysis import (
    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
)

blocking_rules_for_analysis = [
    block_on("substr(first_name, 1,1)", "surname"),
    block_on("surname"),
    block_on("email"),
    block_on("city", "first_name"),
    "l.first_name = r.first_name and levenshtein(l.surname, r.surname) < 2",
]


cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
    table_or_tables=df,
    blocking_rules=blocking_rules_for_analysis,
    db_api=db_api,
    link_type="dedupe_only",
)

Digging deeper: Understanding why certain blocking rules create large numbers of comparisons¶

Finally, we can use the profile_columns function we saw in the previous tutorial to understand a specific blocking rule in more depth.

Suppose we're interested in blocking on city and first initial.

Within each distinct value of (city, first initial), all possible pairwise comparisons will be generated.

So for instance, if there are 15 distinct records with London,J then these records will result in n(n-1)/2 = 105 pairwise comparisons being generated.

In a larger dataset, we might observe 10,000 London,J records, which would then be responsible for 49,995,000 comparisons.

These high-frequency values therefore have a disproportionate influence on the overall number of pairwise comparisons, and so it can be useful to analyse skew, as follows:

from splink.exploratory import profile_columns

profile_columns(df, column_expressions=["city || left(first_name,1)"], db_api=db_api)

Next steps¶

Now we have chosen which records to compare, we can use those records to train a linkage model.