Skip to content

Blocking Rules for Splink Predictions

Prediction Blocking Rules choose which record pairs from a dataset get considered and scored by the Splink model.

The aim of Prediction Blocking Rules are to:

  • Capture as many true matches as possible
  • Reduce the total number of comparisons being generated

Blocking Rules for Prediction are defined through blocking_rules_to_generate_predictions in the Settings dictionary of a model. For example:

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
       brl.block_on(["first_name", "surname"]),
       brl.block_on("dob"),
    ],
    "comparisons": [
        ctl.name_comparison("first_name"),
        ctl.name_comparison("surname"),
        ctl.date_comparison("dob", cast_strings_to_date=True),
        cl.exact_match("city", term_frequency_adjustments=True),
        ctl.email_comparison("email"),
    ],
}

will generate comparisons for all true matches where names match. But it would miss a true match where there was a typo in (say) the first name.

In general, it is usually impossible to find a single rule which both:

  • Reduces the number of comparisons generated to a computationally tractable number

  • Ensures comparisons are generated for all true matches

This is why blocking_rules_to_generate_predictions is a list. Suppose we also block on postcode:

settings_example = {
    "blocking_rules_to_generate_predictions" [
        brl.block_on(["first_name", "surname"]),
        brl.block_on("postcode")
    ]
}

This generates all pairwise comparisons that satisfy at least one of the rules.

We will now generate a pairwise comparison for the record where there was a typo in the first name, so long as there isn't also a difference in the postcode.

By specifying a variety of blocking_rules_to_generate_predictions, it becomes unlikely that a truly matching record would not be captured by at least one of the rules.

Note

Unlike Training Rules, Prediction Rules are considered collectively, and are order-dependent. So, in the example above, the l.postcode = r.postcode blocking rule only generates record comparisons that are a match on postcode were not already captured by the first_name and surname rule.

Choosing Prediction Rules

When defining blocking rules it is important to consider the number of pairwise comparisons being generated your the blocking rules. There are a number of useful functions in Splink which can help with this.

Once a linker has been instated, we can use the cumulative_num_comparisons_from_blocking_rules_chart function to look at the cumulative number of comparisons generated by blocking_rules_to_generate_predictions. For example, a setting dictionary like this:

settings = {
    "blocking_rules_to_generate_predictions": [
        brl.block_on("first_name"),
        brl.block_on("surname")
    ],
}

will generate the something like:

linker = DuckDBLinker(df, settings)
linker.cumulative_num_comparisons_from_blocking_rules_chart()

Where, similar to the note above, the l.surname = r.surname bar in light blue is a count of all record comparisons that match on surname that have not already been captured by the first_name rule.

You can also return the underlying data for this chart using the cumulative_comparisons_from_blocking_rules_records function:

linker.cumulative_comparisons_from_blocking_rules_records()

[{'row_count': 2253, 'rule': 'l.first_name = r.first_name'}, {'row_count': 2568, 'rule': 'l.surname = r.surname'}]