Blocking for Model Training¶
Model Training Blocking Rules choose which record pairs from a dataset get considered when training a Splink model. These are used during Expectation Maximisation (EM), where we estimate the m probability (in most cases).
The aim of Model Training Blocking Rules is to reduce the number of record pairs considered when training a Splink model in order to reduce the computational resource required. Each Training Blocking Rule define a training "block" of records which have a combination of matches and non-matches that are considered by Splink's Expectation Maximisation algorithm.
The Expectation Maximisation algorithm seems to work best when the pairwise record comparisons are a mix of anywhere between around 0.1% and 99.9% true matches. It works less efficiently if there is a huge imbalance between the two (e.g. a billion non matches and only a hundred matches).
Note
Unlike Prediction Rules, it does not matter if Training Rules excludes some true matches - it just needs to generate examples of matches and non-matches.
Using Training Rules in Splink¶
Blocking Rules for Model Training are used as a parameter in the estimate_parameters_using_expectation_maximisation
function. After a linker
object has been instantiated, you can estimate m probability
with training sessions such as:
from splink.duckdb.blocking_rule_library import block_on
blocking_rule_for_training = block_on("first_name")
linker.estimate_parameters_using_expectation_maximisation(
blocking_rule_for_training
)
Here, we have defined a "block" of records where first_name
are the same. As names are not unique, we can be pretty sure that there will be a combination of matches and non-matches in this "block" which is what is required for the EM algorithm.
Matching only on first_name
will likely generate a large "block" of pairwise comparisons which will take longer to run. In this case it may be worthwhile applying a stricter blocking rule to reduce runtime. For example, a match on first_name
and surname
:
from splink.duckdb.blocking_rule_library import block_on
blocking_rule = block_on(["first_name", "surname"])
linker.estimate_parameters_using_expectation_maximisation(
blocking_rule_for_training
)
which will still have a combination of matches and non-matches, but fewer record pairs to consider.
Choosing Training Rules¶
The idea behind Training Rules is to consider "blocks" of record pairs with a mixture of matches and non-matches. In practice, most blocking rules have a mixture of matches and non-matches so the primary consideration should be to reduce the runtime of model training by choosing Training Rules that reduce the number of record pairs in the training set.
There are some tools within Splink to help choosing these rules. For example, the count_num_comparisons_from_blocking_rule
gives the number of records pairs generated by a blocking rule:
from splink.duckdb.blocking_rule_library import block_on
blocking_rule = block_on(["first_name", "surname"])
linker.count_num_comparisons_from_blocking_rule(blocking_rule)
1056
It is recommended that you run this function to check how many comparisons are generated before training a model so that you do not needlessly run a training session on billions of comparisons.
Note
Unlike Prediction Rules, Training Rules are treated separately for each EM training session therefore the total number of comparisons for Model Training is simply the sum of count_num_comparisons_from_blocking_rule
across all Blocking Rules (as opposed to the result of cumulative_comparisons_from_blocking_rules_records
).