Run times, performance and linking large data
Run times, performance, and linking large data¶
This topic guide covers the fundamental drivers of the run time of Splink jobs. It also describes the tools that are built into Splink that help you to understand how long a job is likely to take.
In summary, your choice of blocking rules is by far the most important driver of performance.
Additional factors which affect performance are:
- the complexity of your comparisons, whether you apply term frequency adjustments,
- whether you choose to set
retain_matching_columns
andretain_intermediate_calculation_columns
toTrue
in your settings, - whether you filter out comparisons with a match score below a given threshold (using a
threshold_match_probability
orthreshold_match_weight
when you callpredict()
).
Blocking rules¶
In most large datasets, it is computationally intractable to compare every row with every other row.
The number of comparisons grows with the square of the number of input records, using the formula \(\frac{n\left(n-1\right)}2\) . For instance, a million input records implies around 500bn comparisons.
In Splink, we use a technique called blocking to dramatically reduce the number of comparisons by comparing only records that adhere to certain rules, such as that the first name and date of birth must be equal . Blocking is described further here.
Even after blocking, the number of comparisons generated is usually much higher than the number of input records - often between 10 and 1,000 times higher. As a result, the performance of Splink is influenced most heavily by the number of comparisons generated by the blocking rules, rather than the number of input records.
This is the case for both main uses of blocking rules in Splink: estimating parameters using expectation maximisation, and generating predictions. (See here for more information on this distinction).
How many comparisons will be generated by a blocking rule?¶
The linker.count_num_comparisons_from_blocking_rule()
, documented here will compute the number of comparisons that will be generated from a blocking rule.
Users are recommended to use this function before attempting linkage, since some blocking rules may imply trillions of comparisons, resulting in record linkage jobs which run for hours and never complete.
In general, we recommend a strategy of starting with strict blocking rules, and gradually loosening them. Sticking to less than 10 million comparisons is a good place to start, before scaling jobs up to 100s of millions (DuckDB on a laptop), or sometimes billions (Athena or Spark).
Examples of strict and loose blocking rules¶
To give an example of how blocking_rules_to_generate_predictions
rules may be incrementally loosened, we may start with the following rule:
l.first_name = r.first_name and l.surname = r.surname and l.dob = r.dob
.
This is a very strict rule, and will only create comparisons where full name and date of birth match. This has the advantage of creating few record comparisons, but the disadvantage that the rule will miss true matches where there are typos or nulls in any of these three fields.
This blocking rule could be loosened to:
substr(l.first_name,1,1) = substr(r.first_name,1,1) and l.surname = r.surname and l.year_of_birth = r.year_of_birth
Now it allows for typos or aliases in the first name, so long as the first letter is the same, and errors in month or day of birth.
Depending on the side of your input data, the rule could be further loosened to
substr(l.first_name,1,1) = substr(r.first_name,1,1) and l.surname = r.surname
or even
l.surname = r.surname
The user could use the linker.count_num_comparisons_from_blocking_rule()
function to select which rule is appropriate for their data.