Run times, performance and linking large data
This topic guide covers the fundamental drivers of the run time of Splink jobs.
Blocking¶
The primary driver of run time is the number of record pairs that the Splink model has to process. In Splink, the number of pairs to consider is reduced using Blocking Rules which are covered in depth in their own set of topic guides.
Complexity of comparisons¶
More complex comparisons reduces performance. Complexity is added to comparisons in a number of ways, including:
- Increasing the number of comparison levels
- Using more computationally expensive comparison functions
- Adding Term Frequency Adjustments
Performant Term Frequency Adjustments
Model training with Term Frequency adjustments can be made more performant by setting estimate_without_term_frequencies
parameter to True
in estimate_parameters_using_expectation_maximisation
.
Retaining columns through the linkage process¶
The size your dataset has an impact on the performance of Splink. This is also applicable to the tables that Splink creates and uses under the hood. Some Splink functionality requires additional calculated columns to be stored. For example:
- The
comparison_viewer_dashboard
requiresretain_matching_columns
andretain_intermediate_calculation_columns
to be set toTrue
in the settings dictionary, but this makes some processes less performant.
Filtering out pairwise in the predict()
step¶
Reducing the number of pairwise comparisons that need to be returned will make Splink perform faster. One way of doing this is to filter comparisons with a match score below a given threshold (using a threshold_match_probability
or threshold_match_weight
) when you call predict()
.
Spark Performance¶
As Spark is designed to distribute processing across multiple machines so there are additional configuration options available to make jobs run more quickly. For more information, check out the Spark Performance Topic Guide.
Balancing computational performance and model accuracy
There is usually a trade off between performance and accuracy in Splink models. I.e. some model design decisions that improve computational performance can also have a negative impact the accuracy of the model.
Be sure to check how the suggestions in this topic guide impact the accuracy of your model to ensure the best results.