Skip to content

threshold_selection_tool_from_labels_table¶

At a glance

Useful for: Selecting an optimal match weight threshold for generating linked clusters.

API Documentation: accuracy_chart_from_labels_table()

What is needed to generate the chart? A linker with some data and a corresponding labelled dataset

Worked Example¶

from splink.duckdb.linker import DuckDBLinker
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl
from splink.duckdb.blocking_rule_library import block_on
from splink.datasets import splink_datasets, splink_dataset_labels
import logging, sys
logging.disable(sys.maxsize)

df = splink_datasets.fake_1000

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on("first_name"),
        block_on("surname"),
    ],
    "comparisons": [
        ctl.name_comparison("first_name"),
        ctl.name_comparison("surname"),
        ctl.date_comparison("dob", cast_strings_to_date=True),
        cl.exact_match("city", term_frequency_adjustments=True),
        ctl.email_comparison("email", include_username_fuzzy_level=False),
    ],
}

linker = DuckDBLinker(df, settings)
linker.estimate_u_using_random_sampling(max_pairs=1e6)

blocking_rule_for_training = block_on(["first_name", "surname"])

linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)

blocking_rule_for_training = block_on("dob")
linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)


df_labels = splink_dataset_labels.fake_1000_labels
labels_table = linker.register_labels_table(df_labels)

linker.threshold_selection_tool_from_labels_table(labels_table, add_metrics=['f1'])

What the chart shows¶

For a given match weight threshold, a record pair with a score above this threshold will be labelled a match and below the threshold will be labelled a non-match. Lowering the threshold to the extreme ensures many more matches are generated - this maximises the True Positives (high recall) but at the expense of some False Positives (low precision).

You can then see the effect on the confusion matrix of raising the match threshold. As more predicted matches become non-matches at the higher threshold, True Positives become False Negatives, but False Positives become True Negatives.

This demonstrates the trade-off between Type 1 (FP) and Type 2 (FN) errors when selecting a match threshold, or precision vs recall.

This chart adds further context to accuracy_chart_from_labels_table showing:

  • the relationship between match weight and match probability
  • various accuracy metrics comparing the Splink scores against clerical labels
  • the confusion matrix of the predictions and the labels

How to interpret the chart¶

Precision can be maximised by increasing the match threshold (reducing false positives).

Recall can be maximised by decreasing the match threshold (reducing false negatives).

Additional metrics can be used to find the optimal compromise between these two, looking for the threshold at which peak accuracy is achieved.

Actions to take as a result of the chart¶

Having identified an optimal match weight threshold, this can be applied when generating linked clusters using cluster_pairwise_predictions_at_thresholds().