Skip to content

accuracy_chart_from_labels_table

At a glance

Useful for: Selecting an optimal match weight threshold for generating linked clusters.

API Documentation: accuracy_chart_from_labels_table()

What is needed to generate the chart? A linker with some data and a corresponding labelled dataset

Worked Example

from splink.duckdb.linker import DuckDBLinker
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl
from splink.duckdb.blocking_rule_library import block_on
from splink.datasets import splink_datasets, splink_dataset_labels
import logging, sys
logging.disable(sys.maxsize)

df = splink_datasets.fake_1000

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on("first_name"),
        block_on("surname"),
    ],
    "comparisons": [
        ctl.name_comparison("first_name"),
        ctl.name_comparison("surname"),
        ctl.date_comparison("dob", cast_strings_to_date=True),
        cl.exact_match("city", term_frequency_adjustments=True),
        ctl.email_comparison("email", include_username_fuzzy_level=False),
    ],
}

linker = DuckDBLinker(df, settings)
linker.estimate_u_using_random_sampling(max_pairs=1e6)

blocking_rule_for_training = block_on(["first_name", "surname"])

linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)

blocking_rule_for_training = block_on("dob")
linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)


df_labels = splink_dataset_labels.fake_1000_labels
labels_table = linker.register_labels_table(df_labels)

linker.accuracy_chart_from_labels_table(labels_table, add_metrics=['f1'])

What the chart shows

For a given match weight threshold, a record pair with a score above this threshold will be labelled a match and below the threshold will be labelled a non-match. For all possible match weight thresholds, this chart shows various accuracy metrics comparing the Splink scores against clerical labels.

Precision and recall are shown by default, but various additional metrics can be added: specificity, negative predictive value (NPV), accuracy, \(F_1\), \(F_2\), \(F_{0.5}\), \(P_4\) and \(\phi\) (Matthews correlation coefficient).

How to interpret the chart

Precision can be maximised by increasing the match threshold (reducing false positives).

Recall can be maximised by decreasing the match threshold (reducing false negatives).

Additional metrics can be used to find the optimal compromise between these two, looking for the threshold at which peak accuracy is achieved.

Confusion matrix

See threshold_selection_tool_from_labels_table for a more complete visualisation of the impact of match threshold on false positives and false negatives, with reference to the confusion matrix.

Actions to take as a result of the chart

Having identified an optimal match weight threshold, this can be applied when generating linked clusters using cluster_pairwise_predictions_at_thresholds().