Skip to content

accuracy_analysis_from_labels_table

At a glance

Useful for: Selecting an optimal match weight threshold for generating linked clusters.

API Documentation: accuracy_chart_from_labels_table()

What is needed to generate the chart? A linker with some data and a corresponding labelled dataset

What the chart shows

For a given match weight threshold, a record pair with a score above this threshold will be labelled a match and below the threshold will be labelled a non-match. For all possible match weight thresholds, this chart shows various accuracy metrics comparing the Splink scores against clerical labels.

Precision and recall are shown by default, but various additional metrics can be added: specificity, negative predictive value (NPV), accuracy, \(F_1\), \(F_2\), \(F_{0.5}\), \(P_4\) and \(\phi\) (Matthews correlation coefficient).

How to interpret the chart

Precision can be maximised by increasing the match threshold (reducing false positives).

Recall can be maximised by decreasing the match threshold (reducing false negatives).

Additional metrics can be used to find the optimal compromise between these two, looking for the threshold at which peak accuracy is achieved.

Confusion matrix

See threshold_selection_tool_from_labels_table for a more complete visualisation of the impact of match threshold on false positives and false negatives, with reference to the confusion matrix.

Actions to take as a result of the chart

Having identified an optimal match weight threshold, this can be applied when generating linked clusters using cluster_pairwise_predictions_at_thresholds().

Worked Example

import splink.comparison_library as cl
import splink.comparison_template_library as ctl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
from splink.datasets import splink_dataset_labels

db_api = DuckDBAPI()

df = splink_datasets.fake_1000

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.JaroWinklerAtThresholds("first_name", [0.9, 0.7]),
        cl.JaroAtThresholds("surname", [0.9, 0.7]),
        ctl.DateComparison(
            "dob",
            input_is_string=True,
            datetime_metrics=["year", "month"],
            datetime_thresholds=[1, 1],
        ),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        ctl.EmailComparison("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("substr(first_name,1,1)"),
        block_on("substr(surname, 1,1)"),
    ],
)

linker = Linker(df, settings, db_api)

linker.training.estimate_probability_two_random_records_match(
    [block_on("first_name", "surname")], recall=0.7
)
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)

blocking_rule_for_training = block_on("first_name", "surname")

linker.training.estimate_parameters_using_expectation_maximisation(
    blocking_rule_for_training
)

blocking_rule_for_training = block_on("dob")
linker.training.estimate_parameters_using_expectation_maximisation(
    blocking_rule_for_training
)


df_labels = splink_dataset_labels.fake_1000_labels
labels_table = linker.table_management.register_labels_table(df_labels)

chart = linker.evaluation.accuracy_analysis_from_labels_table(
    labels_table, output_type="accuracy", add_metrics=["f1"]
)

Note that you can also produce a ROC chart, a precision recall chart, or get the results as a table:

linker.evaluation.accuracy_analysis_from_labels_table(
    labels_table, output_type="roc", add_metrics=["f1"]
)
linker.evaluation.accuracy_analysis_from_labels_table(
    labels_table, output_type="precision_recall", add_metrics=["f1"]
)
linker.evaluation.accuracy_analysis_from_labels_table(
    labels_table, output_type="table", add_metrics=["f1"]
).as_pandas_dataframe()
truth_threshold match_probability total_clerical_labels p n tp tn fp fn P_rate ... precision recall specificity npv accuracy f1 f2 f0_5 p4 phi
0 -23.8 6.846774e-08 3176.0 2031.0 1145.0 1446.0 1055.0 90.0 585.0 0.639484 ... 0.941406 0.711965 0.921397 0.643293 0.787469 0.810765 0.748447 0.884404 0.783298 0.608544
1 -22.7 1.467638e-07 3176.0 2031.0 1145.0 1446.0 1077.0 68.0 585.0 0.639484 ... 0.955086 0.711965 0.940611 0.648014 0.794395 0.815797 0.750156 0.894027 0.790841 0.627351
2 -21.7 2.935275e-07 3176.0 2031.0 1145.0 1446.0 1083.0 62.0 585.0 0.639484 ... 0.958886 0.711965 0.945852 0.649281 0.796285 0.817180 0.750623 0.896689 0.792887 0.632504
3 -21.6 3.145950e-07 3176.0 2031.0 1145.0 1446.0 1088.0 57.0 585.0 0.639484 ... 0.962076 0.711965 0.950218 0.650329 0.797859 0.818336 0.751013 0.898918 0.794588 0.636808
4 -20.6 6.291899e-07 3176.0 2031.0 1145.0 1446.0 1094.0 51.0 585.0 0.639484 ... 0.965932 0.711965 0.955459 0.651578 0.799748 0.819728 0.751481 0.901609 0.796624 0.641982
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
278 24.2 9.999999e-01 3176.0 2031.0 1145.0 5.0 1145.0 0.0 2026.0 0.639484 ... 1.000000 0.002462 1.000000 0.361085 0.362091 0.004912 0.003075 0.012189 0.009733 0.029815
279 24.3 1.000000e+00 3176.0 2031.0 1145.0 4.0 1145.0 0.0 2027.0 0.639484 ... 1.000000 0.001969 1.000000 0.360971 0.361776 0.003931 0.002461 0.009770 0.007805 0.026663
280 24.4 1.000000e+00 3176.0 2031.0 1145.0 3.0 1145.0 0.0 2028.0 0.639484 ... 1.000000 0.001477 1.000000 0.360857 0.361461 0.002950 0.001846 0.007342 0.005867 0.023087
281 24.6 1.000000e+00 3176.0 2031.0 1145.0 2.0 1145.0 0.0 2029.0 0.639484 ... 1.000000 0.000985 1.000000 0.360744 0.361146 0.001968 0.001231 0.004904 0.003921 0.018848
282 25.1 1.000000e+00 3176.0 2031.0 1145.0 1.0 1145.0 0.0 2030.0 0.639484 ... 1.000000 0.000492 1.000000 0.360630 0.360831 0.000984 0.000615 0.002457 0.001965 0.013325

283 rows × 25 columns