`accuracy_analysis_from_labels_table`¶

At a glance

Useful for: Selecting an optimal match weight threshold for generating linked clusters.

API Documentation: accuracy_chart_from_labels_table()

What is needed to generate the chart? A linker with some data and a corresponding labelled dataset

What the chart shows¶

For a given match weight threshold, a record pair with a score above this threshold will be labelled a match and below the threshold will be labelled a non-match. For all possible match weight thresholds, this chart shows various accuracy metrics comparing the Splink scores against clerical labels.

Precision and recall are shown by default, but various additional metrics can be added: specificity, negative predictive value (NPV), accuracy, \(F_1\), \(F_2\), \(F_{0.5}\), \(P_4\) and \(\phi\) (Matthews correlation coefficient).

How to interpret the chart¶

Precision can be maximised by increasing the match threshold (reducing false positives).

Recall can be maximised by decreasing the match threshold (reducing false negatives).

Additional metrics can be used to find the optimal compromise between these two, looking for the threshold at which peak accuracy is achieved.

Confusion matrix

See threshold_selection_tool_from_labels_table for a more complete visualisation of the impact of match threshold on false positives and false negatives, with reference to the confusion matrix.

Actions to take as a result of the chart¶

Having identified an optimal match weight threshold, this can be applied when generating linked clusters using cluster_pairwise_predictions_at_thresholds().

Worked Example¶

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
from splink.datasets import splink_dataset_labels

db_api = DuckDBAPI()

df = splink_datasets.fake_1000

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.JaroWinklerAtThresholds("first_name", [0.9, 0.7]),
        cl.JaroAtThresholds("surname", [0.9, 0.7]),
        cl.DateOfBirthComparison(
            "dob",
            input_is_string=True,
            datetime_metrics=["year", "month"],
            datetime_thresholds=[1, 1],
        ),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        cl.EmailComparison("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("substr(first_name,1,1)"),
        block_on("substr(surname, 1,1)"),
    ],
)

linker = Linker(df, settings, db_api)

linker.training.estimate_probability_two_random_records_match(
    [block_on("first_name", "surname")], recall=0.7
)
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)

blocking_rule_for_training = block_on("first_name", "surname")

linker.training.estimate_parameters_using_expectation_maximisation(
    blocking_rule_for_training
)

blocking_rule_for_training = block_on("dob")
linker.training.estimate_parameters_using_expectation_maximisation(
    blocking_rule_for_training
)


df_labels = splink_dataset_labels.fake_1000_labels
labels_table = linker.table_management.register_labels_table(df_labels)

chart = linker.evaluation.accuracy_analysis_from_labels_table(
    labels_table, output_type="accuracy", add_metrics=["f1"]
)

Note that you can also produce a ROC chart, a precision recall chart, or get the results as a table:

linker.evaluation.accuracy_analysis_from_labels_table(
    labels_table, output_type="roc", add_metrics=["f1"]
)

linker.evaluation.accuracy_analysis_from_labels_table(
    labels_table, output_type="precision_recall", add_metrics=["f1"]
)

linker.evaluation.accuracy_analysis_from_labels_table(
    labels_table, output_type="table", add_metrics=["f1"]
).as_pandas_dataframe()

	truth_threshold	match_probability	total_clerical_labels	p	n	tp	tn	fp	fn	P_rate	...	precision	recall	specificity	npv	accuracy	f1	f2	f0_5	p4	phi
0	-23.8	6.846774e-08	3176.0	2031.0	1145.0	1446.0	1055.0	90.0	585.0	0.639484	...	0.941406	0.711965	0.921397	0.643293	0.787469	0.810765	0.748447	0.884404	0.783298	0.608544
1	-22.7	1.467638e-07	3176.0	2031.0	1145.0	1446.0	1077.0	68.0	585.0	0.639484	...	0.955086	0.711965	0.940611	0.648014	0.794395	0.815797	0.750156	0.894027	0.790841	0.627351
2	-21.7	2.935275e-07	3176.0	2031.0	1145.0	1446.0	1083.0	62.0	585.0	0.639484	...	0.958886	0.711965	0.945852	0.649281	0.796285	0.817180	0.750623	0.896689	0.792887	0.632504
3	-21.6	3.145950e-07	3176.0	2031.0	1145.0	1446.0	1088.0	57.0	585.0	0.639484	...	0.962076	0.711965	0.950218	0.650329	0.797859	0.818336	0.751013	0.898918	0.794588	0.636808
4	-20.6	6.291899e-07	3176.0	2031.0	1145.0	1446.0	1094.0	51.0	585.0	0.639484	...	0.965932	0.711965	0.955459	0.651578	0.799748	0.819728	0.751481	0.901609	0.796624	0.641982
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
278	24.2	9.999999e-01	3176.0	2031.0	1145.0	5.0	1145.0	0.0	2026.0	0.639484	...	1.000000	0.002462	1.000000	0.361085	0.362091	0.004912	0.003075	0.012189	0.009733	0.029815
279	24.3	1.000000e+00	3176.0	2031.0	1145.0	4.0	1145.0	0.0	2027.0	0.639484	...	1.000000	0.001969	1.000000	0.360971	0.361776	0.003931	0.002461	0.009770	0.007805	0.026663
280	24.4	1.000000e+00	3176.0	2031.0	1145.0	3.0	1145.0	0.0	2028.0	0.639484	...	1.000000	0.001477	1.000000	0.360857	0.361461	0.002950	0.001846	0.007342	0.005867	0.023087
281	24.6	1.000000e+00	3176.0	2031.0	1145.0	2.0	1145.0	0.0	2029.0	0.639484	...	1.000000	0.000985	1.000000	0.360744	0.361146	0.001968	0.001231	0.004904	0.003921	0.018848
282	25.1	1.000000e+00	3176.0	2031.0	1145.0	1.0	1145.0	0.0	2030.0	0.639484	...	1.000000	0.000492	1.000000	0.360630	0.360831	0.000984	0.000615	0.002457	0.001965	0.013325

283 rows × 25 columns

accuracy_analysis_from_labels_table¶

What the chart shows¶

How to interpret the chart¶

Actions to take as a result of the chart¶

Worked Example¶

`accuracy_analysis_from_labels_table`¶