7. Evaluation

Evaluation of prediction results¶

In the previous tutorial, we looked at various ways to visualise the results of our model. These are useful for evaluating a linkage pipeline because they allow us to understand how our model works and verify that it is doing something sensible. They can also be useful to identify examples where the model is not performing as expected.

In addition to these spot checks, Splink also has functions to perform more formal accuracy analysis. These functions allow you to understand the likely prevalence of false positives and false negatives in your linkage models.

They rely on the existence of a sample of labelled (ground truth) matches, which may have been produced (for example) by human beings. For the accuracy analysis to be unbiased, the sample should be representative of the overall dataset.

# Rerun our predictions to we're ready to view the charts
import pandas as pd

from splink import DuckDBAPI, Linker, splink_datasets

pd.options.display.max_columns = 1000

db_api = DuckDBAPI()
df = splink_datasets.fake_1000

import json
import urllib

from splink import block_on

url = "https://raw.githubusercontent.com/moj-analytical-services/splink/847e32508b1a9cdd7bcd2ca6c0a74e547fb69865/docs/demos/demo_settings/saved_model_from_demo.json"

with urllib.request.urlopen(url) as u:
    settings = json.loads(u.read().decode())

# The data quality is very poor in this dataset, so we need looser blocking rules
# to achieve decent recall
settings["blocking_rules_to_generate_predictions"] = [
    block_on("first_name"),
    block_on("city"),
    block_on("email"),
    block_on("dob"),
]

linker = Linker(df, settings, db_api=DuckDBAPI())
df_predictions = linker.inference.predict(threshold_match_probability=0.01)

Blocking time: 0.02 seconds


Predict time: 0.80 seconds



 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained

Load in labels¶

The labels file contains a list of pairwise comparisons which represent matches and non-matches.

The required format of the labels file is described here.

from splink.datasets import splink_dataset_labels

df_labels = splink_dataset_labels.fake_1000_labels
labels_table = linker.table_management.register_labels_table(df_labels)
df_labels.head(5)

	source_dataset_l	unique_id_r	source_dataset_r	clerical_match_score
0	fake_1000	1	fake_1000	1.0
1	fake_1000	2	fake_1000	1.0
2	fake_1000	3	fake_1000	1.0
3	fake_1000	4	fake_1000	0.0
4	fake_1000	5	fake_1000	0.0

View examples of false positives and false negatives¶

splink_df = linker.evaluation.prediction_errors_from_labels_table(
    labels_table, include_false_negatives=True, include_false_positives=False
)
false_negatives = splink_df.as_record_dict(limit=5)
linker.visualisations.waterfall_chart(false_negatives)

False positives¶

# Note I've picked a threshold match probability of 0.01 here because otherwise
# in this simple example there are no false positives
splink_df = linker.evaluation.prediction_errors_from_labels_table(
    labels_table, include_false_negatives=False, include_false_positives=True, threshold_match_probability=0.01
)
false_postives = splink_df.as_record_dict(limit=5)
linker.visualisations.waterfall_chart(false_postives)

Threshold Selection chart¶

Splink includes an interactive dashboard that shows key accuracy statistics:

linker.evaluation.accuracy_analysis_from_labels_table(
    labels_table, output_type="threshold_selection", add_metrics=["f1"]
)

Receiver operating characteristic curve¶

A ROC chart shows how the number of false positives and false negatives varies depending on the match threshold chosen. The match threshold is the match weight chosen as a cutoff for which pairwise comparisons to accept as matches.

linker.evaluation.accuracy_analysis_from_labels_table(labels_table, output_type="roc")

Truth table¶

Finally, Splink can also report the underlying table used to construct the ROC and precision recall curves.

roc_table = linker.evaluation.accuracy_analysis_from_labels_table(
    labels_table, output_type="table"
)
roc_table.as_pandas_dataframe(limit=5)

	truth_threshold	match_probability	total_clerical_labels	p	n	tp	tn	fp	fn	P_rate	N_rate	tp_rate	tn_rate	fp_rate	fn_rate	precision	recall	specificity	npv	accuracy	f1	f2	f0_5	p4	phi
0	-18.9	0.000002	3176.0	2031.0	1145.0	1709.0	1103.0	42.0	322.0	0.639484	0.360516	0.841457	0.963319	0.036681	0.158543	0.976014	0.841457	0.963319	0.774035	0.885390	0.903755	0.865316	0.945766	0.880476	0.776931
1	-16.7	0.000009	3176.0	2031.0	1145.0	1709.0	1119.0	26.0	322.0	0.639484	0.360516	0.841457	0.977293	0.022707	0.158543	0.985014	0.841457	0.977293	0.776544	0.890428	0.907594	0.866721	0.952514	0.886010	0.789637
2	-12.8	0.000140	3176.0	2031.0	1145.0	1709.0	1125.0	20.0	322.0	0.639484	0.360516	0.841457	0.982533	0.017467	0.158543	0.988433	0.841457	0.982533	0.777471	0.892317	0.909043	0.867249	0.955069	0.888076	0.794416
3	-12.5	0.000173	3176.0	2031.0	1145.0	1708.0	1125.0	20.0	323.0	0.639484	0.360516	0.840965	0.982533	0.017467	0.159035	0.988426	0.840965	0.982533	0.776934	0.892003	0.908752	0.866829	0.954937	0.887763	0.793897
4	-12.4	0.000185	3176.0	2031.0	1145.0	1705.0	1132.0	13.0	326.0	0.639484	0.360516	0.839488	0.988646	0.011354	0.160512	0.992433	0.839488	0.988646	0.776406	0.893262	0.909576	0.866186	0.957542	0.889225	0.797936

Unlinkables chart¶

Finally, it can be interesting to analyse whether your dataset contains any 'unlinkable' records.

'Unlinkable records' are records with such poor data quality they don't even link to themselves at a high enough probability to be accepted as matches

For example, in a typical linkage problem, a 'John Smith' record with nulls for their address and postcode may be unlinkable. By 'unlinkable' we don't mean there are no matches; rather, we mean it is not possible to determine whether there are matches.UnicodeTranslateError

A high proportion of unlinkable records is an indication of poor quality in the input dataset

linker.evaluation.unlinkables_chart()

For this dataset and this trained model, we can see that most records are (theoretically) linkable: At a match weight 6, around around 99% of records could be linked to themselves.

That's it!¶

That wraps up the Splink tutorial! Don't worry, there are still plenty of resources to help on the next steps of your Splink journey:

For some end-to-end notebooks of Splink pipelines, check out our Examples

For more deepdives into the different aspects of Splink, and record linkage more generally, check out our Topic Guides

For a reference on all the functionality avalable in Splink, see our Documentation