Skip to content

6. Quality assurance

Quality assurance of prediction results

In the previous tutorial, we looked at various ways to visualise the results of our model.

These are useful for quality assurance purposes because they allow us to understand how our model works and verify that it is doing something sensible. They can also be useful to identify examples where the model is not performing as expected.

In addition to these spot checks, Splink also has functions to perform more formal accuracy analysis. These functions allow you to understand the likely prevalence of false positives and false negatives in your linkage models.

They rely on the existence of a sample of labelled (ground truth) matches, which may have been produced (for example) by human beings. For the accuracy analysis to be unbiased, the sample should be representative of the overall dataset.

# Rerun our predictions to we're ready to view the charts
from splink.duckdb.duckdb_linker import DuckDBLinker
import pandas as pd 
import altair as alt
alt.renderers.enable('mimetype')

df = pd.read_csv("./data/fake_1000.csv")
linker = DuckDBLinker(df)
linker.load_settings_from_json("./demo_settings/saved_model_from_demo.json")
df_predictions = linker.predict(threshold_match_probability=0.2)

Load in labels

The labels file contains a list of pairwise comparisons which represent matches and non-matches.

The required format of the labels file is described here.

df_labels = pd.read_csv("./data/fake_1000_labels.csv")
df_labels.head(5)
linker.register_table(df_labels, "labels")
linker._initialise_df_concat_with_tf()

Receiver operating characteristic curve

A ROC chart shows how the number of false positives and false negatives varies depending on the match threshold chosen. The match threshold is the match weight chosen as a cutoff for which pairwise comparisons to accept as matches.

linker.roc_chart_from_labels_table("labels")

Precision-recall chart

An alternative representation of truth space is called a precision recall curve.

This can be plotted as follows:

linker.precision_recall_chart_from_labels_table("labels")

Truth table

Finally, Splink can also report the underlying table used to construct the ROC and precision recall curves.

linker._initialise_df_concat_with_tf()
roc_table = linker.truth_space_table_from_labels_table("labels")
roc_table.as_pandas_dataframe(limit=5)
truth_threshold match_probability row_count P N TP TN FP FN P_rate N_rate TP_rate TN_rate FP_rate FN_rate precision recall F1
0 -16.180925 0.000013 1225.0 80.0 1145.0 80.0 0.0 1145.0 0.0 0.0 0.934694 1.0 0.000000 1.000000 0.0 0.065306 1.0 0.122699
1 -15.197628 0.000027 1225.0 80.0 1145.0 80.0 106.0 1039.0 0.0 0.0 0.934694 1.0 0.092576 0.907424 0.0 0.071492 1.0 0.133556
2 -15.058351 0.000029 1225.0 80.0 1145.0 80.0 194.0 951.0 0.0 0.0 0.934694 1.0 0.169432 0.830568 0.0 0.077595 1.0 0.144144
3 -14.281196 0.000050 1225.0 80.0 1145.0 80.0 373.0 772.0 0.0 0.0 0.934694 1.0 0.325764 0.674236 0.0 0.093897 1.0 0.171674
4 -14.075054 0.000058 1225.0 80.0 1145.0 80.0 416.0 729.0 0.0 0.0 0.934694 1.0 0.363319 0.636681 0.0 0.098888 1.0 0.180180