Skip to content

Predicting which records match

Open In Colab

In the previous tutorial, we built and estimated a linkage model.

In this tutorial, we will load the estimated model and use it to make predictions of which pairwise record comparisons match.

from splink import Linker, DuckDBAPI, splink_datasets

import pandas as pd

pd.options.display.max_columns = 1000

db_api = DuckDBAPI()
df = splink_datasets.fake_1000

Load estimated model from previous tutorial

import json
import urllib

url = "https://raw.githubusercontent.com/moj-analytical-services/splink/847e32508b1a9cdd7bcd2ca6c0a74e547fb69865/docs/demos/demo_settings/saved_model_from_demo.json"

with urllib.request.urlopen(url) as u:
    settings = json.loads(u.read().decode())


linker = Linker(df, settings, db_api=DuckDBAPI())

Predicting match weights using the trained model

We use linker.predict() to run the model.

Under the hood this will:

  • Generate all pairwise record comparisons that match at least one of the blocking_rules_to_generate_predictions

  • Use the rules specified in the Comparisons to evaluate the similarity of the input data

  • Use the estimated match weights, applying term frequency adjustments where requested to produce the final match_weight and match_probability scores

Optionally, a threshold_match_probability or threshold_match_weight can be provided, which will drop any row where the predicted score is below the threshold.

df_predictions = linker.inference.predict(threshold_match_probability=0.2)
df_predictions.as_pandas_dataframe(limit=5)
 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained
match_weight match_probability unique_id_l unique_id_r first_name_l first_name_r gamma_first_name tf_first_name_l tf_first_name_r bf_first_name bf_tf_adj_first_name surname_l surname_r gamma_surname tf_surname_l tf_surname_r bf_surname bf_tf_adj_surname dob_l dob_r gamma_dob bf_dob city_l city_r gamma_city tf_city_l tf_city_r bf_city bf_tf_adj_city email_l email_r gamma_email tf_email_l tf_email_r bf_email bf_tf_adj_email match_key
0 -1.749664 0.229211 324 326 Kai Kai 4 0.006017 0.006017 84.821765 0.962892 None Turner -1 NaN 0.007326 1.000000 1.000000 2018-12-31 2009-11-03 0 0.460743 London London 1 0.212792 0.212792 10.20126 0.259162 k.t50eherand@z.ncom None -1 0.001267 NaN 1.0 1.0 0
1 -1.626076 0.244695 25 27 Gabriel None -1 0.001203 NaN 1.000000 1.000000 Thomas Thomas 4 0.004884 0.004884 88.870507 1.001222 1977-09-13 1977-10-17 0 0.460743 London London 1 0.212792 0.212792 10.20126 0.259162 gabriel.t54@nichols.info None -1 0.002535 NaN 1.0 1.0 1
2 -1.551265 0.254405 626 629 geeorGe George 1 0.001203 0.014440 4.176727 1.000000 Davidson Davidson 4 0.007326 0.007326 88.870507 0.667482 1999-05-07 2000-05-06 0 0.460743 Southamptn None -1 0.001230 NaN 1.00000 1.000000 None gdavidson@johnson-brown.com -1 NaN 0.00507 1.0 1.0 1
3 -1.427735 0.270985 600 602 Toby Toby 4 0.004813 0.004813 84.821765 1.203614 None None -1 NaN NaN 1.000000 1.000000 2003-04-23 2013-03-21 0 0.460743 London London 1 0.212792 0.212792 10.20126 0.259162 toby.d@menhez.com None -1 0.001267 NaN 1.0 1.0 0
4 -1.427735 0.270985 599 602 Toby Toby 4 0.004813 0.004813 84.821765 1.203614 Haall None -1 0.001221 NaN 1.000000 1.000000 2003-04-23 2013-03-21 0 0.460743 London London 1 0.212792 0.212792 10.20126 0.259162 None None -1 NaN NaN 1.0 1.0 0

Clustering

The result of linker.predict() is a list of pairwise record comparisons and their associated scores. For instance, if we have input records A, B, C and D, it could be represented conceptually as:

A -> B with score 0.9
B -> C with score 0.95
C -> D with score 0.1
D -> E with score 0.99

Often, an alternative representation of this result is more useful, where each row is an input record, and where records link, they are assigned to the same cluster.

With a score threshold of 0.5, the above data could be represented conceptually as:

ID, Cluster ID
A,  1
B,  1
C,  1
D,  2
E,  2

The algorithm that converts between the pairwise results and the clusters is called connected components, and it is included in Splink. You can use it as follows:

clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predictions, threshold_match_probability=0.5
)
clusters.as_pandas_dataframe(limit=10)
Completed iteration 1, root rows count 2


Completed iteration 2, root rows count 0
cluster_id unique_id first_name surname dob city email cluster __splink_salt tf_surname tf_email tf_city tf_first_name
0 0 0 Robert Alan 1971-06-24 None robert255@smith.net 0 0.012924 0.001221 0.001267 NaN 0.003610
1 1 1 Robert Allen 1971-05-24 None roberta25@smith.net 0 0.478756 0.002442 0.002535 NaN 0.003610
2 1 2 Rob Allen 1971-06-24 London roberta25@smith.net 0 0.409662 0.002442 0.002535 0.212792 0.001203
3 3 3 Robert Alen 1971-06-24 Lonon None 0 0.311029 0.001221 NaN 0.007380 0.003610
4 4 4 Grace None 1997-04-26 Hull grace.kelly52@jones.com 1 0.486141 NaN 0.002535 0.001230 0.006017
5 5 5 Grace Kelly 1991-04-26 None grace.kelly52@jones.com 1 0.434566 0.002442 0.002535 NaN 0.006017
6 6 6 Logan pMurphy 1973-08-01 None None 2 0.423760 0.001221 NaN NaN 0.012034
7 7 7 None None 2015-03-03 Portsmouth evied56@harris-bailey.net 3 0.683689 NaN 0.002535 0.017220 NaN
8 8 8 None Dean 2015-03-03 None None 3 0.553086 0.003663 NaN NaN NaN
9 8 9 Evie Dean 2015-03-03 Pootsmruth evihd56@earris-bailey.net 3 0.753070 0.003663 0.001267 0.001230 0.008424
sql = f"""
select *
from {df_predictions.physical_name}
limit 2
"""
linker.misc.query_sql(sql)
match_weight match_probability unique_id_l unique_id_r first_name_l first_name_r gamma_first_name tf_first_name_l tf_first_name_r bf_first_name bf_tf_adj_first_name surname_l surname_r gamma_surname tf_surname_l tf_surname_r bf_surname bf_tf_adj_surname dob_l dob_r gamma_dob bf_dob city_l city_r gamma_city tf_city_l tf_city_r bf_city bf_tf_adj_city email_l email_r gamma_email tf_email_l tf_email_r bf_email bf_tf_adj_email match_key
0 -1.749664 0.229211 324 326 Kai Kai 4 0.006017 0.006017 84.821765 0.962892 None Turner -1 NaN 0.007326 1.000000 1.000000 2018-12-31 2009-11-03 0 0.460743 London London 1 0.212792 0.212792 10.20126 0.259162 k.t50eherand@z.ncom None -1 0.001267 NaN 1.0 1.0 0
1 -1.626076 0.244695 25 27 Gabriel None -1 0.001203 NaN 1.000000 1.000000 Thomas Thomas 4 0.004884 0.004884 88.870507 1.001222 1977-09-13 1977-10-17 0 0.460743 London London 1 0.212792 0.212792 10.20126 0.259162 gabriel.t54@nichols.info None -1 0.002535 NaN 1.0 1.0 1

Further Reading

For more on the prediction tools in Splink, please refer to the Prediction API documentation.

Next steps

Now we have made predictions with a model, we can move on to visualising it to understand how it is working.