Predicting which records match¶

In the previous tutorial, we built and estimated a linkage model.

In this tutorial, we will load the estimated model and use it to make predictions of which pairwise record comparisons match.

from splink import Linker, DuckDBAPI, splink_datasets

import pandas as pd

pd.options.display.max_columns = 1000

db_api = DuckDBAPI()
df = splink_datasets.fake_1000

Load estimated model from previous tutorial¶

import json
import urllib

url = "https://raw.githubusercontent.com/moj-analytical-services/splink/847e32508b1a9cdd7bcd2ca6c0a74e547fb69865/docs/demos/demo_settings/saved_model_from_demo.json"

with urllib.request.urlopen(url) as u:
    settings = json.loads(u.read().decode())


linker = Linker(df, settings, db_api=DuckDBAPI())

Predicting match weights using the trained model¶

We use linker.inference.predict() to run the model.

Under the hood this will:

Generate all pairwise record comparisons that match at least one of the blocking_rules_to_generate_predictions
Use the rules specified in the Comparisons to evaluate the similarity of the input data
Use the estimated match weights, applying term frequency adjustments where requested to produce the final match_weight and match_probability scores

Optionally, a threshold_match_probability or threshold_match_weight can be provided, which will drop any row where the predicted score is below the threshold.

df_predictions = linker.inference.predict(threshold_match_probability=0.2)
df_predictions.as_pandas_dataframe(limit=5)

 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained

	match_weight	match_probability	unique_id_l	unique_id_r	first_name_l	first_name_r	gamma_first_name	tf_first_name_l	tf_first_name_r	bf_first_name	bf_tf_adj_first_name	surname_l	surname_r	gamma_surname	tf_surname_l	tf_surname_r	bf_surname	bf_tf_adj_surname	dob_l	dob_r	bf_dob	city_l	city_r	gamma_city	tf_city_l	tf_city_r	bf_city	bf_tf_adj_city	email_l	email_r	gamma_email	tf_email_l	tf_email_r	bf_email	bf_tf_adj_email	match_key
0	-1.749664	0.229211	324	326	Kai	Kai	4	0.006017	0.006017	84.821765	0.962892	None	Turner	-1	NaN	0.007326	1.000000	1.000000	2018-12-31	2009-11-03	0.460743	London	London	1	0.212792	0.212792	10.20126	0.259162	k.t50eherand@z.ncom	None	-1	0.001267	NaN	1.0	1.0	0
1	-1.626076	0.244695	25	27	Gabriel	None	-1	0.001203	NaN	1.000000	1.000000	Thomas	Thomas	4	0.004884	0.004884	88.870507	1.001222	1977-09-13	1977-10-17	0.460743	London	London	1	0.212792	0.212792	10.20126	0.259162	gabriel.t54@nichols.info	None	-1	0.002535	NaN	1.0	1.0	1
2	-1.551265	0.254405	626	629	geeorGe	George	1	0.001203	0.014440	4.176727	1.000000	Davidson	Davidson	4	0.007326	0.007326	88.870507	0.667482	1999-05-07	2000-05-06	0.460743	Southamptn	None	-1	0.001230	NaN	1.00000	1.000000	None	gdavidson@johnson-brown.com	-1	NaN	0.00507	1.0	1.0	1
3	-1.427735	0.270985	600	602	Toby	Toby	4	0.004813	0.004813	84.821765	1.203614	None	None	-1	NaN	NaN	1.000000	1.000000	2003-04-23	2013-03-21	0.460743	London	London	1	0.212792	0.212792	10.20126	0.259162	toby.d@menhez.com	None	-1	0.001267	NaN	1.0	1.0	0
4	-1.427735	0.270985	599	602	Toby	Toby	4	0.004813	0.004813	84.821765	1.203614	Haall	None	-1	0.001221	NaN	1.000000	1.000000	2003-04-23	2013-03-21	0.460743	London	London	1	0.212792	0.212792	10.20126	0.259162	None	None	-1	NaN	NaN	1.0	1.0	0

Clustering¶

The result of linker.inference.predict() is a list of pairwise record comparisons and their associated scores. For instance, if we have input records A, B, C and D, it could be represented conceptually as:

A -> B with score 0.9
B -> C with score 0.95
C -> D with score 0.1
D -> E with score 0.99

Often, an alternative representation of this result is more useful, where each row is an input record, and where records link, they are assigned to the same cluster.

With a score threshold of 0.5, the above data could be represented conceptually as:

ID, Cluster ID
A,  1
B,  1
C,  1
D,  2
E,  2

The algorithm that converts between the pairwise results and the clusters is called connected components, and it is included in Splink. You can use it as follows:

clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predictions, threshold_match_probability=0.5
)
clusters.as_pandas_dataframe(limit=10)

Completed iteration 1, root rows count 2


Completed iteration 2, root rows count 0

	cluster_id	unique_id	first_name	surname	dob	city	email	cluster	__splink_salt	tf_surname	tf_email	tf_city	tf_first_name
0	0	0	Robert	Alan	1971-06-24	None	robert255@smith.net	0	0.012924	0.001221	0.001267	NaN	0.003610
1	1	1	Robert	Allen	1971-05-24	None	roberta25@smith.net	0	0.478756	0.002442	0.002535	NaN	0.003610
2	1	2	Rob	Allen	1971-06-24	London	roberta25@smith.net	0	0.409662	0.002442	0.002535	0.212792	0.001203
3	3	3	Robert	Alen	1971-06-24	Lonon	None	0	0.311029	0.001221	NaN	0.007380	0.003610
4	4	4	Grace	None	1997-04-26	Hull	grace.kelly52@jones.com	1	0.486141	NaN	0.002535	0.001230	0.006017
5	5	5	Grace	Kelly	1991-04-26	None	grace.kelly52@jones.com	1	0.434566	0.002442	0.002535	NaN	0.006017
6	6	6	Logan	pMurphy	1973-08-01	None	None	2	0.423760	0.001221	NaN	NaN	0.012034
7	7	7	None	None	2015-03-03	Portsmouth	evied56@harris-bailey.net	3	0.683689	NaN	0.002535	0.017220	NaN
8	8	8	None	Dean	2015-03-03	None	None	3	0.553086	0.003663	NaN	NaN	NaN
9	8	9	Evie	Dean	2015-03-03	Pootsmruth	evihd56@earris-bailey.net	3	0.753070	0.003663	0.001267	0.001230	0.008424

sql = f"""
select *
from {df_predictions.physical_name}
limit 2
"""
linker.misc.query_sql(sql)

	match_weight	match_probability	unique_id_l	unique_id_r	first_name_l	first_name_r	gamma_first_name	tf_first_name_l	tf_first_name_r	bf_first_name	bf_tf_adj_first_name	surname_l	surname_r	gamma_surname	tf_surname_l	tf_surname_r	bf_surname	bf_tf_adj_surname	dob_l	dob_r	gamma_dob	bf_dob	city_l	city_r	gamma_city	tf_city_l	tf_city_r	bf_city	bf_tf_adj_city	email_l	email_r	gamma_email	tf_email_l	tf_email_r	bf_email	bf_tf_adj_email	match_key
0	-1.749664	0.229211	324	326	Kai	Kai	4	0.006017	0.006017	84.821765	0.962892	None	Turner	-1	NaN	0.007326	1.000000	1.000000	2018-12-31	2009-11-03	0	0.460743	London	London	1	0.212792	0.212792	10.20126	0.259162	k.t50eherand@z.ncom	None	-1	0.001267	NaN	1.0	1.0	0
1	-1.626076	0.244695	25	27	Gabriel	None	-1	0.001203	NaN	1.000000	1.000000	Thomas	Thomas	4	0.004884	0.004884	88.870507	1.001222	1977-09-13	1977-10-17	0	0.460743	London	London	1	0.212792	0.212792	10.20126	0.259162	gabriel.t54@nichols.info	None	-1	0.002535	NaN	1.0	1.0	1

Next steps¶

Now we have made predictions with a model, we can move on to visualising it to understand how it is working.