Predicting which records match¶

In the previous tutorial, we built and estimated a linkage model.

In this tutorial, we will load the estimated model and use it to make predictions of which pairwise record comparisons match.

from splink.duckdb.linker import DuckDBLinker
from splink.datasets import splink_datasets
import pandas as pd
pd.options.display.max_columns = 1000
df = splink_datasets.fake_1000

Load estimated model from previous tutorial¶

linker = DuckDBLinker(df)
linker.load_model("../demo_settings/saved_model_from_demo.json")

Predicting match weights using the trained model¶

We use linker.predict() to run the model.

Under the hood this will:

Generate all pairwise record comparisons that match at least one of the blocking_rules_to_generate_predictions
Use the rules specified in the Comparisons to evaluate the similarity of the input data
Use the estimated match weights, applying term frequency adjustments where requested to produce the final match_weight and match_probability scores

Optionally, a threshold_match_probability or threshold_match_weight can be provided, which will drop any row where the predicted score is below the threshold.

df_predictions = linker.predict(threshold_match_probability=0.2)
df_predictions.as_pandas_dataframe(limit=5)

	match_weight	match_probability	unique_id_l	unique_id_r	first_name_l	first_name_r	gamma_first_name	bf_first_name	surname_l	surname_r	gamma_surname	bf_surname	dob_l	dob_r	gamma_dob	bf_dob	city_l	city_r	gamma_city	tf_city_l	tf_city_r	bf_city	bf_tf_adj_city	email_l	email_r	gamma_email	bf_email
0	12.655148	0.999845	4	5	Grace	Grace	4	84.391685	NaN	Kelly	-1	1.000000	1997-04-26	1991-04-26	4	90.597357	Hull	NaN	-1	0.001230	NaN	1.000000	1.000000	grace.kelly52@jones.com	grace.kelly52@jones.com	3	252.361018
1	11.142456	0.999558	26	29	Thomas	Thomas	4	84.391685	Gabriel	Gabriel	4	88.441584	1976-09-15	1976-08-15	4	90.597357	Loodon	NaN	-1	0.001230	NaN	1.000000	1.000000	gabriel.t54@nnichls.info	NaN	-1	1.000000
2	11.142456	0.999558	28	29	Thomas	Thomas	4	84.391685	Gabriel	Gabriel	4	88.441584	1976-09-15	1976-08-15	4	90.597357	London	NaN	-1	0.212792	NaN	1.000000	1.000000	gabriel.t54@nichols.info	NaN	-1	1.000000
3	-1.153442	0.310131	37	860	Theodore	Theodore	4	84.391685	Morris	Marshall	0	0.237771	1978-08-19	1972-07-25	1	0.588069	Birmingham	Birmingham	1	0.049200	0.0492	10.167002	1.120874	t.m39@brooks-sawyer.com	NaN	-1	1.000000
4	-1.153442	0.310131	39	860	Theodore	Theodore	4	84.391685	Morris	Marshall	0	0.237771	1978-08-19	1972-07-25	1	0.588069	Birmingham	Birmingham	1	0.049200	0.0492	10.167002	1.120874	t.m39@brooks-sawyer.com	NaN	-1	1.000000

Clustering¶

The result of linker.predict() is a list of pairwise record comparisons and their associated scores. For instance, if we have input records A, B, C and D, it could be represented conceptually as:

A -> B with score 0.9
B -> C with score 0.95
C -> D with score 0.1
D -> E with score 0.99

Often, an alternative representation of this result is more useful, where each row is an input record, and where records link, they are assigned to the same cluster.

With a score threshold of 0.5, the above data could be represented conceptually as:

ID, Cluster ID
A,  1
B,  1
C,  1
D,  2
E,  2

The algorithm that converts between the pairwise results and the clusters is called connected components, and it is included in Splink. You can use it as follows:

clusters = linker.cluster_pairwise_predictions_at_threshold(df_predictions, threshold_match_probability=0.5)
clusters.as_pandas_dataframe(limit=10)

Completed iteration 1, root rows count 11
Completed iteration 2, root rows count 1
Completed iteration 3, root rows count 0

	cluster_id	unique_id	first_name	surname	dob	city	email	cluster	tf_city
0	0	0	Robert	Alan	1971-06-24	NaN	robert255@smith.net	0	NaN
1	0	1	Robert	Allen	1971-05-24	NaN	roberta25@smith.net	0	NaN
2	0	2	Rob	Allen	1971-06-24	London	roberta25@smith.net	0	0.212792
3	0	3	Robert	Alen	1971-06-24	Lonon	NaN	0	0.007380
4	4	4	Grace	NaN	1997-04-26	Hull	grace.kelly52@jones.com	1	0.001230
5	4	5	Grace	Kelly	1991-04-26	NaN	grace.kelly52@jones.com	1	NaN
6	6	6	Logan	pMurphy	1973-08-01	NaN	NaN	2	NaN
7	7	7	NaN	NaN	2015-03-03	Portsmouth	evied56@harris-bailey.net	3	0.017220
8	8	8	NaN	Dean	2015-03-03	NaN	NaN	3	NaN
9	8	9	Evie	Dean	2015-03-03	Pootsmruth	evihd56@earris-bailey.net	3	0.001230

sql = f"""
select * 
from {df_predictions.physical_name}
limit 2
"""
linker.query_sql(sql)

	match_weight	match_probability	unique_id_l	unique_id_r	first_name_l	first_name_r	gamma_first_name	bf_first_name	surname_l	surname_r	gamma_surname	bf_surname	dob_l	dob_r	gamma_dob	bf_dob	city_l	city_r	gamma_city	tf_city_l	tf_city_r	bf_city	bf_tf_adj_city	email_l	email_r	gamma_email	bf_email	match_key
0	12.655148	0.999845	4	5	Grace	Grace	4	84.391685	NaN	Kelly	-1	1.000000	1997-04-26	1991-04-26	4	90.597357	Hull	NaN	-1	0.00123	NaN	1.0	1.0	grace.kelly52@jones.com	grace.kelly52@jones.com	3	252.361018	0
1	11.142456	0.999558	26	29	Thomas	Thomas	4	84.391685	Gabriel	Gabriel	4	88.441584	1976-09-15	1976-08-15	4	90.597357	Loodon	NaN	-1	0.00123	NaN	1.0	1.0	gabriel.t54@nnichls.info	NaN	-1	1.000000	0

Next steps¶

Now we have made predictions with a model, we can move on to visualising it to understand how it is working.