Skip to content

Predicting which records match

In the previous tutorial, we built and estimated a linkage model.

In this tutorial, we will load the estimated model and use it to make predictions of which pairwise record comparisons match.

from splink.duckdb.linker import DuckDBLinker
from splink.datasets import splink_datasets
import pandas as pd
pd.options.display.max_columns = 1000
df = splink_datasets.fake_1000

Load estimated model from previous tutorial

linker = DuckDBLinker(df)
linker.load_model("../demo_settings/saved_model_from_demo.json")

Predicting match weights using the trained model

We use linker.predict() to run the model.

Under the hood this will:

  • Generate all pairwise record comparisons that match at least one of the blocking_rules_to_generate_predictions

  • Use the rules specified in the Comparisons to evaluate the similarity of the input data

  • Use the estimated match weights, applying term frequency adjustments where requested to produce the final match_weight and match_probability scores

Optionally, a threshold_match_probability or threshold_match_weight can be provided, which will drop any row where the predicted score is below the threshold.

df_predictions = linker.predict(threshold_match_probability=0.2)
df_predictions.as_pandas_dataframe(limit=5)
match_weight match_probability unique_id_l unique_id_r first_name_l first_name_r gamma_first_name bf_first_name surname_l surname_r gamma_surname bf_surname dob_l dob_r gamma_dob bf_dob city_l city_r gamma_city tf_city_l tf_city_r bf_city bf_tf_adj_city email_l email_r gamma_email bf_email match_key
0 12.655148 0.999845 4 5 Grace Grace 4 84.391685 NaN Kelly -1 1.000000 1997-04-26 1991-04-26 4 90.597357 Hull NaN -1 0.001230 NaN 1.000000 1.000000 grace.kelly52@jones.com grace.kelly52@jones.com 3 252.361018 0
1 11.142456 0.999558 26 29 Thomas Thomas 4 84.391685 Gabriel Gabriel 4 88.441584 1976-09-15 1976-08-15 4 90.597357 Loodon NaN -1 0.001230 NaN 1.000000 1.000000 gabriel.t54@nnichls.info NaN -1 1.000000 0
2 11.142456 0.999558 28 29 Thomas Thomas 4 84.391685 Gabriel Gabriel 4 88.441584 1976-09-15 1976-08-15 4 90.597357 London NaN -1 0.212792 NaN 1.000000 1.000000 gabriel.t54@nichols.info NaN -1 1.000000 0
3 -1.153442 0.310131 37 860 Theodore Theodore 4 84.391685 Morris Marshall 0 0.237771 1978-08-19 1972-07-25 1 0.588069 Birmingham Birmingham 1 0.049200 0.0492 10.167002 1.120874 t.m39@brooks-sawyer.com NaN -1 1.000000 0
4 -1.153442 0.310131 39 860 Theodore Theodore 4 84.391685 Morris Marshall 0 0.237771 1978-08-19 1972-07-25 1 0.588069 Birmingham Birmingham 1 0.049200 0.0492 10.167002 1.120874 t.m39@brooks-sawyer.com NaN -1 1.000000 0

Clustering

The result of linker.predict() is a list of pairwise record comparisons and their associated scores. For instance, if we have input records A, B, C and D, it could be represented conceptually as:

A -> B with score 0.9
B -> C with score 0.95
C -> D with score 0.1
D -> E with score 0.99

Often, an alternative representation of this result is more useful, where each row is an input record, and where records link, they are assigned to the same cluster.

With a score threshold of 0.5, the above data could be represented conceptually as:

ID, Cluster ID
A,  1
B,  1
C,  1
D,  2
E,  2

The algorithm that converts between the pairwise results and the clusters is called connected components, and it is included in Splink. You can use it as follows:

clusters = linker.cluster_pairwise_predictions_at_threshold(df_predictions, threshold_match_probability=0.5)
clusters.as_pandas_dataframe(limit=10)
Completed iteration 1, root rows count 11
Completed iteration 2, root rows count 1
Completed iteration 3, root rows count 0
cluster_id unique_id first_name surname dob city email cluster tf_city
0 0 0 Robert Alan 1971-06-24 NaN robert255@smith.net 0 NaN
1 0 1 Robert Allen 1971-05-24 NaN roberta25@smith.net 0 NaN
2 0 2 Rob Allen 1971-06-24 London roberta25@smith.net 0 0.212792
3 0 3 Robert Alen 1971-06-24 Lonon NaN 0 0.007380
4 4 4 Grace NaN 1997-04-26 Hull grace.kelly52@jones.com 1 0.001230
5 4 5 Grace Kelly 1991-04-26 NaN grace.kelly52@jones.com 1 NaN
6 6 6 Logan pMurphy 1973-08-01 NaN NaN 2 NaN
7 7 7 NaN NaN 2015-03-03 Portsmouth evied56@harris-bailey.net 3 0.017220
8 8 8 NaN Dean 2015-03-03 NaN NaN 3 NaN
9 8 9 Evie Dean 2015-03-03 Pootsmruth evihd56@earris-bailey.net 3 0.001230
sql = f"""
select * 
from {df_predictions.physical_name}
limit 2
"""
linker.query_sql(sql)
match_weight match_probability unique_id_l unique_id_r first_name_l first_name_r gamma_first_name bf_first_name surname_l surname_r gamma_surname bf_surname dob_l dob_r gamma_dob bf_dob city_l city_r gamma_city tf_city_l tf_city_r bf_city bf_tf_adj_city email_l email_r gamma_email bf_email match_key
0 12.655148 0.999845 4 5 Grace Grace 4 84.391685 NaN Kelly -1 1.000000 1997-04-26 1991-04-26 4 90.597357 Hull NaN -1 0.00123 NaN 1.0 1.0 grace.kelly52@jones.com grace.kelly52@jones.com 3 252.361018 0
1 11.142456 0.999558 26 29 Thomas Thomas 4 84.391685 Gabriel Gabriel 4 88.441584 1976-09-15 1976-08-15 4 90.597357 Loodon NaN -1 0.00123 NaN 1.0 1.0 gabriel.t54@nnichls.info NaN -1 1.000000 0

Further Reading

For more on the prediction tools in Splink, please refer to the Prediction API documentation.

Next steps

Now we have made predictions with a model, we can move on to visualising it to understand how it is working.