Skip to content

Predicting which records match

In the previous tutorial, we built and estimated a linkage model.

In this tutorial, we will load the estimated model and use it to make predictions of which pairwise record comparisons match.

from splink.duckdb.duckdb_linker import DuckDBLinker
import pandas as pd 
pd.options.display.max_columns = 1000
df = pd.read_csv("./data/fake_1000.csv")

Load estimated model from previous tutorial

linker = DuckDBLinker(df)
linker.load_settings_from_json("./demo_settings/saved_model_from_demo.json")

Predicting match weights using the trained model

We use linker.predict() to run the model.

Under the hood this will:

  • Generate all pairwise record comparisons that match at least one of the blocking_rules_to_generate_predictions

  • Use the rules specified in the Comparisons to evaluate the similarity of the input data

  • Use the estimated match weights, applying term frequency adjustments where requested to produce the final match_weight and match_probability scores

Optionally, a threshold_match_probability or threshold_match_weight can be provided, which will drop any row where the predicted score is below the threshold.

df_predictions = linker.predict(threshold_match_probability=0.2)
df_predictions.as_pandas_dataframe(limit=5)
match_weight match_probability unique_id_l unique_id_r first_name_l first_name_r gamma_first_name bf_first_name surname_l surname_r gamma_surname bf_surname dob_l dob_r gamma_dob bf_dob city_l city_r gamma_city tf_city_l tf_city_r bf_city bf_tf_adj_city email_l email_r gamma_email bf_email match_key
0 12.728972 0.999853 4 5 Grace Grace 1 85.803382 NaN Kelly -1 1.000000 1997-04-26 1991-04-26 1 92.704873 Hull NaN -1 0.001230 NaN 1.000000 1.000000 grace.kelly52@jones.com grace.kelly52@jones.com 3 255.30162 0
1 11.216421 0.999580 26 29 Thomas Thomas 1 85.803382 Gabriel Gabriel 3 89.480899 1976-09-15 1976-08-15 1 92.704873 Loodon NaN -1 0.001230 NaN 1.000000 1.000000 gabriel.t54@nnichls.info NaN -1 1.00000 0
2 11.216421 0.999580 28 29 Thomas Thomas 1 85.803382 Gabriel Gabriel 3 89.480899 1976-09-15 1976-08-15 1 92.704873 London NaN -1 0.212792 NaN 1.000000 1.000000 gabriel.t54@nichols.info NaN -1 1.00000 0
3 -1.284382 0.291055 37 860 Theodore Theodore 1 85.803382 Morris Marshall 0 0.267994 1978-08-19 1972-07-25 0 0.464198 Birmingham Birmingham 1 0.049200 0.0492 10.264354 1.120874 t.m39@brooks-sawyer.com NaN -1 1.00000 0
4 -1.284382 0.291055 39 860 Theodore Theodore 1 85.803382 Morris Marshall 0 0.267994 1978-08-19 1972-07-25 0 0.464198 Birmingham Birmingham 1 0.049200 0.0492 10.264354 1.120874 t.m39@brooks-sawyer.com NaN -1 1.00000 0

Clustering

The result of linker.predict() is a list of pairwise record comparisons and their associated scores. For instance, if we have input records A, B, C and D, it could be represented conceptually as:

A -> B with score 0.9
B -> C with score 0.95
C -> D with score 0.1
D -> E with score 0.99

Often, an alternative representation of this result is more useful, where each row is an input record, and where records link, they are assigned to the same cluster.

With a score threshold of 0.5, the above data could be represented conceptually as:

ID, Cluster ID
A,  1
B,  1
C,  1
D,  2
E,  2

The algorithm that converts between the pairwise results and the clusters is called connected components, and it is included in Splink. You can use it as follows:

clusters = linker.cluster_pairwise_predictions_at_threshold(df_predictions, threshold_match_probability=0.5)
clusters.as_pandas_dataframe(limit=10)
Completed iteration 1, root rows count 13
Completed iteration 2, root rows count 1
Completed iteration 3, root rows count 0

cluster_id unique_id first_name surname dob city email cluster tf_city
0 0 0 Robert Alan 1971-06-24 NaN robert255@smith.net 0 NaN
1 0 1 Robert Allen 1971-05-24 NaN roberta25@smith.net 0 NaN
2 0 2 Rob Allen 1971-06-24 London roberta25@smith.net 0 0.212792
3 0 3 Robert Alen 1971-06-24 Lonon NaN 0 0.007380
4 4 4 Grace NaN 1997-04-26 Hull grace.kelly52@jones.com 1 0.001230
5 4 5 Grace Kelly 1991-04-26 NaN grace.kelly52@jones.com 1 NaN
6 6 6 Logan pMurphy 1973-08-01 NaN NaN 2 NaN
7 7 7 NaN NaN 2015-03-03 Portsmouth evied56@harris-bailey.net 3 0.017220
8 8 8 NaN Dean 2015-03-03 NaN NaN 3 NaN
9 8 9 Evie Dean 2015-03-03 Pootsmruth evihd56@earris-bailey.net 3 0.001230
sql = f"""
select * 
from {df_predictions.physical_name}
limit 2
"""
linker.query_sql(sql)
match_weight match_probability unique_id_l unique_id_r first_name_l first_name_r gamma_first_name bf_first_name surname_l surname_r gamma_surname bf_surname dob_l dob_r gamma_dob bf_dob city_l city_r gamma_city tf_city_l tf_city_r bf_city bf_tf_adj_city email_l email_r gamma_email bf_email match_key
0 12.728972 0.999853 4 5 Grace Grace 1 85.803382 NaN Kelly -1 1.000000 1997-04-26 1991-04-26 1 92.704873 Hull NaN -1 0.00123 NaN 1.0 1.0 grace.kelly52@jones.com grace.kelly52@jones.com 3 255.30162 0
1 11.216421 0.999580 26 29 Thomas Thomas 1 85.803382 Gabriel Gabriel 3 89.480899 1976-09-15 1976-08-15 1 92.704873 Loodon NaN -1 0.00123 NaN 1.0 1.0 gabriel.t54@nnichls.info NaN -1 1.00000 0