Predicting which records match¶
In the previous tutorial, we built and estimated a linkage model.
In this tutorial, we will load the estimated model and use it to make predictions of which pairwise record comparisons match.
from splink import Linker, DuckDBAPI, splink_datasets
import pandas as pd
pd.options.display.max_columns = 1000
db_api = DuckDBAPI()
df = splink_datasets.fake_1000
Load estimated model from previous tutorial¶
import json
import urllib
url = "https://raw.githubusercontent.com/moj-analytical-services/splink/847e32508b1a9cdd7bcd2ca6c0a74e547fb69865/docs/demos/demo_settings/saved_model_from_demo.json"
with urllib.request.urlopen(url) as u:
settings = json.loads(u.read().decode())
linker = Linker(df, settings, db_api=DuckDBAPI())
Predicting match weights using the trained model¶
We use linker.predict()
to run the model.
Under the hood this will:
-
Generate all pairwise record comparisons that match at least one of the
blocking_rules_to_generate_predictions
-
Use the rules specified in the
Comparisons
to evaluate the similarity of the input data -
Use the estimated match weights, applying term frequency adjustments where requested to produce the final
match_weight
andmatch_probability
scores
Optionally, a threshold_match_probability
or threshold_match_weight
can be provided, which will drop any row where the predicted score is below the threshold.
df_predictions = linker.inference.predict(threshold_match_probability=0.2)
df_predictions.as_pandas_dataframe(limit=5)
-- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary. To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
m values not fully trained
match_weight | match_probability | unique_id_l | unique_id_r | first_name_l | first_name_r | gamma_first_name | tf_first_name_l | tf_first_name_r | bf_first_name | bf_tf_adj_first_name | surname_l | surname_r | gamma_surname | tf_surname_l | tf_surname_r | bf_surname | bf_tf_adj_surname | dob_l | dob_r | gamma_dob | bf_dob | city_l | city_r | gamma_city | tf_city_l | tf_city_r | bf_city | bf_tf_adj_city | email_l | email_r | gamma_email | tf_email_l | tf_email_r | bf_email | bf_tf_adj_email | match_key | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -1.749664 | 0.229211 | 324 | 326 | Kai | Kai | 4 | 0.006017 | 0.006017 | 84.821765 | 0.962892 | None | Turner | -1 | NaN | 0.007326 | 1.000000 | 1.000000 | 2018-12-31 | 2009-11-03 | 0 | 0.460743 | London | London | 1 | 0.212792 | 0.212792 | 10.20126 | 0.259162 | k.t50eherand@z.ncom | None | -1 | 0.001267 | NaN | 1.0 | 1.0 | 0 |
1 | -1.626076 | 0.244695 | 25 | 27 | Gabriel | None | -1 | 0.001203 | NaN | 1.000000 | 1.000000 | Thomas | Thomas | 4 | 0.004884 | 0.004884 | 88.870507 | 1.001222 | 1977-09-13 | 1977-10-17 | 0 | 0.460743 | London | London | 1 | 0.212792 | 0.212792 | 10.20126 | 0.259162 | gabriel.t54@nichols.info | None | -1 | 0.002535 | NaN | 1.0 | 1.0 | 1 |
2 | -1.551265 | 0.254405 | 626 | 629 | geeorGe | George | 1 | 0.001203 | 0.014440 | 4.176727 | 1.000000 | Davidson | Davidson | 4 | 0.007326 | 0.007326 | 88.870507 | 0.667482 | 1999-05-07 | 2000-05-06 | 0 | 0.460743 | Southamptn | None | -1 | 0.001230 | NaN | 1.00000 | 1.000000 | None | gdavidson@johnson-brown.com | -1 | NaN | 0.00507 | 1.0 | 1.0 | 1 |
3 | -1.427735 | 0.270985 | 600 | 602 | Toby | Toby | 4 | 0.004813 | 0.004813 | 84.821765 | 1.203614 | None | None | -1 | NaN | NaN | 1.000000 | 1.000000 | 2003-04-23 | 2013-03-21 | 0 | 0.460743 | London | London | 1 | 0.212792 | 0.212792 | 10.20126 | 0.259162 | toby.d@menhez.com | None | -1 | 0.001267 | NaN | 1.0 | 1.0 | 0 |
4 | -1.427735 | 0.270985 | 599 | 602 | Toby | Toby | 4 | 0.004813 | 0.004813 | 84.821765 | 1.203614 | Haall | None | -1 | 0.001221 | NaN | 1.000000 | 1.000000 | 2003-04-23 | 2013-03-21 | 0 | 0.460743 | London | London | 1 | 0.212792 | 0.212792 | 10.20126 | 0.259162 | None | None | -1 | NaN | NaN | 1.0 | 1.0 | 0 |
Clustering¶
The result of linker.predict()
is a list of pairwise record comparisons and their associated scores. For instance, if we have input records A, B, C and D, it could be represented conceptually as:
A -> B with score 0.9
B -> C with score 0.95
C -> D with score 0.1
D -> E with score 0.99
Often, an alternative representation of this result is more useful, where each row is an input record, and where records link, they are assigned to the same cluster.
With a score threshold of 0.5, the above data could be represented conceptually as:
ID, Cluster ID
A, 1
B, 1
C, 1
D, 2
E, 2
The algorithm that converts between the pairwise results and the clusters is called connected components, and it is included in Splink. You can use it as follows:
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
df_predictions, threshold_match_probability=0.5
)
clusters.as_pandas_dataframe(limit=10)
Completed iteration 1, root rows count 2
Completed iteration 2, root rows count 0
cluster_id | unique_id | first_name | surname | dob | city | cluster | __splink_salt | tf_surname | tf_email | tf_city | tf_first_name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | Robert | Alan | 1971-06-24 | None | robert255@smith.net | 0 | 0.012924 | 0.001221 | 0.001267 | NaN | 0.003610 |
1 | 1 | 1 | Robert | Allen | 1971-05-24 | None | roberta25@smith.net | 0 | 0.478756 | 0.002442 | 0.002535 | NaN | 0.003610 |
2 | 1 | 2 | Rob | Allen | 1971-06-24 | London | roberta25@smith.net | 0 | 0.409662 | 0.002442 | 0.002535 | 0.212792 | 0.001203 |
3 | 3 | 3 | Robert | Alen | 1971-06-24 | Lonon | None | 0 | 0.311029 | 0.001221 | NaN | 0.007380 | 0.003610 |
4 | 4 | 4 | Grace | None | 1997-04-26 | Hull | grace.kelly52@jones.com | 1 | 0.486141 | NaN | 0.002535 | 0.001230 | 0.006017 |
5 | 5 | 5 | Grace | Kelly | 1991-04-26 | None | grace.kelly52@jones.com | 1 | 0.434566 | 0.002442 | 0.002535 | NaN | 0.006017 |
6 | 6 | 6 | Logan | pMurphy | 1973-08-01 | None | None | 2 | 0.423760 | 0.001221 | NaN | NaN | 0.012034 |
7 | 7 | 7 | None | None | 2015-03-03 | Portsmouth | evied56@harris-bailey.net | 3 | 0.683689 | NaN | 0.002535 | 0.017220 | NaN |
8 | 8 | 8 | None | Dean | 2015-03-03 | None | None | 3 | 0.553086 | 0.003663 | NaN | NaN | NaN |
9 | 8 | 9 | Evie | Dean | 2015-03-03 | Pootsmruth | evihd56@earris-bailey.net | 3 | 0.753070 | 0.003663 | 0.001267 | 0.001230 | 0.008424 |
sql = f"""
select *
from {df_predictions.physical_name}
limit 2
"""
linker.misc.query_sql(sql)
match_weight | match_probability | unique_id_l | unique_id_r | first_name_l | first_name_r | gamma_first_name | tf_first_name_l | tf_first_name_r | bf_first_name | bf_tf_adj_first_name | surname_l | surname_r | gamma_surname | tf_surname_l | tf_surname_r | bf_surname | bf_tf_adj_surname | dob_l | dob_r | gamma_dob | bf_dob | city_l | city_r | gamma_city | tf_city_l | tf_city_r | bf_city | bf_tf_adj_city | email_l | email_r | gamma_email | tf_email_l | tf_email_r | bf_email | bf_tf_adj_email | match_key | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -1.749664 | 0.229211 | 324 | 326 | Kai | Kai | 4 | 0.006017 | 0.006017 | 84.821765 | 0.962892 | None | Turner | -1 | NaN | 0.007326 | 1.000000 | 1.000000 | 2018-12-31 | 2009-11-03 | 0 | 0.460743 | London | London | 1 | 0.212792 | 0.212792 | 10.20126 | 0.259162 | k.t50eherand@z.ncom | None | -1 | 0.001267 | NaN | 1.0 | 1.0 | 0 |
1 | -1.626076 | 0.244695 | 25 | 27 | Gabriel | None | -1 | 0.001203 | NaN | 1.000000 | 1.000000 | Thomas | Thomas | 4 | 0.004884 | 0.004884 | 88.870507 | 1.001222 | 1977-09-13 | 1977-10-17 | 0 | 0.460743 | London | London | 1 | 0.212792 | 0.212792 | 10.20126 | 0.259162 | gabriel.t54@nichols.info | None | -1 | 0.002535 | NaN | 1.0 | 1.0 | 1 |
Further Reading
For more on the prediction tools in Splink, please refer to the Prediction API documentation.
Next steps¶
Now we have made predictions with a model, we can move on to visualising it to understand how it is working.