Predicting which records match¶
In the previous tutorial, we built and estimated a linkage model.
In this tutorial, we will load the estimated model and use it to make predictions of which pairwise record comparisons match.
from splink.duckdb.duckdb_linker import DuckDBLinker
import pandas as pd
pd.options.display.max_columns = 1000
df = pd.read_csv("./data/fake_1000.csv")
Load estimated model from previous tutorial¶
linker = DuckDBLinker(df)
linker.load_settings_from_json("./demo_settings/saved_model_from_demo.json")
Predicting match weights using the trained model¶
We use linker.predict()
to run the model.
Under the hood this will:
-
Generate all pairwise record comparisons that match at least one of the
blocking_rules_to_generate_predictions
-
Use the rules specified in the
Comparisons
to evaluate the similarity of the input data -
Use the estimated match weights, applying term frequency adjustments where requested to produce the final
match_weight
andmatch_probability
scores
Optionally, a threshold_match_probability
or threshold_match_weight
can be provided, which will drop any row where the predicted score is below the threshold.
df_predictions = linker.predict(threshold_match_probability=0.2)
df_predictions.as_pandas_dataframe(limit=5)
Clustering¶
The result of linker.predict()
is a list of pairwise record comparisons and their associated scores. For instance, if we have input records A, B, C and D, it could be represented conceptually as:
A -> B with score 0.9
B -> C with score 0.95
C -> D with score 0.1
D -> E with score 0.99
Often, an alternative representation of this result is more useful, where each row is an input record, and where records link, they are assigned to the same cluster.
With a score threshold of 0.5, the above data could be represented conceptually as:
ID, Cluster ID
A, 1
B, 1
C, 1
D, 2
E, 2
The algorithm that converts between the pairwise results and the clusters is called connected components, and it is included in Splink. You can use it as follows:
clusters = linker.cluster_pairwise_predictions_at_threshold(df_predictions, threshold_match_probability=0.5)
clusters.as_pandas_dataframe(limit=10)
sql = f"""
select *
from {df_predictions.physical_name}
limit 2
"""
linker.query_sql(sql)