Deterministic dedupe
Linking a dataset of real historical persons with Deterrministic RulesΒΆ
While Splink is primarily a tool for probabilistic records linkage, it includes functionality to perform deterministic (i.e. rules based) linkage.
Significant work has gone into optimising the performance of rules based matching, so Splink is likely to be significantly faster than writing the basic SQL by hand.
In this example, we deduplicate a 50k row dataset based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors introduced. The probabilistic dedupe of the same dataset can be found at Deduplicate 50k rows historical persons
.
# Uncomment and run this cell if you're running in Google Colab.
# !pip install splink
import pandas as pd
from splink import splink_datasets
pd.options.display.max_rows = 1000
df = splink_datasets.historical_50k
df.head()
unique_id | cluster | full_name | first_and_surname | first_name | surname | dob | birth_place | postcode_fake | gender | occupation | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Q2296770-1 | Q2296770 | thomas clifford, 1st baron clifford of chudleigh | thomas chudleigh | thomas | chudleigh | 1630-08-01 | devon | tq13 8df | male | politician |
1 | Q2296770-2 | Q2296770 | thomas of chudleigh | thomas chudleigh | thomas | chudleigh | 1630-08-01 | devon | tq13 8df | male | politician |
2 | Q2296770-3 | Q2296770 | tom 1st baron clifford of chudleigh | tom chudleigh | tom | chudleigh | 1630-08-01 | devon | tq13 8df | male | politician |
3 | Q2296770-4 | Q2296770 | thomas 1st chudleigh | thomas chudleigh | thomas | chudleigh | 1630-08-01 | devon | tq13 8hu | None | politician |
4 | Q2296770-5 | Q2296770 | thomas clifford, 1st baron chudleigh | thomas chudleigh | thomas | chudleigh | 1630-08-01 | devon | tq13 8df | None | politician |
When defining the settings object, specity your deterministic rules in the blocking_rules_to_generate_predictions
key.
For a deterministic linkage, the linkage methodology is based solely on these rules, so there is no need to define comparisons
nor any other parameters required for model training in a probabilistic model.
Prior to running the linkage, it's usually a good idea to check how many record comparisons will be generated by your deterministic rules:
from splink import DuckDBAPI, block_on
from splink.blocking_analysis import (
cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
)
db_api = DuckDBAPI()
cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
table_or_tables=df,
blocking_rules=[
block_on("first_name", "surname", "dob"),
block_on("surname", "dob", "postcode_fake"),
block_on("first_name", "dob", "occupation"),
],
db_api=db_api,
link_type="dedupe_only",
)
from splink import Linker, SettingsCreator
settings = SettingsCreator(
link_type="dedupe_only",
blocking_rules_to_generate_predictions=[
block_on("first_name", "surname", "dob"),
block_on("surname", "dob", "postcode_fake"),
block_on("first_name", "dob", "occupation"),
],
retain_intermediate_calculation_columns=True,
)
linker = Linker(df, settings, db_api=db_api)
The results of the linkage can be viewed with the deterministic_link
function.
df_predict = linker.inference.deterministic_link()
df_predict.as_pandas_dataframe().head()
unique_id_l | unique_id_r | occupation_l | occupation_r | first_name_l | first_name_r | dob_l | dob_r | surname_l | surname_r | postcode_fake_l | postcode_fake_r | match_key | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Q55455287-12 | Q55455287-2 | None | writer | jaido | jaido | 1836-01-01 | 1836-01-01 | morata | morata | ta4 2ug | ta4 2uu | 0 |
1 | Q55455287-12 | Q55455287-3 | None | writer | jaido | jaido | 1836-01-01 | 1836-01-01 | morata | morata | ta4 2ug | ta4 2uu | 0 |
2 | Q55455287-12 | Q55455287-4 | None | writer | jaido | jaido | 1836-01-01 | 1836-01-01 | morata | morata | ta4 2ug | ta4 2sz | 0 |
3 | Q55455287-12 | Q55455287-5 | None | None | jaido | jaido | 1836-01-01 | 1836-01-01 | morata | morata | ta4 2ug | ta4 2ug | 0 |
4 | Q55455287-12 | Q55455287-6 | None | writer | jaido | jaido | 1836-01-01 | 1836-01-01 | morata | morata | ta4 2ug | None | 0 |
Which can be used to generate clusters.
Note, for deterministic linkage, each comparison has been assigned a match probability of 1, so to generate clusters, set threshold_match_probability=1
in the cluster_pairwise_predictions_at_threshold
function.
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
df_predict
)
Completed iteration 1, root rows count 94
Completed iteration 2, root rows count 10
Completed iteration 3, root rows count 0
clusters.as_pandas_dataframe(limit=5)
cluster_id | unique_id | cluster | full_name | first_and_surname | first_name | surname | dob | birth_place | postcode_fake | gender | occupation | __splink_salt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Q16025107-1 | Q5497940-9 | Q5497940 | frederick hall | frederick hall | frederick | hall | 1855-01-01 | bristol, city of | bs11 9pn | None | None | 0.002739 |
1 | Q1149445-1 | Q1149445-9 | Q1149445 | earl egerton | earl egerton | earl | egerton | 1800-01-01 | westminster | w1d 2hf | None | None | 0.991459 |
2 | Q20664532-1 | Q21466387-2 | Q21466387 | harry brooker | harry brooker | harry | brooker | 1848-01-01 | plymouth | pl4 9hx | male | painter | 0.506127 |
3 | Q1124636-1 | Q1124636-12 | Q1124636 | tom stapleton | tom stapleton | tom | stapleton | 1535-01-01 | None | bn6 9na | male | theologian | 0.612694 |
4 | Q18508292-1 | Q21466711-4 | Q21466711 | harry s0ence | harry s0ence | harry | s0ence | 1860-01-01 | london | se1 7pb | male | painter | 0.488917 |
These results can then be passed into the Cluster Studio Dashboard
.
linker.visualisations.cluster_studio_dashboard(
df_predict,
clusters,
"dashboards/50k_deterministic_cluster.html",
sampling_method="by_cluster_size",
overwrite=True,
)
from IPython.display import IFrame
IFrame(src="./dashboards/50k_deterministic_cluster.html", width="100%", height=1200)