Skip to content

Deterministic dedupe

Linking a dataset of real historical persons with Deterrministic RulesΒΆ

While Splink is primarily a tool for probabilistic records linkage, it includes functionality to perform deterministic (i.e. rules based) linkage.

Significant work has gone into optimising the performance of rules based matching, so Splink is likely to be significantly faster than writing the basic SQL by hand.

In this example, we deduplicate a 50k row dataset based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors introduced. The probabilistic dedupe of the same dataset can be found at Deduplicate 50k rows historical persons.

Open In Colab

# Uncomment and run this cell if you're running in Google Colab.
# !pip install splink
import pandas as pd

from splink import splink_datasets

pd.options.display.max_rows = 1000
df = splink_datasets.historical_50k
df.head()
unique_id cluster full_name first_and_surname first_name surname dob birth_place postcode_fake gender occupation
0 Q2296770-1 Q2296770 thomas clifford, 1st baron clifford of chudleigh thomas chudleigh thomas chudleigh 1630-08-01 devon tq13 8df male politician
1 Q2296770-2 Q2296770 thomas of chudleigh thomas chudleigh thomas chudleigh 1630-08-01 devon tq13 8df male politician
2 Q2296770-3 Q2296770 tom 1st baron clifford of chudleigh tom chudleigh tom chudleigh 1630-08-01 devon tq13 8df male politician
3 Q2296770-4 Q2296770 thomas 1st chudleigh thomas chudleigh thomas chudleigh 1630-08-01 devon tq13 8hu None politician
4 Q2296770-5 Q2296770 thomas clifford, 1st baron chudleigh thomas chudleigh thomas chudleigh 1630-08-01 devon tq13 8df None politician

When defining the settings object, specity your deterministic rules in the blocking_rules_to_generate_predictions key.

For a deterministic linkage, the linkage methodology is based solely on these rules, so there is no need to define comparisons nor any other parameters required for model training in a probabilistic model.

Prior to running the linkage, it's usually a good idea to check how many record comparisons will be generated by your deterministic rules:

from splink import DuckDBAPI, block_on
from splink.blocking_analysis import (
    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
)

db_api = DuckDBAPI()
cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
    table_or_tables=df,
    blocking_rules=[
        block_on("first_name", "surname", "dob"),
        block_on("surname", "dob", "postcode_fake"),
        block_on("first_name", "dob", "occupation"),
    ],
    db_api=db_api,
    link_type="dedupe_only",
)
from splink import Linker, SettingsCreator

settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "surname", "dob"),
        block_on("surname", "dob", "postcode_fake"),
        block_on("first_name", "dob", "occupation"),
    ],
    retain_intermediate_calculation_columns=True,
)

linker = Linker(df, settings, db_api=db_api)

The results of the linkage can be viewed with the deterministic_link function.

df_predict = linker.inference.deterministic_link()
df_predict.as_pandas_dataframe().head()
unique_id_l unique_id_r occupation_l occupation_r first_name_l first_name_r dob_l dob_r surname_l surname_r postcode_fake_l postcode_fake_r match_key
0 Q55455287-12 Q55455287-2 None writer jaido jaido 1836-01-01 1836-01-01 morata morata ta4 2ug ta4 2uu 0
1 Q55455287-12 Q55455287-3 None writer jaido jaido 1836-01-01 1836-01-01 morata morata ta4 2ug ta4 2uu 0
2 Q55455287-12 Q55455287-4 None writer jaido jaido 1836-01-01 1836-01-01 morata morata ta4 2ug ta4 2sz 0
3 Q55455287-12 Q55455287-5 None None jaido jaido 1836-01-01 1836-01-01 morata morata ta4 2ug ta4 2ug 0
4 Q55455287-12 Q55455287-6 None writer jaido jaido 1836-01-01 1836-01-01 morata morata ta4 2ug None 0

Which can be used to generate clusters.

Note, for deterministic linkage, each comparison has been assigned a match probability of 1, so to generate clusters, set threshold_match_probability=1 in the cluster_pairwise_predictions_at_threshold function.

clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predict, threshold_match_probability=1
)
Completed iteration 1, root rows count 94


Completed iteration 2, root rows count 10


Completed iteration 3, root rows count 0
clusters.as_pandas_dataframe(limit=5)
cluster_id unique_id cluster full_name first_and_surname first_name surname dob birth_place postcode_fake gender occupation __splink_salt
0 Q16025107-1 Q5497940-9 Q5497940 frederick hall frederick hall frederick hall 1855-01-01 bristol, city of bs11 9pn None None 0.002739
1 Q1149445-1 Q1149445-9 Q1149445 earl egerton earl egerton earl egerton 1800-01-01 westminster w1d 2hf None None 0.991459
2 Q20664532-1 Q21466387-2 Q21466387 harry brooker harry brooker harry brooker 1848-01-01 plymouth pl4 9hx male painter 0.506127
3 Q1124636-1 Q1124636-12 Q1124636 tom stapleton tom stapleton tom stapleton 1535-01-01 None bn6 9na male theologian 0.612694
4 Q18508292-1 Q21466711-4 Q21466711 harry s0ence harry s0ence harry s0ence 1860-01-01 london se1 7pb male painter 0.488917

These results can then be passed into the Cluster Studio Dashboard.

linker.visualisations.cluster_studio_dashboard(
    df_predict,
    clusters,
    "dashboards/50k_deterministic_cluster.html",
    sampling_method="by_cluster_size",
    overwrite=True,
)

from IPython.display import IFrame

IFrame(src="./dashboards/50k_deterministic_cluster.html", width="100%", height=1200)