Deterministic dedupe

Linking a dataset of real historical persons with Deterrministic Rules¶

While Splink is primarily a tool for probabilistic records linkage, it includes functionality to perform deterministic (i.e. rules based) linkage.

Significant work has gone into optimising the performance of rules based matching, so Splink is likely to be significantly faster than writing the basic SQL by hand.

In this example, we deduplicate a 50k row dataset based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors introduced. The probabilistic dedupe of the same dataset can be found at Deduplicate 50k rows historical persons.

# Uncomment and run this cell if you're running in Google Colab.
# !pip install splink

import pandas as pd

from splink import splink_datasets

pd.options.display.max_rows = 1000
df = splink_datasets.historical_50k
df.head()

	unique_id	cluster	full_name	first_and_surname	first_name	surname	dob	birth_place	postcode_fake	gender	occupation
0	Q2296770-1	Q2296770	thomas clifford, 1st baron clifford of chudleigh	thomas chudleigh	thomas	chudleigh	1630-08-01	devon	tq13 8df	male	politician
1	Q2296770-2	Q2296770	thomas of chudleigh	thomas chudleigh	thomas	chudleigh	1630-08-01	devon	tq13 8df	male	politician
2	Q2296770-3	Q2296770	tom 1st baron clifford of chudleigh	tom chudleigh	tom	chudleigh	1630-08-01	devon	tq13 8df	male	politician
3	Q2296770-4	Q2296770	thomas 1st chudleigh	thomas chudleigh	thomas	chudleigh	1630-08-01	devon	tq13 8hu	None	politician
4	Q2296770-5	Q2296770	thomas clifford, 1st baron chudleigh	thomas chudleigh	thomas	chudleigh	1630-08-01	devon	tq13 8df	None	politician

When defining the settings object, specity your deterministic rules in the blocking_rules_to_generate_predictions key.

For a deterministic linkage, the linkage methodology is based solely on these rules, so there is no need to define comparisons nor any other parameters required for model training in a probabilistic model.

Prior to running the linkage, it's usually a good idea to check how many record comparisons will be generated by your deterministic rules:

from splink import DuckDBAPI, block_on
from splink.blocking_analysis import (
    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
)

db_api = DuckDBAPI()
cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
    table_or_tables=df,
    blocking_rules=[
        block_on("first_name", "surname", "dob"),
        block_on("surname", "dob", "postcode_fake"),
        block_on("first_name", "dob", "occupation"),
    ],
    db_api=db_api,
    link_type="dedupe_only",
)

from splink import Linker, SettingsCreator

settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "surname", "dob"),
        block_on("surname", "dob", "postcode_fake"),
        block_on("first_name", "dob", "occupation"),
    ],
    retain_intermediate_calculation_columns=True,
)

linker = Linker(df, settings, db_api=db_api)

The results of the linkage can be viewed with the deterministic_link function.

df_predict = linker.inference.deterministic_link()
df_predict.as_pandas_dataframe().head()

	unique_id_l	unique_id_r	occupation_l	occupation_r	first_name_l	first_name_r	dob_l	dob_r	surname_l	surname_r	postcode_fake_l	postcode_fake_r
0	Q55455287-12	Q55455287-2	None	writer	jaido	jaido	1836-01-01	1836-01-01	morata	morata	ta4 2ug	ta4 2uu
1	Q55455287-12	Q55455287-3	None	writer	jaido	jaido	1836-01-01	1836-01-01	morata	morata	ta4 2ug	ta4 2uu
2	Q55455287-12	Q55455287-4	None	writer	jaido	jaido	1836-01-01	1836-01-01	morata	morata	ta4 2ug	ta4 2sz
3	Q55455287-12	Q55455287-5	None	None	jaido	jaido	1836-01-01	1836-01-01	morata	morata	ta4 2ug	ta4 2ug
4	Q55455287-12	Q55455287-6	None	writer	jaido	jaido	1836-01-01	1836-01-01	morata	morata	ta4 2ug	None

Which can be used to generate clusters.

Note, for deterministic linkage, each comparison has been assigned a match probability of 1, so to generate clusters, set threshold_match_probability=1 in the cluster_pairwise_predictions_at_threshold function.

clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predict
)

Completed iteration 1, root rows count 94


Completed iteration 2, root rows count 10


Completed iteration 3, root rows count 0

clusters.as_pandas_dataframe(limit=5)

	cluster_id	unique_id	cluster	full_name	first_and_surname	first_name	surname	dob	birth_place	postcode_fake	gender	occupation	__splink_salt
0	Q16025107-1	Q5497940-9	Q5497940	frederick hall	frederick hall	frederick	hall	1855-01-01	bristol, city of	bs11 9pn	None	None	0.002739
1	Q1149445-1	Q1149445-9	Q1149445	earl egerton	earl egerton	earl	egerton	1800-01-01	westminster	w1d 2hf	None	None	0.991459
2	Q20664532-1	Q21466387-2	Q21466387	harry brooker	harry brooker	harry	brooker	1848-01-01	plymouth	pl4 9hx	male	painter	0.506127
3	Q1124636-1	Q1124636-12	Q1124636	tom stapleton	tom stapleton	tom	stapleton	1535-01-01	None	bn6 9na	male	theologian	0.612694
4	Q18508292-1	Q21466711-4	Q21466711	harry s0ence	harry s0ence	harry	s0ence	1860-01-01	london	se1 7pb	male	painter	0.488917

These results can then be passed into the Cluster Studio Dashboard.

linker.visualisations.cluster_studio_dashboard(
    df_predict,
    clusters,
    "dashboards/50k_deterministic_cluster.html",
    sampling_method="by_cluster_size",
    overwrite=True,
)

from IPython.display import IFrame

IFrame(src="./dashboards/50k_deterministic_cluster.html", width="100%", height=1200)