Skip to content

Deterministic dedupe

Linking a dataset of real historical persons with Deterrministic Rules¶

While Splink is primarily a tool for probabilistic records linkage, there is functionality to perform deterministic (i.e. rules based) linkage.

In this example, we deduplicate a more realistic dataset. The data is based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors introduced. The probabilistic dedupe of the same dataset can be found at Deduplicate 50k rows historical persons.

from splink.datasets import splink_datasets
from splink.duckdb.linker import DuckDBLinker
import altair as alt
alt.renderers.enable('html')

import pandas as pd 
pd.options.display.max_rows = 1000
df = splink_datasets.historical_50k
df.head()
unique_id cluster full_name first_and_surname first_name surname dob birth_place postcode_fake gender occupation
0 Q2296770-1 Q2296770 thomas clifford, 1st baron clifford of chudleigh thomas chudleigh thomas chudleigh 1630-08-01 devon tq13 8df male politician
1 Q2296770-2 Q2296770 thomas of chudleigh thomas chudleigh thomas chudleigh 1630-08-01 devon tq13 8df male politician
2 Q2296770-3 Q2296770 tom 1st baron clifford of chudleigh tom chudleigh tom chudleigh 1630-08-01 devon tq13 8df male politician
3 Q2296770-4 Q2296770 thomas 1st chudleigh thomas chudleigh thomas chudleigh 1630-08-01 devon tq13 8hu None politician
4 Q2296770-5 Q2296770 thomas clifford, 1st baron chudleigh thomas chudleigh thomas chudleigh 1630-08-01 devon tq13 8df None politician

When defining the settings object, simply pass your deterministic rules into blocking_rules_to_generate_predictions.

For a deterministic linkage, the entire linkage methodology is based on these rules, so there is no need to define comparisons nor any other parameters required for model training in a probabilistic model.

from splink.duckdb.blocking_rule_library import block_on

# Simple settings dictionary will be used for exploratory analysis
settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on(["first_name", "surname", "dob"]),
        block_on(["surname", "dob", "postcode_fake"]),
        block_on(["first_name", "dob", "occupation"]),
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
}
linker = DuckDBLinker(df, settings)

linker.debug_mode = False

Once the linker object is defined, you can profile the dataset columns.

linker.profile_columns(
    ["first_name", "surname", "substr(dob, 1,4)"], top_n=10, bottom_n=5
)

In a deterministic linkage, the blocking rules chart shows how many records have been matched by each of the deterministic rules.

linker.cumulative_num_comparisons_from_blocking_rules_chart()

The results of the linkage can be viewed with the deterministic_link function.

df_predict = linker.deterministic_link()
df_predict.as_pandas_dataframe().head()
unique_id_l unique_id_r first_name_l first_name_r surname_l surname_r occupation_l occupation_r postcode_fake_l postcode_fake_r dob_l dob_r match_key match_probability
0 Q2296770-1 Q2296770-6 thomas thomas chudleigh chudleigh politician politician tq13 8df tq13 8df 1630-08-01 1630-08-01 0 1.0
1 Q2296770-2 Q2296770-6 thomas thomas chudleigh chudleigh politician politician tq13 8df tq13 8df 1630-08-01 1630-08-01 0 1.0
2 Q2296770-3 Q2296770-7 tom tom chudleigh chudleigh politician NaN tq13 8df tq13 8df 1630-08-01 1630-08-01 0 1.0
3 Q2296770-4 Q2296770-6 thomas thomas chudleigh chudleigh politician politician tq13 8hu tq13 8df 1630-08-01 1630-08-01 0 1.0
4 Q2296770-5 Q2296770-6 thomas thomas chudleigh chudleigh politician politician tq13 8df tq13 8df 1630-08-01 1630-08-01 0 1.0

Which can be used to generate clusters.

Note, for deterministic linkage, each comparison has been assigned a match probability of 1, so to generate clusters, set threshold_match_probability=1 in the cluster_pairwise_predictions_at_threshold function.

clusters = linker.cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=1)
Completed iteration 1, root rows count 94
Completed iteration 2, root rows count 10
Completed iteration 3, root rows count 0
clusters.as_pandas_dataframe(limit=5)
cluster_id unique_id cluster full_name first_and_surname first_name surname dob birth_place postcode_fake gender occupation
0 Q33436042-4 Q33436042-4 Q33436042 charlie louis wiliam merlin charlie merlin charlie merlin 1822-01-01 radstock NaN male NaN
1 Q7791916-15 Q7791916-15 Q7791916 tom longman tom longman tom longman 1698-01-01 bristol bs5 6rq male publisher
2 Q97991018-1 Q97991018-3 Q97991018 john hare john hare john hare 1857-05-31 canterbury ct4 6jr male religious
3 Q363965-1 Q363965-3 Q363965 robert t. a. innes robert innes robert innes 1861-11-10 edinburgh eh3 5jz male astronomer
4 Q457399-4 Q457399-4 Q457399 charles eliot charles eliot charles eliot 1862-01-09 sibford gower ox15 6pr male NaN

These results can then be passed into the Cluster Studio Dashboard.

linker.cluster_studio_dashboard(df_predict, clusters, "dashboards/50k_deterministic_cluster.html", sampling_method='by_cluster_size', overwrite=True)

from IPython.display import IFrame

IFrame(
    src="./dashboards/50k_deterministic_cluster.html", width="100%", height=1200
)