Deterministic dedupe
Linking a dataset of real historical persons with Deterrministic Rules¶
While Splink is primarily a tool for probabilistic records linkage, there is functionality to perform deterministic (i.e. rules based) linkage.
In this example, we deduplicate a more realistic dataset. The data is based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors introduced. The probabilistic dedupe of the same dataset can be found at Deduplicate 50k rows historical persons
.
from splink.datasets import splink_datasets
from splink.duckdb.linker import DuckDBLinker
import altair as alt
alt.renderers.enable('html')
import pandas as pd
pd.options.display.max_rows = 1000
df = splink_datasets.historical_50k
df.head()
unique_id | cluster | full_name | first_and_surname | first_name | surname | dob | birth_place | postcode_fake | gender | occupation | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Q2296770-1 | Q2296770 | thomas clifford, 1st baron clifford of chudleigh | thomas chudleigh | thomas | chudleigh | 1630-08-01 | devon | tq13 8df | male | politician |
1 | Q2296770-2 | Q2296770 | thomas of chudleigh | thomas chudleigh | thomas | chudleigh | 1630-08-01 | devon | tq13 8df | male | politician |
2 | Q2296770-3 | Q2296770 | tom 1st baron clifford of chudleigh | tom chudleigh | tom | chudleigh | 1630-08-01 | devon | tq13 8df | male | politician |
3 | Q2296770-4 | Q2296770 | thomas 1st chudleigh | thomas chudleigh | thomas | chudleigh | 1630-08-01 | devon | tq13 8hu | None | politician |
4 | Q2296770-5 | Q2296770 | thomas clifford, 1st baron chudleigh | thomas chudleigh | thomas | chudleigh | 1630-08-01 | devon | tq13 8df | None | politician |
When defining the settings object, simply pass your deterministic rules into blocking_rules_to_generate_predictions
.
For a deterministic linkage, the entire linkage methodology is based on these rules, so there is no need to define comparisons
nor any other parameters required for model training in a probabilistic model.
from splink.duckdb.blocking_rule_library import block_on
# Simple settings dictionary will be used for exploratory analysis
settings = {
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [
block_on(["first_name", "surname", "dob"]),
block_on(["surname", "dob", "postcode_fake"]),
block_on(["first_name", "dob", "occupation"]),
],
"retain_matching_columns": True,
"retain_intermediate_calculation_columns": True,
}
linker = DuckDBLinker(df, settings)
linker.debug_mode = False
Once the linker
object is defined, you can profile the dataset columns.
linker.profile_columns(
["first_name", "surname", "substr(dob, 1,4)"], top_n=10, bottom_n=5
)
In a deterministic linkage, the blocking rules chart shows how many records have been matched by each of the deterministic rules.
linker.cumulative_num_comparisons_from_blocking_rules_chart()
The results of the linkage can be viewed with the deterministic_link
function.
df_predict = linker.deterministic_link()
df_predict.as_pandas_dataframe().head()
unique_id_l | unique_id_r | first_name_l | first_name_r | surname_l | surname_r | occupation_l | occupation_r | postcode_fake_l | postcode_fake_r | dob_l | dob_r | match_key | match_probability | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Q2296770-1 | Q2296770-6 | thomas | thomas | chudleigh | chudleigh | politician | politician | tq13 8df | tq13 8df | 1630-08-01 | 1630-08-01 | 0 | 1.0 |
1 | Q2296770-2 | Q2296770-6 | thomas | thomas | chudleigh | chudleigh | politician | politician | tq13 8df | tq13 8df | 1630-08-01 | 1630-08-01 | 0 | 1.0 |
2 | Q2296770-3 | Q2296770-7 | tom | tom | chudleigh | chudleigh | politician | NaN | tq13 8df | tq13 8df | 1630-08-01 | 1630-08-01 | 0 | 1.0 |
3 | Q2296770-4 | Q2296770-6 | thomas | thomas | chudleigh | chudleigh | politician | politician | tq13 8hu | tq13 8df | 1630-08-01 | 1630-08-01 | 0 | 1.0 |
4 | Q2296770-5 | Q2296770-6 | thomas | thomas | chudleigh | chudleigh | politician | politician | tq13 8df | tq13 8df | 1630-08-01 | 1630-08-01 | 0 | 1.0 |
Which can be used to generate clusters.
Note, for deterministic linkage, each comparison has been assigned a match probability of 1, so to generate clusters, set threshold_match_probability=1
in the cluster_pairwise_predictions_at_threshold
function.
clusters = linker.cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=1)
Completed iteration 1, root rows count 94
Completed iteration 2, root rows count 10
Completed iteration 3, root rows count 0
clusters.as_pandas_dataframe(limit=5)
cluster_id | unique_id | cluster | full_name | first_and_surname | first_name | surname | dob | birth_place | postcode_fake | gender | occupation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Q33436042-4 | Q33436042-4 | Q33436042 | charlie louis wiliam merlin | charlie merlin | charlie | merlin | 1822-01-01 | radstock | NaN | male | NaN |
1 | Q7791916-15 | Q7791916-15 | Q7791916 | tom longman | tom longman | tom | longman | 1698-01-01 | bristol | bs5 6rq | male | publisher |
2 | Q97991018-1 | Q97991018-3 | Q97991018 | john hare | john hare | john | hare | 1857-05-31 | canterbury | ct4 6jr | male | religious |
3 | Q363965-1 | Q363965-3 | Q363965 | robert t. a. innes | robert innes | robert | innes | 1861-11-10 | edinburgh | eh3 5jz | male | astronomer |
4 | Q457399-4 | Q457399-4 | Q457399 | charles eliot | charles eliot | charles | eliot | 1862-01-09 | sibford gower | ox15 6pr | male | NaN |
These results can then be passed into the Cluster Studio Dashboard
.
linker.cluster_studio_dashboard(df_predict, clusters, "dashboards/50k_deterministic_cluster.html", sampling_method='by_cluster_size', overwrite=True)
from IPython.display import IFrame
IFrame(
src="./dashboards/50k_deterministic_cluster.html", width="100%", height=1200
)