Deduplicate 50k rows historical persons

Linking a dataset of real historical persons¶

In this example, we deduplicate a more realistic dataset. The data is based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors introduced.

from splink import splink_datasets

df = splink_datasets.historical_50k

df.head()

	unique_id	cluster	full_name	first_and_surname	first_name	surname	dob	birth_place	postcode_fake	gender	occupation
0	Q2296770-1	Q2296770	thomas clifford, 1st baron clifford of chudleigh	thomas chudleigh	thomas	chudleigh	1630-08-01	devon	tq13 8df	male	politician
1	Q2296770-2	Q2296770	thomas of chudleigh	thomas chudleigh	thomas	chudleigh	1630-08-01	devon	tq13 8df	male	politician
2	Q2296770-3	Q2296770	tom 1st baron clifford of chudleigh	tom chudleigh	tom	chudleigh	1630-08-01	devon	tq13 8df	male	politician
3	Q2296770-4	Q2296770	thomas 1st chudleigh	thomas chudleigh	thomas	chudleigh	1630-08-01	devon	tq13 8hu	None	politician
4	Q2296770-5	Q2296770	thomas clifford, 1st baron chudleigh	thomas chudleigh	thomas	chudleigh	1630-08-01	devon	tq13 8df	None	politician

from splink import DuckDBAPI
from splink.exploratory import profile_columns

db_api = DuckDBAPI()
profile_columns(df, db_api, column_expressions=["first_name", "substr(surname,1,2)"])

from splink import DuckDBAPI, block_on
from splink.blocking_analysis import (
    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
)

blocking_rules = [
    block_on("substr(first_name,1,3)", "substr(surname,1,4)"),
    block_on("surname", "dob"),
    block_on("first_name", "dob"),
    block_on("postcode_fake", "first_name"),
    block_on("postcode_fake", "surname"),
    block_on("dob", "birth_place"),
    block_on("substr(postcode_fake,1,3)", "dob"),
    block_on("substr(postcode_fake,1,3)", "first_name"),
    block_on("substr(postcode_fake,1,3)", "surname"),
    block_on("substr(first_name,1,2)", "substr(surname,1,2)", "substr(dob,1,4)"),
]

db_api = DuckDBAPI()

cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
    table_or_tables=df,
    blocking_rules=blocking_rules,
    db_api=db_api,
    link_type="dedupe_only",
)

import splink.comparison_library as cl

from splink import Linker, SettingsCreator

settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=blocking_rules,
    comparisons=[
        cl.ForenameSurnameComparison(
            "first_name",
            "surname",
            forename_surname_concat_col_name="first_name_surname_concat",
        ),
        cl.DateOfBirthComparison(
            "dob", input_is_string=True
        ),
        cl.PostcodeComparison("postcode_fake"),
        cl.ExactMatch("birth_place").configure(term_frequency_adjustments=True),
        cl.ExactMatch("occupation").configure(term_frequency_adjustments=True),
    ],
    retain_intermediate_calculation_columns=True,
)
# Needed to apply term frequencies to first+surname comparison
df["first_name_surname_concat"] = df["first_name"] + " " + df["surname"]
linker = Linker(df, settings, db_api=db_api)

linker.training.estimate_probability_two_random_records_match(
    [
        block_on("first_name", "surname", "dob"),
        block_on("substr(first_name,1,2)", "surname", "substr(postcode_fake,1,2)"),
        block_on("dob", "postcode_fake"),
    ],
    recall=0.6,
)

Probability two random records match is estimated to be  0.000136.
This means that amongst all possible pairwise record comparisons, one in 7,362.31 are expected to match.  With 1,279,041,753 total possible comparisons, we expect a total of around 173,728.33 matching pairs

linker.training.estimate_u_using_random_sampling(max_pairs=5e6)

----- Estimating u probabilities using random sampling -----

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - first_name_surname (no m values are trained).
    - dob (no m values are trained).
    - postcode_fake (no m values are trained).
    - birth_place (no m values are trained).
    - occupation (no m values are trained).

training_blocking_rule = block_on("first_name", "surname")
training_session_names = (
    linker.training.estimate_parameters_using_expectation_maximisation(
        training_blocking_rule, estimate_without_term_frequencies=True
    )
)

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
(l."first_name" = r."first_name") AND (l."surname" = r."surname")

Parameter estimates will be made for the following comparison(s):
    - dob
    - postcode_fake
    - birth_place
    - occupation

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - first_name_surname

Iteration 1: Largest change in params was 0.248 in probability_two_random_records_match
Iteration 2: Largest change in params was -0.0935 in the m_probability of postcode_fake, level `Exact match on full postcode`
Iteration 3: Largest change in params was -0.0239 in the m_probability of birth_place, level `Exact match on birth_place`
Iteration 4: Largest change in params was 0.00984 in the m_probability of birth_place, level `All other comparisons`
Iteration 5: Largest change in params was -0.00477 in the m_probability of birth_place, level `Exact match on birth_place`
Iteration 6: Largest change in params was 0.00274 in the m_probability of birth_place, level `All other comparisons`
Iteration 7: Largest change in params was 0.00189 in the m_probability of dob, level `Abs date difference <= 10 year`
Iteration 8: Largest change in params was 0.00129 in the m_probability of dob, level `Abs date difference <= 10 year`
Iteration 9: Largest change in params was 0.000863 in the m_probability of dob, level `Abs date difference <= 10 year`
Iteration 10: Largest change in params was 0.000576 in the m_probability of dob, level `Abs date difference <= 10 year`
Iteration 11: Largest change in params was 0.000383 in the m_probability of dob, level `Abs date difference <= 10 year`
Iteration 12: Largest change in params was 0.000254 in the m_probability of dob, level `Abs date difference <= 10 year`
Iteration 13: Largest change in params was 0.000169 in the m_probability of dob, level `Abs date difference <= 10 year`
Iteration 14: Largest change in params was 0.000112 in the m_probability of dob, level `Abs date difference <= 10 year`
Iteration 15: Largest change in params was 7.43e-05 in the m_probability of dob, level `Abs date difference <= 10 year`

EM converged after 15 iterations

Your model is not yet fully trained. Missing estimates for:
    - first_name_surname (no m values are trained).

training_blocking_rule = block_on("dob")
training_session_dob = (
    linker.training.estimate_parameters_using_expectation_maximisation(
        training_blocking_rule, estimate_without_term_frequencies=True
    )
)

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."dob" = r."dob"

Parameter estimates will be made for the following comparison(s):
    - first_name_surname
    - postcode_fake
    - birth_place
    - occupation

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - dob

Iteration 1: Largest change in params was -0.472 in the m_probability of first_name_surname, level `Exact match on first_name_surname_concat`
Iteration 2: Largest change in params was 0.0536 in the m_probability of first_name_surname, level `All other comparisons`
Iteration 3: Largest change in params was 0.0179 in the m_probability of first_name_surname, level `All other comparisons`
Iteration 4: Largest change in params was 0.00547 in the m_probability of first_name_surname, level `All other comparisons`
Iteration 5: Largest change in params was 0.00169 in the m_probability of first_name_surname, level `All other comparisons`
Iteration 6: Largest change in params was 0.00053 in the m_probability of first_name_surname, level `All other comparisons`
Iteration 7: Largest change in params was 0.000168 in the m_probability of first_name_surname, level `All other comparisons`
Iteration 8: Largest change in params was 5.38e-05 in the m_probability of first_name_surname, level `All other comparisons`

EM converged after 8 iterations

Your model is fully trained. All comparisons have at least one estimate for their m and u values

The final match weights can be viewed in the match weights chart:

linker.visualisations.match_weights_chart()

linker.evaluation.unlinkables_chart()

df_predict = linker.inference.predict()
df_e = df_predict.as_pandas_dataframe(limit=5)
df_e

Blocking time: 0.66 seconds
Predict time: 1.32 seconds

	match_weight	match_probability	unique_id_l	unique_id_r	surname_l	surname_r	first_name_l	first_name_r	first_name_surname_concat_l	first_name_surname_concat_r	...	bf_birth_place	bf_tf_adj_birth_place	occupation_l	occupation_r	gamma_occupation	tf_occupation_l	tf_occupation_r	bf_occupation	bf_tf_adj_occupation	match_key
0	11.155625	0.999562	Q19654778-17	Q19654778-4	chattock	chattock	richard	ritchie	richard chattock	ritchie chattock	...	0.164723	1.000000	photographer	photographer	1	0.018862	0.018862	23.537422	2.020099	4
1	21.080818	1.000000	Q2331144-2	Q2331144-9	caine	caine	sir	hall	sir caine	hall caine	...	165.631265	20.031894	novelist	writer	0	0.007078	0.053264	0.107239	1.000000	4
2	20.499240	0.999999	Q3377781-1	Q3377781-4	meux	meux	hedworth	admiral	hedworth meux	admiral meux	...	165.631265	0.094897	politician	politician	1	0.088932	0.088932	23.537422	0.428451	4
3	20.499240	0.999999	Q3377781-2	Q3377781-4	meux	meux	hedworth	admiral	hedworth meux	admiral meux	...	165.631265	0.094897	politician	politician	1	0.088932	0.088932	23.537422	0.428451	4
4	20.499240	0.999999	Q3377781-3	Q3377781-4	meux	meux	hedworth	admiral	hedworth meux	admiral meux	...	165.631265	0.094897	politician	politician	1	0.088932	0.088932	23.537422	0.428451	4

5 rows × 42 columns

You can also view rows in this dataset as a waterfall chart as follows:

records_to_plot = df_e.to_dict(orient="records")
linker.visualisations.waterfall_chart(records_to_plot, filter_nulls=False)

clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predict, threshold_match_probability=0.95
)

Completed iteration 1, num representatives needing updating: 810
Completed iteration 2, num representatives needing updating: 183
Completed iteration 3, num representatives needing updating: 59
Completed iteration 4, num representatives needing updating: 6
Completed iteration 5, num representatives needing updating: 1
Completed iteration 6, num representatives needing updating: 0

from IPython.display import IFrame

linker.visualisations.cluster_studio_dashboard(
    df_predict,
    clusters,
    "dashboards/50k_cluster.html",
    sampling_method="by_cluster_size",
    overwrite=True,
)


IFrame(src="./dashboards/50k_cluster.html", width="100%", height=1200)

linker.evaluation.accuracy_analysis_from_labels_column(
    "cluster", output_type="accuracy", match_weight_round_to_nearest=0.02
)

Blocking time: 1.37 seconds
Predict time: 1.38 seconds

records = linker.evaluation.prediction_errors_from_labels_column(
    "cluster",
    threshold_match_probability=0.999,
    include_false_negatives=False,
    include_false_positives=True,
).as_record_dict()
linker.visualisations.waterfall_chart(records)

Blocking time: 1.80 seconds
Predict time: 0.59 seconds

# Some of the false negatives will be because they weren't detected by the blocking rules
records = linker.evaluation.prediction_errors_from_labels_column(
    "cluster",
    threshold_match_probability=0.5,
    include_false_negatives=True,
    include_false_positives=False,
).as_record_dict(limit=50)

linker.visualisations.waterfall_chart(records)

Blocking time: 1.08 seconds
Predict time: 0.48 seconds