Febrl3 Dedupe

Deduplicating the febrl3 dataset¶

See A.2 here and here for the source of this data

from splink.datasets import splink_datasets

df = splink_datasets.febrl3

df = df.rename(columns=lambda x: x.strip())

df["cluster"] = df["rec_id"].apply(lambda x: "-".join(x.split("-")[:2]))

df["date_of_birth"] = df["date_of_birth"].astype(str).str.strip()
df["soc_sec_id"] = df["soc_sec_id"].astype(str).str.strip()

df.head(2)

	rec_id	given_name	surname	street_number	address_1	address_2	suburb	postcode	state	date_of_birth	soc_sec_id	cluster
0	rec-1496-org	mitchell	green	7	wallaby place	delmar	cleveland	2119	sa	19560409	1804974	rec-1496
1	rec-552-dup-3	harley	mccarthy	177	pridhamstreet	milton	marsden	3165	nsw	19080419	6089216	rec-552

df["date_of_birth"] = df["date_of_birth"].astype(str).str.strip()
df["soc_sec_id"] = df["soc_sec_id"].astype(str).str.strip()

df["date_of_birth"] = df["date_of_birth"].astype(str).str.strip()
df["soc_sec_id"] = df["soc_sec_id"].astype(str).str.strip()

from splink import DuckDBAPI, Linker, SettingsCreator

# TODO:  Allow missingness to be analysed without a linker
settings = SettingsCreator(
    unique_id_column_name="rec_id",
    link_type="dedupe_only",
)

linker = Linker(df, settings, db_api=DuckDBAPI())

It's usually a good idea to perform exploratory analysis on your data so you understand what's in each column and how often it's missing:

from splink.exploratory import completeness_chart

completeness_chart(df, db_api=DuckDBAPI())

from splink.exploratory import profile_columns

profile_columns(df, db_api=DuckDBAPI(), column_expressions=["given_name", "surname"])

from splink import DuckDBAPI, block_on
from splink.blocking_analysis import (
    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
)

blocking_rules = [
    block_on("soc_sec_id"),
    block_on("given_name"),
    block_on("surname"),
    block_on("date_of_birth"),
    block_on("postcode"),
]

db_api = DuckDBAPI()
cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
    table_or_tables=df,
    blocking_rules=blocking_rules,
    db_api=db_api,
    link_type="dedupe_only",
    unique_id_column_name="rec_id",
)

import splink.comparison_library as cl

from splink import Linker

settings = SettingsCreator(
    unique_id_column_name="rec_id",
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=blocking_rules,
    comparisons=[
        cl.NameComparison("given_name"),
        cl.NameComparison("surname"),
        cl.DateOfBirthComparison(
            "date_of_birth",
            input_is_string=True,
            datetime_format="%Y%m%d",
        ),
        cl.DamerauLevenshteinAtThresholds("soc_sec_id", [2]),
        cl.ExactMatch("street_number").configure(term_frequency_adjustments=True),
        cl.ExactMatch("postcode").configure(term_frequency_adjustments=True),
    ],
    retain_intermediate_calculation_columns=True,
)

linker = Linker(df, settings, db_api=DuckDBAPI())

from splink import block_on

deterministic_rules = [
    block_on("soc_sec_id"),
    block_on("given_name", "surname", "date_of_birth"),
    "l.given_name = r.surname and l.surname = r.given_name and l.date_of_birth = r.date_of_birth",
]

linker.training.estimate_probability_two_random_records_match(
    deterministic_rules, recall=0.9
)

Probability two random records match is estimated to be  0.000528.
This means that amongst all possible pairwise record comparisons, one in 1,893.56 are expected to match.  With 12,497,500 total possible comparisons, we expect a total of around 6,600.00 matching pairs

linker.training.estimate_u_using_random_sampling(max_pairs=1e6)

You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.
----- Estimating u probabilities using random sampling -----



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - given_name (no m values are trained).
    - surname (no m values are trained).
    - date_of_birth (some u values are not trained, no m values are trained).
    - soc_sec_id (no m values are trained).
    - street_number (no m values are trained).
    - postcode (no m values are trained).

em_blocking_rule_1 = block_on("date_of_birth")
session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
    em_blocking_rule_1
)

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."date_of_birth" = r."date_of_birth"

Parameter estimates will be made for the following comparison(s):
    - given_name
    - surname
    - soc_sec_id
    - street_number
    - postcode

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - date_of_birth

Iteration 1: Largest change in params was -0.376 in the m_probability of surname, level `Exact match on surname`
Iteration 2: Largest change in params was 0.0156 in the m_probability of surname, level `All other comparisons`
Iteration 3: Largest change in params was 0.000699 in the m_probability of postcode, level `All other comparisons`
Iteration 4: Largest change in params was -3.77e-05 in the m_probability of postcode, level `Exact match on postcode`

EM converged after 4 iterations

Your model is not yet fully trained. Missing estimates for:
    - date_of_birth (some u values are not trained, no m values are trained).

em_blocking_rule_2 = block_on("postcode")
session_postcode = linker.training.estimate_parameters_using_expectation_maximisation(
    em_blocking_rule_2
)

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."postcode" = r."postcode"

Parameter estimates will be made for the following comparison(s):
    - given_name
    - surname
    - date_of_birth
    - soc_sec_id
    - street_number

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - postcode

WARNING:
Level Abs difference of 'transformed date_of_birth <= 1 month' on comparison date_of_birth not observed in dataset, unable to train m value

WARNING:
Level Abs difference of 'transformed date_of_birth <= 1 year' on comparison date_of_birth not observed in dataset, unable to train m value

WARNING:
Level Abs difference of 'transformed date_of_birth <= 10 year' on comparison date_of_birth not observed in dataset, unable to train m value

Iteration 1: Largest change in params was 0.0681 in probability_two_random_records_match
Iteration 2: Largest change in params was -0.00185 in the m_probability of date_of_birth, level `Exact match on date_of_birth`
Iteration 3: Largest change in params was -5.7e-05 in the m_probability of date_of_birth, level `Exact match on date_of_birth`

EM converged after 3 iterations
m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.

Your model is not yet fully trained. Missing estimates for:
    - date_of_birth (some u values are not trained, some m values are not trained).

linker.visualisations.match_weights_chart()

results = linker.inference.predict(threshold_match_probability=0.2)

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))



 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'date_of_birth':
    m values not fully trained
Comparison: 'date_of_birth':
    u values not fully trained

linker.evaluation.accuracy_analysis_from_labels_column(
    "cluster", match_weight_round_to_nearest=0.1, output_type="accuracy"
)

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))



 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'date_of_birth':
    m values not fully trained
Comparison: 'date_of_birth':
    u values not fully trained

pred_errors_df = linker.evaluation.prediction_errors_from_labels_column(
    "cluster"
).as_pandas_dataframe()
len(pred_errors_df)
pred_errors_df.head()

 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'date_of_birth':
    m values not fully trained
Comparison: 'date_of_birth':
    u values not fully trained

	clerical_match_score	found_by_blocking_rules	match_weight	match_probability	rec_id_l	rec_id_r	given_name_l	given_name_r	tf_given_name_l	...	postcode_l	postcode_r	tf_postcode_l	tf_postcode_r	bf_postcode	bf_tf_adj_postcode	cluster_l	cluster_r	match_key
0	1.0	False	-27.805731	4.262268e-09	rec-993-dup-1	rec-993-dup-3	westbrook	jake	0.0004	...	2704	2074	0.0002	0.0014	0.230173	1.0	rec-993	rec-993	5
1	1.0	False	-27.805731	4.262268e-09	rec-829-dup-0	rec-829-dup-2	wilde	kyra	0.0002	...	3859	3595	0.0004	0.0006	0.230173	1.0	rec-829	rec-829	5
2	1.0	False	-19.717877	1.159651e-06	rec-829-dup-0	rec-829-dup-1	wilde	kyra	0.0002	...	3859	3889	0.0004	0.0002	0.230173	1.0	rec-829	rec-829	5
3	1.0	True	-15.453190	2.229034e-05	rec-721-dup-0	rec-721-dup-1	mikhaili	elly	0.0008	...	4806	4860	0.0008	0.0014	0.230173	1.0	rec-721	rec-721	2
4	1.0	True	-12.931781	1.279648e-04	rec-401-dup-1	rec-401-dup-3	whitbe	alexa-ose	0.0002	...	3040	3041	0.0020	0.0004	0.230173	1.0	rec-401	rec-401	0

5 rows × 45 columns

The following chart seems to suggest that, where the model is making errors, it's because the data is corrupted beyond recognition and no reasonable linkage model could find these matches

records = linker.evaluation.prediction_errors_from_labels_column(
    "cluster"
).as_record_dict(limit=10)
linker.visualisations.waterfall_chart(records)

 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'date_of_birth':
    m values not fully trained
Comparison: 'date_of_birth':
    u values not fully trained