Skip to content

Febrl3 Dedupe

Deduplicating the febrl3 dataset¶

See A.2 here and here for the source of this data

Open In Colab

from splink.datasets import splink_datasets

df = splink_datasets.febrl3
df = df.rename(columns=lambda x: x.strip())

df["cluster"] = df["rec_id"].apply(lambda x: "-".join(x.split("-")[:2]))

df["date_of_birth"] = df["date_of_birth"].astype(str).str.strip()
df["soc_sec_id"] = df["soc_sec_id"].astype(str).str.strip()

df.head(2)
rec_id given_name surname street_number address_1 address_2 suburb postcode state date_of_birth soc_sec_id cluster
0 rec-1496-org mitchell green 7 wallaby place delmar cleveland 2119 sa 19560409 1804974 rec-1496
1 rec-552-dup-3 harley mccarthy 177 pridhamstreet milton marsden 3165 nsw 19080419 6089216 rec-552
df["date_of_birth"] = df["date_of_birth"].astype(str).str.strip()
df["soc_sec_id"] = df["soc_sec_id"].astype(str).str.strip()
df["date_of_birth"] = df["date_of_birth"].astype(str).str.strip()
df["soc_sec_id"] = df["soc_sec_id"].astype(str).str.strip()
from splink import DuckDBAPI, Linker, SettingsCreator

# TODO:  Allow missingness to be analysed without a linker
settings = SettingsCreator(
    unique_id_column_name="rec_id",
    link_type="dedupe_only",
)

linker = Linker(df, settings, db_api=DuckDBAPI())

It's usually a good idea to perform exploratory analysis on your data so you understand what's in each column and how often it's missing:

from splink.exploratory import completeness_chart

completeness_chart(df, db_api=DuckDBAPI())
from splink.exploratory import profile_columns

profile_columns(df, db_api=DuckDBAPI(), column_expressions=["given_name", "surname"])
from splink import DuckDBAPI, block_on
from splink.blocking_analysis import (
    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
)

blocking_rules = [
    block_on("soc_sec_id"),
    block_on("given_name"),
    block_on("surname"),
    block_on("date_of_birth"),
    block_on("postcode"),
]

db_api = DuckDBAPI()
cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
    table_or_tables=df,
    blocking_rules=blocking_rules,
    db_api=db_api,
    link_type="dedupe_only",
    unique_id_column_name="rec_id",
)
import splink.comparison_library as cl

from splink import Linker

settings = SettingsCreator(
    unique_id_column_name="rec_id",
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=blocking_rules,
    comparisons=[
        cl.NameComparison("given_name"),
        cl.NameComparison("surname"),
        cl.DateOfBirthComparison(
            "date_of_birth",
            input_is_string=True,
            datetime_format="%Y%m%d",
        ),
        cl.DamerauLevenshteinAtThresholds("soc_sec_id", [2]),
        cl.ExactMatch("street_number").configure(term_frequency_adjustments=True),
        cl.ExactMatch("postcode").configure(term_frequency_adjustments=True),
    ],
    retain_intermediate_calculation_columns=True,
)

linker = Linker(df, settings, db_api=DuckDBAPI())
from splink import block_on

deterministic_rules = [
    block_on("soc_sec_id"),
    block_on("given_name", "surname", "date_of_birth"),
    "l.given_name = r.surname and l.surname = r.given_name and l.date_of_birth = r.date_of_birth",
]

linker.training.estimate_probability_two_random_records_match(
    deterministic_rules, recall=0.9
)
Probability two random records match is estimated to be  0.000528.
This means that amongst all possible pairwise record comparisons, one in 1,893.56 are expected to match.  With 12,497,500 total possible comparisons, we expect a total of around 6,600.00 matching pairs
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.
----- Estimating u probabilities using random sampling -----



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - given_name (no m values are trained).
    - surname (no m values are trained).
    - date_of_birth (some u values are not trained, no m values are trained).
    - soc_sec_id (no m values are trained).
    - street_number (no m values are trained).
    - postcode (no m values are trained).
em_blocking_rule_1 = block_on("date_of_birth")
session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
    em_blocking_rule_1
)
----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."date_of_birth" = r."date_of_birth"

Parameter estimates will be made for the following comparison(s):
    - given_name
    - surname
    - soc_sec_id
    - street_number
    - postcode

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - date_of_birth

Iteration 1: Largest change in params was -0.376 in the m_probability of surname, level `Exact match on surname`
Iteration 2: Largest change in params was 0.0156 in the m_probability of surname, level `All other comparisons`
Iteration 3: Largest change in params was 0.000699 in the m_probability of postcode, level `All other comparisons`
Iteration 4: Largest change in params was -3.77e-05 in the m_probability of postcode, level `Exact match on postcode`

EM converged after 4 iterations

Your model is not yet fully trained. Missing estimates for:
    - date_of_birth (some u values are not trained, no m values are trained).
em_blocking_rule_2 = block_on("postcode")
session_postcode = linker.training.estimate_parameters_using_expectation_maximisation(
    em_blocking_rule_2
)
----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."postcode" = r."postcode"

Parameter estimates will be made for the following comparison(s):
    - given_name
    - surname
    - date_of_birth
    - soc_sec_id
    - street_number

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - postcode

WARNING:
Level Abs difference of 'transformed date_of_birth <= 1 month' on comparison date_of_birth not observed in dataset, unable to train m value

WARNING:
Level Abs difference of 'transformed date_of_birth <= 1 year' on comparison date_of_birth not observed in dataset, unable to train m value

WARNING:
Level Abs difference of 'transformed date_of_birth <= 10 year' on comparison date_of_birth not observed in dataset, unable to train m value

Iteration 1: Largest change in params was 0.0681 in probability_two_random_records_match
Iteration 2: Largest change in params was -0.00185 in the m_probability of date_of_birth, level `Exact match on date_of_birth`
Iteration 3: Largest change in params was -5.7e-05 in the m_probability of date_of_birth, level `Exact match on date_of_birth`

EM converged after 3 iterations
m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.

Your model is not yet fully trained. Missing estimates for:
    - date_of_birth (some u values are not trained, some m values are not trained).
linker.visualisations.match_weights_chart()
results = linker.inference.predict(threshold_match_probability=0.2)
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))



 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'date_of_birth':
    m values not fully trained
Comparison: 'date_of_birth':
    u values not fully trained
linker.evaluation.accuracy_analysis_from_labels_column(
    "cluster", match_weight_round_to_nearest=0.1, output_type="accuracy"
)
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))



 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'date_of_birth':
    m values not fully trained
Comparison: 'date_of_birth':
    u values not fully trained
pred_errors_df = linker.evaluation.prediction_errors_from_labels_column(
    "cluster"
).as_pandas_dataframe()
len(pred_errors_df)
pred_errors_df.head()
 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'date_of_birth':
    m values not fully trained
Comparison: 'date_of_birth':
    u values not fully trained
clerical_match_score found_by_blocking_rules match_weight match_probability rec_id_l rec_id_r given_name_l given_name_r gamma_given_name tf_given_name_l ... postcode_l postcode_r gamma_postcode tf_postcode_l tf_postcode_r bf_postcode bf_tf_adj_postcode cluster_l cluster_r match_key
0 1.0 False -27.805731 4.262268e-09 rec-993-dup-1 rec-993-dup-3 westbrook jake 0 0.0004 ... 2704 2074 0 0.0002 0.0014 0.230173 1.0 rec-993 rec-993 5
1 1.0 False -27.805731 4.262268e-09 rec-829-dup-0 rec-829-dup-2 wilde kyra 0 0.0002 ... 3859 3595 0 0.0004 0.0006 0.230173 1.0 rec-829 rec-829 5
2 1.0 False -19.717877 1.159651e-06 rec-829-dup-0 rec-829-dup-1 wilde kyra 0 0.0002 ... 3859 3889 0 0.0004 0.0002 0.230173 1.0 rec-829 rec-829 5
3 1.0 True -15.453190 2.229034e-05 rec-721-dup-0 rec-721-dup-1 mikhaili elly 0 0.0008 ... 4806 4860 0 0.0008 0.0014 0.230173 1.0 rec-721 rec-721 2
4 1.0 True -12.931781 1.279648e-04 rec-401-dup-1 rec-401-dup-3 whitbe alexa-ose 0 0.0002 ... 3040 3041 0 0.0020 0.0004 0.230173 1.0 rec-401 rec-401 0

5 rows × 45 columns

The following chart seems to suggest that, where the model is making errors, it's because the data is corrupted beyond recognition and no reasonable linkage model could find these matches

records = linker.evaluation.prediction_errors_from_labels_column(
    "cluster"
).as_record_dict(limit=10)
linker.visualisations.waterfall_chart(records)
 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'date_of_birth':
    m values not fully trained
Comparison: 'date_of_birth':
    u values not fully trained