Febrl3 Dedupe
Deduplicating the febrl3 dataset¶
See A.2 here and here for the source of this data
from splink.datasets import splink_datasets
df = splink_datasets.febrl3
df = df.rename(columns=lambda x: x.strip())
df["cluster"] = df["rec_id"].apply(lambda x: "-".join(x.split("-")[:2]))
df["date_of_birth"] = df["date_of_birth"].astype(str).str.strip()
df["soc_sec_id"] = df["soc_sec_id"].astype(str).str.strip()
df.head(2)
rec_id | given_name | surname | street_number | address_1 | address_2 | suburb | postcode | state | date_of_birth | soc_sec_id | cluster | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | rec-1496-org | mitchell | green | 7 | wallaby place | delmar | cleveland | 2119 | sa | 19560409 | 1804974 | rec-1496 |
1 | rec-552-dup-3 | harley | mccarthy | 177 | pridhamstreet | milton | marsden | 3165 | nsw | 19080419 | 6089216 | rec-552 |
df["date_of_birth"] = df["date_of_birth"].astype(str).str.strip()
df["soc_sec_id"] = df["soc_sec_id"].astype(str).str.strip()
df["date_of_birth"] = df["date_of_birth"].astype(str).str.strip()
df["soc_sec_id"] = df["soc_sec_id"].astype(str).str.strip()
from splink import DuckDBAPI, Linker, SettingsCreator
# TODO: Allow missingness to be analysed without a linker
settings = SettingsCreator(
unique_id_column_name="rec_id",
link_type="dedupe_only",
)
linker = Linker(df, settings, db_api=DuckDBAPI())
It's usually a good idea to perform exploratory analysis on your data so you understand what's in each column and how often it's missing:
from splink.exploratory import completeness_chart
completeness_chart(df, db_api=DuckDBAPI())
from splink.exploratory import profile_columns
profile_columns(df, db_api=DuckDBAPI(), column_expressions=["given_name", "surname"])
from splink import DuckDBAPI, block_on
from splink.blocking_analysis import (
cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
)
blocking_rules = [
block_on("soc_sec_id"),
block_on("given_name"),
block_on("surname"),
block_on("date_of_birth"),
block_on("postcode"),
]
db_api = DuckDBAPI()
cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
table_or_tables=df,
blocking_rules=blocking_rules,
db_api=db_api,
link_type="dedupe_only",
unique_id_column_name="rec_id",
)
import splink.comparison_library as cl
from splink import Linker
settings = SettingsCreator(
unique_id_column_name="rec_id",
link_type="dedupe_only",
blocking_rules_to_generate_predictions=blocking_rules,
comparisons=[
cl.NameComparison("given_name"),
cl.NameComparison("surname"),
cl.DateOfBirthComparison(
"date_of_birth",
input_is_string=True,
datetime_format="%Y%m%d",
),
cl.DamerauLevenshteinAtThresholds("soc_sec_id", [2]),
cl.ExactMatch("street_number").configure(term_frequency_adjustments=True),
cl.ExactMatch("postcode").configure(term_frequency_adjustments=True),
],
retain_intermediate_calculation_columns=True,
)
linker = Linker(df, settings, db_api=DuckDBAPI())
from splink import block_on
deterministic_rules = [
block_on("soc_sec_id"),
block_on("given_name", "surname", "date_of_birth"),
"l.given_name = r.surname and l.surname = r.given_name and l.date_of_birth = r.date_of_birth",
]
linker.training.estimate_probability_two_random_records_match(
deterministic_rules, recall=0.9
)
Probability two random records match is estimated to be 0.000528.
This means that amongst all possible pairwise record comparisons, one in 1,893.56 are expected to match. With 12,497,500 total possible comparisons, we expect a total of around 6,600.00 matching pairs
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.
----- Estimating u probabilities using random sampling -----
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
Estimated u probabilities using random sampling
Your model is not yet fully trained. Missing estimates for:
- given_name (no m values are trained).
- surname (no m values are trained).
- date_of_birth (some u values are not trained, no m values are trained).
- soc_sec_id (no m values are trained).
- street_number (no m values are trained).
- postcode (no m values are trained).
em_blocking_rule_1 = block_on("date_of_birth")
session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
em_blocking_rule_1
)
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
l."date_of_birth" = r."date_of_birth"
Parameter estimates will be made for the following comparison(s):
- given_name
- surname
- soc_sec_id
- street_number
- postcode
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- date_of_birth
Iteration 1: Largest change in params was -0.376 in the m_probability of surname, level `Exact match on surname`
Iteration 2: Largest change in params was 0.0156 in the m_probability of surname, level `All other comparisons`
Iteration 3: Largest change in params was 0.000699 in the m_probability of postcode, level `All other comparisons`
Iteration 4: Largest change in params was -3.77e-05 in the m_probability of postcode, level `Exact match on postcode`
EM converged after 4 iterations
Your model is not yet fully trained. Missing estimates for:
- date_of_birth (some u values are not trained, no m values are trained).
em_blocking_rule_2 = block_on("postcode")
session_postcode = linker.training.estimate_parameters_using_expectation_maximisation(
em_blocking_rule_2
)
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
l."postcode" = r."postcode"
Parameter estimates will be made for the following comparison(s):
- given_name
- surname
- date_of_birth
- soc_sec_id
- street_number
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- postcode
WARNING:
Level Abs difference of 'transformed date_of_birth <= 1 month' on comparison date_of_birth not observed in dataset, unable to train m value
WARNING:
Level Abs difference of 'transformed date_of_birth <= 1 year' on comparison date_of_birth not observed in dataset, unable to train m value
WARNING:
Level Abs difference of 'transformed date_of_birth <= 10 year' on comparison date_of_birth not observed in dataset, unable to train m value
Iteration 1: Largest change in params was 0.0681 in probability_two_random_records_match
Iteration 2: Largest change in params was -0.00185 in the m_probability of date_of_birth, level `Exact match on date_of_birth`
Iteration 3: Largest change in params was -5.7e-05 in the m_probability of date_of_birth, level `Exact match on date_of_birth`
EM converged after 3 iterations
m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
Your model is not yet fully trained. Missing estimates for:
- date_of_birth (some u values are not trained, some m values are not trained).
linker.visualisations.match_weights_chart()
results = linker.inference.predict(threshold_match_probability=0.2)
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
-- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary. To produce predictions the following untrained trained parameters will use default values.
Comparison: 'date_of_birth':
m values not fully trained
Comparison: 'date_of_birth':
u values not fully trained
linker.evaluation.accuracy_analysis_from_labels_column(
"cluster", match_weight_round_to_nearest=0.1, output_type="accuracy"
)
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
-- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary. To produce predictions the following untrained trained parameters will use default values.
Comparison: 'date_of_birth':
m values not fully trained
Comparison: 'date_of_birth':
u values not fully trained
pred_errors_df = linker.evaluation.prediction_errors_from_labels_column(
"cluster"
).as_pandas_dataframe()
len(pred_errors_df)
pred_errors_df.head()
-- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary. To produce predictions the following untrained trained parameters will use default values.
Comparison: 'date_of_birth':
m values not fully trained
Comparison: 'date_of_birth':
u values not fully trained
clerical_match_score | found_by_blocking_rules | match_weight | match_probability | rec_id_l | rec_id_r | given_name_l | given_name_r | gamma_given_name | tf_given_name_l | ... | postcode_l | postcode_r | gamma_postcode | tf_postcode_l | tf_postcode_r | bf_postcode | bf_tf_adj_postcode | cluster_l | cluster_r | match_key | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | False | -27.805731 | 4.262268e-09 | rec-993-dup-1 | rec-993-dup-3 | westbrook | jake | 0 | 0.0004 | ... | 2704 | 2074 | 0 | 0.0002 | 0.0014 | 0.230173 | 1.0 | rec-993 | rec-993 | 5 |
1 | 1.0 | False | -27.805731 | 4.262268e-09 | rec-829-dup-0 | rec-829-dup-2 | wilde | kyra | 0 | 0.0002 | ... | 3859 | 3595 | 0 | 0.0004 | 0.0006 | 0.230173 | 1.0 | rec-829 | rec-829 | 5 |
2 | 1.0 | False | -19.717877 | 1.159651e-06 | rec-829-dup-0 | rec-829-dup-1 | wilde | kyra | 0 | 0.0002 | ... | 3859 | 3889 | 0 | 0.0004 | 0.0002 | 0.230173 | 1.0 | rec-829 | rec-829 | 5 |
3 | 1.0 | True | -15.453190 | 2.229034e-05 | rec-721-dup-0 | rec-721-dup-1 | mikhaili | elly | 0 | 0.0008 | ... | 4806 | 4860 | 0 | 0.0008 | 0.0014 | 0.230173 | 1.0 | rec-721 | rec-721 | 2 |
4 | 1.0 | True | -12.931781 | 1.279648e-04 | rec-401-dup-1 | rec-401-dup-3 | whitbe | alexa-ose | 0 | 0.0002 | ... | 3040 | 3041 | 0 | 0.0020 | 0.0004 | 0.230173 | 1.0 | rec-401 | rec-401 | 0 |
5 rows × 45 columns
The following chart seems to suggest that, where the model is making errors, it's because the data is corrupted beyond recognition and no reasonable linkage model could find these matches
records = linker.evaluation.prediction_errors_from_labels_column(
"cluster"
).as_record_dict(limit=10)
linker.visualisations.waterfall_chart(records)
-- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary. To produce predictions the following untrained trained parameters will use default values.
Comparison: 'date_of_birth':
m values not fully trained
Comparison: 'date_of_birth':
u values not fully trained