Deduplicating the febrl3 dataset¶
See A.2 here and here for the source of this data
```python from splink.datasets import splink_datasets
df = splink_datasets.febrl3 ```
```python df = df.rename(columns=lambda x: x.strip())
df["cluster"] = df["rec_id"].apply(lambda x: "-".join(x.split("-")[:2]))
df["date_of_birth"] = df["date_of_birth"].astype(str).str.strip() df["soc_sec_id"] = df["soc_sec_id"].astype(str).str.strip()
df.head(2) ```
| rec_id | given_name | surname | street_number | address_1 | address_2 | suburb | postcode | state | date_of_birth | soc_sec_id | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | rec-1496-org | mitchell | green | 7 | wallaby place | delmar | cleveland | 2119 | sa | 19560409 | 1804974 | rec-1496 |
| 1 | rec-552-dup-3 | harley | mccarthy | 177 | pridhamstreet | milton | marsden | 3165 | nsw | 19080419 | 6089216 | rec-552 |
python
df["date_of_birth"] = df["date_of_birth"].astype(str).str.strip()
df["soc_sec_id"] = df["soc_sec_id"].astype(str).str.strip()
python
df["date_of_birth"] = df["date_of_birth"].astype(str).str.strip()
df["soc_sec_id"] = df["soc_sec_id"].astype(str).str.strip()
```python from splink import DuckDBAPI, Linker, SettingsCreator
TODO: Allow missingness to be analysed without a linker¶
settings = SettingsCreator( unique_id_column_name="rec_id", link_type="dedupe_only", )
linker = Linker(df, settings, db_api=DuckDBAPI()) ```
It's usually a good idea to perform exploratory analysis on your data so you understand what's in each column and how often it's missing:
```python from splink.exploratory import completeness_chart
completeness_chart(df, db_api=DuckDBAPI()) ```
```python from splink.exploratory import profile_columns
profile_columns(df, db_api=DuckDBAPI(), column_expressions=["given_name", "surname"]) ```
```python from splink import DuckDBAPI, block_on from splink.blocking_analysis import ( cumulative_comparisons_to_be_scored_from_blocking_rules_chart, )
blocking_rules = [ block_on("soc_sec_id"), block_on("given_name"), block_on("surname"), block_on("date_of_birth"), block_on("postcode"), ]
db_api = DuckDBAPI() cumulative_comparisons_to_be_scored_from_blocking_rules_chart( table_or_tables=df, blocking_rules=blocking_rules, db_api=db_api, link_type="dedupe_only", unique_id_column_name="rec_id", ) ```
```python import splink.comparison_library as cl
from splink import Linker
settings = SettingsCreator( unique_id_column_name="rec_id", link_type="dedupe_only", blocking_rules_to_generate_predictions=blocking_rules, comparisons=[ cl.NameComparison("given_name"), cl.NameComparison("surname"), cl.DateOfBirthComparison( "date_of_birth", input_is_string=True, datetime_format="%Y%m%d", ), cl.DamerauLevenshteinAtThresholds("soc_sec_id", [2]), cl.ExactMatch("street_number").configure(term_frequency_adjustments=True), cl.ExactMatch("postcode").configure(term_frequency_adjustments=True), ], retain_intermediate_calculation_columns=True, )
linker = Linker(df, settings, db_api=DuckDBAPI()) ```
```python from splink import block_on
deterministic_rules = [ block_on("soc_sec_id"), block_on("given_name", "surname", "date_of_birth"), "l.given_name = r.surname and l.surname = r.given_name and l.date_of_birth = r.date_of_birth", ]
linker.training.estimate_probability_two_random_records_match( deterministic_rules, recall=0.9 ) ```
Probability two random records match is estimated to be 0.000528.
This means that amongst all possible pairwise record comparisons, one in 1,893.56 are expected to match. With 12,497,500 total possible comparisons, we expect a total of around 6,600.00 matching pairs
python
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.
----- Estimating u probabilities using random sampling -----
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
Estimated u probabilities using random sampling
Your model is not yet fully trained. Missing estimates for:
- given_name (no m values are trained).
- surname (no m values are trained).
- date_of_birth (some u values are not trained, no m values are trained).
- soc_sec_id (no m values are trained).
- street_number (no m values are trained).
- postcode (no m values are trained).
python
em_blocking_rule_1 = block_on("date_of_birth")
session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
em_blocking_rule_1
)
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
l."date_of_birth" = r."date_of_birth"
Parameter estimates will be made for the following comparison(s):
- given_name
- surname
- soc_sec_id
- street_number
- postcode
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- date_of_birth
Iteration 1: Largest change in params was -0.376 in the m_probability of surname, level `Exact match on surname`
Iteration 2: Largest change in params was 0.0156 in the m_probability of surname, level `All other comparisons`
Iteration 3: Largest change in params was 0.000699 in the m_probability of postcode, level `All other comparisons`
Iteration 4: Largest change in params was -3.77e-05 in the m_probability of postcode, level `Exact match on postcode`
EM converged after 4 iterations
Your model is not yet fully trained. Missing estimates for:
- date_of_birth (some u values are not trained, no m values are trained).
python
em_blocking_rule_2 = block_on("postcode")
session_postcode = linker.training.estimate_parameters_using_expectation_maximisation(
em_blocking_rule_2
)
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
l."postcode" = r."postcode"
Parameter estimates will be made for the following comparison(s):
- given_name
- surname
- date_of_birth
- soc_sec_id
- street_number
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- postcode
WARNING:
Level Abs difference of 'transformed date_of_birth <= 1 month' on comparison date_of_birth not observed in dataset, unable to train m value
WARNING:
Level Abs difference of 'transformed date_of_birth <= 1 year' on comparison date_of_birth not observed in dataset, unable to train m value
WARNING:
Level Abs difference of 'transformed date_of_birth <= 10 year' on comparison date_of_birth not observed in dataset, unable to train m value
Iteration 1: Largest change in params was 0.0681 in probability_two_random_records_match
Iteration 2: Largest change in params was -0.00185 in the m_probability of date_of_birth, level `Exact match on date_of_birth`
Iteration 3: Largest change in params was -5.7e-05 in the m_probability of date_of_birth, level `Exact match on date_of_birth`
EM converged after 3 iterations
m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
Your model is not yet fully trained. Missing estimates for:
- date_of_birth (some u values are not trained, some m values are not trained).
python
linker.visualisations.match_weights_chart()
python
results = linker.inference.predict(threshold_match_probability=0.2)
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
-- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary. To produce predictions the following untrained trained parameters will use default values.
Comparison: 'date_of_birth':
m values not fully trained
Comparison: 'date_of_birth':
u values not fully trained
python
linker.evaluation.accuracy_analysis_from_labels_column(
"cluster", match_weight_round_to_nearest=0.1, output_type="accuracy"
)
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
-- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary. To produce predictions the following untrained trained parameters will use default values.
Comparison: 'date_of_birth':
m values not fully trained
Comparison: 'date_of_birth':
u values not fully trained
python
pred_errors_df = linker.evaluation.prediction_errors_from_labels_column(
"cluster"
).as_pandas_dataframe()
len(pred_errors_df)
pred_errors_df.head()
-- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary. To produce predictions the following untrained trained parameters will use default values.
Comparison: 'date_of_birth':
m values not fully trained
Comparison: 'date_of_birth':
u values not fully trained
| clerical_match_score | found_by_blocking_rules | match_weight | match_probability | rec_id_l | rec_id_r | given_name_l | given_name_r | gamma_given_name | tf_given_name_l | ... | postcode_l | postcode_r | gamma_postcode | tf_postcode_l | tf_postcode_r | bf_postcode | bf_tf_adj_postcode | cluster_l | cluster_r | match_key | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | False | -27.805731 | 4.262268e-09 | rec-993-dup-1 | rec-993-dup-3 | westbrook | jake | 0 | 0.0004 | ... | 2704 | 2074 | 0 | 0.0002 | 0.0014 | 0.230173 | 1.0 | rec-993 | rec-993 | 5 |
| 1 | 1.0 | False | -27.805731 | 4.262268e-09 | rec-829-dup-0 | rec-829-dup-2 | wilde | kyra | 0 | 0.0002 | ... | 3859 | 3595 | 0 | 0.0004 | 0.0006 | 0.230173 | 1.0 | rec-829 | rec-829 | 5 |
| 2 | 1.0 | False | -19.717877 | 1.159651e-06 | rec-829-dup-0 | rec-829-dup-1 | wilde | kyra | 0 | 0.0002 | ... | 3859 | 3889 | 0 | 0.0004 | 0.0002 | 0.230173 | 1.0 | rec-829 | rec-829 | 5 |
| 3 | 1.0 | True | -15.453190 | 2.229034e-05 | rec-721-dup-0 | rec-721-dup-1 | mikhaili | elly | 0 | 0.0008 | ... | 4806 | 4860 | 0 | 0.0008 | 0.0014 | 0.230173 | 1.0 | rec-721 | rec-721 | 2 |
| 4 | 1.0 | True | -12.931781 | 1.279648e-04 | rec-401-dup-1 | rec-401-dup-3 | whitbe | alexa-ose | 0 | 0.0002 | ... | 3040 | 3041 | 0 | 0.0020 | 0.0004 | 0.230173 | 1.0 | rec-401 | rec-401 | 0 |
5 rows × 45 columns
The following chart seems to suggest that, where the model is making errors, it's because the data is corrupted beyond recognition and no reasonable linkage model could find these matches
python
records = linker.evaluation.prediction_errors_from_labels_column(
"cluster"
).as_record_dict(limit=10)
linker.visualisations.waterfall_chart(records)
-- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary. To produce predictions the following untrained trained parameters will use default values.
Comparison: 'date_of_birth':
m values not fully trained
Comparison: 'date_of_birth':
u values not fully trained