Skip to content

Linking two tables of persons

Linking without deduplication

A simple record linkage model using the link_only link type.

import pandas as pd 
df_l = pd.read_parquet("./data/fake_df_l.parquet")
df_r = pd.read_parquet("./data/fake_df_r.parquet")
df_l.head(2)
unique_id first_name surname dob city email group
0 0 Julia None 2015-10-29 London hannah88@powers.com 0
1 4 oNah Watson 2008-03-23 Bolton matthew78@ballard-mcdonald.net 1
from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb import duckdb_comparison_library as cl
settings = {
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
    ],
    "comparisons": [
        cl.levenshtein_at_thresholds("first_name", 2),
        cl.levenshtein_at_thresholds("surname"),
        cl.levenshtein_at_thresholds("dob"),
        cl.exact_match("city", term_frequency_adjustments=True),
        cl.levenshtein_at_thresholds("email"),
    ],       
}
linker = DuckDBLinker([df_l, df_r], settings, input_table_aliases=["df_left", "df_right"])
deterministic_rules = [
    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
    "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
    "l.email = r.email"
]

linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)
Probability two random records match is estimated to be  0.00582.
This means that amongst all possible pairwise record comparisons, one in 171.80 are expected to match.  With 148,239 total possible comparisons, we expect a total of around 862.86 matching pairs

linker.estimate_u_using_random_sampling(target_rows=1e6)
----- Estimating u probabilities using random sampling -----

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - first_name (no m values are trained).
    - surname (no m values are trained).
    - dob (no m values are trained).
    - city (no m values are trained).
    - email (no m values are trained).

session_dob = linker.estimate_parameters_using_expectation_maximisation("l.dob = r.dob")
session_email = linker.estimate_parameters_using_expectation_maximisation("l.email = r.email")

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l.dob = r.dob

Parameter estimates will be made for the following comparison(s):
    - first_name
    - surname
    - city
    - email

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - dob

Iteration 1: Largest change in params was -0.388 in the m_probability of first_name, level `Exact match`
Iteration 2: Largest change in params was 0.21 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0695 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.0254 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.0121 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.00674 in probability_two_random_records_match
Iteration 7: Largest change in params was 0.00414 in probability_two_random_records_match
Iteration 8: Largest change in params was 0.00271 in probability_two_random_records_match
Iteration 9: Largest change in params was 0.00185 in probability_two_random_records_match
Iteration 10: Largest change in params was 0.0013 in probability_two_random_records_match
Iteration 11: Largest change in params was 0.000935 in probability_two_random_records_match
Iteration 12: Largest change in params was 0.000682 in probability_two_random_records_match
Iteration 13: Largest change in params was 0.000503 in probability_two_random_records_match
Iteration 14: Largest change in params was 0.000375 in probability_two_random_records_match
Iteration 15: Largest change in params was 0.000281 in probability_two_random_records_match
Iteration 16: Largest change in params was 0.000211 in probability_two_random_records_match
Iteration 17: Largest change in params was 0.00016 in probability_two_random_records_match
Iteration 18: Largest change in params was 0.000121 in probability_two_random_records_match
Iteration 19: Largest change in params was 9.17e-05 in probability_two_random_records_match

EM converged after 19 iterations

Your model is not yet fully trained. Missing estimates for:
    - dob (no m values are trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l.email = r.email

Parameter estimates will be made for the following comparison(s):
    - first_name
    - surname
    - dob
    - city

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - email

Iteration 1: Largest change in params was 0.417 in probability_two_random_records_match
Iteration 2: Largest change in params was 0.101 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0254 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.0102 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.00526 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.00312 in probability_two_random_records_match
Iteration 7: Largest change in params was 0.00204 in probability_two_random_records_match
Iteration 8: Largest change in params was 0.00142 in probability_two_random_records_match
Iteration 9: Largest change in params was 0.00104 in probability_two_random_records_match
Iteration 10: Largest change in params was 0.000784 in probability_two_random_records_match
Iteration 11: Largest change in params was 0.000609 in probability_two_random_records_match
Iteration 12: Largest change in params was 0.000485 in probability_two_random_records_match
Iteration 13: Largest change in params was 0.000392 in probability_two_random_records_match
Iteration 14: Largest change in params was 0.000322 in probability_two_random_records_match
Iteration 15: Largest change in params was 0.000268 in probability_two_random_records_match
Iteration 16: Largest change in params was 0.000225 in probability_two_random_records_match
Iteration 17: Largest change in params was 0.000191 in probability_two_random_records_match
Iteration 18: Largest change in params was 0.000163 in probability_two_random_records_match
Iteration 19: Largest change in params was 0.00014 in probability_two_random_records_match
Iteration 20: Largest change in params was 0.00012 in probability_two_random_records_match
Iteration 21: Largest change in params was 0.000105 in probability_two_random_records_match
Iteration 22: Largest change in params was 9.1e-05 in probability_two_random_records_match

EM converged after 22 iterations

Your model is fully trained. All comparisons have at least one estimate for their m and u values

results = linker.predict(threshold_match_probability=0.9)
results.as_pandas_dataframe(limit=5)
match_weight match_probability source_dataset_l unique_id_l source_dataset_r unique_id_r first_name_l first_name_r gamma_first_name surname_l ... dob_l dob_r gamma_dob city_l city_r gamma_city email_l email_r gamma_email match_key
0 4.875607 0.967058 df_left 0 df_right 1 Julia Julia 2 NaN ... 2015-10-29 2015-07-31 0 London London 1 hannah88@powers.com hannah88@powers.com 3 0
1 4.875607 0.967058 df_left 0 df_right 2 Julia Julia 2 NaN ... 2015-10-29 2016-01-27 0 London London 1 hannah88@powers.com hannah88@powers.com 3 0
2 4.840904 0.966283 df_left 27 df_right 28 Matilda Matilda 2 Hsrir ... 1983-04-30 1983-04-30 3 London London 1 patrcio47@davis.cam patricia47@davis.com 0 0
3 15.216464 0.999974 df_left 32 df_right 34 Baxter Baxter 2 Aria ... 1992-09-07 1992-09-30 1 London London 1 christineshepherd@allen.com christineshepherd@allen.com 3 0
4 18.083396 0.999996 df_left 38 df_right 39 Olivia Olivia 2 Andrews ... 2009-01-23 2009-01-23 3 NaN London -1 hesterkurt@taylor-fitzgerald.com hesterkurt@taylor-fitzgerald.com 3 0

5 rows × 22 columns