Skip to content

Linking two tables of persons

Linking without deduplicationΒΆ

A simple record linkage model using the link_only link type.

from splink.datasets import splink_datasets
df = splink_datasets.fake_1000

# Split a simple dataset into two, separate datasets which can be linked together.
df_l = df.sample(frac=0.5)
df_r = df.drop(df_l.index)

df_l.head(2)
unique_id first_name surname dob city email cluster
681 681 Elizabeth Sahw 2006-04-21 NaN e.shaw@smith-hall.biz 174
655 655 Dylan Robert 1990-10-26 Birmingham NaN 166
from splink.duckdb.linker import DuckDBLinker
from splink.duckdb.blocking_rule_library import block_on
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl


settings = {
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": [
        block_on("first_name"),
        block_on("surname"),
    ],
    "comparisons": [
        ctl.name_comparison("first_name",),
        ctl.name_comparison("surname"),
        ctl.date_comparison("dob", cast_strings_to_date=True),
        cl.exact_match("city", term_frequency_adjustments=True),
        ctl.email_comparison("email", include_username_fuzzy_level=False),
    ],       
}

linker = DuckDBLinker([df_l, df_r], settings, input_table_aliases=["df_left", "df_right"])
linker.completeness_chart(cols=["first_name", "surname", "dob", "city", "email"])
deterministic_rules = [
    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
    "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
    "l.email = r.email"
]

linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)
Probability two random records match is estimated to be  0.00326.
This means that amongst all possible pairwise record comparisons, one in 306.48 are expected to match.  With 250,000 total possible comparisons, we expect a total of around 815.71 matching pairs
linker.estimate_u_using_random_sampling(max_pairs=1e6, seed=1)
----- Estimating u probabilities using random sampling -----

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - first_name (no m values are trained).
    - surname (no m values are trained).
    - dob (no m values are trained).
    - city (no m values are trained).
    - email (no m values are trained).
session_dob = linker.estimate_parameters_using_expectation_maximisation(block_on("dob"))
session_email = linker.estimate_parameters_using_expectation_maximisation(block_on("email"))
session_first_name = linker.estimate_parameters_using_expectation_maximisation(block_on("first_name"))
----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."dob" = r."dob"

Parameter estimates will be made for the following comparison(s):
    - first_name
    - surname
    - city
    - email

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - dob

Iteration 1: Largest change in params was -0.434 in the m_probability of surname, level `Exact match surname`
Iteration 2: Largest change in params was 0.122 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0454 in the m_probability of first_name, level `All other comparisons`
Iteration 4: Largest change in params was 0.0153 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.00601 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.00259 in probability_two_random_records_match
Iteration 7: Largest change in params was 0.00117 in probability_two_random_records_match
Iteration 8: Largest change in params was 0.000537 in probability_two_random_records_match
Iteration 9: Largest change in params was 0.000249 in probability_two_random_records_match
Iteration 10: Largest change in params was 0.000116 in probability_two_random_records_match
Iteration 11: Largest change in params was 5.41e-05 in probability_two_random_records_match

EM converged after 11 iterations

Your model is not yet fully trained. Missing estimates for:
    - dob (no m values are trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."email" = r."email"

Parameter estimates will be made for the following comparison(s):
    - first_name
    - surname
    - dob
    - city

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - email

Iteration 1: Largest change in params was -0.492 in the m_probability of dob, level `Exact match`
Iteration 2: Largest change in params was 0.0862 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0181 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.0064 in the m_probability of surname, level `All other comparisons`
Iteration 5: Largest change in params was 0.00312 in the m_probability of surname, level `All other comparisons`
Iteration 6: Largest change in params was 0.00176 in the m_probability of surname, level `All other comparisons`
Iteration 7: Largest change in params was 0.0011 in the m_probability of surname, level `All other comparisons`
Iteration 8: Largest change in params was 0.000735 in the m_probability of surname, level `All other comparisons`
Iteration 9: Largest change in params was 0.000516 in the m_probability of surname, level `All other comparisons`
Iteration 10: Largest change in params was 0.000374 in the m_probability of surname, level `All other comparisons`
Iteration 11: Largest change in params was 0.000279 in the m_probability of surname, level `All other comparisons`
Iteration 12: Largest change in params was 0.000212 in the m_probability of surname, level `All other comparisons`
Iteration 13: Largest change in params was 0.000164 in the m_probability of surname, level `All other comparisons`
Iteration 14: Largest change in params was 0.000128 in the m_probability of surname, level `All other comparisons`
Iteration 15: Largest change in params was 0.000101 in the m_probability of surname, level `All other comparisons`
Iteration 16: Largest change in params was 8.03e-05 in the m_probability of surname, level `All other comparisons`

EM converged after 16 iterations

Your model is fully trained. All comparisons have at least one estimate for their m and u values

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."first_name" = r."first_name"

Parameter estimates will be made for the following comparison(s):
    - surname
    - dob
    - city
    - email

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - first_name

Iteration 1: Largest change in params was -0.164 in the m_probability of surname, level `All other comparisons`
Iteration 2: Largest change in params was -0.00765 in the m_probability of surname, level `All other comparisons`
Iteration 3: Largest change in params was -0.00102 in the m_probability of surname, level `All other comparisons`
Iteration 4: Largest change in params was -0.000162 in the m_probability of surname, level `All other comparisons`
Iteration 5: Largest change in params was -2.68e-05 in the m_probability of surname, level `All other comparisons`

EM converged after 5 iterations

Your model is fully trained. All comparisons have at least one estimate for their m and u values
results = linker.predict(threshold_match_probability=0.9)
results.as_pandas_dataframe(limit=5)
match_weight match_probability source_dataset_l source_dataset_r unique_id_l unique_id_r first_name_l first_name_r gamma_first_name surname_l ... dob_l dob_r gamma_dob city_l city_r gamma_city email_l email_r gamma_email match_key
0 15.720147 0.999981 df_left df_right 655 656 Dylan Dylan 4 Robert ... 1990-10-26 1990-10-26 5 Birmingham Birmingham 1 NaN droberts73@taylor-lang.com -1 0
1 18.533203 0.999997 df_left df_right 686 685 Rosie Rosie 4 Johnston ... 1978-11-23 1978-11-23 5 Sheffield Sheffield 1 NaN rosiej32@robinson-moran.net -1 0
2 14.264530 0.999949 df_left df_right 815 819 Logan Logan 4 Morgan ... 1977-01-29 1976-12-30 3 Coventry Coventry 1 NaN loganmorgan43@icbride-kmng.com -1 0
3 25.783560 1.000000 df_left df_right 766 765 Adam Adam 4 Edwards ... 1971-08-28 1971-08-28 5 Cardiff Cardiff 1 adam.edwards38@bullock-edwards.com adam.edwards38@bullock-edward.com 2 0
4 7.646308 0.995033 df_left df_right 125 123 Harley Harley 4 Kaur ... 1973-11-26 1973-10-27 3 Mancseter Mancheeser 0 harleyk@houston.net harleyk@houston.net 3 0

5 rows Γ— 22 columns