Skip to content

Linking two tables of persons

Linking without deduplication

A simple record linkage model using the link_only link type.

With link_only, only between-dataset record comparisons are generated. No within-dataset record comparisons are created, meaning that the model does not attempt to find within-dataset duplicates.

Open In Colab

from splink import splink_datasets

df = splink_datasets.fake_1000

# Split a simple dataset into two, separate datasets which can be linked together.
df_l = df.sample(frac=0.5)
df_r = df.drop(df_l.index)

df_l.head(2)
unique_id first_name surname dob city email cluster
922 922 Evie Jones 2002-07-22 NaN eviejones@brewer-sparks.org 230
224 224 Logn Feeruson 2013-10-15 London l.fergson46@shah.com 58
import splink.comparison_library as cl

from splink import DuckDBAPI, Linker, SettingsCreator, block_on

settings = SettingsCreator(
    link_type="link_only",
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
    ],
    comparisons=[
        cl.NameComparison(
            "first_name",
        ),
        cl.NameComparison("surname"),
        cl.DateOfBirthComparison(
            "dob",
            input_is_string=True,
            invalid_dates_as_null=True,
        ),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        cl.EmailComparison("email"),
    ],
)

linker = Linker(
    [df_l, df_r],
    settings,
    db_api=DuckDBAPI(),
    input_table_aliases=["df_left", "df_right"],
)
from splink.exploratory import completeness_chart

completeness_chart(
    [df_l, df_r],
    cols=["first_name", "surname", "dob", "city", "email"],
    db_api=DuckDBAPI(),
    table_names_for_chart=["df_left", "df_right"],
)
deterministic_rules = [
    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
    "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
    block_on("email"),
]


linker.training.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)
Probability two random records match is estimated to be  0.00338.
This means that amongst all possible pairwise record comparisons, one in 295.61 are expected to match.  With 250,000 total possible comparisons, we expect a total of around 845.71 matching pairs
linker.training.estimate_u_using_random_sampling(max_pairs=1e6, seed=1)
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.
----- Estimating u probabilities using random sampling -----

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - first_name (no m values are trained).
    - surname (no m values are trained).
    - dob (no m values are trained).
    - city (no m values are trained).
    - email (no m values are trained).
session_dob = linker.training.estimate_parameters_using_expectation_maximisation(block_on("dob"))
session_email = linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("email")
)
session_first_name = linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("first_name")
)
----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."dob" = r."dob"

Parameter estimates will be made for the following comparison(s):
    - first_name
    - surname
    - city
    - email

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - dob

WARNING:
Level Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value

Iteration 1: Largest change in params was -0.418 in the m_probability of surname, level `Exact match on surname`
Iteration 2: Largest change in params was 0.104 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0711 in the m_probability of first_name, level `All other comparisons`
Iteration 4: Largest change in params was 0.0237 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.0093 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.00407 in probability_two_random_records_match
Iteration 7: Largest change in params was 0.0019 in probability_two_random_records_match
Iteration 8: Largest change in params was 0.000916 in probability_two_random_records_match
Iteration 9: Largest change in params was 0.000449 in probability_two_random_records_match
Iteration 10: Largest change in params was 0.000222 in probability_two_random_records_match
Iteration 11: Largest change in params was 0.00011 in probability_two_random_records_match
Iteration 12: Largest change in params was 5.46e-05 in probability_two_random_records_match

EM converged after 12 iterations
m probability not trained for email - Jaro-Winkler >0.88 on username (comparison vector value: 1). This usually means the comparison level was never observed in the training data.

Your model is not yet fully trained. Missing estimates for:
    - dob (no m values are trained).
    - email (some m values are not trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."email" = r."email"

Parameter estimates will be made for the following comparison(s):
    - first_name
    - surname
    - dob
    - city

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - email

Iteration 1: Largest change in params was -0.483 in the m_probability of dob, level `Exact match on dob`
Iteration 2: Largest change in params was 0.0905 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.02 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.00718 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.0031 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.00148 in probability_two_random_records_match
Iteration 7: Largest change in params was 0.000737 in probability_two_random_records_match
Iteration 8: Largest change in params was 0.000377 in probability_two_random_records_match
Iteration 9: Largest change in params was 0.000196 in probability_two_random_records_match
Iteration 10: Largest change in params was 0.000102 in probability_two_random_records_match
Iteration 11: Largest change in params was 5.37e-05 in probability_two_random_records_match

EM converged after 11 iterations

Your model is not yet fully trained. Missing estimates for:
    - email (some m values are not trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."first_name" = r."first_name"

Parameter estimates will be made for the following comparison(s):
    - surname
    - dob
    - city
    - email

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - first_name

Iteration 1: Largest change in params was -0.169 in the m_probability of surname, level `All other comparisons`
Iteration 2: Largest change in params was -0.0127 in the m_probability of surname, level `All other comparisons`
Iteration 3: Largest change in params was -0.00388 in the m_probability of surname, level `All other comparisons`
Iteration 4: Largest change in params was -0.00164 in the m_probability of email, level `Jaro-Winkler >0.88 on username`
Iteration 5: Largest change in params was -0.00089 in the m_probability of email, level `Jaro-Winkler >0.88 on username`
Iteration 6: Largest change in params was -0.000454 in the m_probability of email, level `Jaro-Winkler >0.88 on username`
Iteration 7: Largest change in params was -0.000225 in the m_probability of email, level `Jaro-Winkler >0.88 on username`
Iteration 8: Largest change in params was -0.00011 in the m_probability of email, level `Jaro-Winkler >0.88 on username`
Iteration 9: Largest change in params was -5.31e-05 in the m_probability of email, level `Jaro-Winkler >0.88 on username`

EM converged after 9 iterations

Your model is fully trained. All comparisons have at least one estimate for their m and u values
results = linker.inference.predict(threshold_match_probability=0.9)
results.as_pandas_dataframe(limit=5)
match_weight match_probability source_dataset_l source_dataset_r unique_id_l unique_id_r first_name_l first_name_r gamma_first_name surname_l ... dob_l dob_r gamma_dob city_l city_r gamma_city email_l email_r gamma_email match_key
0 3.180767 0.900674 df_left df_right 242 240 Freya Freya 4 Shah ... 1970-12-17 1970-12-16 4 Lonnod noLdon 0 None None -1 0
1 3.180767 0.900674 df_left df_right 241 240 Freya Freya 4 None ... 1970-12-17 1970-12-16 4 London noLdon 0 f.s@flynn.com None -1 0
2 3.212523 0.902626 df_left df_right 679 682 Elizabeth Elizabeth 4 Shaw ... 2006-04-21 2016-04-18 1 Cardiff Cardifrf 0 e.shaw@smith-hall.biz e.shaw@smith-hall.lbiz 3 0
3 3.224126 0.903331 df_left df_right 576 580 Jessica Jessica 4 None ... 1974-11-17 1974-12-17 4 None Walsall -1 jesscac.owen@elliott.org None -1 0
4 3.224126 0.903331 df_left df_right 577 580 Jessica Jessica 4 None ... 1974-11-17 1974-12-17 4 None Walsall -1 jessica.owen@elliott.org None -1 0

5 rows × 22 columns