Linking two tables of persons

Linking without deduplication¶

A simple record linkage model using the link_only link type.

from splink.datasets import splink_datasets
df = splink_datasets.fake_1000

# Split a simple dataset into two, separate datasets which can be linked together.
df_l = df.sample(frac=0.5)
df_r = df.drop(df_l.index)

df_l.head(2)

	unique_id	first_name	surname	dob	city	email	cluster
681	681	Elizabeth	Sahw	2006-04-21	NaN	e.shaw@smith-hall.biz	174
655	655	Dylan	Robert	1990-10-26	Birmingham	NaN	166

from splink.duckdb.linker import DuckDBLinker
from splink.duckdb.blocking_rule_library import block_on
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl


settings = {
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": [
        block_on("first_name"),
        block_on("surname"),
    ],
    "comparisons": [
        ctl.name_comparison("first_name",),
        ctl.name_comparison("surname"),
        ctl.date_comparison("dob", cast_strings_to_date=True),
        cl.exact_match("city", term_frequency_adjustments=True),
        ctl.email_comparison("email", include_username_fuzzy_level=False),
    ],       
}

linker = DuckDBLinker([df_l, df_r], settings, input_table_aliases=["df_left", "df_right"])

linker.completeness_chart(cols=["first_name", "surname", "dob", "city", "email"])

deterministic_rules = [
    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
    "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
    "l.email = r.email"
]

linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)

Probability two random records match is estimated to be  0.00326.
This means that amongst all possible pairwise record comparisons, one in 306.48 are expected to match.  With 250,000 total possible comparisons, we expect a total of around 815.71 matching pairs

linker.estimate_u_using_random_sampling(max_pairs=1e6, seed=1)

----- Estimating u probabilities using random sampling -----

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - first_name (no m values are trained).
    - surname (no m values are trained).
    - dob (no m values are trained).
    - city (no m values are trained).
    - email (no m values are trained).

session_dob = linker.estimate_parameters_using_expectation_maximisation(block_on("dob"))
session_email = linker.estimate_parameters_using_expectation_maximisation(block_on("email"))
session_first_name = linker.estimate_parameters_using_expectation_maximisation(block_on("first_name"))

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."dob" = r."dob"

Parameter estimates will be made for the following comparison(s):
    - first_name
    - surname
    - city
    - email

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - dob

Iteration 1: Largest change in params was -0.434 in the m_probability of surname, level `Exact match surname`
Iteration 2: Largest change in params was 0.122 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0454 in the m_probability of first_name, level `All other comparisons`
Iteration 4: Largest change in params was 0.0153 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.00601 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.00259 in probability_two_random_records_match
Iteration 7: Largest change in params was 0.00117 in probability_two_random_records_match
Iteration 8: Largest change in params was 0.000537 in probability_two_random_records_match
Iteration 9: Largest change in params was 0.000249 in probability_two_random_records_match
Iteration 10: Largest change in params was 0.000116 in probability_two_random_records_match
Iteration 11: Largest change in params was 5.41e-05 in probability_two_random_records_match

EM converged after 11 iterations

Your model is not yet fully trained. Missing estimates for:
    - dob (no m values are trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."email" = r."email"

Parameter estimates will be made for the following comparison(s):
    - first_name
    - surname
    - dob
    - city

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - email

Iteration 1: Largest change in params was -0.492 in the m_probability of dob, level `Exact match`
Iteration 2: Largest change in params was 0.0862 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0181 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.0064 in the m_probability of surname, level `All other comparisons`
Iteration 5: Largest change in params was 0.00312 in the m_probability of surname, level `All other comparisons`
Iteration 6: Largest change in params was 0.00176 in the m_probability of surname, level `All other comparisons`
Iteration 7: Largest change in params was 0.0011 in the m_probability of surname, level `All other comparisons`
Iteration 8: Largest change in params was 0.000735 in the m_probability of surname, level `All other comparisons`
Iteration 9: Largest change in params was 0.000516 in the m_probability of surname, level `All other comparisons`
Iteration 10: Largest change in params was 0.000374 in the m_probability of surname, level `All other comparisons`
Iteration 11: Largest change in params was 0.000279 in the m_probability of surname, level `All other comparisons`
Iteration 12: Largest change in params was 0.000212 in the m_probability of surname, level `All other comparisons`
Iteration 13: Largest change in params was 0.000164 in the m_probability of surname, level `All other comparisons`
Iteration 14: Largest change in params was 0.000128 in the m_probability of surname, level `All other comparisons`
Iteration 15: Largest change in params was 0.000101 in the m_probability of surname, level `All other comparisons`
Iteration 16: Largest change in params was 8.03e-05 in the m_probability of surname, level `All other comparisons`

EM converged after 16 iterations

Your model is fully trained. All comparisons have at least one estimate for their m and u values

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."first_name" = r."first_name"

Parameter estimates will be made for the following comparison(s):
    - surname
    - dob
    - city
    - email

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - first_name

Iteration 1: Largest change in params was -0.164 in the m_probability of surname, level `All other comparisons`
Iteration 2: Largest change in params was -0.00765 in the m_probability of surname, level `All other comparisons`
Iteration 3: Largest change in params was -0.00102 in the m_probability of surname, level `All other comparisons`
Iteration 4: Largest change in params was -0.000162 in the m_probability of surname, level `All other comparisons`
Iteration 5: Largest change in params was -2.68e-05 in the m_probability of surname, level `All other comparisons`

EM converged after 5 iterations

Your model is fully trained. All comparisons have at least one estimate for their m and u values

results = linker.predict(threshold_match_probability=0.9)

results.as_pandas_dataframe(limit=5)

	match_weight	match_probability	source_dataset_l	source_dataset_r	unique_id_l	unique_id_r	first_name_l	first_name_r	gamma_first_name	surname_l	...	dob_l	dob_r	gamma_dob	city_l	city_r	gamma_city	email_l	email_r	gamma_email
0	15.720147	0.999981	df_left	df_right	655	656	Dylan	Dylan	4	Robert	...	1990-10-26	1990-10-26	5	Birmingham	Birmingham	1	NaN	droberts73@taylor-lang.com	-1
1	18.533203	0.999997	df_left	df_right	686	685	Rosie	Rosie	4	Johnston	...	1978-11-23	1978-11-23	5	Sheffield	Sheffield	1	NaN	rosiej32@robinson-moran.net	-1
2	14.264530	0.999949	df_left	df_right	815	819	Logan	Logan	4	Morgan	...	1977-01-29	1976-12-30	3	Coventry	Coventry	1	NaN	loganmorgan43@icbride-kmng.com	-1
3	25.783560	1.000000	df_left	df_right	766	765	Adam	Adam	4	Edwards	...	1971-08-28	1971-08-28	5	Cardiff	Cardiff	1	adam.edwards38@bullock-edwards.com	adam.edwards38@bullock-edward.com	2
4	7.646308	0.995033	df_left	df_right	125	123	Harley	Harley	4	Kaur	...	1973-11-26	1973-10-27	3	Mancseter	Mancheeser	0	harleyk@houston.net	harleyk@houston.net	3

5 rows × 22 columns