Linking two tables of persons
Linking without deduplication¶
A simple record linkage model using the link_only
link type.
With link_only
, only between-dataset record comparisons are generated. No within-dataset record comparisons are created, meaning that the model does not attempt to find within-dataset duplicates.
from splink import splink_datasets
df = splink_datasets.fake_1000
# Split a simple dataset into two, separate datasets which can be linked together.
df_l = df.sample(frac=0.5)
df_r = df.drop(df_l.index)
df_l.head(2)
unique_id | first_name | surname | dob | city | cluster | ||
---|---|---|---|---|---|---|---|
922 | 922 | Evie | Jones | 2002-07-22 | NaN | eviejones@brewer-sparks.org | 230 |
224 | 224 | Logn | Feeruson | 2013-10-15 | London | l.fergson46@shah.com | 58 |
import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on
settings = SettingsCreator(
link_type="link_only",
blocking_rules_to_generate_predictions=[
block_on("first_name"),
block_on("surname"),
],
comparisons=[
cl.NameComparison(
"first_name",
),
cl.NameComparison("surname"),
cl.DateOfBirthComparison(
"dob",
input_is_string=True,
invalid_dates_as_null=True,
),
cl.ExactMatch("city").configure(term_frequency_adjustments=True),
cl.EmailComparison("email"),
],
)
linker = Linker(
[df_l, df_r],
settings,
db_api=DuckDBAPI(),
input_table_aliases=["df_left", "df_right"],
)
from splink.exploratory import completeness_chart
completeness_chart(
[df_l, df_r],
cols=["first_name", "surname", "dob", "city", "email"],
db_api=DuckDBAPI(),
table_names_for_chart=["df_left", "df_right"],
)
deterministic_rules = [
"l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
"l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
"l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
block_on("email"),
]
linker.training.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)
Probability two random records match is estimated to be 0.00338.
This means that amongst all possible pairwise record comparisons, one in 295.61 are expected to match. With 250,000 total possible comparisons, we expect a total of around 845.71 matching pairs
linker.training.estimate_u_using_random_sampling(max_pairs=1e6, seed=1)
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.
----- Estimating u probabilities using random sampling -----
Estimated u probabilities using random sampling
Your model is not yet fully trained. Missing estimates for:
- first_name (no m values are trained).
- surname (no m values are trained).
- dob (no m values are trained).
- city (no m values are trained).
- email (no m values are trained).
session_dob = linker.training.estimate_parameters_using_expectation_maximisation(block_on("dob"))
session_email = linker.training.estimate_parameters_using_expectation_maximisation(
block_on("email")
)
session_first_name = linker.training.estimate_parameters_using_expectation_maximisation(
block_on("first_name")
)
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
l."dob" = r."dob"
Parameter estimates will be made for the following comparison(s):
- first_name
- surname
- city
- email
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- dob
WARNING:
Level Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value
Iteration 1: Largest change in params was -0.418 in the m_probability of surname, level `Exact match on surname`
Iteration 2: Largest change in params was 0.104 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0711 in the m_probability of first_name, level `All other comparisons`
Iteration 4: Largest change in params was 0.0237 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.0093 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.00407 in probability_two_random_records_match
Iteration 7: Largest change in params was 0.0019 in probability_two_random_records_match
Iteration 8: Largest change in params was 0.000916 in probability_two_random_records_match
Iteration 9: Largest change in params was 0.000449 in probability_two_random_records_match
Iteration 10: Largest change in params was 0.000222 in probability_two_random_records_match
Iteration 11: Largest change in params was 0.00011 in probability_two_random_records_match
Iteration 12: Largest change in params was 5.46e-05 in probability_two_random_records_match
EM converged after 12 iterations
m probability not trained for email - Jaro-Winkler >0.88 on username (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
Your model is not yet fully trained. Missing estimates for:
- dob (no m values are trained).
- email (some m values are not trained).
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
l."email" = r."email"
Parameter estimates will be made for the following comparison(s):
- first_name
- surname
- dob
- city
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- email
Iteration 1: Largest change in params was -0.483 in the m_probability of dob, level `Exact match on dob`
Iteration 2: Largest change in params was 0.0905 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.02 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.00718 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.0031 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.00148 in probability_two_random_records_match
Iteration 7: Largest change in params was 0.000737 in probability_two_random_records_match
Iteration 8: Largest change in params was 0.000377 in probability_two_random_records_match
Iteration 9: Largest change in params was 0.000196 in probability_two_random_records_match
Iteration 10: Largest change in params was 0.000102 in probability_two_random_records_match
Iteration 11: Largest change in params was 5.37e-05 in probability_two_random_records_match
EM converged after 11 iterations
Your model is not yet fully trained. Missing estimates for:
- email (some m values are not trained).
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
l."first_name" = r."first_name"
Parameter estimates will be made for the following comparison(s):
- surname
- dob
- city
- email
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- first_name
Iteration 1: Largest change in params was -0.169 in the m_probability of surname, level `All other comparisons`
Iteration 2: Largest change in params was -0.0127 in the m_probability of surname, level `All other comparisons`
Iteration 3: Largest change in params was -0.00388 in the m_probability of surname, level `All other comparisons`
Iteration 4: Largest change in params was -0.00164 in the m_probability of email, level `Jaro-Winkler >0.88 on username`
Iteration 5: Largest change in params was -0.00089 in the m_probability of email, level `Jaro-Winkler >0.88 on username`
Iteration 6: Largest change in params was -0.000454 in the m_probability of email, level `Jaro-Winkler >0.88 on username`
Iteration 7: Largest change in params was -0.000225 in the m_probability of email, level `Jaro-Winkler >0.88 on username`
Iteration 8: Largest change in params was -0.00011 in the m_probability of email, level `Jaro-Winkler >0.88 on username`
Iteration 9: Largest change in params was -5.31e-05 in the m_probability of email, level `Jaro-Winkler >0.88 on username`
EM converged after 9 iterations
Your model is fully trained. All comparisons have at least one estimate for their m and u values
results = linker.inference.predict(threshold_match_probability=0.9)
results.as_pandas_dataframe(limit=5)
match_weight | match_probability | source_dataset_l | source_dataset_r | unique_id_l | unique_id_r | first_name_l | first_name_r | gamma_first_name | surname_l | ... | dob_l | dob_r | gamma_dob | city_l | city_r | gamma_city | email_l | email_r | gamma_email | match_key | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3.180767 | 0.900674 | df_left | df_right | 242 | 240 | Freya | Freya | 4 | Shah | ... | 1970-12-17 | 1970-12-16 | 4 | Lonnod | noLdon | 0 | None | None | -1 | 0 |
1 | 3.180767 | 0.900674 | df_left | df_right | 241 | 240 | Freya | Freya | 4 | None | ... | 1970-12-17 | 1970-12-16 | 4 | London | noLdon | 0 | f.s@flynn.com | None | -1 | 0 |
2 | 3.212523 | 0.902626 | df_left | df_right | 679 | 682 | Elizabeth | Elizabeth | 4 | Shaw | ... | 2006-04-21 | 2016-04-18 | 1 | Cardiff | Cardifrf | 0 | e.shaw@smith-hall.biz | e.shaw@smith-hall.lbiz | 3 | 0 |
3 | 3.224126 | 0.903331 | df_left | df_right | 576 | 580 | Jessica | Jessica | 4 | None | ... | 1974-11-17 | 1974-12-17 | 4 | None | Walsall | -1 | jesscac.owen@elliott.org | None | -1 | 0 |
4 | 3.224126 | 0.903331 | df_left | df_right | 577 | 580 | Jessica | Jessica | 4 | None | ... | 1974-11-17 | 1974-12-17 | 4 | None | Walsall | -1 | jessica.owen@elliott.org | None | -1 | 0 |
5 rows × 22 columns