Linking two tables of persons
Linking without deduplicationΒΆ
A simple record linkage model using the link_only
link type.
from splink.datasets import splink_datasets
df = splink_datasets.fake_1000
# Split a simple dataset into two, separate datasets which can be linked together.
df_l = df.sample(frac=0.5)
df_r = df.drop(df_l.index)
df_l.head(2)
unique_id | first_name | surname | dob | city | cluster | ||
---|---|---|---|---|---|---|---|
681 | 681 | Elizabeth | Sahw | 2006-04-21 | NaN | e.shaw@smith-hall.biz | 174 |
655 | 655 | Dylan | Robert | 1990-10-26 | Birmingham | NaN | 166 |
from splink.duckdb.linker import DuckDBLinker
from splink.duckdb.blocking_rule_library import block_on
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl
settings = {
"link_type": "link_only",
"blocking_rules_to_generate_predictions": [
block_on("first_name"),
block_on("surname"),
],
"comparisons": [
ctl.name_comparison("first_name",),
ctl.name_comparison("surname"),
ctl.date_comparison("dob", cast_strings_to_date=True),
cl.exact_match("city", term_frequency_adjustments=True),
ctl.email_comparison("email", include_username_fuzzy_level=False),
],
}
linker = DuckDBLinker([df_l, df_r], settings, input_table_aliases=["df_left", "df_right"])
linker.completeness_chart(cols=["first_name", "surname", "dob", "city", "email"])
deterministic_rules = [
"l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
"l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
"l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
"l.email = r.email"
]
linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)
Probability two random records match is estimated to be 0.00326.
This means that amongst all possible pairwise record comparisons, one in 306.48 are expected to match. With 250,000 total possible comparisons, we expect a total of around 815.71 matching pairs
linker.estimate_u_using_random_sampling(max_pairs=1e6, seed=1)
----- Estimating u probabilities using random sampling -----
Estimated u probabilities using random sampling
Your model is not yet fully trained. Missing estimates for:
- first_name (no m values are trained).
- surname (no m values are trained).
- dob (no m values are trained).
- city (no m values are trained).
- email (no m values are trained).
session_dob = linker.estimate_parameters_using_expectation_maximisation(block_on("dob"))
session_email = linker.estimate_parameters_using_expectation_maximisation(block_on("email"))
session_first_name = linker.estimate_parameters_using_expectation_maximisation(block_on("first_name"))
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
l."dob" = r."dob"
Parameter estimates will be made for the following comparison(s):
- first_name
- surname
- city
- email
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- dob
Iteration 1: Largest change in params was -0.434 in the m_probability of surname, level `Exact match surname`
Iteration 2: Largest change in params was 0.122 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0454 in the m_probability of first_name, level `All other comparisons`
Iteration 4: Largest change in params was 0.0153 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.00601 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.00259 in probability_two_random_records_match
Iteration 7: Largest change in params was 0.00117 in probability_two_random_records_match
Iteration 8: Largest change in params was 0.000537 in probability_two_random_records_match
Iteration 9: Largest change in params was 0.000249 in probability_two_random_records_match
Iteration 10: Largest change in params was 0.000116 in probability_two_random_records_match
Iteration 11: Largest change in params was 5.41e-05 in probability_two_random_records_match
EM converged after 11 iterations
Your model is not yet fully trained. Missing estimates for:
- dob (no m values are trained).
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
l."email" = r."email"
Parameter estimates will be made for the following comparison(s):
- first_name
- surname
- dob
- city
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- email
Iteration 1: Largest change in params was -0.492 in the m_probability of dob, level `Exact match`
Iteration 2: Largest change in params was 0.0862 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0181 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.0064 in the m_probability of surname, level `All other comparisons`
Iteration 5: Largest change in params was 0.00312 in the m_probability of surname, level `All other comparisons`
Iteration 6: Largest change in params was 0.00176 in the m_probability of surname, level `All other comparisons`
Iteration 7: Largest change in params was 0.0011 in the m_probability of surname, level `All other comparisons`
Iteration 8: Largest change in params was 0.000735 in the m_probability of surname, level `All other comparisons`
Iteration 9: Largest change in params was 0.000516 in the m_probability of surname, level `All other comparisons`
Iteration 10: Largest change in params was 0.000374 in the m_probability of surname, level `All other comparisons`
Iteration 11: Largest change in params was 0.000279 in the m_probability of surname, level `All other comparisons`
Iteration 12: Largest change in params was 0.000212 in the m_probability of surname, level `All other comparisons`
Iteration 13: Largest change in params was 0.000164 in the m_probability of surname, level `All other comparisons`
Iteration 14: Largest change in params was 0.000128 in the m_probability of surname, level `All other comparisons`
Iteration 15: Largest change in params was 0.000101 in the m_probability of surname, level `All other comparisons`
Iteration 16: Largest change in params was 8.03e-05 in the m_probability of surname, level `All other comparisons`
EM converged after 16 iterations
Your model is fully trained. All comparisons have at least one estimate for their m and u values
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
l."first_name" = r."first_name"
Parameter estimates will be made for the following comparison(s):
- surname
- dob
- city
- email
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- first_name
Iteration 1: Largest change in params was -0.164 in the m_probability of surname, level `All other comparisons`
Iteration 2: Largest change in params was -0.00765 in the m_probability of surname, level `All other comparisons`
Iteration 3: Largest change in params was -0.00102 in the m_probability of surname, level `All other comparisons`
Iteration 4: Largest change in params was -0.000162 in the m_probability of surname, level `All other comparisons`
Iteration 5: Largest change in params was -2.68e-05 in the m_probability of surname, level `All other comparisons`
EM converged after 5 iterations
Your model is fully trained. All comparisons have at least one estimate for their m and u values
results = linker.predict(threshold_match_probability=0.9)
results.as_pandas_dataframe(limit=5)
match_weight | match_probability | source_dataset_l | source_dataset_r | unique_id_l | unique_id_r | first_name_l | first_name_r | gamma_first_name | surname_l | ... | dob_l | dob_r | gamma_dob | city_l | city_r | gamma_city | email_l | email_r | gamma_email | match_key | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 15.720147 | 0.999981 | df_left | df_right | 655 | 656 | Dylan | Dylan | 4 | Robert | ... | 1990-10-26 | 1990-10-26 | 5 | Birmingham | Birmingham | 1 | NaN | droberts73@taylor-lang.com | -1 | 0 |
1 | 18.533203 | 0.999997 | df_left | df_right | 686 | 685 | Rosie | Rosie | 4 | Johnston | ... | 1978-11-23 | 1978-11-23 | 5 | Sheffield | Sheffield | 1 | NaN | rosiej32@robinson-moran.net | -1 | 0 |
2 | 14.264530 | 0.999949 | df_left | df_right | 815 | 819 | Logan | Logan | 4 | Morgan | ... | 1977-01-29 | 1976-12-30 | 3 | Coventry | Coventry | 1 | NaN | loganmorgan43@icbride-kmng.com | -1 | 0 |
3 | 25.783560 | 1.000000 | df_left | df_right | 766 | 765 | Adam | Adam | 4 | Edwards | ... | 1971-08-28 | 1971-08-28 | 5 | Cardiff | Cardiff | 1 | adam.edwards38@bullock-edwards.com | adam.edwards38@bullock-edward.com | 2 | 0 |
4 | 7.646308 | 0.995033 | df_left | df_right | 125 | 123 | Harley | Harley | 4 | Kaur | ... | 1973-11-26 | 1973-10-27 | 3 | Mancseter | Mancheeser | 0 | harleyk@houston.net | harleyk@houston.net | 3 | 0 |
5 rows Γ 22 columns