Skip to content

Evaluation from ground truth column

Evaluation when you have fully labelled data¶

In this example, our data contains a fully-populated ground-truth column called cluster that enables us to perform accuracy analysis of the final model

Open In Colab

from splink import splink_datasets

df = splink_datasets.fake_1000
df.head(2)
unique_id first_name surname dob city email cluster
0 0 Robert Alan 1971-06-24 NaN robert255@smith.net 0
1 1 Robert Allen 1971-05-24 NaN roberta25@smith.net 0
from splink import SettingsCreator, Linker, block_on, DuckDBAPI

import splink.comparison_library as cl

settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
        block_on("dob"),
        block_on("email"),
    ],
    comparisons=[
        cl.ForenameSurnameComparison("first_name", "surname"),
        cl.DateOfBirthComparison(
            "dob",
            input_is_string=True,
        ),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        cl.EmailComparison("email"),
    ],
    retain_intermediate_calculation_columns=True,
)
db_api = DuckDBAPI()
linker = Linker(df, settings, db_api=db_api)
deterministic_rules = [
    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
    "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
    "l.email = r.email",
]

linker.training.estimate_probability_two_random_records_match(
    deterministic_rules, recall=0.7
)
Probability two random records match is estimated to be  0.00333.
This means that amongst all possible pairwise record comparisons, one in 300.13 are expected to match.  With 499,500 total possible comparisons, we expect a total of around 1,664.29 matching pairs
linker.training.estimate_u_using_random_sampling(max_pairs=1e6, seed=5)
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.
----- Estimating u probabilities using random sampling -----

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - first_name_surname (no m values are trained).
    - dob (no m values are trained).
    - city (no m values are trained).
    - email (no m values are trained).
session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("dob"), estimate_without_term_frequencies=True
)
session_email = linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("email"), estimate_without_term_frequencies=True
)
session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("first_name", "surname"), estimate_without_term_frequencies=True
)
----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."dob" = r."dob"

Parameter estimates will be made for the following comparison(s):
    - first_name_surname
    - city
    - email

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - dob

WARNING:
Level Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value

Iteration 1: Largest change in params was -0.751 in the m_probability of first_name_surname, level `(Exact match on first_name) AND (Exact match on surname)`
Iteration 2: Largest change in params was 0.196 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0536 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.0189 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.00731 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.0029 in probability_two_random_records_match
Iteration 7: Largest change in params was 0.00116 in probability_two_random_records_match
Iteration 8: Largest change in params was 0.000469 in probability_two_random_records_match
Iteration 9: Largest change in params was 0.000189 in probability_two_random_records_match
Iteration 10: Largest change in params was 7.62e-05 in probability_two_random_records_match

EM converged after 10 iterations
m probability not trained for email - Jaro-Winkler >0.88 on username (comparison vector value: 1). This usually means the comparison level was never observed in the training data.

Your model is not yet fully trained. Missing estimates for:
    - dob (no m values are trained).
    - email (some m values are not trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."email" = r."email"

Parameter estimates will be made for the following comparison(s):
    - first_name_surname
    - dob
    - city

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - email

Iteration 1: Largest change in params was -0.438 in the m_probability of dob, level `Exact match on dob`
Iteration 2: Largest change in params was 0.122 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0286 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.01 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.00448 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.00237 in probability_two_random_records_match
Iteration 7: Largest change in params was 0.0014 in probability_two_random_records_match
Iteration 8: Largest change in params was 0.000893 in probability_two_random_records_match
Iteration 9: Largest change in params was 0.000597 in probability_two_random_records_match
Iteration 10: Largest change in params was 0.000413 in probability_two_random_records_match
Iteration 11: Largest change in params was 0.000292 in probability_two_random_records_match
Iteration 12: Largest change in params was 0.000211 in probability_two_random_records_match
Iteration 13: Largest change in params was 0.000154 in probability_two_random_records_match
Iteration 14: Largest change in params was 0.000113 in probability_two_random_records_match
Iteration 15: Largest change in params was 8.4e-05 in probability_two_random_records_match

EM converged after 15 iterations

Your model is not yet fully trained. Missing estimates for:
    - email (some m values are not trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
(l."first_name" = r."first_name") AND (l."surname" = r."surname")

Parameter estimates will be made for the following comparison(s):
    - dob
    - city
    - email

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - first_name_surname

WARNING:
Level Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value

Iteration 1: Largest change in params was 0.473 in probability_two_random_records_match
Iteration 2: Largest change in params was 0.0452 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.00766 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.00135 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.00025 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.000468 in the m_probability of email, level `All other comparisons`
Iteration 7: Largest change in params was 0.00776 in the m_probability of email, level `All other comparisons`
Iteration 8: Largest change in params was 0.00992 in the m_probability of email, level `All other comparisons`
Iteration 9: Largest change in params was 0.00277 in probability_two_random_records_match
Iteration 10: Largest change in params was 0.000972 in probability_two_random_records_match
Iteration 11: Largest change in params was 0.000337 in probability_two_random_records_match
Iteration 12: Largest change in params was 0.000118 in probability_two_random_records_match
Iteration 13: Largest change in params was 4.14e-05 in probability_two_random_records_match

EM converged after 13 iterations
m probability not trained for email - Jaro-Winkler >0.88 on username (comparison vector value: 1). This usually means the comparison level was never observed in the training data.

Your model is not yet fully trained. Missing estimates for:
    - email (some m values are not trained).
linker.evaluation.accuracy_analysis_from_labels_column(
    "cluster", output_type="table"
).as_pandas_dataframe(limit=5)
 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained
truth_threshold match_probability total_clerical_labels p n tp tn fp fn P_rate ... precision recall specificity npv accuracy f1 f2 f0_5 p4 phi
0 -17.8 0.000004 499500.0 2031.0 497469.0 1650.0 495130.0 2339.0 381.0 0.004066 ... 0.413638 0.812408 0.995298 0.999231 0.994555 0.548173 0.681086 0.458665 0.707466 0.577474
1 -17.7 0.000005 499500.0 2031.0 497469.0 1650.0 495225.0 2244.0 381.0 0.004066 ... 0.423729 0.812408 0.995489 0.999231 0.994745 0.556962 0.686470 0.468564 0.714769 0.584558
2 -17.1 0.000007 499500.0 2031.0 497469.0 1650.0 495311.0 2158.0 381.0 0.004066 ... 0.433298 0.812408 0.995662 0.999231 0.994917 0.565165 0.691418 0.477901 0.721512 0.591197
3 -17.0 0.000008 499500.0 2031.0 497469.0 1650.0 495354.0 2115.0 381.0 0.004066 ... 0.438247 0.812408 0.995748 0.999231 0.995003 0.569358 0.693919 0.482710 0.724931 0.594601
4 -16.9 0.000008 499500.0 2031.0 497469.0 1650.0 495386.0 2083.0 381.0 0.004066 ... 0.442004 0.812408 0.995813 0.999231 0.995067 0.572519 0.695792 0.486353 0.727497 0.597173

5 rows × 25 columns

linker.evaluation.accuracy_analysis_from_labels_column("cluster", output_type="roc")
 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained
linker.evaluation.accuracy_analysis_from_labels_column(
    "cluster",
    output_type="threshold_selection",
    threshold_match_probability=0.5,
    add_metrics=["f1"],
)
 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained
# Plot some false positives
linker.evaluation.prediction_errors_from_labels_column(
    "cluster", include_false_negatives=True, include_false_positives=True
).as_pandas_dataframe(limit=5)
 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained
clerical_match_score found_by_blocking_rules match_weight match_probability unique_id_l unique_id_r surname_l surname_r first_name_l first_name_r ... email_l email_r gamma_email tf_email_l tf_email_r bf_email bf_tf_adj_email cluster_l cluster_r match_key
0 1.0 False -15.568945 0.000021 452 454 Daves Reuben None Davies ... rd@lewis.com idlewrs.cocm 0 0.003802 0.001267 0.01099 1.0 115 115 4
1 1.0 False -14.884057 0.000033 715 717 Joes Jones None Mia ... None mia.j63@martinez.biz -1 NaN 0.005070 1.00000 1.0 182 182 4
2 1.0 False -14.884057 0.000033 626 628 Davidson None geeorGe Geeorge ... None gdavidson@johnson-brown.com -1 NaN 0.005070 1.00000 1.0 158 158 4
3 1.0 False -13.761589 0.000072 983 984 Milller Miller Jessica aessicJ ... None jessica.miller@johnson.com -1 NaN 0.007605 1.00000 1.0 246 246 4
4 1.0 True -11.637585 0.000314 594 595 Kik Kiirk Grace Grace ... gk@frey-robinson.org rgk@frey-robinon.org 0 0.001267 0.001267 0.01099 1.0 146 146 0

5 rows × 38 columns

records = linker.evaluation.prediction_errors_from_labels_column(
    "cluster", include_false_negatives=True, include_false_positives=True
).as_record_dict(limit=5)

linker.visualisations.waterfall_chart(records)
 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained