Evaluation from ground truth column

Evaluation when you have fully labelled data¶

In this example, our data contains a fully-populated ground-truth column called cluster that enables us to perform accuracy analysis of the final model

from splink import splink_datasets

df = splink_datasets.fake_1000
df.head(2)

	unique_id	first_name	surname	dob	city	email	cluster
0	0	Robert	Alan	1971-06-24	NaN	robert255@smith.net	0
1	1	Robert	Allen	1971-05-24	NaN	roberta25@smith.net	0

from splink import SettingsCreator, Linker, block_on, DuckDBAPI

import splink.comparison_library as cl

settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
        block_on("dob"),
        block_on("email"),
    ],
    comparisons=[
        cl.ForenameSurnameComparison("first_name", "surname"),
        cl.DateOfBirthComparison(
            "dob",
            input_is_string=True,
        ),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        cl.EmailComparison("email"),
    ],
    retain_intermediate_calculation_columns=True,
)

db_api = DuckDBAPI()
linker = Linker(df, settings, db_api=db_api)
deterministic_rules = [
    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
    "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
    "l.email = r.email",
]

linker.training.estimate_probability_two_random_records_match(
    deterministic_rules, recall=0.7
)

Probability two random records match is estimated to be  0.00333.
This means that amongst all possible pairwise record comparisons, one in 300.13 are expected to match.  With 499,500 total possible comparisons, we expect a total of around 1,664.29 matching pairs

linker.training.estimate_u_using_random_sampling(max_pairs=1e6, seed=5)

You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.
----- Estimating u probabilities using random sampling -----

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - first_name_surname (no m values are trained).
    - dob (no m values are trained).
    - city (no m values are trained).
    - email (no m values are trained).

session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("dob"), estimate_without_term_frequencies=True
)
session_email = linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("email"), estimate_without_term_frequencies=True
)
session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("first_name", "surname"), estimate_without_term_frequencies=True
)

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."dob" = r."dob"

Parameter estimates will be made for the following comparison(s):
    - first_name_surname
    - city
    - email

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - dob

WARNING:
Level Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value

Iteration 1: Largest change in params was -0.751 in the m_probability of first_name_surname, level `(Exact match on first_name) AND (Exact match on surname)`
Iteration 2: Largest change in params was 0.196 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0536 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.0189 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.00731 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.0029 in probability_two_random_records_match
Iteration 7: Largest change in params was 0.00116 in probability_two_random_records_match
Iteration 8: Largest change in params was 0.000469 in probability_two_random_records_match
Iteration 9: Largest change in params was 0.000189 in probability_two_random_records_match
Iteration 10: Largest change in params was 7.62e-05 in probability_two_random_records_match

EM converged after 10 iterations
m probability not trained for email - Jaro-Winkler >0.88 on username (comparison vector value: 1). This usually means the comparison level was never observed in the training data.

Your model is not yet fully trained. Missing estimates for:
    - dob (no m values are trained).
    - email (some m values are not trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."email" = r."email"

Parameter estimates will be made for the following comparison(s):
    - first_name_surname
    - dob
    - city

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - email

Iteration 1: Largest change in params was -0.438 in the m_probability of dob, level `Exact match on dob`
Iteration 2: Largest change in params was 0.122 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0286 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.01 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.00448 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.00237 in probability_two_random_records_match
Iteration 7: Largest change in params was 0.0014 in probability_two_random_records_match
Iteration 8: Largest change in params was 0.000893 in probability_two_random_records_match
Iteration 9: Largest change in params was 0.000597 in probability_two_random_records_match
Iteration 10: Largest change in params was 0.000413 in probability_two_random_records_match
Iteration 11: Largest change in params was 0.000292 in probability_two_random_records_match
Iteration 12: Largest change in params was 0.000211 in probability_two_random_records_match
Iteration 13: Largest change in params was 0.000154 in probability_two_random_records_match
Iteration 14: Largest change in params was 0.000113 in probability_two_random_records_match
Iteration 15: Largest change in params was 8.4e-05 in probability_two_random_records_match

EM converged after 15 iterations

Your model is not yet fully trained. Missing estimates for:
    - email (some m values are not trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
(l."first_name" = r."first_name") AND (l."surname" = r."surname")

Parameter estimates will be made for the following comparison(s):
    - dob
    - city
    - email

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - first_name_surname

WARNING:
Level Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value

Iteration 1: Largest change in params was 0.473 in probability_two_random_records_match
Iteration 2: Largest change in params was 0.0452 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.00766 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.00135 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.00025 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.000468 in the m_probability of email, level `All other comparisons`
Iteration 7: Largest change in params was 0.00776 in the m_probability of email, level `All other comparisons`
Iteration 8: Largest change in params was 0.00992 in the m_probability of email, level `All other comparisons`
Iteration 9: Largest change in params was 0.00277 in probability_two_random_records_match
Iteration 10: Largest change in params was 0.000972 in probability_two_random_records_match
Iteration 11: Largest change in params was 0.000337 in probability_two_random_records_match
Iteration 12: Largest change in params was 0.000118 in probability_two_random_records_match
Iteration 13: Largest change in params was 4.14e-05 in probability_two_random_records_match

EM converged after 13 iterations
m probability not trained for email - Jaro-Winkler >0.88 on username (comparison vector value: 1). This usually means the comparison level was never observed in the training data.

Your model is not yet fully trained. Missing estimates for:
    - email (some m values are not trained).

linker.evaluation.accuracy_analysis_from_labels_column(
    "cluster", output_type="table"
).as_pandas_dataframe(limit=5)

 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained

	truth_threshold	match_probability	total_clerical_labels	p	n	tp	tn	fp	fn	P_rate	...	precision	recall	specificity	npv	accuracy	f1	f2	f0_5	p4	phi
0	-17.8	0.000004	499500.0	2031.0	497469.0	1650.0	495130.0	2339.0	381.0	0.004066	...	0.413638	0.812408	0.995298	0.999231	0.994555	0.548173	0.681086	0.458665	0.707466	0.577474
1	-17.7	0.000005	499500.0	2031.0	497469.0	1650.0	495225.0	2244.0	381.0	0.004066	...	0.423729	0.812408	0.995489	0.999231	0.994745	0.556962	0.686470	0.468564	0.714769	0.584558
2	-17.1	0.000007	499500.0	2031.0	497469.0	1650.0	495311.0	2158.0	381.0	0.004066	...	0.433298	0.812408	0.995662	0.999231	0.994917	0.565165	0.691418	0.477901	0.721512	0.591197
3	-17.0	0.000008	499500.0	2031.0	497469.0	1650.0	495354.0	2115.0	381.0	0.004066	...	0.438247	0.812408	0.995748	0.999231	0.995003	0.569358	0.693919	0.482710	0.724931	0.594601
4	-16.9	0.000008	499500.0	2031.0	497469.0	1650.0	495386.0	2083.0	381.0	0.004066	...	0.442004	0.812408	0.995813	0.999231	0.995067	0.572519	0.695792	0.486353	0.727497	0.597173

5 rows × 25 columns

linker.evaluation.accuracy_analysis_from_labels_column("cluster", output_type="roc")

 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained

linker.evaluation.accuracy_analysis_from_labels_column(
    "cluster",
    output_type="threshold_selection",
    threshold_match_probability=0.5,
    add_metrics=["f1"],
)

 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained

# Plot some false positives
linker.evaluation.prediction_errors_from_labels_column(
    "cluster", include_false_negatives=True, include_false_positives=True
).as_pandas_dataframe(limit=5)

 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained

	clerical_match_score	found_by_blocking_rules	match_weight	match_probability	unique_id_l	unique_id_r	surname_l	surname_r	first_name_l	first_name_r	...	email_l	email_r	gamma_email	tf_email_l	tf_email_r	bf_email	bf_tf_adj_email	cluster_l	cluster_r	match_key
0	1.0	False	-15.568945	0.000021	452	454	Daves	Reuben	None	Davies	...	rd@lewis.com	idlewrs.cocm	0	0.003802	0.001267	0.01099	1.0	115	115	4
1	1.0	False	-14.884057	0.000033	715	717	Joes	Jones	None	Mia	...	None	mia.j63@martinez.biz	-1	NaN	0.005070	1.00000	1.0	182	182	4
2	1.0	False	-14.884057	0.000033	626	628	Davidson	None	geeorGe	Geeorge	...	None	gdavidson@johnson-brown.com	-1	NaN	0.005070	1.00000	1.0	158	158	4
3	1.0	False	-13.761589	0.000072	983	984	Milller	Miller	Jessica	aessicJ	...	None	jessica.miller@johnson.com	-1	NaN	0.007605	1.00000	1.0	246	246	4
4	1.0	True	-11.637585	0.000314	594	595	Kik	Kiirk	Grace	Grace	...	gk@frey-robinson.org	rgk@frey-robinon.org	0	0.001267	0.001267	0.01099	1.0	146	146	0

5 rows × 38 columns

records = linker.evaluation.prediction_errors_from_labels_column(
    "cluster", include_false_negatives=True, include_false_positives=True
).as_record_dict(limit=5)

linker.visualisations.waterfall_chart(records)

 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained