# Evaluation from ground truth column

## Evaluation when you have fully labelled dataÂ¶

In this example, our data contains a fully-populated ground-truth column called `cluster`

that enables us to perform accuracy analysis of the final model

```
from splink import splink_datasets
df = splink_datasets.fake_1000
df.head(2)
```

unique_id | first_name | surname | dob | city | cluster | ||
---|---|---|---|---|---|---|---|

0 | 0 | Robert | Alan | 1971-06-24 | NaN | robert255@smith.net | 0 |

1 | 1 | Robert | Allen | 1971-05-24 | NaN | roberta25@smith.net | 0 |

```
from splink import SettingsCreator, Linker, block_on, DuckDBAPI
import splink.comparison_library as cl
settings = SettingsCreator(
link_type="dedupe_only",
blocking_rules_to_generate_predictions=[
block_on("first_name"),
block_on("surname"),
block_on("dob"),
block_on("email"),
],
comparisons=[
cl.ForenameSurnameComparison("first_name", "surname"),
cl.DateOfBirthComparison(
"dob",
input_is_string=True,
),
cl.ExactMatch("city").configure(term_frequency_adjustments=True),
cl.EmailComparison("email"),
],
retain_intermediate_calculation_columns=True,
)
```

```
db_api = DuckDBAPI()
linker = Linker(df, settings, db_api=db_api)
deterministic_rules = [
"l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
"l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
"l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
"l.email = r.email",
]
linker.training.estimate_probability_two_random_records_match(
deterministic_rules, recall=0.7
)
```

```
Probability two random records match is estimated to be 0.00333.
This means that amongst all possible pairwise record comparisons, one in 300.13 are expected to match. With 499,500 total possible comparisons, we expect a total of around 1,664.29 matching pairs
```

```
linker.training.estimate_u_using_random_sampling(max_pairs=1e6, seed=5)
```

```
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.
----- Estimating u probabilities using random sampling -----
Estimated u probabilities using random sampling
Your model is not yet fully trained. Missing estimates for:
- first_name_surname (no m values are trained).
- dob (no m values are trained).
- city (no m values are trained).
- email (no m values are trained).
```

```
session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
block_on("dob"), estimate_without_term_frequencies=True
)
session_email = linker.training.estimate_parameters_using_expectation_maximisation(
block_on("email"), estimate_without_term_frequencies=True
)
session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
block_on("first_name", "surname"), estimate_without_term_frequencies=True
)
```

```
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
l."dob" = r."dob"
Parameter estimates will be made for the following comparison(s):
- first_name_surname
- city
- email
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- dob
WARNING:
Level Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value
Iteration 1: Largest change in params was -0.751 in the m_probability of first_name_surname, level `(Exact match on first_name) AND (Exact match on surname)`
Iteration 2: Largest change in params was 0.196 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0536 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.0189 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.00731 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.0029 in probability_two_random_records_match
Iteration 7: Largest change in params was 0.00116 in probability_two_random_records_match
Iteration 8: Largest change in params was 0.000469 in probability_two_random_records_match
Iteration 9: Largest change in params was 0.000189 in probability_two_random_records_match
Iteration 10: Largest change in params was 7.62e-05 in probability_two_random_records_match
EM converged after 10 iterations
m probability not trained for email - Jaro-Winkler >0.88 on username (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
Your model is not yet fully trained. Missing estimates for:
- dob (no m values are trained).
- email (some m values are not trained).
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
l."email" = r."email"
Parameter estimates will be made for the following comparison(s):
- first_name_surname
- dob
- city
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- email
Iteration 1: Largest change in params was -0.438 in the m_probability of dob, level `Exact match on dob`
Iteration 2: Largest change in params was 0.122 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0286 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.01 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.00448 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.00237 in probability_two_random_records_match
Iteration 7: Largest change in params was 0.0014 in probability_two_random_records_match
Iteration 8: Largest change in params was 0.000893 in probability_two_random_records_match
Iteration 9: Largest change in params was 0.000597 in probability_two_random_records_match
Iteration 10: Largest change in params was 0.000413 in probability_two_random_records_match
Iteration 11: Largest change in params was 0.000292 in probability_two_random_records_match
Iteration 12: Largest change in params was 0.000211 in probability_two_random_records_match
Iteration 13: Largest change in params was 0.000154 in probability_two_random_records_match
Iteration 14: Largest change in params was 0.000113 in probability_two_random_records_match
Iteration 15: Largest change in params was 8.4e-05 in probability_two_random_records_match
EM converged after 15 iterations
Your model is not yet fully trained. Missing estimates for:
- email (some m values are not trained).
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
(l."first_name" = r."first_name") AND (l."surname" = r."surname")
Parameter estimates will be made for the following comparison(s):
- dob
- city
- email
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- first_name_surname
WARNING:
Level Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value
Iteration 1: Largest change in params was 0.473 in probability_two_random_records_match
Iteration 2: Largest change in params was 0.0452 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.00766 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.00135 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.00025 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.000468 in the m_probability of email, level `All other comparisons`
Iteration 7: Largest change in params was 0.00776 in the m_probability of email, level `All other comparisons`
Iteration 8: Largest change in params was 0.00992 in the m_probability of email, level `All other comparisons`
Iteration 9: Largest change in params was 0.00277 in probability_two_random_records_match
Iteration 10: Largest change in params was 0.000972 in probability_two_random_records_match
Iteration 11: Largest change in params was 0.000337 in probability_two_random_records_match
Iteration 12: Largest change in params was 0.000118 in probability_two_random_records_match
Iteration 13: Largest change in params was 4.14e-05 in probability_two_random_records_match
EM converged after 13 iterations
m probability not trained for email - Jaro-Winkler >0.88 on username (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
Your model is not yet fully trained. Missing estimates for:
- email (some m values are not trained).
```

```
linker.evaluation.accuracy_analysis_from_labels_column(
"cluster", output_type="table"
).as_pandas_dataframe(limit=5)
```

```
-- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary. To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
m values not fully trained
```

truth_threshold | match_probability | total_clerical_labels | p | n | tp | tn | fp | fn | P_rate | ... | precision | recall | specificity | npv | accuracy | f1 | f2 | f0_5 | p4 | phi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | -17.8 | 0.000004 | 499500.0 | 2031.0 | 497469.0 | 1650.0 | 495130.0 | 2339.0 | 381.0 | 0.004066 | ... | 0.413638 | 0.812408 | 0.995298 | 0.999231 | 0.994555 | 0.548173 | 0.681086 | 0.458665 | 0.707466 | 0.577474 |

1 | -17.7 | 0.000005 | 499500.0 | 2031.0 | 497469.0 | 1650.0 | 495225.0 | 2244.0 | 381.0 | 0.004066 | ... | 0.423729 | 0.812408 | 0.995489 | 0.999231 | 0.994745 | 0.556962 | 0.686470 | 0.468564 | 0.714769 | 0.584558 |

2 | -17.1 | 0.000007 | 499500.0 | 2031.0 | 497469.0 | 1650.0 | 495311.0 | 2158.0 | 381.0 | 0.004066 | ... | 0.433298 | 0.812408 | 0.995662 | 0.999231 | 0.994917 | 0.565165 | 0.691418 | 0.477901 | 0.721512 | 0.591197 |

3 | -17.0 | 0.000008 | 499500.0 | 2031.0 | 497469.0 | 1650.0 | 495354.0 | 2115.0 | 381.0 | 0.004066 | ... | 0.438247 | 0.812408 | 0.995748 | 0.999231 | 0.995003 | 0.569358 | 0.693919 | 0.482710 | 0.724931 | 0.594601 |

4 | -16.9 | 0.000008 | 499500.0 | 2031.0 | 497469.0 | 1650.0 | 495386.0 | 2083.0 | 381.0 | 0.004066 | ... | 0.442004 | 0.812408 | 0.995813 | 0.999231 | 0.995067 | 0.572519 | 0.695792 | 0.486353 | 0.727497 | 0.597173 |

5 rows Ã— 25 columns

```
linker.evaluation.accuracy_analysis_from_labels_column("cluster", output_type="roc")
```

```
-- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary. To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
m values not fully trained
```

```
linker.evaluation.accuracy_analysis_from_labels_column(
"cluster",
output_type="threshold_selection",
threshold_match_probability=0.5,
add_metrics=["f1"],
)
```

```
-- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary. To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
m values not fully trained
```

```
# Plot some false positives
linker.evaluation.prediction_errors_from_labels_column(
"cluster", include_false_negatives=True, include_false_positives=True
).as_pandas_dataframe(limit=5)
```

```
-- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary. To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
m values not fully trained
```

clerical_match_score | found_by_blocking_rules | match_weight | match_probability | unique_id_l | unique_id_r | surname_l | surname_r | first_name_l | first_name_r | ... | email_l | email_r | gamma_email | tf_email_l | tf_email_r | bf_email | bf_tf_adj_email | cluster_l | cluster_r | match_key | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1.0 | False | -15.568945 | 0.000021 | 452 | 454 | Daves | Reuben | None | Davies | ... | rd@lewis.com | idlewrs.cocm | 0 | 0.003802 | 0.001267 | 0.01099 | 1.0 | 115 | 115 | 4 |

1 | 1.0 | False | -14.884057 | 0.000033 | 715 | 717 | Joes | Jones | None | Mia | ... | None | mia.j63@martinez.biz | -1 | NaN | 0.005070 | 1.00000 | 1.0 | 182 | 182 | 4 |

2 | 1.0 | False | -14.884057 | 0.000033 | 626 | 628 | Davidson | None | geeorGe | Geeorge | ... | None | gdavidson@johnson-brown.com | -1 | NaN | 0.005070 | 1.00000 | 1.0 | 158 | 158 | 4 |

3 | 1.0 | False | -13.761589 | 0.000072 | 983 | 984 | Milller | Miller | Jessica | aessicJ | ... | None | jessica.miller@johnson.com | -1 | NaN | 0.007605 | 1.00000 | 1.0 | 246 | 246 | 4 |

4 | 1.0 | True | -11.637585 | 0.000314 | 594 | 595 | Kik | Kiirk | Grace | Grace | ... | gk@frey-robinson.org | rgk@frey-robinon.org | 0 | 0.001267 | 0.001267 | 0.01099 | 1.0 | 146 | 146 | 0 |

5 rows Ã— 38 columns

```
records = linker.evaluation.prediction_errors_from_labels_column(
"cluster", include_false_negatives=True, include_false_positives=True
).as_record_dict(limit=5)
linker.visualisations.waterfall_chart(records)
```

```
-- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary. To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
m values not fully trained
```