Skip to content

unlinkables_chart

At a glance

Useful for: Looking at how many records have insufficient information to be linked to themselves.

API Documentation: unlinkables_chart()

What is needed to generate the chart? A trained Splink model

Worked Example

from splink.duckdb.linker import DuckDBLinker
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl
from splink.duckdb.blocking_rule_library import block_on
from splink.datasets import splink_datasets
import logging, sys
logging.disable(sys.maxsize)

df = splink_datasets.fake_1000

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on("first_name"),
        block_on("surname"),
    ],
    "comparisons": [
        ctl.name_comparison("first_name"),
        ctl.name_comparison("surname"),
        ctl.date_comparison("dob", cast_strings_to_date=True),
        cl.exact_match("city", term_frequency_adjustments=True),
        ctl.email_comparison("email", include_username_fuzzy_level=False),
    ],
}

linker = DuckDBLinker(df, settings)
linker.estimate_u_using_random_sampling(max_pairs=1e6)

blocking_rule_for_training = block_on(["first_name", "surname"])

linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)

blocking_rule_for_training = block_on("dob")
linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)

linker.unlinkables_chart()
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

What the chart shows

The unlinkables_chart shows the proportion of records with insufficient information to be matched to themselves at differing match thresholds.

What the chart tooltip shows

This tooltip shows a number of statistics based on the match weight of the selected point of the line, including:

  • The chosen match weight and corresponding match probability.
  • The proportion of records of records that cannot be linked to themselves given the chosen match weight threshold for a match.

How to interpret the chart

This chart gives an indication of both data quality and/or model predictiveness within a Splink model. If a high proportion of records are not linkable to themselves at a low match threshold (e.g. 0 match weight/50% probability) we can conclude that either/or:

  • the data quality is low enough such that a significant proportion of records are unable to be linked to themselves
  • the parameters of the Splink model are such that features have not been assigned enough weight, and therefore will not perform well

This chart also gives an indication of the number of False Negatives (i.e. missed links) at a given threshold, assuming sufficient data quality. For example:

  • we know that a record should be linked to itself, so seeing that a match weight \(\approx\) 10 gives 16% of records unable to link to themselves
  • exact matches generally provide the strongest matches, therefore, we can expect that any "fuzzy" matches to have lower match scores. As a result, we can deduce that the propoertion of False Negatives will be higher than 16%.

Actions to take as a result of the chart

If the level of unlinkable records is extremely high at low match weight thresholds, you have a poorly performing model. This may be an issue that can be resolved by tweaking the models comparisons, but if the poor performance is primarily down to poor data quality, there is very little that can be done to improve the model.

When interpretted as an indicator of False Negatives, this chart can be used to establish an upper bound for match weight, depending on the propensity for False Negatives in the particular use case.