Skip to content

Defining and customising how record comparisons are made

A key feature of Splink is the ability to customise how record comparisons are made - that is, how similarity is defined for different data types. For example, the definition of similarity that is appropriate for a date of birth field is different than for a first name field.

By tailoring the definitions of similarity, linking models are more effectively able to distinguish beteween different gradations of similarity, leading to more accurate data linking models.

Note that for performance reasons, Splink requires the user to define n discrete levels (gradations) of similarity.

Comparing information

Comparisons are defined on pairwise record comparisons. Suppose for instance your data contains first_name and surname and dob:

id first_name surname dob
1 john smith 1991-04-11
2 jon smith 1991-04-17
3 john smyth 1991-04-11

To compare these records, at the blocking stage, Splink will set these records against each other in a table of pairwise record comparisons:

id_l id_r first_name_l first_name_r surname_l surname_r dob_l dob_r
1 2 john jon smith smith 1991-04-11 1991-04-17
1 3 john john smith smyth 1991-04-11 1991-04-11
2 3 jon john smith smyth 1991-04-17 1991-04-11

When defining comparisons, we are defining rules that operate on each row of this latter table of pairwise comparisons

Comparisons and ComparisonLevels

A Splink model contains a collection of Comparisons and ComparisonLevels organised in a hierarchy. An example is as follows:

Data Linking Model
├─-- Comparison: Date of birth
│    ├─-- ComparisonLevel: Exact match
│    ├─-- ComparisonLevel: Up to one character difference
│    ├─-- ComparisonLevel: Up to three character difference
│    ├─-- ComparisonLevel: All other
├─-- Comparison: Name
│    ├─-- ComparisonLevel: Exact match on first name and surname
│    ├─-- ComparisonLevel: Exact match on first name
│    ├─-- etc.

A fuller description of Comaprisons and ComparisonLevels can be found here and here respectively.

How are these comparisons specified?

Three ways of specifying Comparisons

In Splink, there are three ways of specifying Comparisons:

  • Using pre-baked comparisons from a backend's ComparisonLibrary. (Most simple/succinct)
  • Composing pre-defined ComparisonLevels from a backend's ComparisonLevelLibrary
  • Writing a full spec of a Comparison by hand (most verbose/flexible)

Method 1: Using the ComparisonLibrary

The ComparisonLibrary for a each backend (DuckDB, Spark, etc.) contains pre-baked similarity functions that cover many common use cases.

These functions generate an entire Comparison, composed of several ComparisonLevels

The following provides an example of using the ComparisonLibrary for Spark

from splink.duckdb.duckdb_comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)

first_name_comparison = exact_match("first_name")
print(first_name_comparison.human_readable_description)
Comparison 'Exact match vs. anything else' of first_name.
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: first_name_l IS NULL OR first_name_r IS NULL
    - 'Exact match' with SQL rule: first_name_l = first_name_r
    - 'All other comparisons' with SQL rule: ELSE


Note that, under the hood, these functions generate a Python dictionary, which conforms to the underlying .json specification of a model:

first_name_comparison.as_dict()
{'output_column_name': 'first_name',
 'comparison_levels': [{'sql_condition': 'first_name_l IS NULL OR first_name_r IS NULL',
   'label_for_charts': 'Null',
   'is_null_level': True},
  {'sql_condition': 'first_name_l = first_name_r',
   'label_for_charts': 'Exact match'},
  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
 'comparison_description': 'Exact match vs. anything else'}

We can now generate a second, more complex comparison:

from splink.duckdb.duckdb_comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)

dob_comparison = levenshtein_at_thresholds("dob", [1, 2])
print(dob_comparison.human_readable_description)
Comparison 'Exact match vs. levenshtein at thresholds 1, 2 vs. anything else' of dob.
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: dob_l IS NULL OR dob_r IS NULL
    - 'Exact match' with SQL rule: dob_l = dob_r
    - 'levenshtein <= 1' with SQL rule: levenshtein(dob_l, dob_r) <= 1
    - 'levenshtein <= 2' with SQL rule: levenshtein(dob_l, dob_r) <= 2
    - 'All other comparisons' with SQL rule: ELSE


These Comparisons can be specified in a data linking model as follows:

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
    ],
    "comparisons": [
        exact_match("first_name"),
        levenshtein_at_thresholds("dob", [1, 2]),

    ]
}

Method 2: ComparisonLevels

The ComparisonLevels API provides a lower-level API that gives the user greater control over their comparisons.

For example, the user may wish to specify a comparison that has levels for a match on dmetaphone and jaro_winkler of the first_name field.

The below example assumes the user has derived a column dmeta_first_name which contains the dmetaphone of the first name.

from splink.spark.spark_comparison_level_library import exact_match_level, null_level, else_level
from splink.spark.spark_comparison_library import levenshtein_at_thresholds

comparison_first_name = {
    'output_column_name': 'first_name',
    'comparison_description': 'First name jaro dmeta',
    'comparison_levels': [
        null_level("first_name"),
        exact_match_level("first_name", term_frequency_adjustments=True),
        exact_match_level("dmeta_first_name", term_frequency_adjustments=True),
        else_level()
    ],
}



from splink.comparison import Comparison
print(Comparison(comparison_first_name).human_readable_description)
Comparison 'First name jaro dmeta' of first_name and dmeta_first_name.
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: first_name_l IS NULL OR first_name_r IS NULL
    - 'Exact match' with SQL rule: first_name_l = first_name_r
    - 'Exact match' with SQL rule: dmeta_first_name_l = dmeta_first_name_r
    - 'All other comparisons' with SQL rule: ELSE


This can now be specified in the settings dictionary as follows:

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
    ],
    "comparisons": [
        comparison_first_name, # The comparison specified above using ComparisonLevels
        levenshtein_at_thresholds("dob", [1, 2], term_frequency_adjustments=True), # From comparison_library 

    ]
}

Method 3: Providing the spec as a dictionary

Ultimately, comparisons are specified as a dictionary which conforms to the formal jsonschema specification of the settings dictionary - see [here]https://github.com/moj-analytical-services/splink/blob/master/splink/files/settings_jsonschema.json) and here.

The library functions described above are convenience functions that provide a shorthand way to produce valid dictionaries.

For maximium control over your settings, you can specify your comparisons as a dictionary.

comparison_first_name = {
    "output_column_name": "first_name",
    "comparison_description": "First name jaro dmeta",
    "comparison_levels": [
        {
            "sql_condition": "first_name_l IS NULL OR first_name_r IS NULL",
            "label_for_charts": "Null",
            "is_null_level": True,
        },
        {
            "sql_condition": "first_name_l = first_name_r",
            "label_for_charts": "Exact match",
            "tf_adjustment_column": "first_name",
            "tf_adjustment_weight": 1.0,
            "tf_minimum_u_value": 0.001,
        },
        {
            "sql_condition": "dmeta_first_name_l = dmeta_first_name_r",
            "label_for_charts": "Exact match",
            "tf_adjustment_column": "dmeta_first_name",
            "tf_adjustment_weight": 1.0,
        },
        {
            "sql_condition": "jaro_winkler_sim(first_name_l, first_name_r) > 0.8",
            "label_for_charts": "Exact match",
            "tf_adjustment_column": "first_name",
            "tf_adjustment_weight": 0.5,
            "tf_minimum_u_value": 0.001,
        },
        {"sql_condition": "ELSE", "label_for_charts": "All other comparisons"},
    ],

}

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
    ],
    "comparisons": [
        comparison_first_name, # The comparison specified above using the dict
        levenshtein_at_thresholds("dob", [1, 2], term_frequency_adjustments=True), # From comparison_library 

    ]
}