Defining and customising how record comparisons are made¶
A key feature of Splink is the ability to customise how record comparisons are made - that is, how similarity is defined for different data types. For example, the definition of similarity that is appropriate for a date of birth field is different than for a first name field.
By tailoring the definitions of similarity, linking models are more effectively able to distinguish between different gradations of similarity, leading to more accurate data linking models.
Comparisons
and ComparisonLevels
¶
Recall that a Splink model contains a collection of Comparisons
and ComparisonLevels
organised in a hierarchy.
Each ComparisonLevel
defines the different gradations of similarity that make up a Comparison
.
An example is as follows:
Data Linking Model
├─-- Comparison: Date of birth
│ ├─-- ComparisonLevel: Exact match
│ ├─-- ComparisonLevel: Up to one character difference
│ ├─-- ComparisonLevel: Up to three character difference
│ ├─-- ComparisonLevel: All other
├─-- Comparison: Name
│ ├─-- ComparisonLevel: Exact match on first name and surname
│ ├─-- ComparisonLevel: Exact match on first name
│ ├─-- etc.
Three ways of specifying Comparisons¶
In Splink, there are three ways of specifying Comparisons
:
- Using 'out-of-the-box'
Comparison
s (Most simple/succinct) - Composing pre-defined
ComparisonLevels
- Writing a full dictionary spec of a
Comparison
by hand (most verbose/flexible)
Method 1: Using the ComparisonLibrary
¶
The ComparisonLibrary
contains pre-baked similarity functions that cover many common use cases.
These functions generate an entire Comparison
, composed of several ComparisonLevels
.
You can find a listing of all available Comparison
s at the page for its API documentation here
The following provides an example of using the ExactMatch
Comparison
, and producing the description (with associated SQL) for the duckdb
backend:
import splink.comparison_library as cl
first_name_comparison = cl.ExactMatch("first_name")
print(first_name_comparison.get_comparison("duckdb").human_readable_description)
Comparison 'ExactMatch' of "first_name".
Similarity is assessed using the following ComparisonLevels:
- 'first_name is NULL' with SQL rule: "first_name_l" IS NULL OR "first_name_r" IS NULL
- 'Exact match on first_name' with SQL rule: "first_name_l" = "first_name_r"
- 'All other comparisons' with SQL rule: ELSE
Note that, under the hood, these functions generate a Python dictionary, which conforms to the underlying .json
specification of a model:
first_name_comparison.get_comparison("duckdb").as_dict()
{'output_column_name': 'first_name',
'comparison_levels': [{'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
'label_for_charts': 'first_name is NULL',
'is_null_level': True},
{'sql_condition': '"first_name_l" = "first_name_r"',
'label_for_charts': 'Exact match on first_name'},
{'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
'comparison_description': 'ExactMatch'}
We can now generate a second, more complex comparison using one of our data-specific comparisons, the PostcodeComparison
:
pc_comparison = cl.PostcodeComparison("postcode")
print(pc_comparison.get_comparison("duckdb").human_readable_description)
Comparison 'PostcodeComparison' of "postcode".
Similarity is assessed using the following ComparisonLevels:
- 'postcode is NULL' with SQL rule: "postcode_l" IS NULL OR "postcode_r" IS NULL
- 'Exact match on full postcode' with SQL rule: "postcode_l" = "postcode_r"
- 'Exact match on sector' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '')
- 'Exact match on district' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '')
- 'Exact match on area' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}', 0), '')
- 'All other comparisons' with SQL rule: ELSE
For a deep dive on out of the box comparisons, see the dedicated topic guide.
Comparisons can be further configured using the .configure()
method - full API docs here.
Method 2: ComparisonLevels
¶
ComparisonLevels
provide a lower-level API that allows you to compose your own comparisons.
For example, the user may wish to specify a comparison that has levels for a match on soundex and jaro_winkler of the first_name
field.
The below example assumes the user has derived a column soundex_first_name
which contains the soundex of the first name.
from splink.comparison_library import CustomComparison
import splink.comparison_level_library as cll
custom_name_comparison = CustomComparison(
output_column_name="first_name",
comparison_levels=[
cll.NullLevel("first_name"),
cll.ExactMatchLevel("first_name").configure(tf_adjustment_column="first_name"),
cll.ExactMatchLevel("soundex_first_name").configure(
tf_adjustment_column="soundex_first_name"
),
cll.ElseLevel(),
],
)
print(custom_name_comparison.get_comparison("duckdb").human_readable_description)
Comparison 'CustomComparison' of "first_name" and "soundex_first_name".
Similarity is assessed using the following ComparisonLevels:
- 'first_name is NULL' with SQL rule: "first_name_l" IS NULL OR "first_name_r" IS NULL
- 'Exact match on first_name' with SQL rule: "first_name_l" = "first_name_r"
- 'Exact match on soundex_first_name' with SQL rule: "soundex_first_name_l" = "soundex_first_name_r"
- 'All other comparisons' with SQL rule: ELSE
This can now be specified in the settings dictionary as follows:
from splink import SettingsCreator, block_on
settings = SettingsCreator(
link_type="dedupe_only",
blocking_rules_to_generate_predictions=[
block_on("first_name"),
block_on("surname"),
],
comparisons=[
custom_name_comparison,
cl.LevenshteinAtThresholds("dob", [1, 2]),
],
)
To inspect the custom comparison as a dictionary, you can call custom_name_comparison.get_comparison("duckdb").as_dict()
Note that ComparisonLevels
can be further configured using the .configure()
method - full API documentation here
Method 3: Providing the spec as a dictionary¶
Behind the scenes in Splink, all Comparisons
are eventually turned into a dictionary which conforms to the formal jsonschema
specification of the settings dictionary and here.
The library functions described above are convenience functions that provide a shorthand way to produce valid dictionaries.
For maximum control over your settings, you can specify your comparisons as a dictionary.
comparison_first_name = {
"output_column_name": "first_name",
"comparison_levels": [
{
"sql_condition": "first_name_l IS NULL OR first_name_r IS NULL",
"label_for_charts": "Null",
"is_null_level": True,
},
{
"sql_condition": "first_name_l = first_name_r",
"label_for_charts": "Exact match",
"tf_adjustment_column": "first_name",
"tf_adjustment_weight": 1.0,
"tf_minimum_u_value": 0.001,
},
{
"sql_condition": "dmeta_first_name_l = dmeta_first_name_r",
"label_for_charts": "Exact match",
"tf_adjustment_column": "dmeta_first_name",
"tf_adjustment_weight": 1.0,
},
{
"sql_condition": "jaro_winkler_sim(first_name_l, first_name_r) > 0.8",
"label_for_charts": "Exact match",
"tf_adjustment_column": "first_name",
"tf_adjustment_weight": 0.5,
"tf_minimum_u_value": 0.001,
},
{"sql_condition": "ELSE", "label_for_charts": "All other comparisons"},
],
}
settings = SettingsCreator(
link_type="dedupe_only",
blocking_rules_to_generate_predictions=[
block_on("first_name"),
block_on("surname"),
],
comparisons=[
comparison_first_name,
cl.LevenshteinAtThresholds("dob", [1, 2]),
],
)
Examples¶
Below are some examples of how you can define the same comparison, but through different methods.
Exact match Comparison with Term-Frequency Adjustments¶
import splink.comparison_library as cl
first_name_comparison = cl.ExactMatch("first_name").configure(
term_frequency_adjustments=True
)
import splink.duckdb.comparison_level_library as cll
first_name_comparison = cl.CustomComparison(
output_column_name="first_name",
comparison_description="Exact match vs. anything else",
comparison_levels=[
cll.NullLevel("first_name"),
cll.ExactMatchLevel("first_name").configure(tf_adjustment_column="first_name"),
cll.ElseLevel(),
],
)
first_name_comparison = {
'output_column_name': 'first_name',
'comparison_levels': [
{
'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
'label_for_charts': 'Null',
'is_null_level': True
},
{
'sql_condition': '"first_name_l" = "first_name_r"',
'label_for_charts': 'Exact match',
'tf_adjustment_column': 'first_name',
'tf_adjustment_weight': 1.0
},
{
'sql_condition': 'ELSE',
'label_for_charts': 'All other comparisons'
}],
'comparison_description': 'Exact match vs. anything else'
}
Each of which gives
{
'output_column_name': 'first_name',
'comparison_levels': [
{
'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
'label_for_charts': 'Null',
'is_null_level': True
},
{
'sql_condition': '"first_name_l" = "first_name_r"',
'label_for_charts': 'Exact match',
'tf_adjustment_column': 'first_name',
'tf_adjustment_weight': 1.0
},
{
'sql_condition': 'ELSE',
'label_for_charts': 'All other comparisons'
}],
'comparison_description': 'Exact match vs. anything else'
}
Levenshtein Comparison¶
import splink.comparison_library as cl
email_comparison = cl.LevenshteinAtThresholds("email", [2, 4])
import splink.comparison_library as cl
import splink.comparison_level_library as cll
email_comparison = cl.CustomComparison(
output_column_name="email",
comparison_description="Exact match vs. Email within levenshtein thresholds 2, 4 vs. anything else",
comparison_levels=[
cll.NullLevel("email"),
cll.LevenshteinLevel("email", distance_threshold=2),
cll.LevenshteinLevel("email", distance_threshold=4),
cll.ElseLevel(),
],
)
email_comparison = {
'output_column_name': 'email',
'comparison_levels': [{'sql_condition': '"email_l" IS NULL OR "email_r" IS NULL',
'label_for_charts': 'Null',
'is_null_level': True},
{
'sql_condition': '"email_l" = "email_r"',
'label_for_charts': 'Exact match'
},
{
'sql_condition': 'levenshtein("email_l", "email_r") <= 2',
'label_for_charts': 'Levenshtein <= 2'
},
{
'sql_condition': 'levenshtein("email_l", "email_r") <= 4',
'label_for_charts': 'Levenshtein <= 4'
},
{
'sql_condition': 'ELSE',
'label_for_charts': 'All other comparisons'
}],
'comparison_description': 'Exact match vs. Email within levenshtein thresholds 2, 4 vs. anything else'}
Each of which gives
{
'output_column_name': 'email',
'comparison_levels': [
{
'sql_condition': '"email_l" IS NULL OR "email_r" IS NULL',
'label_for_charts': 'Null',
'is_null_level': True},
{
'sql_condition': '"email_l" = "email_r"',
'label_for_charts': 'Exact match'
},
{
'sql_condition': 'levenshtein("email_l", "email_r") <= 2',
'label_for_charts': 'Levenshtein <= 2'
},
{
'sql_condition': 'levenshtein("email_l", "email_r") <= 4',
'label_for_charts': 'Levenshtein <= 4'
},
{
'sql_condition': 'ELSE',
'label_for_charts': 'All other comparisons'
}],
'comparison_description': 'Exact match vs. Email within levenshtein thresholds 2, 4 vs. anything else'
}
in your settings dictionary.