Defining and customising how record comparisons are made¶
A key feature of Splink is the ability to customise how record comparisons are made - that is, how similarity is defined for different data types. For example, the definition of similarity that is appropriate for a date of birth field is different than for a first name field.
By tailoring the definitions of similarity, linking models are more effectively able to distinguish beteween different gradations of similarity, leading to more accurate data linking models.
Note that for performance reasons, Splink requires the user to define n
discrete levels (gradations) of similarity.
Comparing information¶
Comparisons are defined on pairwise record comparisons. Suppose for instance your data contains first_name
and surname
and dob
:
id | first_name | surname | dob |
---|---|---|---|
1 | john | smith | 1991-04-11 |
2 | jon | smith | 1991-04-17 |
3 | john | smyth | 1991-04-11 |
To compare these records, at the blocking stage, Splink will set these records against each other in a table of pairwise record comparisons:
id_l | id_r | first_name_l | first_name_r | surname_l | surname_r | dob_l | dob_r |
---|---|---|---|---|---|---|---|
1 | 2 | john | jon | smith | smith | 1991-04-11 | 1991-04-17 |
1 | 3 | john | john | smith | smyth | 1991-04-11 | 1991-04-11 |
2 | 3 | jon | john | smith | smyth | 1991-04-17 | 1991-04-11 |
When defining comparisons, we are defining rules that operate on each row of this latter table of pairwise comparisons
Comparisons
and ComparisonLevels
¶
A Splink model contains a collection of Comparisons
and ComparisonLevels
organised in a hierarchy. An example is as follows:
Data Linking Model
├─-- Comparison: Date of birth
│ ├─-- ComparisonLevel: Exact match
│ ├─-- ComparisonLevel: Up to one character difference
│ ├─-- ComparisonLevel: Up to three character difference
│ ├─-- ComparisonLevel: All other
├─-- Comparison: Name
│ ├─-- ComparisonLevel: Exact match on first name and surname
│ ├─-- ComparisonLevel: Exact match on first name
│ ├─-- etc.
A fuller description of Comaprison
s and ComparisonLevel
s can be found here and here respectively.
How are these comparisons specified?
Three ways of specifying Comparisons¶
In Splink, there are three ways of specifying Comparisons
:
- Using pre-baked comparisons from a backend's
ComparisonLibrary
. (Most simple/succinct) - Composing pre-defined
ComparisonLevels
from a backend'sComparisonLevelLibrary
- Writing a full spec of a
Comparison
by hand (most verbose/flexible)
Method 1: Using the ComparisonLibrary
¶
The ComparisonLibrary
for a each backend (DuckDB
, Spark
, etc.) contains pre-baked similarity functions that cover many common use cases.
These functions generate an entire Comparison
, composed of several ComparisonLevels
The following provides an example of using the ComparisonLibrary
for Spark
from splink.duckdb.duckdb_comparison_library import (
exact_match,
levenshtein_at_thresholds,
)
first_name_comparison = exact_match("first_name")
print(first_name_comparison.human_readable_description)
Note that, under the hood, these functions generate a Python dictionary, which conforms to the underlying .json
specification of a model:
first_name_comparison.as_dict()
We can now generate a second, more complex comparison:
from splink.duckdb.duckdb_comparison_library import (
exact_match,
levenshtein_at_thresholds,
)
dob_comparison = levenshtein_at_thresholds("dob", [1, 2])
print(dob_comparison.human_readable_description)
These Comparisons
can be specified in a data linking model as follows:
settings = {
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [
"l.first_name = r.first_name",
"l.surname = r.surname",
],
"comparisons": [
exact_match("first_name"),
levenshtein_at_thresholds("dob", [1, 2]),
]
}
Method 2: ComparisonLevels
¶
The ComparisonLevels
API provides a lower-level API that gives the user greater control over their comparisons.
For example, the user may wish to specify a comparison that has levels for a match on dmetaphone and jaro_winkler of the first_name
field.
The below example assumes the user has derived a column dmeta_first_name
which contains the dmetaphone of the first name.
from splink.spark.spark_comparison_level_library import exact_match_level, null_level, else_level
from splink.spark.spark_comparison_library import levenshtein_at_thresholds
comparison_first_name = {
'output_column_name': 'first_name',
'comparison_description': 'First name jaro dmeta',
'comparison_levels': [
null_level("first_name"),
exact_match_level("first_name", term_frequency_adjustments=True),
exact_match_level("dmeta_first_name", term_frequency_adjustments=True),
else_level()
],
}
from splink.comparison import Comparison
print(Comparison(comparison_first_name).human_readable_description)
This can now be specified in the settings dictionary as follows:
settings = {
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [
"l.first_name = r.first_name",
"l.surname = r.surname",
],
"comparisons": [
comparison_first_name, # The comparison specified above using ComparisonLevels
levenshtein_at_thresholds("dob", [1, 2], term_frequency_adjustments=True), # From comparison_library
]
}
Method 3: Providing the spec as a dictionary¶
Ultimately, comparisons are specified as a dictionary which conforms to the formal jsonschema
specification of the settings dictionary and here.
The library functions described above are convenience functions that provide a shorthand way to produce valid dictionaries.
For maximium control over your settings, you can specify your comparisons as a dictionary.
comparison_first_name = {
"output_column_name": "first_name",
"comparison_description": "First name jaro dmeta",
"comparison_levels": [
{
"sql_condition": "first_name_l IS NULL OR first_name_r IS NULL",
"label_for_charts": "Null",
"is_null_level": True,
},
{
"sql_condition": "first_name_l = first_name_r",
"label_for_charts": "Exact match",
"tf_adjustment_column": "first_name",
"tf_adjustment_weight": 1.0,
"tf_minimum_u_value": 0.001,
},
{
"sql_condition": "dmeta_first_name_l = dmeta_first_name_r",
"label_for_charts": "Exact match",
"tf_adjustment_column": "dmeta_first_name",
"tf_adjustment_weight": 1.0,
},
{
"sql_condition": "jaro_winkler_sim(first_name_l, first_name_r) > 0.8",
"label_for_charts": "Exact match",
"tf_adjustment_column": "first_name",
"tf_adjustment_weight": 0.5,
"tf_minimum_u_value": 0.001,
},
{"sql_condition": "ELSE", "label_for_charts": "All other comparisons"},
],
}
settings = {
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [
"l.first_name = r.first_name",
"l.surname = r.surname",
],
"comparisons": [
comparison_first_name, # The comparison specified above using the dict
levenshtein_at_thresholds("dob", [1, 2], term_frequency_adjustments=True), # From comparison_library
]
}