Out-of-the-box Comparisons for specific data typesΒΆ
Similarity is defined differently for types of data (e.g. names, dates of birth, postcodes, addresses, ids). The Comparison Template Library contains functions to generate ready-made comparisons for a variety of data types.
Below are examples of how to structure comparisons for a variety of data types.
Date ComparisonsΒΆ
Date comparisons are generally structured as:
- Null level
- Exact match
- Fuzzy match (using metric of choice)
- Interval match (within X days/months/years)
- Else level
The comparison_template_library contains the date_comparison function which gives this structure, with some pre-defined parameters, out-of-the-box.
from splink.duckdb.comparison_template_library import date_comparison
date_of_birth_comparison = date_comparison("date_of_birth")
Gives a comparison structured as follows:
Comparison: Date of birth
ββ-- ComparisonLevel: Exact match
ββ-- ComparisonLevel: Up to one character difference
ββ-- ComparisonLevel: Dates within 1 month of each other
ββ-- ComparisonLevel: Dates within 1 year of each other
ββ-- ComparisonLevel: Dates within 10 years of each other
ββ-- ComparisonLevel: All other
Or, using human_readable_description
to generate automatically from date_of_birth_comparison
:
print(date_of_birth_comparison.human_readable_description)
Comparison 'Exact match vs. Date_Of_Birth within damerau-levenshtein threshold 1 vs. Dates within the following thresholds Month(s): 1, Year(s): 1, Year(s): 10 vs. anything else' of "date_of_birth".
Similarity is assessed using the following ComparisonLevels:
- 'Null' with SQL rule: "date_of_birth_l" IS NULL OR "date_of_birth_r" IS NULL
- 'Exact match' with SQL rule: "date_of_birth_l" = "date_of_birth_r"
- 'Damerau_levenshtein <= 1' with SQL rule: damerau_levenshtein("date_of_birth_l", "date_of_birth_r") <= 1
- 'Within 1 month' with SQL rule:
abs(date_diff('month', "date_of_birth_l",
"date_of_birth_r")) <= 1
- 'Within 1 year' with SQL rule:
abs(date_diff('year', "date_of_birth_l",
"date_of_birth_r")) <= 1
- 'Within 10 years' with SQL rule:
abs(date_diff('year', "date_of_birth_l",
"date_of_birth_r")) <= 10
- 'All other comparisons' with SQL rule: ELSE
The date_comparison function also allows the user flexibility to change the parameters and/or fuzzy matching comparison levels.
For example:
date_of_birth_comparison = date_comparison(
"date_of_birth",
levenshtein_thresholds=[2],
damerau_levenshtein_thresholds=[],
datediff_thresholds=[7, 1, 1],
datediff_metrics=["day", "month", "year"],
)
print(date_of_birth_comparison.human_readable_description)
Comparison 'Exact match vs. Date_Of_Birth within levenshtein threshold 2 vs. Dates within the following thresholds Day(s): 7, Month(s): 1, Year(s): 1 vs. anything else' of "date_of_birth".
Similarity is assessed using the following ComparisonLevels:
- 'Null' with SQL rule: "date_of_birth_l" IS NULL OR "date_of_birth_r" IS NULL
- 'Exact match' with SQL rule: "date_of_birth_l" = "date_of_birth_r"
- 'Levenshtein <= 2' with SQL rule: levenshtein("date_of_birth_l", "date_of_birth_r") <= 2
- 'Within 7 days' with SQL rule:
abs(date_diff('day', "date_of_birth_l",
"date_of_birth_r")) <= 7
- 'Within 1 month' with SQL rule:
abs(date_diff('month', "date_of_birth_l",
"date_of_birth_r")) <= 1
- 'Within 1 year' with SQL rule:
abs(date_diff('year', "date_of_birth_l",
"date_of_birth_r")) <= 1
- 'All other comparisons' with SQL rule: ELSE
To see this as a specifications dictionary you can call
date_of_birth_comparison.as_dict()
{'output_column_name': 'date_of_birth',
'comparison_levels': [{'sql_condition': '"date_of_birth_l" IS NULL OR "date_of_birth_r" IS NULL',
'label_for_charts': 'Null',
'is_null_level': True},
{'sql_condition': '"date_of_birth_l" = "date_of_birth_r"',
'label_for_charts': 'Exact match'},
{'sql_condition': 'levenshtein("date_of_birth_l", "date_of_birth_r") <= 2',
'label_for_charts': 'Levenshtein <= 2'},
{'sql_condition': '\n abs(date_diff(\'day\', "date_of_birth_l",\n "date_of_birth_r")) <= 7\n ',
'label_for_charts': 'Within 7 days'},
{'sql_condition': '\n abs(date_diff(\'month\', "date_of_birth_l",\n "date_of_birth_r")) <= 1\n ',
'label_for_charts': 'Within 1 month'},
{'sql_condition': '\n abs(date_diff(\'year\', "date_of_birth_l",\n "date_of_birth_r")) <= 1\n ',
'label_for_charts': 'Within 1 year'},
{'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
'comparison_description': 'Exact match vs. Date_Of_Birth within levenshtein threshold 2 vs. Dates within the following thresholds Day(s): 7, Month(s): 1, Year(s): 1 vs. anything else'}
which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.
Name ComparisonsΒΆ
Name comparisons for an individual name column (e.g. forename, surname) are generally structured as:
- Null level
- Exact match
- Fuzzy match (using metric of choice)
- Else level
The comparison_template_library contains the name_comparison function which gives this structure, with some pre-defined parameters, out-of-the-box.
from splink.duckdb.comparison_template_library import name_comparison
first_name_comparison = name_comparison("first_name")
Gives a comparison structured as follows:
Comparison: First Name
ββ-- ComparisonLevel: Exact match
ββ-- ComparisonLevel: Up to one character difference
ββ-- ComparisonLevel: First Names with Jaro-Winkler similarity of 0.9 or greater
ββ-- ComparisonLevel: First Names with Jaro-Winkler similarity of 0.8 or greater
ββ-- ComparisonLevel: All other
Or, using human_readable_description
to generate automatically from first_name_comparison
:
print(first_name_comparison.human_readable_description)
Comparison 'Exact match vs. First_Name within levenshtein threshold 1 vs. First_Name within damerau-levenshtein threshold 1 vs. First_Name within jaro_winkler thresholds 0.9, 0.8 vs. anything else' of "first_name".
Similarity is assessed using the following ComparisonLevels:
- 'Null' with SQL rule: "first_name_l" IS NULL OR "first_name_r" IS NULL
- 'Exact match first_name' with SQL rule: "first_name_l" = "first_name_r"
- 'Damerau_levenshtein <= 1' with SQL rule: damerau_levenshtein("first_name_l", "first_name_r") <= 1
- 'Jaro_winkler_similarity >= 0.9' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.9
- 'Jaro_winkler_similarity >= 0.8' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.8
- 'All other comparisons' with SQL rule: ELSE
The name_comparison function also allows flexibility to change the parameters and/or fuzzy matching comparison levels.
For example:
surname_comparison = name_comparison(
"surname",
phonetic_col_name="surname_dm",
term_frequency_adjustments=True,
levenshtein_thresholds=[2],
damerau_levenshtein_thresholds=[],
jaro_winkler_thresholds=[],
jaccard_thresholds=[1],
)
print(surname_comparison.human_readable_description)
Comparison 'Exact match vs. Names with phonetic exact match vs. Surname within levenshtein threshold 2 vs. Surname within jaccard threshold 1 vs. anything else' of "surname" and "surname_dm".
Similarity is assessed using the following ComparisonLevels:
- 'Null' with SQL rule: "surname_l" IS NULL OR "surname_r" IS NULL
- 'Exact match surname' with SQL rule: "surname_l" = "surname_r"
- 'Exact match surname_dm' with SQL rule: "surname_dm_l" = "surname_dm_r"
- 'Levenshtein <= 2' with SQL rule: levenshtein("surname_l", "surname_r") <= 2
- 'Jaccard >= 1' with SQL rule: jaccard("surname_l", "surname_r") >= 1
- 'All other comparisons' with SQL rule: ELSE
Where surname_dm
refers to a column which has used the DoubleMetaphone algorithm on surname
to give a phonetic spelling. This helps to catch names which sounds the same but have different spellings (e.g. Stephens vs Stevens). For more on Phonetic Transformations, see the topic guide.
To see this as a specifications dictionary you can call
surname_comparison.as_dict()
{'output_column_name': 'custom_surname_surname_dm',
'comparison_levels': [{'sql_condition': '"surname_l" IS NULL OR "surname_r" IS NULL',
'label_for_charts': 'Null',
'is_null_level': True},
{'sql_condition': '"surname_l" = "surname_r"',
'label_for_charts': 'Exact match surname',
'tf_adjustment_column': 'surname',
'tf_adjustment_weight': 1.0},
{'sql_condition': '"surname_dm_l" = "surname_dm_r"',
'label_for_charts': 'Exact match surname_dm',
'tf_adjustment_column': 'surname_dm',
'tf_adjustment_weight': 1.0},
{'sql_condition': 'levenshtein("surname_l", "surname_r") <= 2',
'label_for_charts': 'Levenshtein <= 2'},
{'sql_condition': 'jaccard("surname_l", "surname_r") >= 1',
'label_for_charts': 'Jaccard >= 1'},
{'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
'comparison_description': 'Exact match vs. Names with phonetic exact match vs. Surname within levenshtein threshold 2 vs. Surname within jaccard threshold 1 vs. anything else'}
which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.
Forename and Surname ComparisonsΒΆ
It can be helpful to construct a single comparison for for comparing the forename and surname of two records as:
-
The Fellegi-Sunter model assumes that comparisons are independent. We know that forename and surname are usually correlated given the regional variation of names etc, so considering then in a single comparison can help to create better models.
-
Term-frequencies of individual forename and surname individually does not necessarily reflect how common the combination of forename and surname are. For example, in the UK population βMohammed Khanβ is a relatively common full name despite neither name occurring frequently. For more information on term-frequencies, see the dedicated topic guide. Addressing forename and surname in a single comparison can allows the model to consider the joint term-frequency as well as individual.
-
It is common for some records to have swapped forename and surname by mistake. Addressing forename and surname in a single comparison can allows the model to consider these name inversions.
Forename and Surname comparisons for an individual name column (e.g. forename, surname) are generally structured as:
- Null level
- Exact match Forename and Surname
- Exact match Forename and Surname swapped
- Exact match Surname
- Exact match Forename
- Fuzzy match Surname (using metric of choice)
- Fuzzy match Forename (using metric of choice)
- Else level
The comparison_template_library contains the forename_surname_comparison function which gives this structure, with some pre-defined parameters, out-of-the-box.
from splink.duckdb.comparison_template_library import forename_surname_comparison
name_comparison = forename_surname_comparison("forename", "surname")
Gives a comparison structured as follows:
Comparison: First Name
ββ-- ComparisonLevel: Exact match Forename and Surname
ββ-- ComparisonLevel: Exact match Forename and Surname swapped
ββ-- ComparisonLevel: Exact match Surname
ββ-- ComparisonLevel: Exact match Forename
ββ-- ComparisonLevel: Surnames with Jaro-Winkler similarity greater than 0.88
ββ-- ComparisonLevel: Forenames with Jaro-Winkler similarity greater than 0.88
ββ-- ComparisonLevel: All other
Or, using human_readable_description
to generate automatically from first_name_comparison
:
print(name_comparison.human_readable_description)
Comparison 'Exact match vs. Forename and surname columns reversed vs. Surname exact match vs. Forename exact match vs. Surname within jaro-winkler threshold 0.88 vs. Forename within jaro-winkler threshold 0.88 vs. anything else' of "surname" and "forename".
Similarity is assessed using the following ComparisonLevels:
- 'Null' with SQL rule: ("forename_l" IS NULL OR "forename_r" IS NULL) AND ("surname_l" IS NULL OR "surname_r" IS NULL)
- 'Full name exact match' with SQL rule: forename_l = forename_r AND surname_l = surname_r
- 'Exact match on reversed cols' with SQL rule: "forename_l" = "surname_r" and "forename_r" = "surname_l"
- 'Exact match surname' with SQL rule: "surname_l" = "surname_r"
- 'Exact match forename' with SQL rule: "forename_l" = "forename_r"
- 'Jaro_winkler_similarity surname >= 0.88' with SQL rule: jaro_winkler_similarity("surname_l", "surname_r") >= 0.88
- 'Jaro_winkler_similarity forename >= 0.88' with SQL rule: jaro_winkler_similarity("forename_l", "forename_r") >= 0.88
- 'All other comparisons' with SQL rule: ELSE
The forename_surname_comparison function also allows flexibility to change the parameters and/or fuzzy matching comparison levels.
For example:
full_name_comparison = forename_surname_comparison(
"forename",
"surname",
term_frequency_adjustments=True,
tf_adjustment_col_forename_and_surname="full_name",
phonetic_forename_col_name="forename_dm",
phonetic_surname_col_name="surname_dm",
levenshtein_thresholds=[2],
jaro_winkler_thresholds=[],
jaccard_thresholds=[1],
)
print(full_name_comparison.human_readable_description)
Comparison 'Exact match vs. Phonetic match forename and surname vs. Forename and surname columns reversed vs. Surname exact match vs. Forename exact match vs. Surname within levenshtein threshold 2 vs. Surname within jaccard threshold 1 vs. Forename within levenshtein threshold 2 vs. Forename within jaccard threshold 1 vs. anything else' of "surname", "forename", "surname_dm" and "forename_dm".
Similarity is assessed using the following ComparisonLevels:
- 'Null' with SQL rule: ("forename_l" IS NULL OR "forename_r" IS NULL) AND ("surname_l" IS NULL OR "surname_r" IS NULL)
- 'Full name exact match' with SQL rule: forename_l = forename_r AND surname_l = surname_r
- 'Full name phonetic match' with SQL rule: forename_dm_l = forename_dm_r AND surname_dm_l = surname_dm_r
- 'Exact match on reversed cols' with SQL rule: "forename_l" = "surname_r" and "forename_r" = "surname_l"
- 'Exact match surname' with SQL rule: "surname_l" = "surname_r"
- 'Exact match forename' with SQL rule: "forename_l" = "forename_r"
- 'Levenshtein surname <= 2' with SQL rule: levenshtein("surname_l", "surname_r") <= 2
- 'Jaccard surname >= 1' with SQL rule: jaccard("surname_l", "surname_r") >= 1
- 'Levenshtein forename <= 2' with SQL rule: levenshtein("forename_l", "forename_r") <= 2
- 'Jaccard forename >= 1' with SQL rule: jaccard("forename_l", "forename_r") >= 1
- 'All other comparisons' with SQL rule: ELSE
Where:
-
forename_dm
andsurname_dm
refer to columns which have used the DoubleMetaphone algorithm onforename
andsurname
to give a phonetic spelling. This helps to catch names which sounds the same but have different spellings (e.g. Stephens vs Stevens). For more on Phonetic Transformations, see the topic guide. These columns will have to already exist in the dataset, or be created in the feature engineering stage when preparing datasets for linking. -
full_name
is a column containingforename
andsurname
so that the model can consider the term-frequency of the full name, as well asforename
andsurname
individually. These columns will have to already exist in the dataset, or be created in the feature engineering stage when preparing datasets for linking.
To see this as a specifications dictionary you can call
full_name_comparison.as_dict()
{'output_column_name': 'custom_surname_forename_surname_dm_forename_dm',
'comparison_levels': [{'sql_condition': '("forename_l" IS NULL OR "forename_r" IS NULL) AND ("surname_l" IS NULL OR "surname_r" IS NULL)',
'label_for_charts': 'Null',
'is_null_level': True},
{'sql_condition': 'forename_l = forename_r AND surname_l = surname_r',
'label_for_charts': 'Full name exact match',
'tf_adjustment_column': 'full_name',
'tf_adjustment_weight': 1.0},
{'sql_condition': 'forename_dm_l = forename_dm_r AND surname_dm_l = surname_dm_r',
'label_for_charts': 'Full name phonetic match',
'tf_adjustment_column': 'full_name',
'tf_adjustment_weight': 1.0},
{'sql_condition': '"forename_l" = "surname_r" and "forename_r" = "surname_l"',
'label_for_charts': 'Exact match on reversed cols',
'tf_adjustment_column': 'full_name',
'tf_adjustment_weight': 1.0},
{'sql_condition': '"surname_l" = "surname_r"',
'label_for_charts': 'Exact match surname',
'tf_adjustment_column': 'surname',
'tf_adjustment_weight': 1.0},
{'sql_condition': '"forename_l" = "forename_r"',
'label_for_charts': 'Exact match forename',
'tf_adjustment_column': 'forename',
'tf_adjustment_weight': 1.0},
{'sql_condition': 'levenshtein("surname_l", "surname_r") <= 2',
'label_for_charts': 'Levenshtein surname <= 2'},
{'sql_condition': 'jaccard("surname_l", "surname_r") >= 1',
'label_for_charts': 'Jaccard surname >= 1'},
{'sql_condition': 'levenshtein("forename_l", "forename_r") <= 2',
'label_for_charts': 'Levenshtein forename <= 2'},
{'sql_condition': 'jaccard("forename_l", "forename_r") >= 1',
'label_for_charts': 'Jaccard forename >= 1'},
{'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
'comparison_description': 'Exact match vs. Phonetic match forename and surname vs. Forename and surname columns reversed vs. Surname exact match vs. Forename exact match vs. Surname within levenshtein threshold 2 vs. Surname within jaccard threshold 1 vs. Forename within levenshtein threshold 2 vs. Forename within jaccard threshold 1 vs. anything else'}
Which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.
Postcode ComparisonsΒΆ
The comparison_template_library contains the postcode_comparison function which provides a sensible approach to comparing postcodes in terms of their constituent components, out-of-the-box. See Feature Engineering for more details.
from splink.duckdb.comparison_template_library import postcode_comparison
pc_comparison = postcode_comparison("postcode")
Gives a comparison structured as follows:
Comparison: Postcode
ββ-- ComparisonLevel: Exact match
ββ-- ComparisonLevel: Exact match on sector
ββ-- ComparisonLevel: Exact match on district
ββ-- ComparisonLevel: Exact match on area
ββ-- ComparisonLevel: All other
Or, using human_readable_description
to generate automatically from pc_comparison
:
print(pc_comparison.human_readable_description)
Comparison 'Exact match on full postcode vs. exact match on sector vs. exact match on district vs. exact match on area vs. all other comparisons' of "postcode".
Similarity is assessed using the following ComparisonLevels:
- 'Null' with SQL rule: "postcode_l" IS NULL OR "postcode_r" IS NULL
- 'Exact match postcode' with SQL rule: lower("postcode_l") = lower("postcode_r")
- 'Exact match Postcode Sector' with SQL rule:
regexp_extract(lower("postcode_l"), '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]')
=
regexp_extract(lower("postcode_r"), '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]')
- 'Exact match Postcode District' with SQL rule:
regexp_extract(lower("postcode_l"), '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?')
=
regexp_extract(lower("postcode_r"), '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?')
- 'Exact match Postcode Area' with SQL rule:
regexp_extract(lower("postcode_l"), '^[A-Za-z]{1,2}')
=
regexp_extract(lower("postcode_r"), '^[A-Za-z]{1,2}')
- 'All other comparisons' with SQL rule: ELSE
where individual postcode components are extracted under-the-hood using the regex_extract
argument.
Note that the 'Exact match Postcode District' level also captures matches on subdistricts where they exist in the data.
Performing comparisons based on substrings alone doesn't always give the best sense of whether two postcodes are close together since locations which are geographically close can be in different postcode regions e.g. London postcodes starting 'N' vs 'SW'. Given this, the postcode_comparison function also allows the user flexibility to include cll.distance_in_km_level() by supplying lat_col
, long_col
and km_thresholds
arguments. This can help to improve results. (See Feature Enginnering for more details.)
Users also have the option to set invalid_postcodes_as_null
to True
. If True
, postcodes that do not adhere to a valid postcode format as determined by valid_postcode_regex
will be included in the null level. valid_postcode_regex
defaults to "^[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][A-Z]{2}$"
.
For example:
pc_comparison = postcode_comparison(
"postcode",
invalid_postcodes_as_null=True,
lat_col="lat",
long_col="long",
km_thresholds=[1, 10, 50]
)
print(pc_comparison.human_readable_description)
Comparison 'Exact match on full postcode vs. exact match on sector vs. exact match on district vs. exact match on area vs. Postcode within km_distance thresholds 1, 10, 50 vs. all other comparisons' of "postcode", "long" and "lat".
Similarity is assessed using the following ComparisonLevels:
- 'Null' with SQL rule:
regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$')
IS NULL OR
regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$')
IS NULL OR
regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$')
=='' OR
regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$')
==''
- 'Exact match postcode' with SQL rule: lower("postcode_l") = lower("postcode_r")
- 'Exact match Postcode Sector' with SQL rule:
regexp_extract(lower("postcode_l"), '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]')
=
regexp_extract(lower("postcode_r"), '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]')
- 'Exact match Postcode District' with SQL rule:
regexp_extract(lower("postcode_l"), '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?')
=
regexp_extract(lower("postcode_r"), '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?')
- 'Exact match Postcode Area' with SQL rule:
regexp_extract(lower("postcode_l"), '^[A-Za-z]{1,2}')
=
regexp_extract(lower("postcode_r"), '^[A-Za-z]{1,2}')
- 'Distance less than 1km' with SQL rule:
cast(
acos(
case
when (
sin( radians("lat_l") ) * sin( radians("lat_r") ) +
cos( radians("lat_l") ) * cos( radians("lat_r") )
* cos( radians("long_r" - "long_l") )
) > 1 then 1
when (
sin( radians("lat_l") ) * sin( radians("lat_r") ) +
cos( radians("lat_l") ) * cos( radians("lat_r") )
* cos( radians("long_r" - "long_l") )
) < -1 then -1
else (
sin( radians("lat_l") ) * sin( radians("lat_r") ) +
cos( radians("lat_l") ) * cos( radians("lat_r") )
* cos( radians("long_r" - "long_l") )
)
end
) * 6371
as float
)
<= 1
- 'Distance less than 10km' with SQL rule:
cast(
acos(
case
when (
sin( radians("lat_l") ) * sin( radians("lat_r") ) +
cos( radians("lat_l") ) * cos( radians("lat_r") )
* cos( radians("long_r" - "long_l") )
) > 1 then 1
when (
sin( radians("lat_l") ) * sin( radians("lat_r") ) +
cos( radians("lat_l") ) * cos( radians("lat_r") )
* cos( radians("long_r" - "long_l") )
) < -1 then -1
else (
sin( radians("lat_l") ) * sin( radians("lat_r") ) +
cos( radians("lat_l") ) * cos( radians("lat_r") )
* cos( radians("long_r" - "long_l") )
)
end
) * 6371
as float
)
<= 10
- 'Distance less than 50km' with SQL rule:
cast(
acos(
case
when (
sin( radians("lat_l") ) * sin( radians("lat_r") ) +
cos( radians("lat_l") ) * cos( radians("lat_r") )
* cos( radians("long_r" - "long_l") )
) > 1 then 1
when (
sin( radians("lat_l") ) * sin( radians("lat_r") ) +
cos( radians("lat_l") ) * cos( radians("lat_r") )
* cos( radians("long_r" - "long_l") )
) < -1 then -1
else (
sin( radians("lat_l") ) * sin( radians("lat_r") ) +
cos( radians("lat_l") ) * cos( radians("lat_r") )
* cos( radians("long_r" - "long_l") )
)
end
) * 6371
as float
)
<= 50
- 'All other comparisons' with SQL rule: ELSE
To see this as a specifications dictionary you can call
pc_comparison.as_dict()
{'output_column_name': 'postcode',
'comparison_levels': [{'sql_condition': '\n regexp_extract("postcode_l", \'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$\')\n IS NULL OR \n regexp_extract("postcode_r", \'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$\')\n IS NULL OR\n \n regexp_extract("postcode_l", \'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$\')\n ==\'\' OR \n regexp_extract("postcode_r", \'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$\')\n ==\'\'',
'label_for_charts': 'Null',
'is_null_level': True},
{'sql_condition': 'lower("postcode_l") = lower("postcode_r")',
'label_for_charts': 'Exact match postcode'},
{'sql_condition': '\n regexp_extract(lower("postcode_l"), \'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]\')\n = \n regexp_extract(lower("postcode_r"), \'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]\')\n ',
'label_for_charts': 'Exact match Postcode Sector'},
{'sql_condition': '\n regexp_extract(lower("postcode_l"), \'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?\')\n = \n regexp_extract(lower("postcode_r"), \'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?\')\n ',
'label_for_charts': 'Exact match Postcode District'},
{'sql_condition': '\n regexp_extract(lower("postcode_l"), \'^[A-Za-z]{1,2}\')\n = \n regexp_extract(lower("postcode_r"), \'^[A-Za-z]{1,2}\')\n ',
'label_for_charts': 'Exact match Postcode Area'},
{'sql_condition': '\n \n cast(\n acos(\n \n case\n when (\n sin( radians("lat_l") ) * sin( radians("lat_r") ) +\n cos( radians("lat_l") ) * cos( radians("lat_r") )\n * cos( radians("long_r" - "long_l") )\n ) > 1 then 1\n when (\n sin( radians("lat_l") ) * sin( radians("lat_r") ) +\n cos( radians("lat_l") ) * cos( radians("lat_r") )\n * cos( radians("long_r" - "long_l") )\n ) < -1 then -1\n else (\n sin( radians("lat_l") ) * sin( radians("lat_r") ) +\n cos( radians("lat_l") ) * cos( radians("lat_r") )\n * cos( radians("long_r" - "long_l") )\n )\n end\n \n ) * 6371\n as float\n )\n <= 1\n ',
'label_for_charts': 'Distance less than 1km'},
{'sql_condition': '\n \n cast(\n acos(\n \n case\n when (\n sin( radians("lat_l") ) * sin( radians("lat_r") ) +\n cos( radians("lat_l") ) * cos( radians("lat_r") )\n * cos( radians("long_r" - "long_l") )\n ) > 1 then 1\n when (\n sin( radians("lat_l") ) * sin( radians("lat_r") ) +\n cos( radians("lat_l") ) * cos( radians("lat_r") )\n * cos( radians("long_r" - "long_l") )\n ) < -1 then -1\n else (\n sin( radians("lat_l") ) * sin( radians("lat_r") ) +\n cos( radians("lat_l") ) * cos( radians("lat_r") )\n * cos( radians("long_r" - "long_l") )\n )\n end\n \n ) * 6371\n as float\n )\n <= 10\n ',
'label_for_charts': 'Distance less than 10km'},
{'sql_condition': '\n \n cast(\n acos(\n \n case\n when (\n sin( radians("lat_l") ) * sin( radians("lat_r") ) +\n cos( radians("lat_l") ) * cos( radians("lat_r") )\n * cos( radians("long_r" - "long_l") )\n ) > 1 then 1\n when (\n sin( radians("lat_l") ) * sin( radians("lat_r") ) +\n cos( radians("lat_l") ) * cos( radians("lat_r") )\n * cos( radians("long_r" - "long_l") )\n ) < -1 then -1\n else (\n sin( radians("lat_l") ) * sin( radians("lat_r") ) +\n cos( radians("lat_l") ) * cos( radians("lat_r") )\n * cos( radians("long_r" - "long_l") )\n )\n end\n \n ) * 6371\n as float\n )\n <= 50\n ',
'label_for_charts': 'Distance less than 50km'},
{'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
'comparison_description': 'Exact match on full postcode vs. exact match on sector vs. exact match on district vs. exact match on area vs. Postcode within km_distance thresholds 1, 10, 50 vs. all other comparisons'}
which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.
Email ComparisonsΒΆ
Email comparisons are generally structured as:
- Null Level
- Exact match on email address
- Exact match on username
- Fuzzy match on email address
- Fuzzy match on username
- All other comparisons
The comparison_template_library contains the email_comparison function which provides a sensible approach to comparing emails out-of-the-box.
from splink.duckdb.comparison_template_library import email_comparison
standard_email_comparison = email_comparison("email")
Gives a comparison structured as follows:
Comparison: Email
ββ-- ComparisonLevel: Exact match
ββ-- ComparisonLevel: Exact match on username with different domain
ββ-- ComparisonLevel: Fuzzy match on email using Jaro-Winkler
ββ-- ComparisonLevel: Fuzzy match on username using Jaro-Winkler
ββ-- ComparisonLevel: All other comparisons
Or, using human_readable_description
to generate automatically from email_comparison
:
print(standard_email_comparison.human_readable_description)
Comparison 'Exact match vs. Exact username match different domain vs. Fuzzy Email within jaro_winkler threshold 0.88 vs. Fuzzy Username within jaro_winkler threshold 0.88 vs. anything else' of "email".
Similarity is assessed using the following ComparisonLevels:
- 'Null' with SQL rule: "email_l" IS NULL OR "email_r" IS NULL
- 'Exact match email' with SQL rule: "email_l" = "email_r"
- 'Exact match email' with SQL rule:
regexp_extract("email_l", '^[^@]+')
=
regexp_extract("email_r", '^[^@]+')
- 'Jaro_winkler_similarity email >= 0.88' with SQL rule: jaro_winkler_similarity("email_l", "email_r") >= 0.88
- 'Jaro_winkler_similarity Username >= 0.88' with SQL rule: jaro_winkler_similarity(
regexp_extract("email_l", '^[^@]+')
,
regexp_extract("email_r", '^[^@]+')
) >= 0.88
- 'All other comparisons' with SQL rule: ELSE
where individual email components are extracted under-the-hood using the regex_extract
argument.
By default, the fuzzy matching is done using Jaro-Winkler thresholds. This will bias the start of a string, specifically the first four characters, which may not be appropriate for all emails. The email_comparison
function is flexible and allows a number of other string fuzzy matching functions.
Users also have the option to set invalid_emails_as_null
to True
. If True
, postcodes that do not adhere to a valid email format as determined by valid_email_regex
will be included in the null level. valid_email_regex
defaults to "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$"
.
For example:
bespoke_email_comparison = email_comparison(
"email",
jaro_winkler_thresholds=[],
jaro_thresholds=[0.8],
include_username_match_level=False,
include_domain_match_level=True,
invalid_emails_as_null=True,
)
print(bespoke_email_comparison.human_readable_description)
Comparison 'Exact match vs. Domain-only match vs.anything else' of "email".
Similarity is assessed using the following ComparisonLevels:
- 'Null' with SQL rule:
regexp_extract("email_l", '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$')
IS NULL OR
regexp_extract("email_r", '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$')
IS NULL OR
regexp_extract("email_l", '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$')
=='' OR
regexp_extract("email_r", '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$')
==''
- 'Exact match email' with SQL rule: "email_l" = "email_r"
- 'Jaro_similarity email >= 0.8' with SQL rule: jaro_similarity("email_l", "email_r") >= 0.8
- 'Jaro_similarity email >= 0.8' with SQL rule: jaro_similarity("email_l", "email_r") >= 0.8
- 'Exact match Email Domain' with SQL rule:
regexp_extract("email_l", '@([^@]+)$')
=
regexp_extract("email_r", '@([^@]+)$')
- 'All other comparisons' with SQL rule: ELSE
To see this as a specifications dictionary you can call
bespoke_email_comparison.as_dict()
{'output_column_name': 'email',
'comparison_levels': [{'sql_condition': '\n regexp_extract("email_l", \'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$\')\n IS NULL OR \n regexp_extract("email_r", \'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$\')\n IS NULL OR\n \n regexp_extract("email_l", \'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$\')\n ==\'\' OR \n regexp_extract("email_r", \'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$\')\n ==\'\'',
'label_for_charts': 'Null',
'is_null_level': True},
{'sql_condition': '"email_l" = "email_r"',
'label_for_charts': 'Exact match email'},
{'sql_condition': 'jaro_similarity("email_l", "email_r") >= 0.8',
'label_for_charts': 'Jaro_similarity email >= 0.8'},
{'sql_condition': 'jaro_similarity("email_l", "email_r") >= 0.8',
'label_for_charts': 'Jaro_similarity email >= 0.8'},
{'sql_condition': '\n regexp_extract("email_l", \'@([^@]+)$\')\n = \n regexp_extract("email_r", \'@([^@]+)$\')\n ',
'label_for_charts': 'Exact match Email Domain'},
{'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
'comparison_description': 'Exact match vs. Domain-only match vs.anything else'}
which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.