Skip to content

Out-of-the-box Comparisons for specific data typesΒΆ

Similarity is defined differently for types of data (e.g. names, dates of birth, postcodes, addresses, ids). The Comparison Template Library contains functions to generate ready-made comparisons for a variety of data types.

Below are examples of how to structure comparisons for a variety of data types.


Date ComparisonsΒΆ

Date comparisons are generally structured as:

  • Null level
  • Exact match
  • Fuzzy match (using metric of choice)
  • Interval match (within X days/months/years)
  • Else level

The comparison_template_library contains the date_comparison function which gives this structure, with some pre-defined parameters, out-of-the-box.

from splink.duckdb.comparison_template_library import date_comparison

date_of_birth_comparison = date_comparison("date_of_birth")

Gives a comparison structured as follows:

Comparison: Date of birth
β”œβ”€-- ComparisonLevel: Exact match
β”œβ”€-- ComparisonLevel: Up to one character difference
β”œβ”€-- ComparisonLevel: Dates within 1 month of each other
β”œβ”€-- ComparisonLevel: Dates within 1 year of each other
β”œβ”€-- ComparisonLevel: Dates within 10 years of each other
β”œβ”€-- ComparisonLevel: All other

Or, using human_readable_description to generate automatically from date_of_birth_comparison:

print(date_of_birth_comparison.human_readable_description)
Comparison 'Exact match vs. Date_Of_Birth within damerau-levenshtein threshold 1 vs. Dates within the following thresholds Month(s): 1, Year(s): 1, Year(s): 10 vs. anything else' of "date_of_birth".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: "date_of_birth_l" IS NULL OR "date_of_birth_r" IS NULL
    - 'Exact match' with SQL rule: "date_of_birth_l" = "date_of_birth_r"
    - 'Damerau_levenshtein <= 1' with SQL rule: damerau_levenshtein("date_of_birth_l", "date_of_birth_r") <= 1
    - 'Within 1 month' with SQL rule: 
            abs(date_diff('month', "date_of_birth_l",
              "date_of_birth_r")) <= 1

    - 'Within 1 year' with SQL rule: 
            abs(date_diff('year', "date_of_birth_l",
              "date_of_birth_r")) <= 1

    - 'Within 10 years' with SQL rule: 
            abs(date_diff('year', "date_of_birth_l",
              "date_of_birth_r")) <= 10

    - 'All other comparisons' with SQL rule: ELSE

The date_comparison function also allows the user flexibility to change the parameters and/or fuzzy matching comparison levels.

For example:

date_of_birth_comparison = date_comparison(
    "date_of_birth",
    levenshtein_thresholds=[2],
    damerau_levenshtein_thresholds=[],
    datediff_thresholds=[7, 1, 1],
    datediff_metrics=["day", "month", "year"],
)
print(date_of_birth_comparison.human_readable_description)
Comparison 'Exact match vs. Date_Of_Birth within levenshtein threshold 2 vs. Dates within the following thresholds Day(s): 7, Month(s): 1, Year(s): 1 vs. anything else' of "date_of_birth".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: "date_of_birth_l" IS NULL OR "date_of_birth_r" IS NULL
    - 'Exact match' with SQL rule: "date_of_birth_l" = "date_of_birth_r"
    - 'Levenshtein <= 2' with SQL rule: levenshtein("date_of_birth_l", "date_of_birth_r") <= 2
    - 'Within 7 days' with SQL rule: 
            abs(date_diff('day', "date_of_birth_l",
              "date_of_birth_r")) <= 7

    - 'Within 1 month' with SQL rule: 
            abs(date_diff('month', "date_of_birth_l",
              "date_of_birth_r")) <= 1

    - 'Within 1 year' with SQL rule: 
            abs(date_diff('year', "date_of_birth_l",
              "date_of_birth_r")) <= 1

    - 'All other comparisons' with SQL rule: ELSE

To see this as a specifications dictionary you can call

date_of_birth_comparison.as_dict()
{'output_column_name': 'date_of_birth',
 'comparison_levels': [{'sql_condition': '"date_of_birth_l" IS NULL OR "date_of_birth_r" IS NULL',
   'label_for_charts': 'Null',
   'is_null_level': True},
  {'sql_condition': '"date_of_birth_l" = "date_of_birth_r"',
   'label_for_charts': 'Exact match'},
  {'sql_condition': 'levenshtein("date_of_birth_l", "date_of_birth_r") <= 2',
   'label_for_charts': 'Levenshtein <= 2'},
  {'sql_condition': '\n            abs(date_diff(\'day\', "date_of_birth_l",\n              "date_of_birth_r")) <= 7\n        ',
   'label_for_charts': 'Within 7 days'},
  {'sql_condition': '\n            abs(date_diff(\'month\', "date_of_birth_l",\n              "date_of_birth_r")) <= 1\n        ',
   'label_for_charts': 'Within 1 month'},
  {'sql_condition': '\n            abs(date_diff(\'year\', "date_of_birth_l",\n              "date_of_birth_r")) <= 1\n        ',
   'label_for_charts': 'Within 1 year'},
  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
 'comparison_description': 'Exact match vs. Date_Of_Birth within levenshtein threshold 2 vs. Dates within the following thresholds Day(s): 7, Month(s): 1, Year(s): 1 vs. anything else'}

which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.


Name ComparisonsΒΆ

Name comparisons for an individual name column (e.g. forename, surname) are generally structured as:

The comparison_template_library contains the name_comparison function which gives this structure, with some pre-defined parameters, out-of-the-box.

from splink.duckdb.comparison_template_library import name_comparison

first_name_comparison = name_comparison("first_name")

Gives a comparison structured as follows:

Comparison: First Name
β”œβ”€-- ComparisonLevel: Exact match
β”œβ”€-- ComparisonLevel: Up to one character difference
β”œβ”€-- ComparisonLevel: First Names with Jaro-Winkler similarity of 0.9 or greater 
β”œβ”€-- ComparisonLevel: First Names with Jaro-Winkler similarity of 0.8 or greater
β”œβ”€-- ComparisonLevel: All other

Or, using human_readable_description to generate automatically from first_name_comparison:

print(first_name_comparison.human_readable_description)
Comparison 'Exact match vs. First_Name within levenshtein threshold 1 vs. First_Name within damerau-levenshtein threshold 1 vs. First_Name within jaro_winkler thresholds 0.9, 0.8 vs. anything else' of "first_name".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: "first_name_l" IS NULL OR "first_name_r" IS NULL
    - 'Exact match first_name' with SQL rule: "first_name_l" = "first_name_r"
    - 'Damerau_levenshtein <= 1' with SQL rule: damerau_levenshtein("first_name_l", "first_name_r") <= 1
    - 'Jaro_winkler_similarity >= 0.9' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.9
    - 'Jaro_winkler_similarity >= 0.8' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.8
    - 'All other comparisons' with SQL rule: ELSE

The name_comparison function also allows flexibility to change the parameters and/or fuzzy matching comparison levels.

For example:

surname_comparison = name_comparison(
    "surname",
    phonetic_col_name="surname_dm",
    term_frequency_adjustments=True,
    levenshtein_thresholds=[2],
    damerau_levenshtein_thresholds=[],
    jaro_winkler_thresholds=[],
    jaccard_thresholds=[1],
)
print(surname_comparison.human_readable_description)
Comparison 'Exact match vs. Names with phonetic exact match vs. Surname within levenshtein threshold 2 vs. Surname within jaccard threshold 1 vs. anything else' of "surname" and "surname_dm".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: "surname_l" IS NULL OR "surname_r" IS NULL
    - 'Exact match surname' with SQL rule: "surname_l" = "surname_r"
    - 'Exact match surname_dm' with SQL rule: "surname_dm_l" = "surname_dm_r"
    - 'Levenshtein <= 2' with SQL rule: levenshtein("surname_l", "surname_r") <= 2
    - 'Jaccard >= 1' with SQL rule: jaccard("surname_l", "surname_r") >= 1
    - 'All other comparisons' with SQL rule: ELSE

Where surname_dm refers to a column which has used the DoubleMetaphone algorithm on surname to give a phonetic spelling. This helps to catch names which sounds the same but have different spellings (e.g. Stephens vs Stevens). For more on Phonetic Transformations, see the topic guide.

To see this as a specifications dictionary you can call

surname_comparison.as_dict()
{'output_column_name': 'custom_surname_surname_dm',
 'comparison_levels': [{'sql_condition': '"surname_l" IS NULL OR "surname_r" IS NULL',
   'label_for_charts': 'Null',
   'is_null_level': True},
  {'sql_condition': '"surname_l" = "surname_r"',
   'label_for_charts': 'Exact match surname',
   'tf_adjustment_column': 'surname',
   'tf_adjustment_weight': 1.0},
  {'sql_condition': '"surname_dm_l" = "surname_dm_r"',
   'label_for_charts': 'Exact match surname_dm',
   'tf_adjustment_column': 'surname_dm',
   'tf_adjustment_weight': 1.0},
  {'sql_condition': 'levenshtein("surname_l", "surname_r") <= 2',
   'label_for_charts': 'Levenshtein <= 2'},
  {'sql_condition': 'jaccard("surname_l", "surname_r") >= 1',
   'label_for_charts': 'Jaccard >= 1'},
  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
 'comparison_description': 'Exact match vs. Names with phonetic exact match vs. Surname within levenshtein threshold 2 vs. Surname within jaccard threshold 1 vs. anything else'}

which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.


Forename and Surname ComparisonsΒΆ

It can be helpful to construct a single comparison for for comparing the forename and surname of two records as:

  1. The Fellegi-Sunter model assumes that comparisons are independent. We know that forename and surname are usually correlated given the regional variation of names etc, so considering then in a single comparison can help to create better models.

  2. Term-frequencies of individual forename and surname individually does not necessarily reflect how common the combination of forename and surname are. For example, in the UK population β€œMohammed Khan” is a relatively common full name despite neither name occurring frequently. For more information on term-frequencies, see the dedicated topic guide. Addressing forename and surname in a single comparison can allows the model to consider the joint term-frequency as well as individual.

  3. It is common for some records to have swapped forename and surname by mistake. Addressing forename and surname in a single comparison can allows the model to consider these name inversions.

Forename and Surname comparisons for an individual name column (e.g. forename, surname) are generally structured as:

  • Null level
  • Exact match Forename and Surname
  • Exact match Forename and Surname swapped
  • Exact match Surname
  • Exact match Forename
  • Fuzzy match Surname (using metric of choice)
  • Fuzzy match Forename (using metric of choice)
  • Else level

The comparison_template_library contains the forename_surname_comparison function which gives this structure, with some pre-defined parameters, out-of-the-box.

from splink.duckdb.comparison_template_library import forename_surname_comparison

name_comparison = forename_surname_comparison("forename", "surname")

Gives a comparison structured as follows:

Comparison: First Name
β”œβ”€-- ComparisonLevel: Exact match Forename and Surname
β”œβ”€-- ComparisonLevel: Exact match Forename and Surname swapped
β”œβ”€-- ComparisonLevel: Exact match Surname
β”œβ”€-- ComparisonLevel: Exact match Forename
β”œβ”€-- ComparisonLevel: Surnames with Jaro-Winkler similarity greater than 0.88
β”œβ”€-- ComparisonLevel: Forenames with Jaro-Winkler similarity greater than 0.88
β”œβ”€-- ComparisonLevel: All other

Or, using human_readable_description to generate automatically from first_name_comparison:

print(name_comparison.human_readable_description)
Comparison 'Exact match vs. Forename and surname columns reversed vs. Surname exact match vs. Forename exact match vs. Surname within jaro-winkler threshold 0.88 vs. Forename within jaro-winkler threshold 0.88 vs. anything else' of "surname" and "forename".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: ("forename_l" IS NULL OR "forename_r" IS NULL) AND ("surname_l" IS NULL OR "surname_r" IS NULL)
    - 'Full name exact match' with SQL rule: forename_l = forename_r AND surname_l = surname_r
    - 'Exact match on reversed cols' with SQL rule: "forename_l" = "surname_r" and "forename_r" = "surname_l"
    - 'Exact match surname' with SQL rule: "surname_l" = "surname_r"
    - 'Exact match forename' with SQL rule: "forename_l" = "forename_r"
    - 'Jaro_winkler_similarity surname >= 0.88' with SQL rule: jaro_winkler_similarity("surname_l", "surname_r") >= 0.88
    - 'Jaro_winkler_similarity forename >= 0.88' with SQL rule: jaro_winkler_similarity("forename_l", "forename_r") >= 0.88
    - 'All other comparisons' with SQL rule: ELSE

The forename_surname_comparison function also allows flexibility to change the parameters and/or fuzzy matching comparison levels.

For example:

full_name_comparison = forename_surname_comparison(
    "forename",
    "surname",
    term_frequency_adjustments=True,
    tf_adjustment_col_forename_and_surname="full_name",
    phonetic_forename_col_name="forename_dm",
    phonetic_surname_col_name="surname_dm",
    levenshtein_thresholds=[2],
    jaro_winkler_thresholds=[],
    jaccard_thresholds=[1],
)
print(full_name_comparison.human_readable_description)
Comparison 'Exact match vs. Phonetic match forename and surname vs. Forename and surname columns reversed vs. Surname exact match vs. Forename exact match vs. Surname within levenshtein threshold 2 vs. Surname within jaccard threshold 1 vs. Forename within levenshtein threshold 2 vs. Forename within jaccard threshold 1 vs. anything else' of "surname", "forename", "surname_dm" and "forename_dm".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: ("forename_l" IS NULL OR "forename_r" IS NULL) AND ("surname_l" IS NULL OR "surname_r" IS NULL)
    - 'Full name exact match' with SQL rule: forename_l = forename_r AND surname_l = surname_r
    - 'Full name phonetic match' with SQL rule: forename_dm_l = forename_dm_r AND surname_dm_l = surname_dm_r
    - 'Exact match on reversed cols' with SQL rule: "forename_l" = "surname_r" and "forename_r" = "surname_l"
    - 'Exact match surname' with SQL rule: "surname_l" = "surname_r"
    - 'Exact match forename' with SQL rule: "forename_l" = "forename_r"
    - 'Levenshtein surname <= 2' with SQL rule: levenshtein("surname_l", "surname_r") <= 2
    - 'Jaccard surname >= 1' with SQL rule: jaccard("surname_l", "surname_r") >= 1
    - 'Levenshtein forename <= 2' with SQL rule: levenshtein("forename_l", "forename_r") <= 2
    - 'Jaccard forename >= 1' with SQL rule: jaccard("forename_l", "forename_r") >= 1
    - 'All other comparisons' with SQL rule: ELSE

Where:

  • forename_dm and surname_dm refer to columns which have used the DoubleMetaphone algorithm on forename and surname to give a phonetic spelling. This helps to catch names which sounds the same but have different spellings (e.g. Stephens vs Stevens). For more on Phonetic Transformations, see the topic guide. These columns will have to already exist in the dataset, or be created in the feature engineering stage when preparing datasets for linking.

  • full_name is a column containing forename and surname so that the model can consider the term-frequency of the full name, as well as forename and surname individually. These columns will have to already exist in the dataset, or be created in the feature engineering stage when preparing datasets for linking.

To see this as a specifications dictionary you can call

full_name_comparison.as_dict()
{'output_column_name': 'custom_surname_forename_surname_dm_forename_dm',
 'comparison_levels': [{'sql_condition': '("forename_l" IS NULL OR "forename_r" IS NULL) AND ("surname_l" IS NULL OR "surname_r" IS NULL)',
   'label_for_charts': 'Null',
   'is_null_level': True},
  {'sql_condition': 'forename_l = forename_r AND surname_l = surname_r',
   'label_for_charts': 'Full name exact match',
   'tf_adjustment_column': 'full_name',
   'tf_adjustment_weight': 1.0},
  {'sql_condition': 'forename_dm_l = forename_dm_r AND surname_dm_l = surname_dm_r',
   'label_for_charts': 'Full name phonetic match',
   'tf_adjustment_column': 'full_name',
   'tf_adjustment_weight': 1.0},
  {'sql_condition': '"forename_l" = "surname_r" and "forename_r" = "surname_l"',
   'label_for_charts': 'Exact match on reversed cols',
   'tf_adjustment_column': 'full_name',
   'tf_adjustment_weight': 1.0},
  {'sql_condition': '"surname_l" = "surname_r"',
   'label_for_charts': 'Exact match surname',
   'tf_adjustment_column': 'surname',
   'tf_adjustment_weight': 1.0},
  {'sql_condition': '"forename_l" = "forename_r"',
   'label_for_charts': 'Exact match forename',
   'tf_adjustment_column': 'forename',
   'tf_adjustment_weight': 1.0},
  {'sql_condition': 'levenshtein("surname_l", "surname_r") <= 2',
   'label_for_charts': 'Levenshtein surname <= 2'},
  {'sql_condition': 'jaccard("surname_l", "surname_r") >= 1',
   'label_for_charts': 'Jaccard surname >= 1'},
  {'sql_condition': 'levenshtein("forename_l", "forename_r") <= 2',
   'label_for_charts': 'Levenshtein forename <= 2'},
  {'sql_condition': 'jaccard("forename_l", "forename_r") >= 1',
   'label_for_charts': 'Jaccard forename >= 1'},
  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
 'comparison_description': 'Exact match vs. Phonetic match forename and surname vs. Forename and surname columns reversed vs. Surname exact match vs. Forename exact match vs. Surname within levenshtein threshold 2 vs. Surname within jaccard threshold 1 vs. Forename within levenshtein threshold 2 vs. Forename within jaccard threshold 1 vs. anything else'}

Which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.


Postcode ComparisonsΒΆ

The comparison_template_library contains the postcode_comparison function which provides a sensible approach to comparing postcodes in terms of their constituent components, out-of-the-box. See Feature Engineering for more details.

from splink.duckdb.comparison_template_library import postcode_comparison

pc_comparison = postcode_comparison("postcode")

Gives a comparison structured as follows:

Comparison: Postcode
β”œβ”€-- ComparisonLevel: Exact match
β”œβ”€-- ComparisonLevel: Exact match on sector
β”œβ”€-- ComparisonLevel: Exact match on district
β”œβ”€-- ComparisonLevel: Exact match on area
β”œβ”€-- ComparisonLevel: All other

Or, using human_readable_description to generate automatically from pc_comparison:

print(pc_comparison.human_readable_description)
Comparison 'Exact match on full postcode vs. exact match on sector vs. exact match on district vs. exact match on area vs. all other comparisons' of "postcode".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: "postcode_l" IS NULL OR "postcode_r" IS NULL
    - 'Exact match postcode' with SQL rule: lower("postcode_l") = lower("postcode_r")
    - 'Exact match Postcode Sector' with SQL rule: 
        regexp_extract(lower("postcode_l"), '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]')
     = 
        regexp_extract(lower("postcode_r"), '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]')

    - 'Exact match Postcode District' with SQL rule: 
        regexp_extract(lower("postcode_l"), '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?')
     = 
        regexp_extract(lower("postcode_r"), '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?')

    - 'Exact match Postcode Area' with SQL rule: 
        regexp_extract(lower("postcode_l"), '^[A-Za-z]{1,2}')
     = 
        regexp_extract(lower("postcode_r"), '^[A-Za-z]{1,2}')

    - 'All other comparisons' with SQL rule: ELSE

where individual postcode components are extracted under-the-hood using the regex_extract argument.

Note that the 'Exact match Postcode District' level also captures matches on subdistricts where they exist in the data.

Performing comparisons based on substrings alone doesn't always give the best sense of whether two postcodes are close together since locations which are geographically close can be in different postcode regions e.g. London postcodes starting 'N' vs 'SW'. Given this, the postcode_comparison function also allows the user flexibility to include cll.distance_in_km_level() by supplying lat_col, long_col and km_thresholds arguments. This can help to improve results. (See Feature Enginnering for more details.)

Users also have the option to set invalid_postcodes_as_null to True. If True, postcodes that do not adhere to a valid postcode format as determined by valid_postcode_regex will be included in the null level. valid_postcode_regex defaults to "^[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][A-Z]{2}$".

For example:

pc_comparison = postcode_comparison(
    "postcode",
    invalid_postcodes_as_null=True,
    lat_col="lat",
    long_col="long",
    km_thresholds=[1, 10, 50]
)
print(pc_comparison.human_readable_description)
Comparison 'Exact match on full postcode vs. exact match on sector vs. exact match on district vs. exact match on area vs. Postcode within km_distance thresholds 1, 10, 50 vs. all other comparisons' of "postcode", "long" and "lat".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: 
        regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$')
     IS NULL OR 
        regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$')
     IS NULL OR

        regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$')
    =='' OR 
        regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$')
     ==''
    - 'Exact match postcode' with SQL rule: lower("postcode_l") = lower("postcode_r")
    - 'Exact match Postcode Sector' with SQL rule: 
        regexp_extract(lower("postcode_l"), '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]')
     = 
        regexp_extract(lower("postcode_r"), '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]')

    - 'Exact match Postcode District' with SQL rule: 
        regexp_extract(lower("postcode_l"), '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?')
     = 
        regexp_extract(lower("postcode_r"), '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?')

    - 'Exact match Postcode Area' with SQL rule: 
        regexp_extract(lower("postcode_l"), '^[A-Za-z]{1,2}')
     = 
        regexp_extract(lower("postcode_r"), '^[A-Za-z]{1,2}')

    - 'Distance less than 1km' with SQL rule:

        cast(
            acos(

        case
            when (
        sin( radians("lat_l") ) * sin( radians("lat_r") ) +
        cos( radians("lat_l") ) * cos( radians("lat_r") )
            * cos( radians("long_r" - "long_l") )
    ) > 1 then 1
            when (
        sin( radians("lat_l") ) * sin( radians("lat_r") ) +
        cos( radians("lat_l") ) * cos( radians("lat_r") )
            * cos( radians("long_r" - "long_l") )
    ) < -1 then -1
            else (
        sin( radians("lat_l") ) * sin( radians("lat_r") ) +
        cos( radians("lat_l") ) * cos( radians("lat_r") )
            * cos( radians("long_r" - "long_l") )
    )
        end

            ) * 6371
            as float
        )
     <= 1

    - 'Distance less than 10km' with SQL rule:

        cast(
            acos(

        case
            when (
        sin( radians("lat_l") ) * sin( radians("lat_r") ) +
        cos( radians("lat_l") ) * cos( radians("lat_r") )
            * cos( radians("long_r" - "long_l") )
    ) > 1 then 1
            when (
        sin( radians("lat_l") ) * sin( radians("lat_r") ) +
        cos( radians("lat_l") ) * cos( radians("lat_r") )
            * cos( radians("long_r" - "long_l") )
    ) < -1 then -1
            else (
        sin( radians("lat_l") ) * sin( radians("lat_r") ) +
        cos( radians("lat_l") ) * cos( radians("lat_r") )
            * cos( radians("long_r" - "long_l") )
    )
        end

            ) * 6371
            as float
        )
     <= 10

    - 'Distance less than 50km' with SQL rule:

        cast(
            acos(

        case
            when (
        sin( radians("lat_l") ) * sin( radians("lat_r") ) +
        cos( radians("lat_l") ) * cos( radians("lat_r") )
            * cos( radians("long_r" - "long_l") )
    ) > 1 then 1
            when (
        sin( radians("lat_l") ) * sin( radians("lat_r") ) +
        cos( radians("lat_l") ) * cos( radians("lat_r") )
            * cos( radians("long_r" - "long_l") )
    ) < -1 then -1
            else (
        sin( radians("lat_l") ) * sin( radians("lat_r") ) +
        cos( radians("lat_l") ) * cos( radians("lat_r") )
            * cos( radians("long_r" - "long_l") )
    )
        end

            ) * 6371
            as float
        )
     <= 50

    - 'All other comparisons' with SQL rule: ELSE

To see this as a specifications dictionary you can call

pc_comparison.as_dict()
{'output_column_name': 'postcode',
 'comparison_levels': [{'sql_condition': '\n        regexp_extract("postcode_l", \'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$\')\n     IS NULL OR \n        regexp_extract("postcode_r", \'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$\')\n     IS NULL OR\n                      \n        regexp_extract("postcode_l", \'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$\')\n    ==\'\' OR \n        regexp_extract("postcode_r", \'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$\')\n     ==\'\'',
   'label_for_charts': 'Null',
   'is_null_level': True},
  {'sql_condition': 'lower("postcode_l") = lower("postcode_r")',
   'label_for_charts': 'Exact match postcode'},
  {'sql_condition': '\n        regexp_extract(lower("postcode_l"), \'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]\')\n     = \n        regexp_extract(lower("postcode_r"), \'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]\')\n    ',
   'label_for_charts': 'Exact match Postcode Sector'},
  {'sql_condition': '\n        regexp_extract(lower("postcode_l"), \'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?\')\n     = \n        regexp_extract(lower("postcode_r"), \'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?\')\n    ',
   'label_for_charts': 'Exact match Postcode District'},
  {'sql_condition': '\n        regexp_extract(lower("postcode_l"), \'^[A-Za-z]{1,2}\')\n     = \n        regexp_extract(lower("postcode_r"), \'^[A-Za-z]{1,2}\')\n    ',
   'label_for_charts': 'Exact match Postcode Area'},
  {'sql_condition': '\n        \n        cast(\n            acos(\n                \n        case\n            when (\n        sin( radians("lat_l") ) * sin( radians("lat_r") ) +\n        cos( radians("lat_l") ) * cos( radians("lat_r") )\n            * cos( radians("long_r" - "long_l") )\n    ) > 1 then 1\n            when (\n        sin( radians("lat_l") ) * sin( radians("lat_r") ) +\n        cos( radians("lat_l") ) * cos( radians("lat_r") )\n            * cos( radians("long_r" - "long_l") )\n    ) < -1 then -1\n            else (\n        sin( radians("lat_l") ) * sin( radians("lat_r") ) +\n        cos( radians("lat_l") ) * cos( radians("lat_r") )\n            * cos( radians("long_r" - "long_l") )\n    )\n        end\n    \n            ) * 6371\n            as float\n        )\n     <= 1\n        ',
   'label_for_charts': 'Distance less than 1km'},
  {'sql_condition': '\n        \n        cast(\n            acos(\n                \n        case\n            when (\n        sin( radians("lat_l") ) * sin( radians("lat_r") ) +\n        cos( radians("lat_l") ) * cos( radians("lat_r") )\n            * cos( radians("long_r" - "long_l") )\n    ) > 1 then 1\n            when (\n        sin( radians("lat_l") ) * sin( radians("lat_r") ) +\n        cos( radians("lat_l") ) * cos( radians("lat_r") )\n            * cos( radians("long_r" - "long_l") )\n    ) < -1 then -1\n            else (\n        sin( radians("lat_l") ) * sin( radians("lat_r") ) +\n        cos( radians("lat_l") ) * cos( radians("lat_r") )\n            * cos( radians("long_r" - "long_l") )\n    )\n        end\n    \n            ) * 6371\n            as float\n        )\n     <= 10\n        ',
   'label_for_charts': 'Distance less than 10km'},
  {'sql_condition': '\n        \n        cast(\n            acos(\n                \n        case\n            when (\n        sin( radians("lat_l") ) * sin( radians("lat_r") ) +\n        cos( radians("lat_l") ) * cos( radians("lat_r") )\n            * cos( radians("long_r" - "long_l") )\n    ) > 1 then 1\n            when (\n        sin( radians("lat_l") ) * sin( radians("lat_r") ) +\n        cos( radians("lat_l") ) * cos( radians("lat_r") )\n            * cos( radians("long_r" - "long_l") )\n    ) < -1 then -1\n            else (\n        sin( radians("lat_l") ) * sin( radians("lat_r") ) +\n        cos( radians("lat_l") ) * cos( radians("lat_r") )\n            * cos( radians("long_r" - "long_l") )\n    )\n        end\n    \n            ) * 6371\n            as float\n        )\n     <= 50\n        ',
   'label_for_charts': 'Distance less than 50km'},
  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
 'comparison_description': 'Exact match on full postcode vs. exact match on sector vs. exact match on district vs. exact match on area vs. Postcode within km_distance thresholds 1, 10, 50 vs. all other comparisons'}

which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.


Email ComparisonsΒΆ

Email comparisons are generally structured as:

  • Null Level
  • Exact match on email address
  • Exact match on username
  • Fuzzy match on email address
  • Fuzzy match on username
  • All other comparisons

The comparison_template_library contains the email_comparison function which provides a sensible approach to comparing emails out-of-the-box.

from splink.duckdb.comparison_template_library import email_comparison

standard_email_comparison = email_comparison("email")

Gives a comparison structured as follows:

Comparison: Email
β”œβ”€-- ComparisonLevel: Exact match
β”œβ”€-- ComparisonLevel: Exact match on username with different domain
β”œβ”€-- ComparisonLevel: Fuzzy match on email using Jaro-Winkler
β”œβ”€-- ComparisonLevel: Fuzzy match on username using Jaro-Winkler
β”œβ”€-- ComparisonLevel: All other comparisons

Or, using human_readable_description to generate automatically from email_comparison:

print(standard_email_comparison.human_readable_description)
Comparison 'Exact match vs. Exact username match different domain vs. Fuzzy Email within jaro_winkler threshold 0.88 vs. Fuzzy Username within jaro_winkler threshold 0.88 vs. anything else' of "email".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: "email_l" IS NULL OR "email_r" IS NULL
    - 'Exact match email' with SQL rule: "email_l" = "email_r"
    - 'Exact match email' with SQL rule: 
        regexp_extract("email_l", '^[^@]+')
     = 
        regexp_extract("email_r", '^[^@]+')

    - 'Jaro_winkler_similarity email >= 0.88' with SQL rule: jaro_winkler_similarity("email_l", "email_r") >= 0.88
    - 'Jaro_winkler_similarity Username >= 0.88' with SQL rule: jaro_winkler_similarity(
        regexp_extract("email_l", '^[^@]+')
    , 
        regexp_extract("email_r", '^[^@]+')
    ) >= 0.88
    - 'All other comparisons' with SQL rule: ELSE

where individual email components are extracted under-the-hood using the regex_extract argument.

By default, the fuzzy matching is done using Jaro-Winkler thresholds. This will bias the start of a string, specifically the first four characters, which may not be appropriate for all emails. The email_comparison function is flexible and allows a number of other string fuzzy matching functions.

Users also have the option to set invalid_emails_as_null to True. If True, postcodes that do not adhere to a valid email format as determined by valid_email_regex will be included in the null level. valid_email_regex defaults to "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$".

For example:

bespoke_email_comparison = email_comparison(
    "email",
    jaro_winkler_thresholds=[],
    jaro_thresholds=[0.8],
    include_username_match_level=False,
    include_domain_match_level=True,
    invalid_emails_as_null=True,
)
print(bespoke_email_comparison.human_readable_description)
Comparison 'Exact match vs. Domain-only match vs.anything else' of "email".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: 
        regexp_extract("email_l", '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$')
     IS NULL OR 
        regexp_extract("email_r", '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$')
     IS NULL OR

        regexp_extract("email_l", '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$')
    =='' OR 
        regexp_extract("email_r", '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$')
     ==''
    - 'Exact match email' with SQL rule: "email_l" = "email_r"
    - 'Jaro_similarity email >= 0.8' with SQL rule: jaro_similarity("email_l", "email_r") >= 0.8
    - 'Jaro_similarity email >= 0.8' with SQL rule: jaro_similarity("email_l", "email_r") >= 0.8
    - 'Exact match Email Domain' with SQL rule: 
        regexp_extract("email_l", '@([^@]+)$')
     = 
        regexp_extract("email_r", '@([^@]+)$')

    - 'All other comparisons' with SQL rule: ELSE

To see this as a specifications dictionary you can call

bespoke_email_comparison.as_dict()
{'output_column_name': 'email',
 'comparison_levels': [{'sql_condition': '\n        regexp_extract("email_l", \'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$\')\n     IS NULL OR \n        regexp_extract("email_r", \'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$\')\n     IS NULL OR\n                      \n        regexp_extract("email_l", \'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$\')\n    ==\'\' OR \n        regexp_extract("email_r", \'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$\')\n     ==\'\'',
   'label_for_charts': 'Null',
   'is_null_level': True},
  {'sql_condition': '"email_l" = "email_r"',
   'label_for_charts': 'Exact match email'},
  {'sql_condition': 'jaro_similarity("email_l", "email_r") >= 0.8',
   'label_for_charts': 'Jaro_similarity email >= 0.8'},
  {'sql_condition': 'jaro_similarity("email_l", "email_r") >= 0.8',
   'label_for_charts': 'Jaro_similarity email >= 0.8'},
  {'sql_condition': '\n        regexp_extract("email_l", \'@([^@]+)$\')\n     = \n        regexp_extract("email_r", \'@([^@]+)$\')\n    ',
   'label_for_charts': 'Exact match Email Domain'},
  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
 'comparison_description': 'Exact match vs. Domain-only match vs.anything else'}

which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.