Out-of-the-box `Comparisons` for specific data types¶

Splink has pre-defined Comparisons available for variety of data types.

DateOfBirthComparison¶

You can find full API docs for DateOfBirthComparison here

import splink.comparison_library as cl

date_of_birth_comparison = cl.DateOfBirthComparison(
    "date_of_birth",
    input_is_string=True,
)

You can view the structure of the comparison as follows:

print(date_of_birth_comparison.get_comparison("duckdb").human_readable_description)

Comparison 'DateOfBirthComparison' of "date_of_birth".
Similarity is assessed using the following ComparisonLevels:
    - 'transformed date_of_birth is NULL' with SQL rule: try_strptime("date_of_birth_l", '%Y-%m-%d') IS NULL OR try_strptime("date_of_birth_r", '%Y-%m-%d') IS NULL
    - 'Exact match on date of birth' with SQL rule: "date_of_birth_l" = "date_of_birth_r"
    - 'DamerauLevenshtein distance <= 1' with SQL rule: damerau_levenshtein("date_of_birth_l", "date_of_birth_r") <= 1
    - 'Abs date difference <= 1 month' with SQL rule: ABS(EPOCH(try_strptime("date_of_birth_l", '%Y-%m-%d')) - EPOCH(try_strptime("date_of_birth_r", '%Y-%m-%d'))) <= 2629800.0
    - 'Abs date difference <= 1 year' with SQL rule: ABS(EPOCH(try_strptime("date_of_birth_l", '%Y-%m-%d')) - EPOCH(try_strptime("date_of_birth_r", '%Y-%m-%d'))) <= 31557600.0
    - 'Abs date difference <= 10 year' with SQL rule: ABS(EPOCH(try_strptime("date_of_birth_l", '%Y-%m-%d')) - EPOCH(try_strptime("date_of_birth_r", '%Y-%m-%d'))) <= 315576000.0
    - 'All other comparisons' with SQL rule: ELSE

To see this as a specifications dictionary you can use:

date_of_birth_comparison.get_comparison("duckdb").as_dict()

which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.

Name Comparison¶

A Name comparison is intended for use on an individual name column (e.g. forename, surname)

You can find full API docs for NameComparison here

import splink.comparison_library as cl

first_name_comparison = cl.NameComparison("first_name")

print(first_name_comparison.get_comparison("duckdb").human_readable_description)

Comparison 'NameComparison' of "first_name".
Similarity is assessed using the following ComparisonLevels:
    - 'first_name is NULL' with SQL rule: "first_name_l" IS NULL OR "first_name_r" IS NULL
    - 'Exact match on first_name' with SQL rule: "first_name_l" = "first_name_r"
    - 'Jaro-Winkler distance of first_name >= 0.92' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.92
    - 'Jaro-Winkler distance of first_name >= 0.88' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.88
    - 'Jaro-Winkler distance of first_name >= 0.7' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7
    - 'All other comparisons' with SQL rule: ELSE

The NameComparison also allows flexibility to change the parameters and/or fuzzy matching comparison levels.

For example:

surname_comparison = cl.NameComparison(
    "surname",
    jaro_winkler_thresholds=[0.95, 0.9],
    dmeta_col_name="surname_dmeta",
)
print(surname_comparison.get_comparison("duckdb").human_readable_description)

Comparison 'NameComparison' of "surname" and "surname_dmeta".
Similarity is assessed using the following ComparisonLevels:
    - 'surname is NULL' with SQL rule: "surname_l" IS NULL OR "surname_r" IS NULL
    - 'Exact match on surname' with SQL rule: "surname_l" = "surname_r"
    - 'Jaro-Winkler distance of surname >= 0.95' with SQL rule: jaro_winkler_similarity("surname_l", "surname_r") >= 0.95
    - 'Jaro-Winkler distance of surname >= 0.9' with SQL rule: jaro_winkler_similarity("surname_l", "surname_r") >= 0.9
    - 'Array intersection size >= 1' with SQL rule: array_length(list_intersect("surname_dmeta_l", "surname_dmeta_r")) >= 1
    - 'All other comparisons' with SQL rule: ELSE

Where surname_dm refers to a column which has used the DoubleMetaphone algorithm on surname to give a phonetic spelling. This helps to catch names which sounds the same but have different spellings (e.g. Stephens vs Stevens). For more on Phonetic Transformations, see the topic guide.

To see this as a specifications dictionary you can call

surname_comparison.get_comparison("duckdb").as_dict()

which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.

Forename and Surname Comparison¶

It can be helpful to construct a single comparison for for comparing the forename and surname because:

The Fellegi-Sunter model assumes that columns are independent. We know that forename and surname are usually correlated given the regional variation of names etc, so considering then in a single comparison can help to create better models.

As a result term-frequencies of individual forename and surname individually does not necessarily reflect how common the combination of forename and surname are. For more information on term-frequencies, see the dedicated topic guide. Combining forename and surname in a single comparison can allows the model to consider the joint term-frequency as well as individual.
It is common for some records to have swapped forename and surname by mistake. Addressing forename and surname in a single comparison can allows the model to consider these name inversions.

The ForenameSurnameComparison has been designed to accomodate this.

You can find full API docs for ForenameSurnameComparison here

import splink.comparison_library as cl

full_name_comparison = cl.ForenameSurnameComparison("forename", "surname")

print(full_name_comparison.get_comparison("duckdb").human_readable_description)

Comparison 'ForenameSurnameComparison' of "forename" and "surname".
Similarity is assessed using the following ComparisonLevels:
    - '(forename is NULL) AND (surname is NULL)' with SQL rule: ("forename_l" IS NULL OR "forename_r" IS NULL) AND ("surname_l" IS NULL OR "surname_r" IS NULL)
    - '(Exact match on forename) AND (Exact match on surname)' with SQL rule: ("forename_l" = "forename_r") AND ("surname_l" = "surname_r")
    - 'Match on reversed cols: forename and surname' with SQL rule: "forename_l" = "surname_r" AND "forename_r" = "surname_l"
    - '(Jaro-Winkler distance of forename >= 0.92) AND (Jaro-Winkler distance of surname >= 0.92)' with SQL rule: (jaro_winkler_similarity("forename_l", "forename_r") >= 0.92) AND (jaro_winkler_similarity("surname_l", "surname_r") >= 0.92)
    - '(Jaro-Winkler distance of forename >= 0.88) AND (Jaro-Winkler distance of surname >= 0.88)' with SQL rule: (jaro_winkler_similarity("forename_l", "forename_r") >= 0.88) AND (jaro_winkler_similarity("surname_l", "surname_r") >= 0.88)
    - 'Exact match on surname' with SQL rule: "surname_l" = "surname_r"
    - 'Exact match on forename' with SQL rule: "forename_l" = "forename_r"
    - 'All other comparisons' with SQL rule: ELSE

As noted in the feature engineering guide, to take advantage of term frequency adjustments on full name, you need to derive a full name column prior to importing data into Splink. You then provide the column name using the forename_surname_concat_col_name argument:

full_name_comparison = cl.ForenameSurnameComparison("forename", "surname", forename_surname_concat_col_name="first_and_last_name")

To see this as a specifications dictionary you can call

full_name_comparison.get_comparison("duckdb").as_dict()

Which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.

Postcode Comparisons¶

See Feature Engineering for more details.

import splink.comparison_library as cl

pc_comparison = cl.PostcodeComparison("postcode")

print(pc_comparison.get_comparison("duckdb").human_readable_description)

Comparison 'PostcodeComparison' of "postcode".
Similarity is assessed using the following ComparisonLevels:
    - 'postcode is NULL' with SQL rule: "postcode_l" IS NULL OR "postcode_r" IS NULL
    - 'Exact match on full postcode' with SQL rule: "postcode_l" = "postcode_r"
    - 'Exact match on sector' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '')
    - 'Exact match on district' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '')
    - 'Exact match on area' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}', 0), '')
    - 'All other comparisons' with SQL rule: ELSE

If you have derived lat long columns, you can model geographical distances.

pc_comparison = cl.PostcodeComparison("postcode", lat_col="lat", long_col="long", km_thresholds=[1,10,50])
print(pc_comparison.get_comparison("duckdb").human_readable_description)

Comparison 'PostcodeComparison' of "postcode", "long" and "lat".
Similarity is assessed using the following ComparisonLevels:
    - 'postcode is NULL' with SQL rule: "postcode_l" IS NULL OR "postcode_r" IS NULL
    - 'Exact match on postcode' with SQL rule: "postcode_l" = "postcode_r"
    - 'Exact match on transformed postcode' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '')
    - 'Distance in km <= 1' with SQL rule:  cast( acos( case when ( sin( radians("lat_l") ) * sin( radians("lat_r") ) + cos( radians("lat_l") ) * cos( radians("lat_r") ) * cos( radians("long_r" - "long_l") ) ) > 1 then 1 when ( sin( radians("lat_l") ) * sin( radians("lat_r") ) + cos( radians("lat_l") ) * cos( radians("lat_r") ) * cos( radians("long_r" - "long_l") ) ) < -1 then -1 else ( sin( radians("lat_l") ) * sin( radians("lat_r") ) + cos( radians("lat_l") ) * cos( radians("lat_r") ) * cos( radians("long_r" - "long_l") ) ) end ) * 6371 as float ) <= 1
    - 'Distance in km <= 10' with SQL rule:  cast( acos( case when ( sin( radians("lat_l") ) * sin( radians("lat_r") ) + cos( radians("lat_l") ) * cos( radians("lat_r") ) * cos( radians("long_r" - "long_l") ) ) > 1 then 1 when ( sin( radians("lat_l") ) * sin( radians("lat_r") ) + cos( radians("lat_l") ) * cos( radians("lat_r") ) * cos( radians("long_r" - "long_l") ) ) < -1 then -1 else ( sin( radians("lat_l") ) * sin( radians("lat_r") ) + cos( radians("lat_l") ) * cos( radians("lat_r") ) * cos( radians("long_r" - "long_l") ) ) end ) * 6371 as float ) <= 10
    - 'Distance in km <= 50' with SQL rule:  cast( acos( case when ( sin( radians("lat_l") ) * sin( radians("lat_r") ) + cos( radians("lat_l") ) * cos( radians("lat_r") ) * cos( radians("long_r" - "long_l") ) ) > 1 then 1 when ( sin( radians("lat_l") ) * sin( radians("lat_r") ) + cos( radians("lat_l") ) * cos( radians("lat_r") ) * cos( radians("long_r" - "long_l") ) ) < -1 then -1 else ( sin( radians("lat_l") ) * sin( radians("lat_r") ) + cos( radians("lat_l") ) * cos( radians("lat_r") ) * cos( radians("long_r" - "long_l") ) ) end ) * 6371 as float ) <= 50
    - 'All other comparisons' with SQL rule: ELSE

To see this as a specifications dictionary you can call

pc_comparison.get_comparison("duckdb").as_dict()

which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.

Email Comparison¶

You can find full API docs for EmailComparison here

import splink.comparison_library as cl

email_comparison = cl.EmailComparison("email")

print(email_comparison.get_comparison("duckdb").human_readable_description)

Comparison 'EmailComparison' of "email".
Similarity is assessed using the following ComparisonLevels:
    - 'email is NULL' with SQL rule: "email_l" IS NULL OR "email_r" IS NULL
    - 'Exact match on email' with SQL rule: "email_l" = "email_r"
    - 'Exact match on username' with SQL rule: NULLIF(regexp_extract("email_l", '^[^@]+', 0), '') = NULLIF(regexp_extract("email_r", '^[^@]+', 0), '')
    - 'Jaro-Winkler distance of email >= 0.88' with SQL rule: jaro_winkler_similarity("email_l", "email_r") >= 0.88
    - 'Jaro-Winkler >0.88 on username' with SQL rule: jaro_winkler_similarity(NULLIF(regexp_extract("email_l", '^[^@]+', 0), ''), NULLIF(regexp_extract("email_r", '^[^@]+', 0), '')) >= 0.88
    - 'All other comparisons' with SQL rule: ELSE

To see this as a specifications dictionary you can call

email_comparison.as_dict()

which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.

Out-of-the-box Comparisons for specific data types¶

DateOfBirthComparison¶

Name Comparison¶

Forename and Surname Comparison¶

Postcode Comparisons¶

Email Comparison¶

Out-of-the-box `Comparisons` for specific data types¶