Skip to content

Documentation for comparison_template_library

The comparison_template_library contains pre-made comparisons with pre-defined parameters available for use directly as described in this topic guide. However, not every comparison is available for every Splink-compatible SQL backend. More detail on creating comparisons for specific data types is also included in the topic guide.

The pre-made Splink comparison templates available for each SQL dialect are as given in this table:

duckdb spark athena sqlite
date_comparison
forename_surname_comparison
name_comparison
postcode_comparison

The detailed API for each of these are outlined below.

Library comparison APIs

splink.comparison_template_library.DateComparisonBase

Bases: Comparison

__init__(col_name, cast_strings_to_date=False, date_format=None, invalid_dates_as_null=False, include_exact_match_level=True, term_frequency_adjustments=False, separate_1st_january=False, levenshtein_thresholds=[], damerau_levenshtein_thresholds=[1], datediff_thresholds=[1, 1, 10], datediff_metrics=['month', 'year', 'year'], m_probability_exact_match=None, m_probability_1st_january=None, m_probability_or_probabilities_lev=None, m_probability_or_probabilities_dl=None, m_probability_or_probabilities_datediff=None, m_probability_else=None)

A wrapper to generate a comparison for a date column the data in col_name with preselected defaults.

The default arguments will give a comparison with comparison levels:

  • Exact match (1st of January only)

  • Exact match (all other dates)

  • Damerau-Levenshtein distance <= 1

  • Date difference <= 1 year

  • Date difference <= 10 years

  • Anything else

Parameters:

Name Type Description Default
col_name str

The name of the column to compare.

required
cast_strings_to_date bool

Set to True to enable date-casting when input dates are strings. Also adjust date_format if date-strings are not in (yyyy-mm-dd) format. Defaults to False.

False
date_format str

Format of input dates if date-strings are given. Must be consistent across record pairs. If None (the default), downstream functions for each backend assign date_format to ISO 8601 format (yyyy-mm-dd). Set to "yyyy-MM-dd" for Spark and "%Y-%m-%d" for DuckDB when invalid_dates_as_null=True

None
invalid_dates_as_null bool

assign any dates that do not adhere to date_format to the null level. Defaults to False.

False
include_exact_match_level bool

If True, include an exact match level. Defaults to True.

True
term_frequency_adjustments bool

If True, apply term frequency adjustments to the exact match level. Defaults to False.

False
separate_1st_january bool

If True, include a separate exact match comparison level when date is 1st January.

False
levenshtein_thresholds Union[int, list]

The thresholds to use for levenshtein similarity level(s). Defaults to []

[]
damerau_levenshtein_thresholds Union[int, list]

The thresholds to use for damerau-levenshtein similarity level(s). Defaults to [1]

[1]
datediff_thresholds Union[int, list]

The thresholds to use for datediff similarity level(s). Defaults to [1, 1].

[1, 1, 10]
datediff_metrics Union[str, list]

The metrics to apply thresholds to for datediff similarity level(s). Defaults to ["month", "year"].

['month', 'year', 'year']
cast_strings_to_date bool

Set to True to enable date-casting when input dates are strings. Also adjust date_format if date-strings are not in (yyyy-mm-dd) format. Defaults to False.

False
date_format str

Format of input dates if date-strings are given. Must be consistent across record pairs. If None (the default), downstream functions for each backend assign date_format to ISO 8601 format (yyyy-mm-dd).

None
m_probability_exact_match float

If provided, overrides the default m probability for the exact match level. Defaults to None.

None
m_probability_or_probabilities_lev Union[float, list]

If provided, overrides the default m probabilities for the levenshtein thresholds specified. Defaults to None.

None
m_probability_or_probabilities_dl Union[float, list]

description. If provided, overrides the default m probabilities for the damerau-levenshtein thresholds specified. Defaults to None.

None
m_probability_or_probabilities_datediff Union[float, list]

If provided, overrides the default m probabilities for the datediff thresholds specified. Defaults to None.

None
m_probability_else float

If provided, overrides the default m probability for the 'anything else' level. Defaults to None.

None

Examples:

Basic Date Comparison

import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.date_comparison("date_of_birth")
Bespoke Date Comparison
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.date_comparison("date_of_birth",
                    damerau_levenshtein_thresholds=[],
                    levenshtein_thresholds=[2],
                    datediff_thresholds=[1, 1],
                    datediff_metrics=["month", "year"])
Date Comparison casting columns date and assigning values that do not match the date_format to the null level
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.date_comparison("date_of_birth",
                    cast_strings_to_date=True,
                    date_format='%d/%m/%Y',
                    invalid_dates_as_null=True)

Basic Date Comparison

import splink.spark.spark_comparison_template_library as ctl
ctl.date_comparison("date_of_birth")
Bespoke Date Comparison
import splink.spark.spark_comparison_template_library as ctl
ctl.date_comparison("date_of_birth",
                    damerau_levenshtein_thresholds=[],
                    levenshtein_thresholds=[2],
                    datediff_thresholds=[1, 1],
                    datediff_metrics=["month", "year"])
Date Comparison casting columns date and assigning values that do not match the date_format to the null level
import splink.spark.spark_comparison_template_library as ctl
ctl.date_comparison("date_of_birth",
                    cast_strings_to_date=True,
                    date_format='dd/mm/yyyy',
                    invalid_dates_as_null=True)

Returns:

Name Type Description
Comparison Comparison

A comparison that can be inclued in the Splink settings dictionary.


splink.comparison_template_library.NameComparisonBase

Bases: Comparison

__init__(col_name, regex_extract=None, set_to_lowercase=False, include_exact_match_level=True, phonetic_col_name=None, term_frequency_adjustments=False, levenshtein_thresholds=[], damerau_levenshtein_thresholds=[1], jaro_thresholds=[], jaro_winkler_thresholds=[0.9, 0.8], jaccard_thresholds=[], m_probability_exact_match_name=None, m_probability_exact_match_phonetic_name=None, m_probability_or_probabilities_lev=None, m_probability_or_probabilities_dl=None, m_probability_or_probabilities_jar=None, m_probability_or_probabilities_jw=None, m_probability_or_probabilities_jac=None, m_probability_else=None)

A wrapper to generate a comparison for a name column the data in col_name with preselected defaults.

The default arguments will give a comparison with comparison levels:

  • Exact match

  • Damerau-Levenshtein Distance <= 1

  • Jaro Winkler similarity >= 0.9

  • Jaro Winkler similarity >= 0.8

  • Anything else

Parameters:

Name Type Description Default
col_name str

The name of the column to compare.

required
regex_extract str

Regular expression pattern to evaluate a match on.

None
set_to_lowercase bool

If True, all names are set to lowercase during the pairwise comparisons. Defaults to False

False
include_exact_match_level bool

If True, include an exact match level for col_name. Defaults to True.

True
phonetic_col_name str

The name of the column with phonetic reduction (such as dmetaphone) of col_name. Including parameter will create an exact match level for phonetic_col_name. The phonetic column must be present in the dataset to use this parameter. Defaults to None

None
term_frequency_adjustments bool

If True, apply term frequency adjustments to the exact match level for "col_name". Defaults to False.

False
term_frequency_adjustments_phonetic_name bool

If True, apply term frequency adjustments to the exact match level for "phonetic_col_name". Defaults to False.

required
levenshtein_thresholds Union[int, list]

The thresholds to use for levenshtein similarity level(s). Defaults to []

[]
damerau_levenshtein_thresholds Union[int, list]

The thresholds to use for damerau-levenshtein similarity level(s). Defaults to [1]

[1]
jaro_thresholds Union[int, list]

The thresholds to use for jaro similarity level(s). Defaults to []

[]
jaro_winkler_thresholds Union[int, list]

The thresholds to use for jaro_winkler similarity level(s). Defaults to [0.9, 0.8]

[0.9, 0.8]
jaccard_thresholds Union[int, list]

The thresholds to use for jaccard similarity level(s). Defaults to []

[]
m_probability_exact_match_name _type_

Starting m probability for exact match level. Defaults to None.

None
m_probability_exact_match_phonetic_name _type_

Starting m probability for exact match level for phonetic_col_name. Defaults to None.

None
m_probability_or_probabilities_lev Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_dl Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_datediff Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

required
m_probability_or_probabilities_jar Union[float, list]

Starting m probabilities for the jaro thresholds specified. Defaults to None.

None
m_probability_or_probabilities_jw Union[float, list]

Starting m probabilities for the jaro winkler thresholds specified. Defaults to None.

None
m_probability_or_probabilities_jac Union[float, list]

Starting m probabilities for the jaccard thresholds specified. Defaults to None.

None
m_probability_else _type_

Starting m probability for the 'everything else' level. Defaults to None.

None

Examples:

Basic Name Comparison

import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.name_comparison("name")
Bespoke Name Comparison
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.name_comparison("name",
                    phonetic_col_name = "name_dm",
                    term_frequency_adjustments = True,
                    levenshtein_thresholds=[2],
                    damerau_levenshtein_thresholds=[],
                    jaro_winkler_thresholds=[],
                    jaccard_thresholds=[1]
                    )

Basic Name Comparison

import splink.spark.spark_comparison_template_library as ctl
ctl.name_comparison("name")
Bespoke Name Comparison
import splink.spark.spark_comparison_template_library as ctl
ctl.name_comparison("name",
                    phonetic_col_name = "name_dm",
                    term_frequency_adjustments = True,
                    levenshtein_thresholds=[2],
                    damerau_levenshtein_thresholds=[],
                    jaro_winkler_thresholds=[],
                    jaccard_thresholds=[1]
                    )

Returns:

Name Type Description
Comparison Comparison

A comparison that can be included in the Splink settings dictionary.


splink.comparison_template_library.ForenameSurnameComparisonBase

Bases: Comparison

__init__(forename_col_name, surname_col_name, set_to_lowercase=False, include_exact_match_level=True, include_columns_reversed=True, term_frequency_adjustments=False, tf_adjustment_col_forename_and_surname=None, phonetic_forename_col_name=None, phonetic_surname_col_name=None, levenshtein_thresholds=[], damerau_levenshtein_thresholds=[], jaro_winkler_thresholds=[0.88], jaro_thresholds=[], jaccard_thresholds=[], m_probability_exact_match_forename_surname=None, m_probability_exact_match_phonetic_forename_surname=None, m_probability_columns_reversed_forename_surname=None, m_probability_exact_match_surname=None, m_probability_exact_match_forename=None, m_probability_exact_match_phonetic_surname=None, m_probability_exact_match_phonetic_forename=None, m_probability_or_probabilities_surname_lev=None, m_probability_or_probabilities_surname_dl=None, m_probability_or_probabilities_surname_jw=None, m_probability_or_probabilities_surname_jac=None, m_probability_or_probabilities_forename_lev=None, m_probability_or_probabilities_forename_dl=None, m_probability_or_probabilities_forename_jw=None, m_probability_or_probabilities_forename_jac=None, m_probability_else=None)

A wrapper to generate a comparison for a name column the data in col_name with preselected defaults.

The default arguments will give a comparison with comparison levels:

  • Exact match forename and surname

  • Macth of forename and surname reversed

  • Exact match surname

  • Exact match forename

  • Fuzzy match surname jaro-winkler >= 0.88

  • Fuzzy match forename jaro-winkler>= 0.88

  • Anything else

Parameters:

Name Type Description Default
forename_col_name str

The name of the forename column to compare

required
surname_col_name str

The name of the surname column to compare

required
set_to_lowercase bool

If True, all names are set to lowercase during the pairwise comparisons. Defaults to False

False
include_exact_match_level bool

If True, include an exact match level for col_name. Defaults to True.

True
include_columns_reversed bool

If True, include a comparison level for forename and surname being swapped. Defaults to True

True
term_frequency_adjustments bool

If True, apply term frequency adjustments to the exact match level for forename_col_name and surname_col_name. Applies term frequency adjustments to full name exact match level and columns reversed exact match level if tf_adjustment_col_forename_and_surname is provided. Applies term frequency adjustments to phonetic_forename_col_name and phonetic_surname_col_name exact match levels, if they are provided. Defaults to False.

False
tf_adjustment_col_forename_and_surname str

The name of a combined forename surname column. This column is used to provide term frequency adjustments for forename surname exact match and columns reversed levels. Defaults to None

None
set_to_lowercase bool

If True, all postcodes are set to lowercase during the pairwise comparisons. Defaults to True

False
phonetic_forename_col_name str

The name of the column with phonetic reduction (such as dmetaphone) of forename_col_name. Including parameter will create an exact match level for phonetic_forename_col_name. The phonetic column must be present in the dataset to use this parameter. Defaults to None

None
phonetic_surname_col_name str

The name of the column with phonetic reduction (such as dmetaphone) of surname_col_name. Including this parameter will create an exact match level for phonetic_surname_col_name. The phonetic column must be present in the dataset to use this parameter. Defaults to None

None
levenshtein_thresholds Union[int, list]

The thresholds to use for levenshtein similarity level(s) for surname_col_name and forename_col_name. Defaults to []

[]
damerau_levenshtein_thresholds Union[int, list]

The thresholds to use for damerau-levenshtein similarity level(s). Defaults to []

[]
jaro_winkler_thresholds Union[int, list]

The thresholds to use for jaro_winkler similarity level(s) for surname_col_name and forename_col_name. Defaults to [0.88]

[0.88]
jaro_thresholds Union[int, list]

The thresholds to use for jaro similarity level(s) for surname_col_name and forename_col_name. Defaults to []

[]
jaccard_thresholds Union[int, list]

The thresholds to use for jaccard similarity level(s) for surname_col_name and forename_col_name. Defaults to []

[]
m_probability_exact_match_forename_surname _type_

If provided, overrides the default m probability for the exact match level for forename and surname. Defaults to None.

None
m_probability_exact_match_phonetic_forename_surname _type_

If provided, overrides the default m probability for the phonetic match level for forename and surname. Defaults to None.

None
m_probability_columns_reversed_forename_surname _type_

If provided, overrides the default m probability for the columns reversed level for forename and surname. Defaults to None.

None
m_probability_columns_reversed_forename_surname _type_

If provided, overrides the default m probability for the columns reversed level for forename and surname. Defaults to None.

None
m_probability_exact_match_surname _type_

If provided, overrides the default m probability for the surname exact match level for forename and surname. Defaults to None.

None
m_probability_exact_match_forename _type_

If provided, overrides the default m probability for the forename exact match level for forename and forename. Defaults to None.

None
m_probability_phonetic_match_surname _type_

If provided, overrides the default m probability for the surname phonetic match level for forename and surname. Defaults to None.

required
m_probability_phonetic_match_forename _type_

If provided, overrides the default m probability for the forename phonetic match level for forename and forename. Defaults to None.

required
m_probability_or_probabilities_surname_lev Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_surname_dl Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_surname_jw Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_surname_jac Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_forename_lev Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_forename_dl Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_forename_jw Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_forename_jac Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_else _type_

If provided, overrides the default m probability for the 'anything else' level. Defaults to None.

None

Examples:

Basic Forename Surname Comparison

import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.forename_surname_comparison("first_name", "surname)

Bespoke Forename Surname Comparison

import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.forename_surname_comparison(
        "forename",
        "surname",
        term_frequency_adjustments=True,
        tf_adjustment_col_forename_and_surname="full_name",
        phonetic_forename_col_name="forename_dm",
        phonetic_surname_col_name="surname_dm",
        levenshtein_thresholds=[2],
        jaro_winkler_thresholds=[],
        jaccard_thresholds=[1],
    )

Basic Forename Surname Comparison

import splink.spark.spark_comparison_template_library as ctl
ctl.forename_surname_comparison("first_name", "surname)

Bespoke Forename Surname Comparison

import splink.spark.spark_comparison_template_library as ctl
ctl.forename_surname_comparison(
        "forename",
        "surname",
        term_frequency_adjustments=True,
        tf_adjustment_col_forename_and_surname="full_name",
        phonetic_forename_col_name="forename_dm",
        phonetic_surname_col_name="surname_dm",
        levenshtein_thresholds=[2],
        jaro_winkler_thresholds=[],
        jaccard_thresholds=[1],
    )

Returns:

Name Type Description
Comparison Comparison

A comparison that can be included in the Splink settings dictionary.


splink.comparison_template_library.PostcodeComparisonBase

Bases: Comparison

__init__(col_name, invalid_postcodes_as_null=False, set_to_lowercase=True, valid_postcode_regex='^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$', term_frequency_adjustments_full=False, include_full_match_level=True, include_sector_match_level=True, include_district_match_level=True, include_area_match_level=True, lat_col=None, long_col=None, km_thresholds=[], m_probability_full_match=None, m_probability_sector_match=None, m_probability_district_match=None, m_probability_area_match=None, m_probability_or_probabilities_km_distance=None, m_probability_else=None)

A wrapper to generate a comparison for a poscode column 'col_name' with preselected defaults.

The default arguments will give a comparison with levels:

  • Exact match on full postcode

  • Exact match on sector

  • Exact match on district

  • Exact match on area

  • All other comparisons

Parameters:

Name Type Description Default
col_name str

The name of the column to compare.

required
invalid_postcodes_as_null bool

If True, postcodes that do not adhere to valid_postcode_regex will be included in the null level. Defaults to False

False
set_to_lowercase bool

If True, all postcodes are set to lowercase during the pairwise comparisons. Defaults to True

True
valid_postcode_regex str

regular expression pattern that is used to validate postcodes. If invalid_postcodes_as_null is True, postcodes that do not adhere to valid_postcode_regex will be included in the null level. Defaults to "^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? 0-9$"

'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$'
term_frequency_adjustments_full bool

If True, apply term frequency adjustments to the full postcode exact match level. Defaults to False.

False
include_full_match_level bool

If True, include an exact match on full postcode level. Defaults to True.

True
include_sector_match_level bool

If True, include an exact match on sector level. Defaults to True.

True
include_district_match_level bool

If True, include an exact match on district level. Defaults to True.

True
include_area_match_level bool

If True, include an exact match on area level. Defaults to True.

True
include_distance_in_km_level bool

If True, include a comparison of distance between postcodes as measured in kilometers. Defaults to False.

required
lat_col str

The name of a latitude column or the respective array or struct column column containing the information, plus an index. For example: lat, long_lat['lat'] or long_lat[0].

None
long_col str

The name of a longitudinal column or the respective array or struct column column containing the information, plus an index. For example: long, long_lat['long'] or long_lat[1].

None
km_thresholds int, float, list

The total distance in kilometers to evaluate the distance_in_km_level comparison against.

[]
m_probability_full_match float

Starting m probability for full match level. Defaults to None.

None
m_probability_sector_match float

Starting m probability for sector match level. Defaults to None.

None
m_probability_district_match float

Starting m probability for district match level. Defaults to None.

None
m_probability_area_match float

Starting m probability for area match level. Defaults to None.

None
m_probability_or_probabilities_km_distance float

Starting m probability for 'distance in km' level. Defaults to None.

None
m_probability_else float

Starting m probability for the 'everything else' level. Defaults to None.

None

Examples:

Basic Postcode Comparison

import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.postcode_comparison("postcode")
Bespoke Postcode Comparison
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.postcode_comparison("postcode",
                    invalid_postcodes_as_null=True,
                    include_distance_in_km_level=True,
                    lat_col="lat",
                    long_col="long",
                    km_thresholds=[10, 100]
                    )

Basic Postcode Comparison

import splink.spark.spark_comparison_template_library as ctl
ctl.postcode_comparison("postcode")
Bespoke Postcode Comparison
import splink.spark.spark_comparison_template_library as ctl
ctl.postcode_comparison("postcode",
                    invalid_postcodes_as_null=True,
                    include_distance_in_km_level=True,
                    lat_col="lat",
                    long_col="long",
                    km_thresholds=[10, 100]
                    )

Basic Postcode Comparison

import splink.athean.athena_comparison_template_library as ctl
ctl.postcode_comparison("postcode")
Bespoke Postcode Comparison
import splink.athena.athena_comparison_template_library as ctl
ctl.postcode_comparison("postcode",
                    invalid_postcodes_as_null=True,
                    include_distance_in_km_level=True,
                    lat_col="lat",
                    long_col="long",
                    km_thresholds=[10, 100]
                    )

Returns:

Name Type Description
Comparison Comparison

A comparison that can be inclued in the Splink settings dictionary.