Documentation for comparison_template_library
¶
The comparison_template_library
contains premade comparisons with predefined parameters available for use directly as described in this topic guide.
However, not every comparison is available for every Splinkcompatible SQL backend. More detail on creating comparisons for specific data types is also included in the topic guide.
The premade Splink comparison templates available for each SQL dialect are as given in this table:
duckdb  spark  athena  sqlite  

date_comparison 
✓  ✓  
forename_surname_comparison 
✓  ✓  
name_comparison 
✓  ✓  
postcode_comparison 
✓  ✓  ✓ 
The detailed API for each of these are outlined below.
Library comparison APIs¶
splink.comparison_template_library.DateComparisonBase
¶
Bases: Comparison
__init__(col_name, cast_strings_to_date=False, date_format=None, invalid_dates_as_null=False, include_exact_match_level=True, term_frequency_adjustments=False, separate_1st_january=False, levenshtein_thresholds=[], damerau_levenshtein_thresholds=[1], datediff_thresholds=[1, 1, 10], datediff_metrics=['month', 'year', 'year'], m_probability_exact_match=None, m_probability_1st_january=None, m_probability_or_probabilities_lev=None, m_probability_or_probabilities_dl=None, m_probability_or_probabilities_datediff=None, m_probability_else=None)
¶
A wrapper to generate a comparison for a date column the data in
col_name
with preselected defaults.
The default arguments will give a comparison with comparison levels:

Exact match (1st of January only)

Exact match (all other dates)

DamerauLevenshtein distance <= 1

Date difference <= 1 year

Date difference <= 10 years

Anything else
Parameters:
Name  Type  Description  Default 

col_name 
str

The name of the column to compare. 
required 
cast_strings_to_date 
bool

Set to True to enable datecasting when input dates are strings. Also adjust date_format if datestrings are not in (yyyymmdd) format. Defaults to False. 
False

date_format 
str

Format of input dates if datestrings are given. Must be consistent across record pairs. If None (the default), downstream functions for each backend assign date_format to ISO 8601 format (yyyymmdd). Set to "yyyyMMdd" for Spark and "%Y%m%d" for DuckDB when invalid_dates_as_null=True 
None

invalid_dates_as_null 
bool

assign any dates that do not adhere to date_format to the null level. Defaults to False. 
False

include_exact_match_level 
bool

If True, include an exact match level. Defaults to True. 
True

term_frequency_adjustments 
bool

If True, apply term frequency adjustments to the exact match level. Defaults to False. 
False

separate_1st_january 
bool

If True, include a separate exact match comparison level when date is 1st January. 
False

levenshtein_thresholds 
Union[int, list]

The thresholds to use for levenshtein similarity level(s). Defaults to [] 
[]

damerau_levenshtein_thresholds 
Union[int, list]

The thresholds to use for dameraulevenshtein similarity level(s). Defaults to [1] 
[1]

datediff_thresholds 
Union[int, list]

The thresholds to use for datediff similarity level(s). Defaults to [1, 1]. 
[1, 1, 10]

datediff_metrics 
Union[str, list]

The metrics to apply thresholds to for datediff similarity level(s). Defaults to ["month", "year"]. 
['month', 'year', 'year']

cast_strings_to_date 
bool

Set to True to enable datecasting when input dates are strings. Also adjust date_format if datestrings are not in (yyyymmdd) format. Defaults to False. 
False

date_format 
str

Format of input dates if datestrings are given. Must be consistent across record pairs. If None (the default), downstream functions for each backend assign date_format to ISO 8601 format (yyyymmdd). 
None

m_probability_exact_match 
float

If provided, overrides the default m probability for the exact match level. Defaults to None. 
None

m_probability_or_probabilities_lev 
Union[float, list]

If provided, overrides the default m probabilities for the levenshtein thresholds specified. Defaults to None. 
None

m_probability_or_probabilities_dl 
Union[float, list]

description. If provided, overrides the default m probabilities for the dameraulevenshtein thresholds specified. Defaults to None. 
None

m_probability_or_probabilities_datediff 
Union[float, list]

If provided, overrides the default m probabilities for the datediff thresholds specified. Defaults to None. 
None

m_probability_else 
float

If provided, overrides the default m probability for the 'anything else' level. Defaults to None. 
None

Examples:
Basic Date Comparison
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.date_comparison("date_of_birth")
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.date_comparison("date_of_birth",
damerau_levenshtein_thresholds=[],
levenshtein_thresholds=[2],
datediff_thresholds=[1, 1],
datediff_metrics=["month", "year"])
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.date_comparison("date_of_birth",
cast_strings_to_date=True,
date_format='%d/%m/%Y',
invalid_dates_as_null=True)
Basic Date Comparison
import splink.spark.spark_comparison_template_library as ctl
ctl.date_comparison("date_of_birth")
import splink.spark.spark_comparison_template_library as ctl
ctl.date_comparison("date_of_birth",
damerau_levenshtein_thresholds=[],
levenshtein_thresholds=[2],
datediff_thresholds=[1, 1],
datediff_metrics=["month", "year"])
import splink.spark.spark_comparison_template_library as ctl
ctl.date_comparison("date_of_birth",
cast_strings_to_date=True,
date_format='dd/mm/yyyy',
invalid_dates_as_null=True)
Returns:
Name  Type  Description 

Comparison 
Comparison

A comparison that can be inclued in the Splink settings dictionary. 
splink.comparison_template_library.NameComparisonBase
¶
Bases: Comparison
__init__(col_name, regex_extract=None, set_to_lowercase=False, include_exact_match_level=True, phonetic_col_name=None, term_frequency_adjustments=False, levenshtein_thresholds=[], damerau_levenshtein_thresholds=[1], jaro_thresholds=[], jaro_winkler_thresholds=[0.9, 0.8], jaccard_thresholds=[], m_probability_exact_match_name=None, m_probability_exact_match_phonetic_name=None, m_probability_or_probabilities_lev=None, m_probability_or_probabilities_dl=None, m_probability_or_probabilities_jar=None, m_probability_or_probabilities_jw=None, m_probability_or_probabilities_jac=None, m_probability_else=None)
¶
A wrapper to generate a comparison for a name column the data in
col_name
with preselected defaults.
The default arguments will give a comparison with comparison levels:

Exact match

DamerauLevenshtein Distance <= 1

Jaro Winkler similarity >= 0.9

Jaro Winkler similarity >= 0.8

Anything else
Parameters:
Name  Type  Description  Default 

col_name 
str

The name of the column to compare. 
required 
regex_extract 
str

Regular expression pattern to evaluate a match on. 
None

set_to_lowercase 
bool

If True, all names are set to lowercase during the pairwise comparisons. Defaults to False 
False

include_exact_match_level 
bool

If True, include an exact match level for col_name. Defaults to True. 
True

phonetic_col_name 
str

The name of the column with phonetic reduction (such as dmetaphone) of col_name. Including parameter will create an exact match level for phonetic_col_name. The phonetic column must be present in the dataset to use this parameter. Defaults to None 
None

term_frequency_adjustments 
bool

If True, apply term frequency adjustments to the exact match level for "col_name". Defaults to False. 
False

term_frequency_adjustments_phonetic_name 
bool

If True, apply term frequency adjustments to the exact match level for "phonetic_col_name". Defaults to False. 
required 
levenshtein_thresholds 
Union[int, list]

The thresholds to use for levenshtein similarity level(s). Defaults to [] 
[]

damerau_levenshtein_thresholds 
Union[int, list]

The thresholds to use for dameraulevenshtein similarity level(s). Defaults to [1] 
[1]

jaro_thresholds 
Union[int, list]

The thresholds to use for jaro similarity level(s). Defaults to [] 
[]

jaro_winkler_thresholds 
Union[int, list]

The thresholds to use for jaro_winkler similarity level(s). Defaults to [0.9, 0.8] 
[0.9, 0.8]

jaccard_thresholds 
Union[int, list]

The thresholds to use for jaccard similarity level(s). Defaults to [] 
[]

m_probability_exact_match_name 
_type_

Starting m probability for exact match level. Defaults to None. 
None

m_probability_exact_match_phonetic_name 
_type_

Starting m probability for exact match level for phonetic_col_name. Defaults to None. 
None

m_probability_or_probabilities_lev 
Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. 
None

m_probability_or_probabilities_dl 
Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. 
None

m_probability_or_probabilities_datediff 
Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. 
required 
m_probability_or_probabilities_jar 
Union[float, list]

Starting m probabilities for the jaro thresholds specified. Defaults to None. 
None

m_probability_or_probabilities_jw 
Union[float, list]

Starting m probabilities for the jaro winkler thresholds specified. Defaults to None. 
None

m_probability_or_probabilities_jac 
Union[float, list]

Starting m probabilities for the jaccard thresholds specified. Defaults to None. 
None

m_probability_else 
_type_

Starting m probability for the 'everything else' level. Defaults to None. 
None

Examples:
Basic Name Comparison
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.name_comparison("name")
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.name_comparison("name",
phonetic_col_name = "name_dm",
term_frequency_adjustments = True,
levenshtein_thresholds=[2],
damerau_levenshtein_thresholds=[],
jaro_winkler_thresholds=[],
jaccard_thresholds=[1]
)
Basic Name Comparison
import splink.spark.spark_comparison_template_library as ctl
ctl.name_comparison("name")
import splink.spark.spark_comparison_template_library as ctl
ctl.name_comparison("name",
phonetic_col_name = "name_dm",
term_frequency_adjustments = True,
levenshtein_thresholds=[2],
damerau_levenshtein_thresholds=[],
jaro_winkler_thresholds=[],
jaccard_thresholds=[1]
)
Returns:
Name  Type  Description 

Comparison 
Comparison

A comparison that can be included in the Splink settings dictionary. 
splink.comparison_template_library.ForenameSurnameComparisonBase
¶
Bases: Comparison
__init__(forename_col_name, surname_col_name, set_to_lowercase=False, include_exact_match_level=True, include_columns_reversed=True, term_frequency_adjustments=False, tf_adjustment_col_forename_and_surname=None, phonetic_forename_col_name=None, phonetic_surname_col_name=None, levenshtein_thresholds=[], damerau_levenshtein_thresholds=[], jaro_winkler_thresholds=[0.88], jaro_thresholds=[], jaccard_thresholds=[], m_probability_exact_match_forename_surname=None, m_probability_exact_match_phonetic_forename_surname=None, m_probability_columns_reversed_forename_surname=None, m_probability_exact_match_surname=None, m_probability_exact_match_forename=None, m_probability_exact_match_phonetic_surname=None, m_probability_exact_match_phonetic_forename=None, m_probability_or_probabilities_surname_lev=None, m_probability_or_probabilities_surname_dl=None, m_probability_or_probabilities_surname_jw=None, m_probability_or_probabilities_surname_jac=None, m_probability_or_probabilities_forename_lev=None, m_probability_or_probabilities_forename_dl=None, m_probability_or_probabilities_forename_jw=None, m_probability_or_probabilities_forename_jac=None, m_probability_else=None)
¶
A wrapper to generate a comparison for a name column the data in
col_name
with preselected defaults.
The default arguments will give a comparison with comparison levels:

Exact match forename and surname

Macth of forename and surname reversed

Exact match surname

Exact match forename

Fuzzy match surname jarowinkler >= 0.88

Fuzzy match forename jarowinkler>= 0.88

Anything else
Parameters:
Name  Type  Description  Default 

forename_col_name 
str

The name of the forename column to compare 
required 
surname_col_name 
str

The name of the surname column to compare 
required 
set_to_lowercase 
bool

If True, all names are set to lowercase during the pairwise comparisons. Defaults to False 
False

include_exact_match_level 
bool

If True, include an exact match level for col_name. Defaults to True. 
True

include_columns_reversed 
bool

If True, include a comparison level for forename and surname being swapped. Defaults to True 
True

term_frequency_adjustments 
bool

If True, apply term frequency adjustments to the exact match level for forename_col_name and surname_col_name. Applies term frequency adjustments to full name exact match level and columns reversed exact match level if tf_adjustment_col_forename_and_surname is provided. Applies term frequency adjustments to phonetic_forename_col_name and phonetic_surname_col_name exact match levels, if they are provided. Defaults to False. 
False

tf_adjustment_col_forename_and_surname 
str

The name of a combined forename surname column. This column is used to provide term frequency adjustments for forename surname exact match and columns reversed levels. Defaults to None 
None

set_to_lowercase 
bool

If True, all postcodes are set to lowercase during the pairwise comparisons. Defaults to True 
False

phonetic_forename_col_name 
str

The name of the column with phonetic reduction (such as dmetaphone) of forename_col_name. Including parameter will create an exact match level for phonetic_forename_col_name. The phonetic column must be present in the dataset to use this parameter. Defaults to None 
None

phonetic_surname_col_name 
str

The name of the column with phonetic reduction (such as dmetaphone) of surname_col_name. Including this parameter will create an exact match level for phonetic_surname_col_name. The phonetic column must be present in the dataset to use this parameter. Defaults to None 
None

levenshtein_thresholds 
Union[int, list]

The thresholds to use for levenshtein similarity level(s) for surname_col_name and forename_col_name. Defaults to [] 
[]

damerau_levenshtein_thresholds 
Union[int, list]

The thresholds to use for dameraulevenshtein similarity level(s). Defaults to [] 
[]

jaro_winkler_thresholds 
Union[int, list]

The thresholds to use for jaro_winkler similarity level(s) for surname_col_name and forename_col_name. Defaults to [0.88] 
[0.88]

jaro_thresholds 
Union[int, list]

The thresholds to use for jaro similarity level(s) for surname_col_name and forename_col_name. Defaults to [] 
[]

jaccard_thresholds 
Union[int, list]

The thresholds to use for jaccard similarity level(s) for surname_col_name and forename_col_name. Defaults to [] 
[]

m_probability_exact_match_forename_surname 
_type_

If provided, overrides the default m probability for the exact match level for forename and surname. Defaults to None. 
None

m_probability_exact_match_phonetic_forename_surname 
_type_

If provided, overrides the default m probability for the phonetic match level for forename and surname. Defaults to None. 
None

m_probability_columns_reversed_forename_surname 
_type_

If provided, overrides the default m probability for the columns reversed level for forename and surname. Defaults to None. 
None

m_probability_columns_reversed_forename_surname 
_type_

If provided, overrides the default m probability for the columns reversed level for forename and surname. Defaults to None. 
None

m_probability_exact_match_surname 
_type_

If provided, overrides the default m probability for the surname exact match level for forename and surname. Defaults to None. 
None

m_probability_exact_match_forename 
_type_

If provided, overrides the default m probability for the forename exact match level for forename and forename. Defaults to None. 
None

m_probability_phonetic_match_surname 
_type_

If provided, overrides the default m probability for the surname phonetic match level for forename and surname. Defaults to None. 
required 
m_probability_phonetic_match_forename 
_type_

If provided, overrides the default m probability for the forename phonetic match level for forename and forename. Defaults to None. 
required 
m_probability_or_probabilities_surname_lev 
Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. 
None

m_probability_or_probabilities_surname_dl 
Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. 
None

m_probability_or_probabilities_surname_jw 
Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. 
None

m_probability_or_probabilities_surname_jac 
Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. 
None

m_probability_or_probabilities_forename_lev 
Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. 
None

m_probability_or_probabilities_forename_dl 
Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. 
None

m_probability_or_probabilities_forename_jw 
Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. 
None

m_probability_or_probabilities_forename_jac 
Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. 
None

m_probability_else 
_type_

If provided, overrides the default m probability for the 'anything else' level. Defaults to None. 
None

Examples:
Basic Forename Surname Comparison
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.forename_surname_comparison("first_name", "surname)
Bespoke Forename Surname Comparison
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.forename_surname_comparison(
"forename",
"surname",
term_frequency_adjustments=True,
tf_adjustment_col_forename_and_surname="full_name",
phonetic_forename_col_name="forename_dm",
phonetic_surname_col_name="surname_dm",
levenshtein_thresholds=[2],
jaro_winkler_thresholds=[],
jaccard_thresholds=[1],
)
Basic Forename Surname Comparison
import splink.spark.spark_comparison_template_library as ctl
ctl.forename_surname_comparison("first_name", "surname)
Bespoke Forename Surname Comparison
import splink.spark.spark_comparison_template_library as ctl
ctl.forename_surname_comparison(
"forename",
"surname",
term_frequency_adjustments=True,
tf_adjustment_col_forename_and_surname="full_name",
phonetic_forename_col_name="forename_dm",
phonetic_surname_col_name="surname_dm",
levenshtein_thresholds=[2],
jaro_winkler_thresholds=[],
jaccard_thresholds=[1],
)
Returns:
Name  Type  Description 

Comparison 
Comparison

A comparison that can be included in the Splink settings dictionary. 
splink.comparison_template_library.PostcodeComparisonBase
¶
Bases: Comparison
__init__(col_name, invalid_postcodes_as_null=False, set_to_lowercase=True, valid_postcode_regex='^[AZaz]{1,2}[09][AZaz09]? [09][AZaz]{2}$', term_frequency_adjustments_full=False, include_full_match_level=True, include_sector_match_level=True, include_district_match_level=True, include_area_match_level=True, lat_col=None, long_col=None, km_thresholds=[], m_probability_full_match=None, m_probability_sector_match=None, m_probability_district_match=None, m_probability_area_match=None, m_probability_or_probabilities_km_distance=None, m_probability_else=None)
¶
A wrapper to generate a comparison for a poscode column 'col_name' with preselected defaults.
The default arguments will give a comparison with levels:

Exact match on full postcode

Exact match on sector

Exact match on district

Exact match on area

All other comparisons
Parameters:
Name  Type  Description  Default 

col_name 
str

The name of the column to compare. 
required 
invalid_postcodes_as_null 
bool

If True, postcodes that do not adhere to valid_postcode_regex will be included in the null level. Defaults to False 
False

set_to_lowercase 
bool

If True, all postcodes are set to lowercase during the pairwise comparisons. Defaults to True 
True

valid_postcode_regex 
str

regular expression pattern that is used to validate postcodes. If invalid_postcodes_as_null is True, postcodes that do not adhere to valid_postcode_regex will be included in the null level. Defaults to "^[AZaz]{1,2}[09][AZaz09]? 09$" 
'^[AZaz]{1,2}[09][AZaz09]? [09][AZaz]{2}$'

term_frequency_adjustments_full 
bool

If True, apply term frequency adjustments to the full postcode exact match level. Defaults to False. 
False

include_full_match_level 
bool

If True, include an exact match on full postcode level. Defaults to True. 
True

include_sector_match_level 
bool

If True, include an exact match on sector level. Defaults to True. 
True

include_district_match_level 
bool

If True, include an exact match on district level. Defaults to True. 
True

include_area_match_level 
bool

If True, include an exact match on area level. Defaults to True. 
True

include_distance_in_km_level 
bool

If True, include a comparison of distance between postcodes as measured in kilometers. Defaults to False. 
required 
lat_col 
str

The name of a latitude column or the respective array or struct column column containing the information, plus an index. For example: lat, long_lat['lat'] or long_lat[0]. 
None

long_col 
str

The name of a longitudinal column or the respective array or struct column column containing the information, plus an index. For example: long, long_lat['long'] or long_lat[1]. 
None

km_thresholds 
int, float, list

The total distance in kilometers to evaluate the distance_in_km_level comparison against. 
[]

m_probability_full_match 
float

Starting m probability for full match level. Defaults to None. 
None

m_probability_sector_match 
float

Starting m probability for sector match level. Defaults to None. 
None

m_probability_district_match 
float

Starting m probability for district match level. Defaults to None. 
None

m_probability_area_match 
float

Starting m probability for area match level. Defaults to None. 
None

m_probability_or_probabilities_km_distance 
float

Starting m probability for 'distance in km' level. Defaults to None. 
None

m_probability_else 
float

Starting m probability for the 'everything else' level. Defaults to None. 
None

Examples:
Basic Postcode Comparison
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.postcode_comparison("postcode")
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.postcode_comparison("postcode",
invalid_postcodes_as_null=True,
include_distance_in_km_level=True,
lat_col="lat",
long_col="long",
km_thresholds=[10, 100]
)
Basic Postcode Comparison
import splink.spark.spark_comparison_template_library as ctl
ctl.postcode_comparison("postcode")
import splink.spark.spark_comparison_template_library as ctl
ctl.postcode_comparison("postcode",
invalid_postcodes_as_null=True,
include_distance_in_km_level=True,
lat_col="lat",
long_col="long",
km_thresholds=[10, 100]
)
Basic Postcode Comparison
import splink.athean.athena_comparison_template_library as ctl
ctl.postcode_comparison("postcode")
import splink.athena.athena_comparison_template_library as ctl
ctl.postcode_comparison("postcode",
invalid_postcodes_as_null=True,
include_distance_in_km_level=True,
lat_col="lat",
long_col="long",
km_thresholds=[10, 100]
)
Returns:
Name  Type  Description 

Comparison 
Comparison

A comparison that can be inclued in the Splink settings dictionary. 