Documentation for comparison_template_library
¶
The comparison_template_library
contains pre-made comparisons with pre-defined parameters available for use directly as described in this topic guide.
However, not every comparison is available for every Splink-compatible SQL backend. More detail on creating comparisons for specific data types is also included in the topic guide.
The pre-made Splink comparison templates available for each SQL dialect are as given in this table:
duckdb | spark | athena | sqlite | |
---|---|---|---|---|
date_comparison |
✓ | ✓ | ||
forename_surname_comparison |
✓ | ✓ | ||
name_comparison |
✓ | ✓ | ||
postcode_comparison |
✓ | ✓ | ✓ |
The detailed API for each of these are outlined below.
Library comparison APIs¶
splink.comparison_template_library.DateComparisonBase
¶
Bases: Comparison
__init__(col_name, cast_strings_to_date=False, date_format=None, invalid_dates_as_null=False, include_exact_match_level=True, term_frequency_adjustments=False, separate_1st_january=False, levenshtein_thresholds=[], damerau_levenshtein_thresholds=[1], datediff_thresholds=[1, 1, 10], datediff_metrics=['month', 'year', 'year'], m_probability_exact_match=None, m_probability_1st_january=None, m_probability_or_probabilities_lev=None, m_probability_or_probabilities_dl=None, m_probability_or_probabilities_datediff=None, m_probability_else=None)
¶
A wrapper to generate a comparison for a date column the data in
col_name
with preselected defaults.
The default arguments will give a comparison with comparison levels:
-
Exact match (1st of January only)
-
Exact match (all other dates)
-
Damerau-Levenshtein distance <= 1
-
Date difference <= 1 year
-
Date difference <= 10 years
-
Anything else
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
The name of the column to compare. |
required |
cast_strings_to_date |
bool
|
Set to True to enable date-casting when input dates are strings. Also adjust date_format if date-strings are not in (yyyy-mm-dd) format. Defaults to False. |
False
|
date_format |
str
|
Format of input dates if date-strings are given. Must be consistent across record pairs. If None (the default), downstream functions for each backend assign date_format to ISO 8601 format (yyyy-mm-dd). Set to "yyyy-MM-dd" for Spark and "%Y-%m-%d" for DuckDB when invalid_dates_as_null=True |
None
|
invalid_dates_as_null |
bool
|
assign any dates that do not adhere to date_format to the null level. Defaults to False. |
False
|
include_exact_match_level |
bool
|
If True, include an exact match level. Defaults to True. |
True
|
term_frequency_adjustments |
bool
|
If True, apply term frequency adjustments to the exact match level. Defaults to False. |
False
|
separate_1st_january |
bool
|
If True, include a separate exact match comparison level when date is 1st January. |
False
|
levenshtein_thresholds |
Union[int, list]
|
The thresholds to use for levenshtein similarity level(s). Defaults to [] |
[]
|
damerau_levenshtein_thresholds |
Union[int, list]
|
The thresholds to use for damerau-levenshtein similarity level(s). Defaults to [1] |
[1]
|
datediff_thresholds |
Union[int, list]
|
The thresholds to use for datediff similarity level(s). Defaults to [1, 1]. |
[1, 1, 10]
|
datediff_metrics |
Union[str, list]
|
The metrics to apply thresholds to for datediff similarity level(s). Defaults to ["month", "year"]. |
['month', 'year', 'year']
|
cast_strings_to_date |
bool
|
Set to True to enable date-casting when input dates are strings. Also adjust date_format if date-strings are not in (yyyy-mm-dd) format. Defaults to False. |
False
|
date_format |
str
|
Format of input dates if date-strings are given. Must be consistent across record pairs. If None (the default), downstream functions for each backend assign date_format to ISO 8601 format (yyyy-mm-dd). |
None
|
m_probability_exact_match |
float
|
If provided, overrides the default m probability for the exact match level. Defaults to None. |
None
|
m_probability_or_probabilities_lev |
Union[float, list]
|
If provided, overrides the default m probabilities for the levenshtein thresholds specified. Defaults to None. |
None
|
m_probability_or_probabilities_dl |
Union[float, list]
|
description. If provided, overrides the default m probabilities for the damerau-levenshtein thresholds specified. Defaults to None. |
None
|
m_probability_or_probabilities_datediff |
Union[float, list]
|
If provided, overrides the default m probabilities for the datediff thresholds specified. Defaults to None. |
None
|
m_probability_else |
float
|
If provided, overrides the default m probability for the 'anything else' level. Defaults to None. |
None
|
Examples:
Basic Date Comparison
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.date_comparison("date_of_birth")
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.date_comparison("date_of_birth",
damerau_levenshtein_thresholds=[],
levenshtein_thresholds=[2],
datediff_thresholds=[1, 1],
datediff_metrics=["month", "year"])
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.date_comparison("date_of_birth",
cast_strings_to_date=True,
date_format='%d/%m/%Y',
invalid_dates_as_null=True)
Basic Date Comparison
import splink.spark.spark_comparison_template_library as ctl
ctl.date_comparison("date_of_birth")
import splink.spark.spark_comparison_template_library as ctl
ctl.date_comparison("date_of_birth",
damerau_levenshtein_thresholds=[],
levenshtein_thresholds=[2],
datediff_thresholds=[1, 1],
datediff_metrics=["month", "year"])
import splink.spark.spark_comparison_template_library as ctl
ctl.date_comparison("date_of_birth",
cast_strings_to_date=True,
date_format='dd/mm/yyyy',
invalid_dates_as_null=True)
Returns:
Name | Type | Description |
---|---|---|
Comparison |
Comparison
|
A comparison that can be inclued in the Splink settings dictionary. |
splink.comparison_template_library.NameComparisonBase
¶
Bases: Comparison
__init__(col_name, regex_extract=None, set_to_lowercase=False, include_exact_match_level=True, phonetic_col_name=None, term_frequency_adjustments=False, levenshtein_thresholds=[], damerau_levenshtein_thresholds=[1], jaro_thresholds=[], jaro_winkler_thresholds=[0.9, 0.8], jaccard_thresholds=[], m_probability_exact_match_name=None, m_probability_exact_match_phonetic_name=None, m_probability_or_probabilities_lev=None, m_probability_or_probabilities_dl=None, m_probability_or_probabilities_jar=None, m_probability_or_probabilities_jw=None, m_probability_or_probabilities_jac=None, m_probability_else=None)
¶
A wrapper to generate a comparison for a name column the data in
col_name
with preselected defaults.
The default arguments will give a comparison with comparison levels:
-
Exact match
-
Damerau-Levenshtein Distance <= 1
-
Jaro Winkler similarity >= 0.9
-
Jaro Winkler similarity >= 0.8
-
Anything else
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
The name of the column to compare. |
required |
regex_extract |
str
|
Regular expression pattern to evaluate a match on. |
None
|
set_to_lowercase |
bool
|
If True, all names are set to lowercase during the pairwise comparisons. Defaults to False |
False
|
include_exact_match_level |
bool
|
If True, include an exact match level for col_name. Defaults to True. |
True
|
phonetic_col_name |
str
|
The name of the column with phonetic reduction (such as dmetaphone) of col_name. Including parameter will create an exact match level for phonetic_col_name. The phonetic column must be present in the dataset to use this parameter. Defaults to None |
None
|
term_frequency_adjustments |
bool
|
If True, apply term frequency adjustments to the exact match level for "col_name". Defaults to False. |
False
|
term_frequency_adjustments_phonetic_name |
bool
|
If True, apply term frequency adjustments to the exact match level for "phonetic_col_name". Defaults to False. |
required |
levenshtein_thresholds |
Union[int, list]
|
The thresholds to use for levenshtein similarity level(s). Defaults to [] |
[]
|
damerau_levenshtein_thresholds |
Union[int, list]
|
The thresholds to use for damerau-levenshtein similarity level(s). Defaults to [1] |
[1]
|
jaro_thresholds |
Union[int, list]
|
The thresholds to use for jaro similarity level(s). Defaults to [] |
[]
|
jaro_winkler_thresholds |
Union[int, list]
|
The thresholds to use for jaro_winkler similarity level(s). Defaults to [0.9, 0.8] |
[0.9, 0.8]
|
jaccard_thresholds |
Union[int, list]
|
The thresholds to use for jaccard similarity level(s). Defaults to [] |
[]
|
m_probability_exact_match_name |
_type_
|
Starting m probability for exact match level. Defaults to None. |
None
|
m_probability_exact_match_phonetic_name |
_type_
|
Starting m probability for exact match level for phonetic_col_name. Defaults to None. |
None
|
m_probability_or_probabilities_lev |
Union[float, list]
|
description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. |
None
|
m_probability_or_probabilities_dl |
Union[float, list]
|
description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. |
None
|
m_probability_or_probabilities_datediff |
Union[float, list]
|
description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. |
required |
m_probability_or_probabilities_jar |
Union[float, list]
|
Starting m probabilities for the jaro thresholds specified. Defaults to None. |
None
|
m_probability_or_probabilities_jw |
Union[float, list]
|
Starting m probabilities for the jaro winkler thresholds specified. Defaults to None. |
None
|
m_probability_or_probabilities_jac |
Union[float, list]
|
Starting m probabilities for the jaccard thresholds specified. Defaults to None. |
None
|
m_probability_else |
_type_
|
Starting m probability for the 'everything else' level. Defaults to None. |
None
|
Examples:
Basic Name Comparison
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.name_comparison("name")
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.name_comparison("name",
phonetic_col_name = "name_dm",
term_frequency_adjustments = True,
levenshtein_thresholds=[2],
damerau_levenshtein_thresholds=[],
jaro_winkler_thresholds=[],
jaccard_thresholds=[1]
)
Basic Name Comparison
import splink.spark.spark_comparison_template_library as ctl
ctl.name_comparison("name")
import splink.spark.spark_comparison_template_library as ctl
ctl.name_comparison("name",
phonetic_col_name = "name_dm",
term_frequency_adjustments = True,
levenshtein_thresholds=[2],
damerau_levenshtein_thresholds=[],
jaro_winkler_thresholds=[],
jaccard_thresholds=[1]
)
Returns:
Name | Type | Description |
---|---|---|
Comparison |
Comparison
|
A comparison that can be included in the Splink settings dictionary. |
splink.comparison_template_library.ForenameSurnameComparisonBase
¶
Bases: Comparison
__init__(forename_col_name, surname_col_name, set_to_lowercase=False, include_exact_match_level=True, include_columns_reversed=True, term_frequency_adjustments=False, tf_adjustment_col_forename_and_surname=None, phonetic_forename_col_name=None, phonetic_surname_col_name=None, levenshtein_thresholds=[], damerau_levenshtein_thresholds=[], jaro_winkler_thresholds=[0.88], jaro_thresholds=[], jaccard_thresholds=[], m_probability_exact_match_forename_surname=None, m_probability_exact_match_phonetic_forename_surname=None, m_probability_columns_reversed_forename_surname=None, m_probability_exact_match_surname=None, m_probability_exact_match_forename=None, m_probability_exact_match_phonetic_surname=None, m_probability_exact_match_phonetic_forename=None, m_probability_or_probabilities_surname_lev=None, m_probability_or_probabilities_surname_dl=None, m_probability_or_probabilities_surname_jw=None, m_probability_or_probabilities_surname_jac=None, m_probability_or_probabilities_forename_lev=None, m_probability_or_probabilities_forename_dl=None, m_probability_or_probabilities_forename_jw=None, m_probability_or_probabilities_forename_jac=None, m_probability_else=None)
¶
A wrapper to generate a comparison for a name column the data in
col_name
with preselected defaults.
The default arguments will give a comparison with comparison levels:
-
Exact match forename and surname
-
Macth of forename and surname reversed
-
Exact match surname
-
Exact match forename
-
Fuzzy match surname jaro-winkler >= 0.88
-
Fuzzy match forename jaro-winkler>= 0.88
-
Anything else
Parameters:
Name | Type | Description | Default |
---|---|---|---|
forename_col_name |
str
|
The name of the forename column to compare |
required |
surname_col_name |
str
|
The name of the surname column to compare |
required |
set_to_lowercase |
bool
|
If True, all names are set to lowercase during the pairwise comparisons. Defaults to False |
False
|
include_exact_match_level |
bool
|
If True, include an exact match level for col_name. Defaults to True. |
True
|
include_columns_reversed |
bool
|
If True, include a comparison level for forename and surname being swapped. Defaults to True |
True
|
term_frequency_adjustments |
bool
|
If True, apply term frequency adjustments to the exact match level for forename_col_name and surname_col_name. Applies term frequency adjustments to full name exact match level and columns reversed exact match level if tf_adjustment_col_forename_and_surname is provided. Applies term frequency adjustments to phonetic_forename_col_name and phonetic_surname_col_name exact match levels, if they are provided. Defaults to False. |
False
|
tf_adjustment_col_forename_and_surname |
str
|
The name of a combined forename surname column. This column is used to provide term frequency adjustments for forename surname exact match and columns reversed levels. Defaults to None |
None
|
set_to_lowercase |
bool
|
If True, all postcodes are set to lowercase during the pairwise comparisons. Defaults to True |
False
|
phonetic_forename_col_name |
str
|
The name of the column with phonetic reduction (such as dmetaphone) of forename_col_name. Including parameter will create an exact match level for phonetic_forename_col_name. The phonetic column must be present in the dataset to use this parameter. Defaults to None |
None
|
phonetic_surname_col_name |
str
|
The name of the column with phonetic reduction (such as dmetaphone) of surname_col_name. Including this parameter will create an exact match level for phonetic_surname_col_name. The phonetic column must be present in the dataset to use this parameter. Defaults to None |
None
|
levenshtein_thresholds |
Union[int, list]
|
The thresholds to use for levenshtein similarity level(s) for surname_col_name and forename_col_name. Defaults to [] |
[]
|
damerau_levenshtein_thresholds |
Union[int, list]
|
The thresholds to use for damerau-levenshtein similarity level(s). Defaults to [] |
[]
|
jaro_winkler_thresholds |
Union[int, list]
|
The thresholds to use for jaro_winkler similarity level(s) for surname_col_name and forename_col_name. Defaults to [0.88] |
[0.88]
|
jaro_thresholds |
Union[int, list]
|
The thresholds to use for jaro similarity level(s) for surname_col_name and forename_col_name. Defaults to [] |
[]
|
jaccard_thresholds |
Union[int, list]
|
The thresholds to use for jaccard similarity level(s) for surname_col_name and forename_col_name. Defaults to [] |
[]
|
m_probability_exact_match_forename_surname |
_type_
|
If provided, overrides the default m probability for the exact match level for forename and surname. Defaults to None. |
None
|
m_probability_exact_match_phonetic_forename_surname |
_type_
|
If provided, overrides the default m probability for the phonetic match level for forename and surname. Defaults to None. |
None
|
m_probability_columns_reversed_forename_surname |
_type_
|
If provided, overrides the default m probability for the columns reversed level for forename and surname. Defaults to None. |
None
|
m_probability_columns_reversed_forename_surname |
_type_
|
If provided, overrides the default m probability for the columns reversed level for forename and surname. Defaults to None. |
None
|
m_probability_exact_match_surname |
_type_
|
If provided, overrides the default m probability for the surname exact match level for forename and surname. Defaults to None. |
None
|
m_probability_exact_match_forename |
_type_
|
If provided, overrides the default m probability for the forename exact match level for forename and forename. Defaults to None. |
None
|
m_probability_phonetic_match_surname |
_type_
|
If provided, overrides the default m probability for the surname phonetic match level for forename and surname. Defaults to None. |
required |
m_probability_phonetic_match_forename |
_type_
|
If provided, overrides the default m probability for the forename phonetic match level for forename and forename. Defaults to None. |
required |
m_probability_or_probabilities_surname_lev |
Union[float, list]
|
description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. |
None
|
m_probability_or_probabilities_surname_dl |
Union[float, list]
|
description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. |
None
|
m_probability_or_probabilities_surname_jw |
Union[float, list]
|
description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. |
None
|
m_probability_or_probabilities_surname_jac |
Union[float, list]
|
description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. |
None
|
m_probability_or_probabilities_forename_lev |
Union[float, list]
|
description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. |
None
|
m_probability_or_probabilities_forename_dl |
Union[float, list]
|
description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. |
None
|
m_probability_or_probabilities_forename_jw |
Union[float, list]
|
description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. |
None
|
m_probability_or_probabilities_forename_jac |
Union[float, list]
|
description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. |
None
|
m_probability_else |
_type_
|
If provided, overrides the default m probability for the 'anything else' level. Defaults to None. |
None
|
Examples:
Basic Forename Surname Comparison
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.forename_surname_comparison("first_name", "surname)
Bespoke Forename Surname Comparison
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.forename_surname_comparison(
"forename",
"surname",
term_frequency_adjustments=True,
tf_adjustment_col_forename_and_surname="full_name",
phonetic_forename_col_name="forename_dm",
phonetic_surname_col_name="surname_dm",
levenshtein_thresholds=[2],
jaro_winkler_thresholds=[],
jaccard_thresholds=[1],
)
Basic Forename Surname Comparison
import splink.spark.spark_comparison_template_library as ctl
ctl.forename_surname_comparison("first_name", "surname)
Bespoke Forename Surname Comparison
import splink.spark.spark_comparison_template_library as ctl
ctl.forename_surname_comparison(
"forename",
"surname",
term_frequency_adjustments=True,
tf_adjustment_col_forename_and_surname="full_name",
phonetic_forename_col_name="forename_dm",
phonetic_surname_col_name="surname_dm",
levenshtein_thresholds=[2],
jaro_winkler_thresholds=[],
jaccard_thresholds=[1],
)
Returns:
Name | Type | Description |
---|---|---|
Comparison |
Comparison
|
A comparison that can be included in the Splink settings dictionary. |
splink.comparison_template_library.PostcodeComparisonBase
¶
Bases: Comparison
__init__(col_name, invalid_postcodes_as_null=False, set_to_lowercase=True, valid_postcode_regex='^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$', term_frequency_adjustments_full=False, include_full_match_level=True, include_sector_match_level=True, include_district_match_level=True, include_area_match_level=True, lat_col=None, long_col=None, km_thresholds=[], m_probability_full_match=None, m_probability_sector_match=None, m_probability_district_match=None, m_probability_area_match=None, m_probability_or_probabilities_km_distance=None, m_probability_else=None)
¶
A wrapper to generate a comparison for a poscode column 'col_name' with preselected defaults.
The default arguments will give a comparison with levels:
-
Exact match on full postcode
-
Exact match on sector
-
Exact match on district
-
Exact match on area
-
All other comparisons
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
The name of the column to compare. |
required |
invalid_postcodes_as_null |
bool
|
If True, postcodes that do not adhere to valid_postcode_regex will be included in the null level. Defaults to False |
False
|
set_to_lowercase |
bool
|
If True, all postcodes are set to lowercase during the pairwise comparisons. Defaults to True |
True
|
valid_postcode_regex |
str
|
regular expression pattern that is used to validate postcodes. If invalid_postcodes_as_null is True, postcodes that do not adhere to valid_postcode_regex will be included in the null level. Defaults to "^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? 0-9$" |
'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$'
|
term_frequency_adjustments_full |
bool
|
If True, apply term frequency adjustments to the full postcode exact match level. Defaults to False. |
False
|
include_full_match_level |
bool
|
If True, include an exact match on full postcode level. Defaults to True. |
True
|
include_sector_match_level |
bool
|
If True, include an exact match on sector level. Defaults to True. |
True
|
include_district_match_level |
bool
|
If True, include an exact match on district level. Defaults to True. |
True
|
include_area_match_level |
bool
|
If True, include an exact match on area level. Defaults to True. |
True
|
include_distance_in_km_level |
bool
|
If True, include a comparison of distance between postcodes as measured in kilometers. Defaults to False. |
required |
lat_col |
str
|
The name of a latitude column or the respective array or struct column column containing the information, plus an index. For example: lat, long_lat['lat'] or long_lat[0]. |
None
|
long_col |
str
|
The name of a longitudinal column or the respective array or struct column column containing the information, plus an index. For example: long, long_lat['long'] or long_lat[1]. |
None
|
km_thresholds |
int, float, list
|
The total distance in kilometers to evaluate the distance_in_km_level comparison against. |
[]
|
m_probability_full_match |
float
|
Starting m probability for full match level. Defaults to None. |
None
|
m_probability_sector_match |
float
|
Starting m probability for sector match level. Defaults to None. |
None
|
m_probability_district_match |
float
|
Starting m probability for district match level. Defaults to None. |
None
|
m_probability_area_match |
float
|
Starting m probability for area match level. Defaults to None. |
None
|
m_probability_or_probabilities_km_distance |
float
|
Starting m probability for 'distance in km' level. Defaults to None. |
None
|
m_probability_else |
float
|
Starting m probability for the 'everything else' level. Defaults to None. |
None
|
Examples:
Basic Postcode Comparison
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.postcode_comparison("postcode")
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.postcode_comparison("postcode",
invalid_postcodes_as_null=True,
include_distance_in_km_level=True,
lat_col="lat",
long_col="long",
km_thresholds=[10, 100]
)
Basic Postcode Comparison
import splink.spark.spark_comparison_template_library as ctl
ctl.postcode_comparison("postcode")
import splink.spark.spark_comparison_template_library as ctl
ctl.postcode_comparison("postcode",
invalid_postcodes_as_null=True,
include_distance_in_km_level=True,
lat_col="lat",
long_col="long",
km_thresholds=[10, 100]
)
Basic Postcode Comparison
import splink.athean.athena_comparison_template_library as ctl
ctl.postcode_comparison("postcode")
import splink.athena.athena_comparison_template_library as ctl
ctl.postcode_comparison("postcode",
invalid_postcodes_as_null=True,
include_distance_in_km_level=True,
lat_col="lat",
long_col="long",
km_thresholds=[10, 100]
)
Returns:
Name | Type | Description |
---|---|---|
Comparison |
Comparison
|
A comparison that can be inclued in the Splink settings dictionary. |