Documentation for comparison_level_library
¶
The comparison_level_library
contains pre-made comparison levels available for use to
construct custom comparisons as described in this topic guide.
However, not every comparison level is available for every Splink-compatible SQL backend.
The pre-made Splink comparison levels available for each SQL dialect are as given in this table:
duckdb | spark | athena | sqlite | |
---|---|---|---|---|
array_intersect_level |
✓ | ✓ | ✓ | |
columns_reversed_level |
✓ | ✓ | ✓ | ✓ |
damerau_levenshtein_level |
✓ | ✓ | ||
datediff_level |
✓ | ✓ | ||
distance_function_level |
✓ | ✓ | ✓ | ✓ |
distance_in_km_level |
✓ | ✓ | ✓ | |
else_level |
✓ | ✓ | ✓ | ✓ |
exact_match_level |
✓ | ✓ | ✓ | ✓ |
jaccard_level |
✓ | ✓ | ||
jaro_level |
✓ | ✓ | ||
jaro_winkler_level |
✓ | ✓ | ||
levenshtein_level |
✓ | ✓ | ✓ | ✓ |
null_level |
✓ | ✓ | ✓ | ✓ |
percentage_difference_level |
✓ | ✓ | ✓ | ✓ |
The detailed API for each of these are outlined below.
Library comparison level APIs¶
splink.comparison_level_library.NullLevelBase
¶
Bases: ComparisonLevel
__init__(col_name, valid_string_regex=None)
¶
Represents comparisons level where one or both sides of the comparison contains null values so the similarity cannot be evaluated. Assumed to have a partial match weight of zero (null effect on overall match weight)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
valid_string_regex |
str
|
regular expression pattern that if not matched will result in column being treated as a null. |
None
|
Examples:
Simple null comparison level
import splink.duckdb.duckdb_comparison_level_library as cll
cll.null_level("name")
import splink.duckdb.duckdb_comparison_level_library as cll
cll.null_level("name", valid_string_regex="^[A-Z]{1,7}$")
Simple null level
import splink.spark.spark_comparison_level_library as cll
cll.null_level("name")
import splink.spark.spark_comparison_level_library as cll
cll.null_level("name", valid_string_regex="^[A-Z]{1,7}$")
Simple null level
import splink.athena.athena_comparison_level_library as cll
cll.null_level("name")
import splink.athena.athena_comparison_level_library as cll
cll.null_level("name", valid_string_regex="^[A-Z]{1,7}$")
Simple null level
import splink.sqlite.sqlite_comparison_level_library as cll
cll.null_level("name")
Returns:
Name | Type | Description |
---|---|---|
ComparisonLevel |
ComparisonLevel
|
Comparison level for null entries |
splink.comparison_level_library.ExactMatchLevelBase
¶
Bases: ComparisonLevel
__init__(col_name, regex_extract=None, set_to_lowercase=False, m_probability=None, term_frequency_adjustments=False, include_colname_in_charts_label=False, manual_chart_label=None)
¶
Represents a comparison level where there is an exact match,
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
regex_extract |
str
|
Regular expression pattern to evaluate a match on. |
None
|
set_to_lowercase |
bool
|
If True, sets all entries to lowercase. |
False
|
m_probability |
float
|
Starting value for m probability Defaults to None. |
None
|
term_frequency_adjustments |
bool
|
If True, apply term frequency adjustments to the exact match level. Defaults to False. |
False
|
include_colname_in_charts_label |
bool
|
If True, include col_name in chart labels (e.g. linker.match_weights_chart()) |
False
|
chart_label |
str
|
string to include in chart label. Setting to col_name would recreate behaviour of include_colname_in_charts_label=True |
required |
Examples:
Simple Exact match level
import splink.duckdb.duckdb_comparison_level_library as cll
cll.exact_match_level("name")
import splink.duckdb.duckdb_comparison_level_library as cll
cll.exact_match_level("name", term_frequency_adjustments=True)
import splink.duckdb.duckdb_comparison_level_library as cll
cll.exact_match_level("name", regex_extract="^[A-Z]{1,4}")
Simple Exact match level
import splink.spark.spark_comparison_level_library as cll
cll.exact_match_level("name")
import splink.spark.spark_comparison_level_library as cll
cll.exact_match_level("name", term_frequency_adjustments=True)
import splink.spark.spark_comparison_level_library as cll
cll.exact_match_level("name", regex_extract="^[A-Z]{1,4}")
Simple Exact match level
import splink.athena.athena_comparison_level_library as cll
cll.exact_match_level("name")
import splink.athena.athena_comparison_level_library as cll
cll.exact_match_level("name", term_frequency_adjustments=True)
import splink.athena.athena_comparison_level_library as cll
cll.exact_match_level("name", regex_extract="^[A-Z]{1,4}")
Simple Exact match level
import splink.sqlite.sqlite_comparison_level_library as cll
cll.exact_match_level("name")
import splink.sqlite.sqlite_comparison_level_library as cll
cll.exact_match_level("name", term_frequency_adjustments=True)
splink.comparison_level_library.ElseLevelBase
¶
Bases: ComparisonLevel
__init__(m_probability=None)
¶
Represents a comparison level for all cases which have not been considered by preceding comparison levels,
Examples:
import splink.duckdb.duckdb_comparison_level_library as cll
cll.else_level("name")
import splink.spark.spark_comparison_level_library as cll
cll.else_level("name")
import splink.athena.athena_comparison_level_library as cll
cll.else_level("name")
import splink.sqlite.sqlite_comparison_level_library as cll
cll.else_level("name")
splink.comparison_level_library.DistanceFunctionLevelBase
¶
Bases: ComparisonLevel
__init__(col_name, distance_function_name, distance_threshold, regex_extract=None, set_to_lowercase=False, higher_is_more_similar=True, include_colname_in_charts_label=False, m_probability=None)
¶
Represents a comparison level using a user-provided distance function, where the similarity
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
distance_function_name |
str
|
The name of the distance function |
required |
distance_threshold |
Union[int, float]
|
The threshold to use to assess similarity |
required |
regex_extract |
str
|
Regular expression pattern to evaluate a match on. |
None
|
set_to_lowercase |
bool
|
If True, sets all entries to lowercase. |
False
|
higher_is_more_similar |
bool
|
If True, a higher value of the distance function indicates a higher similarity (e.g. jaro_winkler). If false, a higher value indicates a lower similarity (e.g. levenshtein). |
True
|
include_colname_in_charts_label |
bool
|
If True, includes col_name in charts label |
False
|
m_probability |
float
|
Starting value for m probability Defaults to None. |
None
|
Examples:
Apply the levenshtein
function to a comparison level
import splink.duckdb.duckdb_comparison_level_library as cll
cll.distance_function_level("name",
"levenshtein",
2,
False)
Apply the levenshtein
function to a comparison level
import splink.spark.spark_comparison_level_library as cll
cll.distance_function_level("name",
"levenshtein",
2,
False)
Apply the levenshtein_distance
function to a comparison level
import splink.athena.athena_comparison_level_library as cll
cll.distance_function_level("name",
"levenshtein_distance",
2,
False)
Returns:
Name | Type | Description |
---|---|---|
ComparisonLevel |
ComparisonLevel
|
A comparison level for a given distance function |
splink.comparison_level_library.LevenshteinLevelBase
¶
Bases: DistanceFunctionLevelBase
__init__(col_name, distance_threshold, regex_extract=None, set_to_lowercase=False, include_colname_in_charts_label=False, m_probability=None)
¶
Represents a comparison level using a levenshtein distance function,
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
distance_threshold |
Union[int, float]
|
The threshold to use to assess similarity |
required |
regex_extract |
str
|
Regular expression pattern to evaluate a match on. |
None
|
set_to_lowercase |
bool
|
If True, sets all entries to lowercase. |
False
|
include_colname_in_charts_label |
bool
|
If True, includes col_name in charts label |
False
|
m_probability |
float
|
Starting value for m probability. Defaults to None. |
None
|
Examples:
Comparison level with levenshtein distance score less than (or equal to) 1
import splink.duckdb.duckdb_comparison_level_library as cll
cll.levenshtein_level("name", 1)
Comparison level with levenshtein distance score less than (or equal to) 1 on a subtring of name column as determined by a regular expression.
import splink.duckdb.duckdb_comparison_level_library as cll
cll.levenshtein_level("name", 1, regex_extract="^[A-Z]{1,4}")
Comparison level with levenshtein distance score less than (or equal to) 1
import splink.spark.spark_comparison_level_library as cll
cll.levenshtein_level("name", 1)
Comparison level with levenshtein distance score less than (or equal to) 1 on a subtring of name column as determined by a regular expression.
import splink.spark.spark_comparison_level_library as cll
cll.levenshtein_level("name", 1, regex_extract="^[A-Z]{1,4}")
Comparison level with levenshtein distance score less than (or equal to) 1
import splink.athena.athena_comparison_level_library as cll
cll.levenshtein_level("name", 1)
Comparison level with levenshtein distance score less than (or equal to) 1 on a subtring of name column as determined by a regular expression.
import splink.athena.athena_comparison_level_library as cll
cll.levenshtein_level("name", 1, regex_extract="^[A-Z]{1,4}")
Returns:
Name | Type | Description |
---|---|---|
ComparisonLevel |
ComparisonLevel
|
A comparison level that evaluates the levenshtein similarity |
splink.comparison_level_library.DamerauLevenshteinLevelBase
¶
Bases: DistanceFunctionLevelBase
__init__(col_name, distance_threshold, regex_extract=None, set_to_lowercase=False, include_colname_in_charts_label=False, m_probability=None)
¶
Represents a comparison level using a damerau-levenshtein distance function,
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
distance_threshold |
Union[int, float]
|
The threshold to use to assess similarity |
required |
regex_extract |
str
|
Regular expression pattern to evaluate a match on. |
None
|
set_to_lowercase |
bool
|
If True, sets all entries to lowercase. |
False
|
include_colname_in_charts_label |
bool
|
If True, includes col_name in charts label |
False
|
m_probability |
float
|
Starting value for m probability. Defaults to None. |
None
|
Examples:
Comparison level with damerau-levenshtein distance score less than (or equal to) 1
import splink.duckdb.duckdb_comparison_level_library as cll
cll.damerau_levenshtein_level("name", 1)
Comparison level with damerau-levenshtein distance score less than (or equal to) 1 on a subtring of name column as determined by a regular expression.
import splink.duckdb.duckdb_comparison_level_library as cll
cll.damerau_levenshtein_level("name", 1, regex_extract="^[A-Z]{1,4}")
Comparison level with damerau-levenshtein distance score less than (or equal to) 1
import splink.spark.spark_comparison_level_library as cll
cll.damerau_levenshtein_level("name", 1)
Comparison level with damerau-levenshtein distance score less than (or equal to) 1 on a subtring of name column as determined by a regular expression.
import splink.spark.spark_comparison_level_library as cll
cll.damerau_levenshtein_level("name", 1, regex_extract="^[A-Z]{1,4}")
Returns:
Name | Type | Description |
---|---|---|
ComparisonLevel |
ComparisonLevel
|
A comparison level that evaluates the Damerau-Levenshtein similarity |
splink.comparison_level_library.JaroLevelBase
¶
Bases: DistanceFunctionLevelBase
__init__(col_name, distance_threshold, regex_extract=None, set_to_lowercase=False, include_colname_in_charts_label=False, m_probability=None)
¶
Represents a comparison using the jaro distance function
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
distance_threshold |
Union[int, float]
|
The threshold to use to assess similarity |
required |
regex_extract |
str
|
Regular expression pattern to evaluate a match on. |
None
|
set_to_lowercase |
bool
|
If True, sets all entries to lowercase. |
False
|
include_colname_in_charts_label |
bool
|
If True, includes col_name in charts label |
False
|
m_probability |
float
|
Starting value for m probability. Defaults to None. |
None
|
Examples:
Comparison level with jaro score greater than 0.9
import splink.duckdb.duckdb_comparison_level_library as cll
cll.jaro_level("name", 0.9)
import splink.duckdb.duckdb_comparison_level_library as cll
cll.jaro_level("name", 0.9, regex_extract="^[A-Z]{1,4}")
Comparison level with jaro score greater than 0.9
import splink.spark.spark_comparison_level_library as cll
cll.jaro_level("name", 0.9)
import splink.spark.spark_comparison_level_library as cll
cll.jaro_level("name", 0.9, regex_extract="^[A-Z]{1,4}")
Returns:
Name | Type | Description |
---|---|---|
ComparisonLevel |
A comparison level that evaluates the jaro similarity |
splink.comparison_level_library.JaroWinklerLevelBase
¶
Bases: DistanceFunctionLevelBase
__init__(col_name, distance_threshold, regex_extract=None, set_to_lowercase=False, include_colname_in_charts_label=False, m_probability=None)
¶
Represents a comparison level using the jaro winkler distance function
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
distance_threshold |
Union[int, float]
|
The threshold to use to assess similarity |
required |
regex_extract |
str
|
Regular expression pattern to evaluate a match on. |
None
|
set_to_lowercase |
bool
|
If True, sets all entries to lowercase. |
False
|
include_colname_in_charts_label |
bool
|
If True, includes col_name in charts label |
False
|
m_probability |
float
|
Starting value for m probability. Defaults to None. |
None
|
Examples:
Comparison level with jaro-winkler score greater than 0.9
import splink.duckdb.duckdb_comparison_level_library as cll
cll.jaro_winkler_level("name", 0.9)
import splink.duckdb.duckdb_comparison_level_library as cll
cll.jaro_winkler_level("name", 0.9, regex_extract="^[A-Z]{1,4}")
Comparison level with jaro score greater than 0.9
import splink.spark.spark_comparison_level_library as cll
cll.jaro_winkler_level("name", 0.9)
import splink.spark.spark_comparison_level_library as cll
cll.jaro_winkler_level("name", 0.9, regex_extract="^[A-Z]{1,4}")
Returns:
Name | Type | Description |
---|---|---|
ComparisonLevel |
ComparisonLevel
|
A comparison level that evaluates the jaro winkler similarity |
splink.comparison_level_library.JaccardLevelBase
¶
Bases: DistanceFunctionLevelBase
__init__(col_name, distance_threshold, regex_extract=None, set_to_lowercase=False, include_colname_in_charts_label=False, m_probability=None)
¶
Represents a comparison level using a jaccard distance function
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
distance_threshold |
Union[int, float]
|
The threshold to use to assess similarity |
required |
regex_extract |
str
|
Regular expression pattern to evaluate a match on. |
None
|
set_to_lowercase |
bool
|
If True, sets all entries to lowercase. |
False
|
include_colname_in_charts_label |
bool
|
If True, includes col_name in charts label |
False
|
m_probability |
float
|
Starting value for m probability. Defaults to None. |
None
|
Examples:
Comparison level with jaccard score greater than 0.9
import splink.duckdb.duckdb_comparison_level_library as cll
cll.jaccard_level("name", 0.9)
import splink.duckdb.duckdb_comparison_level_library as cll
cll.jaccard_level("name", 0.9, regex_extract="^[A-Z]{1,4}")
Comparison level with jaccard score greater than 0.9
import splink.spark.spark_comparison_level_library as cll
cll.jaccard_level("name", 0.9)
import splink.spark.spark_comparison_level_library as cll
cll.jaccard_level("name", 0.9, regex_extract="^[A-Z]{1,4}")
Returns:
Name | Type | Description |
---|---|---|
ComparisonLevel |
ComparisonLevel
|
A comparison level that evaluates the jaccard similarity |
splink.comparison_level_library.ColumnsReversedLevelBase
¶
Bases: ComparisonLevel
__init__(col_name_1, col_name_2, regex_extract=None, set_to_lowercase=False, m_probability=None, tf_adjustment_column=None)
¶
Represents a comparison level where the columns are reversed. For example, if surname is in the forename field and vice versa
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name_1 |
str
|
First column, e.g. forename |
required |
col_name_2 |
str
|
Second column, e.g. surname |
required |
regex_extract |
str
|
Regular expression pattern to evaluate a match on. |
None
|
set_to_lowercase |
bool
|
If True, sets all entries to lowercase. |
False
|
m_probability |
float
|
Starting value for m probability. Defaults to None. |
None
|
tf_adjustment_column |
str
|
Column to use for term frequency adjustments if an exact match is observed. Defaults to None. |
None
|
Examples:
Comparison level on first_name and surname columns reversed
import splink.duckdb.duckdb_comparison_level_library as cll
cll.columns_reversed_level("first_name", "surname")
import splink.duckdb.duckdb_comparison_level_library as cll
cll.columns_reversed_level("first_name",
"surname",
regex_extract="^[A-Z]{1,4}")
import splink.spark.spark_comparison_level_library as cll
cll.columns_reversed_level("first_name", "surname")
import splink.spark.spark_comparison_level_library as cll
cll.columns_reversed_level("first_name",
"surname",
regex_extract="^[A-Z]{1,4}")
import splink.athena.athena_comparison_level_library as cll
cll.columns_reversed_level("first_name", "surname")
import splink.athena.athena_comparison_level_library as cll
cll.columns_reversed_level("first_name",
"surname",
regex_extract="^[A-Z]{1,4}")
import splink.sqlite.sqlite_comparison_level_library as cll
cll.columns_reversed_level("first_name", "surname")
Returns:
Name | Type | Description |
---|---|---|
ComparisonLevel |
ComparisonLevel
|
A comparison level that evaluates the exact match of two columns. |
splink.comparison_level_library.DistanceInKMLevelBase
¶
Bases: ComparisonLevel
__init__(lat_col, long_col, km_threshold, not_null=False, m_probability=None)
¶
Use the haversine formula to transform comparisons of lat,lngs into distances measured in kilometers
Parameters:
Name | Type | Description | Default |
---|---|---|---|
lat_col |
str
|
The name of a latitude column or the respective array or struct column column containing the information For example: long_lat['lat'] or long_lat[0] |
required |
long_col |
str
|
The name of a longitudinal column or the respective array or struct column column containing the information, plus an index. For example: long_lat['long'] or long_lat[1] |
required |
km_threshold |
int
|
The total distance in kilometers to evaluate your comparisons against |
required |
not_null |
bool
|
If true, remove any . This is only necessary if you are not capturing nulls elsewhere in your comparison level. |
False
|
m_probability |
float
|
Starting value for m probability. Defaults to None. |
None
|
Examples:
import splink.duckdb.duckdb_comparison_level_library as cll
cll.distance_in_km_level("lat_col",
"long_col",
km_threshold=5)
import splink.spark.spark_comparison_level_library as cll
cll.distance_in_km_level("lat_col",
"long_col",
km_threshold=5)
import splink.athena.athena_comparison_level_library as cll
cll.distance_in_km_level("lat_col",
"long_col",
km_threshold=5)
Returns:
Name | Type | Description |
---|---|---|
ComparisonLevel |
ComparisonLevel
|
A comparison level that evaluates the distance between two coordinates |
splink.comparison_level_library.PercentageDifferenceLevelBase
¶
Bases: ComparisonLevel
__init__(col_name, percentage_distance_threshold, m_probability=None)
¶
Represents a comparison level based around the percentage difference between two numbers.
Note: the percentage is calculated by dividing the absolute difference between the values by the largest value
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
percentage_distance_threshold |
float
|
Percentage difference threshold for the comparison level |
required |
m_probability |
float
|
Starting value for m probability. Defaults to None. |
None
|
Examples:
import splink.duckdb.duckdb_comparison_level_library as cll
cll.percentage_difference_level("value", 0.5)
import splink.spark.spark_comparison_level_library as cll
cll.percentage_difference_level("value", 0.5)
import splink.athena.athena_comparison_level_library as cll
cll.percentage_difference_level("value", 0.5)
import splink.sqlite.sqlite_comparison_level_library as cll
cll.percentage_difference_level("value", 0.5)
Returns:
Name | Type | Description |
---|---|---|
ComparisonLevel |
ComparisonLevel
|
A comparison level that evaluates the percentage difference between two values |
splink.comparison_level_library.ArrayIntersectLevelBase
¶
Bases: ComparisonLevel
__init__(col_name, m_probability=None, term_frequency_adjustments=False, min_intersection=1, include_colname_in_charts_label=False)
¶
Represents a comparison level based around the size of an intersection of arrays
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
m_probability |
float
|
Starting value for m probability. Defaults to None. |
None
|
term_frequency_adjustments |
bool
|
If True, apply term frequency adjustments to the exact match level. Defaults to False. |
False
|
min_intersection |
int
|
The minimum cardinality of the intersection of arrays for this comparison level. Defaults to 1 |
1
|
include_colname_in_charts_label |
bool
|
Should the charts label contain the column name? Defaults to False |
False
|
Examples:
import splink.duckdb.duckdb_comparison_level_library as cll
cll.array_intersect_level("name")
import splink.spark.spark_comparison_level_library as cll
cll.array_intersect_level("name")
import splink.athena.athena_comparison_level_library as cll
cll.array_intersect_level("name")
Returns:
Name | Type | Description |
---|---|---|
ComparisonLevel |
ComparisonLevel
|
A comparison level that evaluates the size of intersection of arrays |
splink.comparison_level_library.DateDiffLevelBase
¶
Bases: ComparisonLevel
__init__(date_col, date_threshold, date_metric='day', m_probability=None, cast_strings_to_date=False, date_format=None)
¶
Represents a comparison level based around the difference between dates within a column
Parameters:
Name | Type | Description | Default |
---|---|---|---|
date_col |
str
|
Input column name |
required |
date_threshold |
int
|
The total difference in time between two given
dates. This is used in tandem with |
required |
date_metric |
str
|
The unit of time with which to measure your
|
'day'
|
m_probability |
float
|
Starting value for m probability. Defaults to None. |
None
|
cast_strings_to_date |
bool
|
Set to true and adjust date_format param when input dates are strings to enable date-casting. Defaults to False. |
False
|
date_format |
str
|
Format of input dates if date-strings are given. Must be consistent across record pairs. If None (the default), downstream functions for each backend assign date_format to ISO 8601 format (yyyy-mm-dd). |
None
|
Examples:
Date Difference comparison level at threshold 1 year
import splink.duckdb.duckdb_comparison_level_library as cll
cll.datediff_level("date",
date_threshold=1,
date_metric="year"
)
import splink.duckdb.duckdb_comparison_level_library as cll
cll.datediff_level("dob",
date_threshold=3,
date_metric='month',
cast_strings_to_date=True
)
import splink.duckdb.duckdb_comparison_level_library as cll
cll.datediff_level("dob",
date_threshold=3,
date_metric='month',
cast_strings_to_date=True,
date_format='%d/%m/%Y'
)
Date Difference comparison level at threshold 1 year
import splink.spark.spark_comparison_level_library as cll
cll.datediff_level("date",
date_threshold=1,
date_metric="year"
)
import splink.spark.spark_comparison_level_library as cll
cll.datediff_level("dob",
date_threshold=3,
date_metric='month',
cast_strings_to_date=True
)
import splink.spark.spark_comparison_level_library as cll
cll.datediff_level("dob",
date_threshold=3,
date_metric='month',
cast_strings_to_date=True,
date_format='%d/%m/%Y'
)
Returns:
Name | Type | Description |
---|---|---|
ComparisonLevel |
ComparisonLevel
|
A comparison level that evaluates whether two dates fall within a given interval. |