Skip to content

Documentation for comparison_level_library

The comparison_level_library contains pre-made comparison levels available for use to construct custom comparisons as described in this topic guide. However, not every comparison level is available for every Splink-compatible SQL backend.

The pre-made Splink comparison levels available for each SQL dialect are as given in this table:

duckdb spark athena sqlite
array_intersect_level
columns_reversed_level
damerau_levenshtein_level
datediff_level
distance_function_level
distance_in_km_level
else_level
exact_match_level
jaccard_level
jaro_level
jaro_winkler_level
levenshtein_level
null_level
percentage_difference_level

The detailed API for each of these are outlined below.

Library comparison level APIs

splink.comparison_level_library.NullLevelBase

Bases: ComparisonLevel

__init__(col_name, valid_string_regex=None)

Represents comparisons level where one or both sides of the comparison contains null values so the similarity cannot be evaluated. Assumed to have a partial match weight of zero (null effect on overall match weight)

Parameters:

Name Type Description Default
col_name str

Input column name

required
valid_string_regex str

regular expression pattern that if not matched will result in column being treated as a null.

None

Examples:

Simple null comparison level

import splink.duckdb.duckdb_comparison_level_library as cll
cll.null_level("name")
Null comparison level including strings that do not match a given regex pattern
import splink.duckdb.duckdb_comparison_level_library as cll
cll.null_level("name", valid_string_regex="^[A-Z]{1,7}$")

Simple null level

import splink.spark.spark_comparison_level_library as cll
cll.null_level("name")
Null comparison level including strings that do not match a given regex pattern
import splink.spark.spark_comparison_level_library as cll
cll.null_level("name", valid_string_regex="^[A-Z]{1,7}$")

Simple null level

import splink.athena.athena_comparison_level_library as cll
cll.null_level("name")
Null comparison level including strings that do not match a given regex pattern
import splink.athena.athena_comparison_level_library as cll
cll.null_level("name", valid_string_regex="^[A-Z]{1,7}$")

Simple null level

import splink.sqlite.sqlite_comparison_level_library as cll
cll.null_level("name")

Returns:

Name Type Description
ComparisonLevel ComparisonLevel

Comparison level for null entries


splink.comparison_level_library.ExactMatchLevelBase

Bases: ComparisonLevel

__init__(col_name, regex_extract=None, set_to_lowercase=False, m_probability=None, term_frequency_adjustments=False, include_colname_in_charts_label=False, manual_chart_label=None)

Represents a comparison level where there is an exact match,

Parameters:

Name Type Description Default
col_name str

Input column name

required
regex_extract str

Regular expression pattern to evaluate a match on.

None
set_to_lowercase bool

If True, sets all entries to lowercase.

False
m_probability float

Starting value for m probability Defaults to None.

None
term_frequency_adjustments bool

If True, apply term frequency adjustments to the exact match level. Defaults to False.

False
include_colname_in_charts_label bool

If True, include col_name in chart labels (e.g. linker.match_weights_chart())

False
chart_label str

string to include in chart label. Setting to col_name would recreate behaviour of include_colname_in_charts_label=True

required

Examples:

Simple Exact match level

import splink.duckdb.duckdb_comparison_level_library as cll
cll.exact_match_level("name")
Exact match level with term-frequency adjustments
import splink.duckdb.duckdb_comparison_level_library as cll
cll.exact_match_level("name", term_frequency_adjustments=True)
Exact match level on a substring of col_name as determined by a regular expression
import splink.duckdb.duckdb_comparison_level_library as cll
cll.exact_match_level("name", regex_extract="^[A-Z]{1,4}")

Simple Exact match level

import splink.spark.spark_comparison_level_library as cll
cll.exact_match_level("name")
Exact match level with term-frequency adjustments
import splink.spark.spark_comparison_level_library as cll
cll.exact_match_level("name", term_frequency_adjustments=True)
Exact match level on a substring of col_name as determined by a regular expression
import splink.spark.spark_comparison_level_library as cll
cll.exact_match_level("name", regex_extract="^[A-Z]{1,4}")

Simple Exact match level

import splink.athena.athena_comparison_level_library as cll
cll.exact_match_level("name")
Exact match level with term-frequency adjustments
import splink.athena.athena_comparison_level_library as cll
cll.exact_match_level("name", term_frequency_adjustments=True)
Exact match level on a substring of col_name as determined by a regular expression
import splink.athena.athena_comparison_level_library as cll
cll.exact_match_level("name", regex_extract="^[A-Z]{1,4}")

Simple Exact match level

import splink.sqlite.sqlite_comparison_level_library as cll
cll.exact_match_level("name")
Exact match level with term-frequency adjustments
import splink.sqlite.sqlite_comparison_level_library as cll
cll.exact_match_level("name", term_frequency_adjustments=True)


splink.comparison_level_library.ElseLevelBase

Bases: ComparisonLevel

__init__(m_probability=None)

Represents a comparison level for all cases which have not been considered by preceding comparison levels,

Examples:

import splink.duckdb.duckdb_comparison_level_library as cll
cll.else_level("name")
import splink.spark.spark_comparison_level_library as cll
cll.else_level("name")
import splink.athena.athena_comparison_level_library as cll
cll.else_level("name")
import splink.sqlite.sqlite_comparison_level_library as cll
cll.else_level("name")

splink.comparison_level_library.DistanceFunctionLevelBase

Bases: ComparisonLevel

__init__(col_name, distance_function_name, distance_threshold, regex_extract=None, set_to_lowercase=False, higher_is_more_similar=True, include_colname_in_charts_label=False, m_probability=None)

Represents a comparison level using a user-provided distance function, where the similarity

Parameters:

Name Type Description Default
col_name str

Input column name

required
distance_function_name str

The name of the distance function

required
distance_threshold Union[int, float]

The threshold to use to assess similarity

required
regex_extract str

Regular expression pattern to evaluate a match on.

None
set_to_lowercase bool

If True, sets all entries to lowercase.

False
higher_is_more_similar bool

If True, a higher value of the distance function indicates a higher similarity (e.g. jaro_winkler). If false, a higher value indicates a lower similarity (e.g. levenshtein).

True
include_colname_in_charts_label bool

If True, includes col_name in charts label

False
m_probability float

Starting value for m probability Defaults to None.

None

Examples:

Apply the levenshtein function to a comparison level

import splink.duckdb.duckdb_comparison_level_library as cll
cll.distance_function_level("name",
                            "levenshtein",
                            2,
                            False)

Apply the levenshtein function to a comparison level

import splink.spark.spark_comparison_level_library as cll
cll.distance_function_level("name",
                            "levenshtein",
                            2,
                            False)

Apply the levenshtein_distance function to a comparison level

import splink.athena.athena_comparison_level_library as cll
cll.distance_function_level("name",
                            "levenshtein_distance",
                            2,
                            False)

Returns:

Name Type Description
ComparisonLevel ComparisonLevel

A comparison level for a given distance function


splink.comparison_level_library.LevenshteinLevelBase

Bases: DistanceFunctionLevelBase

__init__(col_name, distance_threshold, regex_extract=None, set_to_lowercase=False, include_colname_in_charts_label=False, m_probability=None)

Represents a comparison level using a levenshtein distance function,

Parameters:

Name Type Description Default
col_name str

Input column name

required
distance_threshold Union[int, float]

The threshold to use to assess similarity

required
regex_extract str

Regular expression pattern to evaluate a match on.

None
set_to_lowercase bool

If True, sets all entries to lowercase.

False
include_colname_in_charts_label bool

If True, includes col_name in charts label

False
m_probability float

Starting value for m probability. Defaults to None.

None

Examples:

Comparison level with levenshtein distance score less than (or equal to) 1

import splink.duckdb.duckdb_comparison_level_library as cll
cll.levenshtein_level("name", 1)

Comparison level with levenshtein distance score less than (or equal to) 1 on a subtring of name column as determined by a regular expression.

import splink.duckdb.duckdb_comparison_level_library as cll
cll.levenshtein_level("name", 1, regex_extract="^[A-Z]{1,4}")

Comparison level with levenshtein distance score less than (or equal to) 1

import splink.spark.spark_comparison_level_library as cll
cll.levenshtein_level("name", 1)

Comparison level with levenshtein distance score less than (or equal to) 1 on a subtring of name column as determined by a regular expression.

import splink.spark.spark_comparison_level_library as cll
cll.levenshtein_level("name", 1, regex_extract="^[A-Z]{1,4}")

Comparison level with levenshtein distance score less than (or equal to) 1

import splink.athena.athena_comparison_level_library as cll
cll.levenshtein_level("name", 1)

Comparison level with levenshtein distance score less than (or equal to) 1 on a subtring of name column as determined by a regular expression.

import splink.athena.athena_comparison_level_library as cll
cll.levenshtein_level("name", 1, regex_extract="^[A-Z]{1,4}")

Returns:

Name Type Description
ComparisonLevel ComparisonLevel

A comparison level that evaluates the levenshtein similarity


splink.comparison_level_library.DamerauLevenshteinLevelBase

Bases: DistanceFunctionLevelBase

__init__(col_name, distance_threshold, regex_extract=None, set_to_lowercase=False, include_colname_in_charts_label=False, m_probability=None)

Represents a comparison level using a damerau-levenshtein distance function,

Parameters:

Name Type Description Default
col_name str

Input column name

required
distance_threshold Union[int, float]

The threshold to use to assess similarity

required
regex_extract str

Regular expression pattern to evaluate a match on.

None
set_to_lowercase bool

If True, sets all entries to lowercase.

False
include_colname_in_charts_label bool

If True, includes col_name in charts label

False
m_probability float

Starting value for m probability. Defaults to None.

None

Examples:

Comparison level with damerau-levenshtein distance score less than (or equal to) 1

import splink.duckdb.duckdb_comparison_level_library as cll
cll.damerau_levenshtein_level("name", 1)

Comparison level with damerau-levenshtein distance score less than (or equal to) 1 on a subtring of name column as determined by a regular expression.

import splink.duckdb.duckdb_comparison_level_library as cll
cll.damerau_levenshtein_level("name", 1, regex_extract="^[A-Z]{1,4}")

Comparison level with damerau-levenshtein distance score less than (or equal to) 1

import splink.spark.spark_comparison_level_library as cll
cll.damerau_levenshtein_level("name", 1)

Comparison level with damerau-levenshtein distance score less than (or equal to) 1 on a subtring of name column as determined by a regular expression.

import splink.spark.spark_comparison_level_library as cll
cll.damerau_levenshtein_level("name", 1, regex_extract="^[A-Z]{1,4}")

Returns:

Name Type Description
ComparisonLevel ComparisonLevel

A comparison level that evaluates the Damerau-Levenshtein similarity


splink.comparison_level_library.JaroLevelBase

Bases: DistanceFunctionLevelBase

__init__(col_name, distance_threshold, regex_extract=None, set_to_lowercase=False, include_colname_in_charts_label=False, m_probability=None)

Represents a comparison using the jaro distance function

Parameters:

Name Type Description Default
col_name str

Input column name

required
distance_threshold Union[int, float]

The threshold to use to assess similarity

required
regex_extract str

Regular expression pattern to evaluate a match on.

None
set_to_lowercase bool

If True, sets all entries to lowercase.

False
include_colname_in_charts_label bool

If True, includes col_name in charts label

False
m_probability float

Starting value for m probability. Defaults to None.

None

Examples:

Comparison level with jaro score greater than 0.9

import splink.duckdb.duckdb_comparison_level_library as cll
cll.jaro_level("name", 0.9)
Comparison level with a jaro score greater than 0.9 on a substring of name column as determined by a regular expression.

import splink.duckdb.duckdb_comparison_level_library as cll
cll.jaro_level("name", 0.9, regex_extract="^[A-Z]{1,4}")

Comparison level with jaro score greater than 0.9

import splink.spark.spark_comparison_level_library as cll
cll.jaro_level("name", 0.9)
Comparison level with a jaro score greater than 0.9 on a substring of name column as determined by a regular expression.

import splink.spark.spark_comparison_level_library as cll
cll.jaro_level("name", 0.9, regex_extract="^[A-Z]{1,4}")

Returns:

Name Type Description
ComparisonLevel

A comparison level that evaluates the jaro similarity


splink.comparison_level_library.JaroWinklerLevelBase

Bases: DistanceFunctionLevelBase

__init__(col_name, distance_threshold, regex_extract=None, set_to_lowercase=False, include_colname_in_charts_label=False, m_probability=None)

Represents a comparison level using the jaro winkler distance function

Parameters:

Name Type Description Default
col_name str

Input column name

required
distance_threshold Union[int, float]

The threshold to use to assess similarity

required
regex_extract str

Regular expression pattern to evaluate a match on.

None
set_to_lowercase bool

If True, sets all entries to lowercase.

False
include_colname_in_charts_label bool

If True, includes col_name in charts label

False
m_probability float

Starting value for m probability. Defaults to None.

None

Examples:

Comparison level with jaro-winkler score greater than 0.9

import splink.duckdb.duckdb_comparison_level_library as cll
cll.jaro_winkler_level("name", 0.9)
Comparison level with jaro-winkler score greater than 0.9 on a substring of name column as determined by a regular expression.
import splink.duckdb.duckdb_comparison_level_library as cll
cll.jaro_winkler_level("name", 0.9, regex_extract="^[A-Z]{1,4}")

Comparison level with jaro score greater than 0.9

import splink.spark.spark_comparison_level_library as cll
cll.jaro_winkler_level("name", 0.9)
Comparison level with jaro-winkler score greater than 0.9 on a substring of name column as determined by a regular expression.
import splink.spark.spark_comparison_level_library as cll
cll.jaro_winkler_level("name", 0.9, regex_extract="^[A-Z]{1,4}")

Returns:

Name Type Description
ComparisonLevel ComparisonLevel

A comparison level that evaluates the jaro winkler similarity


splink.comparison_level_library.JaccardLevelBase

Bases: DistanceFunctionLevelBase

__init__(col_name, distance_threshold, regex_extract=None, set_to_lowercase=False, include_colname_in_charts_label=False, m_probability=None)

Represents a comparison level using a jaccard distance function

Parameters:

Name Type Description Default
col_name str

Input column name

required
distance_threshold Union[int, float]

The threshold to use to assess similarity

required
regex_extract str

Regular expression pattern to evaluate a match on.

None
set_to_lowercase bool

If True, sets all entries to lowercase.

False
include_colname_in_charts_label bool

If True, includes col_name in charts label

False
m_probability float

Starting value for m probability. Defaults to None.

None

Examples:

Comparison level with jaccard score greater than 0.9

import splink.duckdb.duckdb_comparison_level_library as cll
cll.jaccard_level("name", 0.9)
Comparison level with jaccard score greater than 0.9 on a substring of name column as determined by a regular expression.
import splink.duckdb.duckdb_comparison_level_library as cll
cll.jaccard_level("name", 0.9, regex_extract="^[A-Z]{1,4}")

Comparison level with jaccard score greater than 0.9

import splink.spark.spark_comparison_level_library as cll
cll.jaccard_level("name", 0.9)
Comparison level with jaccard score greater than 0.9 on a substring of name column as determined by a regular expression.
import splink.spark.spark_comparison_level_library as cll
cll.jaccard_level("name", 0.9, regex_extract="^[A-Z]{1,4}")

Returns:

Name Type Description
ComparisonLevel ComparisonLevel

A comparison level that evaluates the jaccard similarity


splink.comparison_level_library.ColumnsReversedLevelBase

Bases: ComparisonLevel

__init__(col_name_1, col_name_2, regex_extract=None, set_to_lowercase=False, m_probability=None, tf_adjustment_column=None)

Represents a comparison level where the columns are reversed. For example, if surname is in the forename field and vice versa

Parameters:

Name Type Description Default
col_name_1 str

First column, e.g. forename

required
col_name_2 str

Second column, e.g. surname

required
regex_extract str

Regular expression pattern to evaluate a match on.

None
set_to_lowercase bool

If True, sets all entries to lowercase.

False
m_probability float

Starting value for m probability. Defaults to None.

None
tf_adjustment_column str

Column to use for term frequency adjustments if an exact match is observed. Defaults to None.

None

Examples:

Comparison level on first_name and surname columns reversed

import splink.duckdb.duckdb_comparison_level_library as cll
cll.columns_reversed_level("first_name", "surname")
Comparison level on first_name and surname column reversed on a substring of each column as determined by a regular expression.
import splink.duckdb.duckdb_comparison_level_library as cll
cll.columns_reversed_level("first_name",
                           "surname",
                           regex_extract="^[A-Z]{1,4}")

import splink.spark.spark_comparison_level_library as cll
cll.columns_reversed_level("first_name", "surname")
Comparison level on first_name and surname column reversed on a substring of each column as determined by a regular expression.
import splink.spark.spark_comparison_level_library as cll
cll.columns_reversed_level("first_name",
                           "surname",
                           regex_extract="^[A-Z]{1,4}")

import splink.athena.athena_comparison_level_library as cll
cll.columns_reversed_level("first_name", "surname")
Comparison level on first_name and surname column reversed on a substring of each column as determined by a regular expression.
import splink.athena.athena_comparison_level_library as cll
cll.columns_reversed_level("first_name",
                           "surname",
                           regex_extract="^[A-Z]{1,4}")

import splink.sqlite.sqlite_comparison_level_library as cll
cll.columns_reversed_level("first_name", "surname")

Returns:

Name Type Description
ComparisonLevel ComparisonLevel

A comparison level that evaluates the exact match of two columns.


splink.comparison_level_library.DistanceInKMLevelBase

Bases: ComparisonLevel

__init__(lat_col, long_col, km_threshold, not_null=False, m_probability=None)

Use the haversine formula to transform comparisons of lat,lngs into distances measured in kilometers

Parameters:

Name Type Description Default
lat_col str

The name of a latitude column or the respective array or struct column column containing the information For example: long_lat['lat'] or long_lat[0]

required
long_col str

The name of a longitudinal column or the respective array or struct column column containing the information, plus an index. For example: long_lat['long'] or long_lat[1]

required
km_threshold int

The total distance in kilometers to evaluate your comparisons against

required
not_null bool

If true, remove any . This is only necessary if you are not capturing nulls elsewhere in your comparison level.

False
m_probability float

Starting value for m probability. Defaults to None.

None

Examples:

import splink.duckdb.duckdb_comparison_level_library as cll
cll.distance_in_km_level("lat_col",
                        "long_col",
                        km_threshold=5)
import splink.spark.spark_comparison_level_library as cll
cll.distance_in_km_level("lat_col",
                        "long_col",
                        km_threshold=5)
import splink.athena.athena_comparison_level_library as cll
cll.distance_in_km_level("lat_col",
                        "long_col",
                        km_threshold=5)

Returns:

Name Type Description
ComparisonLevel ComparisonLevel

A comparison level that evaluates the distance between two coordinates


splink.comparison_level_library.PercentageDifferenceLevelBase

Bases: ComparisonLevel

__init__(col_name, percentage_distance_threshold, m_probability=None)

Represents a comparison level based around the percentage difference between two numbers.

Note: the percentage is calculated by dividing the absolute difference between the values by the largest value

Parameters:

Name Type Description Default
col_name str

Input column name

required
percentage_distance_threshold float

Percentage difference threshold for the comparison level

required
m_probability float

Starting value for m probability. Defaults to None.

None

Examples:

import splink.duckdb.duckdb_comparison_level_library as cll
cll.percentage_difference_level("value", 0.5)
import splink.spark.spark_comparison_level_library as cll
cll.percentage_difference_level("value", 0.5)
import splink.athena.athena_comparison_level_library as cll
cll.percentage_difference_level("value", 0.5)
import splink.sqlite.sqlite_comparison_level_library as cll
cll.percentage_difference_level("value", 0.5)

Returns:

Name Type Description
ComparisonLevel ComparisonLevel

A comparison level that evaluates the percentage difference between two values


splink.comparison_level_library.ArrayIntersectLevelBase

Bases: ComparisonLevel

__init__(col_name, m_probability=None, term_frequency_adjustments=False, min_intersection=1, include_colname_in_charts_label=False)

Represents a comparison level based around the size of an intersection of arrays

Parameters:

Name Type Description Default
col_name str

Input column name

required
m_probability float

Starting value for m probability. Defaults to None.

None
term_frequency_adjustments bool

If True, apply term frequency adjustments to the exact match level. Defaults to False.

False
min_intersection int

The minimum cardinality of the intersection of arrays for this comparison level. Defaults to 1

1
include_colname_in_charts_label bool

Should the charts label contain the column name? Defaults to False

False

Examples:

import splink.duckdb.duckdb_comparison_level_library as cll
cll.array_intersect_level("name")
import splink.spark.spark_comparison_level_library as cll
cll.array_intersect_level("name")
import splink.athena.athena_comparison_level_library as cll
cll.array_intersect_level("name")

Returns:

Name Type Description
ComparisonLevel ComparisonLevel

A comparison level that evaluates the size of intersection of arrays


splink.comparison_level_library.DateDiffLevelBase

Bases: ComparisonLevel

__init__(date_col, date_threshold, date_metric='day', m_probability=None, cast_strings_to_date=False, date_format=None)

Represents a comparison level based around the difference between dates within a column

Parameters:

Name Type Description Default
date_col str

Input column name

required
date_threshold int

The total difference in time between two given dates. This is used in tandem with date_metric to determine . If you are using year as your metric, then a value of 1 would require that your dates lie within 1 year of one another.

required
date_metric str

The unit of time with which to measure your date_threshold. Your metric should be one of day, month or year. Defaults to day.

'day'
m_probability float

Starting value for m probability. Defaults to None.

None
cast_strings_to_date bool

Set to true and adjust date_format param when input dates are strings to enable date-casting. Defaults to False.

False
date_format str

Format of input dates if date-strings are given. Must be consistent across record pairs. If None (the default), downstream functions for each backend assign date_format to ISO 8601 format (yyyy-mm-dd).

None

Examples:

Date Difference comparison level at threshold 1 year

import splink.duckdb.duckdb_comparison_level_library as cll
cll.datediff_level("date",
                    date_threshold=1,
                    date_metric="year"
                    )
Date Difference comparison with date-casting and unspecified date_format (default = %Y-%m-%d)
import splink.duckdb.duckdb_comparison_level_library as cll
cll.datediff_level("dob",
                    date_threshold=3,
                    date_metric='month',
                    cast_strings_to_date=True
                    )
Date Difference comparison with date-casting and specified date_format
import splink.duckdb.duckdb_comparison_level_library as cll
cll.datediff_level("dob",
                    date_threshold=3,
                    date_metric='month',
                    cast_strings_to_date=True,
                    date_format='%d/%m/%Y'
                    )

Date Difference comparison level at threshold 1 year

import splink.spark.spark_comparison_level_library as cll
cll.datediff_level("date",
                    date_threshold=1,
                    date_metric="year"
                    )
Date Difference comparison with date-casting and unspecified date_format (default = %Y-%m-%d)
import splink.spark.spark_comparison_level_library as cll
cll.datediff_level("dob",
                    date_threshold=3,
                    date_metric='month',
                    cast_strings_to_date=True
                    )
Date Difference comparison with date-casting and specified date_format
import splink.spark.spark_comparison_level_library as cll
cll.datediff_level("dob",
                    date_threshold=3,
                    date_metric='month',
                    cast_strings_to_date=True,
                    date_format='%d/%m/%Y'
                    )

Returns:

Name Type Description
ComparisonLevel ComparisonLevel

A comparison level that evaluates whether two dates fall within a given interval.