Documentation for the `comparison_library`¶

`AbsoluteTimeDifferenceAtThresholds(col_name, *, input_is_string, metrics, thresholds, datetime_format=None, term_frequency_adjustments=False, invalid_dates_as_null=True)` ¶

Bases: ComparisonCreator

Represents a comparison of the data in col_name with multiple levels based on absolute time differences:

Exact match in col_name
Absolute time difference levels at specified thresholds
...
Anything else

For example, with metrics = ['day', 'month'] and thresholds = [1, 3] the levels are:

Exact match in col_name
Absolute time difference in col_name <= 1 day
Absolute time difference in col_name <= 3 months
Anything else

This comparison uses the AbsoluteTimeDifferenceLevel, which computes the total elapsed time between two dates, rather than counting calendar intervals.

Parameters:

Name	Type	Description	Default
`col_name`	`str`	The name of the column to compare.	required
`input_is_string`	`bool`	If True, the input dates are treated as strings and parsed according to `datetime_format`.	required
`metrics`	`Union[DateMetricType, List[DateMetricType]]`	The unit(s) of time to use when comparing dates. Can be 'second', 'minute', 'hour', 'day', 'month', or 'year'.	required
`thresholds`	`Union[int, float, List[Union[int, float]]]`	The threshold(s) to use for the time difference level(s).	required
`datetime_format`	`str`	The format string for parsing dates if `input_is_string` is True. ISO 8601 format used if not provided.	`None`
`term_frequency_adjustments`	`bool`	Whether to apply term frequency adjustments. Defaults to False.	`False`
`invalid_dates_as_null`	`bool`	If True and `input_is_string` is True, treat invalid dates as null. Defaults to True.	`True`

`ArrayIntersectAtSizes(col_name, size_threshold_or_thresholds=[1])` ¶

Bases: ComparisonCreator

Represents a comparison of the data in col_name with multiple levels based on the intersection sizes of array elements:

Intersection at specified size thresholds
...
Anything else

For example, with size_threshold_or_thresholds = [3, 1], the levels are:

Intersection of arrays in col_name has at least 3 elements
Intersection of arrays in col_name has at least 1 element
Anything else (e.g., empty intersection)

Parameters:

Name	Type	Description	Default
`col_name`	`str`	The name of the column to compare.	required
`size_threshold_or_thresholds`	`Union[int, list[int]]`	The size threshold(s) for the intersection levels. Defaults to [1].	`[1]`

`CosineSimilarityAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.8, 0.7])` ¶

Bases: ComparisonCreator

Represents a comparison of the data in col_name with two or more levels:

Cosine similarity levels at specified thresholds
...
Anything else

For example, with score_threshold_or_thresholds = [0.9, 0.7] the levels are:

Cosine similarity in col_name >= 0.9
Cosine similarity in col_name >= 0.7
Anything else

Parameters:

Name	Type	Description	Default
`col_name`	`str`	The name of the column to compare.	required
`score_threshold_or_thresholds`	`Union[float, list]`	The threshold(s) to use for the cosine similarity level(s). Defaults to [0.9, 0.7].	`[0.9, 0.8, 0.7]`

`CustomComparison(comparison_levels, output_column_name=None, comparison_description=None)` ¶

Bases: ComparisonCreator

Represents a comparison of the data with custom supplied levels.

Parameters:

Name	Type	Description	Default
`output_column_name`	`str`	The column name to use to refer to this comparison	`None`
`comparison_levels`	`list`	A list of some combination of `ComparisonLevelCreator` objects, or dicts. These represent the similarity levels assessed by the comparison, in order of decreasing specificity	required
`comparison_description`	`str`	An optional description of the comparison	`None`

`DamerauLevenshteinAtThresholds(col_name, distance_threshold_or_thresholds=[1, 2])` ¶

Bases: ComparisonCreator

Represents a comparison of the data in col_name with three or more levels:

Exact match in col_name
Damerau-Levenshtein levels at specified distance thresholds
...
Anything else

For example, with distance_threshold_or_thresholds = [1, 3] the levels are

Exact match in col_name
Damerau-Levenshtein distance in col_name <= 1
Damerau-Levenshtein distance in col_name <= 3
Anything else

Parameters:

Name	Type	Description	Default
`col_name`	`str`	The name of the column to compare.	required
`distance_threshold_or_thresholds`	`Union[int, list]`	The threshold(s) to use for the Damerau-Levenshtein similarity level(s). Defaults to [1, 2].	`[1, 2]`

`DateOfBirthComparison(col_name, *, input_is_string, datetime_thresholds=[1, 1, 10], datetime_metrics=['month', 'year', 'year'], datetime_format=None, invalid_dates_as_null=True)` ¶

Bases: ComparisonCreator

Generate an 'out of the box' comparison for a date of birth column in the col_name provided.

Note that input_is_string is a required argument: you must denote whether the col_name contains if of type date/dattime or string.

The default arguments will give a comparison with comparison levels:

Exact match (all other dates)
Damerau-Levenshtein distance <= 1
Date difference <= 1 month
Date difference <= 1 year
Date difference <= 10 years
Anything else

Parameters:

Name	Type	Description	Default
`col_name`	`Union[str, ColumnExpression]`	The column name	required
`input_is_string`	`bool`	If True, the provided `col_name` must be of type string. If False, it must be a date or datetime.	required
`datetime_thresholds`	`Union[int, float, List[Union[int, float]]]`	Numeric thresholds for date differences. Defaults to [1, 1, 10].	`[1, 1, 10]`
`datetime_metrics`	`Union[DateMetricType, List[DateMetricType]]`	Metrics for date differences. Defaults to ["month", "year", "year"].	`['month', 'year', 'year']`
`datetime_format`	`str`	The datetime format used to cast strings to dates. Only used if input is a string.	`None`
`invalid_dates_as_null`	`bool`	If True, treat invalid dates as null as opposed to allowing e.g. an exact or levenshtein match where one side or both are an invalid date. Only used if input is a string. Defaults to True.	`True`

`DistanceFunctionAtThresholds(col_name, distance_function_name, distance_threshold_or_thresholds, higher_is_more_similar=True)` ¶

Bases: ComparisonCreator

Represents a comparison of the data in col_name with three or more levels:

Exact match in col_name
Custom distance function levels at specified thresholds
...
Anything else

For example, with distance_threshold_or_thresholds = [1, 3] and distance_function 'hamming', with higher_is_more_similar False the levels are:

Exact match in col_name
Hamming distance of col_name <= 1
Hamming distance of col_name <= 3
Anything else

Parameters:

Name	Type	Description	Default
`col_name`	`str`	The name of the column to compare.	required
`distance_function_name`	`str`	the name of the SQL distance function	required
`distance_threshold_or_thresholds`	`Union[float, list]`	The threshold(s) to use for the distance function level(s).	required
`higher_is_more_similar`	`bool`	Are higher values of the distance function more similar? (e.g. True for Jaro-Winkler, False for Levenshtein) Default is True	`True`

`DistanceInKMAtThresholds(lat_col, long_col, km_thresholds)` ¶

Bases: ComparisonCreator

A comparison of the latitude, longitude coordinates defined in 'lat_col' and 'long col' giving the great circle distance between them in km.

An example of the output with km_thresholds = [1, 10] would be:

The two coordinates are within 1 km of one another
The two coordinates are within 10 km of one another
Anything else (i.e. the distance between coordinates are > 10km apart)

Parameters:

Name	Type	Description	Default
`lat_col`	`str`	The name of the latitude column to compare.	required
`long_col`	`str`	The name of the longitude column to compare.	required
`km_thresholds`	`iterable[float] \| float`	The km threshold(s) for the distance levels.	required

`EmailComparison(col_name)` ¶

Bases: ComparisonCreator

Generate an 'out of the box' comparison for an email address column with the in the col_name provided.

The default comparison levels are:

Null comparison: e.g., one email is missing or invalid.
Exact match on full email: e.g., john@smith.com vs. john@smith.com.
Exact match on username part of email: e.g., john@company.com vs. john@other.com.
Jaro-Winkler similarity > 0.88 on full email: e.g., john.smith@company.com vs. john.smyth@company.com.
Jaro-Winkler similarity > 0.88 on username part of email: e.g., john.smith@company.com vs. john.smyth@other.com.
Anything else: e.g., john@company.com vs. rebecca@other.com.

Parameters:

Name	Type	Description	Default
`col_name`	`Union[str, ColumnExpression]`	The column name or expression for the email addresses to be compared.	required

`ExactMatch(col_name)` ¶

Bases: ComparisonCreator

Represents a comparison of the data in col_name with two levels:

Exact match in col_name
Anything else

Parameters:

Name	Type	Description	Default
`col_name`	`str`	The name of the column to compare	required

`ForenameSurnameComparison(forename_col_name, surname_col_name, *, jaro_winkler_thresholds=[0.92, 0.88], forename_surname_concat_col_name=None)` ¶

Bases: ComparisonCreator

Generate an 'out of the box' comparison for forename and surname columns in the forename_col_name and surname_col_name provided.

It's recommended to derive an additional column containing a concatenated forename and surname column so that term frequencies can be applied to the full name. If you have derived a column, provide it at forename_surname_concat_col_name.

The default comparison levels are:

Null comparison on both forename and surname
Exact match on both forename and surname
Columns reversed comparison (forename and surname swapped)
Jaro-Winkler similarity > 0.92 on both forename and surname
Jaro-Winkler similarity > 0.88 on both forename and surname
Exact match on surname
Exact match on forename
Anything else

Parameters:

Name	Type	Description	Default
`forename_col_name`	`Union[str, ColumnExpression]`	The column name or expression for the forenames to be compared.	required
`surname_col_name`	`Union[str, ColumnExpression]`	The column name or expression for the surnames to be compared.	required
`jaro_winkler_thresholds`	`Union[float, list[float]]`	Thresholds for Jaro-Winkler similarity. Defaults to [0.92, 0.88].	`[0.92, 0.88]`
`forename_surname_concat_col_name`	`str`	The column name for concatenated forename and surname values. If provided, term frequencies are applied on the exact match using this column	`None`

`JaccardAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.7])` ¶

Bases: ComparisonCreator

Represents a comparison of the data in col_name with three or more levels:

Exact match in col_name
Jaccard score levels at specified thresholds
...
Anything else

For example, with score_threshold_or_thresholds = [0.9, 0.7] the levels are:

Exact match in col_name
Jaccard score in col_name >= 0.9
Jaccard score in col_name >= 0.7
Anything else

Parameters:

Name	Type	Description	Default
`col_name`	`str`	The name of the column to compare.	required
`score_threshold_or_thresholds`	`Union[float, list]`	The threshold(s) to use for the Jaccard similarity level(s). Defaults to [0.9, 0.7].	`[0.9, 0.7]`

`JaroAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.7])` ¶

Bases: ComparisonCreator

Represents a comparison of the data in col_name with three or more levels:

Exact match in col_name
Jaro score levels at specified thresholds
...
Anything else

For example, with score_threshold_or_thresholds = [0.9, 0.7] the levels are:

Exact match in col_name
Jaro score in col_name >= 0.9
Jaro score in col_name >= 0.7
Anything else

Parameters:

Name	Type	Description	Default
`col_name`	`str`	The name of the column to compare.	required
`score_threshold_or_thresholds`	`Union[float, list]`	The threshold(s) to use for the Jaro similarity level(s). Defaults to [0.9, 0.7].	`[0.9, 0.7]`

`JaroWinklerAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.7])` ¶

Bases: ComparisonCreator

Represents a comparison of the data in col_name with three or more levels:

Exact match in col_name
Jaro-Winkler score levels at specified thresholds
...
Anything else

For example, with score_threshold_or_thresholds = [0.9, 0.7] the levels are:

Exact match in col_name
Jaro-Winkler score in col_name >= 0.9
Jaro-Winkler score in col_name >= 0.7
Anything else

Parameters:

Name	Type	Description	Default
`col_name`	`str`	The name of the column to compare.	required
`score_threshold_or_thresholds`	`Union[float, list]`	The threshold(s) to use for the Jaro-Winkler similarity level(s). Defaults to [0.9, 0.7].	`[0.9, 0.7]`

`LevenshteinAtThresholds(col_name, distance_threshold_or_thresholds=[1, 2])` ¶

Bases: ComparisonCreator

Represents a comparison of the data in col_name with three or more levels:

Exact match in col_name
Levenshtein levels at specified distance thresholds
...
Anything else

For example, with distance_threshold_or_thresholds = [1, 3] the levels are

Exact match in col_name
Levenshtein distance in col_name <= 1
Levenshtein distance in col_name <= 3
Anything else

Parameters:

Name	Type	Description	Default
`col_name`	`str`	The name of the column to compare	required
`distance_threshold_or_thresholds`	`Union[int, list]`	The threshold(s) to use for the levenshtein similarity level(s). Defaults to [1, 2].	`[1, 2]`

`NameComparison(col_name, *, jaro_winkler_thresholds=[0.92, 0.88, 0.7], dmeta_col_name=None)` ¶

Bases: ComparisonCreator

Generate an 'out of the box' comparison for a name column in the col_name provided.

It's also possible to include a level for a dmetaphone match, but this requires you to derive a dmetaphone column prior to importing it into Splink. Note this is expected to be a column containing arrays of dmetaphone values, which are of length 1 or 2.

The default comparison levels are:

Null comparison
Exact match
Jaro-Winkler similarity > 0.92
Jaro-Winkler similarity > 0.88
Jaro-Winkler similarity > 0.70
Anything else

Parameters:

Name	Type	Description	Default
`col_name`	`Union[str, ColumnExpression]`	The column name or expression for the names to be compared.	required
`jaro_winkler_thresholds`	`Union[float, list[float]]`	Thresholds for Jaro-Winkler similarity. Defaults to [0.92, 0.88, 0.7].	`[0.92, 0.88, 0.7]`
`dmeta_col_name`	`str`	The column name for dmetaphone values. If provided, array intersection level is included. This column must contain arrays of dmetaphone values, which are of length 1 or 2.	`None`

`PairwiseStringDistanceFunctionAtThresholds(col_name, distance_function_name, distance_threshold_or_thresholds)` ¶

Bases: ComparisonCreator

Represents a comparison of the most similar pair of values where the first value is in the array data in col_name for the first record and the second value is in the array data in col_name for the second record. The comparison has three or more levels:

Exact match between any pair of values
User-selected string distance function levels at specified thresholds
...
Anything else

For example, with distance_threshold_or_thresholds = [1, 3] and distance_function 'levenshtein' the levels are:

Exact match between any pair of values
Levenshtein distance between the most similar pair of values <= 1
Levenshtein distance between the most similar pair of values <= 3
Anything else

Parameters:

Name	Type	Description	Default
`col_name`	`str`	The name of the column to compare.	required
`distance_function_name`	`str`	the name of the string distance function. Must be one of "levenshtein," "damera_levenshtein," "jaro_winkler," or "jaro."	required
`distance_threshold_or_thresholds`	`Union[float, list]`	The threshold(s) to use for the distance function level(s).	required

`PostcodeComparison(col_name, *, invalid_postcodes_as_null=False, lat_col=None, long_col=None, km_thresholds=[1, 10, 100])` ¶

Bases: ComparisonCreator

Generate an 'out of the box' comparison for a postcode column with the in the col_name provided.

The default comparison levels are:

Null comparison
Exact match on full postcode
Exact match on sector
Exact match on district
Exact match on area
Distance in km (if lat_col and long_col are provided)

It's also possible to include levels for distance in km, but this requires you to have geocoded your postcodes prior to importing them into Splink. Use the lat_col and long_col arguments to tell Splink where to find the latitude and longitude columns.

See https://ideal-postcodes.co.uk/guides/uk-postcode-format for definitions

Parameters:

Name	Type	Description	Default
`col_name`	`Union[str, ColumnExpression]`	The column name or expression for the postcodes to be compared.	required
`invalid_postcodes_as_null`	`bool`	If True, treat invalid postcodes as null. Defaults to False.	`False`
`lat_col`	`Union[str, ColumnExpression]`	The column name or expression for latitude. Required if `km_thresholds` is provided.	`None`
`long_col`	`Union[str, ColumnExpression]`	The column name or expression for longitude. Required if `km_thresholds` is provided.	`None`
`km_thresholds`	`Union[float, List[float]]`	Thresholds for distance in kilometers. If provided, `lat_col` and `long_col` must also be provided.	`[1, 10, 100]`

AbsoluteDateDifferenceAtThresholds¶

An alias of AbsoluteTimeDifferenceAtThresholds.

Configuring comparisons¶

Note that all comparisons have a .configure() method as follows:

Configure the comparison creator with options that are common to all comparisons.

For m and u probabilities, the first element in the list corresponds to the first comparison level, usually an exact match level. Subsequent elements correspond comparison to levels in sequential order, through to the last element which is usually the 'ELSE' level.

All options have default options set initially. Any call to .configure() will set any options that are supplied. Any subsequent calls to .configure() will not override these values with defaults; to override values you must explicitly provide a value corresponding to the default.

Generally speaking only a single call (at most) to .configure() should be required.

Parameters:

Name	Type	Description	Default
`term_frequency_adjustments`	`bool`	Whether term frequency adjustments are switched on for this comparison. Only applied to exact match levels. Default corresponds to False.	`unsupplied_option`
`m_probabilities`	`list`	List of m probabilities Default corresponds to None.	`unsupplied_option`
`u_probabilities`	`list`	List of u probabilities Default corresponds to None.	`unsupplied_option`

Example

cc = LevenshteinAtThresholds("name", 2)
cc.configure(
    m_probabilities=[0.9, 0.08, 0.02],
    u_probabilities=[0.01, 0.05, 0.94]
    # probabilities for exact match level, levenshtein <= 2, and else
    # in that order
)

Documentation for the comparison_library¶

AbsoluteTimeDifferenceAtThresholds(col_name, *, input_is_string, metrics, thresholds, datetime_format=None, term_frequency_adjustments=False, invalid_dates_as_null=True) ¶

ArrayIntersectAtSizes(col_name, size_threshold_or_thresholds=[1]) ¶

CosineSimilarityAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.8, 0.7]) ¶

CustomComparison(comparison_levels, output_column_name=None, comparison_description=None) ¶

DamerauLevenshteinAtThresholds(col_name, distance_threshold_or_thresholds=[1, 2]) ¶

DateOfBirthComparison(col_name, *, input_is_string, datetime_thresholds=[1, 1, 10], datetime_metrics=['month', 'year', 'year'], datetime_format=None, invalid_dates_as_null=True) ¶

DistanceFunctionAtThresholds(col_name, distance_function_name, distance_threshold_or_thresholds, higher_is_more_similar=True) ¶

DistanceInKMAtThresholds(lat_col, long_col, km_thresholds) ¶

EmailComparison(col_name) ¶

ExactMatch(col_name) ¶

ForenameSurnameComparison(forename_col_name, surname_col_name, *, jaro_winkler_thresholds=[0.92, 0.88], forename_surname_concat_col_name=None) ¶

JaccardAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.7]) ¶

JaroAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.7]) ¶

JaroWinklerAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.7]) ¶

LevenshteinAtThresholds(col_name, distance_threshold_or_thresholds=[1, 2]) ¶

NameComparison(col_name, *, jaro_winkler_thresholds=[0.92, 0.88, 0.7], dmeta_col_name=None) ¶

PairwiseStringDistanceFunctionAtThresholds(col_name, distance_function_name, distance_threshold_or_thresholds) ¶

PostcodeComparison(col_name, *, invalid_postcodes_as_null=False, lat_col=None, long_col=None, km_thresholds=[1, 10, 100]) ¶

AbsoluteDateDifferenceAtThresholds¶

Configuring comparisons¶

Documentation for the `comparison_library`¶

`AbsoluteTimeDifferenceAtThresholds(col_name, *, input_is_string, metrics, thresholds, datetime_format=None, term_frequency_adjustments=False, invalid_dates_as_null=True)` ¶

`ArrayIntersectAtSizes(col_name, size_threshold_or_thresholds=[1])` ¶

`CosineSimilarityAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.8, 0.7])` ¶

`CustomComparison(comparison_levels, output_column_name=None, comparison_description=None)` ¶

`DamerauLevenshteinAtThresholds(col_name, distance_threshold_or_thresholds=[1, 2])` ¶

`DateOfBirthComparison(col_name, *, input_is_string, datetime_thresholds=[1, 1, 10], datetime_metrics=['month', 'year', 'year'], datetime_format=None, invalid_dates_as_null=True)` ¶

`DistanceFunctionAtThresholds(col_name, distance_function_name, distance_threshold_or_thresholds, higher_is_more_similar=True)` ¶

`DistanceInKMAtThresholds(lat_col, long_col, km_thresholds)` ¶

`EmailComparison(col_name)` ¶

`ExactMatch(col_name)` ¶

`ForenameSurnameComparison(forename_col_name, surname_col_name, *, jaro_winkler_thresholds=[0.92, 0.88], forename_surname_concat_col_name=None)` ¶

`JaccardAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.7])` ¶

`JaroAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.7])` ¶

`JaroWinklerAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.7])` ¶

`LevenshteinAtThresholds(col_name, distance_threshold_or_thresholds=[1, 2])` ¶

`NameComparison(col_name, *, jaro_winkler_thresholds=[0.92, 0.88, 0.7], dmeta_col_name=None)` ¶

`PairwiseStringDistanceFunctionAtThresholds(col_name, distance_function_name, distance_threshold_or_thresholds)` ¶

`PostcodeComparison(col_name, *, invalid_postcodes_as_null=False, lat_col=None, long_col=None, km_thresholds=[1, 10, 100])` ¶