Skip to content

Documentation for the comparison_library¶

AbsoluteTimeDifferenceAtThresholds(col_name, *, input_is_string, metrics, thresholds, datetime_format=None, term_frequency_adjustments=False, invalid_dates_as_null=True) ¶

Bases: ComparisonCreator

Represents a comparison of the data in col_name with multiple levels based on absolute time differences:

  • Exact match in col_name
  • Absolute time difference levels at specified thresholds
  • ...
  • Anything else

For example, with metrics = ['day', 'month'] and thresholds = [1, 3] the levels are:

  • Exact match in col_name
  • Absolute time difference in col_name <= 1 day
  • Absolute time difference in col_name <= 3 months
  • Anything else

This comparison uses the AbsoluteTimeDifferenceLevel, which computes the total elapsed time between two dates, rather than counting calendar intervals.

Parameters:

Name Type Description Default
col_name str

The name of the column to compare.

required
input_is_string bool

If True, the input dates are treated as strings and parsed according to datetime_format.

required
metrics Union[DateMetricType, List[DateMetricType]]

The unit(s) of time to use when comparing dates. Can be 'second', 'minute', 'hour', 'day', 'month', or 'year'.

required
thresholds Union[int, float, List[Union[int, float]]]

The threshold(s) to use for the time difference level(s).

required
datetime_format str

The format string for parsing dates if input_is_string is True. ISO 8601 format used if not provided.

None
term_frequency_adjustments bool

Whether to apply term frequency adjustments. Defaults to False.

False
invalid_dates_as_null bool

If True and input_is_string is True, treat invalid dates as null. Defaults to True.

True

ArrayIntersectAtSizes(col_name, size_threshold_or_thresholds=[1]) ¶

Bases: ComparisonCreator

Represents a comparison of the data in col_name with multiple levels based on the intersection sizes of array elements:

  • Intersection at specified size thresholds
  • ...
  • Anything else

For example, with size_threshold_or_thresholds = [3, 1], the levels are:

  • Intersection of arrays in col_name has at least 3 elements
  • Intersection of arrays in col_name has at least 1 element
  • Anything else (e.g., empty intersection)

Parameters:

Name Type Description Default
col_name str

The name of the column to compare.

required
size_threshold_or_thresholds Union[int, list[int]]

The size threshold(s) for the intersection levels. Defaults to [1].

[1]

CosineSimilarityAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.8, 0.7]) ¶

Bases: ComparisonCreator

Represents a comparison of the data in col_name with two or more levels:

  • Cosine similarity levels at specified thresholds
  • ...
  • Anything else

For example, with score_threshold_or_thresholds = [0.9, 0.7] the levels are:

  • Cosine similarity in col_name >= 0.9
  • Cosine similarity in col_name >= 0.7
  • Anything else

Parameters:

Name Type Description Default
col_name str

The name of the column to compare.

required
score_threshold_or_thresholds Union[float, list]

The threshold(s) to use for the cosine similarity level(s). Defaults to [0.9, 0.7].

[0.9, 0.8, 0.7]

CustomComparison(comparison_levels, output_column_name=None, comparison_description=None) ¶

Bases: ComparisonCreator

Represents a comparison of the data with custom supplied levels.

Parameters:

Name Type Description Default
output_column_name str

The column name to use to refer to this comparison

None
comparison_levels list

A list of some combination of ComparisonLevelCreator objects, or dicts. These represent the similarity levels assessed by the comparison, in order of decreasing specificity

required
comparison_description str

An optional description of the comparison

None

DamerauLevenshteinAtThresholds(col_name, distance_threshold_or_thresholds=[1, 2]) ¶

Bases: ComparisonCreator

Represents a comparison of the data in col_name with three or more levels:

  • Exact match in col_name
  • Damerau-Levenshtein levels at specified distance thresholds
  • ...
  • Anything else

For example, with distance_threshold_or_thresholds = [1, 3] the levels are

  • Exact match in col_name
  • Damerau-Levenshtein distance in col_name <= 1
  • Damerau-Levenshtein distance in col_name <= 3
  • Anything else

Parameters:

Name Type Description Default
col_name str

The name of the column to compare.

required
distance_threshold_or_thresholds Union[int, list]

The threshold(s) to use for the Damerau-Levenshtein similarity level(s). Defaults to [1, 2].

[1, 2]

DateOfBirthComparison(col_name, *, input_is_string, datetime_thresholds=[1, 1, 10], datetime_metrics=['month', 'year', 'year'], datetime_format=None, invalid_dates_as_null=True) ¶

Bases: ComparisonCreator

Generate an 'out of the box' comparison for a date of birth column in the col_name provided.

Note that input_is_string is a required argument: you must denote whether the col_name contains if of type date/dattime or string.

The default arguments will give a comparison with comparison levels:

  • Exact match (all other dates)
  • Damerau-Levenshtein distance <= 1
  • Date difference <= 1 month
  • Date difference <= 1 year
  • Date difference <= 10 years
  • Anything else

Parameters:

Name Type Description Default
col_name Union[str, ColumnExpression]

The column name

required
input_is_string bool

If True, the provided col_name must be of type string. If False, it must be a date or datetime.

required
datetime_thresholds Union[int, float, List[Union[int, float]]]

Numeric thresholds for date differences. Defaults to [1, 1, 10].

[1, 1, 10]
datetime_metrics Union[DateMetricType, List[DateMetricType]]

Metrics for date differences. Defaults to ["month", "year", "year"].

['month', 'year', 'year']
datetime_format str

The datetime format used to cast strings to dates. Only used if input is a string.

None
invalid_dates_as_null bool

If True, treat invalid dates as null as opposed to allowing e.g. an exact or levenshtein match where one side or both are an invalid date. Only used if input is a string. Defaults to True.

True

DistanceFunctionAtThresholds(col_name, distance_function_name, distance_threshold_or_thresholds, higher_is_more_similar=True) ¶

Bases: ComparisonCreator

Represents a comparison of the data in col_name with three or more levels:

  • Exact match in col_name
  • Custom distance function levels at specified thresholds
  • ...
  • Anything else

For example, with distance_threshold_or_thresholds = [1, 3] and distance_function 'hamming', with higher_is_more_similar False the levels are:

  • Exact match in col_name
  • Hamming distance of col_name <= 1
  • Hamming distance of col_name <= 3
  • Anything else

Parameters:

Name Type Description Default
col_name str

The name of the column to compare.

required
distance_function_name str

the name of the SQL distance function

required
distance_threshold_or_thresholds Union[float, list]

The threshold(s) to use for the distance function level(s).

required
higher_is_more_similar bool

Are higher values of the distance function more similar? (e.g. True for Jaro-Winkler, False for Levenshtein) Default is True

True

DistanceInKMAtThresholds(lat_col, long_col, km_thresholds) ¶

Bases: ComparisonCreator

A comparison of the latitude, longitude coordinates defined in 'lat_col' and 'long col' giving the great circle distance between them in km.

An example of the output with km_thresholds = [1, 10] would be:

  • The two coordinates are within 1 km of one another
  • The two coordinates are within 10 km of one another
  • Anything else (i.e. the distance between coordinates are > 10km apart)

Parameters:

Name Type Description Default
lat_col str

The name of the latitude column to compare.

required
long_col str

The name of the longitude column to compare.

required
km_thresholds iterable[float] | float

The km threshold(s) for the distance levels.

required

EmailComparison(col_name) ¶

Bases: ComparisonCreator

Generate an 'out of the box' comparison for an email address column with the in the col_name provided.

The default comparison levels are:

  • Null comparison: e.g., one email is missing or invalid.
  • Exact match on full email: e.g., john@smith.com vs. john@smith.com.
  • Exact match on username part of email: e.g., john@company.com vs. john@other.com.
  • Jaro-Winkler similarity > 0.88 on full email: e.g., john.smith@company.com vs. john.smyth@company.com.
  • Jaro-Winkler similarity > 0.88 on username part of email: e.g., john.smith@company.com vs. john.smyth@other.com.
  • Anything else: e.g., john@company.com vs. rebecca@other.com.

Parameters:

Name Type Description Default
col_name Union[str, ColumnExpression]

The column name or expression for the email addresses to be compared.

required

ExactMatch(col_name) ¶

Bases: ComparisonCreator

Represents a comparison of the data in col_name with two levels:

  • Exact match in col_name
  • Anything else

Parameters:

Name Type Description Default
col_name str

The name of the column to compare

required

ForenameSurnameComparison(forename_col_name, surname_col_name, *, jaro_winkler_thresholds=[0.92, 0.88], forename_surname_concat_col_name=None) ¶

Bases: ComparisonCreator

Generate an 'out of the box' comparison for forename and surname columns in the forename_col_name and surname_col_name provided.

It's recommended to derive an additional column containing a concatenated forename and surname column so that term frequencies can be applied to the full name. If you have derived a column, provide it at forename_surname_concat_col_name.

The default comparison levels are:

  • Null comparison on both forename and surname
  • Exact match on both forename and surname
  • Columns reversed comparison (forename and surname swapped)
  • Jaro-Winkler similarity > 0.92 on both forename and surname
  • Jaro-Winkler similarity > 0.88 on both forename and surname
  • Exact match on surname
  • Exact match on forename
  • Anything else

Parameters:

Name Type Description Default
forename_col_name Union[str, ColumnExpression]

The column name or expression for the forenames to be compared.

required
surname_col_name Union[str, ColumnExpression]

The column name or expression for the surnames to be compared.

required
jaro_winkler_thresholds Union[float, list[float]]

Thresholds for Jaro-Winkler similarity. Defaults to [0.92, 0.88].

[0.92, 0.88]
forename_surname_concat_col_name str

The column name for concatenated forename and surname values. If provided, term frequencies are applied on the exact match using this column

None

JaccardAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.7]) ¶

Bases: ComparisonCreator

Represents a comparison of the data in col_name with three or more levels:

  • Exact match in col_name
  • Jaccard score levels at specified thresholds
  • ...
  • Anything else

For example, with score_threshold_or_thresholds = [0.9, 0.7] the levels are:

  • Exact match in col_name
  • Jaccard score in col_name >= 0.9
  • Jaccard score in col_name >= 0.7
  • Anything else

Parameters:

Name Type Description Default
col_name str

The name of the column to compare.

required
score_threshold_or_thresholds Union[float, list]

The threshold(s) to use for the Jaccard similarity level(s). Defaults to [0.9, 0.7].

[0.9, 0.7]

JaroAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.7]) ¶

Bases: ComparisonCreator

Represents a comparison of the data in col_name with three or more levels:

  • Exact match in col_name
  • Jaro score levels at specified thresholds
  • ...
  • Anything else

For example, with score_threshold_or_thresholds = [0.9, 0.7] the levels are:

  • Exact match in col_name
  • Jaro score in col_name >= 0.9
  • Jaro score in col_name >= 0.7
  • Anything else

Parameters:

Name Type Description Default
col_name str

The name of the column to compare.

required
score_threshold_or_thresholds Union[float, list]

The threshold(s) to use for the Jaro similarity level(s). Defaults to [0.9, 0.7].

[0.9, 0.7]

JaroWinklerAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.7]) ¶

Bases: ComparisonCreator

Represents a comparison of the data in col_name with three or more levels:

  • Exact match in col_name
  • Jaro-Winkler score levels at specified thresholds
  • ...
  • Anything else

For example, with score_threshold_or_thresholds = [0.9, 0.7] the levels are:

  • Exact match in col_name
  • Jaro-Winkler score in col_name >= 0.9
  • Jaro-Winkler score in col_name >= 0.7
  • Anything else

Parameters:

Name Type Description Default
col_name str

The name of the column to compare.

required
score_threshold_or_thresholds Union[float, list]

The threshold(s) to use for the Jaro-Winkler similarity level(s). Defaults to [0.9, 0.7].

[0.9, 0.7]

LevenshteinAtThresholds(col_name, distance_threshold_or_thresholds=[1, 2]) ¶

Bases: ComparisonCreator

Represents a comparison of the data in col_name with three or more levels:

  • Exact match in col_name
  • Levenshtein levels at specified distance thresholds
  • ...
  • Anything else

For example, with distance_threshold_or_thresholds = [1, 3] the levels are

  • Exact match in col_name
  • Levenshtein distance in col_name <= 1
  • Levenshtein distance in col_name <= 3
  • Anything else

Parameters:

Name Type Description Default
col_name str

The name of the column to compare

required
distance_threshold_or_thresholds Union[int, list]

The threshold(s) to use for the levenshtein similarity level(s). Defaults to [1, 2].

[1, 2]

NameComparison(col_name, *, jaro_winkler_thresholds=[0.92, 0.88, 0.7], dmeta_col_name=None) ¶

Bases: ComparisonCreator

Generate an 'out of the box' comparison for a name column in the col_name provided.

It's also possible to include a level for a dmetaphone match, but this requires you to derive a dmetaphone column prior to importing it into Splink. Note this is expected to be a column containing arrays of dmetaphone values, which are of length 1 or 2.

The default comparison levels are:

  • Null comparison
  • Exact match
  • Jaro-Winkler similarity > 0.92
  • Jaro-Winkler similarity > 0.88
  • Jaro-Winkler similarity > 0.70
  • Anything else

Parameters:

Name Type Description Default
col_name Union[str, ColumnExpression]

The column name or expression for the names to be compared.

required
jaro_winkler_thresholds Union[float, list[float]]

Thresholds for Jaro-Winkler similarity. Defaults to [0.92, 0.88, 0.7].

[0.92, 0.88, 0.7]
dmeta_col_name str

The column name for dmetaphone values. If provided, array intersection level is included. This column must contain arrays of dmetaphone values, which are of length 1 or 2.

None

PostcodeComparison(col_name, *, invalid_postcodes_as_null=False, lat_col=None, long_col=None, km_thresholds=[1, 10, 100]) ¶

Bases: ComparisonCreator

Generate an 'out of the box' comparison for a postcode column with the in the col_name provided.

The default comparison levels are:

  • Null comparison
  • Exact match on full postcode
  • Exact match on sector
  • Exact match on district
  • Exact match on area
  • Distance in km (if lat_col and long_col are provided)

It's also possible to include levels for distance in km, but this requires you to have geocoded your postcodes prior to importing them into Splink. Use the lat_col and long_col arguments to tell Splink where to find the latitude and longitude columns.

See https://ideal-postcodes.co.uk/guides/uk-postcode-format for definitions

Parameters:

Name Type Description Default
col_name Union[str, ColumnExpression]

The column name or expression for the postcodes to be compared.

required
invalid_postcodes_as_null bool

If True, treat invalid postcodes as null. Defaults to False.

False
lat_col Union[str, ColumnExpression]

The column name or expression for latitude. Required if km_thresholds is provided.

None
long_col Union[str, ColumnExpression]

The column name or expression for longitude. Required if km_thresholds is provided.

None
km_thresholds Union[float, List[float]]

Thresholds for distance in kilometers. If provided, lat_col and long_col must also be provided.

[1, 10, 100]

AbsoluteDateDifferenceAtThresholds¶

An alias of AbsoluteTimeDifferenceAtThresholds.

Configuring comparisons¶

Note that all comparisons have a .configure() method as follows:

Configure the comparison creator with options that are common to all comparisons.

For m and u probabilities, the first element in the list corresponds to the first comparison level, usually an exact match level. Subsequent elements correspond comparison to levels in sequential order, through to the last element which is usually the 'ELSE' level.

All options have default options set initially. Any call to .configure() will set any options that are supplied. Any subsequent calls to .configure() will not override these values with defaults; to override values you must explicitly provide a value corresponding to the default.

Generally speaking only a single call (at most) to .configure() should be required.

Parameters:

Name Type Description Default
term_frequency_adjustments bool

Whether term frequency adjustments are switched on for this comparison. Only applied to exact match levels. Default corresponds to False.

unsupplied_option
m_probabilities list

List of m probabilities Default corresponds to None.

unsupplied_option
u_probabilities list

List of u probabilities Default corresponds to None.

unsupplied_option
Example
cc = LevenshteinAtThresholds("name", 2)
cc.configure(
    m_probabilities=[0.9, 0.08, 0.02],
    u_probabilities=[0.01, 0.05, 0.94]
    # probabilities for exact match level, levenshtein <= 2, and else
    # in that order
)