Documentation for the `comparison_level_library`¶

`AbsoluteDifferenceLevel(col_name, difference_threshold)` ¶

Bases: ComparisonLevelCreator

Represents a comparison level where the absolute difference between two numerical values is within a specified threshold.

Parameters:

Name	Type	Description	Default
`col_name`	`str \| ColumnExpression`	Input column name or ColumnExpression.	required
`difference_threshold`	`int \| float`	The maximum allowed absolute difference between the two values.	required

`AbsoluteTimeDifferenceLevel(col_name, *, input_is_string, threshold, metric, datetime_format=None)` ¶

Bases: ComparisonLevelCreator

Computes the absolute elapsed time between two dates (total duration).

This function computes the amount of time that has passed between two dates, in contrast to functions like date_diff found in some SQL backends, which count the number of full calendar intervals (e.g., months, years) crossed.

For instance, the difference between January 29th and March 2nd would be less than two months in terms of elapsed time, unlike a date_diff calculation that would give an answer of 2 calendar intervals crossed.

That the thresold is inclusive e.g. a level with a 10 day threshold will include difference in date of 10 days.

Parameters:

Name	Type	Description	Default
`col_name`	`str`	The name of the input column containing the dates to compare	required
`input_is_string`	`bool`	Indicates if the input date/times are in string format, requiring parsing according to `datetime_format`.	required
`threshold`	`int`	The maximum allowed difference between the two dates, in units specified by `date_metric`.	required
`metric`	`str`	The unit of time to use when comparing the dates. Can be 'second', 'minute', 'hour', 'day', 'month', or 'year'.	required
`datetime_format`	`str`	The format string for parsing dates. ISO 8601 format used if not provided.	`None`

`And(*comparison_levels)` ¶

Bases: _Merge

Represents a comparison level that is an 'AND' of other comparison levels

Merge multiple ComparisonLevelCreators into a single ComparisonLevelCreator by merging their SQL conditions using a logical "AND".

Parameters:

Name	Type	Description	Default
`*comparison_levels`	`ComparisonLevelCreator \| dict`	These represent the comparison levels you wish to combine via 'AND'	`()`

`ArrayIntersectLevel(col_name, min_intersection=1)` ¶

Bases: ComparisonLevelCreator

Represents a comparison level based around the size of an intersection of arrays

Parameters:

Name	Type	Description	Default
`col_name`	`str`	Input column name	required
`min_intersection`	`int`	The minimum cardinality of the intersection of arrays for this comparison level. Defaults to 1	`1`

`ArraySubsetLevel(col_name, empty_is_subset=False)` ¶

Bases: ComparisonLevelCreator

Represents a comparison level where the smaller array is an exact subset of the larger array. If arrays are equal length, they must have the same elements

The order of items in the arrays does not matter for this comparison.

Parameters:

Name	Type	Description	Default
`col_name`	`str \| ColumnExpression`	Input column name or ColumnExpression	required
`empty_is_subset`	`bool`	If True, an empty array is considered a subset of any array (including another empty array). Default is False.	`False`

`ColumnsReversedLevel(col_name_1, col_name_2, symmetrical=False)` ¶

Bases: ComparisonLevelCreator

Represents a comparison level where the columns are reversed. For example, if surname is in the forename field and vice versa

By default, col_l = col_r. If the symmetrical argument is True, then col_l = col_r AND col_r = col_l.

Parameters:

Name	Type	Description	Default
`col_name_1`	`str`	First column, e.g. forename	required
`col_name_2`	`str`	Second column, e.g. surname	required
`symmetrical`	`bool`	If True, equality is required in in both directions. Default is False.	`False`

`CosineSimilarityLevel(col_name, similarity_threshold)` ¶

Bases: ComparisonLevelCreator

A comparison level using a cosine similarity function

e.g. array_cosine_similarity(val_l, val_r) >= similarity_threshold

Parameters:

Name	Type	Description	Default
`col_name`	`str`	Input column name	required
`similarity_threshold`	`float`	The threshold to use to assess similarity. Should be between 0 and 1.	required

`CustomLevel(sql_condition, label_for_charts=None, base_dialect_str=None)` ¶

Bases: ComparisonLevelCreator

Represents a comparison level with a custom sql expression

Must be in a form suitable for use in a SQL CASE WHEN expression e.g. "substr(name_l, 1, 1) = substr(name_r, 1, 1)"

Parameters:

Name	Type	Description	Default
`sql_condition`	`str`	SQL condition to assess similarity	required
`label_for_charts`	`str`	A label for this level to be used in charts. Default None, so that `sql_condition` is used	`None`
`base_dialect_str`	`str`	If specified, the SQL dialect that this expression will parsed as when attempting to translate to other backends	`None`

`DamerauLevenshteinLevel(col_name, distance_threshold)` ¶

Bases: ComparisonLevelCreator

A comparison level using a Damerau-Levenshtein distance function

e.g. damerau_levenshtein(val_l, val_r) <= distance_threshold

Parameters:

Name	Type	Description	Default
`col_name`	`str`	Input column name	required
`distance_threshold`	`int`	The threshold to use to assess similarity	required

`DistanceFunctionLevel(col_name, distance_function_name, distance_threshold, higher_is_more_similar=True)` ¶

Bases: ComparisonLevelCreator

A comparison level using an arbitrary distance function

e.g. custom_distance(val_l, val_r) >= (<=) distance_threshold

The function given by distance_function_name must exist in the SQL backend you use, and must take two parameters of the type in `col_name, returning a numeric type

Parameters:

Name	Type	Description	Default
`col_name`	`str \| ColumnExpression`	Input column name	required
`distance_function_name`	`str`	the name of the SQL distance function	required
`distance_threshold`	`Union[int, float]`	The threshold to use to assess similarity	required
`higher_is_more_similar`	`bool`	Are higher values of the distance function more similar? (e.g. True for Jaro-Winkler, False for Levenshtein) Default is True	`True`

`DistanceInKMLevel(lat_col, long_col, km_threshold, not_null=False)` ¶

Bases: ComparisonLevelCreator

Use the haversine formula to transform comparisons of lat,lngs into distances measured in kilometers

Parameters:

Name	Type	Description	Default
`lat_col`	`str`	The name of a latitude column or the respective array or struct column column containing the information For example: long_lat['lat'] or long_lat[0]	required
`long_col`	`str`	The name of a longitudinal column or the respective array or struct column column containing the information, plus an index. For example: long_lat['long'] or long_lat[1]	required
`km_threshold`	`int`	The total distance in kilometers to evaluate your comparisons against	required
`not_null`	`bool`	If true, ensure no attempt is made to compute this if any inputs are null. This is only necessary if you are not capturing nulls elsewhere in your comparison level.	`False`

`ElseLevel` ¶

Bases: ComparisonLevelCreator

This level is used to capture all comparisons that do not match any other specified levels. It corresponds to the ELSE clause in a SQL CASE statement.

`ExactMatchLevel(col_name, term_frequency_adjustments=False)` ¶

Bases: ComparisonLevelCreator

Represents a comparison level where there is an exact match

e.g. val_l = val_r

Parameters:

Name	Type	Description	Default
`col_name`	`str`	Input column name	required
`term_frequency_adjustments`	`bool`	If True, apply term frequency adjustments to the exact match level. Defaults to False.	`False`

`JaccardLevel(col_name, distance_threshold)` ¶

Bases: ComparisonLevelCreator

A comparison level using a Jaccard distance function

e.g. jaccard(val_l, val_r) >= distance_threshold

Parameters:

Name	Type	Description	Default
`col_name`	`str`	Input column name	required
`distance_threshold`	`Union[int, float]`	The threshold to use to assess similarity	required

`JaroLevel(col_name, distance_threshold)` ¶

Bases: ComparisonLevelCreator

A comparison level using a Jaro distance function

e.g. jaro(val_l, val_r) >= distance_threshold

Parameters:

Name	Type	Description	Default
`col_name`	`str`	Input column name	required
`distance_threshold`	`Union[int, float]`	The threshold to use to assess similarity	required

`JaroWinklerLevel(col_name, distance_threshold)` ¶

Bases: ComparisonLevelCreator

A comparison level using a Jaro-Winkler distance function

e.g. jaro_winkler(val_l, val_r) >= distance_threshold

Parameters:

Name	Type	Description	Default
`col_name`	`str`	Input column name	required
`distance_threshold`	`Union[int, float]`	The threshold to use to assess similarity	required

`LevenshteinLevel(col_name, distance_threshold)` ¶

Bases: ComparisonLevelCreator

A comparison level using a sqlglot_dialect_name distance function

e.g. levenshtein(val_l, val_r) <= distance_threshold

Parameters:

Name	Type	Description	Default
`col_name`	`str`	Input column name	required
`distance_threshold`	`int`	The threshold to use to assess similarity	required

`LiteralMatchLevel(col_name, literal_value, literal_datatype, side_of_comparison='both')` ¶

Bases: ComparisonLevelCreator

Represents a comparison level where a column matches a literal value

e.g. val_l = 'literal' AND/OR val_r = 'literal'

Parameters:

Name	Type	Description	Default
`col_name`	`Union[str, ColumnExpression]`	Input column name or ColumnExpression	required
`literal_value`	`str`	The literal value to compare against e.g. 'male'	required
`literal_datatype`	`str`	The datatype of the literal value. Must be one of: "string", "int", "float", "date"	required
`side_of_comparison`	`str`	Which side(s) of the comparison to apply. Must be one of: "left", "right", "both". Defaults to "both".	`'both'`

`Not(comparison_level)` ¶

Bases: ComparisonLevelCreator

Represents a comparison level that is the negation of another comparison level

Resulting ComparisonLevelCreator is equivalent to the passed ComparisonLevelCreator but with SQL conditions negated with logical "NOY".

Parameters:

Name	Type	Description	Default
`*comparison_level`	`ComparisonLevelCreator \| dict`	This represents the comparison level you wish to negate with 'NOT'	required

`NullLevel(col_name, valid_string_pattern=None)` ¶

Bases: ComparisonLevelCreator

Represents a comparison level where either or both values are NULL

e.g. val_l IS NULL OR val_r IS NULL

Parameters:

Name	Type	Description	Default
`col_name`	`Union[str, ColumnExpression]`	Input column name or ColumnExpression	required
`valid_string_pattern`	`str`	If provided, a regex pattern to extract a valid substring from the column before checking for NULL. Default is None.	`None`

Note

If a valid_string_pattern is provided, the NULL check will be performed on the extracted substring rather than the original column value.

`Or(*comparison_levels)` ¶

Bases: _Merge

Represents a comparison level that is an 'OR' of other comparison levels

Merge multiple ComparisonLevelCreators into a single ComparisonLevelCreator by merging their SQL conditions using a logical "OR".

Parameters:

Name	Type	Description	Default
`*comparison_levels`	`ComparisonLevelCreator \| dict`	These represent the comparison levels you wish to combine via 'OR'	`()`

`PairwiseStringDistanceFunctionLevel(col_name, distance_function_name, distance_threshold)` ¶

Bases: ComparisonLevelCreator

A comparison level using the most similar string distance between any pair of values between arrays in an array column.

The function given by distance_function_name must be one of "levenshtein," "damera_levenshtein," "jaro_winkler," or "jaro."

Parameters:

Name	Type	Description	Default
`col_name`	`str \| ColumnExpression`	Input column name	required
`distance_function_name`	`str`	the name of the string distance function	required
`distance_threshold`	`Union[int, float]`	The threshold to use to assess similarity	required

`PercentageDifferenceLevel(col_name, percentage_threshold)` ¶

Bases: ComparisonLevelCreator

Represents a comparison level where the difference between two numerical values is within a specified percentage threshold.

The percentage difference is calculated as the absolute difference between the two values divided by the greater of the two values.

Parameters:

Name	Type	Description	Default
`col_name`	`str`	Input column name.	required
`percentage_threshold`	`float`	The threshold percentage to use to assess similarity e.g. 0.1 for 10%.	required

AbsoluteDateDifferenceLevel¶

An alias of AbsoluteTimeDifferenceLevel.

Configuring comparisons¶

Note that all comparison levels have a .configure() method as follows:

Configure the comparison level with options which are common to all comparison levels. The options align to the keys in the json specification of a comparison level. These options are usually not needed, but are available for advanced users.

All options have default options set initially. Any call to .configure() will set any options that are supplied. Any subsequent calls to .configure() will not override these values with defaults; to override values you must must explicitly provide a value corresponding to the default.

Generally speaking only a single call (at most) to .configure() should be required.

Parameters:

Name	Type	Description	Default
`m_probability`	`float`	The m probability for this comparison level. Default is equivalent to None, in which case a default initial value will be provided for this level.	`unsupplied_option`
`u_probability`	`float`	The u probability for this comparison level. Default is equivalent to None, in which case a default initial value will be provided for this level.	`unsupplied_option`
`tf_adjustment_column`	`str`	Make term frequency adjustments for this comparison level using this input column. Default is equivalent to None, meaning that term-frequency adjustments will not be applied for this level.	`unsupplied_option`
`tf_adjustment_weight`	`float`	Make term frequency adjustments for this comparison level using this weight. Default is equivalent to None, meaning term-frequency adjustments are fully-weighted if turned on.	`unsupplied_option`
`tf_minimum_u_value`	`float`	When term frequency adjustments are turned on, where the term frequency adjustment implies a u value below this value, use this minimum value instead. Defaults is equivalent to None, meaning no minimum value.	`unsupplied_option`
`is_null_level`	`bool`	If true, m and u values will not be estimated and instead the match weight will be zero for this column. Default is equivalent to False.	`unsupplied_option`
`label_for_charts`	`str`	If provided, a custom label that will be used for this level in any charts. Default is equivalent to None, in which case a default label will be provided for this level.	`unsupplied_option`
`disable_tf_exact_match_detection`	`bool`	If true, if term frequency adjustments are set, the corresponding adjustment will be made using the u-value for this level, rather than the usual case where it is the u-value of the exact match level in the same comparison. Default is equivalent to False.	`unsupplied_option`
`fix_m_probability`	`bool`	If true, the m probability for this level will be fixed and not estimated during training. Default is equivalent to False.	`unsupplied_option`
`fix_u_probability`	`bool`	If true, the u probability for this level will be fixed and not estimated during training. Default is equivalent to False.	`unsupplied_option`

Documentation for the comparison_level_library¶

AbsoluteDifferenceLevel(col_name, difference_threshold) ¶

AbsoluteTimeDifferenceLevel(col_name, *, input_is_string, threshold, metric, datetime_format=None) ¶

And(*comparison_levels) ¶

ArrayIntersectLevel(col_name, min_intersection=1) ¶

ArraySubsetLevel(col_name, empty_is_subset=False) ¶

ColumnsReversedLevel(col_name_1, col_name_2, symmetrical=False) ¶

CosineSimilarityLevel(col_name, similarity_threshold) ¶

CustomLevel(sql_condition, label_for_charts=None, base_dialect_str=None) ¶

DamerauLevenshteinLevel(col_name, distance_threshold) ¶

DistanceFunctionLevel(col_name, distance_function_name, distance_threshold, higher_is_more_similar=True) ¶

DistanceInKMLevel(lat_col, long_col, km_threshold, not_null=False) ¶

ElseLevel ¶

ExactMatchLevel(col_name, term_frequency_adjustments=False) ¶

JaccardLevel(col_name, distance_threshold) ¶

JaroLevel(col_name, distance_threshold) ¶

JaroWinklerLevel(col_name, distance_threshold) ¶

LevenshteinLevel(col_name, distance_threshold) ¶

LiteralMatchLevel(col_name, literal_value, literal_datatype, side_of_comparison='both') ¶

Not(comparison_level) ¶

NullLevel(col_name, valid_string_pattern=None) ¶

Or(*comparison_levels) ¶

PairwiseStringDistanceFunctionLevel(col_name, distance_function_name, distance_threshold) ¶

PercentageDifferenceLevel(col_name, percentage_threshold) ¶

AbsoluteDateDifferenceLevel¶

Configuring comparisons¶

Documentation for the `comparison_level_library`¶

`AbsoluteDifferenceLevel(col_name, difference_threshold)` ¶

`AbsoluteTimeDifferenceLevel(col_name, *, input_is_string, threshold, metric, datetime_format=None)` ¶

`And(*comparison_levels)` ¶

`ArrayIntersectLevel(col_name, min_intersection=1)` ¶

`ArraySubsetLevel(col_name, empty_is_subset=False)` ¶

`ColumnsReversedLevel(col_name_1, col_name_2, symmetrical=False)` ¶

`CosineSimilarityLevel(col_name, similarity_threshold)` ¶

`CustomLevel(sql_condition, label_for_charts=None, base_dialect_str=None)` ¶

`DamerauLevenshteinLevel(col_name, distance_threshold)` ¶

`DistanceFunctionLevel(col_name, distance_function_name, distance_threshold, higher_is_more_similar=True)` ¶

`DistanceInKMLevel(lat_col, long_col, km_threshold, not_null=False)` ¶

`ElseLevel` ¶

`ExactMatchLevel(col_name, term_frequency_adjustments=False)` ¶

`JaccardLevel(col_name, distance_threshold)` ¶

`JaroLevel(col_name, distance_threshold)` ¶

`JaroWinklerLevel(col_name, distance_threshold)` ¶

`LevenshteinLevel(col_name, distance_threshold)` ¶

`LiteralMatchLevel(col_name, literal_value, literal_datatype, side_of_comparison='both')` ¶

`Not(comparison_level)` ¶

`NullLevel(col_name, valid_string_pattern=None)` ¶

`Or(*comparison_levels)` ¶

`PairwiseStringDistanceFunctionLevel(col_name, distance_function_name, distance_threshold)` ¶

`PercentageDifferenceLevel(col_name, percentage_threshold)` ¶