Skip to content

Documentation for the comparison_level_library

AbsoluteDifferenceLevel(col_name, difference_threshold)

Bases: ComparisonLevelCreator

Represents a comparison level where the absolute difference between two numerical values is within a specified threshold.

Parameters:

Name Type Description Default
col_name str | ColumnExpression

Input column name or ColumnExpression.

required
difference_threshold int | float

The maximum allowed absolute difference between the two values.

required

AbsoluteTimeDifferenceLevel(col_name, *, input_is_string, threshold, metric, datetime_format=None)

Bases: ComparisonLevelCreator

Computes the absolute elapsed time between two dates (total duration).

This function computes the amount of time that has passed between two dates, in contrast to functions like date_diff found in some SQL backends, which count the number of full calendar intervals (e.g., months, years) crossed.

For instance, the difference between January 29th and March 2nd would be less than two months in terms of elapsed time, unlike a date_diff calculation that would give an answer of 2 calendar intervals crossed.

That the thresold is inclusive e.g. a level with a 10 day threshold will include difference in date of 10 days.

Parameters:

Name Type Description Default
col_name str

The name of the input column containing the dates to compare

required
input_is_string bool

Indicates if the input date/times are in string format, requiring parsing according to datetime_format.

required
threshold int

The maximum allowed difference between the two dates, in units specified by date_metric.

required
metric str

The unit of time to use when comparing the dates. Can be 'second', 'minute', 'hour', 'day', 'month', or 'year'.

required
datetime_format str

The format string for parsing dates. ISO 8601 format used if not provided.

None

And(*comparison_levels)

Bases: _Merge

Represents a comparison level that is an 'AND' of other comparison levels

Merge multiple ComparisonLevelCreators into a single ComparisonLevelCreator by merging their SQL conditions using a logical "AND".

Parameters:

Name Type Description Default
*comparison_levels ComparisonLevelCreator | dict

These represent the comparison levels you wish to combine via 'AND'

()

ArrayIntersectLevel(col_name, min_intersection)

Bases: ComparisonLevelCreator

Represents a comparison level based around the size of an intersection of arrays

Parameters:

Name Type Description Default
col_name str

Input column name

required
min_intersection int

The minimum cardinality of the intersection of arrays for this comparison level. Defaults to 1

required

ArraySubsetLevel(col_name, empty_is_subset=False)

Bases: ComparisonLevelCreator

Represents a comparison level where the smaller array is an exact subset of the larger array. If arrays are equal length, they must have the same elements

The order of items in the arrays does not matter for this comparison.

Parameters:

Name Type Description Default
col_name str | ColumnExpression

Input column name or ColumnExpression

required
empty_is_subset bool

If True, an empty array is considered a subset of any array (including another empty array). Default is False.

False

ColumnsReversedLevel(col_name_1, col_name_2, symmetrical=False)

Bases: ComparisonLevelCreator

Represents a comparison level where the columns are reversed. For example, if surname is in the forename field and vice versa

By default, col_l = col_r. If the symmetrical argument is True, then col_l = col_r AND col_r = col_l.

Parameters:

Name Type Description Default
col_name_1 str

First column, e.g. forename

required
col_name_2 str

Second column, e.g. surname

required
symmetrical bool

If True, equality is required in in both directions. Default is False.

False

CosineSimilarityLevel(col_name, similarity_threshold)

Bases: ComparisonLevelCreator

A comparison level using a cosine similarity function

e.g. array_cosine_similarity(val_l, val_r) >= similarity_threshold

Parameters:

Name Type Description Default
col_name str

Input column name

required
similarity_threshold float

The threshold to use to assess similarity. Should be between 0 and 1.

required

CustomLevel(sql_condition, label_for_charts=None, base_dialect_str=None)

Bases: ComparisonLevelCreator

Represents a comparison level with a custom sql expression

Must be in a form suitable for use in a SQL CASE WHEN expression e.g. "substr(name_l, 1, 1) = substr(name_r, 1, 1)"

Parameters:

Name Type Description Default
sql_condition str

SQL condition to assess similarity

required
label_for_charts str

A label for this level to be used in charts. Default None, so that sql_condition is used

None
base_dialect_str str

If specified, the SQL dialect that this expression will parsed as when attempting to translate to other backends

None

DamerauLevenshteinLevel(col_name, distance_threshold)

Bases: ComparisonLevelCreator

A comparison level using a Damerau-Levenshtein distance function

e.g. damerau_levenshtein(val_l, val_r) <= distance_threshold

Parameters:

Name Type Description Default
col_name str

Input column name

required
distance_threshold int

The threshold to use to assess similarity

required

DistanceFunctionLevel(col_name, distance_function_name, distance_threshold, higher_is_more_similar=True)

Bases: ComparisonLevelCreator

A comparison level using an arbitrary distance function

e.g. custom_distance(val_l, val_r) >= (<=) distance_threshold

The function given by distance_function_name must exist in the SQL backend you use, and must take two parameters of the type in `col_name, returning a numeric type

Parameters:

Name Type Description Default
col_name str | ColumnExpression

Input column name

required
distance_function_name str

the name of the SQL distance function

required
distance_threshold Union[int, float]

The threshold to use to assess similarity

required
higher_is_more_similar bool

Are higher values of the distance function more similar? (e.g. True for Jaro-Winkler, False for Levenshtein) Default is True

True

DistanceInKMLevel(lat_col, long_col, km_threshold, not_null=False)

Bases: ComparisonLevelCreator

Use the haversine formula to transform comparisons of lat,lngs into distances measured in kilometers

Parameters:

Name Type Description Default
lat_col str

The name of a latitude column or the respective array or struct column column containing the information For example: long_lat['lat'] or long_lat[0]

required
long_col str

The name of a longitudinal column or the respective array or struct column column containing the information, plus an index. For example: long_lat['long'] or long_lat[1]

required
km_threshold int

The total distance in kilometers to evaluate your comparisons against

required
not_null bool

If true, ensure no attempt is made to compute this if any inputs are null. This is only necessary if you are not capturing nulls elsewhere in your comparison level.

False

ElseLevel

Bases: ComparisonLevelCreator

This level is used to capture all comparisons that do not match any other specified levels. It corresponds to the ELSE clause in a SQL CASE statement.

ExactMatchLevel(col_name, term_frequency_adjustments=False)

Bases: ComparisonLevelCreator

Represents a comparison level where there is an exact match

e.g. val_l = val_r

Parameters:

Name Type Description Default
col_name str

Input column name

required
term_frequency_adjustments bool

If True, apply term frequency adjustments to the exact match level. Defaults to False.

False

JaccardLevel(col_name, distance_threshold)

Bases: ComparisonLevelCreator

A comparison level using a Jaccard distance function

e.g. jaccard(val_l, val_r) >= distance_threshold

Parameters:

Name Type Description Default
col_name str

Input column name

required
distance_threshold Union[int, float]

The threshold to use to assess similarity

required

JaroLevel(col_name, distance_threshold)

Bases: ComparisonLevelCreator

A comparison level using a Jaro distance function

e.g. jaro(val_l, val_r) >= distance_threshold

Parameters:

Name Type Description Default
col_name str

Input column name

required
distance_threshold Union[int, float]

The threshold to use to assess similarity

required

JaroWinklerLevel(col_name, distance_threshold)

Bases: ComparisonLevelCreator

A comparison level using a Jaro-Winkler distance function

e.g. jaro_winkler(val_l, val_r) >= distance_threshold

Parameters:

Name Type Description Default
col_name str

Input column name

required
distance_threshold Union[int, float]

The threshold to use to assess similarity

required

LevenshteinLevel(col_name, distance_threshold)

Bases: ComparisonLevelCreator

A comparison level using a sqlglot_dialect_name distance function

e.g. levenshtein(val_l, val_r) <= distance_threshold

Parameters:

Name Type Description Default
col_name str

Input column name

required
distance_threshold int

The threshold to use to assess similarity

required

LiteralMatchLevel(col_name, literal_value, literal_datatype, side_of_comparison='both')

Bases: ComparisonLevelCreator

Represents a comparison level where a column matches a literal value

e.g. val_l = 'literal' AND/OR val_r = 'literal'

Parameters:

Name Type Description Default
col_name Union[str, ColumnExpression]

Input column name or ColumnExpression

required
literal_value str

The literal value to compare against e.g. 'male'

required
literal_datatype str

The datatype of the literal value. Must be one of: "string", "int", "float", "date"

required
side_of_comparison str

Which side(s) of the comparison to apply. Must be one of: "left", "right", "both". Defaults to "both".

'both'

Not(comparison_level)

Bases: ComparisonLevelCreator

Represents a comparison level that is the negation of another comparison level

Resulting ComparisonLevelCreator is equivalent to the passed ComparisonLevelCreator but with SQL conditions negated with logical "NOY".

Parameters:

Name Type Description Default
*comparison_level ComparisonLevelCreator | dict

This represents the comparison level you wish to negate with 'NOT'

required

NullLevel(col_name, valid_string_pattern=None)

Bases: ComparisonLevelCreator

Represents a comparison level where either or both values are NULL

e.g. val_l IS NULL OR val_r IS NULL

Parameters:

Name Type Description Default
col_name Union[str, ColumnExpression]

Input column name or ColumnExpression

required
valid_string_pattern str

If provided, a regex pattern to extract a valid substring from the column before checking for NULL. Default is None.

None
Note

If a valid_string_pattern is provided, the NULL check will be performed on the extracted substring rather than the original column value.

Or(*comparison_levels)

Bases: _Merge

Represents a comparison level that is an 'OR' of other comparison levels

Merge multiple ComparisonLevelCreators into a single ComparisonLevelCreator by merging their SQL conditions using a logical "OR".

Parameters:

Name Type Description Default
*comparison_levels ComparisonLevelCreator | dict

These represent the comparison levels you wish to combine via 'OR'

()

PairwiseStringDistanceFunctionLevel(col_name, distance_function_name, distance_threshold)

Bases: ComparisonLevelCreator

A comparison level using the most similar string distance between any pair of values between arrays in an array column.

The function given by distance_function_name must be one of "levenshtein," "damera_levenshtein," "jaro_winkler," or "jaro."

Parameters:

Name Type Description Default
col_name str | ColumnExpression

Input column name

required
distance_function_name str

the name of the string distance function

required
distance_threshold Union[int, float]

The threshold to use to assess similarity

required

PercentageDifferenceLevel(col_name, percentage_threshold)

Bases: ComparisonLevelCreator

Represents a comparison level where the difference between two numerical values is within a specified percentage threshold.

The percentage difference is calculated as the absolute difference between the two values divided by the greater of the two values.

Parameters:

Name Type Description Default
col_name str

Input column name.

required
percentage_threshold float

The threshold percentage to use to assess similarity e.g. 0.1 for 10%.

required

AbsoluteDateDifferenceLevel

An alias of AbsoluteTimeDifferenceLevel.

Configuring comparisons

Note that all comparison levels have a .configure() method as follows:

Configure the comparison level with options which are common to all comparison levels. The options align to the keys in the json specification of a comparison level. These options are usually not needed, but are available for advanced users.

All options have default options set initially. Any call to .configure() will set any options that are supplied. Any subsequent calls to .configure() will not override these values with defaults; to override values you must must explicitly provide a value corresponding to the default.

Generally speaking only a single call (at most) to .configure() should be required.

Parameters:

Name Type Description Default
m_probability float

The m probability for this comparison level. Default is equivalent to None, in which case a default initial value will be provided for this level.

unsupplied_option
u_probability float

The u probability for this comparison level. Default is equivalent to None, in which case a default initial value will be provided for this level.

unsupplied_option
tf_adjustment_column str

Make term frequency adjustments for this comparison level using this input column. Default is equivalent to None, meaning that term-frequency adjustments will not be applied for this level.

unsupplied_option
tf_adjustment_weight float

Make term frequency adjustments for this comparison level using this weight. Default is equivalent to None, meaning term-frequency adjustments are fully-weighted if turned on.

unsupplied_option
tf_minimum_u_value float

When term frequency adjustments are turned on, where the term frequency adjustment implies a u value below this value, use this minimum value instead. Defaults is equivalent to None, meaning no minimum value.

unsupplied_option
is_null_level bool

If true, m and u values will not be estimated and instead the match weight will be zero for this column. Default is equivalent to False.

unsupplied_option
label_for_charts str

If provided, a custom label that will be used for this level in any charts. Default is equivalent to None, in which case a default label will be provided for this level.

unsupplied_option
disable_tf_exact_match_detection bool

If true, if term frequency adjustments are set, the corresponding adjustment will be made using the u-value for this level, rather than the usual case where it is the u-value of the exact match level in the same comparison. Default is equivalent to False.

unsupplied_option
fix_m_probability bool

If true, the m probability for this level will be fixed and not estimated during training. Default is equivalent to False.

unsupplied_option
fix_u_probability bool

If true, the u probability for this level will be fixed and not estimated during training. Default is equivalent to False.

unsupplied_option