Documentation for the comparison_level_library
¶
AbsoluteDifferenceLevel(col_name, difference_threshold)
¶
Bases: ComparisonLevelCreator
Represents a comparison level where the absolute difference between two numerical values is within a specified threshold.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str | ColumnExpression
|
Input column name or ColumnExpression. |
required |
difference_threshold |
int | float
|
The maximum allowed absolute difference between the two values. |
required |
AbsoluteTimeDifferenceLevel(col_name, *, input_is_string, threshold, metric, datetime_format=None)
¶
Bases: ComparisonLevelCreator
Computes the absolute elapsed time between two dates (total duration).
This function computes the amount of time that has passed between two dates,
in contrast to functions like date_diff
found in some SQL backends,
which count the number of full calendar intervals (e.g., months, years) crossed.
For instance, the difference between January 29th and March 2nd would be less
than two months in terms of elapsed time, unlike a date_diff
calculation that
would give an answer of 2 calendar intervals crossed.
That the thresold is inclusive e.g. a level with a 10 day threshold will include difference in date of 10 days.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
The name of the input column containing the dates to compare |
required |
input_is_string |
bool
|
Indicates if the input date/times are in
string format, requiring parsing according to |
required |
threshold |
int
|
The maximum allowed difference between the two dates,
in units specified by |
required |
metric |
str
|
The unit of time to use when comparing the dates. Can be 'second', 'minute', 'hour', 'day', 'month', or 'year'. |
required |
datetime_format |
str
|
The format string for parsing dates. ISO 8601 format used if not provided. |
None
|
And(*comparison_levels)
¶
Bases: _Merge
Represents a comparison level that is an 'AND' of other comparison levels
Merge multiple ComparisonLevelCreators into a single ComparisonLevelCreator by merging their SQL conditions using a logical "AND".
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*comparison_levels |
ComparisonLevelCreator | dict
|
These represent the comparison levels you wish to combine via 'AND' |
()
|
ArrayIntersectLevel(col_name, min_intersection)
¶
Bases: ComparisonLevelCreator
Represents a comparison level based around the size of an intersection of arrays
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
min_intersection |
int
|
The minimum cardinality of the intersection of arrays for this comparison level. Defaults to 1 |
required |
ArraySubsetLevel(col_name, empty_is_subset=False)
¶
Bases: ComparisonLevelCreator
Represents a comparison level where the smaller array is an exact subset of the larger array. If arrays are equal length, they must have the same elements
The order of items in the arrays does not matter for this comparison.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str | ColumnExpression
|
Input column name or ColumnExpression |
required |
empty_is_subset |
bool
|
If True, an empty array is considered a subset of any array (including another empty array). Default is False. |
False
|
ColumnsReversedLevel(col_name_1, col_name_2, symmetrical=False)
¶
Bases: ComparisonLevelCreator
Represents a comparison level where the columns are reversed. For example, if surname is in the forename field and vice versa
By default, col_l = col_r. If the symmetrical argument is True, then col_l = col_r AND col_r = col_l.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name_1 |
str
|
First column, e.g. forename |
required |
col_name_2 |
str
|
Second column, e.g. surname |
required |
symmetrical |
bool
|
If True, equality is required in in both directions. Default is False. |
False
|
CosineSimilarityLevel(col_name, similarity_threshold)
¶
Bases: ComparisonLevelCreator
A comparison level using a cosine similarity function
e.g. array_cosine_similarity(val_l, val_r) >= similarity_threshold
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
similarity_threshold |
float
|
The threshold to use to assess similarity. Should be between 0 and 1. |
required |
CustomLevel(sql_condition, label_for_charts=None, base_dialect_str=None)
¶
Bases: ComparisonLevelCreator
Represents a comparison level with a custom sql expression
Must be in a form suitable for use in a SQL CASE WHEN expression e.g. "substr(name_l, 1, 1) = substr(name_r, 1, 1)"
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sql_condition |
str
|
SQL condition to assess similarity |
required |
label_for_charts |
str
|
A label for this level to be used in
charts. Default None, so that |
None
|
base_dialect_str |
str
|
If specified, the SQL dialect that this expression will parsed as when attempting to translate to other backends |
None
|
DamerauLevenshteinLevel(col_name, distance_threshold)
¶
Bases: ComparisonLevelCreator
A comparison level using a Damerau-Levenshtein distance function
e.g. damerau_levenshtein(val_l, val_r) <= distance_threshold
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
distance_threshold |
int
|
The threshold to use to assess similarity |
required |
DistanceFunctionLevel(col_name, distance_function_name, distance_threshold, higher_is_more_similar=True)
¶
Bases: ComparisonLevelCreator
A comparison level using an arbitrary distance function
e.g. custom_distance(val_l, val_r) >= (<=) distance_threshold
The function given by distance_function_name
must exist in the SQL
backend you use, and must take two parameters of the type in `col_name,
returning a numeric type
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str | ColumnExpression
|
Input column name |
required |
distance_function_name |
str
|
the name of the SQL distance function |
required |
distance_threshold |
Union[int, float]
|
The threshold to use to assess similarity |
required |
higher_is_more_similar |
bool
|
Are higher values of the distance function more similar? (e.g. True for Jaro-Winkler, False for Levenshtein) Default is True |
True
|
DistanceInKMLevel(lat_col, long_col, km_threshold, not_null=False)
¶
Bases: ComparisonLevelCreator
Use the haversine formula to transform comparisons of lat,lngs into distances measured in kilometers
Parameters:
Name | Type | Description | Default |
---|---|---|---|
lat_col |
str
|
The name of a latitude column or the respective array or struct column column containing the information For example: long_lat['lat'] or long_lat[0] |
required |
long_col |
str
|
The name of a longitudinal column or the respective array or struct column column containing the information, plus an index. For example: long_lat['long'] or long_lat[1] |
required |
km_threshold |
int
|
The total distance in kilometers to evaluate your comparisons against |
required |
not_null |
bool
|
If true, ensure no attempt is made to compute this if any inputs are null. This is only necessary if you are not capturing nulls elsewhere in your comparison level. |
False
|
ElseLevel
¶
Bases: ComparisonLevelCreator
This level is used to capture all comparisons that do not match any other specified levels. It corresponds to the ELSE clause in a SQL CASE statement.
ExactMatchLevel(col_name, term_frequency_adjustments=False)
¶
Bases: ComparisonLevelCreator
Represents a comparison level where there is an exact match
e.g. val_l = val_r
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
term_frequency_adjustments |
bool
|
If True, apply term frequency adjustments to the exact match level. Defaults to False. |
False
|
JaccardLevel(col_name, distance_threshold)
¶
Bases: ComparisonLevelCreator
A comparison level using a Jaccard distance function
e.g. jaccard(val_l, val_r) >= distance_threshold
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
distance_threshold |
Union[int, float]
|
The threshold to use to assess similarity |
required |
JaroLevel(col_name, distance_threshold)
¶
Bases: ComparisonLevelCreator
A comparison level using a Jaro distance function
e.g. jaro(val_l, val_r) >= distance_threshold
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
distance_threshold |
Union[int, float]
|
The threshold to use to assess similarity |
required |
JaroWinklerLevel(col_name, distance_threshold)
¶
Bases: ComparisonLevelCreator
A comparison level using a Jaro-Winkler distance function
e.g. jaro_winkler(val_l, val_r) >= distance_threshold
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
distance_threshold |
Union[int, float]
|
The threshold to use to assess similarity |
required |
LevenshteinLevel(col_name, distance_threshold)
¶
Bases: ComparisonLevelCreator
A comparison level using a sqlglot_dialect_name distance function
e.g. levenshtein(val_l, val_r) <= distance_threshold
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
distance_threshold |
int
|
The threshold to use to assess similarity |
required |
LiteralMatchLevel(col_name, literal_value, literal_datatype, side_of_comparison='both')
¶
Bases: ComparisonLevelCreator
Represents a comparison level where a column matches a literal value
e.g. val_l = 'literal' AND/OR val_r = 'literal'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
Union[str, ColumnExpression]
|
Input column name or ColumnExpression |
required |
literal_value |
str
|
The literal value to compare against e.g. 'male' |
required |
literal_datatype |
str
|
The datatype of the literal value. Must be one of: "string", "int", "float", "date" |
required |
side_of_comparison |
str
|
Which side(s) of the comparison to apply. Must be one of: "left", "right", "both". Defaults to "both". |
'both'
|
Not(comparison_level)
¶
Bases: ComparisonLevelCreator
Represents a comparison level that is the negation of another comparison level
Resulting ComparisonLevelCreator is equivalent to the passed ComparisonLevelCreator but with SQL conditions negated with logical "NOY".
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*comparison_level |
ComparisonLevelCreator | dict
|
This represents the comparison level you wish to negate with 'NOT' |
required |
NullLevel(col_name, valid_string_pattern=None)
¶
Bases: ComparisonLevelCreator
Represents a comparison level where either or both values are NULL
e.g. val_l IS NULL OR val_r IS NULL
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
Union[str, ColumnExpression]
|
Input column name or ColumnExpression |
required |
valid_string_pattern |
str
|
If provided, a regex pattern to extract a valid substring from the column before checking for NULL. Default is None. |
None
|
Note
If a valid_string_pattern is provided, the NULL check will be performed on the extracted substring rather than the original column value.
Or(*comparison_levels)
¶
Bases: _Merge
Represents a comparison level that is an 'OR' of other comparison levels
Merge multiple ComparisonLevelCreators into a single ComparisonLevelCreator by merging their SQL conditions using a logical "OR".
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*comparison_levels |
ComparisonLevelCreator | dict
|
These represent the comparison levels you wish to combine via 'OR' |
()
|
PairwiseStringDistanceFunctionLevel(col_name, distance_function_name, distance_threshold)
¶
Bases: ComparisonLevelCreator
A comparison level using the most similar string distance between any pair of values between arrays in an array column.
The function given by distance_function_name
must be one of
"levenshtein," "damera_levenshtein," "jaro_winkler," or "jaro."
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str | ColumnExpression
|
Input column name |
required |
distance_function_name |
str
|
the name of the string distance function |
required |
distance_threshold |
Union[int, float]
|
The threshold to use to assess similarity |
required |
PercentageDifferenceLevel(col_name, percentage_threshold)
¶
Bases: ComparisonLevelCreator
Represents a comparison level where the difference between two numerical values is within a specified percentage threshold.
The percentage difference is calculated as the absolute difference between the two values divided by the greater of the two values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name. |
required |
percentage_threshold |
float
|
The threshold percentage to use to assess similarity e.g. 0.1 for 10%. |
required |
AbsoluteDateDifferenceLevel¶
An alias of AbsoluteTimeDifferenceLevel.
Configuring comparisons¶
Note that all comparison levels have a .configure()
method as follows:
Configure the comparison level with options which are common to all comparison levels. The options align to the keys in the json specification of a comparison level. These options are usually not needed, but are available for advanced users.
All options have default options set initially. Any call to .configure()
will set any options that are supplied. Any subsequent calls to .configure()
will not override these values with defaults; to override values you must must
explicitly provide a value corresponding to the default.
Generally speaking only a single call (at most) to .configure()
should
be required.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
m_probability |
float
|
The m probability for this comparison level. Default is equivalent to None, in which case a default initial value will be provided for this level. |
unsupplied_option
|
u_probability |
float
|
The u probability for this comparison level. Default is equivalent to None, in which case a default initial value will be provided for this level. |
unsupplied_option
|
tf_adjustment_column |
str
|
Make term frequency adjustments for this comparison level using this input column. Default is equivalent to None, meaning that term-frequency adjustments will not be applied for this level. |
unsupplied_option
|
tf_adjustment_weight |
float
|
Make term frequency adjustments for this comparison level using this weight. Default is equivalent to None, meaning term-frequency adjustments are fully-weighted if turned on. |
unsupplied_option
|
tf_minimum_u_value |
float
|
When term frequency adjustments are turned on, where the term frequency adjustment implies a u value below this value, use this minimum value instead. Defaults is equivalent to None, meaning no minimum value. |
unsupplied_option
|
is_null_level |
bool
|
If true, m and u values will not be estimated and instead the match weight will be zero for this column. Default is equivalent to False. |
unsupplied_option
|
label_for_charts |
str
|
If provided, a custom label that will be used for this level in any charts. Default is equivalent to None, in which case a default label will be provided for this level. |
unsupplied_option
|
disable_tf_exact_match_detection |
bool
|
If true, if term frequency adjustments are set, the corresponding adjustment will be made using the u-value for this level, rather than the usual case where it is the u-value of the exact match level in the same comparison. Default is equivalent to False. |
unsupplied_option
|
fix_m_probability |
bool
|
If true, the m probability for this level will be fixed and not estimated during training. Default is equivalent to False. |
unsupplied_option
|
fix_u_probability |
bool
|
If true, the u probability for this level will be fixed and not estimated during training. Default is equivalent to False. |
unsupplied_option
|