Documentation for comparison_level_library
¶
splink.comparison_level_library.NullLevelBase
¶
Bases: ComparisonLevel
__init__(col_name)
¶
Represents comparisons where one or both sides of the comparison contains null values so the similarity cannot be evaluated. Assumed to have a partial match weight of zero (null effect on overall match weight)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
Returns:
Name | Type | Description |
---|---|---|
ComparisonLevel | Comparison level |
splink.comparison_level_library.ExactMatchLevelBase
¶
Bases: ComparisonLevel
__init__(col_name, m_probability=None, term_frequency_adjustments=False, include_colname_in_charts_label=False)
¶
splink.comparison_level_library.ElseLevelBase
¶
Bases: ComparisonLevel
__init__(m_probability=None)
¶
splink.comparison_level_library.DistanceFunctionLevelBase
¶
Bases: ComparisonLevel
__init__(col_name, distance_function_name, distance_threshold, higher_is_more_similar=True, m_probability=None)
¶
Represents a comparison using a user-provided distance function, where the similarity
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
distance_function_name |
str
|
The name of the distance function |
required |
distance_threshold |
Union[int, float]
|
The threshold to use to assess similarity |
required |
higher_is_more_similar |
bool
|
If True, a higher value of the distance function indicates a higher similarity (e.g. jaro_winkler). If false, a higher value indicates a lower similarity (e.g. levenshtein). |
True
|
m_probability |
float
|
Starting value for m probability Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
ComparisonLevel | A comparison level for a given distance function |
splink.comparison_level_library.LevenshteinLevelBase
¶
Bases: DistanceFunctionLevelBase
__init__(col_name, distance_threshold, m_probability=None)
¶
Represents a comparison using a levenshtein distance function,
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
distance_threshold |
Union[int, float]
|
The threshold to use to assess similarity |
required |
m_probability |
float
|
Starting value for m probability. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
ComparisonLevel | A comparison level that evaluates the levenshtein similarity |
splink.comparison_level_library.JaroWinklerLevelBase
¶
Bases: DistanceFunctionLevelBase
__init__(col_name, distance_threshold, m_probability=None)
¶
Represents a comparison using the jaro winkler distance function
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
distance_threshold |
Union[int, float]
|
The threshold to use to assess similarity |
required |
m_probability |
float
|
Starting value for m probability. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
ComparisonLevel | A comparison level that evaluates the jaro winkler similarity |
splink.comparison_level_library.JaccardLevelBase
¶
Bases: DistanceFunctionLevelBase
__init__(col_name, distance_threshold, m_probability=None)
¶
Represents a comparison using a jaccard distance function
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
distance_threshold |
Union[int, float]
|
The threshold to use to assess similarity |
required |
m_probability |
float
|
Starting value for m probability. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
ComparisonLevel | A comparison level that evaluates the jaccard similarity |
splink.comparison_level_library.ColumnsReversedLevelBase
¶
Bases: ComparisonLevel
__init__(col_name_1, col_name_2, m_probability=None, tf_adjustment_column=None)
¶
Represents a comparison where the columns are reversed. For example, if surname is in the forename field and vice versa
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name_1 |
str
|
First column, e.g. forename |
required |
col_name_2 |
str
|
Second column, e.g. surname |
required |
m_probability |
float
|
Starting value for m probability. Defaults to None. |
None
|
tf_adjustment_column |
str
|
Column to use for term frequency adjustments if an exact match is observed. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
ComparisonLevel | A comparison level that evaluates the exact match of two columns. |
splink.comparison_level_library.DistanceInKMLevelBase
¶
Bases: ComparisonLevel
__init__(lat_col, long_col, km_threshold, not_null=False, m_probability=None)
¶
Use the haversine formula to transform comparisons of lat,lngs into distances measured in kilometers
Parameters:
Name | Type | Description | Default |
---|---|---|---|
lat_col |
str
|
The name of a latitude column or the respective array or struct column column containing the information For example: long_lat['lat'] or long_lat[0] |
required |
long_col |
str
|
The name of a longitudinal column or the respective array or struct column column containing the information, plus an index. For example: long_lat['long'] or long_lat[1] |
required |
km_threshold |
int
|
The total distance in kilometers to evaluate your comparisons against |
required |
not_null |
bool
|
If true, remove any . This is only necessary if you are not capturing nulls elsewhere in your comparison level. |
False
|
m_probability |
float
|
Starting value for m probability. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
ComparisonLevel | A comparison level that evaluates the distance between two coordinates |
splink.comparison_level_library.PercentageDifferenceLevelBase
¶
Bases: ComparisonLevel
__init__(col_name, percentage_distance_threshold, m_probability=None)
¶
splink.comparison_level_library.ArrayIntersectLevelBase
¶
Bases: ComparisonLevel
__init__(col_name, m_probability=None, term_frequency_adjustments=False, min_intersection=1, include_colname_in_charts_label=False)
¶
Represents a comparison level based around the size of an intersection of arrays
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
Input column name |
required |
m_probability |
float
|
Starting value for m probability. Defaults to None. |
None
|
tf_adjustment_column |
str
|
Column to use for term frequency adjustments if an exact match is observed. Defaults to None. |
required |
min_intersection |
int
|
The minimum cardinality of the intersection of arrays for this comparison level. Defaults to 1 |
1
|
include_colname_in_charts_label |
bool
|
Should the charts label contain the column name? Defaults to False |
False
|
Returns:
Name | Type | Description |
---|---|---|
ComparisonLevel | A comparison level that evaluates the size of intersection of arrays |