Documentation for comparison_library
¶
splink.comparison_library.ExactMatchBase
¶
Bases: Comparison
__init__(col_name, term_frequency_adjustments=False, m_probability_exact_match=None, m_probability_else=None, include_colname_in_charts_label=False)
¶
A comparison of the data in col_name
with two levels:
- Exact match
- Anything else
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
The name of the column to compare |
required |
term_frequency_adjustments |
bool
|
If True, term frequency adjustments will be made on the exact match level. Defaults to False. |
False
|
m_probability_exact_match |
_type_
|
If provided, overrides the default m probability for the exact match level. Defaults to None. |
None
|
m_probability_else |
_type_
|
If provided, overrides the default m probability for the 'anything else' level. Defaults to None. |
None
|
include_colname_in_charts_label |
If true, append col name to label for charts. Defaults to False. |
False
|
Returns:
Name | Type | Description |
---|---|---|
Comparison | A comparison that can be inclued in the Splink settings dictionary |
splink.comparison_library.DistanceFunctionAtThresholdsComparisonBase
¶
Bases: Comparison
__init__(col_name, distance_function_name, distance_threshold_or_thresholds, higher_is_more_similar=True, include_exact_match_level=True, term_frequency_adjustments=False, m_probability_exact_match=None, m_probability_or_probabilities_lev=None, m_probability_else=None)
¶
A comparison of the data in col_name
with a user-provided distance
function used to assess middle similarity levels.
The user-provided distance function must exist in the SQL backend.
An example of the output with default arguments and setting
distance_function_name
to jaccard
and
distance_threshold_or_thresholds = [0.9,0.7]
would be
- Exact match
- Jaccard distance <= 0.9
- Jaccard distance <= 0.7
- Anything else
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
The name of the column to compare |
required |
distance_function_name |
str
|
The name of the distance function. |
required |
distance_threshold_or_thresholds |
Union[int, list]
|
The threshold(s) to use for the middle similarity level(s). Defaults to [1, 2]. |
required |
higher_is_more_similar |
bool
|
If True, a higher value of the distance function indicates a higher similarity (e.g. jaro_winkler). If false, a higher value indicates a lower similarity (e.g. levenshtein). |
True
|
include_exact_match_level |
bool
|
If True, include an exact match level. Defaults to True. |
True
|
term_frequency_adjustments |
bool
|
If True, apply term frequency adjustments to the exact match level. Defaults to False. |
False
|
m_probability_exact_match |
_type_
|
If provided, overrides the default m probability for the exact match level. Defaults to None. |
None
|
m_probability_or_probabilities_lev |
Union[float, list]
|
description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. |
None
|
m_probability_else |
_type_
|
If provided, overrides the default m probability for the 'anything else' level. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
Comparison |
splink.comparison_library.LevenshteinAtThresholdsComparisonBase
¶
Bases: DistanceFunctionAtThresholdsComparisonBase
__init__(col_name, distance_threshold_or_thresholds=[1, 2], include_exact_match_level=True, term_frequency_adjustments=False, m_probability_exact_match=None, m_probability_or_probabilities_lev=None, m_probability_else=None)
¶
A comparison of the data in col_name
with the levenshtein distance used to
assess middle similarity levels.
An example of the output with default arguments and setting
distance_threshold_or_thresholds = [1,2]
would be
- Exact match
- levenshtein distance <= 1
- levenshtein distance <= 2
- Anything else
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
The name of the column to compare |
required |
distance_threshold_or_thresholds |
Union[int, list]
|
The threshold(s) to use for the middle similarity level(s). Defaults to [1, 2]. |
[1, 2]
|
include_exact_match_level |
bool
|
If True, include an exact match level. Defaults to True. |
True
|
term_frequency_adjustments |
bool
|
If True, apply term frequency adjustments to the exact match level. Defaults to False. |
False
|
m_probability_exact_match |
_type_
|
If provided, overrides the default m probability for the exact match level. Defaults to None. |
None
|
m_probability_or_probabilities_lev |
Union[float, list]
|
description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. |
None
|
m_probability_else |
_type_
|
If provided, overrides the default m probability for the 'anything else' level. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
Comparison |
splink.comparison_library.JaccardAtThresholdsComparisonBase
¶
Bases: DistanceFunctionAtThresholdsComparisonBase
__init__(col_name, distance_threshold_or_thresholds=[0.9, 0.7], include_exact_match_level=True, term_frequency_adjustments=False, m_probability_exact_match=None, m_probability_or_probabilities_lev=None, m_probability_else=None)
¶
A comparison of the data in col_name
with the jaccard distance used to
assess middle similarity levels.
An example of the output with default arguments and setting
distance_threshold_or_thresholds = [1,2]
would be
- Exact match
- Jaccard distance <= 0.9
- Jaccard distance <= 0.7
- Anything else
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
The name of the column to compare |
required |
distance_threshold_or_thresholds |
Union[int, list]
|
The threshold(s) to use for the middle similarity level(s). Defaults to [0.9, 0.7]. |
[0.9, 0.7]
|
include_exact_match_level |
bool
|
If True, include an exact match level. Defaults to True. |
True
|
term_frequency_adjustments |
bool
|
If True, apply term frequency adjustments to the exact match level. Defaults to False. |
False
|
m_probability_exact_match |
_type_
|
If provided, overrides the default m probability for the exact match level. Defaults to None. |
None
|
m_probability_or_probabilities_lev |
Union[float, list]
|
description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. |
None
|
m_probability_else |
_type_
|
If provided, overrides the default m probability for the 'anything else' level. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
Comparison |
splink.comparison_library.JaroWinklerAtThresholdsComparisonBase
¶
Bases: DistanceFunctionAtThresholdsComparisonBase
__init__(col_name, distance_threshold_or_thresholds=[0.9, 0.7], include_exact_match_level=True, term_frequency_adjustments=False, m_probability_exact_match=None, m_probability_or_probabilities_lev=None, m_probability_else=None)
¶
A comparison of the data in col_name
with the jaro_winkler distance used to
assess middle similarity levels.
An example of the output with default arguments and setting
distance_threshold_or_thresholds = [1,2]
would be
- Exact match
- jaro_winkler distance <= 0.9
- jaro_winkler distance <= 0.7
- Anything else
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
The name of the column to compare |
required |
distance_threshold_or_thresholds |
Union[int, list]
|
The threshold(s) to use for the middle similarity level(s). Defaults to [0.9, 0.7]. |
[0.9, 0.7]
|
include_exact_match_level |
bool
|
If True, include an exact match level. Defaults to True. |
True
|
term_frequency_adjustments |
bool
|
If True, apply term frequency adjustments to the exact match level. Defaults to False. |
False
|
m_probability_exact_match |
_type_
|
If provided, overrides the default m probability for the exact match level. Defaults to None. |
None
|
m_probability_or_probabilities_lev |
Union[float, list]
|
description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None. |
None
|
m_probability_else |
_type_
|
If provided, overrides the default m probability for the 'anything else' level. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
Comparison |
splink.comparison_library.ArrayIntersectAtSizesComparisonBase
¶
Bases: Comparison
__init__(col_name, size_or_sizes=[1], m_probability_or_probabilities_sizes=None, m_probability_else=None)
¶
A comparison of the data in array column col_name
with various
intersection sizes to assess similarity levels.
An example of the output with default arguments and setting
size_or_sizes = [3, 1]
would be
- Intersection has at least 3 elements
- Intersection has at least 1 element (i.e. 1 or 2)
- Anything else (i.e. empty intersection)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
The name of the column to compare |
required |
size_or_sizes |
Union[int, list]
|
The size(s) of intersection to use for the non-'else' similarity level(s). Should be in descending order. Defaults to [1]. |
[1]
|
m_probability_or_probabilities_sizes |
Union[float, list]
|
description. If provided, overrides the default m probabilities for the sizes specified. Defaults to None. |
None
|
m_probability_else |
_type_
|
If provided, overrides the default m probability for the 'anything else' level. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
Comparison |