Skip to content

Documentation for comparison_library

splink.comparison_library.ExactMatchBase

Bases: Comparison

__init__(col_name, term_frequency_adjustments=False, m_probability_exact_match=None, m_probability_else=None, include_colname_in_charts_label=False)

A comparison of the data in col_name with two levels: - Exact match - Anything else

Parameters:

Name Type Description Default
col_name str

The name of the column to compare

required
term_frequency_adjustments bool

If True, term frequency adjustments will be made on the exact match level. Defaults to False.

False
m_probability_exact_match _type_

If provided, overrides the default m probability for the exact match level. Defaults to None.

None
m_probability_else _type_

If provided, overrides the default m probability for the 'anything else' level. Defaults to None.

None
include_colname_in_charts_label

If true, append col name to label for charts. Defaults to False.

False

Returns:

Name Type Description
Comparison

A comparison that can be inclued in the Splink settings dictionary

splink.comparison_library.DistanceFunctionAtThresholdsComparisonBase

Bases: Comparison

__init__(col_name, distance_function_name, distance_threshold_or_thresholds, higher_is_more_similar=True, include_exact_match_level=True, term_frequency_adjustments=False, m_probability_exact_match=None, m_probability_or_probabilities_lev=None, m_probability_else=None)

A comparison of the data in col_name with a user-provided distance function used to assess middle similarity levels.

The user-provided distance function must exist in the SQL backend.

An example of the output with default arguments and setting distance_function_name to jaccard and distance_threshold_or_thresholds = [0.9,0.7] would be - Exact match - Jaccard distance <= 0.9 - Jaccard distance <= 0.7 - Anything else

Parameters:

Name Type Description Default
col_name str

The name of the column to compare

required
distance_function_name str

The name of the distance function.

required
distance_threshold_or_thresholds Union[int, list]

The threshold(s) to use for the middle similarity level(s). Defaults to [1, 2].

required
higher_is_more_similar bool

If True, a higher value of the distance function indicates a higher similarity (e.g. jaro_winkler). If false, a higher value indicates a lower similarity (e.g. levenshtein).

True
include_exact_match_level bool

If True, include an exact match level. Defaults to True.

True
term_frequency_adjustments bool

If True, apply term frequency adjustments to the exact match level. Defaults to False.

False
m_probability_exact_match _type_

If provided, overrides the default m probability for the exact match level. Defaults to None.

None
m_probability_or_probabilities_lev Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_else _type_

If provided, overrides the default m probability for the 'anything else' level. Defaults to None.

None

Returns:

Name Type Description
Comparison

splink.comparison_library.LevenshteinAtThresholdsComparisonBase

Bases: DistanceFunctionAtThresholdsComparisonBase

__init__(col_name, distance_threshold_or_thresholds=[1, 2], include_exact_match_level=True, term_frequency_adjustments=False, m_probability_exact_match=None, m_probability_or_probabilities_lev=None, m_probability_else=None)

A comparison of the data in col_name with the levenshtein distance used to assess middle similarity levels.

An example of the output with default arguments and setting distance_threshold_or_thresholds = [1,2] would be - Exact match - levenshtein distance <= 1 - levenshtein distance <= 2 - Anything else

Parameters:

Name Type Description Default
col_name str

The name of the column to compare

required
distance_threshold_or_thresholds Union[int, list]

The threshold(s) to use for the middle similarity level(s). Defaults to [1, 2].

[1, 2]
include_exact_match_level bool

If True, include an exact match level. Defaults to True.

True
term_frequency_adjustments bool

If True, apply term frequency adjustments to the exact match level. Defaults to False.

False
m_probability_exact_match _type_

If provided, overrides the default m probability for the exact match level. Defaults to None.

None
m_probability_or_probabilities_lev Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_else _type_

If provided, overrides the default m probability for the 'anything else' level. Defaults to None.

None

Returns:

Name Type Description
Comparison

splink.comparison_library.JaccardAtThresholdsComparisonBase

Bases: DistanceFunctionAtThresholdsComparisonBase

__init__(col_name, distance_threshold_or_thresholds=[0.9, 0.7], include_exact_match_level=True, term_frequency_adjustments=False, m_probability_exact_match=None, m_probability_or_probabilities_lev=None, m_probability_else=None)

A comparison of the data in col_name with the jaccard distance used to assess middle similarity levels.

An example of the output with default arguments and setting distance_threshold_or_thresholds = [1,2] would be - Exact match - Jaccard distance <= 0.9 - Jaccard distance <= 0.7 - Anything else

Parameters:

Name Type Description Default
col_name str

The name of the column to compare

required
distance_threshold_or_thresholds Union[int, list]

The threshold(s) to use for the middle similarity level(s). Defaults to [0.9, 0.7].

[0.9, 0.7]
include_exact_match_level bool

If True, include an exact match level. Defaults to True.

True
term_frequency_adjustments bool

If True, apply term frequency adjustments to the exact match level. Defaults to False.

False
m_probability_exact_match _type_

If provided, overrides the default m probability for the exact match level. Defaults to None.

None
m_probability_or_probabilities_lev Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_else _type_

If provided, overrides the default m probability for the 'anything else' level. Defaults to None.

None

Returns:

Name Type Description
Comparison

splink.comparison_library.JaroWinklerAtThresholdsComparisonBase

Bases: DistanceFunctionAtThresholdsComparisonBase

__init__(col_name, distance_threshold_or_thresholds=[0.9, 0.7], include_exact_match_level=True, term_frequency_adjustments=False, m_probability_exact_match=None, m_probability_or_probabilities_lev=None, m_probability_else=None)

A comparison of the data in col_name with the jaro_winkler distance used to assess middle similarity levels.

An example of the output with default arguments and setting distance_threshold_or_thresholds = [1,2] would be - Exact match - jaro_winkler distance <= 0.9 - jaro_winkler distance <= 0.7 - Anything else

Parameters:

Name Type Description Default
col_name str

The name of the column to compare

required
distance_threshold_or_thresholds Union[int, list]

The threshold(s) to use for the middle similarity level(s). Defaults to [0.9, 0.7].

[0.9, 0.7]
include_exact_match_level bool

If True, include an exact match level. Defaults to True.

True
term_frequency_adjustments bool

If True, apply term frequency adjustments to the exact match level. Defaults to False.

False
m_probability_exact_match _type_

If provided, overrides the default m probability for the exact match level. Defaults to None.

None
m_probability_or_probabilities_lev Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_else _type_

If provided, overrides the default m probability for the 'anything else' level. Defaults to None.

None

Returns:

Name Type Description
Comparison

splink.comparison_library.ArrayIntersectAtSizesComparisonBase

Bases: Comparison

__init__(col_name, size_or_sizes=[1], m_probability_or_probabilities_sizes=None, m_probability_else=None)

A comparison of the data in array column col_name with various intersection sizes to assess similarity levels.

An example of the output with default arguments and setting size_or_sizes = [3, 1] would be - Intersection has at least 3 elements - Intersection has at least 1 element (i.e. 1 or 2) - Anything else (i.e. empty intersection)

Parameters:

Name Type Description Default
col_name str

The name of the column to compare

required
size_or_sizes Union[int, list]

The size(s) of intersection to use for the non-'else' similarity level(s). Should be in descending order. Defaults to [1].

[1]
m_probability_or_probabilities_sizes Union[float, list]

description. If provided, overrides the default m probabilities for the sizes specified. Defaults to None.

None
m_probability_else _type_

If provided, overrides the default m probability for the 'anything else' level. Defaults to None.

None

Returns:

Name Type Description
Comparison