Skip to content

Documentation for comparison_library

exact_match(col_name, term_frequency_adjustments=False, m_probability_exact_match=None, m_probability_else=None)

A comparison of the data in col_name with two levels: - Exact match - Anything else

Parameters:

Name Type Description Default
col_name str

The name of the column to compare

required
term_frequency_adjustments bool

If True, term frequency adjustments will be made on the exact match level. Defaults to False.

False
m_probability_exact_match _type_

If provided, overrides the default m probability for the exact match level. Defaults to None.

None
m_probability_else _type_

If provided, overrides the default m probability for the 'anything else' level. Defaults to None.

None

Returns:

Name Type Description
Comparison Comparison

A comparison that can be inclued in the Splink settings dictionary

Source code in splink/comparison_library.py
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def exact_match(
    col_name,
    term_frequency_adjustments=False,
    m_probability_exact_match=None,
    m_probability_else=None,
) -> Comparison:
    """A comparison of the data in `col_name` with two levels:
    - Exact match
    - Anything else

    Args:
        col_name (str): The name of the column to compare
        term_frequency_adjustments (bool, optional): If True, term frequency adjustments
            will be made on the exact match level. Defaults to False.
        m_probability_exact_match (_type_, optional): If provided, overrides the
            default m probability for the exact match level. Defaults to None.
        m_probability_else (_type_, optional): If provided, overrides the
            default m probability for the 'anything else' level. Defaults to None.

    Returns:
        Comparison: A comparison that can be inclued in the Splink settings dictionary
    """

    comparison_dict = {
        "comparison_description": "Exact match vs. anything else",
        "comparison_levels": [
            cl.null_level(col_name),
            cl.exact_match_level(
                col_name,
                term_frequency_adjustments=term_frequency_adjustments,
                m_probability=m_probability_exact_match,
            ),
            cl.else_level(m_probability=m_probability_else),
        ],
    }
    return Comparison(comparison_dict)

distance_function_at_thresholds(col_name, distance_function_name, distance_threshold_or_thresholds, higher_is_more_similar=True, include_exact_match_level=True, term_frequency_adjustments=False, m_probability_exact_match=None, m_probability_or_probabilities_lev=None, m_probability_else=None)

A comparison of the data in col_name with a user-provided distance function used to assess middle similarity levels.

The user-provided distance function must exist in the SQL backend.

An example of the output with default arguments and setting distance_function_name to jaccard and distance_threshold_or_thresholds = [0.9,0.7] would be - Exact match - Jaccard distance <= 0.9 - Jaccard distance <= 0.7 - Anything else

Parameters:

Name Type Description Default
col_name str

The name of the column to compare

required
distance_function_name str

The name of the distance function

required
distance_threshold_or_thresholds Union[int, list]

The threshold(s) to use for the middle similarity level(s). Defaults to [1, 2].

required
higher_is_more_similar bool

If True, a higher value of the distance function indicates a higher similarity (e.g. jaro_winkler). If false, a higher value indicates a lower similarity (e.g. levenshtein).

True
include_exact_match_level bool

If True, include an exact match level. Defaults to True.

True
term_frequency_adjustments bool

If True, apply term frequency adjustments to the exact match level. Defaults to False.

False
m_probability_exact_match _type_

If provided, overrides the default m probability for the exact match level. Defaults to None.

None
m_probability_or_probabilities_lev Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_else _type_

If provided, overrides the default m probability for the 'anything else' level. Defaults to None.

None

Returns:

Name Type Description
Comparison Comparison
Source code in splink/comparison_library.py
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
def distance_function_at_thresholds(
    col_name: str,
    distance_function_name: str,
    distance_threshold_or_thresholds: Union[int, list],
    higher_is_more_similar: bool = True,
    include_exact_match_level=True,
    term_frequency_adjustments=False,
    m_probability_exact_match=None,
    m_probability_or_probabilities_lev: Union[float, list] = None,
    m_probability_else=None,
) -> Comparison:
    """A comparison of the data in `col_name` with a user-provided distance function
    used to assess middle similarity levels.

    The user-provided distance function must exist in the SQL backend.

    An example of the output with default arguments and setting `distance_function_name`
    to `jaccard` and `distance_threshold_or_thresholds = [0.9,0.7]` would be
    - Exact match
    - Jaccard distance <= 0.9
    - Jaccard distance <= 0.7
    - Anything else

    Args:
        col_name (str): The name of the column to compare
        distance_function_name (str): The name of the distance function
        distance_threshold_or_thresholds (Union[int, list], optional): The threshold(s)
            to use for the middle similarity level(s). Defaults to [1, 2].
        higher_is_more_similar (bool): If True, a higher value of the distance function
            indicates a higher similarity (e.g. jaro_winkler).  If false, a higher
            value indicates a lower similarity (e.g. levenshtein).
        include_exact_match_level (bool, optional): If True, include an exact match
            level. Defaults to True.
        term_frequency_adjustments (bool, optional): If True, apply term frequency
            adjustments to the exact match level. Defaults to False.
        m_probability_exact_match (_type_, optional): If provided, overrides the
            default m probability for the exact match level. Defaults to None.
        m_probability_or_probabilities_lev (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_else (_type_, optional): If provided, overrides the
            default m probability for the 'anything else' level. Defaults to None.

    Returns:
        Comparison:
    """

    distance_thresholds = ensure_is_iterable(distance_threshold_or_thresholds)

    if m_probability_or_probabilities_lev is None:
        m_probability_or_probabilities_lev = [None] * len(distance_thresholds)
    m_probabilities = ensure_is_iterable(m_probability_or_probabilities_lev)

    comparison_levels = []
    comparison_levels.append(cl.null_level(col_name))
    if include_exact_match_level:
        level = cl.exact_match_level(
            col_name,
            term_frequency_adjustments=term_frequency_adjustments,
            m_probability=m_probability_exact_match,
        )
        comparison_levels.append(level)

    for thres, m_prob in zip(distance_thresholds, m_probabilities):
        level = cl.distance_function_level(
            col_name,
            distance_function_name=distance_function_name,
            higher_is_more_similar=higher_is_more_similar,
            distance_threshold=thres,
            m_probability=m_prob,
        )
        comparison_levels.append(level)

    comparison_levels.append(
        cl.else_level(m_probability=m_probability_else),
    )

    comparison_desc = ""
    if include_exact_match_level:
        comparison_desc += "Exact match vs. "

    thres_desc = ", ".join([str(d) for d in distance_thresholds])
    plural = "" if len(distance_thresholds) == 1 else "s"
    comparison_desc += (
        f"{distance_function_name} at threshold{plural} {thres_desc} vs. "
    )
    comparison_desc += "anything else"

    comparison_dict = {
        "comparison_description": comparison_desc,
        "comparison_levels": comparison_levels,
    }
    return Comparison(comparison_dict)

levenshtein_at_thresholds(col_name, distance_threshold_or_thresholds=[1, 2], include_exact_match_level=True, term_frequency_adjustments=False, m_probability_exact_match=None, m_probability_or_probabilities_lev=None, m_probability_else=None)

A comparison of the data in col_name with the levenshtein distance used to assess middle similarity levels.

An example of the output with default arguments and setting distance_threshold_or_thresholds = [1,2] would be - Exact match - levenshtein distance <= 1 - levenshtein distance <= 2 - Anything else

Parameters:

Name Type Description Default
col_name str

The name of the column to compare

required
distance_threshold_or_thresholds Union[int, list]

The threshold(s) to use for the middle similarity level(s). Defaults to [1, 2].

[1, 2]
include_exact_match_level bool

If True, include an exact match level. Defaults to True.

True
term_frequency_adjustments bool

If True, apply term frequency adjustments to the exact match level. Defaults to False.

False
m_probability_exact_match _type_

If provided, overrides the default m probability for the exact match level. Defaults to None.

None
m_probability_or_probabilities_lev Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_else _type_

If provided, overrides the default m probability for the 'anything else' level. Defaults to None.

None

Returns:

Name Type Description
Comparison Comparison
Source code in splink/comparison_library.py
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
def levenshtein_at_thresholds(
    col_name: str,
    distance_threshold_or_thresholds: Union[int, list] = [1, 2],
    include_exact_match_level=True,
    term_frequency_adjustments=False,
    m_probability_exact_match=None,
    m_probability_or_probabilities_lev: Union[float, list] = None,
    m_probability_else=None,
) -> Comparison:
    """A comparison of the data in `col_name` with the levenshtein distance used to
    assess middle similarity levels.

    An example of the output with default arguments and setting
    `distance_threshold_or_thresholds = [1,2]` would be
    - Exact match
    - levenshtein distance <= 1
    - levenshtein distance <= 2
    - Anything else

    Args:
        col_name (str): The name of the column to compare
        distance_threshold_or_thresholds (Union[int, list], optional): The threshold(s)
            to use for the middle similarity level(s). Defaults to [1, 2].
        include_exact_match_level (bool, optional): If True, include an exact match
            level. Defaults to True.
        term_frequency_adjustments (bool, optional): If True, apply term frequency
            adjustments to the exact match level. Defaults to False.
        m_probability_exact_match (_type_, optional): If provided, overrides the
            default m probability for the exact match level. Defaults to None.
        m_probability_or_probabilities_lev (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_else (_type_, optional): If provided, overrides the
            default m probability for the 'anything else' level. Defaults to None.

    Returns:
        Comparison:
    """

    return distance_function_at_thresholds(
        col_name,
        cl._mutable_params["levenshtein"],
        distance_threshold_or_thresholds,
        False,
        include_exact_match_level,
        term_frequency_adjustments,
        m_probability_exact_match,
        m_probability_or_probabilities_lev,
        m_probability_else,
    )

jaccard_at_thresholds(col_name, distance_threshold_or_thresholds=[0.9, 0.7], include_exact_match_level=True, term_frequency_adjustments=False, m_probability_exact_match=None, m_probability_or_probabilities_lev=None, m_probability_else=None)

A comparison of the data in col_name with the jaccard distance used to assess middle similarity levels.

An example of the output with default arguments and setting distance_threshold_or_thresholds = [1,2] would be - Exact match - Jaccard distance <= 0.9 - Jaccard distance <= 0.7 - Anything else

Parameters:

Name Type Description Default
col_name str

The name of the column to compare

required
distance_threshold_or_thresholds Union[int, list]

The threshold(s) to use for the middle similarity level(s). Defaults to [0.9, 0.7].

[0.9, 0.7]
include_exact_match_level bool

If True, include an exact match level. Defaults to True.

True
term_frequency_adjustments bool

If True, apply term frequency adjustments to the exact match level. Defaults to False.

False
m_probability_exact_match _type_

If provided, overrides the default m probability for the exact match level. Defaults to None.

None
m_probability_or_probabilities_lev Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_else _type_

If provided, overrides the default m probability for the 'anything else' level. Defaults to None.

None

Returns:

Name Type Description
Comparison Comparison
Source code in splink/comparison_library.py
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
def jaccard_at_thresholds(
    col_name: str,
    distance_threshold_or_thresholds: Union[int, list] = [0.9, 0.7],
    include_exact_match_level=True,
    term_frequency_adjustments=False,
    m_probability_exact_match=None,
    m_probability_or_probabilities_lev: Union[float, list] = None,
    m_probability_else=None,
) -> Comparison:
    """A comparison of the data in `col_name` with the jaccard distance used to
    assess middle similarity levels.

    An example of the output with default arguments and setting
    `distance_threshold_or_thresholds = [1,2]` would be
    - Exact match
    - Jaccard distance <= 0.9
    - Jaccard distance <= 0.7
    - Anything else

    Args:
        col_name (str): The name of the column to compare
        distance_threshold_or_thresholds (Union[int, list], optional): The threshold(s)
            to use for the middle similarity level(s). Defaults to [0.9, 0.7].
        include_exact_match_level (bool, optional): If True, include an exact match
            level. Defaults to True.
        term_frequency_adjustments (bool, optional): If True, apply term frequency
            adjustments to the exact match level. Defaults to False.
        m_probability_exact_match (_type_, optional): If provided, overrides the
            default m probability for the exact match level. Defaults to None.
        m_probability_or_probabilities_lev (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_else (_type_, optional): If provided, overrides the
            default m probability for the 'anything else' level. Defaults to None.

    Returns:
        Comparison:
    """

    return distance_function_at_thresholds(
        col_name,
        "jaccard",
        distance_threshold_or_thresholds,
        True,
        include_exact_match_level,
        term_frequency_adjustments,
        m_probability_exact_match,
        m_probability_or_probabilities_lev,
        m_probability_else,
    )