Skip to content

Documentation for comparison_template_library

The comparison_template_library contains pre-made comparisons with pre-defined parameters available for use directly as described in this topic guide. However, not every comparison is available for every Splink-compatible SQL backend. More detail on creating comparisons for specific data types is also included in the topic guide.

The pre-made Splink comparison templates available for each SQL dialect are as given in this table:


DuckDB

Spark

Athena

SQLite

PostgreSql
date_comparison
email_comparison
forename_surname_comparison
name_comparison
postcode_comparison

The detailed API for each of these are outlined below.

Library comparison APIs

Bases: Comparison

Source code in splink/comparison_template_library.py
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
class DateComparisonBase(Comparison):
    def __init__(
        self,
        col_name: str,
        cast_strings_to_date: bool = False,
        date_format: str = None,
        invalid_dates_as_null: bool = False,
        include_exact_match_level: bool = True,
        term_frequency_adjustments: bool = False,
        separate_1st_january: bool = False,
        levenshtein_thresholds: int | list = [],
        damerau_levenshtein_thresholds: int | list = [1],
        datediff_thresholds: int | list = [1, 1, 10],
        datediff_metrics: str | list = ["month", "year", "year"],
        m_probability_exact_match: float = None,
        m_probability_1st_january: float = None,
        m_probability_or_probabilities_lev: float | list = None,
        m_probability_or_probabilities_dl: float | list = None,
        m_probability_or_probabilities_datediff: float | list = None,
        m_probability_else: float = None,
    ) -> Comparison:
        """A wrapper to generate a comparison for a date column the data in
        `col_name` with preselected defaults.

        The default arguments will give a comparison with comparison levels:\n
        - Exact match (1st of January only)\n
        - Exact match (all other dates)\n
        - Damerau-Levenshtein distance <= 1\n
        - Date difference <= 1 year\n
        - Date difference <= 10 years \n
        - Anything else

        Args:
            col_name (str): The name of the column to compare.
            cast_strings_to_date (bool, optional): Set to True to
                enable date-casting when input dates are strings. Also adjust
                date_format if date-strings are not in (yyyy-mm-dd) format.
                Defaults to False.
            date_format (str, optional): Format of input dates if date-strings
                are given. Must be consistent across record pairs. If None
                (the default), downstream functions for each backend assign
                date_format to ISO 8601 format (yyyy-mm-dd).
                Set to "yyyy-MM-dd" for Spark and "%Y-%m-%d" for DuckDB
                when invalid_dates_as_null=True
            invalid_dates_as_null (bool, optional): assign any dates that do not adhere
                to date_format to the null level. Defaults to False.
            include_exact_match_level (bool, optional): If True, include an exact match
                level. Defaults to True.
            term_frequency_adjustments (bool, optional): If True, apply term frequency
                adjustments to the exact match level. Defaults to False.
            separate_1st_january (bool, optional): If True, include a separate
                exact match comparison level when date is 1st January.
            levenshtein_thresholds (Union[int, list], optional): The thresholds to use
                for levenshtein similarity level(s).
                Defaults to []
            damerau_levenshtein_thresholds (Union[int, list], optional): The thresholds
                to use for damerau-levenshtein similarity level(s).
                Defaults to [1]
            datediff_thresholds (Union[int, list], optional): The thresholds to use
                for datediff similarity level(s).
                Defaults to [1, 1].
            datediff_metrics (Union[str, list], optional): The metrics to apply
                thresholds to for datediff similarity level(s).
                Defaults to ["month", "year"].
            cast_strings_to_date (bool, optional): Set to True to
                enable date-casting when input dates are strings. Also adjust
                date_format if date-strings are not in (yyyy-mm-dd) format.
                Defaults to False.
            date_format (str, optional): Format of input dates if date-strings
                are given. Must be consistent across record pairs. If None
                (the default), downstream functions for each backend assign
                date_format to ISO 8601 format (yyyy-mm-dd).
            m_probability_exact_match (float, optional): If provided, overrides the
                default m probability for the exact match level. Defaults to None.
            m_probability_or_probabilities_lev (Union[float, list], optional):
                If provided, overrides the default m probabilities
                for the levenshtein thresholds specified. Defaults to None.
            m_probability_or_probabilities_dl (Union[float, list], optional):
                _description_. If provided, overrides the default m probabilities
                for the damerau-levenshtein thresholds specified. Defaults to None.
            m_probability_or_probabilities_datediff (Union[float, list], optional):
                If provided, overrides the default m probabilities
                for the datediff thresholds specified. Defaults to None.
            m_probability_else (float, optional): If provided, overrides the
                default m probability for the 'anything else' level. Defaults to None.

        Examples:
            === ":simple-duckdb: DuckDB"
                Basic Date Comparison
                ``` python
                import splink.duckdb.comparison_template_library as ctl
                ctl.date_comparison("date_of_birth")
                ```
                Bespoke Date Comparison
                ``` python
                import splink.duckdb.comparison_template_library as ctl
                ctl.date_comparison("date_of_birth",
                                    damerau_levenshtein_thresholds=[],
                                    levenshtein_thresholds=[2],
                                    datediff_thresholds=[1, 1],
                                    datediff_metrics=["month", "year"])
                ```
                Date Comparison casting columns date and assigning values that do not
                match the date_format to the null level
                ``` python
                import splink.duckdb.comparison_template_library as ctl
                ctl.date_comparison("date_of_birth",
                                    cast_strings_to_date=True,
                                    date_format='%d/%m/%Y',
                                    invalid_dates_as_null=True)
                ```
            === ":simple-apachespark: Spark"
                Basic Date Comparison
                ``` python
                import splink.spark.comparison_template_library as ctl
                ctl.date_comparison("date_of_birth")
                ```
                Bespoke Date Comparison
                ``` python
                import splink.spark.comparison_template_library as ctl
                ctl.date_comparison("date_of_birth",
                                    damerau_levenshtein_thresholds=[],
                                    levenshtein_thresholds=[2],
                                    datediff_thresholds=[1, 1],
                                    datediff_metrics=["month", "year"])
                ```
                Date Comparison casting columns date and assigning values that do not
                match the date_format to the null level
                ``` python
                import splink.spark.comparison_template_library as ctl
                ctl.date_comparison("date_of_birth",
                                    cast_strings_to_date=True,
                                    date_format='dd/mm/yyyy',
                                    invalid_dates_as_null=True)
                ```
        Returns:
            Comparison: A comparison that can be inclued in the Splink settings
                dictionary.
        """
        # Construct Comparison
        comparison_levels = []
        comparison_levels.append(
            self._null_level(
                col_name,
                invalid_dates_as_null=invalid_dates_as_null,
                valid_string_pattern=date_format,
            )
        )

        # Validate user inputs
        datediff_error_logger(thresholds=datediff_thresholds, metrics=datediff_metrics)

        if separate_1st_january:
            dob_first_jan = {
                "sql_condition": f"SUBSTR({col_name}_l, 6, 5) = '01-01'",
                "label_for_charts": "Date is 1st Jan",
            }
            comparison_level = and_(
                self._exact_match_level(col_name),
                dob_first_jan,
                label_for_charts="Exact match and 1st Jan",
            )

            if m_probability_1st_january:
                comparison_level["m_probability"] = m_probability_1st_january
            if term_frequency_adjustments:
                comparison_level["tf_adjustment_column"] = col_name
            comparison_levels.append(comparison_level)

        if include_exact_match_level:
            comparison_level = self._exact_match_level(
                col_name,
                term_frequency_adjustments=term_frequency_adjustments,
                m_probability=m_probability_exact_match,
            )
            comparison_levels.append(comparison_level)

        levenshtein_thresholds = ensure_is_iterable(levenshtein_thresholds)
        if len(levenshtein_thresholds) > 0:
            threshold_comparison_levels = distance_threshold_comparison_levels(
                self,
                col_name,
                distance_function_name="levenshtein",
                distance_threshold_or_thresholds=levenshtein_thresholds,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_lev,
            )
            comparison_levels = comparison_levels + threshold_comparison_levels

        damerau_levenshtein_thresholds = ensure_is_iterable(
            damerau_levenshtein_thresholds
        )
        if len(damerau_levenshtein_thresholds) > 0:
            damerau_levenshtein_thresholds = ensure_is_iterable(
                damerau_levenshtein_thresholds
            )
            threshold_comparison_levels = distance_threshold_comparison_levels(
                self,
                col_name,
                distance_function_name="damerau-levenshtein",
                distance_threshold_or_thresholds=damerau_levenshtein_thresholds,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_dl,
            )
            comparison_levels = comparison_levels + threshold_comparison_levels

        datediff_thresholds = ensure_is_iterable(datediff_thresholds)
        datediff_metrics = ensure_is_iterable(datediff_metrics)
        if len(datediff_thresholds) > 0:
            if m_probability_or_probabilities_datediff is None:
                m_probability_or_probabilities_datediff = [None] * len(
                    datediff_thresholds
                )
            m_probability_or_probabilities_datediff = ensure_is_iterable(
                m_probability_or_probabilities_datediff
            )

            for thres, metric, m_prob in zip(
                datediff_thresholds,
                datediff_metrics,
                m_probability_or_probabilities_datediff,
            ):
                comparison_level = self._datediff_level(
                    col_name,
                    date_threshold=thres,
                    date_metric=metric,
                    m_probability=m_prob,
                    cast_strings_to_date=cast_strings_to_date,
                    date_format=date_format,
                )
                comparison_levels.append(comparison_level)

        comparison_levels.append(
            self._else_level(m_probability=m_probability_else),
        )

        # Construct Description
        comparison_desc = ""
        if include_exact_match_level:
            comparison_desc += "Exact match vs. "

        if len(levenshtein_thresholds) > 0:
            desc = distance_threshold_description(
                col_name, "levenshtein", levenshtein_thresholds
            )
            comparison_desc += desc

        if len(damerau_levenshtein_thresholds) > 0:
            desc = distance_threshold_description(
                col_name, "damerau-levenshtein", damerau_levenshtein_thresholds
            )
            comparison_desc += desc

        if len(datediff_thresholds) > 0:
            datediff_desc = ", ".join(
                [
                    f"{m.title()}(s): {v}"
                    for v, m in zip(datediff_thresholds, datediff_metrics)
                ]
            )
            plural = "" if len(datediff_thresholds) == 1 else "s"
            comparison_desc += (
                f"Dates within the following threshold{plural} {datediff_desc} vs. "
            )

        comparison_desc += "anything else"

        comparison_dict = {
            "comparison_description": comparison_desc,
            "comparison_levels": comparison_levels,
        }
        super().__init__(comparison_dict)

    @property
    def _is_distance_subclass(self):
        return False

__init__(col_name, cast_strings_to_date=False, date_format=None, invalid_dates_as_null=False, include_exact_match_level=True, term_frequency_adjustments=False, separate_1st_january=False, levenshtein_thresholds=[], damerau_levenshtein_thresholds=[1], datediff_thresholds=[1, 1, 10], datediff_metrics=['month', 'year', 'year'], m_probability_exact_match=None, m_probability_1st_january=None, m_probability_or_probabilities_lev=None, m_probability_or_probabilities_dl=None, m_probability_or_probabilities_datediff=None, m_probability_else=None)

A wrapper to generate a comparison for a date column the data in col_name with preselected defaults.

The default arguments will give a comparison with comparison levels:

  • Exact match (1st of January only)

  • Exact match (all other dates)

  • Damerau-Levenshtein distance <= 1

  • Date difference <= 1 year

  • Date difference <= 10 years

  • Anything else

Parameters:

Name Type Description Default
col_name str

The name of the column to compare.

required
cast_strings_to_date bool

Set to True to enable date-casting when input dates are strings. Also adjust date_format if date-strings are not in (yyyy-mm-dd) format. Defaults to False.

False
date_format str

Format of input dates if date-strings are given. Must be consistent across record pairs. If None (the default), downstream functions for each backend assign date_format to ISO 8601 format (yyyy-mm-dd). Set to "yyyy-MM-dd" for Spark and "%Y-%m-%d" for DuckDB when invalid_dates_as_null=True

None
invalid_dates_as_null bool

assign any dates that do not adhere to date_format to the null level. Defaults to False.

False
include_exact_match_level bool

If True, include an exact match level. Defaults to True.

True
term_frequency_adjustments bool

If True, apply term frequency adjustments to the exact match level. Defaults to False.

False
separate_1st_january bool

If True, include a separate exact match comparison level when date is 1st January.

False
levenshtein_thresholds Union[int, list]

The thresholds to use for levenshtein similarity level(s). Defaults to []

[]
damerau_levenshtein_thresholds Union[int, list]

The thresholds to use for damerau-levenshtein similarity level(s). Defaults to [1]

[1]
datediff_thresholds Union[int, list]

The thresholds to use for datediff similarity level(s). Defaults to [1, 1].

[1, 1, 10]
datediff_metrics Union[str, list]

The metrics to apply thresholds to for datediff similarity level(s). Defaults to ["month", "year"].

['month', 'year', 'year']
cast_strings_to_date bool

Set to True to enable date-casting when input dates are strings. Also adjust date_format if date-strings are not in (yyyy-mm-dd) format. Defaults to False.

False
date_format str

Format of input dates if date-strings are given. Must be consistent across record pairs. If None (the default), downstream functions for each backend assign date_format to ISO 8601 format (yyyy-mm-dd).

None
m_probability_exact_match float

If provided, overrides the default m probability for the exact match level. Defaults to None.

None
m_probability_or_probabilities_lev Union[float, list]

If provided, overrides the default m probabilities for the levenshtein thresholds specified. Defaults to None.

None
m_probability_or_probabilities_dl Union[float, list]

description. If provided, overrides the default m probabilities for the damerau-levenshtein thresholds specified. Defaults to None.

None
m_probability_or_probabilities_datediff Union[float, list]

If provided, overrides the default m probabilities for the datediff thresholds specified. Defaults to None.

None
m_probability_else float

If provided, overrides the default m probability for the 'anything else' level. Defaults to None.

None

Examples:

Basic Date Comparison

import splink.duckdb.comparison_template_library as ctl
ctl.date_comparison("date_of_birth")
Bespoke Date Comparison
import splink.duckdb.comparison_template_library as ctl
ctl.date_comparison("date_of_birth",
                    damerau_levenshtein_thresholds=[],
                    levenshtein_thresholds=[2],
                    datediff_thresholds=[1, 1],
                    datediff_metrics=["month", "year"])
Date Comparison casting columns date and assigning values that do not match the date_format to the null level
import splink.duckdb.comparison_template_library as ctl
ctl.date_comparison("date_of_birth",
                    cast_strings_to_date=True,
                    date_format='%d/%m/%Y',
                    invalid_dates_as_null=True)

Basic Date Comparison

import splink.spark.comparison_template_library as ctl
ctl.date_comparison("date_of_birth")
Bespoke Date Comparison
import splink.spark.comparison_template_library as ctl
ctl.date_comparison("date_of_birth",
                    damerau_levenshtein_thresholds=[],
                    levenshtein_thresholds=[2],
                    datediff_thresholds=[1, 1],
                    datediff_metrics=["month", "year"])
Date Comparison casting columns date and assigning values that do not match the date_format to the null level
import splink.spark.comparison_template_library as ctl
ctl.date_comparison("date_of_birth",
                    cast_strings_to_date=True,
                    date_format='dd/mm/yyyy',
                    invalid_dates_as_null=True)
Source code in splink/comparison_template_library.py
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
def __init__(
    self,
    col_name: str,
    cast_strings_to_date: bool = False,
    date_format: str = None,
    invalid_dates_as_null: bool = False,
    include_exact_match_level: bool = True,
    term_frequency_adjustments: bool = False,
    separate_1st_january: bool = False,
    levenshtein_thresholds: int | list = [],
    damerau_levenshtein_thresholds: int | list = [1],
    datediff_thresholds: int | list = [1, 1, 10],
    datediff_metrics: str | list = ["month", "year", "year"],
    m_probability_exact_match: float = None,
    m_probability_1st_january: float = None,
    m_probability_or_probabilities_lev: float | list = None,
    m_probability_or_probabilities_dl: float | list = None,
    m_probability_or_probabilities_datediff: float | list = None,
    m_probability_else: float = None,
) -> Comparison:
    """A wrapper to generate a comparison for a date column the data in
    `col_name` with preselected defaults.

    The default arguments will give a comparison with comparison levels:\n
    - Exact match (1st of January only)\n
    - Exact match (all other dates)\n
    - Damerau-Levenshtein distance <= 1\n
    - Date difference <= 1 year\n
    - Date difference <= 10 years \n
    - Anything else

    Args:
        col_name (str): The name of the column to compare.
        cast_strings_to_date (bool, optional): Set to True to
            enable date-casting when input dates are strings. Also adjust
            date_format if date-strings are not in (yyyy-mm-dd) format.
            Defaults to False.
        date_format (str, optional): Format of input dates if date-strings
            are given. Must be consistent across record pairs. If None
            (the default), downstream functions for each backend assign
            date_format to ISO 8601 format (yyyy-mm-dd).
            Set to "yyyy-MM-dd" for Spark and "%Y-%m-%d" for DuckDB
            when invalid_dates_as_null=True
        invalid_dates_as_null (bool, optional): assign any dates that do not adhere
            to date_format to the null level. Defaults to False.
        include_exact_match_level (bool, optional): If True, include an exact match
            level. Defaults to True.
        term_frequency_adjustments (bool, optional): If True, apply term frequency
            adjustments to the exact match level. Defaults to False.
        separate_1st_january (bool, optional): If True, include a separate
            exact match comparison level when date is 1st January.
        levenshtein_thresholds (Union[int, list], optional): The thresholds to use
            for levenshtein similarity level(s).
            Defaults to []
        damerau_levenshtein_thresholds (Union[int, list], optional): The thresholds
            to use for damerau-levenshtein similarity level(s).
            Defaults to [1]
        datediff_thresholds (Union[int, list], optional): The thresholds to use
            for datediff similarity level(s).
            Defaults to [1, 1].
        datediff_metrics (Union[str, list], optional): The metrics to apply
            thresholds to for datediff similarity level(s).
            Defaults to ["month", "year"].
        cast_strings_to_date (bool, optional): Set to True to
            enable date-casting when input dates are strings. Also adjust
            date_format if date-strings are not in (yyyy-mm-dd) format.
            Defaults to False.
        date_format (str, optional): Format of input dates if date-strings
            are given. Must be consistent across record pairs. If None
            (the default), downstream functions for each backend assign
            date_format to ISO 8601 format (yyyy-mm-dd).
        m_probability_exact_match (float, optional): If provided, overrides the
            default m probability for the exact match level. Defaults to None.
        m_probability_or_probabilities_lev (Union[float, list], optional):
            If provided, overrides the default m probabilities
            for the levenshtein thresholds specified. Defaults to None.
        m_probability_or_probabilities_dl (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the damerau-levenshtein thresholds specified. Defaults to None.
        m_probability_or_probabilities_datediff (Union[float, list], optional):
            If provided, overrides the default m probabilities
            for the datediff thresholds specified. Defaults to None.
        m_probability_else (float, optional): If provided, overrides the
            default m probability for the 'anything else' level. Defaults to None.

    Examples:
        === ":simple-duckdb: DuckDB"
            Basic Date Comparison
            ``` python
            import splink.duckdb.comparison_template_library as ctl
            ctl.date_comparison("date_of_birth")
            ```
            Bespoke Date Comparison
            ``` python
            import splink.duckdb.comparison_template_library as ctl
            ctl.date_comparison("date_of_birth",
                                damerau_levenshtein_thresholds=[],
                                levenshtein_thresholds=[2],
                                datediff_thresholds=[1, 1],
                                datediff_metrics=["month", "year"])
            ```
            Date Comparison casting columns date and assigning values that do not
            match the date_format to the null level
            ``` python
            import splink.duckdb.comparison_template_library as ctl
            ctl.date_comparison("date_of_birth",
                                cast_strings_to_date=True,
                                date_format='%d/%m/%Y',
                                invalid_dates_as_null=True)
            ```
        === ":simple-apachespark: Spark"
            Basic Date Comparison
            ``` python
            import splink.spark.comparison_template_library as ctl
            ctl.date_comparison("date_of_birth")
            ```
            Bespoke Date Comparison
            ``` python
            import splink.spark.comparison_template_library as ctl
            ctl.date_comparison("date_of_birth",
                                damerau_levenshtein_thresholds=[],
                                levenshtein_thresholds=[2],
                                datediff_thresholds=[1, 1],
                                datediff_metrics=["month", "year"])
            ```
            Date Comparison casting columns date and assigning values that do not
            match the date_format to the null level
            ``` python
            import splink.spark.comparison_template_library as ctl
            ctl.date_comparison("date_of_birth",
                                cast_strings_to_date=True,
                                date_format='dd/mm/yyyy',
                                invalid_dates_as_null=True)
            ```
    Returns:
        Comparison: A comparison that can be inclued in the Splink settings
            dictionary.
    """
    # Construct Comparison
    comparison_levels = []
    comparison_levels.append(
        self._null_level(
            col_name,
            invalid_dates_as_null=invalid_dates_as_null,
            valid_string_pattern=date_format,
        )
    )

    # Validate user inputs
    datediff_error_logger(thresholds=datediff_thresholds, metrics=datediff_metrics)

    if separate_1st_january:
        dob_first_jan = {
            "sql_condition": f"SUBSTR({col_name}_l, 6, 5) = '01-01'",
            "label_for_charts": "Date is 1st Jan",
        }
        comparison_level = and_(
            self._exact_match_level(col_name),
            dob_first_jan,
            label_for_charts="Exact match and 1st Jan",
        )

        if m_probability_1st_january:
            comparison_level["m_probability"] = m_probability_1st_january
        if term_frequency_adjustments:
            comparison_level["tf_adjustment_column"] = col_name
        comparison_levels.append(comparison_level)

    if include_exact_match_level:
        comparison_level = self._exact_match_level(
            col_name,
            term_frequency_adjustments=term_frequency_adjustments,
            m_probability=m_probability_exact_match,
        )
        comparison_levels.append(comparison_level)

    levenshtein_thresholds = ensure_is_iterable(levenshtein_thresholds)
    if len(levenshtein_thresholds) > 0:
        threshold_comparison_levels = distance_threshold_comparison_levels(
            self,
            col_name,
            distance_function_name="levenshtein",
            distance_threshold_or_thresholds=levenshtein_thresholds,
            m_probability_or_probabilities_thres=m_probability_or_probabilities_lev,
        )
        comparison_levels = comparison_levels + threshold_comparison_levels

    damerau_levenshtein_thresholds = ensure_is_iterable(
        damerau_levenshtein_thresholds
    )
    if len(damerau_levenshtein_thresholds) > 0:
        damerau_levenshtein_thresholds = ensure_is_iterable(
            damerau_levenshtein_thresholds
        )
        threshold_comparison_levels = distance_threshold_comparison_levels(
            self,
            col_name,
            distance_function_name="damerau-levenshtein",
            distance_threshold_or_thresholds=damerau_levenshtein_thresholds,
            m_probability_or_probabilities_thres=m_probability_or_probabilities_dl,
        )
        comparison_levels = comparison_levels + threshold_comparison_levels

    datediff_thresholds = ensure_is_iterable(datediff_thresholds)
    datediff_metrics = ensure_is_iterable(datediff_metrics)
    if len(datediff_thresholds) > 0:
        if m_probability_or_probabilities_datediff is None:
            m_probability_or_probabilities_datediff = [None] * len(
                datediff_thresholds
            )
        m_probability_or_probabilities_datediff = ensure_is_iterable(
            m_probability_or_probabilities_datediff
        )

        for thres, metric, m_prob in zip(
            datediff_thresholds,
            datediff_metrics,
            m_probability_or_probabilities_datediff,
        ):
            comparison_level = self._datediff_level(
                col_name,
                date_threshold=thres,
                date_metric=metric,
                m_probability=m_prob,
                cast_strings_to_date=cast_strings_to_date,
                date_format=date_format,
            )
            comparison_levels.append(comparison_level)

    comparison_levels.append(
        self._else_level(m_probability=m_probability_else),
    )

    # Construct Description
    comparison_desc = ""
    if include_exact_match_level:
        comparison_desc += "Exact match vs. "

    if len(levenshtein_thresholds) > 0:
        desc = distance_threshold_description(
            col_name, "levenshtein", levenshtein_thresholds
        )
        comparison_desc += desc

    if len(damerau_levenshtein_thresholds) > 0:
        desc = distance_threshold_description(
            col_name, "damerau-levenshtein", damerau_levenshtein_thresholds
        )
        comparison_desc += desc

    if len(datediff_thresholds) > 0:
        datediff_desc = ", ".join(
            [
                f"{m.title()}(s): {v}"
                for v, m in zip(datediff_thresholds, datediff_metrics)
            ]
        )
        plural = "" if len(datediff_thresholds) == 1 else "s"
        comparison_desc += (
            f"Dates within the following threshold{plural} {datediff_desc} vs. "
        )

    comparison_desc += "anything else"

    comparison_dict = {
        "comparison_description": comparison_desc,
        "comparison_levels": comparison_levels,
    }
    super().__init__(comparison_dict)

Bases: Comparison

Source code in splink/comparison_template_library.py
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
class NameComparisonBase(Comparison):
    def __init__(
        self,
        col_name: str,
        regex_extract: str = None,
        set_to_lowercase: str = False,
        include_exact_match_level: bool = True,
        phonetic_col_name: str = None,
        term_frequency_adjustments: bool = False,
        levenshtein_thresholds: int | list = [],
        damerau_levenshtein_thresholds: int | list = [1],
        jaro_thresholds: float | list = [],
        jaro_winkler_thresholds: float | list = [0.9, 0.8],
        jaccard_thresholds: float | list = [],
        m_probability_exact_match_name: float = None,
        m_probability_exact_match_phonetic_name: float = None,
        m_probability_or_probabilities_lev: float | list = None,
        m_probability_or_probabilities_dl: float | list = None,
        m_probability_or_probabilities_jar: float | list = None,
        m_probability_or_probabilities_jw: float | list = None,
        m_probability_or_probabilities_jac: float | list = None,
        m_probability_else: float = None,
    ) -> Comparison:
        """A wrapper to generate a comparison for a name column the data in
        `col_name` with preselected defaults.

        The default arguments will give a comparison with comparison levels:\n
        - Exact match \n
        - Damerau-Levenshtein Distance <= 1
        - Jaro Winkler similarity >= 0.9\n
        - Jaro Winkler similarity >= 0.8\n
        - Anything else

        Args:
            col_name (str): The name of the column to compare.
            regex_extract (str): Regular expression pattern to evaluate a match on.
            set_to_lowercase (bool): If True, all names are set to lowercase
                during the pairwise comparisons.
                Defaults to False
            include_exact_match_level (bool, optional): If True, include an exact match
                level for col_name. Defaults to True.
            phonetic_col_name (str): The name of the column with phonetic reduction
                (such as dmetaphone) of col_name. Including parameter will create
                an exact match level for  phonetic_col_name. The phonetic column must
                be present in the dataset to use this parameter.
                Defaults to None
            term_frequency_adjustments (bool, optional): If True, apply term
                frequency adjustments to the exact match level for "col_name".
                Defaults to False.
            term_frequency_adjustments_phonetic_name (bool, optional): If True, apply
                term frequency adjustments to the exact match level for
                "phonetic_col_name".
                Defaults to False.
            levenshtein_thresholds (Union[int, list], optional): The thresholds to use
                for levenshtein similarity level(s).
                Defaults to []
            damerau_levenshtein_thresholds (Union[int, list], optional): The thresholds
                to use for damerau-levenshtein similarity level(s).
                Defaults to [1]
            jaro_thresholds (Union[int, list], optional): The thresholds to use
                for jaro similarity level(s).
                Defaults to []
            jaro_winkler_thresholds (Union[int, list], optional): The thresholds to use
                for jaro_winkler similarity level(s).
                Defaults to [0.9, 0.8]
            jaccard_thresholds (Union[int, list], optional): The thresholds to use
                for jaccard similarity level(s).
                Defaults to []
            m_probability_exact_match_name (_type_, optional): Starting m probability
                for exact match level. Defaults to None.
            m_probability_exact_match_phonetic_name (_type_, optional): Starting m
                probability for exact match level for phonetic_col_name.
                Defaults to None.
            m_probability_or_probabilities_lev (Union[float, list], optional):
                _description_. If provided, overrides the default m probabilities
                for the thresholds specified. Defaults to None.
            m_probability_or_probabilities_dl (Union[float, list], optional):
                _description_. If provided, overrides the default m probabilities
                for the thresholds specified. Defaults to None.
            m_probability_or_probabilities_datediff (Union[float, list], optional):
                _description_. If provided, overrides the default m probabilities
                for the thresholds specified. Defaults to None.
            m_probability_or_probabilities_jar (Union[float, list], optional):
                Starting m probabilities for the jaro thresholds specified.
                Defaults to None.
            m_probability_or_probabilities_jw (Union[float, list], optional):
                Starting m probabilities for the jaro winkler thresholds specified.
                Defaults to None.
            m_probability_or_probabilities_jac (Union[float, list], optional):
                Starting m probabilities for the jaccard thresholds specified.
                Defaults to None.
            m_probability_else (_type_, optional): Starting m probability for
                the 'everything else' level. Defaults to None.

        Examples:
            === ":simple-duckdb: DuckDB"
                Basic Name Comparison
                ``` python
                import splink.duckdb.comparison_template_library as ctl
                ctl.name_comparison("name")
                ```
                Bespoke Name Comparison
                ``` python
                import splink.duckdb.comparison_template_library as ctl
                ctl.name_comparison("name",
                                    phonetic_col_name = "name_dm",
                                    term_frequency_adjustments = True,
                                    levenshtein_thresholds=[2],
                                    damerau_levenshtein_thresholds=[],
                                    jaro_winkler_thresholds=[],
                                    jaccard_thresholds=[1]
                                    )
                ```
            === ":simple-apachespark: Spark"
                Basic Name Comparison
                ``` python
                import splink.spark.comparison_template_library as ctl
                ctl.name_comparison("name")
                ```
                Bespoke Name Comparison
                ``` python
                import splink.spark.comparison_template_library as ctl
                ctl.name_comparison("name",
                                    phonetic_col_name = "name_dm",
                                    term_frequency_adjustments = True,
                                    levenshtein_thresholds=[2],
                                    damerau_levenshtein_thresholds=[],
                                    jaro_winkler_thresholds=[],
                                    jaccard_thresholds=[1]
                                    )
                ```
            === ":simple-sqlite: SQLite"
                Basic Name Comparison
                ``` python
                import splink.sqlite.comparison_template_library as ctl
                ctl.name_comparison("name")
                ```
                Bespoke Name Comparison
                ``` python
                import splink.sqlite.comparison_template_library as ctl
                ctl.name_comparison("name",
                                    phonetic_col_name = "name_dm",
                                    term_frequency_adjustments = True,
                                    levenshtein_thresholds=[2],
                                    damerau_levenshtein_thresholds=[],
                                    jaro_winkler_thresholds=[0.8],
                                    )
                ```

        Returns:
            Comparison: A comparison that can be included in the Splink settings
                dictionary.
        """

        # Construct Comparison
        comparison_levels = []
        comparison_levels.append(self._null_level(col_name))

        if include_exact_match_level:
            comparison_level = self._exact_match_level(
                col_name,
                term_frequency_adjustments=term_frequency_adjustments,
                m_probability=m_probability_exact_match_name,
                include_colname_in_charts_label=True,
                regex_extract=regex_extract,
                set_to_lowercase=set_to_lowercase,
            )
            comparison_levels.append(comparison_level)

            if phonetic_col_name is not None:
                comparison_level = self._exact_match_level(
                    phonetic_col_name,
                    term_frequency_adjustments=term_frequency_adjustments,
                    m_probability=m_probability_exact_match_phonetic_name,
                    include_colname_in_charts_label=True,
                    regex_extract=regex_extract,
                    set_to_lowercase=set_to_lowercase,
                )
                comparison_levels.append(comparison_level)

        levenshtein_thresholds = ensure_is_iterable(levenshtein_thresholds)
        if len(levenshtein_thresholds) > 0:
            threshold_comparison_levels = distance_threshold_comparison_levels(
                self,
                col_name,
                distance_function_name="levenshtein",
                distance_threshold_or_thresholds=levenshtein_thresholds,
                regex_extract=regex_extract,
                set_to_lowercase=set_to_lowercase,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_lev,
            )
            comparison_levels = comparison_levels + threshold_comparison_levels

        damerau_levenshtein_thresholds = ensure_is_iterable(
            damerau_levenshtein_thresholds
        )
        if len(damerau_levenshtein_thresholds) > 0:
            levenshtein_thresholds = ensure_is_iterable(damerau_levenshtein_thresholds)
            threshold_comparison_levels = distance_threshold_comparison_levels(
                self,
                col_name,
                distance_function_name="damerau-levenshtein",
                distance_threshold_or_thresholds=damerau_levenshtein_thresholds,
                regex_extract=regex_extract,
                set_to_lowercase=set_to_lowercase,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_dl,
            )
            comparison_levels = comparison_levels + threshold_comparison_levels

        jaro_thresholds = ensure_is_iterable(jaro_thresholds)
        if len(jaro_thresholds) > 0:
            threshold_comparison_levels = distance_threshold_comparison_levels(
                self,
                col_name,
                distance_function_name="jaro",
                distance_threshold_or_thresholds=jaro_thresholds,
                regex_extract=regex_extract,
                set_to_lowercase=set_to_lowercase,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_jar,
            )
            comparison_levels = comparison_levels + threshold_comparison_levels

        jaro_winkler_thresholds = ensure_is_iterable(jaro_winkler_thresholds)
        if len(jaro_winkler_thresholds) > 0:
            threshold_comparison_levels = distance_threshold_comparison_levels(
                self,
                col_name,
                distance_function_name="jaro-winkler",
                distance_threshold_or_thresholds=jaro_winkler_thresholds,
                regex_extract=regex_extract,
                set_to_lowercase=set_to_lowercase,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_jw,
            )
            comparison_levels = comparison_levels + threshold_comparison_levels

        jaccard_thresholds = ensure_is_iterable(jaccard_thresholds)
        if len(jaccard_thresholds) > 0:
            threshold_comparison_levels = distance_threshold_comparison_levels(
                self,
                col_name,
                distance_function_name="jaccard",
                distance_threshold_or_thresholds=jaccard_thresholds,
                regex_extract=regex_extract,
                set_to_lowercase=set_to_lowercase,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_jac,
            )
            comparison_levels = comparison_levels + threshold_comparison_levels

        comparison_levels.append(
            self._else_level(m_probability=m_probability_else),
        )

        # Construct Description
        comparison_desc = ""
        if include_exact_match_level:
            comparison_desc += "Exact match vs. "

        if phonetic_col_name is not None:
            comparison_desc += "Names with phonetic exact match vs. "

        if len(levenshtein_thresholds) > 0:
            desc = distance_threshold_description(
                col_name, "levenshtein", levenshtein_thresholds
            )
            comparison_desc += desc

        if len(damerau_levenshtein_thresholds) > 0:
            desc = distance_threshold_description(
                col_name, "damerau-levenshtein", damerau_levenshtein_thresholds
            )
            comparison_desc += desc

        if len(jaro_thresholds) > 0:
            desc = distance_threshold_description(col_name, "jaro", jaro_thresholds)
            comparison_desc += desc

        if len(jaro_winkler_thresholds) > 0:
            desc = distance_threshold_description(
                col_name, "jaro_winkler", jaro_winkler_thresholds
            )
            comparison_desc += desc

        if len(jaccard_thresholds) > 0:
            desc = distance_threshold_description(
                col_name, "jaccard", jaccard_thresholds
            )
            comparison_desc += desc

        comparison_desc += "anything else"

        comparison_dict = {
            "comparison_description": comparison_desc,
            "comparison_levels": comparison_levels,
        }
        super().__init__(comparison_dict)

    @property
    def _is_distance_subclass(self):
        return False

__init__(col_name, regex_extract=None, set_to_lowercase=False, include_exact_match_level=True, phonetic_col_name=None, term_frequency_adjustments=False, levenshtein_thresholds=[], damerau_levenshtein_thresholds=[1], jaro_thresholds=[], jaro_winkler_thresholds=[0.9, 0.8], jaccard_thresholds=[], m_probability_exact_match_name=None, m_probability_exact_match_phonetic_name=None, m_probability_or_probabilities_lev=None, m_probability_or_probabilities_dl=None, m_probability_or_probabilities_jar=None, m_probability_or_probabilities_jw=None, m_probability_or_probabilities_jac=None, m_probability_else=None)

A wrapper to generate a comparison for a name column the data in col_name with preselected defaults.

The default arguments will give a comparison with comparison levels:

  • Exact match

  • Damerau-Levenshtein Distance <= 1

  • Jaro Winkler similarity >= 0.9

  • Jaro Winkler similarity >= 0.8

  • Anything else

Parameters:

Name Type Description Default
col_name str

The name of the column to compare.

required
regex_extract str

Regular expression pattern to evaluate a match on.

None
set_to_lowercase bool

If True, all names are set to lowercase during the pairwise comparisons. Defaults to False

False
include_exact_match_level bool

If True, include an exact match level for col_name. Defaults to True.

True
phonetic_col_name str

The name of the column with phonetic reduction (such as dmetaphone) of col_name. Including parameter will create an exact match level for phonetic_col_name. The phonetic column must be present in the dataset to use this parameter. Defaults to None

None
term_frequency_adjustments bool

If True, apply term frequency adjustments to the exact match level for "col_name". Defaults to False.

False
term_frequency_adjustments_phonetic_name bool

If True, apply term frequency adjustments to the exact match level for "phonetic_col_name". Defaults to False.

required
levenshtein_thresholds Union[int, list]

The thresholds to use for levenshtein similarity level(s). Defaults to []

[]
damerau_levenshtein_thresholds Union[int, list]

The thresholds to use for damerau-levenshtein similarity level(s). Defaults to [1]

[1]
jaro_thresholds Union[int, list]

The thresholds to use for jaro similarity level(s). Defaults to []

[]
jaro_winkler_thresholds Union[int, list]

The thresholds to use for jaro_winkler similarity level(s). Defaults to [0.9, 0.8]

[0.9, 0.8]
jaccard_thresholds Union[int, list]

The thresholds to use for jaccard similarity level(s). Defaults to []

[]
m_probability_exact_match_name _type_

Starting m probability for exact match level. Defaults to None.

None
m_probability_exact_match_phonetic_name _type_

Starting m probability for exact match level for phonetic_col_name. Defaults to None.

None
m_probability_or_probabilities_lev Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_dl Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_datediff Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

required
m_probability_or_probabilities_jar Union[float, list]

Starting m probabilities for the jaro thresholds specified. Defaults to None.

None
m_probability_or_probabilities_jw Union[float, list]

Starting m probabilities for the jaro winkler thresholds specified. Defaults to None.

None
m_probability_or_probabilities_jac Union[float, list]

Starting m probabilities for the jaccard thresholds specified. Defaults to None.

None
m_probability_else _type_

Starting m probability for the 'everything else' level. Defaults to None.

None

Examples:

Basic Name Comparison

import splink.duckdb.comparison_template_library as ctl
ctl.name_comparison("name")
Bespoke Name Comparison
import splink.duckdb.comparison_template_library as ctl
ctl.name_comparison("name",
                    phonetic_col_name = "name_dm",
                    term_frequency_adjustments = True,
                    levenshtein_thresholds=[2],
                    damerau_levenshtein_thresholds=[],
                    jaro_winkler_thresholds=[],
                    jaccard_thresholds=[1]
                    )

Basic Name Comparison

import splink.spark.comparison_template_library as ctl
ctl.name_comparison("name")
Bespoke Name Comparison
import splink.spark.comparison_template_library as ctl
ctl.name_comparison("name",
                    phonetic_col_name = "name_dm",
                    term_frequency_adjustments = True,
                    levenshtein_thresholds=[2],
                    damerau_levenshtein_thresholds=[],
                    jaro_winkler_thresholds=[],
                    jaccard_thresholds=[1]
                    )

Basic Name Comparison

import splink.sqlite.comparison_template_library as ctl
ctl.name_comparison("name")
Bespoke Name Comparison
import splink.sqlite.comparison_template_library as ctl
ctl.name_comparison("name",
                    phonetic_col_name = "name_dm",
                    term_frequency_adjustments = True,
                    levenshtein_thresholds=[2],
                    damerau_levenshtein_thresholds=[],
                    jaro_winkler_thresholds=[0.8],
                    )

Returns:

Name Type Description
Comparison Comparison

A comparison that can be included in the Splink settings dictionary.

Source code in splink/comparison_template_library.py
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
def __init__(
    self,
    col_name: str,
    regex_extract: str = None,
    set_to_lowercase: str = False,
    include_exact_match_level: bool = True,
    phonetic_col_name: str = None,
    term_frequency_adjustments: bool = False,
    levenshtein_thresholds: int | list = [],
    damerau_levenshtein_thresholds: int | list = [1],
    jaro_thresholds: float | list = [],
    jaro_winkler_thresholds: float | list = [0.9, 0.8],
    jaccard_thresholds: float | list = [],
    m_probability_exact_match_name: float = None,
    m_probability_exact_match_phonetic_name: float = None,
    m_probability_or_probabilities_lev: float | list = None,
    m_probability_or_probabilities_dl: float | list = None,
    m_probability_or_probabilities_jar: float | list = None,
    m_probability_or_probabilities_jw: float | list = None,
    m_probability_or_probabilities_jac: float | list = None,
    m_probability_else: float = None,
) -> Comparison:
    """A wrapper to generate a comparison for a name column the data in
    `col_name` with preselected defaults.

    The default arguments will give a comparison with comparison levels:\n
    - Exact match \n
    - Damerau-Levenshtein Distance <= 1
    - Jaro Winkler similarity >= 0.9\n
    - Jaro Winkler similarity >= 0.8\n
    - Anything else

    Args:
        col_name (str): The name of the column to compare.
        regex_extract (str): Regular expression pattern to evaluate a match on.
        set_to_lowercase (bool): If True, all names are set to lowercase
            during the pairwise comparisons.
            Defaults to False
        include_exact_match_level (bool, optional): If True, include an exact match
            level for col_name. Defaults to True.
        phonetic_col_name (str): The name of the column with phonetic reduction
            (such as dmetaphone) of col_name. Including parameter will create
            an exact match level for  phonetic_col_name. The phonetic column must
            be present in the dataset to use this parameter.
            Defaults to None
        term_frequency_adjustments (bool, optional): If True, apply term
            frequency adjustments to the exact match level for "col_name".
            Defaults to False.
        term_frequency_adjustments_phonetic_name (bool, optional): If True, apply
            term frequency adjustments to the exact match level for
            "phonetic_col_name".
            Defaults to False.
        levenshtein_thresholds (Union[int, list], optional): The thresholds to use
            for levenshtein similarity level(s).
            Defaults to []
        damerau_levenshtein_thresholds (Union[int, list], optional): The thresholds
            to use for damerau-levenshtein similarity level(s).
            Defaults to [1]
        jaro_thresholds (Union[int, list], optional): The thresholds to use
            for jaro similarity level(s).
            Defaults to []
        jaro_winkler_thresholds (Union[int, list], optional): The thresholds to use
            for jaro_winkler similarity level(s).
            Defaults to [0.9, 0.8]
        jaccard_thresholds (Union[int, list], optional): The thresholds to use
            for jaccard similarity level(s).
            Defaults to []
        m_probability_exact_match_name (_type_, optional): Starting m probability
            for exact match level. Defaults to None.
        m_probability_exact_match_phonetic_name (_type_, optional): Starting m
            probability for exact match level for phonetic_col_name.
            Defaults to None.
        m_probability_or_probabilities_lev (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_or_probabilities_dl (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_or_probabilities_datediff (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_or_probabilities_jar (Union[float, list], optional):
            Starting m probabilities for the jaro thresholds specified.
            Defaults to None.
        m_probability_or_probabilities_jw (Union[float, list], optional):
            Starting m probabilities for the jaro winkler thresholds specified.
            Defaults to None.
        m_probability_or_probabilities_jac (Union[float, list], optional):
            Starting m probabilities for the jaccard thresholds specified.
            Defaults to None.
        m_probability_else (_type_, optional): Starting m probability for
            the 'everything else' level. Defaults to None.

    Examples:
        === ":simple-duckdb: DuckDB"
            Basic Name Comparison
            ``` python
            import splink.duckdb.comparison_template_library as ctl
            ctl.name_comparison("name")
            ```
            Bespoke Name Comparison
            ``` python
            import splink.duckdb.comparison_template_library as ctl
            ctl.name_comparison("name",
                                phonetic_col_name = "name_dm",
                                term_frequency_adjustments = True,
                                levenshtein_thresholds=[2],
                                damerau_levenshtein_thresholds=[],
                                jaro_winkler_thresholds=[],
                                jaccard_thresholds=[1]
                                )
            ```
        === ":simple-apachespark: Spark"
            Basic Name Comparison
            ``` python
            import splink.spark.comparison_template_library as ctl
            ctl.name_comparison("name")
            ```
            Bespoke Name Comparison
            ``` python
            import splink.spark.comparison_template_library as ctl
            ctl.name_comparison("name",
                                phonetic_col_name = "name_dm",
                                term_frequency_adjustments = True,
                                levenshtein_thresholds=[2],
                                damerau_levenshtein_thresholds=[],
                                jaro_winkler_thresholds=[],
                                jaccard_thresholds=[1]
                                )
            ```
        === ":simple-sqlite: SQLite"
            Basic Name Comparison
            ``` python
            import splink.sqlite.comparison_template_library as ctl
            ctl.name_comparison("name")
            ```
            Bespoke Name Comparison
            ``` python
            import splink.sqlite.comparison_template_library as ctl
            ctl.name_comparison("name",
                                phonetic_col_name = "name_dm",
                                term_frequency_adjustments = True,
                                levenshtein_thresholds=[2],
                                damerau_levenshtein_thresholds=[],
                                jaro_winkler_thresholds=[0.8],
                                )
            ```

    Returns:
        Comparison: A comparison that can be included in the Splink settings
            dictionary.
    """

    # Construct Comparison
    comparison_levels = []
    comparison_levels.append(self._null_level(col_name))

    if include_exact_match_level:
        comparison_level = self._exact_match_level(
            col_name,
            term_frequency_adjustments=term_frequency_adjustments,
            m_probability=m_probability_exact_match_name,
            include_colname_in_charts_label=True,
            regex_extract=regex_extract,
            set_to_lowercase=set_to_lowercase,
        )
        comparison_levels.append(comparison_level)

        if phonetic_col_name is not None:
            comparison_level = self._exact_match_level(
                phonetic_col_name,
                term_frequency_adjustments=term_frequency_adjustments,
                m_probability=m_probability_exact_match_phonetic_name,
                include_colname_in_charts_label=True,
                regex_extract=regex_extract,
                set_to_lowercase=set_to_lowercase,
            )
            comparison_levels.append(comparison_level)

    levenshtein_thresholds = ensure_is_iterable(levenshtein_thresholds)
    if len(levenshtein_thresholds) > 0:
        threshold_comparison_levels = distance_threshold_comparison_levels(
            self,
            col_name,
            distance_function_name="levenshtein",
            distance_threshold_or_thresholds=levenshtein_thresholds,
            regex_extract=regex_extract,
            set_to_lowercase=set_to_lowercase,
            m_probability_or_probabilities_thres=m_probability_or_probabilities_lev,
        )
        comparison_levels = comparison_levels + threshold_comparison_levels

    damerau_levenshtein_thresholds = ensure_is_iterable(
        damerau_levenshtein_thresholds
    )
    if len(damerau_levenshtein_thresholds) > 0:
        levenshtein_thresholds = ensure_is_iterable(damerau_levenshtein_thresholds)
        threshold_comparison_levels = distance_threshold_comparison_levels(
            self,
            col_name,
            distance_function_name="damerau-levenshtein",
            distance_threshold_or_thresholds=damerau_levenshtein_thresholds,
            regex_extract=regex_extract,
            set_to_lowercase=set_to_lowercase,
            m_probability_or_probabilities_thres=m_probability_or_probabilities_dl,
        )
        comparison_levels = comparison_levels + threshold_comparison_levels

    jaro_thresholds = ensure_is_iterable(jaro_thresholds)
    if len(jaro_thresholds) > 0:
        threshold_comparison_levels = distance_threshold_comparison_levels(
            self,
            col_name,
            distance_function_name="jaro",
            distance_threshold_or_thresholds=jaro_thresholds,
            regex_extract=regex_extract,
            set_to_lowercase=set_to_lowercase,
            m_probability_or_probabilities_thres=m_probability_or_probabilities_jar,
        )
        comparison_levels = comparison_levels + threshold_comparison_levels

    jaro_winkler_thresholds = ensure_is_iterable(jaro_winkler_thresholds)
    if len(jaro_winkler_thresholds) > 0:
        threshold_comparison_levels = distance_threshold_comparison_levels(
            self,
            col_name,
            distance_function_name="jaro-winkler",
            distance_threshold_or_thresholds=jaro_winkler_thresholds,
            regex_extract=regex_extract,
            set_to_lowercase=set_to_lowercase,
            m_probability_or_probabilities_thres=m_probability_or_probabilities_jw,
        )
        comparison_levels = comparison_levels + threshold_comparison_levels

    jaccard_thresholds = ensure_is_iterable(jaccard_thresholds)
    if len(jaccard_thresholds) > 0:
        threshold_comparison_levels = distance_threshold_comparison_levels(
            self,
            col_name,
            distance_function_name="jaccard",
            distance_threshold_or_thresholds=jaccard_thresholds,
            regex_extract=regex_extract,
            set_to_lowercase=set_to_lowercase,
            m_probability_or_probabilities_thres=m_probability_or_probabilities_jac,
        )
        comparison_levels = comparison_levels + threshold_comparison_levels

    comparison_levels.append(
        self._else_level(m_probability=m_probability_else),
    )

    # Construct Description
    comparison_desc = ""
    if include_exact_match_level:
        comparison_desc += "Exact match vs. "

    if phonetic_col_name is not None:
        comparison_desc += "Names with phonetic exact match vs. "

    if len(levenshtein_thresholds) > 0:
        desc = distance_threshold_description(
            col_name, "levenshtein", levenshtein_thresholds
        )
        comparison_desc += desc

    if len(damerau_levenshtein_thresholds) > 0:
        desc = distance_threshold_description(
            col_name, "damerau-levenshtein", damerau_levenshtein_thresholds
        )
        comparison_desc += desc

    if len(jaro_thresholds) > 0:
        desc = distance_threshold_description(col_name, "jaro", jaro_thresholds)
        comparison_desc += desc

    if len(jaro_winkler_thresholds) > 0:
        desc = distance_threshold_description(
            col_name, "jaro_winkler", jaro_winkler_thresholds
        )
        comparison_desc += desc

    if len(jaccard_thresholds) > 0:
        desc = distance_threshold_description(
            col_name, "jaccard", jaccard_thresholds
        )
        comparison_desc += desc

    comparison_desc += "anything else"

    comparison_dict = {
        "comparison_description": comparison_desc,
        "comparison_levels": comparison_levels,
    }
    super().__init__(comparison_dict)

Bases: Comparison

Source code in splink/comparison_template_library.py
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
class ForenameSurnameComparisonBase(Comparison):
    def __init__(
        self,
        forename_col_name,
        surname_col_name,
        set_to_lowercase=False,
        include_exact_match_level: bool = True,
        include_columns_reversed: bool = True,
        term_frequency_adjustments: bool = False,
        tf_adjustment_col_forename_and_surname: str = None,
        phonetic_forename_col_name: str = None,
        phonetic_surname_col_name: str = None,
        levenshtein_thresholds: int | list = [],
        damerau_levenshtein_thresholds: int | list = [],
        jaro_winkler_thresholds: float | list = [0.88],
        jaro_thresholds: float | list = [],
        jaccard_thresholds: float | list = [],
        m_probability_exact_match_forename_surname: float = None,
        m_probability_exact_match_phonetic_forename_surname: float = None,
        m_probability_columns_reversed_forename_surname: float = None,
        m_probability_exact_match_surname: float = None,
        m_probability_exact_match_forename: float = None,
        m_probability_or_probabilities_surname_lev: float | list = None,
        m_probability_or_probabilities_surname_dl: float | list = None,
        m_probability_or_probabilities_surname_jw: float | list = None,
        m_probability_or_probabilities_surname_jar: float | list = None,
        m_probability_or_probabilities_surname_jac: float | list = None,
        m_probability_or_probabilities_forename_lev: float | list = None,
        m_probability_or_probabilities_forename_dl: float | list = None,
        m_probability_or_probabilities_forename_jw: float | list = None,
        m_probability_or_probabilities_forename_jar: float | list = None,
        m_probability_or_probabilities_forename_jac: float | list = None,
        m_probability_else: float = None,
    ) -> Comparison:
        """A wrapper to generate a comparison for a name column the data in
        `col_name` with preselected defaults.

        The default arguments will give a comparison with comparison levels:\n
        - Exact match forename and surname\n
        - Macth of forename and surname reversed\n
        - Exact match surname\n
        - Exact match forename\n
        - Fuzzy match surname jaro-winkler >= 0.88\n
        - Fuzzy match forename jaro-winkler>=  0.88\n
        - Anything else

        Args:
            forename_col_name (str): The name of the forename column to compare
            surname_col_name (str): The name of the surname column to compare
            set_to_lowercase (bool): If True, all names are set to lowercase
                during the pairwise comparisons.
                Defaults to False
            include_exact_match_level (bool, optional): If True, include an exact match
                level for col_name. Defaults to True.
            include_columns_reversed (bool, optional): If True, include a comparison
                level for forename and surname being swapped. Defaults to True
            term_frequency_adjustments (bool, optional): If True, apply term
                frequency adjustments to the exact match level for forename_col_name
                and surname_col_name.
                Applies term frequency adjustments to full name exact match level
                and columns reversed exact match level if
                tf_adjustment_col_forename_and_surname is provided.
                Applies term frequency adjustments to phonetic_forename_col_name
                and phonetic_surname_col_name exact match levels, if they are provided.
                Defaults to False.
            tf_adjustment_col_forename_and_surname (str, optional): The name
                of a combined forename surname column. This column is used to provide
                term frequency adjustments for forename surname exact match and columns
                reversed levels.
                Defaults to None
            set_to_lowercase (bool): If True, all postcodes are set to lowercase
                during the pairwise comparisons.
                Defaults to True
            phonetic_forename_col_name (str, optional): The name of the column with
                phonetic reduction (such as dmetaphone) of forename_col_name. Including
                parameter along with phonetic_surname_col_name will create an exact
                match level for "Full name phonetic match".
                The phonetic column must be present in the dataset to use this
                parameter.
                Defaults to None
            phonetic_surname_col_name (str, optional): The name of the column with
                phonetic reduction (such as dmetaphone) of surname_col_name. Including
                this parameter along with phonetic_forename_col_name will create an
                exact match level for "Full name phonetic match".
                The phonetic column must be present in
                the dataset to use this parameter.
                Defaults to None
            levenshtein_thresholds (Union[int, list], optional): The thresholds
                to use for levenshtein similarity level(s) for surname_col_name
                and forename_col_name.
                Defaults to []
            damerau_levenshtein_thresholds (Union[int, list], optional): The thresholds
                to use for damerau-levenshtein similarity level(s).
                Defaults to []
            jaro_winkler_thresholds (Union[int, list], optional): The thresholds
                to use for jaro_winkler similarity level(s) for surname_col_name
                and forename_col_name.
                Defaults to [0.88]
            jaro_thresholds (Union[int, list], optional): The thresholds
                to use for jaro similarity level(s) for surname_col_name
                and forename_col_name.
                Defaults to []
            jaccard_thresholds (Union[int, list], optional): The thresholds to
                use for jaccard similarity level(s) for surname_col_name and
                forename_col_name.
                Defaults to []
            m_probability_exact_match_forename_surname (_type_, optional): If provided,
                overrides the default m probability for the exact match level for
                forename and surname.
                Defaults to None.
            m_probability_exact_match_phonetic_forename_surname (_type_, optional): If
                provided, overrides the default m probability for the phonetic match
                level for forename and surname.
                Defaults to None.
            m_probability_columns_reversed_forename_surname (_type_, optional): If
                provided, overrides the default m probability for the columns reversed
                level for forename and surname.
                Defaults to None.
            m_probability_columns_reversed_forename_surname (_type_, optional): If
                provided, overrides the default m probability for the columns reversed
                level for forename and surname.
                Defaults to None.
            m_probability_exact_match_surname (_type_, optional): If provided,
                overrides the default m probability for the surname exact match
                level for forename and surname.
                Defaults to None.
            m_probability_exact_match_forename (_type_, optional): If provided,
                overrides the default m probability for the forename exact match
                level for forename and forename.
                Defaults to None.
            m_probability_phonetic_match_surname (_type_, optional): If provided,
                overrides the default m probability for the surname phonetic match
                level for forename and surname.
                Defaults to None.
            m_probability_phonetic_match_forename (_type_, optional): If provided,
                overrides the default m probability for the forename phonetic match
                level for forename and forename.
                Defaults to None.
            m_probability_or_probabilities_surname_lev (Union[float, list], optional):
                _description_. If provided, overrides the default m probabilities
                for the thresholds specified. Defaults to None.
            m_probability_or_probabilities_surname_dl (Union[float, list], optional):
                _description_. If provided, overrides the default m probabilities
                for the thresholds specified. Defaults to None.
            m_probability_or_probabilities_surname_jw (Union[float, list], optional):
                _description_. If provided, overrides the default m probabilities
                for the thresholds specified. Defaults to None.
            m_probability_or_probabilities_surname_jar (Union[float, list], optional):
                _description_. If provided, overrides the default m probabilities
                for the thresholds specified. Defaults to None.
            m_probability_or_probabilities_surname_jac (Union[float, list], optional):
                _description_. If provided, overrides the default m probabilities
                for the thresholds specified. Defaults to None.
            m_probability_or_probabilities_forename_lev (Union[float, list], optional):
                _description_. If provided, overrides the default m probabilities
                for the thresholds specified. Defaults to None.
            m_probability_or_probabilities_forename_dl (Union[float, list], optional):
                _description_. If provided, overrides the default m probabilities
                for the thresholds specified. Defaults to None.
            m_probability_or_probabilities_forename_jw (Union[float, list], optional):
                _description_. If provided, overrides the default m probabilities
                for the thresholds specified. Defaults to None.
            m_probability_or_probabilities_forename_jar (Union[float, list], optional):
                _description_. If provided, overrides the default m probabilities
                for the thresholds specified. Defaults to None.
            m_probability_or_probabilities_forename_jac (Union[float, list], optional):
                _description_. If provided, overrides the default m probabilities
                for the thresholds specified. Defaults to None.
            m_probability_else (_type_, optional): If provided, overrides the
                default m probability for the 'anything else' level. Defaults to None.

        Examples:
            === ":simple-duckdb: DuckDB"
                Basic Forename Surname Comparison
                ```py
                import splink.duckdb.comparison_template_library as ctl
                ctl.forename_surname_comparison("first_name", "surname)
                ```

                Bespoke Forename Surname Comparison
                ```py
                import splink.duckdb.comparison_template_library as ctl
                ctl.forename_surname_comparison(
                        "forename",
                        "surname",
                        term_frequency_adjustments=True,
                        tf_adjustment_col_forename_and_surname="full_name",
                        phonetic_forename_col_name="forename_dm",
                        phonetic_surname_col_name="surname_dm",
                        levenshtein_thresholds=[2],
                        jaro_winkler_thresholds=[],
                        jaccard_thresholds=[1],
                    )
                ```
            === ":simple-apachespark: Spark"
                Basic Forename Surname Comparison
                ```py
                import splink.spark.comparison_template_library as ctl
                ctl.forename_surname_comparison("first_name", "surname)
                ```

                Bespoke Forename Surname Comparison
                ```py
                import splink.spark.comparison_template_library as ctl
                ctl.forename_surname_comparison(
                        "forename",
                        "surname",
                        term_frequency_adjustments=True,
                        tf_adjustment_col_forename_and_surname="full_name",
                        phonetic_forename_col_name="forename_dm",
                        phonetic_surname_col_name="surname_dm",
                        levenshtein_thresholds=[2],
                        jaro_winkler_thresholds=[],
                        jaccard_thresholds=[1],
                    )
                ```
            === ":simple-sqlite: SQLite"
                Basic Forename Surname Comparison
                ```py
                import splink.sqlite.comparison_template_library as ctl
                ctl.forename_surname_comparison("first_name", "surname)
                ```

                Bespoke Forename Surname Comparison
                ```py
                import splink.sqlite.comparison_template_library as ctl
                ctl.forename_surname_comparison(
                        "forename",
                        "surname",
                        term_frequency_adjustments=True,
                        tf_adjustment_col_forename_and_surname="full_name",
                        phonetic_forename_col_name="forename_dm",
                        phonetic_surname_col_name="surname_dm",
                        levenshtein_thresholds=[2],
                        jaro_winkler_thresholds=[0.8],
                    )
                ```


        Returns:
            Comparison: A comparison that can be included in the Splink settings
                dictionary.
        """

        # Construct Comparison
        comparison_levels = []

        comparison_level = and_(
            self._null_level(forename_col_name),
            self._null_level(surname_col_name),
            label_for_charts="Null",
        )

        comparison_levels.append(comparison_level)

        ### Forename surname exact match

        if include_exact_match_level:
            if set_to_lowercase:
                forename_col_name_l = f"lower({forename_col_name}_l)"
                forename_col_name_r = f"lower({forename_col_name}_r)"
                surname_col_name_l = f"lower({surname_col_name}_l)"
                surname_col_name_r = f"lower({surname_col_name}_r)"
            else:
                forename_col_name_l = f"{forename_col_name}_l"
                forename_col_name_r = f"{forename_col_name}_r"
                surname_col_name_l = f"{surname_col_name}_l"
                surname_col_name_r = f"{surname_col_name}_r"

            comparison_level = {
                "sql_condition": f"{forename_col_name_l} = {forename_col_name_r} "
                f"AND {surname_col_name_l} = {surname_col_name_r}",
                "tf_adjustment_column": tf_adjustment_col_forename_and_surname,
                "tf_adjustment_weight": 1.0,
                "m_probability": m_probability_exact_match_forename_surname,
                "label_for_charts": "Full name exact match",
            }

            comparison_levels.append(comparison_level)

        ### Phonetic forename surname match

        if phonetic_forename_col_name and phonetic_surname_col_name is not None:
            comparison_level = {
                "sql_condition": f"{phonetic_forename_col_name}_l = "
                f"{phonetic_forename_col_name}_r"
                f" AND {phonetic_surname_col_name}_l = {phonetic_surname_col_name}_r",
                "tf_adjustment_column": tf_adjustment_col_forename_and_surname,
                "tf_adjustment_weight": 1.0,
                "m_probability": m_probability_exact_match_phonetic_forename_surname,
                "label_for_charts": "Full name phonetic match",
            }
            comparison_levels.append(comparison_level)

        ### Columns reversed match

        if include_columns_reversed:
            comparison_level = self._columns_reversed_level(
                forename_col_name,
                surname_col_name,
                set_to_lowercase=set_to_lowercase,
                tf_adjustment_column=tf_adjustment_col_forename_and_surname,
                m_probability=m_probability_columns_reversed_forename_surname,
            )
            comparison_levels.append(comparison_level)

        ### Surname Exact match

        comparison_level = self._exact_match_level(
            surname_col_name,
            set_to_lowercase=set_to_lowercase,
            term_frequency_adjustments=term_frequency_adjustments,
            m_probability=m_probability_exact_match_surname,
            include_colname_in_charts_label=True,
        )
        comparison_levels.append(comparison_level)

        ### Forename Exact match

        comparison_level = self._exact_match_level(
            forename_col_name,
            set_to_lowercase=set_to_lowercase,
            term_frequency_adjustments=term_frequency_adjustments,
            m_probability=m_probability_exact_match_forename,
            include_colname_in_charts_label=True,
        )
        comparison_levels.append(comparison_level)

        ### Ensure fuzzy match thresholds are iterable
        levenshtein_thresholds = ensure_is_iterable(levenshtein_thresholds)
        damerau_levenshtein_thresholds = ensure_is_iterable(
            damerau_levenshtein_thresholds
        )
        jaro_thresholds = ensure_is_iterable(jaro_thresholds)
        jaro_winkler_thresholds = ensure_is_iterable(jaro_winkler_thresholds)
        jaccard_thresholds = ensure_is_iterable(jaccard_thresholds)

        ### Surname Fuzzy match
        if len(levenshtein_thresholds) > 0:
            threshold_levels = distance_threshold_comparison_levels(
                self,
                surname_col_name,
                distance_function_name="levenshtein",
                distance_threshold_or_thresholds=levenshtein_thresholds,
                set_to_lowercase=set_to_lowercase,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_surname_lev,
                include_colname_in_charts_label=True,
            )
            comparison_levels = comparison_levels + threshold_levels

        if len(damerau_levenshtein_thresholds) > 0:
            levenshtein_thresholds = ensure_is_iterable(damerau_levenshtein_thresholds)
            threshold_comparison_levels = distance_threshold_comparison_levels(
                self,
                surname_col_name,
                distance_function_name="damerau-levenshtein",
                distance_threshold_or_thresholds=damerau_levenshtein_thresholds,
                set_to_lowercase=set_to_lowercase,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_surname_dl,
            )
            comparison_levels = comparison_levels + threshold_comparison_levels

        if len(jaro_thresholds) > 0:
            threshold_levels = distance_threshold_comparison_levels(
                self,
                surname_col_name,
                distance_function_name="jaro-",
                distance_threshold_or_thresholds=jaro_thresholds,
                set_to_lowercase=set_to_lowercase,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_surname_jar,
                include_colname_in_charts_label=True,
            )
            comparison_levels = comparison_levels + threshold_levels

        if len(jaro_winkler_thresholds) > 0:
            threshold_levels = distance_threshold_comparison_levels(
                self,
                surname_col_name,
                distance_function_name="jaro-winkler",
                distance_threshold_or_thresholds=jaro_winkler_thresholds,
                set_to_lowercase=set_to_lowercase,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_surname_jw,
                include_colname_in_charts_label=True,
            )
            comparison_levels = comparison_levels + threshold_levels

        if len(jaccard_thresholds) > 0:
            threshold_levels = distance_threshold_comparison_levels(
                self,
                surname_col_name,
                distance_function_name="jaccard",
                distance_threshold_or_thresholds=jaccard_thresholds,
                set_to_lowercase=set_to_lowercase,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_surname_jac,
                include_colname_in_charts_label=True,
            )
            comparison_levels = comparison_levels + threshold_levels

        ### Forename Fuzzy match

        if len(levenshtein_thresholds) > 0:
            threshold_levels = distance_threshold_comparison_levels(
                self,
                forename_col_name,
                distance_function_name="levenshtein",
                distance_threshold_or_thresholds=levenshtein_thresholds,
                set_to_lowercase=set_to_lowercase,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_forename_lev,
                include_colname_in_charts_label=True,
            )
            comparison_levels = comparison_levels + threshold_levels

        if len(damerau_levenshtein_thresholds) > 0:
            threshold_levels = distance_threshold_comparison_levels(
                self,
                forename_col_name,
                distance_function_name="damerau-levenshtein",
                distance_threshold_or_thresholds=damerau_levenshtein_thresholds,
                set_to_lowercase=set_to_lowercase,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_forename_dl,
                include_colname_in_charts_label=True,
            )
            comparison_levels = comparison_levels + threshold_levels

        if len(jaro_winkler_thresholds) > 0:
            threshold_levels = distance_threshold_comparison_levels(
                self,
                forename_col_name,
                distance_function_name="jaro-winkler",
                distance_threshold_or_thresholds=jaro_winkler_thresholds,
                set_to_lowercase=set_to_lowercase,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_forename_jw,
                include_colname_in_charts_label=True,
            )
            comparison_levels = comparison_levels + threshold_levels

        if len(jaro_thresholds) > 0:
            threshold_levels = distance_threshold_comparison_levels(
                self,
                forename_col_name,
                distance_function_name="jaro",
                distance_threshold_or_thresholds=jaro_thresholds,
                set_to_lowercase=set_to_lowercase,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_forename_jar,
                include_colname_in_charts_label=True,
            )
            comparison_levels = comparison_levels + threshold_levels

        if len(jaccard_thresholds) > 0:
            threshold_levels = distance_threshold_comparison_levels(
                self,
                forename_col_name,
                distance_function_name="jaccard",
                distance_threshold_or_thresholds=jaccard_thresholds,
                set_to_lowercase=set_to_lowercase,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_forename_jac,
                include_colname_in_charts_label=True,
            )
            comparison_levels = comparison_levels + threshold_levels

        comparison_levels.append(
            self._else_level(m_probability=m_probability_else),
        )

        # Construct Description
        comparison_desc = ""
        if include_exact_match_level:
            comparison_desc += "Exact match vs. "

        if phonetic_forename_col_name and phonetic_surname_col_name is not None:
            comparison_desc += "Phonetic match forename and surname vs. "

        if include_columns_reversed:
            comparison_desc += "Forename and surname columns reversed vs. "

        comparison_desc += "Surname exact match vs. "

        comparison_desc += "Forename exact match vs. "

        if len(levenshtein_thresholds) > 0:
            comparison_desc += distance_threshold_description(
                surname_col_name, "levenshtein", levenshtein_thresholds
            )

        if len(damerau_levenshtein_thresholds) > 0:
            comparison_desc += distance_threshold_description(
                surname_col_name, "damerau-levenshtein", damerau_levenshtein_thresholds
            )

        if len(jaro_thresholds) > 0:
            comparison_desc += distance_threshold_description(
                surname_col_name, "jaro", jaro_thresholds
            )

        if len(jaro_winkler_thresholds) > 0:
            comparison_desc += distance_threshold_description(
                surname_col_name, "jaro-winkler", jaro_winkler_thresholds
            )

        if len(jaccard_thresholds) > 0:
            comparison_desc += distance_threshold_description(
                surname_col_name, "jaccard", jaccard_thresholds
            )

        if len(levenshtein_thresholds) > 0:
            comparison_desc += distance_threshold_description(
                forename_col_name, "levenshtein", levenshtein_thresholds
            )

        if len(damerau_levenshtein_thresholds) > 0:
            comparison_desc += distance_threshold_description(
                surname_col_name, "damerau-levenshtein", damerau_levenshtein_thresholds
            )

        if len(jaro_thresholds) > 0:
            comparison_desc += distance_threshold_description(
                forename_col_name, "jaro", jaro_thresholds
            )

        if len(jaro_winkler_thresholds) > 0:
            comparison_desc += distance_threshold_description(
                forename_col_name, "jaro-winkler", jaro_winkler_thresholds
            )

        if len(jaccard_thresholds) > 0:
            comparison_desc += distance_threshold_description(
                forename_col_name, "jaccard", jaccard_thresholds
            )

        comparison_desc += "anything else"

        comparison_dict = {
            "comparison_description": comparison_desc,
            "comparison_levels": comparison_levels,
        }
        super().__init__(comparison_dict)

    @property
    def _is_distance_subclass(self):
        return False

__init__(forename_col_name, surname_col_name, set_to_lowercase=False, include_exact_match_level=True, include_columns_reversed=True, term_frequency_adjustments=False, tf_adjustment_col_forename_and_surname=None, phonetic_forename_col_name=None, phonetic_surname_col_name=None, levenshtein_thresholds=[], damerau_levenshtein_thresholds=[], jaro_winkler_thresholds=[0.88], jaro_thresholds=[], jaccard_thresholds=[], m_probability_exact_match_forename_surname=None, m_probability_exact_match_phonetic_forename_surname=None, m_probability_columns_reversed_forename_surname=None, m_probability_exact_match_surname=None, m_probability_exact_match_forename=None, m_probability_or_probabilities_surname_lev=None, m_probability_or_probabilities_surname_dl=None, m_probability_or_probabilities_surname_jw=None, m_probability_or_probabilities_surname_jar=None, m_probability_or_probabilities_surname_jac=None, m_probability_or_probabilities_forename_lev=None, m_probability_or_probabilities_forename_dl=None, m_probability_or_probabilities_forename_jw=None, m_probability_or_probabilities_forename_jar=None, m_probability_or_probabilities_forename_jac=None, m_probability_else=None)

A wrapper to generate a comparison for a name column the data in col_name with preselected defaults.

The default arguments will give a comparison with comparison levels:

  • Exact match forename and surname

  • Macth of forename and surname reversed

  • Exact match surname

  • Exact match forename

  • Fuzzy match surname jaro-winkler >= 0.88

  • Fuzzy match forename jaro-winkler>= 0.88

  • Anything else

Parameters:

Name Type Description Default
forename_col_name str

The name of the forename column to compare

required
surname_col_name str

The name of the surname column to compare

required
set_to_lowercase bool

If True, all names are set to lowercase during the pairwise comparisons. Defaults to False

False
include_exact_match_level bool

If True, include an exact match level for col_name. Defaults to True.

True
include_columns_reversed bool

If True, include a comparison level for forename and surname being swapped. Defaults to True

True
term_frequency_adjustments bool

If True, apply term frequency adjustments to the exact match level for forename_col_name and surname_col_name. Applies term frequency adjustments to full name exact match level and columns reversed exact match level if tf_adjustment_col_forename_and_surname is provided. Applies term frequency adjustments to phonetic_forename_col_name and phonetic_surname_col_name exact match levels, if they are provided. Defaults to False.

False
tf_adjustment_col_forename_and_surname str

The name of a combined forename surname column. This column is used to provide term frequency adjustments for forename surname exact match and columns reversed levels. Defaults to None

None
set_to_lowercase bool

If True, all postcodes are set to lowercase during the pairwise comparisons. Defaults to True

False
phonetic_forename_col_name str

The name of the column with phonetic reduction (such as dmetaphone) of forename_col_name. Including parameter along with phonetic_surname_col_name will create an exact match level for "Full name phonetic match". The phonetic column must be present in the dataset to use this parameter. Defaults to None

None
phonetic_surname_col_name str

The name of the column with phonetic reduction (such as dmetaphone) of surname_col_name. Including this parameter along with phonetic_forename_col_name will create an exact match level for "Full name phonetic match". The phonetic column must be present in the dataset to use this parameter. Defaults to None

None
levenshtein_thresholds Union[int, list]

The thresholds to use for levenshtein similarity level(s) for surname_col_name and forename_col_name. Defaults to []

[]
damerau_levenshtein_thresholds Union[int, list]

The thresholds to use for damerau-levenshtein similarity level(s). Defaults to []

[]
jaro_winkler_thresholds Union[int, list]

The thresholds to use for jaro_winkler similarity level(s) for surname_col_name and forename_col_name. Defaults to [0.88]

[0.88]
jaro_thresholds Union[int, list]

The thresholds to use for jaro similarity level(s) for surname_col_name and forename_col_name. Defaults to []

[]
jaccard_thresholds Union[int, list]

The thresholds to use for jaccard similarity level(s) for surname_col_name and forename_col_name. Defaults to []

[]
m_probability_exact_match_forename_surname _type_

If provided, overrides the default m probability for the exact match level for forename and surname. Defaults to None.

None
m_probability_exact_match_phonetic_forename_surname _type_

If provided, overrides the default m probability for the phonetic match level for forename and surname. Defaults to None.

None
m_probability_columns_reversed_forename_surname _type_

If provided, overrides the default m probability for the columns reversed level for forename and surname. Defaults to None.

None
m_probability_columns_reversed_forename_surname _type_

If provided, overrides the default m probability for the columns reversed level for forename and surname. Defaults to None.

None
m_probability_exact_match_surname _type_

If provided, overrides the default m probability for the surname exact match level for forename and surname. Defaults to None.

None
m_probability_exact_match_forename _type_

If provided, overrides the default m probability for the forename exact match level for forename and forename. Defaults to None.

None
m_probability_phonetic_match_surname _type_

If provided, overrides the default m probability for the surname phonetic match level for forename and surname. Defaults to None.

required
m_probability_phonetic_match_forename _type_

If provided, overrides the default m probability for the forename phonetic match level for forename and forename. Defaults to None.

required
m_probability_or_probabilities_surname_lev Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_surname_dl Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_surname_jw Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_surname_jar Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_surname_jac Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_forename_lev Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_forename_dl Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_forename_jw Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_forename_jar Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_forename_jac Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_else _type_

If provided, overrides the default m probability for the 'anything else' level. Defaults to None.

None

Examples:

Basic Forename Surname Comparison

import splink.duckdb.comparison_template_library as ctl
ctl.forename_surname_comparison("first_name", "surname)

Bespoke Forename Surname Comparison

import splink.duckdb.comparison_template_library as ctl
ctl.forename_surname_comparison(
        "forename",
        "surname",
        term_frequency_adjustments=True,
        tf_adjustment_col_forename_and_surname="full_name",
        phonetic_forename_col_name="forename_dm",
        phonetic_surname_col_name="surname_dm",
        levenshtein_thresholds=[2],
        jaro_winkler_thresholds=[],
        jaccard_thresholds=[1],
    )

Basic Forename Surname Comparison

import splink.spark.comparison_template_library as ctl
ctl.forename_surname_comparison("first_name", "surname)

Bespoke Forename Surname Comparison

import splink.spark.comparison_template_library as ctl
ctl.forename_surname_comparison(
        "forename",
        "surname",
        term_frequency_adjustments=True,
        tf_adjustment_col_forename_and_surname="full_name",
        phonetic_forename_col_name="forename_dm",
        phonetic_surname_col_name="surname_dm",
        levenshtein_thresholds=[2],
        jaro_winkler_thresholds=[],
        jaccard_thresholds=[1],
    )

Basic Forename Surname Comparison

import splink.sqlite.comparison_template_library as ctl
ctl.forename_surname_comparison("first_name", "surname)

Bespoke Forename Surname Comparison

import splink.sqlite.comparison_template_library as ctl
ctl.forename_surname_comparison(
        "forename",
        "surname",
        term_frequency_adjustments=True,
        tf_adjustment_col_forename_and_surname="full_name",
        phonetic_forename_col_name="forename_dm",
        phonetic_surname_col_name="surname_dm",
        levenshtein_thresholds=[2],
        jaro_winkler_thresholds=[0.8],
    )

Returns:

Name Type Description
Comparison Comparison

A comparison that can be included in the Splink settings dictionary.

Source code in splink/comparison_template_library.py
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
def __init__(
    self,
    forename_col_name,
    surname_col_name,
    set_to_lowercase=False,
    include_exact_match_level: bool = True,
    include_columns_reversed: bool = True,
    term_frequency_adjustments: bool = False,
    tf_adjustment_col_forename_and_surname: str = None,
    phonetic_forename_col_name: str = None,
    phonetic_surname_col_name: str = None,
    levenshtein_thresholds: int | list = [],
    damerau_levenshtein_thresholds: int | list = [],
    jaro_winkler_thresholds: float | list = [0.88],
    jaro_thresholds: float | list = [],
    jaccard_thresholds: float | list = [],
    m_probability_exact_match_forename_surname: float = None,
    m_probability_exact_match_phonetic_forename_surname: float = None,
    m_probability_columns_reversed_forename_surname: float = None,
    m_probability_exact_match_surname: float = None,
    m_probability_exact_match_forename: float = None,
    m_probability_or_probabilities_surname_lev: float | list = None,
    m_probability_or_probabilities_surname_dl: float | list = None,
    m_probability_or_probabilities_surname_jw: float | list = None,
    m_probability_or_probabilities_surname_jar: float | list = None,
    m_probability_or_probabilities_surname_jac: float | list = None,
    m_probability_or_probabilities_forename_lev: float | list = None,
    m_probability_or_probabilities_forename_dl: float | list = None,
    m_probability_or_probabilities_forename_jw: float | list = None,
    m_probability_or_probabilities_forename_jar: float | list = None,
    m_probability_or_probabilities_forename_jac: float | list = None,
    m_probability_else: float = None,
) -> Comparison:
    """A wrapper to generate a comparison for a name column the data in
    `col_name` with preselected defaults.

    The default arguments will give a comparison with comparison levels:\n
    - Exact match forename and surname\n
    - Macth of forename and surname reversed\n
    - Exact match surname\n
    - Exact match forename\n
    - Fuzzy match surname jaro-winkler >= 0.88\n
    - Fuzzy match forename jaro-winkler>=  0.88\n
    - Anything else

    Args:
        forename_col_name (str): The name of the forename column to compare
        surname_col_name (str): The name of the surname column to compare
        set_to_lowercase (bool): If True, all names are set to lowercase
            during the pairwise comparisons.
            Defaults to False
        include_exact_match_level (bool, optional): If True, include an exact match
            level for col_name. Defaults to True.
        include_columns_reversed (bool, optional): If True, include a comparison
            level for forename and surname being swapped. Defaults to True
        term_frequency_adjustments (bool, optional): If True, apply term
            frequency adjustments to the exact match level for forename_col_name
            and surname_col_name.
            Applies term frequency adjustments to full name exact match level
            and columns reversed exact match level if
            tf_adjustment_col_forename_and_surname is provided.
            Applies term frequency adjustments to phonetic_forename_col_name
            and phonetic_surname_col_name exact match levels, if they are provided.
            Defaults to False.
        tf_adjustment_col_forename_and_surname (str, optional): The name
            of a combined forename surname column. This column is used to provide
            term frequency adjustments for forename surname exact match and columns
            reversed levels.
            Defaults to None
        set_to_lowercase (bool): If True, all postcodes are set to lowercase
            during the pairwise comparisons.
            Defaults to True
        phonetic_forename_col_name (str, optional): The name of the column with
            phonetic reduction (such as dmetaphone) of forename_col_name. Including
            parameter along with phonetic_surname_col_name will create an exact
            match level for "Full name phonetic match".
            The phonetic column must be present in the dataset to use this
            parameter.
            Defaults to None
        phonetic_surname_col_name (str, optional): The name of the column with
            phonetic reduction (such as dmetaphone) of surname_col_name. Including
            this parameter along with phonetic_forename_col_name will create an
            exact match level for "Full name phonetic match".
            The phonetic column must be present in
            the dataset to use this parameter.
            Defaults to None
        levenshtein_thresholds (Union[int, list], optional): The thresholds
            to use for levenshtein similarity level(s) for surname_col_name
            and forename_col_name.
            Defaults to []
        damerau_levenshtein_thresholds (Union[int, list], optional): The thresholds
            to use for damerau-levenshtein similarity level(s).
            Defaults to []
        jaro_winkler_thresholds (Union[int, list], optional): The thresholds
            to use for jaro_winkler similarity level(s) for surname_col_name
            and forename_col_name.
            Defaults to [0.88]
        jaro_thresholds (Union[int, list], optional): The thresholds
            to use for jaro similarity level(s) for surname_col_name
            and forename_col_name.
            Defaults to []
        jaccard_thresholds (Union[int, list], optional): The thresholds to
            use for jaccard similarity level(s) for surname_col_name and
            forename_col_name.
            Defaults to []
        m_probability_exact_match_forename_surname (_type_, optional): If provided,
            overrides the default m probability for the exact match level for
            forename and surname.
            Defaults to None.
        m_probability_exact_match_phonetic_forename_surname (_type_, optional): If
            provided, overrides the default m probability for the phonetic match
            level for forename and surname.
            Defaults to None.
        m_probability_columns_reversed_forename_surname (_type_, optional): If
            provided, overrides the default m probability for the columns reversed
            level for forename and surname.
            Defaults to None.
        m_probability_columns_reversed_forename_surname (_type_, optional): If
            provided, overrides the default m probability for the columns reversed
            level for forename and surname.
            Defaults to None.
        m_probability_exact_match_surname (_type_, optional): If provided,
            overrides the default m probability for the surname exact match
            level for forename and surname.
            Defaults to None.
        m_probability_exact_match_forename (_type_, optional): If provided,
            overrides the default m probability for the forename exact match
            level for forename and forename.
            Defaults to None.
        m_probability_phonetic_match_surname (_type_, optional): If provided,
            overrides the default m probability for the surname phonetic match
            level for forename and surname.
            Defaults to None.
        m_probability_phonetic_match_forename (_type_, optional): If provided,
            overrides the default m probability for the forename phonetic match
            level for forename and forename.
            Defaults to None.
        m_probability_or_probabilities_surname_lev (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_or_probabilities_surname_dl (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_or_probabilities_surname_jw (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_or_probabilities_surname_jar (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_or_probabilities_surname_jac (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_or_probabilities_forename_lev (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_or_probabilities_forename_dl (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_or_probabilities_forename_jw (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_or_probabilities_forename_jar (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_or_probabilities_forename_jac (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_else (_type_, optional): If provided, overrides the
            default m probability for the 'anything else' level. Defaults to None.

    Examples:
        === ":simple-duckdb: DuckDB"
            Basic Forename Surname Comparison
            ```py
            import splink.duckdb.comparison_template_library as ctl
            ctl.forename_surname_comparison("first_name", "surname)
            ```

            Bespoke Forename Surname Comparison
            ```py
            import splink.duckdb.comparison_template_library as ctl
            ctl.forename_surname_comparison(
                    "forename",
                    "surname",
                    term_frequency_adjustments=True,
                    tf_adjustment_col_forename_and_surname="full_name",
                    phonetic_forename_col_name="forename_dm",
                    phonetic_surname_col_name="surname_dm",
                    levenshtein_thresholds=[2],
                    jaro_winkler_thresholds=[],
                    jaccard_thresholds=[1],
                )
            ```
        === ":simple-apachespark: Spark"
            Basic Forename Surname Comparison
            ```py
            import splink.spark.comparison_template_library as ctl
            ctl.forename_surname_comparison("first_name", "surname)
            ```

            Bespoke Forename Surname Comparison
            ```py
            import splink.spark.comparison_template_library as ctl
            ctl.forename_surname_comparison(
                    "forename",
                    "surname",
                    term_frequency_adjustments=True,
                    tf_adjustment_col_forename_and_surname="full_name",
                    phonetic_forename_col_name="forename_dm",
                    phonetic_surname_col_name="surname_dm",
                    levenshtein_thresholds=[2],
                    jaro_winkler_thresholds=[],
                    jaccard_thresholds=[1],
                )
            ```
        === ":simple-sqlite: SQLite"
            Basic Forename Surname Comparison
            ```py
            import splink.sqlite.comparison_template_library as ctl
            ctl.forename_surname_comparison("first_name", "surname)
            ```

            Bespoke Forename Surname Comparison
            ```py
            import splink.sqlite.comparison_template_library as ctl
            ctl.forename_surname_comparison(
                    "forename",
                    "surname",
                    term_frequency_adjustments=True,
                    tf_adjustment_col_forename_and_surname="full_name",
                    phonetic_forename_col_name="forename_dm",
                    phonetic_surname_col_name="surname_dm",
                    levenshtein_thresholds=[2],
                    jaro_winkler_thresholds=[0.8],
                )
            ```


    Returns:
        Comparison: A comparison that can be included in the Splink settings
            dictionary.
    """

    # Construct Comparison
    comparison_levels = []

    comparison_level = and_(
        self._null_level(forename_col_name),
        self._null_level(surname_col_name),
        label_for_charts="Null",
    )

    comparison_levels.append(comparison_level)

    ### Forename surname exact match

    if include_exact_match_level:
        if set_to_lowercase:
            forename_col_name_l = f"lower({forename_col_name}_l)"
            forename_col_name_r = f"lower({forename_col_name}_r)"
            surname_col_name_l = f"lower({surname_col_name}_l)"
            surname_col_name_r = f"lower({surname_col_name}_r)"
        else:
            forename_col_name_l = f"{forename_col_name}_l"
            forename_col_name_r = f"{forename_col_name}_r"
            surname_col_name_l = f"{surname_col_name}_l"
            surname_col_name_r = f"{surname_col_name}_r"

        comparison_level = {
            "sql_condition": f"{forename_col_name_l} = {forename_col_name_r} "
            f"AND {surname_col_name_l} = {surname_col_name_r}",
            "tf_adjustment_column": tf_adjustment_col_forename_and_surname,
            "tf_adjustment_weight": 1.0,
            "m_probability": m_probability_exact_match_forename_surname,
            "label_for_charts": "Full name exact match",
        }

        comparison_levels.append(comparison_level)

    ### Phonetic forename surname match

    if phonetic_forename_col_name and phonetic_surname_col_name is not None:
        comparison_level = {
            "sql_condition": f"{phonetic_forename_col_name}_l = "
            f"{phonetic_forename_col_name}_r"
            f" AND {phonetic_surname_col_name}_l = {phonetic_surname_col_name}_r",
            "tf_adjustment_column": tf_adjustment_col_forename_and_surname,
            "tf_adjustment_weight": 1.0,
            "m_probability": m_probability_exact_match_phonetic_forename_surname,
            "label_for_charts": "Full name phonetic match",
        }
        comparison_levels.append(comparison_level)

    ### Columns reversed match

    if include_columns_reversed:
        comparison_level = self._columns_reversed_level(
            forename_col_name,
            surname_col_name,
            set_to_lowercase=set_to_lowercase,
            tf_adjustment_column=tf_adjustment_col_forename_and_surname,
            m_probability=m_probability_columns_reversed_forename_surname,
        )
        comparison_levels.append(comparison_level)

    ### Surname Exact match

    comparison_level = self._exact_match_level(
        surname_col_name,
        set_to_lowercase=set_to_lowercase,
        term_frequency_adjustments=term_frequency_adjustments,
        m_probability=m_probability_exact_match_surname,
        include_colname_in_charts_label=True,
    )
    comparison_levels.append(comparison_level)

    ### Forename Exact match

    comparison_level = self._exact_match_level(
        forename_col_name,
        set_to_lowercase=set_to_lowercase,
        term_frequency_adjustments=term_frequency_adjustments,
        m_probability=m_probability_exact_match_forename,
        include_colname_in_charts_label=True,
    )
    comparison_levels.append(comparison_level)

    ### Ensure fuzzy match thresholds are iterable
    levenshtein_thresholds = ensure_is_iterable(levenshtein_thresholds)
    damerau_levenshtein_thresholds = ensure_is_iterable(
        damerau_levenshtein_thresholds
    )
    jaro_thresholds = ensure_is_iterable(jaro_thresholds)
    jaro_winkler_thresholds = ensure_is_iterable(jaro_winkler_thresholds)
    jaccard_thresholds = ensure_is_iterable(jaccard_thresholds)

    ### Surname Fuzzy match
    if len(levenshtein_thresholds) > 0:
        threshold_levels = distance_threshold_comparison_levels(
            self,
            surname_col_name,
            distance_function_name="levenshtein",
            distance_threshold_or_thresholds=levenshtein_thresholds,
            set_to_lowercase=set_to_lowercase,
            m_probability_or_probabilities_thres=m_probability_or_probabilities_surname_lev,
            include_colname_in_charts_label=True,
        )
        comparison_levels = comparison_levels + threshold_levels

    if len(damerau_levenshtein_thresholds) > 0:
        levenshtein_thresholds = ensure_is_iterable(damerau_levenshtein_thresholds)
        threshold_comparison_levels = distance_threshold_comparison_levels(
            self,
            surname_col_name,
            distance_function_name="damerau-levenshtein",
            distance_threshold_or_thresholds=damerau_levenshtein_thresholds,
            set_to_lowercase=set_to_lowercase,
            m_probability_or_probabilities_thres=m_probability_or_probabilities_surname_dl,
        )
        comparison_levels = comparison_levels + threshold_comparison_levels

    if len(jaro_thresholds) > 0:
        threshold_levels = distance_threshold_comparison_levels(
            self,
            surname_col_name,
            distance_function_name="jaro-",
            distance_threshold_or_thresholds=jaro_thresholds,
            set_to_lowercase=set_to_lowercase,
            m_probability_or_probabilities_thres=m_probability_or_probabilities_surname_jar,
            include_colname_in_charts_label=True,
        )
        comparison_levels = comparison_levels + threshold_levels

    if len(jaro_winkler_thresholds) > 0:
        threshold_levels = distance_threshold_comparison_levels(
            self,
            surname_col_name,
            distance_function_name="jaro-winkler",
            distance_threshold_or_thresholds=jaro_winkler_thresholds,
            set_to_lowercase=set_to_lowercase,
            m_probability_or_probabilities_thres=m_probability_or_probabilities_surname_jw,
            include_colname_in_charts_label=True,
        )
        comparison_levels = comparison_levels + threshold_levels

    if len(jaccard_thresholds) > 0:
        threshold_levels = distance_threshold_comparison_levels(
            self,
            surname_col_name,
            distance_function_name="jaccard",
            distance_threshold_or_thresholds=jaccard_thresholds,
            set_to_lowercase=set_to_lowercase,
            m_probability_or_probabilities_thres=m_probability_or_probabilities_surname_jac,
            include_colname_in_charts_label=True,
        )
        comparison_levels = comparison_levels + threshold_levels

    ### Forename Fuzzy match

    if len(levenshtein_thresholds) > 0:
        threshold_levels = distance_threshold_comparison_levels(
            self,
            forename_col_name,
            distance_function_name="levenshtein",
            distance_threshold_or_thresholds=levenshtein_thresholds,
            set_to_lowercase=set_to_lowercase,
            m_probability_or_probabilities_thres=m_probability_or_probabilities_forename_lev,
            include_colname_in_charts_label=True,
        )
        comparison_levels = comparison_levels + threshold_levels

    if len(damerau_levenshtein_thresholds) > 0:
        threshold_levels = distance_threshold_comparison_levels(
            self,
            forename_col_name,
            distance_function_name="damerau-levenshtein",
            distance_threshold_or_thresholds=damerau_levenshtein_thresholds,
            set_to_lowercase=set_to_lowercase,
            m_probability_or_probabilities_thres=m_probability_or_probabilities_forename_dl,
            include_colname_in_charts_label=True,
        )
        comparison_levels = comparison_levels + threshold_levels

    if len(jaro_winkler_thresholds) > 0:
        threshold_levels = distance_threshold_comparison_levels(
            self,
            forename_col_name,
            distance_function_name="jaro-winkler",
            distance_threshold_or_thresholds=jaro_winkler_thresholds,
            set_to_lowercase=set_to_lowercase,
            m_probability_or_probabilities_thres=m_probability_or_probabilities_forename_jw,
            include_colname_in_charts_label=True,
        )
        comparison_levels = comparison_levels + threshold_levels

    if len(jaro_thresholds) > 0:
        threshold_levels = distance_threshold_comparison_levels(
            self,
            forename_col_name,
            distance_function_name="jaro",
            distance_threshold_or_thresholds=jaro_thresholds,
            set_to_lowercase=set_to_lowercase,
            m_probability_or_probabilities_thres=m_probability_or_probabilities_forename_jar,
            include_colname_in_charts_label=True,
        )
        comparison_levels = comparison_levels + threshold_levels

    if len(jaccard_thresholds) > 0:
        threshold_levels = distance_threshold_comparison_levels(
            self,
            forename_col_name,
            distance_function_name="jaccard",
            distance_threshold_or_thresholds=jaccard_thresholds,
            set_to_lowercase=set_to_lowercase,
            m_probability_or_probabilities_thres=m_probability_or_probabilities_forename_jac,
            include_colname_in_charts_label=True,
        )
        comparison_levels = comparison_levels + threshold_levels

    comparison_levels.append(
        self._else_level(m_probability=m_probability_else),
    )

    # Construct Description
    comparison_desc = ""
    if include_exact_match_level:
        comparison_desc += "Exact match vs. "

    if phonetic_forename_col_name and phonetic_surname_col_name is not None:
        comparison_desc += "Phonetic match forename and surname vs. "

    if include_columns_reversed:
        comparison_desc += "Forename and surname columns reversed vs. "

    comparison_desc += "Surname exact match vs. "

    comparison_desc += "Forename exact match vs. "

    if len(levenshtein_thresholds) > 0:
        comparison_desc += distance_threshold_description(
            surname_col_name, "levenshtein", levenshtein_thresholds
        )

    if len(damerau_levenshtein_thresholds) > 0:
        comparison_desc += distance_threshold_description(
            surname_col_name, "damerau-levenshtein", damerau_levenshtein_thresholds
        )

    if len(jaro_thresholds) > 0:
        comparison_desc += distance_threshold_description(
            surname_col_name, "jaro", jaro_thresholds
        )

    if len(jaro_winkler_thresholds) > 0:
        comparison_desc += distance_threshold_description(
            surname_col_name, "jaro-winkler", jaro_winkler_thresholds
        )

    if len(jaccard_thresholds) > 0:
        comparison_desc += distance_threshold_description(
            surname_col_name, "jaccard", jaccard_thresholds
        )

    if len(levenshtein_thresholds) > 0:
        comparison_desc += distance_threshold_description(
            forename_col_name, "levenshtein", levenshtein_thresholds
        )

    if len(damerau_levenshtein_thresholds) > 0:
        comparison_desc += distance_threshold_description(
            surname_col_name, "damerau-levenshtein", damerau_levenshtein_thresholds
        )

    if len(jaro_thresholds) > 0:
        comparison_desc += distance_threshold_description(
            forename_col_name, "jaro", jaro_thresholds
        )

    if len(jaro_winkler_thresholds) > 0:
        comparison_desc += distance_threshold_description(
            forename_col_name, "jaro-winkler", jaro_winkler_thresholds
        )

    if len(jaccard_thresholds) > 0:
        comparison_desc += distance_threshold_description(
            forename_col_name, "jaccard", jaccard_thresholds
        )

    comparison_desc += "anything else"

    comparison_dict = {
        "comparison_description": comparison_desc,
        "comparison_levels": comparison_levels,
    }
    super().__init__(comparison_dict)

Bases: Comparison

Source code in splink/comparison_template_library.py
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
class PostcodeComparisonBase(Comparison):
    def __init__(
        self,
        col_name: str,
        invalid_postcodes_as_null=False,
        set_to_lowercase=True,
        valid_postcode_regex="^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$",
        term_frequency_adjustments_full=False,
        include_full_match_level=True,
        include_sector_match_level=True,
        include_district_match_level=True,
        include_area_match_level=True,
        lat_col: str = None,
        long_col: str = None,
        km_thresholds: int | float | list = [],
        m_probability_full_match=None,
        m_probability_sector_match=None,
        m_probability_district_match=None,
        m_probability_area_match=None,
        m_probability_or_probabilities_km_distance=None,
        m_probability_else=None,
    ) -> Comparison:
        """A wrapper to generate a comparison for a poscode column 'col_name'
            with preselected defaults.

        The default arguments will give a comparison with levels:\n
        - Exact match on full postcode\n
        - Exact match on sector\n
        - Exact match on district\n
        - Exact match on area\n
        - All other comparisons

        Args:
            col_name (str): The name of the column to compare.
            invalid_postcodes_as_null (bool): If True, postcodes that do not adhere
                to valid_postcode_regex will be included in the null level.
                Defaults to False
            set_to_lowercase (bool): If True, all postcodes are set to lowercase
                during the pairwise comparisons.
                Defaults to True
            valid_postcode_regex (str): regular expression pattern that is used
                to validate postcodes. If invalid_postcodes_as_null is True,
                postcodes that do not adhere to valid_postcode_regex will be included
                 in the null level.
                 Defaults to "^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$"
            term_frequency_adjustments_full (bool, optional): If True, apply
                term frequency adjustments to the full postcode exact match level.
                Defaults to False.
            include_full_match_level (bool, optional): If True, include an exact
                match on full postcode level. Defaults to True.
            include_sector_match_level (bool, optional): If True, include an exact
                match on sector level. Defaults to True.
            include_district_match_level (bool, optional): If True, include an exact
                match on district level. Defaults to True.
            include_area_match_level (bool, optional): If True, include an exact
                match on area level. Defaults to True.
            include_distance_in_km_level (bool, optional): If True, include a
                comparison of distance between postcodes as measured in kilometers.
                Defaults to False.
            lat_col (str): The name of a latitude column or the respective array
                or struct column column containing the information, plus an index.
                For example: lat, long_lat['lat'] or long_lat[0].
            long_col (str): The name of a longitudinal column or the respective array
                or struct column column containing the information, plus an index.
                For example: long, long_lat['long'] or long_lat[1].
            km_thresholds (int, float, list): The total distance in kilometers to
                evaluate the distance_in_km_level comparison against.
            m_probability_full_match (float, optional): Starting m
                probability for full match level. Defaults to None.
            m_probability_sector_match (float, optional): Starting m
                probability for sector match level. Defaults to None.
            m_probability_district_match (float, optional): Starting m
                probability for district match level. Defaults to None.
            m_probability_area_match (float, optional): Starting m
                probability for area match level. Defaults to None.
            m_probability_or_probabilities_km_distance (float, optional): Starting m
                probability for 'distance in km' level. Defaults to None.
            m_probability_else (float, optional): Starting m probability for
                the 'everything else' level. Defaults to None.

        Examples:
            === ":simple-duckdb: DuckDB"
                Basic Postcode Comparison
                ``` python
                import splink.duckdb.comparison_template_library as ctl
                ctl.postcode_comparison("postcode")
                ```
                Bespoke Postcode Comparison
                ``` python
                import splink.duckdb.comparison_template_library as ctl
                ctl.postcode_comparison("postcode",
                                    invalid_postcodes_as_null=True,
                                    include_distance_in_km_level=True,
                                    lat_col="lat",
                                    long_col="long",
                                    km_thresholds=[10, 100]
                                    )
                ```
            === ":simple-apachespark: Spark"
                Basic Postcode Comparison
                ``` python
                import splink.spark.comparison_template_library as ctl
                ctl.postcode_comparison("postcode")
                ```
                Bespoke Postcode Comparison
                ``` python
                import splink.spark.comparison_template_library as ctl
                ctl.postcode_comparison("postcode",
                                    invalid_postcodes_as_null=True,
                                    include_distance_in_km_level=True,
                                    lat_col="lat",
                                    long_col="long",
                                    km_thresholds=[10, 100]
                                    )
                ```
            === ":simple-amazonaws: Athena"
                Basic Postcode Comparison
                ``` python
                import splink.athena.comparison_template_library as ctl
                ctl.postcode_comparison("postcode")
                ```
                Bespoke Postcode Comparison
                ``` python
                import splink.athena.comparison_template_library as ctl
                ctl.postcode_comparison("postcode",
                                    invalid_postcodes_as_null=True,
                                    include_distance_in_km_level=True,
                                    lat_col="lat",
                                    long_col="long",
                                    km_thresholds=[10, 100]
                                    )
                ```

        Returns:
            Comparison: A comparison that can be inclued in the Splink settings
                dictionary.
        """

        comparison_levels = []

        if invalid_postcodes_as_null:
            comparison_levels.append(self._null_level(col_name, valid_postcode_regex))
        else:
            comparison_levels.append(self._null_level(col_name))

        if include_full_match_level:
            comparison_level = self._exact_match_level(
                col_name,
                regex_extract=None,
                term_frequency_adjustments=term_frequency_adjustments_full,
                set_to_lowercase=set_to_lowercase,
                m_probability=m_probability_full_match,
                include_colname_in_charts_label=True,
            )
            comparison_levels.append(comparison_level)

        if include_sector_match_level:
            comparison_level = self._exact_match_level(
                col_name,
                regex_extract="^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]",
                set_to_lowercase=set_to_lowercase,
                m_probability=m_probability_sector_match,
                manual_col_name_for_charts_label="Postcode Sector",
            )
            comparison_levels.append(comparison_level)

        if include_district_match_level:
            comparison_level = self._exact_match_level(
                col_name,
                regex_extract="^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?",
                set_to_lowercase=set_to_lowercase,
                m_probability=m_probability_district_match,
                manual_col_name_for_charts_label="Postcode District",
            )
            comparison_levels.append(comparison_level)

        if include_area_match_level:
            comparison_level = self._exact_match_level(
                col_name,
                regex_extract="^[A-Za-z]{1,2}",
                set_to_lowercase=set_to_lowercase,
                m_probability=m_probability_area_match,
                manual_col_name_for_charts_label="Postcode Area",
            )
            comparison_levels.append(comparison_level)

        km_thresholds = ensure_is_iterable(km_thresholds)
        if len(km_thresholds) > 0:
            if m_probability_or_probabilities_km_distance is None:
                m_probability_or_probabilities_km_distance = [None] * len(km_thresholds)
            m_probability_or_probabilities_km_distance = ensure_is_iterable(
                m_probability_or_probabilities_km_distance
            )

            for thres, m_prob in zip(
                km_thresholds,
                m_probability_or_probabilities_km_distance,
            ):
                comparison_level = self._distance_in_km_level(
                    lat_col,
                    long_col,
                    km_threshold=thres,
                    m_probability=m_prob,
                )
                comparison_levels.append(comparison_level)

        comparison_levels.append(
            self._else_level(m_probability=m_probability_else),
        )

        # Construct Description
        comparison_desc = ""
        if include_full_match_level:
            comparison_desc += "Exact match on full postcode vs. "

        if include_sector_match_level:
            comparison_desc += "exact match on sector vs. "

        if include_district_match_level:
            comparison_desc += "exact match on district vs. "

        if include_area_match_level:
            comparison_desc += "exact match on area vs. "

        if len(km_thresholds) > 0:
            desc = distance_threshold_description(
                col_name, "km_distance", km_thresholds
            )
            comparison_desc += desc

        comparison_desc += "all other comparisons"

        comparison_dict = {
            "output_column_name": col_name,
            "comparison_description": comparison_desc,
            "comparison_levels": comparison_levels,
        }
        super().__init__(comparison_dict)

__init__(col_name, invalid_postcodes_as_null=False, set_to_lowercase=True, valid_postcode_regex='^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$', term_frequency_adjustments_full=False, include_full_match_level=True, include_sector_match_level=True, include_district_match_level=True, include_area_match_level=True, lat_col=None, long_col=None, km_thresholds=[], m_probability_full_match=None, m_probability_sector_match=None, m_probability_district_match=None, m_probability_area_match=None, m_probability_or_probabilities_km_distance=None, m_probability_else=None)

A wrapper to generate a comparison for a poscode column 'col_name' with preselected defaults.

The default arguments will give a comparison with levels:

  • Exact match on full postcode

  • Exact match on sector

  • Exact match on district

  • Exact match on area

  • All other comparisons

Parameters:

Name Type Description Default
col_name str

The name of the column to compare.

required
invalid_postcodes_as_null bool

If True, postcodes that do not adhere to valid_postcode_regex will be included in the null level. Defaults to False

False
set_to_lowercase bool

If True, all postcodes are set to lowercase during the pairwise comparisons. Defaults to True

True
valid_postcode_regex str

regular expression pattern that is used to validate postcodes. If invalid_postcodes_as_null is True, postcodes that do not adhere to valid_postcode_regex will be included in the null level. Defaults to "^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]$"

'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$'
term_frequency_adjustments_full bool

If True, apply term frequency adjustments to the full postcode exact match level. Defaults to False.

False
include_full_match_level bool

If True, include an exact match on full postcode level. Defaults to True.

True
include_sector_match_level bool

If True, include an exact match on sector level. Defaults to True.

True
include_district_match_level bool

If True, include an exact match on district level. Defaults to True.

True
include_area_match_level bool

If True, include an exact match on area level. Defaults to True.

True
include_distance_in_km_level bool

If True, include a comparison of distance between postcodes as measured in kilometers. Defaults to False.

required
lat_col str

The name of a latitude column or the respective array or struct column column containing the information, plus an index. For example: lat, long_lat['lat'] or long_lat[0].

None
long_col str

The name of a longitudinal column or the respective array or struct column column containing the information, plus an index. For example: long, long_lat['long'] or long_lat[1].

None
km_thresholds (int, float, list)

The total distance in kilometers to evaluate the distance_in_km_level comparison against.

[]
m_probability_full_match float

Starting m probability for full match level. Defaults to None.

None
m_probability_sector_match float

Starting m probability for sector match level. Defaults to None.

None
m_probability_district_match float

Starting m probability for district match level. Defaults to None.

None
m_probability_area_match float

Starting m probability for area match level. Defaults to None.

None
m_probability_or_probabilities_km_distance float

Starting m probability for 'distance in km' level. Defaults to None.

None
m_probability_else float

Starting m probability for the 'everything else' level. Defaults to None.

None

Examples:

Basic Postcode Comparison

import splink.duckdb.comparison_template_library as ctl
ctl.postcode_comparison("postcode")
Bespoke Postcode Comparison
import splink.duckdb.comparison_template_library as ctl
ctl.postcode_comparison("postcode",
                    invalid_postcodes_as_null=True,
                    include_distance_in_km_level=True,
                    lat_col="lat",
                    long_col="long",
                    km_thresholds=[10, 100]
                    )

Basic Postcode Comparison

import splink.spark.comparison_template_library as ctl
ctl.postcode_comparison("postcode")
Bespoke Postcode Comparison
import splink.spark.comparison_template_library as ctl
ctl.postcode_comparison("postcode",
                    invalid_postcodes_as_null=True,
                    include_distance_in_km_level=True,
                    lat_col="lat",
                    long_col="long",
                    km_thresholds=[10, 100]
                    )

Basic Postcode Comparison

import splink.athena.comparison_template_library as ctl
ctl.postcode_comparison("postcode")
Bespoke Postcode Comparison
import splink.athena.comparison_template_library as ctl
ctl.postcode_comparison("postcode",
                    invalid_postcodes_as_null=True,
                    include_distance_in_km_level=True,
                    lat_col="lat",
                    long_col="long",
                    km_thresholds=[10, 100]
                    )

Returns:

Name Type Description
Comparison Comparison

A comparison that can be inclued in the Splink settings dictionary.

Source code in splink/comparison_template_library.py
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
def __init__(
    self,
    col_name: str,
    invalid_postcodes_as_null=False,
    set_to_lowercase=True,
    valid_postcode_regex="^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$",
    term_frequency_adjustments_full=False,
    include_full_match_level=True,
    include_sector_match_level=True,
    include_district_match_level=True,
    include_area_match_level=True,
    lat_col: str = None,
    long_col: str = None,
    km_thresholds: int | float | list = [],
    m_probability_full_match=None,
    m_probability_sector_match=None,
    m_probability_district_match=None,
    m_probability_area_match=None,
    m_probability_or_probabilities_km_distance=None,
    m_probability_else=None,
) -> Comparison:
    """A wrapper to generate a comparison for a poscode column 'col_name'
        with preselected defaults.

    The default arguments will give a comparison with levels:\n
    - Exact match on full postcode\n
    - Exact match on sector\n
    - Exact match on district\n
    - Exact match on area\n
    - All other comparisons

    Args:
        col_name (str): The name of the column to compare.
        invalid_postcodes_as_null (bool): If True, postcodes that do not adhere
            to valid_postcode_regex will be included in the null level.
            Defaults to False
        set_to_lowercase (bool): If True, all postcodes are set to lowercase
            during the pairwise comparisons.
            Defaults to True
        valid_postcode_regex (str): regular expression pattern that is used
            to validate postcodes. If invalid_postcodes_as_null is True,
            postcodes that do not adhere to valid_postcode_regex will be included
             in the null level.
             Defaults to "^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9][A-Za-z]{2}$"
        term_frequency_adjustments_full (bool, optional): If True, apply
            term frequency adjustments to the full postcode exact match level.
            Defaults to False.
        include_full_match_level (bool, optional): If True, include an exact
            match on full postcode level. Defaults to True.
        include_sector_match_level (bool, optional): If True, include an exact
            match on sector level. Defaults to True.
        include_district_match_level (bool, optional): If True, include an exact
            match on district level. Defaults to True.
        include_area_match_level (bool, optional): If True, include an exact
            match on area level. Defaults to True.
        include_distance_in_km_level (bool, optional): If True, include a
            comparison of distance between postcodes as measured in kilometers.
            Defaults to False.
        lat_col (str): The name of a latitude column or the respective array
            or struct column column containing the information, plus an index.
            For example: lat, long_lat['lat'] or long_lat[0].
        long_col (str): The name of a longitudinal column or the respective array
            or struct column column containing the information, plus an index.
            For example: long, long_lat['long'] or long_lat[1].
        km_thresholds (int, float, list): The total distance in kilometers to
            evaluate the distance_in_km_level comparison against.
        m_probability_full_match (float, optional): Starting m
            probability for full match level. Defaults to None.
        m_probability_sector_match (float, optional): Starting m
            probability for sector match level. Defaults to None.
        m_probability_district_match (float, optional): Starting m
            probability for district match level. Defaults to None.
        m_probability_area_match (float, optional): Starting m
            probability for area match level. Defaults to None.
        m_probability_or_probabilities_km_distance (float, optional): Starting m
            probability for 'distance in km' level. Defaults to None.
        m_probability_else (float, optional): Starting m probability for
            the 'everything else' level. Defaults to None.

    Examples:
        === ":simple-duckdb: DuckDB"
            Basic Postcode Comparison
            ``` python
            import splink.duckdb.comparison_template_library as ctl
            ctl.postcode_comparison("postcode")
            ```
            Bespoke Postcode Comparison
            ``` python
            import splink.duckdb.comparison_template_library as ctl
            ctl.postcode_comparison("postcode",
                                invalid_postcodes_as_null=True,
                                include_distance_in_km_level=True,
                                lat_col="lat",
                                long_col="long",
                                km_thresholds=[10, 100]
                                )
            ```
        === ":simple-apachespark: Spark"
            Basic Postcode Comparison
            ``` python
            import splink.spark.comparison_template_library as ctl
            ctl.postcode_comparison("postcode")
            ```
            Bespoke Postcode Comparison
            ``` python
            import splink.spark.comparison_template_library as ctl
            ctl.postcode_comparison("postcode",
                                invalid_postcodes_as_null=True,
                                include_distance_in_km_level=True,
                                lat_col="lat",
                                long_col="long",
                                km_thresholds=[10, 100]
                                )
            ```
        === ":simple-amazonaws: Athena"
            Basic Postcode Comparison
            ``` python
            import splink.athena.comparison_template_library as ctl
            ctl.postcode_comparison("postcode")
            ```
            Bespoke Postcode Comparison
            ``` python
            import splink.athena.comparison_template_library as ctl
            ctl.postcode_comparison("postcode",
                                invalid_postcodes_as_null=True,
                                include_distance_in_km_level=True,
                                lat_col="lat",
                                long_col="long",
                                km_thresholds=[10, 100]
                                )
            ```

    Returns:
        Comparison: A comparison that can be inclued in the Splink settings
            dictionary.
    """

    comparison_levels = []

    if invalid_postcodes_as_null:
        comparison_levels.append(self._null_level(col_name, valid_postcode_regex))
    else:
        comparison_levels.append(self._null_level(col_name))

    if include_full_match_level:
        comparison_level = self._exact_match_level(
            col_name,
            regex_extract=None,
            term_frequency_adjustments=term_frequency_adjustments_full,
            set_to_lowercase=set_to_lowercase,
            m_probability=m_probability_full_match,
            include_colname_in_charts_label=True,
        )
        comparison_levels.append(comparison_level)

    if include_sector_match_level:
        comparison_level = self._exact_match_level(
            col_name,
            regex_extract="^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]",
            set_to_lowercase=set_to_lowercase,
            m_probability=m_probability_sector_match,
            manual_col_name_for_charts_label="Postcode Sector",
        )
        comparison_levels.append(comparison_level)

    if include_district_match_level:
        comparison_level = self._exact_match_level(
            col_name,
            regex_extract="^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?",
            set_to_lowercase=set_to_lowercase,
            m_probability=m_probability_district_match,
            manual_col_name_for_charts_label="Postcode District",
        )
        comparison_levels.append(comparison_level)

    if include_area_match_level:
        comparison_level = self._exact_match_level(
            col_name,
            regex_extract="^[A-Za-z]{1,2}",
            set_to_lowercase=set_to_lowercase,
            m_probability=m_probability_area_match,
            manual_col_name_for_charts_label="Postcode Area",
        )
        comparison_levels.append(comparison_level)

    km_thresholds = ensure_is_iterable(km_thresholds)
    if len(km_thresholds) > 0:
        if m_probability_or_probabilities_km_distance is None:
            m_probability_or_probabilities_km_distance = [None] * len(km_thresholds)
        m_probability_or_probabilities_km_distance = ensure_is_iterable(
            m_probability_or_probabilities_km_distance
        )

        for thres, m_prob in zip(
            km_thresholds,
            m_probability_or_probabilities_km_distance,
        ):
            comparison_level = self._distance_in_km_level(
                lat_col,
                long_col,
                km_threshold=thres,
                m_probability=m_prob,
            )
            comparison_levels.append(comparison_level)

    comparison_levels.append(
        self._else_level(m_probability=m_probability_else),
    )

    # Construct Description
    comparison_desc = ""
    if include_full_match_level:
        comparison_desc += "Exact match on full postcode vs. "

    if include_sector_match_level:
        comparison_desc += "exact match on sector vs. "

    if include_district_match_level:
        comparison_desc += "exact match on district vs. "

    if include_area_match_level:
        comparison_desc += "exact match on area vs. "

    if len(km_thresholds) > 0:
        desc = distance_threshold_description(
            col_name, "km_distance", km_thresholds
        )
        comparison_desc += desc

    comparison_desc += "all other comparisons"

    comparison_dict = {
        "output_column_name": col_name,
        "comparison_description": comparison_desc,
        "comparison_levels": comparison_levels,
    }
    super().__init__(comparison_dict)

Bases: Comparison

Source code in splink/comparison_template_library.py
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
class EmailComparisonBase(Comparison):
    def __init__(
        self,
        col_name: str,
        invalid_emails_as_null: bool = False,
        valid_email_regex: str = "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$",
        term_frequency_adjustments_full: bool = False,
        include_exact_match_level: bool = True,
        include_username_match_level: bool = True,
        include_username_fuzzy_level: bool = True,
        include_domain_match_level: bool = False,
        levenshtein_thresholds: int | list = [],
        damerau_levenshtein_thresholds: int | list = [],
        jaro_winkler_thresholds: float | list = [0.88],
        jaro_thresholds: float | list = [],
        m_probability_full_match: bool = None,
        m_probability_username_match: bool = None,
        m_probability_or_probabilities_username_lev: float | list = None,
        m_probability_or_probabilities_username_dl: float | list = None,
        m_probability_or_probabilities_username_jw: float | list = None,
        m_probability_or_probabilities_username_jar: float | list = None,
        m_probability_or_probabilities_email_lev: float | list = None,
        m_probability_or_probabilities_email_dl: float | list = None,
        m_probability_or_probabilities_email_jw: float | list = None,
        m_probability_or_probabilities_email_jar: float | list = None,
        m_probability_domain_match: float | list = None,
        m_probability_else: float | list = None,
    ) -> Comparison:
        """A wrapped to generate a comparison for an email colummn
        'col_name' with preselected defaults.

        The default arguments will give a comparison with levels:\n
        - Exact match on email\n
        - Exact match on username with different domain\n
        - Fuzzy match on email user Jaro-Winkler\n
        - Fuzzy match on username using Jaro-Winkler \n
        - All other comparisons

        Args:
            col_name (str): The name of the column to compare.
            invalid_email_as_null (bool): If True, emails that do not adhere
                to valid_email_regex will be included in the null level.
                Defaults to False
            valid_email_regex (str): regular expression pattern that is used
                to validate emails. If invalid_emails_as_null is True,
                emails that do not adhere to valid_email_regex will be included
                 in the null level.
                 Defaults to "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
            term_frequency_adjustments_full (bool, optional): If True, apply
                term frequency adjustments to the full email exact match level.
                Defaults to False.
            include_exact_match_level (bool, optional): If True, include an exact
                match on full email level. Defaults to True.
            include_username_match_level (bool, optional): If True, include an exact
                match on username only level. Defaults to True.
            include_username_fuzzy_level (bool, optional): If True, include a level
                for fuzzy match on username. Defaults to True.
            include_domain_match_level (bool, optional): If True, include an exact
                match on domain only level. Defaults to True.
            levenshtein_thresholds (Union[int, list], optional): The thresholds
                to use for levenshtein similarity level(s).
                Defaults to []
            damerau_levenshtein_thresholds (Union[int, list], optional): The thresholds
                to use for damerau-levenshtein similarity level(s).
                Defaults to []
            jaro_winkler_thresholds (Union[int, list], optional): The thresholds
                to use for jaro_winkler similarity level(s).
                Defaults to [0.88]
            jaro_thresholds (Union[int, list], optional): The thresholds
                to use for jaro similarity level(s).
                Defaults to []
            m_probability_full_match (float, optional): Starting m
                probability for full match level. Defaults to None.
            m_probability_username_match (float, optional): Starting m probability
                for username only match level. Defaults to None.
            m_probability_or_probabilities_username_lev (Union[float, list], optional):
                _description_. If provided, overrides the default m probabilities
                for the thresholds specified. Defaults to None.
            m_probability_or_probabilities_username_dl (Union[float, list], optional):
                _description_. If provided, overrides the default m probabilities
                for the thresholds specified. Defaults to None.
            m_probability_or_probabilities_username_jw (Union[float, list], optional):
                _description_. If provided, overrides the default m probabilities
                for the thresholds specified. Defaults to None.
            m_probability_or_probabilities_username_jar (Union[float, list], optional):
                _description_. If provided, overrides the default m probabilities
                for the thresholds specified. Defaults to None.
            m_probability_or_probabilities_email_lev (Union[float, list], optional):
                _description_. If provided, overrides the default m probabilities
                for the thresholds specified. Defaults to None.
            m_probability_or_probabilities_email_dl (Union[float, list], optional):
                _description_. If provided, overrides the default m probabilities
                for the thresholds specified. Defaults to None.
            m_probability_or_probabilities_email_jw (Union[float, list], optional):
                _description_. If provided, overrides the default m probabilities
                for the thresholds specified. Defaults to None.
            m_probability_or_probabilities_email_jar (Union[float, list], optional):
                _description_. If provided, overrides the default m probabilities
                for the thresholds specified. Defaults to None.
            m_probability_domain_match (float, optional): Starting m probability
                for domain only match level. Defaults to None.
            m_probability_else (float, optional): Starting m probability for
                the 'everything else' level. Defaults to None.

        Examples:
            === ":simple-duckdb: DuckDB"
                Basic email Comparison
                ``` python
                import splink.duckdb.duckdb_comparison_template_library as ctl
                ctl.email_comparison("email")
                ```
                Bespoke email Comparison
                ``` python
                import splink.duckdb.duckdb_comparison_template_library as ctl
                ctl.email_comparison("email",
                                    levenshtein_thresholds = [2],
                                    damerau_levenshtein_thresholds = [2],
                                    invalid_emails_as_null = True,
                                    include_username_match_level = True,
                                    include_domain_match_level = True,
                                    )
                ```
            === ":simple-apachespark: Spark"
                Basic email Comparison
                ``` python
                import splink.spark.spark_comparison_template_library as ctl
                ctl.email_comparison(col_name = "email")
                ```
                Bespoke email Comparison
                ``` python
                import splink.spark.spark_comparison_template_library as ctl
                ctl.email_comparison("email",
                                    levenshtein_thresholds = [2],
                                    damerau_levenshtein_thresholds = [2],
                                    invalid_emails_as_null = True,
                                    include_username_match_level = True,
                                    include_domain_match_level = True,
                                    )

                ```

        Returns:
            Comparison: A comparison that can be inclued in the Splink settings
                dictionary.
        """
        # Contstruct comparrison

        comparison_levels = []

        # Decide whether invalid emails should be treated as null
        if invalid_emails_as_null:
            comparison_levels.append(
                self._null_level(col_name, valid_string_pattern=valid_email_regex)
            )
        else:
            comparison_levels.append(self._null_level(col_name))

        # Exact match on full email

        if include_exact_match_level:
            comparison_level = self._exact_match_level(
                col_name,
                regex_extract=None,
                term_frequency_adjustments=term_frequency_adjustments_full,
                m_probability=m_probability_full_match,
                include_colname_in_charts_label=True,
            )
            comparison_levels.append(comparison_level)

        # Exact match on username with different domain

        if include_username_match_level:
            comparison_level = self._exact_match_level(
                col_name,
                regex_extract="^[^@]+",
                m_probability=m_probability_username_match,
                include_colname_in_charts_label=True,
                manual_col_name_for_charts_label="Username",
            )
            comparison_levels.append(comparison_level)

        # Ensure fuzzy match thresholds are iterable

        damerau_levenshtein_thresholds = ensure_is_iterable(
            damerau_levenshtein_thresholds
        )
        levenshtein_thresholds = ensure_is_iterable(levenshtein_thresholds)
        jaro_winkler_thresholds = ensure_is_iterable(jaro_winkler_thresholds)
        jaro_thresholds = ensure_is_iterable(jaro_thresholds)

        # Fuzzy match on full email

        if len(levenshtein_thresholds) > 0:
            threshold_levels = distance_threshold_comparison_levels(
                self,
                col_name,
                distance_function_name="levenshtein",
                distance_threshold_or_thresholds=levenshtein_thresholds,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_email_lev,
                include_colname_in_charts_label=True,
            )
            comparison_levels = comparison_levels + threshold_levels

        if len(damerau_levenshtein_thresholds) > 0:
            threshold_levels = distance_threshold_comparison_levels(
                self,
                col_name,
                distance_function_name="damerau-levenshtein",
                distance_threshold_or_thresholds=damerau_levenshtein_thresholds,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_email_dl,
                include_colname_in_charts_label=True,
            )
            comparison_levels = comparison_levels + threshold_levels

        if len(jaro_winkler_thresholds) > 0:
            threshold_levels = distance_threshold_comparison_levels(
                self,
                col_name,
                distance_function_name="jaro-winkler",
                distance_threshold_or_thresholds=jaro_winkler_thresholds,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_email_jw,
                include_colname_in_charts_label=True,
            )
            comparison_levels = comparison_levels + threshold_levels

        if len(jaro_thresholds) > 0:
            threshold_levels = distance_threshold_comparison_levels(
                self,
                col_name,
                distance_function_name="jaro",
                distance_threshold_or_thresholds=jaro_thresholds,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_email_jar,
                include_colname_in_charts_label=True,
            )
            comparison_levels = comparison_levels + threshold_levels

        # Fuzzy match on username only
        if include_username_fuzzy_level:
            if len(levenshtein_thresholds) > 0:
                threshold_levels = distance_threshold_comparison_levels(
                    self,
                    col_name,
                    regex_extract="^[^@]+",
                    distance_function_name="levenshtein",
                    distance_threshold_or_thresholds=levenshtein_thresholds,
                    m_probability_or_probabilities_thres=m_probability_or_probabilities_username_lev,
                    include_colname_in_charts_label=True,
                    manual_col_name_for_charts_label="Username",
                )
                comparison_levels = comparison_levels + threshold_levels

            if len(damerau_levenshtein_thresholds) > 0:
                threshold_levels = distance_threshold_comparison_levels(
                    self,
                    col_name,
                    regex_extract="^[^@]+",
                    distance_function_name="damerau-levenshtein",
                    distance_threshold_or_thresholds=damerau_levenshtein_thresholds,
                    m_probability_or_probabilities_thres=m_probability_or_probabilities_username_dl,
                    include_colname_in_charts_label=True,
                    manual_col_name_for_charts_label="Username",
                )
                comparison_levels = comparison_levels + threshold_levels

            if len(jaro_winkler_thresholds) > 0:
                threshold_levels = distance_threshold_comparison_levels(
                    self,
                    col_name,
                    regex_extract="^[^@]+",
                    distance_function_name="jaro-winkler",
                    distance_threshold_or_thresholds=jaro_winkler_thresholds,
                    m_probability_or_probabilities_thres=m_probability_or_probabilities_username_jw,
                    include_colname_in_charts_label=True,
                    manual_col_name_for_charts_label="Username",
                )
                comparison_levels = comparison_levels + threshold_levels

            if len(jaro_thresholds) > 0:
                threshold_levels = distance_threshold_comparison_levels(
                    self,
                    col_name,
                    distance_function_name="jaro",
                    distance_threshold_or_thresholds=jaro_thresholds,
                    m_probability_or_probabilities_thres=m_probability_or_probabilities_email_jar,
                    include_colname_in_charts_label=True,
                )
                comparison_levels = comparison_levels + threshold_levels

        # Domain-only match

        if include_domain_match_level:
            comparison_level = self._exact_match_level(
                col_name,
                regex_extract="@([^@]+)$",
                m_probability=m_probability_domain_match,
                manual_col_name_for_charts_label="Email Domain",
            )
            comparison_levels.append(comparison_level)

        comparison_levels.append(
            self._else_level(m_probability=m_probability_else),
        )

        # Construct Description

        comparison_desc = ""
        if include_exact_match_level:
            comparison_desc += "Exact match vs. "

        if include_username_match_level:
            comparison_desc += "Exact username match different domain vs. "

        if len(levenshtein_thresholds) > 0:
            comparison_desc += distance_threshold_description(
                "fuzzy email", "levenshtein", jaro_winkler_thresholds
            )
            comparison_desc += distance_threshold_description(
                "fuzzy username", "levenshtein", jaro_winkler_thresholds
            )

        if len(damerau_levenshtein_thresholds) > 0:
            comparison_desc += distance_threshold_description(
                "fuzzy email", "damerau_levenshtein", jaro_winkler_thresholds
            )
            comparison_desc += distance_threshold_description(
                "fuzzy username", "levenshtein", jaro_winkler_thresholds
            )

        if len(jaro_winkler_thresholds) > 0:
            comparison_desc += distance_threshold_description(
                "fuzzy email", "jaro_winkler", jaro_winkler_thresholds
            )
            comparison_desc += distance_threshold_description(
                "fuzzy username", "jaro_winkler", jaro_winkler_thresholds
            )

        if include_domain_match_level:
            comparison_desc += "Domain-only match vs."

        comparison_desc += "anything else"

        comparison_dict = {
            "comparison_description": comparison_desc,
            "comparison_levels": comparison_levels,
        }
        super().__init__(comparison_dict)

    @property
    def _is_distance_subclass(self):
        return False

__init__(col_name, invalid_emails_as_null=False, valid_email_regex='^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$', term_frequency_adjustments_full=False, include_exact_match_level=True, include_username_match_level=True, include_username_fuzzy_level=True, include_domain_match_level=False, levenshtein_thresholds=[], damerau_levenshtein_thresholds=[], jaro_winkler_thresholds=[0.88], jaro_thresholds=[], m_probability_full_match=None, m_probability_username_match=None, m_probability_or_probabilities_username_lev=None, m_probability_or_probabilities_username_dl=None, m_probability_or_probabilities_username_jw=None, m_probability_or_probabilities_username_jar=None, m_probability_or_probabilities_email_lev=None, m_probability_or_probabilities_email_dl=None, m_probability_or_probabilities_email_jw=None, m_probability_or_probabilities_email_jar=None, m_probability_domain_match=None, m_probability_else=None)

A wrapped to generate a comparison for an email colummn 'col_name' with preselected defaults.

The default arguments will give a comparison with levels:

  • Exact match on email

  • Exact match on username with different domain

  • Fuzzy match on email user Jaro-Winkler

  • Fuzzy match on username using Jaro-Winkler

  • All other comparisons

Parameters:

Name Type Description Default
col_name str

The name of the column to compare.

required
invalid_email_as_null bool

If True, emails that do not adhere to valid_email_regex will be included in the null level. Defaults to False

required
valid_email_regex str

regular expression pattern that is used to validate emails. If invalid_emails_as_null is True, emails that do not adhere to valid_email_regex will be included in the null level. Defaults to "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$"

'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$'
term_frequency_adjustments_full bool

If True, apply term frequency adjustments to the full email exact match level. Defaults to False.

False
include_exact_match_level bool

If True, include an exact match on full email level. Defaults to True.

True
include_username_match_level bool

If True, include an exact match on username only level. Defaults to True.

True
include_username_fuzzy_level bool

If True, include a level for fuzzy match on username. Defaults to True.

True
include_domain_match_level bool

If True, include an exact match on domain only level. Defaults to True.

False
levenshtein_thresholds Union[int, list]

The thresholds to use for levenshtein similarity level(s). Defaults to []

[]
damerau_levenshtein_thresholds Union[int, list]

The thresholds to use for damerau-levenshtein similarity level(s). Defaults to []

[]
jaro_winkler_thresholds Union[int, list]

The thresholds to use for jaro_winkler similarity level(s). Defaults to [0.88]

[0.88]
jaro_thresholds Union[int, list]

The thresholds to use for jaro similarity level(s). Defaults to []

[]
m_probability_full_match float

Starting m probability for full match level. Defaults to None.

None
m_probability_username_match float

Starting m probability for username only match level. Defaults to None.

None
m_probability_or_probabilities_username_lev Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_username_dl Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_username_jw Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_username_jar Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_email_lev Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_email_dl Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_email_jw Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_or_probabilities_email_jar Union[float, list]

description. If provided, overrides the default m probabilities for the thresholds specified. Defaults to None.

None
m_probability_domain_match float

Starting m probability for domain only match level. Defaults to None.

None
m_probability_else float

Starting m probability for the 'everything else' level. Defaults to None.

None

Examples:

Basic email Comparison

import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.email_comparison("email")
Bespoke email Comparison
import splink.duckdb.duckdb_comparison_template_library as ctl
ctl.email_comparison("email",
                    levenshtein_thresholds = [2],
                    damerau_levenshtein_thresholds = [2],
                    invalid_emails_as_null = True,
                    include_username_match_level = True,
                    include_domain_match_level = True,
                    )

Basic email Comparison

import splink.spark.spark_comparison_template_library as ctl
ctl.email_comparison(col_name = "email")
Bespoke email Comparison
import splink.spark.spark_comparison_template_library as ctl
ctl.email_comparison("email",
                    levenshtein_thresholds = [2],
                    damerau_levenshtein_thresholds = [2],
                    invalid_emails_as_null = True,
                    include_username_match_level = True,
                    include_domain_match_level = True,
                    )

Returns:

Name Type Description
Comparison Comparison

A comparison that can be inclued in the Splink settings dictionary.

Source code in splink/comparison_template_library.py
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
def __init__(
    self,
    col_name: str,
    invalid_emails_as_null: bool = False,
    valid_email_regex: str = "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$",
    term_frequency_adjustments_full: bool = False,
    include_exact_match_level: bool = True,
    include_username_match_level: bool = True,
    include_username_fuzzy_level: bool = True,
    include_domain_match_level: bool = False,
    levenshtein_thresholds: int | list = [],
    damerau_levenshtein_thresholds: int | list = [],
    jaro_winkler_thresholds: float | list = [0.88],
    jaro_thresholds: float | list = [],
    m_probability_full_match: bool = None,
    m_probability_username_match: bool = None,
    m_probability_or_probabilities_username_lev: float | list = None,
    m_probability_or_probabilities_username_dl: float | list = None,
    m_probability_or_probabilities_username_jw: float | list = None,
    m_probability_or_probabilities_username_jar: float | list = None,
    m_probability_or_probabilities_email_lev: float | list = None,
    m_probability_or_probabilities_email_dl: float | list = None,
    m_probability_or_probabilities_email_jw: float | list = None,
    m_probability_or_probabilities_email_jar: float | list = None,
    m_probability_domain_match: float | list = None,
    m_probability_else: float | list = None,
) -> Comparison:
    """A wrapped to generate a comparison for an email colummn
    'col_name' with preselected defaults.

    The default arguments will give a comparison with levels:\n
    - Exact match on email\n
    - Exact match on username with different domain\n
    - Fuzzy match on email user Jaro-Winkler\n
    - Fuzzy match on username using Jaro-Winkler \n
    - All other comparisons

    Args:
        col_name (str): The name of the column to compare.
        invalid_email_as_null (bool): If True, emails that do not adhere
            to valid_email_regex will be included in the null level.
            Defaults to False
        valid_email_regex (str): regular expression pattern that is used
            to validate emails. If invalid_emails_as_null is True,
            emails that do not adhere to valid_email_regex will be included
             in the null level.
             Defaults to "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
        term_frequency_adjustments_full (bool, optional): If True, apply
            term frequency adjustments to the full email exact match level.
            Defaults to False.
        include_exact_match_level (bool, optional): If True, include an exact
            match on full email level. Defaults to True.
        include_username_match_level (bool, optional): If True, include an exact
            match on username only level. Defaults to True.
        include_username_fuzzy_level (bool, optional): If True, include a level
            for fuzzy match on username. Defaults to True.
        include_domain_match_level (bool, optional): If True, include an exact
            match on domain only level. Defaults to True.
        levenshtein_thresholds (Union[int, list], optional): The thresholds
            to use for levenshtein similarity level(s).
            Defaults to []
        damerau_levenshtein_thresholds (Union[int, list], optional): The thresholds
            to use for damerau-levenshtein similarity level(s).
            Defaults to []
        jaro_winkler_thresholds (Union[int, list], optional): The thresholds
            to use for jaro_winkler similarity level(s).
            Defaults to [0.88]
        jaro_thresholds (Union[int, list], optional): The thresholds
            to use for jaro similarity level(s).
            Defaults to []
        m_probability_full_match (float, optional): Starting m
            probability for full match level. Defaults to None.
        m_probability_username_match (float, optional): Starting m probability
            for username only match level. Defaults to None.
        m_probability_or_probabilities_username_lev (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_or_probabilities_username_dl (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_or_probabilities_username_jw (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_or_probabilities_username_jar (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_or_probabilities_email_lev (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_or_probabilities_email_dl (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_or_probabilities_email_jw (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_or_probabilities_email_jar (Union[float, list], optional):
            _description_. If provided, overrides the default m probabilities
            for the thresholds specified. Defaults to None.
        m_probability_domain_match (float, optional): Starting m probability
            for domain only match level. Defaults to None.
        m_probability_else (float, optional): Starting m probability for
            the 'everything else' level. Defaults to None.

    Examples:
        === ":simple-duckdb: DuckDB"
            Basic email Comparison
            ``` python
            import splink.duckdb.duckdb_comparison_template_library as ctl
            ctl.email_comparison("email")
            ```
            Bespoke email Comparison
            ``` python
            import splink.duckdb.duckdb_comparison_template_library as ctl
            ctl.email_comparison("email",
                                levenshtein_thresholds = [2],
                                damerau_levenshtein_thresholds = [2],
                                invalid_emails_as_null = True,
                                include_username_match_level = True,
                                include_domain_match_level = True,
                                )
            ```
        === ":simple-apachespark: Spark"
            Basic email Comparison
            ``` python
            import splink.spark.spark_comparison_template_library as ctl
            ctl.email_comparison(col_name = "email")
            ```
            Bespoke email Comparison
            ``` python
            import splink.spark.spark_comparison_template_library as ctl
            ctl.email_comparison("email",
                                levenshtein_thresholds = [2],
                                damerau_levenshtein_thresholds = [2],
                                invalid_emails_as_null = True,
                                include_username_match_level = True,
                                include_domain_match_level = True,
                                )

            ```

    Returns:
        Comparison: A comparison that can be inclued in the Splink settings
            dictionary.
    """
    # Contstruct comparrison

    comparison_levels = []

    # Decide whether invalid emails should be treated as null
    if invalid_emails_as_null:
        comparison_levels.append(
            self._null_level(col_name, valid_string_pattern=valid_email_regex)
        )
    else:
        comparison_levels.append(self._null_level(col_name))

    # Exact match on full email

    if include_exact_match_level:
        comparison_level = self._exact_match_level(
            col_name,
            regex_extract=None,
            term_frequency_adjustments=term_frequency_adjustments_full,
            m_probability=m_probability_full_match,
            include_colname_in_charts_label=True,
        )
        comparison_levels.append(comparison_level)

    # Exact match on username with different domain

    if include_username_match_level:
        comparison_level = self._exact_match_level(
            col_name,
            regex_extract="^[^@]+",
            m_probability=m_probability_username_match,
            include_colname_in_charts_label=True,
            manual_col_name_for_charts_label="Username",
        )
        comparison_levels.append(comparison_level)

    # Ensure fuzzy match thresholds are iterable

    damerau_levenshtein_thresholds = ensure_is_iterable(
        damerau_levenshtein_thresholds
    )
    levenshtein_thresholds = ensure_is_iterable(levenshtein_thresholds)
    jaro_winkler_thresholds = ensure_is_iterable(jaro_winkler_thresholds)
    jaro_thresholds = ensure_is_iterable(jaro_thresholds)

    # Fuzzy match on full email

    if len(levenshtein_thresholds) > 0:
        threshold_levels = distance_threshold_comparison_levels(
            self,
            col_name,
            distance_function_name="levenshtein",
            distance_threshold_or_thresholds=levenshtein_thresholds,
            m_probability_or_probabilities_thres=m_probability_or_probabilities_email_lev,
            include_colname_in_charts_label=True,
        )
        comparison_levels = comparison_levels + threshold_levels

    if len(damerau_levenshtein_thresholds) > 0:
        threshold_levels = distance_threshold_comparison_levels(
            self,
            col_name,
            distance_function_name="damerau-levenshtein",
            distance_threshold_or_thresholds=damerau_levenshtein_thresholds,
            m_probability_or_probabilities_thres=m_probability_or_probabilities_email_dl,
            include_colname_in_charts_label=True,
        )
        comparison_levels = comparison_levels + threshold_levels

    if len(jaro_winkler_thresholds) > 0:
        threshold_levels = distance_threshold_comparison_levels(
            self,
            col_name,
            distance_function_name="jaro-winkler",
            distance_threshold_or_thresholds=jaro_winkler_thresholds,
            m_probability_or_probabilities_thres=m_probability_or_probabilities_email_jw,
            include_colname_in_charts_label=True,
        )
        comparison_levels = comparison_levels + threshold_levels

    if len(jaro_thresholds) > 0:
        threshold_levels = distance_threshold_comparison_levels(
            self,
            col_name,
            distance_function_name="jaro",
            distance_threshold_or_thresholds=jaro_thresholds,
            m_probability_or_probabilities_thres=m_probability_or_probabilities_email_jar,
            include_colname_in_charts_label=True,
        )
        comparison_levels = comparison_levels + threshold_levels

    # Fuzzy match on username only
    if include_username_fuzzy_level:
        if len(levenshtein_thresholds) > 0:
            threshold_levels = distance_threshold_comparison_levels(
                self,
                col_name,
                regex_extract="^[^@]+",
                distance_function_name="levenshtein",
                distance_threshold_or_thresholds=levenshtein_thresholds,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_username_lev,
                include_colname_in_charts_label=True,
                manual_col_name_for_charts_label="Username",
            )
            comparison_levels = comparison_levels + threshold_levels

        if len(damerau_levenshtein_thresholds) > 0:
            threshold_levels = distance_threshold_comparison_levels(
                self,
                col_name,
                regex_extract="^[^@]+",
                distance_function_name="damerau-levenshtein",
                distance_threshold_or_thresholds=damerau_levenshtein_thresholds,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_username_dl,
                include_colname_in_charts_label=True,
                manual_col_name_for_charts_label="Username",
            )
            comparison_levels = comparison_levels + threshold_levels

        if len(jaro_winkler_thresholds) > 0:
            threshold_levels = distance_threshold_comparison_levels(
                self,
                col_name,
                regex_extract="^[^@]+",
                distance_function_name="jaro-winkler",
                distance_threshold_or_thresholds=jaro_winkler_thresholds,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_username_jw,
                include_colname_in_charts_label=True,
                manual_col_name_for_charts_label="Username",
            )
            comparison_levels = comparison_levels + threshold_levels

        if len(jaro_thresholds) > 0:
            threshold_levels = distance_threshold_comparison_levels(
                self,
                col_name,
                distance_function_name="jaro",
                distance_threshold_or_thresholds=jaro_thresholds,
                m_probability_or_probabilities_thres=m_probability_or_probabilities_email_jar,
                include_colname_in_charts_label=True,
            )
            comparison_levels = comparison_levels + threshold_levels

    # Domain-only match

    if include_domain_match_level:
        comparison_level = self._exact_match_level(
            col_name,
            regex_extract="@([^@]+)$",
            m_probability=m_probability_domain_match,
            manual_col_name_for_charts_label="Email Domain",
        )
        comparison_levels.append(comparison_level)

    comparison_levels.append(
        self._else_level(m_probability=m_probability_else),
    )

    # Construct Description

    comparison_desc = ""
    if include_exact_match_level:
        comparison_desc += "Exact match vs. "

    if include_username_match_level:
        comparison_desc += "Exact username match different domain vs. "

    if len(levenshtein_thresholds) > 0:
        comparison_desc += distance_threshold_description(
            "fuzzy email", "levenshtein", jaro_winkler_thresholds
        )
        comparison_desc += distance_threshold_description(
            "fuzzy username", "levenshtein", jaro_winkler_thresholds
        )

    if len(damerau_levenshtein_thresholds) > 0:
        comparison_desc += distance_threshold_description(
            "fuzzy email", "damerau_levenshtein", jaro_winkler_thresholds
        )
        comparison_desc += distance_threshold_description(
            "fuzzy username", "levenshtein", jaro_winkler_thresholds
        )

    if len(jaro_winkler_thresholds) > 0:
        comparison_desc += distance_threshold_description(
            "fuzzy email", "jaro_winkler", jaro_winkler_thresholds
        )
        comparison_desc += distance_threshold_description(
            "fuzzy username", "jaro_winkler", jaro_winkler_thresholds
        )

    if include_domain_match_level:
        comparison_desc += "Domain-only match vs."

    comparison_desc += "anything else"

    comparison_dict = {
        "comparison_description": comparison_desc,
        "comparison_levels": comparison_levels,
    }
    super().__init__(comparison_dict)


Last update: 2023-12-11
Created: 2023-12-11