Skip to content

Defining a Splink Model¶

When building any linkage model in Splink, there are 3 key things which need to be defined:

  1. What type of linkage you want (defined by the link type)
  2. What pairs of records to consider (defined by blocking rules)
  3. What features to consider, and how they should be compared (defined by comparisons)

All aspects of a Splink model are defined via the settings dictionary. The settings object is a json-like object which underpins a model.

For example, consider a simple model (as defined in the README). The model is defined by the following settings dictionary

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
    ],
    "comparisons": [
        ctl.name_comparison("first_name"),
        ctl.name_comparison("surname"),
        ctl.date_comparison("dob", cast_strings_to_date=True),
        cl.exact_match("city", term_frequency_adjustments=True),
        ctl.email_comparison("email"),
    ],
}

Where:

1. Type of linkage

The "link_type" is defined as a deduplication for a single dataset.

5
    "link_type": "dedupe_only",

2. Pairs of records to consider

The "blocking_rules_to_generate_predictions" define a subset of pairs of records for the model to be considered when making predictions. In this case, where there is a match on "first_name" or "surname".

6
7
8
9
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
    ],

For more information on how blocking is used in Splink, see the dedicated topic guide.

3. Features to consider, and how they should be compared

The "comparisons" define the features to be compared between records: "first_name", "surname", "dob", "city" and "email".

10
11
12
13
14
15
16
    "comparisons": [
        ctl.name_comparison("first_name"),
        ctl.name_comparison("surname"),
        ctl.date_comparison("dob", cast_strings_to_date=True),
        cl.exact_match("city", term_frequency_adjustments=True),
        ctl.email_comparison("email"),
    ],

Using functions from the comparison template library and comparison library to define how these features should be compared.

1
2
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl

For more information on how comparisons are defined, see the dedicated topic guide.

These functions generate comparisons within the settings dictionary. See below for the full settings dictionary once the comparison_library and comparison_template_library functions have been evaluated and constructed:

Settings Dictionary in full
{
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname"
    ],
    "comparisons": [
        {
            "output_column_name": "first_name",
            "comparison_levels": [
                {
                    "sql_condition": "\"first_name_l\" IS NULL OR \"first_name_r\" IS NULL",
                    "label_for_charts": "Null",
                    "is_null_level": true
                },
                {
                    "sql_condition": "\"first_name_l\" = \"first_name_r\"",
                    "label_for_charts": "Exact match first_name"
                },
                {
                    "sql_condition": "damerau_levenshtein(\"first_name_l\", \"first_name_r\") <= 1",
                    "label_for_charts": "Damerau_levenshtein <= 1"
                },
                {
                    "sql_condition": "jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.9",
                    "label_for_charts": "Jaro_winkler_similarity >= 0.9"
                },
                {
                    "sql_condition": "jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.8",
                    "label_for_charts": "Jaro_winkler_similarity >= 0.8"
                },
                {
                    "sql_condition": "ELSE",
                    "label_for_charts": "All other comparisons"
                }
            ],
            "comparison_description": "Exact match vs. First_Name within levenshtein threshold 1 vs. First_Name within damerau-levenshtein threshold 1 vs. First_Name within jaro_winkler thresholds 0.9, 0.8 vs. anything else"
        },
        {
            "output_column_name": "surname",
            "comparison_levels": [
                {
                    "sql_condition": "\"surname_l\" IS NULL OR \"surname_r\" IS NULL",
                    "label_for_charts": "Null",
                    "is_null_level": true
                },
                {
                    "sql_condition": "\"surname_l\" = \"surname_r\"",
                    "label_for_charts": "Exact match surname"
                },
                {
                    "sql_condition": "damerau_levenshtein(\"surname_l\", \"surname_r\") <= 1",
                    "label_for_charts": "Damerau_levenshtein <= 1"
                },
                {
                    "sql_condition": "jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.9",
                    "label_for_charts": "Jaro_winkler_similarity >= 0.9"
                },
                {
                    "sql_condition": "jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.8",
                    "label_for_charts": "Jaro_winkler_similarity >= 0.8"
                },
                {
                    "sql_condition": "ELSE",
                    "label_for_charts": "All other comparisons"
                }
            ],
            "comparison_description": "Exact match vs. Surname within levenshtein threshold 1 vs. Surname within damerau-levenshtein threshold 1 vs. Surname within jaro_winkler thresholds 0.9, 0.8 vs. anything else"
        },
        {
            "output_column_name": "dob",
            "comparison_levels": [
                {
                    "sql_condition": "\"dob_l\" IS NULL OR \"dob_r\" IS NULL",
                    "label_for_charts": "Null",
                    "is_null_level": true
                },
                {
                    "sql_condition": "\"dob_l\" = \"dob_r\"",
                    "label_for_charts": "Exact match"
                },
                {
                    "sql_condition": "damerau_levenshtein(\"dob_l\", \"dob_r\") <= 1",
                    "label_for_charts": "Damerau_levenshtein <= 1"
                },
                {
                    "sql_condition": "\n            abs(date_diff('month',strptime(\"dob_l\",\n              '%Y-%m-%d'),strptime(\"dob_r\",\n              '%Y-%m-%d'))) <= 1\n        ",
                    "label_for_charts": "Within 1 month"
                },
                {
                    "sql_condition": "\n            abs(date_diff('year',strptime(\"dob_l\",\n              '%Y-%m-%d'),strptime(\"dob_r\",\n              '%Y-%m-%d'))) <= 1\n        ",
                    "label_for_charts": "Within 1 year"
                },
                {
                    "sql_condition": "\n            abs(date_diff('year',strptime(\"dob_l\",\n              '%Y-%m-%d'),strptime(\"dob_r\",\n              '%Y-%m-%d'))) <= 10\n        ",
                    "label_for_charts": "Within 10 years"
                },
                {
                    "sql_condition": "ELSE",
                    "label_for_charts": "All other comparisons"
                }
            ],
            "comparison_description": "Exact match vs. Dob within damerau-levenshtein threshold 1 vs. Dates within the following thresholds Month(s): 1, Year(s): 1, Year(s): 10 vs. anything else"
        },
        {
            "output_column_name": "city",
            "comparison_levels": [
                {
                    "sql_condition": "\"city_l\" IS NULL OR \"city_r\" IS NULL",
                    "label_for_charts": "Null",
                    "is_null_level": true
                },
                {
                    "sql_condition": "\"city_l\" = \"city_r\"",
                    "label_for_charts": "Exact match",
                    "tf_adjustment_column": "city",
                    "tf_adjustment_weight": 1.0
                },
                {
                    "sql_condition": "ELSE",
                    "label_for_charts": "All other comparisons"
                }
            ],
            "comparison_description": "Exact match vs. anything else"
        },
        {
            "output_column_name": "email",
            "comparison_levels": [
                {
                    "sql_condition": "\"email_l\" IS NULL OR \"email_r\" IS NULL",
                    "label_for_charts": "Null",
                    "is_null_level": true
                },
                {
                    "sql_condition": "\"email_l\" = \"email_r\"",
                    "label_for_charts": "Exact match"
                },
                {
                    "sql_condition": "levenshtein(\"email_l\", \"email_r\") <= 2",
                    "label_for_charts": "Levenshtein <= 2"
                },
                {
                    "sql_condition": "ELSE",
                    "label_for_charts": "All other comparisons"
                }
            ],
            "comparison_description": "Exact match vs. Email within levenshtein threshold 2 vs. anything else"
        }
    ],
    "sql_dialect": "duckdb",
    "linker_uid": "wpYkgjrm",
    "probability_two_random_records_match": 0.0001
}

With our finalised settings object, we can train a Splink model using the following code:

Example model using the settings dictionary
from splink.duckdb.linker import DuckDBLinker
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl
from splink.datasets import splink_datasets

df = splink_datasets.fake_1000

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
    ],
    "comparisons": [
        ctl.name_comparison("first_name"),
        ctl.name_comparison("surname"),
        ctl.date_comparison("dob", cast_strings_to_date=True),
        cl.exact_match("city", term_frequency_adjustments=True),
        cl.levenshtein_at_thresholds("email", 2),
    ],
}

linker = DuckDBLinker(df, settings)
linker.estimate_u_using_random_sampling(max_pairs=1e6)

blocking_rule_for_training = "l.first_name = r.first_name and l.surname = r.surname"
linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)

blocking_rule_for_training = "l.dob = r.dob"
linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)

pairwise_predictions = linker.predict()

clusters = linker.cluster_pairwise_predictions_at_threshold(pairwise_predictions, 0.95)
clusters.as_pandas_dataframe(limit=5)

Advanced usage of the settings dictionary¶

The section above refers to the three key aspects of the Splink settings dictionary. In reality, these are a small proportion of the possible parameters that can be defined within the settings. However, these additional parameters are used much less frequently, either because they are not required or they have a sensible default.

For a list of all possible parameters that can be used within the settings dictionary, see the Settings Dictionary Reference and the Interactive Settings Editor.

Saving a trained model¶

Once you have have a trained Splink model, it is often helpful to save out the model. The save_model_to_json function allows the user to save out the specifications of their trained model.

linker.save_model_to_json("model.json")

which, using the example settings and model training from above, gives the following output:

Model JSON
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
{
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname"
    ],
    "comparisons": [
        {
            "output_column_name": "first_name",
            "comparison_levels": [
                {
                    "sql_condition": "\"first_name_l\" IS NULL OR \"first_name_r\" IS NULL",
                    "label_for_charts": "Null",
                    "is_null_level": true
                },
                {
                    "sql_condition": "\"first_name_l\" = \"first_name_r\"",
                    "label_for_charts": "Exact match first_name",
                    "m_probability": 0.42539626585084356,
                    "u_probability": 0.0055691509084167595
                },
                {
                    "sql_condition": "damerau_levenshtein(\"first_name_l\", \"first_name_r\") <= 1",
                    "label_for_charts": "Damerau_levenshtein <= 1",
                    "m_probability": 0.07083503518755285,
                    "u_probability": 0.0015597577555308368
                },
                {
                    "sql_condition": "jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.9",
                    "label_for_charts": "Jaro_winkler_similarity >= 0.9",
                    "m_probability": 0.059652193684047713,
                    "u_probability": 0.0013298726980595723
                },
                {
                    "sql_condition": "jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.8",
                    "label_for_charts": "Jaro_winkler_similarity >= 0.8",
                    "m_probability": 0.07293119026085258,
                    "u_probability": 0.0060561117290816955
                },
                {
                    "sql_condition": "ELSE",
                    "label_for_charts": "All other comparisons",
                    "m_probability": 0.3711853150167032,
                    "u_probability": 0.9854851069089111
                }
            ],
            "comparison_description": "Exact match vs. First_Name within levenshtein threshold 1 vs. First_Name within damerau-levenshtein threshold 1 vs. First_Name within jaro_winkler thresholds 0.9, 0.8 vs. anything else"
        },
        {
            "output_column_name": "surname",
            "comparison_levels": [
                {
                    "sql_condition": "\"surname_l\" IS NULL OR \"surname_r\" IS NULL",
                    "label_for_charts": "Null",
                    "is_null_level": true
                },
                {
                    "sql_condition": "\"surname_l\" = \"surname_r\"",
                    "label_for_charts": "Exact match surname",
                    "m_probability": 0.4799599325681923,
                    "u_probability": 0.0074429557488431336
                },
                {
                    "sql_condition": "damerau_levenshtein(\"surname_l\", \"surname_r\") <= 1",
                    "label_for_charts": "Damerau_levenshtein <= 1",
                    "m_probability": 0.07008569271116605,
                    "u_probability": 0.0019083296710011445
                },
                {
                    "sql_condition": "jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.9",
                    "label_for_charts": "Jaro_winkler_similarity >= 0.9",
                    "m_probability": 0.020227438482587002,
                    "u_probability": 0.0006063412008846001
                },
                {
                    "sql_condition": "jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.8",
                    "label_for_charts": "Jaro_winkler_similarity >= 0.8",
                    "m_probability": 0.04851972996678833,
                    "u_probability": 0.003760255509361861
                },
                {
                    "sql_condition": "ELSE",
                    "label_for_charts": "All other comparisons",
                    "m_probability": 0.38120720627126636,
                    "u_probability": 0.9862821178699093
                }
            ],
            "comparison_description": "Exact match vs. Surname within levenshtein threshold 1 vs. Surname within damerau-levenshtein threshold 1 vs. Surname within jaro_winkler thresholds 0.9, 0.8 vs. anything else"
        },
        {
            "output_column_name": "dob",
            "comparison_levels": [
                {
                    "sql_condition": "\"dob_l\" IS NULL OR \"dob_r\" IS NULL",
                    "label_for_charts": "Null",
                    "is_null_level": true
                },
                {
                    "sql_condition": "\"dob_l\" = \"dob_r\"",
                    "label_for_charts": "Exact match",
                    "m_probability": 0.6110365914628999,
                    "u_probability": 0.003703703703703704
                },
                {
                    "sql_condition": "damerau_levenshtein(\"dob_l\", \"dob_r\") <= 1",
                    "label_for_charts": "Damerau_levenshtein <= 1",
                    "m_probability": 0.011899382842926385,
                    "u_probability": 0.0014194194194194194
                },
                {
                    "sql_condition": "\n            abs(date_diff('month',strptime(\"dob_l\",\n              '%Y-%m-%d'),strptime(\"dob_r\",\n              '%Y-%m-%d'))) <= 1\n        ",
                    "label_for_charts": "Within 1 month",
                    "m_probability": 0.14328591788922354,
                    "u_probability": 0.004904904904904905
                },
                {
                    "sql_condition": "\n            abs(date_diff('year',strptime(\"dob_l\",\n              '%Y-%m-%d'),strptime(\"dob_r\",\n              '%Y-%m-%d'))) <= 1\n        ",
                    "label_for_charts": "Within 1 year",
                    "m_probability": 0.23377741936500165,
                    "u_probability": 0.05535935935935936
                },
                {
                    "sql_condition": "\n            abs(date_diff('year',strptime(\"dob_l\",\n              '%Y-%m-%d'),strptime(\"dob_r\",\n              '%Y-%m-%d'))) <= 10\n        ",
                    "label_for_charts": "Within 10 years",
                    "u_probability": 0.3138258258258258
                },
                {
                    "sql_condition": "ELSE",
                    "label_for_charts": "All other comparisons",
                    "m_probability": 6.88439948607414e-07,
                    "u_probability": 0.6207867867867868
                }
            ],
            "comparison_description": "Exact match vs. Dob within damerau-levenshtein threshold 1 vs. Dates within the following thresholds Month(s): 1, Year(s): 1, Year(s): 10 vs. anything else"
        },
        {
            "output_column_name": "city",
            "comparison_levels": [
                {
                    "sql_condition": "\"city_l\" IS NULL OR \"city_r\" IS NULL",
                    "label_for_charts": "Null",
                    "is_null_level": true
                },
                {
                    "sql_condition": "\"city_l\" = \"city_r\"",
                    "label_for_charts": "Exact match",
                    "m_probability": 0.6594293995874598,
                    "u_probability": 0.09307809682341518,
                    "tf_adjustment_column": "city",
                    "tf_adjustment_weight": 1.0
                },
                {
                    "sql_condition": "ELSE",
                    "label_for_charts": "All other comparisons",
                    "m_probability": 0.34057060041254017,
                    "u_probability": 0.9069219031765848
                }
            ],
            "comparison_description": "Exact match vs. anything else"
        },
        {
            "output_column_name": "email",
            "comparison_levels": [
                {
                    "sql_condition": "\"email_l\" IS NULL OR \"email_r\" IS NULL",
                    "label_for_charts": "Null",
                    "is_null_level": true
                },
                {
                    "sql_condition": "\"email_l\" = \"email_r\"",
                    "label_for_charts": "Exact match",
                    "m_probability": 0.676510796673033,
                    "u_probability": 0.004070304803112018
                },
                {
                    "sql_condition": "levenshtein(\"email_l\", \"email_r\") <= 2",
                    "label_for_charts": "Levenshtein <= 2",
                    "m_probability": 0.07663549181271126,
                    "u_probability": 0.00038595529012665426
                },
                {
                    "sql_condition": "ELSE",
                    "label_for_charts": "All other comparisons",
                    "m_probability": 0.24685371151425578,
                    "u_probability": 0.9955437399067614
                }
            ],
            "comparison_description": "Exact match vs. Email within levenshtein threshold 2 vs. anything else"
        }
    ],
    "sql_dialect": "duckdb",
    "linker_uid": "mHRb2HT7",
    "probability_two_random_records_match": 0.0001
}

This is simply the settings dictionary with additional entries for "m_probability" and "u_probability" in each of the "comparison_levels", which have estimated during model training.

For example in the first name exact match level:

16
17
18
19
20
21
{
    "sql_condition": "\"first_name_l\" = \"first_name_r\"",
    "label_for_charts": "Exact match first_name",
    "m_probability": 0.42539626585084356,
    "u_probability": 0.0055691509084167595
},

where the m_probability and u_probability values here are then used to generate the match weight for an exact match on "first_name" between two records (i.e. the amount of evidence provided by records having the same first name) in model predictions.

Loading a pre-trained model¶

When using a pre-trained model, you can read in the model from a json and recreate the linker object to make new pairwise predictions. For example:

linker = DuckDBLinker(new_df)
linker.load_model("model.json")

Where the linker is initialised with a dataset, but no settings dictionary. Then the load_model function is used to add the settings dictionary (including the trained "m_probability" and "u_probability" values) to the linker.

Once you have loaded the model, you can generate predictions, clusters etc. as normal. For example:

pairwise_predictions = linker.predict()