Defining a Splink Model¶
What makes a Splink Model?¶
When building any linkage model in Splink, there are 3 key things which need to be defined:
- What type of linkage you want (defined by the link type)
- What pairs of records to consider (defined by blocking rules)
- What features to consider, and how they should be compared (defined by comparisons)
Defining a Splink model with a settings dictionary¶
All aspects of a Splink model are defined via the settings dictionary. The settings object is a json-like object which underpins a model.
For example, consider a simple model (as defined in the README). The model is defined by the following settings dictionary
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Where:
1. Type of linkage
The "link_type"
is defined as a deduplication for a single dataset.
5 |
|
2. Pairs of records to consider
The "blocking_rules_to_generate_predictions"
define a subset of pairs of records for the model to be considered when making predictions. In this case, where there is a match on "first_name"
or "surname"
.
6 7 8 9 |
|
For more information on how blocking is used in Splink, see the dedicated topic guide.
3. Features to consider, and how they should be compared
The "comparisons"
define the features to be compared between records: "first_name"
, "surname"
, "dob"
, "city"
and "email"
.
10 11 12 13 14 15 16 |
|
Using functions from the comparison template library and comparison library to define how these features should be compared.
1 2 |
|
For more information on how comparisons are defined, see the dedicated topic guide.
These functions generate comparisons within the settings dictionary. See below for the full settings dictionary once the comparison_library
and comparison_template_library
functions have been evaluated and constructed:
Settings Dictionary in full
{
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [
"l.first_name = r.first_name",
"l.surname = r.surname"
],
"comparisons": [
{
"output_column_name": "first_name",
"comparison_levels": [
{
"sql_condition": "\"first_name_l\" IS NULL OR \"first_name_r\" IS NULL",
"label_for_charts": "Null",
"is_null_level": true
},
{
"sql_condition": "\"first_name_l\" = \"first_name_r\"",
"label_for_charts": "Exact match first_name"
},
{
"sql_condition": "damerau_levenshtein(\"first_name_l\", \"first_name_r\") <= 1",
"label_for_charts": "Damerau_levenshtein <= 1"
},
{
"sql_condition": "jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.9",
"label_for_charts": "Jaro_winkler_similarity >= 0.9"
},
{
"sql_condition": "jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.8",
"label_for_charts": "Jaro_winkler_similarity >= 0.8"
},
{
"sql_condition": "ELSE",
"label_for_charts": "All other comparisons"
}
],
"comparison_description": "Exact match vs. First_Name within levenshtein threshold 1 vs. First_Name within damerau-levenshtein threshold 1 vs. First_Name within jaro_winkler thresholds 0.9, 0.8 vs. anything else"
},
{
"output_column_name": "surname",
"comparison_levels": [
{
"sql_condition": "\"surname_l\" IS NULL OR \"surname_r\" IS NULL",
"label_for_charts": "Null",
"is_null_level": true
},
{
"sql_condition": "\"surname_l\" = \"surname_r\"",
"label_for_charts": "Exact match surname"
},
{
"sql_condition": "damerau_levenshtein(\"surname_l\", \"surname_r\") <= 1",
"label_for_charts": "Damerau_levenshtein <= 1"
},
{
"sql_condition": "jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.9",
"label_for_charts": "Jaro_winkler_similarity >= 0.9"
},
{
"sql_condition": "jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.8",
"label_for_charts": "Jaro_winkler_similarity >= 0.8"
},
{
"sql_condition": "ELSE",
"label_for_charts": "All other comparisons"
}
],
"comparison_description": "Exact match vs. Surname within levenshtein threshold 1 vs. Surname within damerau-levenshtein threshold 1 vs. Surname within jaro_winkler thresholds 0.9, 0.8 vs. anything else"
},
{
"output_column_name": "dob",
"comparison_levels": [
{
"sql_condition": "\"dob_l\" IS NULL OR \"dob_r\" IS NULL",
"label_for_charts": "Null",
"is_null_level": true
},
{
"sql_condition": "\"dob_l\" = \"dob_r\"",
"label_for_charts": "Exact match"
},
{
"sql_condition": "damerau_levenshtein(\"dob_l\", \"dob_r\") <= 1",
"label_for_charts": "Damerau_levenshtein <= 1"
},
{
"sql_condition": "\n abs(date_diff('month',strptime(\"dob_l\",\n '%Y-%m-%d'),strptime(\"dob_r\",\n '%Y-%m-%d'))) <= 1\n ",
"label_for_charts": "Within 1 month"
},
{
"sql_condition": "\n abs(date_diff('year',strptime(\"dob_l\",\n '%Y-%m-%d'),strptime(\"dob_r\",\n '%Y-%m-%d'))) <= 1\n ",
"label_for_charts": "Within 1 year"
},
{
"sql_condition": "\n abs(date_diff('year',strptime(\"dob_l\",\n '%Y-%m-%d'),strptime(\"dob_r\",\n '%Y-%m-%d'))) <= 10\n ",
"label_for_charts": "Within 10 years"
},
{
"sql_condition": "ELSE",
"label_for_charts": "All other comparisons"
}
],
"comparison_description": "Exact match vs. Dob within damerau-levenshtein threshold 1 vs. Dates within the following thresholds Month(s): 1, Year(s): 1, Year(s): 10 vs. anything else"
},
{
"output_column_name": "city",
"comparison_levels": [
{
"sql_condition": "\"city_l\" IS NULL OR \"city_r\" IS NULL",
"label_for_charts": "Null",
"is_null_level": true
},
{
"sql_condition": "\"city_l\" = \"city_r\"",
"label_for_charts": "Exact match",
"tf_adjustment_column": "city",
"tf_adjustment_weight": 1.0
},
{
"sql_condition": "ELSE",
"label_for_charts": "All other comparisons"
}
],
"comparison_description": "Exact match vs. anything else"
},
{
"output_column_name": "email",
"comparison_levels": [
{
"sql_condition": "\"email_l\" IS NULL OR \"email_r\" IS NULL",
"label_for_charts": "Null",
"is_null_level": true
},
{
"sql_condition": "\"email_l\" = \"email_r\"",
"label_for_charts": "Exact match"
},
{
"sql_condition": "levenshtein(\"email_l\", \"email_r\") <= 2",
"label_for_charts": "Levenshtein <= 2"
},
{
"sql_condition": "ELSE",
"label_for_charts": "All other comparisons"
}
],
"comparison_description": "Exact match vs. Email within levenshtein threshold 2 vs. anything else"
}
],
"sql_dialect": "duckdb",
"linker_uid": "wpYkgjrm",
"probability_two_random_records_match": 0.0001
}
With our finalised settings object, we can train a Splink model using the following code:
Example model using the settings dictionary
from splink.duckdb.linker import DuckDBLinker
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl
from splink.datasets import splink_datasets
df = splink_datasets.fake_1000
settings = {
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [
"l.first_name = r.first_name",
"l.surname = r.surname",
],
"comparisons": [
ctl.name_comparison("first_name"),
ctl.name_comparison("surname"),
ctl.date_comparison("dob", cast_strings_to_date=True),
cl.exact_match("city", term_frequency_adjustments=True),
cl.levenshtein_at_thresholds("email", 2),
],
}
linker = DuckDBLinker(df, settings)
linker.estimate_u_using_random_sampling(max_pairs=1e6)
blocking_rule_for_training = "l.first_name = r.first_name and l.surname = r.surname"
linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)
blocking_rule_for_training = "l.dob = r.dob"
linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)
pairwise_predictions = linker.predict()
clusters = linker.cluster_pairwise_predictions_at_threshold(pairwise_predictions, 0.95)
clusters.as_pandas_dataframe(limit=5)
Advanced usage of the settings dictionary¶
The section above refers to the three key aspects of the Splink settings dictionary. In reality, these are a small proportion of the possible parameters that can be defined within the settings. However, these additional parameters are used much less frequently, either because they are not required or they have a sensible default.
For a list of all possible parameters that can be used within the settings dictionary, see the Settings Dictionary Reference and the Interactive Settings Editor.
Saving a trained model¶
Once you have have a trained Splink model, it is often helpful to save out the model. The save_model_to_json
function allows the user to save out the specifications of their trained model.
linker.save_model_to_json("model.json")
which, using the example settings and model training from above, gives the following output:
Model JSON
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 |
|
This is simply the settings dictionary with additional entries for "m_probability"
and "u_probability"
in each of the "comparison_levels"
, which have estimated during model training.
For example in the first name exact match level:
16 17 18 19 20 21 |
|
where the m_probability
and u_probability
values here are then used to generate the match weight for an exact match on "first_name"
between two records (i.e. the amount of evidence provided by records having the same first name) in model predictions.
Loading a pre-trained model¶
When using a pre-trained model, you can read in the model from a json and recreate the linker object to make new pairwise predictions. For example:
linker = DuckDBLinker(new_df)
linker.load_model("model.json")
Where the linker is initialised with a dataset, but no settings dictionary. Then the load_model
function is used to add the settings dictionary (including the trained "m_probability"
and "u_probability"
values) to the linker.
Once you have loaded the model, you can generate predictions, clusters etc. as normal. For example:
pairwise_predictions = linker.predict()