Specifying and estimating a linkage model¶
In the last tutorial we looked at how we can use blocking rules to generate pairwise record comparisons.
Now it's time to estimate a probabilistic linkage model to score each of these comparisons. The resultant match score is a prediction of whether the two records represent the same entity (e.g. are the same person).
The purpose of estimating the model is to learn the relative importance of different parts of your data for the purpose of data linking.
For example, a match on date of birth is a much stronger indicator that two records refer to the same entity than a match on gender. A mismatch on gender may be a stronger indicate against two records referring than a mismatch on name, since names are more likely to be entered differently.
The relative importance of different information is captured in the (partial) 'match weights', which can be learned from your data. These match weights are then added up to compute the overall match score.
The match weights are are derived from the m
and u
parameters of the underlying Fellegi Sunter model. Splink uses various statistical routines to estimate these parameters. Further details of the underlying theory can be found here, which will help you understand this part of the tutorial.
Specifying a linkage model¶
To build a linkage model, the user defines the partial match weights that splink
needs to estimate. This is done by defining how the information in the input records should be compared.
To be concrete, here is an example comparison:
first_name_l | first_name_r | surname_l | surname_r | dob_l | dob_r | city_l | city_r | email_l | email_r |
---|---|---|---|---|---|---|---|---|---|
Robert | Rob | Allen | Allen | 1971-05-24 | 1971-06-24 | nan | London | roberta25@smith.net | roberta25@smith.net |
What functions should we use to assess the similarity of Rob
vs. Robert
in the the first_name
field?
Should similarity in the dob
field be computed in the same way, or a different way?
Your job as the developer of a linkage model is to decide what comparisons are most appropriate for the types of data you have.
Splink can then estimate how much weight to place on a fuzzy match of Rob
vs. Robert
, relative to an exact match on Robert
, or a non-match.
Defining these scenarios is done using Comparison
s.
Comparisons¶
The concept of a Comparison
has a specific definition within Splink: it defines how data from one or more input columns is compared.
For example, one Comparison
may represent how similarity is assessed for a person's date of birth.
Another Comparison
may represent the comparison of a person's name or location.
A model is composed of many Comparison
s, which between them assess the similarity of all of the columns being used for data linking.
Each Comparison
contains two or more ComparisonLevels
which define n discrete gradations of similarity between the input columns within the Comparison.
As such ComparisonLevels
are nested within Comparisons
as follows:
Data Linking Model
├─-- Comparison: Date of birth
│ ├─-- ComparisonLevel: Exact match
│ ├─-- ComparisonLevel: One character difference
│ ├─-- ComparisonLevel: All other
├─-- Comparison: Surname
│ ├─-- ComparisonLevel: Exact match on surname
│ ├─-- ComparisonLevel: All other
│ etc.
Our example data would therefore result in the following comparisons, for dob
and surname
:
dob_l | dob_r | comparison_level | interpretation |
---|---|---|---|
1971-05-24 | 1971-05-24 | Exact match | great match |
1971-05-24 | 1971-06-24 | One character difference | fuzzy match |
1971-05-24 | 2000-01-02 | All other | bad match |
surname_l | surname_r | comparison_level | interpretation |
---|---|---|---|
Rob | Rob | Exact match | great match |
Rob | Jane | All other | bad match |
Rob | Robert | All other | bad match, this comparison has no notion of nicknames |
More information about specifying comparisons can be found here and here.
We will now use these concepts to build a data linking model.
# Begin by reading in the tutorial data again
from splink import splink_datasets
df = splink_datasets.fake_1000
Specifying the model using comparisons¶
Splink includes a library of comparison functions at splink.comparison_library
to make it simple to get started. These are split into two categories:
- Generic
Comparison
functions which apply a particular fuzzy matching function. For example, levenshtein distance.
import splink.comparison_library as cl
city_comparison = cl.LevenshteinAtThresholds("city", 2)
print(city_comparison.get_comparison("duckdb").human_readable_description)
Comparison 'LevenshteinAtThresholds' of "city".
Similarity is assessed using the following ComparisonLevels:
- 'city is NULL' with SQL rule: "city_l" IS NULL OR "city_r" IS NULL
- 'Exact match on city' with SQL rule: "city_l" = "city_r"
- 'Levenshtein distance of city <= 2' with SQL rule: levenshtein("city_l", "city_r") <= 2
- 'All other comparisons' with SQL rule: ELSE
Comparison
functions tailored for specific data types. For example, email.
email_comparison = cl.EmailComparison("email")
print(email_comparison.get_comparison("duckdb").human_readable_description)
Comparison 'EmailComparison' of "email".
Similarity is assessed using the following ComparisonLevels:
- 'email is NULL' with SQL rule: "email_l" IS NULL OR "email_r" IS NULL
- 'Exact match on email' with SQL rule: "email_l" = "email_r"
- 'Exact match on username' with SQL rule: NULLIF(regexp_extract("email_l", '^[^@]+', 0), '') = NULLIF(regexp_extract("email_r", '^[^@]+', 0), '')
- 'Jaro-Winkler distance of email >= 0.88' with SQL rule: jaro_winkler_similarity("email_l", "email_r") >= 0.88
- 'Jaro-Winkler >0.88 on username' with SQL rule: jaro_winkler_similarity(NULLIF(regexp_extract("email_l", '^[^@]+', 0), ''), NULLIF(regexp_extract("email_r", '^[^@]+', 0), '')) >= 0.88
- 'All other comparisons' with SQL rule: ELSE
Specifying the full settings dictionary¶
Comparisons
are specified as part of the Splink settings
, a Python dictionary which controls all of the configuration of a Splink model:
from splink import Linker, SettingsCreator, block_on, DuckDBAPI
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.NameComparison("first_name"),
cl.NameComparison("surname"),
cl.LevenshteinAtThresholds("dob", 1),
cl.ExactMatch("city").configure(term_frequency_adjustments=True),
cl.EmailComparison("email"),
],
blocking_rules_to_generate_predictions=[
block_on("first_name", "city"),
block_on("surname"),
],
retain_intermediate_calculation_columns=True,
)
linker = Linker(df, settings, db_api=DuckDBAPI())
In words, this setting dictionary says:
- We are performing a
dedupe_only
(the other options arelink_only
, orlink_and_dedupe
, which may be used if there are multiple input datasets). - When comparing records, we will use information from the
first_name
,surname
,dob
,city
andemail
columns to compute a match score. - The
blocking_rules_to_generate_predictions
states that we will only check for duplicates amongst records where either thefirst_name AND city
orsurname
is identical. - We have enabled term frequency adjustments for the 'city' column, because some values (e.g.
London
) appear much more frequently than others. - We have set
retain_intermediate_calculation_columns
andadditional_columns_to_retain
toTrue
so that Splink outputs additional information that helps the user understand the calculations. If they wereFalse
, the computations would run faster.
Estimate the parameters of the model¶
Now that we have specified our linkage model, we need to estimate the probability_two_random_records_match
, u
, and m
parameters.
-
The
probability_two_random_records_match
parameter is the probability that two records taken at random from your input data represent a match (typically a very small number). -
The
u
values are the proportion of records falling into eachComparisonLevel
amongst truly non-matching records. -
The
m
values are the proportion of records falling into eachComparisonLevel
amongst truly matching records
You can read more about the theory of what these mean.
We can estimate these parameters using unlabeled data. If we have labels, then we can estimate them even more accurately.
Estimation of probability_two_random_records_match
¶
In some cases, the probability_two_random_records_match
will be known. For example, if you are linking two tables of 10,000 records and expect a one-to-one match, then you should set this value to 1/10_000
in your settings instead of estimating it.
More generally, this parameter is unknown and needs to be estimated.
It can be estimated accurately enough for most purposes by combining a series of deterministic matching rules and a guess of the recall corresponding to those rules. For further details of the rationale behind this appraoch see here.
In this example, I guess that the following deterministic matching rules have a recall of about 70%. That means, between them, the rules recover 70% of all true matches.
deterministic_rules = [
block_on("first_name", "dob"),
"l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
block_on("email")
]
linker.training.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)
Probability two random records match is estimated to be 0.00298.
This means that amongst all possible pairwise record comparisons, one in 335.56 are expected to match. With 499,500 total possible comparisons, we expect a total of around 1,488.57 matching pairs
Estimation of u
probabilities¶
Once we have the probability_two_random_records_match
parameter, we can estimate the u
probabilities.
We estimate u
using the estimate_u_using_random_sampling
method, which doesn't require any labels.
It works by sampling random pairs of records, since most of these pairs are going to be non-matches. Over these non-matches we compute the distribution of ComparisonLevel
s for each Comparison
.
For instance, for gender
, we would find that the the gender matches 50% of the time, and mismatches 50% of the time.
For dob
on the other hand, we would find that the dob
matches 1% of the time, has a "one character difference" 3% of the time, and everything else happens 96% of the time.
The larger the random sample, the more accurate the predictions. You control this using the max_pairs
parameter. For large datasets, we recommend using at least 10 million - but the higher the better and 1 billion is often appropriate for larger datasets.
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.
----- Estimating u probabilities using random sampling -----
Estimated u probabilities using random sampling
Your model is not yet fully trained. Missing estimates for:
- first_name (no m values are trained).
- surname (no m values are trained).
- dob (no m values are trained).
- city (no m values are trained).
- email (no m values are trained).
Estimation of m
probabilities¶
m
is the trickiest of the parameters to estimate, because we have to have some idea of what the true matches are.
If we have labels, we can directly estimate it.
If we do not have labelled data, the m
parameters can be estimated using an iterative maximum likelihood approach called Expectation Maximisation.
Estimating directly¶
If we have labels, we can estimate m
directly using the estimate_m_from_label_column
method of the linker.
For example, if the entity being matched is persons, and your input dataset(s) contain social security number, this could be used to estimate the m values for the model.
Note that this column does not need to be fully populated. A common case is where a unique identifier such as social security number is only partially populated.
For example (in this tutorial we don't have labels, so we're not actually going to use this):
linker.estimate_m_from_label_column("social_security_number")
Estimating with Expectation Maximisation¶
This algorithm estimates the m
values by generating pairwise record comparisons, and using them to maximise a likelihood function.
Each estimation pass requires the user to configure an estimation blocking rule to reduce the number of record comparisons generated to a manageable level.
In our first estimation pass, we block on first_name
and surname
, meaning we will generate all record comparisons that have first_name
and surname
exactly equal.
Recall we are trying to estimate the m
values of the model, i.e. proportion of records falling into each ComparisonLevel
amongst truly matching records.
This means that, in this training session, we cannot estimate parameter estimates for the first_name
or surname
columns, since they will be equal for all the comparisons we do.
We can, however, estimate parameter estimates for all of the other columns. The output messages produced by Splink confirm this.
training_blocking_rule = block_on("first_name", "surname")
training_session_fname_sname = (
linker.training.estimate_parameters_using_expectation_maximisation(training_blocking_rule)
)
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
(l."first_name" = r."first_name") AND (l."surname" = r."surname")
Parameter estimates will be made for the following comparison(s):
- dob
- city
- email
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- first_name
- surname
WARNING:
Level Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value
Iteration 1: Largest change in params was -0.521 in the m_probability of dob, level `Exact match on dob`
Iteration 2: Largest change in params was 0.0516 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0183 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.00744 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.00349 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.00183 in probability_two_random_records_match
Iteration 7: Largest change in params was 0.00103 in probability_two_random_records_match
Iteration 8: Largest change in params was 0.000607 in probability_two_random_records_match
Iteration 9: Largest change in params was 0.000367 in probability_two_random_records_match
Iteration 10: Largest change in params was 0.000226 in probability_two_random_records_match
Iteration 11: Largest change in params was 0.00014 in probability_two_random_records_match
Iteration 12: Largest change in params was 8.73e-05 in probability_two_random_records_match
EM converged after 12 iterations
m probability not trained for email - Jaro-Winkler >0.88 on username (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
Your model is not yet fully trained. Missing estimates for:
- first_name (no m values are trained).
- surname (no m values are trained).
- email (some m values are not trained).
In a second estimation pass, we block on dob. This allows us to estimate parameters for the first_name
and surname
comparisons.
Between the two estimation passes, we now have parameter estimates for all comparisons.
training_blocking_rule = block_on("dob")
training_session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
training_blocking_rule
)
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
l."dob" = r."dob"
Parameter estimates will be made for the following comparison(s):
- first_name
- surname
- city
- email
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- dob
WARNING:
Level Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value
Iteration 1: Largest change in params was -0.407 in the m_probability of surname, level `Exact match on surname`
Iteration 2: Largest change in params was 0.0929 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0548 in the m_probability of first_name, level `All other comparisons`
Iteration 4: Largest change in params was 0.0186 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.00758 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.00339 in probability_two_random_records_match
Iteration 7: Largest change in params was 0.0016 in probability_two_random_records_match
Iteration 8: Largest change in params was 0.000773 in probability_two_random_records_match
Iteration 9: Largest change in params was 0.000379 in probability_two_random_records_match
Iteration 10: Largest change in params was 0.000189 in probability_two_random_records_match
Iteration 11: Largest change in params was 9.68e-05 in probability_two_random_records_match
EM converged after 11 iterations
m probability not trained for email - Jaro-Winkler >0.88 on username (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
Your model is not yet fully trained. Missing estimates for:
- email (some m values are not trained).
Note that Splink includes other algorithms for estimating m and u values, which are documented here.
Visualising model parameters¶
Splink can generate a number of charts to help you understand your model. For an introduction to these charts and how to interpret them, please see this video.
The final estimated match weights can be viewed in the match weights chart:
linker.visualisations.match_weights_chart()
linker.visualisations.m_u_parameters_chart()
We can also compare the estimates that were produced by the different EM training sessions
linker.visualisations.parameter_estimate_comparisons_chart()
Saving the model¶
We can save the model, including our estimated parameters, to a .json
file, so we can use it in the next tutorial.
settings = linker.misc.save_model_to_json(
"../demo_settings/saved_model_from_demo.json", overwrite=True
)
Detecting unlinkable records¶
An interesting application of our trained model that is useful to explore before making any predictions is to detect 'unlinkable' records.
Unlinkable records are those which do not contain enough information to be linked. A simple example would be a record containing only 'John Smith', and null in all other fields. This record may link to other records, but we'll never know because there's not enough information to disambiguate any potential links. Unlinkable records can be found by linking records to themselves - if, even when matched to themselves, they don't meet the match threshold score, we can be sure they will never link to anything.
linker.evaluation.unlinkables_chart()
In the above chart, we can see that about 1.3% of records in the input dataset are unlinkable at a threshold match weight of 6.11 (correponding to a match probability of around 98.6%)
Further Reading
For more on the model estimation tools in Splink, please refer to the Model Training API documentation.
For a deeper dive on:
- choosing comparisons, please refer to the Comparisons Topic Guides
- the underlying model theory, please refer to the Fellegi Sunter Topic Guide
- model training, please refer to the Model Training Topic Guides (Coming Soon).
For more on the charts used in this tutorial, please refer to the Charts Gallery.
Next steps¶
Now we have trained a model, we can move on to using it predict matching records.