Skip to content

Febrl4 link-only

Linking the febrl4 datasets¶

See A.2 here and here for the source of this data.

It consists of two datasets, A and B, of 5000 records each, with each record in dataset A having a corresponding record in dataset B. The aim will be to capture as many of those 5000 true links as possible, with minimal false linkages.

It is worth noting that we should not necessarily expect to capture all links. There are some links that although we know they do correspond to the same person, the data is so mismatched between them that we would not reasonably expect a model to link them, and indeed should a model do so may indicate that we have overengineered things using our knowledge of true links, which will not be a helpful reference in situations where we attempt to link unlabelled data, as will usually be the case.

Exploring data and defining model¶

Firstly let's read in the data and have a little look at it

import pandas as pd
import altair as alt
from splink.datasets import splink_datasets
from IPython.display import IFrame
alt.renderers.enable('html')

df_a = splink_datasets.febrl4a
df_b = splink_datasets.febrl4b

def prepare_data(data):
    data = data.rename(columns=lambda x: x.strip())
    data["cluster"] = data["rec_id"].apply(lambda x: "-".join(x.split('-')[:2]))
    data["date_of_birth"] = data["date_of_birth"].astype(str).str.strip()
    data["date_of_birth"] = data["date_of_birth"].replace("", None)

    data["soc_sec_id"] = data["soc_sec_id"].astype(str).str.strip()
    data["soc_sec_id"] = data["soc_sec_id"].replace("", None)

    data["postcode"] = data["postcode"].astype(str).str.strip()
    data["postcode"] = data["postcode"].replace("", None)
    return data
dfs = [
    prepare_data(dataset)
    for dataset in [df_a, df_b]
]

display(dfs[0].head(2))
display(dfs[1].head(2))
rec_id given_name surname street_number address_1 address_2 suburb postcode state date_of_birth soc_sec_id cluster
0 rec-1070-org michaela neumann 8 stanley street miami winston hills 4223 nsw 19151111 5304218 rec-1070
1 rec-1016-org courtney painter 12 pinkerton circuit bega flats richlands 4560 vic 19161214 4066625 rec-1016
rec_id given_name surname street_number address_1 address_2 suburb postcode state date_of_birth soc_sec_id cluster
0 rec-561-dup-0 elton 3 light setreet pinehill windermere 3212 vic 19651013 1551941 rec-561
1 rec-2642-dup-0 mitchell maxon 47 edkins street lochaoair north ryde 3355 nsw 19390212 8859999 rec-2642

Next, to better understand which variables will prove useful in linking, we have a look at how populated each column is, as well as the distribution of unique values within each

from splink.duckdb.linker import DuckDBLinker

basic_settings = {
    "unique_id_column_name": "rec_id",
    "link_type": "link_only",
    # NB as we are linking one-one, we know the probability that a random pair will be a match
    # hence we could set:
    # "probability_two_random_records_match": 1/5000,
    # however we will not specify this here, as we will use this as a check that
    # our estimation procedure returns something sensible
}

linker = DuckDBLinker(dfs, basic_settings)
linker.missingness_chart()
cols_to_profile = list(dfs[0].columns)
cols_to_profile = [col for col in cols_to_profile if col not in ("rec_id", "cluster")]
linker.profile_columns(cols_to_profile)

Next let's come up with some candidate blocking rules, which define which record comparisons are generated, and have a look at how many comparisons each will generate.

For blocking rules that we use in prediction, our aim is to have the union of all rules cover all true matches, whilst avoiding generating so many comparisons that it becomes computationally intractable - i.e. each true match should have at least one of the following conditions holding.

blocking_rules = [
    "l.given_name = r.given_name AND l.surname = r.surname",
    "l.date_of_birth = r.date_of_birth",
    "l.soc_sec_id = r.soc_sec_id",
    "l.state = r.state AND l.address_1 = r.address_1",
    "l.street_number = r.street_number AND l.address_1 = r.address_1",
    "l.postcode = r.postcode",
]
linker.cumulative_num_comparisons_from_blocking_rules_chart(blocking_rules)

The broadest rule, having a matching postcode, unsurpisingly gives the largest number of comparisons. For this small dataset we still have a very manageable number, but if it was larger we might have needed to include a further AND condition with it to break the number of comparisons further.

Now we get the full settings by including the blocking rules, as well as deciding the actual comparisons we will be including in our model.

We will define two models, each with a separate linker with different settings, so that we can compare performance. One will be a very basic model, whilst the other will include a lot more detail.

import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl
import splink.duckdb.comparison_level_library as cll

# the simple model only considers a few columns, and only two comparison levels for each
simple_model_settings = {
    **basic_settings,
    "blocking_rules_to_generate_predictions": blocking_rules,
    "comparisons": [
        cl.exact_match("given_name", term_frequency_adjustments=True),
        cl.exact_match("surname", term_frequency_adjustments=True),
        cl.exact_match("street_number", term_frequency_adjustments=True),
    ],
    "retain_intermediate_calculation_columns": True,
}
# the detailed model considers more columns, using the information we saw in the exploratory phase
# we also include further comparison levels to account for typos and other differences
detailed_model_settings = {
    **basic_settings,
    "blocking_rules_to_generate_predictions": blocking_rules,
    "comparisons": [
        ctl.name_comparison("given_name", term_frequency_adjustments=True),
        ctl.name_comparison("surname", term_frequency_adjustments=True),
        ctl.date_comparison("date_of_birth", 
                            damerau_levenshtein_thresholds=[],
                            cast_strings_to_date=True,
                            invalid_dates_as_null=True,
                            date_format="%Y%m%d"),
        cl.damerau_levenshtein_at_thresholds("soc_sec_id", [1, 2]),
        cl.exact_match("street_number", term_frequency_adjustments=True),
        cl.damerau_levenshtein_at_thresholds("postcode", [1, 2], term_frequency_adjustments=True),
        # we don't consider further location columns as they will be strongly correlated with postcode
    ],
    "retain_intermediate_calculation_columns": True,
}


linker_simple = DuckDBLinker(dfs, simple_model_settings)
linker_detailed = DuckDBLinker(dfs, detailed_model_settings)

Estimating model parameters¶

We need to furnish our models with parameter estimates so that we can generate results. We will focus on the detailed model, generating the values for the simple model at the end

We can instead estimate the probability two random records match, and compare with the known value of 1/5000 = 0.0002, to see how well our estimation procedure works.

To do this we come up with some deterministic rules - the aim here is that we generate very few false positives (i.e. we expect that the majority of records with at least one of these conditions holding are true matches), whilst also capturing the majority of matches - our guess here is that these two rules should capture 80% of all matches.

deterministic_rules = [
    "l.soc_sec_id = r.soc_sec_id",
    "l.given_name = r.given_name and l.surname = r.surname and l.date_of_birth = r.date_of_birth",
]

linker_detailed.estimate_probability_two_random_records_match(deterministic_rules, recall=0.8)
Probability two random records match is estimated to be  0.000239.
This means that amongst all possible pairwise record comparisons, one in 4,185.85 are expected to match.  With 25,000,000 total possible comparisons, we expect a total of around 5,972.50 matching pairs

Even playing around with changing these deterministic rules, or the nominal recall leaves us with an answer which is pretty close to our known value

Next we estimate u and m values for each comparison, so that we can move to generating predictions

linker_detailed.estimate_u_using_random_sampling(max_pairs=1e7)
----- Estimating u probabilities using random sampling -----



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))



Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - given_name (no m values are trained).
    - surname (no m values are trained).
    - date_of_birth (no m values are trained).
    - soc_sec_id (no m values are trained).
    - street_number (no m values are trained).
    - postcode (no m values are trained).

When training the m values using expectation maximisation, we need somre more blocking rules to reduce the total number of comparisons. For each rule, we want to ensure that we have neither proportionally too many matches, or too few.

We must run this multiple times using different rules so that we can obtain estimates for all comparisons - if we block on e.g. date_of_birth, then we cannot compute the m values for the date_of_birth comparison, as we have only looked at records where these match.

session_dob = linker_detailed.estimate_parameters_using_expectation_maximisation(
    "l.date_of_birth = r.date_of_birth"
)
session_pc = linker_detailed.estimate_parameters_using_expectation_maximisation(
    "l.postcode = r.postcode"
)
----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l.date_of_birth = r.date_of_birth

Parameter estimates will be made for the following comparison(s):
    - given_name
    - surname
    - soc_sec_id
    - street_number
    - postcode

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - date_of_birth

Iteration 1: Largest change in params was -0.319 in probability_two_random_records_match
Iteration 2: Largest change in params was 0.0034 in the m_probability of given_name, level `All other comparisons`
Iteration 3: Largest change in params was 7.12e-05 in the m_probability of soc_sec_id, level `All other comparisons`

EM converged after 3 iterations

Your model is not yet fully trained. Missing estimates for:
    - date_of_birth (no m values are trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l.postcode = r.postcode

Parameter estimates will be made for the following comparison(s):
    - given_name
    - surname
    - date_of_birth
    - soc_sec_id
    - street_number

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - postcode

Iteration 1: Largest change in params was 0.0258 in the m_probability of date_of_birth, level `All other comparisons`
Iteration 2: Largest change in params was 0.000194 in the m_probability of date_of_birth, level `All other comparisons`
Iteration 3: Largest change in params was 1.23e-06 in the m_probability of date_of_birth, level `All other comparisons`

EM converged after 3 iterations

Your model is fully trained. All comparisons have at least one estimate for their m and u values

If we wish we can have a look at how our parameter estimates changes over these training sessions

session_dob.m_u_values_interactive_history_chart()

For variables that aren't used in the m-training blocking rules, we have two estimates --- one from each of the training sessions (see for example street_number). We can have a look at how the values compare between them, to ensure that we don't have drastically different values, which may be indicative of an issue.

linker_detailed.parameter_estimate_comparisons_chart()

We repeat our parameter estimations for the simple model in much the same fashion

linker_simple.estimate_probability_two_random_records_match(deterministic_rules, recall=0.8)
linker_simple.estimate_u_using_random_sampling(max_pairs=1e7)
session_ssid = linker_simple.estimate_parameters_using_expectation_maximisation(
    "l.given_name = r.given_name"
)
session_pc = linker_simple.estimate_parameters_using_expectation_maximisation(
    "l.street_number = r.street_number"
)
linker_simple.parameter_estimate_comparisons_chart()
Probability two random records match is estimated to be  0.000239.
This means that amongst all possible pairwise record comparisons, one in 4,185.85 are expected to match.  With 25,000,000 total possible comparisons, we expect a total of around 5,972.50 matching pairs
----- Estimating u probabilities using random sampling -----

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - given_name (no m values are trained).
    - surname (no m values are trained).
    - street_number (no m values are trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l.given_name = r.given_name

Parameter estimates will be made for the following comparison(s):
    - surname
    - street_number

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - given_name

Iteration 1: Largest change in params was -0.105 in the m_probability of surname, level `Exact match`
Iteration 2: Largest change in params was -0.0403 in the m_probability of surname, level `Exact match`
Iteration 3: Largest change in params was 0.0295 in the m_probability of surname, level `All other comparisons`
Iteration 4: Largest change in params was 0.0205 in the m_probability of surname, level `All other comparisons`
Iteration 5: Largest change in params was -0.0136 in the m_probability of surname, level `Exact match`
Iteration 6: Largest change in params was -0.00885 in the m_probability of surname, level `Exact match`
Iteration 7: Largest change in params was -0.00571 in the m_probability of surname, level `Exact match`
Iteration 8: Largest change in params was 0.00368 in the m_probability of surname, level `All other comparisons`
Iteration 9: Largest change in params was 0.00237 in the m_probability of surname, level `All other comparisons`
Iteration 10: Largest change in params was -0.00154 in the m_probability of surname, level `Exact match`
Iteration 11: Largest change in params was -0.000999 in the m_probability of surname, level `Exact match`
Iteration 12: Largest change in params was -0.000651 in the m_probability of surname, level `Exact match`
Iteration 13: Largest change in params was -0.000425 in the m_probability of surname, level `Exact match`
Iteration 14: Largest change in params was 0.000278 in the m_probability of surname, level `All other comparisons`
Iteration 15: Largest change in params was 0.000182 in the m_probability of surname, level `All other comparisons`
Iteration 16: Largest change in params was -0.00012 in the m_probability of surname, level `Exact match`
Iteration 17: Largest change in params was -7.86e-05 in the m_probability of surname, level `Exact match`

EM converged after 17 iterations

Your model is not yet fully trained. Missing estimates for:
    - given_name (no m values are trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l.street_number = r.street_number

Parameter estimates will be made for the following comparison(s):
    - given_name
    - surname

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - street_number

Iteration 1: Largest change in params was 0.0728 in the m_probability of given_name, level `All other comparisons`
Iteration 2: Largest change in params was -0.0472 in the m_probability of given_name, level `Exact match`
Iteration 3: Largest change in params was 0.031 in the m_probability of given_name, level `All other comparisons`
Iteration 4: Largest change in params was -0.0181 in the m_probability of given_name, level `Exact match`
Iteration 5: Largest change in params was -0.01 in the m_probability of given_name, level `Exact match`
Iteration 6: Largest change in params was -0.00546 in the m_probability of given_name, level `Exact match`
Iteration 7: Largest change in params was 0.00301 in the m_probability of given_name, level `All other comparisons`
Iteration 8: Largest change in params was 0.00171 in the m_probability of given_name, level `All other comparisons`
Iteration 9: Largest change in params was 0.000999 in the m_probability of given_name, level `All other comparisons`
Iteration 10: Largest change in params was -0.000606 in the m_probability of given_name, level `Exact match`
Iteration 11: Largest change in params was -0.000394 in the m_probability of surname, level `Exact match`
Iteration 12: Largest change in params was 0.00029 in the m_probability of surname, level `All other comparisons`
Iteration 13: Largest change in params was 0.00021 in the m_probability of surname, level `All other comparisons`
Iteration 14: Largest change in params was -0.00015 in the m_probability of surname, level `Exact match`
Iteration 15: Largest change in params was -0.000107 in the m_probability of surname, level `Exact match`
Iteration 16: Largest change in params was 7.53e-05 in the m_probability of surname, level `All other comparisons`

EM converged after 16 iterations

Your model is fully trained. All comparisons have at least one estimate for their m and u values
# import json
# we can have a look at the full settings if we wish, including the values of our estimated parameters:
# print(json.dumps(linker_detailed._settings_obj.as_dict(), indent=2))
# we can also get a handy summary of of the model in an easily readable format if we wish:
# print(linker_detailed._settings_obj.human_readable_description)
# (we suppress output here for brevity)

We can now visualise some of the details of our models. We can look at the match weights, which tell us the relative importance for/against a match for each of our comparsion levels.

Comparing the two models will show the added benefit we get in the more detailed model --- what in the simple model is classed as 'all other comparisons' is instead broken down further, and we can see that the detail of how this is broken down in fact gives us quite a bit of useful information about the likelihood of a match.

linker_simple.match_weights_chart()
linker_detailed.match_weights_chart()

As well as the match weights, which give us an idea of the overall effect of each comparison level, we can also look at the individual u and m parameter estimates, which tells us about the prevalence of coincidences and mistakes (for further details/explanation about this see this article). We might want to revise aspects of our model based on the information we ascertain here.

Note however that some of these values are very small, which is why the match weight chart is often more useful for getting a decent picture of things.

# linker_simple.m_u_parameters_chart()
linker_detailed.m_u_parameters_chart()

It is also useful to have a look at unlinkable records - these are records which do not contain enough information to be linked at some match probability threshold. We can figure this out be seeing whether records are able to be matched with themselves.

This is of course relative to the information we have put into the model - we see that in our simple model, at a 99% match threshold nearly 10% of records are unlinkable, as we have not included enough information in the model for distinct records to be adequately distinguished; this is not an issue in our more detailed model.

linker_simple.unlinkables_chart()
linker_detailed.unlinkables_chart()

Our simple model doesn't do terribly, but suffers if we want to have a high match probability --- to be 99% (match weight ~7) certain of matches we have ~10% of records that we will be unable to link.

Our detailed model, however, has enough nuance that we can at least self-link records.

Predictions¶

Now that we have had a look into the details of the models, we will focus on only our more detailed model, which should be able to capture more of the genuine links in our data

predictions = linker_detailed.predict()
df_predictions = predictions.as_pandas_dataframe()
df_predictions.head(5)
match_weight match_probability source_dataset_l source_dataset_r rec_id_l rec_id_r given_name_l given_name_r gamma_given_name tf_given_name_l ... gamma_postcode tf_postcode_l tf_postcode_r bf_postcode bf_tf_adj_postcode address_1_l address_1_r state_l state_r match_key
0 45.265873 1.0 __splink__input_table_0 __splink__input_table_1 rec-4405-org rec-4405-dup-0 charles charles 4 0.0017 ... 3 0.0002 0.0002 739.211191 5.692728 salkauskas crescent salkauskas crescent nsw nsw 0
1 50.067581 1.0 __splink__input_table_0 __splink__input_table_1 rec-3585-org rec-3585-dup-0 mikayla mikayla 4 0.0011 ... 3 0.0008 0.0008 739.211191 1.423182 randwick road randwick road vic vic 0
2 37.545864 1.0 __splink__input_table_0 __splink__input_table_1 rec-298-org rec-298-dup-0 blake blake 4 0.0038 ... 2 0.0005 0.0005 11.541654 1.000000 cutlack street belmont park belted galloway stud vic vic 0
3 49.248563 1.0 __splink__input_table_0 __splink__input_table_1 rec-4866-org rec-4866-dup-0 charlie charlie 4 0.0023 ... 3 0.0012 0.0012 739.211191 0.948788 hawkesbury crescent vic vic 0
4 47.422741 1.0 __splink__input_table_0 __splink__input_table_1 rec-420-org rec-420-dup-0 kate kate 4 0.0010 ... 3 0.0002 0.0002 739.211191 5.692728 sinclair street sinclair street sa nsw 0

5 rows × 47 columns

We can see how our model performs at different probability thresholds, with a couple of options depending on the space we wish to view things

# linker_detailed.roc_chart_from_labels_column("cluster")
linker_detailed.precision_recall_chart_from_labels_column("cluster")