Febrl4 link-only

Linking the febrl4 datasets¶

See A.2 here and here for the source of this data.

It consists of two datasets, A and B, of 5000 records each, with each record in dataset A having a corresponding record in dataset B. The aim will be to capture as many of those 5000 true links as possible, with minimal false linkages.

It is worth noting that we should not necessarily expect to capture all links. There are some links that although we know they do correspond to the same person, the data is so mismatched between them that we would not reasonably expect a model to link them, and indeed should a model do so may indicate that we have overengineered things using our knowledge of true links, which will not be a helpful reference in situations where we attempt to link unlabelled data, as will usually be the case.

Exploring data and defining model¶

Firstly let's read in the data and have a little look at it

from splink import splink_datasets

df_a = splink_datasets.febrl4a
df_b = splink_datasets.febrl4b


def prepare_data(data):
    data = data.rename(columns=lambda x: x.strip())
    data["cluster"] = data["rec_id"].apply(lambda x: "-".join(x.split("-")[:2]))
    data["date_of_birth"] = data["date_of_birth"].astype(str).str.strip()
    data["soc_sec_id"] = data["soc_sec_id"].astype(str).str.strip()
    data["postcode"] = data["postcode"].astype(str).str.strip()
    return data


dfs = [prepare_data(dataset) for dataset in [df_a, df_b]]

display(dfs[0].head(2))
display(dfs[1].head(2))

	rec_id	given_name	surname	street_number	address_1	address_2	suburb	postcode	state	date_of_birth	soc_sec_id	cluster
0	rec-1070-org	michaela	neumann	8	stanley street	miami	winston hills	4223	nsw	19151111	5304218	rec-1070
1	rec-1016-org	courtney	painter	12	pinkerton circuit	bega flats	richlands	4560	vic	19161214	4066625	rec-1016

	rec_id	given_name	surname	street_number	address_1	address_2	suburb	postcode	state	date_of_birth	soc_sec_id	cluster
0	rec-561-dup-0	elton		3	light setreet	pinehill	windermere	3212	vic	19651013	1551941	rec-561
1	rec-2642-dup-0	mitchell	maxon	47	edkins street	lochaoair	north ryde	3355	nsw	19390212	8859999	rec-2642

Next, to better understand which variables will prove useful in linking, we have a look at how populated each column is, as well as the distribution of unique values within each

from splink import DuckDBAPI, Linker, SettingsCreator

basic_settings = SettingsCreator(
    unique_id_column_name="rec_id",
    link_type="link_only",
    # NB as we are linking one-one, we know the probability that a random pair will be a match
    # hence we could set:
    # "probability_two_random_records_match": 1/5000,
    # however we will not specify this here, as we will use this as a check that
    # our estimation procedure returns something sensible
)

linker = Linker(dfs, basic_settings, db_api=DuckDBAPI())

It's usually a good idea to perform exploratory analysis on your data so you understand what's in each column and how often it's missing

from splink.exploratory import completeness_chart

completeness_chart(dfs, db_api=DuckDBAPI())

from splink.exploratory import profile_columns

profile_columns(dfs, db_api=DuckDBAPI(), column_expressions=["given_name", "surname"])

Next let's come up with some candidate blocking rules, which define which record comparisons are generated, and have a look at how many comparisons each will generate.

For blocking rules that we use in prediction, our aim is to have the union of all rules cover all true matches, whilst avoiding generating so many comparisons that it becomes computationally intractable - i.e. each true match should have at least one of the following conditions holding.

from splink import DuckDBAPI, block_on
from splink.blocking_analysis import (
    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
)

blocking_rules = [
    block_on("given_name", "surname"),
    # A blocking rule can also be an aribtrary SQL expression
    "l.given_name = r.surname and l.surname = r.given_name",
    block_on("date_of_birth"),
    block_on("soc_sec_id"),
    block_on("state", "address_1"),
    block_on("street_number", "address_1"),
    block_on("postcode"),
]


db_api = DuckDBAPI()
cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
    table_or_tables=dfs,
    blocking_rules=blocking_rules,
    db_api=db_api,
    link_type="link_only",
    unique_id_column_name="rec_id",
    source_dataset_column_name="source_dataset",
)

The broadest rule, having a matching postcode, unsurpisingly gives the largest number of comparisons. For this small dataset we still have a very manageable number, but if it was larger we might have needed to include a further AND condition with it to break the number of comparisons further.

Now we get the full settings by including the blocking rules, as well as deciding the actual comparisons we will be including in our model.

We will define two models, each with a separate linker with different settings, so that we can compare performance. One will be a very basic model, whilst the other will include a lot more detail.

import splink.comparison_level_library as cll
import splink.comparison_library as cl


# the simple model only considers a few columns, and only two comparison levels for each
simple_model_settings = SettingsCreator(
    unique_id_column_name="rec_id",
    link_type="link_only",
    blocking_rules_to_generate_predictions=blocking_rules,
    comparisons=[
        cl.ExactMatch("given_name").configure(term_frequency_adjustments=True),
        cl.ExactMatch("surname").configure(term_frequency_adjustments=True),
        cl.ExactMatch("street_number").configure(term_frequency_adjustments=True),
    ],
    retain_intermediate_calculation_columns=True,
)

# the detailed model considers more columns, using the information we saw in the exploratory phase
# we also include further comparison levels to account for typos and other differences
detailed_model_settings = SettingsCreator(
    unique_id_column_name="rec_id",
    link_type="link_only",
    blocking_rules_to_generate_predictions=blocking_rules,
    comparisons=[
        cl.NameComparison("given_name").configure(term_frequency_adjustments=True),
        cl.NameComparison("surname").configure(term_frequency_adjustments=True),
        cl.DateOfBirthComparison(
            "date_of_birth",
            input_is_string=True,
            datetime_format="%Y%m%d",
            invalid_dates_as_null=True,
        ),
        cl.DamerauLevenshteinAtThresholds("soc_sec_id", [1, 2]),
        cl.ExactMatch("street_number").configure(term_frequency_adjustments=True),
        cl.DamerauLevenshteinAtThresholds("postcode", [1, 2]).configure(
            term_frequency_adjustments=True
        ),
        # we don't consider further location columns as they will be strongly correlated with postcode
    ],
    retain_intermediate_calculation_columns=True,
)


linker_simple = Linker(dfs, simple_model_settings, db_api=DuckDBAPI())
linker_detailed = Linker(dfs, detailed_model_settings, db_api=DuckDBAPI())

Estimating model parameters¶

We need to furnish our models with parameter estimates so that we can generate results. We will focus on the detailed model, generating the values for the simple model at the end

We can instead estimate the probability two random records match, and compare with the known value of 1/5000 = 0.0002, to see how well our estimation procedure works.

To do this we come up with some deterministic rules - the aim here is that we generate very few false positives (i.e. we expect that the majority of records with at least one of these conditions holding are true matches), whilst also capturing the majority of matches - our guess here is that these two rules should capture 80% of all matches.

deterministic_rules = [
    block_on("soc_sec_id"),
    block_on("given_name", "surname", "date_of_birth"),
]

linker_detailed.training.estimate_probability_two_random_records_match(
    deterministic_rules, recall=0.8
)

Probability two random records match is estimated to be  0.000239.
This means that amongst all possible pairwise record comparisons, one in 4,185.85 are expected to match.  With 25,000,000 total possible comparisons, we expect a total of around 5,972.50 matching pairs

Even playing around with changing these deterministic rules, or the nominal recall leaves us with an answer which is pretty close to our known value

Next we estimate u and m values for each comparison, so that we can move to generating predictions

# We generally recommend setting max pairs higher (e.g. 1e7 or more)
# But this will run faster for the purpose of this demo
linker_detailed.training.estimate_u_using_random_sampling(max_pairs=1e6)

You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.
----- Estimating u probabilities using random sampling -----



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - given_name (no m values are trained).
    - surname (no m values are trained).
    - date_of_birth (some u values are not trained, no m values are trained).
    - soc_sec_id (no m values are trained).
    - street_number (no m values are trained).
    - postcode (no m values are trained).

When training the m values using expectation maximisation, we need somre more blocking rules to reduce the total number of comparisons. For each rule, we want to ensure that we have neither proportionally too many matches, or too few.

We must run this multiple times using different rules so that we can obtain estimates for all comparisons - if we block on e.g. date_of_birth, then we cannot compute the m values for the date_of_birth comparison, as we have only looked at records where these match.

session_dob = (
    linker_detailed.training.estimate_parameters_using_expectation_maximisation(
        block_on("date_of_birth"), estimate_without_term_frequencies=True
    )
)
session_pc = (
    linker_detailed.training.estimate_parameters_using_expectation_maximisation(
        block_on("postcode"), estimate_without_term_frequencies=True
    )
)

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."date_of_birth" = r."date_of_birth"

Parameter estimates will be made for the following comparison(s):
    - given_name
    - surname
    - soc_sec_id
    - street_number
    - postcode

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - date_of_birth

Iteration 1: Largest change in params was -0.331 in probability_two_random_records_match
Iteration 2: Largest change in params was 0.00365 in the m_probability of given_name, level `All other comparisons`
Iteration 3: Largest change in params was 9.22e-05 in the m_probability of soc_sec_id, level `All other comparisons`

EM converged after 3 iterations

Your model is not yet fully trained. Missing estimates for:
    - date_of_birth (some u values are not trained, no m values are trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."postcode" = r."postcode"

Parameter estimates will be made for the following comparison(s):
    - given_name
    - surname
    - date_of_birth
    - soc_sec_id
    - street_number

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - postcode

WARNING:
Level Abs difference of 'transformed date_of_birth <= 1 month' on comparison date_of_birth not observed in dataset, unable to train m value

WARNING:
Level Abs difference of 'transformed date_of_birth <= 1 year' on comparison date_of_birth not observed in dataset, unable to train m value

WARNING:
Level Abs difference of 'transformed date_of_birth <= 10 year' on comparison date_of_birth not observed in dataset, unable to train m value

Iteration 1: Largest change in params was 0.0374 in the m_probability of date_of_birth, level `All other comparisons`
Iteration 2: Largest change in params was 0.000457 in the m_probability of date_of_birth, level `All other comparisons`
Iteration 3: Largest change in params was 7.66e-06 in the m_probability of soc_sec_id, level `All other comparisons`

EM converged after 3 iterations
m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.

Your model is not yet fully trained. Missing estimates for:
    - date_of_birth (some u values are not trained, some m values are not trained).

If we wish we can have a look at how our parameter estimates changes over these training sessions

session_dob.m_u_values_interactive_history_chart()

For variables that aren't used in the m-training blocking rules, we have two estimates --- one from each of the training sessions (see for example street_number). We can have a look at how the values compare between them, to ensure that we don't have drastically different values, which may be indicative of an issue.

linker_detailed.visualisations.parameter_estimate_comparisons_chart()

We repeat our parameter estimations for the simple model in much the same fashion

linker_simple.training.estimate_probability_two_random_records_match(
    deterministic_rules, recall=0.8
)
linker_simple.training.estimate_u_using_random_sampling(max_pairs=1e7)
session_ssid = (
    linker_simple.training.estimate_parameters_using_expectation_maximisation(
        block_on("given_name"), estimate_without_term_frequencies=True
    )
)
session_pc = linker_simple.training.estimate_parameters_using_expectation_maximisation(
    block_on("street_number"), estimate_without_term_frequencies=True
)
linker_simple.visualisations.parameter_estimate_comparisons_chart()

Probability two random records match is estimated to be  0.000239.
This means that amongst all possible pairwise record comparisons, one in 4,185.85 are expected to match.  With 25,000,000 total possible comparisons, we expect a total of around 5,972.50 matching pairs
----- Estimating u probabilities using random sampling -----

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - given_name (no m values are trained).
    - surname (no m values are trained).
    - street_number (no m values are trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."given_name" = r."given_name"

Parameter estimates will be made for the following comparison(s):
    - surname
    - street_number

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - given_name

Iteration 1: Largest change in params was 0.0812 in the m_probability of surname, level `All other comparisons`
Iteration 2: Largest change in params was -0.0261 in the m_probability of surname, level `Exact match on surname`
Iteration 3: Largest change in params was -0.0247 in the m_probability of surname, level `Exact match on surname`
Iteration 4: Largest change in params was 0.0227 in the m_probability of surname, level `All other comparisons`
Iteration 5: Largest change in params was -0.0198 in the m_probability of surname, level `Exact match on surname`
Iteration 6: Largest change in params was 0.0164 in the m_probability of surname, level `All other comparisons`
Iteration 7: Largest change in params was -0.0131 in the m_probability of surname, level `Exact match on surname`
Iteration 8: Largest change in params was 0.0101 in the m_probability of surname, level `All other comparisons`
Iteration 9: Largest change in params was -0.00769 in the m_probability of surname, level `Exact match on surname`
Iteration 10: Largest change in params was 0.00576 in the m_probability of surname, level `All other comparisons`
Iteration 11: Largest change in params was -0.00428 in the m_probability of surname, level `Exact match on surname`
Iteration 12: Largest change in params was 0.00316 in the m_probability of surname, level `All other comparisons`
Iteration 13: Largest change in params was -0.00234 in the m_probability of surname, level `Exact match on surname`
Iteration 14: Largest change in params was -0.00172 in the m_probability of surname, level `Exact match on surname`
Iteration 15: Largest change in params was 0.00127 in the m_probability of surname, level `All other comparisons`
Iteration 16: Largest change in params was -0.000939 in the m_probability of surname, level `Exact match on surname`
Iteration 17: Largest change in params was -0.000694 in the m_probability of surname, level `Exact match on surname`
Iteration 18: Largest change in params was -0.000514 in the m_probability of surname, level `Exact match on surname`
Iteration 19: Largest change in params was -0.000381 in the m_probability of surname, level `Exact match on surname`
Iteration 20: Largest change in params was -0.000282 in the m_probability of surname, level `Exact match on surname`
Iteration 21: Largest change in params was 0.00021 in the m_probability of surname, level `All other comparisons`
Iteration 22: Largest change in params was -0.000156 in the m_probability of surname, level `Exact match on surname`
Iteration 23: Largest change in params was 0.000116 in the m_probability of surname, level `All other comparisons`
Iteration 24: Largest change in params was 8.59e-05 in the m_probability of surname, level `All other comparisons`

EM converged after 24 iterations

Your model is not yet fully trained. Missing estimates for:
    - given_name (no m values are trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."street_number" = r."street_number"

Parameter estimates will be made for the following comparison(s):
    - given_name
    - surname

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - street_number

Iteration 1: Largest change in params was -0.0446 in the m_probability of surname, level `Exact match on surname`
Iteration 2: Largest change in params was -0.0285 in the m_probability of surname, level `All other comparisons`
Iteration 3: Largest change in params was -0.026 in the m_probability of given_name, level `Exact match on given_name`
Iteration 4: Largest change in params was 0.0252 in the m_probability of given_name, level `All other comparisons`
Iteration 5: Largest change in params was -0.0231 in the m_probability of given_name, level `Exact match on given_name`
Iteration 6: Largest change in params was -0.02 in the m_probability of given_name, level `Exact match on given_name`
Iteration 7: Largest change in params was -0.0164 in the m_probability of given_name, level `Exact match on given_name`
Iteration 8: Largest change in params was -0.013 in the m_probability of given_name, level `Exact match on given_name`
Iteration 9: Largest change in params was 0.01 in the m_probability of given_name, level `All other comparisons`
Iteration 10: Largest change in params was -0.00757 in the m_probability of given_name, level `Exact match on given_name`
Iteration 11: Largest change in params was 0.00564 in the m_probability of given_name, level `All other comparisons`
Iteration 12: Largest change in params was -0.00419 in the m_probability of given_name, level `Exact match on given_name`
Iteration 13: Largest change in params was 0.0031 in the m_probability of given_name, level `All other comparisons`
Iteration 14: Largest change in params was -0.00231 in the m_probability of given_name, level `Exact match on given_name`
Iteration 15: Largest change in params was -0.00173 in the m_probability of given_name, level `Exact match on given_name`
Iteration 16: Largest change in params was 0.0013 in the m_probability of given_name, level `All other comparisons`
Iteration 17: Largest change in params was 0.000988 in the m_probability of given_name, level `All other comparisons`
Iteration 18: Largest change in params was -0.000756 in the m_probability of given_name, level `Exact match on given_name`
Iteration 19: Largest change in params was -0.000584 in the m_probability of given_name, level `Exact match on given_name`
Iteration 20: Largest change in params was -0.000465 in the m_probability of surname, level `Exact match on surname`
Iteration 21: Largest change in params was -0.000388 in the m_probability of surname, level `Exact match on surname`
Iteration 22: Largest change in params was -0.000322 in the m_probability of surname, level `Exact match on surname`
Iteration 23: Largest change in params was 0.000266 in the m_probability of surname, level `All other comparisons`
Iteration 24: Largest change in params was -0.000219 in the m_probability of surname, level `Exact match on surname`
Iteration 25: Largest change in params was -0.00018 in the m_probability of surname, level `Exact match on surname`

EM converged after 25 iterations

Your model is fully trained. All comparisons have at least one estimate for their m and u values

# import json
# we can have a look at the full settings if we wish, including the values of our estimated parameters:
# print(json.dumps(linker_detailed._settings_obj.as_dict(), indent=2))
# we can also get a handy summary of of the model in an easily readable format if we wish:
# print(linker_detailed._settings_obj.human_readable_description)
# (we suppress output here for brevity)

We can now visualise some of the details of our models. We can look at the match weights, which tell us the relative importance for/against a match for each of our comparsion levels.

Comparing the two models will show the added benefit we get in the more detailed model --- what in the simple model is classed as 'all other comparisons' is instead broken down further, and we can see that the detail of how this is broken down in fact gives us quite a bit of useful information about the likelihood of a match.

linker_simple.visualisations.match_weights_chart()

linker_detailed.visualisations.match_weights_chart()

As well as the match weights, which give us an idea of the overall effect of each comparison level, we can also look at the individual u and m parameter estimates, which tells us about the prevalence of coincidences and mistakes (for further details/explanation about this see this article). We might want to revise aspects of our model based on the information we ascertain here.

Note however that some of these values are very small, which is why the match weight chart is often more useful for getting a decent picture of things.

# linker_simple.m_u_parameters_chart()
linker_detailed.visualisations.m_u_parameters_chart()

It is also useful to have a look at unlinkable records - these are records which do not contain enough information to be linked at some match probability threshold. We can figure this out be seeing whether records are able to be matched with themselves.

This is of course relative to the information we have put into the model - we see that in our simple model, at a 99% match threshold nearly 10% of records are unlinkable, as we have not included enough information in the model for distinct records to be adequately distinguished; this is not an issue in our more detailed model.

linker_simple.evaluation.unlinkables_chart()

linker_detailed.evaluation.unlinkables_chart()

Our simple model doesn't do terribly, but suffers if we want to have a high match probability --- to be 99% (match weight ~7) certain of matches we have ~10% of records that we will be unable to link.

Our detailed model, however, has enough nuance that we can at least self-link records.

Predictions¶

Now that we have had a look into the details of the models, we will focus on only our more detailed model, which should be able to capture more of the genuine links in our data

predictions = linker_detailed.inference.predict(threshold_match_probability=0.2)
df_predictions = predictions.as_pandas_dataframe()
df_predictions.head(5)

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))



 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'date_of_birth':
    m values not fully trained
Comparison: 'date_of_birth':
    u values not fully trained

	match_weight	match_probability	source_dataset_l	source_dataset_r	rec_id_l	rec_id_r	given_name_l	given_name_r	gamma_given_name	tf_given_name_l	...	gamma_postcode	tf_postcode_l	tf_postcode_r	bf_postcode	bf_tf_adj_postcode	address_1_l	address_1_r	state_l	state_r	match_key
0	-1.830001	0.219521	__splink__input_table_0	__splink__input_table_1	rec-760-org	rec-3951-dup-0	lachlan	lachlan	4	0.0113	...	3	0.0007	0.0007	759.407155	1.583362	bushby close	templestoew avenue	nsw	vic	0
1	-1.801736	0.222896	__splink__input_table_0	__splink__input_table_1	rec-4980-org	rec-4980-dup-0	isabella	ctercteko	0	0.0069	...	3	0.0004	0.0004	759.407155	2.770884	sturt avenue	sturta venue	vic	vic	2
2	-1.271794	0.292859	__splink__input_table_0	__splink__input_table_1	rec-585-org	rec-585-dup-0	danny	stephenson	0	0.0001	...	2	0.0016	0.0012	11.264825	1.000000	o'shanassy street	o'shanassy street	tas	tas	1
3	-1.213441	0.301305	__splink__input_table_0	__splink__input_table_1	rec-1250-org	rec-1250-dup-0	luke	gazzola	0	0.0055	...	2	0.0015	0.0002	11.264825	1.000000	newman morris circuit	newman morr is circuit	nsw	nsw	1
4	-0.380336	0.434472	__splink__input_table_0	__splink__input_table_1	rec-4763-org	rec-4763-dup-0	max	alisha	0	0.0021	...	1	0.0004	0.0016	0.043565	1.000000	duffy street	duffy s treet	nsw	nsw	2

5 rows × 47 columns

We can see how our model performs at different probability thresholds, with a couple of options depending on the space we wish to view things

linker_detailed.evaluation.accuracy_analysis_from_labels_column(
    "cluster", output_type="accuracy"
)

 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'date_of_birth':
    m values not fully trained
Comparison: 'date_of_birth':
    u values not fully trained

and we can easily see how many individuals we identify and link by looking at clusters generated at some threshold match probability of interest - in this example 99%

clusters = linker_detailed.clustering.cluster_pairwise_predictions_at_threshold(
    predictions, threshold_match_probability=0.99
)
df_clusters = clusters.as_pandas_dataframe().sort_values("cluster_id")
df_clusters.groupby("cluster_id").size().value_counts()

Completed iteration 1, root rows count 0





2    4959
1      82
Name: count, dtype: int64

In this case, we happen to know what the true links are, so we can manually inspect the ones that are doing worst to see what our model is not capturing - i.e. where we have false negatives.

Similarly, we can look at the non-links which are performing the best, to see whether we have an issue with false positives.

Ordinarily we would not have this luxury, and so would need to dig a bit deeper for clues as to how to improve our model, such as manually inspecting records across threshold probabilities,

df_predictions["cluster_l"] = df_predictions["rec_id_l"].apply(
    lambda x: "-".join(x.split("-")[:2])
)
df_predictions["cluster_r"] = df_predictions["rec_id_r"].apply(
    lambda x: "-".join(x.split("-")[:2])
)
df_true_links = df_predictions[
    df_predictions["cluster_l"] == df_predictions["cluster_r"]
].sort_values("match_probability")

records_to_view = 3
linker_detailed.visualisations.waterfall_chart(
    df_true_links.head(records_to_view).to_dict(orient="records")
)

df_non_links = df_predictions[
    df_predictions["cluster_l"] != df_predictions["cluster_r"]
].sort_values("match_probability", ascending=False)
linker_detailed.visualisations.waterfall_chart(
    df_non_links.head(records_to_view).to_dict(orient="records")
)

Looking at the non-links we have done well in having no false positives at any substantial match probability --- however looking at some of the true links we can see that there are a few that we are not capturing with sufficient match probability.

We can see that there are a few features that we are not capturing/weighting appropriately

single-character transpostions, particularly in postcode (which is being lumped in with more 'severe typos'/probable non-matches)
given/sur-names being swapped with typos
given/sur-names being cross-matches on one only, with no match on the other cross

We will quickly see if we can incorporate these features into a new model. As we are now going into more detail with the inter-relationship between given name and surname, it is probably no longer sensible to model them as independent comparisons, and so we will need to switch to a combined comparison on full name.

# we need to append a full name column to our source data frames
# so that we can use it for term frequency adjustments
dfs[0]["full_name"] = dfs[0]["given_name"] + "_" + dfs[0]["surname"]
dfs[1]["full_name"] = dfs[1]["given_name"] + "_" + dfs[1]["surname"]


extended_model_settings = {
    "unique_id_column_name": "rec_id",
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": blocking_rules,
    "comparisons": [
        {
            "output_column_name": "Full name",
            "comparison_levels": [
                {
                    "sql_condition": "(given_name_l IS NULL OR given_name_r IS NULL) and (surname_l IS NULL OR surname_r IS NULL)",
                    "label_for_charts": "Null",
                    "is_null_level": True,
                },
                # full name match
                cll.ExactMatchLevel("full_name", term_frequency_adjustments=True),
                # typos - keep levels across full name rather than scoring separately
                cll.JaroWinklerLevel("full_name", 0.9),
                cll.JaroWinklerLevel("full_name", 0.7),
                # name switched
                cll.ColumnsReversedLevel("given_name", "surname"),
                # name switched + typo
                {
                    "sql_condition": "jaro_winkler_similarity(given_name_l, surname_r) + jaro_winkler_similarity(surname_l, given_name_r) >= 1.8",
                    "label_for_charts": "switched + jaro_winkler_similarity >= 1.8",
                },
                {
                    "sql_condition": "jaro_winkler_similarity(given_name_l, surname_r) + jaro_winkler_similarity(surname_l, given_name_r) >= 1.4",
                    "label_for_charts": "switched + jaro_winkler_similarity >= 1.4",
                },
                # single name match
                cll.ExactMatchLevel("given_name", term_frequency_adjustments=True),
                cll.ExactMatchLevel("surname", term_frequency_adjustments=True),
                # single name cross-match
                {
                    "sql_condition": "given_name_l = surname_r OR surname_l = given_name_r",
                    "label_for_charts": "single name cross-matches",
                },  # single name typos
                cll.JaroWinklerLevel("given_name", 0.9),
                cll.JaroWinklerLevel("surname", 0.9),
                # the rest
                cll.ElseLevel(),
            ],
        },
        cl.DateOfBirthComparison(
            "date_of_birth",
            input_is_string=True,
            datetime_format="%Y%m%d",
            invalid_dates_as_null=True,
        ),
        {
            "output_column_name": "Social security ID",
            "comparison_levels": [
                cll.NullLevel("soc_sec_id"),
                cll.ExactMatchLevel("soc_sec_id", term_frequency_adjustments=True),
                cll.DamerauLevenshteinLevel("soc_sec_id", 1),
                cll.DamerauLevenshteinLevel("soc_sec_id", 2),
                cll.ElseLevel(),
            ],
        },
        {
            "output_column_name": "Street number",
            "comparison_levels": [
                cll.NullLevel("street_number"),
                cll.ExactMatchLevel("street_number", term_frequency_adjustments=True),
                cll.DamerauLevenshteinLevel("street_number", 1),
                cll.ElseLevel(),
            ],
        },
        {
            "output_column_name": "Postcode",
            "comparison_levels": [
                cll.NullLevel("postcode"),
                cll.ExactMatchLevel("postcode", term_frequency_adjustments=True),
                cll.DamerauLevenshteinLevel("postcode", 1),
                cll.DamerauLevenshteinLevel("postcode", 2),
                cll.ElseLevel(),
            ],
        },
        # we don't consider further location columns as they will be strongly correlated with postcode
    ],
    "retain_intermediate_calculation_columns": True,
}

# train
linker_advanced = Linker(dfs, extended_model_settings, db_api=DuckDBAPI())
linker_advanced.training.estimate_probability_two_random_records_match(
    deterministic_rules, recall=0.8
)
# We recommend increasing target rows to 1e8 improve accuracy for u
# values in full name comparison, as we have subdivided the data more finely

# Here, 1e7 for speed
linker_advanced.training.estimate_u_using_random_sampling(max_pairs=1e7)

Probability two random records match is estimated to be  0.000239.
This means that amongst all possible pairwise record comparisons, one in 4,185.85 are expected to match.  With 25,000,000 total possible comparisons, we expect a total of around 5,972.50 matching pairs
----- Estimating u probabilities using random sampling -----



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - Full name (no m values are trained).
    - date_of_birth (some u values are not trained, no m values are trained).
    - Social security ID (no m values are trained).
    - Street number (no m values are trained).
    - Postcode (no m values are trained).

session_dob = (
    linker_advanced.training.estimate_parameters_using_expectation_maximisation(
        "l.date_of_birth = r.date_of_birth", estimate_without_term_frequencies=True
    )
)

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l.date_of_birth = r.date_of_birth

Parameter estimates will be made for the following comparison(s):
    - Full name
    - Social security ID
    - Street number
    - Postcode

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - date_of_birth

WARNING:
Level single name cross-matches on comparison Full name not observed in dataset, unable to train m value

Iteration 1: Largest change in params was -0.465 in the m_probability of Full name, level `Exact match on full_name`
Iteration 2: Largest change in params was 0.00252 in the m_probability of Social security ID, level `All other comparisons`
Iteration 3: Largest change in params was 4.98e-05 in the m_probability of Social security ID, level `All other comparisons`

EM converged after 3 iterations
m probability not trained for Full name - single name cross-matches (comparison vector value: 3). This usually means the comparison level was never observed in the training data.

Your model is not yet fully trained. Missing estimates for:
    - Full name (some m values are not trained).
    - date_of_birth (some u values are not trained, no m values are trained).

session_pc = (
    linker_advanced.training.estimate_parameters_using_expectation_maximisation(
        "l.postcode = r.postcode", estimate_without_term_frequencies=True
    )
)

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l.postcode = r.postcode

Parameter estimates will be made for the following comparison(s):
    - Full name
    - date_of_birth
    - Social security ID
    - Street number

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - Postcode

WARNING:
Level single name cross-matches on comparison Full name not observed in dataset, unable to train m value

WARNING:
Level Abs difference of 'transformed date_of_birth <= 1 month' on comparison date_of_birth not observed in dataset, unable to train m value

WARNING:
Level Abs difference of 'transformed date_of_birth <= 1 year' on comparison date_of_birth not observed in dataset, unable to train m value

WARNING:
Level Abs difference of 'transformed date_of_birth <= 10 year' on comparison date_of_birth not observed in dataset, unable to train m value

Iteration 1: Largest change in params was 0.0374 in the m_probability of date_of_birth, level `All other comparisons`
Iteration 2: Largest change in params was 0.000656 in the m_probability of date_of_birth, level `All other comparisons`
Iteration 3: Largest change in params was 1.75e-05 in the m_probability of Social security ID, level `All other comparisons`

EM converged after 3 iterations
m probability not trained for Full name - single name cross-matches (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.

Your model is not yet fully trained. Missing estimates for:
    - Full name (some m values are not trained).
    - date_of_birth (some u values are not trained, some m values are not trained).

linker_advanced.visualisations.parameter_estimate_comparisons_chart()

linker_advanced.visualisations.match_weights_chart()

predictions_adv = linker_advanced.inference.predict()
df_predictions_adv = predictions_adv.as_pandas_dataframe()
clusters_adv = linker_advanced.clustering.cluster_pairwise_predictions_at_threshold(
    predictions_adv, threshold_match_probability=0.99
)
df_clusters_adv = clusters_adv.as_pandas_dataframe().sort_values("cluster_id")
df_clusters_adv.groupby("cluster_id").size().value_counts()

 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'Full name':
    m values not fully trained
Comparison: 'date_of_birth':
    m values not fully trained
Comparison: 'date_of_birth':
    u values not fully trained
Completed iteration 1, root rows count 0





2    4960
1      80
Name: count, dtype: int64

This is a pretty modest improvement on our previous model - however it is worth re-iterating that we should not necessarily expect to recover all matches, as in several cases it may be unreasonable for a model to have reasonable confidence that two records refer to the same entity.

If we wished to improve matters we could iterate on this process - investigating where our model is not performing as we would hope, and seeing how we can adjust these areas to address these shortcomings.

Febrl4 link-only

Linking the febrl4 datasets¶

Exploring data and defining model¶

Estimating model parameters¶

Predictions¶

Further refinements¶