Skip to content

Documentation for Linker object methods related to parameter estimation

The Linker object manages the data linkage process and holds the data linkage model.

Most of Splink's functionality can be accessed by calling methods (functions) on the linker, such as linker.predict(), linker.profile_columns() etc.

The Linker class is intended for subclassing for specific backends, e.g. a DuckDBLinker.

estimate_m_from_label_column(label_colname)

Estimate the m parameters of the linkage model from a label (ground truth) column in the input dataframe(s).

The m parameters represent the proportion of record comparisons that fall into each comparison level amongst truly matching records.

The ground truth column is used to generate pairwise record comparisons which are then assumed to be matches.

For example, if the entity being matched is persons, and your input dataset(s) contain social security number, this could be used to estimate the m values for the model.

Note that this column does not need to be fully populated. A common case is where a unique identifier such as social security number is only partially populated.

Parameters:

Name Type Description Default
label_colname str

The name of the column containing the ground truth label in the input data.

required

Examples:

linker.estimate_m_from_label_column("social_security_number")

Returns:

Type Description

Updates the estimated m parameters within the linker object

and returns nothing.

estimate_parameters_using_expectation_maximisation(blocking_rule, comparisons_to_deactivate=None, comparison_levels_to_reverse_blocking_rule=None, fix_probability_two_random_records_match=False, fix_m_probabilities=False, fix_u_probabilities=True, populate_probability_two_random_records_match_from_trained_values=False)

Estimate the parameters of the linkage model using expectation maximisation.

By default, the m probabilities are estimated, but not the u probabilities, because good estimates for the u probabilities can be obtained from linker.estimate_u_using_random_sampling(). You can change this by setting fix_u_probabilities to False.

The blocking rule provided is used to generate pairwise record comparisons. Usually, this should be a blocking rule that results in a dataframe where matches are between about 1% and 99% of the comparisons.

By default, m parameters are estimated for all comparisons except those which are included in the blocking rule.

For example, if the blocking rule is l.first_name = r.first_name, then parameter esimates will be made for all comparison except those which use first_name in their sql_condition

By default, the probability two random records match is estimated for the blocked data, and then the m and u parameters for the columns specified in the blocking rules are used to estiamte the global probability two random records match.

To control which comparisons should have their parameter estimated, and the process of 'reversing out' the global probability two random records match, the user may specify comparisons_to_deactivate and comparison_levels_to_reverse_blocking_rule. This is useful, for example if you block on the dmetaphone of a column but match on the original column.

Examples:

Default behaviour

br_training = "l.first_name = r.first_name and l.dob = r.dob"
linker.estimate_parameters_using_expectation_maximisation(br_training)
Specify which comparisons to deactivate
br_training = "l.dmeta_first_name = r.dmeta_first_name"
settings_obj = linker._settings_obj
comp = settings_obj._get_comparison_by_output_column_name("first_name")
dmeta_level = comp._get_comparison_level_by_comparison_vector_value(1)
linker.estimate_parameters_using_expectation_maximisation(
    br_training,
    comparisons_to_deactivate=["first_name"],
    comparison_levels_to_reverse_blocking_rule=[dmeta_level],
)

Parameters:

Name Type Description Default
blocking_rule str

The blocking rule used to generate pairwise record comparisons.

required
comparisons_to_deactivate list

By default, splink will analyse the blocking rule provided and estimate the m parameters for all comaprisons except those included in the blocking rule. If comparisons_to_deactivate are provided, spink will instead estimate m parameters for all comparison except those specified in the comparisons_to_deactivate list. This list can either contain the output_column_name of the Comparison as a string, or Comparison objects. Defaults to None.

None
comparison_levels_to_reverse_blocking_rule list

By default, splink will analyse the blocking rule provided and adjust the global probability two random records match to account for the matches specified in the blocking rule. If provided, this argument will overrule this default behaviour. The user must provide a list of ComparisonLevel objects. Defaults to None.

None
fix_probability_two_random_records_match bool

If True, do not update the probability two random records match after each iteration. Defaults to False.

False
fix_m_probabilities bool

If True, do not update the m probabilities after each iteration. Defaults to False.

False
fix_u_probabilities bool

If True, do not update the u probabilities after each iteration. Defaults to True.

True

Examples:

blocking_rule = "l.first_name = r.first_name and l.dob = r.dob"
linker.estimate_parameters_using_expectation_maximisation(blocking_rule)

Returns:

Name Type Description
EMTrainingSession EMTrainingSession

An object containing information about the training session such as how parameters changed during the iteration history

estimate_u_using_random_sampling(max_pairs=None, seed=None, *, target_rows=None)

Estimate the u parameters of the linkage model using random sampling.

The u parameters represent the proportion of record comparisons that fall into each comparison level amongst truly non-matching records.

This procedure takes a sample of the data and generates the cartesian product of pairwise record comparisons amongst the sampled records. The validity of the u values rests on the assumption that the resultant pairwise comparisons are non-matches (or at least, they are very unlikely to be matches). For large datasets, this is typically true.

The results of estimate_u_using_random_sampling, and therefore an entire splink model, can be made reproducible by setting the seed parameter. Setting the seed will have performance implications as additional processing is required.

Parameters:

Name Type Description Default
max_pairs int

The maximum number of pairwise record comparisons to

None
seed int

Seed for random sampling. Assign to get reproducible u

None

Examples:

linker.estimate_u_using_random_sampling(1e8)

Returns:

Name Type Description
None

Updates the estimated u parameters within the linker object

and returns nothing.

save_model_to_json(out_path=None, overwrite=False)

Save the configuration and parameters of the linkage model to a .json file.

The model can later be loaded back in using linker.load_model(). The settings dict is also returned in case you want to save it a different way.

Examples:

linker.save_model_to_json("my_settings.json", overwrite=True)

Parameters:

Name Type Description Default
out_path str

File path for json file. If None, don't save to file. Defaults to None.

None
overwrite bool

Overwrite if already exists? Defaults to False.

False

Returns:

Name Type Description
dict dict

The settings as a dictionary.

estimate_m_from_pairwise_labels(labels_splinkdataframe_or_table_name)

Estimate the m parameters of the linkage model from a dataframe of pairwise labels.

The table of labels should be in the following format, and should be registered with your database: |source_dataset_l|unique_id_l|source_dataset_r|unique_id_r| |----------------|-----------|----------------|-----------| |df_1 |1 |df_2 |2 | |df_1 |1 |df_2 |3 |

Note that source_dataset and unique_id should correspond to the values specified in the settings dict, and the input_table_aliases passed to the linker object. Note that at the moment, this method does not respect values in a clerical_match_score column. If provided, these are ignored and it is assumed that every row in the table of labels is a score of 1, i.e. a perfect match.

Parameters:

Name Type Description Default
labels_splinkdataframe_or_table_name str

Name of table containing labels in the database or SplinkDataframe

required

Examples:

pairwise_labels = pd.read_csv("./data/pairwise_labels_to_estimate_m.csv")
linker.register_table(pairwise_labels, "labels", overwrite=True)
linker.estimate_m_from_pairwise_labels("labels")

estimate_probability_two_random_records_match(deterministic_matching_rules, recall)

Estimate the model parameter probability_two_random_records_match using a direct estimation approach.

See here for discussion of methodology

Parameters:

Name Type Description Default
deterministic_matching_rules list

A list of deterministic matching rules that should be designed to admit very few (none if possible) false positives

required
recall float

A guess at the recall the deterministic matching rules will attain. i.e. what proportion of true matches will be recovered by these deterministic rules

required