Skip to content

Documentation for Linker object methods related to parameter estimation

The Linker object manages the data linkage process and holds the data linkage model.

Most of Splink's functionality can be accessed by calling methods (functions) on the linker, such as linker.predict(), linker.profile_columns() etc.

The Linker class is intended for subclassing for specific backends, e.g. a DuckDBLinker.

estimate_m_from_label_column(label_colname)

Estimate the m parameters of the linkage model from a label (ground truth) column in the input dataframe(s).

The m parameters represent the proportion of record comparisons that fall into each comparison level amongst truly matching records.

The ground truth column is used to generate pairwise record comparisons which are then assumed to be matches.

For example, if the entity being matched is persons, and your input dataset(s) contain social security number, this could be used to estimate the m values for the model.

Note that this column does not need to be fully populated. A common case is where a unique identifier such as social security number is only partially populated.

Parameters:

Name Type Description Default
label_colname str

The name of the column containing the ground truth label in the input data.

required

Examples:

>>> linker.estimate_m_from_label_column("social_security_number")

Returns:

Type Description

Updates the estimated m parameters within the linker object

and returns nothing.

estimate_parameters_using_expectation_maximisation(blocking_rule, comparisons_to_deactivate=None, comparison_levels_to_reverse_blocking_rule=None, fix_probability_two_random_records_match=False, fix_m_probabilities=False, fix_u_probabilities=True, populate_probability_two_random_records_match_from_trained_values=False)

Estimate the parameters of the linkage model using expectation maximisation.

By default, the m probabilities are estimated, but not the u probabilities, because good estiamtes for the u probabilities can be obtained from linker.estimate_u_using_random_sampling(). You can change this by setting fix_u_probabilities to False.

The blocking rule provided is used to generate pairwise record comparisons. Usually, this should be a blocking rule that results in a dataframe where matches are between about 1% and 99% of the comparisons.

By default, m parameters are estimated for all comparisons except those which are included in the blocking rule.

For example, if the blocking rule is l.first_name = r.first_name, then parameter esimates will be made for all comparison except those which use first_name in their sql_condition

By default, the probability two random records match is estimated for the blocked data, and then the m and u parameters for the columns specified in the blocking rules are used to estiamte the global probability two random records match.

To control which comparisons should have their parameter estimated, and the process of 'reversing out' the global probability two random records match, the user may specify comparisons_to_deactivate and comparison_levels_to_reverse_blocking_rule. This is useful, for example if you block on the dmetaphone of a column but match on the original column.

Examples:

>>> # Default behaviour
>>> br_training = "l.first_name = r.first_name and l.dob = r.dob"
>>> linker.estimate_parameters_using_expectation_maximisation(br_training)
>>> # Specify which comparisons to deactivate
>>> br_training = "l.dmeta_first_name = r.dmeta_first_name"
>>> settings_obj = linker._settings_obj
>>> comp = settings_obj._get_comparison_by_output_column_name("first_name")
>>> dmeta_level = comp._get_comparison_level_by_comparison_vector_value(1)
>>> linker.estimate_parameters_using_expectation_maximisation(
>>>     br_training,
>>>     comparisons_to_deactivate=["first_name"],
>>>     comparison_levels_to_reverse_blocking_rule=[dmeta_level],
>>> )

Parameters:

Name Type Description Default
blocking_rule str

The blocking rule used to generate pairwise record comparisons.

required
comparisons_to_deactivate list

By default, splink will analyse the blocking rule provided and estimate the m parameters for all comaprisons except those included in the blocking rule. If comparisons_to_deactivate are provided, spink will instead estimate m parameters for all comparison except those specified in the comparisons_to_deactivate list. This list can either contain the output_column_name of the Comparison as a string, or Comparison objects. Defaults to None.

None
comparison_levels_to_reverse_blocking_rule list

By default, splink will analyse the blocking rule provided and adjust the global probability two random records match to account for the matches specified in the blocking rule. If provided, this argument will overrule this default behaviour. The user must provide a list of ComparisonLevel objects. Defaults to None.

None
fix_probability_two_random_records_match bool

If True, do not update the probability two random records match after each iteration. Defaults to False.

False
fix_m_probabilities bool

If True, do not update the m probabilities after each iteration. Defaults to False.

False
fix_u_probabilities bool

If True, do not update the u probabilities after each iteration. Defaults to True.

True

Examples:

>>> blocking_rule = "l.first_name = r.first_name and l.dob = r.dob"
>>> linker.estimate_parameters_using_expectation_maximisation(blocking_rule)

Returns:

Name Type Description
EMTrainingSession EMTrainingSession

An object containing information about the training session such as how parameters changed during the iteration history

estimate_u_using_random_sampling(target_rows)

Estimate the u parameters of the linkage model using random sampling.

The u parameters represent the proportion of record comparisons that fall into each comparison level amongst truly non-matching records.

This procedure takes a sample of the data and generates the cartesian product of pairwise record comparisons amongst the sampled records. The validity of the u values rests on the assumption that the resultant pairwise comparisons are non-matches (or at least, they are very unlikely to be matches). For large datasets, this is typically true.

Parameters:

Name Type Description Default
target_rows int

The target number of pairwise record comparisons from

required

Examples:

>>> linker.estimate_u_using_random_sampling(1e8)

Returns:

Name Type Description
None

Updates the estimated u parameters within the linker object

and returns nothing.

save_settings_to_json(out_path, overwrite=False)

Save the configuration and parameters the linkage model to a .json file.

The model can later be loaded back in using linker.load_settings_from_json()

Examples:

>>> linker.save_settings_to_json("my_settings.json", overwrite=True)

Parameters:

Name Type Description Default
out_path str

File path for json file

required
overwrite bool

Overwrite if already exists? Defaults to False.

False

estimate_m_from_pairwise_labels(table_name)

estimate_probability_two_random_records_match(deterministic_matching_rules, recall)

Estimate the model parameter probability_two_random_records_match using a direct estimation approach.

See here for discussion of methodology

Parameters:

Name Type Description Default
deterministic_matching_rules list

A list of deterministic matching rules that should be designed to admit very few (none if possible) false positives

required
recall float

A guess at the recall the deterministic matching rules will attain. i.e. what proportion of true matches will be recovered by these deterministic rules

required