Skip to content

Methods in Linker.training¶

Estimate the parameters of the linkage model, accessed via linker.training.

estimate_probability_two_random_records_match(deterministic_matching_rules, recall, max_rows_limit=int(1000000000.0)) ¶

Estimate the model parameter probability_two_random_records_match using a direct estimation approach.

This method counts the number of matches found using deterministic rules and divides by the total number of possible record comparisons. The recall of the deterministic rules is used to adjust this proportion up to reflect missed matches, providing an estimate of the probability that two random records from the input data are a match.

Note that if more than one deterministic rule is provided, any duplicate pairs are automatically removed, so you do not need to worry about double counting.

See here for discussion of methodology.

Parameters:

Name Type Description Default
deterministic_matching_rules list

A list of deterministic matching rules designed to admit very few (preferably no) false positives.

required
recall float

An estimate of the recall the deterministic matching rules will achieve, i.e., the proportion of all true matches these rules will recover.

required
max_rows_limit int

Maximum number of rows to consider during estimation. Defaults to 1e9.

int(1000000000.0)

Examples:

deterministic_rules = [
    block_on("forename", "dob"),
    "l.forename = r.forename and levenshtein(r.surname, l.surname) <= 2",
    block_on("email")
]
linker.training.estimate_probability_two_random_records_match(
    deterministic_rules, recall=0.8
)

estimate_u_using_random_sampling(max_pairs=1000000.0, seed=None) ¶

Estimate the u parameters of the linkage model using random sampling.

The u parameters estimate the proportion of record comparisons that fall into each comparison level amongst truly non-matching records.

This procedure takes a sample of the data and generates the cartesian product of pairwise record comparisons amongst the sampled records. The validity of the u values rests on the assumption that the resultant pairwise comparisons are non-matches (or at least, they are very unlikely to be matches). For large datasets, this is typically true.

The results of estimate_u_using_random_sampling, and therefore an entire splink model, can be made reproducible by setting the seed parameter. Setting the seed will have performance implications as additional processing is required.

Parameters:

Name Type Description Default
max_pairs int

The maximum number of pairwise record comparisons to sample. Larger will give more accurate estimates but lead to longer runtimes. In our experience at least 1e9 (one billion) gives best results but can take a long time to compute. 1e7 (ten million) is often adequate whilst testing different model specifications, before the final model is estimated.

1000000.0
seed int

Seed for random sampling. Assign to get reproducible u probabilities. Note, seed for random sampling is only supported for DuckDB and Spark, for Athena and SQLite set to None.

None

Examples:

linker.training.estimate_u_using_random_sampling(max_pairs=1e8)

Returns:

Name Type Description
Nothing None

Updates the estimated u parameters within the linker object and returns nothing.

estimate_parameters_using_expectation_maximisation(blocking_rule, estimate_without_term_frequencies=False, fix_probability_two_random_records_match=False, fix_m_probabilities=False, fix_u_probabilities=True, populate_probability_two_random_records_match_from_trained_values=False) ¶

Estimate the parameters of the linkage model using expectation maximisation.

By default, the m probabilities are estimated, but not the u probabilities, because good estimates for the u probabilities can be obtained from linker.training.estimate_u_using_random_sampling(). You can change this by setting fix_u_probabilities to False.

The blocking rule provided is used to generate pairwise record comparisons. Usually, this should be a blocking rule that results in a dataframe where matches are between about 1% and 99% of the blocked comparisons.

By default, m parameters are estimated for all comparisons except those which are included in the blocking rule.

For example, if the blocking rule is block_on("first_name"), then parameter estimates will be made for all comparison except those which use first_name in their sql_condition

By default, the probability two random records match is allowed to vary during EM estimation, but is not saved back to the model. See this PR for the rationale.

Parameters:

Name Type Description Default
blocking_rule BlockingRuleCreator | str

The blocking rule used to generate pairwise record comparisons.

required
estimate_without_term_frequencies bool

If True, the iterations of the EM algorithm ignore any term frequency adjustments and only depend on the comparison vectors. This allows the EM algorithm to run much faster, but the estimation of the parameters will change slightly.

False
fix_probability_two_random_records_match bool

If True, do not update the probability two random records match after each iteration. Defaults to False.

False
fix_m_probabilities bool

If True, do not update the m probabilities after each iteration. Defaults to False.

False
fix_u_probabilities bool

If True, do not update the u probabilities after each iteration. Defaults to True.

True
populate_prob... (bool, optional)

The full name of this parameter is populate_probability_two_random_records_match_from_trained_values. If True, derive this parameter from the blocked value. Defaults to False.

required

Examples:

br_training = block_on("first_name", "dob")
linker.training.estimate_parameters_using_expectation_maximisation(
    br_training
)

Returns:

Name Type Description
EMTrainingSession EMTrainingSession

An object containing information about the training session such as how parameters changed during the iteration history

estimate_m_from_pairwise_labels(labels_splinkdataframe_or_table_name) ¶

Estimate the m probabilities of the linkage model from a dataframe of pairwise labels.

The table of labels should be in the following format, and should be registered with your database:

source_dataset_l unique_id_l source_dataset_r unique_id_r
df_1 1 df_2 2
df_1 1 df_2 3

Note that source_dataset and unique_id should correspond to the values specified in the settings dict, and the input_table_aliases passed to the linker object. Note that at the moment, this method does not respect values in a clerical_match_score column. If provided, these are ignored and it is assumed that every row in the table of labels is a score of 1, i.e. a perfect match.

Parameters:

Name Type Description Default
labels_splinkdataframe_or_table_name str

Name of table containing labels in the database or SplinkDataframe

required

Examples:

pairwise_labels = pd.read_csv("./data/pairwise_labels_to_estimate_m.csv")

linker.table_management.register_table(
    pairwise_labels, "labels", overwrite=True
)

linker.training.estimate_m_from_pairwise_labels("labels")

estimate_m_from_label_column(label_colname) ¶

Estimate the m parameters of the linkage model from a label (ground truth) column in the input dataframe(s).

The m parameters represent the proportion of record comparisons that fall into each comparison level amongst truly matching records.

The ground truth column is used to generate pairwise record comparisons which are then assumed to be matches.

For example, if the entity being matched is persons, and your input dataset(s) contain social security number, this could be used to estimate the m values for the model.

Note that this column does not need to be fully populated. A common case is where a unique identifier such as social security number is only partially populated.

Parameters:

Name Type Description Default
label_colname str

The name of the column containing the ground truth label in the input data.

required

Examples:

linker.training.estimate_m_from_label_column("social_security_number")

Returns:

Name Type Description
Nothing None

Updates the estimated m parameters within the linker object.