Methods in Linker.training¶
Estimate the parameters of the linkage model, accessed via
linker.training
.
estimate_probability_two_random_records_match(deterministic_matching_rules, recall, max_rows_limit=int(1000000000.0))
¶
Estimate the model parameter probability_two_random_records_match
using
a direct estimation approach.
This method counts the number of matches found using deterministic rules and divides by the total number of possible record comparisons. The recall of the deterministic rules is used to adjust this proportion up to reflect missed matches, providing an estimate of the probability that two random records from the input data are a match.
Note that if more than one deterministic rule is provided, any duplicate pairs are automatically removed, so you do not need to worry about double counting.
See here for discussion of methodology.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
deterministic_matching_rules |
list
|
A list of deterministic matching rules designed to admit very few (preferably no) false positives. |
required |
recall |
float
|
An estimate of the recall the deterministic matching rules will achieve, i.e., the proportion of all true matches these rules will recover. |
required |
max_rows_limit |
int
|
Maximum number of rows to consider during estimation. Defaults to 1e9. |
int(1000000000.0)
|
Examples:
deterministic_rules = [
block_on("forename", "dob"),
"l.forename = r.forename and levenshtein(r.surname, l.surname) <= 2",
block_on("email")
]
linker.training.estimate_probability_two_random_records_match(
deterministic_rules, recall=0.8
)
estimate_u_using_random_sampling(max_pairs=1000000.0, seed=None)
¶
Estimate the u parameters of the linkage model using random sampling.
The u parameters estimate the proportion of record comparisons that fall into each comparison level amongst truly non-matching records.
This procedure takes a sample of the data and generates the cartesian product of pairwise record comparisons amongst the sampled records. The validity of the u values rests on the assumption that the resultant pairwise comparisons are non-matches (or at least, they are very unlikely to be matches). For large datasets, this is typically true.
The results of estimate_u_using_random_sampling, and therefore an entire splink model, can be made reproducible by setting the seed parameter. Setting the seed will have performance implications as additional processing is required.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
max_pairs |
int
|
The maximum number of pairwise record comparisons to sample. Larger will give more accurate estimates but lead to longer runtimes. In our experience at least 1e9 (one billion) gives best results but can take a long time to compute. 1e7 (ten million) is often adequate whilst testing different model specifications, before the final model is estimated. |
1000000.0
|
seed |
int
|
Seed for random sampling. Assign to get reproducible u probabilities. Note, seed for random sampling is only supported for DuckDB and Spark, for Athena and SQLite set to None. |
None
|
Examples:
linker.training.estimate_u_using_random_sampling(max_pairs=1e8)
Returns:
Name | Type | Description |
---|---|---|
Nothing |
None
|
Updates the estimated u parameters within the linker object and returns nothing. |
estimate_parameters_using_expectation_maximisation(blocking_rule, estimate_without_term_frequencies=False, fix_probability_two_random_records_match=False, fix_m_probabilities=False, fix_u_probabilities=True, populate_probability_two_random_records_match_from_trained_values=False)
¶
Estimate the parameters of the linkage model using expectation maximisation.
By default, the m probabilities are estimated, but not the u probabilities,
because good estimates for the u probabilities can be obtained from
linker.training.estimate_u_using_random_sampling()
. You can change this by
setting fix_u_probabilities
to False.
The blocking rule provided is used to generate pairwise record comparisons. Usually, this should be a blocking rule that results in a dataframe where matches are between about 1% and 99% of the blocked comparisons.
By default, m parameters are estimated for all comparisons except those which are included in the blocking rule.
For example, if the blocking rule is block_on("first_name")
, then
parameter estimates will be made for all comparison except those which use
first_name
in their sql_condition
By default, the probability two random records match is allowed to vary during EM estimation, but is not saved back to the model. See this PR for the rationale.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
blocking_rule |
BlockingRuleCreator | str
|
The blocking rule used to generate pairwise record comparisons. |
required |
estimate_without_term_frequencies |
bool
|
If True, the iterations of the EM algorithm ignore any term frequency adjustments and only depend on the comparison vectors. This allows the EM algorithm to run much faster, but the estimation of the parameters will change slightly. |
False
|
fix_probability_two_random_records_match |
bool
|
If True, do not update the probability two random records match after each iteration. Defaults to False. |
False
|
fix_m_probabilities |
bool
|
If True, do not update the m probabilities after each iteration. Defaults to False. |
False
|
fix_u_probabilities |
bool
|
If True, do not update the u probabilities after each iteration. Defaults to True. |
True
|
populate_prob... |
(bool, optional)
|
The full name of this parameter is populate_probability_two_random_records_match_from_trained_values. If True, derive this parameter from the blocked value. Defaults to False. |
required |
Examples:
br_training = block_on("first_name", "dob")
linker.training.estimate_parameters_using_expectation_maximisation(
br_training
)
Returns:
Name | Type | Description |
---|---|---|
EMTrainingSession |
EMTrainingSession
|
An object containing information about the training session such as how parameters changed during the iteration history |
estimate_m_from_pairwise_labels(labels_splinkdataframe_or_table_name)
¶
Estimate the m probabilities of the linkage model from a dataframe of pairwise labels.
The table of labels should be in the following format, and should be registered with your database:
source_dataset_l | unique_id_l | source_dataset_r | unique_id_r |
---|---|---|---|
df_1 | 1 | df_2 | 2 |
df_1 | 1 | df_2 | 3 |
Note that source_dataset
and unique_id
should correspond to the
values specified in the settings dict, and the input_table_aliases
passed to the linker
object. Note that at the moment, this method does
not respect values in a clerical_match_score
column. If provided, these
are ignored and it is assumed that every row in the table of labels is a score
of 1, i.e. a perfect match.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
labels_splinkdataframe_or_table_name |
str
|
Name of table containing labels in the database or SplinkDataframe |
required |
Examples:
pairwise_labels = pd.read_csv("./data/pairwise_labels_to_estimate_m.csv")
linker.table_management.register_table(
pairwise_labels, "labels", overwrite=True
)
linker.training.estimate_m_from_pairwise_labels("labels")
estimate_m_from_label_column(label_colname)
¶
Estimate the m parameters of the linkage model from a label (ground truth) column in the input dataframe(s).
The m parameters represent the proportion of record comparisons that fall into each comparison level amongst truly matching records.
The ground truth column is used to generate pairwise record comparisons which are then assumed to be matches.
For example, if the entity being matched is persons, and your input dataset(s) contain social security number, this could be used to estimate the m values for the model.
Note that this column does not need to be fully populated. A common case is where a unique identifier such as social security number is only partially populated.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
label_colname |
str
|
The name of the column containing the ground truth label in the input data. |
required |
Examples:
linker.training.estimate_m_from_label_column("social_security_number")
Returns:
Name | Type | Description |
---|---|---|
Nothing |
None
|
Updates the estimated m parameters within the linker object. |