Documentation for Linker
object methods related to parameter estimation¶
The Linker object manages the data linkage process and holds the data linkage model.
Most of Splink's functionality can be accessed by calling methods (functions)
on the linker, such as linker.predict()
, linker.profile_columns()
etc.
The Linker class is intended for subclassing for specific backends, e.g.
a DuckDBLinker
.
estimate_m_from_label_column(label_colname)
¶
Estimate the m parameters of the linkage model from a label (ground truth) column in the input dataframe(s).
The m parameters represent the proportion of record comparisons that fall into each comparison level amongst truly matching records.
The ground truth column is used to generate pairwise record comparisons which are then assumed to be matches.
For example, if the entity being matched is persons, and your input dataset(s) contain social security number, this could be used to estimate the m values for the model.
Note that this column does not need to be fully populated. A common case is where a unique identifier such as social security number is only partially populated.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
label_colname |
str
|
The name of the column containing the ground truth label in the input data. |
required |
Examples:
linker.estimate_m_from_label_column("social_security_number")
Returns:
Type | Description |
---|---|
Updates the estimated m parameters within the linker object |
|
and returns nothing. |
estimate_parameters_using_expectation_maximisation(blocking_rule, comparisons_to_deactivate=None, comparison_levels_to_reverse_blocking_rule=None, fix_probability_two_random_records_match=False, fix_m_probabilities=False, fix_u_probabilities=True, populate_probability_two_random_records_match_from_trained_values=False)
¶
Estimate the parameters of the linkage model using expectation maximisation.
By default, the m probabilities are estimated, but not the u probabilities,
because good estimates for the u probabilities can be obtained from
linker.estimate_u_using_random_sampling()
. You can change this by setting
fix_u_probabilities
to False.
The blocking rule provided is used to generate pairwise record comparisons. Usually, this should be a blocking rule that results in a dataframe where matches are between about 1% and 99% of the comparisons.
By default, m parameters are estimated for all comparisons except those which are included in the blocking rule.
For example, if the blocking rule is l.first_name = r.first_name
, then
parameter esimates will be made for all comparison except those which use
first_name
in their sql_condition
By default, the probability two random records match is estimated for the blocked data, and then the m and u parameters for the columns specified in the blocking rules are used to estiamte the global probability two random records match.
To control which comparisons should have their parameter estimated, and the
process of 'reversing out' the global probability two random records match, the
user may specify comparisons_to_deactivate
and
comparison_levels_to_reverse_blocking_rule
. This is useful, for example
if you block on the dmetaphone of a column but match on the original column.
Examples:
Default behaviour
br_training = "l.first_name = r.first_name and l.dob = r.dob"
linker.estimate_parameters_using_expectation_maximisation(br_training)
br_training = "l.dmeta_first_name = r.dmeta_first_name"
settings_obj = linker._settings_obj
comp = settings_obj._get_comparison_by_output_column_name("first_name")
dmeta_level = comp._get_comparison_level_by_comparison_vector_value(1)
linker.estimate_parameters_using_expectation_maximisation(
br_training,
comparisons_to_deactivate=["first_name"],
comparison_levels_to_reverse_blocking_rule=[dmeta_level],
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
blocking_rule |
str
|
The blocking rule used to generate pairwise record comparisons. |
required |
comparisons_to_deactivate |
list
|
By default, splink will analyse the blocking rule provided and estimate the m parameters for all comaprisons except those included in the blocking rule. If comparisons_to_deactivate are provided, spink will instead estimate m parameters for all comparison except those specified in the comparisons_to_deactivate list. This list can either contain the output_column_name of the Comparison as a string, or Comparison objects. Defaults to None. |
None
|
comparison_levels_to_reverse_blocking_rule |
list
|
By default, splink will analyse the blocking rule provided and adjust the global probability two random records match to account for the matches specified in the blocking rule. If provided, this argument will overrule this default behaviour. The user must provide a list of ComparisonLevel objects. Defaults to None. |
None
|
fix_probability_two_random_records_match |
bool
|
If True, do not update the probability two random records match after each iteration. Defaults to False. |
False
|
fix_m_probabilities |
bool
|
If True, do not update the m probabilities after each iteration. Defaults to False. |
False
|
fix_u_probabilities |
bool
|
If True, do not update the u probabilities after each iteration. Defaults to True. |
True
|
Examples:
blocking_rule = "l.first_name = r.first_name and l.dob = r.dob"
linker.estimate_parameters_using_expectation_maximisation(blocking_rule)
Returns:
Name | Type | Description |
---|---|---|
EMTrainingSession |
EMTrainingSession
|
An object containing information about the training session such as how parameters changed during the iteration history |
estimate_u_using_random_sampling(max_pairs=None, seed=None, *, target_rows=None)
¶
Estimate the u parameters of the linkage model using random sampling.
The u parameters represent the proportion of record comparisons that fall into each comparison level amongst truly non-matching records.
This procedure takes a sample of the data and generates the cartesian product of pairwise record comparisons amongst the sampled records. The validity of the u values rests on the assumption that the resultant pairwise comparisons are non-matches (or at least, they are very unlikely to be matches). For large datasets, this is typically true.
The results of estimate_u_using_random_sampling, and therefore an entire splink model, can be made reproducible by setting the seed parameter. Setting the seed will have performance implications as additional processing is required.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
max_pairs |
int
|
The maximum number of pairwise record comparisons to |
None
|
seed |
int
|
Seed for random sampling. Assign to get reproducible u |
None
|
Examples:
linker.estimate_u_using_random_sampling(1e8)
Returns:
Name | Type | Description |
---|---|---|
None |
Updates the estimated u parameters within the linker object |
|
and returns nothing. |
save_model_to_json(out_path=None, overwrite=False)
¶
Save the configuration and parameters of the linkage model to a .json
file.
The model can later be loaded back in using linker.load_model()
.
The settings dict is also returned in case you want to save it a different way.
Examples:
linker.save_model_to_json("my_settings.json", overwrite=True)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
out_path |
str
|
File path for json file. If None, don't save to file. Defaults to None. |
None
|
overwrite |
bool
|
Overwrite if already exists? Defaults to False. |
False
|
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
The settings as a dictionary. |
estimate_m_from_pairwise_labels(labels_splinkdataframe_or_table_name)
¶
Estimate the m parameters of the linkage model from a dataframe of pairwise labels.
The table of labels should be in the following format, and should be registered with your database: |source_dataset_l|unique_id_l|source_dataset_r|unique_id_r| |----------------|-----------|----------------|-----------| |df_1 |1 |df_2 |2 | |df_1 |1 |df_2 |3 |
Note that source_dataset
and unique_id
should correspond to the
values specified in the settings dict, and the input_table_aliases
passed to the linker
object. Note that at the moment, this method does
not respect values in a clerical_match_score
column. If provided, these
are ignored and it is assumed that every row in the table of labels is a score
of 1, i.e. a perfect match.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
labels_splinkdataframe_or_table_name |
str
|
Name of table containing labels in the database or SplinkDataframe |
required |
Examples:
pairwise_labels = pd.read_csv("./data/pairwise_labels_to_estimate_m.csv")
linker.register_table(pairwise_labels, "labels", overwrite=True)
linker.estimate_m_from_pairwise_labels("labels")
estimate_probability_two_random_records_match(deterministic_matching_rules, recall)
¶
Estimate the model parameter probability_two_random_records_match
using
a direct estimation approach.
See here for discussion of methodology
Parameters:
Name | Type | Description | Default |
---|---|---|---|
deterministic_matching_rules |
list
|
A list of deterministic matching rules that should be designed to admit very few (none if possible) false positives |
required |
recall |
float
|
A guess at the recall the deterministic matching rules will attain. i.e. what proportion of true matches will be recovered by these deterministic rules |
required |