Methods in Linker.inference¶
Use your Splink model to make predictions (perform inference). Accessed via
linker.inference
.
deterministic_link()
¶
Uses the blocking rules specified by
blocking_rules_to_generate_predictions
in your settings to
generate pairwise record comparisons.
For deterministic linkage, this should be a list of blocking rules which are strict enough to generate only true links.
Deterministic linkage, however, is likely to result in missed links (false negatives).
Returns:
Name | Type | Description |
---|---|---|
SplinkDataFrame |
SplinkDataFrame
|
A SplinkDataFrame of the pairwise comparisons. |
```py
settings = SettingsCreator(
link_type="dedupe_only",
blocking_rules_to_generate_predictions=[
block_on("first_name", "surname"),
block_on("dob", "first_name"),
],
)
linker = Linker(df, settings, db_api=db_api)
splink_df = linker.inference.deterministic_link()
```
predict(threshold_match_probability=None, threshold_match_weight=None, materialise_after_computing_term_frequencies=True, materialise_blocked_pairs=True)
¶
Create a dataframe of scored pairwise comparisons using the parameters of the linkage model.
Uses the blocking rules specified in the
blocking_rules_to_generate_predictions
key of the settings to
generate the pairwise comparisons.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
threshold_match_probability |
float
|
If specified, filter the results to include only pairwise comparisons with a match_probability above this threshold. Defaults to None. |
None
|
threshold_match_weight |
float
|
If specified, filter the results to include only pairwise comparisons with a match_weight above this threshold. Defaults to None. |
None
|
materialise_after_computing_term_frequencies |
bool
|
If true, Splink will materialise the table containing the input nodes (rows) joined to any term frequencies which have been asked for in the settings object. If False, this will be computed as part of a large CTE pipeline. Defaults to True |
True
|
materialise_blocked_pairs |
bool
|
In the blocking phase, materialise the table of pairs of records that will be scored |
True
|
Examples:
linker = linker(df, "saved_settings.json", db_api=db_api)
splink_df = linker.inference.predict(threshold_match_probability=0.95)
splink_df.as_pandas_dataframe(limit=5)
find_matches_to_new_records(records_or_tablename, blocking_rules=[], match_weight_threshold=-4)
¶
Given one or more records, find records in the input dataset(s) which match and return in order of the Splink prediction score.
This effectively provides a way of searching the input datasets for given record(s)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
records_or_tablename |
List[dict]
|
Input search record(s) as list of dict, or a table registered to the database. |
required |
blocking_rules |
list
|
Blocking rules to select which records to find and score. If [], do not use a blocking rule - meaning the input records will be compared to all records provided to the linker when it was instantiated. Defaults to []. |
[]
|
match_weight_threshold |
int
|
Return matches with a match weight above this threshold. Defaults to -4. |
-4
|
Examples:
linker = Linker(df, "saved_settings.json", db_api=db_api)
# You should load or pre-compute tf tables for any tables with
# term frequency adjustments
linker.table_management.compute_tf_table("first_name")
# OR
linker.table_management.register_term_frequency_lookup(df, "first_name")
record = {'unique_id': 1,
'first_name': "John",
'surname': "Smith",
'dob': "1971-05-24",
'city': "London",
'email': "john@smith.net"
}
df = linker.inference.find_matches_to_new_records(
[record], blocking_rules=[]
)
Returns:
Name | Type | Description |
---|---|---|
SplinkDataFrame |
SplinkDataFrame
|
The pairwise comparisons. |
compare_two_records(record_1, record_2, include_found_by_blocking_rules=False)
¶
Use the linkage model to compare and score a pairwise record comparison based on the two input records provided.
If your inputs contain multiple rows, scores for the cartesian product of the two inputs will be returned.
If your inputs contain hardcoded term frequency columns (e.g. a tf_first_name column), then these values will be used instead of any provided term frequency lookup tables. or term frequency values derived from the input data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
record_1 |
dict
|
dictionary representing the first record. Columns names and data types must be the same as the columns in the settings object |
required |
record_2 |
dict
|
dictionary representing the second record. Columns names and data types must be the same as the columns in the settings object |
required |
include_found_by_blocking_rules |
bool
|
If True, outputs a column indicating whether the record pair would have been found by any of the blocking rules specified in settings.blocking_rules_to_generate_predictions. Defaults to False. |
False
|
Examples:
linker = Linker(df, "saved_settings.json", db_api=db_api)
# You should load or pre-compute tf tables for any tables with
# term frequency adjustments
linker.table_management.compute_tf_table("first_name")
# OR
linker.table_management.register_term_frequency_lookup(df, "first_name")
record_1 = {'unique_id': 1,
'first_name': "John",
'surname': "Smith",
'dob': "1971-05-24",
'city': "London",
'email': "john@smith.net"
}
record_2 = {'unique_id': 1,
'first_name': "Jon",
'surname': "Smith",
'dob': "1971-05-23",
'city': "London",
'email': "john@smith.net"
}
df = linker.inference.compare_two_records(record_1, record_2)
Returns:
Name | Type | Description |
---|---|---|
SplinkDataFrame |
SplinkDataFrame
|
Pairwise comparison with scored prediction |