Methods in Linker.inference¶

Use your Splink model to make predictions (perform inference). Accessed via linker.inference.

`deterministic_link()` ¶

Uses the blocking rules specified by blocking_rules_to_generate_predictions in your settings to generate pairwise record comparisons.

For deterministic linkage, this should be a list of blocking rules which are strict enough to generate only true links.

Deterministic linkage, however, is likely to result in missed links (false negatives).

Returns:

Name	Type	Description
`SplinkDataFrame`	`SplinkDataFrame`	A SplinkDataFrame of the pairwise comparisons.

```py
settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "surname"),
        block_on("dob", "first_name"),
    ],
)

linker = Linker(df, settings, db_api=db_api)
splink_df = linker.inference.deterministic_link()
```

`predict(threshold_match_probability=None, threshold_match_weight=None, materialise_after_computing_term_frequencies=True, materialise_blocked_pairs=True)` ¶

Create a dataframe of scored pairwise comparisons using the parameters of the linkage model.

Uses the blocking rules specified in the blocking_rules_to_generate_predictions key of the settings to generate the pairwise comparisons.

Parameters:

Name	Type	Description	Default
`threshold_match_probability`	`float`	If specified, filter the results to include only pairwise comparisons with a match_probability above this threshold. Defaults to None.	`None`
`threshold_match_weight`	`float`	If specified, filter the results to include only pairwise comparisons with a match_weight above this threshold. Defaults to None.	`None`
`materialise_after_computing_term_frequencies`	`bool`	If true, Splink will materialise the table containing the input nodes (rows) joined to any term frequencies which have been asked for in the settings object. If False, this will be computed as part of a large CTE pipeline. Defaults to True	`True`
`materialise_blocked_pairs`	`bool`	In the blocking phase, materialise the table of pairs of records that will be scored	`True`

Examples:

linker = linker(df, "saved_settings.json", db_api=db_api)
splink_df = linker.inference.predict(threshold_match_probability=0.95)
splink_df.as_pandas_dataframe(limit=5)

`find_matches_to_new_records(records_or_tablename, blocking_rules=[], match_weight_threshold=-4)` ¶

Given one or more records, find records in the input dataset(s) which match and return in order of the Splink prediction score.

This effectively provides a way of searching the input datasets for given record(s)

Parameters:

Name	Type	Description	Default
`records_or_tablename`	`List[dict]`	Input search record(s) as list of dict, or a table registered to the database.	required
`blocking_rules`	`list`	Blocking rules to select which records to find and score. If [], do not use a blocking rule - meaning the input records will be compared to all records provided to the linker when it was instantiated. Defaults to [].	`[]`
`match_weight_threshold`	`int`	Return matches with a match weight above this threshold. Defaults to -4.	`-4`

Examples:

linker = Linker(df, "saved_settings.json", db_api=db_api)

# You should load or pre-compute tf tables for any tables with
# term frequency adjustments
linker.table_management.compute_tf_table("first_name")
# OR
linker.table_management.register_term_frequency_lookup(df, "first_name")

record = {'unique_id': 1,
    'first_name': "John",
    'surname': "Smith",
    'dob': "1971-05-24",
    'city': "London",
    'email': "john@smith.net"
    }
df = linker.inference.find_matches_to_new_records(
    [record], blocking_rules=[]
)

Returns:

Name	Type	Description
`SplinkDataFrame`	`SplinkDataFrame`	The pairwise comparisons.

`compare_two_records(record_1, record_2, include_found_by_blocking_rules=False)` ¶

Use the linkage model to compare and score a pairwise record comparison based on the two input records provided.

If your inputs contain multiple rows, scores for the cartesian product of the two inputs will be returned.

If your inputs contain hardcoded term frequency columns (e.g. a tf_first_name column), then these values will be used instead of any provided term frequency lookup tables. or term frequency values derived from the input data.

Parameters:

Name	Type	Description	Default
`record_1`	`dict`	dictionary representing the first record. Columns names and data types must be the same as the columns in the settings object	required
`record_2`	`dict`	dictionary representing the second record. Columns names and data types must be the same as the columns in the settings object	required
`include_found_by_blocking_rules`	`bool`	If True, outputs a column indicating whether the record pair would have been found by any of the blocking rules specified in settings.blocking_rules_to_generate_predictions. Defaults to False.	`False`

Examples:

linker = Linker(df, "saved_settings.json", db_api=db_api)

# You should load or pre-compute tf tables for any tables with
# term frequency adjustments
linker.table_management.compute_tf_table("first_name")
# OR
linker.table_management.register_term_frequency_lookup(df, "first_name")

record_1 = {'unique_id': 1,
    'first_name': "John",
    'surname': "Smith",
    'dob': "1971-05-24",
    'city': "London",
    'email': "john@smith.net"
    }

record_2 = {'unique_id': 1,
    'first_name': "Jon",
    'surname': "Smith",
    'dob': "1971-05-23",
    'city': "London",
    'email': "john@smith.net"
    }
df = linker.inference.compare_two_records(record_1, record_2)

Returns:

Name	Type	Description
`SplinkDataFrame`	`SplinkDataFrame`	Pairwise comparison with scored prediction

Methods in Linker.inference¶

deterministic_link() ¶

predict(threshold_match_probability=None, threshold_match_weight=None, materialise_after_computing_term_frequencies=True, materialise_blocked_pairs=True) ¶

find_matches_to_new_records(records_or_tablename, blocking_rules=[], match_weight_threshold=-4) ¶

compare_two_records(record_1, record_2, include_found_by_blocking_rules=False) ¶

`deterministic_link()` ¶

`predict(threshold_match_probability=None, threshold_match_weight=None, materialise_after_computing_term_frequencies=True, materialise_blocked_pairs=True)` ¶

`find_matches_to_new_records(records_or_tablename, blocking_rules=[], match_weight_threshold=-4)` ¶

`compare_two_records(record_1, record_2, include_found_by_blocking_rules=False)` ¶