Skip to content

Methods in Linker.inference

Use your Splink model to make predictions (perform inference). Accessed via linker.inference.

Uses the blocking rules specified by blocking_rules_to_generate_predictions in your settings to generate pairwise record comparisons.

For deterministic linkage, this should be a list of blocking rules which are strict enough to generate only true links.

Deterministic linkage, however, is likely to result in missed links (false negatives).

Returns:

Name Type Description
SplinkDataFrame SplinkDataFrame

A SplinkDataFrame of the pairwise comparisons.

```py
settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "surname"),
        block_on("dob", "first_name"),
    ],
)

linker = Linker(df, settings, db_api=db_api)
splink_df = linker.inference.deterministic_link()
```

predict(threshold_match_probability=None, threshold_match_weight=None, materialise_after_computing_term_frequencies=True, materialise_blocked_pairs=True)

Create a dataframe of scored pairwise comparisons using the parameters of the linkage model.

Uses the blocking rules specified in the blocking_rules_to_generate_predictions key of the settings to generate the pairwise comparisons.

Parameters:

Name Type Description Default
threshold_match_probability float

If specified, filter the results to include only pairwise comparisons with a match_probability above this threshold. Defaults to None.

None
threshold_match_weight float

If specified, filter the results to include only pairwise comparisons with a match_weight above this threshold. Defaults to None.

None
materialise_after_computing_term_frequencies bool

If true, Splink will materialise the table containing the input nodes (rows) joined to any term frequencies which have been asked for in the settings object. If False, this will be computed as part of a large CTE pipeline. Defaults to True

True
materialise_blocked_pairs bool

In the blocking phase, materialise the table of pairs of records that will be scored

True

Examples:

linker = linker(df, "saved_settings.json", db_api=db_api)
splink_df = linker.inference.predict(threshold_match_probability=0.95)
splink_df.as_pandas_dataframe(limit=5)

find_matches_to_new_records(records_or_tablename, blocking_rules=[], match_weight_threshold=-4)

Given one or more records, find records in the input dataset(s) which match and return in order of the Splink prediction score.

This effectively provides a way of searching the input datasets for given record(s)

Parameters:

Name Type Description Default
records_or_tablename List[dict]

Input search record(s) as list of dict, or a table registered to the database.

required
blocking_rules list

Blocking rules to select which records to find and score. If [], do not use a blocking rule - meaning the input records will be compared to all records provided to the linker when it was instantiated. Defaults to [].

[]
match_weight_threshold int

Return matches with a match weight above this threshold. Defaults to -4.

-4

Examples:

linker = Linker(df, "saved_settings.json", db_api=db_api)

# You should load or pre-compute tf tables for any tables with
# term frequency adjustments
linker.table_management.compute_tf_table("first_name")
# OR
linker.table_management.register_term_frequency_lookup(df, "first_name")

record = {'unique_id': 1,
    'first_name': "John",
    'surname': "Smith",
    'dob': "1971-05-24",
    'city': "London",
    'email': "john@smith.net"
    }
df = linker.inference.find_matches_to_new_records(
    [record], blocking_rules=[]
)

Returns:

Name Type Description
SplinkDataFrame SplinkDataFrame

The pairwise comparisons.

compare_two_records(record_1, record_2, include_found_by_blocking_rules=False)

Use the linkage model to compare and score a pairwise record comparison based on the two input records provided.

If your inputs contain multiple rows, scores for the cartesian product of the two inputs will be returned.

If your inputs contain hardcoded term frequency columns (e.g. a tf_first_name column), then these values will be used instead of any provided term frequency lookup tables. or term frequency values derived from the input data.

Parameters:

Name Type Description Default
record_1 dict

dictionary representing the first record. Columns names and data types must be the same as the columns in the settings object

required
record_2 dict

dictionary representing the second record. Columns names and data types must be the same as the columns in the settings object

required
include_found_by_blocking_rules bool

If True, outputs a column indicating whether the record pair would have been found by any of the blocking rules specified in settings.blocking_rules_to_generate_predictions. Defaults to False.

False

Examples:

linker = Linker(df, "saved_settings.json", db_api=db_api)

# You should load or pre-compute tf tables for any tables with
# term frequency adjustments
linker.table_management.compute_tf_table("first_name")
# OR
linker.table_management.register_term_frequency_lookup(df, "first_name")

record_1 = {'unique_id': 1,
    'first_name': "John",
    'surname': "Smith",
    'dob': "1971-05-24",
    'city': "London",
    'email': "john@smith.net"
    }

record_2 = {'unique_id': 1,
    'first_name': "Jon",
    'surname': "Smith",
    'dob': "1971-05-23",
    'city': "London",
    'email': "john@smith.net"
    }
df = linker.inference.compare_two_records(record_1, record_2)

Returns:

Name Type Description
SplinkDataFrame SplinkDataFrame

Pairwise comparison with scored prediction