Skip to content

Documentation for splink.clustering

cluster_pairwise_predictions_at_threshold(nodes, edges, db_api, node_id_column_name, edge_id_column_name_left=None, edge_id_column_name_right=None, threshold_match_probability=None)

Clusters the pairwise match predictions into groups of connected records using the connected components graph clustering algorithm.

Records with an estimated match probability at or above threshold_match_probability are considered to be a match (i.e. they represent the same entity).

If no match probability column is provided, it is assumed that all edges (comparison) are a match.

If your node and edge column names follow Splink naming conventions, then you can omit edge_id_column_name_left and edge_id_column_name_right. For example, if you have a table of nodes with a column unique_id, it would be assumed that the edge table has columns unique_id_l and unique_id_r.

Parameters:

Name Type Description Default
nodes AcceptableInputTableType

The table containing node information

required
edges AcceptableInputTableType

The table containing edge information

required
db_api DatabaseAPISubClass

The database API to use for querying

required
node_id_column_name str

The name of the column containing node IDs

required
edge_id_column_name_left Optional[str]

The name of the column containing left edge IDs. If not provided, assumed to be f"{node_id_column_name}_l"

None
edge_id_column_name_right Optional[str]

The name of the column containing right edge IDs. If not provided, assumed to be f"{node_id_column_name}_r"

None
threshold_match_probability Optional[float]

Pairwise comparisons with a match_probability at or above this threshold are matched

None

Returns:

Name Type Description
SplinkDataFrame SplinkDataFrame

A SplinkDataFrame containing a list of all IDs, clustered into groups based on the desired match threshold.

Examples:

from splink import DuckDBAPI
from splink.clustering import cluster_pairwise_predictions_at_threshold

db_api = DuckDBAPI()

nodes = [
    {"my_id": 1},
    {"my_id": 2},
    {"my_id": 3},
    {"my_id": 4},
    {"my_id": 5},
    {"my_id": 6},
]

edges = [
    {"n_1": 1, "n_2": 2, "match_probability": 0.8},
    {"n_1": 3, "n_2": 2, "match_probability": 0.9},
    {"n_1": 4, "n_2": 5, "match_probability": 0.99},
]

cc = cluster_pairwise_predictions_at_threshold(
    nodes,
    edges,
    node_id_column_name="my_id",
    edge_id_column_name_left="n_1",
    edge_id_column_name_right="n_2",
    db_api=db_api,
    threshold_match_probability=0.5,
)

cc.as_duckdbpyrelation()