Documentation for `splink.clustering`¶

`cluster_pairwise_predictions_at_threshold(nodes, edges, db_api, node_id_column_name, edge_id_column_name_left=None, edge_id_column_name_right=None, threshold_match_probability=None, threshold_match_weight=None)` ¶

Clusters the pairwise match predictions into groups of connected records using the connected components graph clustering algorithm.

Records with an estimated match probability at or above threshold_match_probability are considered to be a match (i.e. they represent the same entity).

If no match probability or match weight is provided, it is assumed that all edges (comparison) are a match.

If your node and edge column names follow Splink naming conventions, then you can omit edge_id_column_name_left and edge_id_column_name_right. For example, if you have a table of nodes with a column unique_id, it would be assumed that the edge table has columns unique_id_l and unique_id_r.

Parameters:

Name	Type	Description	Default
`nodes`	`AcceptableInputTableType`	The table containing node information	required
`edges`	`AcceptableInputTableType`	The table containing edge information	required
`db_api`	`DatabaseAPISubClass`	The database API to use for querying	required
`node_id_column_name`	`str`	The name of the column containing node IDs	required
`edge_id_column_name_left`	`Optional[str]`	The name of the column containing left edge IDs. If not provided, assumed to be f"{node_id_column_name}_l"	`None`
`edge_id_column_name_right`	`Optional[str]`	The name of the column containing right edge IDs. If not provided, assumed to be f"{node_id_column_name}_r"	`None`
`threshold_match_probability`	`Optional[float]`	Pairwise comparisons with a match_probability at or above this threshold are matched	`None`
`threshold_match_weight`	`Optional[float]`	Pairwise comparisons with a match_weight at or above this threshold are matched	`None`

Returns:

Name	Type	Description
`SplinkDataFrame`	`SplinkDataFrame`	A SplinkDataFrame containing a list of all IDs, clustered into groups based on the desired match threshold.

Examples:

from splink import DuckDBAPI
from splink.clustering import cluster_pairwise_predictions_at_threshold

db_api = DuckDBAPI()

nodes = [
    {"my_id": 1},
    {"my_id": 2},
    {"my_id": 3},
    {"my_id": 4},
    {"my_id": 5},
    {"my_id": 6},
]

edges = [
    {"n_1": 1, "n_2": 2, "match_probability": 0.8},
    {"n_1": 3, "n_2": 2, "match_probability": 0.9},
    {"n_1": 4, "n_2": 5, "match_probability": 0.99},
]

cc = cluster_pairwise_predictions_at_threshold(
    nodes,
    edges,
    node_id_column_name="my_id",
    edge_id_column_name_left="n_1",
    edge_id_column_name_right="n_2",
    db_api=db_api,
    threshold_match_probability=0.5,
)

cc.as_duckdbpyrelation()

Documentation for splink.clustering¶

cluster_pairwise_predictions_at_threshold(nodes, edges, db_api, node_id_column_name, edge_id_column_name_left=None, edge_id_column_name_right=None, threshold_match_probability=None, threshold_match_weight=None) ¶

Documentation for `splink.clustering`¶

`cluster_pairwise_predictions_at_threshold(nodes, edges, db_api, node_id_column_name, edge_id_column_name_left=None, edge_id_column_name_right=None, threshold_match_probability=None, threshold_match_weight=None)` ¶