Documentation for splink.clustering
¶
cluster_pairwise_predictions_at_threshold(nodes, edges, db_api, node_id_column_name, edge_id_column_name_left=None, edge_id_column_name_right=None, threshold_match_probability=None, threshold_match_weight=None)
¶
Clusters the pairwise match predictions into groups of connected records using the connected components graph clustering algorithm.
Records with an estimated match probability at or above threshold_match_probability are considered to be a match (i.e. they represent the same entity).
If no match probability or match weight is provided, it is assumed that all edges (comparison) are a match.
If your node and edge column names follow Splink naming conventions, then you can
omit edge_id_column_name_left and edge_id_column_name_right. For example, if you
have a table of nodes with a column unique_id
, it would be assumed that the
edge table has columns unique_id_l
and unique_id_r
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
nodes |
AcceptableInputTableType
|
The table containing node information |
required |
edges |
AcceptableInputTableType
|
The table containing edge information |
required |
db_api |
DatabaseAPISubClass
|
The database API to use for querying |
required |
node_id_column_name |
str
|
The name of the column containing node IDs |
required |
edge_id_column_name_left |
Optional[str]
|
The name of the column containing left edge IDs. If not provided, assumed to be f"{node_id_column_name}_l" |
None
|
edge_id_column_name_right |
Optional[str]
|
The name of the column containing right edge IDs. If not provided, assumed to be f"{node_id_column_name}_r" |
None
|
threshold_match_probability |
Optional[float]
|
Pairwise comparisons with a match_probability at or above this threshold are matched |
None
|
threshold_match_weight |
Optional[float]
|
Pairwise comparisons with a match_weight at or above this threshold are matched |
None
|
Returns:
Name | Type | Description |
---|---|---|
SplinkDataFrame |
SplinkDataFrame
|
A SplinkDataFrame containing a list of all IDs, clustered into groups based on the desired match threshold. |
Examples:
from splink import DuckDBAPI
from splink.clustering import cluster_pairwise_predictions_at_threshold
db_api = DuckDBAPI()
nodes = [
{"my_id": 1},
{"my_id": 2},
{"my_id": 3},
{"my_id": 4},
{"my_id": 5},
{"my_id": 6},
]
edges = [
{"n_1": 1, "n_2": 2, "match_probability": 0.8},
{"n_1": 3, "n_2": 2, "match_probability": 0.9},
{"n_1": 4, "n_2": 5, "match_probability": 0.99},
]
cc = cluster_pairwise_predictions_at_threshold(
nodes,
edges,
node_id_column_name="my_id",
edge_id_column_name_left="n_1",
edge_id_column_name_right="n_2",
db_api=db_api,
threshold_match_probability=0.5,
)
cc.as_duckdbpyrelation()