Methods in Linker.clustering¶
Cluster the results of the linkage model and analyse clusters, accessed via
linker.clustering
.
cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=None)
¶
Clusters the pairwise match predictions that result from
linker.inference.predict()
into groups of connected record using the connected
components graph clustering algorithm
Records with an estimated match_probability
at or above
threshold_match_probability
are considered to be a match (i.e. they represent
the same entity).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df_predict |
SplinkDataFrame
|
The results of |
required |
threshold_match_probability |
float
|
Pairwise comparisons with a
|
None
|
Returns:
Name | Type | Description |
---|---|---|
SplinkDataFrame |
SplinkDataFrame
|
A SplinkDataFrame containing a list of all IDs, clustered into groups based on the desired match threshold. |
Examples:
df_predict = linker.inference.predict(threshold_match_probability=0.5)
df_clustered = linker.clustering.cluster_pairwise_predictions_at_threshold(
df_predict, threshold_match_probability=0.95
)
compute_graph_metrics(df_predict, df_clustered, *, threshold_match_probability=None)
¶
Generates tables containing graph metrics (for nodes, edges and clusters), and returns a data class of Splink dataframes
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df_predict |
SplinkDataFrame
|
The results of |
required |
df_clustered |
SplinkDataFrame
|
The outputs of
|
required |
threshold_match_probability |
float
|
Filter the pairwise match
predictions to include only pairwise comparisons with a
match_probability at or above this threshold. If not provided, the value
will be taken from metadata on |
None
|
Returns:
Name | Type | Description |
---|---|---|
GraphMetricsResult |
GraphMetricsResults
|
A data class containing SplinkDataFrames |
GraphMetricsResults
|
of cluster IDs and selected node, edge or cluster metrics. attribute "nodes" for nodes metrics table attribute "edges" for edge metrics table attribute "clusters" for cluster metrics table |
Examples:
df_predict = linker.inference.predict(threshold_match_probability=0.5)
df_clustered = linker.clustering.cluster_pairwise_predictions_at_threshold(
df_predict, threshold_match_probability=0.95
)
graph_metrics = linker.clustering.compute_graph_metrics(
df_predict, df_clustered, threshold_match_probability=0.95
)
node_metrics = graph_metrics.nodes.as_pandas_dataframe()
edge_metrics = graph_metrics.edges.as_pandas_dataframe()
cluster_metrics = graph_metrics.clusters.as_pandas_dataframe()