Skip to content

Methods in Linker.clustering¶

Cluster the results of the linkage model and analyse clusters, accessed via linker.clustering.

cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=None) ¶

Clusters the pairwise match predictions that result from linker.inference.predict() into groups of connected record using the connected components graph clustering algorithm

Records with an estimated match_probability at or above threshold_match_probability are considered to be a match (i.e. they represent the same entity).

Parameters:

Name Type Description Default
df_predict SplinkDataFrame

The results of linker.predict()

required
threshold_match_probability float

Pairwise comparisons with a match_probability at or above this threshold are matched

None

Returns:

Name Type Description
SplinkDataFrame SplinkDataFrame

A SplinkDataFrame containing a list of all IDs, clustered into groups based on the desired match threshold.

Examples:

df_predict = linker.inference.predict(threshold_match_probability=0.5)
df_clustered = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predict, threshold_match_probability=0.95
)

compute_graph_metrics(df_predict, df_clustered, *, threshold_match_probability=None) ¶

Generates tables containing graph metrics (for nodes, edges and clusters), and returns a data class of Splink dataframes

Parameters:

Name Type Description Default
df_predict SplinkDataFrame

The results of linker.inference.predict()

required
df_clustered SplinkDataFrame

The outputs of linker.clustering.cluster_pairwise_predictions_at_threshold()

required
threshold_match_probability float

Filter the pairwise match predictions to include only pairwise comparisons with a match_probability at or above this threshold. If not provided, the value will be taken from metadata on df_clustered. If no such metadata is available, this value must be provided.

None

Returns:

Name Type Description
GraphMetricsResult GraphMetricsResults

A data class containing SplinkDataFrames

GraphMetricsResults

of cluster IDs and selected node, edge or cluster metrics. attribute "nodes" for nodes metrics table attribute "edges" for edge metrics table attribute "clusters" for cluster metrics table

Examples:

df_predict = linker.inference.predict(threshold_match_probability=0.5)
df_clustered = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predict, threshold_match_probability=0.95
)
graph_metrics = linker.clustering.compute_graph_metrics(
    df_predict, df_clustered, threshold_match_probability=0.95
)

node_metrics = graph_metrics.nodes.as_pandas_dataframe()
edge_metrics = graph_metrics.edges.as_pandas_dataframe()
cluster_metrics = graph_metrics.clusters.as_pandas_dataframe()