Skip to content

Methods in Linker.clustering

Cluster the results of the linkage model and analyse clusters, accessed via linker.clustering.

cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=None, threshold_match_weight=None)

Clusters the pairwise match predictions that result from linker.inference.predict() into groups of connected record using the connected components graph clustering algorithm

Records with an estimated match_probability at or above threshold_match_probability (or records with a match_weight at or above threshold_match_weight) are considered to be a match (i.e. they represent the same entity).

Parameters:

Name Type Description Default
df_predict SplinkDataFrame

The results of linker.predict()

required
threshold_match_probability float

Pairwise comparisons with a match_probability at or above this threshold are matched

None
threshold_match_weight float

Pairwise comparisons with a match_weight at or above this threshold are matched. Only one of threshold_match_probability or threshold_match_weight should be provided

None

Returns:

Name Type Description
SplinkDataFrame SplinkDataFrame

A SplinkDataFrame containing a list of all IDs, clustered into groups based on the desired match threshold.

Examples:

df_predict = linker.inference.predict(threshold_match_probability=0.5)
df_clustered = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predict, threshold_match_probability=0.95
)

Clusters the pairwise match predictions that result from linker.inference.predict() into groups of connected records using a single best links method that restricts the clusters to have at most one record from each source dataset in the duplicate_free_datasets list.

This method will include a record into a cluster if it is mutually the best match for the record and for the cluster, and if adding the record will not violate the criteria of having at most one record from each of the duplicate_free_datasets.

Parameters:

Name Type Description Default
df_predict SplinkDataFrame

The results of linker.predict()

required
duplicate_free_datasets List[str]

(List[str]): The source datasets which should be treated as having no duplicates. Clusters will not form with more than one record from each of these datasets. This can be a subset of all of the source datasets in the input data.

required
threshold_match_probability float

Pairwise comparisons with a match_probability at or above this threshold are matched

None
threshold_match_weight float

Pairwise comparisons with a match_weight at or above this threshold are matched. Only one of threshold_match_probability or threshold_match_weight should be provided

None
ties_method str

How the clustering method should deal with ties. There are two options: 'drop' and 'lowest_id'. After linking datasets A and B, if record A1 is tied between records B1 and B2 from dataset B, then the 'drop' option will drop both links, whereas the 'lowest_id' option will keep the link to record B1. If the links A1 to B1 and A1 to C1 are tied where each record is from a different source dataset then both links will be kept, even with the 'drop' option.

'lowest_id'

Examples:

df_predict = linker.inference.predict(threshold_match_probability=0.5)
df_clustered = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predict,
    duplicate_free_datasets=["A", "B"],
    threshold_match_probability=0.95
)

compute_graph_metrics(df_predict, df_clustered, *, threshold_match_probability=None)

Generates tables containing graph metrics (for nodes, edges and clusters), and returns a data class of Splink dataframes

Parameters:

Name Type Description Default
df_predict SplinkDataFrame

The results of linker.inference.predict()

required
df_clustered SplinkDataFrame

The outputs of linker.clustering.cluster_pairwise_predictions_at_threshold()

required
threshold_match_probability float

Filter the pairwise match predictions to include only pairwise comparisons with a match_probability at or above this threshold. If not provided, the value will be taken from metadata on df_clustered. If no such metadata is available, this value must be provided.

None

Returns:

Name Type Description
GraphMetricsResult GraphMetricsResults

A data class containing SplinkDataFrames

GraphMetricsResults

of cluster IDs and selected node, edge or cluster metrics. attribute "nodes" for nodes metrics table attribute "edges" for edge metrics table attribute "clusters" for cluster metrics table

Examples:

df_predict = linker.inference.predict(threshold_match_probability=0.5)
df_clustered = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predict, threshold_match_probability=0.95
)
graph_metrics = linker.clustering.compute_graph_metrics(
    df_predict, df_clustered, threshold_match_probability=0.95
)

node_metrics = graph_metrics.nodes.as_pandas_dataframe()
edge_metrics = graph_metrics.edges.as_pandas_dataframe()
cluster_metrics = graph_metrics.clusters.as_pandas_dataframe()