Methods in Linker.clustering¶

Cluster the results of the linkage model and analyse clusters, accessed via linker.clustering.

`cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=None, threshold_match_weight=None)` ¶

Clusters the pairwise match predictions that result from linker.inference.predict() into groups of connected record using the connected components graph clustering algorithm

Records with an estimated match_probability at or above threshold_match_probability (or records with a match_weight at or above threshold_match_weight) are considered to be a match (i.e. they represent the same entity).

Parameters:

Name	Type	Description	Default
`df_predict`	`SplinkDataFrame`	The results of `linker.inference.predict()`	required
`threshold_match_probability`	`float`	Pairwise comparisons with a `match_probability` at or above this threshold are matched	`None`
`threshold_match_weight`	`float`	Pairwise comparisons with a `match_weight` at or above this threshold are matched. Only one of threshold_match_probability or threshold_match_weight should be provided	`None`

Returns:

Name	Type	Description
`SplinkDataFrame`	`SplinkDataFrame`	A SplinkDataFrame containing a list of all IDs, clustered into groups based on the desired match threshold.

Examples:

df_predict = linker.inference.predict(threshold_match_probability=0.5)
df_clustered = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predict, threshold_match_probability=0.95
)

`cluster_using_single_best_links(df_predict, duplicate_free_datasets, threshold_match_probability=None, threshold_match_weight=None, ties_method='lowest_id')` ¶

Clusters the pairwise match predictions that result from linker.inference.predict() into groups of connected records using a single best links method that restricts the clusters to have at most one record from each source dataset in the duplicate_free_datasets list.

This method will include a record into a cluster if it is mutually the best match for the record and for the cluster, and if adding the record will not violate the criteria of having at most one record from each of the duplicate_free_datasets.

Parameters:

Name	Type	Description	Default
`df_predict`	`SplinkDataFrame`	The results of `linker.inference.predict()`	required
`duplicate_free_datasets`	`List[str]`	(List[str]): The source datasets which should be treated as having no duplicates. Clusters will not form with more than one record from each of these datasets. This can be a subset of all of the source datasets in the input data.	required
`threshold_match_probability`	`float`	Pairwise comparisons with a `match_probability` at or above this threshold are matched	`None`
`threshold_match_weight`	`float`	Pairwise comparisons with a `match_weight` at or above this threshold are matched. Only one of threshold_match_probability or threshold_match_weight should be provided	`None`
`ties_method`	`str`	How the clustering method should deal with ties. There are two options: 'drop' and 'lowest_id'. After linking datasets A and B, if record A1 is tied between records B1 and B2 from dataset B, then the 'drop' option will drop both links, whereas the 'lowest_id' option will keep the link to record B1. If the links A1 to B1 and A1 to C1 are tied where each record is from a different source dataset then both links will be kept, even with the 'drop' option.	`'lowest_id'`

Examples:

df_predict = linker.inference.predict(threshold_match_probability=0.5)
df_clustered = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predict,
    duplicate_free_datasets=["A", "B"],
    threshold_match_probability=0.95
)

`compute_graph_metrics(df_predict, df_clustered, *, threshold_match_probability=None)` ¶

Generates tables containing graph metrics (for nodes, edges and clusters), and returns a data class of Splink dataframes

Parameters:

Name	Type	Description	Default
`df_predict`	`SplinkDataFrame`	The results of `linker.inference.predict()`	required
`df_clustered`	`SplinkDataFrame`	The outputs of `linker.clustering.cluster_pairwise_predictions_at_threshold()`	required
`threshold_match_probability`	`float`	Filter the pairwise match predictions to include only pairwise comparisons with a match_probability at or above this threshold. If not provided, the value will be taken from metadata on `df_clustered`. If no such metadata is available, this value must be provided.	`None`

Returns:

Name	Type	Description
`GraphMetricsResult`	`GraphMetricsResults`	A data class containing SplinkDataFrames
	`GraphMetricsResults`	of cluster IDs and selected node, edge or cluster metrics. attribute "nodes" for nodes metrics table attribute "edges" for edge metrics table attribute "clusters" for cluster metrics table

Examples:

df_predict = linker.inference.predict(threshold_match_probability=0.5)
df_clustered = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predict, threshold_match_probability=0.95
)
graph_metrics = linker.clustering.compute_graph_metrics(
    df_predict, df_clustered, threshold_match_probability=0.95
)

node_metrics = graph_metrics.nodes.as_pandas_dataframe()
edge_metrics = graph_metrics.edges.as_pandas_dataframe()
cluster_metrics = graph_metrics.clusters.as_pandas_dataframe()

Methods in Linker.clustering¶

cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=None, threshold_match_weight=None) ¶

cluster_using_single_best_links(df_predict, duplicate_free_datasets, threshold_match_probability=None, threshold_match_weight=None, ties_method='lowest_id') ¶

compute_graph_metrics(df_predict, df_clustered, *, threshold_match_probability=None) ¶

`cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=None, threshold_match_weight=None)` ¶

`cluster_using_single_best_links(df_predict, duplicate_free_datasets, threshold_match_probability=None, threshold_match_weight=None, ties_method='lowest_id')` ¶

`compute_graph_metrics(df_predict, df_clustered, *, threshold_match_probability=None)` ¶