Methods in Linker.clustering¶
Cluster the results of the linkage model and analyse clusters, accessed via
linker.clustering
.
cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=None, threshold_match_weight=None)
¶
Clusters the pairwise match predictions that result from
linker.inference.predict()
into groups of connected record using the connected
components graph clustering algorithm
Records with an estimated match_probability
at or above
threshold_match_probability
(or records with a match_weight
at or above
threshold_match_weight
) are considered to be a match (i.e. they represent
the same entity).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df_predict |
SplinkDataFrame
|
The results of |
required |
threshold_match_probability |
float
|
Pairwise comparisons with a
|
None
|
threshold_match_weight |
float
|
Pairwise comparisons with a
|
None
|
Returns:
Name | Type | Description |
---|---|---|
SplinkDataFrame |
SplinkDataFrame
|
A SplinkDataFrame containing a list of all IDs, clustered into groups based on the desired match threshold. |
Examples:
df_predict = linker.inference.predict(threshold_match_probability=0.5)
df_clustered = linker.clustering.cluster_pairwise_predictions_at_threshold(
df_predict, threshold_match_probability=0.95
)
cluster_using_single_best_links(df_predict, duplicate_free_datasets, threshold_match_probability=None, threshold_match_weight=None, ties_method='lowest_id')
¶
Clusters the pairwise match predictions that result from
linker.inference.predict()
into groups of connected records using a single
best links method that restricts the clusters to have at most one record from
each source dataset in the duplicate_free_datasets
list.
This method will include a record into a cluster if it is mutually the best
match for the record and for the cluster, and if adding the record will not
violate the criteria of having at most one record from each of the
duplicate_free_datasets
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df_predict |
SplinkDataFrame
|
The results of |
required |
duplicate_free_datasets |
List[str]
|
(List[str]): The source datasets which should be treated as having no duplicates. Clusters will not form with more than one record from each of these datasets. This can be a subset of all of the source datasets in the input data. |
required |
threshold_match_probability |
float
|
Pairwise comparisons with a
|
None
|
threshold_match_weight |
float
|
Pairwise comparisons with a
|
None
|
ties_method |
str
|
How the clustering method should deal with ties. There are two options: 'drop' and 'lowest_id'. After linking datasets A and B, if record A1 is tied between records B1 and B2 from dataset B, then the 'drop' option will drop both links, whereas the 'lowest_id' option will keep the link to record B1. If the links A1 to B1 and A1 to C1 are tied where each record is from a different source dataset then both links will be kept, even with the 'drop' option. |
'lowest_id'
|
Examples:
df_predict = linker.inference.predict(threshold_match_probability=0.5)
df_clustered = linker.clustering.cluster_pairwise_predictions_at_threshold(
df_predict,
duplicate_free_datasets=["A", "B"],
threshold_match_probability=0.95
)
compute_graph_metrics(df_predict, df_clustered, *, threshold_match_probability=None)
¶
Generates tables containing graph metrics (for nodes, edges and clusters), and returns a data class of Splink dataframes
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df_predict |
SplinkDataFrame
|
The results of |
required |
df_clustered |
SplinkDataFrame
|
The outputs of
|
required |
threshold_match_probability |
float
|
Filter the pairwise match
predictions to include only pairwise comparisons with a
match_probability at or above this threshold. If not provided, the value
will be taken from metadata on |
None
|
Returns:
Name | Type | Description |
---|---|---|
GraphMetricsResult |
GraphMetricsResults
|
A data class containing SplinkDataFrames |
GraphMetricsResults
|
of cluster IDs and selected node, edge or cluster metrics. attribute "nodes" for nodes metrics table attribute "edges" for edge metrics table attribute "clusters" for cluster metrics table |
Examples:
df_predict = linker.inference.predict(threshold_match_probability=0.5)
df_clustered = linker.clustering.cluster_pairwise_predictions_at_threshold(
df_predict, threshold_match_probability=0.95
)
graph_metrics = linker.clustering.compute_graph_metrics(
df_predict, df_clustered, threshold_match_probability=0.95
)
node_metrics = graph_metrics.nodes.as_pandas_dataframe()
edge_metrics = graph_metrics.edges.as_pandas_dataframe()
cluster_metrics = graph_metrics.clusters.as_pandas_dataframe()