Skip to content

How to compute graph metrics with Splink

Introduction to the compute_graph_metrics() method

To enable users to calculate a variety of graph metrics for their linked data, Splink provides the compute_graph_metrics() method.

The method is called on the linker like so:

linker.compute_graph_metrics(df_predict, df_clustered, threshold_match_probability=0.95)

Parameters:

Name Type Description Default
df_predict SplinkDataFrame

The results of linker.inference.predict()

required
df_clustered SplinkDataFrame

The outputs of linker.clustering.cluster_pairwise_predictions_at_threshold()

required
threshold_match_probability float

Filter the pairwise match predictions to include only pairwise comparisons with a match_probability at or above this threshold. If not provided, the value will be taken from metadata on df_clustered. If no such metadata is available, this value must be provided.

None

Warning

threshold_match_probability should be the same as the clustering threshold passed to cluster_pairwise_predictions_at_threshold(). If this information is available to Splink then it will be passed automatically, otherwise the user will have to provide it themselves and take care to ensure that threshold values align.

The method generates tables containing graph metrics (for nodes, edges and clusters), and returns a data class of Splink dataframes. The individual Splink dataframes containing node, edge and cluster metrics can be accessed as follows:

graph_metrics = linker.clustering.compute_graph_metrics(
    pairwise_predictions, clusters
)

df_edges = graph_metrics.edges.as_pandas_dataframe()
df_nodes = graph_metrics.nodes.as_pandas_dataframe()
df_clusters = graph_metrics.clusters.as_pandas_dataframe()

The metrics computed by compute_graph_metrics() include all those mentioned in the Graph metrics chapter, namely:

  • Node degree
  • 'Is bridge'
  • Cluster size
  • Cluster density
  • Cluster centrality

All of these metrics are calculated by default. If you are unable to install the igraph package required for 'is bridge', this metric won't be calculated, however all other metrics will still be generated.

Full code example

This code snippet computes graph metrics for a simple Splink dedupe model. A pandas dataframe of cluster metrics is displayed as the final output.

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

df = splink_datasets.historical_50k

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch(
            "first_name",
        ).configure(term_frequency_adjustments=True),
        cl.JaroWinklerAtThresholds("surname", score_threshold_or_thresholds=[0.9, 0.8]),
        cl.LevenshteinAtThresholds(
            "postcode_fake", distance_threshold_or_thresholds=[1, 2]
        ),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("postcode_fake", "first_name"),
        block_on("first_name", "surname"),
        block_on("dob", "substr(postcode_fake,1,2)"),
        block_on("postcode_fake", "substr(dob,1,3)"),
        block_on("postcode_fake", "substr(dob,4,5)"),
    ],
    retain_intermediate_calculation_columns=True,
)

db_api = DuckDBAPI()
linker = Linker(df, settings, db_api)

linker.training.estimate_u_using_random_sampling(max_pairs=1e6)

linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("first_name", "surname")
)

linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("dob", "substr(postcode_fake, 1,3)")
)

pairwise_predictions = linker.inference.predict()
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    pairwise_predictions, 0.95
)

graph_metrics = linker.clustering.compute_graph_metrics(pairwise_predictions, clusters)

df_clusters = graph_metrics.clusters.as_pandas_dataframe()
df_clusters
cluster_id n_nodes n_edges density cluster_centralisation
0 Q5076213-1 10 31.0 0.688889 0.250000
1 Q760788-1 9 30.0 0.833333 0.214286
2 Q88466525-10 3 3.0 1.000000 0.000000
3 Q88466525-1 10 37.0 0.822222 0.222222
4 Q1386511-1 13 47.0 0.602564 0.272727
... ... ... ... ... ...
21346 Q1562561-16 1 0.0 NaN NaN
21347 Q15999964-5 1 0.0 NaN NaN
21348 Q5363139-12 1 0.0 NaN NaN
21349 Q4722328-5 1 0.0 NaN NaN
21350 Q7528564-13 1 0.0 NaN NaN

21351 rows × 5 columns