How to compute graph metrics with Splink¶

Introduction to the `compute_graph_metrics()` method¶

To enable users to calculate a variety of graph metrics for their linked data, Splink provides the compute_graph_metrics() method.

The method is called on the linker like so:

linker.computer_graph_metrics(df_predict, df_clustered, threshold_match_probability=0.95)

Parameters:

Name	Type	Description	Default
`df_predict`	`SplinkDataFrame`	The results of `linker.inference.predict()`	required
`df_clustered`	`SplinkDataFrame`	The outputs of `linker.clustering.cluster_pairwise_predictions_at_threshold()`	required
`threshold_match_probability`	`float`	Filter the pairwise match predictions to include only pairwise comparisons with a match_probability at or above this threshold. If not provided, the value will be taken from metadata on `df_clustered`. If no such metadata is available, this value must be provided.	`None`

Warning

threshold_match_probability should be the same as the clustering threshold passed to cluster_pairwise_predictions_at_threshold(). If this information is available to Splink then it will be passed automatically, otherwise the user will have to provide it themselves and take care to ensure that threshold values align.

The method generates tables containing graph metrics (for nodes, edges and clusters), and returns a data class of Splink dataframes. The individual Splink dataframes containing node, edge and cluster metrics can be accessed as follows:

graph_metrics = linker.clustering.compute_graph_metrics(
    pairwise_predictions, clusters
)

df_edges = graph_metrics.edges.as_pandas_dataframe()
df_nodes = graph_metrics.nodes.as_pandas_dataframe()
df_clusters = graph_metrics.clusters.as_pandas_dataframe()

The metrics computed by compute_graph_metrics() include all those mentioned in the Graph metrics chapter, namely:

Node degree
'Is bridge'
Cluster size
Cluster density
Cluster centrality

All of these metrics are calculated by default. If you are unable to install the igraph package required for 'is bridge', this metric won't be calculated, however all other metrics will still be generated.

Full code example¶

This code snippet computes graph metrics for a simple Splink dedupe model. A pandas dataframe of cluster metrics is displayed as the final output.

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

df = splink_datasets.historical_50k

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch(
            "first_name",
        ).configure(term_frequency_adjustments=True),
        cl.JaroWinklerAtThresholds("surname", score_threshold_or_thresholds=[0.9, 0.8]),
        cl.LevenshteinAtThresholds(
            "postcode_fake", distance_threshold_or_thresholds=[1, 2]
        ),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("postcode_fake", "first_name"),
        block_on("first_name", "surname"),
        block_on("dob", "substr(postcode_fake,1,2)"),
        block_on("postcode_fake", "substr(dob,1,3)"),
        block_on("postcode_fake", "substr(dob,4,5)"),
    ],
    retain_intermediate_calculation_columns=True,
)

db_api = DuckDBAPI()
linker = Linker(df, settings, db_api)

linker.training.estimate_u_using_random_sampling(max_pairs=1e6)

linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("first_name", "surname")
)

linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("dob", "substr(postcode_fake, 1,3)")
)

pairwise_predictions = linker.inference.predict()
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    pairwise_predictions, 0.95
)

graph_metrics = linker.clustering.compute_graph_metrics(pairwise_predictions, clusters)

df_clusters = graph_metrics.clusters.as_pandas_dataframe()

df_clusters

	cluster_id	n_nodes	n_edges	density	cluster_centralisation
0	Q5076213-1	10	31.0	0.688889	0.250000
1	Q760788-1	9	30.0	0.833333	0.214286
2	Q88466525-10	3	3.0	1.000000	0.000000
3	Q88466525-1	10	37.0	0.822222	0.222222
4	Q1386511-1	13	47.0	0.602564	0.272727
...	...	...	...	...	...
21346	Q1562561-16	1	0.0	NaN	NaN
21347	Q15999964-5	1	0.0	NaN	NaN
21348	Q5363139-12	1	0.0	NaN	NaN
21349	Q4722328-5	1	0.0	NaN	NaN
21350	Q7528564-13	1	0.0	NaN	NaN

21351 rows × 5 columns

How to compute graph metrics with Splink¶

Introduction to the compute_graph_metrics() method¶

Full code example¶

Introduction to the `compute_graph_metrics()` method¶