How to compute graph metrics with Splink¶
Introduction to the compute_graph_metrics()
method¶
To enable users to calculate a variety of graph metrics for their linked data, Splink provides the compute_graph_metrics()
method.
The method is called on the linker
like so:
linker.computer_graph_metrics(df_predict, df_clustered, threshold_match_probability=0.95)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df_predict |
SplinkDataFrame
|
The results of |
required |
df_clustered |
SplinkDataFrame
|
The outputs of
|
required |
threshold_match_probability |
float
|
Filter the pairwise match
predictions to include only pairwise comparisons with a
match_probability at or above this threshold. If not provided, the value
will be taken from metadata on |
None
|
Warning
threshold_match_probability
should be the same as the clustering threshold passed to cluster_pairwise_predictions_at_threshold()
. If this information is available to Splink then it will be passed automatically, otherwise the user will have to provide it themselves and take care to ensure that threshold values align.
The method generates tables containing graph metrics (for nodes, edges and clusters), and returns a data class of Splink dataframes. The individual Splink dataframes containing node, edge and cluster metrics can be accessed as follows:
graph_metrics = linker.clustering.compute_graph_metrics(
pairwise_predictions, clusters
)
df_edges = graph_metrics.edges.as_pandas_dataframe()
df_nodes = graph_metrics.nodes.as_pandas_dataframe()
df_clusters = graph_metrics.clusters.as_pandas_dataframe()
The metrics computed by compute_graph_metrics()
include all those mentioned in the Graph metrics chapter, namely:
- Node degree
- 'Is bridge'
- Cluster size
- Cluster density
- Cluster centrality
All of these metrics are calculated by default. If you are unable to install the igraph
package required for 'is bridge', this metric won't be calculated, however all other metrics will still be generated.
Full code example¶
This code snippet computes graph metrics for a simple Splink dedupe model. A pandas dataframe of cluster metrics is displayed as the final output.
import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
df = splink_datasets.historical_50k
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.ExactMatch(
"first_name",
).configure(term_frequency_adjustments=True),
cl.JaroWinklerAtThresholds("surname", score_threshold_or_thresholds=[0.9, 0.8]),
cl.LevenshteinAtThresholds(
"postcode_fake", distance_threshold_or_thresholds=[1, 2]
),
],
blocking_rules_to_generate_predictions=[
block_on("postcode_fake", "first_name"),
block_on("first_name", "surname"),
block_on("dob", "substr(postcode_fake,1,2)"),
block_on("postcode_fake", "substr(dob,1,3)"),
block_on("postcode_fake", "substr(dob,4,5)"),
],
retain_intermediate_calculation_columns=True,
)
db_api = DuckDBAPI()
linker = Linker(df, settings, db_api)
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
linker.training.estimate_parameters_using_expectation_maximisation(
block_on("first_name", "surname")
)
linker.training.estimate_parameters_using_expectation_maximisation(
block_on("dob", "substr(postcode_fake, 1,3)")
)
pairwise_predictions = linker.inference.predict()
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
pairwise_predictions, 0.95
)
graph_metrics = linker.clustering.compute_graph_metrics(pairwise_predictions, clusters)
df_clusters = graph_metrics.clusters.as_pandas_dataframe()
df_clusters
cluster_id | n_nodes | n_edges | density | cluster_centralisation | |
---|---|---|---|---|---|
0 | Q5076213-1 | 10 | 31.0 | 0.688889 | 0.250000 |
1 | Q760788-1 | 9 | 30.0 | 0.833333 | 0.214286 |
2 | Q88466525-10 | 3 | 3.0 | 1.000000 | 0.000000 |
3 | Q88466525-1 | 10 | 37.0 | 0.822222 | 0.222222 |
4 | Q1386511-1 | 13 | 47.0 | 0.602564 | 0.272727 |
... | ... | ... | ... | ... | ... |
21346 | Q1562561-16 | 1 | 0.0 | NaN | NaN |
21347 | Q15999964-5 | 1 | 0.0 | NaN | NaN |
21348 | Q5363139-12 | 1 | 0.0 | NaN | NaN |
21349 | Q4722328-5 | 1 | 0.0 | NaN | NaN |
21350 | Q7528564-13 | 1 | 0.0 | NaN | NaN |
21351 rows × 5 columns