How to compute graph metrics with Splink¶

Introduction to the `compute_graph_metrics()` method¶

To enable users to calculate a variety of graph metrics for their linked data, Splink provides the compute_graph_metrics() method.

The method is called on the linker like so:

linker.computer_graph_metrics(df_predict, df_clustered, threshold_match_probability=0.95)

with arguments

Args:
    df_predict (SplinkDataFrame): The results of `linker.predict()`
    df_clustered (SplinkDataFrame): The outputs of
        `linker.cluster_pairwise_predictions_at_threshold()`
    threshold_match_probability (float): Filter the pairwise match predictions
        to include only pairwise comparisons with a match_probability at or
        above this threshold.

Warning

threshold_match_probability should be the same as the clustering threshold passed to cluster_pairwise_predictions_at_threshold(). If this information is available to Splink then it will be passed automatically, otherwise the user will have to provide it themselves and take care to ensure that threshold values align.

The method generates tables containing graph metrics (for nodes, edges and clusters), and returns a data class of Splink dataframes. The individual Splink dataframes containing node, edge and cluster metrics can be accessed as follows:

compute_graph_metrics.nodes for node metrics
compute_graph_metrics.edges for edge metrics
compute_graph_metrics.clusters for cluster metrics

The metrics computed by compute_graph_metrics() include all those mentioned in the Graph metrics chapter, namely:

Node degree
'Is bridge'
Cluster size
Cluster density
Cluster centrality

All of these metrics are calculated by default. If you are unable to install the igraph package required for 'is bridge', this metric won't be calculated, however all other metrics will still be generated.

This topic guide is a work in progress and we welcome any feedback.

Full code example¶

This code snippet computes graph metrics for a simple Splink dedupe model. A pandas dataframe of cluster metrics is displayed as the final output.

import splink.duckdb.comparison_library as cl
from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.linker import DuckDBLinker

import ssl

ssl._create_default_https_context = ssl._create_unverified_context

df = splink_datasets.historical_50k

settings_dict = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on(["postcode_fake", "first_name"]),
        block_on(["first_name", "surname"]),
        block_on(["dob", "substr(postcode_fake,1,2)"]),
        block_on(["postcode_fake", "substr(dob,1,3)"]),
        block_on(["postcode_fake", "substr(dob,4,5)"]),
    ],
    "comparisons": [
        cl.exact_match(
            "first_name",
            term_frequency_adjustments=True,
        ),
        cl.jaro_winkler_at_thresholds(
            "surname", distance_threshold_or_thresholds=[0.9, 0.8]
        ),
        cl.levenshtein_at_thresholds(
            "postcode_fake", distance_threshold_or_thresholds=[1, 2]
        ),
    ],
    "retain_intermediate_calculation_columns": True,
}


linker = DuckDBLinker(df, settings_dict)

linker.estimate_u_using_random_sampling(target_rows=1e6)

linker.estimate_parameters_using_expectation_maximisation(
    block_on(["first_name", "surname"])
)

linker.estimate_parameters_using_expectation_maximisation(
    block_on(["dob", "substr(postcode_fake, 1,3)"])
)

df_predict = linker.predict()
df_clustered = linker.cluster_pairwise_predictions_at_threshold(df_predict, 0.95)

graph_metrics = linker.compute_graph_metrics(df_predict, df_clustered)

graph_metrics.clusters.as_pandas_dataframe()

/var/folders/nd/c3xr518x3txg5kcqp1h7zwc80000gp/T/ipykernel_13654/2355919473.py:39: SplinkDeprecated: target_rows is deprecated; use max_pairs
  linker.estimate_u_using_random_sampling(target_rows=1e6)
----- Estimating u probabilities using random sampling -----

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - first_name (no m values are trained).
    - surname (no m values are trained).
    - postcode_fake (no m values are trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
(l."first_name" = r."first_name") AND (l."surname" = r."surname")

Parameter estimates will be made for the following comparison(s):
    - postcode_fake

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - first_name
    - surname

Iteration 1: Largest change in params was -0.352 in probability_two_random_records_match
Iteration 2: Largest change in params was 0.108 in the m_probability of postcode_fake, level `All other comparisons`
Iteration 3: Largest change in params was 0.019 in the m_probability of postcode_fake, level `All other comparisons`
Iteration 4: Largest change in params was 0.00276 in the m_probability of postcode_fake, level `All other comparisons`
Iteration 5: Largest change in params was 0.000388 in the m_probability of postcode_fake, level `All other comparisons`
Iteration 6: Largest change in params was 5.44e-05 in the m_probability of postcode_fake, level `All other comparisons`

EM converged after 6 iterations

Your model is not yet fully trained. Missing estimates for:
    - first_name (no m values are trained).
    - surname (no m values are trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
(l."dob" = r."dob") AND (SUBSTR(l."postcode_fake", 1, 3) = SUBSTR(r."postcode_fake", 1, 3))

Parameter estimates will be made for the following comparison(s):
    - first_name
    - surname

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - postcode_fake

Iteration 1: Largest change in params was 0.508 in probability_two_random_records_match
Iteration 2: Largest change in params was 0.0868 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0212 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.00704 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.00306 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.00149 in probability_two_random_records_match
Iteration 7: Largest change in params was 0.000761 in probability_two_random_records_match
Iteration 8: Largest change in params was 0.000395 in probability_two_random_records_match
Iteration 9: Largest change in params was 0.000206 in probability_two_random_records_match
Iteration 10: Largest change in params was 0.000108 in probability_two_random_records_match
Iteration 11: Largest change in params was 5.66e-05 in probability_two_random_records_match

EM converged after 11 iterations

Your model is fully trained. All comparisons have at least one estimate for their m and u values
Completed iteration 1, root rows count 316
Completed iteration 2, root rows count 63
Completed iteration 3, root rows count 12
Completed iteration 4, root rows count 0

	cluster_id	n_nodes	n_edges	density	cluster_centralisation
0	Q98761652-1	5	8.0	0.800000	0.333333
1	Q10307857-1	11	35.0	0.636364	0.200000
2	Q18910925-1	20	172.0	0.905263	0.105263
3	Q13530025-1	11	32.0	0.581818	0.266667
4	Q15966633-11	3	3.0	1.000000	0.000000
...	...	...	...	...	...
21530	Q5006750-7	1	0.0	NaN	NaN
21531	Q5166888-13	1	0.0	NaN	NaN
21532	Q5546247-8	1	0.0	NaN	NaN
21533	Q6698372-5	1	0.0	NaN	NaN
21534	Q7794499-6	1	0.0	NaN	NaN

21535 rows × 5 columns

How to compute graph metrics with Splink¶

Introduction to the compute_graph_metrics() method¶

Full code example¶

Introduction to the `compute_graph_metrics()` method¶