How to compute graph metrics with Splink¶
Introduction to the compute_graph_metrics()
method¶
To enable users to calculate a variety of graph metrics for their linked data, Splink provides the compute_graph_metrics()
method.
The method is called on the linker
like so:
linker.computer_graph_metrics(df_predict, df_clustered, threshold_match_probability=0.95)
Args:
df_predict (SplinkDataFrame): The results of `linker.predict()`
df_clustered (SplinkDataFrame): The outputs of
`linker.cluster_pairwise_predictions_at_threshold()`
threshold_match_probability (float): Filter the pairwise match predictions
to include only pairwise comparisons with a match_probability at or
above this threshold.
Warning
threshold_match_probability
should be the same as the clustering threshold passed to cluster_pairwise_predictions_at_threshold()
. If this information is available to Splink then it will be passed automatically, otherwise the user will have to provide it themselves and take care to ensure that threshold values align.
The method generates tables containing graph metrics (for nodes, edges and clusters), and returns a data class of Splink dataframes. The individual Splink dataframes containing node, edge and cluster metrics can be accessed as follows:
compute_graph_metrics.nodes for node metrics
compute_graph_metrics.edges for edge metrics
compute_graph_metrics.clusters for cluster metrics
The metrics computed by compute_graph_metrics()
include all those mentioned in the Graph metrics chapter, namely:
- Node degree
- 'Is bridge'
- Cluster size
- Cluster density
- Cluster centrality
All of these metrics are calculated by default. If you are unable to install the igraph
package required for 'is bridge', this metric won't be calculated, however all other metrics will still be generated.
This topic guide is a work in progress and we welcome any feedback.
Full code example¶
This code snippet computes graph metrics for a simple Splink dedupe model. A pandas dataframe of cluster metrics is displayed as the final output.
import splink.duckdb.comparison_library as cl
from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.linker import DuckDBLinker
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
df = splink_datasets.historical_50k
settings_dict = {
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [
block_on(["postcode_fake", "first_name"]),
block_on(["first_name", "surname"]),
block_on(["dob", "substr(postcode_fake,1,2)"]),
block_on(["postcode_fake", "substr(dob,1,3)"]),
block_on(["postcode_fake", "substr(dob,4,5)"]),
],
"comparisons": [
cl.exact_match(
"first_name",
term_frequency_adjustments=True,
),
cl.jaro_winkler_at_thresholds(
"surname", distance_threshold_or_thresholds=[0.9, 0.8]
),
cl.levenshtein_at_thresholds(
"postcode_fake", distance_threshold_or_thresholds=[1, 2]
),
],
"retain_intermediate_calculation_columns": True,
}
linker = DuckDBLinker(df, settings_dict)
linker.estimate_u_using_random_sampling(target_rows=1e6)
linker.estimate_parameters_using_expectation_maximisation(
block_on(["first_name", "surname"])
)
linker.estimate_parameters_using_expectation_maximisation(
block_on(["dob", "substr(postcode_fake, 1,3)"])
)
df_predict = linker.predict()
df_clustered = linker.cluster_pairwise_predictions_at_threshold(df_predict, 0.95)
graph_metrics = linker.compute_graph_metrics(df_predict, df_clustered)
graph_metrics.clusters.as_pandas_dataframe()
/var/folders/nd/c3xr518x3txg5kcqp1h7zwc80000gp/T/ipykernel_13654/2355919473.py:39: SplinkDeprecated: target_rows is deprecated; use max_pairs
linker.estimate_u_using_random_sampling(target_rows=1e6)
----- Estimating u probabilities using random sampling -----
Estimated u probabilities using random sampling
Your model is not yet fully trained. Missing estimates for:
- first_name (no m values are trained).
- surname (no m values are trained).
- postcode_fake (no m values are trained).
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
(l."first_name" = r."first_name") AND (l."surname" = r."surname")
Parameter estimates will be made for the following comparison(s):
- postcode_fake
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- first_name
- surname
Iteration 1: Largest change in params was -0.352 in probability_two_random_records_match
Iteration 2: Largest change in params was 0.108 in the m_probability of postcode_fake, level `All other comparisons`
Iteration 3: Largest change in params was 0.019 in the m_probability of postcode_fake, level `All other comparisons`
Iteration 4: Largest change in params was 0.00276 in the m_probability of postcode_fake, level `All other comparisons`
Iteration 5: Largest change in params was 0.000388 in the m_probability of postcode_fake, level `All other comparisons`
Iteration 6: Largest change in params was 5.44e-05 in the m_probability of postcode_fake, level `All other comparisons`
EM converged after 6 iterations
Your model is not yet fully trained. Missing estimates for:
- first_name (no m values are trained).
- surname (no m values are trained).
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
(l."dob" = r."dob") AND (SUBSTR(l."postcode_fake", 1, 3) = SUBSTR(r."postcode_fake", 1, 3))
Parameter estimates will be made for the following comparison(s):
- first_name
- surname
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- postcode_fake
Iteration 1: Largest change in params was 0.508 in probability_two_random_records_match
Iteration 2: Largest change in params was 0.0868 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0212 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.00704 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.00306 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.00149 in probability_two_random_records_match
Iteration 7: Largest change in params was 0.000761 in probability_two_random_records_match
Iteration 8: Largest change in params was 0.000395 in probability_two_random_records_match
Iteration 9: Largest change in params was 0.000206 in probability_two_random_records_match
Iteration 10: Largest change in params was 0.000108 in probability_two_random_records_match
Iteration 11: Largest change in params was 5.66e-05 in probability_two_random_records_match
EM converged after 11 iterations
Your model is fully trained. All comparisons have at least one estimate for their m and u values
Completed iteration 1, root rows count 316
Completed iteration 2, root rows count 63
Completed iteration 3, root rows count 12
Completed iteration 4, root rows count 0
cluster_id | n_nodes | n_edges | density | cluster_centralisation | |
---|---|---|---|---|---|
0 | Q98761652-1 | 5 | 8.0 | 0.800000 | 0.333333 |
1 | Q10307857-1 | 11 | 35.0 | 0.636364 | 0.200000 |
2 | Q18910925-1 | 20 | 172.0 | 0.905263 | 0.105263 |
3 | Q13530025-1 | 11 | 32.0 | 0.581818 | 0.266667 |
4 | Q15966633-11 | 3 | 3.0 | 1.000000 | 0.000000 |
... | ... | ... | ... | ... | ... |
21530 | Q5006750-7 | 1 | 0.0 | NaN | NaN |
21531 | Q5166888-13 | 1 | 0.0 | NaN | NaN |
21532 | Q5546247-8 | 1 | 0.0 | NaN | NaN |
21533 | Q6698372-5 | 1 | 0.0 | NaN | NaN |
21534 | Q7794499-6 | 1 | 0.0 | NaN | NaN |
21535 rows × 5 columns