Skip to content

Documentation for Linker object methods related to exploratory analysis

The Linker object manages the data linkage process and holds the data linkage model.

Most of Splink's functionality can be accessed by calling methods (functions) on the linker, such as linker.predict(), linker.profile_columns() etc.

The Linker class is intended for subclassing for specific backends, e.g. a DuckDBLinker.

count_num_comparisons_from_blocking_rule(blocking_rule)

Compute the number of pairwise record comparisons that would be generated by a blocking rule

Parameters:

Name Type Description Default
blocking_rule str

The blocking rule to analyse

required
link_type str

The link type. This is needed only if the linker has not yet been provided with a settings dictionary. Defaults to None.

required
unique_id_column_name str

This is needed only if the linker has not yet been provided with a settings dictionary. Defaults to None.

required

Examples:

br = "l.first_name = r.first_name"
linker.count_num_comparisons_from_blocking_rule(br)

19387

br = "l.name = r.name and substr(l.dob,1,4) = substr(r.dob,1,4)"
linker.count_num_comparisons_from_blocking_rule(br)
394

Returns:

Name Type Description
int int

The number of comparisons generated by the blocking rule

cumulative_comparisons_from_blocking_rules_records(blocking_rules=None)

Output the number of comparisons generated by each successive blocking rule.

This is equivalent to the output size of df_predict and details how many comparisons each of your individual blocking rules will contribute to the total.

Parameters:

Name Type Description Default
blocking_rules str or list

The blocking rule(s) to compute comparisons for. If null, the rules set out in your settings object will be used.

None

Examples:

linker_settings = DuckDBLinker(df, settings)
# Compute the cumulative number of comparisons generated by the rules
# in your settings object.
linker_settings.cumulative_comparisons_from_blocking_rules_records()
>>>
# Generate total comparisons with custom blocking rules.
blocking_rules = [
   "l.surname = r.surname",
   "l.first_name = r.first_name
    and substr(l.dob,1,4) = substr(r.dob,1,4)"
]
>>>
linker_settings.cumulative_comparisons_from_blocking_rules_records(
    blocking_rules
 )

Returns:

Name Type Description
List

A list of blocking rules and the corresponding number of comparisons it is forecast to generate.

cumulative_num_comparisons_from_blocking_rules_chart(blocking_rules=None)

Display a chart with the cumulative number of comparisons generated by a selection of blocking rules.

This is equivalent to the output size of df_predict and details how many comparisons each of your individual blocking rules will contribute to the total.

Parameters:

Name Type Description Default
blocking_rules str or list

The blocking rule(s) to compute comparisons for. If null, the rules set out in your settings object will be used.

None

Examples:

linker_settings = DuckDBLinker(df, settings)
# Compute the cumulative number of comparisons generated by the rules
# in your settings object.
linker_settings.cumulative_num_comparisons_from_blocking_rules_chart()
>>>
# Generate total comparisons with custom blocking rules.
blocking_rules = [
   "l.surname = r.surname",
   "l.first_name = r.first_name
    and substr(l.dob,1,4) = substr(r.dob,1,4)"
]
>>>
linker_settings.cumulative_num_comparisons_from_blocking_rules_chart(
    blocking_rules
 )

Returns:

Name Type Description
VegaLite

A VegaLite chart object. See altair.vegalite.v4.display.VegaLite. The vegalite spec is available as a dictionary using the spec attribute.

missingness_chart(input_dataset=None)

Generate a summary chart of the missingness (prevalence of nulls) of columns in the input datasets. By default, missingness is assessed across all input datasets

Parameters:

Name Type Description Default
input_dataset str

Name of one of the input tables in the

None

Examples:

linker.missingness_chart()
To view offline (if you don't have an internet connection):
from splink.charts import save_offline_chart
c = linker.missingness_chart()
save_offline_chart(c.spec, "test_chart.html")
View resultant html file in Jupyter (or just load it in your browser)
from IPython.display import IFrame
IFrame(src="./test_chart.html", width=1000, height=500

profile_columns(column_expressions, top_n=10, bottom_n=10)

unlinkables_chart(x_col='match_weight', source_dataset=None, as_dict=False)

Generate an interactive chart displaying the proportion of records that are "unlinkable" for a given splink score threshold and model parameters.

Unlinkable records are those that, even when compared with themselves, do not contain enough information to confirm a match.

Parameters:

Name Type Description Default
x_col str

Column to use for the x-axis. Defaults to "match_weight".

'match_weight'
source_dataset str

Name of the source dataset to use for the title of the output chart.

None
as_dict bool

If True, return a dict version of the chart.

False

Examples:

For the simplest code pipeline, load a pre-trained model and run this against the test data.

df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")
linker = DuckDBLinker(df)
linker.load_settings("saved_settings.json")
linker.unlinkables_chart()
For more complex code pipelines, you can run an entire pipeline that estimates your m and u values, before `unlinkables_chart().

Returns:

Name Type Description
VegaLite

A VegaLite chart object. See altair.vegalite.v4.display.VegaLite. The vegalite spec is available as a dictionary using the spec attribute.