Documentation for Linker
object methods related to exploratory analysis¶
The Linker object manages the data linkage process and holds the data linkage model.
Most of Splink's functionality can be accessed by calling methods (functions)
on the linker, such as linker.predict()
, linker.profile_columns()
etc.
The Linker class is intended for subclassing for specific backends, e.g.
a DuckDBLinker
.
count_num_comparisons_from_blocking_rule(blocking_rule)
¶
Compute the number of pairwise record comparisons that would be generated by a blocking rule
Parameters:
Name | Type | Description | Default |
---|---|---|---|
blocking_rule |
str
|
The blocking rule to analyse |
required |
link_type |
str
|
The link type. This is needed only if the linker has not yet been provided with a settings dictionary. Defaults to None. |
required |
unique_id_column_name |
str
|
This is needed only if the linker has not yet been provided with a settings dictionary. Defaults to None. |
required |
Examples:
br = "l.first_name = r.first_name"
linker.count_num_comparisons_from_blocking_rule(br)
19387
394br = "l.name = r.name and substr(l.dob,1,4) = substr(r.dob,1,4)" linker.count_num_comparisons_from_blocking_rule(br)
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
The number of comparisons generated by the blocking rule |
cumulative_comparisons_from_blocking_rules_records(blocking_rules=None)
¶
Output the number of comparisons generated by each successive blocking rule.
This is equivalent to the output size of df_predict and details how many comparisons each of your individual blocking rules will contribute to the total.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
blocking_rules |
str or list
|
The blocking rule(s) to compute comparisons for. If null, the rules set out in your settings object will be used. |
None
|
Examples:
linker_settings = DuckDBLinker(df, settings)
# Compute the cumulative number of comparisons generated by the rules
# in your settings object.
linker_settings.cumulative_comparisons_from_blocking_rules_records()
>>>
# Generate total comparisons with custom blocking rules.
blocking_rules = [
"l.surname = r.surname",
"l.first_name = r.first_name
and substr(l.dob,1,4) = substr(r.dob,1,4)"
]
>>>
linker_settings.cumulative_comparisons_from_blocking_rules_records(
blocking_rules
)
Returns:
Name | Type | Description |
---|---|---|
List |
A list of blocking rules and the corresponding number of comparisons it is forecast to generate. |
cumulative_num_comparisons_from_blocking_rules_chart(blocking_rules=None)
¶
Display a chart with the cumulative number of comparisons generated by a selection of blocking rules.
This is equivalent to the output size of df_predict and details how many comparisons each of your individual blocking rules will contribute to the total.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
blocking_rules |
str or list
|
The blocking rule(s) to compute comparisons for. If null, the rules set out in your settings object will be used. |
None
|
Examples:
linker_settings = DuckDBLinker(df, settings)
# Compute the cumulative number of comparisons generated by the rules
# in your settings object.
linker_settings.cumulative_num_comparisons_from_blocking_rules_chart()
>>>
# Generate total comparisons with custom blocking rules.
blocking_rules = [
"l.surname = r.surname",
"l.first_name = r.first_name
and substr(l.dob,1,4) = substr(r.dob,1,4)"
]
>>>
linker_settings.cumulative_num_comparisons_from_blocking_rules_chart(
blocking_rules
)
Returns:
Name | Type | Description |
---|---|---|
VegaLite |
A VegaLite chart object. See altair.vegalite.v4.display.VegaLite.
The vegalite spec is available as a dictionary using the |
missingness_chart(input_dataset=None)
¶
Generate a summary chart of the missingness (prevalence of nulls) of columns in the input datasets. By default, missingness is assessed across all input datasets
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_dataset |
str
|
Name of one of the input tables in the |
None
|
Examples:
linker.missingness_chart()
from splink.charts import save_offline_chart
c = linker.missingness_chart()
save_offline_chart(c.spec, "test_chart.html")
from IPython.display import IFrame
IFrame(src="./test_chart.html", width=1000, height=500
profile_columns(column_expressions, top_n=10, bottom_n=10)
¶
unlinkables_chart(x_col='match_weight', source_dataset=None, as_dict=False)
¶
Generate an interactive chart displaying the proportion of records that are "unlinkable" for a given splink score threshold and model parameters.
Unlinkable records are those that, even when compared with themselves, do not contain enough information to confirm a match.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x_col |
str
|
Column to use for the x-axis. Defaults to "match_weight". |
'match_weight'
|
source_dataset |
str
|
Name of the source dataset to use for the title of the output chart. |
None
|
as_dict |
bool
|
If True, return a dict version of the chart. |
False
|
Examples:
For the simplest code pipeline, load a pre-trained model and run this against the test data.
df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")
linker = DuckDBLinker(df)
linker.load_settings("saved_settings.json")
linker.unlinkables_chart()
Returns:
Name | Type | Description |
---|---|---|
VegaLite |
A VegaLite chart object. See altair.vegalite.v4.display.VegaLite.
The vegalite spec is available as a dictionary using the |