Methods in Linker.evaluation¶
Evaluate the performance of a Splink model. Accessed via
linker.evaluation
prediction_errors_from_labels_table(labels_splinkdataframe_or_table_name, include_false_positives=True, include_false_negatives=True, threshold_match_probability=0.5)
¶
Find false positives and false negatives based on the comparison between the
clerical_match_score
in the labels table compared with the splink predicted
match probability
The table of labels should be in the following format, and should be registered as a table with your database using
labels_table = linker.table_management.register_labels_table(my_df)
source_dataset_l | unique_id_l | source_dataset_r | unique_id_r | clerical_match_score |
---|---|---|---|---|
df_1 | 1 | df_2 | 2 | 0.99 |
df_1 | 1 | df_2 | 3 | 0.2 |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
labels_splinkdataframe_or_table_name |
str | SplinkDataFrame
|
Name of table containing labels in the database |
required |
include_false_positives |
bool
|
Defaults to True. |
True
|
include_false_negatives |
bool
|
Defaults to True. |
True
|
threshold_match_probability |
float
|
Threshold probability above which a prediction considered to be a match. Defaults to 0.5. |
0.5
|
Examples:
labels_table = linker.table_management.register_labels_table(df_labels)
linker.evaluation.prediction_errors_from_labels_table(
labels_table, include_false_negatives=True, include_false_positives=False
).as_pandas_dataframe()
Returns:
Name | Type | Description |
---|---|---|
SplinkDataFrame |
SplinkDataFrame
|
Table containing false positives and negatives |
accuracy_analysis_from_labels_column(labels_column_name, *, threshold_match_probability=0.5, match_weight_round_to_nearest=0.1, output_type='threshold_selection', add_metrics=[], positives_not_captured_by_blocking_rules_scored_as_zero=True)
¶
Generate an accuracy chart or table from ground truth data, where the ground
truth is in a column in the input dataset called labels_column_name
Parameters:
Name | Type | Description | Default |
---|---|---|---|
labels_column_name |
str
|
Column name containing labels in the input table |
required |
threshold_match_probability |
float
|
Where the
|
0.5
|
match_weight_round_to_nearest |
float
|
When provided, thresholds are rounded. When large numbers of labels are provided, this is sometimes necessary to reduce the size of the ROC table, and therefore the number of points plotted on the chart. Defaults to None. |
0.1
|
add_metrics |
list(str)
|
Precision and recall metrics are always
included. Where provided,
|
[]
|
Examples:
linker.evaluation.accuracy_analysis_from_labels_column("ground_truth", add_metrics=["f1"])
Returns:
Name | Type | Description |
---|---|---|
chart |
Union[ChartReturnType, SplinkDataFrame]
|
An altair chart |
accuracy_analysis_from_labels_table(labels_splinkdataframe_or_table_name, *, threshold_match_probability=0.5, match_weight_round_to_nearest=0.1, output_type='threshold_selection', add_metrics=[])
¶
Generate an accuracy chart or table from labelled (ground truth) data.
The table of labels should be in the following format, and should be registered
as a table with your database using
labels_table = linker.register_labels_table(my_df)
source_dataset_l | unique_id_l | source_dataset_r | unique_id_r | clerical_match_score |
---|---|---|---|---|
df_1 | 1 | df_2 | 2 | 0.99 |
df_1 | 1 | df_2 | 3 | 0.2 |
Note that source_dataset
and unique_id
should correspond to the values
specified in the settings dict, and the input_table_aliases
passed to the
linker
object.
For dedupe_only
links, the source_dataset
columns can be ommitted.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
labels_splinkdataframe_or_table_name |
str | SplinkDataFrame
|
Name of table containing labels in the database |
required |
threshold_match_probability |
float
|
Where the
|
0.5
|
match_weight_round_to_nearest |
float
|
When provided, thresholds are rounded. When large numbers of labels are provided, this is sometimes necessary to reduce the size of the ROC table, and therefore the number of points plotted on the chart. Defaults to None. |
0.1
|
add_metrics |
list(str)
|
Precision and recall metrics are always
included. Where provided,
|
[]
|
Returns:
Type | Description |
---|---|
Union[ChartReturnType, SplinkDataFrame]
|
altair.Chart: An altair chart |
Examples:
linker.accuracy_analysis_from_labels_table("ground_truth", add_metrics=["f1"])
prediction_errors_from_labels_column(label_colname, include_false_positives=True, include_false_negatives=True, threshold_match_probability=0.5)
¶
Generate a dataframe containing false positives and false negatives based on the comparison between the splink match probability and the labels column. A label column is a column in the input dataset that contains the 'ground truth' cluster to which the record belongs
Parameters:
Name | Type | Description | Default |
---|---|---|---|
label_colname |
str
|
Name of labels column in input data |
required |
include_false_positives |
bool
|
Defaults to True. |
True
|
include_false_negatives |
bool
|
Defaults to True. |
True
|
threshold_match_probability |
float
|
Threshold above which a score is considered to be a match. Defaults to 0.5. |
0.5
|
Returns:
Name | Type | Description |
---|---|---|
SplinkDataFrame |
SplinkDataFrame
|
Table containing false positives and negatives |
Examples:
linker.evaluation.prediction_errors_from_labels_column(
"ground_truth_cluster",
include_false_negatives=True,
include_false_positives=False
).as_pandas_dataframe()
unlinkables_chart(x_col='match_weight', name_of_data_in_title=None, as_dict=False)
¶
Generate an interactive chart displaying the proportion of records that are "unlinkable" for a given splink score threshold and model parameters.
Unlinkable records are those that, even when compared with themselves, do not contain enough information to confirm a match.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x_col |
str
|
Column to use for the x-axis. Defaults to "match_weight". |
'match_weight'
|
name_of_data_in_title |
str
|
Name of the source dataset to use for the title of the output chart. |
None
|
as_dict |
bool
|
If True, return a dict version of the chart. |
False
|
Returns:
Type | Description |
---|---|
ChartReturnType
|
altair.Chart: An altair chart |
Examples:
After estimating the parameters of the model, run:
linker.evaluation.unlinkables_chart()
labelling_tool_for_specific_record(unique_id, source_dataset=None, out_path='labelling_tool.html', overwrite=False, match_weight_threshold=-4, view_in_jupyter=False, show_splink_predictions_in_interface=True)
¶
Create a standalone, offline labelling dashboard for a specific record as identified by its unique id
Parameters:
Name | Type | Description | Default |
---|---|---|---|
unique_id |
str
|
The unique id of the record for which to create the labelling tool |
required |
source_dataset |
str
|
If there are multiple datasets, to identify the record you must also specify the source_dataset. Defaults to None. |
None
|
out_path |
str
|
The output path for the labelling tool. Defaults to "labelling_tool.html". |
'labelling_tool.html'
|
overwrite |
bool
|
If true, overwrite files at the output path if they exist. Defaults to False. |
False
|
match_weight_threshold |
int
|
Include possible matches in the output which score above this threshold. Defaults to -4. |
-4
|
view_in_jupyter |
bool
|
If you're viewing in the Jupyter html viewer, set this to True to extract your labels. Defaults to False. |
False
|
show_splink_predictions_in_interface |
bool
|
Whether to show information about the Splink model's predictions that could potentially bias the decision of the clerical labeller. Defaults to True. |
True
|