Skip to content

Documentation forsplink.blocking_analysis¶

count_comparisons_from_blocking_rule(*, table_or_tables, blocking_rule, link_type, db_api, unique_id_column_name='unique_id', source_dataset_column_name=None, compute_post_filter_count=True, max_rows_limit=int(1000000000.0)) ¶

Analyse a blocking rule to understand the number of comparisons it will generate.

Read more about the definition of pre and post filter conditions here

Parameters:

Name Type Description Default
table_or_tables (dataframe, str)

Input data

required
blocking_rule Union[BlockingRuleCreator, str, Dict[str, Any]]

The blocking rule to analyse

required
link_type user_input_link_type_options

The link type - "link_only", "dedupe_only" or "link_and_dedupe"

required
db_api DatabaseAPISubClass

Database API

required
unique_id_column_name str

Defaults to "unique_id".

'unique_id'
source_dataset_column_name Optional[str]

Defaults to None.

None
compute_post_filter_count bool

Whether to use a slower methodology to calculate how many comparisons will be generated post filter conditions. Defaults to True.

True
max_rows_limit int

Calculation of post filter counts will only proceed if the fast method returns a value below this limit. Defaults to int(1e9).

int(1000000000.0)

Returns:

Type Description
dict[str, Union[int, str]]

dict[str, Union[int, str]]: A dictionary containing the results

cumulative_comparisons_to_be_scored_from_blocking_rules_chart(*, table_or_tables, blocking_rules, link_type, db_api, unique_id_column_name='unique_id', max_rows_limit=int(1000000000.0), source_dataset_column_name=None) ¶

cumulative_comparisons_to_be_scored_from_blocking_rules_data(*, table_or_tables, blocking_rules, link_type, db_api, unique_id_column_name='unique_id', max_rows_limit=int(1000000000.0), source_dataset_column_name=None) ¶

n_largest_blocks(*, table_or_tables, blocking_rule, link_type, db_api, n_largest=5) ¶

Find the values responsible for creating the largest blocks of records.

For example, when blocking on first name and surname, the 'John Smith' block might be the largest block of records. In cases where values are highly skewed a few values may be resonsible for generating a large proportion of all comparisons. This function helps you find the culprit values.

The analysis is performed pre filter conditions, read more about what this means here

Parameters:

Name Type Description Default
table_or_tables (dataframe, str)

Input data

required
blocking_rule Union[BlockingRuleCreator, str, Dict[str, Any]]

The blocking rule to analyse

required
link_type user_input_link_type_options

The link type - "link_only", "dedupe_only" or "link_and_dedupe"

required
db_api DatabaseAPISubClass

Database API

required
n_largest int

How many rows to return. Defaults to 5.

5

Returns:

Name Type Description
SplinkDataFrame 'SplinkDataFrame'

A dataframe containing the n_largest blocks