Documentation for`splink.blocking_analysis`¶

`count_comparisons_from_blocking_rule(*, table_or_tables, blocking_rule, link_type, db_api, unique_id_column_name='unique_id', source_dataset_column_name=None, compute_post_filter_count=True, max_rows_limit=int(1000000000.0))` ¶

Analyse a blocking rule to understand the number of comparisons it will generate.

Read more about the definition of pre and post filter conditions here

Parameters:

Name	Type	Description	Default
`table_or_tables`	`(dataframe, str)`	Input data	required
`blocking_rule`	`Union[BlockingRuleCreator, str, Dict[str, Any]]`	The blocking rule to analyse	required
`link_type`	`user_input_link_type_options`	The link type - "link_only", "dedupe_only" or "link_and_dedupe"	required
`db_api`	`DatabaseAPISubClass`	Database API	required
`unique_id_column_name`	`str`	Defaults to "unique_id".	`'unique_id'`
`source_dataset_column_name`	`Optional[str]`	Defaults to None.	`None`
`compute_post_filter_count`	`bool`	Whether to use a slower methodology to calculate how many comparisons will be generated post filter conditions. Defaults to True.	`True`
`max_rows_limit`	`int`	Calculation of post filter counts will only proceed if the fast method returns a value below this limit. Defaults to int(1e9).	`int(1000000000.0)`

Returns:

Type	Description
`dict[str, Union[int, str]]`	dict[str, Union[int, str]]: A dictionary containing the results

`cumulative_comparisons_to_be_scored_from_blocking_rules_chart(*, table_or_tables, blocking_rules, link_type, db_api, unique_id_column_name='unique_id', max_rows_limit=int(1000000000.0), source_dataset_column_name=None)` ¶

`cumulative_comparisons_to_be_scored_from_blocking_rules_data(*, table_or_tables, blocking_rules, link_type, db_api, unique_id_column_name='unique_id', max_rows_limit=int(1000000000.0), source_dataset_column_name=None)` ¶

`n_largest_blocks(*, table_or_tables, blocking_rule, link_type, db_api, n_largest=5)` ¶

Find the values responsible for creating the largest blocks of records.

For example, when blocking on first name and surname, the 'John Smith' block might be the largest block of records. In cases where values are highly skewed a few values may be resonsible for generating a large proportion of all comparisons. This function helps you find the culprit values.

The analysis is performed pre filter conditions, read more about what this means here

Parameters:

Name	Type	Description	Default
`table_or_tables`	`(dataframe, str)`	Input data	required
`blocking_rule`	`Union[BlockingRuleCreator, str, Dict[str, Any]]`	The blocking rule to analyse	required
`link_type`	`user_input_link_type_options`	The link type - "link_only", "dedupe_only" or "link_and_dedupe"	required
`db_api`	`DatabaseAPISubClass`	Database API	required
`n_largest`	`int`	How many rows to return. Defaults to 5.	`5`

Returns:

Name	Type	Description
`SplinkDataFrame`	`'SplinkDataFrame'`	A dataframe containing the n_largest blocks

Documentation forsplink.blocking_analysis¶

count_comparisons_from_blocking_rule(*, table_or_tables, blocking_rule, link_type, db_api, unique_id_column_name='unique_id', source_dataset_column_name=None, compute_post_filter_count=True, max_rows_limit=int(1000000000.0)) ¶

cumulative_comparisons_to_be_scored_from_blocking_rules_chart(*, table_or_tables, blocking_rules, link_type, db_api, unique_id_column_name='unique_id', max_rows_limit=int(1000000000.0), source_dataset_column_name=None) ¶

cumulative_comparisons_to_be_scored_from_blocking_rules_data(*, table_or_tables, blocking_rules, link_type, db_api, unique_id_column_name='unique_id', max_rows_limit=int(1000000000.0), source_dataset_column_name=None) ¶

n_largest_blocks(*, table_or_tables, blocking_rule, link_type, db_api, n_largest=5) ¶

Documentation for`splink.blocking_analysis`¶

`count_comparisons_from_blocking_rule(*, table_or_tables, blocking_rule, link_type, db_api, unique_id_column_name='unique_id', source_dataset_column_name=None, compute_post_filter_count=True, max_rows_limit=int(1000000000.0))` ¶

`cumulative_comparisons_to_be_scored_from_blocking_rules_chart(*, table_or_tables, blocking_rules, link_type, db_api, unique_id_column_name='unique_id', max_rows_limit=int(1000000000.0), source_dataset_column_name=None)` ¶

`cumulative_comparisons_to_be_scored_from_blocking_rules_data(*, table_or_tables, blocking_rules, link_type, db_api, unique_id_column_name='unique_id', max_rows_limit=int(1000000000.0), source_dataset_column_name=None)` ¶

`n_largest_blocks(*, table_or_tables, blocking_rule, link_type, db_api, n_largest=5)` ¶