Documentation forsplink.blocking_analysis
¶
count_comparisons_from_blocking_rule(*, table_or_tables, blocking_rule, link_type, db_api, unique_id_column_name='unique_id', source_dataset_column_name=None, compute_post_filter_count=True, max_rows_limit=int(1000000000.0))
¶
Analyse a blocking rule to understand the number of comparisons it will generate.
Read more about the definition of pre and post filter conditions here
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_or_tables |
(dataframe, str)
|
Input data |
required |
blocking_rule |
Union[BlockingRuleCreator, str, Dict[str, Any]]
|
The blocking rule to analyse |
required |
link_type |
user_input_link_type_options
|
The link type - "link_only", "dedupe_only" or "link_and_dedupe" |
required |
db_api |
DatabaseAPISubClass
|
Database API |
required |
unique_id_column_name |
str
|
Defaults to "unique_id". |
'unique_id'
|
source_dataset_column_name |
Optional[str]
|
Defaults to None. |
None
|
compute_post_filter_count |
bool
|
Whether to use a slower methodology to calculate how many comparisons will be generated post filter conditions. Defaults to True. |
True
|
max_rows_limit |
int
|
Calculation of post filter counts will only proceed if the fast method returns a value below this limit. Defaults to int(1e9). |
int(1000000000.0)
|
Returns:
Type | Description |
---|---|
dict[str, Union[int, str]]
|
dict[str, Union[int, str]]: A dictionary containing the results |
cumulative_comparisons_to_be_scored_from_blocking_rules_chart(*, table_or_tables, blocking_rules, link_type, db_api, unique_id_column_name='unique_id', max_rows_limit=int(1000000000.0), source_dataset_column_name=None)
¶
cumulative_comparisons_to_be_scored_from_blocking_rules_data(*, table_or_tables, blocking_rules, link_type, db_api, unique_id_column_name='unique_id', max_rows_limit=int(1000000000.0), source_dataset_column_name=None)
¶
n_largest_blocks(*, table_or_tables, blocking_rule, link_type, db_api, n_largest=5)
¶
Find the values responsible for creating the largest blocks of records.
For example, when blocking on first name and surname, the 'John Smith' block might be the largest block of records. In cases where values are highly skewed a few values may be resonsible for generating a large proportion of all comparisons. This function helps you find the culprit values.
The analysis is performed pre filter conditions, read more about what this means here
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_or_tables |
(dataframe, str)
|
Input data |
required |
blocking_rule |
Union[BlockingRuleCreator, str, Dict[str, Any]]
|
The blocking rule to analyse |
required |
link_type |
user_input_link_type_options
|
The link type - "link_only", "dedupe_only" or "link_and_dedupe" |
required |
db_api |
DatabaseAPISubClass
|
Database API |
required |
n_largest |
int
|
How many rows to return. Defaults to 5. |
5
|
Returns:
Name | Type | Description |
---|---|---|
SplinkDataFrame |
'SplinkDataFrame'
|
A dataframe containing the n_largest blocks |