Skip to content

Documentation for blocking_rules_library¶

The blocking_rules_library contains a series of pre-made blocking rules available for use in the construction of blocking rule strategies and em training blocks as described in this topic guide.

These conform to a more performant standard that is outlined in detail here.

The detailed API for each of these are outlined below.

Blocking Rule APIs¶

The block_on function generates blocking rules that facilitate efficient equi-joins based on the columns or SQL statements specified in the col_names argument. When multiple columns or SQL snippets are provided, the function generates a compound blocking rule, connecting individual match conditions with "AND" clauses.

This function is designed for scenarios where you aim to achieve efficient yet straightforward blocking conditions based on one or more columns or SQL snippets.

For more information on the intended use cases of block_on, please see the following discussion.

Further information on equi-join conditions can be found here

This function acts as a shorthand alias for the brl.and_ syntax:

import splink.duckdb.blocking_rule_library as brl
brl.and_(brl.exact_match_rule, brl.exact_match_rule, ...)

Parameters:

Name Type Description Default
col_names list[str]

A list of input columns or sql conditions you wish to create blocks on.

required
salting_partitions (optional, int)

Whether to add salting to the blocking rule. More information on salting can be found within the docs. Salting is only valid for Spark.

1

Examples:

from splink.duckdb.blocking_rule_library import block_on
block_on("first_name")  # check for exact matches on first name
sql = "substr(surname,1,2)"
block_on([sql, "surname"])
from splink.spark.blocking_rule_library import block_on
block_on("first_name")  # check for exact matches on first name
sql = "substr(surname,1,2)"
block_on([sql, "surname"], salting_partitions=1)
from splink.athena.blocking_rule_library import block_on
block_on("first_name")  # check for exact matches on first name
sql = "substr(surname,1,2)"
block_on([sql, "surname"])
from splink.sqlite.blocking_rule_library import block_on
block_on("first_name")  # check for exact matches on first name
sql = "substr(surname,1,2)"
block_on([sql, "surname"])
from splink.postgres.blocking_rule_library import block_on
block_on("first_name")  # check for exact matches on first name
sql = "substr(surname,1,2)"
block_on([sql, "surname"])
Source code in splink/blocking_rules_library.py
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
def block_on(
    _exact_match,
    col_names: list[str],
    salting_partitions: int = 1,
) -> BlockingRule:
    """The `block_on` function generates blocking rules that facilitate
    efficient equi-joins based on the columns or SQL statements
    specified in the col_names argument. When multiple columns or
    SQL snippets are provided, the function generates a compound
    blocking rule, connecting individual match conditions with
    "AND" clauses.

    This function is designed for scenarios where you aim to achieve
    efficient yet straightforward blocking conditions based on one
    or more columns or SQL snippets.

    For more information on the intended use cases of `block_on`, please see
    [the following discussion](https://github.com/moj-analytical-services/splink/issues/1376).

    Further information on equi-join conditions can be found
    [here](https://moj-analytical-services.github.io/splink/topic_guides/blocking/performance.html)

    This function acts as a shorthand alias for the `brl.and_` syntax:
    ```py
    import splink.duckdb.blocking_rule_library as brl
    brl.and_(brl.exact_match_rule, brl.exact_match_rule, ...)
    ```

    Args:
        col_names (list[str]): A list of input columns or sql conditions
            you wish to create blocks on.
        salting_partitions (optional, int): Whether to add salting
            to the blocking rule. More information on salting can
            be found within the docs. Salting is only valid for Spark.

    Examples:
        === ":simple-duckdb: DuckDB"
            ``` python
            from splink.duckdb.blocking_rule_library import block_on
            block_on("first_name")  # check for exact matches on first name
            sql = "substr(surname,1,2)"
            block_on([sql, "surname"])
            ```
        === ":simple-apachespark: Spark"
            ``` python
            from splink.spark.blocking_rule_library import block_on
            block_on("first_name")  # check for exact matches on first name
            sql = "substr(surname,1,2)"
            block_on([sql, "surname"], salting_partitions=1)
            ```
        === ":simple-amazonaws: Athena"
            ``` python
            from splink.athena.blocking_rule_library import block_on
            block_on("first_name")  # check for exact matches on first name
            sql = "substr(surname,1,2)"
            block_on([sql, "surname"])
            ```
        === ":simple-sqlite: SQLite"
            ``` python
            from splink.sqlite.blocking_rule_library import block_on
            block_on("first_name")  # check for exact matches on first name
            sql = "substr(surname,1,2)"
            block_on([sql, "surname"])
            ```
        === "PostgreSQL"
            ``` python
            from splink.postgres.blocking_rule_library import block_on
            block_on("first_name")  # check for exact matches on first name
            sql = "substr(surname,1,2)"
            block_on([sql, "surname"])
            ```
    """  # noqa: E501

    col_names = ensure_is_list(col_names)
    em_rules = [_exact_match(col) for col in col_names]
    return and_(*em_rules, salting_partitions=salting_partitions)

Represents an exact match blocking rule.

DEPRECATED: exact_match_rule is deprecated. Please use block_on instead, which acts as a wrapper with additional functionality.

Parameters:

Name Type Description Default
col_name str

Input column name, or a str represent a sql statement you'd like to match on. For example, surname or "substr(surname,1,2)" are both valid.

required
salting_partitions (optional, int)

Whether to add salting to the blocking rule. More information on salting can be found within the docs. Salting is currently only valid for Spark.

None
Source code in splink/blocking_rules_library.py
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
def exact_match_rule(
    col_name: str,
    _sql_dialect: str,
    salting_partitions: int = None,
) -> BlockingRule:
    """Represents an exact match blocking rule.

    **DEPRECATED:**
    `exact_match_rule` is deprecated. Please use `block_on`
    instead, which acts as a wrapper with additional functionality.

    Args:
        col_name (str): Input column name, or a str represent a sql
            statement you'd like to match on. For example, `surname` or
            `"substr(surname,1,2)"` are both valid.
        salting_partitions (optional, int): Whether to add salting
            to the blocking rule. More information on salting can
            be found within the docs. Salting is currently only valid
            for Spark.
    """
    warnings.warn(
        "`exact_match_rule` is deprecated; use `block_on`",
        DeprecationWarning,
        stacklevel=2,
    )

    syntax_tree = sqlglot.parse_one(col_name, read=_sql_dialect)

    l_col = add_quotes_and_table_prefix(syntax_tree, "l").sql(_sql_dialect)
    r_col = add_quotes_and_table_prefix(syntax_tree, "r").sql(_sql_dialect)

    blocking_rule = f"{l_col} = {r_col}"

    return blocking_rule_to_obj(
        {
            "blocking_rule": blocking_rule,
            "salting_partitions": salting_partitions,
            "sql_dialect": _sql_dialect,
        }
    )