Documentation for `comparison_level_composition` functions¶

comparison_composition allows the merging of existing comparison levels by a logical SQL clause - OR, AND or NOT.

This extends the functionality of our base comparison levels by allowing users to "join" existing comparisons by various SQL clauses.

For example, or_(null_level("first_name"), null_level("surname")) creates a check for nulls in either first_name or surname, rather than restricting the user to a single column.

The Splink comparison level composition functions available for each SQL dialect are as given in this table:

	DuckDB	Spark	Athena	SQLite	PostgreSql
and_	✓	✓	✓	✓	✓
not_	✓	✓	✓	✓	✓
or_	✓	✓	✓	✓	✓

The detailed API for each of these are outlined below.

Library comparison composition APIs¶

`and_(*clls, label_for_charts=None, m_probability=None, is_null_level=None)` ¶

Merge ComparisonLevels using logical "AND".

Merge multiple ComparisonLevels into a single ComparisonLevel by merging their SQL conditions using a logical "AND".

By default, we generate a new label_for_charts for the new ComparisonLevel. You can override this, and any other ComparisonLevel attributes, by passing them as keyword arguments.

Parameters:

Name	Type	Description	Default
`*clls`	`ComparisonLevel \| dict`	ComparisonLevels or comparison level dictionaries to merge	`()`
`label_for_charts`	`str`	A label for this comparson level, which will appear on charts as a reminder of what the level represents. Defaults to a composition of - `label_1 AND label_2`	`None`
`m_probability`	`float`	Starting value for m probability. Defaults to None.	`None`
`is_null_level`	`bool`	If true, m and u values will not be estimated and instead the match weight will be zero for this column. Defaults to None.	`None`

Examples:

DuckDB Spark Athena SQLite

Simple null level composition with an AND clause

import splink.duckdb.comparison_level_library as cll
cll.and_(cll.null_level("first_name"), cll.null_level("surname"))

Composing a levenshtein level with a custom contains level

import splink.duckdb.comparison_level_library as cll
misspelling = cll.levenshtein_level("name", 1)
contains = {
    "sql_condition": "(contains(name_l, name_r) OR "                 "contains(name_r, name_l))"
}
merged = cll.and_(misspelling, contains, label_for_charts="Spelling error")

merged.as_dict()

{ 'sql_condition': '(levenshtein("name_l", "name_r") <= 1) ' > 'AND ((contains(name_l, name_r) OR contains(name_r, name_l)))', 'label_for_charts': 'Spelling error' }

Simple null level composition with an AND clause

import splink.spark.comparison_level_library as cll
cll.and_(cll.null_level("first_name"), cll.null_level("surname"))

Composing a levenshtein level with a custom contains level

import splink.spark.comparison_level_library as cll
misspelling = cll.levenshtein_level("name", 1)
contains = {
    "sql_condition": "(contains(name_l, name_r) OR "                 "contains(name_r, name_l))"
}
merged = cll.and_(misspelling, contains, label_for_charts="Spelling error")

merged.as_dict()

{ 'sql_condition': '(levenshtein("name_l", "name_r") <= 1) ' > 'AND ((contains(name_l, name_r) OR contains(name_r, name_l)))', 'label_for_charts': 'Spelling error' }

Simple null level composition with an AND clause

import splink.athena.comparison_level_library as cll
cll.and_(cll.null_level("first_name"), cll.null_level("surname"))

Composing a levenshtein level with a custom contains level

import splink.athena.comparison_level_library as cll
misspelling = cll.levenshtein_level("name", 1)
contains = {
    "sql_condition": "(contains(name_l, name_r) OR "                 "contains(name_r, name_l))"
}
merged = cll.and_(misspelling, contains, label_for_charts="Spelling error")

merged.as_dict()

{ 'sql_condition': '(levenshtein("name_l", "name_r") <= 1) ' > 'AND ((contains(name_l, name_r) OR contains(name_r, name_l)))', 'label_for_charts': 'Spelling error' }

Simple null level composition with an AND clause

import splink.sqlite.comparison_level_library as cll
cll.and_(cll.null_level("first_name"), cll.null_level("surname"))

Returns:

Name	Type	Description
`ComparisonLevel`	`ComparisonLevel`	A new ComparisonLevel with the merged SQL condition

Source code in splink/comparison_level_composition.py

def and_(
    *clls: ComparisonLevel | dict,
    label_for_charts=None,
    m_probability=None,
    is_null_level=None,
) -> ComparisonLevel:
    """Merge ComparisonLevels using logical "AND".

    Merge multiple ComparisonLevels into a single ComparisonLevel by
    merging their SQL conditions using a logical "AND".

    By default, we generate a new `label_for_charts` for the new ComparisonLevel.
    You can override this, and any other ComparisonLevel attributes, by passing
    them as keyword arguments.

    Args:
        *clls (ComparisonLevel | dict): ComparisonLevels or comparison
            level dictionaries to merge
        label_for_charts (str, optional): A label for this comparson level,
            which will appear on charts as a reminder of what the level represents.
            Defaults to a composition of - `label_1 AND label_2`
        m_probability (float, optional): Starting value for m probability.
            Defaults to None.
        is_null_level (bool, optional): If true, m and u values will not be
            estimated and instead the match weight will be zero for this column.
            Defaults to None.

    Examples:
        === ":simple-duckdb: DuckDB"
            Simple null level composition with an `AND` clause
            ``` python
            import splink.duckdb.comparison_level_library as cll
            cll.and_(cll.null_level("first_name"), cll.null_level("surname"))
            ```
            Composing a levenshtein level with a custom `contains` level
            ``` python
            import splink.duckdb.comparison_level_library as cll
            misspelling = cll.levenshtein_level("name", 1)
            contains = {
                "sql_condition": "(contains(name_l, name_r) OR " \
                "contains(name_r, name_l))"
            }
            merged = cll.and_(misspelling, contains, label_for_charts="Spelling error")
            ```
            ```python
            merged.as_dict()
            ```
            >{
            > 'sql_condition': '(levenshtein("name_l", "name_r") <= 1) ' \
            >  'AND ((contains(name_l, name_r) OR contains(name_r, name_l)))',
            >  'label_for_charts': 'Spelling error'
            >}
        === ":simple-apachespark: Spark"
            Simple null level composition with an `AND` clause
            ``` python
            import splink.spark.comparison_level_library as cll
            cll.and_(cll.null_level("first_name"), cll.null_level("surname"))
            ```
            Composing a levenshtein level with a custom `contains` level
            ``` python
            import splink.spark.comparison_level_library as cll
            misspelling = cll.levenshtein_level("name", 1)
            contains = {
                "sql_condition": "(contains(name_l, name_r) OR " \
                "contains(name_r, name_l))"
            }
            merged = cll.and_(misspelling, contains, label_for_charts="Spelling error")
            ```
            ```python
            merged.as_dict()
            ```
            >{
            > 'sql_condition': '(levenshtein("name_l", "name_r") <= 1) ' \
            >  'AND ((contains(name_l, name_r) OR contains(name_r, name_l)))',
            >  'label_for_charts': 'Spelling error'
            >}
        === ":simple-amazonaws: Athena"
            Simple null level composition with an `AND` clause
            ``` python
            import splink.athena.comparison_level_library as cll
            cll.and_(cll.null_level("first_name"), cll.null_level("surname"))
            ```
            Composing a levenshtein level with a custom `contains` level
            ``` python
            import splink.athena.comparison_level_library as cll
            misspelling = cll.levenshtein_level("name", 1)
            contains = {
                "sql_condition": "(contains(name_l, name_r) OR " \
                "contains(name_r, name_l))"
            }
            merged = cll.and_(misspelling, contains, label_for_charts="Spelling error")
            ```
            ```python
            merged.as_dict()
            ```
            >{
            > 'sql_condition': '(levenshtein("name_l", "name_r") <= 1) ' \
            >  'AND ((contains(name_l, name_r) OR contains(name_r, name_l)))',
            >  'label_for_charts': 'Spelling error'
            >}
        === ":simple-sqlite: SQLite"
            Simple null level composition with an `AND` clause
            ``` python
            import splink.sqlite.comparison_level_library as cll
            cll.and_(cll.null_level("first_name"), cll.null_level("surname"))
            ```

    Returns:
        ComparisonLevel: A new ComparisonLevel with the merged
            SQL condition
    """
    return _cl_merge(
        *clls,
        clause="AND",
        label_for_charts=label_for_charts,
        m_probability=m_probability,
        is_null_level=is_null_level,
    )

`not_(cll, label_for_charts=None, m_probability=None)` ¶

Negate a ComparisonLevel.

Returns a ComparisonLevel with the same SQL condition as the input, but prefixed with "NOT".

By default, we generate a new label_for_charts for the new ComparisonLevel. You can override this, and any other ComparisonLevel attributes, by passing them as keyword arguments.

Parameters:

Name	Type	Description	Default
`cll`	`ComparisonLevel \| dict`	ComparisonLevel or comparison level dictionary	required
`label_for_charts`	`str`	A label for this comparson level, which will appear on charts as a reminder of what the level represents.	`None`
`m_probability`	`float`	Starting value for m probability. Defaults to None.	`None`

Examples:

DuckDB Spark Athena SQLite

Not an exact match on first name

import splink.duckdb.comparison_level_library as cll
cll.not_(cll.exact_match("first_name"))

Find all exact matches not on the first of January

import splink.duckdb.comparison_level_library as cll
dob_first_jan =  {
   "sql_condition": "SUBSTR(dob_std_l, -5) = '01-01'",
   "label_for_charts": "Date is 1st Jan",
}
exact_match_not_first_jan = cll.and_(
    cll.exact_match_level("dob"),
    cll.not_(dob_first_jan),
    label_for_charts = "Exact match and not the 1st Jan"
)

Not an exact match on first name

import splink.spark.comparison_level_library as cll
cll.not_(cll.exact_match("first_name"))

Find all exact matches not on the first of January

import splink.spark.comparison_level_library as cll
dob_first_jan =  {
   "sql_condition": "SUBSTR(dob_std_l, -5) = '01-01'",
   "label_for_charts": "Date is 1st Jan",
}
exact_match_not_first_jan = cll.and_(
    cll.exact_match_level("dob"),
    cll.not_(dob_first_jan),
    label_for_charts = "Exact match and not the 1st Jan"
)

Not an exact match on first name

import splink.athena.comparison_level_library as cll
cll.not_(cll.exact_match("first_name"))

Find all exact matches not on the first of January

import splink.athena.comparison_level_library as cll
dob_first_jan =  {
   "sql_condition": "SUBSTR(dob_std_l, -5) = '01-01'",
   "label_for_charts": "Date is 1st Jan",
}
exact_match_not_first_jan = cll.and_(
    cll.exact_match_level("dob"),
    cll.not_(dob_first_jan),
    label_for_charts = "Exact match and not the 1st Jan"
)

Not an exact match on first name

import splink.sqlite.comparison_level_library as cll
cll.not_(cll.exact_match("first_name"))

Find all exact matches not on the first of January

import splink.sqlite.comparison_level_library as cll
dob_first_jan =  {
   "sql_condition": "SUBSTR(dob_std_l, -5) = '01-01'",
   "label_for_charts": "Date is 1st Jan",
}
exact_match_not_first_jan = cll.and_(
    cll.exact_match_level("dob"),
    cll.not_(dob_first_jan),
    label_for_charts = "Exact match and not the 1st Jan"
)

Returns:

Type	Description
`ComparisonLevel`	ComparisonLevel A new ComparisonLevel with the negated SQL condition and label_for_charts

Source code in splink/comparison_level_composition.py

def not_(
    cll: ComparisonLevel | dict,
    label_for_charts: str | None = None,
    m_probability: float | None = None,
) -> ComparisonLevel:
    """Negate a ComparisonLevel.

    Returns a ComparisonLevel with the same SQL condition as the input,
    but prefixed with "NOT".

    By default, we generate a new `label_for_charts` for the new ComparisonLevel.
    You can override this, and any other ComparisonLevel attributes, by passing
    them as keyword arguments.

    Args:
        cll (ComparisonLevel | dict): ComparisonLevel or comparison
            level dictionary
        label_for_charts (str, optional): A label for this comparson level,
            which will appear on charts as a reminder of what the level represents.
        m_probability (float, optional): Starting value for m probability.
            Defaults to None.

    Examples:
        === ":simple-duckdb: DuckDB"
            *Not* an exact match on first name
            ``` python
            import splink.duckdb.comparison_level_library as cll
            cll.not_(cll.exact_match("first_name"))
            ```
            Find all exact matches *not* on the first of January
            ``` python
            import splink.duckdb.comparison_level_library as cll
            dob_first_jan =  {
               "sql_condition": "SUBSTR(dob_std_l, -5) = '01-01'",
               "label_for_charts": "Date is 1st Jan",
            }
            exact_match_not_first_jan = cll.and_(
                cll.exact_match_level("dob"),
                cll.not_(dob_first_jan),
                label_for_charts = "Exact match and not the 1st Jan"
            )
            ```
        === ":simple-apachespark: Spark"
            *Not* an exact match on first name
            ``` python
            import splink.spark.comparison_level_library as cll
            cll.not_(cll.exact_match("first_name"))
            ```
            Find all exact matches *not* on the first of January
            ``` python
            import splink.spark.comparison_level_library as cll
            dob_first_jan =  {
               "sql_condition": "SUBSTR(dob_std_l, -5) = '01-01'",
               "label_for_charts": "Date is 1st Jan",
            }
            exact_match_not_first_jan = cll.and_(
                cll.exact_match_level("dob"),
                cll.not_(dob_first_jan),
                label_for_charts = "Exact match and not the 1st Jan"
            )
            ```
        === ":simple-amazonaws: Athena"
            *Not* an exact match on first name
            ``` python
            import splink.athena.comparison_level_library as cll
            cll.not_(cll.exact_match("first_name"))
            ```
            Find all exact matches *not* on the first of January
            ``` python
            import splink.athena.comparison_level_library as cll
            dob_first_jan =  {
               "sql_condition": "SUBSTR(dob_std_l, -5) = '01-01'",
               "label_for_charts": "Date is 1st Jan",
            }
            exact_match_not_first_jan = cll.and_(
                cll.exact_match_level("dob"),
                cll.not_(dob_first_jan),
                label_for_charts = "Exact match and not the 1st Jan"
            )
            ```
        === ":simple-sqlite: SQLite"
            *Not* an exact match on first name
            ``` python
            import splink.sqlite.comparison_level_library as cll
            cll.not_(cll.exact_match("first_name"))
            ```
            Find all exact matches *not* on the first of January
            ``` python
            import splink.sqlite.comparison_level_library as cll
            dob_first_jan =  {
               "sql_condition": "SUBSTR(dob_std_l, -5) = '01-01'",
               "label_for_charts": "Date is 1st Jan",
            }
            exact_match_not_first_jan = cll.and_(
                cll.exact_match_level("dob"),
                cll.not_(dob_first_jan),
                label_for_charts = "Exact match and not the 1st Jan"
            )
            ```

    Returns:
        ComparisonLevel
            A new ComparisonLevel with the negated SQL condition and label_for_charts
    """
    cls, sql_dialect = _parse_comparison_levels(cll)
    cl = cls[0]
    result = {}
    result["sql_condition"] = f"NOT ({cl.sql_condition})"

    # Invert if is_null_level.
    # If NOT is_null_level, then we don't know if the inverted level is null or not
    if not cl.is_null_level:
        result["is_null_level"] = False

    result["label_for_charts"] = (
        label_for_charts if label_for_charts else f"NOT ({cl.label_for_charts})"
    )

    if m_probability:
        result["m_probability"] = m_probability

    return ComparisonLevel(result, sql_dialect=sql_dialect)

`or_(*clls, label_for_charts=None, m_probability=None, is_null_level=None)` ¶

Merge ComparisonLevels using logical "OR".

Merge multiple ComparisonLevels into a single ComparisonLevel by merging their SQL conditions using a logical "OR".

By default, we generate a new label_for_charts for the new ComparisonLevel. You can override this, and any other ComparisonLevel attributes, by passing them as keyword arguments.

Parameters:

Name	Type	Description	Default
`*clls`	`ComparisonLevel \| dict`	ComparisonLevels or comparison level dictionaries to merge	`()`
`label_for_charts`	`str`	A label for this comparson level, which will appear on charts as a reminder of what the level represents. Defaults to a composition of - `label_1 OR label_2`	`None`
`m_probability`	`float`	Starting value for m probability. Defaults to None.	`None`
`is_null_level`	`bool`	If true, m and u values will not be estimated and instead the match weight will be zero for this column. Defaults to None.	`None`

Examples:

DuckDB Spark Athena SQLite

Simple null level composition with an OR clause

import splink.duckdb.comparison_level_library as cll
cll.or_(cll.null_level("first_name"), cll.null_level("surname"))

Composing a levenshtein level with a custom contains level

import splink.duckdb.comparison_level_library as cll
misspelling = cll.levenshtein_level("name", 1)
contains = {
    "sql_condition": "(contains(name_l, name_r) OR "                 "contains(name_r, name_l))"
}
merged = cll.or_(misspelling, contains, label_for_charts="Spelling error")

merged.as_dict()

{ sql_condition': '(levenshtein("name_l", "name_r") <= 1) ' > 'OR ((contains(name_l, name_r) OR contains(name_r, name_l)))', 'label_for_charts': 'Spelling error' }

Simple null level composition with an OR clause

import splink.spark.comparison_level_library as cll
cll.or_(cll.null_level("first_name"), cll.null_level("surname"))

Composing a levenshtein level with a custom contains level

import splink.spark.comparison_level_library as cll
misspelling = cll.levenshtein_level("name", 1)
contains = {
    "sql_condition": "(contains(name_l, name_r) OR "                 "contains(name_r, name_l))"
}
merged = cll.or_(misspelling, contains, label_for_charts="Spelling error")

merged.as_dict()

{ sql_condition': '(levenshtein("name_l", "name_r") <= 1) ' > 'OR ((contains(name_l, name_r) OR contains(name_r, name_l)))', 'label_for_charts': 'Spelling error' }

Simple null level composition with an OR clause

import splink.athena.comparison_level_library as cll
cll.or_(cll.null_level("first_name"), cll.null_level("surname"))

Composing a levenshtein level with a custom contains level

import splink.athena.comparison_level_library as cll
misspelling = cll.levenshtein_level("name", 1)
contains = {
    "sql_condition": "(contains(name_l, name_r) OR "                 "contains(name_r, name_l))"
}
merged = cll.or_(misspelling, contains, label_for_charts="Spelling error")

merged.as_dict()

{ sql_condition': '(levenshtein("name_l", "name_r") <= 1) ' > 'OR ((contains(name_l, name_r) OR contains(name_r, name_l)))', 'label_for_charts': 'Spelling error' }

Simple null level composition with an OR clause

import splink.sqlite.comparison_level_library as cll
cll.or_(cll.null_level("first_name"), cll.null_level("surname"))

Returns:

Name	Type	Description
`ComparisonLevel`	`ComparisonLevel`	A new ComparisonLevel with the merged SQL condition

Source code in splink/comparison_level_composition.py

def or_(
    *clls: ComparisonLevel | dict,
    label_for_charts: str | None = None,
    m_probability: float | None = None,
    is_null_level: bool | None = None,
) -> ComparisonLevel:
    """Merge ComparisonLevels using logical "OR".

    Merge multiple ComparisonLevels into a single ComparisonLevel by
    merging their SQL conditions using a logical "OR".

    By default, we generate a new `label_for_charts` for the new ComparisonLevel.
    You can override this, and any other ComparisonLevel attributes, by passing
    them as keyword arguments.

    Args:
        *clls (ComparisonLevel | dict): ComparisonLevels or comparison
            level dictionaries to merge
        label_for_charts (str, optional): A label for this comparson level,
            which will appear on charts as a reminder of what the level represents.
            Defaults to a composition of - `label_1 OR label_2`
        m_probability (float, optional): Starting value for m probability.
            Defaults to None.
        is_null_level (bool, optional): If true, m and u values will not be
            estimated and instead the match weight will be zero for this column.
            Defaults to None.

    Examples:
        === ":simple-duckdb: DuckDB"
            Simple null level composition with an `OR` clause
            ``` python
            import splink.duckdb.comparison_level_library as cll
            cll.or_(cll.null_level("first_name"), cll.null_level("surname"))
            ```
            Composing a levenshtein level with a custom `contains` level
            ``` python
            import splink.duckdb.comparison_level_library as cll
            misspelling = cll.levenshtein_level("name", 1)
            contains = {
                "sql_condition": "(contains(name_l, name_r) OR " \
                "contains(name_r, name_l))"
            }
            merged = cll.or_(misspelling, contains, label_for_charts="Spelling error")
            ```
            ```python
            merged.as_dict()
            ```
            >{
            > sql_condition': '(levenshtein("name_l", "name_r") <= 1) ' \
            >  'OR ((contains(name_l, name_r) OR contains(name_r, name_l)))',
            >  'label_for_charts': 'Spelling error'
            >}
        === ":simple-apachespark: Spark"
            Simple null level composition with an `OR` clause
            ``` python
            import splink.spark.comparison_level_library as cll
            cll.or_(cll.null_level("first_name"), cll.null_level("surname"))
            ```
            Composing a levenshtein level with a custom `contains` level
            ``` python
            import splink.spark.comparison_level_library as cll
            misspelling = cll.levenshtein_level("name", 1)
            contains = {
                "sql_condition": "(contains(name_l, name_r) OR " \
                "contains(name_r, name_l))"
            }
            merged = cll.or_(misspelling, contains, label_for_charts="Spelling error")
            ```
            ```python
            merged.as_dict()
            ```
            >{
            > sql_condition': '(levenshtein("name_l", "name_r") <= 1) ' \
            >  'OR ((contains(name_l, name_r) OR contains(name_r, name_l)))',
            >  'label_for_charts': 'Spelling error'
            >}
        === ":simple-amazonaws: Athena"
            Simple null level composition with an `OR` clause
            ``` python
            import splink.athena.comparison_level_library as cll
            cll.or_(cll.null_level("first_name"), cll.null_level("surname"))
            ```
            Composing a levenshtein level with a custom `contains` level
            ``` python
            import splink.athena.comparison_level_library as cll
            misspelling = cll.levenshtein_level("name", 1)
            contains = {
                "sql_condition": "(contains(name_l, name_r) OR " \
                "contains(name_r, name_l))"
            }
            merged = cll.or_(misspelling, contains, label_for_charts="Spelling error")
            ```
            ```python
            merged.as_dict()
            ```
            >{
            > sql_condition': '(levenshtein("name_l", "name_r") <= 1) ' \
            >  'OR ((contains(name_l, name_r) OR contains(name_r, name_l)))',
            >  'label_for_charts': 'Spelling error'
            >}
        === ":simple-sqlite: SQLite"
            Simple null level composition with an `OR` clause
            ``` python
            import splink.sqlite.comparison_level_library as cll
            cll.or_(cll.null_level("first_name"), cll.null_level("surname"))
            ```

    Returns:
        ComparisonLevel: A new ComparisonLevel with the merged
            SQL condition
    """

    return _cl_merge(
        *clls,
        clause="OR",
        label_for_charts=label_for_charts,
        m_probability=m_probability,
        is_null_level=is_null_level,
    )

Documentation for comparison_level_composition functions¶

Library comparison composition APIs¶

and_(*clls, label_for_charts=None, m_probability=None, is_null_level=None) ¶

not_(cll, label_for_charts=None, m_probability=None) ¶

or_(*clls, label_for_charts=None, m_probability=None, is_null_level=None) ¶

Documentation for `comparison_level_composition` functions¶

`and_(*clls, label_for_charts=None, m_probability=None, is_null_level=None)` ¶

`not_(cll, label_for_charts=None, m_probability=None)` ¶

`or_(*clls, label_for_charts=None, m_probability=None, is_null_level=None)` ¶