Column Expressions¶

In comparisons, you may wish to consider expressions which are not simply columns of your input table. For instance you may have a forename column in your data, but when comparing records you may wish to also use the values in this column transformed all to lowercase, or just the first three letters of the name, or perhaps both of these transformations taken together.

If it is feasible to do so, then it may be best to derive a new column containing the transformed data. Particularly if it is an expensive calculation, or you wish to refer to it many times, deriving the column once on your input data may well be preferable, as it is cheaper than doing so directly in comparisons where each input record may need to be processed many times. However, there may be situations where you don't wish to derive a new column, perhaps for large data where you have many such transformations, or when you are experimenting with different models.

This is where a ColumnExpression may be used. It represents some SQL expression, which may be a column, or some more complicated construct, to which you can also apply zero or more transformations. These are lazily evaluated, and in particular will not be tied to a specific SQL dialect until they are put (via settings into a linker). This can be particularly useful if you want to write code that can easily be switched between different backends.

Term frequency adjustments

One caveat to using a ColumnExpression is that it cannot be combined with term frequency adjustments. Term frequency adjustments can only be computed on the raw values in a column prior to any function transforms.

If you wish to use term frequencies with transformations of an input column, you must pre-compute a new column in your input data with the transforms applied, instead of a ColumnExpression.

from splink import ColumnExpression

email_lowercase = ColumnExpression("email").lower()
dob_as_string = ColumnExpression("dob").cast_to_string()
surname_initial_lowercase = ColumnExpression("surname").substr(1, 1).lower()
entry_date = ColumnExpression("entry_date_str").try_parse_date(date_format="YYYY-MM-DD")
full_name_lowercase = ColumnExpression("first_name || ' ' || surname").lower()

You can use a ColumnExpression in most places where you might also use a simple column name, such as in a library comparison, a library comparison level, or in a blocking rule:

from splink import block_on
import splink.comparison_library as cl
import splink.comparison_level_library as cll

full_name_lower_br = block_on([full_name_lowercase])

email_comparison = cl.DamerauLevenshteinAtThresholds(email_lowercase, distance_threshold_or_thresholds=[1, 3])
entry_date_comparison = cl.AbsoluteTimeDifferenceAtThresholds(
    entry_date,
    input_is_string=False,
    metrics=["day", "day"],
    thresholds=[1, 10],
)
name_comparison = cl.CustomComparison(
    comparison_levels=[
        cll.NullLevel(full_name_lowercase),
        cll.ExactMatch(full_name_lowercase),
        cll.ExactMatch("surname")
        cll.ExactMatch("first_name"),
        cll.ExactMatch(surname_initial_lowercase),
        cll.ElseLevel()
    ],
    output_column_name="name",
)

`ColumnExpression`¶

Enables transforms to be applied to a column before it's passed into a comparison level.

Dialect agnostic. Execution is delayed until the dialect is known.

For example

from splink.column_expression import ColumnExpression
col = (
    ColumnExpression("first_name")
    .lower()
    .regex_extract("^[A-Z]{1,4}")
)

ExactMatchLevel(col)

Note that this will typically be created without a dialect, and the dialect will later be populated when the ColumnExpression is passed via a comparison level creator into a Linker.

`lower()` ¶

Applies a lowercase transform to the input expression.

`substr(start, length)` ¶

Applies a substring transform to the input expression of a given length starting from a specified index.

Parameters:

Name	Type	Description	Default
`start`	`int`	The starting index of the substring.	required
`length`	`int`	The length of the substring.	required

`cast_to_string()` ¶

Applies a cast to string transform to the input expression.

`regex_extract(pattern, capture_group=0)` ¶

Applies a regex extract transform to the input expression.

Parameters:

Name	Type	Description	Default
`pattern`	`str`	The regex pattern to match.	required
`capture_group`	`int`	The capture group to extract from the matched pattern. Defaults to 0, meaning the full pattern is extracted	`0`

`nullif(null_value)` ¶

Applies a nullif transform to the input expression, with the specified string value that should be converted to NULL.

Parameters:

Name	Type	Description	Default
`null_value`	`str`	The string literal that should be converted to NULL.	required

`try_parse_date(date_format=None)` ¶

Applies a 'try parse date' transform to the input expression.

Parameters:

Name	Type	Description	Default
`date_format`	`str`	The date format to attempt to parse. Defaults to None, meaning the dialect-specific default format is used.	`None`

`try_parse_timestamp(timestamp_format=None)` ¶

Applies a 'try parse timestamp' transform to the input expression.

Parameters:

Name	Type	Description	Default
`timestamp_format`	`str`	The timestamp format to attempt to parse. Defaults to None, meaning the dialect-specific default format is used.	`None`

`access_extreme_array_element(first_or_last)` ¶

Applies a transformation to access either the first or the last element of an array

Parameters:

Name	Type	Description	Default
`first_or_last`	`str`	'first' for returning the first elemen of the array, 'last' for the last element	required

Column Expressions¶

ColumnExpression¶

lower() ¶

substr(start, length) ¶

cast_to_string() ¶

regex_extract(pattern, capture_group=0) ¶

nullif(null_value) ¶

try_parse_date(date_format=None) ¶

try_parse_timestamp(timestamp_format=None) ¶

access_extreme_array_element(first_or_last) ¶

`ColumnExpression`¶

`lower()` ¶

`substr(start, length)` ¶

`cast_to_string()` ¶

`regex_extract(pattern, capture_group=0)` ¶

`nullif(null_value)` ¶

`try_parse_date(date_format=None)` ¶

`try_parse_timestamp(timestamp_format=None)` ¶

`access_extreme_array_element(first_or_last)` ¶