Column Expressions¶
In comparisons, you may wish to consider expressions which are not simply columns of your input table.
For instance you may have a forename
column in your data, but when comparing records you may wish to also use the values in this column transformed all to lowercase, or just the first three letters of the name, or perhaps both of these transformations taken together.
If it is feasible to do so, then it may be best to derive a new column containing the transformed data. Particularly if it is an expensive calculation, or you wish to refer to it many times, deriving the column once on your input data may well be preferable, as it is cheaper than doing so directly in comparisons where each input record may need to be processed many times. However, there may be situations where you don't wish to derive a new column, perhaps for large data where you have many such transformations, or when you are experimenting with different models.
This is where a ColumnExpression
may be used. It represents some SQL expression, which may be a column, or some more complicated construct,
to which you can also apply zero or more transformations. These are lazily evaluated, and in particular will not be tied to a specific SQL dialect until they are put (via settings into a linker).
This can be particularly useful if you want to write code that can easily be switched between different backends.
Term frequency adjustments
One caveat to using a ColumnExpression
is that it cannot be combined with term frequency adjustments.
Term frequency adjustments can only be computed on the raw values in a column prior to any function transforms.
If you wish to use term frequencies with transformations of an input column, you must pre-compute a new column in your input data
with the transforms applied, instead of a ColumnExpression
.
from splink import ColumnExpression
email_lowercase = ColumnExpression("email").lower()
dob_as_string = ColumnExpression("dob").cast_to_string()
surname_initial_lowercase = ColumnExpression("surname").substr(1, 1).lower()
entry_date = ColumnExpression("entry_date_str").try_parse_date(date_format="YYYY-MM-DD")
full_name_lowercase = ColumnExpression("first_name || ' ' || surname").lower()
You can use a ColumnExpression
in most places where you might also use a simple column name, such as in a library comparison, a library comparison level, or in a blocking rule:
from splink import block_on
import splink.comparison_library as cl
import splink.comparison_level_library as cll
full_name_lower_br = block_on([full_name_lowercase])
email_comparison = cl.DamerauLevenshteinAtThresholds(email_lowercase, distance_threshold_or_thresholds=[1, 3])
entry_date_comparison = cl.AbsoluteTimeDifferenceAtThresholds(
entry_date,
input_is_string=False,
metrics=["day", "day"],
thresholds=[1, 10],
)
name_comparison = cl.CustomComparison(
comparison_levels=[
cll.NullLevel(full_name_lowercase),
cll.ExactMatch(full_name_lowercase),
cll.ExactMatch("surname")
cll.ExactMatch("first_name"),
cll.ExactMatch(surname_initial_lowercase),
cll.ElseLevel()
],
output_column_name="name",
)
ColumnExpression
¶
Enables transforms to be applied to a column before it's passed into a comparison level.
Dialect agnostic. Execution is delayed until the dialect is known.
For example
from splink.column_expression import ColumnExpression
col = (
ColumnExpression("first_name")
.lower()
.regex_extract("^[A-Z]{1,4}")
)
ExactMatchLevel(col)
Note that this will typically be created without a dialect, and the dialect
will later be populated when the ColumnExpression
is passed via a comparison
level creator into a Linker
.
lower()
¶
Applies a lowercase transform to the input expression.
substr(start, length)
¶
Applies a substring transform to the input expression of a given length starting from a specified index.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
start |
int
|
The starting index of the substring. |
required |
length |
int
|
The length of the substring. |
required |
cast_to_string()
¶
Applies a cast to string transform to the input expression.
regex_extract(pattern, capture_group=0)
¶
Applies a regex extract transform to the input expression.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pattern |
str
|
The regex pattern to match. |
required |
capture_group |
int
|
The capture group to extract from the matched pattern. Defaults to 0, meaning the full pattern is extracted |
0
|
nullif(null_value)
¶
Applies a nullif transform to the input expression, with the specified string value that should be converted to NULL.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
null_value |
str
|
The string literal that should be converted to NULL. |
required |
try_parse_date(date_format=None)
¶
Applies a 'try parse date' transform to the input expression.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
date_format |
str
|
The date format to attempt to parse. Defaults to None, meaning the dialect-specific default format is used. |
None
|
try_parse_timestamp(timestamp_format=None)
¶
Applies a 'try parse timestamp' transform to the input expression.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
timestamp_format |
str
|
The timestamp format to attempt to parse. Defaults to None, meaning the dialect-specific default format is used. |
None
|
access_extreme_array_element(first_or_last)
¶
Applies a transformation to access either the first or the last element of an array
Parameters:
Name | Type | Description | Default |
---|---|---|---|
first_or_last |
str
|
'first' for returning the first elemen of the array, 'last' for the last element |
required |