Extracting partial strings¶

It can sometimes be useful to make comparisons based on substrings or parts of column values. For example, one approach to comparing postcodes is to consider their constituent components, e.g. area, district, etc (see Featuring Engineering for more details).

We can use functions such as substrings and regular expressions to enable users to compare strings without needing to engineer new features from source data.

Splink supports this functionality via the use of the ComparisonExpression.

Examples¶

1. Exact match on postcode area¶

Suppose you wish to make comparisons on a postcode column in your data, however only care about finding links between people who share the same area code (given by the first 1 to 2 letters of the postcode). The regular expression to pick out the first two characters is ^[A-Z]{1,2}:

import splink.comparison_level_library as cll
from splink import ColumnExpression

pc_ce = ColumnExpression("postcode").regex_extract("^[A-Z]{1,2}")
print(cll.ExactMatchLevel(pc_ce).get_comparison_level("duckdb").sql_condition)

NULLIF(regexp_extract("postcode_l", '^[A-Z]{1,2}', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Z]{1,2}', 0), '')

We may therefore configure a comparison as follows:

from splink.comparison_library import CustomComparison

cc = CustomComparison(
    output_column_name="postcode",
    comparison_levels=[
        cll.NullLevel("postcode"),
        cll.ExactMatchLevel(pc_ce),
        cll.ElseLevel()
    ]

)
print(cc.get_comparison("duckdb").human_readable_description)

Comparison 'CustomComparison' of "postcode".
Similarity is assessed using the following ComparisonLevels:
    - 'postcode is NULL' with SQL rule: "postcode_l" IS NULL OR "postcode_r" IS NULL
    - 'Exact match on transformed postcode' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Z]{1,2}', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Z]{1,2}', 0), '')
    - 'All other comparisons' with SQL rule: ELSE

person_id_l	person_id_r	postcode_l	postcode_r	comparison_level
7	1	SE1P 0NY	SE1P 0NY	exact match
5	1	SE2 4UZ	SE1P 0NY	exact match
9	2	SW14 7PQ	SW3 9JG	exact match
4	8	N7 8RL	EC2R 8AH	else level
6	3		SE2 4UZ	null level

2. Exact match on initial¶

In this example we use the .substr function to create a comparison level based on the first letter of a column value.

Note that the substr function is 1-indexed, so the first character is given by substr(1, 1): The first two characters would be given by substr(1, 2).

import splink.comparison_level_library as cll
from splink import ColumnExpression

initial = ColumnExpression("first_name").substr(1,1)
print(cll.ExactMatchLevel(initial).get_comparison_level("duckdb").sql_condition)

SUBSTRING("first_name_l", 1, 1) = SUBSTRING("first_name_r", 1, 1)

Additional info¶

Regular expressions containing “\” (the python escape character) are tricky to make work with the Spark linker due to escaping so consider using alternative syntax, for example replacing “\d” with “[0-9]”.

Different regex patterns can achieve the same result but with more or less efficiency. You might want to consider optimising your regular expressions to improve performance (see here, for example).