Extracting partial strings¶
It can sometimes be useful to make comparisons based on substrings or parts of column values. For example, one approach to comparing postcodes is to consider their constituent components, e.g. area, district, etc (see Featuring Engineering for more details).
We can use functions such as substrings and regular expressions to enable users to compare strings without needing to engineer new features from source data.
Splink supports this functionality via the use of the ComparisonExpression
.
Examples¶
1. Exact match on postcode area¶
Suppose you wish to make comparisons on a postcode column in your data, however only care about finding links between people who share the same area code (given by the first 1 to 2 letters of the postcode). The regular expression to pick out the first two characters is ^[A-Z]{1,2}
:
import splink.comparison_level_library as cll
from splink import ColumnExpression
pc_ce = ColumnExpression("postcode").regex_extract("^[A-Z]{1,2}")
print(cll.ExactMatchLevel(pc_ce).get_comparison_level("duckdb").sql_condition)
NULLIF(regexp_extract("postcode_l", '^[A-Z]{1,2}', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Z]{1,2}', 0), '')
We may therefore configure a comparison as follows:
from splink.comparison_library import CustomComparison
cc = CustomComparison(
output_column_name="postcode",
comparison_levels=[
cll.NullLevel("postcode"),
cll.ExactMatchLevel(pc_ce),
cll.ElseLevel()
]
)
print(cc.get_comparison("duckdb").human_readable_description)
Comparison 'CustomComparison' of "postcode".
Similarity is assessed using the following ComparisonLevels:
- 'postcode is NULL' with SQL rule: "postcode_l" IS NULL OR "postcode_r" IS NULL
- 'Exact match on transformed postcode' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Z]{1,2}', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Z]{1,2}', 0), '')
- 'All other comparisons' with SQL rule: ELSE
person_id_l | person_id_r | postcode_l | postcode_r | comparison_level |
---|---|---|---|---|
7 | 1 | SE1P 0NY | SE1P 0NY | exact match |
5 | 1 | SE2 4UZ | SE1P 0NY | exact match |
9 | 2 | SW14 7PQ | SW3 9JG | exact match |
4 | 8 | N7 8RL | EC2R 8AH | else level |
6 | 3 | SE2 4UZ | null level |
2. Exact match on initial¶
In this example we use the .substr
function to create a comparison level based on the first letter of a column value.
Note that the substr
function is 1-indexed, so the first character is given by substr(1, 1)
: The first two characters would be given by substr(1, 2)
.
import splink.comparison_level_library as cll
from splink import ColumnExpression
initial = ColumnExpression("first_name").substr(1,1)
print(cll.ExactMatchLevel(initial).get_comparison_level("duckdb").sql_condition)
SUBSTRING("first_name_l", 1, 1) = SUBSTRING("first_name_r", 1, 1)
Additional info¶
Regular expressions containing “\” (the python escape character) are tricky to make work with the Spark linker due to escaping so consider using alternative syntax, for example replacing “\d” with “[0-9]”.
Different regex patterns can achieve the same result but with more or less efficiency. You might want to consider optimising your regular expressions to improve performance (see here, for example).