Skip to content

Feature Engineering for Data Linkage

During record linkage, the features in a given dataset are used to provide evidence as to whether two records are a match. Like any predictive model, the quality of a Splink model is dictated by the features provided.

Below are some examples of features that be created from common columns, and how to create more detailed comparisons with them in a Splink model.


Postcodes

In this example, we derive latitude and longitude coordinates from a postcode column to create a more nuanced comparison. By doing so, we account for similarity not just in the string of the postcode, but in the geographical location it represents. This could be useful if we believe, for instance, that people move house, but generally stay within the same geographical area.

We start with a comparison that uses the postcode's components, For example, UK postcodes can be broken down into the following substrings:

UK postcode components from https://ideal-postcodes.co.uk/guides/uk-postcode-format See image source for more details.

The pre-built postcode comparison generates a comparison with levels for an exact match on full postcode, sector, district and area in turn.

Code examples to use the comparison template:

import splink.comparison_library as cl

pc_comparison = ctl.PostcodeComparison("postcode").get_comparison("duckdb")
print(pc_comparison.human_readable_description)
Output
Comparison 'PostcodeComparison' of "postcode".
Similarity is assessed using the following ComparisonLevels:
    - 'postcode is NULL' with SQL rule: "postcode_l" IS NULL OR "postcode_r" IS NULL
    - 'Exact match on full postcode' with SQL rule: "postcode_l" = "postcode_r"
    - 'Exact match on sector' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '')
    - 'Exact match on district' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '')
    - 'Exact match on area' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}', 0), '')
    - 'All other comparisons' with SQL rule: ELSE

Note that this is not able to compute geographical distance by default, because it cannot assume that lat-long coordinates are available.

We now proceed to derive lat and long columns so that we can take advantage of geographcial distance. We will use the ONS Postcode Directory to look up the lat-long coordinates for each postcode.

Read in a dataset with postcodes:

import duckdb

from splink import splink_datasets

df = splink_datasets.historical_50k

df_with_pc = """
WITH postcode_lookup AS (
    SELECT
        pcd AS postcode,
        lat,
        long
    FROM
        read_csv_auto('./path/to/ONSPD_FEB_2023_UK.csv')
)
SELECT
    df.*,
    postcode_lookup.lat,
    postcode_lookup.long
FROM
    df
LEFT JOIN
    postcode_lookup
ON
    upper(df.postcode_fake) = postcode_lookup.postcode
"""

df_with_postcode = duckdb.sql(df_with_pc)

Now that coordinates have been added, a more detailed postcode comparison can be produced using the postcode_comparison:

pc_comparison = cl.PostcodeComparison(
    "postcode", lat_col="lat", long_col="long", km_thresholds=[1, 10]
).get_comparison("duckdb")
print(pc_comparison.human_readable_description)
Output
Comparison 'PostcodeComparison' of "postcode", "lat" and "long".
Similarity is assessed using the following ComparisonLevels:
    - 'postcode is NULL' with SQL rule: "postcode_l" IS NULL OR "postcode_r" IS NULL
    - 'Exact match on postcode' with SQL rule: "postcode_l" = "postcode_r"
    - 'Exact match on transformed postcode' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '')
    - 'Distance less than 1km' with SQL rule:
        cast(
            acos(

        case
            when (
        sin( radians("lat_l") ) * sin( radians("lat_r") ) +
        cos( radians("lat_l") ) * cos( radians("lat_r") )
            * cos( radians("long_r" - "long_l") )
    ) > 1 then 1
            when (
        sin( radians("lat_l") ) * sin( radians("lat_r") ) +
        cos( radians("lat_l") ) * cos( radians("lat_r") )
            * cos( radians("long_r" - "long_l") )
    ) < -1 then -1
            else (
        sin( radians("lat_l") ) * sin( radians("lat_r") ) +
        cos( radians("lat_l") ) * cos( radians("lat_r") )
            * cos( radians("long_r" - "long_l") )
    )
        end

            ) * 6371
            as float
        )
    <= 1
    - 'Distance less than 10km' with SQL rule:
        cast(
            acos(

        case
            when (
        sin( radians("lat_l") ) * sin( radians("lat_r") ) +
        cos( radians("lat_l") ) * cos( radians("lat_r") )
            * cos( radians("long_r" - "long_l") )
    ) > 1 then 1
            when (
        sin( radians("lat_l") ) * sin( radians("lat_r") ) +
        cos( radians("lat_l") ) * cos( radians("lat_r") )
            * cos( radians("long_r" - "long_l") )
    ) < -1 then -1
            else (
        sin( radians("lat_l") ) * sin( radians("lat_r") ) +
        cos( radians("lat_l") ) * cos( radians("lat_r") )
            * cos( radians("long_r" - "long_l") )
    )
        end

            ) * 6371
            as float
        )
    <= 10
    - 'All other comparisons' with SQL rule: ELSE

or by using cll.distance_in_km_level() in conjunction with other comparison levels:

import splink.comparison_level_library as cll
import splink.comparison_library as cl

custom_postcode_comparison = cl.CustomComparison(
    output_column_name="postcode",
    comparison_description="Postcode",
    comparison_levels=[
        cll.NullLevel("postcode"),
        cll.ExactMatchLevel("postcode"),
        cll.DistanceInKMLevel("lat", "long", 1),
        cll.DistanceInKMLevel("lat", "long", 10),
        cll.DistanceInKMLevel("lat", "long", 50),
        cll.ElseLevel(),
    ],
)

Phonetic transformations

Phonetic transformation algorithms can be used to identify words that sound similar, even if they are spelled differently. These are particularly useful for names and can be used as an additional comparison level within name comparisons.

For a more detailed explanation on phonetic transformation algorithms, see the topic guide.

Example

There are a number of python packages which support phonetic transformations that can be applied to a pandas dataframe, which can then be loaded into the Linker. For example, creating a Double Metaphone column with the phonetics python library:

import pandas as pd
import phonetics

from splink import splink_datasets
df = splink_datasets.fake_1000

# Define a function to apply the dmetaphone phonetic algorithm to each name in the column
def dmetaphone_name(name):
    if name is None:
        pass
    else:
        return phonetics.dmetaphone(name)

# Apply the function to the "first_name" and surname columns using the apply method
df['first_name_dm'] = df['first_name'].apply(dmetaphone_name)
df['surname_dm'] = df['surname'].apply(dmetaphone_name)

df.head()
Output
unique_id first_name surname dob city email group first_name_dm surname_dm
0 0 Julia 2015-10-29 London hannah88@powers.com 0 ('JL', 'AL')
1 1 Julia Taylor 2015-07-31 London hannah88@powers.com 0 ('JL', 'AL') ('TLR', '')
2 2 Julia Taylor 2016-01-27 London hannah88@powers.com 0 ('JL', 'AL') ('TLR', '')
3 3 Julia Taylor 2015-10-29 hannah88opowersc@m 0 ('JL', 'AL') ('TLR', '')
4 4 oNah Watson 2008-03-23 Bolton matthew78@ballard-mcdonald.net 1 ('AN', '') ('ATSN', 'FTSN')

Note: Soundex and Metaphone are also supported in phonetics

Now that the dmetaphone columns have been added, they can be used within comparisons. For example, using the NameComparison function from the comparison library.

import splink.duckdb.comparison_template_library as ctl

comparison = cl.NameComparison("first_name", dmeta_col_name="first_name_dm").get_comparison("duckdb")
comparison.human_readable_description
Output
Comparison 'NameComparison' of "first_name" and "first_name_dm".
Similarity is assessed using the following ComparisonLevels:
    - 'first_name is NULL' with SQL rule: "first_name_l" IS NULL OR "first_name_r" IS NULL
    - 'Exact match on first_name' with SQL rule: "first_name_l" = "first_name_r"
    - 'Jaro-Winkler distance of first_name >= 0.92' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.92
    - 'Jaro-Winkler distance of first_name >= 0.88' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.88
    - 'Array intersection size >= 1' with SQL rule: array_length(list_intersect("first_name_dm_l", "first_name_dm_r")) >= 1
    - 'Jaro-Winkler distance of first_name >= 0.7' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7
    - 'All other comparisons' with SQL rule: ELSE

Full name

If Splink has access to a combined full name column, it can use the term frequency of the full name, as opposed to treating forename and surname as independent.

This can be important because correlations in names are common. For example, in the UK, “Mohammed Khan” is a more common full name than the individual frequencies of "Mohammed" or "Khan" would suggest.

The following example shows how to do this.

For more on term frequency, see the dedicated topic guide.

Example

Derive a full name column:

import pandas as pd

from splink import splink_datasets

df = splink_datasets.fake_1000

df['full_name'] = df['first_name'] + ' ' + df['surname']

df.head()

Now that the full_name column has been added, it can be used within comparisons. For example, using the ForenameSurnameComparison function from the comparison library.

comparison = cl.ForenameSurnameComparison(
    "first_name", "surname", forename_surname_concat_col_name="full_name"
)
comparison.get_comparison("duckdb").as_dict()
Output
{'output_column_name': 'first_name_surname',
'comparison_levels': [{'sql_condition': '("first_name_l" IS NULL OR "first_name_r" IS NULL) AND ("surname_l" IS NULL OR "surname_r" IS NULL)',
'label_for_charts': '(first_name is NULL) AND (surname is NULL)',
'is_null_level': True},
{'sql_condition': '"full_name_l" = "full_name_r"',
'label_for_charts': 'Exact match on full_name',
'tf_adjustment_column': 'full_name',
'tf_adjustment_weight': 1.0},
{'sql_condition': '"first_name_l" = "surname_r" AND "first_name_r" = "surname_l"',
'label_for_charts': 'Match on reversed cols: first_name and surname'},
{'sql_condition': '(jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.92) AND (jaro_winkler_similarity("surname_l", "surname_r") >= 0.92)',
'label_for_charts': '(Jaro-Winkler distance of first_name >= 0.92) AND (Jaro-Winkler distance of surname >= 0.92)'},
{'sql_condition': '(jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.88) AND (jaro_winkler_similarity("surname_l", "surname_r") >= 0.88)',
'label_for_charts': '(Jaro-Winkler distance of first_name >= 0.88) AND (Jaro-Winkler distance of surname >= 0.88)'},
{'sql_condition': '"surname_l" = "surname_r"',
'label_for_charts': 'Exact match on surname',
'tf_adjustment_column': 'surname',
'tf_adjustment_weight': 1.0},
{'sql_condition': '"first_name_l" = "first_name_r"',
'label_for_charts': 'Exact match on first_name',
'tf_adjustment_column': 'first_name',
'tf_adjustment_weight': 1.0},
{'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
'comparison_description': 'ForenameSurnameComparison'}

Note that the first level is now :

{'sql_condition': '"full_name_l" = "full_name_r"',
'label_for_charts': 'Exact match on full_name',
'tf_adjustment_column': 'full_name',
'tf_adjustment_weight': 1.0},

whereas without specifying forename_surname_concat_col_name we would have had:

{'sql_condition': '("first_name_l" = "first_name_r") AND ("surname_l" = "surname_r")',
'label_for_charts': '(Exact match on first_name) AND (Exact match on surname)'},