Skip to content

Linking businesses

Open In Colab

Matching businesses

In this example we attempt to match open business rates data from Stockport (source) with Companies House data (source).

This is quite a challenging scenario because the data is not initially well suited for Splink.

First, there are only three columns to match on:

  • The business name
  • The address
  • The business postcode

Second, the data itself is quite different to many other applications: the business name is an ordred 'bag of words'.

We will therefore need to use a few preprocessing tricks to build a successful matching model/

Let's start by taking a look at the data. To make the size of these files managable, I have selected only rows from the companies house data where substr(postcode, 1,4) matches at least one postcode in the Stockport data, resulting in:

  • Companies house data is around 1 million records
  • Stockport data has 4,540 records

In this example, to make it easy to understand, I break the calculation into multiple steps. Many calculations can be combined into single steps - you can see a more condensed version here.

First, let's take a look at the data:

import duckdb
import altair as alt

df_companies_house = duckdb.read_parquet("https://github.com/RobinL/company_matching_example/raw/refs/heads/main/companies_house.parquet")
df_companies_house.sort("company_name").show(max_rows=5)

df_stockport_business_rates = duckdb.read_parquet("https://github.com/RobinL/company_matching_example/raw/refs/heads/main/stockport_business_rates.parquet")
df_stockport_business_rates.sort("company_name").show(max_rows=5)
┌──────────────────────────────┬────────────────┬─────────────────────────────────────────┬──────────┐
│         company_name         │ company_number │             address_concat              │ postcode │
│           varchar            │    varchar     │                 varchar                 │ varchar  │
├──────────────────────────────┼────────────────┼─────────────────────────────────────────┼──────────┤
│ ! LTD                        │ 08209948       │ 9 PRINCES SQUARE                        │ HG1 1ND  │
│ !BIG IMPACT GRAPHICS LIMITED │ 11743365       │ 372 OLD STREET 335 ROSDEN HOUSE         │ EC1V 9LT │
│ !NFOGENIE LTD                │ 13522064       │ 71-75 SHELTON STREET                    │ WC2H 9JQ │
│       ·                      │    ·           │       ·                                 │    ·     │
│       ·                      │    ·           │       ·                                 │    ·     │
│       ·                      │    ·           │       ·                                 │    ·     │
│ 4TECH MOTO LIMITED           │ 05583123       │ 1 BOND STREET                           │ BB8 9DG  │
│ 4TECH SOFTWARE LIMITED       │ 07415956       │ ELEVEN BRINDLEYPLACE 2 BRUNSWICK SQUARE │ B1 2LP   │
├──────────────────────────────┴────────────────┴─────────────────────────────────────────┴──────────┤
│ ? rows (>9999 rows, 5 shown)                                                             4 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────┬────────────────┬──────────────────────────────────────────────────────────────────────────┬──────────┐
│            company_name             │ company_number │                              address_concat                              │ postcode │
│               varchar               │    varchar     │                                 varchar                                  │ varchar  │
├─────────────────────────────────────┼────────────────┼──────────────────────────────────────────────────────────────────────────┼──────────┤
│ 08 UK LTD                           │ NULL           │ 15 MADOC STREET LLANDUDNO CONWY WALES                                    │ LL30 2TL │
│ 1 GR LTD                            │ NULL           │ 296-296A GORTON ROAD REDDISH STOCKPORT                                   │ SK5 6RN  │
│ 101 CANVAS PRINTS LTD               │ NULL           │ SUITE 44 4TH FLOOR HOULDSWORTH MILL HOULDSWORTH STREET REDDISH STOCKPORT │ SK5 6DA  │
│           ·                         │  ·             │                           ·                                              │    ·     │
│           ·                         │  ·             │                           ·                                              │    ·     │
│           ·                         │  ·             │                           ·                                              │    ·     │
│ ZYGOCITY LTD T/A CRAFTSTAR STENCILS │ NULL           │ UNIT 46 GOYT MILL UPPER HIBBERT LANE MARPLE STOCKPORT                    │ SK6 7HX  │
│ ZYNC GROUP LTD IN LIQUIDATION       │ NULL           │ C/O COWGILLS LIMITED FOURTH FLOOR UNIT 5B THE PARKLANDS BOLTON           │ BL6 4SD  │
├─────────────────────────────────────┴────────────────┴──────────────────────────────────────────────────────────────────────────┴──────────┤
│ 4540 rows (5 shown)                                                                                                              4 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Feature engineering

We will do some feature engineering to make this data more amenable to matching:

  • To match companies, we will tokenise the company name and compute the term frequenices of each token. This will allow us to more accurately quantify how closely two business names match
  • We will also pick out a subset of tokens to give us access to new blocking strategies
  • To match addresses, we will use the first number in the address and the postcode only. This is simpistic - for a more advanced approach to address matching in Splink see here

We'll perform this featuring on both datasets simultanouesly to ensure the token frequencies are representative of all the data

# Start by creating a table with the data vertically concatenated
sql = """
create or replace table all_input_data as
with concat_data as (
    select *, 'stockport' as source_dataset

from df_stockport_business_rates
union all
select *, 'z_all_companies' as source_dataset
from df_companies_house
)
select ROW_NUMBER() OVER () as unique_id, *
from concat_data
"""

duckdb.execute(sql)

duckdb.table("all_input_data").show(max_rows=10, max_width=100000)
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


┌───────────┬──────────────────────────────┬────────────────┬──────────────────────────────────────────────────────────────────────────┬──────────┬─────────────────┐
│ unique_id │         company_name         │ company_number │                              address_concat                              │ postcode │ source_dataset  │
│   int64   │           varchar            │    varchar     │                                 varchar                                  │ varchar  │     varchar     │
├───────────┼──────────────────────────────┼────────────────┼──────────────────────────────────────────────────────────────────────────┼──────────┼─────────────────┤
│         1 │ 08 UK LTD                    │ NULL           │ 15 MADOC STREET LLANDUDNO CONWY WALES                                    │ LL30 2TL │ stockport       │
│         2 │ 1 GR LTD                     │ NULL           │ 296-296A GORTON ROAD REDDISH STOCKPORT                                   │ SK5 6RN  │ stockport       │
│         3 │ 101 CANVAS PRINTS LTD        │ NULL           │ SUITE 44 4TH FLOOR HOULDSWORTH MILL HOULDSWORTH STREET REDDISH STOCKPORT │ SK5 6DA  │ stockport       │
│         4 │ 121 DIRECT MAIL LTD          │ NULL           │ UNIT 18 CHARTER HOUSE LATHAM CLOSE BREDBURY STOCKPORT                    │ SK6 2SD  │ stockport       │
│         5 │ 121 PERSONAL TRAINING LTD    │ NULL           │ 198 WILMSLOW ROAD HEALD GREEN STOCKPORT                                  │ SK8 3BH  │ stockport       │
│         · │             ·                │  ·             │              ·                                                           │    ·     │     ·           │
│         · │             ·                │  ·             │              ·                                                           │    ·     │     ·           │
│         · │             ·                │  ·             │              ·                                                           │    ·     │     ·           │
│      9996 │ AL-MAALIK LOCUM SERVICES LTD │ 14351700       │ 239 MOSELEY ROAD LEVENSHULME                                             │ M19 2LJ  │ z_all_companies │
│      9997 │ AL-MAALIK PHARMACIES LTD     │ 14421852       │ 239 MOSELEY ROAD LEVENSHULME                                             │ M19 2LJ  │ z_all_companies │
│      9998 │ AL-MAALIK PROPERTIES LTD     │ 14412454       │ 1041 STOCKPORT ROAD                                                      │ M19 2TF  │ z_all_companies │
│      9999 │ AL-MADAD OUTREACH LTD        │ 14827390       │ 124-128 CITY ROAD                                                        │ EC1V 2NX │ z_all_companies │
│     10000 │ AL-MADANI LTD                │ 14415359       │ 71-75 SHELTON STREET COVENT GARDEN                                       │ WC2H 9JQ │ z_all_companies │
├───────────┴──────────────────────────────┴────────────────┴──────────────────────────────────────────────────────────────────────────┴──────────┴─────────────────┤
│ ? rows (>9999 rows, 10 shown)                                                                                                                           6 columns │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
# Tokenize company names, and explode tokens across rows
# The tokenisation approach here is simplistic, more advanced strategies could be used
# in a more fully-fledged application


sql = """
CREATE OR REPLACE TABLE unnested AS
SELECT
    unique_id,
    unnest(regexp_split_to_array(upper(trim(company_name)), '\\s+')) as name_token,
    generate_subscripts(regexp_split_to_array(upper(trim(company_name)), '\\s+'), 1) as token_position_in_name
FROM all_input_data
"""

unnested = duckdb.execute(sql)
duckdb.table("unnested").show(max_rows=10, max_width=100000)
┌───────────┬────────────┬────────────────────────┐
│ unique_id │ name_token │ token_position_in_name │
│   int64   │  varchar   │         int64          │
├───────────┼────────────┼────────────────────────┤
│         1 │ 08         │                      1 │
│         1 │ UK         │                      2 │
│         1 │ LTD        │                      3 │
│         2 │ 1          │                      1 │
│         2 │ GR         │                      2 │
│         · │ ·          │                      · │
│         · │ ·          │                      · │
│         · │ ·          │                      · │
│      2619 │ MEDATECH   │                      1 │
│      2619 │ UK         │                      2 │
│      2619 │ LTD        │                      3 │
│      2620 │ MEDIA      │                      1 │
│      2620 │ +          │                      2 │
├───────────┴────────────┴────────────────────────┤
│ ? rows (>9999 rows, 10 shown)         3 columns │
└─────────────────────────────────────────────────┘
# Compute token relative frequencies
sql = """
CREATE OR REPLACE TABLE token_frequencies AS
SELECT
    name_token as token,
    count(*)::float/(select count(*) from unnested) as rel_freq
FROM unnested
GROUP BY token
ORDER BY rel_freq DESC

"""

# Execute and show results
token_frequencies = duckdb.execute(sql)
duckdb.table("token_frequencies").show(max_rows=10, max_width=100000)
┌────────────┬───────────────┐
│   token    │   rel_freq    │
│  varchar   │     float     │
├────────────┼───────────────┤
│ LTD        │    0.14284134 │
│ LIMITED    │    0.13450336 │
│ SERVICES   │   0.012372046 │
│ &          │    0.01089211 │
│ MANAGEMENT │  0.0076835644 │
│   ·        │        ·      │
│   ·        │        ·      │
│   ·        │        ·      │
│ MENDIP     │ 6.0517837e-06 │
│ MVP        │ 6.0517837e-06 │
│ FINGERS    │ 6.0517837e-06 │
│ GRADUATE   │ 6.0517837e-06 │
│ REACT      │ 6.0517837e-06 │
├────────────┴───────────────┤
│ ? rows           2 columns │
└────────────────────────────┘
# Reconstruct the company name as a list of tokens with their relative frequency
# This creates a new column called name_tokens_with_freq
# Example: 101 CANVAS PRINTS LTD  becomes
# [
#  {'token': 101, 'rel_freq': 3.19e-05},
#  {'token': CANVAS, 'rel_freq': 2.20e-05},
#  {'token': PRINTS, 'rel_freq': 3.16e-05},
#  {'token': LTD, 'rel_freq': 0.142}
# ]
sql = """
CREATE OR REPLACE TABLE input_data_with_tokens AS
WITH
tokens_with_freq AS (
    SELECT
        m.unique_id,
        -- Create a list of structs containing each token and its frequency
        list_transform(
            list_zip(
                array_agg(u.name_token ORDER BY u.token_position_in_name),
                array_agg(COALESCE(tf.rel_freq, 0.0) ORDER BY u.token_position_in_name)
            ),
            x -> struct_pack(token := x[1], rel_freq := x[2])
        ) as name_tokens_with_freq
    FROM all_input_data m
    JOIN unnested u ON m.unique_id = u.unique_id
    LEFT JOIN token_frequencies tf ON u.name_token = tf.token
    GROUP BY m.unique_id
)
SELECT
    m.*,
    t.name_tokens_with_freq
FROM all_input_data m
LEFT JOIN tokens_with_freq t ON m.unique_id = t.unique_id
order by m.unique_id
"""

# Execute and show results
duckdb.execute(sql)
duckdb.table("input_data_with_tokens").show(max_rows=10, max_width=200)
┌───────────┬──────────────────────┬────────────────┬──────────────────────┬──────────┬─────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────┐
│ unique_id │     company_name     │ company_number │    address_concat    │ postcode │ source_dataset  │                                    name_tokens_with_freq                                     │
│   int64   │       varchar        │    varchar     │       varchar        │ varchar  │     varchar     │                           struct(token varchar, rel_freq float)[]                            │
├───────────┼──────────────────────┼────────────────┼──────────────────────┼──────────┼─────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│         1 │ 08 UK LTD            │ NULL           │ 15 MADOC STREET LL…  │ LL30 2TL │ stockport       │ [{'token': 08, 'rel_freq': 2.2006486e-06}, {'token': UK, 'rel_freq': 0.0059337737}, {'toke…  │
│         2 │ 1 GR LTD             │ NULL           │ 296-296A GORTON RO…  │ SK5 6RN  │ stockport       │ [{'token': 1, 'rel_freq': 0.000688803}, {'token': GR, 'rel_freq': 2.3656972e-05}, {'token'…  │
│         3 │ 101 CANVAS PRINTS …  │ NULL           │ SUITE 44 4TH FLOOR…  │ SK5 6DA  │ stockport       │ [{'token': 101, 'rel_freq': 3.1909403e-05}, {'token': CANVAS, 'rel_freq': 2.2006485e-05}, …  │
│         4 │ 121 DIRECT MAIL LTD  │ NULL           │ UNIT 18 CHARTER HO…  │ SK6 2SD  │ stockport       │ [{'token': 121, 'rel_freq': 1.7055027e-05}, {'token': DIRECT, 'rel_freq': 0.00050367345}, …  │
│         5 │ 121 PERSONAL TRAIN…  │ NULL           │ 198 WILMSLOW ROAD …  │ SK8 3BH  │ stockport       │ [{'token': 121, 'rel_freq': 1.7055027e-05}, {'token': PERSONAL, 'rel_freq': 8.8851186e-05}…  │
│         · │          ·           │  ·             │          ·           │    ·     │     ·           │                                              ·                                               │
│         · │          ·           │  ·             │          ·           │    ·     │     ·           │                                              ·                                               │
│         · │          ·           │  ·             │          ·           │    ·     │     ·           │                                              ·                                               │
│      9996 │ AL-MAALIK LOCUM SE…  │ 14351700       │ 239 MOSELEY ROAD L…  │ M19 2LJ  │ z_all_companies │ [{'token': AL-MAALIK, 'rel_freq': 8.252432e-07}, {'token': LOCUM, 'rel_freq': 6.6844696e-0…  │
│      9997 │ AL-MAALIK PHARMACI…  │ 14421852       │ 239 MOSELEY ROAD L…  │ M19 2LJ  │ z_all_companies │ [{'token': AL-MAALIK, 'rel_freq': 8.252432e-07}, {'token': PHARMACIES, 'rel_freq': 4.67637…  │
│      9998 │ AL-MAALIK PROPERTI…  │ 14412454       │ 1041 STOCKPORT ROAD  │ M19 2TF  │ z_all_companies │ [{'token': AL-MAALIK, 'rel_freq': 8.252432e-07}, {'token': PROPERTIES, 'rel_freq': 0.00506…  │
│      9999 │ AL-MADAD OUTREACH …  │ 14827390       │ 124-128 CITY ROAD    │ EC1V 2NX │ z_all_companies │ [{'token': AL-MADAD, 'rel_freq': 2.7508108e-07}, {'token': OUTREACH, 'rel_freq': 1.9530757…  │
│     10000 │ AL-MADANI LTD        │ 14415359       │ 71-75 SHELTON STRE…  │ WC2H 9JQ │ z_all_companies │ [{'token': AL-MADANI, 'rel_freq': 2.7508108e-07}, {'token': LTD, 'rel_freq': 0.14284134}]    │
├───────────┴──────────────────────┴────────────────┴──────────────────────┴──────────┴─────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────┤
│ ? rows (>9999 rows, 10 shown)                                                                                                                                                              7 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
# Derive rarest tokens - these will be used for blocking
# This would be simpler if Duckdb list sort supported a lambda function
# but it doesn't, so we have to unnest the list and sort manually
# https://duckdb.org/docs/sql/functions/list.html#sorting-lists
sql = """
with
tokens_unnested as (
    SELECT
        unique_id,
        unnest(name_tokens_with_freq) as token_info
    FROM input_data_with_tokens
),
rare_tokens as (
    SELECT
        unique_id,
        array_agg(token_info.token ORDER BY token_info.rel_freq ASC)[:2] as rarest_tokens
    FROM tokens_unnested
    WHERE token_info.rel_freq < 0.01
    GROUP BY unique_id
)
select m.*, rarest_tokens
from input_data_with_tokens as m
left join rare_tokens on m.unique_id = rare_tokens.unique_id
order by m.unique_id
"""
input_data_with_tokens_and_rarest_tokens = duckdb.sql(sql)
input_data_with_tokens_and_rarest_tokens.show(max_rows=10, max_width=1000)
┌───────────┬──────────────────────────────┬────────────────┬──────────────────────────────────────────────────────────────────────────┬──────────┬─────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────────────────────┐
│ unique_id │         company_name         │ company_number │                              address_concat                              │ postcode │ source_dataset  │                                                                                name_tokens_with_freq                                                                                │      rarest_tokens      │
│   int64   │           varchar            │    varchar     │                                 varchar                                  │ varchar  │     varchar     │                                                                       struct(token varchar, rel_freq float)[]                                                                       │        varchar[]        │
├───────────┼──────────────────────────────┼────────────────┼──────────────────────────────────────────────────────────────────────────┼──────────┼─────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────┤
│         1 │ 08 UK LTD                    │ NULL           │ 15 MADOC STREET LLANDUDNO CONWY WALES                                    │ LL30 2TL │ stockport       │ [{'token': 08, 'rel_freq': 2.2006486e-06}, {'token': UK, 'rel_freq': 0.0059337737}, {'token': LTD, 'rel_freq': 0.14284134}]                                                         │ [08, UK]                │
│         2 │ 1 GR LTD                     │ NULL           │ 296-296A GORTON ROAD REDDISH STOCKPORT                                   │ SK5 6RN  │ stockport       │ [{'token': 1, 'rel_freq': 0.000688803}, {'token': GR, 'rel_freq': 2.3656972e-05}, {'token': LTD, 'rel_freq': 0.14284134}]                                                           │ [GR, 1]                 │
│         3 │ 101 CANVAS PRINTS LTD        │ NULL           │ SUITE 44 4TH FLOOR HOULDSWORTH MILL HOULDSWORTH STREET REDDISH STOCKPORT │ SK5 6DA  │ stockport       │ [{'token': 101, 'rel_freq': 3.1909403e-05}, {'token': CANVAS, 'rel_freq': 2.2006485e-05}, {'token': PRINTS, 'rel_freq': 3.163432e-05}, {'token': LTD, 'rel_freq': 0.14284134}]      │ [CANVAS, PRINTS]        │
│         4 │ 121 DIRECT MAIL LTD          │ NULL           │ UNIT 18 CHARTER HOUSE LATHAM CLOSE BREDBURY STOCKPORT                    │ SK6 2SD  │ stockport       │ [{'token': 121, 'rel_freq': 1.7055027e-05}, {'token': DIRECT, 'rel_freq': 0.00050367345}, {'token': MAIL, 'rel_freq': 1.5679621e-05}, {'token': LTD, 'rel_freq': 0.14284134}]       │ [MAIL, 121]             │
│         5 │ 121 PERSONAL TRAINING LTD    │ NULL           │ 198 WILMSLOW ROAD HEALD GREEN STOCKPORT                                  │ SK8 3BH  │ stockport       │ [{'token': 121, 'rel_freq': 1.7055027e-05}, {'token': PERSONAL, 'rel_freq': 8.8851186e-05}, {'token': TRAINING, 'rel_freq': 0.00092289696}, {'token': LTD, 'rel_freq': 0.14284134}] │ [121, PERSONAL]         │
│         · │             ·                │  ·             │              ·                                                           │    ·     │     ·           │                                                                                          ·                                                                                          │        ·                │
│         · │             ·                │  ·             │              ·                                                           │    ·     │     ·           │                                                                                          ·                                                                                          │        ·                │
│         · │             ·                │  ·             │              ·                                                           │    ·     │     ·           │                                                                                          ·                                                                                          │        ·                │
│      9996 │ AL-MAALIK LOCUM SERVICES LTD │ 14351700       │ 239 MOSELEY ROAD LEVENSHULME                                             │ M19 2LJ  │ z_all_companies │ [{'token': AL-MAALIK, 'rel_freq': 8.252432e-07}, {'token': LOCUM, 'rel_freq': 6.6844696e-05}, {'token': SERVICES, 'rel_freq': 0.012372046}, {'token': LTD, 'rel_freq': 0.14284134}] │ [AL-MAALIK, LOCUM]      │
│      9997 │ AL-MAALIK PHARMACIES LTD     │ 14421852       │ 239 MOSELEY ROAD LEVENSHULME                                             │ M19 2LJ  │ z_all_companies │ [{'token': AL-MAALIK, 'rel_freq': 8.252432e-07}, {'token': PHARMACIES, 'rel_freq': 4.676378e-06}, {'token': LTD, 'rel_freq': 0.14284134}]                                           │ [AL-MAALIK, PHARMACIES] │
│      9998 │ AL-MAALIK PROPERTIES LTD     │ 14412454       │ 1041 STOCKPORT ROAD                                                      │ M19 2TF  │ z_all_companies │ [{'token': AL-MAALIK, 'rel_freq': 8.252432e-07}, {'token': PROPERTIES, 'rel_freq': 0.005062042}, {'token': LTD, 'rel_freq': 0.14284134}]                                            │ [AL-MAALIK, PROPERTIES] │
│      9999 │ AL-MADAD OUTREACH LTD        │ 14827390       │ 124-128 CITY ROAD                                                        │ EC1V 2NX │ z_all_companies │ [{'token': AL-MADAD, 'rel_freq': 2.7508108e-07}, {'token': OUTREACH, 'rel_freq': 1.9530757e-05}, {'token': LTD, 'rel_freq': 0.14284134}]                                            │ [AL-MADAD, OUTREACH]    │
│     10000 │ AL-MADANI LTD                │ 14415359       │ 71-75 SHELTON STREET COVENT GARDEN                                       │ WC2H 9JQ │ z_all_companies │ [{'token': AL-MADANI, 'rel_freq': 2.7508108e-07}, {'token': LTD, 'rel_freq': 0.14284134}]                                                                                           │ [AL-MADANI]             │
├───────────┴──────────────────────────────┴────────────────┴──────────────────────────────────────────────────────────────────────────┴──────────┴─────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────────┤
│ ? rows (>9999 rows, 10 shown)                                                                                                                                                                                                                                                                                                                                           8 columns │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
# Finally we're going to extract the 'house number' from the address, which, combined with the postcode will be used for matching

sql = """
CREATE OR REPLACE TABLE data_for_matching AS
select
    unique_id,
    source_dataset,
    company_name,
    company_number,
    COALESCE(
        REGEXP_EXTRACT(address_concat, '(\\d+[A-Z]?)'),
        REGEXP_EXTRACT(address_concat, '(\\S+)(?=\\s+HOUSE)')
    ) AS first_num_in_address,
    postcode,
    name_tokens_with_freq,
    rarest_tokens
from input_data_with_tokens_and_rarest_tokens

"""
duckdb.execute(sql)
duckdb.table("data_for_matching").show(max_rows=10, max_width=400)
┌───────────┬─────────────────┬──────────────────────────────┬────────────────┬──────────────────────┬──────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────────────────────┐
│ unique_id │ source_dataset  │         company_name         │ company_number │ first_num_in_address │ postcode │                                                                                name_tokens_with_freq                                                                                │      rarest_tokens      │
│   int64   │     varchar     │           varchar            │    varchar     │       varchar        │ varchar  │                                                                       struct(token varchar, rel_freq float)[]                                                                       │        varchar[]        │
├───────────┼─────────────────┼──────────────────────────────┼────────────────┼──────────────────────┼──────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────┤
│         1 │ stockport       │ 08 UK LTD                    │ NULL           │ 15                   │ LL30 2TL │ [{'token': 08, 'rel_freq': 2.2006486e-06}, {'token': UK, 'rel_freq': 0.0059337737}, {'token': LTD, 'rel_freq': 0.14284134}]                                                         │ [08, UK]                │
│         2 │ stockport       │ 1 GR LTD                     │ NULL           │ 296                  │ SK5 6RN  │ [{'token': 1, 'rel_freq': 0.000688803}, {'token': GR, 'rel_freq': 2.3656972e-05}, {'token': LTD, 'rel_freq': 0.14284134}]                                                           │ [GR, 1]                 │
│         3 │ stockport       │ 101 CANVAS PRINTS LTD        │ NULL           │ 44                   │ SK5 6DA  │ [{'token': 101, 'rel_freq': 3.1909403e-05}, {'token': CANVAS, 'rel_freq': 2.2006485e-05}, {'token': PRINTS, 'rel_freq': 3.163432e-05}, {'token': LTD, 'rel_freq': 0.14284134}]      │ [CANVAS, PRINTS]        │
│         4 │ stockport       │ 121 DIRECT MAIL LTD          │ NULL           │ 18                   │ SK6 2SD  │ [{'token': 121, 'rel_freq': 1.7055027e-05}, {'token': DIRECT, 'rel_freq': 0.00050367345}, {'token': MAIL, 'rel_freq': 1.5679621e-05}, {'token': LTD, 'rel_freq': 0.14284134}]       │ [MAIL, 121]             │
│         5 │ stockport       │ 121 PERSONAL TRAINING LTD    │ NULL           │ 198                  │ SK8 3BH  │ [{'token': 121, 'rel_freq': 1.7055027e-05}, {'token': PERSONAL, 'rel_freq': 8.8851186e-05}, {'token': TRAINING, 'rel_freq': 0.00092289696}, {'token': LTD, 'rel_freq': 0.14284134}] │ [121, PERSONAL]         │
│         · │     ·           │             ·                │  ·             │  ·                   │    ·     │                                                                                          ·                                                                                          │        ·                │
│         · │     ·           │             ·                │  ·             │  ·                   │    ·     │                                                                                          ·                                                                                          │        ·                │
│         · │     ·           │             ·                │  ·             │  ·                   │    ·     │                                                                                          ·                                                                                          │        ·                │
│      9996 │ z_all_companies │ AL-MAALIK LOCUM SERVICES LTD │ 14351700       │ 239                  │ M19 2LJ  │ [{'token': AL-MAALIK, 'rel_freq': 8.252432e-07}, {'token': LOCUM, 'rel_freq': 6.6844696e-05}, {'token': SERVICES, 'rel_freq': 0.012372046}, {'token': LTD, 'rel_freq': 0.14284134}] │ [AL-MAALIK, LOCUM]      │
│      9997 │ z_all_companies │ AL-MAALIK PHARMACIES LTD     │ 14421852       │ 239                  │ M19 2LJ  │ [{'token': AL-MAALIK, 'rel_freq': 8.252432e-07}, {'token': PHARMACIES, 'rel_freq': 4.676378e-06}, {'token': LTD, 'rel_freq': 0.14284134}]                                           │ [AL-MAALIK, PHARMACIES] │
│      9998 │ z_all_companies │ AL-MAALIK PROPERTIES LTD     │ 14412454       │ 1041                 │ M19 2TF  │ [{'token': AL-MAALIK, 'rel_freq': 8.252432e-07}, {'token': PROPERTIES, 'rel_freq': 0.005062042}, {'token': LTD, 'rel_freq': 0.14284134}]                                            │ [AL-MAALIK, PROPERTIES] │
│      9999 │ z_all_companies │ AL-MADAD OUTREACH LTD        │ 14827390       │ 124                  │ EC1V 2NX │ [{'token': AL-MADAD, 'rel_freq': 2.7508108e-07}, {'token': OUTREACH, 'rel_freq': 1.9530757e-05}, {'token': LTD, 'rel_freq': 0.14284134}]                                            │ [AL-MADAD, OUTREACH]    │
│     10000 │ z_all_companies │ AL-MADANI LTD                │ 14415359       │ 71                   │ WC2H 9JQ │ [{'token': AL-MADANI, 'rel_freq': 2.7508108e-07}, {'token': LTD, 'rel_freq': 0.14284134}]                                                                                           │ [AL-MADANI]             │
├───────────┴─────────────────┴──────────────────────────────┴────────────────┴──────────────────────┴──────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────────┤
│ ? rows (>9999 rows, 10 shown)                                                                                                                                                                                                                                                                                       8 columns │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

We've now prepared our data for matching. We'll use a fairly straightforward matching model with one innovation: we need to define a comparison that is able to exploit the information in name_tokens_with_freq effectively.

The approach we're going to take is to define similarity as the number of tokens in common between two business names, adjusting for term frequency. This can be computed by multiplying together the token frequencies of the matching tokens.

We will create a comparison level for different thresholds of this value.

def calculate_tf_product_array_sql(token_rel_freq_array_name):

    return f"""
    list_intersect({token_rel_freq_array_name}_l, {token_rel_freq_array_name}_r)
        .list_transform(x -> x.rel_freq::float)
        .list_concat([1.0::FLOAT]) -- in case there are no matches
        .list_reduce((p, q) -> p * q)
    """
# Prepare our two input datasets

df_stockport = duckdb.sql(
    "select * from data_for_matching where source_dataset = 'stockport'"
)

df_all_companies = duckdb.sql(
    "select * from data_for_matching where source_dataset = 'z_all_companies'"
)
from splink import DuckDBAPI, Linker, SettingsCreator, block_on
import splink.comparison_library as cl
con = duckdb.connect(":default:")
db_api = DuckDBAPI(connection=con)


settings = SettingsCreator(
    link_type="link_only",
    unique_id_column_name="unique_id",
    probability_two_random_records_match=1/1e6,
    comparisons=[
        {
            "output_column_name": "name_tokens_with_freq",
            "comparison_levels": [
                {
                    "sql_condition": '"name_tokens_with_freq_l" IS NULL OR "name_tokens_with_freq_r" IS NULL',
                    "label_for_charts": "name_tokens_with_freq is NULL",
                    "is_null_level": True,
                },
                {
                    "sql_condition": f"""
                    {calculate_tf_product_array_sql("name_tokens_with_freq")} < 1e-12
                    """,
                    "label_for_charts": "Array product is less than 1e-10",
                },
                {
                    "sql_condition": f"""
                    {calculate_tf_product_array_sql("name_tokens_with_freq")} < 1e-10
                    """,
                    "label_for_charts": "Array product is less than 1e-10",
                },
                {
                    "sql_condition": f"""
                    {calculate_tf_product_array_sql("name_tokens_with_freq")} < 1e-8
                    """,
                    "label_for_charts": "Array product is less than 1e-8",
                },
                {
                    "sql_condition": f"""
                    {calculate_tf_product_array_sql("name_tokens_with_freq")} < 1e-6
                    """,
                    "label_for_charts": "Array product is less than 1e-6",
                },
                {
                    "sql_condition": f"""
                    {calculate_tf_product_array_sql("name_tokens_with_freq")} < 1e-4
                    """,
                    "label_for_charts": "Array product is less than 1e-4",
                },
                {"sql_condition": "ELSE", "label_for_charts": "All other comparisons"},
            ],
            "comparison_description": "ExactMatch",
        },
        cl.PostcodeComparison("postcode"),
        cl.ExactMatch("first_num_in_address"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("postcode"),
        block_on("rarest_tokens", "substr(postcode,1,3)"),
        "l.rarest_tokens[1] = r.rarest_tokens[2] and substr(l.company_name,1,3) = substr(r.company_name,1,3)",
        "l.rarest_tokens[2] = r.rarest_tokens[1] and substr(l.company_name,1,3) = substr(r.company_name,1,3)",
        block_on("company_name"),
    ],
    additional_columns_to_retain=["company_name"],
    retain_intermediate_calculation_columns=True,
    retain_matching_columns=True,
)

linker = Linker([df_stockport, df_all_companies], settings, db_api)
# Probably not worth training m values given how few columns we have;
# by not training them we'll be using the defaults

linker.training.estimate_u_using_random_sampling(max_pairs=1e7)
linker.visualisations.match_weights_chart()
----- Estimating u probabilities using random sampling -----



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))



Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - name_tokens_with_freq (no m values are trained).
    - postcode (no m values are trained).
    - first_num_in_address (no m values are trained).

png

# Get some results

df_predictions = linker.inference.predict(threshold_match_weight=-5)



# Since the companies house data is a canonical list, here we're taking the
# best matching company for each Stockport business
sql = f"""
WITH ranked_matches AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY unique_id_l ORDER BY match_weight DESC) as rank
    FROM {df_predictions.physical_name}
),
best_match as (
    select * from ranked_matches
    where rank = 1
    order by match_weight desc
),
matched_stockport_ids as (
    select distinct unique_id_l
    from best_match
)
select * from best_match

order by match_probability desc
"""

ranked_matches = con.sql(sql)
ranked_matches.show(max_rows=10, max_width=400)
Blocking time: 0.18 seconds



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


Predict time: 3.65 seconds

 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'name_tokens_with_freq':
    m values not fully trained
Comparison: 'postcode':
    m values not fully trained
Comparison: 'first_num_in_address':
    m values not fully trained


┌─────────────────────┬─────────────────────┬──────────────────┬──────────────────┬─────────────┬─────────────┬──────────────────────┬──────────────────────┬──────────────────────┬──────────────────────┬───┬──────────────────────┬──────────────────────┬──────────────────────┬──────────────────────┬──────────────────────┬──────────────────────┬──────────────────────┬───────────┬───────┐
│    match_weight     │  match_probability  │ source_dataset_l │ source_dataset_r │ unique_id_l │ unique_id_r │ name_tokens_with_f…  │ name_tokens_with_f…  │ gamma_name_tokens_…  │ bf_name_tokens_wit…  │ … │ first_num_in_addre…  │ gamma_first_num_in…  │ bf_first_num_in_ad…  │    company_name_l    │    company_name_r    │   rarest_tokens_l    │   rarest_tokens_r    │ match_key │ rank  │
│       double        │       double        │     varchar      │     varchar      │    int64    │    int64    │ struct(token varch…  │ struct(token varch…  │        int32         │        double        │   │       varchar        │        int32         │        double        │       varchar        │       varchar        │      varchar[]       │      varchar[]       │  varchar  │ int64 │
├─────────────────────┼─────────────────────┼──────────────────┼──────────────────┼─────────────┼─────────────┼──────────────────────┼──────────────────────┼──────────────────────┼──────────────────────┼───┼──────────────────────┼──────────────────────┼──────────────────────┼──────────────────────┼──────────────────────┼──────────────────────┼──────────────────────┼───────────┼───────┤
│  21.416127918577743 │  0.9999996426424596 │ stockport        │ z_all_companies  │          80 │      828925 │ [{'token': ACOUSTI…  │ [{'token': ACOUSTI…  │                    5 │    5881782.930380143 │ … │ 316                  │                    1 │   31.128138879477106 │ ACOUSTIC AND ENGIN…  │ ACOUSTIC AND ENGIN…  │ [ACOUSTIC, CONSULT…  │ [ACOUSTIC, CONSULT…  │ 0         │     1 │
│  21.416127918577743 │  0.9999996426424596 │ stockport        │ z_all_companies  │        2323 │      487414 │ [{'token': LA, 're…  │ [{'token': LA, 're…  │                    5 │    5881782.930380143 │ … │ 29                   │                    1 │   31.128138879477106 │ LA BELLE PRESTIGE …  │ LA BELLE PRESTIGE …  │ [BELLE, LA]          │ [BELLE, LA]          │ 0         │     1 │
│  21.416127918577743 │  0.9999996426424596 │ stockport        │ z_all_companies  │        1928 │       33589 │ [{'token': HM, 're…  │ [{'token': HM, 're…  │                    5 │    5881782.930380143 │ … │ 27                   │                    1 │   31.128138879477106 │ HM ATKINS CLOTHING…  │ HM ATKINS CLOTHING…  │ [ATKINS, HM]         │ [ATKINS, HM]         │ 0         │     1 │
│  21.416127918577743 │  0.9999996426424596 │ stockport        │ z_all_companies  │         643 │      686474 │ [{'token': BURTON,…  │ [{'token': BURTON,…  │                    5 │    5881782.930380143 │ … │                      │                    1 │   31.128138879477106 │ BURTON ROOFING MER…  │ BURTON ROOFING MER…  │ [MERCHANTS, BURTON]  │ [MERCHANTS, BURTON]  │ 0         │     1 │
│  21.416127918577743 │  0.9999996426424596 │ stockport        │ z_all_companies  │          58 │      599776 │ [{'token': ABCETA,…  │ [{'token': ABCETA,…  │                    5 │    5881782.930380143 │ … │ 19                   │                    1 │   31.128138879477106 │ ABCETA PLAYTHINGS …  │ ABCETA PLAYTHINGS …  │ [ABCETA, PLAYTHINGS] │ [ABCETA, PLAYTHINGS] │ 0         │     1 │
│           ·         │           ·         │     ·            │        ·         │           · │         ·   │          ·           │          ·           │                    · │            ·         │ · │ ·                    │                    · │            ·         │          ·           │          ·           │         ·            │         ·            │ ·         │     · │
│           ·         │           ·         │     ·            │        ·         │           · │         ·   │          ·           │          ·           │                    · │            ·         │ · │ ·                    │                    · │            ·         │          ·           │          ·           │         ·            │         ·            │ ·         │     · │
│           ·         │           ·         │     ·            │        ·         │           · │         ·   │          ·           │          ·           │                    · │            ·         │ · │ ·                    │                    · │            ·         │          ·           │          ·           │         ·            │         ·            │ ·         │     · │
│ -3.4705625254057795 │ 0.08274585730410321 │ stockport        │ z_all_companies  │        3079 │      355661 │ [{'token': PERPETU…  │ [{'token': PERPETU…  │                    5 │    5881782.930380143 │ … │                      │                    0 │  0.05157398705697865 │ PERPETUAL WATCH SE…  │ PERPETUAL WATCH SE…  │ [PERPETUAL, WATCH]   │ [PERPETUAL, WATCH]   │ 4         │     1 │
│  -4.644182870808291 │ 0.03845316514889586 │ stockport        │ z_all_companies  │        2140 │      634191 │ [{'token': JAMES, …  │ [{'token': JAMES, …  │                    2 │   270.89098005456975 │ … │ 4                    │                    1 │   31.128138879477106 │ JAMES RETAIL LIMITED │ JAMES RETAIL SOHAM…  │ [JAMES, RETAIL]      │ [SOHAM, JAMES]       │ 2         │     1 │
│  -4.644182870808291 │ 0.03845316514889586 │ stockport        │ z_all_companies  │        3986 │     1036621 │ [{'token': TGL, 'r…  │ [{'token': TGL, 'r…  │                    2 │   270.89098005456975 │ … │                      │                    1 │   31.128138879477106 │ TGL SOLUTIONS LTD    │ TGL SOLUTIONS LIMI…  │ [TGL, SOLUTIONS]     │ [TGL, SOLUTIONS]     │ 1         │     1 │
│  -4.644182870808291 │ 0.03845316514889586 │ stockport        │ z_all_companies  │        1230 │      617874 │ [{'token': DOM, 'r…  │ [{'token': DOM, 'r…  │                    2 │   270.89098005456975 │ … │ 26                   │                    1 │   31.128138879477106 │ DOM PROPERTIES LTD   │ DOM PROPERTIES LIM…  │ [DOM, PROPERTIES]    │ [DOM, PROPERTIES]    │ 1         │     1 │
│  -4.644182870808291 │ 0.03845316514889586 │ stockport        │ z_all_companies  │        3657 │      508179 │ [{'token': SIMULAT…  │ [{'token': SIMULAT…  │                    2 │   270.89098005456975 │ … │ 10                   │                    1 │   31.128138879477106 │ SIMULATION SOLUTIO…  │ SIMULATION SOLUTIO…  │ [SIMULATION, SOLUT…  │ [SIMULATION, SOLUT…  │ 1         │     1 │
├─────────────────────┴─────────────────────┴──────────────────┴──────────────────┴─────────────┴─────────────┴──────────────────────┴──────────────────────┴──────────────────────┴──────────────────────┴───┴──────────────────────┴──────────────────────┴──────────────────────┴──────────────────────┴──────────────────────┴──────────────────────┴──────────────────────┴───────────┴───────┤
│ 2757 rows (10 shown)                                                                                                                                                                                                                                                                                                                                                       24 columns (19 shown) │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
sql = """
CREATE OR REPLACE TABLE comparison AS
select * from ranked_matches
where company_name_l != company_name_r
order by random()
limit 1
"""
duckdb.execute(sql)


sql = """
SELECT match_probability, COLUMNS('^(t[^f_].*|b[^f_].*|[^tb].*_l)$') AS '\\1'
FROM comparison
UNION ALL
SELECT match_probability, COLUMNS('^(t[^f_].*|b[^f_].*|[^tb].*_r)$') AS '\\1'
FROM comparison;
"""
duckdb.sql(sql).show(max_rows=10, max_width=200)

recs = duckdb.table("comparison").df().to_dict(orient="records")
linker.visualisations.waterfall_chart(recs)
┌────────────────────┬──────────────────┬─────────────┬─────────────────────────────────────────────────┬────────────┬────────────────────────┬────────────────────────────────┬───────────────────────┐
│ match_probability  │ source_dataset_l │ unique_id_l │             name_tokens_with_freq_l             │ postcode_l │ first_num_in_address_l │         company_name_l         │    rarest_tokens_l    │
│       double       │     varchar      │    int64    │     struct(token varchar, rel_freq float)[]     │  varchar   │        varchar         │            varchar             │       varchar[]       │
├────────────────────┼──────────────────┼─────────────┼─────────────────────────────────────────────────┼────────────┼────────────────────────┼────────────────────────────────┼───────────────────────┤
│ 0.9998652868923441 │ stockport        │         872 │ [{'token': CLARITAS, 'rel_freq': 1.9255674e-0…  │ SK4 4DT    │ 10                     │ CLARITAS WEALTH MANAGMENT LTD  │ [CLARITAS, MANAGMENT] │
│ 0.9998652868923441 │ z_all_companies  │      537567 │ [{'token': CLARITAS, 'rel_freq': 1.9255674e-0…  │ SK4 4DT    │ 10                     │ CLARITAS WEALTH MANAGEMENT LTD │ [CLARITAS, WEALTH]    │
└────────────────────┴──────────────────┴─────────────┴─────────────────────────────────────────────────┴────────────┴────────────────────────┴────────────────────────────────┴───────────────────────┘

png

import altair as alt
# Plot histogram of match weights

# This list does not include any stockport records where no comparison pairs were made
# because blocking rules did not find any potential matches

# Add in unmatched records
sql = """
WITH matched_stockport_ids as (
    select distinct unique_id_l
    from ranked_matches
)
select
    -10 as match_weight,
    0.0 as match_probability,
from data_for_matching t
where t.source_dataset = 'stockport'
and t.unique_id not in (select unique_id_l from matched_stockport_ids)
"""

unmatched_records = duckdb.sql(sql)
unmatched_records.show(max_rows=10, max_width=400)


sql = """
select match_probability, match_weight from ranked_matches
union all
select match_probability, match_weight from unmatched_records
order by match_probability desc
"""

all_records = con.sql(sql)

# Create the histogram
chart = (
    alt.Chart(all_records.df())
    .mark_bar()
    .encode(
        alt.X("match_weight:Q", bin=alt.Bin(maxbins=50), title="Match Weight"),
        alt.Y("count():Q", title="Count"),
    )
    .properties(title="Distribution of Match Probabilities", width=600, height=400)
)

# Save the chart
chart
┌──────────────┬───────────────────┐
│ match_weight │ match_probability │
│    int32     │   decimal(2,1)    │
├──────────────┼───────────────────┤
│          -10 │               0.0 │
│          -10 │               0.0 │
│          -10 │               0.0 │
│          -10 │               0.0 │
│          -10 │               0.0 │
│           ·  │                ·  │
│           ·  │                ·  │
│           ·  │                ·  │
│          -10 │               0.0 │
│          -10 │               0.0 │
│          -10 │               0.0 │
│          -10 │               0.0 │
│          -10 │               0.0 │
├──────────────┴───────────────────┤
│ 1783 rows (10 shown)   2 columns │
└──────────────────────────────────┘

png

Extensions and final notes

When running this against the full companies house data rather than the small subset here, the number of unmatched records (the bar on the left hand side of the histogram) reduces to about 890 records.

There are a number of ways the accuracy of this example could be improved:

  • Additional data cleaning, e.g. dealing better with company names like A B C Engineering vs ABC engineering, or translating word aliases like LTD to LIMITED
  • A more sophisticated approach to tokenisation
  • Blocking could be improved in various ways, e.g. by blocking on address tokens, using more 'rare' tokens, using partial postcodes better and so on
  • Allowing fuzzy matching on tokens in the company name, not just full token matching