Choosing String Comparators¶

When building a Splink model, one of the most important aspects is defining the Comparisons and Comparison Levels that the model will train on. Each Comparison Level within a Comparison should contain a different amount of evidence that two records are a match, which the model can assign a Match Weight to. When considering different amounts of evidence for the model, it is helpful to explore fuzzy matching as a way of distinguishing strings that are similar, but not the same, as one another.

This guide is intended to show how Splink's string comparators perform in different situations in order to help choosing the most appropriate comparator for a given column as well as the most appropriate threshold (or thresholds). For descriptions and examples of each string comparators available in Splink, see the dedicated topic guide.

What options are available when comparing strings?¶

There are three main classes of string comparator that are considered within Splink:

String Similarity Scores
String Distance Scores
Phonetic Matching

where

String Similarity Scores are scores between 0 and 1 indicating how similar two strings are. 0 represents two completely dissimilar strings and 1 represents identical strings. E.g. Jaro-Winkler Similarity.

String Distance Scores are integer distances, counting the number of operations to convert one string into another. A lower string distance indicates more similar strings. E.g. Levenshtein Distance.

Phonetic Matching is whether two strings are phonetically similar. The two strings are passed through a phonetic transformation algorithm and then the resulting phonetic codes are matched. E.g. Double Metaphone.

Comparing String Similarity and Distance Scores¶

Splink contains a comparison_helpers module which includes some helper functions for comparing the string similarity and distance scores that can help when choosing the most appropriate fuzzy matching function.

For comparing two strings the comparator_score function returns the scores for all of the available comparators. E.g. consider a simple inversion "Richard" vs "iRchard":

import splink.comparison_helpers as ch

ch.comparator_score("Richard", "iRchard")

	string1	string2	levenshtein_distance	damerau_levenshtein_distance	jaro_similarity	jaro_winkler_similarity	jaccard_similarity
0	Richard	iRchard	2	1	0.95	0.95	1.0

Now consider a collection of common variations of the name "Richard" - which comparators will consider these variations as sufficiently similar to "Richard"?

import pandas as pd

data = {
    "string1": [
        "Richard",
        "Richard",
        "Richard",
        "Richard",
        "Richard",
        "Richard",
        "Richard",
        "Richard",
        "Richard",
        "Richard",
        "Richard",
        "Richard",
    ],
    "string2": [
        "Richard",
        "ichard",
        "Richar",
        "iRchard",
        "Richadr",
        "Rich",
        "Rick",
        "Ricky",
        "Dick",
        "Rico",
        "Rachael",
        "Stephen",
    ],
    "error_type": [
        "None",
        "Deletion",
        "Deletion",
        "Transposition",
        "Transposition",
        "Shortening",
        "Nickname/Alias",
        "Nickname/Alias",
        "Nickname/Alias",
        "Nickname/Alias",
        "Different Name",
        "Different Name",
    ],
}
df = pd.DataFrame(data)
df

	string1	string2	error_type
0	Richard	Richard	None
1	Richard	ichard	Deletion
2	Richard	Richar	Deletion
3	Richard	iRchard	Transposition
4	Richard	Richadr	Transposition
5	Richard	Rich	Shortening
6	Richard	Rick	Nickname/Alias
7	Richard	Ricky	Nickname/Alias
8	Richard	Dick	Nickname/Alias
9	Richard	Rico	Nickname/Alias
10	Richard	Rachael	Different Name
11	Richard	Stephen	Different Name

The comparator_score_chart function allows you to compare two lists of strings and how similar the elements are according to the available string similarity and distance metrics.

ch.comparator_score_chart(data, "string1", "string2")

/Users/rosskennedy/splink/splink/comparison_helpers.py:121: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  similarity_df["comparator"] = similarity_df["comparator"].str.replace(
/Users/rosskennedy/splink/splink/comparison_helpers.py:126: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  distance_df["comparator"] = distance_df["comparator"].str.replace("_distance", "")

Here we can see that all of the metrics are fairly sensitive to transcriptions errors ("Richadr", "Richar", "iRchard"). However, considering nicknames/aliases ("Rick", "Ricky", "Rico"), simple metrics such as Jaccard, Levenshtein and Damerau-Levenshtein tend to be less useful. The same can be said for name shortenings ("Rich"), but to a lesser extent than more complex nicknames. However, even more subtle metrics like Jaro and Jaro-Winkler still struggle to identify less obvious nicknames/aliases such as "Dick".

If you would prefer the underlying dataframe instead of the chart, there is the comparator_score_df function.

ch.comparator_score_df(data, "string1", "string2")

	string1	string2	levenshtein_distance	damerau_levenshtein_distance	jaro_similarity	jaro_winkler_similarity	jaccard_similarity
0	Richard	Richard	0	0	1.00	1.00	1.00
1	Richard	ichard	1	1	0.95	0.95	0.86
2	Richard	Richar	1	1	0.95	0.97	0.86
3	Richard	iRchard	2	1	0.95	0.95	1.00
4	Richard	Richadr	2	1	0.95	0.97	1.00
5	Richard	Rich	3	3	0.86	0.91	0.57
6	Richard	Rick	4	4	0.73	0.81	0.38
7	Richard	Ricky	4	4	0.68	0.68	0.33
8	Richard	Dick	5	5	0.60	0.60	0.22
9	Richard	Rico	4	4	0.73	0.81	0.38
10	Richard	Rachael	3	3	0.71	0.74	0.44
11	Richard	Stephen	7	7	0.43	0.43	0.08

Choosing thresholds¶

We can add distance and similarity thresholds to the comparators to see what strings would be included in a given comparison level:

ch.comparator_score_threshold_chart(
    data, "string1", "string2", distance_threshold=2, similarity_threshold=0.8
)

/Users/rosskennedy/splink/splink/comparison_helpers.py:172: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  similarity_df["comparator"] = similarity_df["comparator"].str.replace(
/Users/rosskennedy/splink/splink/comparison_helpers.py:177: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  distance_df["comparator"] = distance_df["comparator"].str.replace("_distance", "")

To class our variations on "Richard" in the same Comparison Level, a good choice of metric could be Jaro-Winkler with a threshold of 0.8. Lowering the threshold any more could increase the chances for false positives.

For example, consider a single Jaro-Winkler Comparison Level threshold of 0.7 would lead to "Rachael" being considered as providing the same amount evidence for a record matching as "iRchard".

An alternative way around this is to construct a Comparison with multiple levels, each corresponding to a different threshold of Jaro-Winkler similarity. For example, below we construct a Comparison using the Comparison Library function jaro_winkler_at_thresholds with multiple levels for different match thresholds.:

import splink.duckdb.comparison_library as cl

first_name_comparison = cl.jaro_winkler_at_thresholds("first_name", [0.9, 0.8, 0.7])

If we print this comparison as a dictionary we can see the underlying SQL.

first_name_comparison.as_dict()

{'output_column_name': 'first_name',
 'comparison_levels': [{'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
   'label_for_charts': 'Null',
   'is_null_level': True},
  {'sql_condition': '"first_name_l" = "first_name_r"',
   'label_for_charts': 'Exact match'},
  {'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.9',
   'label_for_charts': 'Jaro_winkler_similarity >= 0.9'},
  {'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.8',
   'label_for_charts': 'Jaro_winkler_similarity >= 0.8'},
  {'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7',
   'label_for_charts': 'Jaro_winkler_similarity >= 0.7'},
  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
 'comparison_description': 'Exact match vs. First_Name within jaro_winkler_similarity thresholds 0.9, 0.8, 0.7 vs. anything else'}

Where:

Exact Match level will catch perfect matches ("Richard").
The 0.9 threshold will catch Shortenings and Typos ("ichard", "Richar", "iRchard", "Richadr", "Rich").
The 0.8 threshold will catch simple Nicknames/Aliases ("Rick", "Rico").
The 0.7 threshold will catch more complex Nicknames/Aliases ("Ricky"), but will also include less relevant names (e.g. "Rachael"). However, this should not be a concern as the model should give less predictive power (i.e. Match Weight) to this level of evidence.
All other comparisons will end up in the "Else" level

Phonetic Matching¶

There are similar functions available within splink to help users get familiar with phonetic transformations. You can create similar visualisations to string comparators.

To see the phonetic transformations for a single string, there is the phonetic_transform function:

import splink.comparison_helpers

ch.phonetic_transform("Richard")

{'soundex': 'R02063', 'metaphone': 'RXRT', 'dmetaphone': ('RXRT', 'RKRT')}

ch.phonetic_transform("Steven")

{'soundex': 'S30105', 'metaphone': 'STFN', 'dmetaphone': ('STFN', '')}

Now consider a collection of common variations of the name "Stephen". Which phonetic transforms will consider these as sufficiently similar to "Stephen"?

data = {
    "string1": [
        "Stephen",
        "Stephen",
        "Stephen",
        "Stephen",
        "Stephen",
        "Stephen",
        "Stephen",
        "Stephen",
        "Stephen",
        "Stephen",
        "Stephen",
    ],
    "string2": [
        "Stephen",
        "Steven",
        "Stephan",
        "Steve",
        "Stehpen",
        "tSephen",
        "Stephne",
        "Stphen",
        "Stepheb",
        "Stephanie",
        "Richard",
    ],
    "error_type": [
        "None",
        "Spelling Variation",
        "Spelling Variation/Similar Name",
        "Nickname/Alias",
        "Transposition",
        "Transposition",
        "Transposition",
        "Deletion",
        "Replacement",
        "Different Name",
        "Different Name",
    ],
}

df = pd.DataFrame(data)
df

	string1	string2	error_type
0	Stephen	Stephen	None
1	Stephen	Steven	Spelling Variation
2	Stephen	Stephan	Spelling Variation/Similar Name
3	Stephen	Steve	Nickname/Alias
4	Stephen	Stehpen	Transposition
5	Stephen	tSephen	Transposition
6	Stephen	Stephne	Transposition
7	Stephen	Stphen	Deletion
8	Stephen	Stepheb	Replacement
9	Stephen	Stephanie	Different Name
10	Stephen	Richard	Different Name

The phonetic_match_chart function allows you to compare two lists of strings and how similar the elements are according to the available string similarity and distance metrics.

ch.phonetic_match_chart(data, "string1", "string2")

Here we can see that all of the algorithms recognise simple phonetically similar names ("Stephen", "Steven"). However, there is some variation when it comes to transposition errors ("Stehpen", "Stephne") with soundex and metaphone-esque giving different results. There is also different behaviour considering different names ("Stephanie").

Given there is no clear winner that captures all of the similar names, it is recommended that phonetic matches are used as a single Comparison Level within in a Comparison which also includes string comparators in the other levels. To see an example of this, see the Combining String scores and Phonetic matching section of this topic guide.

If you would prefer the underlying dataframe instead of the chart, there is the phonetic_transform_df function.

ch.phonetic_transform_df(data, "string1", "string2")

	string1	string2	soundex	metaphone	dmetaphone
0	Stephen	Stephen	[S30105, S30105]	[STFN, STFN]	[(STFN, ), (STFN, )]
1	Stephen	Steven	[S30105, S30105]	[STFN, STFN]	[(STFN, ), (STFN, )]
2	Stephen	Stephan	[S30105, S30105]	[STFN, STFN]	[(STFN, ), (STFN, )]
3	Stephen	Steve	[S30105, S3010]	[STFN, STF]	[(STFN, ), (STF, )]
4	Stephen	Stehpen	[S30105, S30105]	[STFN, STPN]	[(STFN, ), (STPN, )]
5	Stephen	tSephen	[S30105, t50105]	[STFN, TSFN]	[(STFN, ), (TSFN, )]
6	Stephen	Stephne	[S30105, S301050]	[STFN, STFN]	[(STFN, ), (STFN, )]
7	Stephen	Stphen	[S30105, S3105]	[STFN, STFN]	[(STFN, ), (STFN, )]
8	Stephen	Stepheb	[S30105, S30101]	[STFN, STFP]	[(STFN, ), (STFP, )]
9	Stephen	Stephanie	[S30105, S301050]	[STFN, STFN]	[(STFN, ), (STFN, )]
10	Stephen	Richard	[S30105, R02063]	[STFN, RXRT]	[(STFN, ), (RXRT, RKRT)]

Combining String scores and Phonetic matching¶

Once you have considered all of the string comparators and phonetic transforms for a given column, you may decide that you would like to have multiple comparison levels including a combination of options.

For this you can construct a custom comparison to catch all of the edge cases you want. For example, if you decide that the comparison for first_name in the model should consider:

A Dmetaphone level for phonetic similarity
A Levenshtein level with distance of 2 for typos
A Jaro-Winkler level with similarity 0.8 for fuzzy matching

The name_comparison function from the Comparison Template Library can be configured as follows:

import splink.duckdb.comparison_template_library as ctl

first_name_comparison = ctl.name_comparison(
    "first_name",
    phonetic_col_name="first_name_dm",
    damerau_levenshtein_thresholds=[],
    levenshtein_thresholds=[2],
    jaro_winkler_thresholds=[0.8],
)

first_name_comparison.as_dict()

{'output_column_name': 'custom_first_name_first_name_dm',
 'comparison_levels': [{'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
   'label_for_charts': 'Null',
   'is_null_level': True},
  {'sql_condition': '"first_name_l" = "first_name_r"',
   'label_for_charts': 'Exact match first_name'},
  {'sql_condition': '"first_name_dm_l" = "first_name_dm_r"',
   'label_for_charts': 'Exact match first_name_dm'},
  {'sql_condition': 'levenshtein("first_name_l", "first_name_r") <= 2',
   'label_for_charts': 'Levenshtein <= 2'},
  {'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.8',
   'label_for_charts': 'Jaro_winkler_similarity >= 0.8'},
  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
 'comparison_description': 'Exact match vs. Names with phonetic exact match vs. First_Name within levenshtein threshold 2 vs. First_Name within jaro_winkler threshold 0.8 vs. anything else'}

where first_name_dm refers to a column in the dataset which has been created during the feature engineering step to give the Dmetaphone transform of first_name.