Choosing String Comparators¶
When building a Splink model, one of the most important aspects is defining the Comparisons
and Comparison Levels
that the model will train on. Each Comparison Level
within a Comparison
should contain a different amount of evidence that two records are a match, to which the model can assign a match weight. When considering different amounts of evidence for the model, it is helpful to explore fuzzy matching as a way of distinguishing strings that are similar, but not the same, as one another.
This guide is intended to show how Splink's string comparators perform in different situations in order to help choosing the most appropriate comparator for a given column as well as the most appropriate threshold (or thresholds). For descriptions and examples of each string comparators available in Splink, see the dedicated topic guide.
What options are available when comparing strings?¶
There are three main classes of string comparator that are considered within Splink:
- String Similarity Scores
- String Distance Scores
- Phonetic Matching
where
String Similarity Scores are scores between 0 and 1 indicating how similar two strings are. 0 represents two completely dissimilar strings and 1 represents identical strings. E.g. Jaro-Winkler Similarity.
String Distance Scores are integer distances, counting the number of operations to convert one string into another. A lower string distance indicates more similar strings. E.g. Levenshtein Distance.
Phonetic Matching is whether two strings are phonetically similar. The two strings are passed through a phonetic transformation algorithm and then the resulting phonetic codes are matched. E.g. Double Metaphone.
Comparing String Similarity and Distance Scores¶
Splink contains a comparison_helpers
module which includes some helper functions for comparing the string similarity and distance scores that can help when choosing the most appropriate fuzzy matching function.
For comparing two strings the comparator_score
function returns the scores for all of the available comparators. E.g. consider a simple inversion "Richard" vs "iRchard":
from splink.exploratory import similarity_analysis as sa
sa.comparator_score("Richard", "iRchard")
string1 | string2 | levenshtein_distance | damerau_levenshtein_distance | jaro_similarity | jaro_winkler_similarity | jaccard_similarity | |
---|---|---|---|---|---|---|---|
0 | Richard | iRchard | 2 | 1 | 0.95 | 0.95 | 1.0 |
Now consider a collection of common variations of the name "Richard" - which comparators will consider these variations as sufficiently similar to "Richard"?
import pandas as pd
data = [
{"string1": "Richard", "string2": "Richard", "error_type": "None"},
{"string1": "Richard", "string2": "ichard", "error_type": "Deletion"},
{"string1": "Richard", "string2": "Richar", "error_type": "Deletion"},
{"string1": "Richard", "string2": "iRchard", "error_type": "Transposition"},
{"string1": "Richard", "string2": "Richadr", "error_type": "Transposition"},
{"string1": "Richard", "string2": "Rich", "error_type": "Shortening"},
{"string1": "Richard", "string2": "Rick", "error_type": "Nickname/Alias"},
{"string1": "Richard", "string2": "Ricky", "error_type": "Nickname/Alias"},
{"string1": "Richard", "string2": "Dick", "error_type": "Nickname/Alias"},
{"string1": "Richard", "string2": "Rico", "error_type": "Nickname/Alias"},
{"string1": "Richard", "string2": "Rachael", "error_type": "Different Name"},
{"string1": "Richard", "string2": "Stephen", "error_type": "Different Name"},
]
df = pd.DataFrame(data)
df
string1 | string2 | error_type | |
---|---|---|---|
0 | Richard | Richard | None |
1 | Richard | ichard | Deletion |
2 | Richard | Richar | Deletion |
3 | Richard | iRchard | Transposition |
4 | Richard | Richadr | Transposition |
5 | Richard | Rich | Shortening |
6 | Richard | Rick | Nickname/Alias |
7 | Richard | Ricky | Nickname/Alias |
8 | Richard | Dick | Nickname/Alias |
9 | Richard | Rico | Nickname/Alias |
10 | Richard | Rachael | Different Name |
11 | Richard | Stephen | Different Name |
The comparator_score_chart
function allows you to compare two lists of strings and how similar the elements are according to the available string similarity and distance metrics.
sa.comparator_score_chart(data, "string1", "string2")
Here we can see that all of the metrics are fairly sensitive to transcriptions errors ("Richadr", "Richar", "iRchard"). However, considering nicknames/aliases ("Rick", "Ricky", "Rico"), simple metrics such as Jaccard, Levenshtein and Damerau-Levenshtein tend to be less useful. The same can be said for name shortenings ("Rich"), but to a lesser extent than more complex nicknames. However, even more subtle metrics like Jaro and Jaro-Winkler still struggle to identify less obvious nicknames/aliases such as "Dick".
If you would prefer the underlying dataframe instead of the chart, there is the comparator_score_df
function.
sa.comparator_score_df(data, "string1", "string2")
string1 | string2 | levenshtein_distance | damerau_levenshtein_distance | jaro_similarity | jaro_winkler_similarity | jaccard_similarity | |
---|---|---|---|---|---|---|---|
0 | Richard | Richard | 0 | 0 | 1.00 | 1.00 | 1.00 |
1 | Richard | ichard | 1 | 1 | 0.95 | 0.95 | 0.86 |
2 | Richard | Richar | 1 | 1 | 0.95 | 0.97 | 0.86 |
3 | Richard | iRchard | 2 | 1 | 0.95 | 0.95 | 1.00 |
4 | Richard | Richadr | 2 | 1 | 0.95 | 0.97 | 1.00 |
5 | Richard | Rich | 3 | 3 | 0.86 | 0.91 | 0.57 |
6 | Richard | Rick | 4 | 4 | 0.73 | 0.81 | 0.38 |
7 | Richard | Ricky | 4 | 4 | 0.68 | 0.68 | 0.33 |
8 | Richard | Dick | 5 | 5 | 0.60 | 0.60 | 0.22 |
9 | Richard | Rico | 4 | 4 | 0.73 | 0.81 | 0.38 |
10 | Richard | Rachael | 3 | 3 | 0.71 | 0.74 | 0.44 |
11 | Richard | Stephen | 7 | 7 | 0.43 | 0.43 | 0.08 |
Choosing thresholds¶
We can add distance and similarity thresholds to the comparators to see what strings would be included in a given comparison level:
sa.comparator_score_threshold_chart(
data, "string1", "string2", distance_threshold=2, similarity_threshold=0.8
)
To class our variations on "Richard" in the same Comparison Level
, a good choice of metric could be Jaro-Winkler with a threshold of 0.8. Lowering the threshold any more could increase the chances for false positives.
For example, consider a single Jaro-Winkler Comparison Level
threshold of 0.7 would lead to "Rachael" being considered as providing the same amount evidence for a record matching as "iRchard".
An alternative way around this is to construct a Comparison
with multiple levels, each corresponding to a different threshold of Jaro-Winkler similarity. For example, below we construct a Comparison
using the Comparison Library
function JaroWinklerAtThresholds with multiple levels for different match thresholds.:
import splink.comparison_library as cl
first_name_comparison = cl.JaroWinklerAtThresholds("first_name", [0.9, 0.8, 0.7])
If we print this comparison as a dictionary we can see the underlying SQL.
first_name_comparison.get_comparison("duckdb").as_dict()
{'output_column_name': 'first_name',
'comparison_levels': [{'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
'label_for_charts': 'first_name is NULL',
'is_null_level': True},
{'sql_condition': '"first_name_l" = "first_name_r"',
'label_for_charts': 'Exact match on first_name'},
{'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.9',
'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.9'},
{'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.8',
'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.8'},
{'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7',
'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.7'},
{'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
'comparison_description': 'JaroWinklerAtThresholds'}
Where:
- Exact Match level will catch perfect matches ("Richard").
- The 0.9 threshold will catch Shortenings and Typos ("ichard", "Richar", "iRchard", "Richadr", "Rich").
- The 0.8 threshold will catch simple Nicknames/Aliases ("Rick", "Rico").
- The 0.7 threshold will catch more complex Nicknames/Aliases ("Ricky"), but will also include less relevant names (e.g. "Rachael"). However, this should not be a concern as the model should give less predictive power (i.e. Match Weight) to this level of evidence.
- All other comparisons will end up in the "Else" level
Phonetic Matching¶
There are similar functions available within splink to help users get familiar with phonetic transformations. You can create similar visualisations to string comparators.
To see the phonetic transformations for a single string, there is the phonetic_transform
function:
sa.phonetic_transform("Richard")
{'soundex': 'R02063', 'metaphone': 'RXRT', 'dmetaphone': ('RXRT', 'RKRT')}
sa.phonetic_transform("Steven")
{'soundex': 'S30105', 'metaphone': 'STFN', 'dmetaphone': ('STFN', '')}
Now consider a collection of common variations of the name "Stephen". Which phonetic transforms will consider these as sufficiently similar to "Stephen"?
data = [
{"string1": "Stephen", "string2": "Stephen", "error_type": "None"},
{"string1": "Stephen", "string2": "Steven", "error_type": "Spelling Variation"},
{"string1": "Stephen", "string2": "Stephan", "error_type": "Spelling Variation/Similar Name"},
{"string1": "Stephen", "string2": "Steve", "error_type": "Nickname/Alias"},
{"string1": "Stephen", "string2": "Stehpen", "error_type": "Transposition"},
{"string1": "Stephen", "string2": "tSephen", "error_type": "Transposition"},
{"string1": "Stephen", "string2": "Stephne", "error_type": "Transposition"},
{"string1": "Stephen", "string2": "Stphen", "error_type": "Deletion"},
{"string1": "Stephen", "string2": "Stepheb", "error_type": "Replacement"},
{"string1": "Stephen", "string2": "Stephanie", "error_type": "Different Name"},
{"string1": "Stephen", "string2": "Richard", "error_type": "Different Name"},
]
df = pd.DataFrame(data)
df
string1 | string2 | error_type | |
---|---|---|---|
0 | Stephen | Stephen | None |
1 | Stephen | Steven | Spelling Variation |
2 | Stephen | Stephan | Spelling Variation/Similar Name |
3 | Stephen | Steve | Nickname/Alias |
4 | Stephen | Stehpen | Transposition |
5 | Stephen | tSephen | Transposition |
6 | Stephen | Stephne | Transposition |
7 | Stephen | Stphen | Deletion |
8 | Stephen | Stepheb | Replacement |
9 | Stephen | Stephanie | Different Name |
10 | Stephen | Richard | Different Name |
The phonetic_match_chart
function allows you to compare two lists of strings and how similar the elements are according to the available string similarity and distance metrics.
sa.phonetic_match_chart(data, "string1", "string2")
Here we can see that all of the algorithms recognise simple phonetically similar names ("Stephen", "Steven"). However, there is some variation when it comes to transposition errors ("Stehpen", "Stephne") with soundex and metaphone-esque giving different results. There is also different behaviour considering different names ("Stephanie").
Given there is no clear winner that captures all of the similar names, it is recommended that phonetic matches are used as a single Comparison Level
within in a Comparison
which also includes string comparators in the other levels. To see an example of this, see the Combining String scores and Phonetic matching section of this topic guide.
If you would prefer the underlying dataframe instead of the chart, there is the phonetic_transform_df
function.
sa.phonetic_transform_df(data, "string1", "string2")
string1 | string2 | soundex | metaphone | dmetaphone | |
---|---|---|---|---|---|
0 | Stephen | Stephen | [S30105, S30105] | [STFN, STFN] | [(STFN, ), (STFN, )] |
1 | Stephen | Steven | [S30105, S30105] | [STFN, STFN] | [(STFN, ), (STFN, )] |
2 | Stephen | Stephan | [S30105, S30105] | [STFN, STFN] | [(STFN, ), (STFN, )] |
3 | Stephen | Steve | [S30105, S3010] | [STFN, STF] | [(STFN, ), (STF, )] |
4 | Stephen | Stehpen | [S30105, S30105] | [STFN, STPN] | [(STFN, ), (STPN, )] |
5 | Stephen | tSephen | [S30105, t50105] | [STFN, TSFN] | [(STFN, ), (TSFN, )] |
6 | Stephen | Stephne | [S30105, S301050] | [STFN, STFN] | [(STFN, ), (STFN, )] |
7 | Stephen | Stphen | [S30105, S3105] | [STFN, STFN] | [(STFN, ), (STFN, )] |
8 | Stephen | Stepheb | [S30105, S30101] | [STFN, STFP] | [(STFN, ), (STFP, )] |
9 | Stephen | Stephanie | [S30105, S301050] | [STFN, STFN] | [(STFN, ), (STFN, )] |
10 | Stephen | Richard | [S30105, R02063] | [STFN, RXRT] | [(STFN, ), (RXRT, RKRT)] |
Combining String scores and Phonetic matching¶
Once you have considered all of the string comparators and phonetic transforms for a given column, you may decide that you would like to have multiple comparison levels including a combination of options.
For this you can construct a custom comparison to catch all of the edge cases you want. For example, if you decide that the comparison for first_name
in the model should consider:
- A
Dmetaphone
level for phonetic similarity - A
Levenshtein
level with distance of 2 for typos - A
Jaro-Winkler
level with similarity 0.8 for fuzzy matching
import splink.comparison_library as cl
import splink.comparison_level_library as cll
first_name_comparison = cl.CustomComparison(
output_column_name="first_name",
comparison_levels=[
cll.NullLevel("first_name"),
cll.ExactMatchLevel("first_name"),
cll.JaroWinklerLevel("first_name", 0.9),
cll.LevenshteinLevel("first_name", 0.8),
cll.ArrayIntersectLevel("first_name_dm", 1),
cll.ElseLevel()
]
)
print(first_name_comparison.get_comparison("duckdb").human_readable_description)
Comparison 'CustomComparison' of "first_name" and "first_name_dm".
Similarity is assessed using the following ComparisonLevels:
- 'first_name is NULL' with SQL rule: "first_name_l" IS NULL OR "first_name_r" IS NULL
- 'Exact match on first_name' with SQL rule: "first_name_l" = "first_name_r"
- 'Jaro-Winkler distance of first_name >= 0.9' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.9
- 'Levenshtein distance of first_name <= 0.8' with SQL rule: levenshtein("first_name_l", "first_name_r") <= 0.8
- 'Array intersection size >= 1' with SQL rule: array_length(list_intersect("first_name_dm_l", "first_name_dm_r")) >= 1
- 'All other comparisons' with SQL rule: ELSE
where first_name_dm
refers to a column in the dataset which has been created during the feature engineering step to give the Dmetaphone
transform of first_name
.