Documentation for the comparison_library
¶
AbsoluteTimeDifferenceAtThresholds(col_name, *, input_is_string, metrics, thresholds, datetime_format=None, term_frequency_adjustments=False, invalid_dates_as_null=True)
¶
Bases: ComparisonCreator
Represents a comparison of the data in col_name
with multiple levels based on
absolute time differences:
- Exact match in
col_name
- Absolute time difference levels at specified thresholds
- ...
- Anything else
For example, with metrics = ['day', 'month'] and thresholds = [1, 3] the levels are:
- Exact match in
col_name
- Absolute time difference in
col_name
<= 1 day - Absolute time difference in
col_name
<= 3 months - Anything else
This comparison uses the AbsoluteTimeDifferenceLevel, which computes the total elapsed time between two dates, rather than counting calendar intervals.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
The name of the column to compare. |
required |
input_is_string |
bool
|
If True, the input dates are treated as strings
and parsed according to |
required |
metrics |
Union[DateMetricType, List[DateMetricType]]
|
The unit(s) of time to use when comparing dates. Can be 'second', 'minute', 'hour', 'day', 'month', or 'year'. |
required |
thresholds |
Union[int, float, List[Union[int, float]]]
|
The threshold(s) to use for the time difference level(s). |
required |
datetime_format |
str
|
The format string for parsing dates if
|
None
|
term_frequency_adjustments |
bool
|
Whether to apply term frequency adjustments. Defaults to False. |
False
|
invalid_dates_as_null |
bool
|
If True and |
True
|
ArrayIntersectAtSizes(col_name, size_threshold_or_thresholds=[1])
¶
Bases: ComparisonCreator
Represents a comparison of the data in col_name
with multiple levels based on
the intersection sizes of array elements:
- Intersection at specified size thresholds
- ...
- Anything else
For example, with size_threshold_or_thresholds = [3, 1], the levels are:
- Intersection of arrays in
col_name
has at least 3 elements - Intersection of arrays in
col_name
has at least 1 element - Anything else (e.g., empty intersection)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
The name of the column to compare. |
required |
size_threshold_or_thresholds |
Union[int, list[int]]
|
The size threshold(s) for the intersection levels. Defaults to [1]. |
[1]
|
CosineSimilarityAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.8, 0.7])
¶
Bases: ComparisonCreator
Represents a comparison of the data in col_name
with two or more levels:
- Cosine similarity levels at specified thresholds
- ...
- Anything else
For example, with score_threshold_or_thresholds = [0.9, 0.7] the levels are:
- Cosine similarity in
col_name
>= 0.9 - Cosine similarity in
col_name
>= 0.7 - Anything else
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
The name of the column to compare. |
required |
score_threshold_or_thresholds |
Union[float, list]
|
The threshold(s) to use for the cosine similarity level(s). Defaults to [0.9, 0.7]. |
[0.9, 0.8, 0.7]
|
CustomComparison(comparison_levels, output_column_name=None, comparison_description=None)
¶
Bases: ComparisonCreator
Represents a comparison of the data with custom supplied levels.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output_column_name |
str
|
The column name to use to refer to this comparison |
None
|
comparison_levels |
list
|
A list of some combination of
|
required |
comparison_description |
str
|
An optional description of the comparison |
None
|
DamerauLevenshteinAtThresholds(col_name, distance_threshold_or_thresholds=[1, 2])
¶
Bases: ComparisonCreator
Represents a comparison of the data in col_name
with three or more levels:
- Exact match in
col_name
- Damerau-Levenshtein levels at specified distance thresholds
- ...
- Anything else
For example, with distance_threshold_or_thresholds = [1, 3] the levels are
- Exact match in
col_name
- Damerau-Levenshtein distance in
col_name
<= 1 - Damerau-Levenshtein distance in
col_name
<= 3 - Anything else
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
The name of the column to compare. |
required |
distance_threshold_or_thresholds |
Union[int, list]
|
The threshold(s) to use for the Damerau-Levenshtein similarity level(s). Defaults to [1, 2]. |
[1, 2]
|
DateOfBirthComparison(col_name, *, input_is_string, datetime_thresholds=[1, 1, 10], datetime_metrics=['month', 'year', 'year'], datetime_format=None, invalid_dates_as_null=True)
¶
Bases: ComparisonCreator
Generate an 'out of the box' comparison for a date of birth column
in the col_name
provided.
Note that input_is_string
is a required argument: you must denote whether the
col_name
contains if of type date/dattime or string.
The default arguments will give a comparison with comparison levels:
- Exact match (all other dates)
- Damerau-Levenshtein distance <= 1
- Date difference <= 1 month
- Date difference <= 1 year
- Date difference <= 10 years
- Anything else
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
Union[str, ColumnExpression]
|
The column name |
required |
input_is_string |
bool
|
If True, the provided |
required |
datetime_thresholds |
Union[int, float, List[Union[int, float]]]
|
Numeric thresholds for date differences. Defaults to [1, 1, 10]. |
[1, 1, 10]
|
datetime_metrics |
Union[DateMetricType, List[DateMetricType]]
|
Metrics for date differences. Defaults to ["month", "year", "year"]. |
['month', 'year', 'year']
|
datetime_format |
str
|
The datetime format used to cast strings to dates. Only used if input is a string. |
None
|
invalid_dates_as_null |
bool
|
If True, treat invalid dates as null as opposed to allowing e.g. an exact or levenshtein match where one side or both are an invalid date. Only used if input is a string. Defaults to True. |
True
|
DistanceFunctionAtThresholds(col_name, distance_function_name, distance_threshold_or_thresholds, higher_is_more_similar=True)
¶
Bases: ComparisonCreator
Represents a comparison of the data in col_name
with three or more levels:
- Exact match in
col_name
- Custom distance function levels at specified thresholds
- ...
- Anything else
For example, with distance_threshold_or_thresholds = [1, 3] and distance_function 'hamming', with higher_is_more_similar False the levels are:
- Exact match in
col_name
- Hamming distance of
col_name
<= 1 - Hamming distance of
col_name
<= 3 - Anything else
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
The name of the column to compare. |
required |
distance_function_name |
str
|
the name of the SQL distance function |
required |
distance_threshold_or_thresholds |
Union[float, list]
|
The threshold(s) to use for the distance function level(s). |
required |
higher_is_more_similar |
bool
|
Are higher values of the distance function more similar? (e.g. True for Jaro-Winkler, False for Levenshtein) Default is True |
True
|
DistanceInKMAtThresholds(lat_col, long_col, km_thresholds)
¶
Bases: ComparisonCreator
A comparison of the latitude, longitude coordinates defined in 'lat_col' and 'long col' giving the great circle distance between them in km.
An example of the output with km_thresholds = [1, 10] would be:
- The two coordinates are within 1 km of one another
- The two coordinates are within 10 km of one another
- Anything else (i.e. the distance between coordinates are > 10km apart)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
lat_col |
str
|
The name of the latitude column to compare. |
required |
long_col |
str
|
The name of the longitude column to compare. |
required |
km_thresholds |
iterable[float] | float
|
The km threshold(s) for the distance levels. |
required |
EmailComparison(col_name)
¶
Bases: ComparisonCreator
Generate an 'out of the box' comparison for an email address column with the
in the col_name
provided.
The default comparison levels are:
- Null comparison: e.g., one email is missing or invalid.
- Exact match on full email: e.g.,
john@smith.com
vs.john@smith.com
. - Exact match on username part of email: e.g.,
john@company.com
vs.john@other.com
. - Jaro-Winkler similarity > 0.88 on full email: e.g.,
john.smith@company.com
vs.john.smyth@company.com
. - Jaro-Winkler similarity > 0.88 on username part of email: e.g.,
john.smith@company.com
vs.john.smyth@other.com
. - Anything else: e.g.,
john@company.com
vs.rebecca@other.com
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
Union[str, ColumnExpression]
|
The column name or expression for the email addresses to be compared. |
required |
ExactMatch(col_name)
¶
Bases: ComparisonCreator
Represents a comparison of the data in col_name
with two levels:
- Exact match in
col_name
- Anything else
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
The name of the column to compare |
required |
ForenameSurnameComparison(forename_col_name, surname_col_name, *, jaro_winkler_thresholds=[0.92, 0.88], forename_surname_concat_col_name=None)
¶
Bases: ComparisonCreator
Generate an 'out of the box' comparison for forename and surname columns
in the forename_col_name
and surname_col_name
provided.
It's recommended to derive an additional column containing a concatenated
forename and surname column so that term frequencies can be applied to the
full name. If you have derived a column, provide it at
forename_surname_concat_col_name
.
The default comparison levels are:
- Null comparison on both forename and surname
- Exact match on both forename and surname
- Columns reversed comparison (forename and surname swapped)
- Jaro-Winkler similarity > 0.92 on both forename and surname
- Jaro-Winkler similarity > 0.88 on both forename and surname
- Exact match on surname
- Exact match on forename
- Anything else
Parameters:
Name | Type | Description | Default |
---|---|---|---|
forename_col_name |
Union[str, ColumnExpression]
|
The column name or expression for the forenames to be compared. |
required |
surname_col_name |
Union[str, ColumnExpression]
|
The column name or expression for the surnames to be compared. |
required |
jaro_winkler_thresholds |
Union[float, list[float]]
|
Thresholds for Jaro-Winkler similarity. Defaults to [0.92, 0.88]. |
[0.92, 0.88]
|
forename_surname_concat_col_name |
str
|
The column name for concatenated forename and surname values. If provided, term frequencies are applied on the exact match using this column |
None
|
JaccardAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.7])
¶
Bases: ComparisonCreator
Represents a comparison of the data in col_name
with three or more levels:
- Exact match in
col_name
- Jaccard score levels at specified thresholds
- ...
- Anything else
For example, with score_threshold_or_thresholds = [0.9, 0.7] the levels are:
- Exact match in
col_name
- Jaccard score in
col_name
>= 0.9 - Jaccard score in
col_name
>= 0.7 - Anything else
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
The name of the column to compare. |
required |
score_threshold_or_thresholds |
Union[float, list]
|
The threshold(s) to use for the Jaccard similarity level(s). Defaults to [0.9, 0.7]. |
[0.9, 0.7]
|
JaroAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.7])
¶
Bases: ComparisonCreator
Represents a comparison of the data in col_name
with three or more levels:
- Exact match in
col_name
- Jaro score levels at specified thresholds
- ...
- Anything else
For example, with score_threshold_or_thresholds = [0.9, 0.7] the levels are:
- Exact match in
col_name
- Jaro score in
col_name
>= 0.9 - Jaro score in
col_name
>= 0.7 - Anything else
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
The name of the column to compare. |
required |
score_threshold_or_thresholds |
Union[float, list]
|
The threshold(s) to use for the Jaro similarity level(s). Defaults to [0.9, 0.7]. |
[0.9, 0.7]
|
JaroWinklerAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.7])
¶
Bases: ComparisonCreator
Represents a comparison of the data in col_name
with three or more levels:
- Exact match in
col_name
- Jaro-Winkler score levels at specified thresholds
- ...
- Anything else
For example, with score_threshold_or_thresholds = [0.9, 0.7] the levels are:
- Exact match in
col_name
- Jaro-Winkler score in
col_name
>= 0.9 - Jaro-Winkler score in
col_name
>= 0.7 - Anything else
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
The name of the column to compare. |
required |
score_threshold_or_thresholds |
Union[float, list]
|
The threshold(s) to use for the Jaro-Winkler similarity level(s). Defaults to [0.9, 0.7]. |
[0.9, 0.7]
|
LevenshteinAtThresholds(col_name, distance_threshold_or_thresholds=[1, 2])
¶
Bases: ComparisonCreator
Represents a comparison of the data in col_name
with three or more levels:
- Exact match in
col_name
- Levenshtein levels at specified distance thresholds
- ...
- Anything else
For example, with distance_threshold_or_thresholds = [1, 3] the levels are
- Exact match in
col_name
- Levenshtein distance in
col_name
<= 1 - Levenshtein distance in
col_name
<= 3 - Anything else
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
str
|
The name of the column to compare |
required |
distance_threshold_or_thresholds |
Union[int, list]
|
The threshold(s) to use for the levenshtein similarity level(s). Defaults to [1, 2]. |
[1, 2]
|
NameComparison(col_name, *, jaro_winkler_thresholds=[0.92, 0.88, 0.7], dmeta_col_name=None)
¶
Bases: ComparisonCreator
Generate an 'out of the box' comparison for a name column in the col_name
provided.
It's also possible to include a level for a dmetaphone match, but this requires you to derive a dmetaphone column prior to importing it into Splink. Note this is expected to be a column containing arrays of dmetaphone values, which are of length 1 or 2.
The default comparison levels are:
- Null comparison
- Exact match
- Jaro-Winkler similarity > 0.92
- Jaro-Winkler similarity > 0.88
- Jaro-Winkler similarity > 0.70
- Anything else
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
Union[str, ColumnExpression]
|
The column name or expression for the names to be compared. |
required |
jaro_winkler_thresholds |
Union[float, list[float]]
|
Thresholds for Jaro-Winkler similarity. Defaults to [0.92, 0.88, 0.7]. |
[0.92, 0.88, 0.7]
|
dmeta_col_name |
str
|
The column name for dmetaphone values. If provided, array intersection level is included. This column must contain arrays of dmetaphone values, which are of length 1 or 2. |
None
|
PostcodeComparison(col_name, *, invalid_postcodes_as_null=False, lat_col=None, long_col=None, km_thresholds=[1, 10, 100])
¶
Bases: ComparisonCreator
Generate an 'out of the box' comparison for a postcode column with the
in the col_name
provided.
The default comparison levels are:
- Null comparison
- Exact match on full postcode
- Exact match on sector
- Exact match on district
- Exact match on area
- Distance in km (if lat_col and long_col are provided)
It's also possible to include levels for distance in km, but this requires
you to have geocoded your postcodes prior to importing them into Splink. Use
the lat_col
and long_col
arguments to tell Splink where to find the
latitude and longitude columns.
See https://ideal-postcodes.co.uk/guides/uk-postcode-format for definitions
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name |
Union[str, ColumnExpression]
|
The column name or expression for the postcodes to be compared. |
required |
invalid_postcodes_as_null |
bool
|
If True, treat invalid postcodes as null. Defaults to False. |
False
|
lat_col |
Union[str, ColumnExpression]
|
The column name or
expression for latitude. Required if |
None
|
long_col |
Union[str, ColumnExpression]
|
The column name or
expression for longitude. Required if |
None
|
km_thresholds |
Union[float, List[float]]
|
Thresholds for distance
in kilometers. If provided, |
[1, 10, 100]
|
AbsoluteDateDifferenceAtThresholds¶
An alias of AbsoluteTimeDifferenceAtThresholds.
Configuring comparisons¶
Note that all comparisons have a .configure()
method as follows:
Configure the comparison creator with options that are common to all comparisons.
For m and u probabilities, the first element in the list corresponds to the first comparison level, usually an exact match level. Subsequent elements correspond comparison to levels in sequential order, through to the last element which is usually the 'ELSE' level.
All options have default options set initially. Any call to .configure()
will set any options that are supplied. Any subsequent calls to .configure()
will not override these values with defaults; to override values you must
explicitly provide a value corresponding to the default.
Generally speaking only a single call (at most) to .configure()
should
be required.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
term_frequency_adjustments |
bool
|
Whether term frequency adjustments are switched on for this comparison. Only applied to exact match levels. Default corresponds to False. |
unsupplied_option
|
m_probabilities |
list
|
List of m probabilities Default corresponds to None. |
unsupplied_option
|
u_probabilities |
list
|
List of u probabilities Default corresponds to None. |
unsupplied_option
|
Example
cc = LevenshteinAtThresholds("name", 2)
cc.configure(
m_probabilities=[0.9, 0.08, 0.02],
u_probabilities=[0.01, 0.05, 0.94]
# probabilities for exact match level, levenshtein <= 2, and else
# in that order
)