Skip to content

Documentation for comparison_level_library

splink.comparison_level_library.NullLevelBase

Bases: ComparisonLevel

__init__(col_name)

Represents comparisons where one or both sides of the comparison contains null values so the similarity cannot be evaluated. Assumed to have a partial match weight of zero (null effect on overall match weight)

Parameters:

Name Type Description Default
col_name str

Input column name

required

Returns:

Name Type Description
ComparisonLevel

Comparison level


splink.comparison_level_library.ExactMatchLevelBase

Bases: ComparisonLevel

__init__(col_name, m_probability=None, term_frequency_adjustments=False, include_colname_in_charts_label=False)


splink.comparison_level_library.ElseLevelBase

Bases: ComparisonLevel

__init__(m_probability=None)


splink.comparison_level_library.DistanceFunctionLevelBase

Bases: ComparisonLevel

__init__(col_name, distance_function_name, distance_threshold, higher_is_more_similar=True, m_probability=None)

Represents a comparison using a user-provided distance function, where the similarity

Parameters:

Name Type Description Default
col_name str

Input column name

required
distance_function_name str

The name of the distance function

required
distance_threshold Union[int, float]

The threshold to use to assess similarity

required
higher_is_more_similar bool

If True, a higher value of the distance function indicates a higher similarity (e.g. jaro_winkler). If false, a higher value indicates a lower similarity (e.g. levenshtein).

True
m_probability float

Starting value for m probability Defaults to None.

None

Returns:

Name Type Description
ComparisonLevel

A comparison level for a given distance function


splink.comparison_level_library.LevenshteinLevelBase

Bases: DistanceFunctionLevelBase

__init__(col_name, distance_threshold, m_probability=None)

Represents a comparison using a levenshtein distance function,

Parameters:

Name Type Description Default
col_name str

Input column name

required
distance_threshold Union[int, float]

The threshold to use to assess similarity

required
m_probability float

Starting value for m probability. Defaults to None.

None

Returns:

Name Type Description
ComparisonLevel

A comparison level that evaluates the levenshtein similarity


splink.comparison_level_library.JaroWinklerLevelBase

Bases: DistanceFunctionLevelBase

__init__(col_name, distance_threshold, m_probability=None)

Represents a comparison using the jaro winkler distance function

Parameters:

Name Type Description Default
col_name str

Input column name

required
distance_threshold Union[int, float]

The threshold to use to assess similarity

required
m_probability float

Starting value for m probability. Defaults to None.

None

Returns:

Name Type Description
ComparisonLevel

A comparison level that evaluates the jaro winkler similarity

splink.comparison_level_library.JaccardLevelBase

Bases: DistanceFunctionLevelBase

__init__(col_name, distance_threshold, m_probability=None)

Represents a comparison using a jaccard distance function

Parameters:

Name Type Description Default
col_name str

Input column name

required
distance_threshold Union[int, float]

The threshold to use to assess similarity

required
m_probability float

Starting value for m probability. Defaults to None.

None

Returns:

Name Type Description
ComparisonLevel

A comparison level that evaluates the jaccard similarity

splink.comparison_level_library.ColumnsReversedLevelBase

Bases: ComparisonLevel

__init__(col_name_1, col_name_2, m_probability=None, tf_adjustment_column=None)

Represents a comparison where the columns are reversed. For example, if surname is in the forename field and vice versa

Parameters:

Name Type Description Default
col_name_1 str

First column, e.g. forename

required
col_name_2 str

Second column, e.g. surname

required
m_probability float

Starting value for m probability. Defaults to None.

None
tf_adjustment_column str

Column to use for term frequency adjustments if an exact match is observed. Defaults to None.

None

Returns:

Name Type Description
ComparisonLevel

A comparison level that evaluates the exact match of two columns.

splink.comparison_level_library.DistanceInKMLevelBase

Bases: ComparisonLevel

__init__(lat_col, long_col, km_threshold, not_null=False, m_probability=None)

Use the haversine formula to transform comparisons of lat,lngs into distances measured in kilometers

Parameters:

Name Type Description Default
lat_col str

The name of a latitude column or the respective array or struct column column containing the information For example: long_lat['lat'] or long_lat[0]

required
long_col str

The name of a longitudinal column or the respective array or struct column column containing the information, plus an index. For example: long_lat['long'] or long_lat[1]

required
km_threshold int

The total distance in kilometers to evaluate your comparisons against

required
not_null bool

If true, remove any . This is only necessary if you are not capturing nulls elsewhere in your comparison level.

False
m_probability float

Starting value for m probability. Defaults to None.

None

Returns:

Name Type Description
ComparisonLevel

A comparison level that evaluates the distance between two coordinates

splink.comparison_level_library.PercentageDifferenceLevelBase

Bases: ComparisonLevel

__init__(col_name, percentage_distance_threshold, m_probability=None)

splink.comparison_level_library.ArrayIntersectLevelBase

Bases: ComparisonLevel

__init__(col_name, m_probability=None, term_frequency_adjustments=False, min_intersection=1, include_colname_in_charts_label=False)

Represents a comparison level based around the size of an intersection of arrays

Parameters:

Name Type Description Default
col_name str

Input column name

required
m_probability float

Starting value for m probability. Defaults to None.

None
tf_adjustment_column str

Column to use for term frequency adjustments if an exact match is observed. Defaults to None.

required
min_intersection int

The minimum cardinality of the intersection of arrays for this comparison level. Defaults to 1

1
include_colname_in_charts_label bool

Should the charts label contain the column name? Defaults to False

False

Returns:

Name Type Description
ComparisonLevel

A comparison level that evaluates the size of intersection of arrays