Skip to content

User Defined Functions

User Defined Functions (UDFs) are functions that can be created to add functionality to a given SQL backend that does not already exist. These are particularly useful within Splink as it supports multiple SQL engines each with different inherent functionality. UDFs are an important tool for creating consistent functionality across backends.

For example, DuckDB has an in-built string comparison function for Jaccard similarity whereas Spark SQL doesn't have an equivalent function. Therefore, a UDF is required to use functions like JaccardAtThresholds() and JaccardLevel() with a Spark backend.

Spark

Spark supports UDFs written in Scala and Java.

Splink currently uses UDFs written in Scala and are implemented as follows:

Now the Spark UDFs have been successfully registered, they can be used in Spark SQL. For example,

jaccard("name_column_1", "name_column_2") >= 0.9

which provides the basis for functions such as JaccardAtThresholds() and JaccardLevel().

DuckDB

Python UDFs can be registered to a DuckDB connection from version 0.8.0 onwards.

The documentation is here, an examples are here. Note that these functions should be registered against the DuckDB connection provided to the linker using connection.create_function.

Note that performance will generally be substantially slower than using native DuckDB functions. Consider using vectorised UDFs were possible - see here.

Athena

Athena supports UDFs written in Java, however these have not yet been implemented in Splink.

SQLite

Python UDFs can be registered to a SQLite connection using the create_function function. An example is as follows:

from rapidfuzz.distance.Levenshtein import distance
conn = sqlite3.connect(":memory:")
conn.create_function("levenshtein", 2, distance)

The function levenshtein is now available to use as a Python function