User Defined Functions¶
User Defined Functions (UDFs) are functions that can be created to add functionality to a given SQL backend that does not already exist. These are particularly useful within Splink as it supports multiple SQL engines each with different inherent functionality. UDFs are an important tool for creating consistent functionality across backends.
For example, DuckDB has an in-built string comparison function for Jaccard similarity whereas Spark SQL doesn't have an equivalent function. Therefore, a UDF is required to use functions like jaccard_at_thresholds()
and jaccard_level()
with a Spark backend.
Spark¶
Spark supports UDFs written in Scala and Java.
Splink currently uses UDFs written in Scala and are implemented as follows:
- The UDFs are created in a separate repository,
splink_scalaudfs
, with the Scala functions being defined inSimilarity.scala
. - The functions are then stored in a Java Archive (JAR) file - for more on JAR files, see the Java documentation.
- Once the JAR file containing the UDFs has been created, it is copied across to the spark_jars folder in Splink.
- Specify the the correct jar location within Splink.
- UDFS are then registered within the Spark Linker.
Now the Spark UDFs have been successfully registered, they can be used in Spark SQL. For example,
jaccard("name_column_1", "name_column_2") >= 0.9
which provides the basis for functions such as jaccard_at_thresholds()
and jaccard_level()
.
DuckDB¶
Python UDFs can be registered to a DuckDB connection from version 0.8.0 onwards.
The documentation is here, an examples are here. Note that these functions should be registered against the DuckDB connection provided to the linker using connection.create_function
.
Note that performance will generally be substantially slower than using native DuckDB functions. Consider using vectorised UDFs were possible - see here.
Athena¶
Athena supports UDFs written in Java, however these have not yet been implemented in Splink.
SQLite¶
Python UDFs can be registered to a SQLite connection using the create_function
function. An example is as follows:
from rapidfuzz.distance.Levenshtein import distance
conn = sqlite3.connect(":memory:")
conn.create_function("levenshtein", 2, distance)
The function levenshtein
is now available to use as a Python function