User Defined Functions¶
User Defined Functions (UDFs) are functions that can be created to add functionality to a given SQL backend that does not already exist. These are particularly useful within Splink as it supports multiple SQL engines each with different inherent functionalty. UDFs are an important tool for creating consistent functionality across backends.
For example, DuckDB has an in-built string comparison function for Jaccard similarity whereas Spark SQL doesn't have an equivalent function. Therefore, a UDF is required to use functions like jaccard_at_thresholds() and jaccard_level() with a Spark backend.
Spark supports UDFs written in Scala and Java.
Splink currently uses UDFs written in Scala and are implemented as follows:
- The UDFs are created in a separate repository, splink_scalaudfs, with the Scala functions being defined in Similarity.scala.
- The functions are then stored in a Java Archive (JAR) file - for more on JAR files, see the java documentation.
- Once the JAR file containing the UDFs has been created, it is copied across to the spark_jars folder in Splink.
- Specify the the correct jar location within Splink.
- UDFS are then registered within the Spark Linker.
Now the Spark UDFs have been successfully registered, they can be used in SparkSQL. For example,
jaccard("name_column_1", "name_column_2") >= 0.9
Python UDFs can be registered to a DuckDB connection from version 0.8.0 onwards.
Note that performance will generally be substantially slower than using native DuckDB functions. Consider using vectorised UDFs were possible - see here.
Athena supports UDFs written in Java, however these have not yet been implemented in Splink.
Python UDFs can be registered to a sqlite connection using the
create_function function. An example is as follows:
from rapidfuzz.distance.Levenshtein import distance conn = sqlite3.connect(":memory:") conn.create_function("levenshtein", 2, distance)
levenshtein is now available to use as a Python function