Skip to content

Performance of comparison functions

Comparing the execution speed of different comparisons

An important determinant of Splink performance is the computational complexity of any similarity or distance measures (fuzzy matching functions) used as part of your model config.

For example, you may be considering using Jaro Winkler or Levenshtein, and wish to know which will take longer to compute.

This page contains summary statistics from performance benchmarking these functions. The code used to generate these results can be found here, and raw results can be found here.

The timings are based on making 10,000,000 comparisons of the named function.

DuckDB

The following chart shows the performance of different functions in DuckDB

Spark

The following chart shows the performance of different functions in Spark

Caveats and notes

These charts are intended to provide a rough, high level guide to performance. Real world performance can be sensitive to a number of factors:

  • For some functions such as Levenshtein, a longer input string will take longer to compute.
  • For some functions, it may be simpler to compute the result when comparing two similar strings
  • For the cosine similarity function, we used an embeddings length of 10. This is far lower than many typical applications e.g. OpenAI's can have a length of 1,536. The reason was that we were running out of memory (RAM) for longer lengths, causing spill to disk, which in turn prevented the test being a pure test of the function itself.

If you wish to run your own benchmarks, head over to the splink_speed_testing repo, create tests like these and then run using the command

pytest benchmarks/