Splink's SQL backends: Spark, DuckDB, etc¶
Splink is a Python library. It implements all data linking computations by generating SQL, and submitting the SQL statements to a backend of the user's chosing for execution.
For smaller input datasets of up to 1-2 million records, users can link data in Python on their laptop using the DuckDB backend. This is the recommended approach because the DuckDB backend is installed automatically when the user installs Splink using
pip install splink. No additional configuration is needed.
Linking larger datasets requires highly computationally intensive calculations, and generates datasets which are too large to be processed on a standard laptop. For these scenarios, we recommend using one of Splink's big data backend - currently Spark or AWS Athena. When these backends are used, the SQL generated by Splink is sent to the chosen backend for execution.
The Splink code you write is almost identical between backends, so it's straightforward to migrate between backends. Often, it's a good idea to start working using DuckDB on a sample of data, because it will produce results very quickly. When you're comfortable with your model, you may wish to migrate to a big data backend to estimate/predict on the full dataset.
Choosing a backend¶
Import the linker from the backend of your choosing, and the backend-specific comparison libraries.
Once you have initialised the
linker object, there is no difference in the subequent code between backends.
Note however, that not all comparison functions are available in all backends. For example, the a Jaro Winkler comparison function doesn't exist in DuckDB or Athena.
from splink.duckdb.duckdb_linker import DuckDBLinker import splink.duckdb.duckdb_comparison_library as cl import splink.duckdb.duckdb_comparison_level_library as cll linker = DuckDBLinker(your_args)
from splink.spark.spark_linker import SparkLinker import splink.spark.spark_comparison_library as cl import splink.spark.spark_comparison_level_library as cll linker = SparkLinker(your_args)
from splink.athena.athena_linker import AthenaLinker import splink.athena.athena_comparison_library as cl import splink.athena.athena_comparison_level_library as cll linker = AthenaLinker(your_args)
from splink.sqlite.sqlite_linker import SQLiteLinker import splink.sqlite.sqlite_comparison_library as cl import splink.sqlite.sqlite_comparison_level_library as cll linker = SQLiteLinker(your_args)