Splink's SQL backends: Spark, DuckDB, etc¶
Splink is a Python library. It implements all data linking computations by generating SQL, and submitting the SQL statements to a backend of the user's chosing for execution.
For smaller input datasets of up to 1-2 million records, users can link data in Python on their laptop using the DuckDB backend. This is the recommended approach because the DuckDB backend is installed automatically when the user installs Splink using pip install splink
. No additional configuration is needed.
Linking larger datasets requires highly computationally intensive calculations, and generates datasets which are too large to be processed on a standard laptop. For these scenarios, we recommend using one of Splink's big data backend - currently Spark or AWS Athena. When these backends are used, the SQL generated by Splink is sent to the chosen backend for execution.
The Splink code you write is almost identical between backends, so it's straightforward to migrate between backends. Often, it's a good idea to start working using DuckDB on a sample of data, because it will produce results very quickly. When you're comfortable with your model, you may wish to migrate to a big data backend to estimate/predict on the full dataset.
Choosing a backend¶
Import the linker from the backend of your choosing, and the backend-specific comparison libraries.
Once you have initialised the linker
object, there is no difference in the subequent code between backends.
Note however, that not all comparison functions are available in all backends. There are tables detailing the available functions for each backend on the comparison library API page and the comparison level library API page.
from splink.duckdb.duckdb_linker import DuckDBLinker
import splink.duckdb.duckdb_comparison_library as cl
import splink.duckdb.duckdb_comparison_level_library as cll
linker = DuckDBLinker(your_args)
from splink.spark.spark_linker import SparkLinker
import splink.spark.spark_comparison_library as cl
import splink.spark.spark_comparison_level_library as cll
linker = SparkLinker(your_args)
from splink.athena.athena_linker import AthenaLinker
import splink.athena.athena_comparison_library as cl
import splink.athena.athena_comparison_level_library as cll
linker = AthenaLinker(your_args)
from splink.sqlite.sqlite_linker import SQLiteLinker
import splink.sqlite.sqlite_comparison_library as cl
import splink.sqlite.sqlite_comparison_level_library as cll
linker = SQLiteLinker(your_args)