Skip to content

Using PostgreSQL as a Splink backend

Splink is compatible with using PostgreSQL (or simply as Postgres) as a SQL backend - for other options have a look at the overview of Splink backends.

Setup

Splink makes use of SQLAlchemy for connecting to Postgres, and the default database adapter is psycopg2, but you should be able to use any other if you prefer. The PostgresLinker requires a valid engine upon creation to manage interactions with the database:

from sqlalchemy import create_engine

from splink.postgres.linker import PostgresLinker
import splink.postgres.comparison_library as cl

# create a sqlalchemy engine to manage connecting to the database
engine = create_engine("postgresql+psycopg2://USER:PASSWORD@HOST:PORT/DB_NAME")

settings = {
    "link_type": "dedupe_only",
}

You can pass data to the linker in one of two ways:

  • use the name of a pre-existing table in your database

    linker = PostgresLinker("my_data_table", settings, engine=engine)
    
  • or pass a pandas DataFrame directly, in which case the linker will create a corresponding table for you automatically in the database

    import pandas as pd
    
    # create pandas frame from csv
    df = pd.read_csv("./my_data_table.csv")
    
    linker = PostgresLinker(df, settings, engine=engine)
    

Permissions

When you connect to Postgres, you must do so with a role that has sufficient privileges for Splink to operate correctly. These are:

  • CREATE ON DATABASE, to allow Splink to create a schema for working, and install the fuzzystrmatch extension
  • USAGE ON LANGUAGE SQL and USAGE ON TYPE float8 - these are required for creating the UDFs that Splink employs for calculations

Things to know

Schemas

When you create a PostgresLinker, Splink will create a new schema within the database you specify - by default this schema is called splink, but you can choose another name by passing the appropriate argument when creating the linker:

linker = PostgresLinker(df, settings, engine=engine, schema="another_splink_schema")
This schema is where all of Splink's work will be carried out, and where any tables created by Splink will live.

By default when looking for tables, Splink will check the schema it created, and the public schema; if you have tables in other schemas that you would like to be discoverable by Splink, you can use the parameter other_schemas_to_search:

linker = PostgresLinker(df, settings, engine=engine, other_schemas_to_search=["my_data_schema_1", "my_data_schema_2"])

User-Defined Functions (UDFs)

Splink makes use of Postgres' user-defined functions in order to operate, which are defined in the schema created by Splink when you create the linker. These functions are all defined using SQL, and are:

Information

The information below is only relevant if you are planning on making changes to Splink. If you are only intending to use Splink with Postgres, you do not need to read any further.

To run only the Splink tests that run against Postgres, you can run simply:

pytest -m postgres_only tests/
For more information see the documentation page for testing in Splink.

The tests will are run using a temporary database and user that are created at the start of the test session, and destroyed at the end.

Postgres via docker

If you are trying to run tests with Splink on Postgres, or simply develop using Postgres, you may prefer to not actually install Postgres on you system, but to run it instead using Docker. In this case you can simply run the setup script (a thin wrapper around docker-compose):

./scripts/postgres/setup.sh
Included in the docker-compose file is a pgAdmin container to allow easy exploration of the database as you work, which can be accessed in-browser on the default port.

When you are finished you can remove these resources:

./scripts/postgres/teardown.sh

Running with a pre-existing database

If you have a pre-existing Postgres server you wish to use to run the tests against, you will need to specify environment variables for the credentials where they differ from default (in parentheses):

  • SPLINKTEST_PG_USER (splinkognito)
  • SPLINKTEST_PG_PASSWORD (splink123!)
  • SPLINKTEST_PG_HOST (localhost)
  • SPLINKTEST_PG_PORT (5432)
  • SPLINKTEST_PG_DB (splink_db) - tests will not actually run against this, but it is from a connection to this that the temporary test database + user will be created

While care has been taken to ensure that tests are run using minimal permissions, and are cleaned up after, it is probably wise to run tests connected to a non-important database, in case anything goes wrong. In addition to the above privileges, in order to run the tests you will need:

  • CREATE DATABASE to create a temporary testing database
  • CREATEROLE to create a temporary user role with limited privileges, which will be actually used for all the SQL execution in the tests