Skip to content

Development Quickstart

Splink is a complex project with many dependencies. This page provides step-by-step instructions for getting set up to develop Splink. Once you have followed these instructions, you should be all set to start making changes.

Step 0: Unix-like operating systemΒΆ

We highly recommend developing Splink on a Unix-like operating system, such as MacOS or Linux. While it is possible to develop on another operating system such as Windows, we do not provide instructions for how to do so.

Luckily, Windows users can easily fulfil this requirement by installing the Windows Subsystem for Linux (WSL):

  • Open PowerShell as Administrator: Right-click the Start button, select β€œWindows Terminal (Admin)”, and ensure PowerShell is the selected shell.
  • Run the command wsl --install.
  • You can find more guidance on setting up WSL on the Microsoft website but you don't need to do anything additional.
  • Open the Windows Terminal again (does not need to be Admin) and select the Ubuntu shell. Follow the rest of these instructions in that shell.

If you haven't already, create a fork of the Splink repository. You can find the Splink repository here, or click here to go directly to making a fork. Clone your fork to whatever directory you want to work in with git clone https://github.com/<YOUR_USERNAME>/splink.git.

Step 2: Choose how to install system dependenciesΒΆ

Developing Splink requires Python, as well as Poetry (the package manager we use to install Python package dependencies). Running Spark or PostgreSQL on your computer to test those backends requires additional dependencies. Athena only runs in the AWS cloud, so to locally run the tests for that backend you will need to create an AWS account and configure Splink to use it.

There are two ways to install these system dependencies: globally on your computer, or in an isolated conda environment.

The decision of which approach to take is subjective.

If you already have Python and Poetry installed (plus Java and PostgreSQL if you want to run the Spark and PostgreSQL backends locally), there is probably little advantage to using conda.

On the other hand, conda is particularly suitable if:

  • You're already a conda user, and/or
  • You're working in an environment where security policies prevent the installation of system level packages like Java
  • You don't want to do global installs of some of the requirements like Java

Step 3, Manual install option: Install system dependenciesΒΆ

PythonΒΆ

Check if Python is already installed by running python3 --version. If that outputs a version like 3.10.12, you've already got it! Otherwise, follow the instructions for installation on your platform from the Python website.

PoetryΒΆ

Run these commands to install Poetry globally. Note that we currently use an older version of Poetry, so the version must be specified.

pip install --upgrade pip
pip install poetry==1.4.2

JavaΒΆ

The instructions to install Java globally depend on your operating system. Generally, some version of Java will be available from your operating system's package manager. Note that you must install a version of Java earlier than Java 18 because Splink currently uses an older version of Spark.

As an example, you could run this on Ubuntu:

sudo apt install openjdk-11-jre-headless

PostgreSQL (optional)ΒΆ

Follow the instructions on the PostgreSQL website to install it on your computer.

Then, we will need to set up a database for Splink. You can achieve that with the following commands:

initdb splink_db
pg_ctl -D splink_db start --wait -l ./splink_db_log
createdb splink_db # The inner database
psql -d splink_db <<SQL
  CREATE USER splinkognito CREATEDB CREATEROLE password 'splink123!' ;
SQL

Most of these commands are one-time setup, but the pg_ctl -D splink_db start --wait -l ./splink_db_log command will need to be run each time you want to start PostgreSQL (after rebooting, for example).

Alternatively, you can run PostgreSQL using Docker. First, install Docker Desktop.

Then run the setup script (a thin wrapper around docker-compose) each time you want to start your PostgreSQL server:

./scripts/postgres_docker/setup.sh

and the teardown script each time you want to stop it:

./scripts/postgres_docker/teardown.sh

Included in the docker-compose file is a pgAdmin container to allow easy exploration of the database as you work, which can be accessed in-browser on the default port. The default username is a@b.com with password b.

Step 3, Conda install option: Install system dependenciesΒΆ

These instructions are the same no matter what operating system you are using. As an added benefit, these installations will be specific to the conda environment you create for Splink, so they will not interfere with other projects.

For convenience, we have created an automatic installation script that will install all dependencies for you. It will create an isolated conda environment called splink.

From the directory where you have cloned the Splink repository, simply run:

./scripts/conda/development_setup_with_conda.sh

If you use a shell besides bash, add the mamba CLI to your PATH by running ~/miniforge3/bin/mamba init <your_shell> -- e.g. ~/miniforge3/bin/mamba init zsh for zsh.

If you've run this successfully, restart your terminal and skip to the "Step 5: Activating your environment(s)" section.

If you would prefer to manually go through the steps to have a better understanding of what you are installing, continue to the next section.

Install Conda itselfΒΆ

First, we need to install a conda CLI. Any will do, but we recommend Miniforge, which can be installed like so:

curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh -b

Miniforge is great because it defaults to the community-curated conda-forge channel, and it installs the mamba CLI by default, which is generally faster than the conda CLI.

Before you'll be able to run the mamba command, you need to run ~/miniforge3/bin/mamba init for your shell -- e.g. ~/miniforge3/bin/mamba init for Bash or ~/miniforge3/bin/mamba init zsh for zsh.

Install Conda packagesΒΆ

The rest is easy, because all the other dependencies can be installed as conda packages. Simply run:

mamba env create -n splink --file ./scripts/conda/development_environment.yaml

Now run mamba activate splink to enter your newly created conda environment -- you will need to do this again each time you open a new terminal. Run the rest of the steps in this guide inside this environment. mamba deactivate leaves the environment.

Step 4: Python package dependenciesΒΆ

Splink manages the other Python packages it depends on using Poetry. Simply run poetry install in the Splink directory to install them. You can find more options for this command (such as how to install optional dependencies) on the managing dependencies with Poetry page.

To enter the virtual environment created by poetry, run poetry shell. You will need to do this again each time you open a new terminal. Use exit to leave the Poetry shell.

Step 5: Activating your environment(s)ΒΆ

Depending on the options you chose in this document, you now have either:

  • Only a Poetry virtual environment.
  • Both a conda environment and a Poetry virtual environment.

If you did not use conda, then each time you open a terminal to develop Splink, after navigating to the repository directory, run poetry shell.

If you did use conda, then each time you open a terminal to develop Splink, after navigating to the repository directory, run mamba activate splink and then poetry shell.

Step 6: Checking that it workedΒΆ

If you have installed all the dependencies, including PostgreSQL, you should be able to run the following command without error (will take about 10 minutes):

pytest tests/

This runs all the Splink tests across the default DuckDB and Spark backends, and runs some integration tests across the rest of the backends except for Athena, which can't run locally.

If you haven't installed PostgreSQL, try this:

pytest tests/ --ignore tests/test_full_example_postgres.py

Step 7: Visual Studio Code (optional)ΒΆ

You're now all set to develop Splink. If you have a text editor/IDE you are comfortable with for working on Python packages, you can use that. If you don't, we recommend Visual Studio Code. Here are some tips on how to get started: