Skip to content

Splink 3 updates, and Splink 4 development announcement - April 2024

This post describes significant updates to Splink since our previous post and details of development work taking place on the forthcoming release of Splink 4.

Latest Splink version: v3.9.14

Here are some highlights of Splink development since our last update in December 2023. As always, keep an eye on the changelog for more regular updates.

Graph metrics

Linked data can be interpreted as graphs, as described in our graph definitions guide. Given this, graph metrics are useful in record linkage because they give insights into the quality of your final output (linked data) and, by extension, the linkage pipeline. They are particularly relevant for the analysis of clusters.

For example, a cluster where all entities are connected to all others with high match weights is likely to be more reliable than a cluster where many of the entities connect to only a small proportion of the other entities in the cluster. This can be measured by a graph metric called density.

Several graph metrics can now be computed using linker.compute_graph_metrics.

🚀 DuckDB Performance Improvements and Benchmarking

The DuckDB backend is now fully parallelised, resulting in large performance increases especially on high core count machines.

We now recommend the DuckDB backend for most users. It is the fastest backend, and is capable of linking large datasets, especially if you have access to high-spec machines.

For the first time, we have also conducted formal benchmarking of DuckDB on machines of different sizes. Check out our blog post outlining the results of this investigation.

Blocking on an array column

In some circumstances, it is useful to block on an array column. For example, if a persons have an array (list) of postcodes associated with each record, then we may wish to generate all record comparisons where there is a match of at least one postcode (the union of the arrays is of length 1 or more). This feature was added in PR 1692, with thanks to Github user nerskin for this external contribution!

📚 More Documentation

We have been building more guidance and documentation to make life as easy as possible for users, including:

Warning

Splink 3 has entered maintenance mode. We will continue to apply bugfixes, but new features should be built on the splink4_dev branch. We are no longer accepting new features on the master (Splink 3) branch.

The team has been focussing development efforts on Splink 4, due to be released later this year.

We’re pleased to announce we’ve recently reached an important milestone: all tests are passing, and all of the tutorial and examples notebooks have been updated and work successfully in the new version

Development releases of Splink 4 have commenced, and you can try it out using pip install --pre splink, or try it out in your web browser using the Colab links at the top the tutorial and example notebooks.

As a result, Splink 3 has entered maintenance mode. We will continue to apply bugfixes, but new features should be built on the splink4_dev branch. We are no longer accepting new features on the master (Splink 3) branch.

Splink 4 represents an incremental improvement to version 3 that makes Splink easier to use without making any major changes to workflows. The core functionality has not changed - the steps to train a model and predict results are the same, and models trained in Splink 3 will still work in Splink 4.

Improve ease of use

The primary aim is to improve the user-facing API so that:

  • The user has to write less code to achieve the same result
  • Function imports are simpler and grouped more intuitively
  • Settings and configuration can now be constructed entirely using Python objects, meaning that the user can rely heavily on autocomplete, rather than needing to remember the names of settings.
  • Less dialect-specific code

You can see an example of how the code changes between version 3 and 4 in the screenshot below:

image

The corresponding code is here.

Improve ease of development

A second important aim of Splink 4 is to improve the internal codebase to make Splink easier to develop for the core team and external contributors. These changes don’t affect the end user, but should enable a faster pace of development.

A wide range of improvements have been made such as:

  • Code quality: type hinting, mypy conformance etc.
  • Making CI run much faster
  • Reducing rigidities in dependencies
  • Decoupling parts of the codebase and less mutable state

🗓 Timelines

We expect to do regular beta releases to PyPI in the coming months. They can be found here, and you can install the latest version of Splink 4 using pip install --pre splink

Warning

During this time, there may be further breaking changes to the public API so please use Splink 4 with caution. However, we think the new API is now relatively stable, and big changes are unlikely.

We expect to bring Splink 4 out of beta, and do a first full release sometime in the autumn.

🗣 Feedback

We would love feedback on Splink 4, so please check it out and let us know what you think! The best way to get in contact is via our discussion forum.