Skip to content

Splink Updates - July 2023ยถ

Its hard to keep up to date with all of the new features being added to Splink, so we have launched this blog to share a round up of latest developments every few months.

So, without further ado, here are some of the highlights from the first half of 2023!

Latest Splink version: v3.9.4

๐Ÿš€ Massive speed gains in EM trainingยถ

Thereโ€™s now an option to make EM training much faster - in one example weโ€™ve seen at 1000x fold speedup. Kudos to external contributor @aymonwuolanne from the Australian Bureau of Statistics!

To make use of this, set the estimate_without_term_frequencies parameter to True; for example:

linker.estimate_parameters_using_expectation_maximisation(..., estimate_without_term_frequencies=True)

Note: If True, the EM algorithm ignores term frequency adjustments during the iterations. Instead, the adjustments are added once the EM algorithm has converged. This will result in slight difference in the final parameter estimations.

๐ŸŽ Out-of-the-box Comparisonsยถ

Splink now contains lots of new out-of-the-box comparisons for dates, names, postcodes etc. The Comparison Template Library (CTL) provides suggested settings for common types of data used in linkage models.

For example, a Comparison for "first_name" can now be written as:

import splink.duckdb.comparison_template_library as ctl

first_name_comparison = ctl.name_comparison("first_name")

Check out these new functions in the CTL Topic Guide and CTL Documentation.

Blocking Rule Libraryยถ

Blocking has, historically, been a point of confusion for users so we have been working behind the scenes to make that easier! The recently launched Blocking Rules Library (BRL) provides a set of functions for defining Blocking Rules (similar to the Comparison Library functions).

For example, a Blocking Rule for "date_of_birth" can now be written as:

import splink.duckdb.blocking_rule_library as brl

brl.exact_match_rule("date_of_birth")

Note: from Splink v3.9.6, exact_match_rule has been superseded by block_on. We advise using this going forward.

Check out these new functions in the BRL Documentation as well as some new Blocking Topic Guides to better explain what Blocking Rules are, how they are used in Splink, and how to choose them.

Keep a look out, as there are more improvements in the pipeline for Blocking in the coming months!

๐Ÿ˜ Postgres Supportยถ

With a massive thanks to external contributor @hanslemm, Splink now supports Postgres. To get started, check out the Postgres Topic Guide.

๐Ÿท Clerical Labelling Tool (beta)ยถ

Clerical labelling is an important tool for generating performance metrics for linkage models (False Positive Rate, Recall, Precision etc.).

Splink now has a (beta) GUI for clerical labelling which produces labels in a form that can be easily ingested into Splink to generate these performance metrics. Check out the example tool, linked Pull Request, and some previous tweets:

This tool is still in the beta phase, so is a work in progress and subject to change based on feedback we get from users. As a result, it is not thoroughly documented at this stage. We recommend checking out the links above to see a ready-made example of the tool. However, if you would like to generate your own, this example is a good starting point.

We would love any feedback from users, so please comment on the PR or open a discussion.

๐Ÿ“Š Charts in Altair 5ยถ

Charts are now all fully-fledged Altair charts, making them much easier to work with.

For example, a chart c can now be saved with:

c.save(โ€œchart.pngโ€, scale_factor=2)

where json, html, png, svg and pdf are all supported.

Reduced duplication in Comparison librariesยถ

Historically, importing of the comparison libraries (CL, CTL, CLL) has included declaring the backend twice. For example:

import splink.duckdb.duckdb_comparison_level_library as cll
This repetition has now been removed
import splink.duckdb.comparison_level_library as cll
The original structure still works, but throws a warning to switch to the new version.

In-built datasetsยถ

When following along with the tutorial or example notebooks, one issue can be references of paths to data that does not exists on users machines. To overcome this issue, Splink now has a splink_datasets module which will store these datasets and make sure any users can copy and paste working code without fear of path issues. For example:

from splink.datasets import splink_datasets

df = splink_datasets.fake_1000
returns the fake 1000 row dataset that is used in the Splink tutorial.

For more information check out the in-built datasets Documentation.

Regular Expressions in Comparisonsยถ

When comparing records, some columns will have a particular structure (e.g. dates, postcodes, email addresses). It can be useful to compare sections of a column entry. Splink's string comparison level functions now include a regex_extract to extract a portion of strings to be compared. For example, an exact_match comparison that compares the first section of a postcode (outcode) can be written as:

import splink.duckdb.duckdb_comparison_library as cl

pc_comparison = cl.exact_match("postcode", regex_extract="^[A-Z]{1,2}")

Splink's string comparison level functions also now include a valid_string_regex parameter which sends any entries that do not conform to a specified structure to the null level. For example, a levenshtein comparison that ensures emails have an "@" symbol can be written as:

import splink.duckdb.duckdb_comparison_library as cl

email_comparison = cl.levenshtein_at_thresholds("email", valid_string_regex="^[^@]+")

For more on how Regular Expressions can be used in Splink, check out the Regex topic guide.

Note: from Splink v3.9.6, valid_string_regex has been renamed as valid_string_pattern.

๐Ÿ“š Documentation Improvementsยถ

We have been putting a lot of effort into improving our documentation site, including launching this blog!

Some of the improvements include:

Thanks to everyone who filled out our feedback survey. If you have any more feedback or ideas for how we can make the docs better please do let us know by raising an issue, starting a discussion or filling out the survey.

๐Ÿ”œ What's in the pipeline?ยถ

  • More Blocking improvements
  • ๐Ÿ“‹ Settings dictionary improvements
  • More guidance on how to evaluate Splink models and linkages