Splink Updates - July 2023¶

Welcome to the Splink Blog! ¶

Its hard to keep up to date with all of the new features being added to Splink, so we have launched this blog to share a round up of latest developments every few months.

So, without further ado, here are some of the highlights from the first half of 2023!

Latest Splink version: v3.9.4

Massive speed gains in EM training¶

There’s now an option to make EM training much faster - in one example we’ve seen at 1000x fold speedup. Kudos to external contributor @aymonwuolanne from the Australian Bureau of Statistics!

To make use of this, set the estimate_without_term_frequencies parameter to True; for example:

linker.estimate_parameters_using_expectation_maximisation(..., estimate_without_term_frequencies=True)

Note: If True, the EM algorithm ignores term frequency adjustments during the iterations. Instead, the adjustments are added once the EM algorithm has converged. This will result in slight difference in the final parameter estimations.

Out-of-the-box Comparisons¶

Splink now contains lots of new out-of-the-box comparisons for dates, names, postcodes etc. The Comparison Template Library (CTL) provides suggested settings for common types of data used in linkage models.

For example, a Comparison for "first_name" can now be written as:

import splink.duckdb.comparison_template_library as ctl

first_name_comparison = ctl.name_comparison("first_name")

Check out these new functions in the Topic Guide and Documentation.

Blocking Rule Library¶

Blocking has, historically, been a point of confusion for users so we have been working behind the scenes to make that easier! The recently launched Blocking Rules Library (BRL) provides a set of functions for defining Blocking Rules (similar to the Comparison Library functions).

For example, a Blocking Rule for "date_of_birth" can now be written as:

import splink.duckdb.blocking_rule_library as brl

brl.exact_match_rule("date_of_birth")

Note: from Splink v3.9.6, exact_match_rule has been superseded by block_on. We advise using this going forward.

Check out these new functions in the BRL Documentation as well as some new Blocking Topic Guides to better explain what Blocking Rules are, how they are used in Splink, and how to choose them.

Keep a look out, as there are more improvements in the pipeline for Blocking in the coming months!

Postgres Support¶

With a massive thanks to external contributor @hanslemm, Splink now supports Postgres. To get started, check out the Postgres Topic Guide.

Clerical Labelling Tool (beta)¶

Clerical labelling is an important tool for generating performance metrics for linkage models (False Positive Rate, Recall, Precision etc.).

Splink now has a (beta) GUI for clerical labelling which produces labels in a form that can be easily ingested into Splink to generate these performance metrics. Check out the example tool, linked Pull Request, and some previous tweets:

Draft new Splink tool to speed up manual labelling of record linkage data. Example dashboard: https://t.co/yc1yHpa90X

Grateful for any feedback whilst I'm still working on this, on Twitter or the draft PR: https://t.co/eXSNHHe2kc

Free and open source pic.twitter.com/MEo4DmaxO9
— Robin Linacre (@RobinLinacre) April 28, 2023

This tool is still in the beta phase, so is a work in progress and subject to change based on feedback we get from users. As a result, it is not thoroughly documented at this stage. We recommend checking out the links above to see a ready-made example of the tool. However, if you would like to generate your own, this example is a good starting point.

We would love any feedback from users, so please comment on the PR or open a discussion.

Charts in Altair 5¶

Charts are now all fully-fledged Altair charts, making them much easier to work with.

For example, a chart c can now be saved with:

c.save(“chart.png”, scale_factor=2)

where json, html, png, svg and pdf are all supported.

Reduced duplication in Comparison libraries¶

Historically, importing of the comparison libraries has included declaring the backend twice. For example:

import splink.duckdb.duckdb_comparison_level_library as cll

This repetition has now been removed

import splink.duckdb.comparison_level_library as cll

The original structure still works, but throws a warning to switch to the new version.

In-built datasets¶

When following along with the tutorial or example notebooks, one issue can be references of paths to data that does not exists on users machines. To overcome this issue, Splink now has a splink_datasets module which will store these datasets and make sure any users can copy and paste working code without fear of path issues. For example:

from splink.datasets import splink_datasets

df = splink_datasets.fake_1000

returns the fake 1000 row dataset that is used in the Splink tutorial.

For more information check out the in-built datasets Documentation.

Regular Expressions in Comparisons¶

When comparing records, some columns will have a particular structure (e.g. dates, postcodes, email addresses). It can be useful to compare sections of a column entry. Splink's string comparison level functions now include a regex_extract to extract a portion of strings to be compared. For example, an exact_match comparison that compares the first section of a postcode (outcode) can be written as:

import splink.duckdb.duckdb_comparison_library as cl

pc_comparison = cl.exact_match("postcode", regex_extract="^[A-Z]{1,2}")

Splink's string comparison level functions also now include a valid_string_regex parameter which sends any entries that do not conform to a specified structure to the null level. For example, a levenshtein comparison that ensures emails have an "@" symbol can be written as:

import splink.duckdb.duckdb_comparison_library as cl

email_comparison = cl.levenshtein_at_thresholds("email", valid_string_regex="^[^@]+")

For more on how Regular Expressions can be used in Splink, check out the Regex topic guide.

Note: from Splink v3.9.6, valid_string_regex has been renamed as valid_string_pattern.

Documentation Improvements¶

We have been putting a lot of effort into improving our documentation site, including launching this blog!

Some of the improvements include:

More Topic Guides covering things such as Record Linkage Theory, Guidance on Splink's backends and String Fuzzy Matching.
A Contributors Guide to make contributing to Splink even easier. If you are interested in getting involved in open source, check the guide out!
Adding tables to the Comparison libraries documentation to show the functions available for each SQL backend.

Thanks to everyone who filled out our feedback survey. If you have any more feedback or ideas for how we can make the docs better please do let us know by raising an issue, starting a discussion or filling out the survey.

What's in the pipeline?¶

More Blocking improvements
Settings dictionary improvements
More guidance on how to evaluate Splink models and linkages