Skip to content

In-built datasets

Splink has some datasets available for use to help you get up and running, test ideas, or explore Splink features. To use, simply import splink_datasets:

from splink.datasets import splink_datasets

df = splink_datasets.fake_1000
which you can then use to set up a linker:
from splink.datasets import splink_datasets
from splink.duckdb.linker import DuckDBLinker
import splink.duckdb.comparison_library as cl

df = splink_datasets.fake_1000
linker = DuckDBLinker(
    df,
    {
        "link_type": "dedupe_only",
        "comparisons": [cl.exact_match("first_name"), cl.exact_match("surname")],
    },
)
Troubleshooting

If you get a SSLCertVerificationError when trying to use the inbuilt datasets, this can be fixed with the ssl package by running:

ssl._create_default_https_context = ssl._create_unverified_context.

Each attribute of splink_datasets is a dataset available for use, which exists as a pandas DataFrame. These datasets are not packaged directly with Splink, but instead are downloaded only when they are requested. Once requested they are cached for future use. The cache can be cleared using splink_dataset_utils, which also contains information on available datasets, and which have already been cached.

Available datasets

The datasets available are listed below:

dataset name description rows unique entities link to source
fake_1000 Fake 1000 from splink demos. Records are 250 simulated people, with different numbers of duplicates, labelled. 1,000 250 source
historical_50k The data is based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors. 50,000 5,156 source
febrl3 The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL3 data set contains 5000 records (2000 originals and 3000 duplicates), with a maximum of 5 duplicates based on one original record. 5,000 2,000 source
febrl4a The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL4a contains 5000 original records. 5,000 5,000 source
febrl4b The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL4b contains 5000 duplicate records, one for each record in FEBRL4a. 5,000 5,000 source
transactions_origin This data has been generated to resemble bank transactions leaving an account. There are no duplicates within the dataset and each transaction is designed to have a counterpart arriving in 'transactions_destination'. Memo is sometimes truncated or missing. 45,326 45,326 source
transactions_destination This data has been generated to resemble bank transactions arriving in an account. There are no duplicates within the dataset and each transaction is designed to have a counterpart sent from 'transactions_origin'. There may be a delay between the source and destination account, and the amount may vary due to hidden fees and foreign exchange rates. Memo is sometimes truncated or missing. 45,326 45,326 source

Some of the splink_datasets have corresponding clerical labels to help assess model performance. These are requested through the splink_dataset_labels module.

Available datasets

The datasets available are listed below:

dataset name description rows unique entities link to source
fake_1000_labels Clerical labels for fake_1000 3,176 NA source

In addition to splink_datasets, you can also import splink_dataset_utils, which has a few functions to help managing splink_datasets. This can be useful if you have limited internet connection and want to see what is already cached, or if you need to clear cache items (e.g. if datasets were to be updated, or if space is an issue).

For example:

from splink.datasets import splink_dataset_utils

splink_dataset_utils.show_downloaded_data()
splink_dataset_utils.clear_cache(['fake_1000'])

list_downloaded_datasets()

Return a list of datasets that have already been pre-downloaded

list_all_datasets()

Return a list of all available datasets, regardless of whether or not they have already been pre-downloaded

show_downloaded_data()

Print a list of datasets that have already been pre-downloaded

clear_downloaded_data(datasets=None)

Delete any pre-downloaded data stored locally.

Parameters:

Name Type Description Default
datasets list

A list of dataset names (without any file suffix) to delete. If None, all datasets will be deleted. Default None

None