In-built datasets¶

Splink has some datasets available for use to help you get up and running, test ideas, or explore Splink features. To use, simply import splink_datasets:

from splink.datasets import splink_datasets

df = splink_datasets.fake_1000

which you can then use to set up a linker:

from splink.datasets import splink_datasets
from splink.duckdb.linker import DuckDBLinker
import splink.duckdb.comparison_library as cl

df = splink_datasets.fake_1000
linker = DuckDBLinker(
    df,
    {
        "link_type": "dedupe_only",
        "comparisons": [cl.exact_match("first_name"), cl.exact_match("surname")],
    },
)

Troubleshooting

If you get a SSLCertVerificationError when trying to use the inbuilt datasets, this can be fixed with the ssl package by running:

ssl._create_default_https_context = ssl._create_unverified_context.

`splink_datasets`¶

Each attribute of splink_datasets is a dataset available for use, which exists as a pandas DataFrame. These datasets are not packaged directly with Splink, but instead are downloaded only when they are requested. Once requested they are cached for future use. The cache can be cleared using splink_dataset_utils, which also contains information on available datasets, and which have already been cached.

Available datasets¶

The datasets available are listed below:

dataset name	description	rows	unique entities	link to source
`fake_1000`	Fake 1000 from splink demos. Records are 250 simulated people, with different numbers of duplicates, labelled.	1,000	250	source
`historical_50k`	The data is based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors.	50,000	5,156	source
`febrl3`	The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL3 data set contains 5000 records (2000 originals and 3000 duplicates), with a maximum of 5 duplicates based on one original record.	5,000	2,000	source
`febrl4a`	The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL4a contains 5000 original records.	5,000	5,000	source
`febrl4b`	The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL4b contains 5000 duplicate records, one for each record in FEBRL4a.	5,000	5,000	source
`transactions_origin`	This data has been generated to resemble bank transactions leaving an account. There are no duplicates within the dataset and each transaction is designed to have a counterpart arriving in 'transactions_destination'. Memo is sometimes truncated or missing.	45,326	45,326	source
`transactions_destination`	This data has been generated to resemble bank transactions arriving in an account. There are no duplicates within the dataset and each transaction is designed to have a counterpart sent from 'transactions_origin'. There may be a delay between the source and destination account, and the amount may vary due to hidden fees and foreign exchange rates. Memo is sometimes truncated or missing.	45,326	45,326	source

`splink_dataset_labels`¶

Some of the splink_datasets have corresponding clerical labels to help assess model performance. These are requested through the splink_dataset_labels module.

Available datasets¶

The datasets available are listed below:

dataset name	description	rows	unique entities	link to source
`fake_1000_labels`	Clerical labels for fake_1000	3,176	NA	source

`splink_dataset_utils` API¶

In addition to splink_datasets, you can also import splink_dataset_utils, which has a few functions to help managing splink_datasets. This can be useful if you have limited internet connection and want to see what is already cached, or if you need to clear cache items (e.g. if datasets were to be updated, or if space is an issue).

For example:

from splink.datasets import splink_dataset_utils

splink_dataset_utils.show_downloaded_data()
splink_dataset_utils.clear_cache(['fake_1000'])

`list_downloaded_datasets()` ¶

Return a list of datasets that have already been pre-downloaded

`list_all_datasets()` ¶

Return a list of all available datasets, regardless of whether or not they have already been pre-downloaded

`show_downloaded_data()` ¶

Print a list of datasets that have already been pre-downloaded

`clear_downloaded_data(datasets=None)` ¶

Delete any pre-downloaded data stored locally.

Parameters:

Name	Type	Description	Default
`datasets`	`list`	A list of dataset names (without any file suffix) to delete. If `None`, all datasets will be deleted. Default `None`	`None`

In-built datasets¶

splink_datasets¶

Available datasets¶

splink_dataset_labels¶

Available datasets¶

splink_dataset_utils API¶

list_downloaded_datasets() ¶

list_all_datasets() ¶

show_downloaded_data() ¶

clear_downloaded_data(datasets=None) ¶

`splink_datasets`¶

`splink_dataset_labels`¶

`splink_dataset_utils` API¶

`list_downloaded_datasets()` ¶

`list_all_datasets()` ¶

`show_downloaded_data()` ¶

`clear_downloaded_data(datasets=None)` ¶