Datasets table

dataset name description rows unique entities link to source
fake_1000 Fake 1000 from splink demos. Records are 250 simulated people, with different numbers of duplicates, labelled. 1,000 250 source
historical_50k The data is based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors. 50,000 5,156 source
febrl3 The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL3 data set contains 5000 records (2000 originals and 3000 duplicates), with a maximum of 5 duplicates based on one original record. 5,000 2,000 source
febrl4a The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL4a contains 5000 original records. 5,000 5,000 source
febrl4b The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL4b contains 5000 duplicate records, one for each record in FEBRL4a. 5,000 5,000 source
transactions_origin This data has been generated to resemble bank transactions leaving an account. There are no duplicates within the dataset and each transaction is designed to have a counterpart arriving in 'transactions_destination'. Memo is sometimes truncated or missing. 45,326 45,326 source
transactions_destination This data has been generated to resemble bank transactions arriving in an account. There are no duplicates within the dataset and each transaction is designed to have a counterpart sent from 'transactions_origin'. There may be a delay between the source and destination account, and the amount may vary due to hidden fees and foreign exchange rates. Memo is sometimes truncated or missing. 45,326 45,326 source