Example Notebooks¶
This section provides a series of examples to help you get started with Splink. You can find the underlying notebooks in the demos folder of the Splink repository.
DuckDB examples¶
Entity type: Persons¶
Deduplicating 50,000 records of realistic data based on historical persons
Using the link_only
setting to link, but not dedupe, two datasets
Accuracy analysis and ROC charts using a ground truth (cluster) column
Estimating m probabilities from pairwise labels
Deduplicating 50,000 records with Deterministic Rules
Deduplicating the febrl3 dataset. Note this dataset comes from febrl, as referenced in A.2 here and replicated here.
Linking the febrl4 datasets. As above, these datasets are from febrl, replicated here.
Cookbook of various Splink techniques
Interactive comparison playground
Investigating Bias in a Splink Model
Entity type: Financial transactions¶
Linking financial transactions
PySpark examples¶
Deduplication of a small dataset using PySpark. Entity type is persons.
Athena examples¶
Deduplicating 50,000 records of realistic data based on historical persons
SQLite examples¶
Deduplicating 50,000 records of realistic data based on historical persons