Skip to content
Splink
Clerical Labelling
Initializing search
GitHub
Getting Started
Tutorial
Examples
API Docs
User Guide
Contributing
Blog
Splink
GitHub
Getting Started
Tutorial
Tutorial
Introduction
1. Data prep prerequisites
2. Exploratory analysis
3. Blocking
4. Estimating model parameters
5. Predicting results
6. Visualising predictions
7. Evaluation
Examples
Examples
Introduction
DuckDB
DuckDB
Deduplicate 50k rows historical persons
Linking financial transactions
Linking two tables of persons
Real time record linkage
Evaluation from ground truth column
Estimating m probabilities from labels
Quick and dirty persons model
Deterministic dedupe
Febrl3 Dedupe
Febrl4 link-only
Cookbook
PySpark
PySpark
Deduplication using Pyspark
Athena
Athena
Deduplicate 50k rows historical persons
SQLite
SQLite
Deduplicate 50k rows historical persons
API Docs
API Docs
Introduction
Linker
Linker
Training
Visualisations
Inference
Clustering
Evaluation
Table Management
Miscellaneous functions
Comparisons Library
Comparisons Library
Comparison Library
Comparison Level Library
Other
Other
Exploratory
Blocking rule creator
Blocking analysis
SplinkDataFrame
EM Training Session API
SplinkDatasets
Settings Dict
User Guide
User Guide
Introduction
Record Linkage Theory
Record Linkage Theory
Why do we need record linkage?
Probabilistic vs Deterministic linkage
The Fellegi-Sunter Model
Linked Data as Graphs
Linkage Models in Splink
Linkage Models in Splink
Defining Splink models
Retrieving and querying Splink results
Link type - linking vs deduping
Splink's SQL backends - Spark, DuckDB etc
Splink's SQL backends - Spark, DuckDB etc
Backends overview
PostgreSQL
Data Preparation
Data Preparation
Feature Engineering
Blocking
Blocking
What are Blocking Rules?
Computational Performance
Model Training Blocking Rules
Comparing Records
Comparing Records
Comparisons and comparison levels
Defining and customising comparisons
Out-of-the-box comparisons
Term frequency adjustments
Comparing strings
Comparing strings
String comparators
Choosing string comparators
Phonetic algorithms
Regular expressions
Evaluation
Evaluation
Overview
Model
Edges (Links)
Edges (Links)
Overview
Edge Metrics
Clerical Labelling
Clusters
Clusters
Overview
Graph metrics
How to compute graph metrics
Performance
Performance
Run times, performance and linking large data
Spark Performance
Spark Performance
Optimising Spark performance
Salting blocking rules
DuckDB Performance
DuckDB Performance
Optimising DuckDB performance
Charts Gallery
Charts Gallery
Exploratory Analysis
Exploratory Analysis
completeness chart
profile columns
Blocking
Blocking
cumulative num comparisons from blocking rules chart
Similarity analysis
Similarity analysis
Comparator score chart
Comparator score threshold chart
Phonetic match chart
Model Training
Model Training
comparison viewer dashboard
match weights chart
m u parameters chart
parameter estimate comparisons chart
tf adjustment chart
unlinkables chart
waterfall chart
Clustering
Clustering
cluster studio dashboard
Model Evaluation
Model Evaluation
accuracy chart from labels table
threshold selection tool
Contributing
Contributing
Contributing to Splink
Contributing to Splink
Contributor Guide
Development Quickstart
Building your local environment
Linting and Formatting
Testing
Contributing to Documentation
Managing Dependencies with Poetry
Releasing a Package Version
Contributing to the Splink Blog
How Splink works
How Splink works
Understanding and debugging Splink
Transpilation using sqlglot
Performance and caching
Performance and caching
Caching and pipelining
Spark caching
Charts
Charts
Understanding and editing charts
Building new charts
User-Defined Functions
Settings Validation
Settings Validation
Settings Validation Overview
Extending the Settings Validator
Dependency Compatibility Policy
Blog
Blog
Categories
Categories
Bias
Ethics
Feature Updates
Clerical Labelling
ΒΆ
This page is under construction - check back soon!
Back to top