Getting Started¶
Install¶
Splink supports python 3.8+.
To obtain the latest released version of Splink you can install from PyPI using pip:
pip install splink
or if you prefer, you can instead install Splink using conda:
conda install -c conda-forge splink
Backend Specific Installs
Backend Specific Installs¶
From Splink v3.9.7, packages required by specific Splink backends can be optionally installed by adding the [<backend>]
suffix to the end of your pip install.
Note that SQLite and DuckDB come packaged with Splink and do not need to be optionally installed.
The following backends are supported:
pip install 'splink[spark]'
pip install 'splink[athena]'
pip install 'splink[postgres]'
DuckDB-less Installation
DuckDB-less Installation¶
Should you be unable to install DuckDB
to your local machine, you can still run Splink
without the DuckDB
dependency using a small workaround.
To start, install the latest released version of Splink from PyPI without any dependencies using:
pip install splink --no-deps
Then, to install the remaining requirements, download the following requirements.txt
from our github repository using:
github_url="https://raw.githubusercontent.com/moj-analytical-services/splink/master/scripts/duckdbless_requirements.txt"
output_file="splink_requirements.txt"
# Download the file from GitHub using curl
curl -o "$output_file" "$github_url"
Or, if you're either unable to download it directly from github or you'd rather create the file manually, simply:
- Create a file called
splink_requirements.txt
- Copy and paste the contents from our duckdbless requirements file into your file.
Finally, run the following command within your virtual environment to install the remaining Splink dependencies:
pip install -r splink_requirements.txt
Quickstart¶
To get a basic Splink model up and running, use the following code. It demonstrates how to:
- Estimate the parameters of a deduplication model
- Use the parameter estimates to identify duplicate records
- Use clustering to generate an estimated unique person ID.
For more detailed tutorial, please see section below.
Simple Splink Model Example
from splink.duckdb.linker import DuckDBLinker
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl
from splink.duckdb.blocking_rule_library import block_on
from splink.datasets import splink_datasets
df = splink_datasets.fake_1000
settings = {
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [
block_on("first_name"),
block_on("surname"),
],
"comparisons": [
ctl.name_comparison("first_name"),
ctl.name_comparison("surname"),
ctl.date_comparison("dob", cast_strings_to_date=True),
cl.exact_match("city", term_frequency_adjustments=True),
ctl.email_comparison("email", include_username_fuzzy_level=False),
],
}
linker = DuckDBLinker(df, settings)
linker.estimate_u_using_random_sampling(max_pairs=1e6)
blocking_rule_for_training = block_on(["first_name", "surname"])
linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training, estimate_without_term_frequencies=True)
blocking_rule_for_training = block_on("substr(dob, 1, 4)") # block on year
linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training, estimate_without_term_frequencies=True)
pairwise_predictions = linker.predict()
clusters = linker.cluster_pairwise_predictions_at_threshold(pairwise_predictions, 0.95)
clusters.as_pandas_dataframe(limit=5)
Tutorials¶
You can learn more about Splink in the step-by-step tutorial.
Videos¶
Example Notebooks¶
You can see end-to-end example of several use cases in the example notebooks, or by clicking the following Binder link:
Charts Gallery¶
You can see all of the interactive charts provided in Splink by checking out the Charts Gallery.