Skip to content

Link type: Linking, Deduping or Both¶

Splink allows data to be linked, deduplicated or both.

Linking refers to finding links between datasets, whereas deduplication finding links within datasets.

Data linking is therefore only meaningful when more than one dataset is provided.

This guide shows how to specify the settings dictionary and initialise the linker for the three link types.

Deduplication¶

The dedupe_only link type expects the user to provide a single input table, and is specified as follows

from splink.duckdb.linker import DuckDBLinker

settings = {
    "link_type": "dedupe_only",
    # etc.
}
linker = DuckDBLinker(df, settings)
from splink.spark.linker import SparkLinker

settings = {
    "link_type": "dedupe_only",
    # etc.
}
linker = SparkLinker(df, settings)
from splink.athena.linker import AthenaLinker

settings = {
    "link_type": "dedupe_only",
    # etc.
}
linker = AthenaLinker(df, settings)
from splink.sqlite.linker import SQLiteLinker

settings = {
    "link_type": "dedupe_only",
    # etc.
}
linker = SQLiteLinker(df, settings)

The link_only link type expects the user to provide a list of input tables, and is specified as follows:

from splink.duckdb.linker import DuckDBLinker

settings = {
    "link_type": "link_only",
    # etc.
    }

input_aliases = ["table_1", "table_2", "table_3"]
linker = DuckDBLinker([df_1, df_2, df_3], settings, input_table_aliases=input_aliases)
from splink.spark.linker import SparkLinker

settings = {
    "link_type": "link_only",
    # etc.
    }

input_aliases = ["table_1", "table_2", "table_3"]
linker = SparkLinker([df_1, df_2, df_3], settings, input_table_aliases=input_aliases)
from splink.athena.linker import AthenaLinker

settings = {
    "link_type": "link_only",
    # etc.
    }

input_aliases = ["table_1", "table_2", "table_3"]
linker = AthenaLinker([df_1, df_2, df_3], settings, input_table_aliases=input_aliases)
from splink.sqlite.linker import SQLiteLinker

settings = {
    "link_type": "link_only",
    # etc.
    }

input_aliases = ["table_1", "table_2", "table_3"]
linker = SQLiteLinker([df_1, df_2, df_3], settings, input_table_aliases=input_aliases)

The input_table_aliases argument is optional and are used to label the tables in the outputs. If not provided, defaults will be automatically chosen by Splink.

The link_and_dedupe link type expects the user to provide a list of input tables, and is specified as follows:

from splink.duckdb.linker import DuckDBLinker

settings = {
    "link_type": "link_and_dedupe",
    # etc.
    }

input_aliases = ["table_1", "table_2", "table_3"]
linker = DuckDBLinker([df_1, df_2, df_3], settings, input_table_aliases=input_aliases)
from splink.spark.linker import SparkLinker

settings = {
    "link_type": "link_and_dedupe",
    # etc.
    }

input_aliases = ["table_1", "table_2", "table_3"]
linker = SparkLinker([df_1, df_2, df_3], settings, input_table_aliases=input_aliases)
from splink.athena.linker import AthenaLinker

settings = {
    "link_type": "link_and_dedupe",
    # etc.
    }

input_aliases = ["table_1", "table_2", "table_3"]
linker = AthenaLinker([df_1, df_2, df_3], settings, input_table_aliases=input_aliases)
from splink.sqlite.linker import SQLiteLinker

settings = {
    "link_type": "link_and_dedupe",
    # etc.
    }

input_aliases = ["table_1", "table_2", "table_3"]
linker = SQLiteLinker([df_1, df_2, df_3], settings, input_table_aliases=input_aliases)

The input_table_aliases argument is optional and are used to label the tables in the outputs. If not provided, defaults will be automatically chosen by Splink.