Skip to content

Methods in Linker.table_management¶

Register Splink tables against your database backend and manage the Splink cache. Accessed via linker.table_management.

compute_tf_table(column_name) ¶

Compute a term frequency table for a given column and persist to the database

This method is useful if you want to pre-compute term frequency tables e.g. so that real time linkage executes faster, or so that you can estimate various models without having to recompute term frequency tables each time

Examples:

Real time linkage
```py
linker = Linker(df, settings="saved_settings.json", db_api=db_api)
linker.table_management.compute_tf_table("surname")
linker.compare_two_records(record_left, record_right)
```
Pre-computed term frequency tables
```py
linker = Linker(df, db_api)
df_first_name_tf = linker.table_management.compute_tf_table("first_name")
df_first_name_tf.write.parquet("folder/first_name_tf")
>>>
# On subsequent data linking job, read this table rather than recompute
df_first_name_tf = pd.read_parquet("folder/first_name_tf")
df_first_name_tf.createOrReplaceTempView("__splink__df_tf_first_name")
```

Parameters:

Name Type Description Default
column_name str

The column name in the input table

required

Returns:

Name Type Description
SplinkDataFrame SplinkDataFrame

The resultant table as a splink data frame

invalidate_cache() ¶

Invalidate the Splink cache. Any previously-computed tables will be recomputed. This is useful, for example, if the input data tables have changed.

register_table_input_nodes_concat_with_tf(input_data, overwrite=False) ¶

Register a pre-computed version of the input_nodes_concat_with_tf table that you want to re-use e.g. that you created in a previous run.

This method allows you to register this table in the Splink cache so it will be used rather than Splink computing this table anew.

Parameters:

Name Type Description Default
input_data AcceptableInputTableType

The data you wish to register. This can be either a dictionary, pandas dataframe, pyarrow table or a spark dataframe.

required
overwrite bool

Overwrite the table in the underlying database if it exists.

False

Returns:

Name Type Description
SplinkDataFrame SplinkDataFrame

An abstraction representing the table created by the sql pipeline

register_table_predict(input_data, overwrite=False) ¶

Register a pre-computed version of the prediction table for use in Splink.

This method allows you to register a pre-computed prediction table in the Splink cache so it will be used rather than Splink computing the table anew.

Examples:

predict_df = pd.read_parquet("path/to/predict_df.parquet")
predict_as_splinkdataframe = linker.table_management.register_table_predict(predict_df)
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    predict_as_splinkdataframe, threshold_match_probability=0.75
)

Parameters:

Name Type Description Default
input_data AcceptableInputTableType

The data you wish to register. This can be either a dictionary, pandas dataframe, pyarrow table, or a spark dataframe.

required
overwrite bool

Overwrite the table in the underlying database if it exists. Defaults to False.

False

Returns:

Name Type Description
SplinkDataFrame

An abstraction representing the table created by the SQL pipeline.

register_term_frequency_lookup(input_data, col_name, overwrite=False) ¶

Register a pre-computed term frequency lookup table for a given column.

This method allows you to register a term frequency table in the Splink cache for a specific column. This table will then be used during linkage rather than computing the term frequency table anew from your input data.

Parameters:

Name Type Description Default
input_data AcceptableInputTableType

The data representing the term frequency table. This can be either a dictionary, pandas dataframe, pyarrow table, or a spark dataframe.

required
col_name str

The name of the column for which the term frequency lookup table is being registered.

required
overwrite bool

Overwrite the table in the underlying database if it exists. Defaults to False.

False

Returns:

Name Type Description
SplinkDataFrame

An abstraction representing the registered term

frequency table.

Examples:

tf_table = [
    {"first_name": "theodore", "tf_first_name": 0.012},
    {"first_name": "alfie", "tf_first_name": 0.013},
]
tf_df = pd.DataFrame(tf_table)
linker.table_management.register_term_frequency_lookup(tf_df,
                                                        "first_name")

register_table(input_table, table_name, overwrite=False) ¶

Register a table to your backend database, to be used in one of the splink methods, or simply to allow querying.

Tables can be of type: dictionary, record level dictionary, pandas dataframe, pyarrow table and in the spark case, a spark df.

Examples:

test_dict = {"a": [666,777,888],"b": [4,5,6]}
linker.table_management.register_table(test_dict, "test_dict")
linker.query_sql("select * from test_dict")

Parameters:

Name Type Description Default
input_table AcceptableInputTableType

The data you wish to register. This can be either a dictionary, pandas dataframe, pyarrow table or a spark dataframe.

required
table_name str

The name you wish to assign to the table.

required
overwrite bool

Overwrite the table in the underlying database if it exists

False

Returns:

Name Type Description
SplinkDataFrame SplinkDataFrame

An abstraction representing the table created by the sql pipeline