Methods in Linker.table_management¶

Register Splink tables against your database backend and manage the Splink cache. Accessed via linker.table_management.

`compute_tf_table(column_name)` ¶

Compute a term frequency table for a given column and persist to the database

This method is useful if you want to pre-compute term frequency tables e.g. so that real time linkage executes faster, or so that you can estimate various models without having to recompute term frequency tables each time

Examples:

Real time linkage
```py
linker = Linker(df, settings="saved_settings.json", db_api=db_api)
linker.table_management.compute_tf_table("surname")
linker.compare_two_records(record_left, record_right)
```
Pre-computed term frequency tables
```py
linker = Linker(df, db_api)
df_first_name_tf = linker.table_management.compute_tf_table("first_name")
df_first_name_tf.write.parquet("folder/first_name_tf")
>>>
# On subsequent data linking job, read this table rather than recompute
df_first_name_tf = pd.read_parquet("folder/first_name_tf")
linker.table_management.register_term_frequency_lookup(
    df_first_name_tf, "first_name"
)

```

Parameters:

Name	Type	Description	Default
`column_name`	`str`	The column name in the input table	required

Returns:

Name	Type	Description
`SplinkDataFrame`	`SplinkDataFrame`	The resultant table as a splink data frame

`invalidate_cache()` ¶

Invalidate the Splink cache. Any previously-computed tables will be recomputed. This is useful, for example, if the input data tables have changed.

`register_table_input_nodes_concat_with_tf(input_data, overwrite=False)` ¶

Register a pre-computed version of the input_nodes_concat_with_tf table that you want to re-use e.g. that you created in a previous run.

This method allows you to register this table in the Splink cache so it will be used rather than Splink computing this table anew.

Parameters:

Name	Type	Description	Default
`input_data`	`AcceptableInputTableType`	The data you wish to register. This can be either a dictionary, pandas dataframe, pyarrow table or a spark dataframe.	required
`overwrite`	`bool`	Overwrite the table in the underlying database if it exists.	`False`

Returns:

Name	Type	Description
`SplinkDataFrame`	`SplinkDataFrame`	An abstraction representing the table created by the sql pipeline

`register_table_predict(input_data, overwrite=False)` ¶

Register a pre-computed version of the prediction table for use in Splink.

This method allows you to register a pre-computed prediction table in the Splink cache so it will be used rather than Splink computing the table anew.

Examples:

predict_df = pd.read_parquet("path/to/predict_df.parquet")
predict_as_splinkdataframe = linker.table_management.register_table_predict(predict_df)
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    predict_as_splinkdataframe, threshold_match_probability=0.75
)

Parameters:

Name	Type	Description	Default
`input_data`	`AcceptableInputTableType`	The data you wish to register. This can be either a dictionary, pandas dataframe, pyarrow table, or a spark dataframe.	required
`overwrite`	`bool`	Overwrite the table in the underlying database if it exists. Defaults to False.	`False`

Returns:

Name	Type	Description
`SplinkDataFrame`		An abstraction representing the table created by the SQL pipeline.

`register_term_frequency_lookup(input_data, col_name, overwrite=False)` ¶

Register a pre-computed term frequency lookup table for a given column.

This method allows you to register a term frequency table in the Splink cache for a specific column. This table will then be used during linkage rather than computing the term frequency table anew from your input data.

Parameters:

Name	Type	Description	Default
`input_data`	`AcceptableInputTableType`	The data representing the term frequency table. This can be either a dictionary, pandas dataframe, pyarrow table, or a spark dataframe.	required
`col_name`	`str`	The name of the column for which the term frequency lookup table is being registered.	required
`overwrite`	`bool`	Overwrite the table in the underlying database if it exists. Defaults to False.	`False`

Returns:

Name	Type	Description
`SplinkDataFrame`		An abstraction representing the registered term
		frequency table.

Examples:

tf_table = [
    {"first_name": "theodore", "tf_first_name": 0.012},
    {"first_name": "alfie", "tf_first_name": 0.013},
]
tf_df = pd.DataFrame(tf_table)
linker.table_management.register_term_frequency_lookup(
    tf_df,
    "first_name"
)

`register_table(input_table, table_name, overwrite=False)` ¶

Register a table to your backend database, to be used in one of the splink methods, or simply to allow querying.

Tables can be of type: dictionary, record level dictionary, pandas dataframe, pyarrow table and in the spark case, a spark df.

Examples:

test_dict = {"a": [666,777,888],"b": [4,5,6]}
linker.table_management.register_table(test_dict, "test_dict")
linker.query_sql("select * from test_dict")

Parameters:

Name	Type	Description	Default
`input_table`	`AcceptableInputTableType`	The data you wish to register. This can be either a dictionary, pandas dataframe, pyarrow table or a spark dataframe.	required
`table_name`	`str`	The name you wish to assign to the table.	required
`overwrite`	`bool`	Overwrite the table in the underlying database if it exists	`False`

Returns:

Name	Type	Description
`SplinkDataFrame`	`SplinkDataFrame`	An abstraction representing the table created by the sql pipeline

Methods in Linker.table_management¶

compute_tf_table(column_name) ¶

invalidate_cache() ¶

register_table_input_nodes_concat_with_tf(input_data, overwrite=False) ¶

register_table_predict(input_data, overwrite=False) ¶

register_term_frequency_lookup(input_data, col_name, overwrite=False) ¶

register_table(input_table, table_name, overwrite=False) ¶

`compute_tf_table(column_name)` ¶

`invalidate_cache()` ¶

`register_table_input_nodes_concat_with_tf(input_data, overwrite=False)` ¶

`register_table_predict(input_data, overwrite=False)` ¶

`register_term_frequency_lookup(input_data, col_name, overwrite=False)` ¶

`register_table(input_table, table_name, overwrite=False)` ¶