Methods in Linker.table_management¶
Register Splink tables against your database backend and manage the Splink cache.
Accessed via linker.table_management
.
compute_tf_table(column_name)
¶
Compute a term frequency table for a given column and persist to the database
This method is useful if you want to pre-compute term frequency tables e.g. so that real time linkage executes faster, or so that you can estimate various models without having to recompute term frequency tables each time
Examples:
Real time linkage
```py
linker = Linker(df, settings="saved_settings.json", db_api=db_api)
linker.table_management.compute_tf_table("surname")
linker.compare_two_records(record_left, record_right)
```
Pre-computed term frequency tables
```py
linker = Linker(df, db_api)
df_first_name_tf = linker.table_management.compute_tf_table("first_name")
df_first_name_tf.write.parquet("folder/first_name_tf")
>>>
# On subsequent data linking job, read this table rather than recompute
df_first_name_tf = pd.read_parquet("folder/first_name_tf")
linker.table_management.register_term_frequency_lookup(
df_first_name_tf, "first_name"
)
```
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column_name |
str
|
The column name in the input table |
required |
Returns:
Name | Type | Description |
---|---|---|
SplinkDataFrame |
SplinkDataFrame
|
The resultant table as a splink data frame |
invalidate_cache()
¶
Invalidate the Splink cache. Any previously-computed tables will be recomputed. This is useful, for example, if the input data tables have changed.
register_table_input_nodes_concat_with_tf(input_data, overwrite=False)
¶
Register a pre-computed version of the input_nodes_concat_with_tf table that you want to re-use e.g. that you created in a previous run.
This method allows you to register this table in the Splink cache so it will be used rather than Splink computing this table anew.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_data |
AcceptableInputTableType
|
The data you wish to register. This can be either a dictionary, pandas dataframe, pyarrow table or a spark dataframe. |
required |
overwrite |
bool
|
Overwrite the table in the underlying database if it exists. |
False
|
Returns:
Name | Type | Description |
---|---|---|
SplinkDataFrame |
SplinkDataFrame
|
An abstraction representing the table created by the sql pipeline |
register_table_predict(input_data, overwrite=False)
¶
Register a pre-computed version of the prediction table for use in Splink.
This method allows you to register a pre-computed prediction table in the Splink cache so it will be used rather than Splink computing the table anew.
Examples:
predict_df = pd.read_parquet("path/to/predict_df.parquet")
predict_as_splinkdataframe = linker.table_management.register_table_predict(predict_df)
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
predict_as_splinkdataframe, threshold_match_probability=0.75
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_data |
AcceptableInputTableType
|
The data you wish to register. This can be either a dictionary, pandas dataframe, pyarrow table, or a spark dataframe. |
required |
overwrite |
bool
|
Overwrite the table in the underlying database if it exists. Defaults to False. |
False
|
Returns:
Name | Type | Description |
---|---|---|
SplinkDataFrame |
An abstraction representing the table created by the SQL pipeline. |
register_term_frequency_lookup(input_data, col_name, overwrite=False)
¶
Register a pre-computed term frequency lookup table for a given column.
This method allows you to register a term frequency table in the Splink cache for a specific column. This table will then be used during linkage rather than computing the term frequency table anew from your input data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_data |
AcceptableInputTableType
|
The data representing the term frequency table. This can be either a dictionary, pandas dataframe, pyarrow table, or a spark dataframe. |
required |
col_name |
str
|
The name of the column for which the term frequency lookup table is being registered. |
required |
overwrite |
bool
|
Overwrite the table in the underlying database if it exists. Defaults to False. |
False
|
Returns:
Name | Type | Description |
---|---|---|
SplinkDataFrame |
An abstraction representing the registered term |
|
frequency table. |
Examples:
tf_table = [
{"first_name": "theodore", "tf_first_name": 0.012},
{"first_name": "alfie", "tf_first_name": 0.013},
]
tf_df = pd.DataFrame(tf_table)
linker.table_management.register_term_frequency_lookup(
tf_df,
"first_name"
)
register_table(input_table, table_name, overwrite=False)
¶
Register a table to your backend database, to be used in one of the splink methods, or simply to allow querying.
Tables can be of type: dictionary, record level dictionary, pandas dataframe, pyarrow table and in the spark case, a spark df.
Examples:
test_dict = {"a": [666,777,888],"b": [4,5,6]}
linker.table_management.register_table(test_dict, "test_dict")
linker.query_sql("select * from test_dict")
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_table |
AcceptableInputTableType
|
The data you wish to register. This can be either a dictionary, pandas dataframe, pyarrow table or a spark dataframe. |
required |
table_name |
str
|
The name you wish to assign to the table. |
required |
overwrite |
bool
|
Overwrite the table in the underlying database if it exists |
False
|
Returns:
Name | Type | Description |
---|---|---|
SplinkDataFrame |
SplinkDataFrame
|
An abstraction representing the table created by the sql pipeline |