Skip to content

Retrieving and Querying Splink results with the SplinkDataFrame

Splink returns tables of results using a class called a SplinkDataFrame. e.g. when you run df_predict = linker.predict() df_predict is a SplinkDataFrame

A SplinkDataFrame is a abstraction Splink's results, which under the hood are a table in the underlying database.

It's possible to convert a SplinkDataFrame into a Pandas dataframe using splink_df.as_pandas_dataframe(). However, this is not recommended because Splink results can be very large, so converting them into pandas can be slow and result in out of memory errors.

You can find out the name of the table in the underlying database using df_predict.physical_name. This enables you to run SQL queries directly against the results.

You can execute queries using linker.query_sql.

This is the recommended approach as it's typically faster and more memory efficient than using pandas dataframes.

The following is an example of this approach, in which we use SQL to find the best match to each input record in a link_type="link_only" job (i.e remove duplicate matches):

# Linker is a duckdb linker with link_type set to "link_only"
results = linker.predict(threshold_match_probability=0.75)

sql = f"""
with ranked as

(
select *,
row_number() OVER (
    PARTITION BY unique_id_l order by match_weight desc
    ) as row_number
from {results.physical_name}
)

select *
from ranked
where row_number = 1


"""
linker.query_sql(sql)

Note that linker.query_sql will return a pandas dataframe by default, but you can instead return a SplinkDataFrame as follows:

linker.query_sql(sql, output_type='splink_df')