Skip to content

Documentation forSplinkDataFrame

Bases: ABC

Abstraction over dataframe to handle basic operations like retrieving data and retrieving column names, which need different implementations depending on whether it's a spark dataframe, sqlite table etc. Uses methods like as_pandas_dataframe() and as_record_dict() to retrieve data

Return the dataframe as a duckdbpyrelation. Only available when using the DuckDB backend.

Parameters:

Name Type Description Default
limit int

If provided, return this number of rows (equivalent to a limit statement in SQL). Defaults to None, meaning return all rows

None

Returns:

Type Description
DuckDBPyRelation

duckdb.DuckDBPyRelation: A DuckDBPyRelation object

Return the dataframe as a pandas dataframe.

This can be computationally expensive if the dataframe is large.

Parameters:

Name Type Description Default
limit int

If provided, return this number of rows (equivalent to a limit statement in SQL). Defaults to None, meaning return all rows

None

Examples:

df_predict = linker.inference.predict()
df_ten_edges = df_predict.as_pandas_dataframe(10)

Return the dataframe as a list of record dictionaries.

This can be computationally expensive if the dataframe is large.

Examples:

df_predict = linker.inference.predict()
ten_edges = df_predict.as_record_dict(10)

Returns:

Name Type Description
list list[dict[str, Any]]

a list of records, each of which is a dictionary

Return the dataframe as a spark dataframe. Only available when using the Spark backend.

Returns:

Type Description
'SparkDataFrame'

spark.DataFrame: A Spark DataFrame

Drops the table from the underlying database, and removes it from the (linker) cache.

By default this will fail if the table is not one created by Splink, but this check can be overriden

Examples:

df_predict = linker.inference.predict()
df_predict.drop_table_from_database_and_remove_from_cache()
# predictions table no longer in the database / cache

Save the dataframe in csv format.

Examples:

df_predict = linker.inference.predict()
df_predict.to_csv("model_predictions.csv", overwrite=True)

Save the dataframe in parquet format.

Examples:

df_predict = linker.inference.predict()
df_predict.to_parquet("model_predictions.parquet", overwrite=True)