Documentation forSplinkDataFrame
¶
Bases: ABC
Abstraction over dataframe to handle basic operations like retrieving data and
retrieving column names, which need different implementations depending on whether
it's a spark dataframe, sqlite table etc.
Uses methods like as_pandas_dataframe()
and as_record_dict()
to retrieve data
as_duckdbpyrelation(limit=None)
¶
Return the dataframe as a duckdbpyrelation. Only available when using the DuckDB backend.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
limit |
int
|
If provided, return this number of rows (equivalent to a limit statement in SQL). Defaults to None, meaning return all rows |
None
|
Returns:
Type | Description |
---|---|
DuckDBPyRelation
|
duckdb.DuckDBPyRelation: A DuckDBPyRelation object |
as_pandas_dataframe(limit=None)
¶
Return the dataframe as a pandas dataframe.
This can be computationally expensive if the dataframe is large.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
limit |
int
|
If provided, return this number of rows (equivalent to a limit statement in SQL). Defaults to None, meaning return all rows |
None
|
Examples:
df_predict = linker.inference.predict()
df_ten_edges = df_predict.as_pandas_dataframe(10)
as_record_dict(limit=None)
¶
Return the dataframe as a list of record dictionaries.
This can be computationally expensive if the dataframe is large.
Examples:
df_predict = linker.inference.predict()
ten_edges = df_predict.as_record_dict(10)
Returns:
Name | Type | Description |
---|---|---|
list |
list[dict[str, Any]]
|
a list of records, each of which is a dictionary |
as_spark_dataframe()
¶
Return the dataframe as a spark dataframe. Only available when using the Spark backend.
Returns:
Type | Description |
---|---|
'SparkDataFrame'
|
spark.DataFrame: A Spark DataFrame |
drop_table_from_database_and_remove_from_cache(force_non_splink_table=False)
¶
Drops the table from the underlying database, and removes it from the (linker) cache.
By default this will fail if the table is not one created by Splink, but this check can be overriden
Examples:
df_predict = linker.inference.predict()
df_predict.drop_table_from_database_and_remove_from_cache()
# predictions table no longer in the database / cache
to_csv(filepath, overwrite=False)
¶
Save the dataframe in csv format.
Examples:
df_predict = linker.inference.predict()
df_predict.to_csv("model_predictions.csv", overwrite=True)
to_parquet(filepath, overwrite=False)
¶
Save the dataframe in parquet format.
Examples:
df_predict = linker.inference.predict()
df_predict.to_parquet("model_predictions.parquet", overwrite=True)