awswrangler
extensions
init_athena_params(func=None, *, allow_boto3_session=False)
Takes a wrangler athena function and sets the following: boto3_session and s3_output_path if exists in function param.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
func |
Callable
|
An function from wr.athena that requires |
None
|
Returns:
Type | Description |
---|---|
Similar function call but with pre-defined params. |
Source code in pydbtools/_wrangler.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
|
start_query_execution_and_wait(sql, *args, **kwargs)
Calls start_query_execution followed by wait_query. args and *kwargs are passed to start_query_execution
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sql |
str
|
An SQL string. Which works with TEMP references. |
required |
Source code in pydbtools/_wrangler.py
check_sql(sql)
Validates sql to confirm it is a select statement
Source code in pydbtools/_wrangler.py
create_temp_table(sql, table_name, boto3_session=None, force_ec2=False, region_name=None)
Create a table inside the temporary database from create table
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sql |
str
|
The SQL table you want to create a temp table out of. Should be a table that starts with a WITH or SELECT clause. |
required |
table_name |
str
|
The name of the temp table you wish to create |
required |
force_ec2 |
bool
|
Boolean specifying if the user wants to force boto to get the credentials from the EC2. This is for dbtools which is the R wrapper that calls this package via reticulate and requires credentials to be refreshed via the EC2 instance (and therefore sets this to True) - this is not necessary when using this in Python. Default is False. |
False
|
region_name |
str
|
Name of the AWS region you want to run queries on. Defaults to pydbtools.utils.aws_default_region (which if left unset is "eu-west-1"). |
None
|
Source code in pydbtools/_wrangler.py
create_table(sql, database, table, location, partition_cols=None, boto3_session=None)
Create a table in a database from a SELECT statement
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sql |
str
|
SQL starting with a WITH or SELECT clause |
required |
database |
str
|
Database name |
required |
table |
str
|
Table name |
required |
location |
str
|
S3 path to where the table should be stored |
required |
partition_cols |
List[str]
|
partition columns (optional) |
None
|
boto3_session |
optional boto3 session |
None
|
Source code in pydbtools/_wrangler.py
read_sql_queries(sql)
Reads a number of SQL statements and returns the result of the last select statement as a dataframe. Temporary tables can be created using CREATE TEMP TABLE tablename AS (sql query) and accessed using temp as the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sql |
str
|
SQL commands |
required |
Returns:
Type | Description |
---|---|
Optional[DataFrame]
|
An iterator of Pandas DataFrames. |
Example
If the file eg.sql contains the SQL code create temp table A as ( select * from database.table1 where year = 2021 );
create temp table B as (
select * from database.table2
where amount > 10
);
select * from __temp__.A
left join __temp__.B
on A.id = B.id;
df = read_sql_queries(open('eg.sql', 'r').read())
Source code in pydbtools/_wrangler.py
read_sql_queries_gen(sql)
Reads a number of SQL statements and returns the result of any select statements as a dataframe generator. Temporary tables can be created using CREATE TEMP TABLE tablename AS (sql query) and accessed using temp as the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sql |
str
|
SQL commands |
required |
Returns:
Type | Description |
---|---|
Iterator[DataFrame]
|
An iterator of Pandas DataFrames. |
Example
If the file eg.sql contains the SQL code create temp table A as ( select * from database.table1 where year = 2021 );
create temp table B as (
select * from database.table2
where amount > 10
);
select * from __temp__.A
left join __temp__.B
on A.id = B.id;
select * from __temp__.A
where country = 'UK'
df_iter = read_sql_queries(open('eg.sql', 'r').read()) df1 = next(df_iter) df2 = next(df_iter)
Source code in pydbtools/_wrangler.py
delete_table_and_data(table, database, boto3_session=None)
Deletes both a table from an Athena database and the underlying data on S3.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table |
str
|
The table name to drop. |
required |
database |
str
|
The database name. |
required |
Returns:
Type | Description |
---|---|
True if table exists and is deleted, False if table |
|
does not exist |
Source code in pydbtools/_wrangler.py
delete_temp_table(table, boto3_session=None)
Deletes a temporary table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table |
str
|
The table name to drop. |
required |
Returns:
Type | Description |
---|---|
True if table exists and is deleted, False if table |
|
does not exist |
Source code in pydbtools/_wrangler.py
delete_database_and_data(database, boto3_session=None)
Deletes both an Athena database and the underlying data on S3.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
database |
str
|
The database name to drop. |
required |
Returns:
Type | Description |
---|---|
True if database exists and is deleted, False if database |
|
does not exist |
Source code in pydbtools/_wrangler.py
delete_partitions_and_data(database, table, expression, boto3_session=None)
Deletes partitions and the underlying data on S3 from an Athena database table matching an expression.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
database |
str
|
The database name. |
required |
table |
str
|
The table name. |
required |
expression |
str
|
The expression to match. |
required |
Please see https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_partitions # noqa for instructions on the expression construction, but at a basic level you can use SQL syntax on your partition columns.
Examples: delete_partitions_and_data("my_database", "my_table", "year = 2020 and month = 5")
Source code in pydbtools/_wrangler.py
save_query_to_parquet(sql, file_path)
Saves the results of a query to a parquet file at a given location.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sql |
str
|
The SQL query. |
required |
file_path |
str
|
The path to save the result to. |
required |
Examples:
save_query_to_parquet( "select * from my database.my_table", "result.parquet" )
Source code in pydbtools/_wrangler.py
dataframe_to_temp_table(df, table, boto3_session=None)
Creates a temporary table from a dataframe.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
A pandas DataFrame |
required |
table |
str
|
The name of the table in the temporary database |
required |
boto3_session |
opeional boto3 sesssion |
None
|
Source code in pydbtools/_wrangler.py
dataframe_to_table(df, database, table, location, mode='overwrite', partition_cols=None, boto3_session=None, **kwargs)
Creates a table from a dataframe.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
A pandas DataFrame |
required |
database |
str
|
Database name |
required |
table |
str
|
Table name |
required |
location |
str
|
S3 path to where the table should be stored |
required |
mode |
str
|
"overwrite" (default), "append", or "overwrite_partitions" |
'overwrite'
|
partition_cols |
List[str]
|
partition columns (optional) |
None
|
boto3_session |
optional boto3 session |
None
|
|
**kwargs |
arguments for to_parquet |
{}
|
Source code in pydbtools/_wrangler.py
create_database(database, **kwargs)
Creates a new database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
database |
str
|
The name of the database |
required |
Returns:
Type | Description |
---|---|
bool
|
False if the database already exists, True if |
bool
|
it has been created. |
Source code in pydbtools/_wrangler.py
file_to_table(path, database, table, location, mode='overwrite', partition_cols=None, boto3_session=None, chunksize=None, metadata=None, **kwargs)
Writes a csv, json, or parquet file to a database table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
The location of the file |
required |
database |
str
|
database name |
required |
table |
str
|
table name |
required |
location |
str
|
s3 file path to table |
required |
mode |
str
|
"overwrite" (default), "append", "overwrite_partitions" |
'overwrite'
|
partition_cols |
List[str]
|
partition columns (optional) |
None
|
boto3_session |
optional boto3 session |
None
|
|
chunksize |
Union[int, str]
|
size of chunks in memory or rows, e.g. "100MB", 100000 |
None
|
metadata |
mojap_metadata instance |
None
|
|
**kwargs |
arguments for arrow_pd_parser.reader.read e.g. use chunksize for very large files, metadata to apply metadata |
{}
|