Using SQL templating with pydbtools¶

pydbtools can read SQL files and also rendor SQL with jinja templating this notebook demos both.

Note this functionality is new to pydbtools v4.0.0.

Setup¶

Just run this script to create the source database so we can use it for our example.

In [1]:

Copied!





import os
import pandas as pd
import awswrangler as wr
import pydbtools as pydb
import os
import pandas as pd
import awswrangler as wr
import pydbtools as pydb

In [2]:

Copied!

# setup your own testing area (set foldername = GH username)
foldername = "mratford"  # GH username
foldername = foldername.lower().replace("-", "_")
# setup your own testing area (set foldername = GH username)
foldername = "mratford"  # GH username
foldername = foldername.lower().replace("-", "_")

In [3]:

Copied!





bucketname = "alpha-everyone"
s3_base_path = f"s3://{bucketname}/{foldername}/"

db_name = f"aws_example_{foldername}"
source_db_base_path = f"s3://{bucketname}/{foldername}/source_db/"

# Delete all the s3 files in a given path
if wr.s3.list_objects(s3_base_path):
    print("deleting objs")
    wr.s3.delete_objects(s3_base_path)

# Delete the database if it exists
df_dbs = wr.catalog.databases(None)
if db_name in df_dbs["Database"].to_list():
    print(f"{db_name} found deleting")
    wr.catalog.delete_database(name=db_name)

# Setup source database
# Create the database
wr.catalog.create_database(db_name)

# Iterate through the tables in data/ and write them to our db using awswrangler
for table_name in ["department", "employees", "sales"]:

    df = pd.read_csv(f"data/{table_name}.csv")
    table_path = os.path.join(source_db_base_path, f"{table_name}/")
    wr.s3.to_parquet(
        df=df,
        path=table_path,
        index=False,
        dataset=True,  # True allows the other params below i.e. overwriting to db.table
        database=db_name,
        table=table_name,
        mode="overwrite",
    )
bucketname = "alpha-everyone"
s3_base_path = f"s3://{bucketname}/{foldername}/"

db_name = f"aws_example_{foldername}"
source_db_base_path = f"s3://{bucketname}/{foldername}/source_db/"

# Delete all the s3 files in a given path
if wr.s3.list_objects(s3_base_path):
    print("deleting objs")
    wr.s3.delete_objects(s3_base_path)

# Delete the database if it exists
df_dbs = wr.catalog.databases(None)
if db_name in df_dbs["Database"].to_list():
    print(f"{db_name} found deleting")
    wr.catalog.delete_database(name=db_name)

# Setup source database
# Create the database
wr.catalog.create_database(db_name)

# Iterate through the tables in data/ and write them to our db using awswrangler
for table_name in ["department", "employees", "sales"]:

    df = pd.read_csv(f"data/{table_name}.csv")
    table_path = os.path.join(source_db_base_path, f"{table_name}/")
    wr.s3.to_parquet(
        df=df,
        path=table_path,
        index=False,
        dataset=True,  # True allows the other params below i.e. overwriting to db.table
        database=db_name,
        table=table_name,
        mode="overwrite",
    )

Task¶

We now have a database with 3 tables. We are joing to write an SQL with Jinja templating which we can "render" with parameters and then run that query using pydbtools.

Render a template¶

Jinja uses the syntax of two curly brackets for it's parameters.

In [4]:

Copied!





sql_template = """
SELECT *
FROM {{ db_name }}.{{ table }}
"""
print(
    pydb.render_sql_template(
        sql_template, {"db_name": db_name, "table": "department"}
    )
)
sql_template = """
SELECT *
FROM {{ db_name }}.{{ table }}
"""
print(
    pydb.render_sql_template(
        sql_template, {"db_name": db_name, "table": "department"}
    )
)

SELECT *
FROM aws_example_mratford.department

In [5]:

Copied!





# We can now use this rendered SQL to actually return the query.
sql = pydb.render_sql_template(
    sql_template, {"db_name": db_name, "table": "department"}
)
pydb.read_sql_query(sql)
# We can now use this rendered SQL to actually return the query.
sql = pydb.render_sql_template(
    sql_template, {"db_name": db_name, "table": "department"}
)
pydb.read_sql_query(sql)

Out[5]:

	department_id	department_name
0	1	Sales
1	2	Admin
2	3	Management
3	4	Technical
4	5	Maintenance
5	6	HR

In [6]:

Copied!





# We can also use the same template to read a different table
sql = pydb.render_sql_template(
    sql_template, {"db_name": db_name, "table": "sales"}
)
pydb.read_sql_query(sql)
# We can also use the same template to read a different table
sql = pydb.render_sql_template(
    sql_template, {"db_name": db_name, "table": "sales"}
)
pydb.read_sql_query(sql)

Out[6]:

	employee_id	qtr	sales
0	1	1	768.17
1	2	1	391.98
2	3	1	406.36
3	4	1	816.25
4	5	1	437.05
...	...	...	...
174	43	4	442.61
175	44	4	857.64
176	45	4	644.43
177	46	4	988.18
178	59	4	927.30

179 rows × 3 columns

Read and render a file¶

So you might be thinking this seems pointless as we can use f-strings. Which is true. But what does come into use is the ability to read in an SQL file and render it.

In [7]:

Copied!





# Let's first create a an sql file
# as your db_name is dependant on who is running this tutorial we'll use an fstring to create a basic SQL file
with open("tempfile.sql", "w") as f:
    f.write(f"SELECT * FROM {db_name}.employees")
# Let's first create a an sql file
# as your db_name is dependant on who is running this tutorial we'll use an fstring to create a basic SQL file
with open("tempfile.sql", "w") as f:
    f.write(f"SELECT * FROM {db_name}.employees")

Now if you open up the file tempfile.sql you'll see a simple SQL file and jupyterlab should also give you some syntax colouring for SQL (because it recognises the .sql extension).

With pydbtools you can read an SQL file and then use that query.

In [8]:

Copied!

sql = pydb.get_sql_from_file("tempfile.sql")
pydb.read_sql_query(sql)
sql = pydb.get_sql_from_file("tempfile.sql")
pydb.read_sql_query(sql)

Out[8]:

	employee_id	sex	forename	surname	department_id	manager_id
0	1	M	Dexter	Mitchell	1.0	17.0
1	2	F	Summer	Bennett	1.0	17.0
2	3	M	Pip	Carter	1.0	17.0
3	4	F	Bella	Long	1.0	17.0
4	5	F	Lexie	Perry	NaN	17.0
...	...	...	...	...	...	...
219	223	M	Austin	Turner	5.0	226.0
220	224	M	Theo	Lewis	5.0	226.0
221	225	M	William	Miller	NaN	226.0
222	226	F	Amelie	Watson	5.0	22.0
223	227	M	Andy	Poulton	5.0	NaN

224 rows × 6 columns

We can use read in and render SQL templates. This means you can store your SQL template as a file and then just parameterise it when reading it in. Then run it.

First lets overwrite our new file with the SQL template we originally created.

In [9]:

Copied!

# Note no f-strings this time. We are using jinja templating
with open("tempfile.sql", "w") as f:
    f.write("SELECT * FROM {{ db_name }}.{{ table_name }}")
# Note no f-strings this time. We are using jinja templating
with open("tempfile.sql", "w") as f:
    f.write("SELECT * FROM {{ db_name }}.{{ table_name }}")

Again it is worth looking at the file again (note if it looks the same, close the file in jupyter and reopen it). You should see the same SQL but with the Jinja templating.

Now lets read in and rendor our template.

In [10]:

Copied!





sql = pydb.get_sql_from_file(
    "tempfile.sql", jinja_args={"db_name": db_name, "table_name": "department"}
)
pydb.read_sql_query(sql)
sql = pydb.get_sql_from_file(
    "tempfile.sql", jinja_args={"db_name": db_name, "table_name": "department"}
)
pydb.read_sql_query(sql)

Out[10]:

	department_id	department_name
0	1	Sales
1	2	Admin
2	3	Management
3	4	Technical
4	5	Maintenance
5	6	HR

Conclusion¶

Writing your SQL as a file can really helpful to utilise syntax highlighting that you don't get when just writing a string. With jinja templating you can create SQL templates and then parameterise them as you please.

Finally, when you start working with a heavy codebase of SQL across multiple users you can use tools like sqlfluff to lint your SQL files to ensure your team are working to the same standard of SQL. (SQL fluff also supports Jinja templated SQL files)

In [ ]: