Using SQL templating with pydbtools¶
pydbtools can read SQL files and also rendor SQL with jinja templating this notebook demos both.
Note this functionality is new to pydbtools v4.0.0.
Setup¶
Just run this script to create the source database so we can use it for our example.
import os
import pandas as pd
import awswrangler as wr
import pydbtools as pydb
# setup your own testing area (set foldername = GH username)
foldername = "mratford" # GH username
foldername = foldername.lower().replace("-", "_")
bucketname = "alpha-everyone"
s3_base_path = f"s3://{bucketname}/{foldername}/"
db_name = f"aws_example_{foldername}"
source_db_base_path = f"s3://{bucketname}/{foldername}/source_db/"
# Delete all the s3 files in a given path
if wr.s3.list_objects(s3_base_path):
print("deleting objs")
wr.s3.delete_objects(s3_base_path)
# Delete the database if it exists
df_dbs = wr.catalog.databases(None)
if db_name in df_dbs["Database"].to_list():
print(f"{db_name} found deleting")
wr.catalog.delete_database(name=db_name)
# Setup source database
# Create the database
wr.catalog.create_database(db_name)
# Iterate through the tables in data/ and write them to our db using awswrangler
for table_name in ["department", "employees", "sales"]:
df = pd.read_csv(f"data/{table_name}.csv")
table_path = os.path.join(source_db_base_path, f"{table_name}/")
wr.s3.to_parquet(
df=df,
path=table_path,
index=False,
dataset=True, # True allows the other params below i.e. overwriting to db.table
database=db_name,
table=table_name,
mode="overwrite",
)
Task¶
We now have a database with 3 tables. We are joing to write an SQL with Jinja templating which we can "render" with parameters and then run that query using pydbtools.
Render a template¶
Jinja uses the syntax of two curly brackets for it's parameters.
sql_template = """
SELECT *
FROM {{ db_name }}.{{ table }}
"""
print(
pydb.render_sql_template(
sql_template, {"db_name": db_name, "table": "department"}
)
)
SELECT * FROM aws_example_mratford.department
# We can now use this rendered SQL to actually return the query.
sql = pydb.render_sql_template(
sql_template, {"db_name": db_name, "table": "department"}
)
pydb.read_sql_query(sql)
department_id | department_name | |
---|---|---|
0 | 1 | Sales |
1 | 2 | Admin |
2 | 3 | Management |
3 | 4 | Technical |
4 | 5 | Maintenance |
5 | 6 | HR |
# We can also use the same template to read a different table
sql = pydb.render_sql_template(
sql_template, {"db_name": db_name, "table": "sales"}
)
pydb.read_sql_query(sql)
employee_id | qtr | sales | |
---|---|---|---|
0 | 1 | 1 | 768.17 |
1 | 2 | 1 | 391.98 |
2 | 3 | 1 | 406.36 |
3 | 4 | 1 | 816.25 |
4 | 5 | 1 | 437.05 |
... | ... | ... | ... |
174 | 43 | 4 | 442.61 |
175 | 44 | 4 | 857.64 |
176 | 45 | 4 | 644.43 |
177 | 46 | 4 | 988.18 |
178 | 59 | 4 | 927.30 |
179 rows × 3 columns
Read and render a file¶
So you might be thinking this seems pointless as we can use f-strings. Which is true. But what does come into use is the ability to read in an SQL file and render it.
# Let's first create a an sql file
# as your db_name is dependant on who is running this tutorial we'll use an fstring to create a basic SQL file
with open("tempfile.sql", "w") as f:
f.write(f"SELECT * FROM {db_name}.employees")
Now if you open up the file tempfile.sql
you'll see a simple SQL file and jupyterlab should also give you some syntax colouring for SQL (because it recognises the .sql
extension).
With pydbtools you can read an SQL file and then use that query.
sql = pydb.get_sql_from_file("tempfile.sql")
pydb.read_sql_query(sql)
employee_id | sex | forename | surname | department_id | manager_id | |
---|---|---|---|---|---|---|
0 | 1 | M | Dexter | Mitchell | 1.0 | 17.0 |
1 | 2 | F | Summer | Bennett | 1.0 | 17.0 |
2 | 3 | M | Pip | Carter | 1.0 | 17.0 |
3 | 4 | F | Bella | Long | 1.0 | 17.0 |
4 | 5 | F | Lexie | Perry | NaN | 17.0 |
... | ... | ... | ... | ... | ... | ... |
219 | 223 | M | Austin | Turner | 5.0 | 226.0 |
220 | 224 | M | Theo | Lewis | 5.0 | 226.0 |
221 | 225 | M | William | Miller | NaN | 226.0 |
222 | 226 | F | Amelie | Watson | 5.0 | 22.0 |
223 | 227 | M | Andy | Poulton | 5.0 | NaN |
224 rows × 6 columns
We can use read in and render SQL templates. This means you can store your SQL template as a file and then just parameterise it when reading it in. Then run it.
First lets overwrite our new file with the SQL template we originally created.
# Note no f-strings this time. We are using jinja templating
with open("tempfile.sql", "w") as f:
f.write("SELECT * FROM {{ db_name }}.{{ table_name }}")
Again it is worth looking at the file again (note if it looks the same, close the file in jupyter and reopen it). You should see the same SQL but with the Jinja templating.
Now lets read in and rendor our template.
sql = pydb.get_sql_from_file(
"tempfile.sql", jinja_args={"db_name": db_name, "table_name": "department"}
)
pydb.read_sql_query(sql)
department_id | department_name | |
---|---|---|
0 | 1 | Sales |
1 | 2 | Admin |
2 | 3 | Management |
3 | 4 | Technical |
4 | 5 | Maintenance |
5 | 6 | HR |
Conclusion¶
Writing your SQL as a file can really helpful to utilise syntax highlighting that you don't get when just writing a string. With jinja templating you can create SQL templates and then parameterise them as you please.
Finally, when you start working with a heavy codebase of SQL across multiple users you can use tools like sqlfluff to lint your SQL files to ensure your team are working to the same standard of SQL. (SQL fluff also supports Jinja templated SQL files)