Building a transaction data lake using
Amazon Athena, Apache Iceberg and dbt


Dr Soumaya Mauthoor

September 2024

Published in 2022 under the Johnson Conservative government

image

MoJ Analytical Platform


image

Previous ELT Architecture

image

Modern Table Formats

Iceberg was the obvious choice for our usecase because of enhanced Athena support
image

Glue PySpark vs Athena Curation Benchmarking

Criteria

  1. Cost
  2. Complexity
  3. Run Time

Dataset

TPCDS stores_sales

  • scale: 100 (~10GB)
  • scale: 3000 (~400GB)
image

Bulk Insert


CREATE TABLE target_table
AS SELECT * FROM source_table

  • Athena is cheaper <=3TB scale
  • Glue PySpark is faster at the 3TB scale
image

SCD2 Merge

Update 0.1% rows

MERGE INTO target_table
USING source_query
ON search_condition
WHEN MATCHED THEN UPDATE []
WHEN NOT MATCHED THEN INSERT []

  • Athena is cheaper and faster
  • Glue PySpark runs out of memory at the 3TB scale
image

Full Load Blue-Green Deployment

image

Incremental Blue-Green Deployment

image

Outcomes

  • Reduced query costs by 99%

  • Reduced query time by 50-75%

  • Enabled daily refresh cycle

  • Stabilised data pipeline

  • Ensured data quality

  • Streamlined technology stack

  • Facilitated phased development

image

Questions?





These slides were created using

image

https://intranet.justice.gov.uk/blog/becoming-a-truly-data-led-justice-system/ Our data strategy: We will improve justice outcomes through data driven insight and innovation. We will ensure data meets user needs. We will build a data culture to value data as a strategic asset.

Provide a table-like abstraction on top of native file formats like Parquet by storing additional metadata.