id	status	updated_at	valid_from	valid_to	is_current
1	pending	2019-01-01	2019-01-01	2019-01-02	False
1	shipped	2019-01-02	2019-01-02	`null`	True

Ecosystem	Hudi	Delta Lake	Iceberg
AWS Glue PySpark	Read+Write+DDL	Read+Write+DDL	Read+Write+DDL
Amazon Athena	Read	Read	Read+Write+DDL

Scale	Partitioned	File Size	Execution Time	Data Scanned
1GB	No	0.7x	2.7x	1.5x
3TB	Yes	0.03x	1.2x	0.9x

Because of its size, a distributed dataset is usually stored in partitions, with each partition holding a group of rows. This also improves parallelism for operations like a map or filter. A shuffle is any operation over a dataset that requires redistributing data across its partitions. Examples include sorting and grouping by key.

Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale. Hive transforms HiveQL queries into MapReduce jobs that run on Apache Hadoop. It queries data stored in a distributed storage solution, like the Hadoop Distributed File System (HDFS) or Amazon S3. Hive stores its database and table metadata in a metastore, which is a database or file backed store that enables easy data abstraction and discovery. Advantages of Hive: 1. De-facto standard 2. Works with basically every engine 2. Can make use of partitions to speed up queries 3. File format agnostic 4. Can atomically update a partition Problems with Hive: 1. Table scans unless you make efficient use of partitions Metadata structures are used to define: - What is the table? - What is the table’s schema? - How is the table partitioned? - What data files make up the table?

Data Definition Language [DDL](https://en.wikipedia.org/wiki/Data_definition_language) queries. In the context of SQL, data definition or data description language (DDL) is a syntax for creating and modifying database objects such as tables, indices, and users. DDL statements are similar to a computer programming language for defining data structures, especially database schemas. Common examples of DDL statements include CREATE, ALTER, and DROP.

Trino can push down the processing of queries, or parts of queries, into the connected data source: https://trino.io/docs/current/optimizer/pushdown.html Why dbt? write custom business logic using SQL automate data quality testing deploy the code publish documentation side-by-side with the code

Evaluation

Contents

1) Summary

Data Pipeline Architecture As-Is

Data Pipeline Architecture To-Be

Outcome

Hence we are proceeding with option 2

Lessons learnt

2) Technical Concepts

Data curation processes

Issues with Glue PySpark job

Data Lake table formats

ACID Transactions

Why Apache Iceberg?

Why Amazon Athena?

3) Evaluation Methodology

Evaluation criteria

TPC-DS Benchmark

Data curation evaluation architecture

Data curation data generation

Bulk Insert comparison

MERGE and SCD2 logic

SCD2 comparison - 100 GB

SCD2 comparison - 3 TB

Data derivation evaluation

Compatibility with AP Tools set

Considerations and Limitations

4) Next Steps

Risks

Knowns Unknowns

5) Appendix

Contributors

Acknowledgements

If we had had more time...

Iceberg Metadata

Athena Resource limits