A data Lakehouse is an open data architecture that brings together the scalability and cost-effectiveness of data lakes with the reliability and performance of data warehouses on a single data platform.
Simply said, the data lakehouse is the only data architecture that allows you to store all types of data in your data lake; unstructured, semi-structured, and structured while maintaining the data quality and governance standards of a data warehouse.
One of the key pillars of Data Lakehouses is the open format. The format in which data will be stored is likely the most important decision to make while building a data Lakehouse. It’s inspiring to think that merely changing the format in which data is stored might unlock new features and enhance overall system performance.
Unfortunately, all of the comparisons available between Delta and Iceberg to assist us in making an informed decision are limited to features. That’s why we took the comparison to the performance level by simulating real-world scenarios using the TPC-DS benchmark.
What is TPC-DS?
TPC-DS is a data warehousing benchmark defined by the Transaction Processing Performance Council (TPC). TPC is a non-profit organization founded by the database community in the late 1980s with the goal of developing benchmarks that may be used objectively to test database system performance by simulating real-world scenarios. TPC has had a significant impact on the database industry.
“Decision support” is what the “DS” in TPC-DS stands for. There are 99 queries in total, ranging from simple aggregations to advanced pattern analysis.
In this benchmark we used Delta 1.0 and Iceberg 0.13.0 with the environment components listed in the table below:
As discussed earlier, we used the open sourced TPC-DS benchmark by Delta oss and we extended it to support Iceberg. We registered load performance, which is the time it takes to load data from Parquet format into Delta/iceberg tables. We then registered query performance. Each TPC-DS query was run three times, and the average running time was taken into account.
1. Overall performance
After completing the benchmark, the overall performance of loading and querying the tables was in favour of Delta as it was 3.5X faster than Iceberg. It took 1.68 hours to Load data into Delta and perform the TPC-DS queries and it took 5.99 hours to do the same for Iceberg.[chart-1]
2. Load performance
When loading data from Parquet to our intended formats, Delta was 1.3X faster in overall performance than Iceberg.[chart-2]
To further analyse the load performance results, we dived into the detailed load results of each table, and noticed that the difference in load time gets wider when the table size gets bigger. For example, when loading the customer table, Delta and Iceberg had practically the same performance. Meanwhile, in the store_sales table, which is one of the biggest tables in the TPC-DS benchmark, Delta was 1.5X times faster than Iceberg.
This shows that Delta is faster and more scalable than Iceberg when loading Data.[chart-3]
3. Query performance
When performing the TPC-DS queries, Delta was 4.5X faster in overall performance than Iceberg. It took 1.14 hours to perform all queries on Delta and it took 5.27 hours to do the same on Iceberg.[chart-4]
Iceberg and Delta delivered approximately the same performance in query34, query41, query46 and query68. The difference in those queries was less than 1 second.
However, Delta was faster than Iceberg in all the remaining TPC-DS queries by different margins.
In some queries, like in query72, Delta was 66X faster than Iceberg.
And in the other queries, the difference between Delta and Iceberg ranged between about 1.1 to 24X faster in the favour of Delta.[chart-5]
After running the benchmark, Delta outperformed Iceberg in terms of scalability and performance with unexpected margins.This benchmark was a clear answer for us and our customers about which solution to opt for when building a Lakehouse.
It is also important to state that Iceberg and Delta will keep improving and along with their improvements, we will keep an eye on their performance and share the results with the wider community.
To further analyse and extract your own insights from this benchmark, you can download the full benchmark reports here.