A Short Introduction to Apache Iceberg

Gautam Goswami
4 min readMar 31, 2022

A table can be defined as an arrangement of data in rows and columns and in a similar fashion, if you visualize from the Big Data perspective, the large number of individual files that hold the actual data can be organized in a tabular manner too.

We are already familiar with Apache Hive that works as a data warehouse system to query and analyze large datasets stored in the HDFS (Hadoop Distributed File System) or Amazon S3. Also an integral part of the big data ecosystem. Hive is a simple directory-based design where actual data files are getting stored at the folder/directory level in HDFS. If interested, you can read here how to import data into Hive tables. Hive keeps track of data at the folder level not in actual data files. Because of the directory-based model in Hive, listings are much slower, renames are not atomic, and results are eventually consistent. To work with data in a table, Hive needs to perform file list operations and this causes a performance bottleneck while executing SQL queries. Apache Iceberg is a new table format for storing large, slow-moving tabular data and can improve on the more standard table layout built into Hive, Trino, and Spark

The giant OTT platform Netflix originally developed Iceberg to decode their established issues related to managing/storing huge volumes of data in tables probably in petabyte-scales. Later in 2018, Iceberg was open-sourced as an Apache Incubator project.

The main aim to designed and developed Iceberg was basically to address the data consistency and performance issues that Hive having. Below we can see few major issues that Hive holding as said above and how resolved by Apache Iceberg.

– Schema Evolution

In nutshell, Schema evolution permits us to update the schema used to write new data while maintaining backward compatibility with the schemas of our old data. To make schema evolution support in Hive, actual data files need to be modified or rewritten. As an example, if we want to handle schema changes/evolution in Hive ORC tables like column deletions occurring at source database which is MySQL by leveraging Flume to import data, here are few major steps we need to follow like

Taking backup of old schema file -> Move the New AVSC Schema File to HDFS -> Create AVRO table with new location set and scheme location set ->Verify the data in AVRO after Schema changes in MySQL ->Take a Backup of the current ORC table and Drop the Original ORC table -> Create ORC table with a new location set -> Insert the data into the ORC table and eventually verify the ORC table after the schema changes. Then Continue the Incremental loads from the Next day onwards with the new target directory which was created after the schema changes.

But Apache Iceberg schema updates are metadata changes only and because of that, no data files need to be rewritten to perform the update.

Using unique IDs, Iceberg tracks each column in a table. While we add a new column, a new ID would assign to it to avoid any existing data usage by mistake.

Following schema evolution changes are currently supporting by Apache Iceberg

  • Add — add a new column to the table or to a nested struct
  • Drop — remove an existing column from the table or a nested struct
  • Rename — rename an existing column or field in a nested struct
  • Update — widen the type of a column, struct field, map key, map value, or list element
  • Reorder — change the order of columns or fields in a nested struct

To ensure schema evolution changes are unfettered and free of side-effects as well as without rewriting files, Apache Iceberg never read existing values from another column while adding a new column. Similarly for dropping or updating a column or field, Iceberg does not change the values in any other column.

– Partition Evolution

In Apache Hive partitioning can be done by dividing a table into related groups based on the values of a particular column like date, city, country, etc, Partitioning reduces the query response time in Apache Hive as data is stored in horizontal slices. In Hive partitioning, partitions are explicit and appear as a column and must be given partition values. Due to this approach, Hive having several issues like can’t validate partition values so fully dependent on the writer to produce the correct value, 100% dependent on the user to write queries correctly, Working queries are tightly coupled with the table’s partitioning scheme, so partitioning configuration cannot be changed without breaking queries, etc.

Apache Iceberg introduces the concept of hidden partitioning where the reading of unnecessary partitions can be avoided automatically. Data consumers that fire the queries don’t need to know how the table is partitioned and add extra filters to their queries. Iceberg partition layouts can evolve as needed. Iceberg can hide partitioning because it does not require user-maintained partition columns. Iceberg produces partition values by taking a column value and optionally transforming it.

Apache Iceberg is used in production where a single table can contain tens of petabytes of data and can read these huge tables without leveraging distributed SQL engine. It was developed for gigantic tables. By using a set of Java API that Iceberg produces, we can manage table metadata, like schema, partition spec, metadata, and data files that store table data.

Hope you have enjoyed this read. Please like and share if you feel this composition is valuable.

Reference:- https://iceberg.apache.org

--

--

Gautam Goswami

Enthusiastic about learning /sharing gyan on Big Data & related headways. Presently Engineering & Data Streaming head @ www.irisidea.com. Crafted dataview.in