Apache Iceberg Catalogs Explained: Types, Comparisons, and How to Choose the Right Catalog

Apache Iceberg Catalogs Explained: Types, Comparisons, and How to Choose the Right Catalog

·

14 min read

In modern lakehouse architectures, data lives in object storage, tables use open formats like Apache Iceberg, and multiple compute engines query the same datasets. Even with storage, format, and engines in place, one layer remains essential: the catalog.

A catalog manages Iceberg table metadata, including schema, partitioning, snapshots, and data file locations. It acts as the control plane that enables interoperability, governance, and scalability.

This blog covers:

  • What Iceberg catalogs are and why they matter

  • The main types of catalogs and their roles

  • A catalog-by-catalog breakdown (use case, advantages, disadvantages, reasons not to choose)

  • How RisingWave uses Iceberg catalogs for its internal and external Iceberg tables

  • Examples of using different catalogs in RisingWave

What Is an Iceberg Catalog?

Iceberg separates data (the files in object storage) from metadata (the information that describes those files and the table’s state). That metadata forms a clear chain:

  • Data files hold the actual rows.

  • Manifest files track which data files belong to the table.

  • A manifest list groups manifests for a specific snapshot.

  • A metadata file records the table’s full state (schema, partition specs, snapshots, properties) and points to the relevant manifest list.

  • The catalog stores the table’s identity and a pointer to the current metadata file.

Apache Iceberg table architecture.

When an engine writes to an Iceberg table, it creates new data files, records them in manifest files, groups those into a manifest list, and writes a new metadata file describing the updated table state. The final step is updating the catalog so the new metadata file becomes the table’s current pointer, making the new snapshot discoverable to every engine.

Apache Iceberg Write Path.

When an engine reads an Iceberg table, it follows the same chain in reverse. It starts at the catalog to find the table’s current metadata file, uses that to locate the appropriate manifest list for the snapshot being queried, then consults manifest files to determine exactly which data files match the query (via partition and predicate pruning). Only those data files are read from object storage.

Apache Iceberg Read Path.

An Iceberg catalog, then, is the system that helps engines locate and manage tables by storing and organizing their metadata. When a query engine needs to interact with a table, the catalog tells it:

  • Where the table is located

  • What schema it has

  • What snapshots exist

  • What partitioning strategy is used

  • What data files belong to the table

Catalogs are the gateway to your data. In the lakehouse world:

  • Data = object storage (S3, GCS, Azure Blob)

  • Format = Iceberg

  • Engines = RisingWave, Spark, Trino, Snowflake, etc.

  • Catalog = metadata control plane

A Brief History: Why Catalogs Matter

As internet usage and data volumes grew, vertically scalable databases such as Oracle and MySQL gave way to massively parallel systems like Greenplum and Teradata.

Then came Hadoop, which democratized big data by combining:

  • Distributed storage (HDFS)

  • Distributed processing (MapReduce)

  • Cluster resource management (YARN)

In 2010, Apache Hive introduced a SQL layer on top of Hadoop. Hive made it possible to query large datasets in HDFS using familiar SQL instead of Java or Scala MapReduce jobs. Hive’s architecture consisted of two core services:

  • A Query Engine for executing SQL

  • A Metastore (Hive Metastore, HMS) for virtualizing data stored in HDFS as tables

The Hive Metastore became the central metadata repository for data lakes. It stored:

  • Table definitions

  • Schemas

  • Partition information

  • Column types

  • File locations in HDFS or object storage

Over time, additional technologies such as Avro (schema evolution improvements) and Parquet (columnar storage for better performance) helped establish the modern data lake architecture. Because nearly every engine integrated with Hive Metastore, it became the de facto metadata layer of the data lake.

Problems in Traditional Hive-Based Data Lakes

Despite its importance, Hive Metastore and traditional data lakes had structural limitations:

No Transactional Guarantees

Lack of ACID compliance caused inconsistencies during concurrent writes.

Poor Performance at Scale

Directory-based metadata and file listing operations led to slow query planning on large datasets.

Unpredictable Schema Changes

Schema updates often required full table rewrites or complex coordination.

Manual Partitioning

Partition management was error-prone and required operational discipline.

Operational Complexity

Hive Metastore was not cloud-native, making it difficult to operate in the cloud-native era.

How Apache Iceberg Solves These Issues

Apache Iceberg was designed to address these architectural weaknesses:

ACID Transactions

Ensures data integrity with atomic commits and isolation.

Advanced Metadata and Indexing

Replaces directory-based metadata with structured table metadata for faster planning and execution.

Safe Schema Evolution

Allows schema changes without rewriting existing data.

Hidden Partitioning

Automates partition management and reduces human error.

The Lakehouse Era

Today, in the Lakehouse era:

  • Data lives in object stores (S3, GCS, Azure Blob)

  • Format is an open table format like Iceberg

  • Engines include Spark, Trino, Snowflake, RisingWave, and others

However, even with object storage, open formats, and modern engines in place, one essential layer remains:

You still need a catalog so engines can find and access Iceberg tables.

And that is where modern Iceberg catalogs come in.

Types of Catalogs

Service-Based Catalogs (example: Hive Metastore)

Service-based catalogs operate as centralized services where metadata is stored in a relational database and accessed through an API. They provide a single control point for managing table definitions and schemas, but they can face scalability and maintainability challenges as deployments grow.

File-Based Catalogs (example: Hadoop Catalog)

File-based catalogs store metadata alongside data files in the same storage system. They do not require an always-on service and are decentralized by design. However, they often introduce performance tuning and coordination challenges, especially in multi-user or large-scale environments.

REST Catalogs (example: Lakekeeper, Polaris, Nessie)

REST catalogs store metadata in files but expose access through a standardized REST API. They are based on an open specification, support concurrent operations, and allow pluggable backends. This model enables interoperability across engines while maintaining centralized metadata access.

Catalog-by-Catalog Breakdown

Each section below includes use case, advantages, disadvantages, and reasons not to choose, based only on the information provided.

Hadoop Catalog

Use Case

The Hadoop Catalog is designed for local or on-premise Iceberg storage environments backed by HDFS or S3. It is most appropriate for small deployments where simplicity is the primary goal.

Advantages

One of its key strengths is that it does not require an external Metastore. Metadata is stored alongside the data files within the same file system or object store. This makes setup straightforward and eliminates the need to operate an always-on service. Its decentralized nature makes it simple to deploy in smaller environments.

Disadvantages

The Hadoop Catalog does not support multi-user concurrent writes. This significantly limits its suitability for collaborative or production-scale workloads. It is also unsuitable for large-scale clusters where distributed coordination is required.

Reasons Not to Choose

It is not scalable for large cloud environments and lacks multi-user support, making it inappropriate for enterprise-scale or multi-engine lakehouse architectures.

Hive Catalog

Use Case

The Hive Catalog is intended for traditional Hadoop data lake environments where Iceberg is integrated with Hive and the Hive Metastore (HMS).

Advantages

It is compatible with Hive Metastore and allows seamless integration with query engines that rely on HMS. For organizations already operating Hive infrastructure, this provides familiarity and straightforward adoption.

Disadvantages

The Hive Catalog requires Hive Metastore maintenance, which introduces operational overhead. It has limited scalability and is not suitable for cloud-native architectures that demand elasticity and serverless operation.

Reasons Not to Choose

It requires manual HMS management and is not ideal for multi-cluster sharing or serverless deployments, limiting its appeal in modern cloud environments.

RisingWave Support

Supported in RisingWave through Hive Metastore integration for Iceberg sources, sinks, and internal tables.

AWS Glue Catalog

Use Case

AWS Glue is designed for AWS-based data lakes, particularly those using EMR and Athena.

Advantages

It provides native AWS support and removes the need to manage a separate Metastore. It integrates directly with AWS Athena, making it convenient within AWS-centric ecosystems.

Disadvantages

It is AWS-only and does not support cross-cloud usage. It can introduce high query complexity and does not provide interoperability across clouds.

Reasons Not to Choose

It is not suitable for on-premise environments and offers limited cross-cluster support outside AWS.

RisingWave Support

Fully supported in RisingWave for Iceberg sources, sinks, and internal tables using the AWS Glue catalog (catalog.type = 'glue').

DynamoDB Catalog

Use Case

The DynamoDB Catalog supports AWS serverless environments such as Lambda and Fargate.

Advantages

It provides a fully serverless architecture with low maintenance requirements. This aligns well with AWS serverless workloads and reduces operational burden.

Disadvantages

It is AWS-only, incompatible with Athena, and query performance depends on DynamoDB characteristics.

Reasons Not to Choose

It is not suitable for multi-cluster Iceberg usage and lacks broad query engine support, limiting interoperability.

JDBC Catalog

Use Case

The JDBC Catalog uses existing relational databases such as PostgreSQL or MySQL to store metadata.

Advantages

Metadata is stored in familiar relational systems, making it easy to manage with well-known operational practices.

Disadvantages

The database can become a single point of failure. Query performance limitations may arise, especially at larger scales.

Reasons Not to Choose

It is not scalable for large data lakes and is not ideal for AWS Athena or cross-cluster sharing.

RisingWave Support

Fully supported in RisingWave via catalog.type = 'jdbc' for Iceberg sources, sinks, and internal tables.

REST Catalog (Iceberg REST Catalog)

Use Case

A REST catalog uses a RESTful API to manage Iceberg table metadata. It is intended for multi-cluster, multi-engine sharing with standardized metadata access.

Advantages

It is ideal for cloud-native architectures. Multiple compute engines can share Iceberg tables, and it enables interoperability across languages such as Python, Java, and Rust.

Disadvantages

It requires deploying and maintaining an additional REST server, increasing operational overhead.

Reasons Not to Choose

It is not usable with AWS Athena and is better suited for advanced users managing cross-cluster setups.

RisingWave Support

Fully supported in RisingWave via the REST catalog (catalog.type = 'rest'), including credential vending support, when creating an Iceberg SOURCE, SINK, or CONNECTION for internal Iceberg tables.

Nessie

Use Case

Nessie provides Git-like version control for data and supports multi-table transaction use cases.

Advantages

It enables branching and tagging (“data as code”) and supports atomic multi-table and multi-statement transactions, allowing versioned workflows.

Disadvantages

It requires self-hosting and maintenance and has limited integration with popular tools.

Reasons Not to Choose

It introduces additional infrastructure complexity and has a smaller ecosystem compared to Glue or Hive.

Apache Polaris

Use Case

Apache Polaris is designed for centralized, cloud-native Iceberg metadata management across engines and clouds.

Advantages

It is built on the Iceberg REST Catalog API as an open standard. It is cloud-neutral (supporting S3, Azure, and GCS), offers native RBAC and credential vending, fully supports schema and partition evolution and time travel, and enables multi-engine interoperability.

Disadvantages

It is a new project that is incubating and evolving rapidly. It requires setup and operational understanding.

Reasons Not to Choose

It introduces overhead for simple, single-engine deployments and requires additional setup for credential vending and governance.

Lakekeeper

Use Case

Lakekeeper is a Rust-native Iceberg REST Catalog with built-in governance and extensibility.

Advantages

It offers single-binary deployment, with no JVM or Python required. It integrates with OpenFGA and OPA for policy systems and supports OIDC authentication. It emits change events via Kafka or NATS. It is Kubernetes-ready with Helm support and is horizontally scalable with autoscaling.

Disadvantages

It is still maturing in enterprise adoption and may require Rust ecosystem familiarity for extensibility.

Reasons Not to Choose

It is earlier-stage compared to vendor-backed solutions and may require setup effort for advanced governance features.

RisingWave Support

Supported in RisingWave as a self-hosted REST catalog service, including credential vending. A Lakekeeper catalog can be used when creating Iceberg sources, sinks, and internal tables.

Apache Gravitino

Use Case

Apache Gravitino is intended for geo-distributed metadata lakes and unified Data + AI asset governance.

Advantages

It provides a single source of truth for multi-regional, federated metadata. It enables unified metadata access for both data and AI assets, centralized access and security control, and compatibility across multiple sources, regions, and formats.

Disadvantages

It requires managing Gravitino infrastructure and is still growing in ecosystem support.

Reasons Not to Choose

It may be too complex for small or monolithic deployments and requires understanding federated metadata systems.

Snowflake

Use Case

Snowflake supports managing Iceberg tables on external cloud storage within Snowflake.

Advantages

It provides high performance with Iceberg integration, supports ACID, schema evolution, and time travel, and has minimal operational overhead.

Disadvantages

The cloud and region must match the Snowflake account. It only supports Parquet and does not allow external modification by third-party clients.

Reasons Not to Choose

It is limited to Snowflake-compatible cloud setups and is not ideal if you rely on external tools or formats beyond Parquet.

RisingWave Support

Supported in RisingWave for creating Iceberg sources (read-only integration).

Unity Catalog

Use Case

Unity Catalog is a centralized governance layer for the Databricks lakehouse, managing data and AI assets across diverse sources such as data lakes and warehouses.

Advantages

Unity Catalog centralizes access control for assets (via the catalog or cloud paths), supports many asset types (tables, ML models), and enables zero-copy external sharing to cut storage and sync overhead. It also provides discovery, lifecycle visibility, and lineage via a layered, relational ER-backed architecture. It runs as a multi-tenant REST service decoupled from the query engine, using events to keep search/lineage up to date. Performance comes from batched requests, immutable caching, write-through caching for mutable metadata, sharded relational storage, LRU/timeout eviction, and snapshot + serializable isolation.

Disadvantages

Unity Catalog is tightly integrated with the Databricks runtime architecture. Its system design includes multiple layers such as ER modeling, adapter layers, credential management, background services, and caching strategies, making it architecturally complex.

Reasons Not to Choose

It may be less suitable outside the Databricks ecosystem and can be heavier compared to Iceberg-only catalog implementations.

RisingWave Support

Supported in RisingWave for writing to Databricks Unity Catalog-managed Iceberg tables via CREATE SINK (append-only).

RisingWave and Iceberg Interaction Model

Now, let’s discuss how RisingWave supports working with Iceberg and the catalog layer. RisingWave supports two modes: internally managed Iceberg tables and externally managed Iceberg tables.

Internally Managed Iceberg Tables

Internally managed tables are created in RisingWave using ENGINE = iceberg. RisingWave manages the table lifecycle (catalog, ingestion, compaction) using either the built-in catalog or a self-hosted REST catalog, while remaining compatible with external query engines via standard Iceberg protocols. This mode is ideal for Medallion-style architectures.

SQL example (Internal table — Built-in Catalog):

CREATE CONNECTION built_in_catalog_conn WITH (
    type='iceberg',
    warehouse.path='s3://my-bucket/warehouse/',
    s3.region='us-west-2',
    s3.access.key='your-key',
    s3.secret.key='your-secret',
    hosted_catalog=true
);

SET iceberg_engine_connection='public.built_in_catalog_conn';

CREATE TABLE my_iceberg_table (
    id INT PRIMARY KEY,
    name VARCHAR
) ENGINE = iceberg;

Externally Managed Iceberg Tables

Externally managed tables are managed outside RisingWave (for example, via Glue, Unity Catalog, S3 Tables, etc.). In this mode, RisingWave integrates by reading via CREATE SOURCE and writing via CREATE SINK, with exactly-once delivery, optional compaction, and support for both append-only and mutable workloads.

SQL example (External table — Glue source):

CREATE SOURCE orders_src WITH (
  connector='iceberg',
  warehouse.path='s3://lakehouse/warehouse',
  database.name='sales',
  table.name='orders',
  catalog.type='glue',
  s3.region='us-west-2'
);

SQL example (External table — REST sink, Lakekeeper-style):

CREATE SINK daily_sales_sink
FROM mv_daily_sales
WITH (
  connector='iceberg',
  type='append-only',
  warehouse.path='s3://lakehouse/warehouse',
  database.name='sales',
  table.name='daily_sales',
  catalog.type='rest',
  catalog.uri='http://lakekeeper:8181',
  s3.region='us-west-2'
);

Final Thoughts

Catalogs are the metadata backbone of the lakehouse, and they aren’t interchangeable. Each catalog represents trade-offs across:

  • Simplicity vs. scalability

  • Governance depth vs. operational overhead

  • Cloud-native design vs. ecosystem lock-in

  • Multi-engine interoperability vs. platform-specific integration

Choosing the right catalog depends on your infrastructure, governance requirements, and architectural direction. Getting this decision right directly impacts:

  • Governance

  • Scalability

  • Interoperability

  • Operational overhead

Similarly, as discussed above, RisingWave supports multiple catalogs for creating Apache Iceberg sources, sinks, and internal tables. It includes a built-in catalog that uses RisingWave’s metastore as a JDBC-compliant catalog and also supports a self-hosted REST catalog such as Lakekeeper. RisingWave also supports external catalogs such as AWS Glue, Hive Metastore, JDBC, REST, Snowflake, Databricks Unity Catalog, Amazon S3 Tables, and Lakekeeper. It supports internal Iceberg tables and external Iceberg ingestion and delivery. It also supports exactly-once streaming pipelines.

With a clear understanding of the catalog ecosystem and RisingWave’s interaction model, you can design a lakehouse architecture that is open, interoperable, real-time, governed, and scalable.

The Modern Backbone for
Real-Time Data and AI
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.