DuckDB Stores Data Lake Metadata in SQL, Not Files

DuckDB Labs released DuckLake, a data lake format that stores catalog metadata in SQL tables rather than scattered JSON files. The approach simplifies querying, improves performance on small updates, and maintains Iceberg compatibility for existing workflows.

Traditional data lake formats - Iceberg, Delta Lake, Hudi - store metadata as files in object storage. Every query starts by reading these files to figure out which data files to access. That works fine for large batch jobs but becomes overhead when you're making frequent small updates or running many concurrent queries.

DuckLake moves metadata into a SQL database. Want to know which partitions contain data for a specific date range? Query the catalog. Need to find the latest version of a table? SQL query. Looking for files modified in the last hour? Another SQL query. No file scanning, no JSON parsing, no building an in-memory catalog before you can start the actual work.

Why Metadata Storage Matters

Data lakes excel at storing massive datasets cheaply. Object storage is cheap, durable, and scales horizontally. The problem is coordination - how do multiple writers avoid corrupting each other's data? How do readers find the files they need without scanning terabytes of storage?

Existing formats solve this with metadata files that track which data files exist, what schema they use, which partitions they contain, and which version of the table they belong to. These metadata files become the catalog - the index that makes the data lake queryable.

But metadata files have their own problems. They're eventually consistent - different readers might see different versions during updates. They accumulate over time - a table with frequent updates can have thousands of small metadata files. They require scanning and parsing before you can start actual query execution.

SQL databases already solve coordination and consistency. They handle concurrent writes, maintain indexes, enforce constraints, and provide transactional guarantees. DuckLake uses that existing infrastructure for metadata rather than reinventing it with files.

Practical Implications

The most immediate benefit is query latency. DuckDB's implementation avoids metadata file scanning entirely. A query that touches one partition in a table with ten thousand partitions only reads the catalog entries for that partition. The rest stays untouched.

Small updates improve significantly. Adding a single new row to a table doesn't require writing a new metadata file and updating a chain of pointers. It's a SQL insert that updates the catalog. Concurrent writers don't coordinate through object storage semantics - they use database transactions.

Partitioning becomes more flexible. Traditional formats require committing to a partitioning scheme upfront. Change the scheme and you're rewriting metadata for the entire table. DuckLake stores partition information in SQL, making it easier to add new partition columns or change granularity without full rewrites.

Iceberg Compatibility

DuckLake maintains compatibility with Apache Iceberg's table format. That means existing tools that read Iceberg tables can read DuckLake tables. The data files use the same structure. The schema evolution rules work the same way. The time travel and snapshot features are equivalent.

The difference is internal. DuckLake can read metadata from SQL catalogs or from Iceberg's file-based catalogs. Write an Iceberg table through DuckDB and it updates both representations. Query a DuckLake-native table through an Iceberg-compatible tool and it works - the tool reads the metadata from files while DuckDB reads from SQL.

This compatibility matters for migration. You don't rebuild your entire data lake to try DuckLake. You point DuckDB at existing Iceberg tables and start using SQL catalog features where they help. Gradually migrate tables to native DuckLake format as performance or management benefits justify the work.

Where This Fits

DuckLake isn't replacing all data lake formats. It's optimised for workloads that DuckDB targets - analytical queries on datasets that fit on one machine, frequent small updates, interactive query latency expectations.

If you're running massive Spark jobs across petabytes of data in S3, the overhead of file-based metadata is negligible compared to data processing time. If you're running a real-time analytics dashboard that queries the last hour of data every few seconds, metadata latency becomes the bottleneck.

The SQL catalog approach also assumes you have a database available. That's not always true in cloud-native architectures where object storage is the only shared state. DuckDB's typical deployment - analytical queries in application servers or data science notebooks - usually has database access. The catalog can live in Postgres, SQLite, or even DuckDB itself.

For teams already using DuckDB, this is a clear win. Better performance on common workloads, simpler mental model for metadata management, and backward compatibility with existing formats. No migration required, just enable the SQL catalog and see if query performance improves.

For teams not using DuckDB, it's a data point in the evolution of data lake formats. The file-based metadata approach worked when data lakes were primarily batch processing systems with infrequent updates. As more workloads shift toward streaming ingestion and interactive queries, metadata management needs to evolve.

SQL catalogs are one answer. Other formats are exploring different solutions - more efficient metadata file formats, better caching strategies, hybrid approaches that use both files and databases. The goal is the same - make data lakes work better for modern query patterns without sacrificing the scalability and cost benefits of object storage.

DuckLake's contribution is showing that SQL metadata works in practice and integrates cleanly with existing formats. That opens design space for other systems to explore similar approaches.