Data November 7, 2024 3 min read

Lakehouse: a storage architecture without choosing the lesser evil

What the lakehouse approach is and when it solves the real problem of choosing between a data warehouse and a data lake.

When a company starts building analytics infrastructure, the same choice eventually appears: data warehouse or data lake. The first is structured, manageable, expensive to change. The second is flexible, cheap to store in, but prone to becoming a swamp without strict discipline.

Both options carry compromises. The lakehouse is an attempt to remove this choice by taking the best of both approaches. In 2024, this approach moved from conceptual to widely available. But as with any architectural idea, it is important to understand what it actually solves and when it is needed.

Where the tension between warehouse and lake appears

A data warehouse works well with well-structured data and clear analytical queries. It is reliable, manageable, delivers stable query performance. But it is expensive for unstructured data, adapts slowly to schema changes, and storing "raw" data in it for future use cases is costly.

A data lake allows storing data in original formats, is cheap and flexible. But without governance it becomes chaos - nobody knows what is where, data is duplicated, schemas are inconsistent, and analytical query performance suffers.

In practice, many companies ended up building both: a lake for storage, a warehouse for analytics, and ETL between them. This doubles the infrastructure, doubles the cost, doubles the points of failure.

What a lakehouse is

A lakehouse combines storing data in open formats (like a lake) with management capabilities, ACID transactions, and analytical query performance (like a warehouse). Key technologies - open table formats such as Delta Lake, Apache Iceberg, Apache Hudi - make this possible on top of ordinary cloud storage.

In practice this means: data is stored once, in an open format, but structured operations with transactional guarantees can be applied to it. An analytics engine, an ML framework, and a streaming processor can work with the same data without copying.

Key capabilities this opens: time travel (you can query a table's state at any point in the past), rollback of changes, schema management with evolution without rebuilding.

When this solves a real problem

A lakehouse makes sense in several situations.

First: the company is paying for two infrastructure layers - lake and warehouse - and wants to merge them without losing capabilities.

Second: data is needed simultaneously for analytics and ML models, and currently has to be synchronized between two places.

Third: data schemas change frequently, and the warehouse approach creates constant labor-intensive migrations.

Fourth: data volume has grown to a level where the cost of a traditional warehouse is a meaningful budget line.

When it is still premature

If data volumes are modest and analytics works fine on PostgreSQL or a simple cloud warehouse - additional complexity is not needed. A lakehouse adds operational complexity: new formats, new tools, new skills required in the team.

A few verification questions:

Do you currently have both a lake and a warehouse - and pay for both?
Is data for analytics and for ML consumed from different places and synchronized manually?
Does the team run into slow or costly schema migrations?
Does the team have the skills to work with open table formats and Spark or Flink?

Lakehouse is a good architectural idea that has matured to production readiness. But a good idea without a mature team is a source of new problems, not a solution to old ones.

Back to all posts

Contact

Where the tension between warehouse and lake appears

What a lakehouse is

When this solves a real problem

When it is still premature

If this resonated, write to me. I reply personally.