Data July 9, 2018 3 min read

Data warehouse or data lake: how to make the right call

A breakdown of two architectural approaches to corporate data storage and the criteria that actually matter for mid-size companies.

Right now the market is pushing two visions of the future for corporate data. The first is the classic data warehouse: structured tables, a fixed schema, SQL queries. The second is the data lake: everything goes in as-is, in its original format, and structure is added later at read time.

I regularly see companies choose between these approaches based on marketing brochures rather than real needs. Both architectures work - but each in its own context.

What a data warehouse actually is

A data warehouse is infrastructure where data arrives already cleaned and conformed to a shared schema. Transformation happens before writing, not after. Queries run fast because the structure is predictable.

The main constraint: the schema has to be defined upfront. If business rules change or a new data source appears with a different structure, you need a migration. That is not a disaster, but it is work.

A warehouse fits well for analytics with established questions: sales by region, funnels, financial reporting. Wherever the questions are known and repeat.

What a data lake actually is

A data lake is a store of raw data. Logs, events, files, JSON, tables - everything in one place, in its original form. The schema is defined at read time, not write time.

The main advantage is flexibility. You can add a new data source without changing the existing structure. You can ask a question that was not anticipated at design time.

The main downside, which rarely gets mentioned: a data lake without governance quickly becomes a data swamp. Nobody knows what is where, data is undocumented, quality is unknown. This is not a technical flaw in the architecture - it is an organisational one. But it is very common.

Where the logic of choosing breaks down

Mid-size companies often overestimate the flexibility of a lake and underestimate the cost of managing it. The idea "let's collect everything first and sort it out later" only works if there is a team that actually does the sorting.

The opposite mistake also happens: companies with genuinely varied and unpredictable data build a rigid warehouse and then struggle with every schema change.

Neither mistake is about the technology. Both are about a mismatch between the architecture and the real situation.

How to make the decision

A few questions that help clarify the picture:

How well do you understand which questions you will be asking the data? If the questions are settled - warehouse. If they change constantly - lake or a hybrid.
Does the company have people who will actively maintain metadata and data documentation? Without that, a lake degrades.
How many data sources do you have now, and how often do new ones appear? At a high rate of change, a lake produces less friction.
What is the cost of a delay in answering an analytical question? A warehouse is usually faster for predictable queries.
Do you have machine learning tasks that need raw data in its original form? That is an argument for a lake, or at least a hybrid approach.

There is no single right answer that works for everyone. But the right answer for a specific company usually becomes clear after honest responses to these questions.

Back to all posts

Contact

What a data warehouse actually is

What a data lake actually is

Where the logic of choosing breaks down

How to make the decision

If this resonated, write to me. I reply personally.