Data lake: questions to ask before you start building
Why data lake projects often turn into data swamps, and what founders and managers should ask before committing budget.
Data lake is one of those concepts that sounds compelling at the pitch level: "We'll collect all the data in one place, and then we'll be able to do anything we want with it." The idea is intuitively attractive. The problem is the word "then".
I have seen a number of projects that started with genuine enthusiasm around a data lake and ended up as what people in the field call a data swamp - a lake without banks, where data exists but finding what you need or knowing what to trust is impossible.
What the architectural idea is
A data lake is a storage layer where data lands in raw form: structured, semi-structured, unstructured, in the original source format. Unlike a traditional data warehouse, where the schema is defined in advance, here the schema is applied on read. That gives flexibility: you do not need to know in advance how you will analyse the data.
The flexibility is real. But it is not free.
Why lakes turn into swamps
Without governance, a lake becomes unmanageable quickly. The specific reasons:
No data catalogue. The data is there, but nobody knows exactly what sits in which folder, where it came from, how fresh it is, how trustworthy.
No owner for data quality. In a traditional warehouse, ETL processes usually clean and transform data on ingestion. In a lake, data lands as-is. Who fixes duplicates, errors, changed formats, and when?
Blurred access controls. "Everything in one place" quickly means "analysts have access to data they should not". Access management in a raw storage layer is non-trivial.
Misaligned expectations. The business expects analytics, business analysts expect ready-made data marts, data engineers are building pipelines - and nobody agreed on who does what with the raw data.
Where a lake is actually needed
Data lake makes sense in a few scenarios:
When there is genuinely a lot of heterogeneous data - logs, events, device data streams, unstructured text. When the cost of pre-structuring all of that into a single schema is higher than the cost of raw storage.
When there is a team of analysts and engineers who will actively work with raw data - doing exploratory analysis, building machine learning models.
When the analytics tasks are not known in advance - for example, research or product experiments.
For most mid-sized companies with a handful of operational data sources that need reliable reports and dashboards, a well-designed data warehouse, or even a well-structured schema in PostgreSQL, is often enough.
Questions before starting
Before agreeing to the recommendation to build a lake:
- What specific tasks can we not solve today, and why will a lake solve them?
- Who will own the data catalogue - not technically, but organisationally?
- Do we have a team that will work with raw data, or are we building infrastructure for analysts who do not exist yet?
- How will we manage data quality - who cleans the data, and when?
- How is access management structured - especially for data containing personal or commercially sensitive information?
A lake is a powerful tool. But without answers to these questions, building one is often infrastructure for the sake of infrastructure.