Data lake: why the lake turns into a swamp
Why the data lake concept fails without data governance, and how a manager can tell real value from architectural hype.
The data lake concept captured the industry's attention roughly three or four years ago. The idea is appealing: instead of designing a schema upfront and deciding in advance what data you will need, you collect everything in raw form into one store and figure out the details as needed. Flexibility, scale, the ability to return to data with new questions later.
In practice, for most companies I have seen, the story unfolds differently. Data does get collected. The lake fills up. And then it turns out that finding what you need, understanding what it is, and trusting it - none of that is possible. The lake becomes a swamp.
Why this happens
The reason is not technical. Hadoop and its ecosystem can genuinely store data in any shape and any volume. The reason is organisational.
In a traditional data warehouse, before loading you have to answer questions: what is this data, where does it come from, what does it mean, how does it relate to other data. That takes effort - but those answers are the value of the warehouse. Without them, data is just bytes.
A data lake tempts you into deferring these questions. Load now, figure it out later. "Later" for most teams never arrives - because the next data stream is already waiting, priorities shift, and the people who understood the context of the loaded data have moved on.
A year or two later, an analyst gets access to the lake and sees thousands of files with unclear names, no descriptions, no lineage. The data technically exists. Practically - it does not.
What a data lake actually requires
A useful data lake demands the same organisational practices as a conventional warehouse - they are just less obvious, because the technology creates the illusion that they are not needed.
A data catalog. What is stored, where it came from, who owns it, when it was last updated. Without a catalog, the lake cannot be navigated.
Data ownership. Every dataset in the lake must have an owner. Someone who ensures the data remains current, that the schema has not broken, that new consumers can rely on it.
Lifecycle management. Data ages, loses relevance, begins contradicting fresher sources. A process for archiving and deletion is required - otherwise the swamp only grows.
Quality controls. Raw data almost always contains problems - gaps, duplicates, anomalies. Who sees them and what gets done about them?
How to tell a useful project from hype
When a proposal arrives to "build a data lake", I ask a few questions that quickly reveal whether there is a real need behind it.
What specific analytical task cannot be done today, and why? If there is no answer - there is no task.
Who will use the data in the lake and how exactly - which tool, which question? If there are no concrete consumers, there is no point building for them.
Who will be responsible for data quality in the lake - not the technology, but the contents? If there is no answer - the lake will become a swamp regardless of the technology.
A practical conclusion
A data lake is a real architectural concept with real use cases. But it does not replace data discipline - it moves that discipline into a different wrapper. Companies that skip this part of the work end up with an expensive storage layer that nobody uses.
Before investing in a data lake, it is worth asking: do we have a data management practice that we want to scale - or do we want the technology to substitute for that practice? In the second case, it is better to start smaller.