Data January 27, 2021 3 min read

When a data lake turns into a swamp

Why data lakes often stop working one or two years after launch, and what to do about it.

A data lake is one of those ideas that sounds convincing during the planning stage. We collect all company data in one place, in a single storage layer. Analysts take what they need. AI projects get their raw material. Everything is connected, everything is accessible.

A year and a half or two years later, I often see something different. The storage exists, there is a lot in it, but nobody knows exactly what is where. New data streams are added without documentation. Old ones are never removed. Queries take unpredictable amounts of time. Analysts duplicate layers because they do not trust what is already there.

This is what gets called a data swamp - a swamp where a lake was supposed to be.

Why it happens

The cause is almost always the same: a data lake is built as technical infrastructure without building the organisational infrastructure alongside it.

The technical part gets done fairly quickly: pick a platform, configure pipelines, start loading. The organisational part does not. Who is responsible for the quality of a specific dataset? Who decides what is stale and can be removed? How does a new analyst find out what already exists and in what state? How does documentation stay current?

Without answers to these questions, the storage grows in size and falls in value.

Three symptoms that diagnose it easily

The first is duplication. Different teams create similar tables or datasets because they either could not find what already existed or did not trust its accuracy. This shows up in the structure of the storage and in how analysts answer the question "where did this data come from?"

The second is fear of deletion. Nobody takes responsibility for cleanup because it is unclear who is still using old data. As a result, three-year-old datasets sit in the storage that no one uses but no one removes either.

The third is the answer "I'll need to check". When a simple question about the source of a specific metric has no quick answer - you have to find the specific person who "set that up".

What helps and what does not

Changing platforms does not help. The swamp migrates with the data. A one-time cleanup without changing processes does not help either - everything returns within six months.

What helps is assigning owners to specific data domains. Not technical administrators, but business owners who care about a particular slice of data. They make decisions about relevance, structure, and quality.

What helps is a minimal data catalog - a registry of what exists, where it came from, and who owns it. It does not have to be an expensive tool. A structured internal wiki that is kept alive is often enough.

What helps is a policy: data without documentation and an owner gets marked stale after a set period and is excluded from reporting.

A practical test

Ask any analyst on your team to explain in five minutes: where a specific key metric in your dashboard comes from, which dataset feeds it, and who is responsible for keeping it current.

If this takes more than five minutes or requires tracking down a specific person - you have the signs of a swamp, regardless of how modern the platform underneath it is.

A data lake does not solve the data problem on its own. It creates new infrastructure that requires the same governance as any other.

Back to all posts

Contact

Why it happens

Three symptoms that diagnose it easily

What helps and what does not

A practical test

If this resonated, write to me. I reply personally.