Data January 24, 2017 3 min read

A data lake without governance becomes a swamp

Why corporate data lake projects often end up as a file store nobody knows how to use.

The term "data lake" appeared a few years ago and quickly became a standard part of corporate IT strategies. The idea is attractive: collect all company data in one place, in raw form, without a predefined schema - then use it as needed. Storage got cheaper, Hadoop and cloud services lowered the entry threshold.

I have worked with several projects that started exactly this way. And I have watched most of them, a year or eighteen months in, end up in the same place: the storage keeps growing, but nobody actually uses the data inside, because nobody knows what is there or whether it can be trusted.

Why "put everything in and sort it out later" does not work

The logic of "collect first, figure it out later" feels pragmatic. In practice it defers the problem rather than solving it.

When data lands in a lake without description - without information about where it came from, what it means, who is responsible for it, and when it was current - it loses context. Six months later a file called export_final_v3.csv tells an analyst nothing. A year later there are thousands of such files.

The result is that analysts do not use the lake - they go directly to the source or build their own local copies. The lake keeps growing, storage money is spent, but no value is created.

Three symptoms I see most often

No data owner. Files in the lake came from ten different systems, but no dataset has a specific person or team responsible for its accuracy and freshness. When something turns out to be wrong, it is unclear who to go to.

No catalog. Finding data happens through messengers and conversations. "Hey, do you have a sales export for last year? What format? Can we trust it?" - this conversation happens daily in companies with a lake and no catalog.

No quality policy at ingestion. Everything lands in the lake: duplicates, incomplete exports, test data mixed with real data. Without minimal checks at the entry point, data degrades from day one.

What must exist from the start

A data lake is not a technical project - it is an operational one. The technology is the easiest part.

Three organisational requirements without which the architecture makes no sense:

First - a catalog describing every dataset: source, owner, update frequency, format, known limitations. Not documentation for show, but a working tool.

Second - an explicit owner for every dataset. Not the team in general, not "the IT department", but a specific person responsible for the data being current and meaning what it says.

Third - a minimum ingestion policy. What actually goes into the lake? Test data? Personal data without masking? Incomplete exports? The boundary must be defined before loading begins.

The question worth asking before you start

Before launching a data lake project, I ask one simple question: if a year from now an analyst outside your team opens the lake catalog and finds a customer dataset from last quarter - will they be able to understand in ten minutes where that data came from, whether it can be trusted, and who to contact with questions?

If the answer is not obvious, the organisational part of the project is not ready. The technical part can run in parallel, but without resolving this question, the lake will remain a well-structured store of files nobody understands.

Back to all posts

Contact

Why "put everything in and sort it out later" does not work

Three symptoms I see most often

What must exist from the start

The question worth asking before you start

If this resonated, write to me. I reply personally.