Data March 14, 2019 3 min read

A data lake without governance is a swamp, not an asset

Why a data lake without access policy and governance turns into unmanageable storage that nobody can get trustworthy data from.

The data lake idea sounds compelling: gather everything in one place, give analysts access, and let them work with data without restrictions. In practice, most large data lakes turn into what the industry calls a "data swamp" within 18-24 months. Lots of information, very little ability to trust it.

I have seen this enough times to consider it the standard outcome when a few specific practices are missing - not the exception.

What goes wrong

The problem is not the technology - S3, HDFS, and GCS all do their job. The problem is that in the rush to "just put everything in there," three things get skipped:

Data catalogue. Who knows what is actually in the lake? If a new analyst needs to spend a day figuring out which table is current and which of three versions of customer_id to use - that is not a data lake, it is a landfill with smudged labels.

Access control. "Everyone can access everything" is not a policy - it is the absence of one. Customer personal data, financial reports, and application logs require different access levels. When that is not enforced, you get both security problems and GDPR exposure at the same time.

Data quality and lineage. Who put this table here? When was it last updated? Is it current or is it a 2016 snapshot nobody deleted? Without answers to these questions, analysts make decisions on data they should not trust.

The minimum necessary controls

This is not a call to build a complex governance infrastructure before any data arrives. It is about the minimum set of practices without which a data lake becomes a liability faster than an asset.

Naming scheme and zones. Divide the storage into zones: raw, curated, sandbox. This simple convention cuts confusion by an order of magnitude.

Metadata at load time. Source, load time, owner, brief description. Filling this in when new data is added is the responsibility of whoever adds it - not a separate task for "later."

Access tiers. At minimum three levels: open (accessible across the company), restricted (by request), confidential (PII and financial data). This is not about complex ACLs - it is about a conscious policy.

Quality checks on ingestion. Basic checks: is the data present, are there obvious anomalies, does the schema match. Not a complex data quality framework - just minimum gate checks.

Who owns this

Data lake governance is not a technical question - it is an operational one. Someone specific must own the catalogue, approve the access policy, and make decisions about what is stale and can be removed.

In most organisations this is either nobody, or "everyone a little" - which in practice is indistinguishable from nobody. Until that question is resolved organisationally, no technical solution will help.

In short

A data lake works when it has an owner. That is a person or team responsible for ensuring that the data inside is understandable, trustworthy, and accessible to the right people with the right permissions. Without that, adding another petabyte of data only makes the problem worse.

Back to all posts

Contact

What goes wrong

The minimum necessary controls

Who owns this

In short

If this resonated, write to me. I reply personally.