Raw data layer before data lake: when it makes sense and how to keep it from becoming a swamp
Storing everything in one place is not a strategy. Without catalogs, owners, and metadata you do not get a lake - you get a swamp.
The idea of storing all raw data in one place is logically appealing: you never know what will be useful, storage keeps getting cheaper, and you can always process later. Companies of all sizes eventually arrive at this thought and start accumulating - logs, events, exports, files, database snapshots.
A year or two passes, and it becomes clear that accumulating was easy and using the data is not. Nobody knows exactly what is there, in what format, or whether any of it is still current. The data exists, but extracting something useful from it is harder than it looked.
This is not a failure of the raw layer idea. It is a failure of implementation without the right pieces in place. The economics of why "store everything" is not a strategy - and what the hidden costs of an undisciplined archive really are - matter before adding more sources.
When a raw layer is genuinely justified
Storing data before processing makes sense under a few conditions.
Data changes or disappears at the source. If the source is an external system, an API, or a partner that does not guarantee history, a raw copy gives you control over the historical record.
Processing requirements are not yet defined. Sometimes the business does not know precisely what will be analysed. Saving the raw layer and processing later is reasonable - if it is a conscious decision, not procrastination.
Audit or reproducibility is required. In regulated industries or financial data work, it matters to be able to reproduce a calculation from the original data. That requires immutable raw copies.
In all other cases, "let's save everything first and figure it out later" is not a strategy. It is deferring the problem with interest.
Three things without which a raw layer becomes a swamp
A catalog. A list of what is in the storage: source, format, period, update frequency, last updated. Without a catalog the store is opaque to new people and becomes opaque to its creators within six months.
An owner. Each dataset needs a specific person or team responsible for its accuracy and freshness. "Owned by the whole team" means owned by nobody. When data goes stale or something breaks, it must be clear who fixes it.
Quality metadata. Data is sometimes incomplete, partially broken, or full of gaps. That is normal - if it is documented. The consumer of the data should know what they are getting, including the known limitations. Otherwise they find out at the worst possible moment.
How to organise layers inside the raw store
Even when everything lives in one place, internal structure helps.
A minimal scheme:
- raw / landing - data as it arrived, no processing;
- validated - passed basic schema and completeness checks;
- curated - cleaned, brought to a common format, ready for analysis.
Moving between layers is an explicit process, not an accident. If data does not pass validation it stays in raw and goes no further. This prevents a situation where incorrect data quietly flows into analytics.
Signs that the swamp is already forming
A few signals I have seen in different companies:
- nobody can say what is in which directory;
- the same data source was loaded multiple times by different teams because nobody knew it already existed;
- an analyst spends more time figuring out what the data is than actually analysing it;
- "data from last year" exists in several copies with different values.
None of these problems is solved by adding more disk space.
A check question
If you have a raw data store, pick any dataset in it and ask three questions: what is this, who is responsible for it, and when was its accuracy last verified?
If all three have answers, that is good. If not, that is where to start bringing order - not by adding new sources, but by describing what is already there.