You can no longer store everything: the economics of archives, logs, and history layers
Cheap disk lowers the cost of writing data, but not the cost of searching it, maintaining it, or understanding what is actually inside.
A few years ago the main argument for aggressive data retention was the cost of a hard drive. Today the price per gigabyte has fallen far enough that the argument sounds like common sense. If it is cheap, store everything.
The problem is that storage cost is not just the cost of the disk. It is the cost of search, indexing, backup, migration, audit, and - most importantly - the cost of understanding what is actually sitting in that storage two years from now.
What actually gets more expensive
When volume grows without discipline, the surrounding costs grow with it:
- backup windows get longer and more expensive;
- search across an unstructured archive degrades without indexes, and indexes cost separately;
- during an incident, figuring out what part of the stored data is still relevant becomes manual work;
- during an audit or investigation, "let's check the logs" turns into "let's find someone who knows where they are";
- migration to a new storage platform scales linearly with volume.
The last point is worth dwelling on: logs are only useful if someone knows what is in them and can query them quickly.
Cheap disk eliminates none of these line items. It only hides them until the situation becomes unmanageable.
Three layers worth separating
In practice, most company data falls into three categories with fundamentally different economics.
The operational layer - data needed right now: recent transactions, active sessions, current tasks. Speed matters here, the volume is relatively small, and the useful life is clear.
The analytical layer - aggregated history: period summaries, trends, comparisons. Volume is moderate, queries are predictable, retention is driven by the business.
The raw archive - full logs, events, source exports. Volume is large, queries are rare and unpredictable. This is the layer where discipline is almost always missing.
When all three land in the same place without labeling, the result is a situation where nobody can answer what is still needed, what is obsolete, and what is safe to delete.
Retention policy is a decision, not a default setting
Keeping everything forever is also a decision. It just tends to be made by default, without thinking through the consequences.
A sensible retention policy answers a few questions:
- What is the useful life of this data type from a business perspective?
- Do we need the full records, or are aggregates enough?
- Who will search for this, and why?
- What happens if we delete it after one year - and after five?
For application logs the answers are usually different from the answers for financial transactions. For access journals they differ from answers for user events. The policy should differ too.
Metadata matters more than volume
The paradox of storing large volumes is that the problem is not the volume itself - it is the absence of description. If nobody knows what is in the folder called "archive_2010_final_v2", that is not an archive. It is garbage with an unknown composition.
The minimum metadata for any dataset:
- what it is (source, content, format);
- what time period it covers;
- who owns it or created it;
- when it expires and can be deleted.
Without this, any storage becomes a place people are afraid to touch within a few years.
A simple audit question
Before expanding storage capacity next time, I suggest asking one question: in the past six months - what from the archive was actually opened, by whom, and for what reason?
If there is no answer, there is no storage discipline. Adding more space just defers the problem, it does not solve it.