Data July 21, 2014 3 min read

When a data warehouse becomes the bottleneck

Signs that your corporate data warehouse has stopped keeping up, and how to think about the next step.

A data warehouse was built to solve a problem: consolidate data from different systems in one place and let analysts ask questions without touching the transactional databases. For a few years it works well. Then at some point the warehouse starts holding back exactly what it was built to enable.

I have seen this scenario often enough to recognise its symptoms. The problem is not that the warehouse is bad. The problem is that the company has grown while the architecture stayed the same.

What the symptoms look like

The signs that a warehouse is stopping to keep up rarely appear all at once. They build gradually:

heavy reports take noticeably longer than they did a year ago;
analysts try to run large queries at night to avoid blocking each other;
adding a new data source has become a multi-week project;
ETL processes increasingly miss the overnight load window;
data history is truncated to a fixed horizon - "otherwise it doesn't fit."

If several of these sound familiar, the warehouse has become a bottleneck, not a tool.

Why this happens

The classic relational warehouse architecture is optimised for a specific class of tasks: regular reports, aggregations known in advance, a relatively small number of analysts. When data volume grows several times over, the number of users doubles, and tasks start to include ad hoc exploration and more complex computations - the architecture starts to strain.

Vertical scaling - buying more powerful hardware - helps, but only up to a point. And the cost on that path grows faster than the performance.

What the options are

There is no single right answer, but there are a few directions worth evaluating.

Optimising the existing system. Sometimes the problem is not the architecture but specific queries, the schema, or index configuration. Before changing platforms, it is worth understanding how much headroom there is for optimisation inside the current system.

Offloading to specialised layers. You can move specific heavy scenarios - exploratory analytics, raw data work - to a separate tool, leaving the warehouse for operational reporting.

Moving to columnar storage. Analytical workloads are fundamentally different from transactional ones. Systems that store data by column rather than by row give a significant speed advantage on aggregations over large volumes. This is no longer exotic - such systems exist and run in production.

Cloud analytical platforms. Several providers offer analytical warehouses where compute is separated from storage and scales independently. The new economics of cloud data warehousing - where you pay for what you query, not for hardware you own - is an early signal of how this model changes what analytics costs. The economics here are different and worth calculating honestly.

How to avoid making the wrong choice

Before moving toward a solution, a few questions deserve an honest answer:

What exactly is the problem - slow reports, slow loads, analyst frustration, or all three?
How much data actually needs to be in active access, and how much could be archived?
Who will own the new platform - does the team have the necessary experience?
What is the migration cost, including years of accumulated ETL logic?
What is the cost of doing nothing - what does the current slowdown actually cost?

Replacing a warehouse is an expensive and lengthy project. But sometimes continuing to patch something that no longer fits the task costs more.

Back to all posts

Contact

What the symptoms look like

Why this happens

What the options are

How to avoid making the wrong choice

If this resonated, write to me. I reply personally.