Data November 15, 2012 4 min read

Amazon Redshift and the new economics of data warehousing

Cloud MPP storage changes not just the stack but the psychology of a pilot. Why the right question is no longer 'can we afford this' but 'where do we start'.

In October 2012, Amazon made Redshift generally available. A columnar MPP data warehouse, managed in the cloud, at a price point that had never existed for this class of system. A few hundred dollars a month for something that previously cost hundreds of thousands in hardware and licences.

I want to talk not about Redshift's technical specifications, but about what changes in how you approach analytics when the economics of the task shift this dramatically.

What MPP is and why it was expensive

MPP - massively parallel processing - is an architecture in which a query over a large dataset runs not on a single machine but across a cluster, where each node processes its own slice. This makes it possible to run analytics over billions of rows in seconds rather than hours. The underlying shift toward columnar storage as the basis for this speed is something I covered in columnar storage and a new pace of analytics.

Before the cloud, this kind of infrastructure required: hardware purchases, licences for specialised software, a team to install and configure it, and a long procurement cycle. The minimum entry cost was several hundred thousand dollars and six months of work. That meant serious analytics was available only to large companies with dedicated resources.

Redshift removes most of that barrier. A cluster launches in minutes. Pricing starts at a level that is accessible to a mid-sized business. Amazon handles the infrastructure management.

What changes in the psychology of a pilot

When the entry threshold is high, the first question is "can we afford this". That is a budgeting and justification question. It requires a business case, approval, and planning several quarters out.

When the entry threshold is low, the first question shifts to "where do we start". That is a different mindset. A pilot stops being a commitment and becomes an experiment. Instead of proving value before the work begins, you can demonstrate it within a few weeks of actual work on real data.

This matters beyond the technical dimension. It changes how decisions about analytical infrastructure get made. A CTO or data lead no longer needs to convince a CFO of the abstract value of a warehouse - they can show a concrete result on concrete data for a reasonable cost.

What cloud DWH actually changes in operations

A few things that become practically accessible for companies that could not do them before:

Historical data. Traditional OLTP databases are not designed for analytical queries over large volumes of history. A warehouse lets you store and analyse years of data without performance degradation.

Source integration. Data from CRM, ERP, web analytics, and finance can be brought to one place and analysed on top of a unified model - rather than switching between reports from different systems that never quite agree.

SQL as a shared language. Redshift uses standard SQL with minimal extensions. An analyst who can write queries can start working without training on a specialised tool.

Scale to the task. The cluster can be expanded for peak loads and scaled back down. For companies with seasonality or one-off analytical projects, that is meaningful savings.

What it does not solve

An honest conversation also requires saying that a cloud DWH does not eliminate data work - it simplifies it at the infrastructure level.

Data quality is still determined by what gets loaded into the warehouse. If the sources are disordered, Redshift will process chaos quickly - which is no better than processing chaos slowly. The data model still requires design. ETL processes still need to be built and maintained.

Redshift removes the infrastructure barrier. The analytical barrier - understanding the business, data quality, update discipline - remains.

Questions to assess your readiness

Before launching a pilot with a cloud warehouse, it is worth answering a few questions:

What specific analytical questions can we not answer today because of infrastructure constraints?
Is there data worth consolidating in one place that is currently scattered across systems?
Who will write the queries and build the data model?
How will we keep the data current - is this a one-time load or an ongoing process?

If there are answers to these questions, the pilot makes sense - and its cost is no longer the obstacle.

Back to all posts

Contact

What MPP is and why it was expensive

What changes in the psychology of a pilot

What cloud DWH actually changes in operations

What it does not solve

Questions to assess your readiness

If this resonated, write to me. I reply personally.