Data September 27, 2013 4 min read

Spark, Hadoop, or MPP: choose the workload type, not the brand

Different tasks need different computation and storage modes. Getting the platform wrong costs more than it appears at the start.

Conversations about big data often contain the same question: "Are you on Hadoop or not?" Or: "We need Hadoop - we heard it is for big data."

That is roughly like asking a doctor: "We need antibiotics - we heard they cure things." Perhaps. Depends on the diagnosis.

Hadoop, MPP databases, and in-memory engines like Spark are different tools with different characteristics. The right choice is determined by the workload, not by what a market-trends article says. The question of when a Hadoop cluster actually makes sense versus a DWH is the starting point; this post extends that argument as Spark enters the picture and the workload classification becomes more nuanced.

Three workload classes, three modes of operation

Before talking about tools, it helps to classify the task.

Batch processing of large volumes - jobs where you need to process terabytes or petabytes, where response time is measured in hours but reliability and scalability matter. Historical recalculations, raw log reprocessing, data transformation for warehouses. Hadoop MapReduce was designed precisely for this - cheap horizontal scaling on commodity hardware with fault tolerance through replication.

Interactive analytics - queries where an analyst waits seconds or minutes for an answer. Slices across large tables, aggregations, ad-hoc analysis. Here MPP architecture - Teradata, Greenplum, Vertica - works fundamentally differently: data is stored in columnar format, queries are parallelised across nodes. The interactive response speed is incomparably faster than MapReduce - a consequence of columnar storage and the new pace of analytics it enables.

Iterative computation - machine learning, graph algorithms, tasks where the same data needs to be read many times within a single computation. MapReduce is catastrophically inefficient here: every iteration writes results to disk and reads from disk. Spark addresses this by keeping intermediate results in memory. On machine learning tasks, the speed difference can be an order of magnitude.

Why one platform does not cover all workloads

Trying to use one tool for all three workload classes is an expensive mistake. Not because the tool is bad, but because it was never designed to do everything at once.

Hadoop with MapReduce is slow for interactive analytics. An analyst waiting twenty minutes instead of twenty seconds conducts fewer explorations. That is a direct productivity loss.

An MPP database is excellent for analytical queries but expensive for storing large volumes of raw unstructured logs. Keeping terabytes of raw data in Teradata is either very expensive, or forces aggressive pre-filtering before loading, which loses detail.

Both platforms are inefficient for iterative ML tasks, for the reasons described above.

Mature data platform architectures use different tools for different layers: cheap raw storage, MPP for analytics, specialised compute for ML. The separation of ODS, data marts, and DWH is the structural expression of this principle.

What to ask before choosing a platform

The right conversation starts not with "which platform should we choose" but with describing the workload. Several useful questions:

What is the current data volume and what is the growth rate? Terabytes versus hundreds of gigabytes call for different solutions.
What is the acceptable latency for analytical queries? Seconds, minutes, or hours are fundamentally different requirements.
What are the tasks - one-time batch calculations, regular analytics, or iterative algorithms?
What is the budget for hardware and licensing? Some MPP solutions are only justified at a certain scale.
Does the team have expertise in a specific platform? The cost of training and hiring is part of the cost of the solution.

One practical guide

If a company is just starting to build analytical infrastructure and data volumes do not exceed a few terabytes - it probably does not need Hadoop or an expensive MPP system right now. A properly chosen relational database with a well-thought-out schema covers most analytical needs for small and mid-sized businesses.

Hadoop and specialised platforms make sense when data volumes or requirements for speed and flexibility genuinely push beyond what standard solutions can handle. Not before.

Choosing a platform "because competitors use it" or "because everyone is talking about it" is not an architectural decision. It is following a trend at the expense of the budget.

Back to all posts

Contact

Three workload classes, three modes of operation

Why one platform does not cover all workloads

What to ask before choosing a platform

One practical guide

If this resonated, write to me. I reply personally.