Is MapReduce enough: where the batch model starts hurting business
Some business scenarios already demand a different computation tempo. Not because batch is bad, but because the latency has become too expensive.
MapReduce solved a real problem. When Google described the model in 2004 and Hadoop implemented it in open source, there was finally a way to process terabytes of data on a cluster of commodity machines. This was not an academic breakthrough - it was a practical solution that changed how companies think about large-scale data.
By 2013 Hadoop has become a standard element of the analytics architecture for anyone working with serious data volumes. And in most cases it still does the job.
But a number of companies are running into tasks that the batch model serves poorly - not because data volumes have grown, but because the latency has become too expensive.
What "batch model" actually means
MapReduce and the systems built on top of it work like this: data accumulates, a job runs, and some time later - minutes or hours - a result appears. That result reflects the state of the world at the moment the job started - a fundamental limit of the batch model versus faster computation and the streaming alternative.
For most analytics tasks this is fine. Last month's report, cohort analysis, scoring calculations - all of this comfortably waits for the batch cycle.
Latency starts costing money when a decision needs to be made before the next batch cycle delivers the answer.
Where latency is already costing money
A few real task classes where this shows up:
Fraud detection. If a transaction is flagged as suspicious two hours after the money has moved - the information exists but the moment for action is gone. A few seconds of latency changes the situation entirely.
Real-time recommendations. If a user is looking at a product right now, but the recommendation system updates once a night, it knows nothing about their current behaviour. Relevance drops precisely when the user is most active.
Operational monitoring. If an anomaly in a system is discovered from a nightly report rather than at the moment it occurs, the cost of the incident grows in proportion to the delay.
Dynamic pricing. If prices are recalculated once a day but market conditions change faster, the company either overpays or under-earns at every update cycle.
What emerged in response
The tooling market responded. In 2011 Twitter released Storm, designed specifically for stream processing of events. LinkedIn released Kafka as high-throughput data stream infrastructure.
This is not a replacement for MapReduce. It is a different tool for a different class of problems.
Stream processing addresses a different question: process the event immediately, or with minimal delay. This is technically harder - there is no final moment when all data is guaranteed to be present, and ensuring delivery and ordering guarantees is more complex. But for tasks where latency is more expensive than complexity, it is the right model.
How to think about the choice
There is no reason to replace Hadoop with a streaming system if latency is not a problem. That would be an unjustifiably complex and expensive move.
The right question is different: do we have tasks where a decision needs to be made before the batch cycle can deliver the answer? And if so - what does that latency actually cost, in money or missed opportunity?
If the answer is "nothing significant" - the batch model is handling it fine, no change needed. If the answer is "a few percentage points of conversion" or "some fraction of fraudulent transactions get through" - it is worth looking at a hybrid approach.
Most mature analytics architectures end up combining both layers: batch for historical analysis, stream for operational decisions. The choice is not either-or. It is which tool for which purpose.