Data October 15, 2014 4 min read

Streaming data pipelines: why batch processing is starting to show its limits

What changes when you move from nightly batch jobs to continuous data streams, and when that shift is worth the added complexity.

For most companies, data moves in batches. The sales system exports a file at midnight, the warehouse picks it up, runs some transformations, and by morning the reports are updated. This has been the standard operating model for analytics for decades, and for many use cases it still works fine.

But there is a growing category of problems where "by morning" is too late, and the batch model starts to bend under the weight of requirements it was not designed to meet. Whether a business actually needs near-real-time data or whether 15-minute batches are perfectly fine depends on what decisions need to be made and how fast.

What batch processing assumes

The batch model assumes that the world can be described in snapshots. You take a picture of your data at a point in time, process that picture, and produce outputs. The outputs are as fresh as the last snapshot.

This works well when the questions you need to answer are backward-looking: what happened last week, what was the total last quarter, which customers bought what over the past month. Analytics of that kind has natural time horizons that fit nightly or hourly batches.

It starts to fail when the questions become real-time: is this transaction fraudulent right now, is this machine behaving anomalously in the last five minutes, what is a user looking at and what should we show them next. For those questions, yesterday's batch run is structurally irrelevant.

What streaming architecture offers

A streaming architecture, rather than collecting data and processing it in bulk at intervals, processes each event as it arrives. The pipeline is always running. Results are updated continuously.

Apache Kafka, which LinkedIn open-sourced in 2011 and which by 2014 had become the most-discussed tool in this space, is built around this model. It acts as a durable, high-throughput log of events. Consumers read from that log in near-real time, and because the log is retained, multiple downstream systems can process the same stream independently without interfering with each other.

The practical implication is that you can have fraud detection, analytics, and operational alerting all reading from the same event stream, each at its own pace, without any of them blocking the others.

The complexity that comes with it

Streaming pipelines are not a strict improvement over batch jobs. They are a different trade-off. When real-time is justified and when a nightly batch is the honest choice depends on latency requirements that most teams underspecify before committing to streaming infrastructure. Batch jobs are easier to reason about, easier to rerun when something goes wrong, and easier to test. You can inspect the input file, run the job, check the output.

A streaming pipeline is always in motion. Debugging it requires different tools and different mental models. Handling late-arriving events - data that arrives out of order because of network delays or retries - requires explicit design decisions that batch processing simply does not face.

The operational overhead is also higher. Kafka clusters require tuning and monitoring. Failure modes are more complex than "the job failed at 2 AM."

When it is worth considering

The threshold I use: if your business is losing something measurable because key decisions are made on data that is hours old when it could be minutes old, streaming is worth the conversation. If the main reason you want it is that it sounds more modern than batch jobs, it is probably not worth the operational complexity at your current scale.

Fraud detection, real-time inventory management, live operational dashboards for manufacturing or logistics - these are cases where the latency reduction is genuinely valuable. Standard analytics reporting for management is usually not.

A practical note on getting there

Very few teams jump directly from batch to full streaming. The more common path is a hybrid: keep existing batch processes for stable reporting, and add streaming only for the specific use cases where real-time matters. Kafka's log-retention model actually supports this well - you can run batch consumers alongside streaming consumers on the same data.

That incremental path is less dramatic but much more manageable.

Back to all posts

Contact

What batch processing assumes

What streaming architecture offers

The complexity that comes with it

When it is worth considering

A practical note on getting there

If this resonated, write to me. I reply personally.