Data September 11, 2018 3 min read

Streaming data: when you need it and when batch is enough

How to decide whether your company needs streaming data processing, or whether that is unnecessary complexity for tasks that batch loading handles perfectly well.

Kafka, Apache Flink, real-time stream processing - all of this sounds like a mark of a modern data architecture. I hear requests from companies that want "streams" without always being clear on exactly why.

Stream processing solves real problems. But it is more expensive to build and operate than batch. And most tasks that companies describe as "we need real-time data" actually require data with a delay of a few minutes, or at most a few hours - which a batch approach handles perfectly well.

What "real time" actually means

In conversations with business stakeholders, "real time" often means "faster than now". If data currently updates once a day, "real time" might reasonably mean once an hour or every fifteen minutes. That is not streaming - it is just more frequent batching.

True stream processing is needed where a delay of minutes is unacceptable for business reasons: detecting fraudulent transactions, reacting to events in equipment management systems, security monitoring with immediate response.

If a business decision is made once a day or once an hour, data with a 15-minute lag and data with a seconds-level lag produce the same outcome.

The cost of streaming architecture

Streaming systems are more complex than batch in several important ways.

First, they are harder to test. A batch job can be run locally on a sample of data and the result verified. A streaming pipeline requires simulating the stream, handling out-of-order messages, and verifying behaviour under delays.

Second, they require a different operational model. Kafka, Flink, or Spark Streaming are separate systems that need to be deployed, monitored, and maintained. This is additional infrastructure overhead.

Third, "exactly once" semantics - the guarantee that each event is processed exactly once, no more and no less - is technically challenging. Getting it wrong leads to duplicate data or lost events.

Where batch is the right choice

Batch processing is not an outdated approach. For most analytical tasks it remains the right choice:

daily and weekly reports;
loading data from external systems for analytics;
training machine learning models;
aggregations and data transformations for the warehouse;
periodic synchronisation between systems.

If data is needed for decisions made on a schedule, batch covers this without unnecessary complexity.

Questions for making the decision

What business decision depends on this data, and how often is it made?
What is the cost of a data delay in minutes? In hours?
Is there a specific scenario where a delay of even a few minutes is already unacceptable?
Are we prepared for the operational overhead of maintaining streaming infrastructure?
Is there a simpler way to get the required speed - for example, more frequent batching at a short interval?

If there is no concrete scenario where seconds matter, the case for streaming architecture has not yet been made.

Back to all posts

Contact

What "real time" actually means

The cost of streaming architecture

Where batch is the right choice

Questions for making the decision

If this resonated, write to me. I reply personally.