Data May 22, 2019 3 min read

Stream processing is an operations question, not an architecture question

Why the decision to move to streaming should start with understanding operational load, not with choosing a technology.

Conversations about stream data processing almost always start with technology. Kafka, Flink, Spark Streaming - someone on the team heard that large companies do it this way, or an architect insists that "batch is outdated." The decision is made, the stack is chosen, and only then does the full picture of what was signed up for become clear.

This is the wrong order. Moving to stream processing is primarily an operational decision. And before making it, several questions need to be answered that have nothing to do with platform selection.

Why streaming is harder than batch in operational terms

Batch processing is simple: a job runs, finishes, the result is there. If something went wrong it is visible immediately, rerunning is straightforward, debugging is relatively simple.

A streaming system runs continuously. It must always be up. A failure means processing has stopped and a backlog has accumulated. Recovery requires not just a restart, but understanding from which point to recover, how to handle deduplication, how to avoid losing data or counting it twice.

The state of a streaming system is significantly more complex than batch. Delivery guarantees, exactly-once semantics, offset management - this is real operational overhead that requires expertise and constant attention.

When streaming is genuinely needed

Streaming is justified when the business has a real latency requirement - not "it would be nice to see data faster" but a specific business task that stops working when delay exceeds a few minutes.

Examples where this is genuinely true: real-time fraud monitoring, where every minute of delay is money. Operational dashboards where a manager is making decisions right now. Event-driven responses - notifications, triggers, automated actions where delay changes the outcome.

Examples where streaming is unnecessary: analytics over historical periods. Nightly reports. ML training that itself takes hours. Data synchronisation where a five-minute delay is indistinguishable from instant for the user.

What to decide before choosing a platform

First question: what is the required latency, and where does that requirement come from? Who from the business set the number, and what happens if it is not met?

Second: does the team have people with experience operating streaming systems? Not developing them - operating them. These are different skills.

Third: what is the acceptable data loss in the event of a system failure? This determines the delivery guarantee requirements and the overall system complexity.

Fourth: what does recovery look like after a serious failure? Can history be replayed from a specific point?

Fifth: what happens to downstream systems if the stream stops? This concerns not just the streaming component but everything that depends on it.

A practical recommendation

If your team has no experience operating streaming systems and there is no hard business latency requirement that batch cannot meet - do not start with streaming. A well-built batch process running every few minutes covers most real business needs and costs significantly less to operate.

Streaming is a powerful tool for specific problems. But like any powerful tool, it requires understanding when to use it and how to manage it. Start with the task, not the technology.

Back to all posts

Contact

Why streaming is harder than batch in operational terms

When streaming is genuinely needed

What to decide before choosing a platform

A practical recommendation

If this resonated, write to me. I reply personally.