Data March 1, 2013 4 min read

Processing speed as an argument: what faster data computation changes

Data engineering is getting a new balance between batch and faster processing - and that changes which problems become solvable at all.

For a long time the standard answer to "how do you process a large volume of data" was batch processing. Accumulate data through the day, run the calculations overnight, look at the results in the morning. That worked because there were practically no alternatives - and because most tasks functioned fine on that cycle.

Tools and approaches are now emerging that shift this balance. They do not replace batch processing - it remains the right choice for many tasks. But they extend the range of what is technically and economically achievable.

What the problem with batch processing was

Batch processing is well suited to tasks where data freshness is not critical. Nightly reports, monthly exports, period aggregations - all of this works fine in batch.

The problem starts where delay has a cost. Fraud detected 24 hours after the transaction is already too late. An operational anomaly in a production process noticed the following morning means a stopped shift or defective output. A recommendation built on yesterday's data is a missed moment.

Where delay has a cost, the batch approach creates a structural constraint. You can compensate for it organisationally, but you cannot remove it.

What faster processing gives you

Tools that process data significantly faster than a batch cycle open up several classes of problems:

Real-time operational monitoring. Detecting deviations in an event stream while they are still relevant. Applicable in manufacturing, financial operations, logistics.

Low-latency aggregation. Dashboards that show state not as of "yesterday" but as of "the last few minutes." For certain management tasks, this changes the nature of decision-making.

Iterative computation over accumulated data. When the result of a new calculation is needed in minutes rather than tomorrow, that affects which hypotheses are worth testing in the first place.

Where this applies and where it does not

Faster processing is not a universal improvement. It has its own trade-offs.

It is harder to engineer. A batch pipeline is easier to verify, easier to replay after a failure, easier to debug. Stream processing requires more careful handling of state, event ordering, and late arrivals.

It is more expensive when misapplied. If the business task does not require data fresher than a few hours, there is no reason to pay for real-time infrastructure.

The sensible approach is to start with a question about the cost of delay. If delay within a nightly cycle does not create a problem, batch processing remains the right choice. If it does - then alternatives are worth considering.

How this changes the conversation about data architecture

Previously the architectural choice was relatively simple: there is data, there is storage, there is a batch calculation, there is a report. Now a question about layers appears: which data and tasks need a fast path, which need a slow one.

This is not complexity for its own sake. It is a consequence of tasks that were previously unsolvable - technically or economically - starting to become solvable. And that creates new questions for architecture.

A practical filter

Before investing in faster processing infrastructure, it is worth answering a few questions:

Do we have a task where a data delay of several hours creates a measurable problem?
Do we have a process that will actually use the output of fast processing - that is, will someone be looking at the dashboard every five minutes, or will a system react automatically?
Do we have the engineering capability to support a streaming pipeline, or will we need to acquire it?
Is the delay coming not from processing speed but from data collection speed - meaning the data simply arrives late for a different reason?

If the answers say "yes, there is a task, there is a process, there is capability" - the investment is justified. If not - batch processing is likely still the right answer for a long time yet.

Back to all posts

Contact

What the problem with batch processing was

What faster processing gives you

Where this applies and where it does not

How this changes the conversation about data architecture

A practical filter

If this resonated, write to me. I reply personally.