Data October 21, 2025 3 min read

Data observability: catching broken pipelines before your users do

A silent data failure is more dangerous than a crashed service. I explain what data observability is and why operational teams need it, not just engineers.

When a server goes down or an application stops loading, it is immediately visible. Alerts, calls, incidents. The team responds.

When a data pipeline breaks - silence. The report looks normal. The numbers are there. Only a few days later does someone notice that the figures do not add up. Or no one notices at all, and the company makes decisions on wrong data for several weeks.

This is a fundamental difference between infrastructure observability and data observability.

Why data failures are silent

The pipeline did not crash - it simply stopped delivering correct data. The source changed its schema without notice. An API started returning empty fields. A connection dropped at 3am and the job carried on with an incomplete dataset. A transformation ran but applied to last week's data instead of today's.

None of this throws an exception. There is no error log. There is only a quiet distortion.

Infrastructure monitoring does not see this. It watches CPU, memory, endpoint availability - not whether the data matches the expected distribution.

What data observability is

It is a set of practices and tools that allows you to ask the same questions about data that monitoring asks about infrastructure:

Is the data fresh? When did the most recent record arrive?
Is the volume normal? Did the same number of rows arrive today as usual?
Has the schema changed? Are all columns present, do the types match?
Are the values within expected bounds? Are there nulls where there should not be, or unexpected outliers?
Did the pipeline complete successfully - not just without errors, but with the expected result?

When there are automatic answers to these questions, the team sees a problem before it reaches the report consumer.

How this works in practice

Data observability does not necessarily require buying a specialised tool. A lot can be built in-house:

Freshness monitoring - the simplest check: when were the data in this table last updated? This can be set up in any monitoring system.

Volume checks - comparing today's row count against the median of the last N days. A sharp deviation in either direction triggers an alert.

Schema tests - automatic verification that the data structure has not changed unexpectedly. dbt tests, great expectations, or hand-written checks.

Distribution monitoring - more complex but valuable: watching that key metrics stay within normal bounds.

The operational value

The main benefit is reducing the time between a problem appearing and being detected. This interval is called MTTD - Mean Time To Detect. The shorter it is, the fewer decisions have been made on bad data.

A second effect is reducing team anxiety. When there is no monitoring, everyone lives with the feeling that "something might be broken but we don't know". When monitoring is in place and it is silent - that is information.

Where to start

If you currently have no data monitoring at all, start with three questions:

Which data, if broken, would have the largest operational consequences?
When should this data be updated, and how can that be verified automatically?
What volume or distribution is normal for it?

Answers to these three questions are already a monitoring project. Everything else is layered complexity on top of that foundation.

Back to all posts

Contact

Why data failures are silent

What data observability is

How this works in practice

The operational value

Where to start

If this resonated, write to me. I reply personally.