m@ksim.pro
Back to all posts
Data 3 min read

ETL pipelines fail quietly and that destroys data trust

How ETL pipelines degrade data quality without anyone noticing, and what to do before the dashboards become decorative.

At some point in almost every data project, the team starts qualifying their own reports. "These numbers are probably right, but the last sync had issues." "Sales figures look off - the pipeline might have skipped some records." The moment that language becomes normal, something important has been lost. Not a number - trust in the data.

ETL pipelines are one of the main places where that trust gets destroyed, slowly and invisibly.

The quiet failure problem

Pipelines fail in loud ways and quiet ways. Loud failures are almost fine - the job crashes, alerts fire, someone investigates. Quiet failures are the dangerous ones: the job completes successfully, records are processed, and the output is wrong.

Quiet failure modes I see regularly:

  • a source schema changes silently and the pipeline continues loading the wrong columns, filling with nulls or defaults;
  • a deduplication key is not unique in the source, and records are silently dropped or multiplied;
  • a timezone or encoding mismatch introduces systematic offsets that nobody catches for weeks;
  • a filter condition that was correct in testing is wrong for production data volume;
  • a partial load completes with no error because the job does not know how many records to expect.

None of these raise an exception. They just corrupt the data quietly.

Why monitoring is not enough

The usual answer is "add monitoring". It helps, but it is not sufficient. Standard pipeline monitoring tells you whether the job ran, how long it took, and how many rows it processed. It does not tell you whether those rows are correct.

Data quality monitoring is a different thing. It means tracking distributions, ranges, cardinality, and referential integrity in the output - not just process metrics. If the number of distinct customers in today's load is 40% of yesterday's, that is a signal. A process monitor will not fire on it.

The gap between "the pipeline ran" and "the data is correct" is where most data quality problems live.

The cost of rebuilding trust

Once analysts stop trusting a data source, it is very hard to win that trust back. They start maintaining their own extracts, their own transformations, their own copies. You end up with multiple unofficial "sources of truth" that diverge further over time. Every meeting where numbers are discussed becomes a debate about which version of the data is real.

I have seen companies spend more effort on managing that divergence than they originally spent building the pipeline. The cost is not just technical - it is the time of senior people reconciling spreadsheets instead of making decisions.

Practical baseline for a trustworthy pipeline

Before optimising for throughput or adding new sources, I recommend three things:

  1. Row count contracts. Define expected ranges for load volume per source per period. Flag and halt - not just log - when those ranges are violated.
  2. Null rate tracking. Track the null rate for every key field. A sudden spike is almost always a schema change or extraction problem upstream.
  3. Cross-system reconciliation. Pick two or three numbers that should match across the source and the target - totals, counts, checksums. Compare them automatically on every load.

This is not glamorous work. But it is the foundation that makes everything built on top of it reliable.

What to fix first

If a pipeline is already running in production with none of the above, do not stop it to rebuild. Add the checks incrementally, run them in parallel, and fix the findings. The goal is to get to a state where the team can say "the data is right" without qualifying it. That is the real output of data engineering work.

Back to all posts
Contact

If this resonated, write to me. I reply personally.

WhatsApp