Data November 12, 2014 4 min read

Knowing when your ETL pipeline is sick

Why data pipelines fail silently, what signs to watch for, and how to build monitoring before the problem shows up in a management report.

Most ETL pipelines fail quietly. The job does not throw an error, the logs look clean, the scheduled task shows "completed successfully" - but the data that landed in the warehouse is wrong. Partial. Duplicated. Three days stale. Nobody notices until a manager asks why last week's numbers do not match what the CRM shows.

I have seen this pattern often enough that I now treat "our pipelines run and we get alerts when they fail" as the floor, not the ceiling, of what data infrastructure monitoring should cover. An ETL pipeline behaves like a production line with queues, stoppages, and invisible grey operations - and the monitoring should reflect that.

Why silent failures are the common case

ETL jobs are built to be resilient. A well-written job handles connection timeouts, retries failed requests, and logs what it could not process before completing. This is the right behavior for availability. The problem is that "completed with some records skipped" looks exactly the same in a success log as "completed with all records processed."

Source systems change without warning. A vendor updates their API and a field that used to hold a numeric value now holds a string. The pipeline does not crash - it silently drops the rows it cannot parse or, worse, coerces the value into something plausible-looking but wrong.

Schema drift is the most common form of this. The pipeline was built against a source system as it existed when someone wrote the ETL code. The source system has been through several releases since then. Technical debt in data pipelines accumulates silently and "we will rewrite it later" almost never happens - schema drift left unchecked is exactly how pipelines become permanently broken in ways nobody admits.

What useful monitoring actually covers

Job-level monitoring - "did the job run, did it complete, how long did it take" - is necessary but not sufficient. The monitoring that catches real problems operates at the data level:

Row counts compared against a baseline or against what the source system reports. A job that processes 40,000 records one day and 400 the next either had a very quiet day upstream or dropped 99% of the data. Without a row count check, that difference is invisible until someone notices.

Null rates on fields that should not be null. If a "customer ID" field starts arriving with 15% nulls when it was previously 0%, something changed upstream.

Freshness checks. A table that should be updated every four hours should have a check that raises an alert when the most recent record is six hours old. Not everything needs to be real-time, but everything should have a known expected freshness, and a monitor that notices when that is not met.

Value distribution checks for fields where you know the expected range. An order total field that suddenly has values of zero or negative numbers in a dataset that previously had neither is a sign worth investigating.

The threshold question

Not every anomaly needs to page someone at 3 AM. The monitoring design question is about thresholds and routing: what conditions are serious enough to interrupt production, what conditions should surface as a morning report, and what conditions can wait for a weekly review.

I usually start with: anything that makes reports silently wrong should alert before the next time those reports are read. Everything else can be more relaxed.

Building this incrementally

You do not need to instrument everything at once. In a new data environment, I find it useful to start with the three or four tables that feed the reports that management actually uses and rely on for decisions. Those are the places where a silent failure has the most visible consequences, and they are usually a small fraction of the total pipeline.

Build the row count and freshness checks there first. Add null and distribution checks where the source systems are known to be unstable. Once that baseline is working, extend it to the rest of the pipeline.

The goal is not to achieve theoretical coverage of every possible failure mode. The goal is to ensure that the data you are making decisions from is what you think it is.

Back to all posts

Contact

Why silent failures are the common case

What useful monitoring actually covers

The threshold question

Building this incrementally

If this resonated, write to me. I reply personally.