Data September 20, 2013 5 min read

Technical debt in data pipelines: why \"we will rewrite it later\" almost never happens

Data pipelines age faster than they appear to, especially when nobody owns the schema. Deferring the refactor has a concrete cost.

In most companies, the first data pipeline is built quickly, for a specific task. Extract from the database, a couple of transformations, load into the warehouse or a spreadsheet. It works - good enough. "Later, when we have time, we will do it properly."

That "later" almost never arrives. ETL as a production line describes the structural failure modes; what this post adds is how those failure modes harden into debt over time. Not because the team is lazy or does not understand the problem. But because the system is set up so that urgent always displaces important - and as long as the pipeline is working, it is not urgent.

But it ages quickly. And quietly.

How the debt accumulates in data

A data pipeline does not degrade the way an application breaks. It does not crash - it quietly stops being true.

The source schema changed and a new type of value appeared in a field that nobody handled. A new table was added and forgot to be included in the extract. The transformation logic was written to work on current data but will break on edge cases that have not appeared yet.

The most dangerous situation is when nobody owns the schema. That means: nobody knows exactly what field X means in the source system and how it should be interpreted on the output side. Dirty master data breaks any BI - the schema ownership problem is what puts master data in that state. That context belonged to the person who built the pipeline a year ago. Now it lives in oral tradition, or nowhere.

That pipeline becomes a black box. It does something, the result looks plausible - but it cannot be verified.

Why the refactor does not happen

If the technical debt is understood and visible - why is it not paid down? I see three persistent reasons.

The first is no owner. The pipeline was written by an analyst or a developer who was solving a task. Once the task was solved, they moved on. Nobody was assigned responsibility for maintaining and evolving this piece of infrastructure.

The second is no visibility. Technical debt in data is invisible to the business until the moment it shows up as wrong reports or failures. Investing in a pipeline refactor is hard to justify - "we will redo what already works so it keeps working, only better." That does not sound like a priority.

The third is fear of breaking something. When a pipeline is opaque and undocumented, touching it is frightening. Any change might break something - it is unclear what. So workarounds get layered on top without touching the foundation.

This is the classic technical debt dynamic. If it is not broken, nobody fixes it. When it breaks, it gets fixed in a panic.

When the debt becomes critical

Several signs indicate the situation is crossing into dangerous territory.

The first is when a change in the source - a table structure change, a system migration, an API update - requires disproportionate time to adapt the pipeline. If changing a field in the source leads to a week of repair work downstream, that debt costs real money.

The second is when analysts spend a significant part of their time "fixing" data instead of analysing it. If they regularly know that "this part is a bit off, I need to adjust it manually" - the pipeline is not doing the work, the people around it are.

The third is when nobody can explain why the report shows a particular number. If explaining a result takes three hours and a phone call - trust in the data is already gone, even if that is not acknowledged out loud yet.

How to get out of this state

Rewriting the pipeline from scratch is expensive and risky. An incremental approach works better.

Start with documentation: describe what each step does, where the data comes from, what the source of truth is for each field. This by itself surfaces contradictions and gaps.

Assign an owner - someone who is responsible for the health and understanding of the pipeline. This does not have to be a full-time developer. It is a responsibility, not a job title.

Add monitoring: row counts, null rates, critical value ranges. If something breaks, it should appear as a signal - not be discovered the next time someone tries to use the report.

Refactor in modules: do not rewrite everything at once. Isolate and rebuild section by section, starting with the most fragile parts.

Questions that help assess the current state:

Who last made changes to the pipeline, and do they remember why?
Is there documentation for the schema and transformation logic?
How long does adapting to a source change take?
Are there alerts that fire before an analyst discovers the problem?
Does anyone on the team know what share of records passes through every step without exceptions?

If most of the answers are "we don't know" or "we would need to check" - that is the description of the debt. Not abstract debt. Concrete debt.

Back to all posts

Contact

How the debt accumulates in data

Why the refactor does not happen

When the debt becomes critical

How to get out of this state

If this resonated, write to me. I reply personally.