Data October 31, 2016 3 min read

Who owns the data pipeline when the answer is nobody

In most companies data pipelines are built by whoever needed the data, owned by nobody, and relied upon by everyone. That is a systemic fragility, not a technical problem.

The conversation about data pipelines in companies usually starts with a technical question: which tool should we use, how do we handle failures, how do we schedule runs. Those are real questions, but they are the second conversation. The first conversation is: who owns this thing?

In most companies I look at, the honest answer is nobody. A pipeline was built by a developer who needed data for a specific feature. That developer has since moved to another team. The pipeline runs every night, something breaks occasionally, and the person who gets paged is whoever happened to be nearby when the alert fired - not the owner of the data, not the owner of the consuming application, and usually not someone who understands what the pipeline does.

This is not a tool problem. It is an ownership model problem.

How pipelines end up ownerless

Pipelines are built incrementally. The first version is simple: pull data from the source, transform it slightly, load it into the analytics table. That works. Then a second team discovers the table and starts depending on it. Then a third. Then someone adds a transformation for a new requirement, and the pipeline now has three consumers with different expectations of what the output should look like.

At no point in this history was a decision made about who is responsible for the pipeline. It happened organically. And organic ownership structures tend to mean: everyone depends on it, but nobody is on the hook when it breaks.

What "ownership" means in practice

Pipeline ownership is not about who wrote the original code. It means:

Someone reviews changes before they go to production.
Someone decides whether the schema of the output can change and communicates that to consumers.
Someone is responsible for the pipeline being up - not just for debugging when it is down.
Someone defines what "correct" looks like for the output data.

That last point is underappreciated. Most data quality problems are not detectable by the pipeline itself - the job runs successfully, but the output is subtly wrong because an assumption upstream changed. The owner is the person who cares enough to define what correct means and validate that it is still true.

A simple model that works

I have seen teams make progress with a very lightweight model:

Each pipeline has a named owner - a person, not a team. That person is in a maintained document alongside the pipeline's source, destination, schedule, and the list of known consumers.

Any change to the output schema requires the owner to notify all listed consumers. This can be a Slack message or an email - it does not have to be a formal process.

When the pipeline breaks, the alert goes to the owner first. The owner can escalate, but the default is that they are the first point of contact.

The harder part

The harder part is what to do about existing pipelines that have no owner. The most practical approach I have seen: when a pipeline breaks and someone fixes it, that person becomes the owner for the next 90 days. That forces ownership to track actual knowledge, which is usually the right proxy.

It also creates an incentive to build pipelines that are easy to understand and fix - because the builder knows they may be the one getting paged for it later.

Back to all posts

Contact

How pipelines end up ownerless

What "ownership" means in practice

A simple model that works

The harder part

If this resonated, write to me. I reply personally.