Data May 14, 2020 3 min read

Data producer-consumer contracts: the unglamorous fix that works

Why data pipelines break most often at team boundaries, and how formalising the contract between producers and consumers prevents the majority of those breaks.

The most persistent cause of broken data pipelines in mid-size companies is not bad infrastructure. It is an absent or informal agreement between the team that produces data and the team that consumes it. A schema changes, a field is renamed, a column is added that silently shifts the semantics of a downstream calculation - and nobody considers themselves responsible for the break because nobody defined the boundary.

This is a governance problem, not a technical one. And it has a straightforward, if unglamorous, solution.

What a data contract is

A data contract is a written agreement between a data producer and a data consumer about what the data looks like, what it means, and what guarantees the producer makes about it.

At minimum it covers:

The schema: field names, types, nullability.
Semantics: what does "revenue" mean in this table - gross, net, with or without returns?
Latency: when is the data expected to be available and with what freshness guarantee?
Stability: how will changes be communicated, and how much notice will the consumer get?

This does not require a complex system. A well-maintained document or a version-controlled YAML file is sufficient for most teams.

Why the boundary between teams is where things break

When a single team owns the full data path from source to consumption, changes are visible and the cost of a break lands on the person who made it. When the path crosses a team boundary, the producer does not feel the downstream consequences of their changes and the consumer has no mechanism to protect themselves from changes they do not know about.

This is a classic organisational problem that looks technical. The technical fix - schema validation, automated tests - only works if it is anchored to a social agreement about who is responsible for what.

What makes a contract useful in practice

A contract that nobody looks at is not a contract. Three things make them stick.

Versioning and changelog - when a producer changes the contract, the change is versioned and communicated before it is deployed. The consumer has time to adapt. This is the same logic as API versioning, applied to data.

Automated validation - the consumer runs schema checks at ingestion time and alerts on violations. This is not a substitute for the contract, but it catches cases where the contract is broken accidentally.

Named ownership - each contract has a named producer owner. That person is the point of contact when something breaks or when the consumer needs a change. Anonymous ownership means nobody answers.

A practical starting point

If your engineering teams produce data that analytics or other engineering teams consume, spend one meeting listing the ten most critical data flows. For each one, write down: who produces it, who consumes it, what the schema is, and what the current implicit expectations are.

That list is your starting inventory. Prioritise the flows where a break would be most painful. Write a contract for those first. The others can follow as you build the habit.

The one pattern to avoid

The anti-pattern I see most often is treating this as a platform engineering problem - building a metadata catalog, a lineage tool, a data observability product - before establishing the social and process norms that make any tooling useful. The tool comes after the habit, not before it.

Back to all posts

Contact

What a data contract is

Why the boundary between teams is where things break

What makes a contract useful in practice

A practical starting point

The one pattern to avoid

If this resonated, write to me. I reply personally.