Data October 28, 2025 3 min read

Data contracts: the agreement that comes before the pipeline

When two teams exchange data without a formal agreement, the pipeline works until it does not. Data contracts make expectations explicit and incidents avoidable.

Most data pipelines break not because of a bug in the transformation code, but because something upstream changed without warning. A field was renamed. A null was introduced where null was not expected. A table was restructured because the source team had a legitimate reason to do so - and nobody told the downstream consumers.

This is not a technology problem. It is a coordination problem. Data contracts are the mechanism that turns an informal dependency into an explicit agreement - with defined expectations on both sides.

What a data contract actually is

A data contract is a formal specification of what a data producer commits to delivering, and what a data consumer can rely on. At minimum it describes:

schema: field names, types, nullability;
semantics: what the fields mean, especially where the name is ambiguous;
freshness: how often data is updated and what latency is expected;
stability: what notice is given before breaking changes.

Some organisations also include SLA on availability and a contact or escalation path.

This is not new as an idea. What is changing in 2025 is that tooling has matured - dbt, Soda, Great Expectations, and dedicated contract frameworks like Bitol make this machine-readable and enforceable rather than a PDF document that lives in a wiki and is never updated.

Where the value is

The immediate value of a data contract is defensive: it makes it clear what breaking a change means, so the source team has to think twice before renaming a column. But the more important value is in what it enables.

When contracts exist, downstream teams can work confidently off a known spec without needing constant coordination with the source. Monitoring and validation can be automated against the contract. When something breaks, the contract tells you whether the source violated its commitment or the consumer was depending on undocumented behaviour.

Without contracts, debugging a broken pipeline starts with "let me ask the source team what changed" - which usually involves finding the right person, waiting, and discovering the change was two weeks ago.

The organisational dynamics

Data contracts work best when they are treated as a two-party commitment, not a document imposed on producers by a central data team. In practice this means the producing team has meaningful input into the contract - what they can actually commit to, what change notice is realistic given their development cadence.

The common failure mode is a central team defining contracts unilaterally based on what consumers want, without understanding what producers can realistically maintain. Then the first time a source team breaks a contract because they had to, the mechanism loses credibility.

Getting this right requires a governance function that brokers the agreement rather than one side dictating to the other.

A practical starting point

I recommend starting with a small scope: two or three pipelines that are critical and have broken in the past year. For those pipelines, document the current implicit contract explicitly. Then answer: does the producing team know this is what downstream depends on? Can they commit to it? What would they need to change to do so?

That exercise usually surfaces either a genuine contract that just needs to be written down, or a dependency that was never stable and needs to be renegotiated. Both are valuable to know.

Once you have two or three contracts that have survived one breaking change each - the pattern is established and adoption becomes easier to expand.

Back to all posts

Contact

What a data contract actually is

Where the value is

The organisational dynamics

A practical starting point

If this resonated, write to me. I reply personally.