Data contracts: how teams agree on quality
When multiple teams share data, conflicts of expectation are inevitable. Data contracts are a practical tool for making those expectations explicit.
When one team in a company produces data and several others consume it, conflict eventually arises. The producer changes a schema or format - consumers break. A consumer builds processes on data the producer considers temporary or experimental. Nobody is at fault, but everyone suffers.
This problem is familiar to anyone who works with data in a company with multiple teams. And there is a practical solution: data contracts.
What a data contract is
A data contract is an explicit agreement between a data producer and its consumers. It specifies: exactly what is being delivered, in what format, how often, what guarantees are made about quality and availability, and what happens when things change.
Conceptually this is similar to an API contract in software development. If a service has an API with documented endpoints, versioning, and a deprecation process - integration is predictable. If the API changes without warning - integration is fragile.
Data works the same way. A dataset or table without an explicit contract is effectively an unstable API.
Why this is rarely done
I often hear the argument: "We have a small team, we don't have time to document everything." That is understandable, but it slightly misframes the issue.
A data contract is not comprehensive documentation. It is a minimal explicit agreement that prevents costly surprises. Writing one takes an hour. Its absence can cost several days of work when something breaks.
The second reason is unclear ownership. Who owns this data? If there is no answer to that, there will be no contract either.
What belongs in a minimal contract
I work with this minimum:
- Description: what the data is, what it represents, what business logic drives its formation.
- Schema: fields, types, whether they are required, valid values.
- Refresh: how often, on what schedule or trigger.
- Quality: guarantees of completeness, latency, uniqueness - or an explicit statement that there are no guarantees.
- Changes: how the producer gives notice of changes and how far in advance.
- Contact: who answers questions.
This is not a large document. It is a few paragraphs or a table. But having it changes the level of shared understanding within the team.
When this matters most
Contracts are especially important in three situations.
The first: data from one system is used to train ML models. A schema or logic change can silently break a training pipeline.
The second: data is used to calculate metrics or KPIs that drive decisions. Here the cost of an unannounced change is high.
The third: data is passed to external partners or contractors. In this case the contract is not just an internal agreement but part of a commercial relationship.
If even one of these applies to your situation, it is worth starting with an inventory: which data flows currently have no explicit contract.