Data November 20, 2023 3 min read

Data contracts: the discipline that separates order from chaos

What data contracts are, why they matter for any team passing data between systems, and how to start without complex infrastructure.

There is a situation I have seen in many companies. Analysts complain that data in reports changed unexpectedly. Engineers say they changed nothing - they just added a new field or renamed an old one. Analysts respond that this is what broke the dashboard. Several hours of investigation, an urgent patch, everything works again - until next time.

This is a symptom of a missing contract between the people who produce data and the people who consume it.

What a data contract is

A data contract is an explicit agreement about what the data provider commits to deliver: which fields, of what types, on what schedule, with what quality guarantees. This is not documentation or a schema description - it is a commitment, the violation of which is an event that requires a response.

The idea is simple: if one service delivers data to another, there must be an explicit understanding of the "interface" between them. If the provider changes the data structure without notice - that is a breach of contract, and the consumer has the right to expect advance notice or versioning.

This is a direct analogy to API contracts in software development, applied to data.

Why this matters now

As the number of systems in a company grows, so does the number of integrations. Every integration is a potential break point when changes happen. The more systems, the harder it is to understand what connects to what and what will break if the data structure in a source changes.

Add AI systems that consume data for training or inference. A quiet schema change in the source can silently degrade model quality - not immediately visible, but the consequences can be serious.

Data contracts are a way to make dependencies explicit and manageable.

What this looks like in practice

A full data contracts implementation requires tooling and process. But you can start simply.

At the minimum level a contract is a document (or a file in a repository) describing: which fields are present, their types, required or optional, expected value ranges, update frequency. The commitment: if something changes, notify consumers and give them time to adapt.

At the next level - automatic validation: every time data is updated, it is checked against the contract. A contract violation is an event that stops the pipeline and generates a notification.

A few tools that make this easier: Great Expectations for data validation, dbt for documenting transformations and dependencies, separate YAML contract files versioned alongside the code.

What this gives the business

Fewer unexpected breakages. When data changes are agreed in advance, incidents like "analytics broke because someone renamed a column" happen less often.

Faster understanding of the consequences of changes. If a source needs to change - you can immediately see who depends on it and who needs to be notified.

A reliable foundation for AI. Models trained or running on data with a known and respected contract produce more predictable results.

Where to start

No need to implement everything at once. A sensible start:

Identify the three to five most critical data flows in the company.
For each one, describe the current implicit contract - what is actually being delivered.
Make that contract explicit and agreed between producer and consumers.
Set a rule: changes to data structure happen only after notifying consumers.
As maturity grows, add automatic validation.

This is not a revolution. It is a discipline that accumulates value gradually.

Back to all posts

Contact

What a data contract is

Why this matters now

What this looks like in practice

What this gives the business

Where to start

If this resonated, write to me. I reply personally.