Data November 3, 2017 3 min read

Data quality: four metrics that are worth tracking in practice

Most data quality programs stall because the metrics are too abstract. Here are four concrete measurements that show up problems early and connect to business outcomes.

Data quality is one of those topics where the theory is sensible and the practice is painful. Teams agree that quality matters, someone is asked to "measure it", and a few weeks later there is a dashboard no one looks at.

The usual reason: the metrics chosen are too general to be actionable. "Overall completeness: 94%" tells you almost nothing about what to fix or where the damage is.

I have found a small set of measurements that consistently surface real problems and are cheap enough to maintain.

Completeness by field, not by record

The standard approach measures how many records have all fields filled. That number hides everything. A record can be 95% complete and still be useless if the one missing field is the one your reports depend on.

More useful: track completeness for each critical field separately, and define "critical" by asking which fields would break a downstream report or decision if missing. Often this list is five to eight fields per dataset. Check those every day.

Freshness - not "was the job green" but "how old is the data"

Most pipelines have a success flag: the job ran, no errors logged. But success flags do not tell you whether the data is current. I have seen systems where a job "succeeds" by reloading last week's snapshot because the upstream source was unavailable and the error was swallowed silently.

Track the age of the newest record in each critical dataset. Set a threshold - say, "sales data older than 26 hours is a problem" - and alert on it. This catches a class of stale-data incidents that run undetected for days.

Referential consistency between systems

In most companies, the same entity - a customer, a product, a contract - exists in multiple systems with different identifiers. Reports that join across systems silently drop rows when the join keys do not match.

Measure: for each cross-system join in your reporting layer, track the match rate. If yesterday 98% of orders matched a customer record and today it is 91%, something broke. You will never see this from a pipeline success flag.

Volume anomalies relative to history

A sudden drop in row count - or a sudden spike - is almost always a signal of something wrong upstream. A source table was truncated. A filter was accidentally added. An extract window changed.

Track daily row counts per dataset against a rolling average. A simple rule - flag anything more than 20% outside the 30-day range - catches most of these incidents before the business notices.

Connecting quality to cost

These four metrics only stay maintained if someone sees the cost of ignoring them. I always try to attach each metric to a business event: a stale pricing feed means offers are priced with yesterday's rates; a broken customer join means a segment of orders disappears from revenue reports.

When the conversation shifts from "our completeness score is 93%" to "three times last month, campaign reports excluded 8% of orders because the join to the product table broke at midnight" - maintenance becomes a priority rather than a housekeeping task.

A minimal starting point

You do not need a data observability platform to start. A set of SQL queries run daily and results written to a table is enough to build the habit. The platform can come later, once the team has learned which checks matter and how to respond to failures.

The goal in the first three months is not a perfect quality score. It is building the instinct to check before trusting.

Back to all posts

Contact