Data quality before analytics: why dirty master data breaks any BI
Dashboards and BI tools produce answers that are exactly as good as the data underneath them. Until reference data and master records are in order, visualisation only makes the disorder look convincing.
A company buys a BI system, builds dashboards, adds nice charts. A few months later management stops trusting the numbers. "You have one figure, she has another, the system has a third - how do we know which one is right?"
I hear this often. And almost always the problem is not the BI system and not the analysts. The problem is the data it is all built on.
What "dirty reference data" looks like
Reference data is data about entities that appear throughout the company: customers, vendors, products, employees, warehouses, accounts. When it is in order, different systems talk about the same thing using the same identifier. When it is not, chaos follows.
Typical symptoms:
- one customer is recorded in the CRM under one name, in the ERP under another, and in the delivery database under a third;
- a product has different SKUs in different systems and there is no mapping table anywhere;
- "region" in one report means a federal district, in another a state, in a third a time zone;
- a vendor is recorded with and without a tax ID, in various name formats, with historical duplicates.
The BI system takes this data and faithfully displays it. The analyst faithfully counts it. The result is a polished chart that does not answer a single real question.
Why fixing reports does not fix the problem
The common reaction is to try to fix the problem where it is visible: in the report. Add a condition to the query, join fields by hand, write an Excel mapping table.
This works once, for one report. Then a second report appears, then a third and a fourth. Each has its own joining logic. A year later the company has ten reports with ten different interpretations of the same data - and no confidence that any of them are correct.
Fixing a report is treating a symptom. The root of the problem is the absence of master data management: until the foundational data is in order, any layer built on top of it only makes the disorder look convincing.
What MDM is and why it matters
MDM - Master Data Management - is not a specific product or software category. It is a process and a responsibility.
The idea is simple: the company needs one authoritative source for each key entity. If customers, then one customer registry that all systems draw from. If products, then one catalogue with a named owner who keeps it current.
In practice this means:
- assigning an owner for each key reference dataset;
- defining the rules: what counts as a duplicate, how to name things, what format to use;
- choosing the source of truth and writing down how other systems synchronise with it;
- establishing procedures for new records: who creates them, who verifies them.
This is organisational work. Tools can help, but no tool saves you without a process and an accountable owner.
When to start fixing data quality
There are signs that the moment has arrived:
- the same metric is calculated differently in different departments;
- merging two reports requires manual reconciliation;
- when one system is replaced it is unclear how to migrate the data to another;
- a new employee cannot figure out on their own where to find the "correct" numbers.
All of these point to one thing: data has been accumulating without governance. Investing in BI in this state produces a polished interface on top of disorder.
A practical sequence
The approach I usually suggest:
- Identify three to five key reference datasets used in most reports.
- For each one, find the current source of truth - or choose one if none exists.
- Assign an owner and agree on the rules.
- Document the mappings between different identifiers across systems.
- Only then build or rebuild the analytics layer on top.
This is not fast. But without this step any BI project ends up in the same place a year later: polished dashboards and distrust of the numbers.