Data March 9, 2017 3 min read

A data catalog: the discipline of knowing what you have

Why metadata management is not a technical project but an operational necessity for companies that work with data seriously.

There is a question I ask in almost every company where I start working: if an analyst needs sales data for the past two years, how do they find it? Not "do we have it" - I am confident we do. Specifically: how do they find it, which dataset to use, whether to trust it, and who to ask if something is unclear?

In most cases the answer is: through colleagues. Send a message to the right person who knows where things live. That works - until that person is on holiday, has left the company, or is busy.

What a data catalog is and why it matters

A data catalog is not a knowledge base and not documentation. It is a working tool: a centralised registry of what data the company has, where it physically lives, what it means, who is responsible for it, and how much it can be trusted.

The difference between "documentation about data" and a "working catalog" is fundamental. Documentation is written once and goes stale. A catalog is kept current as part of the operational process - because without that it loses value faster than it seems.

A good catalog answers several key questions for each dataset:

Where did this data come from and when was it last updated?
Who is the owner - a specific person, not a department?
What fields does the dataset contain and what do they mean?
What known limitations or quirks does it have?
How is this data connected to other datasets?

When the absence of a catalog starts costing money

The first symptoms are invisible. An analyst spends an hour finding the right data instead of ten minutes. That seems normal: "they'll figure it out."

Then the company launches an analytics project, and half the time goes not to analysis but to figuring out which data can be trusted. That is already visible loss.

Then it gets worse. Two reports show different figures for the same metric and nobody can quickly explain why. That is already a management crisis.

A data catalog does not solve all these problems automatically. But it creates the conditions under which they do not accumulate to a crisis.

Where to start if there is no catalog

The mistake I see most often: a company buys a metadata management tool and expects it to fill itself. The tool matters, but it is secondary.

First step - inventory. What data sources exist in the company? Not listing databases in the IT department, but listing data from a business perspective: customer data, sales data, inventory data. For each source - who uses it and who is responsible for its accuracy.

Second step - choose priority datasets. There is no need to catalog everything at once. Start with the data used most often and on which key decisions depend.

Third step - assign ownership. Every dataset in the catalog must have an owner. Without that, the catalog becomes documentation - current today and stale in six months.

A simple check

Ask a random analyst in the company to explain where the data in their favourite report comes from - all the way back to the source. If the chain is clear and documented, the catalog is working. If the answer is "I'm used to working with this data and I know it can be trusted" - that knowledge lives in a person's head, not in a system.

Back to all posts

Contact

What a data catalog is and why it matters

When the absence of a catalog starts costing money

Where to start if there is no catalog

A simple check

If this resonated, write to me. I reply personally.