Data March 24, 2021 3 min read

Data catalog: what it gives you and when you can skip it

A practical look at a data registry: what it delivers, when it justifies the investment, and when it is premature complexity.

A data catalog is a registry of a company's data: what exists, where it comes from, who owns it, how to use it. The topic has grown popular alongside rising interest in data governance and AI projects. With that popularity comes a familiar effect: teams start implementing the tool before there is a real need for it.

Here is my view on when a data catalog actually solves a problem, and when it is an extra layer that needs maintenance but delivers nothing.

What happens without a catalog

In small teams, the absence of a catalog is usually not a problem. Everyone knows where things live because they built it themselves. Documentation exists in the memory of three people.

The problem appears with growth - of the team, of systems, of data volume. Questions emerge that have no quick answers: which table holds last year's order data? Is that the same table the marketing analyst uses? Does "customer_id" mean the same thing in both places? Who can explain why this column sometimes contains null values?

If questions like these take hours and require tracking down a specific person - that is a cost that accumulates quietly but adds up.

What a data catalog provides

In its minimal form, a data catalog is a documented answer to the question "what do we have". A list of sources, tables, key fields with their meaning, and - most importantly - who owns each dataset.

In a fuller form, it adds: change history, sensitivity classification, relationships between objects, quality metrics.

For AI projects a catalog is especially valuable: before starting a pilot you need to know what data is available. Without a catalog, that investigation alone takes weeks.

When a catalog is not needed

If you have one or two analytical systems and a data team of up to five people - a full data catalog is premature. The cost of maintaining it will exceed the benefit. A structured internal document or wiki describing the key sources is enough.

If analytics in the company happens irregularly and mostly by hand - a catalog creates an illusion of order without real improvement.

If nobody will maintain it - better not to start. An outdated catalog is worse than no catalog: it creates a false sense that everything is documented.

Three signs that the time has come

First: you have more than one analyst on the team, and they regularly ask each other about data structure.

Second: a new employee or contractor cannot independently figure out where a specific metric comes from within a reasonable amount of time.

Third: the company has sensitive data - personal, financial, commercial - and there is no clear picture of exactly where it is stored and who has access to it.

If even one of these signs is present, investment in a minimal catalog is justified.

How to start without a large investment

Expensive enterprise solutions are not needed at the beginning. Start small:

Create a single document or wiki page listing all key data sources.
For each source, record: what it stores, where the data comes from, and who is responsible.
Agree to update it whenever the structure changes.
Make this part of how data work is done, not a one-time exercise.

A catalog like this solves 80% of the problems. Specialised tools add value later, once the foundation exists and you know exactly what is missing.

Back to all posts

Contact

What happens without a catalog

What a data catalog provides

When a catalog is not needed

Three signs that the time has come

How to start without a large investment

If this resonated, write to me. I reply personally.