Data February 23, 2016 3 min read

Why structuring data must come before any ML model

Before the conversation reaches algorithm selection, you need to establish whether there is data worth learning from. I walk through that step in detail.

When a company decides to try machine learning, the conversation usually starts with the task and the algorithm. We want to predict customer churn - let us look at classification. We want to forecast demand - regression, perhaps.

This is a natural top-down framing for someone looking at a problem from above. But in practice this is exactly where the main cause of future failure gets embedded: before any algorithm discussion, you need to establish whether there is data that is actually fit for training.

What "fit for training" means

A machine learning model is not a system you configure and launch. It is a system that learns from examples. That requires several conditions to hold.

First: examples must exist in sufficient quantity. For most business tasks this means thousands of records, not dozens. Too little data and the model does not generalise - it memorises.

Second: the examples must have correct answers. For a churn prediction task you need to know who left and when. If that history lives only in managers' heads or scattered across email threads, there is no training data.

Third: the data must describe the same period and context in which the model will operate. Data from three years ago in a substantially changed market will produce a model that forecasts the past.

Common problems I encounter

The most frequent one is the gap between what the data should logically contain and what is actually in it. The "first contact date" field exists in the CRM but is populated for only 40% of customers. The "reason for decline" field exists, but managers filled it in freely over three years with no consistent vocabulary.

The second common problem is mixed contexts. A single table may contain data from different periods with different business rules, different products, different pricing. A model trained on this will absorb patterns that no longer exist.

The third is absent metadata. What does the value "3" in the status field mean? When did the logic of that field change? Without documentation, data technically exists but cannot be interpreted correctly without someone who "remembers how it worked".

What to do before choosing an algorithm

In almost every project where I help assess ML readiness, we walk through the same steps:

Inventory: what data exists, where it lives, who is responsible for it.
Completeness check: for key fields - what percentage of records are filled, are there anomalies.
History check: how far back the data goes, how the collection logic has changed.
Linkage check: can data from different sources be joined, are there keys to do that.
Target variable definition: what exactly we want to predict and whether it appears explicitly in the data.

Only after this does a conversation about algorithms become concrete.

A practical test

If, before an ML project starts, the team cannot produce a sample of 500 examples with correct answers within one working day - that is a signal the data is not ready. Not because 500 examples are sufficient, but because if even that is hard to do, building a proper model will be harder still.

That is not a verdict against the project. It is a diagnosis that needs to be addressed before modelling begins.

Back to all posts

Contact

What "fit for training" means

Common problems I encounter

What to do before choosing an algorithm

A practical test

If this resonated, write to me. I reply personally.