Data July 24, 2019 3 min read

Why ML teams keep rewriting the same thing over and over

Feature stores and feature management in machine learning: where the duplication comes from and how to get rid of it.

When a company launches its second ML project after the first, the same thing almost always happens. A new team - or the same person - starts writing code to prepare data. Then it turns out that a significant part of that work was already done last time. The code lives in a different repository, in a colleague's notebook, or was simply lost after a reorganisation.

I call this the feature rewriting problem. A feature in machine learning is a specific variable the model uses to make predictions: average transaction value over the last 90 days, days since the last purchase, the ratio of cancelled orders. Computing them requires hitting data, normalising it, aggregating it. That is real work. And it repeats in every new project.

Where the duplication comes from

The root of the problem is usually organisational, not technical. ML projects inside companies are most often launched as separate initiatives with separate teams. Each team solves the problem from scratch, including data preparation.

Even when it is technically possible to reuse code from a previous project, it usually does not happen: different library versions, unfamiliar logic, fear of breaking something in someone else's code, or simply faster to rewrite it. The result is that the company has five versions of "customer revenue" and three of them compute it differently.

This creates more than technical debt. When different models use different versions of the same features, results become incomparable. The business loses the ability to understand which model performs better and why.

What a feature store is

A feature store is a centralised repository of computed features with versioning and the ability to reuse them. The idea is simple: a feature is computed once, described, stored, and available to any team.

In practice this means the team building a churn model can take an already-built feature "customer activity over 30 days" - the same one used by the team building the recommendation engine. The computation logic is shared, history is kept, results are comparable.

This solution started to be actively discussed and adopted at large technology companies around 2017-2018. Uber, Airbnb, and others described their approaches publicly. In 2019 this is no longer only relevant for FAANG-scale companies - the question becomes important for any company running more than one ML project.

How this connects to data management

A feature store is not a tool for a data scientist. It is an infrastructure decision that requires involvement at the data architecture level. You need to decide who owns features, who documents them, how they are versioned when computation logic changes, and how data is synchronised between training and production inference.

That last point is its own topic. The classic mistake: features for training are computed in batch mode on historical data, but in production the same logic runs on a stream with latency. The model is trained on one thing and runs on another. Results are worse than expected.

A practical check

If your company has more than one ML project, or you plan to launch the next one, a few questions worth asking:

Where does the feature computation logic physically live in each project?
Are there features that are computed independently in more than one project?
Who can explain exactly how a specific feature in a production model is calculated?
Does the computation logic match between training time and live serving?

Having no answers to these questions is a signal that infrastructure debt is already accumulating. It is easier to address before the number of projects doubles again.

Back to all posts

Contact

Where the duplication comes from

What a feature store is

How this connects to data management

A practical check

If this resonated, write to me. I reply personally.