The gap between experiment and production: why ML models never reach work
Most ML projects show good results in experiments and perform poorly in production. I look at why this happens.
There is a familiar scenario in companies starting to work with machine learning. The data team builds a model that performs well on test data. Everyone is satisfied. Then the rollout into the real environment begins - and something goes wrong. The model is slower than required. Or it degrades over time. Or it is hard to update. Or nobody understands what it outputs when it receives unusual input.
This gap between experiment and production is one of the main reasons ML projects fail to deliver expected value. And it is almost always predictable.
Why an experiment and production are different tasks
In an experiment the goal is to maximize prediction quality on a given dataset. This is a static task: the data is fixed, the metric is defined, the result is reproducible.
In production the task is different: the model must run continuously on live data that changes over time, integrated with other systems, under load, with quality monitoring, and with the ability to update without stopping the service.
These are different engineering challenges. The first requires data science. The second requires production systems engineering. For a long time these two competencies existed separately, and exactly that gap gave rise to the discipline of MLOps.
What specifically breaks
Several typical problems I observe when moving from experiment to production.
Data drift. The data the model was trained on differs from the data it receives in the real environment. Seasonality, changes in user behaviour, new product categories - all of this makes the model less accurate over time without retraining.
Absence of monitoring. In an experiment, model quality is known from the test set. In production there is no automated system that signals when the model has started making more errors. Degradation is noticed by accident, or not at all.
Reproducibility. An experiment is hard to reproduce: different library versions, data from different sources, no fixed parameters. Updating the model or rolling back a version becomes a risky operation.
Integration with systems. The model exists separately from the business system that is supposed to use it. The integration is written hastily, does not handle errors, and there is no logging of requests and responses.
What this means for the manager
MLOps is not an additional expense. It is the condition for ML working in production at all.
A few questions worth asking the team before an ML project moves from experiment to planned rollout:
- How will we know if the model's quality has not deteriorated after a month? After six months?
- How will model updates work - who does it, how often, triggered by what?
- What happens when the model receives data that differs significantly from what it was trained on?
- How does the integration with the business system handle model failures - is there a fallback?
- Who is the owner of the model in production and who is called first when something goes wrong?
If there are no answers - the project is not yet ready for rollout. A good experiment is a starting point, not a result.
The gap between experiment and production narrows every year - tools improve, practices have settled. But the gap itself has not gone away. It needs to be planned for and built into the project from the beginning.