The gap between an ML experiment and a production system
Why machine learning in a notebook and machine learning in a running product are different tasks with different requirements.
Companies that start working with machine learning go through a predictable scenario. A data specialist is hired or a contractor is brought in. An experiment is run: historical data is taken, a model is built, the model shows good results on a test set. Everyone is pleased.
Then comes the phase of "now let's put this into production" - and it turns out the distance from an experiment to a working system is considerably longer than it seemed.
I have seen this scenario often enough to treat it not as a mistake of a particular team, but as a systemic problem in how organisations understand ML projects.
What the difference between an experiment and a system is
An ML experiment answers the question: "Can this task in principle be solved with ML, and how well?" That is a valuable question. The answer is a necessary but not sufficient condition for a product.
A production system answers a different set of questions:
- How will the model receive new data in real time or on a schedule?
- How will you track that prediction quality does not degrade over time?
- What happens when the model makes an error - who notices and what do they do?
- How do you update the model when data or requirements change?
- How does the system behave under load, when the data source is unavailable, or when input values are anomalous?
None of these questions arise during an experiment. All of them stand at full height during operations.
Three typical gaps
Data gap. The experiment ran on a historical export - clean, prepared, without missing values. In production, data comes from live systems: with delays, occasionally missing fields, and formats that change without notice. A model trained on "ideal" data may perform noticeably worse on a real stream.
Monitoring gap. During the experiment, model quality is measured once - on the test set. In production, data changes, user behaviour changes, business conditions change. A model that worked well six months ago can quietly degrade - and nobody will know until it becomes visible in business metrics.
Ownership gap. The experiment is done - who is responsible for the system in production? The data specialist wrote the model but has no operational experience. Developers deployed the application but do not understand the model's logic. The ops team maintains the infrastructure but does not know what normal behaviour looks like for this system.
What this means for a product owner
An ML project cannot be assessed by the results of an experiment. An experiment is research. A product is separate work.
When estimating budget and timelines for an ML project, I recommend making three phases explicit: research (can this work), engineering (how will this work in production), and operations (how will this be maintained and improved). In practice, only the first phase tends to be scoped.
Questions for assessing project readiness
If you are being shown ML experiment results and offered a production launch, a few questions help understand how seriously the engineering side has been worked through:
- How will the model receive data for predictions - from where and in what format?
- How will prediction quality be monitored after launch?
- Who will be responsible for the system if problems arise?
- What will happen if the data source is temporarily unavailable?
- When and how is model retraining planned?
If there are no confident answers, the project is at the experiment stage, not the readiness-for-launch stage.