ML in production: the gap between a pilot and a working system
Why machine learning pilots often fail to become production systems, and what to do differently from the very beginning.
Over the past couple of years I keep seeing the same pattern. A company runs a machine learning pilot - predictive analytics, a recommendation system, a classifier. The pilot shows good results in the test environment. Management is impressed. A decision is made to scale.
Then the problems start. The model that predicted accurately on historical data performs worse on the live stream. Infrastructure for regular retraining was never built. The team that ran the pilot moved on to the next project. Six months later the model had degraded and was quietly switched off.
Why a pilot is not a small production
An ML pilot answers a specific research question: can we build a working model for this task, with this data, at this level of accuracy? That is an important question. But an answer of "yes" does not mean the system is ready for production operation.
In production, a model is not a finished product. It is a component in a system that must:
- receive fresh data regularly and in a predictable format;
- detect quality degradation (data drift, concept drift);
- retrain on a schedule or on a trigger;
- have versioning and the ability to roll back to a previous version;
- log predictions and outcomes for subsequent analysis;
- handle anomalies in input data without crashing.
None of that is needed in a pilot. All of it is needed in production.
Where the gap comes from
The main source of the gap is that pilots and production systems are built for different goals, often by different people.
A data scientist in a pilot optimises a model metric. They work with Jupyter notebooks, static datasets, a local environment. Their goal is to show that a solution is possible.
Moving to production requires a different mindset and different skills: data engineering, DevOps, monitoring, ongoing support. In companies that did not think about this in advance, the data scientist's notebook becomes the "production system" - without version control, without monitoring, without process.
What to design from the beginning
If you are running an ML pilot with the intention of taking it to production, a few things are worth deciding at the outset:
Data infrastructure. Where does the data for training and for predictions come from? Is this an automated pipeline or a manual export? What will this process look like a year from now?
Ownership of model quality. Who monitors whether the model continues to work correctly? What quality metrics are tracked, and by whom?
Retraining process. How often, by whom, and under what conditions is the model updated? Who makes the decision to deploy a new version?
Integration. How do model results enter business processes? Who acts on the predictions, and how?
Questions before starting a pilot
- Do we have a plan for who will maintain this system if the pilot succeeds?
- Do we have infrastructure for automated data delivery to the model, or are we expecting manual exports?
- Who will be the business owner of the system - not IT, but the business side?
- How will we know when the model has degraded - what is the signal, and who will see it?
- What is the cost of a model error in this specific application, and how do we handle bad predictions?
A well-designed pilot is not just a good model. It is a pilot that answers all of these questions before the decision to scale is made.