When a good model goes bad: drift, detection, and business cost
A model that passed every test at launch can quietly degrade over months. Understanding why helps you decide how much monitoring is worth the investment.
A client came to me last year with a problem I have seen more than once. Their demand forecasting model had performed well at launch, confirmed by back-tests and a three-month live evaluation. A year later, forecast quality had degraded significantly. The business impact was real: excess inventory in some categories, stockouts in others. Nobody had noticed the gradual change because there was no monitoring on the model output.
This is model drift. It is common, it is slow enough to be invisible without deliberate tracking, and it is preventable.
Two types of drift
The first type is data drift. The distribution of input features changes over time. A demand model trained on 2019 purchasing patterns was not trained on the patterns that emerged during a health crisis. A credit model trained before a recession is looking at a different population of applicants afterward. The model logic has not changed. The world has.
The second type is concept drift. The relationship between inputs and the target variable itself changes. What predicted customer churn in 2019 may not predict it in 2021 - not because the customer data looks different, but because the underlying dynamics have shifted. Price sensitivity, competitive alternatives, product changes - the factors that drive the outcome have changed in weight or direction.
Both types are gradual in most cases, which makes them easy to miss.
Why drift is invisible without monitoring
A production ML model, unlike a server, does not throw errors when its performance degrades. It keeps producing outputs. The outputs look plausible. The system keeps running.
The first signal that something is wrong often comes from the business side: someone notices the forecasts feel off, a report shows an unusual pattern, a manager asks why the numbers seem inconsistent with reality. By that point the model may have been underperforming for months.
What monitoring requires in practice
Catching drift requires two things:
- Tracking input distributions over time. If the statistical profile of the data you feed the model has shifted significantly from the training data profile, that is an early warning.
- Measuring outcome quality against ground truth. This requires having actual results available with some lag - the actual sales figures, the actual churn events, the actual defaults - to compare against the model's predictions.
For many business use cases the lag is short enough to make this practical. A demand forecast can be evaluated against actual sales within days. A fraud model can be evaluated against confirmed fraud cases within weeks.
A minimum viable monitoring setup
You do not need a sophisticated MLOps platform to start. A practical minimum for a business model in production:
- Log every prediction with its input features and a timestamp.
- Weekly comparison of input feature distributions against the training baseline. A simple statistical distance measure is enough to flag large shifts.
- Monthly evaluation of prediction quality against ground truth, with a defined threshold that triggers a review.
- A named person responsible for reviewing the results. Automated alerts with no human accountable for acting on them do not close the loop.
The retraining question
There is no universal cadence. How often to retrain depends on how fast the underlying world changes. A model predicting equipment failure in a stable industrial environment may hold for years. A model predicting consumer behaviour in a rapidly shifting market may need retraining every quarter.
The honest answer is: monitor the performance, and let the data tell you when it is time. Picking a fixed schedule upfront is a guess. A monitored schedule is a response to evidence.