Narrow AI in production: where the line between pilot and working system is
Why most AI pilots never reach production, and what it actually takes for a model to work in real conditions rather than just in a demo.
Interest in applied AI among executives has grown sharply in recent years. Companies launch pilots - on image recognition, churn prediction, automatic classification of support tickets. The pilots show promising results. Then most of them stop.
This is called the "valley of death" for AI projects. I have seen it enough times to understand exactly where the boundary is.
Why pilots look better than the systems that follow
A pilot is a controlled environment. Data is prepared, conditions are fixed, metrics are chosen to highlight strengths. The team is motivated and working on the problem deliberately.
Production is different. Data arrives in unexpected formats, conditions change, users behave differently than anticipated. The system has to work not once on a clean dataset, but continuously, under load, without constant attention from data scientists.
The gap between accuracy on a test set and accuracy in real conditions is the standard source of disappointment.
What is needed beyond a good model
A good model is necessary but not sufficient. To reach production you also need:
Quality monitoring. Models degrade over time - data changes, the distribution of inputs shifts. You need a process that notices this and responds. Without it, you learn about the problem from users rather than from the system.
Version management. When the model is updated, you need to know: which version is running now, what changed, and whether you can roll back. This seems obvious, but most pilots do not account for it.
Failure handling. What happens when the model is not confident in its answer? What if inputs arrive that differ significantly from the training data? The system needs to be able to acknowledge uncertainty and hand off to a human - not give a confident wrong answer.
Inference infrastructure. Speed, cost, reliability. A model that works well on a researcher's GPU may cause problems in production due to cost or latency.
The organisational gap
Technical questions are not the only ones. Equally important: who owns the system after the pilot ends? The data scientist who built it is usually not the right answer - they have the next project.
Someone needs to monitor quality, respond to degradation, and decide when to retrain. Most companies do not have this role. The pilot ends, the system seems to be working - and gradually degrades without anyone watching.
Checklist before going to production
- How will we know when the model starts performing worse? Are there metrics and alerts?
- Who is responsible for the system after launch?
- How does the system handle cases where confidence in the answer is low?
- Can we roll back a model update if it made results worse?
- Have load and edge cases been tested - not just "normal" inputs?
If there are no answers to these questions, the system is not ready for production - even if the test-set metric looks excellent.