AI November 6, 2019 3 min read

NLP in production: the gap between a demo and a working system

Language models in 2019 deliver impressive demonstrations. Why the road from demo to a real working product is much longer than it looks.

In autumn 2019, Google published research on using BERT to improve search ranking. This is a landmark event: the world's largest search engine begins applying a next-generation language model in a production system. In parallel, research papers are appearing showing state-of-the-art results on text comprehension, classification, and question-answering tasks.

I see that against this backdrop many executives are asking: "Can we do something similar?" The question usually becomes more specific: "let us build smart search over our documents" or "let us automate the handling of incoming requests."

The ideas are right. But between what a research paper demonstrates and what works in a real system there is a large gap that rarely gets discussed openly.

Where the gap comes from

Research results are measured on standard benchmarks: datasets specifically created to evaluate NLP tasks. That data is clean, labelled, in English, and it represents a specific, carefully defined type of task.

A real system works on your data. And your data is emails written in ten different styles, with errors, abbreviations, and mixed language. Documents that were scanned from paper through poor OCR. Queries phrased the way people in your company actually think, not the way they are formulated in academic datasets.

A model that shows 95% accuracy on a benchmark may show 60% on your data - and that does not mean the model is bad. It means the benchmark and your problem are different problems.

What a real system requires

First: data for fine-tuning or evaluation. An academic model needs to be adapted to your domain. That requires labelled examples from your subject area. Those need to be collected, labelled, and reviewed. That is manual work, and it takes time.

Second: a precise definition of the task. What exactly should the system do? "Smart search" is not a task. "A system that takes an incoming customer request in natural language and returns the three most relevant paragraphs from the knowledge base" is a task - one for which you can define a metric and verify the result.

Third: a threshold and error-handling logic. At what confidence level does the model's result get shown to the user? What happens when the model does not know? How does the system behave on unusual inputs? These are engineering questions that require as much work as the model itself.

Fourth: monitoring in production. A model that worked well at launch can degrade if the input data shifts. You need to understand how to detect this.

A realistic view of the timeline

None of this means NLP tasks inside a company are unachievable. It means the path to them needs to be planned honestly.

A good NLP pilot that actually works and delivers a measurable result is three to six months with the right competencies on the team. Not a week after "playing with the model."

Good questions before starting:

Can we clearly describe the task in a way that allows us to measure the quality of the result?
Do we have at least a few hundred labelled examples - or are we prepared to create them?
Is there someone on the team who understands the difference between benchmark accuracy and performance on real data?
What happens when the system makes a mistake - what is the handling process?

The answers to these questions determine whether the pilot becomes a working system or a demonstration that is quietly shelved.

Back to all posts

Contact

Where the gap comes from

What a real system requires

A realistic view of the timeline

If this resonated, write to me. I reply personally.