AI October 16, 2018 3 min read

BERT and the new baseline for applied NLP

What the BERT model changes in the practical use of text processing, and why it matters for companies working with unstructured data.

In early October 2018 Google published a paper on BERT - Bidirectional Encoder Representations from Transformers. The results on standard NLP benchmarks came in significantly above previous best scores - across several tasks at once: question answering, contextual understanding, sentence pair classification.

For the academic community this is another important step in the development of language models. For people thinking about practical applications of NLP in business, it is a change to the baseline.

What changed technically

The core idea of BERT is pre-training on a large corpus of text with bidirectional context, then fine-tuning for a specific task. Previous approaches either processed text left-to-right or right-to-left, or used smaller volumes of data for pre-training.

Bidirectionality matters: the word "bank" in "river bank" and "bank account" carries different meaning. A model that sees the full context simultaneously handles this much better.

Here is the practically important part: Google released the weights of the pre-trained model. This means a team can take a model that already "understands" language at a deep level and fine-tune it on their own data for a specific task - without spending resources on pre-training from scratch on hundreds of millions of documents.

What this means for applied tasks

A few months ago, the accuracy that BERT shows out of the box on basic NLP tasks required serious custom development. Now the baseline has shifted: what used to be a ceiling has become the starting point.

For companies that work with text - customer support, analysis of incoming requests, document classification, information extraction - this opens possibilities that previously required either large teams or compromises on quality.

Concrete tasks where this applies now:

automatic classification of incoming support tickets by topic and sentiment;
extracting structured information from unstructured documents;
semantic search - finding not exact word matches but meaning;
question answering over a corpus of internal documents.

What stands between the paper and deployment

A new model is not a ready-made solution. Between what is published in the paper and a working system inside a company, there are several steps.

First, labelled data is needed for fine-tuning on the specific task. The quality of that data directly affects the result.

Second, BERT is a large model by 2018 standards. The base version has 110 million parameters. This creates computational requirements for fine-tuning and inference. On CPU it is slow; on GPU significantly faster - but that means additional infrastructure decisions.

Third, even with a good model, the organisational questions remain: who evaluates the quality of results, how the system fits into the existing process, and what happens with errors.

The practical conclusion

For a manager, the right question now is not "do we need BERT". The right question is "do we have text-processing tasks where quality is a genuine bottleneck".

If such tasks exist, and if they have been deferred because of high cost or low available accuracy - the moment to revisit that assessment has arrived.

The technology will not do everything by itself. But what is now possible has just moved up a level.

Back to all posts

Contact

What changed technically

What this means for applied tasks

What stands between the paper and deployment

The practical conclusion

If this resonated, write to me. I reply personally.