Transformer moves beyond NLP: what it means
The transformer architecture that reshaped text processing is beginning to work with images and structured data. What this means for business.
The transformer architecture appeared in 2017 as a solution for text processing. Over several years it completely reshaped NLP - language models, machine translation, text analysis. GPT-3, which appeared last year, became one of the most visible examples of what a sufficiently large transformer model can do.
Now the same architecture is beginning to be applied outside of text. Early 2021 has brought convincing results from applying it to images (Vision Transformer, ViT), and to tasks involving tabular and structured data. This is not just an extension of one technology - it is a signal of a shift in a foundational paradigm.
Why this matters more than it seems
Before the transformer, different machine learning tasks required fundamentally different architectures. For text - recurrent networks and attention mechanisms. For images - convolutional networks. For structured data - gradient boosting and decision trees.
The transformer turns out to be a general enough architecture to work across all of these domains with appropriate adaptation. This matters for two reasons.
The first: knowledge and skills transfer. A team that knows how to work with transformer models for text will be able to apply the same foundation to other tasks faster than if each task required a completely different approach.
The second: a path opens toward multimodal systems. CLIP is one of the first examples. A shared architecture for different data types is the technical foundation for systems that work with multiple modalities at once.
What this means for practical tasks
For most companies, the immediate consequence is not yet very visible. New architectures first appear in research papers, then in open-source tools, then in managed cloud services, and only then become practically accessible without deep expertise.
But there are a few areas where the change will be felt earlier than others.
Computer vision: Vision Transformer shows competitive or better results compared to the previous standard - convolutional networks - especially with large amounts of data. This means vendors of ready-made solutions for visual tasks will gradually update the architectures under the hood.
Analysis of mixed data: tasks that require jointly analysing text, images, and structured data are exactly the area where a unified architecture has an advantage. Medical records with images, product data sheets, reports with embedded tables - these are real business data types.
How to think about this on a one-to-two-year horizon
There is no need to switch to new architectures immediately. If a machine learning task is working, let it work. Chasing every new architectural development is not worthwhile.
What is useful: watching how your AI tool and cloud service vendors update their offerings. New models behind the same APIs are typically an improvement with no extra effort on your side.
And it is worth keeping in mind: tasks that today seem too complex or expensive to automate with AI may well be solvable in two years - not because someone invented a new task, but because the underlying architecture got better.