word2vec and the new semantics of search: why text collections will soon think in proximity
Vector word representations open a new level of search, matching, and recommendations. What this means for companies that hold large text collections.
At the end of 2012, a team of researchers at Google published a paper on a method they called word2vec. The idea is that words can be represented as vectors in a high-dimensional space - and that the proximity of vectors reflects the semantic proximity of words. This sounds like a mathematical abstraction until you see what follows from it.
What follows is that text collections can be not only indexed and searched by keywords, but also worked with through the notion of semantic closeness. This is a fundamentally different class of tasks.
How vector representation differs from ordinary search
A conventional search index works on word matching. If the query says "automobile" and the document says "car" or "vehicle", these are - from the perspective of a standard index - different terms. Synonyms, dictionaries, and manual tuning are required.
Vector representation approaches the task differently. Words that are used in similar contexts end up with similar vectors. "Automobile", "car", "motor vehicle" sit close together in the vector space - not because someone configured this, but because they appear in similar contexts in real texts.
The consequence: a search query can return documents that contain not a single word from the query, but are semantically close to it. This does not work perfectly, but it works on fundamentally different principles from traditional search.
Where this opens new possibilities
A few classes of tasks where this approach delivers something genuinely new.
Search across internal documentation. When a company has accumulated thousands of documents - regulations, contracts, technical documentation, correspondence - finding what you need via exact keywords is difficult. Semantic search allows documents to be found by the meaning of a query, not by exact term matching. Corporate knowledge bases have faced exactly this limitation with keyword-only search.
Matching and deduplication. If a company works with large databases of products, customers, or vendors, the same entities are often recorded differently. Vector representation makes it possible to find similar records by meaning, not only by exact string matching.
Content-based recommendations. If a company has a catalogue - of products, articles, vacancies, listings - vector representations enable recommendations of similar items based on the semantic similarity of their descriptions.
Classification and tagging. Automatic categorisation of incoming queries, documents, or requests by topic, based on meaning rather than just the presence of keywords. Even keyword-based routing in reviews and tickets already delivers measurable value at lower complexity.
What stands between experiment and application
As with any new technology, there is a gap between results in academic conditions and real-world use.
The quality of vector representation depends on the data it was trained on. Public models are trained on general texts - news, Wikipedia, books. For a specialised domain - medicine, law, a specific industry - these models perform less well. Fine-tuning on domain texts, or training from scratch, is needed, which requires data volume.
Interpretability is lower than with traditional search. Why the system returned a particular result is harder to explain than "the document contained this word".
The infrastructure is different from a traditional search index. Storing and searching vectors is a different class of problem with different computational requirements.
All of this is solvable, but requires deliberate investment.
How to think about this now
For a company that has a large text collection - documents, descriptions, communications - and is experiencing problems with search or matching, this direction is worth keeping in view.
A few orienting questions:
- Do we have a text collection where keyword search produces poor results?
- Is there a deduplication or matching problem where records are written in different ways?
- Do we have a catalogue that needs "similar item" recommendations?
- Do we have enough domain-specific texts to train or fine-tune a model?
- Do we have the technical capacity to build a prototype?
The direction is young. But the trajectory is clear enough: semantic proximity as the basis for working with text data is not a replacement for keywords - it is an additional layer that opens up classes of tasks that previously required manual work.